import-export.qmd

---
date: "2024-07-01"
engine: julia
---

# Importing and Exporting Files {.unnumbered}

## Overview

Until recently, Julia didn't have a single package to import the majority of file types.
However, with the increasing development efforts put into the [`{Tidier.jl}`](https://github.com/TidierOrg) meta-packages, aimed at recreating the wildly popular set of `{tidyverse}` R packages in Julia, a new, convenient, single interface exists to the many excellent Julia I/O (input/output) packages.
Here, we will explore some common workflows that can utilize the `{TidierFiles.jl}` package, though more details can be found in the [documentation](https://tidierorg.github.io/TidierFiles.jl/latest/).

## File Paths

If you've set up your project using `{DrWatson}`, you should have a number of [useful helper functions](./new-project.qmd#drwatson-helpers) for you to use in your project (after loading and activating the project with `using DrWatson; quickactivate(Project)`).
Because you created a new folder for the project, Julia will create a ***Project.toml*** and a ***Manifest.toml*** file in your project folder.
This does all the hard work of setting up **relative paths**, and making sure each project has it's own list of packages and versions, avoiding any conflicts that may arise when you set up another project with similar packages.
To make use of your code and project-specific packages, you simply need to activate the project using `{DrWatson}`'s `quickactivate("Project")` function (or with `]activate .` if you are in the correct directory).

A **relative** filepath is simply the steps that need to be taken to find your target file from the current location, e.g., the active file.
For example, you may have the following files in a project:

```{bash}
│projectdir
│
├── data
│   └── exp_raw.csv
├── out
│   └── exp_clean.parquet
.
.
.
├── scripts
│   ├── cleaning.jl
│   └── intro.jl
.
.
.
├── Manifest.toml
└── Project.toml
```

The relative filepath of the ***exp_raw.csv*** data file from the ***intro.jl*** scripts file is `../data/exp_raw.csv`.
Here, `../` means look in the parent directory i.e., one directory up from the current location, with the rest pointing to the ***data/*** directory and appropriate file.
If you wanted to ***cleaning.jl*** file from ***intro.jl***, the relative path would just be `cleaning.jl` (or alternatively `./cleaning.jl`, where `./` means look in the current directory).

::: {.callout-note}
Generally you won't need to refer to other ***script/*** files as they shouldn't contain functions that are used for calculations, only the scripts that utilize the functions defined in the ***src/*** directory to produce your results.
As we'll come to later, Julia has been designed to make creating and working with packages very simple, which we will take full advantage to minimize the risk of files relying on each other in a circular fashion.
:::

As mentioned, `{DrWatson}` provides convenience functions to simplify the process of locating files with relative paths.
To load a file from the ***data/*** folder, we need to `include()` it.
This effectively finds the file and runs it, so everything defined in it is now available for use.
In practice, we can simply use the command `include(DrWatson.datadir("filename.jl"))` (or drop the `DrWatson.` package namespace if you brought `{DrWatson}` into the **`Main` namespace** with the `using` command - see [here](./misc.qmd#namespaces) for more information about **namespaces**).
This helpful function means that the file ***filename.jl*** can be loaded from within any file in your project without needing to use multiple `../` calls in the relative path, or specify the absolute filepath.
It also means that you can send your code to the REPL and it will execute correctly, which is not a given as the directory that the REPL is started in is usually the project directory, and your current file may be in a subdirectory, so the relative paths to the file to load will be different.

::: {.callout-note}
In case you're wondering what `{DrWatson}` is doing when it sets up these helper scripts, we can just have a look at the source code.
The beauty of Julia is that its speed and expressiveness means that most Julia packages are written in pure Julia and don't need to resort to lower-level languages for the internals!

Looking at [these lines](https://github.com/JuliaDynamics/DrWatson.jl/blob/3f2a6cf4b1c66bbfdeaab9527c436b1d275dba7d/src/project_setup.jl#L28-L33) in the source code, we can see that the `projectdir()` function is effectively a wrapper around the below code, with a check to make sure you have activated the project.

```{julia}
#| eval: false
dirname(Base.active_project())
```
:::

::: {.callout-tip}
Because we created the `out/` directory to save our processed data files to, it would be useful to have a similar helper function to `{DrWatson}`'s.
We can do that very easily using the following function definition:

```{julia}
#| eval: false
outdir(args...) = DrWatson.projectdir("outdir", args...)
```

Incidentally, this is exactly how `{DrWatson}` defines it's helper functions, and states that the functions can take any number of arguments and should join the directory path of the ***outdir/*** and the function arguments.
For example, `outdir("simulation-files")` becomes ***path-to-project/outdir/simulation-files/***.
:::

## `{TidierFiles}` Overview

At the time of writing this section, the following file formats can be read and written to using `{TidierFiles}`:

- Delimited files
    - .csv `read_csv()` and `write_csv()`
    - .tsv `read_tsv()` and `write_tsv()`/`write_table()`
    - .txt `read_delim()`
- Excel files `read_xlsx()` and `write_xlsx()`
    - .xlsx
- SPSS files `read_sav()` and `write_sav()`
    - .sav
    - .por
- SAS files `read_sas()` and `write_sas()`
    - .sas7bdat
    - .xpt
- Stata files `read_dta()` and `write_dta()`
    - .dta
- Arrow files `read_arrow()` and `write_arrow()`
    - .arrow
- Parquet files `read_parquet()` and `write_parquet()`
    - .parquet

Reading and writing files is greatly simplified by using a common interface.
To read a file, the general command is `read_{format}(path-to-file.{format})` e.g., to read the ***exp_raw.csv*** file we would use the command `read_csv(DrWatson.datadir("exp_raw.csv"))`, and to read the ***exp_clean.parquet*** file we would use the command `read_parquet(outdir("exp_raw.parquet"))`.
If you wish to use the values from the file, rather than just viewing them, you should assign them to a variable e.g. `raw_df = read_csv(DrWatson.datadir("exp_raw.csv"))`.

Writing data to files is similarly simple, just replacing the `read_` with `write_`, and passing in a `DataFrame` as the first argument.
For example, to write a dataframe to a csv file, you simply use the command `write_csv(raw_df, outdir("exp_clean.csv"))`.
Currently, all write functions require the use of a `DataFrame` object, and all read functions return a `DataFrame` object.

## Importing Files

### CSV Files from Internet URLs

With `{TidierFiles}` we can read data from URLs that are hosted on the web.
To do this, we can use the `read_csv()` function and pass in the URL as the first argument, instead of a relative filepath.
If the CSV or txt file is hosted on GitHub, you can navigate to the file in GitHub, open up the raw version, and pass the URL to the `read_csv()`/`read_delim()` function.

![](./figs/import-export/GitHub-file-import-1.png)

![](./figs/import-export/GitHub-file-import-2.png)


```{julia}
using TidierFiles

read_csv(
    "https://raw.githubusercontent.com/TidierOrg/TidierFiles.jl/main/testing_files/csvtest.csv";
    skip = 1, # skip the first data row
    n_max = 4, # only read the first 4 rows
    col_select = ["ID", "Score"], # only read the "ID" and "Score" columns
    missingstring = ["4"] # replace the value "4" with a missing value representation
)
```

### Specific Excel Sheets & Ranges

While it's generally preferable to try and use non-proprietary data files that can be read by multiple software tools, such as `.csv` files, we are sometimes just provided with Excel files (or need to create them).
In these situations, we can use `{TidierFiles}`'s Excel functions' to access data from specific sheets, or use the underlying [`{XLSX}`](https://github.com/JuliaIO/XLSX.jl) package to read and write Excel files.

To read data from a sheet, we use the `sheet = ` keyword argument of `read_xlsx(path-to-file.xlsx, sheet = "sheet-name")` function.

Often, Excel sheets contain multiple tables, which may be related to one-another, but should be considered as their own separate entities.
In these situations we can use the `range = ` keyword argument of the `read_xlsx()` function to only read a portion of the Excel sheet.

### Missing Data

For delimited and Excel files, it is possible to specify how missing data should be handled on file reads.
This is done using the `missingstring = ` keyword argument of the `read_{format}()` functions, and defaults to `""`.

### Skipping Rows

Your input data files may include a number of header rows that you do not want to include in the resulting dataframes, for example, a data dictionary that lists how values in the main table are coded.
These rows can be skipped using the `skip = ` keyword argument of all the `read_{format}()` functions.

### Automated Programming Interfaces (APIs)

Sometimes an API is used to request and access data.
Although it sounds very technical and complicated, an API is simply a (documented) method of interacting with a code and data source.
In fact, we've already been using APIs throughout this book - every package provides a number of functions that are ***exported*** and available to end-users, which constitutes an API!
Now that we're a little more comfortable with the notion of an API, let's see how it is often used in programming: to access data and information from websites.

For this example, we are going to use the [Delphi COVIDcast API](https://delphi.cmu.edu/epidemic-signals/) to collect COVID-19 death incidence data for the USA between 2020 and 2022.
Since the beginning of the COVID-19 pandemic, the CMU Delphi team has been collating data from multiple sources and have provided an API for use by researchers and the public alike.
As part of this project they have created and [R package](https://cmu-delphi.github.io/covidcast/covidcastR/) and a [Python package](https://cmu-delphi.github.io/covidcast/covidcast-py/html/) to make interacting with the API easier.
While there is not an equivalent Julia package, we can use the [API documentation](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#constructing-api-queries) to assist us in downloading the data.

Examining the instructions, we can see we need to use the following base URL: `https://api.delphi.cmu.edu/epidata/covidcast/`, and that there are a number of required query parameters.
To see how we can piece together a query, the Delphi team have provided a [number of examples](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#constructing-api-queries) we can use, along with our recently gained understanding of the API requirements.
Looking at the examples we can see that each API query starts with `?` and uses `&` to join the query parameters.
To fill out the rest of the details, we will look at the example provided in the [R package vignette](https://cmu-delphi.github.io/covidcast/covidcastR/), as this is what we are trying to recreate.

```{julia}
base_url = "https://api.delphi.cmu.edu/epidata/covidcast/?"
data_source = "jhu-csse"
signal = "deaths_incidence_num"
geo = "nation"
geo_values = "us"
time_type = "day"
time_values = "20200415-20221231"

url = base_url * "data_source=" * data_source *
    "&signal=" * signal *
    "&time_type=" * time_type *
    "&geo_type=" * geo *
    "&time_values=" * time_values *
    "&geo_values=" * geo_values
```

::: {.callout-note collapse="true"}
We use the `*` character to perform ***string concatenation*** i.e., to join two or more strings together.
:::

Now that we have our API query, we need to make the request.
API calls often use the HTTP internet protocol (the same one that you will be using to access this book), that utilize four verb: **GET**, **SET**, **PUT**, and **POST**, though we will just focus on **GET**.
In Julia, the `{HTTP}` package is useful for making HTTP requests, in this case making use of the `get()` function.
Examining the resulting `response` object, we can see there are a number of **properties** that can be accessed, but the ones we are interested in are the `status` and `body`.
The `status` property provides us with the HTTP status code of the response, i.e., whether the request was successful or not.
The `body` property provides us with the response body, which in this case is the actual data we are interested in.

```{julia}
using HTTP

response = HTTP.get(url)
```

::: {.callout-tip}
Although it is good to know how to use string concatenation to create the API query, we can use the `query` keyword argument to pass a Dictionary of API parameters to the `get()` function, as shown in the [documentation](https://juliaweb.github.io/HTTP.jl/stable/client/#query), which is a little cleaner and less prone to errors.

```{julia}
#| eval: false
response = HTTP.get(
    base_url;
    query = Dict(
        "data_source" => "jhu-csse",
        "signal" => "deaths_incidence_num",
        "geo" => "nation",
        "geo_values" => "us",
        "time_type" => "day",
        "time_values" => "20200415-20221231",
    )
)
```
:::

::: {.callout-warning}
In practice, you would want to check the status code of the response to make sure that the request was successful, and if not, return an error to stop errors from being propagated throughout your code.
In this case, we can just check the response code in the REPL using the following code which **asserts** that the status code is 200, which is commonly used to denote a successful API request.
If the status code is 200, nothing will be printed, and if not, an error will be thrown.
An example of an unsuccessful API request would be a 404 error, commonly used to denote that the requested resource does not exist (think about when you came across a 404 error when browsing the internet).
HTTP status codes are not hard rules, but they are a standard that most people agree to follow.
See [here](https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml) for a full list of HTTP status codes.

```{julia}
@assert 200 == response.status
```

See this later discussion of [Result types](./misc.qmd#result-types) for more information on returning **errors as values** for a better way to handle possible (likely) sources of errors in your code.
:::

```{julia}
first(response.body, 10)
```

Unfortunately the `body` property is a `Vector{UInt8}` and we can't use it in our code, so we need to convert it to a `String` first before we can manipulate it for plotting purposes.

```{julia}
body = String(response.body)
```

::: {.callout-note collapse="true"}
Once we convert the body to a `String`, the original data is deleted, so it cannot be used again.
To confirm this, we can try calling it again.

```{julia}
response.body
```

This is not normally an issue, but it's something to be aware of.
If you really wanted to make sure it was retained, you should instead use the command

```{julia}
#| eval: false
body = String(copy(response.body))
```
:::

Once we have downloaded the JSON data, we need to turn it into something that we can work with in Julia (i.e., a native data structure like a Dictionary).
There are a few different options, but here we will use the [`{Serde.jl}`](https://github.com/JuliaData/Serde.jl) package as it can convert many more file formats, not just JSON.
`{Serde.jl}` is loosely based on the popular [`{serde.rs}`](https://github.com/serde-rs/serde) rust library (in that it aims to achieve the same goals), standing for serializing and deserializing data (i.e. converting language specific objects into common formats and back).
The function we are interested in here is `Serde.parse_json()`, which takes the JSON object and converts it into a Dictionary.

::: {.callout-note collapse="true"}
JSON is a commonly used data format used to transfer data between web applications and APIs, as well as between many different programming languages.
It is non-tabular and hierarchical, and looks a little like a Dictionary entry.

An example from the Delphi COVIDcast API documentation, which states that it returns a JSON object by default, is shown below:

```{json}
{
  "result": 1,
  "epidata": [
    {
      "geo_value": "06001",
      "time_value": 20200407,
      "direction": null,
      "value": 1.1293550689064,
      "stderr": 0.53185454111042,
      "sample_size": 281.0245
    },
    ...
  ],
  "message": "success"
}
```
:::

```{julia}
using Serde

parsed = Serde.parse_json(body)
```

We can see that `{Serde.jl}` has converted the JSON into a Dictionary, and the data we are interested in is in the `epidata` property, which is itself a `Vector{Dict{String, Any}}` (a vector of dictionaries that has strings for keys and any type for the values).

```{julia}
epidata = parsed["epidata"]
```

Now we are finally ready to work with the data.
The first  thing we want to do it examine what the data looks like, and the easiest way to do this is to just examine the first element in the Vector.

```{julia}
epidata[1]
```

We can double check that some of the other elements in the Vector also look the same, but assuming they do, we are really interested in the `time_value` and `value` properties as our goal is to plot the number of deaths over time.
In order for us to plot the data, we need to create two vectors that can be passed to our plotting library: `{GLMakie.jl}`.
Let's do that with a loop.

```{julia}
using Dates

# Create two empty vectors of the correct types
dates = Vector{Date}(undef, length(epidata))
vals = Vector{Float64}(undef, length(epidata))

# Use pairs to ensure we have the correct length of the dictionary
# and create a pair of indices and values that we can iterate over
for (index, dictionary) in pairs(epidata)
    vals[index] = dictionary["value"]
    # using the `Date` constructor and the `dateformat` string macro we can convert
    # the `time_value` into a `Date` format
    dates[index] = Date("$(dictionary["time_value"])", dateformat"yyyymmdd")
end
```

Now we have our data in vector form, let's plot it!
Throughout this book we will use `{GLMakie.jl}` for plotting, and for the time being you can ignore the implementation details.
In a [later chapter](./misc.qmd#plotting) we will cover the details of the plotting library.

```{julia}
#| fig-align: center
using GLMakie

# Create a minimal custom theme with bold, larger axis labels
function theme_adjustments()
    return Theme(;
        fontsize = 16,
        Axis = (;
            xlabelsize = 20,
            ylabelsize = 20,
            xlabelfont = :bold,
            ylabelfont = :bold,
        ),
        Colorbar = (;
            labelsize = 20,
            labelfont = :bold,
        ),
    )
end
custom_theme = merge(theme_adjustments(), theme_minimal())

set_theme!(
    custom_theme;
    fontsize = 16,
    linewidth = 2,
)

# Create a lineplot of the data using GLMakie
fig = Figure()
ax = Axis(fig[1, 1]; xlabel = "Date", ylabel = "Deaths incidence\nin the USA")
lines!(ax, dates, vals)
fig
```

### R Files

It is possible that you will be working with others who use R.
While it would be ideal if all of your data was saved to common formats, such as CSV, that is not always the case, with many R users choosing to save with R-specific formats.
`{TidierFiles.jl}` does not have the functionality to read R files yet (though there [are discussions](https://github.com/TidierOrg/TidierFiles.jl/issues/10) about implementing it), we can use the [`{RData.jl}` package](https://github.com/JuliaData/RData.jl) to load rds and Rdata files.
As noted in the ***README.md***, we will also add and load the `{CodecBzip2.jl}` and `{CodecXz.jl}` packages for reading R data files that might use non-default compression methods.

As an example, we will load the fictional [Malaria count data](https://epirhandbook.com/new_pages/data_used.html#data_malaria) from the EpiRHandbook, and the [Niamey data](https://github.com/objornstad/epimdr2) from Ottar Bjornstad's "Epidemics: Models and Data in R" Book.

```{julia}
using RData
import CodecBzip2, CodecXz
using DrWatson

malaria = RData.load(DrWatson.datadir("malaria_facility_count_data.rds"))
```

```{julia}
niamey_dict = RData.load(DrWatson.datadir("niamey.rda"))
```

As we can see, the malaria RDS file loaded as expected, creating a nice DataFrame object that we manipulate.
However the Niamey RData file is not in the DataFrame format, instead creating a Dictionary, so we will need to extract the DataFrame.
This highlights the difference between RData and RDS files: RDS files can only contain one object, so are effectively a subset of the RData files.
In both cases, we should confirm that all the variable type guesses are correct, as well as inspecting them for missingness, but this is something we will cover in later chapters.

```{julia}
niamey = niamey_dict["niamey"]
```

### Non-Tabular Data - `{JLD2.jl}`

We don't always work with tabular data.
For example, we may want to save the fit of a regression model to a file for later use, or we may just have a multi-dimensional array of epidemic simulations that we do not want to first transform into a matrix form.
In these situations we need to use a different file format, and in Julia a common choice is the [`{JLD2.jl}` package](https://github.com/JuliaIO/JLD2.jl).

`{JLD2.jl}` is a package that allows us to read and write files in a format that is compatible with the [`{HDF5.jl}` package](https://github.com/JuliaIO/HDF5.jl), written in pure Julia.
HDF5 (Hierarchical Data Format v5) is a common file format that can be written and read by many programming languages, including a couple of R packages (['{hdf5r}'](https://github.com/hhoeflin/hdf5r), [`{rhdf5}`](https://github.com/grimbough/rhdf5)), making it an especially good choice for polyglot teams.

::: {.callout-warning collapse="true"}

`{JLD2.jl}` does it's [best to save custom **structs**](https://juliaio.github.io/JLD2.jl/dev/hdf5compat/#Understanding-how-Julia-structs-are-encoded) in a manner that other languages will be able to interpret, it may not always work, so if cross-language compatibility is essential it is better to try and convert them into a common format, such as Arrays.

:::

To load a JLD2 file we can use the `load()` or the `jldopen()` functions.
`load()` works by reading the file and returning it as a Dictionary that can be saved to an object and indexed.

```{julia}
using JLD2

example_load = JLD2.load(datadir("example.jld2"))
```

```{julia}
example_load["dict"]
```

`jldopen()` works in a [similar way to the `XLSX.openxlsx()` function](#specific-excel-sheets) in that it opens a file in a certain mode (`r` for read-only; `r+` for read/write, failing if the file doesn't exists; `w` for read/write, overwriting existing files; and `a+` for read/write, preserving the file contents - amending).
By default, `jldopen()` will open the file in read-only mode, but we are writing it explicitly for clarity in this example.
Once the file has been opened, we can index it just like we do with a Dictionary.

```{julia}
example_jldopen = jldopen(datadir("example.jld2"), "r")
```

```{julia}
example_jldopen["dict"]
```

Although objects are accessed in the same way between the two methods, we can see that they are not the same thing by examining the types (though for our purposes at the moment it doesn't matter).

```{julia}
typeof(example_load)
```

```{julia}
typeof(example_jldopen)
```

::: {.callout-warning}

Once you have finished reading/writing a file with with `jldopen()` you need to close it.
If you don't, you will not be able to reference it again.
For example, you may have multiple files that reference the same data file, but you are using the same long-running REPL (which is usually recommended in Julia development).
To do this, use `close()` to close the file.

```{julia}
close(example_jldopen)
```

The `close()` function is not necessary if you use the `load()` function.

:::

## Exporting Files

### Specific Excel Sheets

If we want to write data to a specific sheet, we cannot use the `{TidierFiles}` excel functions.
More information on how this can be accomplished can be found at the [documentation](https://felipenoris.github.io/XLSX.jl/stable/tutorial/#Edit-Existing-Files) of the `{XLSX}` package, but in brief, you must open up an excel file in either `w` (write) or `rw` (read-write) mode, depending on whether the file already exists, before setting the sheet's values equal to the data you wish to add to the sheet.

An example of creating a new spreadsheet can from the [documentation](https://felipenoris.github.io/XLSX.jl/stable/tutorial/#Create-New-Files) is shown below:

```{julia}
#| eval: false
XLSX.openxlsx("my_new_file.xlsx", mode="w") do xf
    sheet = xf[1]
    XLSX.rename!(sheet, "new_sheet")
    sheet["A1"] = "this"
    sheet["A2"] = "is a"
    sheet["A3"] = "new file"
    sheet["A4"] = 100

    # will add a row from "A5" to "E5"
    sheet["A5"] = collect(1:5) # equivalent to `sheet["A5", dim=2] = collect(1:4)`

    # will add a column from "B1" to "B4"
    sheet["B1", dim=1] = collect(1:4)

    # will add a matrix from "A7" to "C9"
    sheet["A7:C9"] = [ 1 2 3 ; 4 5 6 ; 7 8 9 ]
end
```

::: {.callout-note}
If you are not already familiar with `do` blocks, don't worry, they are relatively simple.
The Excel file is being created and assigned to the `xf` variable, which can then be manipulated with an anonymous function that has its own **scope**.
You can use the `do` block when using the `map()` function, as [shown earlier](./julia-basics.qmd#map).
:::

### Non-Tabular Data - `{JLD2.jl}`

As mentioned earlier, we may want to save non-tabular data to a file.
In these situations, we can use the `{JLD2.jl}` package, and there are two functions that we can use to do this.
Let's recreate the jld2 file from earlier.

```{julia}
dict = Dict("a" => 100, "b" => "This is a string")
x = 100

jldsave(datadir("example.jld2"); dict, x)
```

We can also rename the objects to be saved with the keyword arguments.

```{julia}
jldsave(datadir("example_rename.jld2"); my_dict = dict, x)
load(datadir("example_rename.jld2"))
```

The alternative method is to open up the file in read/write mode and edit it directly.

::: {.callout-note collapse="true"}

In this case, we have not created an object for the file, using a `do` block instead.
To close the file, we need to use the `close()` function at the end of the block.

:::

```{julia}
#| eval: false
jldopen(datadir("example.jld2"), "w") do file
    file["dict"] = dict
    file["x"] = x

    close(file)
end
```

The advantage of this method is we can create **groups**, i.e., create a folder-like structure within the file to group objects.

```{julia}
jldopen(datadir("example_group.jld2"), "w") do file
    group_1 = JLD2.Group(file, "group_1")
    # equivalent to writing file["group_1/dict"] = dict
    # and file["group_1"]["dict"] = dict
    group_1["dict"] = dict
    group_1["x"] = x

    group_2 = JLD2.Group(file, "group_2")
    group_2["string"] = "This exists in group 2"

    close(file)
end


load(datadir("example_group.jld2"))
```