├── .gitignore ├── docs ├── Project.toml ├── make.jl └── src │ ├── index.md │ ├── setup.md │ ├── tabletraits.md │ └── fileio.md ├── LICENSE ├── .travis.yml └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | docs/build 2 | docs/Manifest.toml 3 | -------------------------------------------------------------------------------- /docs/Project.toml: -------------------------------------------------------------------------------- 1 | [deps] 2 | Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" 3 | 4 | [compat] 5 | Documenter = "~0.21" 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. 2 | To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. 3 | -------------------------------------------------------------------------------- /docs/make.jl: -------------------------------------------------------------------------------- 1 | using Documenter 2 | 3 | makedocs( 4 | sitename = "Julia for Data Science", 5 | pages = [ 6 | "index.md", 7 | "setup.md", 8 | "fileio.md", 9 | "tabletraits.md" 10 | ] 11 | ) 12 | 13 | deploydocs( 14 | repo = "github.com/davidanthoff/jl4ds" 15 | ) 16 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: julia 2 | os: 3 | - linux 4 | julia: 5 | - 1.0 6 | notifications: 7 | email: false 8 | branches: 9 | only: 10 | - master 11 | - /release-.*/ 12 | - /v(\d+)\.(\d+)\.(\d+)/ 13 | script: 14 | - julia --project=docs/ -e 'using Pkg; Pkg.instantiate(); Pkg.develop(PackageSpec(path=pwd()))' 15 | - julia --project=docs/ --color=yes docs/make.jl 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](http://www.repostatus.org/badges/latest/wip.svg)](http://www.repostatus.org/#wip) 2 | [![](https://img.shields.io/badge/docs-stable-blue.svg)](http://www.david-anthoff.com/jl4ds/stable/) 3 | [![](https://img.shields.io/badge/docs-dev-blue.svg)](http://www.david-anthoff.com/jl4ds/dev/) 4 | [![Build Status](https://travis-ci.org/davidanthoff/jl4ds.svg?branch=master)](https://travis-ci.org/davidanthoff/jl4ds) 5 | 6 | # Julia for Data Science 7 | 8 | This might eventually turn into a book. The current version can be read at [Julia for Data Science](http://www.david-anthoff.com/jl4ds/stable/). 9 | -------------------------------------------------------------------------------- /docs/src/index.md: -------------------------------------------------------------------------------- 1 | # Julia for Data Science 2 | 3 | This might eventually turn into a book, who knows. For now it is a work in progress that might be useful to some. 4 | 5 | This book will not teach you about all the packages in the julia data ecosystem. The book describes a particular way to do data science with julia. That approach is based on a family of packages that follow a common philosophy and are tightly integrated with each other. While this makes for a coherent and simple story, it does mean that there is a risk that you are missing out on some great features in other packages if you blindly follow the advice in this book. I try to mitigate this with a section at the end of each chapter that points you towards alternative packages that you can use to achieve similar outcomes as described in that chapter. 6 | 7 | This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/). 8 | -------------------------------------------------------------------------------- /docs/src/setup.md: -------------------------------------------------------------------------------- 1 | # Setup 2 | 3 | All you need to follow the examples in this book is a working installation of the programming language [julia](https://julialang.org/) and the [Queryverse.jl](https://github.com/queryverse/Queryverse.jl) package. While that is technically all you need, I would also recommend that you install [VS Code](https://code.visualstudio.com/) with the [VS Code julia extension](https://marketplace.visualstudio.com/items?itemName=julialang.language-julia) and [IJulia](https://github.com/JuliaLang/IJulia.jl), the julia kernel for [Jupyter](http://jupyter.org/). 4 | 5 | ## Installing julia 6 | 7 | Installing julia is easy: you just download a version for your computer from the [julia language website](https://julialang.org/downloads/) and then install it on your machine. Once you have installed julia, try to start it. You should see the julia REPL: a command line at which you can enter julia code and execute it by pressing `Enter`. 8 | 9 | All examples in this book are based on julia 1.0 or newer. 10 | 11 | ## Installing the Queryverse.jl package 12 | 13 | Once julia is installed, just start it and enter the following command in the Pkg REPL-mode: 14 | 15 | ```julia 16 | pkg> add Queryverse 17 | ``` 18 | 19 | This will install the [Queryverse.jl](https://github.com/queryverse/Queryverse.jl) package onto your system. The [Queryverse.jl](https://github.com/queryverse/Queryverse.jl) package is a meta-package that pulls in a large number of other packages that you need to run the code examples in this book. You could install all of these packages one-by-one manually, but it is much easier to just install (and later use) the [Queryverse.jl](https://github.com/queryverse/Queryverse.jl) package instead and not worry about those details. 20 | -------------------------------------------------------------------------------- /docs/src/tabletraits.md: -------------------------------------------------------------------------------- 1 | # Advanced - TableTraits.jl 2 | 3 | !!! note 4 | 5 | This chapter describes the TableTraits.jl interface as it existed on julia 0.6. There have been some small changes for julia 1.0 that are not yet reflected in this chapter. 6 | 7 | This chapter describes the internals of various table interfaces that are defined in the [TableTraits.jl](https://github.com/queryverse/TableTraits.jl) package. Most data science users do not need to read this chapter, it mostly targets developers that want to integrate their own packages with the ecosystem of packages described in this book. 8 | 9 | ## Overview 10 | 11 | The [TableTraits.jl](https://github.com/queryverse/TableTraits.jl) defines three interfaces that a table can implement: the iterable tables interface, the columns-copy interface and the columns-view interface. A function that accepts a table as an argument can then use these interfaces to access the data in the table. By accessing the data in the table via one of these three interfaces, the function can interact with many different types of tables, without taking a dependency on any specific table implementation. 12 | 13 | While the three table interfaces are entirely independent from each other, one of the three is more equal than the others: any table that wants to participate in the Queryverse ecosystem *must* implement the iterable tables interface, and every function that accepts a table as an argument *must* be able to access the data from that table via the iterable tables interface. The iterable tables interface is thus the most fundamental and basic of the three interfaces that any implementation must support. The two other interfaces (columns-copy and columns-view) are more specialized, but can provide much better performance in certain situations. Tables and table consumers *may* support those interfaces in addition to the iterable tables interface. 14 | 15 | The [TableTraitsUtils.jl](https://github.com/queryverse/TableTraitsUtils.jl) package provides a number of helper functions that make it easier to implement and consume the interfaces described in this package. Most packages that want to integrate with the ecosystem described in this chapter should first check whether any of the helper functions in that package can be used to implement these interfaces, before attempting to follow the guidelines in this chapter to implement these interfaces manually. 16 | 17 | ## The iterable tables interface 18 | 19 | ### Specification 20 | 21 | The iterable table interface has two core parts: 22 | 23 | 1. A simple way for a type to signal that it is an iterable table. It also provides a way for consumers of an iterable table to check whether a particular value is an iterable table and a convention on how to start the iteration of the table. 24 | 2. A number of conventions how tabular data should be iterated. 25 | 26 | In addition, [TableTraits.jl](https://github.com/queryverse/TableTraits.jl) provides a number of small helper functions that make it easier to implement an iterable table consumer. 27 | 28 | #### Signaling and detection of iterable tables 29 | 30 | In general a type is an iterable table if it can be iterated and if the element type that is returned during iteration is a `NamedTuple`. 31 | 32 | In a slight twist of the standard julia iteration interface, iterable tables introduces one extra step into this simple story: consumers should never iterate a data source directly by calling the `start` function on it, instead they should always call `IteratorInterfaceExtensions.getiterator` on the data source, and then use the standard julia iterator protocol on the value return by `IteratorInterfaceExtensions.getiterator`. This function is defined in the [IteratorInterfaceExtensions.jl](https://github.com/queryverse/IteratorInterfaceExtensions.jl) package. 33 | 34 | This indirection enables us to implement type stable iterator functions `start`, `next` and `done` for data sources that don't incorporate enough information in their type for type stable versions of these three functions (e.g. `DataFrame`s). [IteratorInterfaceExtensions.jl](https://github.com/queryverse/IteratorInterfaceExtensions.jl) provides a default implementation of `IteratorInterfaceExtensions.getiterator` that just returns that data source itself. For data sources that have enough type information to implement type stable versions of the iteration functions, this default implementation of `IteratorInterfaceExtensions.getiterator` works well. For other types, like `DataFrame`, package authors can provide their own `IteratorInterfaceExtensions.getiterator` implementation that returns a value of some new type that has enough information encoded in its type parameters so that one can implement type stable versions of `start`, `next` and `done`. 35 | 36 | The function `IteratorInterfaceExtensions.isiterable` enables a consumer to check whether any arbitrary value is iterable, in the sense that `IteratorInterfaceExtensions.getiterator` will return something that can be iterated. The default `IteratorInterfaceExtensions.isiterable(x::Any)` implementation checks whether a suitable `start` method for the type of `x` exists. Types that use the indirection described in the previous paragraph might not implement a `start` method, though. Instead they will return a type for which `start` is implemented from the `IteratorInterfaceExtensions.getiterator` function. Such types should manually add a method to `IteratorInterfaceExtensions.isiterable` that returns `true` for values of their type, so that consumers can detect that a call to `IteratorInterfaceExtensions.getiterator` is going to be successful. 37 | 38 | The final function in the detection and signaling interface of iterable tables is `TableTraits.isiterabletable(x)`. This function is defined in the [TableTraits.jl](https://github.com/queryverse/TableTraits.jl) package. The fallback implementation for this method will check whether `IteratorInterfaceExtensions.isiterable(x)` returns `true`, and whether `eltype(x)` returns a `NamedTuple`. For types that don't provide their own `IteratorInterfaceExtensions.getiterator` method this will signal the correct behavior to consumers. For types that use the indirection method described above by providing their own `IteratorInterfaceExtensions.getiterator` method, package authors should provide their own `TableTraits.isiterabletable` method that returns `true` if that data source will iterate values of type `NamedTuples` from the value returned by `IteratorInterfaceExtensions.getiterator`. 39 | 40 | #### Iteration conventions 41 | 42 | Any iterable table should return elements of type `NamedTuple`. Each column of the source table should be encoded as a field in the named tuple, and the type of that field in the named tuple should reflect the data type of the column in the table. If a column can hold missing values, the type of the corresponding field in the `NamedTuple` should be a `DataValue{T}` where `T` is the data type of the column. The `NamedTuple` type is defined in the [NamedTuples.jl](https://github.com/blackrock/NamedTuples.jl) package, and the `DataValue` is defined in the [DataValues.jl](https://github.com/queryverse/DataValues.jl) package. 43 | 44 | ### Integration Guide 45 | 46 | This section describes how package authors can integrate their own packages with the iterable tables ecosystem. Specifically, it explains how one can turn a type into an iterable table and how one can write code that consumes iterable tables. 47 | 48 | The code that integrates a package with the iterable tables ecosystem should live in the repository of that package. For example, if `Foo.jl` wants to be integrated with the iterable tables ecosystem, one should add the necessary code to the `Foo.jl` repository. 49 | 50 | #### Consuming iterable tables 51 | 52 | One cannot dispatch on an iterable table because iterable tables don't have a common super type. Instead one has to add a method that takes a value of any type as an argument to consume an iterable table. For conversions between types it is recommended that one adds a constructor that takes one argument without any type restriction that can convert an iterable table into the target type. For example, if one has added a new table type called `MyTable`, one would add a constructor with this method signature for this type: `function MyTable(iterable_table)`. For other situations, for example a plotting function, one also would add a method without any type restriction, for example `plot(iterable_table)`. 53 | 54 | The first step inside any function that consumes iterable tables is to check whether the argument that was passed is actually an iterable table or not. This can easily be done with the `TableTraits.isiterabletable` function. For example, the constructor for a new table type might start like this: 55 | ```julia 56 | function MyTable(source) 57 | TableTraits.isiterabletable(source) || error("Argument is not an iterable table.") 58 | 59 | # Code that converts things follows 60 | end 61 | ``` 62 | Once it has been established that the argument is actually an iterable table there are multiple ways to proceed. The following two sections describe three options, which one is appropriate for a given situation depends on a variety of factors. 63 | 64 | ##### Reusing an existing consumer of iterable tables 65 | 66 | This option is by far the simplest way to add support for consuming an iterable table. Essentially the strategy is to reuse the conversion implementation of some other type. For example, one can simply convert the iterable table into a `DataFrame` right after one has checked that the argument of the function is actually an iterable table. Once the iterable table is converted to a `DataFrame`, one can use the standard API of `DataFrame`s to proceed. This strategy is especially simple for packages that already support interaction with `DataFrames` (or any of the other table types that support the iterable tables interface). The code for the ``MyTable`` constructor might look like this: 67 | ```julia 68 | function MyTable(source) 69 | TableTraits.isiterabletable(source) || error("Argument is not an iterable table.") 70 | 71 | df = DataFrame(source) 72 | return MyTable(df) 73 | end 74 | ``` 75 | This assumes that `MyTable` has another constructor that accepts a `DataFrame`. 76 | 77 | While this strategy to consume iterable tables is simple to implement, it leads to a tighter couping than needed in many situations. In particular, a package that follows this strategy will still need a dependency on an existing table type, which is often not ideal. I therefore recommend this strategy only as a first quick-and-dirty way to get compatible with the iterable table ecosystem. The two two options described in the next sections are generally more robust ways to achieve the iterable table integration. 78 | 79 | ##### Coding a complete conversion 80 | 81 | Coding a custom conversion is more work than reusing an existing consumer of iterable tables, but it provides more flexibility. 82 | 83 | In general, a custom conversion function also needs to start with a call to `TableTraits.isiterabletable` to check whether one actually has an iterable table. The second step in any custom conversion function is to call the `IteratorInterfaceExtensions.getiterator` function on the iterable table. This will return an instance of a type that implements the standard julia iterator interface, i.e. one can call `start`, `next` and `done` on the instance that is returned by `IteratorInterfaceExtensions.getiterator`. For some iterable tables `IteratorInterfaceExtensions.getiterator` will just return the argument that one has passed to it, but for other iterable tables it will return an instance of a different type. 84 | 85 | `IteratorInterfaceExtensions.getiterator` is generally not a type stable function. Given that this function is generally only called once per conversion this hopefully is not a huge performance issue. The functions that really need to be type-stable are `start`, `next` and `done` because they will be called for every row of the table that is to be converted. In general, these three functions will be type stable for the type of the return value of `IteratorInterfaceExtensions.getiterator`. But given that `IteratorInterfaceExtensions.getiterator` is not type stable, one needs to use a function barrier to make sure the three iteration functions are called from a type stable function. 86 | 87 | The next step in a custom conversion function is typically to find out what columns the iterable table has. The helper functions `TableTraits.column_types` and `TableTraits.column_names` provide this functionality (note that these are not part of the official iterable tables interface, they are simply helper functions that make it easier to find this information). Both functions need to be called with the return value of ``IteratorInterfaceExtensions.getiterator` as the argument. `TableTraits.column_types` returns a vector of `Type`s that are the element types of the columns of the iterable table. `TableTraits.column_names` returns a vector of `Symbol`s with the names of the columns. 88 | 89 | Custom conversion functions can at this point optionally check whether the iterable table implements the `length` function by checking whether `Base.iteratorsize(typeof(iter))==Base.HasLength()` (this is part of the standard iteration protocol). It is important to note that every consumer of iterable tables needs to handle the case where no length information is available, but can provide an additional, typically faster implementation if length information is provided by the source. A typical pattern might be that a consumer can pre-allocate the arrays that should hold the data from the iterable tables with the right size if length information is available from the source. 90 | 91 | With all this information, a consumer now typically would allocate the data structures that should hold the converted data. This will almost always be very consumer specific. Once these data structures have been allocated, one can actually implement the loop that iterates over the source rows. To get good performance it is recommended that this loop is implemented in a new function (behind a function barrier), and that the function with the loop is type-stable. Often this will require the use of a generated function that generates code for each column of the source. This can avoid a loop over the columns while one is iterating over the rows. It is often key to avoid a loop over columns inside the loop over the rows, given that columns can have different types, which almost inevitably would lead to a type instability. 92 | 93 | A good example of a custom consumer of an iterable table is the code in the `DataTable` integration. 94 | 95 | #### Creating an iterable table source 96 | 97 | There are generally two strategies for turning some custom type into an iterable table. The first strategy works if one can implement a type-stable version of `start`, `next` and `done` that iterates elements of type `NamedTuple` directly for the source type. If that is not feasible, the strategy is to create a new iterator type. The following two sections describe both approaches. 98 | 99 | ##### Directly implementing the julia base iteration trait 100 | 101 | This strategy only works if the type that one wants to expose as an iterable table has enough information about the structure of the table that one can implement a type stable version of `start`, `next` and `done`. Typically that requires that one can deduce the names and types of the columns of the table purely from the type (and type parameters). For some types that works, but for other types (like `DataFrame`) this strategy won't work. 102 | 103 | If the type one wants to expose as an iterable table allows this strategy, the implementation is fairly straightforward: one simple needs to implement the standard julia base iterator interface, and during iteration one should return `NamedTuple`s for each element. The fields in the `NamedTuple` correspond to the columns of the table, i.e. the names of the fields are the column names, and the types of the field are the column types. If the source supports some notion of missing values, it should return `NamedTuples` that have fields of type `DataValue{T}`, where `T` is the data type of the column. 104 | 105 | It is important to not only implement `start`, `next` and `end` from the julia iteration protocol. Iterable tables also always require that `eltype` is implemented. Finally, one should either implement `length`, if the source supports returning the number of rows without expensive computations, or one should add a method `iteratorsize` that returns `SizeUnknown()` for the custom type. 106 | 107 | The implementation of a type stable `next` method typically requires the use of generated functions. 108 | 109 | ##### Creating a custom iteration type 110 | 111 | For types that don't have enough information encoded in their type to implement a type stable version of the julia iteration interface, the best strategy is to create a custom iteration type that implements the julia iteration interface and has enough type information. 112 | 113 | For example, for the `MyTable` type one might create a new iterator type called `MyTableIterator{T}` that holds the type of the `NamedTuple` that this iterator will return in `T`. 114 | 115 | To expose this new iterator type to consumers, one needs to add a method to the `IteratorInterfaceExtensions.getiterator` function. This function takes an instance of the type one wants to expose as an iterable table, and returns a new type that should actually be used for the iteration itself. For example, `function IteratorInterfaceExtensions.getiterator(table::MyTable)` would return an instance of `MyTableIterator{T}`. 116 | 117 | In addition to adding a method to `IteratorInterfaceExtensions.getiterator`, one must also add methods to the `IteratorInterfaceExtensions.isiterable` and `TableTraits.isiterabletable` functions for the type one wants to turn into an iterable table, in both cases those methods should return `true`. 118 | 119 | The final step is to implement the full julia iteration interface for the custom iterator type that one returned from `IteratorInterfaceExtensions.getiterator`. All the same requirements that were discussed in the previous section apply here as well. 120 | 121 | An example of this strategy is the `DataTable` integration. 122 | 123 | ## The columns-copy interface [experimental] 124 | 125 | Note that this interface is still experimental and might change in the future. 126 | 127 | ### Specification 128 | 129 | The columns-copy interface consists of only two functions: `TableTraits.supports_get_columns_copy` (to check whether a table supports this interface) and `TableTraits.get_columns_copy` (to get a copy of all the columns in the table). 130 | 131 | This interface allows a consumer of a table to obtain a copy of the data in a table. The copy will consist of one vector for each column of the source table. The key feature of this interface is that the consumer of this interface will "own" the vectors that are obtained via this interface. The consumer can modify, delete or do anything else with the vectors returned from this interface. This implies that a source that returns columns via this interface should not hold onto the actual vectors that it returns via this interface. 132 | 133 | The `TableTraits.supports_get_columns_copy` function accepts one argument that has to be a table. It will return `true` or `false`, depending on whether the table supports the columns-copy interface or not. 134 | 135 | The `TableTraits.get_columns_copy` function also accepts one argument that is a table. It returns a `NamedTuple` with one field for each column in the source table. Each field should hold a vector with the actual values for that column. 136 | 137 | If the source table supports a notion of missing data in a column, it should return a `DataValueVector` from the [DataValues.jl](https://github.com/queryverse/DataValues.jl) package for such columns. 138 | 139 | ## The columns-view interface 140 | 141 | ### Specification 142 | 143 | The columns-view interface consists of only two functions: `TableTraits.supports_get_columns_view` (to check whether a table supports this interface) and `TableTraits.get_columns_view` (to get a view into the source table). 144 | 145 | This interface allows a consumer of a table to get access to the columns in a table via a standardized interface. In particular, a consumer can obtain a `NamedTuple` of columns from a source table that give access to the data in the source table. The key feature of this interface is that the consumer is only allowed to read data from the arrays returned by this interface. The consumer must not attempt to modify the content of the source table via the arrays that were returned from this interface. A source should in general not make copies of the data if it implements this interface. In essence this interface gives a read-only view into a table that a consumer can use to access any cell in a table. 146 | 147 | The `TableTraits.supports_get_columns_view` function accepts one argument that has to be a table. It will return `true` or `false`, depending on whether the table supports the columns-view interface or not. 148 | 149 | The `TableTraits.get_columns_view` function also accepts one argument that is a table. It returns a `NamedTuple` with one field for each column in the source table. Each field should hold a vector with the actual values for that column. 150 | 151 | If the source table supports a notion of missing data in a column, the `eltype` of the vector for that column must be of type `DataValue`. 152 | 153 | ## The TableTraitsUtils.jl package 154 | 155 | [TODO] 156 | -------------------------------------------------------------------------------- /docs/src/fileio.md: -------------------------------------------------------------------------------- 1 | # Tabular File IO 2 | 3 | This chapter will teach you how to read and write data from files. We will limit the discussion to tabular data, i.e. data that has the structure of a table. 4 | 5 | ## Getting Started 6 | 7 | ### Loading Data 8 | 9 | The main function for reading tabular data from disc is the `load` function from the [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) package. You always pass a filename as the first argument to the `load` function. For some file formats you can also pass additional arguments that control the details of how the data is loaded. You will learn about those additional arguments later in the chapter. 10 | 11 | It is often convenient to materialize the data from a tabular file into a data structure like a `DataFrame`. You can do that easily by passing the return value from the `load` function to the `DataFrame` constructor, like this: 12 | 13 | ```julia 14 | using Queryverse 15 | 16 | df = DataFrame(load("mydata.csv")) 17 | ``` 18 | 19 | You can also use the pipe syntax to achieve the same result: 20 | 21 | ```julia 22 | using Queryverse 23 | 24 | df = load("mydata.csv") |> DataFrame 25 | ``` 26 | 27 | The pipe syntax is particularly useful when you want to apply some data transformation to the data that you are loading. For example, you can filter the data before you materialize it into a `DataFrame` like this: 28 | 29 | ```julia 30 | using Queryverse 31 | 32 | df = load("mydata.csv") |> @filter(_.age>20) |> DataFrame 33 | ``` 34 | 35 | The `load` function can load many different tabular file formats. The following code loads an Excel file: 36 | 37 | ```julia 38 | using Queryverse 39 | 40 | df = load("mydata.xlsx", "Sheet1") |> DataFrame 41 | ``` 42 | 43 | Note how we have to pass the name of the sheet to read as a second argument to the `load` function for Excel files. 44 | 45 | A full list of supported file formats is provided later in this chapter. 46 | 47 | You can also use the `load` function to acquire data from a remote server by passing a URI as the filename. The following code loads a CSV file from a remote server: 48 | 49 | ```julia 50 | using Queryverse 51 | 52 | df = load("https://raw.githubusercontent.com/queryverse/CSVFiles.jl/master/test/data.csv") |> DataFrame 53 | ``` 54 | 55 | ### Saving Data 56 | 57 | The `save` function from the [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) package is the main function to save tabular data to disc. The first argument to the `save` function is the filename you want to use for the file. The file extension of that filename will determine in what format the data will be written to disc. The second argument is the table you want to write to disc. Here is a simple example that writes some data to a CSV file: 58 | 59 | ```julia 60 | using Queryverse 61 | 62 | df = DataFrame(Name=["Jim", "Sally", "John"], Age=[23., 56., 34.]) 63 | 64 | save("mydata.csv", df) 65 | ``` 66 | 67 | You can also use the pipe syntax with the `save` function: 68 | 69 | ```julia 70 | using Queryverse 71 | 72 | df = DataFrame(Name=["Jim", "Sally", "John"], Age=[23., 56., 34.]) 73 | 74 | df |> save("mydata.csv") 75 | ``` 76 | 77 | The `save` function works with any tabular data structure, not just `DataFrame`s and it supports many different file formats. The following code shows how you can load data from a CSV file, filter it and then write it out directly as a Feather file, without ever materializing it into a `DataFrame`: 78 | 79 | ```julia 80 | using Queryverse 81 | 82 | load("mydata.csv") |> @filter(_.age>23) |> save("mydata.feather") 83 | ``` 84 | 85 | For some file formats you can pass additional configuration arguments to the `save` function that control in detail how the file is written to disc. The following example writes a table to disc as a CSV file, but uses a non-standard delimiter character and also does not write a header to the file: 86 | 87 | ```julia 88 | using Queryverse 89 | 90 | df = DataFrame(Name=["Jim", "Sally", "John"], Age=[23., 56., 34.]) 91 | 92 | df |> save("mydata.csv", delim=';', header=false) 93 | ``` 94 | 95 | ## The `load` and `save` function 96 | 97 | [TODO] 98 | 99 | ## CSV Files 100 | 101 | [TODO add general description of CSV files] 102 | 103 | ### Loading CSV Files 104 | 105 | If you pass a filename with the extension `*.csv` to the `load` function, [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [CSVFiles.jl](https://github.com/queryverse/CSVFiles.jl) package to load that file. The package supports filenames that point to a file on your local computer and URLs that point to a file on remote server: 106 | 107 | ```julia 108 | using Queryverse 109 | 110 | # Load a local file 111 | df = load("mycsv.csv") |> DataFrame 112 | 113 | # Load a remote file 114 | url = "https://raw.githubusercontent.com/queryverse/CSVFiles.jl/master/test/data.csv" 115 | df = load(url) |> DataFrame 116 | ``` 117 | 118 | #### Delimiter character 119 | 120 | By default, CSV files use a comma `,` to separate content in different columns. While that is the most common case, CSV files also sometimes use a different character to separate content in different columns. For example, you might want to read a file that uses a semicolon `;` to separate columns, like the following example: 121 | 122 | ``` 123 | Name;Age 124 | John;34 125 | Sally;52 126 | ``` 127 | 128 | You can tell `load` to use a different character as the delimiter between columns by passing a `Char` value as the second argument to the `load` function: 129 | 130 | ```julia 131 | using Queryverse 132 | 133 | df = load("mycsvfile_with_semicolon.csv", ';') |> DataFrame 134 | ``` 135 | 136 | You can tell `load` to use any character as the column delimiter signal. Another common case besides the semicolon is a tab character (written as `'\t'` in julia). 137 | 138 | A special case arises when one or multiple spaces are used to separate columns. If you have a file like that, you can use the `spacedelim=true` argument with the `load` function. For example, say you have a file like this: 139 | 140 | ``` 141 | Name Age 142 | John 34 143 | Sally 52 144 | ``` 145 | 146 | Note how columns are separated with multiple spaces in this file. You can load this file with the following code: 147 | 148 | ```julia 149 | using Queryverse 150 | 151 | df = load("mycsvfile_with_whitespaces.csv", spacedelim=true) |> DataFrame 152 | ``` 153 | 154 | #### Column Names 155 | 156 | In most CSV files the first line contains the names of the columns, and subsequent lines the actual data itself. If you call `load` with no special arguments, it will assume that the first line in the CSV file holds column names. An example of such a CSV file might look like this: 157 | 158 | ``` 159 | Name,Age,Children 160 | "John",23.,1 161 | "Sally",54.,3 162 | ``` 163 | 164 | But sometimes CSV files don't have a special header row with the column names, and instead start with the actual data in the first row, like in this file: 165 | 166 | ``` 167 | "John",23.,1 168 | "Sally",54.,3 169 | ``` 170 | 171 | You can indicate this situation by calling the `load` function with the keyword argument `header_exists=false`: 172 | 173 | ```julia 174 | using Queryverse 175 | 176 | df = load("myfile.csv", header_exists=false) |> DataFrame 177 | ``` 178 | 179 | The `header_exists=false` keyword argument will cause two things: the first row of the file will now be read as data, i.e. the resulting table will have a first row with data from the first line in the file. Second, the columns will be named by numbers, i.e. the name of the first column will be `1`, the name of the second column `2` and so on, unless you specify custom column names with the `colnames` keyword argument. 180 | 181 | The `colnames` keyword argument of the `load` function allows you to specify your own column names. You can use that option to either specify the names of all columns as an array of `String`s, or you can change the name of only a few columns by passing a `Dict`. 182 | 183 | When you pass an array of `String`s, you indicate that you want the names in the array to be used as the column names in the resulting table. The following code loads a CSV file and specifies custom column names: 184 | 185 | ```julia 186 | using Queryverse 187 | 188 | df = load("mydata.csv", colnames=["name", "age", "children"]) |> DataFrame 189 | ``` 190 | 191 | When you use the `colnames` argument with `header_exists=true` (or don't specify that keyword argument), the names in `colnames` will replace the names that are loaded from the file. 192 | 193 | Sometimes you load some data from a CSV file that has a column header and you want to replace the names of just a few columns. While you could pass an array of `String`s to the `colnames` argument, it would cumbersome: you would have to specify the names of all columns, even the ones that you don't want to rename. In this situation you can pass a `Dict` to the `colnames` argument instead. Each element in the `Dict` is one renaming rule that `load` should apply to the columns it loads from the file. The key for each element specifies which column should be renamed, and the value the new name. The key can be specified either as a `String`, in which case it refers to the column name that is present in the file, or as an `Int`, in which case it refers to the position of the column that should be renamed. The values in the `Dict` always have to be `String`s, i.e. the new names. Note that you cannot pass a `Dict` to `colnames` when you call `load` with `header_exists=false`. The following code example will load a CSV file, and rename the column with the original name "Age" to "age", and the third column to "children". All other columns will keep the names that are specified in the file: 194 | 195 | ```julia 196 | using Queryverse 197 | 198 | df = load("mydata.csv", colnames=Dict("Age"=>"age", 3=>"children")) 199 | ``` 200 | 201 | #### Rows to Read 202 | 203 | `load` accepts two keyword arguments that allow you to specify whether all lines in the file should be read or not. 204 | 205 | With the `skiplines_begin` argument you can tell `load` to ignore a certain number of lines at the beginning of the file. This is useful if you have a file like this: 206 | 207 | ``` 208 | # This file was generated on 1/1/2017 209 | # By John 210 | Name,Age,Children 211 | "John",34.,2 212 | "Sally",23.,1 213 | ``` 214 | 215 | In this example the first two lines in the file contain some meta information that is not part of the table data itself. You can load such a file like this: 216 | 217 | ```julia 218 | using Queryverse 219 | 220 | df = load("mydata.csv", skiplines_begin=2) |> DataFrame 221 | ``` 222 | 223 | With that option, the first two lines will be ignored and the file is treated as if it started in line 3. 224 | 225 | [TODO There should actually be an option to limit the number of rows that are read] 226 | 227 | #### Column Types 228 | 229 | [TODO] 230 | 231 | #### Quote and Escape Character 232 | 233 | If a CSV file has a column with string data, it should ideally surround the actual string in quotation marks `"` or some other quote character. This is important because otherwise such a string column could not contain the character that is used as the delimiter character between columns. A typical example CSV file might look like this: 234 | 235 | ``` 236 | Name,Age 237 | "Smith, John",23. 238 | "Smith, Sally",35. 239 | ``` 240 | 241 | Note that the values in the `Name` column here contain a comma `,` which is also the delimiter character between columns in this file. But because the whole string is surrounded by quotation marks `"`, the CSV reader understands that the comma between the last and first name here is part of the `Name` column and does not separate the `Name` from the `Age` column. 242 | 243 | Some CSV files use a different character as their quotation character. For example, a file might use single quotation marks `'` like in this example: 244 | 245 | ``` 246 | Name,Age 247 | 'Smith, John',23. 248 | 'Smith, Sally',35. 249 | ``` 250 | 251 | The keyword argument `quotechar` of the `load` function allows you to specify the quote character used in the file you want to load. The above file could be loaded like this: 252 | 253 | ```julia 254 | using Queryverse 255 | 256 | df = load("mydata.csv", quotechar='\'') |> DataFrame 257 | ``` 258 | 259 | Note how we need to use the julia escape character `\` here to create a `Char` instance with content `'`. 260 | 261 | There is still a problem, though: what if you have a column that sometimes contains the character that is used as the quote character? For that case you can specify an escape character: whenever the escape character followed by the quote character appears in a column, the quote character is not interpreted as the end of the column, but as an appearance of that character in the column itself. An example file might look like this: 262 | 263 | ``` 264 | Text,Number 265 | "This text contains a \" mark",23 266 | "This line doesn't",45 267 | ``` 268 | 269 | The content of the first column in the first row here should be read as `This text contains a " mark`. You can specify what character is used as the escape character with the `escapechar` keyword argument: 270 | 271 | ```julia 272 | using Queryverse 273 | 274 | df = load("mydata.csv", escapechar='\\') 275 | ``` 276 | 277 | Note how we have to escape the `\` character itself in the julia string: `\` is the julia escape character, and to create a `Char` instance with content `\` we have to write `'\\'`. 278 | 279 | ### Saving CSV Files 280 | 281 | To save a table as a CSV file, call the `save` function with a filename that has a `*.csv` extension. [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will then use the [CSVFiles.jl](https://github.com/queryverse/CSVFiles.jl) package to save the table. The following example shows how to save a table as a CSV file: 282 | 283 | ```julia 284 | using Queryverse 285 | 286 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 287 | 288 | df |> save("mydata.csv") 289 | ``` 290 | 291 | The `save` function accepts a number of arguments when saving a CSV file that control the precise format of the CSV file that is written. 292 | 293 | #### Delimiter character 294 | 295 | You can control which character should separate columns in the result file by passing the keyword argument `delim` to the `save` function. The following code uses a semicolon `;` as the column separator character: 296 | 297 | ```julia 298 | using Queryverse 299 | 300 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 301 | 302 | df |> save("mydata.csv", delim=';') 303 | ``` 304 | 305 | #### Header 306 | 307 | By default `save` writes the names of the columns as the first line in the CSV file. You can change that behavior by passing the `header=false` keyword argument: 308 | 309 | ```julia 310 | using Queryverse 311 | 312 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 313 | 314 | df |> save("mydata.csv", header=false) 315 | ``` 316 | 317 | This will write a CSV file that looks like this: 318 | 319 | ``` 320 | "John",23. 321 | "Sally",25. 322 | ``` 323 | 324 | #### Quote and Escape Character 325 | 326 | The `quotechar` and `escapechar` keyword arguments control how text columns get written to disc. By default `save` will surround any text by double quotation marks `"`, and use a backslash `\` to escape any occurrence of the quote character in the actual text of a column. The following code instead uses plus `+` as the quote character and a forward slash `/` as the escape character: 327 | 328 | ```julia 329 | using Queryverse 330 | 331 | df = DataFrame(name=["John + Jim", "Sally"], age=[23.,25.]) 332 | 333 | df |> save("mydata.csv", quotechar='+', escapechar='/') 334 | ``` 335 | 336 | This code will write the following CSV file: 337 | 338 | ``` 339 | +name+,+age+ 340 | +John /+ Jim+,23. 341 | +Sally+,25. 342 | ``` 343 | 344 | ## Feather Files 345 | 346 | [TODO add general description of Feather files] 347 | 348 | ### Loading Feather Files 349 | 350 | If you pass a filename with the extension `*.feather` to the `load` function, [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [FeatherFiles.jl](https://github.com/queryverse/FeatherFiles.jl) package to load that file. The following example demonstrates how you can load a Feather file: 351 | 352 | ```julia 353 | using Queryverse 354 | 355 | # Load a local file 356 | df = load("mydata.feather") |> DataFrame 357 | ``` 358 | 359 | There are no options you can specify when loading a Feather file. 360 | 361 | ### Saving Feather Files 362 | 363 | You can save a table as a Feather file by calling the `save` function with a filename that has the `*.feather` extension. In that case [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [FeatherFiles.jl](https://github.com/queryverse/FeatherFiles.jl) package to save that file. This example shows you how to save a table as a Feather file: 364 | 365 | ```julia 366 | using Queryverse 367 | 368 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 369 | 370 | df |> save("mydata.feather") 371 | ``` 372 | 373 | No other options can be specified when saving a Feather file. 374 | 375 | ## Excel Files 376 | 377 | [TODO add general description of Excel files] 378 | 379 | ### Loading Excel Files 380 | 381 | You can load both `*.xls` and `*.xlsx` files with `load`. If you pass a filename with one of those extensions to `load`, [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [ExcelFiles.jl](https://github.com/queryverse/ExcelFiles.jl) package to load those files. 382 | 383 | To load an Excel file, you always need to specify either a sheet name or range in addition to the filename itself. 384 | 385 | The following example loads the sheet `Sheet1` from an Excel file: 386 | 387 | ```julia 388 | using Queryverse 389 | 390 | df = load("mydata.xlsx", "Sheet1") |> DataFrame 391 | ``` 392 | 393 | When you pass a sheet name to `load` without any other option, it will automatically skip any initial empty rows or columns in the Excel file, and then read the remaining content on the sheet. You can also manually control what data should be read from the sheet by using a number of keyword arguments. The `skipstartrows` argument takes an `Int`, when specified the `load` function will ignore the first `skipstartrows` rows in the file. Note that in this case `load` will no longer attempt to automatically figure out on which row your data is starting in the sheet. The `skipstartcols` option works the same way, but for columns. The `nrows` and `ncols` keyword arguments allow you to specify how many rows and columns you want to read from the sheet. The following example uses all four options to skip the first two rows and first three columns, to then read a table with four rows and five columns: 394 | 395 | ```julia 396 | using Queryverse 397 | 398 | df = load("mydata.xlsx", "Sheet1", skipstartrows=2, skipstartcols=3, nrows=4, ncols=5) |> DataFrame 399 | ``` 400 | 401 | Instead of passing a sheet name to `load`, you can also pass a full Excel range specification. Excel range specifications have the form `Sheetname!CellRef1:CellRef2`. `CellRef1` and `CellRef2` designate the top left and bottom right cell of the rectangle that you want to load. For example, the range specification `Sheet1!B2:D5` denotes the data on `Sheet1` that lies in the rectangle that has cell `B2` as the top left corner and `D5` as the bottom right corner. To load that data with julia you can use this code: 402 | 403 | ```julia 404 | using Queryverse 405 | 406 | df = load("mydata.xlsx", "Sheet1!B2:D5") |> DataFrame 407 | ``` 408 | 409 | Without any other arguments, `load` will assume that the first row in this rectangle contains the columns names of a table. If that is not the case for your data, you can specify the keyword argument `header=false`, in which case `load` will treat the first row in the rectangle specified by the range as data. The columns will get automatically generated names. You can also pass custom column names with the `colnames` keyword argument, which accepts an array of `String`s. If you pass column names via the `colnames` argument with the option `header=true` (the default setting), `load` will ignore the first row in the range specified rectangle and instead use the names you passed in the `colnames` argument. The following code reads data from `Sheet1` in range `A2:C5`, treats the first row as data and assigns custom column names: 410 | 411 | ```julia 412 | using Queryverse 413 | 414 | df = load("mydata.xlsx", "Sheet1!B2:C5", header=false, colnames=["Name", "Age", "Children"]) |> DataFrame 415 | ``` 416 | 417 | ### Saving Excel Files 418 | 419 | To save a table as an Excel file, call the `save` function with a filename that has a `*.xlsx` extension. [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will then use the [ExcelFiles.jl](https://github.com/queryverse/ExcelFiles.jl) package to save the table. The following example shows how to save a table as an Excel file: 420 | 421 | ```julia 422 | using Queryverse 423 | 424 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 425 | 426 | df |> save("mydata.xlsx") 427 | ``` 428 | 429 | #### Sheet name 430 | 431 | You can specify the name of the sheet in the Excelfile that will receive the table data via the `sheetname` keyword argument of the `save` function. The following code writes the data to a sheet with name `Custom Name`: 432 | 433 | ```julia 434 | using Queryverse 435 | 436 | df = DataFrame(name=["John", "Sally"], age=[23.,25.]) 437 | 438 | df |> save("mydata.xlsx", sheetname="Custom Name") 439 | ``` 440 | 441 | ## Stata, SPSS, and SAS Files 442 | 443 | [TODO add general description of stats files] 444 | 445 | ### Loading Stata, SPSS, and SAS Files 446 | 447 | You can load files that were saved in one of the formats of these statistical software packages that have the extension `*.dta`, `*.por`, `*.sav` or `*.sas7bdat`. If you call the `load` function with a filename with any of these extensions, [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [StatFiles.jl](https://github.com/queryverse/StatFiles.jl) package to read those files. The following code example demonstrates how you can read a file in each of these formats: 448 | 449 | ```julia 450 | using Queryverse 451 | 452 | df1 = load("mydata.dta") |> DataFrame 453 | 454 | df2 = load("mydata.por") |> DataFrame 455 | 456 | df3 = load("mydata.sav") |> DataFrame 457 | 458 | df4 = load("mydata.sas7bdat") |> DataFrame 459 | ``` 460 | 461 | There are no further options you can specify when loading one of these files. 462 | 463 | ## Parquet Files 464 | 465 | [TODO add general description of Parquet files] 466 | 467 | ### Loading Parquet Files 468 | 469 | If you pass a filename with the extension `*.parquet` to the `load` function, [FileIO.jl](https://github.com/JuliaIO/FileIO.jl) will use the [ParquetFiles.jl](https://github.com/queryverse/ParquetFiles.jl) package to load that file. The following example demonstrates how you can load a Feather file: 470 | 471 | ```julia 472 | using Queryverse 473 | 474 | # Load a local file 475 | df = load("mydata.parquet") |> DataFrame 476 | ``` 477 | 478 | There are no options you can specify when loading a Parquet file. 479 | 480 | ## Alternative Packages 481 | 482 | This section described how you can use packages from the Queryverse to 483 | load and save data. While those are useful, they are not the only julia 484 | packages that you can use for tabular file IO, in fact there are many other 485 | excellent packages for those tasks. I encourage you to explore those 486 | packages and use them whenever they are a good fit for your work. Here 487 | is an (incomplete) list of other packages you might want to take a look 488 | at: 489 | 490 | - [CSV.jl](https://github.com/JuliaData/CSV.jl). 491 | - [uCSV.jl](https://github.com/cjprybol/uCSV.jl). 492 | - [TextParse.jl](https://github.com/JuliaComputing/TextParse.jl) (*). 493 | - [ReadWriteDlm2.jl](https://github.com/strickek/ReadWriteDlm2.jl). 494 | - [Feather.jl](https://github.com/JuliaData/Feather.jl). 495 | - [FeatherLib.jl](https://github.com/queryverse/FeatherLib.jl) (*). 496 | - [ReadStat.jl](https://github.com/WizardMac/ReadStat.jl) (*). 497 | - [SASLib.jl](https://github.com/tk3369/SASLib.jl). 498 | - [ExcelReaders.jl](https://github.com/queryverse/ExcelReaders.jl) (*). 499 | - [XLSX.jl](https://github.com/felipenoris/XLSX.jl) (*). 500 | - [Taro.jl](https://github.com/aviks/Taro.jl). 501 | - [Bedgraph.jl](https://github.com/CiaranOMara/Bedgraph.jl) (*). 502 | - [DBFTables.jl](https://github.com/JuliaData/DBFTables.jl). 503 | - [RData.jl](https://github.com/JuliaStats/RData.jl). 504 | 505 | Note that some of these packages actually power the Queryverse file IO packages, I have denoted those packages with (*). 506 | --------------------------------------------------------------------------------