[DEPRECATED APRIL 2020]

This library is now deprecated. Checkout a second implementation of this library here: https://github.com/p-ranav/csv2.

Highlights

Header-only library
Fast, asynchronous, multi-threaded processing using:
- Lock-free Concurrent Queues
- Robin hood Hashing
Requires C++17
MIT License

Reading CSV files

Simply include reader.hpp and you're good to go.

#include <csv/reader.hpp>

To start parsing CSV files, create a csv::Reader object and call .read(filename).

csv::Reader foo;
foo.read("test.csv");

This .read method is non-blocking. The reader spawns multiple threads to tokenize the file stream and build a "list of dictionaries". While the reader is doing it's thing, you can start post-processing the rows it has parsed so far using this iterator pattern:

while(foo.busy()) {
  if (foo.ready()) {
    auto row = foo.next_row();  // Each row is a csv::unordered_flat_map (github.com/martinus/robin-hood-hashing)
    auto foo = row["foo"]       // You can use it just like an std::unordered_map
    auto bar = row["bar"];
    // do something
  }
}

If instead you'd like to wait for all the rows to get processed, you can call .rows() which is a convenience method that executes the above while loop

auto rows = foo.rows();           // blocks until the CSV is fully processed
for (auto& row : rows) {          // Example: [{"foo": "1", "bar": "2"}, {"foo": "3", "bar": "4"}, ...] 
  auto foo = row["foo"];
  // do something
}

Dialects

This csv library comes with three standard dialects:

Name	Description
excel	The excel dialect defines the usual properties of an Excel-generated CSV file
excel_tab	The excel_tab dialect defines the usual properties of an Excel-generated TAB-delimited file
unix	The unix dialect defines the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

Configuring Custom Dialects

Custom dialects can be constructed with .configure_dialect(...)

csv::Reader csv;
csv.configure_dialect("my fancy dialect")
  .delimiter("")
  .quote_character('"')
  .double_quote(true)
  .skip_initial_space(false)
  .trim_characters(' ', '\t')
  .ignore_columns("foo", "bar")
  .header(true)
  .skip_empty_rows(true);

csv.read("foo.csv");
for (auto& row : csv.rows()) {
  // do something
}

Property	Data Type	Description
delimiter	`std::string`	specifies the character sequence which should separate fields (aka columns). Default = `","`
quote_character	`char`	specifies a one-character string to use as the quoting character. Default = `'"'`
double_quote	`bool`	controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default = `true`
skip_initial_space	`bool`	specifies how to interpret whitespace which immediately follows a delimiter; if false, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default = `false`
trim_characters	`std::vector<char>`	specifies the list of characters to trim from every value in the CSV. Default = `{}` - nothing trimmed
ignore_columns	`std::vector<std::string>`	specifies the list of columns to ignore. These columns will be stripped during the parsing process. Default = `{}` - no column ignored
header	`bool`	indicates whether the file includes a header row. If true the first row in the file is a header row, not data. Default = `true`
column_names	`std::vector<std::string>`	specifies the list of column names. This is useful when the first row of the CSV isn't a header Default = `{}`
skip_empty_rows	`bool`	specifies how empty rows should be interpreted. If this is set to true, empty rows are skipped. Default = `false`

The line terminator is '\n' by default. I use std::getline and handle stripping out '\r' from line endings. So, for now, this is not configurable in custom dialects.

Multi-character Delimiters

Consider this strange, messed up log file:

[Thread ID] :: [Log Level] :: [Log Message] :: {Timestamp}
04 :: INFO :: Hello World ::             1555164718
02        :: DEBUG :: Warning! Foo has happened                :: 1555463132

To parse this file, simply configure a new dialect that splits on "::" and trims whitespace, braces, and bracket characters.

csv::Reader csv;
csv.configure_dialect("my strange dialect")
  .delimiter("::")
  .trim_characters(' ', '[', ']', '{', '}');   

csv.read("test.csv");
for (auto& row : csv.rows()) {
  auto thread_id = row["Thread ID"];    // "04"
  auto log_level = row["Log Level"];    // "INFO"
  auto message = row["Log Message"];    // "Hello World"
  // do something
}

Ignoring Columns

Consider the following CSV. Let's say you don't care about the columns age and gender. Here, you can use .ignore_columns and provide a list of columns to ignore.

name, age, gender, email, department
Mark Johnson, 50, M, mark.johnson@gmail.com, BA
John Stevenson, 35, M, john.stevenson@gmail.com, IT
Jane Barkley, 25, F, jane.barkley@gmail.com, MGT

You can configure the dialect to ignore these columns like so:

csv::Reader csv;
csv.configure_dialect("ignore meh and fez")
  .delimiter(", ")
  .ignore_columns("age", "gender");

csv.read("test.csv");
auto rows = csv.rows();
// Your rows are:
// [{"name": "Mark Johnson", "email": "mark.johnson@gmail.com", "department": "BA"},
//  {"name": "John Stevenson", "email": "john.stevenson@gmail.com", "department": "IT"},
//  {"name": "Jane Barkley", "email": "jane.barkley@gmail.com", "department": "MGT"}]

No Header?

Sometimes you have CSV files with no header row:

If you want to prevent the reader from parsing the first row as a header, simply:

Set .header to false
Provide a list of column names with .column_names(...)

csv.configure_dialect("no headers")
  .header(false)
  .column_names("foo", "bar", "baz");

The CSV rows will now look like this:

[{"foo": "9", "bar": "52", "baz": "1"}, {"foo": "52", "bar": "91", "baz": "0"}, ...]

If .column_names is not called, then the reader simply generates dictionary keys like so:

[{"0": "9", "1": "52", "2": "1"}, {"0": "52", "1": "91", "2": "0"}, ...]

Dealing with Empty Rows

Sometimes you have to deal with a CSV file that has empty lines; either in the middle or at the end of the file:

a,b,c
1,2,3

4,5,6

10,11,12

Here's how this get's parsed by default:

csv::Reader csv;
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "", "b": "", "c": ""}, {"a": "4", "b": "5", "c": "6"}, {"a": "", ...}]

If you don't care for these empty rows, simply call .skip_empty_rows(true)

csv::Reader csv;
csv.configure_dialect()
  .skip_empty_rows(true);
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "4", "b": "5", "c": "6"}, {"a": "10", "b": "11", "c": "12"}]

Reading first N rows

If you know exactly how many rows to parse, you can help out the reader by using the .read(filename, num_rows) overloaded method. This saves the reader from trying to figure out the number of lines in the CSV file. You can use this method to parse the first N rows of the file instead of parsing all of it.

csv::Reader foo;
foo.read("bar.csv", 1000);
auto rows = foo.rows();

Note: Do not provide num_rows greater than the actual number of rows in the file - The reader will loop forever till the end of time.

Performance Benchmark

// benchmark.cpp
void parse(const std::string& filename) {
  csv::Reader foo;
  foo.read(filename);
  std::vector<csv::unordered_flat_map<std::string_view, std::string>> rows;
  while (foo.busy()) {
    if (foo.ready()) {
      auto row = foo.next_row();
      rows.push_back(row);
    }
  }
}

$ g++ -pthread -std=c++17 -O3 -Iinclude/ -o test benchmark.cpp
$ time ./test

Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

Here are the average-case execution times:

Dataset	File Size	Rows	Cols	Time
Demographic Statistics By Zip Code	27 KB	237	46	0.026s
Simple 3-column CSV	14.1 MB	761,817	3	0.523s
Majestic Million	77.7 MB	1,000,000	12	1.972s
Crimes 2001 - Present	1.50 GB	6,846,406	22	32.411s

Writing CSV files

Simply include writer.hpp and you're good to go.

#include <csv/writer.hpp>

To start writing CSV files, create a csv::Writer object and provide a filename:

csv::Writer foo("test.csv");

Constructing a writer spawns a worker thread that is ready to start writing rows. Using .configure_dialect, configure the dialect to be used by the writer. This is where you can specify the column names:

foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

Now it's time to write rows. You can do this in multiple ways:

foo.write_row("1", "2", "3");                                     // parameter packing
foo.write_row({"4", "5", "6"});                                   // std::vector
foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });

You can also omit one or more values dynamically when using maps:

foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"c", "9"} });                                      // omitting "b"
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"b", "8"}, {"c", "9"} });                                      // omitting "a"
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"} });                                      // omitting "c"

Finally, once you're done writing rows, call .close() to stop the worker thread and close the file stream.

foo.close();

Here's an example writing 3 million lines of CSV to a file:

csv::Writer foo("test.csv");
foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

for (long i = 0; i < 3000000; i++) {
  auto x = std::to_string(i % 100);
  auto y = std::to_string((i + 1) % 100);
  auto z = std::to_string((i + 2) % 100);
  foo.write_row(x, y, z);
}
foo.close();

The above code takes about 1.8 seconds to execute on my Surface Pro 4.

Steps For Contributors

Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

git clone https://github.com/p-ranav/csv.git
cd csv
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DCSV_BUILD_TESTS=ON
cmake --build . --config Debug
ctest --output-on-failure -C Debug

Steps For Users

git clone https://github.com/p-ranav/csv.git
cd csv
mkdir build
cd build
cmake ../.
sudo make install

Continuous Integration Reports

AppVeyor
Travis CI

License

The project is available under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[DEPRECATED APRIL 2020]

Highlights

Table of Contents

Reading CSV files

Dialects

Configuring Custom Dialects

Multi-character Delimiters

Ignoring Columns

No Header?

Dealing with Empty Rows

Reading first N rows

Performance Benchmark

Writing CSV files

Steps For Contributors

Steps For Users

Continuous Integration Reports

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

[DEPRECATED APRIL 2020]

Highlights

Table of Contents

Reading CSV files

Dialects

Configuring Custom Dialects

Multi-character Delimiters

Ignoring Columns

No Header?

Dealing with Empty Rows

Reading first N rows

Performance Benchmark

Writing CSV files

Steps For Contributors

Steps For Users

Continuous Integration Reports

License