Copyright (C) 2017 Paulius Danenas
Generate synthetic datasets which can be used directly for research or train models, using YAML specifications. Currently, only Pandas dataframes are supported as output
The package can be easily from GitHub repository installed using Python's pip utility.
Dataset generation is fairly straightforward:
import data_faker as df
spec_file = 'examples/distributions.yaml'
output = 'output.csv'
df.generate(spec_file, output)
A command line tool is also installed during the setup, which allows to generate datasets and serialize them straight from the command line:
datafaker -o output.csv examples/distributions.yaml
or:
datafaker --output-file output.csv examples/distributions.yaml
Currently the tool supports only serialization to CSV file. However, one can easily serialize the created dataset to other formats, by generating
Pandas dataframe directly using generate_pandas
method, and using internal pandas
methods or third-party tools.
TBD
This tool requires several other Python libraries to function: