Skip to content

Latest commit

 

History

History
238 lines (202 loc) · 4.63 KB

README.md

File metadata and controls

238 lines (202 loc) · 4.63 KB

🌌 amanogawa: Graph construction meets DAG-based data processing engine

Flexible graph construction and data pre-processing engine

Tutorial is here. You can try amanogawa and hoshizora on Jupyter on Docker

(:warning: Currently alpha version. Inner structure and APIs might be changed a lot)

✨ Features

  • Easy to use
    • You can use amanogawa as a Python library, C++ library and CLI tool
    • Flexible DAG representation
  • Extremely fast
  • Modular design
    • You can add templates of data source, format, data processing, join, branch, etc. as plugin

🔜 Install

Supporting Linux and macOS

Python library via pip

pip install amanogawa

From source

Prerequisites

  • Make
  • CMake 3.0+
  • Clang++ 3.4+
  • Python 3
make init

CLI

make release

Python library

python3 setup.py install

💡 Example

Task:one:: Simple (Python)

Read a single json, filter and then export to csv

sample.json

[
  {"id": 1, "name": "Aries"},
  {"id": 2, "name": "Taurus"},
  {"id": 3, "name": "Gemini"}
]
import amanogawa as am
builder = am.ConfigBuilder()
config = builder.source('file').set('path', 'sample.csv').format('csv') \
    .set('columns',
        [{'name': 'id', 'type': 'int'}, {'name': 'name', 'type': 'string'}]) \
    .set('filter', {'key': 'name', op: 'contains', 'cond': 'i'})
    .sink('file').set('path', 'sample.tsv').format('csv').set('delimiter', '\t') \
    .build()
am.execute(config)

sample.csv

id,name
1,Aries
3,Gemini

Task:two:: Graph construction (Python)

Read json lines, construct graph and then export to csv

comments.jsonl

{"content": "Apple Strawberry Apple", "command": "foo"}
{"content": "Apple Strawberry", "command": "foo"}
{"content": "Apple Apple", "command": "bar"}
{"content": "Banana Banana", "command": "foo bar"}
{"content": "Pineapple Banana Banana", "command": "foo"}
import amanogawa as am
builder = am.ConfigBuilder()
config = builder.source('file').set('path', 'comments.jsonl').format('json') \
    .set('columns', [{'name': 'content', 'type': 'string'}]) \
    .flow('to_graph').set('mode', 'bow').set('column', 'content').set('knn', {'k': 2, 'p': 1.5}) \
    .sink('file').set('path', 'graph').format('csv').set('delimiter', ' ').build()
am.execute(config)
src dst
0 4
0 3
0 2
1 4
1 3
1 2
2 4
2 3

Task:three:: Complex (CLI)

Read csvs, join them, split by column and then export to csv and tsv

In

kinmosa.csv

id,name,blood_id
1,karen,3
2,ayaya,0
3,shino,0
4,yo-ko,2
5,alice,0

blood.csv

id,type
0,A
1,B
2,O
3,AB

Config

config.toml

[source.read_awesome_csv]
type = "file"
path = "kinmosa.csv"
[source.read_awesome_csv.format]
type = "csv"
columns = [
  { name = "id", type = "int" },
  { name = "name", type = "string" },
  { name = "blood_type", type = "int" }
]

[branch.id_name_blood]
type = "column"
from = "read_awesome_csv"
to = [
  { name = "id_name", columns = [ "id", "name" ] },
  { name = "blood", columns = [ "blood_type" ] }
]

[source.about_blood]
type = "file"
path = "blood.csv"
[source.about_blood.format]
type = "csv"
columns = [
  { name = "id", type = "int" },
  { name = "type_string", type = "string" }
]

[confluence.blood_type]
type = "key"
from = [
  { name = "about_blood", key = "id" },
  { name = "blood", key = "blood_type" }
]

[sink.write_id_name_tsv]
type = "file"
path = "result_id_name.tsv"
from = "id_name"
[sink.write_id_name_tsv.format]
type = "csv"
delimiter = "\t"

[sink.write_blood_csv]
type = "file"
path = "result_blood.csv"
from = "blood_type"
[sink.write_blood_csv.format]
type = "csv"
./amanogawa-cli config.toml

Out

result_id_name.csv

id	name
1	karen
2	ayaya
3	shino
4	yo-ko
5	alice

result_blood.csv

id,type_string
0,A
0,A
0,A
2,O
3,AB

😣 WIP

  • Support files with serial number
  • Efficient config builder
  • Automatic input schema config generator, like guess in embulk
  • Out-of-core processing
  • Effective parallel processing and scheduling
  • Dynamic DAG scheduling
  • Effective use of Apache Arrow (Currently using it as just an interface)
  • Row-based and Column-based, compound data handling
  • Data validation and error handling
  • Sharing amanogawa-core between plugins
  • Tools for creating third-party plugins
  • Tests
  • Many many plugins

💚 Acknowledgement

This project was supported by IPA (Mito Project)