This repository contains the solution audit review artifacts for the Kraft Heinz team.
-
Create a
.env
file and fill-in the environment variables as shown in theexample.env
-
Install the libraries listed in
requirements.txt
.
pip install -r requirements.txt
The hierarchy_queries
directory contains the notebook that executes and retrieves parent/child hierarchies:
- query_test.ipynb
- Executes the cypher queries for extracting parent/child hierarchies
- config.yaml
- Cypher queries used in the notebook are extracted from this file, it contains 3 example queries:
- original_query - original cypher provided by the kraft team.
- spanning_tree_only - does the hierarchy expansion using apoc.path.spanningTree.
- qpp_query_only_nodes - does the hierarchy expansion using qpp and only returning nodes as a list.
- Cypher queries used in the notebook are extracted from this file, it contains 3 example queries:
The json_export
directory contains the python scripts that execute the json export using apoc:
-
- Iterates through each node label and relationship type and exports each label or type to a json lines file.
- Data is now incrementally exported as opposed to bulk exported.
- We take advantage of the multicore CPU of the AuraDB instance and run the export job in four parallel threads (This may be increased to six according to Kraft’s AuraDB instance specifications).
- We batch each export transaction in chunks of 200 rows.
-
- Perform same process as above, however it will export all nodes into a file and all relationships into a file.
- Process runs in under 10 minutes.
-
- The data exports are validated by retrieving the node label and relationship type counts from the AuraDB instance and comparing them to the line counts of their respective files. This is automatically completed by the validate_data_export.ipynb Python notebook.
The exported data can be found in the directory exports/*.