Cypher implementation of the LDBC SNB benchmark. Note that some BI queries cannot be expressed (efficiently) in vanilla Cypher so they make use of the APOC and Graph Data Science Neo4j libraries.
The Neo4j implementation expects the data to be in composite-projected-fk
CSV layout, without headers and with quoted fields. (Rationale: Files should not have headers as these are provided separately in the headers/
directory and quoting the fields in the CSV is required to preserve trailing spaces.)
To generate data that confirms these requirements, run Datagen with the --explode-edges
and the --format-options header=false,quoteAll=true
options.
This implementation also supports compressed data sets (.csv.gz
files), both for the initial load and for batches. The scripts in this repository automatically detect whether a compressed or an uncompressed data set is used.
In Datagen's directory (ldbc_snb_datagen_spark
), issue the following commands. We assume that the Datagen project is built and the ${PLATFORM_VERSION}
, ${DATAGEN_VERSION}
environment variables are set correctly.
export SF=desired_scale_factor
export LDBC_SNB_DATAGEN_MAX_MEM=available_memory
export LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
rm -rf out-sf${SF}/
tools/run.py \
--cores $(nproc) \
--memory ${LDBC_SNB_DATAGEN_MAX_MEM} \
-- \
--format csv \
--scale-factor ${SF} \
--explode-edges \
--mode bi \
--output-dir out-sf${SF}/ \
--format-options header=false,quoteAll=true,compression=gzip
-
Set the
${NEO4J_CSV_DIR}
environment variable.-
To use a locally generated data set, set the
${LDBC_SNB_DATAGEN_DIR}
and${SF}
environment variables and run:. scripts/use-datagen-data-set.sh
-
To download and use the sample data set, run:
scripts/get-sample-data-set.sh . scripts/use-sample-data-set.sh
-
-
Configure Neo4j to use the available memory, e.g.:
export NEO4J_ENV_VARS="${NEO4J_ENV_VARS-} --env NEO4J_dbms_memory_pagecache_size=20G --env NEO4J_dbms_memory_heap_max__size=20G"
-
Load the data:
scripts/load-in-one-step.sh
-
The substitution parameters should be generated using the
paramgen
.
Test loading the microbatches:
scripts/batches.sh
${NEO4J_CSV_DIR}
directory on the host machine but maps the paths relative to the /import
directory in the Docker container (Neo4j's dedicated import directory which it uses as the basis of the import paths in the LOAD CSV
Cypher commands).
For example, the ${NEO4J_CSV_DIR}/deletes/dynamic/Post/batch_id=2012-09-13/part-x.csv
path is translated to the deletes/dynamic/Post/batch_id=2012-09-13/part-x.csv
relative path.
To run the queries, issue:
scripts/queries.sh ${SF}
For a test run, use:
scripts/queries.sh ${SF} --test
To start a database that has already been loaded, run:
scripts/start.sh