The paramgen implements parameter curation to ensure predictable performance results that (mostly) correspond to a normal distribution.
-
Install dependencies:
scripts/install-dependencies.sh
-
Generating the factors entities with the Datagen: In Datagen's directory (
ldbc_snb_datagen_spark
), issue the following commands. We assume that the Datagen project is built and the${LDBC_SNB_DATAGEN_MAX_MEM}
,${LDBC_SNB_DATAGEN_JAR}
environment variables are set correctly.export SF=desired_scale_factor export LDBC_SNB_DATAGEN_MAX_MEM=available_memory export LDBC_SNB_DATAGEN_JAR=$(sbt -batch -error 'print assembly / assemblyOutputPath')
rm -rf out-sf${SF}/ tools/run.py \ --cores $(nproc) \ --memory ${LDBC_SNB_DATAGEN_MAX_MEM} \ -- \ --format parquet \ --scale-factor ${SF} \ --mode raw \ --output-dir out-sf${SF} \ --generate-factors
-
Obtaining the factors: Create the
scratch/factors/
directory and move the factor directories fromout-sf${SF}/factors/csv/raw/composite-merged-fk/
(cityPairsNumFriends/
,personDisjointEmployerPairs/
, etc.) into it. Assuming that your${LDBC_SNB_DATAGEN_DIR}
and${SF}
environment variables are set, run:scripts/get-factors.sh
To download and use the factors of the sample data set, run:
scripts/get-sample-factors.sh export SF=0.003
-
To run the parameter generator, ensure that
${SF}
is set correctly and issue:scripts/paramgen.sh
-
The parameters will be placed in the
../parameters/parameters-sf${SF}/
directory.
The parameter generator process performs several join and aggregation operations on large tables, therefore, it uses a significant amount of memory.
For example, the process for SF30,000 uses 404.8 GB RAM and takes about 11 minutes to run on an AWS EC2 m6id.32xlarge
instance with a 128 vCPUs.