Skip to content

Latest commit

 

History

History
99 lines (76 loc) · 5.28 KB

GlutenUsage.md

File metadata and controls

99 lines (76 loc) · 5.28 KB

Gluten Usage

How to compile Gluten jar

git clone -b main https://github.com/oap-project/gluten.git
cd gluten

When compiling Gluten, a backend should be enabled for execution. For example, add below options to enable Velox backend:

If you wish to automatically build velox from source. The default velox installation path will be in "/PATH_TO_GLUTEN/tools/build/velox-ep/velox_ep".

-Dbuild_velox=ON -Dbuild_velox_from_source=ON

If you wish to enable Velox backend and you have an existing compiled Velox or Velox repo, please use velox_home to set the path.

-Dbuild_velox=ON -Dvelox_home=${VELOX_HOME}

We provide a single command to help build arrow as well as velox under Ubuntu20.04 environment. The full compiling command would be like:

mvn clean package -Pbackends-velox -Pfull-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dbuild_velox_from_source=ON -Dbuild_arrow=ON

If Arrow has once been installed successfully on your env, and there is no change to Arrow, you can add below config to disable Arrow compilation each time when compiling Gluten:

-Dbuild_arrow=OFF -Darrow_root=${ARROW_LIB_PATH}

Compile gluten again with existed velox and arrow. The full compiling command would be like:

mvn package -Pbackends-velox -Pfull-scala-compiler -DskipTests -Dcheckstyle.skip -Dbuild_cpp=ON -Dbuild_velox=ON -Dbuild_arrow=ON -Dbuild_protobuf=OFF

while arrow_root and velox_home is the default value as following

Based on the different environment, there are some parameters can be set via -D with mvn.

Parameters Description Default Value
build_cpp Enable or Disable building CPP library OFF
cpp_tests Enable or Disable CPP Tests OFF
build_arrow Build Arrow from Source OFF
arrow_root When build_arrow set to False, arrow_root will be enabled to find the location of your existing arrow library. /PATH_TO_GLUTEN/tools/build/arrow_install
build_protobuf Build Protobuf from Source. If set to False, default library path will be used to find protobuf library. ON
build_velox_from_source Enable or Disable building Velox from a specific velox github repository. A default installed path will be in velox_home OFF
backends-velox Add -Pbackends-velox in maven command to compile the JVM part of Velox backend false
backends-clickhouse Add -Pbackends-clickhouse in maven command to compile the JVM part of ClickHouse backend false
build_velox Enable or Disable building the CPP part of Velox backend OFF
velox_home (only valid when build_velox is ON) The path to the compiled Velox project. When building Gluten with Velox, if you have an existing Velox, please set it. /PATH_TO_GLUTEN/tools/build/velox_ep
compile_velox(only valid when velox_home is assigned) recompile exising Velox use custom compile parameters OFF
velox_build_type The build type Velox was built with from source code. Gluten uses this value to locate the binary path of Velox's binary libraries. release
debug_build Whether to generate debug binary library from Gluten's C++ codes. OFF

When build_arrow set to True, the build_arrow.sh will be launched and compile a custom arrow library from OAP Arrow If you wish to change any parameters from Arrow, you can change it from the build_arrow.sh script.

How to submit Spark Job

Key Spark Configurations when using Gluten
spark.plugins io.glutenproject.GlutenPlugin
spark.gluten.sql.columnar.backend.lib ${BACKEND}
spark.sql.sources.useV1SourceList avro
spark.memory.offHeap.size 20g
spark.driver.extraClassPath ${GLUTEN_HOME}/backends-velox/target/gluten-jvm-<version>-SNAPSHOT-jar-with-dependencies.jar
spark.executor.extraClassPath ${GLUTEN_HOME}/backends-velox/target/gluten-jvm-<version>-SNAPSHOT-jar-with-dependencies.jar

${BACKEND} can be velox or clickhouse, refer Velox.md and ClickHouse.md to get more detail.

Below is an example of the script to submit Spark SQL query.

cat query.scala

// For ORC
val table = spark.read.format("orc").load("${ORC_PATH}")
// For Parquet
val table = spark.read.parquet("${PARQURT_PATH}")
table.createOrReplaceTempView("${TABLE_NAME}")
time{spark.sql("${QUERY}").show}

Submit the above script from spark-shell to trigger a Spark Job with certain configurations.

cat query.scala | spark-shell --name query --master yarn --deploy-mode client --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.gluten.sql.columnar.backend.lib=${BACKEND} --conf spark.driver.extraClassPath=${gluten_jvm_jar} --conf spark.executor.extraClassPath=${gluten_jvm_jar} --conf spark.memory.offHeap.size=20g --conf spark.sql.sources.useV1SourceList=avro --num-executors 6 --executor-cores 6 --driver-memory 20g --executor-memory 25g --conf spark.executor.memoryOverhead=5g --conf spark.driver.maxResultSize=32g

For more information about How to use Gluten with Velox, please go to Decision Support Benchmark1.