Spark Job in Airflow throwing error in Hive Session #32223
Replies: 1 comment
-
It deppends on how your deployment is done. Likely when you (or someone in your team) set-up airflow, the environment you run "airflow" worker is different - for example environment variables were not set. We have no idea what your deployment, executor etc. is so we can make only some guesses, but you should talk to whoever set Airflow up for you (or if it was you, you need to understand the differences between how you login to the node and how applications are run there). Likely you might want to search for differences between interactive and non-interactive shell sessions. They have different intialization steps, usually different env variables are set, different files are processes (/etc/profile, /etc/bash.rc or similar). This is generic Linux knowledge that Deployment Manager should master when configuring and deploying their applications and must understand differences between those. Most likely that is your reason that your env variables are differently set when you login interactively and differently where non-interactive application (like airlfow) is run (for example via systemd or other ways you can mange apps - it's Deployment Manger's choice how to set-up the applications). Also if you use Kubernetes Executor, your Deployment Manager has to set the right pod template (see the docs) with all the env vars that are needed, and your image needs to be build according to your needs but my guess (wild because you have not explained your deployment details) is that this is not your case. |
Beta Was this translation helpful? Give feedback.
-
I have a spark job which is creating a hive connection , when Im running through node my job is working fine but while running it via Airflow im getting the following error :
hive = HiveWarehouseSession.session(spark).userPassword('','!').build() [2023-06-27 15:49:14,347] {bash.py:173} INFO - File "/var/disk/hdp/3.1.5.0-152/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip/pyspark_llap/sql/session.py", line 88, in build [2023-06-27 15:49:14,347] {bash.py:173} INFO - File "/app/spark3.3.1/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call [2023-06-27 15:49:14,347] {bash.py:173} INFO - File "/app/spark3.3.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco [2023-06-27 15:49:14,347] {bash.py:173} INFO - File "/app/spark3.3.1/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value [2023-06-27 15:49:14,348] {bash.py:173} INFO - py4j.protocol.Py4JJavaError: An error occurred while calling o71.build. [2023-06-27 15:49:14,348] {bash.py:173} INFO - : java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class [2023-06-27 15:49:14,348] {bash.py:173} INFO - at com.hortonworks.spark.sql.hive.llap.LlapQueryExecutionListener.(LlapQueryExecutionListener.scala:25) [2023-06-27 15:49:14,348] {bash.py:173} INFO - at com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl.(HiveWarehouseSessionImpl.java:91) [2023-06-27 15:49:14,348] {bash.py:173} INFO - at com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.build(HiveWarehouseBuilder.java:103) [2023-06-27 15:49:14,348] {bash.py:173} INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [2023-06-27 15:49:14,348] {bash.py:173} INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [2023-06-27 15:49:14,348] {bash.py:173} INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at java.lang.reflect.Method.invoke(Method.java:498) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.Gateway.invoke(Gateway.java:282) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.commands.CallCommand.execute(CallCommand.java:79) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at py4j.ClientServerConnection.run(ClientServerConnection.java:106) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at java.lang.Thread.run(Thread.java:750) [2023-06-27 15:49:14,349] {bash.py:173} INFO - Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class [2023-06-27 15:49:14,349] {bash.py:173} INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:387) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:418) [2023-06-27 15:49:14,349] {bash.py:173} INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:351) [2023-06-27 15:49:14,349] {bash.py:173} INFO - ... 15 more [2023-06-27 15:49:14,350] {bash.py:173} INFO
Im expecting it to run it in the same way as from the node
WHy running from Airflow as Bash operator is different from running it directly via node.
Beta Was this translation helpful? Give feedback.
All reactions