-
Notifications
You must be signed in to change notification settings - Fork 70
Error in accept call on a passive RdmaChannel #31
Comments
More notes-- IbvContext supports only -1333102304 CPU cores? Why negative? 9/05/02 15:53:42 WARN rdma.RdmaNode: IbvContext supports only -1333102304 CPU cores, while there are 88 CPU cores in the system. This may lead to under-utilization of the system's CPU cores. This limitation may be adjustable in the RDMA device configuration. |
Hi, seems like you're using spark shuffle service for dynamic allocation ( |
Hi Petro, thanks for reviewing my file. I tried disabling spark.shuffle.service in config as well as disable dynamicAllocation and it still throws same error: Spark properties used, including those specified through Main class: 19/05/03 17:43:00 WARN rdma.RdmaNode: IbvContext supports only -704911072 CPU cores, while there are 88 CPU cores in the system. This may lead to under-utilization of the system's CPU cores. This limitation may be adjustable in the RDMA device configuration. |
Ok seems like there's overflow when requesting completion vectors from disni. Strange. Can you please try to run with these prebuild disni libraries that should log something like |
Do you use Mellanox ofed? |
Thanks will try and report back.
…Sent from my iPhone
On May 6, 2019, at 5:24 AM, Peter Rudenko ***@***.***> wrote:
Ok seems like there's overflow when requesting completion vectors from disni. Strange. Can you please try to run with these prebuild disni libraries that should log something like j2c::getContextNumCompVectors: obj_id 25435344654, num_comp_vectors: 1234
disni_log.tar.gz
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi Petro, I am not sure if OFED is installed. I can share: sudo service rdma status Upper layer protocol modules: User space access modules: Connection management modules: Configured IPoIB interfaces: ib0 ib1 Currently active IPoIB interfaces: ib0 ib1 bondib0 |
Can you please run with debug disni library and send spark logs. |
Hi Petro, This is the output after using the prebuild disni libraries: j2c::createEventChannel: obj_id 140621149451072 |
Thanks, seems like need to check for the negative value. Can you please run with a next spark config: |
Hi Petro, I have tested using the config and receive same error. I have attached the yarn logs for your consideration. |
Im not sure if this helps but I see a different number of num_comp_vectors each time I run j2c::createEventChannel: obj_id 140671346315408 |
Made a PR to fix this issue. Can you please try to run with the attached jar. |
Can you please run |
Using provided jar, I receive following error: 19/05/13 15:00:20 ERROR spark.SparkContext: Error initializing SparkContext. |
Not able to run ofed_info at the moment. Command not found on IB switches. Working with support for more information on that. |
lsmod | grep ipoib |
Ah sorry, wrong jar. Here's correct one: |
ib_read_bw test: Server:
Dual-port : OFF Device : mlx4_0
|
Petro, I got error using latest attached jar: ERROR rdma.RdmaNode: Error in accept call on a passive RdmaChannel: java.io.IOException: createCQ() failed Exception in thread "main" org.apache.spark.SparkException: Job aborted. Driver stacktrace: |
So you have unconfigured fabric. Please make sure that the network is configured correctly, |
Thanks Petro, What are your recommendations for configuring fabric. This is first time I have come across this issue on Oracle Linux. System is OL6. From the results of ib_read_bw are you confirming the fabric is unconfigured? |
@rmunoz527 you need to follow ofed installation tutorial (assuming you're using Mellanox product). Need to make sure |
Hi,
I am currently evaluating this library and not having done any specific configuration with respect to the infiniband network the spark nodes interconnect on. Can you point me in right direction on what might be cause of issue. See config and stacktrace below.
spark2-submit -v --num-executors 10 --executor-cores 5 --executor-memory 4G --conf spark.driver.extraClassPath=/opt/mellanox/spark-rdma-3.1.jar --conf spark.executor.extraClassPath=/opt/mellanox/spark-rdma-3.1.jar --conf spark.shuffle.manager=org.apache.spark.shuffle.rdma.RdmaShuffleManager --class com.github.ehiggs.spark.terasort.TeraSort /tmp/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /tmp/data/terasort_in /tmp/data/terasort_out
Parsed arguments:
master yarn
deployMode client
executorMemory 4G
executorCores 5
totalExecutorCores null
propertiesFile /opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath /opt/mellanox/spark-rdma-3.1.jar
driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native
driverExtraJavaOptions null
supervise false
queue null
numExecutors 10
files null
pyFiles null
archives null
mainClass com.github.ehiggs.spark.terasort.TeraSort
primaryResource file:/tmp/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar
name com.github.ehiggs.spark.terasort.TeraSort
childArgs [/tmp/data/terasort_in /tmp/data/terasort_out]
jars null
packages null
packagesExclusions null
repositories null
(spark.shuffle.manager,org.apache.spark.shuffle.rdma.RdmaShuffleManager)
(spark.executor.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.authenticate,false)
(spark.yarn.jars,local:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/)
(spark.driver.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.yarn.historyServer.address,
(spark.yarn.am.extraLibraryPath,/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/lib/native)
(spark.eventLog.enabled,true)
(spark.dynamicAllocation.schedulerBacklogTimeout,1)
(spark.yarn.config.gatewayPath,/opt/cloudera/parcels)
(spark.ui.killEnabled,true)
(spark.dynamicAllocation.maxExecutors,148)
(spark.serializer,org.apache.spark.serializer.KryoSerializer)
(spark.shuffle.service.enabled,true)
(spark.hadoop.yarn.application.classpath,)
(spark.dynamicAllocation.minExecutors,0)
(spark.dynamicAllocation.executorIdleTimeout,60)
(spark.yarn.config.replacementPath,{{HADOOP_COMMON_HOME}}/../../..)
(spark.sql.hive.metastore.version,1.1.0)
(spark.submit.deployMode,client)
(spark.shuffle.service.port,7337)
(spark.executor.extraClassPath,/opt/mellanox/spark-rdma-3.1.jar)
(spark.hadoop.mapreduce.application.classpath,)
(spark.eventLog.dir,
(spark.master,yarn)
(spark.dynamicAllocation.enabled,true)
(spark.sql.catalogImplementation,hive)
(spark.sql.hive.metastore.jars,${env:HADOOP_COMMON_HOME}/../hive/lib/:${env:HADOOP_COMMON_HOME}/client/*)
(spark.driver.extraClassPath,/opt/mellanox/spark-rdma-3.1.jar)
19/05/02 13:54:17 WARN spark.SparkContext: Using an existing SparkContext; some configuration may not take effect.
[Stage 0:> (0 + 17) / 45]19/05/02 13:54:18 ERROR rdma.RdmaNode: Error in accept call on a passive RdmaChannel: java.io.IOException: createCQ() failed
java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaChannel.processRdmaCmEvent(RdmaChannel.java:345)
at org.apache.spark.shuffle.rdma.RdmaChannel.stop(RdmaChannel.java:894)
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:176)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "RdmaNode connection listening thread" java.lang.RuntimeException: Exception in RdmaNode listening thread java.lang.NullPointerException
at org.apache.spark.shuffle.rdma.RdmaNode.lambda$new$0(RdmaNode.java:210)
at java.lang.Thread.run(Thread.java:748)
19/05/02 13:54:20 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 0.0 (TID 3, , executor 3): java.lang.ArithmeticException: / by zero
at org.apache.spark.shuffle.rdma.RdmaNode.getNextCpuVector(RdmaNode.java:278)
at org.apache.spark.shuffle.rdma.RdmaNode.getRdmaChannel(RdmaNode.java:301)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.org$apache$spark$shuffle$rdma$RdmaShuffleManager$$getRdmaChannel(RdmaShuffleManager.scala:314)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.getRdmaChannelToDriver(RdmaShuffleManager.scala:322)
at org.apache.spark.shuffle.rdma.RdmaShuffleManager.publishMapTaskOutput(RdmaShuffleManager.scala:410)
at org.apache.spark.shuffle.rdma.writer.wrapper.RdmaWrapperShuffleWriter.stop(RdmaWrapperShuffleWriter.scala:118)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: