You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am seeing some odd issues when I started testing the removal of locking in spill/materialize. I tried testing this in a very high constrained environment with 20GB of GPU memory and 8GB of host memory to create high disk traffic.
I then rolled back the locking changes and I see it less, but it does happen.
There's an issue with #11747 that needs to be fixed. Current culprit is chunked packed batches that went to disk (as that's a lot of the spill activity I see before it blows up)
java.lang.IllegalArgumentException: requirement failed: onAllocFailure invoked with invalid allocSize -158492791120
at scala.Predef$.require(Predef.scala:281)
at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:111)
at ai.rapids.cudf.Table.groupByAggregate(Native Method)
at ai.rapids.cudf.Table.access$3300(Table.java:41)
at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:3994)
at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$2(GpuAggregateExec.scala:575)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$1(GpuAggregateExec.scala:562)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.AggHelper.performGroupByAggregation(GpuAggregateExec.scala:561)
at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:476)
at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:495)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:493)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:492)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:513)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:649)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:553)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at com.nvidia.spark.rapids.CloseableBufferedIterator.next(CloseableBufferedIterator.scala:65)
at com.nvidia.spark.rapids.CloseableBufferedIterator.head(CloseableBufferedIterator.scala:47)
at com.nvidia.spark.rapids.AggregateUtils$$anon$1.$anonfun$next$1(GpuAggregateExec.scala:173)
at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:126)
The text was updated successfully, but these errors were encountered:
I am seeing some odd issues when I started testing the removal of locking in spill/materialize. I tried testing this in a very high constrained environment with 20GB of GPU memory and 8GB of host memory to create high disk traffic.
I then rolled back the locking changes and I see it less, but it does happen.
There's an issue with #11747 that needs to be fixed. Current culprit is chunked packed batches that went to disk (as that's a lot of the spill activity I see before it blows up)
The text was updated successfully, but these errors were encountered: