[BUG] data corruption with spill framework changes #11885

abellina · 2024-12-18T16:22:58Z

I am seeing some odd issues when I started testing the removal of locking in spill/materialize. I tried testing this in a very high constrained environment with 20GB of GPU memory and 8GB of host memory to create high disk traffic.

I then rolled back the locking changes and I see it less, but it does happen.

There's an issue with #11747 that needs to be fixed. Current culprit is chunked packed batches that went to disk (as that's a lot of the spill activity I see before it blows up)

java.lang.IllegalArgumentException: requirement failed: onAllocFailure invoked with invalid allocSize -158492791120                                      
        at scala.Predef$.require(Predef.scala:281)
        at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:111)
        at ai.rapids.cudf.Table.groupByAggregate(Native Method)
        at ai.rapids.cudf.Table.access$3300(Table.java:41)
        at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:3994)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$2(GpuAggregateExec.scala:575)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$1(GpuAggregateExec.scala:562)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.performGroupByAggregation(GpuAggregateExec.scala:561)
        at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:476)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:495)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:493)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:492)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:513)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:649)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:553)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at com.nvidia.spark.rapids.CloseableBufferedIterator.next(CloseableBufferedIterator.scala:65)
        at com.nvidia.spark.rapids.CloseableBufferedIterator.head(CloseableBufferedIterator.scala:47)
        at com.nvidia.spark.rapids.AggregateUtils$$anon$1.$anonfun$next$1(GpuAggregateExec.scala:173)
        at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:126)

The text was updated successfully, but these errors were encountered:

abellina added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 18, 2024

abellina self-assigned this Dec 18, 2024

abellina mentioned this issue Dec 18, 2024

Make sure that the chunked packer bounce buffer is realease after the synchronize #11887

Merged

abellina closed this as completed in #11887 Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] data corruption with spill framework changes #11885

[BUG] data corruption with spill framework changes #11885

abellina commented Dec 18, 2024

[BUG] data corruption with spill framework changes #11885

[BUG] data corruption with spill framework changes #11885

Comments

abellina commented Dec 18, 2024