Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] data corruption with spill framework changes #11885

Closed
abellina opened this issue Dec 18, 2024 · 0 comments · Fixed by #11887
Closed

[BUG] data corruption with spill framework changes #11885

abellina opened this issue Dec 18, 2024 · 0 comments · Fixed by #11887
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@abellina
Copy link
Collaborator

I am seeing some odd issues when I started testing the removal of locking in spill/materialize. I tried testing this in a very high constrained environment with 20GB of GPU memory and 8GB of host memory to create high disk traffic.

I then rolled back the locking changes and I see it less, but it does happen.

There's an issue with #11747 that needs to be fixed. Current culprit is chunked packed batches that went to disk (as that's a lot of the spill activity I see before it blows up)

java.lang.IllegalArgumentException: requirement failed: onAllocFailure invoked with invalid allocSize -158492791120                                      
        at scala.Predef$.require(Predef.scala:281)
        at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:111)
        at ai.rapids.cudf.Table.groupByAggregate(Native Method)
        at ai.rapids.cudf.Table.access$3300(Table.java:41)
        at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:3994)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$2(GpuAggregateExec.scala:575)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$performGroupByAggregation$1(GpuAggregateExec.scala:562)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.performGroupByAggregation(GpuAggregateExec.scala:561)
        at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:476)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:495)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:493)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:492)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:513)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:649)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:553)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at com.nvidia.spark.rapids.CloseableBufferedIterator.next(CloseableBufferedIterator.scala:65)
        at com.nvidia.spark.rapids.CloseableBufferedIterator.head(CloseableBufferedIterator.scala:47)
        at com.nvidia.spark.rapids.AggregateUtils$$anon$1.$anonfun$next$1(GpuAggregateExec.scala:173)
        at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:126)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
1 participant