PERF: Performance Improvement on `DataFrame.to_csv()` when `index=False` #59608

KevsterAmp · 2024-08-26T10:13:13Z

closes PERF: Significant Performance Difference in DataFrame.to_csv() with and without Index Reset #59312 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I added an alternative ndarray with the same length on _get_values_for_csv's output, used on write_csv_rows crude testing

pandas/pandas/_libs/writers.pyx

Line 42 in 642d244

Py_ssize_t i, j = 0, k = len(data_index), N = 100, ncols = len(cols)

Tests

Using the same code as the referenced issue:

import pandas as pd
import pyarrow as pa
import pyarrow.csv as csv
import time

NUM_ROWS = 10000000
NUM_COLS = 20

# Example Multi-Index DataFrame
df = pd.DataFrame(
    {
        f"col_{col_idx}": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS)
        for col_idx in range(NUM_COLS)
    }
)
df = df.set_index(["col_0", "col_1"], drop=False)

# Timing Operation A
start_time = time.time()
df.to_csv("file_A.csv", index=False)
end_time = time.time()
print(f"Operation A time: {end_time - start_time} seconds")

# Timing Operation B
start_time = time.time()
df_reset = df.reset_index(drop=True)
df_reset.to_csv("file_B.csv", index=False)
end_time = time.time()
print(f"Operation B time: {end_time - start_time} seconds")

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output after performance improvement

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

mroeschke · 2024-08-26T17:14:32Z

pandas/io/formats/csvs.py

+        ix = (
+            self.data_index[slicer]._get_values_for_csv(**self._number_format)
+            if self.nlevels != 0
+            else np.full(end_i - start_i, None)


Can you use np.empty instead?

mroeschke · 2024-08-26T17:14:48Z

For you benchmark could you show before and after timings

KevsterAmp · 2024-08-26T23:16:40Z

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output (after performance improvement)

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

Added output times to the description as well

mroeschke · 2024-08-27T00:07:34Z

Thanks @KevsterAmp

KevsterAmp added 2 commits August 26, 2024 17:57

add alternative ix when self.nlevel is 0

b252376

add to latest whatsnew

3a9d2b9

KevsterAmp changed the title ~~add alternative ix when self.nlevel is 0~~ PERF: Performance Improvement on DataFrame.to_csv() when index=False Aug 26, 2024

mroeschke reviewed Aug 26, 2024

View reviewed changes

mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Aug 26, 2024

change np.full to np.empty

c170220

mroeschke added this to the 3.0 milestone Aug 27, 2024

mroeschke approved these changes Aug 27, 2024

View reviewed changes

mroeschke merged commit bd81fef into pandas-dev:main Aug 27, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Performance Improvement on `DataFrame.to_csv()` when `index=False` #59608

PERF: Performance Improvement on `DataFrame.to_csv()` when `index=False` #59608

KevsterAmp commented Aug 26, 2024 •

edited

Loading

mroeschke Aug 26, 2024

mroeschke commented Aug 26, 2024

KevsterAmp commented Aug 26, 2024

mroeschke commented Aug 27, 2024

PERF: Performance Improvement on DataFrame.to_csv() when index=False #59608

PERF: Performance Improvement on DataFrame.to_csv() when index=False #59608

Conversation

KevsterAmp commented Aug 26, 2024 • edited Loading

Tests

mroeschke Aug 26, 2024

Choose a reason for hiding this comment

mroeschke commented Aug 26, 2024

KevsterAmp commented Aug 26, 2024

Added output times to the description as well

mroeschke commented Aug 27, 2024

PERF: Performance Improvement on `DataFrame.to_csv()` when `index=False` #59608

PERF: Performance Improvement on `DataFrame.to_csv()` when `index=False` #59608

KevsterAmp commented Aug 26, 2024 •

edited

Loading