Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Performance Improvement on DataFrame.to_csv() when index=False #59608

Merged

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Aug 26, 2024

I added an alternative ndarray with the same length on _get_values_for_csv's output, used on write_csv_rows crude testing

Py_ssize_t i, j = 0, k = len(data_index), N = 100, ncols = len(cols)

Tests

Using the same code as the referenced issue:

import pandas as pd
import pyarrow as pa
import pyarrow.csv as csv
import time

NUM_ROWS = 10000000
NUM_COLS = 20

# Example Multi-Index DataFrame
df = pd.DataFrame(
    {
        f"col_{col_idx}": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS)
        for col_idx in range(NUM_COLS)
    }
)
df = df.set_index(["col_0", "col_1"], drop=False)

# Timing Operation A
start_time = time.time()
df.to_csv("file_A.csv", index=False)
end_time = time.time()
print(f"Operation A time: {end_time - start_time} seconds")

# Timing Operation B
start_time = time.time()
df_reset = df.reset_index(drop=True)
df_reset.to_csv("file_B.csv", index=False)
end_time = time.time()
print(f"Operation B time: {end_time - start_time} seconds")

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output after performance improvement

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

@KevsterAmp KevsterAmp changed the title add alternative ix when self.nlevel is 0 PERF: Performance Improvement on DataFrame.to_csv() when index=False Aug 26, 2024
ix = (
self.data_index[slicer]._get_values_for_csv(**self._number_format)
if self.nlevels != 0
else np.full(end_i - start_i, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use np.empty instead?

@mroeschke
Copy link
Member

For you benchmark could you show before and after timings

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Aug 26, 2024
@KevsterAmp
Copy link
Contributor Author

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output (after performance improvement)

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

Added output times to the description as well

@mroeschke mroeschke added this to the 3.0 milestone Aug 27, 2024
@mroeschke mroeschke merged commit bd81fef into pandas-dev:main Aug 27, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @KevsterAmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Significant Performance Difference in DataFrame.to_csv() with and without Index Reset
2 participants