Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Sohambutala · 2024-05-15T04:24:58Z

This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.

Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.

Proposed Solutions for Discussion:

Custom Serialization:

Pros: Allows tighter control over how data is managed and passed between stages.
Cons: Requires additional development effort and maintenance.
Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.
Compression:

Pros: Reduces the physical size of serialized files, potentially decreasing I/O time.
Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data.
Implementation: Implement compression algorithms suited to our data types and processing needs.
Enhanced Caching Strategy:

Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas.
Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk.
Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.

Any insights, comments or criticism is welcomed.

Sohambutala added the enhancement New feature or request label May 15, 2024

Sohambutala self-assigned this May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Sohambutala commented May 15, 2024 •

edited

Loading

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Comments

Sohambutala commented May 15, 2024 • edited Loading

Sohambutala commented May 15, 2024 •

edited

Loading