Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on Enhancing Performance and Memory Efficiency in Data Processing Workflows #84

Open
Sohambutala opened this issue May 15, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Sohambutala
Copy link
Collaborator

Sohambutala commented May 15, 2024

This ticket aims to open a discussion on potential improvements in performance and memory management within our data processing workflows. The current approach, although functional, leads to significant I/O overhead due to the necessity of writing outputs and reading them in subsequent stages.

Current Challenge : Each stage of our workflow currently writes its output to disk, which is then read by the next stage. This process is not only I/O intensive but also becomes a bottleneck in terms of performance, primarily due to the serialization requirements of data passing between Prefect flows.

Proposed Solutions for Discussion:

  1. Custom Serialization:

    Pros: Allows tighter control over how data is managed and passed between stages.
    Cons: Requires additional development effort and maintenance.
    Implementation: Develop a custom class (if required) that handles serialization of intermediate data efficiently.

  2. Compression:

    Pros: Reduces the physical size of serialized files, potentially decreasing I/O time.
    Cons: Introduces a trade-off between the time saved on I/O operations and the additional time required for compressing and decompressing data.
    Implementation: Implement compression algorithms suited to our data types and processing needs.

  3. Enhanced Caching Strategy:

    Pros: Minimizes disk I/O by keeping frequently used data in faster access storage areas.
    Cons: Complex implementation, especially when deciding what data remains in cache and what gets written to disk.
    Implementation: Utilize a hybrid caching system using Redis for in-memory caching and disk-based storage for less frequently accessed data. Consider writing a custom caching strategy tailored to predict and pre-load data required for upcoming stages.

Any insights, comments or criticism is welcomed.

@Sohambutala Sohambutala added the enhancement New feature or request label May 15, 2024
@Sohambutala Sohambutala self-assigned this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

1 participant