Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DAPHNE-#629] efficient processing StringData in DenseMatrix #797

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

saminbassiri
Copy link
Contributor

[DAPHNE-#629] Efficient Processing of String Data Sets in DAPHNE with FixedStr16 Class and std::string Class for DenseMatrix

Summary

This PR addresses issue #629 by enhancing the string support in DAPHNE, making it practical to process string data sets. The main addition is Generilizeing or specializing current template structures for FixedStr16 class and std::string class. While significant progress has been made, additional features related to element-wise comparisons will be added in the upcoming days.

Key Features Implemented

  • FixedStr16 Class: A fixed-size string class with a 16-character buffer.
  • DenseMatrix for string value type: Generalize the current DenseMatrix class to support string data.
  • I/O Operations: Reading CSV files containing string columns.
  • Convert String Matrices to numeric:
    • oneHot: Applies one-hot-encoding to the given (n x m) matrix of strings.
    • recode: Applies dictionary encoding to the given (n x 1) matrix.
    • Cast: String value and matrix objects can be cast to a particular numeric type.
  • Source Operations:
    • fill: Creates a matrix and sets all elements to a particular value.
  • Reorganization:
    • transpose: Transposes a given matrix.

Testing

  • Initial tests for the DenseMatrix<std::string> and DenseMatrix<FixedStr16> have been implemented, verifying functionality for newly added features and data types. However, tests for std::strings are currently failing due to issues with element-wise OPT equality. These will be addressed with the upcoming features.

Upcoming Features

The following features will be added in the next few days:

  • Comparison Operators:
    • Element-Wise Binary operators for comparing DenseMatrix<std::string>  and/or DenseMatrix<FixedStr16>.
    • Element-Wise Unary operators for comparing DenseMatrix<std::string>  and /or DenseMatrix<FixedStr16>.

@saminbassiri saminbassiri force-pushed the 629_Efficient_Processing_DenseMatrix_of_Strings branch from 4fc65f8 to 12849fd Compare August 2, 2024 10:23
@saminbassiri
Copy link
Contributor Author

saminbassiri commented Aug 2, 2024

Upcoming Tasks Completed

The additional features mentioned before have been successfully implemented.

Changes:

The following features, as outlined in the PR message, have now been added:

  • resolving the previously noted issues with equality OPT onDenseMatrix<std::string>.

  • Comparison Operators:

    • Element-Wise Binary Operators: Implemented for comparing two Data with string value.
      • Scalar Comparison Equality and Inequality: operator == and != between two std::string or two FixedStr16.
      • Matrices Comparison Equality and Inequality: operator == and != between two DenseMatrix<std::string> or two DenseMatrix<FixedStr16>.
      • Scalar Comparison Greater than and Lower than: operator > and < between two std::string or two FixedStr16
      • Matrices Comparison Greater than and Lower than: operator > and < between two DenseMatrix<std::string> or two DenseMatrix<FixedStr16>.
      • Scalar Concatenation: operator + on twostd::string or two FixedStr16. (note: since the result of Concatenation may not fit in FixedStr16, the output of Concat is always std::string).
      • DenseMatrix Concatenation: operator + on twoDenseMatrix<std::string> or two DenseMatrix<FixedStr16>. (note: since the result of Concatenation may not fit in FixedStr16, the output of Concat is always DenseMatrix<std::string>).
    • Element-Wise Unary Operators: Implemented for Data with string value.
      • Scalar Operators Upper and Lower: operator UPPER and LOWER on std::string orFixedStr16.
      • Matrices Operators Upper and Lower: operator UPPER and LOWER on DenseMatrix<std::string> or DenseMatrix<FixedStr16>.

Test

Initial tests for DenseMatrix<std::string> and DenseMatrix<FixedStr16> were implemented, verifying functionality for newly added features and data types.

@saminbassiri saminbassiri marked this pull request as ready for review August 2, 2024 11:37
@pdamme pdamme added the LDE summer 2024 Student project in the course Large-scale Data Engineering at TU Berlin (summer 2024). label Aug 4, 2024
@pdamme pdamme self-requested a review August 4, 2024 19:18
@pdamme pdamme force-pushed the 629_Efficient_Processing_DenseMatrix_of_Strings branch from 12849fd to 6b49646 Compare September 8, 2024 17:13
- Renamed test CSV and meta data files.
- Removed unnecessary casts of ValueTypeUtils::default_value to the type it already has.
- Renamed ValueTypeUtils::default_value to defaultValue.
- oneHot-kernel
  - Removed superfluous additional convenience functions.
  - Release recoded intermediate.
  - Made index calculation view-aware.
  - Reduced the code duplication of the specializations for std::string and FixedStr16 by factoring out the code into a separate function template.
- Tidied up BinaryOpCode.h.
- Little formatting corrections.
- And some more minor things.
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution, @saminbassiri. Better support for string processing, especially matrices of string value type, is highly welcome. Overall, your code already looks good to me. I've made some minor improvements and corrections myself (see my commit for details), but there are also a few points I'd like to ask you to improve before we merge this PR:

Required changes: (before this PR can be merged)

  1. Please make string comparisons case-sensitive by default. In case a users wants a case-insensitive comparison, they can apply the typical trick of lower-casing/upper-casing both (matrices of) strings explicitly.
  2. FixedStr16(const std::string& other): It seems to me that this constructor reads beyond the end of the given string's data and does not ensure that all remaining fields in the fixed-size string are '\0'; please double-check.
  3. Parsing of CSV files containing string data: Please add more corner cases to the tests, e.g., (a) empty string (with and without double quotes), (b) a quoted string containing an escaped double quote (e.g., "abc""def"), (c) a quoted string containing a newline.
  4. castSca-kernel: When casting a string scalar to a numeric type, please don't always rely on stold, but use a value type-specific parsing function. For instance, not all integers can be exactly represented by a double. You can take inspiration from DaphneDSLVisitor::visitLiteral().
  5. oneHot-kernel: The recoded intermediate result can be an invalid input to oneHot, since codes are not assigned contiguously. For instance, please add a test case with a column like ["a", "a", "b"]; the recoded column should be [0, 0, 1], but it would be [0, 0, 2] with the current code. This deviation can lead to an out-of-bounds error in the oneHot-kernel.
  6. Diversity in test/example data: Please either avoid columns like name and gender or make them reflect a more realistic diversity. This is important since we strive for an open and inclusive development community around DAPHNE.
  7. Generally don't use platform-specific C++ types like int and long as the value types of DAPHNE matrices and scalars. Instead, please use the [u]intX_t-types (e.g., int64_t).

@saminbassiri
Copy link
Contributor Author

saminbassiri commented Sep 22, 2024

Thank you for the thorough review and detailed feedback, @pdamme. I have considered the points you raised, and here is a summary of the changes:

  • String Comparisons: I updated the code to make string comparisons case-sensitive.

  • FixedStr16 Constructor: I reviewed and corrected the constructor to ensure it doesn’t read beyond the end of the string’s data. Hence, excluding the null character, FixedStr16 contains 15 characters.

  • CSV Parsing Tests: I extended the CSV parsing tests to cover the corner cases you mentioned and fixed bugs related to these new tests:

    • Empty strings with and without quotes.
    • Quoted strings containing escaped double quotes (e.g., "abc""def").
    • Quoted strings that contain newlines.
  • Numeric Casting (castSca-kernel): I modified the casting logic to avoid relying solely on stold when converting string scalars to numeric types.

  • oneHot-Kernel: I updated the logic to ensure the recoded column maintains contiguous values, such as converting ["a", "a", "b", "b"] to [0, 0, 1, 1]. I also added a corresponding test case to verify the functionality.

  • Diversity in Test Data: I avoid columns like name and gender in the test data.

  • Platform-Specific C++ Types: I reviewed the code and replaced the platform-specific types with the appropriate type.

- The FixedStr16 buffer no longer requires a null-terminator.
- This change optimizes memory usage for FixedStr16 value type.
@saminbassiri
Copy link
Contributor Author

PR Update:

I have applied several changes related to this PR:

  1. Handling Unsupported Result Types During String Casting:

    • This change improves the robustness of the code by adding proper handling of unsupported result types during string casting. If an unsupported type is used, a compile-time warning is issued using the C++ [[deprecated]] attribute, and a runtime error will be thrown if the CastSca function is called with an invalid result type. This ensures safer and more predictable behavior.
  2. FixedStr16 Buffer Size Update:

    • The FixedStr16 constructor has been updated to support 16-character strings without requiring a null terminator. Additionally, I have updated the test cases in CastObjTest.cpp to reflect this change.

@saminbassiri saminbassiri force-pushed the 629_Efficient_Processing_DenseMatrix_of_Strings branch from 56c1fe7 to 5205b56 Compare October 2, 2024 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDE summer 2024 Student project in the course Large-scale Data Engineering at TU Berlin (summer 2024).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants