-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DAPHNE-#629] efficient processing StringData in DenseMatrix #797
base: main
Are you sure you want to change the base?
[DAPHNE-#629] efficient processing StringData in DenseMatrix #797
Conversation
4fc65f8
to
12849fd
Compare
Upcoming Tasks CompletedThe additional features mentioned before have been successfully implemented. Changes:The following features, as outlined in the PR message, have now been added:
TestInitial tests for |
…> struct for std::string and FixedStr16
…ng and FixedStr16
12849fd
to
6b49646
Compare
- Renamed test CSV and meta data files. - Removed unnecessary casts of ValueTypeUtils::default_value to the type it already has. - Renamed ValueTypeUtils::default_value to defaultValue. - oneHot-kernel - Removed superfluous additional convenience functions. - Release recoded intermediate. - Made index calculation view-aware. - Reduced the code duplication of the specializations for std::string and FixedStr16 by factoring out the code into a separate function template. - Tidied up BinaryOpCode.h. - Little formatting corrections. - And some more minor things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this contribution, @saminbassiri. Better support for string processing, especially matrices of string value type, is highly welcome. Overall, your code already looks good to me. I've made some minor improvements and corrections myself (see my commit for details), but there are also a few points I'd like to ask you to improve before we merge this PR:
Required changes: (before this PR can be merged)
- Please make string comparisons case-sensitive by default. In case a users wants a case-insensitive comparison, they can apply the typical trick of lower-casing/upper-casing both (matrices of) strings explicitly.
FixedStr16(const std::string& other)
: It seems to me that this constructor reads beyond the end of the given string's data and does not ensure that all remaining fields in the fixed-size string are'\0'
; please double-check.- Parsing of CSV files containing string data: Please add more corner cases to the tests, e.g., (a) empty string (with and without double quotes), (b) a quoted string containing an escaped double quote (e.g.,
"abc""def"
), (c) a quoted string containing a newline. castSca
-kernel: When casting a string scalar to a numeric type, please don't always rely onstold
, but use a value type-specific parsing function. For instance, not all integers can be exactly represented by a double. You can take inspiration fromDaphneDSLVisitor::visitLiteral()
.oneHot
-kernel: The recoded intermediate result can be an invalid input tooneHot
, since codes are not assigned contiguously. For instance, please add a test case with a column like ["a", "a", "b"]; the recoded column should be [0, 0, 1], but it would be [0, 0, 2] with the current code. This deviation can lead to an out-of-bounds error in theoneHot
-kernel.- Diversity in test/example data: Please either avoid columns like name and gender or make them reflect a more realistic diversity. This is important since we strive for an open and inclusive development community around DAPHNE.
- Generally don't use platform-specific C++ types like
int
andlong
as the value types of DAPHNE matrices and scalars. Instead, please use the[u]intX_t
-types (e.g.,int64_t
).
Thank you for the thorough review and detailed feedback, @pdamme. I have considered the points you raised, and here is a summary of the changes:
|
- The FixedStr16 buffer no longer requires a null-terminator. - This change optimizes memory usage for FixedStr16 value type.
PR Update: I have applied several changes related to this PR:
|
56c1fe7
to
5205b56
Compare
[DAPHNE-#629] Efficient Processing of String Data Sets in DAPHNE with FixedStr16 Class and std::string Class for DenseMatrix
Summary
This PR addresses issue #629 by enhancing the string support in DAPHNE, making it practical to process string data sets. The main addition is Generilizeing or specializing current template structures for
FixedStr16
class andstd::string
class. While significant progress has been made, additional features related to element-wise comparisons will be added in the upcoming days.Key Features Implemented
oneHot
: Applies one-hot-encoding to the given (n x m) matrix of strings.recode
: Applies dictionary encoding to the given (n x 1) matrix.Cast
: String value and matrix objects can be cast to a particular numeric type.fill
: Creates a matrix and sets all elements to a particular value.transpose
: Transposes a given matrix.Testing
DenseMatrix<std::string>
andDenseMatrix<FixedStr16>
have been implemented, verifying functionality for newly added features and data types. However, tests for std::strings are currently failing due to issues with element-wise OPT equality. These will be addressed with the upcoming features.Upcoming Features
The following features will be added in the next few days:
DenseMatrix<std::string>
and/orDenseMatrix<FixedStr16>
.DenseMatrix<std::string>
and /orDenseMatrix<FixedStr16>
.