Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass target arrow type to array_decoder_factory #92

Merged
merged 2 commits into from
Jun 5, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Jun 4, 2024

This is the prerequisite to support configurable timestamp precision mentioned in #75 (comment)

@Jefffrey
Copy link
Collaborator

Jefffrey commented Jun 4, 2024

I plan to take a look at this later today 👍

Copy link
Collaborator

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @progval, this API looks good to me overall 👍

One thing I'm concerning is we may also need to provide a method that can convert ORC type to Arrow type, the so-called "default" option (or there is one and I miss it?). Otherwise the external user of array_decoder_factory needs to maintain that mapping by themselves.

src/array_decoder/struct_decoder.rs Outdated Show resolved Hide resolved
Comment on lines 321 to 325
pub fn array_decoder_factory(
column: &Column,
field: Arc<Field>,
stripe: &Stripe,
) -> Result<Box<dyn ArrayBatchDecoder>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key API change 👍

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
@progval
Copy link
Contributor Author

progval commented Jun 5, 2024

One thing I'm concerning is we may also need to provide a method that can convert ORC type to Arrow type, the so-called "default" option (or there is one and I miss it?). Otherwise the external user of array_decoder_factory needs to maintain that mapping by themselves.

Yes, I'm planning to add it after #93 is merged. Probably .with_default_time_unit(TimeUnit) or .with_default_timestamp_type::<T: arrow::datatypes::ArrowTimestampType>() on the reader builder

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible to store the Field inside Column itself, considering it's already storing DataType?

#[derive(Debug)]
pub struct Column {
number_of_rows: u64,
footer: Arc<StripeFooter>,
name: String,
data_type: DataType,
}

Might cut down on having to pass field everywhere 🤔

@progval
Copy link
Contributor Author

progval commented Jun 5, 2024

Do you think it's possible to store the Field inside Column itself, considering it's already storing DataType?

#[derive(Debug)]
pub struct Column {
number_of_rows: u64,
footer: Arc<StripeFooter>,
name: String,
data_type: DataType,
}

Might cut down on having to pass field everywhere 🤔

It looked like a good idea, but it actually complexifies everything:

  1. matching both types was only in array_decoder/mod.rs while destructuring the ORC type, but is now spread between that module, column.rs, and array_decoder/{struct,list,map,union}.rs. In particular, the union decoder expects UnionFields which is an iterable of (i8, Field) unlike Fields that every other one would use, so it can't rely on Column::children() to match both types
  2. Column::children() now needs to return a Result
  3. We would need the Field when constructing the Column, so we need to pass Fields through Stripe::new, so also in StripeFactory
  4. Overall it makes the code larger and less readable. See my attempt: progval/datafusion-orc@target-type...progval:datafusion-orc:target-type-in-column (which doesn't even include all of point 3, or moving validation that both types match from array_decoder_factory to Column::new to make sure the Column is self-consistent)

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, and thanks for the detailed breakdown on my suggestion ❤️

@Jefffrey Jefffrey merged commit 749128b into datafusion-contrib:main Jun 5, 2024
9 checks passed
waynexia added a commit that referenced this pull request Oct 24, 2024
* Pass target arrow type to array_decoder_factory

* Remove debug prints

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

---------

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants