PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #43

julienledem · 2016-10-26T18:12:41Z

No description provided.

rdblue

The types make sense, but I'd like to have quotes from the SQL spec here to explain a bit more what year-month and day-millisecond "units" are. I think this should update the logical types doc as well.

rdblue · 2016-10-26T22:22:04Z

src/main/thrift/parquet.thrift

@@ -162,19 +162,48 @@ enum ConvertedType {
  BSON = 20;

  /**
+   * @Deprecated: use INTERVAL_YEAR_MONTH or INTERVAL_DAY_TIME
+   * since the SQL standard defines either YEAR_MONTH or DAY_TIME unit.
+   * This is deprecated in favor of those 2 types


What does the SQL spec say?

See links in https://issues.apache.org/jira/browse/PARQUET-675.
For example: "An interval is defined as the difference between two dates and times. Intervals are expressed in one of two different ways. One is a year-month interval that expresses intervals in terms of years and an integral number of months. The other is a day-time interval that expresses intervals in terms of days, minutes, and seconds. These two types of intervals are distinct and cannot be mixed, because months can have varying numbers of days."
https://msdn.microsoft.com/en-us/library/ms716506(v=vs.85).aspx

Should I add this quote in there?

SQL 92 Spec:

4.5.2 Intervals There are two classes of intervals. One class, called year-month intervals, has an express or implied datetime precision that in- cludes no fields other than YEAR and MONTH, though not both are required. The other class, called day-time intervals, has an ex- press or implied interval precision that can include any fields other than YEAR or MONTH.

Further:

Operations involving items of type interval require that the interval items be mutually compara- ble. ... Year-month intervals are mutually comparable only with other year- month intervals ... Day-time intervals are mutually comparable only with other day- time intervals.

How are intervals comparable? Are there details in the spec for resolving cases like (1 day, 0 ms) <=> (0 days, 86400000 ms)?

By the SQL definition 1 day = 24 hours = 86400 seconds.

_Keyword______Valid_values_of_INTERVAL_fields______________________ | DAY | Unconstrained except by <interval leading field | precision | HOUR | Hours (within days) (0-23) | | | | | MINUTE | Minutes (within hours) (0-59) | | | | |_SECOND_____|_Seconds_(within_minutes)_(0-59.999...)______________|

and

When it is necessary to add new most significant datetime fields, the asso- ciated value is effectively converted to the new precision in a manner obeying the natural rules for dates and times associated with the Gregorian calendar.

Of course this is just the standard SQL, Parquet does not have to follow this.

@michal-databricks, can you post the reference for 1 day = 24 hours = 86400 seconds? I'm not sure that what you've quoted here requires it. The second paragraph, "When it is necessary...", is talking about conversion and the first constrains the fields. That doesn't mean that "1 day" and "23 hours 59 minutes 59.999999 seconds" are comparable. It just means that 86400 seconds is an invalid interval.

@rdblue, this is the complete paragraph defining the comparison of day-time intervals where the quote was taken from.

Day-time intervals are mutually comparable only with other day- time intervals. If two day-time intervals have different interval precisions, they are, for the purpose of any operations between them, effectively converted to the same precision by appending new <datetime field>s to either the most significant end or the least significant end of one or both day-time intervals. New least sig- nificant <datetime field>s are assigned a value of 0. When it is necessary to add new most significant datetime fields, the asso- ciated value is effectively converted to the new precision in a manner obeying the natural rules for dates and times associated with the Gregorian calendar.

To my understanding, this means that, for example, comparing INTERVAL '100' HOUR with INTERVAL '4 04:00' DAY TO MINUTE requires conversion to the common super type, in this case INTERVAL DAY TO MINUTE. During conversion least significant fields are filled with zeros and most significant "in a manner obeying the natural rules for dates and times associated with the Gregorian calendar", since the non-most-significant fields must conform to the constraints in the table (comes from the same chapter). In this case INTERVAL '100' HOUR becomes INTERVAL '4 04:00' DAY TO MINUTE. Now they are compared field by field (making them equal).
This is the best reference from the SQL '92 standard I can give to argument 1 day = 24 hours. And while implementations of the standard vary they still tend to use the above method for comparisons.

rdblue · 2016-10-26T22:35:46Z

src/main/thrift/parquet.thrift

+   * the provided duration.  This duration of time is independent of any
+   * particular timezone or date.
+   */
+  INTERVAL_DAY_TIME = 23;


Is there an alternative to embedding two values in a single byte array? The problem is that this is incompatible with stats filtering and efficient integer encodings. I think those are significant enough that we should consider alternatives.

I like the idea of using this annotation for a group of two int32 values. That way we get good compression and encoding, stats filtering works, and we get can get independent dictionary encoding, which would work well for the days field. What are the downsides to that approach instead of the byte array?

the downside is the two values not being contiguous for processing.
We can also make it an Int64 with the most significant 32 bits for day and the rest for ms with the constraint that ms are < 1 day. That would make it sortable.

Given that a consumer would be unlikely to read/consume the two values separately, storing them separately seems like overkill. Since we already do the set of three values in the existing interval, it seems like storing two in a single 8 bytes would be okay and consistent.

My understanding is that using a single INT64 with milliseconds (1 day = 86400000 milliseconds) would be the most convenient solution. Besides better encoding and statistics it has the advantage of unique representation (does not allow for 1 day and 86400000 ms). Non unique representation would be especially confusing with negative intervals. Single integer storage would be also consistent with the existing storage format of TIMESTAMPs and the storage format for INTERVAL_YEAR_MONTH proposed here.
On that note, it would be better to have 2 types instead: INTERVAL_DAY_MILLIS and INTERVAL_DAY_MICROS just as there are TIMESTAMP_MILLIS and TIMESTAMP_MICROS

I agree with Jacques that this is lossy because a day isn't consistently 86400000 ms.

We need to store 2 values, but I'm not convinced that there is much value in storing these in a single byte array. What about users that only use days or only use micros? The proposed type would use 8 bytes per value either way. I don't think it is worth that cost when we can deserialize two columns to a single arrow column instead. Is there a micro-benchmark to back up the benefit of using this format instead of a 2-column format?

Using an int64 won't work because with 2 values that aren't constrained by one another, min and max stats are invalid (days always overrides micros). In fact, because days and micros can't be converted to one another, I'd argue that there is no consistent ordering for this interval type. So we would get compression for the days value but at the cost of more confusion about how to produce stats.

I am guessing that by day not equal 86400000 you mean the influence of leap seconds. I don't know any database system which would implement those, especially for intervals. Most just store one component for days and seconds (among them Spark) and while for example Postgres uses separate components for them, it still makes a day equal 24 hours for comparison purposes. But if you do decide to store milliseconds with intention of it being an independent dimension, 32 bits seem not enough: 2 ^ 32 / 86400000 = 49.71. This is only +/-25 "days" range in the signed case.

Oh, another reason you may mean for different day lengths is if they span different DST time zones (on the DST change dates). SQL does not provide a meaningful way to handle those as SQL time stamps only carry time zone offset (in minutes) and not the time zone in the traditional sense (location).

The case I was thinking about was DST. The length of a day may be 23 hours, 24 hours, or 25 hours. That makes 1 day incomparable with 24 hours, unless the SQL spec includes a way to compare them. I'm fine with not having a comparison for Parquet, even if the SQL spec defines one.

The other consideration is whether we should constrain hours, minutes, and seconds to less than 24 hours. The other comment from the SQL spec appears to support that, but constraining the microseconds field to < 86,400,000 will require conversion logic somewhere in order to support arithmetic on interval values: 0 days, 22 hours + 0 days 4 hours = ???. Parquet could just require the microseconds field to be less than 86,400,000, but that shifts the problem to processing engines. And either way, it doesn't support the idea of storing the entire interval using microseconds.

@rdblue, there are basically two approaches:

Treat days and seconds as two dimensions. In this case seconds cannot be constrained by 86,400 (otherwise you cannot express 25 hour days on a DST change). Addition and multiplication by a scalar works per dimension. 0 days, 22 hours + 0 days, 4 hours = 0 days, 26 hours.

Treat one day as 24 hours making this one dimension: a scalar of seconds (which I consider the SQL way, but it seems you are not convinced that this is the SQL standard semantic). 0 days 22 hours + 0 days 4 hour = 1 day 2 hours. In this case the storage can be days + 86400000 + ms to make the ordering and arithmetic easier. While it could be still stored as 2 values, it would complicate the implementation and hamper the encoding. Which was my original point.

Going forward with option 1) is fine (as long as it is clearly documented), but then I don't see the advantage over the existing interval type with this is deprecating.

piyushnarang · 2016-10-31T18:15:56Z

LogicalTypes.md


+#### INTERVAL
+`INTERVAL` is deprecated. Please use the more precise `INTERVAL_YEAR_MONTH` and
+`INTERVAL_DAY_TIME` per SQL spec.


can we link to the SQL spec here?

http://webstore.ansi.org/RecordDetail.aspx?sku=ISO%2fIEC+9075-1%3a2011

Sorry I meant if we could add a link to the sql spec in the md file, it might be useful for folks reading in the future.

Sorry, I should have added a comment. The SQL spec is not free. This is a paid link.
There are drafts hosted here and there. We can use those.

closes apache#43 and closes apache#50 This PR also implements 1) PARQUET-532: Null values detection needs to be fixed and tested 2) PARQUET-502: Scanner segfaults when its batch size is smaller than the number of rows 3) PARQUET-526: Add more complete unit test coverage for column Scanner implementations 4) PARQUET-531: Can't read past first page in a column Author: Deepak Majeti <deepak.majeti@hp.com> Closes apache#62 from majetideepak/PARQUET-538 and squashes the following commits: 1e56f83 [Deepak Majeti] Trigger notification 6478a7c [Deepak Majeti] TYPED_TEST 1d14171 [Deepak Majeti] Added Boolean Test and Scanner:Next API d1da031 [Deepak Majeti] lint issue 45f10aa [Deepak Majeti] Reproducer for PARQUET-502 88e27c6 [Deepak Majeti] formatting 8aac435 [Deepak Majeti] PARQUET-526 dca7e2d [Deepak Majeti] PARQUET-532 and PARQUET-502 Fix a622021 [Deepak Majeti] Reverted PARQUET-524 and addressed comments 859c1df [Deepak Majeti] minor comment edits d938a13 [Deepak Majeti] PARQUET-538 df1fbd7 [Deepak Majeti] Templated single page tests 8548e3c [Deepak Majeti] PARQUET-524 c265fea [Deepak Majeti] fixed PARQUET-499 bugs

michal-databricks · 2017-08-09T09:25:47Z

src/main/thrift/parquet.thrift

@@ -162,19 +162,48 @@ enum ConvertedType {
  BSON = 20;

  /**
+   * @Deprecated: use INTERVAL_YEAR_MONTH or INTERVAL_DAY_TIME
+   * since the SQL standard defines either YEAR_MONTH or DAY_TIME unit.
+   * This is deprecated in favor of those 2 types


SQL 92 Spec:

4.5.2 Intervals There are two classes of intervals. One class, called year-month intervals, has an express or implied datetime precision that in- cludes no fields other than YEAR and MONTH, though not both are required. The other class, called day-time intervals, has an ex- press or implied interval precision that can include any fields other than YEAR or MONTH.

Further:

Operations involving items of type interval require that the interval items be mutually compara- ble. ... Year-month intervals are mutually comparable only with other year- month intervals ... Day-time intervals are mutually comparable only with other day- time intervals.

michal-databricks · 2017-08-09T09:32:55Z

src/main/thrift/parquet.thrift

+   * An interval of time with a year-month unit.
+   *
+   * This type annotates data stored as an INT32
+   * This data is stored as a little endian unsigned


This should be signed. SQL standard does not mention this explicitly, but defines interval - interval operation, which would be problematic without negative intervals. Also all SQL implementations I am aware of, allow negative intervals.

michal-databricks · 2017-08-09T09:43:58Z

src/main/thrift/parquet.thrift

+   * the provided duration.  This duration of time is independent of any
+   * particular timezone or date.
+   */
+  INTERVAL_DAY_TIME = 23;


My understanding is that using a single INT64 with milliseconds (1 day = 86400000 milliseconds) would be the most convenient solution. Besides better encoding and statistics it has the advantage of unique representation (does not allow for 1 day and 86400000 ms). Non unique representation would be especially confusing with negative intervals. Single integer storage would be also consistent with the existing storage format of TIMESTAMPs and the storage format for INTERVAL_YEAR_MONTH proposed here.
On that note, it would be better to have 2 types instead: INTERVAL_DAY_MILLIS and INTERVAL_DAY_MICROS just as there are TIMESTAMP_MILLIS and TIMESTAMP_MICROS

jacques-n · 2017-08-12T16:39:28Z

Ugh, we didn't realize that this hadn't been merged. We've been using the representations here with multiple customers for the last six months using the annotations and formats above so the reality is that there are files with the representation described in the PR in the wild at this point.

I'm happy to help get this over the finish line but we should wrap up discussion.

Michal is recommending a collapsed format for day. However, I think that is a bit surprising to users since it is lossy. A user may want to express 28 hours. The collapsed format would replace this with 1 day, 4 hours so I'm inclined to stick with the separate, non-lossy representation. It is a trivial calculating if you want to canonicalize for internal processing but the storage should be lossless.

@rdblue and @julienledem, what are you thoughts to the previous points. It seems like there was some discussion of a group but Julien and myself preferred a consolidated approach since we don't see any scenario where you'd only be interacting with one of the two items in the group. Additionally, in general, since Intervals are such a small part of every dataset I've seen in the real world, I'm not inclined to spend a lot of engineering work trying to optimize them too extensively. That's probably just laziness talking... but I like to think of it as pragmatism :)

michal-databricks · 2017-08-14T11:45:37Z

My suggestion to use the collapsed format was based on the assumption that the point of moving away from the old Interval format was to make it closer to the SQL standard definition of the interval data type. However the classic SQL implementations are usually far from implementing the standard in the first place. MySQL and MS SQL Server do not have interval as a storable data type at all. PostgreSQL allows literals and declaration as standard (with an exception of explicit precision, like interval hour to second(3)), but internally it is all one type holding 3 components (months, days and seconds; like Parquet now, but probably as floating point numbers). Even then it equates 1 day with 24 hours (and surprisingly also 1 month with 30 days) for comparison and order, and it is inconsistent with conversion between those fields:

> select interval '2' day
0 years 0 mons 2 days 0 hours 0 mins 0.00 secs
> select interval '2' day / 4 * 4
0 years 0 mons 0 days 48 hours 0 mins 0.00 secs
> select interval '1' month / 4 * 4
0 years 0 mons 28 days 48 hours 0 mins 0.00 secs

From the common SQL implementations Oracle seem to be the closest to the standard.
The bottom line is that with the current state of the SQL implementations it is hard to guess what is the user expectation when it comes to the semantics of the interval types.

Either way, I think, the comparison semantics should be well defined within Paruqet format as this influences filter pushdown features like min max statistics and hashing. Of course if there is no linear order on interval then min max cannot be even defined.
This issue is important because it may affect correctness of the systems interfacing Paruqet, even if efficiency is not the key for this data type.

wesm · 2017-08-15T04:00:04Z

Whatever decision is reached here, could we incorporate this into the related Arrow patch apache/arrow#920 so that we can round-trip to Arrow memory without any data loss?

jacques-n · 2017-08-17T18:10:28Z

@julianhyde, would love your thoughts on this as you know more than anybody on these things.

Oracle:
SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '1' HOUR FROM DUAL
2017-03-12 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '2' HOUR FROM DUAL
2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '3' HOUR FROM DUAL
2017-03-12 04:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '24' HOUR FROM DUAL
2017-03-13 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '1' DAY FROM DUAL
2017-03-13 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '86400' SECOND(2,4) FROM
dual
2017-03-13 01:00:00

Postgres:
SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '1' HOUR
2017-03-12 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '2' HOUR
2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '3' HOUR
2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '24' HOUR
2017-03-13 00:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '1' day
2017-03-13 00:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '86400' SECOND
2017-03-13 00:00:00

So people don't agree on what adding intervals across time skips mean. However, they each agree internally that skipping is the same whether we're using 1 day or 86400seconds. This means that interval should be comparable.

On the flipside, Postgres maintains a distinction between 25 hours and 1 day, 1 hour whereas Oracle does not

Oracle:
SELECT INTERVAL '25' HOUR FROM dual
1 1:0:0.0

SELECT INTERVAL '1 1' DAY TO HOUR FROM dual
1 1:0:0.0

Postges:
select INTERVAL '25' hour
0 years 0 mons 0 days 25 hours 0 mins 0.00 secs

select INTERVAL '1 1' day to hour
0 years 0 mons 1 days 1 hours 0 mins 0.00 secs

julianhyde · 2017-08-18T08:46:33Z

These examples are complicated. I don't recall whether the Oracle ones will use TIMESTAMP WITH TIME ZONE or TIMESTAMP WITH LOCAL TIME ZONE literals. And Postgres will use TIMESTAMP WITH TIME ZONE, but it's semantics are non-standard.

Also, I don't know whether PST and US/Pacific are the same, and whether either of them time-shifts.

I'm not an expert on TIMESTAMP WITH TIME ZONE. Evidently its arithmetic is more complicated than TIMESTAMP, so someone should consult the sql standard.

Could you punt on TIMESTAMP WITH TIME ZONE for now?

Regarding your second example. I'm pretty sure both systems will give the two literals different types - interval hour and interval day to hour. So they are distinct in both systems, regardless of how they are printed. Distinct in the same sense that integer 2 and double 2 are distinct. They have different types, may or may not be printed as the same string, but yield the same value if converted to each other's type.

IIRC calcite would print '1 1' and '25', but store 25 * 60 * 60 * 1000 for both. If arrow is being frugal it should store 25 for both.

michal-databricks · 2017-08-18T09:24:22Z

@jacques-n, Postgres does actually make use of the difference between hours and days in some cases:

SET TIME ZONE 'US/Pacific';

SELECT TIMESTAMP WITH TIME ZONE '2017-03-12 0:00:00 PST' + INTERVAL '24' HOUR;
2017-03-13T08:00:00Z

SELECT TIMESTAMP WITH TIME ZONE '2017-03-12 0:00:00 PST' + INTERVAL '1' DAY;
2017-03-13T07:00:00Z

However the docs clearly state this is not standard SQL: "(...) the SQL standard has an odd mix of date and time types and capabilities. Two obvious problems are: (...) The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries. (...) To address these difficulties, (...) PostgreSQL assumes your local time zone for any type containing only date or time." (From 8.5.3. Time Zones).
The literal syntax it uses is also non standard. WITH TIME ZONE is not necessary nor allowed, instead the result type should be derived from the literal string:

         17)The data type of a <timestamp literal> that does not specify
            <time zone interval> is TIMESTAMP(P), where P is the number of
            digits in <seconds fraction>, if specified, and 0 otherwise.
            The data type of a <timestamp literal> that specifies <time zone
            interval> is TIMESTAMP(P) WITH TIME ZONE, where P is the number
            of digits in <seconds fraction>, if specified, and 0 otherwise.

Without non-standard time zone implementation the distinction between days and 24 hour wouldn't be very useful. Also while for Postgres DATETIME + INTERVAL semantic is as above, for DATETIME - DATETIME the result assumes 24h = 1 day which leads to confusing results like:
timestamp1 + (timestamp2 - timestamp1) <> timestamp2
The assumption is true for comparisons as well, BTW:

SELECT INTERVAL '24:01' HOUR TO MINUTE > INTERVAL '1' DAY;
true

SELECT INTERVAL '24' HOUR = INTERVAL '1' DAY;
true

Either way the current proposal in the PR cannot represent Postrgres intervals because of the small range for seconds and potential mixing of months and seconds:

select interval '20' hour * 10e7 + interval '1' month;
0 years 1 mons 0 days 2000000000 hours 0 mins 0.00 secs

@julianhyde, It is true that INTERVAL '1' DAY and INTERVAL '24' HOUR are by the standard of distinct types even when equal. But this can be typically handled on the matadata level.

From Parquet point of view the most flexible way would be to have parametrized interval type (like decimal) with least significant field/seconds scale. So one of: DAY, HOUR, MINUTE, SECOND, SECOND(1), ... SECOND(9). As this may be excessive implementation wise for such a minor type, a middle ground would be to have at least DAY, MILLISECOND (SECOND(3)), and MICROSECOND(SECOND(6)) scales. This is made even easier with the new logical types PR.

rdblue · 2017-08-18T16:06:46Z

There's a lot of behavior to consider for processing engines here, but the conclusion I think we can draw from all of this information for Parquet is that engines may have semantics where 24 hours and 1 day are not equal, so Parquet should not force engines to consider them equal by storing a single value.

That means what we're back where we started: we have two values to store for these types and need to choose a good way to store them. I don't see a down-side to separate columns, since we can easily construct a single column in Arrow from them when deserializing.

julianhyde · 2017-08-18T16:17:10Z

So, Postgres diverges from the SQL standard and allows time zones that don't have fixed offsets from UTC. This seems useful, but as Jacques has noted, it leads to madness. So, let's stick to the standard.

And, if I read you correctly, Postgres allows intervals with a mixture of hours and months. Again, madness.

At a minimum, Arrow needs 2 interval types: INTERVAL MONTH and INTERVAL SECOND. Optionally, we could also have finer-grained types. For instance, INTERVAL HOUR TO SECOND would be compatible with INTERVAL SECOND but would use a format string of 'HH:MM:SS'.

I don't feel strongly whether Arrow has 2 interval types or goes for the full list in standard SQL: YEAR, YEAR_TO_MONTH, MONTH, DAY, DAY_TO_HOUR, DAY_TO_MINUTE, DAY_TO_SECOND, HOUR, HOUR_TO_MINUTE, HOUR_TO_SECOND, MINUTE, MINUTE_TO_SECOND, SECOND. Postgres even has a few more: QUARTER, MILLENNIUM, MILLISECOND etc.

I think a reasonable level of support in Arrow is precision and scale, e.g. DAY(5) TO SECOND(6). implies that the value holds microsecond values between -8,640,000,000,000,000 and 8,640,000,000,000,000 (the number of microseconds in 10,000 days). Thus behind the scenes it looks like a fixed decimal type.

wesm · 2017-08-28T21:32:29Z

@julianhyde my thinking in my patch https://github.com/apache/arrow/pull/920/files was to add a time resolution to the DAY_TIME interval type to address resolutions from seconds down to nanoseconds. This matches up with the timedelta64[UNIT] type found in some other systems. I wasn't sure whether this would be adequate (4 different 64-bit interval types, plus a YEAR_MONTH type) -- it's appealing from a time arithmetic point of view (because differences of timestamps yield these interval values quite gracefully). Do we need a 16-byte interval type or is 8 bytes enough?

julianhyde · 2017-08-28T22:24:48Z

Do whatever you have for timestamp. If you only have an 8 byte timestamp type, an 8 byte interval will be enough. But if you have 16 byte timestamp, it will generate 16 byte values if you subtract them.

wesm · 2017-08-28T23:28:28Z

Understood. Should it become necessary in the future, we can augment the metadata to accommodate additional precision. I think 8 bytes will be adequate for the vast majority of use cases for Arrow

nevi-me · 2021-01-23T11:11:06Z

Hi everyone,

I've opened #165 to implement a LogicalType for the Interval. From looking at the conversations on this thread, it looks like the Arrow-related concerns have either been addressed, or might not be issues as we have stabilised on an Interval type in Arrow.

Pleas let me know if we're missing any details, or concerns to still address.
I'm keen to get Arrow <> Parquet support for intervals, I'm working on the Parquet Rust implementation over at the Arrow repo.

Thanks

PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

e39aa7b

rdblue reviewed Oct 26, 2016

View reviewed changes

add LogicalTypes description

ddc9c47

piyushnarang reviewed Oct 31, 2016

View reviewed changes

michal-databricks reviewed Aug 9, 2017

View reviewed changes

nevi-me mentioned this pull request Jan 23, 2021

PARQUET-675: Specify Interval LogicalType #165

Closed

nwt mentioned this pull request Apr 26, 2023

Use pqarrow in zio/parquetio brimdata/super#4547

Merged

tustvold mentioned this pull request Aug 1, 2023

[Parquet][C++] Support month_day_nano_interval type in Parquet apache/arrow#36799

Open

asfimport mentioned this pull request Jun 23, 2024

Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #313

Open

PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #43

Are you sure you want to change the base?

PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #43

Conversation

julienledem commented Oct 26, 2016

rdblue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n Oct 28, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-databricks Aug 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michal-databricks Aug 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Aug 12, 2017

michal-databricks commented Aug 14, 2017

wesm commented Aug 15, 2017

jacques-n commented Aug 17, 2017

julianhyde commented Aug 18, 2017

michal-databricks commented Aug 18, 2017 • edited Loading

rdblue commented Aug 18, 2017

julianhyde commented Aug 18, 2017

wesm commented Aug 28, 2017

julianhyde commented Aug 28, 2017

wesm commented Aug 28, 2017

nevi-me commented Jan 23, 2021

jacques-n Oct 28, 2016 •

edited

Loading

michal-databricks Aug 14, 2017 •

edited

Loading

michal-databricks Aug 9, 2017 •

edited

Loading

michal-databricks commented Aug 18, 2017 •

edited

Loading