-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
base: master
Are you sure you want to change the base?
Conversation
This commit proposes an improvement for handling of NaN values in FLOAT and DOUBLE type columns. The goal is to allow reading engines, regardless of how they order NaN w.r.t. other values, to efficiently use statistics for scan pruning while NaN values are present, which currently is impossible in most cases. This is to be accomplished in a fully backward compatible manner, so that existing readers and writers do not need to be updated immediatly but can migrate over time to make use of the improved semantics. There was already [work on improving the handling of float and double columns](apache#185) which laid good ground work for backward compatible improvements, but it wasn't sufficient to fix all the problems with NaN values, which are laid out hereinafter. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found [here](apache/arrow-rs#264 (comment)). The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate `x > 1.0` *should* include NaN values. If a page has `max = 0.0` now, the engine would *not* be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate `x < 1.0` should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with `min = 2.0` couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparible to all other values. Note that the problem exist especially if the Parquet file *doesn't* include any NaNs, so this is not only a problem in the case where NaNs are present; it is a problem for the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we *have to* put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent the emission for a column index for that column chunk, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the `ColumnIndex` when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term *only-NaN* is used, it refers to a page column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The `Statistics` objects stored in page headers and in the file footer have a similar, albeit smaller problem: `min_value` and `max_value` are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing `min_value` and `max_value` means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Proposed solution ================= Adding NaN counts ----------------- The proposed solution for handling NaNs in statistics is akin to what [Apache Iceberg](https://iceberg.apache.org/spec/) does: add an *optional* `nan_count` field to `Statistics` and an *optional* `nan_counts` list to `ColumnIndex`. This way, all places where statistics are being retained can specify whether NaN values are present. This already solves the first problem, as now queries wanting to retain NaNs can check whether the count is > 0 to see whether a page or column chunk contains NaNs. Handling NaN-only pages & column chunks --------------------------------------- Adding `nan_count`/`nan_counts` fields does not solve the problem of only-NaN pages, yet. But since we have a new optional field in both the `Statistics` object and the `ColumnIndex` object, we can tie a stricter semantics to the existence of this field. I.e., we can mandate that writers who write this optional field have to treat NaNs in a specific way. We basically have two options for treating only-NaN pages or column chunks: * Order the writer to write NaN as min/max in this case. * Order the writer to write nothing, i.e., * omit the `min_value` / `max_value` in `Statistics` * write byte[0] in the min_values/max_values entry of the `ColumnIndex` I propose to go with the first option of writing NaN as min/max in case of only-Nan pages & column chunks. A section depicting the decision process for this follows below. Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior: * Once a writer writes the `nan_count`/`nan_counts` fields, they have to: a) never write NaN into min/max if there are non-NaN non-Null values and b) always write min=max=NaN if the only non-null values in a page are NaN * A reader observing that `nan_count`/`nan_counts` field was written can then rely on that if min or max are Nan, then both have to be NaN and this means that the only non-NULL values are NaN. Should we write NaN or nothing for only-NaN pages & column chunks? ------------------------------------------------------------------ Here are the cons of each approach and how to mitigate them: CONs for writing NaN in this case: * Writing NaN breaks with the "don't write NaN into min and max bounds" rule. * However, one could argue that breaking the rule in this edge case is okay, as since if NaN is the only value in a page, then it doesn't matter where to sort NaN w.r.t. other values, as there are no other values. It is the only value in the page, so it is the min and max of the page * Breaking this rule has no consequences on existing readers, as they should ignore NaN anyway, i.e., treat it as if it wasn't written, so legacy readers should treat both cases the same anyway. * There might be existing writers that have written NaN for min & max for pages that do not only contain NaN but also other values, so a reader couldn't rely on min=max=NaN to mean that the only non-null value is NaN * However, as specified, we can enforce a stricter semantics once the `nan_count` field is written: We could define that once a writing engine writes this field, it has to a) never write NaN into min/max if there are non-NaN non-Null values and b) always write min=max=NaN if the only non-null values in a page are NaN. Then, readers could rely on the semantics once they observe that the `nan_count` field was written. * NaNs take more space than not writing the field or writing byte[0] in the column index. This space overhead should usually be negligible. In conclusion, there is no big con for writing NaN. It can be implemented in a fully backward compatible way that still allows future writers and readers to apply a more strict semantics. CONs for writing nothing in this case: * Writing byte[0] to a ColumnIndex might break older readers who expect the `min_values`/`max_values` field to be a value with correct length unless `null_pages` is true for the entry. * Although readers should be lenient enough and handle wrongly sized min/max values gracefully by ignoring them we cannot be sure this is the case for any reader. Thus, we might legacy spec-conforming readers to reject the new Parquet file, which is bad. * Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as it first looks as if the writing engine has decided to not write them for whatever reason. In this case, the page could never be pruned by a reader, as the reader couldn't know which values are in there. Yes, we could define that a writer may not omit min/max if they write `null_count` and must only omit them if a page has only NaNs, but this seems to be quite fishy semancially. In conclusion, the cons for the NaN approach have mitigations and can be handled in a backward compatible way, while the cons for writing nothing might be non-backward-compatible. Therefore, I propose to write NaN as min/max for only-nan pages & column chunks. Considerations ============== Backward compatibility ---------------------- The suggested change is fully backward compatible both in the read and write direction: Older readers not supporting `nan_count`/`nan_counts` yet can stay as is. As the fields are optional, readers not supporting them will simply ignore them. The spec already today mandates that if a reader sees `NaN` in min or max fields they should ignore it. They can continue doing so. No older reader will have regressed performance; any page that an older reader would have skipped before can still be skipped. Older writers can continue writing files without `nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets min=max=NaN for a page that does contain non-NaN values, readers supporting this new semantics will not misinterpret these bounds, as the writer will not write `nan_count`/`nan_counts`, so the new more strict semantics does not apply when reading. As `nan_count`/`nan_counts` are local to the scopes where they apply (column index, page, column chunk), even stiching together row groups from a writer that didn't write them and a writer that does write them works. This would result in a file where some pages / column indexes / column chunks would have them set while others wouldn't. Versioning ---------- This proposal definitly does not require a *major* version bump, as it is fully backward compatible. I do not fully understand the versioning policy of parquet, so I don't know whether this change would require a minor version bump. One could argue that it is not necessary as the mere existence of the `nan_count`/`nan_counts` field would be the "feature flag" that would indicate whether a writer supported this change or not. There wouldn't be a version check necessary in a reader; they would just need to check for the existence of the `nan_count`/`nan_counts` fields. No Unnecessary Overhead ----------------------- As thrift encodes missing optional fields with zero bytes in the compact protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the new optional fields. Design alternatives and why they were not chosen ================================================ Ordering NaN before or after all other values --------------------------------------------- We could simply define NaN to be smaller or greater than all other values and then allow NaN in the respective bound. This however has many drawbacks: * NaN is the widest bound possible, so adding NaNs to min and max isn't very useful, as it makes pruning for non-NaN values almost impossible in the respective direction. * As mentioned, not all systems agree on whether NaN is larger or smaller than all other values. If we decided for one, systems that choose the other semantics couldn't filter effectively. * This contradicts the current spec of not writing NaN to min/max, so it would make older readers no longer skip pages they could skip before. Adding a new ColumnOrder ------------------------ We could add a new ColumnOrder that specifies NaN to be smaller or greater than all other values, or even supports -NaN and +NaN ordering them as smaller and larger than all values, respectively. For example, Iceberg mandates the following sort order: -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN Once we define such an order, we could again allow NaN (and potentially -NaN) in bounds again. This however has the following drawbacks: * As with the previous alternative: NaN is the widest bound possible, so adding NaNs to min and max isn't very useful, as it makes pruning for non-NaN values almost impossible in the respective direction. If we even allow -NaN and +NaN, a page containing both would have no meaningful min and max and wouldn't allow any pruning at all. * Most systems don't support -NaN, as mathematically speaking, it is nonsense. The only reason why it exists is that the physical reprsentation of floats has a sign bit that also exists for NaN representations. * The fact that NaNs being so unuseful for min/max bounds is displayed by the fact that even though Iceberg has such a well defined sort order, they still do what is proposed in this proposal and *do not* include -NaN/NaN into min/max bounds and rather track them through nan_counts. All in all, any alternative putting NaN into min/max bounds (except for only-NaN-pages) suffers from the problem that NaN bounds are too wide and therefore not useful for pruning. Writing a second `value_counts` list in the ColumnIndex ------------------------------------------------------- The column index does allow `byte[0]` values already, in case a page contains only nulls. We could enable the same for only-NaN pages by not storing only the `nan_counts`, but also the `value_counts` of a page. Then, one could check whether a page in the column index contains only NaNs by checking `nan_count + nulls_count = value_count`. Hoewever, this would mean yet another list in the column index, making the column index bigger and more expensive to deserialize. And while the `nan_counts` list only exists for FLOAT/DOUBLE columns, the `value_counts` list would exist for all columns and therefore take up considerably more space. Also, this would not be backward compatible, as an older reader wouldn't know of the new lists, so it would see a `byte[0]` and would need to treat it as invalid. All in all, the extra list doesn't seem to add enough value for its cost and reduced backward compatibility. Do nothing ---------- As long as we don't do anything, columns containing NaNs will almost be useless for scan pruning. The problems outlined will persist, making double columns almost unprunable for some predicates. That is not satisfactory. Why wasn't sort order tackled? ============================== Even with this improvement which fixes the semantics of NaN values in statistics, NaN values are still a problem in some other places as there is still not a defined order for them, so the `boundary_order` in a column index and the `SortingColumn` would still have undefined placement for `NaN`. This mainly wasn't tackled for two reasons: * It is an orthogonal issue. This improvement is about enabling NaNs in statistics, so after this change all statistics can handle NaN in a well-defined way. Sort odering of columns to leverage `boundary_order` or `SortingColumn` needs to be solved in a different way anyway, as the mere information about whether (only or some) NaNs are present isn't enough here but it needs to be defined whether they come before or after all values. * Even though we could fix both statistics and sort order by just defining NaN to be smaller or greater than other values, doing so for statistics is *not* a good idea, as having NaN in bounds makes too wide bounds that aren't helpful for filtering. * If sort ordering will be fixed by a different commit one day, the design of this commit shouldn't have a (negative) influence on that future design, as NaN counts and not including NaNs into bounds is a good thing to do anyway. * While fixing statistics with NaN counts is pretty uncontested w.r.t. design alternatives (see the respective section for a discussion why), the design to be chosen for sort orders isn't that clear: * We could define a new `ColumnOrder` with well defined NaN ordering. This would be the cleanest fix, but also a "very big gun", as this would be the first non-default column order in existence. * We could define a `nans_first` fields which tells whether NaNs are to be sorted before or after other values, akin to the already existing field `nulls_first`. This would be a more micro-invasive change, but it would be less clean IMHO, as there is a tool for defining column ordering--the `ColumnOrder`--and not using that tool but working around it feels hacky. Thus, sort ordering of NaNs wasn't tackled in this commit. They can be tackled by a follow-up change if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by null_count + nan_count == num_values
?
README.md
Outdated
* If the max is -0, the row group may contain +0 values as well. | ||
* When looking for NaN values, min and max should be ignored. | ||
* The following compatibility rules should be applied when reading statistics: | ||
* If the nan_count field is set to > 0 and both min and max are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's a little strict here? Just ingore min-max seems ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU To your general comment (I can't answer there)
The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can decide nan by min-max. Can we just decide it by
null_count + nan_count == num_values
?
The problem is that the ColumnIndex does not have the num_values
field, so using this computation to derive whether there are only NaNs would only be applicable to Statistics, not to the column index. Of course, we could do what I suggested in alternatives and give the column index a num_values
list. Then this would indeed work everywhere but at the cost of an additional list.
So I see we have the following options:
- Do what I did here, i.e., use min/max to determine whether there are only NaNs
- Add a
num_values
list to the ColumnIndex - Accept the fact that the column index cannot detect only-NaN pages (might lead to fishy semantics)
- Tell readers to use the
min==max==NaN
reasoning only in the column index, and use thenull_count + nan_count == num_values
for the statistics.
Which one would you suggest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To this suggestion:
Seems it's a little strict here? Just ingore min-max seems ok?
Note that the line you mentioned here just tells a reader that they can rely on this information, and therfore could, e.g., skip this page if a predicate like x = 12.34
was used. They can of course also opt to ignore this information and not skip but rather scan the page. If we removed this, a reader couldn't do the skip here.
I guess this is related to your general suggestion: How do we detect only-NaN pages? Depending on what we do for that, this line will be adapted accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH: I would actually love to have a num_values
list in the column index. We have the same in the statistics, Iceberg does the same, and not needing min=max=NaN for only-NaN checking would actually be much more elegant IMHO.
I just didn't want to suggest adding another list to each column index for the added space cost. However, given that these indexes are negligibly small in comparison to the data, I think actually no one would mind that extra space. If the consensus is that this is preferrable, I'm happy to adapt the commit to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it, I think using both min-max is backward-capatible and can represent "all-data-is-nan". https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944 can we import a status like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU Yes, we could also add a nan_pages
bool list in the column index. That would work as well.
My gut feeling is that one day having a value_counts
count would be more useful than boolean lists. We already have null_pages
and null_counts
and we would then also have nan_pages
and nan_counts
, both null_pages
and nan_pages
would be obsolete if there were value_counts
. Yes, storing one integer (value_counts
) is likely more space than storing two booleans (null_pages
& nan_pages
), but knowing the number of values in a page could also be helpful for other pruposes.
But yes, we could drop the testing of min=max=NaN
if we had a nan_pages
list in the column index.
Note though that if we then also drop that we write NaN
into the min/max here and rather write nothing, then we would be strictly speaking not backward compatible, as legacy readers might assume that any min/max for a page in the column index that has null_pages == false
has a valid bound value, which would no longer be the case for an only-NaN page. I'm not sure whether any reader does this or whether that's just a theoretical problem though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe you are right. My point is that, if we write nan_count or even record count, the program would works well. However, non-float point page would have some size-overhead. Personally, I'd like to use list<bool>
, because it's easy to implement, and also lightweight. And we can hear others idea.
Instead of writing min and max as NaN when there are only NaN values and then let the reader to check whether min max NaN are credible by evaluating whether naNCounts is empty, wouldn't it be much simpler if we just left the evaluation of isNaN and notNaN to the reader? |
@zhongyujiang (as I can't answer your comment directly). Here is the problem with your suggestion of checking
|
@JFinis Thanks for your reply, just realized that the page value count is stored in the page header, not in the column index. I overlooked your comments above before asked the question, sorry for that. |
The gist of all opened issues is the question how to encode pages/column chunks that contain only NaNs. This is actually only an issue for the I see three alternatives to handle the problem in the
I'm fine with either of these, so I would like us to reach a consensus for one of the alternatives here; then I can update my PR to reflect the decision. As this is my first contribution to parquet, I don't know the decision processes here. Do we vote? Is there a single or group of decision makers? Please let me know how to come to a conclusion here. As a help for the decision: Here are again the PROs and CONs of the three alternativs:
* Explanation of "in theory not 100% backward compatible": Today, min and max in a column index have to have a valid value unless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal is very exhaustive. Thanks for the effort! The Parquet community is strict to format change and maintainers are busy these days. It takes time to review proposals like this.
Personally speaking, apart from adding a nan_count
to the statistics, I would go with the option 3: adding a nan_pages
bool list to the column index. I am not in favor of writing any NaN to min/max bounds.
Thank you, @JFinis, for working on this. This is not an easy topic. |
@JFinis Do you have a plan to revive this? |
I finally have time to continue on this. Sorry for the long wait. As @gszadovszky has highlighted, we have to store a valid double/float value into the min/max bounds of the column index to be compatible with legacy readers. So the initial proposal to write NaN into min/max in this case would actually work. But so far not everyone was happy with using these NaNs in readers to see whether we have an only-nan page. Therefore, the suggestion was to also add For all other statistics, we would never (as before) write NaNs to bounds and instead now advice readers to use I have not updated the PR description yet to reflect this new design; only the files themselves have been updated. @wgtmac @mapleFU @gszadovszky Please review and let me know if you agree with this design. Then I will update the PR description accordingly. |
This commit updates the proposal based on the suggestions in the PR. The biggest change is that readers are no longer expected to check min == max == NaN to find only-NaN pages. Instead, they should check nan_count + null_count == num_values in pages and nan_pages[x] == true in the column index. This way, we no longer rely on NaN comparison rules and readers can and should continue to ignore NaN values they find in bounds. However, as was pointed out, we cannot write "no value" into min and max bounds in the column index, as this would not be compatible with legacy readers. Instead, writers must write something here. Therefore, they are now suggested to write NaN here, but readers will still ignore this and instead must rely on the new nan_pages field. In addition, two further suggestions were implemented: * Removed the duplicate explanation from README.md. It now only points to parquet.thrift as the source of truth. * Softened the wording from "nan_count fields should always be set" to "it is suggested to always set the nan_coutn fields".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
* who do not consider nan_pages yet are still able to use the column index | ||
* but are not able to skip only-NaN pages. | ||
*/ | ||
6: optional list<bool> nan_pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? We already know:
- the NaN count for each page (in
nan_counts
) - the null count for each page (in
null_counts
) - the number of rows for each page (from the OffsetIndex)
It seems this might be enough to infer whether a page is all-NaN (except perhaps if there are repetition levels?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, if we do need an additional list (because of repeated columns?), it might be more worthwhile to add an optional list<i64> value_counts
instead, as it would then benefit all column types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think optional list<i64> value_counts
is more common, but I think null already has null_counts
, and value_counts
might consume more bytes for every leaf column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As nan_counts
will be set only after this proposal, could we simply deduce a NaN page by checking null_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN
? If that is true, we can safely remove definition of nan_pages
list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, number of rows in the offset index isn't enough due to repeated values.
Apart from this, the suggestions seem to turn a bit in circles now. Note that all suggestions in this thread were already mentioned in my earlier post where I depicted our possible options for the column index.
@pitrou what you mentioned was my Option 2. I personally would prefer this as it feels like a useful thing to have anyway. Having said that, others pointed rightfully out that it would cost a few bytes even for non float columns. The value might be valuable for other tasks as well. For example, it could be used to quickly check how many nested values are in a page. By having these values one could sum up the nested values per column chunk by adding up all the value counts. This is currently a value that cannot be optained at all through statistics; instead one has to decode pages and count. For example, the SQL query SELECT count(*) FROM some_nested_column;
could be fully answered with such a value_counts field.
@wgtmac your proposal was my Option 1 and actually my initial proposal (see previous commit). Note that you earlier actually were against writing NaNs and rather preferred the nan_pages approach:
Personally speaking, apart from adding a nan_count to the statistics, I would go with the option 3: adding a nan_pages bool list to the column index. I am not in favor of writing any NaN to min/max bounds.
Is your argument that if we now need to write the NaNs anyway, that we should in this case just use them instead of adding nan_pages? I do agree that this would save the extra field and I personally see nothing wrong in doing this. Readers need to be able to detect NaN values anyway (to ignore them), so readers should be able to use the same logic to determin min=max=NaN <=> all values are NaN.
As mentioned in my previous post where I compared the three approaches, I am happy to implement any of them and I think all of them will fulfill the requirements. In my personal opinion, I like the current approach with nan_pages actually the least, as it seems redundant if we have to write NaN values anyway and I see no problem in using NaN values for the "all values NaN check".
I also like the option of adding a value_counts field to the column index of all columns. It feels like a useful and missing field (that is not subsumed by offset index row counts for nested columns) and I would love to add it as well and I feel the few extra bytes will be so negligible in contrast to the actual data that no-one will ever care. Also it would enable us to do the check for all values NaN the same way in page statistics and in the column index.
So we're back at the three options I proposed:
- Drop nan_pages and use my initial approach of "min=max=NaN && nan_counts > 0 <=> all values are NaN" in the column index
- Drop nan_pages and instead add value_counts so we can use value_counts-null_counts==nan_counts to determine whether all values are null. (My personal favorite)
- Retain the current state and use
nan_pages
@wgtmac @mapleFU @gszadovszky @pitrou could we arrive at a consensus here? I'm happy to adapt my PR to any of the solutions. @gszadovszky you also haven't mentioned your favorite, yet (you just pointed out that we have to write some valid value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou writing anything but NaN into min/max was one of my suggestions to circumvent the problem that parquet-mr doesn't seem to check for NaN values in min/max while reading and therefore would probably yield wrong results once we start writing NaNs into these values.
This would only work if we go back to maintaining either nan_pages
or value_counts
though, as otherwise, as you correctly pointed out, we wouldn't have a way to draw the distinction between only-NaN and real infinities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was an argument that some writers already write NaN values into column indexes. Hence, they try to filter on NaN values. Now, we start writing
[-Inf,+Inf]
for NaN only pages. NaN is probably out of[-Inf,+Inf]
interval so that reader would drop the only NaN page while searching for a NaN.
Hi gabor, which implemention has do like that? I check C++ implemention but it doesn't do this. Maybe we can do a check here? Since I guess [-inf, +inf]
sounds ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU, I did not think about any specific implementation. (TBH, I only have experince with parquet-mr.) This is mentioned in the PR description. Maybe, we do not have any implementations as such.
@JFinis, I agree we should not care about the potential systems already writing NaN values into column indexes. Also agree that writing NaN values to min/max is risky for existing systems. So we need to write non-NaN valid values to min/max for all-NaN pages. (And of course mark them with either nan_pages
or value_counts
.)
The more we narrow the range the higher the chance the page will be dropped during filtering which is good because we should not search for NaN values based on the spec anyway. What do you think about [-Inf, -Inf]
? The worst case is we will read the page of all NaN values instead of dropping. In this very case we would not drop it for < x
like cases. (This turned out to be the rephrasing and summary of your previous comments. 😄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think [-inf, +inf]
it's ok. Now I guess only Rust impl and parquet-mr has the potential problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to catching up the discussion. I like the idea to write either [-inf, +inf] or [-0.0, +0.0] for NaN-only pages.
As NaN value does not have a well-defined order across systems, simply leveraging page min/max values to filter NaN does not make any sense. Therefore I think this design can break such misuses.
* - It is suggested to always set the nan_count fields for FLOAT and | ||
DOUBLE columns. | ||
* - NaNs should not be written to min or max statistics fields except | ||
* in the column index, where a value has to be written incase of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* in the column index, where a value has to be written incase of | |
* in the column index, where a value has to be written in case of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this with my next revision once we have decided on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough research!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JFinis! I agree with you guys and this looks good. I have left some comments mostly on the rationale of nan_pages
.
* - NaNs should not be written to min or max statistics fields. | ||
* - It is suggested to always set the nan_count fields for FLOAT and | ||
DOUBLE columns. | ||
* - NaNs should not be written to min or max statistics fields except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to explicitly state that NaN value should not be written to min or max fields in the Statistics of DataPageHeader, DataPageHeaderV2 and ColumnMetaData. But it is suggested to write NaN to min_values and max_values fields in the ColumnIndex where a value has to be written in case of a only-NaN page
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this with my next revision once we have decided on this issue.
* who do not consider nan_pages yet are still able to use the column index | ||
* but are not able to skip only-NaN pages. | ||
*/ | ||
6: optional list<bool> nan_pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As nan_counts
will be set only after this proposal, could we simply deduce a NaN page by checking null_pages[i] == false && nan_counts[i] > 0 && min_values[i] == NaN && max_values[i] == NaN
? If that is true, we can safely remove definition of nan_pages
list.
Let's cc some of the maintainers of parquet-rs: @AdamGS @tustvold @alamb |
#196 (comment) |
So arrow-rs has a nice handing on float/double comparings, I guess we only need to consider that the new data will not broken by stale parquet-mr reader? @gszadovszky |
@mapleFU, as I've written before that's why we initiated ColumnOrder to make the format open to specify orderings. I don't know how the other implementations use this already. In the current parquet-mr (since we introduced |
I agree w/ @tustvold's standpoint. Some thoughts on top of what he wrote: IMHO this is leaking application details into the storage format. If you start to differentiate NaN from "all normal values" and NULL you may do the same for +/-Inf, because it also acts as a poison value in most computations. But you may also do that for "nearly Inf" because someone divided by "nearly zero" and these super big values are equally nonsensical. This whole discussion isn't even specific to floats. Why do boolean stats not count true/false separately? What about empty strings and byte arrays? Or empty lists in general? My point is: this is opening a can of worms and the complexity isn't worth the gain. The better alternative is: let the user cast invalid values to NULL if they wanna exclude them from their data, because this is exactly what missing values were invented for. If they still want to store broken data and want to have some niche understanding of statistics, provide a way to attach application-defined stats to parquet (this extends to a number of histogram types or counts of other "special" values). Keep the storage format baseline simple. IEEE total ordering is well defined and universally agreed upon. I think the world doesn't need yet another special floating point treatment. |
I think we already have type-defined order, and already exclude +inf and -inf. And not when if a page is all |
Indeed, and what I am suggesting is rather than layering on more complexity to workaround the problems of such an approach, how about we just remove this complexity? Perhaps defining a new ColumnOrder if necessary for backwards compatibility, although I am personally not sure this would even be necessary, as I suspect most readers already handle NaNs in some capacity even if they just ignore them. |
Okay, |
@tustvold @crepererum Do I interpret your answer correctly in that your suggestion would be to
Did I get this right? I guess this can also be implemented in each language by "bit casting" the float bits to integer bits and doing an integer comparison, correct? So even if the underlying language doesn't have native support for total ordering, it should still be possible to implement this. I do see a certain beauty in this approach in it being "simple". As always, I'm happy to adapt my PR to this approach, if we can get consensus that we want this. @mapleFU @gszadovszky @pitrou @wgtmac What is your opinion on this proposal? |
Its a bit more than a simple bit cast, but broadly speaking yes. |
The idea looks ok. I've check arrow parquet's reader implemention. For statistics:
For |
It's difficult to say without understanding the implications. Say a data page contains some NaNs, what happens? |
@pitrou On the write path:
On the read path:
Ramifications:
|
@JFinis Thanks a lot! I agree that makes sense. The main problem IMHO is that old readers wouldn't support page filtering on such files. That said, we have to move forward somehow. |
I agree this is a con. Total ordering is nice if the goal is to order the data. If the goal is to filter the data then I think any consideration of NaN/null/infinity is meaningless. However, I also agree with @crepererum that this is a slippery slope and I agree with @JFinis that NaNs are not widely used and simpler is better. I don't entirely agree the solution can always be to replace NaN/Infinity with NULL but the cases where it can't are probably very rare. Besides, the penalty here is only a performance loss and not incorrect results so it's more manageable. So, on the balance, I'd say I'm neutral. If there are other advantages to this approach then the disadvantages to dataset filtering are probably not enough outweigh them. We might want to add a small sentence to some kind of pyarrow or parquet documentation somewhere so that users can be aware of this. |
The new |
To support old readers with the statistics we can choose to write |
Just coming back to this as it has come up a bit downstream, the approach described in #196 (comment) makes a lot of sense to me. Would it help move this forward if I were to raise a separate PR proposing it?
Provided Java provides some mechanism to interpret a float as an integer, it is just a case of some bit operations - https://doc.rust-lang.org/src/core/num/f64.rs.html#1336
Why would filter predicates not also need a well-defined order? FWIW arrow-rs uses total order for all floating point comparison. |
@tustvold I actually already have the change ready in my local repo. I was just distracted by other work and it seemed there was little interest so far in advancing this quickly, so I didn't update it on github, yet. I can update the PR tomorrow :). |
I hate to not stick to my word, but I won't be able to create the PR today, as my wife is going into labor and I'll have to drive her to the clinic soon 😅. I pushed the status I have so far to my fork. You can already have a look if you want: https://github.com/jfinis/parquet-format/tree/totalorder The commit is basically done, I just wanted to proof read everything and write a descriptive message for the commit and the PR. I'll find some time once we're back from the hospital, i.e., in a few days. But for now, I first need to deliver something else 👶 . |
Congratulations! Take all the time you need, there is no urgency on this from my end, just wanted to avoid things stalling out |
This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in apache#196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Solution ======== IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering. As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate `x > 5.0` (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN. Only-NaN pages can now also be included in the column index, as they are no longer a special case. In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.
This commit adds a new column order `IEEE754TotalOrder`, which can be used for floating point types (FLOAT, DOUBLE, FLOAT16). The advantage of the new order is a well-defined ordering between -0,+0 and the various possible bit patterns of NaNs. Thus, every single possible bit pattern of a floating point value has a well-defined order now, so there are no possibilities where two implementations might apply different orders when the new column order is used. With the default column order, there were many problems w.r.t. NaN values which lead to reading engines not being able to use statistics of floating point columns for scan pruning even in the case where no NaNs were in the data set. The problems are discussed in detail in the next section. This solution to the problem is the result of the extended discussion in apache#196, which ended with the consensus that IEEE 754 total ordering is the best approach to solve the problem in a simple manner without introducing special fields for floating point columns (such as `nan_counts`, which was proposed in that PR). Please refer to the discussion in that PR for all the details why this solution was chosen over various design alternatives. Note that this solution is fully backward compatible and should not break neither old readers nor writers, as a new column order is added. Legacy writers can continue not writing this new order and instead writing the default type defined order. Legacy readers should avoid using any statistics on columns that have a column order they do not understand and therefore should just not use the statistics for columns ordered using the new order. The remainder of this message explains in detail what the problems are and how the proposed solution fixes them. Problem Description =================== Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist: Statistics don't tell whether NaNs are present ---------------------------------------------- As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values. For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate x > 1.0 should include NaN values. If a page has max = 0.0 now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result. Likewise, the predicate x < 1.0 should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page with min = 2.0 couldn't be skipped in this case by the reader. Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight. Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values. Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present. Handling NaNs in a ColumnIndex ------------------------------ There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed. It would be nice if there was a defined way of writing the ColumnIndex when NaNs (and especially only-NaN pages) are present. Handling only-NaN pages & column chunks --------------------------------------- Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values. The Statistics objects stored in page headers and in the file footer have a similar, albeit smaller problem: min_value and max_value are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional. Consequently, a reader cannot know whether missing min_value and max_value means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains". It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages. Solution ======== IEEE 754 total order solves all the mentioned problems. As NaNs now have a defined place in the ordering, they can be incorporated into min and max bounds. In fact, in contrast to the default ordering, they do not need any special casing anymore, so all the remarks how readers and writers should special-handle NaNs and -0/+0 no longer apply to the new ordering. As NaNs are incorporated into min and max, a reader can now see whether NaNs are contained through the statistics. Thus, a reading engine just has to map its NaN semantics to the NaN semantics of total ordering. For example, if the semantics of the reading engine treat all NaNs (also -NaNs) as greater than all other values, a reading engine having a predicate `x > 5.0` (which should include NaNs) may not filter any pages / row groups if either min or max are (+/-)NaN. Only-NaN pages can now also be included in the column index, as they are no longer a special case. In conclusion, all mentioned problems are solved by using IEEE 754 total ordering.
Okay, finally done. As the new solution (total order) does not share a single line with the current solution and this PR gets quite long and contrived, I created a new PR: #221 I hope this is fine. If you rather want me to continue in this PR, let me know, then I'll close the other one and instead update this one. Otherwise, let's continue the discussion about total order in the new PR :). @tustvold @mapleFU @wgtmac @crepererum @etseidl @gszadovszky @pitrou FYI |
This commit proposes an improvement for handling of NaN values in FLOAT and DOUBLE type columns. The goal is to allow reading engines, regardless of how they order NaN w.r.t. other values, to efficiently use statistics for scan pruning on those columns, which currently is impossible in most cases. This is to be accomplished in a fully backward compatible manner, so that existing readers and writers do not need to be updated immediatly but can migrate over time to make use of the improved semantics.
There was already work on improving the handling of float and double columns which laid good ground work for backward compatible improvements, but it wasn't sufficient to fix all the problems with NaN values, which are laid out hereinafter.
Problem Description
Currently, the way NaN values are to be handled in statistics inhibits most scan pruning once NaN values are present in DOUBLE or FLOAT columns. Concretely the following problems exist:
Statistics don't tell whether NaNs are present
As NaN values are not to be incorporated in min/max bounds, a reader cannot know whether NaN values are present. This might seem to be not too problematic, as most queries will not filter for NaNs. However, NaN is ordered in most database systems. For example, Postgres, DB2, and Oracle treat NaN as greater than any other value, while MSSQL and MySQL treat it as less than any other value. An overview over what different systems are doing can be found
here. The gist of it is that different systems with different semantics exist w.r.t. NaNs and most of the systems do order NaNs; either less than or greater than all other values.
For example, if the semantics of the reading query engine mandate that NaN is to be treated greater than all other values, the predicate
x > 1.0
should include NaN values. If a page hasmax = 0.0
now, the engine would not be able to skip the page, as the page might contain NaNs which would need to be included in the query result.Likewise, the predicate
x < 1.0
should include NaN if NaN is treated to be less than all other values by the reading engine. Again, a page withmin = 2.0
couldn't be skipped in this case by the reader.Thus, even if a user doesn't query for NaN explicitly, they might use other predictes that need to filter or retain NaNs in the semantics of the reading engine, so the fact that we currently can't know whether a page or row group contains NaN is a bigger problem than it might seem on first sight.
Currently, any predicate that needs to retain NaNs cannot use min and max bounds in Parquet and therefore cannot be used for scan pruning at all. And as state, that can be many seemingly innocuous greater than or less than predicates in most databases systems. Conversely, it would be nice if Parquet would enable scan pruning in these cases, regardless of whether the reader and writer agree upon whether NaN is smaller, greater, or incomparable to all other values.
Note that the problem exists especially if the Parquet file doesn't include any NaNs, so this is not only a problem in the edge case where NaNs are present; it is a problem in the way more common case of NaNs not being present.
Handling NaNs in a ColumnIndex
There is currently no well-defined way to write a spec-conforming ColumnIndex once a page has only NaN (and possibly null) values. NaN values should not be included in min/max bounds, but if a page contains only NaN values, then there is no other value to put into the min/max bounds. However, bounds in a ColumnIndex are non-optional, so we have to put something in here. The spec does not describe what engines should do in this case. Parquet-mr takes the safe route and does not write a column index once NaNs are present. But this is a huge pessimization, as a single page containing NaNs will prevent writing a column index for the column chunk containing that page, so even pages in that chunk that don't contain NaNs will not be indexed.
It would be nice if there was a defined way of writing the
ColumnIndex
when NaNs (and especially only-NaN pages) are present.Handling only-NaN pages & column chunks
Note: Hereinafter, whenever the term only-NaN is used, it refers to a page or column chunk, whose only non-null values are NaNs. E.g., an only-NaN page is allowed to have a mixture of null values and NaNs or only NaNs, but no non-NaN non-null values.
The
Statistics
objects stored in page headers and in the file footer have a similar, albeit smaller problem:min_value
andmax_value
are optional here, so it is easier to not include NaNs in the min/max in case of an only-NaN page or column chunk: Simply omit these optional fields. However, this brings a semantic ambiguity with it, as it is now unclear whether the min/max value wasn't written because there were only NaNs, or simply because the writing engine did decide to omit them for whatever other reason, which is allowed by the spec as the field is optional.Consequently, a reader cannot know whether missing
min_value
andmax_value
means "only NaNs, you can skip this page if you are looking for only non-NaN values" or "no stats written, you have to read this page as it is undefined what values it contains".It would be nice if we could handle NaNs in a way that would allow scan pruning for these only-NaN pages.
Proposed solution
Adding NaN counts
The proposed solution for handling NaNs in statistics is akin to what Apache Iceberg does: add an optional
nan_count
field toStatistics
and an optionalnan_counts
list toColumnIndex
. This way, all places where statistics are being retained can specify whether NaN values are present. This already solves the first problem, as now queries wanting to retain NaNs can check whether the count is > 0 to see whether a page or column chunk contains NaNs.Handling NaN-only pages & column chunks
Adding
nan_count
/nan_counts
fields does not solve the problem of only-NaN pages, yet. But since we have a new optional field in both theStatistics
object and theColumnIndex
object, we can tie a stricter semantics to the existence of this field. I.e., we can mandate that writers who write this optional field have to treat NaNs in a specific way.We basically have two options for treating only-NaN pages or column chunks:
min_value
/max_value
inStatistics
byte[0]
in themin_values
/max_values
list entry of theColumnIndex
I propose to go with the first option of writing NaN as min/max in case of only-Nan pages & column chunks. A section depicting the decision process for this follows below.
Thus, to solve the problem of only-NaN pages, the comments in the spec are extended to mandate the following behavior:
nan_count
/nan_counts
fields, they have to:nan_count
/nan_counts
field was written can then rely on that if min or max are NaN, then both have to be NaN and this means that the only non-NULL values are NaN.Should we write NaN or nothing for only-NaN pages & column chunks?
Here are the cons of each approach and how to mitigate them:
CONs for writing NaN in this case:
nan_count
field is written: We define that once a writing engine writes this field, it has to never write NaN into min/max if there are non-NaN non-Null values and always write min=max=NaN if the only non-null values in a page are NaN. Consequently, readers can rely on the semantics once they observe that thenan_count
field is written and this becomes a non-issue.In conclusion, there is no big con for writing NaN. It can be implemented in a fully backward compatible way that still allows future writers and readers to apply a more strict semantics.
CONs for writing nothing in this case:
min_values
/max_values
field to be a value with correct length unlessnull_pages
is true for the entry.min_value
/max_value
fields inStatistics
is suboptimal, as it is indistinguishable from the writing engine not writing these fields for whatever other reason. Would we do this, then the page could never be pruned by a reader, as the reader couldn't rely on this meaning "there are only NaNs" vs "the writer hasn't written any min/max for this page".nan_count
field and must only omit them if a page has only NaNs, but this seems to be quite fishy, semancially.In conclusion, the cons for the NaN approach have mitigations and can be handled in a backward compatible way, while the cons for writing nothing might be non-backward-compatible. Therefore, I propose to write NaN as min/max for only-nan pages & column chunks.
Considerations
Backward compatibility
The suggested change is fully backward compatible both in the read and write direction:
Older readers not supporting
nan_count
/nan_counts
yet can stay as is. As the fields are optional, readers not supporting them will simply ignore them.The spec already today mandates that if a reader sees
NaN
in min or max fields they should ignore it. They can continue doing so.No older reader will have regressed performance; any page that an older reader would have skipped before can still be skipped.
Older writers can continue writing files without
nan_count
/nan_counts
. Even if an old reader sets min=max=NaN for a page that does contain non-NaN values, readers supporting this new semantics will not misinterpret these bounds, as the writer will not writenan_count
/nan_counts
, so the new more strict semantics does not apply when reading.As
nan_count
/nan_counts
are local to the scopes where they apply (column index, page, column chunk), even stitching together row groups from a writer that didn't write them and a writer that does write them works. This would result in a file where some pages / column indexes / column chunks would have them written while others wouldn't.Versioning
This proposal definitly does not require a major version bump, as it is fully backward compatible.
I do not fully understand the versioning policy of parquet, so I don't know whether this change would require a minor version bump. One could argue that it is not necessary as the mere existence of the
nan_count
/nan_counts
field would be the "feature flag" that would indicate whether a writer supported this change or not. There wouldn't be a version check necessary in a reader; they would just need to check for the existence of thenan_count
/nan_counts
fields.No Unnecessary Overhead
As thrift encodes missing optional fields with zero bytes in the compact protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the new optional fields.
Design alternatives and why they were not chosen
Ordering NaN before or after all other values
We could simply define NaN to be smaller or greater than all other values and then allow NaN in the respective bound.
This however has many drawbacks:
Adding a new ColumnOrder
We could add a new ColumnOrder that specifies NaN to be smaller or greater than all other values, or even supports -NaN and +NaN ordering them as smaller and larger than all values, respectively. For example, Iceberg mandates the following sort order:
-NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN
Once we define such an order, we could again allow NaN (and potentially -NaN) in bounds again.
This however has the following drawbacks:
All in all, any alternative putting NaN into min/max bounds (except for only-NaN-pages) suffers from the problem that NaN bounds are too wide and therefore not useful for pruning.
Writing a second
value_counts
list in the ColumnIndexThe column index does allow
byte[0]
values already, in case a page contains only nulls. We could enable the same for only-NaN pages by not storing only thenan_counts
, but also thevalue_counts
of a page. Then, one could check whether a page in the column index contains only NaNs by checkingnan_count + nulls_count = value_count
. Hoewever, this would mean yet another list in the column index, making the column index bigger and more expensive to deserialize. And while thenan_counts
list only exists for FLOAT/DOUBLE columns, thevalue_counts
list would exist for all columns and therefore take up considerably more space. Also, this would not be backward compatible, as an older reader wouldn't know of the new lists, so it would see abyte[0]
and would need to treat it as invalid.All in all, the extra list doesn't seem to add enough value for its cost (especially also cost on non-float columns) and reduced backward compatibility.
Do nothing
As long as we don't do anything, columns containing NaNs will almost be useless for scan pruning. The problems outlined will persist, making double columns almost unprunable for some predicates. That is not satisfactory.
Why wasn't sort order tackled?
Even with this improvement which fixes the semantics of NaN values in statistics, NaN values are still a problem in some other places as there is still not a defined order for them, so the
boundary_order
in a column index and theSortingColumn
would still have undefined placement forNaN
.This mainly wasn't tackled for two reasons:
boundary_order
orSortingColumn
needs to be solved in a different way anyway, as the mere information about whether (only or some) NaNs are present isn't enough here but it needs to be defined whether they come before or after all values.ColumnOrder
with well defined NaN ordering. This would be the cleanest fix, but also a "very big gun", as this would be the first non-default column order in existence. Also, we would have to decide whether we want to have different sort orders for nans first, nans last, and +/-nan allowed.nans_first
field which tells whether NaNs are to be sorted before or after other values, akin to the already existing fieldnulls_first
. This would be a more micro-invasive change, but it would be less clean IMHO, as there is a tool for defining column ordering--theColumnOrder
--and not using that tool but working around it feels hacky.Thus, sort ordering of NaNs wasn't tackled in this commit. They can be tackled by a follow-up change if necessary.
Epilogue
Make sure you have checked all steps below.
Jira
Commits
Documentation