Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2249: Add nan_count to handle NaNs in statistics #196

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Commits on Mar 22, 2023

  1. PARQUET-2249: Add nan_count to handle NaNs in Statistics

    This commit proposes an improvement for handling of NaN values in FLOAT
    and DOUBLE type columns. The goal is to allow reading engines,
    regardless of how they order NaN w.r.t. other values, to efficiently use
    statistics for scan pruning while NaN values are present, which
    currently is impossible in most cases. This is to be accomplished in a
    fully backward compatible manner, so that existing readers and writers
    do not need to be updated immediatly but can migrate over time to make
    use of the improved semantics.
    
    There was already [work on improving the handling of float and double
    columns](apache#185) which laid
    good ground work for backward compatible improvements, but it wasn't
    sufficient to fix all the problems with NaN values, which are laid out
    hereinafter.
    
    Problem Description
    ===================
    
    Currently, the way NaN values are to be handled in statistics inhibits
    most scan pruning once NaN values are present in DOUBLE or FLOAT
    columns.  Concretely the following problems exist:
    
    Statistics don't tell whether NaNs are present
    ----------------------------------------------
    
    As NaN values are not to be incorporated in min/max bounds, a reader
    cannot know whether NaN values are present. This might seem to be not
    too problematic, as most queries will not filter for NaNs. However, NaN
    is ordered in most database systems. For example, Postgres, DB2, and
    Oracle treat NaN as greater than any other value, while MSSQL and MySQL
    treat it as less than any other value. An overview over what different
    systems are doing can be found
    [here](apache/arrow-rs#264 (comment)).
    The gist of it is that different systems with different semantics exist
    w.r.t.  NaNs and most of the systems do order NaNs; either less than or
    greater than all other values.
    
    For example, if the semantics of the reading query engine mandate that
    NaN is to be treated greater than all other values, the predicate `x >
    1.0` *should* include NaN values. If a page has `max = 0.0` now, the
    engine would *not* be able to skip the page, as the page might contain
    NaNs which would need to be included in the query result.
    
    Likewise, the predicate `x < 1.0` should include NaN if NaN is treated
    to be less than all other values by the reading engine. Again, a page
    with `min = 2.0` couldn't be skipped in this case by the reader.
    
    Thus, even if a user doesn't query for NaN explicitly, they might use
    other predictes that need to filter or retain NaNs in the semantics of
    the reading engine, so the fact that we currently can't know whether a
    page or row group contains NaN is a bigger problem than it might seem on
    first sight.
    
    Currently, any predicate that needs to retain NaNs cannot use min and
    max bounds in Parquet and therefore cannot be used for scan pruning at
    all.  Conversely, it would be nice if Parquet would enable scan pruning
    in these cases, regardless of whether the reader and writer agree upon
    whether NaN is smaller, greater, or incomparible to all other values.
    
    Note that the problem exist especially if the Parquet file *doesn't*
    include any NaNs, so this is not only a problem in the case where NaNs
    are present; it is a problem for the way more common case of NaNs not
    being present.
    
    Handling NaNs in a ColumnIndex
    ------------------------------
    
    There is currently no well-defined way to write a spec-conforming
    ColumnIndex once a page has only NaN (and possibly null) values.  NaN
    values should not be included in min/max bounds, but if a page contains
    only NaN values, then there is no other value to put into the min/max
    bounds. However, bounds in a ColumnIndex are non-optional, so we *have
    to* put something in here.  The spec does not describe what engines
    should do in this case.  Parquet-mr takes the safe route and does not
    write a column index once NaNs are present. But this is a huge
    pessimization, as a single page containing NaNs will prevent the
    emission for a column index for that column chunk, so even pages in that
    chunk that don't contain NaNs will not be indexed.
    
    It would be nice if there was a defined way of writing the `ColumnIndex`
    when NaNs (and especially only-NaN pages) are present.
    
    Handling only-NaN pages & column chunks
    ---------------------------------------
    
    Note: Hereinafter, whenever the term *only-NaN* is used, it refers to a
    page column chunk, whose only non-null values are NaNs. E.g., an
    only-NaN page is allowed to have a mixture of null values and NaNs or
    only NaNs, but no non-NaN non-null values.
    
    The `Statistics` objects stored in page headers and in the file footer
    have a similar, albeit smaller problem: `min_value` and `max_value` are
    optional here, so it is easier to not include NaNs in the min/max in
    case of an only-NaN page or column chunk: Simply omit these optional
    fields. However, this brings a semantic ambiguity with it, as it is now
    unclear whether the min/max value wasn't written because there were only
    NaNs, or simply because the writing engine did decide to omit them for
    whatever other reason, which is allowed by the spec as the field is
    optional.
    
    Consequently, a reader cannot know whether missing `min_value` and
    `max_value` means "only NaNs, you can skip this page if you are looking
    for only non-NaN values" or "no stats written, you have to read this
    page as it is undefined what values it contains".
    
    It would be nice if we could handle NaNs in a way that would allow scan
    pruning for these only-NaN pages.
    
    Proposed solution
    =================
    
    Adding NaN counts
    -----------------
    
    The proposed solution for handling NaNs in statistics is akin to what
    [Apache Iceberg](https://iceberg.apache.org/spec/) does: add an
    *optional* `nan_count` field to `Statistics` and an *optional*
    `nan_counts` list to `ColumnIndex`.  This way, all places where
    statistics are being retained can specify whether NaN values are
    present. This already solves the first problem, as now queries wanting
    to retain NaNs can check whether the count is > 0 to see whether a page
    or column chunk contains NaNs.
    
    Handling NaN-only pages & column chunks
    ---------------------------------------
    
    Adding `nan_count`/`nan_counts` fields does not solve the problem of
    only-NaN pages, yet. But since we have a new optional field in both the
    `Statistics` object and the `ColumnIndex` object, we can tie a stricter
    semantics to the existence of this field. I.e., we can mandate that
    writers who write this optional field have to treat NaNs in a specific
    way.
    
    We basically have two options for treating only-NaN pages or column
    chunks:
    * Order the writer to write NaN as min/max in this case.
    * Order the writer to write nothing, i.e.,
        * omit the `min_value` / `max_value` in `Statistics`
        * write byte[0] in the min_values/max_values entry of the
          `ColumnIndex`
    
    I propose to go with the first option of writing NaN as min/max in case
    of only-Nan pages & column chunks. A section depicting the decision
    process for this follows below.
    
    Thus, to solve the problem of only-NaN pages, the comments in the spec
    are extended to mandate the following behavior:
    
    * Once a writer writes the `nan_count`/`nan_counts` fields, they have
      to: a) never write NaN into min/max if there are non-NaN non-Null
      values and b) always write min=max=NaN if the only non-null values in
      a page are NaN
    * A reader observing that `nan_count`/`nan_counts` field was written can
      then rely on that if min or max are Nan, then both have to be NaN and
      this means that the only non-NULL values are NaN.
    
    Should we write NaN or nothing for only-NaN pages & column chunks?
    ------------------------------------------------------------------
    
    Here are the cons of each approach and how to mitigate them:
    
    CONs for writing NaN in this case:
    * Writing NaN breaks with the "don't write NaN into min and max bounds"
      rule.
        * However, one could argue that breaking the rule in this edge case
          is okay, as since if NaN is the only value in a page, then it
          doesn't matter where to sort NaN w.r.t. other values, as there are
          no other values. It is the only value in the page, so it is the
          min and max of the page
        * Breaking this rule has no consequences on existing readers, as
          they should ignore NaN anyway, i.e., treat it as if it wasn't
          written, so legacy readers should treat both cases the same
          anyway.
    * There might be existing writers that have written NaN for min & max
      for pages that do not only contain NaN but also other values, so a
      reader couldn't rely on min=max=NaN to mean that the only non-null
      value is NaN
        * However, as specified, we can enforce a stricter semantics once
          the `nan_count` field is written: We could define that once a
          writing engine writes this field, it has to a) never write NaN
          into min/max if there are non-NaN non-Null values and b) always
          write min=max=NaN if the only non-null values in a page are NaN.
          Then, readers could rely on the semantics once they observe that
          the `nan_count` field was written.
    * NaNs take more space than not writing the field or writing byte[0] in
      the column index. This space overhead should usually be negligible.
    
    In conclusion, there is no big con for writing NaN. It can be
    implemented in a fully backward compatible way that still allows future
    writers and readers to apply a more strict semantics.
    
    CONs for writing nothing in this case:
    * Writing byte[0] to a ColumnIndex might break older readers who expect
      the `min_values`/`max_values` field to be a value with correct length
      unless `null_pages` is true for the entry.
      * Although readers should be lenient enough and handle wrongly sized
        min/max values gracefully by ignoring them we cannot be sure this is
        the case for any reader. Thus, we might legacy spec-conforming
        readers to reject the new Parquet file, which is bad.
    * Omit the `min_value` / `max_value` in `Statistics` is suboptimal, as
      it first looks as if the writing engine has decided to not write them
      for whatever reason. In this case, the page could never be pruned by a
      reader, as the reader couldn't know which values are in there. Yes, we
      could define that a writer may not omit min/max if they write
      `null_count` and must only omit them if a page has only NaNs, but this
      seems to be quite fishy semancially.
    
    In conclusion, the cons for the NaN approach have mitigations and can be
    handled in a backward compatible way, while the cons for writing nothing
    might be non-backward-compatible. Therefore, I propose to write NaN as
    min/max for only-nan pages & column chunks.
    
    Considerations
    ==============
    
    Backward compatibility
    ----------------------
    
    The suggested change is fully backward compatible both in the read and
    write direction:
    
    Older readers not supporting `nan_count`/`nan_counts` yet can stay as
    is.  As the fields are optional, readers not supporting them will simply
    ignore them.
    
    The spec already today mandates that if a reader sees `NaN` in min or
    max fields they should ignore it. They can continue doing so.
    
    No older reader will have regressed performance; any page that an older
    reader would have skipped before can still be skipped.
    
    Older writers can continue writing files without
    `nan_count`/`nan_counts` and `nans_first`. Even if an old reader sets
    min=max=NaN for a page that does contain non-NaN values, readers
    supporting this new semantics will not misinterpret these bounds, as the
    writer will not write `nan_count`/`nan_counts`, so the new more strict
    semantics does not apply when reading.
    
    As `nan_count`/`nan_counts` are local to the scopes where they apply
    (column index, page, column chunk), even stiching together row groups
    from a writer that didn't write them and a writer that does write them
    works. This would result in a file where some pages / column indexes /
    column chunks would have them set while others wouldn't.
    
    Versioning
    ----------
    
    This proposal definitly does not require a *major* version bump, as it
    is fully backward compatible.
    
    I do not fully understand the versioning policy of parquet, so I don't
    know whether this change would require a minor version bump. One could
    argue that it is not necessary as the mere existence of the
    `nan_count`/`nan_counts` field would be the "feature flag" that would
    indicate whether a writer supported this change or not. There wouldn't
    be a version check necessary in a reader; they would just need to check
    for the existence of the `nan_count`/`nan_counts` fields.
    
    No Unnecessary Overhead
    -----------------------
    
    As thrift encodes missing optional fields with zero bytes in the compact
    protocol, non-FLOAT/DOUBLE columns will not incur any overhead for the
    new optional fields.
    
    Design alternatives and why they were not chosen
    ================================================
    
    Ordering NaN before or after all other values
    ---------------------------------------------
    
    We could simply define NaN to be smaller or greater than all other
    values and then allow NaN in the respective bound.
    
    This however has many drawbacks:
    * NaN is the widest bound possible, so adding NaNs to min and max isn't
      very useful, as it makes pruning for non-NaN values almost impossible
      in the respective direction.
    * As mentioned, not all systems agree on whether NaN is larger or
      smaller than all other values. If we decided for one, systems that
      choose the other semantics couldn't filter effectively.
    * This contradicts the current spec of not writing NaN to min/max, so it
      would make older readers no longer skip pages they could skip before.
    
    Adding a new ColumnOrder
    ------------------------
    
    We could add a new ColumnOrder that specifies NaN to be smaller or
    greater than all other values, or even supports -NaN and +NaN ordering
    them as smaller and larger than all values, respectively. For example,
    Iceberg mandates the following sort order:
    
    -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN
    
    Once we define such an order, we could again allow NaN (and potentially
    -NaN) in bounds again.
    
    This however has the following drawbacks:
    * As with the previous alternative: NaN is the widest bound possible, so
      adding NaNs to min and max isn't very useful, as it makes pruning for
      non-NaN values almost impossible in the respective direction. If we
      even allow -NaN and +NaN, a page containing both would have no
      meaningful min and max and wouldn't allow any pruning at all.
    * Most systems don't support -NaN, as mathematically speaking, it is
      nonsense.  The only reason why it exists is that the physical
      reprsentation of floats has a sign bit that also exists for NaN
      representations.
    * The fact that NaNs being so unuseful for min/max bounds is displayed
      by the fact that even though Iceberg has such a well defined sort
      order,  they still do what is proposed in this proposal and *do not*
      include -NaN/NaN into min/max bounds and rather track them through
      nan_counts.
    
    All in all, any alternative putting NaN into min/max bounds (except for
    only-NaN-pages) suffers from the problem that NaN bounds are too wide
    and therefore not useful for pruning.
    
    Writing a second `value_counts` list in the ColumnIndex
    -------------------------------------------------------
    
    The column index does allow `byte[0]` values already, in case a page
    contains only nulls. We could enable the same for only-NaN pages by not
    storing only the `nan_counts`, but also the `value_counts` of a page.
    Then, one could check whether a page in the column index contains only
    NaNs by checking  `nan_count + nulls_count = value_count`.  Hoewever,
    this would mean yet another list in the column index, making the column
    index bigger and more expensive to deserialize.  And while the
    `nan_counts` list only exists for FLOAT/DOUBLE columns, the
    `value_counts` list would exist for all columns and therefore take up
    considerably more space.  Also, this would not be backward compatible,
    as an older reader wouldn't know of the new lists, so it would see a
    `byte[0]` and would need to treat it as invalid.
    
    All in all, the extra list doesn't seem to add enough value for its cost
    and reduced backward compatibility.
    
    Do nothing
    ----------
    
    As long as we don't do anything, columns containing NaNs will almost be
    useless for scan pruning. The problems outlined will persist, making
    double columns almost unprunable for some predicates. That is not
    satisfactory.
    
    Why wasn't sort order tackled?
    ==============================
    
    Even with this improvement which fixes the semantics of NaN values in
    statistics, NaN values are still a problem in some other places as there
    is still not a defined order for them, so the `boundary_order` in a
    column index and the `SortingColumn` would still have undefined
    placement for `NaN`.
    
    This mainly wasn't tackled for two reasons:
    * It is an orthogonal issue. This improvement is about enabling NaNs in
      statistics, so after this change all statistics can handle NaN in a
      well-defined way.  Sort odering of columns to leverage
      `boundary_order` or `SortingColumn` needs to be solved in a different
      way anyway, as the mere information about whether (only or some) NaNs
      are present isn't enough here but it needs to be defined whether they
      come before or after all values.
      * Even though we could fix both statistics and sort order by just
        defining NaN to be smaller or greater than other values, doing so
        for statistics is *not* a good idea, as having NaN in bounds makes
        too wide bounds that aren't helpful for filtering.
      * If sort ordering will be fixed by a different commit one day, the
        design of this commit shouldn't have a (negative) influence on that
        future design, as NaN counts and not including NaNs into bounds is a
        good thing to do anyway.
    * While fixing statistics with NaN counts is pretty uncontested w.r.t.
      design alternatives (see the respective section for a discussion why),
      the design to be chosen for sort orders isn't that clear:
      * We could define a new `ColumnOrder` with well defined NaN ordering.
        This would be the cleanest fix, but also a "very big gun", as this
        would be the first non-default column order in existence.
      * We could define a `nans_first` fields which tells whether NaNs are
        to be sorted before or after other values, akin to the already
        existing field `nulls_first`.  This would be a more micro-invasive
        change, but it would be less clean IMHO, as there is a tool for
        defining column ordering--the `ColumnOrder`--and not using that tool
        but working around it feels hacky.
    
    Thus, sort ordering of NaNs wasn't tackled in this commit. They can be
    tackled by a follow-up change if necessary.
    JFinis committed Mar 22, 2023
    Configuration menu
    Copy the full SHA
    2f3449e View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2023

  1. PARQUET-2249: Update with suggestions

    This commit updates the proposal based on the suggestions in the PR.
    
    The biggest change is that readers are no longer expected to check
    min == max == NaN to find only-NaN pages. Instead, they should check
    nan_count + null_count == num_values in pages and nan_pages[x] == true
    in the column index. This way, we no longer rely on NaN comparison
    rules and readers can and should continue to ignore NaN values they
    find in bounds.
    
    However, as was pointed out, we cannot write "no value" into min and max
    bounds in the column index, as this would not be compatible with legacy
    readers. Instead, writers must write something here. Therefore, they
    are now suggested to write NaN here, but readers will still ignore this
    and instead must rely on the new nan_pages field.
    
    In addition, two further suggestions were implemented:
    
    * Removed the duplicate explanation from README.md. It now only points
      to parquet.thrift as the source of truth.
    * Softened the wording from "nan_count fields should always be set" to
      "it is suggested to always set the nan_coutn fields".
    JFinis committed Jun 12, 2023
    Configuration menu
    Copy the full SHA
    aecffd7 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2023

  1. Configuration menu
    Copy the full SHA
    8d31bff View commit details
    Browse the repository at this point in the history