-
-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat(core): make path filter configurable #329
base: master
Are you sure you want to change the base?
Feat(core): make path filter configurable #329
Conversation
This change adds a pathFilter option to ParquetReader builder interface because there are some situations where users needs to configure path filter predicates(e.g. They use `_` prefix for partition columns).
.gitignore
Outdated
@@ -11,3 +11,4 @@ target | |||
spark-warehouse | |||
project/.plugins.sbt.swp | |||
project/project | |||
project/metals.sbt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to track by vcs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted 2276a90
import org.scalatest.BeforeAndAfter | ||
import org.scalatest.EitherValues | ||
import org.scalatest.flatspec.AnyFlatSpec | ||
import org.scalatest.matchers.should.Matchers | ||
import org.scalatest.{BeforeAndAfter, EitherValues} | ||
import org.slf4j.{Logger, LoggerFactory} | ||
import org.slf4j.Logger | ||
import org.slf4j.LoggerFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran organize imports
by scalafmt
@@ -42,6 +46,12 @@ object ParquetReader extends IOOps { | |||
*/ | |||
def filter(filter: Filter): Builder[T] | |||
|
|||
/** @param pathFilter | |||
* optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses | |||
* org.apache.parquet.hadoop.util.HiddenFileFilter by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mentioning org.apache.parquet.hadoop.util.HiddenFileFilter
here feels leaking implementation details.
Current HiddenFileFilter definition is as simple as !_.getName().startsWith(Set('.','_'))
. Should we define it in com.github.mjakubowski84.parquet4s package instead of using org.apache.parquet.hadoop.util.HiddenFileFilter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter
seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options
case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?
This reverts commit d53ed10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind also adding this feature to Akka/Pekko and FS2 modules?
@@ -42,6 +46,12 @@ object ParquetReader extends IOOps { | |||
*/ | |||
def filter(filter: Filter): Builder[T] | |||
|
|||
/** @param pathFilter | |||
* optional path filter; ParquetReader traverses paths that match this predicate to resolve partitions. It uses | |||
* org.apache.parquet.hadoop.util.HiddenFileFilter by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet4s relies so much on parquet-hadoop that I am not so concerned about leaking this detail.
However, PathFilter
seems to be quite an esoteric (but static & reusable) option, so I think that it can go into ParquetReader.Options
case class alongside Hadoop Configuration. That way, it will also not cause confusion with an existing filter. What do you think?
protected def findPartitionedPaths( | ||
path: Path, | ||
configuration: Configuration | ||
configuration: Configuration, | ||
pathFilter: PathFilter = HiddenFileFilter.INSTANCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This change adds a pathFilter option to ParquetReader builder interface
because there are some situations where users needs to configure path filter
predicates(e.g. They use
_
prefix for partition columns).Currently, there seems no option to change default path filter(
org.apache.parquet.hadoop.util.HiddenFileFilter
)