-
Notifications
You must be signed in to change notification settings - Fork 114
Use indexes subdirectory for custom index system path #369
base: master
Are you sure you want to change the base?
Conversation
@imback82 @rapoth @apoorvedave1 @thugsatbay Though this change will be a breaking change, I think we need this sub root dir to avoid any kind of interference. |
|
The default behavior is creating @imback82 could you add some comment? |
I guess the only concern I have is the breaking change as you pointed out. @rapoth? |
I'll give the example of using custom indexing path with Iceberg. Iceberg has its own In the case of Hyperspace, the current possibility of setting the path to whatever value it is needed is the best option although the user needs to know not to use a folder used for other purposes (ie: In my opinion, adding the hardcoded What should some one do if they want to keep the indexes next to their data using custom indexes path but the |
@andrei-ionescu That's a fair point but currently Hyperspace doesn't have "metadata" dir and rely on the directory structure under the system path; so I suggest this sub-root dir. We don't support different index data locations yet (#243), but I think Hyperspace also needs "metadata" dir to maintain indexes located in different paths. |
@sezruby I'm not against having a default folder name for the custom index system path, but I'm against of having it hardcoded (burned in code) as I'm for making the sub-folder a setting at all levels: either if the user chooses the Spark warehouse (default) or dataset path (custom path) or other path if a new suffix is added to the path, the suffix should be a setting. I see that in Code wise, this what I suggest:
And add the new
Just to conclude, I'm not against having this subdirectory but 1) let's make it settable 2) let's make it consistent over all possible pathing options in Hyperspace. |
src/main/scala/com/microsoft/hyperspace/index/PathResolver.scala
Outdated
Show resolved
Hide resolved
@andrei-ionescu Yep sounds good to me. Thanks for the good suggestion :) @rapoth any opinion on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@imback82 Now at least we have a workaround ( indexDirName = "" ) for the breaking change. WDYT? |
// Config used for setting the system path, which is considered as a "root" path for Hyperspace; | ||
// e.g, indexes are created under the system path. | ||
val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path" | ||
|
||
// Config used for subdirectory name under the system path. | ||
val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe simply DIR or LOCATION instead of DIR_NAME?
@@ -63,9 +63,16 @@ private[hyperspace] class PathResolver(conf: SQLConf, hadoopConf: Configuration) | |||
* @return Hyperspace index system path. | |||
*/ | |||
def systemPath: Path = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit confused by now about what a system path is in Hyperspace, because the value of INDEX_SYSTEM_PATH and the return value of this function can be different, yet both are being called "system paths". Which one is the "system path"? Shouldn't we name either one differently? Maybe we can 1) improve the docstring, 2) change the config name from INDEX_SYSTEM_PATH to something like INDEX_SYSTEM_ROOT, and/or 3) use a more self-explanatory name than system path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think it's better to change the function name rather than the config string. So my suggestion is 1) keep "system path" as it is (w/o subdir) 2) give another name to "system path"/subdir path.
How about indexLocationPath
?
cc @imback82
I think this PR is good in general, but we need some clarification and write something about how and where data is stored in Hyperspace. Maybe there is one already? |
Maybe this one? https://microsoft.github.io/hyperspace/docs/toh-indexes-on-the-lake/ |
// Config used for setting the system path, which is considered as a "root" path for Hyperspace; | ||
// e.g, indexes are created under the system path. | ||
val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path" | ||
|
||
// Config used for subdirectory name under the system path. | ||
val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName" | ||
val INDEX_DIR_NAME_DEFAULT = "indexes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: will change the default dir name as "hyperspace".
What is the context for this pull request?
What changes were proposed in this pull request?
Use subdirectory for custom index system path for consistency & maintenance.
e.g. with indexDirName = "indexes"
Added new config for the subdir name
Does this PR introduce any user-facing change?
Yes, this PR changes the system directory path.
This is a breaking change.
How was this patch tested?
Need to fix tests