Use indexes subdirectory for custom index system path #369

sezruby · 2021-02-26T20:22:44Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

Use subdirectory for custom index system path for consistency & maintenance.
e.g. with indexDirName = "indexes"

current:
- /custom/system/path/index1
- /custom/system/path/index2

with this change:
- /custom/system/path/indexes/index1
- /custom/system/path/indexes/index2

Added new config for the subdir name

// Config used for subdirectory name under the system path.
  val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
  val INDEX_DIR_NAME_DEFAULT = "indexes"

Does this PR introduce any user-facing change?

Yes, this PR changes the system directory path.
This is a breaking change.

How was this patch tested?

Need to fix tests

sezruby · 2021-02-26T20:36:15Z

@imback82 @rapoth @apoorvedave1 @thugsatbay

Though this change will be a breaking change, I think we need this sub root dir to avoid any kind of interference.
We shouldn't rely on user's input. WDYT?

thugsatbay · 2021-02-26T21:13:44Z

One advantage I see is that user index are more organized.
Will we be keeping some files in the indexes folder that denote some global state of all the index created so far ?
What kind of interference are we expecting ?
Will user ever come to the hyperspace index dir and play or move files ?

sezruby · 2021-02-27T00:10:49Z

we might have some metadata later
if there's other directories in there, index manager tries to find or build IndexLogEntry on it and causes some exception, or directory size estimation or something like that
rarely but possible

The default behavior is creating indexes under the spark warehouse dir.
But if the system path config is set, we don't create indexes under it.

@imback82 could you add some comment?

imback82 · 2021-02-27T00:16:38Z

I guess the only concern I have is the breaking change as you pointed out. @rapoth?

andrei-ionescu · 2021-03-03T10:32:37Z

I'll give the example of using custom indexing path with Iceberg.

Iceberg has its own metadata folder where it's keeping its own information. That is the default value but can be changed if needed through some table or custom define Spark parameters. This is named metadata_location and is a relative path starting from the dataset path. In the case of Iceberg it is set in stone that the main use case is keeping the metadata next to the data.

In the case of Hyperspace, the current possibility of setting the path to whatever value it is needed is the best option although the user needs to know not to use a folder used for other purposes (ie: metadata in Iceberg case).

In my opinion, adding the hardcoded indexes value inside the path adds some limitations to the current freedom.

What should some one do if they want to keep the indexes next to their data using custom indexes path but the indexes folder inside the dataset is already used by other indexing mechanism?

sezruby · 2021-03-04T04:01:39Z

@andrei-ionescu That's a fair point but currently Hyperspace doesn't have "metadata" dir and rely on the directory structure under the system path; so I suggest this sub-root dir. We don't support different index data locations yet (#243), but I think Hyperspace also needs "metadata" dir to maintain indexes located in different paths.

andrei-ionescu · 2021-03-04T10:04:49Z

@sezruby I'm not against having a default folder name for the custom index system path, but I'm against of having it hardcoded (burned in code) as indexes.

I'm for making the sub-folder a setting at all levels: either if the user chooses the Spark warehouse (default) or dataset path (custom path) or other path if a new suffix is added to the path, the suffix should be a setting.

I see that in HyperspaceConf are these two constants: INDEX_SYSTEM_PATH and INDEXES_DIR. I see that INDEXES_DIR is not used in code and I would suggest to make use of it too, or rename it for a better understanding.

Code wise, this what I suggest:

def systemPath: Path = {
  val indexesSuffix = conf.getConfString(IndexConstants.INDEXES_DIR_SUFFIX, IndexConstants.INDEXES_DIR)
  val defaultIndexesPath = conf.getConfString("spark.sql.warehouse.dir")
  new Path(conf.getConfString(IndexConstants.INDEX_SYSTEM_PATH, defaultIndexesPath), indexesSuffix)
}

And add the new INDEXES_DIR_SUFFIX into IndexConstants

val INDEXES_DIR_SUFFIX = "spark.hyperspace.indexes.dir.suffix"

Just to conclude, I'm not against having this subdirectory but 1) let's make it settable 2) let's make it consistent over all possible pathing options in Hyperspace.

src/main/scala/com/microsoft/hyperspace/index/PathResolver.scala

sezruby · 2021-03-05T07:45:43Z

@andrei-ionescu Yep sounds good to me. Thanks for the good suggestion :)

@rapoth any opinion on this PR?

andrei-ionescu

LGTM

sezruby · 2021-03-11T02:01:00Z

@imback82 Now at least we have a workaround ( indexDirName = "" ) for the breaking change. WDYT?

clee704 · 2021-04-09T04:54:09Z

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

  // Config used for setting the system path, which is considered as a "root" path for Hyperspace;
  // e.g, indexes are created under the system path.
  val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path"

+  // Config used for subdirectory name under the system path.
+  val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"


nit: maybe simply DIR or LOCATION instead of DIR_NAME?

clee704 · 2021-04-09T05:30:17Z

src/main/scala/com/microsoft/hyperspace/index/PathResolver.scala

@@ -63,9 +63,16 @@ private[hyperspace] class PathResolver(conf: SQLConf, hadoopConf: Configuration)
   * @return Hyperspace index system path.
   */
  def systemPath: Path = {


I'm a little bit confused by now about what a system path is in Hyperspace, because the value of INDEX_SYSTEM_PATH and the return value of this function can be different, yet both are being called "system paths". Which one is the "system path"? Shouldn't we name either one differently? Maybe we can 1) improve the docstring, 2) change the config name from INDEX_SYSTEM_PATH to something like INDEX_SYSTEM_ROOT, and/or 3) use a more self-explanatory name than system path.

Good point. I think it's better to change the function name rather than the config string. So my suggestion is 1) keep "system path" as it is (w/o subdir) 2) give another name to "system path"/subdir path.

How about indexLocationPath?

cc @imback82

clee704 · 2021-04-09T05:41:54Z

I think this PR is good in general, but we need some clarification and write something about how and where data is stored in Hyperspace. Maybe there is one already?

sezruby · 2021-04-09T05:54:42Z

Maybe there is one already?

Maybe this one? https://microsoft.github.io/hyperspace/docs/toh-indexes-on-the-lake/
I think it's good to add this link to quick-start guide for reference.

sezruby · 2021-04-09T05:58:06Z

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

  // Config used for setting the system path, which is considered as a "root" path for Hyperspace;
  // e.g, indexes are created under the system path.
  val INDEX_SYSTEM_PATH = "spark.hyperspace.system.path"

+  // Config used for subdirectory name under the system path.
+  val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
+  val INDEX_DIR_NAME_DEFAULT = "indexes"


NOTE: will change the default dir name as "hyperspace".

Use indexes subdirectory for custom index system path

d28e14f

andrei-ionescu suggested changes Mar 4, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/PathResolver.scala Outdated Show resolved Hide resolved

sezruby added 2 commits March 4, 2021 23:47

Merge remote-tracking branch 'upstream/master' into path

2a47dc7

Add INDEX_DIR_NAME config

fec013a

sezruby changed the title ~~[WIP] Use indexes subdirectory for custom index system path~~ Use indexes subdirectory for custom index system path Mar 5, 2021

sezruby requested review from imback82, rapoth, apoorvedave1 and thugsatbay March 5, 2021 17:58

sezruby self-assigned this Mar 7, 2021

andrei-ionescu previously approved these changes Mar 10, 2021

View reviewed changes

clee704 reviewed Apr 9, 2021

View reviewed changes

sezruby commented Apr 9, 2021

View reviewed changes

sezruby added 2 commits April 12, 2021 10:00

Merge remote-tracking branch 'upstream/master' into path

4423b3b

Change systemPath => indexLocationDir

4537281

sezruby dismissed andrei-ionescu’s stale review via 4537281 April 12, 2021 18:03

Merge branch 'master' into path

a2833ef

sezruby added the breaking changes label Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use indexes subdirectory for custom index system path #369

Use indexes subdirectory for custom index system path #369

sezruby commented Feb 26, 2021 •

edited

Loading

sezruby commented Feb 26, 2021

thugsatbay commented Feb 26, 2021

sezruby commented Feb 27, 2021 •

edited

Loading

imback82 commented Feb 27, 2021

andrei-ionescu commented Mar 3, 2021

sezruby commented Mar 4, 2021

andrei-ionescu commented Mar 4, 2021 •

edited

Loading

sezruby commented Mar 5, 2021

andrei-ionescu left a comment

sezruby commented Mar 11, 2021

clee704 Apr 9, 2021

clee704 Apr 9, 2021

sezruby Apr 9, 2021

clee704 commented Apr 9, 2021

sezruby commented Apr 9, 2021

sezruby Apr 9, 2021

Use indexes subdirectory for custom index system path #369

Are you sure you want to change the base?

Use indexes subdirectory for custom index system path #369

Conversation

sezruby commented Feb 26, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

sezruby commented Feb 26, 2021

thugsatbay commented Feb 26, 2021

sezruby commented Feb 27, 2021 • edited Loading

imback82 commented Feb 27, 2021

andrei-ionescu commented Mar 3, 2021

sezruby commented Mar 4, 2021

andrei-ionescu commented Mar 4, 2021 • edited Loading

sezruby commented Mar 5, 2021

andrei-ionescu left a comment

Choose a reason for hiding this comment

sezruby commented Mar 11, 2021

clee704 Apr 9, 2021

Choose a reason for hiding this comment

clee704 Apr 9, 2021

Choose a reason for hiding this comment

sezruby Apr 9, 2021

Choose a reason for hiding this comment

clee704 commented Apr 9, 2021

sezruby commented Apr 9, 2021

sezruby Apr 9, 2021

Choose a reason for hiding this comment

sezruby commented Feb 26, 2021 •

edited

Loading

sezruby commented Feb 27, 2021 •

edited

Loading

andrei-ionescu commented Mar 4, 2021 •

edited

Loading