Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch indexing and queries #2834

Merged
merged 75 commits into from
Aug 16, 2024
Merged

OpenSearch indexing and queries #2834

merged 75 commits into from
Aug 16, 2024

Conversation

phixMe
Copy link
Member

@phixMe phixMe commented Jun 7, 2024

Problem

Our search right now does not enable nested queries on OpenLineage facets, code, linked entities, and ids. We want to
enable our search to be the very best place to absorb and catalog OpenLineage based data.

Opensearch.Demo.mov

Includes

  • Indexing of documents on OL events
  • New APIs to retrieve search for jobs and datasets
  • Search UI updates that can be enabled or fallback to legacy search system.
  • Local and Helm updates for new OpenSearch containers needed for this. (Including dashboard)

To Follow Up

  • Update jobs and datasets for all tag related tag actions to include in indexes.
  • Adding support for more facets and integrations (Trino, Flink...)

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@boring-cyborg boring-cyborg bot added the api API layer changes label Jun 7, 2024
Copy link

netlify bot commented Jun 7, 2024

Deploy Preview for peppy-sprite-186812 canceled.

Name Link
🔨 Latest commit d096138
🔍 Latest deploy log https://app.netlify.com/sites/peppy-sprite-186812/deploys/66bfdfaf894d82000853712d

@boring-cyborg boring-cyborg bot added the web label Jun 11, 2024
@boring-cyborg boring-cyborg bot added the docker label Jun 26, 2024
@@ -68,6 +68,7 @@ public OpenLineageResource(
public void create(@Valid @NotNull BaseEvent event, @Suspended final AsyncResponse asyncResponse)
throws JsonProcessingException, SQLException {
if (event instanceof LineageEvent) {
serviceFactory.getSearchService().indexEvent((LineageEvent) event);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If search.enabled=false, will the index call fail/error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handling in the indexEvent method:

  public void indexEvent(@Valid @NotNull LineageEvent event) {
    if (!searchConfig.isEnabled()) {
      log.debug("Search is disabled, skipping indexing");
      return;
    }
    UUID runUuid = runUuidFromEvent(event.getRun());
    log.debug("Indexing event {}", event);

    if (event.getInputs() != null) {
      indexDatasets(event.getInputs(), runUuid, event);
    }
    if (event.getOutputs() != null) {
      indexDatasets(event.getOutputs(), runUuid, event);
    }
    indexJob(runUuid, event);
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, when we follow up with an SearchService interface, we'll want to bind search to an engine (psql, opensearch) and can do away with the flag.

* SPDX-License-Identifier: Apache-2.0
*/

package marquez.api;
Copy link
Member

@wslulciuc wslulciuc Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the pkg to marquez.api.v2beta.SearchResource until we promote the API to v2.

build.gradle Outdated
@@ -64,6 +64,8 @@ subprojects {

dependencies {
implementation "org.projectlombok:lombok:${lombokVersion}"
implementation 'org.opensearch.client:opensearch-rest-client:2.15.0'
implementation 'org.opensearch.client:opensearch-java:2.6.0'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to api/build.gradle

Copy link

codecov bot commented Aug 9, 2024

Codecov Report

Attention: Patch coverage is 22.36025% with 125 lines in your changes missing coverage. Please review.

Project coverage is 83.28%. Comparing base (422fd43) to head (d096138).
Report is 1 commits behind head on main.

Files Patch % Lines
...i/src/main/java/marquez/service/SearchService.java 8.40% 106 Missing and 3 partials ⚠️
...c/main/java/marquez/api/v2beta/SearchResource.java 21.05% 15 Missing ⚠️
...src/main/java/marquez/api/OpenLineageResource.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2834      +/-   ##
============================================
- Coverage     84.77%   83.28%   -1.49%     
- Complexity     1470     1477       +7     
============================================
  Files           256      259       +3     
  Lines          6626     6785     +159     
  Branches        308      313       +5     
============================================
+ Hits           5617     5651      +34     
- Misses          856      977     +121     
- Partials        153      157       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Conflicts:
#	.env.example
#	docker-compose.web.yml
run_id: string
name: string
namespace: string
eventType: string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's eventType used?

Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, @phixMe! We definitely have some follow up work, but great to see the progress on search 💯 🚀 🥇

phixMe and others added 4 commits August 9, 2024 18:02
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc self-requested a review August 11, 2024 19:33
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 ❤️

@phixMe phixMe merged commit ead480b into main Aug 16, 2024
15 of 16 checks passed
@phixMe phixMe deleted the feature/es-client branch August 16, 2024 23:34
alaturqua pushed a commit to alaturqua/marquez that referenced this pull request Aug 21, 2024
* Elasticsearch code.

* Adding basic responses for elasticsearch.

* Saving highlights.

* Saving code cleanup.

* Adding EsSearch.

* Saving partial progress.

* Refinements.

* Small bug fixes.

* Fixing alignment

* Migrating es jobs naming to be specific.

* Adding boilerplate for dataset es search

* Adding datasets.

* Adding polish for more data.

* Empty state and other small enhancements.

* Adding arrow key functionality.

* Removing console log

* Spotless

* Refinements to queries.

* Adding debounce.

* Fixing alignment issues.

* Saving updates for password setting via env config for elasticsearch.

* Setting up startup scripts and adding corresponding waits.

* Adding logs and more fields for jobs.

* Resolving jackson serialization issue.

* Small updates for search display.

* Adding onClick handlers.

* Fixing null cases, adding more search options for datasets.

* Handling enter key.

* Fixing minor encoding and layout issues for spark related open lineage events.

* Additional fixes for text overflow on names and namespaces.

* Fixing indexing problem.

* Transitioning to opensearch.

* Removing elasticsearch references.

* Isolation of search code, calling services.

* Adding config to support multiple instances.

* Spotless

* Adding helm files.

* Adding in stronger password for search.

* Handling debouncing.

* Adding "ADVANCED_SEARCH" configurable variable for web.

* Fixing some tests.

* Moving indexing down a row.

* Spotless

* Putting back removed code.

* Merge spotless resolution.

* Skipping over search for db migration tests.

* Adding search back to migration

* Trying out ci config setting.

* Removing search from base config as a whole.

* Pushing out header updates.

* Review comment on search service init.

* Fixing up dependencies in docker to apply migrations.

* Code review updates and naming changes.

* newline

* Updating for beta vs. non beta endpoints in search resource.

* Moving search resource to its own place.

* Removing prints.

* Removing all helm changes for this work stream.

* Adding back lock file contents.

* Adding header

* Adding middleware proxy.

* Code review updates.

* Moving from outer gradle to api gradle.

* Removing extra containers.

* Removing extra containers.

* Set timeout for seed container to 60s

Signed-off-by: wslulciuc <willy@datakin.com>

* Fixing `--no-search` and frontend config.

* Add check before indexing ol event

Signed-off-by: wslulciuc <willy@datakin.com>

* Fix db migration CI job

Signed-off-by: wslulciuc <willy@datakin.com>

---------

Signed-off-by: wslulciuc <willy@datakin.com>
Co-authored-by: phix <peter.hicks@astronomer.io>
Co-authored-by: wslulciuc <willy@datakin.com>
Signed-off-by: Isa Inalcik <isa.inalcik@gmail.com>
alaturqua pushed a commit to alaturqua/marquez that referenced this pull request Aug 21, 2024
* Elasticsearch code.

* Adding basic responses for elasticsearch.

* Saving highlights.

* Saving code cleanup.

* Adding EsSearch.

* Saving partial progress.

* Refinements.

* Small bug fixes.

* Fixing alignment

* Migrating es jobs naming to be specific.

* Adding boilerplate for dataset es search

* Adding datasets.

* Adding polish for more data.

* Empty state and other small enhancements.

* Adding arrow key functionality.

* Removing console log

* Spotless

* Refinements to queries.

* Adding debounce.

* Fixing alignment issues.

* Saving updates for password setting via env config for elasticsearch.

* Setting up startup scripts and adding corresponding waits.

* Adding logs and more fields for jobs.

* Resolving jackson serialization issue.

* Small updates for search display.

* Adding onClick handlers.

* Fixing null cases, adding more search options for datasets.

* Handling enter key.

* Fixing minor encoding and layout issues for spark related open lineage events.

* Additional fixes for text overflow on names and namespaces.

* Fixing indexing problem.

* Transitioning to opensearch.

* Removing elasticsearch references.

* Isolation of search code, calling services.

* Adding config to support multiple instances.

* Spotless

* Adding helm files.

* Adding in stronger password for search.

* Handling debouncing.

* Adding "ADVANCED_SEARCH" configurable variable for web.

* Fixing some tests.

* Moving indexing down a row.

* Spotless

* Putting back removed code.

* Merge spotless resolution.

* Skipping over search for db migration tests.

* Adding search back to migration

* Trying out ci config setting.

* Removing search from base config as a whole.

* Pushing out header updates.

* Review comment on search service init.

* Fixing up dependencies in docker to apply migrations.

* Code review updates and naming changes.

* newline

* Updating for beta vs. non beta endpoints in search resource.

* Moving search resource to its own place.

* Removing prints.

* Removing all helm changes for this work stream.

* Adding back lock file contents.

* Adding header

* Adding middleware proxy.

* Code review updates.

* Moving from outer gradle to api gradle.

* Removing extra containers.

* Removing extra containers.

* Set timeout for seed container to 60s

Signed-off-by: wslulciuc <willy@datakin.com>

* Fixing `--no-search` and frontend config.

* Add check before indexing ol event

Signed-off-by: wslulciuc <willy@datakin.com>

* Fix db migration CI job

Signed-off-by: wslulciuc <willy@datakin.com>

---------

Signed-off-by: wslulciuc <willy@datakin.com>
Co-authored-by: phix <peter.hicks@astronomer.io>
Co-authored-by: wslulciuc <willy@datakin.com>
Signed-off-by: Isa Inalcik <isa.inalcik@gmail.com>
@wslulciuc wslulciuc mentioned this pull request Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants