Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Bump Lucene@8.11.3 and lucene-gosen@8.11.0 #10810

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

miurahr
Copy link
Contributor

@miurahr miurahr commented Aug 6, 2024

Update Apache Lucene version from 5.5.5 to 8.11.3 that is latest version which support Java 8.
For dependency resolution, also update Lucene-Gosen 8.11.0

Unfortunately, Gosen project does not publish the version in Maven Central, I uses my build of Gosen.

Implement #3628

Changes

  • regenerate ngram test data
  • Because o.a.l.d.LongField is removed in Lucene 8.x , switch to LongPoint, affected
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
  • MultiFields.getFields(reader) is deprecated, affected
    - languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java
    - languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java
  • limitedTopDocs.topDocs.totalHits to add .value that return long instead of int
  • Change DirectoryReader.open(IndexWriter, boolean) to open(IndexWriter, boolean, boolean) affected
    - languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
  • languagetool-wikipedia/pom.xml change lucene.version
  • CharacterUtils changes fill to be static function instead of a method

Summary by CodeRabbit

  • New Features

    • Enhanced handling of n-gram counts for improved indexing efficiency.
    • Improved term processing flexibility in the identification of uppercase phrases.
  • Bug Fixes

    • Updated total hits access method to ensure consistency across the application.
  • Documentation

    • Clarified import statements and dependencies in project configuration files.
  • Chores

    • Updated multiple library dependencies to their latest versions for improved functionality.

@miurahr miurahr force-pushed the topic/miurahr/migrate-lucene-8.11.3 branch from dc932e8 to b3ae165 Compare August 6, 2024 08:33
@miurahr miurahr marked this pull request as ready for review August 7, 2024 03:32
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
@miurahr
Copy link
Contributor Author

miurahr commented Aug 29, 2024

You can check the change with upgraded English ngram data, which is now Lucene 8 format.
https://github.com/miurahr/lt-8-ngram-data

Updated: I've updated the download permission, and now you can download (19:00, 30th Aug JST).

@miurahr
Copy link
Contributor Author

miurahr commented Aug 30, 2024

Here is an example of a usage of the patch.
This is a container configuration to launch languagetool-server that is built from source.
It uses a private forked version of LT with the modification proposed and Gradle build system.
https://github.com/miurahr/languagetool-server-container

Updated: I've updated the entry point script, which has a ngram download URL error. (19:00, 30th, Aug JST)

@danielnaber
Copy link
Member

Thanks!

Unfortunately, Gosen project does not publish the version in Maven Central, I uses my build of Gosen.

Could you explain this in more details? Are all the dependencies still available at Maven Central?

@miurahr
Copy link
Contributor Author

miurahr commented Aug 30, 2024

Thanks!

Unfortunately, Gosen project does not publish the version in Maven Central, I uses my build of Gosen.

Could you explain this in more details? Are all the dependencies still available at Maven Central?

Sure,

They commit to catch Lucene versions and build it for their in-house project but does not publish frequently.
They published only two versions; '6.0.1' and '6.2.1' on Maven Central.

Here is a project home of the forked lucene-gosen project.
https://github.com/omegat-org/lucene-gosen

The rel-8.11 branch is to build lucene-gosen with Lucene 8.11.x.

You can find a lucene-gosen on Maven Central, please find it on https://mvnrepository.com/artifact/org.omegat.lucene/lucene-gosen

OmegaT project published three versions

  • '5.0.0' - we use it for OmegaT 6.0 with LanguageTool 3.5 and Lucene 5.2.1
  • '5.5.1' - we use it for OmegaT 6.1 with LanguageTool 6.1 and Lucene 5.5.5
  • '8.11.0' - for the PR here, forked from the specific commit (*) in the history, and built with Lucene 8.11.3.

which commit is choosed?

I have started rel-8.11 branch from the commit merging the PR#32 of lucene-gosen;
"Support Solr/Lucene 8.0.0"

what modification to be made for lucene-gosen?

Drop unused dependency for restlet library.

@miurahr
Copy link
Contributor Author

miurahr commented Aug 30, 2024

Are all the dependencies still available at Maven Central?

Yes, all the dependencies are available at Maven Central.

@miurahr
Copy link
Contributor Author

miurahr commented Aug 30, 2024

To upgrade other language ngram data to lucene 8 format, https://github.com/hakanai/luceneupgrader will help.

@danielnaber
Copy link
Member

@miurahr Thanks again for your contribution. I've tested the PR on a live server, and everything looks good so far. The plan it to merge it shortly after the LT 6.5 release on Friday (we're in feature freeze, so we won't be able to merge it before).

Copy link
Contributor

coderabbitai bot commented Oct 1, 2024

Walkthrough

The changes primarily involve updates to the handling of n-gram indexing and retrieval in Lucene across multiple files. Key modifications include the transition from using LongField to LongPoint and StoredField for numeric fields, updates to how total hits are accessed, and improvements in the term processing logic. The overall structure of the indexing process remains intact, but the changes enhance efficiency and correctness in how n-grams are stored and retrieved.

Changes

Files Change Summary
languagetool-dev/src/main/java/.../AggregatedNgramToLucene.java
languagetool-dev/src/main/java/.../CommonCrawlToNgram.java
Transition from LongField to LongPoint and StoredField for numeric data indexing. Updates to method signatures and internal logic for handling n-gram counts and total hits.
languagetool-dev/src/main/java/.../GermanUppercasePhraseFinder.java Modifications to term processing logic, allowing for more flexible handling of terms from the Lucene index while maintaining core functionality.
pom.xml Updates to dependency versions, including lucene-gosen and lucene, reflecting a shift to newer library versions.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (11)
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (1)

Line range hint 1-73: Suggest improvements for error handling, resource management, and code readability.

While the main change looks good, there are several areas where the overall code could be improved:

  1. Error Handling: Add try-catch blocks to handle potential IOExceptions and NumberFormatExceptions.
  2. Resource Management: Use try-with-resources to ensure proper closing of IndexReader and FSDirectory.
  3. Variable Naming: Rename 'max' to a more descriptive name like 'maxCount'.
  4. Constants: Define the progress print interval as a constant.

Here's a suggested refactoring that addresses these points:

public class LargestNGramFinder {
    private static final int PROGRESS_INTERVAL = 10_000;

    public static void main(String[] args) {
        if (args.length != 1) {
            System.out.println("Usage: " + LargestNGramFinder.class.getSimpleName() + " <ngramIndexDir>");
            System.exit(1);
        }

        try (FSDirectory fsDir = FSDirectory.open(new File(args[0]).toPath());
             IndexReader reader = DirectoryReader.open(fsDir)) {

            IndexSearcher searcher = new IndexSearcher(reader);
            Terms terms = MultiTerms.getTerms(reader, "ngram");
            long maxCount = 0;
            String maxTerm = "";
            TermsEnum termsEnum = terms.iterator();
            int count = 0;
            BytesRef next;

            while ((next = termsEnum.next()) != null) {
                String term = next.utf8ToString();
                try {
                    TopDocs topDocs = searcher.search(new TermQuery(new Term("ngram", term)), 5);
                    int docId = topDocs.scoreDocs[0].doc;
                    Document document = reader.document(docId);
                    long thisCount = Long.parseLong(document.get("count"));
                    if (maxCount < thisCount) {
                        maxCount = thisCount;
                        maxTerm = term;
                    }
                    if (count % PROGRESS_INTERVAL == 0) {
                        System.out.println(count + " -> " + topDocs.totalHits + " for " + term + " -> " + thisCount + ", max so far: " + maxCount + " for '" + maxTerm + "'");
                    }
                    count++;
                } catch (NumberFormatException e) {
                    System.err.println("Error parsing count for term: " + term);
                }
            }
            System.out.println("Max: " + maxCount + " for " + maxTerm);
        } catch (IOException e) {
            System.err.println("Error processing index: " + e.getMessage());
        }
    }
}

This refactoring improves error handling, ensures proper resource management, and enhances code readability.

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (1)

Multiple StandardAnalyzer Initializations Found Without CharArraySet.EMPTY_SET

Several instances of StandardAnalyzer are initialized without using CharArraySet.EMPTY_SET, which may lead to inconsistencies with the updated Lucene 8.11.3 API:

  • languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java
  • languagetool-dev/src/main/java/org/languagetool/dev/wordsimilarity/SimilarWordFinder.java
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/FrequencyIndexCreator.java
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java
  • languagetool-dev/src/main/java/org/languagetool/dev/EmptyLuceneIndexCreator.java

Please update these initializations to use CharArraySet.EMPTY_SET to maintain consistency and adhere to the latest API standards.

🔗 Analysis chain

Line range hint 47-47: LGTM: StandardAnalyzer initialization updated correctly.

The StandardAnalyzer is now initialized with CharArraySet.EMPTY_SET, which is consistent with the Lucene 8.11.3 API. The functionality remains unchanged.

To ensure this change doesn't affect other parts of the codebase, please run the following script:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for other occurrences of StandardAnalyzer initialization
# that might need similar updates.

# Test: Search for StandardAnalyzer usage
rg --type java -A 5 'new StandardAnalyzer\('

Length of output: 661


Script:

#!/bin/bash
# Description: Find StandardAnalyzer initializations not using CharArraySet.EMPTY_SET

# Step 1: Find all StandardAnalyzer initializations
rg --type java 'new StandardAnalyzer\(' > all_standard_analyzer_initializations.txt

# Step 2: Exclude those using CharArraySet.EMPTY_SET
grep -v 'CharArraySet\.EMPTY_SET' all_standard_analyzer_initializations.txt

Length of output: 1021

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java (1)

61-61: LGTM! Consider adding null check for robustness.

The change from MultiFields.getFields(reader).terms("ngram") to MultiTerms.getTerms(reader, "ngram") is appropriate and aligns with the updated Lucene API. This new method provides a more direct way to retrieve the Terms for the "ngram" field.

For improved robustness, consider adding a null check before proceeding with the terms.iterator() call:

-Terms terms = MultiTerms.getTerms(reader, "ngram");
-TermsEnum termsEnum = terms.iterator();
+Terms terms = MultiTerms.getTerms(reader, "ngram");
+if (terms == null) {
+    System.out.println("No terms found for 'ngram' field");
+    return;
+}
+TermsEnum termsEnum = terms.iterator();

This will prevent a potential NullPointerException if the "ngram" field doesn't exist in the index.

languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (2)

85-85: LGTM: CharacterUtils usage updated correctly.

The fill method call has been correctly updated to use the static method from CharacterUtils. This change is consistent with the new Lucene API and makes the code more concise.

Consider adding a comment explaining the purpose of this line, as it's a crucial part of the tokenizer's functionality:

// Fill the buffer with characters from the input stream
CharacterUtils.fill(this.ioBuffer, this.input);

99-99: LGTM: Character handling updated correctly.

The codePointAt method call has been correctly updated to use the standard Java Character class method. This change simplifies the code and removes the dependency on the Lucene-specific CharacterUtils for this operation.

Consider caching the result of this.ioBuffer.getBuffer() to potentially improve performance, especially if this method is called frequently in a loop:

char[] buffer = this.ioBuffer.getBuffer();
int c = Character.codePointAt(buffer, this.bufferIndex);

This change would avoid repeated method calls to getBuffer() and might lead to a small performance improvement.

languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (1)

197-199: LGTM: Consistent update for totalHits access with a suggestion

The change from docs.totalHits > 2000 to docs.totalHits.value > 2000 is consistent with the previous updates and correctly reflects the new Lucene API. The error message has been appropriately updated to include docs.totalHits.value.

Consider reviewing the 2000 match limit:
The current implementation throws a RuntimeException when there are more than 2000 matches. This hard limit might be too restrictive for some use cases. Consider implementing a more flexible approach, such as:

  1. Making the limit configurable.
  2. Implementing pagination to handle large result sets.
  3. Logging a warning instead of throwing an exception, and returning the best 2000 matches.

Example of making the limit configurable:

-      if (docs.totalHits.value > 2000) {
-        throw new RuntimeException("More than 2000 matches for '" + term + "' not supported for performance reasons: " +
-                                   docs.totalHits.value + " matches in " + luceneSearcher.directory);
+      private static final int MAX_MATCHES = 2000;
+      if (docs.totalHits.value > MAX_MATCHES) {
+        throw new RuntimeException("More than " + MAX_MATCHES + " matches for '" + term + "' not supported for performance reasons: " +
+                                   docs.totalHits.value + " matches in " + luceneSearcher.directory);
       }

This change would make it easier to adjust the limit in the future if needed.

languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java (1)

274-279: LGTM! Consider adding a comment for clarity.

The changes to the assertMatches method look good. Updating the parameter type from int to long for expectedMatches allows for handling larger result sets, which is a good practice. The update to accessing total hits is consistent with the changes in the Lucene API for version 8.11.3.

Consider adding a brief comment explaining the reason for using totalHits.value to improve code readability:

 private void assertMatches(AbstractPatternRule patternRule, long expectedMatches) throws Exception {
   PatternRuleQueryBuilder queryBuilder = new PatternRuleQueryBuilder(language, searcher);
   Query query = queryBuilder.buildRelaxedQuery(patternRule);
   //System.out.println("QUERY: " + query);
+  // Use totalHits.value to get the exact number of hits (Lucene 8.x API)
   long matches = searcher.search(query, 1000).totalHits.value;
   assertEquals("Query failed: " + query, expectedMatches, matches);
 }
pom.xml (1)

Line range hint 158-167: Address TODO comments in pom.xml

There are several TODO comments in the pom.xml file, indicating potential future work or issues to be addressed. For example:

  • "TODO: Check errors on compile in wikipedia, dev and rpm"
  • "TODO: Check compatibility"
  • "TODO: Check for replacement"

These comments suggest that there might be some incomplete work or known issues with the current changes.

Consider creating GitHub issues to track these TODOs and ensure they are addressed in future updates. Would you like me to create draft issues for each of these TODOs?

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (1)

70-70: Remove or Relocate Commented Test Code

The line // term = "persischer Golf"; // for testing is commented out test code. Consider removing it or moving it to an appropriate testing environment to keep the production code clean and maintainable.

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java (2)

Line range hint 180-267: Optimize index reader refresh logic

Repeatedly opening DirectoryReader instances without proper closure can lead to resource leaks. Consider using DirectoryReader.openIfChanged to refresh the reader only when there are changes, and ensure that previous readers are properly closed.

Apply this diff to optimize reader handling:

-// Not sure why this doesn't work, should be faster:
-/*DirectoryReader newReader = DirectoryReader.openIfChanged(reader);
-if (newReader != null) {
-  reader = newReader;
-}*/
-index.reader = DirectoryReader.open(index.indexWriter);
+DirectoryReader newReader = DirectoryReader.openIfChanged(index.reader, index.indexWriter);
+if (newReader != null) {
+    index.reader.close();
+    index.reader = newReader;
+    index.searcher = new IndexSearcher(index.reader);
+}

Handle totalHits relation for accurate hit counts

Several usages of totalHits do not consider the relation field, which may lead to inaccurate hit count handling. Please update the following instances to handle the relation appropriately:

  • languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java:
    • Lines where totalHits is accessed without .value.
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AutomaticConfusionRuleEvaluator.java:
  • languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java:
  • languagetool-dev/src/main/java/org/languagetool/dev/wordsimilarity/SimilarWordFinder.java:
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java:
🔗 Analysis chain

Line range hint 186-198: Handle TotalHits relation for accurate hit counts

In Lucene 8.x, totalHits has changed from an int to a TotalHits object, which includes both value and relation. While accessing topDocs.totalHits.value provides the hit count, it's important to consider the relation field to determine if the hit count is exact or a lower bound (e.g., when using TOTAL_HITS_THRESHOLD).

Run the following script to check for other usages of totalHits that may need to handle the relation field:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find usages of totalHits to ensure proper handling of TotalHits object.

rg --type java -A 2 $'totalHits\\b(?!\\.)'

Length of output: 273


Script:

#!/bin/bash
# Description: Find usages of totalHits to ensure proper handling of TotalHits object using PCRE2.

rg --pcre2 --type java -A 2 'totalHits\b(?!\.)'

Length of output: 3938

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 6706526 and 593c282.

📒 Files selected for processing (18)
  • languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (2 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java (1 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java (2 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java (4 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (3 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (2 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java (1 hunks)
  • languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (1 hunks)
  • languagetool-language-modules/ja/pom.xml (1 hunks)
  • languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java (2 hunks)
  • languagetool-wikipedia/pom.xml (1 hunks)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java (1 hunks)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (3 hunks)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/index/PatternRuleQueryBuilder.java (1 hunks)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/index/Searcher.java (6 hunks)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/index/SearcherResult.java (2 hunks)
  • languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java (1 hunks)
  • pom.xml (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java
🔇 Additional comments (19)
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (1)

50-50: Approve the Lucene API update with a suggestion for performance verification.

The change from using MultiFields.getFields(reader) followed by fields.terms("ngram") to directly using MultiTerms.getTerms(reader, "ngram") aligns well with the Lucene 8.x API. This modification simplifies the code and maintains the core functionality.

To ensure this change doesn't introduce any performance regressions, consider running a performance test. You can use the following script to check if there are any other occurrences of the old API usage that might need updating:

This script will help identify any other locations in the codebase where similar updates might be necessary.

languagetool-language-modules/ja/pom.xml (1)

43-46: Verify the new lucene-gosen dependency.

The group ID for the lucene-gosen dependency has been changed from com.github.lucene-gosen to org.omegat.lucene. This change aligns with the PR objectives, which mention using a personal build of Gosen. However, we should ensure that this new dependency is correctly configured and available.

Run the following script to verify the availability and correctness of the new dependency:

The change is approved, pending successful verification of the new dependency.

languagetool-wikipedia/src/main/java/org/languagetool/dev/index/SearcherResult.java (1)

38-38: LGTM! Approve the changes to luceneMatchCount.

The modifications to change luceneMatchCount from int to long are correct and consistent throughout the class. This change allows for handling larger datasets, which is a good improvement.

Let's verify that these changes don't break any existing code:

If any of these searches return results, it might indicate areas of the codebase that need to be updated to work with the new long type.

Also applies to: 84-90

✅ Verification successful

Verification Successful: No usages of int-based luceneMatchCount methods found.

The changes to luceneMatchCount from int to long have been successfully verified and do not affect existing code.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining usage of int-based luceneMatchCount methods

# Test 1: Search for any remaining int-based setter calls
echo "Searching for int-based setter calls:"
rg --type java 'setLuceneMatchCount\s*\(\s*\d+\s*\)'

# Test 2: Search for any explicit int casts when calling the getter
echo "Searching for int casts of getLuceneMatchCount:"
rg --type java '\(int\)\s*\w+\.getLuceneMatchCount\(\)'

# Test 3: Search for any comparisons that might assume int return type
echo "Searching for potential int-based comparisons:"
rg --type java 'getLuceneMatchCount\(\)\s*([<>=!]=?|[&|^])\s*\d+'

Length of output: 473

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (2)

22-22: LGTM: Import statement updated correctly.

The import for CharArraySet has been updated to reflect the package reorganization in Lucene 8.11.3. This change is consistent with the library upgrade mentioned in the PR objectives.


Line range hint 1-89: Overall, the changes look good. Consider verifying similar updates across the project.

The modifications in this file are minimal and correctly reflect the Lucene 8.11.3 upgrade. The import statement and StandardAnalyzer initialization have been updated appropriately. The overall functionality of the TextIndexCreator class remains unchanged.

To ensure consistency across the project, please run the following script to identify other files that might require similar updates:

#!/bin/bash
# Description: Identify other files that might need updates due to Lucene API changes

# Test 1: Search for CharArraySet imports that might need updating
echo "Files with potential CharArraySet import issues:"
rg --type java 'import org\.apache\.lucene\.analysis\.util\.CharArraySet'

# Test 2: Search for StandardAnalyzer initializations that might need updating
echo "\nFiles with potential StandardAnalyzer initialization issues:"
rg --type java 'new StandardAnalyzer\((?!CharArraySet\.EMPTY_SET)'
languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java (1)

50-75: Approved with suggestions for improvement

The changes successfully update the code to work with newer versions of Lucene and provide more flexibility by iterating over all indexed fields. This is a good improvement in terms of compatibility and extensibility.

However, there are a few points to consider:

  1. Performance: If there are many indexed fields, iterating over all of them might impact performance. Consider adding a check to only process relevant fields if known.

  2. Error messages: The error messages still reference the "ngram" field specifically (e.g., line 75). These should be updated to be more generic or to include the current field name.

  3. Logging: Consider adding some debug logging to show which fields are being processed, especially if multiple fields are expected.

Here's a suggested improvement for the error message:

- throw new RuntimeException("More hits than expected for " + term + ": " + topDocs.totalHits);
+ throw new RuntimeException("More hits than expected for term '" + term + "' in field '" + field + "': " + topDocs.totalHits.value);

Also, consider adding a debug log at the start of the field iteration:

System.out.println("Processing field: " + field);

To ensure that the changes don't introduce any regressions, please run the following verification script:

This script will help identify any other parts of the codebase that might need similar updates.

✅ Verification successful

Verification Successful

The review comment has been thoroughly verified. The only reference to 'ngram' outside of StartTokenCounter.java is within a comment in LangIdentExamples.java, which does not affect the functionality. No other uses of MultiFields.getFields or problematic totalHits usages were found related to the changes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the StartTokenCounter class is used correctly throughout the codebase

# Test: Check for any direct references to the 'ngram' field that might need updating
echo "Checking for 'ngram' field references:"
rg --type java "'ngram'" --glob '!StartTokenCounter.java'

# Test: Look for other uses of MultiFields.getFields that might need similar updates
echo "Checking for other uses of MultiFields.getFields:"
rg --type java "MultiFields\.getFields"

# Test: Verify that totalHits is accessed correctly in other parts of the code
echo "Checking totalHits usage:"
rg --type java "totalHits\s*\."

Length of output: 2846

languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (2)

21-21: LGTM: Import statement updated correctly.

The import statement for CharacterUtils has been updated to reflect the package reorganization in Lucene 8.11.3. This change is consistent with the PR objective of updating the Lucene version.


Line range hint 1-150: Summary: Successfully updated to Lucene 8.11.3 API

The changes in this file successfully update the AnyCharTokenizer class to be compatible with Lucene 8.11.3. The main modifications include:

  1. Updated import statement for CharacterUtils.
  2. Replaced usage of charUtils instance with static method calls to CharacterUtils.
  3. Switched to using standard Java Character methods where appropriate.

These changes maintain the original functionality while aligning the code with the latest Lucene API. The updates also slightly simplify the code by removing the need for a charUtils instance.

To ensure that these changes haven't introduced any regressions, please run the following command to check for any existing tests that cover this tokenizer:

If no tests are found, consider adding unit tests to verify the tokenizer's behavior with the updated Lucene version.

languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java (3)

21-21: LGTM: Import changes align with updated Lucene usage.

The addition of MultiTerms import and removal of Fields and MultiFields imports correctly reflect the changes in how terms are accessed from the Lucene index. This change is consistent with the modifications in the getIterator() method.


114-115: Excellent: Improved term retrieval efficiency.

The update to use MultiTerms.getTerms() is a good improvement:

  1. It simplifies the term retrieval process by eliminating the intermediate Fields object.
  2. The code is now more direct and efficient in accessing the "ngram" terms.
  3. This change aligns with Lucene 8.x API updates.

The method's overall functionality and return type remain unchanged, ensuring compatibility with the rest of the class.


Line range hint 1-138: Summary: Successful Lucene API update with improved efficiency.

The changes in this file successfully update the Lucene API usage to version 8.11.3:

  1. Import statements have been updated to reflect the new API.
  2. The getIterator() method now uses a more efficient approach to retrieve terms.

These modifications improve the performance of term retrieval without altering the overall functionality of the HomophoneOccurrenceDumper class. The changes are localized and don't introduce any apparent issues in other parts of the class.

Great job on keeping the code up-to-date with the latest Lucene version while also improving its efficiency!

languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (3)

151-152: LGTM: Consistent update for totalHits access

The change from docs.totalHits > 1000 to docs.totalHits.value > 1000 is consistent with the previous update and correctly reflects the new Lucene API. The error message has been appropriately updated to include docs.totalHits.value.


Line range hint 1-249: Summary: Successful update to Lucene 8.11.3 API

The changes in this file successfully update the code to work with Lucene 8.11.3 API, specifically in how totalHits is accessed. The updates have been applied consistently throughout the file, maintaining the existing logic while adapting to the new API.

Key points:

  1. All instances of docs.totalHits have been changed to docs.totalHits.value.
  2. Error messages have been updated accordingly.
  3. The overall functionality of the class remains unchanged.

While the changes are correct and consistent, there's an opportunity to improve the handling of large result sets, as mentioned in the previous comment.


149-151: LGTM: Lucene API update for totalHits access

The change from docs.totalHits == 0 to docs.totalHits.value == 0 correctly reflects the update in the Lucene API for accessing total hits. This is consistent with the PR objective of updating to Lucene 8.11.3.

To ensure consistency, let's verify if this change has been applied uniformly across the codebase:

languagetool-wikipedia/src/main/java/org/languagetool/dev/index/PatternRuleQueryBuilder.java (1)

127-127: LGTM: Updated createWeight call to match new Lucene API

The change correctly updates the indexSearcher.createWeight method call to match the new Lucene API. The use of ScoreMode.COMPLETE_NO_SCORES is appropriate here as scoring is not needed for term extraction. The added boost value of 1.0f is neutral and maintains the existing behavior. This update ensures compatibility with the newer Lucene version while preserving the method's functionality.

pom.xml (3)

221-221: Major update to Lucene version

The Lucene version has been updated from 5.5.5 to 8.11.3. This is a significant version jump that aligns with the PR objectives.

Please ensure that all Lucene-dependent code has been updated to work with this new version. Run the following script to check for any usage of Lucene classes that might have changed:

#!/bin/bash
# Check for usage of common Lucene classes that might have changed
rg --type java "import org\.apache\.lucene\.(index|search|store|document)\."

Line range hint 158-167: Multiple dependency version updates

Several dependencies have been updated to newer versions:

  • org.sweble.wikitext: 1.1.1 -> 3.1.9
  • org.apache.commons.commons-compress: 1.2 -> 1.26.0
  • org.mariadb.jdbc: 2.2.3 -> 2.7.6

These are significant version jumps that may introduce new features, bug fixes, and potentially breaking changes.

Please ensure that the project still compiles and functions correctly with these new versions. Pay special attention to any usage of these libraries in your code. Run the following script to check for their usage:

#!/bin/bash
# Check for usage of updated dependencies
rg --type java "import org\.sweble\.wikitext\."
rg --type java "import org\.apache\.commons\.compress\."
rg --type java "import org\.mariadb\.jdbc\."

165-165: Significant update to Lucene-Gosen dependency

The Lucene-Gosen dependency has been updated from version 6.2.1 to 8.11.0, and the groupId has changed from com.github.lucene-gosen to org.omegat.lucene. This aligns with the PR objectives but represents a major version jump.

Please ensure that all code using Lucene-Gosen has been tested with this new version. Run the following script to check for any usage of the old groupId:

✅ Verification successful

Lucene-Gosen Dependency Update Verified

No remaining references to com.github.lucene-gosen were found in pom.xml or any Java files. The dependency update is complete and does not impact other parts of the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any remaining references to the old Lucene-Gosen groupId
rg --type xml "com\.github\.lucene-gosen"

Length of output: 3930


Script:

#!/bin/bash
# Verify any remaining references to the old Lucene-Gosen groupId using alternative search patterns
fd "pom.xml" | xargs grep -H "com\.github\.lucene-gosen"
# Additionally, search through all Java files for any lingering import statements
fd --type java | xargs grep -H "com\.github\.lucene-gosen"

Length of output: 289


Script:

#!/bin/bash
# Verify any remaining references to the old Lucene-Gosen groupId in pom.xml
fd "pom.xml" | xargs grep -H "com\.github\.lucene-gosen"
# Additionally, search through all Java files for any lingering import statements
fd --extension java | xargs grep -H "com\.github\.lucene-gosen"

Length of output: 122

languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (1)

123-123: Correct Usage of topDocs.totalHits.value

The change to topDocs.totalHits.value correctly reflects the updated Lucene 8.x API for accessing the total number of hits. This is an appropriate update.

@@ -38,7 +38,7 @@

<properties>
<!-- we have to repeat it here as otherwise Grails project languagetool-community-website will fail: -->
<lucene.version>5.5.5</lucene.version>
<lucene.version>8.11.3</lucene.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Deprecated Lucene API Usage Detected

The update to Lucene version 8.11.3 introduces potential compatibility issues due to deprecated API usages found in the codebase.

Areas to Address:

  • Deprecated Methods:

    • DirectoryReader.open is used in multiple files:
      • languagetool-wikipedia/src/main/java/org/languagetool/dev/index/Searcher.java
      • languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java
      • languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java
      • [List continues as per script output]
  • Existing TODO Comments:

    • languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java
      • // TODO: lm.getCount("_START_") returns 0 for Google data -- see getCount(String) in LuceneLanguageModel

Recommendations:

  1. Update Deprecated APIs:

    • Replace DirectoryReader.open with the updated methods as per Lucene 8.11.3 documentation.
  2. Address TODOs:

    • Resolve the TODO comment to ensure the getCount method functions correctly with the updated Lucene version.
  3. Comprehensive Testing:

    • After making the necessary changes, perform thorough testing to ensure that all Lucene-dependent functionalities work as expected with the new version.
🔗 Analysis chain

LGTM! Verify compatibility with the updated Lucene version.

The Lucene version update from 5.5.5 to 8.11.3 aligns with the PR objectives. This is a significant version jump, which may introduce breaking changes or new features.

To ensure smooth integration, please verify the following:

  1. Compatibility of all Lucene-dependent code with version 8.11.3.
  2. Update of any Lucene-related documentation or comments in the codebase.
  3. Successful build and all tests passing with the new version.

Run the following script to check for potential compatibility issues:

This script will help identify areas that might need attention due to the Lucene version update.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential Lucene compatibility issues

# Test: Search for Lucene imports and usages
echo "Searching for Lucene imports and usages:"
rg --type java "import org\.apache\.lucene" -A 5

# Test: Check for any TODO or FIXME comments related to Lucene
echo "Checking for TODO or FIXME comments related to Lucene:"
rg --type java "(TODO|FIXME).*Lucene"

# Test: Look for deprecated Lucene API usage
echo "Checking for potentially deprecated Lucene API usage:"
rg --type java "MultiFields\.getFields|DirectoryReader\.open|LongField"

Length of output: 45771

## language-dev sub-project

- Add StoredField addition to LongPoint for ngram
- Use DirectoryReader.open(IndexWriter) instead of open(IndexWriter, boolean, boolean)
- Null and empty check for FieldInfo and Terms

Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
pom.xml (1)

Line range hint 70-82: Address TODO comments in dependency updates.

Several dependencies have been updated, and there are TODO comments indicating areas that require attention or further verification. These include:

  • Checking for errors on compile in wikipedia, dev, and rpm modules after updating org.sweble.wikitext.
  • Verifying compatibility after updating org.apache.commons.commons-compress.
  • Checking for updates or replacements for edu.washington.cs.knowitall.opennlp.
  • Reviewing the update of org.mariadb.jdbc.

Please address these TODO items before merging:

  1. Compile and test the wikipedia, dev, and rpm modules with the updated org.sweble.wikitext version.
  2. Verify compatibility with the new org.apache.commons.commons-compress version.
  3. Investigate if there's an update or replacement available for edu.washington.cs.knowitall.opennlp.
  4. Review and confirm the org.mariadb.jdbc update.

After addressing these items, please remove or update the TODO comments accordingly.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4edeeb2 and 86295a2.

📒 Files selected for processing (1)
  • pom.xml (3 hunks)
🧰 Additional context used
🔇 Additional comments (3)
pom.xml (3)

217-217: Lucene version updated to 8.11.3.

The update of Lucene to version 8.11.3 from 5.5.5 is a significant change that aligns with the PR objectives. This major version update likely introduces breaking changes that need to be addressed in the codebase.

Please ensure that all necessary code modifications mentioned in the PR objectives (such as replacing LongField with LongPoint) have been implemented. Run the following script to check for any remaining usage of deprecated Lucene classes:

#!/bin/bash
# Search for usage of deprecated Lucene classes
rg "import org\.apache\.lucene\.document\.LongField" --type java
rg "LongField\(" --type java

Line range hint 1-1337: Overall pom.xml changes look good, but require some final steps.

The updates to Lucene, Lucene-Gosen, and other dependencies are well-implemented and align with the PR objectives. However, there are a few items that need attention before merging:

  1. Verify compatibility with the new versions, especially for Lucene and Lucene-Gosen.
  2. Address the TODO comments in the dependency updates section.
  3. Ensure all necessary code modifications related to the Lucene update have been implemented throughout the project.
  4. Run a full build and test suite to catch any potential issues introduced by these dependency updates.

Please run a full build and test suite, and confirm that all tests pass with the updated dependencies:

#!/bin/bash
# Run full build and test suite
mvn clean install

Once these steps are completed and all tests pass, this PR should be ready for final review and merge.


164-164: Lucene-Gosen dependency updated successfully.

The update of lucene-gosen to version 8.11.0 and the change in groupId to org.omegat.lucene align with the PR objectives. This is a significant update that may require corresponding code changes.

Please verify that all code using this dependency is compatible with the new version. Run the following script to check for any usage of the old groupId:

@SteVio89
Copy link
Collaborator

Sorry for the delay here. We're currently updating our backend to prepare everything for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants