-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Bump Lucene@8.11.3 and lucene-gosen@8.11.0 #10810
base: master
Are you sure you want to change the base?
feat: Bump Lucene@8.11.3 and lucene-gosen@8.11.0 #10810
Conversation
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
dc932e8
to
b3ae165
Compare
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
Signed-off-by: Hiroshi Miura <miurahr@linux.com>
You can check the change with upgraded English ngram data, which is now Lucene 8 format. Updated: I've updated the download permission, and now you can download (19:00, 30th Aug JST). |
Here is an example of a usage of the patch. Updated: I've updated the entry point script, which has a ngram download URL error. (19:00, 30th, Aug JST) |
Thanks!
Could you explain this in more details? Are all the dependencies still available at Maven Central? |
Sure, They commit to catch Lucene versions and build it for their in-house project but does not publish frequently. Here is a project home of the forked lucene-gosen project. The You can find a OmegaT project published three versions
which commit is choosed?I have started what modification to be made for
|
Yes, all the dependencies are available at Maven Central. |
To upgrade other language ngram data to lucene 8 format, https://github.com/hakanai/luceneupgrader will help. |
@miurahr Thanks again for your contribution. I've tested the PR on a live server, and everything looks good so far. The plan it to merge it shortly after the LT 6.5 release on Friday (we're in feature freeze, so we won't be able to merge it before). |
WalkthroughThe changes primarily involve updates to the handling of n-gram indexing and retrieval in Lucene across multiple files. Key modifications include the transition from using Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Outside diff range and nitpick comments (11)
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (1)
Line range hint
1-73
: Suggest improvements for error handling, resource management, and code readability.While the main change looks good, there are several areas where the overall code could be improved:
- Error Handling: Add try-catch blocks to handle potential IOExceptions and NumberFormatExceptions.
- Resource Management: Use try-with-resources to ensure proper closing of IndexReader and FSDirectory.
- Variable Naming: Rename 'max' to a more descriptive name like 'maxCount'.
- Constants: Define the progress print interval as a constant.
Here's a suggested refactoring that addresses these points:
public class LargestNGramFinder { private static final int PROGRESS_INTERVAL = 10_000; public static void main(String[] args) { if (args.length != 1) { System.out.println("Usage: " + LargestNGramFinder.class.getSimpleName() + " <ngramIndexDir>"); System.exit(1); } try (FSDirectory fsDir = FSDirectory.open(new File(args[0]).toPath()); IndexReader reader = DirectoryReader.open(fsDir)) { IndexSearcher searcher = new IndexSearcher(reader); Terms terms = MultiTerms.getTerms(reader, "ngram"); long maxCount = 0; String maxTerm = ""; TermsEnum termsEnum = terms.iterator(); int count = 0; BytesRef next; while ((next = termsEnum.next()) != null) { String term = next.utf8ToString(); try { TopDocs topDocs = searcher.search(new TermQuery(new Term("ngram", term)), 5); int docId = topDocs.scoreDocs[0].doc; Document document = reader.document(docId); long thisCount = Long.parseLong(document.get("count")); if (maxCount < thisCount) { maxCount = thisCount; maxTerm = term; } if (count % PROGRESS_INTERVAL == 0) { System.out.println(count + " -> " + topDocs.totalHits + " for " + term + " -> " + thisCount + ", max so far: " + maxCount + " for '" + maxTerm + "'"); } count++; } catch (NumberFormatException e) { System.err.println("Error parsing count for term: " + term); } } System.out.println("Max: " + maxCount + " for " + maxTerm); } catch (IOException e) { System.err.println("Error processing index: " + e.getMessage()); } } }This refactoring improves error handling, ensures proper resource management, and enhances code readability.
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (1)
Multiple
StandardAnalyzer
Initializations Found WithoutCharArraySet.EMPTY_SET
Several instances of
StandardAnalyzer
are initialized without usingCharArraySet.EMPTY_SET
, which may lead to inconsistencies with the updated Lucene 8.11.3 API:
languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java
languagetool-dev/src/main/java/org/languagetool/dev/wordsimilarity/SimilarWordFinder.java
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/FrequencyIndexCreator.java
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java
languagetool-dev/src/main/java/org/languagetool/dev/EmptyLuceneIndexCreator.java
Please update these initializations to use
CharArraySet.EMPTY_SET
to maintain consistency and adhere to the latest API standards.🔗 Analysis chain
Line range hint
47-47
: LGTM:StandardAnalyzer
initialization updated correctly.The
StandardAnalyzer
is now initialized withCharArraySet.EMPTY_SET
, which is consistent with the Lucene 8.11.3 API. The functionality remains unchanged.To ensure this change doesn't affect other parts of the codebase, please run the following script:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for other occurrences of StandardAnalyzer initialization # that might need similar updates. # Test: Search for StandardAnalyzer usage rg --type java -A 5 'new StandardAnalyzer\('Length of output: 661
Script:
#!/bin/bash # Description: Find StandardAnalyzer initializations not using CharArraySet.EMPTY_SET # Step 1: Find all StandardAnalyzer initializations rg --type java 'new StandardAnalyzer\(' > all_standard_analyzer_initializations.txt # Step 2: Exclude those using CharArraySet.EMPTY_SET grep -v 'CharArraySet\.EMPTY_SET' all_standard_analyzer_initializations.txtLength of output: 1021
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java (1)
61-61
: LGTM! Consider adding null check for robustness.The change from
MultiFields.getFields(reader).terms("ngram")
toMultiTerms.getTerms(reader, "ngram")
is appropriate and aligns with the updated Lucene API. This new method provides a more direct way to retrieve the Terms for the "ngram" field.For improved robustness, consider adding a null check before proceeding with the
terms.iterator()
call:-Terms terms = MultiTerms.getTerms(reader, "ngram"); -TermsEnum termsEnum = terms.iterator(); +Terms terms = MultiTerms.getTerms(reader, "ngram"); +if (terms == null) { + System.out.println("No terms found for 'ngram' field"); + return; +} +TermsEnum termsEnum = terms.iterator();This will prevent a potential
NullPointerException
if the "ngram" field doesn't exist in the index.languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (2)
85-85
: LGTM: CharacterUtils usage updated correctly.The
fill
method call has been correctly updated to use the static method fromCharacterUtils
. This change is consistent with the new Lucene API and makes the code more concise.Consider adding a comment explaining the purpose of this line, as it's a crucial part of the tokenizer's functionality:
// Fill the buffer with characters from the input stream CharacterUtils.fill(this.ioBuffer, this.input);
99-99
: LGTM: Character handling updated correctly.The
codePointAt
method call has been correctly updated to use the standard JavaCharacter
class method. This change simplifies the code and removes the dependency on the Lucene-specificCharacterUtils
for this operation.Consider caching the result of
this.ioBuffer.getBuffer()
to potentially improve performance, especially if this method is called frequently in a loop:char[] buffer = this.ioBuffer.getBuffer(); int c = Character.codePointAt(buffer, this.bufferIndex);This change would avoid repeated method calls to
getBuffer()
and might lead to a small performance improvement.languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (1)
197-199
: LGTM: Consistent update for totalHits access with a suggestionThe change from
docs.totalHits > 2000
todocs.totalHits.value > 2000
is consistent with the previous updates and correctly reflects the new Lucene API. The error message has been appropriately updated to includedocs.totalHits.value
.Consider reviewing the 2000 match limit:
The current implementation throws a RuntimeException when there are more than 2000 matches. This hard limit might be too restrictive for some use cases. Consider implementing a more flexible approach, such as:
- Making the limit configurable.
- Implementing pagination to handle large result sets.
- Logging a warning instead of throwing an exception, and returning the best 2000 matches.
Example of making the limit configurable:
- if (docs.totalHits.value > 2000) { - throw new RuntimeException("More than 2000 matches for '" + term + "' not supported for performance reasons: " + - docs.totalHits.value + " matches in " + luceneSearcher.directory); + private static final int MAX_MATCHES = 2000; + if (docs.totalHits.value > MAX_MATCHES) { + throw new RuntimeException("More than " + MAX_MATCHES + " matches for '" + term + "' not supported for performance reasons: " + + docs.totalHits.value + " matches in " + luceneSearcher.directory); }This change would make it easier to adjust the limit in the future if needed.
languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java (1)
274-279
: LGTM! Consider adding a comment for clarity.The changes to the
assertMatches
method look good. Updating the parameter type fromint
tolong
forexpectedMatches
allows for handling larger result sets, which is a good practice. The update to accessing total hits is consistent with the changes in the Lucene API for version 8.11.3.Consider adding a brief comment explaining the reason for using
totalHits.value
to improve code readability:private void assertMatches(AbstractPatternRule patternRule, long expectedMatches) throws Exception { PatternRuleQueryBuilder queryBuilder = new PatternRuleQueryBuilder(language, searcher); Query query = queryBuilder.buildRelaxedQuery(patternRule); //System.out.println("QUERY: " + query); + // Use totalHits.value to get the exact number of hits (Lucene 8.x API) long matches = searcher.search(query, 1000).totalHits.value; assertEquals("Query failed: " + query, expectedMatches, matches); }
pom.xml (1)
Line range hint
158-167
: Address TODO comments in pom.xmlThere are several TODO comments in the pom.xml file, indicating potential future work or issues to be addressed. For example:
- "TODO: Check errors on compile in wikipedia, dev and rpm"
- "TODO: Check compatibility"
- "TODO: Check for replacement"
These comments suggest that there might be some incomplete work or known issues with the current changes.
Consider creating GitHub issues to track these TODOs and ensure they are addressed in future updates. Would you like me to create draft issues for each of these TODOs?
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (1)
70-70
: Remove or Relocate Commented Test CodeThe line
// term = "persischer Golf"; // for testing
is commented out test code. Consider removing it or moving it to an appropriate testing environment to keep the production code clean and maintainable.languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java (2)
Line range hint
180-267
: Optimize index reader refresh logicRepeatedly opening
DirectoryReader
instances without proper closure can lead to resource leaks. Consider usingDirectoryReader.openIfChanged
to refresh the reader only when there are changes, and ensure that previous readers are properly closed.Apply this diff to optimize reader handling:
-// Not sure why this doesn't work, should be faster: -/*DirectoryReader newReader = DirectoryReader.openIfChanged(reader); -if (newReader != null) { - reader = newReader; -}*/ -index.reader = DirectoryReader.open(index.indexWriter); +DirectoryReader newReader = DirectoryReader.openIfChanged(index.reader, index.indexWriter); +if (newReader != null) { + index.reader.close(); + index.reader = newReader; + index.searcher = new IndexSearcher(index.reader); +}
Handle
totalHits
relation for accurate hit countsSeveral usages of
totalHits
do not consider therelation
field, which may lead to inaccurate hit count handling. Please update the following instances to handle therelation
appropriately:
languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java:
- Lines where
totalHits
is accessed without.value
.languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AutomaticConfusionRuleEvaluator.java:
languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java:
languagetool-dev/src/main/java/org/languagetool/dev/wordsimilarity/SimilarWordFinder.java:
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java:
🔗 Analysis chain
Line range hint
186-198
: HandleTotalHits
relation for accurate hit countsIn Lucene 8.x,
totalHits
has changed from anint
to aTotalHits
object, which includes bothvalue
andrelation
. While accessingtopDocs.totalHits.value
provides the hit count, it's important to consider therelation
field to determine if the hit count is exact or a lower bound (e.g., when usingTOTAL_HITS_THRESHOLD
).Run the following script to check for other usages of
totalHits
that may need to handle therelation
field:🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Find usages of totalHits to ensure proper handling of TotalHits object. rg --type java -A 2 $'totalHits\\b(?!\\.)'Length of output: 273
Script:
#!/bin/bash # Description: Find usages of totalHits to ensure proper handling of TotalHits object using PCRE2. rg --pcre2 --type java -A 2 'totalHits\b(?!\.)'Length of output: 3938
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (18)
- languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (2 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java (1 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java (2 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java (4 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (3 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (2 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java (1 hunks)
- languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (1 hunks)
- languagetool-language-modules/ja/pom.xml (1 hunks)
- languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java (2 hunks)
- languagetool-wikipedia/pom.xml (1 hunks)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java (1 hunks)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (3 hunks)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/index/PatternRuleQueryBuilder.java (1 hunks)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/index/Searcher.java (6 hunks)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/index/SearcherResult.java (2 hunks)
- languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java (1 hunks)
- pom.xml (3 hunks)
✅ Files skipped from review due to trivial changes (1)
- languagetool-wikipedia/src/main/java/org/languagetool/dev/dumpcheck/SentenceSourceIndexer.java
🔇 Additional comments (19)
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java (1)
50-50
: Approve the Lucene API update with a suggestion for performance verification.The change from using
MultiFields.getFields(reader)
followed byfields.terms("ngram")
to directly usingMultiTerms.getTerms(reader, "ngram")
aligns well with the Lucene 8.x API. This modification simplifies the code and maintains the core functionality.To ensure this change doesn't introduce any performance regressions, consider running a performance test. You can use the following script to check if there are any other occurrences of the old API usage that might need updating:
This script will help identify any other locations in the codebase where similar updates might be necessary.
languagetool-language-modules/ja/pom.xml (1)
43-46
: Verify the newlucene-gosen
dependency.The group ID for the
lucene-gosen
dependency has been changed fromcom.github.lucene-gosen
toorg.omegat.lucene
. This change aligns with the PR objectives, which mention using a personal build of Gosen. However, we should ensure that this new dependency is correctly configured and available.Run the following script to verify the availability and correctness of the new dependency:
The change is approved, pending successful verification of the new dependency.
languagetool-wikipedia/src/main/java/org/languagetool/dev/index/SearcherResult.java (1)
38-38
: LGTM! Approve the changes toluceneMatchCount
.The modifications to change
luceneMatchCount
fromint
tolong
are correct and consistent throughout the class. This change allows for handling larger datasets, which is a good improvement.Let's verify that these changes don't break any existing code:
If any of these searches return results, it might indicate areas of the codebase that need to be updated to work with the new
long
type.Also applies to: 84-90
✅ Verification successful
Verification Successful: No usages of int-based
luceneMatchCount
methods found.The changes to
luceneMatchCount
fromint
tolong
have been successfully verified and do not affect existing code.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for any remaining usage of int-based luceneMatchCount methods # Test 1: Search for any remaining int-based setter calls echo "Searching for int-based setter calls:" rg --type java 'setLuceneMatchCount\s*\(\s*\d+\s*\)' # Test 2: Search for any explicit int casts when calling the getter echo "Searching for int casts of getLuceneMatchCount:" rg --type java '\(int\)\s*\w+\.getLuceneMatchCount\(\)' # Test 3: Search for any comparisons that might assume int return type echo "Searching for potential int-based comparisons:" rg --type java 'getLuceneMatchCount\(\)\s*([<>=!]=?|[&|^])\s*\d+'Length of output: 473
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/TextIndexCreator.java (2)
22-22
: LGTM: Import statement updated correctly.The import for
CharArraySet
has been updated to reflect the package reorganization in Lucene 8.11.3. This change is consistent with the library upgrade mentioned in the PR objectives.
Line range hint
1-89
: Overall, the changes look good. Consider verifying similar updates across the project.The modifications in this file are minimal and correctly reflect the Lucene 8.11.3 upgrade. The import statement and
StandardAnalyzer
initialization have been updated appropriately. The overall functionality of theTextIndexCreator
class remains unchanged.To ensure consistency across the project, please run the following script to identify other files that might require similar updates:
#!/bin/bash # Description: Identify other files that might need updates due to Lucene API changes # Test 1: Search for CharArraySet imports that might need updating echo "Files with potential CharArraySet import issues:" rg --type java 'import org\.apache\.lucene\.analysis\.util\.CharArraySet' # Test 2: Search for StandardAnalyzer initializations that might need updating echo "\nFiles with potential StandardAnalyzer initialization issues:" rg --type java 'new StandardAnalyzer\((?!CharArraySet\.EMPTY_SET)'languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java (1)
50-75
: Approved with suggestions for improvementThe changes successfully update the code to work with newer versions of Lucene and provide more flexibility by iterating over all indexed fields. This is a good improvement in terms of compatibility and extensibility.
However, there are a few points to consider:
Performance: If there are many indexed fields, iterating over all of them might impact performance. Consider adding a check to only process relevant fields if known.
Error messages: The error messages still reference the "ngram" field specifically (e.g., line 75). These should be updated to be more generic or to include the current field name.
Logging: Consider adding some debug logging to show which fields are being processed, especially if multiple fields are expected.
Here's a suggested improvement for the error message:
- throw new RuntimeException("More hits than expected for " + term + ": " + topDocs.totalHits); + throw new RuntimeException("More hits than expected for term '" + term + "' in field '" + field + "': " + topDocs.totalHits.value);Also, consider adding a debug log at the start of the field iteration:
System.out.println("Processing field: " + field);To ensure that the changes don't introduce any regressions, please run the following verification script:
This script will help identify any other parts of the codebase that might need similar updates.
✅ Verification successful
Verification Successful
The review comment has been thoroughly verified. The only reference to
'ngram'
outside ofStartTokenCounter.java
is within a comment inLangIdentExamples.java
, which does not affect the functionality. No other uses ofMultiFields.getFields
or problematictotalHits
usages were found related to the changes.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the StartTokenCounter class is used correctly throughout the codebase # Test: Check for any direct references to the 'ngram' field that might need updating echo "Checking for 'ngram' field references:" rg --type java "'ngram'" --glob '!StartTokenCounter.java' # Test: Look for other uses of MultiFields.getFields that might need similar updates echo "Checking for other uses of MultiFields.getFields:" rg --type java "MultiFields\.getFields" # Test: Verify that totalHits is accessed correctly in other parts of the code echo "Checking totalHits usage:" rg --type java "totalHits\s*\."Length of output: 2846
languagetool-wikipedia/src/main/java/org/languagetool/dev/index/AnyCharTokenizer.java (2)
21-21
: LGTM: Import statement updated correctly.The import statement for
CharacterUtils
has been updated to reflect the package reorganization in Lucene 8.11.3. This change is consistent with the PR objective of updating the Lucene version.
Line range hint
1-150
: Summary: Successfully updated to Lucene 8.11.3 APIThe changes in this file successfully update the
AnyCharTokenizer
class to be compatible with Lucene 8.11.3. The main modifications include:
- Updated import statement for
CharacterUtils
.- Replaced usage of
charUtils
instance with static method calls toCharacterUtils
.- Switched to using standard Java
Character
methods where appropriate.These changes maintain the original functionality while aligning the code with the latest Lucene API. The updates also slightly simplify the code by removing the need for a
charUtils
instance.To ensure that these changes haven't introduced any regressions, please run the following command to check for any existing tests that cover this tokenizer:
If no tests are found, consider adding unit tests to verify the tokenizer's behavior with the updated Lucene version.
languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java (3)
21-21
: LGTM: Import changes align with updated Lucene usage.The addition of
MultiTerms
import and removal ofFields
andMultiFields
imports correctly reflect the changes in how terms are accessed from the Lucene index. This change is consistent with the modifications in thegetIterator()
method.
114-115
: Excellent: Improved term retrieval efficiency.The update to use
MultiTerms.getTerms()
is a good improvement:
- It simplifies the term retrieval process by eliminating the intermediate
Fields
object.- The code is now more direct and efficient in accessing the "ngram" terms.
- This change aligns with Lucene 8.x API updates.
The method's overall functionality and return type remain unchanged, ensuring compatibility with the rest of the class.
Line range hint
1-138
: Summary: Successful Lucene API update with improved efficiency.The changes in this file successfully update the Lucene API usage to version 8.11.3:
- Import statements have been updated to reflect the new API.
- The
getIterator()
method now uses a more efficient approach to retrieve terms.These modifications improve the performance of term retrieval without altering the overall functionality of the
HomophoneOccurrenceDumper
class. The changes are localized and don't introduce any apparent issues in other parts of the class.Great job on keeping the code up-to-date with the latest Lucene version while also improving its efficiency!
languagetool-core/src/main/java/org/languagetool/languagemodel/LuceneSingleIndexLanguageModel.java (3)
151-152
: LGTM: Consistent update for totalHits accessThe change from
docs.totalHits > 1000
todocs.totalHits.value > 1000
is consistent with the previous update and correctly reflects the new Lucene API. The error message has been appropriately updated to includedocs.totalHits.value
.
Line range hint
1-249
: Summary: Successful update to Lucene 8.11.3 APIThe changes in this file successfully update the code to work with Lucene 8.11.3 API, specifically in how
totalHits
is accessed. The updates have been applied consistently throughout the file, maintaining the existing logic while adapting to the new API.Key points:
- All instances of
docs.totalHits
have been changed todocs.totalHits.value
.- Error messages have been updated accordingly.
- The overall functionality of the class remains unchanged.
While the changes are correct and consistent, there's an opportunity to improve the handling of large result sets, as mentioned in the previous comment.
149-151
: LGTM: Lucene API update for totalHits accessThe change from
docs.totalHits == 0
todocs.totalHits.value == 0
correctly reflects the update in the Lucene API for accessing total hits. This is consistent with the PR objective of updating to Lucene 8.11.3.To ensure consistency, let's verify if this change has been applied uniformly across the codebase:
languagetool-wikipedia/src/main/java/org/languagetool/dev/index/PatternRuleQueryBuilder.java (1)
127-127
: LGTM: UpdatedcreateWeight
call to match new Lucene APIThe change correctly updates the
indexSearcher.createWeight
method call to match the new Lucene API. The use ofScoreMode.COMPLETE_NO_SCORES
is appropriate here as scoring is not needed for term extraction. The added boost value of1.0f
is neutral and maintains the existing behavior. This update ensures compatibility with the newer Lucene version while preserving the method's functionality.pom.xml (3)
221-221
: Major update to Lucene versionThe Lucene version has been updated from 5.5.5 to 8.11.3. This is a significant version jump that aligns with the PR objectives.
Please ensure that all Lucene-dependent code has been updated to work with this new version. Run the following script to check for any usage of Lucene classes that might have changed:
#!/bin/bash # Check for usage of common Lucene classes that might have changed rg --type java "import org\.apache\.lucene\.(index|search|store|document)\."
Line range hint
158-167
: Multiple dependency version updatesSeveral dependencies have been updated to newer versions:
- org.sweble.wikitext: 1.1.1 -> 3.1.9
- org.apache.commons.commons-compress: 1.2 -> 1.26.0
- org.mariadb.jdbc: 2.2.3 -> 2.7.6
These are significant version jumps that may introduce new features, bug fixes, and potentially breaking changes.
Please ensure that the project still compiles and functions correctly with these new versions. Pay special attention to any usage of these libraries in your code. Run the following script to check for their usage:
#!/bin/bash # Check for usage of updated dependencies rg --type java "import org\.sweble\.wikitext\." rg --type java "import org\.apache\.commons\.compress\." rg --type java "import org\.mariadb\.jdbc\."
165-165
: Significant update to Lucene-Gosen dependencyThe Lucene-Gosen dependency has been updated from version 6.2.1 to 8.11.0, and the groupId has changed from
com.github.lucene-gosen
toorg.omegat.lucene
. This aligns with the PR objectives but represents a major version jump.Please ensure that all code using Lucene-Gosen has been tested with this new version. Run the following script to check for any usage of the old groupId:
✅ Verification successful
Lucene-Gosen Dependency Update Verified
No remaining references to
com.github.lucene-gosen
were found inpom.xml
or any Java files. The dependency update is complete and does not impact other parts of the codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for any remaining references to the old Lucene-Gosen groupId rg --type xml "com\.github\.lucene-gosen"Length of output: 3930
Script:
#!/bin/bash # Verify any remaining references to the old Lucene-Gosen groupId using alternative search patterns fd "pom.xml" | xargs grep -H "com\.github\.lucene-gosen" # Additionally, search through all Java files for any lingering import statements fd --type java | xargs grep -H "com\.github\.lucene-gosen"Length of output: 289
Script:
#!/bin/bash # Verify any remaining references to the old Lucene-Gosen groupId in pom.xml fd "pom.xml" | xargs grep -H "com\.github\.lucene-gosen" # Additionally, search through all Java files for any lingering import statements fd --extension java | xargs grep -H "com\.github\.lucene-gosen"Length of output: 122
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java (1)
123-123
: Correct Usage oftopDocs.totalHits.value
The change to
topDocs.totalHits.value
correctly reflects the updated Lucene 8.x API for accessing the total number of hits. This is an appropriate update.
@@ -38,7 +38,7 @@ | |||
|
|||
<properties> | |||
<!-- we have to repeat it here as otherwise Grails project languagetool-community-website will fail: --> | |||
<lucene.version>5.5.5</lucene.version> | |||
<lucene.version>8.11.3</lucene.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Deprecated Lucene API Usage Detected
The update to Lucene version 8.11.3 introduces potential compatibility issues due to deprecated API usages found in the codebase.
Areas to Address:
-
Deprecated Methods:
DirectoryReader.open
is used in multiple files:languagetool-wikipedia/src/main/java/org/languagetool/dev/index/Searcher.java
languagetool-wikipedia/src/test/java/org/languagetool/dev/index/PatternRuleQueryBuilderTest.java
languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java
- [List continues as per script output]
-
Existing TODO Comments:
languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java
// TODO: lm.getCount("_START_") returns 0 for Google data -- see getCount(String) in LuceneLanguageModel
Recommendations:
-
Update Deprecated APIs:
- Replace
DirectoryReader.open
with the updated methods as per Lucene 8.11.3 documentation.
- Replace
-
Address TODOs:
- Resolve the TODO comment to ensure the
getCount
method functions correctly with the updated Lucene version.
- Resolve the TODO comment to ensure the
-
Comprehensive Testing:
- After making the necessary changes, perform thorough testing to ensure that all Lucene-dependent functionalities work as expected with the new version.
🔗 Analysis chain
LGTM! Verify compatibility with the updated Lucene version.
The Lucene version update from 5.5.5 to 8.11.3 aligns with the PR objectives. This is a significant version jump, which may introduce breaking changes or new features.
To ensure smooth integration, please verify the following:
- Compatibility of all Lucene-dependent code with version 8.11.3.
- Update of any Lucene-related documentation or comments in the codebase.
- Successful build and all tests passing with the new version.
Run the following script to check for potential compatibility issues:
This script will help identify areas that might need attention due to the Lucene version update.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential Lucene compatibility issues
# Test: Search for Lucene imports and usages
echo "Searching for Lucene imports and usages:"
rg --type java "import org\.apache\.lucene" -A 5
# Test: Check for any TODO or FIXME comments related to Lucene
echo "Checking for TODO or FIXME comments related to Lucene:"
rg --type java "(TODO|FIXME).*Lucene"
# Test: Look for deprecated Lucene API usage
echo "Checking for potentially deprecated Lucene API usage:"
rg --type java "MultiFields\.getFields|DirectoryReader\.open|LongField"
Length of output: 45771
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java
Outdated
Show resolved
Hide resolved
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java
Outdated
Show resolved
Hide resolved
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java
Outdated
Show resolved
Hide resolved
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
Outdated
Show resolved
Hide resolved
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
Outdated
Show resolved
Hide resolved
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
Outdated
Show resolved
Hide resolved
languagetool-wikipedia/src/main/java/org/languagetool/dev/index/Searcher.java
Show resolved
Hide resolved
## language-dev sub-project - Add StoredField addition to LongPoint for ngram - Use DirectoryReader.open(IndexWriter) instead of open(IndexWriter, boolean, boolean) - Null and empty check for FieldInfo and Terms Signed-off-by: Hiroshi Miura <miurahr@linux.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
pom.xml (1)
Line range hint
70-82
: Address TODO comments in dependency updates.Several dependencies have been updated, and there are TODO comments indicating areas that require attention or further verification. These include:
- Checking for errors on compile in wikipedia, dev, and rpm modules after updating
org.sweble.wikitext
.- Verifying compatibility after updating
org.apache.commons.commons-compress
.- Checking for updates or replacements for
edu.washington.cs.knowitall.opennlp
.- Reviewing the update of
org.mariadb.jdbc
.Please address these TODO items before merging:
- Compile and test the wikipedia, dev, and rpm modules with the updated
org.sweble.wikitext
version.- Verify compatibility with the new
org.apache.commons.commons-compress
version.- Investigate if there's an update or replacement available for
edu.washington.cs.knowitall.opennlp
.- Review and confirm the
org.mariadb.jdbc
update.After addressing these items, please remove or update the TODO comments accordingly.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- pom.xml (3 hunks)
🧰 Additional context used
🔇 Additional comments (3)
pom.xml (3)
217-217
: Lucene version updated to 8.11.3.The update of Lucene to version 8.11.3 from 5.5.5 is a significant change that aligns with the PR objectives. This major version update likely introduces breaking changes that need to be addressed in the codebase.
Please ensure that all necessary code modifications mentioned in the PR objectives (such as replacing
LongField
withLongPoint
) have been implemented. Run the following script to check for any remaining usage of deprecated Lucene classes:#!/bin/bash # Search for usage of deprecated Lucene classes rg "import org\.apache\.lucene\.document\.LongField" --type java rg "LongField\(" --type java
Line range hint
1-1337
: Overall pom.xml changes look good, but require some final steps.The updates to Lucene, Lucene-Gosen, and other dependencies are well-implemented and align with the PR objectives. However, there are a few items that need attention before merging:
- Verify compatibility with the new versions, especially for Lucene and Lucene-Gosen.
- Address the TODO comments in the dependency updates section.
- Ensure all necessary code modifications related to the Lucene update have been implemented throughout the project.
- Run a full build and test suite to catch any potential issues introduced by these dependency updates.
Please run a full build and test suite, and confirm that all tests pass with the updated dependencies:
#!/bin/bash # Run full build and test suite mvn clean installOnce these steps are completed and all tests pass, this PR should be ready for final review and merge.
164-164
: Lucene-Gosen dependency updated successfully.The update of lucene-gosen to version 8.11.0 and the change in groupId to
org.omegat.lucene
align with the PR objectives. This is a significant update that may require corresponding code changes.Please verify that all code using this dependency is compatible with the new version. Run the following script to check for any usage of the old groupId:
Sorry for the delay here. We're currently updating our backend to prepare everything for this PR. |
Update Apache Lucene version from 5.5.5 to 8.11.3 that is latest version which support Java 8.
For dependency resolution, also update Lucene-Gosen 8.11.0
Unfortunately, Gosen project does not publish the version in Maven Central, I uses my build of Gosen.
Implement #3628
Changes
o.a.l.d.LongField
is removed in Lucene 8.x , switch toLongPoint
, affected-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/AggregatedNgramToLucene.java
-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
MultiFields.getFields(reader)
is deprecated, affected-
languagetool-dev/src/main/java/org/languagetool/dev/archive/StartTokenCounter.java
-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/GermanUppercasePhraseFinder.java
-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/LargestNGramFinder.java
-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/NeededNGramCounter.java
-
languagetool-standalone/src/main/java/org/languagetool/dev/HomophoneOccurrenceDumper.java
limitedTopDocs.topDocs.totalHits
to add.value
that return long instead of intDirectoryReader.open(IndexWriter, boolean)
toopen(IndexWriter, boolean, boolean)
affected-
languagetool-dev/src/main/java/org/languagetool/dev/bigdata/CommonCrawlToNgram.java
languagetool-wikipedia/pom.xml
changelucene.version
CharacterUtils
changesfill
to be static function instead of a methodSummary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores