Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fr] Ignore words starting with capital letter in "VERBE_SUIVI_D_UN_NOM" ruleset #10422

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Sharcoux
Copy link
Contributor

Fixup for VERBE_SUIVI_D_UN_NOM based on the last diff.

We will fix 2 things:

  • Words starting with a capital letter should be ignored
  • Words that can be both a noun and something else should be ignored

I noticed that the rule is still triggered when there is a newline character between the 2 tokens. I don't know how to ignore that. For now, the check about capital letters should do the trick, but that's not ideal.

@Sharcoux Sharcoux marked this pull request as draft March 19, 2024 15:22
@Sharcoux Sharcoux marked this pull request as ready for review March 21, 2024 13:08
@Sharcoux
Copy link
Contributor Author

The disambiguation processor seem to have many flaws with this kind of structure. Anyway, this rule will just ignore the mistake when the disambiguation is not good, and when the processor will improve, it will correctly catch them all.

For the record, an example of disambiguation problem:


"Il prend café"

Token Lemma Part-of-speech
Il il R pers suj 3 m s
prend prendre V ind pres 3 s
café café J e sp / N m s

There is no way that café could be an adjective. There is not even a single name in the sentence...


"Il prend pelle"

Token Lemma Part-of-speech
Il il R pers suj 3 m s
prend prendre V ind pres 3 s
pelle pelle / peller N f s / V imp pres 2 s / V ind pres 1 s / V ind pres 3 s / V sub pres 1 s / V sub pres 3 s

How can a conjugated verb follow a conjugated verb that is not an auxiliary? What could be the subject of "V sub pres 1 s"? That doesn't seem to make any sense.

@Sharcoux
Copy link
Contributor Author

This is the continuation of the PR: #10385

I'll keep tracking down the diffs but we should be close to the end.

@Sharcoux
Copy link
Contributor Author

@jaumeortola This is ready for merge

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Apr 8, 2024

@jaumeortola @LucieSteib Can we merge this? So I can check if every case is 100% handled?

@@ -116344,7 +116351,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA
<suggestion>un \2</suggestion>
<suggestion>une \2</suggestion>
<suggestion>des \2</suggestion>
<example correction="un rhume|une rhume|des rhume">J'attrape <marker>rhume</marker>.</example>
<example correction="un lapin|une lapin|des lapin">J'attrape <marker>lapin</marker>.</example>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I still don't get why it's not possible to get the right suggestion here (feminine or plural forms...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love it if it was possible. Do you know how? I thought that LanguageTool had no solution for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we have a verb followed by a name. The easiest and one of the likeliest solution is to add the missing article. But we need to know the gender and number of the name in order to generate the correct article. But I don't think that the suggestions can be conditional to the gender or number of that name. At least I found nothing of the sort in the documentation and ChatGPT didn't know one either...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, to be fair, the model is kinda good at doing this. Maybe you could check if some of the example here are not detected by the model?
image

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Apr 8, 2024

Ok, the original problem I was trying to solve was with the sentence: L'eau entre par la bouche du poisson et ressort par les opercules.. Also, I notice that Il prend pelle. or Il prend café. won't be flagged as wrong. However, it's true that many mistakes seem to be correctly identified by the rule "AI_FR_GGEC_MISSING_DETERMINER". But what is this rule exactly and where is it coming from? Are you sure that it is available in this project? I couldn't find anything about it in the repo.

@LucieSteib
Copy link
Collaborator

LucieSteib commented Apr 8, 2024

And indeed you are right, both these sentences don't get corrected properly yet:
L'eau entre par la bouche du poisson et ressort par les opercules. -> is still a false positive by CONFUSION_ER_E_PAR[20] (that could be corrected with a simple antipattern in the corresponding subrule)
Il prend pelle. is still a false negative indeed, I can add it to the next model training.

The "rules" you see starting with AI_FR_GGEC aren't XML rules as the other ones in the grammar.xml file.
They are operation done by our Artificial Intelligence (for) Grammatical Error Correction for FRench.
These parts of LanguageTool are not Open Source, that's why you can't find them "in the project". They have an entirely different way of working, and nothing about that lives in this repository.

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Apr 8, 2024

The original rule CONFUSION_ER_E_PAR is wrong and is supposed to be replaced with CONFUSION_ER_E_PAR2. But you're right, this is not related to this PR and we can discuss it in the other one.

About the fixes made by the AI, I understand that this part might not be available in the open source project, but in this case, that means that there is no rule for those in the current project and thus, this PR is important.

@LucieSteib
Copy link
Collaborator

About the fixes made by the AI, I understand that this part might not be available in the open source project, but in this case, that means that there is no rule for those in the current project and thus, this PR is important.

The models operations are available for non-Premium users, on the Editor and the web extension.
Only the code supporting it is not accessible, it's a mt5 model massively trained that cannot be changed like XML rules, that's why it's not "in the OS repository".

@LucieSteib
Copy link
Collaborator

LucieSteib commented Apr 9, 2024

I remember that you want to replace CONFUSION_ER_E_PAR withCONFUSION_ER_E_PAR2 entirely, yes.
Will the version 2 remove FPs and add TPs compare to the current CONFUSION_ER_E_PAR?
You should be able to see it in the diffs, by going through a good testing, plus assessing the False Positives (if the new version is generating a ton of false positives, it's not really an improvement).
To be noted, though, the diffs are not generated properly for a few days, so we might better wait for a fix of this issue before merging this PR anyways.

@Sharcoux
Copy link
Contributor Author

Sharcoux commented Apr 9, 2024

@LucieSteib About CONFUSION_ER_E_PAR2, I'm confident to arrive at a quite good result and I promise to analyze very carefully the diffs until I get something satisfying. However, I had to give up some detections to work around errors made in the disambiguation process, as mentioned here. In the future, I might try to improve the disambiguation to achieve even better detection. But the current version should already bring quite an improvement.

I'm not sure about how to do the replace of one rule by the other and see the diff. I just know how to see the diffs introduced by new rules being added. Maybe, when the rule is ready, I should open a PR that does the replacement?

The models operations are available for non-Premium users, on the Editor and the web extension.
Only the code supporting it is not accessible, it's a mt5 model massively trained that cannot be changed like XML rules, that's why it's not "in the OS repository".

I don't think that this is available for self hosted instances, is it? We use it for offline usage. If it's not available, then I still insist that we need a rule in the project to cover those problems. I have no problem trying to improve the rule, though, if you don't like the way it is now.

@LucieSteib
Copy link
Collaborator

LucieSteib commented Apr 9, 2024

No, you're right, the model would be far too heavy for a self-hosted instance, it's not accessible like that.

About VERBE_SUIVI_D_UN_NOM the gender/number agreement for the determiner, look maybe at D_N_E_OU_E[2] that is correcting "Je prends un pelle" to "une pelle":

<pattern>
    <token postag="D m s">
        <exception scope="previous">qui</exception>
        <exception regexp="yes">(?-i)CE|audit|tout</exception></token>
    <token regexp="yes" min="0" max="1">très|trop|vraiment</token>
    <token postag="N f s" regexp="yes">.*e$
        <exception scope="next">-</exception>
        <exception regexp="yes">cote|chèvre|chouette|date|attrape|bouche|trompe|pratique|couche|lave|cure|case|pile|toilette|coupe|barde|ride|bienvenue|monte|marque|porte|micro-ondes?|traine|escape|lie|lire|serge|donne|marine|martyre|malaise|(?-i)[A-Z].*|prime|tome</exception></token>
</pattern>
<filter class="org.languagetool.rules.fr.FrenchSuppressMisspelledSuggestionsFilter" args="suppressMatch:true"/>
<message>Le déterminant s'accorde avec le nom "\3".</message>
<suggestion suppress_misspelled="yes">\1 \2 <match no="3" regexp_match="(?iu)e" regexp_replace="é"/></suggestion>
<suggestion suppress_misspelled="yes">\1 \2 <match no="3" regexp_match="(?iu)e" regexp_replace=""/></suggestion>
<suggestion><match no="1" postag="(D|J) .*" postag_regexp="yes" postag_replace="$1 f s"/> \2 \3</suggestion>
<example correction="le carré|la carre">Il faut que <marker>le carre</marker> soit à gauche.</example>

About CONFUSION_ER_E_PAR2: I've nothing against the rule :) as long as it's at least as good as the current one, see the process we could try:

About the actual replacement, when the rule is ready (meaning:

  • there are no more FPs visible in the diff,
  • all the FPs AND FNs from the current are handled)
    we can change the priorities in the file French.java to achieve something close to a “replacement” in the sense of pasting the new rule over the old one.
    The priority will have to be higher for the new version than the old one, and below the model's priority. But we will cross that bridge when we get to the river.

<exception>godot</exception>
<exception postag="[^N].*" postag_regexp="yes"/>
<exception regexp="yes" case_sensitive="yes">[A-ZÉÈÀÙÂÊÎÔÛÄËÏÖÜÇ].*</exception>
<exception regexp="yes">réparation|confirmation|famille|godot|lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche|janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre|début|mi-.*|fin</exception>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(format): maybe here you could use ENTITY like mois_annee and jours_semaine (but maybe also unites_temps, parties_journee...) all the ENTITIES are at the top of the grammar.xml file.

You can call them in the rule with structures like:
<token regexp="yes">&mois_annee;</token>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants