Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The analyzer keeps the analysis of former analyzed tokens in the results #17

Open
dpappas opened this issue May 31, 2019 · 4 comments
Open

Comments

@dpappas
Copy link

dpappas commented May 31, 2019

Hi everyone!

I am trying to use the greeklish analyzer in ElasticSearch 6.7.1 but the analyzer keeps appending the analysis of former tokens.

For example i use the token "αυγο" and i get a first analysis.
Then is use the token "παιχνιδι" and i get the analysis of "παιχνιδι" but then appended in the results i see the analysis of "αυγό".

Can you help me?
Thank you in advance

@cmantas
Copy link
Contributor

cmantas commented May 31, 2019

Hello. Could you illustrate the problem using the analyze API?

@dpappas
Copy link
Author

dpappas commented May 31, 2019

Ok!

We can reproduce this in kibana as well

DELETE /my_index

PUT /my_index
{
  "settings": {
            "analysis": {
                "analyzer": {
                    "simple_greeklish": {
                        "tokenizer": "standard",
                        "filter":[
                          "lowercase","greeklish"
                          ]
                    }
                },
                "filter": {
                    "greeklish": {
                        "type": "skroutz_greeklish",
                        "max_expansions" 	: 5,
                        "greek_variants" 	: "false"
                    }
                }
            }
  }
}

When i do the following request

GET /my_index/_analyze
{
	"analyzer" : "simple_greeklish",
	"text"     : "αυγο"
}

i get a perfect analysis:

{
  "tokens" : [
    {
      "token" : "αυγο",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "aygo",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "avgo",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "afgo",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "augo",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "greeklish_word",
      "position" : 0
    }
  ]
}

But then when i do the following request:

GET /my_index/_analyze
{
	"analyzer" : "simple_greeklish",
	"text"     : "παιχνιδι"
}

I get the new analysis but the former as well!

{
  "tokens" : [
    {
      "token" : "παιχνιδι",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "pehnidi",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "paichnidi",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "paihnidi",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "pexnidi",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "paixnidi",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "aygo",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "avgo",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "afgo",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    },
    {
      "token" : "augo",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "greeklish_word",
      "position" : 0
    }
  ]
}

I have tried resetting the:
tokenStream in GreeklishTokenFilterFactory
or the
greeklishWords in GreeklishTokenFilter
or the
greeklishList and the perWordGreeklish in GreeklishGenerator
but nothing happened!

@cmantas
Copy link
Contributor

cmantas commented May 31, 2019

Very strange.
Thankfully this bug does not manifest itself in ElasticSearch v5.4.2 (Naturally one can use that version if downgrading is an option).

Unfortunatelly development for this plugin hasn't started yet for newer versions, so the dev team cannot offer an timeline/ETA for fixing this.
Seems like a critical issue for v6.x.x though.

@petsoukos
Copy link

This issue still persists up to ES7.7. Anyone found a solution/workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants