A simple plugin for Elasticsearch for filtering out duplicate text
./bin/elasticsearch-plugin install https://github.com/abiko-search/analysis-copypaste/releases/download/v1.0.0/analysis-copypaste-1.0.0.zip
GET /_analyze?pretty=true
{
"tokenizer": "standard",
"filter": ["copypaste"],
"text": "So I repeat every sentence twice. So I repeat every sentence twice. That is dumb."
}
The request returns the following result:
{
"tokens" : [
{
"token" : "So",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "I",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "repeat",
"start_offset" : 5,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "every",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "sentence",
"start_offset" : 18,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "twice",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "That",
"start_offset" : 68,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "is",
"start_offset" : 73,
"end_offset" : 75,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "dumb",
"start_offset" : 76,
"end_offset" : 80,
"type" : "<ALPHANUM>",
"position" : 14
}
]
}
CopyPaste is a derivative work from Elasticsearch DeDuplicatingTokenFilter