Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

目标和参考资料 #1

Open
guonaihong opened this issue Jul 30, 2020 · 12 comments
Open

目标和参考资料 #1

guonaihong opened this issue Jul 30, 2020 · 12 comments

Comments

@guonaihong
Copy link
Contributor

guonaihong commented Jul 30, 2020

目标

  • 用go实现字符串相似度lib
  • 处理中文准确度较高(目前很多老外写的库处理中文效果不佳)
  • 集成多种相似度算法(编辑距离,汉明编码,骰子系数)

莱文斯坦-编辑距离(Levenshtein)

Hamming

Dice's coefficient

  • https://blog.csdn.net/gjk0223/article/details/2314844
    n个字符算集合一个元素,这点容易忽略,n是可以配置的,很多开源项目都忽略这点。原论文公式是 2 *(a 和b的交集) /(len(a) + len(b)),默认选择2,但是2对中文不太友好

Jaro

TODO

  • Damerau-Levenshtein - distance & normalized
  • Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length

补充

https://help.highbond.com/helpdocs/analytics/13/scripting-guide/zh-cn/Content/lang_ref/functions/r_dicecoefficient.htm

参考API设计(取名)

参考选用了哪些算法名字

@SummerSec
Copy link
Collaborator

余弦相似度算法考虑实现吗?

@guonaihong
Copy link
Contributor Author

余弦相似度算法考虑实现吗?

可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了.

@SummerSec
Copy link
Collaborator

余弦相似度算法考虑实现吗?

可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了.

OK,我看看能不能学会,康康能不能提交一个pr,simhash也是相似度算法。

@guonaihong
Copy link
Contributor Author

guonaihong commented May 24, 2022 via email

@SummerSec
Copy link
Collaborator

我通过简单实验发现如果将文本进行base64编码之后,相似度结果会更加精确一些。
示例代码:

package main

import (
	"encoding/base64"
	"fmt"
	"github.com/antlabs/strsim"
)

func main() {
	en := base64.NewEncoding("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
	s1 := "<!DOCTYPE html>\n<html>\n<head>\n<title>Error</title>\n<style>\n                                                       \n    body {\n        width: 35em;\n        margin: 0 auto;\n        font-family: Tahoma, Verdana, Arial, sans-serif;\n    }\n                                                     \n</style>\n</head>\n<body>\n<h1><center>An error occurred.</center></h1>\n<p><center>An error occurred.</center></p>\n<p><center>                                                                                                        </center></p>\n</body>\n</html>"
	s2 := "<html>\n<head><title>404 Not Found</title></head>\n<body>\n<center><h1>404 Not Found</h1></center>\n<hr><center>openresty</center>\n</body>\n</html>\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n"
	str1 := en.EncodeToString([]byte(s1))
	str2 := en.EncodeToString([]byte(s2))
	// 默认算法的相似度 莱文斯坦-编辑距离
	a1 := strsim.Compare(str1, str2)
	b1 := strsim.Compare(s1, s2)
	fmt.Printf("Defalut Levenshtein  Similarity base64 %f original %f\n", a1, b1)
	// Hamming算法计算出的相似度
	a2 := strsim.Compare(str1, str2, strsim.Hamming())
	b2 := strsim.Compare(s1, s2, strsim.Hamming())
	fmt.Printf("Hamming Similarity base64 %f original %f\n", a2, b2)

	// Dice's coefficient 算法计算出的相似度
	a3 := strsim.Compare(str1, str2, strsim.DiceCoefficient())
	b3 := strsim.Compare(s1, s2, strsim.DiceCoefficient())
	fmt.Printf("Dice's coefficient Similarity  base64 %f original %f\n", a3, b3)

	//jaro 算法计算出的相似度
	a4 := strsim.Compare(str1, str2, strsim.Jaro())
	b4 := strsim.Compare(str1, str2, strsim.Jaro())
	fmt.Printf("jaro Similarity  base64 %f original %f\n", a4, b4)

}

运行测试结果,我想将这个发现提交pr,我觉得是很有必要的。

Defalut Levenshtein Similarity base64 0.111264 original 0.156250
Hamming Similarity base64 0.067308 original 0.084559
Dice's coefficient Similarity base64 0.229599 original 0.286772
jaro Similarity base64 0.521006 original 0.521006

@SummerSec
Copy link
Collaborator

另外,我想加入你们组织,不知是否可以?

@SummerSec
Copy link
Collaborator

我大致简单说明一下我加入的余弦相似度算法和Jaro-Winkler算法
首先是Jaro-Winkler算法,我在看相关文档发现你提出Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix lengt问题的解决方法,相同的前缀最大值为4,也就是p的影响因子的取值范围为[0,4]是一个闭区间,如果大于4取4。在jaro-and-jaro-winkler-similarity有着对应的代码
image

@SummerSec
Copy link
Collaborator

余弦相似度,相关原理就是高中学过的空间向量定理,两个向量夹角的角度。但余弦相似度得计算词出现频率,如果用分词的效率不高,并且计算量,内存开销都会很大。于是我想到采用base64编码的方式,因为base64编码的字符串同一个字符是相同的,也等价的计算词语出现的频率。然后将base64标准字符串进行余弦计算。

package main

import (
	"fmt"
	"github.com/antlabs/strsim"
)

func main() {
	Levenshtein := strsim.Compare("abc", "ab")
	fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
	dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
	fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
	jaro := strsim.Compare("abc", "ab", strsim.Jaro())

	fmt.Printf("Jaro Algorithm: %f\n", jaro)
	jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
	fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
	ham := strsim.Compare("abc", "ab", strsim.Hamming())
	fmt.Printf("Hamming Algorithm: %f\n", ham)

	consine := strsim.Compare("abc", "ab", strsim.Cosine())
	fmt.Printf("Consine Similarity Algorithm : %f\n", consine)
}

最终计算出来的结果也是可观

Levenshtein Algorithm: 0.666667
Dice Coefficient Algorithm: 0.666667
Jaro Algorithm: 0.888889
Jaro Winkler Algorithm: 0.888889
Hamming Algorithm: 0.666667
Consine Similarity Algorithm : 0.577350

@SummerSec
Copy link
Collaborator

simhash的难点感觉是在分词和加权,分词处理之后,加权操作目前没有很好的解决方法。分词的话,如果是中英文的都有的情况也很难处理,比例说网页的源代码。对于分词这部分,我目前简单将文本进行base64编码,然后将粗暴的以四个字符分为一组。对于加权这块,我是统计每组出现的概率进行加权。

	Levenshtein := strsim.Compare("abc", "ab")
	fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
	dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
	fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
	jaro := strsim.Compare("abc", "ab", strsim.Jaro())

	fmt.Printf("Jaro Algorithm: %f\n", jaro)
	jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
	fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
	ham := strsim.Compare("abc", "ab", strsim.Hamming())
	fmt.Printf("Hamming Algorithm: %f\n", ham)

	consine := strsim.Compare("abc", "ab", strsim.Cosine())
	fmt.Printf("Consine Similarity Algorithm : %f\n", consine)

	simhash := strsim.Compare("abc", "ab", strsim.Simhash())
	fmt.Printf("Simhash Algorithm: %f \n", simhash)

目前没有经过大量测试,简单结果仅限参考。

Levenshtein Algorithm: 0.666667
Dice Coefficient Algorithm: 0.666667
Jaro Algorithm: 0.888889
Jaro Winkler Algorithm: 0.888889
Hamming Algorithm: 0.666667
Consine Similarity Algorithm : 0.577350
Simhash Algorithm: 0.500000

@kooksee
Copy link

kooksee commented Jan 12, 2023

https://en.wikipedia.org/wiki/MinHash
这个也可以集成下,我们之前用于文档的相识度,用这个比较的

@guonaihong
Copy link
Contributor Author

https://en.wikipedia.org/wiki/MinHash 这个也可以集成下,我们之前用于文档的相识度,用这个比较的

可以。

@zzZfans
Copy link

zzZfans commented Sep 4, 2023

那种算法的表现最好?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants