目标和参考资料 #1

---原始邮件--- 发件人: ***@***.***> 发送时间: 2022年5月24日(周二) 晚上9:12 收件人: ***@***.***>; 抄送: ***@***.******@***.***>; 主题: Re: [antlabs/strsim] 目标和参考资料 (#1) 余弦相似度算法考虑实现吗？可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了. OK，我看看能不能学会，康康能不能提交一个pr，simhash也是相似度算法。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

SummerSec · 2022-05-25T07:22:02Z

我通过简单实验发现如果将文本进行base64编码之后，相似度结果会更加精确一些。
示例代码：

package main

import (
	"encoding/base64"
	"fmt"
	"github.com/antlabs/strsim"
)

func main() {
	en := base64.NewEncoding("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
	s1 := "<!DOCTYPE html>\n<html>\n<head>\n<title>Error</title>\n<style>\n                                                       \n    body {\n        width: 35em;\n        margin: 0 auto;\n        font-family: Tahoma, Verdana, Arial, sans-serif;\n    }\n                                                     \n</style>\n</head>\n<body>\n<h1><center>An error occurred.</center></h1>\n<p><center>An error occurred.</center></p>\n<p><center>                                                                                                        </center></p>\n</body>\n</html>"
	s2 := "<html>\n<head><title>404 Not Found</title></head>\n<body>\n<center><h1>404 Not Found</h1></center>\n<hr><center>openresty</center>\n</body>\n</html>\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n"
	str1 := en.EncodeToString([]byte(s1))
	str2 := en.EncodeToString([]byte(s2))
	// 默认算法的相似度 莱文斯坦-编辑距离
	a1 := strsim.Compare(str1, str2)
	b1 := strsim.Compare(s1, s2)
	fmt.Printf("Defalut Levenshtein  Similarity base64 %f original %f\n", a1, b1)
	// Hamming算法计算出的相似度
	a2 := strsim.Compare(str1, str2, strsim.Hamming())
	b2 := strsim.Compare(s1, s2, strsim.Hamming())
	fmt.Printf("Hamming Similarity base64 %f original %f\n", a2, b2)

	// Dice's coefficient 算法计算出的相似度
	a3 := strsim.Compare(str1, str2, strsim.DiceCoefficient())
	b3 := strsim.Compare(s1, s2, strsim.DiceCoefficient())
	fmt.Printf("Dice's coefficient Similarity  base64 %f original %f\n", a3, b3)

	//jaro 算法计算出的相似度
	a4 := strsim.Compare(str1, str2, strsim.Jaro())
	b4 := strsim.Compare(str1, str2, strsim.Jaro())
	fmt.Printf("jaro Similarity  base64 %f original %f\n", a4, b4)

}

运行测试结果，我想将这个发现提交pr，我觉得是很有必要的。

Defalut Levenshtein Similarity base64 0.111264 original 0.156250
Hamming Similarity base64 0.067308 original 0.084559
Dice's coefficient Similarity base64 0.229599 original 0.286772
jaro Similarity base64 0.521006 original 0.521006

SummerSec · 2022-05-25T07:23:18Z

另外，我想加入你们组织，不知是否可以？

SummerSec · 2022-05-25T11:50:55Z

我大致简单说明一下我加入的余弦相似度算法和Jaro-Winkler算法
首先是Jaro-Winkler算法，我在看相关文档发现你提出Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix lengt问题的解决方法，相同的前缀最大值为4，也就是p的影响因子的取值范围为[0,4]是一个闭区间，如果大于4取4。在jaro-and-jaro-winkler-similarity有着对应的代码

SummerSec · 2022-05-25T12:04:14Z

余弦相似度，相关原理就是高中学过的空间向量定理，两个向量夹角的角度。但余弦相似度得计算词出现频率，如果用分词的效率不高，并且计算量，内存开销都会很大。于是我想到采用base64编码的方式，因为base64编码的字符串同一个字符是相同的，也等价的计算词语出现的频率。然后将base64标准字符串进行余弦计算。

package main

import (
	"fmt"
	"github.com/antlabs/strsim"
)

func main() {
	Levenshtein := strsim.Compare("abc", "ab")
	fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
	dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
	fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
	jaro := strsim.Compare("abc", "ab", strsim.Jaro())

	fmt.Printf("Jaro Algorithm: %f\n", jaro)
	jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
	fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
	ham := strsim.Compare("abc", "ab", strsim.Hamming())
	fmt.Printf("Hamming Algorithm: %f\n", ham)

	consine := strsim.Compare("abc", "ab", strsim.Cosine())
	fmt.Printf("Consine Similarity Algorithm : %f\n", consine)
}

最终计算出来的结果也是可观

Levenshtein Algorithm: 0.666667
Dice Coefficient Algorithm: 0.666667
Jaro Algorithm: 0.888889
Jaro Winkler Algorithm: 0.888889
Hamming Algorithm: 0.666667
Consine Similarity Algorithm : 0.577350

SummerSec · 2022-05-27T04:28:56Z

simhash的难点感觉是在分词和加权，分词处理之后，加权操作目前没有很好的解决方法。分词的话，如果是中英文的都有的情况也很难处理，比例说网页的源代码。对于分词这部分，我目前简单将文本进行base64编码，然后将粗暴的以四个字符分为一组。对于加权这块，我是统计每组出现的概率进行加权。

	Levenshtein := strsim.Compare("abc", "ab")
	fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
	dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
	fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
	jaro := strsim.Compare("abc", "ab", strsim.Jaro())

	fmt.Printf("Jaro Algorithm: %f\n", jaro)
	jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
	fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
	ham := strsim.Compare("abc", "ab", strsim.Hamming())
	fmt.Printf("Hamming Algorithm: %f\n", ham)

	consine := strsim.Compare("abc", "ab", strsim.Cosine())
	fmt.Printf("Consine Similarity Algorithm : %f\n", consine)

	simhash := strsim.Compare("abc", "ab", strsim.Simhash())
	fmt.Printf("Simhash Algorithm: %f \n", simhash)

目前没有经过大量测试，简单结果仅限参考。

Levenshtein Algorithm: 0.666667
Dice Coefficient Algorithm: 0.666667
Jaro Algorithm: 0.888889
Jaro Winkler Algorithm: 0.888889
Hamming Algorithm: 0.666667
Consine Similarity Algorithm : 0.577350
Simhash Algorithm: 0.500000

kooksee · 2023-01-12T15:39:49Z

https://en.wikipedia.org/wiki/MinHash
这个也可以集成下，我们之前用于文档的相识度，用这个比较的

guonaihong · 2023-01-13T14:13:28Z

https://en.wikipedia.org/wiki/MinHash 这个也可以集成下，我们之前用于文档的相识度，用这个比较的

可以。

zzZfans · 2023-09-04T06:17:59Z

那种算法的表现最好？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

目标和参考资料 #1

目标和参考资料 #1

guonaihong commented Jul 30, 2020 •

edited

Loading

SummerSec commented May 24, 2022

guonaihong commented May 24, 2022

SummerSec commented May 24, 2022

guonaihong commented May 24, 2022 via email

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 27, 2022

kooksee commented Jan 12, 2023

guonaihong commented Jan 13, 2023

zzZfans commented Sep 4, 2023

目标和参考资料 #1

目标和参考资料 #1

Comments

guonaihong commented Jul 30, 2020 • edited Loading

目标

莱文斯坦-编辑距离(Levenshtein)

Hamming

Dice's coefficient

Jaro

TODO

补充

参考API设计(取名)

参考选用了哪些算法名字

SummerSec commented May 24, 2022

guonaihong commented May 24, 2022

SummerSec commented May 24, 2022

guonaihong commented May 24, 2022 via email

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 25, 2022

SummerSec commented May 27, 2022

kooksee commented Jan 12, 2023

guonaihong commented Jan 13, 2023

zzZfans commented Sep 4, 2023

guonaihong commented Jul 30, 2020 •

edited

Loading