编辑距离(Levenshtein Distance)
来源:互联网 发布:伊斯兰知识软件下载 编辑:程序博客网 时间:2024/06/11 01:09
搞自然语言处理的应该不会对这个概念感到陌生,编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
- If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
- If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
- Spell checking
- Speech recognition
- DNA analysis
- Plagiarism detection
Algorithm
Steps
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.2Initialize the first row to 0..n.
Initialize the first column to 0..m.3Examine each character of s (i from 1 to n).4Examine each character of t (j from 1 to m).5If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].
Example
This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".
Steps 1 and 2
GUMBO 012345G1 A2 M3 B4 O5 L6Steps 3 to 6 When i = 1
GUMBO 012345G10 A21 M32 B43 O54 L65Steps 3 to 6 When i = 2
GUMBO 012345G101 A211 M322 B433 O544 L655Steps 3 to 6 When i = 3
GUMBO 012345G1012 A2112 M3221 B4332 O5443 L6554Steps 3 to 6 When i = 4
GUMBO 012345G10123 A21123 M32212 B43321 O54432 L65543Steps 3 to 6 When i = 5
GUMBO 012345G101234A211234M322123B433212O544321L655432Step 7
The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).
- #include <stdio.h>
- #include <string.h>
- char s1[1000],s2[1000];
- int min(int a,int b,int c) {
- int t = a < b ? a : b;
- return t < c ? t : c;
- }
- void editDistance(int len1,int len2) {
- int** d=new int*[len1+1];
for(int k=0;k<=len1;k++)
d[k]=new int[len2+1]; - int i,j;
- for(i = 0;i <= len1;i++)
- d[i][0] = i;
- for(j = 0;j <= len2;j++)
- d[0][j] = j;
- for(i = 1;i <= len1;i++)
- for(j = 1;j <= len2;j++) {
- int cost = s1[i] == s2[j] ? 0 : 1;
- int deletion = d[i-1][j] + 1;
- int insertion = d[i][j-1] + 1;
- int substitution = d[i-1][j-1] + cost;
- d[i][j] = min(deletion,insertion,substitution);
- }
- printf("%d/n",d[len1][len2]);
- for(int k=0;i<=len1;k++)
delete[] d[k];
delete[] d; - }
- int main() {
- while(scanf("%s %s",s1,s2) != EOF)
- editDistance(strlen(s1),strlen(s2));
- }
- 编辑距离(Levenshtein Distance)
- 编辑距离(Levenshtein Distance)
- Levenshtein Distance(编辑距离)
- 编辑距离(Edit Distance | Levenshtein距离)
- 编辑距离(Levenshtein Distance) (转)
- 编辑距离算法 Levenshtein Distance
- 编辑距离算法(Levenshtein distance)
- Minimum edit distance(levenshtein distance)(最小编辑距离)初探
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- Java算法之Levenshtein Distance(编辑距离)算法
- iNLP源代码之编辑距离算法(Levenshtein distance)
- 字符串相似度算法(编辑距离算法 Levenshtein Distance)
- Levenshtein(编辑) 距离
- Levenshtein distance最小编辑距离算法实现
- Levenshtein distance最小编辑距离算法实现
- 最短编辑距离问题 : Levenshtein Distance
- Levenshtein distance最小编辑距离算法实现
- Levenshtein距离(编辑距离)
- 行业云
- 直通串口线与交叉串口线的区分
- SQLhelper使用事务
- CATransition 的初级应用
- 移动云
- 编辑距离(Levenshtein Distance)
- c# string
- 做个入门总结
- IBM“智慧的运算”
- 感知this指针 人工传递this指针技巧
- CONST
- 关于Struts 2对Date类型的自动类型转换出错的问题
- java.lang.OutOfMemoryError: Java heap space内在溢出
- 论IT从业人员知识的学习方法以及学习技巧(摘抄)