java - Spell-Check: Find one-to-one token difference mapping between two strings -
i stumbled on question on internet archive , having difficulty wrapping head around it. want find desired mapping amongst different tokens between 2 strings. output should string-to-string map.
for example:
string1: hewlottpackardenterprise helped american raleways in n y
string2: hewlett packard enterprise helped american railways in ny
output:
hewlottpackardenterprise -> hewlett packard enterprisehewlott -> hewlett
raleways -> railways
n y -> ny
note: have been able write edit-distance method, finds types of edits (segregated types, deletion, substitution etc.) , can convert first string second convert
method
what have tried far?
approach 1: began naive approach of splitting both strings space, inserting tokens of first string hash map , comparing tokens of other string hashmap. however, approach fails misses on relevant mappings.
approach 2: utilize covert
method find edit positions in string, , type of edits. using space edits, i'm able create mapping hewlottpackardenterprise -> hewlett packardenterprise. however, method explodes more , more things need splitted within same word.
appreciate thoughts in regard! clear doubts in comments.
public string returnwhitespaceedittoken(editdone e, list<string> testtokens) { int pos = e.pos, count=0, i=0; string resulttoken = null; if (e.type.equals(deleteedit)) { (i=0;i<testtokens.size();i++) { count+=testtokens.get(i).length(); if (count==pos) { break; } if (i!=testtokens.size()-1) { count++; } } resulttoken = testtokens.get(i) + " " + testtokens.get(i+1); } else if (e.type.equals(insertedit)) { (i=0;i<testtokens.size();i++) { count+=testtokens.get(i).length(); if (count>pos) { break; } if (i!=testtokens.size()-1) { count++; } } string token = testtokens.get(i); resulttoken = token.substring(count-token.length(), pos) + token.substring(pos, count); } return resulttoken; }
a pretty common way of handling problems find longest common subsequence (or it's dual shortest edit script) between 2 strings , post-process output specific format want; in case string maps.
wikipedia has pretty decent introduction problem here: https://en.wikipedia.org/wiki/longest_common_subsequence_problem
and great paper "an o(nd) difference algorithm , variations" myers can found here. http://www.xmailserver.org/diff2.pdf
Comments
Post a Comment