读书人

字符串匹配,容许一个字符不匹配

发布时间: 2012-12-30 10:43:14 作者: rapoo

字符串匹配,允许一个字符不匹配
Oh!! Mankind is in trouble again.This time its a deadly disease spreading at a rate never seen before. Efficient detectors for the virus responsible is the need of the hour. You are the lead at Central Hospital and need to find a fast and reliable way to detect the 'foot-prints' of the virus DNA in that of patient.

The DNA of the patient as well as of the virus consists of lower case letters. Since the data collected is raw there may be some errors. You will need to find all substrings in the patient DNA that exactly matches the virus DNA with the exception of at most one mismatch.

For example tolerating at most one mismatch, "aa" and "aa" are matching, "ab" and "aa" are matching, while "ab" and "ba" are not.

Input:
The first line contains the number of test cases T. T cases follow. Each case contains two lines containing strings P(Patient DNA) and V(Virus DNA) . Each case is followed by a blank line.

Output:
Output T lines, one corresponding to each case. For each case, output a space delimited list of starting indices (0 indexed) of substrings of P which are matching with V according to the condition mentioned above . The indices has to be in increasing order.

Constraints:
1 <= T <= 10
P and V contain at most 100000 characters each.
All characters in P and V are lowercase letters.

Sample Input:
3
abbab
ba



hello

world



banana

nan


Sample Output:
1 2



0 2

Explanation:
For the first case, the substrings of P starting at indices 1 and 2 are "bb" and "ba" and they are matching with the string V which is "ba".
For the second case, there are no matching substrings so the output is a blank line.

题目大意:
给定两个字符串s1,s2。在s1中找到s2(最多允许一个字符不相同)。
相同字符串匹配可以用kmp,所以,我的思路是改变一下kmp。
1,如果字符串完全相同,直接用kmp。
2,如果有一个字符串不相同,那么在第一个不相同的地方做标记,p_i, p_j,然后,继续比较下去,直到第二个不匹配的字符或者后面的字符全部相符。
前面两种情况在一次匹配完后,都要将下标(分别用i,j表示)移动。
情况1,j = pat[j-1], i = i+1
情况2, i, j = p_i, p_j #先移到第一次不匹配的地方
while j > 0 and tmp[j] != src[i]:
j = pat[j-1] #类似kmp方法将j尽量向前移动
这是一次匹配的情况,每次匹配都返回当前下标i,j,开始下标,是否匹配四个参数

代码:

def kmp_pattern(seq):
P = [0]*len(seq)


j = 0
for i in range(1, len(seq)):
while j > 0 and seq[j] != seq[i]:
j = P[j-1]
if seq[j] == seq[i]:
j += 1
P[i] = j
return P

def kmp_match(src, tmp, pat):
ret = []
j = 0
for i in range(len(src)):
while j > 0 and tmp[j] != src[i]:
j = pat[j-1]
if tmp[j] == src[i]:
j += 1
if j == len(tmp):
ret.append(i-len(tmp)+1)
j = pat[j-1]
return ret

def kmp_mismatch_once(src, tmp, pat, i, j):
p_i, p_j = i, j
state = 0
s_len, t_len = len(src), len(tmp)
while i<len(src):
if j<t_len and state == 0:
if src[i] == tmp[j]:
j += 1
else:
state = 1
p_i, p_j = i, j
j += 1
elif j<t_len and state == 1:
if src[i] == tmp[j]:
j += 1
else:
#not match
i, j = p_i, p_j
while j>0 and tmp[j] != src[i]:
j = pat[j-1]
return i, j, i-j, False

if j == len(tmp):


ret = i - t_len + 1
if state == 0:
j = pat[j-1]
return i+1, j, ret, True
elif state == 1:
i, j = p_i, p_j
while j>0 and tmp[j] != src[i]:
j = pat[j-1]
return i, j, ret, True
i += 1
return i, j, 0, False


def kmp_one_mismatch(src, tmp, pat):
pre_i, pre_j, i, j = 0, 0, 0, 0
pre_idx = -1
idx, ret2 = 0, False
ret = []
while i<len(src):
i, j, idx, ret2 = kmp_mismatch_once(src, tmp, pat, i, j)
if pre_i == i and pre_j == j and idx == pre_idx: #repeat
i += 1
else:
pre_i, pre_j = i, j
pre_idx = idx
if ret2 == True:
ret.append(idx)
return ret

sys.stdin=open( 'input00.txt', 'r')
T = int(raw_input())
while T:
T -= 1
src = raw_input()
tmp = raw_input()
if T != 0:
emp = raw_input()
src = list(src)
tmp = list(tmp)

pat = kmp_pattern(tmp)
idces = kmp_one_mismatch(src, tmp, pat)
for v in idces:
print(v),
print('')



提交的结果是wrong error。这个问题前前后后搞了我2周了。我怀疑我的逻辑上有问题。
请帮助一下,谢谢。

[解决办法]


换我做的话我会用rabin karp,配上int64的hash。事先把匹配串的所有1mismatch的hash给算出来。一共也就26*m的hash数,而且能线性的求出所有的hash。排个序以便二分搜索。接下来就是算原串所有长度m的子串,然后看是不是在刚才求出的列表里。整个可以做到nlogn。

kmp我一直不算掌握的很好,所以你这个对不对我还真说不上- -b
[解决办法]
试试用NFA或者DFA解决
[解决办法]
目测可以用NFA,求链接
[解决办法]

引用:
引用:目测可以用NFA,求链接

详细说说算法?复杂度为多少?


和9L做法差不多,如果javascript的正则表达式是转化为自动机来处理的话
个人理解本质上kmp就是在跑一个nfa。用nfa理解可能简单一点,只是个思路而已,复杂度我也不知道。

读书人网 >软件架构设计

热点推荐