python在进行差值统计时,如何设定统计大小范围
2012-04-18 12:33:33 192.168.13.106 218.16.121.240 80
2012-04-18 12:33:43 192.168.13.106 110.75.187.22 80
2012-04-18 12:34:13 192.168.65.27 192.168.0.188 443
2012-04-18 12:34:27 192.168.40.117 192.168.0.174 80
2012-04-18 12:35:39 192.168.20.109 119.147.113.98 80
2012-04-18 12:35:59 192.168.20.109 119.147.113.98 80
2012-04-18 12:36:13 192.168.65.27 192.168.0.189 443
2012-04-18 12:36:20 192.168.13.106 113.11.195.106 80
2012-04-18 12:36:26 192.168.50.112 192.168.0.174 80
2012-04-18 12:36:33 192.168.50.146 118.186.66.51 80
2012-04-18 12:36:43 192.168.30.105 192.168.0.174 80
2012-04-18 12:36:53 192.168.50.145 119.147.194.250 80
2012-04-18 12:37:01 192.168.40.105 192.168.0.174 80
2012-04-18 12:37:12 192.168.13.106 182.50.0.106 80
2012-04-18 12:37:33 192.168.13.106 182.50.0.106 80
2012-04-19 12:34:13 192.168.65.27 192.168.0.188 443
文本格式如上所示
希望统计出后三段相同的 他们的时间间隔有没有一定的周期
思路如下:
1、提取后三段相同的,列到一起
2、后三段相同的 他们的时间做差(这里会有跨天发生,python如何实现跨天的减法?)
3、统计出现最多的时间间隔所占的百分比,如果大于90%,就将此段信息输出到high.txt 并返回1
大于60%小于90% 输出到middle.txt 返回0
小于60%,输出到low.txt
等于说最后三个txt的格式应为(随便举例):
high.txt
192.168.13.106 182.50.0.106 80,95%
middle.txt
192.168.65.27 192.168.0.188 443,70%
low.txt
192.168.65.27 192.168.0.188 443,30%
现在处理的代码如下:
- Python code
import refrom datetime import datetime# read data from files = open(r'/home/test').read()print s# format data# srcIP->destIP:port = date timedateDict = {}# srcIP->destIP:port = number of slotsslotDict = {}# total number of slotstotalNum = 0# loopfor line in s.split('\n'): items = line.split(' ') if len(items)==5: # total time slot totalNum += 1 # new key newkey = items[-3]+'->'+items[-2]+':'+items[-1] # dateDict if dateDict.has_key(newkey): dateDict[newkey].append(items[0]+' '+items[1]) else: dateDict[newkey] = [items[0]+' '+items[1]] # slotDist if slotDict.has_key(newkey): slotDict[newkey] += 1 else: slotDict[newkey] = 0# write filesfor k in slotDict.keys(): # ratio ratio = slotDict[k]*1.0/totalNum # line string newline = k+', '+str(int(ratio*100))+'%\n' # open file if ratio>0.9: fid = open('/home/susy/work/data/high1.txt','a+') print 1 elif 0.6<ratio<=0.9: fid = open('/home/susy/work/data/middle1.txt','a+') print 0 elif ratio<=0.6: fid = open('/home/susy/work/data/low.txt1','a+') # write, close fid.write(newline) fid.close()print 'DONE!'
现在的问题是
对于时间差的统计,我不想统计小于10秒的差值,
只想对大于或等于10秒的差值进行统计并算百分比。
该如何实现啊?
[解决办法]
- Python code
#!/usr/bin/python# encoding: utf-8import reimport datetimepatt = re.compile(r''' (?P<dt>\d{4}\-\d{2}\-\d{2}\s\d{2}:\d{2}:\d{2})\s (?P<src>\d+(\.\d+){3})\s (?P<tag>\d+(\.\d+){3})\s (?P<port>\d+) ''', re.I|re.U|re.X)def dataReader(filename): with open(filename, 'rt') as handle: for ln in handle: m = patt.match(ln.strip()) if m: yield m.groupdict() else: continuedef s2dt(s, fmt='%Y-%m-%d %H:%M:%S'): return datetime.datetime.strptime(s, fmt)def dataCollector(filename): collector = {} for d in dataReader(filename): collector.setdefault( (d['src'],d['tag'],d['port']),[] ).append(s2dt(d['dt'])) return collectordef delta(timelist): timelist.sort() dlist = [] t0 = timelist.pop(0) for t in timelist: d = (t - t0).total_seconds() t0 = t if d < 10: continue dlist.append(d) return countdlist(dlist)def countdlist(dlist): dd, totalcnt = {}, 0 for d in dlist: totalcnt += 1 dd.setdefault(d,[]).append(d) lst = [(len(dd[d]),d) for d in dd] if not lst: return None lst.sort() cnt, dur = lst[-1] return dur, '%.2f%%'%(1.*cnt/totalcnt)for category, timelist in dataCollector(r'test').items(): print category, delta(timelist)