读书人

【群体智慧编程 学习笔记】统计订阅源

发布时间: 2012-07-30 16:19:05 作者: rapoo

【集体智慧编程 学习笔记】统计订阅源中的单词数

几乎所有的博客都可以在线阅读,或者通过RSS订阅源进行阅读。RSS订阅源是一个包含博客及其所有文章条目信息的简单的XML文档。
程序中使用了feedparser第三方模块,可以轻松地从任何RSS或Atom订阅源中得到标题、链接和文章的条目。完整代码如下:

01'''02Created on Jul 14, 201203 04@Author: killua05@E-mail: killua_hzl@163.com06@Homepage: http://www.yidooo.net07@Decriptioin: Counting the words in a Feed08 09feedparser:feedparser is a Python library that parses feeds in all known formats, including Atom, RSS, and RDF.It runs on Python 2.4 all the way up to 3.2.10 11dataset: http://kiwitobes.com/clusters/feedlist.txt12 You can download feeds from this list. Maybe some feeds you can access in China.13'''14 15import feedparser16import re17 18#Get word from feed19def getwords(html):20 #Remove all the HTML tags21 text = re.compile(r"<[^>]+>").sub('', html)22 23 #Split words by all non-alpha characters24 words = re.compile(r"[^A-Z^a-z]+").split(text)25 26 #Convert words to lowercase27 wordlist = [word.lower() for word in words if word != ""]28 29 return wordlist30 31#Returns title and dictionary of word counts for an RSS feed32def getFeedwordcounts(url):33 #Parser the feed34 d = feedparser.parse(url)35 wordcounts = {}36 37 #Loop over all the entries38 for e in d.entries:39 if 'summary' in e:40 summary = e.summary41 else:42 summary = e.description43 44 words = getwords(e.title + ' ' + summary)45 for word in words:46 wordcounts.setdefault(word, 0)47 wordcounts[word] += 148 49 return d.feed.title, wordcounts50 51if __name__ == '__main__':52 #count the words appeared in blog53 blogcount = {}54 wordcounts = {}55 56 feedFile = file('resource/feedlist.txt')57 feedlist = [line for line in feedFile.readlines()]58 59 for feedUrl in feedlist:60 try:61 title, wc = getFeedwordcounts(feedUrl)62 wordcounts[title] = wc63 for word, count in wc.items():64 blogcount.setdefault(word, 0)65 if count > 1:66 blogcount[word] += 167 except:68 print 'Failed to parse feed %s' % feedUrl69 70 wordlist = []71 for w, bc in blogcount.items():72 frac = float(bc) / len(feedlist)73 if frac > 0.1 and frac < 0.5:74 wordlist.append(w)75 76 #Write the result to the file77 datafile = file('blogdata', 'w')78 #Write result's head79 datafile.write('Blog')80 for word in wordlist:81 datafile.write('\t%s' % word)82 datafile.write('\n')83 #Write results84 for blogname, wc in wordcounts.items():85 print blogname86 datafile.write(blogname)87 for word in wordlist:88 if word in wc:89 datafile.write("\t%d" % wc[word])90 else:91 datafile.write("\t0")92 datafile.write('\n')



转载请注明: 转自阿龙の异度空间

本文链接地址: http://www.yidooo.net/archives/3255.html

读书人网 >编程

热点推荐