OCR学习记要

OCR学习记录

最近对验证码识别做了一些研究，主要是OCR方向的，一些总结记录一下。识别CAPTCHA后面跟了很多参考文章都讲解的很详细了，做ORC不难，难点在于如何提高识别率。基本流程如下:

1.原图

2.预处理（去噪点）

3.标准化（灰度变换,二值化，归一化）

4.image segment（个人感觉这个比较难，有很多算法，比如垂直投影直方图，KNN，Color Filling）

5.提取特征

6.机器学习

7.识别

总之OCR是一个很有意思的研究课题，包含大量对计算机图形图像，机器学习，神经网络方面的研究，可以作为一个问题点来研究机器学习。网上已经有一个学习好的手写体样本库MNIST可供玩耍。附件另有一份是VSM向量空间模型理论的论文，清楚的讲解了如何计算两个对象之间的相似性。

0.PIL简单的API使用：

# -*- coding: utf-8 -*-path = "/home/yunpeng/test4/data/4399/simple/8.png"from PIL import Image,ImageDrawim =Image.open(path)im = im.convert('L')#二值化print 'img info:',im.format,im.size,im.modewidth,height = im.sizefor x in xrange(width):    for y in xrange(height):        p= im.getpixel((x, y))        if p&gt;90:            im.putpixel((x,y),255)        else:            im.putpixel((x,y),0)#去头去尾mlist = set([])p = im.load()for x in xrange(width):    for y in xrange(height):        p= im.getpixel((x, y))        if p&lt;200:            mlist.add(x)mlist = list(mlist)left= mlist[:1][0]right = mlist[len(mlist)-1:][0]box = (left, 0, right, height)im = im.crop(box)width,height = im.sizeps = [0]*widthfor x in xrange(width):    for y in xrange(height):        p= im.getpixel((x, y))        if p==0:            ps[x]=ps[x]+4   image = Image.new('RGB',(200,200),(255,255,255))  draw = ImageDraw.Draw(image)ps_width = len(ps)for k in xrange(ps_width):    source = (k,199)                 #起点坐标y=99, x=[0,1,2....]    target = (k,199-ps[k])    #终点坐标y=255-a[x],a[x]的最大数值是200,x=[0,1,2....]    draw.line([source, target], (100,100,100),1)image.show()im.show()

?1.什么是灰度变换？

Photoshop里的灰度变换可以使R,G,B 3色按任何比例增强再混合。黑白图片的黑白变换叫灰度变换，彩色图片的色彩变换也叫灰度变换。
比如线性变换
可以用一个线性函数:f(x,y)=a'+(b'-a')/(b-a)×(f(x,y)-a)
f(x,y)代表一个象素
[a,b]是原始图像的灰度范围，[a',b']是变换后新图像的灰度范围
用这个线性函数分别对R,G,B分量进行变换可以起到单色增强的目的，然后再混合输出。
如果b'-a' > b-a ，则使得图像灰度范围增大，即对比度增大，图像会变得清晰
如果b'-a' < b-a ，则使得图像灰度范围缩小，即对比度减小。

PS: PIL可以通过im.convert('L')

2.什么是直方图？

直方图就是统计图像中像素点为某个颜色值的个数。

参考：

使用PIL计算直方图并显示

3.tesseract如何安装？

参考：

ubuntu安装tesseract 进行OCR识别

使用tesseract-ocr破解网站验证码

4. 参考资料

Python图像处理库(PIL)--基本概念和类库介绍
http://www.cnblogs.com/wei-li/archive/2012/04/19/2443281.html
http://www.cnblogs.com/wei-li/archive/2012/04/19/2456725.html
http://iysm.net/?tag=pil

用Python做图像处理:
http://blog.csdn.net/gzlaiyonghao/article/details/1852726

计算图像相似度——《Python也可以》之一
http://blog.csdn.net/gzlaiyonghao/article/details/2325027

10 行代码判定色*情*图片——Python 也可以系列之二
http://blog.csdn.net/gzlaiyonghao/article/details/3166735

用BP人工神经网络识别手写数字——《Python也可以》之三
http://blog.csdn.net/gzlaiyonghao/article/details/7109898

大规模识别相似图像的算法探讨（比较浅）
http://caocao.iteye.com/blog/149776

用PIL实现滤镜(一)——素描、铅笔画效果
http://blog.sina.com.cn/s/blog_5eeb1e2f0101axvi.html

图像处理之霍夫变换（直线检测算法）
http://blog.csdn.net/jia20003/article/details/7724530

python 简单图像处理（最详细1-16篇，包括细化，傅立叶变换，)
http://www.cnblogs.com/xianglan/category/272764.html

使用（ImageMagick+tesseract-ocr）实现图像验证码识别实例（识别读比较高）：
http://blog.csdn.net/mlks_2008/article/details/8052782

tesseract-ocr训练方法：
http://www.lixin.me/blog/2012/05/26/29536

OCR学习及tesseract的一些测试:
http://blog.csdn.net/viewcode/article/details/7784600

某网站验证码的识别笔记（去除背景色）：
http://blog.csdn.net/bh20077/article/details/7041280

用imagemagick和tesseract-ocr破解简单验证码（ruby）:
http://hooopo.iteye.com/blog/993538

使用 Python 构造神经网络(IBM Hopfield 网络可以重构失真的图案并消除噪声)
http://www.ibm.com/developerworks/cn/linux/l-neurnet/

常见验证码的弱点与验证码识别
http://drops.wooyun.org/tips/141

一种通用的去除文字图像中干扰线的算法:
http://wenku.baidu.com/view/63bac64f2b160b4e767fcfed.html

Decoding CAPTCHA’s:
http://www.boyter.org/decoding-captchas/

===================================================================
Tesseract OCR 训练和识别总结:
http://miphol.com/muse/2013/06/tesseract-ocr-1.html
http://miphol.com/muse/2013/05/tesseract-ocr.html

Tesseract-OCR 字符识别---样本训练（使用jTessBoxEditor工具，比较详细）
http://blog.csdn.net/firehood_/article/details/8433077

Tesseract-OCR引擎入门
http://blog.csdn.net/xiaochunyong/article/details/7193744

Tesseract官方配置
http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

粘连字符的图片验证码识别
http://wenku.baidu.com/view/343c200c581b6bd97f19ead9.html

字符扭曲粘连验证码识别技术研究
http://wenku.baidu.com/view/45896630580216fc700afd16.html

-----------------------------------------------------------------------
wiki:
http://zh.wikipedia.org/zh-cn/captcha
http://en.wikipedia.org/wiki/Image_segmentation

Python Module for Mean Shift Image Segmentation:
http://code.google.com/p/pymeanshift/

淘宝验证码:
http://pin.aliyun.com/get_img?identity=taoquan.taobao.com&sessionid=1381293634479

验证码识别工具-tesseract（最详细）
http://hilojack.sinaapp.com/?p=866

如何识别高级的验证码? 鬼仔's Blog（最高级）
http://huaidan.org/archives/2085.html

浅谈OCR之Tesseract:
http://www.cnblogs.com/brooks-dotnet/archive/2010/10/05/1844203.html

tesseract-ocr使用方法总结:
http://hyhx2008.github.io/tesseract-ocrshi-yong-fang-fa-zong-jie.html

开源OCR引擎Tesseract
http://hi.baidu.com/lifulinghan/item/b59af9eb1d92282d5a7cfb69

使用tesseract-ocr破解网站验证码
http://grunt1223.iteye.com/blog/904313

breaking weak captcha in slightly more than 26 lines of groovy-code
http://www.kellyrob99.com/blog/2010/03/14/breaking-weak-captcha-in-slightly-more-than-26-lines-of-groovy-code/

tesseract-ocr3.02的用法详解(训练词库)
http://www.cnblogs.com/huyulin/p/3305563.html

关于tesseract-ocr3的训练和使用
http://www.cnblogs.com/zcsor/archive/2011/02/21/1959555.html

tesseract java api
http://stackoverflow.com/questions/13974645/using-tesseract-from-java

tesseract python api
http://code.google.com/p/pytesser/
https://github.com/rosarior/pytesser
https://code.google.com/p/pytesser/wiki/README

识别验证码，你有几分成功率？
http://aoingl.iteye.com/blog/1389232
http://ptlogin.4399.com/ptlogin/captcha.do?captchaId=captchaReq011404b815f6235726
http://www.andrew.cmu.edu/user/ericwu/parch/finalreport.html()

[1]?L.?von?Ahn,?M.?Blum?and?J.?Langford.?Telling?Humans?and?Computer?Apart
Automatically[J],?Comm.?Of?the?ACM,?46(Aug.?2003),?57-60.
[2]?K.?Chellapilla,?K.?Larson,?P.?Simard?and?M.?Czerwinski,?Building?Segmentation
Based?Human-friendly?Human?Interaction?Proofs[C],?2nd?Int’l?Workshop?on?Human?Interaction?Proofs,?Springer-Verlag,?LNCS?3517,?2005.?
[3]?J.?Yan?and?A.?S.?EI.?Ahmad.?Usability?of?CAPTCHAs?-?Or,?Usability?issues?in
CAPTCHA?design[C],?the?fourth?Symposium?on?Usable?Privacy?and?Security,?Pittsburgh,?USA,?July?2008.
[4]?K.?Chellapilla,?K.?Larson,?P.?Simard,?M.?Czerwinski,?Computers?beat?humans?at
single?character?recognition?in?reading-based?Human?Interaction?Proofs[C],?In?2nd?Conference?on?Email?and?Anti-Spam?(CEAS’05),?2005.??
[5]?J.?Yan?and?A.?S.?El?Ahmad.?A?Low-cost?Attack?on?a?Microsoft?CAPTCHA[C],?15th
ACM?Conference?on?Computer?and?Communications?Security?(CCS’08).?Virginia,?USA,?Oct?27-31,?2008.?ACM?Press.?543-554.?
[6]?Microsoft?Corporation.?Human?Interaction?Proof?(HIP)?-?Technical?and?Market
Overview[J],?2006.?Accessed?in?Jan?2011.?
[7]?J.?Yan?and?A.?S.?El?Ahmad.?Breaking?Visual?CAPTCHAs?with?Naive?Pattern
Recognition?Algorithms[C],?in?Proc.?of?the?23rd?Annual?Computer?Security?Applications?Conference?(ACSAC’07).?FL,?USA,?Dec?2007.?IEEE?computer?society.?279-291.?
[8]?G.?Mori?and?J.?Malik.?Recognizing?Objects?in?Adversarial?Clutter:?Breaking?a
Visual?CAPTCHA[C],?IEEE?Conference?on?Computer?Vision?and?Pattern?Recognition(CVPR'03),?Vol?1,?June?2003,?134-141.?
[9]?G.?Moy,?N.?Jones,?C.?Harkless?and?R.?Potter.?Distortion?estimation?techniques?in
solving?visual?CAPTCHAs[C],?IEEE?CVPR,?2004.?
[10]?K.?Chellapilla?and?P.?Simard,?Using?Machine?Learning?to?Break?Visual?Human
Interaction?Proofs[M],?Neural?Information?Processing?Systems?(NIPS),?MIT?Press,?2004.
[11]?L.?von?Ahn,?M.Blum,?N.?J.?Hopper,?and?J.?Langford,?CAPTCHA:?Using?hard?AI
problems?for?security[C].?Eurocrypt’2003.
[12]?W.?Zhang,?J.?Sun,?and?X.?Tang.?Cat?head?detection?-how?to?effectively?exploit?shape?and?texture?features[C].?In?Proc.?ECCV?2008,?Part?IV,?LNCS?5305?(2008),?802816.
[13]?P.?Golle.?Machine?learning?attacks?against?the?Asirra?CAPTCHA[C].?In?ACM
CCS’2008,?535-542.
[14]?http://recaptcha.net/learnmore.html，2012-10-19。
[15]?Elie?Bursztein,?Matthieu?Martin,?and?John?C.?Mitchell.?Text-based?CAPTCHA
strengths?and?weaknesses[C].?18th?ACM?conference,2011.
[16]?Luis?von?Ahn,?Benjamin?Maurer,?Colin?McMillen,?David?Abraham,?and?Manuel
Blum,?2008.?reCAPTCHA:?Human-?Based?Character?Recognition?via?Web?Security?Measures[J].?Science,?321(5895):1465-1468.
[17]?李颖,Web验证码生成和识别[D]。南京理工大学2008?研究生论文。
[18]?Zeidenberg,?Matthew.?Neural?Networks?in?Artificial?Intelligence[M].?1990:?Ellis
Horwood?Limited.?1990.?ISBN?0-13-612185-3.
[19]?张淑雅，赵一鸣，赵晓宇等.认证码字符识别方法的研究[J].宁波大学学报:
理工版，2007,12(4):429-433.
[20]?潘大夫，汪渤.一种基于外部轮廓的数字验证码识别方法[J],微计算机信息:
测控自动化，2007,23(9-1):0256-0258.
[21]?贾磊磊，陈锡华，熊川，验证码的模糊识别[J],西昌学院学报：自然科学版，
2010，24(1)：60-62?

OCR学习记要

热点推荐