第十一章 图像识别与文字处理

11.1 OCR库描述

11.1.1 Pillow

1
2
3
4
5
6
from PIL import Image, ImageFilter

kitten = Image.open('page.jpg')
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save('kitten_blurred.jpg')
blurryKitten.show()

11.1.2 Tesseract

  • Tesseract 是目前公认最优秀、最精确的开源 OCR 系统

  • Tesseract 也具有很高的灵活性。它可以通过训练识别出任何字体(只要这些字体的风格保持不变就可以,后面我们会介绍),也可以识别出任何 Unicode 字符

  • Tesseract 是一个 Python 的命令行工具,不是通过 import语句导入的库。安装之后,要用 tesseract 命令在 Python 的外面运行

11.2 处理格式规范的文字

tesseract text.tif textoutput

  • 图片先进行模糊处理,转换成一个 JPG 压缩格式的图片,再增加一点儿背景渐变,识别效果就会变得很差

  • 遇到这类问题,可以先用 Python 脚本对图片进行清理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
image = Image.open(filePath)

# 对图片进行阈值过滤, 然后保存
image = image.point(lambda x:0 if x<143 else 255)
image.save(newFilePath)

# 调用系统的tesseract命令对图片进行OCR识别
subprocess.call(['tesseract', newFilePath, 'output'])

# 打开文件读取结果
outputFile = open('output.txt', 'r')
print(outputFile.read())
outputFile.close()

cleanFile('text2.png', 'text_2_clean.png')
This IS some text, wntten In Arial, that will be "
Tesseract Here are some symbols: l@#$%"&

从网站图片中抓取文字

首先导航到托尔斯泰的《战争与和平》的大字号印刷版.打开阅读器,收集图片的 URL 链接,然后下载图片,识别图片,最后打印每个图片的文字。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import time
from urllib.request import urlretrieve
import subprocess
from selenium import webdriver

# 创建新的Selenium driver
driver = webdriver.PhantomJS(executable_path='/usr/local/phantomjs/bin/phantomjs')
#driver = webdriver.Firefox()


driver.get('http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200')
time.sleep(5)

# 单击图书预览按钮
driver.find_element_by_id('sitbLogoImg').click()
imageList = set()

# 等待页面加载完成
time.sleep(5)
# 当向右箭头可以点击时, 开始翻页
while 'pointer' in driver.find_element_by_id('sitbReaderRightPageTurner').get_attribute('style'):
driver.find_element_by_id('sitbReaderRightPageTurner').click()
time.sleep(2)
# 获取已加载的新页面
pages = driver.find_element_by_xpath("//div[@class='pageImage']/div/img")
for page in pages:
image = page.get_attribute('src')
imageList.add(image)
driver.quit()

# 用Tesseract处理我们收集的图片URL链接
for image in sorted(imageList):
urlretrieve(image, 'page.jpg')
p = subprocess.Popen(['tesseract', 'page.jpg', 'page'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()
f = open('page.txt', 'r')
print(f.read())

11.3 读取验证码与训练Tesseract

11.4 获取验证码提交答案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
import subprocess
import requests
from PIL import Image
from PIL import ImageOps
def cleanImage(imagePath):
image = Image.open(imagePath)
image = image.point(lambda x:0 if x<143 else 255)
borderImage = ImageOps.expand(image, border=20, fill='white')
borderImage.save(imagePath)

html = urlopen('http://pythonscraping.com/humans-only')
bsObj = BeautifulSoup(html, 'lxml')
# 收集需要处理的表单处理(包括验证码和输入字段)
imageLocation = bsObj.find('img', {'title':'Image CAPTCHA'})['src']
formBuildId = bsObj.find('input', {'name':'form_build_id'})['value']
captchaSid = bsObj.find('input', {'name':'captcha_sid'})['value']
captchaToken = bsObj.find('input', {'name':'captcha_token'})['value']

captchaUrl = 'http://pythonscraping.com'+imageLocation
urlretrieve(captchaUrl, 'captcha.jpg')
cleanImage('captcha.jpg')
p = subprocess.Popen(['tesseract', 'captcha.jpg', 'captcha'] ,stdout=
subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()
f = open('captcha.txt', 'r')

#清理识别结果中的空格和换行符
captchaResponse = f.read().replace(' ', '').replace('\n', '')
print('Captcha solution attempt: '+captchaResponse)

if len(captchaResponse) == 5:
params = {"captcha_token":captchaToken, "captcha_sid":captchaSid,
"form_id":"comment_node_page_form", "form_build_id": formBuildId,
"captcha_response":captchaResponse, "name":"Ryan Mitchell",
"subject": "I come to seek the Grail",
"comment_body[und][0][value]":
"...and I am definitely not a bot"}
r = requests.post('http://www.pythonscraping.com/comment/reply/10', data=params)
responseObj = BeautifulSoup(r.text, 'lxml')
if responseObj.find('div', {'class':'messages'}) is not None:
print(responseObj.find('div', {'class':'messages'}).get_text())
else:
print('There was a problem reading the CAPTCHA correctly!')
Captcha solution attempt: 4'”LY

Error message
The answer you entered for the CAPTCHA was not correct.
分享到