第十一章图像识别与文字处理

2017-12-11

11.1 OCR库描述

11.1.1 Pillow

from PIL import Image, ImageFilter

kitten = Image.open('page.jpg')
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save('kitten_blurred.jpg')
blurryKitten.show()

11.1.2 Tesseract

Tesseract 是目前公认最优秀、最精确的开源 OCR 系统
Tesseract 也具有很高的灵活性。它可以通过训练识别出任何字体(只要这些字体的风格保持不变就可以,后面我们会介绍),也可以识别出任何 Unicode 字符
Tesseract 是一个 Python 的命令行工具,不是通过 import语句导入的库。安装之后,要用 tesseract 命令在 Python 的外面运行

11.2 处理格式规范的文字

tesseract text.tif textoutput

图片先进行模糊处理,转换成一个 JPG 压缩格式的图片,再增加一点儿背景渐变,识别效果就会变得很差
遇到这类问题,可以先用 Python 脚本对图片进行清理

from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
    image = Image.open(filePath)
    
    # 对图片进行阈值过滤， 然后保存
    image = image.point(lambda x:0 if x<143 else 255)
    image.save(newFilePath)
    
    # 调用系统的tesseract命令对图片进行OCR识别
    subprocess.call(['tesseract', newFilePath, 'output'])
    
    # 打开文件读取结果
    outputFile = open('output.txt', 'r')
    print(outputFile.read())
    outputFile.close()

cleanFile('text2.png', 'text_2_clean.png')

This IS some text, wntten In Arial, that will be "
Tesseract Here are some symbols: l@#$%"&

从网站图片中抓取文字

首先导航到托尔斯泰的《战争与和平》的大字号印刷版.打开阅读器,收集图片的 URL 链接,然后下载图片,识别图片,最后打印每个图片的文字。

import time
from urllib.request import urlretrieve
import subprocess
from selenium import webdriver

# 创建新的Selenium driver
driver = webdriver.PhantomJS(executable_path='/usr/local/phantomjs/bin/phantomjs')
#driver = webdriver.Firefox()


driver.get('http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200')
time.sleep(5)

# 单击图书预览按钮
driver.find_element_by_id('sitbLogoImg').click()
imageList = set()

# 等待页面加载完成
time.sleep(5)
# 当向右箭头可以点击时， 开始翻页
while 'pointer' in driver.find_element_by_id('sitbReaderRightPageTurner').get_attribute('style'):
    driver.find_element_by_id('sitbReaderRightPageTurner').click()
    time.sleep(2)
    # 获取已加载的新页面
    pages = driver.find_element_by_xpath("//div[@class='pageImage']/div/img")
    for page in pages:
        image = page.get_attribute('src')
        imageList.add(image)
driver.quit()

# 用Tesseract处理我们收集的图片URL链接
for image in sorted(imageList):
    urlretrieve(image, 'page.jpg')
    p = subprocess.Popen(['tesseract', 'page.jpg', 'page'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.wait()
    f = open('page.txt', 'r')
    print(f.read())

11.3 读取验证码与训练Tesseract

11.4 获取验证码提交答案

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
import subprocess
import requests
from PIL import Image
from PIL import ImageOps
def cleanImage(imagePath):
    image = Image.open(imagePath)
    image = image.point(lambda x:0 if x<143 else 255)
    borderImage = ImageOps.expand(image, border=20, fill='white')
    borderImage.save(imagePath)
    
html = urlopen('http://pythonscraping.com/humans-only')
bsObj = BeautifulSoup(html, 'lxml')
# 收集需要处理的表单处理(包括验证码和输入字段)
imageLocation = bsObj.find('img', {'title':'Image CAPTCHA'})['src']
formBuildId = bsObj.find('input', {'name':'form_build_id'})['value']
captchaSid = bsObj.find('input', {'name':'captcha_sid'})['value']
captchaToken = bsObj.find('input', {'name':'captcha_token'})['value']

captchaUrl = 'http://pythonscraping.com'+imageLocation
urlretrieve(captchaUrl, 'captcha.jpg')
cleanImage('captcha.jpg')
p = subprocess.Popen(['tesseract', 'captcha.jpg', 'captcha'] ,stdout=
                    subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()
f = open('captcha.txt', 'r')

#清理识别结果中的空格和换行符
captchaResponse = f.read().replace(' ', '').replace('\n', '')
print('Captcha solution attempt: '+captchaResponse)

if len(captchaResponse) == 5:
    params = {"captcha_token":captchaToken, "captcha_sid":captchaSid,
"form_id":"comment_node_page_form", "form_build_id": formBuildId,
"captcha_response":captchaResponse, "name":"Ryan Mitchell",
"subject": "I come to seek the Grail",
"comment_body[und][0][value]":
"...and I am definitely not a bot"}
    r = requests.post('http://www.pythonscraping.com/comment/reply/10', data=params)
    responseObj = BeautifulSoup(r.text, 'lxml')
    if responseObj.find('div', {'class':'messages'}) is not None:
        print(responseObj.find('div', {'class':'messages'}).get_text())
else:
    print('There was a problem reading the CAPTCHA correctly!')

Captcha solution attempt: 4'”LY

Error message
The answer you entered for the CAPTCHA was not correct.