第一章初见网络爬虫

2017-12-11

1.1网络连接

1
2
3

from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

1.2BeautifulSoup简介

1.2.1安装BeautifulSoup

1	from bs4 import BeautifulSoup

1.2.2运行BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bsObj = BeautifulSoup(html.read(), 'lxml')
print(bsObj.h1)

<h1>An Interesting Title</h1>

1.2.3可靠的网络连接

处理可见的异常

网页在服务器上不存在

try: 
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # 返回空值，中断程序，或者执行另一个方案
else:
    # 程序继续
- 服务器不存在
if html is None:
    print('URL is not found')
else:
    # 程序继续

检查BeautifulSoup对象标签是否存在

1	print(bsObj.nonExistentTag)

None


C:\Program Files\Anaconda3\lib\site-packages\bs4\element.py:1050: UserWarning: .nonExistentTag is deprecated, use .find("nonExistent") instead.
  tag_name, tag_name))

1	print(bsObj.nonExistentTag.someTag)

C:\Program Files\Anaconda3\lib\site-packages\bs4\element.py:1050: UserWarning: .nonExistentTag is deprecated, use .find("nonExistent") instead.
  tag_name, tag_name))



---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-6-e80cf0b4416d> in <module>()
----> 1 print(bsObj.nonExistentTag.someTag)


AttributeError: 'NoneType' object has no attribute 'someTag'

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

Tag was not found


C:\Program Files\Anaconda3\lib\site-packages\bs4\element.py:1050: UserWarning: .nonExistingTag is deprecated, use .find("nonExisting") instead.
  tag_name, tag_name))

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), 'lxml')
        title = bsObj.h1
    except AttributeError as e:
        return None
    return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title cound not be found')
else:
    print(title)

<h1>An Interesting Title</h1>