以前使用的requests库属于同步爬虫,也就是按顺序来,通常使用for循环来进行遍历爬取。
若想实现异步爬虫,python还提供了aiohttp第三方库便于实现异步。
测试网页是人人网美剧前10页的每个美剧名称title
http://yyetss.com/list-lishi-all-1.html
asyncio+aiohttp
- 其实大致方法和使用requests库类似,先定义一个协程函数async def get_title用于http请求和解析页面
async with aiohttp.ClientSession(headers=headers) as ss:
建立session连接保持回话,括号内填写请求头和requests一样async with ss.get(url) as resp:
开始请求返回一个响应resptext = await resp.text()
这里个resp.text()需要注意的是和 requests库的 resp.text 不一样,查看源码发现它是一个协程函数,所以前面才可以加上 await 关键字title_list = re.findall(pattern,text)
要在session缩进下,也就是保持session回话中去解析网页,不能离开ss缩进,解析方法和requests一样,用的正则- main()函数中
loop = asyncio.get_event_loop()
初始使用该方法会在主线程中创建一个事件池event loop,或者叫事件循环,应该意思一样吧tasks = [get_title(page) for page in range(1,11)]
因为get_title(page)是一个协程函数,所以tasks就是一个任务列表,将会放进事件池中进行事件循环asyncio.wait(tasks)
刚刚看了一下文档,大概理解就是,async.wait会返回两个值:done和pending,done为已完成的协程Task,pending为超时未完成的协程Task,需通过future.result调用Task的result。具体实例参考python异步asyncio笔记事件循环那部分loop.run_until_complete(asyncio.wait(tasks))
将任务列表tasks放到loop事件池中,进行事件循环
import asyncio,aiohttp,re,time
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义协程函数,协程函数不可直接调用
async def get_title(page):
url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
# aiohttp 是用下面这种方式连接的
async with aiohttp.ClientSession(headers=headers) as ss:
async with ss.get(url) as resp:
# 注意,这里的 resp.text() 和 requests库的 resp.text 不一样
# 这里的resp.text() 是一个协程函数所以前面才可以加上 await 关键字
text = await resp.text()
# 解析网页
title_list = re.findall(pattern,text)
print("页数{}:".format(page),title_list)
if __name__ == '__main__':
begin = time.time()
loop = asyncio.get_event_loop()
tasks = [get_title(page) for page in range(1,11)]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
print("耗时:",time.time() - begin)
# 异步 耗时: 0.3580803871154785
普通requests
import requests,re,time
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义异步函数,异步函数不可直接调用
def get_title(page):
url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
resp = requests.get(url,headers)
text = resp.text
title_list = re.findall(pattern, text)
print("页数{}:".format(page), title_list)
begin = time.time()
for page in range(1,11):
get_title(page)
print("耗时:",time.time() - begin)
# 耗时: 3.47572922706604
线程池
import requests,re,time
from concurrent.futures import ThreadPoolExecutor
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义异步函数,异步函数不可直接调用
def get_title(page):
url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
resp = requests.get(url,headers)
text = resp.text
title_list = re.findall(pattern, text)
print("页数{}:".format(page), title_list)
# ============多线程
if __name__ == '__main__':
begin = time.time()
with ThreadPoolExecutor(max_workers=10) as executor:
[executor.submit(get_title,page) for page in range(1,11)]
print("耗时:",time.time() - begin)
# 10线程 耗时: 0.715099573135376
resp.text()源码
async def text(self, encoding: Optional[str] = None, errors: str = "strict") -> str:
"""Read response payload and decode."""
if self._body is None:
await self.read()
if encoding is None:
encoding = self.get_encoding()
return self._body.decode(encoding, errors=errors) # type: ignore
小结
类型 | 耗时 |
asyncio+aiohttp | 0.3580803871154785 |
线程池(10线程) | 0.715099573135376 |
普通requests循环 | 3.47572922706604 |
可以看出使用asyncio+aiohttp的效率远高于普通使用requests库,主要是因为http请求属于IO密集型使用异步的效率确实提升不少,而且出乎意料的比多线程还要快。
Comments | NOTHING