asyncio+aiohttp实现异步爬虫

发布于 2021-01-12  904 次阅读


以前使用的requests库属于同步爬虫,也就是按顺序来,通常使用for循环来进行遍历爬取。
若想实现异步爬虫,python还提供了aiohttp第三方库便于实现异步。

测试网页是人人网美剧前10页的每个美剧名称title
http://yyetss.com/list-lishi-all-1.html

asyncio+aiohttp

  1. 其实大致方法和使用requests库类似,先定义一个协程函数async def get_title用于http请求和解析页面
  2. async with aiohttp.ClientSession(headers=headers) as ss:
    建立session连接保持回话,括号内填写请求头和requests一样
  3. async with ss.get(url) as resp:
    开始请求返回一个响应resp
  4. text = await resp.text()
    这里个resp.text()需要注意的是和 requests库的 resp.text 不一样,查看源码发现它是一个协程函数,所以前面才可以加上 await 关键字
  5. title_list = re.findall(pattern,text)
    要在session缩进下,也就是保持session回话中去解析网页,不能离开ss缩进,解析方法和requests一样,用的正则
  6. main()函数中
    loop = asyncio.get_event_loop()
    初始使用该方法会在主线程中创建一个事件池event loop,或者叫事件循环,应该意思一样吧
    tasks = [get_title(page) for page in range(1,11)]
    因为get_title(page)是一个协程函数,所以tasks就是一个任务列表,将会放进事件池中进行事件循环
    asyncio.wait(tasks)
    刚刚看了一下文档,大概理解就是,async.wait会返回两个值:done和pending,done为已完成的协程Task,pending为超时未完成的协程Task,需通过future.result调用Task的result。具体实例参考python异步asyncio笔记事件循环那部分
    loop.run_until_complete(asyncio.wait(tasks))
    将任务列表tasks放到loop事件池中,进行事件循环
import asyncio,aiohttp,re,time
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
    'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义协程函数,协程函数不可直接调用
async def get_title(page):
    url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
    # aiohttp 是用下面这种方式连接的
    async with aiohttp.ClientSession(headers=headers) as ss:
        async with ss.get(url) as resp:
            # 注意,这里的 resp.text() 和 requests库的 resp.text 不一样
            # 这里的resp.text() 是一个协程函数所以前面才可以加上 await 关键字
            text = await resp.text()
        # 解析网页
        title_list = re.findall(pattern,text)
        print("页数{}:".format(page),title_list)

if __name__ == '__main__':
    begin = time.time()
    loop = asyncio.get_event_loop()
    tasks = [get_title(page) for page in range(1,11)]
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()
    print("耗时:",time.time() - begin)

# 异步 耗时: 0.3580803871154785

普通requests

import requests,re,time
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
    'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义异步函数,异步函数不可直接调用
def get_title(page):
    url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
    resp = requests.get(url,headers)
    text = resp.text
    title_list = re.findall(pattern, text)
    print("页数{}:".format(page), title_list)

begin = time.time()
for page in range(1,11):
    get_title(page)

print("耗时:",time.time() - begin)

# 耗时: 3.47572922706604

线程池

import requests,re,time
from concurrent.futures import ThreadPoolExecutor
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
    'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
# 定义异步函数,异步函数不可直接调用
def get_title(page):
    url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
    resp = requests.get(url,headers)
    text = resp.text
    title_list = re.findall(pattern, text)
    print("页数{}:".format(page), title_list)

# ============多线程
if __name__ == '__main__':
    begin = time.time()
    with ThreadPoolExecutor(max_workers=10) as executor:
        [executor.submit(get_title,page) for page in range(1,11)]

    print("耗时:",time.time() - begin)

# 10线程 耗时: 0.715099573135376

resp.text()源码

    async def text(self, encoding: Optional[str] = None, errors: str = "strict") -> str:
        """Read response payload and decode."""
        if self._body is None:
            await self.read()

        if encoding is None:
            encoding = self.get_encoding()

        return self._body.decode(encoding, errors=errors)  # type: ignore

小结

类型耗时
asyncio+aiohttp0.3580803871154785
线程池(10线程)0.715099573135376
普通requests循环3.47572922706604
实现爬虫耗时对比表

可以看出使用asyncio+aiohttp的效率远高于普通使用requests库,主要是因为http请求属于IO密集型使用异步的效率确实提升不少,而且出乎意料的比多线程还要快。


所念皆星河