asyncio+requests实现异步爬虫

前言

有这么一个问题，既然已经存在了aiohttp库能和asyncio很好的应用，为什么还要探究asyncio+requests的异步爬虫呢，实际上不单单针对于异步爬虫问题，更倾向这样的场景。

一般在程序开发中我们要么统一使用 asycio 的协程实现异步操作、要么都使用进程池和线程池实现多线程操作。但如果协程的异步和进程池/线程池的异步混搭时，那么就会用到此功能了。也这篇笔记最核心的一个代码

loop.run_in_executor(线程池对象/None，函数名，*参数)

如果线程对象参数为None内部则会先调用 ThreadPoolExecutor 的 submit 方法去线程池中申请一个线程去执行括号里的函数，并返回一个concurrent.futures.Future对象，如果传递了线程对象或者进程对象，则直接sumbit进去，和使用executor.submit(funname,*args)一样，executor就是那个线程对象或者进程对象。
调用asyncio.wrap_future将concurrent.futures.Future对象包装为asycio.Future对象，所以最后放进tasks里也是一个任务列表，和asyncio+aiohttp里的一样。

asyncio+requests异步爬虫实现

可以看到get_title方法是个同步函数也就是普通函数并非协程，就像某些操作我们可能因为兼容性或者规则性不能使用协程，但我们仍可以利用线程池和run_in_executor的结合去实现异步。
with ThreadPoolExecutor(max_workers=10) as tpool:
先创建一个最大线程数为10的线程池 tpool
loop.run_in_executor
使用for循环和loop.run_in_executor(tpool,get_title,page)方法，将10个get_title(page)放入任务列表中。
在内部实际上是

for page in range(1,11):
    tpool.submit(get_title,page) #这时候还是concurrent.futures.Future类型
     asyncio.wrap_future(tpool)  #这时候就是asycio.Future类型了，也可以理解为任务，
    tasks.append    #再添加进任务列表

这就完成了内部中Future的类型转换，实现了同步函数（普通函数）利用线程池和loop.run_in_executor方法的异步。

代码

import re,requests,asyncio,time
"""
使用requests来实现异步，那么就不需要创建协程函数了
但要使用线程池和asyncio中的loop.run_in_executor(线程池对象，函数名，*参数)去实现
"""
from concurrent.futures import ThreadPoolExecutor
pattern = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a.*?title="(.*?)"',re.S)
headers = {
    'Cookie': 'yyetss=yyetss2020; Hm_lvt_68b4c43849deddc211a9468dc296fdbc=1608692874,1608731298,1608989825,1608990438; Hm_lvt_165ce09938742bc47fef3de7d50e5f86=1608692874,1608731298,1608989825,1608990438; Hm_lpvt_68b4c43849deddc211a9468dc296fdbc=1608990806; Hm_lpvt_165ce09938742bc47fef3de7d50e5f86=1608990806',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}

def get_title(page):
    url = 'http://yyetss.com/list-lishi-all-{}.html'.format(page)
    resp = requests.get(url,headers)
    text = resp.text
    title_list = re.findall(pattern, text)
    print("页数{}：".format(page), title_list)

async def main():
    loop = asyncio.get_event_loop()
    tasks = []
    with ThreadPoolExecutor(max_workers=10) as tpool:
        for page in range(1,11):
            tasks.append(loop.run_in_executor(
                tpool,
                get_title,
                page
            ))

if __name__ == '__main__':
    begin = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    print("耗时：",time.time() - begin)

小结

考虑到一些操作不能使用协程函数去调用，但还想用异步，就可以用线程池或者进程池结合loop.run_in_executor(线程/进程池对象/None，函数名，*参数)去实现异步。总体上来说速度略慢于asyncio+aiohttp，耗时：0.77s。是一个特殊场景实现异步的一种思想或者方法。

Hi,Tong-Hao

asyncio+requests实现异步爬虫

前言

asyncio+requests异步爬虫实现

代码

小结

asyncio+aiohttp实现异步爬虫

人人美剧迅雷链接多线程多进程爬虫分析

tonghao

Comments | NOTHING

取消回复

背单词的博客