多进程多线程爬虫中线程任务的选择问题

发布于 2021-01-21  1151 次阅读


在博客http://tonghao.xyz/archives/139/中,使用了多进程多线程,不过今天看了另一个视频中的线程池任务是每页,而不是我写的线程池任务是每个单独美剧。

为了测试不同方法的效率,对源代码进行简单修改

源代码

def get_html(url):
    # 获取页数
    pattern = re.compile('<ul class=\'pagination\'>.*?<a href=.*?>(\d+)页</a></li></ul>',re.S)
    pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"',re.S)
    resp = requests.get(url,headers)
    if resp.status_code == 200:
        html = resp.text
        pages = eval(re.search(pattern,html).group(1))
        url_ = re.sub('\d','{}',url)
        # url_  http://yyetss.com/list-lishi-all-{}.html
        threadpools = ThreadPoolExecutor(max_workers=24)
        for page in range(1,pages+1):  #  页数循环
            print("===================第 <{}> 页==================".format(page))
            url_page = url_.format(page)
            # url_page  http://yyetss.com/list-lishi-all-1.html
            resp2 = requests.get(url_page, headers)
            if resp2.status_code == 200:
                link_title_list = re.findall(pattern2, resp2.text)
                for item in link_title_list:  # 每页上单个美剧的循环
                    # 每个片的url
                    link = url_head + item[0]
                    title = item[1]
                    # 每单个美剧放入线程池,一页24个美剧,所以线程池最大24
                    threadpools.submit(get_link,link,title)
                print(len(link_title_list))

修改为线程池任务是每页后

def get_html(url):
    # 获取页数
    pattern = re.compile('<ul class=\'pagination\'>.*?<a href=.*?>(\d+)页</a></li></ul>',re.S)
    pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"',re.S)
    resp = requests.get(url,headers)
    if resp.status_code == 200:
        html = resp.text
        pages = eval(re.search(pattern,html).group(1))
        url_ = re.sub('\d','{}',url)
        # url_  http://yyetss.com/list-lishi-all-{}.html
        threadpools = ThreadPoolExecutor(max_workers=24)
        for page in range(1,pages+1):  #  页数循环
             # 每页的解析任务放进线程池
            threadpools.submit(get_page,url_,page)

def get_page(url_,page):
    pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"', re.S)
    print("===================第 <{}> 页==================".format(page))
    url_page = url_.format(page)
    # url_page  http://yyetss.com/list-lishi-all-1.html
    resp2 = requests.get(url_page, headers)
    if resp2.status_code == 200:
        link_title_list = re.findall(pattern2, resp2.text)
        for item in link_title_list:  # 每页上单个美剧的循环
            # 每个片的url
            link = url_head + item[0]
            title = item[1]
            get_link(link, title)
        print(link_title_list)
        print(len(link_title_list))

小结

最后经过运行测试,还是原来将每个美剧放入线程池的效率更高一些,以后还是老老实实的这样写代码吧。


所念皆星河