在博客http://tonghao.xyz/archives/139/中,使用了多进程多线程,不过今天看了另一个视频中的线程池任务是每页,而不是我写的线程池任务是每个单独美剧。
为了测试不同方法的效率,对源代码进行简单修改
源代码
def get_html(url):
# 获取页数
pattern = re.compile('<ul class=\'pagination\'>.*?<a href=.*?>(\d+)页</a></li></ul>',re.S)
pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"',re.S)
resp = requests.get(url,headers)
if resp.status_code == 200:
html = resp.text
pages = eval(re.search(pattern,html).group(1))
url_ = re.sub('\d','{}',url)
# url_ http://yyetss.com/list-lishi-all-{}.html
threadpools = ThreadPoolExecutor(max_workers=24)
for page in range(1,pages+1): # 页数循环
print("===================第 <{}> 页==================".format(page))
url_page = url_.format(page)
# url_page http://yyetss.com/list-lishi-all-1.html
resp2 = requests.get(url_page, headers)
if resp2.status_code == 200:
link_title_list = re.findall(pattern2, resp2.text)
for item in link_title_list: # 每页上单个美剧的循环
# 每个片的url
link = url_head + item[0]
title = item[1]
# 每单个美剧放入线程池,一页24个美剧,所以线程池最大24
threadpools.submit(get_link,link,title)
print(len(link_title_list))
修改为线程池任务是每页后
def get_html(url):
# 获取页数
pattern = re.compile('<ul class=\'pagination\'>.*?<a href=.*?>(\d+)页</a></li></ul>',re.S)
pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"',re.S)
resp = requests.get(url,headers)
if resp.status_code == 200:
html = resp.text
pages = eval(re.search(pattern,html).group(1))
url_ = re.sub('\d','{}',url)
# url_ http://yyetss.com/list-lishi-all-{}.html
threadpools = ThreadPoolExecutor(max_workers=24)
for page in range(1,pages+1): # 页数循环
# 每页的解析任务放进线程池
threadpools.submit(get_page,url_,page)
def get_page(url_,page):
pattern2 = re.compile('<div class="col-xs-3 col-sm-3 col-md-2 c-list-box">.*?<a href="(.*?)" title="(.*?)"', re.S)
print("===================第 <{}> 页==================".format(page))
url_page = url_.format(page)
# url_page http://yyetss.com/list-lishi-all-1.html
resp2 = requests.get(url_page, headers)
if resp2.status_code == 200:
link_title_list = re.findall(pattern2, resp2.text)
for item in link_title_list: # 每页上单个美剧的循环
# 每个片的url
link = url_head + item[0]
title = item[1]
get_link(link, title)
print(link_title_list)
print(len(link_title_list))
小结
最后经过运行测试,还是原来将每个美剧放入线程池的效率更高一些,以后还是老老实实的这样写代码吧。
Comments | NOTHING