首页猿问即使异步运行，脚本的执行速度也非常缓慢

即使异步运行，脚本的执行速度也非常缓慢

Python

繁星淼淼 2021-09-14 10:37:12

我编写了一个asyncio与aiohttp库相关的脚本来异步解析网站的内容。我尝试在以下脚本中应用逻辑，就像它通常在scrapy.但是，当我执行我的脚本时，它的行为就像同步库喜欢requests或urllib.request做的那样。因此，它非常缓慢并且不能达到目的。我知道我可以通过在link变量中定义所有下一页链接来解决这个问题。但是，我是否已经以正确的方式使用现有脚本完成了任务？在脚本中，processing_docs()函数的作用是收集不同帖子的所有链接，并将细化的链接传递给fetch_again()函数以从其目标页面获取标题。在processing_docs()函数中应用了一个逻辑，它收集 next_page 链接并将相同的内容提供给fetch()函数以重复相同的内容。This next_page call is making the script slower whereas we usually do the same in刮的and get expected performance.我的问题是：如何在保持现有逻辑不变的情况下实现相同的目标？import aiohttpimport asynciofrom lxml.html import fromstringfrom urllib.parse import urljoinlink = "https://stackoverflow.com/questions/tagged/web-scraping"async def fetch(url): async with aiohttp.ClientSession() as session: async with session.get(url) as response: text = await response.text() result = await processing_docs(session, text) return resultasync def processing_docs(session, html): tree = fromstring(html) titles = [urljoin(link,title.attrib['href']) for title in tree.cssselect(".summary .question-hyperlink")] for title in titles: await fetch_again(session,title) next_page = tree.cssselect("div.pager a[rel='next']") if next_page: page_link = urljoin(link,next_page[0].attrib['href']) await fetch(page_link)async def fetch_again(session,url): async with session.get(url) as response: text = await response.text() tree = fromstring(text) title = tree.cssselect("h1[itemprop='name'] a")[0].text print(title)if __name__ == '__main__': loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*(fetch(url) for url in [link]))) loop.close()

查看完整描述

1 回答

元芳怎么了

TA贡献1798条经验获得超7个赞

使用 asyncio 的全部意义在于您可以同时运行多个提取（彼此并行）。让我们看看你的代码：

for title in titles:

await fetch_again(session,title)

这部分意味着fetch_again只有在等待（完成）之前才会开始每个新的。如果你这样做，是的，使用同步方法没有区别。

要调用 asyncio 的所有功能，请使用asyncio.gather以下命令同时启动多个提取：

await asyncio.gather(*[

fetch_again(session,title)

for title

in titles

])

你会看到显着的加速。

您可以进一步进行事件并fetch与fetch_again标题同时开始下一页：

async def processing_docs(session, html):

coros = []

tree = fromstring(html)

# titles:

titles = [

urljoin(link,title.attrib['href'])

for title

in tree.cssselect(".summary .question-hyperlink")

]

for title in titles:

coros.append(

fetch_again(session,title)

)

# next_page:

next_page = tree.cssselect("div.pager a[rel='next']")

if next_page:

page_link = urljoin(link,next_page[0].attrib['href'])

coros.append(

fetch(page_link)

)

# await:

await asyncio.gather(*coros)

重要的提示

虽然这种方法可以让您更快地做事，但您可能希望限制当时并发请求的数量，以避免在您的机器和服务器上大量使用资源。

您可以asyncio.Semaphore为此目的使用：

semaphore = asyncio.Semaphore(10)

async def fetch(url):

async with semaphore:

async with aiohttp.ClientSession() as session:

async with session.get(url) as response:

text = await response.text()

result = await processing_docs(session, text)

return result

反对回复 2021-09-14

1 回答
0 关注
270 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

即使异步运行，脚本的执行速度也非常缓慢

即使异步运行，脚本的执行速度也非常缓慢

1 回答

添加回答