为了账号安全,请及时绑定邮箱和手机立即绑定

如何在 Python 中自动检索 RSS 文件

如何在 Python 中自动检索 RSS 文件

POPMUISE 2023-07-18 16:30:39
我正在开发一个从 RSS 文件中抓取新闻文章并将其传递给情绪分析 API 的系统。这是我第一次参与如此规模的项目。我正处于可以从 RSS 文件中的链接获取原始文本的阶段。我现在需要建立一个可以在 RSS 文件更新时自动获取它们的系统。关于如何实现这一目标有什么高层次的想法吗?谢谢
查看完整描述

1 回答

?
慕桂英546537

TA贡献1848条经验 获得超10个赞

然后是循环 RSS 源的简单情况。

import feedparser

from bs4 import BeautifulSoup

import urllib.parse, xml.sax

import pandas as pd


# get some RSS feeds....

resp = requests.get("https://blog.feedspot.com/world_news_rss_feeds/")

soup = BeautifulSoup(resp.content.decode(), "html.parser")

rawfeeds = soup.find_all("h2")

feeds = {}

for rf in rawfeeds:

    a = rf.find("a")

    if a is not None:

        feeds[a.string.replace("RSS Feed", "").strip()] = urllib.parse.parse_qs(a['href'])["q"][0].replace("site:","")

        

# now source them all into a dataframe

df = pd.DataFrame()

for k, url in feeds.items():

    try:

        df = pd.concat([df, pd.json_normalize(feedparser.parse(url)["entries"]).assign(Source=k)])

    except (Exception, xml.sax.SAXParseException):

        print(f"invalid xml: {url}")

可重入

  1. 使用etag修改的功能feedparser

  2. 持久化数据帧,以便再次运行时它会从上次停止的地方开始

我会使用线程,这样它就不是纯粹顺序的。显然,对于线程,您需要考虑同步您的保存点。然后,您只需在调度程序中运行即可定期在 RSS 源中获取新项目并获取相关文章。

import feedparser, requests, newspaper

from bs4 import BeautifulSoup

import urllib.parse, xml.sax

from pathlib import Path

import pandas as pd


if not Path().cwd().joinpath("news").is_dir(): Path.cwd().joinpath("news").mkdir()

p = Path().cwd().joinpath("news")

    

# get some RSS feeds....

if p.joinpath("rss.pickle").is_file():

    dfrss = pd.read_pickle(p.joinpath("rss.pickle"))

else:

    resp = requests.get("https://blog.feedspot.com/world_news_rss_feeds/")

    soup = BeautifulSoup(resp.content.decode(), "html.parser")

    rawfeeds = soup.find_all("h2")

    feeds = []

    for rf in rawfeeds:

        a = rf.find("a")

        if a is not None:

            feeds.append({"name":a.string.replace("RSS Feed", "").strip(),

                         "url":urllib.parse.parse_qs(a['href'])["q"][0].replace("site:",""),

                         "etag":"","status":0, "dubug_msg":"", "modified":""})

    dfrss = pd.DataFrame(feeds).set_index("url")

if p.joinpath("rssdata.pickle").is_file():

    df = pd.read_pickle(p.joinpath("rssdata.pickle"))

else:

    df = pd.DataFrame({"id":[],"link":[]})


# now source them all into a dataframe. head() is there for testing purposes

for r in dfrss.head(5).itertuples():

#     print(r.Index)

    try:

        fp = feedparser.parse(r.Index, etag=r.etag, modified=r.modified)

        if fp.bozo==1: raise Exception(fp.bozo_exception)

    except Exception as e:

        fp = feedparser.FeedParserDict(**{"etag":r.etag, "entries":[], "status":500, "debug_message":str(e)})

    # keep meta information of what has already been sourced from a RSS feed

    if "etag" in fp.keys(): dfrss.loc[r.Index,"etag"] = fp.etag

    dfrss.loc[r.Index,"status"] = fp.status

    if "debug_message" in fp.keys(): dfrss.loc[r.Index,"debug_mgs"] = fp.debug_message

    # 304 means upto date... getting 301 and entries hence test len...

    if len(fp["entries"])>0:

        dft = pd.json_normalize(fp["entries"]).assign(Source=r.Index)

        # don't capture items that have already been captured...

        df = pd.concat([df, dft[~dft["link"].isin(df["link"])]])


# save to make re-entrant...

dfrss.to_pickle(p.joinpath("rss.pickle"))

df.to_pickle(p.joinpath("rssdata.pickle"))


# finally get the text...

if p.joinpath("text.pickle").is_file():

    dftext = pd.read_pickle(p.joinpath("text.pickle"))

else:

    dftext = pd.DataFrame({"link":[], "text":[]})


# head() is there for testing purposes

for r in df[~df["link"].isin(dftext["link"])].head(5).itertuples():

    a = newspaper.Article(r.link)

    a.download()

    a.parse()

    dftext = dftext.append({"link":r.link, "text":a.text},ignore_index=True)

    

dftext.to_pickle(p.joinpath("text.pickle"))

对检索到的数据进行分析。

//img3.sycdn.imooc.com/64b64de80001fe8211990717.jpg

查看完整回答
反对 回复 2023-07-18
  • 1 回答
  • 0 关注
  • 166 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信