为了账号安全,请及时绑定邮箱和手机立即绑定

如何检查是否存在并在CSV蟒蛇中提取年份和百分比

如何检查是否存在并在CSV蟒蛇中提取年份和百分比

明月笑刀无情 2022-09-13 09:56:08
我有一个CSV文件,新闻.csv,其中包含许多数据。我想检查该行是否包含任何年份,如果是,则为 1,否则为 0。这也适用于百分比,如果行包含百分比,则返回 1,否则为 0。并且还要提取它们。以下是到目前为止我的代码。我遇到错误(值错误:通过的项目数量错误2,放置意味着1),当我尝试提取百分比news=pd.read_csv("news.csv")news['year']= news['STORY'].str.extract(r'(?!\()\b(\d+){1}')news["howmanyyear"] = news["STORY"].str.count(r'(?!\()\b(\d+){1}')news["existyear"] = news["howmany"] != 0news["existyear"] = news["existyear"].astype(int)news['percentage']= news['STORY'].str.extract(r'(\s100|\s\d{1})(\.\d+)+%')news.to_csv('news.csv')提取年份的代码似乎有效,但是,它也提取普通数字,并且只提取其中一个年份。我的 CSV 文件示例ID  STORY                                                            1   There are a total of 2,070 people died in 2001 due to the virus                         2   20% of people in the village have diabetes in 2007                        3   About 70 percent of them still believe the rumor                            4  In 2003 and 2020, the pneumonia pandemic spread in the world以下是我想要的输出:ID  STORY                                                            existyear  year    existpercentage  percentage1   There are a total of 2,070 people died in 2001 due to the virus    1        2001      0              -2   20% of people in the village have diabetes in 2007                 1        2007      1              20%3   About 70 percent of them still believe the rumor                   0         -        1              704  In 2003 and 2020, the pneumonia pandemic spread in the world        1       2003,2020  0              -
查看完整描述

1 回答

?
MYYA

TA贡献1868条经验 获得超4个赞

创建示例数据帧:


c = [1,2,3,4]

d = ["There are a total of 2,070 people died in 2001 due to the virus" , "20% of people in the village have diabetes in 2007 ",

    "About 70 percent of them still believe the rumor", "In 2003 and 2020, the pneumonia pandemic spread in the world"] 

f = ['2001', '2007', '-', '2003,2020']

g = ['-', '20%', '70', '-']

df = pd.DataFrame([c,d,f,g]).T

df.rename(columns = {0:'ID ', 1:'STORY', 2:'year', 3:'percentage'}, inplace = True)

断续器:


ID  STORY                                                           year    percentage

1   There are a total of 2,070 people died in 2001 due to the virus 2001    -

2   20% of people in the village have diabetes in 2007              2007    20%

3   About 70 percent of them still believe the rumor                -       70

4   In 2003 and 2020, the pneumonia pandemic spread in the world    2003,2020 -

法典:


def year_exits_or_not(row):

    if re.match(r'.*([1-3][0-9]{3})', row):

        return 1

    else:

        return 0


def perc_or_not(row):

    if re.match(r'.*\d+', row):

        return 1

    else:

        return 0


df['existyear'] = df.year.apply(year_exits_or_not)

df['existpercentage'] = df.percentage.apply(perc_or_not)

断续器:


ID  STORY                                                            existyear  year    existpercentage  percentage

1   There are a total of 2,070 people died in 2001 due to the virus    1        2001      0              -

2   20% of people in the village have diabetes in 2007                 1        2007      1              20%

3   About 70 percent of them still believe the rumor                   0         -        1              70

4   In 2003 and 2020, the pneumonia pandemic spread in the world       1       2003,2020  0              -

编辑:


df.year = df.STORY.apply(lambda row: str(re.findall(r'.*?([1-3][0-9]{3})', row))[1:-1])


df.percentage = df.STORY.apply(lambda row: str(re.findall(r"(\d+)(?:%| percent)", row))[1:-1])

断续器:


    ID  STORY                                                year          percentage

0   1   There are a total of 2,070 people died in 2001...   '2001'  

1   2   20% of people in the village have diabetes in ...   '2007'         '20'

2   3   About 70 percent of them still believe the rumor                   '70'

3   4   In 2003 and 2020, the pneumonia pandemic sprea...   '2003', '2020'  


查看完整回答
反对 回复 2022-09-13
  • 1 回答
  • 0 关注
  • 61 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信