为了账号安全,请及时绑定邮箱和手机立即绑定

Pandas:根据开始/结束分割点的字符串列表(重叠)将字符串列拆分为组件列

Pandas:根据开始/结束分割点的字符串列表(重叠)将字符串列拆分为组件列

森栏 2022-03-09 20:55:55
在我的 Pandas 字符串数据框中,在一列中我有一个大字符串,我想将其拆分为单独的字符串,每个字符串都有自己的行一个新的数据框。第二列是一个标签,相同的标签应该出现在每个字符串组件上。起点和终点分割点应由一组字符串确定。每个组件字符串将从遇到该集合中的一个字符串开始。每个字符串的起点应该在它自己的行的列中,而不应该在拆分的字符串中。这是一个例子我有一组这些字符串listStrings = { '\nIntroduction' , '\nCase' , '\nLiterature' , '\nBackground',  '\nRelated' , '\nMethods' , '\nMethod','\nTechniques', '\nMethodology','\nResults', '\nResult', '\nExperimental','\nExperiments', '\nExperiment','\nDiscussion' , '\nLimitations','\nConclusion' , '\nConclusions','\nConcluding' ,'Introduction\n' , 'Case\n' , 'Literature\n' , 'Background\n',  'Related\n' , 'Methods\n' , 'Method\n','Techniques\n', 'Methodology\n','Results\n', 'Result\n', 'Experimental\n','Experiments\n', 'Experiment\n','Discussion\n' , 'Limitations\n','Conclusion\n' , 'Conclusions\n','Concluding\n' ,'INTRODUCTION' , 'CASE' , 'LITERATURE' , 'BACKGROUND',  'RELATED' , 'METHODS' , 'METHOD','TECHNIQUES', 'METHODOLOGY','RESULTS', 'RESULT', 'EXPERIMENTAL','EXPERIMENTS', 'EXPERIMENT','DISCUSSION' , 'LIMITATIONS','CONCLUSION' , 'CONCLUSIONS','CONCLUDING' ,'Introduction:' , 'Case:' , 'Literature:' , 'Background:',  'Related:' , 'Methods:' , 'Method:','Techniques:', 'Methodology:','Results:', 'Result:', 'Experimental:','Experiments:', 'Experiment:','Discussion:' , 'Limitations:','Conclusion:' , 'Conclusions:','Concluding:' ,}在 A 列中的字符串到达 中的字符串之一之前listStrings,不要保存任何内容。一旦它到达 中的一个字符串listStrings,将该listStrings字符串作为它自己的单独列放在新数据框的一行中。然后将那个listStrings字符串之后的所有内容放在一个新行中,直到该段到达另一个字符串listStrings。然后重复该过程:将该字符串放在一个新列中,并为新段创建一个新行,依此类推。
查看完整描述

1 回答

?
大话西游666

TA贡献1817条经验 获得超14个赞

这是一种方法,我不确定大数据集的效率:


# first we build a big regex pattern

pat = '|'.join(listStrings)


# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1                    [\nResults, \nConclusion]

# 2                [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object


# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 

             for i in range(len(testdf))]).stack()


# stack the keywords:

keys = new_df.str.join(' ').str.split(' ', expand=True).stack()


# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

输出:


                D                                                  E

0 0    BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...

  1       METHODS  \nData from 75 ALS patients and 75 healthy con...

  2        RESULT  S\nFollowing predictor variable selection, a c...

  3    DISCUSSION  \nThis study evaluates disease-associated imag...

  4           NaN                                                NaN

1 0     \nResults  : The findings show ICT innovation was effecti...

  1  \nConclusion  : By evaluating the ICT innovation, empirical ...

  2           NaN                                                NaN

2 0    BACKGROUND   AND PURPOSE\nRotator cuff tears are associate...

  1       METHODS  \nSupraspinatus muscle biopsies were obtained ...

  2        RESULT  S\nDegenerative changes were present in both p...

  3           NaN                                                NaN

编辑:


这是解决方案的一个版本,它给出了问题中指定的确切输出


# first we build a big regex pattern

pat = '|'.join(listStrings)


# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1                    [\nResults, \nConclusion]

# 2                [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object


# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 

             for i in range(len(testdf))]).stack()


# stack the keywords:

keys = np.concatenate(new_df.values) # Flatten the keywords array

values = chunks.groupby(level=0).shift(-1).dropna().values

labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) 

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'C': keys, 'D': values, 'E': labels})

输出:


C   D   E

0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...   Entry1

1   METHODS \nData from 75 ALS patients and 75 healthy con...   Entry1

2   RESULTS \nFollowing predictor variable selection, a cl...   Entry1

3   DISCUSSION  \nThis study evaluates disease-associated imag...   Entry1

4   \nResult    s: The findings show ICT innovation was effect...   Entry2

5   \nConclusion    : By evaluating the ICT innovation, empirical ...   Entry2

6   BACKGROUND  AND PURPOSE\nRotator cuff tears are associate...    Entry3

7   METHODS \nSupraspinatus muscle biopsies were obtained ...   Entry3

8   RESULTS \nDegenerative changes were present in both pa...   Entry3


查看完整回答
反对 回复 2022-03-09
  • 1 回答
  • 0 关注
  • 136 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信