首页猿问 Pandas：根据开始/结束分割点...

Pandas：根据开始/结束分割点的字符串列表（重叠）将字符串列拆分为组件列

Python

森栏 2022-03-09 20:55:55

在我的 Pandas 字符串数据框中，在一列中我有一个大字符串，我想将其拆分为单独的字符串，每个字符串都有自己的行一个新的数据框。第二列是一个标签，相同的标签应该出现在每个字符串组件上。起点和终点分割点应由一组字符串确定。每个组件字符串将从遇到该集合中的一个字符串开始。每个字符串的起点应该在它自己的行的列中，而不应该在拆分的字符串中。这是一个例子我有一组这些字符串listStrings = { '\nIntroduction' , '\nCase' , '\nLiterature' , '\nBackground', '\nRelated' , '\nMethods' , '\nMethod','\nTechniques', '\nMethodology','\nResults', '\nResult', '\nExperimental','\nExperiments', '\nExperiment','\nDiscussion' , '\nLimitations','\nConclusion' , '\nConclusions','\nConcluding' ,'Introduction\n' , 'Case\n' , 'Literature\n' , 'Background\n', 'Related\n' , 'Methods\n' , 'Method\n','Techniques\n', 'Methodology\n','Results\n', 'Result\n', 'Experimental\n','Experiments\n', 'Experiment\n','Discussion\n' , 'Limitations\n','Conclusion\n' , 'Conclusions\n','Concluding\n' ,'INTRODUCTION' , 'CASE' , 'LITERATURE' , 'BACKGROUND', 'RELATED' , 'METHODS' , 'METHOD','TECHNIQUES', 'METHODOLOGY','RESULTS', 'RESULT', 'EXPERIMENTAL','EXPERIMENTS', 'EXPERIMENT','DISCUSSION' , 'LIMITATIONS','CONCLUSION' , 'CONCLUSIONS','CONCLUDING' ,'Introduction:' , 'Case:' , 'Literature:' , 'Background:', 'Related:' , 'Methods:' , 'Method:','Techniques:', 'Methodology:','Results:', 'Result:', 'Experimental:','Experiments:', 'Experiment:','Discussion:' , 'Limitations:','Conclusion:' , 'Conclusions:','Concluding:' ,}在 A 列中的字符串到达中的字符串之一之前listStrings，不要保存任何内容。一旦它到达中的一个字符串listStrings，将该listStrings字符串作为它自己的单独列放在新数据框的一行中。然后将那个listStrings字符串之后的所有内容放在一个新行中，直到该段到达另一个字符串listStrings。然后重复该过程：将该字符串放在一个新列中，并为新段创建一个新行，依此类推。

查看完整描述

1 回答

大话西游666

TA贡献1817条经验获得超14个赞

这是一种方法，我不确定大数据集的效率：

# first we build a big regex pattern

pat = '|'.join(listStrings)

# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1 [\nResults, \nConclusion]

# 2 [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)

for i in range(len(testdf))]).stack()

# stack the keywords:

keys = new_df.str.join(' ').str.split(' ', expand=True).stack()

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

输出：

D E

0 0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma...

1 METHODS \nData from 75 ALS patients and 75 healthy con...

2 RESULT S\nFollowing predictor variable selection, a c...

3 DISCUSSION \nThis study evaluates disease-associated imag...

4 NaN NaN

1 0 \nResults : The findings show ICT innovation was effecti...

1 \nConclusion : By evaluating the ICT innovation, empirical ...

2 NaN NaN

2 0 BACKGROUND AND PURPOSE\nRotator cuff tears are associate...

1 METHODS \nSupraspinatus muscle biopsies were obtained ...

2 RESULT S\nDegenerative changes were present in both p...

3 NaN NaN

编辑：

这是解决方案的一个版本，它给出了问题中指定的确切输出

# first we build a big regex pattern

pat = '|'.join(listStrings)

# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1 [\nResults, \nConclusion]

# 2 [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)

for i in range(len(testdf))]).stack()

# stack the keywords:

keys = np.concatenate(new_df.values) # Flatten the keywords array

values = chunks.groupby(level=0).shift(-1).dropna().values

labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)])

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'C': keys, 'D': values, 'E': labels})

输出：

C D E

0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma... Entry1

1 METHODS \nData from 75 ALS patients and 75 healthy con... Entry1

2 RESULTS \nFollowing predictor variable selection, a cl... Entry1

3 DISCUSSION \nThis study evaluates disease-associated imag... Entry1

4 \nResult s: The findings show ICT innovation was effect... Entry2

5 \nConclusion : By evaluating the ICT innovation, empirical ... Entry2

6 BACKGROUND AND PURPOSE\nRotator cuff tears are associate... Entry3

7 METHODS \nSupraspinatus muscle biopsies were obtained ... Entry3

8 RESULTS \nDegenerative changes were present in both pa... Entry3

反对回复 2022-03-09

1 回答
0 关注
136 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Pandas：根据开始/结束分割点的字符串列表（重叠）将字符串列拆分为组件列

Pandas：根据开始/结束分割点的字符串列表（重叠）将字符串列拆分为组件列

1 回答

添加回答