1 回答
TA贡献1817条经验 获得超14个赞
这是一种方法,我不确定大数据集的效率:
# first we build a big regex pattern
pat = '|'.join(listStrings)
# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1 [\nResults, \nConclusion]
# 2 [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object
# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)
for i in range(len(testdf))]).stack()
# stack the keywords:
keys = new_df.str.join(' ').str.split(' ', expand=True).stack()
# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})
输出:
D E
0 0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma...
1 METHODS \nData from 75 ALS patients and 75 healthy con...
2 RESULT S\nFollowing predictor variable selection, a c...
3 DISCUSSION \nThis study evaluates disease-associated imag...
4 NaN NaN
1 0 \nResults : The findings show ICT innovation was effecti...
1 \nConclusion : By evaluating the ICT innovation, empirical ...
2 NaN NaN
2 0 BACKGROUND AND PURPOSE\nRotator cuff tears are associate...
1 METHODS \nSupraspinatus muscle biopsies were obtained ...
2 RESULT S\nDegenerative changes were present in both p...
3 NaN NaN
编辑:
这是解决方案的一个版本,它给出了问题中指定的确切输出
# first we build a big regex pattern
pat = '|'.join(listStrings)
# find all keywords in the series
new_df = testdf.A.str.findall(pat)
# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]
# 1 [\nResults, \nConclusion]
# 2 [BACKGROUND, METHODS, RESULT]
# Name: A, dtype: object
# find all the chunks by splitting the text with the found keywords
chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)
for i in range(len(testdf))]).stack()
# stack the keywords:
keys = np.concatenate(new_df.values) # Flatten the keywords array
values = chunks.groupby(level=0).shift(-1).dropna().values
labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)])
# out return dataframe
# note that we shift the chunks to match the keywords
pd.DataFrame({'C': keys, 'D': values, 'E': labels})
输出:
C D E
0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma... Entry1
1 METHODS \nData from 75 ALS patients and 75 healthy con... Entry1
2 RESULTS \nFollowing predictor variable selection, a cl... Entry1
3 DISCUSSION \nThis study evaluates disease-associated imag... Entry1
4 \nResult s: The findings show ICT innovation was effect... Entry2
5 \nConclusion : By evaluating the ICT innovation, empirical ... Entry2
6 BACKGROUND AND PURPOSE\nRotator cuff tears are associate... Entry3
7 METHODS \nSupraspinatus muscle biopsies were obtained ... Entry3
8 RESULTS \nDegenerative changes were present in both pa... Entry3
添加回答
举报