1 回答
TA贡献1818条经验 获得超8个赞
我不建议在这里使用正则表达式。绝对不像在空格上分割后在每一行上迭代,可能重新排列列表并最终加入那样,不那么直观。您可以尝试这样的事情,
reordered_corpus = open('reordered_corpus.txt', 'w')
with open('corpus.txt', 'r') as corpus:
for phrase in corpus:
phrase = phrase.split() # split on whitespace
vb_index = rp_index = -1 # variables for the indices
for i, word_pos in enumerate(phrase):
pos = word_pos.split('_')[1] # POS at index 1 splitting on _
if pos == 'VB' or pos == 'VBZ': # can add more verb POS tags
vb_index = i
elif vb_index >= 0 and pos == 'RP': # or more particle POS tags
rp_index = i
break # found both so can stop
if vb_index >= 0 and rp_index >= 0: # do any rearranging
phrase = phrase[:vb_index+1] + [phrase[rp_index]] + \
phrase[vb_index+1:rp_index] + phrase[rp_index+1:]
reordered_corpus.write(' '.join(word_pos for word_pos in phrase)+'\n')
reordered_corpus.close()
使用此代码,如果corpus.txt读取,
you_PRP mean_VBP we_PRP should_MD kick_VB them_PRP out_RP ._.
don_VB 't_NNP take_VB it_PRP off_RP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB the_DT man_NN out_RP ._.
shut_VBZ it_PRP down_RP !_.
运行后,reordered_corpus.txt会,
you_PRP mean_VBP we_PRP should_MD kick_VB out_RP them_PRP ._.
don_VB 't_NNP take_VB off_RP it_PRP until_IN I_PRP say_VBP so_RB ._.
please_VB help_VB out_RP the_DT man_NN ._.
shut_VBZ down_RP it_PRP !_.
添加回答
举报