将句子拆分为其组成词和标点符号的列表的代码是什么?大多数文本预处理程序倾向于删除标点符号。例如,如果我输入:"Punctuations to be included as its own unit."期望的输出是:结果 = ['标点符号', 'to', 'be', '包含', 'as', '它', '自己', '单位', '.']非常感谢!
2 回答
慕村9548890
TA贡献1884条经验 获得超4个赞
您可能需要考虑使用自然语言工具包或nltk.
尝试这个:
import nltk
sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)
输出:['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']
郎朗坤
TA贡献1921条经验 获得超9个赞
下面的代码片段可以使用正则表达式来分隔列表中的单词和标点符号。
import string
import re
punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"
content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression, content)
print(splittedWords_Puncs)
输出:['标点符号', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']
添加回答
举报
0/150
提交
取消