1 回答
![?](http://img1.sycdn.imooc.com/5458622b000117dd02200220-100-100.jpg)
TA贡献1777条经验 获得超10个赞
(移动评论来回答)
您正在尝试处理文件对象而不是文件中的文本。创建文本文件后,重新打开它并在标记化之前读取整个文件。
试试这个代码:
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
with open(path + "/" + file) as f:
outfile.write(f.read() + '\n')
#outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close() # done writing
from nltk.tokenize import sent_tokenize, word_tokenize
with open('result.txt') as outfile: # open for read
alltext = outfile.read() # read entire file
print(alltext)
sent_tokens = sent_tokenize(alltext) # process file text. tokenize sentences
word_tokens = word_tokenize(alltext) # process file text. tokenize words
添加回答
举报