为了账号安全,请及时绑定邮箱和手机立即绑定

如何使用 pandas 解析文本文件并创建列表

如何使用 pandas 解析文本文件并创建列表

繁星coding 2023-10-06 19:35:25
我正在尝试使用 pandas 创建一个列表/数组,其中包含以下文本文件的“评论/文本”字段中的所有单词:product/productId: B001E4KFG0 review/userId: A3SGXH7AUHU8GW review/profileName: delmartian review/helpfulness: 1/1 review/score:5.0 review/time: 1303862400 review/summary: Good Quality Dog Food review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.product/productId: B00813GRG4 review/userId: A1D87F6ZCVE5NK review/profileName: dll pa review/helpfulness: 0/0 review/score: 1.0 review/time: 1346976000 review/summary: Not as Advertised review/text: Product arrived labeled as Jumbo Salted Peanuts...(文本文件 food.txt 位于:http://snap.stanford.edu/data/web-FineFoods.html)我的最终目标是识别评论/文本字段中出现的所有独特单词。我写了以下代码:    import pandas as pd        f=open("foods.txt","r")    df=pd.read_csv(f,names=['product/productId','review/userId','review/profileName','review/helpfulness','review/score','review/time','review/summary'])    selected = df[ df['review/summary'] ]     print(selected)selected.to_csv('result.csv', sep=' ', header=False)但是,我收到以下错误:ValueError: cannot index with vector containing NA / NaN values有什么建议/意见吗?
查看完整描述

3 回答

?
动漫人物

TA贡献1815条经验 获得超10个赞

我认为您必须执行此操作才能从文件中提取所有记录并获取审核/摘要值。您不需要数据框。


#create a dictionary to store the list of review summary values

d = {'review summary':[]}


#function to extract only the review_summary from the line

def split_review_summary(full_line):

    

    #find review/text and exclude it from the line

    found = full_line.find('review/text:')

    if found >= 0:

        full_line = full_line[:found]


    #find review summary. All text to the right is review summary

    #add this to the dictionary

    found = full_line.find('review/summary:')

    if found >= 0:

        review_summary = full_line[(found + 15):]

        d['review summary'].append(review_summary)


#open the file for reading

with open ("xyz.txt","r") as f:

    #read the first line

    new_line = f.readline().rstrip('\n')

    #loop through the rest of the lines

    for line in f:

        #remove newline from the data

        line = line.rstrip('\n')

        

        #if the line starts with product/productId, then its a new entry

        #process the previous line and strip out the review_summary

        #to do that, call split_review_summary function

        

        if line[:17] == 'product/productId':

            split_review_summary(new_line)

            #reset new_line to the current line

            new_line = line

        else:

            #append to the new_line as its part of the previous record

            new_line += line


#the last full record has not been processed

#So send it to split_review_summary to extract review summary

split_review_summary(new_line)


#now dictionary d has all the review summary items

print (d)

其输出将是:


{'review summary': [' Good Quality Dog Food ', ' Not as Advertised ']}

我认为你的问题范围还包括写入新文件。


您可以打开一个文件并将字典写入一行。这将包含所有细节。我将把这部分留给你来解决。


查看完整回答
反对 回复 2023-10-06
?
30秒到达战场

TA贡献1828条经验 获得超6个赞

CSV 文件代表逗号分隔值。我在你的文件中没有看到任何逗号。


它看起来像一本损坏的字典(每个条目缺少分隔逗号):


my_dict ={

 'productid': 12312312,

 'some_key': 'I am the key!',

}


查看完整回答
反对 回复 2023-10-06
?
白猪掌柜的

TA贡献1893条经验 获得超10个赞

我查看了 S.Ghoshal 提供的链接并得出以下结论:


#Opening your file

your_file = open('foods.txt')


#Reading every line

reviews = your_file.readlines()


reviews_array = []

dictionary = {}


#We are going through every line and skip it when we see that it's a blank line

for review in reviews:

    this_line = review.split(":")

    if len(this_line) > 1:

        #The blank lines are less than 1 in length after the split

        dictionary[this_line[0]] = this_line[1].strip()

        #Every first part before ":" is the key of the dictionary, and the second part id the content.

    else:

        #If a blank linee was found lets save the object in the array and reset it

        #for the next review

        reviews_array.append(dictionary)

        dictionary = {}


#Append the last object because it goes out the last else

reviews_array.append(dictionary)


f1=open("output.txt","a")

for r in reviews_array:

    print(r['review/text'], file=f1)

f1.close()

现在,以 review/text 开头的行中的所有单词都将转储到文件中。接下来我需要创建一个包含所有独特单词的列表。


查看完整回答
反对 回复 2023-10-06
  • 3 回答
  • 0 关注
  • 115 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信