首页猿问 pd.Serie...

pd.Serie 每行的平均“分数”基于通过另一个分数系列映射的内容

Python

撒科打诨 2021-11-16 14:38:01

我有一个（非常大的）系列，其中包含关键字（例如，每行包含多个由“-”分隔的关键字）In[5]: word_seriesOut[5]: 0 the-cat-is-pink1 blue-sea2 best-job-everdtype: object我有另一个系列，其中包含每个单词的分数属性（单词是索引，分数是值），例如：In[7]: all_scoresOut[7]: the 0.34cat 0.56best 0.01ever 0.77is 0.12pink 0.34job 0.01sea 0.87blue 0.65dtype: float64我的 word_series 中的所有单词都出现在我的分数中。我试图根据 all_scores 中每个单词的平均分数，找到将分数归因于 word_series 的每一行的最快方法。如果一行是 n/a，则分数应该是分数的平均值。我试过用这种方式应用，但它太慢了。scores = word_series.apply( lambda x: all_scores[x.split('-')].mean()).fillna( all_scores.mean())然后我想我可以使用 str.replace 将 all_words 拆分为列，并且可能使用这个新矩阵 M 和我的单词 M.mul(all_scores) 执行矩阵乘法类型的操作，其中 M 中的每一行都与基于索引的值匹配的 all_scores。这将是第一步，为了得到平均值，然后我可以除以每行非 na 的数量In[9]: all_words.str.split('-', expand=True)Out[9]: 0 1 2 30 the cat is pink1 blue sea None None2 best job ever None这样的手术可行吗？还是有另一种快速的方法来实现这一目标？

查看完整描述

2 回答

繁星淼淼

TA贡献1775条经验获得超11个赞

在 Pandas 中处理字符串数据很慢，所以使用 map by Seriesand 的列表理解mean：

from statistics import mean

L = [mean(all_scores.get(y) for y in x.split('-')) for x in word_series]

a = pd.Series(L, index=word_series.index)

print (a)

0 0.340000

1 0.760000

2 0.263333

dtype: float64

或者：

def mean(a):

return sum(a) / len(a)

L = [mean([all_scores.get(y) for y in x.split('-')]) for x in word_series]

a = pd.Series(L, index=word_series.index)

如果可能的一些值不匹配的附加参数np.nan，以get和使用numpy.nanmean：

L = [np.nanmean([all_scores.get(y, np.nan) for y in x.split('-')]) for x in word_series]

a = pd.Series(L, index=word_series.index)

或者：

def mean(a):

return sum(a) / len(a)

L = [mean([all_scores.get(y, np.nan) for y in x.split('-') if y in all_scores.index])

for x in word_series]

反对回复 2021-11-16

慕妹3146593

TA贡献1820条经验获得超9个赞

这是一个方法

打印（一）

words

0 the-cat-is-pink

1 blue-sea

2 best-job-ever

打印(b)

all_scores

the 0.34

cat 0.56

best 0.01

ever 0.77

is 0.12

pink 0.34

job 0.01

sea 0.87

blue 0.65

b = b.reset_index()

打印(b)

index all_scores

0 the 0.34

1 cat 0.56

2 best 0.01

3 ever 0.77

4 is 0.12

5 pink 0.34

6 job 0.01

7 sea 0.87

8 blue 0.65

a['score'] = a['words'].str.split('-').apply(lambda x: sum([b[b['index'] == w].reset_index()['all_scores'][0] for w in x])/len(x))

输出

words score

0 the-cat-is-pink 0.340000

1 blue-sea 0.760000

2 best-job-ever 0.263333

反对回复 2021-11-16

2 回答
0 关注
192 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

pd.Serie 每行的平均“分数”基于通过另一个分数系列映射的内容

pd.Serie 每行的平均“分数”基于通过另一个分数系列映射的内容

2 回答

添加回答