首页猿问如果值计数低于阈值，则将列值映射到...

如果值计数低于阈值，则将列值映射到“杂项” - 分类列 - Pandas Dataframe

Python

牧羊人nacy 2021-06-08 14:34:58

我有一个形状为 ~ [200K, 40] 的熊猫数据框。数据框有一个分类列（众多列之一），有超过 1000 个唯一值。我可以使用以下方法可视化每个此类唯一列的值计数：df['column_name'].value_counts()我现在如何将价值观与：value_count 小于阈值，比如 100，并将它们映射到，比如“杂项”？或基于累积行数 % ？

查看完整描述

3 回答

至尊宝的传说

TA贡献1789条经验获得超10个赞

您可以从索引中提取要屏蔽的值，value_counts然后使用replace 将它们映射到“杂项” ：

import pandas as pd

import numpy as np

df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])

frequencies = df['A'].value_counts()

condition = frequencies<200 # you can define it however you want

mask_obs = frequencies[condition].index

mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')

df['A'] = df['A'].replace(mask_dict) # or you could make a copy not to modify original data

现在，使用 value_counts 会将低于阈值的所有值分组为杂项：

df['A'].value_counts()

Out[18]:

miscellaneous 947

3 226

1 221

0 204

7 201

2 201

反对回复 2021-06-09

德玛西亚99

TA贡献1770条经验获得超3个赞

我认为需要：

df = pd.DataFrame({ 'A': ['a','a','a','a','b','b','b','c','d']})

s = df['A'].value_counts()

print (s)

a 4

b 3

d 1

c 1

Name: A, dtype: int64

如果需要总结以下所有值threshold：

threshold = 2

m = s < threshold

#filter values under threshold

out = s[~m]

#sum values under and create new values to Series

out['misc'] = s[m].sum()

print (out)

a 4

b 3

misc 2

Name: A, dtype: int64

但是如果需要rename索引值低于阈值：

out = s.rename(dict.fromkeys(s.index[s < threshold], 'misc'))

print (out)

a 4

b 3

misc 1

Name: A, dtype: int64

如果需要更换原来的柱使用GroupBy.transform具有numpy.where：

df['A'] = np.where(df.groupby('A')['A'].transform('size') < threshold, 'misc', df['A'])

print (df)

0 a

1 a

2 a

3 a

4 b

5 b

6 b

7 misc

8 misc

反对回复 2021-06-09

白衣非少年

TA贡献1155条经验获得超0个赞

替代解决方案：

cond = df['col'].value_counts()

threshold = 100

df['col'] = np.where(df['col'].isin(cond.index[cond >= threshold ]), df['col'], 'miscellaneous')

反对回复 2021-06-09

3 回答
0 关注
162 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如果值计数低于阈值，则将列值映射到“杂项” - 分类列 - Pandas Dataframe

如果值计数低于阈值，则将列值映射到“杂项” - 分类列 - Pandas Dataframe

3 回答

添加回答