检测并排除Pandas数据帧中的异常值

检测并排除Pandas数据帧中的异常值我有一个包含少量列的pandas数据帧。现在我知道某些行是基于某个列值的异常值。例如列 - 'Vol'的所有值都大约为12xx，一个值为4000（异常值）。现在我想排除那些有'Vol'列的行。因此，基本上我需要在数据框上放置一个过滤器，以便我们选择所有行，其中某列的值在与平均值相差3个标准差的范围内。实现这一目标的优雅方式是什么？

查看完整描述

3 回答

扬帆大鱼

TA贡献1799条经验获得超9个赞

boolean您可以像使用索引一样使用索引numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})

# example dataset of normally distributed data.

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]

# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]

# or if you prefer the other way around

对于一个系列，它是相似的：

S = pd.Series(np.random.normal(size=200))

S[~((S-S.mean()).abs() > 3*S.std())]

反对回复 2019-07-31

波斯汪

TA贡献1811条经验获得超4个赞

对于每个dataframe列，您可以获得分位数：

q = df["col"].quantile(0.99)

然后过滤：

df[df["col"] < q]

反对回复 2019-07-31

热搜

最近搜索清空

检测并排除Pandas数据帧中的异常值

检测并排除Pandas数据帧中的异常值

3 回答

添加回答