首页猿问使用dplyr删除重复的行

使用dplyr删除重复的行

R语言

SMILET 2019-10-08 11:20:22

我有一个这样的data.frame-set.seed(123)df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)> df x y z1 0 1 12 1 0 23 0 1 34 1 1 45 1 0 56 0 1 67 1 0 78 1 0 89 1 0 910 0 1 10我想删除基于前两列的重复行。预期产量-df[!duplicated(df[,1:2]),] x y z1 0 1 12 1 0 24 1 1 4我正在寻找使用dplyr包的解决方案。

查看完整描述

3 回答

幕布斯7119047

TA贡献1794条经验获得超8个赞

注意：dplyr现在包含distinct用于此目的的功能。

原始答案如下：

library(dplyr)

set.seed(123)

df <- data.frame(

x = sample(0:1, 10, replace = T),

y = sample(0:1, 10, replace = T),

z = 1:10

)

一种方法是分组，然后仅保留第一行：

df %>% group_by(x, y) %>% filter(row_number(z) == 1)

## Source: local data frame [3 x 3]

## Groups: x, y

## x y z

## 1 0 1 1

## 2 1 0 2

## 3 1 1 4

（在dplyr 0.2中，您将不需要哑z变量，并且只需要编写即可row_number() == 1）

我也一直在考虑添加一个slice()功能，如：

df %>% group_by(x, y) %>% slice(from = 1, to = 1)

或者，也许可以通过变种来unique()选择要使用的变量：

df %>% unique(x, y)

反对回复 2019-10-08

拉丁的传说

TA贡献1789条经验获得超8个赞

这是使用的解决方案dplyr 0.3。

library(dplyr)

set.seed(123)

df <- data.frame(

x = sample(0:1, 10, replace = T),

y = sample(0:1, 10, replace = T),

z = 1:10

)

> df %>% distinct(x, y)

x y z

1 0 1 1

2 1 0 2

3 1 1 4

更新为dplyr 0.5

dplyr版本0.5的默认行为是distinct()仅返回...参数中指定的列。

为了获得原始结果，您现在必须使用：

df %>% distinct(x, y, .keep_all = TRUE)

反对回复 2019-10-08

月关宝盒

TA贡献1772条经验获得超5个赞

大多数时候，最好的解决方案是使用distinct()dplyr，正如已经建议的那样。

但是，这是另一种使用slice()dplyr函数的方法。

# Generate fake data for the example

library(dplyr)

set.seed(123)

df <- data.frame(

x = sample(0:1, 10, replace = T),

y = sample(0:1, 10, replace = T),

z = 1:10

)

# In each group of rows formed by combinations of x and y

# retain only the first row

df %>%

group_by(x, y) %>%

slice(1)

与使用distinct()功能的区别

此解决方案的优点是，它可以使从原始数据帧中保留哪些行变得明确，并且可以与该arrange()函数很好地配对。

假设您有客户销售数据，并且希望为每个客户保留一条记录，并且希望该记录成为他们最近一次购买的记录。然后，您可以编写：

customer_purchase_data %>%

arrange(desc(Purchase_Date)) %>%

group_by(Customer_ID) %>%

slice(1)

反对回复 2019-10-08

3 回答
0 关注
1735 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用dplyr删除重复的行

使用dplyr删除重复的行

3 回答

添加回答