如何标记熊猫数据帧中的最后一个重复元素

如您所知，有一种.duplicated在列中查找重复项的方法，但我需要的是知道我的数据按日期排序的最后一个重复元素。这是Last_dup该列的预期结果Policy_id：Id Policy_id Start_Date Last_dup0 b123 2019/02/24 01 b123 2019/03/24 02 b123 2019/04/24 13 c123 2018/09/01 04 c123 2018/10/01 15 d123 2017/02/24 06 d123 2017/03/24 1在此先感谢您的帮助和支持！

查看完整描述

2 回答

慕的地8271018

TA贡献1796条经验获得超4个赞

使用Series.duplicated或DataFrame.duplicated指定列和参数keep='last'，然后将反转掩码转换为整数以True/False进行1/0映射或使用numpy.where：

df['Last_dup1'] = (~df['Policy_id'].duplicated(keep='last')).astype(int)

df['Last_dup1'] = np.where(df['Policy_id'].duplicated(keep='last'), 0, 1)

或者：

df['Last_dup1'] = (~df.duplicated(subset=['Policy_id'], keep='last')).astype(int)

df['Last_dup1'] = np.where(df.duplicated(subset=['Policy_id'], keep='last'), 0, 1)

print (df)

Id Policy_id Start_Date Last_dup Last_dup1

0 0 b123 2019/02/24 0 0

1 1 b123 2019/03/24 0 0

2 2 b123 2019/04/24 1 1

3 3 c123 2018/09/01 0 0

4 4 c123 2018/10/01 1 1

5 5 d123 2017/02/24 0 0

6 6 d123 2017/03/24 1 1

反对回复 2022-01-05

芜湖不芜

TA贡献1796条经验获得超7个赞

也可以通过下面提到的方式完成（不使用Series.duplicated）：

dictionary = df[['Id','Policy_id']].set_index('Policy_id').to_dict()['Id']

#here the dictionary values contains the most recent Id's

df['Last_dup'] = df.Id.apply(lambda x: 1 if x in list(dictionary.values()) else 0)

反对回复 2022-01-05

热搜

最近搜索清空

如何标记熊猫数据帧中的最后一个重复元素

如何标记熊猫数据帧中的最后一个重复元素

2 回答

添加回答