首页猿问 Pandas 数据帧合并重复值以对齐

Pandas 数据帧合并重复值以对齐

Python

繁花如伊 2021-11-23 19:17:19

这里是链接到原始数据集来源：数据集的容量和数据集的类型或修改版本数据集已修改 1和数据集已修改 2我有 2 个数据框要合并： first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],['2001','Abu Dhabi','100','44'],['2001','Abu Dhabi','200','462'],['2001','Dubai','100-','40'],['2001','Dubai','100','30'],['2001','Dubai','200','51'],['2002','Abu Dhabi','100-','300'],['2002','Abu Dhabi','100','220'],['2002','Abu Dhabi','200','56'],['2002','Dubai','100-','55'],['2002','Dubai','100','67'],['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number']) second_df=pd.DataFrame([['2001','Abu Dhabi','Performed','45'],['2001','Abu Dhabi','Not Performed','76'],['2001','Dubai','Performed','90'],['2001','Dubai','Not Performed','50'],['2002','Abu Dhabi','Performed','78'],['2002','Abu Dhabi','Not Performed','45'],['2002','Dubai','Performed','76'],['2002','Dubai','Not Performed','58']],columns=['Year','Emirate','Type','Value'])所以我为两个数据帧设置了 multiIndex：first=first_df.set_index(['Year','Emirate']) second=second_df.set_index(['Year','Emirate'])并合并：merged=first.merge(second,how='outer',right_index=True,left_index=True)结果如下：合并| Year , Emirate | Capacity | count | friday | count ||:----------------------|:-----------|--------:|:--------------|--------:|| ('2001', 'Abu Dhabi') | 100- | 462 | Performed | 45 || ('2001', 'Abu Dhabi') | 100- | 462 | Not Performed | 76 || ('2001', 'Abu Dhabi') | 100 | 44 | Performed | 45 || ('2001', 'Abu Dhabi') | 100 | 44 | Not Performed | 76 || ('2001', 'Abu Dhabi') | 200 | 657 | Performed | 45 |所以连接在一起的两个数据帧不应该有重复（比如第一次合并）或向下移动（比如 concat 变体）。什么可以是使 2 个数据帧很好地对齐的解决方案？

查看完整描述

2 回答

qq_遁去的一_1

TA贡献1725条经验获得超8个赞

我在这里看到了问题，当您加入['year','Emirate']. 例如，2001 Abu Dhabi 与 2001 Abu Dhabi 在两个数据框中都加入了“已执行和未执行”。基本上这是 mxn 关系连接数据集。除非您指定一个可以唯一标识每一行的主键，否则您最终会得到相同的结果。

反对回复 2021-11-23

慕婉清6462132

TA贡献1804条经验获得超2个赞

我假设您的数据还不正确，因为您的预期输出是可能的，但现在不符合您的逻辑。

您缺少三分之一key column，second_df其中是capacity。如果我们添加此列并执行 a left merge，我们可以实现您的预期输出。

顺便说一句，我们不需要将列设置为索引，因此解决方案如下所示。

# Clean up and create correct dataframes

first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],

['2001','Abu Dhabi','100','44'],

['2001','Abu Dhabi','200','657'],

['2001','Dubai','100-','40'],

['2001','Dubai','100','30'],

['2001','Dubai','200','51'],

['2002','Abu Dhabi','100-','300'],

['2002','Abu Dhabi','100','220'],

['2002','Abu Dhabi','200','56'],

['2002','Dubai','100-','55'],

['2002','Dubai','100','67'],

['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number'])

second_df=pd.DataFrame([['2001','Abu Dhabi','100-','Performed','45'],

['2001','Abu Dhabi','100','Not Performed','76'],

['2001','Abu Dhabi','','',''],

['2001','Dubai','100-','Performed','90'],

['2001','Dubai','100','Not Performed','50'],

['2001','Dubai','','',''],

['2002','Abu Dhabi','100-','Performed','78'],

['2002','Abu Dhabi','100','Not Performed','45'],

['2002','Abu Dhabi','', '', ''],

['2002','Dubai','100-','Performed','76'],

['2002','Dubai','100','Not Performed','58'],

['2002','Dubai', '', '', '']],columns=['Year','Emirate','Capacity','Type','Value'])

# Perform a left merge to get correct output

merged=first_df.merge(second_df,how='left',on=['Year', 'Emirate', 'Capacity'])

输出

Year Emirate Capacity Number Type Value

0 2001 Abu Dhabi 100- 462 Performed 45

1 2001 Abu Dhabi 100 44 Not Performed 76

2 2001 Abu Dhabi 200 657 NaN NaN

3 2001 Dubai 100- 40 Performed 90

4 2001 Dubai 100 30 Not Performed 50

5 2001 Dubai 200 51 NaN NaN

6 2002 Abu Dhabi 100- 300 Performed 78

7 2002 Abu Dhabi 100 220 Not Performed 45

8 2002 Abu Dhabi 200 56 NaN NaN

9 2002 Dubai 100- 55 Performed 76

10 2002 Dubai 100 67 Not Performed 58

11 2002 Dubai 200 89 NaN NaN

反对回复 2021-11-23

2 回答
0 关注
233 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Pandas 数据帧合并重复值以对齐

Pandas 数据帧合并重复值以对齐

2 回答

添加回答