合并两个数据框-python中的UPSERT

在熊猫数据框中插入或更新我想合并 storage_df 和 processes_df ，如下所示。假设 phone 是主键： 1. 如果值存在则字段（并创建剩余的列，如下例中的性别） 2. 如果值不存在，则将该值插入最终数据帧中，如示例中的 382837371请注意，随着我们处理更多信息，该列会不断增加。但是有 32 列的限制，直到 processes_df/storage_df 会增加storage_df________________________Phone Name918348483 Sumit874647474 Saurabh238362633 NAProcessed_df_______________________________Phone Name Gender874647474 Saurabh Male238362633 NA Female382837371 NA Malefinal_df_______________________________Phone Name Gender918348483 Sumit NA874647474 Saurabh Male238362633 NA Female382837371 NA Male为此，我使用了熊猫的 combine_first：final_df = processed_df.set_index('phone').combine_first(storage_df.set_index('phone'))但是随着数据帧大小的增加，系统内存不足（16Gb 内存并且无法组合形状（88488, 6）和形状（7307, 8）可以使用 sqlite 在 sql 中存储两个数据帧，然后使用 UPSERT。你能指导我这样做的语法吗？虽然我真的很想在内存中而不是在数据库中。 File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5364, in combine_first return self.combine(other, combiner, overwrite=False) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5229, in combine this, other = self.align(other, copy=False) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3792, in align broadcast_axis=broadcast_axis) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8423, in align fill_axis=fill_axis) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8459, in _align_frame allow_dups=True) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 4490, in _reindex_with_indexers copy=copy) File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1220, in reindex_indexer self._consolidate_inplace() File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace

查看完整描述

3 回答

芜湖不芜

TA贡献1796条经验获得超7个赞

你可以试试 pandas 外连接。

final_df = storage_df.merge(processed_df, on='Phone', how='outer', suffixes=('', '_y'))

final_df.drop(list(final_df.filter(regex=r'.*_y$').columns), axis=1, inplace=True)

加入数据框

从合并中删除额外的列

反对回复 2022-05-24

PIPIONE

TA贡献1829条经验获得超9个赞

设置Phone为两个数据帧的索引，因为它们是您所说的主键，然后使用pandas.concat.

在这样做的同时从其他数据框中删除公共列，否则它们将在结果数据框中重复。

>>> df1.set_index('Phone', inplace=True)

>>> df2.set_index('Phone', inplace=True)

>>> other_cols = set(df2.columns) - set(df1.columns)

>>> df = pd.concat([df1, df2[other_cols]], axis=1)

>>> df

Name Gender

Phone

238362633 NaN Female

382837371 NaN Male

874647474 Saurabh Male

918348483 Sumit NaN

反对回复 2022-05-24

泛舟湖上清波郎朗

TA贡献1818条经验获得超3个赞

您需要做的就是首先删除重复的列并进行外部连接。

# as mentioned you don't need this.

processed_df.drop('Name', axis=1, inplace=True)

# now do an outer join

storage_df.merge(processed_df, on='Phone', how='outer')

反对回复 2022-05-24

热搜

最近搜索清空

合并两个数据框-python中的UPSERT

合并两个数据框-python中的UPSERT

3 回答

添加回答