首页猿问 Pandas...

Pandas 数据框在数据中的自我依赖以填充一列

Python

德玛西亚99 2021-07-30 14:41:10

我有数据的数据框：“relation”的值由codeid决定。皮革有“codeid”=11，它已经出现在包上，所以我们把价值包放在了一起。鞋子也是一样。ToDo：通过检查数据帧中的 codeid 来填充“relation”的值。任何帮助，将不胜感激。编辑：相同的 codeid 例如 11 可以出现 > 两次。但是“关系”只能具有作为 bag 的值，因为 bag 是第一个具有 codeid=11 的值。我也更新了图片。

查看完整描述

3 回答

慕工程0101907

TA贡献1887条经验获得超5个赞

如果只想要第一个重复值到最后重复使用transformwithfirst然后NaN通过locwith设置值duplicated：

df = pd.DataFrame({'id':[1,2,3,4,5],

'name':list('brslp'),

'codeid':[11,12,13,11,13]})

df['relation'] = df.groupby('codeid')['name'].transform('first')

print (df)

id name codeid relation

0 1 b 11 b

1 2 r 12 r

2 3 s 13 s

3 4 l 11 b

4 5 p 13 s

#get first duplicated values of codeid

print (df['codeid'].duplicated(keep='last'))

0 True

1 False

2 True

3 False

4 False

Name: codeid, dtype: bool

#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows

print (~df['codeid'].duplicated(keep=False))

0 False

1 True

2 False

3 False

4 False

Name: codeid, dtype: bool

#chain boolen mask together

print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))

0 True

1 True

2 True

3 False

4 False

Name: codeid, dtype: bool

#replace True values by mask by NaN

df.loc[df['codeid'].duplicated(keep='last') |

~df['codeid'].duplicated(keep=False), 'relation'] = np.nan

print (df)

id name codeid relation

0 1 b 11 NaN

1 2 r 12 NaN

2 3 s 13 NaN

3 4 l 11 b

4 5 p 13 s

反对回复 2021-08-03

万千封印

TA贡献1891条经验获得超3个赞

这不是最佳解决方案，因为它会占用您的内存，但这是我的尝试。df1创建是为了保存列的null值relation，因为似乎空值是第一次出现。经过一些清理后，两个数据帧被合并为一个。

import pandas as pd

df = pd.DataFrame([['bag', 11, 'null'],

['shoes', 12, 'null'],

['shopper', 13, 'null'],

['leather', 11, 'bag'],

['plastic', 13, 'shopper'],

['something',13,""]], columns = ['name', 'codeid', 'relation'])

df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation

df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry

df1=df1.drop("relation",axis=1)#drop the unneeded column

final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

反对回复 2021-08-03

繁星点点滴滴

TA贡献1803条经验获得超3个赞

我想你想做这样的事情：

import pandas as pd

df = pd.DataFrame([['bag', 11, 'null'],

['shoes', 12, 'null'],

['shopper', 13, 'null'],

['leather', 11, 'bag'],

['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])

def codeid_analysis(rows):

if rows['codeid'] == 11:

rows['relation'] = 'bag'

elif rows['codeid'] == 12:

rows['relation'] = 'shirt' #for example. You should put what you want here

elif rows['codeid'] == 13:

rows['relation'] = 'pants' #for example. You should put what you want here

return rows

result = df.apply(codeid_analysis, axis = 1)

print(result)

反对回复 2021-08-03

3 回答
0 关注
155 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

Pandas 数据框在数据中的自我依赖以填充一列

Pandas 数据框在数据中的自我依赖以填充一列

3 回答

添加回答