3 回答

TA贡献1887条经验 获得超5个赞
如果只想要第一个重复值到最后重复使用transformwithfirst然后NaN通过locwith设置值duplicated:
df = pd.DataFrame({'id':[1,2,3,4,5],
'name':list('brslp'),
'codeid':[11,12,13,11,13]})
df['relation'] = df.groupby('codeid')['name'].transform('first')
print (df)
id name codeid relation
0 1 b 11 b
1 2 r 12 r
2 3 s 13 s
3 4 l 11 b
4 5 p 13 s
#get first duplicated values of codeid
print (df['codeid'].duplicated(keep='last'))
0 True
1 False
2 True
3 False
4 False
Name: codeid, dtype: bool
#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows
print (~df['codeid'].duplicated(keep=False))
0 False
1 True
2 False
3 False
4 False
Name: codeid, dtype: bool
#chain boolen mask together
print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))
0 True
1 True
2 True
3 False
4 False
Name: codeid, dtype: bool
#replace True values by mask by NaN
df.loc[df['codeid'].duplicated(keep='last') |
~df['codeid'].duplicated(keep=False), 'relation'] = np.nan
print (df)
id name codeid relation
0 1 b 11 NaN
1 2 r 12 NaN
2 3 s 13 NaN
3 4 l 11 b
4 5 p 13 s

TA贡献1891条经验 获得超3个赞
这不是最佳解决方案,因为它会占用您的内存,但这是我的尝试。df1创建是为了保存列的null值relation,因为似乎空值是第一次出现。经过一些清理后,两个数据帧被合并为一个。
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shopper'],
['something',13,""]], columns = ['name', 'codeid', 'relation'])
df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation
df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry
df1=df1.drop("relation",axis=1)#drop the unneeded column
final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

TA贡献1803条经验 获得超3个赞
我想你想做这样的事情:
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])
def codeid_analysis(rows):
if rows['codeid'] == 11:
rows['relation'] = 'bag'
elif rows['codeid'] == 12:
rows['relation'] = 'shirt' #for example. You should put what you want here
elif rows['codeid'] == 13:
rows['relation'] = 'pants' #for example. You should put what you want here
return rows
result = df.apply(codeid_analysis, axis = 1)
print(result)
添加回答
举报