首页猿问使用 pandas/python...

使用 pandas/python 基于优先级的分类

Python

翻翻过去那场雪 2023-08-08 10:25:19

我在下面的数据框和代码列表中包含发票相关数据df = pd.DataFrame({ 'invoice':[1,1,2,2,2,3,3,3,4,4,4,5,5,6,6,6,7], 'code':[101,104,105,101,106,106,104,101,104,105,111,109,111,110,101,114,112], 'qty':[2,1,1,3,2,4,7,1,1,1,1,4,2,1,2,2,1]})+---------+------+-----+| invoice | code | qty |+---------+------+-----+| 1 | 101 | 2 |+---------+------+-----+| 1 | 104 | 1 |+---------+------+-----+| 2 | 105 | 1 |+---------+------+-----+| 2 | 101 | 3 |+---------+------+-----+| 2 | 106 | 2 |+---------+------+-----+| 3 | 106 | 4 |+---------+------+-----+| 3 | 104 | 7 |+---------+------+-----+| 3 | 101 | 1 |+---------+------+-----+| 4 | 104 | 1 |+---------+------+-----+| 4 | 105 | 1 |+---------+------+-----+| 4 | 111 | 1 |+---------+------+-----+| 5 | 109 | 4 |+---------+------+-----+| 5 | 111 | 2 |+---------+------+-----+| 6 | 110 | 1 |+---------+------+-----+| 6 | 101 | 2 |+---------+------+-----+| 6 | 114 | 2 |+---------+------+-----+| 7 | 104 | 2 |+---------+------+-----+代码列表是，Soda = [101,102]Hot = [103,109]Juice = [104,105]Milk = [106,107,108]Dessert = [110,111]category我的任务是根据下面指定的添加一个新列Order of Priority。优先级第一：如果任何发票的数量超过 10 个，则应分类为Mega。例如：qty总和invoice 3 is 12优先事项 2：来自rest of the invoice. 如果列表中有任何code一个，则类别应该是。例如：在是在中。因此，完整发票被分类为。无论发票中是否存在其他项目 ( )。由于优先级适用于发票。invoiceMilkHealthyinvoice 2 code 106MilkHealthycode 101 & 105full优先级No.3：从中rest of the invoice，如果其中任何一个code在invoice列表中Juice，那么这有2 parts(3.1) 如果该果汁数量的总和为equal to 1，则类别应为OneJuice。例如：invoice 1具有code 104和qty 1.thisinvoice 1将得到，OneJuice无论code 101发票中是否存在其他项目 ( )。由于优先级适用于full发票。(3.2) 如果该果汁数量的总和为greater than 1，则类别应为ManyJuice。例如：invoice 4有code 104 & 105 和qty 1 + 1 = 2。优先级4：从中rest of the invoice，如果任何code发票在Hot列表中，则应将其分类为HotLovers。无论发票中是否包含其他项目。优先级No.5：从中rest of the invoice，如果任何code发票在Dessert列表中，则应将其分类为DessertLovers。最后，其余所有发票应归类为Others。

查看完整描述

1 回答

天涯尽头无女友

TA贡献1831条经验获得超9个赞

您可以尝试使用np.select

df['category'] = np.select([

df.groupby('invoice')['qty'].transform('sum') >= 10,

df['code'].isin(Milk).groupby(df.invoice).transform('any'),

(df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') == 1,

(df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') > 1,

df['code'].isin(Hot).groupby(df.invoice).transform('any'),

df['code'].isin(Dessert).groupby(df.invoice).transform('any')

['Mega','Healthy','OneJuice','ManyJuice','HotLovers','DessertLovers'],

'Other'

)

print(df)

输出

invoice code qty category

0 1 101 2 OneJuice

1 1 104 1 OneJuice

2 2 105 1 Healthy

3 2 101 3 Healthy

4 2 106 2 Healthy

5 3 106 4 Mega

6 3 104 7 Mega

7 3 101 1 Mega

8 4 104 1 ManyJuice

9 4 105 1 ManyJuice

10 4 111 1 ManyJuice

11 5 109 4 HotLovers

12 5 111 2 HotLovers

13 6 110 1 DessertLovers

14 6 101 2 DessertLovers

15 6 114 2 DessertLovers

16 7 104 2 ManyJuice

微基准测试

pd.show_versions()

commit : None

python : 3.7.5.final.0

python-bits : 64

OS : Linux

OS-release : 4.4.0-18362-Microsoft

machine : x86_64

processor : x86_64

byteorder : little

LC_ALL : None

LANG : C.UTF-8

LOCALE : en_US.UTF-8

pandas : 0.25.3

numpy : 1.17.4

数据创建于

def make_data(n):

return pd.DataFrame({

'invoice':np.arange(n)//3,

'code':np.random.choice(np.arange(101,112),n),

'qty':np.random.choice(np.arange(1,8), n, p=[10/25,10/25,1/25,1/25,1/25,1/25,1/25])

})

结果

perfplot.show(

setup=make_data,

kernels=[get_category, get_with_np_select],

n_range=[2**k for k in range(8, 20)],

logx=True,

logy=True,

equality_check=False,

xlabel='len(df)')

反对回复 2023-08-08

1 回答
0 关注
125 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用 pandas/python 基于优先级的分类

使用 pandas/python 基于优先级的分类

1 回答

添加回答