5 回答
data:image/s3,"s3://crabby-images/3df66/3df663c2935f7c00d0a51a37fbfdaba2c51d6e11" alt="?"
TA贡献1815条经验 获得超10个赞
这是另一个解决方案,
import numpy as np
mask = df.groupby('customer_id').transform(np.size).eq(1)
date amount_spent
0 False False
1 False False
2 False False
3 False False
4 True True
5 True True
6 True True
7 True True
8 True True
df[mask['date'] & df.date.eq('2020-01-10')]
date customer_id amount_spent
5 2020-01-10 99 86
6 2020-01-10 67 140
7 2020-01-10 32 321
8 2020-01-10 75 76
data:image/s3,"s3://crabby-images/4e780/4e780e466e33da355d91b3529a412c397c566343" alt="?"
TA贡献1829条经验 获得超13个赞
如果您正在寻找通用方法,这可能是一个解决方案:
df = pd.DataFrame({
'date':['2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-10','2020-01-10','2020-01-10'],
'customer_id':[24,24,58,58,98,99,67,32,75],
'amount_spent':[123,145,89,67,34,86,140,321,76]
})
print(df)
date customer_id amount_spent
0 2020-01-01 24 123
1 2020-01-10 24 145
2 2020-01-01 58 89
3 2020-01-10 58 67
4 2020-01-01 98 34
5 2020-01-10 99 86
6 2020-01-10 67 140
7 2020-01-10 32 321
8 2020-01-10 75 76
您正在查找最后两个日期,因为您的数据集可能看起来不同,而且您不知道要查找的日期。所以现在你应该找到最后两个日期。
df=df.sort_values(by='date')
take_last_dates = df.drop_duplicates(subset='date').sort_values(by='date')
take_last_dates = take_last_dates.date.tolist()
print(take_last_dates)
['2020-01-01', '2020-01-10']
现在您需要为这两个日期创建两个 DF,以查看客户的差异:
df_prev = df[
df.date==take_last_dates[0]
]
print(df_prev)
date customer_id amount_spent
0 2020-01-01 24 123
2 2020-01-01 58 89
4 2020-01-01 98 34
df_current = df[
df.date==take_last_dates[1]
]
print(df_current)
date customer_id amount_spent
1 2020-01-10 24 145
3 2020-01-10 58 67
5 2020-01-10 99 86
6 2020-01-10 67 140
7 2020-01-10 32 321
8 2020-01-10 75 76
所以最后你可以通过使用这两个 df 得到你的结果:
new_customers = df_current[
~df_current.customer_id.isin(df_prev.customer_id.tolist())
]
print(new_customers)
date customer_id amount_spent
5 2020-01-10 99 86
6 2020-01-10 67 140
7 2020-01-10 32 321
8 2020-01-10 75 76
data:image/s3,"s3://crabby-images/6a71d/6a71dff4fd2daeb6c4ab48b5ecdd781632be7e3b" alt="?"
TA贡献1845条经验 获得超8个赞
假设您的示例中有错字(99 是 98)。您可以执行以下操作:
df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])
df["order"] = df.groupby("customer_id").cumcount()
df[(df["date"] == "2020-01-10") & (df["order_x"]==0)]
输出:
date customer_id amount_spent order_x order_y
6 2020-01-10 67 140 0 0
7 2020-01-10 32 321 0 0
8 2020-01-10 75 76 0 0
这将需要根据您的 df 的复杂性进行编辑
data:image/s3,"s3://crabby-images/0bd40/0bd4048a5f139f7f75fbefb78f433b664bd2d70c" alt="?"
TA贡献1816条经验 获得超4个赞
这就是你注意到的。不确定您的示例数据和输出是否如您所想。我在 2020-01-10 将客户 99 更改为 98
创建一个掩码,它是您所需日期之前/之后的行
选择切换日期之后(包括切换日期)的行,减去切换日期之前存在的客户
isin()
import datetime as dt
df = pd.read_csv(io.StringIO("""date customer_id amount_spent
2020-01-01 24 123
2020-01-10 24 145
2020-01-01 58 89
2020-01-10 58 67
2020-01-01 98 34
2020-01-10 98 86
2020-01-10 67 140
2020-01-10 32 321
2020-01-10 75 76"""), sep="\s+")
df["date"] = pd.to_datetime(df["date"])
mask = df["date"] < dt.datetime(2020,1,10)
dfnew = df[~mask & ~df["customer_id"].isin(df.loc[mask,"customer_id"])].groupby("customer_id").sum()
print(dfnew.to_string())
输出
amount_spent
customer_id
32 321
67 140
75 76
data:image/s3,"s3://crabby-images/d5dd8/d5dd8ec0bbe63f65cb267a399f3b33544ea31090" alt="?"
TA贡献1821条经验 获得超6个赞
IIUC 你可以customer_id在 中获取礼物2020-01-01,然后过滤掉它们:
s = df.loc[df["date"]=="2020-01-01", "customer_id"]
print (df[~df["customer_id"].isin(s)])
date customer_id amount_spent
5 2020-01-10 99 86
6 2020-01-10 67 140
7 2020-01-10 32 321
8 2020-01-10 75 76
添加回答
举报