为了账号安全,请及时绑定邮箱和手机立即绑定

Python:计算一段时间内 Pandas 数据框中的累积量

Python:计算一段时间内 Pandas 数据框中的累积量

神不在的星期二 2022-10-06 15:37:49
目标:计算自 2020-01-01 以来的累计收入。我有一个 python 字典,如下所示data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},    {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},    {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},    {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},    {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},    {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},    {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},    {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]上面的同一个字典看起来像一个表   game_id   user_id  amt  activity date  'Racing', 'ABC123', 5,   '2020-01-01'  'Racing', 'ABC123', 1,   '2020-01-04'  'Racing', 'CDE123', 1,   '2020-01-04'  'DH',     'CDE123', 100, '2020-01-03'  'DH',     'CDE456', 10,  '2020-01-02'  'DH', '    CDE789', 5,   '2020-01-02'  'DH',     'CDE456', 1,   '2020-01-03'  'DH',     'CDE456', 1,   '2020-01-03'年龄计算为交易日期与 2020-01-01 之间的差异。付款人总数是每场比赛的付款人数量。我正在尝试创建一个数据框,其中包含从第一笔交易之日到交易第二天的每一天的累积结果。例如:对于 game_id Racing,我们在 2020 年 1 月 1 日从金额 5 开始,所以年龄为 0。在 2020 年 1 月 2 日,金额仍然是 5,因为我们那天没有交易。在 2020 年 1 月 3 日,金额为 5。但在 2020 年 1 月 4 日,金额为 7,因为我们在这一天有 2 笔交易。预期产出Game    Age    Cum_rev    Total_unique_payers_per_gameRacing  0      5          2Racing  1      5          2Racing  2      5          2Racing  3      7          2DH      0      0          3DH      1      15         3DH      2      117        3DH      3      117        3如何在 python 中使用窗口函数,就像我们在 SQL 中使用一样。有没有更好的方法来解决这个问题?
查看完整描述

1 回答

?
拉风的咖菲猫

TA贡献1995条经验 获得超2个赞

这里非常复杂的部分是填写日期。我使用了申请,但我不确定这是最好的方法


import pandas as pd


data = [{"game_id":"Racing","user_id":"ABC123","amt":5,"date":"2020-01-01"},

        {"game_id":"Racing","user_id":"ABC123","amt":1,"date":"2020-01-04"},

        {"game_id":"Racing","user_id":"CDE123","amt":1,"date":"2020-01-04"},

        {"game_id":"DH","user_id":"CDE123","amt":100,"date":"2020-01-03"},

        {"game_id":"DH","user_id":"CDE456","amt":10,"date":"2020-01-02"},

        {"game_id":"DH","user_id":"CDE789","amt":5,"date":"2020-01-02"},

        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"},

        {"game_id":"DH","user_id":"CDE456","amt":1,"date":"2020-01-03"}]


df = pd.DataFrame(data)

# we want datetime not object

df["date"] = df["date"].astype("M8[us]")


# we will need to merge this at the end

grp = df.groupby("game_id")['user_id']\

        .nunique()\

        .reset_index(name="Total_unique_payers_per_game")


# sum amt per game_id date

df = df.groupby(["game_id", "date"])["amt"].sum().reset_index()


# dates from 2020-01-01 till the max date in df

dates = pd.DataFrame({"date": pd.date_range("2020-01-01", df["date"].max())})


# add missing dates

def expand_dates(x):

    x = pd.merge(dates, x.drop("game_id", axis=1), how="left")

    x["amt"] = x["amt"].fillna(0)

    return x


df = df.groupby("game_id")\

       .apply(expand_dates)\

       .reset_index().drop("level_1", axis=1)


df["Cum_rev"] = df.groupby("game_id")['amt'].transform("cumsum")


# this is equivalent as long as data is sorted

# df["Cum_rev"] = df.groupby("game_id")['amt'].cumsum()


# merge unique payers per game

df = pd.merge(df, grp, how="left")


# dates difference

df["Age"] = "2020-01-01"

df["Age"] = df["Age"].astype("M8[us]")

df["Age"] = (df["date"]-df["Age"]).dt.days



# then you can eventually filter

df = df[["game_id", "Age", 

         "Cum_rev", "Total_unique_payers_per_game"]]\

       .rename(columns={"game_id":"Game"})


查看完整回答
反对 回复 2022-10-06
  • 1 回答
  • 0 关注
  • 96 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信