根据多个条件合并两个数据框

我正在寻找比较两个数据框（df-a 和 df-b）并搜索 1 个数据框（df-b）中给定 ID 和日期在另一个数据框（df-a）中 ID 匹配的日期范围内的位置). 然后我想剥离 df-a 中的所有列并将它们连接到它们匹配的 df-b 中。例如如果我有一个数据框 df-a，格式如下 df-a： ID Start_Date End_Date A B C D E 0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 201 cd2 2020-06-24 2020-07-212 cd56 2020-06-10 2020-07-033 cd915 2020-04-28 2020-07-214 cd103 2020-04-13 2020-04-24和 df-b 在 ID Date0 cd2 2020-05-121 cd2 2020-04-122 cd2 2020-06-103 cd15 2020-04-284 cd193 2020-04-13我想要一个像这样的输出 df df-c= ID Date Start_Date End_Date A B C D E 0 cd2 2020-05-12 - - - - - - -1 cd2 2020-04-12 - - - - - - -2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 203 cd15 2020-04-28 - - - - - - -4 cd193 2020-04-13 - - - - - - -在上一篇文章中，我得到了一个很好的答案，它允许比较数据帧并在满足此条件的任何地方丢弃，但我正在努力弄清楚如何从 df-a 中适当地提取信息。目前的尝试如下！df_c=df_b.copy()ar=[]for i in range(df_c.shape[0]): currentID = df_c.stafnum[i] currentDate = df_c.Date[i] df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID] for j in range(df_a_entriesForCurrentID.shape[0]): startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date if (startDate <= currentDate <= endDate): print(df_c.loc[i]) print(df_a_entriesForCurrentID.iloc[j,:]) #df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0) #df_fin_2=df_fin.append(df_d, ignore_index=True) #ar.append(df_d)

查看完整描述

1 回答

慕尼黑5688855

TA贡献1848条经验获得超2个赞

所以你想做一种“软”匹配。这是一个尝试矢量化日期范围匹配的解决方案。

# notice working with dates as strings, inequalities will only work if dates in format y-m-d

# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`

# create a groupby object once so we can efficiently filter df_b inside the loop

# good idea if df_b is considerably large and has many different IDs

gdf_b = df_b.groupby('ID')

b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}

matched = [] # so we can collect matched rows from df_b

# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))

for i, ID, date in df_a.itertuples():

if ID in b_IDs:

gID = gdf_b.get_group(ID) # get the filtered df_b

inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)

if any(inrange):

matched.append(

gID.loc[inrange.idxmax()] # get the first row with date inrange

.values[1:] # use the array without column indices and slice `ID` out

)

else:

matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs

else:

matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNs

df_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))

print(df_c)

输出

ID Date Start_Date End_Date A B C D E

0 cd2 2020-05-12 NaN NaN NaN NaN NaN NaN NaN

1 cd2 2020-04-12 NaN NaN NaN NaN NaN NaN NaN

2 cd2 2020-06-10 2020-06-01 2020-06-24 a b c 10.0 20.0

3 cd15 2020-04-28 NaN NaN NaN NaN NaN NaN NaN

4 cd193 2020-04-13 NaN NaN NaN NaN NaN NaN NaN

反对回复 2023-06-20

热搜

最近搜索清空

根据多个条件合并两个数据框

根据多个条件合并两个数据框

1 回答

添加回答