我正在分析 PGA 巡回赛数据。出于机器学习的目的,我希望列数据能够代表几周内的统计数据。下面是原始数据结构的示例。import pandas as pdimport numpy as npdata = {'Player Name':['Tiger','Tiger','Tiger','Tiger','Tiger','Tiger','Jack', 'Jack','Jack','Jack','Jack','Jack','Jack'], 'Date':[1, 2, 4, 6, 7, 9, 1, 3, 4, 6, 9, 10, 11], 'SG Total':[13, 2, 14, 6, 8, 1, 1, 3, 8, 4, 9, 2, 1]}df_original = pd.DataFrame(data)我想获取以下格式的数据。data = {'Player Name':['Tiger','Tiger','Tiger','Jack','Jack', 'Jack','Jack'], 'Date':[6, 7, 9, 6, 9, 10, 11], 'SG Total (Date t-3)':[13, 2, 14, 1, 3, 8, 4], 'SG Total (Date t-2)':[2, 14, 6, 3, 8, 4, 9], 'SG Total (Date t-1)':[14, 6, 8, 8, 4, 9, 2], 'SG Total (Date y)': [6, 8, 1, 4, 9, 2, 1]}df_correct = pd.DataFrame(data)在我使用的真实数据集中,我大约有 1000 列。因此,新的所需数据集可能有 4000 列。正如您在所需数据集中看到的那样,我删除了每个玩家的前 3 周。我在个人数据的第 4 周开始日期,因为我使用前 3 周来填写 (t-3)、(t-2) 和 (t-1)无论玩家是否玩过游戏,我最初都会为每周创建一个数据集,并使用此代码创建所需的 DataFrame。#%% Creates weekly dataframes & predictions dataframes#Creates dataframes of each weekdict_of_weeks = {}for i in range(1,df_numeric_combined['Date'].nunique()+1): dict_of_weeks['Week_{}_df'.format(i)] = df_numeric_combined[df_numeric_combined['Date'] == i] dict_of_weeks['Week_{}_df'.format(i)].columns += ' (Week ' + str(i) + ')' dict_of_weeks['Week_{}_df'.format(i)].rename(columns={'Player Name (Week ' + str(i) + ')' : 'Player Name' , 'Date (Week ' + str(i) + ')' : 'Date'},inplace=True)#Creating dataframes for prediction of each weekimport functoolsdict_of_predictions = {}df_weeks = []for i in range(4,df_numeric_combined['Date'].nunique()+1): dfs = [dict_of_weeks['Week_'+str(i-3)+'_df'], dict_of_weeks['Week_'+str(i-2)+'_df'], dict_of_weeks['Week_'+str(i-1)+'_df'], dict_of_weeks['Week_'+str(i)+'_df']] dict_of_predictions['Week_{}_predictions'.format(i)] = functools.reduce(lambda left,right: pd.merge(left,right,on=['Player Name'], how='outer'), dfs)然而,我创建的这段代码只有在玩家连续玩几周时才有效,因为它依赖于周数并减去 3、2 和 1。最终目标是获取 df_correct 格式的数据。
查看完整描述