1 回答

TA贡献1880条经验 获得超4个赞
您的逻辑很复杂,可以实现两件事
删除不在列表中的行。我为此使用了一个技巧,所以我可以使用
dropna()
到
shift()
专栏
这表现得很好。数据集 > 0.5m 行上的几分之一秒。
import time
d = [d for d in pd.date_range(dt.datetime(2015,5,1,2),
dt.datetime(2020,5,1,4), freq="128s")
if random.randint(0,3) < 2 ] # miss some sample times...
# random manipulation of rawIdx so there are some rows where ts is not in rawIdx
df = pd.DataFrame({"ts":d, "rawIdx":[x if random.randint(0,3)<=2
else x + pd.Timedelta(1, unit="s") for x in d],
"val":[random.randint(0,50) for x in d]}).set_index("ts")
start = time.time()
print(f"size before: {len(df)}")
dfc = df.assign(
# make it float64 so can have nan, map False to nan so can dropna() rows that are not in rawIdx
issue=lambda dfa: np.array(np.where(dfa.index.isin(dfa["rawIdx"]),True, np.nan), dtype="float64"),
).dropna().drop(columns="issue").assign(
# this should be just a straight forward shift. rawIdx will be same as index due to dropna()
nextloc_ixS=df.rawIdx.shift(-1)
)
print(f"size after: {len(dfc)}\ntime: {time.time()-start:.2f}s\n\n{dfc.head().to_string()}")
输出
size before: 616264
size after: 462207
time: 0.13s
rawIdx val nextloc_ixS
ts
2015-05-01 02:02:08 2015-05-01 02:02:08 33 2015-05-01 02:06:24
2015-05-01 02:06:24 2015-05-01 02:06:24 40 2015-05-01 02:08:33
2015-05-01 02:10:40 2015-05-01 02:10:40 15 2015-05-01 02:12:48
2015-05-01 02:12:48 2015-05-01 02:12:48 45 2015-05-01 02:17:04
2015-05-01 02:17:04 2015-05-01 02:17:04 14 2015-05-01 02:21:21
添加回答
举报