首页猿问创建数据框中不存在的时间间隔

创建数据框中不存在的时间间隔

动漫人物 2022-05-24 15:42:20

我有详细的工厂、工作站、机器、开始日期时间和结束日期时间的机器错误/机器停止数据。我想在机器使用 python/pandas 正常运行时创建时间间隔因此，我希望有 24 小时的时间表，并且每个间隔都标记为工作（如果没有发生错误）或不工作。1 个站（共 17 个）、1 个机器类型（共 10 个）和 1 天的数据框如下所示；Stat. Mac. start_date end_date start_no end_no status A B 2019-01-03 00:00:00 2019-01-03 01:30:00 1 90 pause A B 2019-01-03 09:35:00 2019-01-03 10:20:00 575 620 pause A B 2019-01-03 20:20:00 2019-01-03 20:40:00 1220 1240 pause A B 2019-01-03 21:45:00 2019-01-03 22:45:00 1305 1365 pause对于相同的工作站-机器-天对，请求的数据框应如下所示； Stat. Mac. start_date end_date start_no end_no status A B 2019-01-03 00:00:00 2019:01:03 00:00:01 0 1 working A B 2019-01-03 00:00:00 2019-01-03 01:30:00 1 90 pause A B 2019-01-03 01:30:00 2019-01-03 09:35:00 90 575 working A B 2019-01-03 09:35:00 2019-01-03 10:20:00 575 620 pause A B 2019-01-03 10:20:00 2019-01-03 20:20:00 620 1220 working A B 2019-01-03 20:20:00 2019-01-03 20:40:00 1220 1240 pause A B 2019-01-03 20:40:00 2019-01-03 21:45:00 1240 1305 working A B 2019-01-03 21:45:00 2019-01-03 22:45:00 1305 1365 pause A B 2019-01-03 22:45:00 2019-01-03 23:59:00 1365 1439 working我在下面的链接中上传了示例数据帧（1000rows-~80kb）；https://gofile.io/?c=tKA8Qj我应该如何解决这个问题？提前致谢

查看完整描述

2 回答

宝慕林4294392

TA贡献2021条经验获得超8个赞

在这个问题中，我们有一个顺序模式，我们可以将“start_no”和“end_no”列转换为所需数据帧的列。当我们采用类似的值时(start_no0, end_no0, start_no1, end_no1, ...)，我们实际上得到了“start_no”和“end_no”所需列的最大部分。通过简单的修复，我们可以获得完全相同的列。相同的逻辑可以应用于 start_date 和 end_date，因为它们代表相同的事物。

由于您有不同的工作站和机器值，我们可以通过使用 Stat.、Mac.、start_date、end_date 索引来将问题分组。在代码中，我试图通过忽略原始数据集中的时间字段来获取当天的所有值。基本上我只是对数据进行分组并迭代每个组以创建一个包含您想要的信息的新数据框。

对于您共享的案例，代码如下所示：

import numpy as np

import pandas as pd

data = pd.read_excel("sample_2.xlsx")

# transform (start|end)_date as only date without time

data["_sDate"] = data.start_date.apply(lambda x: x.strftime("%Y-%m-%d"))

data["_eDate"] = data.end_date.apply(lambda x: x.strftime("%Y-%m-%d"))

# group the data by following columns

grouped = data.groupby(["Station","Machine","_sDate","_eDate"])

# container for storing result of each group

container = []

# iterate the groups

for name, group in grouped:

# sort them by start_number

group = group.sort_values("start_number")

# get (start|end)_numbers into a flatten array

nums = group[["start_number", "end_number"]].values.flatten()

# get (start|end)_date into a flatten array

dates = group[["start_date", "end_date"]].values.flatten()

## insert required values to nums and dates

# we add the first pause time at index 1 to show first working interval

dates = np.insert(dates, 1 , dates[0] + nums[0]*10**9)

# we add 0 in the beginning of the array to show first working interval

nums = np.insert(nums, 0, 0)

# create df

nrow = nums.size-1 # decrement, because we add one additional element

newdf = pd.DataFrame({

"Station": np.tile(("A"),nrow),

"Machine": np.tile(("B"),nrow),

"start_date": dates[:-1],

"end_date": dates[1:],

"start_no": nums[:-1],

"end_no": nums[1:],

"status": np.tile(["working", "pause"], nrow//2)

})

container.append(newdf)

df_final = pd.concat(container)

df_final.index = range(0,df_final.shape[0])

反对回复 2022-05-24

万千封印

TA贡献1891条经验获得超3个赞

一种快速但缓慢的方法可能是遍历所有行并检查当前 + 下一行。您只有 1000 行，所以现在就可以了。这看起来像这样：

import pandas as pd

df = pd.read_excel("sample_2.xlsx")

df['status'] = 'pause'

df = df.sort_values(['Workcenter','Machine','Error_Reason','Class','start_date','start_time', 'end_date','end_time']).reset_index()

new_df = df.copy()

number_rows = len(df)-1

for i in range(number_rows):

row = df.loc[i]

next_row = df.loc[i+1]

new_row = row

new_row['status'] = 'working'

new_row['start_date'] = row['end_date']

new_row['end_date'] = next_row['start_date']

new_row['start_number'] = row['end_number']

new_row['end_number'] = next_row['start_number']

new_df = new_df.append(new_row)

反对回复 2022-05-24

2 回答
0 关注
235 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

创建数据框中不存在的时间间隔

创建数据框中不存在的时间间隔

2 回答

添加回答