1 回答
TA贡献2016条经验 获得超9个赞
根据我对您的代码如何工作的理解,它似乎花费了这么长时间,因为它在 O(n^c) 时间之内运行。我的意思是,对于每个索引,它必须多次遍历整个数据集以检查条件。
因此,最好尝试避免遍历每个索引的整个数据集——即,使其在 O(n) 线性时间内工作。为此,我将执行以下操作:
import pandas as pd
from pandas import Timestamp
import datetime
data_copy = pd.DataFrame(data = {
'sensor_id': {
0: 1385001, 1: 1385001, 2: 1385001, 3: 1385001, 4: 1385001, 5: 1385001,
6: 1385001, 7: 1385001, 8: 1385001, 9: 1385001},
'label': {
0: 50.79999923706055, 1: 52.69230651855469, 2: 50.0, 3: 48.61538314819336,
4: 48.0, 5: 47.90909194946289, 6: 51.41666793823242, 7: 48.3684196472168,
8: 49.8636360168457, 9: 48.66666793823242},
'avg5': {
0: 49.484848, 1: 51.735294, 2: 51.59375, 3: 49.266666,
4: 50.135135999999996, 5: 50.5, 6: 50.8, 7: 52.69230699999999,
8: 50.0, 9: 48.615383},
'timestamp5': {
0: Timestamp('2014-08-01 00:00:00'), 1: Timestamp('2014-08-01 00:05:00'),
2: Timestamp('2014-08-01 00:10:00'), 3: Timestamp('2014-08-01 00:15:00'),
4: Timestamp('2014-08-01 00:20:00'), 5: Timestamp('2014-08-01 00:25:00'),
6: Timestamp('2014-08-01 00:30:00'), 7: Timestamp('2014-08-01 00:35:00'),
8: Timestamp('2014-08-01 00:40:00'), 9: Timestamp('2014-08-01 00:45:00')}})
hours_added = datetime.timedelta(minutes = 40)
# Create a data series that combines the information about sensor_id & timestamp5
sen_time = data_copy['sensor_id'].astype(str) + data_copy['timestamp5'].astype(str)
# Create a dictionary of the corresponding { sensor_id + timestamp5 : avg5 } values
dictionary = pd.Series(data_copy['avg5'].values, sen_time).to_dict()
# Create a data series combining the timestamp5 + 40 mins information
timePlus40 = data_copy['timestamp5'] + hours_added
# Create a mapping column that combines the sensor_id & timestamp5+40mins
sensor_timePlus40 = (data_copy['sensor_id'].astype(str) + timePlus40.astype(str))
# Create a new_label series by mapping the dictionary onto sensor_timePlus40
new_label = sensor_timePlus40.map(dictionary)
# Extract indices where this series has non-NaN values
where = new_label.notnull()
# Replace the values in the 'label' column with only non-NaN new_label values
data_copy.loc[where, 'label'] = new_label.loc[where]
我相信这与@pecey 和@iracebeth_18 在评论中提出的想法类似。
此EDIT ed 版本反映了 OP 的愿望(来自评论)以label仅使用非 NaN 值更新列。
结果如下所示:
> print(data_copy)
sensor_id label avg5 timestamp5
0 1385001 50.000000 49.484848 2014-08-01 00:00:00
1 1385001 48.615383 51.735294 2014-08-01 00:05:00
2 1385001 50.000000 51.593750 2014-08-01 00:10:00
3 1385001 48.615383 49.266666 2014-08-01 00:15:00
4 1385001 48.000000 50.135136 2014-08-01 00:20:00
5 1385001 47.909092 50.500000 2014-08-01 00:25:00
6 1385001 51.416668 50.800000 2014-08-01 00:30:00
7 1385001 48.368420 52.692307 2014-08-01 00:35:00
8 1385001 49.863636 50.000000 2014-08-01 00:40:00
9 1385001 48.666668 48.615383 2014-08-01 00:45:00
将此代码的速度与您的代码进行比较会timeit产生更快的运行时间,并且差异只会随着数据集的增大而增加。
- 1 回答
- 0 关注
- 112 浏览
添加回答
举报