Pyspark：将具有特定时间戳的行插入数据帧

我有以下 Spark 数据框： id | time |Value| id1 | 2020-02-22 04:57:36.843 | 1.4 | id2 | 2020-02-22 04:57:50.850 | 1.7 | id3 | 2020-02-22 04:58:02.133 | 1.2 |我想在现有行之间插入一定的及时距离（例如5秒）。输出应如下所示： id | time |Value| id1 | 2020-02-22 04:57:36.843 | 1.4 | id1 | 2020-02-22 04:57:41.843 | | id1 | 2020-02-22 04:57:46.843 | | id1 | 2020-02-22 04:57:51.843 | | id2 | 2020-02-22 04:57:50.850 | 1.7 | id2 | 2020-02-22 04:57:55.850 | | id2 | 2020-02-22 04:58:00.850 | | id2 | 2020-02-22 04:58:05.850 | | id3 | 2020-02-22 04:58:02.133 | 1.2 | id3 | 2020-02-22 04:58:07.133 | | id3 | 2020-02-22 04:58:12.133 | | id3 | 2020-02-22 04:58:17.133 | |我尝试通过 for 循环来实现这一点，创建新的数据帧（每个新行）并通过“union”将其合并到现有的数据帧，但没有成功。我尤其没有通过这种方法获得 id。你知道我如何达到我想要的输出吗？

查看完整描述

1 回答

人到中年有点甜

TA贡献1895条经验获得超7个赞

这是我尝试进行一些修改，例如，我无法理解如何存在 62 秒。

from pyspark.sql.functions import *

from pyspark.sql import Window

w = Window.orderBy('time')

df.select('id', 'time') \

.withColumn('time', to_timestamp('time', 'yyyy-MM-dd HH:mm:ss.SSS')) \

.withColumn('time2', coalesce(lead('time', 1).over(w), expr('time + interval 10 seconds'))) \

.withColumn('seq', expr("sequence(time, time2 + interval 5 seconds, interval 5 seconds)")) \

.withColumn('time', explode('seq')) \

.select('id', 'time') \

.join(df, ['id', 'time'], 'left') \

.fillna(0).show(20, False)

+---+-----------------------+-----+

|id |time |Value|

+---+-----------------------+-----+

|id1|2020-02-22 04:57:36.843|1.4 |

|id1|2020-02-22 04:57:41.843|0.0 |

|id1|2020-02-22 04:57:46.843|0.0 |

|id1|2020-02-22 04:57:51.843|0.0 |

|id2|2020-02-22 04:57:50.85 |1.7 |

|id2|2020-02-22 04:57:55.85 |0.0 |

|id2|2020-02-22 04:58:00.85 |0.0 |

|id3|2020-02-22 04:57:59.133|1.2 |

|id3|2020-02-22 04:58:04.133|0.0 |

|id3|2020-02-22 04:58:09.133|0.0 |

|id3|2020-02-22 04:58:14.133|0.0 |

+---+-----------------------+-----+

反对回复 2023-06-27

热搜

最近搜索清空

Pyspark：将具有特定时间戳的行插入数据帧

Pyspark：将具有特定时间戳的行插入数据帧

1 回答

添加回答