如何融化火花数据帧?是否有相当于PandasMelt功能的PandasMelt函数存在于PySPark或至少在Scala中的ApacheSPark中?到目前为止,我在python中运行了一个示例数据集,现在我想对整个数据集使用SPark。提前谢谢。
3 回答
catspeake
TA贡献1111条经验 获得超0个赞
UPD
from pyspark.sql.functions import explode def melt(df): sp = df.columns[1:] return (df .rdd .map(lambda x: [str(x[0]), [(str(i[0]), float(i[1] if i[1] else 0)) for i in zip(sp, x[1:])]], preservesPartitioning = True) .toDF() .withColumn('_2', explode('_2')) .rdd.map(lambda x: [str(x[0]), str(x[1][0]), float(x[1][1] if x[1][1] else 0)], preservesPartitioning = True) .toDF() )
columns=['a', 'b', 'c', 'd', 'e', 'f'] pd_df = pd.DataFrame([[1,2,3,4,5,6], [4,5,6,7,9,8], [7,8,9,1,2,4], [8,3,9,8,7,4]], columns=columns) df = spark.createDataFrame(pd_df) +---+---+---+---+---+---+ | a| b| c| d| e| f| +---+---+---+---+---+---+ | 1| 2| 3| 4| 5| 6| | 4| 5| 6| 7| 9| 8| | 7| 8| 9| 1| 2| 4| | 8| 3| 9| 8| 7| 4| +---+---+---+---+---+---+ cols = df.columns[1:] df.selectExpr('a', "stack({}, {})".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols)))) +---+----+----+ | a|col0|col1| +---+----+----+ | 1| b| 2| | 1| c| 3| | 1| d| 4| | 1| e| 5| | 1| f| 6| | 4| b| 5| | 4| c| 6| | 4| d| 7| | 4| e| 9| | 4| f| 8| | 7| b| 8| | 7| c| 9| ...
添加回答
举报
0/150
提交
取消