使用 PySpark 数据框解析 json 字符串列表

我正在尝试使用 pyspark 数据帧读取 JSON 列表。您将在下面找到我的输入数据，我的目标是获取具有两列 user (string ) 和 ips Array[Sting] 的数据框。sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',), ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',), ]感谢您的帮助。

查看完整描述

1 回答

汪汪一只猫

TA贡献1898条经验获得超8个赞

使用from_json函数通过defining schema.

Example:

from pyspark.sql.functions import *

from pyspark.sql.types import *

sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',), ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',), ]

df1=spark.createDataFrame(sampleJson)

sch=StructType([StructField('user', StringType(), False),StructField('ips',ArrayType(StringType()))])

df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").show(10,False)

#+----+--------------------------------------------------------------------+

#|user|ips |

#+----+--------------------------------------------------------------------+

#|100 |[191.168.192.101, 191.168.192.103, 191.168.192.96, 191.168.192.99] |

#|101 |[191.168.192.102, 191.168.192.105, 191.168.192.103, 191.168.192.107]|

#|102 |[191.168.192.105, 191.168.192.101, 191.168.192.105, 191.168.192.107]|

#|103 |[191.168.192.96, 191.168.192.100, 191.168.192.107, 191.168.192.101] |

#|104 |[191.168.192.99, 191.168.192.99, 191.168.192.102, 191.168.192.99] |

#|105 |[191.168.192.99, 191.168.192.99, 191.168.192.100, 191.168.192.96] |

#+----+--------------------------------------------------------------------+

#schema

df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").printSchema()

#root

# |-- user: string (nullable = true)

# |-- ips: array (nullable = true)

# | |-- element: string (containsNull = true)

反对回复 2023-07-27

热搜

最近搜索清空

使用 PySpark 数据框解析 json 字符串列表

使用 PySpark 数据框解析 json 字符串列表

1 回答

添加回答