如何使用 Java 从火花中的卡夫卡读取流嵌套的 JSON

Java

湖上湖 2022-09-14 10:29:12

我正在尝试使用Java从卡夫卡中读取复杂的嵌套JSON数据，并且在形成数据集时遇到麻烦发送到卡夫卡的实际 JSON 文件{"sample_title": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}{"sample_title2": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}{"sample_title3": {"txn_date": "2019-01-10","timestamp": "2019-02-01T08:57:18.100Z","txn_type": "TBD","txn_rcvd_time": "01/04/2019 03:32:32.135","txn_ref": "Test","txn_status": "TEST"}}Dataset<Row> df = spark.readStream().format("kafka") .option("spark.local.dir", config.getString(PropertyKeys.SPARK_APPLICATION_TEMP_LOCATION.getCode())) .option("kafka.bootstrap.servers", config.getString(PropertyKeys.KAFKA_BOORTSTRAP_SERVERS.getCode())) .option("subscribe", config.getString(PropertyKeys.KAFKA_TOPIC_IPE_STP.getCode())) .option("startingOffsets", "earliest") .option("spark.default.parallelism", config.getInt(PropertyKeys.SPARK_APPLICATION_DEFAULT_PARALLELISM_VALUE.getCode())) .option("spark.sql.shuffle.partitions", config.getInt(PropertyKeys.SPARK_APPLICATION_SHUFFLE_PARTITIONS_COUNT.getCode())) .option("kafka.security.protocol", config.getString(PropertyKeys.SECURITY_PROTOCOL.getCode()))val output = df.selectExpr("CAST(value AS STRING)").as(Encoders.STRING()).filter(x -> x.contains("sample_title"));由于我可以在输入中有多个架构，因此代码应该能够处理它并根据标题进行过滤并映射到Title类型的数据集

查看完整描述

1 回答

杨__羊羊

TA贡献1943条经验获得超7个赞

首先使类标题成为java bean类，即编写获取器和设置器。

public class Title implements Serializable {

String txn_date;

Timestamp timestamp;

String txn_type;

String txn_rcvd_time;

String txn_ref;

String txn_status;

public Title(String data){... //set values for fields with the data}

// add all getters and setters for fields

}

Dataset<Title> resultdf = df.selectExpr("CAST(value AS STRING)").map(value -> new Title(value), Encoders.bean(Title.class))

resultdf.filter(title -> // apply any predicate on title)

如果要先筛选数据，然后应用编码，

df.selectExpr("CAST(value AS STRING)")

.filter(get_json_object(col("value"), "$.sample_title").isNotNull)

// for simple filter use, .filter(t-> t.contains("sample_title"))

.map(value -> new Title(value), Encoders.bean(Title.class))

反对回复 2022-09-14

热搜

最近搜索清空

如何使用 Java 从火花中的卡夫卡读取流嵌套的 JSON

如何使用 Java 从火花中的卡夫卡读取流嵌套的 JSON

1 回答

添加回答