2 回答
TA贡献2012条经验 获得超12个赞
如何加入这 2 个 DataFrame 并返回这样的结果?
df_joined = df1.join(df2, Seq('USER_NAME'), 'left' )
如何获取许可证仍然相关的所有用户的列表?
df_relevant = df_joined
.withColumn('STATUS', when(col('REQUEST_DATE') > col('START_SERVICE') and col('REQUEST_DATE') < col('END_SERVICE'), col('STATUS')).otherwise('No information')
.select('USER_NAME', 'REQUEST_DATE', 'STATUS' )
TA贡献1829条经验 获得超13个赞
您在不正确的字符串值上比较 <= 和 >=。在进行此类比较之前,您应该将它们转换为时间戳。下面的代码对我有用。
顺便说一句..您使用的过滤条件没有给出您在问题中发布的结果。请再次检查。
scala> val df= Seq(("Alex","2018-03-01 00:00:00"),("Alex","2018-09-01 00:00:00"),("Bob","2018-03-01 00:00:00"),("Mark","2018-02-01 00:00:00"),("Mark","2018-07-01 00:00:00"),("Kate","2018-02-01 00:00:00")).toDF("USER_NAME","REQUEST_DATE").withColumn("REQUEST_DATE",to_timestamp('REQUEST_DATE))
df: org.apache.spark.sql.DataFrame = [USER_NAME: string, REQUEST_DATE: timestamp]
scala> df.printSchema
root
|-- USER_NAME: string (nullable = true)
|-- REQUEST_DATE: timestamp (nullable = true)
scala> df.show(false)
+---------+-------------------+
|USER_NAME|REQUEST_DATE |
+---------+-------------------+
|Alex |2018-03-01 00:00:00|
|Alex |2018-09-01 00:00:00|
|Bob |2018-03-01 00:00:00|
|Mark |2018-02-01 00:00:00|
|Mark |2018-07-01 00:00:00|
|Kate |2018-02-01 00:00:00|
+---------+-------------------+
scala> val df2 = Seq(( "Alex","2018-01-01 00:00:00","2018-06-01 00:00:00","Active"),("Bob","2018-01-01 00:00:00","2018-02-01 00:00:00","Not Active"),("Mark","2018-01-01 00:00:00","2018-05-01 23:59:59","Active"),("Mark","2018-05-01 00:00:00","2018-08-01 23:59:59","VIP")).toDF("USER_NAME","START_SERVICE","END_SERVICE","STATUS").withColumn("START_SERVICE",to_timestamp('START_SERVICE)).withColumn("END_SERVICE",to_timestamp('END_SERVICE))
df2: org.apache.spark.sql.DataFrame = [USER_NAME: string, START_SERVICE: timestamp ... 2 more fields]
scala> df2.printSchema
root
|-- USER_NAME: string (nullable = true)
|-- START_SERVICE: timestamp (nullable = true)
|-- END_SERVICE: timestamp (nullable = true)
|-- STATUS: string (nullable = true)
scala> df2.show(false)
+---------+-------------------+-------------------+----------+
|USER_NAME|START_SERVICE |END_SERVICE |STATUS |
+---------+-------------------+-------------------+----------+
|Alex |2018-01-01 00:00:00|2018-06-01 00:00:00|Active |
|Bob |2018-01-01 00:00:00|2018-02-01 00:00:00|Not Active|
|Mark |2018-01-01 00:00:00|2018-05-01 23:59:59|Active |
|Mark |2018-05-01 00:00:00|2018-08-01 23:59:59|VIP |
+---------+-------------------+-------------------+----------+
scala> df.join(df2,Seq("USER_NAME"),"leftOuter").filter(" REQUEST_DATE >= START_SERVICE and REQUEST_DATE <= END_SERVICE").select(df("*"),df2("status")).show(false)
+---------+-------------------+------+
|USER_NAME|REQUEST_DATE |status|
+---------+-------------------+------+
|Alex |2018-03-01 00:00:00|Active|
|Mark |2018-02-01 00:00:00|Active|
|Mark |2018-07-01 00:00:00|VIP |
+---------+-------------------+------+
scala>
添加回答
举报