为了账号安全,请及时绑定邮箱和手机立即绑定

pyspark 使用正则表达式搜索关键字,然后加入其他数据框

pyspark 使用正则表达式搜索关键字,然后加入其他数据框

一只名叫tom的猫 2023-02-22 13:53:43
我有两个数据框数据帧Aname       groceries Mike       apple, orange, banana, noodle, red wineKate       white wine, green beans, extra pineapple hawaiian pizzaLeah       red wine, juice, rice, grapes, green beansBen        water, spaghetti数据帧Bid       item0001     red wine0002     green beans我逐行浏览 B,并使用正则表达式搜索数据框 A 的杂货店中是否存在项目df = Nonefor keyword in B.select('item').rdd.flatMap(lambda x : x).collect():    if keyword == None:        continue    pattern = '(?i)^'    start = '(?=.*\\b'    end = '\\b)'    for word in re.split('\\s+', keyword):        pattern = pattern + start + word + end    pattern = pattern + '.*$'        if df == None:        df = A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword))    else:        df = df.unionAll(A.filter(A['groceries'].rlike(pattern)).withColumn('item', F.lit(keyword)))我想要的输出是 A 中的行,其中包含 B 中的项目,但也将 item 关键字作为新列插入name       groceries                                                     itemMike       apple, orange, banana, noodle, red wine                       red wineLeah       red wine, juice, rice, grapes, green beans                    red wineKate       white wine, green beans, extra pineapple hawaiian pizza       green beansLeah       red wine, juice, rice, grapes, green beans                    green beans实际输出不是我想要的,我不明白这种方法有什么不对。我还想知道是否有一种方法可以使用 rlike 直接连接 A 和 B,这样只有当 A 中的项目存在于 B 的杂货店中时,行才会连接。谢谢!
查看完整描述

1 回答

?
慕尼黑的夜晚无繁华

TA贡献1864条经验 获得超6个赞

使用 F.expr() 可以进行类连接。在您的情况下,您需要将它与内部联接一起使用。尝试这个,


    #%%

import pyspark.sql.functions as F

test1 =sqlContext.createDataFrame([("Mike","apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)" ),("kate","Whitewine,greenbeans,pineapple"),("Ben","Water,Spaghetti")],schema=["name","groceries"])

test2 = sqlContext.createDataFrame([("001","redwine"),("002","greenbeans"),("003","cd")],schema=["id","item"])

#%%

test_join =test1.join(test2,F.expr("""groceries rlike item"""),how='inner')

结果:


 test_join.show(truncate=False)

   +----+-------------------------------------------------------------------------------------------------+---+----------+

|name|groceries                                                                                        |id |item      |

+----+-------------------------------------------------------------------------------------------------+---+----------+

|Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|001|redwine   |

|Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|002|greenbeans|

|Mike|apple,greenbeans,redwine,the little prince 70th anniversary gift set (book/cd/downloadable audio)|003|cd        |

|kate|Whitewine,greenbeans,pineapple                                                                   |002|greenbeans|

+----+-------------------------------------------------------------------------------------------------+---+----------+

对于您的复杂数据集,contains() 函数必须有效


import pyspark.sql.functions as F

test1 = spark.createDataFrame([("Mike","apple, oranges, red wine,green beans"),("Kate","Whitewine, green beans waterrr, pineapple, red wine"), ("Leah", "red wine, juice, rice, grapes, green beans"),("Ben","Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["name","groceries"])

test2 = spark.createDataFrame([("001","red wine"),("002","green beans waterrr"), ("003", "the little prince 70th anniversary gift set (book/cd/downloadable audio)")],schema=["id","item"])

#%%

test_join =test1.join(test2,F.col('groceries').contains(F.col('item')),how='inner')

结果:


+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+

|name|groceries                                                                                |id |item                                                                    |

+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+

|Mike|apple, oranges, red wine,green beans                                                     |001|red wine                                                                |

|Kate|Whitewine, green beans waterrr, pineapple, red wine                                      |001|red wine                                                                |

|Kate|Whitewine, green beans waterrr, pineapple, red wine                                      |002|green beans waterrr                                                     |

|Leah|red wine, juice, rice, grapes, green beans                                               |001|red wine                                                                |

|Ben |Water,Spaghetti, the little prince 70th anniversary gift set (book/cd/downloadable audio)|003|the little prince 70th anniversary gift set (book/cd/downloadable audio)|

+----+-----------------------------------------------------------------------------------------+---+------------------------------------------------------------------------+



查看完整回答
反对 回复 2023-02-22
  • 1 回答
  • 0 关注
  • 85 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信