3 回答
TA贡献1853条经验 获得超6个赞
这里另一种方法是利用和 Spark 相等运算符,它将数组作为任何其他类型进行处理,前提是对数组进行排序:array_sort
from pyspark.sql.functions import lit, array, array_sort, array_intersect
target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))
df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
.show(10, False)
# +-----------+-----------------------------------+
# |Studentname|Speciality |
# +-----------+-----------------------------------+
# |Alex |[Physics, Math, biology] |
# |Sam |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+
首先,我们找到公共元素,然后用于比较排序的数组。array_intersect(df["Speciality"], search_ar)==
TA贡献1834条经验 获得超8个赞
使用高阶函数应该是实现此目的最具可扩展性和效率的方法( Spark2.4filter )
from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
.filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
如果你想使用一个喜欢来动态地做到这一点,你可以使用 和 然后打开 ( spark 2.4 ):arraya1F.array_exceptF.arrayfiltersize
a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
.filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
要获得计数,您可以放入而不是.count().show()
TA贡献1900条经验 获得超5个赞
假设您有,则学生没有重复项(例如Speciality
StudentName Speciality
SomeStudent ['Physics', 'Math', 'Biology', 'Physics']
你可以在熊猫中使用explodegroupby
所以,对于你的问题
# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']
gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')
gdata.loc[a1, 'Count']
# Count
# Speciality
# Physics 3
# Math 2
添加回答
举报