我有两个数据框。AA = +---+----+---+-----+-----+| id1|id2| nr|cell1|cell2|+---+----+---+-----+-----+| 1| 1| 0| ab2 | ac3 || 1| 1| 1| dg6 | jf2 || 2| 1| 1| 84d | kf6 || 2| 2| 1| 89m | k34 || 3| 1| 0| 5bd | nc4 |+---+----+---+-----+-----+和第二个 dataframe BB,它看起来像:BB =+---+----+---+-----+| a | b|use|cell |+---+----+---+-----+| 1| 1| x| ab2 || 1| 1| a| dg6 || 2| 1| b| 84d || 2| 2| t| 89m || 3| 1| d| 5bd |+---+----+---+-----+其中,在BB单元格部分中,我拥有所有可能出现在AA cell1和cell2部分中的单元格(cell1 - cell2是一个间隔)。我想将两列添加到BB,val1和val2。条件如下。val1 has 1 values when: id1 == id2 (in AA) , and cell (in B) == cell1 or cell2 (in AA) and nr = 1 in AA.and 0 otherwise. 另一列是根据以下内容构建的:val 2 has 1 values when: id1 != id2 in (AA) and cell (in B) == cell1 or cell 2 in (AA) and nr = 1 in AA. it also has 0 values otherwise.我的尝试:我尝试与:from pyspark.sql.functions import when, colcondition = col("id1") == col("id2")result = df.withColumn("val1", when(condition, 1)result.show()但很快就发现这项任务远远超过了我的 pyspark 技能水平。
添加回答
举报
0/150
提交
取消