3 回答
TA贡献1799条经验 获得超8个赞
这是Option派上用场的地方:
val extractDateAsOptionInt = udf((d: String) => d match {
case null => None
case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)
})
或在一般情况下使其更加安全:
import scala.util.Try
val extractDateAsOptionInt = udf((d: String) => Try(
d.substring(0, 10).filterNot("-".toSet).toInt
).toOption)
一切归功于德米特里Selivanov谁已经指出,这种解决方案为(失踪?)编辑这里。
另一种方法是null在UDF之外处理:
import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType
val extractDateAsInt = udf(
(d: String) => d.substring(0, 10).filterNot("-".toSet).toInt
)
df.withColumn("y",
when($"x".isNull, lit(null))
.otherwise(extractDateAsInt($"x"))
.cast(IntegerType)
)
TA贡献1784条经验 获得超7个赞
我创建了以下代码,以使用户定义的函数可用,以处理所述的空值。希望对别人有帮助!
/**
* Set of methods to construct [[org.apache.spark.sql.UserDefinedFunction]]s that
* handle `null` values.
*/
object NullableFunctions {
import org.apache.spark.sql.functions._
import scala.reflect.runtime.universe.{TypeTag}
import org.apache.spark.sql.UserDefinedFunction
/**
* Given a function A1 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that
* * if fnc input is null, None is returned. This will create a null value in the output Spark column.
* * if A1 is non null, Some( f(input) will be returned, thus creating f(input) as value in the output column.
* @param f function from A1 => RT
* @tparam RT return type
* @tparam A1 input parameter type
* @return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above
*/
def nullableUdf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {
udf[Option[RT],A1]( (i: A1) => i match {
case null => None
case s => Some(f(i))
})
}
/**
* Given a function A1, A2 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that
* * if on of the function input parameters is null, None is returned.
* This will create a null value in the output Spark column.
* * if both input parameters are non null, Some( f(input) will be returned, thus creating f(input1, input2)
* as value in the output column.
* @param f function from A1 => RT
* @tparam RT return type
* @tparam A1 input parameter type
* @tparam A2 input parameter type
* @return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above
*/
def nullableUdf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): UserDefinedFunction = {
udf[Option[RT], A1, A2]( (i1: A1, i2: A2) => (i1, i2) match {
case (null, _) => None
case (_, null) => None
case (s1, s2) => Some((f(s1,s2)))
} )
}
}
添加回答
举报