1 回答
TA贡献1788条经验 获得超4个赞
这应该按预期工作——很可能你的实现有问题——可以尝试处理一个虚拟数据集。并不TransformerMixin真正关心输入是numpy还是pandas.DataFrame,它将按“预期”工作。
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
class CustomTransformer(TransformerMixin):
def __init__(self, some_stuff=None, column_names= []):
self.some_stuff = some_stuff
self.column_names = column_names
def fit(self, X, y=None):
return self
def transform(self, X):
# do stuff on X, and return dataframe
# of the same shape - this gets messy
# if the preceding item is a numpy array
# and not a dataframe
if isinstance(X, np.ndarray):
X = pd.DataFrame(X, columns=self.column_names)
X['str_len'] = X['my_str'].apply(lambda x: str(x)).str.len()
X['custom_func'] = X['val'].apply(lambda x: 1 if x > 0.5 else -1)
return X
df = pd.DataFrame({
'my_str': [111, 2, 3333],
'val': [0, 1, 1]
})
# mixing this works as expected
my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)
# using this by itself works as well
my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)
输出是:
In [ ]: my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))
...: my_pipeline.fit_transform(df)
Out[ ]:
my_str val str_len custom_func
0 -0.671543 -1.414214 19 -1
1 -0.742084 0.707107 18 1
2 1.413627 0.707107 17 1
In [ ]: my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))
...: my_pipeline.fit_transform(df)
Out[ ]:
my_str val str_len custom_func
0 111 0 3 -1
1 2 1 1 1
2 3333 1 4 1
sklearn-pandas或者,如果您想直接将事物映射到数据框,则可以使用
from sklearn_pandas import DataFrameMapper
# using sklearn-pandas
str_transformer = FunctionTransformer(lambda x: x.apply(lambda y: y.str.len()))
cust_transformer = FunctionTransformer(lambda x: (x > 0.5) *2 -1)
mapper = DataFrameMapper([
(['my_str'], str_transformer),
(['val'], make_pipeline(StandardScaler(), cust_transformer))
], input_df=True, df_out=True)
mapper.fit_transform(df)
输出:
In [ ]: mapper.fit_transform(df)
Out[47]:
my_str val
0 3 -1
1 2 1
2 1 1
使用 sklearn pandas 可以让您更具体地将输入作为数据框,将输出作为数据框,并允许您将每一列单独映射到每个感兴趣的管道,而不是将列名编码/硬编码为对象的一部分TransformerMixin。
添加回答
举报