首页手记如何快速写一个sklearn LabelEncoder？

如何快速写一个sklearn LabelEncoder？

标签：

机器学习算法

在传统机器学习中，对于类别型特征有许多encode方法：

其中，Label Encoder是最简单的一种encode方法，并在sklearn.preprocessing中有实现方法，目的是将类别型特征统一转化成0-len(类别性特征)范围的数字。

既然只是对去重后的类别型特征进行某种方式的标序号，那么我们自己实现一个labelEncoder会不会比sklearn的要更快呢？

数据（test_data）

test_data.shape

(65022441, 1)，总共65022441条数据

test_data.drop_duplicates()，去重后，test_data共有783行，即类别型特征有783个

我们分别使用三种方式来对test_data进行encode，分别是labelEncoder，通过reset_index进行编号，通过构建一个index 的dict进行编号，然后使用memory_profiler来分析三种方法对整个test_data进行encode的内存消耗和耗时情况。

labelEncoder

def encode_category_with_LabelEncoder(data, category_col):
    category_df = pd.DataFrame()
    category_df[category_col] = df[category_col].drop_duplicates()
    category_df[category_col+'_encode']=LabelEncoder().fit_transform(category_df[category_col].astype(str))
    data = pd.merge(data, category_df, on=category_col, how="left")

reset_index

def encode_category_with_index(data):
    category_df = df[[category_col]].drop_duplicates()
    category_df.reset_index(inplace=True)
    category_df[category_col+'_encode']=category_df.index
    category_df = category_df.drop("index", axis=1)
    data = pd.merge(data, category_df, on=category_col, how="left”)

index dict

def encode_category_with_index_dict(data, category_col):
    category_dict = df[category_col].value_counts()
    category_dict = pd.Series(np.arange(0, len(category_dict)), index=category_dict.index).to_dict()
    data[category_col+'_encode'] = data[category_col].map(category_dict).astype('int32')

内存使用情况，以及耗时情况

import memory_profiler
%load_ext memory_profiler

%memit encode_category_with_LabelEncoder(test_data, ‘category')
%timeit encode_category_with_LabelEncoder(test_data, ‘category')

peak memory: 24225.01 MiB, increment: 6077.23 MiB
22.3 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit encode_category_with_index(test_data, ‘category')
%timeit encode_category_with_index(test_data, ‘category')

peak memory: 27573.56 MiB, increment: 9425.56 MiB
48 s ± 812 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit encode_category_with_index_dict(test_data, ‘category')
%timeit encode_category_with_index_dict(test_data, ‘category')

peak memory: 18892.13 MiB, increment: 0.00 MiB
13.4 s ± 74.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

其实后两种的编码方式思路是一样的，都是使用去重后的类别型特征的index为其编码，但第二种方式有reset_index和merge的开销，其性能不如labelEncoder，但使用字典存储index的思路来编码类别特征，效果几乎是labelEncoder的一倍。

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

损失函数

算法工程师

手记
篇

粉丝

1532

获赞与收藏

2735

关注作者，订阅最新文章

阅读免费教程

Python 算法入门教程

15个小节 27410 1070

算法入门教程

15个小节 32468 679

后端通用面试教程

41个小节 30936 346

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

如何快速写一个sklearn LabelEncoder？

阅读免费教程