Pandas
学习莫烦python,非常感谢~记录自己在学习python过程中的点滴。
Pandas 安装
Anaconda安装
pip安装
Pandas 基本介绍
Series
DataFrame
dtype
:查看数据中的类型index
:查看队列的序号(行)columns
:查看数据的名称(列)values
:查看数据所有值describe
:查看数据的总结transpose
:翻转数据(转置)sort_index
:对数据的 index 进行排序并输出sort_values
:对数据 value 排序输出:
Series
Series
的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。
参考代码:
import pandas as pdimport numpy as np s = pd.Series([1,3,6,np.nan,44,1]) print(s)""" 0 1.0 1 3.0 2 6.0 3 NaN 4 44.0 5 1.0 dtype: float64 """
DataFrame
DataFrame
是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame
既有行索引也有列索引, 它可以被看做由Series
组成的大字典。
数据准备
dates = pd.date_range('20160101',periods=6) df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d']) print(df)""" a b c d 2016-01-01 -0.253065 -2.071051 -0.640515 0.613663 2016-01-02 -1.147178 1.532470 0.989255 -0.499761 2016-01-03 1.221656 -2.390171 1.862914 0.778070 2016-01-04 1.473877 -0.046419 0.610046 0.204672 2016-01-05 -1.584752 -0.700592 1.487264 -1.778293 2016-01-06 0.633675 -1.414157 -0.277066 -0.442545 """
挑选b
的元素
print(df['b'])""" 2016-01-01 -2.071051 2016-01-02 1.532470 2016-01-03 -2.390171 2016-01-04 -0.046419 2016-01-05 -0.700592 2016-01-06 -1.414157 Freq: D, Name: b, dtype: float64 """
创建一组没有给定行标签和列标签的数据 df1
,默认的从0
开始index
:
df1 = pd.DataFrame(np.arange(12).reshape((3,4))) print(df1)""" 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 """
另一种方式创建df2
:
df2 = pd.DataFrame({'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo'}) print(df2)""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
查看数据中的类型, 使用 dtype
属性:
print(df2.dtypes)""" df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object """
查看队列的序号,使用 index
属性:
print(df2.index)# Int64Index([0, 1, 2, 3], dtype='int64')
查看数据的名称,使用 columns
属性:
print(df2.columns)# Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
查看所有df2
的值,使用 values
属性:
print(df2.values)""" array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) """
查看数据的总结, 可以用 describe()
:
df2.describe() """ A C Dcount 4.0 4.0 4.0mean 1.0 1.0 3.0std 0.0 0.0 0.0min 1.0 1.0 3.025% 1.0 1.0 3.050% 1.0 1.0 3.075% 1.0 1.0 3.0max 1.0 1.0 3.0"""
翻转数据, transpose
:
print(df2.T)""" 0 1 2 \ A 1 1 1 B 2013-01-02 00:00:00 2013-01-02 00:00:00 2013-01-02 00:00:00 C 1 1 1 D 3 3 3 E test train test F foo foo foo 3 A 1 B 2013-01-02 00:00:00 C 1 D 3 E train F foo """
对数据的 index
进行排序并输出:
print(df2.sort_index(axis=1, ascending=False))""" F E D C B A 0 foo test 3 1.0 2013-01-02 1.0 1 foo train 3 1.0 2013-01-02 1.0 2 foo test 3 1.0 2013-01-02 1.0 3 foo train 3 1.0 2013-01-02 1.0 """# 对数据的 value 进行排序并输出:print(df2.sort_values(by='B'))""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
Pandas 选择数据
简单的筛选
loc
:根据标签来选择数据iloc
:根据位置来选择数据ix
:混合选择通过判断的筛选
数据准备:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
简单的筛选:
print(df['A']) print(df.A)""" 2013-01-01 0 2013-01-02 4 2013-01-03 8 2013-01-04 12 2013-01-05 16 2013-01-06 20 Freq: D, Name: A, dtype: int64 """
选择跨越多行或多列:
print(df[0:3])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 """print(df['20130102':'20130104'])""" A B C D 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 """
根据标签 loc
:
print(df.loc['20130102'])""" A 4 B 5 C 6 D 7 Name: 2013-01-02 00:00:00, dtype: int64 """print(df.loc[:,['A','B']]) """ A B 2013-01-01 0 1 2013-01-02 4 5 2013-01-03 8 9 2013-01-04 12 13 2013-01-05 16 17 2013-01-06 20 21 """print(df.loc['20130102',['A','B']])""" A 4 B 5 Name: 2013-01-02 00:00:00, dtype: int64 """
根据序列 iloc
:
print(df.iloc[3,1])# 13print(df.iloc[3:5,1:3])""" B C 2013-01-04 13 14 2013-01-05 17 18 """print(df.iloc[[1,3,5],1:3])""" B C 2013-01-02 5 6 2013-01-04 13 14 2013-01-06 21 22 """
混合选择 ix
:
print(df.ix[:3,['A','C']])""" A C 2013-01-01 0 2 2013-01-02 4 6 2013-01-03 8 10 """
通过判断的筛选:
print(df[df.A>8])""" A B C D 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
Pandas 设置值
根据位置设置
loc
和iloc
根据条件设置
按行或列设置
添加数据
数据准备:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
根据位置设置 loc
和iloc
df.iloc[2,2] = 1111df.loc['20130101','B'] = 2222""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 1111 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
根据条件设置:
df.B[df.A>4] = 0""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 0 1111 11 2013-01-04 12 0 14 15 2013-01-05 16 0 18 19 2013-01-06 20 0 22 23 """
按行或列设置:
df['F'] = np.nan""" A B C D F 2013-01-01 0 2222 2 3 NaN 2013-01-02 4 5 6 7 NaN 2013-01-03 8 0 1111 11 NaN 2013-01-04 12 0 14 15 NaN 2013-01-05 16 0 18 19 NaN 2013-01-06 20 0 22 23 NaN """
添加数据:
df['E'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101',periods=6)) """ A B C D F E 2013-01-01 0 2222 2 3 NaN 1 2013-01-02 4 5 6 7 NaN 2 2013-01-03 8 0 1111 11 NaN 3 2013-01-04 12 0 14 15 NaN 4 2013-01-05 16 0 18 19 NaN 5 2013-01-06 20 0 22 23 NaN 6 """
Pandas 处理丢失数据
dropna()
: 去掉有NaN
的行或列fillna()
:将NaN
的值用其他值代替isnull()
:判断是否有缺失数据NaN
创建含 NaN
的矩阵:
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D']) df.iloc[0,1] = np.nan df.iloc[1,2] = np.nan""" A B C D 2013-01-01 0 NaN 2.0 3 2013-01-02 4 5.0 NaN 7 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
去掉有 NaN
的行或列:
df.dropna( axis=0, # 0: 对行进行操作; 1: 对列进行操作 how='any' # 'any': 只要存在 NaN 就 drop 掉; 'all': 必须全部是 NaN 才 drop ) """ A B C D 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
将 NaN
的值用其他值代替:
df.fillna(value=0)""" A B C D 2013-01-01 0 0.0 2.0 3 2013-01-02 4 5.0 0.0 7 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
判断是否有缺失数据 NaN
:
df.isnull() """ A B C D 2013-01-01 False True False False 2013-01-02 False False True False 2013-01-03 False False False False 2013-01-04 False False False False 2013-01-05 False False False False 2013-01-06 False False False False """np.any(df.isnull()) == True # True
Pandas 导入导出
pandas
可以读取与存取的资料格式有很多种,像csv
、excel
、json
、html
与pickle
等…, 详细请看官方说明文件
参考代码:
import pandas as pd #加载模块#读取csvdata = pd.read_csv('student.csv')#打印出dataprint(data)# 将资料存取成pickledata.to_pickle('student.pickle')
Pandas 合并 concat
参数
axis
:合并方向参数
ignore_index
:重置index
参数
join
:合并方式参数
join_axes
:依照 axes 合并append
:添加数据
axis
(合并方向):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])# concat纵向合并res = pd.concat([df1, df2, df3], axis=0)# 打印结果print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 0 1.0 1.0 1.0 1.0# 1 1.0 1.0 1.0 1.0# 2 1.0 1.0 1.0 1.0# 0 2.0 2.0 2.0 2.0# 1 2.0 2.0 2.0 2.0# 2 2.0 2.0 2.0 2.0
ignore_index
(重置 index
):
# 承上一个例子,并将index_ignore设定为Trueres = pd.concat([df1, df2, df3], axis=0, ignore_index=True)# 打印结果print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 6 2.0 2.0 2.0 2.0# 7 2.0 2.0 2.0 2.0# 8 2.0 2.0 2.0 2.0
join
(合并方式):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])# 纵向"外"合并df1与df2res = pd.concat([df1, df2], axis=0, join='outer') print(res)# a b c d e# 1 0.0 0.0 0.0 0.0 NaN# 2 0.0 0.0 0.0 0.0 NaN# 3 0.0 0.0 0.0 0.0 NaN# 2 NaN 1.0 1.0 1.0 1.0# 3 NaN 1.0 1.0 1.0 1.0# 4 NaN 1.0 1.0 1.0 1.0# 承上一个例子# 纵向"内"合并df1与df2res = pd.concat([df1, df2], axis=0, join='inner')# 打印结果print(res)# b c d# 1 0.0 0.0 0.0# 2 0.0 0.0 0.0# 3 0.0 0.0 0.0# 2 1.0 1.0 1.0# 3 1.0 1.0 1.0# 4 1.0 1.0 1.0# 重置index并打印结果res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True) print(res)# b c d# 0 0.0 0.0 0.0# 1 0.0 0.0 0.0# 2 0.0 0.0 0.0# 3 1.0 1.0 1.0# 4 1.0 1.0 1.0# 5 1.0 1.0 1.0
join_axes
(依照 axes
合并):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])# 依照`df1.index`进行横向合并res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])# 打印结果print(res)# a b c d b c d e# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 移除join_axes,并打印结果res = pd.concat([df1, df2], axis=1) print(res)# a b c d b c d e# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
append
(添加数据):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) s1 = pd.Series([1,2,3,4], index=['a','b','c','d'])# 将df2合并到df1的下面,以及重置index,并打印出结果res = df1.append(df2, ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 合并多个df,将df2与df3合并至df1的下面,以及重置index,并打印出结果res = df1.append([df2, df3], ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 6 1.0 1.0 1.0 1.0# 7 1.0 1.0 1.0 1.0# 8 1.0 1.0 1.0 1.0# 合并series,将s1合并至df1,以及重置index,并打印出结果res = df1.append(s1, ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 2.0 3.0 4.0
Pandas 合并 merge
依据一组key合并
依据两组key合并
Indicator
:indicator=True
会将合并的记录放在新的一列依据index合并
解决overlapping的问题
依据一组key合并:
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) print(left)# A B key# 0 A0 B0 K0# 1 A1 B1 K1# 2 A2 B2 K2# 3 A3 B3 K3print(right)# C D key# 0 C0 D0 K0# 1 C1 D1 K1# 2 C2 D2 K2# 3 C3 D3 K3#依据key column合并,并打印出res = pd.merge(left, right, on='key') print(res) A B key C D# 0 A0 B0 K0 C0 D0# 1 A1 B1 K1 C1 D1# 2 A2 B2 K2 C2 D2# 3 A3 B3 K3 C3 D3
依据两组key合并:
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 'key2': ['K0', 'K1', 'K0', 'K1'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'], 'key2': ['K0', 'K0', 'K0', 'K0'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) print(left)# A B key1 key2# 0 A0 B0 K0 K0# 1 A1 B1 K0 K1# 2 A2 B2 K1 K0# 3 A3 B3 K2 K1print(right)# C D key1 key2# 0 C0 D0 K0 K0# 1 C1 D1 K1 K0# 2 C2 D2 K1 K0# 3 C3 D3 K2 K0#依据key1与key2 columns进行合并,并打印出四种结果['left', 'right', 'outer', 'inner']res = pd.merge(left, right, on=['key1', 'key2'], how='inner') print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A2 B2 K1 K0 C1 D1# 2 A2 B2 K1 K0 C2 D2res = pd.merge(left, right, on=['key1', 'key2'], how='outer') print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A1 B1 K0 K1 NaN NaN# 2 A2 B2 K1 K0 C1 D1# 3 A2 B2 K1 K0 C2 D2# 4 A3 B3 K2 K1 NaN NaN# 5 NaN NaN K2 K0 C3 D3res = pd.merge(left, right, on=['key1', 'key2'], how='left') print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A1 B1 K0 K1 NaN NaN# 2 A2 B2 K1 K0 C1 D1# 3 A2 B2 K1 K0 C2 D2# 4 A3 B3 K2 K1 NaN NaNres = pd.merge(left, right, on=['key1', 'key2'], how='right') print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A2 B2 K1 K0 C1 D1# 2 A2 B2 K1 K0 C2 D2# 3 NaN NaN K2 K0 C3 D3
Indicator :
import pandas as pd#定义资料集并打印出df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']}) df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]}) print(df1)# col1 col_left# 0 0 a# 1 1 bprint(df2)# col1 col_right# 0 1 2# 1 2 2# 2 2 2# 依据col1进行合并,并启用indicator=True,最后打印出res = pd.merge(df1, df2, on='col1', how='outer', indicator=True) print(res)# col1 col_left col_right _merge# 0 0.0 a NaN left_only# 1 1.0 b 2.0 both# 2 2.0 NaN 2.0 right_only# 3 2.0 NaN 2.0 right_only# 自定indicator column的名称,并打印出res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column') print(res)# col1 col_left col_right indicator_column# 0 0.0 a NaN left_only# 1 1.0 b 2.0 both# 2 2.0 NaN 2.0 right_only# 3 2.0 NaN 2.0 right_only
依据index合并
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2']) right = pd.DataFrame({'C': ['C0', 'C2', 'C3'], 'D': ['D0', 'D2', 'D3']}, index=['K0', 'K2', 'K3']) print(left)# A B# K0 A0 B0# K1 A1 B1# K2 A2 B2print(right)# C D# K0 C0 D0# K2 C2 D2# K3 C3 D3# 依据左右资料集的index进行合并,how='outer',并打印出res = pd.merge(left, right, left_index=True, right_index=True, how='outer') print(res)# A B C D# K0 A0 B0 C0 D0# K1 A1 B1 NaN NaN# K2 A2 B2 C2 D2# K3 NaN NaN C3 D3# 依据左右资料集的index进行合并,how='inner',并打印出res = pd.merge(left, right, left_index=True, right_index=True, how='inner') print(res)# A B C D# K0 A0 B0 C0 D0# K2 A2 B2 C2 D2
解决overlapping的问题:
import pandas as pd#定义资料集boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]}) girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})#使用suffixes解决overlapping的问题res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner') print(res)# age_boy k age_girl# 0 1 K0 4# 1 1 K0 5
Pandas plot 出图
创建一个
Series
Dataframe
可视化
创建一个Series
:
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt# 随机生成1000个数据data = pd.Series(np.random.randn(1000),index=np.arange(1000)) # 为了方便观看效果, 我们累加这个数据data.cumsum()# pandas 数据可以直接观看其可视化形式data.plot() plt.show()
img1
Dataframe
可视化:
data = pd.DataFrame( np.random.randn(1000,4), index=np.arange(1000), columns=list("ABCD") ) data.cumsum() data.plot() plt.show()
img2
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class1')# 将之下这个 data 画在上一个 ax 上面data.plot.scatter(x='A',y='C',color='LightGreen',label='Class2',ax=ax) plt.show()
img3
作者:CrazyWolf_081c
链接:https://www.jianshu.com/p/1ea7c0d32bab
共同学习,写下你的评论
评论加载中...
作者其他优质文章