Pandas
学习莫烦python,非常感谢~记录自己在学习python过程中的点滴。
Pandas 安装
Anaconda安装
pip安装
Pandas 基本介绍
SeriesDataFramedtype:查看数据中的类型index:查看队列的序号(行)columns:查看数据的名称(列)values:查看数据所有值describe:查看数据的总结transpose:翻转数据(转置)sort_index:对数据的 index 进行排序并输出sort_values:对数据 value 排序输出:
Series
Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。
参考代码:
import pandas as pdimport numpy as np s = pd.Series([1,3,6,np.nan,44,1]) print(s)""" 0 1.0 1 3.0 2 6.0 3 NaN 4 44.0 5 1.0 dtype: float64 """
DataFrame
DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。
数据准备
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print(df)"""
a b c d
2016-01-01 -0.253065 -2.071051 -0.640515 0.613663
2016-01-02 -1.147178 1.532470 0.989255 -0.499761
2016-01-03 1.221656 -2.390171 1.862914 0.778070
2016-01-04 1.473877 -0.046419 0.610046 0.204672
2016-01-05 -1.584752 -0.700592 1.487264 -1.778293
2016-01-06 0.633675 -1.414157 -0.277066 -0.442545
"""挑选b 的元素
print(df['b'])""" 2016-01-01 -2.071051 2016-01-02 1.532470 2016-01-03 -2.390171 2016-01-04 -0.046419 2016-01-05 -0.700592 2016-01-06 -1.414157 Freq: D, Name: b, dtype: float64 """
创建一组没有给定行标签和列标签的数据 df1,默认的从0开始index:
df1 = pd.DataFrame(np.arange(12).reshape((3,4))) print(df1)""" 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 """
另一种方式创建df2:
df2 = pd.DataFrame({'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo'})
print(df2)"""
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
"""查看数据中的类型, 使用 dtype 属性:
print(df2.dtypes)""" df2.dtypes A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object """
查看队列的序号,使用 index属性:
print(df2.index)# Int64Index([0, 1, 2, 3], dtype='int64')
查看数据的名称,使用 columns 属性:
print(df2.columns)# Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
查看所有df2的值,使用 values 属性:
print(df2.values)"""
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
"""查看数据的总结, 可以用 describe():
df2.describe() """ A C Dcount 4.0 4.0 4.0mean 1.0 1.0 3.0std 0.0 0.0 0.0min 1.0 1.0 3.025% 1.0 1.0 3.050% 1.0 1.0 3.075% 1.0 1.0 3.0max 1.0 1.0 3.0"""
翻转数据, transpose:
print(df2.T)""" 0 1 2 \ A 1 1 1 B 2013-01-02 00:00:00 2013-01-02 00:00:00 2013-01-02 00:00:00 C 1 1 1 D 3 3 3 E test train test F foo foo foo 3 A 1 B 2013-01-02 00:00:00 C 1 D 3 E train F foo """
对数据的 index 进行排序并输出:
print(df2.sort_index(axis=1, ascending=False))""" F E D C B A 0 foo test 3 1.0 2013-01-02 1.0 1 foo train 3 1.0 2013-01-02 1.0 2 foo test 3 1.0 2013-01-02 1.0 3 foo train 3 1.0 2013-01-02 1.0 """# 对数据的 value 进行排序并输出:print(df2.sort_values(by='B'))""" A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo """
Pandas 选择数据
简单的筛选
loc:根据标签来选择数据iloc:根据位置来选择数据ix:混合选择通过判断的筛选
数据准备:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])"""
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
"""简单的筛选:
print(df['A']) print(df.A)""" 2013-01-01 0 2013-01-02 4 2013-01-03 8 2013-01-04 12 2013-01-05 16 2013-01-06 20 Freq: D, Name: A, dtype: int64 """
选择跨越多行或多列:
print(df[0:3])""" A B C D 2013-01-01 0 1 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 """print(df['20130102':'20130104'])""" A B C D 2013-01-02 4 5 6 7 2013-01-03 8 9 10 11 2013-01-04 12 13 14 15 """
根据标签 loc:
print(df.loc['20130102'])""" A 4 B 5 C 6 D 7 Name: 2013-01-02 00:00:00, dtype: int64 """print(df.loc[:,['A','B']]) """ A B 2013-01-01 0 1 2013-01-02 4 5 2013-01-03 8 9 2013-01-04 12 13 2013-01-05 16 17 2013-01-06 20 21 """print(df.loc['20130102',['A','B']])""" A 4 B 5 Name: 2013-01-02 00:00:00, dtype: int64 """
根据序列 iloc:
print(df.iloc[3,1])# 13print(df.iloc[3:5,1:3])""" B C 2013-01-04 13 14 2013-01-05 17 18 """print(df.iloc[[1,3,5],1:3])""" B C 2013-01-02 5 6 2013-01-04 13 14 2013-01-06 21 22 """
混合选择 ix:
print(df.ix[:3,['A','C']])""" A C 2013-01-01 0 2 2013-01-02 4 6 2013-01-03 8 10 """
通过判断的筛选:
print(df[df.A>8])""" A B C D 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
Pandas 设置值
根据位置设置
loc和iloc根据条件设置
按行或列设置
添加数据
数据准备:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])"""
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
"""根据位置设置 loc 和iloc
df.iloc[2,2] = 1111df.loc['20130101','B'] = 2222""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 9 1111 11 2013-01-04 12 13 14 15 2013-01-05 16 17 18 19 2013-01-06 20 21 22 23 """
根据条件设置:
df.B[df.A>4] = 0""" A B C D 2013-01-01 0 2222 2 3 2013-01-02 4 5 6 7 2013-01-03 8 0 1111 11 2013-01-04 12 0 14 15 2013-01-05 16 0 18 19 2013-01-06 20 0 22 23 """
按行或列设置:
df['F'] = np.nan""" A B C D F 2013-01-01 0 2222 2 3 NaN 2013-01-02 4 5 6 7 NaN 2013-01-03 8 0 1111 11 NaN 2013-01-04 12 0 14 15 NaN 2013-01-05 16 0 18 19 NaN 2013-01-06 20 0 22 23 NaN """
添加数据:
df['E'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101',periods=6))
"""
A B C D F E
2013-01-01 0 2222 2 3 NaN 1
2013-01-02 4 5 6 7 NaN 2
2013-01-03 8 0 1111 11 NaN 3
2013-01-04 12 0 14 15 NaN 4
2013-01-05 16 0 18 19 NaN 5
2013-01-06 20 0 22 23 NaN 6
"""Pandas 处理丢失数据
dropna(): 去掉有NaN的行或列fillna():将NaN的值用其他值代替isnull():判断是否有缺失数据NaN
创建含 NaN 的矩阵:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan"""
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
"""去掉有 NaN 的行或列:
df.dropna( axis=0, # 0: 对行进行操作; 1: 对列进行操作 how='any' # 'any': 只要存在 NaN 就 drop 掉; 'all': 必须全部是 NaN 才 drop ) """ A B C D 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
将 NaN 的值用其他值代替:
df.fillna(value=0)""" A B C D 2013-01-01 0 0.0 2.0 3 2013-01-02 4 5.0 0.0 7 2013-01-03 8 9.0 10.0 11 2013-01-04 12 13.0 14.0 15 2013-01-05 16 17.0 18.0 19 2013-01-06 20 21.0 22.0 23 """
判断是否有缺失数据 NaN:
df.isnull() """ A B C D 2013-01-01 False True False False 2013-01-02 False False True False 2013-01-03 False False False False 2013-01-04 False False False False 2013-01-05 False False False False 2013-01-06 False False False False """np.any(df.isnull()) == True # True
Pandas 导入导出
pandas可以读取与存取的资料格式有很多种,像csv、excel、json、html与pickle等…, 详细请看官方说明文件
参考代码:
import pandas as pd #加载模块#读取csvdata = pd.read_csv('student.csv')#打印出dataprint(data)# 将资料存取成pickledata.to_pickle('student.pickle')Pandas 合并 concat
参数
axis:合并方向参数
ignore_index:重置index参数
join:合并方式参数
join_axes:依照 axes 合并append:添加数据
axis (合并方向):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])# concat纵向合并res = pd.concat([df1, df2, df3], axis=0)# 打印结果print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 0 1.0 1.0 1.0 1.0# 1 1.0 1.0 1.0 1.0# 2 1.0 1.0 1.0 1.0# 0 2.0 2.0 2.0 2.0# 1 2.0 2.0 2.0 2.0# 2 2.0 2.0 2.0 2.0
ignore_index (重置 index):
# 承上一个例子,并将index_ignore设定为Trueres = pd.concat([df1, df2, df3], axis=0, ignore_index=True)# 打印结果print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 6 2.0 2.0 2.0 2.0# 7 2.0 2.0 2.0 2.0# 8 2.0 2.0 2.0 2.0
join (合并方式):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])# 纵向"外"合并df1与df2res = pd.concat([df1, df2], axis=0, join='outer') print(res)# a b c d e# 1 0.0 0.0 0.0 0.0 NaN# 2 0.0 0.0 0.0 0.0 NaN# 3 0.0 0.0 0.0 0.0 NaN# 2 NaN 1.0 1.0 1.0 1.0# 3 NaN 1.0 1.0 1.0 1.0# 4 NaN 1.0 1.0 1.0 1.0# 承上一个例子# 纵向"内"合并df1与df2res = pd.concat([df1, df2], axis=0, join='inner')# 打印结果print(res)# b c d# 1 0.0 0.0 0.0# 2 0.0 0.0 0.0# 3 0.0 0.0 0.0# 2 1.0 1.0 1.0# 3 1.0 1.0 1.0# 4 1.0 1.0 1.0# 重置index并打印结果res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True) print(res)# b c d# 0 0.0 0.0 0.0# 1 0.0 0.0 0.0# 2 0.0 0.0 0.0# 3 1.0 1.0 1.0# 4 1.0 1.0 1.0# 5 1.0 1.0 1.0
join_axes (依照 axes 合并):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3]) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])# 依照`df1.index`进行横向合并res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])# 打印结果print(res)# a b c d b c d e# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 移除join_axes,并打印结果res = pd.concat([df1, df2], axis=1) print(res)# a b c d b c d e# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0# 4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
append (添加数据):
import pandas as pdimport numpy as np# 定义资料集df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) df3 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) s1 = pd.Series([1,2,3,4], index=['a','b','c','d'])# 将df2合并到df1的下面,以及重置index,并打印出结果res = df1.append(df2, ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 合并多个df,将df2与df3合并至df1的下面,以及重置index,并打印出结果res = df1.append([df2, df3], ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 1.0 1.0 1.0# 4 1.0 1.0 1.0 1.0# 5 1.0 1.0 1.0 1.0# 6 1.0 1.0 1.0 1.0# 7 1.0 1.0 1.0 1.0# 8 1.0 1.0 1.0 1.0# 合并series,将s1合并至df1,以及重置index,并打印出结果res = df1.append(s1, ignore_index=True) print(res)# a b c d# 0 0.0 0.0 0.0 0.0# 1 0.0 0.0 0.0 0.0# 2 0.0 0.0 0.0 0.0# 3 1.0 2.0 3.0 4.0
Pandas 合并 merge
依据一组key合并
依据两组key合并
Indicator:indicator=True会将合并的记录放在新的一列依据index合并
解决overlapping的问题
依据一组key合并:
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']})
print(left)# A B key# 0 A0 B0 K0# 1 A1 B1 K1# 2 A2 B2 K2# 3 A3 B3 K3print(right)# C D key# 0 C0 D0 K0# 1 C1 D1 K1# 2 C2 D2 K2# 3 C3 D3 K3#依据key column合并,并打印出res = pd.merge(left, right, on='key')
print(res)
A B key C D# 0 A0 B0 K0 C0 D0# 1 A1 B1 K1 C1 D1# 2 A2 B2 K2 C2 D2# 3 A3 B3 K3 C3 D3依据两组key合并:
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 'key2': ['K0', 'K1', 'K0', 'K1'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'], 'key2': ['K0', 'K0', 'K0', 'K0'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']})
print(left)# A B key1 key2# 0 A0 B0 K0 K0# 1 A1 B1 K0 K1# 2 A2 B2 K1 K0# 3 A3 B3 K2 K1print(right)# C D key1 key2# 0 C0 D0 K0 K0# 1 C1 D1 K1 K0# 2 C2 D2 K1 K0# 3 C3 D3 K2 K0#依据key1与key2 columns进行合并,并打印出四种结果['left', 'right', 'outer', 'inner']res = pd.merge(left, right, on=['key1', 'key2'], how='inner')
print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A2 B2 K1 K0 C1 D1# 2 A2 B2 K1 K0 C2 D2res = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A1 B1 K0 K1 NaN NaN# 2 A2 B2 K1 K0 C1 D1# 3 A2 B2 K1 K0 C2 D2# 4 A3 B3 K2 K1 NaN NaN# 5 NaN NaN K2 K0 C3 D3res = pd.merge(left, right, on=['key1', 'key2'], how='left')
print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A1 B1 K0 K1 NaN NaN# 2 A2 B2 K1 K0 C1 D1# 3 A2 B2 K1 K0 C2 D2# 4 A3 B3 K2 K1 NaN NaNres = pd.merge(left, right, on=['key1', 'key2'], how='right')
print(res)# A B key1 key2 C D# 0 A0 B0 K0 K0 C0 D0# 1 A2 B2 K1 K0 C1 D1# 2 A2 B2 K1 K0 C2 D2# 3 NaN NaN K2 K0 C3 D3Indicator :
import pandas as pd#定义资料集并打印出df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,2],'col_right':[2,2,2]})
print(df1)# col1 col_left# 0 0 a# 1 1 bprint(df2)# col1 col_right# 0 1 2# 1 2 2# 2 2 2# 依据col1进行合并,并启用indicator=True,最后打印出res = pd.merge(df1, df2, on='col1', how='outer', indicator=True)
print(res)# col1 col_left col_right _merge# 0 0.0 a NaN left_only# 1 1.0 b 2.0 both# 2 2.0 NaN 2.0 right_only# 3 2.0 NaN 2.0 right_only# 自定indicator column的名称,并打印出res = pd.merge(df1, df2, on='col1', how='outer', indicator='indicator_column')
print(res)# col1 col_left col_right indicator_column# 0 0.0 a NaN left_only# 1 1.0 b 2.0 both# 2 2.0 NaN 2.0 right_only# 3 2.0 NaN 2.0 right_only依据index合并
import pandas as pd#定义资料集并打印出left = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'], 'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
print(left)# A B# K0 A0 B0# K1 A1 B1# K2 A2 B2print(right)# C D# K0 C0 D0# K2 C2 D2# K3 C3 D3# 依据左右资料集的index进行合并,how='outer',并打印出res = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(res)# A B C D# K0 A0 B0 C0 D0# K1 A1 B1 NaN NaN# K2 A2 B2 C2 D2# K3 NaN NaN C3 D3# 依据左右资料集的index进行合并,how='inner',并打印出res = pd.merge(left, right, left_index=True, right_index=True, how='inner')
print(res)# A B C D# K0 A0 B0 C0 D0# K2 A2 B2 C2 D2解决overlapping的问题:
import pandas as pd#定义资料集boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})#使用suffixes解决overlapping的问题res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print(res)# age_boy k age_girl# 0 1 K0 4# 1 1 K0 5Pandas plot 出图
创建一个
SeriesDataframe可视化
创建一个Series :
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt# 随机生成1000个数据data = pd.Series(np.random.randn(1000),index=np.arange(1000)) # 为了方便观看效果, 我们累加这个数据data.cumsum()# pandas 数据可以直接观看其可视化形式data.plot() plt.show()
img1
Dataframe 可视化:
data = pd.DataFrame(
np.random.randn(1000,4),
index=np.arange(1000),
columns=list("ABCD")
)
data.cumsum()
data.plot()
plt.show()img2
ax = data.plot.scatter(x='A',y='B',color='DarkBlue',label='Class1')# 将之下这个 data 画在上一个 ax 上面data.plot.scatter(x='A',y='C',color='LightGreen',label='Class2',ax=ax) plt.show()
img3
作者:CrazyWolf_081c
链接:https://www.jianshu.com/p/1ea7c0d32bab
共同学习,写下你的评论
评论加载中...
作者其他优质文章


