首页手记 pandas数据结构

pandas数据结构

标签：

Python

创建数据

Series和python的列表类似。DataFrame则类似值为Series的字典。

create.py

#!/usr/bin/env python3# -*- coding: utf-8 -*-# create.pyimport pandas as pd

print("\n\n创建序列Series")
s = pd.Series(['banana', 42])
print(s)

print("\n\n指定索引index创建序列Series")
s = pd.Series(['Wes McKinney', 'Creator of Pandas'], index=['Person', 'Who'])
print(s)# 注意：列名未必为执行的顺序，通常为按字母排序print("\n\n创建数据帧DataFrame")
scientists = pd.DataFrame({    ' Name': ['Rosaline Franklin', 'William Gosset'],    ' Occupation': ['Chemist', 'Statistician'],    ' Born': ['1920-07-25', '1876-06-13'],    ' Died': ['1958-04-16', '1937-10-16'],    ' Age': [37, 61]})
print(scientists)

print("\n\n指定顺序(index和columns)创建数据帧DataFrame")
scientists = pd.DataFrame(
    data={'Occupation': ['Chemist', 'Statistician'],    'Born': ['1920-07-25', '1876-06-13'],    'Died': ['1958-04-16', '1937-10-16'],    'Age': [37, 61]},
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age'])
print(scientists)

执行结果：

$ ./create.py 


创建序列Series0    banana1        42dtype: object


指定索引index创建序列Series
Person         Wes McKinney
Who       Creator of Pandas
dtype: object


创建数据帧DataFrame
                Name    Occupation        Born        Died   Age0  Rosaline Franklin       Chemist  1920-07-25  1958-04-16    371     William Gosset  Statistician  1876-06-13  1937-10-16    61指定顺序(index和columns)创建数据帧DataFrame
                     Occupation        Born        Died  Age
Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37William Gosset     Statistician  1876-06-13  1937-10-16   61

Series

官方文档：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Series的属性

属性	描述
loc	使用索引值获取子集
iloc	使用索引位置获取子集
dtype或dtypes	类型
T	转置
shape	数据的尺寸
size	元素的数量
values	ndarray或类似ndarray的Series

Series的方法

方法	描述
append	连接2个或更多系列
corr	计算与其他Series的关联
cov	与其他Series计算协方差
describe	计算汇总统计
drop duplicates	返回一个没有重复项的Series
equals	Series是否具有相同的元素
get values	获取Series的值，与values属性相同
hist	绘制直方图
min	返回最小值
max	返回最大值
mean	返回算术平均值
median	返回中位数
mode(s)	返回mode(s)
replace	用指定值替换系列中的值
sample	返回Series中值的随机样本
sort values	排序
to frame	转换为数据帧
transpose	返回转置
unique	返回numpy.ndarray唯一值

series.py

#!/usr/bin/python3# -*- coding: utf-8 -*-# CreateDate: 2018-3-14# series.pyimport pandas as pdimport numpy as np

scientists = pd.DataFrame(
    data={'Occupation': ['Chemist', 'Statistician'],    'Born': ['1920-07-25', '1876-06-13'],    'Died': ['1958-04-16', '1937-10-16'],    'Age': [37, 61]},
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age'])
print(scientists)# 从数据帧(DataFrame)获取的行或者列为Seriesfirst_row = scientists.loc['William Gosset']
print(type(first_row))
print(first_row)# index和keys是一样的print(first_row.index)
print(first_row.keys())
print(first_row.values)

print(first_row.index[0])
print(first_row.keys()[0])# Pandas.Series和numpy.ndarray很类似ages = scientists['Age']
print(ages)# 统计，更多参考http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statisticsprint(ages.mean())
print(ages.min())
print(ages.max())
print(ages.std())

scientists = pd.read_csv('../data/scientists.csv')
ages = scientists['Age']
print(ages)
print(ages.mean())
print(ages.describe())
print(ages[ages > ages.mean()])
print(ages > ages.mean())
manual_bool_values = [True, True, False, False, True, True, False, False]
print(ages[manual_bool_values])

print(ages + ages)
print(ages * ages)
print(ages + 100)
print(ages * 2)
print(ages + pd.Series([1, 100]))# print(ages + np.array([1, 100])) 会报错，不同类型相加，大小一定要一样print(ages + np.array([1, 100, 1, 100, 1, 100, 1, 100]))# 排序： 默认有自动排序print(ages)
rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
print(ages * 2)
print(ages + rev_ages)

执行结果

$ python3 series.py 
                     Occupation        Born        Died  Age
Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37William Gosset     Statistician  1876-06-13  1937-10-16   61<class 'pandas.core.series.Series'>Occupation    StatisticianBorn            1876-06-13Died            1937-10-16Age                     61Name: William Gosset, dtype: object
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')
Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')
['Statistician' '1876-06-13' '1937-10-16' 61]
Occupation
Occupation
Rosaline Franklin    37William Gosset       61Name: Age, dtype: int6449.0376116.970562748477140    371    612    903    664    565    456    417    77Name: Age, dtype: int6459.125count     8.000000mean     59.125000std      18.325918min      37.00000025%      44.00000050%      58.50000075%      68.750000max      90.000000Name: Age, dtype: float641    612    903    667    77Name: Age, dtype: int640    False1     True2     True3     True4    False5    False6    False7     TrueName: Age, dtype: bool0    371    614    565    45Name: Age, dtype: int640     741    1222    1803    1324    1125     906     827    154Name: Age, dtype: int640    13691    37212    81003    43564    31365    20256    16817    5929Name: Age, dtype: int640    1371    1612    1903    1664    1565    1456    1417    177Name: Age, dtype: int640     741    1222    1803    1324    1125     906     827    154Name: Age, dtype: int640     38.01    161.02      NaN3      NaN4      NaN5      NaN6      NaN7      NaN
dtype: float640     381    1612     913    1664     575    1456     427    177Name: Age, dtype: int640    371    612    903    664    565    456    417    77Name: Age, dtype: int647    776    415    454    563    662    901    610    37Name: Age, dtype: int640     741    1222    1803    1324    1125     906     827    154Name: Age, dtype: int640     741    1222    1803    1324    1125     906     827    154Name: Age, dtype: int64

数据帧(DataFrame)

DataFrame是最常见的Pandas对象,可认为是Python存储类似电子表格的数据的方式。Series多常见功能都包含在DataFrame中。

子集的方法

注意ix现在已经不推荐使用。

DataFrame常用的索引操作有：

方式	描述
df[val]	选择单个列
df [[ column1, column2, ... ]]	选择多个列
df.loc[val]	选择行
df. loc [[ label1 , label2 ,...]]	选择多行
df.loc[:, val]	基于行index选择列
df.loc[val1, val2]	选择行列
df.iloc[row number]	基于行数选择行
df. iloc [[ row1, row2, ...]] Multiple rows by row number	基于行数选择多行
df.iloc[:, where]	选择列
df.iloc[where_i, where_j]	选择行列
df.at[label_i, label_j]	选择值
df.iat[i, j]	选择值
reindex method	通过label选择多行或列
get_value, set_value	通过label选择耽搁行或列
df[bool]	选择行
df [[ bool1, bool2, ...]]	选择行
df[ start :stop: step ]	基于行数选择行

#!/usr/bin/python3# -*- coding: utf-8 -*-# CreateDate: 2018-3-31# df.pyimport pandas as pdimport numpy as np

scientists = pd.read_csv('../data/scientists.csv')
print(scientists[scientists['Age'] > scientists['Age'].mean()])
first_half = scientists[: 4]
second_half = scientists[ 4 :]
print(first_half)
print(second_half)
print(first_half + second_half)
print(scientists * 2)

执行结果

#!/usr/bin/python3# -*- coding: utf-8 -*-# df.pyimport pandas as pdimport numpy as np

scientists = pd.read_csv('../data/scientists.csv')
print(scientists[scientists['Age'] > scientists['Age'].mean()])
first_half = scientists[: 4]
second_half = scientists[ 4 :]
print(first_half)
print(second_half)
print(first_half + second_half)
print(scientists * 2)

执行结果

$ python3 df.py 
                   Name        Born        Died  Age     Occupation1        William Gosset  1876-06-13  1937-10-16   61   Statistician2  Florence Nightingale  1820-05-12  1910-08-13   90          Nurse3           Marie Curie  1867-11-07  1934-07-04   66        Chemist7          Johann Gauss  1777-04-30  1855-02-23   77  Mathematician
                   Name        Born        Died  Age    Occupation0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist1        William Gosset  1876-06-13  1937-10-16   61  Statistician2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse3           Marie Curie  1867-11-07  1934-07-04   66       Chemist
            Name        Born        Died  Age          Occupation4  Rachel Carson  1907-05-27  1964-04-14   56           Biologist5      John Snow  1813-03-15  1858-06-16   45           Physician6    Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist7   Johann Gauss  1777-04-30  1855-02-23   77       Mathematician
  Name Born Died  Age Occupation0  NaN  NaN  NaN  NaN        NaN1  NaN  NaN  NaN  NaN        NaN2  NaN  NaN  NaN  NaN        NaN3  NaN  NaN  NaN  NaN        NaN4  NaN  NaN  NaN  NaN        NaN5  NaN  NaN  NaN  NaN        NaN6  NaN  NaN  NaN  NaN        NaN7  NaN  NaN  NaN  NaN        NaN
                                       Name                  Born  \0        Rosaline FranklinRosaline Franklin  1920-07-251920-07-25   1              William GossetWilliam Gosset  1876-06-131876-06-13   2  Florence NightingaleFlorence Nightingale  1820-05-121820-05-12   3                    Marie CurieMarie Curie  1867-11-071867-11-07   4                Rachel CarsonRachel Carson  1907-05-271907-05-27   5                        John SnowJohn Snow  1813-03-151813-03-15   6                    Alan TuringAlan Turing  1912-06-231912-06-23   7                  Johann GaussJohann Gauss  1777-04-301777-04-30   

                   Died  Age                            Occupation  
0  1958-04-161958-04-16   74                        ChemistChemist  
1  1937-10-161937-10-16  122              StatisticianStatistician  
2  1910-08-131910-08-13  180                            NurseNurse  
3  1934-07-041934-07-04  132                        ChemistChemist  
4  1964-04-141964-04-14  112                    BiologistBiologist  
5  1858-06-161858-06-16   90                    PhysicianPhysician  
6  1954-06-071954-06-07   82  Computer ScientistComputer Scientist  
7  1855-02-231855-02-23  154            MathematicianMathematician

修改列

#!/usr/bin/python3# -*- coding: utf-8 -*-# Author:    xurongzhong#126.com wechat:pythontesting qq:37391319# qq群：144081101 591302926  567351477# CreateDate: 2018-06-07# change.pyimport pandas as pdimport numpy as npimport random

scientists = pd.read_csv('../data/scientists.csv')
print(scientists['Born'].dtype)
print(scientists['Died'].dtype)
print(scientists.head())# 转为日期 参考：https://docs.python.org/3.5/library/datetime.htmlborn_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')# 增加列scientists['born_dt'], scientists['died_dt'] = (born_datetime, died_datetime)
print(scientists.shape)
print(scientists.head())

random.seed(42)
random.shuffle(scientists['Age']) # 此修改会作用于scientistsprint(scientists.head())

scientists['age_days_dt'] = (scientists['died_dt'] - scientists['born_dt'])
print(scientists.head())

执行结果：

$ python3 change.py 
object
object
                   Name        Born        Died  Age    Occupation0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist1        William Gosset  1876-06-13  1937-10-16   61  Statistician2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse3           Marie Curie  1867-11-07  1934-07-04   66       Chemist4         Rachel Carson  1907-05-27  1964-04-14   56     Biologist
(8, 7)
                   Name        Born        Died  Age    Occupation    born_dt  \0     Rosaline Franklin  1920-07-25  1958-04-16   37       Chemist 1920-07-25   1        William Gosset  1876-06-13  1937-10-16   61  Statistician 1876-06-13   2  Florence Nightingale  1820-05-12  1910-08-13   90         Nurse 1820-05-12   3           Marie Curie  1867-11-07  1934-07-04   66       Chemist 1867-11-07   4         Rachel Carson  1907-05-27  1964-04-14   56     Biologist 1907-05-27   

     died_dt  
0 1958-04-16  1 1937-10-16  2 1910-08-13  3 1934-07-04  4 1964-04-14  /usr/lib/python3.5/random.py:272: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  x[i], x[j] = x[j], x[i]
                   Name        Born        Died  Age    Occupation    born_dt  \0     Rosaline Franklin  1920-07-25  1958-04-16   66       Chemist 1920-07-25   1        William Gosset  1876-06-13  1937-10-16   56  Statistician 1876-06-13   2  Florence Nightingale  1820-05-12  1910-08-13   41         Nurse 1820-05-12   3           Marie Curie  1867-11-07  1934-07-04   77       Chemist 1867-11-07   4         Rachel Carson  1907-05-27  1964-04-14   90     Biologist 1907-05-27   

     died_dt  
0 1958-04-16  1 1937-10-16  2 1910-08-13  3 1934-07-04  4 1964-04-14  
                   Name        Born        Died  Age    Occupation    born_dt  \0     Rosaline Franklin  1920-07-25  1958-04-16   66       Chemist 1920-07-25   1        William Gosset  1876-06-13  1937-10-16   56  Statistician 1876-06-13   2  Florence Nightingale  1820-05-12  1910-08-13   41         Nurse 1820-05-12   3           Marie Curie  1867-11-07  1934-07-04   77       Chemist 1867-11-07   4         Rachel Carson  1907-05-27  1964-04-14   90     Biologist 1907-05-27   

     died_dt age_days_dt  
0 1958-04-16  13779 days  
1 1937-10-16  22404 days  
2 1910-08-13  32964 days  
3 1934-07-04  24345 days  
4 1964-04-14  20777 days

数据导入导出

out.py

#!/usr/bin/python3# -*- coding: utf-8 -*-# Author:    china-testing#126.com wechat:pythontesting qq群:630011153# CreateDate: 2018-3-31# out.pyimport pandas as pdimport numpy as npimport random
scientists = pd.read_csv('../data/scientists.csv')

names = scientists['Name']
print(names)
names.to_pickle('../output/scientists_names_series.pickle')
scientists.to_pickle('../output/scientists_df.pickle')# .p, .pkl,  .pickle 是常用的pickle文件扩展名scientist_names_from_pickle = pd.read_pickle('../output/scientists_df.pickle')
print(scientist_names_from_pickle)

names.to_csv('../output/scientist_names_series.csv')
scientists.to_csv('../output/scientists_df.tsv', sep='\t')# 不输出行号scientists.to_csv('../output/scientists_df_no_index.csv', index=None)# Series可以转为df再输出成excel文件names_df = names.to_frame()
names_df.to_excel('../output/scientists_names_series_df.xls')
names_df.to_excel('../output/scientists_names_series_df.xlsx')

scientists.to_excel('../output/scientists_df.xlsx', sheet_name='scientists',
                    index=False)

执行结果：

$ python3 out.py 
0       Rosaline Franklin1          William Gosset2    Florence Nightingale3             Marie Curie4           Rachel Carson5               John Snow6             Alan Turing7            Johann Gauss
Name: Name, dtype: object
                   Name        Born        Died  Age          Occupation0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist1        William Gosset  1876-06-13  1937-10-16   61        Statistician2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse3           Marie Curie  1867-11-07  1934-07-04   66             Chemist4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist5             John Snow  1813-03-15  1858-06-16   45           Physician6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician

注意：序列一般是直接输出成excel文件

更多的输入输出方法：

方式	描述
to_clipboard	将数据保存到系统剪贴板进行粘贴
to_dense	将数据转换为常规“密集”DataFrame
to_dict	将数据转换为Python字典
to_gbq	将数据转换为Google BigQuery表格
toJidf	将数据保存为分层数据格式（HDF）
to_msgpack	将数据保存到可移植的类似JSON的二进制文件中
toJitml	将数据转换为HTML表格
tojson	将数据转换为JSON字符串
toJatex	将数据转换为LTEXtabular环境
to_records	将数据转换为记录数组
to_string	将DataFrame显示为stdout的字符串
to_sparse	将数据转换为SparceDataFrame
to_sql	将数据保存到SQL数据库中
to_stata	将数据转换为Stata dta文件

读CSV文件

read_csv.py

#!/usr/bin/python3# -*- coding: utf-8 -*-# Author:    china-testing#126.com wechat:pythontesting QQ群:630011153# CreateDate: 2018-3-9# read_csv.pyimport pandas as pd

df = pd.read_csv("1.csv", header=None) # 不读取列名print("df:")
print(df)

print("df.head():")
print(df.head()) # head(self, n=5)，默认为5行，类似的有tailprint("df.tail():")
print(df.tail())

df = pd.read_csv("1.csv") # 默认读取列名print("df:")
print(df)

df = pd.read_csv("1.csv", names=['号码','群号']) # 自定义列名print("df:")
print(df)# 自定义列名，去掉第一行df = pd.read_csv("1.csv", skiprows=[0], names=['号码','群号'])
print("df:")
print(df)

执行结果：

df:           0          10         qq    qqgroup1   37391319  1440811012   37391320  1440811023   37391321  1440811034   37391322  1440811045   37391323  1440811056   37391324  1440811067   37391325  1440811078   37391326  1440811089   37391327  14408110910  37391328  14408111011  37391329  14408111112  37391330  14408111213  37391331  14408111314  37391332  14408111415  37391333  144081115df.head():          0          10        qq    qqgroup1  37391319  1440811012  37391320  1440811023  37391321  1440811034  37391322  144081104df.tail():           0          111  37391329  14408111112  37391330  14408111213  37391331  14408111314  37391332  14408111415  37391333  144081115df:
          qq    qqgroup0   37391319  1440811011   37391320  1440811022   37391321  1440811033   37391322  1440811044   37391323  1440811055   37391324  1440811066   37391325  1440811077   37391326  1440811088   37391327  1440811099   37391328  14408111010  37391329  14408111111  37391330  14408111212  37391331  14408111313  37391332  14408111414  37391333  144081115df:
          号码         群号0         qq    qqgroup1   37391319  1440811012   37391320  1440811023   37391321  1440811034   37391322  1440811045   37391323  1440811056   37391324  1440811067   37391325  1440811078   37391326  1440811089   37391327  14408110910  37391328  14408111011  37391329  14408111112  37391330  14408111213  37391331  14408111314  37391332  14408111415  37391333  144081115df:
          号码         群号0   37391319  1440811011   37391320  1440811022   37391321  1440811033   37391322  1440811044   37391323  1440811055   37391324  1440811066   37391325  1440811077   37391326  1440811088   37391327  1440811099   37391328  14408111010  37391329  14408111111  37391330  14408111212  37391331  14408111313  37391332  14408111414  37391333  144081115

写CSV文件

#!/usr/bin/python3# -*- coding: utf-8 -*-# write_csv.pyimport pandas as pd

data ={'qq': [37391319,37391320], 'group':[1,2]}

df = pd.DataFrame(data=data, columns=['qq','group'])
df.to_csv('2.csv',index=False)

作者：python作业AI毕业设计
链接：https://www.jianshu.com/p/e55f6a2e2cae

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

幕布斯6054654

手记
篇

粉丝

219

获赞与收藏

1011

关注作者，订阅最新文章

阅读免费教程

Python 办公自动化教程

17个小节 25913 878

Python 算法入门教程

15个小节 27656 1081

Python 进阶应用教程

38个小节 66324 1042

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

pandas数据结构

创建数据

Series

数据帧(DataFrame)

修改列

数据导入导出

阅读免费教程