python382怎么用

阅读: 评论:0

python382怎么用

python382怎么用

Python3快速入门(十五)——Pandas数据处理

一、函数应用

1、函数应用简介

如果要将自定义函数或其它库函数应用于Pandas对象,有三种使用方式。pipe()将函数用于表格,apply()将函数用于行或列,applymap()将函数用于元素。

2、表格函数应用

可以通过将函数对象和参数作为pipe函数的参数来执行自定义操作,会对整个DataFrame执行操作。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

def adder(x, y):

return x + y

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(5, 3),columns=['col1', 'col2', 'col3'])

print(df)

df = df.pipe(adder, 1)

print(df)

# output:

# col1 col2 col3

# 0 0.390803 0.940306 -1.300635

# 1 -0.349588 -1.290132 0.415693

# 2 -0.079585 -0.083825 0.262867

# 3 0.582377 0.171701 -1.011748

# 4 -0.466655 1.746269 1.281538

# col1 col2 col3

# 0 1.390803 1.940306 -0.300635

# 1 0.650412 -0.290132 1.415693

# 2 0.920415 0.916175 1.262867

# 3 1.582377 1.171701 -0.011748

# 4 0.533345 2.746269 2.281538

3、行、列函数应用

使用apply()函数可以沿DataFrame或Panel的轴执行应用函数,采用可选axis参数。 默认情况下,操作按列执行。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

def adder(x, y):

return x + y

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])

print(df)

# 按列执行

result = df.apply(np.sum)

print(result)

# 按行执行

result = df.apply(np.sum, axis=1)

print(result)

# output:

# col1 col2 col3

# 0 -1.773775 -0.608478 0.602059

# 1 -0.208412 0.969435 -0.292108

# 2 0.776864 -0.768559 -0.389092

# 3 -2.088412 1.133090 1.006486

# 4 0.693241 1.808845 0.772191

# col1 -2.600494

# col2 2.534332

# col3 1.699536

# dtype: float64

# 0 -1.780194

# 1 0.468915

# 2 -0.380788

# 3 0.051164

# 4 3.274277

# dtype: float64

4、元素函数应用

在DataFrame的applymap()函数可以接受任何Python函数,并且返回单个值。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])

print(df)

df = df.applymap(lambda x: x + 1)

print(df)

# output:

# col1 col2 col3

# 0 2.396185 -0.263581 -0.090799

# 1 1.718716 0.876074 -1.067746

# 2 -1.033945 -0.078448 1.036566

# 3 0.553849 0.251312 -0.422640

# 4 -0.896062 1.605349 -0.089430

# col1 col2 col3

# 0 3.396185 0.736419 0.909201

# 1 2.718716 1.876074 -0.067746

# 2 -0.033945 0.921552 2.036566

# 3 1.553849 1.251312 0.577360

# 4 0.103938 2.605349 0.910570

二、数据清洗

1、数据清洗简介

数据清洗是一项复杂且繁琐的工作,同时也是数据分析过程中最为重要的环节。数据清洗的目的一是通过清洗让数据可用,二是让数据变的更适合进行数据分析工作。因此,脏数据要清洗,干净数据也要清洗。在实际数据分析中,数据清洗将占用项目70%左右的时间。

2、缺失值处理

查看每一列有多少缺失值。

df.isnull().sum()

查看每一列有多少完整的数据

df.shape[0]-df.isnull().sum()

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))

print(df)

print(df.isnull().sum())

print(df.shape[0] - df.isnull().sum())

# output:

# A B C

# 2019-01-01 1.138325 0.981597 1.359580

# 2019-01-02 -1.622074 0.812393 -0.946351

# 2019-01-03 0.049815 1.194241 0.807209

# 2019-01-04 1.500074 -0.570367 -0.328529

# 2019-01-05 0.465869 1.049651 -0.112453

# 2019-01-06 -1.399495 0.492769 1.961198

# A 0

# B 0

# C 0

# dtype: int64

# A 6

# B 6

# C 6

# dtype: int64

删除列

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

print(df)

del df['D']

# 删除第2列

df.lumns[2], axis=1, inplace=True)

# 删除B列

df.drop('B', axis=1, inplace=True)

print(df)

# output:

# A B C

# 2019-01-01 -0.703151 0.753482 -0.624376

# 2019-01-02 -0.396221 -0.832279 -1.419897

# 2019-01-03 -0.179341 -0.368501 -0.300810

# 2019-01-04 0.464156 0.117461 1.502114

# 2019-01-05 -1.022012 -1.612456 1.611377

# 2019-01-06 -0.677521 0.001020 -0.342290

# A

# 2019-01-01 -0.703151

# 2019-01-02 -0.396221

# 2019-01-03 -0.179341

# 2019-01-04 0.464156

# 2019-01-05 -1.022012

# 2019-01-06 -0.677521

删除NaN值

df.dropna(self, axis=0, how='any', thresh=None, subset=None,

inplace=False)

axis为轴,0表示对行进行操作,1表示对列进行操作。

how为操作类型,’any’表示只要出现NaN的行或列都删除,’all’表示删除整行或整列都为NaN的行或列。

thresh:NaN的阈值,达到thresh时删除。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df.iloc[1, 3] = None

df.iloc[2, 2] = None

print(df)

print(df.dropna(axis=1))

print(df.dropna(how='any'))

# output:

# A B C D

# 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549

# 2019-01-02 -1.884801 1.046506 -1.618871 NaN

# 2019-01-03 0.976682 -1.043107 NaN 0.391338

# 2019-01-04 0.143389 0.951518 0.040632 -0.443944

# 2019-01-05 3.092766 0.787921 -2.408260 -1.111238

# 2019-01-06 -0.179249 0.573734 -0.912023 0.261517

# A B

# 2019-01-01 -0.152239 -2.315100

# 2019-01-02 -1.884801 1.046506

# 2019-01-03 0.976682 -1.043107

# 2019-01-04 0.143389 0.951518

# 2019-01-05 3.092766 0.787921

# 2019-01-06 -0.179249 0.573734

# A B C D

# 2019-01-01 -0.152239 -2.315100 -0.504998 -0.987549

# 2019-01-04 0.143389 0.951518 0.040632 -0.443944

# 2019-01-05 3.092766 0.787921 -2.408260 -1.111238

# 2019-01-06 -0.179249 0.573734 -0.912023 0.261517

填充NaN值

df.fillna(self, value=None, method=None, axis=None, inplace=False,limit=None, downcast=None, **kwargs)

value:填充的值,可以为字典,字典的key为列名称。

inplace:表示是否对源数据进行修改,默认为False。

fillna默认会返回新对象,但也可以对现有对象进行就地修改。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df.iloc[1, 3] = None

df.iloc[2, 2] = None

print(df)

print(df.fillna({'C': 3.14, 'D': 0.0}))

# 使用指定值填充

df.fillna(value=3.14, inplace=True)

print(df)

# output:

# A B C D

# 2019-01-01 0.490727 -0.603079 0.202922 2.012060

# 2019-01-02 -0.855106 0.305557 0.851141 NaN

# 2019-01-03 -0.324215 0.629637 NaN -0.174930

# 2019-01-04 0.085996 0.173265 0.416938 -0.903989

# 2019-01-05 0.009368 0.410056 -1.297822 -2.202893

# 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454

# A B C D

# 2019-01-01 0.490727 -0.603079 0.202922 2.012060

# 2019-01-02 -0.855106 0.305557 0.851141 0.000000

# 2019-01-03 -0.324215 0.629637 3.140000 -0.174930

# 2019-01-04 0.085996 0.173265 0.416938 -0.903989

# 2019-01-05 0.009368 0.410056 -1.297822 -2.202893

# 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454

# A B C D

# 2019-01-01 0.490727 -0.603079 0.202922 2.012060

# 2019-01-02 -0.855106 0.305557 0.851141 3.140000

# 2019-01-03 -0.324215 0.629637 3.140000 -0.174930

# 2019-01-04 0.085996 0.173265 0.416938 -0.903989

# 2019-01-05 0.009368 0.410056 -1.297822 -2.202893

# 2019-01-06 0.021892 -0.359749 -0.608556 -0.859454

对数据进行布尔填充

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df.iloc[1, 3] = None

df.iloc[2, 2] = None

print(df)

print(pd.isnull(df))

# output:

# A B C D

# 2019-01-01 -1.337471 0.154446 0.493862 1.278946

# 2019-01-02 2.853301 -0.151376 0.318281 NaN

# 2019-01-03 1.094465 0.059063 NaN 0.216805

# 2019-01-04 -0.983091 -1.052905 0.416604 -1.431156

# 2019-01-05 -1.421142 1.015465 -1.851315 -0.680514

# 2019-01-06 0.224378 -0.636699 -0.749040 -0.728368

# A B C D

# 2019-01-01 False False False False

# 2019-01-02 False False False True

# 2019-01-03 False False True False

# 2019-01-04 False False False False

# 2019-01-05 False False False False

# 2019-01-06 False False False False

3、行和列处理

通过字典键可以进行列选择,获取DataFrame中的一列数据。

生成DataFrame时指定index和columns

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

print(df)

# output:

# A B C D

# 2013-01-01 1.116914 -0.221035 -0.577299 -0.328831

# 2013-01-02 1.764656 1.462838 -0.360678 1.176134

# 2013-01-03 0.144396 -0.594359 -0.548543 1.281829

# 2013-01-04 0.632378 0.895123 -0.757924 -1.325917

# 2013-01-05 0.219125 -1.247446 0.335363 -0.676052

# 2013-01-06 0.963715 -0.131331 0.326482 -0.718461

index和columns也可以在DataFrame创建后指定

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

print(df)

df.index = pd.date_range('20130201', periods=df.shape[0])

print(df)

df.index = pd.date_range('20130301', periods=len(df))

print(df)

# output:

# A B C D

# 2013-01-01 1.588442 1.548420 0.132539 0.410512

# 2013-01-02 0.200415 1.515354 2.275575 -1.533603

# 2013-01-03 0.838294 0.067409 -1.157181 0.401973

# 2013-01-04 0.551363 -0.749296 0.343762 -1.558969

# 2013-01-05 -0.799507 -1.343379 -0.006312 1.091014

# 2013-01-06 0.012188 -0.382384 0.280008 -2.333430

# a b c d

# 2013-02-01 1.588442 1.548420 0.132539 0.410512

# 2013-02-02 0.200415 1.515354 2.275575 -1.533603

# 2013-02-03 0.838294 0.067409 -1.157181 0.401973

# 2013-02-04 0.551363 -0.749296 0.343762 -1.558969

# 2013-02-05 -0.799507 -1.343379 -0.006312 1.091014

# 2013-02-06 0.012188 -0.382384 0.280008 -2.333430

# A B C D

# 2013-03-01 1.588442 1.548420 0.132539 0.410512

# 2013-03-02 0.200415 1.515354 2.275575 -1.533603

# 2013-03-03 0.838294 0.067409 -1.157181 0.401973

# 2013-03-04 0.551363 -0.749296 0.343762 -1.558969

# 2013-03-05 -0.799507 -1.343379 -0.006312 1.091014

# 2013-03-06 0.012188 -0.382384 0.280008 -2.333430

可以指定某一列为index

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))

df['date'] = dates

print(df)

df = df.set_index('date', drop=True)

print(df)

# output:

# A B C D date

# 0 0.910416 -0.378195 0.332562 -0.194766 2013-01-01

# 1 0.533733 0.888629 -0.358143 1.583278 2013-01-02

# 2 0.482362 -0.905558 1.045753 -0.874653 2013-01-03

# 3 0.901622 -0.535862 -0.439763 -0.640594 2013-01-04

# 4 -1.273577 -0.746785 1.448309 -0.368285 2013-01-05

# 5 0.191289 -1.246213 0.184757 -1.143074 2013-01-06

# A B C D

# date

# 2013-01-01 0.910416 -0.378195 0.332562 -0.194766

# 2013-01-02 0.533733 0.888629 -0.358143 1.583278

# 2013-01-03 0.482362 -0.905558 1.045753 -0.874653

# 2013-01-04 0.901622 -0.535862 -0.439763 -0.640594

# 2013-01-05 -1.273577 -0.746785 1.448309 -0.368285

# 2013-01-06 0.191289 -1.246213 0.184757 -1.143074

在原有DataFrame的基础上,可以创建一个新的DataFrame,或者将原有DataFrame按行进行汇总统计创建一个新的DataFrame。

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))

print(df)

df1 = pd.DataFrame()

df1['min'] = df.min()

df1['max'] = df.max()

df1['std'] = df.std()

print(df1)

df['min'] = df.min(axis=1)

df['max'] = df.max(axis=1)

df['std'] = df.std(axis=1)

print(df)

# output:

# A B C

# 2013-01-01 0.901073 1.706925 -0.503194

# 2013-01-02 0.379870 0.729674 0.579337

# 2013-01-03 -1.285323 -0.665951 -0.161148

# 2013-01-04 -0.714282 0.423376 0.586061

# 2013-01-05 -0.895171 -0.413328 0.485803

# 2013-01-06 1.926472 -0.718467 1.113522

# min max std

# A -1.285323 1.926472 1.234084

# B -0.718467 1.706925 0.955797

# C -0.503194 1.113522 0.582913

# A B C min max std

# 2013-01-01 0.901073 1.706925 -0.503194 -0.503194 1.706925 1.113132

# 2013-01-02 0.379870 0.729674 0.579337 0.379870 0.729674 0.175247

# 2013-01-03 -1.285323 -0.665951 -0.161148 -1.285323 -0.161148 0.562671

# 2013-01-04 -0.714282 0.423376 0.586061 -0.714282 0.586061 0.685749

# 2013-01-05 -0.895171 -0.413328 0.485803 -0.895171 0.485803 0.696763

# 2013-01-06 1.926472 -0.718467 1.113522 -0.718467 1.926472 1.341957

axis=0,对DataFrame的每一列数据进行统计运算,得到一行。axis=0,对DataFrame的每一行数据进行统计运算,得到一列。

DataFrame可以修改index和columns。

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=list('ABC'))

print(df)

df = df.rename(index=lambda x: x + 5, columns={'A': 'newA', 'B': 'newB'})

print(df)

# output:

# A B C

# 2013-01-01 0.834910 0.652175 0.537611

# 2013-01-02 1.083902 0.836208 -1.466876

# 2013-01-03 -0.044256 0.932547 1.843682

# 2013-01-04 1.610113 -0.705734 -0.145042

# 2013-01-05 1.114897 0.273569 -0.047725

# 2013-01-06 -0.541942 -0.112752 1.644338

# newA newB C

# 2013-01-06 0.834910 0.652175 0.537611

# 2013-01-07 1.083902 0.836208 -1.466876

# 2013-01-08 -0.044256 0.932547 1.843682

# 2013-01-09 1.610113 -0.705734 -0.145042

# 2013-01-10 1.114897 0.273569 -0.047725

# 2013-01-11 -0.541942 -0.112752 1.644338

列数据的单位统一

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df['D'] = [10000, 34000, 60000, 34000, 56000, 80000]

print(df)

for i in range(len(df['D'])):

weight = float(df.iloc[i, 3]) / 10000

df.iloc[i, 3] = '{}万'.format(weight)

print(df)

# output:

# A B C D

# 2019-01-01 -0.889533 -0.411451 0.563969 10000

# 2019-01-02 -0.573239 0.264805 -0.058530 34000

# 2019-01-03 1.224993 -1.815338 -2.075301 60000

# 2019-01-04 0.266483 1.841926 -0.759681 34000

# 2019-01-05 -0.167595 0.432617 0.533577 56000

# 2019-01-06 -0.973877 0.700821 1.093101 80000

# A B C D

# 2019-01-01 -0.889533 -0.411451 0.563969 1.0万

# 2019-01-02 -0.573239 0.264805 -0.058530 3.4万

# 2019-01-03 1.224993 -1.815338 -2.075301 6.0万

# 2019-01-04 0.266483 1.841926 -0.759681 3.4万

# 2019-01-05 -0.167595 0.432617 0.533577 5.6万

# 2019-01-06 -0.973877 0.700821 1.093101 8.0万

4、重复值删除

df.duplicated(self, subset=None, keep='first')

检查DataFrame是否有重复数据。

subset:子集,列标签或列标签的序列

keep:可选值为first,last,False,first表示保留第一个出现的值,last表示保留最后一个出现的值,False表示保留所有的值。

df.drop_duplicates(self, subset=None, keep='first', inplace=False)

删除DataFrame的重复数据。

subset:子集,列标签或列标签的序列

keep:可选值为first,last,False,first表示保留第一个出现的值,last表示保留最后一个出现的值,False表示保留所有的值。

inplace:值为True表示修改源数据,值为False表示不修改源数据

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

# 使用bool过滤,取出重复的值

print(df[df.duplicated(keep=False)])

# 删除重复值,修改源数据

df.drop_duplicates(keep='last', inplace=True)

print(df)

# output:

# Name Age Score

# 0 Alex NaN 80

# 1 Bob 25.0 90

# 2 Bob 25.0 90

# Name Age Score

# 1 Bob 25.0 90

# 2 Bob 25.0 90

# Name Age Score

# 0 Alex NaN 80

# 2 Bob 25.0 90

5、异常值处理

异常值分为两种,一种是非法数据,如数字列的中间夹杂着一些汉字或者是符号;第二种是异常数据,异乎寻常的大数值或者是小数值。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

def swap(x):

if type(x) == str:

if x[-1] == '岁':

x = int(x[:-1])

elif x[-1] == '分':

x = int(x[:-1])

return x

if __name__ == "__main__":

data = [['Alex', np.nan, '89分'], ['Bob', '25岁', '90分'], ['Bob', '28岁', '90分']]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

df = df.applymap(swap)

print(df)

# output:

# Name Age Score

# 0 Alex NaN 89分

# 1 Bob 25岁 90分

# 2 Bob 28岁 90分

# Name Age Score

# 0 Alex NaN 89

# 1 Bob 25.0 90

# 2 Bob 28.0 90

6、数据格式清洗

清除字段字符的前后空格

df[‘city’]=df[‘city’].map(str.strip)

将字段进行大小写转换:

df[‘city’]=df[‘city’].str.lower()

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [['Alex', np.nan, 80], [' Bob ', 25, 90], [' Bob', 25, 90]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

# 清除字符串前后空格

print(df['Name'].map(str.strip))

# 大小写转换

print(df['Name'].str.lower())

# output:

# Name Age Score

# 0 Alex NaN 80

# 1 Bob 25.0 90

# 2 Bob 25.0 90

# 0 Alex

# 1 Bob

# 2 Bob

# Name: Name, dtype: object

# 0 alex

# 1 bob

# 2 bob

# Name: Name, dtype: object

更改列的数据类型:

df[‘price’].astype(‘int’)

7、数据替换

df[‘city’].replace(‘sh’, ‘shanghai’)

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [['Alex', np.nan, 80], ['Bob', 25, 90], ['Bob', 25, 90]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df['Name'].replace('Bob', 'Bauer'))

# output:

# Name Age Score

# 0 Alex NaN 80

# 1 Bob 25.0 90

# 2 Bob 25.0 90

# 0 Alex

# 1 Bauer

# 2 Bauer

# Name: Name, dtype: object

替换时,字符串前后不能有空格存在,必须严格匹配。

三、数据处理

1、排序

(1)按标签排序

sort_index(self, axis=0, level=None, ascending=True, inplace=False,

kind='quicksort', na_position='last', sort_remaining=True,

by=None)

使用sort_index()函数,通过传递axis参数和排序顺序,可以对DataFrame进行排序。 默认情况下,按照升序对行标签进行排序。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col1', 'col2', 'col3'])

print(df)

df = df.sort_index()

print(df)

# output:

# col1 col2 col3

# rank2 -0.627700 -0.361006 -1.126366

# rank1 -1.997538 1.569461 0.454773

# rank4 -0.598688 1.348594 0.777791

# rank3 -0.190794 -1.209312 0.830699

# col1 col2 col3

# rank1 -1.997538 1.569461 0.454773

# rank2 -0.627700 -0.361006 -1.126366

# rank3 -0.190794 -1.209312 0.830699

# rank4 -0.598688 1.348594 0.777791

通过将布尔值传递给升序参数ascending,可以控制排序顺序;通过传递axis参数值为1,可以对列标签进行排序。 默认情况下,axis = 0,对行标签进行排序。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])

print(df)

# 按列标签进行排序

df = df.sort_index(ascending=True, axis=1)

print(df)

# output:

# col3 col2 col1

# rank2 -0.715319 -0.245760 -1.282737

# rank1 0.046705 -0.202133 0.185576

# rank4 -1.608270 -0.491281 0.047686

# rank3 -1.013456 -0.020197 1.184151

# col1 col2 col3

# rank2 -1.282737 -0.245760 -0.715319

# rank1 0.185576 -0.202133 0.046705

# rank4 0.047686 -0.491281 -1.608270

# rank3 1.184151 -0.020197 -1.013456

(2)按值排序

sort_values(self, by, axis=0, ascending=True, inplace=False,

kind='quicksort', na_position='last')

使用sort_values函数可以按值排序,接收一个by参数,使用DataFrame的列名称作为值,根据某列进行排序。by可以是列名称的列表。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])

print(df)

df = df.sort_values(by="col2")

print(df)

# output:

# col3 col2 col1

# rank2 -0.706054 -2.135880 1.066836

# rank1 0.290660 -2.214451 -1.724394

# rank4 1.211874 0.475177 -0.711855

# rank3 -0.253331 1.211301 -0.208633

# col3 col2 col1

# rank1 0.290660 -2.214451 -1.724394

# rank2 -0.706054 -2.135880 1.066836

# rank4 1.211874 0.475177 -0.711855

# rank3 -0.253331 1.211301 -0.208633

sort_values()提供mergesort,heapsort和quicksort三种排序算法,mergesort是唯一的稳定排序算法,通过参数kind进行传递。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])

print(df)

df = df.sort_values(by="col2", kind='mergesort')

print(df)

# output:

# col3 col2 col1

# rank2 -0.243768 -0.344846 0.535481

# rank1 -1.491950 0.690749 -2.023808

# rank4 -0.656292 -0.704788 0.655129

# rank3 0.468007 -0.250702 0.079670

# col3 col2 col1

# rank4 -0.656292 -0.704788 0.655129

# rank2 -0.243768 -0.344846 0.535481

# rank3 0.468007 -0.250702 0.079670

# rank1 -1.491950 0.690749 -2.023808

按顺序进行多列降序排序

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])

print(df)

df = df.sort_values(by=['col1', 'col3'], ascending=True, axis=0)

print(df)

# output:

# col3 col2 col1

# rank2 1.035965 1.048124 -0.341586

# rank1 2.391899 -1.575462 0.616940

# rank4 0.968523 -0.932288 -0.553498

# rank3 0.585521 1.907344 -0.264500

# col3 col2 col1

# rank4 0.968523 -0.932288 -0.553498

# rank2 1.035965 1.048124 -0.341586

# rank3 0.585521 1.907344 -0.264500

# rank1 2.391899 -1.575462 0.616940

2、分组

Pandas可以使用groupby函数对DataFrame进行拆分,得到分组对象。

group_keys=True, squeeze=False, observed=False, **kwargs)

by:分组方式,可以是字典、函数、标签、标签列表

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90], ['Jack', 26, 80]]

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])

print(df)

group_obj1 = df.groupby('Name')

print(ups)

print('===================================')

# 单层分组迭代

for key, data in group_obj1:

print(key)

print(data)

group_obj2 = df.groupby(['Name', 'A'])

# 分组信息查看

print(ups)

print('===================================')

# 多层分组迭代

for key, data in group_obj2:

print(key)

print(data)

# output:

# Name Age A

# a Alex 24 80

# b Bob 25 90

# c Bauer 25 90

# d Jack 26 80

# {'Alex': Index(['a'], dtype='object'), 'Bauer': Index(['c'], dtype='object'), 'Bob': Index(['b'], dtype='object'), 'Jack': Index(['d'], dtype='object')}

# ===================================

# Alex

# Name Age A

# a Alex 24 80

# Bauer

# Name Age A

# c Bauer 25 90

# Bob

# Name Age A

# b Bob 25 90

# Jack

# Name Age A

# d Jack 26 80

# {('Alex', 80): Index(['a'], dtype='object'), ('Bauer', 90): Index(['c'], dtype='object'), ('Bob', 90): Index(['b'], dtype='object'), ('Jack', 80): Index(['d'], dtype='object')}

# ===================================

# ('Alex', 80)

# Name Age A

# a Alex 24 80

# ('Bauer', 90)

# Name Age A

# c Bauer 25 90

# ('Bob', 90)

# Name Age A

# b Bob 25 90

# ('Jack', 80)

# Name Age A

# d Jack 26 80

filter()函数可以用于过滤数据。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [['Alex', 24, 80], ['Bob', 25, 92], ['Bauer', 25, 90], ['Jack', 26, 80]]

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])

print(df)

group_obj1 = df.groupby('Age')

print(ups)

# 过滤年龄相同的人

group = group_obj1.filter(lambda x: len(x) > 1)

print(group)

# output:

# Name Age A

# a Alex 24 80

# b Bob 25 92

# c Bauer 25 90

# d Jack 26 80

# {24: Index(['a'], dtype='object'), 25: Index(['b', 'c'], dtype='object'), 26: Index(['d'], dtype='object')}

# Name Age A

# b Bob 25 92

# c Bauer 25 90

3、合并

<(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)

合并两个DataFrame对象。

left ,左DataFrame对象。

right,右DataFrame对象。

on,列(名称)连接,必须在左DataFrame和右DataFrame对象中存在(找到)。

left_on,左侧DataFrame中的列用作键,可以是列名或长度等于DataFrame长度的数组。

right_on,来自右DataFrame的列作为键,可以是列名或长度等于DataFrame长度的数组。

left_index,如果为True,则使用左侧DataFrame中的索引(行标签)作为其连接键。 在具有MultiIndex(分层)的DataFrame的情况下,级别的数量必须与来自右DataFrame的连接键的数量相匹配。

right_index ,与右DataFrame的left_index具有相同的用法。

how,可选值为left, right, outer,inner,默认为inner。

sort,按照字典顺序通过连接键对结果DataFrame进行排序。默认为True,设置为False时,可以大大提高性能。

在一个键上合并两个DataFrame的示例如下:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on='Name')

print(df)

# output:

# Name Age A

# 0 Alex 24 80

# 1 Bob 25 90

# 2 Bauer 25 90

# ==================================

# Name B C

# 0 Alex 87 78

# 1 Bob 67 87

# 2 Bauer 98 78

# ==================================

# Name Age A B C

# 0 Alex 24 80 87 78

# 1 Bob 25 90 67 87

# 2 Bauer 25 90 98 78

合并多个键上的两个DataFrame的示例如下:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on=['ID', 'Name'])

print(df)

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ==================================

# ID Name B C

# 0 1 Alex 87 78

# 1 4 Bob 67 87

# 2 3 Bauer 98 78

# ==================================

# ID Name Age A B C

# 0 1 Alex 24 80 87 78

# 1 3 Bauer 25 90 98 78

使用“how”参数进行合并,如何合并参数指定如何确定哪些键将被包含在结果表中。如果组合键没有出现在左侧或右侧表中,则连接表中的值将为NA。

left:LEFT OUTER JOIN,使用左侧对象的键。

right:RIGHT OUTER JOIN,使用右侧对象的键。

outer:FULL OUTER JOIN,使用键的联合。

inner:INNER JOIN,使用键的交集。

Left Join示例:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on='ID', how='left')

print(df)

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ==================================

# ID Name B C

# 0 1 Alex 87 78

# 1 4 Bob 67 87

# 2 3 Bauer 98 78

# ==================================

# ID Name_x Age A Name_y B C

# 0 1 Alex 24 80 Alex 87.0 78.0

# 1 2 Bob 25 90 NaN NaN NaN

# 2 3 Bauer 25 90 Bauer 98.0 78.0

Right Join示例:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on='ID', how='right')

print(df)

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ==================================

# ID Name B C

# 0 1 Alex 87 78

# 1 4 Bob 67 87

# 2 3 Bauer 98 78

# ==================================

# ID Name_x Age A Name_y B C

# 0 1 Alex 24.0 80.0 Alex 87 78

# 1 3 Bauer 25.0 90.0 Bauer 98 78

# 2 4 NaN NaN NaN Bob 67 87

Outer Join示例:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on='ID', how='outer')

print(df)

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ==================================

# ID Name B C

# 0 1 Alex 87 78

# 1 4 Bob 67 87

# 2 3 Bauer 98 78

# ==================================

# ID Name_x Age A Name_y B C

# 0 1 Alex 24.0 80.0 Alex 87.0 78.0

# 1 2 Bob 25.0 90.0 NaN NaN NaN

# 2 3 Bauer 25.0 90.0 Bauer 98.0 78.0

# 3 4 NaN NaN NaN Bob 67.0 87.0

Inner Join示例:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

left = pd.DataFrame(data1, columns=['ID', 'Name', 'Age', 'A'])

data2 = [[1, 'Alex', 87, 78], [4, 'Bob', 67, 87], [3, 'Bauer', 98, 78]]

right = pd.DataFrame(data2, columns=['ID', 'Name', 'B', 'C'])

print(left)

print('==================================')

print(right)

print('==================================')

df = pd.merge(left, right, on='ID', how='inner')

print(df)

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ==================================

# ID Name B C

# 0 1 Alex 87 78

# 1 4 Bob 67 87

# 2 3 Bauer 98 78

# ==================================

# ID Name_x Age A Name_y B C

# 0 1 Alex 24 80 Alex 87 78

# 1 3 Bauer 25 90 Bauer 98 78

4、级联

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,

keys=None, levels=None, names=None, verify_integrity=False,

sort=None, copy=True)

沿某个轴进行级联操作。

objs,Series、DataFrame或Panel对象的序列或字典。

axis,{0,1,...},默认为0,axis=0表示按index进行级联,axis=1表示按columns进行级联。

join,{'inner', 'outer'},默认inner,指示如何处理其它轴上的索引。

ignore_index,布尔值,默认为False。如果指定为True,则不使用连接轴上的索引值。结果轴将被标记为:0,...,n-1。

join_axes ,Index对象的列表。用于其它(n-1)轴的特定索引,而不是执行内部/外部集逻辑。

sort:是否进行排序,True会进行排序,False不进行排序。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]

one = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]

two = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

print(one)

print('==================================')

print(two)

print('==================================')

df = pd.concat([one, two], axis=1, sort=False)

print(df)

# output:

# Name Age A

# 0 Alex 24 80

# 1 Bob 25 90

# 2 Bauer 25 90

# ==================================

# Name B C

# 0 Alex 87 78

# 1 Bob 67 87

# 2 Bauer 98 78

# ==================================

# Name Age A Name B C

# 0 Alex 24 80 Alex 87 78

# 1 Bob 25 90 Bob 67 87

# 2 Bauer 25 90 Bauer 98 78

当结果的索引是重复的,如果想要生成的对象必须遵循自己的索引,需要将ignore_index设置为True。

Pandas提供了连接DataFrame的append方法,沿axis=0连接。

df.append(self, other, ignore_index=False,

verify_integrity=False, sort=None)

向DataFrame对象中添加新的行,如果添加的列名不在DataFrame对象中,将会被当作新的列进行添加。

other:DataFrame、series、dict、list

ignore_index:默认值为False,如果为True则不使用index标签。

verify_integrity :默认值为False,如果为True当创建相同的index时会抛出ValueError的异常。

sort:boolean,默认是None。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90]]

one = pd.DataFrame(data1, columns=['Name', 'Age', 'A'])

data2 = [['Alex', 87, 78], ['Bob', 67, 87], ['Bauer', 98, 78]]

two = pd.DataFrame(data2, columns=['Name', 'B', 'C'])

print(one)

print('==================================')

print(two)

print('==================================')

df = one.append(two, sort=False)

print(df)

# output:

# Name Age A

# 0 Alex 24 80

# 1 Bob 25 90

# 2 Bauer 25 90

# ==================================

# Name B C

# 0 Alex 87 78

# 1 Bob 67 87

# 2 Bauer 98 78

# ==================================

# Name Age A B C

# 0 Alex 24.0 80.0 NaN NaN

# 1 Bob 25.0 90.0 NaN NaN

# 2 Bauer 25.0 90.0 NaN NaN

# 0 Alex NaN NaN 87.0 78.0

# 1 Bob NaN NaN 67.0 87.0

# 2 Bauer NaN NaN 98.0 78.0

Pandas提供了连接DataFrame的join方法,沿axis=1连接,用于将两个DataFrame中的不同的列索引合并成为一个DataFrame。

df.join(self, other, on=None, how='left', lsuffix='', rsuffix='',

sort=False)

join方法提供SQL的Join操作,默认为为左外连接how=left。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data1 = [['Alex', 24, 80], ['Bob', 25, 90], ['Bauer', 25, 90],['Jack', 26, 80]]

one = pd.DataFrame(data1, index=['a', 'b', 'c', 'd'], columns=['Name', 'Age', 'A'])

data2 = [[87, 78], [67, 87], [98, 78]]

two = pd.DataFrame(data2, index=['a', 'b', 'c'], columns=['B', 'C'])

print(one)

print('==================================')

print(two)

print('==================================')

df = one.join(two)

print(df)

# output:

# Name Age A

# a Alex 24 80

# b Bob 25 90

# c Bauer 25 90

# d Jack 26 80

# ==================================

# B C

# a 87 78

# b 67 87

# c 98 78

# ==================================

# Name Age A B C

# a Alex 24 80 87.0 78.0

# b Bob 25 90 67.0 87.0

# c Bauer 25 90 98.0 78.0

# d Jack 26 80 NaN NaN

5、迭代

迭代DataFrame提供列名。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

print(df)

for col in df:

print(col, end=' ')

# output:

# A B C D

# 2019-01-01 -0.415754 -1.214340 -0.103952 1.232414

# 2019-01-02 -0.367888 0.257199 -1.615029 -0.335322

# 2019-01-03 0.552697 0.202993 -1.000219 -0.530897

# 2019-01-04 0.503410 -1.610091 1.660362 0.649700

# 2019-01-05 0.575416 -1.962578 -1.681379 -0.425239

# 2019-01-06 1.075917 -0.499081 1.886878 -0.073895

# A B C D

df.iteritems()用于迭代(key,value)对,将每个列标签作为key,value为Series对象。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

dates = pd.date_range('20190101', periods=6)

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

print(df)

for key, value in df.iteritems():

print(key, value)

# output:

# A B C D

# 2019-01-01 -0.302021 1.343811 -0.070351 -0.409479

# 2019-01-02 -0.365564 0.743572 -0.475075 1.026054

# 2019-01-03 0.025748 1.395340 -0.987686 0.141003

# 2019-01-04 -0.291348 -1.173600 -2.286905 0.528416

# 2019-01-05 -1.844523 -0.052567 0.575980 0.260001

# 2019-01-06 0.271046 -0.583334 -0.596251 0.772095

# A 2019-01-01 -0.302021

# 2019-01-02 -0.365564

# 2019-01-03 0.025748

# 2019-01-04 -0.291348

# 2019-01-05 -1.844523

# 2019-01-06 0.271046

# Freq: D, Name: A, dtype: float64

# B 2019-01-01 1.343811

# 2019-01-02 0.743572

# 2019-01-03 1.395340

# 2019-01-04 -1.173600

# 2019-01-05 -0.052567

# 2019-01-06 -0.583334

# Freq: D, Name: B, dtype: float64

# C 2019-01-01 -0.070351

# 2019-01-02 -0.475075

# 2019-01-03 -0.987686

# 2019-01-04 -2.286905

# 2019-01-05 0.575980

# 2019-01-06 -0.596251

# Freq: D, Name: C, dtype: float64

# D 2019-01-01 -0.409479

# 2019-01-02 1.026054

# 2019-01-03 0.141003

# 2019-01-04 0.528416

# 2019-01-05 0.260001

# 2019-01-06 0.772095

# Freq: D, Name: D, dtype: float64

df.iterrows()用于返回迭代器,产生每个index以及包含每行数据的Series。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))

print(df)

for index, value in df.iterrows():

print(index, value)

# output:

# A B C D

# 0 -1.097851 0.785749 -1.727198 -1.120925

# 1 -1.420429 0.094384 -1.566202 0.237084

# 2 -0.761957 0.552395 0.680884 -0.290955

# 3 0.357713 -0.323331 1.438013 -1.334616

# 4 0.015467 -2.431556 -0.717285 -0.094409

# 5 -1.198224 -1.370170 0.201725 0.258093

# 0 A -1.097851

# B 0.785749

# C -1.727198

# D -1.120925

# Name: 0, dtype: float64

# 1 A -1.420429

# B 0.094384

# C -1.566202

# D 0.237084

# Name: 1, dtype: float64

# 2 A -0.761957

# B 0.552395

# C 0.680884

# D -0.290955

# Name: 2, dtype: float64

# 3 A 0.357713

# B -0.323331

# C 1.438013

# D -1.334616

# Name: 3, dtype: float64

# 4 A 0.015467

# B -2.431556

# C -0.717285

# D -0.094409

# Name: 4, dtype: float64

# 5 A -1.198224

# B -1.370170

# C 0.201725

# D 0.258093

# Name: 5, dtype: float64

df.itertuples()方法将为DataFrame中的每一行返回一个产生一个命名元组的迭代器。元组的第一个元素是行的index,而剩余的值是行值。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))

print(df)

for row in df.itertuples():

print(row)

# output:

# A B C D

# 0 0.681324 1.047734 -1.909570 -0.845900

# 1 -0.879077 -0.897085 -0.795461 -0.634519

# 2 0.484502 -0.061608 0.605827 -0.321721

# 3 -0.051974 1.533112 -1.011544 -0.922280

# 4 -0.634157 -0.173692 1.228584 -1.229581

# 5 0.236769 -0.933609 0.111948 1.048215

# Pandas(Index=0, A=0.6813238552921729, B=1.0477343302788706, C=-1.909570436815022, D=-0.8459001766064564)

# Pandas(Index=1, A=-0.8790771200969485, B=-0.8970849190216943, C=-0.7954606477323869, D=-0.6345188867416923)

# Pandas(Index=2, A=0.48450157948338324, B=-0.061608014575315506, C=0.6058267522125123, D=-0.32172144100965605)

# Pandas(Index=3, A=-0.05197447447575398, B=1.5331115391025778, C=-1.0115444345763995, D=-0.9222798204619236)

# Pandas(Index=4, A=-0.6341570074338677, B=-0.173692444412635, C=1.2285839004083785, D=-1.2295807166909738)

# Pandas(Index=5, A=0.23676890089548117, B=-0.9336090868233837, C=0.11194794444517034, D=1.0482154173833818)

迭代用于读取,迭代器返回原始对象(视图)的副本,因此迭代时更改将不会反映在原始对象上。

6、SQL化操作

在SQL中,SELECT使用逗号分隔的列列表(或选择所有列)来完成。

SELECT ID, Name FROM tablename LIMIT 5;

在Pandas中,列选择通过传递列名到DataFrame。

df[['ID', 'Name']].head(5)

SELECT操作示例:

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A'])

print(df)

print(df[['ID', 'Name']].head(5))

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ID Name

# 0 1 Alex

# 1 2 Bob

# 2 3 Bauer

在SQL中,使用WHERE进行条件过滤。

SELECT * FROM tablename WHERE Name = 'Bauer' LIMIT 5;

在Pandas中,通常使用布尔索引进行过滤。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

data = [[1, 'Alex', 24, 80], [2, 'Bob', 25, 90], [3, 'Bauer', 25, 90]]

df = pd.DataFrame(data, columns=['ID', 'Name', 'Age', 'A'])

print(df)

print('===========================')

print(df[df['Name'] == 'Bauer'].head(5))

# output:

# ID Name Age A

# 0 1 Alex 24 80

# 1 2 Bob 25 90

# 2 3 Bauer 25 90

# ===========================

# ID Name Age A

# 2 3 Bauer 25 90

四、数据分析

1、描述性统计

(1)sum

返回所请求轴的值的总和。 默认情况下,轴为索引(axis=0)。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df.sum())

print(df.sum(1))

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Name AlexBobBauer

# Age 75

# Score 257

# dtype: object

# 0 105

# 1 116

# 2 111

# dtype: int64

(2)mean

返回平均值。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

an())

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Age 25.000000

# Score 85.666667

# dtype: float64

(3)std

返回数字列的Bressel标准偏差。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df.std())

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Age 1.000000

# Score 5.131601

# dtype: float64

(4)median

求所有值的中位数。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

dian())

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Age 25.0

# Score 87.0

# dtype: float64

(5)min

求所有值中的最小值。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df.min())

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Name Alex

# Age 24

# Score 80

# dtype: object

(6)max

求所有值中的最大值。

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df.max())

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Name Bob

# Age 26

# Score 90

# dtype: object

(7)describe

汇总有关DataFrame列的统计信息的摘要。

def describe(self, percentiles=None, include=None, exclude=None)

include用于传递关于什么列需要考虑用于总结的必要信息的参数。获取值列表,默认情况下是number 。

object - 汇总字符串列

number - 汇总数字列

all - 将所有列汇总在一起(不应将其作为列表值传递)

import pandas as pd

if __name__ == "__main__":

data = [['Alex', 25, 80], ['Bob', 26, 90], ['Bauer', 24, 87]]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Score'])

print(df)

print(df.describe(include="all"))

# output:

# Name Age Score

# 0 Alex 25 80

# 1 Bob 26 90

# 2 Bauer 24 87

# Name Age Score

# count 3 3.0 3.000000

# unique 3 NaN NaN

# top Alex NaN NaN

# freq 1 NaN NaN

# mean NaN 25.0 85.666667

# std NaN 1.0 5.131601

# min NaN 24.0 80.000000

# 25% NaN 24.5 83.500000

# 50% NaN 25.0 87.000000

# 75% NaN 25.5 88.500000

# max NaN 26.0 90.000000

abs:求所有值的绝对值

prod:求所有值的乘积

cumsum:累计总和

cumprod:累计乘积

2、百分比变化

Series,DatFrames和Panel都有pct_change()函数,用于将每个元素与其前一个元素进行比较,并计算变化百分比。默认情况下,pct_change()对列进行操作; 如果想应用到行上,那么可使用axis = 1参数。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(4, 3), index=['rank2', 'rank1', 'rank4', 'rank3'], columns=['col3', 'col2', 'col1'])

print(df)

print(df.pct_change())

# output:

# col3 col2 col1

# rank2 0.988739 2.062798 1.400892

# rank1 0.394663 -0.988307 1.583098

# rank4 -0.768109 -0.163727 -1.801323

# rank3 0.999816 -1.224068 1.470020

# col3 col2 col1

# rank2 NaN NaN NaN

# rank1 -0.600842 -1.479110 0.130064

# rank4 -2.946241 -0.834336 -2.137846

# rank3 -2.301659 6.476294 -1.816078

3、协方差

协方差适用于Series数据,Series对象有一个方法cov用来计算Series对象之间的协方差,NA将被自动排除。当应用于DataFrame对象时,协方差方法计算所有列之间的协方差(cov)值。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])

print(df)

print(df['a'].cov(df['b']))

v())

# output:

# a b c d e

# 0 1.168443 -0.343905 2.254448 0.269765 -0.928009

# 1 0.542551 -1.303205 -1.767313 -0.349884 -0.352578

# 2 -2.028410 -1.176339 0.156047 1.426468 -1.338805

# 0.48923631972868176

# a b c d e

# a 2.870241 0.489236 0.713430 -1.312818 0.581441

# b 0.489236 0.271550 0.974811 -0.023849 -0.055862

# c 0.713430 0.974811 4.046193 0.580236 -0.558184

# d -1.312818 -0.023849 0.580236 0.812892 -0.430603

# e 0.581441 -0.055862 -0.558184 -0.430603 0.245420

4、相关性

相关性显示了任何两个数值(Series)之间的线性关系。有多种计算相关性的方法,如pearson(默认),spearman和kendall。如果DataFrame中存在任何非数字列,则会自动排除。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])

print(df)

print(df['a'].corr(df['b']))

())

# output:

# a b c d e

# 0 -2.110756 0.693665 0.405701 -0.628349 -1.062029

# 1 -1.331364 1.283434 1.619166 -0.025866 1.742287

# 2 -1.159944 0.435840 -0.251710 -0.347102 -0.026825

# 0.052396578025987336

# a b c d e

# a 1.000000 0.052397 -0.000006 0.743940 0.664845

# b 0.052397 1.000000 0.998626 0.706309 0.780790

# c -0.000006 0.998626 1.000000 0.668242 0.746977

# d 0.743940 0.706309 0.668242 1.000000 0.993772

# e 0.664845 0.780790 0.746977 0.993772 1.000000

5、数据排名

数据排名为元素数组中的每个元素生成排名。在关系的情况下,分配平均等级。

# -*- coding=utf-8 -*-

import pandas as pd

import numpy as np

if __name__ == "__main__":

s = pd.Series(np.random.randn(5), index=list('abcde'))

print(s)

s['a'] = s['c']

print(s.rank())

# output:

# a 1.597684

# a 1.597684

# b 1.107413

# c -0.298296

# d -0.281076

# e -0.667954

# dtype: float64

# a 2.5

# b 5.0

# c 2.5

# d 4.0

# e 1.0

# dtype: float64

rank使用一个默认为True的升序参数; False时,数据被反向排序,较大的值被分配较小的排序。

本文发布于:2024-02-04 08:53:37,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170703510854136.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23