越来越多的软件包在大熊猫上构建，以满足数据准备，分析和可视化的特定需求。这是令人鼓舞的，因为它意味着熊猫不仅帮助用户处理他们的数据任务，而且它为开发者提供了一个更好的起点，构建强大和更集中的数据工具。创建补充熊猫功能的图书馆也允许熊猫开发继续专注于它的原始要求。

我们希望让用户更容易找到这些项目，如果您知道您认为应该在此列表中的其他实质性项目，请告诉我们。

13 |

Statistics and Machine Learning

14 |

15 |

Statsmodels

16 |

Statsmodels是着名的python“统计和计量经济学图书馆”，它与熊猫有着长期的特殊关系。Statsmodels提供强大的统计，计量经济学，分析和建模功能，超出了熊猫的范围。Statsmodels利用pandas对象作为计算的基础数据容器。

17 |

18 |

19 |

sklearn-pandas

20 |

在scikit-learn ML管道中使用pandas DataFrames。

21 |

22 |

24 |

Visualization

25 |

26 |

Bokeh

27 |

Bokeh是一个用于大型数据集的Python交互式可视化库，本地使用最新的Web技术。其目标是以Protovis / D3的风格提供优雅，简洁的新颖图形构造，同时为大型数据向瘦客户端提供高性能交互性。

28 |

29 |

30 |

yhat/ggplot

31 |

Hadley Wickham的ggplot2是R语言的基础探索性可视化包。基于“图形语法”它提供了一个强大的，声明性和极其一般的方式来生成任何类型的数据的定制图。这真的很不可思议。各种实现到其他语言是可用的，但一个忠实的实现python用户长期以来一直缺失。虽然仍然年轻（截至2014年1月），yhat / ggplot项目已经在这个方向上迅速发展。

32 |

33 |

34 |

Seaborn

35 |

虽然熊猫有相当多的“只是绘图”的功能内置，可视化，特别是统计图形是一个广泛的领域，具有悠久的传统和大量的地面覆盖。Seaborn项目构建在pandas和matplotlib之上，以便于绘制更多高级类型的数据，然后提供由pandas提供的数据。

36 |

37 |

38 |

Vincent

39 |

Vincent项目利用Vega（进而利用d3）创建图表。虽然功能，从2016年夏天Vincent项目在两年内没有更新，不太可能收到进一步更新。

40 |

41 |

42 |

IPython Vega

43 |

像Vincent一样，IPython Vega项目利用Vega创建图，但主要针对IPython Notebook环境。

44 |

45 |

46 |

Plotly的 Python API可提供互动数字和网页分享功能。使用WebGL和D3.js来呈现地图，2D，3D和实况流图。该库支持直接从pandas DataFrame和基于云的协作绘制。matplotlib，ggplot for Python和Seaborn的用户可以将图形转换为基于Web的互动图。绘图可以在IPython笔记本中绘制，使用R或MATLAB编辑，在GUI中修改，或嵌入在应用程序和仪表板中。Plotly可免费无限制分享，且拥有云，离线或内部帐户供私人使用。

48 |

49 |

50 |

Pandas-Qt

51 |

从主熊猫库跳出，Pandas-Qt库可以在PyQt4和PySide应用程序中实现DataFrame可视化和操作。

52 |

53 |

55 |

IDE

56 |

57 |

IPython

58 |

IPython是一个交互式命令shell和分布式计算环境。IPython Notebook是一个用于创建IPython笔记本的Web应用程序。IPython notebook是一个JSON文档，包含输入/输出单元格的有序列表，其中可以包含代码，文本，数学，图表和富媒体。IPython Notebook可以通过Web界面中的“下载为”和ipython t1转换为多种开放标准输出格式（HTML，HTML演示文稿幻灯片，LaTeX，PDF，ReStructuredText，Markdown， > nbconvert。

59 |

Pandas DataFrames实现了IPython Notebook用于显示（缩写）HTML表的_repr_html_方法。（注意：HTML表格可能与非HTML IPython输出格式兼容，也可能不兼容）。

60 |

61 |

62 |

quantopian/qgrid

63 |

qgrid是“用于排序和过滤IPython Notebook中的DataFrames的交互式网格”，使用SlickGrid构建。

64 |

65 |

66 |

Spyder

67 |

Spyder是一个跨平台的基于Qt的开源Python IDE，具有编辑，测试，调试和内省功能。Spyder现在可以内省和显示Pandas DataFrames，并显示“列方式最小/最大值和全局最小/最大着色”。

68 |

69 |

71 |

API

72 |

73 |

pandas-datareader

74 |

pandas-datareader是用于pandas的远程数据访问库。pandas.io from pandas < 0.17.0 is now refactored/split-off to and importable from pandas_datareader (PyPI:pandas-datareader). 许多/大多数支持的API在pandas-datareader docs中至少有一个文档段落：

75 |

以下数据Feed可用：

76 |

77 |
78 |
雅虎金融
79 |
Google财经
80 |
FRED
81 |
Fama /法语
82 |
世界银行
83 |
经合组织
84 |
欧洲统计局
85 |
EDGAR索引
86 |
87 |

88 |

89 |

90 |

quandl/Python

91 |

Quandl API for Python包装Quandl REST API以返回带有时间序列索引的Pandas DataFrames。

92 |

93 |

94 |

pydatastream

95 |

PyDatastream是Thomson Dataworks Enterprise（DWE / Datastream） SOAP API的Python接口，用于返回带有财务数据的带索引的Pandas DataFrames或面板。此程序包需要此API的有效凭据（非免费）。

96 |

97 |

98 |

pandaSDMX

99 |

pandaSDMX是一个可扩展的库，用于检索和获取在SDMX 2.1中传播的统计数据和元数据。本标准目前由欧洲统计局（欧盟统计局）和欧洲中央银行（欧洲中央银行）支持。数据集可以作为pandas系列或多索引的DataFrames返回。

100 |

101 |

102 |

fredapi

103 |

fredapi是由圣路易斯联邦储备银行提供的联邦储备经济数据（FRED）的Python接口。它与包含时间点数据（即历史数据修订）的FRED数据库和ALFRED数据库一起工作。fredapi在python中为FRED HTTP API提供了一个包装器，并且还提供了几种方便的方法来解析和分析来自ALFRED的时间点数据。fredapi使用pandas并返回一个Series或DataFrame中的数据。此模块需要FRED API密钥，您可以在FRED网站上免费获取。

104 |

105 |

107 |

Domain Specific

108 |

109 |

Geopandas

110 |

地理空间扩展了熊猫数据对象，以包括支持几何操作的地理信息。如果你的工作需要地图和地理坐标，你喜欢大熊猫，你应该仔细看看地球圈。

111 |

112 |

113 |

xarray

114 |

xarray通过提供核心熊猫数据结构的N维变量将大熊猫的标记数据功率带到物理科学。它旨在提供一个用于多维数组分析的熊猫和熊猫兼容工具包，而不是熊猫擅长的表格数据。

115 |

116 |

118 |

Out-of-core

119 |

120 |

Dask

121 |

Dask是一个用于分析的灵活的并行计算库。Dask允许熟悉的DataFrame接口用于核外，并行和分布式计算。

122 |

123 |

124 |

Blaze

125 |

Blaze提供了一个标准API，用于使用各种内存和磁盘后端进行计算：NumPy，Pandas，SQLAlchemy，MongoDB，PyTables，PySpark。

126 |

127 |

128 |

Odo

129 |

Odo提供了用于在不同格式之间移动数据的统一API。它使用pandas自己的read_csv来获取CSV IO，并利用许多现有的包（如PyTables，h5py和pymongo）在非熊猫格式之间移动数据。它的基于图的方法也可以由最终用户扩展自定义格式，可能太具体的odo的核心。

130 |

131 |

Frequently Asked Questions (FAQ)

10 |

DataFrame memory usage

11 |

对于pandas版本0.15.0，当使用info方法访问结构数据时，将输出结构数据（包括索引）的内存使用情况。配置选项display.memory_usage（请参阅Options and Settings）指定在调用df.info()

12 |

例如，调用df.info()时会显示以下结构数据的内存使用情况：

13 |

In [1]: dtypes = ['int64', 'float64', 'datetime64[ns]', 'timedelta64[ns]',
|    ...:           'complex128', 'object', 'bool']
|    ...: 
| 
| In [2]: n = 5000
| 
| In [3]: data = dict([ (t, np.random.randint(100, size=n).astype(t))
|    ...:                 for t in dtypes])
|    ...: 
| 
| In [4]: df = pd.DataFrame(data)
| 
| In [5]: df['categorical'] = df['object'].astype('category')
| 
| In [6]: df.info()
| <class 'pandas.core.frame.DataFrame'>
| RangeIndex: 5000 entries, 0 to 4999
| Data columns (total 8 columns):
| bool               5000 non-null bool
| complex128         5000 non-null complex128
| datetime64[ns]     5000 non-null datetime64[ns]
| float64            5000 non-null float64
| int64              5000 non-null int64
| object             5000 non-null object
| timedelta64[ns]    5000 non-null timedelta64[ns]
| categorical        5000 non-null category
| dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
| memory usage: 284.1+ KB
| 

| 

43 |

+符号表示真正的内存使用率可能更高，因为pandas不会计算dtype=object的列中使用的内存。

44 |

45 |

版本0.17.1中的新功能。

46 |

47 |

传递memory_usage='deep'参数，将输出更准确的内存使用情况报告，包含结构数据内存的完全使用情况。这是参数是可选的，因为做更深入的内存检查需要付出更多。

48 |

In [7]: df.info(memory_usage='deep')
| <class 'pandas.core.frame.DataFrame'>
| RangeIndex: 5000 entries, 0 to 4999
| Data columns (total 8 columns):
| bool               5000 non-null bool
| complex128         5000 non-null complex128
| datetime64[ns]     5000 non-null datetime64[ns]
| float64            5000 non-null float64
| int64              5000 non-null int64
| object             5000 non-null object
| timedelta64[ns]    5000 non-null timedelta64[ns]
| categorical        5000 non-null category
| dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
| memory usage: 401.2 KB
| 

| 

64 |

默认情况下，display选项设置为True，但是可以在调用df.info()时传递memory_usage参数来显式覆盖。

65 |

通过调用memory_usage方法可以找到每列的内存使用情况。这将返回一个具有以字节表示的列的名称和内存使用情况的索引。对于上面的数据帧，可以使用memory_usage方法找到每列数据的内存使用情况和结构数据的总内存使用情况：

66 |

In [8]: df.memory_usage()
| Out[8]: 
| Index                 72
| bool                5000
| complex128         80000
| datetime64[ns]     40000
| float64            40000
| int64              40000
| object             40000
| timedelta64[ns]    40000
| categorical         5800
| dtype: int64
| 
| # total memory usage of dataframe
| In [9]: df.memory_usage().sum()
| Out[9]: 290872
| 

| 

84 |

默认情况下，结构数据索引的内存使用情况显示在返回的Series中，可以通过传递index=False参数来去除索引的内存使用情况：

85 |

In [10]: df.memory_usage(index=False)
| Out[10]: 
| bool                5000
| complex128         80000
| datetime64[ns]     40000
| float64            40000
| int64              40000
| object             40000
| timedelta64[ns]    40000
| categorical         5800
| dtype: int64
| 

| 

98 |

info方法显示的内存使用情况利用memory_usage方法来确定结构数据的内存使用情况，同时还以人类可读单位格式化输出（base-2表示；即1KB = 1024字节）。

99 |

另请参见Categorical Memory Usage。

100 |

102 |

Byte-Ordering Issues

103 |

有时，您可能必须处理在机器上创建的数据具有与运行Python不同的字节顺序。要处理这个问题，应该使用类似于以下内容的方法将底层NumPy数组转换为本地系统字节顺序之后传递给Series / DataFrame / Panel构造函数：

104 |

In [11]: x = np.array(list(range(10)), '>i4') # big endian
| 
| In [12]: newx = x.byteswap().newbyteorder() # force native byteorder
| 
| In [13]: s = pd.Series(newx)
| 

| 

111 |

有关详细信息，请参阅有关字节顺序的NumPy文档。

112 |

114 |

Visualizing Data in Qt applications

115 |

在pandas中没有这种可视化的支持。但是，外部模块pandas-qt提供这样的功能。

116 |

pandas:强大的Python数据分析工具包

pandas是一个提供快速，灵活和表达性数据结构的Python包，旨在使“关系”或“标记”数据变得简单直观。它旨在成为在Python中进行实用的真实世界数据分析的基本高级构建块。此外，它的更广泛的目标是成为最强大和最灵活的任何语言的开源数据分析/操作工具。它已经很好地朝着这个目标前进了。

pandas的两个主要数据结构Series（一维）和DataFrame（二维）处理了金融，统计，社会中的绝大多数典型用例科学，以及许多工程领域。对于R用户，DataFrame提供R的data.frame所有功能及其他功能。pandas建立在NumPy之上，旨在包含更多其他第三方库并与之集成为优秀的科学计算环境。

许多此处原则是为了解决在使用其他语言/科学研究环境时常常所遇到的不足。对于数据科学家，处理数据通常分为多个阶段：清理和清理数据，分析/建模，然后将分析的结果组织成适合于绘图或表格显示的形式。pandas是处理所有这些任务的理想工具。

Installation

大多数用户安装 pandas 的最简单的方法是将其安装为 Anaconda 发行版的一部分，这是一个用于数据分析和科学计算的跨平台发行版。这是大多数 pandas 用户选择的安装方式

此外这里还提供了从源码，PyPI，各种Linux发行版或开发版本安装的说明。

12 |

支持的 Python 版本

13 |

官方Python 2.7,3.4,3.5和3.6

14 |

16 |

安装 pandas

17 |

18 |

最简单的使用pandas的方法（无需安装）!

19 |

最简单开始尝试pandas方式，不需要安装pandas，方式如下

20 |

Wakari是一项免费服务，可在云中提供托管的IPython Notebook服务。

21 |

只需创建一个帐户，即可在几分钟内通过IPython Notebook在浏览器中访问pandas。

22 |

23 |

24 |

在 Anaconda 中安装 pandas

25 |

对于没有经验的用户安装Pandas、NumPy和SciPy数据科学分析体系的其余部分可能有点困难。

26 |

最简单的方法不仅安装pandas，而且还安装Python和构成SciPy辅助（IPython，NumPy，Matplotlib，...）与用于数据分析和科学计算的跨平台（Linux，Mac OS X，Windows）Python分发版Anaconda。

27 |

运行简单的安装程序后，用户将可以访问pandas和SciPy体系的其余部分，而无需安装任何其他内容，无需等待任何软件编译。

28 |

可在此处找到Anaconda 的安装说明。

29 |

作为Anaconda分发一部分的软件包的完整列表可在此处找到。

30 |

安装Anaconda的另一个好处是，你不需要管理员权限安装它，它会安装在用户的主目录，这也使得在日后删除Anaconda（只是删除该文件夹）变得简单。

31 |

32 |

33 |

在 Miniconda 中安装 pandas

34 |

上一节概述了如何将 pandas 安装为Anaconda发行版的一部分。然而，这种方法意味着您将安装超过一百个软件包，并且涉及下载大小为几百兆字节的安装程序。

35 |

如果您想要更多地控制哪些软件包或者有限的互联网带宽，那么使用Miniconda安装 pandas 可能是一个更好的解决方案。

36 |

Conda是基于Anaconda分发的软件包管理器。它是一个跨平台和语言不可知的包管理器（它可以扮演类似于pip和virtualenv组合的角色）。

37 |

Miniconda允许创建最小的自包含Python安装，然后使用Conda命令安装其他软件包。

38 |

首先，您需要安装Conda并下载并运行 Miniconda才能为您完成此操作。可在此处找到安装程序

39 |

下一步是创建一个新的conda环境（这些类似于virtualenv，但它们也允许您精确指定要安装的Python版本）。从终端窗口运行以下命令：

40 |

conda create -n name_of_my_env python
 41 | 

 42 | 

43 |

这将创建一个只安装了Python的最小环境。要将你自己放在这个环境中运行：

44 |

source activate name_of_my_env
 45 | 

 46 | 

47 |

在Windows上，命令是：

48 |

activate name_of_my_env
 49 | 

 50 | 

51 |

所需的最后一步是安装pandas。可以使用以下命令完成：

52 |

conda install pandas
 53 | 

 54 | 

55 |

要安装特定的pandas版本：

56 |

conda install pandas=0.13.1
 57 | 

 58 | 

59 |

要安装其他软件包，例如IPython：

60 |

conda install ipython
 61 | 

 62 | 

63 |

要安装完整的Anaconda发行版：

64 |

conda install anaconda
 65 | 

 66 | 

67 |

如果你需要任何可用的pip，但不能conda，只需安装pip，并使用pip安装这些软件包：

68 |

conda install pip
| pip install django
| 

| 

72 |

73 |

74 |

从 PyPI 安装

75 |

pandas还可以通过pip从PyPI安装。

76 |

pip install pandas
 77 | 

 78 | 

79 |

这可能需要安装一些依赖项，包括NumPy，将需要一个编译器来编译所需的代码位，并且可能需要几分钟时间才能完成。

80 |

81 |

82 |

Installing using your Linux distribution’s package manager.

83 |

此表中的命令将从您的分发中安装用于Python 2的pandas。要为Python 3安装pandas，您可能需要使用包python3-pandas。

84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 |

分配	状态	下载/ Repository链接	安装方法
Debian	稳定	官方Debian存储库	`sudo apt-get 安装 python-pandas`
Debian和Ubuntu	unstable（最新软件包）	NeuroDebian	`sudo apt-get 安装 python-pandas`
Ubuntu	稳定	官方Ubuntu存储库	`sudo apt-get 安装 python-pandas`
Ubuntu	不稳定（每日构建）	PythonXY PPA； activate by：`sudo add-apt-repository ppa：pythonxy / pythonxy-devel ＆amp；＆amp； > sudo apt-get 更新`	`sudo apt-get 安装 python-pandas`
OpenSuse	稳定	OpenSuse存储库	`zypper 在 python-pandas`
Fedora	稳定	官方Fedora存储库	`dnf 安装 python-pandas`
Centos / RHEL	稳定	EPEL存储库	`yum 安装 python-pandas`

136 |

137 |

138 |

Installing from source

139 |

有关从git源代码树构建的完整说明，请参阅contributing documentation。此外，如果您要创建pandas开发环境，请参阅creating a development environment。

140 |

141 |

142 |

Running the test suite

143 |

pandas配备了一套详尽的单元测试，涵盖了大约97％的代码库。要在您的计算机上运行它以验证一切是否正常（并且已经安装了所有依赖项，软硬件安装），请确保您已经察觉并运行：

144 |

>>> import pandas as pd
| >>> pd.test()
| Running unit tests for pandas
| pandas version 0.18.0
| numpy version 1.10.2
| pandas is installed in pandas
| Python version 2.7.11 |Continuum Analytics, Inc.|
|    (default, Dec  6 2015, 18:57:58) [GCC 4.2.1 (Apple Inc. build 5577)]
| nose version 1.3.7
| ..................................................................S......
| ........S................................................................
| .........................................................................
| 
| ----------------------------------------------------------------------
| Ran 9252 tests in 368.339s
| 
| OK (SKIP=117)
| 

| 

163 |

164 |

166 |

Dependencies

167 |

setuptools
NumPy：1.7.1或更高版本
python-dateutil：1.5或更高
pytz：需要时区支持

173 |

174 |

可选依赖关系

186 |

Cython：只需要构建开发版本。版本0.19.1或更高版本。
188 |
SciPy：其他统计函数
190 |
xarray：pandas像处理> 2 dims，需要将面板转换为xarray对象。建议使用0.7.0或更高版本。
192 |
PyTables：基于HDF5的存储所必需。强烈建议需要3.0.0或更高版本，3.2.1或更高版本。
194 |
SQLAlchemy：用于SQL数据库支持。建议使用0.8.1或更高版本。除了SQLAlchemy，还需要一个数据库特定的驱动程序。您可以在SQLAlchemy docs中找到每种SQL方言的支持的驱动程序的概述。一些常见的驱动程序是：
196 |
197 |
- psycopg2：用于PostgreSQL
- pymysql：for MySQL。
- SQLite：对于SQLite，默认情况下包含在Python的标准库中。
202 |
203 |
matplotlib：用于绘图
205 |
对于Excel I / O：
207 |
- xlrd / xlwt：Excel阅读（xlrd）和书写（xlwt）
- openpyxl：openpyxl版本1.6.1或更高版本（但低于2.0.0）或版本2.2或更高版本，用于写入.xlsx文件（xlrd> = 0.9.0）
- XlsxWriter：备用Excel编写器
212 |
Jinja2：用于条件HTML格式化的模板引擎。
214 |
boto：对于Amazon S3访问必需的。
216 |
blosc：用于使用blosc的msgpack压缩
218 |
PyQt4，PySide，pygtk，xsel或xclip之一：必要使用read_clipboard()。Linux发行版上的大多数软件包管理器都会立即提供xclip和/或xsel。
220 |
Google的`python-gflags <<https://github.com/google/python-gflags/>`__，oauth2client，httplib2 和 google-api-python-client：需要gbq
222 |
Backports.lzma：仅适用于Python 2，用于在CSV中写入和/或读取xz压缩的DataFrame； Python 3支持内置到标准库中。
224 |
需要使用以下库的组合之一来使用顶层read_html()函数：
226 |
- BeautifulSoup4和html5lib（任何最新版本的html5lib都可以。）
- BeautifulSoup4和lxml
- BeautifulSoup4和html5lib和lxml
- 只有lxml，因为您可能需要而不采取这种方法，因此请参阅HTML reading gotchas。
232 |
233 |
警告
234 |
- 如果您安装BeautifulSoup4，则必须安装lxml或html5lib或两者。read_html()将不仅处理 BeautifulSoup4。
- 我们非常鼓励您阅读HTML reading gotchas。它解释了关于上述三个库的安装和使用的问题
- 您可能需要安装旧版本的BeautifulSoup4：版本4.2.1,4.1.3和4.0.2已经确认64和32位Ubuntu / Debian
- 此外，如果您使用Anaconda，您应该阅读the gotchas about HTML parsing libraries
240 |
241 |
242 |
注意
243 |
- 如果你使用apt-get的系统，你可以这样做
  245 |
  sudo apt-get build-dep python-lxml 246 |
  247 |
  248 |
  以获取安装lxml所需的依赖关系。这可以防止下面进程中出现错误。
  249 |
251 |
252 |

254 |

255 |

注意

256 |

没有可选的依赖项，许多有用的功能将不工作。因此，强烈建议您安装这些。像Anaconda或Enthought Canopy的打包分发可能值得考虑。

257 |

258 |

259 |

Internals

11 |

Indexing

12 |

在大熊猫中有一些实现的对象可以作为轴标签的有效容器：

13 |

Index：通用的“有序集合”对象，对象类型的ndarray，不考虑其内容。标签必须是可哈希的（并且可能是不可变的）和唯一的。在Cython中填充标签的位置以执行O(1)查找。
Int64Index：针对64位整数数据（例如时间戳）高度优化的Index版本
Float64Index：针对64位浮点数据进行高度优化的Index版本
MultiIndex：标准层次索引对象
DatetimeIndex：带有Timestamp框元素的索引对象（impl是int64值）
TimedeltaIndex：具有Timedelta框元素的索引对象（impl是in64值）
PeriodIndex：包含Period元素的索引对象

22 |

有一些功能使得创建常规索引变得容易：

23 |

date_range：从时间规则或DateOffset生成的固定频率日期范围。Python datetime对象的ndarray
period_range：从时间规则或DateOffset生成的固定频率日期范围。Period对象的数组，表示时间段

27 |

首先具有Index类的动机是启用不同的索引实现。这意味着，用户可以实现一个自定义Index子类，它可能更适合于特定应用程序，而不是在pandas中提供的。

28 |

从内部实现的角度来看，Index必须定义的相关方法是以下一个或多个（取决于新对象内部与Index功能）：

29 |

get_loc：为标签返回“indexer”（整数，或在某些情况下为slice对象）
slice_locs：返回“范围”到两个标签之间的切片
get_indexer：计算用于重建索引/数据对齐目的的索引向量。有关更多信息，请参阅source / docstrings
get_indexer_non_unique：当索引非唯一时，计算索引向量以进行重建索引/数据对齐。有关更多信息，请参阅source / docstrings
reindex：输入索引的任何预转换然后调用get_indexer
union，intersection：计算两个Index对象的并集或交集
insert：在索引中插入一个新标签，产生一个新对象
delete：删除标签，生成一个新对象
drop：删除一组标签
take：类似于ndarray.take

41 |

42 |

MultiIndex

43 |

在内部，MultiIndex包含以下几项内容：级别，整数标签和级别名称：

44 |

In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
| 
| In [2]: index
| Out[2]: 
| MultiIndex(levels=[[0, 1, 2], [u'one', u'two']],
|            labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
|            names=[u'first', u'second'])
| 
| In [3]: index.levels
| Out[3]: FrozenList([[0, 1, 2], [u'one', u'two']])
| 
| In [4]: index.labels
| Out[4]: FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
| 
| In [5]: index.names
| Out[5]: FrozenList([u'first', u'second'])
| 

| 

62 |

您可能猜测标签确定在索引的每个层上用该位置标识哪个唯一元素。It’s important to note that sortedness is determined solely from the integer labels and does not check (or care) whether the levels themselves are sorted. 幸运的是，构造函数from_tuples和from_arrays确保这是真的，但如果你自己计算级别和标签，请小心。

63 |

64 |

66 |

Subclassing pandas Data Structures

67 |

68 |

警告

69 |

在考虑子类化pandas数据结构之前，有一些更简单的替代方法。

70 |

具有pipe
使用组合。请参阅此处。

74 |

75 |

本节介绍如何子类化pandas数据结构以满足更具体的需求。有2分需要注意：

76 |

覆盖构造函数属性。
定义原始属性

80 |

81 |

注意

82 |

你可以在geopandas项目中找到一个很好的例子。

83 |

84 |

85 |

Override Constructor Properties

86 |

每个数据结构都具有指定数据构造函数的构造函数属性。通过覆盖这些属性，您可以通过pandas数据操作来保留定义类。

87 |

有3个构造函数要定义：

88 |

_constructor：当操作结果与原始操作结果具有相同的缩放时使用。
_constructor_sliced：当操作结果具有一个较低维度作为原始值时使用，例如DataFrame单列切片。
_constructor_expanddim：当操作结果具有一个较高维度作为原始值时使用，例如Series.to_frame()和DataFrame.to_panel()。

93 |

下表显示了默认情况下pandas数据结构如何定义构造函数属性。

94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 |

属性属性	`Series`	`DataFrame`	`Panel`
`_constructor`	`Series`	`DataFrame`	`Panel`
`_constructor_sliced`	`NotImplementedError`	`Series`	`DataFrame`
`_constructor_expanddim`	`DataFrame`	`Panel`	`NotImplementedError`

126 |

下面的示例显示如何定义SubclassedSeries和SubclassedDataFrame覆盖构造函数属性。

127 |

class SubclassedSeries(Series):
| 
|     @property
|     def _constructor(self):
|         return SubclassedSeries
| 
|     @property
|     def _constructor_expanddim(self):
|         return SubclassedDataFrame
| 
| class SubclassedDataFrame(DataFrame):
| 
|     @property
|     def _constructor(self):
|         return SubclassedDataFrame
| 
|     @property
|     def _constructor_sliced(self):
|         return SubclassedSeries
| 

| 

148 |

>>> s = SubclassedSeries([1, 2, 3])
| >>> type(s)
| <class '__main__.SubclassedSeries'>
| 
| >>> to_framed = s.to_frame()
| >>> type(to_framed)
| <class '__main__.SubclassedDataFrame'>
| 
| >>> df = SubclassedDataFrame({'A', [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
| >>> df
|    A  B  C
| 0  1  4  7
| 1  2  5  8
| 2  3  6  9
| 
| >>> type(df)
| <class '__main__.SubclassedDataFrame'>
| 
| >>> sliced1 = df[['A', 'B']]
| >>> sliced1
|    A  B
| 0  1  4
| 1  2  5
| 2  3  6
| >>> type(sliced1)
| <class '__main__.SubclassedDataFrame'>
| 
| >>> sliced2 = df['A']
| >>> sliced2
| 0    1
| 1    2
| 2    3
| Name: A, dtype: int64
| >>> type(sliced2)
| <class '__main__.SubclassedSeries'>
| 

| 

185 |

186 |

187 |

Define Original Properties

188 |

要让原始数据结构具有其他属性，您应该让pandas知道添加了什么属性。pandas将未知属性映射到覆盖__getattribute__的数据名称。定义原始属性可以通过以下两种方式之一完成：

189 |

为不会传递到操作结果的临时属性定义_internal_names和_internal_names_set。
为将传递到操作结果的正常属性定义_metadata。

193 |

下面是一个定义2个原始属性的示例，“internal_cache”作为临时属性，“added_property”作为正常属性

194 |

class SubclassedDataFrame2(DataFrame):
| 
|     # temporary properties
|     _internal_names = pd.DataFrame._internal_names + ['internal_cache']
|     _internal_names_set = set(_internal_names)
| 
|     # normal properties
|     _metadata = ['added_property']
| 
|     @property
|     def _constructor(self):
|         return SubclassedDataFrame2
| 

| 

208 |

>>> df = SubclassedDataFrame2({'A', [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
| >>> df
|    A  B  C
| 0  1  4  7
| 1  2  5  8
| 2  3  6  9
| 
| >>> df.internal_cache = 'cached'
| >>> df.added_property = 'property'
| 
| >>> df.internal_cache
| cached
| >>> df.added_property
| property
| 
| # properties defined in _internal_names is reset after manipulation
| >>> df[['A', 'B']].internal_cache
| AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'
| 
| # properties defined in _metadata are retained
| >>> df[['A', 'B']].added_property
| property
| 

| 

232 |

233 |

包概述

23 |

数据结构一览

24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |

维度	名称	描述
1	Series	1维标记同类型数组
2	DataFrame	普通2维标记大小可变的表格结构，且列能有不同类型
3	Panel	普通3维标记，大小可变

51 |

52 |

为什么有不止一种数据结构?

53 |

学习 pandas 数据结构的最好方法是将其作为低维数据的灵活容器。例如，DataFrame 是 Series 的容器，Panel 是 DataFrame 对象的容器。我们希望能够以类似字典的方式从这些容器中插入和删除对象。

54 |

此外，我们将对通用API函数（其考虑时间序列和横截面数据集的典型取向）采取明智的默认行为。当使用ndarrays存储2维和3维数据时，用户在编写函数时会考虑数据集的方向，轴被认为或多或少相等（除非C-或Fortran连续性对性能至关重要）。在pandas中，轴旨在为数据提供更多的语义意义；即，对于特定数据集，可能存在定向数据的“正确”方式。因此，目标是减少在下游功能中编码数据转换所需的精神努力量。

55 |

例如，对于表格数据（DataFrame），考虑索引（行）和列而不是轴0和轴1更具语义上的帮助。并且遍历DataFrame的列，因此导致更可读的代码：

56 |

for col in df.columns:
|     series = df[col]
|     # do something with series
| 

| 

61 |

62 |

64 |

Mutability and copying of data

65 |

所有的pandas数据结构都是值可变的（它们包含的值可以改变），但不总是size-mutable。Series的长度不可更改，但是，可以将列插入到DataFrame中。然而，绝大多数方法产生新对象并且保持输入数据不变。一般来说，我们喜欢有利于不变性。

66 |

68 |

获得支持

69 |

Pandas问题和想法的第一站是Github问题跟踪器。如果你有一个一般的问题，pandas社区专家可以通过Stack Overflow回答。

70 |

更长的讨论发生在开发人员邮件列表上，而Lambda Foundry的商业支持查询应发送到：支持@ lambdafoundry 。

71 |

73 |

Credits

74 |

pandas于2008年4月由AQR资本管理。它是在2009年年底开源的。2011年底，AQR继续为发展提供资源，并继续提供今天的错误报告。

75 |

自2012年1月起，Lambda Foundry一直提供开发资源，以及商业支持，培训和pandas咨询。

76 |

pandas只是由一组世界各地的人像你一样，贡献了新的代码，错误报告，修复，评论和想法。完整的列表可以在Github上找到。

77 |

79 |

开发团队

80 |

pandas是PyData项目的一部分。PyData开发团队是专注于改进Python数据库的开发人员的集合。协调开发的核心团队可以在Github上找到。如果您有兴趣参与，请访问项目网站。

81 |

83 |

License

84 |

=======
| License
| =======
| 
| pandas is distributed under a 3-clause ("Simplified" or "New") BSD
| license. Parts of NumPy, SciPy, numpydoc, bottleneck, which all have
| BSD-compatible licenses, are included. Their licenses follow the pandas
| license.
| 
| pandas license
| ==============
| 
| Copyright (c) 2011-2012, Lambda Foundry, Inc. and PyData Development Team
| All rights reserved.
| 
| Copyright (c) 2008-2011 AQR Capital Management, LLC
| All rights reserved.
| 
| Redistribution and use in source and binary forms, with or without
| modification, are permitted provided that the following conditions are
| met:
| 
|     * Redistributions of source code must retain the above copyright
|        notice, this list of conditions and the following disclaimer.
| 
|     * Redistributions in binary form must reproduce the above
|        copyright notice, this list of conditions and the following
|        disclaimer in the documentation and/or other materials provided
|        with the distribution.
| 
|     * Neither the name of the copyright holder nor the names of any
|        contributors may be used to endorse or promote products derived
|        from this software without specific prior written permission.
| 
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS
| "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
| LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
| A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
| OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
| SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
| LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
| DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
| THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
| (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
| OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
| 
| About the Copyright Holders
| ===========================
| 
| AQR Capital Management began pandas development in 2008. Development was
| led by Wes McKinney. AQR released the source under this license in 2009.
| Wes is now an employee of Lambda Foundry, and remains the pandas project
| lead.
| 
| The PyData Development Team is the collection of developers of the PyData
| project. This includes all of the PyData sub-projects, including pandas. The
| core team that coordinates development on GitHub can be found here:
| http://github.com/pydata.
| 
| Full credits for pandas contributors can be found in the documentation.
| 
| Our Copyright Policy
| ====================
| 
| PyData uses a shared copyright model. Each contributor maintains copyright
| over their contributions to PyData. However, it is important to note that
| these contributions are typically only changes to the repositories. Thus,
| the PyData source code, in its entirety, is not the copyright of any single
| person or institution. Instead, it is the collective copyright of the
| entire PyData Development Team. If individual contributors want to maintain
| a record of what changes/contributions they have specific copyright on,
| they should indicate their copyright in the commit message of the change
| when they commit the change to one of the PyData repositories.
| 
| With this in mind, the following banner should be used in any source code
| file to indicate the copyright and license terms:
| 
| #-----------------------------------------------------------------------------
| # Copyright (c) 2012, PyData Development Team
| # All rights reserved.
| #
| # Distributed under the terms of the BSD Simplified License.
| #
| # The full license is in the LICENSE file, distributed with this software.
| #-----------------------------------------------------------------------------
| 
| Other licenses can be found in the LICENSES directory.
| 

| 

173 |

rpy2 / R interface

14 |

Updating your code to use rpy2 functions

15 |

在v0.16.0中，pandas.rpy模块已弃用，用户指向rpy2本身）。

16 |

不要将导入 pandas.rpy.common 作为 com导入，应该做到激活rpy2中的pandas转换支持：

17 |

from rpy2.robjects import pandas2ri
| pandas2ri.activate()
| 

| 

21 |

在rpy2和pandas之间来回转换数据帧应该在很大程度上自动化（不需要显式转换，它将在大多数rpy2函数中即时完成）。

22 |

要显式转换，函数为pandas2ri.py2ri()和pandas2ri.ri2py()。所以这些函数可以用来替换pandas中的现有函数：

23 |

com.convert_to_r_dataframe(df)应替换为pandas2ri.py2ri(df)
com.convert_robj(rdf)应替换为pandas2ri.ri2py(rdf)

27 |

注意：这些函数用于最新版本（rpy2 2.5.x），之前称为pandas2ri.pandas2ri()和pandas2ri.ri2pandas()。

28 |

pandas.rpy中的一些其他功能也可以轻松替换。例如，使用load_data函数加载R数据，当前方法：

29 |

df_iris = com.load_data('iris')
 30 | 

 31 | 

32 |

可替换为：

33 |

from rpy2.robjects import r
| r.data('iris')
| df_iris = pandas2ri.ri2py(r[name])
| 

| 

38 |

convert_to_r_matrix函数可以替换为正常的pandas2ri.py2ri以转换数据帧，随后调用R as.matrix函数。

39 |

40 |

警告

41 |

并不是rpy2中的所有转换函数都与pandas中的当前方法完全相同。如果您遇到与大熊猫相比的问题或限制，请在问题跟踪器上报告此问题。

42 |

43 |

另请参见rpy2项目的文档。

44 |

46 |

R interface with rpy2

47 |

如果您的计算机安装了R和rpy2（> 2.2）（将留给读者），您将能够利用以下功能。在Windows上，这样做是相当痛苦的，但在类Unix系统上的用户应该很容易。rpy2在时间上演变，目前达到2.3版本，而当前接口是为2.2.x系列设计的。我们建议使用2.2.x比其他系列，除非你准备修复部分代码，但rpy2-2.3.0引入了改进，如更好的R-Python桥内存管理层，因此它可能是一个好主意子弹和提交修补程序的一些小的差异，需要修复。

48 |

# if installing for the first time
| hg clone http://bitbucket.org/lgautier/rpy2
| 
| cd rpy2
| hg pull
| hg update version_2.2.x
| sudo python setup.py install
| 

| 

57 |

58 |

注意

59 |

要通过此接口使用R程序包，您需要自己在R中安装它们。目前它无法为您安装它们。

60 |

61 |

安装完R和rpy2后，您应该可以轻松导入pandas.rpy.common。

62 |

64 |

Transferring R data sets into Python

65 |

load_data函数检索R数据集并将其转换为适当的pandas对象（很可能是DataFrame）：

66 |

In [1]: import pandas.rpy.common as com
| 
| In [2]: infert = com.load_data('infert')
| 
| In [3]: infert.head()
| Out[3]: 
|   education   age  parity  induced  case  spontaneous  stratum  pooled.stratum
| 1    0-5yrs  26.0     6.0      1.0   1.0          2.0        1             3.0
| 2    0-5yrs  42.0     1.0      1.0   1.0          0.0        2             1.0
| 3    0-5yrs  39.0     6.0      2.0   1.0          0.0        3             4.0
| 4    0-5yrs  34.0     4.0      2.0   1.0          0.0        4             2.0
| 5   6-11yrs  35.0     3.0      1.0   1.0          1.0        5            32.0
| 

| 

80 |

82 |

Converting DataFrames into R objects

83 |

84 |

版本0.8中的新功能。

85 |

86 |

从pandas 0.8开始，有实验支持将DataFrames转换为等效的R对象（即data.frame）：

87 |

In [4]: import pandas.rpy.common as com
| 
| In [5]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
|    ...:                   index=["one", "two", "three"])
|    ...: 
| 
| In [6]: r_dataframe = com.convert_to_r_dataframe(df)
| 
| In [7]: print(type(r_dataframe))
| <class 'rpy2.robjects.vectors.DataFrame'>
| 
| In [8]: print(r_dataframe)
|       A B C
| one   1 4 7
| two   2 5 8
| three 3 6 9
| 

| 

105 |

DataFrame的索引存储为data.frame实例的rownames属性。

106 |

您还可以使用convert_to_r_matrix获取Matrix实例，但是请记住，它只适用于均匀类型的DataFrames（因为R矩阵不包含数据类型的信息）：

107 |

In [9]: import pandas.rpy.common as com
| 
| In [10]: r_matrix = com.convert_to_r_matrix(df)
| 
| In [11]: print(type(r_matrix))
| <class 'rpy2.robjects.vectors.Matrix'>
| 
| In [12]: print(r_matrix)
|       A B C
| one   1 4 7
| two   2 5 8
| three 3 6 9
| 

| 

121 |

123 |

使用pandas对象调用R函数

124 |

126 |

到R估计器的高级接口

127 |

Remote Data Access

10 |

DataReader

11 |

删除子包pandas.io.data以支持可单独安装的pandas-datareader包。这将允许数据模块独立更新到您的pandas安装。pandas-datareader v0.1.1的API与pandas v0 .16.1。（GH8961）

12 |

13 |
您应该替换以下的导入：
14 |
from pandas.io import data, wb
15 | 
16 |
17 |
使用：
18 |
from pandas_datareader import data, wb
19 | 
20 |
21 |

22 |

24 |

Google Analytics

25 |

ga模块为Google Analytics API提供了一个包装，以简化检索流量数据。结果集将解析为具有从源表派生的形状和数据类型的pandas DataFrame。

26 |

27 |

Configuring Access to Google Analytics

28 |

您需要做的第一件事是设置对Google Analytics（分析）API的访问。请按照以下步骤操作：

29 |

31 |
在Google Developers Console中
32 |
1. 启用Google Analytics（分析）API
2. 创建一个新项目
3. 为“已安装的应用程序”创建一个新的客户端ID（在新创建的项目的“APIs＆auth / Credentials部分”中）
4. 下载它（JSON文件）
38 |
39 |
40 |
42 |
在您的机器上
43 |
1. 将其重命名为client_secrets.json
2. 将其移动到pandas/io模块目录
47 |
48 |
49 |

51 |

第一次使用read_ga()函数时，将打开一个浏览器窗口，要求您对Google API进行身份验证。请继续。

52 |

53 |

54 |

Using the Google Analytics API

55 |

以下内容将从特定媒体资源获取2014年第一季度每周的每日用户和浏览量（指标）数据。

56 |

import pandas.io.ga as ga
| ga.read_ga(
|     account_id  = "2360420",
|     profile_id  = "19462946",
|     property_id = "UA-2360420-5",
|     metrics     = ['users', 'pageviews'],
|     dimensions  = ['dayOfWeek'],
|     start_date  = "2014-01-01",
|     end_date    = "2014-08-01",
|     index_col   = 0,
|     filters     = "pagePath=~aboutus;ga:country==France",
| )
| 

| 

70 |

唯一的强制参数是metrics, dimensions和start_date。我们强烈建议您始终指定account_id，profile_id和property_id，以避免在Google Analytics（分析）中访问错误的数据桶。

71 |

index_col参数指示必须将哪个维度用作索引。

72 |

filters参数指示要应用于查询的过滤。在上面的示例中，网页网址必须包含aboutus，访问者国家必须是法国。

73 |

详细信息如下：

74 |

79 |

80 |

Sparse data structures

我们实现了“稀疏”版本的Series和DataFrame。这些在典型的“大多为0”中不稀疏。相反，您可以将这些对象视为“压缩”，其中省略任何匹配特定值（NaN /缺失值，尽管可以选择任何值）的数据。特殊的SparseIndex对象跟踪数据已被“稀疏化”的位置。在一个例子中，这将更有意义。所有标准的熊猫数据结构都有一个to_sparse方法：

to_sparse方法采用kind参数（对于稀疏索引，请参见下文）和fill_value。所以如果我们有一个大多数为零的系列，我们可以将它转换为稀疏与fill_value=0：

稀疏对象存在是为了内存效率的原因。假设你有一个大的，主要是NA DataFrame：

如你所见，密度（未被“压缩”的值的百分比）非常低。这个稀疏对象在磁盘（pickled）和Python解释器中占用更少的内存。在功能上，它们的行为应该与它们的稠密对应物几乎相同。

107 |

SparseArray

108 |

SparseArray是所有稀疏索引数据结构的基本层。它是一个1维的ndarray样对象，只存储不同于fill_value的值：

109 |

In [12]: arr = np.random.randn(10)
| 
| In [13]: arr[2:5] = np.nan; arr[7:8] = np.nan
| 
| In [14]: sparr = pd.SparseArray(arr)
| 
| In [15]: sparr
| Out[15]: 
| [-1.95566352972, -1.6588664276, nan, nan, nan, 1.15893288864, 0.145297113733, nan, 0.606027190513, 1.33421134013]
| Fill: nan
| IntIndex
| Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
| 

| 

123 |

像索引对象（SparseSeries，SparseDataFrame）一样，通过调用to_dense可以将SparseArray转换回常规的ndarray：

124 |

In [16]: sparr.to_dense()
| Out[16]: 
| array([-1.9557, -1.6589,     nan,     nan,     nan,  1.1589,  0.1453,
|            nan,  0.606 ,  1.3342])
| 

| 

130 |

132 |

SparseList

133 |

SparseList类已弃用，将在以后的版本中删除。有关SparseList的文档，请参见以前版本的文档。

134 |

136 |

SparseIndex objects

137 |

实现了两种SparseIndex，block和integer。我们建议使用block，因为它更节省内存。integer格式保留数据不等于填充值的所有位置的数组。block格式只跟踪数据块的位置和大小。

138 |

140 |

Sparse Dtypes

141 |

稀疏数据应具有与其密集表示相同的dtype。目前，支持float64，int64和bool dtypes。根据原始dtype，fill_value默认更改：

142 |

float64：np.nan
int64：0
bool：False

147 |

In [17]: s = pd.Series([1, np.nan, np.nan])
| 
| In [18]: s
| Out[18]: 
| 0    1.0
| 1    NaN
| 2    NaN
| dtype: float64
| 
| In [19]: s.to_sparse()
| Out[19]: 
| 0    1.0
| 1    NaN
| 2    NaN
| dtype: float64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([1], dtype=int32)
| 
| In [20]: s = pd.Series([1, 0, 0])
| 
| In [21]: s
| Out[21]: 
| 0    1
| 1    0
| 2    0
| dtype: int64
| 
| In [22]: s.to_sparse()
| Out[22]: 
| 0    1
| 1    0
| 2    0
| dtype: int64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([1], dtype=int32)
| 
| In [23]: s = pd.Series([True, False, True])
| 
| In [24]: s
| Out[24]: 
| 0     True
| 1    False
| 2     True
| dtype: bool
| 
| In [25]: s.to_sparse()
| Out[25]: 
| 0     True
| 1    False
| 2     True
| dtype: bool
| BlockIndex
| Block locations: array([0, 2], dtype=int32)
| Block lengths: array([1, 1], dtype=int32)
| 

| 

205 |

您可以使用.astype()更改dtype，结果也是稀疏的。请注意，.astype()也会影响fill_value以保持其密集表示。

206 |

In [26]: s = pd.Series([1, 0, 0, 0, 0])
| 
| In [27]: s
| Out[27]: 
| 0    1
| 1    0
| 2    0
| 3    0
| 4    0
| dtype: int64
| 
| In [28]: ss = s.to_sparse()
| 
| In [29]: ss
| Out[29]: 
| 0    1
| 1    0
| 2    0
| 3    0
| 4    0
| dtype: int64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([1], dtype=int32)
| 
| In [30]: ss.astype(np.float64)
| Out[30]: 
| 0    1.0
| 1    0.0
| 2    0.0
| 3    0.0
| 4    0.0
| dtype: float64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([1], dtype=int32)
| 

| 

244 |

如果任何值不能强制到指定的dtype，它会引发。

245 |

In [1]: ss = pd.Series([1, np.nan, np.nan]).to_sparse()
| 0    1.0
| 1    NaN
| 2    NaN
| dtype: float64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([1], dtype=int32)
| 
| In [2]: ss.astype(np.int64)
| ValueError: unable to coerce current fill_value nan to int64 dtype
| 

| 

258 |

260 |

Sparse Calculation

261 |

您可以将NumPy ufuncs应用于SparseArray，并获得SparseArray作为结果。

262 |

In [31]: arr = pd.SparseArray([1., np.nan, np.nan, -2., np.nan])
| 
| In [32]: np.abs(arr)
| Out[32]: 
| [1.0, nan, nan, 2.0, nan]
| Fill: nan
| IntIndex
| Indices: array([0, 3], dtype=int32)
| 

| 

272 |

ufunc也适用于fill_value。这是需要得到正确的密集结果。

273 |

In [33]: arr = pd.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
| 
| In [34]: np.abs(arr)
| Out[34]: 
| [1.0, 1, 1, 2.0, 1]
| Fill: 1
| IntIndex
| Indices: array([0, 3], dtype=int32)
| 
| In [35]: np.abs(arr).to_dense()
| Out[35]: array([ 1.,  1.,  1.,  2.,  1.])
| 

| 

286 |

288 |

Interaction with scipy.sparse

289 |

实验api在稀疏熊猫和scipy.sparse结构之间进行转换。

290 |

A SparseSeries.to_coo() method is implemented for transforming a SparseSeries indexed by a MultiIndex to a scipy.sparse.coo_matrix.

291 |

该方法需要具有两个或更多个级别的MultiIndex。

292 |

In [36]: s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
| 
| In [37]: s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
|    ....:                                      (1, 2, 'a', 1),
|    ....:                                      (1, 1, 'b', 0),
|    ....:                                      (1, 1, 'b', 1),
|    ....:                                      (2, 1, 'b', 0),
|    ....:                                      (2, 1, 'b', 1)],
|    ....:                                      names=['A', 'B', 'C', 'D'])
|    ....: 
| 
| In [38]: s
| Out[38]: 
| A  B  C  D
| 1  2  a  0    3.0
|          1    NaN
|    1  b  0    1.0
|          1    3.0
| 2  1  b  0    NaN
|          1    NaN
| dtype: float64
| 
| # SparseSeries
| In [39]: ss = s.to_sparse()
| 
| In [40]: ss
| Out[40]: 
| A  B  C  D
| 1  2  a  0    3.0
|          1    NaN
|    1  b  0    1.0
|          1    3.0
| 2  1  b  0    NaN
|          1    NaN
| dtype: float64
| BlockIndex
| Block locations: array([0, 2], dtype=int32)
| Block lengths: array([1, 2], dtype=int32)
| 

| 

332 |

在下面的示例中，通过指定第一个和第二个MultiIndex级别定义行的标签，将SparseSeries变换为2-d数组的稀疏表示，和第四级定义列的标签。我们还指定列和行标签应按最终稀疏表示法排序。

333 |

In [41]: A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
|    ....:                              column_levels=['C', 'D'],
|    ....:                              sort_labels=True)
|    ....: 
| 
| In [42]: A
| Out[42]: 
| <3x4 sparse matrix of type '<type 'numpy.float64'>'
| 	with 3 stored elements in COOrdinate format>
| 
| In [43]: A.todense()
| Out[43]: 
| matrix([[ 0.,  0.,  1.,  3.],
|         [ 3.,  0.,  0.,  0.],
|         [ 0.,  0.,  0.,  0.]])
| 
| In [44]: rows
| Out[44]: [(1, 1), (1, 2), (2, 1)]
| 
| In [45]: columns
| Out[45]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
| 

| 

356 |

指定不同的行和列标签（而不是排序）会产生不同的稀疏矩阵：

357 |

In [46]: A, rows, columns = ss.to_coo(row_levels=['A', 'B', 'C'],
|    ....:                              column_levels=['D'],
|    ....:                              sort_labels=False)
|    ....: 
| 
| In [47]: A
| Out[47]: 
| <3x2 sparse matrix of type '<type 'numpy.float64'>'
| 	with 3 stored elements in COOrdinate format>
| 
| In [48]: A.todense()
| Out[48]: 
| matrix([[ 3.,  0.],
|         [ 1.,  3.],
|         [ 0.,  0.]])
| 
| In [49]: rows
| Out[49]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
| 
| In [50]: columns
| Out[50]: [0, 1]
| 

| 

380 |

实现方便方法SparseSeries.from_coo()用于从scipy.sparse.coo_matrix创建SparseSeries。

381 |

In [51]: from scipy import sparse
| 
| In [52]: A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])),
|    ....:                       shape=(3, 4))
|    ....: 
| 
| In [53]: A
| Out[53]: 
| <3x4 sparse matrix of type '<type 'numpy.float64'>'
| 	with 3 stored elements in COOrdinate format>
| 
| In [54]: A.todense()
| Out[54]: 
| matrix([[ 0.,  0.,  1.,  2.],
|         [ 3.,  0.,  0.,  0.],
|         [ 0.,  0.,  0.,  0.]])
| 

| 

399 |

默认行为（dense_index=False）只返回一个只包含非空条目的SparseSeries。

400 |

In [55]: ss = pd.SparseSeries.from_coo(A)
| 
| In [56]: ss
| Out[56]: 
| 0  2    1.0
|    3    2.0
| 1  0    3.0
| dtype: float64
| BlockIndex
| Block locations: array([0], dtype=int32)
| Block lengths: array([3], dtype=int32)
| 

| 

413 |

指定dense_index=True将产生一个索引，该索引是矩阵的行和列坐标的笛卡尔乘积。注意，如果稀疏矩阵足够大（和稀疏），这将消耗大量的存储器（相对于dense_index=False）。

414 |

In [57]: ss_dense = pd.SparseSeries.from_coo(A, dense_index=True)
| 
| In [58]: ss_dense
| Out[58]: 
| 0  0    NaN
|    1    NaN
|    2    1.0
|    3    2.0
| 1  0    3.0
|    1    NaN
|    2    NaN
|    3    NaN
| 2  0    NaN
|    1    NaN
|    2    NaN
|    3    NaN
| dtype: float64
| BlockIndex
| Block locations: array([2], dtype=int32)
| Block lengths: array([3], dtype=int32)
| 

| 

436 |

pandas Ecosystem

Statistics and Machine Learning

Visualization

IDE

API

Domain Specific

Out-of-core

Frequently Asked Questions (FAQ)

DataFrame memory usage

Byte-Ordering Issues

Visualizing Data in Qt applications

pandas:强大的Python数据分析工具包

Installation

支持的 Python 版本

安装 pandas

最简单的使用pandas的方法（无需安装）!

在 Anaconda 中安装 pandas

在 Miniconda 中安装 pandas

从 PyPI 安装

Installing using your Linux distribution’s package manager.

Installing from source

Running the test suite

Dependencies

推荐的依赖关系

可选依赖关系

Internals

Indexing

MultiIndex

Subclassing pandas Data Structures

Override Constructor Properties

Define Original Properties

包概述

数据结构一览

为什么有不止一种数据结构?

Mutability and copying of data

获得支持

Credits

开发团队

License

rpy2 / R interface

Updating your code to use rpy2 functions

R interface with rpy2

Transferring R data sets into Python

Converting DataFrames into R objects

使用pandas对象调用R函数

到R估计器的高级接口

Remote Data Access

DataReader

Google Analytics

Configuring Access to Google Analytics

Using the Google Analytics API

Sparse data structures

SparseArray

SparseList

SparseIndex objects

Sparse Dtypes

Sparse Calculation

Interaction with scipy.sparse

教程

内部指南

pandas Cookbook

Lessons for New pandas Users

Practical data analysis with Python

Modern Pandas

Excel charts with pandas, vincent and xlsxwriter

Various Tutorials

新功能