├── Pyspark.sql_module.ipynb ├── README.md ├── Scel ├── coal_dict.txt ├── 东北话大全【官方推荐】.scel ├── 史记【官方推荐】.scel ├── 开发大神专用词库【官方推荐】.scel ├── 柳宗元诗词【官方推荐】.scel └── 诗经【官方推荐】.scel ├── Spark_SQL.ipynb ├── Spark_Transformations_总结&举例.ipynb ├── Untitled.ipynb ├── scel_TR_txt.ipynb └── school.csv /Pyspark.sql_module.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 1. Module Context" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Spark SQL 和 DataFrames 的重要类\n", 15 | " - __pyspark.sql.SparkSession__ : DataFrame 和 SQL 功能的主入口;\n", 16 | " - __pyspark.sql.DataFrame__ : 分布在命名列中的分布式数据集合;\n", 17 | " - __pyspark.sql.Column__ : DataFrame 中的列表达式;\n", 18 | " - __pyspark.sql.Row__ : Frame 中的一行数据;\n", 19 | " - __pyspark.sql.GroupedData__ : 聚合方法,DataFrame.groupBy()返回;\n", 20 | " - __pyspark.sql.DataFrameNaFunctions__ : 处理缺失数据的方法(空值);\n", 21 | " - __pyspark.sql.DataFrameStatFunctions__ : 统计功能的方法;\n", 22 | " - __pyspark.sql.functions__ : 可用于 DataFrame 的内置函数列表;\n", 23 | " - __pyspark.sql.types__ : 可用的数据类型列表\n", 24 | " - __pyspark.sql.Window__ : 用于处理窗口函数" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## 1.1 class pyspark.sql.SparkSession(sparkContext, jsparkSession=None)\n", 32 | "\n", 33 | " SparkSession可用于创建DataFrame,将DataFrame注册为表,在表上执行SQL,缓存表以及读取镶木地板文件。 要创建SparkSession,请使用以下构建器模式:\n", 34 | " \n", 35 | " __spark = SparkSession.builder \\ __ \n", 36 | " \n", 37 | " __... .master(\"local\") \\__\n", 38 | " \n", 39 | " __... .appName(\"Word Count\") \\__\n", 40 | " \n", 41 | " __... .config(\"spark.some.config.option\", \"some-value\") \\__\n", 42 | " \n", 43 | " __... .getOrCreate()__\n", 44 | " \n", 45 | "- __builder__ : 具有构建 SparkSession 实例的 Builder 的类属性\n", 46 | " + __appName__(name)\n", 47 | " \n", 48 | " 设置应用程序的名称,该名称将显示在 Spark Web UI 中。\n", 49 | "\n", 50 | " 如果未设置应用程序名称,则将使用随机生成的名称。\n", 51 | "\n", 52 | " 参数:name - 应用程序名称\n", 53 | " _\n", 54 | " + __config__(key=None, value=None, conf=None)\n", 55 | " \n", 56 | " 设置配置选项。 使用此方法设置的选项会自动传播到SparkConf和SparkSession自己的配置。\n", 57 | "\n", 58 | " 对于现有的SparkConf,请使用conf参数。\n", 59 | " \n", 60 | " Parameters:\t\n", 61 | " - key – 配置属性的键名称字符串\n", 62 | " - value – 配置属性的值\n", 63 | " - conf – SparkConf 的一个实例\n", 64 | " \n", 65 | " + __enableHiveSupport__()\n", 66 | " \n", 67 | " 启用 Hive 支持,包括与持久性 Hive Metastore 的连接,对 Hive serde s的支持以及 Hive 用户定义的功能。\n", 68 | " \n", 69 | " + __getOrCreate__()\n", 70 | " \n", 71 | " 获取现有 SparkSession ,如果没有现有 SparkSession ,则根据此构建器中设置的选项创建新 SparkSession。\n", 72 | "\n", 73 | " 此方法首先检查是否存在有效的全局默认 SparkSession ,如果是,则返回该值。 如果不存在有效的全局默认 SparkSession ,则该方法将创建新的 SparkSession 并将新创建的 SparkSession 指定为全局默认值。\n", 74 | " \n", 75 | " + __master__(master)\n", 76 | " \n", 77 | " 设置要连接的 Spark 主 URL ,例如“本地”以在本地运行,“ local [4] ”在本地运行 4 核,或“ spark:// master:7077 ”在Spark独立群集上运行。\n", 78 | "\n", 79 | " 参数:master - spark master的url\n", 80 | " \n", 81 | " + __catalog__\n", 82 | " \n", 83 | " 用户可通过其创建,删除,更改或查询底层数据库,表,函数等的接口。\n", 84 | "\n", 85 | " 返回:目录\n", 86 | " \n", 87 | " + __conf__\n", 88 | " \n", 89 | " Spark的运行时配置界面。\n", 90 | "\n", 91 | " 这是用户可以通过该接口获取和设置与 Spark SQL 相关的所有 Spark 和 Hadoop 配置。 获取配置的值时,默认为基础 SparkContext 中设置的值(如果有)\n", 92 | " \n", 93 | " + __createDataFrame__(data, schema=None, samplingRatio=None, verifySchema=True)\n", 94 | " \n", 95 | " 从 RDD,列表或 pandas.DataFrame 创建 DataFrame。\n", 96 | "\n", 97 | " 当 schema 是列名列表时,将从数据推断每列的类型。\n", 98 | "\n", 99 | " 当 schema 为 None 时,它将尝试从数据推断出模式(列名和类型),数据应该是 Row 的 RDD,或者是 namedTuple 或 dict。\n", 100 | "\n", 101 | " schema 是 pyspark.sql.types.DataType 或数据类型字符串时,它必须与实际数据匹配,否则将在运行时抛出异常。如果给定的模式不是 pyspark.sql.types.StructType ,它将被包装到 pyspark.sql.types.StructType 中作为其唯一的字段,并且字段名称将为“value”,每条记录也将被包装成一个元组,可以在以后转换为行。\n", 102 | "\n", 103 | " 如果需要模式推断,则 samplingRatio 用于确定用于模式推断的行的比率。如果 samplingRatio 为 None,则将使用第一行。\n", 104 | "\n", 105 | " __参数__:\n", 106 | " - data - 任何类型的 SQL 数据表示的 RDD(例如,row,tuple,int,boolean 等),或 list 或 pandas.DataFrame。\n", 107 | " - schema - pyspark.sql.types.DataType 或数据类型字符串或列名列表,默认为 None。数据类型字符串格式等于 pyspark.sql.types.DataType.simpleString,除了顶级结构类型可以省略 struct <> 和原子类型使用 typeName() 作为它们的格式,例如对 pyspark.sql.types.ByteType 使用 byte 而不是 tinyint。我们也可以使用 int 作为 IntegerType 的短名称。\n", 108 | " - samplingRatio - 用于推断的行的样本比率\n", 109 | " - verifySchema - 根据模式验证每一行的数据类型。\n", 110 | " \n", 111 | " __返回:数据帧__\n", 112 | " \n", 113 | " + __newSession__()\n", 114 | " \n", 115 | " 返回一个新的 SparkSession 作为新会话,它具有单独的 SQLConf,已注册的临时视图和 UDF,但共享 SparkContext 和表缓存。\n", 116 | " + __range__(start, end=None, step=1, numPartitions=None)\n", 117 | " \n", 118 | " 使用名为id的单个pyspark.sql.types.LongType列创建一个DataFrame,其中包含从开始到结束(不包括)的步骤值步骤范围内的元素。\n", 119 | "\n", 120 | " 参数:\n", 121 | " - start - 起始值\n", 122 | " - end - 结束价值(独家)\n", 123 | " - step - 增量步骤(默认值:1)\n", 124 | " - numPartitions - DataFrame的分区数\n", 125 | " \n", 126 | " __返回:数据帧__\n", 127 | " \n", 128 | " + __read__\n", 129 | " \n", 130 | " 返回一个 DataFrameReader,可用于以 DataFrame 的形式读取数据。\n", 131 | "\n", 132 | " __返回:DataFrameReader__\n", 133 | " \n", 134 | " \n", 135 | " + __readStream__\n", 136 | " \n", 137 | " 返回一个 DataStreamReader ,可用于将数据流作为流式 DataFrame 读取。\n", 138 | " \n", 139 | " __返回:DataStreamReader__\n", 140 | " \n", 141 | " + __sparkContext__\n", 142 | " \n", 143 | " 返回底层的 SparkContext\n", 144 | " \n", 145 | " + __sql__(sqlQuery)\n", 146 | " \n", 147 | " 返回表示给定查询结果的 DataFrame\n", 148 | "\n", 149 | " __返回:数据帧__\n", 150 | " \n", 151 | " + __stop__()\n", 152 | " \n", 153 | " 停止底层的 SparkContext\n", 154 | " \n", 155 | " + __streams__\n", 156 | " \n", 157 | " 返回一个 StreamingQueryManager,它允许管理在此上下文中活动的所有 StreamingQuery StreamingQueries。\n", 158 | "\n", 159 | " 返回:StreamingQueryManager\n", 160 | " \n", 161 | " + __table__(tableName)\n", 162 | " \n", 163 | " 将指定的表作为DataFrame返回。\n", 164 | "\n", 165 | " __返回:数据帧__\n", 166 | " \n", 167 | " + __udf__\n", 168 | " \n", 169 | " 返回UDF注册的UDFRegistration。\n", 170 | "\n", 171 | " __返回:UDFRegistration__\n", 172 | " \n", 173 | " + __version__\n", 174 | " \n", 175 | " 运行此应用程序的Spark版本。\n", 176 | "\n", 177 | " " 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "## 1.2 pyspark.sql.SQLContext(sparkContext, sparkSession=None, jsqlContext=None)\n", 185 | "\n", 186 | " 在Spark 1.x中使用Spark中的结构化数据(行和列)的入口点。\n", 187 | "\n", 188 | " 从Spark 2.0开始,它被SparkSession取代。 但是,为了向后兼容,我们将此类保留在此处。\n", 189 | "\n", 190 | " 可以使用SQLContext创建DataFrame,将DataFrame注册为表,对表执行SQL,缓存表以及读取镶木地板文件。\n", 191 | "\n", 192 | "#### 参数:\n", 193 | "- sparkContext - 支持此SQLContext的SparkContext。\n", 194 | "- sparkSession - 这个SQLContext包围的SparkSession。\n", 195 | "- jsqlContext - 可选的JVM Scala SQLContext。 如果设置,我们不会在JVM中实例化新的SQLContext,而是对这个对象进行所有调用。" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "## 1.3 pyspark.sql.HiveContext(sparkContext, jhiveContext=None)\n", 203 | "\n", 204 | " Spark SQL 的一种变体,它与存储在 Hive 中的数据集成。\n", 205 | "\n", 206 | " 从类路径上的 hive-site.xml 读取 Hive 的配置。 它支持运行 SQL 和 HiveQL 命令。\n", 207 | " \n", 208 | " __参数:__\n", 209 | " \n", 210 | " - __sparkContext__ - 要包装的SparkContext。\n", 211 | " \n", 212 | " - __jhiveContext__ - 可选的JVM Scala HiveContext。 如果设置,我们不会在JVM中实例化新的HiveContext,而是对这个对象进行所有调用。" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": {}, 233 | "outputs": [], 234 | "source": [] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": {}, 254 | "outputs": [], 255 | "source": [] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": null, 330 | "metadata": {}, 331 | "outputs": [], 332 | "source": [] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": null, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": null, 344 | "metadata": {}, 345 | "outputs": [], 346 | "source": [] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": {}, 352 | "outputs": [], 353 | "source": [] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": null, 358 | "metadata": {}, 359 | "outputs": [], 360 | "source": [] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": null, 365 | "metadata": {}, 366 | "outputs": [], 367 | "source": [] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": {}, 373 | "outputs": [], 374 | "source": [] 375 | } 376 | ], 377 | "metadata": { 378 | "kernelspec": { 379 | "display_name": "Python 3", 380 | "language": "python", 381 | "name": "python3" 382 | }, 383 | "language_info": { 384 | "codemirror_mode": { 385 | "name": "ipython", 386 | "version": 3 387 | }, 388 | "file_extension": ".py", 389 | "mimetype": "text/x-python", 390 | "name": "python", 391 | "nbconvert_exporter": "python", 392 | "pygments_lexer": "ipython3", 393 | "version": "3.6.5" 394 | } 395 | }, 396 | "nbformat": 4, 397 | "nbformat_minor": 2 398 | } 399 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Spark_Python 2 | Spark—Python学习笔记 3 | -------------------------------------------------------------------------------- /Scel/coal_dict.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/coal_dict.txt -------------------------------------------------------------------------------- /Scel/东北话大全【官方推荐】.scel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/东北话大全【官方推荐】.scel -------------------------------------------------------------------------------- /Scel/史记【官方推荐】.scel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/史记【官方推荐】.scel -------------------------------------------------------------------------------- /Scel/开发大神专用词库【官方推荐】.scel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/开发大神专用词库【官方推荐】.scel -------------------------------------------------------------------------------- /Scel/柳宗元诗词【官方推荐】.scel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/柳宗元诗词【官方推荐】.scel -------------------------------------------------------------------------------- /Scel/诗经【官方推荐】.scel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CQiang27/Spark_Python/e03a492805ccdaf4b557032ee446de312b38d117/Scel/诗经【官方推荐】.scel -------------------------------------------------------------------------------- /Spark_SQL.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "- __Spark SQL, DataFrames and Datasets Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html#overview__\n", 8 | "\n", 9 | "\n", 10 | "- __pyspark.sql module: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame__" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# 1. Overview" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "- Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.\n", 25 | "\n", 26 | "- All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell.\n", 27 | "\n", 28 | "- Spark SQL是用于结构化数据处理的Spark模块。 与基本的Spark RDD API不同,Spark SQL提供的接口为Spark提供了有关数据结构和正在执行的计算的更多信息。 在内部,Spark SQL使用此额外信息来执行额外的优化。 有几种与Spark SQL交互的方法,包括SQL和Dataset API。 在计算结果时,使用相同的执行引擎,与您用于表达计算的API /语言无关。 这种统一意味着开发人员可以轻松地在不同的API之间来回切换,从而提供表达给定转换的最自然的方式。\n", 29 | "\n", 30 | "- 此页面上的所有示例都使用Spark分发中包含的示例数据,并且可以在spark-shell,pyspark shell或sparkR shell中运行。\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# 2. SQL" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "- One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.\n", 45 | "- Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。 有关如何配置此功能的更多信息,请参阅Hive Tables部分。 从另一种编程语言中运行SQL时,结果将作为数据集/数据框返回。 您还可以使用命令行或JDBC / ODBC与SQL接口进行交互。" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "# 3. Datasets and DataFrames" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "- A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.\n", 60 | "\n", 61 | "- A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.\n", 62 | "\n", 63 | "- Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.\n", 64 | "\n", 65 | "- 数据集是分布式数据集合。 Dataset是Spark 1.6中添加的一个新接口,它提供了RDD的优势(强类型,使用强大的lambda函数的能力)以及Spark SQL优化执行引擎的优点。数据集可以从JVM对象构造,然后使用功能转换(map,flatMap,filter等)进行操作。数据集API在Scala和Java中可用。 Python没有对Dataset API的支持。但由于Python的动态特性,数据集API的许多好处已经可用(即您可以通过名称自然地访问行的字段row.columnName)。 R的情况类似。\n", 66 | "\n", 67 | "- DataFrame是组织为命名列的数据集。它在概念上等同于关系数据库中的表或R / Python中的数据框,但在引擎盖下具有更丰富的优化。 DataFrame可以从多种来源构建,例如:结构化数据文件,Hive中的表,外部数据库或现有RDD。 DataFrame API在Scala,Java,Python和R中可用。在Scala和Java中,DataFrame由行数据集表示。在Scala API中,DataFrame只是Dataset [Row]的类型别名。而在Java API中,用户需要使用数据集来表示DataFrame。\n", 68 | "\n", 69 | "- 在本文档中,我们经常将行的Scala / Java数据集称为DataFrame。" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "# 3. Getting Started" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "## 3.1 Starting Point: _SparkSession_\n", 84 | "\n", 85 | "- The __entry point__ into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():\n", 86 | "- Spark中所有__功能的入口点__是 __SparkSession__ 类。 要创建基本的SparkSession,只需使用SparkSession.builder():\n", 87 | " \n", 88 | " __SparkSession.builder().getOrCreate()__" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 1, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "from pyspark.sql import SparkSession\n", 98 | "import os, time\n", 99 | "\n", 100 | "os.environ['SPARK_HOME'] = \"D:/Spark\"\n", 101 | "\n", 102 | "spark = SparkSession \\\n", 103 | " .builder \\\n", 104 | " .appName(\"Python Spark SQL basic example\") \\\n", 105 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 106 | " .getOrCreate()\n", 107 | "spark.stop()\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## 3.2 Creating DataFrames\n", 115 | "- With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.\n", 116 | "\n", 117 | "- As an example, the following creates a DataFrame based on the content of a JSON file:\n", 118 | "\n", 119 | "- 使用SparkSession,应用程序可以从现有RDD,Hive表或Spark数据源创建DataFrame。\n", 120 | "\n", 121 | "- 作为示例,以下内容基于JSON文件的内容创建DataFrame:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 2, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "+----+-------+\n", 134 | "| age| name|\n", 135 | "+----+-------+\n", 136 | "|null|Michael|\n", 137 | "| 30| Andy|\n", 138 | "| 19| Justin|\n", 139 | "+----+-------+\n", 140 | "\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "from pyspark.sql import SparkSession\n", 146 | "import os, time\n", 147 | "\n", 148 | "os.environ['SPARK_HOME'] = \"D:/Spark\"\n", 149 | "\n", 150 | "spark = SparkSession \\\n", 151 | " .builder \\\n", 152 | " .appName(\"Python Spark SQL basic example\") \\\n", 153 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 154 | " .getOrCreate()\n", 155 | "# spark is an existing SparkSession\n", 156 | "df = spark.read.json(\"D:/Spark/examples/src/main/resources/people.json\")\n", 157 | "# Displays the content of the DataFrame to stdout\n", 158 | "df.show()\n", 159 | "# +----+-------+\n", 160 | "# | age| name|\n", 161 | "# +----+-------+\n", 162 | "# |null|Michael|\n", 163 | "# | 30| Andy|\n", 164 | "# | 19| Justin|\n", 165 | "# +----+-------+\n", 166 | "spark.stop()\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 3, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "+----+-------+\n", 179 | "| age| name|\n", 180 | "+----+-------+\n", 181 | "|null|Michael|\n", 182 | "| 30| Andy|\n", 183 | "| 19| Justin|\n", 184 | "+----+-------+\n", 185 | "\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "from pyspark.sql import SparkSession\n", 191 | "\n", 192 | "spark = SparkSession \\\n", 193 | " .builder \\\n", 194 | " .appName(\"Python Spark SQL basic example\") \\\n", 195 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 196 | " .getOrCreate()\n", 197 | "# spark is an existing SparkSession\n", 198 | "df = spark.read.json(\"D:/Spark/examples/src/main/resources/people.json\")\n", 199 | "# Displays the content of the DataFrame to stdout\n", 200 | "df.show()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "## 3.3 Untyped Dataset Operations (aka DataFrame Operations)\n", 208 | "- DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.\n", 209 | "\n", 210 | "- As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.\n", 211 | "\n", 212 | "- Here we include some basic examples of structured data processing using Datasets:\n", 213 | "\n", 214 | "- DataFrames为Scala,Java,Python和R中的结构化数据操作提供特定于域的语言。\n", 215 | "\n", 216 | "- 如上所述,在Spark 2.0中,DataFrames只是Scala和Java API中Rows的数据集。 与“类型转换”相比,这些操作也称为“无类型转换”,带有强类型Scala / Java数据集。\n", 217 | "\n", 218 | "- 这里我们包括一些使用数据集进行结构化数据处理的基本示例:" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 4, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "name": "stdout", 228 | "output_type": "stream", 229 | "text": [ 230 | "root\n", 231 | " |-- age: long (nullable = true)\n", 232 | " |-- name: string (nullable = true)\n", 233 | "\n", 234 | "+-------+\n", 235 | "| name|\n", 236 | "+-------+\n", 237 | "|Michael|\n", 238 | "| Andy|\n", 239 | "| Justin|\n", 240 | "+-------+\n", 241 | "\n", 242 | "+-------+---------+\n", 243 | "| name|(age + 1)|\n", 244 | "+-------+---------+\n", 245 | "|Michael| null|\n", 246 | "| Andy| 31|\n", 247 | "| Justin| 20|\n", 248 | "+-------+---------+\n", 249 | "\n", 250 | "+---+----+\n", 251 | "|age|name|\n", 252 | "+---+----+\n", 253 | "| 30|Andy|\n", 254 | "+---+----+\n", 255 | "\n", 256 | "+----+-----+\n", 257 | "| age|count|\n", 258 | "+----+-----+\n", 259 | "| 19| 1|\n", 260 | "|null| 1|\n", 261 | "| 30| 1|\n", 262 | "+----+-----+\n", 263 | "\n" 264 | ] 265 | } 266 | ], 267 | "source": [ 268 | "# spark, df are from the previous example\n", 269 | "# Print the schema in a tree format\n", 270 | "df.printSchema()\n", 271 | "# root\n", 272 | "# |-- age: long (nullable = true)\n", 273 | "# |-- name: string (nullable = true)\n", 274 | "\n", 275 | "# Select only the \"name\" column\n", 276 | "df.select(\"name\").show()\n", 277 | "# +-------+\n", 278 | "# | name|\n", 279 | "# +-------+\n", 280 | "# |Michael|\n", 281 | "# | Andy|\n", 282 | "# | Justin|\n", 283 | "# +-------+\n", 284 | "\n", 285 | "# Select everybody, but increment the age by 1\n", 286 | "df.select(df['name'], df['age'] + 1).show()\n", 287 | "# +-------+---------+\n", 288 | "# | name|(age + 1)|\n", 289 | "# +-------+---------+\n", 290 | "# |Michael| null|\n", 291 | "# | Andy| 31|\n", 292 | "# | Justin| 20|\n", 293 | "# +-------+---------+\n", 294 | "\n", 295 | "# Select people older than 21\n", 296 | "df.filter(df['age'] > 21).show()\n", 297 | "# +---+----+\n", 298 | "# |age|name|\n", 299 | "# +---+----+\n", 300 | "# | 30|Andy|\n", 301 | "# +---+----+\n", 302 | "\n", 303 | "# Count people by age\n", 304 | "df.groupBy(\"age\").count().show()\n", 305 | "# +----+-----+\n", 306 | "# | age|count|\n", 307 | "# +----+-----+\n", 308 | "# | 19| 1|\n", 309 | "# |null| 1|\n", 310 | "# | 30| 1|\n", 311 | "# +----+-----+" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "## 3.4 Running SQL Queries Programmatically" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 5, 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "name": "stdout", 328 | "output_type": "stream", 329 | "text": [ 330 | "+----+-------+\n", 331 | "| age| name|\n", 332 | "+----+-------+\n", 333 | "|null|Michael|\n", 334 | "| 30| Andy|\n", 335 | "| 19| Justin|\n", 336 | "+----+-------+\n", 337 | "\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "# Register the DataFrame as a SQL temporary view\n", 343 | "df.createOrReplaceTempView(\"people\")\n", 344 | "\n", 345 | "sqlDF = spark.sql(\"SELECT * FROM people\")\n", 346 | "sqlDF.show()\n", 347 | "# +----+-------+\n", 348 | "# | age| name|\n", 349 | "# +----+-------+\n", 350 | "# |null|Michael|\n", 351 | "# | 30| Andy|\n", 352 | "# | 19| Justin|\n", 353 | "# +----+-------+" 354 | ] 355 | }, 356 | { 357 | "cell_type": "markdown", 358 | "metadata": {}, 359 | "source": [ 360 | "## 3.5 Global Temporary View\n", 361 | "- Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp, and we must use the qualified name to refer it, e.g. SELECT * FROM global_temp.view1.\n", 362 | "- Spark SQL中的临时视图是会话范围的,如果创建它的会话终止,它将消失。 如果您希望拥有一个在所有会话之间共享的临时视图并保持活动状态,直到Spark应用程序终止,您可以创建一个全局临时视图。 全局临时视图与系统保留的数据库global_temp绑定,我们必须使用限定名称来引用它,例如 SELECT * FROM global_temp.view1。" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 6, 368 | "metadata": { 369 | "scrolled": true 370 | }, 371 | "outputs": [ 372 | { 373 | "name": "stdout", 374 | "output_type": "stream", 375 | "text": [ 376 | "+----+-------+\n", 377 | "| age| name|\n", 378 | "+----+-------+\n", 379 | "|null|Michael|\n", 380 | "| 30| Andy|\n", 381 | "| 19| Justin|\n", 382 | "+----+-------+\n", 383 | "\n", 384 | "+----+-------+\n", 385 | "| age| name|\n", 386 | "+----+-------+\n", 387 | "|null|Michael|\n", 388 | "| 30| Andy|\n", 389 | "| 19| Justin|\n", 390 | "+----+-------+\n", 391 | "\n" 392 | ] 393 | } 394 | ], 395 | "source": [ 396 | "# Register the DataFrame as a global temporary view\n", 397 | "df.createGlobalTempView(\"people\")\n", 398 | "\n", 399 | "# Global temporary view is tied to a system preserved database `global_temp`\n", 400 | "spark.sql(\"SELECT * FROM global_temp.people\").show()\n", 401 | "# +----+-------+\n", 402 | "# | age| name|\n", 403 | "# +----+-------+\n", 404 | "# |null|Michael|\n", 405 | "# | 30| Andy|\n", 406 | "# | 19| Justin|\n", 407 | "# +----+-------+\n", 408 | "\n", 409 | "# Global temporary view is cross-session\n", 410 | "spark.newSession().sql(\"SELECT * FROM global_temp.people\").show()\n", 411 | "# +----+-------+\n", 412 | "# | age| name|\n", 413 | "# +----+-------+\n", 414 | "# |null|Michael|\n", 415 | "# | 30| Andy|\n", 416 | "# | 19| Justin|\n", 417 | "# +----+-------+" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "## 3.6 Creating Datasets" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "- Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network. While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.\n", 432 | "\n", 433 | "- 数据集与RDD类似,但是,它们不使用Java序列化或Kryo,而是使用专门的编码器来序列化对象以便通过网络进行处理或传输。 虽然编码器和标准序列化都负责将对象转换为字节,但编码器是动态生成的代码,并使用一种格式,允许Spark执行许多操作,如过滤,排序和散列,而无需将字节反序列化为对象。" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": {}, 439 | "source": [ 440 | "# 4. Interoperating with RDDs" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "- Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.\n", 448 | "\n", 449 | "- The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.\n", 450 | "\n", 451 | "- Spark SQL支持两种不同的方法将现有RDD转换为数据集。 第一种方法使用反射来推断包含特定类型对象的RDD的模式。 这种基于反射的方法可以提供更简洁的代码,并且在编写Spark应用程序时已经了解模式时可以很好地工作。\n", 452 | "\n", 453 | "- 创建数据集的第二种方法是通过编程接口,允许您构建模式,然后将其应用于现有RDD。 虽然此方法更详细,但它允许您在直到运行时才知道列及其类型时构造数据集。" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "## 4.1 Inferring the Schema Using Reflection\n", 461 | "- Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.\n", 462 | "- Spark SQL可以将Row对象的RDD转换为DataFrame,从而推断出数据类型。 通过将键/值对列表作为kwargs传递给Row类来构造行。 此列表的键定义表的列名称,并通过对整个数据集进行采样来推断类型,类似于对JSON文件执行的推断。" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 9, 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "name": "stdout", 472 | "output_type": "stream", 473 | "text": [ 474 | "Name: Justin\n" 475 | ] 476 | } 477 | ], 478 | "source": [ 479 | "from pyspark.sql import Row\n", 480 | "\n", 481 | "sc = spark.sparkContext\n", 482 | "\n", 483 | "# Load a text file and convert each line to a Row.\n", 484 | "lines = sc.textFile(\"examples/src/main/resources/people.txt\")\n", 485 | "parts = lines.map(lambda l: l.split(\",\"))\n", 486 | "people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))\n", 487 | "\n", 488 | "# Infer the schema, and register the DataFrame as a table.\n", 489 | "schemaPeople = spark.createDataFrame(people)\n", 490 | "schemaPeople.createOrReplaceTempView(\"people\")\n", 491 | "\n", 492 | "# SQL can be run over DataFrames that have been registered as a table.\n", 493 | "teenagers = spark.sql(\"SELECT name FROM people WHERE age >= 13 AND age <= 19\")\n", 494 | "\n", 495 | "# The results of SQL queries are Dataframe objects.\n", 496 | "# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.\n", 497 | "teenNames = teenagers.rdd.map(lambda p: \"Name: \" + p.name).collect()\n", 498 | "for name in teenNames:\n", 499 | " print(name)\n", 500 | "# Name: Justin" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "## 4.2 Programmatically Specifying the Schema\n", 508 | "- When a dictionary of kwargs cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.\n", 509 | "\n", 510 | "- Create an RDD of tuples or lists from the original RDD;\n", 511 | "\n", 512 | "- Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1.\n", 513 | "\n", 514 | "- Apply the schema to the RDD via createDataFrame method provided by SparkSession.\n", 515 | "\n", 516 | "- 当无法提前定义kwargs字典时(例如,记录结构以字符串形式编码,或者文本数据集将被解析,字段将以不同方式为不同用户进行投影),可以使用编程方式创建DataFrame 三个步骤。\n", 517 | "\n", 518 | "- 从原始RDD创建元组或列表的RDD;\n", 519 | "\n", 520 | "- 创建由StructType表示的模式,该模式与步骤1中创建的RDD中的元组或列表的结构相匹配。\n", 521 | "\n", 522 | "- 通过SparkSession提供的createDataFrame方法将模式应用于RDD。" 523 | ] 524 | }, 525 | { 526 | "cell_type": "code", 527 | "execution_count": 10, 528 | "metadata": {}, 529 | "outputs": [ 530 | { 531 | "name": "stdout", 532 | "output_type": "stream", 533 | "text": [ 534 | "+-------+\n", 535 | "| name|\n", 536 | "+-------+\n", 537 | "|Michael|\n", 538 | "| Andy|\n", 539 | "| Justin|\n", 540 | "+-------+\n", 541 | "\n" 542 | ] 543 | } 544 | ], 545 | "source": [ 546 | "# Import data types\n", 547 | "from pyspark.sql.types import *\n", 548 | "\n", 549 | "sc = spark.sparkContext\n", 550 | "\n", 551 | "# Load a text file and convert each line to a Row.\n", 552 | "lines = sc.textFile(\"examples/src/main/resources/people.txt\")\n", 553 | "parts = lines.map(lambda l: l.split(\",\"))\n", 554 | "# Each line is converted to a tuple.\n", 555 | "people = parts.map(lambda p: (p[0], p[1].strip()))\n", 556 | "\n", 557 | "# The schema is encoded in a string.\n", 558 | "schemaString = \"name age\"\n", 559 | "\n", 560 | "fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]\n", 561 | "schema = StructType(fields)\n", 562 | "\n", 563 | "# Apply the schema to the RDD.\n", 564 | "schemaPeople = spark.createDataFrame(people, schema)\n", 565 | "\n", 566 | "# Creates a temporary view using the DataFrame\n", 567 | "schemaPeople.createOrReplaceTempView(\"people\")\n", 568 | "\n", 569 | "# SQL can be run over DataFrames that have been registered as a table.\n", 570 | "results = spark.sql(\"SELECT name FROM people\")\n", 571 | "\n", 572 | "results.show()\n", 573 | "# +-------+\n", 574 | "# | name|\n", 575 | "# +-------+\n", 576 | "# |Michael|\n", 577 | "# | Andy|\n", 578 | "# | Justin|\n", 579 | "# +-------+" 580 | ] 581 | }, 582 | { 583 | "cell_type": "markdown", 584 | "metadata": {}, 585 | "source": [ 586 | "# 5. Data Sources" 587 | ] 588 | }, 589 | { 590 | "cell_type": "markdown", 591 | "metadata": {}, 592 | "source": [ 593 | "- Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on using relational transformations and can also be used to create a temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section describes the general methods for loading and saving data using the Spark Data Sources and then goes into specific options that are available for the built-in data sources.\n", 594 | "- Spark SQL支持通过DataFrame接口对各种数据源进行操作。 DataFrame可以使用关系转换进行操作,也可以用于创建临时视图。 将DataFrame注册为临时视图允许您对其数据运行SQL查询。 本节介绍使用Spark数据源加载和保存数据的一般方法,然后介绍可用于内置数据源的特定选项。" 595 | ] 596 | }, 597 | { 598 | "cell_type": "markdown", 599 | "metadata": {}, 600 | "source": [ 601 | "## 5.1 Generic Load/Save Functions" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": 18, 607 | "metadata": {}, 608 | "outputs": [ 609 | { 610 | "name": "stdout", 611 | "output_type": "stream", 612 | "text": [ 613 | "Init passed!\n" 614 | ] 615 | }, 616 | { 617 | "ename": "AnalysisException", 618 | "evalue": "'path file:/C:/Users/Berlin/Python/Spark_Python/namesAndFavColors.parquet already exists.;'", 619 | "output_type": "error", 620 | "traceback": [ 621 | "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", 622 | "\u001b[1;31mPy4JJavaError\u001b[0m Traceback (most recent call last)", 623 | "\u001b[1;32mD:\\Anaconda3\\lib\\pyspark\\sql\\utils.py\u001b[0m in \u001b[0;36mdeco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m 62\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 63\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 64\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mpy4j\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mprotocol\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mPy4JJavaError\u001b[0m \u001b[1;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 624 | "\u001b[1;32mD:\\Anaconda3\\lib\\py4j\\protocol.py\u001b[0m in \u001b[0;36mget_return_value\u001b[1;34m(answer, gateway_client, target_id, name)\u001b[0m\n\u001b[0;32m 327\u001b[0m \u001b[1;34m\"An error occurred while calling {0}{1}{2}.\\n\"\u001b[0m\u001b[1;33m.\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 328\u001b[1;33m format(target_id, \".\", name), value)\n\u001b[0m\u001b[0;32m 329\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 625 | "\u001b[1;31mPy4JJavaError\u001b[0m: An error occurred while calling o263.parquet.\n: org.apache.spark.sql.AnalysisException: path file:/C:/Users/Berlin/Python/Spark_Python/namesAndFavColors.parquet already exists.;\r\n\tat org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:109)\r\n\tat org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)\r\n\tat org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)\r\n\tat org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)\r\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)\r\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)\r\n\tat org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)\r\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\r\n\tat org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)\r\n\tat org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)\r\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)\r\n\tat org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)\r\n\tat org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)\r\n\tat org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)\r\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)\r\n\tat org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)\r\n\tat org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)\r\n\tat org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)\r\n\tat org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)\r\n\tat org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:547)\r\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\r\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)\r\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)\r\n\tat java.lang.reflect.Method.invoke(Unknown Source)\r\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\r\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\r\n\tat py4j.Gateway.invoke(Gateway.java:282)\r\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\r\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\r\n\tat py4j.GatewayConnection.run(GatewayConnection.java:238)\r\n\tat java.lang.Thread.run(Unknown Source)\r\n", 626 | "\nDuring handling of the above exception, another exception occurred:\n", 627 | "\u001b[1;31mAnalysisException\u001b[0m Traceback (most recent call last)", 628 | "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 9\u001b[0m \u001b[1;31m# 读取example下面的parquet文件\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 10\u001b[0m \u001b[0mdf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mload\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"examples/src/main/resources/users.parquet\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 11\u001b[1;33m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mselect\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"name\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;34m\"favorite_color\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparquet\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"namesAndFavColors.parquet\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 12\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mshow\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 629 | "\u001b[1;32mD:\\Anaconda3\\lib\\pyspark\\sql\\readwriter.py\u001b[0m in \u001b[0;36mparquet\u001b[1;34m(self, path, mode, partitionBy, compression)\u001b[0m\n\u001b[0;32m 802\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mpartitionBy\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpartitionBy\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 803\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_set_opts\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcompression\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcompression\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 804\u001b[1;33m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_jwrite\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mparquet\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 805\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 806\u001b[0m \u001b[1;33m@\u001b[0m\u001b[0msince\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m1.6\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 630 | "\u001b[1;32mD:\\Anaconda3\\lib\\py4j\\java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[1;34m(self, *args)\u001b[0m\n\u001b[0;32m 1255\u001b[0m \u001b[0manswer\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1256\u001b[0m return_value = get_return_value(\n\u001b[1;32m-> 1257\u001b[1;33m answer, self.gateway_client, self.target_id, self.name)\n\u001b[0m\u001b[0;32m 1258\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1259\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mtemp_arg\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mtemp_args\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 631 | "\u001b[1;32mD:\\Anaconda3\\lib\\pyspark\\sql\\utils.py\u001b[0m in \u001b[0;36mdeco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m 67\u001b[0m e.java_exception.getStackTrace()))\n\u001b[0;32m 68\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'org.apache.spark.sql.AnalysisException: '\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 69\u001b[1;33m \u001b[1;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m': '\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 70\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstartswith\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'org.apache.spark.sql.catalyst.analysis'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 71\u001b[0m \u001b[1;32mraise\u001b[0m \u001b[0mAnalysisException\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ms\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m': '\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mstackTrace\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 632 | "\u001b[1;31mAnalysisException\u001b[0m: 'path file:/C:/Users/Berlin/Python/Spark_Python/namesAndFavColors.parquet already exists.;'" 633 | ] 634 | } 635 | ], 636 | "source": [ 637 | "from pyspark.sql import SparkSession\n", 638 | "from pyspark.sql import Row\n", 639 | "import os, time\n", 640 | "\n", 641 | "os.environ['SPARK_HOME'] = \"D:/Spark\"\n", 642 | "\n", 643 | "spark = SparkSession \\\n", 644 | " .builder \\\n", 645 | " .appName(\"Python Spark SQL basic example\") \\\n", 646 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 647 | " .getOrCreate()\n", 648 | "print('Init passed!')\n", 649 | "# 读取example下面的parquet文件\n", 650 | "df = spark.read.load(\"examples/src/main/resources/users.parquet\")\n", 651 | "df.select(\"name\", \"favorite_color\").write.parquet(\"namesAndFavColors.parquet\")\n", 652 | "df.show()\n" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": null, 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": null, 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [] 668 | }, 669 | { 670 | "cell_type": "code", 671 | "execution_count": null, 672 | "metadata": {}, 673 | "outputs": [], 674 | "source": [] 675 | }, 676 | { 677 | "cell_type": "code", 678 | "execution_count": null, 679 | "metadata": {}, 680 | "outputs": [], 681 | "source": [] 682 | }, 683 | { 684 | "cell_type": "code", 685 | "execution_count": null, 686 | "metadata": {}, 687 | "outputs": [], 688 | "source": [] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": null, 693 | "metadata": {}, 694 | "outputs": [], 695 | "source": [] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": null, 700 | "metadata": {}, 701 | "outputs": [], 702 | "source": [] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": null, 707 | "metadata": {}, 708 | "outputs": [], 709 | "source": [] 710 | }, 711 | { 712 | "cell_type": "code", 713 | "execution_count": null, 714 | "metadata": {}, 715 | "outputs": [], 716 | "source": [] 717 | }, 718 | { 719 | "cell_type": "code", 720 | "execution_count": null, 721 | "metadata": {}, 722 | "outputs": [], 723 | "source": [] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": null, 728 | "metadata": {}, 729 | "outputs": [], 730 | "source": [] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": null, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [] 745 | }, 746 | { 747 | "cell_type": "code", 748 | "execution_count": null, 749 | "metadata": {}, 750 | "outputs": [], 751 | "source": [] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": {}, 757 | "outputs": [], 758 | "source": [] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": null, 763 | "metadata": {}, 764 | "outputs": [], 765 | "source": [] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": null, 770 | "metadata": {}, 771 | "outputs": [], 772 | "source": [] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": null, 777 | "metadata": {}, 778 | "outputs": [], 779 | "source": [] 780 | }, 781 | { 782 | "cell_type": "code", 783 | "execution_count": null, 784 | "metadata": {}, 785 | "outputs": [], 786 | "source": [] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": {}, 792 | "outputs": [], 793 | "source": [] 794 | } 795 | ], 796 | "metadata": { 797 | "kernelspec": { 798 | "display_name": "Python 3", 799 | "language": "python", 800 | "name": "python3" 801 | }, 802 | "language_info": { 803 | "codemirror_mode": { 804 | "name": "ipython", 805 | "version": 3 806 | }, 807 | "file_extension": ".py", 808 | "mimetype": "text/x-python", 809 | "name": "python", 810 | "nbconvert_exporter": "python", 811 | "pygments_lexer": "ipython3", 812 | "version": "3.6.5" 813 | } 814 | }, 815 | "nbformat": 4, 816 | "nbformat_minor": 2 817 | } 818 | -------------------------------------------------------------------------------- /Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "- __Spark SQL, DataFrames and Datasets Guide: http://spark.apache.org/docs/latest/sql-programming-guide.html#overview__\n", 8 | "\n", 9 | "\n", 10 | "- __pyspark.sql module: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame__" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "# 1. Overview" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "- Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.\n", 25 | "\n", 26 | "- All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell.\n", 27 | "\n", 28 | "- Spark SQL是用于结构化数据处理的Spark模块。 与基本的Spark RDD API不同,Spark SQL提供的接口为Spark提供了有关数据结构和正在执行的计算的更多信息。 在内部,Spark SQL使用此额外信息来执行额外的优化。 有几种与Spark SQL交互的方法,包括SQL和Dataset API。 在计算结果时,使用相同的执行引擎,与您用于表达计算的API /语言无关。 这种统一意味着开发人员可以轻松地在不同的API之间来回切换,从而提供表达给定转换的最自然的方式。\n", 29 | "\n", 30 | "- 此页面上的所有示例都使用Spark分发中包含的示例数据,并且可以在spark-shell,pyspark shell或sparkR shell中运行。\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "# 2. SQL" 38 | ] 39 | }, 40 | { 41 | "cell_type": "markdown", 42 | "metadata": {}, 43 | "source": [ 44 | "- One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.\n", 45 | "- Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。 有关如何配置此功能的更多信息,请参阅Hive Tables部分。 从另一种编程语言中运行SQL时,结果将作为数据集/数据框返回。 您还可以使用命令行或JDBC / ODBC与SQL接口进行交互。" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "# 3. Datasets and DataFrames" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "- A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.\n", 60 | "\n", 61 | "- A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset to represent a DataFrame.\n", 62 | "\n", 63 | "- Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.\n", 64 | "\n", 65 | "- 数据集是分布式数据集合。 Dataset是Spark 1.6中添加的一个新接口,它提供了RDD的优势(强类型,使用强大的lambda函数的能力)以及Spark SQL优化执行引擎的优点。数据集可以从JVM对象构造,然后使用功能转换(map,flatMap,filter等)进行操作。数据集API在Scala和Java中可用。 Python没有对Dataset API的支持。但由于Python的动态特性,数据集API的许多好处已经可用(即您可以通过名称自然地访问行的字段row.columnName)。 R的情况类似。\n", 66 | "\n", 67 | "- DataFrame是组织为命名列的数据集。它在概念上等同于关系数据库中的表或R / Python中的数据框,但在引擎盖下具有更丰富的优化。 DataFrame可以从多种来源构建,例如:结构化数据文件,Hive中的表,外部数据库或现有RDD。 DataFrame API在Scala,Java,Python和R中可用。在Scala和Java中,DataFrame由行数据集表示。在Scala API中,DataFrame只是Dataset [Row]的类型别名。而在Java API中,用户需要使用数据集来表示DataFrame。\n", 68 | "\n", 69 | "- 在本文档中,我们经常将行的Scala / Java数据集称为DataFrame。" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "# 3. Getting Started" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "## 3.1 Starting Point: _SparkSession_\n", 84 | "\n", 85 | "- The __entry point__ into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():\n", 86 | "- Spark中所有__功能的入口点__是 __SparkSession__ 类。 要创建基本的SparkSession,只需使用SparkSession.builder():\n", 87 | " \n", 88 | " __SparkSession.builder().getOrCreate()__" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 10, 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "from pyspark.sql import SparkSession\n", 98 | "import os, time\n", 99 | "\n", 100 | "os.environ['SPARK_HOME'] = \"D:/Spark\"\n", 101 | "\n", 102 | "spark = SparkSession \\\n", 103 | " .builder \\\n", 104 | " .appName(\"Python Spark SQL basic example\") \\\n", 105 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 106 | " .getOrCreate()\n", 107 | "spark.stop()\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## 3.2 Creating DataFrames\n", 115 | "- With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.\n", 116 | "\n", 117 | "- As an example, the following creates a DataFrame based on the content of a JSON file:\n", 118 | "\n", 119 | "- 使用SparkSession,应用程序可以从现有RDD,Hive表或Spark数据源创建DataFrame。\n", 120 | "\n", 121 | "- 作为示例,以下内容基于JSON文件的内容创建DataFrame:" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 17, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "+----+-------+\n", 134 | "| age| name|\n", 135 | "+----+-------+\n", 136 | "|null|Michael|\n", 137 | "| 30| Andy|\n", 138 | "| 19| Justin|\n", 139 | "+----+-------+\n", 140 | "\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "from pyspark.sql import SparkSession\n", 146 | "import os, time\n", 147 | "\n", 148 | "os.environ['SPARK_HOME'] = \"D:/Spark\"\n", 149 | "\n", 150 | "spark = SparkSession \\\n", 151 | " .builder \\\n", 152 | " .appName(\"Python Spark SQL basic example\") \\\n", 153 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 154 | " .getOrCreate()\n", 155 | "# spark is an existing SparkSession\n", 156 | "df = spark.read.json(\"D:/Spark/examples/src/main/resources/people.json\")\n", 157 | "# Displays the content of the DataFrame to stdout\n", 158 | "df.show()\n", 159 | "# +----+-------+\n", 160 | "# | age| name|\n", 161 | "# +----+-------+\n", 162 | "# |null|Michael|\n", 163 | "# | 30| Andy|\n", 164 | "# | 19| Justin|\n", 165 | "# +----+-------+\n", 166 | "spark.stop()\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 18, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "+----+-------+\n", 179 | "| age| name|\n", 180 | "+----+-------+\n", 181 | "|null|Michael|\n", 182 | "| 30| Andy|\n", 183 | "| 19| Justin|\n", 184 | "+----+-------+\n", 185 | "\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "from pyspark.sql import SparkSession\n", 191 | "\n", 192 | "spark = SparkSession \\\n", 193 | " .builder \\\n", 194 | " .appName(\"Python Spark SQL basic example\") \\\n", 195 | " .config(\"spark.some.config.option\", \"some-value\") \\\n", 196 | " .getOrCreate()\n", 197 | "# spark is an existing SparkSession\n", 198 | "df = spark.read.json(\"D:/Spark/examples/src/main/resources/people.json\")\n", 199 | "# Displays the content of the DataFrame to stdout\n", 200 | "df.show()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "## 3.3 Untyped Dataset Operations (aka DataFrame Operations)\n", 208 | "- DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.\n", 209 | "\n", 210 | "- As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.\n", 211 | "\n", 212 | "- Here we include some basic examples of structured data processing using Datasets:\n", 213 | "\n", 214 | "- DataFrames为Scala,Java,Python和R中的结构化数据操作提供特定于域的语言。\n", 215 | "\n", 216 | "- 如上所述,在Spark 2.0中,DataFrames只是Scala和Java API中Rows的数据集。 与“类型转换”相比,这些操作也称为“无类型转换”,带有强类型Scala / Java数据集。\n", 217 | "\n", 218 | "- 这里我们包括一些使用数据集进行结构化数据处理的基本示例:" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 23, 224 | "metadata": {}, 225 | "outputs": [ 226 | { 227 | "name": "stdout", 228 | "output_type": "stream", 229 | "text": [ 230 | "root\n", 231 | " |-- age: long (nullable = true)\n", 232 | " |-- name: string (nullable = true)\n", 233 | "\n", 234 | "+-------+\n", 235 | "| name|\n", 236 | "+-------+\n", 237 | "|Michael|\n", 238 | "| Andy|\n", 239 | "| Justin|\n", 240 | "+-------+\n", 241 | "\n", 242 | "+-------+---------+\n", 243 | "| name|(age + 1)|\n", 244 | "+-------+---------+\n", 245 | "|Michael| null|\n", 246 | "| Andy| 31|\n", 247 | "| Justin| 20|\n", 248 | "+-------+---------+\n", 249 | "\n", 250 | "+---+----+\n", 251 | "|age|name|\n", 252 | "+---+----+\n", 253 | "| 30|Andy|\n", 254 | "+---+----+\n", 255 | "\n", 256 | "+----+-----+\n", 257 | "| age|count|\n", 258 | "+----+-----+\n", 259 | "| 19| 1|\n", 260 | "|null| 1|\n", 261 | "| 30| 1|\n", 262 | "+----+-----+\n", 263 | "\n" 264 | ] 265 | } 266 | ], 267 | "source": [ 268 | "# spark, df are from the previous example\n", 269 | "# Print the schema in a tree format\n", 270 | "df.printSchema()\n", 271 | "# root\n", 272 | "# |-- age: long (nullable = true)\n", 273 | "# |-- name: string (nullable = true)\n", 274 | "\n", 275 | "# Select only the \"name\" column\n", 276 | "df.select(\"name\").show()\n", 277 | "# +-------+\n", 278 | "# | name|\n", 279 | "# +-------+\n", 280 | "# |Michael|\n", 281 | "# | Andy|\n", 282 | "# | Justin|\n", 283 | "# +-------+\n", 284 | "\n", 285 | "# Select everybody, but increment the age by 1\n", 286 | "df.select(df['name'], df['age'] + 1).show()\n", 287 | "# +-------+---------+\n", 288 | "# | name|(age + 1)|\n", 289 | "# +-------+---------+\n", 290 | "# |Michael| null|\n", 291 | "# | Andy| 31|\n", 292 | "# | Justin| 20|\n", 293 | "# +-------+---------+\n", 294 | "\n", 295 | "# Select people older than 21\n", 296 | "df.filter(df['age'] > 21).show()\n", 297 | "# +---+----+\n", 298 | "# |age|name|\n", 299 | "# +---+----+\n", 300 | "# | 30|Andy|\n", 301 | "# +---+----+\n", 302 | "\n", 303 | "# Count people by age\n", 304 | "df.groupBy(\"age\").count().show()\n", 305 | "# +----+-----+\n", 306 | "# | age|count|\n", 307 | "# +----+-----+\n", 308 | "# | 19| 1|\n", 309 | "# |null| 1|\n", 310 | "# | 30| 1|\n", 311 | "# +----+-----+" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": null, 324 | "metadata": {}, 325 | "outputs": [], 326 | "source": [] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": null, 331 | "metadata": {}, 332 | "outputs": [], 333 | "source": [] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": null, 338 | "metadata": {}, 339 | "outputs": [], 340 | "source": [] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": null, 345 | "metadata": {}, 346 | "outputs": [], 347 | "source": [] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": null, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [] 355 | } 356 | ], 357 | "metadata": { 358 | "kernelspec": { 359 | "display_name": "Python 3", 360 | "language": "python", 361 | "name": "python3" 362 | }, 363 | "language_info": { 364 | "codemirror_mode": { 365 | "name": "ipython", 366 | "version": 3 367 | }, 368 | "file_extension": ".py", 369 | "mimetype": "text/x-python", 370 | "name": "python", 371 | "nbconvert_exporter": "python", 372 | "pygments_lexer": "ipython3", 373 | "version": "3.6.5" 374 | } 375 | }, 376 | "nbformat": 4, 377 | "nbformat_minor": 2 378 | } 379 | -------------------------------------------------------------------------------- /scel_TR_txt.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 1. 搜狗词库 [](https://pinyin.sogou.com/dict/)\n", 8 | "## 2. struct函数 -- struct.pack()和struct.unpack()\n", 9 | "\n", 10 | "在转化过程中,主要用到了一个格式化字符串(format strings),用来规定转化的方法和格式。\n", 11 | "\n", 12 | "### 2.1 struct.pack(fmt,v1,v2,.....)\n", 13 | "\n", 14 | "  将v1,v2等参数的值进行一层包装,包装的方法由fmt指定。被包装的参数必须严格符合fmt。最后返回一个包装后的字符串。\n", 15 | "\n", 16 | "### 2.2 struct.unpack(fmt,string)\n", 17 | "\n", 18 | "  顾名思义,解包。比如pack打包,然后就可以用unpack解包了。返回一个由解包数据(string)得到的一个元组(tuple), 即使仅有一个数据也会被解包成元组。其中len(string) 必须等于 calcsize(fmt),这里面涉及到了一个calcsize函数。struct.calcsize(fmt):这个就是用来计算fmt格式所描述的结构的大小。\n", 19 | "\n", 20 | "  格式字符串(format string)由一个或多个格式字符(format characters)组成,对于这些格式字符的描述参照Python manual如下:" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "|Format\t|C Type\t|Python\t|Notes\n", 28 | " | - - - | - - -|\n", 29 | "|x\t|pad byte\t|no value\t| \n", 30 | "|c\t|char\t|string of length |1\t \n", 31 | "|b\t|signed char\t|integer\t| \n", 32 | "|B\t|unsigned char\t|integer\t| \n", 33 | "|h\t|short\t|integer\t \n", 34 | "|H\t|unsigned short\t|integer\t| \n", 35 | "|i\t|int\t|integer\t |\n", 36 | "|I\t|unsigned int\t|long\t |\n", 37 | "|l\t|long\t|integer\t |\n", 38 | "|L\t|unsigned long\t|long\t |\n", 39 | "|q\t|long long\t|long\t|(1)\n", 40 | "|Q\t|unsigned long long\t|long\t|(1)\n", 41 | "|f\t|float\t|float\t |\n", 42 | "|d\t|double\t|float\t |\n", 43 | "|s\t|char[]\t|string\t |\n", 44 | "|p\t|char[]\t|string\t |\n", 45 | "|P\t|void *\t|integer|" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "原作者:\n", 53 | "搜狗的scel词库就是保存的文本的unicode编码,每两个字节一个字符(中文汉字或者英文字母)找出其每部分的偏移位置即可\n", 54 | "\n", 55 | "主要两部分:\n", 56 | "\n", 57 | "1.全局拼音表,貌似是所有的拼音组合,字典序\n", 58 | " 格式为(index,len,pinyin)的列表\n", 59 | " index: 两个字节的整数 代表这个拼音的索引\n", 60 | " len: 两个字节的整数 拼音的字节长度\n", 61 | " pinyin: 当前的拼音,每个字符两个字节,总长len\n", 62 | "\n", 63 | "2.汉语词组表\n", 64 | " 格式为(same,py_table_len,py_table,{word_len,word,ext_len,ext})的一个列表\n", 65 | " same: 两个字节 整数 同音词数量\n", 66 | " py_table_len: 两个字节 整数\n", 67 | " py_table: 整数列表,每个整数两个字节,每个整数代表一个拼音的索引\n", 68 | "\n", 69 | " word_len:两个字节 整数 代表中文词组字节数长度\n", 70 | " word: 中文词组,每个中文汉字两个字节,总长度word_len\n", 71 | " ext_len: 两个字节 整数 代表扩展信息的长度,好像都是10\n", 72 | " ext: 扩展信息 前两个字节是一个整数(不知道是不是词频) 后八个字节全是0\n", 73 | "\n", 74 | " {word_len,word,ext_len,ext} 一共重复same次 同音词 相同拼音表" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 63, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "------------------------------------------------------------\n", 87 | "词库名: 东北话大全【官方推荐】\n", 88 | "词库类型: 方言\n", 89 | "描述信息: 贼拉逗的东北话你一定听过,说起来实在霸气!词库终于出来了,再也不愁打不出来你想说的话,专门为你私人订制哦,还不快去下载!!!\n", 90 | "词库示例: 不想嘎哈 老嘎瘩 嗯呐 瞅你咋的 咋地 \n", 91 | "------------------------------------------------------------\n", 92 | "词库名: 史记【官方推荐】\n", 93 | "词库类型: 文学\n", 94 | "描述信息: 作者司马迁以其“究天人之际,通古今之变,成一家之言”的史识,使《史记》成为中国第一部,也是最出名的纪传体通史。小编整理了书中经典故事、名句、人物等,让你更快打出相关词汇,我在这等你哦~\n", 95 | "词库示例: 十表 姬昌 项羽本纪 武王伐纣 虚怀若谷 商鞅变法 \n", 96 | "------------------------------------------------------------\n", 97 | "词库名: 开发大神专用词库【官方推荐】\n", 98 | "词库类型: 互联网\n", 99 | "描述信息: 程序猿们是不是遨游在代码的海洋里无法自拔?小编知道你们整日找BUG辛苦了,为辅助你们的工作特意奉上专属词库,提高工作效率,畅快打字。欢迎你们前来补充词条哦 !\n", 100 | "词库示例: 资源保留 代码 优先级 启动事件 排期 公开测试 \n", 101 | "------------------------------------------------------------\n", 102 | "词库名: 柳宗元诗词【官方推荐】\n", 103 | "词库类型: 诗词歌赋\n", 104 | "描述信息: 柳宗元,唐朝文学家、散文家和思想家。倡导唐代古文运动。散文论说性强,笔锋犀利,讽刺辛辣。游记写景状物,多所寄托。诗多抒写抑郁悲愤、思乡怀友之情,自成一路。\n", 105 | "词库示例: 柳河东 永州八记 捕蛇者说 柳河东集 江雪 小石潭记 \n", 106 | "------------------------------------------------------------\n", 107 | "词库名: 诗经【官方推荐】\n", 108 | "词库类型: 诗词歌赋\n", 109 | "描述信息: 《诗经》是中国古代诗歌开端,最早的一部诗歌总集,现存305篇(此外有目无诗的6篇,共311篇),分《风》、《雅》、《颂》三部分。《颂》有40篇,《雅》有105篇(《小雅》中有6篇有目无诗,不计算在内),《风》的数量最多,共160篇,合起来是305篇。古人取其整数,常说“诗三百”。\n", 110 | "词库示例: 君子好逑 悠哉悠哉 左右采之 琴瑟友之 施于中谷 黄鸟于飞 \n" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "import struct\n", 116 | "import os\n", 117 | "\n", 118 | "# 拼音表偏移,\n", 119 | "startPy = 0x1540;\n", 120 | "\n", 121 | "# 汉语词组表偏移\n", 122 | "startChinese = 0x2628;\n", 123 | "\n", 124 | "# 全局拼音表\n", 125 | "GPy_Table = {}\n", 126 | "\n", 127 | "# 解析结果\n", 128 | "# 元组(词频,拼音,中文词组)的列表\n", 129 | "GTable = []\n", 130 | "\n", 131 | "# 原始字节码转为字符串\n", 132 | "def byte2str(data):\n", 133 | " pos = 0\n", 134 | " str = ''\n", 135 | " while pos < len(data):\n", 136 | " c = chr(struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0])\n", 137 | " if c != chr(0):\n", 138 | " str += c\n", 139 | " pos += 2\n", 140 | " return str\n", 141 | "\n", 142 | "# 获取拼音表\n", 143 | "def getPyTable(data):\n", 144 | " data = data[4:]\n", 145 | " pos = 0\n", 146 | " while pos < len(data):\n", 147 | " index = struct.unpack('H', bytes([data[pos],data[pos + 1]]))[0]\n", 148 | " pos += 2\n", 149 | " lenPy = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 150 | " pos += 2\n", 151 | " py = byte2str(data[pos:pos + lenPy])\n", 152 | " GPy_Table[index] = py\n", 153 | " pos += lenPy\n", 154 | "\n", 155 | "# 获取一个词组的拼音\n", 156 | "def getWordPy(data):\n", 157 | " pos = 0\n", 158 | " ret = ''\n", 159 | " while pos < len(data):\n", 160 | " index = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 161 | " ret += GPy_Table[index]\n", 162 | " pos += 2\n", 163 | " return ret\n", 164 | "\n", 165 | "# 读取中文表\n", 166 | "def getChinese(data):\n", 167 | " pos = 0\n", 168 | " while pos < len(data):\n", 169 | " # 同音词数量\n", 170 | " same = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 171 | "\n", 172 | " # 拼音索引表长度\n", 173 | " pos += 2\n", 174 | " py_table_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 175 | "\n", 176 | " # 拼音索引表\n", 177 | " pos += 2\n", 178 | " py = getWordPy(data[pos: pos + py_table_len])\n", 179 | "\n", 180 | " # 中文词组\n", 181 | " pos += py_table_len\n", 182 | " for i in range(same):\n", 183 | " # 中文词组长度\n", 184 | " c_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 185 | " # 中文词组\n", 186 | " pos += 2\n", 187 | " word = byte2str(data[pos: pos + c_len])\n", 188 | " # 扩展数据长度\n", 189 | " pos += c_len\n", 190 | " ext_len = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 191 | " # 词频\n", 192 | " pos += 2\n", 193 | " count = struct.unpack('H', bytes([data[pos], data[pos + 1]]))[0]\n", 194 | "\n", 195 | " # 保存\n", 196 | " GTable.append((count, py, word))\n", 197 | "\n", 198 | " # 到下个词的偏移位置\n", 199 | " pos += ext_len\n", 200 | "\n", 201 | "\n", 202 | "def scel2txt(file_name):\n", 203 | " # 分隔符\n", 204 | " print('-' * 60)\n", 205 | " # 读取文件\n", 206 | " with open(file_name, 'rb') as f:\n", 207 | " data = f.read()\n", 208 | "\n", 209 | " print(\"词库名:\", byte2str(data[0x130:0x338])) # .encode('GB18030')\n", 210 | " print(\"词库类型:\", byte2str(data[0x338:0x540]))\n", 211 | " print(\"描述信息:\", byte2str(data[0x540:0xd40]))\n", 212 | " print(\"词库示例:\", byte2str(data[0xd40:startPy]))\n", 213 | "\n", 214 | " getPyTable(data[startPy:startChinese])\n", 215 | " getChinese(data[startChinese:])\n", 216 | "\n", 217 | "if __name__ == '__main__':\n", 218 | "\n", 219 | " # scel所在文件夹路径\n", 220 | " in_path = \"Scel\"\n", 221 | "\n", 222 | " fin = [fname for fname in os.listdir(in_path) if fname[-5:] == \".scel\"]\n", 223 | " for f in fin:\n", 224 | " f = os.path.join(in_path, f)\n", 225 | " scel2txt(f)\n", 226 | " \n", 227 | " f = open('./Scel/coal_dict.txt', 'w')\n", 228 | " for count, py, word in GTable:\n", 229 | " f.write(str(count)+ '\\t\\t\\t' + py + '\\t\\t\\t' + word + '\\n')\n", 230 | " f.close()\n", 231 | "\n" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": {}, 238 | "outputs": [], 239 | "source": [] 240 | } 241 | ], 242 | "metadata": { 243 | "kernelspec": { 244 | "display_name": "Python 3", 245 | "language": "python", 246 | "name": "python3" 247 | }, 248 | "language_info": { 249 | "codemirror_mode": { 250 | "name": "ipython", 251 | "version": 3 252 | }, 253 | "file_extension": ".py", 254 | "mimetype": "text/x-python", 255 | "name": "python", 256 | "nbconvert_exporter": "python", 257 | "pygments_lexer": "ipython3", 258 | "version": "3.6.5" 259 | } 260 | }, 261 | "nbformat": 4, 262 | "nbformat_minor": 2 263 | } 264 | --------------------------------------------------------------------------------