16 |
--------------------------------------------------------------------------------
/7.构造空间数据的RDD/data/dltb.prj:
--------------------------------------------------------------------------------
1 | PROJCS["WGS_1984_Web_Mercator_Auxiliary_Sphere",GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Mercator_Auxiliary_Sphere"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",0.0],PARAMETER["Standard_Parallel_1",0.0],PARAMETER["Auxiliary_Sphere_Type",0.0],UNIT["Meter",1.0]]
--------------------------------------------------------------------------------
/11.Map算子解析(2)/data/cn.shp.xml:
--------------------------------------------------------------------------------
1 |
2 |
8 |
9 | 空间数据的快速查询,主要是通过索引来实现的,而空间分析发展了几十年,空间索引技术已经很成熟了,正如我一直强调的,我们没必要去自己造轮子,直接拿来主义就行。
10 |
11 | 有同学问,空间索引怎么用?这个问题问得好:
12 |
13 |
14 |
15 | GeoPandas可以快速的构建一个支持空间数据运算的DataFrame,然后利用这个DataFrame就可以快速进行查询了,看下面的例子:
16 |
17 |
18 |
19 |
20 |
21 | 上面是一个简单的GeoPandas的例子,模拟10000个点对32个多边形进行相交查询,
22 |
23 | GeoPandas是支持空间索引的,但是点数据对面数据的选定,实际上去独立构建空间索引没有多大意义,构建的开销比计算还要大一些。
24 |
25 | 但是对比传统迭代模式,就要快很多了,但是如果丢弃掉属性数据,单纯进行空间数据的对比,还是挺快的,这个主要是Pandas的切片带来的效率开销,有兴趣的同学自行了解一下。
26 |
27 | 不过如果仅有空间数据,把所有的属性信息都丢掉的话,那么分析也就没有任何意义了。
28 |
29 | 下面来看看geoPandas在PySpark里面如何去使用:
30 |
31 |
32 |
33 | 可以发现,效果非常的拔群——比传统粗暴的迭代,5倍效率的提升。但是如果参与运算的数据量少的话,GeoPandas的优势就不大了,毕竟DataFrame的切片也是需要开销的。
34 |
35 | 待续未完。
36 |
37 |
38 | 示例中的代码,可以到虾神的github或者gitee上去下载:
39 |
40 | Github:
41 | https://github.com/allenlu2008/PySparkDemo
42 |
43 | gitee:
44 | https://gitee.com/godxia/PySparkDemo
45 |
--------------------------------------------------------------------------------
/14.Uber H3蜂窝多边形聚合统计/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(14)外篇:
2 | # 经纬度数据利用Uber H3进行聚合统计
3 |
4 |
5 | 以前曾经写过一篇Uber H3算法相关的文章,大家有兴趣可以翻一下:
6 |
7 | Uber H3算法实现蜂窝六边形聚合
8 |
9 | 顺着上一篇文章既然说了GeoHash,那么今天也顺便说说在PySpark里面怎么用H3做一下六边形聚合统计。实际上看过上一篇文章的同学不用我说大家也都知道怎么做了……无非就是选择一个聚合的尺度,然后做Map,在做reduceByKey嘛……
10 |
11 | 实际上,如果你仅仅是要来做一下蜂窝格网密度的可视化,数据量少的情况下,根本不用PySpark什么的,在Python的matplotlib包里面,直接有hexbin,可以直接进行蜂窝密度分析,如下所示:
12 |
13 |
14 |
15 | 如果就从可视化角度来看,基本上也就够用了,而且关键是matplotlib这个包,从2.x重新构建之后,效率还挺高,比如我做了个1000万个点的聚合测试:
16 |
17 |
18 |
19 | 在我的机器上,只需要7秒多就完成了聚合和绘制了,极其厉害。
20 |
21 | 但是可视化仅仅是数据分析最初始的手段,很多时候,不但要能够可视化出来,还需要有后续的分析,比如做空间聚类或者做深度或者广度的钻取。
22 |
23 | 下面我们通过PySpark + H3 来实现全球地震分析,进行分源分区的统计:
24 |
25 |
26 |
27 |
28 |
29 | 统计完之后,还可以用matplotlib可视化一下看看效果:
30 |
31 |
32 |
33 |
34 | 可以看出,整个分析计算的方式,与geohash基本一致,而且保留下来之后,你还可以把数据写成Shapefile,就可以进行更进一步的做空间分析了,写数据为shapefile相关内容,可以看看ArcGIS或者GDAL的相关接口,不是这里的内容,所以这里不做多说。
35 |
36 |
37 | ### Github:
38 | https://github.com/allenlu2008/PySparkDemo
39 |
40 | ### gitee:
41 | https://gitee.com/godxia/PySparkDemo
--------------------------------------------------------------------------------
/18.OD矩阵/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(18):
2 | # groupByKey算子实现轨迹数据OD矩阵
3 |
4 |
5 |
6 | 上篇说到,可以用groupByKey算子,来计算输出多个值,比之reduceByKey,groupByKey算子能够更细粒度的实现批量数据的业务,当然,reduceByKey也行,只要你设计的算法足够精细,毕竟人家是始祖级算子。
7 |
8 | 今天我们来看一个案例,就是轨迹计算里面,大家都和关心的一个功能:OD矩阵。
9 |
10 |
11 |
12 | ####
13 |
14 |
15 |
16 | 啥叫OD矩阵呢?实际上就是把每次出行归类为起点与终点,然后度量每个起点到每个终点区域之间通行量,最后类似于这种效果:
17 |
18 |
19 |
20 | 实际上OD矩阵的各种业务算法可以很复杂,但是如果我们手中的是LBS数据,只需要做固定时间区间的OD流量,就非常简单了,比如下面的数据:
21 |
22 |
23 |
24 | CID一列,是车辆的ID,为了脱密,用了UUID,time是当前的时间,后面是经纬度。下面先对这批数据做个简单的描述性统计:
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 | 可以看见,数据主要集中在北京的城六区,下面我们先来做一个简单的移动连线,也就是把每辆车的首点和最后一个点提取出来,然后做连线:
35 |
36 |
37 |
38 |
39 |
40 | 这样,由起讫点组成的流向线就绘制完成了,实际上在OD矩阵里面,很多时候是有区域选择的,比如研究几个城区之间的车辆流动,或者是外地进京车辆,车辆离京情况、工作区——住宅区流动(职住分析)等等,还有就是各种时间分辨率的切片,但是万变不离其宗,只要掌握了最基础的算法,其他的无非就是在这个算法上加载各种条件罢了。
41 |
42 | 待续未完
43 |
44 | 需要代码和数据的同学,到下面地址下载:
45 |
46 | Github:
47 |
48 | https://github.com/allenlu2008/PySparkDemo
49 |
50 | gitee:
51 |
52 | https://gitee.com/godxia/PySparkDemo
53 |
54 |
--------------------------------------------------------------------------------
/15.Map算子精确裁切面要素/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(15):
2 | # 利用Map精确裁剪面状要素
3 |
4 |
5 | Map是Spark最重要的一个算子,没有之一。所有的transformation算子,基本上都可以用map算子来替代,可以说Map算子实际上是Spark的始祖级算子(废话——人家本来就来源自MapReduce算法,顾名思义嘛)
6 |
7 |
8 |
9 | 前面我们实际上已经做过了利用filter算子对空间数据进行过滤,实际上在提取数据是属于数据生产的主要工作之一,也就是按照条件从相应数据中选出符合需求的数据。但是在空间上,有些数据的的提取是需要生成新的数据的,比如修一条路,需要征用土地:
10 |
11 |
12 |
13 | 征用土地就要收费,而如果你这样征用的话:
14 |
15 |
16 |
17 | 你是不会觉得:
18 |
19 |
20 |
21 | 所以涉及到RMB的东西,那是一点都不能多的……
22 |
23 |
24 |
25 | 那么怎么来做精确裁切呢?小数据量好说,所有的GIS软件都内置了Clip功能,但是我们要考虑的是超大数据量,比如从全国几亿个地块里面,裁剪出京九线沿线地块这种操作:
26 |
27 |
28 |
29 |
30 |
31 | 一次裁剪,如果用传统工具,那就是以天为单位……那么如果要做不同缓冲尺度的裁剪,那么无论是分析计算所用的时间,还是生成的结果所用的存储空间,都是一个极其庞大的量。
32 |
33 | 所以这种级别的运算,就可以考虑使用分布式计算了,那么算法如何写呢?如果掌握了Map算子的话,那是灰常的简单,不信?LOOK:
34 |
35 |
36 |
37 |
38 |
39 | 可以看出,只需要在map方法里面,做一个简单的处理,就把需要裁切的内容处理出来了,如果后续还需要继续分析,那么只需要把这个内容写成shapefile,或者写入数据库,就可以继续使用了。
40 |
41 | 从以上流程可以看出,map实际上可以处理所有需要迭代的内容,然后把处理的结果直接生成为新的RDD就可以了,如果这也就是MapReduce的编程核心,如何构建合理有效的map。
42 |
43 | 可以说,掌握了Map,基本上就大半掌握了Spark的编程精髓了,在以后的应用中,我们还会重复再重复的使用Map来进行处理。
44 |
45 |
46 |
47 | ## 以上代码,需要的同学可以直接从下面地方下载:
48 |
49 | ### Github:
50 | https://github.com/allenlu2008/PySparkDemo
51 |
52 | ### gitee:
53 | https://gitee.com/godxia/PySparkDemo
54 |
--------------------------------------------------------------------------------
/3.啥是算子(2)/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(3):
2 | # 啥是算子(2):转换算子与行动算子
3 |
4 | 上篇文章说到,算子实际上就是一种操作,在计算机里面就是一个函数,这篇文章说说Spark里面的算子的分类:
5 |
6 |
7 |
8 |
9 | Spark里面的算子一共有两种,一种称之为Tranformation(转换)算子,一种称之为Action(行动)算子。
10 |
11 | Tranformation算子,主要用来处理数据,也就是进行数据状态的转换,比如上篇文章里面,把排骨切成块,这个切的动作,就是一个一个Tranformation算子。
12 |
13 | 每一个转换算子,都是得到一个新的RDD。比如把整扇肋排切了,这个转换操作,得到就是一个全新的被切成块的排骨。
14 |
15 |
16 |
17 | 那么Action算子是干嘛的呢?
18 | 顾名思义,Action算子,主要用来触发提交作业,是真正的执行动作的算子。
19 |
20 |
21 |
22 | 从Action才执行操作,那么前面的转换算子是拿干什么的么?
23 |
24 | 这就是Spark的一个特点了:
25 |
26 | Spark的转换算子,仅仅生成一个函数链,记录每个转换操作,但是并不实际执行这些转换操作,只有当出现行动算子的时候,这些操作才开始按照设定的函数链依次执行。
27 |
28 | 下面我们来看看下面的这个例子:
29 |
30 |
31 | 从上面的案例看出,所有转换算子都几乎不耗费任何时间——因为本身转换算子是不执行操作的,只要到执行算子的时候,才会执行。
32 |
33 | 如果说上面的例子表示是转换算子不执行,那么下面的例子,就说明了Spark里面,每个RDD都是独立函数链的,因为转换完RDD就不一样,下次要用这个RDD的时候,还得再来一次:
34 |
35 |
36 | 可以看见,两种不同的写法,最后的计算时间都差不多的,
37 |
38 |
39 | 那么我们用缓存的技术,在做一遍看看:
40 |
41 |
42 |
43 |
44 | 如果不进行缓存,那么第二次执行还是需要进行reduceByKey运算,只是减少了任务调度的时间。加上缓存之后,就不用进行reduceByKey了,直接提交任务进行结果转换就行。
45 |
46 | 从上面几个示例可以看出,RDD算子里面最核心的两类算子的执行方式,转换算子仅做记录,行动算子提交任务。而如果我们的某一步结果需要重复使用,最方便的方式就是使用缓存。
47 |
48 | 啥是算子这个问题,到此就讲完了,从下一篇开始,我们进入Spark算子与空间计算的应用。
49 |
50 | 以上代码,可以通过虾神的github或者gitee下载,地址如下:
51 |
52 | # github
53 | https://github.com/allenlu2008/PySparkDemo
54 |
55 | # gitee:
56 | https://gitee.com/godxia/PySparkDemo
57 |
58 |
59 | 待续未完。
60 |
--------------------------------------------------------------------------------
/3.啥是算子(2)/3.啥是算子(2).md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(2):
2 | # 啥是算子(2):转换算子与行动算子
3 |
4 | 上篇文章说到,算子实际上就是一种操作,在计算机里面就是一个函数,这篇文章说说Spark里面的算子的分类:
5 |
6 |
7 |
8 |
9 | Spark里面的算子一共有两种,一种称之为Tranformation(转换)算子,一种称之为Action(行动)算子。
10 |
11 | Tranformation算子,主要用来处理数据,也就是进行数据状态的转换,比如上篇文章里面,把排骨切成块,这个切的动作,就是一个一个Tranformation算子。
12 |
13 | 每一个转换算子,都是得到一个新的RDD。比如把整扇肋排切了,这个转换操作,得到就是一个全新的被切成块的排骨。
14 |
15 |
16 |
17 | 那么Action算子是干嘛的呢?
18 | 顾名思义,Action算子,主要用来触发提交作业,是真正的执行动作的算子。
19 |
20 |
21 |
22 | 那么从Action才执行操作,那么前面的转换算子是拿干什么的么?
23 |
24 | 这就是Spark的一个特点了:
25 |
26 | Spark的转换算子,仅仅生成一个函数链,记录每个转换操作,但是并不实际执行这些转换操作,只有当出现行动算子的时候,这些操作才开始按照设定的函数链依次执行。
27 |
28 | 下面我们来看看下面的这个例子:
29 |
30 |
31 | 从上面的案例看出,所有转换算子都几乎不耗费任何时间——因为本身转换算子是不执行操作的,只要到执行算子的时候,才会执行。
32 |
33 | 如果说上面的例子表示是转换算子不执行,那么下面的例子,就说明了Spark里面,每个RDD都是独立函数链的,因为转换完RDD就不一样,下次要用这个RDD的时候,还得再来一次:
34 |
35 |
36 | 可以看见,两种不同的写法,最后的计算时间都差不多的,
37 |
38 |
39 | 那么我们用缓存的技术,在做一遍看看:
40 |
41 |
42 |
43 |
44 | 如果不进行缓存,那么第二次执行还是需要进行reduceByKey运算,只是减少了任务调度的时间。加上缓存之后,就不用进行reduceByKey了,直接提交任务进行结果转换就行。
45 |
46 | 从上面几个示例可以看出,RDD算子里面最核心的两类算子的执行方式,转换算子仅做记录,行动算子提交任务。而如果我们的某一步结果需要重复使用,最方便的方式就是使用缓存。
47 |
48 | 啥是算子这个问题,到此就讲完了,从下一篇开始,我们进入Spark算子与空间计算的应用。
49 |
50 | 以上代码,可以通过虾神的github或者gitee下载,地址如下:
51 |
52 | # github
53 | https://github.com/allenlu2008/PySparkDemo
54 |
55 | # gitee:
56 | https://gitee.com/godxia/PySparkDemo
57 |
58 |
59 | 待续未完。
60 |
--------------------------------------------------------------------------------
/6.数据生成算子/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(6):
2 | # 数据生成算子
3 |
4 |
5 |
6 | 从本章开始讲各种算子,首先是最常用的算子,就是数据生成的算子。
7 |
8 | 开始我们讲过,RDD只能通过转换而来,而最早的RDD是怎么来的?那么先讲讲始祖RDD是怎么生成的。
9 |
10 | 最初始的RDD的获取方式如下:
11 |
12 |
13 |
14 |
15 | 主要是有两种来源模式,一是从持久化的数据源进行获取,Spark支持的数据源非常全面,几乎市面上能找到的持久化数据存储系统都可以支持,但是最常见的还是直接读取Hadoop的分布式文件系统HDFS和apache Mesos。
16 |
17 | 因为从分布式存储系统来获取数据,可以充分利用Spark分布式节点的优势,去多个进程去读取分布式存储节点上的数据,而且如果是HDFS的话,默认直接就会使用HDFS的数据分片作为Spark的任务大小分片,省去了很多任务调度和任务分配的开销。
18 |
19 |
20 |
21 | 结构化数据库和非结构化数据库(存储系统)的读取结构与HDFS集群读取是一样的,都是通过Driver进行控制分配,然后每个计算节点去计算去读取存储系统的内容,读取到Worker中,变成RDD。
22 |
23 | 而如果是本地磁盘的话,一般采用URL的方式,这样才能够让所有的Spark节点访问到数据位置,如果指定的是单机模式的话,那么会导致计算节点无法读取的异常。
24 |
25 |
26 |
27 | 通过上面两个架构的比较,可以发现,使用分布式存储系统,能够最大程度的利用分布式计算节点的优势,而共享存储的读取,就要受制于共享磁盘的访问效率了。
28 |
29 | 第二种就是从Spark支持的各种集合对象来转换。
30 |
31 |
32 |
33 | 通过JVM对数据进行读取,然后生成各种能够序列化对象集合,通过Spark Driver进行转换,变成RDD,然后把这些RDD推送到每个计算节点上。
34 |
35 | 这种方式能够获取的数据类型就更加广泛了,理论上只要Java、Spark、Python等语言能够读取并且序列化的对象集合,都可以转换为RDD。
36 |
37 | 但是这种模式,有一个致命的缺点:读取并且序列化的数据,不能超过Driver的内存大小。因为这种模式,是把所有数据全部读取进来,然后进行序列化并且传输的,如果你的要序列化的数据大小大于Driver的大小,那么就会出现内存溢出的问题。(当然,要解决这个问题,可以采用磁盘缓存等多种技术,这里暂时不聊这个)。
38 |
39 | 而采用前面分布式读取的方式,则不会出现这个问题,因为worker节点直接去获取数据,并且获取到的是一个个的数据块,不存在需要在Driver这一台机器上进行序列化的问题。
40 |
41 | 下面通过几个示例,来简单来看看如何从数据中获取到RDD。
42 |
43 |
44 |
45 |
46 |
47 | Spark默认可以直接读取文本类型的文件,那么空间数据怎么变成RDD呢?我们下回分解
48 |
49 | 待续未完
50 |
51 |
--------------------------------------------------------------------------------
/16.reduceByKey算子/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(16):
2 | # reduceByKey算子简介(1)
3 |
4 |
5 | 前面的文章,讲了各种map,同学也都发现了,map后面老是跟着一个叫做reduceByKey的算子,是干嘛的呢?或者说,如何去理解它的运行原理呢?
6 |
7 | ###
8 |
9 |
10 |
11 | 正如名称所言:MapReduce算法,reduceByKey也是始祖级的算法。那么今天来说说这个reduceByKey算子的运行原理。
12 |
13 | reduce从名字上看,就知道,主要是用来做聚合,比如下面的例子:
14 |
15 |
16 |
17 | 看看代码:
18 |
19 |
20 |
21 |
22 | reduceByKey,做核心的操作,就是这个By Key了,By key的话,就表示在这个Map中,只要Key相同,就会相互进行计算,而且这些计算是累积的。
23 |
24 | 这个算子一般都是与Map算子组合起来使用的,一般来说Map负责构建数据结构,ReduceByKey算子负责进行聚合统计。
25 |
26 | Spark和Hadoop一类的框架,最早就是用来进行统计分析的,在属性数据进行计算的时候,ReduceByKey是非常重要的一个操作,比如下面这个案例:
27 |
28 |
29 |
30 | 在地类图斑的统计中,分类统计应该是最常见的一种操作,类似于SQL里面的group by操作,上面的代码就是实现这一系列的过程,当然,有的同学说,你写这么多代码,SQL只需要一句就行……嗯,要写简单也行,比如上面的代码是为了展开介绍,需要的话也可以写得非常非常简单,把所有代码都集成在一起,也就是写一句就可以了:
31 |
32 |
33 |
34 | 最后再来看一个例子,把上篇文章精确裁切的功能和分类统计放一起,来做一个精确裁切统计:
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 | 上面的代码,在map里面封装了一个业务逻辑,国土和农业的地类图斑数据,都有台账数据一说,里面的面积记录也有做合同面积、权证面积等,因为历史原因,未必和实际地理面积相同(在历史上,不同地区对地类的分级不一样,按照当年按地征收农业税的定标方法,贫瘠的土地有一亩做五分入账,或者丰沃的土地,有八分做一亩入账等的历史原因,所以图斑面积和实测面积可能相去甚远)。
43 |
44 | 为了用图斑面积进行计算,所以需要知道,原始地块被裁切掉了多少,所以这里直接计算裁切掉的比例,然后用这个比例系数乘以图斑面积,最后得到结果面积。
45 |
46 | 待续未完
47 |
48 | 需要代码的同学,到下面地址下载:
49 | Github:
50 | https://github.com/allenlu2008/PySparkDemo
51 |
52 | gitee:
53 | https://gitee.com/godxia/PySparkDemo
54 |
--------------------------------------------------------------------------------
/17.groupbyKey算子简介/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(17):
2 | # groupByKey算子解析
3 |
4 |
5 |
6 | 上一篇说了reduceByKey,这是一个统计意义用处特别大的算子,本篇再介绍另外一个用处比较广泛的算子:groupByKey。
7 |
8 | 从名字上看可以看出,reduceByKey是按key聚合,那么groupByKey,就是按key进行分组了。不同的地方在于,聚合是最后输出成一个值,而分组的话,不会把所有值都合并在一起:
9 |
10 |
11 |
12 | 有同学会问,分组在一起,有啥用呢?看下面这个例子:
13 |
14 |
15 |
16 | 上面的数据结构中,如果要同时计算max和min两个值,那么用reduceByKey的话,传统模式需要计算两次(因为每次只能输出一个值),而用groupByKey的话,因为返回的是原始数据的集合,所以只需要计算一次就可以了。
17 |
18 | 当然,如果你设计的map结构足够好,也可以一次计算出结果,这就是我我一直强调的:
19 |
20 | ## 设计合理有效Map的数据结构,是Spark数据分析最重要的内容,没有之一
21 |
22 | 当然,当然,还有其他的方法也可以做到,比如利用flatmap算子了,此乃后话,暂时不提。
23 |
24 | 有的同学会问,我有啥必要一次性做多个结果啊,我一次做一个结果不行么?来来来,举个例子,让我们看看,啥情况下才需要一次计算出多个结果捏?
25 |
26 |
27 |
28 | 最简单的,就是计算平均数——需要累加和计数两个值,才能完成平均数的计算。
29 |
30 |
31 |
32 | 虽然传统模式也很容易解决,但是明显感觉groupByKey的写法更加简洁明了,而且更加易读易理解。
33 |
34 | 当然,还有一些其他的应用,比如在系统日志统计的时候,我需要看最近发生的10次信息、警告和错误各是什么……或者是最近的20次、30次呢……用Map和ReduceByKey会让你写的死去活来,但是用groupByKey,就各种容易了:
35 |
36 |
37 |
38 |
39 |
40 | 实际上,在我上面的示例里面,海量数据的情况下,性能是非常不好的,因为分组之后,每一组里面的value,可能依然非常非常多,要是多到单个节点无法处理的情况下,那么依然会出现内存溢出。
41 |
42 | 要处理这种情况,实际上应该使用下面的方法:
43 |
44 |
45 |
46 | 上面给出的两个简单算法和示例,可以看出groupByKey,实际上是MapReduce的一种高层的抽象和封装,他能实现的功能,用原生MapReduce算法,设计得足够好的话依然能够实现。但是很多时候,用这个算子,能够简化很多的工作。
47 |
48 | 待续未完:
49 | 需要代码的同学,到下面地址下载:
50 |
51 | Github:
52 | https://github.com/allenlu2008/PySparkDemo
53 |
54 | gitee:
55 | https://gitee.com/godxia/PySparkDemo
56 |
57 |
58 |
--------------------------------------------------------------------------------
/5.如何在PySpark里面使用空间运算接口/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(5):
2 | # 如何在PySpark里面使用空间运算接口
3 |
4 |
5 | Spark是分布式计算的,PySpark实际上是用Python调用了Spark的底层框架,那么这些个框架是如何调用的呢?上一篇说了一下Python里面利用GDAL包实现的空间算子,那么这些个整个调用流程是怎么样的呢?今天我们来一起探索一下。
6 |
7 | 本系列的第一篇文章就说过,要跑PySpark,需要用Py4J这个包,这个包的作用就是利用Python来调用Java虚拟机里面的对象,原理是这样的:
8 |
9 |
10 |
11 | Python的算法通过Socket将任务发送给Java虚机,JAVA虚机通过Py4J这个包解析,然后调用Spark的Worker计算节点,再把任务还原成Python实现,执行完成之后,再反向走一遍。
12 |
13 | 具体的说明请查看这篇文章,我就不重复了:
14 | http://sharkdtu.com/posts/pyspark-internal.html
15 |
16 | 可以看出来,用Python写的内容,最后再Worker端,也是用Python的算法或者包来进行实现,下面我们来做一个实验:
17 |
18 |
19 |
20 | 利用Python的sys包来查看运行的Python的版本,用socket包来查看节点的机器名,这两个包都是Python特有的,如果说PySpark仅仅运行的是Java的话,在不同节点上,应该是无法执行的。
21 |
22 | 我这里一共有两台机器,分别叫做sparkvm.com和sparkvmslave.com,其中Sparkvm.com是master + worker,而sparkvmslave.com仅仅是worker。
23 |
24 | 最后执行的表明,在不同的节点上返回了不同的结果。
25 |
26 | 从上面的实验可以看出,不同的计算节点上,最终使用的是Python的算法包,那么如何在不同的节点上使用空间分析算法呢?
27 |
28 | 在Spark上,利用的算法插件这种方式来实现:
29 |
30 |
31 |
32 | 只要在不同的节点上都安装同样的Python算法包,就可以执行了,关键点在于需要配置好系统的Python,因为PySpark默认调用的是系统的Python。
33 |
34 | 下面再做一个实验:
35 |
36 |
37 |
38 | 然后在PySpark上面在运行一个示例:
39 |
40 |
41 |
42 | 两个节点,为什么全部都在一个节点上执行呢?看看debug出来日志:
43 |
44 |
45 |
46 | 发现在153节点上,已经抛出了异常,说没有找到pygeohash包。
47 |
48 | 下面我在153上面,把pygeohash包安装上:
49 |
50 |
51 |
52 | 然后再次执行上面的内容:
53 |
54 |
55 |
56 | 最后我们来利用gdal的空间算法接口,来跑一个示例:
57 |
58 |
59 |
60 | 待续未完
61 |
--------------------------------------------------------------------------------
/13.geohash编码/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(13)外篇:
2 | # 经纬度数据利用GeoHash进行聚合统计
3 |
4 |
5 | 点数据的分析是空间大数据的主要分析模式:
6 |
7 |
8 |
9 |
10 | 点分析一般有如下分析内容:
11 |
12 |
13 |
14 | 而点分析最常用的算法就是点聚合:
15 |
16 |
17 |
18 | 比如在轨迹数据中,任意一个点独立拿出来,都没有任何的意义,比如要了解某个路段某个时间的交通状况,就需要把研究区域内所有的点都进行综合计算,或者用众数,或者用平均是来代表这个区域的基本情况。
19 |
20 | 所以,点聚合是点数据分析中最常用分析手段。
21 |
22 | 而我们能够快速获取到的点数据,一般都是LBS(基于位置的服务)数据,基于位置的话,90%以上,都是以经纬度模式进行表达的。
23 |
24 |
25 |
26 | 如果数据是经纬度的话,有个非常简单实用是算法,叫做GeoHash:
27 |
28 |
29 |
30 | GeoHash的原理就是按照区域,把经纬度进行编码,按照不同的网格精度,变成不同位数的编码,在同一区域中的编码相同。
31 |
32 | 比如上图里面的那个例子,如果两个坐标的前6位相同,那么表示在第六级精度格网上,两个坐标处于同一个格网。这样,我们就可以按照编码的内容,按照不同的精度聚合位置点数据了。
33 |
34 | GeoHash的网格精度如下:
35 |
36 |
37 |
38 | 比如我们要按照152m*152m来设定一个区域,那么设置生成七位数的geohash编码即可,如下面的例子所示:
39 |
40 |
41 |
42 |
43 | GeoHash可以让我们快速的对点数据进行聚合,下面我们来看看如何使用GeoHash算法配合PySpark,来对地震数据进行快速聚合统计:
44 |
45 |
46 |
47 | 从上面的代码可以看出,如果我们要按照152公里的网格进行聚合,只需要用三级编码就可以了,实际上与以前说过的整除截余法的原理是一样的。
48 |
49 | GeoHash能够最大的限度简化网格聚合的算法,但是也有一定的限制,主要的限制就是geohash编码的网格大小是固定的,如果我们要实现100米或者500米网格聚合,就没有办法直接用了。
50 |
51 | 当然,如果你真的要搞什么500米网格,首先不是要考虑的是geohash,而是考虑先选择一个什么样的投影坐标系——经纬度转换为米本来就是需要进行投影了,而一旦投影之后,你就可以采用整除截余法来进行聚合。
52 |
53 |
54 |
55 | 实际上已经说了好几篇的聚合,有的同学可能也有一个疑问:为什么你老在说聚合,难道点数据分析就做聚合就可以了么?
56 |
57 | 当然不是,但是聚合可以是我们处理数据的一个基本手段,比如要做核密度分析,因为核密度的一个特性,使我们做聚合成为核密度分析的重要前置条件,这个问题,以后我们再说。
58 |
59 |
60 |
61 |
62 | 待续未完
63 |
64 |
65 | ## 代码下载:
66 |
67 | Github:
68 | https://github.com/allenlu2008/PySparkDemo
69 |
70 | gitee:
71 | https://gitee.com/godxia/PySparkDemo
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | share/python-wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 |
29 | # PyInstaller
30 | # Usually these files are written by a python script from a template
31 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 |
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 |
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .nox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *.cover
49 | .hypothesis/
50 | .pytest_cache/
51 |
52 | # Translations
53 | *.mo
54 | *.pot
55 |
56 | # Django stuff:
57 | *.log
58 | local_settings.py
59 | db.sqlite3
60 |
61 | # Flask stuff:
62 | instance/
63 | .webassets-cache
64 |
65 | # Scrapy stuff:
66 | .scrapy
67 |
68 | # Sphinx documentation
69 | docs/_build/
70 |
71 | # PyBuilder
72 | target/
73 |
74 | # Jupyter Notebook
75 | .ipynb_checkpoints
76 |
77 | # IPython
78 | profile_default/
79 | ipython_config.py
80 |
81 | # pyenv
82 | .python-version
83 |
84 | # celery beat schedule file
85 | celerybeat-schedule
86 |
87 | # SageMath parsed files
88 | *.sage.py
89 |
90 | # Environments
91 | .env
92 | .venv
93 | env/
94 | venv/
95 | ENV/
96 | env.bak/
97 | venv.bak/
98 |
99 | # Spyder project settings
100 | .spyderproject
101 | .spyproject
102 |
103 | # Rope project settings
104 | .ropeproject
105 |
106 | # mkdocs documentation
107 | /site
108 |
109 | # mypy
110 | .mypy_cache/
111 | .dmypy.json
112 | dmypy.json
113 |
114 | # Pyre type checker
115 | .pyre/
116 |
--------------------------------------------------------------------------------
/2.啥是算子(1)/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(2):
2 | # 啥是算子(1):RDD
3 |
4 |
5 | 说到蒜子嘛,不就是:
6 |
7 |
8 |
9 | 额,不对啊,那蒜子就应该是:
10 |
11 |
12 |
13 | 那么我只能表示:
14 |
15 |
16 |
17 | 那么,到底什么是算子呢?
18 |
19 | 算子实际上Spark里面的一种专业术语,实际上就是一种函数,但是与传统函数不一样的是,它进行的状态的转换。比如:
20 |
21 |
22 |
23 | 将小鸡变成KFC全家桶,之间就有好多个步骤,那么每个步骤实际上就是一个算子。
24 |
25 |
26 | 要说算子,首先就要先说说Spark里面的一个核心概念:
27 | ## RDD
28 |
29 | RDD的全称叫做“弹性分布式数据集”(Resilient Distributed Dataset),网络上关于介绍它的东西海了去了,我这里就不赘述了,总体来说,他有如下几个特性
30 |
31 | ### 1、弹性:
32 | 弹性这个词,这里实际上说的数据的灵活性,包括了一下这些能力:
33 |
34 |
35 |
36 | ### 2、分布式:
37 |
38 |
39 |
40 | 要分析的数据会被分散成若干块,存放在不同的work节点上,进行处理。
41 |
42 | ### 3、不可变:
43 |
44 |
45 |
46 | RDD作为Spark的核心内容,掌握了RDD的操作,就等于掌握了绝大部分Spark的应用方式了。下面我们通过一个具体的例子来说明RDD变换流程,比如红烧排骨的做法:
47 |
48 |
49 |
50 | 每一个成品,就是一个RDD,而每一次变换,就是一个算子。
51 |
52 | 比如第一个步骤,是整扇的肋排,通过“切”这个操作,变成了小块。那么在这个操作里面,整扇的肋排,就是初始的RDD,算子就是“切”这个操作,转换而成的一个新的RDD,就是变成了小块的排骨。
53 |
54 | 第二个步骤也是一样,小块排骨这个RDD,通过焯水这步骤,变成去掉了血水的半熟品。这里初始的RDD,就是上一个步骤转换而来的RDD,这个步骤的算子,就是焯水这个操作,转换而来的新的RDD,就是去掉了血水的半熟排骨。
55 |
56 | 后面的步骤也都是一样,每个步骤都是由上一个步骤的结果转换而来,每个个转换的操作,就称之为一个算子。
57 |
58 | 这里我们就可以看见RDD的其中一个特性,就是默认情况下,转换之后,前一个RDD就不存在,比如切块之后,整扇的肋排,就已经消失了,变成了切块之后的小块排骨了。
59 |
60 | 这样的好处是,Spark本身要处理的数据量就非常庞大,所以通过转换生成的方式,不需要在内存里面驻留多份数据,而且不对原始数据进行改变的话,控制的难度就会大大下降,只需要维护当前唯一状态就行。
61 |
62 | 但是,这里又带来了一个问题,比如:
63 |
64 |
65 |
66 | 刚刚切片好,突然把桌子打翻了怎么办呢?也就是在处理的过程中,RDD出现了错误,没有保留上一步的RDD的情况下,如何恢复呢?
67 |
68 | Spark通过血统的方式进行恢复,也就是他会记录下你的RDD是怎么转换而来的,比如:
69 |
70 |
71 |
72 | 这样仅需要重复以前的步骤,就可以进行恢复了。
73 |
74 | 但是如果我们不是仅仅执行一次,比如我们先要做一个红烧排骨,又要做一个排骨汤,那么前面的步骤就需要全部再来一遍,那么这种情况下,如何处理呢?
75 |
76 | 答案是利用缓存。
77 |
78 |
79 |
80 | 缓存的优点是可以多次重复使用和快速容错恢复,但是缺点就是系统的资源消耗会乘以2,那么何时使用缓存,怎么使用缓存,就是一个仁者见仁智者见智的过程了。
81 |
82 | 通过上面的内容,大家基本上也都可以了解RDD的一些特性了,后面我们讲算子和操作的时候,这些特性还会不断的规范我们的编程思想。
83 |
84 | 待续未完。
85 |
--------------------------------------------------------------------------------
/2.啥是算子(1)/啥是算子.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(2):
2 | # 啥是算子(1):RDD
3 |
4 |
5 | 说到蒜子嘛,不就是:
6 |
7 |
8 |
9 | 额,不对啊,那蒜子就应该是:
10 |
11 |
12 |
13 | 那么我只能表示:
14 |
15 |
16 |
17 | 那么,到底什么是算子呢?
18 |
19 | 算子实际上Spark里面的一种专业术语,实际上就是一种函数,但是与传统函数不一样的是,它进行的状态的转换。比如:
20 |
21 |
22 |
23 | 将小鸡变成KFC全家桶,之间就有好多个步骤,那么每个步骤实际上就是一个算子。
24 |
25 |
26 | 要说算子,首先就要先说说Spark里面的一个核心概念:
27 | ## RDD
28 |
29 | RDD的全称叫做“弹性分布式数据集”(Resilient Distributed Dataset),网络上关于介绍它的东西海了去了,我这里就不赘述了,总体来说,他有如下几个特性
30 |
31 | ### 1、弹性:
32 | 弹性这个词,这里实际上说的数据的灵活性,包括了一下这些能力:
33 |
34 |
35 |
36 | ### 2、分布式:
37 |
38 |
39 |
40 | 要分析的数据会被分散成若干块,存放在不同的work节点上,进行处理。
41 |
42 | ### 3、不可变:
43 |
44 |
45 |
46 | RDD作为Spark的核心内容,掌握了RDD的操作,就等于掌握了绝大部分Spark的应用方式了。下面我们通过一个具体的例子来说明RDD变换流程,比如红烧排骨的做法:
47 |
48 |
49 |
50 | 每一个成品,就是一个RDD,而每一次变换,就是一个算子。
51 |
52 | 比如第一个步骤,是整扇的肋排,通过“切”这个操作,变成了小块。那么在这个操作里面,整扇的肋排,就是初始的RDD,算子就是“切”这个操作,转换而成的一个新的RDD,就是变成了小块的排骨。
53 |
54 | 第二个步骤也是一样,小块排骨这个RDD,通过焯水这步骤,变成去掉了血水的半熟品。这里初始的RDD,就是上一个步骤转换而来的RDD,这个步骤的算子,就是焯水这个操作,转换而来的新的RDD,就是去掉了血水的半熟排骨。
55 |
56 | 后面的步骤也都是一样,每个步骤都是由上一个步骤的结果转换而来,每个个转换的操作,就称之为一个算子。
57 |
58 | 这里我们就可以看见RDD的其中一个特性,就是默认情况下,转换之后,前一个RDD就不存在,比如切块之后,整扇的肋排,就已经消失了,变成了切块之后的小块排骨了。
59 |
60 | 这样的好处是,Spark本身要处理的数据量就非常庞大,所以通过转换生成的方式,不需要在内存里面驻留多份数据,而且不对原始数据进行改变的话,控制的难度就会大大下降,只需要维护当前唯一状态就行。
61 |
62 | 但是,这里又带来了一个问题,比如:
63 |
64 |
65 |
66 | 刚刚切片好,突然把桌子打翻了怎么办呢?也就是在处理的过程中,RDD出现了错误,没有保留上一步的RDD的情况下,如何恢复呢?
67 |
68 | Spark通过血统的方式进行恢复,也就是他会记录下你的RDD是怎么转换而来的,比如:
69 |
70 |
71 |
72 | 这样仅需要重复以前的步骤,就可以进行恢复了。
73 |
74 | 但是如果我们不是仅仅执行一次,比如我们先要做一个红烧排骨,又要做一个排骨汤,那么前面的步骤就需要全部再来一遍,那么这种情况下,如何处理呢?
75 |
76 | 答案是利用缓存。
77 |
78 |
79 |
80 | 缓存的优点是可以多次重复使用和快速容错恢复,但是缺点就是系统的资源消耗会乘以2,那么何时使用缓存,怎么使用缓存,就是一个仁者见仁智者见智的过程了。
81 |
82 | 通过上面的内容,大家基本上也都可以了解RDD的一些特性了,后面我们讲算子和操作的时候,这些特性还会不断的规范我们的编程思想。
83 |
84 | 待续未完。
85 |
--------------------------------------------------------------------------------
/7.构造空间数据的RDD/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(7):
2 | # 构造空间数据的RDD(1)
3 |
4 |
5 | Spark默认读取的是文本类型的文件,但是作为GISer,我们打交道的基本上都不是文本文件,无论是通用的Shapefile还是地理数据库,或者是栅格文件,都是以二进制为主的文件,那么在Spark里面怎么用呢?
6 |
7 |
8 |
9 | 上一篇说过,可以通过对象序列化的方式来实现,比如先读成序列化对象,然后转换成RDD,比如下面这篇地类图斑:
10 |
11 |
12 |
13 | 存储为Shapefile,如何变成RDD呢,通过对象序列化的方式来实现,代码如下:
14 |
15 |
16 |
17 | 上面的例子里面,我们直接通过arcpy包,读取了shapefile里面的要素,把每一条数据简单取出构建成了一个由地类名称和几何要素组成的元祖,然后把这些元祖的集合直接转变成了RDD。
18 |
19 | 之后使用这个RDD就和使用普通RDD一样了,里面几何要素都的属性和方法都能直接使用。
20 |
21 | 这种方法能够序列化任意对象,只要你的Driver能够读取的数据,都能够变成RDD,但是问题昨天也说过了,如果你序列化的对象太大,超过了Driver节点的可用内存,那么可能会被撑爆:
22 |
23 |
24 |
25 | 那么如果要另外一种方式,比如你的数据是存放在HDFS上面的怎么办呢?还有,如果你的数据大小超过了Driver节点的可用内存又怎么办呢?
26 |
27 | 如果要去读取Shapefile,或者file gdb这种数据格式,就得说Spark对自定义格式数据的读取了,但是我们这里暂时不去说怎么复杂的东西,我们暂时先来看看一般情况下如何解决这个问题。
28 |
29 | (ArcGIS Geoanalytics server有一套特有的API,可以把Shapefile读取成FeatureRDD,但是需要有ArcGIS GA的许可和API,我们这里主要说开源体系,所以这里暂时略过)。
30 |
31 |
32 |
33 | 那么怎么在仅能支持文本模式的默认情况下,读取空间数据呢?
34 |
35 |
36 |
37 | 最简单的模式,就是直接把空间数据表达为文本模式,如果要表达的是点的话,仅需要定义x\y两个字段即可,比如我们的地震点位数据:
38 |
39 |
40 |
41 | 但是空间数据里面,还有大量带有拓扑结构的数据,比如线和面,那么就需要借助OGC规范定义的一些结构了,比如GeoJson:
42 |
43 |
44 |
45 | 或者是WKB/WKT结构:
46 |
47 |
48 |
49 | 但是WKT仅能用于描述空间对象,附带这空间数据上面的那些属性数据怎么办呢?这样就可以采用csv/tsv格式来进行表达:
50 |
51 |
52 |
53 | tsv结合wkt,就可以进行要素级别的表达了:
54 |
55 |
56 |
57 | 那么为什么不用csv来结合wkt呢?这里实际上大家也都看出来了,csv是用逗号来分割的,而wkt里面有大量的逗号,除非仅表达点数据,否则会破坏掉数据分割的标记。
58 |
59 | 而wkt + 属性字段的模式,也是标准的ogc可读空间数据表达模式,比如在PostGIS里面,直接就会表达为:
60 |
61 |
62 |
63 |
64 | 下面我们通过一个示例来说明如何将数据转换为文本模式数据(采用GDAL):
65 |
66 |
67 |
68 | 最后生成的tsv数据如下:
69 |
70 |
71 |
72 | 待续未完
73 |
74 | 需要代码的同学可以从我的github或者gitee上自行下载:
75 |
76 | github:
77 | https://github.com/allenlu2008/PySparkDemo
78 |
79 | gitee:
80 | https://gitee.com/godxia/PySparkDemo
81 |
82 |
--------------------------------------------------------------------------------
/1.PySpark测试.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ```python
4 | import pyspark
5 | ```
6 |
7 |
8 | ```python
9 | conf = pyspark.SparkConf()
10 | ```
11 |
12 | ## 设置为local模式,在单机模拟环境下,这是默认的模式
13 |
14 |
15 | ```python
16 | conf.setMaster("local[*]")
17 | ```
18 |
19 |
20 |
21 |
22 |
9 |
10 |
11 | 此类计算,一般会有两种情况,一种是判断,是否相交;另外一种就是取出相交的部分来。
12 |
13 | 所以借用英文语法里面的疑问句形式,把空间运算也分为两类:
14 |
15 |
16 |
17 |
18 | 第一类空间运算,对比英文的一般疑问句,实际上是一种空间关系的计算,不会生成新的结果,仅仅返回参与运算的两个空间数据的关系的判断,比如图一的两个面,进行空间关系计算,判断条件为“相交”的话,返回的结果仅会有一个boolean值:True。
19 |
20 |
21 |
22 | 第二类,就是所谓的空间数据分析了,这种分析对比第一类关系运算而已,会生成新的数据;比如图一,分析条件为“相交”的话,如果要求返回的结果为面要素,那么就会把相交的部分给取出来:
23 |
24 |
25 |
26 |
27 | 当然,空间关系的算法已经很成熟了,大家有兴趣可以自行去查阅,我们这里不做基础算法的普及,而且也没有必要去自己造轮子,贯彻学以致用的原理,我们仅去了解如何去用。
28 |
29 | 先来说说目前业界使用最广泛的空间算法库。
30 |
31 | 在OGC标准(开放地理空间信息联盟(Open Geospatial Consortium))之前,做GIS的每家组织都搞了自己的一套空间对象规则和空间计算规则,各种百花齐放(文体……两开花),之后因为实在太纷乱了,所以OGC横空出世,给出了一套空间对象标准的顶层架构,在这个顶层下面大家继续各自文体两开花,但是好歹有了一个通用的约定。
32 | OGC在矢量数据上定义了如下操作:
33 | 首先是一些几何信息定义的:
34 |
35 |
36 |
37 | 空间关系判断的一些定义:
38 |
39 |
40 |
41 | 这些空间关系,返回的都是True/False。
42 |
43 | 最后就是空间几何运算:
44 |
45 |
46 |
47 | 这些几何运算,会生成一个新的几何对象。
48 |
49 | 上面就是OGC规定的一些矢量数据的关系和运算规则,只要符合OGC标准的任意组织和单位做的空间算法,都将包含这些基本算法,而且经过千锤百炼的进化,我们也就没有必要自己再去写一套了。
50 |
51 | 这些算法,又如何实现呢?在OGC下面,一般又分为如下两个体系的实现:
52 |
53 |
54 |
55 |
56 | 开源体系里面,一共有两种实现,第一种是业界使用的最广泛的GDAL,GDAL的全称是空间数据抽象库(Geospatial Data Abstraction Library),主要是用C++来实现的,这套体系里面,又衍生扩展出了无数的分支,比如PostGIS、Python GDAL/OGR,R语言的RGDAL包等。
57 |
58 | 另外一套就是JAVA体系下面的JTS(Java Topology Suite:Java拓扑套件),这一套包在GIS领域中的开源体系里面虽然名声不显,但是如果在非GIS专业里面,缺少大放异彩,在他下面衍生出来的空间应用包括了GeoServer这种开源WebGIS头把交椅的系统,Oracle Spatial这种企业级空间处理插件,包括了Spark下面用于空间处理的GeoSpark等等,都应用了JTS。
59 |
60 | 然后是闭源的系统,正如操作系统界一说闭源就是微软为代表,那么GIS界就是以ESRI为主要的标靶了。Esri在OGC标准下面开发了一整套几何关系算法,称之为Esri Geometry,功能除了上面OGC相关的标准以外,还有很多自行扩展的一些算法,几何组织模式也有一些不同,就以Esri的arcpy包而论,几何结构就表现为:
61 |
62 |
63 |
64 | 所有的几何结构,都以点为基准结构进行构造,几何点结构+空间关系,就变成了各种几何要素。这些几何要素包含了如下的属性:
65 |
66 |
67 |
68 | 当然,也是包含了各种空间运算了,下面就是一部分:
69 |
70 |
71 |
72 | 这些算法,单独拿出来非常容易理解,比如下面我们来做几个简单计算:
73 | 这里主要采用GDAL来实现:
74 |
75 |
76 |
77 | 其他的方法,大家有兴趣自行查看API即可,我这里就不一一说明了。
78 |
79 | 这些空间算法如何在PySpark里面去使用呢?有一些什么条件呢?我们下一篇再说。
80 |
81 | 待续未完。
82 |
--------------------------------------------------------------------------------
/1.windows模拟开发环境搭建/.ipynb_checkpoints/windows模拟开发环境搭建-checkpoint.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(1):windows模拟开发环境搭建
2 |
3 |
4 | Spark是个灰常强大的东西……
5 | 实际上要说分布式集群神马的,Spark和hadoop一类的分布式计算框架,作为此轮大数据浪潮的尖刀,各种摧城拔寨,所向披靡,但是你真的对Spark的强大有所了解么?
6 |
7 |
8 |
9 | 我们先来看看Spark能干嘛,万能的广告平台某度的主页面:
10 | 翻了N页之后,发现所谓的实战,99%是针对电商领域的应用,不管是什么日志分析、行为分析什么,就咋们做画地图的同学来说,这些个所谓的“实战”就是:
11 | 那么Spark可不可以在我们的GIS领域里面使用呢?又怎么使用呢?这就是为什么会有这一系列文章了:
12 | 废话少说,下面进入入门环节篇,在windows上如何进行模拟开发环境的搭建:
13 | 为什么要用windows呢?用Linux不好么?如果你能在linux上完整部署Spark/hadoop,以及整体开发环境,那么下面就不用看了……下面是给一些不怎么用过linux的同学看的。
14 | ps:最后的Spark生产环境,都是Linux的,但是我们学习和模拟开发环境,在windows上是木有一点问题,如下所示:
15 | 那么在windows里面如何搭建Spark的测试运行环境呢?
16 | 实际上Spark运行在Java虚拟机与Scala语言环境下面,这两种东西,都是在任意平台下都可以执行的,所以你只需要在windows上面安装JDK与Scala就可以了。
17 | 步骤如下:
18 | 1、下载并且安装Oracle Java 1.8 的版本(小版本尽量高点,比如161或者181)
19 | 2、下载并且安装scala,如果追求稳定,就安装2.11或者2.12,追新就安装2.13,注意,2.13对JDK的版本也要求比较高。
20 | 3、下载,并且解压Spark。
21 | 4、最关键的一点来了:下载并且解压Hadoop在windows上的模拟依赖文件:winutils(自行百度这个名字)。该文件的应该存放在xxx/bin/winutils.exe这样的结构下面,那么你的xxx目录,就是hadoop的home
22 | 5、设置如下环境变量:
23 | a、JAVA_HOME :JDK安装的目录
24 | b、SCALA_HOME:Scala安装的目录
25 | c、SPARK_HOME:Spark解压的目录
26 | d、HADOOP_HOME:上面的winutils.exe所在的bin的上一级目录下,比如你放到:C:\hadoop\bin\winutils.exe,那你的HADOOP_HOME就写c:\hadoop
27 | e、设置环境变量的PATH,将你上面的a、b、c的bin,写入到path里面,如下所示:
28 |
29 |
30 |
31 | 全部设置完成之后,打开cmd窗口,输入spark-
32 | shell,如果出现如下界面,表示部署成功:
33 | 注意:我这里使用的是ArcGIS自带的Spark,如果是自行安装的Spark,效果是一模一样的,不用去纠结版本的问题。
34 |
35 | 接下去,就是PySpark的部署了。
36 | 先看看PySpark是是嘛东西。
37 | Spark很强大,但是Scala不好学(如果你要做Spark开发,建议还是学习一下Spark,但是如果你就是用用,那么就没多大必要了),幸好Spark支持用Python来对他进行原生的调用。用Python来调用Spark的包,就称为PySpark。
38 | 网络上有很多关于PySpark的介绍了,我这里就不赘述了,那么Python是怎么调用Spark的呢?核心在于这样一个包,叫做Py4j。这包的作用就是通过Python来调用Java虚拟机中的JAVA对象,所以实际上来说,PySpark核心还是调用的是Java,而并非是Python原生态的对象。
39 | 如何使用PySpark呢?建议大家直接使用Anaconda Python 3来调用PySpark。
40 |
41 | 首先安装Anaconda Python3版本:
42 | 安装完成之后,直接通过conda或者pip安装py4j,安装完成之后测试一下Py4J是否可用了:
43 | 之后,就要设置PySpark包了。PySpark的位置在你Spark的目录下面,比如我的在这里:
44 | 找到了之后,就可以把这个路径设置到你的Python环境里面了,主要方法有很多,这里介绍做简单的一种,就是制作pth文件:
45 |
46 | 1、找到你的anaconda
47 | Python的安装目录,找到Python站点包的位置:如果有env,就找到env里面,比如果你默认安装的是anaconda的话,应该是在D:\ProgramData\Anaconda3\Lib\site-
48 | packages下面,如果有env,就应该在env/<虚拟环境>/Lib/site-packages下面:
49 | 2、在下面创建一个文本文件,把后缀名改为pth,如下,比如我的名字就叫做pyspark.pth
50 | 用记事本打开,在里面写上你Spark下面的python的路径:
51 |
52 |
53 | 保存——退出。
54 |
55 | 然后我们来测试一下PySpark在我们的Python环境中是否可用了:
56 | 打开Python,然后输入import pyspark,如果不报错,就表示成功。
57 |
58 |
59 |
60 | 如果出现了错误,多半是你pth里面的路径设置不正确,注意使用双斜杆。
61 | 之后就可以用同jupyter来进行pyspark开发了,下面建立一个jupyter notebook,来测试一下pyspark:
62 | 最后需要源代码的同学,可以到虾神的代码托管仓库去自行下载:
63 |
64 | https://gitee.com/godxia/PySparkDemo
65 |
66 |
67 |
68 | 待续未完
69 |
70 |
--------------------------------------------------------------------------------
/17.groupbyKey算子简介/平均数计算.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pyspark,random"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "ls = [chr(random.randint(65,71)) \n",
19 | " for i in range(1000)]"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 3,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "rdd = pyspark.SparkContext().parallelize(ls)\n",
29 | "rdd2 = rdd.map(lambda x :(x,random.randint(0,10000))).cache()"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "## 传统模式计算平均数"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 4,
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "data": {
46 | "text/plain": [
47 | "[('B', 4762.109677419355),\n",
48 | " ('C', 5253.848),\n",
49 | " ('F', 4667.606666666667),\n",
50 | " ('A', 5231.463235294118),\n",
51 | " ('E', 4873.174825174825),\n",
52 | " ('G', 5127.118055555556),\n",
53 | " ('D', 5298.21768707483)]"
54 | ]
55 | },
56 | "execution_count": 4,
57 | "metadata": {},
58 | "output_type": "execute_result"
59 | }
60 | ],
61 | "source": [
62 | "rdd2.map(lambda x : (x[0],(x[1],1)))\\\n",
63 | ".reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1]))\\\n",
64 | ".map(lambda x : (x[0],x[1][0]/x[1][1])).collect()"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "## groupByKey模式"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 5,
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "data": {
81 | "text/plain": [
82 | "[('B', 4762.109677419355),\n",
83 | " ('C', 5253.848),\n",
84 | " ('F', 4667.606666666667),\n",
85 | " ('A', 5231.463235294118),\n",
86 | " ('E', 4873.174825174825),\n",
87 | " ('G', 5127.118055555556),\n",
88 | " ('D', 5298.21768707483)]"
89 | ]
90 | },
91 | "execution_count": 5,
92 | "metadata": {},
93 | "output_type": "execute_result"
94 | }
95 | ],
96 | "source": [
97 | "rdd2.groupByKey()\\\n",
98 | ".map(lambda x : (x[0],sum(x[1])/len(x[1])))\\\n",
99 | ".collect()"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {},
106 | "outputs": [],
107 | "source": []
108 | }
109 | ],
110 | "metadata": {
111 | "kernelspec": {
112 | "display_name": "Python 3",
113 | "language": "python",
114 | "name": "python3"
115 | },
116 | "language_info": {
117 | "codemirror_mode": {
118 | "name": "ipython",
119 | "version": 3
120 | },
121 | "file_extension": ".py",
122 | "mimetype": "text/x-python",
123 | "name": "python",
124 | "nbconvert_exporter": "python",
125 | "pygments_lexer": "ipython3",
126 | "version": "3.6.6"
127 | }
128 | },
129 | "nbformat": 4,
130 | "nbformat_minor": 2
131 | }
132 |
--------------------------------------------------------------------------------
/13.geohash编码/geohash示例.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# geohash示例"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "import pygeohash"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### 设置第五级编码"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 3,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/plain": [
34 | "'wx4g5'"
35 | ]
36 | },
37 | "execution_count": 3,
38 | "metadata": {},
39 | "output_type": "execute_result"
40 | }
41 | ],
42 | "source": [
43 | "pygeohash.encode(39.91,116.52,5)"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "## 从第一级开始编码"
51 | ]
52 | },
53 | {
54 | "cell_type": "markdown",
55 | "metadata": {},
56 | "source": [
57 | "### 一直到第六级都是一样的,第七级才开始不同\n",
58 | "### 表示了152m* 152m格子的时候,这两个点不在同一个格网里面了"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 7,
64 | "metadata": {},
65 | "outputs": [
66 | {
67 | "name": "stdout",
68 | "output_type": "stream",
69 | "text": [
70 | "i = 1: [w] -> [w]\n",
71 | "i = 2: [wx] -> [wx]\n",
72 | "i = 3: [wx4] -> [wx4]\n",
73 | "i = 4: [wx4g] -> [wx4g]\n",
74 | "i = 5: [wx4g5] -> [wx4g5]\n",
75 | "i = 6: [wx4g59] -> [wx4g59]\n",
76 | "i = 7: [wx4g598] -> [wx4g59b]\n",
77 | "i = 8: [wx4g5984] -> [wx4g59b8]\n",
78 | "i = 9: [wx4g59842] -> [wx4g59b8r]\n",
79 | "i = 10: [wx4g59842j] -> [wx4g59b8r2]\n",
80 | "i = 11: [wx4g59842jd] -> [wx4g59b8r2h]\n",
81 | "i = 12: [wx4g59842jdu] -> [wx4g59b8r2h9]\n"
82 | ]
83 | }
84 | ],
85 | "source": [
86 | "for i in range(1,13):\n",
87 | " print(\"i = {0}: [{1}] -> [{2}]\".format(i,pygeohash.encode(39.911,116.521,i),\n",
88 | " pygeohash.encode(39.912,116.522,i)))"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "### 可以通过decode进行解码"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 8,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "data": {
105 | "text/plain": [
106 | "(39.911, 116.521)"
107 | ]
108 | },
109 | "execution_count": 8,
110 | "metadata": {},
111 | "output_type": "execute_result"
112 | }
113 | ],
114 | "source": [
115 | "pygeohash.decode(\"wx4g59842jdu\")"
116 | ]
117 | }
118 | ],
119 | "metadata": {
120 | "kernelspec": {
121 | "display_name": "Python 3",
122 | "language": "python",
123 | "name": "python3"
124 | },
125 | "language_info": {
126 | "codemirror_mode": {
127 | "name": "ipython",
128 | "version": 3
129 | },
130 | "file_extension": ".py",
131 | "mimetype": "text/x-python",
132 | "name": "python",
133 | "nbconvert_exporter": "python",
134 | "pygments_lexer": "ipython3",
135 | "version": "3.6.7"
136 | }
137 | },
138 | "nbformat": 4,
139 | "nbformat_minor": 2
140 | }
141 |
--------------------------------------------------------------------------------
/1.windows模拟开发环境搭建/readme.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(1):
2 | # windows模拟开发环境搭建
3 |
4 |
5 | Spark是个灰常强大的东西……
6 |
7 | 实际上要说分布式集群神马的,Spark和hadoop一类的分布式计算框架,作为此轮大数据浪潮的尖刀,各种摧城拔寨,所向披靡,但是你真的对Spark的强大有所了解么?
8 |
9 |
10 |
11 | 我们先来看看Spark能干嘛,万能的广告平台某度的主页面:
12 |
13 |
14 |
15 | 翻了N页之后,发现所谓的实战,99%是针对电商领域的应用,不管是什么日志分析、行为分析什么,就咋们做画地图的同学来说,这些个所谓的“实战”就是:
16 |
17 |
18 |
19 | 那么Spark可不可以在我们的GIS领域里面使用呢?又怎么使用呢?这就是为什么会有这一系列文章了:
20 |
21 |
22 | 废话少说,下面进入入门环节篇,在windows上如何进行模拟开发环境的搭建:
23 |
24 | 为什么要用windows呢?用Linux不好么?如果你能在linux上完整部署Spark/hadoop,以及整体开发环境,那么下面就不用看了……下面是给一些不怎么用过linux的同学看的。
25 |
26 |
27 |
28 | ps:最后的Spark生产环境,都是Linux的,但是我们学习和模拟开发环境,在windows上是木有一点问题,如下所示:
29 |
30 |
31 |
32 | 那么在windows里面如何搭建Spark的测试运行环境呢?
33 |
34 | 实际上Spark运行在Java虚拟机与Scala语言环境下面,这两种东西,都是在任意平台下都可以执行的,所以你只需要在windows上面安装JDK与Scala就可以了。
35 | 步骤如下:
36 |
37 | - 1、下载并且安装Oracle Java 1.8 的版本(小版本尽量高点,比如161或者181)
38 | - 2、下载并且安装scala,如果追求稳定,就安装2.11或者2.12,追新就安装2.13,注意,2.13对JDK的版本也要求比较高。
39 | - 3、下载,并且解压Spark。
40 | - 4、最关键的一点来了:下载并且解压Hadoop在windows上的模拟依赖文件:winutils(自行百度这个名字)。该文件的应该存放在xxx/bin/winutils.exe,这样的结构下面,那么你的xxx目录,就是hadoop的home
41 | - 5、设置如下环境变量:
42 | - a、JAVA_HOME :JDK安装的目录
43 | - b、SCALA_HOME:Scala安装的目录
44 | - c、SPARK_HOME:Spark解压的目录
45 | - d、HADOOP_HOME:上面的winutils.exe所在的bin的上一级目录下,比如你放到:C:\hadoop\bin\winutils.exe,那你的HADOOP_HOME就写c:\hadoop
46 | - e、设置环境变量的PATH,将你上面的a、b、c的bin,写入到path里面,如下所示:
47 |
48 |
49 |
50 |
51 |
52 |
53 | 全部设置完成之后,打开cmd窗口,输入spark-
54 | shell,如果出现如下界面,表示部署成功:
55 |
56 |
57 | ## 注意:我这里使用的是ArcGIS自带的Spark,如果是自行安装的Spark,效果是一模一样的,不用去纠结版本的问题。
58 |
59 | 接下去,就是PySpark的部署了。
60 |
61 | 先看看PySpark是是嘛东西。
62 |
63 | Spark很强大,但是Scala不好学(如果你要做Spark开发,建议还是学习一下Spark,但是如果你就是用用,那么就没多大必要了),幸好Spark支持用Python来对他进行原生的调用。用Python来调用Spark的包,就称为PySpark。
64 |
65 | 网络上有很多关于PySpark的介绍了,我这里就不赘述了,那么Python是怎么调用Spark的呢?核心在于这样一个包,叫做Py4j。这包的作用就是通过Python来调用Java虚拟机中的JAVA对象,所以实际上来说,PySpark核心还是调用的是Java,而并非是Python原生态的对象。
66 |
67 | 如何使用PySpark呢?建议大家直接使用Anaconda Python 3来调用PySpark。
68 |
69 | 首先安装Anaconda Python3版本:
70 |
71 |
72 | 安装完成之后,直接通过conda或者pip安装py4j,安装完成之后测试一下Py4J是否可用了:
73 |
74 |
75 | 之后,就要设置PySpark包了。PySpark的位置在你Spark的目录下面,比如我的在这里:
76 |
77 | 找到了之后,就可以把这个路径设置到你的Python环境里面了,主要方法有很多,这里介绍做简单的一种,就是制作pth文件:
78 |
79 | - 1、找到你的anaconda
80 |
81 | Python的安装目录,找到Python站点包的位置:如果有env,就找到env里面,比如果你默认安装的是anaconda的话,应该是在D:\ProgramData\Anaconda3\Lib\site-packages下面,如果有env,就应该在env/<虚拟环境>/Lib/site-packages下面:
82 |
83 |
84 | - 2、在下面创建一个文本文件,把后缀名改为pth,如下,比如我的名字就叫做pyspark.pth
85 |
86 |
87 | - 3、用记事本打开,在里面写上你Spark下面的python的路径:
88 |
89 |
90 | - 4、保存——退出。
91 |
92 | 然后我们来测试一下PySpark在我们的Python环境中是否可用了:
93 | 打开Python,然后输入import pyspark,如果不报错,就表示成功。
94 |
95 |
96 |
97 | ### 如果出现了错误,多半是你pth里面的路径设置不正确,注意使用双斜杆。
98 | 之后就可以用同jupyter来进行pyspark开发了,下面建立一个jupyter notebook,来测试一下pyspark:
99 |
100 | 最后需要源代码的同学,可以到虾神的代码托管仓库去自行下载:
101 |
102 | https://gitee.com/godxia/PySparkDemo
103 |
104 |
105 |
106 | 待续未完
107 |
108 |
--------------------------------------------------------------------------------
/1.windows模拟开发环境搭建/windows模拟开发环境搭建.md:
--------------------------------------------------------------------------------
1 | # PySpark算子处理空间数据全解析(1):
2 | # windows模拟开发环境搭建
3 |
4 |
5 | Spark是个灰常强大的东西……
6 |
7 | 实际上要说分布式集群神马的,Spark和hadoop一类的分布式计算框架,作为此轮大数据浪潮的尖刀,各种摧城拔寨,所向披靡,但是你真的对Spark的强大有所了解么?
8 |
9 |
10 |
11 | 我们先来看看Spark能干嘛,万能的广告平台某度的主页面:
12 |
13 |
14 |
15 | 翻了N页之后,发现所谓的实战,99%是针对电商领域的应用,不管是什么日志分析、行为分析什么,就咋们做画地图的同学来说,这些个所谓的“实战”就是:
16 |
17 |
18 |
19 | 那么Spark可不可以在我们的GIS领域里面使用呢?又怎么使用呢?这就是为什么会有这一系列文章了:
20 |
21 |
22 | 废话少说,下面进入入门环节篇,在windows上如何进行模拟开发环境的搭建:
23 |
24 | 为什么要用windows呢?用Linux不好么?如果你能在linux上完整部署Spark/hadoop,以及整体开发环境,那么下面就不用看了……下面是给一些不怎么用过linux的同学看的。
25 |
26 |
27 |
28 | ps:最后的Spark生产环境,都是Linux的,但是我们学习和模拟开发环境,在windows上是木有一点问题,如下所示:
29 |
30 |
31 |
32 | 那么在windows里面如何搭建Spark的测试运行环境呢?
33 |
34 | 实际上Spark运行在Java虚拟机与Scala语言环境下面,这两种东西,都是在任意平台下都可以执行的,所以你只需要在windows上面安装JDK与Scala就可以了。
35 | 步骤如下:
36 |
37 | - 1、下载并且安装Oracle Java 1.8 的版本(小版本尽量高点,比如161或者181)
38 | - 2、下载并且安装scala,如果追求稳定,就安装2.11或者2.12,追新就安装2.13,注意,2.13对JDK的版本也要求比较高。
39 | - 3、下载,并且解压Spark。
40 | - 4、最关键的一点来了:下载并且解压Hadoop在windows上的模拟依赖文件:winutils(自行百度这个名字)。该文件的应该存放在xxx/bin/winutils.exe,这样的结构下面,那么你的xxx目录,就是hadoop的home
41 | - 5、设置如下环境变量:
42 | - a、JAVA_HOME :JDK安装的目录
43 | - b、SCALA_HOME:Scala安装的目录
44 | - c、SPARK_HOME:Spark解压的目录
45 | - d、HADOOP_HOME:上面的winutils.exe所在的bin的上一级目录下,比如你放到:C:\hadoop\bin\winutils.exe,那你的HADOOP_HOME就写c:\hadoop
46 | - e、设置环境变量的PATH,将你上面的a、b、c的bin,写入到path里面,如下所示:
47 |
48 |
49 |
50 |
51 |
52 |
53 | 全部设置完成之后,打开cmd窗口,输入spark-
54 | shell,如果出现如下界面,表示部署成功:
55 |
56 |
57 | ## 注意:我这里使用的是ArcGIS自带的Spark,如果是自行安装的Spark,效果是一模一样的,不用去纠结版本的问题。
58 |
59 | 接下去,就是PySpark的部署了。
60 |
61 | 先看看PySpark是是嘛东西。
62 |
63 | Spark很强大,但是Scala不好学(如果你要做Spark开发,建议还是学习一下Spark,但是如果你就是用用,那么就没多大必要了),幸好Spark支持用Python来对他进行原生的调用。用Python来调用Spark的包,就称为PySpark。
64 |
65 | 网络上有很多关于PySpark的介绍了,我这里就不赘述了,那么Python是怎么调用Spark的呢?核心在于这样一个包,叫做Py4j。这包的作用就是通过Python来调用Java虚拟机中的JAVA对象,所以实际上来说,PySpark核心还是调用的是Java,而并非是Python原生态的对象。
66 |
67 | 如何使用PySpark呢?建议大家直接使用Anaconda Python 3来调用PySpark。
68 |
69 | 首先安装Anaconda Python3版本:
70 |
71 |
72 | 安装完成之后,直接通过conda或者pip安装py4j,安装完成之后测试一下Py4J是否可用了:
73 |
74 |
75 | 之后,就要设置PySpark包了。PySpark的位置在你Spark的目录下面,比如我的在这里:
76 |
77 | 找到了之后,就可以把这个路径设置到你的Python环境里面了,主要方法有很多,这里介绍做简单的一种,就是制作pth文件:
78 |
79 | - 1、找到你的anaconda
80 |
81 | Python的安装目录,找到Python站点包的位置:如果有env,就找到env里面,比如果你默认安装的是anaconda的话,应该是在D:\ProgramData\Anaconda3\Lib\site-packages下面,如果有env,就应该在env/<虚拟环境>/Lib/site-packages下面:
82 |
83 |
84 | - 2、在下面创建一个文本文件,把后缀名改为pth,如下,比如我的名字就叫做pyspark.pth
85 |
86 |
87 | - 3、用记事本打开,在里面写上你Spark下面的python的路径:
88 |
89 |
90 | - 4、保存——退出。
91 |
92 | 然后我们来测试一下PySpark在我们的Python环境中是否可用了:
93 | 打开Python,然后输入import pyspark,如果不报错,就表示成功。
94 |
95 |
96 |
97 | ### 如果出现了错误,多半是你pth里面的路径设置不正确,注意使用双斜杆。
98 | 之后就可以用同jupyter来进行pyspark开发了,下面建立一个jupyter notebook,来测试一下pyspark:
99 |
100 | 最后需要源代码的同学,可以到虾神的代码托管仓库去自行下载:
101 |
102 | https://gitee.com/godxia/PySparkDemo
103 |
104 |
105 |
106 | 待续未完
107 |
108 |
--------------------------------------------------------------------------------
/16.reduceByKey算子/ReduceByKey示例.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pyspark,random"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# 随机生成1000个字母"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 2,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "ls = [chr(random.randint(65,72)) for i in range(1000)]"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 3,
31 | "metadata": {},
32 | "outputs": [
33 | {
34 | "name": "stdout",
35 | "output_type": "stream",
36 | "text": [
37 | "['D', 'A', 'C', 'D', 'E', 'G', 'E', 'E', 'D', 'D', 'E', 'F', 'G', 'E', 'E', 'F', 'B', 'F', 'E', 'F']\n"
38 | ]
39 | }
40 | ],
41 | "source": [
42 | "print(ls[:20])"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 4,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "sc = pyspark.SparkContext()"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 5,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "rdd = sc.parallelize(ls)"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "# MAP操作,每个字母计数为一"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 6,
73 | "metadata": {},
74 | "outputs": [],
75 | "source": [
76 | "maprdd = rdd.map(lambda x: (x ,1))"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 7,
82 | "metadata": {},
83 | "outputs": [
84 | {
85 | "name": "stdout",
86 | "output_type": "stream",
87 | "text": [
88 | "[('D', 1), ('A', 1), ('C', 1), ('D', 1), ('E', 1), ('G', 1), ('E', 1), ('E', 1), ('D', 1), ('D', 1)]\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "print(maprdd.take(10))"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "# ReduceByKey操作,同key的数据之间做计算。"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 8,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "reduceRes = maprdd.reduceByKey(lambda x,y : (x+y))"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 9,
115 | "metadata": {},
116 | "outputs": [
117 | {
118 | "data": {
119 | "text/plain": [
120 | "[('B', 124),\n",
121 | " ('H', 118),\n",
122 | " ('C', 139),\n",
123 | " ('A', 130),\n",
124 | " ('F', 132),\n",
125 | " ('D', 121),\n",
126 | " ('E', 128),\n",
127 | " ('G', 108)]"
128 | ]
129 | },
130 | "execution_count": 9,
131 | "metadata": {},
132 | "output_type": "execute_result"
133 | }
134 | ],
135 | "source": [
136 | "reduceRes.collect()"
137 | ]
138 | }
139 | ],
140 | "metadata": {
141 | "kernelspec": {
142 | "display_name": "Python 3",
143 | "language": "python",
144 | "name": "python3"
145 | },
146 | "language_info": {
147 | "codemirror_mode": {
148 | "name": "ipython",
149 | "version": 3
150 | },
151 | "file_extension": ".py",
152 | "mimetype": "text/x-python",
153 | "name": "python",
154 | "nbconvert_exporter": "python",
155 | "pygments_lexer": "ipython3",
156 | "version": "3.6.6"
157 | }
158 | },
159 | "nbformat": 4,
160 | "nbformat_minor": 2
161 | }
162 |
--------------------------------------------------------------------------------
/10.map算子解析(1)/map数据结构示例.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 71,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import random,datetime"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 66,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "class workmate:\n",
19 | " def __init__(self,name,age,clothesColor):\n",
20 | " self.name = name\n",
21 | " self.age = age\n",
22 | " self.clothesColor = clothesColor\n",
23 | " def __str__(self):\n",
24 | " return \"name:<{0}> age:<{1}> clothesColor:<{2}>\\\n",
25 | " \".format(self.name,self.age,self.clothesColor)"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 67,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "colorlist = [\"red\",\"blue\",\"green\",\n",
35 | " \"yellow\",\"black\",\"gray\",\n",
36 | " \"white\",\"orange\"]"
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "metadata": {},
42 | "source": [
43 | "### 生成一个100万条件记录的集合"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 78,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "arrDemo = [workmate(chr(random.randint(65,91)),\n",
53 | " random.randint(21,55),\n",
54 | " random.choice(colorlist)) for i in range(1000000)]"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 79,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "name:age:<43> clothesColor:\n", 67 | "name: age:<21> clothesColor: \n", 68 | "name: age:<33> clothesColor: \n", 69 | "name: age:<47> clothesColor: \n", 70 | "name: age:<47> clothesColor: \n", 71 | "name: age:<52> clothesColor: \n", 72 | "name: age:<33> clothesColor: \n", 73 | "name: age:<33> clothesColor: \n", 74 | "name: age:<38> clothesColor: \n", 75 | "name: age:<37> clothesColor:
\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "for w in arrDemo[:10]:\n", 81 | " print(str(w))" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### 如果要在数组里面找到某个条件,只能进行遍历和判断" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 80, 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "37037\n", 101 | "0:00:00.090800\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "s = datetime.datetime.now()\n", 107 | "v = 0\n", 108 | "for w in arrDemo:\n", 109 | " if w.name == \"A\":\n", 110 | " v +=1\n", 111 | "print(v)\n", 112 | "print(datetime.datetime.now() -s)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### 构造一个map,把要进行查询的条件作为key" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 81, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "mapDemo = {}\n", 129 | "for w in arrDemo:\n", 130 | " if w.name in mapDemo:\n", 131 | " mapDemo[w.name].append(w)\n", 132 | " else:\n", 133 | " mapDemo[w.name] = [w]" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "### 直接进行索引就可以了,效率比数组迭代,要快两个数量级。" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 85, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "name": "stdout", 150 | "output_type": "stream", 151 | "text": [ 152 | "37037\n", 153 | "0:00:00.000998\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "s = datetime.datetime.now()\n", 159 | "print(len(mapDemo[\"A\"]))\n", 160 | "print(datetime.datetime.now() -s)" 161 | ] 162 | } 163 | ], 164 | "metadata": { 165 | "kernelspec": { 166 | "display_name": "Python 3", 167 | "language": "python", 168 | "name": "python3" 169 | }, 170 | "language_info": { 171 | "codemirror_mode": { 172 | "name": "ipython", 173 | "version": 3 174 | }, 175 | "file_extension": ".py", 176 | "mimetype": "text/x-python", 177 | "name": "python", 178 | "nbconvert_exporter": "python", 179 | "pygments_lexer": "ipython3", 180 | "version": "3.6.7" 181 | } 182 | }, 183 | "nbformat": 4, 184 | "nbformat_minor": 2 185 | } 186 | -------------------------------------------------------------------------------- /1.PySpark测试.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pyspark" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "conf = pyspark.SparkConf()" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## 设置为local模式,在单机模拟环境下,这是默认的模式" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/plain": [ 36 | " " 37 | ] 38 | }, 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "conf.setMaster(\"local[*]\")" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## 设置任务名称" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/plain": [ 63 | " " 64 | ] 65 | }, 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "output_type": "execute_result" 69 | } 70 | ], 71 | "source": [ 72 | "conf.setAppName(\"AAA\")" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "sc = pyspark.SparkContext(conf=conf)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 6, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "import random" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## 随机生成十万个字母,A-Z" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 7, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "x = [chr(random.randint(65,91)) for i in range(100000)]" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## 显示前100个" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 11, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "['V', 'C', 'H', 'I', 'K', 'F', 'K', 'B', 'Y', 'N', 'X', 'Y', 'S', 'W', 'K', 'U', 'F', 'Y', 'W', 'K', 'E', 'J', 'Y', 'P', 'J', 'R', 'Y', 'L', 'F', 'Q', 'W', 'D', 'S', 'N', 'Z', 'C', '[', 'N', 'M', 'A', 'P', 'F', 'E', 'D', 'N', 'R', 'C', 'Y', 'G', 'R', 'S', 'L', 'Z', 'A', 'H', 'G', 'E', 'E', 'M', 'L', 'G', 'A', 'Q', 'F', 'Z', 'I', 'B', 'S', 'B', 'Y', 'C', 'O', 'V', 'G', 'Y', 'J', 'F', 'E', 'K', 'D', 'T', 'P', 'M', 'X', '[', 'M', 'I', 'K', 'O', 'Q', 'U', 'S', 'W', 'U', 'C', 'S', 'E', 'A', 'N', 'X']\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "print(x[0:100])" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 9, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "rdd = sc.parallelize(x)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## 分类统计聚合,每个字母计数为1,然后看看出现了多少次" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 10, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "[('N', 3760),\n", 158 | " ('J', 3692),\n", 159 | " ('[', 3657),\n", 160 | " ('O', 3738),\n", 161 | " ('H', 3688),\n", 162 | " ('B', 3620),\n", 163 | " ('C', 3626),\n", 164 | " ('W', 3729),\n", 165 | " ('R', 3769),\n", 166 | " ('V', 3719),\n", 167 | " ('I', 3788),\n", 168 | " ('X', 3734),\n", 169 | " ('Z', 3698),\n", 170 | " ('K', 3661),\n", 171 | " ('L', 3659),\n", 172 | " ('F', 3767),\n", 173 | " ('U', 3771),\n", 174 | " ('Q', 3711),\n", 175 | " ('M', 3694),\n", 176 | " ('A', 3683),\n", 177 | " ('T', 3649),\n", 178 | " ('S', 3655),\n", 179 | " ('Y', 3765),\n", 180 | " ('E', 3614),\n", 181 | " ('P', 3731),\n", 182 | " ('D', 3706),\n", 183 | " ('G', 3716)]" 184 | ] 185 | }, 186 | "execution_count": 10, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "rdd.map(lambda w : (w,1)).reduceByKey(lambda a,b : a+b).collect()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 3", 206 | "language": "python", 207 | "name": "python3" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 3 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython3", 219 | "version": "3.6.7" 220 | } 221 | }, 222 | "nbformat": 4, 223 | "nbformat_minor": 2 224 | } 225 | -------------------------------------------------------------------------------- /.ipynb_checkpoints/1.PySpark测试-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pyspark" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "conf = pyspark.SparkConf()" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## 设置为local模式,在单机模拟环境下,这是默认的模式" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/plain": [ 36 | " " 37 | ] 38 | }, 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "conf.setMaster(\"local[*]\")" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## 设置任务名称" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "data": { 62 | "text/plain": [ 63 | " " 64 | ] 65 | }, 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "output_type": "execute_result" 69 | } 70 | ], 71 | "source": [ 72 | "conf.setAppName(\"AAA\")" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "sc = pyspark.SparkContext(conf=conf)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 6, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "import random" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## 随机生成十万个字母,A-Z" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 7, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "x = [chr(random.randint(65,91)) for i in range(100000)]" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## 显示前100个" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 11, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "['V', 'C', 'H', 'I', 'K', 'F', 'K', 'B', 'Y', 'N', 'X', 'Y', 'S', 'W', 'K', 'U', 'F', 'Y', 'W', 'K', 'E', 'J', 'Y', 'P', 'J', 'R', 'Y', 'L', 'F', 'Q', 'W', 'D', 'S', 'N', 'Z', 'C', '[', 'N', 'M', 'A', 'P', 'F', 'E', 'D', 'N', 'R', 'C', 'Y', 'G', 'R', 'S', 'L', 'Z', 'A', 'H', 'G', 'E', 'E', 'M', 'L', 'G', 'A', 'Q', 'F', 'Z', 'I', 'B', 'S', 'B', 'Y', 'C', 'O', 'V', 'G', 'Y', 'J', 'F', 'E', 'K', 'D', 'T', 'P', 'M', 'X', '[', 'M', 'I', 'K', 'O', 'Q', 'U', 'S', 'W', 'U', 'C', 'S', 'E', 'A', 'N', 'X']\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "print(x[0:100])" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 9, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "rdd = sc.parallelize(x)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "## 分类统计聚合" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 10, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "[('N', 3760),\n", 158 | " ('J', 3692),\n", 159 | " ('[', 3657),\n", 160 | " ('O', 3738),\n", 161 | " ('H', 3688),\n", 162 | " ('B', 3620),\n", 163 | " ('C', 3626),\n", 164 | " ('W', 3729),\n", 165 | " ('R', 3769),\n", 166 | " ('V', 3719),\n", 167 | " ('I', 3788),\n", 168 | " ('X', 3734),\n", 169 | " ('Z', 3698),\n", 170 | " ('K', 3661),\n", 171 | " ('L', 3659),\n", 172 | " ('F', 3767),\n", 173 | " ('U', 3771),\n", 174 | " ('Q', 3711),\n", 175 | " ('M', 3694),\n", 176 | " ('A', 3683),\n", 177 | " ('T', 3649),\n", 178 | " ('S', 3655),\n", 179 | " ('Y', 3765),\n", 180 | " ('E', 3614),\n", 181 | " ('P', 3731),\n", 182 | " ('D', 3706),\n", 183 | " ('G', 3716)]" 184 | ] 185 | }, 186 | "execution_count": 10, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "rdd.map(lambda w : (w,1)).reduceByKey(lambda a,b : a+b).collect()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 3", 206 | "language": "python", 207 | "name": "python3" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 3 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython3", 219 | "version": "3.6.7" 220 | } 221 | }, 222 | "nbformat": 4, 223 | "nbformat_minor": 2 224 | } 225 | -------------------------------------------------------------------------------- /3.啥是算子(2)/3.2 啥是算子(2)函数链的执行.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime\n", 11 | "import pyspark" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "sc = pyspark.SparkContext()" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 4, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "rdd = sc.parallelize(x)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## 两种不同的编写方式,执行的时间是一样 " 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### 第一种模式,可以看见rdd2定义了一次,被使用了两次:\n", 53 | "### 两次都是用同一个rdd做reduceByKey" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 5, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "0:00:10.367666\n", 66 | "-------------\n", 67 | "0:00:09.702549\n", 68 | "-------------\n", 69 | "0:00:20.070215\n", 70 | "#############\n", 71 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n", 72 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "rdd2 = rdd.map(lambda a : (a,1))\n", 78 | "s = datetime.datetime.now()\n", 79 | "res = rdd2.reduceByKey(lambda a,b : a+b).collect()\n", 80 | "rt1 = datetime.datetime.now() -s \n", 81 | "print(rt1)\n", 82 | "print(\"-------------\")\n", 83 | "s = datetime.datetime.now()\n", 84 | "res2 = rdd2.reduceByKey(lambda a,b : a+b).collect()\n", 85 | "rt2 = datetime.datetime.now() -s \n", 86 | "print(rt2)\n", 87 | "print(\"-------------\")\n", 88 | "print(rt1 + rt2)\n", 89 | "print(\"#############\")\n", 90 | "print(res)\n", 91 | "print(res2)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## 第二种写法,map+reduceByKey每次都生成两个不同的rdd,但是结果(执行时间大致)是一样的" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "0:00:10.230988\n", 111 | "-------------\n", 112 | "0:00:10.221805\n", 113 | "-------------\n", 114 | "0:00:20.452793\n", 115 | "#############\n", 116 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n", 117 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "s = datetime.datetime.now()\n", 123 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 124 | "res = rdd2.collect()\n", 125 | "rt1 = datetime.datetime.now() -s \n", 126 | "print(rt1)\n", 127 | "print(\"-------------\")\n", 128 | "s = datetime.datetime.now()\n", 129 | "rdd3 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 130 | "res2 = rdd3.collect()\n", 131 | "rt2 = datetime.datetime.now() -s \n", 132 | "print(rt2)\n", 133 | "print(\"-------------\")\n", 134 | "print(rt1 + rt2)\n", 135 | "print(\"#############\")\n", 136 | "print(res)\n", 137 | "print(res2)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [] 146 | } 147 | ], 148 | "metadata": { 149 | "kernelspec": { 150 | "display_name": "Python 3", 151 | "language": "python", 152 | "name": "python3" 153 | }, 154 | "language_info": { 155 | "codemirror_mode": { 156 | "name": "ipython", 157 | "version": 3 158 | }, 159 | "file_extension": ".py", 160 | "mimetype": "text/x-python", 161 | "name": "python", 162 | "nbconvert_exporter": "python", 163 | "pygments_lexer": "ipython3", 164 | "version": "3.6.7" 165 | } 166 | }, 167 | "nbformat": 4, 168 | "nbformat_minor": 2 169 | } 170 | -------------------------------------------------------------------------------- /3.啥是算子(2)/3.3 啥是算子(2)缓存.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime\n", 11 | "import pyspark" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "sc = pyspark.SparkContext()" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 4, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "rdd = sc.parallelize(x)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## 生成一个reduceByKey之后的rdd2,连续两次进行提交" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### 可以看见,第二次稍微快点,是因为第二次不用进行任务调度,节约了任务调度的时间,但是依然耗费了计算时间" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 5, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "name": "stdout", 62 | "output_type": "stream", 63 | "text": [ 64 | "0:00:10.000775\n", 65 | "-------------\n", 66 | "0:00:05.434287\n", 67 | "-------------\n", 68 | "0:00:15.435062\n", 69 | "#############\n", 70 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n", 71 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 77 | "s = datetime.datetime.now()\n", 78 | "res1 = rdd2.collect()\n", 79 | "rt1 = datetime.datetime.now() -s \n", 80 | "print(rt1)\n", 81 | "print(\"-------------\")\n", 82 | "s = datetime.datetime.now()\n", 83 | "res2 = rdd2.collect()\n", 84 | "rt2 = datetime.datetime.now() -s \n", 85 | "print(rt2)\n", 86 | "print(\"-------------\")\n", 87 | "print(rt1 + rt2)\n", 88 | "print(\"#############\")\n", 89 | "print(res1)\n", 90 | "print(res2)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## 直接在reduceByKey的时候,生成一个缓存" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### 第二次执行,就只需要从缓存RDD来执行,仅需要1秒钟。" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 6, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "0:00:10.037251\n", 117 | "-------------\n", 118 | "0:00:01.072305\n", 119 | "-------------\n", 120 | "0:00:11.109556\n", 121 | "#############\n", 122 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n", 123 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b).cache()\n", 129 | "s = datetime.datetime.now()\n", 130 | "res = rdd2.collect()\n", 131 | "rt1 = datetime.datetime.now() -s \n", 132 | "print(rt1)\n", 133 | "print(\"-------------\")\n", 134 | "s = datetime.datetime.now()\n", 135 | "res2 = rdd2.collect()\n", 136 | "rt2 = datetime.datetime.now() -s \n", 137 | "print(rt2)\n", 138 | "print(\"-------------\")\n", 139 | "print(rt1 + rt2)\n", 140 | "print(\"#############\")\n", 141 | "print(res1)\n", 142 | "print(res2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [] 151 | } 152 | ], 153 | "metadata": { 154 | "kernelspec": { 155 | "display_name": "Python 3", 156 | "language": "python", 157 | "name": "python3" 158 | }, 159 | "language_info": { 160 | "codemirror_mode": { 161 | "name": "ipython", 162 | "version": 3 163 | }, 164 | "file_extension": ".py", 165 | "mimetype": "text/x-python", 166 | "name": "python", 167 | "nbconvert_exporter": "python", 168 | "pygments_lexer": "ipython3", 169 | "version": "3.6.7" 170 | } 171 | }, 172 | "nbformat": 4, 173 | "nbformat_minor": 2 174 | } 175 | -------------------------------------------------------------------------------- /3.啥是算子(2)/.ipynb_checkpoints/3.2 啥是算子(2)函数链的执行-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime\n", 11 | "import pyspark" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "sc = pyspark.SparkContext()" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 4, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "rdd = sc.parallelize(x)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## 两种不同的编写方式,执行的时间是一样 " 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### 第一种模式,可以看见rdd2定义了一次,被使用了两次:\n", 53 | "### 两次都是用同一个rdd做reduceByKey" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 5, 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "name": "stdout", 63 | "output_type": "stream", 64 | "text": [ 65 | "0:00:10.367666\n", 66 | "-------------\n", 67 | "0:00:09.702549\n", 68 | "-------------\n", 69 | "0:00:20.070215\n", 70 | "#############\n", 71 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n", 72 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "rdd2 = rdd.map(lambda a : (a,1))\n", 78 | "s = datetime.datetime.now()\n", 79 | "res = rdd2.reduceByKey(lambda a,b : a+b).collect()\n", 80 | "rt1 = datetime.datetime.now() -s \n", 81 | "print(rt1)\n", 82 | "print(\"-------------\")\n", 83 | "s = datetime.datetime.now()\n", 84 | "res2 = rdd2.reduceByKey(lambda a,b : a+b).collect()\n", 85 | "rt2 = datetime.datetime.now() -s \n", 86 | "print(rt2)\n", 87 | "print(\"-------------\")\n", 88 | "print(rt1 + rt2)\n", 89 | "print(\"#############\")\n", 90 | "print(res)\n", 91 | "print(res2)" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "## 第二种写法,map+reduceByKey每次都生成两个不同的rdd,但是结果(执行时间大致)是一样的" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "name": "stdout", 108 | "output_type": "stream", 109 | "text": [ 110 | "0:00:10.230988\n", 111 | "-------------\n", 112 | "0:00:10.221805\n", 113 | "-------------\n", 114 | "0:00:20.452793\n", 115 | "#############\n", 116 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n", 117 | "[('N', 36921), ('J', 36963), ('O', 37176), ('[', 37147), ('H', 37025), ('B', 37037), ('R', 36973), ('W', 37438), ('C', 36989), ('Z', 37157), ('V', 36778), ('X', 36705), ('I', 37165), ('K', 36993), ('L', 37151), ('Q', 37044), ('T', 36665), ('U', 37025), ('A', 37126), ('F', 36657), ('M', 37224), ('S', 37248), ('Y', 36968), ('G', 36989), ('D', 37100), ('E', 37147), ('P', 37189)]\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "s = datetime.datetime.now()\n", 123 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 124 | "res = rdd2.collect()\n", 125 | "rt1 = datetime.datetime.now() -s \n", 126 | "print(rt1)\n", 127 | "print(\"-------------\")\n", 128 | "s = datetime.datetime.now()\n", 129 | "rdd3 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 130 | "res2 = rdd3.collect()\n", 131 | "rt2 = datetime.datetime.now() -s \n", 132 | "print(rt2)\n", 133 | "print(\"-------------\")\n", 134 | "print(rt1 + rt2)\n", 135 | "print(\"#############\")\n", 136 | "print(res)\n", 137 | "print(res2)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [] 146 | } 147 | ], 148 | "metadata": { 149 | "kernelspec": { 150 | "display_name": "Python 3", 151 | "language": "python", 152 | "name": "python3" 153 | }, 154 | "language_info": { 155 | "codemirror_mode": { 156 | "name": "ipython", 157 | "version": 3 158 | }, 159 | "file_extension": ".py", 160 | "mimetype": "text/x-python", 161 | "name": "python", 162 | "nbconvert_exporter": "python", 163 | "pygments_lexer": "ipython3", 164 | "version": "3.6.7" 165 | } 166 | }, 167 | "nbformat": 4, 168 | "nbformat_minor": 2 169 | } 170 | -------------------------------------------------------------------------------- /3.啥是算子(2)/.ipynb_checkpoints/3.3 啥是算子(2)缓存-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime\n", 11 | "import pyspark" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "sc = pyspark.SparkContext()" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 3, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 4, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "rdd = sc.parallelize(x)" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "## 生成一个reduceByKey之后的rdd2,连续两次进行提交" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### 可以看见,第二次稍微快点,是因为第二次不用进行任务调度,节约了任务调度的时间,但是依然耗费了计算时间" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 5, 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "name": "stdout", 62 | "output_type": "stream", 63 | "text": [ 64 | "0:00:10.000775\n", 65 | "-------------\n", 66 | "0:00:05.434287\n", 67 | "-------------\n", 68 | "0:00:15.435062\n", 69 | "#############\n", 70 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n", 71 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b)\n", 77 | "s = datetime.datetime.now()\n", 78 | "res1 = rdd2.collect()\n", 79 | "rt1 = datetime.datetime.now() -s \n", 80 | "print(rt1)\n", 81 | "print(\"-------------\")\n", 82 | "s = datetime.datetime.now()\n", 83 | "res2 = rdd2.collect()\n", 84 | "rt2 = datetime.datetime.now() -s \n", 85 | "print(rt2)\n", 86 | "print(\"-------------\")\n", 87 | "print(rt1 + rt2)\n", 88 | "print(\"#############\")\n", 89 | "print(res1)\n", 90 | "print(res2)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "## 直接在reduceByKey的时候,生成一个缓存" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### 第二次执行,就只需要从缓存RDD来执行,仅需要1秒钟。" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 6, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "0:00:10.037251\n", 117 | "-------------\n", 118 | "0:00:01.072305\n", 119 | "-------------\n", 120 | "0:00:11.109556\n", 121 | "#############\n", 122 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n", 123 | "[('N', 36919), ('[', 37345), ('O', 37158), ('J', 37141), ('H', 36538), ('B', 37152), ('R', 36929), ('C', 37090), ('W', 36951), ('Z', 37120), ('I', 36969), ('V', 36720), ('X', 37498), ('L', 36923), ('K', 37071), ('Q', 37015), ('A', 37038), ('M', 37333), ('T', 36822), ('U', 36748), ('F', 37038), ('S', 36858), ('E', 36917), ('Y', 37405), ('G', 37114), ('P', 37035), ('D', 37153)]\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "rdd2 = rdd.map(lambda a : (a,1)).reduceByKey(lambda a,b : a+b).cache()\n", 129 | "s = datetime.datetime.now()\n", 130 | "res = rdd2.collect()\n", 131 | "rt1 = datetime.datetime.now() -s \n", 132 | "print(rt1)\n", 133 | "print(\"-------------\")\n", 134 | "s = datetime.datetime.now()\n", 135 | "res2 = rdd2.collect()\n", 136 | "rt2 = datetime.datetime.now() -s \n", 137 | "print(rt2)\n", 138 | "print(\"-------------\")\n", 139 | "print(rt1 + rt2)\n", 140 | "print(\"#############\")\n", 141 | "print(res1)\n", 142 | "print(res2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [] 151 | } 152 | ], 153 | "metadata": { 154 | "kernelspec": { 155 | "display_name": "Python 3", 156 | "language": "python", 157 | "name": "python3" 158 | }, 159 | "language_info": { 160 | "codemirror_mode": { 161 | "name": "ipython", 162 | "version": 3 163 | }, 164 | "file_extension": ".py", 165 | "mimetype": "text/x-python", 166 | "name": "python", 167 | "nbconvert_exporter": "python", 168 | "pygments_lexer": "ipython3", 169 | "version": "3.6.7" 170 | } 171 | }, 172 | "nbformat": 4, 173 | "nbformat_minor": 2 174 | } 175 | -------------------------------------------------------------------------------- /3.啥是算子(2)/3.1 啥是算子(2)Transformation和Action.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import pyspark" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 3, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "sc = pyspark.SparkContext()" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## 生成100万个字母组成的随机列表" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 4, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## 生成利用转换算子parallelize将列表转换为RDD" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## 生成了一个管道集合RDD,顾名思义,只是生成一个管道,而不是实体" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 8, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "0:00:00.222304\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "s = datetime.datetime.now()\n", 76 | "rdd = sc.parallelize(x)\n", 77 | "print(datetime.datetime.now() - s)\n", 78 | "print(rdd)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## 执行count这个行动算子,开始提交任务,所以整体执行时间需要5秒" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 11, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "0:00:05.061927\n", 98 | "1000000\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "s = datetime.datetime.now()\n", 104 | "x2 = rdd.count()\n", 105 | "print(datetime.datetime.now() - s)\n", 106 | "print(x2)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## 在rdd算子上面执行一个转换算子map,使每个字母进行一个计数,实际上就是一个转换过程:\n", 114 | "### 集合(\"w\") ——> 集合(\"w\",1) " 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## map算子作为一个转换算子,仅仅记录一个函数过程链,而不发生实际的操作,所以完全没有花费任何执行时间。" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 13, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "0:00:00\n", 134 | "PythonRDD[4] at RDD at PythonRDD.scala:48\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "s = datetime.datetime.now()\n", 140 | "rdd2 = rdd.map(lambda x : (x,1))\n", 141 | "print(datetime.datetime.now() - s)\n", 142 | "print(rdd2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## 同上,reduceByKey算子,也是一个转换算子,转换的结果是得到一个同key相交的结果,如:" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### ((\"w\",1),(\"w\",1),(\"w\",1),(\"z\",1),(\"z\",1)) ——>\n", 157 | "### ((\"w\",3),(\"z\",2))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "## 同样,reduceByKey算子也不会触发执行,所以也没有花费执行时间" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 16, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "0:00:00.034910\n", 177 | "PythonRDD[15] at RDD at PythonRDD.scala:48\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "s = datetime.datetime.now()\n", 183 | "rdd3 = rdd2.reduceByKey(lambda a,b : a+b)\n", 184 | "print(datetime.datetime.now() - s)\n", 185 | "print(rdd3)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "## 最后调用collect算子,这是一个action算子,开始执行全环节运行,全部函数链执行。" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 17, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "0:00:09.629967\n", 205 | "[('[', 36920), ('O', 36954), ('N', 37148), ('J', 36862), ('B', 36882), ('H', 36955), ('W', 37045), ('C', 37055), ('R', 36978), ('X', 37271), ('V', 36881), ('Z', 37272), ('I', 36911), ('K', 37436), ('L', 36595), ('A', 36775), ('T', 37037), ('M', 37113), ('Q', 36969), ('U', 37495), ('F', 37137), ('S', 37112), ('G', 36919), ('Y', 36865), ('D', 37267), ('P', 36932), ('E', 37214)]\n" 206 | ] 207 | } 208 | ], 209 | "source": [ 210 | "s = datetime.datetime.now()\n", 211 | "res = rdd3.collect()\n", 212 | "print(datetime.datetime.now() - s)\n", 213 | "print(res)" 214 | ] 215 | } 216 | ], 217 | "metadata": { 218 | "kernelspec": { 219 | "display_name": "Python 3", 220 | "language": "python", 221 | "name": "python3" 222 | }, 223 | "language_info": { 224 | "codemirror_mode": { 225 | "name": "ipython", 226 | "version": 3 227 | }, 228 | "file_extension": ".py", 229 | "mimetype": "text/x-python", 230 | "name": "python", 231 | "nbconvert_exporter": "python", 232 | "pygments_lexer": "ipython3", 233 | "version": "3.6.7" 234 | } 235 | }, 236 | "nbformat": 4, 237 | "nbformat_minor": 2 238 | } 239 | -------------------------------------------------------------------------------- /3.啥是算子(2)/.ipynb_checkpoints/3.1 啥是算子(2)Transformation和Action-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random\n", 10 | "import datetime" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "import pyspark" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 3, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "sc = pyspark.SparkContext()" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "## 生成100万个字母组成的随机列表" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 4, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "x = [chr(random.randint(65,91)) for i in range(1000000)]" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## 生成利用转换算子parallelize将列表转换为RDD" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## 生成了一个管道集合RDD,顾名思义,只是生成一个管道,而不是实体" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 8, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "0:00:00.222304\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "s = datetime.datetime.now()\n", 76 | "rdd = sc.parallelize(x)\n", 77 | "print(datetime.datetime.now() - s)\n", 78 | "print(rdd)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## 执行count这个行动算子,开始提交任务,所以整体执行时间需要5秒" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 11, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "0:00:05.061927\n", 98 | "1000000\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "s = datetime.datetime.now()\n", 104 | "x2 = rdd.count()\n", 105 | "print(datetime.datetime.now() - s)\n", 106 | "print(x2)" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "## 在rdd算子上面执行一个转换算子map,使每个字母进行一个计数,实际上就是一个转换过程:\n", 114 | "### 集合(\"w\") ——> 集合(\"w\",1) " 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## map算子作为一个转换算子,仅仅记录一个函数过程链,而不发生实际的操作,所以完全没有花费任何执行时间。" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 13, 127 | "metadata": {}, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "0:00:00\n", 134 | "PythonRDD[4] at RDD at PythonRDD.scala:48\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "s = datetime.datetime.now()\n", 140 | "rdd2 = rdd.map(lambda x : (x,1))\n", 141 | "print(datetime.datetime.now() - s)\n", 142 | "print(rdd2)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "## 同上,reduceByKey算子,也是一个转换算子,转换的结果是得到一个同key相交的结果,如:" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### ((\"w\",1),(\"w\",1),(\"w\",1),(\"z\",1),(\"z\",1)) ——>\n", 157 | "### ((\"w\",3),(\"z\",2))" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "## 同样,reduceByKey算子也不会触发执行,所以也没有花费执行时间" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 16, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "name": "stdout", 174 | "output_type": "stream", 175 | "text": [ 176 | "0:00:00.034910\n", 177 | "PythonRDD[15] at RDD at PythonRDD.scala:48\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "s = datetime.datetime.now()\n", 183 | "rdd3 = rdd2.reduceByKey(lambda a,b : a+b)\n", 184 | "print(datetime.datetime.now() - s)\n", 185 | "print(rdd3)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "## 最后调用collect算子,这是一个action算子,开始执行全环节运行,全部函数链执行。" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 17, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "0:00:09.629967\n", 205 | "[('[', 36920), ('O', 36954), ('N', 37148), ('J', 36862), ('B', 36882), ('H', 36955), ('W', 37045), ('C', 37055), ('R', 36978), ('X', 37271), ('V', 36881), ('Z', 37272), ('I', 36911), ('K', 37436), ('L', 36595), ('A', 36775), ('T', 37037), ('M', 37113), ('Q', 36969), ('U', 37495), ('F', 37137), ('S', 37112), ('G', 36919), ('Y', 36865), ('D', 37267), ('P', 36932), ('E', 37214)]\n" 206 | ] 207 | } 208 | ], 209 | "source": [ 210 | "s = datetime.datetime.now()\n", 211 | "res = rdd3.collect()\n", 212 | "print(datetime.datetime.now() - s)\n", 213 | "print(res)" 214 | ] 215 | } 216 | ], 217 | "metadata": { 218 | "kernelspec": { 219 | "display_name": "Python 3", 220 | "language": "python", 221 | "name": "python3" 222 | }, 223 | "language_info": { 224 | "codemirror_mode": { 225 | "name": "ipython", 226 | "version": 3 227 | }, 228 | "file_extension": ".py", 229 | "mimetype": "text/x-python", 230 | "name": "python", 231 | "nbconvert_exporter": "python", 232 | "pygments_lexer": "ipython3", 233 | "version": "3.6.7" 234 | } 235 | }, 236 | "nbformat": 4, 237 | "nbformat_minor": 2 238 | } 239 | -------------------------------------------------------------------------------- /8.构造空间数据的RDD(2)/构造点数据.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 10, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import os,ogr" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pyspark" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "sc = pyspark.SparkContext()" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 4, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "eqcsv = os.getcwd()+\"\\\\data\\\\eq2013.csv\"" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 7, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "rdd = sc.textFile(eqcsv)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 8, 51 | "metadata": {}, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "8205" 57 | ] 58 | }, 59 | "execution_count": 8, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "rdd.count()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 9, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/plain": [ 76 | "['date,time,x,y,z,lv,M,v1,v2,v3,v4',\n", 77 | " '2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 78 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 79 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 80 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,',\n", 81 | " '2013/5/31,22:34:02,-178.08,51.127,26,3.1,ML3.1,AEIC,,ANDREANOF ISLANDS, ALEUTIAN IS.',\n", 82 | " '2013/5/31,22:21:56,143.032,21.648,322.5,4.4,,,,MARIANA ISLANDS REGION,',\n", 83 | " '2013/5/31,21:17:52,-64.754,19.103,36.9,2.9,MD2.9,RSPR,,VIRGIN ISLANDS,',\n", 84 | " '2013/5/31,20:17:11,161.56,-10.28,58.8,4.7,,,,SOLOMON ISLANDS,',\n", 85 | " '2013/5/31,19:51:33,-121.068,40.148,6.6,2.2,MD2.2,NC,,NORTHERN CALIFORNIA,']" 86 | ] 87 | }, 88 | "execution_count": 9, 89 | "metadata": {}, 90 | "output_type": "execute_result" 91 | } 92 | ], 93 | "source": [ 94 | "rdd.take(10)" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## 编写一个过滤的方法。\n", 102 | "### 原理是用xy两个字段来构造一个点要素,然后用gdal的IsValid方法来验证,如果能用,就表示是这一行是一个合法的要素描述信息" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "def isPoint(line):\n", 112 | " pntline = line.split(\",\")\n", 113 | " try:\n", 114 | " wkt = \"POINT({0} {1})\".format(float(pntline[2]),float(pntline[2]))\n", 115 | " geom = ogr.CreateGeometryFromWkt(wkt)\n", 116 | " return geom.IsValid()\n", 117 | " except:\n", 118 | " return False" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 12, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "eqrdd = rdd.filter(lambda line : isPoint(line))" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "## 执行过滤之后,发现减少了一行" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 13, 140 | "metadata": {}, 141 | "outputs": [ 142 | { 143 | "data": { 144 | "text/plain": [ 145 | "8204" 146 | ] 147 | }, 148 | "execution_count": 13, 149 | "metadata": {}, 150 | "output_type": "execute_result" 151 | } 152 | ], 153 | "source": [ 154 | "eqrdd.count()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 14, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "text/plain": [ 165 | "['2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 166 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 167 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 168 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,',\n", 169 | " '2013/5/31,22:34:02,-178.08,51.127,26,3.1,ML3.1,AEIC,,ANDREANOF ISLANDS, ALEUTIAN IS.',\n", 170 | " '2013/5/31,22:21:56,143.032,21.648,322.5,4.4,,,,MARIANA ISLANDS REGION,',\n", 171 | " '2013/5/31,21:17:52,-64.754,19.103,36.9,2.9,MD2.9,RSPR,,VIRGIN ISLANDS,',\n", 172 | " '2013/5/31,20:17:11,161.56,-10.28,58.8,4.7,,,,SOLOMON ISLANDS,',\n", 173 | " '2013/5/31,19:51:33,-121.068,40.148,6.6,2.2,MD2.2,NC,,NORTHERN CALIFORNIA,',\n", 174 | " '2013/5/31,19:29:31,-67.355,18.185,12,2.7,MD2.7,RSPR,,MONA PASSAGE,']" 175 | ] 176 | }, 177 | "execution_count": 14, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "eqrdd.take(10)" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [] 192 | } 193 | ], 194 | "metadata": { 195 | "kernelspec": { 196 | "display_name": "Python 3", 197 | "language": "python", 198 | "name": "python3" 199 | }, 200 | "language_info": { 201 | "codemirror_mode": { 202 | "name": "ipython", 203 | "version": 3 204 | }, 205 | "file_extension": ".py", 206 | "mimetype": "text/x-python", 207 | "name": "python", 208 | "nbconvert_exporter": "python", 209 | "pygments_lexer": "ipython3", 210 | "version": "3.6.7" 211 | } 212 | }, 213 | "nbformat": 4, 214 | "nbformat_minor": 2 215 | } 216 | -------------------------------------------------------------------------------- /8.构造空间数据的RDD(2)/tsv&wkt数据构建面数据RDD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import ogr" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pyspark" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "sc = pyspark.SparkContext()" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 4, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "rdd = sc.textFile(\"./data/dltb.tsv\")" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "### 直接过滤掉字段头和不能构建成要素的行(清洗脏数据)" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 5, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "def filterHeader(line):\n", 53 | " geoline = line.split(\"\\t\")\n", 54 | " try:\n", 55 | " return ogr.CreateGeometryFromWkt(geoline[1]).IsValid()\n", 56 | " except:\n", 57 | " return False" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "### 清洗之前,包括头,一共193行" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 6, 70 | "metadata": {}, 71 | "outputs": [ 72 | { 73 | "data": { 74 | "text/plain": [ 75 | "193" 76 | ] 77 | }, 78 | "execution_count": 6, 79 | "metadata": {}, 80 | "output_type": "execute_result" 81 | } 82 | ], 83 | "source": [ 84 | "rdd.count()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "### 过滤掉之后剩下192行了" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 7, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "rdd2 = rdd.filter(lambda line : filterHeader(line))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 8, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "192" 112 | ] 113 | }, 114 | "execution_count": 8, 115 | "metadata": {}, 116 | "output_type": "execute_result" 117 | } 118 | ], 119 | "source": [ 120 | "rdd2.count()" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "## 构建一个由几何要素和需要的属性组成的元祖" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 9, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "def myBuilderGeom(line):\n", 137 | " geoline = line.split(\"\\t\")\n", 138 | " geo = ogr.CreateGeometryFromWkt(geoline[1])\n", 139 | " return (geo,geoline[3])" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 10, 145 | "metadata": {}, 146 | "outputs": [], 147 | "source": [ 148 | "rdd3 = rdd2.map(lambda line : myBuilderGeom(line))" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "### 查看前面五行" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 33, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/plain": [ 166 | "[( >,\n", 167 | " '旱地'),\n", 168 | " ( >,\n", 169 | " '园地'),\n", 170 | " ( >,\n", 171 | " '林地'),\n", 172 | " ( >,\n", 173 | " '园地'),\n", 174 | " ( >,\n", 175 | " '园地')]" 176 | ] 177 | }, 178 | "execution_count": 33, 179 | "metadata": {}, 180 | "output_type": "execute_result" 181 | } 182 | ], 183 | "source": [ 184 | "rdd3.take(5)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### 进行面积统计" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 32, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "data": { 201 | "text/plain": [ 202 | "[('旱地', 1062114.5405737045),\n", 203 | " ('园地', 974189.4862199697),\n", 204 | " ('农村居民点用地', 39557.91353517026),\n", 205 | " ('其他独立建设用地', 257737.88656154094),\n", 206 | " ('林地', 4250805.430884263)]" 207 | ] 208 | }, 209 | "execution_count": 32, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "rdd3.map(lambda feat: (feat[1],\n", 216 | " feat[0].GetArea())).reduceByKey(lambda x,y:x+y).collect()" 217 | ] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "execution_count": null, 222 | "metadata": {}, 223 | "outputs": [], 224 | "source": [] 225 | } 226 | ], 227 | "metadata": { 228 | "kernelspec": { 229 | "display_name": "Python 3", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.6.7" 244 | } 245 | }, 246 | "nbformat": 4, 247 | "nbformat_minor": 2 248 | } 249 | -------------------------------------------------------------------------------- /17.groupbyKey算子简介/groupByKey示例.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pyspark" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## 生成一个1000个元素组成的随机字符数组" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "ls = [chr(random.randint(65,71)) \n", 35 | " for i in range(1000)]" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 4, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "rdd = pyspark.SparkContext().parallelize(ls)" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "## 构建一个map,key是随机字符,value是随机整数\n", 52 | "## 进行缓存" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 5, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "rdd2 = rdd.map(lambda x :(x,random.randint(0,10000))).cache()" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 6, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "[('B', 6769), ('F', 8230), ('E', 3325), ('B', 148), ('F', 1956), ('D', 4230), ('A', 5862), ('D', 8148), ('A', 9723), ('D', 9276), ('E', 5407), ('G', 5177), ('F', 7832), ('B', 157), ('D', 5441), ('D', 2184), ('E', 3121), ('E', 4381), ('F', 7427), ('B', 4172), ('G', 4421), ('A', 7796), ('E', 1898), ('A', 2690), ('B', 5578), ('F', 2669), ('E', 2744), ('E', 5593), ('F', 2173), ('C', 5830)]\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "print(rdd2.take(30))" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## 进行groupByKey操作" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 7, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "grdd = rdd2.groupByKey()" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## groupByKey会生成一个(key,[value迭代器])组成的结构" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 8, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "data": { 111 | "text/plain": [ 112 | "[('B', )]" 113 | ] 114 | }, 115 | "execution_count": 8, 116 | "metadata": {}, 117 | "output_type": "execute_result" 118 | } 119 | ], 120 | "source": [ 121 | "grdd.take(1)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "## 再来一次Map,这次执行获取最大值和最小值的操作" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 9, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "def mymap(gmap):\n", 138 | " return (gmap[0],(min(gmap[1]),max(gmap[1])))" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 10, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "data": { 148 | "text/plain": [ 149 | "[('B', (104, 9917)),\n", 150 | " ('C', (63, 9979)),\n", 151 | " ('F', (56, 9958)),\n", 152 | " ('A', (10, 9790)),\n", 153 | " ('E', (168, 9929)),\n", 154 | " ('D', (30, 9925)),\n", 155 | " ('G', (95, 9954))]" 156 | ] 157 | }, 158 | "execution_count": 10, 159 | "metadata": {}, 160 | "output_type": "execute_result" 161 | } 162 | ], 163 | "source": [ 164 | "grdd.map(lambda g: mymap(g)).collect()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## 传统方式一、计算max和min,需要进行两次计算" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 11, 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [ 180 | "maxval = rdd2.reduceByKey(lambda x,y:max(x,y)).collect()\n", 181 | "minval = rdd2.reduceByKey(lambda x,y:min(x,y)).collect()" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 12, 187 | "metadata": { 188 | "scrolled": true 189 | }, 190 | "outputs": [ 191 | { 192 | "name": "stdout", 193 | "output_type": "stream", 194 | "text": [ 195 | "[('B', 9917), ('C', 9979), ('F', 9958), ('A', 9790), ('E', 9929), ('D', 9925), ('G', 9954)]\n", 196 | "[('B', 104), ('C', 63), ('F', 56), ('A', 10), ('E', 168), ('D', 30), ('G', 95)]\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "print(maxval)\n", 202 | "print(minval)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "## 设计好map结构,也可以一次性算完" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": 14, 215 | "metadata": {}, 216 | "outputs": [ 217 | { 218 | "data": { 219 | "text/plain": [ 220 | "[('B', (9917, 104)),\n", 221 | " ('C', (9979, 63)),\n", 222 | " ('F', (9958, 56)),\n", 223 | " ('A', (9790, 10)),\n", 224 | " ('E', (9929, 168)),\n", 225 | " ('D', (9925, 30)),\n", 226 | " ('G', (9954, 95))]" 227 | ] 228 | }, 229 | "execution_count": 14, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "rdd2.map(lambda x : (x[0],(x[1],x[1])))\\\n", 236 | ".reduceByKey(lambda x,y:(max(x[0],y[0]),min(x[1],y[1]))).collect()" 237 | ] 238 | } 239 | ], 240 | "metadata": { 241 | "kernelspec": { 242 | "display_name": "Python 3", 243 | "language": "python", 244 | "name": "python3" 245 | }, 246 | "language_info": { 247 | "codemirror_mode": { 248 | "name": "ipython", 249 | "version": 3 250 | }, 251 | "file_extension": ".py", 252 | "mimetype": "text/x-python", 253 | "name": "python", 254 | "nbconvert_exporter": "python", 255 | "pygments_lexer": "ipython3", 256 | "version": "3.6.6" 257 | } 258 | }, 259 | "nbformat": 4, 260 | "nbformat_minor": 2 261 | } 262 | -------------------------------------------------------------------------------- /9.空间过滤算子解析/filter算子简介.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "## 生成一个由八种颜色随机组成的10万个元素组成的列表" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 4, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "collist = [random.choice([\"red\",\"blue\",\"green\",\n", 26 | " \"yellow\",\"black\",\"gray\",\n", 27 | " \"white\",\"orange\"])\\\n", 28 | " for i in range(100000)]" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 8, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "name": "stdout", 38 | "output_type": "stream", 39 | "text": [ 40 | "['black', 'white', 'white', 'blue', 'yellow', 'blue', 'white', 'red', 'blue', 'green', 'white', 'white', 'red', 'orange', 'gray', 'blue', 'red', 'yellow', 'gray', 'red', 'blue', 'yellow', 'gray', 'red', 'red', 'yellow', 'gray', 'green', 'yellow', 'yellow', 'white', 'red', 'gray', 'orange', 'blue', 'red', 'green', 'black', 'orange', 'orange', 'red', 'green', 'blue', 'red', 'blue', 'black', 'green', 'yellow', 'white', 'black', 'orange', 'green', 'white', 'red', 'orange', 'green', 'green', 'white', 'yellow', 'red', 'orange', 'gray', 'black', 'red', 'red', 'blue', 'orange', 'white', 'green', 'yellow', 'black', 'yellow', 'yellow', 'gray', 'white', 'red', 'red', 'white', 'green', 'white', 'gray', 'red', 'black', 'white', 'blue', 'black', 'orange', 'orange', 'blue', 'orange', 'orange', 'green', 'orange', 'black', 'red', 'white', 'black', 'yellow', 'red', 'red']\n" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "print(collist[:100])" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 9, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import pyspark" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 10, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "sc = pyspark.SparkContext()" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 11, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "colrdd = sc.parallelize(collist)" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 15, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "['black', 'white', 'white', 'blue', 'yellow', 'blue', 'white', 'red', 'blue', 'green', 'white', 'white', 'red', 'orange', 'gray', 'blue', 'red', 'yellow', 'gray', 'red', 'blue', 'yellow', 'gray', 'red', 'red', 'yellow', 'gray', 'green', 'yellow', 'yellow', 'white', 'red', 'gray', 'orange', 'blue', 'red', 'green', 'black', 'orange', 'orange', 'red', 'green', 'blue', 'red', 'blue', 'black', 'green', 'yellow', 'white', 'black', 'orange', 'green', 'white', 'red', 'orange', 'green', 'green', 'white', 'yellow', 'red', 'orange', 'gray', 'black', 'red', 'red', 'blue', 'orange', 'white', 'green', 'yellow', 'black', 'yellow', 'yellow', 'gray', 'white', 'red', 'red', 'white', 'green', 'white', 'gray', 'red', 'black', 'white', 'blue', 'black', 'orange', 'orange', 'blue', 'orange', 'orange', 'green', 'orange', 'black', 'red', 'white', 'black', 'yellow', 'red', 'red']\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "print(colrdd.take(100))" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": {}, 95 | "source": [ 96 | "### 定义个判断方法,如果等于red,就返回True,否则就返回False" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 12, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "def isRed(col):\n", 106 | " if col == \"red\":\n", 107 | " return True\n", 108 | " else:\n", 109 | " return False" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "### 执行过滤" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 13, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "redrdd = colrdd.filter(lambda col:isRed(col))" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 16, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "['red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red', 'red']\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "print(redrdd.take(100))" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "### 十万条数据中,为red的数据有多少" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 17, 155 | "metadata": {}, 156 | "outputs": [ 157 | { 158 | "data": { 159 | "text/plain": [ 160 | "12487" 161 | ] 162 | }, 163 | "execution_count": 17, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "redrdd.count()" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "### 上面那个判断方法,也可以非常简化的写成:" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 19, 182 | "metadata": {}, 183 | "outputs": [ 184 | { 185 | "data": { 186 | "text/plain": [ 187 | "12487" 188 | ] 189 | }, 190 | "execution_count": 19, 191 | "metadata": {}, 192 | "output_type": "execute_result" 193 | } 194 | ], 195 | "source": [ 196 | "colrdd.filter(lambda col : col == \"red\").count()" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### 看看等于红色,或者等于蓝色的有多少个" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 20, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "24810" 215 | ] 216 | }, 217 | "execution_count": 20, 218 | "metadata": {}, 219 | "output_type": "execute_result" 220 | } 221 | ], 222 | "source": [ 223 | "colrdd.filter(lambda col : col == \"red\" or col == \"blue\").count()" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [] 232 | } 233 | ], 234 | "metadata": { 235 | "kernelspec": { 236 | "display_name": "Python 3", 237 | "language": "python", 238 | "name": "python3" 239 | }, 240 | "language_info": { 241 | "codemirror_mode": { 242 | "name": "ipython", 243 | "version": 3 244 | }, 245 | "file_extension": ".py", 246 | "mimetype": "text/x-python", 247 | "name": "python", 248 | "nbconvert_exporter": "python", 249 | "pygments_lexer": "ipython3", 250 | "version": "3.6.7" 251 | } 252 | }, 253 | "nbformat": 4, 254 | "nbformat_minor": 2 255 | } 256 | -------------------------------------------------------------------------------- /16.reduceByKey算子/reduceByKey进行属性统计.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import ogr,pandas" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import pyspark" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "sc = pyspark.SparkContext()" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 4, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "rdd = sc.textFile(\"./data/dltb.tsv\")" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## 过滤掉header" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 5, 49 | "metadata": { 50 | "scrolled": true 51 | }, 52 | "outputs": [ 53 | { 54 | "data": { 55 | "text/plain": [ 56 | "['FID\\tSHAPE\\tTBBH\\tDLMC\\tTBMJ\\tShape_Leng\\tShape_Area']" 57 | ] 58 | }, 59 | "execution_count": 5, 60 | "metadata": {}, 61 | "output_type": "execute_result" 62 | } 63 | ], 64 | "source": [ 65 | "rdd.take(1)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 6, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "geoRdd = rdd.filter(lambda line : line.find(\"POLYGON\")>0)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "## 借用pandas查看第一条记录" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 7, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/html": [ 92 | " \n", 93 | "\n", 106 | "" 145 | ], 146 | "text/plain": [ 147 | " 0\n", 148 | "0 0\n", 149 | "1 POLYGON ((11135971.0803 2710491.2501,11135968....\n", 150 | "2 379\n", 151 | "3 旱地\n", 152 | "4 1489.7\n", 153 | "5 1375.76858615\n", 154 | "6 47227.8152174" 155 | ] 156 | }, 157 | "execution_count": 7, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "pandas.DataFrame(geoRdd.take(1)[0].split(\"\\t\"))" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "## map操作,设计key为地类名称,value为图斑面积" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 8, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "mapRdd = geoRdd.map(lambda line:(line.split(\"\\t\")[3],\n", 180 | " float(line.split(\"\\t\")[4])))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "## 查看一下前10条的内容,可以看见数据中的地类名称和图斑面积被构建成了一个map" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 9, 193 | "metadata": {}, 194 | "outputs": [ 195 | { 196 | "name": "stdout", 197 | "output_type": "stream", 198 | "text": [ 199 | "[('旱地', 1489.7), ('园地', 174938.39), ('林地', 16361.45), ('园地', 6216.62), ('园地', 767.08), ('园地', 84939.96), ('园地', 808.57), ('园地', 26997.04), ('旱地', 6367.33), ('林地', 23087.05)]\n" 200 | ] 201 | } 202 | ], 203 | "source": [ 204 | "print(mapRdd.take(10))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "## 进行reduceByKey操作" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 10, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "res = mapRdd.reduceByKey(lambda x,y:x+y).collect()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 11, 226 | "metadata": { 227 | "scrolled": true 228 | }, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "[('旱地', 501681.6499999999), ('园地', 1281049.4700000002), ('农村居民点用地', 82505.71), ('其他独立建设用地', 2507353.24), ('林地', 1648852.08)]\n" 235 | ] 236 | } 237 | ], 238 | "source": [ 239 | "print(res)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 12, 245 | "metadata": {}, 246 | "outputs": [ 247 | { 248 | "data": { 249 | "text/plain": [ 250 | "[('旱地', 501681.6499999999),\n", 251 | " ('园地', 1281049.4700000002),\n", 252 | " ('农村居民点用地', 82505.71),\n", 253 | " ('其他独立建设用地', 2507353.24),\n", 254 | " ('林地', 1648852.08)]" 255 | ] 256 | }, 257 | "execution_count": 12, 258 | "metadata": {}, 259 | "output_type": "execute_result" 260 | } 261 | ], 262 | "source": [ 263 | "rdd.filter(lambda line : line.find(\"POLYGON\")>0)\\\n", 264 | ".map(lambda line:(line.split(\"\\t\")[3],\n", 265 | " float(line.split(\"\\t\")[4])))\\\n", 266 | ".reduceByKey(lambda x,y:x+y).collect()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": null, 272 | "metadata": {}, 273 | "outputs": [], 274 | "source": [] 275 | } 276 | ], 277 | "metadata": { 278 | "kernelspec": { 279 | "display_name": "Python 3", 280 | "language": "python", 281 | "name": "python3" 282 | }, 283 | "language_info": { 284 | "codemirror_mode": { 285 | "name": "ipython", 286 | "version": 3 287 | }, 288 | "file_extension": ".py", 289 | "mimetype": "text/x-python", 290 | "name": "python", 291 | "nbconvert_exporter": "python", 292 | "pygments_lexer": "ipython3", 293 | "version": "3.6.6" 294 | } 295 | }, 296 | "nbformat": 4, 297 | "nbformat_minor": 2 298 | } 299 | -------------------------------------------------------------------------------- /10.map算子解析(1)/map示例:数据集构造.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import random" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "def createList():\n", 19 | " v1 = chr(random.randint(65,90))\n", 20 | " v2 = chr(random.randint(97,122))\n", 21 | " return random.choice([v1,v2])" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## 构造一个100万个随机大小字母组成的数据集" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "outputs": [], 36 | "source": [ 37 | "arr = [createList() for i in range(1000000)]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": {}, 44 | "outputs": [ 45 | { 46 | "name": "stdout", 47 | "output_type": "stream", 48 | "text": [ 49 | "['J', 'f', 'U', 'j', 'r', 'p', 'c', 'L', 'a', 'J', 'C', 'l', 'R', 'v', 'B', 'R', 'M', 'Z', 'a', 'E', 'Q', 'i', 'B', 'o', 'z', 'f', 'O', 'q', 'L', 'x', 'L', 'P', 'i', 'y', 'v', 'S', 'g', 'G', 'o', 'k', 'r', 's', 'n', 'v', 'H', 'K', 'j', 'A', 'Y', 'i', 'q', 't', 'j', 'h', 'T', 'G', 'S', 'O', 'y', 'R', 'Z', 'k', 'O', 'Q', 'o', 'z', 'x', 'x', 'd', 'O', 'f', 'L', 'S', 'E', 'm', 'o', 'w', 'v', 'l', 'J', 'E', 'W', 'y', 'p', 'D', 'X', 'I', 't', 'I', 'g', 'o', 't', 'V', 'A', 'd', 'G', 'B', 'D', 'c', 'q']\n" 50 | ] 51 | } 52 | ], 53 | "source": [ 54 | "print(arr[0:100])" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 5, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "import pyspark" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 6, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "sc = pyspark.SparkContext()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 7, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "rdd = sc.parallelize(arr)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### 对数据集进行转化:转化为(key,value)模式" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### 其中,key是原来的字母,value可以是用来计算的内容,比如计数的话,就是1" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 8, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "maprdd = rdd.map(lambda a:(a,1))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 9, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "[('J', 1), ('f', 1), ('U', 1), ('j', 1), ('r', 1), ('p', 1), ('c', 1), ('L', 1), ('a', 1), ('J', 1), ('C', 1), ('l', 1), ('R', 1), ('v', 1), ('B', 1), ('R', 1), ('M', 1), ('Z', 1), ('a', 1), ('E', 1), ('Q', 1), ('i', 1), ('B', 1), ('o', 1), ('z', 1), ('f', 1), ('O', 1), ('q', 1), ('L', 1), ('x', 1), ('L', 1), ('P', 1), ('i', 1), ('y', 1), ('v', 1), ('S', 1), ('g', 1), ('G', 1), ('o', 1), ('k', 1), ('r', 1), ('s', 1), ('n', 1), ('v', 1), ('H', 1), ('K', 1), ('j', 1), ('A', 1), ('Y', 1), ('i', 1), ('q', 1), ('t', 1), ('j', 1), ('h', 1), ('T', 1), ('G', 1), ('S', 1), ('O', 1), ('y', 1), ('R', 1), ('Z', 1), ('k', 1), ('O', 1), ('Q', 1), ('o', 1), ('z', 1), ('x', 1), ('x', 1), ('d', 1), ('O', 1), ('f', 1), ('L', 1), ('S', 1), ('E', 1), ('m', 1), ('o', 1), ('w', 1), ('v', 1), ('l', 1), ('J', 1), ('E', 1), ('W', 1), ('y', 1), ('p', 1), ('D', 1), ('X', 1), ('I', 1), ('t', 1), ('I', 1), ('g', 1), ('o', 1), ('t', 1), ('V', 1), ('A', 1), ('d', 1), ('G', 1), ('B', 1), ('D', 1), ('c', 1), ('q', 1)]\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "print(maprdd.take(100))" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "### 统计一下" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 10, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "[('J', 19127), ('r', 19367), ('l', 19431), ('i', 19512), ('O', 19144), ('y', 19344), ('g', 19303), ('s', 19238), ('N', 19116), ('a', 19133), ('B', 19380), ('z', 19407), ('H', 19347), ('m', 19286), ('e', 19219), ('j', 19178), ('C', 19070), ('R', 19145), ('W', 19192), ('Z', 19289), ('q', 19222), ('k', 19084), ('n', 19372), ('t', 18921), ('X', 19141), ('I', 19328), ('V', 19202), ('p', 19272), ('c', 19339), ('L', 19139), ('K', 19294), ('b', 19380), ('U', 19154), ('M', 19034), ('Q', 19093), ('x', 19158), ('A', 19231), ('T', 19290), ('w', 18995), ('F', 19353), ('S', 19072), ('h', 19322), ('d', 19214), ('f', 19206), ('v', 19338), ('E', 19272), ('o', 19261), ('P', 19297), ('G', 19216), ('Y', 19157), ('D', 19162), ('u', 19253)]\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "res = maprdd.reduceByKey(lambda x,y : x+y).collect()\n", 146 | "print(res)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "### 也可以在map里面写入逻辑表达,比如小写字母计数为1,大写字母计数为2" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 11, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "def myMap(w):\n", 163 | " if ord(w) >=97:\n", 164 | " return (w,1)\n", 165 | " else:\n", 166 | " return (w,2)" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 12, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "maprdd2 = rdd.map(lambda a: myMap(a))" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 13, 181 | "metadata": {}, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "[('J', 2), ('f', 1), ('U', 2), ('j', 1), ('r', 1), ('p', 1), ('c', 1), ('L', 2), ('a', 1), ('J', 2), ('C', 2), ('l', 1), ('R', 2), ('v', 1), ('B', 2), ('R', 2), ('M', 2), ('Z', 2), ('a', 1), ('E', 2), ('Q', 2), ('i', 1), ('B', 2), ('o', 1), ('z', 1), ('f', 1), ('O', 2), ('q', 1), ('L', 2), ('x', 1), ('L', 2), ('P', 2), ('i', 1), ('y', 1), ('v', 1), ('S', 2), ('g', 1), ('G', 2), ('o', 1), ('k', 1), ('r', 1), ('s', 1), ('n', 1), ('v', 1), ('H', 2), ('K', 2), ('j', 1), ('A', 2), ('Y', 2), ('i', 1), ('q', 1), ('t', 1), ('j', 1), ('h', 1), ('T', 2), ('G', 2), ('S', 2), ('O', 2), ('y', 1), ('R', 2), ('Z', 2), ('k', 1), ('O', 2), ('Q', 2), ('o', 1), ('z', 1), ('x', 1), ('x', 1), ('d', 1), ('O', 2), ('f', 1), ('L', 2), ('S', 2), ('E', 2), ('m', 1), ('o', 1), ('w', 1), ('v', 1), ('l', 1), ('J', 2), ('E', 2), ('W', 2), ('y', 1), ('p', 1), ('D', 2), ('X', 2), ('I', 2), ('t', 1), ('I', 2), ('g', 1), ('o', 1), ('t', 1), ('V', 2), ('A', 2), ('d', 1), ('G', 2), ('B', 2), ('D', 2), ('c', 1), ('q', 1)]\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "print(maprdd2.take(100))" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 14, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "name": "stdout", 202 | "output_type": "stream", 203 | "text": [ 204 | "[('J', 38254), ('r', 19367), ('l', 19431), ('i', 19512), ('O', 38288), ('y', 19344), ('g', 19303), ('s', 19238), ('N', 38232), ('a', 19133), ('B', 38760), ('z', 19407), ('H', 38694), ('m', 19286), ('e', 19219), ('j', 19178), ('C', 38140), ('R', 38290), ('W', 38384), ('Z', 38578), ('q', 19222), ('k', 19084), ('n', 19372), ('t', 18921), ('X', 38282), ('I', 38656), ('V', 38404), ('p', 19272), ('c', 19339), ('L', 38278), ('K', 38588), ('b', 19380), ('U', 38308), ('M', 38068), ('Q', 38186), ('x', 19158), ('A', 38462), ('T', 38580), ('w', 18995), ('F', 38706), ('S', 38144), ('h', 19322), ('d', 19214), ('f', 19206), ('v', 19338), ('E', 38544), ('o', 19261), ('P', 38594), ('G', 38432), ('Y', 38314), ('D', 38324), ('u', 19253)]\n" 205 | ] 206 | } 207 | ], 208 | "source": [ 209 | "res2 = maprdd2.reduceByKey(lambda x,y : x+y).collect()\n", 210 | "print(res2)" 211 | ] 212 | } 213 | ], 214 | "metadata": { 215 | "kernelspec": { 216 | "display_name": "Python 3", 217 | "language": "python", 218 | "name": "python3" 219 | }, 220 | "language_info": { 221 | "codemirror_mode": { 222 | "name": "ipython", 223 | "version": 3 224 | }, 225 | "file_extension": ".py", 226 | "mimetype": "text/x-python", 227 | "name": "python", 228 | "nbconvert_exporter": "python", 229 | "pygments_lexer": "ipython3", 230 | "version": "3.6.7" 231 | } 232 | }, 233 | "nbformat": 4, 234 | "nbformat_minor": 2 235 | } 236 | -------------------------------------------------------------------------------- /17.groupbyKey算子简介/每类数据取前十条.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 本例只是作为算法示例,生产环境中不建议使用,因为这种排序的代价极大,特别是在大数据环境下。" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pyspark,random,datetime" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "d = datetime.datetime.strptime('2017-01-01 00:00:00',\n", 26 | " '%Y-%m-%d %H:%M:%S')" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 3, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "ls = [(random.choice([\"info\",\"warning\",\"error\"]),\n", 36 | " d + datetime.timedelta(seconds=random.randint(0,1000000)))\n", 37 | " for i in range(10000)]" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": { 44 | "scrolled": true 45 | }, 46 | "outputs": [ 47 | { 48 | "data": { 49 | "text/plain": [ 50 | "[('warning', datetime.datetime(2017, 1, 11, 6, 5, 27)),\n", 51 | " ('info', datetime.datetime(2017, 1, 6, 16, 35, 31)),\n", 52 | " ('info', datetime.datetime(2017, 1, 4, 18, 29, 55)),\n", 53 | " ('info', datetime.datetime(2017, 1, 4, 21, 14, 8))]" 54 | ] 55 | }, 56 | "execution_count": 4, 57 | "metadata": {}, 58 | "output_type": "execute_result" 59 | } 60 | ], 61 | "source": [ 62 | "ls[1:5]" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 5, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "sc = pyspark.SparkContext()" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 6, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "rdd = sc.parallelize(ls).cache()" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 7, 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/plain": [ 91 | "[('error', datetime.datetime(2017, 1, 1, 18, 20, 56)),\n", 92 | " ('warning', datetime.datetime(2017, 1, 11, 6, 5, 27)),\n", 93 | " ('info', datetime.datetime(2017, 1, 6, 16, 35, 31)),\n", 94 | " ('info', datetime.datetime(2017, 1, 4, 18, 29, 55)),\n", 95 | " ('info', datetime.datetime(2017, 1, 4, 21, 14, 8))]" 96 | ] 97 | }, 98 | "execution_count": 7, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "rdd.take(5)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 8, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "rdd2 = rdd.groupByKey()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "## 分组,数据如下:" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 9, 126 | "metadata": { 127 | "scrolled": true 128 | }, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "info\n", 135 | "[datetime.datetime(2017, 1, 6, 16, 35, 31), datetime.datetime(2017, 1, 4, 18, 29, 55), datetime.datetime(2017, 1, 4, 21, 14, 8), datetime.datetime(2017, 1, 6, 23, 47, 1), datetime.datetime(2017, 1, 4, 19, 57, 31), datetime.datetime(2017, 1, 12, 6, 11, 57), datetime.datetime(2017, 1, 7, 13, 32, 47), datetime.datetime(2017, 1, 6, 12, 47, 49), datetime.datetime(2017, 1, 10, 16, 41, 53), datetime.datetime(2017, 1, 6, 0, 53, 26)]\n", 136 | "error\n", 137 | "[datetime.datetime(2017, 1, 1, 18, 20, 56), datetime.datetime(2017, 1, 5, 8, 47, 43), datetime.datetime(2017, 1, 11, 5, 0, 27), datetime.datetime(2017, 1, 4, 9, 9, 10), datetime.datetime(2017, 1, 4, 18, 31, 10), datetime.datetime(2017, 1, 10, 16, 55, 1), datetime.datetime(2017, 1, 11, 20, 0, 36), datetime.datetime(2017, 1, 1, 17, 56, 23), datetime.datetime(2017, 1, 2, 5, 45, 22), datetime.datetime(2017, 1, 8, 6, 8, 42)]\n", 138 | "warning\n", 139 | "[datetime.datetime(2017, 1, 11, 6, 5, 27), datetime.datetime(2017, 1, 2, 13, 35, 8), datetime.datetime(2017, 1, 8, 12, 37, 24), datetime.datetime(2017, 1, 6, 15, 20, 42), datetime.datetime(2017, 1, 8, 15, 52, 2), datetime.datetime(2017, 1, 12, 8, 58, 12), datetime.datetime(2017, 1, 4, 20, 56, 43), datetime.datetime(2017, 1, 8, 0, 48, 18), datetime.datetime(2017, 1, 11, 9, 3, 18), datetime.datetime(2017, 1, 11, 19, 16, 18)]\n" 140 | ] 141 | } 142 | ], 143 | "source": [ 144 | "for t in rdd2.collect():\n", 145 | " print(t[0])\n", 146 | " flag = 0\n", 147 | " print(list(t[1])[:10])" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "## 定义一个map方法,进行排序,并且取出前10条" 155 | ] 156 | }, 157 | { 158 | "cell_type": "markdown", 159 | "metadata": {}, 160 | "source": [ 161 | "### 再次强调,在大数据环境下,这种方式无论是性能还是效果都很不实用" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 10, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "def mymap(x):\n", 171 | " key = x[0]\n", 172 | " value = sorted(list(x[1]))[0:10]\n", 173 | " return (key,value)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 11, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "[('info',\n", 185 | " [datetime.datetime(2017, 1, 1, 0, 4, 56),\n", 186 | " datetime.datetime(2017, 1, 1, 0, 11, 43),\n", 187 | " datetime.datetime(2017, 1, 1, 0, 12, 36),\n", 188 | " datetime.datetime(2017, 1, 1, 0, 13, 35),\n", 189 | " datetime.datetime(2017, 1, 1, 0, 22, 40),\n", 190 | " datetime.datetime(2017, 1, 1, 0, 24, 20),\n", 191 | " datetime.datetime(2017, 1, 1, 0, 30, 31),\n", 192 | " datetime.datetime(2017, 1, 1, 0, 40, 23),\n", 193 | " datetime.datetime(2017, 1, 1, 0, 46, 8),\n", 194 | " datetime.datetime(2017, 1, 1, 0, 46, 50)]),\n", 195 | " ('error',\n", 196 | " [datetime.datetime(2017, 1, 1, 0, 1, 31),\n", 197 | " datetime.datetime(2017, 1, 1, 0, 7, 21),\n", 198 | " datetime.datetime(2017, 1, 1, 0, 9, 19),\n", 199 | " datetime.datetime(2017, 1, 1, 0, 11, 56),\n", 200 | " datetime.datetime(2017, 1, 1, 0, 23, 38),\n", 201 | " datetime.datetime(2017, 1, 1, 0, 39, 26),\n", 202 | " datetime.datetime(2017, 1, 1, 0, 44, 17),\n", 203 | " datetime.datetime(2017, 1, 1, 0, 47, 15),\n", 204 | " datetime.datetime(2017, 1, 1, 0, 53, 58),\n", 205 | " datetime.datetime(2017, 1, 1, 0, 54, 9)]),\n", 206 | " ('warning',\n", 207 | " [datetime.datetime(2017, 1, 1, 0, 0, 37),\n", 208 | " datetime.datetime(2017, 1, 1, 0, 13, 7),\n", 209 | " datetime.datetime(2017, 1, 1, 0, 15, 33),\n", 210 | " datetime.datetime(2017, 1, 1, 0, 24, 57),\n", 211 | " datetime.datetime(2017, 1, 1, 0, 27, 28),\n", 212 | " datetime.datetime(2017, 1, 1, 0, 31, 12),\n", 213 | " datetime.datetime(2017, 1, 1, 0, 38, 18),\n", 214 | " datetime.datetime(2017, 1, 1, 0, 43, 56),\n", 215 | " datetime.datetime(2017, 1, 1, 0, 59),\n", 216 | " datetime.datetime(2017, 1, 1, 0, 59, 6)])]" 217 | ] 218 | }, 219 | "execution_count": 11, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "rdd2.map(lambda x:mymap(x)).collect()" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "## 以上场景,海量数据情况下,是使用filter + sortBy算子" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": 13, 238 | "metadata": {}, 239 | "outputs": [ 240 | { 241 | "data": { 242 | "text/plain": [ 243 | "[('info', datetime.datetime(2017, 1, 1, 0, 4, 56)),\n", 244 | " ('info', datetime.datetime(2017, 1, 1, 0, 11, 43)),\n", 245 | " ('info', datetime.datetime(2017, 1, 1, 0, 12, 36)),\n", 246 | " ('info', datetime.datetime(2017, 1, 1, 0, 13, 35)),\n", 247 | " ('info', datetime.datetime(2017, 1, 1, 0, 22, 40)),\n", 248 | " ('info', datetime.datetime(2017, 1, 1, 0, 24, 20)),\n", 249 | " ('info', datetime.datetime(2017, 1, 1, 0, 30, 31)),\n", 250 | " ('info', datetime.datetime(2017, 1, 1, 0, 40, 23)),\n", 251 | " ('info', datetime.datetime(2017, 1, 1, 0, 46, 8)),\n", 252 | " ('info', datetime.datetime(2017, 1, 1, 0, 46, 50))]" 253 | ] 254 | }, 255 | "execution_count": 13, 256 | "metadata": {}, 257 | "output_type": "execute_result" 258 | } 259 | ], 260 | "source": [ 261 | "rdd.filter(lambda x :(x[0] ==\"info\"))\\\n", 262 | ".sortBy(lambda x : (x[1],x)).take(10)" 263 | ] 264 | } 265 | ], 266 | "metadata": { 267 | "kernelspec": { 268 | "display_name": "Python 3", 269 | "language": "python", 270 | "name": "python3" 271 | }, 272 | "language_info": { 273 | "codemirror_mode": { 274 | "name": "ipython", 275 | "version": 3 276 | }, 277 | "file_extension": ".py", 278 | "mimetype": "text/x-python", 279 | "name": "python", 280 | "nbconvert_exporter": "python", 281 | "pygments_lexer": "ipython3", 282 | "version": "3.6.6" 283 | } 284 | }, 285 | "nbformat": 4, 286 | "nbformat_minor": 2 287 | } 288 | -------------------------------------------------------------------------------- /6.数据生成算子/1.数据转换为RDD.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pyspark" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "sc = pyspark.SparkContext()" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## 直接读取本地数据" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 15, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "rdd = sc.textFile(\"./data/eq2013.csv\")" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 16, 40 | "metadata": {}, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/plain": [ 45 | "['date,time,x,y,z,lv,M,v1,v2,v3,v4',\n", 46 | " '2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 47 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 48 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 49 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,']" 50 | ] 51 | }, 52 | "execution_count": 16, 53 | "metadata": {}, 54 | "output_type": "execute_result" 55 | } 56 | ], 57 | "source": [ 58 | "rdd.take(5)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 17, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "data": { 68 | "text/plain": [ 69 | "8205" 70 | ] 71 | }, 72 | "execution_count": 17, 73 | "metadata": {}, 74 | "output_type": "execute_result" 75 | } 76 | ], 77 | "source": [ 78 | "rdd.count()" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "## 通过HDFS读取数据" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 12, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "rdd2 = sc.textFile(\"hdfs://sparkvm.com:8020/data/eq/eq2013.csv\")" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 13, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "['date,time,x,y,z,lv,M,v1,v2,v3,v4',\n", 106 | " '2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 107 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 108 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 109 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,']" 110 | ] 111 | }, 112 | "execution_count": 13, 113 | "metadata": {}, 114 | "output_type": "execute_result" 115 | } 116 | ], 117 | "source": [ 118 | "rdd2.take(5)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 14, 124 | "metadata": {}, 125 | "outputs": [ 126 | { 127 | "data": { 128 | "text/plain": [ 129 | "8205" 130 | ] 131 | }, 132 | "execution_count": 14, 133 | "metadata": {}, 134 | "output_type": "execute_result" 135 | } 136 | ], 137 | "source": [ 138 | "rdd2.count()" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "## 通过Python自行读取进行序列化" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 18, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "import csv" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 35, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "eq = [row[:-1] for row in open(\"./data/eq2013.csv\")]" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 36, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "['date,time,x,y,z,lv,M,v1,v2,v3,v4',\n", 175 | " '2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 176 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 177 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 178 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,']" 179 | ] 180 | }, 181 | "execution_count": 36, 182 | "metadata": {}, 183 | "output_type": "execute_result" 184 | } 185 | ], 186 | "source": [ 187 | "eq[:5]" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 37, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "rdd3 = sc.parallelize(eq)" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 38, 202 | "metadata": {}, 203 | "outputs": [ 204 | { 205 | "data": { 206 | "text/plain": [ 207 | "['date,time,x,y,z,lv,M,v1,v2,v3,v4',\n", 208 | " '2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 209 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 210 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 211 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,']" 212 | ] 213 | }, 214 | "execution_count": 38, 215 | "metadata": {}, 216 | "output_type": "execute_result" 217 | } 218 | ], 219 | "source": [ 220 | "rdd3.take(5)" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 39, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "data": { 230 | "text/plain": [ 231 | "8205" 232 | ] 233 | }, 234 | "execution_count": 39, 235 | "metadata": {}, 236 | "output_type": "execute_result" 237 | } 238 | ], 239 | "source": [ 240 | "rdd3.count()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "## 通过JDBC读取数据库转为RDD" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "metadata": {}, 253 | "source": [ 254 | "### Spark主要通过spark.sql来实现数据库获取" 255 | ] 256 | }, 257 | { 258 | "cell_type": "markdown", 259 | "metadata": {}, 260 | "source": [ 261 | "### 首先把Postgresql的JDBC包,导入到环境变量中" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 41, 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "import os" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 49, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "os.environ['SPARK_CLASSPATH']=\\\n", 280 | "\"$SPARK_CLASSPATH;{0}/data/postgresql-42.2.4.jar\".format(os.getcwd())" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "### PostgreSQL的JDBC连接字符串" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 51, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "url=\"jdbc:postgresql://sparkvm.com:5432/postgis?user=postgres&password=postgres\"" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 52, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "import pyspark.sql" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 53, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "pyssql = pyspark.sql.SQLContext(sc)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "### 通过SQLContext来获取数据,需要设置读取模式、URL和读取的表格" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 58, 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "df1 = pyssql.read.format(\"jdbc\").\\\n", 331 | "option(\"url\", url).option(\"dbtable\",\"china\").load()" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "### 显示前面5行" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 60, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "name": "stdout", 348 | "output_type": "stream", 349 | "text": [ 350 | "+---+--------------------+----------+------+------------+--------+--------+--------+--------+--------+---------+----------+----------+--------+--------+--------+--------+--------+--------+----------+\n", 351 | "| id| geom|first_name| code| area|pop_2009|pop_2005|pop_2000|pop_1999|pop_1995| pop_1990|pop_birth_|pop_death_|gdp_2009|gdp_2008|gdp_2007|gdp_2006|gdp_2005|cpi_2009|categories|\n", 352 | "+---+--------------------+----------+------+------------+--------+--------+--------+--------+--------+---------+----------+----------+--------+--------+--------+--------+--------+--------+----------+\n", 353 | "| 1|0106000020E610000...| 北京|110000| 1.634556E10| 1755.0| 1538.0| 1382.0| 1257.0| 1251.0|1081.9407| 8.06| 5.12|12153.03| 11115.0| 9846.81| 8117.78| 6969.52| 22154.0| 1.0|\n", 354 | "| 2|0106000020E610000...| 天津|120000|1.1660963E10| 1228.16| 1043.0| 1001.0| 959.0| 942.0| 878.5402| 8.3| 6.23| 7521.85| 6719.01| 5252.76| 4462.74| 3905.64| 15149.0| 1.0|\n", 355 | "| 3|0106000020E610000...| 山西|140000|1.5727177E11| 3427.36| 3355.0| 3297.0| 3204.0| 3077.0|2875.9014| 10.87| 6.12| 7358.31| 7315.4| 6024.45| 4878.61| 4230.53| 6854.0| 1.0|\n", 356 | "| 7|0106000020E610000...| 上海|310000| 5.2556017E9| 1921.0| 1778.0| 1674.0| 1474.0| 1415.0|1334.1896| 8.64| 7.05|15046.45|14069.86|12494.01|10572.24| 9247.66| 29572.0| 2.0|\n", 357 | "| 4|0106000020E610000...| 辽宁|210000| 1.450149E11| 4319.0| 4221.0| 4238.0| 4171.0| 4092.0|3945.9697| 6.06| 6.15|15212.49|13668.58| 11164.3| 9304.52| 8047.26| 10848.0| 1.0|\n", 358 | "+---+--------------------+----------+------+------------+--------+--------+--------+--------+--------+---------+----------+----------+--------+--------+--------+--------+--------+--------+----------+\n", 359 | "only showing top 5 rows\n", 360 | "\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "df1.show(5)" 366 | ] 367 | } 368 | ], 369 | "metadata": { 370 | "kernelspec": { 371 | "display_name": "Python 3", 372 | "language": "python", 373 | "name": "python3" 374 | }, 375 | "language_info": { 376 | "codemirror_mode": { 377 | "name": "ipython", 378 | "version": 3 379 | }, 380 | "file_extension": ".py", 381 | "mimetype": "text/x-python", 382 | "name": "python", 383 | "nbconvert_exporter": "python", 384 | "pygments_lexer": "ipython3", 385 | "version": "3.6.8" 386 | } 387 | }, 388 | "nbformat": 4, 389 | "nbformat_minor": 2 390 | } 391 | -------------------------------------------------------------------------------- /11.Map算子解析(2)/地震点分区统计.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pyspark,ogr,datetime" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "### 读取中国每个省" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "driver = ogr.GetDriverByName('ESRI Shapefile')\n", 26 | "dataSource = driver.Open(\"./data/cn.shp\", 0)\n", 27 | "layer = dataSource.GetLayerByIndex(0)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "## 构造一个用来计算的几何结构,注意这里用wkt,是因为Python无法序列化GDAL的对象为一个列表" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "featArr = []\n", 44 | "layer.ResetReading()\n", 45 | "for f in layer:\n", 46 | " featArr.append((f.GetField(\"FIRST_NAME\"),\n", 47 | " f.geometry().ExportToIsoWkt()))" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 4, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "北京 POLYGON ((117.201635735 40.077285175,117.188380491 40.0635675290001,117.172605018 40.047241738,117.17614785 40.059573125,117.176306955 40.0598749100001,117.177471185 40.0619996300001,117.182759255 40.0684182210001,117.17511575 40.0703352350001,117.091653325 40.075145455,117.079871615 40.075415475000\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "print(featArr[0][0],featArr[0][1][:300])" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": 5, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "sc = pyspark.SparkContext()" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 6, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "def isPoint(line):\n", 83 | " pntline = line.split(\",\")\n", 84 | " try:\n", 85 | " wkt = \"POINT({0} {1})\".format(float(pntline[2]),\n", 86 | " float(pntline[3]))\n", 87 | " geom = ogr.CreateGeometryFromWkt(wkt)\n", 88 | " return geom.IsValid()\n", 89 | " except:\n", 90 | " return False" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 7, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "rdd = sc.textFile(\"./data/eq2013.csv\"\n", 100 | " ).filter(lambda line : isPoint(line))" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 8, 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "data": { 110 | "text/plain": [ 111 | "['2013/5/31,23:41:56,-113.408,37.175,6.6,2.5,ML2.5,SLC,,UTAH,',\n", 112 | " '2013/5/31,23:09:05,-113.411,37.178,6,2.5,ML2.5,SLC,,UTAH,',\n", 113 | " '2013/5/31,22:45:34,-113.413,37.172,4,2.9,ML2.9,SLC,,UTAH,',\n", 114 | " '2013/5/31,22:34:26,-113.414,37.174,3.2,2.8,ML2.8,SLC,,UTAH,',\n", 115 | " '2013/5/31,22:34:02,-178.08,51.127,26,3.1,ML3.1,AEIC,,ANDREANOF ISLANDS, ALEUTIAN IS.']" 116 | ] 117 | }, 118 | "execution_count": 8, 119 | "metadata": {}, 120 | "output_type": "execute_result" 121 | } 122 | ], 123 | "source": [ 124 | "rdd.take(5)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "### map方法,构造一个结构:\n", 132 | "### 如果这个省包含了点,则为(省名,数量1)\n", 133 | "### 不在中国任何一个省里面,则为(other,数量1)" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 9, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "def myMap(pnt,featArr):\n", 143 | " for feat in featArr:\n", 144 | " key = feat[0]\n", 145 | " geo = ogr.CreateGeometryFromWkt(feat[1])\n", 146 | " pntline = pnt.split(\",\")\n", 147 | " wkt = \"POINT({0} {1})\".format(float(pntline[2]),\n", 148 | " float(pntline[3]))\n", 149 | " pntGeom = ogr.CreateGeometryFromWkt(wkt)\n", 150 | " if geo.Contains(pntGeom):\n", 151 | " return (key,1)\n", 152 | " return (\"other\",1)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 10, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "maprdd = rdd.map(lambda line: myMap(line,featArr))" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 11, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "[('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('西藏', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1), ('other', 1)]\n" 174 | ] 175 | } 176 | ], 177 | "source": [ 178 | "print(maprdd.take(100))" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "### reduceByKey是分组聚合" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 12, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "0:00:48.190894\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "s = datetime.datetime.now()\n", 203 | "res = maprdd.reduceByKey(lambda x,y : x+y).collect()\n", 204 | "print(datetime.datetime.now() -s)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "### 计算结果" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 13, 217 | "metadata": { 218 | "scrolled": true 219 | }, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "[('other', 8042), ('云南', 10), ('四川', 69), ('青海', 17), ('山东', 1), ('贵州', 1), ('西藏', 23), ('新疆', 20), ('内蒙古', 4), ('台湾', 12), ('广东', 1), ('广西', 1), ('甘肃', 2), ('辽宁', 1)]\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "print(res)" 231 | ] 232 | }, 233 | { 234 | "cell_type": "markdown", 235 | "metadata": {}, 236 | "source": [ 237 | "# 先把不是中国范围内的数据过滤掉" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "### 用gdal获取到中国的extent" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 16, 250 | "metadata": {}, 251 | "outputs": [ 252 | { 253 | "data": { 254 | "text/plain": [ 255 | "(73.61689383500004, 135.08727119000002, 18.278927775000056, 53.56026110500005)" 256 | ] 257 | }, 258 | "execution_count": 16, 259 | "metadata": {}, 260 | "output_type": "execute_result" 261 | } 262 | ], 263 | "source": [ 264 | "cnext = layer.GetExtent()\n", 265 | "cnext" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "### 写一个过滤方法,过滤中国范围以外的数据" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": 17, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "def filterExtent(line,cnext):\n", 282 | " pntline = line.split(\",\")\n", 283 | " x = float(pntline[2])\n", 284 | " y = float(pntline[3])\n", 285 | " if x >= cnext[0] and x <= cnext[1] \\\n", 286 | " and y >= cnext[2] and y <= cnext[3]:\n", 287 | " return True\n", 288 | " else:\n", 289 | " return False" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 19, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [ 298 | "maprdd2 = rdd.filter(lambda line:filterExtent(line,cnext))\\\n", 299 | ".map(lambda line: myMap(line,featArr))" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "### 速度显著提升" 307 | ] 308 | }, 309 | { 310 | "cell_type": "code", 311 | "execution_count": 20, 312 | "metadata": {}, 313 | "outputs": [ 314 | { 315 | "name": "stdout", 316 | "output_type": "stream", 317 | "text": [ 318 | "0:00:06.146958\n" 319 | ] 320 | } 321 | ], 322 | "source": [ 323 | "s = datetime.datetime.now()\n", 324 | "res = maprdd.reduceByKey(lambda x,y : x+y).collect()\n", 325 | "print(datetime.datetime.now() -s)" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 21, 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "name": "stdout", 335 | "output_type": "stream", 336 | "text": [ 337 | "[('other', 232), ('云南', 10), ('四川', 69), ('青海', 17), ('山东', 1), ('贵州', 1), ('西藏', 23), ('新疆', 20), ('内蒙古', 4), ('台湾', 12), ('广东', 1), ('广西', 1), ('甘肃', 2), ('辽宁', 1)]\n" 338 | ] 339 | } 340 | ], 341 | "source": [ 342 | "print(res)" 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": null, 348 | "metadata": {}, 349 | "outputs": [], 350 | "source": [] 351 | } 352 | ], 353 | "metadata": { 354 | "kernelspec": { 355 | "display_name": "Python 3", 356 | "language": "python", 357 | "name": "python3" 358 | }, 359 | "language_info": { 360 | "codemirror_mode": { 361 | "name": "ipython", 362 | "version": 3 363 | }, 364 | "file_extension": ".py", 365 | "mimetype": "text/x-python", 366 | "name": "python", 367 | "nbconvert_exporter": "python", 368 | "pygments_lexer": "ipython3", 369 | "version": "3.6.7" 370 | } 371 | }, 372 | "nbformat": 4, 373 | "nbformat_minor": 2 374 | } 375 | --------------------------------------------------------------------------------\n", 107 | " \n", 108 | "
\n", 144 | "\n", 109 | " \n", 112 | " \n", 113 | " \n", 114 | "\n", 110 | " 0 \n", 111 | "\n", 115 | " \n", 118 | "0 \n", 116 | "0 \n", 117 | "\n", 119 | " \n", 122 | "1 \n", 120 | "POLYGON ((11135971.0803 2710491.2501,11135968.... \n", 121 | "\n", 123 | " \n", 126 | "2 \n", 124 | "379 \n", 125 | "\n", 127 | " \n", 130 | "3 \n", 128 | "旱地 \n", 129 | "\n", 131 | " \n", 134 | "4 \n", 132 | "1489.7 \n", 133 | "\n", 135 | " \n", 138 | "5 \n", 136 | "1375.76858615 \n", 137 | "\n", 139 | " \n", 142 | " \n", 143 | "6 \n", 140 | "47227.8152174 \n", 141 | "