├── .gitattributes
├── 01-大数据背景知识
    ├── README.md
    └── imgs
    │   ├── datanode.png
    │   ├── datawarehouse.png
    │   ├── dis.png
    │   ├── download.png
    │   ├── edits.png
    │   ├── hbase.png
    │   ├── mergeedits.png
    │   ├── oev.png
    │   ├── oiv.png
    │   ├── pic.png
    │   ├── product.png
    │   ├── result.png
    │   ├── upload.png
    │   ├── weather.png
    │   ├── wxgzh.jpg
    │   └── yarn.png
├── 02-大数据应用案例分析
    ├── README.md
    └── imgs
    │   ├── ali.png
    │   ├── arc.png
    │   ├── full.png
    │   ├── log.png
    │   ├── oparc.png
    │   ├── optimize.png
    │   ├── point.png
    │   ├── table.png
    │   └── tra.png
├── 03-Hadoop
    ├── README.md
    └── imgs
    │   ├── HDFSConsole.png
    │   ├── YARNConsole.png
    │   ├── ecosystem.jpg
    │   ├── hadoop-logo.jpg
    │   ├── hadoop-logo2.jpg
    │   ├── hadoop.png
    │   ├── hdfs-logo.jpg
    │   ├── map.png
    │   ├── ssh.png
    │   └── warn.png
├── 04-HDFS基础
    ├── README.md
    └── imgs
    │   ├── console.png
    │   ├── download.png
    │   ├── group.png
    │   ├── safe.png
    │   └── upload.png
├── 05-HDFS性能优化
    └── README.md
├── 06-MapReduce基础
    ├── README.md
    └── imgs
    │   ├── mr-console.png
    │   ├── mr-dataflow.png
    │   ├── mr-shuffle.png
    │   └── mr-yarn.png
├── 07-MapReduce十大经典案例
    └── README.md
├── 08-MapReduce性能优化
    └── README.md
├── 09-HBase基础
    ├── README.md
    └── imgs
    │   ├── arc.png
    │   ├── hbase-logo.png
    │   ├── hbasearc.png
    │   ├── hbasemapreduce.png
    │   └── hbasetable.png
├── 10-HBase性能优化
    └── README.md
├── 11-Hive基础
    ├── README.md
    └── imgs
    │   ├── hive_logo_medium.jpg
    │   └── hivearc.png
├── 12-Hive性能优化
    └── README.md
├── 13-Sqoop基础
    ├── README.md
    └── imgs
    │   └── sqoop-logo.png
├── 14-Sqoop性能优化
    └── README.md
├── 15-Flume基础
    ├── README.md
    └── imgs
    │   ├── flume-arc.png
    │   ├── flume-logo.png
    │   └── pic.png
├── 16-Flume性能优化
    └── README.md
├── 17-ZooKeeper
    ├── README.md
    └── imgs
    │   ├── zookeeper-arc.png
    │   └── zookeeper-logo.png
├── 18-Zookeeper性能优化
    └── README.md
├── 19-Redis基础
    └── README.md
├── 20-Redis性能优化
    └── README.md
├── 21-Storm基础
    ├── README.md
    └── imgs
    │   ├── arc1.png
    │   ├── arc2.png
    │   ├── event-logger.png
    │   ├── nimbus-process.png
    │   ├── process.png
    │   ├── storm-arc.png
    │   ├── storm-log.png
    │   ├── storm-model.png
    │   ├── storm-submit.png
    │   ├── storm-tcp1.png
    │   ├── storm-tcp2.png
    │   ├── storm-ui.png
    │   ├── storm-wordcount.png
    │   ├── storm-zk.png
    │   ├── water.png
    │   └── worker-process.png
├── 22-Storm与其他组件集成
    └── README.md
├── 23-Storm性能优化
    └── README.md
├── 24-JStorm基础
    └── README.md
├── 25-JStorm与其他组件集成
    └── README.md
├── 26-JStorm性能优化
    └── README.md
├── 27-Azkaban
    └── README.md
├── 28-Scala
    ├── README.md
    └── imgs
    │   ├── add.png
    │   ├── changerule.png
    │   ├── com.png
    │   ├── data-type.png
    │   ├── def-demo.png
    │   ├── def.png
    │   ├── exception.png
    │   ├── fun-param-process.png
    │   ├── fun-param.png
    │   ├── helpc.png
    │   ├── highfun.png
    │   ├── imp.png
    │   ├── insert-value.png
    │   ├── lazy-com.png
    │   ├── lazy.png
    │   ├── math.png
    │   ├── matrix.png
    │   ├── monkey.png
    │   ├── nothing.png
    │   ├── param-type.png
    │   ├── reverc.png
    │   ├── scala-logo.jpg
    │   ├── trycatch.png
    │   ├── unit.png
    │   ├── upbound.png
    │   ├── viewbb.png
    │   ├── viewbund.png
    │   ├── viewerr.png
    │   ├── viewsuccess.png
    │   └── yincan.png
├── 29-SparkCore
    ├── README.md
    └── imgs
    │   ├── active-standby.png
    │   ├── arc1.png
    │   ├── arc2.png
    │   ├── depen.png
    │   ├── diff.png
    │   ├── easy.png
    │   ├── everywhere.png
    │   ├── jav.png
    │   ├── persit.png
    │   ├── proc.png
    │   ├── rdd.png
    │   ├── rddpar.png
    │   ├── sca.png
    │   ├── shuffle.png
    │   ├── spark.png
    │   ├── spark83.png
    │   ├── speed.png
    │   ├── standalone.png
    │   ├── startall.png
    │   ├── storeage.png
    │   ├── sz.png
    │   └── wc.png
├── 30-SparkSQL
    └── README.md
├── 31-SparkStreaming
    └── README.md
├── 32-Spark与其他组件集成
    └── README.md
├── 33-Spark性能优化
    └── README.md
├── 34-Flink
    └── README.md
├── 35-Flink与其他组件集成
    └── README.md
├── 36-Flink性能优化
    └── README.md
├── 37-CDH
    └── README.md
├── LICENSE
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/01-大数据背景知识/README.md:
--------------------------------------------------------------------------------
  1 | ## Hadoop的起源与背景知识
  2 | 
  3 | ### （一）什么是大数据
  4 | 
  5 | 大数据（Big Data），指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合，是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
  6 | 
  7 | 大数据的5个特征（IBM提出）：
  8 | 
  9 | * Volume（大量）
 10 | * Velocity（高速）
 11 | * Variety（多样）
 12 | * Value（价值）
 13 | * Veracity（真实性）
 14 | 
 15 | 大数据的典型案例：
 16 | 
 17 | * 电商网站的商品推荐
 18 | 
 19 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/product.png)
 20 | 
 21 | * 基于大数据的天气预报
 22 | 
 23 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/weather.png)
 24 | 
 25 | ### （二）OLTP与OLAP
 26 | 
 27 | * OLTP：On-Line Transaction Processing（联机事务处理过程）。
 28 | 
 29 | 	也称为面向交易的处理过程，其基本特征是前台接收的用户数据可以立即传送到计算中心进行处理，并在很短的时间内给出处理结果，是对用户操作快速	响应的方式之一。OLTP是传统的关系型数据库的主要应用，主要是基本的、日常的事务处理，例如银行转账。
 30 | 
 31 | * OLAP：On-Line Analytic Processing（联机分析处理过程）。
 32 | 
 33 | 	OLAP是数据仓库系统的主要应用，支持复杂的分析操作，侧重决策支持，并且提供直观易懂的查询结果。典型案例：商品推荐。
 34 | 
 35 | 
 36 | * OLTP和OLAP的区别：
 37 | 
 38 | 维度 | OLTP | OLAP
 39 | ---|---|---
 40 | 用户 | 操作人员，低层管理人员 | 决策人员，高层管理人员
 41 | 功能 | 日常操作处理 | 分析决策
 42 | DB设计 | 面向应用 | 面向主题
 43 | 数据 | 当前的，最新的细节的，二维的分立的 | 历史的，聚集的，多维的，集成的，统一的
 44 | 存取 | 读写数十条记录 | 读上百万条记录
 45 | 工作单位 | 简单的事务 | 复杂的查询
 46 | DB大小 | 100MB-GB | 100GB-TB
 47 | 
 48 | ### （三）数据仓库
 49 | 
 50 | 数据仓库，英文名称为Data Warehouse，可简写为DW或DWH。数据仓库，是为企业所有级别的决策制定过程，提供所有类型数据支持的战略集合。它是单个数据存储，出于分析性报告和决策支持目的而创建。
 51 | 
 52 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/datawarehouse.png)
 53 | 
 54 | ### （四）Hadoop的起源 -- Luence，从Luence到nutch，从nutch到hadoop
 55 | 
 56 | * 2003-2004年，Google公开了部分GFS和Mapreduce思想的细节，以此为基础Doug Cutting等人用了2年业余时间实现了DFS和Mapreduce机制，使Nutch性能飙升。
 57 | 
 58 | * Yahoo招安Doug Cutting及其项目。
 59 | 
 60 | * Hadoop 于 2005 年秋天作为 Lucene的子项目 Nutch的 一部分正式引入Apache基金会。2006 年 3 月份，Map-Reduce 和 Nutch Distributed File System (NDFS) 分别被纳入称为 Hadoop 的项目中
 61 | 
 62 | * 名字来源于Doug Cutting儿子的玩具大象。
 63 | 
 64 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/pic.png)
 65 | 
 66 | ## 二、Apache Hadoop的体系结构
 67 | 
 68 | ### （一）分布式存储：HDFS
 69 | 
 70 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/dis.png)
 71 | 
 72 | * NameNode（名称节点）
 73 | 
 74 | 1. 维护HDFS文件系统，是HDFS的主节点。
 75 | 
 76 | 2. 接受客户端的请求: 上传文件、下载文件、创建目录等等。
 77 | 
 78 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/upload.png)
 79 | 
 80 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/download.png)
 81 | 
 82 | 3. 记录客户端操作的日志（edits文件），保存了HDFS最新的状态.
 83 | 
 84 | 	1. Edits文件保存了自最后一次检查点之后所有针对HDFS文件系统的操		作，比如：增加文件、重命名文件、删除目录等等
 85 | 
 86 | 	2. 保存目录：$HADOOP_HOME/tmp/dfs/name/current
 87 | 
 88 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/edits.png)
 89 | 
 90 | 	3. 可以使用hdfs oev -i命令将日志（二进制）输出为XML文件
 91 | 
 92 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/oev.png)
 93 | 
 94 | 	输出结果为：
 95 | 	
 96 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/result.png)
 97 | 
 98 | 	4. 维护文件元信息，将内存中不常用（采用LRU算法）的文件元信息保存在硬盘上（fsimage文件）
 99 | 
100 | 		1. fsimage是HDFS文件系统存于硬盘中的元数据检查点，里面记录了自		最后一次检查点之前HDFS文件系统中所有目录和文件的序列化信息
101 | 
102 | 		2. 保存目录：$HADOOP_HOME/tmp/dfs/name/current
103 | 
104 | 		3. 可以使用hdfs oiv -i命令将日志（二进制）输出为文本（文本和XML）
105 | 		
106 | 		![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/oiv.png)
107 | 
108 | * DataNode（数据节点）
109 | 
110 | 1. 以数据块为单位，保存数据
111 | 
112 | 	1. Hadoop1.0的数据块大小：64M
113 | 	
114 | 	2. Hadoop2.0的数据块大小：128M
115 | 
116 | 2. 在全分布模式下，至少两个DataNode节点
117 | 
118 | 3. 数据保存的目录：由hadoop.tmp.dir参数指定
119 | 
120 | 例如：
121 | 
122 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/datanode.png)
123 | 
124 | * Secondary NameNode（第二名称节点）
125 | 
126 | 1. 主要作用是进行日志合并
127 | 
128 | 2. 日志合并的过程：
129 | 
130 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/mergeedits.png)
131 | 
132 | * HDFS存在的问题
133 | 
134 | 1. NameNode单点故障，难以应用在线场景
135 | 
136 | 	解决方案：Hadoop1.0中，没有解决方案。Hadoop2.0中，使用Zookeeper实现NameNode的HA功能。
137 | 
138 | 2. NameNode压力过大，且内存受限，影响系统扩展性
139 | 
140 | 	解决方案：Hadoop1.0中，没有解决方案。Hadoop2.0中，使用NameNode联盟实现其水平扩展。
141 | 
142 | ### （二）Yarn：分布式计算（MapReduce）
143 | 
144 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/yarn.png)
145 | 
146 | * ResourceManager（资源管理器）
147 | 
148 | 	1. 接收客户端的请求：执行任务
149 | 	
150 | 	2. 分配资源
151 | 	
152 | 	3. 分配任务
153 | 
154 | * NameNode（节点管理器：运行MapReduce任务）
155 | 
156 | 	1. 从DataNode上获取数据，执行任务
157 | 
158 | ### （三）HBase的体系结构
159 | 
160 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/hbase.png)
161 | 
162 | 
163 | 
164 | 
165 | 
166 | 
167 | 
168 | 
169 | 
170 | 


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/datanode.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/datanode.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/datawarehouse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/datawarehouse.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/dis.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/dis.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/download.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/edits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/edits.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/hbase.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/hbase.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/mergeedits.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/mergeedits.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/oev.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/oev.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/oiv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/oiv.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/pic.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/product.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/product.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/result.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/upload.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/upload.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/weather.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/weather.png


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/wxgzh.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/wxgzh.jpg


--------------------------------------------------------------------------------
/01-大数据背景知识/imgs/yarn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/01-大数据背景知识/imgs/yarn.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/README.md:
--------------------------------------------------------------------------------
 1 | ## Hadoop应用案例分析
 2 | 
 3 | ### （一）互联网应用的架构
 4 | 
 5 | 1. 传统的架构
 6 | 
 7 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/tra.png)
 8 | 
 9 | 2. 改良后的架构
10 | 
11 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/optimize.png)
12 | 
13 | 3. 完整的架构
14 | 
15 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/full.png)
16 | 
17 | ### （二）日志分析
18 | 
19 | 1. 需求说明：
20 | 
21 | 	对某技术论坛的apache server日志分析，计算论坛关键指标，供运营者决策。
22 | 
23 | 2. 论坛日志数据有两部分：
24 | 
25 | 	* 历史数据约56GB，统计到2012-05-29
26 | 
27 | 	* 自2013-05-30起，每天生成一个数据文件，约150MB
28 | 
29 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/log.png)
30 | 
31 | 3. 关键指标：
32 | 
33 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/point.png)
34 | 
35 | 4. 系统架构：
36 | 
37 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/arc.png)
38 | 
39 | 5. 改良后的系统架构：
40 | 
41 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/oparc.png)
42 | 
43 | 6. HBase表的结构：
44 | 
45 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/table.png)
46 | 
47 | 7. 日志分析的执行过程：
48 | 
49 | 	* 周期性把日志数据导入到hdfs中
50 | 
51 | 	* 周期性把明细日志导入hbase存储
52 | 
53 | 	* 周期性使用hive进行数据的多维分析
54 | 
55 | 	* 周期性把hive分析结果导入到mysql中
56 | 
57 | ### （三）Hadoop在淘宝的应用
58 | 
59 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/02-大数据应用案例分析/imgs/ali.png)
60 | 


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/ali.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/ali.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/arc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/arc.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/full.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/full.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/log.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/oparc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/oparc.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/optimize.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/optimize.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/point.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/point.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/table.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/table.png


--------------------------------------------------------------------------------
/02-大数据应用案例分析/imgs/tra.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/02-大数据应用案例分析/imgs/tra.png


--------------------------------------------------------------------------------
/03-Hadoop/README.md:
--------------------------------------------------------------------------------
  1 | ## Hadoop2.x的安装与配置
  2 | 
  3 | ### （一）Hadoop安装部署的预备条件
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/hadoop-logo.jpg)
  6 | 
  7 | * 安装Linux
  8 | 
  9 | * 安装JDK
 10 | 
 11 | ### （二）Hadoop的目录结构
 12 | 
 13 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/hadoop.png)
 14 | 
 15 | ### （三）Hadoop安装部署的三种模式
 16 | 
 17 | * 本地模式
 18 | 
 19 | 参数文件 | 配置参数 | 参考值
 20 | ---|---|---
 21 | hadoop-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 22 | 
 23 | * 伪分布模式
 24 | 
 25 | 参数文件 | 配置参数 | 参考值
 26 | ---|---|---
 27 | hadoop-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 28 | hdfs-site.xml | dfs.replication | 1
 29 | ... | dfs.permissions | false
 30 | core-site.xml | fs.defaultFS | hdfs://<hostname>:9000
 31 | ... | hadoop.tmp.dir | /root/training/hadoop-2.7.3/tmp
 32 | mapred-site.xml | mapreduce.framework.name | yarn
 33 | yarn-site.xml | yarn.resourcemanager.hostname | <hostname>
 34 | ... | yarn.nodemanager.aux-services | mapreduce_shuffle
 35 | 
 36 | * 全分布模式
 37 | 
 38 | 参数文件 | 配置参数 | 参考值
 39 | ---|---|---
 40 | hadoop-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 41 | hdfs-site.xml | dfs.replication | 2
 42 | ... | dfs.permissions | false
 43 | core-site.xml | fs.defaultFS | hdfs://<hostname>:9000
 44 | ... | hadoop.tmp.dir | /root/training/hadoop-2.7.3/tmp
 45 | mapred-site.xml | mapreduce.framework.name | yarn
 46 | yarn-site.xml | yarn.resourcemanager.hostname | <hostname>
 47 | ... | yarn.nodemanager.aux-services | mapreduce_shuffle
 48 | slaves | DataNode的ip地址或主机名 | qujianlei001
 49 | 
 50 | 如果出现以下警告信息：
 51 | 
 52 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/warn.png)
 53 | 
 54 | 只需要在以下两个文件中增加下面的环境变量，即可：
 55 | 
 56 | * hadoop-env.sh 脚本中：
 57 | 
 58 | 	export JAVA_HOME=/root/training/jdk1.8.0_144
 59 | 	export HADOOP_HOME=/root/training/hadoop-2.7.3
 60 | 	export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
 61 | 	export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
 62 | 
 63 | * yarn-env.sh 脚本中：
 64 | 
 65 | 	export JAVA_HOME=/root/training/jdk1.8.0_144
 66 | 	export HADOOP_HOME=/root/training/hadoop-2.7.3
 67 | 	export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
 68 | 	export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
 69 | 
 70 | ### （四）验证Hadoop环境
 71 | 
 72 | * HDFS Console：http://192.168.157.11:50070
 73 | 
 74 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/HDFSConsole.png)
 75 | 
 76 | * Yarn Console：http://192.168.157.11:8088
 77 | 
 78 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/YARNConsole.png)
 79 | 
 80 | ### （五）配置SSH免密登录
 81 | 
 82 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/ssh.png)
 83 | 
 84 | 详细操作见：https://blog.csdn.net/a909301740/article/details/84147035
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 
100 | 


--------------------------------------------------------------------------------
/03-Hadoop/imgs/HDFSConsole.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/HDFSConsole.png


--------------------------------------------------------------------------------
/03-Hadoop/imgs/YARNConsole.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/YARNConsole.png


--------------------------------------------------------------------------------
/03-Hadoop/imgs/ecosystem.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/ecosystem.jpg


--------------------------------------------------------------------------------
/03-Hadoop/imgs/hadoop-logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/hadoop-logo.jpg


--------------------------------------------------------------------------------
/03-Hadoop/imgs/hadoop-logo2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/hadoop-logo2.jpg


--------------------------------------------------------------------------------
/03-Hadoop/imgs/hadoop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/hadoop.png


--------------------------------------------------------------------------------
/03-Hadoop/imgs/hdfs-logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/hdfs-logo.jpg


--------------------------------------------------------------------------------
/03-Hadoop/imgs/map.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/map.png


--------------------------------------------------------------------------------
/03-Hadoop/imgs/ssh.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/ssh.png


--------------------------------------------------------------------------------
/03-Hadoop/imgs/warn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/03-Hadoop/imgs/warn.png


--------------------------------------------------------------------------------
/04-HDFS基础/README.md:
--------------------------------------------------------------------------------
  1 | ## HDFS
  2 | 
  3 | ### （一）HDFS的命令行操作
  4 | 
  5 | #### 1. HDFS操作命令（HDFS操作命令帮助信息：hdfs dfs）
  6 | 
  7 | 命令 | 说明 | 示例
  8 | ---|---|---
  9 | -mkdir | 在HDFS上创建目录 | 在HDFS上创建目录/data: hdfs dfs -mkdir /data <br/> 在HDFS上级联创建目录/data/input: hdfs dfs -mkdir -p /data/input
 10 | -ls | 列出hdfs文件系统根目录下的目录和文件 | 查看HDFS根目录下的文件和目录: hdfs dfs -ls / <br/> 查看HDFS的/data目录下文件和目录: hdfs dfs -ls /data
 11 | -ls -R | 列出hdfs文件系统所有的目录和文件 | 查看HDFS根目录及其子目录下的文件和目录: hdfs dfs -ls -R /
 12 | -put | 上传文件或者从键盘输入字符到HDFS | 将本地Linux的文件data.txt上传到HDFS: hdfs dfs -put data.txt /data/input <br/> 从键盘输入字符保存到HDFS的文件: hdfs dfs -put - /aaa.txt (按Ctrl+c结束输入)
 13 | -moveFromLocal | 与put相类似，命令执行后源文件将从本地被移除 | hdfs dfs -moveFromLocal data.txt /data/input
 14 | -copyFromLocal | 与put相类似 | hdfs dfs -copyFromLocal data.txt /data/input
 15 | -get | 将HDFS中的文件复制到本地 | hdfs dfs -get /data/input/data.txt /root/
 16 | -rm | 每次可以删除多个文件或目录 | 删除多个文件: hdfs dfs -rm /data1.txt /data2.txt <br/> 删除多个目录: hdfs dfs -rm -r /data /input
 17 | -getmerge | 将hdfs指定目录下所有文件排序后合并到local指定的文件中，文件不存在时会自动创建，文件存在时会覆盖里面的内容 | 将HDFS上/data/input目录下的所有文件合并到本地的a.txt文件中: hdfs dfs -getmerge /data/input /root/a.txt
 18 | -cp | 在HDFS上拷贝文件 | 
 19 | -mv | 在HDFS上移动文件 | 
 20 | -count | 统计hdfs对应路径下的目录个数，文件个数，文件总计大小 <br/> 显示为目录个数，文件个数，文件总计大小，输入路径 | hdfs dfs -count /data
 21 | -du | 显示hdfs对应路径下每个文件和文件大小 | hdfs dfs -du /
 22 | -text、-cat | 相当于Linux的cat命令 | hdfs dfs -cat /input/1.txt
 23 | balancer | 如果管理员发现某些DataNode保存数据过多，某些DataNode保存数据相对较少，可以使用上述命令手动启动内部的均衡过程 | hdfs balancer
 24 | 
 25 | #### 2. HDFS管理命令（HDFS管理命令帮助信息：hdfs dfsadmin）
 26 | 
 27 | 命令 | 说明 | 示例
 28 | ---|---|---
 29 | -report | 显示HDFS的总容量，剩余容量，datanode的相关信息 | hdfs dfsadmin -report
 30 | -safemode | HDFS的安全模式命令 enter, leave, get, wait | hdfs dfsadmin -safemode enter <br/> hdfs dfsadmin -safemode leave <br/> hdfs dfsadmin -safemode get <br/> hdfs dfsadmin -safemode wait
 31 | 
 32 | ### （二）HDFS的JavaAPI
 33 | 
 34 | 所需pom依赖如下：
 35 | ```xml
 36 | <dependencies>
 37 |     <dependency>
 38 |         <groupId>junit</groupId>
 39 |         <artifactId>junit</artifactId>
 40 |         <version>3.8.1</version>
 41 |         <scope>test</scope>
 42 |     </dependency>
 43 |     <dependency>
 44 |         <groupId>org.apache.hadoop</groupId>
 45 |         <artifactId>hadoop-client</artifactId>
 46 |         <version>2.7.3</version>
 47 |     </dependency>
 48 |     <dependency>
 49 |         <groupId>org.apache.hadoop</groupId>
 50 |         <artifactId>hadoop-common</artifactId>
 51 |         <version>2.7.3</version>
 52 |     </dependency>
 53 |     <dependency>
 54 |         <groupId>org.apache.hadoop</groupId>
 55 |         <artifactId>hadoop-hdfs</artifactId>
 56 |         <version>2.7.3</version>
 57 |     </dependency>
 58 | </dependencies>
 59 | ```
 60 | 
 61 | 通过HDFS提供的JavaAPI，我们可以完成以下的功能：
 62 | 
 63 | 1. 在HDFS上创建目录
 64 |     ```java
 65 |     @Test
 66 |     public void testMkDir() throws Exception {
 67 |         Configuration conf = new Configuration();
 68 |         conf.set("fs.defaultFS", "hdfs://192.168.137.25:9000");
 69 |         FileSystem fs = FileSystem.get(conf);
 70 |         // 创建目录
 71 |         boolean flag = fs.mkdirs(new Path("/inputdata"));
 72 |         System.out.println(flag);
 73 |     }
 74 |     ```
 75 | 
 76 | 1. 通过FileSystemAPI读取数据（下载文件）
 77 |     ```java
 78 |     @Test
 79 |     publci void testDownload() throws Exception {
 80 |         // 构造一个输入流 <-------HDFS
 81 |         FileSystem fs = FileSystem.get(new URI("hdfs://192.168.2.123:9000"), new Configuration());
 82 |         InputStream in = fs.open(new Path("/inputdata/a.war"));
 83 | 
 84 |         // 构造一个输出流
 85 |         OutputStream out = new FileOutputStream("d:\\a.war");
 86 |         IOUtils.copy(in, out);
 87 |     }
 88 |     ```
 89 | 
 90 | 2. 写入数据（上传文件）
 91 |     ```java
 92 |     @Test
 93 |     public void testUpload() throws Exception {
 94 |         // 指定上传的文件（输入流）
 95 |         InputStream in = new FileInputStream("d:\\test.war");
 96 | 
 97 |         // 构造输出流 ----> HDFS
 98 |         FileSystem fs = FileSystem.get(new URI("hdfs://192.168.2.123:9000"), new Configuration());
 99 | 
100 |         // 工具类 ---> 直接实现上传和下载
101 |         IOUtils.copy(in, out);
102 |     }
103 |     ```
104 | 
105 | 3. 查看目录及文件信息
106 |     ```java
107 |     @Test
108 |     public void checkFileInformation() throws Exception {
109 |         Configuration conf = new Configuration();
110 |         conf.set("fs.defaultFS", "hdfs://192.168.137.25:9000");
111 | 
112 |         FileSystem fs = FileSystem.get(conf);
113 |         FileStatus[] status = fs.listStatus(new Path("/hbase"));
114 | 
115 |         for (FileStatus f : status) {
116 |             String dir = f.isDirectory() ? "目录" : "文件";
117 |             String name = f.getPath().getName();
118 |             String path = f.getPath().toString();
119 |             System.out.println(dir + "------" + name + ",path:" + path);
120 |             System.out.println(f.getAccessTime());
121 |             System.out.println(f.getBlockSize());
122 |             System.out.println(f.getGroup());
123 |             System.out.println(f.getLen());
124 |             System.out.println(f.getModificationTime());
125 |             System.out.println(f.getOwner());
126 |             System.out.println(f.getPermission());
127 |             System.out.println(f.getReplication());
128 |         }
129 | 
130 |     }
131 |     ```
132 | 
133 | 4. 查找某个文件在HDFS集群的位置
134 |     ```java
135 |     @Test
136 |     public void findFileBlockLocation() throws Exception {
137 |         Configuration conf = new Configuration();
138 |         conf.set("fs.defaultFS", "hdfs://192.168.137.25:9000");
139 | 
140 |         FileSystem fs = FileSystem.get(conf);
141 |         FileStatus fStatus = fs.getFileStatus(new Path("/data/mydata.txt"));
142 | 
143 |         BlockLocation[] blocks = fs.getFileBlockLocations(fStatus, 0, fStatus.getLen());
144 |         for (BlockLocation block : blocks) {
145 |             System.out.println(Arrays.toString(block.getHosts()) + "\t" + Arrays.toString(block.getNames()));
146 |         }
147 |     }
148 |     ```
149 | 
150 | 5. 删除数据
151 |     ```java
152 |     @Test
153 |     public void deleteFile() throws Exception {
154 |         Configuration conf = new Configuration();
155 |         conf.set("fs.defaultFS", "hdfs://192.168.137.25:9000");
156 | 
157 |         FileSytem fs = FileSystem.get(conf);
158 |         // 第二个参数表示是否递归
159 |         boolean flag = fs.delete(new Path("/mydir/test.txt", false));
160 |         System.out.println(flag ? "删除成功" : "删除失败");
161 |     }
162 |     ```
163 | 
164 | 6. 获取HDFS集群上所有数据节点的信息
165 |     ```java
166 |     @Test
167 |     public void testDataNode() throws Exception {
168 |         Configuration conf = new Configuration();
169 |         conf.set("fs.defaultFS", "hdfs://192.168.137.25:9000");
170 | 
171 |         DistributedFileSystem fs = (DistributedFileSystem) FileSystem.get(conf);
172 |         DatanodeInfo[] dataNodeStats = fs.getDataNodeStats();
173 |         for (DatanodeInfo dataNode : dataNodeStats) {
174 |             System.out.println(dataNode.getHostName() + "\t" + dataNode.getName());
175 |         }
176 |     }
177 |     ```
178 | 
179 | ### （三）HDFS的WebConsole
180 | 
181 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/04-HDFS基础/imgs/console.png)
182 | 
183 | ### （四）HDFS的回收站
184 | 
185 | * 默认回收站是关闭的，可以通过在core-site.xml中添加fs.trash.interval来打开配置时间阀值，例如：
186 |     ```xml
187 |     <property>
188 |         <name>fs.trash.interval</name>
189 |         <value>1440</value>
190 |     </property>
191 |     ```
192 | 
193 | * 删除文件时，其实是放入回收站/user/root/.Trash/Current
194 | * 回收站里的文件可以快速恢复
195 | * 可以设置一个时间阀值，当回收站里文件的存放时间超过这个阀值，就被彻底删除，并且释放占用的数据块
196 | * 查看回收站：
197 |     ```shell
198 |     hdfs dfs -ls /user/root/.Trash/Current
199 |     ```
200 | * 从回收站中恢复
201 |     ```shell
202 |     hdfs dfs -cp /user/root/.Trash/Current/data.txt /input
203 |     ```
204 | 
205 | ### （五）HDFS的快照
206 | * 一个snapshot（快照）是一个全部文件系统、或者某个目录在某一时刻的镜像
207 | * 快照应用在如下场景中：
208 | 	* 防止用户的错误操作
209 | 	* 备份
210 | 	* 试验/测试
211 | 	* 灾难恢复
212 | * HDFS的快照操作
213 | 	* 开启快照
214 |     ```shell
215 |     hdfs dfsadmin -allowSnapshot /input
216 |     ```
217 | 	* 创建快照
218 |     ```shell
219 |     hdfs dfs -createSnapshot /input backup_input_01
220 |     ```
221 | 	* 查看快照
222 |     ```shell
223 |     hdfs lsSnapshottableDir
224 |     ```
225 | 	* 对比快照
226 |     ```shell
227 |     hdfs snapshotDiff /input backup_input_01 backup_input_02
228 |     ```
229 | 	* 恢复快照
230 |     ```shell
231 |     hdfs dfs -cp /input/.snapshot/backup_input_01/data.txt /input
232 |     ```
233 | 
234 | ### （六）HDFS的用户权限管理
235 | 
236 | * 启动namenode服务的用户就是超级用户，该用户的组是supergroup
237 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/04-HDFS基础/imgs/group.png)
238 | 
239 | * shell 命令
240 | 
241 |     命令 | 说明
242 |     ---|---
243 |     chmod[-R] mode file... | 只有文件的所有者或者超级用户才有权限改变文件模式
244 |     chgrp[-R] group file ... | 使用chgrp命令的用户必须属于特定的组且是文件的所有者，或者用户是超级用户
245 |     chown[-R][owner]:[group] file | 文件的所有者只能被超级用户修改
246 | 
247 | ### （七）HDFS的配额管理
248 | 
249 | #### 什么是配额？
250 | 
251 | 配额就是HDFS为每个目录分配的大小空间，新建立的目录是没有配额的，最大的配额是Long.MAX_VALUE。配额为1可以强制目录保持为空。
252 | 
253 | #### 配额的类型？
254 | 
255 | * 名称配额：用于设置该目录中能够存放的最多文件（目录）个数。
256 | * 空间配额：用于设置该目录中最大能够存放的文件大小。
257 | 
258 | #### 配额的应用案例
259 | 
260 | * 设置/input目录的名称配额为3：```hdfs dfsadmin -setQuota 3 /input```
261 | 
262 | * 清除/input目录的名称配额：```hdfs dfsadmin -clrQuota /input```
263 | 
264 | * 设置/input目录的空间配额为1M：```hdfs dfsadmin -setSpaceQuota 1048576 /input```
265 | 
266 | * 清除input目录的空间配额：```hdfs dfsadmin -clrSpaceQuota /input```
267 | 
268 | **注意：如果hdfs文件系统中文件个数或者大小超过了限制配额，会出现错误。**
269 | 
270 | ### （八）HDFS的安全模式
271 | 
272 | * 什么是安全模式？
273 | 
274 |     安全模式是hadoop的一种保护机制，用于保证集群中的数据块的安全性。如果HDFS处于安全模式，则表示HDFS是只读状态。
275 | 
276 | * **当集群启动的时候，会首先进入安全模式**。当系统处于安全模式时会检查数据块的完整性。假设我们设置的副本数（即参数dfs.replication）是5，那么在datanode上就应该有5个副本存在，假设只存在3个副本，那么比例就是3/5=0.6。在配置文件hdfs-default.xml中定义了一个最小的副本的副本率0.999，如图：
277 | 
278 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/04-HDFS基础/imgs/safe.png)
279 | 
280 |     我们的副本率0.6明显小于0.99，因此系统会自动的复制副本到其他的dataNode,使得副本率不小于0.999.如果系统中有8个副本，超过我们设定的5个副本，那么系统也会删除多余的3个副本。
281 | 
282 | * 虽然不能进行修改文件的操作，但是可以浏览目录结构、查看文件内容。
283 | 
284 | * 在命令行下是可以控制安全模式的进入、退出和查看的：
285 |     * 查看安全模式状态：```hdfs dfsadmin -safemode get```
286 |     * 进入安全模式状态：```hdfs dfsadmin -safemode enter```
287 |     * 离开安全模式状态：```hdfs dfsadmin -safemode leave```
288 | 
289 | ## HDFS上传与下载原理
290 | 
291 | ### （一）HDFS上传原理
292 | 
293 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/04-HDFS基础/imgs/upload.png)
294 | 
295 | ### （二）HDFS下载原理
296 | 
297 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/04-HDFS基础/imgs/download.png)
298 | 
299 | ## HDFS的底层原理
300 | 
301 | ### HDFS的底层通信原理采用的是：RPC和动态代理对象Proxy
302 | 
303 | ### （一）RPC
304 | 
305 | #### 什么是RPC？
306 | 
307 | Remote Procedure Call，远程过程调用。也就是说，调用过程代码并不是在调用者本地运行，而是要实现调用者与被调用者二地之间的连接与通信。
308 | RPC的基本通信模型是基于Client/Server进程间相互通信模型的一种同步通信形式；它对Client提供了远程服务的过程抽象，其底层消息传递操作对Client是透明的。
309 | 在RPC中，Client即是请求服务的调用者(Caller)，而Server则是执行Client的请求而被调用的程序 (Callee)。
310 | 
311 | #### RPC示例
312 | 
313 | * 服务器端
314 |     ```java
315 |     package rpc.server;
316 | 
317 |     import org.apache.hadoop.ipc.VersionedProtocol;
318 | 
319 |     public interface MyInterface extends VersionedProtocol {
320 | 
321 |         //定义一个版本号
322 |         public static long versionID=1;
323 | 
324 |         //定义客户端可以调用的方法
325 |         public String sayHello(String name);
326 |     }
327 |     ```
328 | 
329 |     ```java
330 |     package rpc.server;
331 | 
332 |     import java.io.IOException;
333 | 
334 |     import org.apache.hadoop.ipc.ProtocolSignature;
335 | 
336 |     public class MyInterfaceImpl implements MyInterface {
337 | 
338 |         @Override
339 |         public ProtocolSignature getProtocolSignature(String arg0, long arg1, int arg2) throws IOException {
340 |             // 指定签名（版本号）
341 |             return new ProtocolSignature(MyInterface.versionID, null);
342 |         }
343 | 
344 |         @Override
345 |         public long getProtocolVersion(String arg0, long arg1) throws IOException {
346 |             // 返回的该实现类的版本号
347 |             return MyInterface.versionID;
348 |         }
349 | 
350 |         @Override
351 |         public String sayHello(String name) {
352 |             System.out.println("********* 调用到了Server端*********");
353 |             return "Hello " + name;
354 |         }
355 | 
356 |     }
357 |     ```
358 | 
359 |     ```java
360 |     package rpc.server;
361 | 
362 |     import java.io.IOException;
363 | 
364 |     import org.apache.hadoop.HadoopIllegalArgumentException;
365 |     import org.apache.hadoop.conf.Configuration;
366 |     import org.apache.hadoop.ipc.RPC;
367 |     import org.apache.hadoop.ipc.RPC.Server;
368 | 
369 |     public class RPCServer {
370 | 
371 |         public static void main(String[] args) throws Exception {
372 |             //定义一个RPC Builder
373 |             RPC.Builder builder = new RPC.Builder(new Configuration());
374 | 
375 |             //指定RPC Server的参数
376 |             builder.setBindAddress("localhost");
377 |             builder.setPort(7788);
378 | 
379 |             //将自己的程序部署到Server上
380 |             builder.setProtocol(MyInterface.class);
381 |             builder.setInstance(new MyInterfaceImpl());
382 | 
383 |             //创建Server
384 |             Server server = builder.build();
385 | 
386 |             //启动
387 |             server.start();
388 | 
389 |         }
390 | 
391 |     }
392 |     ```
393 | 
394 | * 客户端
395 |     ```java
396 |     package rpc.client;
397 | 
398 |     import java.io.IOException;
399 |     import java.net.InetSocketAddress;
400 | 
401 |     import org.apache.hadoop.conf.Configuration;
402 |     import org.apache.hadoop.ipc.RPC;
403 | 
404 |     import rpc.server.MyInterface;
405 | 
406 |     public class RPCClient {
407 | 
408 |         public static void main(String[] args) throws Exception {
409 |             //得到的是服务器端的一个代理对象
410 |             MyInterface proxy = RPC.getProxy(MyInterface.class,  //调用服务器端的接口
411 |                                              MyInterface.versionID,      // 版本号
412 |                                              new InetSocketAddress("localhost", 7788), //指定RPC Server的地址
413 |                                              new Configuration());
414 | 
415 |             String result = proxy.sayHello("Tom");
416 |             System.out.println("结果是："+ result);
417 |         }
418 | 
419 |     }
420 |     ```
421 | 
422 | ### （二）Java动态代理对象
423 | 
424 | * 为其他对象提供一种代理以控制这个对象的访问。
425 | 
426 | * 核心是使用JDK的Proxy类
427 | 
428 |     ```java
429 |     package proxy;
430 |     
431 |     public interface MyBusiness {
432 |     
433 |         public void method1();
434 |     
435 |         public void method2();
436 |     }
437 |     ```
438 | 
439 |     ```java
440 |     package proxy;
441 |     
442 |     public class MyBusinessImpl implements MyBusiness {
443 |     
444 |         @Override
445 |         public void method1() {
446 |             System.out.println("method1");
447 |         }
448 |     
449 |         @Override
450 |         public void method2() {
451 |             System.out.println("method2");
452 |         }
453 |     }
454 |     ```
455 | 
456 |     ```java
457 |     package proxy;
458 |     
459 |     import java.lang.reflect.InvocationHandler;
460 |     import java.lang.reflect.Method;
461 |     import java.lang.reflect.Proxy;
462 |     
463 |     public class ProxyTestMain {
464 |     
465 |         public static void main(String[] args) {
466 |             //创建真正的对象
467 |             MyBusiness obj = new MyBusinessImpl();
468 |     
469 |             //重写method1的实现 ---> 不修改源码
470 |             //生成真正对象的代理对象
471 |             /*
472 |             Proxy.newProxyInstance(loader, 类加载器
473 |                                    interfaces, 真正对象实现的接口
474 |                                    h ) InvocationHandler 表示客户端如何调用代理对象
475 |             */
476 |     
477 |             MyBusiness proxyObj = (MyBusiness) Proxy.newProxyInstance(ProxyTestMain.class.getClassLoader(), 
478 |                                                          obj.getClass().getInterfaces(), 
479 |                                                          new InvocationHandler() {
480 |     
481 |                                             @Override
482 |                                             public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
483 |                                                 // 客户端的一次调用
484 |                                                 /*
485 |                                                  * method: 客户端调用方法名
486 |                                                  * args  : 方法的参数
487 |                                                  */
488 |                                                 if(method.getName().equals("method1")){
489 |                                                     //重写
490 |                                                     System.out.println("******重写了method1*********");
491 |                                                     return null;
492 |                                                 }else{
493 |                                                     //不感兴趣的方法 直接调用真正的对象完成
494 |                                                     return method.invoke(obj, args);
495 |                                                 }
496 |                                             }
497 |                         });
498 |     
499 |             //通过代理对象调用 method1  method2
500 |             proxyObj.method1();
501 |             proxyObj.method2();
502 |         }
503 |     
504 |     }
505 |     ```
506 | 
507 | 
508 | 
509 | 
510 | 
511 | 
512 | 
513 | 
514 | 


--------------------------------------------------------------------------------
/04-HDFS基础/imgs/console.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/04-HDFS基础/imgs/console.png


--------------------------------------------------------------------------------
/04-HDFS基础/imgs/download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/04-HDFS基础/imgs/download.png


--------------------------------------------------------------------------------
/04-HDFS基础/imgs/group.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/04-HDFS基础/imgs/group.png


--------------------------------------------------------------------------------
/04-HDFS基础/imgs/safe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/04-HDFS基础/imgs/safe.png


--------------------------------------------------------------------------------
/04-HDFS基础/imgs/upload.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/04-HDFS基础/imgs/upload.png


--------------------------------------------------------------------------------
/05-HDFS性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/05-HDFS性能优化/README.md


--------------------------------------------------------------------------------
/06-MapReduce基础/README.md:
--------------------------------------------------------------------------------
  1 | ## MapReduce
  2 | 
  3 | ### （一）MapReduce在Yarn平台上运行过程
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/06-MapReduce基础/imgs/mr-yarn.png)
  6 | 
  7 | ### （二）第一个MapReduce程序：WordCount
  8 | 
  9 | * 所需的 pom 依赖：
 10 |     ```xml
 11 |     <dependencies>
 12 |         <dependency>
 13 |             <groupId>org.apache.hadoop</groupId>
 14 |             <artifactId>hadoop-client</artifactId>
 15 |             <version>2.7.3</version>
 16 |         </dependency>
 17 |         <dependency>
 18 |             <groupId>org.apache.hadoop</groupId>
 19 |             <artifactId>hadoop-common</artifactId>
 20 |             <version>2.7.3</version>
 21 |         </dependency>
 22 |         <dependency>
 23 |             <groupId>org.apache.hadoop</groupId>
 24 |             <artifactId>hadoop-hdfs</artifactId>
 25 |             <version>2.7.3</version>
 26 |         </dependency>
 27 |     </dependencies>
 28 |     ```
 29 | 
 30 | * Mapper 实现：
 31 |     ```java
 32 |     import java.io.IOException;
 33 | 
 34 |     import org.apache.hadoop.io.LongWritable;
 35 |     import org.apache.hadoop.io.Text;
 36 |     import org.apache.hadoop.mapreduce.Mapper;
 37 | 
 38 |     public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
 39 | 
 40 |         private Text k = new Text();
 41 | 
 42 |         private LongWritable v = new LongWritable();
 43 | 
 44 |         @Override
 45 |         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
 46 |             String line = value.toString();
 47 |             // 分词
 48 |             String[] words = line.split(" ");
 49 |             // 输出
 50 |             for (String word : words) {
 51 |                 k.set(word);
 52 |                 v.set(1L);
 53 |                 context.write(k, v);
 54 |             }
 55 |         }
 56 |     }
 57 |     ```
 58 | 
 59 | 
 60 | * Reducer 实现：
 61 |     ```java
 62 |     import java.io.IOException;
 63 | 
 64 |     import org.apache.hadoop.io.LongWritable;
 65 |     import org.apache.hadoop.io.Text;
 66 |     import org.apache.hadoop.mapreduce.Reducer;
 67 | 
 68 |     public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
 69 | 
 70 |         private LongWritable value = new LongWritable();
 71 | 
 72 |         @Override
 73 |         protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
 74 |             long sum = 0;
 75 |             // 计数
 76 |             for (LongWritable v : values) {
 77 |                 sum += v.get();
 78 |             }
 79 |             // 输出
 80 |             value.set(sum);
 81 |             context.write(key, value);
 82 |         }
 83 |     }
 84 |     ```
 85 | 
 86 | * Driver 实现：
 87 |     ```java
 88 |     package github;
 89 | 
 90 |     import org.apache.hadoop.conf.Configuration;
 91 |     import org.apache.hadoop.fs.Path;
 92 |     import org.apache.hadoop.io.LongWritable;
 93 |     import org.apache.hadoop.io.Text;
 94 |     import org.apache.hadoop.mapreduce.Job;
 95 |     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 96 |     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 97 | 
 98 |     public class Driver {
 99 | 
100 |         public static void main(String[] args) throws Exception {
101 |             args = new String[]{"D:/EclipseWorkspace/mapreducetop10/hello.txt",
102 |                 "D:/EclipseWorkspace/mapreducetop10/output"};
103 | 
104 |             Configuration conf = new Configuration();
105 |             Job job = Job.getInstance(conf);
106 |             // 指明程序的入口
107 |             job.setJarByClass(Driver.class);
108 | 
109 |             // 指明mapper
110 |             job.setMapperClass(WordCountMapper.class);
111 |             job.setMapOutputKeyClass(Text.class);
112 |             job.setMapOutputValueClass(LongWritable.class);
113 | 
114 |             // 指明reducer
115 |             job.setReducerClass(WordCountReducer.class);
116 |             job.setOutputKeyClass(Text.class);
117 |             job.setOutputValueClass(LongWritable.class);
118 | 
119 |             // 指明任务的输入输出路径
120 |             FileInputFormat.setInputPaths(job, new Path(args[0]));
121 |             FileOutputFormat.setOutputPath(job, new Path(args[1]));
122 | 
123 |             // 启动任务
124 |             job.waitForCompletion(true);
125 |         }
126 |     }
127 |     ```
128 | 
129 | ### （三）WordCount的数据流动过程
130 | 
131 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/06-MapReduce基础/imgs/mr-dataflow.png)
132 | 
133 | ### （四）使用MapReduce进行排序
134 | 
135 | **排序：注意排序按照 Key2（Mapper输出的key） 排序，key2 需要实现WritableComparable接口**
136 | 
137 | * 测试数据：emp.csv
138 |     ```shell
139 |     7369,SMITH,CLERK,7902,1980/12/17,800,,20
140 |     7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
141 |     7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
142 |     7566,JONES,MANAGER,7839,1981/4/2,2975,,20
143 |     7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
144 |     7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
145 |     7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
146 |     7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
147 |     7839,KING,PRESIDENT,,1981/11/17,5000,,10
148 |     7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
149 |     7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
150 |     7900,JAMES,CLERK,7698,1981/12/3,950,,30
151 |     7902,FORD,ANALYST,7566,1981/12/3,3000,,20
152 |     7934,MILLER,CLERK,7782,1982/1/23,1300,,10
153 |     ```
154 | 
155 | * SortMapper：
156 |     ```java
157 |     import java.io.IOException;
158 | 
159 |     import org.apache.hadoop.io.LongWritable;
160 |     import org.apache.hadoop.io.NullWritable;
161 |     import org.apache.hadoop.io.Text;
162 |     import org.apache.hadoop.mapreduce.Mapper;
163 |     import org.apache.hadoop.mapreduce.Mapper.Context;
164 | 
165 |     public class SortMapper extends Mapper<LongWritable, Text, Employee, NullWritable> {
166 | 
167 |         @Override
168 |         protected void map(LongWritable key, Text value,Context context)
169 |                 throws IOException, InterruptedException {
170 |             //7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
171 |             String str = value.toString();
172 |             //分词
173 |             String[] words = str.split(",");
174 | 
175 |             Employee e = new Employee();
176 |             e.setEmpno(Integer.parseInt(words[0]));
177 |             e.setEname(words[1]);
178 |             e.setJob(words[2]);
179 |             try {
180 |                 e.setMgr(Integer.parseInt(words[3]));
181 |             } catch (Exception e2) {
182 |                 e.setMgr(0);
183 |             }
184 |             e.setHiredate(words[4]);
185 |             e.setSal(Integer.parseInt(words[5]));
186 |             try {
187 |                 e.setComm(Integer.parseInt(words[6]));
188 |             } catch (Exception e2) {
189 |                 e.setComm(0);
190 |             }		
191 |             e.setDeptno(Integer.parseInt(words[7]));
192 | 
193 |             //将这个员工输出
194 |             context.write(e, NullWritable.get());
195 |         }
196 |     }
197 |     ```
198 | 
199 | * 实现 WritableComparable 接口的 key2：
200 |     ```java
201 |     package demo.sort;
202 | 
203 |     import java.io.DataInput;
204 |     import java.io.DataOutput;
205 |     import java.io.IOException;
206 | 
207 |     import org.apache.hadoop.io.Writable;
208 |     import org.apache.hadoop.io.WritableComparable;
209 | 
210 |     //7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
211 |     public class Employee implements WritableComparable<Employee>{
212 | 
213 |         private int empno;
214 |         private String ename;
215 |         private String job;
216 |         private int mgr;
217 |         private String hiredate;
218 |         private int sal;
219 |         private int comm;
220 |         private int deptno;
221 | 
222 |         public Employee(){
223 | 
224 |         }
225 | 
226 |         @Override
227 |         public int compareTo(Employee o) {
228 |             // 排序规则
229 |             if(this.sal >= o.getSal()){
230 |                 return 1;
231 |             }else{
232 |                 return -1;
233 |             }
234 |         }
235 | 
236 |         @Override
237 |         public String toString() {
238 |             return "Employee [empno=" + empno + ", ename=" + ename + ", job=" + job
239 |                     + ", mgr=" + mgr + ", hiredate=" + hiredate + ", sal=" + sal
240 |                     + ", comm=" + comm + ", deptno=" + deptno + "]";
241 |         }
242 | 
243 |         @Override
244 |         public void readFields(DataInput in) throws IOException {
245 |             this.empno = in.readInt();
246 |             this.ename = in.readUTF();
247 |             this.job = in.readUTF();
248 |             this.mgr = in.readInt();
249 |             this.hiredate = in.readUTF();
250 |             this.sal = in.readInt();
251 |             this.comm = in.readInt();
252 |             this.deptno = in.readInt();
253 |         }
254 | 
255 |         @Override
256 |         public void write(DataOutput output) throws IOException {
257 |             ////7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
258 |             output.writeInt(empno);
259 |             output.writeUTF(ename);
260 |             output.writeUTF(job);
261 |             output.writeInt(mgr);
262 |             output.writeUTF(hiredate);
263 |             output.writeInt(sal);
264 |             output.writeInt(comm);
265 |             output.writeInt(deptno);
266 |         }
267 | 
268 |         public int getEmpno() {
269 |             return empno;
270 |         }
271 | 
272 |         public void setEmpno(int empno) {
273 |             this.empno = empno;
274 |         }
275 | 
276 |         public String getEname() {
277 |             return ename;
278 |         }
279 | 
280 |         public void setEname(String ename) {
281 |             this.ename = ename;
282 |         }
283 | 
284 |         public String getJob() {
285 |             return job;
286 |         }
287 | 
288 |         public void setJob(String job) {
289 |             this.job = job;
290 |         }
291 | 
292 |         public int getMgr() {
293 |             return mgr;
294 |         }
295 | 
296 |         public void setMgr(int mgr) {
297 |             this.mgr = mgr;
298 |         }
299 | 
300 |         public String getHiredate() {
301 |             return hiredate;
302 |         }
303 | 
304 |         public void setHiredate(String hiredate) {
305 |             this.hiredate = hiredate;
306 |         }
307 | 
308 |         public int getSal() {
309 |             return sal;
310 |         }
311 | 
312 |         public void setSal(int sal) {
313 |             this.sal = sal;
314 |         }
315 | 
316 |         public int getComm() {
317 |             return comm;
318 |         }
319 | 
320 |         public void setComm(int comm) {
321 |             this.comm = comm;
322 |         }
323 | 
324 |         public int getDeptno() {
325 |             return deptno;
326 |         }
327 | 
328 |         public void setDeptno(int deptno) {
329 |             this.deptno = deptno;
330 |         }
331 |     }
332 |     ```
333 | 
334 | * 驱动程序：
335 |     ```java
336 |     import org.apache.hadoop.conf.Configuration;
337 |     import org.apache.hadoop.fs.Path;
338 |     import org.apache.hadoop.io.LongWritable;
339 |     import org.apache.hadoop.io.NullWritable;
340 |     import org.apache.hadoop.io.Text;
341 |     import org.apache.hadoop.mapreduce.Job;
342 |     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
343 |     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
344 | 
345 |     public class SortMain {
346 | 
347 |         public static void main(String[] args) throws Exception{
348 | 
349 |             // 求员工工资的总额
350 |             Job job = new Job(new Configuration());
351 | 
352 |             //指明程序的入口
353 |             job.setJarByClass(SortMain.class);
354 | 
355 |             //指明任务中的mapper
356 |             job.setMapperClass(SortMapper.class);
357 |             job.setMapOutputKeyClass(Employee.class);
358 |             job.setMapOutputValueClass(NullWritable.class);
359 | 
360 |             job.setOutputKeyClass(Employee.class);
361 |             job.setOutputValueClass(NullWritable.class);
362 | 
363 |             //指明任务的输入路径和输出路径	---> HDFS的路径
364 |             FileInputFormat.addInputPath(job, new Path(args[0]));
365 |             FileOutputFormat.setOutputPath(job, new Path(args[1]));
366 | 
367 |             //启动任务
368 |             job.waitForCompletion(true);
369 |         }
370 |     }
371 |     ```
372 | 
373 | ### （五）使用Partitioner进行分区
374 | 
375 | * Mapper：
376 |     ```java
377 |     import java.io.IOException;
378 | 
379 |     import org.apache.hadoop.io.LongWritable;
380 |     import org.apache.hadoop.io.NullWritable;
381 |     import org.apache.hadoop.io.Text;
382 |     import org.apache.hadoop.mapreduce.Mapper;
383 |     import org.apache.hadoop.mapreduce.Mapper.Context;
384 | 
385 |     public class EmployeeMapper  extends Mapper<LongWritable, Text, LongWritable, Employee> {
386 | 
387 |         @Override
388 |         protected void map(LongWritable key, Text value,Context context)
389 |                 throws IOException, InterruptedException {
390 |             //7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
391 |             String str = value.toString();
392 |             //分词
393 |             String[] words = str.split(",");
394 | 
395 |             Employee e = new Employee();
396 |             e.setEmpno(Integer.parseInt(words[0]));
397 |             e.setEname(words[1]);
398 |             e.setJob(words[2]);
399 |             try {
400 |                 e.setMgr(Integer.parseInt(words[3]));
401 |             } catch (Exception e2) {
402 |                 e.setMgr(0);
403 |             }
404 |             e.setHiredate(words[4]);
405 |             e.setSal(Integer.parseInt(words[5]));
406 |             try {
407 |                 e.setComm(Integer.parseInt(words[6]));
408 |             } catch (Exception e2) {
409 |                 e.setComm(0);
410 |             }		
411 |             e.setDeptno(Integer.parseInt(words[7]));
412 | 
413 |             //将这个员工输出
414 |             context.write(new LongWritable(e.getDeptno()),e);
415 |         }
416 |     }
417 |     ```
418 | 
419 | * Reducer：
420 |     ```java
421 |     import java.io.IOException;
422 | 
423 |     import org.apache.hadoop.io.LongWritable;
424 |     import org.apache.hadoop.mapreduce.Reducer;
425 | 
426 |     public class EmployeeReducer extends Reducer<LongWritable, Employee, LongWritable, Employee> {
427 | 
428 |         @Override
429 |         protected void reduce(LongWritable deptno, Iterable<Employee> values,Context context)
430 |                 throws IOException, InterruptedException {
431 |             for(Employee e:values){
432 |                 context.write(deptno, e);
433 |             }
434 |         }
435 | 
436 |     }
437 |     ```
438 | 
439 | * Employee：
440 |     ```java
441 |     import java.io.DataInput;
442 |     import java.io.DataOutput;
443 |     import java.io.IOException;
444 | 
445 |     import org.apache.hadoop.io.Writable;
446 |     import org.apache.hadoop.io.WritableComparable;
447 | 
448 |     //7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
449 |     public class Employee implements Writable{
450 | 
451 |         private int empno;
452 |         private String ename;
453 |         private String job;
454 |         private int mgr;
455 |         private String hiredate;
456 |         private int sal;
457 |         private int comm;
458 |         private int deptno;
459 | 
460 |         public Employee(){
461 | 
462 |         }
463 | 
464 |         @Override
465 |         public String toString() {
466 |             return "Employee [empno=" + empno + ", ename=" + ename + ", job=" + job
467 |                     + ", mgr=" + mgr + ", hiredate=" + hiredate + ", sal=" + sal
468 |                     + ", comm=" + comm + ", deptno=" + deptno + "]";
469 |         }
470 | 
471 |         @Override
472 |         public void readFields(DataInput in) throws IOException {
473 |             this.empno = in.readInt();
474 |             this.ename = in.readUTF();
475 |             this.job = in.readUTF();
476 |             this.mgr = in.readInt();
477 |             this.hiredate = in.readUTF();
478 |             this.sal = in.readInt();
479 |             this.comm = in.readInt();
480 |             this.deptno = in.readInt();
481 |         }
482 | 
483 |         @Override
484 |         public void write(DataOutput output) throws IOException {
485 |             ////7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
486 |             output.writeInt(empno);
487 |             output.writeUTF(ename);
488 |             output.writeUTF(job);
489 |             output.writeInt(mgr);
490 |             output.writeUTF(hiredate);
491 |             output.writeInt(sal);
492 |             output.writeInt(comm);
493 |             output.writeInt(deptno);
494 |         }
495 | 
496 |         public int getEmpno() {
497 |             return empno;
498 |         }
499 | 
500 |         public void setEmpno(int empno) {
501 |             this.empno = empno;
502 |         }
503 | 
504 |         public String getEname() {
505 |             return ename;
506 |         }
507 | 
508 |         public void setEname(String ename) {
509 |             this.ename = ename;
510 |         }
511 | 
512 |         public String getJob() {
513 |             return job;
514 |         }
515 | 
516 |         public void setJob(String job) {
517 |             this.job = job;
518 |         }
519 | 
520 |         public int getMgr() {
521 |             return mgr;
522 |         }
523 | 
524 |         public void setMgr(int mgr) {
525 |             this.mgr = mgr;
526 |         }
527 | 
528 |         public String getHiredate() {
529 |             return hiredate;
530 |         }
531 | 
532 |         public void setHiredate(String hiredate) {
533 |             this.hiredate = hiredate;
534 |         }
535 | 
536 |         public int getSal() {
537 |             return sal;
538 |         }
539 | 
540 |         public void setSal(int sal) {
541 |             this.sal = sal;
542 |         }
543 | 
544 |         public int getComm() {
545 |             return comm;
546 |         }
547 | 
548 |         public void setComm(int comm) {
549 |             this.comm = comm;
550 |         }
551 | 
552 |         public int getDeptno() {
553 |             return deptno;
554 |         }
555 | 
556 |         public void setDeptno(int deptno) {
557 |             this.deptno = deptno;
558 |         }
559 |     }
560 |     ```
561 | 
562 | * Partitioner：
563 |     ```java
564 |     import org.apache.hadoop.io.LongWritable;
565 |     import org.apache.hadoop.mapreduce.Partitioner;
566 | 
567 |     public class EmployeePartition extends Partitioner<LongWritable, Employee> {
568 | 
569 |         @Override
570 |         public int getPartition(LongWritable key2, Employee e, int numPartition) {
571 |             // 分区的规则
572 |             if(e.getDeptno() == 10){
573 |                 return 1%numPartition;
574 |             }else if(e.getDeptno() == 20){
575 |                 return 2%numPartition;
576 |             }else{
577 |                 return 3%numPartition;
578 |             }
579 |         }
580 |     }
581 |     ```
582 | 
583 | * Driver：
584 |     ```java
585 |     import org.apache.hadoop.conf.Configuration;
586 |     import org.apache.hadoop.fs.Path;
587 |     import org.apache.hadoop.io.LongWritable;
588 |     import org.apache.hadoop.io.NullWritable;
589 |     import org.apache.hadoop.mapreduce.Job;
590 |     import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
591 |     import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
592 |     public class PartitionMain {
593 | 
594 |         public static void main(String[] args) throws Exception {
595 |             // 求员工工资的总额
596 |             Configuration conf = new Configuration();
597 |             Job job = Job.getInstance(conf);
598 | 
599 |             //指明程序的入口
600 |             job.setJarByClass(PartitionMain.class);
601 | 
602 |             //指明任务中的mapper
603 |             job.setMapperClass(EmployeeMapper.class);
604 |             job.setMapOutputKeyClass(LongWritable.class);
605 |             job.setMapOutputValueClass(Employee.class);
606 | 
607 |             //设置分区的规则
608 |             job.setPartitionerClass(EmployeePartition.class);
609 |             job.setNumReduceTasks(3);
610 | 
611 |             job.setReducerClass(EmployeeReducer.class);
612 |             job.setOutputKeyClass(LongWritable.class);
613 |             job.setOutputValueClass(Employee.class);
614 | 
615 |             //指明任务的输入路径和输出路径	---> HDFS的路径
616 |             FileInputFormat.addInputPath(job, new Path(args[0]));
617 |             FileOutputFormat.setOutputPath(job, new Path(args[1]));
618 | 
619 |             //启动任务
620 |             job.waitForCompletion(true);
621 |         }
622 |     }
623 |     ```
624 | 
625 | ### （五）Shuffle的过程
626 | 
627 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/06-MapReduce基础/imgs/mr-shuffle.png)
628 | 
629 | ### （七）MapReduce作业任务的管理
630 | 
631 | * 通过 web console 监控作业的运行：
632 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/06-MapReduce基础/imgs/mr-console.png)
633 | 
634 | * 通过 yarn application 命令来进行作业管理
635 | 	1. 列出帮助信息：```yarn application --help```
636 | 	2. 查看运行的 MapReduce 程序：```yarn application --list```
637 | 	3. 查看应用状态：```yarn application -status <application_id>```
638 | 	4. 强制杀死应用：```yarn application -kill <application_id>```
639 | 
640 | 
641 | 
642 | 
643 | 
644 | 
645 | 
646 | 


--------------------------------------------------------------------------------
/06-MapReduce基础/imgs/mr-console.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/06-MapReduce基础/imgs/mr-console.png


--------------------------------------------------------------------------------
/06-MapReduce基础/imgs/mr-dataflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/06-MapReduce基础/imgs/mr-dataflow.png


--------------------------------------------------------------------------------
/06-MapReduce基础/imgs/mr-shuffle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/06-MapReduce基础/imgs/mr-shuffle.png


--------------------------------------------------------------------------------
/06-MapReduce基础/imgs/mr-yarn.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/06-MapReduce基础/imgs/mr-yarn.png


--------------------------------------------------------------------------------
/07-MapReduce十大经典案例/README.md:
--------------------------------------------------------------------------------
 1 | ## MapReduce十大经典案例
 2 | 
 3 | ### （一）WordCount案例
 4 | 
 5 | ### （二）流量汇总案例
 6 | 
 7 | ### （三）辅助排序和二次排序案例（GroupingComparator）
 8 | 
 9 | ### （四）小文件处理案例（自定义InputFormat）
10 | 
11 | ### （五）过滤日志及自定义日志输出路径案例（自定义OutputFormat）
12 | 
13 | ### （六）MapReduce中多表合并案例
14 | 
15 | ### （七）日志清洗案例
16 | 
17 | ### （八）倒排索引案例
18 | 
19 | ### （九）找博客共同好友案例
20 | 
21 | ### （十）压缩/解压缩案例
22 | 
23 | ### （十一）KeyValueTextInputFormat使用案例
24 | 
25 | ### （十二）NLineInputFormat使用案例
26 | 
27 | 


--------------------------------------------------------------------------------
/08-MapReduce性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/08-MapReduce性能优化/README.md


--------------------------------------------------------------------------------
/09-HBase基础/README.md:
--------------------------------------------------------------------------------
  1 | ## HBase
  2 | 
  3 | ### （一）什么是HBase？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/hbase-logo.png)
  6 | 
  7 | 一个底层存储依赖于HDFS，分布式依赖于zookeeper，面向列的开源NoSql数据库。
  8 | 
  9 | ### （二）HBase的体系结构
 10 | 
 11 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/hbasearc.png)
 12 | 
 13 | * HMaster:
 14 | 
 15 |     1. 管理用户对Table的增删改查操作。
 16 | 
 17 |     2. 管理HRegionServer服务器之间的负载均衡，调整Region的分布。
 18 | 
 19 |     3. 数据量过大导致Region分裂后，负责分配新的Region。
 20 | 
 21 |     4. 在Region服务器停机后，负责失效Region服务器上Regiion的迁移。
 22 | 
 23 | * HRegionServer: 存储Region的服务器。
 24 | 
 25 | * Region: Region是HBase数据存储和管理的基本单位。
 26 | 
 27 | * Store: Region中由多个Store组成，每个Store对应表中的一个CF（列族）。Store由两部分组成：MemStore，StoreFie。
 28 | 
 29 | * MemStore: 是一个写缓存，对表中数据的操作首先写WAL日志，然后才写入MemStore，MemStore满了之后会Flush成一个StoreFile（底层实现是HFile）
 30 | 
 31 | * Sotre file: 对HFile的一层封装。
 32 | 
 33 | * HFile: 真正用于存储HBase数据的文件。在HFile中的数据是按照RowKey，CF，Column排序。位于HDFS上。
 34 | 
 35 | * zookeeper: HBase的客户端首先需要访问zookeeper获取RegionServer的地址然后才能操作RegionServer。
 36 | 
 37 | ### （三）HBase的表结构
 38 | 
 39 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/hbasetable.png)
 40 | 
 41 | ### （四）HBase的安装和部署
 42 | 
 43 | * 下载HBase：https://hbase.apache.org/downloads.html
 44 | 
 45 | * 解压到Linux上的指定目录，修改conf目录下的如下文件：
 46 | 
 47 | * 本地模式配置：
 48 | 
 49 |     参数文件 | 配置参数 | 参考值
 50 |     ---|---|---
 51 |     hbase-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 52 |     hbase-site.xml | hbase.rootdir | file:///root/training/hbase-1.3.1/data
 53 | 
 54 | * 伪分布模式配置：
 55 | 
 56 |     参数文件 | 配置参数 | 参考值
 57 |     ---|---|---
 58 |     hbase-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 59 |     ... | HBASE_MANAGES_ZK | true
 60 |     hbase-site.xml | hbase.rootdir | file:///root/training/hbase-1.3.1/data
 61 |     ... | hbase.cluster.distributed | true
 62 |     ... | hbase.zookeeper.quorum | 192.168.157.111
 63 |     ... | dfs.replication | 1
 64 |     regionservers | | 192.168.157.111
 65 | 
 66 | * 全分布模式配置：
 67 | 
 68 |     参数文件 | 配置参数 | 参考值
 69 |     ---|---|---
 70 |     hbase-env.sh | JAVA_HOME | /root/training/jdk1.8.0_144
 71 |     ... | HBASE_MANAGES_ZK | true
 72 |     hbase-site.xml | hbase.rootdir | file:///root/training/hbase-1.3.1/data
 73 |     ... | hbase.cluster.distributed | true
 74 |     ... | hbase.zookeeper.quorum | 192.168.157.111
 75 |     ... | dfs.replication | 2
 76 |     ... | hbase.master.maxclockskew | 180000
 77 |     regionservers | | 192.168.157.111
 78 |     ... | | 192.168.157.112
 79 | 
 80 | * 配置xml文件的一般格式(仅供参考)：
 81 | 
 82 |     ```xml
 83 |     <?xml version="1.0"?>
 84 |     <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 85 |     <configuration>
 86 |         <property>
 87 |             <name>hbase.rootdir</name>
 88 |             <value>file:///root/training/hbase-1.3.1/data</value>
 89 |         </property>
 90 |         <property>
 91 |             <name>hbase.rootdir</name>
 92 |             <value>file:///root/training/hbase-1.3.1/data</value>
 93 |         </property>
 94 |         <property>
 95 |             <name>hbase.rootdir</name>
 96 |             <value>file:///root/training/hbase-1.3.1/data</value>
 97 |         </property>
 98 |     </configuration>
 99 |     ```
100 | 
101 | * hbase-env.sh 的45行，46行，47行，jdk1.8可以移除。
102 | 
103 | * 启动HDFS：```start-hdfs.sh```
104 | 
105 | * 启动zookeeper：```zkServer.sh start```
106 | 
107 | * 启动HBase：```start-hbase.sh```
108 | 
109 | * 启动HBase Shell：```hbase shell```
110 | 
111 | * 在web端查看HMaster和RegionServer的情况：localhost:16010
112 | 
113 | ### （五）-ROOT-和.META.
114 | 
115 | * HBase中有两张特殊的表，-ROOT- 和 .META.
116 | 
117 |     * -ROOT-: 记录了.META.表的Region信息,-ROOT-只有一个region
118 |     
119 |     * .META.: 记录了用户创建表的Region信息，.META.可以有多个Region
120 | 
121 | 	> 注：-ROOT-表在0.96版本之后被移除了，因为多了一个步骤，影响了性能。
122 | 	
123 | * zookeeper中记录了-ROOT-表的位置
124 | 
125 | * Client访问用户数据之前需要首先访问zookeeper，获取-ROOT-表的位置，然后访问-ROOT-表，接着访问.META.表，最后才能找到用户数据的位置去访问。
126 | 
127 | 
128 | ### （六）HBase Shell
129 | 
130 | * 命令格式如下：
131 | 
132 |     名称 | 命令表达式
133 |     ---|---
134 |     创建表 | create '表名称','列族名称1','列族名称2','列族名称N'
135 |     添加记录 | put '表名称','列族名称','列名称','值'
136 |     查看记录 | get '表名称','行键'
137 |     查看表中的记录数 | count '表名称'
138 |     删除记录 | delete '表名',"行键",'列族名称:列名称'
139 |     删除表 | 先要屏蔽该表，第一步 disable '表名称' 第二步 drop '表名称'
140 |     查看所有记录 | scan '表名称'
141 |     查看某个表某个列的所有数据 | scan 'students',{COLUMNS=>'列族名称:列名称'}
142 |     更新记录 | 就是重新put一遍进行覆盖
143 | 
144 | ### （七）HBase的JavaAPI
145 | 
146 | * pom依赖如下：
147 | 
148 |     ```xml
149 |     <dependency>
150 |         <groupId>org.apache.hbase</groupId>
151 |         <artifactId>hbase-client</artifactId>
152 |         <version>1.3.1</version>
153 |     </dependency>
154 |     ```
155 | 
156 | * 创建表
157 | 
158 |     ```java
159 |     @Test
160 |     public void testCreateTable() throws Exception {
161 |         // 配置信息
162 |         Configuration conf = new Configuration();
163 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
164 | 
165 |         // 创建客户端
166 |         HBaseAdmin admin = new HBaseAdmin(conf);
167 |         // 创建表的描述信息
168 |         HTableDescriptor htd = new HTableDescriptor(TableName.valueOf("students"));
169 |         // 创建列族
170 |         HColumnDescriptor h1 = new HColumnDescriptor("info");
171 |         HColumnDescriptor h2 = new HColumnDescriptor("grade");
172 |         // 将列加入列族
173 |         htd.addFamily(h1);
174 |         htd.addFamily(h2);
175 | 
176 |         // 创建表
177 |         admin.createTable(htd);
178 |         admin.close();
179 |     }
180 |     ```
181 | 
182 | * 插入单条数据
183 | 
184 |     ```java
185 |     @Test
186 |     public void testPut() throws Exception {
187 |         // 配置信息
188 |         Configuration conf = new Configuration();
189 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
190 | 
191 |         // 指定表名
192 |         HTable table = new HTable(conf, "students");
193 |         // 创建一条数据，行键
194 |         Put put = new Put(Bytes.toBytes("stu000"));
195 |         // 指定数据 family 列族 qualifier 列 value 值
196 |         put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Tom"));
197 | 
198 |         // 插入数据
199 |         table.put(put);
200 |         table.close();
201 |     }
202 |     ```
203 | 
204 | * 插入多条数据
205 | 
206 |     ```java
207 |     @Test
208 |     public void testPutList() throws Exception {
209 |         // 配置信息
210 |         Configuration conf = new Configuration();
211 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
212 | 
213 |         // 指定表名
214 |         HTable table = new HTable(conf, "students");
215 | 
216 |         // 构造集合代表要插入的数据
217 |         List<Put> list = new ArrayList<Put>();
218 |         for (int i = 1; i < 11; i++) {
219 |             Put put = new Put(Bytes.toBytes("stu00" + i));
220 |             put.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Tom" + i));
221 |             // 将数据加入集合
222 |             list.add(put);
223 |         }
224 |         table.put(list);
225 |         table.close();
226 |     }
227 |     ```
228 | 
229 | * 根据行键查询数据
230 | 
231 |     ```java
232 |     @Test
233 |     public void testGet() throws Exception {
234 |         // 配置信息
235 |         Configuration conf = new Configuration();
236 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
237 | 
238 |         // 指定要查询的表
239 |         HTable table = new HTable(conf, "students");
240 | 
241 |         // 通过get查询，指定行键
242 |         Get get = new Get(Bytes.toBytes("stu001"));
243 |         // 执行查询
244 |         Result result = table.get(get);
245 | 
246 |         // 输出结果
247 |         System.out.println(Bytes.toString(result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
248 |         table.close();
249 |     }
250 |     ```
251 | 
252 | * 扫描表中的所有数据
253 | 
254 |     ```java
255 |     @Test
256 |     public void testScan() throws Exception {
257 |         // 配置信息
258 |         Configuration conf = new Configuration();
259 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
260 | 
261 |         HTable table = new HTable(conf, "students");
262 |         // 创建一个Scan
263 |         Scan scan = new Scan();
264 | 
265 |         // 扫描表
266 |         ResultScanner result = table.getScanner(scan);
267 | 
268 |         // 打印返回的值
269 |         for (Result r : result) {
270 |             System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
271 |         }
272 |         table.close();
273 |     }
274 |     ```
275 | 
276 | * 删除表
277 | 
278 |     ```java
279 |     @Test
280 |     public void testDropTable() throws Exception {
281 |         // 配置信息
282 |         Configuration conf = new Configuration();
283 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
284 | 
285 |         // 创建客户端
286 |         HBaseAdmin admin = new HBaseAdmin(conf);
287 | 
288 |         // 先禁用这张表
289 |         admin.disableTable(Bytes.toBytes("students"));
290 |         // 删除表
291 |         admin.deleteTable(Bytes.toBytes("students"));
292 |         admin.close();
293 |     }
294 |     ```
295 | 
296 | ### （八）HBase上的过滤器
297 | 
298 | * 单一列值过滤器
299 | 
300 |     类似：select * from students where name = 'Tom1';
301 | 
302 |     ```java
303 |     @Test
304 |     public void testSingleColumnValueFilter() throws Exception {
305 |         // 配置信息
306 |         Configuration conf = new Configuration();
307 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
308 | 
309 |         // 创建客户端查询表
310 |         HTable table = new HTable(conf, "students");
311 |         // 创建一个Scann
312 |         Scan scan = new Scan();
313 |         // 创建一个Filter：SingleColumnValueFilter：姓名为Tom1的
314 |         SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("info"),
315 |                 Bytes.toBytes("name"),
316 |                 CompareFilter.CompareOp.EQUAL,
317 |                 Bytes.toBytes("Tom1"));
318 |         // 使用创建的过滤器
319 |         scan.setFilter(filter);
320 |         // 查询数据
321 |         ResultScanner result = table.getScanner(scan);
322 |         for (Result r : result) {
323 |             // 打印输出
324 |             System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
325 |         }
326 |         table.close();
327 |     }
328 |     ```
329 | 
330 | * 列名前缀过滤器
331 | 
332 |     类似：select name from students;
333 | 
334 |     ```java
335 |     @Test
336 |     public void testColumnPrefixFilter() throws Exception {
337 |         // 配置信息
338 |         Configuration conf = new Configuration();
339 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
340 | 
341 |         // 创建HTable进行查询
342 |         HTable table = new HTable(conf, "students");
343 |         // 创建一个Scan
344 |         Scan scan = new Scan();
345 | 
346 |         // 创建列名前缀过滤器
347 |         ColumnPrefixFilter filter = new ColumnPrefixFilter(Bytes.toBytes("nam"));
348 |         // ColumnPrefixFilter filter = new ColumnPrefixFilter(Bytes.toBytes("name"));
349 |         scan.setFilter(filter);
350 | 
351 |         // 扫描表
352 |         ResultScanner result = table.getScanner(scan);
353 |         for (Result r : result) {
354 |             // 打印
355 |             System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
356 |         }
357 |         table.close();
358 |     }
359 |     ```
360 | 
361 | * 多个列名前缀过滤器
362 | 
363 |     类似 ：select name, gender from students;
364 | 
365 |     ```java
366 |     @Test
367 |     public void testMultipleColumnPrefixFilter() throws Exception {
368 |         // 配置信息
369 |         Configuration conf = new Configuration();
370 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
371 | 
372 |         // 创建客户端查询表
373 |         HTable table = new HTable(conf, "students");
374 |         // 创建一个Scan
375 |         Scan scan = new Scan();
376 | 
377 |         // 指定我们要查询的多个列
378 |         byte[][] prefixs = new byte[][]{Bytes.toBytes("name"), Bytes.toBytes("gender")};
379 |         // 创建一个MultipleColumnPrefixFilter
380 |         MultipleColumnPrefixFilter filter = new MultipleColumnPrefixFilter(prefixs);
381 |         // 设置Scan的过滤器
382 |         scan.setFilter(filter);
383 | 
384 |         // 查询数据
385 |         ResultScanner result = table.getScanner(scan);
386 |         for (Result r : result) {
387 |             String name = Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name")));
388 |             String gender = Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("gender")));
389 |             // 打印
390 |             System.out.println(name + "\t" + gender);
391 |         }
392 |         // 关闭客户端
393 |         table.close();
394 |     }
395 |     ```
396 | 
397 | * 行键过滤器
398 | 
399 |     类似：select * from students where id = 1;
400 | 
401 |     ```java
402 |     @Test
403 |     public void testRowKeyFilter() throws Exception {
404 |         // 配置信息
405 |         Configuration conf = new Configuration();
406 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
407 | 
408 |         // 创建客户端
409 |         HTable table = new HTable(conf, "students");
410 |         // 创建一个Scan
411 |         Scan scan = new Scan();
412 |         // 创建一个Rowkey过滤器
413 |         RowFilter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("stu000"));
414 |         // 使用创建的过滤器
415 |         scan.setFilter(filter);
416 |         // 查询数据
417 |         ResultScanner result = table.getScanner(scan);
418 |         for (Result r : result) {
419 |             String name = Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name")));
420 |             String gender = Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("gender")));
421 |             // 打印
422 |             System.out.println(name + "\t" + gender);
423 |         }
424 |         // 关闭客户端
425 |         table.close();
426 |     }
427 |     ```
428 | 
429 | * 同时使用多个过滤器
430 | 
431 |     类似：select name from students where id = 1;
432 | 
433 |     ```java
434 |     @Test
435 |     public void testFilter() throws Exception {
436 |         // 配置信息
437 |         Configuration conf = new Configuration();
438 |         conf.set("hbase.zookeeper.quorum", "192.168.0.1");
439 | 
440 |         // 创建客户端
441 |         HTable table = new HTable(conf, "students");
442 |         // 创建Scan
443 |         Scan scan = new Scan();
444 |         // 第一个过滤器：rowkey过滤器
445 |         RowFilter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("stu002"));
446 |         // 第二个过滤器：列名前缀过滤器
447 |         ColumnPrefixFilter columnPrefixFilter = new ColumnPrefixFilter(Bytes.toBytes("name"));
448 |         // 创建Filter的List
449 |         FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
450 |         filterList.addFilter(rowFilter);
451 |         filterList.addFilter(columnPrefixFilter);
452 | 
453 |         scan.setFilter(filterList);
454 |         // 查询数据
455 |         ResultScanner result = table.getScanner(scan);
456 |         for (Result r : result) {
457 |             System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"))));
458 |         }
459 |         table.close();
460 |     }
461 |     ```
462 | 
463 | ### （九）HBase上的MapReduce
464 | 
465 | * pom依赖暂时没有找到，需要的jar包是hbase的lib目录下的所有jar包
466 | 
467 | * 测试数据：
468 | 
469 |     ```shell
470 |     create 'word','content'
471 |     put 'word','1','content:info','I love Beijing'
472 |     put 'word','2','content:info','I love China'
473 |     put 'word','3','content:info','Beijing is the capital of China'
474 | 
475 |     create 'stat','content'
476 |     ```
477 | 
478 | * Mapper：
479 | 
480 |     ```java
481 |     import org.apache.hadoop.hbase.client.Result;
482 |     import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
483 |     import org.apache.hadoop.hbase.mapreduce.TableMapper;
484 |     import org.apache.hadoop.hbase.util.Bytes;
485 |     import org.apache.hadoop.io.IntWritable;
486 |     import org.apache.hadoop.io.Text;
487 | 
488 |     import java.io.IOException;
489 | 
490 |     /**
491 |      * @author 曲健磊
492 |      * @date 2019-03-14 19:23:39
493 |      */
494 |     public class MyMapper extends TableMapper<Text, IntWritable> {
495 |         @Override
496 |         protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
497 |             // 读入的数据：HBase表中的数据--->word表
498 |             String words = Bytes.toString(value.getValue(Bytes.toBytes("content"), Bytes.toBytes("info")));
499 | 
500 |             // 分词：I love Beijing
501 |             String[] itr = words.split(" ");
502 | 
503 |             for (String w : itr) {
504 |                 // 直接输出
505 |                 Text w1 = new Text();
506 |                 w1.set(w);
507 |                 context.write(w1, new IntWritable(1));
508 |             }
509 |         }
510 |     }
511 |     ```
512 | 
513 | * Reducer：
514 | 
515 |     ```java
516 |     import org.apache.hadoop.hbase.client.Put;
517 |     import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
518 |     import org.apache.hadoop.hbase.mapreduce.TableReducer;
519 |     import org.apache.hadoop.hbase.util.Bytes;
520 |     import org.apache.hadoop.io.IntWritable;
521 |     import org.apache.hadoop.io.Text;
522 | 
523 |     import java.io.IOException;
524 | 
525 |     /**
526 |      * @author 曲健磊
527 |      * @date 2019-03-15 14:08:54
528 |      */
529 |     public class MyReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable> {
530 |         @Override
531 |         protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
532 |             // 求和
533 |             int sum = 0;
534 |             for (IntWritable val : values) {
535 |                 sum += val.get();
536 |             }
537 |             // 输出---> HBase表
538 |             // 构造Put,可以使用key作为行键
539 |             Put put = new Put(Bytes.toBytes(key.toString()));
540 | 
541 |             // 封装数据
542 |             put.add(Bytes.toBytes("content"), Bytes.toBytes("info"), Bytes.toBytes(String.valueOf(sum)));
543 | 
544 |             // 写入HBase
545 |             context.write(new ImmutableBytesWritable(Bytes.toBytes(key.toString())), put);
546 |         }
547 |     }
548 |     ```
549 | 
550 | * Main:
551 | 
552 |     ```java
553 |     import org.apache.hadoop.conf.Configuration;
554 |     import org.apache.hadoop.hbase.client.Scan;
555 |     import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
556 |     import org.apache.hadoop.hbase.util.Bytes;
557 |     import org.apache.hadoop.io.IntWritable;
558 |     import org.apache.hadoop.io.Text;
559 |     import org.apache.hadoop.mapreduce.Job;
560 | 
561 |     import java.io.IOException;
562 | 
563 |     /**
564 |      * @author 曲健磊
565 |      * @date 2019-03-15 14:14:41
566 |      */
567 |     public class MyDriver {
568 |         public static void main(String[] args) throws Exception {
569 |             Configuration conf = new Configuration();
570 |             conf.set("hbase.zookeeper.quorum", "127.0.0.1");
571 | 
572 |             // 创建Job
573 |             Job job = Job.getInstance(conf);
574 |             job.setJarByClass(MyDriver.class);
575 | 
576 |             // 创建Scan
577 |             Scan scan = new Scan();
578 |             // 可以指定查询的某一列
579 |             scan.addColumn(Bytes.toBytes("content"), Bytes.toBytes("info"));
580 | 
581 |             // 指定查询HBase表的Mapper
582 |             TableMapReduceUtil.initTableMapperJob("word", scan, MyMapper.class, Text.class, IntWritable.class, job);
583 | 
584 |             // 指定写入HBase表的Reducer
585 |             TableMapReduceUtil.initTableReducerJob("stat", MyReducer.class, job);
586 | 
587 |             job.waitForCompletion(true);
588 |         }
589 |     }
590 |     ```
591 | 
592 | * 打成jar包，上传到服务器上
593 | 
594 | * 修改hadoop的hadoop-env.sh内容，追加如下内容：
595 | 
596 |     ```
597 |     export HADOOP_CLASSPATH=$HBASE_HOME/lib/*:${HADOOP_CLASSPATH}
598 |     ```
599 | 
600 |     > 注：需要先在 ```/etc/profile``` 文件里定义HBASE_HOME变量的值为HBase所在的Home目录。
601 |     
602 | * /etc/profile文件内容：
603 | 
604 |     ```shell
605 |     HBASE_HOME=/root/training/hbase-1.3.1
606 |     export HBASE_HOME
607 | 
608 |     PATH=$HBASE_HOME/bin:$PATH
609 |     export PATH
610 |     ```
611 | 
612 |     > 注：修改完 ```/etc/profile``` 文件的内容后，需要执行 ```source /etc/profile``` 命令来更新环境变量。
613 | 
614 | 
615 | * 启动HDFS：start-hdfs.sh
616 | 
617 | * 启动Yarn：start-yarn.sh
618 | 
619 | * 启动zookeeper：zkServer.sh start
620 | 
621 | * 启动HBase：start-hbase.sh
622 | 
623 | * 提交jar包到hadoop集群上运行：hadoop jar xxx.jar
624 | 
625 | * 通过hbase shell查看运行结果：
626 | 
627 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/hbasemapreduce.png)
628 | 
629 | ### （十）HBase的HA
630 | 
631 | * 架构：
632 | 
633 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/arc.png)
634 | 
635 | * 在一个RegionServer上单独启动一个HMaster
636 | 
637 |     ```
638 |     hbase-daemon.sh start master
639 |     ```
640 | 
641 | 


--------------------------------------------------------------------------------
/09-HBase基础/imgs/arc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/09-HBase基础/imgs/arc.png


--------------------------------------------------------------------------------
/09-HBase基础/imgs/hbase-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/09-HBase基础/imgs/hbase-logo.png


--------------------------------------------------------------------------------
/09-HBase基础/imgs/hbasearc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/09-HBase基础/imgs/hbasearc.png


--------------------------------------------------------------------------------
/09-HBase基础/imgs/hbasemapreduce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/09-HBase基础/imgs/hbasemapreduce.png


--------------------------------------------------------------------------------
/09-HBase基础/imgs/hbasetable.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/09-HBase基础/imgs/hbasetable.png


--------------------------------------------------------------------------------
/10-HBase性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/10-HBase性能优化/README.md


--------------------------------------------------------------------------------
/11-Hive基础/README.md:
--------------------------------------------------------------------------------
  1 | ## Hive
  2 | 
  3 | ### （一）什么是Hive？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/11-Hive基础/imgs/hive_logo_medium.jpg)
  6 | 
  7 | * 构建在Hadoop上的数据仓库平台，为数据仓库管理提供了许多功能
  8 | 
  9 | * 起源自facebook由Jeff Hammerbacher领导的团队
 10 | 
 11 | * 2008年facebook把hive项目贡献给Apache
 12 | 
 13 | * 定义了一种类SQL语言HiveQL。可以看成是仍SQL到Map-Reduce的映射器
 14 | 
 15 | * 提供Hive shell、JDBC/ODBC、Thrift客户端等连接
 16 | 
 17 | ### （二）Hive的体系结构
 18 | 
 19 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/11-Hive基础/imgs/hivearc.png)
 20 | 
 21 | * 用户接口主要有三个：CLI，JDBC/ODBC和 WebUI
 22 | 
 23 |     * CLI，即Shell命令行
 24 |     * JDBC/ODBC 是 Hive 的Java，与使用传统数据库JDBC的方式类似
 25 |     * WebGUI是通过浏览器访问 Hive
 26 | 
 27 | * Hive 将元数据存储在数据库中(metastore)，目前只支持 mysql、derby。Hive 中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等
 28 | 
 29 | * 解释器、编译器、优化器完成 HQL 查询语句从词法分析、语法分析、编译、优化以及查询计划（plan）的生成。生成的查询计划存储在 HDFS 中，并在随后有 MapReduce 调用执行
 30 | 
 31 | * Hive 的数据存储在 HDFS 中，大部分的查询由 MapReduce 完成（包含 * 的查询，比如 select * from table 不会生成 MapRedcue 任务）
 32 | 
 33 | ### （三）Hive的安装和配置
 34 | 
 35 | * hive下载地址：http://mirrors.shu.edu.cn/apache/hive/
 36 | 
 37 | * 解压tar包到指定位置：tar -zxvf apache-hive-2.3.0-bin.tar.gz -C ~/training/
 38 | 
 39 | * 将mysql的驱动jar包（5.1.43以上的版本）放入lib目录下
 40 | 
 41 | * 修改$HIVE_HOME/conf/hive-site.xml配置文件：
 42 | 
 43 |     参数文件 | 配置参数 | 参考值
 44 |     ---|---|---
 45 |     hive-site.xml | javax.jdo.option.ConnectionURL | jdbc:mysql://localhost:3306/hive?useSSL=false
 46 |     ... | javax.jdo.option.ConnectionDriverName | com.mysql.jdbc.Driver
 47 |     ... | javax.jdo.option.ConnectionUserName | root
 48 |     ... | javax.jdo.option.ConnectionPassword | Welcome_1
 49 | 
 50 | * 启动mysql数据库：systemctl start mysqld
 51 | 
 52 | * 启动HDFS：start-hdfs.sh
 53 | 
 54 | * 初始化Hive的MetaStore：schematool -dbType mysql -initSchema
 55 | 
 56 | * 启动Hive：hive
 57 | 
 58 | ### （四）Hive的数据类型
 59 | 
 60 | * 基本数据类型
 61 | 
 62 |     * tinyint/smallint/int/bigint: 整数类型
 63 |     * float/double: 浮点数类型
 64 |     * boolean：布尔类型
 65 |     * string：字符串类型
 66 | 
 67 | * 复杂数据类型
 68 | 
 69 |     * Array：数组类型，由一系列相同数据类型的元素组成
 70 |     * Map：集合类型，包含key->value键值对，可以通过key来访问元素
 71 |     * Struct：结构类型，可以包含不同数据类型的元。这些元素可以通过"点语法"的方式来得到所需要的元素
 72 | 
 73 | * 时间类型
 74 | 
 75 |     * Date：从Hive0.12.0开始支持
 76 |     * Timestamp：从Hive0.8.0开始支持
 77 | 
 78 | ### （五）Hive的数据模型
 79 | 
 80 | #### 5.1. Hive的数据存储：
 81 | 
 82 | * 基于HDFS
 83 | * 没有专门的数据存储格式
 84 | * 存储结构主要包括：数据库、文件、表、视图
 85 | * 可以直接加载文本文件（.txt文件）
 86 | * 创建表时，指定Hive数据的列分隔符与行分隔符
 87 | 
 88 | #### 5.2. 表：
 89 | 
 90 | ##### 5.2.1. Inner Table（内部表）
 91 | 
 92 | * 与数据库中的Table在概念上是类似的
 93 | * 每一个Table在Hive中都有一个相应的目录存储数据
 94 | * 所有的Table数据（不包括External Table）都保存在这个目录中
 95 | * 删除表时，元数据与数据都会被删除
 96 | 
 97 |     ```sql
 98 |     create table emp
 99 |     (empno int,
100 |     ename string,
101 |     job string,
102 |     mgr int,
103 |     hiredate string,
104 |     sal int,
105 |     comm int,
106 |     deptno int)
107 |     row format delimited fields terminated by ',';
108 |     ```
109 | 
110 | ##### 5.2.2. Partition Table（分区表）
111 | 
112 | * Partition 对应于数据库的 Partition 列的密集索引
113 | 
114 | * 在 Hive 中，表中的一个 Partition 对应于表下的一个目录，所有的 Partition 的数据都存储在对应的目录中
115 | 
116 |     ```sql
117 |     create table emp_part
118 |     (empno int,
119 |     ename string,
120 |     job string,
121 |     mgr int,
122 |     hiredate string,
123 |     sal int,
124 |     comm int)
125 |     partitioned by (deptno int)
126 |     row format delimited fields terminated by ',';
127 |     ```
128 | 
129 | * 往分区表中插入数据：
130 | 
131 |     ```sql
132 |     insert into table emp_part partition(deptno=10)
133 |     select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=10;
134 |     insert into table emp_part partition(deptno=20)
135 |     select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=20;
136 |     ```
137 | 
138 |     > insert 语句会转换成一个mapreduce程序，所以需要先启动yarn：start-yarn.sh
139 | 
140 | ##### 5.2.3. External Table（外部表）
141 | 
142 | * 指向已经在 HDFS 中存在的数据，可以创建 Partition
143 | 
144 | * 它和内部表在元数据的组织上是相同的，而实际数据的存储则有较大的差异
145 | 
146 | * 外部表 只有一个过程，加载数据和创建表同时完成，并不会移动到数据仓库目录中，只是与外部数据建立一个链接。当删除一个外部表时，仅删除该链接
147 | 
148 |     ```sql
149 |     create external table ex_student
150 |     (sid int, sname string, age int)
151 |     row format delimited terminated by ','
152 |     location '/students';
153 |     ```
154 | 
155 | ##### 5.2.4. Bucket Table（桶表）
156 | 
157 | * 桶表是对数据进行哈希取值，然后放到不同文件中存储
158 | 
159 |     ```sql
160 |     create table emp_bucket
161 |     (empno int,
162 |     ename string,
163 |     job string,
164 |     mgr int,
165 |     hiredate string,
166 |     sal int,
167 |     comm int,
168 |     deptno int)
169 |     clustered by (job) into 4 buckets
170 |     row format delimited fields terminated by ',';
171 |     ```
172 | 
173 |     > 注：不能直接向桶表中加载数据，需要使用insert语句插入数据
174 |     
175 | #### 5.3. 视图（View）
176 | 
177 | * 视图是一种虚表，是一个逻辑概念，可以跨越多张表
178 | * 视图建立在已有表的基础上，视图依赖以建立的这些表称为基表
179 | * 视图的目的是为了简化复杂查询
180 | * ```create view myview as select sname from student1;```
181 | 
182 | ### （六）Hive的数据的导入
183 | 
184 | * Hive支持两种方式的数据导入
185 | 
186 |     * 使用load语句导入数据
187 |     * 使用sqoop导入关系型数据库中的数据
188 | 
189 | * 使用load语句导入数据
190 | 
191 |     * 数据文件：
192 |     
193 |         ```
194 |         student.csv
195 |         1,Tom,23
196 |         2,Mary,24
197 |         3,Mike,22
198 |         
199 |         create table student(sid int, sname string, age int)
200 |         row format delimited fields terminated by ',';
201 |         ```
202 | 
203 |     * 导入本地数据文件：
204 |     
205 |         ```sql
206 |         load data local inpath '/root/training/data/student.csv' into table student;
207 |         ```
208 |     
209 |         > 注意：Hive默认分隔符是: tab键。所以需要在建表的时候，指定分隔符。
210 |     
211 |         ```sql
212 |         create table student1
213 |         (sid int,sname string,age int)
214 |         row format delimited fields terminated by ',';
215 |         ```
216 | 
217 |     * 导入HDFS上的数据：
218 | 
219 |         ```sql
220 |         create table student2
221 |         (sid int,sname string,age int)
222 |         row format delimited fields terminated by ',';
223 |         ```
224 |         
225 |         ```sql
226 |         load data inpath '/input/student.csv' into table student2;
227 |         ```
228 | 
229 | * 使用sqoop导入关系型数据库中的数据
230 | 
231 |     * 将关系型数据的表结构复制到hive中：
232 |     
233 |         ```shell
234 |         sqoop create-hive-table --connect jdbc:mysql://localhost:3306/test --username root --password 123 --table student --hive-table student
235 |         ```
236 |     
237 |         > 注：其中 --table username为mysql中的数据库test中的表   --hive-table test 为hive中新建的表名称
238 |     
239 |     * 从关系数据库导入文件到hive中：
240 |         
241 |         ```shell
242 |         sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123 --table student --hive-import
243 |         ```
244 |     
245 |     * 将hive中的数据导入到mysql中：
246 |     
247 |         ```shell
248 |         sqoop export --connect jdbc:mysql://localhost:3306/test --username root --password 123 --table uv_info --export-dir /user/hive/warehouse/uv/dt=2011-08-03
249 |         ```
250 | 
251 | ### （七）Hive的查询
252 | 
253 | 
254 | 
255 | 
256 | ### （八）Hive的客户端操作：JDBC
257 | 
258 | 
259 | ### （九）Hive的自定义函数
260 | 
261 | 
262 | ### （十）Hive常见问题解决
263 | 
264 | * 启动Hive后输入 ```show tables``` 命令报异常：Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
265 | 
266 |     在hive-site.xml文件中添加如下配置：
267 |     ```
268 |     <property>
269 |         <name>hive.metastore.schema.verification</name>
270 |         <value>false</value>
271 |     </property>
272 |     ```
273 | 
274 | * Hive如何将日期格式的字符串 "2019-04-04 21:13:20" 转换成日期类型的数据：
275 | 
276 | 	1. 将日期格式的字符串转换成时间戳格式：
277 | 	
278 | 		select unix_timestamp(action_time, 'yyyy-MM-dd HH:mm:ss') from tab;
279 | 
280 | 	2. 再将unix时间戳转换成具体的日期：
281 | 	
282 | 		select from_unixtime(unix_timestamp(action_time, 'yyyy-MM-dd HH:mm:ss'), 'yyyy-MM-dd HH:mm:ss') from tab;
283 | 
284 | * 将mysql的表复制一份到hive上时，默认创建的表的分隔符是\tab，若要修改分隔符：
285 | 
286 | 	```sql
287 | 	alter table tbname set SERDEPROPERTIES('field.delim'='\t');
288 | 	```
289 | 


--------------------------------------------------------------------------------
/11-Hive基础/imgs/hive_logo_medium.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/11-Hive基础/imgs/hive_logo_medium.jpg


--------------------------------------------------------------------------------
/11-Hive基础/imgs/hivearc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/11-Hive基础/imgs/hivearc.png


--------------------------------------------------------------------------------
/12-Hive性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/12-Hive性能优化/README.md


--------------------------------------------------------------------------------
/13-Sqoop基础/README.md:
--------------------------------------------------------------------------------
  1 | ## Sqoop
  2 | 
  3 | ### （一）什么是Sqoop？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/13-Sqoop基础/imgs/sqoop-logo.png)
  6 | 
  7 | Sqoop是一款开源的工具，主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递，可以将一个关系型数据库（例如 ： MySQL ,Oracle ,Postgres等）中的数据导进到Hadoop的HDFS中，也可以将HDFS的数据导进到关系型数据库中。
  8 | 
  9 | Sqoop项目开始于2009年，最早是作为Hadoop的一个第三方模块存在，后来为了让使用者能够快速部署，也为了让开发人员能够更快速的迭代开发，Sqoop独立成为一个Apache项目。
 10 | 
 11 | ### （二）Sqoop是如何工作？
 12 | 
 13 | * 底层就是利用JDBC连接数据库。
 14 | 
 15 | ### （三）安装配置Sqoop
 16 | 
 17 | * sqoop下载地址：http://mirror.bit.edu.cn/apache/sqoop/1.4.7/
 18 | 
 19 | * 解压tar包到指定目录
 20 | 
 21 | * 添加Sqoop根目录到环境变量
 22 | 
 23 | * 将mysql的mysql-connector-java-5.1.43-bin.jar包放到sqoop的lib目录下
 24 | 
 25 | ### （四）使用Sqoop
 26 | 
 27 | 命令 | 说明
 28 | ---|---
 29 | codegen | 将关系数据库表映射为一个Java文件、Java class类、以及相关的jar包
 30 | create-hive-table | 生成与关系数据库表的表结构对应的HIVE表
 31 | eval | 以快速地使用SQL语句对关系数据库进行操作，这可以使得在使用import这种工具进行数据导入的时候，可以预先了解相关的SQL语句是否正确，并能将结果显示在控制台。
 32 | export | 从hdfs中导数据到关系数据库中
 33 | help | 
 34 | import | 将数据库表的数据导入到HDFS中
 35 | import-all-tables | 将数据库中所有的表的数据导入到HDFS中
 36 | job | 用来生成一个sqoop的任务，生成后，该任务并不执行，除非使用命令执行该任务。
 37 | list-databases | 打印出关系数据库所有的数据库名
 38 | list-tables | 打印出关系数据库某一数据库的所有表名
 39 | merge | 将HDFS中不同目录下面的数据合在一起，并存放在指定的目录中
 40 | metastore | 记录sqoop job的元数据信息
 41 | version | 显示sqoop版本信息
 42 | 
 43 | 参数 | 说明
 44 | ---|---
 45 | -m | 使用几个map任务并发执行
 46 | --split-by | 拆分数据的字段(数据类型最好是int类型，否则不建议设置)
 47 | 
 48 | ### （五）案例
 49 | 
 50 | * 案例一：将mysql中的表映射为一个java文件
 51 | 
 52 | 	```shell
 53 | 	sqoop codegen --connect jdbc:mysql://localhost:3306/dbname --username root --password 123 --table emp
 54 | 	```
 55 | 
 56 | * 案例二：根据mysql的表结构在hive中创建一个同样结构的表
 57 | 
 58 | 	```shell
 59 | 	sqoop create-hive-table --connect jdbc:mysql://localhost:3306/dbname --username root --password Welcome_1 --table emp --hive-table emp
 60 | 	```
 61 | 
 62 | 	> 注：需要将hive/lib中的hive-common-2.3.3.jar拷贝到sqoop的lib目录中，否则执行报错。
 63 | 
 64 | * 案例三：通过Sqoop验证一条SQL语句是否正确
 65 | 
 66 | 	```shell
 67 | 	sqoop eval --connect jdbc:mysql://localhost:3306/dbname --username root --password Welcome_1 --query 'select * from cate'
 68 | 	```
 69 | 
 70 | * 案例四：将mysql中的数据导入到HDFS
 71 | 
 72 | 	```shell
 73 | 	sqoop import --connect jdbc:mysql://localhost:3306/dbname --username root --password Welcome_1 --table cate --target-dir /data
 74 | 	```
 75 | 
 76 | * 案例五：将mysql中所有表中的数据导入HDFS
 77 | 
 78 | 	```shell
 79 | 	sqoop import-all-tables "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" --connect jdbc:mysql://localhost:3306/jzgyl --username root --password Welcome_1
 80 | 	```
 81 | 
 82 | 	> 注："-Dorg.apache.sqoop.splitter.allow_text_splitter=true" 参数允许表的主键是字符串的情况下仍进行导入，导入的表默认存放在HDFS的/user/root目录下，而且还会在执行这条命令的那个目录下生成对应表的java文件。
 83 | 
 84 | * 案例六：将HDFS中的数据导出到mysql
 85 | 
 86 | 	```shell
 87 | 	sqoop export --connect jdbc:mysql://localhost:3306/jzgyl --username root --password Welcome_1 --table cate --export-dir /data
 88 | 	```
 89 | 
 90 | 	> 注：如果mysql没有在配置文件中统一utf8编码会出现乱码。
 91 | 
 92 | 
 93 | * 案例七：列出所有的数据库
 94 | 
 95 | 	```shell
 96 | 	sqoop list-databases --connect jdbc:mysql://localhost:3306/jzgyl --username root --password Welcome_1
 97 | 	```
 98 | 
 99 | * 案例八：列出某个数据库中所有的表
100 | 
101 | 	```shell
102 | 	sqoop list-tables --connect jdbc:mysql://localhost:3306/jzgyl --username root --password Welcome_1
103 | 	```
104 | 
105 | * 案例九：查看sqoop的版本
106 | 
107 | 	```shell
108 | 	sqoop version
109 | 	```
110 | 
111 | * 案例十：将mysql表中的数据导入HBase
112 | 
113 | 	```shell
114 | 	sqoop import --connect jdbc:mysql://localhost:3306/jzgyl --username root --password Welcome_1 --table cate --columns id,name,create_time,update_time --hbase-table cate --hbase-row-key id --column-family info
115 | 	```
116 | 
117 | 	> ps：lz在执行命令的时候在控制台发现提交的mapreduce程序显示被Killed，以为任务挂掉了，但是登录yarn的网页查看却发现已经执行成功......
118 | 
119 | 
120 | 


--------------------------------------------------------------------------------
/13-Sqoop基础/imgs/sqoop-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/13-Sqoop基础/imgs/sqoop-logo.png


--------------------------------------------------------------------------------
/14-Sqoop性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/14-Sqoop性能优化/README.md


--------------------------------------------------------------------------------
/15-Flume基础/README.md:
--------------------------------------------------------------------------------
  1 | ## Flume
  2 | 
  3 | ### （一）什么是Flume？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/15-Flume基础/imgs/flume-logo.png)
  6 | 
  7 | Flume是Cloudera提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统，Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。
  8 | 
  9 | 
 10 | ### （二）Flume的体系结构
 11 | 
 12 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/15-Flume基础/imgs/flume-arc.png)
 13 | 
 14 | * Source: 用于采集日志数据
 15 | 
 16 | * Channel: 缓存日志数据
 17 | 
 18 | * Sink: 将Channel中缓存的日志数据转移到指定地点(此处是HDFS)
 19 | 
 20 | ### （三）安装和配置Flume
 21 | 
 22 | * 下载Flume：http://flume.apache.org/download.html
 23 | 
 24 | * 解压：tar -zxvf apache-flume-1.7.0-bin.tar.gz -C ~/training
 25 | 
 26 | * 将flume-env.sh.template改名为flume-env.sh
 27 | 
 28 | * 修改conf/flume-env.sh设置JAVA_HOME即可
 29 | 
 30 | ### （四）使用Flume采集日志数据
 31 | 
 32 | 在Flume根目录下新建一个myagent目录，在myagent目录下创建如下的配置文件：
 33 | 
 34 | #### 案例一：监听某个文件的末尾，将新增内容打印到控制台
 35 | 
 36 | ```shell
 37 | #bin/flume-ng agent -n a1 -f myagent/a1.conf -c conf -Dflume.root.logger=INFO,console
 38 | #定义agent名，source、channel、sink的名称
 39 | a1.sources=r1
 40 | a1.channels=c1
 41 | a1.sinks=k1
 42 | 
 43 | #具体定义source
 44 | a1.sources.r1.type=exec
 45 | a1.sources.r1.command=tail -F /root/logs/a.log
 46 | 
 47 | #具体定义channel
 48 | a1.channels.c1.type=memory
 49 | a1.channels.c1.capacity=1000
 50 | a1.channels.c1.transactionCapacity=100
 51 | 
 52 | #具体定义sink
 53 | a1.sinks.k1.type=logger
 54 | 
 55 | #组装source,channel,sink
 56 | a1.sources.r1.chanels=c1
 57 | a1.sinks.k1.channel=c1
 58 | ```
 59 | 
 60 | #### 案例二：监听某个目录，每当目录下新增文件时，将该文件的内容打印到控制台
 61 | 
 62 | ```shell
 63 | #bin/flume-ng agent -n a2 -f myagent/a2.conf -c conf -Dflume.root.logger=INFO,console
 64 | #定义agent名，source，channel，sink的名称
 65 | a2.sources=r1
 66 | a2.channels=c1
 67 | a2.sinks=k1
 68 | 
 69 | #具体定义source
 70 | a2.sources.r1.type=spooldir
 71 | a2.sources.r1.spoolDir=/root/logs/
 72 | 
 73 | #具体定义channel
 74 | a2.channels.c1.type=memory
 75 | a2.channels.c1.capacity=1000
 76 | a2.channels.c1.transactionCapacity=100
 77 | 
 78 | #具体定义sink
 79 | a2.sinks.k1.type=logger
 80 | 
 81 | #组装source，channel，sinks
 82 | a2.sources.r1.channels=c1
 83 | a2.sinks.k1.channel=c1
 84 | ```
 85 | 
 86 | #### 案例三：监听某个目录，每当目录下新增文件时，将文件复制到HDFS上指定目录
 87 | 
 88 | ```shell
 89 | #bin/flume-ng agent -n a3 -f myagent/a3.conf -c conf -Dflume.root.logger=INFO,console
 90 | #定义agent名，source，channel，sink的名称
 91 | a3.sources=r1
 92 | a3.channels=c1
 93 | a3.sinks=k1
 94 | 
 95 | #具体定义source
 96 | a3.sources.r1.type=spooldir
 97 | a3.sources.r1.spoolDir=/root/logs
 98 | 
 99 | #为source定义拦截器，给消息添加时间戳
100 | a3.sources.r1.interceptors=i1
101 | a3.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.TimestampInterceptor$Builder
102 | 
103 | #具体定义channel
104 | a3.channels.c1.type=memory
105 | a3.channels.c1.capacity=1000
106 | a3.channels.c1.transactionCapacity=100
107 | 
108 | #具体定义sink
109 | a3.sinks.k1.type=hdfs
110 | a3.sinks.k1.hdfs.path=hdfs://127.0.0.1:9000/flume/%Y%m%d
111 | a3.sinks.k1.hdfs.filePrefix=events-
112 | a3.sinks.k1.hdfs.fileType=DataStream
113 | 
114 | #不按照条数生成文件
115 | a3.sinks.k1.hdfs.rollCount=0
116 | a3.sinks.k1.hdfs.rollSize=134217728
117 | a3.sinks.k1.hdfs.rollInterval=60
118 | 
119 | #组装source，channel，sink
120 | a3.sources.r1.channels=c1
121 | a3.sinks.k1.channel=c1
122 | ```
123 | 
124 | #### 案例四：监听某个目录，每当目录下新增文件时，将数据推送到kafka
125 | 
126 | ```shell
127 | #bin/flume-ng agent -n a4 -f myagent/a4.conf -c conf -Dflume.root.logger=INFO,console
128 | #定义a4名， source、channel、sink的名称
129 | a4.sources = r1
130 | a4.channels = c1
131 | a4.sinks = k1
132 | 
133 | #具体定义source
134 | a4.sources.r1.type = spooldir
135 | a4.sources.r1.spoolDir = /root/logs
136 | 
137 | #具体定义channel
138 | a4.channels.c1.type = memory
139 | a4.channels.c1.capacity = 10000
140 | a4.channels.c1.transactionCapacity = 100
141 | 
142 | #设置Kafka接收器
143 | a4.sinks.k1.type= org.apache.flume.sink.kafka.KafkaSink
144 | 
145 | #设置Kafka的broker地址和端口号
146 | #HDP 集群kafka broker的默认端口是6667，而不是9092
147 | a4.sinks.k1.brokerList=qujianlei:9092
148 | 
149 | #设置Kafka的Topic
150 | a4.sinks.k1.topic=mytopic
151 | 
152 | #设置序列化方式
153 | a4.sinks.k1.serializer.class=kafka.serializer.StringEncoder
154 | 
155 | #组装source、channel、sink
156 | a4.sources.r1.channels = c1
157 | a4.sinks.k1.channel = c1
158 | ```
159 | 
160 | ### 常见问题
161 | 
162 | 1. 内存不足导致flume进程经常死掉
163 | 
164 | 	修改$FLUME_HOME/bin/flume-ng中JAVA_OPTS变量-Xmx的值
165 | 
166 | 2. 采集kafka数据或者生产kafka数据的时候默认数据大小是1M，所以使用flume采集kafka数据或向kafka送数据的时候需要向agent配置文件中添加相应的参数
167 | 	* 采集source：```agent.sources.r1.kafka.consumer.max.partition.fetch.bytes=10240000```
168 | 	* 发送sink：```agent.sinks.k1.kafka.producer.max.request.size=10240000```
169 | 
170 | 3. 安装Flume的监控软件Ganglia时，在浏览器访问提示权限不足
171 | 
172 | 	* 修改```/etc/httpd/conf.d/ganglia.conf```文件为如下内容后仍然提示权限不足：
173 |     ```shell
174 |     Alias /ganglia /usr/share/ganglia
175 | 
176 |     <Location /ganglia>
177 |       Order deny,allow
178 |       Deny from all
179 |       Allow from all
180 |       # Allow from 127.0.0.1
181 |       # Allow from ::1
182 |       # Allow from .example.com
183 |     </Location>
184 |     ```
185 | 
186 | 	* 修改```/etc/selinux/config```文件为如下内容后仍然提示权限不足：
187 |     ```shell
188 |     # This file controls the state of SELinux on the system.
189 |     # SELINUX= can take one of these three values:
190 |     #     enforcing - SELinux security policy is enforced.
191 |     #     permissive - SELinux prints warnings instead of enforcing.
192 |     #     disabled - No SELinux policy is loaded.
193 |     SELINUX=disabled
194 |     # SELINUXTYPE= can take one of these two values:
195 |     #     targeted - Targeted processes are protected,
196 |     #     mls - Multi Level Security protection.
197 |     SELINUXTYPE=targeted
198 |     ```
199 | 
200 |     * 执行如下命令或者关机重启后仍然提示权限不足：
201 |     ```shell
202 |     setenforce 0
203 |     ```
204 | 
205 |     * 执行修改目录命令，修改目录（/var/lib/ganglia）权限为777后，还是提示权限不足，此时内心已崩溃
206 |     ```shell
207 |     chmod -R 777 /var/lib/ganglia
208 |     ```
209 | 
210 | 	* 心有不甘的出去跑了3km回来之后找到了答案：
211 | 	```/etc/httpd/conf/httpd.conf```文件修改为如下内容后成功访问：
212 |     ```shell
213 |     <Directory />
214 |         AllowOverride none
215 |         #Require all denied
216 |     </Directory>
217 |     ```
218 | 	未修改之前是这样的，可以猜测http服务器默认拒绝了所有的外部访问：
219 |     ```shell
220 |     <Directory />
221 |         AllowOverride none
222 |         Require all denied
223 |     </Directory>
224 |     ```
225 | 	成功的界面：
226 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/15-Flume基础/imgs/pic.png)
227 | 
228 | ### 其他
229 | 
230 | * 安装netcat教程：https://blog.csdn.net/z1941563559/article/details/81347981
231 | 
232 | 
233 | 


--------------------------------------------------------------------------------
/15-Flume基础/imgs/flume-arc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/15-Flume基础/imgs/flume-arc.png


--------------------------------------------------------------------------------
/15-Flume基础/imgs/flume-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/15-Flume基础/imgs/flume-logo.png


--------------------------------------------------------------------------------
/15-Flume基础/imgs/pic.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/15-Flume基础/imgs/pic.png


--------------------------------------------------------------------------------
/16-Flume性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/16-Flume性能优化/README.md


--------------------------------------------------------------------------------
/17-ZooKeeper/README.md:
--------------------------------------------------------------------------------
  1 | ## Zookeeper
  2 | 
  3 | ### （一）什么是ZooKeeper？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/17-ZooKeeper/imgs/zookeeper-logo.png)
  6 | 
  7 | ZooKeeper是一个分布式的，开放源码的分布式应用程序协调服务，是Google的Chubby一个开源的实现，是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件，提供的功能包括：配置维护、域名服务、分布式同步、组服务等。
  8 | 
  9 | ### （二）ZooKeeper的体系结构
 10 | 
 11 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/17-ZooKeeper/imgs/zookeeper-arc.png)
 12 | 
 13 | ### （三）Zookeeper能帮我们做什么？
 14 | 
 15 | * Hadoop2.0使用Zookeeper来实现HA（高可用，有多个namenode），同时使用它的事件处理确保整个集群只有一个活跃的NameNode，存储配置信息等。
 16 | 
 17 | * HBase,使用Zookeeper的事件处理确保整个集群只有一个HMaster，察觉HRegionServer联机和宕机,存储访问控制列表等。
 18 | 
 19 | ### （四）安装配置启动ZooKeeper
 20 | 
 21 | 1. 下载：https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/
 22 | 
 23 | 2. 解压：tar -zxvf zookeeper-3.4.14.tar.gz
 24 | 
 25 | 3. 修改```$ZOOKEEPER_HOME/conf/zoo.cfg```内容如下：
 26 |     ```shell
 27 |     dataDir=$ZOOKEEPER_HOME/tmp
 28 | 
 29 |     server.1=hostname1:2888:3888
 30 |     server.2=hostname2:2888:3888
 31 |     server.3=hostname3:2888:3888
 32 |     ```
 33 | 
 34 | 	> 2888端口是用来在多个zookeeper节点之间进行数据同步
 35 | 	> 3888端口是当leader死掉了，来选举一个新leader
 36 | 
 37 | 4. 在```$ZOOKEEPER_HOME/tmp```目录下创建一个myid的空文件，执行如下命令：
 38 |     ```shell
 39 |     echo 1 > $ZOOKEEPER_HOME/tmp/myid
 40 |     ```
 41 | 
 42 | 5. 将配置好的zookeeper拷贝到其他的机器上，同时修改各自的myid文件为2,3：
 43 |     ```shell
 44 |     scp -r $ZOOKEEPER_HOME/ hostname2:/$ZOOKEEPER_HOME
 45 |     scp -r $ZOOKEEPER_HOME/ hostname3:/$ZOOKEEPER_HOME
 46 |     ```
 47 | 
 48 | 6. 启动Zookeeper：```zkServer.sh start```
 49 | 
 50 | 7. 查看Zookeeper状态：```zkServer.sh status```
 51 | 
 52 | ### （五）Zookeeper常用命令
 53 | 
 54 | 1. ls -- 查看某个目录包含的所有文件，例如：
 55 |     ```shell
 56 |     [zk: 127.0.0.1:2181(CONNECTED) 1] ls /
 57 |     ```
 58 | 
 59 | 2. ls2 -- 与ls类似，不同的是它可以看到time、version等信息，例如：
 60 |     ```shell
 61 |     [zk: 127.0.0.1:2181(CONNECTED) 1] ls2 /
 62 |     ```
 63 | 
 64 | 3. create -- 创建znode，并设置初始内容，例如：
 65 |     ```shell
 66 |     [zk: 127.0.0.1:2181(CONNECTED) 1] create /test "test" 
 67 |     ```
 68 | 
 69 | 4. get -- 获取znode的数据，例如：
 70 |     ```shell
 71 |     [zk: 127.0.0.1:2181(CONNECTED) 1] get /test
 72 |     ```
 73 | 
 74 | 5. set -- 修改znode的内容，例如：
 75 |     ```shell
 76 |     [zk: 127.0.0.1:2181(CONNECTED) 1] set /test "ricky"
 77 |     ```
 78 | 
 79 | 6. delete -- 删除znode，例如：
 80 |     ```shell
 81 |     [zk: 127.0.0.1:2181(CONNECTED) 1] delete /test
 82 |     ```
 83 | 
 84 | 7. quit -- 退出客户端，例如：
 85 |     ```shell
 86 |     [zk: 127.0.0.1:2181(CONNECTED) 1] quit
 87 |     ```
 88 | 
 89 | 8. help -- 帮助命令，例如：
 90 |     ```shell
 91 |     [zk: 127.0.0.1:2181(CONNECTED) 1] help
 92 |     ```
 93 | 
 94 | ### （六）Zookeeper应用场景
 95 | 
 96 | * 利用zookeeper的分布式锁实现秒杀：
 97 |     pom依赖：
 98 |     ```xml
 99 |     <dependencies>
100 |         <dependency>
101 |             <groupId>org.apache.curator</groupId>
102 |             <artifactId>curator-framework</artifactId>
103 |             <version>4.0.0</version>
104 |         </dependency>
105 |         <dependency>
106 |             <groupId>org.apache.curator</groupId>
107 |             <artifactId>curator-recipes</artifactId>
108 |             <version>4.0.0</version>
109 |         </dependency>
110 |         <dependency>
111 |             <groupId>org.apache.curator</groupId>
112 |             <artifactId>curator-client</artifactId>
113 |             <version>4.0.0</version>
114 |         </dependency>
115 |         <dependency>
116 |             <groupId>org.apache.zookeeper</groupId>
117 |             <artifactId>zookeeper</artifactId>
118 |             <version>3.4.6</version>
119 |         </dependency>
120 |         <dependency>
121 |             <groupId>com.google.guava</groupId>
122 |             <artifactId>guava</artifactId>
123 |             <version>16.0.1</version>
124 |         </dependency>
125 |     </dependencies>
126 |     ```
127 | 
128 |     主程序：
129 |     ```java
130 |     package com.qjl.kafkatest.producer;
131 | 
132 |     import org.apache.curator.RetryPolicy;
133 |     import org.apache.curator.framework.CuratorFramework;
134 |     import org.apache.curator.framework.CuratorFrameworkFactory;
135 |     import org.apache.curator.framework.recipes.locks.InterProcessMutex;
136 |     import org.apache.curator.retry.ExponentialBackoffRetry;
137 | 
138 |     public class TestDistributedLock {
139 | 
140 |         private static int number = 10;
141 | 
142 |         private static void getNumber() {
143 |             System.out.println("\n\n******* 开始业务方法   ************");
144 |             System.out.println("当前值：" + number);
145 |             number--;
146 |             try {
147 |                 Thread.sleep(2000);
148 |             } catch (InterruptedException e) {
149 |                 e.printStackTrace();
150 |             }
151 |         }
152 | 
153 |         public static void main(String[] args) {
154 |             // 最多失败重连10次,每次间隔1000ms
155 |             RetryPolicy policy = new ExponentialBackoffRetry(1000, 10);
156 |             CuratorFramework cf = CuratorFrameworkFactory.builder()
157 |                     .connectString("qujianlei:2181")
158 |                     .retryPolicy(policy)
159 |                     .build();
160 |             cf.start();
161 | 
162 |             final InterProcessMutex lock = new InterProcessMutex(cf, "/aaa");
163 | 
164 |             for (int i = 0; i < 10; i++) {
165 |                 new Thread(new Runnable() {
166 |                     @Override
167 |                     public void run() {
168 |                         try {
169 |                             lock.acquire();
170 |                             getNumber();
171 |                         } catch (Exception e) {
172 |                             e.printStackTrace();
173 |                         } finally {
174 |                             try {
175 |                                 lock.release();
176 |                             } catch (Exception e) {
177 |                                 e.printStackTrace();
178 |                             }
179 |                         }
180 |                     }
181 |                 }).start();;
182 |             }
183 |         }
184 |     }
185 |     ```
186 | 
187 | 
188 | 
189 | 
190 | 
191 | 
192 | 
193 | 
194 | 
195 | 
196 | 


--------------------------------------------------------------------------------
/17-ZooKeeper/imgs/zookeeper-arc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/17-ZooKeeper/imgs/zookeeper-arc.png


--------------------------------------------------------------------------------
/17-ZooKeeper/imgs/zookeeper-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/17-ZooKeeper/imgs/zookeeper-logo.png


--------------------------------------------------------------------------------
/18-Zookeeper性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/18-Zookeeper性能优化/README.md


--------------------------------------------------------------------------------
/19-Redis基础/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/19-Redis基础/README.md


--------------------------------------------------------------------------------
/20-Redis性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/20-Redis性能优化/README.md


--------------------------------------------------------------------------------
/21-Storm基础/README.md:
--------------------------------------------------------------------------------
  1 | ## Storm
  2 | 
  3 | ### （一）什么是Storm？
  4 | 
  5 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-log.png)
  6 | 
  7 | Storm为分布式实时计算提供了一组通用原语，可被用于“流处理”之中，实时处理消息并更新数据库。这是管理队列及工作者集群的另一种方式。 Storm也可被用于“连续计算”（continuous computation），对数据流做连续查询，在计算时就将结果以流的形式输出给用户。它还可被用于“分布式RPC”，以并行的方式运行昂贵的运算。 
  8 | 
  9 | Storm可以方便地在一个计算机集群中编写与扩展复杂的实时计算，Storm用于实时处理，就好比 Hadoop 用于批处理。Storm保证每个消息都会得到处理，而且它很快——在一个小集群中，每秒可以处理数以百万计的消息。更棒的是你可以使用任意编程语言来做开发。
 10 | 
 11 | ### （二）离线计算和流式计算
 12 | 
 13 | #### 离线计算
 14 | 
 15 | * 离线计算：批量获取数据、批量传输数据、周期性批量计算数据、数据展示
 16 | 
 17 | * 代表技术：Sqoop批量导入数据、HDFS批量存储数据、MapReduce批量计算、Hive
 18 | 
 19 | #### 流式计算
 20 | 
 21 | * 流式计算：数据实时产生、数据实时传输、数据实时计算、实时展示
 22 | 
 23 | * 代表技术：Flume实时获取数据、Kafka/metaq实时数据存储、Storm/JStorm/SparkStreaming/Flink实时数据计算、Redis实时结果缓存、持久化存储(mysql)。
 24 | 
 25 | * 一句话总结：将源源不断产生的数据实时收集并实时计算，尽可能快的得到计算结果
 26 | 
 27 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/water.png)
 28 | 
 29 | #### Storm与Hadoop的区别
 30 | 
 31 | Storm | Hadoop
 32 | ---|---
 33 | Storm用于实时计算 | Hadoop用于离线计算
 34 | Storm处理的数据保存在内存中，源源不断 | Hadoop处理的数据保存在文件系统中，一批一批
 35 | Storm的数据通过网络传输进来 | Hadoop的数据保存在磁盘中
 36 | Storm与Hadoop的编程模型相似 | Storm与Hadoop的编程模型相似
 37 | 
 38 | ### （三）Storm体系结构
 39 | 
 40 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/arc1.png)
 41 | 
 42 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/arc2.png)
 43 | 
 44 | * **Nimbus**：负责资源分配和任务调度
 45 | * **Supervisor**：负责接收 nimbus 分配的任务，启动和停止属于自己管理的 worker 进程。通过配置文件设置当前 supervisor 上启动多少个 worker。
 46 | * **Worker**：处理具体业务逻辑的进程。Worker 运行的任务类型只有两种，一种是 Spout 任务，一种是 Bolt 任务。
 47 | * **Executor**：Storm 0.8 之后，Executor 为 Worker 进程中具体的物理线程，同一个 Spout / Bolt 的 Task 可能会共享一个物理线程，一个 Executor 中只能运行隶属于同一个 Spout / Bolt 的 Task。
 48 | * **Task**：worker 中每一个 Spout / Bolt 的线程称为一个 task。在 Storm0.8 之后， Task 不再与物理线程对应，不同 Spout / Bolt 的 task 可能会共享一个物理线程，该线程称为 Executor。
 49 | 
 50 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/worker-process.png)
 51 | 
 52 | ### （四）Storm的运行机制
 53 | 
 54 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/nimbus-process.png)
 55 | 
 56 | * 整个处理流程的组织协调不用用户去关心，用户只需要去定义每一个步骤中的具体业务处理逻辑。
 57 | * 具体执行任务的角色是 Worker，Worker执行任务时具体的行为则有我们定义的业务逻辑决定。
 58 | 
 59 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-arc.png)
 60 | 
 61 | ### （五）Storm的安装配置
 62 | 
 63 | * 下载 Storm：http://storm.apache.org/downloads.html
 64 | 
 65 | * 解压：```tar -zxvf apache-storm-1.0.3.tar.gz```
 66 | 
 67 | * 修改 ```/etc/profile``` 文件，设置环境变量：
 68 |     ```shell
 69 |     STORM_HOME=/root/training/apache-storm-1.0.3
 70 |     export STORM_HOME
 71 | 
 72 |     PATH=$STORM_HOME/bin:$PATH
 73 |     export PATH
 74 |     ```
 75 | 
 76 | * 编辑配置文件：$STORM_HOME/conf/storm.yaml
 77 | 
 78 |     ```shell
 79 |     ########### These MUST be filled in for a storm configuration
 80 |     # 配置zookeeper的地址
 81 |     storm.zookeeper.servers:
 82 |         - "192.168.137.81"
 83 |         - "192.168.137.82"
 84 |         - "192.168.137.83"
 85 | 
 86 |     # 配置storm主节点的地址
 87 |     nimbus.seeds: ["192.168.137.81"]
 88 | 
 89 |     # 配置storm存储数据的目录
 90 |     storm.local.dir: "/root/training/apache-storm-1.0.3/tmp"
 91 | 
 92 |     # 配置每个supervisor的worker的数目
 93 |     supervisor.slots.ports:
 94 |         - 6700
 95 |         - 6701
 96 |         - 6702
 97 |         - 6703
 98 |     ```
 99 | 
100 | 	> PS：如果要搭建 Storm 的 HA，只需要在 nimbus.seeds 中设置多个 nimbus 即可。
101 | 
102 | * 执行 ```scp``` 命令把安装包复制到其他节点上：
103 | 
104 |     ```shell
105 |     scp -r apache-storm-1.0.3 root@192.168.137.82:/root/training
106 |     scp -r apache-storm-1.0.3 root@192.168.137.83:/root/training
107 |     ```
108 | 
109 | ### （六）启动和查看Storm
110 | 
111 | * 在 nimbus.host 所属的机器上启动 nimbus 服务和 logviewer 服务
112 |     ```shell
113 |     nohup storm nimbus &
114 |     nohup storm logviewer &
115 |     ```
116 | 
117 | * 在 nimbus.host 所属的机器上启动 ui 服务
118 |     ```shell
119 |     nohup storm ui &
120 |     ```
121 | 
122 | * 在其他节点上启动 supervisor 服务和 logviewer 服务
123 |     ```shell
124 |     nohup storm supervisor &
125 |     nohup storm logviewer &
126 |     ```
127 | 
128 | * 查看 storm 集群：访问 nimbus.host:/8080，即可看到 storm 的 ui 界面
129 | 
130 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-ui.png)
131 | 
132 | ### （七）Storm 编程模型
133 | 
134 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-model.png)
135 | 
136 | * Topology：Storm 中运行的一个实时应用程序的名称。
137 | * Spout：在一个 topology 中获取数据源流的组件。同常情况下，Spout 会从外部数据源中读取数据，然后转换为 topology 内部的源数据。
138 | * Bolt：接收数据然后执行业务逻辑的组件，用户可以在其中执行自己想要的操作。
139 | * Tuple：一次消息传递的基本单元，理解为一组消息就是一个 Tuple。
140 | * Stream：表示数据的流向。
141 | * StreamGroup：数据的分组策略。
142 | 	* Shuffle Grouping：随机分组，尽量均匀分布到下游 Bolt 中。
143 | 	* Fields Grouping：按字段分组，按数据中 field 值进行分组；相同 field 值的 Tuple 被发送到相同的 Task。
144 | 	* All Grouping：广播。
145 | 	* Global Grouping：全局分组，Tuple 被分配到一个 Bolt 中的一个 Task，实现事务性的 Topology。
146 | 	* None Grouping：不分组。
147 | 	* Direct Grouping：直接分组，指定分组。
148 | 
149 | ### （八）Storm 编程案例：WordCount
150 | 
151 | 流式计算一般架构图：
152 | 
153 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-wordcount.png)
154 | 
155 | * Flume 用来采集数据。
156 | * Kafka 用来缓存 Flume 采集的数据。
157 | * Storm 用来计算数据。
158 | * Redis 是个内存数据库，用来保存数据。
159 | 
160 | * 所需 pom 依赖：
161 |     ```xml
162 |     <dependencies>
163 |         <dependency>
164 |             <groupId>org.apache.storm</groupId>
165 |             <artifactId>storm-core</artifactId>
166 |             <version>1.0.3</version>
167 |         </dependency>
168 |     </dependencies>
169 |     ```
170 | 
171 | * 创建 Spout 组件采集数据，作为整个 Topology 的数据源：
172 |     ```java
173 |     package test;
174 | 
175 |     import java.util.Map;
176 |     import java.util.Random;
177 | 
178 |     import org.apache.storm.spout.SpoutOutputCollector;
179 |     import org.apache.storm.task.TopologyContext;
180 |     import org.apache.storm.topology.OutputFieldsDeclarer;
181 |     import org.apache.storm.topology.base.BaseRichSpout;
182 |     import org.apache.storm.tuple.Fields;
183 |     import org.apache.storm.tuple.Values;
184 |     import org.apache.storm.utils.Utils;
185 | 
186 |     public class WordCountSpout extends BaseRichSpout {
187 | 
188 |         // 模拟数据
189 |         private String[] data = {"I love Beijing", "I love China", "Beijing is the capital of China"};
190 | 
191 |         // 用于往下一个组件发送消息
192 |         private SpoutOutputCollector collector;
193 | 
194 |         public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
195 |             // spout初始化方法
196 |             this.collector = collector;
197 |         }
198 | 
199 |         // 该方法由Storm框架调用，用于接收外部数据源的数据
200 |         public void nextTuple() {
201 |             Utils.sleep(3000);
202 |             int random = (new Random()).nextInt(3);
203 |             String sentence = data[random];
204 | 
205 |             // 发送数据
206 |             System.out.println("发送数据：" + sentence);
207 |             this.collector.emit(new Values(sentence));
208 |         }
209 | 
210 |         // 声明输出数据的key
211 |         public void declareOutputFields(OutputFieldsDeclarer declarer) {
212 |             declarer.declare(new Fields("sentence"));
213 | 
214 |         }
215 |     }
216 |     ```
217 | 
218 | * 创建Bolt（WordCountSplitBolt）组件进行分词操作
219 |     ```java
220 |     package test;
221 | 
222 |     import java.util.Map;
223 | 
224 |     import org.apache.storm.task.OutputCollector;
225 |     import org.apache.storm.task.TopologyContext;
226 |     import org.apache.storm.topology.OutputFieldsDeclarer;
227 |     import org.apache.storm.topology.base.BaseRichBolt;
228 |     import org.apache.storm.tuple.Fields;
229 |     import org.apache.storm.tuple.Tuple;
230 |     import org.apache.storm.tuple.Values;
231 | 
232 |     public class WordCountSplitBolt extends BaseRichBolt {
233 | 
234 |         // 向下一级Bolt组件发送数据
235 |         private OutputCollector collector;
236 | 
237 |         public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
238 |             this.collector = collector;
239 |         }
240 | 
241 |         public void execute(Tuple input) {
242 |             String sentence = input.getStringByField("sentence");
243 |             // 分词
244 |             String[] words = sentence.split(" ");
245 |             for (String word : words) {
246 |                 this.collector.emit(new Values(word, 1));
247 |             }
248 |         }
249 | 
250 |         public void declareOutputFields(OutputFieldsDeclarer declarer) {
251 |             declarer.declare(new Fields("word", "count"));
252 |         }
253 |     }
254 |     ```
255 | 
256 | * 创建Bolt（WordCountBoltCount）组件进行单词计数作
257 |     ```java
258 |     package test;
259 | 
260 |     import java.util.HashMap;
261 |     import java.util.Map;
262 | 
263 |     import org.apache.storm.task.OutputCollector;
264 |     import org.apache.storm.task.TopologyContext;
265 |     import org.apache.storm.topology.OutputFieldsDeclarer;
266 |     import org.apache.storm.topology.base.BaseRichBolt;
267 |     import org.apache.storm.tuple.Tuple;
268 | 
269 |     public class WordCountBoltCount extends BaseRichBolt {
270 | 
271 |         private Map<String, Integer> result = new HashMap<String, Integer>();
272 | 
273 |         public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
274 | 
275 |         }
276 | 
277 |         public void execute(Tuple input) {
278 |             String word = input.getStringByField("word");
279 |             int count = input.getIntegerByField("count");
280 | 
281 |             if (result.containsKey(word)) {
282 |                 int total = result.get(word);
283 |                 result.put(word, total + count);
284 |             } else {
285 |                 result.put(word, 1);
286 |             }
287 |             // 直接输出到屏幕
288 |             System.out.println("输出的结果是：" + result);
289 |         }
290 | 
291 |         public void declareOutputFields(OutputFieldsDeclarer declarer) {
292 | 
293 |         }
294 |     }
295 |     ```
296 | 
297 | * 创建主程序Topology（WordCountTopology），并提交到本地运行
298 |     ```java
299 |     package test;
300 | 
301 |     import org.apache.storm.Config;
302 |     import org.apache.storm.LocalCluster;
303 |     import org.apache.storm.StormSubmitter;
304 |     import org.apache.storm.generated.StormTopology;
305 |     import org.apache.storm.topology.TopologyBuilder;
306 |     import org.apache.storm.tuple.Fields;
307 | 
308 |     public class WordCountTopology {
309 | 
310 |         public static void main(String[] args) throws Exception {
311 |             TopologyBuilder builder = new TopologyBuilder();
312 | 
313 |             // 设置任务的spout组件
314 |             builder.setSpout("wordcount_spout", new WordCountSpout());
315 | 
316 |             // 设置任务的第一个bolt组件
317 |             builder.setBolt("wordcount_splitbolt", new WordCountSplitBolt()).shuffleGrouping("wordcount_spout");
318 | 
319 |             // 设置任务的第二个Bolt组件
320 |             builder.setBolt("wordcount_count", new WordCountBoltCount()).fieldsGrouping("wordcount_splitbolt", new Fields("word"));
321 | 
322 |             // 创建Topology任务
323 |             StormTopology wc = builder.createTopology();
324 | 
325 |             Config config = new Config();
326 | 
327 |             // 提交任务到本地运行
328 |             LocalCluster localCluster = new LocalCluster();
329 |             localCluster.submitTopology("mywordcount", config, wc);
330 | 
331 |             // 提交任务到storm集群上运行
332 |     //		StormSubmitter.submitTopology(args[0], config, wc);
333 |         }
334 |     }
335 |     ```
336 | 
337 | * 在 Eclipse 上右击运行即可（**注：要以管理员方式启动Eclipse**）。
338 | 
339 | ### （九）Storm的常用命令
340 | 
341 | 有许多简单且有用的命令可以用来管理拓扑，它们可以提交、杀死、禁用、再平衡拓扑。
342 | 
343 | 1. 提交任务命令格式：storm jar [jar路径] [拓扑包名.拓扑类名] [拓扑名称]
344 |     ```shell
345 |     storm jar storm-0.0.1-SNAPSHOT.jar test.WordCountTopology MyWordCount
346 |     ```
347 | 
348 | 2. 杀死任务命令格式：storm kill [拓扑名称] -w 10
349 |     ```shell
350 |     storm kill topology-name -w 10
351 |     ```
352 | 	> 执行kill命令时可以通过-w [等待秒数] 指定拓扑停用以后的等待时间，就像刹车一样。
353 | 3. 停用任务命令格式：storm deactive [拓扑名称]
354 |     ```shell
355 |     storm deactivate topology-name
356 |     ```
357 | 
358 | 4. 启用任务命令格式：storm active [拓扑名称]
359 |     ```shell
360 |     storm activate topology-name
361 |     ```
362 | 
363 | 5. 重新部署任务命令格式：storm rebalance [拓扑名称]
364 |     ```shell
365 |     storm rebalance topology-name
366 |     ```
367 | 	> 再平衡使你重新分配集群任务。这是个很强大的命令。比如，你向一个运行中的集群增加了节点。再平衡命令就会停用拓扑，然后在相应超时时间之后重新分配 Worker，并重启拓扑。
368 | 
369 | ### （十）WordCount流程分析
370 | 
371 | 通过查看 Storm UI 上每个组件的 events 连接，可以查看 Storm 的每个组件（spout，blot）发送的消息。但 Storm 的 event logger 的功能默认是禁用的，需要在配置文件中设置：```topology.eventlogger.executors: 1```，具体说明如下：
372 | 
373 | * ```topology.eventlogger.executors: 0``` 默认禁用
374 | * ```topology.eventlogger.executors: 1``` 一个 topology 分配一个 Event Logger
375 | * ```topology.eventlogger.executors: nil``` 每个 worker 分配一个 Event Logger
376 | 
377 | 	![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/event-logger.png)
378 | 
379 | #### WordCount的数据流程分析
380 | 
381 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/process.png)
382 | 
383 | ### （十一）Storm集群在Zookeeper上保存的数据
384 | 
385 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-zk.png)
386 | 
387 | ### （十二）Storm集群任务提交流程
388 | 
389 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-submit.png)
390 | 
391 | ### （十三）Storm内部通信机制
392 | 
393 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-tcp1.png)
394 | 
395 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/21-Storm基础/imgs/storm-tcp2.png)
396 | 
397 | 
398 | 


--------------------------------------------------------------------------------
/21-Storm基础/imgs/arc1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/arc1.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/arc2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/arc2.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/event-logger.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/event-logger.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/nimbus-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/nimbus-process.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/process.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-arc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-arc.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-log.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-model.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-submit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-submit.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-tcp1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-tcp1.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-tcp2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-tcp2.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-ui.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-ui.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-wordcount.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-wordcount.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/storm-zk.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/storm-zk.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/water.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/water.png


--------------------------------------------------------------------------------
/21-Storm基础/imgs/worker-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/21-Storm基础/imgs/worker-process.png


--------------------------------------------------------------------------------
/22-Storm与其他组件集成/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/22-Storm与其他组件集成/README.md


--------------------------------------------------------------------------------
/23-Storm性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/23-Storm性能优化/README.md


--------------------------------------------------------------------------------
/24-JStorm基础/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/24-JStorm基础/README.md


--------------------------------------------------------------------------------
/25-JStorm与其他组件集成/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/25-JStorm与其他组件集成/README.md


--------------------------------------------------------------------------------
/26-JStorm性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/26-JStorm性能优化/README.md


--------------------------------------------------------------------------------
/27-Azkaban/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/27-Azkaban/README.md


--------------------------------------------------------------------------------
/28-Scala/README.md:
--------------------------------------------------------------------------------
   1 | ## Scala编程语言
   2 | 
   3 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/scala-logo.jpg)
   4 | 
   5 | ### （一）Scala语言基础
   6 | 
   7 | #### 1. Scala语言简介
   8 | 
   9 | Scala是一种多范式的编程语言，其设计的初衷是要集成面向对象编程和函数式编程的各种特性。Scala运行于Java平台（Java虚拟机），并兼容现有的Java程序。它也能运行于CLDC配置的Java ME中。目前还有另一.NET平台的实现，不过该版本更新有些滞后。Scala的编译模型（独立编译，动态类加载）与Java和C#一样，所以Scala代码可以调用Java类库（对于.NET实现则可调用.NET类库）。Scala包括编译器和类库，以及BSD许可证发布。
  10 | 
  11 | 学习Scala编程语言，为后续学习Spark奠定基础。
  12 | 
  13 | #### 2. 下载和安装Scala
  14 | 
  15 | * 安装 JDK（Scala底层依赖于JDK）
  16 | 
  17 | * 下载 Scala：http://www.scala-lang.org/download/
  18 | 
  19 | * 安装 Scala：设置环境变量：SCALA_HOME 和 PATH 变量。
  20 | 
  21 | #### 3. Scala的运行环境
  22 | 
  23 | * REPL（Read Evaluate Print Loop）：命令行
  24 | 
  25 | * IDE：图形开发工具
  26 |   
  27 |   * Scala IDE（Based on Eclipse）：http://scala-ide.org/ 
  28 |   * IntelliJ IDEA 加插件：http://www.jetbrains.com/idea/download/
  29 | 
  30 | #### 4. Scala的常用数据类型
  31 | 
  32 | 注意：在 Scala 中，任何数据都是对象。例如：
  33 | 
  34 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/data-type.png)
  35 | 
  36 | 1. 数值类型：Byte，Short，Int，Long，Float，Double
  37 |   
  38 |    * Byte：8 位有符号数字，从 -128 到 127
  39 |    * Short：16 位有符号数据，从 -32768 到 32767
  40 |    * Int：32 位有符号数据
  41 |    * Long：64 位有符号数据
  42 |      
  43 |      ```scala
  44 |        val a:Byte = 10
  45 |        a + 10
  46 |        // 得到：res9:Int=20
  47 |        // 这里的 res9 是新生成的变量的名字
  48 |      ```
  49 |      
  50 |      **注意：在 Scala 中，定义变量可以不指定类型，因为 Scala 会进行类型的自动推到。**
  51 | 
  52 | 2. 字符类型和字符串类型：Char 和 String
  53 |   
  54 |     对于字符串，在 Scala 中可以进行插值操作
  55 |    
  56 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/insert-value.png)
  57 |    
  58 |     **注意：前面有个 s，相当于执行："My Name is" + s1**
  59 | 
  60 | 3. Unit 类型：相当于 Java 中的 void 类型
  61 |   
  62 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/unit.png)
  63 | 
  64 | 4. Nothing 类型：一般表示在执行过程中，产生了 Exception。例如，我们定义一个函数如下：
  65 |   
  66 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/nothing.png)
  67 | 
  68 | #### 5. Scala变量的声明和使用
  69 | 
  70 | * 使用 val 和 var 声明变量：
  71 |   
  72 |   ```scala
  73 |     val answer = 8 * 3 + 2
  74 |   ```
  75 | 
  76 | * val：定义的值实际是一个常量，要声明其值可变需用 var
  77 |   
  78 |     **注意：可以不用显式指定变量的类型，Scala会进行自动的类型推到**
  79 | 
  80 | #### 6. Scala的函数和方法的使用
  81 | 
  82 | * 可以使用 Scala 的预定义函数，例如：求两个值的最大值:
  83 | 
  84 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/math.png)
  85 | 
  86 | * 也可以使用 def 关键字自定义函数，语法：
  87 |   
  88 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/def.png)
  89 |   
  90 |     示例：
  91 |   
  92 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/def-demo.png)
  93 | 
  94 | #### 7. Scala的条件表达式
  95 | 
  96 | Scala 的 if/else 语法结构和 Java 或 C++ 一样。
  97 | 
  98 | 不过，在 Scala 中，if/else 是表达式，有值，这个值就是跟在 if 或 else 之后的表达式的值。
  99 | 
 100 | #### 8. Scala的循环
 101 | 
 102 | Scala 拥有与 Java 和 C++ 相同的 while 和 do 循环
 103 | 
 104 | Scala 中，可以使用 for 和 foreach 进行迭代
 105 | 
 106 | * 使用 for 循环案例：
 107 |   
 108 |   ```scala
 109 |     // 定义一个集合
 110 |     var list = List("Mary", "Tom", "Mike")
 111 |   
 112 |     println("********** for 第一种写法 ***********")
 113 |     for (s <- list) println(s)
 114 |   
 115 |     println("********** for 第二种写法 ***********")
 116 |     for {
 117 |         s <- list
 118 |         if (s.length > 3)
 119 |     } println(s)
 120 |   
 121 |     println("********** for 第三种写法 ***********")
 122 |     for ( s <- list if s.length <= 3) println(s)
 123 |   ```
 124 | 
 125 | #### 9. Scala函数的参数
 126 | 
 127 | * Scala中，有两种函数参数的求值策略
 128 |   
 129 |   * Call By Value：对函数实参求值，且仅求一次
 130 |   * Call By Name：函数实参每次在函数体内被用到的时候都会求值
 131 |     
 132 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/fun-param.png)
 133 |     
 134 |     我们来分析一下，上面两个调用执行的过程：
 135 |     
 136 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/fun-param-process.png)
 137 |     
 138 |     一份复杂一点的例子：
 139 |     
 140 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/com.png)
 141 | 
 142 | * Scala中的函数参数：
 143 |   
 144 |   * 默认参数
 145 |   
 146 |   * 代名参数
 147 |   
 148 |   * 可变参数
 149 |     
 150 |       ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/param-type.png)
 151 | 
 152 | #### 10. Scala的Lazy值（懒值）
 153 | 
 154 | 当 val 被声明为 lazy 时，它的初始化将被推迟，直到我们首次对它取值。
 155 | 
 156 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/lazy.png)
 157 | 
 158 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/lazy-com.png)
 159 | 
 160 | #### 11. Scala中异常的处理
 161 | 
 162 | Scala 异常的工作机制和 Java 或者 C++ 一样。直接使用 throw 关键字抛出异常。
 163 | 
 164 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/exception.png)
 165 | 
 166 | 使用 try...catch...finally 来捕获和处理异常：
 167 | 
 168 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/trycatch.png)
 169 | 
 170 | #### 12. Scala中的数组
 171 | 
 172 | * 定长数组：使用关键字 Array
 173 |   
 174 |   ```scala
 175 |   // 定长数组
 176 |   val a = new Array[Int](10)
 177 |   val b = new Array[String](5)
 178 |   val c = Array("Tom", "Mary", "Mike")
 179 |   ```
 180 | 
 181 | * 变长数组
 182 |   
 183 |   ```scala
 184 |   // 变长数组
 185 |   val d = ArrayBuffer[Int]()
 186 |   // 往变长数组中加入元素
 187 |   d += 1
 188 |   d += 2
 189 |   d += 3
 190 |   // 往变长数组中加入多个元素
 191 |   d += (10, 12, 13)
 192 |   
 193 |   // 去掉最后两个值
 194 |   d.trimEnd(2)
 195 |   
 196 |   // 将ArrayBuffer转换为Array
 197 |   d.toArray
 198 |   ```
 199 | - 遍历数组：
 200 |   
 201 |   ```scala
 202 |   // 遍历数组
 203 |   var a = Array("Tom", "Mary", "Mike")
 204 |   
 205 |   // 使用for循环进行遍历
 206 |   for (s <- a) println(s)
 207 |   
 208 |   // 对数组进行转换，新生成一个数组 yield
 209 |   val b = for {
 210 |       s <- a
 211 |       s1 = s.toUpperCase
 212 |   } yield (s1)
 213 |   
 214 |   // 可以使用foreach进行循环输出
 215 |   a.foreach(println)
 216 |   ```
 217 | * Scala 数组的常用操作：
 218 |   
 219 |   ```scala
 220 |   import scala.collection.mutable.ArrayBuffer
 221 |   
 222 |   val myArray = Array(1, 10, 2, 3, 5, 4)
 223 |   
 224 |   // 最大值
 225 |   myArray.max
 226 |   
 227 |   // 最小值
 228 |   myArray.min
 229 |   
 230 |   // 求和
 231 |   myArray.sum
 232 |   
 233 |   // 定义一个变长数组
 234 |   var myArray1 = ArrayBuffer(1, 10, 2, 3, 5, 4)
 235 |   
 236 |   // 排序
 237 |   myArray1.sortWith(_ > _)
 238 |   
 239 |   // 升序
 240 |   myArray1.sortWith(_ < _)
 241 |   ```
 242 | 
 243 | * Scala 的多维数组：
 244 |   
 245 |   * 和 Java 一样，多维数组是通过数组的数组来实现的。
 246 |   
 247 |   * 也可以创建不规则的数组，每一行的长度各不相同。
 248 |     
 249 |     ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/matrix.png)
 250 | 
 251 | #### 13. Map
 252 | 
 253 | map 集合，由一个(key, value) 组成，用 -> 操作符来创建，例如：
 254 | 
 255 | ```scala
 256 | val scores = Map("Alice" -> 10, "Bob" -> 3, "Cindy" -> 8)n
 257 | ```
 258 | 
 259 | map 的类型分为：不可变 Map 和可变 Map
 260 | 
 261 | ```scala
 262 | // 不可变Map
 263 | val math = scala.collection.immutable.Map("Alice" -> 95, "Tom" -> 59)
 264 | 
 265 | // 可变Map
 266 | val english = scala.collection.mutable.Map("Alice" -> 80)
 267 | val chinese = scala.collection.mutable.Map(("Alice", 80), ("Tom", 30))
 268 | ```
 269 | 
 270 | 映射的操作
 271 | 
 272 | * 获取映射中的值
 273 |   
 274 |   ```scala
 275 |   // 1. 获取Map中的值
 276 |   // 如果不存在，会抛出异常
 277 |   chinese("Alice")
 278 |   chinese.get("Alice")
 279 |   if (chinese.contains("Alice")) {
 280 |       chinese("Alice")
 281 |   } else {
 282 |       -1
 283 |   }
 284 |   // 简写
 285 |   chinese.getOrElse("Alice", -1)
 286 |   ```
 287 | 
 288 | * 更新映射中的值（必须是可变Map）
 289 | 
 290 |   ```scala
 291 |   // 2.更新Map中的值
 292 |   chinese("Bob") = 100
 293 |   // 往Map中添加新的元素
 294 |   chinese += "Tom" -> 85
 295 |   // 移除Map中的元素
 296 |   chinese -= "Bob"
 297 |   ```
 298 | 
 299 | * 迭代映射
 300 | 
 301 |   ```scala
 302 |   // 3.迭代Map：使用for，或者foreach
 303 |   for (s <- chinese) println(s)
 304 |   chinese.foreach(println)
 305 |   ```
 306 | 
 307 | #### 14. Tuple（元组）
 308 | 
 309 | 元组是不同类型的值的聚集
 310 | 
 311 | 例如：`val t = (1, 3.14, "Friend") // 类型为Tuple3[Int, Double, java.lang.String]`
 312 | 
 313 | 这里：Tuple 是类型，3 是表示元组中有三个元素。
 314 | 
 315 | 元组的访问和遍历：
 316 | 
 317 | ```scala
 318 | // 定义tuple，包含3个元素
 319 | val t1 = (1, 2, "Tom")
 320 | val t2 = new Tuple4("Marry", 3.14, 100, "Hello")
 321 | 
 322 | // 访问tuple中的组员_1
 323 | t2._1
 324 | t2._2
 325 | t2._3
 326 | t2._4
 327 | // t2._5 ---> error
 328 | 
 329 | // 遍历Tuple：
 330 | t2.productIterator.foreach(println)
 331 | ```
 332 | 
 333 | 注意：要遍历 Tuple 中的元素，需要首先生成对应的迭代器。不能直接使用 for 或者 foreach。
 334 | 
 335 | ### （二）Scala语言的面向对象
 336 | 
 337 | 1. 面向对象的基本概念
 338 |   
 339 |    把数据及数据的操作方法放在一起，作为一个相互依存的整体 -- 对象
 340 |    
 341 |    面向对象的三大特征：
 342 |    
 343 |    * 封装
 344 |    
 345 |    * 继承
 346 |    
 347 |    * 多态
 348 | 
 349 | 2. 类的定义
 350 |   
 351 |    简单类和无参方法：
 352 |    
 353 |    ```scala
 354 |    class Counter {
 355 |        private var value = 0;
 356 |        def increment() { value += 1 }
 357 |        def current() = value;
 358 |    }
 359 |    ```
 360 |    
 361 |    案例：注意 class 前面没有 public 关键字修饰。
 362 |    
 363 |    ```scala
 364 |    // Scala中类的定义
 365 |    class Student1 {
 366 |        // 定义属性
 367 |        private var stuName: String = "Tom"
 368 |        private var stuAge: Int = 20
 369 |    
 370 |        // 成员方法
 371 |        def getStuName(): String = stuName
 372 |        def setStuName(newName: String) = this.stuName = newName
 373 |    
 374 |        def getStuAge(): Int = stuAge
 375 |        def setStuAge(newAge: Int) = this.stuAge = newAge
 376 |    }
 377 |    ```
 378 |    
 379 |    如果要开发 main 方法，需要将 main 方法定义在该类的伴生对象中，即：object 对象中。
 380 |    
 381 |    ```scala
 382 |    // 创建 Student1 的伴生对象
 383 |    object Student1 {
 384 |        def main(args: Array[String]): Unit = {
 385 |            // 测试Student1
 386 |            var s1 = new Student1
 387 |    
 388 |            // 第一次输出
 389 |            println(s1.getStuName() + "\t" + s1.getStuAge())
 390 |    
 391 |            // 调用set方法
 392 |            s1.setStuName("Mary")
 393 |            s1.setAge(25)
 394 |    
 395 |            // 第二次输出
 396 |            println(s1.getStuName() + "\t" + s1.getStuAge())
 397 |    
 398 |            // 第三次输出
 399 |            println(s1.stuName + "\t" + s1.stuAge)
 400 |            // 注意：stuName 和 stuAge 是 private 的类型的，为什么还可以直接访问呢？这就需要来讨论属性的 get 和 set 方法了
 401 |        }
 402 |    }
 403 |    ```
 404 | 
 405 | 3. 属性的getter和setter方法
 406 |   
 407 |    - 当定义属性是 private 时候，scala 会自动为其生成对应的 get 和 set 方法
 408 |      
 409 |      `private var stuName: String = "Tom"`
 410 |      - get 方法：stuName ==> `s2.stuName()` 由于 stuName 是方法的名字，所以可以加上一个括号，当然也可以不加
 411 |      
 412 |      - set 方法：stuName ==> `stuName_=是方法的名字`
 413 |    
 414 |    * 定义属性：`private var money: Int = 1000`希望 money 只有 get 方法，没有 set 方法
 415 |    
 416 |      * 办法：将其定义为常量 `private val money: Int = 1000`
 417 |    
 418 |    * private[this]的用法：该属性只属于该对象私有，就不会生成对应的 set 和 get 方法。
 419 |         ```scala
 420 |         class Student2 {
 421 |             // 定义属性
 422 |             private var stuName: String = "Tom"
 423 |             // private[this] var stuAge: Int = 20
 424 |             private var stuAge: Int = 20
 425 |             private val money: Int = 1000
 426 |         }
 427 | 
 428 |         // 测试
 429 |         object Student2 {
 430 |             def main(args: Array[String]): Unit = {
 431 |                 var s2 = new Student2
 432 | 
 433 |                 println(s2.stuName + "\t" + s2.stuAge)
 434 |                 println(s2.stuName + "\t" + s2.stuAge + "\t" + s2.money)
 435 | 
 436 |                 // 修改money的值 --> error
 437 |                 s2.money = 2000
 438 |             }
 439 |         }
 440 |         ```
 441 | 
 442 | 4. 内部类（嵌套类）
 443 |   
 444 |    我们可以在一个类的内部定义一个类，如下：我们在 Student 类中，再定义了一个 Course 类用于保存学生的选修课。
 445 |    
 446 |    ```scala
 447 |    import scala.collection.mutable.ArrayBuffer
 448 |    
 449 |    // 嵌套类：内部类
 450 |    class Student3 {
 451 |        // 定义一个内部类，记录学生选修课的课程信息
 452 |        class Course(val courseName: String, val credit: Int) {
 453 |            // 定义其他方法
 454 |        }
 455 |        // 属性
 456 |        private var stuName: String = "Tom"
 457 |        private var stuAge: Int = 20
 458 |    
 459 |        // 定义一个ArrayBuffer记录该学生选修的所有课程
 460 |        private var courseList = new ArrayBuffer[Course]()
 461 |    
 462 |        // 定义方法往学生信息中添加新的课程
 463 |        def addNewCourse(cname: String, credit: Int) {
 464 |            // 创建新的课程
 465 |            var c = new Course(cname, credit)
 466 |            // 将课程加入list
 467 |            courseList += c
 468 |        }
 469 |    }
 470 |    ```
 471 |    
 472 |    开发一个测试程序进行测试：
 473 |    
 474 |    ```scala
 475 |    // 测试
 476 |    object Student3 {
 477 |        // 创建学生对象
 478 |        var s3 = new Student3
 479 |    
 480 |        // 给该学生添加新的课程
 481 |        s3.addNewCourse("Chinese", 2)
 482 |        s3.addNewCourse("English", 3)
 483 |        s3.addNewCourse("Math", 4)
 484 |    
 485 |        // 输出
 486 |        println(s3.stuName + "\t" + s3.stuAge)
 487 |        println("*************选修的课程*************")
 488 |        for (s <- s3.courseList) println(s.courseName + "\t" + s.credit)
 489 |    }
 490 |    ```
 491 | 
 492 | 5. 类的构造器
 493 |   
 494 |    类的构造器分为：主构造器、辅助构造器
 495 |    
 496 |    * 主构造器：和类的声明结合在一起，只能有一个主构造器
 497 |      
 498 |      `Student4(val stuName: String, val stuAge: Int)`
 499 |      
 500 |      1. 定义类的主构造器：两个参数
 501 |      
 502 |      2. 声明了两个属性：stuName 和 stuAge 和对应的 get 和 set 方法
 503 |        
 504 |         ```scala
 505 |         class Student4(val stuName: String, val stuAge: Int) {}
 506 |         
 507 |         object Student4 {
 508 |             def main(args: Array[String]) {
 509 |                 // 创建Student4的一个对象，调用了主构造器
 510 |                 var s4 = new Student4("Tom", 20)
 511 |                 println(s4.stuName + "\t" + s4.stuAge)
 512 |             }
 513 |         }
 514 |         ```
 515 |    
 516 |    * 辅助构造器：可以有多个辅助构造器，通过关键字 this 来实现
 517 |    
 518 |         ```scala
 519 |           class Student4(val stuName: String, val stuAge: Int) {
 520 |               // 定义辅助构造器
 521 |               def this(age: Int) {
 522 |                   // 调用主构造器
 523 |                   this("no name", age)
 524 |               }
 525 |           }
 526 | 
 527 |           object Student4 {
 528 |               def main(args: Array[String]) {
 529 |                   // 创建一个新的Student4的对象，并调用辅助构造器
 530 |                   var s42 = new Student4(25)
 531 |                   println(s42.stuName + "\t" + s42.stuAge)
 532 |               }
 533 |           }
 534 |         ```
 535 | 
 536 | 6. Scala中的Object对象
 537 |   
 538 |    Scala 没有静态的修饰符，但 Object 对象下的成员都是静态的，若有同名的 class，将其作为它的伴生类。在 Object 中一般可以在伴生类中做一些初始化等操作。
 539 |    
 540 |    Object 对象的应用
 541 |    
 542 |    - 单例对象：
 543 |      
 544 |      ```scala
 545 |      // 利用object对象实现单例模式
 546 |      object CreditCard {
 547 |          // 变量保存信用卡号
 548 |          private[this] var creditCardNumber: Long = 0
 549 |      
 550 |          // 产生新的卡号
 551 |          def generateNewCCNumber(): Long = {
 552 |              creditCardNumber += 1
 553 |              creditCardNumber
 554 |          }
 555 |      
 556 |          // 测试程序
 557 |          def main(args: Array[String]) {
 558 |              // 产生新的卡号
 559 |              println(CreditCard.generateNewCCNumber())
 560 |              println(CreditCard.generateNewCCNumber())
 561 |              println(CreditCard.generateNewCCNumber())
 562 |              println(CreditCard.generateNewCCNumber())
 563 |          }
 564 |      
 565 |      }
 566 |      ```
 567 |    
 568 |    * 使用应用程序对象：
 569 |    
 570 |        ```scala
 571 |          // 使用应用程序对象： 可以省略main方法
 572 |          object HelloWorld extends App {
 573 |              // 通过如下方式取得命令行的参数
 574 |              if (args.length > 0) {
 575 |                  println(args(0))
 576 |              } else {
 577 |                  println("no arguments")
 578 |              }
 579 |          }
 580 |        ```
 581 | 
 582 | 7. Scala中的apply方法
 583 |   
 584 |    遇到如下形式的表达式时，apply 方法就会被调用：
 585 |    
 586 |    `Object(arg1, arg2, ... argn)`
 587 |    
 588 |    通常，这样一个 apply 方法返回的是伴生类的对象；其作用是为了省略 new 关键字
 589 |    
 590 |    ```scala
 591 |    var myarray = Array(1, 2, 3)
 592 |    ```
 593 |    
 594 |    Object 的 apply 方法举例：
 595 |    
 596 |    ```scala
 597 |    // object的apply方法
 598 |    class Student5(val stuName: String) {
 599 |    
 600 |    }
 601 |    object Student5 {
 602 |        // 定义自己的apply方法
 603 |        def apply(stuName: String) = {
 604 |            println("********Apply in Object**********")
 605 |            new Student5(stuName)        
 606 |        }
 607 |    
 608 |        def main(args: Array[String]) {
 609 |            // 创建Student5的一个对象
 610 |            var s51 = new Student5("Tom")
 611 |            println(s51.stuName)
 612 |    
 613 |            // 创建Student5的一个对象
 614 |            var s52 = Student5("Mary")
 615 |            println(s52.stuName)
 616 |        }
 617 |    }
 618 |    ```
 619 | 
 620 | 8. Scala中的继承
 621 |   
 622 |    Scala 和 Java 一样，使用 extends 关键字扩展类。
 623 |    
 624 |    - 案例一：Employee 类继承 Person 类
 625 |      
 626 |      ```scala
 627 |      // 演示Scala的继承
 628 |      // 父类
 629 |      class Person(val name: String, val age: Int) {
 630 |        // 定义方法
 631 |        def sayHello(): String = "Hello " + name + " and the age is " + age
 632 |      }
 633 |      
 634 |      // 子类
 635 |      class Employee(override val name: String, override val age: Int, val salary: Int) extends Person(name, age) {
 636 |      
 637 |      }
 638 |      
 639 |      object Demo1 {
 640 |        def main(args: Array[String]): Unit = {
 641 |          // 创建一个Person的对象
 642 |          val p1 = new Person("Tom", 20)
 643 |          println(p1.sayHello())
 644 |      
 645 |          // 创建一个Employee的对象
 646 |          var p2: Person = new Employee("Mike", 25, 1000)
 647 |          println(p2.sayHello())
 648 |        }
 649 |      }
 650 |      ```
 651 |    
 652 |    - 案例二：在子类中重写父类的方法：
 653 |      
 654 |      ```scala
 655 |      // 子类
 656 |      class Employee(override val name: String, override val age: Int, val salary: Int) extends Person(name, age) {
 657 |        override def sayHello(): String = "子类中的sayHello方法"
 658 |      }
 659 |      ```
 660 |    * 案例三：使用匿名子类
 661 |      
 662 |      ```scala
 663 |      // 使用匿名子类来创建新的Person对象
 664 |       var p3: Person = new Person("Jerry", 26) {
 665 |       override def sayHello(): String = "匿名子类中的sayHello方法"
 666 |       }
 667 |       println(p3.sayHello())
 668 |      ```
 669 |    
 670 |    * 案例四：使用抽象类。抽象类中包含抽象方法，抽象类只能用来继承。
 671 |      
 672 |      ```scala
 673 |      // Scala中的抽象类
 674 |       // 父类： 抽象类
 675 |       abstract class Vehicle {
 676 |       // 定义抽象方法
 677 |       def checkType(): String
 678 |       }
 679 |       // 子类
 680 |       class Car extends Vehicle {
 681 |       override def checkType(): String = "I am a car"
 682 |       }
 683 |       class Bysical extends Vehicle {
 684 |       override def checkType(): String = "I am a bike"
 685 |       }
 686 |       object Demo2 {
 687 |       def main(args: Array[String]): Unit = {
 688 |       // 定义两个交通工具
 689 |       var v1: Vehicle = new Car
 690 |       println(v1.checkType())
 691 |       var v2: Vehicle = new Bysical
 692 |       println(v2.checkType())
 693 |       }
 694 |       }
 695 |      ```
 696 |    
 697 |    * 案例五：使用抽象字段。抽象字段就是个没有初始值的字段
 698 |      
 699 |      ```scala
 700 |      // 抽象的父类
 701 |       abstract class Person {
 702 |       // 第一个抽象的字段，并且只有get方法
 703 |       val id: Int
 704 |       // 另一个抽象的字段，并且有get和set方法
 705 |       var name: String
 706 |       }
 707 |       // 子类：应该提供抽象字段的初始值，否则该子类也应该是抽象的
 708 |       abstract class Employee extends Person {
 709 |       // val id: Int = 1
 710 |       var name: String = "no name"
 711 |       }
 712 |       // 还有一个办法：我们可以定义个主构造器，接收一个id参数，注意名字要与父类中的名字一样。
 713 |       class Employee2(val id: Int) extends Person {
 714 |       var name: String = "no name"
 715 |       }
 716 |      ```
 717 | 
 718 | 9. Scala中的trit（特征）
 719 |   
 720 |    trait 就是抽象类。trait 跟抽象类最大的区别：trait 支持多继承
 721 |    
 722 |    ```scala
 723 |    // 第一个trait
 724 |    trait Human {
 725 |      val id: Int
 726 |      val name: String
 727 |    
 728 |      // 方法
 729 |      def sayHello(): String = "Hello" + name
 730 |    }
 731 |    
 732 |    // 第二个trait
 733 |    trait Actions {
 734 |      // 抽象的方法
 735 |      def getActionNames(): String
 736 |    }
 737 |    
 738 |    // 子类
 739 |    class Student(val id: Int, val name: String) extends Human with Actions {
 740 |      override def getActionNames(): String = "Action is running"
 741 |    }
 742 |    
 743 |    object Demo2 {
 744 |      def main(args: Array[String]): Unit = {
 745 |        // 创建一个student的对象
 746 |        var s1 = new Student(1, "Tom")
 747 |        println(s1.sayHello())
 748 |        println(s1.getActionNames())
 749 |      }
 750 |    }
 751 |    ```
 752 | 
 753 | 10. Scala中的文件访问
 754 |   
 755 |     * 读取行：
 756 |       
 757 |       ```scala
 758 |       import scala.io
 759 |       import scala.io.Source
 760 |       
 761 |       object Demo2 {
 762 |         def main(args: Array[String]): Unit = {
 763 |           // 读取行
 764 |           val source = Source.fromFile("d:/a.txt")
 765 |           // 1.将整个文件作为一个字符串
 766 |       //    println(source.mkString)
 767 |           // 2. 一行一行的读取
 768 |           val lines = source.getLines()
 769 |           for (l <- lines) println(l)
 770 |         }
 771 |       }
 772 |       ```
 773 |     
 774 |     * 读取字符：
 775 |       
 776 |       ```scala
 777 |       val source = Source.fromFile("d:/a.txt")
 778 |       for (c <- source) println(c)
 779 |       ```
 780 |     
 781 |             其实这个 source 就指向了文件中的每个字符。
 782 |     
 783 |     * 从 URL 或其他源读取：注意指定字符集 UTF-8：
 784 |       
 785 |       ```scala
 786 |       // 从 URL 或其他源读取：http://www.baidu.com
 787 |        val source = scala.io.Source.fromURL("http://www.baidu.com", "UTF-8")
 788 |        println(source.mkString)
 789 |       ```
 790 |     
 791 |     * 读取二进制文件：Scala 中并不支持直接读取二进制，但可以通过调用 Java 的 InputStream 来进行读入。
 792 |       
 793 |       ```scala
 794 |       object Demo2 {
 795 |        def main(args: Array[String]): Unit = {
 796 |          // 读取二进制文件：Scala 并不支持直接读取二进制文件
 797 |          var file = new File("d:/a.txt")
 798 |          // 构造一个InputStream
 799 |          val in = new FileInputStream(file)
 800 |          // 构造一个buffer
 801 |          val buffer = new Array[Byte](file.length().toInt)
 802 |          // 读取
 803 |          in.read(buffer)
 804 |          // 关闭
 805 |          in.close()
 806 |        }
 807 |       }
 808 |       ```
 809 |     
 810 |     * 写入文本文件：
 811 |       
 812 |       ```scala
 813 |       object Demo2 {
 814 |          def main(args: Array[String]): Unit = {
 815 |            // 写入文本文件
 816 |            val out = new PrintWriter("d:/m.txt")
 817 |            for (i <- 1 to 10) out.println(i)
 818 |            out.close()
 819 |          }
 820 |       }
 821 |       ```
 822 | 
 823 | ### （三）Scala语言的函数式编程
 824 | 
 825 | 1. Scala 中的函数
 826 |   
 827 |    在 Scala 中，函数是“头等公民”，就和数字一样。可以在变量中存放函数，即：将变量作为函数的值（值函数）。
 828 |    
 829 |    ```scala
 830 |    def myFun1(name: String): String = "Hello" + name
 831 |    println(myFun1("Tome"))
 832 |    
 833 |    def myFun2(): String = "Hello World"
 834 |    
 835 |    // 值函数：将函数作为变量的值
 836 |    val v1 = myFun1("Tom")
 837 |    val v2 = myFun2()
 838 |    // 再将v2赋值给myFun1
 839 |    println(myFun1(v2))
 840 |    ```
 841 | 
 842 | 2. 匿名函数
 843 |   
 844 |    ```scala
 845 |    // 匿名函数
 846 |    (x: Int) => x * 3
 847 |    // 例子：(1,2,3) --> (3,6,9)
 848 |    Array(1,2,3).map((x: Int) => x * 3).foreach(println)
 849 |    // 由于map方法接收一个函数参数，我们就可以把上面的匿名函数作为参数传递给map方法
 850 |    ```
 851 | 
 852 | 3. 带函数参数的函数，即：高阶函数
 853 |   
 854 |    ```scala
 855 |    import scala.math._
 856 |    
 857 |    // 定义高阶函数：带有函数参数的函数
 858 |    def someAction(f: Double => Double) = f (10)
 859 |    println(someAction(sqrt))
 860 |    ```
 861 |    
 862 |    函数的参数名是 f，f 的类型是匿名函数类型，匿名函数的参数是 Double 类型，匿名函数的返回值是 Double 类型。
 863 | 
 864 | 4. 闭包
 865 |   
 866 |    就是函数的嵌套，即：在一个函数定义中，包含另外一个函数的定义；并且在内函数中可以访问外函数中的变量。
 867 |    
 868 |    ```scala
 869 |    def mulBy(factory: Double) = (x: Double) => x * factor
 870 |    
 871 |    val triple = mulBy(3)
 872 |    val half = mulBy(0.5)
 873 |    // 调用
 874 |    println(triple(10) + "\t" + half(8))
 875 |    ```
 876 | 
 877 | 5. 柯里化
 878 |   
 879 |    柯里化函数是把具有多个参数的函数转换为一条函数链，每个节点上是单一参数。
 880 |    
 881 |    ```scala
 882 |    // 例子：以下两个 add 函数定义是等价的
 883 |    
 884 |    def add(x: Int, y: Int) = x + y
 885 |    
 886 |    def add(x: Int)(y: Int) = x + y // Scala里柯里化的语法
 887 |    ```
 888 |    
 889 |    一个简单的例子：
 890 |    
 891 |    ```scala
 892 |    // 一个普通的函数
 893 |    def mulByOneTIme(x: Int, y: Int) = x * y
 894 |    
 895 |    // 柯里化函数
 896 |    def mulByOneTIme1(x: Int) = (y: Int) => x * y
 897 |    
 898 |    // 简写的方式
 899 |    def mulByOneTime2(x: Int)(y: Int) = x * y
 900 |    ```
 901 | 
 902 | 6. 高阶函数示例
 903 |   
 904 |    ![高阶函数](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/highfun.png)
 905 |    
 906 |    示例1：map
 907 |    
 908 |    ```scala
 909 |    // map
 910 |    // 在列表中的每个元素上计算一个函数，并且返回一个包含相同数目元素的列表
 911 |    val numbers = List(1,2,3,4,5,6,7,8,9,10)
 912 |    numbers.map((i: Int) => i * 2).foreach(println)
 913 |    ```
 914 |    
 915 |    示例2：foreach
 916 |    
 917 |    ```scala
 918 |    val numbers = List(1,2,3,4,5,6,7,8,9,10)
 919 |    
 920 |    // foreach
 921 |    // foreach 和 map相似，只不过它没有返回值 foreach只是为了对参数进行作用
 922 |    numbers.foreach(println)
 923 |    ```
 924 |    
 925 |    示例3：filter
 926 |    
 927 |    ```scala
 928 |    val numbers = List(1,2,3,4,5,6,7,8,9,10)
 929 |    // filter
 930 |    // 移除任何使得传入的函数返回false的元素
 931 |    numbers.filter((i: Int) => i % 2 == 0).foreach(println)
 932 |    ```
 933 |    
 934 |    示例4：zip（拉链操作）
 935 |    
 936 |    ```scala
 937 |    // zip
 938 |    // zip把两个列表的元素合成一个由元素对组成的列表里
 939 |    val nList = List(1,2,3) zip List(4,5,6)
 940 |    nList.foreach(println)
 941 |    ```
 942 |    
 943 |    示例5：partition（只能分两组，返回值是一个Tuple2）
 944 |    
 945 |    ```scala
 946 |    val numbers = List(1,2,3,4,5,6,7,8,9,10)
 947 |    // partition
 948 |    // partition根据函数参数的返回值对列表进行拆分
 949 |    val tup = numbers.partition((i: Int) => i % 2 == 0)
 950 |    tup._1.foreach(println)
 951 |    println("----------")
 952 |    tup._2.foreach(println)
 953 |    ```
 954 |    
 955 |    示例6：find
 956 |    
 957 |    ```scala
 958 |    val numbers = List(1,2,3,4,5,6,7,8,9,10)
 959 |    // find
 960 |    
 961 |    // find返回集合里第一个匹配断言函数的元素
 962 |    
 963 |    println(numbers.find(_ % 3 == 0))
 964 |    ```
 965 |    
 966 |    示例7：flatten
 967 |    
 968 |    ```scala
 969 |    // flatten
 970 |    // flatten可以把嵌套的结构展开
 971 |    
 972 |    List(List(1,2,3),List(4,5,6)).flatten.foreach(println)
 973 |    ```
 974 |    
 975 |    示例8：flatMap
 976 |    
 977 |    ```scala
 978 |    // flatMap
 979 |    // flatMap是一个常用的combinator，它结合了map和flatten的功能
 980 |    var myList = List(List(1,2,3), List(4,5,6))
 981 |    myList.flatMap(x => x.map(_*2)).foreach(println)
 982 |    ```
 983 |    
 984 |    (1). 先将(1,2,3)和(4,5,6)这两个集合合并成一个集合
 985 |    
 986 |    (2). 再对每个元素乘以2
 987 | 
 988 | ### （四）Scala中的集合
 989 | 
 990 | 1. 可变集合和不可变集合
 991 |   
 992 |    * 可变集合
 993 |    
 994 |    * 不可变集合
 995 |      
 996 |      * 集合从不改变，因此可以安全地共享其引用。
 997 |      
 998 |      * 甚至是在一个多线程的应用程序中也没问题。
 999 |        
1000 |        ```scala
1001 |        // 不可变集合
1002 |        val math = scala.collection.immutable.Map("Alice"->80,"Bob"->78)
1003 |        
1004 |        // 可变的集合
1005 |        val english = scala.collection.mutable.Map("Alice"->80)
1006 |        ```
1007 |        
1008 |        集合的操作：
1009 |        
1010 |        ```scala
1011 |        // 1. 获取集合中的值
1012 |        println(english("Alice"))
1013 |        // 2. 调用集合的contains来判断key是否存在
1014 |        if (english.contains("Alice")) {
1015 |          println("key存在")
1016 |        } else {
1017 |          println("key不存在")
1018 |        }
1019 |        // 3. 获取集合中的值不存在时返回一个默认值
1020 |        println(english.getOrElse("Alice1", -1))
1021 |        
1022 |        // 4. 修改集合中的值
1023 |        english("Alice") = 59
1024 |        println(english.getOrElse("Alice", -1))
1025 |        
1026 |        // 5. 向集合中添加元素
1027 |        english += "xiaoming" -> 40
1028 |        
1029 |        // 6. 移除集合中的元素
1030 |        english -= "Alice"
1031 |        ```
1032 | 
1033 | 2. 列表
1034 |   
1035 |    * 不可变列表（List）
1036 |      
1037 |      ```scala
1038 |      // 不可变列表：List
1039 |      // 字符串列表
1040 |      val nameslist = List("Bob", "Mary", "Mike")
1041 |      // 整数列表
1042 |      val intList = List(1,2,3,4,5)
1043 |      // 空列表
1044 |      val nullList:List[Nothing] = List()
1045 |      // 二维列表
1046 |      val dim: List[List[Int]] = List(List(1,2,3),List(4,5,6))
1047 |      ```
1048 |      
1049 |      不可变列表的相关操作：
1050 |      
1051 |      ```scala
1052 |      println("第一个人的名字：" + nameslist.head)
1053 |      // tail: 不是返回的最后一个元素，而是返回除去第一个元素后，剩下的元素列表
1054 |      println(nameslist.tail)
1055 |      println("列表是否为空：" + nameslist.isEmpty)
1056 |      ```
1057 |    
1058 |    * 可变列表（LinkedList）
1059 |    
1060 |        ```scala
1061 |          // 可变列表：LinkedList和不可变List类似，只不过我们可以修改列表中的值
1062 |          val myList = mutable.LinkedList(1,2,3,4,5)
1063 |          // 操作：将上面可变列表中的每个值乘以2
1064 |          // 列名的elem
1065 | 
1066 |          // 定义了一个指针指向列表的开始
1067 |          var cur = myList
1068 |          // Nil: 代表Scala中的null
1069 |          while (cur != Nil) {
1070 |             // 对当前值*2
1071 |             cur.elem = cur.elem * 2
1072 |             // 将指针指向下一个元素
1073 |             cur = cur.next
1074 |          }
1075 | 
1076 |          // 查看结果
1077 |          println(myList)
1078 |        ```
1079 | 
1080 | 3. 序列
1081 |   
1082 |    常用的序列有：Vector 和 Range
1083 |    
1084 |    * Vector 是 ArrayBuffer 的不可变版本，是一个带下标的序列
1085 |      
1086 |      ```scala
1087 |      // Vector: 为了提高list列表随机存取的效率而引入的新的集合类型
1088 |      // 支持快速的查找和更新
1089 |      
1090 |      val v = Vector(1,2,3,4,5,6)
1091 |      
1092 |      // 返回的是第一个满足条件的元素
1093 |      v.find(_ > 3)
1094 |      v.updated(2, 100)
1095 |      ```
1096 |    
1097 |    * Range 表示一个整数序列
1098 |    
1099 |        ```scala
1100 |          // Range: 有序的通过空格分割的 Int 序列
1101 |          // 一下几个列子 Range 是一样
1102 |          println("第一种写法：" + Range(0, 5))
1103 |          println("第二种写法：" + (0 until 5))
1104 |          println("第三种写法：" + (0 to 4))
1105 | 
1106 |          // 两个range可以相加
1107 |          ('0' to '9') ++ ('A' to 'Z')
1108 | 
1109 |          // 可以将Range转换为List
1110 |          1 to 5 toList
1111 |        ```
1112 | 
1113 | 4. 集（set）和集的操作
1114 |   
1115 |    * 集 Set 是不重复元素的集合
1116 |    
1117 |    * 和列表不同，集并不保留元素插入的顺序。默认以 Hash 集实现
1118 |      
1119 |      示例1：创建集
1120 |      
1121 |      ```scala
1122 |      // 集Set：是不重复元素的集合，默认是HashSet
1123 |      
1124 |      // 创建一个 Set
1125 |      var s1 = Set(2,0,1)
1126 |      // 往s1中添加一个重复的元素
1127 |      s1 =  s1 + 100
1128 |      
1129 |      // 往s1中添加一个不重复的元素
1130 |      s1 = s1 + 100
1131 |      
1132 |      // 创建一个LinkedHashSet
1133 |      var weeksday = mutable.LinkedHashSet("星期一", "星期二", "星期三", "星期四")
1134 |      // 创建一个排序的集
1135 |      var s2 = mutable.SortedSet(1,2,3,10,4)
1136 |      ```
1137 |      
1138 |      示例2：集的操作
1139 |      
1140 |      ```scala
1141 |      // 集的操作
1142 |      // 1. 添加
1143 |      
1144 |      weeksday + "星期五"
1145 |      
1146 |      // 2. 判断元素是否存在
1147 |      weeksday.contains("星期二")
1148 |      
1149 |      // 3.判断一个集是否是另一个集的子集
1150 |      Set("星期二", "星期四", "星期日") subsetOf(weeksday)
1151 |      ```
1152 | 
1153 | 5. 模式匹配
1154 |   
1155 |    Scala 有一个强大的模式匹配机制，可以应用在很多场合：
1156 |    
1157 |    * switch语句
1158 |    
1159 |    * 类型检查
1160 |    
1161 |    Scala 还提供了样本类（case class），对模式匹配进行了优化
1162 |    
1163 |    模式匹配实例：
1164 |    
1165 |    * 更好的 switch
1166 |      
1167 |      ```scala
1168 |      // 更好的switch
1169 |      var sign = 0
1170 |      var ch1 = '-'
1171 |      ch1 match {
1172 |          case '+' => sign = 1
1173 |          case '-' => sign = -1
1174 |          case _ => sign = 0
1175 |      }
1176 |      println(sign)
1177 |      ```
1178 |    
1179 |    * Scala 的守卫
1180 |    
1181 |        ```scala
1182 |          // Scala的守卫：匹配某种类型的所有值
1183 |          var ch2 = '6'
1184 |          var digit: Int = -1
1185 |          ch2 match {
1186 |              case '+' => println("这是一个+")
1187 |              case '-' => println("这是一个-")
1188 |              case _ if Character.isDigit(ch2) => digit = Character.digit(ch2, 10)
1189 |              case _ => println("其他类型")
1190 |          }
1191 |          println("Digit: " + digit)
1192 |        ```
1193 |    
1194 |    * 模式匹配中的变量
1195 |    
1196 |        ```scala
1197 |          // 模式匹配中的变量
1198 |          var str3 = "Hello World"
1199 |          str3(7) match {
1200 |              case '+' => println("这是一个+")
1201 |              case '-' => println("这是一个-")
1202 |              case ch => println("这个字符是：" + ch)
1203 |          }
1204 |        ```
1205 |    
1206 |    * 类型模式
1207 |    
1208 |        ```scala
1209 |          // 类型模式
1210 |          var v4: Any = 100
1211 |          v4 match {
1212 |              case x: Int => println("这是一个整数：" + x)
1213 |              case s: String => println("这是一个字符串：" + s)
1214 |              case _ => println("其他类型")
1215 |          }
1216 |        ```
1217 |    
1218 |    * 匹配数组和列表
1219 |    
1220 |        ```scala
1221 |          // 匹配数组和列表
1222 |          var myArray = Array(1,2,3)
1223 |          myArray match {
1224 |              case Array(0) => println("0")
1225 |              case Array(x,y) => println("数组包含两个元素")
1226 |              case Array(x,y,z) => println("数组包含三个元素")
1227 |              case Array(x, _*) => println("这是一个数组")
1228 |          }
1229 |          // 最后的这个表示，数组包含任意个元素，即：default的匹配
1230 |        ```
1231 |    
1232 |        ```scala
1233 |          var myList = List(1,2,3)
1234 |          myList match {
1235 |              case List(0) println("0")
1236 |              case List(x,y) => println("这个列表包含两个元素")
1237 |              case List(x,y,z) => println("这是一个列表，包含三个元素")
1238 |              case List(x, _*) => println("这个列表包含多个元素")
1239 |          }
1240 |        ```
1241 | 
1242 | 6. 样本类（CaseClass）
1243 |   
1244 |    简单的来说，Scala 的 case class 就是在普通的类定义前加 case 这个关键字，然后你可以对这些类来模式匹配。
1245 |    
1246 |    case class 带来的最大好处是它们支持模式识别。
1247 |    
1248 |    首先，回顾一下前面的模式匹配：
1249 |    
1250 |    ```scala
1251 |    // 普通的模式匹配
1252 |    var name: String = "Tom"
1253 |    
1254 |    name match {
1255 |        case "Tom" => println("Hello Tom")
1256 |        case "Mary" => println("Hello Mary")
1257 |        case _ => println("Others")
1258 |    }
1259 |    ```
1260 |    
1261 |    其次，如果我们想判断一个对象是否是某个类的对象，跟 Java 一样可以使用 isInstanceOf
1262 |    
1263 |    ```scala
1264 |    // 判断一个对象是否是某个类的对象？isInstanceOf
1265 |    class Fruit
1266 |    
1267 |    class Apple(name: String) extends Fruit
1268 |    class Banana(name: String) extends Fruit
1269 |    
1270 |    // 创建对应的对象
1271 |    var aApple: Fruit = new Apple("苹果")
1272 |    var bBanana: Fruit = new Banana("香蕉")
1273 |    
1274 |    println("aApple是Fruit吗？" + aApple.isInstanceOf[Fruit])
1275 |    println("aApple是Fruit吗？" + aApple.isInstanceOf[Apple])
1276 |    println("aApple是Banana吗？" + aApple.isInstanceOf[Banana])
1277 |    ```
1278 |    
1279 |    在 Scala 中有一种更简单的方式来判断，就是 case class
1280 |    
1281 |    ```scala
1282 |    // 使用case class（样本类）进行模式匹配
1283 |    class Vehicle
1284 |    case class Car(name: String) extends Vehicle
1285 |    case class Bicycle(name: String) extends Vehicle
1286 |    
1287 |    // 定义一个 Car 对象
1288 |    var aCar: Vechicle = new Car("Tom的汽车")
1289 |    aCar match {
1290 |        case Car(name) => println("我是一辆汽车：" + name)
1291 |        case Bicycle(name) => println("我是一辆自行车")
1292 |        case _ => println("其他交通工具")
1293 |    }
1294 |    ```
1295 | 
1296 | ### （五）Scala语言的高级特性
1297 | 
1298 | 1. 什么是泛型类
1299 |   
1300 |    和 Java 或者 C++ 一样，类和特质可以带类型参数。在 Scala 中，使用方括号来定义类型参数。
1301 |    
1302 |    ```scala
1303 |    class GenericClass[T] {
1304 |        // 定义一个变量
1305 |        private var content: T = _
1306 |    
1307 |        // 定义变量的get和set方法
1308 |        def set(value: T) = {content = value}
1309 |        def get(): T = {content}
1310 |    }
1311 |    ```
1312 |    
1313 |    测试程序：
1314 |    
1315 |    ```scala
1316 |    // 测试
1317 |    object GenericClass {
1318 |        def main(args: Array[String]) {
1319 |            // 定义一个Int整数类型的泛型对象
1320 |            var intGeneric = new GenericClass[Int]
1321 |            intGeneric.set(123)
1322 |            println("得到的值是：" + intGeneric.get())
1323 |    
1324 |            // 定义一个String类型的泛型类对象
1325 |            var stringGeneric = new GenericClass[String]
1326 |            stringGeneric.set("Hello Scala")
1327 |            println("得到的值是：" + stringGeneric.get())
1328 |        }
1329 |    }
1330 |    ```
1331 | 
1332 | 2. 什么是泛型函数
1333 |   
1334 |    函数和方法也可以带类型参数。和泛型类一样，我们需要把类型参数放在方法名之后。注意：这里的 ClassTag 是必须的，表示运行时的一些信息，比如类型。
1335 |    
1336 |    ```scala
1337 |    import scala.reflect.ClassTag
1338 |    
1339 |    // 创建一个函数，可以创建一个Int类型的数组
1340 |    def mkIntArray(elems: Int*) = Array[Int](elems:_*)
1341 |    
1342 |    // 创建一个函数，可以创建一个String类型的数组
1343 |    def mkStringArray(elems: String*) = Array[String](elems:_*)
1344 |    
1345 |    // 问题：能否创建一个函数mkArray，即能创建Int类型的数组，也能创建String类型的数组？
1346 |    // 泛型函数
1347 |    def mkArray[T:ClassTag](elems:T*) = Array[T](elems:_*)
1348 |    mkArray(1,2,3,5,8)
1349 |    mkArray("Tom", "Marry")
1350 |    ```
1351 | 
1352 | 3. Upper Bounds 与 Lower Bounds
1353 |   
1354 |    类型的上界和下界，是用来定义类型变量的范围。它们的含义如下：
1355 |    
1356 |    * S <: T
1357 |      
1358 |      这是类型上界的定义。也就是 S 必须是 T 的子类（或本身，自己也可以认为是自己的子类。）
1359 |    
1360 |    * U >: T
1361 |    
1362 |      这是类型的下界的定义。也就是 U 必须是类型 T 的父类（或本身）
1363 |    
1364 |    * 一个简单的例子
1365 |      
1366 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/upbound.png)
1367 |      
1368 |      * 一个复杂一点的例子（上界）：
1369 |      
1370 |          ```scala
1371 |          class Vehicle {
1372 |              def drive() = {prinltn("Driving")}
1373 |          }
1374 | 
1375 |          class Car extends Vehicle {
1376 |              override def drive() = {println("Car Driving")}
1377 |          }
1378 | 
1379 |          class Bicycle extends Vehicle {
1380 |              override def drive() = {println("Bicycle Driving")}
1381 |          }
1382 | 
1383 |          object ScalaUpperBounds {
1384 |              // 定义方法
1385 |              def takeVehicle[ T <: Vehicle](v: T) = {v.drive()}
1386 | 
1387 |              def main(args: Array[String]) {
1388 |                  var v: Vehicle = new Vehicle
1389 |                  takeVehicle(v)
1390 | 
1391 |                  var c: Car = new Car
1392 |                  takeVehicle(c)
1393 |              }
1394 |          }
1395 |          ```
1396 | 
1397 | 4. 视图界定（View bounds）
1398 |   
1399 |    它比 <: 适用的范围更加广泛，除了所有的子类型，还允许隐式转换过去的类型。用 <% 表示。尽量使用视图界定，来取代泛型的上界，因为适用的范围更加广泛。
1400 |    
1401 |    示例：
1402 |    
1403 |    * 上面写过的一个例子。这里由于 T 的上界是 String，当我们传递 100 和 200 的时候，就会出现类型不匹配。
1404 |      
1405 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/viewbund.png)    
1406 |    
1407 |    * 但是 100 和 200 是可以转换成字符串的，所以我们可以使用视图界定让 addTwoString 方法可以接受更为广泛的数据类型，即：字符串极其子类、可以转换成字符串的类型。
1408 |    
1409 |      注意：使用的是 <%
1410 |    
1411 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/viewbb.png)
1412 |    
1413 |    * 但实际运行的时候，会出现错误：
1414 |    
1415 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/viewerr.png)
1416 |    
1417 |      这是因为：Scala并没有定义如何将 Int 转换成 String 的规则，所以要使用视图界定，我们就必须创建转换规则。
1418 |    
1419 |    * 创建转换规则：
1420 |      
1421 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/changerule.png)
1422 |    
1423 |    * 运行成功：
1424 |      
1425 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/viewsuccess.png)
1426 | 
1427 | 5. 协变和逆变
1428 |   
1429 |    * 协变：
1430 |      
1431 |      Scala 的类或特征的泛型定义中，如果在类型参数前面加入 + 符号，就可以使类或特征变为协变了。
1432 |      
1433 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/helpc.png)
1434 |    
1435 |    * 逆变：
1436 |      
1437 |      在类或特征的定义中，在类型参数之前加上一个 - 符号，就可以定义逆变泛型类和特征了。
1438 |      
1439 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/reverc.png)
1440 |    
1441 |    * 总结：
1442 |      
1443 |      Scala 的协变：泛型变量的值可以是本身类型或者其子类的类型
1444 |      
1445 |      Scala 的逆变：泛型变量的值可以是本身类型或者其父类的类型
1446 | 
1447 | 6. 隐式转换函数
1448 |   
1449 |    所谓隐式转换函数指的是以 implicit 关键字申明的带有单个参数的函数。
1450 |    
1451 |    * 前面介绍视图界定的时候的一个例子：
1452 |      
1453 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/imp.png)
1454 |    
1455 |    * 再举一个例子：我们把 Fruit 对象转换成了 Monkey 对象
1456 |      
1457 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/monkey.png)
1458 | 
1459 | 7. 隐式参数
1460 |   
1461 |    使用 implicit 申明的函数参数叫做隐式参数。我们可以使用隐式参数实现隐式转换
1462 |    
1463 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/yincan.png)
1464 | 
1465 | 8. 隐式类
1466 |   
1467 |    所谓隐式类：就是对类增加 implicit 限定的类，其作用主要是对类的功能加强。
1468 |    
1469 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/add.png)
1470 | 
1471 | ### （六）Scala语法错误集锦
1472 | 
1473 | 1. 不用.调方法引起的错误：
1474 | 
1475 |    ```scala
1476 |     val dataFieldValue = dataFieldStr toInt
1477 |     if (dataFieldValue >= startParamFieldValue && dataFieldValue <= endParamFieldValue) {
1478 |       return true
1479 |     } else {
1480 |       return false
1481 |     }
1482 |    ```
1483 | 
1484 |     编译报错：
1485 | 
1486 |    ```scala
1487 |     Error:(311, 7) illegal start of simple expression
1488 |           if (dataFieldValue >= startParamFieldValue && dataFieldValue <= endParamFieldValue) {
1489 |    ```
1490 | 
1491 |     正确写法：
1492 | 
1493 |    ```scala
1494 |     val dataFieldValue = dataFieldStr.toInt
1495 |     if (dataFieldValue >= startParamFieldValue && dataFieldValue <= endParamFieldValue) {
1496 |       return true
1497 |     } else {
1498 |       return false
1499 |     }
1500 |    ```
1501 | 
1502 | 2. scala 循环创建数字序列的几种方式：
1503 | 
1504 |    ```scala
1505 |    // 0-10
1506 |    for (i <- 0 to 10) {
1507 |        println(i)
1508 |    }
1509 |    
1510 |    // 0-9
1511 |    for (i <- 0 until 10) {
1512 |        println(i)
1513 |    }
1514 |    ```
1515 | 
1516 | 3. groupByKey 和 reduceByKey 的区别？
1517 | 
1518 |    它们都有shuffle的过程，不同的是 reduceByKey 在 shuffle 前进行了合并，减少了 IO 传输，效率高。
1519 | 
1520 | 4. scala 的 ++= 作用？
1521 | 
1522 |    ```scala
1523 |    val d = ArrayBuffer[Int]()
1524 |    d += 9
1525 |    
1526 |    val m = ArrayBuffer[Int]()
1527 |    m += 29
1528 |    
1529 |    m ++= d
1530 |    
1531 |    println(m) // ArrayBuffer(29, 9)
1532 |    println(d) // ArrayBuffer(9)
1533 |    
1534 |    m -= 9
1535 |    
1536 |    println(m) // ArrayBuffer(29)
1537 |    println(d) // ArrayBuffer(9)
1538 |    ```
1539 | 
1540 |    把一个集合中的数据全部复制到另一个集合中。而且是复制的对象，并不仅仅是复制引用。
1541 | 


--------------------------------------------------------------------------------
/28-Scala/imgs/add.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/add.png


--------------------------------------------------------------------------------
/28-Scala/imgs/changerule.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/changerule.png


--------------------------------------------------------------------------------
/28-Scala/imgs/com.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/com.png


--------------------------------------------------------------------------------
/28-Scala/imgs/data-type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/data-type.png


--------------------------------------------------------------------------------
/28-Scala/imgs/def-demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/def-demo.png


--------------------------------------------------------------------------------
/28-Scala/imgs/def.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/def.png


--------------------------------------------------------------------------------
/28-Scala/imgs/exception.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/exception.png


--------------------------------------------------------------------------------
/28-Scala/imgs/fun-param-process.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/fun-param-process.png


--------------------------------------------------------------------------------
/28-Scala/imgs/fun-param.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/fun-param.png


--------------------------------------------------------------------------------
/28-Scala/imgs/helpc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/helpc.png


--------------------------------------------------------------------------------
/28-Scala/imgs/highfun.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/highfun.png


--------------------------------------------------------------------------------
/28-Scala/imgs/imp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/imp.png


--------------------------------------------------------------------------------
/28-Scala/imgs/insert-value.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/insert-value.png


--------------------------------------------------------------------------------
/28-Scala/imgs/lazy-com.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/lazy-com.png


--------------------------------------------------------------------------------
/28-Scala/imgs/lazy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/lazy.png


--------------------------------------------------------------------------------
/28-Scala/imgs/math.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/math.png


--------------------------------------------------------------------------------
/28-Scala/imgs/matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/matrix.png


--------------------------------------------------------------------------------
/28-Scala/imgs/monkey.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/monkey.png


--------------------------------------------------------------------------------
/28-Scala/imgs/nothing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/nothing.png


--------------------------------------------------------------------------------
/28-Scala/imgs/param-type.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/param-type.png


--------------------------------------------------------------------------------
/28-Scala/imgs/reverc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/reverc.png


--------------------------------------------------------------------------------
/28-Scala/imgs/scala-logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/scala-logo.jpg


--------------------------------------------------------------------------------
/28-Scala/imgs/trycatch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/trycatch.png


--------------------------------------------------------------------------------
/28-Scala/imgs/unit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/unit.png


--------------------------------------------------------------------------------
/28-Scala/imgs/upbound.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/upbound.png


--------------------------------------------------------------------------------
/28-Scala/imgs/viewbb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/viewbb.png


--------------------------------------------------------------------------------
/28-Scala/imgs/viewbund.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/viewbund.png


--------------------------------------------------------------------------------
/28-Scala/imgs/viewerr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/viewerr.png


--------------------------------------------------------------------------------
/28-Scala/imgs/viewsuccess.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/viewsuccess.png


--------------------------------------------------------------------------------
/28-Scala/imgs/yincan.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/28-Scala/imgs/yincan.png


--------------------------------------------------------------------------------
/29-SparkCore/README.md:
--------------------------------------------------------------------------------
  1 | ## Spark Core
  2 | 
  3 | ### （一）什么是Spark？
  4 | 
  5 | 1. 什么是 Spark？
  6 | 
  7 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/spark.png)
  8 | 
  9 |    **Spark 是一个针对大规模数据处理的快速通用引擎。**
 10 | 
 11 |    Spark是一种快速、通用、可扩展的大数据分析引擎，2009年诞生于加州大学伯克利分校AMPLab，2010年开源，2013年6月成为Apache孵化项目，2014年2月成为Apache顶级项目。目前，Spark生态系统已经发展成为一个包含多个子项目的集合，其中包含SparkSQL、Spark Streaming、GraphX、MLlib等子项目，Spark是基于内存计算的大数据并行计算框架。Spark基于内存计算，提高了在大数据环境下数据处理的实时性，同时保证了高容错性和高可伸缩性，允许用户将Spark部署在大量廉价硬件之上，形成集群。Spark得到了众多大数据公司的支持，这些公司包括Hortonworks、IBM、Intel、Cloudera、MapR、Pivotal、百度、阿里、腾讯、京东、携程、优酷土豆。当前百度的Spark已应用于凤巢、大搜索、直达号、百度大数据等业务；阿里利用GraphX构建了大规模的图计算和图挖掘系统，实现了很多生产系统的推荐算法；腾讯Spark集群达到8000台的规模，是当前已知的世界上最大的Spark集群。
 12 | 
 13 | 2. 为什么要学习 Spark？
 14 | 
 15 |    * Hadoop的MapReduce计算模型存在的问题：
 16 | 
 17 |      学习过Hadoop的MapReduce的学员都知道，MapReduce的核心是Shuffle（洗牌）。在整个Shuffle的过程中，至少会产生6次的I/O。下图是我们在讲MapReduce的时候，画的Shuffle的过程。
 18 | 
 19 |      中间结果输出：基于MapReduce的计算引擎通常会将中间结果输出到磁盘上，进行存储和容错。另外，当一些查询（如：Hive）翻译到MapReduce任务时，往往会产生多个Stage（阶段），而这些串联的Stage又依赖于底层文件系统（如HDFS）来存储每一个Stage的输出结果，而I/O的效率往往较低，从而影响了MapReduce的运行速度。
 20 | 
 21 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/shuffle.png)
 22 | 
 23 |    * Spark 的最大特点：基于内存
 24 | 
 25 |      Spark 是 MapReduce 的替代方案，而且兼容 HDFS、Hive，可融入 Hadoop 的生态系统，以弥补 MapReduce 的不足。
 26 | 
 27 | 3. Saprk 的特点：快、易用、通用、兼容性
 28 | 
 29 |    * 快
 30 | 
 31 |      与Hadoop的MapReduce相比，Spark基于内存的运算速度要快100倍以上，即使，Spark基于硬盘的运算也要快10倍。Spark实现了高效的DAG执行引擎，从而可以通过内存来高效处理数据流。
 32 | 
 33 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/speed.png)
 34 | 
 35 |    * 易用
 36 | 
 37 |       Spark支持Java、Python和Scala的API，还支持超过80种高级算法，使用户可以快速构建不同的应用。而且Spark支持交互式的Python和Scala的shell，可以非常方便地在这些shell中使用Spark集群来验证解决问题的方法。
 38 | 
 39 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/easy.png)
 40 | 
 41 |    * 通用
 42 | 
 43 |       Spark提供了统一的解决方案。Spark可以用于**批处理**、交互式查询（**Spark SQL**）、实时流处理（**Spark Streaming**）、机器学习（**Spark MLlib**）和图计算（**GraphX**）。这些不同类型的处理都可以在同一个应用中无缝使用。Spark统一的解决方案非常具有吸引力，毕竟任何公司都想用统一的平台去处理遇到的问题，减少开发和维护的人力成本和部署平台的物力成本。
 44 | 
 45 |       另外Spark还可以很好的融入Hadoop的体系结构中可以直接操作HDFS，并提供Hive on Spark、Pig on Spark的框架集成Hadoop。
 46 | 
 47 |    * 兼容性
 48 | 
 49 |      Spark可以非常方便地与其他的开源产品进行融合。比如，Spark可以使用Hadoop的YARN和Apache Mesos作为它的资源管理和调度器，器，并且可以处理所有Hadoop支持的数据，包括HDFS、HBase和Cassandra等。这对于已经部署Hadoop集群的用户特别重要，因为不需要做任何数据迁移就可以使用Spark的强大处理能力。Spark也可以不依赖于第三方的资源管理和调度器，它实现了Standalone作为其内置的资源管理和调度框架，这样进一步降低了Spark的使用门槛，使得所有人都可以非常容易地部署和使用Spark。此外，Spark还提供了在EC2上部署Standalone的Spark集群的工具。
 50 | 
 51 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/everywhere.png)
 52 | 
 53 | ### （二）Spark的体系结构与安装部署
 54 | 
 55 | 1. Spark集群的体系结构
 56 | 
 57 |    * 官方的一张图：
 58 | 
 59 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/arc1.png)
 60 | 
 61 |    * 容易理解的一张图：
 62 | 
 63 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/arc2.png)
 64 | 
 65 | 2. Spark的安装与部署
 66 | 
 67 |    * Spark 的安装部署方式有以下几种模式：
 68 |      * Standalone
 69 |      * Yarn
 70 |      * Apache Mesos
 71 |      * Amazon EC2
 72 | 
 73 |    * Spark Standalone 伪分布的部署：
 74 | 
 75 |      * 配置文件：`conf/spark-env.sh`
 76 |        * export JAVA_HOME=/root/training/jdk1.7.0_75
 77 |        * export SPARK_MASTER_HOST=spark81
 78 |        * export SPARK_MASTER_PORT=7077
 79 |      * 配置文件：`conf/slave`
 80 |        * spark81
 81 | 
 82 |    * Spark Standalone全分布的部署：
 83 | 
 84 |      * 配置文件：`conf/spark-env.sh`
 85 |        * export JAVA_HOME=/root/training/jdk1.7.0_75
 86 |        * export SPARK_MASTER_HOST=spark81
 87 |        * export SPARK_MASTER_PORT=7077
 88 |      * 配置文件：`conf/slave`
 89 |        * spark83
 90 |        * spark84
 91 | 
 92 |    * 启动 Spark 集群：`start-all.sh`
 93 | 
 94 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/startall.png)
 95 | 
 96 | 3. Saprk HA的实现
 97 | 
 98 |    * 基于文件系统的单点恢复
 99 | 
100 |      主要用于开发或测试环境。当 Spark 提供目录保存 spark Application 和 worker的注册信息，并将他们的恢复状态写入该目录中，这时，一旦 master 发生故障，就可以通过重新启动Master进程 `sbin/start-master.sh`，恢复已运行的 spark Application 和 worker 的注册信息。
101 | 
102 |      基于文件系统的单点恢复，主要是在 spark-env.sh 里对 SPARK_DAEMON_JAVA_OPTS 设置：
103 | 
104 |      |           配置参数           |                      参考值                      |
105 |      | :--------------------------: | :----------------------------------------------: |
106 |      |  spark.deploy.recoveryMode   | 设置为 FILESYSTEM 开启单点恢复功能，默认值：NONE |
107 |      | spark.deploy.recoryDirectory |             Spark 保存恢复状态的目录             |
108 | 
109 |      参考：
110 | 
111 |      ```shell
112 |      export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.recoveryDirectory=/root/training/spark-2.1.0-bin-hadoop2.7/recovery"
113 |      ```
114 | 
115 |      测试：
116 | 
117 |      1. 在 spark82 上启动 Spark 集群
118 | 
119 |      2. 在 spark83 上启动 spark shell
120 | 
121 |         `MASTER=spark://spark82:7077 spark-shell`
122 | 
123 |      3. 在 spark82 上停止 master
124 | 
125 |         `stop-master.sh`
126 | 
127 |      4. 观察 spark83 上的输出：
128 | 
129 |         ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/spark83.png)
130 | 
131 |      5. 在 spark82 上重启 master：
132 | 
133 |         `start-master.sh`
134 | 
135 |    * 基于 Zookeeper 的 standBy Masters
136 | 
137 |      ZooKeeper提供了一个Leader Election机制，利用这个机制可以保证虽然集群存在多个Master，但是只有一个是Active的，其他的都是Standby。当Active的Master出现故障时，另外的一个Standby Master会被选举出来。由于集群的信息，包括Worker， Driver和Application的信息都已经持久化到ZooKeeper，因此在切换的过程中只会影响新Job的提交，对于正在进行的Job没有任何的影响。加入ZooKeeper的集群整体架构如下图所示：
138 | 
139 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/standalone.png)
140 | 
141 |      |          配置参数          |                    参考值                     |
142 |      | :------------------------: | :-------------------------------------------: |
143 |      | spark.deploy.recoveryMode  | 设置为ZOOKEEPER开启单点恢复功能，默认值：NONE |
144 |      | spark.deploy.zookeeper.url |              ZooKeeper集群的地址              |
145 |      | spark.deploy.zookeeper.dir |    Spark信息在ZK中的保存目录，默认：/spark    |
146 | 
147 |      参考：
148 | 
149 |      ```shell
150 |      export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=bigdata12:2181,bigdata13:2181,bigdata14:2181 -Dspark.deploy.zookeeper.dir=/spark"
151 |      ```
152 | 
153 |      另外：每个节点需要将原来配置全分布环境的相关设置注释掉：
154 | 
155 |      ```powershell
156 |      这两行注释掉
157 |      
158 |      # export SPARK_MASTER_HOST=spark82
159 |      # export SPARK_MASTER_PORT=7077
160 |      ```
161 | 
162 |      Zookeeper中保存的信息：
163 | 
164 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/sz.png)
165 | 
166 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/active-standby.png)
167 | 
168 | ### （三）执行Spark Demo程序
169 | 
170 | 1. 执行 Spark Example 程序
171 | 
172 |    * 启动 spark 集群：`sbin/start-all.sh`
173 | 
174 |    * 实例程序：`$SPARK_HOME/examples/jars/spark-examples_2.11-2.1.0.jar`
175 | 
176 |    * 所有示例程序：`$SPARK_HOME/examples/src/main`，有 Java，python，Scala，R语言等各种版本
177 | 
178 |    * Demo：蒙特卡罗求π
179 | 
180 |      ```shell
181 |      spark-submit --master spark://spark81:7077 org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.1.0.jar 100
182 |      ```
183 | 
184 | 2. 使用 Spark Shell
185 | 
186 |    spark-shell是Spark自带的交互式Shell程序，方便用户进行交互式编程，用户可以在该命令行下用scala编写spark程序。
187 | 
188 |    * 启动 Spark Shell：`spark-shell`
189 | 
190 |      也可以使用以下参数：
191 | 
192 |      参数说明：
193 | 
194 |      ```shell
195 |      --master spark://spark81:7077 # 指定 master 的地址
196 |      --executor-memory 2g # 指定每个worker可用内存为2g
197 |      --total-executor-cores 2 # 指定整个集群使用的cpu核数为2个
198 |      ```
199 | 
200 |      例如：
201 | 
202 |      ```shell
203 |      spark-shell --master spark://spark81:7077 --executor-memory 2g --total-executor-cores 2
204 |      ```
205 | 
206 |    * 注意：
207 | 
208 |      如果启动spark shell时没有指定master地址，但是也可以正常启动spark shell和执行spark shell中的程序，其实是启动了spark的local模式，该模式仅在本机启动一个进程，没有与集群建立联系。
209 | 
210 |      请注意local模式和集群模式的日志区别：
211 | 
212 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/diff.png)
213 | 
214 |    * 在 Spark Shell 中编写 WordCount 程序
215 | 
216 |      程序如下：
217 | 
218 |      ```scala
219 |      sc.textFile("hdfs://192.168.88.111:9000/data/data.txt")
220 |        .flatMap(_.split(" "))
221 |        .map((_, 1))
222 |        .reduceByKey(_+_)
223 |        .saveAsTextFile("hdfs://192.168.88.111:9000/output/spark/wc")
224 |      ```
225 | 
226 |      说明：
227 | 
228 |      * `sc` 是 SparkContext 对象，该对象是提交 spark 程序的入口
229 |      * `textFile("hdfs://192.168.88.111:9000/data/data.txt")` 是 hdfds 中读取数据
230 |      * `flatMap(_.split(" "))` 先 map 在压平
231 |      * `map((_, 1))` 将单词和 1 构成元组
232 |      * `reduceByKey(_+_) ` 按照 key 进行 reduce，并将 value 累加
233 |      * `saveAsTextFile("hdfs://192.168.88.111:9000/output/spark/wc") `  将结果写入到 hdfs 中 
234 | 
235 | 3. 在IDEA 中编写 WordCount 程序
236 | 
237 |    * 所需的pom依赖：
238 | 
239 |      ```xml
240 |      <dependency>
241 |      	<groupId>org.apache.spark</groupId>
242 |      	<artifactId>spark-core_2.11</artifactId>
243 |      	<version>2.1.0</version>
244 |      </dependency>
245 |      ```
246 | 
247 |    * IDEA 安装 scala 插件
248 | 
249 |    * 创建maven工程
250 | 
251 |    * 书写源代码，并打成 jar 包，上传到 Linux
252 | 
253 |    * scala 版本：
254 | 
255 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/sca.png)
256 | 
257 |      运行程序：
258 | 
259 |      ```shell
260 |      spark-submit --master spark://spark81:7077 --class mydemo.WordCount jars/wc.jar hdfs://192.168.88.111:9000/data/data.txt hdfs://192.168.88.111:9000/output/spark/wc
261 |      ```
262 | 
263 |    * Java 版本（直接输出在屏幕上）：
264 | 
265 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/jav.png)
266 | 
267 |      运行程序：
268 | 
269 |      ```shell
270 |      spark-submit --master spark://spark81:7077 --class mydemo.JavaWordCount jars/wc.jar hdfs://192.168.88.111:9000/data/data.txt
271 |      ```
272 | 
273 | ### （四）Spark运行机制及原理分析
274 | 
275 | 1. WordCount 执行的流程分析
276 | 
277 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/wc.png)
278 | 
279 | 2. Spark 提交任务的流程
280 | 
281 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/proc.png)
282 | 
283 | ### （五）Spark的算子
284 | 
285 | 1. RDD 基础
286 | 
287 |    * 什么是 RDD？
288 | 
289 |      RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是 Spark 中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD 具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD 允许用户在执行多个查询时显示的将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。
290 | 
291 |    * RDD 的属性（源码中的一段话）
292 | 
293 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/rdd.png)
294 | 
295 |      * **一组分区（Partition）**，即数据集的基本组成单位。对于 RDD 来说，每个分区都会被一个计算任务处理，并决定并行计算的粒度。用户可以在创建 RDD 时指定 RDD 的分区个数，如果没有指定，那么就会采用默认值。默认值就是程序所分配到的 CPU 核的数目。
296 |      * **一个计算每个分区的函数。**Spark 中 RDD 的计算是以分区为单位的，每个 RDD 都会实现 compute 函数以达到这个目的。compute 函数会对迭代器进行复合，不需要保存每次计算的结果。
297 |      *  **RDD之间的依赖关系**。RDD的每次转换都会生成一个新的RDD，所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时，Spark可以通过这个依赖关系重新计算丢失的分区数据，而不是对RDD的所有分区进行重新计算。
298 |      * **一个分区器（Partitioner），即 RDD 的分片函数**。当前Spark中实现了两种类型的分片函数，一个是基于哈希的HashPartitioner，另外一个是基于范围的RangePartitioner。只有对于于key-value的RDD，才会有Partitioner，非key-value的RDD的Parititioner的值是None。Partitioner函数不但决定了RDD本身的分片数量，也决定了parent RDD Shuffle输出时的分片数量。
299 |      * ² **一个列表**，存储存取每个Partition的优先位置（preferred location）。对于一个HDFS文件来说，这个列表保存的就是每个Partition所在的块的位置。按照“移动数据不如移动计算”的理念，Spark在进行任务调度的时候，会尽可能地将计算任务分配到其所要处理数据块的存储位置。
300 | 
301 |    * RDD 的创建方式
302 | 
303 |      * 通过外部的数据文件创建，如 HDFS
304 | 
305 |        ```scala
306 |        val rdd1 = sc.textFile("hdfs://192.168.88.111:9000/data/data.txt")
307 |        ```
308 | 
309 |      * 通过 sc.parallelize 进行创建：
310 | 
311 |        ```scala
312 |        val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))
313 |        ```
314 | 
315 |      * RDD 的算子（其实就是调用的函数）的类型：Transformation 算子和 Action 算子
316 | 
317 |    * RDD 的基本原理
318 | 
319 |      ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/rddpar.png)
320 | 
321 | 2. Transformation 算子
322 | 
323 |    RDD 中的所有转换都是延迟加载的，也就是说，它们并不会直接计算结果。相反，它们只是记住这些应用到基础数据集（例如一个文件）上的转换动作。只有当发生一个要求返回结果给 Driver 的动作时，这些转换才会真正运行。这种设计让 Spark 更加有效率的运行。
324 | 
325 |    |                  算子                   |                             含义                             |
326 |    | :-------------------------------------: | :----------------------------------------------------------: |
327 |    |                map(func)                | 返回一个新的RDD，该RDD由每一个输入元素经过func函数转换后组成 |
328 |    |              filter(func)               | 返回一个新的RDD，该RDD由经过func函数计算后返回值为true的输入元素组成 |
329 |    |              flatMap(func)              | 类似于map，但是每一个输入元素可以被映射为0个或多个输出元素（所以func 应该返回一个序列，而不是单一元素） |
330 |    |           mapPartitions(func)           | 类似于map，但独立地在RDD的每一个分片上运行，因此在类型为 T 的 RDD 上运行时，func 的函数类型必须是 Iterator[T] => Iterator[U] |
331 |    |      mapPartitionsWithIndex(func)       | 类似于mapPartitions，但func带有一个整数参数表示分片的索引值，因此在类型为T的RDD上运行时，func的函数类型必须是(Int, Interator[T]) => Iterator[U] |
332 |    | sample(withReplacement, fraction, seed) | 根据fraction指定的比例对数据进行采样，可以选择是否使用随机数进行替换，seed用于指定随机数生成器种子 |
333 |    |         union(otherDataset)         | 对源RDD和参数RDD求并集后返回一个新的RDD |
334 |    | intersection(otherDataset) | 对源RDD和参数RDD求交集后返回一个新的RDD |
335 |    | distinct([numTasks])) | 对源RDD进行去重后返回一个新的RDD |
336 |    | groupByKey([numTasks]) | 在一个(K,V)的RDD上调用，返回一个(K, Iterator[V])的RDD |
337 |    | reduceByKey(func, [numTasks]) | 在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的reduce函数，将相同key的值聚合到一起，与groupByKey类似，reduce任务的个数可以通过第二个可选的参数来设置 |
338 |    | sortByKey([ascending], [numTasks]) | 在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key进行排序的(K,V)的RDD |
339 |    | join(otherDataset, [numTasks]) | 在类型为(K,V)和(K,W)的RDD上调用，返回一个相同key对应的所有元素对在一起的(K,(V,W))的RDD |
340 |    | cogroup(otherDataset, [numTasks]) | 在类型为(K,V)和(K,W)的RDD上调用，返回一个(K,(Iterable\<V>,Iterable\<W>))类型的RDD |
341 | 
342 | 3. Action 算子
343 | 
344 |    |                     动作                      |                             含义                             |
345 |    | :-------------------------------------------: | :----------------------------------------------------------: |
346 |    |                 reduce(func)                  | 通过func函数聚集RDD中的所有元素，这个功能必须是课交换且可并联的 |
347 |    |                   collect()                   |        在驱动程序中，以数组的形式返回数据集的所有元素        |
348 |    |                    count()                    |                      返回RDD的元素个数                       |
349 |    |                    first()                    |                     返回RDD的第一个元素                      |
350 |    |                    take(n)                    |            返回一个由数据集的前n个元素组成的数组             |
351 |    | takeSample(*withReplacement*,*num*, [*seed*]) | 返回一个数组，该数组由从数据集中随机采样的num个元素组成，可以选择是否用随机数替换不足的部分，seed用于指定随机数生成器种子 |
352 |    |          saveAsTextFile(*path*)           | 将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统，对于每个元素，Spark将会调用toString方法，将它装换为文件中的文本 |
353 |    |               countByKey()                | 针对(K,V)类型的RDD，返回一个(K,Int)的map，表示每一个key对应的元素个数。 |
354 |    |              foreach(func)              |        在数据集的每一个元素上，运行函数func进行更新。        |
355 | 
356 | 4. RDD 的缓存机制
357 | 
358 |    RDD通过persist方法或cache方法可以将前面的计算结果缓存，但是并不是这两个方法被调用时立即缓存，而是触发后面的action时，该RDD将会被缓存在计算节点的内存中，并供后面重用。
359 | 
360 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/persit.png)
361 | 
362 |    通过查看源码发现cache最终也是调用了persist方法，默认的存储级别都是仅在内存存储一份，Spark的存储级别还有好多种，存储级别在object StorageLevel中定义的。
363 | 
364 |    ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/storeage.png)
365 | 
366 |    缓存有可能丢失，或者存储存储于内存的数据由于内存不足而被删除，RDD的缓存容错机制保证了即使缓存丢失也能保证计算的正确执行。通过基于RDD的一系列转换，丢失的数据会被重算，由于RDD的各个Partition是相对独立的，因此只需要计算丢失的部分即可，并不需要重算全部Partition。
367 | 
368 | 5. RDD 的容错机制
369 | 
370 |    检查点（本质是通过将RDD写入Disk做检查点）是为了通过lineage（血统）做容错的辅助，lineage过长会造成容错成本过高，这样就不如在中间阶段做检查点容错，如果之后有节点出现问题而丢失分区，从做检查点的RDD开始重做Lineage，就会减少开销。
371 | 
372 | 6. RDD 的依赖关系和 Spark 任务中的 Stage
373 | 
374 |    * RDD 的依赖关系
375 | 
376 |      * RDD 和它依赖的父 RDD（s）的关系有两种不同的类型，即窄依赖（narrow dependency）和宽依赖（wide dependency）。
377 | 
378 |        ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/29-SparkCore/imgs/depen.png)
379 | 
380 |        * 窄依赖指的是每一个父 RDD 的 partition 最多被子 RDD 的一个 partition使用。
381 |        * 宽依赖指的是多个子 RDD 的partition 会依赖同一个父 RDD 的 partition。
382 | 
383 |    * Spark 任务中的 Stage
384 | 
385 |      DAG（DirectedAcydicGraph）叫做有向无环图，原始的 RDD 通过一系列的转换就行成了 DAG，根据 RDD 之间的依赖关系的不同将 DAG 划分成不同的 Stage，对于窄依赖，partition的转换处理在 Stage 中完成计算。对于宽依赖，由于有 shuffle 的存在，只能在 partitionRDD 处理完成后，才能开始接下来的计算，因此宽依赖是划分 Stage 的依据。
386 | 
387 | 7. RDD 基础练习
388 | 
389 | ### （六）Spark RDD的高级算子
390 | 
391 | 
392 | 
393 | 
394 | 
395 | 
396 | 
397 | ### （七）Spark基础编程案例
398 | 
399 | 
400 | 
401 | 
402 | 
403 | ### （八）一些问题
404 | 
405 | 1. 自定义的Spark累加器，在foreach算子累加之值后，出了foreach算子累加的值消失？
406 |   
407 |    原因：重写的merge函数出错，导致Driver端在合并各个节点发来的累加器时未合并成功。
408 |    
409 |    * 错误的写法：
410 |      
411 |      ```scala
412 |      override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]]): Unit = {
413 |          other match {
414 |              case acc: SessionAggrStatAccumulator => {
415 |                  // 将acc中的k-v对和当前map中的k-v对合并累加
416 |                  this.aggrStatmap./:(acc.value) {
417 |                      case (map, (k, v)) =>
418 |                          map += (k -> (v + map.getOrElse(k, 0)))
419 |                  }
420 |              }
421 |          }
422 |      }
423 |      ```
424 |    
425 |    * 正确的写法
426 |      
427 |      ```scala
428 |      override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]]): Unit = {
429 |          other match {
430 |              case acc: SessionAggrStatAccumulator => {
431 |                  // 将acc中的k-v对和当前map中的k-v对合并累加
432 |                  (this.aggrStatmap /: acc.value) {
433 |                      case (map, (k, v)) =>
434 |                          map += (k -> (v + map.getOrElse(k, 0)))
435 |                  }
436 |              }
437 |          }
438 |      }
439 |      ```
440 |    
441 |    * 进一步探究出错的原因：scala的`/:`符号没有搞明白如何使用：
442 |    
443 |    * 错误的写法：通过点调用 `/:`会返回一个新的map，新map中的值进行了累加，而原来的两个 map 里的值均没有发生变化，相当于做了如下操作：`a + b`
444 |    
445 |    * 正确的写法：不通过点调用，而通过空格的方式调用相当于做了这个操作：`a = a + b`
446 |    
447 |    * 以此可以猜测 scala 的其他符号：`++``::``:::`等都有类似的性质。
448 | 


--------------------------------------------------------------------------------
/29-SparkCore/imgs/active-standby.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/active-standby.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/arc1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/arc1.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/arc2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/arc2.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/depen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/depen.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/diff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/diff.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/easy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/easy.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/everywhere.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/everywhere.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/jav.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/jav.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/persit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/persit.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/proc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/proc.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/rdd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/rdd.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/rddpar.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/rddpar.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/sca.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/sca.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/shuffle.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/shuffle.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/spark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/spark.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/spark83.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/spark83.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/speed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/speed.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/standalone.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/standalone.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/startall.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/startall.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/storeage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/storeage.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/sz.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/sz.png


--------------------------------------------------------------------------------
/29-SparkCore/imgs/wc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/29-SparkCore/imgs/wc.png


--------------------------------------------------------------------------------
/30-SparkSQL/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/30-SparkSQL/README.md


--------------------------------------------------------------------------------
/31-SparkStreaming/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/31-SparkStreaming/README.md


--------------------------------------------------------------------------------
/32-Spark与其他组件集成/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/32-Spark与其他组件集成/README.md


--------------------------------------------------------------------------------
/33-Spark性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/33-Spark性能优化/README.md


--------------------------------------------------------------------------------
/34-Flink/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/34-Flink/README.md


--------------------------------------------------------------------------------
/35-Flink与其他组件集成/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/35-Flink与其他组件集成/README.md


--------------------------------------------------------------------------------
/36-Flink性能优化/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/36-Flink性能优化/README.md


--------------------------------------------------------------------------------
/37-CDH/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MrQuJL/hadoop-guide/59670710fdbf78794aa7ef5060b5adf8fa717535/37-CDH/README.md


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "{}"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright {yyyy} {name of copyright owner}
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## 大数据学习指南
  2 | 
  3 | ### 概述：
  4 | 传统的 OLTP系统的数据一般是存储在关系型数据库中，数据的处理也只是由单台服务器来完成，即使是某个服务集群部署也只是每台机器独自处理一个来自客户端的请求，并不是所有机器共同处理这一个请求。
  5 | 
  6 | 随着业务的发展，数据量的不断沉淀，不少企业都已经积攒了 TB，PB 甚至 EB 级别的数据。Spring，MySQL 那一套传统的 OLTP 系统的架构已经无法存储以及计算如此庞大的数据，Hadoop 应运而生。
  7 | 
  8 | > PS：点击图片可以跳转到相应的部分。
  9 | 
 10 | ### 第一代计算引擎：Hadoop
 11 | ------
 12 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/hadoop-logo.jpg)](https://github.com/MrQuJL/hadoop-guide/tree/master/03-Hadoop)
 13 | 
 14 | Hadoop 通过 **HDFS** 将一个文件切分成多个数据块，分开存储在各个节点上，并且在每个节点进行冗余备份以实现高可用，解决了海量数据的存储问题；通过 **MapReduce** 将程序分发到不同的节点上，每个程序只负责处理待处理数据的的一部分，所有程序同时执行，最终每个程序的执行结果进行汇总，解决了海量数据的计算问题。
 15 | 
 16 | #### 1. HDFS：Hadoop Distributed File System
 17 | 
 18 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/03-Hadoop/imgs/hdfs-logo.jpg)](https://github.com/MrQuJL/hadoop-guide/tree/master/04-HDFS基础)
 19 | 
 20 | 不管是学习 Hadoop，Spark 还是 Flink，我们首先要学习如何安装配置使用 HDFS 分布式文件系统，这是大数据的基石。我们平常开发所使用的 MapReduce 程序，Spark 程序，Hive 底层都是操作的 HDFS 上的数据，学习 HDFS 的安装配置和使用是首要任务。
 21 | 
 22 | #### 2. MapReduce
 23 | 
 24 | 当我们已经学会如何操作底层的 HDFS 之后，就可以写一些 MapReduce 程序来读取 HDFS 上的数据进行一些业务操作。
 25 | 
 26 | #### 3. Hive
 27 | 
 28 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/11-Hive基础/imgs/hive_logo_medium.jpg)](https://github.com/MrQuJL/hadoop-guide/tree/master/11-Hive基础)
 29 | 
 30 | 使用了一段时间 MapReduce 程序之后我们会发现在通过 MapReduce 框架来进行一些多表关联操作有些麻烦。Hive 应运而生。我们只需要执行一条 HQL（Hive SQL）命令，Hive 就会自动把它转化成一个 MapReduce 程序提交到 Yarn 上执行。
 31 | 
 32 | #### 4. Pig
 33 | 
 34 | Pig 和 Hive 是类似的东西。我们也可以通过写一条 PigLatin 语句，Pig 会自动将其转换成 MapReduce 程序提交到 Yarn 上执行。
 35 | 
 36 | > PS：在企业中，Hive 用的多一点。
 37 | 
 38 | #### 5. Sqoop
 39 | 
 40 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/13-Sqoop基础/imgs/sqoop-logo.png)](https://github.com/MrQuJL/hadoop-guide/tree/master/13-Sqoop基础)
 41 | 
 42 | 无论是直接写 MapReduce 程序，还是用 Hive，Pig 执行查询语句，间接生成 MapReduce 程序，我们用到的 HDFS 上的数据可能是我们手动上传的本地测试数据。实际生产环境的数据可能来自我们的业务数据库，埋点，日志，Python爬虫爬取的公共平台的数据。而 MapReduce 程序读取的是 HDFS 上的数据，那我们如何把业务系统中的数据导入到 HDFS 上呢？
 43 | 
 44 | 针对关系型数据库（MySql，Oracle），我们通常选择使用 Sqoop，执行一条 Sqoop 命令将关系型数据库中的数据导入到 HDFS 上。当然，也可以将 HDFS 上的数据导出到关系型数据库中。
 45 | 
 46 | > PS：Sqoop 导关系型数据库中的数据底层其实就是使用 JDBC 的方式，所以在导数据的时候会对数据库造成较大的压力。小公司，可能无所谓，本身业务量不大，使用 Sqoop 就比较方便；大公司，巨大的访问量可能不允许你直接操作线上的业务库，而且你也不一定有访问数据库的权限，所以一般的做法是，通知 DBA 将关系型数据库中的数据导出成 CSV 文件，通过写脚本上传或通过 Flume 采集的方式将数据上传到 HDFS 。
 47 | 
 48 | #### 6. Flume
 49 | 
 50 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/15-Flume基础/imgs/flume-logo.png)](https://github.com/MrQuJL/hadoop-guide/tree/master/15-Flume基础)
 51 | 
 52 | 在介绍 Sqoop 的时候也提到了，可以通过 Flume 将 CSV 文件采集到 HDFS。Flume 主要还是用来采集日志。Flume 通过监控某个目录，每当目录中有新的文件产生时，就将文件上传到 HDFS。
 53 | 
 54 | #### 7. HBase
 55 | 
 56 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/09-HBase基础/imgs/hbase-logo.png)](https://github.com/MrQuJL/hadoop-guide/tree/master/09-HBase基础)
 57 | 
 58 | HDFS 上的数据是位于磁盘上的，直接访问磁盘效率比较低，访问内存速度比较快。为了实现海量数据的快速查询就有了 HBase，HBase 就是基于 HDFS 的 NoSQL。
 59 | 
 60 | #### 8. Zookeeper
 61 | 
 62 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/17-ZooKeeper/imgs/zookeeper-logo.png)](https://github.com/MrQuJL/hadoop-guide/tree/master/17-ZooKeeper)
 63 | 
 64 | 可以实现大数据各个组件的HA（高可用），例：Hadoop，HBase，Storm，...
 65 | 
 66 | #### 9. Azkaban
 67 | 
 68 | 
 69 | #### 10. Storm
 70 | 
 71 | 
 72 | #### 11. JStorm
 73 | 
 74 | 
 75 | ### 第二代计算引擎：Spark
 76 | ------
 77 | 
 78 | #### Scala编程语言
 79 | 
 80 | [![image](https://github.com/MrQuJL/hadoop-guide/blob/master/28-Scala/imgs/scala-logo.jpg)](https://github.com/MrQuJL/hadoop-guide/tree/master/28-Scala)
 81 | 
 82 | #### SparkCore
 83 | 
 84 | 
 85 | 
 86 | #### SparkSQL
 87 | 
 88 | 
 89 | 
 90 | 
 91 | #### SparkStreaming
 92 | 
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | ### 第三代计算引擎：Flink
 99 | ------
100 | 
101 | 
102 | 
103 | 
104 | ### CDH（Cloudera’s Distribution Including Apache Hadoop）
105 | 
106 | 
107 | 
108 | 1. *Java*基础和*Linux*基础
109 | 
110 | 
111 | 
112 | 
113 | 
114 | 2. Hadoop的学习：体系结构、原理、编程
115 | 	```	
116 | 	* 第一阶段：HDFS（文件系统）、MapReduce（java程序）、HBase（NoSQL数据库）
117 | 	* 第二阶段：数据分析引擎 ---> Hive、Pig
118 | 		数据采集引擎 ---> Sqoop、Flume
119 | 	* 第三阶段：HUE：Web管理工具
120 | 		ZooKeeper：实现Hadoop的HA
121 | 		Oozie：工作流引擎
122 | 	```
123 | 3. Spark的学习
124 | 	```
125 | 	* 第一个阶段：Scala编程语言
126 | 	* 第二个阶段：Spark Core-----> 基于内存，数据的计算
127 | 	* 第三个阶段：Spark SQL -----> 类似Oracle中的SQL语句
128 | 	* 第四个阶段：Spark Streaming ---> 进行实时计算（流式计算）：比如：自来水厂
129 | 	```
130 | 4. Apache Storm：类似Spark Streaming ---> 进行实时计算（流式计算）：比如：自来水厂 
131 | 	```
132 | 	* NoSQL：Redis基于内存的数据库（学习中...）
133 | 	```
134 | 
135 | 
136 | 
137 | 关注我的微信公众号，发现更多精彩内容：
138 | 
139 | ![image](https://github.com/MrQuJL/hadoop-guide/blob/master/01-大数据背景知识/imgs/wxgzh.jpg)
140 | 
141 | 
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------