├── README.md
├── interview-question
    ├── TopN.md
    ├── group-by.md
    ├── 列转行.md
    ├── 窗口分析.md
    ├── 窗口分析2lag-lead.md
    ├── 自连接.md
    └── 行转列.md
└── png
    ├── case_when1.png
    ├── case_when2.png
    ├── explodeAndSplit.png
    ├── id_course1.png
    ├── lateral_view.png
    ├── login_log.png
    ├── row_number.png
    ├── row_number2.png
    ├── shop1.png
    ├── shop2.png
    ├── split.png
    ├── sum.png
    ├── use_rank.png
    ├── user_click1.png
    ├── user_click_temp1.png
    ├── user_click_view1.png
    ├── user_click_view2.png
    ├── 列转行.png
    ├── 行转列.png
    ├── 面试题2_2.png
    ├── 面试题2_3.png
    ├── 面试题2_4.png
    ├── 面试题2_5.png
    ├── 面试题4_2.png
    ├── 面试题5_1.png
    └── 面试题5_2.png


/README.md:
--------------------------------------------------------------------------------
  1 | # Hive_interview-question
  2 | **总结典型的hive面试题**。
  3 | 
  4 | SQL面试题
  5 | 
  6 | HQL面试题
  7 | 
  8 | ## 需求分类：
  9 |     窗口分析
 10 |     行列转换
 11 |     求TopN
 12 |     自连接
 13 |     group by
 14 | 
 15 | ## 窗口分析
 16 | 
 17 | [窗口分析](interview-question/窗口分析.md)
 18 | 
 19 | 1. 窗口函数_聚合函数「sum\max\min\avg」讲解
 20 | 2. 面试题_1
 21 |    - ​	「求出每个店铺的当月销售额和累计到当月的总销售额。」
 22 | 3. 面试题_2
 23 |    - ​	「求出每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数。」
 24 | 4. 面试题_3
 25 |    - ​	「按照day和mac分组，求出每组的销量累计总和，追加到每条记录的后面。」
 26 | 
 27 | [窗口分析2lag/lead](interview-question/窗口分析2lag-lead.md)
 28 | 
 29 | ​	1.面试题
 30 | 
 31 | ​			「求出连续三天登陆的用户id」
 32 | 
 33 | ## 行列转换
 34 | 
 35 | [行转列](interview-question/行转列.md)
 36 | 
 37 | 多行转到某行的一列上。
 38 | 
 39 | 1. 面试题_4
 40 | 
 41 |    - ​	「求出所有数学课程成绩 大于 语文课程成绩的学生的学号。」
 42 | 
 43 |    - ​		case···when语句
 44 | 
 45 | 2. 面试题_5
 46 | 
 47 |    - ​	「以1\0的形式展示出学生的选课情况。」
 48 | 
 49 |    - ​		collect_set()\collect_list()函数
 50 | 
 51 |    - ​		array_contains()函数
 52 | 
 53 |    - ​		if()函数
 54 | 
 55 | ## 求TopN
 56 | 
 57 | [TopN](interview-question/TopN.md)
 58 | 
 59 | 1. 窗口函数_几种序列函数「row_number()\rank()\dense_rank() 」
 60 | 
 61 |    - ​	Row_number()函数精讲
 62 | 
 63 | 2. 面试题_6
 64 | 
 65 |    - ​	「求每年最高温度及其日期。」
 66 | 
 67 | 3. 面试题_7
 68 | 
 69 |    - ​	「求每种爱好中年龄最大的那个人、年龄排名前2的人」
 70 | 
 71 |    - ​	[列转行](./interview-question/列转行.md)（虚拟视图+炸裂函数）
 72 | 
 73 | ## 自连接
 74 | 
 75 | [自连接](./interview-question/自连接.md)
 76 | 
 77 | 1. 面试题_8
 78 | 
 79 |    ​	「查找所有至少**连续三次**出现的数字。」
 80 | 
 81 |    - 笛卡尔积
 82 |    - 连接查询
 83 |    - lag、lead
 84 | 
 85 | 2. 面试题_9
 86 | 
 87 |    ​	「求每个学生成绩最好的课程及分数、最差的课程及分数、平均分数」
 88 | 
 89 |    - 炸裂函数-->列转行
 90 |    - 窗口函数-->分组排序
 91 |    - case..when、concat()、max+group by
 92 | 
 93 | ## group by
 94 | 
 95 | [gropu-by](./interview-question/group-by.md)
 96 | 
 97 | 1. 面试题_10
 98 | 
 99 |    「输出每个产品，在2018年期间，每个月的净利润，日均成本。」
100 | 
101 |    「输出每个产品，在2018年3月中每一天与上一天相比，成本的变化。」
102 | 
103 |    「输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。」
104 | 
105 |    「输出2018年4月，总收入最高的那个产品，每日的收入，成本，过程使用over()函数。」
106 | 
107 | 


--------------------------------------------------------------------------------
/interview-question/TopN.md:
--------------------------------------------------------------------------------
  1 | # TOP类型
  2 | 
  3 | ## 窗口函数：序列函数
  4 | 
  5 | ### 初识row_number()函数
  6 | 
  7 | - **函数语法:** `row_number()  over(partition by 分组字段 order by 排序字段 desc/asc) as rn`
  8 | - 解释：指定分组字段、排序字段以及排序规则，返回分组内排序。
  9 | - 应用场景：常常用来排序后，筛选出topN。
 10 | - 事例：
 11 | 
 12 | 数据：（cookie_id, create_time , pv）
 13 | 
 14 | ```
 15 | cookie1,2015-04-10,1
 16 | cookie1,2015-04-11,5
 17 | cookie1,2015-04-12,7
 18 | cookie1,2015-04-13,3
 19 | cookie1,2015-04-14,2
 20 | cookie1,2015-04-15,4
 21 | cookie1,2015-04-16,4
 22 | cookie2,2015-04-10,2
 23 | cookie2,2015-04-11,3
 24 | cookie2,2015-04-12,5
 25 | cookie2,2015-04-13,6
 26 | cookie2,2015-04-14,3
 27 | cookie2,2015-04-15,9
 28 | cookie2,2015-04-16,7
 29 | ```
 30 | 
 31 | 创建表、导入数据：
 32 | 
 33 | ```sql
 34 | -- 创建表并指定字段分隔符为逗号（，）
 35 | create table cookie(cookie_id string, create_time string, pv int) row format delimited fields terminated by ',';
 36 | 
 37 | -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/cookie_data.txt）
 38 | 
 39 | -- 加载数据到表
 40 | load data local inpath "/root/yber/data/cookie_data.txt" into table cookie;
 41 | ```
 42 | 
 43 | 查询语句：
 44 | 
 45 | ```sql
 46 | select
 47 |  cookie_id,
 48 |  create_time,
 49 |  pv,
 50 |  row_number() over (partition by cookie_id order by pv desc) as rn 
 51 | from 
 52 | cookie;
 53 | ```
 54 | 
 55 | 查询结果：按照cookie1、cookie2分为了两组，组内分别按照<u>访问量pv</u>排序，排序结果定义为rn。
 56 | 
 57 | ![](../png/row_number2.png)
 58 | 
 59 | 进一步求每一组的top 1访问记录：
 60 | 
 61 | ```sql
 62 | select aa.cookie_id,aa.create_time,aa.pv
 63 | from
 64 | (select
 65 |  cookie_id,
 66 |  create_time,
 67 |  pv,
 68 |  row_number() over (partition by cookie_id order by pv desc) as rn
 69 | from cookie) aa
 70 | where rn=1;
 71 | ```
 72 | 
 73 | 结果：
 74 | 
 75 | ![](../png/面试题4_2.png)
 76 | 
 77 | ​			
 78 | 
 79 | ### 几种常用的序列函数比较
 80 | 
 81 | #### 语法结构
 82 | 
 83 | ##### row_number()
 84 | 
 85 | ##### rank()
 86 | 
 87 | ##### dense_rank()
 88 | 
 89 | ```sql
 90 | row_number() over(partition by 分组字段 order by 排序字段  desc/asc) as rn
 91 | rank() over(partition by 分组字段 order by 排序字段  desc/asc) as rn
 92 | dense_rank() over(partition by 分组字段 order by 排序字段  desc/asc) as rn
 93 | ```
 94 | 
 95 | #### 用法
 96 | 
 97 | <u>row_number()基本一致，不同地方如下表所示</u>
 98 | 
 99 | | 函数                                                         | 说明                           | 示例          |
100 | | ------------------------------------------------------------ | ------------------------------ | ------------- |
101 | | row_number： 按**顺序**编号，**不留空位**                    | （**重复也按顺序写下去**）     | 1-2-3-4-5.... |
102 | | rank： 按**顺序**编号，**相同**的值编相**同号**，**留空位**  | （**并列第一，就没有第二了**） | 1-1-3-4-5.... |
103 | | dense_rank： 按**顺序**编号，**相同**的值编**相同的号**，**不留**空位 | **（并列第一，接下来第二）**   | 1-1-2-3-4.... |
104 | 
105 | ![](../png/row_number.png)
106 | 
107 | ## 其他函数
108 | 
109 | ### 字符串分割函数
110 | 
111 | | substring(字符串，起始位置，截取长度) | 起始位置从1开始计算 |
112 | | ------------------------------------- | ------------------- |
113 | | substring(2015011023,1,4)             | 2015                |
114 | 
115 | ## 第六道面试题
116 | 
117 | ### 需求、数据、建表等
118 | 
119 | - 需求：编写Hive的HQL语句
120 | 
121 |   1、求出每一年的最高温度（年份，最高温度）
122 | 
123 |   2、求出每一年的最高温度是那一天（日期， 最高温度）
124 | 
125 | - 数据： (line)
126 | 
127 |   比如：2010012325表示在2010年01月23日的气温为25度。
128 | 
129 |   ```
130 |   2014010114
131 |   2014010216
132 |   2014010317
133 |   2014010410
134 |   2014010506
135 |   2012010609
136 |   2012010732
137 |   2012010812
138 |   2012010919
139 |   2012011023
140 |   2001010116
141 |   2001010212
142 |   2001010310
143 |   2001010411
144 |   2001010529
145 |   2013010619
146 |   2013010722
147 |   2013010812
148 |   2013010929
149 |   2013011023
150 |   2008010105
151 |   2008010216
152 |   2008010337
153 |   2008010414
154 |   2008010516
155 |   2007010619
156 |   2007010712
157 |   2007010812
158 |   2007010999
159 |   2007011023
160 |   2010010114
161 |   2010010216
162 |   2010010317
163 |   2010010410
164 |   2010010506
165 |   2015010649
166 |   2015010722
167 |   2015010812
168 |   2015010999
169 |   2015011023
170 |   ```
171 | 
172 | - 建表、导入数据
173 | 
174 |   ```sql
175 |   -- 创建表并指定字段分隔符为逗号（，）
176 |   create table if not exists temperature(line string) row format delimited fields terminated by ",";
177 |   
178 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/temperature_data.txt）
179 |   
180 |   -- 加载数据到表
181 |   load data local inpath "/root/yber/data/temperature_data.txt" into table temperature;
182 |   ```
183 | 
184 | ### 思路与实现步骤
185 | 
186 | - 思路
187 | 
188 |   原数据格式是string类型，因此我们需要用到<u>字符串分割</u>函数「substring（）」，将数据分割。
189 | 
190 |   | substring(字符串，起始位置，截取长度) | 起始位置从1开始计算 |
191 |   | ------------------------------------- | ------------------- |
192 |   | substring(2015011023,1,4)             | 2015                |
193 |   | substring(2015011023,9)               | 23                  |
194 | 
195 | - 实现步骤
196 | 
197 |   - 第一题：求出每一年的最高温度（年份，最高温度）
198 | 
199 |     ```sql
200 |     select 
201 |       substring(line,1,4) as year, 
202 |       max(substring(line,9)) as temperature 
203 |     from 
204 |       temperature 
205 |     group by 
206 |       substring(line,1,4);
207 |     ```
208 |   
209 |     ```
210 |     2001	29
211 |     2007	99
212 |     2008	37
213 |     2010	17
214 |     2012	32
215 |     2013	29
216 |     2014	17
217 |     2015	99
218 |     ```
219 |   
220 |   - 第二题：求出每一年的最高温度是哪一天（日期， 最高温度）
221 |   
222 |     ```sql
223 |     select a.year,a.dt,a.temperature,a.index
224 |     from (
225 |       select 
226 |       substring(line,1,4) as year,
227 |       substring(line,5,4) as dt,
228 |       substring(line,9) as temperature,
229 |       row_number() over(partition by substring(line,1,4) order by substring(line,9) desc) as index
230 |       from 
231 |     temperature
232 |     ) a 
233 |     where a.index <=1
234 |     ```
235 |     
236 |     <u>内层</u>---利用row_number()函数、substring()函数，查询「年、月日、温度、排序（排序按年分组，按温度高低排序）」
237 |     
238 |     <u>外层</u>---查询结果，并使用where限制需要查询的温度「topN就where index<=n」
239 | 
240 | **第二题的另一种解法：连接查询**
241 | 
242 | 这里给出简要思路不在赘述。（如图）
243 | 
244 | ![](../png/面试题5_2.png)
245 | 
246 | ## 第七道面试题
247 | 
248 | ### 需求、数据、建表等
249 | 
250 | - 需求：编写Hive的HQL语句
251 | 
252 |   1、**求出每种爱好中，年龄最大的人**
253 | 
254 |   2、**列出每个爱好年龄最大的两个人，并且列出名字。**
255 | 
256 | - 数据： (id，name，age，favors)
257 | 
258 |   ```
259 |   1,huangbo,45,a-c-d-f
260 |   2,xuzheng,36,b-c-d-e
261 |   3,huanglei,41,c-d-e
262 |   4,liushishi,22,a-d-e
263 |   5,liudehua,39,e-f-d
264 |   6,liuyifei,35,a-d-e
265 |   ```
266 | 
267 | - 建表、导入数据
268 | 
269 |   ```sql
270 |   -- 创建表并指定字段分隔符为逗号（，）
271 |   create table if not exists interest(id int, name string,age int,favors string) row format delimited fields terminated by ",";
272 |   
273 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/opt/zyb/data/interest_data.txt）
274 |   
275 |   -- 加载数据到表
276 |   load data local inpath "/opt/zyb/data/interest_data.txt" into table interest;
277 |   ```
278 | 
279 | ### 思路与实现步骤
280 | 
281 | - 思路分析
282 | 
283 | **数据**：一个人对应多种爱好（**一对多**）
284 | 
285 | **需求**：为了求出爱好中的年龄最大的人（**从多中找出一**）
286 | 
287 | **方法**：[列转行](./列转行.md)
288 | 
289 | 采用**炸裂函数** **explode** 、**字符分割函数** **split**
290 | 
291 | 同时采用**虚拟视图技术** **lateral view**
292 | 
293 | - 实现步骤
294 | 
295 | **语句1**：
296 | 
297 | ```sql
298 | -- 利用虚拟视图将爱好 列转行。
299 | select id,name,age,t2.favor
300 | from 
301 | interest 
302 | lateral view explode(split(favors,"-"))t2 as favor;
303 | ```
304 | 
305 | **结果1**：
306 | 
307 | ```
308 | id|name     |age|favor|
309 | --+---------+---+-----+
310 |  1|huangbo  | 45|a    |
311 |  1|huangbo  | 45|c    |
312 |  1|huangbo  | 45|d    |
313 |  1|huangbo  | 45|f    |
314 |  2|xuzheng  | 36|b    |
315 |  2|xuzheng  | 36|c    |
316 |  2|xuzheng  | 36|d    |
317 |  2|xuzheng  | 36|e    |
318 |  3|huanglei | 41|c    |
319 |  3|huanglei | 41|d    |
320 |  3|huanglei | 41|e    |
321 |  4|liushishi| 22|a    |
322 |  4|liushishi| 22|d    |
323 |  4|liushishi| 22|e    |
324 |  5|liudehua | 39|e    |
325 |  5|liudehua | 39|f    |
326 |  5|liudehua | 39|d    |
327 |  6|liuyifei | 35|a    |
328 |  6|liuyifei | 35|d    |
329 |  6|liuyifei | 35|e    |
330 | ```
331 | 
332 | **语句2**：
333 | 
334 | 对语句1的结果进一步操作：
335 | 
336 | ```sql
337 | -- 利用row_number函数列出每个爱好的年龄排名
338 | select 
339 |   id,
340 |   name,
341 |   age,
342 |   t2.favor as favor,
343 |   row_number() over(partition by t2.favor order by age desc) as rn
344 | from 
345 |   interest 
346 |   lateral view explode(split(favors,"-"))t2 as favor
347 | ```
348 | 
349 | **结果2**:
350 | 
351 | ```
352 | id	name		age	favor	rn
353 | 1	huangbo		45	a		1
354 | 6	liuyifei	35	a		2
355 | 4	liushishi	22	a		3
356 | 2	xuzheng		36	b		1
357 | 1	huangbo		45	c		1
358 | 3	huanglei	41	c		2
359 | 2	xuzheng		36	c		3
360 | 1	huangbo		45	d		1
361 | 3	huanglei	41	d		2
362 | 5	liudehua	39	d		3
363 | 2	xuzheng		36	d		4
364 | 6	liuyifei	35	d		5
365 | 4	liushishi	22	d		6
366 | 3	huanglei	41	e		1
367 | 5	liudehua	39	e		2
368 | 2	xuzheng		36	e		3
369 | 6	liuyifei	35	e		4
370 | 4	liushishi	22	e		5
371 | 1	huangbo		45	f		1
372 | 5	liudehua	39	f		2
373 | ```
374 | 
375 | 此时，已经按照爱好分组、年龄排序，进一步筛选index即可。
376 | 
377 | （where index<=2代表前两名）
378 | 
379 | （where index<=1代表年龄最大的一个人）
380 | 
381 | ```sql
382 | -- 此处列出每种爱好年龄最大的两个人。
383 | select * from (
384 |   select 
385 |     id,
386 |     name,
387 |     age,
388 |     t2.favor as favor,
389 |     row_number() over(partition by t2.favor order by age desc) as rn
390 |   from 
391 |     interest 
392 |     lateral view explode(split(favors,"-"))t2 as favor
393 | ) a
394 | where rn <=2;
395 | ```
396 | 
397 | 
398 | 
399 | **补充：不用row_number，完成第一问。**
400 | 
401 | ```sql
402 | select favor,max(age) as max_favor_person from (
403 |   select id,name,age,t2.favor as favor
404 |   from 
405 |   interest 
406 |   lateral view explode(split(favors,"-"))t2 as favor
407 | ) a
408 | group by favor
409 | ```
410 | 
411 | ```
412 | favor	max_favor_person
413 | a		45
414 | b		36
415 | c		45
416 | d		45
417 | e		41
418 | f		45
419 | ```
420 | 
421 | 


--------------------------------------------------------------------------------
/interview-question/group-by.md:
--------------------------------------------------------------------------------
  1 | # group by
  2 | 
  3 | ## 第十道面试题
  4 | 
  5 | ### 需求、数据、建表等
  6 | 
  7 | #### 需求
  8 | 
  9 | 编写Hive的HQL语句：
 10 | 
 11 | 1. 输出每个产品，在2018年期间，每个月的净利润，日均成本。
 12 | 2. 输出每个产品，在2018年3月中每一天与上一天相比，成本的变化。
 13 | 3. 输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。
 14 | 4. 输出2018年4月，总收入最高的那个产品，每日的收入，成本，过程使用over()函数。
 15 | 
 16 | - 数据： ( 日期dt，产品id，当日收入income，当日成本cost )
 17 | 
 18 |   ```
 19 |   2018-03-01,a,3000,2500
 20 |   2018-03-01,b,4000,3200
 21 |   2018-03-01,c,3200,2400
 22 |   2018-03-01,d,3000,2500
 23 |   2018-03-02,a,3000,2500
 24 |   2018-03-02,b,1500,800
 25 |   2018-03-02,c,2600,1800
 26 |   2018-03-02,d,2400,1000
 27 |   2018-03-03,a,3100,2400
 28 |   2018-03-03,b,2500,2100
 29 |   2018-03-03,c,4000,1200
 30 |   2018-03-03,d,2500,1900
 31 |   2018-03-04,a,2800,2400
 32 |   2018-03-04,b,3200,2700
 33 |   2018-03-04,c,2900,2200
 34 |   2018-03-04,d,2700,2500
 35 |   2018-03-05,a,2700,1000
 36 |   2018-03-05,b,1800,200
 37 |   2018-03-05,c,5600,2200
 38 |   2018-03-05,d,1200,1000
 39 |   2018-03-06,a,2900,2500
 40 |   2018-03-06,b,4500,2500
 41 |   2018-03-06,c,6700,2300
 42 |   2018-03-06,d,7500,5000
 43 |   2018-04-01,a,3000,2500
 44 |   2018-04-01,b,4000,3200
 45 |   2018-04-01,c,3200,2400
 46 |   2018-04-01,d,3000,2500
 47 |   2018-04-02,a,3000,2500
 48 |   2018-04-02,b,1500,800
 49 |   2018-04-02,c,4600,1800
 50 |   2018-04-02,d,2400,1000
 51 |   2018-04-03,a,6100,2400
 52 |   2018-04-03,b,4500,2100
 53 |   2018-04-03,c,6000,1200
 54 |   2018-04-03,d,3500,1900
 55 |   2018-04-04,a,2800,2400
 56 |   2018-04-04,b,3200,2700
 57 |   2018-04-04,c,2900,2200
 58 |   2018-04-04,d,2700,2500
 59 |   2018-04-05,a,4700,1000
 60 |   2018-04-05,b,3800,200
 61 |   2018-04-05,c,5600,2200
 62 |   2018-04-05,d,5200,1000
 63 |   2018-04-06,a,2900,2500
 64 |   2018-04-06,b,4500,2500
 65 |   2018-04-06,c,6700,2300
 66 |   2018-04-06,d,7500,5000
 67 |   ```
 68 | 
 69 | - 建表、导入数据
 70 | 
 71 |   ```sql
 72 |   -- 创建表并指定字段分隔符为逗号（，）
 73 |   create table if not exists goods(dt string, id string, income int, cost int) row format delimited fields terminated by ','; 
 74 |   
 75 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/goods_data.txt）
 76 |   
 77 |   -- 加载数据到表
 78 |   load data local inpath '/root/yber/data/goods_data.txt' into table goods;
 79 |   ```
 80 | 
 81 | #### 思路与实现步骤
 82 | 
 83 | - 思路分析
 84 | 
 85 |   ​	聚合函数+group by	
 86 | 
 87 | - 实现步骤
 88 | 
 89 | 1. 输出2018年期间，每个产品的每个月净利润、日均成本。
 90 | 
 91 |    语句：
 92 | 
 93 |    ```sql
 94 |    -- 首先字符串函数substring得到年、月、日并计算每日利润。
 95 |    select 
 96 |    substring(dt,1,4) as year,
 97 |    substring(dt,6,2) as month,
 98 |    dt as day,
 99 |    id,
100 |    income,
101 |    cost,
102 |    (income-cost) as profit 
103 |    from 
104 |    goods;
105 |    
106 |    -- 然后通过group by得到3月总利润和平均日成本。
107 |    -- where month = ‘03’既可以限制在内查询，也可以限制在内查询，一般来说，限制在内部会减少外部查询工作量。
108 |    select id,month,sum(profit) as month_sum_profit ,avg(cost) as daily_cost
109 |    from (
110 |      select 
111 |        substring(dt,1,4) as year,
112 |        substring(dt,6,2) as month,
113 |        dt as day,
114 |        id,
115 |        income,
116 |        cost,
117 |        (income-cost) as profit
118 |      from 
119 |        goods
120 |    ) a
121 |    where month = '03'
122 |    group by id,month
123 |    ```
124 | 
125 |    结果：
126 | 
127 |    | id   | month | month_sum_profit | daily_cost  |
128 |    | ---- | ----- | ---------------- | ----------- |
129 |    | a    | 3     | 4200             | 2216.666667 |
130 |    | b    | 3     | 6000             | 1916.666667 |
131 |    | c    | 3     | 12900            | 2016.666667 |
132 |    | d    | 3     | 5400             | 2316.666667 |
133 | 
134 | 2. 输出每个产品，在2018年3月中每一天与上一天相比，成本的变化。
135 | 
136 |    - 方法一：`lag() over(partition by order by)`
137 | 
138 |    ```sql
139 |    select id,day,cost,last_cost,(cost-last_cost) as cost_change
140 |    from (
141 |      select 
142 |        substring(dt,1,4) as year,
143 |        substring(dt,6,2) as month,
144 |        dt as day,
145 |        id,
146 |        cost,
147 |        -- 获取上一条成本，第一条数据没有时赋默认值0；分区范围时id和月份；排序按照id和日期（day）
148 |        lag(cost,1,0) over(partition by substring(dt,6,2),id order by id,dt) as last_cost
149 |      from 
150 |        goods
151 |      where 
152 |        -- 在内部过滤3月份，此时不能使用别名month。（原因与sql执行顺序有关！）
153 |        substring(dt,6,2) ='03'
154 |    )a
155 |    ;
156 |    ```
157 | 
158 |    - 方法二：自连接
159 | 
160 |    语句：
161 | 
162 |    ```sql
163 |    select aa.aid as id,aa.bday as day,(aa.bcost-aa.acost) as difference
164 |    from
165 |      (
166 |      select a.id as aid,a.cost as acost,a.day as aday,b.cost as bcost,b.day as bday from 
167 |        (
168 |        select id,income,cost,substring(p_date,1,4) as year,substring(p_date,6,2) as month,substring(p_date,9,2) as day
169 |        from product
170 |        where substring(p_date,6,2)='03' and substring(p_date,1,4)='2018'
171 |        order by id,month,day
172 |        ) a
173 |      left join
174 |        (
175 |        select id,income,cost,substring(p_date,1,4) as year,substring(p_date,6,2) as month,substring(p_date,9,2) as day
176 |        from product
177 |        where substring(p_date,6,2)='03' and substring(p_date,1,4)='2018'
178 |        order by id,month,day
179 |        ) b
180 |      on a.id=b.id and a.month=b.month and a.day=b.day-1
181 |      ) aa
182 |    where aa.bcost is not null;
183 |    ```
184 | 
185 |    结果（lag）：
186 | 
187 |    | id   | day      | cost | last_cost | cost_change |
188 |    | ---- | -------- | ---- | --------- | ----------- |
189 |    | a    | 2018/3/1 | 2500 | 0         | 2500        |
190 |    | a    | 2018/3/2 | 2500 | 2500      | 0           |
191 |    | a    | 2018/3/3 | 2400 | 2500      | -100        |
192 |    | a    | 2018/3/4 | 2400 | 2400      | 0           |
193 |    | a    | 2018/3/5 | 1000 | 2400      | -1400       |
194 |    | a    | 2018/3/6 | 2500 | 1000      | 1500        |
195 |    | b    | 2018/3/1 | 3200 | 0         | 3200        |
196 |    | b    | 2018/3/2 | 800  | 3200      | -2400       |
197 |    | b    | 2018/3/3 | 2100 | 800       | 1300        |
198 |    | b    | 2018/3/4 | 2700 | 2100      | 600         |
199 |    | b    | 2018/3/5 | 200  | 2700      | -2500       |
200 |    | b    | 2018/3/6 | 2500 | 200       | 2300        |
201 |    | c    | 2018/3/1 | 2400 | 0         | 2400        |
202 |    | c    | 2018/3/2 | 1800 | 2400      | -600        |
203 |    | c    | 2018/3/3 | 1200 | 1800      | -600        |
204 |    | c    | 2018/3/4 | 2200 | 1200      | 1000        |
205 |    | c    | 2018/3/5 | 2200 | 2200      | 0           |
206 |    | c    | 2018/3/6 | 2300 | 2200      | 100         |
207 |    | d    | 2018/3/1 | 2500 | 0         | 2500        |
208 |    | d    | 2018/3/2 | 1000 | 2500      | -1500       |
209 |    | d    | 2018/3/3 | 1900 | 1000      | 900         |
210 |    | d    | 2018/3/4 | 2500 | 1900      | 600         |
211 |    | d    | 2018/3/5 | 1000 | 2500      | -1500       |
212 |    | d    | 2018/3/6 | 5000 | 1000      | 4000        |
213 | 
214 | 3. 输出2018年4月，有多少个产品总收入大于22000元，必须用一句SQL语句实现，且不允许使用关联表查询、子查询。
215 | 
216 |    语句：
217 | 
218 |    ```sql
219 |    select 
220 |      substring(dt,1,4) as year,
221 |      substring(dt,6,2) as month,
222 |      id,
223 |      sum(income) as month_income
224 |    from 
225 |      goods
226 |    where 
227 |      -- 不能用别名
228 |      substring(dt,1,4) = '2018' and
229 |      substring(dt,6,2) = '04'
230 |    group by
231 |      -- 不能用别名
232 |      substring(dt,1,4),
233 |      substring(dt,6,2),
234 |      id
235 |    having 
236 |      -- having 可以使用别名
237 |      month_income >= 22000;
238 |    ```
239 | 
240 |    结果：
241 | 
242 |    | year | month | id   | month_income |
243 |    | ---- | ----- | ---- | ------------ |
244 |    | 2018 | 4     | a    | 22500        |
245 |    | 2018 | 4     | c    | 29000        |
246 |    | 2018 | 4     | d    | 24300        |
247 | 
248 | 4. 输出2018年4月，总收入最高的那个产品，每日的收入，成本，过程使用over()函数。
249 | 
250 |    语句：
251 | 
252 |    ```sql
253 |    -- 此查询结果如最下方图片所示！！！
254 |    select 
255 |      *,
256 |      -- 对内层当月总收入rank排序（月份为分区，总收入降序），rn=1则表示总收入最高的产品。
257 |      rank() over(partition by month order by sum_income desc) as rn
258 |    from (
259 |        select 
260 |          substring(dt,1,4) as year,
261 |          substring(dt,6,2) as month,
262 |          dt as day,
263 |          id,
264 |          income,
265 |          cost,
266 |          -- 计算每个产品（id）当月总收入
267 |          sum(income) over(partition by id order by id) as sum_income
268 |        from 
269 |          goods
270 |        where substring(dt,6,2) = '04'  
271 |    ) a
272 |    
273 |    
274 |    
275 |    
276 |    -- 最终结果！：对上述查询进行rn=1的过滤即可。
277 |    select * from (
278 |    select *,rank() over(partition by month order by sum_income desc) as rn
279 |    from (
280 |    select 
281 |    substring(dt,1,4) as year,
282 |    substring(dt,6,2) as month,
283 |    dt as day,
284 |    id,
285 |    income,
286 |    cost,
287 |    sum(income) over(partition by id order by id) as sum_income
288 |    from 
289 |    goods
290 |    where substring(dt,6,2) = '04'
291 |    ) a
292 |    ) b
293 |    where rn = 1
294 |    ```
295 | 
296 |    结果（最终结果！）：
297 | 
298 |    | a.year | a.month | a.day    | a.id | a.income | a.cost | a.sum_income | rn   |
299 |    | ------ | ------- | -------- | ---- | -------- | ------ | ------------ | ---- |
300 |    | 2018   | 4       | 2018/4/2 | c    | 4600     | 1800   | 29000        | 1    |
301 |    | 2018   | 4       | 2018/4/3 | c    | 6000     | 1200   | 29000        | 1    |
302 |    | 2018   | 4       | 2018/4/5 | c    | 5600     | 2200   | 29000        | 1    |
303 |    | 2018   | 4       | 2018/4/1 | c    | 3200     | 2400   | 29000        | 1    |
304 |    | 2018   | 4       | 2018/4/6 | c    | 6700     | 2300   | 29000        | 1    |
305 |    | 2018   | 4       | 2018/4/4 | c    | 2900     | 2200   | 29000        | 1    |
306 | 
307 | 
308 | 
309 | **为什么第四问使用rank而不是row_numnber?**
310 | 
311 | <img src="../png/use_rank.png" style="zoom:75%;" />
312 | 
313 | ```sql
314 | -- 详细请：回顾“求TopN” 章节他们的不同。
315 | 如上图，我们将产品每个月的总收入查询到了最后。
316 | 如果使用row_number，则上述111111将会表示为123456
317 | 就无法用rn=1过滤需要的结果。
318 | 
319 | 而rank可以向相同的数排名同号。这样可以一次性过滤需要的数据。
320 | ```
321 | 
322 | 


--------------------------------------------------------------------------------
/interview-question/列转行.md:
--------------------------------------------------------------------------------
 1 | # 列转行
 2 | 
 3 | ## 概念
 4 | 
 5 | ​	把表中同一个key值对应的多个value列，转换为多行数据，使每一行数据中，保证一个key只对应一个value。（将一行某列的数据转到多行上）
 6 | 
 7 | ![](../png/列转行.png)
 8 | 
 9 | ## 使用技术
10 | 
11 | ### 字符串函数-split
12 | 
13 | | 函数    | 用法              | 含义                        |
14 | | ------- | ----------------- | --------------------------- |
15 | | split() | split(favors,'-') | 将favors按照-分割成多部分。 |
16 | 
17 | ![](../png/split.png)
18 | 
19 | ### 炸裂函数-explode
20 | 
21 | | 函数      | 用法                              | 含义                                                   | 读音                        |
22 | | --------- | --------------------------------- | ------------------------------------------------------ | --------------------------- |
23 | | explode() | explode(array参数)                | 将array的值转到多行上                                  | `explode /ɪkˈsploʊd/ 爆炸 ` |
24 | |           | explode(`split(炸裂字段,分隔符)`) | 和split组合使用，将String类型转为array，然后转到多行上 |                             |
25 | 
26 | ![](../png/explodeAndSplit.png)
27 | 
28 | ### 虚拟视图-lateral view
29 | 
30 | - 读音：
31 | 
32 | `lateral /ˈlætərəl/ 侧面的 `
33 | 
34 | - 用法：
35 | 
36 | `LATERAL VIEW udtf(expression) tableAlias AS columnAlias` (',' columnAlias)
37 | 
38 | **lateral view** explode(split(favors,"-"))**t2 as favor**;
39 | 
40 | 「t2」---是虚拟视图的名字
41 | 
42 | 「favor」---是虚拟视图列别名
43 | 
44 | ![](../png/lateral_view.png)
45 | 
46 | ## 案例
47 | 
48 | ### 数据
49 | 
50 | (city,infos)
51 | 
52 | ```
53 | 北京	朝阳区,海淀区,其他
54 | 上海	黄浦区,徐汇区,其他
55 | ```
56 | 
57 | ### 查询语句
58 | 
59 | ```sql
60 | -- 创建表并指定字段分隔符为逗号（\t）
61 | create table city_infos(city string,infos string) row format delimited fields terminated by "\t";
62 | 
63 | -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/lines_data.txt）
64 | 
65 | -- 加载数据到表
66 | load data local inpath "/root/yber/data/lines_data.txt" into table city_infos;
67 | 
68 | -- 查询数据
69 | select city,t2.info 
70 | from city_infos 
71 | lateral view explode(split(infos,",")) t2 as info
72 | ```
73 | 
74 | ### 查询结果
75 | 
76 | ```
77 | city	t2.info
78 | 北京	朝阳区
79 | 北京	海淀区
80 | 北京	其他
81 | 上海	黄浦区
82 | 上海	徐汇区
83 | 上海	其他
84 | ```
85 | 
86 | 


--------------------------------------------------------------------------------
/interview-question/窗口分析.md:
--------------------------------------------------------------------------------
  1 | ## 窗口分析
  2 | 
  3 | ### 窗口函数——聚合「sum\max\min\avg」
  4 | 
  5 | 以SUM为例子：（max、min、avg同理）
  6 | 
  7 |  sum(求和字段) over (partition by 分组字段 order by 排序字段 **rows between** <u>unbounded preceding</u> **and** <u>current row</u>) as pv1
  8 | 
  9 | | 关键字                  | 说明                  |
 10 | | ----------------------- | --------------------- |
 11 | | 如果不指定 rows between | 默认为从起点到当前行; |
 12 | | 如果不指定 order  by    | 则将分组内所有值累加; |
 13 | 
 14 | 关键是理解ROWS BETWEEN含义,也叫做WINDOW子句：
 15 | 
 16 | | 关键字              | 说明                         |
 17 | | ------------------- | ---------------------------- |
 18 | | preceding：往前     | <u>3 preceding（前三行）</u> |
 19 | | following：往后     | <u>1 following</u>（后一行） |
 20 | | current row：当前行 | <u>current row</u>（当前行） |
 21 | 
 22 | | 关键字                      | 说明               |
 23 | | --------------------------- | ------------------ |
 24 | | <u>unbounded preceding.</u> | (表示从前面的起点) |
 25 | | <u>unbounded following</u>. | (表示到后面的终点) |
 26 | 
 27 | 示例：
 28 | 
 29 | ```sql
 30 |    sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, 
 31 |    sum(pv) over (partition by cookieid order by createtime) as pv2, 
 32 |    sum(pv) over (partition by cookieid) as pv3, 
 33 |    sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, 
 34 |    sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, 
 35 |    sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6
 36 | 
 37 | ```
 38 | 
 39 | ![](../png/sum.png)
 40 | 
 41 | <u>**注意看：几个示例的pv中 rows between 之后的内容 与 图片统计出来的结果。（多看几遍区别，多理解一下！！！）**</u>
 42 | 
 43 | ### 第一道面试题
 44 | 
 45 | - 需求：编写Hive的HQL语句求出**每个店铺的当月销售额、累计到当月的总销售额、当月最大销售额**。
 46 | 
 47 | - 数据： (name，month，money)
 48 | 
 49 |   ```
 50 |   a,01,150
 51 |   a,01,200
 52 |   b,01,1000
 53 |   b,01,800
 54 |   c,01,250
 55 |   c,01,220
 56 |   b,01,6000
 57 |   a,02,2000
 58 |   a,02,3000
 59 |   b,02,1000
 60 |   b,02,1500
 61 |   c,02,350
 62 |   c,02,280
 63 |   a,03,350
 64 |   a,03,250
 65 |   ```
 66 | 
 67 | - 建表、导入数据
 68 | 
 69 |   ```sql
 70 |   -- 创建表并指定字段分隔符为逗号（，）
 71 |   create table shop(id string,month string,money int) row format delimited fields terminated by ",";
 72 |   
 73 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/shop_data.txt）
 74 |   
 75 |   -- 加载数据到表
 76 |   load data local inpath "/root/yber/data/shop_data.txt" into table shop;
 77 |   ```
 78 | 
 79 | - 查询语句
 80 | 
 81 |   ```sql
 82 |   -- 分两步走：
 83 |   -- 第一步：sql得到每个店铺每个月的的销售额。
 84 |   SELECT
 85 |   id,MONTH,sum(money) AS month_money
 86 |   FROM shop
 87 |   GROUP BY id,MONTH;
 88 |   
 89 |   -- 第二步：从第一步的结果出发
 90 |   -- 利用sum开窗函数得到店铺当月累计总销售额
 91 |   -- 利用max开窗函数得到店铺当月最大销售额
 92 |   SELECT a.id,a.MONTH,a.month_money,
 93 |   sum(month_money) over(PARTITION BY a.id ORDER BY a.MONTH) AS sum_month_money,
 94 |   max(month_money) over(PARTITION BY a.id ORDER BY a.MONTH ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT row) AS max_month_money
 95 |   from
 96 |   (SELECT id,MONTH,sum(money) AS month_money FROM shop GROUP BY id,MONTH) a;
 97 |   ```
 98 | 
 99 | - 查询结果展示
100 | 
101 |   - 第一步：
102 |   
103 |   ![](../png/shop1.png)
104 |   
105 |   - 第二步（最终结果）：
106 |   
107 |   ![](../png/shop2.png)
108 | 
109 | ### 第二道面试题
110 | 
111 | #### 1、开窗函数方法
112 | 
113 | - 需求：编写Hive的HQL语句求出**每个用户**截止到**每月**为止的**最大单月访问次数**和**累计到该月的总访问次数**
114 | 
115 | - 数据： (用户名，月份，访问次数)
116 | 
117 |   ```
118 |   A,2015-01,5
119 |   A,2015-01,15
120 |   B,2015-01,5
121 |   A,2015-01,8
122 |   B,2015-01,25
123 |   A,2015-01,5
124 |   A,2015-02,4
125 |   A,2015-02,6
126 |   B,2015-02,10
127 |   B,2015-02,5
128 |   A,2015-03,16
129 |   A,2015-03,22
130 |   B,2015-03,23
131 |   B,2015-03,10
132 |   B,2015-03,11
133 |   ```
134 | 
135 | - 要求结果展示
136 | 
137 |   ```
138 |   用户	月份		当月访问次数	总访问次数	最大访问次数	
139 |   A	2015-01		33			33			33
140 |   A	2015-02		10			43			33
141 |   A	2015-03		38			81			38
142 |   B	2015-01		30			30			30
143 |   B	2015-02		15			45			30
144 |   B	2015-03		44			89			44
145 |   ```
146 | 
147 |   
148 | 
149 | - 建表、导入数据
150 | 
151 |   ```sql
152 |   -- 创建表并指定字段分隔符为逗号（，）
153 |   create table user_click(id string,month string,number int) row format delimited fields terminated by ",";
154 |   
155 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/user_click_data.txt）
156 |   
157 |   -- 加载数据到表
158 |   load data local inpath "/root/yber/data/user_click_data.txt" into table user_click;
159 |   ```
160 | 
161 | - 查询语句
162 | 
163 |   首先查询出每个用户当月访问的总次数
164 | 
165 |   ```sql
166 |   SELECT id,MONTH,sum(number) AS month_number FROM user_click GROUP BY id,MONTH;
167 |   ```
168 | 
169 |   然后在上一步的基础上，使用开窗函数
170 | 
171 |   ```sql
172 |   SELECT 
173 |   a.id,
174 |   a.MONTH,
175 |   a.month_number,
176 |   sum(a.month_number) over(PARTITION BY a.id ORDER BY a.MONTH) AS sum_month_number,
177 |   max(a.month_number) over(PARTITION BY a.id ORDER BY a.MONTH ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS max_month_number
178 |   FROM 
179 |   (SELECT id,MONTH,sum(number) AS month_number FROM user_click GROUP BY id,MONTH) a
180 |   ```
181 | 
182 | - 查询结果展示
183 | 
184 |   ​	![](../png/user_click1.png)
185 | 
186 | #### 2、自连接视图方法（复杂、难用）
187 | 
188 | - 首先求出**每个用户的当月访问次数**，并存入中间表；
189 | 
190 |   ```sql
191 |   -- 创建中转表 user_click_temp
192 |   create table user_click_temp(id string,month string,number int)row format delimited fields terminated by ",";
193 |   -- 将将初步查询的结果转入中转表(user_click_temp)中。
194 |   insert into table user_click_temp select id,month,sum(number) from user_click group by id,month
195 |   ```
196 | 
197 |   <img src="../png/user_click_temp1.png" style="zoom:50%;" />
198 | 
199 | - **创建**每个用户当月访问次数的**自连接视图（利用中转表）**
200 | 
201 | ```sql
202 | -- 创建自连接视图
203 | create view user_click_view as select a.id aid,a.month amonth,a.number anumber,b.id bid,b.month bmonth,b.number bnumber from user_click_temp a inner join user_click_temp b on a.id=b.id;
204 | -- 查询视图
205 | select * from user_click_view;
206 | ```
207 | 
208 | <img src="../png/user_click_view1.png" style="zoom:50%;" />
209 | 
210 | - 查询语句
211 | 
212 | ```sql
213 | select 
214 | aid,
215 | amonth,
216 | anumber,
217 | max(bnumber) as max_number,
218 | sum(bnumber) as sum_number 
219 | from 
220 | user_click_view 
221 | where 
222 | amonth >= bmonth 
223 | group by 
224 | aid,amonth,anumber;
225 | ```
226 | 
227 | ![结果](../png/user_click1.png)
228 | 
229 | 说明（如图，演示了前两组的情况）：
230 | 
231 | 1. 按照id相等的分组条件，自连接视图（user_click_view）将会有18条数据；<img src="../png/user_click_view2.png" style="zoom:50%;" />
232 | 2. 利用where过滤出符合条件的一条或者多条
233 | 3. 聚合函数sum、max会选出符合条件的一条
234 | 
235 | ![](../png/面试题2_4.png)
236 | 
237 | ![](../png/面试题2_5.png)
238 | 
239 | ### 第三道面试题
240 | 
241 | - 需求：编写Hive的HQL语句**按照day和mac分组，求出每组的销量累计总和，追加到每条记录的后面**
242 | 
243 | - 数据： (day,mac,color,num)
244 | 
245 |   ```
246 |   20171011	1292	金色	1
247 |   20171011	1292	金色	14
248 |   20171011	1292	金色	2
249 |   20171011	1292	金色	11
250 |   20171011	1292	黑色	2
251 |   20171011	1292	粉金	58
252 |   20171011	1292	金色	1
253 |   20171011	2013	金色	10
254 |   20171011	2013	金色	9
255 |   20171011	2013	金色	2
256 |   20171011	2013	金色	1
257 |   20171012	1292	金色	5
258 |   20171012	1292	金色	7
259 |   20171012	1292	金色	5
260 |   20171012	1292	粉金	1
261 |   20171012	2013	粉金	1
262 |   20171012	2013	金色	6
263 |   20171013	1292	黑色	1
264 |   20171013	2013	粉金	2
265 |   20171011	12460	茶花金	1
266 |   ```
267 | 
268 | - 建表、导入数据
269 | 
270 |   ```sql
271 |   -- 创建表并指定字段分隔符为制表符（\t）
272 |   create table if not exists mac(day string, mac string,color string,number int) row format delimited fields terminated by "\t";
273 |   
274 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/opt/yber/data/mac.txt）
275 |   
276 |   -- 加载数据到表
277 |   load data local inpath "/opt/zyb/data/mac.txt" into table mac;
278 |   ```
279 | 
280 | - 查询语句
281 | 
282 |   ```sql
283 |   select day,mac,color,number,sum(number) over(partition by day,mac order by day,mac) as sum_number from mac;
284 |   ```
285 | 
286 | - 查询结果展示
287 | 
288 |   ```
289 |   day     mac     color   num     sumnumber
290 |   20171011        1292    金色    1       89
291 |   20171011        1292    金色    14      89
292 |   20171011        1292    金色    2       89
293 |   20171011        1292    金色    11      89
294 |   20171011        1292    黑色    2       89
295 |   20171011        1292    粉金    58      89
296 |   20171011        1292    金色    1       89
297 |   20171011        2013    金色    2       22
298 |   20171011        2013    金色    1       22
299 |   20171011        2013    金色    9       22
300 |   20171011        2013    金色    10      22
301 |   20171011        12460   茶花金  1       1
302 |   20171012        1292    金色    5       18
303 |   20171012        1292    粉金    1       18
304 |   20171012        1292    金色    5       18
305 |   20171012        1292    金色    7       18
306 |   20171012        2013    粉金    1       7
307 |   20171012        2013    金色    6       7
308 |   20171013        1292    黑色    1       1
309 |   20171013        2013    粉金    2       2
310 |   ```
311 | 
312 | 


--------------------------------------------------------------------------------
/interview-question/窗口分析2lag-lead.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ## 窗口分析2 lag/lead
  4 | 
  5 | ### 窗口函数——上下「lag/lead」
  6 | 
  7 | #### LAG(col,n,DEFAULT) 用于统计窗口内**往上**第n行值
  8 | 
  9 | > 第一个参数为列名，
 10 | > 第二个参数为往上第n行（可选，默认为1），
 11 | > 第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）
 12 | 
 13 | #### LEAD(col,n,DEFAULT) 用于统计窗口内**往下**第n行值
 14 | 
 15 | > 第一个参数为列名，
 16 | > 第二个参数为往下第n行（可选，默认为1），
 17 | > 第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）
 18 | 
 19 | ### 日期函数
 20 | 
 21 | 这两个函数日期均只能为<u>'yyyy-MM-dd'格式 & 'yyyy-MM-dd HH:mm:ss'格式</u>
 22 | 
 23 | #### datediff(endDate, startDate)
 24 | 
 25 | 返回endDate和startDate相差的天数。
 26 | 
 27 | #### date_add（start_date, num_days）
 28 | 
 29 | 返回初始日期n天后（负数为之前）的日期。
 30 | 
 31 | ### 第一道面试题
 32 | 
 33 | - 需求：编写Hive的HQL语句**<u>求出连续三天登陆的用户id</u>**
 34 | 
 35 | - 数据：(user_id，login_date)
 36 | 
 37 |   ```sql
 38 |   -- 创建表，并导入数据
 39 |   create table login_log as
 40 |   select 1 as user_id, "2020-01-01" as login_date
 41 |   union all
 42 |   select 1 as user_id, "2020-01-02" as login_date
 43 |   union all
 44 |   select 1 as user_id, "2020-01-07" as login_date
 45 |   union all
 46 |   select 1 as user_id, "2020-01-08" as login_date
 47 |   union all
 48 |   select 1 as user_id, "2020-01-09" as login_date
 49 |   union all
 50 |   select 1 as user_id, "2020-01-10" as login_date
 51 |   union all
 52 |   select 2 as user_id, "2020-01-01" as login_date
 53 |   union all
 54 |   select 2 as user_id, "2020-01-02" as login_date
 55 |   union all
 56 |   select 2 as user_id, "2020-01-04" as login_date
 57 |   ```
 58 | 
 59 | - 查看表数据
 60 | 
 61 |   ```sql
 62 |   hive> select * from login_log;
 63 |   OK
 64 |   1	2020-01-01
 65 |   1	2020-01-02
 66 |   1	2020-01-07
 67 |   1	2020-01-08
 68 |   1	2020-01-09
 69 |   1	2020-01-10
 70 |   2	2020-01-01
 71 |   2	2020-01-02
 72 |   2	2020-01-04
 73 |   ```
 74 | 
 75 | #### 1、lag/lead函数+datediff方法
 76 | 
 77 | 思路：
 78 | 
 79 | 1. 通过lag/lead函数，将每个用户ID记录的上一条时间、下一条时间汇总到一条数据中（形如A）；
 80 | 2. 然后利用datediff函数计算本条与上一条时间差、下一条与本条时间差，两个差同时为1表示三天递增，即：连续三天登录。（形如B）；
 81 | 3. 最后对结果去重即可。
 82 | 
 83 | - 查询语句（第一步）
 84 | 
 85 |   ```sql
 86 |   -- A 首先查询 该记录的用户ID、上一条时间、本条时间、下一条时间。
 87 |   SELECT 
 88 |   user_id,
 89 |   -- lag(login_data) 也可以，因为第二个参数可选，默认为1
 90 |   lag(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lag_date, 
 91 |   login_date,
 92 |   -- lead(login_data) 也可以，因为第二个参数可选，默认为1
 93 |   lead(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lead_date
 94 |   FROM login_log;
 95 |   ```
 96 | 
 97 |     A:
 98 | 
 99 |   ```
100 |   1	NULL		2020-01-01	2020-01-02
101 |   1	2020-01-01	2020-01-02	2020-01-07
102 |   1	2020-01-02	2020-01-07	2020-01-08
103 |   1	2020-01-07	2020-01-08	2020-01-09
104 |   1	2020-01-08	2020-01-09	2020-01-10
105 |   1	2020-01-09	2020-01-10	NULL
106 |   2	NULL		2020-01-01	2020-01-02
107 |   2	2020-01-01	2020-01-02	2020-01-04
108 |   2	2020-01-02	2020-01-04	NULL
109 |   ```
110 | 
111 | - 查询语句（第二步）
112 | 
113 |   ```sql
114 |   -- B 使用datediff函数判断过滤（当前日期和前一天日期差一天、当前日期和后一天日期差一天），说明为连续三天登陆
115 |   -- datediff()函数，注意用新的一天减去旧的一天
116 |   SELECT 
117 |     a.user_id
118 |   FROM (
119 |     SELECT 
120 |     user_id,
121 |     lag(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lag_date,
122 |     login_date,
123 |     lead(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lead_date
124 |     FROM login_log
125 |   ) a
126 |   WHERE datediff(login_date,lag_date)=1 AND datediff(lead_date,login_date) =1
127 |   ```
128 | 
129 |   B:
130 | 
131 |   ```
132 |   1
133 |   1
134 |   ```
135 | 
136 | - 最后通过distinct **或者** group by去重即可。
137 | 
138 |   ```sql
139 |   SELECT 
140 |   DISTINCT a.user_id -- 这里通过distinct去重。
141 |   FROM (
142 |   SELECT 
143 |   user_id,
144 |   lag(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lag_date,
145 |   login_date,
146 |   lead(login_date,1) over(PARTITION BY user_id ORDER BY login_date) AS lead_date
147 |   FROM login_log
148 |   ) a
149 |   WHERE datediff(login_date,lag_date)=1 AND datediff(lead_date,login_date) =1
150 |   ```
151 | 
152 |   最终结果：
153 | 
154 |   ```
155 |   1
156 |   ```
157 | 
158 | #### 2、date_add方法+row_number函数
159 | 
160 | 思路：
161 | 
162 | 1. 利用row_number函数得到每个用户ID，按照日期排序的递增排名；
163 |    - 这里row_number得到一个以1递增的序列，日期如果连续也会是一个以1天递增的序列（如下图1：查询第二步）。
164 |    - 当用户ID的日期连续时（即，该用户连续登陆），则两个都以1递增的相减得到的数应该相同（如下图1：查询第二步）。
165 | 2. 利用相减结果得到
166 | 
167 | - 查询语句（第一步【row_number】）
168 | 
169 | 首先查询出每个用户id、登陆日期、登陆日期排名（row_number）
170 | 
171 | ```sql
172 | SELECT 
173 | user_id,
174 | login_date,
175 | row_number() over(PARTITION BY user_id ORDER BY login_date) AS rn -- row_number无参数，不要跟其他窗口函数混淆
176 | FROM login_log;
177 | ```
178 | 
179 | 结果：
180 | 
181 | ```
182 | 1	2020-01-01	1
183 | 1	2020-01-02	2
184 | 1	2020-01-07	3
185 | 1	2020-01-08	4
186 | 1	2020-01-09	5
187 | 1	2020-01-10	6
188 | 2	2020-01-01	1
189 | 2	2020-01-02	2
190 | 2	2020-01-04	3
191 | ```
192 | 
193 | - 查询语句（第二步【date_add函数】）
194 | 
195 | 获取user_id,login_date,**归一化日期（如果用户是连续登陆，这个日期是同一天）**【如下图所示】。
196 | 
197 | ![图1](../png/login_log.png)
198 | 
199 | ```sql
200 | SELECT 
201 |   a.user_id,
202 |   a.login_date, 
203 |   date_add(a.login_date,1-rn) AS con_date -- 这里相当执行了(day 减 rn 加 1) ，上图白色字体1、2.
204 | FROM (
205 |   SELECT 
206 |   user_id,
207 |   login_date,
208 |   row_number() over(PARTITION BY user_id ORDER BY login_date) AS rn -- row_number无参数，不要跟其他窗口函数混淆
209 |   FROM login_log
210 | ) a
211 | ```
212 | 
213 |   结果：
214 | 
215 | ```
216 | a.user_id	a.login_date	con_date
217 | 1	2020-01-01	2020-01-01
218 | 1	2020-01-02	2020-01-01
219 | 1	2020-01-07	2020-01-05
220 | 1	2020-01-08	2020-01-05
221 | 1	2020-01-09	2020-01-05
222 | 1	2020-01-10	2020-01-05
223 | 2	2020-01-01	2020-01-01
224 | 2	2020-01-02	2020-01-01
225 | 2	2020-01-04	2020-01-02
226 | ```
227 | 
228 | - 最终结果（最后，聚合得到id和连续登陆日期）。
229 | 
230 | ```sql
231 | SELECT b.user_id,b.con_date,count(*) AS login_number
232 | FROM (
233 | SELECT 
234 |   a.user_id,
235 |   a.login_date, 
236 |   date_add(a.login_date,1-rn) AS con_date -- 这里相当执行了(day 减 rn 加 1) ，上图白色字体1、2.
237 | FROM (
238 |   SELECT 
239 |   user_id,
240 |   login_date,
241 |   row_number() over(PARTITION BY user_id ORDER BY login_date) AS rn -- row_number无参数，不要跟其他窗口函数混淆
242 |   FROM login_log
243 | ) a
244 | ) b
245 | GROUP BY b.user_id,b.con_date;
246 | ```
247 | 
248 | 查询结果展示
249 | 
250 | ```sql
251 | -- 1、3列即为我们需要的结果。
252 | 1	2020-01-01	2
253 | 1	2020-01-05	4
254 | 2	2020-01-01	2
255 | 2	2020-01-02	1
256 | ```
257 | 
258 | 
259 | 
260 | 有时间继续思考，没有先跳过。
261 | 
262 | 附加内容： 进阶！！！
263 | 
264 | 附加内容： 进阶！！！
265 | 
266 | 附加内容： 进阶！！！
267 | 
268 | ```sql
269 | SELECT 
270 | b.user_id,
271 | min(b.login_date) over(PARTITION BY b.user_id,b.con_date) AS first_date,
272 | sum(b.constant) over(PARTITION BY b.user_id,b.con_date) AS login_number -- 附加2：在这里对常数进行sum窗口计数
273 | FROM (
274 | SELECT 
275 |   a.user_id,
276 |   a.login_date, 
277 |   date_add(a.login_date,1-rn) AS con_date ,-- 这里相当执行了(day 减 rn 加 1) ，上图白色字体1、2.
278 |   1 AS constant -- 附加1：在这里添加一个常数
279 | FROM (
280 |   SELECT 
281 |   user_id,
282 |   login_date,
283 |   row_number() over(PARTITION BY user_id ORDER BY login_date) AS rn -- row_number无参数，不要跟其他窗口函数混淆
284 |   FROM login_log
285 | ) a
286 | ) b
287 | ```
288 | 
289 | 结果：（中间的日期也是有用的，表示连续登录的第一天日期）。后续可以根据需要去重等比如`select 后添加 distinct`。
290 | 
291 | ```sql
292 | -- 1、2、3列都是有用的结果！！！
293 | b.user_id	first_date	login_number
294 | 1	2020-01-01	2
295 | 1	2020-01-01	2
296 | 1	2020-01-07	4
297 | 1	2020-01-07	4
298 | 1	2020-01-07	4
299 | 1	2020-01-07	4
300 | 2	2020-01-01	2
301 | 2	2020-01-01	2
302 | 2	2020-01-04	1
303 | ```
304 | 
305 | distinct去重
306 | 
307 | ```sql
308 | SELECT 
309 | distinct -- 去重
310 | b.user_id,
311 | min(b.login_date) over(PARTITION BY b.user_id,b.con_date) AS first_date,
312 | sum(b.constant) over(PARTITION BY b.user_id,b.con_date) AS login_number
313 | FROM (
314 | SELECT 
315 |   a.user_id,
316 |   a.login_date, 
317 |   date_add(a.login_date,1-rn) AS con_date ,-- 这里相当执行了(day 减 rn 加 1) ，上图白色字体1、2.
318 |   1 AS constant
319 | FROM (
320 |   SELECT 
321 |   user_id,
322 |   login_date,
323 |   row_number() over(PARTITION BY user_id ORDER BY login_date) AS rn -- row_number无参数，不要跟其他窗口函数混淆
324 |   FROM login_log
325 | ) a
326 | ) b
327 | 
328 | ```
329 | 
330 | 结果：
331 | 
332 | ```sql
333 | b.user_id	first_date	login_number
334 | 1	2020-01-01	2
335 | 1	2020-01-07	4
336 | 2	2020-01-01	2
337 | 2	2020-01-04	1
338 | ```
339 | 
340 | 


--------------------------------------------------------------------------------
/interview-question/自连接.md:
--------------------------------------------------------------------------------
  1 | # 自连接
  2 | 
  3 | ## 第八道面试题
  4 | 
  5 | ### 需求、数据、建表等
  6 | 
  7 | - 需求：编写Hive的HQL语句
  8 | 
  9 |   1、查找所有至少**连续三次**出现的数字。（number连续三次出现）
 10 | 
 11 |   - 数字出现三次与以上
 12 |   - 必须是连续出现（分开累计三次不行）
 13 | 
 14 | - 数据： ( Id，number )
 15 | 
 16 |   ```
 17 |   1,1
 18 |   2,1
 19 |   3,1
 20 |   4,2
 21 |   5,1
 22 |   6,2
 23 |   7,2
 24 |   8,3
 25 |   9,3
 26 |   10,3
 27 |   11,3
 28 |   12,4
 29 |   ```
 30 |   
 31 | - 建表、导入数据
 32 | 
 33 |   ```sql
 34 |   -- 创建表并指定字段分隔符为逗号（，）
 35 |   create table if not exists id_number(id int, number int) row format delimited fields terminated by ",";
 36 |   
 37 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/id_number_data.txt）
 38 |   
 39 |   -- 加载数据到表
 40 |   load data local inpath "/root/yber/data/id_number_data.txt" into table id_number;
 41 |   ```
 42 | 
 43 | ### 思路与实现步骤
 44 | 
 45 | - 思路分析
 46 | 
 47 |   ​	1、为了查找连续出现3次--我们通过将当前行和之后两行（共三行）列出在同一行。如果该行的三个值相同，就代表这个值连续出现了3次
 48 | 
 49 |   ​	2、有时候有些数字连续出现了3次以上，那么会出现连续两行，都是符合条件的，为了防止重复查询，我们通过distinct来排除重复
 50 | 
 51 | - 实现步骤
 52 | 
 53 |   #### 方式一：笛卡尔积
 54 | 
 55 |   注意：
 56 | 
 57 |   ```sql
 58 |   -- 有时候系统不允许进行笛卡尔积运算，需要设置打开
 59 |   -- 笛卡尔积实现：
 60 |   set hive.mapred.mode = nonstrict;
 61 |   ```
 62 | 
 63 |   语句：
 64 | 
 65 |   ```sql
 66 |   -- select * 结果（结果1）
 67 |   -- select * 结果（结果1）
 68 |   -- select * 结果（结果1）
 69 |   select * 
 70 |   from id_number a ,id_number b,id_number c  -- 三个表 笛卡尔积
 71 |   where 
 72 |   a.id = b.id-1 and b.id = c.id-1  -- 限制笛卡尔积结果的条件：a、b、c的id在同一行按1递增。(结果1)
 73 |   ;
 74 |   
 75 |   -- 限制结果为id并去重。（结果2）
 76 |   -- 限制结果为id并去重。（结果2）
 77 |   -- 限制结果为id并去重。（结果2）
 78 |   select distinct a.id 
 79 |   from id_number a ,id_number b,id_number c -- 三个表 笛卡尔积 
 80 |   where 
 81 |   a.id = b.id-1 and b.id = c.id-1   -- 限制笛卡尔积结果的条件：a、b、c的id在同一行按1递增。
 82 |   and a.number = b.number and b.number = c.number; -- 同时根据条件需要，筛选连续三个数相同的结果。（结果2）
 83 |   ```
 84 |   
 85 |   ```
 86 |   select * 结果（结果1）
 87 |   a.id	a.number	b.id	b.number	c.id	c.number
 88 |   1		1			2		1			3		1
 89 |   2		1			3		1			4		2
 90 |   3		1			4		2			5		1
 91 |   4		2			5		1			6		2
 92 |   5		1			6		2			7		2
 93 |   6		2			7		2			8		3
 94 |   7		2			8		3			9		3
 95 |   8		3			9		3			10		3
 96 |   9		3			10		3			11		3
 97 |   10		3			11		3			12		4
 98 |   ```
 99 |   
100 |   ```
101 |   限制结果为id并去重。（结果2）
102 |   a.id
103 |   1
104 |   8
105 |   9
106 |   ```
107 |   
108 |   #### 方式二：显式的join连接（显式的join更加清晰，推荐！）
109 |   
110 |   语句：
111 |   
112 |   ```sql
113 |   -- 显式的join中进行连接条件限制。
114 |   -- 显式的join中进行连接条件限制。
115 |   -- 显式的join中进行连接条件限制。
116 |   select *
117 |   FROM 
118 |   id_number a join id_number b on a.id = b.id-1
119 |   join id_number c on b.id = c.id-1
120 |   
121 |   
122 |   -- 在where中只进行条件限制。（连接条件在join的on中设置）
123 |   -- 在where中只进行条件限制。（连接条件在join的on中设置）
124 |   -- 在where中只进行条件限制。（连接条件在join的on中设置）
125 |   select distinct a.id -- 对结果id去重
126 |   FROM 
127 |   id_number a join id_number b on a.id = b.id-1 join id_number c on b.id = c.id-1 -- join连接（a、b、c的id在同一行按1递增。）
128 |   where a.number = b.number and a.number = c.number; -- 筛选连续三个数相同的结果。
129 |   ```
130 |   
131 |   查询结果：
132 |   
133 |   ```
134 |   a.id
135 |   1
136 |   8
137 |   9
138 |   ```
139 |   
140 |   #### 方法三：利用开窗函数leg、lead
141 |   
142 |   思路：
143 |   
144 |   ​	首先通过leg、lead函数，得到每行数据的`前一个数字`、`数字`、`后一个数字`。(结果1)
145 |   
146 |   ​	随后，通过判断三个数字相同得到结果。
147 |   
148 |   ```sql
149 |   select 
150 |   id,
151 |   lag(number,1) over(order by id) as before_number,
152 |   number,
153 |   lead(number,1) over(order by id) as after_number
154 |   FROM 
155 |   id_number ;
156 |   
157 |   select * from (
158 |     select 
159 |     id,
160 |     lag(number,1) over(order by id) as before_number,
161 |     number,
162 |     lead(number,1) over(order by id) as after_number
163 |     FROM 
164 |     id_number
165 |   ) a
166 |   where before_number = number and number = after_number ;
167 |   ```
168 |   
169 |   结果1：
170 |   
171 |   ```
172 |   id	before_number	number	after_number
173 |   1	NULL			1		1
174 |   2	1				1		1
175 |   3	1				1		2
176 |   4	1				2		1
177 |   5	2				1		2
178 |   6	1				2		2
179 |   7	2				2		3
180 |   8	2				3		3
181 |   9	3				3		3
182 |   10	3				3		3
183 |   11	3				3		4
184 |   12	3				4		NULL
185 |   ```
186 |   
187 |   结果2：
188 |   
189 |   ```
190 |   a.id	a.before_number	a.number	a.after_number
191 |   2		1				1			1
192 |   9		3				3			3
193 |   10		3				3			3
194 |   ```
195 |   
196 |   **一个疑问解答：为什么方法三 和 方法一、二的结果有差别？**
197 |   
198 |   方法三：利用lag、lead将数据的前一个、后一个分别计算出来。
199 |   
200 |   方法一、二：利用自连接将数据的后两个分别计算出来。
201 |   
202 |   因此，方法三得到的连续number实际上是对应了`连续数字中间的number`；方法一、二得到的连续number对应的是`连续数字第一个number`。请注意区分！！！
203 | 
204 | ## 第九道面试题
205 | 
206 | #### 需求、数据、建表等
207 | 
208 | - 需求：编写Hive的HQL语句
209 | 
210 |   1、**求一下每个学生成绩最好的课程及分数、最差的课程及分数、平均分数**
211 | 
212 | - 数据： (name string, score map<string,int>)
213 | 
214 |   ​	两列，分别是学生姓名name(类型string)，学生成绩score(类型map<string,int>)
215 | 
216 |   ​	成绩列中key是课程名称，例如语文、数学等，value是对应课程分数(0-100)
217 | 
218 |   ```
219 |   huangbo	yuwen:80,shuxue:89,yingyu:95
220 |   xuzheng	yuwen:70,shuxue:65,yingyu:81
221 |   wangbaoqiang	yuwen:75,shuxue:100,yingyu:76
222 |   ```
223 | 
224 | - 建表、导入数据
225 | 
226 |   ```sql
227 |   -- 建表
228 |   create table if not exists student_score(name string, score map<string,int>) 
229 |   row format delimited fields terminated by "\t" 
230 |   collection items terminated by "," 
231 |   Map keys terminated by ":"; 
232 |   
233 |   -- 导入数据
234 |   load data local inpath "/opt/zyb/data/student_score_data.txt" into table student_score;
235 |   ```
236 | 
237 | 说明：
238 | 
239 | ​	由于原数据的格式特殊，我们使用map<string，int>集合存储第二个字段。因此，在建表的时候，需要额外声明map中key-value对之间的分隔符、map中key与value之间的分隔符。
240 | 
241 | ```sql
242 | row format delimited fields terminated by "\t" 
243 | -- 字段分隔符
244 | collection items terminated by "," 
245 | -- map中 key-value对 之间的分隔符（额外加上）
246 | Map keys terminated by ":";
247 | -- map中key与value之间的分隔符（额外加上）
248 | ```
249 | 
250 | #### 思路与实现步骤
251 | 
252 | - 思路分析
253 | 
254 | ```
255 | 1、使用炸裂函数把原始数据列转行
256 | 
257 | 2、使用窗口函数row_number()把炸裂之后的数据按照名字分组，成绩降序排序desc
258 | 3、使用窗口函数row_number()把炸裂之后的数据按照名字分组，成绩生序排序asc
259 | 4、使用窗口函数avg()把炸裂之后的数据按照名字分组，得出平均成绩
260 | 
261 | ```
262 | 
263 | - 实现步骤
264 | 
265 |   - 列转行
266 | 
267 |     语句：
268 | 
269 |     ```sql
270 |     select 
271 |     name,
272 |     courses.course,
273 |     courses.info
274 |     from 
275 |     student_score 
276 |     lateral view explode(score) courses as course,info -- 注意这里，map炸裂需要指定两个列名
277 |     ```
278 |   
279 |     结果：
280 |   
281 |     | name         | course | info |
282 |     | ------------ | ------ | ---- |
283 |     | huangbo      | yuwen  | 80   |
284 |     | huangbo      | shuxue | 89   |
285 |     | huangbo      | yingyu | 95   |
286 |     | xuzheng      | yuwen  | 70   |
287 |     | xuzheng      | shuxue | 65   |
288 |     | xuzheng      | yingyu | 81   |
289 |     | wangbaoqiang | yuwen  | 75   |
290 |     | wangbaoqiang | shuxue | 100  |
291 |     | wangbaoqiang | yingyu | 76   |
292 | 
293 |   - 对炸裂结果使用多个窗口函数
294 |   
295 |     语句：
296 |   
297 |     ```sql
298 |     select 
299 |     name,
300 |     courses.course,
301 |     courses.info,
302 |     row_number() over(partition by name order by courses.info desc) as info_rn_desc,
303 |     row_number() over(partition by name order by courses.info asc) as info_rn_asc,
304 |     avg(courses.info) over(partition by name) as info_avg
305 |     from 
306 |     student_score 
307 |     lateral view explode(score) courses as course,info
308 |     ```
309 |   
310 |     结果：
311 |   
312 |     | a.name       | a.course | a.score | info_rn_desc | info_rn_asc | info_avg          |
313 |     | ------------ | -------- | ------- | ------------ | ----------- | ----------------- |
314 |     | huangbo      | yingyu   | 95      | 1            | 3           | 88.0              |
315 |     | huangbo      | shuxue   | 89      | 2            | 2           | 88.0              |
316 |     | huangbo      | yuwen    | 80      | 3            | 1           | 88.0              |
317 |     | wangbaoqiang | shuxue   | 100     | 1            | 3           | 83.66666666666667 |
318 |     | wangbaoqiang | yingyu   | 76      | 2            | 2           | 83.66666666666667 |
319 |     | wangbaoqiang | yuwen    | 75      | 3            | 1           | 83.66666666666667 |
320 |     | xuzheng      | yingyu   | 81      | 1            | 3           | 72.0              |
321 |     | xuzheng      | yuwen    | 70      | 2            | 2           | 72.0              |
322 |     | xuzheng      | shuxue   | 65      | 3            | 1           | 72.0              |
323 | 
324 |   - 选出根据两个rn过滤出每个人的最好、最差科目与成绩
325 | 
326 |     ```sql
327 |     select 
328 |     *
329 |     from (
330 |     select 
331 |     name,
332 |     courses.course,
333 |     courses.info,
334 |     row_number() over(partition by name order by courses.info desc) as info_rn_desc,
335 |     row_number() over(partition by name order by courses.info asc) as info_rn_asc,
336 |     avg(courses.info) over(partition by name) as info_avg
337 |     from 
338 |     student_score 
339 |     lateral view explode(score) courses as course,info
340 |     ) a
341 |     where info_rn_desc = 1 or info_rn_asc = 1 -- or关系，过滤最好成绩与最差成绩
342 |     ```
343 |     
344 |     结果：
345 |     
346 |     | name         | course | info | desc | asc  | avg               |
347 |     | ------------ | ------ | ---- | ---- | ---- | ----------------- |
348 |     | huangbo      | yingyu | 95   | 1    | 3    | 88.0              |
349 |     | huangbo      | yuwen  | 80   | 3    | 1    | 88.0              |
350 |     | wangbaoqiang | shuxue | 100  | 1    | 3    | 83.66666666666667 |
351 |     | wangbaoqiang | yuwen  | 75   | 3    | 1    | 83.66666666666667 |
352 |     | xuzheng      | yingyu | 81   | 1    | 3    | 72.0              |
353 |     | xuzheng      | shuxue | 65   | 3    | 1    | 72.0              |
354 |     
355 |   - 通过case..when、concat()拼接函数、max+group组合得到最终结果！
356 |   
357 |     说明：这里为了将最好成绩、最差成绩拼接到一行（使用了**行转列**的技巧）。看不懂，回顾行转列章节。
358 |   
359 |     ```sql
360 |     select 
361 |     a.name,
362 |     max(case when a.info_rn_desc=1 then concat(a.course,'-',a.info) else '0' end) as max_info,
363 |     max(case when a.info_rn_asc=1 then concat(a.course,'-',a.info) else '0' end) as min_info,
364 |     max(info_avg)
365 |     from (
366 |     select 
367 |     name,
368 |     courses.course,
369 |     courses.info,
370 |     row_number() over(partition by name order by courses.info desc) as info_rn_desc,
371 |     row_number() over(partition by name order by courses.info asc) as info_rn_asc,
372 |     avg(courses.info) over(partition by name) as info_avg
373 |     from 
374 |     student_score 
375 |     lateral view explode(score) courses as course,info
376 |     ) a
377 |     where info_rn_desc = 1 or info_rn_asc = 1
378 |     group by a.name
379 |     ```
380 |   
381 |     结果：
382 |   
383 |     | name         | course1    | course2   | avg               |
384 |     | ------------ | ---------- | --------- | ----------------- |
385 |     | huangbo      | yingyu-95  | yuwen-80  | 88.0              |
386 |     | wangbaoqiang | shuxue-100 | yuwen-75  | 83.66666666666667 |
387 |     | xuzheng      | yingyu-81  | shuxue-65 | 72.0              |


--------------------------------------------------------------------------------
/interview-question/行转列.md:
--------------------------------------------------------------------------------
  1 | # 行列转换
  2 | 
  3 | ## 概念
  4 | 
  5 | ### 行转列
  6 | 
  7 | ​	把数据表中具有<u>相同key值的多行value数据（左侧）</u>，**转换为**使用<u>一个key值的多列数据（右侧）</u>；使每一行数据中，一个key对应多个value。
  8 | 
  9 | ![](../png/行转列.png)
 10 | 
 11 | ## 函数
 12 | 
 13 | ### case···when···then···else···end语句
 14 | 
 15 | [一片文章，学习一下。](https://blog.csdn.net/konglongaa/article/details/80250253)
 16 | 
 17 | 
 18 | 
 19 | ### collect_set()、collect_list()
 20 | 
 21 | | 函数               | 作用                               |
 22 | | ------------------ | ---------------------------------- |
 23 | | collect_set(字段)  | 求出该字段的所有值（不重复，去重） |
 24 | | collect_list(字段) | 求出该字段的所有值（存在重复）     |
 25 | 
 26 | [它们都是将分组中的某列转为一个数组返回，不同的是collect_list不去重而collect_set去重。](https://www.cnblogs.com/cc11001100/p/9043946.html)
 27 | 
 28 | 
 29 | 
 30 | ### array_contains()、if()
 31 | 
 32 | | 函数                               | 作用                                     |
 33 | | ---------------------------------- | ---------------------------------------- |
 34 | | array_contains(数组，判断包含字段) | 包含返回true，否则返回false              |
 35 | | if ( boolean, true返回，false返回) | 判断条件为true返回第一个条件，否则第二个 |
 36 | 
 37 | 
 38 | 
 39 | ## 第四道面试题
 40 | 
 41 | ### 需求、数据、建表等
 42 | 
 43 | - 需求：编写Hive的HQL语句求出**所有flink课程成绩 大于 spark课程成绩的学生的学号**
 44 | 
 45 | - 数据： (序号-id，学号-sid，课程-course，分数-score)
 46 | 
 47 |   ```
 48 |   2,1,flink,55
 49 |   3,2,spark,77
 50 |   4,2,flink,88
 51 |   5,3,spark,98
 52 |   6,3,flink,65
 53 |   7,3,hadoop,80
 54 |   ```
 55 |   
 56 | - 建表、导入数据
 57 | 
 58 |   ```sql
 59 |   -- 创建表并指定字段分隔符为逗号（，）
 60 |   create table student_score(id int,sid int,course string,score int) row format delimited fields terminated by ",";
 61 |   
 62 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/student_score_data.txt）
 63 |   
 64 |   -- 加载数据到表
 65 |   load data local inpath "/root/yber/data/student_score_data.txt" into table student_score;
 66 |   ```
 67 | 
 68 | ### 思路与实现步骤
 69 | 
 70 | - 思路
 71 | 
 72 | **原数据格式**是：每行一个学号，对应一课的成绩「即一个学生的多个科目与成绩对应在不同行」
 73 | 
 74 | ```
 75 | id	sid	course	score
 76 | 1	1	spark	43
 77 | 2	1	flink	55
 78 | 3	2	spark	77
 79 | 4	2	flink	88
 80 | 5	3	spark	98
 81 | 6	3	flink	65
 82 | 7	3	hadoop	80
 83 | ```
 84 | 
 85 | 如果能够将一个学生的<u>多个科目成绩转换到一行</u>「即**行转列**」
 86 | 
 87 | ```
 88 | sid	spark	flink	hadoop
 89 | 1	43		55		0
 90 | 2	77		88		0
 91 | 3	98		65		80
 92 | ```
 93 | 
 94 | 那么，我们可以通过
 95 | 
 96 | ```sql
 97 | select sid from student_score where spark<flink;
 98 | ```
 99 | 
100 | - 实现步骤
101 | 
102 |   - case when then else end语句
103 | 
104 |     1. 使用case···when，将学生各科成绩列举在同一行数据
105 | 
106 |        ```sql
107 |        
108 |        select 
109 |        sid,
110 |        -- 当course字段的值为spark的时候。 新字段spark的值，就是score，否则就是0
111 |        case when course="spark" then score else 0 end as spark,
112 |        case when course="flink" then score else 0 end as flink,
113 |        case when course="hadoop" then score else 0 end as hadoop
114 |        from 
115 |        student_score ;
116 |        ```
117 | 
118 |        ![](../png/case_when1.png)
119 | 
120 |     2. 在通过sid分组、并将case查询到的各科成绩通过max聚合。(对比第一步多了group by+max)
121 | 
122 |        ```sql
123 |        select 
124 |        sid,
125 |        max(case when course="spark" then score else 0 end) as spark,
126 |        max(case when course="flink" then score else 0 end) as flink,
127 |        max(case when course="hadoop" then score else 0 end) as hadoop
128 |        from 
129 |        student_score 
130 |        group by sid;
131 |        ```
132 | 
133 |        ![](../png/case_when2.png)
134 | 
135 |     3. 最后在这个查询基础上，我们在进行对spark小于flink的判断即可。
136 | 
137 |        ```sql
138 |        select a.sid from (
139 |        select 
140 |        sid,
141 |        max(case when course="spark" then score else 0 end) as spark,
142 |        max(case when course="flink" then score else 0 end) as flink,
143 |        max(case when course="hadoop" then score else 0 end) as hadoop
144 |        from 
145 |        student_score 
146 |        group by sid
147 |        ) a
148 |        where a.flink > a.spark
149 |        ;
150 |        ```
151 |        
152 |        结果：
153 |        
154 |        ```
155 |        1
156 |        2
157 |        ```
158 | 
159 | 
160 | 
161 | 
162 | ## 第五道面试题
163 | 
164 | ### 需求、数据、建表等
165 | 
166 | - 需求：
167 | 
168 |   ​	有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门。
169 | 
170 |   ​	**编写Hive的HQL语句来实现以下结果：表中的Yes表示选修，表中的No表示未选修**
171 | 
172 |   ![](../png/id_course1.png)
173 | 
174 | - 元数据. (id course )
175 | 
176 |   ```
177 |   1,a
178 |   1,b
179 |   1,c
180 |   1,e
181 |   2,a
182 |   2,c
183 |   2,d
184 |   2,f
185 |   3,a
186 |   3,b
187 |   3,c
188 |   3,e
189 |   ```
190 | 
191 | - 建表、导入数据
192 | 
193 |   ```sql
194 |   -- 创建表并指定字段分隔符为逗号（，）
195 |   create table if not exists id_course(id int, course string) row format delimited fields terminated by ",";
196 |   
197 |   -- 准备数据，放置在服务器文件系统或HDFS。此处放在服务器文件系统上（/root/yber/data/id_course_data.txt）
198 |   
199 |   -- 加载数据到表
200 |   load data local inpath "/root/yber/data/id_course_data.txt" into table id_course;
201 |   ```
202 | 
203 | ### 实现步骤
204 | 
205 | - 第一步：列出<u>所有的课程</u>
206 | 
207 |   **collect_set函数用法见开头！！！**
208 | 
209 |   **collect_set函数用法见开头！！！**
210 | 
211 |   **collect_set函数用法见开头！！！**
212 | 
213 |   ```sql
214 |   -- 使用 collect_set函数 获取不重复的所有课程。
215 |   select collect_set(course) as courses from id_course;
216 |   ```
217 | 
218 |   结果：
219 | 
220 |   ```
221 |   courses
222 |   ["a","b","c","e","d","f"]
223 |   ```
224 | 
225 | - 第二步：列出<u>每个id学修的课程</u>
226 | 
227 |   ```sql
228 |   -- 获取每个id选择的课程
229 |   select id,collect_set(course) as user_course from id_course group by id;
230 |   ```
231 | 
232 |   结果：
233 | 
234 |   ```
235 |   id	user_course
236 |   1	["a","b","c","e"]
237 |   2	["a","c","d","f"]
238 |   3	["a","b","c","e"]
239 |   ```
240 | 
241 | - 第三步：组合前两步的查询结果（join）
242 | 
243 |   ```sql
244 |   -- Set
245 |   -- 该属性不允许笛卡尔积，设置为false代表开启笛卡尔积。
246 |   set hive.strict.checks.cartesian.product=false;
247 |   -- 设置本地运行(学习测试时，运行的更快。)
248 |   set hive.exec.mode.local.auto=true;
249 |   ```
250 | 
251 |   ```sql
252 |   select t1.id,t1.user_course,t2.courses from 
253 |   (select id,collect_set(course) as user_course from id_course group by id) t1
254 |   join 
255 |   (select collect_set(course) as courses from id_course) t2
256 |   ```
257 | 
258 |   结果：
259 | 
260 |   ```
261 |   t1.id	t1.user_course		t2.courses
262 |   1		["a","b","c","e"]	["a","b","c","e","d","f"]
263 |   2		["a","c","d","f"]	["a","b","c","e","d","f"]
264 |   3		["a","b","c","e"]	["a","b","c","e","d","f"]
265 |   ```
266 | 
267 | - 第四步：得出最终结果：拿出courses字段中的每一个元素在user_course中进行判断，看是否存在。
268 | 
269 |   **array_contains函数用法见开头！！！**
270 | 
271 |   **array_contains函数用法见开头！！！**
272 | 
273 |   **array_contains函数用法见开头！！！**
274 | 
275 |   ```sql
276 |   -- 这一步结果已经可以看出学生选修情况。
277 |   select 
278 |   aa.id as id,
279 |   -- 注意这里是遍历所有课程（0-5,对应课程a～f），分别判断用户是否选修。（函数的参数不要弄反了！）
280 |   ARRAY_CONTAINS(aa.user_course,aa.courses[0]) as a ,
281 |   ARRAY_CONTAINS(aa.user_course,aa.courses[1]) as b ,
282 |   ARRAY_CONTAINS(aa.user_course,aa.courses[2]) as c ,
283 |   ARRAY_CONTAINS(aa.user_course,aa.courses[3]) as d ,
284 |   ARRAY_CONTAINS(aa.user_course,aa.courses[4]) as e ,
285 |   ARRAY_CONTAINS(aa.user_course,aa.courses[5]) as f 
286 |   from ( 
287 |   select 
288 |   t1.id as id,
289 |   t1.user_course as user_course ,
290 |   t2.courses as courses
291 |   from 
292 |   (select id,collect_set(course) as user_course from id_course group by id) t1
293 |   join 
294 |   (select collect_set(course) as courses from id_course) t2
295 |   ) aa
296 |   ```
297 | 
298 |   结果：
299 | 
300 |   ```
301 |   id	a			b			c			d			e			f
302 |   1		true	true	true	true	false	false
303 |   2		true	false	true	false	true	true
304 |   3		true	true	true	true	false	false
305 |   ```
306 | 
307 | - 利用if函数，规范为题目要求的结果即可。
308 | 
309 |   **if函数用法见开头！！！**
310 | 
311 |   **if函数用法见开头！！！**
312 | 
313 |   **if函数用法见开头！！！**
314 | 
315 |   ```sql
316 |   select 
317 |   aa.id as id,
318 |   -- 与上一步不同的地方仅仅是使用了if函数，规范要求的结果。
319 |   -- array_contains函数返回的true，false；
320 |   -- 通过if函数将返回结果规范为Yes，No；
321 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[0]),"Yes","No") as a ,
322 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[1]),"Yes","No") as b ,
323 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[2]),"Yes","No") as c ,
324 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[3]),"Yes","No") as d ,
325 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[4]),"Yes","No") as e ,
326 |   if(ARRAY_CONTAINS(aa.user_course,aa.courses[5]),"Yes","No") as f 
327 |   from ( 
328 |   select 
329 |   t1.id as id,
330 |   t1.user_course as user_course ,
331 |   t2.courses as courses
332 |   from 
333 |   (select id,collect_set(course) as user_course from id_course group by id) t1
334 |   join 
335 |   (select collect_set(course) as courses from id_course) t2
336 |   ) aa
337 |   ```
338 | 
339 |   结果：
340 | 
341 |   ```
342 |   id	a	b	c	d	e	f
343 |   1	Yes	Yes	Yes	Yes	No	No
344 |   2	Yes	No	Yes	No	Yes	Yes
345 |   3	Yes	Yes	Yes	Yes	No	No
346 |   ```
347 | 
348 |   
349 | 
350 | 
351 | 
352 | 


--------------------------------------------------------------------------------
/png/case_when1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/case_when1.png


--------------------------------------------------------------------------------
/png/case_when2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/case_when2.png


--------------------------------------------------------------------------------
/png/explodeAndSplit.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/explodeAndSplit.png


--------------------------------------------------------------------------------
/png/id_course1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/id_course1.png


--------------------------------------------------------------------------------
/png/lateral_view.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/lateral_view.png


--------------------------------------------------------------------------------
/png/login_log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/login_log.png


--------------------------------------------------------------------------------
/png/row_number.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/row_number.png


--------------------------------------------------------------------------------
/png/row_number2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/row_number2.png


--------------------------------------------------------------------------------
/png/shop1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/shop1.png


--------------------------------------------------------------------------------
/png/shop2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/shop2.png


--------------------------------------------------------------------------------
/png/split.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/split.png


--------------------------------------------------------------------------------
/png/sum.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/sum.png


--------------------------------------------------------------------------------
/png/use_rank.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/use_rank.png


--------------------------------------------------------------------------------
/png/user_click1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/user_click1.png


--------------------------------------------------------------------------------
/png/user_click_temp1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/user_click_temp1.png


--------------------------------------------------------------------------------
/png/user_click_view1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/user_click_view1.png


--------------------------------------------------------------------------------
/png/user_click_view2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/user_click_view2.png


--------------------------------------------------------------------------------
/png/列转行.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/列转行.png


--------------------------------------------------------------------------------
/png/行转列.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/行转列.png


--------------------------------------------------------------------------------
/png/面试题2_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题2_2.png


--------------------------------------------------------------------------------
/png/面试题2_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题2_3.png


--------------------------------------------------------------------------------
/png/面试题2_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题2_4.png


--------------------------------------------------------------------------------
/png/面试题2_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题2_5.png


--------------------------------------------------------------------------------
/png/面试题4_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题4_2.png


--------------------------------------------------------------------------------
/png/面试题5_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题5_1.png


--------------------------------------------------------------------------------
/png/面试题5_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zzyb/Hive_interview-question/6e0ef41eb17c17da8285c2457cb5b3b723f9bee4/png/面试题5_2.png


--------------------------------------------------------------------------------