├── .idea
├── compiler.xml
├── encodings.xml
├── libraries
│ └── R_User_Library.xml
├── misc.xml
└── sbt.xml
├── FlinkWEB.png
├── README.md
├── flink-log-analysis.iml
├── jar包.png
├── pom.xml
├── src
├── main
│ ├── java
│ │ └── com
│ │ │ └── jmx
│ │ │ ├── analysis
│ │ │ ├── LogAnalysis.java
│ │ │ └── LogParse.java
│ │ │ └── bean
│ │ │ └── AccessLogRecord.java
│ └── resources
│ │ ├── access_log.txt
│ │ ├── log4j.properties
│ │ └── param.conf
└── test
│ └── java
│ └── TestLogparse.java
├── 帖子.png
├── 日志架构.png
├── 日志格式.png
├── 板块.png
├── 结果.png
└── 项目代码.png
/.idea/compiler.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
--------------------------------------------------------------------------------
/.idea/encodings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/libraries/R_User_Library.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
10 |
11 |
12 |
13 |
14 |
--------------------------------------------------------------------------------
/.idea/sbt.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/FlinkWEB.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/FlinkWEB.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # flink-log-analysis
2 | ## 项目架构图
3 | 从0到1基于Flink实现一个实时的用户行为日志分析系统,基本架构图如下:
4 |
5 | 
6 |
7 | ## 代码结构
8 |
9 | 
10 |
11 | 首先会先搭建一个论坛平台,对论坛平台产生的用户点击日志进行分析。然后使用Flume日志收集系统对产生的Apache日志进行收集,并将其推送到Kafka。接着我们使用Flink对日志进行实时分析处理,将处理之后的结果写入MySQL供前端应用可视化展示。本文主要实现以下三个指标计算:
12 |
13 | - 统计热门板块,即访问量最高的板块
14 | - 统计热门文章,即访问量最高的帖子文章
15 | - 统计不同客户端对版块和文章的总访问量
16 |
17 | ## 基于discuz搭建一个论坛平台
18 |
19 | ### 安装XAMPP
20 |
21 | - 下载
22 |
23 | ```bash
24 | wget https://www.apachefriends.org/xampp-files/5.6.33/xampp-linux-x64-5.6.33-0-installer.run
25 | ```
26 |
27 | - 安装
28 |
29 | ```bash
30 | # 赋予文件执行权限
31 | chmod u+x xampp-linux-x64-5.6.33-0-installer.run
32 | # 运行安装文件
33 | ./xampp-linux-x64-5.6.33-0-installer.run
34 | ```
35 |
36 | - 配置环境变量
37 |
38 | 将以下内容加入到 ~/.bash_profile
39 |
40 | ```bash
41 | export XAMPP=/opt/lampp/
42 | export PATH=$PATH:$XAMPP:$XAMPP/bin
43 | ```
44 |
45 | - 刷新环境变量
46 |
47 | ```bash
48 | source ~/.bash_profile
49 | ```
50 |
51 | - 启动XAMPP
52 |
53 | ```bash
54 | xampp restart
55 | ```
56 |
57 | - MySQL的root用户密码和权限修改
58 |
59 | ```bash
60 | #修改root用户密码为123qwe
61 | update mysql.user set password=PASSWORD('123qwe') where user='root';
62 | flush privileges;
63 | #赋予root用户远程登录权限
64 | grant all privileges on *.* to 'root'@'%' identified by '123qwe' with grant option;
65 | flush privileges;
66 | ```
67 |
68 | ### 安装Discuz
69 |
70 | - 下载discuz
71 |
72 | ```bash
73 | wget http://download.comsenz.com/DiscuzX/3.2/Discuz_X3.2_SC_UTF8.zip
74 | ```
75 |
76 | - 安装
77 |
78 | ```bash
79 | #删除原有的web应用
80 | rm -rf /opt/lampp/htdocs/*
81 | unzip Discuz_X3.2_SC_UTF8.zip –d /opt/lampp/htdocs/
82 | cd /opt/lampp/htdocs/
83 | mv upload/*
84 | #修改目录权限
85 | chmod 777 -R /opt/lampp/htdocs/config/
86 | chmod 777 -R /opt/lampp/htdocs/data/
87 | chmod 777 -R /opt/lampp/htdocs/uc_client/
88 | chmod 777 -R /opt/lampp/htdocs/uc_server/
89 | ```
90 |
91 | ### Discuz基本操作
92 |
93 | - 自定义版块
94 | - 进入discuz后台:http://kms-4/admin.php
95 | - 点击顶部的**论坛**菜单
96 | - 按照页面提示创建所需版本,可以创建父子版块
97 |
98 | 
99 |
100 | ### Discuz帖子/版块存储数据库表介
101 |
102 | ```sql
103 | -- 登录ultrax数据库
104 | mysql -uroot -p123 ultrax
105 | -- 查看包含帖子id及标题对应关系的表
106 | -- tid, subject(文章id、标题)
107 | select tid, subject from pre_forum_post limit 10;
108 | -- fid, name(版块id、标题)
109 | select fid, name from pre_forum_forum limit 40;
110 | ```
111 |
112 | 当我们在各个板块添加帖子之后,如下所示:
113 |
114 | 
115 |
116 | ### 修改日志格式
117 |
118 | - 查看访问日志
119 |
120 | ```bash
121 | # 日志默认地址
122 | /opt/lampp/logs/access_log
123 | # 实时查看日志命令
124 | tail –f /opt/lampp/logs/access_log
125 | ```
126 |
127 | - 修改日志格式
128 |
129 | Apache配置文件名称为httpd.conf,完整路径为`/opt/lampp/etc/httpd.conf`。由于默认的日志类型为**common**类型,总共有7个字段。为了获取更多的日志信息,我们需要将其格式修改为**combined**格式,该日志格式共有9个字段。修改方式如下:
130 |
131 | ```bash
132 | # 启用组合日志文件
133 | CustomLog "logs/access_log" combined
134 | ```
135 |
136 | 
137 |
138 | - 重新加载配置文件
139 |
140 | ```bash
141 | xampp reload
142 | ```
143 |
144 | ### Apache日志格式介绍
145 |
146 | ```bash
147 | 192.168.10.1 - - [30/Aug/2020:15:53:15 +0800] "GET /forum.php?mod=forumdisplay&fid=43 HTTP/1.1" 200 30647 "http://kms-4/forum.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
148 | ```
149 |
150 | 上面的日志格式共有9个字段,分别用空格隔开。每个字段的具体含义如下:
151 |
152 | ```bash
153 | 192.168.10.1 ##(1)客户端的IP地址
154 | - ## (2)客户端identity标识,该字段为"-"
155 | - ## (3)客户端userid标识,该字段为"-"
156 | [30/Aug/2020:15:53:15 +0800] ## (4)服务器完成请求处理时的时间
157 | "GET /forum.php?mod=forumdisplay&fid=43 HTTP/1.1" ## (5)请求类型 请求的资源 使用的协议
158 | 200 ## (6)服务器返回给客户端的状态码,200表示成功
159 | 30647 ## (7)返回给客户端不包括响应头的字节数,如果没有信息返回,则此项应该是"-"
160 | "http://kms-4/forum.php" ## (8)Referer请求头
161 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" ## (9)客户端的浏览器信息
162 | ```
163 |
164 | 关于上面的日志格式,可以使用正则表达式进行匹配:
165 |
166 | ```bash
167 | (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (\S+) (\S+) (\[.+?\]) (\"(.*?)\") (\d{3}) (\S+) (\"(.*?)\") (\"(.*?)\")
168 | ```
169 |
170 | ## Flume与Kafka集成
171 |
172 | 本文使用Flume对产生的Apache日志进行收集,然后推送至Kafka。需要启动Flume agent对日志进行收集,对应的配置文件如下:
173 |
174 | ```bash
175 | # agent的名称为a1
176 | a1.sources = source1
177 | a1.channels = channel1
178 | a1.sinks = sink1
179 |
180 | # set source
181 | a1.sources.source1.type = TAILDIR
182 | a1.sources.source1.filegroups = f1
183 | a1.sources.source1.filegroups.f1 = /opt/lampp/logs/access_log
184 | a1sources.source1.fileHeader = flase
185 |
186 | # 配置sink
187 | a1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
188 | a1.sinks.sink1.brokerList=kms-2:9092,kms-3:9092,kms-4:9092
189 | a1.sinks.sink1.topic= user_access_logs
190 | a1.sinks.sink1.kafka.flumeBatchSize = 20
191 | a1.sinks.sink1.kafka.producer.acks = 1
192 | a1.sinks.sink1.kafka.producer.linger.ms = 1
193 | a1.sinks.sink1.kafka.producer.compression.type = snappy
194 |
195 | # 配置channel
196 | a1.channels.channel1.type = file
197 | a1.channels.channel1.checkpointDir = /home/kms/data/flume_data/checkpoint
198 | a1.channels.channel1.dataDirs= /home/kms/data/flume_data/data
199 |
200 | # 配置bind
201 | a1.sources.source1.channels = channel1
202 | a1.sinks.sink1.channel = channel1
203 |
204 | ```
205 |
206 | > 知识点:
207 | >
208 | > **Taildir Source**相比**Exec Source**、**Spooling Directory Source**的优势是什么?
209 | >
210 | > **TailDir Source**:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传
211 | >
212 | > **Exec Source**:可以实时收集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失
213 | >
214 | > **Spooling Directory Source**:监控目录,不支持断点续传
215 |
216 | 值得注意的是,上面的配置是直接将原始日志push到Kafka。除此之外,我们还可以自定义Flume的拦截器对原始日志先进行过滤处理,同时也可以实现将不同的日志push到Kafka的不同Topic中。
217 |
218 | ### 启动Flume Agent
219 |
220 | 将启动Agent的命令封装成shell脚本:**start-log-collection.sh **,脚本内容如下:
221 |
222 | ```shell
223 | #!/bin/bash
224 | echo "start log agent !!!"
225 | /opt/modules/apache-flume-1.9.0-bin/bin/flume-ng agent --conf-file /opt/modules/apache-flume-1.9.0-bin/conf/log_collection.conf --name a1 -Dflume.root.logger=INFO,console
226 | ```
227 |
228 | ### 查看push到Kafka的日志数据
229 |
230 | 将控制台消费者命令封装成shell脚本:**kafka-consumer.sh**,脚本内容如下:
231 |
232 | ```bash
233 | #!/bin/bash
234 | echo "kafka consumer "
235 | bin/kafka-console-consumer.sh --bootstrap-server kms-2.apache.com:9092,kms-3.apache.com:9092,kms-4.apache.com:9092 --topic $1 --from-beginning
236 | ```
237 |
238 | 使用下面命令消费Kafka中的数据:
239 |
240 | ```shell
241 | [kms@kms-2 kafka_2.11-2.1.0]$ ./kafka-consumer.sh user_access_logs
242 | ```
243 |
244 | ## 日志分析处理流程
245 |
246 | ### 创建MySQL数据库和目标表
247 |
248 | ```sql
249 | -- 客户端访问量统计
250 | CREATE TABLE `client_ip_access` (
251 | `client_ip` char(50) NOT NULL COMMENT '客户端ip',
252 | `client_access_cnt` bigint(20) NOT NULL COMMENT '访问次数',
253 | `statistic_time` text NOT NULL COMMENT '统计时间',
254 | PRIMARY KEY (`client_ip`)
255 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
256 | -- 热门文章统计
257 | CREATE TABLE `hot_article` (
258 | `article_id` int(10) NOT NULL COMMENT '文章id',
259 | `subject` varchar(80) NOT NULL COMMENT '文章标题',
260 | `article_pv` bigint(20) NOT NULL COMMENT '访问次数',
261 | `statistic_time` text NOT NULL COMMENT '统计时间',
262 | PRIMARY KEY (`article_id`)
263 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
264 | -- 热门板块统计
265 | CREATE TABLE `hot_section` (
266 | `section_id` int(10) NOT NULL COMMENT '版块id',
267 | `name` char(50) NOT NULL COMMENT '版块标题',
268 | `section_pv` bigint(20) NOT NULL COMMENT '访问次数',
269 | `statistic_time` text NOT NULL COMMENT '统计时间',
270 | PRIMARY KEY (`section_id`)
271 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
272 | ```
273 |
274 | ### AccessLogRecord类
275 |
276 | 该类封装了日志所包含的字段数据,共有9个字段。
277 |
278 | ```java
279 | /**
280 | * 使用lombok
281 | * 原始日志封装类
282 | */
283 | @Data
284 | public class AccessLogRecord {
285 | public String clientIpAddress; // 客户端ip地址
286 | public String clientIdentity; // 客户端身份标识,该字段为 `-`
287 | public String remoteUser; // 用户标识,该字段为 `-`
288 | public String dateTime; //日期,格式为[day/month/yearhourminutesecond zone]
289 | public String request; // url请求,如:`GET /foo ...`
290 | public String httpStatusCode; // 状态码,如:200; 404.
291 | public String bytesSent; // 传输的字节数,有可能是 `-`
292 | public String referer; // 参考链接,即来源页
293 | public String userAgent; // 浏览器和操作系统类型
294 | }
295 | ```
296 |
297 | ### LogParse类
298 |
299 | 该类是日志解析类,通过正则表达式对日志进行匹配,对匹配上的日志进行按照字段解析。
300 |
301 | ```java
302 | public class LogParse implements Serializable {
303 |
304 | //构建正则表达式
305 | private String regex = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) (\\S+) (\\S+) (\\[.+?\\]) (\\\"(.*?)\\\") (\\d{3}) (\\S+) (\\\"(.*?)\\\") (\\\"(.*?)\\\")";
306 | private Pattern p = Pattern.compile(regex);
307 |
308 | /*
309 | *构造访问日志的封装类对象
310 | * */
311 | public AccessLogRecord buildAccessLogRecord(Matcher matcher) {
312 | AccessLogRecord record = new AccessLogRecord();
313 | record.setClientIpAddress(matcher.group(1));
314 | record.setClientIdentity(matcher.group(2));
315 | record.setRemoteUser(matcher.group(3));
316 | record.setDateTime(matcher.group(4));
317 | record.setRequest(matcher.group(5));
318 | record.setHttpStatusCode(matcher.group(6));
319 | record.setBytesSent(matcher.group(7));
320 | record.setReferer(matcher.group(8));
321 | record.setUserAgent(matcher.group(9));
322 | return record;
323 |
324 | }
325 |
326 | /**
327 | * @param record:record表示一条apache combined 日志
328 | * @return 解析日志记录,将解析的日志封装成一个AccessLogRecord类
329 | */
330 | public AccessLogRecord parseRecord(String record) {
331 | Matcher matcher = p.matcher(record);
332 | if (matcher.find()) {
333 | return buildAccessLogRecord(matcher);
334 | }
335 | return null;
336 | }
337 |
338 | /**
339 | * @param request url请求,类型为字符串,类似于 "GET /the-uri-here HTTP/1.1"
340 | * @return 一个三元组(requestType, uri, httpVersion). requestType表示请求类型,如GET, POST等
341 | */
342 | public Tuple3 parseRequestField(String request) {
343 | //请求的字符串格式为:“GET /test.php HTTP/1.1”,用空格切割
344 | String[] arr = request.split(" ");
345 | if (arr.length == 3) {
346 | return Tuple3.of(arr[0], arr[1], arr[2]);
347 | } else {
348 | return null;
349 | }
350 | }
351 |
352 | /**
353 | * 将apache日志中的英文日期转化为指定格式的中文日期
354 | *
355 | * @param dateTime 传入的apache日志中的日期字符串,"[21/Jul/2009:02:48:13 -0700]"
356 | * @return
357 | */
358 | public String parseDateField(String dateTime) throws ParseException {
359 | // 输入的英文日期格式
360 | String inputFormat = "dd/MMM/yyyy:HH:mm:ss";
361 | // 输出的日期格式
362 | String outPutFormat = "yyyy-MM-dd HH:mm:ss";
363 |
364 | String dateRegex = "\\[(.*?) .+]";
365 | Pattern datePattern = Pattern.compile(dateRegex);
366 |
367 | Matcher dateMatcher = datePattern.matcher(dateTime);
368 | if (dateMatcher.find()) {
369 | String dateString = dateMatcher.group(1);
370 | SimpleDateFormat dateInputFormat = new SimpleDateFormat(inputFormat, Locale.ENGLISH);
371 | Date date = dateInputFormat.parse(dateString);
372 |
373 | SimpleDateFormat dateOutFormat = new SimpleDateFormat(outPutFormat);
374 |
375 | String formatDate = dateOutFormat.format(date);
376 | return formatDate;
377 | } else {
378 | return "";
379 | }
380 | }
381 |
382 | /**
383 | * 解析request,即访问页面的url信息解析
384 | * "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1"
385 | * 匹配出访问的fid:版本id
386 | * 以及tid:文章id
387 | * @param request
388 | * @return
389 | */
390 | public Tuple2 parseSectionIdAndArticleId(String request) {
391 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id
392 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)";
393 | Pattern sectionPattern = Pattern.compile(sectionIdRegex);
394 | // 匹配出前面是"tid="的数字记为文章id
395 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)";
396 | Pattern articlePattern = Pattern.compile(articleIdRegex);
397 |
398 | String[] arr = request.split(" ");
399 | String sectionId = "";
400 | String articleId = "";
401 | if (arr.length == 3) {
402 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]);
403 | Matcher articleMatcher = articlePattern.matcher(arr[1]);
404 | sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : "";
405 | articleId = (articleMatcher.find()) ? articleMatcher.group(2) : "";
406 | }
407 | return Tuple2.of(sectionId, articleId);
408 | }
409 | }
410 | ```
411 |
412 | ### LogAnalysis类
413 |
414 | 该类是日志处理的基本逻辑
415 |
416 | ```java
417 | public class LogAnalysis {
418 |
419 | public static void main(String[] args) throws Exception {
420 |
421 | StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment();
422 | // 开启checkpoint,时间间隔为毫秒
423 | senv.enableCheckpointing(5000L);
424 | // 选择状态后端
425 | // 本地测试
426 | // senv.setStateBackend(new FsStateBackend("file:///E://checkpoint"));
427 | // 集群运行
428 | senv.setStateBackend(new FsStateBackend("hdfs://kms-1:8020/flink-checkpoints"));
429 | // 重启策略
430 | senv.setRestartStrategy(
431 | RestartStrategies.fixedDelayRestart(3, Time.of(2, TimeUnit.SECONDS) ));
432 |
433 | EnvironmentSettings settings = EnvironmentSettings.newInstance()
434 | .useBlinkPlanner()
435 | .inStreamingMode()
436 | .build();
437 | StreamTableEnvironment tEnv = StreamTableEnvironment.create(senv, settings);
438 | // kafka参数配置
439 | Properties props = new Properties();
440 | // kafka broker地址
441 | props.put("bootstrap.servers", "kms-2:9092,kms-3:9092,kms-4:9092");
442 | // 消费者组
443 | props.put("group.id", "log_consumer");
444 | // kafka 消息的key序列化器
445 | props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
446 | // kafka 消息的value序列化器
447 | props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
448 | props.put("auto.offset.reset", "earliest");
449 |
450 | FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer(
451 | "user_access_logs",
452 | new SimpleStringSchema(),
453 | props);
454 |
455 | DataStreamSource logSource = senv.addSource(kafkaConsumer);
456 | // 获取有效的日志数据
457 | DataStream availableAccessLog = LogAnalysis.getAvailableAccessLog(logSource);
458 | // 获取[clienIP,accessDate,sectionId,articleId]
459 | DataStream> fieldFromLog = LogAnalysis.getFieldFromLog(availableAccessLog);
460 | //从DataStream中创建临时视图,名称为logs
461 | // 添加一个计算字段:proctime,用于维表JOIN
462 | tEnv.createTemporaryView("logs",
463 | fieldFromLog,
464 | $("clientIP"),
465 | $("accessDate"),
466 | $("sectionId"),
467 | $("articleId"),
468 | $("proctime").proctime());
469 |
470 | // 需求1:统计热门板块
471 | LogAnalysis.getHotSection(tEnv);
472 | // 需求2:统计热门文章
473 | LogAnalysis.getHotArticle(tEnv);
474 | // 需求3:统计不同客户端ip对版块和文章的总访问量
475 | LogAnalysis.getClientAccess(tEnv);
476 | senv.execute("log-analysisi");
477 | }
478 |
479 | /**
480 | * 统计不同客户端ip对版块和文章的总访问量
481 | * @param tEnv
482 | */
483 | private static void getClientAccess(StreamTableEnvironment tEnv) {
484 | // sink表
485 | // [client_ip,client_access_cnt,statistic_time]
486 | // [客户端ip,访问次数,统计时间]
487 | String client_ip_access_ddl = "" +
488 | "CREATE TABLE client_ip_access (\n" +
489 | " client_ip STRING ,\n" +
490 | " client_access_cnt BIGINT,\n" +
491 | " statistic_time STRING,\n" +
492 | " PRIMARY KEY (client_ip) NOT ENFORCED\n" +
493 | ")WITH (\n" +
494 | " 'connector' = 'jdbc',\n" +
495 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
496 | " 'table-name' = 'client_ip_access', \n" +
497 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
498 | " 'username' = 'root',\n" +
499 | " 'password' = '123qwe'\n" +
500 | ") ";
501 |
502 | tEnv.executeSql(client_ip_access_ddl);
503 |
504 | String client_ip_access_sql = "" +
505 | "INSERT INTO client_ip_access\n" +
506 | "SELECT\n" +
507 | " clientIP,\n" +
508 | " count(1) AS access_cnt,\n" +
509 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" +
510 | "FROM\n" +
511 | " logs \n" +
512 | "WHERE\n" +
513 | " articleId <> 0 \n" +
514 | " OR sectionId <> 0 \n" +
515 | "GROUP BY\n" +
516 | " clientIP "
517 | ;
518 | tEnv.executeSql(client_ip_access_sql);
519 |
520 | }
521 |
522 | /**
523 | * 统计热门文章
524 | * @param tEnv
525 | */
526 |
527 | private static void getHotArticle(StreamTableEnvironment tEnv) {
528 | // JDBC数据源
529 | // 文章id及标题对应关系的表,[tid, subject]分别为:文章id和标题
530 | String pre_forum_post_ddl = "" +
531 | "CREATE TABLE pre_forum_post (\n" +
532 | " tid INT,\n" +
533 | " subject STRING,\n" +
534 | " PRIMARY KEY (tid) NOT ENFORCED\n" +
535 | ") WITH (\n" +
536 | " 'connector' = 'jdbc',\n" +
537 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" +
538 | " 'table-name' = 'pre_forum_post', \n" +
539 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
540 | " 'username' = 'root',\n" +
541 | " 'password' = '123qwe'\n" +
542 | ")";
543 | // 创建pre_forum_post数据源
544 | tEnv.executeSql(pre_forum_post_ddl);
545 | // 创建MySQL的sink表
546 | // [article_id,subject,article_pv,statistic_time]
547 | // [文章id,标题名称,访问次数,统计时间]
548 | String hot_article_ddl = "" +
549 | "CREATE TABLE hot_article (\n" +
550 | " article_id INT,\n" +
551 | " subject STRING,\n" +
552 | " article_pv BIGINT ,\n" +
553 | " statistic_time STRING,\n" +
554 | " PRIMARY KEY (article_id) NOT ENFORCED\n" +
555 | ")WITH (\n" +
556 | " 'connector' = 'jdbc',\n" +
557 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
558 | " 'table-name' = 'hot_article', \n" +
559 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
560 | " 'username' = 'root',\n" +
561 | " 'password' = '123qwe'\n" +
562 | ")";
563 | tEnv.executeSql(hot_article_ddl);
564 | // 向MySQL目标表insert数据
565 | String hot_article_sql = "" +
566 | "INSERT INTO hot_article\n" +
567 | "SELECT \n" +
568 | " a.articleId,\n" +
569 | " b.subject,\n" +
570 | " count(1) as article_pv,\n" +
571 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" +
572 | "FROM logs a \n" +
573 | " JOIN pre_forum_post FOR SYSTEM_TIME AS OF a.proctime as b ON a.articleId = b.tid\n" +
574 | "WHERE a.articleId <> 0\n" +
575 | "GROUP BY a.articleId,b.subject\n" +
576 | "ORDER BY count(1) desc\n" +
577 | "LIMIT 10";
578 |
579 | tEnv.executeSql(hot_article_sql);
580 |
581 | }
582 |
583 | /**
584 | * 统计热门板块
585 | *
586 | * @param tEnv
587 | */
588 | public static void getHotSection(StreamTableEnvironment tEnv) {
589 |
590 | // 板块id及其名称对应关系表,[fid, name]分别为:版块id和板块名称
591 | String pre_forum_forum_ddl = "" +
592 | "CREATE TABLE pre_forum_forum (\n" +
593 | " fid INT,\n" +
594 | " name STRING,\n" +
595 | " PRIMARY KEY (fid) NOT ENFORCED\n" +
596 | ") WITH (\n" +
597 | " 'connector' = 'jdbc',\n" +
598 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" +
599 | " 'table-name' = 'pre_forum_forum', \n" +
600 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
601 | " 'username' = 'root',\n" +
602 | " 'password' = '123qwe',\n" +
603 | " 'lookup.cache.ttl' = '10',\n" +
604 | " 'lookup.cache.max-rows' = '1000'" +
605 | ")";
606 | // 创建pre_forum_forum数据源
607 | tEnv.executeSql(pre_forum_forum_ddl);
608 |
609 | // 创建MySQL的sink表
610 | // [section_id,name,section_pv,statistic_time]
611 | // [板块id,板块名称,访问次数,统计时间]
612 | String hot_section_ddl = "" +
613 | "CREATE TABLE hot_section (\n" +
614 | " section_id INT,\n" +
615 | " name STRING ,\n" +
616 | " section_pv BIGINT,\n" +
617 | " statistic_time STRING,\n" +
618 | " PRIMARY KEY (section_id) NOT ENFORCED \n" +
619 | ") WITH (\n" +
620 | " 'connector' = 'jdbc',\n" +
621 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
622 | " 'table-name' = 'hot_section', \n" +
623 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
624 | " 'username' = 'root',\n" +
625 | " 'password' = '123qwe'\n" +
626 | ")";
627 |
628 | // 创建sink表:hot_section
629 | tEnv.executeSql(hot_section_ddl);
630 |
631 | //统计热门板块
632 | // 使用日志流与MySQL的维表数据进行JOIN
633 | // 从而获取板块名称
634 | String hot_section_sql = "" +
635 | "INSERT INTO hot_section\n" +
636 | "SELECT\n" +
637 | " a.sectionId,\n" +
638 | " b.name,\n" +
639 | " count(1) as section_pv,\n" +
640 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time \n" +
641 | "FROM\n" +
642 | " logs a\n" +
643 | " JOIN pre_forum_forum FOR SYSTEM_TIME AS OF a.proctime as b ON a.sectionId = b.fid \n" +
644 | "WHERE\n" +
645 | " a.sectionId <> 0 \n" +
646 | "GROUP BY a.sectionId, b.name\n" +
647 | "ORDER BY count(1) desc\n" +
648 | "LIMIT 10";
649 | // 执行数据insert
650 | tEnv.executeSql(hot_section_sql);
651 |
652 | }
653 |
654 | /**
655 | * 获取[clienIP,accessDate,sectionId,articleId]
656 | * 分别为客户端ip,访问日期,板块id,文章id
657 | *
658 | * @param logRecord
659 | * @return
660 | */
661 | public static DataStream> getFieldFromLog(DataStream logRecord) {
662 | DataStream> fieldFromLog = logRecord.map(new MapFunction>() {
663 | @Override
664 | public Tuple4 map(AccessLogRecord accessLogRecord) throws Exception {
665 | LogParse parse = new LogParse();
666 |
667 | String clientIpAddress = accessLogRecord.getClientIpAddress();
668 | String dateTime = accessLogRecord.getDateTime();
669 | String request = accessLogRecord.getRequest();
670 | String formatDate = parse.parseDateField(dateTime);
671 | Tuple2 sectionIdAndArticleId = parse.parseSectionIdAndArticleId(request);
672 | if (formatDate == "" || sectionIdAndArticleId == Tuple2.of("", "")) {
673 |
674 | return new Tuple4("0.0.0.0", "0000-00-00 00:00:00", 0, 0);
675 | }
676 | Integer sectionId = (sectionIdAndArticleId.f0 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f0);
677 | Integer articleId = (sectionIdAndArticleId.f1 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f1);
678 | return new Tuple4<>(clientIpAddress, formatDate, sectionId, articleId);
679 | }
680 | });
681 | return fieldFromLog;
682 | }
683 |
684 | /**
685 | * 筛选可用的日志记录
686 | *
687 | * @param accessLog
688 | * @return
689 | */
690 | public static DataStream getAvailableAccessLog(DataStream accessLog) {
691 | final LogParse logParse = new LogParse();
692 | //解析原始日志,将其解析为AccessLogRecord格式
693 | DataStream filterDS = accessLog.map(new MapFunction() {
694 | @Override
695 | public AccessLogRecord map(String log) throws Exception {
696 | return logParse.parseRecord(log);
697 | }
698 | }).filter(new FilterFunction() {
699 | //过滤掉无效日志
700 | @Override
701 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception {
702 | return !(accessLogRecord == null);
703 | }
704 | }).filter(new FilterFunction() {
705 | //过滤掉状态码非200的记录,即保留请求成功的日志记录
706 | @Override
707 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception {
708 | return !accessLogRecord.getHttpStatusCode().equals("200");
709 | }
710 | });
711 | return filterDS;
712 | }
713 | }
714 | ```
715 |
716 | 将上述代码打包上传到集群运行,在执行提交命令之前,需要先将Hadoop的依赖jar包放置在Flink安装目录下的lib文件下:**flink-shaded-hadoop-2-uber-2.7.5-10.0.jar**,因为我们配置了HDFS上的状态后端,而Flink的release包不含有Hadoop的依赖Jar包。
717 |
718 | 
719 |
720 | 否则会报如下错误:
721 |
722 | ```bash
723 | Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop is not in the classpath/dependencies.
724 | ```
725 |
726 | ### 提交到集群
727 |
728 | 编写提交命令脚本
729 |
730 | ```bash
731 | #!/bin/bash
732 | /opt/modules/flink-1.11.1/bin/flink run -m kms-1:8081 \
733 | -c com.jmx.analysis.LogAnalysis \
734 | /opt/softwares/com.jmx-1.0-SNAPSHOT.jar
735 | ```
736 |
737 | 提交之后,访问Flink的Web界面,查看任务:
738 |
739 | 
740 |
741 | 此时访问论坛,点击板块和帖子文章,观察数据库变化:
742 |
743 | 
744 |
745 | ## 总结
746 |
747 | 本文主要分享了从0到1构建一个用户行为日志分析系统。首先,基于discuz搭建了论坛平台,针对论坛产生的日志,使用Flume进行收集并push到Kafka中;接着使用Flink对其进行分析处理;最后将处理结果写入MySQL供可视化展示使用。
748 |
--------------------------------------------------------------------------------
/flink-log-analysis.iml:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/jar包.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/jar包.png
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
2 |
5 | 4.0.0
6 |
7 | flink-log-analysis
8 | com.jmx
9 | 1.0-SNAPSHOT
10 |
11 | 1.8
12 | 1.11.1
13 | 1.8
14 | 2.3.4
15 | 2.11
16 |
17 | 1.8
18 | 1.8
19 | UTF-8
20 |
21 |
22 |
23 |
24 |
25 | org.apache.flink
26 | flink-connector-kafka_2.11
27 | ${flink.version}
28 |
29 |
30 | org.apache.flink
31 | flink-streaming-java_${scala.binary.version}
32 | ${flink.version}
33 | provided
34 |
35 |
36 | org.apache.flink
37 | flink-table-api-java-bridge_${scala.binary.version}
38 | ${flink.version}
39 | provided
40 |
41 |
42 |
43 | org.apache.flink
44 | flink-table-planner-blink_${scala.binary.version}
45 | ${flink.version}
46 | provided
47 |
48 |
49 |
50 | org.apache.flink
51 | flink-table-common
52 | ${flink.version}
53 | provided
54 |
55 |
56 | org.apache.flink
57 | flink-connector-jdbc_${scala.binary.version}
58 | ${flink.version}
59 |
60 |
61 |
62 | mysql
63 | mysql-connector-java
64 | 8.0.20
65 |
66 |
67 |
68 | org.apache.flink
69 | flink-clients_${scala.binary.version}
70 | ${flink.version}
71 |
72 |
73 | com.typesafe
74 | config
75 | 1.2.1
76 |
77 |
78 |
79 | org.slf4j
80 | slf4j-api
81 | 1.7.25
82 |
83 |
84 | org.slf4j
85 | slf4j-simple
86 | 1.7.25
87 |
88 |
89 |
90 | org.apache.kafka
91 | kafka-clients
92 | 2.1.0
93 |
94 |
95 |
96 | org.projectlombok
97 | lombok
98 | 1.16.18
99 |
100 |
101 |
102 |
103 |
104 |
105 |
--------------------------------------------------------------------------------
/src/main/java/com/jmx/analysis/LogAnalysis.java:
--------------------------------------------------------------------------------
1 | package com.jmx.analysis;
2 |
3 | import com.jmx.bean.AccessLogRecord;
4 | import org.apache.flink.api.common.functions.FilterFunction;
5 | import org.apache.flink.api.common.functions.MapFunction;
6 | import org.apache.flink.api.common.restartstrategy.RestartStrategies;
7 | import org.apache.flink.api.common.serialization.SimpleStringSchema;
8 | import org.apache.flink.api.common.time.Time;
9 | import org.apache.flink.api.java.tuple.Tuple2;
10 | import org.apache.flink.api.java.tuple.Tuple4;
11 | import org.apache.flink.runtime.state.filesystem.FsStateBackend;
12 | import org.apache.flink.streaming.api.datastream.DataStream;
13 | import org.apache.flink.streaming.api.datastream.DataStreamSource;
14 | import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
15 | import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
16 | import org.apache.flink.table.api.EnvironmentSettings;
17 | import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
18 |
19 | import java.util.Properties;
20 | import java.util.concurrent.TimeUnit;
21 |
22 | import static org.apache.flink.table.api.Expressions.$;
23 |
24 | /**
25 | * @Created with IntelliJ IDEA.
26 | * @author : jmx
27 | * @Date: 2020/8/24
28 | * @Time: 22:19
29 | *
30 | */
31 | public class LogAnalysis {
32 |
33 |
34 | public static void main(String[] args) throws Exception {
35 |
36 | StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment();
37 | // 开启checkpoint,时间间隔为毫秒
38 | senv.enableCheckpointing(5000L);
39 | // 选择状态后端
40 | // 本地测试
41 | // senv.setStateBackend(new FsStateBackend("file:///E://checkpoint"));
42 | // 集群运行
43 | senv.setStateBackend(new FsStateBackend("hdfs://kms-1:8020/flink-checkpoints"));
44 | // 重启策略
45 | senv.setRestartStrategy(
46 | RestartStrategies.fixedDelayRestart(3, Time.of(2, TimeUnit.SECONDS) ));
47 |
48 | EnvironmentSettings settings = EnvironmentSettings.newInstance()
49 | .useBlinkPlanner()
50 | .inStreamingMode()
51 | .build();
52 | StreamTableEnvironment tEnv = StreamTableEnvironment.create(senv, settings);
53 | // kafka参数配置
54 | Properties props = new Properties();
55 | // kafka broker地址
56 | props.put("bootstrap.servers", "kms-2:9092,kms-3:9092,kms-4:9092");
57 | // 消费者组
58 | props.put("group.id", "log_consumer");
59 | // kafka 消息的key序列化器
60 | props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
61 | // kafka 消息的value序列化器
62 | props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
63 | props.put("auto.offset.reset", "earliest");
64 |
65 | FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer(
66 | "user_access_logs",
67 | new SimpleStringSchema(),
68 | props);
69 |
70 | DataStreamSource logSource = senv.addSource(kafkaConsumer);
71 | // 获取有效的日志数据
72 | DataStream availableAccessLog = LogAnalysis.getAvailableAccessLog(logSource);
73 | // 获取[clienIP,accessDate,sectionId,articleId]
74 | DataStream> fieldFromLog = LogAnalysis.getFieldFromLog(availableAccessLog);
75 | //从DataStream中创建临时视图,名称为logs
76 | // 添加一个计算字段:proctime,用于维表JOIN
77 | tEnv.createTemporaryView("logs",
78 | fieldFromLog,
79 | $("clientIP"),
80 | $("accessDate"),
81 | $("sectionId"),
82 | $("articleId"),
83 | $("proctime").proctime());
84 |
85 | // 统计热门板块
86 | LogAnalysis.getHotSection(tEnv);
87 | // 统计热门文章
88 | LogAnalysis.getHotArticle(tEnv);
89 | // 统计不同客户端ip对版块和文章的总访问量
90 | LogAnalysis.getClientAccess(tEnv);
91 |
92 | senv.execute("log-analysisi");
93 |
94 | }
95 |
96 | private static void getClientAccess(StreamTableEnvironment tEnv) {
97 | // sink表
98 | // [client_ip,client_access_cnt,statistic_time]
99 | // [客户端ip,访问次数,统计时间]
100 | String client_ip_access_ddl = "" +
101 | "CREATE TABLE client_ip_access (\n" +
102 | " client_ip STRING ,\n" +
103 | " client_access_cnt BIGINT,\n" +
104 | " statistic_time STRING,\n" +
105 | " PRIMARY KEY (client_ip) NOT ENFORCED\n" +
106 | ")WITH (\n" +
107 | " 'connector' = 'jdbc',\n" +
108 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
109 | " 'table-name' = 'client_ip_access', \n" +
110 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
111 | " 'username' = 'root',\n" +
112 | " 'password' = '123qwe'\n" +
113 | ") ";
114 |
115 | tEnv.executeSql(client_ip_access_ddl);
116 |
117 | String client_ip_access_sql = "" +
118 | "INSERT INTO client_ip_access\n" +
119 | "SELECT\n" +
120 | " clientIP,\n" +
121 | " count(1) AS access_cnt,\n" +
122 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" +
123 | "FROM\n" +
124 | " logs \n" +
125 | "WHERE\n" +
126 | " articleId <> 0 \n" +
127 | " OR sectionId <> 0 \n" +
128 | "GROUP BY\n" +
129 | " clientIP "
130 | ;
131 | tEnv.executeSql(client_ip_access_sql);
132 |
133 | }
134 |
135 | private static void getHotArticle(StreamTableEnvironment tEnv) {
136 | // JDBC数据源
137 | // 文章id及标题对应关系的表,[tid, subject]分别为:文章id和标题
138 | String pre_forum_post_ddl = "" +
139 | "CREATE TABLE pre_forum_post (\n" +
140 | " tid INT,\n" +
141 | " subject STRING,\n" +
142 | " PRIMARY KEY (tid) NOT ENFORCED\n" +
143 | ") WITH (\n" +
144 | " 'connector' = 'jdbc',\n" +
145 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" +
146 | " 'table-name' = 'pre_forum_post', \n" +
147 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
148 | " 'username' = 'root',\n" +
149 | " 'password' = '123qwe'\n" +
150 | ")";
151 | // 创建pre_forum_post数据源
152 | tEnv.executeSql(pre_forum_post_ddl);
153 | // 创建MySQL的sink表
154 | // [article_id,subject,article_pv,statistic_time]
155 | // [文章id,标题名称,访问次数,统计时间]
156 | String hot_article_ddl = "" +
157 | "CREATE TABLE hot_article (\n" +
158 | " article_id INT,\n" +
159 | " subject STRING,\n" +
160 | " article_pv BIGINT ,\n" +
161 | " statistic_time STRING,\n" +
162 | " PRIMARY KEY (article_id) NOT ENFORCED\n" +
163 | ")WITH (\n" +
164 | " 'connector' = 'jdbc',\n" +
165 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
166 | " 'table-name' = 'hot_article', \n" +
167 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
168 | " 'username' = 'root',\n" +
169 | " 'password' = '123qwe'\n" +
170 | ")";
171 | tEnv.executeSql(hot_article_ddl);
172 | // 向MySQL目标表insert数据
173 | String hot_article_sql = "" +
174 | "INSERT INTO hot_article\n" +
175 | "SELECT \n" +
176 | " a.articleId,\n" +
177 | " b.subject,\n" +
178 | " count(1) as article_pv,\n" +
179 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" +
180 | "FROM logs a \n" +
181 | " JOIN pre_forum_post FOR SYSTEM_TIME AS OF a.proctime as b ON a.articleId = b.tid\n" +
182 | "WHERE a.articleId <> 0\n" +
183 | "GROUP BY a.articleId,b.subject\n" +
184 | "ORDER BY count(1) desc\n" +
185 | "LIMIT 10";
186 |
187 | tEnv.executeSql(hot_article_sql);
188 |
189 | }
190 |
191 | /**
192 | * 统计热门板块
193 | *
194 | * @param tEnv
195 | */
196 | public static void getHotSection(StreamTableEnvironment tEnv) {
197 |
198 | // 板块id及其名称对应关系表,[fid, name]分别为:版块id和板块名称
199 | String pre_forum_forum_ddl = "" +
200 | "CREATE TABLE pre_forum_forum (\n" +
201 | " fid INT,\n" +
202 | " name STRING,\n" +
203 | " PRIMARY KEY (fid) NOT ENFORCED\n" +
204 | ") WITH (\n" +
205 | " 'connector' = 'jdbc',\n" +
206 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" +
207 | " 'table-name' = 'pre_forum_forum', \n" +
208 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
209 | " 'username' = 'root',\n" +
210 | " 'password' = '123qwe',\n" +
211 | " 'lookup.cache.ttl' = '10',\n" +
212 | " 'lookup.cache.max-rows' = '1000'" +
213 | ")";
214 | // 创建pre_forum_forum数据源
215 | tEnv.executeSql(pre_forum_forum_ddl);
216 |
217 | // 创建MySQL的sink表
218 | // [section_id,name,section_pv,statistic_time]
219 | // [板块id,板块名称,访问次数,统计时间]
220 | String hot_section_ddl = "" +
221 | "CREATE TABLE hot_section (\n" +
222 | " section_id INT,\n" +
223 | " name STRING ,\n" +
224 | " section_pv BIGINT,\n" +
225 | " statistic_time STRING,\n" +
226 | " PRIMARY KEY (section_id) NOT ENFORCED \n" +
227 | ") WITH (\n" +
228 | " 'connector' = 'jdbc',\n" +
229 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" +
230 | " 'table-name' = 'hot_section', \n" +
231 | " 'driver' = 'com.mysql.jdbc.Driver',\n" +
232 | " 'username' = 'root',\n" +
233 | " 'password' = '123qwe'\n" +
234 | ")";
235 |
236 | // 创建sink表:hot_section
237 | tEnv.executeSql(hot_section_ddl);
238 |
239 | //统计热门板块
240 | // 使用日志流与MySQL的维表数据进行JOIN
241 | // 从而获取板块名称
242 | String hot_section_sql = "" +
243 | "INSERT INTO hot_section\n" +
244 | "SELECT\n" +
245 | " a.sectionId,\n" +
246 | " b.name,\n" +
247 | " count(1) as section_pv,\n" +
248 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time \n" +
249 | "FROM\n" +
250 | " logs a\n" +
251 | " JOIN pre_forum_forum FOR SYSTEM_TIME AS OF a.proctime as b ON a.sectionId = b.fid \n" +
252 | "WHERE\n" +
253 | " a.sectionId <> 0 \n" +
254 | "GROUP BY a.sectionId, b.name\n" +
255 | "ORDER BY count(1) desc\n" +
256 | "LIMIT 10";
257 | // 执行数据insert
258 | tEnv.executeSql(hot_section_sql);
259 |
260 | }
261 |
262 | /**
263 | * 获取[clienIP,accessDate,sectionId,articleId]
264 | * 分别为客户端ip,访问日期,板块id,文章id
265 | *
266 | * @param logRecord
267 | * @return
268 | */
269 | public static DataStream> getFieldFromLog(DataStream logRecord) {
270 | DataStream> fieldFromLog = logRecord.map(new MapFunction>() {
271 | @Override
272 | public Tuple4 map(AccessLogRecord accessLogRecord) throws Exception {
273 | LogParse parse = new LogParse();
274 |
275 | String clientIpAddress = accessLogRecord.getClientIpAddress();
276 | String dateTime = accessLogRecord.getDateTime();
277 | String request = accessLogRecord.getRequest();
278 | String formatDate = parse.parseDateField(dateTime);
279 | Tuple2 sectionIdAndArticleId = parse.parseSectionIdAndArticleId(request);
280 | if (formatDate == "" || sectionIdAndArticleId == Tuple2.of("", "")) {
281 |
282 | return new Tuple4("0.0.0.0", "0000-00-00 00:00:00", 0, 0);
283 | }
284 | Integer sectionId = (sectionIdAndArticleId.f0 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f0);
285 | Integer articleId = (sectionIdAndArticleId.f1 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f1);
286 | return new Tuple4<>(clientIpAddress, formatDate, sectionId, articleId);
287 | }
288 | });
289 |
290 |
291 | return fieldFromLog;
292 | }
293 |
294 | /**
295 | * 筛选可用的日志记录
296 | *
297 | * @param accessLog
298 | * @return
299 | */
300 | public static DataStream getAvailableAccessLog(DataStream accessLog) {
301 | final LogParse logParse = new LogParse();
302 | //解析原始日志,将其解析为AccessLogRecord格式
303 | DataStream filterDS = accessLog.map(new MapFunction() {
304 | @Override
305 | public AccessLogRecord map(String log) throws Exception {
306 | return logParse.parseRecord(log);
307 | }
308 | }).filter(new FilterFunction() {
309 | //过滤掉无效日志
310 | @Override
311 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception {
312 | return !(accessLogRecord == null);
313 | }
314 | }).filter(new FilterFunction() {
315 | //过滤掉状态码非200的记录,即保留请求成功的日志记录
316 | @Override
317 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception {
318 | return !accessLogRecord.getHttpStatusCode().equals("200");
319 | }
320 | });
321 | return filterDS;
322 |
323 | }
324 |
325 |
326 | }
327 |
--------------------------------------------------------------------------------
/src/main/java/com/jmx/analysis/LogParse.java:
--------------------------------------------------------------------------------
1 | package com.jmx.analysis;
2 |
3 | import com.jmx.bean.AccessLogRecord;
4 | import org.apache.flink.api.java.tuple.Tuple2;
5 | import org.apache.flink.api.java.tuple.Tuple3;
6 |
7 | import java.io.Serializable;
8 | import java.text.ParseException;
9 | import java.text.SimpleDateFormat;
10 | import java.util.Date;
11 | import java.util.Locale;
12 | import java.util.regex.Matcher;
13 | import java.util.regex.Pattern;
14 |
15 | /**
16 | * @Created with IntelliJ IDEA.
17 | * @author : jmx
18 | * @Date: 2020/8/24
19 | * @Time: 22:21
20 | *
21 | */
22 | public class LogParse implements Serializable {
23 |
24 | //构建正则表达式
25 | private String regex = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) (\\S+) (\\S+) (\\[.+?\\]) (\\\"(.*?)\\\") (\\d{3}) (\\S+) (\\\"(.*?)\\\") (\\\"(.*?)\\\")";
26 | private Pattern p = Pattern.compile(regex);
27 |
28 | /*
29 | *构造访问日志的封装类对象
30 | * */
31 | public AccessLogRecord buildAccessLogRecord(Matcher matcher) {
32 | AccessLogRecord record = new AccessLogRecord();
33 | record.setClientIpAddress(matcher.group(1));
34 | record.setClientIdentity(matcher.group(2));
35 | record.setRemoteUser(matcher.group(3));
36 | record.setDateTime(matcher.group(4));
37 | record.setRequest(matcher.group(5));
38 | record.setHttpStatusCode(matcher.group(6));
39 | record.setBytesSent(matcher.group(7));
40 | record.setReferer(matcher.group(8));
41 | record.setUserAgent(matcher.group(9));
42 | return record;
43 |
44 | }
45 |
46 | /**
47 | * @param record:record表示一条apache combined 日志
48 | * @return 解析日志记录,将解析的日志封装成一个AccessLogRecord类
49 | */
50 | public AccessLogRecord parseRecord(String record) {
51 | Matcher matcher = p.matcher(record);
52 | if (matcher.find()) {
53 | return buildAccessLogRecord(matcher);
54 | }
55 | return null;
56 | }
57 |
58 | /**
59 | * @param request url请求,类型为字符串,类似于 "GET /the-uri-here HTTP/1.1"
60 | * @return 一个三元组(requestType, uri, httpVersion). requestType表示请求类型,如GET, POST等
61 | */
62 | public Tuple3 parseRequestField(String request) {
63 | //请求的字符串格式为:“GET /test.php HTTP/1.1”,用空格切割
64 | String[] arr = request.split(" ");
65 | if (arr.length == 3) {
66 | return Tuple3.of(arr[0], arr[1], arr[2]);
67 | } else {
68 | return null;
69 | }
70 |
71 | }
72 |
73 | /**
74 | * 将apache日志中的英文日期转化为指定格式的中文日期
75 | *
76 | * @param dateTime 传入的apache日志中的日期字符串,"[21/Jul/2009:02:48:13 -0700]"
77 | * @return
78 | */
79 | public String parseDateField(String dateTime) throws ParseException {
80 | // 输入的英文日期格式
81 | String inputFormat = "dd/MMM/yyyy:HH:mm:ss";
82 | // 输出的日期格式
83 | String outPutFormat = "yyyy-MM-dd HH:mm:ss";
84 |
85 | String dateRegex = "\\[(.*?) .+]";
86 | Pattern datePattern = Pattern.compile(dateRegex);
87 |
88 | Matcher dateMatcher = datePattern.matcher(dateTime);
89 | if (dateMatcher.find()) {
90 | String dateString = dateMatcher.group(1);
91 | SimpleDateFormat dateInputFormat = new SimpleDateFormat(inputFormat, Locale.ENGLISH);
92 | Date date = dateInputFormat.parse(dateString);
93 |
94 | SimpleDateFormat dateOutFormat = new SimpleDateFormat(outPutFormat);
95 |
96 | String formatDate = dateOutFormat.format(date);
97 | return formatDate;
98 | } else {
99 | return "";
100 | }
101 | }
102 |
103 | /**
104 | * 解析request,即访问页面的url信息解析
105 | * "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1"
106 | * 匹配出访问的fid:版本id
107 | * 以及tid:文章id
108 | *
109 | * @param request
110 | * @return
111 | */
112 | public Tuple2 parseSectionIdAndArticleId(String request) {
113 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id
114 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)";
115 | Pattern sectionPattern = Pattern.compile(sectionIdRegex);
116 | // 匹配出前面是"tid="的数字记为文章id
117 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)";
118 | Pattern articlePattern = Pattern.compile(articleIdRegex);
119 |
120 | String[] arr = request.split(" ");
121 | String sectionId = "";
122 | String articleId = "";
123 | if (arr.length == 3) {
124 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]);
125 | Matcher articleMatcher = articlePattern.matcher(arr[1]);
126 | sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : "";
127 | articleId = (articleMatcher.find()) ? articleMatcher.group(2) : "";
128 |
129 | }
130 | return Tuple2.of(sectionId, articleId);
131 |
132 | }
133 |
134 | }
135 |
--------------------------------------------------------------------------------
/src/main/java/com/jmx/bean/AccessLogRecord.java:
--------------------------------------------------------------------------------
1 | package com.jmx.bean;
2 |
3 | import lombok.Data;
4 |
5 | /**
6 | * @Created with IntelliJ IDEA.
7 | * @author : jmx
8 | * @Date: 2020/8/24
9 | * @Time: 21:41
10 | *
11 | */
12 |
13 | /**
14 | * 原始日志封装类
15 | */
16 | @Data
17 | public class AccessLogRecord {
18 | public String clientIpAddress; // 客户端ip地址
19 | public String clientIdentity; // 客户端身份标识,该字段为 `-`
20 | public String remoteUser; // 用户标识,该字段为 `-`
21 | public String dateTime; //日期,格式为[day/month/yearhourminutesecond zone]
22 | public String request; // url请求,如:`GET /foo ...`
23 | public String httpStatusCode; // 状态码,如:200; 404.
24 | public String bytesSent; // 传输的字节数,有可能是 `-`
25 | public String referer; // 参考链接,即来源页
26 | public String userAgent; // 浏览器和操作系统类型
27 | }
28 |
--------------------------------------------------------------------------------
/src/main/resources/access_log.txt:
--------------------------------------------------------------------------------
1 | 127.0.0.1 - - [07/Dec/2017:19:36:27 +0800] "GET /test.php HTTP/1.1" 200 73946 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0"
2 | 192.168.1.1 - - [07/Dec/2017:19:37:51 +0800] "GET /robots.txt HTTP/1.1" 404 208 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
3 | 192.168.1.1 - - [07/Dec/2017:19:37:51 +0800] "GET /test.php HTTP/1.1" 200 74407 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
4 | 192.168.1.1 - - [07/Dec/2017:19:37:53 +0800] "GET /favicon.ico HTTP/1.1" 404 209 "http://192.168.1.10/test.php" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
5 | 192.168.1.1 - - [07/Dec/2017:19:38:43 +0800] "-" 408 - "-" "-"
6 | 192.168.1.1 - - [07/Dec/2017:19:40:07 +0800] "GET /test.php HTTP/1.1" 200 74462 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
7 | 192.168.1.1 - - [07/Dec/2017:19:40:37 +0800] "GET / HTTP/1.1" 403 4897 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
8 | 192.168.1.1 - - [07/Dec/2017:19:41:01 +0800] "GET /abc HTTP/1.1" 404 201 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
9 | 192.168.169.50 - - [17/Feb/2012:10:09:13 +0800] "GET /favicon.ico HTTP/1.1" 404 288 "-" "360se"
10 | 192.168.169.50 - - [17/Feb/2012:10:36:26 +0800] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"
11 | 192.168.169.50 - - [17/Feb/2012:10:36:26 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0"
12 | 192.168.169.50 - - [17/Feb/2012:10:09:10 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)"
13 | 192.168.55.230 - - [24/Feb/2012:09:48:58 +0800] "GET /favicon.ico HTTP/1.1" 404 288 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
14 | 192.168.169.50 - - [24/Feb/2012:09:45:03 +0800] "GET /server-status HTTP/1.1" 404 290 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)"
15 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
16 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET /icons/apache_pb.gif HTTP/1.1" 200 2326 "http://192.168.55.230/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
17 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
18 | 192.168.55.230 - - [24/Feb/2012:09:49:20 +0800] "GET /server-status HTTP/1.1" 404 290 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24"
19 | 192.168.1.1 - - [12/Jan/2018:21:01:04 +0800] "GET /about/forum.php?mod=ajax&action=forumchecknew&fid=40&time=1515762003&inajax=yes HTTP/1.1" 200 64 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
20 | 192.168.1.1 - - [12/Jan/2018:21:02:22 +0800] "GET /about/forum.php?mod=ajax&action=forumchecknew&fid=40&time=1515762111&inajax=yes HTTP/1.1" 200 64 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
21 | 192.168.1.1 - - [12/Jan/2018:21:03:09 +0800] "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1" 200 36838 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
22 | 192.168.1.1 - - [14/Jan/2018:16:47:36 +0800] "-" 408 - "-" "-"
23 |
24 | 199.180.11.91 - - [06/Mar/2019:04:22:58 +0100] "GET /robots.txt HTTP/1.1" 404 1228 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"
25 |
26 | 61.135.219.2 访问来源IP
27 | '-' 远端登录名(由identd而来,如果支持的话)
28 | '-' 远程用户名
29 | [01/Jan/2014:00:02:02 +0800] 请求时间,格式为[day/month/year:hour:minute:second zone]
30 | "GET /feed/ HTTP/1.0" 请求内容,格式为"%m %U%q %H",即"请求方法/访问路径/协议"
31 | 200 状态码
32 | 12306 返回数据大小
33 | “%{User-Agent}i” http_user_agent 客户端信息。
34 | “%{Rererer}i” http_referer 来源页。
35 |
36 |
37 |
38 | LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
39 |
40 |
41 |
42 |
43 |
44 | 1客户端的IP地址。
45 | 2 由客户端identd进程判断的RFC1413身份(identity),输出中的符号"-"表示此处的信息无效。
46 | 3HTTP认证系统得到的访问该网页的客户标识(userid),如果网页没有设置密码保护,则此项将是"-"。
47 | 4服务器完成请求处理时的时间。
48 | 5客户的动作\请求的资源\使用的协议。
49 | 6服务器返回给客户端的状态码。
50 | 7返回给客户端的不包括响应头的字节数.如果没有信息返回,则此项应该是"-"。
51 | 8"Referer"请求头。
52 | 9User-Agent"请求头。
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 | (1)183.69.210.164
61 | 这是一个请求到apache服务器的客户端ip,默认的情况下,第一项信息只是远程主机的ip地址,但我们如果需要apache查出主机的名字
62 | (2)-
63 | 这一项是空白,使用"-"来代替,这个位置是用于标注访问者的标示
64 | (3) -
65 | 这一项又是为空白,不过这项是用户记录用户HTTP的身份验证,如果某些网站要求用户进行身份雁阵,那么这一项就是记录用户的身份信息
66 | (4)==[07/Apr/2017:09:32:39 +0800] ==
67 | 第四项是记录请求的时间,格式为[day/month/year:hour:minute:second zone],最后的+0800表示服务器所处的时区为东八区
68 | (5)GET /member/ HTTP/1.1
69 | 这一项是最有用的信息,它告诉我们的服务器收到的是一个GET请求,其次,是客户端请求的资源路径
70 | (6)302
71 | 是一个状态码,由服务器端发送回客户端,它告诉我们客户端的请求是否成功,或者是重定向,或者是碰到了什么样的错误,这项值为302 这项值以2开头的表示请求成功,以3开头的表示重定向,以4开头的标示客户端存在某些的错误,以5开头的标示服务器端
72 | (7)31
73 | 这项表示服务器向客户端发送了多少的字节,在日志分析统计的时侯,把这些字节加起来就可以得知服务器在某点时间内总的发送数据量是多少
74 | (8) -
75 | 没有值时是直接打开网页的原因,而有值时告诉服务器我是从哪个页面链接过来的
76 | (9) “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
77 | (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE
78 | 2.X MetaSr 1.0”
79 | 这项主要记录客户端的浏览器信息
80 |
81 |
82 |
83 | (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (\S+) (\S+) (\[.+?\]) (\"(.*?)\") (\d{3}) (\S+) (\"(.*?)\") (\"(.*?)\")
84 |
85 |
86 |
--------------------------------------------------------------------------------
/src/main/resources/log4j.properties:
--------------------------------------------------------------------------------
1 | ################################################################################
2 | # Licensed to the Apache Software Foundation (ASF) under one
3 | # or more contributor license agreements. See the NOTICE file
4 | # distributed with this work for additional information
5 | # regarding copyright ownership. The ASF licenses this file
6 | # to you under the Apache License, Version 2.0 (the
7 | # "License"); you may not use this file except in compliance
8 | # with the License. You may obtain a copy of the License at
9 | #
10 | # http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | ################################################################################
18 |
19 | log4j.rootLogger=WARN, console
20 |
21 | log4j.appender.console=org.apache.log4j.ConsoleAppender
22 | log4j.appender.console.layout=org.apache.log4j.PatternLayout
23 | log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n
24 |
--------------------------------------------------------------------------------
/src/main/resources/param.conf:
--------------------------------------------------------------------------------
1 | kafka {
2 |
3 | topic = "user_access_logs"
4 | broker = "kms-2:9092,kms-3:9092,kms-4:9092"
5 | group = "log_consumer"
6 | offset = "earliest"
7 | key_deserializer = "org.apache.kafka.common.serialization.StringDeserializer"
8 | value_deserializer = "org.apache.kafka.common.serialization.StringDeserializer"
9 | }
10 |
11 | mysql {
12 | forum_url = "jdbc:mysql://kms-4:3306/ultrax?characterEncoding=utf8&user=root&password=123qwe"
13 | statistics_url = "jdbc:mysql://kms-4:3306/statistics?characterEncoding=utf8&user=root&password=123qwe"
14 | }
15 |
--------------------------------------------------------------------------------
/src/test/java/TestLogparse.java:
--------------------------------------------------------------------------------
1 | import com.jmx.analysis.LogParse;
2 | import com.jmx.bean.AccessLogRecord;
3 | import org.apache.flink.api.java.tuple.Tuple2;
4 |
5 | import java.util.regex.Matcher;
6 | import java.util.regex.Pattern;
7 |
8 | /**
9 | * @Created with IntelliJ IDEA.
10 | * @author : jmx
11 | * @Date: 2020/8/27
12 | * @Time: 10:23
13 | *
14 | */
15 | public class TestLogparse {
16 | public static void main(String[] args) {
17 |
18 |
19 | String log = "192.168.10.1 - - [27/Aug/2020:10:20:53 +0800] \"GET /forum.php?mod=viewthread&tid=9&extra=page%3D1 HTTP/1.1\" 200 39913 \"http://kms-4/forum.php?mod=forumdisplay&fid=41\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36\"";
20 |
21 | LogParse parse = new LogParse();
22 |
23 | AccessLogRecord accessLogRecord = parse.parseRecord(log);
24 |
25 | // System.out.println(accessLogRecord.getRequest());
26 |
27 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id
28 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)";
29 | Pattern sectionPattern = Pattern.compile(sectionIdRegex);
30 | // 匹配出前面是"tid="的数字记为文章id
31 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)";
32 | Pattern articlePattern = Pattern.compile(articleIdRegex);
33 |
34 | String[] arr = accessLogRecord.getRequest().split(" ");
35 | String sectionId = "";
36 | String articleId = "";
37 | if (arr.length == 3) {
38 | //System.out.println(arr[1]);
39 |
40 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]);
41 | Matcher articleMatcher = articlePattern.matcher(arr[1]);
42 | //System.out.println(articleMatcher.find());
43 | // System.out.println(sectionMatcher.find());
44 | //System.out.println(articleMatcher.group(0));
45 | //System.out.println(articleMatcher.group(1));
46 | //System.out.println(articleMatcher.group(2));
47 |
48 | /* sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : "";
49 | articleId = articleMatcher.find() ? articleMatcher.group(2) : "";*/
50 |
51 | /* if (articleMatcher.find()){
52 | articleId = articleMatcher.group(2);
53 | } else {
54 | articleId = "no";
55 | }
56 | if (sectionMatcher.find()){
57 | sectionId = sectionMatcher.group(2);
58 | }else{
59 | sectionId = "no";
60 | }
61 | */
62 | }
63 | /* System.out.println( articleId);
64 | System.out.println(sectionId);*/
65 | Tuple2 stringStringTuple2 = parse.parseSectionIdAndArticleId(accessLogRecord.getRequest());
66 | System.out.println(stringStringTuple2.f0 + stringStringTuple2.f1);
67 | System.out.println(stringStringTuple2);
68 | }
69 | }
70 |
--------------------------------------------------------------------------------
/帖子.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/帖子.png
--------------------------------------------------------------------------------
/日志架构.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/日志架构.png
--------------------------------------------------------------------------------
/日志格式.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/日志格式.png
--------------------------------------------------------------------------------
/板块.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/板块.png
--------------------------------------------------------------------------------
/结果.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/结果.png
--------------------------------------------------------------------------------
/项目代码.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/项目代码.png
--------------------------------------------------------------------------------