├── .idea ├── compiler.xml ├── encodings.xml ├── libraries │ └── R_User_Library.xml ├── misc.xml └── sbt.xml ├── FlinkWEB.png ├── README.md ├── flink-log-analysis.iml ├── jar包.png ├── pom.xml ├── src ├── main │ ├── java │ │ └── com │ │ │ └── jmx │ │ │ ├── analysis │ │ │ ├── LogAnalysis.java │ │ │ └── LogParse.java │ │ │ └── bean │ │ │ └── AccessLogRecord.java │ └── resources │ │ ├── access_log.txt │ │ ├── log4j.properties │ │ └── param.conf └── test │ └── java │ └── TestLogparse.java ├── 帖子.png ├── 日志架构.png ├── 日志格式.png ├── 板块.png ├── 结果.png └── 项目代码.png /.idea/compiler.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | -------------------------------------------------------------------------------- /.idea/encodings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /.idea/libraries/R_User_Library.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /.idea/sbt.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | -------------------------------------------------------------------------------- /FlinkWEB.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/FlinkWEB.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # flink-log-analysis 2 | ## 项目架构图 3 | 从0到1基于Flink实现一个实时的用户行为日志分析系统,基本架构图如下: 4 | 5 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E6%97%A5%E5%BF%97%E6%9E%B6%E6%9E%84.png) 6 | 7 | ## 代码结构 8 | 9 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E9%A1%B9%E7%9B%AE%E4%BB%A3%E7%A0%81.png) 10 | 11 | 首先会先搭建一个论坛平台,对论坛平台产生的用户点击日志进行分析。然后使用Flume日志收集系统对产生的Apache日志进行收集,并将其推送到Kafka。接着我们使用Flink对日志进行实时分析处理,将处理之后的结果写入MySQL供前端应用可视化展示。本文主要实现以下三个指标计算: 12 | 13 | - 统计热门板块,即访问量最高的板块 14 | - 统计热门文章,即访问量最高的帖子文章 15 | - 统计不同客户端对版块和文章的总访问量 16 | 17 | ## 基于discuz搭建一个论坛平台 18 | 19 | ### 安装XAMPP 20 | 21 | - 下载 22 | 23 | ```bash 24 | wget https://www.apachefriends.org/xampp-files/5.6.33/xampp-linux-x64-5.6.33-0-installer.run 25 | ``` 26 | 27 | - 安装 28 | 29 | ```bash 30 | # 赋予文件执行权限 31 | chmod u+x xampp-linux-x64-5.6.33-0-installer.run 32 | # 运行安装文件 33 | ./xampp-linux-x64-5.6.33-0-installer.run 34 | ``` 35 | 36 | - 配置环境变量 37 | 38 | 将以下内容加入到 ~/.bash_profile 39 | 40 | ```bash 41 | export XAMPP=/opt/lampp/ 42 | export PATH=$PATH:$XAMPP:$XAMPP/bin 43 | ``` 44 | 45 | - 刷新环境变量 46 | 47 | ```bash 48 | source ~/.bash_profile 49 | ``` 50 | 51 | - 启动XAMPP 52 | 53 | ```bash 54 | xampp restart 55 | ``` 56 | 57 | - MySQL的root用户密码和权限修改 58 | 59 | ```bash 60 | #修改root用户密码为123qwe 61 | update mysql.user set password=PASSWORD('123qwe') where user='root'; 62 | flush privileges; 63 | #赋予root用户远程登录权限 64 | grant all privileges on *.* to 'root'@'%' identified by '123qwe' with grant option; 65 | flush privileges; 66 | ``` 67 | 68 | ### 安装Discuz 69 | 70 | - 下载discuz 71 | 72 | ```bash 73 | wget http://download.comsenz.com/DiscuzX/3.2/Discuz_X3.2_SC_UTF8.zip 74 | ``` 75 | 76 | - 安装 77 | 78 | ```bash 79 | #删除原有的web应用 80 | rm -rf /opt/lampp/htdocs/* 81 | unzip Discuz_X3.2_SC_UTF8.zip –d /opt/lampp/htdocs/ 82 | cd /opt/lampp/htdocs/ 83 | mv upload/* 84 | #修改目录权限 85 | chmod 777 -R /opt/lampp/htdocs/config/ 86 | chmod 777 -R /opt/lampp/htdocs/data/ 87 | chmod 777 -R /opt/lampp/htdocs/uc_client/ 88 | chmod 777 -R /opt/lampp/htdocs/uc_server/ 89 | ``` 90 | 91 | ### Discuz基本操作 92 | 93 | - 自定义版块 94 | - 进入discuz后台:http://kms-4/admin.php 95 | - 点击顶部的**论坛**菜单 96 | - 按照页面提示创建所需版本,可以创建父子版块 97 | 98 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E6%9D%BF%E5%9D%97.png) 99 | 100 | ### Discuz帖子/版块存储数据库表介 101 | 102 | ```sql 103 | -- 登录ultrax数据库 104 | mysql -uroot -p123 ultrax 105 | -- 查看包含帖子id及标题对应关系的表 106 | -- tid, subject(文章id、标题) 107 | select tid, subject from pre_forum_post limit 10; 108 | -- fid, name(版块id、标题) 109 | select fid, name from pre_forum_forum limit 40; 110 | ``` 111 | 112 | 当我们在各个板块添加帖子之后,如下所示: 113 | 114 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E5%B8%96%E5%AD%90.png) 115 | 116 | ### 修改日志格式 117 | 118 | - 查看访问日志 119 | 120 | ```bash 121 | # 日志默认地址 122 | /opt/lampp/logs/access_log 123 | # 实时查看日志命令 124 | tail –f /opt/lampp/logs/access_log 125 | ``` 126 | 127 | - 修改日志格式 128 | 129 | Apache配置文件名称为httpd.conf,完整路径为`/opt/lampp/etc/httpd.conf`。由于默认的日志类型为**common**类型,总共有7个字段。为了获取更多的日志信息,我们需要将其格式修改为**combined**格式,该日志格式共有9个字段。修改方式如下: 130 | 131 | ```bash 132 | # 启用组合日志文件 133 | CustomLog "logs/access_log" combined 134 | ``` 135 | 136 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E6%97%A5%E5%BF%97%E6%A0%BC%E5%BC%8F.png) 137 | 138 | - 重新加载配置文件 139 | 140 | ```bash 141 | xampp reload 142 | ``` 143 | 144 | ### Apache日志格式介绍 145 | 146 | ```bash 147 | 192.168.10.1 - - [30/Aug/2020:15:53:15 +0800] "GET /forum.php?mod=forumdisplay&fid=43 HTTP/1.1" 200 30647 "http://kms-4/forum.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" 148 | ``` 149 | 150 | 上面的日志格式共有9个字段,分别用空格隔开。每个字段的具体含义如下: 151 | 152 | ```bash 153 | 192.168.10.1 ##(1)客户端的IP地址 154 | - ## (2)客户端identity标识,该字段为"-" 155 | - ## (3)客户端userid标识,该字段为"-" 156 | [30/Aug/2020:15:53:15 +0800] ## (4)服务器完成请求处理时的时间 157 | "GET /forum.php?mod=forumdisplay&fid=43 HTTP/1.1" ## (5)请求类型 请求的资源 使用的协议 158 | 200 ## (6)服务器返回给客户端的状态码,200表示成功 159 | 30647 ## (7)返回给客户端不包括响应头的字节数,如果没有信息返回,则此项应该是"-" 160 | "http://kms-4/forum.php" ## (8)Referer请求头 161 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36" ## (9)客户端的浏览器信息 162 | ``` 163 | 164 | 关于上面的日志格式,可以使用正则表达式进行匹配: 165 | 166 | ```bash 167 | (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (\S+) (\S+) (\[.+?\]) (\"(.*?)\") (\d{3}) (\S+) (\"(.*?)\") (\"(.*?)\") 168 | ``` 169 | 170 | ## Flume与Kafka集成 171 | 172 | 本文使用Flume对产生的Apache日志进行收集,然后推送至Kafka。需要启动Flume agent对日志进行收集,对应的配置文件如下: 173 | 174 | ```bash 175 | # agent的名称为a1 176 | a1.sources = source1 177 | a1.channels = channel1 178 | a1.sinks = sink1 179 | 180 | # set source 181 | a1.sources.source1.type = TAILDIR 182 | a1.sources.source1.filegroups = f1 183 | a1.sources.source1.filegroups.f1 = /opt/lampp/logs/access_log 184 | a1sources.source1.fileHeader = flase 185 | 186 | # 配置sink 187 | a1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink 188 | a1.sinks.sink1.brokerList=kms-2:9092,kms-3:9092,kms-4:9092 189 | a1.sinks.sink1.topic= user_access_logs 190 | a1.sinks.sink1.kafka.flumeBatchSize = 20 191 | a1.sinks.sink1.kafka.producer.acks = 1 192 | a1.sinks.sink1.kafka.producer.linger.ms = 1 193 | a1.sinks.sink1.kafka.producer.compression.type = snappy 194 | 195 | # 配置channel 196 | a1.channels.channel1.type = file 197 | a1.channels.channel1.checkpointDir = /home/kms/data/flume_data/checkpoint 198 | a1.channels.channel1.dataDirs= /home/kms/data/flume_data/data 199 | 200 | # 配置bind 201 | a1.sources.source1.channels = channel1 202 | a1.sinks.sink1.channel = channel1 203 | 204 | ``` 205 | 206 | > 知识点: 207 | > 208 | > **Taildir Source**相比**Exec Source**、**Spooling Directory Source**的优势是什么? 209 | > 210 | > **TailDir Source**:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传 211 | > 212 | > **Exec Source**:可以实时收集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失 213 | > 214 | > **Spooling Directory Source**:监控目录,不支持断点续传 215 | 216 | 值得注意的是,上面的配置是直接将原始日志push到Kafka。除此之外,我们还可以自定义Flume的拦截器对原始日志先进行过滤处理,同时也可以实现将不同的日志push到Kafka的不同Topic中。 217 | 218 | ### 启动Flume Agent 219 | 220 | 将启动Agent的命令封装成shell脚本:**start-log-collection.sh **,脚本内容如下: 221 | 222 | ```shell 223 | #!/bin/bash 224 | echo "start log agent !!!" 225 | /opt/modules/apache-flume-1.9.0-bin/bin/flume-ng agent --conf-file /opt/modules/apache-flume-1.9.0-bin/conf/log_collection.conf --name a1 -Dflume.root.logger=INFO,console 226 | ``` 227 | 228 | ### 查看push到Kafka的日志数据 229 | 230 | 将控制台消费者命令封装成shell脚本:**kafka-consumer.sh**,脚本内容如下: 231 | 232 | ```bash 233 | #!/bin/bash 234 | echo "kafka consumer " 235 | bin/kafka-console-consumer.sh --bootstrap-server kms-2.apache.com:9092,kms-3.apache.com:9092,kms-4.apache.com:9092 --topic $1 --from-beginning 236 | ``` 237 | 238 | 使用下面命令消费Kafka中的数据: 239 | 240 | ```shell 241 | [kms@kms-2 kafka_2.11-2.1.0]$ ./kafka-consumer.sh user_access_logs 242 | ``` 243 | 244 | ## 日志分析处理流程 245 | 246 | ### 创建MySQL数据库和目标表 247 | 248 | ```sql 249 | -- 客户端访问量统计 250 | CREATE TABLE `client_ip_access` ( 251 | `client_ip` char(50) NOT NULL COMMENT '客户端ip', 252 | `client_access_cnt` bigint(20) NOT NULL COMMENT '访问次数', 253 | `statistic_time` text NOT NULL COMMENT '统计时间', 254 | PRIMARY KEY (`client_ip`) 255 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 256 | -- 热门文章统计 257 | CREATE TABLE `hot_article` ( 258 | `article_id` int(10) NOT NULL COMMENT '文章id', 259 | `subject` varchar(80) NOT NULL COMMENT '文章标题', 260 | `article_pv` bigint(20) NOT NULL COMMENT '访问次数', 261 | `statistic_time` text NOT NULL COMMENT '统计时间', 262 | PRIMARY KEY (`article_id`) 263 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 264 | -- 热门板块统计 265 | CREATE TABLE `hot_section` ( 266 | `section_id` int(10) NOT NULL COMMENT '版块id', 267 | `name` char(50) NOT NULL COMMENT '版块标题', 268 | `section_pv` bigint(20) NOT NULL COMMENT '访问次数', 269 | `statistic_time` text NOT NULL COMMENT '统计时间', 270 | PRIMARY KEY (`section_id`) 271 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8; 272 | ``` 273 | 274 | ### AccessLogRecord类 275 | 276 | 该类封装了日志所包含的字段数据,共有9个字段。 277 | 278 | ```java 279 | /** 280 | * 使用lombok 281 | * 原始日志封装类 282 | */ 283 | @Data 284 | public class AccessLogRecord { 285 | public String clientIpAddress; // 客户端ip地址 286 | public String clientIdentity; // 客户端身份标识,该字段为 `-` 287 | public String remoteUser; // 用户标识,该字段为 `-` 288 | public String dateTime; //日期,格式为[day/month/yearhourminutesecond zone] 289 | public String request; // url请求,如:`GET /foo ...` 290 | public String httpStatusCode; // 状态码,如:200; 404. 291 | public String bytesSent; // 传输的字节数,有可能是 `-` 292 | public String referer; // 参考链接,即来源页 293 | public String userAgent; // 浏览器和操作系统类型 294 | } 295 | ``` 296 | 297 | ### LogParse类 298 | 299 | 该类是日志解析类,通过正则表达式对日志进行匹配,对匹配上的日志进行按照字段解析。 300 | 301 | ```java 302 | public class LogParse implements Serializable { 303 | 304 | //构建正则表达式 305 | private String regex = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) (\\S+) (\\S+) (\\[.+?\\]) (\\\"(.*?)\\\") (\\d{3}) (\\S+) (\\\"(.*?)\\\") (\\\"(.*?)\\\")"; 306 | private Pattern p = Pattern.compile(regex); 307 | 308 | /* 309 | *构造访问日志的封装类对象 310 | * */ 311 | public AccessLogRecord buildAccessLogRecord(Matcher matcher) { 312 | AccessLogRecord record = new AccessLogRecord(); 313 | record.setClientIpAddress(matcher.group(1)); 314 | record.setClientIdentity(matcher.group(2)); 315 | record.setRemoteUser(matcher.group(3)); 316 | record.setDateTime(matcher.group(4)); 317 | record.setRequest(matcher.group(5)); 318 | record.setHttpStatusCode(matcher.group(6)); 319 | record.setBytesSent(matcher.group(7)); 320 | record.setReferer(matcher.group(8)); 321 | record.setUserAgent(matcher.group(9)); 322 | return record; 323 | 324 | } 325 | 326 | /** 327 | * @param record:record表示一条apache combined 日志 328 | * @return 解析日志记录,将解析的日志封装成一个AccessLogRecord类 329 | */ 330 | public AccessLogRecord parseRecord(String record) { 331 | Matcher matcher = p.matcher(record); 332 | if (matcher.find()) { 333 | return buildAccessLogRecord(matcher); 334 | } 335 | return null; 336 | } 337 | 338 | /** 339 | * @param request url请求,类型为字符串,类似于 "GET /the-uri-here HTTP/1.1" 340 | * @return 一个三元组(requestType, uri, httpVersion). requestType表示请求类型,如GET, POST等 341 | */ 342 | public Tuple3 parseRequestField(String request) { 343 | //请求的字符串格式为:“GET /test.php HTTP/1.1”,用空格切割 344 | String[] arr = request.split(" "); 345 | if (arr.length == 3) { 346 | return Tuple3.of(arr[0], arr[1], arr[2]); 347 | } else { 348 | return null; 349 | } 350 | } 351 | 352 | /** 353 | * 将apache日志中的英文日期转化为指定格式的中文日期 354 | * 355 | * @param dateTime 传入的apache日志中的日期字符串,"[21/Jul/2009:02:48:13 -0700]" 356 | * @return 357 | */ 358 | public String parseDateField(String dateTime) throws ParseException { 359 | // 输入的英文日期格式 360 | String inputFormat = "dd/MMM/yyyy:HH:mm:ss"; 361 | // 输出的日期格式 362 | String outPutFormat = "yyyy-MM-dd HH:mm:ss"; 363 | 364 | String dateRegex = "\\[(.*?) .+]"; 365 | Pattern datePattern = Pattern.compile(dateRegex); 366 | 367 | Matcher dateMatcher = datePattern.matcher(dateTime); 368 | if (dateMatcher.find()) { 369 | String dateString = dateMatcher.group(1); 370 | SimpleDateFormat dateInputFormat = new SimpleDateFormat(inputFormat, Locale.ENGLISH); 371 | Date date = dateInputFormat.parse(dateString); 372 | 373 | SimpleDateFormat dateOutFormat = new SimpleDateFormat(outPutFormat); 374 | 375 | String formatDate = dateOutFormat.format(date); 376 | return formatDate; 377 | } else { 378 | return ""; 379 | } 380 | } 381 | 382 | /** 383 | * 解析request,即访问页面的url信息解析 384 | * "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1" 385 | * 匹配出访问的fid:版本id 386 | * 以及tid:文章id 387 | * @param request 388 | * @return 389 | */ 390 | public Tuple2 parseSectionIdAndArticleId(String request) { 391 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id 392 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)"; 393 | Pattern sectionPattern = Pattern.compile(sectionIdRegex); 394 | // 匹配出前面是"tid="的数字记为文章id 395 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)"; 396 | Pattern articlePattern = Pattern.compile(articleIdRegex); 397 | 398 | String[] arr = request.split(" "); 399 | String sectionId = ""; 400 | String articleId = ""; 401 | if (arr.length == 3) { 402 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]); 403 | Matcher articleMatcher = articlePattern.matcher(arr[1]); 404 | sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : ""; 405 | articleId = (articleMatcher.find()) ? articleMatcher.group(2) : ""; 406 | } 407 | return Tuple2.of(sectionId, articleId); 408 | } 409 | } 410 | ``` 411 | 412 | ### LogAnalysis类 413 | 414 | 该类是日志处理的基本逻辑 415 | 416 | ```java 417 | public class LogAnalysis { 418 | 419 | public static void main(String[] args) throws Exception { 420 | 421 | StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment(); 422 | // 开启checkpoint,时间间隔为毫秒 423 | senv.enableCheckpointing(5000L); 424 | // 选择状态后端 425 | // 本地测试 426 | // senv.setStateBackend(new FsStateBackend("file:///E://checkpoint")); 427 | // 集群运行 428 | senv.setStateBackend(new FsStateBackend("hdfs://kms-1:8020/flink-checkpoints")); 429 | // 重启策略 430 | senv.setRestartStrategy( 431 | RestartStrategies.fixedDelayRestart(3, Time.of(2, TimeUnit.SECONDS) )); 432 | 433 | EnvironmentSettings settings = EnvironmentSettings.newInstance() 434 | .useBlinkPlanner() 435 | .inStreamingMode() 436 | .build(); 437 | StreamTableEnvironment tEnv = StreamTableEnvironment.create(senv, settings); 438 | // kafka参数配置 439 | Properties props = new Properties(); 440 | // kafka broker地址 441 | props.put("bootstrap.servers", "kms-2:9092,kms-3:9092,kms-4:9092"); 442 | // 消费者组 443 | props.put("group.id", "log_consumer"); 444 | // kafka 消息的key序列化器 445 | props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 446 | // kafka 消息的value序列化器 447 | props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 448 | props.put("auto.offset.reset", "earliest"); 449 | 450 | FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer( 451 | "user_access_logs", 452 | new SimpleStringSchema(), 453 | props); 454 | 455 | DataStreamSource logSource = senv.addSource(kafkaConsumer); 456 | // 获取有效的日志数据 457 | DataStream availableAccessLog = LogAnalysis.getAvailableAccessLog(logSource); 458 | // 获取[clienIP,accessDate,sectionId,articleId] 459 | DataStream> fieldFromLog = LogAnalysis.getFieldFromLog(availableAccessLog); 460 | //从DataStream中创建临时视图,名称为logs 461 | // 添加一个计算字段:proctime,用于维表JOIN 462 | tEnv.createTemporaryView("logs", 463 | fieldFromLog, 464 | $("clientIP"), 465 | $("accessDate"), 466 | $("sectionId"), 467 | $("articleId"), 468 | $("proctime").proctime()); 469 | 470 | // 需求1:统计热门板块 471 | LogAnalysis.getHotSection(tEnv); 472 | // 需求2:统计热门文章 473 | LogAnalysis.getHotArticle(tEnv); 474 | // 需求3:统计不同客户端ip对版块和文章的总访问量 475 | LogAnalysis.getClientAccess(tEnv); 476 | senv.execute("log-analysisi"); 477 | } 478 | 479 | /** 480 | * 统计不同客户端ip对版块和文章的总访问量 481 | * @param tEnv 482 | */ 483 | private static void getClientAccess(StreamTableEnvironment tEnv) { 484 | // sink表 485 | // [client_ip,client_access_cnt,statistic_time] 486 | // [客户端ip,访问次数,统计时间] 487 | String client_ip_access_ddl = "" + 488 | "CREATE TABLE client_ip_access (\n" + 489 | " client_ip STRING ,\n" + 490 | " client_access_cnt BIGINT,\n" + 491 | " statistic_time STRING,\n" + 492 | " PRIMARY KEY (client_ip) NOT ENFORCED\n" + 493 | ")WITH (\n" + 494 | " 'connector' = 'jdbc',\n" + 495 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 496 | " 'table-name' = 'client_ip_access', \n" + 497 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 498 | " 'username' = 'root',\n" + 499 | " 'password' = '123qwe'\n" + 500 | ") "; 501 | 502 | tEnv.executeSql(client_ip_access_ddl); 503 | 504 | String client_ip_access_sql = "" + 505 | "INSERT INTO client_ip_access\n" + 506 | "SELECT\n" + 507 | " clientIP,\n" + 508 | " count(1) AS access_cnt,\n" + 509 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" + 510 | "FROM\n" + 511 | " logs \n" + 512 | "WHERE\n" + 513 | " articleId <> 0 \n" + 514 | " OR sectionId <> 0 \n" + 515 | "GROUP BY\n" + 516 | " clientIP " 517 | ; 518 | tEnv.executeSql(client_ip_access_sql); 519 | 520 | } 521 | 522 | /** 523 | * 统计热门文章 524 | * @param tEnv 525 | */ 526 | 527 | private static void getHotArticle(StreamTableEnvironment tEnv) { 528 | // JDBC数据源 529 | // 文章id及标题对应关系的表,[tid, subject]分别为:文章id和标题 530 | String pre_forum_post_ddl = "" + 531 | "CREATE TABLE pre_forum_post (\n" + 532 | " tid INT,\n" + 533 | " subject STRING,\n" + 534 | " PRIMARY KEY (tid) NOT ENFORCED\n" + 535 | ") WITH (\n" + 536 | " 'connector' = 'jdbc',\n" + 537 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" + 538 | " 'table-name' = 'pre_forum_post', \n" + 539 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 540 | " 'username' = 'root',\n" + 541 | " 'password' = '123qwe'\n" + 542 | ")"; 543 | // 创建pre_forum_post数据源 544 | tEnv.executeSql(pre_forum_post_ddl); 545 | // 创建MySQL的sink表 546 | // [article_id,subject,article_pv,statistic_time] 547 | // [文章id,标题名称,访问次数,统计时间] 548 | String hot_article_ddl = "" + 549 | "CREATE TABLE hot_article (\n" + 550 | " article_id INT,\n" + 551 | " subject STRING,\n" + 552 | " article_pv BIGINT ,\n" + 553 | " statistic_time STRING,\n" + 554 | " PRIMARY KEY (article_id) NOT ENFORCED\n" + 555 | ")WITH (\n" + 556 | " 'connector' = 'jdbc',\n" + 557 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 558 | " 'table-name' = 'hot_article', \n" + 559 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 560 | " 'username' = 'root',\n" + 561 | " 'password' = '123qwe'\n" + 562 | ")"; 563 | tEnv.executeSql(hot_article_ddl); 564 | // 向MySQL目标表insert数据 565 | String hot_article_sql = "" + 566 | "INSERT INTO hot_article\n" + 567 | "SELECT \n" + 568 | " a.articleId,\n" + 569 | " b.subject,\n" + 570 | " count(1) as article_pv,\n" + 571 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" + 572 | "FROM logs a \n" + 573 | " JOIN pre_forum_post FOR SYSTEM_TIME AS OF a.proctime as b ON a.articleId = b.tid\n" + 574 | "WHERE a.articleId <> 0\n" + 575 | "GROUP BY a.articleId,b.subject\n" + 576 | "ORDER BY count(1) desc\n" + 577 | "LIMIT 10"; 578 | 579 | tEnv.executeSql(hot_article_sql); 580 | 581 | } 582 | 583 | /** 584 | * 统计热门板块 585 | * 586 | * @param tEnv 587 | */ 588 | public static void getHotSection(StreamTableEnvironment tEnv) { 589 | 590 | // 板块id及其名称对应关系表,[fid, name]分别为:版块id和板块名称 591 | String pre_forum_forum_ddl = "" + 592 | "CREATE TABLE pre_forum_forum (\n" + 593 | " fid INT,\n" + 594 | " name STRING,\n" + 595 | " PRIMARY KEY (fid) NOT ENFORCED\n" + 596 | ") WITH (\n" + 597 | " 'connector' = 'jdbc',\n" + 598 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" + 599 | " 'table-name' = 'pre_forum_forum', \n" + 600 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 601 | " 'username' = 'root',\n" + 602 | " 'password' = '123qwe',\n" + 603 | " 'lookup.cache.ttl' = '10',\n" + 604 | " 'lookup.cache.max-rows' = '1000'" + 605 | ")"; 606 | // 创建pre_forum_forum数据源 607 | tEnv.executeSql(pre_forum_forum_ddl); 608 | 609 | // 创建MySQL的sink表 610 | // [section_id,name,section_pv,statistic_time] 611 | // [板块id,板块名称,访问次数,统计时间] 612 | String hot_section_ddl = "" + 613 | "CREATE TABLE hot_section (\n" + 614 | " section_id INT,\n" + 615 | " name STRING ,\n" + 616 | " section_pv BIGINT,\n" + 617 | " statistic_time STRING,\n" + 618 | " PRIMARY KEY (section_id) NOT ENFORCED \n" + 619 | ") WITH (\n" + 620 | " 'connector' = 'jdbc',\n" + 621 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 622 | " 'table-name' = 'hot_section', \n" + 623 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 624 | " 'username' = 'root',\n" + 625 | " 'password' = '123qwe'\n" + 626 | ")"; 627 | 628 | // 创建sink表:hot_section 629 | tEnv.executeSql(hot_section_ddl); 630 | 631 | //统计热门板块 632 | // 使用日志流与MySQL的维表数据进行JOIN 633 | // 从而获取板块名称 634 | String hot_section_sql = "" + 635 | "INSERT INTO hot_section\n" + 636 | "SELECT\n" + 637 | " a.sectionId,\n" + 638 | " b.name,\n" + 639 | " count(1) as section_pv,\n" + 640 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time \n" + 641 | "FROM\n" + 642 | " logs a\n" + 643 | " JOIN pre_forum_forum FOR SYSTEM_TIME AS OF a.proctime as b ON a.sectionId = b.fid \n" + 644 | "WHERE\n" + 645 | " a.sectionId <> 0 \n" + 646 | "GROUP BY a.sectionId, b.name\n" + 647 | "ORDER BY count(1) desc\n" + 648 | "LIMIT 10"; 649 | // 执行数据insert 650 | tEnv.executeSql(hot_section_sql); 651 | 652 | } 653 | 654 | /** 655 | * 获取[clienIP,accessDate,sectionId,articleId] 656 | * 分别为客户端ip,访问日期,板块id,文章id 657 | * 658 | * @param logRecord 659 | * @return 660 | */ 661 | public static DataStream> getFieldFromLog(DataStream logRecord) { 662 | DataStream> fieldFromLog = logRecord.map(new MapFunction>() { 663 | @Override 664 | public Tuple4 map(AccessLogRecord accessLogRecord) throws Exception { 665 | LogParse parse = new LogParse(); 666 | 667 | String clientIpAddress = accessLogRecord.getClientIpAddress(); 668 | String dateTime = accessLogRecord.getDateTime(); 669 | String request = accessLogRecord.getRequest(); 670 | String formatDate = parse.parseDateField(dateTime); 671 | Tuple2 sectionIdAndArticleId = parse.parseSectionIdAndArticleId(request); 672 | if (formatDate == "" || sectionIdAndArticleId == Tuple2.of("", "")) { 673 | 674 | return new Tuple4("0.0.0.0", "0000-00-00 00:00:00", 0, 0); 675 | } 676 | Integer sectionId = (sectionIdAndArticleId.f0 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f0); 677 | Integer articleId = (sectionIdAndArticleId.f1 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f1); 678 | return new Tuple4<>(clientIpAddress, formatDate, sectionId, articleId); 679 | } 680 | }); 681 | return fieldFromLog; 682 | } 683 | 684 | /** 685 | * 筛选可用的日志记录 686 | * 687 | * @param accessLog 688 | * @return 689 | */ 690 | public static DataStream getAvailableAccessLog(DataStream accessLog) { 691 | final LogParse logParse = new LogParse(); 692 | //解析原始日志,将其解析为AccessLogRecord格式 693 | DataStream filterDS = accessLog.map(new MapFunction() { 694 | @Override 695 | public AccessLogRecord map(String log) throws Exception { 696 | return logParse.parseRecord(log); 697 | } 698 | }).filter(new FilterFunction() { 699 | //过滤掉无效日志 700 | @Override 701 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception { 702 | return !(accessLogRecord == null); 703 | } 704 | }).filter(new FilterFunction() { 705 | //过滤掉状态码非200的记录,即保留请求成功的日志记录 706 | @Override 707 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception { 708 | return !accessLogRecord.getHttpStatusCode().equals("200"); 709 | } 710 | }); 711 | return filterDS; 712 | } 713 | } 714 | ``` 715 | 716 | 将上述代码打包上传到集群运行,在执行提交命令之前,需要先将Hadoop的依赖jar包放置在Flink安装目录下的lib文件下:**flink-shaded-hadoop-2-uber-2.7.5-10.0.jar**,因为我们配置了HDFS上的状态后端,而Flink的release包不含有Hadoop的依赖Jar包。 717 | 718 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/jar%E5%8C%85.png) 719 | 720 | 否则会报如下错误: 721 | 722 | ```bash 723 | Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop is not in the classpath/dependencies. 724 | ``` 725 | 726 | ### 提交到集群 727 | 728 | 编写提交命令脚本 729 | 730 | ```bash 731 | #!/bin/bash 732 | /opt/modules/flink-1.11.1/bin/flink run -m kms-1:8081 \ 733 | -c com.jmx.analysis.LogAnalysis \ 734 | /opt/softwares/com.jmx-1.0-SNAPSHOT.jar 735 | ``` 736 | 737 | 提交之后,访问Flink的Web界面,查看任务: 738 | 739 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/FlinkWEB.png) 740 | 741 | 此时访问论坛,点击板块和帖子文章,观察数据库变化: 742 | 743 | ![](https://github.com/jiamx/flink-log-analysis/blob/master/%E7%BB%93%E6%9E%9C.png) 744 | 745 | ## 总结 746 | 747 | 本文主要分享了从0到1构建一个用户行为日志分析系统。首先,基于discuz搭建了论坛平台,针对论坛产生的日志,使用Flume进行收集并push到Kafka中;接着使用Flink对其进行分析处理;最后将处理结果写入MySQL供可视化展示使用。 748 | -------------------------------------------------------------------------------- /flink-log-analysis.iml: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /jar包.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/jar包.png -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 5 | 4.0.0 6 | 7 | flink-log-analysis 8 | com.jmx 9 | 1.0-SNAPSHOT 10 | 11 | 1.8 12 | 1.11.1 13 | 1.8 14 | 2.3.4 15 | 2.11 16 | 17 | 1.8 18 | 1.8 19 | UTF-8 20 | 21 | 22 | 23 | 24 | 25 | org.apache.flink 26 | flink-connector-kafka_2.11 27 | ${flink.version} 28 | 29 | 30 | org.apache.flink 31 | flink-streaming-java_${scala.binary.version} 32 | ${flink.version} 33 | provided 34 | 35 | 36 | org.apache.flink 37 | flink-table-api-java-bridge_${scala.binary.version} 38 | ${flink.version} 39 | provided 40 | 41 | 42 | 43 | org.apache.flink 44 | flink-table-planner-blink_${scala.binary.version} 45 | ${flink.version} 46 | provided 47 | 48 | 49 | 50 | org.apache.flink 51 | flink-table-common 52 | ${flink.version} 53 | provided 54 | 55 | 56 | org.apache.flink 57 | flink-connector-jdbc_${scala.binary.version} 58 | ${flink.version} 59 | 60 | 61 | 62 | mysql 63 | mysql-connector-java 64 | 8.0.20 65 | 66 | 67 | 68 | org.apache.flink 69 | flink-clients_${scala.binary.version} 70 | ${flink.version} 71 | 72 | 73 | com.typesafe 74 | config 75 | 1.2.1 76 | 77 | 78 | 79 | org.slf4j 80 | slf4j-api 81 | 1.7.25 82 | 83 | 84 | org.slf4j 85 | slf4j-simple 86 | 1.7.25 87 | 88 | 89 | 90 | org.apache.kafka 91 | kafka-clients 92 | 2.1.0 93 | 94 | 95 | 96 | org.projectlombok 97 | lombok 98 | 1.16.18 99 | 100 | 101 | 102 | 103 | 104 | 105 | -------------------------------------------------------------------------------- /src/main/java/com/jmx/analysis/LogAnalysis.java: -------------------------------------------------------------------------------- 1 | package com.jmx.analysis; 2 | 3 | import com.jmx.bean.AccessLogRecord; 4 | import org.apache.flink.api.common.functions.FilterFunction; 5 | import org.apache.flink.api.common.functions.MapFunction; 6 | import org.apache.flink.api.common.restartstrategy.RestartStrategies; 7 | import org.apache.flink.api.common.serialization.SimpleStringSchema; 8 | import org.apache.flink.api.common.time.Time; 9 | import org.apache.flink.api.java.tuple.Tuple2; 10 | import org.apache.flink.api.java.tuple.Tuple4; 11 | import org.apache.flink.runtime.state.filesystem.FsStateBackend; 12 | import org.apache.flink.streaming.api.datastream.DataStream; 13 | import org.apache.flink.streaming.api.datastream.DataStreamSource; 14 | import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; 15 | import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; 16 | import org.apache.flink.table.api.EnvironmentSettings; 17 | import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; 18 | 19 | import java.util.Properties; 20 | import java.util.concurrent.TimeUnit; 21 | 22 | import static org.apache.flink.table.api.Expressions.$; 23 | 24 | /** 25 | *  @Created with IntelliJ IDEA. 26 | *  @author : jmx 27 | *  @Date: 2020/8/24 28 | *  @Time: 22:19 29 | *   30 | */ 31 | public class LogAnalysis { 32 | 33 | 34 | public static void main(String[] args) throws Exception { 35 | 36 | StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment(); 37 | // 开启checkpoint,时间间隔为毫秒 38 | senv.enableCheckpointing(5000L); 39 | // 选择状态后端 40 | // 本地测试 41 | // senv.setStateBackend(new FsStateBackend("file:///E://checkpoint")); 42 | // 集群运行 43 | senv.setStateBackend(new FsStateBackend("hdfs://kms-1:8020/flink-checkpoints")); 44 | // 重启策略 45 | senv.setRestartStrategy( 46 | RestartStrategies.fixedDelayRestart(3, Time.of(2, TimeUnit.SECONDS) )); 47 | 48 | EnvironmentSettings settings = EnvironmentSettings.newInstance() 49 | .useBlinkPlanner() 50 | .inStreamingMode() 51 | .build(); 52 | StreamTableEnvironment tEnv = StreamTableEnvironment.create(senv, settings); 53 | // kafka参数配置 54 | Properties props = new Properties(); 55 | // kafka broker地址 56 | props.put("bootstrap.servers", "kms-2:9092,kms-3:9092,kms-4:9092"); 57 | // 消费者组 58 | props.put("group.id", "log_consumer"); 59 | // kafka 消息的key序列化器 60 | props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 61 | // kafka 消息的value序列化器 62 | props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 63 | props.put("auto.offset.reset", "earliest"); 64 | 65 | FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer( 66 | "user_access_logs", 67 | new SimpleStringSchema(), 68 | props); 69 | 70 | DataStreamSource logSource = senv.addSource(kafkaConsumer); 71 | // 获取有效的日志数据 72 | DataStream availableAccessLog = LogAnalysis.getAvailableAccessLog(logSource); 73 | // 获取[clienIP,accessDate,sectionId,articleId] 74 | DataStream> fieldFromLog = LogAnalysis.getFieldFromLog(availableAccessLog); 75 | //从DataStream中创建临时视图,名称为logs 76 | // 添加一个计算字段:proctime,用于维表JOIN 77 | tEnv.createTemporaryView("logs", 78 | fieldFromLog, 79 | $("clientIP"), 80 | $("accessDate"), 81 | $("sectionId"), 82 | $("articleId"), 83 | $("proctime").proctime()); 84 | 85 | // 统计热门板块 86 | LogAnalysis.getHotSection(tEnv); 87 | // 统计热门文章 88 | LogAnalysis.getHotArticle(tEnv); 89 | // 统计不同客户端ip对版块和文章的总访问量 90 | LogAnalysis.getClientAccess(tEnv); 91 | 92 | senv.execute("log-analysisi"); 93 | 94 | } 95 | 96 | private static void getClientAccess(StreamTableEnvironment tEnv) { 97 | // sink表 98 | // [client_ip,client_access_cnt,statistic_time] 99 | // [客户端ip,访问次数,统计时间] 100 | String client_ip_access_ddl = "" + 101 | "CREATE TABLE client_ip_access (\n" + 102 | " client_ip STRING ,\n" + 103 | " client_access_cnt BIGINT,\n" + 104 | " statistic_time STRING,\n" + 105 | " PRIMARY KEY (client_ip) NOT ENFORCED\n" + 106 | ")WITH (\n" + 107 | " 'connector' = 'jdbc',\n" + 108 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 109 | " 'table-name' = 'client_ip_access', \n" + 110 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 111 | " 'username' = 'root',\n" + 112 | " 'password' = '123qwe'\n" + 113 | ") "; 114 | 115 | tEnv.executeSql(client_ip_access_ddl); 116 | 117 | String client_ip_access_sql = "" + 118 | "INSERT INTO client_ip_access\n" + 119 | "SELECT\n" + 120 | " clientIP,\n" + 121 | " count(1) AS access_cnt,\n" + 122 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" + 123 | "FROM\n" + 124 | " logs \n" + 125 | "WHERE\n" + 126 | " articleId <> 0 \n" + 127 | " OR sectionId <> 0 \n" + 128 | "GROUP BY\n" + 129 | " clientIP " 130 | ; 131 | tEnv.executeSql(client_ip_access_sql); 132 | 133 | } 134 | 135 | private static void getHotArticle(StreamTableEnvironment tEnv) { 136 | // JDBC数据源 137 | // 文章id及标题对应关系的表,[tid, subject]分别为:文章id和标题 138 | String pre_forum_post_ddl = "" + 139 | "CREATE TABLE pre_forum_post (\n" + 140 | " tid INT,\n" + 141 | " subject STRING,\n" + 142 | " PRIMARY KEY (tid) NOT ENFORCED\n" + 143 | ") WITH (\n" + 144 | " 'connector' = 'jdbc',\n" + 145 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" + 146 | " 'table-name' = 'pre_forum_post', \n" + 147 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 148 | " 'username' = 'root',\n" + 149 | " 'password' = '123qwe'\n" + 150 | ")"; 151 | // 创建pre_forum_post数据源 152 | tEnv.executeSql(pre_forum_post_ddl); 153 | // 创建MySQL的sink表 154 | // [article_id,subject,article_pv,statistic_time] 155 | // [文章id,标题名称,访问次数,统计时间] 156 | String hot_article_ddl = "" + 157 | "CREATE TABLE hot_article (\n" + 158 | " article_id INT,\n" + 159 | " subject STRING,\n" + 160 | " article_pv BIGINT ,\n" + 161 | " statistic_time STRING,\n" + 162 | " PRIMARY KEY (article_id) NOT ENFORCED\n" + 163 | ")WITH (\n" + 164 | " 'connector' = 'jdbc',\n" + 165 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 166 | " 'table-name' = 'hot_article', \n" + 167 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 168 | " 'username' = 'root',\n" + 169 | " 'password' = '123qwe'\n" + 170 | ")"; 171 | tEnv.executeSql(hot_article_ddl); 172 | // 向MySQL目标表insert数据 173 | String hot_article_sql = "" + 174 | "INSERT INTO hot_article\n" + 175 | "SELECT \n" + 176 | " a.articleId,\n" + 177 | " b.subject,\n" + 178 | " count(1) as article_pv,\n" + 179 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time\n" + 180 | "FROM logs a \n" + 181 | " JOIN pre_forum_post FOR SYSTEM_TIME AS OF a.proctime as b ON a.articleId = b.tid\n" + 182 | "WHERE a.articleId <> 0\n" + 183 | "GROUP BY a.articleId,b.subject\n" + 184 | "ORDER BY count(1) desc\n" + 185 | "LIMIT 10"; 186 | 187 | tEnv.executeSql(hot_article_sql); 188 | 189 | } 190 | 191 | /** 192 | * 统计热门板块 193 | * 194 | * @param tEnv 195 | */ 196 | public static void getHotSection(StreamTableEnvironment tEnv) { 197 | 198 | // 板块id及其名称对应关系表,[fid, name]分别为:版块id和板块名称 199 | String pre_forum_forum_ddl = "" + 200 | "CREATE TABLE pre_forum_forum (\n" + 201 | " fid INT,\n" + 202 | " name STRING,\n" + 203 | " PRIMARY KEY (fid) NOT ENFORCED\n" + 204 | ") WITH (\n" + 205 | " 'connector' = 'jdbc',\n" + 206 | " 'url' = 'jdbc:mysql://kms-4:3306/ultrax',\n" + 207 | " 'table-name' = 'pre_forum_forum', \n" + 208 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 209 | " 'username' = 'root',\n" + 210 | " 'password' = '123qwe',\n" + 211 | " 'lookup.cache.ttl' = '10',\n" + 212 | " 'lookup.cache.max-rows' = '1000'" + 213 | ")"; 214 | // 创建pre_forum_forum数据源 215 | tEnv.executeSql(pre_forum_forum_ddl); 216 | 217 | // 创建MySQL的sink表 218 | // [section_id,name,section_pv,statistic_time] 219 | // [板块id,板块名称,访问次数,统计时间] 220 | String hot_section_ddl = "" + 221 | "CREATE TABLE hot_section (\n" + 222 | " section_id INT,\n" + 223 | " name STRING ,\n" + 224 | " section_pv BIGINT,\n" + 225 | " statistic_time STRING,\n" + 226 | " PRIMARY KEY (section_id) NOT ENFORCED \n" + 227 | ") WITH (\n" + 228 | " 'connector' = 'jdbc',\n" + 229 | " 'url' = 'jdbc:mysql://kms-4:3306/statistics?useUnicode=true&characterEncoding=utf-8',\n" + 230 | " 'table-name' = 'hot_section', \n" + 231 | " 'driver' = 'com.mysql.jdbc.Driver',\n" + 232 | " 'username' = 'root',\n" + 233 | " 'password' = '123qwe'\n" + 234 | ")"; 235 | 236 | // 创建sink表:hot_section 237 | tEnv.executeSql(hot_section_ddl); 238 | 239 | //统计热门板块 240 | // 使用日志流与MySQL的维表数据进行JOIN 241 | // 从而获取板块名称 242 | String hot_section_sql = "" + 243 | "INSERT INTO hot_section\n" + 244 | "SELECT\n" + 245 | " a.sectionId,\n" + 246 | " b.name,\n" + 247 | " count(1) as section_pv,\n" + 248 | " FROM_UNIXTIME(UNIX_TIMESTAMP()) AS statistic_time \n" + 249 | "FROM\n" + 250 | " logs a\n" + 251 | " JOIN pre_forum_forum FOR SYSTEM_TIME AS OF a.proctime as b ON a.sectionId = b.fid \n" + 252 | "WHERE\n" + 253 | " a.sectionId <> 0 \n" + 254 | "GROUP BY a.sectionId, b.name\n" + 255 | "ORDER BY count(1) desc\n" + 256 | "LIMIT 10"; 257 | // 执行数据insert 258 | tEnv.executeSql(hot_section_sql); 259 | 260 | } 261 | 262 | /** 263 | * 获取[clienIP,accessDate,sectionId,articleId] 264 | * 分别为客户端ip,访问日期,板块id,文章id 265 | * 266 | * @param logRecord 267 | * @return 268 | */ 269 | public static DataStream> getFieldFromLog(DataStream logRecord) { 270 | DataStream> fieldFromLog = logRecord.map(new MapFunction>() { 271 | @Override 272 | public Tuple4 map(AccessLogRecord accessLogRecord) throws Exception { 273 | LogParse parse = new LogParse(); 274 | 275 | String clientIpAddress = accessLogRecord.getClientIpAddress(); 276 | String dateTime = accessLogRecord.getDateTime(); 277 | String request = accessLogRecord.getRequest(); 278 | String formatDate = parse.parseDateField(dateTime); 279 | Tuple2 sectionIdAndArticleId = parse.parseSectionIdAndArticleId(request); 280 | if (formatDate == "" || sectionIdAndArticleId == Tuple2.of("", "")) { 281 | 282 | return new Tuple4("0.0.0.0", "0000-00-00 00:00:00", 0, 0); 283 | } 284 | Integer sectionId = (sectionIdAndArticleId.f0 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f0); 285 | Integer articleId = (sectionIdAndArticleId.f1 == "") ? 0 : Integer.parseInt(sectionIdAndArticleId.f1); 286 | return new Tuple4<>(clientIpAddress, formatDate, sectionId, articleId); 287 | } 288 | }); 289 | 290 | 291 | return fieldFromLog; 292 | } 293 | 294 | /** 295 | * 筛选可用的日志记录 296 | * 297 | * @param accessLog 298 | * @return 299 | */ 300 | public static DataStream getAvailableAccessLog(DataStream accessLog) { 301 | final LogParse logParse = new LogParse(); 302 | //解析原始日志,将其解析为AccessLogRecord格式 303 | DataStream filterDS = accessLog.map(new MapFunction() { 304 | @Override 305 | public AccessLogRecord map(String log) throws Exception { 306 | return logParse.parseRecord(log); 307 | } 308 | }).filter(new FilterFunction() { 309 | //过滤掉无效日志 310 | @Override 311 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception { 312 | return !(accessLogRecord == null); 313 | } 314 | }).filter(new FilterFunction() { 315 | //过滤掉状态码非200的记录,即保留请求成功的日志记录 316 | @Override 317 | public boolean filter(AccessLogRecord accessLogRecord) throws Exception { 318 | return !accessLogRecord.getHttpStatusCode().equals("200"); 319 | } 320 | }); 321 | return filterDS; 322 | 323 | } 324 | 325 | 326 | } 327 | -------------------------------------------------------------------------------- /src/main/java/com/jmx/analysis/LogParse.java: -------------------------------------------------------------------------------- 1 | package com.jmx.analysis; 2 | 3 | import com.jmx.bean.AccessLogRecord; 4 | import org.apache.flink.api.java.tuple.Tuple2; 5 | import org.apache.flink.api.java.tuple.Tuple3; 6 | 7 | import java.io.Serializable; 8 | import java.text.ParseException; 9 | import java.text.SimpleDateFormat; 10 | import java.util.Date; 11 | import java.util.Locale; 12 | import java.util.regex.Matcher; 13 | import java.util.regex.Pattern; 14 | 15 | /** 16 | *  @Created with IntelliJ IDEA. 17 | *  @author : jmx 18 | *  @Date: 2020/8/24 19 | *  @Time: 22:21 20 | *   21 | */ 22 | public class LogParse implements Serializable { 23 | 24 | //构建正则表达式 25 | private String regex = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}) (\\S+) (\\S+) (\\[.+?\\]) (\\\"(.*?)\\\") (\\d{3}) (\\S+) (\\\"(.*?)\\\") (\\\"(.*?)\\\")"; 26 | private Pattern p = Pattern.compile(regex); 27 | 28 | /* 29 | *构造访问日志的封装类对象 30 | * */ 31 | public AccessLogRecord buildAccessLogRecord(Matcher matcher) { 32 | AccessLogRecord record = new AccessLogRecord(); 33 | record.setClientIpAddress(matcher.group(1)); 34 | record.setClientIdentity(matcher.group(2)); 35 | record.setRemoteUser(matcher.group(3)); 36 | record.setDateTime(matcher.group(4)); 37 | record.setRequest(matcher.group(5)); 38 | record.setHttpStatusCode(matcher.group(6)); 39 | record.setBytesSent(matcher.group(7)); 40 | record.setReferer(matcher.group(8)); 41 | record.setUserAgent(matcher.group(9)); 42 | return record; 43 | 44 | } 45 | 46 | /** 47 | * @param record:record表示一条apache combined 日志 48 | * @return 解析日志记录,将解析的日志封装成一个AccessLogRecord类 49 | */ 50 | public AccessLogRecord parseRecord(String record) { 51 | Matcher matcher = p.matcher(record); 52 | if (matcher.find()) { 53 | return buildAccessLogRecord(matcher); 54 | } 55 | return null; 56 | } 57 | 58 | /** 59 | * @param request url请求,类型为字符串,类似于 "GET /the-uri-here HTTP/1.1" 60 | * @return 一个三元组(requestType, uri, httpVersion). requestType表示请求类型,如GET, POST等 61 | */ 62 | public Tuple3 parseRequestField(String request) { 63 | //请求的字符串格式为:“GET /test.php HTTP/1.1”,用空格切割 64 | String[] arr = request.split(" "); 65 | if (arr.length == 3) { 66 | return Tuple3.of(arr[0], arr[1], arr[2]); 67 | } else { 68 | return null; 69 | } 70 | 71 | } 72 | 73 | /** 74 | * 将apache日志中的英文日期转化为指定格式的中文日期 75 | * 76 | * @param dateTime 传入的apache日志中的日期字符串,"[21/Jul/2009:02:48:13 -0700]" 77 | * @return 78 | */ 79 | public String parseDateField(String dateTime) throws ParseException { 80 | // 输入的英文日期格式 81 | String inputFormat = "dd/MMM/yyyy:HH:mm:ss"; 82 | // 输出的日期格式 83 | String outPutFormat = "yyyy-MM-dd HH:mm:ss"; 84 | 85 | String dateRegex = "\\[(.*?) .+]"; 86 | Pattern datePattern = Pattern.compile(dateRegex); 87 | 88 | Matcher dateMatcher = datePattern.matcher(dateTime); 89 | if (dateMatcher.find()) { 90 | String dateString = dateMatcher.group(1); 91 | SimpleDateFormat dateInputFormat = new SimpleDateFormat(inputFormat, Locale.ENGLISH); 92 | Date date = dateInputFormat.parse(dateString); 93 | 94 | SimpleDateFormat dateOutFormat = new SimpleDateFormat(outPutFormat); 95 | 96 | String formatDate = dateOutFormat.format(date); 97 | return formatDate; 98 | } else { 99 | return ""; 100 | } 101 | } 102 | 103 | /** 104 | * 解析request,即访问页面的url信息解析 105 | * "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1" 106 | * 匹配出访问的fid:版本id 107 | * 以及tid:文章id 108 | * 109 | * @param request 110 | * @return 111 | */ 112 | public Tuple2 parseSectionIdAndArticleId(String request) { 113 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id 114 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)"; 115 | Pattern sectionPattern = Pattern.compile(sectionIdRegex); 116 | // 匹配出前面是"tid="的数字记为文章id 117 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)"; 118 | Pattern articlePattern = Pattern.compile(articleIdRegex); 119 | 120 | String[] arr = request.split(" "); 121 | String sectionId = ""; 122 | String articleId = ""; 123 | if (arr.length == 3) { 124 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]); 125 | Matcher articleMatcher = articlePattern.matcher(arr[1]); 126 | sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : ""; 127 | articleId = (articleMatcher.find()) ? articleMatcher.group(2) : ""; 128 | 129 | } 130 | return Tuple2.of(sectionId, articleId); 131 | 132 | } 133 | 134 | } 135 | -------------------------------------------------------------------------------- /src/main/java/com/jmx/bean/AccessLogRecord.java: -------------------------------------------------------------------------------- 1 | package com.jmx.bean; 2 | 3 | import lombok.Data; 4 | 5 | /** 6 | *  @Created with IntelliJ IDEA. 7 | *  @author : jmx 8 | *  @Date: 2020/8/24 9 | *  @Time: 21:41 10 | *   11 | */ 12 | 13 | /** 14 | * 原始日志封装类 15 | */ 16 | @Data 17 | public class AccessLogRecord { 18 | public String clientIpAddress; // 客户端ip地址 19 | public String clientIdentity; // 客户端身份标识,该字段为 `-` 20 | public String remoteUser; // 用户标识,该字段为 `-` 21 | public String dateTime; //日期,格式为[day/month/yearhourminutesecond zone] 22 | public String request; // url请求,如:`GET /foo ...` 23 | public String httpStatusCode; // 状态码,如:200; 404. 24 | public String bytesSent; // 传输的字节数,有可能是 `-` 25 | public String referer; // 参考链接,即来源页 26 | public String userAgent; // 浏览器和操作系统类型 27 | } 28 | -------------------------------------------------------------------------------- /src/main/resources/access_log.txt: -------------------------------------------------------------------------------- 1 | 127.0.0.1 - - [07/Dec/2017:19:36:27 +0800] "GET /test.php HTTP/1.1" 200 73946 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0" 2 | 192.168.1.1 - - [07/Dec/2017:19:37:51 +0800] "GET /robots.txt HTTP/1.1" 404 208 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 3 | 192.168.1.1 - - [07/Dec/2017:19:37:51 +0800] "GET /test.php HTTP/1.1" 200 74407 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 4 | 192.168.1.1 - - [07/Dec/2017:19:37:53 +0800] "GET /favicon.ico HTTP/1.1" 404 209 "http://192.168.1.10/test.php" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 5 | 192.168.1.1 - - [07/Dec/2017:19:38:43 +0800] "-" 408 - "-" "-" 6 | 192.168.1.1 - - [07/Dec/2017:19:40:07 +0800] "GET /test.php HTTP/1.1" 200 74462 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 7 | 192.168.1.1 - - [07/Dec/2017:19:40:37 +0800] "GET / HTTP/1.1" 403 4897 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 8 | 192.168.1.1 - - [07/Dec/2017:19:41:01 +0800] "GET /abc HTTP/1.1" 404 201 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" 9 | 192.168.169.50 - - [17/Feb/2012:10:09:13 +0800] "GET /favicon.ico HTTP/1.1" 404 288 "-" "360se" 10 | 192.168.169.50 - - [17/Feb/2012:10:36:26 +0800] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0" 11 | 192.168.169.50 - - [17/Feb/2012:10:36:26 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/5.0 (Windows NT 5.1; rv:6.0) Gecko/20100101 Firefox/6.0" 12 | 192.168.169.50 - - [17/Feb/2012:10:09:10 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)" 13 | 192.168.55.230 - - [24/Feb/2012:09:48:58 +0800] "GET /favicon.ico HTTP/1.1" 404 288 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24" 14 | 192.168.169.50 - - [24/Feb/2012:09:45:03 +0800] "GET /server-status HTTP/1.1" 404 290 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; 360SE)" 15 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET / HTTP/1.1" 403 5043 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24" 16 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET /icons/apache_pb.gif HTTP/1.1" 200 2326 "http://192.168.55.230/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24" 17 | 192.168.55.230 - - [24/Feb/2012:09:49:02 +0800] "GET /icons/powered_by_rh.png HTTP/1.1" 200 1213 "http://192.168.55.230/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24" 18 | 192.168.55.230 - - [24/Feb/2012:09:49:20 +0800] "GET /server-status HTTP/1.1" 404 290 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111109 CentOS/3.6-3.el5.centos Firefox/3.6.24" 19 | 192.168.1.1 - - [12/Jan/2018:21:01:04 +0800] "GET /about/forum.php?mod=ajax&action=forumchecknew&fid=40&time=1515762003&inajax=yes HTTP/1.1" 200 64 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 20 | 192.168.1.1 - - [12/Jan/2018:21:02:22 +0800] "GET /about/forum.php?mod=ajax&action=forumchecknew&fid=40&time=1515762111&inajax=yes HTTP/1.1" 200 64 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 21 | 192.168.1.1 - - [12/Jan/2018:21:03:09 +0800] "GET /about/forum.php?mod=viewthread&tid=5&extra=page%3D1 HTTP/1.1" 200 36838 "http://192.168.1.10/about/forum.php?mod=forumdisplay&fid=40" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" 22 | 192.168.1.1 - - [14/Jan/2018:16:47:36 +0800] "-" 408 - "-" "-" 23 | 24 | 199.180.11.91 - - [06/Mar/2019:04:22:58 +0100] "GET /robots.txt HTTP/1.1" 404 1228 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)" 25 | 26 | 61.135.219.2 访问来源IP 27 | '-' 远端登录名(由identd而来,如果支持的话) 28 | '-' 远程用户名 29 | [01/Jan/2014:00:02:02 +0800] 请求时间,格式为[day/month/year:hour:minute:second zone] 30 | "GET /feed/ HTTP/1.0" 请求内容,格式为"%m %U%q %H",即"请求方法/访问路径/协议" 31 | 200 状态码 32 | 12306 返回数据大小 33 | “%{User-Agent}i” http_user_agent 客户端信息。 34 | “%{Rererer}i” http_referer 来源页。 35 | 36 | 37 | 38 | LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined 39 | 40 | 41 | 42 | 43 | 44 | 1客户端的IP地址。 45 | 2 由客户端identd进程判断的RFC1413身份(identity),输出中的符号"-"表示此处的信息无效。 46 | 3HTTP认证系统得到的访问该网页的客户标识(userid),如果网页没有设置密码保护,则此项将是"-"。 47 | 4服务器完成请求处理时的时间。 48 | 5客户的动作\请求的资源\使用的协议。 49 | 6服务器返回给客户端的状态码。 50 | 7返回给客户端的不包括响应头的字节数.如果没有信息返回,则此项应该是"-"。 51 | 8"Referer"请求头。 52 | 9User-Agent"请求头。 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | (1)183.69.210.164 61 | 这是一个请求到apache服务器的客户端ip,默认的情况下,第一项信息只是远程主机的ip地址,但我们如果需要apache查出主机的名字 62 | (2)- 63 | 这一项是空白,使用"-"来代替,这个位置是用于标注访问者的标示 64 | (3) - 65 | 这一项又是为空白,不过这项是用户记录用户HTTP的身份验证,如果某些网站要求用户进行身份雁阵,那么这一项就是记录用户的身份信息 66 | (4)==[07/Apr/2017:09:32:39 +0800] == 67 | 第四项是记录请求的时间,格式为[day/month/year:hour:minute:second zone],最后的+0800表示服务器所处的时区为东八区 68 | (5)GET /member/ HTTP/1.1 69 | 这一项是最有用的信息,它告诉我们的服务器收到的是一个GET请求,其次,是客户端请求的资源路径 70 | (6)302 71 | 是一个状态码,由服务器端发送回客户端,它告诉我们客户端的请求是否成功,或者是重定向,或者是碰到了什么样的错误,这项值为302 这项值以2开头的表示请求成功,以3开头的表示重定向,以4开头的标示客户端存在某些的错误,以5开头的标示服务器端 72 | (7)31 73 | 这项表示服务器向客户端发送了多少的字节,在日志分析统计的时侯,把这些字节加起来就可以得知服务器在某点时间内总的发送数据量是多少 74 | (8) - 75 | 没有值时是直接打开网页的原因,而有值时告诉服务器我是从哪个页面链接过来的 76 | (9) “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 77 | (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 78 | 2.X MetaSr 1.0” 79 | 这项主要记录客户端的浏览器信息 80 | 81 | 82 | 83 | (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (\S+) (\S+) (\[.+?\]) (\"(.*?)\") (\d{3}) (\S+) (\"(.*?)\") (\"(.*?)\") 84 | 85 | 86 | -------------------------------------------------------------------------------- /src/main/resources/log4j.properties: -------------------------------------------------------------------------------- 1 | ################################################################################ 2 | # Licensed to the Apache Software Foundation (ASF) under one 3 | # or more contributor license agreements. See the NOTICE file 4 | # distributed with this work for additional information 5 | # regarding copyright ownership. The ASF licenses this file 6 | # to you under the Apache License, Version 2.0 (the 7 | # "License"); you may not use this file except in compliance 8 | # with the License. You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | ################################################################################ 18 | 19 | log4j.rootLogger=WARN, console 20 | 21 | log4j.appender.console=org.apache.log4j.ConsoleAppender 22 | log4j.appender.console.layout=org.apache.log4j.PatternLayout 23 | log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss,SSS} %-5p %-60c %x - %m%n 24 | -------------------------------------------------------------------------------- /src/main/resources/param.conf: -------------------------------------------------------------------------------- 1 | kafka { 2 | 3 | topic = "user_access_logs" 4 | broker = "kms-2:9092,kms-3:9092,kms-4:9092" 5 | group = "log_consumer" 6 | offset = "earliest" 7 | key_deserializer = "org.apache.kafka.common.serialization.StringDeserializer" 8 | value_deserializer = "org.apache.kafka.common.serialization.StringDeserializer" 9 | } 10 | 11 | mysql { 12 | forum_url = "jdbc:mysql://kms-4:3306/ultrax?characterEncoding=utf8&user=root&password=123qwe" 13 | statistics_url = "jdbc:mysql://kms-4:3306/statistics?characterEncoding=utf8&user=root&password=123qwe" 14 | } 15 | -------------------------------------------------------------------------------- /src/test/java/TestLogparse.java: -------------------------------------------------------------------------------- 1 | import com.jmx.analysis.LogParse; 2 | import com.jmx.bean.AccessLogRecord; 3 | import org.apache.flink.api.java.tuple.Tuple2; 4 | 5 | import java.util.regex.Matcher; 6 | import java.util.regex.Pattern; 7 | 8 | /** 9 | *  @Created with IntelliJ IDEA. 10 | *  @author : jmx 11 | *  @Date: 2020/8/27 12 | *  @Time: 10:23 13 | *   14 | */ 15 | public class TestLogparse { 16 | public static void main(String[] args) { 17 | 18 | 19 | String log = "192.168.10.1 - - [27/Aug/2020:10:20:53 +0800] \"GET /forum.php?mod=viewthread&tid=9&extra=page%3D1 HTTP/1.1\" 200 39913 \"http://kms-4/forum.php?mod=forumdisplay&fid=41\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36\""; 20 | 21 | LogParse parse = new LogParse(); 22 | 23 | AccessLogRecord accessLogRecord = parse.parseRecord(log); 24 | 25 | // System.out.println(accessLogRecord.getRequest()); 26 | 27 | // 匹配出前面是"forumdisplay&fid="的数字记为版块id 28 | String sectionIdRegex = "(\\?mod=forumdisplay&fid=)(\\d+)"; 29 | Pattern sectionPattern = Pattern.compile(sectionIdRegex); 30 | // 匹配出前面是"tid="的数字记为文章id 31 | String articleIdRegex = "(\\?mod=viewthread&tid=)(\\d+)"; 32 | Pattern articlePattern = Pattern.compile(articleIdRegex); 33 | 34 | String[] arr = accessLogRecord.getRequest().split(" "); 35 | String sectionId = ""; 36 | String articleId = ""; 37 | if (arr.length == 3) { 38 | //System.out.println(arr[1]); 39 | 40 | Matcher sectionMatcher = sectionPattern.matcher(arr[1]); 41 | Matcher articleMatcher = articlePattern.matcher(arr[1]); 42 | //System.out.println(articleMatcher.find()); 43 | // System.out.println(sectionMatcher.find()); 44 | //System.out.println(articleMatcher.group(0)); 45 | //System.out.println(articleMatcher.group(1)); 46 | //System.out.println(articleMatcher.group(2)); 47 | 48 | /* sectionId = (sectionMatcher.find()) ? sectionMatcher.group(2) : ""; 49 | articleId = articleMatcher.find() ? articleMatcher.group(2) : "";*/ 50 | 51 | /* if (articleMatcher.find()){ 52 | articleId = articleMatcher.group(2); 53 | } else { 54 | articleId = "no"; 55 | } 56 | if (sectionMatcher.find()){ 57 | sectionId = sectionMatcher.group(2); 58 | }else{ 59 | sectionId = "no"; 60 | } 61 | */ 62 | } 63 | /* System.out.println( articleId); 64 | System.out.println(sectionId);*/ 65 | Tuple2 stringStringTuple2 = parse.parseSectionIdAndArticleId(accessLogRecord.getRequest()); 66 | System.out.println(stringStringTuple2.f0 + stringStringTuple2.f1); 67 | System.out.println(stringStringTuple2); 68 | } 69 | } 70 | -------------------------------------------------------------------------------- /帖子.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/帖子.png -------------------------------------------------------------------------------- /日志架构.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/日志架构.png -------------------------------------------------------------------------------- /日志格式.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/日志格式.png -------------------------------------------------------------------------------- /板块.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/板块.png -------------------------------------------------------------------------------- /结果.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/结果.png -------------------------------------------------------------------------------- /项目代码.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiamx/flink-log-analysis/9e3a7411a0c1d5d7003ccf1289725aeebc0de998/项目代码.png --------------------------------------------------------------------------------