├── .gitignore ├── README.md ├── application.properties ├── pom.xml └── src └── main ├── java └── com │ └── sjj │ ├── SprakStreamingMain.java │ ├── bo │ ├── OffsetInfo.java │ └── WordCount.java │ ├── config │ └── RedisConfig.java │ ├── rdd │ └── function │ │ ├── Accumulator.java │ │ ├── DStream2Row.java │ │ ├── Line2Word.java │ │ ├── MessageAndMeta.java │ │ ├── TupleValue.java │ │ └── WordTick.java │ ├── service │ ├── IOperateHiveWithSpark.java │ └── impl │ │ └── OperateHiveWithSpark.java │ └── util │ └── RedisUtil.java └── resources ├── hive-site.xml └── log4j2.xml /.gitignore: -------------------------------------------------------------------------------- 1 | # maven ignore 2 | target/ 3 | *.jar 4 | *.war 5 | *.zip 6 | *.tar 7 | 8 | # eclipse ignore 9 | .settings/ 10 | .project 11 | .classpath 12 | 13 | # idea ignore 14 | .idea/ 15 | *.ipr 16 | *.iml 17 | *.iws 18 | 19 | # temp ignore 20 | logs/ 21 | *.doc 22 | *.log 23 | *.cache 24 | *.diff 25 | *.patch 26 | *.tmp 27 | metastore_db/ 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # spark-streaming-kafka-demo 2 | **使用Springboot框架，Sparkstreaming监听Kafka消息，Redis记录已读Kafka偏移量，Spark统计单词出现次数，最后写入Hive表。** 3 | 4 | 5 | ### 注意事项 6 | 7 | 1. **版本信息** 8 | - Kafka：2.12-2.3.0 9 | - Spark：1.6.0 10 | - Redis：4.x 11 | - Hadoop：2.6.0-cdh5.15.2 12 | - 13 | 2. **读取Kafka** 14 | 15 | 采用Direct方式：此方式不使用接收器接收数据，而是周期性查询Kafka中每个主题+分区中的最新偏移量，并相应地定义要在每批中处理的偏移量范围。 16 | 17 | 18 | ``` 19 | JavaInputDStream dStream = KafkaUtils.createDirectStream( 20 | jssc, 21 | String.class, 22 | String.class, 23 | StringDecoder.class, 24 | StringDecoder.class, 25 | String.class, 26 | kafkaParams, 27 | offsets, 28 | new MessageAndMeta()); 29 | ``` 30 | 31 | 正常处理RDD后，更新Offset到Redis 32 | ``` 33 | // 更新offset 34 | for (OffsetRange offsetRange : offsetRanges.get()) { 35 | setOffsetToRedis(offsetRange); 36 | } 37 | ``` 38 | 39 | 40 | 每次启动前从Redis获取上次读取的Offset 41 | ``` 42 | // 获取消费kafka的offset 43 | getOffset(topicsSet); 44 | ``` 45 | 46 | 47 | 处理数据的作业启动后，Kafka consumerAPI读取Kafka中定义的偏移量范围（类似于从文件系统读取文件），每60s读取一次Kafka消息，这个间隔时间建议根据kafka写入速度自行设置，Sparkstreaming的微批处理模式，Spark处理后每批写一个数据文件（直接写hdfs或者hive表），如果每批读取的文件太少，会造成大量小文件（大量小文件的问题请自行bing）。当消息为空时，Spark任务会生成空文件，为避免生成空文件，在操作前对RDD进行非空判断 48 | 49 | 50 | 51 | ``` 52 | if (!rowRDD.isEmpty()) { 53 | //操作RDD 54 | } 55 | ``` 56 | 57 | 58 | 59 | 3. **RDD操作** 60 | 61 | 示例是计算单词频次，Spark的transformation算子用到的function需要序列化，所以如果使用匿名类，那匿名类所在的宿主类也必须能序列化，所以示例中把flatMap、reduceByKey等算子的function单独定义了类，放在util目录下。 62 | 63 | spark任务默认会在Hive表数据文件目录生成staging文件，可以配置到统一目录定时清理 64 | 65 | 66 | ``` 67 | hiveContext.sql("set hive.exec.stagingdir = /tmp/staging/.hive-staging") 68 | ``` 69 | 70 | 71 | 4. 运行 72 | - 本地运行 73 | 按照Springboot的方式直接运行主类 SprakStreamingMain 74 | - 集群环境运行 75 | - 打包 76 | 77 | Spark不支持使用spring-boot-maven-plugin打包的springboot项目结构，所以本项目使用maven-shade-plugin插件打包成一个fat的jar包；因为集群中 78 | 一般都有相关的jar包，所有Spark相关的jar包都不需要打进jar包，在pom中把scope设置成provided。 79 | 80 | - 运行 81 | 82 | 有多种运行模式， 83 | 84 | Master参数 | 含义 85 | ---|--- 86 | local | 使用1个worker线程在本地运行Spark应用程序 87 | local[K] | 使用K个worker线程在本地运行Spark应用程序 88 | local. | 使用所有剩余worker线程在本地运行Spark应用程序 89 | spark://HOST:PORT | 连接到Spark Standalone集群，以便在该集群上运行Spark应用程序 90 | mesos://HOST:PORT | 连接到Mesos集群，以便在该集群上运行Spark应用程序 91 | yarn-client | 以client方式连接到YARN集群，集群的定位由环境变量HADOOP_CONF_DIR定义，该方式driver在client运行。 92 | yarn-cluster | 以cluster方式连接到YARN集群，集群的定位由环境变量HADOOP_CONF_DIR定义，该方式driver也在集群中运行。 93 | 94 | 举个使用yarn集群的例子，通过参数properties-file指定springboot配置文件 95 | ``` 96 | spark-submit --master yarn-cluster --num-executors 2 --driver-memory 128m --executor-memory 128m --executor-cores 2 --class com.sjj.SprakStreamingMain --properties-file /root/application.properties spark-demo-boot.jar 97 | ``` 98 | 99 | 100 | 101 | 102 | -------------------------------------------------------------------------------- /application.properties: -------------------------------------------------------------------------------- 1 | app.name=spring-boot-spark-streaming-hive-demo 2 | logging.level.root=info 3 | spring.profiles.active=dev 4 | spring.cache.type=redis 5 | spring.application.name=${app.name} 6 | spring.redis.password=redis 7 | spring.redis.cluster.nodes[0]=172.1.1.1:6380 8 | spring.redis.cluster.nodes[1]=172.1.1.2:6380 9 | spring.redis.cluster.nodes[2]=172.1.1.3:6380 10 | spring.redis.cluster.nodes[3]=172.1.1.4:6381 11 | spring.redis.cluster.nodes[4]=172.1.1.5:6381 12 | spring.redis.cluster.nodes[5]=172.1.1.6:6381 13 | spring.redis.jedis.pool.max-active=8 14 | spring.redis.jedis.pool.max-idle=8 15 | spring.redis.jedis.pool.max-wait=-1 16 | spring.redis.jedis.pool.min-idle=2 17 | kafka.broker=172.1.1.7:9092 18 | kafka.topic=spark456 19 | spark.master=local[2] 20 | spark.appName=javaDirectKafkaWordCount 21 | spark.extrClassPath=/root 22 | spark.ui.port=8082 23 | hadoop.user=hdfs -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 4 | 4.0.0 5 | com.sjj 6 | spark-streaming-kafka-demo 7 | 0.0.1-SNAPSHOT 8 | 9 | spring-boot-demo 10 | springboot2.0 整合 spark、spark streaming、kafka 11 | 12 | 13 | org.springframework.boot 14 | spring-boot-starter-parent 15 | 2.0.3.RELEASE 16 | 17 | 18 | 19 | 1.6.0 20 | 2.10 21 | UTF-8 22 | UTF-8 23 | 24 | 25 | 26 | 27 | org.springframework.boot 28 | spring-boot-starter-actuator 29 | 30 | 31 | ch.qos.logback 32 | logback-classic 33 | 34 | 35 | org.springframework.boot 36 | spring-boot-starter-logging 37 | 38 | 39 | org.yaml 40 | snakeyaml 41 | 42 | 43 | 44 | 45 | 46 | 47 | org.springframework.boot 48 | spring-boot-configuration-processor 49 | true 50 | 51 | 52 | org.springframework 53 | spring-context-support 54 | 55 | 56 | 57 | 58 | org.apache.spark 59 | spark-core_${scala.binary.version} 60 | ${spark.version} 61 | provided 62 | 63 | 64 | 65 | org.apache.spark 66 | spark-streaming_${scala.binary.version} 67 | ${spark.version} 68 | provided 69 | 70 | 71 | org.apache.spark 72 | spark-streaming-kafka_${scala.binary.version} 73 | ${spark.version} 74 | provided 75 | 76 | 77 | 78 | 79 | org.apache.spark 80 | spark-sql_${scala.binary.version} 81 | ${spark.version} 82 | provided 83 | 84 | 85 | 86 | org.apache.spark 87 | spark-hive_${scala.binary.version} 88 | ${spark.version} 89 | provided 90 | 91 | 92 | org.codehaus.janino 93 | janino 94 | 95 | 96 | org.codehaus.janino 97 | commons-compiler 98 | 99 | 100 | 101 | 102 | 103 | 104 | org.springframework.boot 105 | spring-boot-starter-data-redis 106 | 107 | 108 | io.lettuce 109 | lettuce-core 110 | 111 | 112 | 113 | 114 | 115 | redis.clients 116 | jedis 117 | 118 | 119 | 120 | 121 | org.projectlombok 122 | lombok 123 | provided 124 | 125 | 126 | 127 | 128 | org.apache.commons 129 | commons-lang3 130 | 131 | 132 | org.apache.commons 133 | commons-pool2 134 | 135 | 136 | 137 | 138 | 139 | org.slf4j 140 | slf4j-log4j12 141 | 142 | 143 | 144 | 145 | 146 | spark-demo-boot 147 | 148 | 149 | src/main/resources 150 | 151 | *.* 152 | 153 | 154 | 155 | 156 | 157 | org.apache.maven.plugins 158 | maven-shade-plugin 159 | 160 | 161 | package 162 | 163 | shade 164 | 165 | 166 | false 167 | false 168 | 169 | 170 | *:* 171 | 172 | META-INF/*.SF 173 | META-INF/*.DSA 174 | META-INF/*.RSA 175 | 176 | 177 | 178 | 179 | 181 | META-INF/spring.handlers 182 | 183 | 185 | META-INF/spring.factories 186 | 187 | 189 | META-INF/spring.schemas 190 | 191 | 193 | 195 | com.sjj.SprakStreamingMain 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/SprakStreamingMain.java: -------------------------------------------------------------------------------- 1 | package com.sjj; 2 | 3 | import org.springframework.beans.factory.annotation.Autowired; 4 | import org.springframework.boot.CommandLineRunner; 5 | import org.springframework.boot.SpringApplication; 6 | import org.springframework.boot.autoconfigure.SpringBootApplication; 7 | import org.springframework.boot.Banner; 8 | import com.sjj.service.impl.OperateHiveWithSpark; 9 | 10 | import lombok.extern.slf4j.Slf4j; 11 | 12 | /** 13 | * springboot启动类 14 | */ 15 | @Slf4j 16 | @SpringBootApplication 17 | public class SprakStreamingMain implements CommandLineRunner { 18 | 19 | @Autowired 20 | private OperateHiveWithSpark wordCount; 21 | 22 | public static void main(String[] args) throws Exception { 23 | 24 | // disabled banner, don't want to see the spring logo 25 | SpringApplication app = new SpringApplication(SprakStreamingMain.class); 26 | app.setBannerMode(Banner.Mode.OFF); 27 | app.run(args); 28 | } 29 | 30 | // Put your logic here. 31 | @Override 32 | public void run(String[] args) throws Exception { 33 | try { 34 | wordCount.launch(); 35 | } catch (Exception e) { 36 | log.error("### exception exit ###"); 37 | log.error(e.getMessage()); 38 | System.exit(1); 39 | } 40 | } 41 | } 42 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/bo/OffsetInfo.java: -------------------------------------------------------------------------------- 1 | package com.sjj.bo; 2 | 3 | import lombok.Getter; 4 | import lombok.Setter; 5 | 6 | @Setter 7 | @Getter 8 | public class OffsetInfo { 9 | private String topic; 10 | private Integer partition; 11 | private Long offset; 12 | } 13 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/bo/WordCount.java: -------------------------------------------------------------------------------- 1 | package com.sjj.bo; 2 | 3 | import lombok.Getter; 4 | import lombok.Setter; 5 | 6 | /** 7 | * 单词统计的bean 8 | * @author Tim 9 | * 10 | */ 11 | @Setter 12 | @Getter 13 | public class WordCount { 14 | private String word; 15 | private Integer count; 16 | } 17 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/config/RedisConfig.java: -------------------------------------------------------------------------------- 1 | package com.sjj.config; 2 | 3 | import org.springframework.context.annotation.Bean; 4 | import org.springframework.context.annotation.Configuration; 5 | import org.springframework.data.redis.connection.RedisConnectionFactory; 6 | import org.springframework.data.redis.core.RedisTemplate; 7 | import org.springframework.data.redis.serializer.Jackson2JsonRedisSerializer; 8 | import org.springframework.data.redis.serializer.StringRedisSerializer; 9 | 10 | import com.fasterxml.jackson.annotation.JsonAutoDetect; 11 | import com.fasterxml.jackson.annotation.PropertyAccessor; 12 | import com.fasterxml.jackson.databind.ObjectMapper; 13 | 14 | import lombok.extern.slf4j.Slf4j; 15 | 16 | /** 17 | * redis客户端Lettuce配置,springboot会根据配置文件自动创建LettuceConnectionFactory 18 | * springboot2.0默认使用Lettuce客户端，用jedis包替换掉Lettuce包，在application.yml的Redis标签下添加jedis配置，即可实现jedis客户端的切换 19 | * 20 | */ 21 | @Slf4j 22 | @Configuration 23 | public class RedisConfig{ 24 | 25 | @SuppressWarnings({ "rawtypes", "unchecked" }) 26 | @Bean 27 | public RedisTemplate redisTemplate(RedisConnectionFactory factory) { 28 | RedisTemplate template = new RedisTemplate(); 29 | template.setConnectionFactory(factory); 30 | 31 | Jackson2JsonRedisSerializer jackson2JsonRedisSerializer = new Jackson2JsonRedisSerializer(Object.class); 32 | jackson2JsonRedisSerializer.setObjectMapper(new ObjectMapper().setVisibility(PropertyAccessor.ALL, JsonAutoDetect.Visibility.ANY).enableDefaultTyping(ObjectMapper.DefaultTyping.NON_FINAL)); 33 | template.setValueSerializer(jackson2JsonRedisSerializer); 34 | 35 | //解决key出现乱码前缀\xac\xed\x00\x05t\x00\tb，这个问题实际并不影响使用，只是去Redis里查找key的时候会出现疑惑 36 | StringRedisSerializer stringRedisSerializer = new StringRedisSerializer(); 37 | template.setKeySerializer(stringRedisSerializer); 38 | template.afterPropertiesSet(); 39 | log.info(">>> Lettuce客户端创建成功"); 40 | return template; 41 | } 42 | } 43 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/Accumulator.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import org.apache.spark.api.java.function.Function2; 4 | 5 | public class Accumulator implements Function2 { 6 | private static final long serialVersionUID = 1L; 7 | 8 | @Override 9 | public Integer call(Integer i1, Integer i2) { 10 | return i1 + i2; 11 | } 12 | } 13 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/DStream2Row.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import java.util.ArrayList; 4 | import java.util.List; 5 | 6 | import org.apache.spark.api.java.function.FlatMapFunction; 7 | import org.apache.spark.sql.Row; 8 | import org.apache.spark.sql.RowFactory; 9 | 10 | import scala.Tuple2; 11 | 12 | public class DStream2Row implements FlatMapFunction, Row>{ 13 | private static final long serialVersionUID = 5481855142090322683L; 14 | 15 | @Override 16 | public Iterable call(Tuple2 t) throws Exception { 17 | List list = new ArrayList<>(); 18 | list.add(RowFactory.create(t._1, t._2)); 19 | 20 | return list; 21 | } 22 | } 23 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/Line2Word.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import java.util.regex.Pattern; 4 | 5 | import org.apache.spark.api.java.function.FlatMapFunction; 6 | 7 | import com.google.common.collect.Lists; 8 | 9 | public class Line2Word implements FlatMapFunction { 10 | 11 | private static final long serialVersionUID = 1L; 12 | 13 | private static final Pattern SPACE = Pattern.compile(" "); 14 | 15 | @Override 16 | public Iterable call(String line) throws Exception { 17 | return Lists.newArrayList(SPACE.split(line)); 18 | } 19 | 20 | } 21 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/MessageAndMeta.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import org.apache.spark.api.java.function.Function; 4 | import kafka.message.MessageAndMetadata; 5 | 6 | public class MessageAndMeta implements Function, String> { 7 | 8 | private static final long serialVersionUID = 1L; 9 | 10 | public String call(MessageAndMetadata messageAndMetadata) throws Exception { 11 | return messageAndMetadata.message(); 12 | } 13 | } 14 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/TupleValue.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import java.util.Arrays; 4 | import org.apache.spark.api.java.function.FlatMapFunction; 5 | import scala.Tuple2; 6 | 7 | public class TupleValue implements FlatMapFunction,String> { 8 | private static final long serialVersionUID = 1L; 9 | 10 | @Override 11 | public Iterable call(Tuple2 t) throws Exception { 12 | return Arrays.asList(t._2); 13 | } 14 | } 15 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/rdd/function/WordTick.java: -------------------------------------------------------------------------------- 1 | package com.sjj.rdd.function; 2 | 3 | import org.apache.spark.api.java.function.PairFunction; 4 | 5 | import scala.Tuple2; 6 | 7 | public class WordTick implements PairFunction { 8 | 9 | private static final long serialVersionUID = 1L; 10 | 11 | @Override 12 | public Tuple2 call(String s) { 13 | return new Tuple2(s, 1); 14 | } 15 | } 16 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/service/IOperateHiveWithSpark.java: -------------------------------------------------------------------------------- 1 | package com.sjj.service; 2 | 3 | public interface IOperateHiveWithSpark { 4 | void launch() throws Exception; 5 | } 6 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/service/impl/OperateHiveWithSpark.java: -------------------------------------------------------------------------------- 1 | package com.sjj.service.impl; 2 | 3 | import java.util.HashMap; 4 | import java.util.HashSet; 5 | import java.util.List; 6 | import java.util.ArrayList; 7 | import java.util.Map; 8 | import java.util.concurrent.atomic.AtomicReference; 9 | import javax.annotation.Resource; 10 | import org.apache.commons.lang3.StringUtils; 11 | import org.apache.spark.SparkConf; 12 | import org.apache.spark.api.java.JavaPairRDD; 13 | import org.apache.spark.api.java.JavaRDD; 14 | import org.apache.spark.api.java.function.Function; 15 | import org.apache.spark.api.java.function.VoidFunction; 16 | import org.apache.spark.sql.DataFrame; 17 | import org.apache.spark.sql.Row; 18 | import org.apache.spark.sql.hive.HiveContext; 19 | import org.apache.spark.sql.types.DataTypes; 20 | import org.apache.spark.sql.types.Metadata; 21 | import org.apache.spark.sql.types.StructField; 22 | import org.apache.spark.sql.types.StructType; 23 | import org.apache.spark.streaming.api.java.*; 24 | import org.apache.spark.streaming.kafka.HasOffsetRanges; 25 | import org.apache.spark.streaming.kafka.KafkaUtils; 26 | import org.apache.spark.streaming.kafka.OffsetRange; 27 | import org.springframework.beans.factory.annotation.Value; 28 | import org.springframework.stereotype.Service; 29 | 30 | import com.google.common.collect.Sets; 31 | import com.sjj.rdd.function.Accumulator; 32 | import com.sjj.rdd.function.DStream2Row; 33 | import com.sjj.rdd.function.Line2Word; 34 | import com.sjj.rdd.function.MessageAndMeta; 35 | import com.sjj.rdd.function.TupleValue; 36 | import com.sjj.rdd.function.WordTick; 37 | import com.sjj.service.IOperateHiveWithSpark; 38 | import com.sjj.util.RedisUtil; 39 | 40 | import kafka.common.TopicAndPartition; 41 | import kafka.serializer.StringDecoder; 42 | import lombok.extern.slf4j.Slf4j; 43 | 44 | import org.apache.spark.streaming.Durations; 45 | 46 | /** 47 | * spark streaming监听kafka消息，实现统计消息中单词出现次数，最后写入hive表 48 | * 49 | * @author Tim 50 | * 51 | */ 52 | @Slf4j 53 | @Service 54 | public final class OperateHiveWithSpark implements IOperateHiveWithSpark { 55 | 56 | @Value("${kafka.broker}") 57 | private String kafkaBroker; 58 | 59 | @Value("${kafka.topic}") 60 | private String kafkaTopic; 61 | 62 | @Value("${spark.master}") 63 | private String sparkMaster; 64 | 65 | @Value("${spark.appName}") 66 | private String sparkAppName; 67 | 68 | @Value("${spark.ui.port}") 69 | private String sparkUiPort; 70 | 71 | @Value("${hadoop.user}") 72 | private String hadoopUser; 73 | 74 | @Value("${hive.db.name}") 75 | private String hiveDBName; 76 | 77 | @Resource 78 | RedisUtil redisUtil; 79 | 80 | // Hold a reference to the current offset ranges, so it can be used downstream 81 | final AtomicReference offsetRanges = new AtomicReference<>(); 82 | 83 | public void launch() throws Exception { 84 | 85 | // kafka broker、topic参数校验 86 | if (StringUtils.isBlank(kafkaBroker) || StringUtils.isBlank(kafkaTopic)) { 87 | log.error("Usage: JavaDirectKafkaWordCount \n" 88 | + " is a list of one or more Kafka brokers\n" 89 | + " is a list of one or more kafka topics to consume from\n\n"); 90 | System.exit(1); 91 | } 92 | 93 | // 设置hadoop用户，本地环境运行需要，集群环境运行其实不需要 94 | if (StringUtils.isNotBlank(hadoopUser)) { 95 | System.setProperty("HADOOP_USER_NAME", hadoopUser); 96 | } 97 | 98 | // Spark应用的名称，可用于查看任务状态 99 | SparkConf sc = new SparkConf().setAppName(sparkAppName); 100 | 101 | // 配置spark UI的端口，默认是4040 102 | if (StringUtils.isNumeric(sparkUiPort)) { 103 | sc.set("spark.ui.port", sparkUiPort); 104 | } 105 | // 配置spark程序运行的环境，本地环境运行需要设置，集群环境在命令行以参数形式指定 106 | if (StringUtils.isNotBlank(sparkMaster)) { 107 | sc.setMaster(sparkMaster); 108 | } 109 | 110 | /* 111 | * 创建上下文，60秒一个批次读取kafka消息，streaming的微批模式，这个时间的大小会影响写到hdfs或者hive中 112 | * 文件个数，要根据kafka实际写入速度设置，避免生成太多小文件 113 | */ 114 | JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(60)); 115 | 116 | // 组装kafka参数 117 | HashSet topicsSet = Sets.newHashSet(StringUtils.split(kafkaTopic)); 118 | HashMap kafkaParams = new HashMap(); 119 | kafkaParams.put("metadata.broker.list", kafkaBroker); 120 | 121 | // 获取消费kafka的offset 122 | // Hold a reference to the current offset ranges, so it can be used downstream 123 | Map offsets = redisUtil.getOffset(topicsSet); 124 | 125 | JavaDStream messages = null; 126 | 127 | // Create direct kafka stream with brokers and topics 128 | if (offsets.isEmpty()) { 129 | JavaPairInputDStream pairDstream = KafkaUtils.createDirectStream(jssc, String.class, 130 | String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet); 131 | messages = pairDstream 132 | .transformToPair(new Function, JavaPairRDD>() { 133 | private static final long serialVersionUID = 1L; 134 | 135 | public JavaPairRDD call(JavaPairRDD rdd) throws Exception { 136 | OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); 137 | offsetRanges.set(offsets); 138 | return rdd; 139 | } 140 | }).flatMap(new TupleValue()); 141 | } else { 142 | JavaInputDStream dStream = KafkaUtils.createDirectStream(jssc, String.class, String.class, 143 | StringDecoder.class, StringDecoder.class, String.class, kafkaParams, offsets, new MessageAndMeta()); 144 | 145 | messages = dStream.transform(new Function, JavaRDD>() { 146 | private static final long serialVersionUID = 1L; 147 | 148 | public JavaRDD call(JavaRDD rdd) throws Exception { 149 | OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); 150 | offsetRanges.set(offsets); 151 | return rdd; 152 | } 153 | }); 154 | } 155 | 156 | JavaDStream words = messages.flatMap(new Line2Word()); 157 | 158 | JavaPairDStream wordCounts = words.mapToPair(new WordTick()).reduceByKey(new Accumulator()); 159 | 160 | HiveContext hiveContext = new HiveContext(jssc.sparkContext()); 161 | List schemaString = new ArrayList(); 162 | schemaString.add("word"); 163 | schemaString.add("count"); 164 | 165 | StructType schema = new StructType( 166 | new StructField[] { new StructField("word", DataTypes.StringType, false, Metadata.empty()), 167 | new StructField("count", DataTypes.IntegerType, false, Metadata.empty()) }); 168 | 169 | JavaDStream rowDStream = wordCounts.flatMap(new DStream2Row()); 170 | 171 | rowDStream.foreachRDD(new VoidFunction>() { 172 | 173 | private static final long serialVersionUID = 1L; 174 | 175 | @Override 176 | public void call(JavaRDD rowRDD) throws Exception { 177 | 178 | if (!rowRDD.isEmpty()) { 179 | log.error(">>>" + rowRDD.partitions().size()); 180 | 181 | hiveContext.sql("set hive.exec.stagingdir = /tmp/staging/.hive-staging"); 182 | hiveContext.sql("use " + hiveDBName); 183 | 184 | DataFrame df = hiveContext.createDataFrame(rowRDD, schema).coalesce(10); 185 | 186 | df.registerTempTable("wc"); 187 | hiveContext.sql("insert into test_wc select word,count from wc"); 188 | 189 | // 更新offset 190 | for (OffsetRange offsetRange : offsetRanges.get()) { 191 | redisUtil.setOffset(offsetRange); 192 | } 193 | 194 | } 195 | } 196 | }); 197 | 198 | wordCounts.print(); 199 | 200 | // Start the computation 201 | jssc.start(); 202 | jssc.awaitTermination(); 203 | } 204 | } 205 | -------------------------------------------------------------------------------- /src/main/java/com/sjj/util/RedisUtil.java: -------------------------------------------------------------------------------- 1 | package com.sjj.util; 2 | 3 | import java.util.HashMap; 4 | import java.util.Map; 5 | import java.util.Set; 6 | 7 | import javax.annotation.Resource; 8 | 9 | import org.apache.spark.streaming.kafka.OffsetRange; 10 | import org.springframework.data.redis.core.RedisTemplate; 11 | import org.springframework.stereotype.Component; 12 | 13 | import com.sjj.service.impl.OperateHiveWithSpark; 14 | 15 | import kafka.common.TopicAndPartition; 16 | import lombok.extern.slf4j.Slf4j; 17 | 18 | @Slf4j 19 | @Component 20 | public class RedisUtil { 21 | 22 | private static final String KAFKA_PATITION_REDIS_KEY_SUFFIX = ".partition"; 23 | private static final String KAFKA_OFFSET_REDIS_KEY_SUFFIX = ".offset"; 24 | 25 | @Resource 26 | RedisTemplate redisTemplate; 27 | 28 | public Map getOffset(Set topics) { 29 | return getOffsetFromRedis(topics); 30 | } 31 | 32 | public void setOffset(OffsetRange offsetRange) { 33 | setOffsetToRedis(offsetRange); 34 | } 35 | 36 | private Map getOffsetFromRedis(Set topics) { 37 | Map offsets = new HashMap(); 38 | 39 | for (String topic : topics) { 40 | try { 41 | Integer partation = Integer 42 | .parseInt(redisTemplate.opsForValue().get(topic + KAFKA_PATITION_REDIS_KEY_SUFFIX)); 43 | Long offset = Long.parseLong(redisTemplate.opsForValue().get(topic + KAFKA_OFFSET_REDIS_KEY_SUFFIX)); 44 | if (null != partation && null != offset) { 45 | log.info("### get kafka offset in redis for kafka ### " + topic + KAFKA_OFFSET_REDIS_KEY_SUFFIX 46 | + " >>> " + partation + " | " + offset); 47 | offsets.put(new TopicAndPartition(topic, partation), offset); 48 | } 49 | } catch (NumberFormatException e) { 50 | log.error("### Topic: " + topic + " offset exception ###"); 51 | } 52 | } 53 | return offsets; 54 | } 55 | 56 | private void setOffsetToRedis(OffsetRange offsetRange) { 57 | redisTemplate.opsForValue().set(offsetRange.topic() + KAFKA_PATITION_REDIS_KEY_SUFFIX, 58 | String.valueOf(offsetRange.partition())); 59 | redisTemplate.opsForValue().set(offsetRange.topic() + KAFKA_OFFSET_REDIS_KEY_SUFFIX, 60 | String.valueOf(offsetRange.fromOffset())); 61 | 62 | log.info("### update kafka offset in redis ### " + offsetRange.topic() + KAFKA_PATITION_REDIS_KEY_SUFFIX 63 | + " >>> " + offsetRange); 64 | 65 | } 66 | } 67 | -------------------------------------------------------------------------------- /src/main/resources/hive-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | hive.metastore.uris 7 | thrift://hadoop01:9083 8 | 9 | 10 | hive.metastore.client.socket.timeout 11 | 300 12 | 13 | 14 | hive.metastore.warehouse.dir 15 | /user/hive/warehouse 16 | 17 | 18 | hive.warehouse.subdir.inherit.perms 19 | true 20 | 21 | 22 | spark.master 23 | yarn-cluster 24 | 25 | 26 | hive.log.explain.output 27 | false 28 | 29 | 30 | hive.auto.convert.join 31 | true 32 | 33 | 34 | hive.auto.convert.join.noconditionaltask.size 35 | 20971520 36 | 37 | 38 | hive.optimize.index.filter 39 | true 40 | 41 | 42 | hive.optimize.bucketmapjoin.sortedmerge 43 | false 44 | 45 | 46 | hive.smbjoin.cache.rows 47 | 10000 48 | 49 | 50 | hive.server2.logging.operation.enabled 51 | true 52 | 53 | 54 | hive.server2.logging.operation.log.location 55 | /var/log/hive/operation_logs 56 | 57 | 58 | mapred.reduce.tasks 59 | -1 60 | 61 | 62 | hive.exec.reducers.bytes.per.reducer 63 | 67108864 64 | 65 | 66 | hive.exec.copyfile.maxsize 67 | 33554432 68 | 69 | 70 | hive.exec.reducers.max 71 | 1099 72 | 73 | 74 | hive.vectorized.groupby.checkinterval 75 | 4096 76 | 77 | 78 | hive.vectorized.groupby.flush.percent 79 | 0.1 80 | 81 | 82 | hive.compute.query.using.stats 83 | false 84 | 85 | 86 | hive.vectorized.execution.enabled 87 | true 88 | 89 | 90 | hive.vectorized.execution.reduce.enabled 91 | false 92 | 93 | 94 | hive.merge.mapfiles 95 | true 96 | 97 | 98 | hive.merge.mapredfiles 99 | false 100 | 101 | 102 | hive.cbo.enable 103 | false 104 | 105 | 106 | hive.fetch.task.conversion 107 | minimal 108 | 109 | 110 | hive.fetch.task.conversion.threshold 111 | 268435456 112 | 113 | 114 | hive.limit.pushdown.memory.usage 115 | 0.1 116 | 117 | 118 | hive.merge.sparkfiles 119 | true 120 | 121 | 122 | hive.merge.smallfiles.avgsize 123 | 16777216 124 | 125 | 126 | hive.merge.size.per.task 127 | 268435456 128 | 129 | 130 | hive.optimize.reducededuplication 131 | true 132 | 133 | 134 | hive.optimize.reducededuplication.min.reducer 135 | 4 136 | 137 | 138 | hive.map.aggr 139 | true 140 | 141 | 142 | hive.map.aggr.hash.percentmemory 143 | 0.5 144 | 145 | 146 | hive.optimize.sort.dynamic.partition 147 | false 148 | 149 | 150 | hive.execution.engine 151 | mr 152 | 153 | 154 | spark.executor.memory 155 | 3240506163 156 | 157 | 158 | spark.driver.memory 159 | 3865470566 160 | 161 | 162 | spark.executor.cores 163 | 5 164 | 165 | 166 | spark.yarn.driver.memoryOverhead 167 | 409 168 | 169 | 170 | spark.yarn.executor.memoryOverhead 171 | 545 172 | 173 | 174 | spark.dynamicAllocation.enabled 175 | true 176 | 177 | 178 | spark.dynamicAllocation.initialExecutors 179 | 1 180 | 181 | 182 | spark.dynamicAllocation.minExecutors 183 | 1 184 | 185 | 186 | spark.dynamicAllocation.maxExecutors 187 | 2147483647 188 | 189 | 190 | hive.stats.fetch.column.stats 191 | true 192 | 193 | 194 | hive.mv.files.thread 195 | 15 196 | 197 | 198 | hive.blobstore.use.blobstore.as.scratchdir 199 | false 200 | 201 | 202 | hive.load.dynamic.partitions.thread 203 | 15 204 | 205 | 206 | hive.exec.input.listing.max.threads 207 | 15 208 | 209 | 210 | hive.msck.repair.batch.size 211 | 0 212 | 213 | 214 | hive.spark.dynamic.partition.pruning.map.join.only 215 | false 216 | 217 | 218 | hive.metastore.execute.setugi 219 | true 220 | 221 | 222 | hive.support.concurrency 223 | true 224 | 225 | 226 | hive.zookeeper.quorum 227 | hadoop01 228 | 229 | 230 | hive.zookeeper.client.port 231 | 2181 232 | 233 | 234 | hive.zookeeper.namespace 235 | hive_zookeeper_namespace_hive 236 | 237 | 238 | hbase.zookeeper.quorum 239 | hadoop01 240 | 241 | 242 | hbase.zookeeper.property.clientPort 243 | 2181 244 | 245 | 246 | hive.cluster.delegation.token.store.class 247 | org.apache.hadoop.hive.thrift.MemoryTokenStore 248 | 249 | 250 | hive.metastore.fshandler.threads 251 | 15 252 | 253 | 254 | hive.server2.thrift.min.worker.threads 255 | 5 256 | 257 | 258 | hive.server2.thrift.max.worker.threads 259 | 100 260 | 261 | 262 | hive.server2.thrift.port 263 | 10000 264 | 265 | 266 | hive.entity.capture.input.URI 267 | true 268 | 269 | 270 | hive.server2.enable.doAs 271 | true 272 | 273 | 274 | hive.server2.session.check.interval 275 | 900000 276 | 277 | 278 | hive.server2.idle.session.timeout 279 | 43200000 280 | 281 | 282 | hive.server2.idle.session.timeout_check_operation 283 | true 284 | 285 | 286 | hive.server2.idle.operation.timeout 287 | 21600000 288 | 289 | 290 | hive.server2.webui.host 291 | 0.0.0.0 292 | 293 | 294 | hive.server2.webui.port 295 | 10002 296 | 297 | 298 | hive.server2.webui.max.threads 299 | 50 300 | 301 | 302 | hive.server2.webui.use.ssl 303 | false 304 | 305 | 306 | 307 | hive.server2.use.SSL 308 | false 309 | 310 | 311 | spark.shuffle.service.enabled 312 | true 313 | 314 | 315 | hive.service.metrics.file.location 316 | /var/log/hive/metrics-hiveserver2/metrics.log 317 | 318 | 319 | hive.server2.metrics.enabled 320 | true 321 | 322 | 323 | hive.service.metrics.file.frequency 324 | 30000 325 | 326 | 327 | hive.security.authorization.enabled 328 | true 329 | 330 | 331 | hive.security.authorization.createtable.owner.grants 332 | ALL 333 | 334 | 335 | hive.security.authorization.task.factory 336 | org.apache.hadoop.hive.ql.parse.authorization.HiveAuthorizationTaskFactoryImpl 337 | 338 | 339 | -------------------------------------------------------------------------------- /src/main/resources/log4j2.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | spark-demo 6 | /tmp/logs 7 | 128 MB 8 | 9 | 10 | 11 | 12 | 13 | 14 | 16 | 17 | 19 | 20 | 21 | 22 | 26 | 28 | 29 | 31 | 33 | 34 | 35 | 36 | 38 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 50 | 52 | 53 | 55 | 57 | 58 | 59 | 60 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | --------------------------------------------------------------------------------