├── README.md ├── indexr.about.cn.md ├── indexr.about.en.md ├── indexr.about.res ├── deploy.jpg ├── ecosystem.jpg ├── indexr_icon.png ├── indexr_icon_500x500.jpg ├── realtime_1.jpg └── realtime_2.jpg └── indexr_white_paper ├── deploy_architecture.png ├── ecosystem.png ├── indexr_icon.png ├── indexr_white_paper.en.md ├── indexr_white_paper.md ├── realtime_process.png ├── realtime_segment.png ├── segment_file.png └── table_olap.png /README.md: -------------------------------------------------------------------------------- 1 | sfmind 2 | === 3 | 4 | * [IndexR Introdution](indexr.about.en.md) 5 | * [IndexR 中文简介](indexr.about.cn.md) 6 | -------------------------------------------------------------------------------- /indexr.about.cn.md: -------------------------------------------------------------------------------- 1 | IndexR,一个千亿级别的实时分析数据库 2 | === 3 | 4 | ![icon](indexr.about.res/indexr_icon.png) 5 | 6 | ## 背景 7 | 8 | [舜飞科技](http://www.sunteng.com/)的各个业务线对接全网的各大媒体及APP,从而产生大量数据,实时分析这些数据不仅仅用于监控业务的发展,还会影响产品的服务质量,直接创造价值。比如优化师要时刻关注活动的投放质量,竞价算法会根据投放数据实时调整策略,网站主会进行流量分析和快速事故反馈等等。 9 | 10 | 这些分析需求的特点: 11 | 12 | * **超大量** - 数据流达到1m/s,一天入库几百亿条消息。 13 | * **准实时** - 全部数据要实时处理、入库、分析、展现,从产生到体现在分析结果里延时几秒以内。传统的批处理根本无法满足需求。 14 | * **全量、准确、可靠** - 不能采用抽样、近似计算等方式,数据就是流动的金钱,可靠性要求非常高。 15 | * **多维分析、Ad-hoc查询** - 大部分的查询结果是基于一个或多个维度组合的汇总,并且要求短时间内响应,最好全面支持SQL和UDF。 16 | 17 | 先来看看目前有哪些解决方案: 18 | 19 | * MySQL,PostgreSQL等关系型数据库,它们一般都有非常完整的功能支持,但无法支持超大量数据,统计分析的性能也不好,一般作为T+1架构的实时库。 20 | * Hbase,或者Redis等K-V数据库,上层一般有一个SQL查询层,比如Phoenix,上游由Spark、Storm等流式框架预聚合数据。这类架构限制非常多,很难支持复杂及频繁修改的业务。Kylin也属于这一类,离线预聚合。 21 | * Infobright,Greenplum,MemSQL等各有特点的数据库,有开源社区版本。在一定条件、数据量下能满足特定需求,但是缺点较多,有些不支持更新,或者运维困难,数据量支持小等。 22 | * Hana, Vertica,以及云服务等收费数据库。我们没有选择这个方向,认为把分析系统构建在这类第三方封闭系统上,与目前现有数据工具的整合相对困难,担心对后续扩展、迁移的影响。 23 | * 最近几年较火的所谓时间序列数据库,代表为Druid,Pinot,Influxdb等。笔者曾经比较深入的研究过,甚至在项目中有过部署,但最终认为都不适合。有些项目并不成熟,或者对硬件要求极高,缺少弹性,有些架构上有比较大的问题,实际应用时表现的非常不稳定。 24 | * 其他开源分析工具,如Impala,Drill,或者SparkSQL。它们一般专注于计算层,缺少一个合适的数据格式,并且它们通常是分析静态文件的,没法做到分析实时数据。目前的Parquet,ORC等数据格式通常有不错的扫描、压缩性能,但缺少有效的索引和必要的灵活性。 25 | 26 | 既然现有方案都不能解决问题,我们最终决定自己做一个合适的数据库系统,叫做**IndexR**。并在一年之后,成功部署于生产环境。 27 | 28 | ## IndexR简介 29 | 30 | IndexR是一个基于HDFS的分布式关系型列式数据库,擅长海量历史、实时数据的快速统计分析。 31 | 32 | * **快速统计分析查询** - IndexR使用列式存储,对于超大量数据集,它提供高效的索引,通过过滤掉无关数据,快速定位有效数据,减少IO。它使用了优秀的[Apach Drill](https://drill.apache.org/)作为上层查询引擎。特别适合于ad-hoc的OLAP查询。 33 | * **数据实时导入** - IndexR支持超高速实时导入数据。数据一到达IndexR节点,立刻可以被查询到。实时数据和历史数据可以一起查,再也不需要考虑所谓T+1架构。且区分于其他有类似功能的系统,IndexR永远不会主动丢弃任何数据。 34 | * **高效硬件利用率** - 相较于其他系统,IndexR可以跑在廉价的机器上。不需要昂贵的SSD硬盘,高端CPU,甚至小型机,你就可以获得非常好的性能,虽然在上面跑会更加快。虽然跑在JVM上,它手动管理几乎所有的内存,使用经过高度设计、紧凑的数据结构。 35 | * **集群高可用,易扩展,易管理,简单** - 分布式系统发展到现在,高可用和扩展性已经是标配了。IndexR的特点是结构非常简单可靠,且只有极少的必须配置项。 36 | * **与Hadoop生态的深度整合** - IndexR把数据存放于HDFS。这意味着你可以使用MapReduce,或者任何Hadoop工具处理这些文件。我们目前提供了Hive插件,用于各种ETL相关工作,或者跑离线任务。对接Spark的工作正在进行,将被使用于数据挖掘以及机器学习。 37 | * **高度压缩的数据格式** - IndexR以列式存储,并提供超高的压缩率,可以显著的减少IO以及网络开销。 38 | * **方便的数据管理** - IndexR可以方便的导入、删除数据,并且支持修改表Schema,如对列的添加、删除、修改等。 39 | 40 | ## IndexR架构介绍 41 | 42 | ### 系统结构 43 | 44 | IndexR参考和使用了很多优秀的开源产品,充分发挥了各个组件的优势,并填补了缺失的模块,组成一个非常简洁、可靠、高效的数据库系统。 45 | 46 | IndexR系统主要涉及几个组件 47 | 48 | * IndexR - 负责文件存储格式,包括索引和数据,数据的实时导入、表定义操作,查询优化,以及数据缓存等。 49 | * 分布式计算框架(Drill/Spark)-负责在IndexR数据上的具体查询操作,以及其他计算任务。 50 | * Hadoop以及周边工具 - 提供分布式文件存储,离线批量计算,离线数据管理,以及各种离线ETL任务。IndexR与Hadoop完美结合,可以作为一个高度压缩、自带索引的文件格式,兼容Hive的所有操作。 51 | * Kafka - 消息队列,数据经过kafka流入IndexR。 52 | * Zookeeper - 集群状态管理。 53 | 54 | ![ecosystem](indexr.about.res/ecosystem.jpg) 55 | 56 | ### 部署架构 57 | 58 | IndexR数据库系统部署非常简单,没有复杂依赖,没有难以理解的不同类型的节点,如果你已经拥有一个Hadoop系统,即使没有任何经验,在现有集群上部署IndexR通常可以在半小时之内完成。只需要在所有Hadoop的DataNode(和NameNode)节点上部署一份带有IndexR插件的Drill节点,只有几项必须配置项,并且所有节点的配置都是一样的。 59 | 60 | IndexR的服务逻辑嵌入了Drillbit进程,无需额外启动服务。 61 | 62 | ![deploy](indexr.about.res/deploy.jpg) 63 | 64 | ### 存储结构 65 | 66 | IndexR以列式存储数据,并分片存储,分片称为Segment,每一个Segment都是自解释的,包括Schema,数据以及索引。Segment通常是固定不变的,这极大简化了数据管理,便于分布式处理。 67 | 68 | ### 实时模块 69 | 70 | IndexR的一大特点是可以极高效率的导入实时数据,并且数据可以立刻被查询,可以多节点同时导入。 71 | 72 | 实时导入的数据叫做Realtime Segment,在达到一定阀值后,IndexR会将它们合并成历史Segment,并上传到HDFS,之后数据就可以被离线分析工具所使用和管理。 73 | 74 | ![realtime_2](indexr.about.res/realtime_2.jpg) 75 | 76 | Realtime Segment具体实现参考了 LSM-Tree。通过在磁盘上的commitlog文件保存所有更新操作,最新数据放在内存中以快速入库和索引,周期性将内存数据dump到磁盘。IndexR进程可以随时被重启,或者直接杀死,不用担心数据丢失。 77 | 78 | ![realtime_1](indexr.about.res/realtime_1.jpg) 79 | 80 | ## 性能标准 81 | 82 | 测试硬件标准:每个节点 [Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz] x 2, 60G内存, SATA接口7200转机械硬盘。 83 | 84 | * **实时导入速度** - 超过 30K 消息/秒/节点/表。即,假如有10个节点,每个节点拥有10个表,可以在一秒钟之内消费3M条消息。一天轻松实时导入千亿数据。 85 | * **扫描速度** - 通常一行内通常会读取多个字段,在现代CPU和计算框架的帮助下,可以同时对多个字段进行运算,从而获得比以下数据更好的性能。 86 | * 冷数据 - 30M字段/秒/节点。 87 | * 热数据 - 100M字段/秒/节点。 88 | * 扫描速度约为Parquet的2.5倍。 89 | * **OLAP查询** - 在我们的实际业务中,我们发现95%的查询延时在3s内,数据量规模为千亿级别,20个节点。 90 | * 相同的Drill环境下约为Parquet格式的3~8倍。 91 | * **压缩率** - 在我们的实际业务中,相对于CSV格式,压缩率约为10:1,有些表甚至达到20:1。 92 | * 压缩后大小约为ORC格式的75%。 93 | -------------------------------------------------------------------------------- /indexr.about.en.md: -------------------------------------------------------------------------------- 1 | ## Introdution 2 | 3 | IndexR is a distributed, relational database system based on HDFS, which focus on fast analyse, both for massive static(historical) data and rapidly ingesting realtime data. It is fully SQL supported, and designed for OLAP. 4 | 5 | * **Fast Statistic Query** - IndexR store its data in columned, sorted, and highly compressed style. It provides effective, smart enough, so called rough set index. It could quickly target the relevant data and greatly reduce IO. It use the remarkable query engine [Drill](https://drill.apache.org/docs/performance/). IndexR is greatly suitable for ad-hoc OLAP. 6 | 7 | * **Realtime Ingestion** - IndexR supports lightning fast realtime ingestion. As soon as events reach IndexR nodes, they are immediately queryable. Realtime data and historical data are accessed in the same query. No more lambda architecture, no more T+1. Unlike other similar system, IndexR will **NEVER** throw away any events, we don't explicitly require a column called *timestamp*. Currently IndexR supports realtime ingest events from [Kafka](http://kafka.apache.org/), more data sources will be supported soonly. 8 | 9 | * **Hardware Efficiency** - IndexR is very hardware cost effective comparing to other similar systems/databases. It manually manage most of the memory(though it runs on JVM), with very tight structure. 10 | 11 | > No more highly recommended SSD, huge RAM, high speed CPU, though they do speed up things anyway. 12 | 13 | * **Highly Avaliable, Scalable, Manageable and Simple** - In modern world HA and Scalability is basic standard for distributed system. What we want to talk about is the Manageability and Simplicity. There is only one kind of IndexR node in the cluster, with very few required settings. Users can easily add/delete/update tables, dynamically update realtime ingestion settings. 14 | 15 | > It is so simple and natural, just like using the nice classic Mysql. 16 | 17 | * **Deep Integration with Hadoop Ecosystem** - IndexR stores its data on HDFS. It means you can directly run MapReduce job on the same data files without copying or transforming. You can manage your data with any hadoop tools. Just put them into the table directory, they are immediately queryable by IndexR. Normally we use [Hive](https://hive.apache.org/) to do the ETL of static data on HDFS and run offline analyse SQL. We are working on adapting IndexR with [Spark](http://spark.apache.org/)。 18 | * **High Compression** - IndexR compress its data with highly rate, which can save you lots of disk space and reduce IO. 19 | 20 | ## Motivation 21 | 22 | IndexR is inspired by many great projects, especially [Infobright](https://infobright.com/), [Pinot](https://github.com/linkedin/pinot) and [Druid](http://druid.io/). The idea of rough set index and the compression is original comes from Infobright's opensource version. 23 | 24 | IndexR is original developed by [Sunteng](http://www.sunteng.com/), a leading internet company making bussiness around advertise marketing(DSP), website/mobile analytic and data management platform(DMP). Almost all of our products highly rely on a fast analyse system. It is difficult for us to find a suitable solution around existing products. 25 | 26 | * Existing SQL on Hadoop is either too slow or without effective index. We need a faster query engine(Drill works pretty well) and an indexed file format with low disk cost. 27 | * A k-v based system like [Storm](http://storm.apache.org/) + [Hbase](https://hbase.apache.org/), [Apache Phoenix](https://phoenix.apache.org/) or pre-build cube [Kylin](http://kylin.apache.org/) won't save us because it is too inflexible. There are too many dimensions and the query models is changing by weeks. 28 | * Search engine systems, e.g. [Lucene](https://lucene.apache.org/) based the [ElasticSearch](https://www.elastic.co/) and [Solr](http://lucene.apache.org/solr/) are both support aggregations and sorting. The problem is their data modle is not cohesive enough, which cost too much RAM. Besides they offers poor performance on large data processing. 29 | * [Druid](http://druid.io/) functionally meets most of our demands. The main problem is the [complex architecture](http://druid.io/docs/latest/design/design.html) detail exposed to user, which make it extremely difficult to maintain. Besides, Druid need to [mmap](http://druid.io/docs/latest/operations/rolling-updates.html) all files into memory before start up and query, in the case lots of files and without SSD, it could take more than half an hour to complete, and consume huge memory. Worse, the realtime ingestion of Druid could lost data, either caused by ingest task failed or delay timestamp events, which we could not tolerance. 30 | 31 | > In fact, we used to deploy a cluster of Druid in production, only after half a year of suffering we decided to build a better one. 32 | 33 | ## Performance 34 | 35 | Hardware: [Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz] x 2, 60G RAM, HDD STATA with 7200 RPM. 36 | 37 | * **Realtime ingestion speed** - maximum over 30K events / second / node / table. e.g. 10 nodes each serving 10 realtime tables can consumes 3M events within one second. We believe this is the best score ever around all similar systems. 38 | * **Scan speed** - You may find much better performence in real production environment because IndexR can process multiple values at the same time with the help of modern CPU and processing platform(like Drill). 39 | * cold data: over 30 millions values / second / node. 40 | * hot data: over 100 millions values / second / node. 41 | * ~2.5 times as Parquet. 42 | * **OLAP query** - In our production, we see 95% < 3s query latency on tables with 100 billions+ rows with 20 nodes. 43 | * 3~8 times as Parquet under the same Drill environment. 44 | * **Compression** - In our production, we see average 10:1 compress rate comparing to raw csv format. 45 | * ~75% size of ORC file format. 46 | 47 | ## Usage 48 | 49 | * Supporting analytic dashboard, BI, etc. 50 | * In areas like machine learning, time-series data collecting(monitoring, statistic). 51 | * Quickly collect event logs into HDFS. -------------------------------------------------------------------------------- /indexr.about.res/deploy.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/deploy.jpg -------------------------------------------------------------------------------- /indexr.about.res/ecosystem.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/ecosystem.jpg -------------------------------------------------------------------------------- /indexr.about.res/indexr_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/indexr_icon.png -------------------------------------------------------------------------------- /indexr.about.res/indexr_icon_500x500.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/indexr_icon_500x500.jpg -------------------------------------------------------------------------------- /indexr.about.res/realtime_1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/realtime_1.jpg -------------------------------------------------------------------------------- /indexr.about.res/realtime_2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr.about.res/realtime_2.jpg -------------------------------------------------------------------------------- /indexr_white_paper/deploy_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/deploy_architecture.png -------------------------------------------------------------------------------- /indexr_white_paper/ecosystem.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/ecosystem.png -------------------------------------------------------------------------------- /indexr_white_paper/indexr_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/indexr_icon.png -------------------------------------------------------------------------------- /indexr_white_paper/indexr_white_paper.en.md: -------------------------------------------------------------------------------- 1 | IndexR: Real-time, Hadoop based Data Warehouse 2 | ==== 3 | ## Summary 4 | 5 | IndexR implements a structured data format that can be deployed in a distributed environment, with parallel processing, indexed, and tabular. Based on this data format, IndexR constructs a Data warehouse system ([Data Warehouse](http://baike.baidu.com/view/19711.htm)), which is based on the Hadoop ecosystem, You can do a quick statistical analysis of a massive dataset ([OLAP](http://baike.baidu.com/view/22068.htm)), data can be imported in real time and query without dealy. IndexR is designed to solve problems such as slow analysis, data delay and complex system in large data scenarios. This paper describes IndexR's design ideas, system architecture, and core technical details. 6 | 7 | Currently the IndexR project is open source, project address: [https://github.com/shunfei/IndexR](https://github.com/shunfei/IndexR). 8 | 9 | ## Introduction 10 | 11 | Shun Fei Technology, one of the top in program advertising business, docking the whole network of major media, every second produceses millions of analysis data. These data is used in the advertising campaign in the process of a detailed tracking and description, such as the creative display, clicks, the number of activities generated registration, return visits and so on. We need to perform real-time analysis of these data to include customer reports, placement optimization, fraud analysis, billing, etc. The data user's query mode is not fixed, unpredictable, and as the volume of business surges, the volume of data also increases dramatically. We need a new technology to address these needs: 12 | 13 | * **Large data sets, low query delay**. The query mode is unpredictable and cannot be estimated; table data is generally more than 100 million, even on the specific queries, filtering conditions may hit a large amount of data, the data in the query will also have a large number of updates, tens of thousands of per second in the data storage. To ensure a lower query delay, in general,considering the query delay requirements within 5s, and for commonly used high-frequency queries within 1s. 14 | * **Real-time**. Data from generation to reflection in the analysis results should only have delay within a few seconds. Timeliness is critical to some businesses, and the more real-time data is, the greater the value. 15 | * **Reliability, consistency, high availability**. This data is one of the company's most important data, any errors and inconsistencies may be directly reflected in the customer report,and the company's business and brand image impact, is critical. 16 | * **Scalable, low-cost, easy to maintain**. The business will develop rapidly, creating new data sources, adding new tables, and old data cannot be deleted, which brings enormous cost pressures, and operational dimension pressure. Typical updates such as index, column-value updates should not affect online services and cannot bring in storage or query latency. 17 | * **SQL Support**. Full support for SQL, as good as MySQL, powerful function. Supports not only common multidimensional analysis, but also complex analytic queries, such as joins, subqueries, and so on, and support custom functions (UDF, UDFA). 18 | * **with the ecological integration of Hadoop**. The vigorous development of Hadoop ecology brings more and more powerful processing ability to large data processing, which can greatly expand the value of the system if combined with the depth of its tool chain. 19 | 20 | IndexR data platform group is created to answer these challenges. We are unable to find a tool in the current open source product that meets all of these needs. 21 | 22 | Products that offer similar functionality are now available through the use of traditional relational data technologies, or by building cubes to expedite queries. These approaches may pose problems such as operational difficulties, data bottlenecks, or patterns that are not flexible enough to support business change. Some scenarios use memory storage technology, which is expensive to use and does not have a particularly high speed advantage in large data analysis scenarios. In recent years, some time-serie databases have been used to solve some problems in storage delay, but there are still some problems in query performance, usability and scalability. 23 | 24 | IndexR Data Warehouse system is based on many excellent open source products, and reference some existing tools which are carefully designed and implemented. It stores data in HDFS, uses zookeeper to communicate and negotiate in the cluster, uses the Hive's convenient management to partition data, through Kafka does high speed real-time data import, the query layer uses the excellent distributed query engine Apache Drill. Its storage and indexing design refer to Infobright Community Edition and Google Mesa thesis, the compression algorithm borrowed from infobright, and real-time storage from hbase is counsulted to get inspiration. 25 | 26 | This paper expounds the IndexR from the following aspects: 27 | 28 | * Storage format and indexing, IndexR core modules. 29 | * Real-time storage module to achieve fast storage and query with 0 delay. 30 | * Hierarchy and deployment architecture, how to combine with Hadoop ecosystem deeply. 31 | * Engineering implementation problems and solutions. 32 | * Typical project selection. 33 | * The challenge of the Data warehouse in the new environment, & IndexR's significance. 34 | 35 | Currently IndexR at Shun Fei has been in stable operation, supporting the DSP, Web site detection analysis, such as the core business of real-time analysis tasks on daily basis with the cluster of 30 billion + current total data volume. 36 | 37 | 38 | ## Storage format and index design 39 | 40 | # data File 41 | 42 | IndexR stores structured data, such as the following is a fictitious advertisement put to the user: table A: 43 | 44 | Column name | Data type| 45 | -------------|----------| 46 | ' Date ' | int | 47 | ' Contry ' | string | 48 | ' Campaign_id ' | Long | 49 | ' Impressions ' | Long | 50 | ' Clicks ' | Long | 51 | 52 | The data file is called segment, and a segment saves a portion of a table, containing all the columns, as shown below. 53 | 54 | ![Segment_file](segment_file.png) 55 | 56 | The segment file is self explanatory and contains version information, full table definitions, metadata for each part (offset), and indexing. In IndexR all columns are indexed by default. The row order can be the natural order of the storage, or it can be sorted by user-defined fields. Such a design simplifies the system architecture, does not require additional metadata storage, is very suitable for parallel processing in a distributed environment, and also facilitates external systems such as Hive's direct use. 57 | 58 | Segment row data is further subdivided internally into packs, and each pack has a separate index. The row data inside the pack is stored in columns, that is, the data for a column is centrally stored together. This approach has great advantages for fast traversal of column data, and compression. For the modern universal computer architecture, Cache-friendly, convenient vector process, to give full play to the modern multi-core CPU performance. The Segment column data uses the specially optimized compression algorithm, chooses the different algorithm and the parameter according to the data type, usually the compression rate is above 10:1. 59 | 60 | In the actual business data test, IndexR each node can process 100 million of fields per second. Test machine configuration: [Intel (R) Xeon (r) CPU e5-2620 v2 @ 2.10GHz] x 2, 60G RAM, SATA $number RPM DISK. This configuration is lowest in the current server configuration, and a more powerful CPU will have a very large performance boost for IndexR. 61 | 62 | # index 63 | 64 | 65 | IndexR uses rough set indexing ([Rough set index](https://en.wikipedia.org/wiki/Rough_set)), which is able to locate files and locations at very low cost and with high accuracy. 66 | 67 | For example, one of our data blocks (packs) has the following data, with date (type int) and Use_name (string) types. 68 | 69 | Row ID | Date | User_name | 70 | -------|----------|-----------| 71 | 0 | 20170101 | Alice | 72 | 1 | 20170101 | Bob | 73 | 2 | 20170102 | Henry | 74 | 3 | 20170107 | Petter | 75 | 4 | 20170110 | Mary | 76 | 77 | IndexR has a different indexing method for number and string types, here the basic ideas are described. 78 | 79 | For the number type, the maximum (max) value (min) of the column is recorded, and the interval (max-min) is segmented into multiple intervals, each with a bit representation. Then map each specific value into this interval. 80 | 81 | bit | Index Chunk | Value | 82 | ----|-------------------|-------| 83 | 0 | 20170101~20170102 | 1 | 84 | 1 | 20170103~20170104 | 0 | 85 | 2 | 20170105~20170106 | 0 | 86 | 3 | 20170107~20170108 | 1 | 87 | 4 | 20170109~20170110 | 1 | 88 | 89 | As shown in figure, the value of 1 indicates that there is one or more rows of data in this interval, and 0 means that it does not exist. We only need to store Max, Min, and the value sequence (5 bit) to complete the index of this column. 90 | 91 | such as query 92 | 93 | >select ' user_name ' from A WHERE ' date ' = ' 20170106 ' 94 | 95 | Because ' 20170106 ' belongs to interval 2 and value is 0, you can know that ' 20170106 ' does not exist in this pack and can be skipped directly. This is a bloomfilter-like filter, the pack does not hit when does not contain the required data 96 | 97 | The index of number and string types are similar, but more complex. 98 | 99 | Currently common indexes have a B + Tree index, inverted index, which can be accurately positioned to specific rows, in relatively small amount of data is very effective. Typically, this approach does not have a particularly effective compression, and the data file size is typically between the numbers times of the original data, and when the amount of data expands to a certain extent, the cost of such an index can be magnified and even impossible to serve. 100 | 101 | IndexR's rough set index has the advantage of being very fast, the index file is small enough to load into memory in a low cost way, and still works efficiently in the extreme data volume scenario. Because data is usually sorted and clustered, the value base (cardinality) of the column is usually small by observing the actual data, which can effectively filter out extraneous packs. It indexes all of the columns and is ideal for exploratory analysis of business or data analysis scenarios. 102 | 103 | ## Real-time storage 104 | 105 | IndexR supports real-time data additions, but does not support online updates, and can be used offline to update data using tools such as hive, similar to Mesa. It's storage speed is very fast, usually single node single table can reach 30k message/s. After the message arrives at IndexR node, it can be queried immediately. 106 | 107 | IndexR's real-time storage module uses a similar lsm-tree structure. Using the Commitlog file to save the message, the latest data is stored in memory and is written to the hard disk after a certain threshold is reached. 108 | 109 | ![Realtime_segment](realtime_segment.png) 110 | 111 | In-memory data is periodically stored on the hard disk, and with time will produce more fragmented files, which will be sorted and merged after reaching a certain threshold. 112 | 113 | ![Realtime_process](realtime_process.png) 114 | 115 | Rows can be stored in a natural storage order, or they can be sorted by a specified field, similar to the one-level index in a relational database and column Family in HBase, which makes the data more cohesive and advantageous for queries. 116 | 117 | Similar to Mesa, if required, IndexR real-time storage can be divided into dimensions (Dimension) and indicators (Metric/measure) based on the concept of Multidimensional Analysis (multidimensional), Rows with the same dimensions are merged together, and the metric uses aggregate functions (aggregation function, E.G. SUM, COUNT), and the table can be designed to be a parent-child relationship of original table. 118 | 119 | ![Table_olap](Table_olap.png) 120 | 121 | As shown in figure, table B and table C can be considered a child of table A. Table A has three dimensions (date, country, campaign_id) that can express the most detailed information. Table B and table C reduce the amount of data by reducing the dimension, and can get the results of the query more quickly. 122 | 123 | The application layer only needs to do simple table routing, such as 124 | 125 | > SELECT ' Date ', ' Country ', SUM (' impressions ') from B WHERE ' country ' = ' CN ' GROUP by ' date ', ' Country ' 126 | 127 | You can route to table B and get results quickly. If a drill (Drill down) query is required, as 128 | 129 | > SELECT ' campaign_id ', SUM (' impressions ') from A WHERE ' country ' = ' CN ' and ' date ' = ' 20170101 ' GROUP by ' campaign_id ' 130 | 131 | is routed to table A. 132 | 133 | This design is similar to the pre-aggregation view in a relational database. In the area of OLAP, especially multidimensional analysis scenarios, this design is very effective. 134 | 135 | ## Architecture Design 136 | 137 | IndexR architecture design follows simple, reliable and extensible principles. It enables large-scale cluster deployments and supports thousands of nodes. In fact, IndexR hardware costs are relatively low and can be extended linearly by adding nodes. 138 | 139 | ![Ecosystem](ecosystem.png) 140 | 141 | Apache Drill is IndexR's query layer. Drill is a new query engine focused on SQL computing, using techniques such as code generation, vector processing, column computing, and heap memory (eliminating GC), specifically for the optimization of large datasets. Very fast, and support standard SQL, no migration burden. From our experience of use, it is very stable, engineering quality is very high. 142 | 143 | IndexR is mainly responsible for the storage layer, and the specific query process optimization, such as common conditions under the push (predicate pushdown), limit push, and so on, in future will also support the aggregation of the next push (aggregation pushdown). IndexR assigns the computing task to the most suitable node through task assignment algorithm, combining data distance, node busy degree, etc. 144 | 145 | HDFS stores specific data files, and Distributed file systems help build node stateless services. Data stored in HDFs can be easily used for other complex analyses using a variety of hadoop tools. We integrated Hive to facilitate off-line processing of the data. Because the data is on the HDFS only, it can be processed simultaneously by a number of tools, eliminating the cumbersome synchronization steps. also 10:1 of the high compression ratio save a huge space. 146 | 147 | Data is imported into IndexR by Kafka and other queues. IndexR Real-time import is very flexible and can add or remove import nodes at any time. It has very high import performance (30k/s), thus the pressure of storage delay becomes history. 148 | 149 | There is only one node in the IndexR cluster (IndexR nodes), which facilitates deployment and maintenance and does not require partitioning of nodes. Currently IndexR is embedded as a drill plugin in the drillbit process. 150 | 151 | ![Deploy_architecture](deploy_architecture.png) 152 | 153 | IndexR provides a IndexR-tool tool that provides a complete operational management tool. For example, can update the table structure online, online add, modify the real-time storage configuration. 154 | 155 | 156 | ## Project Implementation Challenges 157 | 158 | The algorithm and data structure should be really grounded, and it must be realized through concrete projects, and the quality of project execution determines the final effect of the project. Tall buildings are not built if they have a superb design drawing and no high-quality construction and suitable materials. IndexR works on the most extreme performance, but without losing the flexibility of scalability. 159 | 160 | * **Use Direct memory**. IndexR is primarily written using JAVA8, while Java's heap memory (HEAP) and garbage collection (GC) patterns face larger challenges in large data operations scenarios. When large memory is required (over 32G) and data is updated frequently, the problem of the JVM's GC is more obvious, which is prone to performance instability, and the memory model of the object instance is often wasteful. In the IndexR project, we store all the stored data and operation temporary data outside the heap, and manually manage the memory request release. This improves code complexity, but saves more than 1/2 memory from the traditional heap memory pattern, and without the GC cost, an assignment operation involving large amounts of data can typically use memory copies to save a lot of CPU cycles. 161 | * **Make full use of modern CPU capabilities**. IndexR's heap memory model is very useful for fully exploiting the hardware potential, they are usually contiguous memory blocks, no class pointer jumps, no virtual function loss, CPU registers and multilevel caches can be fully utilized, and the use of vector processor is very convenient, there is no structural conversion overhead. 162 | * **Avoid random reads**. Usually the disk is characterized by continuous reads very fast, so Kafka can use disk to do Message Queuing, while random reading is relatively slow, so the bottleneck of traditional database is generally IO. The IndexR index is continuous and read-friendly to the disk, and it reorganizes the data to make it more cohesive. In particular, we have carefully optimized the way file are read. 163 | * **Optimize thread, IO dispatch**. When the task is very busy, CPU scrambles and bring the cost of thread switching which can not be ignored. And because of the particularity of the database environment, in doing busy CPU task, but also network, IO operation, how to do task scheduling, reasonable arrangement of the number of threads and tasks, the overall performance impact is relatively large. Sometimes single-threaded is more efficient than multithreading, and is more resource-saving. 164 | * **Key points can be implemented using C++**. It has obvious advantages in operating efficiency when it involves both memory operation and complex CPU operation scenarios. We put the key performance points, such as compression algorithm, using C++ implementation. 165 | 166 | ## Tool Selection 167 | 168 | IndexR is a new tool that you can consider. use IndexR if your project has the following requirements, or if there have been some selection but not enough to meet the requirements. 169 | 170 | Classic scene: 171 | 172 | * Need to do a quick statistical analysis on top of the massive data query. 173 | * Requires very fast storage speed and needs real-time analysis. 174 | * Store a large number of historical detail databases. For example, website browsing information, transaction information, security data, power industry data, material networking equipment collection data. This kind of data is usually very large, the data content is complex, storage time is long, and expectation can be more quickly according to various conditions to do the detailed query, or in a certain range to do complex analysis. In this case, we can give full play to the IndexR, which is Low-cost, scalable, suitable for the advantages of large datasets. 175 | 176 | Typical selection: 177 | 178 | * Use MySQL, PostgreSQL and other relational database, not only for business query (OLTP), also do statistical analysis, generally in the existing business database directly do some analysis requirements. This approach has a performance problem after the volume of data is increased, especially when slow queries can have a significant impact on business queries. You can consider importing data into IndexR for analysis, separating the business database from the analysis database. 179 | * ES, SOLR and other Full-text search databases for statistical analysis scenarios. The biggest feature of this type of database is the use of inverted indexes to solve indexing problems. For statistical analysis scenarios, these are usually not particularly optimized memory and disk pressure in large data scenarios. If you are experiencing performance problems, or if the amount of data is not going to hold, consider using IndexR. 180 | * Coupled, Pinot and so-called time series database. In the case of query conditions hit a large number of data may have performance problems, and the ability to sort, aggregate and so generally not very good, from our experience the operational dimension is difficult, flexibility and scalability is not enough, such as lack of join, subqueries and so on. The hardware resources required to save a large amount of historical data are relatively expensive. This scenario allows you to consider using IndexR direct substitution without worrying about the business implementation problem. 181 | * Infobright, Clickhose and other column database. The column database itself is ideal for OLAP scenarios, and IndexR is also part of a column database. The biggest difference is that IndexR is based on the Hadoop ecosystem. 182 | * Off-line pre-polymerization, building cube, the results of data stored in HBase and other KV databases, such as Kylin. This is very effective in situations where only multidimensional analysis scenarios and queries are relatively simple. The problem is lack of flexibility (flexibility), inability to exploratory analysis, and more complex analysis requirements. IndexR can be configured to achieve the effect of a pre-aggregation, and aggregation is real-time, no delay, can retain the original data or high dimensional data, through the table routing to determine the specific query table. 183 | * To solve the problem of real-time analysis of large amount of data, the upper layer uses Impala, Presto, Sparksql, drill and other computing engines to do queries, and the storage layer uses open source data formats such as parquet, based on the Hadoop ecology. Such architectures are similar to IndexR. IndexR's advantage lies in more efficient index design, better performance, and support for real-time storage, second-level latency. In the same environment as parquet format query performance comparison wise IndexR query speed increase number of times. After IndexR gone through a lot of performance optimization, it is expected to have a better performance. 184 | * Kudu, Phoenix and so on both support OLTP scenarios, but also for the OLAP scene optimization and other open source products. It is often difficult to take the two into account, it is suggested that the real-time library and history library, for different data characteristics of the use of unused storage solutions. 185 | * Memory database. Can only be used as ephimeral storage. 186 | 187 | The Cleanse Data platform group has a wealth of experience with most of the technical choices mentioned above, that is, the tools we have used in the build environment, or have had in-depth research and testing, this has prompted the birth of IndexR. 188 | 189 | ## Thinking and Summarizing 190 | 191 | Large data after recent years of rapid development, complete ecological gradually mature, has long been from the era of Hadoop running MR Jobs. After satisfying the need to analyze a large number of data sets, people gradually put forward higher requirements for timeliness and ease of use, so new tools such as Storm and Spark were born. New challenges are emerging and new opportunities are offered. and traditional Data Warehouse products, in the face of large data impact appears very powerless. IndexR provides a new way to solve this situation. 192 | 193 | IndexR is a new generation of data Warehouse system, designed for OLAP scenarios, can host large number of structured data for rapid analysis, support for fast real-time storage. It is powerful, simple and reliable, and supports large-scale cluster deployments. It integrates deeply with the Hadoop ecosystem and can give full play to the ability of large data tools. 194 | 195 | We do not need to worry about ability analyze the bottleneck of complex bigdata systems, do not give up the classic OLAP theory, do not downgrade your service, do not worry about the business staff to unfamiliar large data tools. IndexR like MySQL, will be good for SQL. 196 | 197 | After open sourcing IndexR, we have seen a lot of use cases, including the different teams at home and abroad. Interestingly, some teams are used in a special way, for example, to store a large number of complex detail data (single table design), or to do a detailed query of historical data. IndexR not only can be used in multidimensional analysis, business intelligence, such as the classic areas of OLAP, but also for Internet, public opinion monitoring, crowd behavior analysis and other emerging directions. 198 | 199 | ## Contact Us 200 | 201 | ![indexr_icon](indexr_icon.png) 202 | 203 | * Contact Email: flowbehappy@gmail.com 204 | * QQ Discussion groups: 606666586 (IndexR discussion group) 205 | -------------------------------------------------------------------------------- /indexr_white_paper/indexr_white_paper.md: -------------------------------------------------------------------------------- 1 | IndexR:实时、基于Hadoop的数据仓库 2 | ==== 3 | ## 摘要 4 | 5 | IndexR实现了一种可部署于分布式环境,可并行化处理,带索引的,列式的结构化数据格式。基于这种数据格式,IndexR构建了一个数据仓库系统([Data Warehouse](http://baike.baidu.com/view/19711.htm)),它基于Hadoop生态,可以对海量数据集做快速统计分析([OLAP](http://baike.baidu.com/view/22068.htm)),数据可实时导入并且对于查询零延迟。IndexR 为解决大数据场景下分析缓慢、数据延迟、系统复杂等问题而设计。本文描述了IndexR的设计思想,系统架构,以及核心的技术细节。 6 | 7 | 目前IndexR项目已经开源,项目地址:[https://github.com/shunfei/indexr](https://github.com/shunfei/indexr)。 8 | 9 | ## 简介 10 | 11 | 舜飞科技的核心业务之一程序化广告业务,对接全网的各大媒体,每秒产生上百万的分析数据。这些数据对广告投放活动的过程进行了精细的追踪和描述,比如创意的展示量、点击量,活动产生的注册数、回访数等。我们需要对这些数据进行实时分析处理,用于包括客户报告,投放优化,欺诈分析,收费结算等。数据使用者的查询模式是非固定的,无法预测的,并且随着业务量的激增,数据量也急剧增长。我们需要一种新的技术来解决这些需求: 12 | 13 | * **超大数据集,低查询延时**。查询模式无法预测,无法预计算;表数据量普遍超过1亿,甚至上百亿千亿,过滤条件有可能会命中大量数据;数据在查询的同时还会有大量的更新,每秒入库几万的数据。要保证较低的查询延时,一般情况下查询延时要求在5s以内,常用高频查询要求1s以内。 14 | * **准实时**。数据从产生到体现在分析结果延时几秒以内。时效性对于某些业务至关重要,并且越实时的数据,价值越大。 15 | * **可靠性,一致性,高可用**。这些数据是公司最重要的数据之一,任何错误和不一致可能会直接体现在客户报表中,对公司的业务和品牌形象产生影响,至关重要。 16 | * **可扩展,低成本,易维护**。业务会快速发展,会产生新的数据源,加入新的表,旧的数据不能删除,这带来巨大的成本压力,和运维压力。典型的更新如加列、列值更新等操作不能影响线上服务,不能带来入库或者查询延迟。 17 | * **SQL支持**。全面支持SQL,要像Mysql一样好用,功能强大。不仅仅支持常见的多维分析,还需要支持复杂的分析查询,如JOIN,子查询等,支持自定义函数(UDF,UDFA)。 18 | * **与Hadoop生态整合**。Hadoop生态的蓬勃发展给大数据处理带来越来越强的处理能力,如果能与它的工具链深度结合,会极大扩展系统的价值。 19 | 20 | IndexR是舜飞科技大数据平台组为了应对这些挑战的答案。我们无法在当前的开源产品中找到可以满足所有以上需要的工具。 21 | 22 | 目前提供相似功能的产品,有些通过使用传统的关系型数据技术,或者通过预先建Cube加速查询。这些方式可能会带来一些问题,比如运维困难,数据量瓶颈,或者模式不够灵活,无法支持业务变化。有些方案使用内存存储技术,使用上成本比较高,而且在大数据分析场景并无特别大的速度优势。近年出现的一些时序数据库,解决了一些入库延迟方面的问题,但是在查询性能,可用性,可扩展性等方面存在一些问题。 23 | 24 | IndexR数据仓库系统基于许多优秀的开源产品,并且参考了一些已经存在的工具,精心设计和实现而成。它把数据存放于HDFS,使用Zookeeper在集群中通讯和交涉,使用Hive方便的管理分区数据,可以通过Kafka高速实时导入数据,查询层使用优秀的分布式查询引擎Apache Drill。它的存储和索引设计参考了Infobright社区版和Google Mesa论文,压缩算法借鉴了Infobright,实时入库从HBase和Druid获得启发。 25 | 26 | 本文从以下几个方面对IndexR进行阐述: 27 | 28 | * 存储格式与索引,IndexR的核心模块。 29 | * 实时入库模块,实现快速入库并且查询零延迟。 30 | * 层次结构与部署架构,如何与Hadoop生态系统深度结合。 31 | * 工程实现的问题以及解决方案。 32 | * 典型项目选型。 33 | * 数据仓库在新环境下的挑战,IndexR的意义。 34 | 35 | 目前在舜飞已经稳定运行,支撑了DSP、网站检测分析等核心业务的实时分析任务,集群每天入库消息300亿+,目前总数据量为千亿级别。 36 | 37 | 38 | 39 | 40 | 41 | ## 存储格式与索引设计 42 | 43 | ### 数据文件 44 | 45 | IndexR存储结构化数据,比如以下是一个虚构的广告投放用户表 Table A: 46 | 47 | column name | data type| 48 | -------------|----------| 49 | `date` | int | 50 | `contry` | string | 51 | `campaign_id`| long | 52 | `impressions`| long | 53 | `clicks` | long | 54 | 55 | 数据文件称为Segment,一个Segment保存一个表的部分行,包含所有的列,如下图。 56 | 57 | ![segment_file](segment_file.png) 58 | 59 | Segment文件是自解释的,它包含版本信息,完整的表定义,各个部分的元数据(offset),以及索引。IndexR默认对所有的列进行索引。行顺序可以是入库的自然顺序,也可以是按照用户定义的字段排序。这样的设计可以简化系统架构,不需要额外的元数据存储,非常适合于分布式环境下的并行处理,也方便外部系统如Hive直接使用。 60 | 61 | Segment的行数据在内部会进一步细分为pack,每个pack都有独立的索引。pack内部的行数据是以列存储的,即某一列的数据会集中存放在一起。这种方式对于列数据的快速遍历,和压缩带来极大的优势。对于现代通用计算机架构,cache友好,方便vector process,充分发挥现代多核CPU的性能。Segment的列数据使用特别优化的压缩算法,根据数据类型选择不同的算法和参数,通常压缩率10:1以上。 62 | 63 | 在实际业务数据测试中,IndexR每个节点每秒可以处理1亿个字段。测试机器配置: [Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz] x 2, 60G RAM, SATA 7200 RPM DISK。这个配置在目前服务器配置中算低端的,更强大的CPU会对IndexR有非常大的性能提升。 64 | 65 | ### 索引 66 | 67 | IndexR采用粗糙集索引(Rough Set Index),它能以极低的成本,很高的精确度定位到相关文件和位置。 68 | 69 | 比如我们的某一个数据块(pack)有以下数据,有date(int类型)和use_name(string)类型。 70 | 71 | row id | date | user_name | 72 | -------|----------|-----------| 73 | 0 | 20170101 | Alice | 74 | 1 | 20170101 | Bob | 75 | 2 | 20170102 | Henry | 76 | 3 | 20170107 | Petter | 77 | 4 | 20170110 | Mary | 78 | 79 | IndexR对于number和string类型有不同的索引方式,这里描述基本的思路。 80 | 81 | 对于number类型,会记录该列的最大值(max),最小值(min),然后把它们的区间(max-min)进行分割成多个区间,每一个区间使用一个bit表示。然后把各个具体的值映射到这个区间之中。 82 | 83 | bit | index chunk | value | 84 | ----|-------------------|-------| 85 | 0 | 20170101~20170102 | 1 | 86 | 1 | 20170103~20170104 | 0 | 87 | 2 | 20170105~20170106 | 0 | 88 | 3 | 20170107~20170108 | 1 | 89 | 4 | 20170109~20170110 | 1 | 90 | 91 | 如图,value值为1表示这个区间存在一行或者多行的数据,为0表示不存在。我们只需要存储max,min,和value序列(5个bit)就完成了对这一列的索引。 92 | 93 | 比如查询 94 | 95 | >SELECT `user_name` FROM A WHERE `date` = '20170106' 96 | 97 | 因为'20170106'属于区间2,value是0,即可以知道'20170106'不存在于这个pack,可以直接跳过。这是一种类似于bloomfilter的过滤方式,索引不命中的pack一定不包含需要的数据 98 | 99 | string类型的索引和number类似,不过更加复杂一点。 100 | 101 | 目前常见的索引有B+树索引,倒排索引,这些索引可以精确定位到具体行,在相对小数据量情况下很有效。这种方式通常没有特别有效的压缩,数据文件大小一般在原始数据的1~3倍之间,当数据量膨胀到一定程度,这类索引的代价就会被放大,甚至无法服务。 102 | 103 | IndexR的粗糙集索引的优势是非常快速,索引文件足够小,可以低成本的方式load到内存,在极大数据量场景下仍然能有效的工作。由于数据通常是排序的内聚的,通过实际数据的观察,列的值基数(cardinality)通常比较小,这种方式是可以有效的过滤掉无关的pack。它会对所有的列进行索引,非常适合于业务不固定,或者数据分析场景的探索型分析。 104 | 105 | ## 实时入库 106 | 107 | IndexR支持实时数据追加,但不支持数据在线更新,可以通过离线的方式使用Hive等工具更新数据,这样的设计和Mesa类似。它的入库速度非常快,通常单个节点单表可以达到30k消息/s。消息到达IndexR Node之后,可以立刻被查询。 108 | 109 | IndexR的实时入库模块使用类似LSM-Tree的结构。使用commitlog文件保存消息,最新的数据存放于内存,在达到一定阀值之后会被写入硬盘。 110 | 111 | ![realtime_segment](realtime_segment.png) 112 | 113 | 内存中的数据周期性的存储到硬盘,时间一久会产生较多碎片文件,这些文件在达到一定阀值之后,会被整理合并。 114 | 115 | ![realtime_process](realtime_process.png) 116 | 117 | 行的存储顺序可以是自然入库顺序,也可以按照指定字段排序,类似于关系型数据库中的一级索引和HBase中的Column Family,这样做可以让数据更加内聚,对于查询非常有利。 118 | 119 | 类似于Mesa,如果需要,IndexR实时入库可以根据多维分析(Multidimensional Analysis)的概念,把字段分成维度(Dimension)和指标(Metric/Measure),具有相同维度的行会合并到一起,指标使用聚合函数(aggregation function, e.g. SUM, COUNT),并且表之间可以设计成父子关系。 120 | 121 | ![table_olap](table_olap.png) 122 | 123 | 如图,Table B 与 Table C 可以可以认为是 Table A 的子表。Table A 拥有三个维度(date, country, campaign_id),可以表达最详细的信息。Table B 与 Table C 通过减少维度,减少了数据量,可以更加快速的获得查询结果。 124 | 125 | 应用层只需要做简单的表路由,比如 126 | 127 | > SELECT `date`, `country`, SUM(`impressions`) FROM B WHERE `country` = 'CN' GROUP BY `date`, `country` 128 | 129 | 可以路由到Table B表,快速获得结果。如果需要下钻(Drill Down)查询,如 130 | 131 | > SELECT `campaign_id`, SUM(`impressions`) FROM A WHERE `country` = 'CN' and `date` = '20170101' GROUP BY `campaign_id` 132 | 133 | 则会路由到Table A。 134 | 135 | 这种设计类似于关系型数据库中预聚合View。在OLAP领域,特别是多维分析场景,这种设计非常有效。 136 | 137 | ## 架构设计 138 | 139 | IndexR的架构设计遵循简单可靠、易扩展的原则。它可以大规模集群部署,支持上千个节点。事实上IndexR的硬件成本相对来说很低,并且可以通过加节点线性扩展处理能力。 140 | 141 | ![ecosystem](ecosystem.png) 142 | 143 | Apache Drill作为IndexR的查询层。Drill是一个全新的查询引擎,专注于SQL计算,使用了代码生成技术,vector process,列式计算,堆外内存(消除GC)等技术,有专门针对对于大数据集的优化。速度极快,并且支持标准SQL,没有迁移负担。从我们的使用经验来看,它非常稳定,工程质量很高。 144 | 145 | IndexR主要负责存储层,并且对具体的查询过程进行优化,比如常见的条件下推(predicate pushdown),limit下推等,未来还将支持聚合下推(aggregation pushdown)。IndexR通过任务分配算法,结合数据距离、节点繁忙程度等,把计算任务分配到最合适的节点。 146 | 147 | HDFS存储具体的数据文件,分布式文件系统帮助构建节点无状态的服务。数据存放于HDFS中,可以方便的使用各种Hadoop工具进行其他复杂分析。我们对接了Hive,方便对数据进行离线处理。由于HDFS上的数据只有一份,可以同时被多个工具处理,省去了繁琐的同步步骤,在10:1的高压缩比上又节省一倍空间。 148 | 149 | 数据经过Kafka等队列高速导入IndexR。IndexR的实时导入非常灵活,可以随时增加或者删除导入节点。它拥有极高的导入性能(30k/s),入库延迟的压力成为历史。 150 | 151 | 在IndexR集群中只有一种节点(IndexR Node),有利于部署和维护,不需要对节点进行划分。目前IndexR作为Drill插件嵌入了Drillbit进程。 152 | 153 | ![deploy_architecture](deploy_architecture.png) 154 | 155 | IndexR提供了indexr-tool工具,提供了完整的运维工具。比如可以在线更新表结构,在线添加、修改实时入库配置。 156 | 157 | ## 工程实现的挑战 158 | 159 | 算法和数据结构要真正落地,必须通过具体的工程来实现,而工程实现的质量决定了项目的最终效果。如果空有高超的设计图纸,而没有高质量的施工和合适的材料,高楼大厦是建不起来的。IndexR在工程上最求极致的性能,但又不失灵活的扩展性。 160 | 161 | * 使用直接内存(Direct Memory))。IndexR主要使用Java8编写,而Java的堆内存(Heap)与垃圾回收(GC)的模式在大数据运算场景下面临比较大的挑战。在需要使用较大内存(超过32G)以及数据更新频繁时,JVM的GC问题比较明显,容易造成性能不稳定,并且对象实例的内存模型通常很浪费内存。我们在IndexR项目中把所有的存储数据和运算临时数据存放于堆外,手动管理内存申请释放。这样提高了代码复杂度,但相比于传统的堆内存模式,节省了超过1/2内存,并且没有了GC代价,涉及大量数据的赋值操作通常可以使用内存拷贝,节省大量CPU循环。 162 | * 充分利用现代CPU能力。IndexR的堆外内存模型对于充分发掘硬件潜能非常有益,它们通常是连续的内存块,没有类指针跳转,没有虚函数损耗,CPU寄存器和多级缓存都可以充分利用,而且对于使用vector processor非常便利,没有结构转换开销。 163 | * 避免随机读取。通常磁盘的特点是连续读取非常快,因而Kafka可以使用磁盘做消息队列;而随机读取相对很慢,故传统数据库的瓶颈一般在IO。IndexR的索引方式对磁盘连续读取友好,并且它会对数据进行整理从而更加内聚。我们还特别对文件读取方式进行了细致的优化。 164 | * 优化线程、IO调度。在任务非常繁忙的时候,CPU争抢带来的线程切换的开销变的不可忽视。并且由于数据库环境的特殊性,在做繁忙CPU任务的同时,还会进行网络、IO操作。如何做任务调度,合理安排线程数量和任务,对整体性能影响比较大。有时候单线程比多线程效率更高,并且更省资源。 165 | * 关键性能点使用C++实现。它在同时涉及内存操作和复杂CPU运算场景时,运行效率优势明显。我们把关键的性能点,比如压缩算法,使用C++实现。 166 | 167 | ## 工具选型 168 | 169 | IndexR是一个新的工具,如果你的项目有以下需求,或者之前已经有一些选型但是无法满足需求,可以考虑使用IndexR。 170 | 171 | 经典场景: 172 | 173 | * 需要在海量数据之上做快速的统计分析查询。 174 | * 要求入库速度非常快,并且需要实时分析。 175 | * 存放超大量历史明细数据库。比如网站浏览信息,交易信息,安保数据,电力行业数据,物联网设备采集数据等。这类数据通常量非常大,数据内容复杂,存放时间比较久,且希望在需要时可以比较快速的根据各种条件做明细查询,或者在一定范围内做复杂的分析。这种情况下可以充分发挥IndexR的低成本,可扩展,适合超大数据集的优势。 176 | 177 | 典型选型: 178 | 179 | * 使用Mysql,PostgreSQL等关系型数据库,不仅用于业务查询(OLTP),也做统计分析,一般是在现有业务数据库上直接做一些分析需求。这种方式在数据量增长之后就会遇到性能问题,特别是分析查询会对业务查询产生极大影响。可以考虑把数据导入IndexR做分析,即把业务数据库和分析数据库分开。 180 | * ES,Solr等全文搜索数据库用于统计分析场景。这类数据库最大的特点是使用了倒排索引解决索引问题。对于统计分析场景通常没有特别优化,在大数据量场景下内存和磁盘压力比较大。如果遇到性能问题,或者数据量撑不住了,可以考虑使用IndexR。 181 | * Druid,Pinot等所谓时序数据库。在查询条件命中大量数据情况下可能会有性能问题,而且排序、聚合等能力普遍不太好,从我们的使用经验来看运维比较困难,灵活性和扩展性不够,比如缺乏Join、子查询等。在保存大量历史数据情况下需要的硬件资源相对昂贵。这种场景下可以考虑使用IndexR直接替换,不用担心业务实现问题。 182 | * Infobright,ClickHose等列式数据库。列式数据库本身非常适合于OLAP场景,IndexR也属于列式数据库。最大的区别在于IndexR是基于Hadoop生态的。 183 | * 离线预聚合,建Cube,结果数据存放于HBase等KV数据库,如Kylin等。这种方式在只有多维分析场景且查询比较简单的情况下非常有效。问题就在于灵活性不足(flexibility),无法探索式分析,以及更复杂的分析需求。IndexR可以通过表配置达到预聚合的效果,并且聚合是实时,没有延迟的;可以保留原始数据或者高维度数据,通过表路由决定具体的查询表。 184 | * 为了解决大数据量的即时分析问题,上层使用Impala,Presto,SparkSQL,Drill等计算引擎来做查询,存储层使用开源数据格式比如Parquet,基于Hadoop生态。这类架构和IndexR很相似。IndexR的优势在于更有效的索引设计,更好的性能,并且支持实时入库,秒级延迟。我们在相同环境下与Parquet格式做过查询性能对比,IndexR的查询速度提升在3~8倍以上。之后IndexR经历了很大的性能优化,估计会有更好的表现。 185 | * Kudu,Phoenix等既支持OLTP场景,又为OLAP场景优化等开源产品。通常很难两者兼顾,建议分成实时库和历史库,针对不同数据特点采用不用的存储方案。 186 | * 内存数据库。贵。 187 | 188 | 舜飞科技大数据平台组对于以上提到的大部分技术选型有着丰富的经验,即这些工具我们或者在生成环境中使用过,或者有过深入的调研和测试,这也促使了IndexR的诞生。 189 | 190 | ## 思考和总结 191 | 192 | 大数据经过近些年的快速发展,完整的生态渐渐成熟,已经早已不是只有Hadoop跑MR任务的时代。人们在在满足了能够分析大量数据集的需求之后,渐渐的对时效性、易用性等方面提出了更高的要求,因而诞生了如Storm,Spark等新的工具。新的问题催生新的挑战,提供新的机遇。而传统的数据仓库产品,在面对大数据的冲击显得非常无力。IndexR为解决这种现状,提供了新的思路和方向。 193 | 194 | IndexR是新一代数据仓库系统,为OLAP场景而设计,可以对超大量的结构化数据进行快速的分析,支持快速的实时入库。它功能强大,并简单可靠,支持大规模集群部署。它与Hadoop生态系统深度整合,可以充分发挥大数据工具的能力。 195 | 196 | 不用再为分析能力的瓶颈担忧,不用放弃经典的OLAP理论,不用降级你的服务,不用担心业务人员对大数据工具不熟悉,IndexR像Mysql一样好用,会SQL就好了。 197 | 198 | IndexR在开源之后,我们已经看到有不少使用案例,包括国内外的不同团队。有意思的是,有些团队的使用方式比较特别,比如用于存放超大量(单表千亿级别)的复杂明细数据,做历史数据的明细查询。IndexR不仅可以用于多维分析,商业智能等OLAP经典领域,还可以用于物联网,舆情监控,人群行为分析等新兴方向。 199 | 200 | ## 联系方式 201 | 202 | ![indexr_icon](indexr_icon.png) 203 | 204 | * 联系邮箱:indexrdb@gmail.com 205 | * QQ讨论群:606666586 (IndexR讨论组) 206 | -------------------------------------------------------------------------------- /indexr_white_paper/realtime_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/realtime_process.png -------------------------------------------------------------------------------- /indexr_white_paper/realtime_segment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/realtime_segment.png -------------------------------------------------------------------------------- /indexr_white_paper/segment_file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/segment_file.png -------------------------------------------------------------------------------- /indexr_white_paper/table_olap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shunfei/sfmind/2bd9e7c5984c59a0181556650177f01689ef7b31/indexr_white_paper/table_olap.png --------------------------------------------------------------------------------