├── README.md └── transcript ├── README.md ├── batch-job-spec.yml ├── rawData ├── transcript.csv └── transcript.json ├── transcript-schema.json ├── transcript-table-offline.json ├── transcript-table-realtime-kinesis.json └── transcript-table-realtime.json /README.md: -------------------------------------------------------------------------------- 1 | # Apache Pinot Tutorial 2 | 3 | Apache Pinot is a real-time OLAP data store that can provide ultra low latency even at high throughput. It can ingest data from batch data sources such as Hadoop, S3, Azure and Google cloud storage or from streaming data sources such as Kafka, EventHub, Kinesis. 4 | 5 | Originally built at LinkedIn, Pinot can power a variety of analytical applications such as 6 | * Real-time Dashboarding applications like Superset, 7 | * Anomaly detection applications such as ThirdEye, 8 | * Rich interactive user facing analytics Data Products such as Company Analytics, Who Viewed My Profile, UberEats Restaurant Analytics, and many more. 9 | 10 | Pinot is also used at companies Uber, Microsoft, Weibo, Factual, Slack, and many more 11 | 12 | This repo contains sample files used in the tutorial video [How to setup a Pinot cluster](https://www.youtube.com/watch?v=cNnwMF0pOJ8). All the commands used in the video can be found below. 13 | 14 | ## How to setup a Pinot cluster 15 | In the tutorial, we will setup a Pinot cluster with the following components 16 | * 1 zookeeper 17 | * 2 controllers 18 | * 2 brokers 19 | * 2 servers 20 | Once the cluster is up and running, we see how to load data into Pinot and query it. 21 | At the end, we show how Pinot is resilient to failures. 22 | 23 | Below are listed all commands used in the tutorial video. 24 | 25 | ### Prerequisites 26 | Before we get started, make sure to go over this list of prerequisites 27 | 28 | | No. | Step | Link | 29 | | ----| ---- |------| 30 | | 1 | Download sample data and configs | | 31 | | 2 | Download latest Pinot release binary | | 32 | | 3 | Install Java 9 or higher | | 33 | | 4 | Install Apache Maven 3.5.0 or higher | | 34 | | 5 | Setup Zooinspector | | 35 | | 6 | Download Apache Kafka binary 2.4 or higher | | 36 | 37 | 38 | ### Start Zookeeper 39 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 40 | 41 | ``` 42 | bin/pinot-admin.sh StartZookeeper -zkPort 2181 43 | ``` 44 | 45 | ### Start Controller 46 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 47 | 48 | **Controller 1** 49 | ``` 50 | bin/pinot-admin.sh StartController \ 51 | -zkAddress localhost:2181 \ 52 | -clusterName PinotCluster \ 53 | -controllerPort 9001 54 | ``` 55 | 56 | **Controller 2** 57 | ``` 58 | bin/pinot-admin.sh StartController \ 59 | -zkAddress localhost:2181 \ 60 | -clusterName PinotCluster \ 61 | -controllerPort 9002 62 | ``` 63 | 64 | ### Start Broker 65 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 66 | 67 | **Broker 1** 68 | ``` 69 | bin/pinot-admin.sh StartBroker \ 70 | -zkAddress localhost:2181 \ 71 | -clusterName PinotCluster \ 72 | -brokerPort 7001 73 | ``` 74 | 75 | **Broker 2** 76 | ``` 77 | bin/pinot-admin.sh StartBroker \ 78 | -zkAddress localhost:2181 \ 79 | -clusterName PinotCluster \ 80 | -brokerPort 7002 81 | ``` 82 | 83 | ### StartServer 84 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 85 | 86 | **Server 1** 87 | ``` 88 | bin/pinot-admin.sh StartServer \ 89 | -zkAddress localhost:2181 \ 90 | -clusterName PinotCluster \ 91 | -serverPort 8001 -serverAdminPort 8011 92 | ``` 93 | 94 | **Server 2** 95 | ``` 96 | bin/pinot-admin.sh StartServer \ 97 | -zkAddress localhost:2181 \ 98 | -clusterName PinotCluster \ 99 | -serverPort 8002 -serverAdminPort 8012 100 | ``` 101 | 102 | The cluster is set up! Explore the cluster using Zooinspector. Explore the Admin endpoints using Rest API on the controller [http://localhost:9001](http://localhost:9001). 103 | 104 | Check out the README in transcript example folder for steps on how to push data into the cluster. 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | -------------------------------------------------------------------------------- /transcript/README.md: -------------------------------------------------------------------------------- 1 | # Uploading sample data to Pinot 2 | 3 | ## Batch 4 | 5 | Here's instructions to upload batch data into Pinot 6 | 7 | ### Set BASE_DIR 8 | 9 | ``` 10 | pwd 11 | /pinot_tutorial/pinot-tutorial/transcript 12 | BASE_DIR=`pwd` 13 | ``` 14 | 15 | ### Upload batch table config and schema 16 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 17 | 18 | ``` 19 | bin/pinot-admin.sh AddTable \ 20 | -tableConfigFile $BASE_DIR/transcript-table-offline.json \ 21 | -schemaFile $BASE_DIR/transcript-schema.json \ 22 | -controllerPort 9001 \ 23 | -exec 24 | ``` 25 | 26 | ### Upload data 27 | Make sure to replace `$BASE_DIR` in the `batch-job-spec.yml` file with the right paths in `inputDir` and `outputDir` . 28 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 29 | 30 | ``` 31 | bin/pinot-admin.sh LaunchDataIngestionJob \ 32 | -jobSpecFile $BASE_DIR/batch-job-spec.yml 33 | ``` 34 | 35 | Explore the data using Query Console on the controller localhost:9001 36 | 37 | 38 | ## Streaming 39 | 40 | Here's instructions to ingest data from a kafka topic. 41 | 42 | ### Start Kafka 43 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 44 | ``` 45 | bin/pinot-admin.sh StartKafka -zkAddress=localhost:2181/kafka -port 9876 46 | ``` 47 | 48 | ### Create a topic 49 | 50 | Download latest release of Apache Kafka from [Downloads](https://kafka.apache.org/quickstart#quickstart_download) 51 | Untar it 52 | 53 | ``` 54 | bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 2 --topic transcript-topic 55 | ``` 56 | 57 | ### Upload realtime table config and schema 58 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release 59 | 60 | ``` 61 | bin/pinot-admin.sh AddTable \ 62 | -schemaFile /tmp/pinot-quick-start/transcript-schema.json \ 63 | -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \ 64 | -controllerPort 9001 \ 65 | -exec 66 | ``` 67 | 68 | The realtime table begins to ingest from the Kafka topic immediately. Let's publish some events to the kafka topic 69 | 70 | ### Publish data to the Kafka topic 71 | Using the scripts in the kafka download, 72 | ``` 73 | bin/kafka-console-producer.sh \ 74 | --broker-list localhost:9876 \ 75 | --topic transcript-topic < $BASE_DIR/rawData/transcript.json 76 | ``` 77 | 78 | The data should arrive into the transcript table. Explore the data using Zooinspector and Query Console 79 | 80 | 81 | 82 | -------------------------------------------------------------------------------- /transcript/batch-job-spec.yml: -------------------------------------------------------------------------------- 1 | executionFrameworkSpec: 2 | name: 'standalone' 3 | segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' 4 | segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' 5 | segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' 6 | jobType: SegmentCreationAndTarPush 7 | inputDirURI: '/Users/npawar/clusters/gettingStarted/apache-pinot-1.1.0-bin/pinot-tutorial/transcript/rawdata/' 8 | includeFileNamePattern: 'glob:**/*.csv' 9 | outputDirURI: '/Users/npawar/clusters/gettingStarted/apache-pinot-1.1.0-bin/pinot-tutorial/transcript/segments/' 10 | overwriteOutput: true 11 | pinotFSSpecs: 12 | - scheme: file 13 | className: org.apache.pinot.spi.filesystem.LocalPinotFS 14 | recordReaderSpec: 15 | dataFormat: 'csv' 16 | className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' 17 | configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' 18 | tableSpec: 19 | tableName: 'transcript' 20 | pinotClusterSpecs: 21 | - controllerURI: 'http://localhost:9001' 22 | pushJobSpec: 23 | pushAttempts: 1 24 | -------------------------------------------------------------------------------- /transcript/rawData/transcript.csv: -------------------------------------------------------------------------------- 1 | studentID,firstName,lastName,gender,subject,score,timestamp 2 | 200,Lucy,Smith,Female,Maths,3.8,1570863600000 3 | 200,Lucy,Smith,Female,English,3.5,1571036400000 4 | 201,Bob,King,Male,Maths,3.2,1571900400000 5 | 202,Nick,Young,Male,Physics,3.6,1572418800000 6 | -------------------------------------------------------------------------------- /transcript/rawData/transcript.json: -------------------------------------------------------------------------------- 1 | {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestamp":1571900400000} 2 | {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestamp":1571900400000} 3 | {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestamp":1571900400000} 4 | {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestamp":1572418800000} 5 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestamp":1572505200000} 6 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestamp":1572505200000} 7 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestamp":1572678000000} 8 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestamp":1572678000000} 9 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestamp":1572678000000} 10 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestamp":1572678000000} 11 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestamp":1572854400000} 12 | {"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestamp":1572854400000} 13 | -------------------------------------------------------------------------------- /transcript/transcript-schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "schemaName": "transcript", 3 | "dimensionFieldSpecs": [ 4 | { 5 | "name": "studentID", 6 | "dataType": "INT" 7 | }, 8 | { 9 | "name": "firstName", 10 | "dataType": "STRING" 11 | }, 12 | { 13 | "name": "lastName", 14 | "dataType": "STRING" 15 | }, 16 | { 17 | "name": "gender", 18 | "dataType": "STRING" 19 | }, 20 | { 21 | "name": "subject", 22 | "dataType": "STRING" 23 | } 24 | ], 25 | "metricFieldSpecs": [ 26 | { 27 | "name": "score", 28 | "dataType": "FLOAT" 29 | } 30 | ], 31 | "dateTimeFieldSpecs": [{ 32 | "name": "timestamp", 33 | "dataType": "LONG", 34 | "format" : "1:MILLISECONDS:EPOCH", 35 | "granularity": "1:MILLISECONDS" 36 | }] 37 | } 38 | -------------------------------------------------------------------------------- /transcript/transcript-table-offline.json: -------------------------------------------------------------------------------- 1 | { 2 | "tableName": "transcript", 3 | "tableType":"OFFLINE", 4 | "segmentsConfig" : { 5 | "timeColumnName": "timestamp", 6 | "replication" : "1" 7 | }, 8 | "tableIndexConfig" : { 9 | "loadMode" : "MMAP" 10 | }, 11 | "tenants" : { 12 | "broker":"DefaultTenant", 13 | "server":"DefaultTenant" 14 | }, 15 | "metadata": {} 16 | } 17 | -------------------------------------------------------------------------------- /transcript/transcript-table-realtime-kinesis.json: -------------------------------------------------------------------------------- 1 | { 2 | "tableName": "transcript", 3 | "tableType": "REALTIME", 4 | "segmentsConfig": { 5 | "timeColumnName": "timestamp", 6 | "timeType": "MILLISECONDS", 7 | "schemaName": "transcript", 8 | "replicasPerPartition": "1" 9 | }, 10 | "tenants": {}, 11 | "tableIndexConfig": { 12 | "loadMode": "MMAP", 13 | "streamConfigs": { 14 | "streamType": "kinesis", 15 | "stream.kinesis.topic.name": "transcript-stream", 16 | "region": "", 17 | "accessKey": "", 18 | "secretKey": "", 19 | "shardIteratorType": "AFTER_SEQUENCE_NUMBER", 20 | "stream.kinesis.consumer.type": "lowlevel", 21 | "stream.kinesis.fetch.timeout.millis": "120000", 22 | "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", 23 | "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory", 24 | "realtime.segment.flush.threshold.rows": "10", 25 | "realtime.segment.flush.threshold.time": "6h" 26 | } 27 | }, 28 | "metadata": { 29 | "customConfigs": {} 30 | } 31 | } 32 | -------------------------------------------------------------------------------- /transcript/transcript-table-realtime.json: -------------------------------------------------------------------------------- 1 | { 2 | "tableName": "transcript", 3 | "tableType": "REALTIME", 4 | "segmentsConfig": { 5 | "timeColumnName": "timestamp", 6 | "timeType": "MILLISECONDS", 7 | "schemaName": "transcript", 8 | "replicasPerPartition": "2" 9 | }, 10 | "tenants": {}, 11 | "tableIndexConfig": { 12 | "loadMode": "MMAP", 13 | "streamConfigs": { 14 | "streamType": "kafka", 15 | "stream.kafka.consumer.type": "lowlevel", 16 | "stream.kafka.topic.name": "transcript-topic", 17 | "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", 18 | "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", 19 | "stream.kafka.broker.list": "localhost:9876", 20 | "realtime.segment.flush.threshold.time": "3600000", 21 | "realtime.segment.flush.threshold.size": "50000", 22 | "stream.kafka.consumer.prop.auto.offset.reset": "smallest" 23 | } 24 | }, 25 | "metadata": { 26 | "customConfigs": {} 27 | } 28 | } 29 | --------------------------------------------------------------------------------