├── README.md
└── transcript
    ├── README.md
    ├── batch-job-spec.yml
    ├── rawData
        ├── transcript.csv
        └── transcript.json
    ├── transcript-schema.json
    ├── transcript-table-offline.json
    ├── transcript-table-realtime-kinesis.json
    └── transcript-table-realtime.json


/README.md:
--------------------------------------------------------------------------------
  1 | # Apache Pinot Tutorial
  2 |  
  3 | Apache Pinot is a real-time OLAP data store that can provide ultra low latency even at high throughput. It can ingest data from batch data sources such as Hadoop, S3, Azure and Google cloud storage or from streaming data sources such as Kafka, EventHub, Kinesis.
  4 | 
  5 | Originally built at LinkedIn, Pinot can power a variety of analytical applications such as 
  6 | * Real-time Dashboarding applications like Superset,
  7 | * Anomaly detection applications such as ThirdEye,
  8 | * Rich interactive user facing analytics Data Products such as Company Analytics, Who Viewed My Profile, UberEats Restaurant Analytics, and many more.
  9 | 
 10 | Pinot is also used at companies Uber, Microsoft, Weibo, Factual, Slack, and many more
 11 | 
 12 | This repo contains sample files used in the tutorial video [How to setup a Pinot cluster](https://www.youtube.com/watch?v=cNnwMF0pOJ8). All the commands used in the video can be found below. 
 13 | 
 14 | ## How to setup a Pinot cluster
 15 | In the tutorial, we will setup a Pinot cluster with the following components
 16 | * 1 zookeeper
 17 | * 2 controllers
 18 | * 2 brokers
 19 | * 2 servers
 20 | Once the cluster is up and running, we see how to load data into Pinot and query it.
 21 | At the end, we show how Pinot is resilient to failures.
 22 | 
 23 | Below are listed all commands used in the tutorial video. 
 24 | 
 25 | ### Prerequisites
 26 | Before we get started, make sure to go over this list of prerequisites
 27 | 
 28 | | No. | Step | Link | 
 29 | | ----| ---- |------| 
 30 | | 1 | Download sample data and configs | <https://github.com/npawar/pinot-tutorial> | 
 31 | | 2 | Download latest Pinot release binary | <https://pinot.apache.org> |
 32 | | 3 | Install Java 9 or higher | <https://openjdk.java.net> |
 33 | | 4 | Install Apache Maven 3.5.0 or higher | <https://maven.apache.org> |
 34 | | 5 | Setup Zooinspector | <https://github.com/zzhang5/zooinspector> |
 35 | | 6 | Download Apache Kafka binary 2.4 or higher | <https://kafka.apache.org> |
 36 | 
 37 | 
 38 | ### Start Zookeeper
 39 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
 40 | 
 41 | ```
 42 | bin/pinot-admin.sh StartZookeeper -zkPort 2181
 43 | ```
 44 | 
 45 | ### Start Controller
 46 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
 47 | 
 48 | **Controller 1**
 49 | ```
 50 | bin/pinot-admin.sh StartController \
 51 |   -zkAddress localhost:2181 \
 52 |   -clusterName PinotCluster \
 53 |   -controllerPort 9001
 54 | ```
 55 | 
 56 | **Controller 2**
 57 | ```
 58 | bin/pinot-admin.sh StartController \
 59 |   -zkAddress localhost:2181 \
 60 |   -clusterName PinotCluster \
 61 |   -controllerPort 9002
 62 | ```
 63 | 
 64 | ### Start Broker
 65 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
 66 | 
 67 | **Broker 1**
 68 | ```
 69 | bin/pinot-admin.sh StartBroker \
 70 |   -zkAddress localhost:2181 \
 71 |   -clusterName PinotCluster \
 72 |   -brokerPort 7001
 73 | ```
 74 | 
 75 | **Broker 2**
 76 | ```
 77 | bin/pinot-admin.sh StartBroker \
 78 |   -zkAddress localhost:2181 \
 79 |   -clusterName PinotCluster \
 80 |   -brokerPort 7002
 81 | ```
 82 | 
 83 | ### StartServer
 84 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
 85 | 
 86 | **Server 1**
 87 | ```
 88 | bin/pinot-admin.sh StartServer \
 89 |   -zkAddress localhost:2181 \
 90 |   -clusterName PinotCluster \
 91 |   -serverPort 8001 -serverAdminPort 8011
 92 | ```
 93 | 
 94 | **Server 2**
 95 | ```
 96 | bin/pinot-admin.sh StartServer \
 97 |   -zkAddress localhost:2181 \
 98 |   -clusterName PinotCluster \
 99 |   -serverPort 8002 -serverAdminPort 8012
100 | ```
101 | 
102 | The cluster is set up! Explore the cluster using Zooinspector. Explore the Admin endpoints using Rest API on the controller [http://localhost:9001](http://localhost:9001).
103 | 
104 | Check out the README in transcript example folder for steps on how to push data into the cluster.
105 | 
106 | 
107 | 
108 | 
109 | 
110 | 
111 | 
112 | 
113 | 


--------------------------------------------------------------------------------
/transcript/README.md:
--------------------------------------------------------------------------------
 1 | # Uploading sample data to Pinot
 2 | 
 3 | ## Batch
 4 | 
 5 | Here's instructions to upload batch data into Pinot 
 6 | 
 7 | ### Set BASE_DIR
 8 | 
 9 | ```
10 | pwd
11 | <home_dir>/pinot_tutorial/pinot-tutorial/transcript
12 | BASE_DIR=`pwd`
13 | ```
14 | 
15 | ### Upload batch table config and schema
16 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
17 | 
18 | ```
19 | bin/pinot-admin.sh AddTable \
20 |   -tableConfigFile $BASE_DIR/transcript-table-offline.json \
21 |   -schemaFile $BASE_DIR/transcript-schema.json \
22 |   -controllerPort 9001 \
23 |   -exec
24 | ```
25 | 
26 | ### Upload data
27 | Make sure to replace `$BASE_DIR` in the `batch-job-spec.yml` file with the right paths in `inputDir` and `outputDir` .
28 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
29 | 
30 | ```
31 | bin/pinot-admin.sh LaunchDataIngestionJob \
32 |     -jobSpecFile $BASE_DIR/batch-job-spec.yml
33 | ```
34 | 
35 | Explore the data using Query Console on the controller localhost:9001
36 | 
37 | 
38 | ## Streaming
39 | 
40 | Here's instructions to ingest data from a kafka topic.
41 | 
42 | ### Start Kafka
43 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
44 | ```
45 | bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2181/kafka -port 9876
46 | ```
47 | 
48 | ### Create a topic 
49 | 
50 | Download latest release of Apache Kafka from [Downloads](https://kafka.apache.org/quickstart#quickstart_download)
51 | Untar it
52 | 
53 | ```
54 | bin/kafka-topics.sh --create --bootstrap-server localhost:9876 --replication-factor 1 --partitions 2 --topic transcript-topic
55 | ```
56 | 
57 | ### Upload realtime table config and schema
58 | Using the launcher script in `apache-pinot-incubating-0.3.0-bin` directory form the Pinot release
59 | 
60 | ```
61 | bin/pinot-admin.sh AddTable \
62 |     -schemaFile /tmp/pinot-quick-start/transcript-schema.json \
63 |     -tableConfigFile /tmp/pinot-quick-start/transcript-table-realtime.json \
64 |     -controllerPort 9001 \   
65 |     -exec
66 | ```
67 | 
68 | The realtime table begins to ingest from the Kafka topic immediately. Let's publish some events to the kafka topic
69 | 
70 | ### Publish data to the Kafka topic
71 | Using the scripts in the kafka download,
72 | ```
73 | bin/kafka-console-producer.sh \
74 |     --broker-list localhost:9876 \
75 |     --topic transcript-topic < $BASE_DIR/rawData/transcript.json
76 | ```
77 | 
78 | The  data should arrive into the transcript table. Explore the data using Zooinspector and Query Console
79 | 
80 | 
81 | 
82 | 


--------------------------------------------------------------------------------
/transcript/batch-job-spec.yml:
--------------------------------------------------------------------------------
 1 | executionFrameworkSpec:
 2 |   name: 'standalone'
 3 |   segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
 4 |   segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
 5 |   segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
 6 | jobType: SegmentCreationAndTarPush
 7 | inputDirURI: '/Users/npawar/clusters/gettingStarted/apache-pinot-1.1.0-bin/pinot-tutorial/transcript/rawdata/'
 8 | includeFileNamePattern: 'glob:**/*.csv'
 9 | outputDirURI: '/Users/npawar/clusters/gettingStarted/apache-pinot-1.1.0-bin/pinot-tutorial/transcript/segments/'
10 | overwriteOutput: true
11 | pinotFSSpecs:
12 |   - scheme: file
13 |     className: org.apache.pinot.spi.filesystem.LocalPinotFS
14 | recordReaderSpec:
15 |   dataFormat: 'csv'
16 |   className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
17 |   configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
18 | tableSpec:
19 |   tableName: 'transcript'
20 | pinotClusterSpecs:
21 |   - controllerURI: 'http://localhost:9001'
22 | pushJobSpec:
23 |   pushAttempts: 1
24 | 


--------------------------------------------------------------------------------
/transcript/rawData/transcript.csv:
--------------------------------------------------------------------------------
1 | studentID,firstName,lastName,gender,subject,score,timestamp
2 | 200,Lucy,Smith,Female,Maths,3.8,1570863600000
3 | 200,Lucy,Smith,Female,English,3.5,1571036400000
4 | 201,Bob,King,Male,Maths,3.2,1571900400000
5 | 202,Nick,Young,Male,Physics,3.6,1572418800000
6 | 


--------------------------------------------------------------------------------
/transcript/rawData/transcript.json:
--------------------------------------------------------------------------------
 1 | {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"Maths","score":3.8,"timestamp":1571900400000}
 2 | {"studentID":205,"firstName":"Natalie","lastName":"Jones","gender":"Female","subject":"History","score":3.5,"timestamp":1571900400000}
 3 | {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Maths","score":3.2,"timestamp":1571900400000}
 4 | {"studentID":207,"firstName":"Bob","lastName":"Lewis","gender":"Male","subject":"Chemistry","score":3.6,"timestamp":1572418800000}
 5 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Geography","score":3.8,"timestamp":1572505200000}
 6 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"English","score":3.5,"timestamp":1572505200000}
 7 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Maths","score":3.2,"timestamp":1572678000000}
 8 | {"studentID":209,"firstName":"Jane","lastName":"Doe","gender":"Female","subject":"Physics","score":3.6,"timestamp":1572678000000}
 9 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"Maths","score":3.8,"timestamp":1572678000000}
10 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"English","score":3.5,"timestamp":1572678000000}
11 | {"studentID":211,"firstName":"John","lastName":"Doe","gender":"Male","subject":"History","score":3.2,"timestamp":1572854400000}
12 | {"studentID":212,"firstName":"Nick","lastName":"Young","gender":"Male","subject":"History","score":3.6,"timestamp":1572854400000}
13 | 


--------------------------------------------------------------------------------
/transcript/transcript-schema.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "schemaName": "transcript",
 3 |   "dimensionFieldSpecs": [
 4 |     {
 5 |       "name": "studentID",
 6 |       "dataType": "INT"
 7 |     },
 8 |     {
 9 |       "name": "firstName",
10 |       "dataType": "STRING"
11 |     },
12 |     {
13 |       "name": "lastName",
14 |       "dataType": "STRING"
15 |     },
16 |     {
17 |       "name": "gender",
18 |       "dataType": "STRING"
19 |     },
20 |     {
21 |       "name": "subject",
22 |       "dataType": "STRING"
23 |     }
24 |   ],
25 |   "metricFieldSpecs": [
26 |     {
27 |       "name": "score",
28 |       "dataType": "FLOAT"
29 |     }
30 |   ],
31 |   "dateTimeFieldSpecs": [{
32 |     "name": "timestamp",
33 |     "dataType": "LONG",
34 |     "format" : "1:MILLISECONDS:EPOCH",
35 |     "granularity": "1:MILLISECONDS"
36 |   }]
37 | }
38 | 


--------------------------------------------------------------------------------
/transcript/transcript-table-offline.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "tableName": "transcript",
 3 |   "tableType":"OFFLINE",
 4 |   "segmentsConfig" : {
 5 |     "timeColumnName": "timestamp",
 6 |     "replication" : "1"
 7 |   },
 8 |   "tableIndexConfig" : {
 9 |     "loadMode"  : "MMAP"
10 |   },
11 |   "tenants" : {
12 |     "broker":"DefaultTenant",
13 |     "server":"DefaultTenant"
14 |   },
15 |   "metadata": {}
16 | }
17 | 


--------------------------------------------------------------------------------
/transcript/transcript-table-realtime-kinesis.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "tableName": "transcript",
 3 |   "tableType": "REALTIME",
 4 |   "segmentsConfig": {
 5 |     "timeColumnName": "timestamp",
 6 |     "timeType": "MILLISECONDS",
 7 |     "schemaName": "transcript",
 8 |     "replicasPerPartition": "1"
 9 |   },
10 |   "tenants": {},
11 |   "tableIndexConfig": {
12 |     "loadMode": "MMAP",
13 |     "streamConfigs": {
14 |       "streamType": "kinesis",
15 |       "stream.kinesis.topic.name": "transcript-stream",
16 |       "region": "<your kinesis streams' region>",
17 |       "accessKey": "<your AWS access key>",
18 |       "secretKey": "<your AWS secret key>",
19 |       "shardIteratorType": "AFTER_SEQUENCE_NUMBER",
20 |       "stream.kinesis.consumer.type": "lowlevel",
21 |       "stream.kinesis.fetch.timeout.millis": "120000",
22 |       "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
23 |       "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory",
24 |       "realtime.segment.flush.threshold.rows": "10",
25 |       "realtime.segment.flush.threshold.time": "6h"
26 |     }
27 |   },
28 |   "metadata": {
29 |     "customConfigs": {}
30 |   }
31 | }
32 | 


--------------------------------------------------------------------------------
/transcript/transcript-table-realtime.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "tableName": "transcript",
 3 |   "tableType": "REALTIME",
 4 |   "segmentsConfig": {
 5 |     "timeColumnName": "timestamp",
 6 |     "timeType": "MILLISECONDS",
 7 |     "schemaName": "transcript",
 8 |     "replicasPerPartition": "2"
 9 |   },
10 |   "tenants": {},
11 |   "tableIndexConfig": {
12 |     "loadMode": "MMAP",
13 |     "streamConfigs": {
14 |       "streamType": "kafka",
15 |       "stream.kafka.consumer.type": "lowlevel",
16 |       "stream.kafka.topic.name": "transcript-topic",
17 |       "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
18 |       "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
19 |       "stream.kafka.broker.list": "localhost:9876",
20 |       "realtime.segment.flush.threshold.time": "3600000",
21 |       "realtime.segment.flush.threshold.size": "50000",
22 |       "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
23 |     }
24 |   },
25 |   "metadata": {
26 |     "customConfigs": {}
27 |   }
28 | }
29 | 


--------------------------------------------------------------------------------