├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Altinity 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DEMO DATASET setup 2 | 3 | ------ 4 | 5 | ## Table of Contents 6 | 7 | * [Introduction](#introduction) 8 | * [Preparation](#preparation) 9 | * [Prepare ClickHouse Repo](#prepare-clickhouse-repo) 10 | * [Prepare Etalon Dataset Server Access](#prepare-etalon-dataset-server-access) 11 | * [Install and Configure ClickHouse](#install-and-configure-clickhouse) 12 | * [Install ClickHouse](#install-clickhouse) 13 | * [Configure ClickHouse](#configure-clickhouse) 14 | * [Setup Users](#setup-users) 15 | * [Setup Dictionaries](#setup-dictionaries) 16 | * [SSH-tunnel setup](#ssh-tunnel-setup) 17 | * [Datasets setup](#datasets-setup) 18 | * [Dataset NYC Taxi Rides](#dataset-nyc-taxi-rides) 19 | * [Setup NYC Taxi Rides Database](#setup-nyc-taxi-rides-database) 20 | * [Setup NYC Taxi Rides Tables](#setup-nyc-taxi-rides-tables) 21 | * [Copy NYC Taxi Rides Dataset](#copy-nyc-taxi-rides-dataset) 22 | * [Check NYC Taxi Rides Dataset](#check-nyc-taxi-rides-dataset) 23 | * [Dataset STAR](#dataset-star) 24 | * [Setup STAR Database](#setup-star-database) 25 | * [Setup STAR Tables](#setup-star-tables) 26 | * [Copy STAR Dataset](#copy-star-dataset) 27 | * [Check STAR Dataset](#check-star-dataset) 28 | * [Dataset AIRLINE](#dataset-airline) 29 | * [Setup AIRLINE Database](#setup-airline-database) 30 | * [Setup AIRLINE Tables](#setup-airline-tables) 31 | * [Copy AIRLINE Dataset](#copy-airline-dataset) 32 | * [Check AIRLINE Dataset](#check-airline-dataset) 33 | * [Close SSH-tunnel](#close-ssh-tunnel) 34 | * [Conclusion](#conclusion) 35 | 36 | ------ 37 | 38 | 39 | ## Introduction 40 | 41 | All instructions in this manual were tested on Ubuntu 16.04. 42 | There is no need to setup all datasets from this manual - feel free to skip any of them, if you don't need it. 43 | SSH-tunnel section is provided because 'etalon dataset server' is located behind the firewall, but you may not need this step. 44 | 45 | ## Preparation 46 | 47 | ### Prepare ClickHouse Repo 48 | 49 | Ensure we have all `apt` - related tools installed 50 | ```bash 51 | sudo apt install software-properties-common 52 | ``` 53 | 54 | Include keyserver for repo 55 | ```bash 56 | sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 57 | ``` 58 | 59 | Figure out distro codename 60 | ```bash 61 | codename=`lsb_release -c|awk '{print $2}'` 62 | echo $codename 63 | ``` 64 | 65 | Build URL to ClickHouse repo based on distro’s codename 66 | ```bash 67 | REPOURL="http://repo.yandex.ru/clickhouse/$codename" 68 | echo $REPOURL 69 | ``` 70 | 71 | Add ClickHouse repo located in `$REPOURL` 72 | ```bash 73 | sudo apt-add-repository "deb $REPOURL stable main" 74 | ``` 75 | 76 | Update list of available packages 77 | ```bash 78 | sudo apt update 79 | ``` 80 | 81 | ### Prepare Etalon Dataset Server Access 82 | 83 | You'll need to prepare: 84 | * hostname or IP address of the 'etalon dataset server' 85 | * access key in order to get SSH-access to 'etalon dataset server' 86 | 87 | Replace 127.0.0.1 with your 'etalon dataset server' address/hostname 88 | 89 | ```bash 90 | # replace 127.0.0.1 with your 'etalon dataset server' address/hostname 91 | DATASET_SERVER="127.0.0.1" 92 | ``` 93 | 94 | Ensure you either already have access key in your `~/.ssh/` folder 95 | 96 | ```bash 97 | ls -l ~/.ssh/ 98 | ... 99 | -rw------- 1 user user 1675 Aug 2 11:55 chdemo 100 | ... 101 | 102 | ``` 103 | 104 | or, if you don't have access key, create access key file and store ssh-access key locally 105 | 106 | 107 | ```bash 108 | mkdir -p ~/.ssh 109 | touch ~/.ssh/chdemo 110 | ``` 111 | 112 | Edit `~/.ssh/chdemo` and save key in it\ 113 | Also it has to have limited access rights 114 | ```bash 115 | chmod 600 ~/.ssh/chdemo 116 | ``` 117 | 118 | Specify **FULL PATH** to acess key file as ENV variable.\ 119 | **IMPORTANT** - please, do not use shortcuts like `~/.ssh/chdemo` - specify real **FULL PATH**. Shortcuts will lead to mess when expanding it in sub-shell's.\ 120 | **PLEASE** - full path only. 121 | 122 | ```bash 123 | # replace "chdemo" with your FULL PATH to your 'etalon dataset server' access key file 124 | DATASET_SERVER_KEY_FILENAME="/home/user/.ssh/chdemo" 125 | ``` 126 | 127 | ## Install and configure ClickHouse 128 | 129 | ### Install ClickHouse 130 | 131 | Install all ClickHouse-related packages: server, client & tools 132 | ```bash 133 | sudo apt install clickhouse-client 'clickhouse-server*' 134 | ``` 135 | 136 | Now let’s setup installed ClickHouse 137 | 138 | Ensure service is down 139 | ```bash 140 | sudo service clickhouse-server stop 141 | ``` 142 | 143 | Create folder where ClickHouse data would be kept. Default is `/var/lib/clickhouse`, if it is ok, just skip this section 144 | ```bash 145 | sudo mkdir -p /data1/clickhouse 146 | sudo mkdir -p /data1/clickhouse/tmp 147 | sudo chown -R clickhouse.clickhouse /data1/clickhouse 148 | ``` 149 | 150 | Create folder where dictionaries specs would be kept 151 | ```bash 152 | sudo mkdir -p /etc/clickhouse-server/dicts 153 | sudo chown -R clickhouse.clickhouse /etc/clickhouse-server/dicts 154 | ``` 155 | 156 | ### Configure ClickHouse 157 | 158 | Setup ClickHouse to listen on all network interfaces for both IPv4 and IPv6 \ 159 | Edit file `/etc/clickhouse-server/config.xml` \ 160 | Ensure `` tags have the following content: 161 | 162 | ```xml 163 | :: 164 | 0.0.0.0 165 | ``` 166 | 167 | Setup ClickHouse to keep data in specified dirs - in case default `/var/lib/clickhouse` is not OK\ 168 | Edit file `/etc/clickhouse-server/config.xml`\ 169 | Ensure `` and `` tags have the following content: 170 | 171 | ```xml 172 | 173 | /data1/clickhouse/ 174 | 175 | /data1/clickhouse/tmp/ 176 | ``` 177 | Setup ClickHouse to look for dictionaries in specified dir\ 178 | Edit file `/etc/clickhouse-server/config.xml`\ 179 | Ensure `` tag has the following content: 180 | 181 | ```xml 182 | /etc/clickhouse-server/dicts/*.xml 183 | ``` 184 | 185 | #### Setup Users 186 | 187 | Setup access for default user from localhost only.\ 188 | Edit file `/etc/clickhouse-server/users.xml`\ 189 | Ensure `default` user (located inside ``) has `` tags specified with localhost values only 190 | 191 | ```xml 192 | 193 | 194 | ::1 195 | 127.0.0.1 196 | 197 | 198 | ``` 199 | 200 | Setup read-only user for ClickHouse with access from all over the world.\ 201 | Username would be testuser and it would not have any password.\ 202 | Edit file `/etc/clickhouse-server/users.xml`\ 203 | Add new profile called `readonly_set_settings` in `` section right after `` profile tag 204 | 205 | ```xml 206 | 207 | 208 | 209 | 10000000000 210 | 211 | 212 | 0 213 | 214 | 221 | random 222 | 2 223 | 224 | ``` 225 | 226 | Add new `` tag with profile referring to just inserted `readonly_set_settings` profile in `` section right after `` user tag: 227 | 228 | ```xml 229 | 230 | 231 | 232 | 233 | ::/0 234 | 0.0.0.0 235 | 236 | readonly_set_settings 237 | default 238 | 239 | ``` 240 | 241 | #### Setup Dictionaries 242 | 243 | Prepare dictionaries specifications.\ 244 | We’ll need **SSH** access to ‘etalon dataset server’.\ 245 | Copy dictionaries specifications from ‘etalon dataset server’ to `/etc/clickhouse-server/dicts` 246 | 247 | ```bash 248 | cd /etc/clickhouse-server/dicts 249 | sudo scp -i $DATASET_SERVER_KEY_FILENAME -P 2222 "root@$DATASET_SERVER:/etc/clickhouse-server/dicts/*" . 250 | ``` 251 | 252 | Ensure we have the following files in `/etc/clickhouse-server/dicts` 253 | 254 | ```bash 255 | ls -l /etc/clickhouse-server/dicts/* 256 | -rw-r--r-- 1 clickhouse clickhouse 831 Jul 6 06:25 taxi_zones.xml 257 | -rw-r--r-- 1 clickhouse clickhouse 2392 Jul 6 06:26 weather.xml 258 | ``` 259 | 260 | Ensure ClickHouse server is running 261 | ```bash 262 | sudo service clickhouse-server restart 263 | ``` 264 | 265 | ## SSH-tunnel setup 266 | 267 | Also we need to have ClickHouse to have access to ‘etalon dataset server’. Since it is behind the firewall, we need to setup SSH-tunnel for this. \ 268 | Make local socket `127.0.0.1:9999` to be forwarded on server `$DATASET_SERVER` to local socket `127.0.0.1:9000` on that server. \ 269 | Thus, connecting to `127.0.0.1:9999` we’ll have connect via **SSH** to `127.0.0.1:9000` on server `$DATASET_SERVER` 270 | 271 | ```bash 272 | ssh -f -N -i $DATASET_SERVER_KEY_FILENAME -p 2222 root@$DATASET_SERVER -L 127.0.0.1:9999:127.0.0.1:9000 273 | ``` 274 | 275 | ## Datasets setup 276 | 277 | Now let’s setup demo datasets 278 | 279 | ### Dataset NYC Taxi Rides 280 | 281 | Now let’s setup New-York City Taxi Rides dataset 282 | 283 | #### Setup NYC Taxi Rides Database 284 | Create database we’ll use 285 | 286 | ```bash 287 | clickhouse-client -q "CREATE DATABASE IF NOT EXISTS nyc_taxi_rides;" 288 | ``` 289 | 290 | #### Setup NYC Taxi Rides Tables 291 | 292 | Drop existing tables if they already exist 293 | 294 | ```bash 295 | clickhouse-client -q "DROP TABLE IF EXISTS nyc_taxi_rides.central_park_weather_observations;" 296 | clickhouse-client -q "DROP TABLE IF EXISTS nyc_taxi_rides.taxi_zones;" 297 | clickhouse-client -q "DROP TABLE IF EXISTS nyc_taxi_rides.tripdata;" 298 | ``` 299 | 300 | Create tables we’ll use 301 | 302 | ```bash 303 | clickhouse-client -q "CREATE TABLE nyc_taxi_rides.central_park_weather_observations ( 304 | station_id String, 305 | station_name String, 306 | weather_date Date, 307 | precipitation Float32, 308 | snow_depth Float32, 309 | snowfall Int32, 310 | max_temperature Float32, 311 | min_temperature Float32, 312 | average_wind_speed Float32 313 | ) ENGINE = MergeTree(weather_date, station_id, 8192);" 314 | 315 | clickhouse-client -q "CREATE TABLE nyc_taxi_rides.taxi_zones ( 316 | location_id UInt32, 317 | zone String, 318 | create_date Date DEFAULT toDate(0) 319 | ) ENGINE = MergeTree(create_date, location_id, 8192);" 320 | 321 | clickhouse-client -q "CREATE TABLE nyc_taxi_rides.tripdata ( 322 | pickup_date Date DEFAULT toDate(tpep_pickup_datetime), 323 | id UInt64, 324 | vendor_id String, 325 | tpep_pickup_datetime DateTime, 326 | tpep_dropoff_datetime DateTime, 327 | passenger_count Int32, 328 | trip_distance Float32, 329 | pickup_longitude Float32, 330 | pickup_latitude Float32, 331 | rate_code_id String, 332 | store_and_fwd_flag String, 333 | dropoff_longitude Float32, 334 | dropoff_latitude Float32, 335 | payment_type String, 336 | fare_amount String, 337 | extra String, 338 | mta_tax String, 339 | tip_amount String, 340 | tolls_amount String, 341 | improvement_surcharge String, 342 | total_amount Float32, 343 | pickup_location_id UInt32, 344 | dropoff_location_id UInt32, 345 | junk1 String, 346 | junk2 String 347 | ) ENGINE = MergeTree(pickup_date, (id, pickup_location_id, dropoff_location_id, vendor_id), 8192);" 348 | ``` 349 | 350 | #### Copy NYC Taxi Rides Dataset 351 | 352 | Fill newly created tables with data from remote ‘etalon dataset server’ 353 | 354 | **IMPORTANT:** This operation requires big amount of data to be copied and takes quite long time 355 | 356 | ```bash 357 | clickhouse-client -q "INSERT INTO nyc_taxi_rides.central_park_weather_observations SELECT * FROM remote('127.0.0.1:9999', 'nyc_taxi_rides.central_park_weather_observations');" 358 | clickhouse-client -q "INSERT INTO nyc_taxi_rides.taxi_zones SELECT * FROM remote('127.0.0.1:9999', 'nyc_taxi_rides.taxi_zones');" 359 | clickhouse-client -q "INSERT INTO nyc_taxi_rides.tripdata SELECT * FROM remote('127.0.0.1:9999', 'nyc_taxi_rides.tripdata');" 360 | ``` 361 | 362 | #### Check NYC Taxi Rides Dataset 363 | 364 | After all data copied ensure we have main tables filled with data: 365 | 366 | ```bash 367 | clickhouse-client -q "SELECT count() FROM nyc_taxi_rides.central_park_weather_observations;" 368 | clickhouse-client -q "SELECT count() FROM nyc_taxi_rides.taxi_zones;" 369 | clickhouse-client -q "SELECT count() FROM nyc_taxi_rides.tripdata;" 370 | ``` 371 | 372 | Ensure all dictionaries are healthy via 373 | 374 | ```bash 375 | clickhouse-client -q "SELECT * FROM system.dictionaries;" 376 | ``` 377 | 378 | There should be two dictionaries and no errors reported on their statuses 379 | 380 | ### Dataset STAR 381 | 382 | Now let’s setup Star Observations dataset 383 | 384 | #### Setup STAR Database 385 | 386 | Create database we’ll use 387 | 388 | ```bash 389 | clickhouse-client -q "CREATE DATABASE IF NOT EXISTS star;" 390 | ``` 391 | 392 | #### Setup STAR Tables 393 | 394 | Drop existing tables if they already exist 395 | 396 | ```bash 397 | clickhouse-client -q "DROP TABLE IF EXISTS star.starexp;" 398 | ``` 399 | 400 | Create tables we’ll use 401 | 402 | ```bash 403 | clickhouse-client -q "CREATE TABLE star.starexp ( 404 | antiNucleus UInt32, 405 | eventFile UInt32, 406 | eventNumber UInt32, 407 | eventTime Float64, 408 | histFile UInt32, 409 | multiplicity UInt32, 410 | NaboveLb UInt32, 411 | NbelowLb UInt32, 412 | NLb UInt32, 413 | primaryTracks UInt32, 414 | prodTime Float64, 415 | Pt Float32, 416 | runNumber UInt32, 417 | vertexX Float32, 418 | vertexY Float32, 419 | vertexZ Float32, 420 | eventDate Date DEFAULT CAST(concat(substring(toString(floor(eventTime)), 1, 4), '-', substring(toString(floor(eventTime)), 5, 2), '-', substring(toString(floor(eventTime)), 7, 2)) AS Date) 421 | ) ENGINE = MergeTree(eventDate, (eventNumber, eventTime, runNumber, eventFile, multiplicity), 8192);" 422 | ``` 423 | 424 | #### Copy STAR Dataset 425 | 426 | Fill newly created tables with data from remote ‘etalon dataset server’ 427 | 428 | **IMPORTANT:** This operation requires big amount of data to be copied and takes quite long time 429 | 430 | ```bash 431 | clickhouse-client -q "INSERT INTO star.starexp SELECT * FROM remote('127.0.0.1:9999', 'star.starexp');" 432 | ``` 433 | 434 | #### Check STAR Dataset 435 | 436 | After all data copied ensure we have main tables filled with data: 437 | 438 | ```bash 439 | clickhouse-client -q "SELECT count() FROM star.starexp;" 440 | ``` 441 | 442 | ### Dataset AIRLINE 443 | 444 | Now let’s setup AIRLINE dataset 445 | 446 | #### Setup AIRLINE Database 447 | 448 | Create database we’ll use 449 | 450 | ```bash 451 | clickhouse-client -q "CREATE DATABASE IF NOT EXISTS airline;" 452 | ``` 453 | 454 | #### Setup AIRLINE Tables 455 | 456 | Drop existing tables if they already exist 457 | 458 | ```bash 459 | clickhouse-client -q "DROP TABLE IF EXISTS airline.ontime;" 460 | ``` 461 | 462 | Create tables we’ll use 463 | 464 | ```bash 465 | clickhouse-client -q "CREATE TABLE IF NOT EXISTS airline.ontime ( 466 | Year UInt16, 467 | Quarter UInt8, 468 | Month UInt8, 469 | DayofMonth UInt8, 470 | DayOfWeek UInt8, 471 | FlightDate Date, 472 | UniqueCarrier String, 473 | AirlineID UInt32, 474 | Carrier String, 475 | TailNum String, 476 | FlightNum String, 477 | OriginAirportID UInt32, 478 | OriginAirportSeqID UInt32, 479 | OriginCityMarketID UInt32, 480 | Origin String, 481 | OriginCityName String, 482 | OriginState String, 483 | OriginStateFips String, 484 | OriginStateName String, 485 | OriginWac UInt32, 486 | DestAirportID UInt32, 487 | DestAirportSeqID UInt32, 488 | DestCityMarketID UInt32, 489 | Dest String, 490 | DestCityName String, 491 | DestState String, 492 | DestStateFips String, 493 | DestStateName String, 494 | DestWac UInt32, 495 | CRSDepTime UInt32, 496 | DepTime UInt32, 497 | DepDelay Float32, 498 | DepDelayMinutes Float32, 499 | DepDel15 Float32, 500 | DepartureDelayGroups Int32, 501 | DepTimeBlk String, 502 | TaxiOut Float32, 503 | WheelsOff UInt32, 504 | WheelsOn UInt32, 505 | TaxiIn Float32, 506 | CRSArrTime UInt32, 507 | ArrTime UInt32, 508 | ArrDelay Float32, 509 | ArrDelayMinutes Float32, 510 | ArrDel15 Float32, 511 | ArrivalDelayGroups Int32, 512 | ArrTimeBlk String, 513 | Cancelled Float32, 514 | CancellationCode String, 515 | Diverted Float32, 516 | CRSElapsedTime Float32, 517 | ActualElapsedTime Float32, 518 | AirTime Float32, 519 | Flights Float32, 520 | Distance Float32, 521 | DistanceGroup Float32, 522 | CarrierDelay Float32, 523 | WeatherDelay Float32, 524 | NASDelay Float32, 525 | SecurityDelay Float32, 526 | LateAircraftDelay Float32, 527 | FirstDepTime String, 528 | TotalAddGTime String, 529 | LongestAddGTime String, 530 | DivAirportLandings String, 531 | DivReachedDest String, 532 | DivActualElapsedTime String, 533 | DivArrDelay String, 534 | DivDistance String, 535 | Div1Airport String, 536 | Div1AirportID UInt32, 537 | Div1AirportSeqID UInt32, 538 | Div1WheelsOn String, 539 | Div1TotalGTime String, 540 | Div1LongestGTime String, 541 | Div1WheelsOff String, 542 | Div1TailNum String, 543 | Div2Airport String, 544 | Div2AirportID UInt32, 545 | Div2AirportSeqID UInt32, 546 | Div2WheelsOn String, 547 | Div2TotalGTime String, 548 | Div2LongestGTime String, 549 | Div2WheelsOff String, 550 | Div2TailNum String, 551 | Div3Airport String, 552 | Div3AirportID UInt32, 553 | Div3AirportSeqID UInt32, 554 | Div3WheelsOn String, 555 | Div3TotalGTime String, 556 | Div3LongestGTime String, 557 | Div3WheelsOff String, 558 | Div3TailNum String, 559 | Div4Airport String, 560 | Div4AirportID UInt32, 561 | Div4AirportSeqID UInt32, 562 | Div4WheelsOn String, 563 | Div4TotalGTime String, 564 | Div4LongestGTime String, 565 | Div4WheelsOff String, 566 | Div4TailNum String, 567 | Div5Airport String, 568 | Div5AirportID UInt32, 569 | Div5AirportSeqID UInt32, 570 | Div5WheelsOn String, 571 | Div5TotalGTime String, 572 | Div5LongestGTime String, 573 | Div5WheelsOff String, 574 | Div5TailNum String 575 | ) 576 | ENGINE = MergeTree(FlightDate, (FlightDate, Year, Month, DepDel15), 8192);" 577 | ``` 578 | 579 | #### Copy AIRLINE Dataset 580 | 581 | Fill newly created tables with data from remote ‘etalon dataset server’ 582 | 583 | **IMPORTANT**: This operation requires big amount of data to be copied and takes quite long time 584 | 585 | ```bash 586 | clickhouse-client -q "INSERT INTO airline.ontime SELECT * FROM remote('127.0.0.1:9999', 'airline.ontime');" 587 | ``` 588 | 589 | #### Check AIRLINE Dataset 590 | 591 | After all data copied ensure we have main tables filled with data: 592 | 593 | ```bash 594 | clickhouse-client -q "SELECT count() FROM airline.ontime;" 595 | ``` 596 | 597 | ## Close SSH-tunnel 598 | 599 | Now let’s terminate SSH-tunnel to ‘etalon dataset server’ 600 | 601 | Find `SSH`-tunnel process `PID` 602 | ```bash 603 | SSHPID=`sudo netstat -antp|grep LIST|grep 9999|grep ssh|awk '{print $7}'| sed -e 's/\/.*//g'` 604 | echo $SSHPID 605 | ``` 606 | 607 | Ensure found `SSH` `PID` is reasonable - it is our `SSH`-tunnel 608 | ```bash 609 | ps ax | grep $SSHPID | grep -v grep 610 | ``` 611 | 612 | and kill it with kill command 613 | ```bash 614 | kill $SSHPID 615 | ``` 616 | 617 | ## Conclusion 618 | 619 | In case all steps were completed successfully, we'll have local copy of one (or more) datasets migrated from 'etalon dataset server' 620 | 621 | --------------------------------------------------------------------------------