├── Oracle_to_Cassandra_OSchema.png ├── README.md └── Red_Hat_Oracle_VM_Notes.md /Oracle_to_Cassandra_OSchema.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simonambridge/Oracle_to_Cassandra/bfc8d7ad9d87d2272f4c84485d9dd504988d3e9e/Oracle_to_Cassandra_OSchema.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Migrating Relational Data From Oracle To DSE/Cassandra - using Spark DataFrames, SparkSQL and the spark-cassandra connector 2 | The objective of this exercise is to demonstrate how to migrate data from Oracle to DataStax Cassandra. I'll be using the DataFrame capability introduced in Apache Spark 1.3 to load data from tables in an Oracle database (12c) via Oracle's JDBC thin driver, to generate a result set, joining tables where necessary. 3 | The data will then be saved to DSE/Cassandra. 4 | 5 | This demo is not intended as an exhaustive or comprehensive tutorial but it should provide enough material to gain an understanding of the processes involved in migrating data from Oracle to Cassandra 6 | 7 |

Pre-requisites

8 |

DataStax Enterprise

9 | You'll need a working installation of DataStax Enterprise. I've used release 5.0.1 here, release 5.0.2 latest available at time of publication. 10 | 11 | - Ubuntu/Debian - https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/install/installDEBdse.html 12 | - Red Hat/Fedora/CentOS/Oracle Linux - https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/install/installRHELdse.html 13 | 14 | To setup your environment, you'll also need the following resources: 15 | 16 | - Python 2.7 17 | - Java 8 18 | - For Red Hat, CentOS and Fedora, install EPEL (Extra Packages for Enterprise Linux). 19 | - An Oracle database: 20 | - In my example the database name is orcl 21 | - A running Oracle tns listener. In my example I'm using the default port of 1521. 22 | - You are able to make a tns connection to the Oracle database e.g. "sqlplus user/password@service_name". 23 | - The Oracle thin JDBC driver. You can download the ojdbc JAR file from: 24 | http://www.oracle.com/technetwork/database/features/jdbc/jdbc-drivers-12c-download-1958347.html 25 | I've used ojdbc7.jar which is certified for use with both JDK7 and JDK8 26 | In my 12c Oracle VM Firefox this downloaded to /app/oracle/downloads/ so you'll see the path referenced in the instructions below. 27 | 28 | 29 | As we will connect from Spark, using the Oracle jdbc driver, to the "orcl" database on TNS port 1521, and then to Cassandra, all these components must be working correctly. 30 |
31 | 32 | Now on to installing DataStax Enterprise and playing with some data! 33 |

34 |

Set Up DataStax Components

35 |
36 | Installation instructions for DSE are provided at the top of this doc. I'll show the instructions for Red Hat/CentOS/Fedora here. I'm using an Oracle Enterprise Linux VirtualBox instance. 37 |

Add The DataStax Repo

38 | As root create ```/etc/yum.repos.d/datastax.repo``` 39 |
# vi /etc/yum.repos.d/datastax.repo
  40 | 
41 | 42 |
43 | Paste in these lines: 44 |
  45 | [datastax] 
  46 | name = DataStax Repo for DataStax Enterprise
  47 | baseurl=https://datastaxrepo_gmail.com:utJVKEg4lKeaWTX@rpm.datastax.com/enterprise
  48 | enabled=1
  49 | gpgcheck=0
  50 | 
51 | 52 |

Import The DataStax Repo Key

53 |
  54 | rpm --import http://rpm.datastax.com/rpm/repo_key 
  55 | 
56 | 57 | 58 |

Install DSE Components

59 |

DSE Platform

60 |
  61 | # yum install dse-full-5.0.1-1
  62 | 
63 | 64 |

DataStax OpsCenter - I got 6.0.2.1

65 |
  66 | # yum install opscenter
  67 | 
68 | 69 |

DataStax OpsCenter Agent - I got 6.0.2.1

70 |
  71 | # yum install datastax-agent
  72 | 
73 |
74 | 75 |

Enable Search & Analytics

76 | 77 | We want to use Search (Solr) and Analytics (Spark) so we need to delete the default datacentre and restart the cluster (if its already running) in SearchAnalytics mode. 78 |
79 | Stop the service if it's running. 80 |
  81 | #  service dse stop
  82 | Stopping DSE daemon : dse                                  [  OK  ]
  83 | 
84 |
85 | Enable Solr and Spark by changing the flag from "0" to "1" in: 86 |
  87 | # vi /etc/default/dse
  88 | 
89 | e.g.: 90 | 91 |
  92 | # Start the node in DSE Search mode
  93 | SOLR_ENABLED=1
  94 | # Start the node in Spark mode
  95 | SPARK_ENABLED=1
  96 | 
97 | 98 |
99 | Delete the old (Cassandra-only) datacentre databases if they exist: 100 |
 101 | # rm -rf /var/lib/cassandra/data/*
 102 | # rm -rf /var/lib/cassandra/saved_caches/*
 103 | # rm -rf /var/lib/cassandra/commitlog/*
 104 | # rm -rf /var/lib/cassandra/hints/*
 105 | 
106 | 107 | Remove the old system.log if it exists: 108 |
 109 | # rm -rf /var/log/cassandra/system.log 
 110 | 
111 | 112 | Now restart DSE: 113 |
 114 | $ sudo service DSE restart
 115 | 
116 |
117 | 118 | After a few minutes use nodetool to check that all is up and running (check for "UN" next to the IP address): 119 |
 120 | $ nodetool status
 121 | Datacenter: SearchAnalytics
 122 | ===========================
 123 | Status=Up/Down
 124 | |/ State=Normal/Leaving/Joining/Moving
 125 | --  Address    Load       Owns    Host ID                               Token                                    Rack
 126 | UN  127.0.0.1  346.89 KB  ?       8e6fa3db-9018-47f0-96df-8c78067fddaa  6840808785095143619                      rack1
 127 | 
 128 | Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
 129 | 
130 | You should also check that you can log into cqlsh: 131 |
 132 | $ cqlsh
 133 | Connected to Test Cluster at 127.0.0.1:9042.
 134 | [cqlsh 5.0.1 | Cassandra 3.0.7.1159 | DSE 5.0.1 | CQL spec 3.4.0 | Native protocol v4]
 135 | Use HELP for help.
 136 | cqlsh> 
 137 | 
138 | Type exit in cqlsh to return to the shell prompt. 139 |
140 | 141 |

Identify Spark Master

142 | We use the new DSE 5.0 format for the dse tool to get the address of the Spark Master in our cluster. 143 | As we are using a single node for our cluster it will be no surprise that the Spark Master is also on our single node! 144 |
 145 | $ dse client-tool spark master-address
 146 | spark://127.0.0.1:7077
 147 | 
148 | OK, let's go look at our source data in the Oracle database. 149 | 150 | > If you have your own data you can substitute your schema for the HR schema shown) 151 | 152 |
153 |

Oracle Database

154 | We're going to create our source data for this exercise using the scripts supplied by Oracle. 155 | We'll use the demo HR schema. 156 |
157 |

Create HR User For The Sample Database

158 | Log into SQLPlus as the sys or system user: 159 |
 160 | $ sqlplus / as sysdba
 161 | 
 162 | SQL*Plus: Release 12.1.0.2.0 Production on Wed Sep 14 02:57:42 2016
 163 | 
 164 | Copyright (c) 1982, 2014, Oracle.  All rights reserved.
 165 | ...
 166 | SQL> 
 167 | 
168 |

Set _Oracle_SCRIPT

169 | Run this command to allow local User ID's to be created: 170 |
 171 | SQL> alter session set "_ORACLE_SCRIPT"=true; 
 172 | 
173 | 174 |

Create HR User

175 | Create the HR user by using the Oracle-supplied script. The "@?" is a substitute for $ORACLE_HOME. 176 |
 177 | SQL> @?/demo/schema/human_resources/hr_main.sql
 178 | 
179 | 180 | Respond with these parameters 181 | - hr 182 | - users 183 | - temp 184 | - [your sys password] 185 | - $ORACLE_HOME/demo/schema/log/ 186 | 187 | You should see 188 |
 189 | PL/SQL procedure successfully completed.
 190 | 
191 |
192 | 193 |

Investigate The HR Schema

194 | 195 |

What Tables Do We Have?

196 | Log into SQL Plus as the HR user. 197 | > If you're still logged in as sys you can change to the HR account by typing "connect hr/hr"). 198 | 199 | The HR user has a simple schema consisting of 7 tables. The data is stored in a relational format, using foreign keys to look up into tables e.g. To find which department an employee belongs to, you use the value in the DEPARTMENT_ID column to look up a record in the DEPARTMENTS table to identify the department the employee belongs to. 200 | You can walk through the model using the diagram below: 201 | 202 |

203 | 204 |

205 | 206 |
 207 | $ sqlplus hr/hr
 208 | SQL> set lines 180
 209 | 
 210 | SQL> select table_name from user_tables;
 211 | 
 212 | TABLE_NAME
 213 | ------------------------
 214 | REGIONS
 215 | COUNTRIES
 216 | LOCATIONS
 217 | DEPARTMENTS
 218 | JOBS
 219 | EMPLOYEES
 220 | JOB_HISTORY
 221 | 
 222 | 7 rows selected.
 223 | 
224 | 225 | There are seven tables in the HR schema. I'm going to look at five of them. 226 | 227 |

1. Table EMPLOYEES

228 | Foreign keys on JOB_ID, DEPARTMENT_ID and MANAGER_ID. 229 | 230 |
 231 | SQL> desc employees
 232 | 
 233 |  Name                    Null?    Type
 234 |  ----------------------- -------- ----------------
 235 |  EMPLOYEE_ID             NOT NULL NUMBER(6)
 236 |  FIRST_NAME                       VARCHAR2(20)
 237 |  LAST_NAME               NOT NULL VARCHAR2(25)
 238 |  EMAIL                   NOT NULL VARCHAR2(25)
 239 |  PHONE_NUMBER                     VARCHAR2(20)
 240 |  HIRE_DATE               NOT NULL DATE
 241 |  JOB_ID                  NOT NULL VARCHAR2(10)
 242 |  SALARY                           NUMBER(8,2)
 243 |  COMMISSION_PCT                   NUMBER(2,2)
 244 |  MANAGER_ID                       NUMBER(6)
 245 |  DEPARTMENT_ID                    NUMBER(4)
 246 | 
247 | 248 |

2. Table JOBS

249 |
 250 | SQL> desc jobs
 251 | 
 252 |  Name                    Null?    Type
 253 |  ----------------------- -------- ----------------
 254 |  JOB_ID                  NOT NULL VARCHAR2(10)
 255 |  JOB_TITLE               NOT NULL VARCHAR2(35)
 256 |  MIN_SALARY                       NUMBER(6)
 257 |  MAX_SALARY                       NUMBER(6)
 258 | 
259 | 260 |

3. Table DEPARTMENTS

261 | Foreign keys on MANAGER_ID and LOCATION_ID. 262 |
 263 | SQL> desc departments
 264 | 
 265 |  Name                    Null?    Type
 266 |  ----------------------- -------- ----------------
 267 |  DEPARTMENT_ID           NOT NULL NUMBER(4)
 268 |  DEPARTMENT_NAME         NOT NULL VARCHAR2(30)
 269 |  MANAGER_ID                       NUMBER(6)
 270 |  LOCATION_ID                       NUMBER(4)
 271 | 
272 | 273 |

4. Table LOCATIONS

274 | Foreign key on COUNTRY_ID. 275 |
 276 | SQL> desc locations
 277 | 
 278 |  Name                    Null?    Type
 279 |  ----------------------- -------- ----------------
 280 |  LOCATION_ID             NOT NULL NUMBER(4)
 281 |  STREET_ADDRESS                   VARCHAR2(40)
 282 |  POSTAL_CODE                      VARCHAR2(12)
 283 |  CITY                    NOT NULL VARCHAR2(30)
 284 |  STATE_PROVINCE                   VARCHAR2(25)
 285 |  COUNTRY_ID                       CHAR(2)
 286 | 
287 | 288 |

5. Table COUNTRIES

289 | Foreign key on REGION_ID. 290 |
 291 | SQL> desc countries
 292 | 
 293 |  Name                    Null?    Type
 294 |  ----------------------- -------- ----------------
 295 |  COUNTRY_ID              NOT NULL CHAR(2)
 296 |  COUNTRY_NAME                     VARCHAR2(40)
 297 |  REGION_ID                        NUMBER
 298 | 
299 |
300 |

Explore The HR Data

301 | 302 | In SQLPlus we can select employees from the employees table, for example: 303 |
 304 | SQL> select * from employees;
 305 | 
 306 | EMPLOYEE_ID FIRST_NAME           LAST_NAME            EMAIL                PHONE_NUMBER         HIRE_DATE JOB_ID         SALARY COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
 307 | ----------- -------------------- -------------------- -------------------- -------------------- --------- ---------- ---------- -------------- ---------- -------------
 308 |         120 Matthew              Weiss                MWEISS               650.123.1234         18-JUL-04 ST_MAN           8000                       100            50
 309 |         121 Adam                 Fripp                AFRIPP               650.123.2234         10-APR-05 ST_MAN           8200                       100            50
 310 |         122 Payam                Kaufling             PKAUFLIN             650.123.3234         01-MAY-03 ST_MAN           7900                       100            50
 311 |         123 Shanta               Vollman              SVOLLMAN             650.123.4234         10-OCT-05 ST_MAN           6500                       100            50
 312 |         124 Kevin                Mourgos              KMOURGOS             650.123.5234         16-NOV-07 ST_MAN           5800                       100            50
 313 |         125 Julia                Nayer                JNAYER               650.124.1214         16-JUL-05 ST_CLERK         3200                       120            50
 314 |         126 Irene                Mikkilineni          IMIKKILI             650.124.1224         28-SEP-06 ST_CLERK         2700                       120            50
 315 | 
316 | 317 | Let's walk through the schema to get familiar with the data that we're going to migrate. 318 | For a moment let's just focus on employees reporting to a manager with ID=121: 319 | 320 |
 321 | SQL> select * from employees where manager_id=121;
 322 | 
 323 | EMPLOYEE_ID FIRST_NAME           LAST_NAME            EMAIL                PHONE_NUMBER         HIRE_DATE JOB_ID         SALARY COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
 324 | ----------- -------------------- -------------------- -------------------- -------------------- --------- ---------- ---------- -------------- ---------- -------------
 325 |         129 Laura                Bissot               LBISSOT              650.124.5234         20-AUG-05 ST_CLERK         3300                       121            50
 326 |         130 Mozhe                Atkinson             MATKINSO             650.124.6234         30-OCT-05 ST_CLERK         2800                       121            50
 327 |         131 James                Marlow               JAMRLOW              650.124.7234         16-FEB-05 ST_CLERK         2500                       121            50
 328 |         132 TJ                   Olson                TJOLSON              650.124.8234         10-APR-07 ST_CLERK         2100                       121            50
 329 |         184 Nandita              Sarchand             NSARCHAN             650.509.1876         27-JAN-04 SH_CLERK         4200                       121            50
 330 |         185 Alexis               Bull                 ABULL                650.509.2876         20-FEB-05 SH_CLERK         4100                       121            50
 331 |         186 Julia                Dellinger            JDELLING             650.509.3876         24-JUN-06 SH_CLERK         3400                       121            50
 332 |         187 Anthony              Cabrio               ACABRIO              650.509.4876         07-FEB-07 SH_CLERK         3000                       121            50
 333 | 
334 | 335 | A little relational reminder. Who is that manager with ID=121? 336 | 337 |
 338 | SQL> select * from employees where employee_id=121;
 339 | 
 340 | EMPLOYEE_ID FIRST_NAME           LAST_NAME            EMAIL                PHONE_NUMBER         HIRE_DATE JOB_ID         SALARY COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
 341 | ----------- -------------------- -------------------- -------------------- -------------------- --------- ---------- ---------- -------------- ---------- -------------
 342 |         121 Adam                 Fripp                AFRIPP               650.123.2234         10-APR-05 ST_MAN           8200                       100            50
 343 | 
344 | 345 | ...and what is HIS job? 346 |
 347 | SQL> select * from jobs where job_id='ST_MAN';
 348 | 
 349 | JOB_ID     JOB_TITLE                           MIN_SALARY MAX_SALARY
 350 | ---------- ----------------------------------- ---------- ----------
 351 | ST_MAN     Stock Manager                             5500       8500
 352 | 
353 | 354 | ...and who is HIS boss? 355 | 356 |
 357 | EMPLOYEE_ID FIRST_NAME           LAST_NAME            EMAIL                PHONE_NUMBER         HIRE_DATE JOB_ID         SALARY COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
 358 | ----------- -------------------- -------------------- -------------------- -------------------- --------- ---------- ---------- -------------- ---------- -------------
 359 |         100 Steven               King                 SKING                515.123.4567         17-JUN-03 AD_PRES         24000                                      90
 360 | 
361 | They work in Department=50 - what is that? 362 |
 363 | SQL> select * from departments where department_id=50;
 364 | 
 365 | DEPARTMENT_ID DEPARTMENT_NAME                MANAGER_ID LOCATION_ID
 366 | ------------- ------------------------------ ---------- -----------
 367 |            50 Shipping                              121        1500
 368 | 
369 | It's in Location 1500. Where is location=1500? 370 |
 371 | SQL> select * from locations where location_id=1500;
 372 | 
 373 | LOCATION_ID STREET_ADDRESS                           POSTAL_CODE  CITY                           STATE_PROVINCE            CO
 374 | ----------- ---------------------------------------- ------------ ------------------------------ ------------------------- --
 375 |        1500 2011 Interiors Blvd                      99236        South San Francisco            California                US
 376 | 
377 | And column CO value is "US" - I wonder what country that is in? 378 |
 379 | SQL> select * from countries where country_id='US';
 380 | 
 381 | CO COUNTRY_NAME                              REGION_ID
 382 | -- ---------------------------------------- ----------
 383 | US United States of America                          2
 384 | 
 385 | 1 row selected.
 386 | 
387 | And the US is in the Americas: 388 |
 389 | SQL> select * from regions where region_id=2;
 390 | 
 391 |  REGION_ID REGION_NAME
 392 | ---------- -------------------------
 393 |          2 Americas
 394 | 
395 | 396 | 397 |
398 |

Using Spark To Read Oracle Data

399 | So all is looking good on the Oracle side. Now time to turn to DSE/Cassandra and Spark. 400 | 401 |

Setup Oracle ojdbc7.jar

402 | We have to add the Oracle JDBC jar file to our Spark classpath so that Spark knows how to talk to the Oracle database. 403 | 404 |
405 |

Using The Oracle JDBC Driver

406 | For this test we only need the jdbc driver file on our single (SparkMaster) node. 407 | In a bigger cluster we would need to distribute it to the slave nodes too. 408 | 409 |

Update Executor Path For ojdbc7.jar In spark-defaults.conf

410 | Add the classpath for the ojdbc7.jar file for the executors. 411 |
 412 | # vi /etc/dse/spark/spark-defaults.conf
 413 | 
414 | Add the following lines pointing to the location of your ojdbc7.jar file (/app/oracle/downloads is my location, yours may be different): 415 |
 416 | spark.driver.extraClassPath = /app/oracle/downloads/ojdbc7.jar
 417 | spark.executor.extraClassPath = /app/oracle/downloads/ojdbc7.jar
 418 | 
419 | 420 |

Restart DSE

421 |
 422 | $ sudo service dse stop
 423 | $ sudo service dse start
 424 | 
425 | 426 | 427 |

Start The Spark REPL

428 | I'm passing the path to the ojdbc7.jar file to Spark on the command line (this shouldn't be needed as the driver path is defined in the spark-defaults.conf file now, but it seems not to work without it). 429 |
 430 | $ dse spark --driver-class-path /app/oracle/downloads/ojdbc7.jar -deprecation
 431 | Welcome to
 432 |       ____              __
 433 |      / __/__  ___ _____/ /__
 434 |     _\ \/ _ \/ _ `/ __/  '_/
 435 |    /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
 436 |       /_/
 437 | Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
 438 | Type in expressions to have them evaluated.
 439 | Type :help for more information.
 440 | Initializing SparkContext with MASTER: spark://127.0.0.1:7077
 441 | Created spark context..
 442 | Spark context available as sc.
 443 | Hive context available as sqlContext. Will be initialized on first use.
 444 | 
445 |
446 | 447 |

Import Some Classes

448 | We import some classes we may or may not use... 449 |
 450 | import com.datastax.spark.connector._
 451 | import com.datastax.spark.connector.cql.CassandraConnector
 452 | import org.apache.log4j.{Level, Logger}
 453 | import org.apache.spark.{SparkConf, SparkContext}
 454 | import java.io._
 455 | 
456 |
457 |

Read Oracle Data Into A Spark DataFrame

458 | 459 | Next, load the data from the Oracle table using JDBC.

460 | For Spark versions below 1.4: 461 |

 462 | scala> val employees = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:hr/hr@localhost:1521/orcl", "dbtable" -> "employees"))
 463 | 
464 | For Spark 1.4 onwards: 465 |
 466 | scala> val employees = sqlContext.read.format("jdbc").option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl").option("driver", "oracle.jdbc.OracleDriver").option("dbtable", "employees").load()
 467 | 
468 | 469 | The Spark REPL responds successfully with: 470 |
 471 | employees: org.apache.spark.sql.DataFrame = [EMPLOYEE_ID: decimal(6,0), FIRST_NAME: string, LAST_NAME: string, EMAIL: string, PHONE_NUMBER: string, HIRE_DATE: timestamp, JOB_ID: string, SALARY: decimal(8,2), COMMISSION_PCT: decimal(2,2), MANAGER_ID: decimal(6,0), DEPARTMENT_ID: decimal(4,0)]
 472 | 
473 | All good so far. 474 | 475 |

Examining The DataFrame

476 | Now that we've created a dataframe using the JDBC method shown above, we can use the dataframe method printSchema() to look at the dataframe schema. 477 | You'll notice that it looks a lot like a table. That's great because it means that we can use it to manipulate large volumes of tabular data: 478 | 479 |
 480 | scala> employees.printSchema()
 481 | root
 482 |  |-- EMPLOYEE_ID: decimal(6,0) (nullable = false)
 483 |  |-- FIRST_NAME: string (nullable = true)
 484 |  |-- LAST_NAME: string (nullable = false)
 485 |  |-- EMAIL: string (nullable = false)
 486 |  |-- PHONE_NUMBER: string (nullable = true)
 487 |  |-- HIRE_DATE: timestamp (nullable = false)
 488 |  |-- JOB_ID: string (nullable = false)
 489 |  |-- SALARY: decimal(8,2) (nullable = true)
 490 |  |-- COMMISSION_PCT: decimal(2,2) (nullable = true)
 491 |  |-- MANAGER_ID: decimal(6,0) (nullable = true)
 492 |  |-- DEPARTMENT_ID: decimal(4,0) (nullable = true)
 493 | 
494 | 495 | We can use the dataframe .show() method to display the first x:int rows in the table: 496 | 497 |
 498 | scala> employees.show(5)
 499 | +-----------+-----------+----------+--------+------------+--------------------+----------+--------+--------------+----------+-------------+
 500 | |EMPLOYEE_ID| FIRST_NAME| LAST_NAME|   EMAIL|PHONE_NUMBER|           HIRE_DATE|    JOB_ID|  SALARY|COMMISSION_PCT|MANAGER_ID|DEPARTMENT_ID|
 501 | +-----------+-----------+----------+--------+------------+--------------------+----------+--------+--------------+----------+-------------+
 502 | |        100|     Steven|      King|   SKING|515.123.4567|2003-06-16 23:00:...|   AD_PRES|24000.00|          null|      null|           90|
 503 | |        101|      Neena|   Kochhar|NKOCHHAR|515.123.4568|2005-09-20 23:00:...|     AD_VP|17000.00|          null|       100|           90|
 504 | |        102|        Lex|   De Haan| LDEHAAN|515.123.4569|2001-01-13 00:00:...|     AD_VP|17000.00|          null|       100|           90|
 505 | |        103|  Alexander|    Hunold| AHUNOLD|590.423.4567|2006-01-03 00:00:...|   IT_PROG| 9000.00|          null|       102|           60|
 506 | |        104|      Bruce|     Ernst|  BERNST|590.423.4568|2007-05-20 23:00:...|   IT_PROG| 6000.00|          null|       103|           60|
 507 | +-----------+-----------+----------+--------+------------+--------------------+----------+--------+--------------+----------+-------------+
 508 | only showing top 5 rows
 509 | 
510 | 511 | We have Employees. Now let's load some data from the Oracle table Departments: 512 | For Spark versions below 1.4: 513 |
 514 | scala> val departments = sqlContext.load("jdbc", Map("url" -> "jdbc:oracle:thin:hr/hr@localhost:1521/orcl", "dbtable" -> "departments"))
 515 | 
516 | 517 | For Spark 1.4 onwards: 518 |
 519 | scala> val departments = sqlContext.read.format("jdbc").option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl").option("driver", "oracle.jdbc.OracleDriver").option("dbtable", "departments").load()
 520 | 
521 | 522 | REPL responds: 523 |
 524 | departments: org.apache.spark.sql.DataFrame = [DEPARTMENT_ID: decimal(4,0), DEPARTMENT_NAME: string, MANAGER_ID: decimal(6,0), LOCATION_ID: decimal(4,0)]
 525 | 
526 | 527 |

Note: JDBC Tuning

528 | There are some options available that allow you to tune how Spark uses the JDBC driver. The JDBC datasource supports partitioning - that means that you can specify how Spark will parallelize the load operation from the JDBC source. 529 | By default a JDBC load will be sequential which is much less efficient where multiple workers are available. 530 | 531 | These options describe how to partition the table when reading in parallel from multiple workers: 532 | 538 | 539 | These options must **all** be specified if any of them is specified. 540 | 545 |
546 | So this might be an alternative load command for the departments table using some of those options: 547 | 548 |
 549 | scala> val departments = sqlContext.read.format("jdbc")
 550 |                   .option("url", "jdbc:oracle:thin:hr/hr@localhost:1521/orcl")
 551 |                   .option("driver", "oracle.jdbc.OracleDriver")
 552 |                   .option("dbtable", "departments")
 553 |                   .option("partitionColumn", "DEPARTMENT_ID")
 554 |                   .option("lowerBound", "1")
 555 |                   .option("upperBound", "100000000")
 556 |                   .option("numPartitions", "4")
 557 |                   .option("fetchsize","1000")
 558 |                   .load()
 559 | 
560 | 561 | > Don’t create too many partitions in parallel on a large cluster, otherwise Spark might crash the external database. 562 | 563 | You can experiment with these options when you load your data on your infrastructure. 564 | Read more about this here: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases 565 | 566 |

View the Departments DataFrame

567 | Examine the schema: 568 |
 569 | scala> departments.printSchema()
 570 | root
 571 |  |-- DEPARTMENT_ID: decimal(4,0) (nullable = false)
 572 |  |-- DEPARTMENT_NAME: string (nullable = false)
 573 |  |-- MANAGER_ID: decimal(6,0) (nullable = true)
 574 |  |-- LOCATION_ID: decimal(4,0) (nullable = true)
 575 | 
576 | 577 | >At this point the JDBC statement has been validated but Spark hasn't yet checked the physical data (this is called lazy evaluation - for example if you provide an invalid partitioning column you won't get an error message until you actually try to read the data, that is, perform an action). 578 | 579 | Now read records from the Departments DataFrame: 580 | 581 |
 582 | scala> departments.show(5)
 583 | +-------------+--------------------+----------+-----------+
 584 | |DEPARTMENT_ID|     DEPARTMENT_NAME|MANAGER_ID|LOCATION_ID|
 585 | +-------------+--------------------+----------+-----------+
 586 | |           10|      Administration|       200|       1700|
 587 | |           20|           Marketing|       201|       1800|
 588 | |           30|          Purchasing|       114|       1700|
 589 | |           40|     Human Resources|       203|       2400|
 590 | |           50|            Shipping|       121|       1500|
 591 | +-------------+--------------------+----------+-----------+
 592 | only showing top 5 rows
 593 | 
594 | What's the total number of records in the departments dataframe: 595 |
 596 | scala> departments.count()
 597 | res1: Long = 27           
 598 | 
599 |
600 | Now that we've seen how to access data in Oracle using Spark DataFrames, let's park that for a moment and head back over to Cassandra to consider how we want to store that data in the Cassandra database. 601 |
602 |

Cassandra Data Modelling

603 | Some basics: 604 | 612 | 613 | For this demonstration I will focus on the EMPLOYEES and DEPARTMENTS tables. 614 | 615 | 623 | 624 | Remember that in a relational database maintaining those indexes is very expensive operation that becomes more expensive as the volume of data grows. In the Cassandra world we start with the queries and then design the data model to support those queries. 625 | 626 | What we are going to do: 627 | 638 | 639 | Our queries are: 640 | 649 | 650 | In Cassandra we will create the following tables: 651 | 659 | 660 |

Create HR KeySpace In Cassandra

661 | The first thing that we need to do in Cassandra is create a keyspace to contain the tables that we will create. I'm using a replication factor of 1 because I have one node in my development cluster. For most production deployments we recommend a multi-datacenter Active-Active HA setup across geographical regions using NetworkTopologyStrategy with RF=3: 662 | 663 | Log into cqlsh 664 | > If you didn't change the IP defaults in cassandra.yaml then just type ```cqlsh``` - if you changed the IP to be the host IP then you may need to supply the hostname e.g. ```cqlsh ```. 665 | 666 | From the cqlsh prompt, create the keyspace to hold our Cassandra HR tables: 667 |
 668 | CREATE KEYSPACE IF NOT EXISTS HR WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 };
 669 | USE HR;
 670 | 
671 | 672 |

Create Employees Table

673 | 674 | De-normalising and potentially duplicating data in Cassandra is not a bad thing. It's quicker than updating indexes, and disk is cheap, and Cassandra writes are fast! 675 | 676 | So for Query 1 we'll also move the Department, Manager and Job data out of the EMPLOYEES table and use the EMPLOYEES table purely for employee personal data. We get the information that we've moved by creating tables specifically to support those queries. 677 | We will de-normalise the relationship between Employee and Department using the table created for Query 2. 678 | See Query 2 as an example of getting data about departments and employees - we'll use EMPLOYEES_BY_DEPARTMENT to query Departments, and also query Employees by Department. 679 | 680 |

Query 1: Query All Employees on EMPLOYEE_ID

681 | Create the table we use to hold employees in Cassandra: 682 |
 683 | DROP TABLE IF EXISTS employees;
 684 | 
 685 | CREATE TABLE employees (
 686 |  employee_id            	bigint,
 687 |  first_name             	text ,
 688 |  last_name              	text,
 689 |  email                  	text,
 690 |  phone_number           	text,
 691 |  hire_date              	text,
 692 |  salary                 	decimal,
 693 |  commission_pct         	decimal,
 694 |  PRIMARY KEY (employee_id));
 695 | 
696 | This table satisfies query 1 we can now do an extremely fast retrieval using a partition key to access our data: 697 |
 698 | cqlsh:hr> select * from employees where employee_id=188;
 699 | 
 700 |  employee_id | commission_pct | email  | first_name | hire_date                | last_name | phone_number | salary
 701 | -------------+----------------+--------+------------+--------------------------+-----------+--------------+---------
 702 |          188 |           null | KCHUNG |      Kelly | 2005-06-14 00:00:00-0400 |     Chung | 650.505.1876 | 3800.00
 703 | 
704 | 705 |

Change Dataframe Column Names To Lower Case

706 | At this point go back to the Spark shell/REPL for the following steps. 707 | 708 | If you try to save your employee dataframe to Cassandra right now (using the Spark-Cassandra connector) it will fail with an error message saying that columns don't exist e.g. "EMPLOYEE_ID", "FIRST_NAME" - even though they're there. This is because the connector expects the column case in the dataframe to match the column case in the Cassandra table. In Cassandra column names are always in lower case, so the dataframe names must match.

709 | So we need to modify our dataframe schema.... 710 | Here's a reminder of our raw employees dataframe schema again: 711 | 712 |

 713 | scala> employees.printSchema()
 714 | root
 715 |  |-- EMPLOYEE_ID: decimal(6,0) (nullable = false)
 716 |  |-- FIRST_NAME: string (nullable = true)
 717 |  |-- LAST_NAME: string (nullable = false)
 718 |  |-- EMAIL: string (nullable = false)
 719 |  |-- PHONE_NUMBER: string (nullable = true)
 720 |  |-- HIRE_DATE: timestamp (nullable = false)
 721 |  |-- JOB_ID: string (nullable = false)
 722 |  |-- SALARY: decimal(8,2) (nullable = true)
 723 |  |-- COMMISSION_PCT: decimal(2,2) (nullable = true)
 724 |  |-- MANAGER_ID: decimal(6,0) (nullable = true)
 725 |  |-- DEPARTMENT_ID: decimal(4,0) (nullable = true)
 726 | 
727 | 728 | We want to change those column names to lower case - so we create a list of column names in lower case matching the dataframe order, creating a new dataframe "emps_lc" in the process. 729 | 730 | We rename the columns in the dataframe like this: 731 | 1. Create a new collection of names: 732 |
 733 | scala> val newNames = Seq("employee_id", "first_name", "last_name", "email","phone_number","hire_date","job_Id","salary","commission_pct","manager_id","department_id")
 734 | 
 735 | newNames: Seq[String] = List(employee_id, first_name, last_name, email, phone_number, hire_date, job_Id, salary, commission_pct, manager_id, department_id)
 736 | 
737 | 2. Create a new dataframe from employees using the new column names: 738 |
 739 | scala> val emps_lc = employees.toDF(newNames: _*)
 740 | 
 741 | emps_lc: org.apache.spark.sql.DataFrame = [employee_id: decimal(6,0), first_name: string, last_name: string, email: string, phone_number: string, hire_date: timestamp, job_Id: string, salary: decimal(8,2), commission_pct: decimal(2,2), manager_id: decimal(6,0), department_id: decimal(4,0)]
 742 | 
743 | 744 | The schema in our new dataframe is in lower case - Yay! 745 |
 746 | scala> emps_lc.printSchema()
 747 | root
 748 |  |-- employee_id: decimal(6,0) (nullable = false)
 749 |  |-- first_name: string (nullable = true)
 750 |  |-- last_name: string (nullable = false)
 751 |  |-- email: string (nullable = false)
 752 |  |-- phone_number: string (nullable = true)
 753 |  |-- hire_date: timestamp (nullable = false)
 754 |  |-- job_Id: string (nullable = false)
 755 |  |-- salary: decimal(8,2) (nullable = true)
 756 |  |-- commission_pct: decimal(2,2) (nullable = true)
 757 |  |-- manager_id: decimal(6,0) (nullable = true)
 758 |  |-- department_id: decimal(4,0) (nullable = true)
 759 | 
760 | 761 | 762 | There are some columns in the dataframe above that we don't need for this exercise. For simplicity we will simply create a new dataframe from this one, containing just the columns that we do want to use. 763 | 764 |

Create SparkSQL Tables From The DataFrames

765 | We can manipulate the data in Spark in two ways. We can use the dataframe methods which allow for querying and filtering of data. Or we can use SparkSQL to do that. To access data using SparkSQL we need to register our dataframe as a temporary table against which we can run relational SQL queries. 766 |
767 |

Use SparkSQL To Create A DataFrame That Matches Our Target Table in Cassandra

768 | We'll select the columns that we want into a new dataframe called "emps_lc_subset": 769 | 770 | We can do it like this using SparkSQL to join records in 'temporary' tables that we register that provide an API for relational queries. 771 | 772 | We register them like this: 773 |
 774 | scala> employees.registerTempTable("empTable")
 775 | 
 776 | scala> departments.registerTempTable("deptTable")
 777 | 
778 | 779 | Now we can run a query using those 'tables': 780 |
 781 | scala> val emps_lc_subset = sqlContext.sql("SELECT employee_id, first_name, last_name, email, phone_number, hire_date, salary, commission_pct FROM empTable")
 782 | 
 783 | emps_lc_subset: org.apache.spark.sql.DataFrame = [employee_id: decimal(6,0), first_name: string, last_name: string, email: string, phone_number: string, hire_date: timestamp, salary: decimal(8,2), commission_pct: decimal(2,2)]
 784 | 
785 | 786 | And we now have the schema we're looking for to match the Cassandra target table for Query 1. 787 |
 788 | scala> emps_lc_subset.printSchema()
 789 | root
 790 |  |-- employee_id: decimal(6,0) (nullable = false)
 791 |  |-- first_name: string (nullable = true)
 792 |  |-- last_name: string (nullable = false)
 793 |  |-- email: string (nullable = false)
 794 |  |-- phone_number: string (nullable = true)
 795 |  |-- hire_date: timestamp (nullable = false)
 796 |  |-- job_id: string (nullable = false)
 797 |  |-- salary: decimal(8,2) (nullable = true)
 798 |  |-- commission_pct: decimal(2,2) (nullable = true)
 799 | 
800 | 801 |

Use DataFrame Methods To Create A DataFrame That Matches Our Target Table in Cassandra

802 | Alternatively we could have created the same emps_lc_subset dataframe using dataframe select methods: 803 |
 804 | scala> val emps_lc_subset = employees.select("employee_id","first_name","last_name","email","phone_number","hire_date","salary","commission_pct")
 805 | 
 806 | emps_lc_subset: org.apache.spark.sql.DataFrame = [employee_id: decimal(6,0), first_name: string, last_name: string, email: string, phone_number: string, hire_date: timestamp, salary: decimal(8,2), commission_pct: decimal(2,2)]
 807 | 
808 | 809 | And to demonstrate that the results are the same: 810 |
 811 | scala> emps_lc_subset.printSchema()
 812 | root
 813 |  |-- employee_id: decimal(6,0) (nullable = false)
 814 |  |-- first_name: string (nullable = true)
 815 |  |-- last_name: string (nullable = false)
 816 |  |-- email: string (nullable = false)
 817 |  |-- phone_number: string (nullable = true)
 818 |  |-- hire_date: timestamp (nullable = false)
 819 |  |-- salary: decimal(8,2) (nullable = true)
 820 |  |-- commission_pct: decimal(2,2) (nullable = true)
 821 | 
 822 | scala> emps_lc_subset.show(5)
 823 | +-----------+----------+---------+--------+------------+--------------------+--------+--------------+
 824 | |employee_id|first_name|last_name|   email|phone_number|           hire_date|  salary|commission_pct|
 825 | +-----------+----------+---------+--------+------------+--------------------+--------+--------------+
 826 | |        100|    Steven|     King|   SKING|515.123.4567|2003-06-16 23:00:...|24000.00|          null|
 827 | |        101|     Neena|  Kochhar|NKOCHHAR|515.123.4568|2005-09-20 23:00:...|17000.00|          null|
 828 | |        102|       Lex|  De Haan| LDEHAAN|515.123.4569|2001-01-13 00:00:...|17000.00|          null|
 829 | |        103| Alexander|   Hunold| AHUNOLD|590.423.4567|2006-01-03 00:00:...| 9000.00|          null|
 830 | |        104|     Bruce|    Ernst|  BERNST|590.423.4568|2007-05-20 23:00:...| 6000.00|          null|
 831 | +-----------+----------+---------+--------+------------+--------------------+--------+--------------+
 832 | only showing top 5 rows
 833 | 
834 | 835 |

Write The Employees Dataframe To Cassandra

836 | We now have our data nicely defined in a Spark dataframe ready to be written to a table in Cassandra. The next step is to save the data to a Cassandra table using the Spark-Cassandra connector (Note: you must truncate the table first if it isn't empty): 837 | 838 |
 839 | scala> emps_lc_subset.write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "employees", "keyspace" -> "hr")).save()
 840 | 
841 | 842 | And if we hop over to cqlsh we can see the records are there: 843 |
 844 | cqlsh:hr> select * from employees;
 845 | 
 846 |  employee_id | commission_pct | email    | first_name  | hire_date                | last_name   | phone_number       | salary
 847 | -------------+----------------+----------+-------------+--------------------------+-------------+--------------------+----------
 848 |          111 |           null | ISCIARRA |      Ismael | 2005-09-30 00:00:00-0400 |     Sciarra |       515.124.4369 |  7700.00
 849 |          163 |           0.15 |  DGREENE |    Danielle | 2007-03-19 00:00:00-0400 |      Greene | 011.44.1346.229268 |  9500.00
 850 |          148 |           0.30 | GCAMBRAU |      Gerald | 2007-10-15 00:00:00-0400 |   Cambrault | 011.44.1344.619268 | 11000.00
 851 |          197 |           null |  KFEENEY |       Kevin | 2006-05-23 00:00:00-0400 |      Feeney |       650.507.9822 |  3000.00
 852 |          102 |           null |  LDEHAAN |         Lex | 2001-01-13 00:00:00-0500 |     De Haan |       515.123.4569 | 17000.00
 853 |          153 |           0.20 |   COLSEN | Christopher | 2006-03-30 00:00:00-0500 |       Olsen | 011.44.1344.498718 |  8000.00
 854 |          129 |           null |  LBISSOT |       Laura | 2005-08-20 00:00:00-0400 |      Bissot |       650.124.5234 |  3300.00
 855 |          160 |           0.30 |   LDORAN |      Louise | 2005-12-15 00:00:00-0500 |       Doran | 011.44.1345.629268 |  7500.00
 856 |          107 |           null | DLORENTZ |       Diana | 2007-02-07 00:00:00-0500 |     Lorentz |       590.423.5567 |  4200.00
 857 |          136 |           null | HPHILTAN |       Hazel | 2008-02-06 00:00:00-0500 |  Philtanker |       650.127.1634 |  2200.00
 858 |          188 |           null |   KCHUNG |       Kelly | 2005-06-14 00:00:00-0400 |       Chung |       650.505.1876 |  3800.00
 859 |          134 |           null |  MROGERS |     Michael | 2006-08-26 00:00:00-0400 |      Rogers |       650.127.1834 |  2900.00
 860 |          181 |           null |  JFLEAUR |        Jean | 2006-02-23 00:00:00-0500 |      Fleaur |       650.507.9877 |  3100.00
 861 | 
862 | 863 | Note:The integrated DSE release of Cassandra and Spark also gives us the opportunity to directly read tables in Cassandra. 864 | 865 | For example, read some columns from the employees table: 866 |
 867 | scala> val emp_data = sqlContext.sql ("select employee_id, first_name, last_name from hr.employees")
 868 | emp_data: org.apache.spark.sql.DataFrame = [employee_id: bigint]
 869 | 
 870 | scala> emp_data.printSchema()
 871 | root
 872 |  |-- employee_id: long (nullable = true)
 873 |  |-- first_name: string (nullable = true)
 874 |  |-- last_name: string (nullable = true)
 875 | 
 876 | scala> emp_data.count()
 877 | res20: Long = 107
 878 | 
 879 | scala> emp_data.show(5)
 880 | +-----------+----------+---------+
 881 | |employee_id|first_name|last_name|
 882 | +-----------+----------+---------+
 883 | |        111|    Ismael|  Sciarra|
 884 | |        163|  Danielle|   Greene|
 885 | |        148|    Gerald|Cambrault|
 886 | |        197|     Kevin|   Feeney|
 887 | |        102|       Lex|  De Haan|
 888 | +-----------+----------+---------+
 889 | only showing top 5 rows
 890 | 
891 | 892 |

Query 2: Query All Departments And Employees by Department

893 | 894 |

Multi-Table Joins In SparkSQL

895 | When we migrate data from a relational database to a NoSQL database like Apache Cassandra there is invariably a need to perform some element of data transformation. Transformation typically involves duplication and de-normalisation of data and to do this we frequently want to join data between tables. Let's see how we can do that in Spark. 896 | 897 |

Create Cassandra Table EMPLOYEES_BY_DEPT

898 | To satisfy Query 2 we have a table for Employees stored by Department. 899 | DEPARTMENT_ID is the partitioning key, EMPLOYEE_ID is the clustering column. Wel will need to join the Employees and Departments tables to satisfy this query. 900 | 901 |
 902 | DROP TABLE IF EXISTS EMPLOYEES_BY_DEPT;
 903 | CREATE TABLE employees_by_dept (
 904 |  DEPARTMENT_ID            	bigint,
 905 |  DEPARTMENT_NAME        text static,
 906 |  EMPLOYEE_ID            bigint,
 907 |  FIRST_NAME             text ,
 908 |  LAST_NAME              text,
 909 |  PRIMARY KEY (DEPARTMENT_ID, EMPLOYEE_ID));
 910 | 
911 | 912 | 913 | > For each department we'll store the department id and the name of the department (as a partioning key and a static column for that partitioning key), and then for each department partition key there will be successive clustered columns of employees in that department. 914 | 915 | We create a dataframe containing employees-by-department by joining our SparkSQL tables employees and departments on DEPARTMENT_ID. 916 | 917 | In the Spark REPL: 918 | As an example, let's join the employees and departments tables. 919 | 920 | Register the dataframes as temporary tables if not done already (this allows us to use Spark SQL on them): 921 | 922 |
 923 | scala> employees.registerTempTable("empTable")
 924 | 
 925 | scala> departments.registerTempTable("deptTable")
 926 | 
927 | 928 | Right, let's do the query joining the tables (this is so easy using SparkSQL :) ): 929 | 930 |
 931 | scala> val emp_by_dept = sqlContext.sql("SELECT d.department_id, d.department_name, e.employee_id, e.first_name, e.last_name FROM empTable e, deptTable d where e.department_id=d.department_id")
 932 | 
933 | Response: 934 |
 935 | emp_by_dept: org.apache.spark.sql.DataFrame = [department_id: decimal(4,0), department_name: string, employee_id: decimal(6,0), first_name: string, last_name: string]
 936 | 
937 | 938 | Here's the schema for the resulting dataframe: 939 |
 940 | scala> emp_by_dept.printSchema()
 941 | root
 942 |  |-- department_id: decimal(4,0) (nullable = false)
 943 |  |-- department_name: string (nullable = false)
 944 |  |-- employee_id: decimal(6,0) (nullable = false)
 945 |  |-- first_name: string (nullable = true)
 946 |  |-- last_name: string (nullable = false)
 947 | 
948 | 949 | The data is in the format that we want it to match the Cassandra target table: 950 |
 951 | scala> emp_by_dept.show(5)
 952 | +-------------+---------------+-----------+----------+-----------+              
 953 | |department_id|department_name|employee_id|first_name|  last_name|
 954 | +-------------+---------------+-----------+----------+-----------+
 955 | |           40|Human Resources|        203|     Susan|     Mavris|
 956 | |           50|       Shipping|        120|   Matthew|      Weiss|
 957 | |           50|       Shipping|        121|      Adam|      Fripp|
 958 | |           50|       Shipping|        122|     Payam|   Kaufling|
 959 | |           50|       Shipping|        123|    Shanta|    Vollman|
 960 | +-------------+---------------+-----------+----------+-----------+
 961 | only showing top 5 rows
 962 | 
963 | 964 |

Write the Employees_By_Department dataframe to Cassandra

965 | Now save the employees_by_dept data to Cassandra using the Spark-Cassandra connector: 966 |
 967 | scala> emp_by_dept.write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "employees_by_dept", "keyspace" -> "hr")).save()
 968 | 
969 | And the records are there in Cassandra: 970 |
 971 | cqlsh:hr> select * from employees_by_dept;           
 972 | 
 973 |  department_id | employee_id | department_name | first_name  | last_name
 974 | ---------------+-------------+-----------------+-------------+-------------
 975 |             30 |         114 |      Purchasing |         Den |    Raphaely
 976 |             30 |         115 |      Purchasing |   Alexander |        Khoo
 977 |             30 |         116 |      Purchasing |      Shelli |       Baida
 978 |             30 |         117 |      Purchasing |       Sigal |      Tobias
 979 |             30 |         118 |      Purchasing |         Guy |      Himuro
 980 |             30 |         119 |      Purchasing |       Karen |  Colmenares
 981 |             20 |         201 |       Marketing |     Michael |   Hartstein
 982 |             20 |         202 |       Marketing |         Pat |         Fay
 983 |             80 |         145 |           Sales |        John |     Russell
 984 |             80 |         146 |           Sales |       Karen |    Partners
 985 | 
986 | Now we're able to retrieve data about departments: 987 |
 988 | cqlsh:hr> select distinct department_id, department_name from employees_by_dept;
 989 |  department_id | department_name
 990 | ---------------+------------------
 991 |             30 |       Purchasing
 992 |             20 |        Marketing
 993 |             80 |            Sales
 994 |             60 |               IT
 995 |            110 |       Accounting
 996 |             50 |         Shipping
 997 |             10 |   Administration
 998 |            100 |          Finance
 999 |             40 |  Human Resources
1000 |             70 | Public Relations
1001 |             90 |        Executive
1002 | 
1003 | And the second part of the query requirement was to be able to return employees by department: 1004 |
1005 | cqlsh:hr> select department_name, first_name, last_name from employees_by_dept where department_id=50;
1006 | 
1007 |  department_name | first_name | last_name
1008 | -----------------+------------+-------------
1009 |         Shipping |    Matthew |       Weiss
1010 |         Shipping |       Adam |       Fripp
1011 |         Shipping |      Payam |    Kaufling
1012 |         Shipping |     Shanta |     Vollman
1013 |         Shipping |      Kevin |     Mourgos
1014 |         Shipping |      Julia |       Nayer
1015 |         Shipping |      Irene | Mikkilineni
1016 |         Shipping |      James |      Landry
1017 |         Shipping |     Steven |      Markle
1018 |         Shipping |      Laura |      Bissot
1019 |         Shipping |      Mozhe |    Atkinson
1020 |         Shipping |      James |      Marlow
1021 | 
1022 | 
1023 | 1024 |

Query 3: Query All Jobs, And Employees By Job

1025 | Query 3 asks for a list of all job titles and descriptions. The second part of this query requirement was to be able to return Employees by Job 1026 | With the techniques described above you should now be able to have a go at doing the same thing yourself with the Jobs and Employees tables. 1027 | (hint: replace the HR.JOBS lookup table - use EMPLOYEES_BY_JOB with a PK on JOB_ID, clustered on EMPLOYEE_ID) 1028 | 1029 |

Query 4: Query All Managers, And Employees By Manager

1030 | If you're a real over-achiever why not have a go at using the Manager column in the Employees table :) 1031 | We need to be able to query all Managers. The second part of this query requirement was to be able to return Employees by Manager. 1032 | (hint: replace the old FK on MANAGER_ID in the EMPLOYEES table - use an EMPLOYEES_BY_MANAGER table with a PK on MANAGER, clustered on EMPLOYEE_ID) 1033 | 1034 | -------------------------------------------------------------------------------- /Red_Hat_Oracle_VM_Notes.md: -------------------------------------------------------------------------------- 1 | # Set Up Red Hat 2 | Various steps required to configure the default Oracle-supplied VM image. 3 |
4 | 5 | ## 1.Tidy Up... 6 | 7 | The output from ifconfig gives us the current IP setup: 8 |
  9 | eth8      Link encap:Ethernet  HWaddr 08:00:27:B7:44:21  
 10 |           inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
 11 | 
12 | 13 | The default NIC is on eth8 - we need to tweak the config in ```/etc/sysconfig/network-scripts``` by creating a file for eth0. 14 | 15 | Start off by getting rid of eth0..eth8 defined in /etc/udev/rules.d/70-persistent-net.rules 16 | 17 | Delete this file then reboot. 18 |
 19 | # rm /etc/udev/rules.d/70-persistent-net.rules
 20 | 
21 |
22 | Now go to the directory containing the NIC config scripts: 23 |
 24 | # cd /etc/sysconfig/network-scripts/
 25 | 
26 |
27 | There is a file in there for eth8 (!) 28 |
 29 | # ls ifcfg*
 30 | ifcfg-Auto_eth8  ifcfg-lo
 31 | 
32 |
33 | 34 | Rename it - we are going back to eth0: 35 |
 36 | # mv ifcfg-eth8 ifcfg-eth0
 37 | 
38 |
39 | 40 | Change the contents of the eth0 config from this: 41 |
 42 | # vi ifcfg-eth0
 43 | TYPE=Ethernet
 44 | BOOTPROTO=dhcp
 45 | DEFROUTE=no
 46 | IPV4_FAILURE_FATAL=yes
 47 | IPV6INIT=no
 48 | NAME="Auto eth8"
 49 | UUID=260d6a27-75c8-4773-b12a-b9c8f18505d7
 50 | ONBOOT=yes
 51 | HWADDR=08:00:27:B7:44:21
 52 | PEERDNS=yes
 53 | PEERROUTES=yes
 54 | LAST_CONNECT=1470844728
 55 | 
56 |
57 | 58 | Change it to this (we're using the VirtualBox gateway at 10.0.2.2 - thanks to [Justin] (http://github.com/justinholmes) :) ) gateway address in there): 59 | 60 |
 61 | DEVICE=eth0
 62 | NM_CONTROLLED=no
 63 | BOOTPROTO=dhcp
 64 | ONBOOT=yes
 65 | TYPE=Ethernet
 66 | IPV4_FAILURE_FATAL=yes
 67 | IPV6INIT=no
 68 | HWADDR=08:00:27:B7:44:21
 69 | NETMASK=255.255.255.0
 70 | GATEWAY=10.0.2.2
 71 | PEERDNS=yes
 72 | PEERROUTES=yes
 73 | 
 74 | 
75 |
76 | 77 | We are also going to turn off iptables while we do this configuration to make sure there is no firewall complicating the picture: 78 |
 79 | # service iptables status
 80 | iptables: Firewall is not running.
 81 | 
82 |
83 | 84 | ## 2.Update The System network File 85 | 86 | Change the supplied network settings to to this: 87 |
 88 | # vi /etc/sysconfig/network
 89 | NETWORKING=yes
 90 | HOSTNAME=demo.us.oracle.com
 91 | GATEWAY=10.128.0.1
 92 | # oracle-rdbms-server-11gR2-preinstall : Add NOZEROCONF=yes
 93 | NOZEROCONF=yes
 94 | 
95 | 96 | To This: 97 |
 98 | NETWORKING=yes
 99 | HOSTNAME=sampleapp
100 | GATEWAY=
101 | # oracle-rdbms-server-11gR2-preinstall : Add NOZEROCONF=yes
102 | NOZEROCONF=yes
103 | 
104 |
105 | 106 | ## 3.Update The System Hostname 107 | Change it from the command line. 108 | 109 | Check the current setting for hostname: 110 |
111 | # hostname
112 | demo.us.oracle.com
113 | 
114 |
115 | 116 | Change it to "sampleapp" (or whatever you want to call your host): 117 |
118 | # hostname sampleapp
119 | 
120 |
121 | 122 | Now check it again - it's correct: 123 |
124 | # hostname
125 | sampleapp
126 | 
127 |
128 | 129 | The HOSTNAME environment variable will be updated next time we reboot - but we can also change it here: 130 |
131 | # echo $HOSTNAME
132 | demo.us.oracle.com
133 | # HOSTNAME=sampleapp export HOSTNAME
134 | 
135 |
136 | 137 | ## 4.Update The /etc/hosts File 138 | We need to change the ip addresses and host names in the supplied hosts file (and its a bit untidy anyway). I will use the IP address that VBox assigns to this client (10.0.2.15): 139 |
140 | # vi /etc/hosts
141 | 
142 | 127.0.0.1 localhost.localdomain localhost
143 | ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
144 | #10.128.8.33 demo.us.oracle.com demo
145 | 10.0.2.15       sampleapp
146 | 
147 |
148 | 149 | Check we can now resolve ourself: 150 |
151 | # ping `echo $HOSTNAME`
152 | PING sampleapp (10.0.2.15) 56(84) bytes of data.
153 | 64 bytes from sampleapp (10.0.2.15): icmp_seq=1 ttl=64 time=0.020 ms
154 | 64 bytes from sampleapp (10.0.2.15): icmp_seq=2 ttl=64 time=0.037 ms
155 | 
156 |
157 | 158 | ## 5.Update the /etc/resolv.conf File 159 | We should be using something like the Google DNS for domain name resolution: 160 |
161 | # cat /etc/resolv.conf
162 | ; generated by /sbin/dhclient-script
163 | search localdomain
164 | nameserver 8.8.8.8
165 | nameserver 8.8.4.4
166 | 
167 |
168 | 169 | ## 6. Restart The Network Service 170 | As the root user restart the network service: 171 |
172 | # service network restart
173 | 
174 | 175 |
176 | eth0      Link encap:Ethernet  HWaddr 08:00:27:B7:44:21  
177 |           inet addr:171.28.136.150  Bcast:171.28.136.255  Mask:255.255.255.0
178 |           
179 | # route -n
180 | Kernel IP routing table
181 | Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
182 | 0.0.0.0         171.28.136.254  0.0.0.0         UG    0      0        0 eth0
183 | 171.28.136.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
184 | 
185 | 186 |
187 | ## 7.Check the version of Red Hat 188 | We need to know our O/S release for some of the following pre-requisites: 189 |
190 | # cat /etc/redhat-release 
191 | Red Hat Enterprise Linux Server release 6.8 (Santiago)
192 | 
193 |
194 | ## 8.Remove The Oracle Proxy from /etc/yum.conf 195 | yum is configured to use an Oracle proxy, which isn't very helpful if you want to add more stuff. 196 | Edit the yum conf file and remove the line in there specifying the Oracle Proxy: 197 |
198 | # vi /etc/yum.conf
199 | 
200 |
201 | 202 | ## 9. Install Python 2.7 Required For cqlsh 203 | 204 | We next need to install Python 2.7 to avoid the "No appropriate python interpreter found" error message that prevents cqlsh from starting. 205 | 206 | The install is referenced here: 207 | http://tecadmin.net/install-python-2-7-on-centos-rhel/# 208 |
209 | # yum install gcc
210 | # cd /usr/src
211 | # wget https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
212 | # tar xzf Python-2.7.12.tgz 
213 | # cd Python-2.7.12
214 | # ./configure
215 | # make altinstall
216 | 
217 |
218 | 219 | Check 2.7 is now installed: 220 |
221 | # python2.7 --version
222 | Python 2.7.12
223 | 
224 | ("python --version" still brings up 2.6.6) 225 | 226 |
227 | Start DSE and test that cqlsh now works: 228 |
229 | # cqlsh
230 | Warning: Timezone defined and 'pytz' module for timezone conversion not installed. Timestamps will be displayed in UTC timezone.
231 | 
232 | Connected to Test Cluster at 127.0.0.1:9042.
233 | [cqlsh 5.0.1 | Cassandra 3.0.7.1159 | DSE 5.0.1 | CQL spec 3.4.0 | Native protocol v4]
234 | Use HELP for help.
235 | cqlsh> 
236 | 
237 | We will fix the pytz error next. 238 |
239 |
240 | 241 | ## 10.Install EPEL And Other Pre-Reqs (mostly Python stuff) 242 | To resolve the Python time zone warning I installed EPEL (Extra Packages for Enterprise Linux) and some Python utilities. 243 |
244 | 245 | ### EPEL 246 | I found a good EPEL install link here: 247 | 248 | http://sharadchhetri.com/2014/05/30/install-pip-centos-rhel-ubuntu-debian/ 249 | 250 | First install wget: 251 |
252 | # yum install wget
253 | 
254 |
255 | 256 | Download and install EPEL for Red Hat/Oracle Linux v6: 257 |
258 | # rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
259 | 
260 | Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
261 | warning: /var/tmp/rpm-tmp.X7MF42: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
262 | Preparing...                ########################################### [100%]
263 |    1:epel-release           ########################################### [100%]
264 | 
265 |
266 | 267 | Check it's installed: 268 |
269 | # rpm -qa | grep epel
270 | epel-release-6-8.noarch
271 | 
272 | 273 | 274 | The install creates two files: 275 | - /etc/yum.repos.d/epel.repo 276 | - /etc/yum.repos.d/epel-testing.repo 277 |
278 | 279 | ### Python Stuff 280 | Now install pip to manage the python stuff: 281 |
282 | # yum install -y python-pip
283 | 
284 | 285 | Install easy_install for Python 2.7 286 |
287 | # wget --no-check-certificate https://bootstrap.pypa.io/ez_setup.py
288 | # /usr/local/bin/python2.7 ez_setup.py
289 | # /usr/local/bin/easy_install-2.7 pip
290 | 
291 |
292 | 293 | Check pip version: 294 |
295 | # which pip
296 | /usr/local/bin/pip
297 | # pip --version
298 | pip 8.1.2 from /usr/local/lib/python2.7/site-packages/pip-8.1.2-py2.7.egg (python 2.7)
299 | 
300 |
301 | 302 | Check easy_install version: 303 |
304 | # which easy_install
305 | /usr/local/bin/easy_install
306 | # easy_install --version
307 | setuptools 27.1.2 from /usr/local/lib/python2.7/site-packages/setuptools-27.1.2-py2.7.egg (Python 2.7)
308 | 
309 | 310 | Use easy_install to install the pytz package (the error in cqlsh): 311 |
312 | # /usr/local/bin/easy_install-2.7 pytz
313 | 
314 | 315 |
316 | 317 | ### All Good 318 | Now no error messages when starting cqlsh: 319 | 320 |
321 | # cqlsh
322 | Connected to Test Cluster at 127.0.0.1:9042.
323 | [cqlsh 5.0.1 | Cassandra 3.0.7.1159 | DSE 5.0.1 | CQL spec 3.4.0 | Native protocol v4]
324 | Use HELP for help.
325 | cqlsh> 
326 | 
327 | 328 |
329 | 330 | ## 11.Check O/S cassandra Account Details 331 | The Cassandra account was created by the installer: 332 |
333 | # cat /etc/passwd | grep cassandra
334 | cassandra:x:476:472::/var/lib/cassandra:/bin/bash
335 | 
336 |
337 | 338 | Ulimits was updated by the installer for the Cassandra user: 339 | 340 |
341 | # cat /etc/security/limits.d/cassandra.conf
342 | 
343 | cassandra - memlock unlimited
344 | cassandra - nofile 100000
345 | cassandra - nproc 32768
346 | cassandra - as unlimited
347 | 
348 | If you want to log in as the Cassandra user you will either need to su - from the root account, or use the root account to set a password for the casssandra account to allow interactive logins. 349 | 350 | ## 12.Check Oracle Database Name 351 |
352 | $ sqlplus / as sysdba
353 | 
354 | SQL*Plus: Release 12.1.0.2.0 Production on Mon Sep 5 05:25:42 2016
355 | 
356 | Copyright (c) 1982, 2014, Oracle.  All rights reserved.
357 | 
358 | Connected to:
359 | Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
360 | With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
361 | 
362 | SQL> select database_name from v$database;
363 | 
364 | DATABASE_NAME
365 | ----------------------
366 | ORCL.US.ORACLE.COM
367 | 
368 | 1 row selected.
369 | 
370 | 371 | ## 13.Set Up Oracle TNS 372 | The environment variable TNS_ADMIN should be pointing to the location of your tns configuration files. 373 |
374 | $ echo $TNS_ADMIN
375 | /app/oracle/db/product/12c/dbhome_1/network/admin
376 | $ ls $TNS_ADMIN
377 | listener.ora  listener.ora~  samples  shrept.lst  sqlnet.ora  tnsnames.ora  tnsnames.ora~
378 | 
379 | 380 | 381 | 382 | ### Configure listener.ora 383 | #### Set Hostname 384 | Change hostname from demo.us.oracle.com to localhost. 385 | The config file is: 386 |
387 | /app/oracle/db/product/12c/dbhome_1/network/admin/listener.ora
388 | 
389 | 390 |
391 | # listener.ora Network Configuration File: /app/oracle/db/product/12c/dbhome_1/network/admin/listener.ora
392 | # Generated by Oracle configuration tools.
393 | 
394 | LISTENER =
395 |   (DESCRIPTION_LIST =
396 |     (DESCRIPTION =
397 |       (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
398 |       (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
399 |     )
400 |   )
401 | 
402 | 403 | ### Add A Service To listener.ora 404 | We need to add a service listening for connections to database orcl. 405 | Add this section to the end of listener.ora: 406 |
407 | SID_LIST_LISTENER=
408 |   (SID_LIST=
409 |     (SID_DESC=
410 |       (GLOBAL_DBNAME=orcl)
411 |       (ORACLE_HOME=/app/oracle/db/product/12c/dbhome_1)
412 |       (SID_NAME=orcl)))
413 | 
414 | 415 | ### Configure Host Names In tnsnames.ora 416 | Change all occurrences of hostname from demo.us.oracle.com to localhost. 417 | 418 |
419 | /app/oracle/db/product/12c/dbhome_1/network/admin/tnsnames.ora
420 | 
421 |
422 | # tnsnames.ora Network Configuration File: /app/oracle/db/product/12c/dbhome_1/network/admin/tnsnames.ora
423 | # Generated by Oracle configuration tools.
424 | 
425 | LISTENER_ORCL =
426 |   (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
427 | 
428 | ORCL =
429 |   (DESCRIPTION =
430 |     (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
431 |     (CONNECT_DATA =
432 |       (SERVER = DEDICATED)
433 |       (SERVICE_NAME = orcl)
434 |     )
435 |   )
436 | pdborcl =
437 |   (DESCRIPTION =
438 |     (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
439 |     (CONNECT_DATA =
440 |       (SERVER = DEDICATED)
441 |       (SERVICE_NAME = pdborcl)
442 |     )
443 |   )
444 | 
445 | 446 | ### Start the listener 447 |
448 | $ lsnrctl start
449 | 
450 | LSNRCTL for Linux: Version 12.1.0.2.0 - Production on 13-SEP-2016 17:18:26
451 | 
452 | Copyright (c) 1991, 2014, Oracle.  All rights reserved.
453 | 
454 | Starting /app/oracle/db/product/12c/dbhome_1/bin/tnslsnr: please wait...
455 | 
456 | TNSLSNR for Linux: Version 12.1.0.2.0 - Production
457 | System parameter file is /app/oracle/db/product/12c/dbhome_1/network/admin/listener.ora
458 | Log messages written to /app/oracle/db/diag/tnslsnr/sampleapp/listener/alert/log.xml
459 | Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))
460 | Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
461 | 
462 | Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521)))
463 | STATUS of the LISTENER
464 | ------------------------
465 | Alias                     LISTENER
466 | Version                   TNSLSNR for Linux: Version 12.1.0.2.0 - Production
467 | Start Date                13-SEP-2016 17:18:26
468 | Uptime                    0 days 0 hr. 0 min. 0 sec
469 | Trace Level               off
470 | Security                  ON: Local OS Authentication
471 | SNMP                      OFF
472 | Listener Parameter File   /app/oracle/db/product/12c/dbhome_1/network/admin/listener.ora
473 | Listener Log File         /app/oracle/db/diag/tnslsnr/sampleapp/listener/alert/log.xml
474 | Listening Endpoints Summary...
475 |   (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=127.0.0.1)(PORT=1521)))
476 |   (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC1521)))
477 | Services Summary...
478 | Service "orcl" has 1 instance(s).
479 |   Instance "orcl", status UNKNOWN, has 1 handler(s) for this service...
480 | The command completed successfully
481 | 
482 | 483 | ### Check tnsping 484 | 485 |
486 | $ tnsping orcl
487 | 
488 | TNS Ping Utility for Linux: Version 12.1.0.2.0 - Production on 13-SEP-2016 17:25:55
489 | 
490 | Copyright (c) 1997, 2014, Oracle.  All rights reserved.
491 | 
492 | Used parameter files:
493 | /app/oracle/db/product/12c/dbhome_1/network/admin/sqlnet.ora
494 | 
495 | 
496 | Used TNSNAMES adapter to resolve the alias
497 | Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = orcl)))
498 | OK (0 msec)
499 | 
500 | 501 | ### Check connection with sqlplus 502 |
503 | $ sqlplus hr/hr@orcl
504 | 
505 | SQL*Plus: Release 12.1.0.2.0 Production on Tue Sep 13 17:32:06 2016
506 | 
507 | Copyright (c) 1982, 2014, Oracle.  All rights reserved.
508 | 
509 | Last Successful login time: Tue Sep 13 2016 17:18:54 -05:00
510 | 
511 | Connected to:
512 | Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
513 | With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
514 | 
515 | SQL> 
516 | 
517 | 518 | ## 14.Fix VBox Shared Clipboard 519 | My clipboard didn't seem to work but this fixes it. 520 | 521 | In Virtual Box mount the Guest Additions CD and run it when prompted for the root password. 522 | 523 | Restart the machine. 524 | 525 | Then: 526 |
527 | # killall VBoxClient
528 | # VBoxClient-all
529 | 
530 | I find that I have to enable the bidirectional clipboard I have to run the last two commands every time I start the machine. 531 | 532 |
533 |
534 | 535 | --------------------------------------------------------------------------------