├── README.md ├── Kafka Cluster Setup ├── YARN Configuration.md ├── Spark Thrift Setup ├── Install Python38 Ubuntu.md ├── Yarn Dask installation with Anaconda.md ├── Install Debezium Kafka Connector for PostgreSQL using wal2json.md ├── Spark Hadoop Free Cluster Setup.md ├── Hadoop Cluster Setup.md └── Hive Setup with Spark Exection and Spark HiveContext.md /README.md: -------------------------------------------------------------------------------- 1 | # Big-Data-Installation -------------------------------------------------------------------------------- /Kafka Cluster Setup: -------------------------------------------------------------------------------- 1 | https://blog.clairvoyantsoft.com/kafka-series-3-creating-3-node-kafka-cluster-on-virtual-box-87d5edc85594 2 | -------------------------------------------------------------------------------- /YARN Configuration.md: -------------------------------------------------------------------------------- 1 | ## Config Resources ## 2 | **Default Configuration** 3 | 4 | **Node Manager Allocated Resources** 5 | 6 | **Container Allocated Resources** 7 | 8 | **Application Master Allocated Resources** 9 | 10 | ## Config Scheduler 11 | **Capacity Scheduler** 12 | 13 | **Fair Scheduler** 14 | -------------------------------------------------------------------------------- /Spark Thrift Setup: -------------------------------------------------------------------------------- 1 | https://www.programmersought.com/article/75806728859/ 2 | https://www.programmersought.com/article/34564251004/ 3 | https://www.programmersought.com/article/47542974362/ 4 | https://www.programmersought.com/article/46657462430/ 5 | https://www.programmersought.com/article/49335838484/ 6 | https://www.programmersought.com/article/45243571155/ 7 | https://www.programmersought.com/article/55482089247/ 8 | https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html 9 | -------------------------------------------------------------------------------- /Install Python38 Ubuntu.md: -------------------------------------------------------------------------------- 1 | # Install Python 3.8 on Ubuntu 18.04 2 | **Install Python3.8** 3 | `sudo apt install python3.8` 4 | 5 | **Set python3 to Python3.8** 6 | Add Python3.6 & Python 3.8 to update-alternatives 7 | 8 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1` 9 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2` 10 | 11 | Update Python 3 to point to Python 3.8 12 | 13 | `sudo update-alternatives --config python3` Enter 2 for Python 3.8 14 | 15 | Test the version of python 16 | ``` 17 | python3 --version 18 | Python 3.8.0 19 | ``` 20 | 21 | Install pip3 22 | `sudo apt install python3-pip` 23 | 24 | **Fix error: ImportError: No module named apt_pkg** 25 | 26 | - `cd /usr/lib/python3/dist-packages/` 27 | - `sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so apt_pkg.so` 28 | 29 | **Ensure to update Python in all Spark worker nodes to the same version** 30 | ``` 31 | Exception: Python in worker has different version 3.6 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. 32 | ``` 33 | 34 | ``` 35 | export PYSPARK_PYTHON=python3 36 | export PYSPARK_DRIVER_PYTHON=python3 37 | ``` 38 | -------------------------------------------------------------------------------- /Yarn Dask installation with Anaconda.md: -------------------------------------------------------------------------------- 1 | ## 1. Install Anaconda 2 | Install Anaconda version 5.3.1 for supporting python version 3.7 3 | * **Download Anaconda installaion script**: 4 | ``` 5 | wget https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh 6 | ``` 7 | * **Run script file**: 8 | ``` 9 | ./home/hadoop/Anaconda3-5.3.1-Linux-x86_64.sh 10 | ``` 11 | Follow the installation on command, when done, to activating Anaconda: 12 | ``` 13 | source ~/.bashrc 14 | ``` 15 | ## 2. Create Conda environment 16 | * **Create Conda environment**: 17 | ``` 18 | conda create -n my-env python=3.7 19 | ``` 20 | _Must specific the python version, if not, Conda will automatically install the lastes version_ 21 | * **Activate Conda environment**: 22 | ``` 23 | conda activate my-env 24 | ``` 25 | * **Package environment**: 26 | Package a conda environment using conda-pack 27 | Install conda-pack if not exists: 28 | ``` 29 | conda install -c conda-forge conda-pack 30 | ``` 31 | Then package the current environment (activate first): 32 | ``` 33 | conda-pack -p /path/to/save/gz/file 34 | ``` 35 | Replicate conda environment to all nodes (Remember to edit _**.bashrc**_ file and put them in same directory) 36 | 37 | ## 3. Install dash-yarn && run first dask job 38 | * **Install dask-yarn with conda**: 39 | ``` 40 | conda install -c conda-forge dask-yarn 41 | ``` 42 | * **Submit python job with dask-yarn cli**: 43 | From terminal, run command: 44 | ``` 45 | dask-yarn submit --environment python:///opt/anaconda/envs/analytics/bin/python --worker-count 8 --worker-vcores 2 --worker-memory 4GiB dask_submit_test.py 46 | ``` 47 | _**Note:**_ Environment format is **_python:///path_to_env/bin/python_**. Get the path to environment by execute the command: 48 | ``` 49 | conda env list 50 | ``` 51 | * **Start YARN cluster**: 52 | To start a YARN cluster, create an instance of YarnCluster. This constructor takes several parameters, leave them empty to use the defaults defined in the Configuration: 53 | ``` 54 | from dask_yarn import YarnCluster 55 | import dask.dataframe as dd 56 | 57 | from dask.distributed import Client 58 | 59 | cluster = YarnCluster(environment='python:///opt/anaconda/envs/analytics/bin/python', 60 | worker_vcores=2, 61 | worker_memory="4GiB") 62 | ``` 63 | By default no workers are started on cluster creation. To change the number of workers, use the YarnCluster.scale()method. When scaling up, new workers will be requested from YARN. When scaling down, workers will be intelligently selected and scaled down gracefully, freeing up resources: 64 | ``` 65 | cluster.scale(4) 66 | ``` 67 | Read csv file: 68 | ``` 69 | data = dd.read_csv('hdfs://10.56.239.63:9000/user/dan/Global_Mobility_Report.csv',dtype={'sub_region_2': 'object'}) 70 | ``` 71 | This will create a dask dataframe, to show the result, use 72 | ``` 73 | data.head() 74 | ``` 75 | Converto pandas dataframe by **compute** method: 76 | ``` 77 | df = data.compute() 78 | ``` 79 | 80 | 81 | -------------------------------------------------------------------------------- /Install Debezium Kafka Connector for PostgreSQL using wal2json.md: -------------------------------------------------------------------------------- 1 | # Install Debezium Connector for PostgreSQL using wal2json plugin 2 | 3 | ## Overview 4 | Debezium’s PostgreSQL Connector can monitor and record row-level changes in the schemas of a PostgreSQL database. 5 | 6 | The first time it connects to a PostgreSQL server/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to PostgreSQL 9.6 or later and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services. 7 | 8 | ## Install wal2json 9 | 10 | On Ubuntu: 11 | 12 | ```bash 13 | sudo apt install postgresql-9.6-wal2json 14 | ``` 15 | Check if `wal2json.so` have already in `/usr/lib/postgresql/9.6` 16 | 17 | ## Configure PostgreSQL Server 18 | 19 | ### Setting up libraries, WAL and replication parameters 20 | 21 | ***Find and modify*** the following lines at the end of the `postgresql.conf` PostgreSQL configuration file in order to include the plug-in at the shared libraries and to adjust some **WAL** and **streaming replication** settings. You may need to modify it, if for example you have additionally installed `shared_preload_libraries`. 22 | 23 | ``` 24 | listeners = '*' 25 | shared_preload_libraries = 'wal2json' 26 | wal_level = logical 27 | max_wal_senders = 4 28 | max_replication_slots = 4 29 | ``` 30 | 31 | 1. `max_wal_senders` tells the server that it should use a maximum of `4` separate processes for processing WAL changes 32 | 2. `max_replication_slots` tells the server that it should allow a maximum of `4` replication slots to be created for streaming WAL changes 33 | 34 | ### Setting up replication permissions 35 | 36 | In order to give a user replication permissions, define a PostgreSQL role that has _at least_ the `REPLICATION` and `LOGIN` permissions. 37 | 38 | Login to `postgres` user and run `psql` 39 | ```bash 40 | sudo -u postgres psql 41 | ``` 42 | For example: 43 | 44 | ```sql 45 | CREATE ROLE datalake REPLICATION LOGIN; 46 | ``` 47 | > However, Debezium need futher permissons to do initial snapshot. I used postgres superuser to fix this issue. You should test more to grant right permission for user role. 48 | 49 | Next modify `pg_hba.conf` to accept remote connection from debezium kafka connector 50 | 51 | ``` 52 | host replication postgres 0.0.0.0/0 trust 53 | host replication postgres ::/0 trust 54 | host all all 192.168.1.2/32 trust 55 | ``` 56 | > **Note:** Debezium is installed on 192.168.1.2 57 | 58 | ### Database Test Environment 59 | 60 | Back to `psql` console. 61 | 62 | ```sql 63 | CREATE DATABASE test; 64 | CREATE TABLE test_table ( 65 | id char(10) NOT NULL, 66 | code char(10), 67 | PRIMARY KEY (id) 68 | ); 69 | ``` 70 | 71 | - **Create a slot** named `test_slot` for the database named `test`, using the logical output plug-in `wal2json` 72 | 73 | 74 | ```bash 75 | $ pg_recvlogical -d test --slot test_slot --create-slot -P wal2json 76 | ``` 77 | 78 | - **Begin streaming changes** from the logical replication slot `test_slot` for the database `test` 79 | 80 | 81 | ```bash 82 | $ pg_recvlogical -d test --slot test_slot --start -o pretty-print=1 -f - 83 | ``` 84 | 85 | - **Perform some basic DML** operations at `test_table` to trigger `INSERT`/`UPDATE`/`DELETE` change events 86 | 87 | 88 | _Interactive PostgreSQL terminal, SQL commands_ 89 | 90 | ```sql 91 | test=# INSERT INTO test_table (id, code) VALUES('id1', 'code1'); 92 | INSERT 0 1 93 | test=# update test_table set code='code2' where id='id1'; 94 | UPDATE 1 95 | test=# delete from test_table where id='id1'; 96 | DELETE 1 97 | ``` 98 | 99 | Upon the `INSERT`, `UPDATE` and `DELETE` events, the `wal2json` plug-in outputs the table changes as captured by `pg_recvlogical`. 100 | 101 | _Output for `INSERT` event_ 102 | 103 | ```json 104 | { 105 | "change": [ 106 | { 107 | "kind": "insert", 108 | "schema": "public", 109 | "table": "test_table", 110 | "columnnames": ["id", "code"], 111 | "columntypes": ["character(10)", "character(10)"], 112 | "columnvalues": ["id1 ", "code1 "] 113 | } 114 | ] 115 | } 116 | ``` 117 | 118 | _Output for `UPDATE` event_ 119 | 120 | ```json 121 | { 122 | "change": [ 123 | { 124 | "kind": "update", 125 | "schema": "public", 126 | "table": "test_table", 127 | "columnnames": ["id", "code"], 128 | "columntypes": ["character(10)", "character(10)"], 129 | "columnvalues": ["id1 ", "code2 "], 130 | "oldkeys": { 131 | "keynames": ["id"], 132 | "keytypes": ["character(10)"], 133 | "keyvalues": ["id1 "] 134 | } 135 | } 136 | ] 137 | } 138 | ``` 139 | 140 | _Output for `DELETE` event_ 141 | 142 | ```json 143 | { 144 | "change": [ 145 | { 146 | "kind": "delete", 147 | "schema": "public", 148 | "table": "test_table", 149 | "oldkeys": { 150 | "keynames": ["id"], 151 | "keytypes": ["character(10)"], 152 | "keyvalues": ["id1 "] 153 | } 154 | } 155 | ] 156 | } 157 | ``` 158 | 159 | When the test is finished, the slot `test_slot` for the database `test` can be removed by the following command: 160 | 161 | ```bash 162 | $ pg_recvlogical -d test --slot test_slot --drop-slot 163 | ``` 164 | 165 | ## Install Debezium Kafka Connector 166 | 167 | Download Debezium Connector: 168 | 169 | ```bash 170 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/1.2.0.Final/debezium-connector-postgres-1.2.0.Final-plugin.tar.gz 171 | ``` 172 | Unzip to /opt/kafka/connect 173 | 174 | ```bash 175 | tar -xvf debezium-connector-postgres-1.2.0.Final-plugin.tar.gz --directory=/opt/kafka/connect 176 | ``` 177 | 178 | Check if folder `debezium-connector-postgres` in `/opt/kafka/connect` 179 | 180 | ### Test connector 181 | 182 | Edit `config/connect-file-source.properties` as following: 183 | 184 | ```properties 185 | name=postgres-cdc-source 186 | connector.class=io.debezium.connector.postgresql.PostgresConnector 187 | #snapshot.mode=never 188 | tasks.max=1 189 | plugin.name=wal2json 190 | database.hostname=192.168.1.1 191 | database.port=5432 192 | database.user=postgres 193 | #database.password=postgres 194 | database.dbname=test 195 | # slot.name=test_slot 196 | database.server.name=fullfillment 197 | #table.whitelist=public.inventory 198 | ``` 199 | 200 | Append kafka plugin folder to `plugin.path` in file `config/connect-standalone.properties` 201 | 202 | ```properties 203 | plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,/opt/kafka/connect 204 | ``` 205 | Run kafka connector in standalone mode 206 | 207 | ```bash 208 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties 209 | ``` 210 | 211 | #### Topics Names 212 | 213 | The PostgreSQL connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. By default, the Kafka topic name is _serverName_._schemaName_._tableName_ (configuration in `connect-file-source.properties`) where _serverName_ is the logical name of the connector as specified with the `database.server.name` configuration property, _schemaName_ is the name of the database schema where the operation occurred, and _tableName_ is the name of the database table on which the operation occurred. 214 | 215 | For example, consider a PostgreSQL installation with a `postgres` database and an `inventory` schema that contains four tables: `products`, `products_on_hand`, `customers`, and `orders`. If the connector monitoring this database were given a logical server name of `fulfillment`, then the connector would produce events on these four Kafka topics: 216 | 217 | - `fulfillment.inventory.products` 218 | 219 | - `fulfillment.inventory.products_on_hand` 220 | 221 | - `fulfillment.inventory.customers` 222 | 223 | - `fulfillment.inventory.orders` 224 | 225 | 226 | If on the other hand the tables were not part of a specific schema but rather created in the default `public` PostgreSQL schema, then the name of the Kafka topics would be: 227 | 228 | - `fulfillment.public.products` 229 | 230 | - `fulfillment.public.products_on_hand` 231 | 232 | - `fulfillment.public.customers` 233 | 234 | - `fulfillment.public.orders` 235 | 236 | In this example, the topic's name is `fulfillment.public.test_table` 237 | 238 | Run: 239 | 240 | ```bash 241 | bin/kafka-topics.sh --list --bootstrap-server 192.168.1.2:9092 242 | ``` 243 | You should see `fullfillment.public.test_table` in the ouput. 244 | 245 | Start receive message from debezium: 246 | 247 | ```bash 248 | bin/kafka-console-consumer.sh --bootstrap-server 192.168.1.2:9092 --from-beginning --topic fullfillment.public.test_table 249 | ``` 250 | -------------------------------------------------------------------------------- /Spark Hadoop Free Cluster Setup.md: -------------------------------------------------------------------------------- 1 | ## 1. Prerequisite 2 | 3 | Below is the requirement items that needs to be installed/setup before installing Spark 4 | - Passwordless SSH 5 | - Java Installation and JAVA_HOME declaration in .bashrc 6 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed 7 | - Hostname/hosts config 8 | 9 | ## 2. Download and Install Spark 10 | 11 | In the Apache Spark Download [Website](https://spark.apache.org/downloads.html), choose your Spark release version and package type that use would like to install. 12 | 13 | Released Version: currently Spark has 2 stable version 14 | - Spark 2.x: wildly supported by many tools, like Elasticsearch, etc. 15 | - Spark 3.x: introduce new features with performance optimization 16 | 17 | Type of packages: 18 | - Spark with Hadoop: currently Spark support 19 | - Spark without Hadoop: in case you already install Hadoop on your cluster. 20 | 21 | In this document, we will use **Spark 3.0.0 prebuilt with user-provided Apache Hadoop** 22 | 23 | ### 2.1. Download Apache Spark 24 | 25 | **On Master node (or Namenode)** 26 | The rest of the document only be executed on the Master node (Namenode) 27 | 28 | Switch to hadoop user 29 | `su hadoop` 30 | 31 | In /opt/, create spark directory and grant sudo permission 32 | `cd /opt/` 33 | `sudo mkdir spark` 34 | `sudo chown -R hadoop /opt/spark` 35 | 36 | Download Apache Spark to /home/hadoop 37 | `cd ~` 38 | `wget -O spark.tgz http://mirrors.viethosting.com/apache/spark/spark-3.0.0/spark-3.0.0-bin-without-hadoop.tgz` 39 | 40 | Untar the file into /opt/spark directory 41 | `tar -zxf spark.tgz --directory=/opt/spark --strip=1` 42 | 43 | ### 2.2. Setup Spark Environment Variable 44 | 45 | Edit the .bashrc file 46 | `nano ~/.bashrc` 47 | 48 | Add the following line to the **.bashrc** file 49 | ``` 50 | # Set Spark Environment Variable 51 | export SPARK_HOME=/opt/spark 52 | export PATH=$PATH:$SPARK_HOME/bin 53 | export PYSPARK_PYTHON=python3 54 | #export PYSPARK_DRIVER_PYTHON="jupyter" 55 | #export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8889" 56 | #export SPARK_LOCAL_IP=192.168.0.1 # Your Master IP 57 | 58 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 59 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH 60 | export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop 61 | export SPARK_DIST_CLASSPATH=($HADOOP_HOME/bin/hadoop classpath) 62 | ``` 63 | 64 | ### 2.3. Edit Spark Config Files 65 | 66 | Apache Spark has some config file dedicated for specific purposes, which are: 67 | - spark-env.sh: store default environment variables for Spark 68 | - spark-defaults.conf: default config for spark-submit, including allocated resources, dependencies, proxies, etc. 69 | 70 | In spark/conf directory, make a copy of spark-env.sh.template and name it as spark-env.sh 71 | `cp spark-env.sh.template spark-env.sh` 72 | 73 | Edit the spark-env.sh 74 | `nano spark-env.sh` 75 | 76 | Add below line to the file for Spark to known the Hadoop location 77 | `export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)` 78 | 79 | Make a copy of spark-defaults.conf.template and name it as spark-defaults.conf 80 | `cp spark-defaults.conf.template spark-defaults.conf` 81 | 82 | Edit the spark-defaults.conf 83 | `nano spark-defaults.conf` 84 | Add below line to the file 85 | ``` 86 | spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 # Kafka Dependency for Spark 87 | spark.driver.extraJavaOptions -Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128 # Set Proxy for Spark 88 | ``` 89 | 90 | ### 2.4. Setup Spark Cluster Manager 91 | 92 | Spark support different Cluster Manager, including Spark Standalone Cluster Manager, Yarn, Mesos, Kubernetes. In this document, we will config for Yarn and Spark Standalone Cluster Manager 93 | 94 | #### 2.4.1. Spark on Yarn 95 | 96 | In the spark-default.conf, declare the spark master config by add the following line 97 | 98 | ``` 99 | spark.master yarn 100 | ``` 101 | 102 | The following config should be consider as well 103 | 104 | ``` 105 | spark.eventLog.enabled true 106 | spark.eventLog.dir hdfs://192.168.0.1:9000/spark-logs 107 | spark.driver.memory 512m 108 | spark.yarn.am.memory 512m 109 | spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider 110 | spark.history.fs.logDirectory hdfs://192.168.0.1:9000/spark-logs 111 | spark.history.fs.update.interval 10s 112 | spark.history.ui.port 18080 113 | ``` 114 | 115 | Create hdfs directory to store spark logs 116 | `hdfs dfs -mkdir /spark-logs` 117 | 118 | #### 2.4.2. Spark Standalone Cluster Manager 119 | 120 | On Spark Master node, copy the spark directory to all slave nodes 121 | 122 | `scp -r /opt/spark hadoop@node2:/opt/spark` 123 | 124 | In spark/conf directory, make a copy of slaves.template and name it as slaves 125 | `cp slaves.template slaves` 126 | 127 | Edit the spark-defaults.conf 128 | `nano slaves` 129 | 130 | Add the name of slave nodes to the file. 131 | ``` 132 | slaves1 # hostname of node2 133 | slaves2 # hostname of node3 134 | ``` 135 | 136 | **Start Spark Standalone Cluster mode** 137 | 138 | Execute the following line 139 | `$SPARK_HOME/sbin/start-all.sh` 140 | 141 | Run jps to check if Spark is running on the Master and Slave Node 142 | `jps` 143 | 144 | To stop Spark Standalone Cluster mode, do 145 | `$SPARK_HOME/sbin/stop-all.sh` 146 | 147 | ## 2.5. Spark Monitor UI 148 | 149 | **Spark Standalone Cluster mode Web UI** 150 | `master:8080` 151 | 152 | **Spark Application Web UI** 153 | `master:4040` 154 | 155 | **Master IP and port for spark-submit** 156 | `master:7077` 157 | 158 | ### Submit jobs to Spark Submit 159 | 160 | **Example** 161 | ``` 162 | spark-submit --deploy-mode client \ 163 | --class org.apache.spark.examples.SparkPi \ 164 | $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.0.jar 10 165 | ``` 166 | 167 | **Run Mobile App Schemaless Consumer** 168 | ``` 169 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/kafka_2_hadoop/Schemaless_Structure_Streaming.py 170 | ``` 171 | 172 | **Run HDFS dump data** 173 | ``` 174 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/hdfs_sstreaming_dump_1_1.py 175 | ``` 176 | 177 | **Run Kafka dump data** 178 | ``` 179 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/kafka_sstreaming_dump_1_1.py 180 | ``` 181 | 182 | **Run Schema Inferno** 183 | ``` 184 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/schema_inferno/Schema_Saver.py 185 | ``` 186 | 187 | #### Note 188 | 189 | Kafka dependencies might not work for version 2.12. 190 | Downgrade to 2.11 org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 => org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 191 | 192 | ### Spark Application Structure 193 | 194 | **Python** 195 | ``` 196 | import os 197 | from pyspark import SparkContext, SparkConf 198 | from pyspark.sql import SparkSession, SQLContext 199 | from pyspark.sql import functions as F 200 | from pyspark.sql.types import * 201 | 202 | 203 | def asdf(path): 204 | print ... 205 | 206 | if __name__ == '__main__': 207 | # Setup Spark Config 208 | conf = SparkConf() 209 | conf.set("spark.executor.memory", "2g") 210 | conf.set("spark.executor.cores", "2") 211 | conf.set("spark.cores.max", "8") 212 | conf.set("spark.driver.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar") 213 | conf.set("spark.executor.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar") 214 | sc = SparkContext(master="spark://10.56.237.195:7077", appName="Schema Inferno", conf=conf) 215 | spark = SparkSession(sc) 216 | sqlContext = SQLContext(sc) 217 | 218 | # Generate Schema 219 | path = 'hdfs://master:9000/home/odyssey/data/raw_hdfs' 220 | temp_df = spark.read.json(f'{path}/*/part-00000') 221 | temp_df.printSchema() 222 | print("Number of records: ", temp_df.count()) 223 | print("Number of distinct records: ", temp_df.distinct().count()) 224 | #print(temp_df.filter('payload is null').count()) 225 | save_schema(path) 226 | ``` 227 | 228 | ### Log Print Control 229 | 230 | Create the log4j file from the template in spark/conf directory 231 | `cp conf/log4j.properties.template conf/log4j.properties` 232 | 233 | Replace this line 234 | `log4j.rootCategory=INFO, console` 235 | By this 236 | `log4j.rootCategory=WARN, console` 237 | -------------------------------------------------------------------------------- /Hadoop Cluster Setup.md: -------------------------------------------------------------------------------- 1 | # Concept 2 | Setup Hadoop in one node, then replicate to others. 3 | 4 | ## **1. Prerequisites** 5 | The following must be done on all node in the cluster, including installation of Java, SSH, user creation and other software utilities 6 | 7 | ### 1.1. Configure hosts and hostname on each node 8 | Here we will edit the /etc/hosts and /ect/hostname fiel, so that we can use hostname instead of IP everytime we wish to use or ping any of these servers 9 | * **Change hostname** 10 | `sudo nano /etc/hostname`
11 | Set your hostname to relative name (node1, node2, node3, etc.) 12 | 13 | * **Change your hosts file**
14 | `sudo nano /etc/hosts`
15 | Add the following line in the structure `IP name`
16 | ``` 17 | 192.168.0.1 node1 18 | 192.168.0.2 node2 19 | 192.168.0.3 node3 20 | ``` 21 | 22 | **Notice: Remember to delete/comment the following line if exists**
23 | > ```# 127.0.1.1 node1 ``` 24 | 25 | ### 1.2. Install OpenSSH 26 | `sudo apt install openssh-client`
27 | `sudo apt install openssh-server` 28 | 29 | ### 1.3. Install Java and config Java Environment Variable 30 | Here we use JDK 8, as it is still the most stable and widely support version 31 | 32 | * **For Oracle Java
** 33 | `sudo add-apt-repository ppa:webupd8team/java`
34 | `sudo apt update`
35 | `sudo apt install oracle-java8-installer` 36 | 37 | * **For OpenJDK Java
** 38 | `sudo apt install openjdk-8-jdk` 39 | 40 | To verify the java version you can use the following command:
41 | `java -version` 42 | 43 | Set Java Environment Variable
44 | * **Locate where java is installed
** 45 | `update-alternatives --config java`
46 | The install path should be like this 47 | > /usr/lib/jvm/java-8-openjdk-amd64/ 48 | * **Add the JAVA_HOME variable to bashrc file:
** 49 | `nano ~/.bashrc`
50 | `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/`
51 | 52 | ### 1.4. Create dedicated user and group for Hadoop 53 | We will use a dedicated Hadoop user account for running Hadoop applications. While that’s not required but it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (security, permissions, backups, etc). 54 | * **Create Hadoop group and Hadoop user
** 55 | `sudo addgroup hadoopgroup`
56 | `sudo adduser --ingroup hadoopgroup hadoop`
57 | `sudo adduser hadoop sudo`
58 | 59 | After this step we will only work on **hadoop** user. You can change your user by:
60 | `su - hadoop`
61 | 62 | ### 1.5. SSH Configuration 63 | Hadoop requires SSH access to manage its different nodes, i.e. remote machines plus your local machine. 64 | * **Generate SSH key value pair**
65 | `ssh-keygen -t rsa`
66 | `cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys`
67 | `sudo chmod 0600 ~/.ssh/authorized_keys` 68 | 69 | To check whether your ssh works, runs:
70 | `ssh localhost` 71 | 72 | * **Config Passwordless SSH**
73 | From each node, copy the ssh public key to others
74 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@node2`
75 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoopuser@node3`
76 | 77 | To check your passwordless SSH, try:
78 | `ssh hadooop@node2`
79 | 80 | ## **2. Download and Configure Hadoop** 81 | In this article, we will install Hadoop on three machines 82 | 83 | | IP | Host Name | Namenode | Datanode | 84 | | ------ | ------ | ------ | ------ | 85 | | 192.168.0.1 | node1 | Yes | No | 86 | | 192.168.0.2 | node2 | No | Yes | 87 | | 192.168.0.3 | node3 | No | Yes | 88 | 89 | ### 2.1. Download and setup hadoop
90 | Firstly, The following directory also need to be create on /opt/ directory 91 | ``` 92 | /opt/ 93 | |-- hadoop 94 | | |-- logs 95 | |-- hdfs 96 | | |-- datanode (if act as datanode) 97 | | |-- namenode (if act as namenode) 98 | |-- mr-history (if act as namenode) 99 | | |-- done 100 | | |-- tmp 101 | |-- yarn (if act as namenode) 102 | | |-- local 103 | | |-- logs 104 | ``` 105 | 106 | Locate to /home/hadoop directory 107 | `cd ~` 108 | 109 | Download the installation Hadoop package from its website: [https://hadoop.apache.org/releases.html](https://hadoop.apache.org/releases.html)
110 | `wget -c -O hadoop.tar.gz http://mirrors.viethosting.com/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz` 111 | 112 | Extract the file
113 | `sudo tar -xvf hadoop.tar.gz --directory=/opt/hadoop --strip 1` 114 | 115 | 116 | Assign permission for user hadoop on these folders
117 | `sudo chown -R hadoop /opt/hadoop`
118 | `sudo chown -R hadoop /opt/hdfs`
119 | `sudo chown -R hadoop /opt/yarn`
120 | `sudo chown -R hadoop /opt/mr-history` 121 | 122 | 123 | ### 2.2. Configuration for Namenode 124 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**. 125 | 126 | **On Namenode Server**
127 | * *core-site.xml* 128 | ``` 129 | 130 | 131 | fs.defaultFS 132 | hdfs://192.168.0.1:9000/ 133 | NameNode URI 134 | 135 | 136 | 137 | io.file.buffer.size 138 | 131072 139 | Buffer size 140 | 141 | 142 | ``` 143 | * *hdfs-site.xml* 144 | ``` 145 | 146 | 147 | dfs.namenode.name.dir 148 | file:///opt/hdfs/namenode 149 | NameNode directory for namespace and transaction logs storage. 150 | 151 | 152 | 153 | fs.checkpoint.dir 154 | file:///opt/hdfs/secnamenode 155 | Secondary Namenode Directory 156 | 157 | 158 | 159 | fs.checkpoint.edits.dir 160 | file:///opt/hdfs/secnamenode 161 | 162 | 163 | 164 | dfs.replication 165 | 2 166 | Number of replication 167 | 168 | 169 | 170 | dfs.permissions 171 | false 172 | 173 | 174 | 175 | dfs.datanode.use.datanode.hostname 176 | false 177 | 178 | 179 | 180 | dfs.namenode.datanode.registration.ip-hostname-check 181 | false 182 | 183 | 184 | ``` 185 | * *yarn-site.xml* 186 | ``` 187 | 188 | 189 | yarn.resourcemanager.hostname 190 | 192.168.0.1 191 | IP of hostname for Yarn Resource Manager Service 192 | 193 | 194 | 195 | yarn.nodemanager.aux-services 196 | mapreduce_shuffle 197 | Yarn Node Manager Aux Service 198 | 199 | 200 | 201 | yarn.nodemanager.aux-services.mapreduce.shuffle.class 202 | org.apache.hadoop.mapred.ShuffleHandler 203 | 204 | 205 | 206 | yarn.nodemanager.local-dirs 207 | file:///opt/yarn/local 208 | 209 | 210 | 211 | yarn.nodemanager.log-dirs 212 | file:///opt/yarn/logs 213 | 214 | 215 | ``` 216 | * *mapred-site.xml* 217 | ``` 218 | 219 | 220 | mapreduce.framework.name 221 | yarn 222 | MapReduce framework name 223 | 224 | 225 | 226 | mapreduce.jobhistory.address 227 | 192.168.0.1:10020 228 | Default port is 10020. 229 | 230 | 231 | 232 | mapreduce.jobhistory.webapp.address 233 | 192.168.0.1:19888 234 | MapReduce JobHistory WebUI URL 235 | 236 | 237 | 238 | mapreduce.jobhistory.intermediate-done-dir 239 | /opt/mr-history/tmp 240 | Directory where history files are written by MapReduce jobs. 241 | 242 | 243 | 244 | mapreduce.jobhistory.done-dir 245 | /opt/mr-history/done 246 | Directory where history files are managed by the MR JobHistory Server. 247 | 248 | 249 | ``` 250 | 251 | * **workers**
252 | Add the datanodes' IP to the file 253 | ``` 254 | 192.168.0.2 255 | 192.168.0.3 256 | ``` 257 | 258 | ### 2.3. Configuration for Datanode 259 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**. 260 | 261 | **On Datanode Server** 262 | * *core-site.xml* 263 | ``` 264 | 265 | 266 | fs.defaultFS 267 | hdfs://192.168.0.1:9000/ 268 | NameNode URI 269 | 270 | 271 | ``` 272 | * *hdfs-site.xml* 273 | ``` 274 | 275 | 276 | dfs.datanode.data.dir 277 | file:///opt/hdfs/datanode 278 | DataNode directory for namespace and transaction logs storage. 279 | 280 | 281 | 282 | dfs.replication 283 | 2 284 | Number of replication 285 | 286 | 287 | 288 | dfs.permissions 289 | false 290 | 291 | 292 | 293 | dfs.datanode.use.datanode.hostname 294 | false 295 | 296 | 297 | 298 | dfs.namenode.datanode.registration.ip-hostname-check 299 | false 300 | 301 | 302 | ``` 303 | * *yarn-site.xml* 304 | ``` 305 | 306 | 307 | yarn.resourcemanager.hostname 308 | 192.168.0.1 309 | IP of hostname for Yarn Resource Manager Service 310 | 311 | 312 | 313 | yarn.nodemanager.aux-services 314 | mapreduce_shuffle 315 | Yarn Node Manager Aux Service 316 | 317 | 318 | ``` 319 | * *mapred-site.xml* 320 | ``` 321 | 322 | 323 | mapreduce.framework.name 324 | yarn 325 | MapReduce framework name 326 | 327 | 328 | ``` 329 | 330 | For more configuration information, see [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html](url) 331 | 332 | ### 2.4. Configure Hadoop Environment Variables 333 | Add the following lines to the **.bashrc** file 334 | ``` 335 | export HADOOP_HOME=/opt/hadoop 336 | export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 337 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop 338 | export HDFS_NAMENODE_USER=hadoop 339 | export HDFS_DATANODE_USER=hadoop 340 | export HDFS_SECONDARYNAMENODE_USER=hadoop 341 | export HADOOP_MAPRED_HOME=/opt/hadoop 342 | export HADOOP_COMMON_HOME=/opt/hadoop 343 | export HADOOP_HDFS_HOME=/opt/hadoop 344 | export YARN_HOME=/opt/hadoop 345 | ``` 346 | 347 | Also add the following lines to the /opt/hadoop/etc/hadoop/hadoop-env.sh file 348 | ``` 349 | export HDFS_NAMENODE_USER=hadoop 350 | export HDFS_DATANODE_USER=hadoop 351 | export HDFS_SECONDARYNAMENODE_USER=hadoop 352 | export YARN_RESOURCEMANAGER_USER=hadoop 353 | export YARN_NODEMANAGER_USER=hadoop 354 | export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 355 | export HADOOP_HOME=/opt/hadoop 356 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop 357 | export HADOOP_LOG_DIR=/opt/hadoop/logs 358 | ``` 359 | 360 | ## 3. Start HDFS, Yarn and monitor services on browser 361 | ### 3.1. Format the namenode, start Hadoop basic services 362 | To format the namenode, type
363 | `hdfs namenode –format`
364 | 365 | Start the hdfs service
366 | `start-dfs.sh`
367 | 368 | The output should look like this
369 | ``` 370 | Starting namenodes on [hadoop-namenode] 371 | Starting datanodes 372 | Starting secondary namenodes [hadoop-namenode] 373 | ``` 374 | 375 | To start yarn service
376 | `start-yarn.sh`
377 | ``` 378 | Starting resourcemanager 379 | Starting nodemanagers 380 | ``` 381 | 382 | Start MapReduce JobHistory as *daemon*
383 | `$HADOOP_HOME/bin/mapred --daemon start historyserver` 384 | 385 | To check if the services is successfully started, *jps* to show the services 386 | 387 | * On Namenode 388 | ``` 389 | 16488 NameNode 390 | 16622 JobHistoryServer 391 | 17087 ResourceManager 392 | 17530 Jps 393 | 16829 SecondaryNameNode 394 | ``` 395 | 396 | * On Datanode 397 | ``` 398 | 2306 DataNode 399 | 2479 NodeManager 400 | 2581 Jps 401 | ``` 402 | 403 | ### 3.2. Monitor the services on browser 404 | 405 | For Namenode of Hadoop 3.x.x
406 | `https://IP:9870` 407 | For Namenode of Hadoop 2.x.x 408 | `https://IP:50070` 409 | 410 | For Yarn
411 | `https://IP:8088` 412 | 413 | For MapReduce Job History
414 | `https://IP:19888` 415 | 416 | ## 4. Run your first HDFS command & Yarn Job 417 | ### 4.1. Put and Get Data to HDFS 418 | Create a books directory in HDFS 419 | `hdfs dfs -mkdir /books` 420 | 421 | Grab a few books from the Gutenberg project 422 | `cd ~` 423 | 424 | ``` 425 | wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt 426 | wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt 427 | wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt 428 | ``` 429 | 430 | Then put the three books through HDFS, in the booksdirectory 431 | `hdfs dfs -put alice.txt holmes.txt frankenstein.txt /books` 432 | 433 | List the contents of the book directory 434 | `hdfs dfs -ls /books` 435 | 436 | Move one of the books to the local filesystem 437 | `hdfs dfs -get /books/alice.txt` 438 | 439 | ### 4.2. Submit MapReduce Jobs to YARN 440 | YARN jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS. 441 | 442 | Submit a job with the sample jar to YARN. On node-master, run 443 | `yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "/books/*" output` 444 | 445 | After the job is finished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output will resemble: 446 | ``` 447 | Found 2 items 448 | -rw-r--r-- 2 hadoop supergroup 0 2019-05-31 17:21 output/_SUCCESS 449 | -rw-r--r-- 2 hadoop supergroup 789726 2019-05-31 17:21 output/part-r-00000 450 | ``` 451 | 452 | Print the result with: 453 | `hdfs dfs -cat output/part-r-00000 | less` 454 | -------------------------------------------------------------------------------- /Hive Setup with Spark Exection and Spark HiveContext.md: -------------------------------------------------------------------------------- 1 | # 1. Prerequisite 2 | 3 | Below is the requirement items that needs to be installed/setup before installing Spark 4 | - Passwordless SSH 5 | - Java Installation and JAVA_HOME declaration in .bashrc 6 | - Hostname/hosts config 7 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed 8 | - Spark on Yarn 9 | 10 | # 2. Download and Install Apache Hive 11 | 12 | In the Apache Hive Download [https://downloads.apache.org/hive/](https://downloads.apache.org/hive/), choose your Hive release version that use would like to install. 13 | 14 | Released Version: currently Spark has 2 stable version 15 | - Hive 2.x: This release works with Hadoop 2.x.y 16 | - Hive 3.x: This release works with Hadoop 3.x.y 17 | 18 | In this document, we will use **Hive 2.3.7 prebuilt version** 19 | 20 | ## 2.1. Download Apache Hive 21 | 22 | **On Master node (or Namenode)** 23 | The rest of the document only be executed on the Master node (Namenode) 24 | 25 | Switch to hadoop user 26 | `su hadoop` 27 | 28 | In /opt/, create Hive directory and grant sudo permission 29 | `cd /opt/` 30 | `sudo mkdir hive` 31 | `sudo chown -R hadoop /opt/hive` 32 | 33 | Download Apache Hive to /home/hadoop 34 | `cd ~` 35 | `wget -O hive.tgz https://downloads.apache.org/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz` 36 | 37 | Untar the file into /opt/spark directory 38 | `tar -zxf hive.tgz --directory=/opt/hive --strip=1` 39 | 40 | ## 2.2. Setup Hive Environment Variable 41 | 42 | Edit the .bashrc file 43 | `nano ~/.bashrc` 44 | 45 | Add the following line to the **.bashrc** file 46 | ``` 47 | # Hive Environment Configuration 48 | export HIVE_HOME=/opt/hive 49 | export PATH=$HIVE_HOME/bin:$PATH 50 | ``` 51 | Also Hive uses Hadoop, so you must have Hadoop in your path or your **.bashrc** must have 52 | `export HADOOP_HOME=` 53 | 54 | ## 2.3. Create Hive working directory on HDFS 55 | 56 | In addition, you must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before you can create a table in Hive. 57 | 58 | `hdfs dfs -mkdir /tmp` 59 | `hdfs dfs -mkdir -p /user/hive/warehouse` 60 | `hdfs dfs -chmod g+w /tmp` 61 | `hdfs dfs -chmod g+w /user/hive/warehouse` 62 | 63 | 64 | ## 2.4. Setup Hive Metastore and HiveServer2 65 | 66 | ### 2.4.1. Configuring a Remote PostgreSQL Database for the Hive Metastore 67 | 68 | In order to use Hive Metastore, a Database is required to store the Hive Metadata. Though Hive provides an embedded Database (Apache Derby), this mode should only be used for experimental purposes only. Here we will setup a remote PostgreSQL Database to run Hive Metastore. 69 | 70 | Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a connector to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user. 71 | 72 | **Install and start PostgreSQL** 73 | Run the following to install PostgreSQL Database 74 | `sudo apt-get install postgresql` 75 | 76 | To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional configuration. 77 | 78 | First you need to edit the `postgresql.conf` file. Set the `listen_addresses` property to `*`, to make sure that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the `standard_conforming_strings property` is set to `off`. 79 | 80 | `sudo nano /etc/postgresql/10/main/postgresql.conf` 81 | 82 | Adjust the required properties 83 | 84 | ``` 85 | listen_addresses = '*' 86 | standard_conforming_strings off 87 | ``` 88 | 89 | You also need to configure authentication for your network in `pg_hba.conf`. You need to make sure that the PostgreSQL user that you will create later in this procedure will have access to the server from a remote host. To do this, add a new line into `pg_hba.conf` that has the following information: 90 | `sudo nano /etc/postgresql/10/main/pg_hba.conf` 91 | Add the following line to the file 92 | ``` 93 | host all all 0.0.0.0 0.0.0.0 md5 94 | ``` 95 | 96 | If the default pg_hba.conf file contains the following line: 97 | ``` 98 | host all all 127.0.0.1/32 ident 99 | ``` 100 | then the host line specifying md5 authentication shown above must be inserted before this ident line. Failing to do so might lead to the following error 101 | ``` 102 | SLF4J: Class path contains multiple SLF4J bindings. 103 | SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] 104 | SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] 105 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 106 | SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 107 | Metastore connection URL: jdbc:postgresql://localhost:5432/metastore 108 | Metastore Connection Driver : org.postgresql.Driver 109 | Metastore connection User: hive 110 | org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version. 111 | Underlying cause: org.postgresql.util.PSQLException : FATAL: Ident authentication failed for user "hive" 112 | SQL Error code: 0 113 | Use --verbose for detailed stacktrace. 114 | *** schemaTool failed *** 115 | ``` 116 | 117 | 118 | After all is done, start PostgreSQL Server 119 | `sudo service postgresql start` 120 | 121 | 122 | **Install the PostgreSQL JDBC driver** 123 | On the client, a JDBC driver is required to connect to the PostgreSQL Database. Install the JDBC and copy or link it to hive/lib folder. 124 | `sudo apt-get install libpostgresql-jdbc-java` 125 | `ln -s /usr/share/java/postgresql-jdbc4.jar /opt/hive/lib/postgresql-jdbc4.jar` 126 | 127 | **Create the metastore database and user account** 128 | Login to postgres user and start a psql 129 | `sudo -u postgres psql` 130 | 131 | Add dedicated User and Database for Hive Metastore 132 | 133 | ``` 134 | postgres=# CREATE USER hive WITH PASSWORD '123123'; 135 | postgres=# CREATE DATABASE metastore; 136 | ``` 137 | 138 | ### 2.4.2. Adjust the Hive and Hadoop config files for the Hive Metastore 139 | In /opt/hive/conf, create config file hive-site.xml from hive-default.xml.template (or not) 140 | `nano hive-site.xml` 141 | 142 | Add the following lines 143 | ``` 144 | 145 | 146 | javax.jdo.option.ConnectionURL 147 | jdbc:postgresql://localhost:5432/metastore 148 | 149 | 150 | 151 | javax.jdo.option.ConnectionDriverName 152 | org.postgresql.Driver 153 | 154 | 155 | 156 | javax.jdo.option.ConnectionUserName 157 | hive 158 | 159 | 160 | 161 | javax.jdo.option.ConnectionPassword 162 | 123123 163 | 164 | 165 | 166 | hive.metastore.warehouse.dir 167 | hdfs://192.168.0.5:9000/user/hive/warehouse 168 | 169 | 170 | ``` 171 | 172 | In order to avoid the issue `Cannot connect to hive using beeline, user root cannot impersonate anonymous` when connecting to HiveServer2 from `beeline`, from hadoop/etc/hadoop, add to the core-site.xml, with `[username]` is your local user that will be use in `beeline` 173 | ``` 174 | 175 | hadoop.proxyuser.[username].groups 176 | * 177 | 178 | 179 | hadoop.proxyuser.[username].hosts 180 | * 181 | 182 | ``` 183 | 184 | ### 2.4.3. Start Hive Metastore Server & HiveServer2 185 | 186 | Use the Hive Schema Tool to create the metastore tables. 187 | `/opt/hive/bin/schematool -dbType postgres -initSchema` 188 | 189 | 190 | The output should be like 191 | ``` 192 | Metastore connection URL: jdbc:postgresql://localhost:5432/metastore 193 | Metastore Connection Driver : org.postgresql.Driver 194 | Metastore connection User: hive 195 | Starting metastore schema initialization to 2.3.0 196 | Initialization script hive-schema-2.3.0.postgres.sql 197 | Initialization script completed 198 | schemaTool completed 199 | ``` 200 | 201 | Respectively start Hive Metastore Server and HiveServer2 services 202 | `hive --service metastore` 203 | From another connection session, start HiveServer2 204 | `hiveserver2` 205 | 206 | Run Beeline (the HiveServer2 CLI). 207 | `beeline` 208 | Connecto to HiveServer2 209 | `!connect jdbc:hive2://localhost:10000` 210 | Input any user and password, and the result should be 211 | ``` 212 | Connecting to jdbc:hive2://localhost:10000 213 | Enter username for jdbc:hive2://localhost:10000: hadoop 214 | Enter password for jdbc:hive2://localhost:10000: ****** 215 | Connected to: Apache Hive (version 2.3.7) 216 | Driver: Hive JDBC (version 2.3.7) 217 | Transaction isolation: TRANSACTION_REPEATABLE_READ 218 | ``` 219 | 220 | From here, you can run your query on Hive 221 | 222 | For monitor, you can access HiveServer WebUI on port 10002 223 | `http://localhost:10002` 224 | 225 | ## 2.6. Setup Hive Execution Engine to Spark 226 | 227 | While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. Therefore, choosing another execution engine is wise decision. Here, we will config Spark to be Hive Execution Engine. 228 | 229 | First, add the following config to `hive-site.xml` in `/opt/hive/conf` 230 | ``` 231 | 232 | hive.execution.engine 233 | spark 234 | 235 | Expects one of [mr, tez, spark] 236 | 237 | 238 | ``` 239 | 240 | Then configuring YARN to distribute an equal share of resources for jobs in the YARN cluster. 241 | ``` 242 | 243 | yarn.resourcemanager.scheduler.class 244 | org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler 245 | 246 | ``` 247 | 248 | Next, add the Spark libs to Hive's class path as below. 249 | 250 | Edit `/opt/hive/bin/hive` file (backup this file is if anything wrong happens) 251 | `cp hive hive_backup` 252 | `nano hive` 253 | 254 | Add Spark Libs to Hive 255 | ``` 256 | for f in ${SPARK_HOME}/jars/*.jar; do 257 | CLASSPATH=${CLASSPATH}:$f; 258 | done 259 | ``` 260 | 261 | Finally, upload all jars in `$SPARK_HOME/jars` to hdfs folder (for example:hdfs:///xxxx:9000/spark-jars): 262 | ``` 263 | hdfs dfs -mkdir /spark-jar 264 | hdfs dfs -put /opt/spark/jars/* /spark-jars 265 | ``` 266 | 267 | and add following in `hive-site.xml` 268 | ``` 269 | 270 | spark.yarn.jars 271 | hdfs://xxxx:9000/spark-jars/* 272 | 273 | ``` 274 | 275 | You may get below exception if you missed the CLASSPATH configuration above. 276 | ```Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable``` 277 | 278 | Another solution should be considered, though the author of this document hasn't successfully experienced yet. [https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive) 279 | 280 | From now on, when doing any execution, the output should be like 281 | 282 | ``` 283 | hive> insert into pokes values(2, 'hai'); 284 | Query ID = hadoop_20200718125927_5e68f5c3-66b5-4459-a145-71b44c658939 285 | Total jobs = 1 286 | Launching Job 1 out of 1 287 | In order to change the average load for a reducer (in bytes): 288 | set hive.exec.reducers.bytes.per.reducer= 289 | In order to limit the maximum number of reducers: 290 | set hive.exec.reducers.max= 291 | In order to set a constant number of reducers: 292 | set mapreduce.job.reduces= 293 | Starting Spark Job = 84d25db4-ea71-4cf1-8e17-d9018390d005 294 | Running with YARN Application = application_1595048642632_0006 295 | Kill Command = /opt/hadoop/bin/yarn application -kill application_1595048642632_0006 296 | 297 | Query Hive on Spark job[0] stages: [0] 298 | 299 | Status: Running (Hive on Spark job[0]) 300 | -------------------------------------------------------------------------------------- 301 | STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED 302 | -------------------------------------------------------------------------------------- 303 | Stage-0 ........ 0 FINISHED 1 1 0 0 0 304 | -------------------------------------------------------------------------------------- 305 | STAGES: 01/01 [==========================>>] 100% ELAPSED TIME: 7.11 s 306 | -------------------------------------------------------------------------------------- 307 | Status: Finished successfully in 7.11 seconds 308 | Loading data to table default.pokes 309 | OK 310 | Time taken: 30.264 seconds 311 | ``` 312 | 313 | 314 | 315 | ## 2.5. Connecting Apache Spark to Apache Hive 316 | ### 2.5.1 Config hive-site.xml and spark-default.sh 317 | Create `/opt/spark/conf/hive-site.xml` and define `hive.metastore.uris` configuration property (that is the thrift URL of the Hive Metastore Server). 318 | 319 | ``` 320 | 321 | 322 | hive.metastore.uris 323 | thrift://localhost:9083 324 | 325 | 326 | ``` 327 | 328 | Optionally, from `log4j.properties.template` create `log4j.properties` add the following for a more Hive low-level logging: 329 | 330 | `cp log4j.properties.template log4j.properties` 331 | 332 | ``` 333 | log4j.logger.org.apache.spark.sql.hive.HiveUtils$=ALL 334 | log4j.logger.org.apache.spark.sql.internal.SharedState=ALL 335 | log4j.logger.org.apache.spark.sql.hive.client.HiveClientImpl=ALL 336 | ``` 337 | 338 | The following config also should be set on `/opt/spark/conf/spark-default.sh` file 339 | ``` 340 | spark.master yarn 341 | spark.serializer org.apache.spark.serializer.KryoSerializer 342 | spark.eventLog.enabled true 343 | spark.eventLog.dir hdfs://192.168.0.5:9000/spark-logs 344 | spark.driver.memory 512m 345 | spark.yarn.am.memory 512m 346 | spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider 347 | spark.history.fs.logDirectory hdfs://192.168.0.5:9000/spark-logs 348 | spark.history.fs.update.interval 10s 349 | spark.history.ui.port 18080 350 | 351 | # Spark Hive Configuartions 352 | spark.sql.catalogImplementation hive 353 | ``` 354 | 355 | 356 | 357 | ### 2.5.2. Start your PySpark Hive 358 | 359 | However, since Hive has a large number of dependencies, **these dependencies are not included in the default Spark distribution**. If **Hive dependencies can be found on the classpath**, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive. 360 | 361 | By adding the dependencies in the code, PySpark will automatically download it from the internet. The dependencies include: 362 | 363 | `org.apache.spark:spark-hive_2.11:2.4.6` (The 2.12 version might not work) 364 | `org.apache.avro:avro-mapred:1.8.2` 365 | 366 | From any IDE, execute the following Python script: 367 | ``` 368 | import findspark 369 | findspark.init() 370 | findspark.find() 371 | import os 372 | from pyspark.sql import SparkSession 373 | from pyspark.sql import Row 374 | 375 | submit_args = '--packages org.apache.spark:spark-hive_2.11:2.4.6 --packages org.apache.avro:avro-mapred:1.8.2 pyspark-shell' 376 | if 'PYSPARK_SUBMIT_ARGS' not in os.environ: 377 | os.environ['PYSPARK_SUBMIT_ARGS'] = submit_args 378 | else: 379 | os.environ['PYSPARK_SUBMIT_ARGS'] += submit_args 380 | 381 | # warehouse_location points to the default location for managed databases and tables 382 | warehouse_location = 'hdfs://192.168.0.5:9000/user/hive/warehouse' 383 | 384 | spark = SparkSession \ 385 | .builder \ 386 | .appName("Python Spark SQL Hive integration example") \ 387 | .config("spark.sql.warehouse.dir", warehouse_location) \ 388 | .enableHiveSupport() \ 389 | .getOrCreate() 390 | ``` 391 | 392 | Then, hopefully you can query into Hive Tables. You can check the SparkSession config by 393 | 394 | `spark.sparkContext.getConf().getAll()` 395 | 396 | Such output should be there 397 | ``` 398 | ('spark.sql.warehouse.dir', 'hdfs://192.168.0.5:9000/user/hive/warehouse'), 399 | ('spark.sql.catalogImplementation', 'hive') 400 | ``` 401 | 402 | ## 2.6. Spark SQL on Hive Tables 403 | 404 | Apache Spark provide a machenism to query into Hive Tables. Try the following Python script 405 | 406 | **Select from Hive Table** 407 | ``` 408 | # With pokes is a Hive table 409 | spark.sql('select * from pokes').show() 410 | ``` 411 | 412 | **Write to Hive Table** 413 | ``` 414 | df = spark.range(10).toDF('number') 415 | df.registerTempTable('number') 416 | spark.sql('create table number as select * from number') 417 | ``` 418 | 419 | You can check whether the `number` table is available using `hive` or `beeline` 420 | ``` 421 | hive> show tables; 422 | OK 423 | number 424 | pokes 425 | values__tmp__table__1 426 | Time taken: 0.039 seconds, Fetched: 3 row(s) 427 | 428 | hive> select * from number; 429 | OK 430 | 0 431 | 1 432 | 2 433 | 3 434 | 4 435 | 5 436 | 6 437 | 7 438 | 8 439 | 9 440 | Time taken: 0.135 seconds, Fetched: 10 row(s) 441 | ``` 442 | 443 | 444 | ```python 445 | 446 | ``` 447 | 448 | # 3. References 449 | ## 3.1. PostgreSQL Setup and Config 450 | ``` 451 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ig_extrnl_pstgrs.html#cmig_topic_5_6 452 | ``` 453 | ``` 454 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_ig_hive_metastore_configure.html 455 | ``` 456 | --------------------------------------------------------------------------------