├── README.md
├── Kafka Cluster Setup
├── YARN Configuration.md
├── Spark Thrift Setup
├── Install Python38 Ubuntu.md
├── Yarn Dask installation with Anaconda.md
├── Install Debezium Kafka Connector for PostgreSQL using wal2json.md
├── Spark Hadoop Free Cluster Setup.md
├── Hadoop Cluster Setup.md
└── Hive Setup with Spark Exection and Spark HiveContext.md


/README.md:
--------------------------------------------------------------------------------
1 | # Big-Data-Installation


--------------------------------------------------------------------------------
/Kafka Cluster Setup:
--------------------------------------------------------------------------------
1 | https://blog.clairvoyantsoft.com/kafka-series-3-creating-3-node-kafka-cluster-on-virtual-box-87d5edc85594
2 | 


--------------------------------------------------------------------------------
/YARN Configuration.md:
--------------------------------------------------------------------------------
 1 | ## Config Resources ##
 2 | **Default Configuration**
 3 | 
 4 | **Node Manager Allocated Resources**
 5 | 
 6 | **Container Allocated Resources**
 7 | 
 8 | **Application Master Allocated Resources**
 9 | 
10 | ## Config Scheduler
11 | **Capacity Scheduler**
12 | 
13 | **Fair Scheduler**
14 | 


--------------------------------------------------------------------------------
/Spark Thrift Setup:
--------------------------------------------------------------------------------
1 | https://www.programmersought.com/article/75806728859/
2 | https://www.programmersought.com/article/34564251004/
3 | https://www.programmersought.com/article/47542974362/
4 | https://www.programmersought.com/article/46657462430/
5 | https://www.programmersought.com/article/49335838484/
6 | https://www.programmersought.com/article/45243571155/
7 | https://www.programmersought.com/article/55482089247/
8 | https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html
9 | 


--------------------------------------------------------------------------------
/Install Python38 Ubuntu.md:
--------------------------------------------------------------------------------
 1 | # Install Python 3.8 on Ubuntu 18.04
 2 | **Install Python3.8**
 3 | `sudo apt install python3.8`
 4 | 
 5 | **Set python3 to Python3.8**
 6 | Add Python3.6 & Python 3.8 to update-alternatives
 7 | 
 8 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1`
 9 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2`
10 | 
11 | Update Python 3 to point to Python 3.8
12 | 
13 | `sudo update-alternatives --config python3` Enter 2 for Python 3.8  
14 | 
15 | Test the version of python
16 | ```
17 | python3 --version
18 | Python 3.8.0
19 | ```
20 | 
21 | Install pip3  
22 | `sudo apt install python3-pip`
23 | 
24 | **Fix error: ImportError: No module named apt_pkg**
25 | 
26 | - `cd /usr/lib/python3/dist-packages/`
27 | - `sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so apt_pkg.so`
28 | 
29 | **Ensure to update Python in all Spark worker nodes to the same version**
30 | ```
31 | Exception: Python in worker has different version 3.6 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
32 | ```
33 | 
34 | ```
35 | export PYSPARK_PYTHON=python3
36 | export PYSPARK_DRIVER_PYTHON=python3
37 | ```
38 | 


--------------------------------------------------------------------------------
/Yarn Dask installation with Anaconda.md:
--------------------------------------------------------------------------------
 1 | ## 1. Install Anaconda
 2 |  Install Anaconda version 5.3.1 for supporting python version 3.7
 3 |  * **Download Anaconda installaion script**:
 4 |     ```
 5 |         wget https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh
 6 |     ```
 7 |  * **Run script file**:
 8 |     ```
 9 |         ./home/hadoop/Anaconda3-5.3.1-Linux-x86_64.sh
10 |     ```
11 |     Follow the installation on command, when done, to activating Anaconda: 
12 |     ```
13 |         source ~/.bashrc
14 |     ```
15 | ## 2. Create Conda environment
16 |  * **Create Conda environment**:
17 |     ```
18 |         conda create -n my-env python=3.7
19 |     ```
20 |     _Must specific the python version, if not, Conda will automatically install the lastes version_
21 |  * **Activate Conda environment**:
22 |     ```
23 |         conda activate my-env
24 |     ```
25 |  * **Package environment**:
26 |     Package a conda environment using conda-pack
27 |     Install conda-pack if not exists:
28 |     ```
29 |         conda install -c conda-forge conda-pack
30 |     ```
31 |     Then package the current environment (activate first):
32 |     ```
33 |         conda-pack -p /path/to/save/gz/file
34 |     ```
35 |    Replicate conda environment to all nodes (Remember to edit _**.bashrc**_ file and put them in same directory)
36 | 
37 | ## 3. Install dash-yarn && run first dask job
38 |  * **Install dask-yarn with conda**:
39 |     ```
40 |         conda install -c conda-forge dask-yarn
41 |     ```
42 |  * **Submit python job with dask-yarn cli**:
43 |     From terminal, run command:
44 |     ```
45 |          dask-yarn submit --environment python:///opt/anaconda/envs/analytics/bin/python  --worker-count 8   --worker-vcores 2   --worker-memory 4GiB dask_submit_test.py
46 |     ```
47 |     _**Note:**_ Environment format is **_python:///path_to_env/bin/python_**. Get the path to environment by execute the command:
48 |     ```
49 |         conda env list
50 |     ```
51 |  * **Start YARN cluster**:
52 |   To start a YARN cluster, create an instance of YarnCluster. This constructor takes several parameters, leave them empty to use the defaults defined in the Configuration:
53 |     ```
54 |     from dask_yarn import YarnCluster
55 |     import dask.dataframe as dd
56 |     
57 |     from dask.distributed import Client
58 |     
59 |     cluster = YarnCluster(environment='python:///opt/anaconda/envs/analytics/bin/python',
60 |                       worker_vcores=2,
61 |                       worker_memory="4GiB")
62 |     ```
63 |      By default no workers are started on cluster creation. To change the number of workers, use the YarnCluster.scale()method. When scaling up, new workers will be requested from YARN. When scaling down, workers will be intelligently selected and scaled down gracefully, freeing up resources: 
64 |      ```
65 |         cluster.scale(4)
66 |      ```
67 |     Read csv file:
68 |     ```
69 |         data = dd.read_csv('hdfs://10.56.239.63:9000/user/dan/Global_Mobility_Report.csv',dtype={'sub_region_2': 'object'})
70 |     ```
71 |     This will create a dask dataframe, to show the result, use
72 |     ```
73 |         data.head()
74 |     ```
75 |     Converto pandas dataframe by **compute** method:
76 |     ```
77 |         df = data.compute()
78 |     ```
79 |     
80 |     
81 | 


--------------------------------------------------------------------------------
/Install Debezium Kafka Connector for PostgreSQL using wal2json.md:
--------------------------------------------------------------------------------
  1 | # Install Debezium Connector for PostgreSQL using wal2json plugin
  2 | 
  3 | ## Overview
  4 | Debezium’s PostgreSQL Connector can monitor and record row-level changes in the schemas of a PostgreSQL database.
  5 | 
  6 | The first time it connects to a PostgreSQL server/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to PostgreSQL 9.6 or later and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.
  7 | 
  8 | ## Install wal2json
  9 | 
 10 | On Ubuntu:
 11 | 
 12 | ```bash
 13 | sudo apt install postgresql-9.6-wal2json
 14 | ```
 15 | Check if `wal2json.so` have already in `/usr/lib/postgresql/9.6`
 16 | 
 17 | ## Configure PostgreSQL Server
 18 | 
 19 | ### Setting up libraries, WAL and replication parameters
 20 | 
 21 | ***Find and modify*** the following lines at the end of the `postgresql.conf` PostgreSQL configuration file in order to include the plug-in at the shared libraries and to adjust some **WAL** and **streaming replication** settings. You may need to modify it, if for example you have additionally installed `shared_preload_libraries`.
 22 | 
 23 | ```
 24 | listeners = '*'
 25 | shared_preload_libraries = 'wal2json'
 26 | wal_level = logical  
 27 | max_wal_senders = 4  
 28 | max_replication_slots = 4
 29 | ```
 30 | 
 31 | 1. `max_wal_senders` tells the server that it should use a maximum of `4` separate processes for processing WAL changes
 32 | 2. `max_replication_slots` tells the server that it should allow a maximum of `4` replication slots to be created for streaming WAL changes
 33 | 
 34 | ### Setting up replication permissions
 35 | 
 36 | In order to give a user replication permissions, define a PostgreSQL role that has _at least_ the `REPLICATION` and `LOGIN` permissions.
 37 |  
 38 | Login to `postgres` user and run `psql`
 39 | ```bash
 40 | sudo -u postgres psql
 41 | ```
 42 | For example:
 43 | 
 44 | ```sql
 45 | CREATE  ROLE  datalake REPLICATION LOGIN;
 46 | ```
 47 | > However, Debezium need futher permissons to do initial snapshot. I used postgres superuser to fix this issue. You should test more to grant right permission for user role.
 48 | 
 49 | Next modify `pg_hba.conf` to accept remote connection from debezium kafka connector
 50 | 
 51 | ```
 52 | host replication postgres 0.0.0.0/0 trust  
 53 | host replication postgres ::/0 trust  
 54 | host all all 192.168.1.2/32 trust
 55 | ```
 56 | > **Note:** Debezium is installed on 192.168.1.2
 57 | 
 58 | ### Database Test Environment
 59 | 
 60 | Back to `psql` console.
 61 | 
 62 | ```sql
 63 | CREATE  DATABASE  test; 
 64 | CREATE  TABLE test_table ( 
 65 | 	id  char(10) NOT  NULL, 
 66 | 	code char(10), 
 67 | 	PRIMARY KEY (id) 
 68 | );
 69 | ```
 70 | 
 71 | -   **Create a slot**  named  `test_slot`  for the database named  `test`, using the logical output plug-in  `wal2json`
 72 |     
 73 | 
 74 | ```bash
 75 | $ pg_recvlogical -d test --slot test_slot --create-slot -P wal2json
 76 | ```
 77 | 
 78 | -   **Begin streaming changes**  from the logical replication slot  `test_slot`  for the database  `test`
 79 |     
 80 | 
 81 | ```bash
 82 | $ pg_recvlogical -d test --slot test_slot --start -o pretty-print=1 -f -
 83 | ```
 84 | 
 85 | -   **Perform some basic DML**  operations at  `test_table`  to trigger  `INSERT`/`UPDATE`/`DELETE`  change events
 86 |     
 87 | 
 88 | _Interactive PostgreSQL terminal, SQL commands_
 89 | 
 90 | ```sql
 91 | test=# INSERT INTO test_table (id, code) VALUES('id1', 'code1');
 92 | INSERT 0 1
 93 | test=# update test_table set code='code2' where id='id1';
 94 | UPDATE 1
 95 | test=# delete from test_table where id='id1';
 96 | DELETE 1
 97 | ```
 98 | 
 99 | Upon the  `INSERT`,  `UPDATE`  and  `DELETE`  events, the  `wal2json`  plug-in outputs the table changes as captured by  `pg_recvlogical`.
100 | 
101 | _Output for  `INSERT`  event_
102 | 
103 | ```json
104 | {
105 |   "change": [
106 |     {
107 |       "kind": "insert",
108 |       "schema": "public",
109 |       "table": "test_table",
110 |       "columnnames": ["id", "code"],
111 |       "columntypes": ["character(10)", "character(10)"],
112 |       "columnvalues": ["id1       ", "code1     "]
113 |     }
114 |   ]
115 | }
116 | ```
117 | 
118 | _Output for  `UPDATE`  event_
119 | 
120 | ```json
121 | {
122 |   "change": [
123 |     {
124 |       "kind": "update",
125 |       "schema": "public",
126 |       "table": "test_table",
127 |       "columnnames": ["id", "code"],
128 |       "columntypes": ["character(10)", "character(10)"],
129 |       "columnvalues": ["id1       ", "code2     "],
130 |       "oldkeys": {
131 |         "keynames": ["id"],
132 |         "keytypes": ["character(10)"],
133 |         "keyvalues": ["id1       "]
134 |       }
135 |     }
136 |   ]
137 | }
138 | ```
139 | 
140 | _Output for  `DELETE`  event_
141 | 
142 | ```json
143 | {
144 |   "change": [
145 |     {
146 |       "kind": "delete",
147 |       "schema": "public",
148 |       "table": "test_table",
149 |       "oldkeys": {
150 |         "keynames": ["id"],
151 |         "keytypes": ["character(10)"],
152 |         "keyvalues": ["id1       "]
153 |       }
154 |     }
155 |   ]
156 | }
157 | ```
158 | 
159 | When the test is finished, the slot  `test_slot`  for the database  `test`  can be removed by the following command:
160 | 
161 | ```bash
162 | $ pg_recvlogical -d test --slot test_slot --drop-slot
163 | ```
164 | 
165 | ## Install Debezium Kafka Connector
166 | 
167 | Download Debezium Connector:
168 | 
169 | ```bash
170 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/1.2.0.Final/debezium-connector-postgres-1.2.0.Final-plugin.tar.gz
171 | ```
172 | Unzip to /opt/kafka/connect
173 | 
174 | ```bash
175 | tar -xvf debezium-connector-postgres-1.2.0.Final-plugin.tar.gz --directory=/opt/kafka/connect
176 | ```
177 | 
178 | Check if folder `debezium-connector-postgres` in `/opt/kafka/connect`
179 | 
180 | ### Test connector
181 | 
182 | Edit `config/connect-file-source.properties` as following:
183 | 
184 | ```properties
185 | name=postgres-cdc-source  
186 | connector.class=io.debezium.connector.postgresql.PostgresConnector  
187 | #snapshot.mode=never  
188 | tasks.max=1  
189 | plugin.name=wal2json  
190 | database.hostname=192.168.1.1  
191 | database.port=5432  
192 | database.user=postgres  
193 | #database.password=postgres  
194 | database.dbname=test  
195 | # slot.name=test_slot  
196 | database.server.name=fullfillment  
197 | #table.whitelist=public.inventory
198 | ```
199 | 
200 | Append kafka plugin folder to `plugin.path` in file `config/connect-standalone.properties`
201 | 
202 | ```properties
203 | plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,/opt/kafka/connect
204 | ```
205 | Run kafka connector in standalone mode
206 | 
207 | ```bash
208 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
209 | ```
210 | 
211 | #### Topics Names
212 | 
213 | The PostgreSQL connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. By default, the Kafka topic name is  _serverName_._schemaName_._tableName_  (configuration in `connect-file-source.properties`) where  _serverName_  is the logical name of the connector as specified with the  `database.server.name`  configuration property,  _schemaName_  is the name of the database schema where the operation occurred, and  _tableName_  is the name of the database table on which the operation occurred.
214 | 
215 | For example, consider a PostgreSQL installation with a  `postgres`  database and an  `inventory`  schema that contains four tables:  `products`,  `products_on_hand`,  `customers`, and  `orders`. If the connector monitoring this database were given a logical server name of  `fulfillment`, then the connector would produce events on these four Kafka topics:
216 | 
217 | -   `fulfillment.inventory.products`
218 |     
219 | -   `fulfillment.inventory.products_on_hand`
220 |     
221 | -   `fulfillment.inventory.customers`
222 |     
223 | -   `fulfillment.inventory.orders`
224 |     
225 | 
226 | If on the other hand the tables were not part of a specific schema but rather created in the default  `public`  PostgreSQL schema, then the name of the Kafka topics would be:
227 | 
228 | -   `fulfillment.public.products`
229 |     
230 | -   `fulfillment.public.products_on_hand`
231 |     
232 | -   `fulfillment.public.customers`
233 |     
234 | -   `fulfillment.public.orders`
235 | 
236 | In this example, the topic's name is `fulfillment.public.test_table`
237 | 
238 | Run:
239 | 
240 | ```bash
241 | bin/kafka-topics.sh --list --bootstrap-server 192.168.1.2:9092
242 | ```
243 | You should see `fullfillment.public.test_table` in the ouput.
244 | 
245 | Start receive message from debezium:
246 | 
247 | ```bash
248 | bin/kafka-console-consumer.sh --bootstrap-server 192.168.1.2:9092 --from-beginning --topic fullfillment.public.test_table
249 | ```
250 | 


--------------------------------------------------------------------------------
/Spark Hadoop Free Cluster Setup.md:
--------------------------------------------------------------------------------
  1 | ## 1. Prerequisite
  2 | 
  3 | Below is the requirement items that needs to be installed/setup before installing Spark
  4 | - Passwordless SSH
  5 | - Java Installation and JAVA_HOME declaration in .bashrc
  6 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed
  7 | - Hostname/hosts config
  8 | 
  9 | ## 2. Download and Install Spark
 10 | 
 11 | In the Apache Spark Download [Website](https://spark.apache.org/downloads.html), choose your Spark release version and package type that use would like to install.
 12 | 
 13 | Released Version: currently Spark has 2 stable version
 14 | - Spark 2.x: wildly supported by many tools, like Elasticsearch, etc.
 15 | - Spark 3.x: introduce new features with performance optimization
 16 | 
 17 | Type of packages:
 18 | - Spark with Hadoop: currently Spark support 
 19 | - Spark without Hadoop: in case you already install Hadoop on your cluster. 
 20 | 
 21 | In this document, we will use **Spark 3.0.0 prebuilt with user-provided Apache Hadoop**
 22 | 
 23 | ### 2.1. Download Apache Spark
 24 | 
 25 | **On Master node (or Namenode)**  
 26 | The rest of the document only be executed on the Master node (Namenode)
 27 | 
 28 | Switch to hadoop user  
 29 | `su hadoop`  
 30 | 
 31 | In /opt/, create spark directory and grant sudo permission
 32 | `cd /opt/`  
 33 | `sudo mkdir spark`  
 34 | `sudo chown -R hadoop /opt/spark`  
 35 | 
 36 | Download Apache Spark to /home/hadoop  
 37 | `cd ~`  
 38 | `wget -O spark.tgz http://mirrors.viethosting.com/apache/spark/spark-3.0.0/spark-3.0.0-bin-without-hadoop.tgz`   
 39 | 
 40 | Untar the file into /opt/spark directory  
 41 | `tar -zxf spark.tgz --directory=/opt/spark --strip=1`  
 42 | 
 43 | ### 2.2. Setup Spark Environment Variable
 44 | 
 45 | Edit the .bashrc file  
 46 | `nano ~/.bashrc`  
 47 | 
 48 | Add the following line to the **.bashrc** file
 49 | ```
 50 | # Set Spark Environment Variable
 51 | export SPARK_HOME=/opt/spark
 52 | export PATH=$PATH:$SPARK_HOME/bin
 53 | export PYSPARK_PYTHON=python3
 54 | #export PYSPARK_DRIVER_PYTHON="jupyter" 
 55 | #export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8889" 
 56 | #export SPARK_LOCAL_IP=192.168.0.1 # Your Master IP
 57 | 
 58 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
 59 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
 60 | export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop
 61 | export SPARK_DIST_CLASSPATH=($HADOOP_HOME/bin/hadoop classpath)
 62 | ```
 63 | 
 64 | ### 2.3. Edit Spark Config Files
 65 | 
 66 | Apache Spark has some config file dedicated for specific purposes, which are:
 67 | - spark-env.sh: store default environment variables for Spark
 68 | - spark-defaults.conf: default config for spark-submit, including allocated resources, dependencies, proxies, etc.  
 69 | 
 70 | In spark/conf directory, make a copy of spark-env.sh.template and name it as spark-env.sh  
 71 | `cp spark-env.sh.template spark-env.sh`
 72 | 
 73 | Edit the spark-env.sh   
 74 | `nano spark-env.sh`  
 75 | 
 76 | Add below line to the file for Spark to known the Hadoop location  
 77 | `export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)`  
 78 | 
 79 | Make a copy of spark-defaults.conf.template and name it as spark-defaults.conf  
 80 | `cp spark-defaults.conf.template spark-defaults.conf`
 81 | 
 82 | Edit the spark-defaults.conf  
 83 | `nano spark-defaults.conf`  
 84 | Add below line to the file  
 85 | ```
 86 | spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 # Kafka Dependency for Spark
 87 | spark.driver.extraJavaOptions -Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128 # Set Proxy for Spark
 88 | ```
 89 | 
 90 | ### 2.4. Setup Spark Cluster Manager
 91 | 
 92 | Spark support different Cluster Manager, including Spark Standalone Cluster Manager, Yarn, Mesos, Kubernetes. In this document, we will config for Yarn and Spark Standalone Cluster Manager
 93 | 
 94 | #### 2.4.1. Spark on Yarn
 95 | 
 96 | In the spark-default.conf, declare the spark master config by add the following line
 97 | 
 98 | ```
 99 | spark.master                      yarn
100 | ```
101 | 
102 | The following config should be consider as well
103 | 
104 | ```
105 | spark.eventLog.enabled            true
106 | spark.eventLog.dir                hdfs://192.168.0.1:9000/spark-logs
107 | spark.driver.memory               512m
108 | spark.yarn.am.memory              512m
109 | spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
110 | spark.history.fs.logDirectory     hdfs://192.168.0.1:9000/spark-logs
111 | spark.history.fs.update.interval  10s
112 | spark.history.ui.port             18080
113 | ```
114 | 
115 | Create hdfs directory to store spark logs  
116 | `hdfs dfs -mkdir /spark-logs`  
117 | 
118 | #### 2.4.2. Spark Standalone Cluster Manager
119 | 
120 | On Spark Master node, copy the spark directory to all slave nodes
121 | 
122 | `scp -r /opt/spark hadoop@node2:/opt/spark`
123 | 
124 | In spark/conf directory, make a copy of slaves.template and name it as slaves  
125 | `cp slaves.template slaves`
126 | 
127 | Edit the spark-defaults.conf  
128 | `nano slaves`
129 | 
130 | Add the name of slave nodes to the file. 
131 | ```
132 | slaves1 # hostname of node2
133 | slaves2 # hostname of node3
134 | ```
135 | 
136 | **Start Spark Standalone Cluster mode**
137 | 
138 | Execute the following line  
139 | `$SPARK_HOME/sbin/start-all.sh`  
140 | 
141 | Run jps to check if Spark is running on the Master and Slave Node  
142 | `jps`
143 | 
144 | To stop Spark Standalone Cluster mode, do  
145 | `$SPARK_HOME/sbin/stop-all.sh`
146 | 
147 | ## 2.5. Spark Monitor UI
148 | 
149 | **Spark Standalone Cluster mode Web UI**  
150 | `master:8080`
151 | 
152 | **Spark Application Web UI**  
153 | `master:4040`
154 | 
155 | **Master IP and port for spark-submit**  
156 | `master:7077`
157 | 
158 | ### Submit jobs to Spark Submit
159 | 
160 | **Example**
161 | ```
162 | spark-submit --deploy-mode client \
163 |                --class org.apache.spark.examples.SparkPi \
164 |                $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.0.jar 10
165 | ```
166 | 
167 | **Run Mobile App Schemaless Consumer**  
168 | ```
169 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/kafka_2_hadoop/Schemaless_Structure_Streaming.py
170 | ```
171 | 
172 | **Run HDFS dump data**  
173 | ```
174 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/hdfs_sstreaming_dump_1_1.py
175 | ```
176 | 
177 | **Run Kafka dump data**  
178 | ```
179 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/kafka_sstreaming_dump_1_1.py
180 | ```
181 | 
182 | **Run Schema Inferno**  
183 | ```
184 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/schema_inferno/Schema_Saver.py
185 | ```
186 | 
187 | #### Note
188 | 
189 | Kafka dependencies might not work for version 2.12.  
190 | Downgrade to 2.11 org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 => org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5
191 | 
192 | ### Spark Application Structure
193 | 
194 | **Python** 
195 | ```
196 | import os
197 | from pyspark import SparkContext, SparkConf
198 | from pyspark.sql import SparkSession, SQLContext
199 | from pyspark.sql import functions as F
200 | from pyspark.sql.types import *
201 | 
202 | 
203 | def asdf(path):
204 |     print ...
205 | 
206 | if __name__ == '__main__':
207 |     # Setup Spark Config
208 |     conf = SparkConf()
209 |     conf.set("spark.executor.memory", "2g")
210 |     conf.set("spark.executor.cores", "2")
211 |     conf.set("spark.cores.max", "8")
212 |     conf.set("spark.driver.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar")
213 |     conf.set("spark.executor.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar")
214 |     sc = SparkContext(master="spark://10.56.237.195:7077", appName="Schema Inferno", conf=conf)
215 |     spark = SparkSession(sc)
216 |     sqlContext = SQLContext(sc)  
217 | 
218 |     # Generate Schema
219 |     path = 'hdfs://master:9000/home/odyssey/data/raw_hdfs'
220 |     temp_df = spark.read.json(f'{path}/*/part-00000')
221 |     temp_df.printSchema()
222 |     print("Number of records: ", temp_df.count())
223 |     print("Number of distinct records: ", temp_df.distinct().count())
224 |     #print(temp_df.filter('payload is null').count())
225 |     save_schema(path)
226 | ```
227 | 
228 | ### Log Print Control
229 | 
230 | Create the log4j file from the template in spark/conf directory  
231 | `cp conf/log4j.properties.template conf/log4j.properties`
232 | 
233 | Replace this line  
234 | `log4j.rootCategory=INFO, console`  
235 | By this  
236 | `log4j.rootCategory=WARN, console`
237 | 


--------------------------------------------------------------------------------
/Hadoop Cluster Setup.md:
--------------------------------------------------------------------------------
  1 | # Concept
  2 | Setup Hadoop in one node, then replicate to others.
  3 | 
  4 | ## **1. Prerequisites**
  5 | The following must be done on all node in the cluster, including installation of Java, SSH, user creation and other software utilities
  6 | 
  7 | ### 1.1. Configure hosts and hostname on each node
  8 | Here we will edit the /etc/hosts and /ect/hostname fiel, so that we can use hostname instead of IP everytime we wish to use or ping any of these servers
  9 | * **Change hostname**
 10 | `sudo nano /etc/hostname`<br/>
 11 | Set your hostname to relative name (node1, node2, node3, etc.)
 12 | 
 13 | * **Change your hosts file**<br/>
 14 | `sudo nano /etc/hosts`<br/>
 15 | Add the following line in the structure `IP name`<br/>
 16 | ```
 17 | 192.168.0.1 node1
 18 | 192.168.0.2 node2
 19 | 192.168.0.3 node3
 20 | ```
 21 | 
 22 | **Notice: Remember to delete/comment the following line if exists**<br/>
 23 | > ```# 127.0.1.1 node1 ```
 24 | 
 25 | ### 1.2. Install OpenSSH
 26 | `sudo apt install openssh-client` <br/>
 27 | `sudo apt install openssh-server`
 28 | 
 29 | ### 1.3. Install Java and config Java Environment Variable
 30 | Here we use JDK 8, as it is still the most stable and widely support version
 31 | 
 32 | * **For Oracle Java <br/>**
 33 | `sudo add-apt-repository ppa:webupd8team/java`<br/>
 34 | `sudo apt update`<br/>
 35 | `sudo apt install oracle-java8-installer` 
 36 | 
 37 | * **For OpenJDK Java <br/>**
 38 | `sudo apt install openjdk-8-jdk`
 39 | 
 40 | To verify the java version you can use the following command: <br/>
 41 | `java -version`
 42 | 
 43 | Set Java Environment Variable <br/>
 44 | * **Locate where java is installed<br/>**
 45 | `update-alternatives --config java`<br/>
 46 | The install path should be like this 
 47 | >  /usr/lib/jvm/java-8-openjdk-amd64/
 48 | * **Add the JAVA_HOME variable to bashrc file:<br/>**
 49 | `nano ~/.bashrc`<br/>
 50 | `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/`<br/>
 51 | 
 52 | ### 1.4. Create dedicated user and group for Hadoop
 53 | We will use a dedicated Hadoop user account for running Hadoop applications. While that’s not required but it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (security, permissions, backups, etc).
 54 | * **Create Hadoop group and Hadoop user<br/>**
 55 | `sudo addgroup hadoopgroup`<br/>
 56 | `sudo adduser --ingroup hadoopgroup hadoop`<br/>
 57 | `sudo adduser hadoop sudo`<br/>
 58 | 
 59 | After this step we will only work on **hadoop** user. You can change your user by:<br/>
 60 | `su - hadoop`<br/>
 61 | 
 62 | ### 1.5. SSH Configuration
 63 | Hadoop requires SSH access to manage its different nodes, i.e. remote machines plus your local machine.
 64 | * **Generate SSH key value pair**<br/>
 65 | `ssh-keygen -t rsa`<br/>
 66 | `cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys`<br/>
 67 | `sudo chmod 0600 ~/.ssh/authorized_keys`
 68 | 
 69 | To check whether your ssh works, runs:<br/>
 70 | `ssh localhost`
 71 | 
 72 | * **Config Passwordless SSH**<br/>
 73 | From each node, copy the ssh public key to others<br/>
 74 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@node2`<br/>
 75 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoopuser@node3`<br/>
 76 | 
 77 | To check your passwordless SSH, try:<br/>
 78 | `ssh hadooop@node2`<br/>
 79 | 
 80 | ## **2. Download and Configure Hadoop**
 81 | In this article, we will install Hadoop on three machines
 82 | 
 83 | | IP | Host Name | Namenode | Datanode |
 84 | | ------ | ------ | ------ | ------ |
 85 | | 192.168.0.1 | node1 | Yes | No |
 86 | | 192.168.0.2 | node2 |  No | Yes |
 87 | | 192.168.0.3 | node3 |  No | Yes |
 88 | 
 89 | ### 2.1. Download and setup hadoop<br/>
 90 | Firstly, The following directory also need to be create on /opt/ directory
 91 | ```
 92 | /opt/
 93 |  |-- hadoop
 94 |  |    |-- logs
 95 |  |-- hdfs
 96 |  |    |-- datanode (if act as datanode)
 97 |  |    |-- namenode (if act as namenode)
 98 |  |-- mr-history (if act as namenode)
 99 |  |    |-- done
100 |  |    |-- tmp
101 |  |-- yarn (if act as namenode)
102 |  |    |-- local 
103 |  |    |-- logs
104 | ```
105 | 
106 | Locate to /home/hadoop directory  
107 | `cd ~`  
108 | 
109 | Download the installation Hadoop package from its website: [https://hadoop.apache.org/releases.html](https://hadoop.apache.org/releases.html)<br/>
110 | `wget -c -O hadoop.tar.gz http://mirrors.viethosting.com/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz`
111 | 
112 | Extract the file<br/>
113 | `sudo tar -xvf hadoop.tar.gz --directory=/opt/hadoop --strip 1`
114 | 
115 | 
116 | Assign permission for user hadoop on these folders<br/>
117 | `sudo chown -R hadoop /opt/hadoop`<br/>
118 | `sudo chown -R hadoop /opt/hdfs`<br/>
119 | `sudo chown -R hadoop /opt/yarn`<br/>
120 | `sudo chown -R hadoop /opt/mr-history`
121 | 
122 | 
123 | ### 2.2. Configuration for Namenode
124 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**. 
125 | 
126 | **On Namenode Server**<br/>
127 | * *core-site.xml*
128 | ```
129 | <configuration>
130 |     <property>
131 |         <name>fs.defaultFS</name>
132 |         <value>hdfs://192.168.0.1:9000/</value>
133 |         <description>NameNode URI</description>
134 |     </property>
135 | 
136 |     <property>
137 |         <name>io.file.buffer.size</name>
138 |         <value>131072</value>
139 |         <description>Buffer size</description>
140 |     </property>
141 | </configuration>
142 | ```
143 | * *hdfs-site.xml*
144 | ```
145 | <configuration>
146 |   <property>
147 |     <name>dfs.namenode.name.dir</name>
148 |     <value>file:///opt/hdfs/namenode</value>
149 |     <description>NameNode directory for namespace and transaction logs storage.</description>
150 |   </property>
151 | 
152 |   <property>
153 |     <name>fs.checkpoint.dir</name>
154 |     <value>file:///opt/hdfs/secnamenode</value>
155 |     <description>Secondary Namenode Directory</description>
156 |   </property>
157 | 
158 |   <property>
159 |     <name>fs.checkpoint.edits.dir</name>
160 |     <value>file:///opt/hdfs/secnamenode</value>
161 |   </property>
162 | 
163 |   <property>
164 |     <name>dfs.replication</name>
165 |     <value>2</value>
166 |     <description>Number of replication</description>
167 |   </property>
168 | 
169 |   <property>
170 |     <name>dfs.permissions</name>
171 |     <value>false</value>
172 |   </property>
173 | 
174 |   <property>
175 |     <name>dfs.datanode.use.datanode.hostname</name>
176 |     <value>false</value>
177 |   </property>
178 | 
179 |   <property>
180 |     <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
181 |     <value>false</value>
182 |   </property>
183 | </configuration>
184 | ```
185 | * *yarn-site.xml*
186 | ```
187 | <configuration>
188 |   <property>
189 |     <name>yarn.resourcemanager.hostname</name>
190 |     <value>192.168.0.1</value>
191 |     <description>IP of hostname for Yarn Resource Manager Service</description>
192 |   </property>
193 | 
194 |   <property>
195 |     <name>yarn.nodemanager.aux-services</name>
196 |     <value>mapreduce_shuffle</value>
197 |     <description>Yarn Node Manager Aux Service</description>
198 |   </property>
199 | 
200 |   <property>
201 |     <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
202 |     <value>org.apache.hadoop.mapred.ShuffleHandler</value>
203 |   </property>
204 | 
205 |   <property>
206 |     <name>yarn.nodemanager.local-dirs</name>
207 |     <value>file:///opt/yarn/local</value>
208 |   </property>
209 | 
210 |   <property>
211 |     <name>yarn.nodemanager.log-dirs</name>
212 |     <value>file:///opt/yarn/logs</value>
213 |   </property>
214 | </configuration>
215 | ```
216 | * *mapred-site.xml*
217 | ```
218 | <configuration>
219 |   <property>
220 |     <name>mapreduce.framework.name</name>
221 |     <value>yarn</value>
222 |     <description>MapReduce framework name</description>
223 |   </property>
224 | 
225 |   <property>
226 |     <name>mapreduce.jobhistory.address</name>
227 |     <value>192.168.0.1:10020</value>
228 |     <description>Default port is 10020.</description>
229 |   </property>
230 | 
231 |   <property>
232 |     <name>mapreduce.jobhistory.webapp.address</name>
233 |     <value>192.168.0.1:19888</value>
234 |     <description>MapReduce JobHistory WebUI URL</description>
235 |   </property>
236 | 
237 |   <property>
238 |     <name>mapreduce.jobhistory.intermediate-done-dir</name>
239 |     <value>/opt/mr-history/tmp</value>
240 |     <description>Directory where history files are written by MapReduce jobs.</description>
241 |   </property>
242 | 
243 |   <property>
244 |     <name>mapreduce.jobhistory.done-dir</name>
245 |     <value>/opt/mr-history/done</value>
246 |     <description>Directory where history files are managed by the MR JobHistory Server.</description>
247 |   </property>
248 | </configuration>
249 | ```
250 | 
251 | * **workers**<br/>
252 | Add the datanodes' IP to the file
253 | ```
254 | 192.168.0.2
255 | 192.168.0.3
256 | ```
257 | 
258 | ### 2.3. Configuration for Datanode
259 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**. 
260 | 
261 | **On Datanode Server**
262 | * *core-site.xml*
263 | ```
264 | <configuration>
265 |     <property>
266 |         <name>fs.defaultFS</name>
267 |         <value>hdfs://192.168.0.1:9000/</value>
268 |         <description>NameNode URI</description>
269 |     </property>
270 | </configuration>
271 | ```
272 | * *hdfs-site.xml*
273 | ```
274 | <configuration>
275 |   <property>
276 |     <name>dfs.datanode.data.dir</name>
277 |     <value>file:///opt/hdfs/datanode</value>
278 |     <description>DataNode directory for namespace and transaction logs storage.</description>
279 |   </property>
280 | 
281 |   <property>
282 |     <name>dfs.replication</name>
283 |     <value>2</value>
284 |     <description>Number of replication</description>
285 |   </property>
286 | 
287 |   <property>
288 |     <name>dfs.permissions</name>
289 |     <value>false</value>
290 |   </property>
291 | 
292 |   <property>
293 |     <name>dfs.datanode.use.datanode.hostname</name>
294 |     <value>false</value>
295 |   </property>
296 | 
297 |   <property>
298 |     <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
299 |     <value>false</value>
300 |   </property>
301 | </configuration>
302 | ```
303 | * *yarn-site.xml*
304 | ```
305 | <configuration>
306 |   <property>
307 |     <name>yarn.resourcemanager.hostname</name>
308 |     <value>192.168.0.1</value>
309 |     <description>IP of hostname for Yarn Resource Manager Service</description>
310 |   </property>
311 | 
312 |   <property>
313 |     <name>yarn.nodemanager.aux-services</name>
314 |     <value>mapreduce_shuffle</value>
315 |     <description>Yarn Node Manager Aux Service</description>
316 |   </property>
317 | </configuration>
318 | ```
319 | * *mapred-site.xml*
320 | ```
321 | <configuration>
322 |   <property>
323 |     <name>mapreduce.framework.name</name>
324 |     <value>yarn</value>
325 |     <description>MapReduce framework name</description>
326 |   </property>
327 | </configuration>
328 | ```
329 | 
330 | For more configuration information, see [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html](url)
331 | 
332 | ### 2.4. Configure Hadoop Environment Variables
333 | Add the following lines to the **.bashrc** file
334 | ```
335 | export HADOOP_HOME=/opt/hadoop
336 | export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
337 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
338 | export HDFS_NAMENODE_USER=hadoop
339 | export HDFS_DATANODE_USER=hadoop
340 | export HDFS_SECONDARYNAMENODE_USER=hadoop
341 | export HADOOP_MAPRED_HOME=/opt/hadoop
342 | export HADOOP_COMMON_HOME=/opt/hadoop
343 | export HADOOP_HDFS_HOME=/opt/hadoop
344 | export YARN_HOME=/opt/hadoop
345 | ```
346 | 
347 | Also add the following lines to the /opt/hadoop/etc/hadoop/hadoop-env.sh file
348 | ```
349 | export HDFS_NAMENODE_USER=hadoop
350 | export HDFS_DATANODE_USER=hadoop
351 | export HDFS_SECONDARYNAMENODE_USER=hadoop
352 | export YARN_RESOURCEMANAGER_USER=hadoop
353 | export YARN_NODEMANAGER_USER=hadoop
354 | export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
355 | export HADOOP_HOME=/opt/hadoop
356 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
357 | export HADOOP_LOG_DIR=/opt/hadoop/logs
358 | ```
359 | 
360 | ## 3. Start HDFS, Yarn and monitor services on browser
361 | ### 3.1. Format the namenode, start Hadoop basic services
362 | To format the namenode, type<br/>
363 | `hdfs namenode –format`<br/>
364 | 
365 | Start the hdfs service<br/>
366 | `start-dfs.sh`<br/>
367 | 
368 | The output should look like this<br/>
369 | ```
370 | Starting namenodes on [hadoop-namenode]
371 | Starting datanodes
372 | Starting secondary namenodes [hadoop-namenode]
373 | ```
374 | 
375 | To start yarn service<br/>
376 | `start-yarn.sh`<br/>
377 | ```
378 | Starting resourcemanager
379 | Starting nodemanagers
380 | ```
381 | 
382 | Start MapReduce JobHistory as *daemon*<br/>
383 | `$HADOOP_HOME/bin/mapred --daemon start historyserver`
384 | 
385 | To check if the services is successfully started, *jps* to show the services
386 | 
387 | * On Namenode
388 | ```
389 | 16488 NameNode
390 | 16622 JobHistoryServer
391 | 17087 ResourceManager
392 | 17530 Jps
393 | 16829 SecondaryNameNode
394 | ```
395 | 
396 | * On Datanode
397 | ```
398 | 2306 DataNode
399 | 2479 NodeManager
400 | 2581 Jps
401 | ```
402 | 
403 | ### 3.2. Monitor the services on browser
404 | 
405 | For Namenode of Hadoop 3.x.x<br/>
406 | `https://IP:9870`  
407 | For Namenode of Hadoop 2.x.x  
408 | `https://IP:50070` 
409 | 
410 | For Yarn<br/>
411 | `https://IP:8088`
412 | 
413 | For MapReduce Job History<br/>
414 | `https://IP:19888`
415 | 
416 | ## 4. Run your first HDFS command & Yarn Job
417 | ### 4.1. Put and Get Data to HDFS
418 | Create a books directory in HDFS  
419 | `hdfs dfs -mkdir /books`
420 | 
421 | Grab a few books from the Gutenberg project  
422 | `cd ~`
423 | 
424 | ```
425 | wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
426 | wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
427 | wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
428 | ```
429 | 
430 | Then put the three books through HDFS, in the booksdirectory  
431 | `hdfs dfs -put alice.txt holmes.txt frankenstein.txt /books`  
432 | 
433 | List the contents of the book directory  
434 | `hdfs dfs -ls /books`
435 | 
436 | Move one of the books to the local filesystem  
437 | `hdfs dfs -get /books/alice.txt`
438 | 
439 | ### 4.2. Submit MapReduce Jobs to YARN
440 | YARN jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS.  
441 | 
442 | Submit a job with the sample jar to YARN. On node-master, run  
443 | `yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "/books/*" output`
444 | 
445 | After the job is finished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output will resemble:  
446 | ```
447 | Found 2 items
448 | -rw-r--r--   2 hadoop supergroup          0 2019-05-31 17:21 output/_SUCCESS
449 | -rw-r--r--   2 hadoop supergroup     789726 2019-05-31 17:21 output/part-r-00000
450 | ```
451 | 
452 | Print the result with:  
453 | `hdfs dfs -cat output/part-r-00000 | less`
454 | 


--------------------------------------------------------------------------------
/Hive Setup with Spark Exection and Spark HiveContext.md:
--------------------------------------------------------------------------------
  1 | # 1. Prerequisite
  2 | 
  3 | Below is the requirement items that needs to be installed/setup before installing Spark
  4 | - Passwordless SSH
  5 | - Java Installation and JAVA_HOME declaration in .bashrc
  6 | - Hostname/hosts config
  7 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed
  8 | - Spark on Yarn
  9 | 
 10 | # 2. Download and Install Apache Hive
 11 | 
 12 | In the Apache Hive Download [https://downloads.apache.org/hive/](https://downloads.apache.org/hive/), choose your Hive release version that use would like to install. 
 13 | 
 14 | Released Version: currently Spark has 2 stable version
 15 | - Hive 2.x: This release works with Hadoop 2.x.y
 16 | - Hive 3.x: This release works with Hadoop 3.x.y 
 17 | 
 18 | In this document, we will use **Hive 2.3.7 prebuilt version**
 19 | 
 20 | ## 2.1. Download Apache Hive
 21 | 
 22 | **On Master node (or Namenode)**  
 23 | The rest of the document only be executed on the Master node (Namenode)
 24 | 
 25 | Switch to hadoop user  
 26 | `su hadoop`  
 27 | 
 28 | In /opt/, create Hive directory and grant sudo permission
 29 | `cd /opt/`  
 30 | `sudo mkdir hive`  
 31 | `sudo chown -R hadoop /opt/hive`  
 32 | 
 33 | Download Apache Hive to /home/hadoop  
 34 | `cd ~`  
 35 | `wget -O hive.tgz https://downloads.apache.org/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz`   
 36 | 
 37 | Untar the file into /opt/spark directory  
 38 | `tar -zxf hive.tgz --directory=/opt/hive --strip=1`  
 39 | 
 40 | ## 2.2. Setup Hive Environment Variable
 41 | 
 42 | Edit the .bashrc file  
 43 | `nano ~/.bashrc`  
 44 | 
 45 | Add the following line to the **.bashrc** file
 46 | ```
 47 | # Hive Environment Configuration
 48 | export HIVE_HOME=/opt/hive
 49 | export PATH=$HIVE_HOME/bin:$PATH
 50 | ```
 51 | Also Hive uses Hadoop, so you must have Hadoop in your path or your **.bashrc**  must have
 52 | `export HADOOP_HOME=<hadoop-install-dir>`  
 53 | 
 54 | ## 2.3. Create Hive working directory on HDFS
 55 | 
 56 | In addition, you must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before you can create a table in Hive.  
 57 | 
 58 | `hdfs dfs -mkdir     /tmp`  
 59 | `hdfs dfs -mkdir -p  /user/hive/warehouse`  
 60 | `hdfs dfs -chmod g+w /tmp`  
 61 | `hdfs dfs -chmod g+w /user/hive/warehouse`  
 62 | 
 63 | 
 64 | ## 2.4. Setup Hive Metastore and HiveServer2
 65 | 
 66 | ### 2.4.1. Configuring a Remote PostgreSQL Database for the Hive Metastore 
 67 | 
 68 | In order to use Hive Metastore, a Database is required to store the Hive Metadata. Though Hive provides an embedded Database (Apache Derby), this mode should only be used for experimental purposes only. Here we will setup a remote PostgreSQL Database to run Hive Metastore.
 69 | 
 70 | Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a connector to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user.
 71 | 
 72 | **Install and start PostgreSQL**  
 73 | Run the following to install PostgreSQL Database  
 74 | `sudo apt-get install postgresql`
 75 | 
 76 | To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional configuration.
 77 | 
 78 | First you need to edit the `postgresql.conf` file. Set the `listen_addresses` property to `*`, to make sure that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the `standard_conforming_strings property` is set to `off`.
 79 | 
 80 | `sudo nano /etc/postgresql/10/main/postgresql.conf`  
 81 | 
 82 | Adjust the required properties  
 83 | 
 84 | ```
 85 | listen_addresses = '*'
 86 | standard_conforming_strings      off
 87 | ```
 88 | 
 89 | You also need to configure authentication for your network in `pg_hba.conf`. You need to make sure that the PostgreSQL user that you will create later in this procedure will have access to the server from a remote host. To do this, add a new line into `pg_hba.conf` that has the following information:  
 90 | `sudo nano /etc/postgresql/10/main/pg_hba.conf`  
 91 | Add the following line to the file
 92 | ```
 93 | host    all         all         0.0.0.0         0.0.0.0               md5
 94 | ```
 95 | 
 96 | If the default pg_hba.conf file contains the following line:
 97 | ```
 98 | host all all 127.0.0.1/32 ident
 99 | ```
100 | then the host line specifying md5 authentication shown above must be inserted before this ident line. Failing to do so might lead to the following error
101 | ```
102 | SLF4J: Class path contains multiple SLF4J bindings.
103 | SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
104 | SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
105 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
106 | SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
107 | Metastore connection URL:        jdbc:postgresql://localhost:5432/metastore
108 | Metastore Connection Driver :    org.postgresql.Driver
109 | Metastore connection User:       hive
110 | org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
111 | Underlying cause: org.postgresql.util.PSQLException : FATAL: Ident authentication failed for user "hive"
112 | SQL Error code: 0
113 | Use --verbose for detailed stacktrace.
114 | *** schemaTool failed ***
115 | ```
116 | 
117 | 
118 | After all is done, start PostgreSQL Server  
119 | `sudo service postgresql start`  
120 | 
121 | 
122 | **Install the PostgreSQL JDBC driver**  
123 | On the client, a JDBC driver is required to connect to the PostgreSQL Database. Install the JDBC and copy or link it to hive/lib folder.  
124 | `sudo apt-get install libpostgresql-jdbc-java`  
125 | `ln -s /usr/share/java/postgresql-jdbc4.jar /opt/hive/lib/postgresql-jdbc4.jar`  
126 | 
127 | **Create the metastore database and user account**  
128 | Login to postgres user and start a psql  
129 | `sudo -u postgres psql`  
130 | 
131 | Add dedicated User and Database for Hive Metastore
132 | 
133 | ```
134 | postgres=# CREATE USER hive WITH PASSWORD '123123';
135 | postgres=# CREATE DATABASE metastore;
136 | ```
137 | 
138 | ### 2.4.2. Adjust the Hive and Hadoop config files for the Hive Metastore 
139 | In /opt/hive/conf, create config file hive-site.xml from hive-default.xml.template (or not)  
140 | `nano hive-site.xml`  
141 | 
142 | Add the following lines
143 | ```
144 | <configuration>
145 |   <property>
146 |     <name>javax.jdo.option.ConnectionURL</name>
147 |     <value>jdbc:postgresql://localhost:5432/metastore</value>
148 |   </property>
149 | 
150 |   <property>
151 |     <name>javax.jdo.option.ConnectionDriverName</name>
152 |     <value>org.postgresql.Driver</value>
153 |   </property>
154 | 
155 |   <property>
156 |     <name>javax.jdo.option.ConnectionUserName</name>
157 |     <value>hive</value>
158 |   </property>
159 | 
160 |   <property>
161 |     <name>javax.jdo.option.ConnectionPassword</name>
162 |     <value>123123</value>
163 |   </property>
164 | 
165 |   <property>
166 |     <name>hive.metastore.warehouse.dir</name>
167 |     <value>hdfs://192.168.0.5:9000/user/hive/warehouse</value>
168 |   </property>
169 | </configuration>
170 | ```
171 | 
172 | In order to avoid the issue  `Cannot connect to hive using beeline, user root cannot impersonate anonymous` when connecting to HiveServer2 from `beeline`, from hadoop/etc/hadoop, add to the core-site.xml, with `[username]` is your local user that will be use in `beeline`   
173 | ```
174 |   <property>
175 |     <name>hadoop.proxyuser.[username].groups</name>
176 |     <value>*</value>
177 |   </property>
178 |   <property>
179 |     <name>hadoop.proxyuser.[username].hosts</name>
180 |     <value>*</value>
181 |   </property>
182 | ```
183 | 
184 | ### 2.4.3. Start Hive Metastore Server & HiveServer2
185 | 
186 | Use the Hive Schema Tool to create the metastore tables.  
187 | `/opt/hive/bin/schematool -dbType postgres -initSchema`
188 | 
189 | 
190 | The output should be like
191 | ```
192 | Metastore connection URL:	 jdbc:postgresql://localhost:5432/metastore
193 | Metastore Connection Driver :	 org.postgresql.Driver
194 | Metastore connection User:	 hive
195 | Starting metastore schema initialization to 2.3.0
196 | Initialization script hive-schema-2.3.0.postgres.sql
197 | Initialization script completed
198 | schemaTool completed
199 | ```
200 | 
201 | Respectively start Hive Metastore Server and HiveServer2 services  
202 | `hive --service metastore`  
203 | From another connection session, start HiveServer2  
204 | `hiveserver2`  
205 | 
206 | Run Beeline (the HiveServer2 CLI).  
207 | `beeline`  
208 | Connecto to HiveServer2  
209 | `!connect jdbc:hive2://localhost:10000`
210 | Input any user and password, and the result should be  
211 | ```
212 | Connecting to jdbc:hive2://localhost:10000
213 | Enter username for jdbc:hive2://localhost:10000: hadoop
214 | Enter password for jdbc:hive2://localhost:10000: ******
215 | Connected to: Apache Hive (version 2.3.7)
216 | Driver: Hive JDBC (version 2.3.7)
217 | Transaction isolation: TRANSACTION_REPEATABLE_READ
218 | ```
219 | 
220 | From here, you can run your query on Hive
221 | 
222 | For monitor, you can access HiveServer WebUI on port 10002
223 | `http://localhost:10002`
224 | 
225 | ## 2.6. Setup Hive Execution Engine to Spark
226 | 
227 | While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. Therefore, choosing another execution engine is wise decision. Here, we will config Spark to be Hive Execution Engine.
228 | 
229 | First, add the following config to `hive-site.xml` in `/opt/hive/conf`  
230 | ```
231 |   <property>
232 |     <name>hive.execution.engine</name>
233 |     <value>spark</value>
234 |     <description>
235 |       Expects one of [mr, tez, spark]
236 |     </description>
237 |   </property>
238 | ```
239 | 
240 | Then configuring YARN to distribute an equal share of resources for jobs in the YARN cluster.  
241 | ```
242 | <property>
243 |   <name>yarn.resourcemanager.scheduler.class</name>
244 |   <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
245 | </property>
246 | ```  
247 | 
248 | Next, add the Spark libs to Hive's class path as below.
249 | 
250 | Edit `/opt/hive/bin/hive` file (backup this file is if anything wrong happens)  
251 | `cp hive hive_backup`  
252 | `nano hive`  
253 | 
254 | Add Spark Libs to Hive  
255 | ```
256 | for f in ${SPARK_HOME}/jars/*.jar; do
257 |   CLASSPATH=${CLASSPATH}:$f;
258 | done
259 | ```
260 | 
261 | Finally, upload all jars in `$SPARK_HOME/jars` to hdfs folder (for example:hdfs:///xxxx:9000/spark-jars):
262 | ``` 
263 | hdfs dfs -mkdir /spark-jar
264 | hdfs dfs -put /opt/spark/jars/* /spark-jars
265 | ```
266 | 
267 | and add following in `hive-site.xml`  
268 | ```
269 | <property>
270 |   <name>spark.yarn.jars</name>
271 |   <value>hdfs://xxxx:9000/spark-jars/*</value>
272 | </property>
273 | ```
274 | 
275 | You may get below exception if you missed the CLASSPATH configuration above.  
276 | ```Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable```
277 | 
278 | Another solution should be considered, though the author of this document hasn't successfully experienced yet. [https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive)
279 | 
280 | From now on, when doing any execution, the output should be like  
281 | 
282 | ```
283 | hive> insert into pokes values(2, 'hai');
284 | Query ID = hadoop_20200718125927_5e68f5c3-66b5-4459-a145-71b44c658939
285 | Total jobs = 1
286 | Launching Job 1 out of 1
287 | In order to change the average load for a reducer (in bytes):
288 |   set hive.exec.reducers.bytes.per.reducer=<number>
289 | In order to limit the maximum number of reducers:
290 |   set hive.exec.reducers.max=<number>
291 | In order to set a constant number of reducers:
292 |   set mapreduce.job.reduces=<number>
293 | Starting Spark Job = 84d25db4-ea71-4cf1-8e17-d9018390d005
294 | Running with YARN Application = application_1595048642632_0006
295 | Kill Command = /opt/hadoop/bin/yarn application -kill application_1595048642632_0006
296 | 
297 | Query Hive on Spark job[0] stages: [0]
298 | 
299 | Status: Running (Hive on Spark job[0])
300 | --------------------------------------------------------------------------------------
301 |           STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
302 | --------------------------------------------------------------------------------------
303 | Stage-0 ........         0      FINISHED      1          1        0        0       0
304 | --------------------------------------------------------------------------------------
305 | STAGES: 01/01    [==========================>>] 100%  ELAPSED TIME: 7.11 s
306 | --------------------------------------------------------------------------------------
307 | Status: Finished successfully in 7.11 seconds
308 | Loading data to table default.pokes
309 | OK
310 | Time taken: 30.264 seconds
311 | ```
312 | 
313 | 
314 | 
315 | ## 2.5. Connecting Apache Spark to Apache Hive
316 | ### 2.5.1 Config hive-site.xml and spark-default.sh
317 | Create `/opt/spark/conf/hive-site.xml` and define `hive.metastore.uris` configuration property (that is the thrift URL of the Hive Metastore Server).  
318 | 
319 | ```
320 | <configuration>
321 |   <property>
322 |     <name>hive.metastore.uris</name>
323 |     <value>thrift://localhost:9083</value>
324 |   </property>
325 | </configuration>
326 | ```
327 | 
328 | Optionally, from `log4j.properties.template` create `log4j.properties` add the following for a more Hive low-level logging:
329 | 
330 | `cp log4j.properties.template log4j.properties`
331 | 
332 | ```
333 | log4j.logger.org.apache.spark.sql.hive.HiveUtils$=ALL
334 | log4j.logger.org.apache.spark.sql.internal.SharedState=ALL
335 | log4j.logger.org.apache.spark.sql.hive.client.HiveClientImpl=ALL
336 | ```
337 | 
338 | The following config also should be set on `/opt/spark/conf/spark-default.sh` file  
339 | ```
340 | spark.master                      yarn
341 | spark.serializer                 org.apache.spark.serializer.KryoSerializer
342 | spark.eventLog.enabled            true
343 | spark.eventLog.dir                hdfs://192.168.0.5:9000/spark-logs
344 | spark.driver.memory               512m
345 | spark.yarn.am.memory              512m
346 | spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
347 | spark.history.fs.logDirectory     hdfs://192.168.0.5:9000/spark-logs
348 | spark.history.fs.update.interval  10s
349 | spark.history.ui.port             18080
350 | 
351 | # Spark Hive Configuartions
352 | spark.sql.catalogImplementation   hive
353 | ```
354 | 
355 | 
356 | 
357 | ### 2.5.2. Start your PySpark Hive 
358 | 
359 | However, since Hive has a large number of dependencies, **these dependencies are not included in the default Spark distribution**. If **Hive dependencies can be found on the classpath**, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive. 
360 | 
361 | By adding the dependencies in the code, PySpark will automatically download it from the internet. The dependencies include:  
362 | 
363 | `org.apache.spark:spark-hive_2.11:2.4.6` (The 2.12 version might not work)  
364 | `org.apache.avro:avro-mapred:1.8.2`  
365 | 
366 | From any IDE, execute the following Python script:  
367 | ```
368 | import findspark
369 | findspark.init()
370 | findspark.find()
371 | import os
372 | from pyspark.sql import SparkSession
373 | from pyspark.sql import Row
374 | 
375 | submit_args = '--packages org.apache.spark:spark-hive_2.11:2.4.6 --packages org.apache.avro:avro-mapred:1.8.2 pyspark-shell'
376 | if 'PYSPARK_SUBMIT_ARGS' not in os.environ:
377 |     os.environ['PYSPARK_SUBMIT_ARGS'] = submit_args
378 | else:
379 |     os.environ['PYSPARK_SUBMIT_ARGS'] += submit_args
380 | 
381 | # warehouse_location points to the default location for managed databases and tables
382 | warehouse_location = 'hdfs://192.168.0.5:9000/user/hive/warehouse'
383 | 
384 | spark = SparkSession \
385 |     .builder \
386 |     .appName("Python Spark SQL Hive integration example") \
387 |     .config("spark.sql.warehouse.dir", warehouse_location) \
388 |     .enableHiveSupport() \
389 |     .getOrCreate()
390 | ```
391 | 
392 | Then, hopefully you can query into Hive Tables. You can check the SparkSession config by  
393 | 
394 | `spark.sparkContext.getConf().getAll()`
395 | 
396 | Such output should be there
397 | ```
398 | ('spark.sql.warehouse.dir', 'hdfs://192.168.0.5:9000/user/hive/warehouse'),
399 | ('spark.sql.catalogImplementation', 'hive')
400 | ```
401 | 
402 | ## 2.6. Spark SQL on Hive Tables
403 | 
404 | Apache Spark provide a machenism to query into Hive Tables. Try the following Python script  
405 | 
406 | **Select from Hive Table**
407 | ```
408 | # With pokes is a Hive table
409 | spark.sql('select * from pokes').show()
410 | ```
411 | 
412 | **Write to Hive Table**
413 | ```
414 | df = spark.range(10).toDF('number')
415 | df.registerTempTable('number')
416 | spark.sql('create table number as select * from number')
417 | ```
418 | 
419 | You can check whether the `number` table is available using `hive` or `beeline`
420 | ```
421 | hive> show tables;
422 | OK
423 | number
424 | pokes
425 | values__tmp__table__1
426 | Time taken: 0.039 seconds, Fetched: 3 row(s)
427 | 
428 | hive> select * from number;
429 | OK
430 | 0
431 | 1
432 | 2
433 | 3
434 | 4
435 | 5
436 | 6
437 | 7
438 | 8
439 | 9
440 | Time taken: 0.135 seconds, Fetched: 10 row(s)
441 | ```
442 | 
443 | 
444 | ```python
445 | 
446 | ```
447 | 
448 | # 3. References
449 | ## 3.1. PostgreSQL Setup and Config
450 | ```
451 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ig_extrnl_pstgrs.html#cmig_topic_5_6
452 | ```
453 | ```
454 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_ig_hive_metastore_configure.html
455 | ```
456 | 


--------------------------------------------------------------------------------