├── README.md
├── Kafka Cluster Setup
├── YARN Configuration.md
├── Spark Thrift Setup
├── Install Python38 Ubuntu.md
├── Yarn Dask installation with Anaconda.md
├── Install Debezium Kafka Connector for PostgreSQL using wal2json.md
├── Spark Hadoop Free Cluster Setup.md
├── Hadoop Cluster Setup.md
└── Hive Setup with Spark Exection and Spark HiveContext.md
/README.md:
--------------------------------------------------------------------------------
1 | # Big-Data-Installation
--------------------------------------------------------------------------------
/Kafka Cluster Setup:
--------------------------------------------------------------------------------
1 | https://blog.clairvoyantsoft.com/kafka-series-3-creating-3-node-kafka-cluster-on-virtual-box-87d5edc85594
2 |
--------------------------------------------------------------------------------
/YARN Configuration.md:
--------------------------------------------------------------------------------
1 | ## Config Resources ##
2 | **Default Configuration**
3 |
4 | **Node Manager Allocated Resources**
5 |
6 | **Container Allocated Resources**
7 |
8 | **Application Master Allocated Resources**
9 |
10 | ## Config Scheduler
11 | **Capacity Scheduler**
12 |
13 | **Fair Scheduler**
14 |
--------------------------------------------------------------------------------
/Spark Thrift Setup:
--------------------------------------------------------------------------------
1 | https://www.programmersought.com/article/75806728859/
2 | https://www.programmersought.com/article/34564251004/
3 | https://www.programmersought.com/article/47542974362/
4 | https://www.programmersought.com/article/46657462430/
5 | https://www.programmersought.com/article/49335838484/
6 | https://www.programmersought.com/article/45243571155/
7 | https://www.programmersought.com/article/55482089247/
8 | https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-thrift-server.html
9 |
--------------------------------------------------------------------------------
/Install Python38 Ubuntu.md:
--------------------------------------------------------------------------------
1 | # Install Python 3.8 on Ubuntu 18.04
2 | **Install Python3.8**
3 | `sudo apt install python3.8`
4 |
5 | **Set python3 to Python3.8**
6 | Add Python3.6 & Python 3.8 to update-alternatives
7 |
8 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1`
9 | - `sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2`
10 |
11 | Update Python 3 to point to Python 3.8
12 |
13 | `sudo update-alternatives --config python3` Enter 2 for Python 3.8
14 |
15 | Test the version of python
16 | ```
17 | python3 --version
18 | Python 3.8.0
19 | ```
20 |
21 | Install pip3
22 | `sudo apt install python3-pip`
23 |
24 | **Fix error: ImportError: No module named apt_pkg**
25 |
26 | - `cd /usr/lib/python3/dist-packages/`
27 | - `sudo ln -s apt_pkg.cpython-36m-x86_64-linux-gnu.so apt_pkg.so`
28 |
29 | **Ensure to update Python in all Spark worker nodes to the same version**
30 | ```
31 | Exception: Python in worker has different version 3.6 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
32 | ```
33 |
34 | ```
35 | export PYSPARK_PYTHON=python3
36 | export PYSPARK_DRIVER_PYTHON=python3
37 | ```
38 |
--------------------------------------------------------------------------------
/Yarn Dask installation with Anaconda.md:
--------------------------------------------------------------------------------
1 | ## 1. Install Anaconda
2 | Install Anaconda version 5.3.1 for supporting python version 3.7
3 | * **Download Anaconda installaion script**:
4 | ```
5 | wget https://repo.anaconda.com/archive/Anaconda3-5.3.1-Linux-x86_64.sh
6 | ```
7 | * **Run script file**:
8 | ```
9 | ./home/hadoop/Anaconda3-5.3.1-Linux-x86_64.sh
10 | ```
11 | Follow the installation on command, when done, to activating Anaconda:
12 | ```
13 | source ~/.bashrc
14 | ```
15 | ## 2. Create Conda environment
16 | * **Create Conda environment**:
17 | ```
18 | conda create -n my-env python=3.7
19 | ```
20 | _Must specific the python version, if not, Conda will automatically install the lastes version_
21 | * **Activate Conda environment**:
22 | ```
23 | conda activate my-env
24 | ```
25 | * **Package environment**:
26 | Package a conda environment using conda-pack
27 | Install conda-pack if not exists:
28 | ```
29 | conda install -c conda-forge conda-pack
30 | ```
31 | Then package the current environment (activate first):
32 | ```
33 | conda-pack -p /path/to/save/gz/file
34 | ```
35 | Replicate conda environment to all nodes (Remember to edit _**.bashrc**_ file and put them in same directory)
36 |
37 | ## 3. Install dash-yarn && run first dask job
38 | * **Install dask-yarn with conda**:
39 | ```
40 | conda install -c conda-forge dask-yarn
41 | ```
42 | * **Submit python job with dask-yarn cli**:
43 | From terminal, run command:
44 | ```
45 | dask-yarn submit --environment python:///opt/anaconda/envs/analytics/bin/python --worker-count 8 --worker-vcores 2 --worker-memory 4GiB dask_submit_test.py
46 | ```
47 | _**Note:**_ Environment format is **_python:///path_to_env/bin/python_**. Get the path to environment by execute the command:
48 | ```
49 | conda env list
50 | ```
51 | * **Start YARN cluster**:
52 | To start a YARN cluster, create an instance of YarnCluster. This constructor takes several parameters, leave them empty to use the defaults defined in the Configuration:
53 | ```
54 | from dask_yarn import YarnCluster
55 | import dask.dataframe as dd
56 |
57 | from dask.distributed import Client
58 |
59 | cluster = YarnCluster(environment='python:///opt/anaconda/envs/analytics/bin/python',
60 | worker_vcores=2,
61 | worker_memory="4GiB")
62 | ```
63 | By default no workers are started on cluster creation. To change the number of workers, use the YarnCluster.scale()method. When scaling up, new workers will be requested from YARN. When scaling down, workers will be intelligently selected and scaled down gracefully, freeing up resources:
64 | ```
65 | cluster.scale(4)
66 | ```
67 | Read csv file:
68 | ```
69 | data = dd.read_csv('hdfs://10.56.239.63:9000/user/dan/Global_Mobility_Report.csv',dtype={'sub_region_2': 'object'})
70 | ```
71 | This will create a dask dataframe, to show the result, use
72 | ```
73 | data.head()
74 | ```
75 | Converto pandas dataframe by **compute** method:
76 | ```
77 | df = data.compute()
78 | ```
79 |
80 |
81 |
--------------------------------------------------------------------------------
/Install Debezium Kafka Connector for PostgreSQL using wal2json.md:
--------------------------------------------------------------------------------
1 | # Install Debezium Connector for PostgreSQL using wal2json plugin
2 |
3 | ## Overview
4 | Debezium’s PostgreSQL Connector can monitor and record row-level changes in the schemas of a PostgreSQL database.
5 |
6 | The first time it connects to a PostgreSQL server/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to PostgreSQL 9.6 or later and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.
7 |
8 | ## Install wal2json
9 |
10 | On Ubuntu:
11 |
12 | ```bash
13 | sudo apt install postgresql-9.6-wal2json
14 | ```
15 | Check if `wal2json.so` have already in `/usr/lib/postgresql/9.6`
16 |
17 | ## Configure PostgreSQL Server
18 |
19 | ### Setting up libraries, WAL and replication parameters
20 |
21 | ***Find and modify*** the following lines at the end of the `postgresql.conf` PostgreSQL configuration file in order to include the plug-in at the shared libraries and to adjust some **WAL** and **streaming replication** settings. You may need to modify it, if for example you have additionally installed `shared_preload_libraries`.
22 |
23 | ```
24 | listeners = '*'
25 | shared_preload_libraries = 'wal2json'
26 | wal_level = logical
27 | max_wal_senders = 4
28 | max_replication_slots = 4
29 | ```
30 |
31 | 1. `max_wal_senders` tells the server that it should use a maximum of `4` separate processes for processing WAL changes
32 | 2. `max_replication_slots` tells the server that it should allow a maximum of `4` replication slots to be created for streaming WAL changes
33 |
34 | ### Setting up replication permissions
35 |
36 | In order to give a user replication permissions, define a PostgreSQL role that has _at least_ the `REPLICATION` and `LOGIN` permissions.
37 |
38 | Login to `postgres` user and run `psql`
39 | ```bash
40 | sudo -u postgres psql
41 | ```
42 | For example:
43 |
44 | ```sql
45 | CREATE ROLE datalake REPLICATION LOGIN;
46 | ```
47 | > However, Debezium need futher permissons to do initial snapshot. I used postgres superuser to fix this issue. You should test more to grant right permission for user role.
48 |
49 | Next modify `pg_hba.conf` to accept remote connection from debezium kafka connector
50 |
51 | ```
52 | host replication postgres 0.0.0.0/0 trust
53 | host replication postgres ::/0 trust
54 | host all all 192.168.1.2/32 trust
55 | ```
56 | > **Note:** Debezium is installed on 192.168.1.2
57 |
58 | ### Database Test Environment
59 |
60 | Back to `psql` console.
61 |
62 | ```sql
63 | CREATE DATABASE test;
64 | CREATE TABLE test_table (
65 | id char(10) NOT NULL,
66 | code char(10),
67 | PRIMARY KEY (id)
68 | );
69 | ```
70 |
71 | - **Create a slot** named `test_slot` for the database named `test`, using the logical output plug-in `wal2json`
72 |
73 |
74 | ```bash
75 | $ pg_recvlogical -d test --slot test_slot --create-slot -P wal2json
76 | ```
77 |
78 | - **Begin streaming changes** from the logical replication slot `test_slot` for the database `test`
79 |
80 |
81 | ```bash
82 | $ pg_recvlogical -d test --slot test_slot --start -o pretty-print=1 -f -
83 | ```
84 |
85 | - **Perform some basic DML** operations at `test_table` to trigger `INSERT`/`UPDATE`/`DELETE` change events
86 |
87 |
88 | _Interactive PostgreSQL terminal, SQL commands_
89 |
90 | ```sql
91 | test=# INSERT INTO test_table (id, code) VALUES('id1', 'code1');
92 | INSERT 0 1
93 | test=# update test_table set code='code2' where id='id1';
94 | UPDATE 1
95 | test=# delete from test_table where id='id1';
96 | DELETE 1
97 | ```
98 |
99 | Upon the `INSERT`, `UPDATE` and `DELETE` events, the `wal2json` plug-in outputs the table changes as captured by `pg_recvlogical`.
100 |
101 | _Output for `INSERT` event_
102 |
103 | ```json
104 | {
105 | "change": [
106 | {
107 | "kind": "insert",
108 | "schema": "public",
109 | "table": "test_table",
110 | "columnnames": ["id", "code"],
111 | "columntypes": ["character(10)", "character(10)"],
112 | "columnvalues": ["id1 ", "code1 "]
113 | }
114 | ]
115 | }
116 | ```
117 |
118 | _Output for `UPDATE` event_
119 |
120 | ```json
121 | {
122 | "change": [
123 | {
124 | "kind": "update",
125 | "schema": "public",
126 | "table": "test_table",
127 | "columnnames": ["id", "code"],
128 | "columntypes": ["character(10)", "character(10)"],
129 | "columnvalues": ["id1 ", "code2 "],
130 | "oldkeys": {
131 | "keynames": ["id"],
132 | "keytypes": ["character(10)"],
133 | "keyvalues": ["id1 "]
134 | }
135 | }
136 | ]
137 | }
138 | ```
139 |
140 | _Output for `DELETE` event_
141 |
142 | ```json
143 | {
144 | "change": [
145 | {
146 | "kind": "delete",
147 | "schema": "public",
148 | "table": "test_table",
149 | "oldkeys": {
150 | "keynames": ["id"],
151 | "keytypes": ["character(10)"],
152 | "keyvalues": ["id1 "]
153 | }
154 | }
155 | ]
156 | }
157 | ```
158 |
159 | When the test is finished, the slot `test_slot` for the database `test` can be removed by the following command:
160 |
161 | ```bash
162 | $ pg_recvlogical -d test --slot test_slot --drop-slot
163 | ```
164 |
165 | ## Install Debezium Kafka Connector
166 |
167 | Download Debezium Connector:
168 |
169 | ```bash
170 | wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/1.2.0.Final/debezium-connector-postgres-1.2.0.Final-plugin.tar.gz
171 | ```
172 | Unzip to /opt/kafka/connect
173 |
174 | ```bash
175 | tar -xvf debezium-connector-postgres-1.2.0.Final-plugin.tar.gz --directory=/opt/kafka/connect
176 | ```
177 |
178 | Check if folder `debezium-connector-postgres` in `/opt/kafka/connect`
179 |
180 | ### Test connector
181 |
182 | Edit `config/connect-file-source.properties` as following:
183 |
184 | ```properties
185 | name=postgres-cdc-source
186 | connector.class=io.debezium.connector.postgresql.PostgresConnector
187 | #snapshot.mode=never
188 | tasks.max=1
189 | plugin.name=wal2json
190 | database.hostname=192.168.1.1
191 | database.port=5432
192 | database.user=postgres
193 | #database.password=postgres
194 | database.dbname=test
195 | # slot.name=test_slot
196 | database.server.name=fullfillment
197 | #table.whitelist=public.inventory
198 | ```
199 |
200 | Append kafka plugin folder to `plugin.path` in file `config/connect-standalone.properties`
201 |
202 | ```properties
203 | plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,/opt/kafka/connect
204 | ```
205 | Run kafka connector in standalone mode
206 |
207 | ```bash
208 | bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
209 | ```
210 |
211 | #### Topics Names
212 |
213 | The PostgreSQL connector writes events for all insert, update, and delete operations on a single table to a single Kafka topic. By default, the Kafka topic name is _serverName_._schemaName_._tableName_ (configuration in `connect-file-source.properties`) where _serverName_ is the logical name of the connector as specified with the `database.server.name` configuration property, _schemaName_ is the name of the database schema where the operation occurred, and _tableName_ is the name of the database table on which the operation occurred.
214 |
215 | For example, consider a PostgreSQL installation with a `postgres` database and an `inventory` schema that contains four tables: `products`, `products_on_hand`, `customers`, and `orders`. If the connector monitoring this database were given a logical server name of `fulfillment`, then the connector would produce events on these four Kafka topics:
216 |
217 | - `fulfillment.inventory.products`
218 |
219 | - `fulfillment.inventory.products_on_hand`
220 |
221 | - `fulfillment.inventory.customers`
222 |
223 | - `fulfillment.inventory.orders`
224 |
225 |
226 | If on the other hand the tables were not part of a specific schema but rather created in the default `public` PostgreSQL schema, then the name of the Kafka topics would be:
227 |
228 | - `fulfillment.public.products`
229 |
230 | - `fulfillment.public.products_on_hand`
231 |
232 | - `fulfillment.public.customers`
233 |
234 | - `fulfillment.public.orders`
235 |
236 | In this example, the topic's name is `fulfillment.public.test_table`
237 |
238 | Run:
239 |
240 | ```bash
241 | bin/kafka-topics.sh --list --bootstrap-server 192.168.1.2:9092
242 | ```
243 | You should see `fullfillment.public.test_table` in the ouput.
244 |
245 | Start receive message from debezium:
246 |
247 | ```bash
248 | bin/kafka-console-consumer.sh --bootstrap-server 192.168.1.2:9092 --from-beginning --topic fullfillment.public.test_table
249 | ```
250 |
--------------------------------------------------------------------------------
/Spark Hadoop Free Cluster Setup.md:
--------------------------------------------------------------------------------
1 | ## 1. Prerequisite
2 |
3 | Below is the requirement items that needs to be installed/setup before installing Spark
4 | - Passwordless SSH
5 | - Java Installation and JAVA_HOME declaration in .bashrc
6 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed
7 | - Hostname/hosts config
8 |
9 | ## 2. Download and Install Spark
10 |
11 | In the Apache Spark Download [Website](https://spark.apache.org/downloads.html), choose your Spark release version and package type that use would like to install.
12 |
13 | Released Version: currently Spark has 2 stable version
14 | - Spark 2.x: wildly supported by many tools, like Elasticsearch, etc.
15 | - Spark 3.x: introduce new features with performance optimization
16 |
17 | Type of packages:
18 | - Spark with Hadoop: currently Spark support
19 | - Spark without Hadoop: in case you already install Hadoop on your cluster.
20 |
21 | In this document, we will use **Spark 3.0.0 prebuilt with user-provided Apache Hadoop**
22 |
23 | ### 2.1. Download Apache Spark
24 |
25 | **On Master node (or Namenode)**
26 | The rest of the document only be executed on the Master node (Namenode)
27 |
28 | Switch to hadoop user
29 | `su hadoop`
30 |
31 | In /opt/, create spark directory and grant sudo permission
32 | `cd /opt/`
33 | `sudo mkdir spark`
34 | `sudo chown -R hadoop /opt/spark`
35 |
36 | Download Apache Spark to /home/hadoop
37 | `cd ~`
38 | `wget -O spark.tgz http://mirrors.viethosting.com/apache/spark/spark-3.0.0/spark-3.0.0-bin-without-hadoop.tgz`
39 |
40 | Untar the file into /opt/spark directory
41 | `tar -zxf spark.tgz --directory=/opt/spark --strip=1`
42 |
43 | ### 2.2. Setup Spark Environment Variable
44 |
45 | Edit the .bashrc file
46 | `nano ~/.bashrc`
47 |
48 | Add the following line to the **.bashrc** file
49 | ```
50 | # Set Spark Environment Variable
51 | export SPARK_HOME=/opt/spark
52 | export PATH=$PATH:$SPARK_HOME/bin
53 | export PYSPARK_PYTHON=python3
54 | #export PYSPARK_DRIVER_PYTHON="jupyter"
55 | #export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=8889"
56 | #export SPARK_LOCAL_IP=192.168.0.1 # Your Master IP
57 |
58 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
59 | export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
60 | export SPARK_DIST_CLASSPATH=$HADOOP_HOME/etc/hadoop
61 | export SPARK_DIST_CLASSPATH=($HADOOP_HOME/bin/hadoop classpath)
62 | ```
63 |
64 | ### 2.3. Edit Spark Config Files
65 |
66 | Apache Spark has some config file dedicated for specific purposes, which are:
67 | - spark-env.sh: store default environment variables for Spark
68 | - spark-defaults.conf: default config for spark-submit, including allocated resources, dependencies, proxies, etc.
69 |
70 | In spark/conf directory, make a copy of spark-env.sh.template and name it as spark-env.sh
71 | `cp spark-env.sh.template spark-env.sh`
72 |
73 | Edit the spark-env.sh
74 | `nano spark-env.sh`
75 |
76 | Add below line to the file for Spark to known the Hadoop location
77 | `export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)`
78 |
79 | Make a copy of spark-defaults.conf.template and name it as spark-defaults.conf
80 | `cp spark-defaults.conf.template spark-defaults.conf`
81 |
82 | Edit the spark-defaults.conf
83 | `nano spark-defaults.conf`
84 | Add below line to the file
85 | ```
86 | spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 # Kafka Dependency for Spark
87 | spark.driver.extraJavaOptions -Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128 # Set Proxy for Spark
88 | ```
89 |
90 | ### 2.4. Setup Spark Cluster Manager
91 |
92 | Spark support different Cluster Manager, including Spark Standalone Cluster Manager, Yarn, Mesos, Kubernetes. In this document, we will config for Yarn and Spark Standalone Cluster Manager
93 |
94 | #### 2.4.1. Spark on Yarn
95 |
96 | In the spark-default.conf, declare the spark master config by add the following line
97 |
98 | ```
99 | spark.master yarn
100 | ```
101 |
102 | The following config should be consider as well
103 |
104 | ```
105 | spark.eventLog.enabled true
106 | spark.eventLog.dir hdfs://192.168.0.1:9000/spark-logs
107 | spark.driver.memory 512m
108 | spark.yarn.am.memory 512m
109 | spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
110 | spark.history.fs.logDirectory hdfs://192.168.0.1:9000/spark-logs
111 | spark.history.fs.update.interval 10s
112 | spark.history.ui.port 18080
113 | ```
114 |
115 | Create hdfs directory to store spark logs
116 | `hdfs dfs -mkdir /spark-logs`
117 |
118 | #### 2.4.2. Spark Standalone Cluster Manager
119 |
120 | On Spark Master node, copy the spark directory to all slave nodes
121 |
122 | `scp -r /opt/spark hadoop@node2:/opt/spark`
123 |
124 | In spark/conf directory, make a copy of slaves.template and name it as slaves
125 | `cp slaves.template slaves`
126 |
127 | Edit the spark-defaults.conf
128 | `nano slaves`
129 |
130 | Add the name of slave nodes to the file.
131 | ```
132 | slaves1 # hostname of node2
133 | slaves2 # hostname of node3
134 | ```
135 |
136 | **Start Spark Standalone Cluster mode**
137 |
138 | Execute the following line
139 | `$SPARK_HOME/sbin/start-all.sh`
140 |
141 | Run jps to check if Spark is running on the Master and Slave Node
142 | `jps`
143 |
144 | To stop Spark Standalone Cluster mode, do
145 | `$SPARK_HOME/sbin/stop-all.sh`
146 |
147 | ## 2.5. Spark Monitor UI
148 |
149 | **Spark Standalone Cluster mode Web UI**
150 | `master:8080`
151 |
152 | **Spark Application Web UI**
153 | `master:4040`
154 |
155 | **Master IP and port for spark-submit**
156 | `master:7077`
157 |
158 | ### Submit jobs to Spark Submit
159 |
160 | **Example**
161 | ```
162 | spark-submit --deploy-mode client \
163 | --class org.apache.spark.examples.SparkPi \
164 | $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.0.jar 10
165 | ```
166 |
167 | **Run Mobile App Schemaless Consumer**
168 | ```
169 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/kafka_2_hadoop/Schemaless_Structure_Streaming.py
170 | ```
171 |
172 | **Run HDFS dump data**
173 | ```
174 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/hdfs_sstreaming_dump_1_1.py
175 | ```
176 |
177 | **Run Kafka dump data**
178 | ```
179 | $SPARK_HOME/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=10.56.224.31 -Dhttp.proxyPort=3128 -Dhttps.proxyHost=10.56.224.31 -Dhttps.proxyPort=3128" --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/dump/kafka_sstreaming_dump_1_1.py
180 | ```
181 |
182 | **Run Schema Inferno**
183 | ```
184 | $SPARK_HOME/bin/spark-submit --master spark://10.56.237.195:7077 --deploy-mode client /home/odyssey/projects/mapp/schema_inferno/Schema_Saver.py
185 | ```
186 |
187 | #### Note
188 |
189 | Kafka dependencies might not work for version 2.12.
190 | Downgrade to 2.11 org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 => org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5
191 |
192 | ### Spark Application Structure
193 |
194 | **Python**
195 | ```
196 | import os
197 | from pyspark import SparkContext, SparkConf
198 | from pyspark.sql import SparkSession, SQLContext
199 | from pyspark.sql import functions as F
200 | from pyspark.sql.types import *
201 |
202 |
203 | def asdf(path):
204 | print ...
205 |
206 | if __name__ == '__main__':
207 | # Setup Spark Config
208 | conf = SparkConf()
209 | conf.set("spark.executor.memory", "2g")
210 | conf.set("spark.executor.cores", "2")
211 | conf.set("spark.cores.max", "8")
212 | conf.set("spark.driver.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar")
213 | conf.set("spark.executor.extraClassPath", "/opt/oracle-client/instantclient_19_6/ojdbc8.jar")
214 | sc = SparkContext(master="spark://10.56.237.195:7077", appName="Schema Inferno", conf=conf)
215 | spark = SparkSession(sc)
216 | sqlContext = SQLContext(sc)
217 |
218 | # Generate Schema
219 | path = 'hdfs://master:9000/home/odyssey/data/raw_hdfs'
220 | temp_df = spark.read.json(f'{path}/*/part-00000')
221 | temp_df.printSchema()
222 | print("Number of records: ", temp_df.count())
223 | print("Number of distinct records: ", temp_df.distinct().count())
224 | #print(temp_df.filter('payload is null').count())
225 | save_schema(path)
226 | ```
227 |
228 | ### Log Print Control
229 |
230 | Create the log4j file from the template in spark/conf directory
231 | `cp conf/log4j.properties.template conf/log4j.properties`
232 |
233 | Replace this line
234 | `log4j.rootCategory=INFO, console`
235 | By this
236 | `log4j.rootCategory=WARN, console`
237 |
--------------------------------------------------------------------------------
/Hadoop Cluster Setup.md:
--------------------------------------------------------------------------------
1 | # Concept
2 | Setup Hadoop in one node, then replicate to others.
3 |
4 | ## **1. Prerequisites**
5 | The following must be done on all node in the cluster, including installation of Java, SSH, user creation and other software utilities
6 |
7 | ### 1.1. Configure hosts and hostname on each node
8 | Here we will edit the /etc/hosts and /ect/hostname fiel, so that we can use hostname instead of IP everytime we wish to use or ping any of these servers
9 | * **Change hostname**
10 | `sudo nano /etc/hostname`
11 | Set your hostname to relative name (node1, node2, node3, etc.)
12 |
13 | * **Change your hosts file**
14 | `sudo nano /etc/hosts`
15 | Add the following line in the structure `IP name`
16 | ```
17 | 192.168.0.1 node1
18 | 192.168.0.2 node2
19 | 192.168.0.3 node3
20 | ```
21 |
22 | **Notice: Remember to delete/comment the following line if exists**
23 | > ```# 127.0.1.1 node1 ```
24 |
25 | ### 1.2. Install OpenSSH
26 | `sudo apt install openssh-client`
27 | `sudo apt install openssh-server`
28 |
29 | ### 1.3. Install Java and config Java Environment Variable
30 | Here we use JDK 8, as it is still the most stable and widely support version
31 |
32 | * **For Oracle Java
**
33 | `sudo add-apt-repository ppa:webupd8team/java`
34 | `sudo apt update`
35 | `sudo apt install oracle-java8-installer`
36 |
37 | * **For OpenJDK Java
**
38 | `sudo apt install openjdk-8-jdk`
39 |
40 | To verify the java version you can use the following command:
41 | `java -version`
42 |
43 | Set Java Environment Variable
44 | * **Locate where java is installed
**
45 | `update-alternatives --config java`
46 | The install path should be like this
47 | > /usr/lib/jvm/java-8-openjdk-amd64/
48 | * **Add the JAVA_HOME variable to bashrc file:
**
49 | `nano ~/.bashrc`
50 | `export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/`
51 |
52 | ### 1.4. Create dedicated user and group for Hadoop
53 | We will use a dedicated Hadoop user account for running Hadoop applications. While that’s not required but it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (security, permissions, backups, etc).
54 | * **Create Hadoop group and Hadoop user
**
55 | `sudo addgroup hadoopgroup`
56 | `sudo adduser --ingroup hadoopgroup hadoop`
57 | `sudo adduser hadoop sudo`
58 |
59 | After this step we will only work on **hadoop** user. You can change your user by:
60 | `su - hadoop`
61 |
62 | ### 1.5. SSH Configuration
63 | Hadoop requires SSH access to manage its different nodes, i.e. remote machines plus your local machine.
64 | * **Generate SSH key value pair**
65 | `ssh-keygen -t rsa`
66 | `cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys`
67 | `sudo chmod 0600 ~/.ssh/authorized_keys`
68 |
69 | To check whether your ssh works, runs:
70 | `ssh localhost`
71 |
72 | * **Config Passwordless SSH**
73 | From each node, copy the ssh public key to others
74 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoop@node2`
75 | `ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub hadoopuser@node3`
76 |
77 | To check your passwordless SSH, try:
78 | `ssh hadooop@node2`
79 |
80 | ## **2. Download and Configure Hadoop**
81 | In this article, we will install Hadoop on three machines
82 |
83 | | IP | Host Name | Namenode | Datanode |
84 | | ------ | ------ | ------ | ------ |
85 | | 192.168.0.1 | node1 | Yes | No |
86 | | 192.168.0.2 | node2 | No | Yes |
87 | | 192.168.0.3 | node3 | No | Yes |
88 |
89 | ### 2.1. Download and setup hadoop
90 | Firstly, The following directory also need to be create on /opt/ directory
91 | ```
92 | /opt/
93 | |-- hadoop
94 | | |-- logs
95 | |-- hdfs
96 | | |-- datanode (if act as datanode)
97 | | |-- namenode (if act as namenode)
98 | |-- mr-history (if act as namenode)
99 | | |-- done
100 | | |-- tmp
101 | |-- yarn (if act as namenode)
102 | | |-- local
103 | | |-- logs
104 | ```
105 |
106 | Locate to /home/hadoop directory
107 | `cd ~`
108 |
109 | Download the installation Hadoop package from its website: [https://hadoop.apache.org/releases.html](https://hadoop.apache.org/releases.html)
110 | `wget -c -O hadoop.tar.gz http://mirrors.viethosting.com/apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz`
111 |
112 | Extract the file
113 | `sudo tar -xvf hadoop.tar.gz --directory=/opt/hadoop --strip 1`
114 |
115 |
116 | Assign permission for user hadoop on these folders
117 | `sudo chown -R hadoop /opt/hadoop`
118 | `sudo chown -R hadoop /opt/hdfs`
119 | `sudo chown -R hadoop /opt/yarn`
120 | `sudo chown -R hadoop /opt/mr-history`
121 |
122 |
123 | ### 2.2. Configuration for Namenode
124 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**.
125 |
126 | **On Namenode Server**
127 | * *core-site.xml*
128 | ```
129 |
130 |
131 | fs.defaultFS
132 | hdfs://192.168.0.1:9000/
133 | NameNode URI
134 |
135 |
136 |
137 | io.file.buffer.size
138 | 131072
139 | Buffer size
140 |
141 |
142 | ```
143 | * *hdfs-site.xml*
144 | ```
145 |
146 |
147 | dfs.namenode.name.dir
148 | file:///opt/hdfs/namenode
149 | NameNode directory for namespace and transaction logs storage.
150 |
151 |
152 |
153 | fs.checkpoint.dir
154 | file:///opt/hdfs/secnamenode
155 | Secondary Namenode Directory
156 |
157 |
158 |
159 | fs.checkpoint.edits.dir
160 | file:///opt/hdfs/secnamenode
161 |
162 |
163 |
164 | dfs.replication
165 | 2
166 | Number of replication
167 |
168 |
169 |
170 | dfs.permissions
171 | false
172 |
173 |
174 |
175 | dfs.datanode.use.datanode.hostname
176 | false
177 |
178 |
179 |
180 | dfs.namenode.datanode.registration.ip-hostname-check
181 | false
182 |
183 |
184 | ```
185 | * *yarn-site.xml*
186 | ```
187 |
188 |
189 | yarn.resourcemanager.hostname
190 | 192.168.0.1
191 | IP of hostname for Yarn Resource Manager Service
192 |
193 |
194 |
195 | yarn.nodemanager.aux-services
196 | mapreduce_shuffle
197 | Yarn Node Manager Aux Service
198 |
199 |
200 |
201 | yarn.nodemanager.aux-services.mapreduce.shuffle.class
202 | org.apache.hadoop.mapred.ShuffleHandler
203 |
204 |
205 |
206 | yarn.nodemanager.local-dirs
207 | file:///opt/yarn/local
208 |
209 |
210 |
211 | yarn.nodemanager.log-dirs
212 | file:///opt/yarn/logs
213 |
214 |
215 | ```
216 | * *mapred-site.xml*
217 | ```
218 |
219 |
220 | mapreduce.framework.name
221 | yarn
222 | MapReduce framework name
223 |
224 |
225 |
226 | mapreduce.jobhistory.address
227 | 192.168.0.1:10020
228 | Default port is 10020.
229 |
230 |
231 |
232 | mapreduce.jobhistory.webapp.address
233 | 192.168.0.1:19888
234 | MapReduce JobHistory WebUI URL
235 |
236 |
237 |
238 | mapreduce.jobhistory.intermediate-done-dir
239 | /opt/mr-history/tmp
240 | Directory where history files are written by MapReduce jobs.
241 |
242 |
243 |
244 | mapreduce.jobhistory.done-dir
245 | /opt/mr-history/done
246 | Directory where history files are managed by the MR JobHistory Server.
247 |
248 |
249 | ```
250 |
251 | * **workers**
252 | Add the datanodes' IP to the file
253 | ```
254 | 192.168.0.2
255 | 192.168.0.3
256 | ```
257 |
258 | ### 2.3. Configuration for Datanode
259 | Inside the /hadoopx.y.z/etc/hadoop/ directory, edit the following files: **core-site.xml**, **hdfs-site.xml**, **yarn-site.xml**, **mapred-site.xml**.
260 |
261 | **On Datanode Server**
262 | * *core-site.xml*
263 | ```
264 |
265 |
266 | fs.defaultFS
267 | hdfs://192.168.0.1:9000/
268 | NameNode URI
269 |
270 |
271 | ```
272 | * *hdfs-site.xml*
273 | ```
274 |
275 |
276 | dfs.datanode.data.dir
277 | file:///opt/hdfs/datanode
278 | DataNode directory for namespace and transaction logs storage.
279 |
280 |
281 |
282 | dfs.replication
283 | 2
284 | Number of replication
285 |
286 |
287 |
288 | dfs.permissions
289 | false
290 |
291 |
292 |
293 | dfs.datanode.use.datanode.hostname
294 | false
295 |
296 |
297 |
298 | dfs.namenode.datanode.registration.ip-hostname-check
299 | false
300 |
301 |
302 | ```
303 | * *yarn-site.xml*
304 | ```
305 |
306 |
307 | yarn.resourcemanager.hostname
308 | 192.168.0.1
309 | IP of hostname for Yarn Resource Manager Service
310 |
311 |
312 |
313 | yarn.nodemanager.aux-services
314 | mapreduce_shuffle
315 | Yarn Node Manager Aux Service
316 |
317 |
318 | ```
319 | * *mapred-site.xml*
320 | ```
321 |
322 |
323 | mapreduce.framework.name
324 | yarn
325 | MapReduce framework name
326 |
327 |
328 | ```
329 |
330 | For more configuration information, see [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html](url)
331 |
332 | ### 2.4. Configure Hadoop Environment Variables
333 | Add the following lines to the **.bashrc** file
334 | ```
335 | export HADOOP_HOME=/opt/hadoop
336 | export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
337 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
338 | export HDFS_NAMENODE_USER=hadoop
339 | export HDFS_DATANODE_USER=hadoop
340 | export HDFS_SECONDARYNAMENODE_USER=hadoop
341 | export HADOOP_MAPRED_HOME=/opt/hadoop
342 | export HADOOP_COMMON_HOME=/opt/hadoop
343 | export HADOOP_HDFS_HOME=/opt/hadoop
344 | export YARN_HOME=/opt/hadoop
345 | ```
346 |
347 | Also add the following lines to the /opt/hadoop/etc/hadoop/hadoop-env.sh file
348 | ```
349 | export HDFS_NAMENODE_USER=hadoop
350 | export HDFS_DATANODE_USER=hadoop
351 | export HDFS_SECONDARYNAMENODE_USER=hadoop
352 | export YARN_RESOURCEMANAGER_USER=hadoop
353 | export YARN_NODEMANAGER_USER=hadoop
354 | export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
355 | export HADOOP_HOME=/opt/hadoop
356 | export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
357 | export HADOOP_LOG_DIR=/opt/hadoop/logs
358 | ```
359 |
360 | ## 3. Start HDFS, Yarn and monitor services on browser
361 | ### 3.1. Format the namenode, start Hadoop basic services
362 | To format the namenode, type
363 | `hdfs namenode –format`
364 |
365 | Start the hdfs service
366 | `start-dfs.sh`
367 |
368 | The output should look like this
369 | ```
370 | Starting namenodes on [hadoop-namenode]
371 | Starting datanodes
372 | Starting secondary namenodes [hadoop-namenode]
373 | ```
374 |
375 | To start yarn service
376 | `start-yarn.sh`
377 | ```
378 | Starting resourcemanager
379 | Starting nodemanagers
380 | ```
381 |
382 | Start MapReduce JobHistory as *daemon*
383 | `$HADOOP_HOME/bin/mapred --daemon start historyserver`
384 |
385 | To check if the services is successfully started, *jps* to show the services
386 |
387 | * On Namenode
388 | ```
389 | 16488 NameNode
390 | 16622 JobHistoryServer
391 | 17087 ResourceManager
392 | 17530 Jps
393 | 16829 SecondaryNameNode
394 | ```
395 |
396 | * On Datanode
397 | ```
398 | 2306 DataNode
399 | 2479 NodeManager
400 | 2581 Jps
401 | ```
402 |
403 | ### 3.2. Monitor the services on browser
404 |
405 | For Namenode of Hadoop 3.x.x
406 | `https://IP:9870`
407 | For Namenode of Hadoop 2.x.x
408 | `https://IP:50070`
409 |
410 | For Yarn
411 | `https://IP:8088`
412 |
413 | For MapReduce Job History
414 | `https://IP:19888`
415 |
416 | ## 4. Run your first HDFS command & Yarn Job
417 | ### 4.1. Put and Get Data to HDFS
418 | Create a books directory in HDFS
419 | `hdfs dfs -mkdir /books`
420 |
421 | Grab a few books from the Gutenberg project
422 | `cd ~`
423 |
424 | ```
425 | wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
426 | wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
427 | wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
428 | ```
429 |
430 | Then put the three books through HDFS, in the booksdirectory
431 | `hdfs dfs -put alice.txt holmes.txt frankenstein.txt /books`
432 |
433 | List the contents of the book directory
434 | `hdfs dfs -ls /books`
435 |
436 | Move one of the books to the local filesystem
437 | `hdfs dfs -get /books/alice.txt`
438 |
439 | ### 4.2. Submit MapReduce Jobs to YARN
440 | YARN jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS.
441 |
442 | Submit a job with the sample jar to YARN. On node-master, run
443 | `yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount "/books/*" output`
444 |
445 | After the job is finished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output will resemble:
446 | ```
447 | Found 2 items
448 | -rw-r--r-- 2 hadoop supergroup 0 2019-05-31 17:21 output/_SUCCESS
449 | -rw-r--r-- 2 hadoop supergroup 789726 2019-05-31 17:21 output/part-r-00000
450 | ```
451 |
452 | Print the result with:
453 | `hdfs dfs -cat output/part-r-00000 | less`
454 |
--------------------------------------------------------------------------------
/Hive Setup with Spark Exection and Spark HiveContext.md:
--------------------------------------------------------------------------------
1 | # 1. Prerequisite
2 |
3 | Below is the requirement items that needs to be installed/setup before installing Spark
4 | - Passwordless SSH
5 | - Java Installation and JAVA_HOME declaration in .bashrc
6 | - Hostname/hosts config
7 | - Hadoop Cluster Setup, with start-dfs.sh and start-yarn.sh are executed
8 | - Spark on Yarn
9 |
10 | # 2. Download and Install Apache Hive
11 |
12 | In the Apache Hive Download [https://downloads.apache.org/hive/](https://downloads.apache.org/hive/), choose your Hive release version that use would like to install.
13 |
14 | Released Version: currently Spark has 2 stable version
15 | - Hive 2.x: This release works with Hadoop 2.x.y
16 | - Hive 3.x: This release works with Hadoop 3.x.y
17 |
18 | In this document, we will use **Hive 2.3.7 prebuilt version**
19 |
20 | ## 2.1. Download Apache Hive
21 |
22 | **On Master node (or Namenode)**
23 | The rest of the document only be executed on the Master node (Namenode)
24 |
25 | Switch to hadoop user
26 | `su hadoop`
27 |
28 | In /opt/, create Hive directory and grant sudo permission
29 | `cd /opt/`
30 | `sudo mkdir hive`
31 | `sudo chown -R hadoop /opt/hive`
32 |
33 | Download Apache Hive to /home/hadoop
34 | `cd ~`
35 | `wget -O hive.tgz https://downloads.apache.org/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz`
36 |
37 | Untar the file into /opt/spark directory
38 | `tar -zxf hive.tgz --directory=/opt/hive --strip=1`
39 |
40 | ## 2.2. Setup Hive Environment Variable
41 |
42 | Edit the .bashrc file
43 | `nano ~/.bashrc`
44 |
45 | Add the following line to the **.bashrc** file
46 | ```
47 | # Hive Environment Configuration
48 | export HIVE_HOME=/opt/hive
49 | export PATH=$HIVE_HOME/bin:$PATH
50 | ```
51 | Also Hive uses Hadoop, so you must have Hadoop in your path or your **.bashrc** must have
52 | `export HADOOP_HOME=`
53 |
54 | ## 2.3. Create Hive working directory on HDFS
55 |
56 | In addition, you must use below HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before you can create a table in Hive.
57 |
58 | `hdfs dfs -mkdir /tmp`
59 | `hdfs dfs -mkdir -p /user/hive/warehouse`
60 | `hdfs dfs -chmod g+w /tmp`
61 | `hdfs dfs -chmod g+w /user/hive/warehouse`
62 |
63 |
64 | ## 2.4. Setup Hive Metastore and HiveServer2
65 |
66 | ### 2.4.1. Configuring a Remote PostgreSQL Database for the Hive Metastore
67 |
68 | In order to use Hive Metastore, a Database is required to store the Hive Metadata. Though Hive provides an embedded Database (Apache Derby), this mode should only be used for experimental purposes only. Here we will setup a remote PostgreSQL Database to run Hive Metastore.
69 |
70 | Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a connector to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user.
71 |
72 | **Install and start PostgreSQL**
73 | Run the following to install PostgreSQL Database
74 | `sudo apt-get install postgresql`
75 |
76 | To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional configuration.
77 |
78 | First you need to edit the `postgresql.conf` file. Set the `listen_addresses` property to `*`, to make sure that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the `standard_conforming_strings property` is set to `off`.
79 |
80 | `sudo nano /etc/postgresql/10/main/postgresql.conf`
81 |
82 | Adjust the required properties
83 |
84 | ```
85 | listen_addresses = '*'
86 | standard_conforming_strings off
87 | ```
88 |
89 | You also need to configure authentication for your network in `pg_hba.conf`. You need to make sure that the PostgreSQL user that you will create later in this procedure will have access to the server from a remote host. To do this, add a new line into `pg_hba.conf` that has the following information:
90 | `sudo nano /etc/postgresql/10/main/pg_hba.conf`
91 | Add the following line to the file
92 | ```
93 | host all all 0.0.0.0 0.0.0.0 md5
94 | ```
95 |
96 | If the default pg_hba.conf file contains the following line:
97 | ```
98 | host all all 127.0.0.1/32 ident
99 | ```
100 | then the host line specifying md5 authentication shown above must be inserted before this ident line. Failing to do so might lead to the following error
101 | ```
102 | SLF4J: Class path contains multiple SLF4J bindings.
103 | SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
104 | SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
105 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
106 | SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
107 | Metastore connection URL: jdbc:postgresql://localhost:5432/metastore
108 | Metastore Connection Driver : org.postgresql.Driver
109 | Metastore connection User: hive
110 | org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
111 | Underlying cause: org.postgresql.util.PSQLException : FATAL: Ident authentication failed for user "hive"
112 | SQL Error code: 0
113 | Use --verbose for detailed stacktrace.
114 | *** schemaTool failed ***
115 | ```
116 |
117 |
118 | After all is done, start PostgreSQL Server
119 | `sudo service postgresql start`
120 |
121 |
122 | **Install the PostgreSQL JDBC driver**
123 | On the client, a JDBC driver is required to connect to the PostgreSQL Database. Install the JDBC and copy or link it to hive/lib folder.
124 | `sudo apt-get install libpostgresql-jdbc-java`
125 | `ln -s /usr/share/java/postgresql-jdbc4.jar /opt/hive/lib/postgresql-jdbc4.jar`
126 |
127 | **Create the metastore database and user account**
128 | Login to postgres user and start a psql
129 | `sudo -u postgres psql`
130 |
131 | Add dedicated User and Database for Hive Metastore
132 |
133 | ```
134 | postgres=# CREATE USER hive WITH PASSWORD '123123';
135 | postgres=# CREATE DATABASE metastore;
136 | ```
137 |
138 | ### 2.4.2. Adjust the Hive and Hadoop config files for the Hive Metastore
139 | In /opt/hive/conf, create config file hive-site.xml from hive-default.xml.template (or not)
140 | `nano hive-site.xml`
141 |
142 | Add the following lines
143 | ```
144 |
145 |
146 | javax.jdo.option.ConnectionURL
147 | jdbc:postgresql://localhost:5432/metastore
148 |
149 |
150 |
151 | javax.jdo.option.ConnectionDriverName
152 | org.postgresql.Driver
153 |
154 |
155 |
156 | javax.jdo.option.ConnectionUserName
157 | hive
158 |
159 |
160 |
161 | javax.jdo.option.ConnectionPassword
162 | 123123
163 |
164 |
165 |
166 | hive.metastore.warehouse.dir
167 | hdfs://192.168.0.5:9000/user/hive/warehouse
168 |
169 |
170 | ```
171 |
172 | In order to avoid the issue `Cannot connect to hive using beeline, user root cannot impersonate anonymous` when connecting to HiveServer2 from `beeline`, from hadoop/etc/hadoop, add to the core-site.xml, with `[username]` is your local user that will be use in `beeline`
173 | ```
174 |
175 | hadoop.proxyuser.[username].groups
176 | *
177 |
178 |
179 | hadoop.proxyuser.[username].hosts
180 | *
181 |
182 | ```
183 |
184 | ### 2.4.3. Start Hive Metastore Server & HiveServer2
185 |
186 | Use the Hive Schema Tool to create the metastore tables.
187 | `/opt/hive/bin/schematool -dbType postgres -initSchema`
188 |
189 |
190 | The output should be like
191 | ```
192 | Metastore connection URL: jdbc:postgresql://localhost:5432/metastore
193 | Metastore Connection Driver : org.postgresql.Driver
194 | Metastore connection User: hive
195 | Starting metastore schema initialization to 2.3.0
196 | Initialization script hive-schema-2.3.0.postgres.sql
197 | Initialization script completed
198 | schemaTool completed
199 | ```
200 |
201 | Respectively start Hive Metastore Server and HiveServer2 services
202 | `hive --service metastore`
203 | From another connection session, start HiveServer2
204 | `hiveserver2`
205 |
206 | Run Beeline (the HiveServer2 CLI).
207 | `beeline`
208 | Connecto to HiveServer2
209 | `!connect jdbc:hive2://localhost:10000`
210 | Input any user and password, and the result should be
211 | ```
212 | Connecting to jdbc:hive2://localhost:10000
213 | Enter username for jdbc:hive2://localhost:10000: hadoop
214 | Enter password for jdbc:hive2://localhost:10000: ******
215 | Connected to: Apache Hive (version 2.3.7)
216 | Driver: Hive JDBC (version 2.3.7)
217 | Transaction isolation: TRANSACTION_REPEATABLE_READ
218 | ```
219 |
220 | From here, you can run your query on Hive
221 |
222 | For monitor, you can access HiveServer WebUI on port 10002
223 | `http://localhost:10002`
224 |
225 | ## 2.6. Setup Hive Execution Engine to Spark
226 |
227 | While MR remains the default engine for historical reasons, it is itself a historical engine and is deprecated in Hive 2 line. It may be removed without further warning. Therefore, choosing another execution engine is wise decision. Here, we will config Spark to be Hive Execution Engine.
228 |
229 | First, add the following config to `hive-site.xml` in `/opt/hive/conf`
230 | ```
231 |
232 | hive.execution.engine
233 | spark
234 |
235 | Expects one of [mr, tez, spark]
236 |
237 |
238 | ```
239 |
240 | Then configuring YARN to distribute an equal share of resources for jobs in the YARN cluster.
241 | ```
242 |
243 | yarn.resourcemanager.scheduler.class
244 | org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
245 |
246 | ```
247 |
248 | Next, add the Spark libs to Hive's class path as below.
249 |
250 | Edit `/opt/hive/bin/hive` file (backup this file is if anything wrong happens)
251 | `cp hive hive_backup`
252 | `nano hive`
253 |
254 | Add Spark Libs to Hive
255 | ```
256 | for f in ${SPARK_HOME}/jars/*.jar; do
257 | CLASSPATH=${CLASSPATH}:$f;
258 | done
259 | ```
260 |
261 | Finally, upload all jars in `$SPARK_HOME/jars` to hdfs folder (for example:hdfs:///xxxx:9000/spark-jars):
262 | ```
263 | hdfs dfs -mkdir /spark-jar
264 | hdfs dfs -put /opt/spark/jars/* /spark-jars
265 | ```
266 |
267 | and add following in `hive-site.xml`
268 | ```
269 |
270 | spark.yarn.jars
271 | hdfs://xxxx:9000/spark-jars/*
272 |
273 | ```
274 |
275 | You may get below exception if you missed the CLASSPATH configuration above.
276 | ```Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/Iterable```
277 |
278 | Another solution should be considered, though the author of this document hasn't successfully experienced yet. [https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started#HiveonSpark:GettingStarted-ConfiguringHive)
279 |
280 | From now on, when doing any execution, the output should be like
281 |
282 | ```
283 | hive> insert into pokes values(2, 'hai');
284 | Query ID = hadoop_20200718125927_5e68f5c3-66b5-4459-a145-71b44c658939
285 | Total jobs = 1
286 | Launching Job 1 out of 1
287 | In order to change the average load for a reducer (in bytes):
288 | set hive.exec.reducers.bytes.per.reducer=
289 | In order to limit the maximum number of reducers:
290 | set hive.exec.reducers.max=
291 | In order to set a constant number of reducers:
292 | set mapreduce.job.reduces=
293 | Starting Spark Job = 84d25db4-ea71-4cf1-8e17-d9018390d005
294 | Running with YARN Application = application_1595048642632_0006
295 | Kill Command = /opt/hadoop/bin/yarn application -kill application_1595048642632_0006
296 |
297 | Query Hive on Spark job[0] stages: [0]
298 |
299 | Status: Running (Hive on Spark job[0])
300 | --------------------------------------------------------------------------------------
301 | STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
302 | --------------------------------------------------------------------------------------
303 | Stage-0 ........ 0 FINISHED 1 1 0 0 0
304 | --------------------------------------------------------------------------------------
305 | STAGES: 01/01 [==========================>>] 100% ELAPSED TIME: 7.11 s
306 | --------------------------------------------------------------------------------------
307 | Status: Finished successfully in 7.11 seconds
308 | Loading data to table default.pokes
309 | OK
310 | Time taken: 30.264 seconds
311 | ```
312 |
313 |
314 |
315 | ## 2.5. Connecting Apache Spark to Apache Hive
316 | ### 2.5.1 Config hive-site.xml and spark-default.sh
317 | Create `/opt/spark/conf/hive-site.xml` and define `hive.metastore.uris` configuration property (that is the thrift URL of the Hive Metastore Server).
318 |
319 | ```
320 |
321 |
322 | hive.metastore.uris
323 | thrift://localhost:9083
324 |
325 |
326 | ```
327 |
328 | Optionally, from `log4j.properties.template` create `log4j.properties` add the following for a more Hive low-level logging:
329 |
330 | `cp log4j.properties.template log4j.properties`
331 |
332 | ```
333 | log4j.logger.org.apache.spark.sql.hive.HiveUtils$=ALL
334 | log4j.logger.org.apache.spark.sql.internal.SharedState=ALL
335 | log4j.logger.org.apache.spark.sql.hive.client.HiveClientImpl=ALL
336 | ```
337 |
338 | The following config also should be set on `/opt/spark/conf/spark-default.sh` file
339 | ```
340 | spark.master yarn
341 | spark.serializer org.apache.spark.serializer.KryoSerializer
342 | spark.eventLog.enabled true
343 | spark.eventLog.dir hdfs://192.168.0.5:9000/spark-logs
344 | spark.driver.memory 512m
345 | spark.yarn.am.memory 512m
346 | spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
347 | spark.history.fs.logDirectory hdfs://192.168.0.5:9000/spark-logs
348 | spark.history.fs.update.interval 10s
349 | spark.history.ui.port 18080
350 |
351 | # Spark Hive Configuartions
352 | spark.sql.catalogImplementation hive
353 | ```
354 |
355 |
356 |
357 | ### 2.5.2. Start your PySpark Hive
358 |
359 | However, since Hive has a large number of dependencies, **these dependencies are not included in the default Spark distribution**. If **Hive dependencies can be found on the classpath**, Spark will load them automatically. Note that these Hive dependencies must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
360 |
361 | By adding the dependencies in the code, PySpark will automatically download it from the internet. The dependencies include:
362 |
363 | `org.apache.spark:spark-hive_2.11:2.4.6` (The 2.12 version might not work)
364 | `org.apache.avro:avro-mapred:1.8.2`
365 |
366 | From any IDE, execute the following Python script:
367 | ```
368 | import findspark
369 | findspark.init()
370 | findspark.find()
371 | import os
372 | from pyspark.sql import SparkSession
373 | from pyspark.sql import Row
374 |
375 | submit_args = '--packages org.apache.spark:spark-hive_2.11:2.4.6 --packages org.apache.avro:avro-mapred:1.8.2 pyspark-shell'
376 | if 'PYSPARK_SUBMIT_ARGS' not in os.environ:
377 | os.environ['PYSPARK_SUBMIT_ARGS'] = submit_args
378 | else:
379 | os.environ['PYSPARK_SUBMIT_ARGS'] += submit_args
380 |
381 | # warehouse_location points to the default location for managed databases and tables
382 | warehouse_location = 'hdfs://192.168.0.5:9000/user/hive/warehouse'
383 |
384 | spark = SparkSession \
385 | .builder \
386 | .appName("Python Spark SQL Hive integration example") \
387 | .config("spark.sql.warehouse.dir", warehouse_location) \
388 | .enableHiveSupport() \
389 | .getOrCreate()
390 | ```
391 |
392 | Then, hopefully you can query into Hive Tables. You can check the SparkSession config by
393 |
394 | `spark.sparkContext.getConf().getAll()`
395 |
396 | Such output should be there
397 | ```
398 | ('spark.sql.warehouse.dir', 'hdfs://192.168.0.5:9000/user/hive/warehouse'),
399 | ('spark.sql.catalogImplementation', 'hive')
400 | ```
401 |
402 | ## 2.6. Spark SQL on Hive Tables
403 |
404 | Apache Spark provide a machenism to query into Hive Tables. Try the following Python script
405 |
406 | **Select from Hive Table**
407 | ```
408 | # With pokes is a Hive table
409 | spark.sql('select * from pokes').show()
410 | ```
411 |
412 | **Write to Hive Table**
413 | ```
414 | df = spark.range(10).toDF('number')
415 | df.registerTempTable('number')
416 | spark.sql('create table number as select * from number')
417 | ```
418 |
419 | You can check whether the `number` table is available using `hive` or `beeline`
420 | ```
421 | hive> show tables;
422 | OK
423 | number
424 | pokes
425 | values__tmp__table__1
426 | Time taken: 0.039 seconds, Fetched: 3 row(s)
427 |
428 | hive> select * from number;
429 | OK
430 | 0
431 | 1
432 | 2
433 | 3
434 | 4
435 | 5
436 | 6
437 | 7
438 | 8
439 | 9
440 | Time taken: 0.135 seconds, Fetched: 10 row(s)
441 | ```
442 |
443 |
444 | ```python
445 |
446 | ```
447 |
448 | # 3. References
449 | ## 3.1. PostgreSQL Setup and Config
450 | ```
451 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cm_ig_extrnl_pstgrs.html#cmig_topic_5_6
452 | ```
453 | ```
454 | https://docs.cloudera.com/documentation/enterprise/5-16-x/topics/cdh_ig_hive_metastore_configure.html
455 | ```
456 |
--------------------------------------------------------------------------------