├── .gitignore
├── Complete_Guide_to_Install_Ubuntu_and_JAVA_and_then_Configure_Hadoop,_MySQL,_HIVE,_Sqoop,_Flume,_Spark_on_a_Docker_Container.md
├── HiveInstallation.md
├── Makefile
├── Multi-Node Cluster on Ubuntu 24.04 (VMware).md
├── README.md
├── base
    ├── Dockerfile
    ├── bde-spark.css
    ├── entrypoint.sh
    ├── execute-step.sh
    ├── finish-step.sh
    └── wait-for-step.sh
├── code
    ├── HadoopWordCount
    │   ├── bin
    │   │   ├── WordCount$IntSumReducer.class
    │   │   ├── WordCount$TokenizerMapper.class
    │   │   ├── WordCount.class
    │   │   └── wc.jar
    │   └── src
    │   │   └── WordCount.java
    ├── input
    │   ├── About Hadoop.txt~
    │   └── data.txt
    └── wordCount.jar
├── conf
    ├── beeline-log4j2.properties
    ├── hive-env.sh
    ├── hive-exec-log4j2.properties
    ├── hive-log4j2.properties
    ├── hive-site.xml
    ├── ivysettings.xml
    └── llap-daemon-log4j2.properties
├── data
    ├── authors.csv
    └── books.csv
├── datanode
    ├── Dockerfile
    └── run.sh
├── docker-compose.yml
├── ecom.md
├── entrypoint.sh
├── flume.md
├── hadoop-basic-commands.md
├── hadoop-hive.env
├── hadoop.env
├── hadoop_installation_VMware Workstation.md
├── historyserver
    ├── Dockerfile
    └── run.sh
├── master
    ├── Dockerfile
    ├── README.md
    └── master.sh
├── namenode
    ├── Dockerfile
    └── run.sh
├── nginx
    ├── Dockerfile
    ├── bde-hadoop.css
    ├── default.conf
    └── materialize.min.css
├── nodemanager
    ├── Dockerfile
    └── run.sh
├── police.csv
├── resourcemanager
    ├── Dockerfile
    └── run.sh
├── spark_in_action.MD
├── sqoop.md
├── startup.sh
├── students.csv
├── submit
    ├── Dockerfile
    ├── WordCount.jar
    └── run.sh
├── template
    ├── java
    │   ├── Dockerfile
    │   ├── README.md
    │   └── template.sh
    ├── python
    │   ├── Dockerfile
    │   ├── README.md
    │   └── template.sh
    └── scala
    │   ├── Dockerfile
    │   ├── README.md
    │   ├── build.sbt
    │   ├── plugins.sbt
    │   └── template.sh
├── wordcount.md
├── worker
    ├── Dockerfile
    ├── README.md
    └── worker.sh
└── yarn.md


/.gitignore:
--------------------------------------------------------------------------------
1 | data/
2 | 


--------------------------------------------------------------------------------
/HiveInstallation.md:
--------------------------------------------------------------------------------
  1 | # **Complete Steps to Install Apache Hive on Ubuntu**
  2 | 
  3 | Apache Hive is a data warehouse infrastructure built on top of Hadoop. This guide will show how to install and configure Hive on **Ubuntu**.
  4 | 
  5 | ---
  6 | 
  7 | ## **Step 1: Install Prerequisites**
  8 | Before installing Hive, ensure your system has the necessary dependencies.
  9 | 
 10 | ### **1.1 Install Java**
 11 | Hive requires Java to run. Install it if it's not already installed:
 12 | ```bash
 13 | sudo apt update
 14 | sudo apt install default-jdk -y
 15 | java -version  # Verify installation
 16 | ```
 17 | 
 18 | ### **1.2 Install Hadoop (Required for Hive)**
 19 | Hive requires Hadoop to function properly. If Hadoop is not installed, install it using:
 20 | 
 21 | ```bash
 22 | sudo apt install hadoop -y
 23 | hadoop version  # Verify installation
 24 | ```
 25 | If you need a full Hadoop setup, follow a [Hadoop installation guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html).
 26 | 
 27 | ### **1.3 Install wget (If not installed)**
 28 | ```bash
 29 | sudo apt install wget -y
 30 | ```
 31 | 
 32 | ---
 33 | 
 34 | ## **Step 2: Download and Install Apache Hive**
 35 | ### **2.1 Download Hive**
 36 | ```bash
 37 | wget https://apache.root.lu/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
 38 | ```
 39 | *Check the latest version from the official Hive website:* [Apache Hive Downloads](https://hive.apache.org/downloads.html)
 40 | 
 41 | ### **2.2 Extract Hive and Move to /opt Directory**
 42 | ```bash
 43 | sudo tar -xzf apache-hive-2.3.9-bin.tar.gz -C /opt
 44 | sudo mv /opt/apache-hive-2.3.9-bin /opt/hive
 45 | ```
 46 | 
 47 | ---
 48 | 
 49 | ## **Step 3: Set Up Environment Variables**
 50 | To run Hive commands globally, configure environment variables.
 51 | 
 52 | ### **3.1 Open the `.bashrc` File**
 53 | ```bash
 54 | nano ~/.bashrc
 55 | ```
 56 | 
 57 | ### **3.2 Add the Following Lines at the End**
 58 | ```bash
 59 | export HIVE_HOME=/opt/hive
 60 | export PATH=$HIVE_HOME/bin:$PATH
 61 | ```
 62 | 
 63 | ### **3.3 Apply the Changes**
 64 | ```bash
 65 | source ~/.bashrc
 66 | ```
 67 | 
 68 | ### **3.4 Verify Hive Installation**
 69 | ```bash
 70 | hive --version
 71 | ```
 72 | If Hive is installed correctly, it will print the version.
 73 | 
 74 | ---
 75 | 
 76 | ## **Step 4: Configure Hive**
 77 | ### **4.1 Create Hive Directories in HDFS**
 78 | ```bash
 79 | hdfs dfs -mkdir -p /user/hive/warehouse
 80 | hdfs dfs -chmod -R 770 /user/hive/warehouse
 81 | hdfs dfs -chown -R $USER:$USER /user/hive/warehouse
 82 | ```
 83 | 
 84 | ### **4.2 Configure `hive-site.xml`**
 85 | Edit the Hive configuration file:
 86 | ```bash
 87 | sudo nano /opt/hive/conf/hive-site.xml
 88 | ```
 89 | 
 90 | Add the following configurations:
 91 | 
 92 | ```xml
 93 | <?xml version="1.0"?>
 94 | <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 95 | <configuration>
 96 |     <property>
 97 |         <name>javax.jdo.option.ConnectionURL</name>
 98 |         <value>jdbc:derby:;databaseName=/opt/hive/metastore_db;create=true</value>
 99 |         <description>JDBC connection URL for the metastore database</description>
100 |     </property>
101 |     <property>
102 |         <name>hive.metastore.warehouse.dir</name>
103 |         <value>/user/hive/warehouse</value>
104 |         <description>Location of default database for the warehouse</description>
105 |     </property>
106 |     <property>
107 |         <name>hive.exec.scratchdir</name>
108 |         <value>/tmp/hive</value>
109 |         <description>Scratch directory for Hive jobs</description>
110 |     </property>
111 | </configuration>
112 | ```
113 | 
114 | Save and exit (`CTRL + X`, then `Y` and `ENTER`).
115 | 
116 | ---
117 | 
118 | ## **Step 5: Set Proper Permissions**
119 | ```bash
120 | sudo chown -R $USER:$USER /opt/hive
121 | sudo chmod -R 755 /opt/hive
122 | ```
123 | 
124 | ---
125 | 
126 | ## **Step 6: Initialize Hive Metastore**
127 | Hive uses a database (Derby by default) to store metadata.
128 | 
129 | ### **6.1 Run Schema Initialization**
130 | ```bash
131 | /opt/hive/bin/schematool -initSchema -dbType derby
132 | ```
133 | 
134 | ---
135 | 
136 | ## **Step 7: Start Hive**
137 | After setup, you can now start Hive.
138 | 
139 | ### **7.1 Run Hive Shell**
140 | ```bash
141 | hive
142 | ```
143 | 
144 | ### **7.2 Verify Hive is Working**
145 | Run the following command inside the Hive shell:
146 | ```sql
147 | SHOW DATABASES;
148 | ```
149 | It should list default databases.
150 | 
151 | ---
152 | 
153 | ## **(Optional) Configure Hive with MySQL (For Production Use)**
154 | Using **MySQL** instead of Derby is recommended for better performance.
155 | 
156 | ### **1. Install MySQL Server**
157 | ```bash
158 | sudo apt install mysql-server -y
159 | sudo systemctl start mysql
160 | sudo systemctl enable mysql
161 | ```
162 | 
163 | ### **2. Create a Hive Metastore Database**
164 | ```bash
165 | mysql -u root -p
166 | ```
167 | Inside the MySQL shell, run:
168 | ```sql
169 | CREATE DATABASE metastore;
170 | CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
171 | GRANT ALL PRIVILEGES ON metastore.* TO 'hiveuser'@'localhost';
172 | FLUSH PRIVILEGES;
173 | EXIT;
174 | ```
175 | 
176 | ### **3. Configure Hive to Use MySQL**
177 | Edit `hive-site.xml`:
178 | ```bash
179 | nano /opt/hive/conf/hive-site.xml
180 | ```
181 | Replace the Derby configuration with:
182 | ```xml
183 | <property>
184 |     <name>javax.jdo.option.ConnectionURL</name>
185 |     <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
186 | </property>
187 | <property>
188 |     <name>javax.jdo.option.ConnectionDriverName</name>
189 |     <value>com.mysql.jdbc.Driver</value>
190 | </property>
191 | <property>
192 |     <name>javax.jdo.option.ConnectionUserName</name>
193 |     <value>hiveuser</value>
194 | </property>
195 | <property>
196 |     <name>javax.jdo.option.ConnectionPassword</name>
197 |     <value>hivepassword</value>
198 | </property>
199 | ```
200 | 
201 | ### **4. Download MySQL JDBC Driver**
202 | ```bash
203 | wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.28.tar.gz
204 | tar -xzf mysql-connector-java-8.0.28.tar.gz
205 | sudo mv mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar /opt/hive/lib/
206 | ```
207 | 
208 | ### **5. Reinitialize Hive Metastore**
209 | ```bash
210 | /opt/hive/bin/schematool -initSchema -dbType mysql
211 | ```
212 | 
213 | ---
214 | 
215 | ## **Hive is Now Ready to Use! 🚀**
216 | With this setup, Hive is installed and ready for queries.
217 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | DOCKER_NETWORK = docker-hadoop_default
 2 | ENV_FILE = hadoop.env
 3 | current_branch := $(shell git rev-parse --abbrev-ref HEAD)
 4 | build:
 5 | 	docker build -t bde2020/hadoop-base:$(current_branch) ./base
 6 | 	docker build -t bde2020/hadoop-namenode:$(current_branch) ./namenode
 7 | 	docker build -t bde2020/hadoop-datanode:$(current_branch) ./datanode
 8 | 	docker build -t bde2020/hadoop-resourcemanager:$(current_branch) ./resourcemanager
 9 | 	docker build -t bde2020/hadoop-nodemanager:$(current_branch) ./nodemanager
10 | 	docker build -t bde2020/hadoop-historyserver:$(current_branch) ./historyserver
11 | 	docker build -t bde2020/hadoop-submit:$(current_branch) ./submit
12 | 	docker build -t bde2020/hive:$(current_branch) ./
13 | 
14 | wordcount:
15 | 	docker build -t hadoop-wordcount ./submit
16 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -mkdir -p /input/
17 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -copyFromLocal -f /opt/hadoop-3.2.1/README.txt /input/
18 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} hadoop-wordcount
19 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -cat /output/*
20 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /output
21 | 	docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /input
22 | 


--------------------------------------------------------------------------------
/Multi-Node Cluster on Ubuntu 24.04 (VMware).md:
--------------------------------------------------------------------------------
  1 | # **Complete Guide: Install Hadoop Multi-Node Cluster on Ubuntu 24.04 (VMware)**
  2 | This guide covers installing and configuring **Hadoop 3.3.6** on **two Ubuntu 24.04 virtual machines** inside **VMware Workstation**.
  3 | 
  4 | ## **Prerequisites**
  5 | 1. **Two Ubuntu 24.04 VMs** running in **VMware Workstation**.
  6 | 2. **At least 4GB RAM & 50GB disk space per VM**.
  7 | 3. **Static IPs for both VMs**.
  8 | 4. **Java 8 or later installed**.
  9 | 
 10 | ---
 11 | 
 12 | # **Step 1: Configure Static IPs for Both VMs**
 13 | ### **1. Check Network Interface Name**
 14 | On **both VMs**, open Terminal and run:
 15 | ```bash
 16 | ip a
 17 | ```
 18 | Find your network interface (e.g., `ens33` or `eth0`).
 19 | 
 20 | ### **2. Edit Netplan Configuration**
 21 | Run:
 22 | ```bash
 23 | sudo nano /etc/netplan/00-installer-config.yaml
 24 | ```
 25 | For the **Master Node** (VM 1):
 26 | ```yaml
 27 | network:
 28 |   version: 2
 29 |   renderer: networkd
 30 |   ethernets:
 31 |     ens33:
 32 |       dhcp4: no
 33 |       addresses:
 34 |         - 192.168.1.100/24
 35 |       gateway4: 192.168.1.1
 36 |       nameservers:
 37 |         addresses:
 38 |           - 8.8.8.8
 39 |           - 8.8.4.4
 40 | ```
 41 | For the **Worker Node** (VM 2):
 42 | ```yaml
 43 | network:
 44 |   version: 2
 45 |   renderer: networkd
 46 |   ethernets:
 47 |     ens33:
 48 |       dhcp4: no
 49 |       addresses:
 50 |         - 192.168.1.101/24
 51 |       gateway4: 192.168.1.1
 52 |       nameservers:
 53 |         addresses:
 54 |           - 8.8.8.8
 55 |           - 8.8.4.4
 56 | ```
 57 | ### **3. Apply Changes**
 58 | ```bash
 59 | sudo netplan apply
 60 | ip a  # Verify new IP
 61 | ```
 62 | 
 63 | ---
 64 | 
 65 | # **Step 2: Install Java on Both VMs**
 66 | Hadoop requires Java. Install **OpenJDK 11**:
 67 | ```bash
 68 | sudo apt update && sudo apt install openjdk-11-jdk -y
 69 | ```
 70 | Verify installation:
 71 | ```bash
 72 | java -version
 73 | ```
 74 | Expected output:
 75 | ```
 76 | openjdk version "11.0.20" 2024-XX-XX
 77 | ```
 78 | 
 79 | ---
 80 | 
 81 | # **Step 3: Create Hadoop User on Both VMs**
 82 | ```bash
 83 | sudo adduser hadoop
 84 | sudo usermod -aG sudo hadoop
 85 | su - hadoop
 86 | ```
 87 | 
 88 | ---
 89 | 
 90 | # **Step 4: Configure SSH Access**
 91 | 1. **Install SSH on Both VMs**:
 92 |    ```bash
 93 |    sudo apt install ssh -y
 94 |    ```
 95 | 2. **Generate SSH Keys on Master Node**:
 96 |    ```bash
 97 |    ssh-keygen -t rsa -P ""
 98 |    ```
 99 | 3. **Copy SSH Key to Worker Node**:
100 |    ```bash
101 |    ssh-copy-id hadoop@192.168.1.101
102 |    ```
103 | 4. **Test SSH Connection from Master to Worker**:
104 |    ```bash
105 |    ssh hadoop@192.168.1.101
106 |    ```
107 |    It should log in without asking for a password.
108 | 
109 | ---
110 | 
111 | # **Step 5: Download and Install Hadoop**
112 | Perform the following steps **on both VMs**.
113 | 
114 | ### **1. Download Hadoop**
115 | ```bash
116 | wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
117 | tar -xvzf hadoop-3.3.6.tar.gz
118 | sudo mv hadoop-3.3.6 /usr/local/hadoop
119 | ```
120 | 
121 | ### **2. Set Environment Variables**
122 | Edit `~/.bashrc`:
123 | ```bash
124 | nano ~/.bashrc
125 | ```
126 | Add:
127 | ```bash
128 | # Hadoop Environment Variables
129 | export HADOOP_HOME=/usr/local/hadoop
130 | export HADOOP_INSTALL=$HADOOP_HOME
131 | export HADOOP_MAPRED_HOME=$HADOOP_HOME
132 | export HADOOP_COMMON_HOME=$HADOOP_HOME
133 | export HADOOP_HDFS_HOME=$HADOOP_HOME
134 | export HADOOP_YARN_HOME=$HADOOP_HOME
135 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
136 | export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
137 | ```
138 | Save and apply:
139 | ```bash
140 | source ~/.bashrc
141 | ```
142 | 
143 | ---
144 | 
145 | # **Step 6: Configure Hadoop**
146 | ## **1. Configure `hadoop-env.sh`**
147 | Edit:
148 | ```bash
149 | nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
150 | ```
151 | Set Java path:
152 | ```bash
153 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
154 | ```
155 | 
156 | ---
157 | 
158 | ## **2. Configure Core Site (`core-site.xml`)**
159 | Edit:
160 | ```bash
161 | nano $HADOOP_HOME/etc/hadoop/core-site.xml
162 | ```
163 | Replace with:
164 | ```xml
165 | <configuration>
166 |     <property>
167 |         <name>fs.defaultFS</name>
168 |         <value>hdfs://master:9000</value>
169 |     </property>
170 | </configuration>
171 | ```
172 | 
173 | ---
174 | 
175 | ## **3. Configure HDFS (`hdfs-site.xml`)**
176 | Edit:
177 | ```bash
178 | nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
179 | ```
180 | Add:
181 | ```xml
182 | <configuration>
183 |     <property>
184 |         <name>dfs.replication</name>
185 |         <value>2</value>
186 |     </property>
187 |     <property>
188 |         <name>dfs.name.dir</name>
189 |         <value>file:///usr/local/hadoop/hdfs/namenode</value>
190 |     </property>
191 |     <property>
192 |         <name>dfs.data.dir</name>
193 |         <value>file:///usr/local/hadoop/hdfs/datanode</value>
194 |     </property>
195 | </configuration>
196 | ```
197 | Create necessary directories:
198 | ```bash
199 | mkdir -p /usr/local/hadoop/hdfs/namenode
200 | mkdir -p /usr/local/hadoop/hdfs/datanode
201 | sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs
202 | ```
203 | 
204 | ---
205 | 
206 | ## **4. Configure MapReduce (`mapred-site.xml`)**
207 | Copy template:
208 | ```bash
209 | cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
210 | ```
211 | Edit:
212 | ```bash
213 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
214 | ```
215 | Add:
216 | ```xml
217 | <configuration>
218 |     <property>
219 |         <name>mapreduce.framework.name</name>
220 |         <value>yarn</value>
221 |     </property>
222 | </configuration>
223 | ```
224 | 
225 | ---
226 | 
227 | ## **5. Configure YARN (`yarn-site.xml`)**
228 | Edit:
229 | ```bash
230 | nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
231 | ```
232 | Add:
233 | ```xml
234 | <configuration>
235 |     <property>
236 |         <name>yarn.nodemanager.aux-services</name>
237 |         <value>mapreduce_shuffle</value>
238 |     </property>
239 | </configuration>
240 | ```
241 | 
242 | ---
243 | 
244 | # **Step 7: Set Up Master and Worker Nodes**
245 | ## **1. Edit Hosts File on Both VMs**
246 | ```bash
247 | sudo nano /etc/hosts
248 | ```
249 | Add:
250 | ```
251 | 192.168.1.100  master
252 | 192.168.1.101  worker1
253 | ```
254 | 
255 | ## **2. Define Workers on Master Node**
256 | On the **Master Node**, edit:
257 | ```bash
258 | nano $HADOOP_HOME/etc/hadoop/workers
259 | ```
260 | Add:
261 | ```
262 | worker1
263 | ```
264 | 
265 | ---
266 | 
267 | # **Step 8: Start Hadoop Cluster**
268 | ## **1. Format Namenode (Master Only)**
269 | ```bash
270 | hdfs namenode -format
271 | ```
272 | 
273 | ## **2. Start Hadoop Services (Master Only)**
274 | ```bash
275 | start-dfs.sh
276 | start-yarn.sh
277 | ```
278 | Check running services:
279 | ```bash
280 | jps
281 | ```
282 | Expected output:
283 | ```
284 | NameNode
285 | DataNode
286 | ResourceManager
287 | NodeManager
288 | ```
289 | 
290 | ---
291 | 
292 | # **Step 9: Verify Hadoop Cluster**
293 | ## **Check Web UI**
294 | 1. **HDFS Web UI**:  
295 |    📌 **http://master:9870/**
296 | 2. **YARN Resource Manager**:  
297 |    📌 **http://master:8088/**
298 | 
299 | ---
300 | 
301 | # **Step 10: Stop Hadoop**
302 | To stop services:
303 | ```bash
304 | stop-dfs.sh
305 | stop-yarn.sh
306 | ```
307 | 
308 | ---
309 | 
310 | # **Conclusion**
311 | You have successfully set up a **Hadoop multi-node cluster** on **two Ubuntu 24.04 VMs** inside **VMware Workstation**!
312 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | ![image](https://github.com/user-attachments/assets/2b0a8b29-8287-446a-8a0c-8c1820ea0971) ![image](https://github.com/user-attachments/assets/343cfd7e-73b7-4eb2-a9a4-76c31f5703c8).![image](https://github.com/user-attachments/assets/04ad8a37-c3a0-4e62-a5c4-70c023992209)![image](https://github.com/user-attachments/assets/5a5fc24a-bc9d-4cc2-aab4-b651c59197d5)![image](https://github.com/user-attachments/assets/10b26b1e-614f-4ad7-966c-505e54825680)
   2 | 
   3 | 
   4 | 
   5 | # Docker Multi-Container Environment with Hadoop, Spark, and Hive
   6 | 
   7 | This guide helps you set up a multi-container environment using Docker for Hadoop (HDFS), Spark, and Hive. The setup is lightweight, without the large memory requirements of a Cloudera sandbox.
   8 | 
   9 | ## **Prerequisites**
  10 | 
  11 | Before you begin, ensure you have the following installed:
  12 | 
  13 | - **Docker**: [Install Docker Desktop for Windows](https://docs.docker.com/desktop/setup/install/windows-install/)
  14 | 
  15 | - IMPORTANT:
  16 |   ******- Enable the "Expose daemon on tcp://localhost:2375 without TLS" option if you're using Docker Desktop for compatibility.******
  17 | 
  18 | ![image](https://github.com/user-attachments/assets/398451cd-46bb-4ba8-876f-9e85f8c0d632)
  19 | 
  20 | 
  21 |  - **Git**: [Download Git](https://git-scm.com/downloads/win)
  22 |    - Git is used to download the required files from a repository.
  23 |   
  24 |    Create a newfolder and open it in terminal or go inside it using CD Command
  25 |   
  26 |  ![image](https://github.com/user-attachments/assets/28602a4b-52e2-4265-bfb5-a08301fda7b8)
  27 | 
  28 | 
  29 | ## **Step 1: Clone the Repository**
  30 | 
  31 | First, clone the GitHub repository that contains the necessary Docker setup files.
  32 | 
  33 | ```bash
  34 | git clone https://github.com/lovnishverma/bigdataecosystem.git
  35 | ```
  36 |  
  37 | [or Directly download zip from my repo](https://github.com/lovnishverma/BigDataecosystem)
  38 | 
  39 | Navigate to the directory:
  40 | 
  41 | ```bash
  42 | cd bigdataecosystem
  43 | ```
  44 | 
  45 | ![image](https://github.com/user-attachments/assets/e4d6a8ab-3f36-424a-bf13-9402bc1c13a2)
  46 | 
  47 | if downloaded zip than cd bigdataecosystem-main
  48 | 
  49 | ## **Step 2: Start the Cluster**
  50 | 
  51 | Use Docker Compose to start the containers in the background.
  52 | 
  53 | ```bash
  54 | docker-compose up -d
  55 | ```
  56 | 
  57 | This command will launch the Hadoop, Spark, and Hive containers.
  58 | 
  59 | ![image](https://github.com/user-attachments/assets/8dc3ec44-84af-40f2-8056-92e5f3449919)
  60 | 
  61 | 
  62 | ## **Step 3: Verify Running Containers**
  63 | 
  64 | To check if the containers are running, use the following command:
  65 | 
  66 | ```bash
  67 | docker ps
  68 | ```
  69 | ![image](https://github.com/user-attachments/assets/f6897172-d14f-462a-95dd-ba46401b5dd7)
  70 | 
  71 | 
  72 | ## **Step 4: Stop and Remove Containers**
  73 | 
  74 | When you are done, stop and remove the containers with:
  75 | 
  76 | ```bash
  77 | docker-compose down
  78 | ```
  79 | ![image](https://github.com/user-attachments/assets/fd1f2298-7d65-4055-a929-12de4d01c428)
  80 | 
  81 | 
  82 | ### Step 5: Access the NameNode container
  83 | Enter the NameNode container to interact with Hadoop:
  84 | ```bash
  85 | docker exec -it namenode bash
  86 | ```
  87 | ** -it refers to (interactive terminal)**
  88 | ---
  89 | 
  90 | ## **Running Hadoop Code** 
  91 | 
  92 | To View NameNode UI Visit:   [http://localhost:9870/](http://localhost:9870/)
  93 | 
  94 | ![image](https://github.com/user-attachments/assets/c4f708cb-7976-49f8-ba79-8b985bcd6a10)
  95 | 
  96 | 
  97 | To View Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/)
  98 | 
  99 | ![image](https://github.com/user-attachments/assets/a65f2495-293e-440c-8366-9e1bed605b29)
 100 | 
 101 | 
 102 | ### ** MAPREDUCE WordCount program**
 103 | ### Step 1: Copy the `code` folder into the container
 104 | Use the following command in your windows cmd to copy the `code` folder to the container:
 105 | ```bash
 106 | docker cp code namenode:/
 107 | ```
 108 | 
 109 | ![image](https://github.com/user-attachments/assets/7acdebdc-2b20-41bf-b92d-8555091d570c)
 110 | 
 111 | 
 112 | ### Step 2: Locate the `data.txt` file
 113 | Inside the container, navigate to the `code/input` directory where the `data.txt` file is located.
 114 | 
 115 | ### Step 3: Create directories in the Hadoop file system
 116 | Run the following commands to set up directories in Hadoop's file system:
 117 | ```bash
 118 | hdfs dfs -mkdir /user
 119 | hdfs dfs -mkdir /user/root
 120 | hdfs dfs -mkdir /user/root/input
 121 | ```
 122 | 
 123 | ### Step 4: Upload the `data.txt` file
 124 | Copy `data.txt` into the Hadoop file system:
 125 | ```bash
 126 | hdfs dfs -put /code/input/data.txt /user/root/input
 127 | ```
 128 | ![image](https://github.com/user-attachments/assets/31fadc17-1c8c-4621-bdee-39d818f3da2c)
 129 | 
 130 | 
 131 | ### Step 5: Navigate to the directory containing the `wordCount.jar` file
 132 | Return to the directory where the `wordCount.jar` file is located:
 133 | ```bash
 134 | cd /code/
 135 | ```
 136 | ![image](https://github.com/user-attachments/assets/4242e3b2-c954-4faf-ab75-825906eeafc5)
 137 | 
 138 | 
 139 | ### Step 6: Execute the WordCount program 
 140 | 
 141 | To View NameNode UI Visit:   [http://localhost:9870/](http://localhost:9870/)
 142 | 
 143 | ![image](https://github.com/user-attachments/assets/20681490-0fcc-41dd-874a-8fe0376dc981)
 144 | 
 145 | 
 146 | Run the WordCount program to process the input data:
 147 | ```bash
 148 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
 149 | ```
 150 | ![image](https://github.com/user-attachments/assets/2bafcdd5-be22-471c-bf9a-6b8a48d88d44)
 151 | 
 152 | 
 153 | To View YARN Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/)
 154 | 
 155 | ![image](https://github.com/user-attachments/assets/89f47e9f-c92f-456c-b89e-0e6025df80e2)
 156 | 
 157 | ### Step 7: Display the output
 158 | View the results of the WordCount program:
 159 | ```bash
 160 | hdfs dfs -cat /user/root/output/*
 161 | ```
 162 | ![image](https://github.com/user-attachments/assets/8a20f77f-71bd-423b-a501-c9514ec9f825)
 163 | 
 164 | ---
 165 | 
 166 | **or**
 167 | 
 168 | ```bash
 169 | hdfs dfs -cat /user/root/output/part-r-00000
 170 | ```
 171 | 
 172 | ![image](https://github.com/user-attachments/assets/a4ef5293-1018-4c5e-a314-91681d430715)
 173 | 
 174 | 
 175 | ## **Summary**
 176 | 
 177 | This guide simplifies setting up and running Hadoop on Docker. Each step ensures a smooth experience, even for beginners without a technical background. Follow the instructions carefully, and you’ll have a working Hadoop setup in no time!
 178 | 
 179 | Certainly! Here’s the explanation of your **MapReduce process** using the input example `DOG CAT RAT`, `CAR CAR RAT`, and `DOG CAR CAT`.
 180 | ---
 181 | 
 182 | ## 🐾 **Input Data**
 183 | 
 184 | The `data.txt` file contains the following lines:
 185 | 
 186 | ```
 187 | DOG CAT RAT
 188 | CAR CAR RAT
 189 | DOG CAR CAT
 190 | ```
 191 | 
 192 | This text file is processed by the **MapReduce WordCount program** to count the occurrences of each word.
 193 | 
 194 | ---
 195 | 
 196 | ## 💡 **What is MapReduce?**
 197 | 
 198 | - **MapReduce** is a two-step process:
 199 |   1. **Map Phase** 🗺️: Splits the input into key-value pairs.
 200 |   2. **Reduce Phase** ➕: Combines the key-value pairs to produce the final result.
 201 | 
 202 | It's like dividing a big task (word counting) into smaller tasks and then combining the results. 🧩
 203 | 
 204 | ---
 205 | 
 206 | ## 🔄 **How MapReduce Works in Your Example**
 207 | 
 208 | ### **1. Map Phase** 🗺️
 209 | 
 210 | The mapper processes each line of the input file, splits it into words, and assigns each word a count of `1`.
 211 | 
 212 | For example:
 213 | ```
 214 | DOG CAT RAT  -> (DOG, 1), (CAT, 1), (RAT, 1)
 215 | CAR CAR RAT  -> (CAR, 1), (CAR, 1), (RAT, 1)
 216 | DOG CAR CAT  -> (DOG, 1), (CAR, 1), (CAT, 1)
 217 | ```
 218 | 
 219 | **Mapper Output**:
 220 | ```
 221 | (DOG, 1), (CAT, 1), (RAT, 1)
 222 | (CAR, 1), (CAR, 1), (RAT, 1)
 223 | (DOG, 1), (CAR, 1), (CAT, 1)
 224 | ```
 225 | 
 226 | ---
 227 | 
 228 | ### **2. Shuffle and Sort Phase** 🔄
 229 | 
 230 | This step groups all values for the same key (word) together and sorts them.
 231 | 
 232 | For example:
 233 | ```
 234 | (CAR, [1, 1, 1])
 235 | (CAT, [1, 1])
 236 | (DOG, [1, 1])
 237 | (RAT, [1, 1])
 238 | ```
 239 | 
 240 | ---
 241 | 
 242 | ### **3. Reduce Phase** ➕
 243 | 
 244 | The reducer sums up the counts for each word to get the total number of occurrences.
 245 | 
 246 | **Reducer Output**:
 247 | ```
 248 | CAR 3  🏎️
 249 | CAT 2  🐱
 250 | DOG 2  🐶
 251 | RAT 2  🐭
 252 | ```
 253 | 
 254 | ---
 255 | 
 256 | ### **Final Output** 📋
 257 | 
 258 | The final word count is saved in the HDFS output directory. You can view it using:
 259 | ```bash
 260 | hdfs dfs -cat /user/root/output/*
 261 | ```
 262 | 
 263 | **Result**:
 264 | ```
 265 | CAR 3
 266 | CAT 2
 267 | DOG 2
 268 | RAT 2
 269 | ```
 270 | 
 271 | ---
 272 | 
 273 | ## 🗂️ **HDFS Commands You Used**
 274 | 
 275 | Here are the basic HDFS commands you used and their purpose:
 276 | 
 277 | 1. **Upload a file to HDFS** 📤:
 278 |    ```bash
 279 |    hdfs dfs -put data.txt /user/root/input
 280 |    ```
 281 |    - **What it does**: Uploads `data.txt` to the HDFS directory `/user/root/input`.
 282 |    - **Output**: No output, but the file is now in HDFS.
 283 | 
 284 | 2. **List files in a directory** 📁:
 285 |    ```bash
 286 |    hdfs dfs -ls /user/root/input
 287 |    ```
 288 |    - **What it does**: Lists all files in the `/user/root/input` directory.
 289 |    - **Output**: Something like this:
 290 |      ```
 291 |      Found 1 items
 292 |      -rw-r--r--   1 root supergroup        50  2024-12-12  /user/root/input/data.txt
 293 |      ```
 294 | 
 295 | 3. **View the contents of a file** 📄:
 296 |    ```bash
 297 |    hdfs dfs -cat /user/root/input/data.txt
 298 |    ```
 299 |    - **What it does**: Displays the contents of the `data.txt` file in HDFS.
 300 |    - **Output**:
 301 |      ```
 302 |      DOG CAT RAT
 303 |      CAR CAR RAT
 304 |      DOG CAR CAT
 305 |      ```
 306 | 
 307 | 4. **Run the MapReduce Job** 🚀:
 308 |    ```bash
 309 |    hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
 310 |    ```
 311 |    - **What it does**: Runs the WordCount program on the input directory and saves the result in the output directory.
 312 | 
 313 | 5. **View the final output** 📊:
 314 |    ```bash
 315 |    hdfs dfs -cat /user/root/output/*
 316 |    ```
 317 |    - **What it does**: Displays the word count results.
 318 |    - **Output**:
 319 |      ```
 320 |      CAR 3
 321 |      CAT 2
 322 |      DOG 2
 323 |      RAT 2
 324 |      ```
 325 | 
 326 | ---
 327 | 
 328 | ## 🛠️ **How You Utilized MapReduce**
 329 | 
 330 | 1. **Input**:  
 331 |    You uploaded a small text file (`data.txt`) to HDFS.
 332 | 
 333 | 2. **Process**:  
 334 |    The `WordCount` program processed the file using MapReduce:
 335 |    - The **mapper** broke the file into words and counted each occurrence.
 336 |    - The **reducer** aggregated the counts for each word.
 337 | 
 338 | 3. **Output**:  
 339 |    The results were saved in HDFS and displayed using the `cat` command.
 340 | 
 341 | ---
 342 | 
 343 | ## 🧩 **Visualization of the Entire Process**
 344 | 
 345 | ### **Input** (HDFS file):
 346 | ```
 347 | DOG CAT RAT
 348 | CAR CAR RAT
 349 | DOG CAR CAT
 350 | ```
 351 | 
 352 | ### **Map Phase Output** 🗺️:
 353 | ```
 354 | (DOG, 1), (CAT, 1), (RAT, 1)
 355 | (CAR, 1), (CAR, 1), (RAT, 1)
 356 | (DOG, 1), (CAR, 1), (CAT, 1)
 357 | ```
 358 | 
 359 | ### **Shuffle & Sort** 🔄:
 360 | ```
 361 | (CAR, [1, 1, 1])
 362 | (CAT, [1, 1])
 363 | (DOG, [1, 1])
 364 | (RAT, [1, 1])
 365 | ```
 366 | 
 367 | ### **Reduce Phase Output** ➕:
 368 | ```
 369 | CAR 3
 370 | CAT 2
 371 | DOG 2
 372 | RAT 2
 373 | ```
 374 | 
 375 | ---
 376 | 
 377 | ![image](https://github.com/user-attachments/assets/a037fc47-7639-48b8-b3f7-5d9f2d5c51ac)
 378 | 
 379 | ### 🔑 **Key Takeaways**
 380 | - **MapReduce** splits the task into small, manageable pieces and processes them in parallel.
 381 | - It’s ideal for large datasets but works the same for smaller ones (like your example).
 382 | - Hadoop is designed for distributed systems, making it powerful for big data processing.
 383 | 
 384 | 
 385 | 
 386 | 
 387 | 
 388 | 
 389 | ### . **Stopping the Containers**  
 390 | To stop the Docker containers when done:
 391 | ```bash
 392 | docker-compose down
 393 | ```
 394 | This will stop and remove the containers and networks created by `docker-compose up`.
 395 | 
 396 | ### 4. **Permissions Issue with Copying Files**  
 397 | If you face permission issues while copying files to containers ensure the correct directory permissions in Docker by using:
 398 | ```bash
 399 | docker exec -it namenode bash
 400 | chmod -R 777 /your-directory
 401 | ```
 402 | 
 403 | ### 5. **Additional Debugging Tips**  
 404 | Sometimes, containers might not start or might throw errors related to Hadoop configuration. A small troubleshooting section or references to common issues (e.g., insufficient memory for Hadoop) would be helpful.
 405 | 
 406 | ### 6. **Final Output File Path**  
 407 | The output of the WordCount job will be written to `/user/root/output/` in HDFS. This is clearly explained, but you could also include a note that the output directory might need to be created beforehand to avoid errors.
 408 | 
 409 | ---
 410 | 
 411 | ### **Example Additions:**
 412 | 
 413 | 1. **Network Issues:**
 414 |    ```
 415 |    If you can't access the NameNode UI, ensure that your Docker container's ports are correctly exposed. For example, if you're running a local machine, the UI should be accessible via http://localhost:9870.
 416 |    ```
 417 |    
 418 | 2. **Stopping Containers:**
 419 |    ```bash
 420 |    docker-compose down  # Stop and remove the containers
 421 |    ```
 422 | 
 423 | 3. **Permissions Fix:**
 424 |    ```bash
 425 |    docker exec -it namenode bash
 426 |    chmod -R 777 /your-directory  # If you face any permission errors
 427 |    ```
 428 | 
 429 | 4. **Handling HDFS Directory Creation:**
 430 |    If `hdfs dfs -mkdir` gives an error, it may be because the directory already exists. Consider adding:
 431 |    ```bash
 432 |    hdfs dfs -rm -r /user/root/input  # If the directory exists, remove it first
 433 |    hdfs dfs -mkdir /user/root/input
 434 |    ```
 435 | 
 436 | ---
 437 | 
 438 | 😊 References
 439 | 
 440 | https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/
 441 | 
 442 | https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
 443 | 
 444 | https://medium.com/@traininghub.io/hadoop-mapreduce-architecture-7e167e264595
 445 | 
 446 | 
 447 | ## **Step 5: Set Up HDFS**
 448 | 
 449 | ### **Upload Files to HDFS**
 450 | 
 451 | To copy a file (e.g., `police.csv`) to the Hadoop cluster:
 452 | 
 453 | 1. Copy the file into the namenode container:
 454 |     ```bash
 455 |     docker cp police.csv namenode:/police.csv
 456 |     ```
 457 | ![image](https://github.com/user-attachments/assets/496c7e6a-41d6-44d2-9557-b6004fe986c4)
 458 | 
 459 | 
 460 | 2. Access the namenode container's bash shell:
 461 |     ```bash
 462 |     docker exec -it namenode bash
 463 |     ```
 464 | ![image](https://github.com/user-attachments/assets/d501a9b3-d2d9-4e2d-aecb-8e3eb7ccf678)
 465 | 
 466 | 
 467 | 3. Create a directory in HDFS and upload the file:
 468 |     ```bash
 469 |     hdfs dfs -mkdir -p /data/crimerecord/police
 470 |     hdfs dfs -put /police.csv /data/crimerecord/police/
 471 |     ```
 472 | ![image](https://github.com/user-attachments/assets/ab68bba9-92f2-4b15-a50e-f3ee1a0f998e)
 473 | 
 474 | 
 475 | 
 476 | ![image](https://github.com/user-attachments/assets/6b27db66-a111-4c2f-a701-2cef8aaa3344)
 477 | 
 478 | 
 479 | ### **Start Spark Shell**
 480 | 
 481 | To interact with Spark, start the Spark shell in the master container:
 482 | 
 483 | ```bash
 484 | docker exec -it spark-master bash
 485 | 
 486 | spark/bin/spark-shell --master spark://spark-master:7077
 487 | ```
 488 | ### **Access the Spark Master UI**
 489 | 
 490 | - Open `http://localhost:8080` in your web browser to view the Spark Master UI.
 491 | - **You can monitor processes here**
 492 | 
 493 | - ![image](https://github.com/user-attachments/assets/8fa7e525-d601-4dad-b5b4-0477d47ec4dd)
 494 | 
 495 | 
 496 | ![image](https://github.com/user-attachments/assets/45765d5e-b1e7-4726-a60c-ddd5dd278c93)
 497 | 
 498 | ![image](https://github.com/user-attachments/assets/b071335b-4928-491a-8bed-321995881d83)
 499 | 
 500 | # **Working with Apache Spark**
 501 | 
 502 | ## **1. Introduction to Apache Spark**
 503 | 
 504 | - **Overview**: Apache Spark is an open-source distributed computing system known for its speed, ease of use, and general-purpose capabilities for big data processing.
 505 | 
 506 | - **Key Features**:
 507 |   - Fast processing using in-memory computation.
 508 |   - Supports multiple languages: Scala, Python, Java, and R.
 509 |   - Unified framework for batch and streaming data processing.
 510 | 
 511 | ---
 512 | 
 513 | ## **2. Introduction to DataFrames**
 514 | 
 515 | - **What are DataFrames?**
 516 |   - Distributed collections of data organized into named columns, similar to a table in a database or a DataFrame in Python's pandas.
 517 |   - Optimized for processing large datasets using Spark SQL.
 518 | 
 519 | - **Key Operations**:
 520 |   - Creating DataFrames from structured data sources (CSV, JSON, Parquet, etc.).
 521 |   - Performing transformations and actions on the data.
 522 | 
 523 | ---
 524 | 
 525 | ## **3. Introduction to Scala for Apache Spark**
 526 | 
 527 | - **Why Scala?**
 528 |   - Apache Spark is written in Scala, offering the best compatibility and performance.
 529 |   - Concise syntax and functional programming support.
 530 | 
 531 | - **Basic Syntax**:
 532 | 
 533 | ```scala
 534 | val numbers = List(1, 2, 3, 4, 5)      // Creates a list of numbers.
 535 | val doubled = numbers.map(_ * 2)       // Doubles each element in the list using map.
 536 | println(doubled)                        // Prints the doubled list.
 537 | ```
 538 | The output will be:
 539 | List(2, 4, 6, 8, 10)
 540 | 
 541 | ---
 542 | 
 543 | ## **4. Spark SQL**
 544 | 
 545 | - **Need for Spark SQL**:
 546 |   - Provides a declarative interface to query structured data using SQL-like syntax.
 547 |   - Supports seamless integration with other Spark modules.
 548 |   - Allows for optimization through Catalyst Optimizer.
 549 | 
 550 | - **Key Components**:
 551 |   - SQL Queries on DataFrames and temporary views.
 552 |   - Hive integration for legacy SQL workflows.
 553 |   - Support for structured data sources.
 554 | 
 555 | ---
 556 | ## **5. Hands-On: Spark SQL**
 557 | 
 558 | ### **Objective**:
 559 | To create DataFrames, load data from different sources, and perform transformations and SQL queries.
 560 | 
 561 | 
 562 | #### **Step 1: Create DataFrames**
 563 | 
 564 | ```scala
 565 | val data = Seq(
 566 |   ("Alice", 30, "HR"),
 567 |   ("Bob", 25, "Engineering"),
 568 |   ("Charlie", 35, "Finance")
 569 | )
 570 | 
 571 | val df = data.toDF("Name", "Age", "Department")
 572 | 
 573 | df.show()
 574 | ```
 575 | ![image](https://github.com/user-attachments/assets/06c2c14f-cf8e-4b38-8944-7844e75ee5d6)
 576 | 
 577 | 
 578 | #### **Step 3: Perform Transformations Using Spark SQL**
 579 | 
 580 | ```scala
 581 | df.createOrReplaceTempView("employees")
 582 | val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department")
 583 | result.show()
 584 | ```
 585 | ![image](https://github.com/user-attachments/assets/c9125138-63dd-4c29-82c4-6d04bc531508)
 586 | 
 587 | 
 588 | #### **Step 4: Save Transformed Data**
 589 | 
 590 | ```scala
 591 | result.write.option("header", "true").csv("hdfs://namenode:9000/output_employees")
 592 | ```
 593 | 
 594 | Reading from HDFS:
 595 | Once the data is written to HDFS, you can read it back into Spark using:
 596 | 
 597 | ```scala
 598 | val outputDF = spark.read.option("header", "true").csv("hdfs://namenode:9000/output_employees")
 599 |  ```
 600 | 
 601 | View output_employees.csv from HDFS
 602 | 
 603 | ```scala
 604 | outputDF.show()
 605 |  ```
 606 | ![image](https://github.com/user-attachments/assets/a4bb7af6-2ee6-485f-a306-371165e5bf37)
 607 | 
 608 | 
 609 | #### **Step 5: Load Data from HDFS**
 610 | 
 611 | ```scala
 612 | // Load CSV from HDFS
 613 | val df = spark.read.option("header", "false").csv("hdfs://namenode:9000/data/crimerecord/police/police.csv")
 614 | df.show()
 615 | ```
 616 | 
 617 | ![image](https://github.com/user-attachments/assets/f6dfde78-f44a-4554-9c0f-f11cb9173e6c)
 618 | 
 619 | 
 620 | #### **Step 6: Scala WordCount using Apache Spark**
 621 | 
 622 | 
 623 | ### Docker Command to Copy File
 624 | *Copy File**: Use `docker cp` to move or create the file inside the namenode Docker container.
 625 | Use the following command to copy the `data.txt` file from your local system to the Docker container:
 626 | 
 627 | ```bash
 628 | docker cp data.txt nodemanager:/data.txt
 629 | ```
 630 | ![image](https://github.com/user-attachments/assets/73a84d9a-af1c-45f0-9504-a24b192e598d)
 631 | 
 632 | *Copy File to HDFS**: Use `hdfs dfs -put` to move the file inside the HDFS filesystem.
 633 | Use the following command to put the `data.txt` file from your Docker container to HDFS:
 634 | 
 635 | ```bash
 636 | hdfs dfs -mkdir /data
 637 | hdfs dfs -put data.txt /data
 638 | ```
 639 | ![image](https://github.com/user-attachments/assets/b4d93f36-f1b1-4056-a4af-d4dbb418634e)
 640 | 
 641 | **Scala WordCount program.**
 642 | 
 643 | **WordCount Program**: The program reads the file, splits it into words, and counts the occurrences of each word.
 644 | 
 645 | ```scala
 646 | import org.apache.spark.{SparkConf}
 647 | val conf = new SparkConf().setAppName("WordCountExample").setMaster("local")
 648 | val input = sc.textFile("hdfs://namenode:9000/data/data.txt")
 649 | val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1))
 650 | val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
 651 | wordCounts.collect().foreach { case (word, count) =>
 652 |   println(s"$word: $count")
 653 | }
 654 | ```
 655 | 
 656 | **Output**: The word counts will be printed to the console when the program is executed.
 657 | 
 658 | ![image](https://github.com/user-attachments/assets/428e0d99-f0e0-4edd-8f3c-4543130c8a47)
 659 | 
 660 | 
 661 | **Stop Session**:
 662 | 
 663 | ```scala
 664 | sc.stop()
 665 | ```
 666 | 
 667 | ---
 668 | 
 669 | ## **6. Key Takeaways**
 670 | 
 671 | - Spark SQL simplifies working with structured data.
 672 | - DataFrames provide a flexible and powerful API for handling large datasets.
 673 | - Apache Spark is a versatile tool for distributed data processing, offering scalability and performance.
 674 | 
 675 | ---
 676 | 
 677 | 
 678 | ![image](https://github.com/user-attachments/assets/fada1eec-5349-4382-8d1a-96940c124064)
 679 | 
 680 | ## **Step 7: Set Up Hive** 
 681 | 
 682 | ### **Start Hive Server**
 683 | 
 684 | Access the Hive container and start the Hive Server:
 685 | 
 686 | ```bash
 687 | docker exec -it hive-server bash
 688 | ```
 689 | 
 690 | ```bash
 691 | hive
 692 | ```
 693 | 
 694 | Check if Hive is listening on port 10000:
 695 | ![image](https://github.com/user-attachments/assets/dc1e78d4-d903-4ac5-9eaa-eff0b893d6fb)
 696 | 
 697 | 
 698 | ```bash
 699 | netstat -anp | grep 10000
 700 | ```
 701 | ![image](https://github.com/user-attachments/assets/9ac08fd3-f515-448d-83b3-c620fa3b15c2)
 702 | 
 703 | 
 704 | ### **Connect to Hive Server**
 705 | 
 706 | Use Beeline to connect to the Hive server:
 707 | 
 708 | ```bash
 709 | beeline -u jdbc:hive2://localhost:10000 -n root
 710 | ```
 711 | ![image](https://github.com/user-attachments/assets/d2dce309-0334-4a64-b8df-8cb6206b1432)
 712 | 
 713 | 
 714 | Alternatively, use the following command for direct connection:
 715 | 
 716 | ```bash
 717 | beeline
 718 | ```
 719 | 
 720 | ```bash
 721 | !connect jdbc:hive2://127.0.0.1:10000 scott tiger
 722 | ```
 723 | 
 724 | ![image](https://github.com/user-attachments/assets/77fadb1f-118e-4d15-8a78-e9783baa9690)
 725 | 
 726 | 
 727 | ### **Create Database and Table in Hive**
 728 | 
 729 | 1. Create a new Hive database:
 730 |     ```sql
 731 |     CREATE DATABASE punjab_police;
 732 |     USE punjab_police;
 733 |     ```
 734 | ![image](https://github.com/user-attachments/assets/73227817-b2d5-4df0-a392-6927750d7220)
 735 | 
 736 | 
 737 | 2. Create a table based on the schema of the `police.csv` dataset:
 738 |     ```sql
 739 |     CREATE TABLE police_data (
 740 |         Crime_ID INT,
 741 |         Crime_Type STRING,
 742 |         Location STRING,
 743 |         Reported_Date STRING,
 744 |         Status STRING
 745 |     )
 746 |     ROW FORMAT DELIMITED
 747 |     FIELDS TERMINATED BY ','
 748 |     STORED AS TEXTFILE;
 749 |     ```
 750 |     ![image](https://github.com/user-attachments/assets/13faa21a-5242-4f1e-bd69-4d98dc318400)
 751 | 
 752 | 
 753 | 3. Load the data into the Hive table:
 754 |     ```sql
 755 |     LOAD DATA INPATH '/data/crimerecord/police/police.csv' INTO TABLE police_data;
 756 |     ```
 757 | ![image](https://github.com/user-attachments/assets/e0fcbe55-d5fd-4a8c-a17b-df888204915f)
 758 | 
 759 | 
 760 | ### **Query the Data in Hive**
 761 | 
 762 | Run SQL queries to analyze the data in Hive:
 763 | 
 764 | 1. **View the top 10 rows:**
 765 |     ```sql
 766 |     SELECT * FROM police_data LIMIT 10;
 767 |     ```
 768 | ![image](https://github.com/user-attachments/assets/6f189765-24f4-47db-ad70-42fbcfb4068e)
 769 | 
 770 | 
 771 | 2. **Count total crimes:**
 772 |     ```sql
 773 |     SELECT COUNT(*) AS Total_Crimes FROM police_data;
 774 |     ```
 775 | ![image](https://github.com/user-attachments/assets/8b56a8b5-6b0b-4306-82da-4cce52b50e95)
 776 | 
 777 | 
 778 | 3. **Find most common crime types:**
 779 |     ```sql
 780 |     SELECT Crime_Type, COUNT(*) AS Occurrences
 781 |     FROM police_data
 782 |     GROUP BY Crime_Type
 783 |     ORDER BY Occurrences DESC;
 784 |     ```
 785 | 
 786 |    ![image](https://github.com/user-attachments/assets/54f000f7-36ec-4672-8bc6-996ac7b4004b)
 787 | 
 788 | 
 789 | 4. **Identify locations with the highest crime rates:**
 790 |     ```sql
 791 |     SELECT Location, COUNT(*) AS Total_Crimes
 792 |     FROM police_data
 793 |     GROUP BY Location
 794 |     ORDER BY Total_Crimes DESC;
 795 |     ```
 796 | ![image](https://github.com/user-attachments/assets/fb418097-97ff-46aa-941a-4b72a0702d3d)
 797 | 
 798 | 
 799 | 5. **Find unresolved cases:**
 800 |     ```sql
 801 |     SELECT Status, COUNT(*) AS Count
 802 |     FROM police_data
 803 |     WHERE Status != 'Closed'
 804 |     GROUP BY Status;
 805 |     ```
 806 | ![image](https://github.com/user-attachments/assets/9b3b32df-38c9-45bd-85dc-c4ac2b16b246)
 807 | 
 808 | 
 809 | **********There you go: your private Hive server to play with.**********
 810 | 
 811 | show databases;
 812 | 
 813 | ![image](https://github.com/user-attachments/assets/7e8e65b1-cb98-41e2-b655-ddf941b614d5)
 814 | 
 815 | #### **📂 Part 2: Creating a Simple Hive Project**
 816 | 
 817 | ---
 818 | 
 819 | ##### **🎯 Objective**
 820 | We will:
 821 | 1. Create a database.
 822 | 2. Create a table inside the database.
 823 | 3. Load data into the table.
 824 | 4. Run queries to retrieve data.
 825 | 
 826 | ---
 827 | 
 828 | ##### **💾 Step 1: Create a Database**
 829 | In the Beeline CLI:
 830 | ```sql
 831 | CREATE DATABASE mydb;
 832 | USE mydb;
 833 | ```
 834 | - 📝 *`mydb` is the name of the database. Replace it with your preferred name.*
 835 | 
 836 | ---
 837 | 
 838 | ##### **📋 Step 2: Create a Table**
 839 | Still in the Beeline CLI, create a simple table:
 840 | ```sql
 841 | CREATE TABLE employees (
 842 |     id INT,
 843 |     name STRING,
 844 |     age INT
 845 | );
 846 | ```
 847 | - This creates a table named `employees` with columns `id`, `name`, and `age`.
 848 | 
 849 | ---
 850 | 
 851 | ##### **📥 Step 3: Insert Data into the Table**
 852 | Insert sample data into your table:
 853 | ```sql
 854 | INSERT INTO employees VALUES (1, 'Prince', 30);
 855 | INSERT INTO employees VALUES (2, 'Ram Singh', 25);
 856 | ```
 857 | 
 858 | ---
 859 | 
 860 | ##### **🔍 Step 4: Query the Table**
 861 | Retrieve data from your table:
 862 | ```sql
 863 | SELECT * FROM employees;
 864 | ```
 865 | - Output:
 866 | 
 867 | ![image](https://github.com/user-attachments/assets/63529cb9-c74d-453e-a4d7-9f176762a8bc)
 868 | 
 869 | 
 870 |   ```
 871 |   +----+----------+-----+
 872 |   | id |   name   | age |
 873 |   +----+----------+-----+
 874 |   | 2  | Ram Singh |  25 |
 875 |   | 1  | Prince     | 30 |
 876 |   +----+----------+-----+
 877 |   ```
 878 | 
 879 | ---
 880 | 
 881 | #### **🌟 Tips & Knowledge**
 882 | 
 883 | 1. **What is Hive?**
 884 |    - Hive is a data warehouse tool on top of Hadoop.
 885 |    - It allows SQL-like querying over large datasets.
 886 | 
 887 | 2. **Why Docker for Hive?**
 888 |    - Simplifies setup by avoiding manual configurations.
 889 |    - Provides a pre-configured environment for running Hive.
 890 | 
 891 | 3. **Beeline CLI**:
 892 |    - A lightweight command-line tool for running Hive queries.
 893 | 
 894 | 4. **Use Cases**:
 895 |    - **Data Analysis**: Run analytics on large datasets.
 896 |    - **ETL**: Extract, Transform, and Load data into your Hadoop ecosystem.
 897 | 
 898 | ---
 899 | 
 900 | #### **🎉 You're Ready!**
 901 | You’ve successfully:
 902 | 1. Set up Apache Hive.
 903 | 2. Created and queried a sample project.  🐝
 904 | 
 905 | ### **🐝 Apache Hive Basic Commands**
 906 | 
 907 | Here is a collection of basic Apache Hive commands with explanations that can help you while working with Hive:
 908 | 
 909 | ---
 910 | 
 911 | #### **1. Database Commands**
 912 | 
 913 | - **Show Databases:**
 914 |   Displays all the databases available in your Hive environment.
 915 |   ```sql
 916 |   SHOW DATABASES;
 917 |   ```
 918 | 
 919 | - **Create a Database:**
 920 |   Create a new database.
 921 |   ```sql
 922 |   CREATE DATABASE <database_name>;
 923 |   ```
 924 |   Example:
 925 |   ```sql
 926 |   CREATE DATABASE mydb;
 927 |   ```
 928 |   In Hive, you can find out which database you are currently using by running the following command:
 929 | 
 930 | ```sql
 931 | SELECT current_database();
 932 | ```
 933 | 
 934 | This will return the name of the database that is currently in use.
 935 | 
 936 | Alternatively, you can use this command:
 937 | 
 938 | ```sql
 939 | USE database_name;
 940 | ```
 941 | 
 942 | If you want to explicitly switch to a specific database or verify the database context, you can use this command before running your queries.
 943 | 
 944 | - **Use a Database:**
 945 |   Switch to the specified database.
 946 |   ```sql
 947 |   USE <database_name>;
 948 |   ```
 949 |   Example:
 950 |   ```sql
 951 |   USE mydb;
 952 |   ```
 953 |   
 954 | 
 955 | - **Drop a Database:**
 956 |   Deletes a database and its associated data.
 957 |   ```sql
 958 |   DROP DATABASE <database_name>;
 959 |   ```
 960 | 
 961 | ---
 962 | 
 963 | #### **2. Table Commands**
 964 | 
 965 | - **Show Tables:**
 966 |   List all the tables in the current database.
 967 |   ```sql
 968 |   SHOW TABLES;
 969 |   ```
 970 | 
 971 | - **Create a Table:**
 972 |   Define a new table with specific columns.
 973 |   ```sql
 974 |   CREATE TABLE <table_name> (
 975 |       column_name column_type,
 976 |       ...
 977 |   );
 978 |   ```
 979 |   Example:
 980 |   ```sql
 981 |   CREATE TABLE employees (
 982 |       id INT,
 983 |       name STRING,
 984 |       age INT
 985 |   );
 986 |   ```
 987 | 
 988 | - **Describe a Table:**
 989 |   Get detailed information about a table, including column names and types.
 990 |   ```sql
 991 |   DESCRIBE <table_name>;
 992 |   ```
 993 | 
 994 | - **Drop a Table:**
 995 |   Deletes a table and its associated data.
 996 |   ```sql
 997 |   DROP TABLE <table_name>;
 998 |   ```
 999 | 
1000 | - **Alter a Table:**
1001 |   Modify a table structure, like adding new columns.
1002 |   ```sql
1003 |   ALTER TABLE <table_name> ADD COLUMNS (<new_column> <type>);
1004 |   ```
1005 |   Example:
1006 |   ```sql
1007 |   ALTER TABLE employees ADD COLUMNS (salary DOUBLE);
1008 |   ```
1009 | 
1010 | ---
1011 | 
1012 | #### **3. Data Manipulation Commands**
1013 | 
1014 | - **Insert Data:**
1015 |   Insert data into a table.
1016 |   ```sql
1017 |   INSERT INTO <table_name> VALUES (<value1>, <value2>, ...);
1018 |   INSERT INTO employees VALUES (1, 'Prince', 30), (2, 'Ram Singh', 25), (3, 'John Doe', 28), (4, 'Jane Smith', 32);
1019 |   ```
1020 |   Example:
1021 |   ```sql
1022 |   INSERT INTO employees VALUES (1, 'John Doe', 30);
1023 |   
1024 |   ```
1025 | 
1026 | - **Select Data:**
1027 |   Retrieve data from a table.
1028 |   ```sql
1029 |   SELECT * FROM <table_name>;
1030 |   ```
1031 | 
1032 | - **Update Data:**
1033 |   Update existing data in a table.
1034 |   ```sql
1035 |   UPDATE <table_name> SET <column_name> = <new_value> WHERE <condition>;
1036 |   ```
1037 | 
1038 | - **Delete Data:**
1039 |   Delete rows from a table based on a condition.
1040 |   ```sql
1041 |   DELETE FROM <table_name> WHERE <condition>;
1042 |   ```
1043 | 
1044 | ---
1045 | 
1046 | #### **4. Querying Commands**
1047 | 
1048 | - **Select Specific Columns:**
1049 |   Retrieve specific columns from a table.
1050 |   ```sql
1051 |   SELECT <column1>, <column2> FROM <table_name>;
1052 |   ```
1053 | 
1054 | - **Filtering Data:**
1055 |   Filter data based on conditions using the `WHERE` clause.
1056 |   ```sql
1057 |   SELECT * FROM <table_name> WHERE <column_name> <operator> <value>;
1058 |   ```
1059 |   Example:
1060 |   ```sql
1061 |   SELECT * FROM employees WHERE age > 25;
1062 |   ```
1063 | 
1064 | - **Sorting Data:**
1065 |   Sort the result by a column in ascending or descending order.
1066 |   ```sql
1067 |   SELECT * FROM <table_name> ORDER BY <column_name> ASC|DESC;
1068 |   ```
1069 |   Example:
1070 |   ```sql
1071 |   SELECT * FROM employees ORDER BY age DESC;
1072 |   SELECT * FROM employees ORDER BY age ASC;
1073 |   ```
1074 | 
1075 | - **Group By:**
1076 |   Group data by one or more columns and aggregate it using functions like `COUNT`, `AVG`, `SUM`, etc.
1077 |   ```sql
1078 |   SELECT <column_name>, COUNT(*) FROM <table_name> GROUP BY <column_name>;
1079 |   ```
1080 |   Example:
1081 |   ```sql
1082 |   SELECT age, COUNT(*) FROM employees GROUP BY age;
1083 |   ```
1084 | 
1085 | ---
1086 | 
1087 | #### **5. File Format Commands**
1088 | 
1089 | - **Create External Table:**
1090 |   Create a table that references data stored externally (e.g., in HDFS).
1091 |   ```sql
1092 |   CREATE EXTERNAL TABLE <table_name> (<column_name> <data_type>, ...) 
1093 |   ROW FORMAT DELIMITED
1094 |   FIELDS TERMINATED BY '<delimiter>'
1095 |   LOCATION '<file_path>';
1096 |   ```
1097 |   Example:
1098 |   ```sql
1099 |   CREATE EXTERNAL TABLE employees (
1100 |       id INT,
1101 |       name STRING,
1102 |       age INT
1103 |   ) ROW FORMAT DELIMITED
1104 |   FIELDS TERMINATED BY ','
1105 |   LOCATION '/user/hive/warehouse/employees';
1106 |   ```
1107 | 
1108 | - **Load Data into Table:**
1109 |   Load data from a file into an existing Hive table.
1110 |   ```sql
1111 |   LOAD DATA LOCAL INPATH '<file_path>' INTO TABLE <table_name>;
1112 |   ```
1113 | 
1114 | ---
1115 | 
1116 | #### **6. Other Useful Commands**
1117 | 
1118 | - **Show Current User:**
1119 |   Display the current user running the Hive session.
1120 |   ```sql
1121 |   !whoami;
1122 |   ```
1123 | 
1124 | - **Exit Hive:**
1125 |   Exit from the Hive shell.
1126 |   ```sql
1127 |   EXIT;
1128 |   ```
1129 | 
1130 | - **Set Hive Variables:**
1131 |   Set Hive session variables.
1132 |   ```sql
1133 |   SET <variable_name>=<value>;
1134 |   ```
1135 | 
1136 | - **Show Hive Variables:**
1137 |   Display all the set variables.
1138 |   ```sql
1139 |   SET;
1140 |   ```
1141 | 
1142 | - **Show the Status of Hive Jobs:**
1143 |   Display the status of running queries.
1144 |   ```sql
1145 |   SHOW JOBS;
1146 |   ```
1147 | 
1148 | ---
1149 | 
1150 | #### **🌟 Tips & Best Practices**
1151 | 
1152 | - **Partitioning Tables:**
1153 |   When dealing with large datasets, partitioning your tables can help improve query performance.
1154 |   ```sql
1155 |   CREATE TABLE sales (id INT, amount DOUBLE)
1156 |   PARTITIONED BY (year INT, month INT);
1157 |   ```
1158 | 
1159 | - **Bucketing:**
1160 |   Bucketing splits your data into a fixed number of files or "buckets."
1161 |   ```sql
1162 |   CREATE TABLE sales (id INT, amount DOUBLE)
1163 |   CLUSTERED BY (id) INTO 4 BUCKETS;
1164 |   ```
1165 | 
1166 | - **Optimization:**
1167 |   Use columnar formats like `ORC` or `Parquet` for efficient storage and performance.
1168 |   ```sql
1169 |   CREATE TABLE sales (id INT, amount DOUBLE)
1170 |   STORED AS ORC;
1171 |   ```
1172 | 
1173 | These basic commands will help you interact with Hive and perform common operations like creating tables, querying data, and managing your Hive environment efficiently.
1174 | 
1175 | While **Hive** and **MySQL** both use SQL-like syntax for querying data, there are some key differences in their commands, especially since Hive is designed for querying large datasets in a Hadoop ecosystem, while MySQL is a relational database management system (RDBMS).
1176 | 
1177 | ##**Here’s a comparison of **Hive** and **MySQL** commands in terms of common operations:**
1178 | 
1179 | ### **1. Creating Databases**
1180 | - **Hive**:
1181 |    ```sql
1182 |    CREATE DATABASE mydb;
1183 |    ```
1184 | 
1185 | - **MySQL**:
1186 |    ```sql
1187 |    CREATE DATABASE mydb;
1188 |    ```
1189 | 
1190 |    *Both Hive and MySQL use the same syntax to create a database.*
1191 | 
1192 | ---
1193 | 
1194 | ### **2. Switching to a Database**
1195 | - **Hive**:
1196 |    ```sql
1197 |    USE mydb;
1198 |    ```
1199 | 
1200 | - **MySQL**:
1201 |    ```sql
1202 |    USE mydb;
1203 |    ```
1204 | 
1205 |    *The syntax is the same for selecting a database in both systems.*
1206 | 
1207 | ---
1208 | 
1209 | ### **3. Creating Tables**
1210 | - **Hive**:
1211 |    ```sql
1212 |    CREATE TABLE employees (
1213 |        id INT,
1214 |        name STRING,
1215 |        age INT
1216 |    );
1217 |    ```
1218 | 
1219 | - **MySQL**:
1220 |    ```sql
1221 |    CREATE TABLE employees (
1222 |        id INT,
1223 |        name VARCHAR(255),
1224 |        age INT
1225 |    );
1226 |    ```
1227 | 
1228 |    **Differences**:
1229 |    - In Hive, **STRING** is used for text data, while in MySQL, **VARCHAR** is used.
1230 |    - Hive also has some specialized data types for distributed storage and performance, like `ARRAY`, `MAP`, `STRUCT`, etc.
1231 | 
1232 | ---
1233 | 
1234 | ### **4. Inserting Data**
1235 | - **Hive**:
1236 |    ```sql
1237 |    INSERT INTO employees VALUES (1, 'John', 30);
1238 |    INSERT INTO employees VALUES (2, 'Alice', 25);
1239 |    ```
1240 | 
1241 | - **MySQL**:
1242 |    ```sql
1243 |    INSERT INTO employees (id, name, age) VALUES (1, 'John', 30);
1244 |    INSERT INTO employees (id, name, age) VALUES (2, 'Alice', 25);
1245 |    ```
1246 | 
1247 |    **Differences**:
1248 |    - Hive allows direct `INSERT INTO` with values, while MySQL explicitly lists column names in the insert statement (though this is optional in MySQL if the columns match).
1249 | 
1250 | ---
1251 | 
1252 | ### **5. Querying Data**
1253 | - **Hive**:
1254 |    ```sql
1255 |    SELECT * FROM employees;
1256 |    ```
1257 | 
1258 | - **MySQL**:
1259 |    ```sql
1260 |    SELECT * FROM employees;
1261 |    ```
1262 | 
1263 |    *Querying data using `SELECT` is identical in both systems.*
1264 | 
1265 | ---
1266 | 
1267 | ### **6. Modifying Data**
1268 | - **Hive**:
1269 |    Hive doesn’t support traditional **UPDATE** or **DELETE** commands directly, as it is optimized for batch processing and is more suited for append operations. However, it does support **INSERT** and **INSERT OVERWRITE** operations.
1270 | 
1271 |    Example of replacing data:
1272 |    ```sql
1273 |    INSERT OVERWRITE TABLE employees SELECT * FROM employees WHERE age > 30;
1274 |    ```
1275 | 
1276 | - **MySQL**:
1277 |    ```sql
1278 |    UPDATE employees SET age = 31 WHERE id = 1;
1279 |    DELETE FROM employees WHERE id = 2;
1280 |    ```
1281 | 
1282 |    **Differences**:
1283 |    - Hive does not allow direct **UPDATE** or **DELETE**; instead, it uses **INSERT OVERWRITE** to modify data in batch operations.
1284 | 
1285 | ---
1286 | 
1287 | ### **7. Dropping Tables**
1288 | - **Hive**:
1289 |    ```sql
1290 |    DROP TABLE IF EXISTS employees;
1291 |    ```
1292 | 
1293 | - **MySQL**:
1294 |    ```sql
1295 |    DROP TABLE IF EXISTS employees;
1296 |    ```
1297 | 
1298 |    *The syntax for dropping tables is the same in both systems.*
1299 | 
1300 | ---
1301 | 
1302 | ### **8. Query Performance**
1303 | - **Hive**:
1304 |    - Hive is designed to run on large datasets using the Hadoop Distributed File System (HDFS), so it focuses more on **batch processing** rather than real-time queries. Query performance in Hive may be slower than MySQL because it’s optimized for scale, not for low-latency transaction processing.
1305 | 
1306 | - **MySQL**:
1307 |    - MySQL is an RDBMS, designed to handle **transactional workloads** with low-latency queries. It’s better suited for OLTP (Online Transaction Processing) rather than OLAP (Online Analytical Processing) workloads.
1308 | 
1309 | ---
1310 | 
1311 | ### **9. Indexing**
1312 | - **Hive**:
1313 |    - Hive doesn’t support traditional indexing as MySQL does. However, you can create **partitioned** or **bucketed** tables in Hive to improve query performance for certain types of data.
1314 | 
1315 | - **MySQL**:
1316 |    - MySQL supports **indexes** (e.g., **PRIMARY KEY**, **UNIQUE**, **INDEX**) to speed up query performance on large datasets.
1317 | 
1318 | ---
1319 | 
1320 | ### **10. Joins**
1321 | - **Hive**:
1322 |    ```sql
1323 |    SELECT a.id, a.name, b.age
1324 |    FROM employees a
1325 |    JOIN employee_details b ON a.id = b.id;
1326 |    ```
1327 | 
1328 | - **MySQL**:
1329 |    ```sql
1330 |    SELECT a.id, a.name, b.age
1331 |    FROM employees a
1332 |    JOIN employee_details b ON a.id = b.id;
1333 |    ```
1334 | 
1335 |    *The syntax for **JOIN** is the same in both systems.*
1336 | 
1337 | ---
1338 | 
1339 | ### **Summary of Key Differences**:
1340 | - **Data Types**: Hive uses types like `STRING`, `TEXT`, `BOOLEAN`, etc., while MySQL uses types like `VARCHAR`, `CHAR`, `TEXT`, etc.
1341 | - **Data Modification**: Hive does not support **UPDATE** or **DELETE** in the traditional way, and is generally used for **batch processing**.
1342 | - **Performance**: Hive is designed for querying large-scale datasets in Hadoop, so queries tend to be slower than MySQL.
1343 | - **Indexing**: Hive does not natively support indexing but can use partitioning and bucketing for performance optimization. MySQL supports indexing for faster queries.
1344 | - **ACID Properties**: MySQL supports full ACID compliance for transactional systems, whereas Hive is not transactional by default (but can support limited ACID features starting from version 0.14 with certain configurations).
1345 | 
1346 | In conclusion, while **Hive** and **MySQL** share SQL-like syntax, they are designed for very different use cases, and not all commands work the same way in both systems.
1347 | 
1348 | ### **Visualize the Data (Optional)**
1349 | 
1350 | Export the query results to a CSV file for analysis in visualization tools:
1351 | 
1352 | ```bash
1353 | hive -e "SELECT * FROM police_data;" > police_analysis_results.csv
1354 | ```
1355 | 
1356 | You can use tools like Tableau, Excel, or Python (Matplotlib, Pandas) for data visualization.
1357 | 
1358 | ## **Step 8: Configure Environment Variables (Optional)**
1359 | 
1360 | If you need to customize configurations, you can specify parameters in the `hadoop.env` file or as environmental variables for services (e.g., namenode, datanode, etc.). For example:
1361 | 
1362 | ```bash
1363 | CORE_CONF_fs_defaultFS=hdfs://namenode:8020
1364 | ```
1365 | 
1366 | This will be transformed into the following in the `core-site.xml` file:
1367 | 
1368 | ```xml
1369 | <property>
1370 |     <name>fs.defaultFS</name>
1371 |     <value>hdfs://namenode:8020</value>
1372 | </property>
1373 | ```
1374 | 
1375 | ## **Conclusion**
1376 | 
1377 | You now have a fully functional Hadoop, Spark, and Hive cluster running in Docker. This environment is great for experimenting with big data processing and analytics in a lightweight, containerized setup.
1378 | 
1379 | ---
1380 | 
1381 | I hope you have fun with this Hadoop-Spark-Hive cluster.
1382 | 
1383 | 
1384 | 
1385 | ![image](https://github.com/user-attachments/assets/1347d354-a160-4cc6-8547-eb0857a72ba5)
1386 | 
1387 | 


--------------------------------------------------------------------------------
/base/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM debian:9
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | MAINTAINER Giannis Mouchakis <gmouchakis@iit.demokritos.gr>
 5 | 
 6 | RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
 7 |       openjdk-8-jdk \
 8 |       net-tools \
 9 |       curl \
10 |       netcat \
11 |       gnupg \
12 |       libsnappy-dev \
13 |     && rm -rf /var/lib/apt/lists/*
14 |       
15 | ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
16 | 
17 | RUN curl -O https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
18 | 
19 | RUN gpg --import KEYS
20 | 
21 | ENV HADOOP_VERSION 3.2.1
22 | ENV HADOOP_URL https://www.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
23 | 
24 | RUN set -x \
25 |     && curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \
26 |     && curl -fSL "$HADOOP_URL.asc" -o /tmp/hadoop.tar.gz.asc \
27 |     && gpg --verify /tmp/hadoop.tar.gz.asc \
28 |     && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
29 |     && rm /tmp/hadoop.tar.gz*
30 | 
31 | RUN ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop
32 | 
33 | RUN mkdir /opt/hadoop-$HADOOP_VERSION/logs
34 | 
35 | RUN mkdir /hadoop-data
36 | 
37 | ENV HADOOP_HOME=/opt/hadoop-$HADOOP_VERSION
38 | ENV HADOOP_CONF_DIR=/etc/hadoop
39 | ENV MULTIHOMED_NETWORK=1
40 | ENV USER=root
41 | ENV PATH $HADOOP_HOME/bin/:$PATH
42 | 
43 | ADD entrypoint.sh /entrypoint.sh
44 | 
45 | RUN chmod a+x /entrypoint.sh
46 | 
47 | ENTRYPOINT ["/entrypoint.sh"]
48 | 


--------------------------------------------------------------------------------
/base/entrypoint.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Set some sensible defaults
  4 | export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020}
  5 | 
  6 | function addProperty() {
  7 |   local path=$1
  8 |   local name=$2
  9 |   local value=$3
 10 | 
 11 |   local entry="<property><name>$name</name><value>${value}</value></property>"
 12 |   local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
 13 |   sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
 14 | }
 15 | 
 16 | function configure() {
 17 |     local path=$1
 18 |     local module=$2
 19 |     local envPrefix=$3
 20 | 
 21 |     local var
 22 |     local value
 23 |     
 24 |     echo "Configuring $module"
 25 |     for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do 
 26 |         name=`echo ${c} | perl -pe 's/___/-/g; s/__/@/g; s/_/./g; s/@/_/g;'`
 27 |         var="${envPrefix}_${c}"
 28 |         value=${!var}
 29 |         echo " - Setting $name=$value"
 30 |         addProperty $path $name "$value"
 31 |     done
 32 | }
 33 | 
 34 | configure /etc/hadoop/core-site.xml core CORE_CONF
 35 | configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF
 36 | configure /etc/hadoop/yarn-site.xml yarn YARN_CONF
 37 | configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF
 38 | configure /etc/hadoop/kms-site.xml kms KMS_CONF
 39 | configure /etc/hadoop/mapred-site.xml mapred MAPRED_CONF
 40 | 
 41 | if [ "$MULTIHOMED_NETWORK" = "1" ]; then
 42 |     echo "Configuring for multihomed network"
 43 | 
 44 |     # HDFS
 45 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
 46 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
 47 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
 48 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
 49 |     addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
 50 |     addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
 51 | 
 52 |     # YARN
 53 |     addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
 54 |     addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
 55 |     addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
 56 | 
 57 |     # MAPRED
 58 |     addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
 59 | fi
 60 | 
 61 | if [ -n "$GANGLIA_HOST" ]; then
 62 |     mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig
 63 |     mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig
 64 | 
 65 |     for module in mapred jvm rpc ugi; do
 66 |         echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31"
 67 |         echo "$module.period=10"
 68 |         echo "$module.servers=$GANGLIA_HOST:8649"
 69 |     done > /etc/hadoop/hadoop-metrics.properties
 70 |     
 71 |     for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do
 72 |         echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31"
 73 |         echo "$module.sink.ganglia.period=10"
 74 |         echo "$module.sink.ganglia.supportsparse=true"
 75 |         echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both"
 76 |         echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40"
 77 |         echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649"
 78 |     done > /etc/hadoop/hadoop-metrics2.properties
 79 | fi
 80 | 
 81 | function wait_for_it()
 82 | {
 83 |     local serviceport=$1
 84 |     local service=${serviceport%%:*}
 85 |     local port=${serviceport#*:}
 86 |     local retry_seconds=5
 87 |     local max_try=100
 88 |     let i=1
 89 | 
 90 |     nc -z $service $port
 91 |     result=$?
 92 | 
 93 |     until [ $result -eq 0 ]; do
 94 |       echo "[$i/$max_try] check for ${service}:${port}..."
 95 |       echo "[$i/$max_try] ${service}:${port} is not available yet"
 96 |       if (( $i == $max_try )); then
 97 |         echo "[$i/$max_try] ${service}:${port} is still not available; giving up after ${max_try} tries. :/"
 98 |         exit 1
 99 |       fi
100 |       
101 |       echo "[$i/$max_try] try in ${retry_seconds}s once again ..."
102 |       let "i++"
103 |       sleep $retry_seconds
104 | 
105 |       nc -z $service $port
106 |       result=$?
107 |     done
108 |     echo "[$i/$max_try] $service:${port} is available."
109 | }
110 | 
111 | for i in ${SERVICE_PRECONDITION[@]}
112 | do
113 |     wait_for_it ${i}
114 | done
115 | 
116 | exec $@
117 | 


--------------------------------------------------------------------------------
/base/execute-step.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ $ENABLE_INIT_DAEMON = "true" ]
 4 |    then
 5 |        echo "Execute step ${INIT_DAEMON_STEP} in pipeline"
 6 |        while true; do
 7 | 	   sleep 5
 8 | 	   echo -n '.'
 9 | 	   string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/execute?step=$INIT_DAEMON_STEP -o /dev/null)
10 | 	   [ "$string" = "204" ] && break
11 |        done
12 |        echo "Notified execution of step ${INIT_DAEMON_STEP}"
13 | fi
14 | 
15 | 


--------------------------------------------------------------------------------
/base/finish-step.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ $ENABLE_INIT_DAEMON = "true" ]
 4 |    then
 5 |        echo "Finish step ${INIT_DAEMON_STEP} in pipeline"
 6 |        while true; do
 7 | 	   sleep 5
 8 | 	   echo -n '.'
 9 | 	   string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/finish?step=$INIT_DAEMON_STEP -o /dev/null)
10 | 	   [ "$string" = "204" ] && break
11 |        done
12 |        echo "Notified finish of step ${INIT_DAEMON_STEP}"
13 | fi
14 | 
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/base/wait-for-step.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ $ENABLE_INIT_DAEMON = "true" ]
 4 |    then
 5 |        echo "Validating if step ${INIT_DAEMON_STEP} can start in pipeline"
 6 |        while true; do
 7 | 	   sleep 5
 8 | 	   echo -n '.'
 9 | 	   string=$(curl -s $INIT_DAEMON_BASE_URI/canStart?step=$INIT_DAEMON_STEP)
10 | 	   [ "$string" = "true" ] && break
11 |        done
12 |        echo "Can start step ${INIT_DAEMON_STEP}"
13 | fi
14 | 


--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount$IntSumReducer.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$IntSumReducer.class


--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount$TokenizerMapper.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$TokenizerMapper.class


--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount.class


--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/wc.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/wc.jar


--------------------------------------------------------------------------------
/code/HadoopWordCount/src/WordCount.java:
--------------------------------------------------------------------------------
 1 | import java.io.IOException;
 2 | import java.util.StringTokenizer;
 3 | 
 4 | import org.apache.hadoop.conf.Configuration;
 5 | import org.apache.hadoop.fs.Path;
 6 | import org.apache.hadoop.io.IntWritable;
 7 | import org.apache.hadoop.io.Text;
 8 | import org.apache.hadoop.mapreduce.Job;
 9 | import org.apache.hadoop.mapreduce.Mapper;
10 | import org.apache.hadoop.mapreduce.Reducer;
11 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 | 
14 | public class WordCount {
15 | 
16 |   public static class TokenizerMapper
17 |        extends Mapper<Object, Text, Text, IntWritable>{
18 | 
19 |     private final static IntWritable one = new IntWritable(1);
20 |     private Text word = new Text();
21 | 
22 |     public void map(Object key, Text value, Context context
23 |                     ) throws IOException, InterruptedException {
24 |       StringTokenizer itr = new StringTokenizer(value.toString());
25 |       while (itr.hasMoreTokens()) {
26 |         word.set(itr.nextToken());
27 |         context.write(word, one);
28 |       }
29 |     }
30 |   }
31 | 
32 |   public static class IntSumReducer
33 |        extends Reducer<Text,IntWritable,Text,IntWritable> {
34 |     private IntWritable result = new IntWritable();
35 | 
36 |     public void reduce(Text key, Iterable<IntWritable> values,
37 |                        Context context
38 |                        ) throws IOException, InterruptedException {
39 |       int sum = 0;
40 |       for (IntWritable val : values) {
41 |         sum += val.get();
42 |       }
43 |       result.set(sum);
44 |       context.write(key, result);
45 |     }
46 |   }
47 | 
48 |   public static void main(String[] args) throws Exception {
49 |     Configuration conf = new Configuration();
50 |     Job job = Job.getInstance(conf, "word count");
51 |     job.setJarByClass(WordCount.class);
52 |     job.setMapperClass(TokenizerMapper.class);
53 |     job.setCombinerClass(IntSumReducer.class);
54 |     job.setReducerClass(IntSumReducer.class);
55 |     job.setOutputKeyClass(Text.class);
56 |     job.setOutputValueClass(IntWritable.class);
57 |     FileInputFormat.addInputPath(job, new Path(args[0]));
58 |     FileOutputFormat.setOutputPath(job, new Path(args[1]));
59 |     System.exit(job.waitForCompletion(true) ? 0 : 1);
60 |   }
61 | }


--------------------------------------------------------------------------------
/code/input/About Hadoop.txt~:
--------------------------------------------------------------------------------
1 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
2 | 
3 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
4 | 
5 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
6 | 


--------------------------------------------------------------------------------
/code/input/data.txt:
--------------------------------------------------------------------------------
1 | DOG CAT RAT
2 | CAR CAR RAT
3 | DOG CAR CAT
4 | 


--------------------------------------------------------------------------------
/code/wordCount.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/wordCount.jar


--------------------------------------------------------------------------------
/conf/beeline-log4j2.properties:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | status = INFO
18 | name = BeelineLog4j2
19 | packages = org.apache.hadoop.hive.ql.log
20 | 
21 | # list of properties
22 | property.hive.log.level = WARN
23 | property.hive.root.logger = console
24 | 
25 | # list of all appenders
26 | appenders = console
27 | 
28 | # console appender
29 | appender.console.type = Console
30 | appender.console.name = console
31 | appender.console.target = SYSTEM_ERR
32 | appender.console.layout.type = PatternLayout
33 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
34 | 
35 | # list of all loggers
36 | loggers = HiveConnection
37 | 
38 | # HiveConnection logs useful info for dynamic service discovery
39 | logger.HiveConnection.name = org.apache.hive.jdbc.HiveConnection
40 | logger.HiveConnection.level = INFO
41 | 
42 | # root logger
43 | rootLogger.level = ${sys:hive.log.level}
44 | rootLogger.appenderRefs = root
45 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger}
46 | 


--------------------------------------------------------------------------------
/conf/hive-env.sh:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | # Set Hive and Hadoop environment variables here. These variables can be used
18 | # to control the execution of Hive. It should be used by admins to configure
19 | # the Hive installation (so that users do not have to set environment variables
20 | # or set command line parameters to get correct behavior).
21 | #
22 | # The hive service being invoked (CLI/HWI etc.) is available via the environment
23 | # variable SERVICE
24 | 
25 | 
26 | # Hive Client memory usage can be an issue if a large number of clients
27 | # are running at the same time. The flags below have been useful in 
28 | # reducing memory usage:
29 | #
30 | # if [ "$SERVICE" = "cli" ]; then
31 | #   if [ -z "$DEBUG" ]; then
32 | #     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
33 | #   else
34 | #     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
35 | #   fi
36 | # fi
37 | 
38 | # The heap size of the jvm stared by hive shell script can be controlled via:
39 | #
40 | # export HADOOP_HEAPSIZE=1024
41 | #
42 | # Larger heap size may be required when running queries over large number of files or partitions. 
43 | # By default hive shell scripts use a heap size of 256 (MB).  Larger heap size would also be 
44 | # appropriate for hive server (hwi etc).
45 | 
46 | 
47 | # Set HADOOP_HOME to point to a specific hadoop install directory
48 | # HADOOP_HOME=${bin}/../../hadoop
49 | 
50 | # Hive Configuration Directory can be controlled by:
51 | # export HIVE_CONF_DIR=
52 | 
53 | # Folder containing extra ibraries required for hive compilation/execution can be controlled by:
54 | # export HIVE_AUX_JARS_PATH=
55 | 


--------------------------------------------------------------------------------
/conf/hive-exec-log4j2.properties:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | status = INFO
18 | name = HiveExecLog4j2
19 | packages = org.apache.hadoop.hive.ql.log
20 | 
21 | # list of properties
22 | property.hive.log.level = INFO
23 | property.hive.root.logger = FA
24 | property.hive.query.id = hadoop
25 | property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
26 | property.hive.log.file = ${sys:hive.query.id}.log
27 | 
28 | # list of all appenders
29 | appenders = console, FA
30 | 
31 | # console appender
32 | appender.console.type = Console
33 | appender.console.name = console
34 | appender.console.target = SYSTEM_ERR
35 | appender.console.layout.type = PatternLayout
36 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
37 | 
38 | # simple file appender
39 | appender.FA.type = File
40 | appender.FA.name = FA
41 | appender.FA.fileName = ${sys:hive.log.dir}/${sys:hive.log.file}
42 | appender.FA.layout.type = PatternLayout
43 | appender.FA.layout.pattern = %d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n
44 | 
45 | # list of all loggers
46 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX
47 | 
48 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn
49 | logger.NIOServerCnxn.level = WARN
50 | 
51 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO
52 | logger.ClientCnxnSocketNIO.level = WARN
53 | 
54 | logger.DataNucleus.name = DataNucleus
55 | logger.DataNucleus.level = ERROR
56 | 
57 | logger.Datastore.name = Datastore
58 | logger.Datastore.level = ERROR
59 | 
60 | logger.JPOX.name = JPOX
61 | logger.JPOX.level = ERROR
62 | 
63 | # root logger
64 | rootLogger.level = ${sys:hive.log.level}
65 | rootLogger.appenderRefs = root
66 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger}
67 | 


--------------------------------------------------------------------------------
/conf/hive-log4j2.properties:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | status = INFO
18 | name = HiveLog4j2
19 | packages = org.apache.hadoop.hive.ql.log
20 | 
21 | # list of properties
22 | property.hive.log.level = INFO
23 | property.hive.root.logger = DRFA
24 | property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name}
25 | property.hive.log.file = hive.log
26 | 
27 | # list of all appenders
28 | appenders = console, DRFA
29 | 
30 | # console appender
31 | appender.console.type = Console
32 | appender.console.name = console
33 | appender.console.target = SYSTEM_ERR
34 | appender.console.layout.type = PatternLayout
35 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
36 | 
37 | # daily rolling file appender
38 | appender.DRFA.type = RollingFile
39 | appender.DRFA.name = DRFA
40 | appender.DRFA.fileName = ${sys:hive.log.dir}/${sys:hive.log.file}
41 | # Use %pid in the filePattern to append <process-id>@<host-name> to the filename if you want separate log files for different CLI session
42 | appender.DRFA.filePattern = ${sys:hive.log.dir}/${sys:hive.log.file}.%d{yyyy-MM-dd}
43 | appender.DRFA.layout.type = PatternLayout
44 | appender.DRFA.layout.pattern = %d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n
45 | appender.DRFA.policies.type = Policies
46 | appender.DRFA.policies.time.type = TimeBasedTriggeringPolicy
47 | appender.DRFA.policies.time.interval = 1
48 | appender.DRFA.policies.time.modulate = true
49 | appender.DRFA.strategy.type = DefaultRolloverStrategy
50 | appender.DRFA.strategy.max = 30
51 | 
52 | # list of all loggers
53 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX
54 | 
55 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn
56 | logger.NIOServerCnxn.level = WARN
57 | 
58 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO
59 | logger.ClientCnxnSocketNIO.level = WARN
60 | 
61 | logger.DataNucleus.name = DataNucleus
62 | logger.DataNucleus.level = ERROR
63 | 
64 | logger.Datastore.name = Datastore
65 | logger.Datastore.level = ERROR
66 | 
67 | logger.JPOX.name = JPOX
68 | logger.JPOX.level = ERROR
69 | 
70 | # root logger
71 | rootLogger.level = ${sys:hive.log.level}
72 | rootLogger.appenderRefs = root
73 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger}
74 | 


--------------------------------------------------------------------------------
/conf/hive-site.xml:
--------------------------------------------------------------------------------
 1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 2 | <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
 3 |    Licensed to the Apache Software Foundation (ASF) under one or more
 4 |    contributor license agreements.  See the NOTICE file distributed with
 5 |    this work for additional information regarding copyright ownership.
 6 |    The ASF licenses this file to You under the Apache License, Version 2.0
 7 |    (the "License"); you may not use this file except in compliance with
 8 |    the License.  You may obtain a copy of the License at
 9 | 
10 |        http://www.apache.org/licenses/LICENSE-2.0
11 | 
12 |    Unless required by applicable law or agreed to in writing, software
13 |    distributed under the License is distributed on an "AS IS" BASIS,
14 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 |    See the License for the specific language governing permissions and
16 |    limitations under the License.
17 | --><configuration>
18 | </configuration>
19 | 


--------------------------------------------------------------------------------
/conf/ivysettings.xml:
--------------------------------------------------------------------------------
 1 | <!--
 2 |    Licensed to the Apache Software Foundation (ASF) under one or more
 3 |    contributor license agreements.  See the NOTICE file distributed with
 4 |    this work for additional information regarding copyright ownership.
 5 |    The ASF licenses this file to You under the Apache License, Version 2.0
 6 |    (the "License"); you may not use this file except in compliance with
 7 |    the License.  You may obtain a copy of the License at
 8 | 
 9 |        http://www.apache.org/licenses/LICENSE-2.0
10 | 
11 |    Unless required by applicable law or agreed to in writing, software
12 |    distributed under the License is distributed on an "AS IS" BASIS,
13 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 |    See the License for the specific language governing permissions and
15 |    limitations under the License.
16 |    -->
17 | 
18 | <!--This file is used by grapes to download dependencies from a maven repository.
19 |     This is just a template and can be edited to add more repositories.
20 | -->
21 | 
22 | <ivysettings>
23 |   <!--name of the defaultResolver should always be 'downloadGrapes'. -->
24 |   <settings defaultResolver="downloadGrapes"/>
25 |   <!-- Only set maven.local.repository if not already set -->
26 |   <property name="maven.local.repository" value="${user.home}/.m2/repository" override="false" />
27 |   <property name="m2-pattern"
28 |             value="file:${maven.local.repository}/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]"
29 |             override="false"/>
30 |   <resolvers>
31 |     <!-- more resolvers can be added here -->
32 |     <chain name="downloadGrapes">
33 |       <!-- This resolver uses ibiblio to find artifacts, compatible with maven2 repository -->
34 |       <ibiblio name="central" m2compatible="true"/>
35 |       <url name="local-maven2" m2compatible="true">
36 |         <artifact pattern="${m2-pattern}"/>
37 |       </url>
38 |       <!-- File resolver to add jars from the local system. -->
39 |       <filesystem name="test" checkmodified="true">
40 |         <artifact pattern="/tmp/[module]-[revision](-[classifier]).jar"/>
41 |       </filesystem>
42 | 
43 |     </chain>
44 |   </resolvers>
45 | </ivysettings>
46 | 


--------------------------------------------------------------------------------
/conf/llap-daemon-log4j2.properties:
--------------------------------------------------------------------------------
 1 | # Licensed to the Apache Software Foundation (ASF) under one
 2 | # or more contributor license agreements.  See the NOTICE file
 3 | # distributed with this work for additional information
 4 | # regarding copyright ownership.  The ASF licenses this file
 5 | # to you under the Apache License, Version 2.0 (the
 6 | # "License"); you may not use this file except in compliance
 7 | # with the License.  You may obtain a copy of the License at
 8 | #
 9 | #     http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | 
17 | status = INFO
18 | name = LlapDaemonLog4j2
19 | packages = org.apache.hadoop.hive.ql.log
20 | 
21 | # list of properties
22 | property.llap.daemon.log.level = INFO
23 | property.llap.daemon.root.logger = console
24 | property.llap.daemon.log.dir = .
25 | property.llap.daemon.log.file = llapdaemon.log
26 | property.llap.daemon.historylog.file = llapdaemon_history.log
27 | property.llap.daemon.log.maxfilesize = 256MB
28 | property.llap.daemon.log.maxbackupindex = 20
29 | 
30 | # list of all appenders
31 | appenders = console, RFA, HISTORYAPPENDER
32 | 
33 | # console appender
34 | appender.console.type = Console
35 | appender.console.name = console
36 | appender.console.target = SYSTEM_ERR
37 | appender.console.layout.type = PatternLayout
38 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t%x] %p %c{2} : %m%n
39 | 
40 | # rolling file appender
41 | appender.RFA.type = RollingFile
42 | appender.RFA.name = RFA
43 | appender.RFA.fileName = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.log.file}
44 | appender.RFA.filePattern = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.log.file}_%i
45 | appender.RFA.layout.type = PatternLayout
46 | appender.RFA.layout.pattern = %d{ISO8601} %-5p [%t%x]: %c{2} (%F:%M(%L)) - %m%n
47 | appender.RFA.policies.type = Policies
48 | appender.RFA.policies.size.type = SizeBasedTriggeringPolicy
49 | appender.RFA.policies.size.size = ${sys:llap.daemon.log.maxfilesize}
50 | appender.RFA.strategy.type = DefaultRolloverStrategy
51 | appender.RFA.strategy.max = ${sys:llap.daemon.log.maxbackupindex}
52 | 
53 | # history file appender
54 | appender.HISTORYAPPENDER.type = RollingFile
55 | appender.HISTORYAPPENDER.name = HISTORYAPPENDER
56 | appender.HISTORYAPPENDER.fileName = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.historylog.file}
57 | appender.HISTORYAPPENDER.filePattern = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.historylog.file}_%i
58 | appender.HISTORYAPPENDER.layout.type = PatternLayout
59 | appender.HISTORYAPPENDER.layout.pattern = %m%n
60 | appender.HISTORYAPPENDER.policies.type = Policies
61 | appender.HISTORYAPPENDER.policies.size.type = SizeBasedTriggeringPolicy
62 | appender.HISTORYAPPENDER.policies.size.size = ${sys:llap.daemon.log.maxfilesize}
63 | appender.HISTORYAPPENDER.strategy.type = DefaultRolloverStrategy
64 | appender.HISTORYAPPENDER.strategy.max = ${sys:llap.daemon.log.maxbackupindex}
65 | 
66 | # list of all loggers
67 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX, HistoryLogger
68 | 
69 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn
70 | logger.NIOServerCnxn.level = WARN
71 | 
72 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO
73 | logger.ClientCnxnSocketNIO.level = WARN
74 | 
75 | logger.DataNucleus.name = DataNucleus
76 | logger.DataNucleus.level = ERROR
77 | 
78 | logger.Datastore.name = Datastore
79 | logger.Datastore.level = ERROR
80 | 
81 | logger.JPOX.name = JPOX
82 | logger.JPOX.level = ERROR
83 | 
84 | logger.HistoryLogger.name = org.apache.hadoop.hive.llap.daemon.HistoryLogger
85 | logger.HistoryLogger.level = INFO
86 | logger.HistoryLogger.additivity = false
87 | logger.HistoryLogger.appenderRefs = HistoryAppender
88 | logger.HistoryLogger.appenderRef.HistoryAppender.ref = HISTORYAPPENDER
89 | 
90 | # root logger
91 | rootLogger.level = ${sys:llap.daemon.log.level}
92 | rootLogger.appenderRefs = root
93 | rootLogger.appenderRef.root.ref = ${sys:llap.daemon.root.logger}
94 | 


--------------------------------------------------------------------------------
/data/authors.csv:
--------------------------------------------------------------------------------
1 | lname,fname
2 | Pascal,Blaise
3 | Voltaire,François
4 | Perrin,Jean-Georges
5 | Maréchal,Pierre Sylvain
6 | Karau,Holden
7 | Zaharia,Matei
8 | 


--------------------------------------------------------------------------------
/data/books.csv:
--------------------------------------------------------------------------------
 1 | id,authorId,title,releaseDate,link
 2 | 1,1,Fantastic Beasts and Where to Find Them: The Original Screenplay,11/18/16,http://amzn.to/2kup94P
 3 | 2,1,"Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter, Book 1)",10/6/15,http://amzn.to/2l2lSwP
 4 | 3,1,"The Tales of Beedle the Bard, Standard Edition (Harry Potter)",12/4/08,http://amzn.to/2kYezqr
 5 | 4,1,"Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter, Book 2)",10/4/16,http://amzn.to/2kYhL5n
 6 | 5,2,"Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple, the Coffee, and a Great Database",4/23/17,http://amzn.to/2i3mthT
 7 | 6,2,"Development Tools in 2006: any Room for a 4GL-style Language?: An independent study by Jean Georges Perrin, IIUG Board Member",12/28/16,http://amzn.to/2vBxOe1
 8 | 7,3,Adventures of Huckleberry Finn,5/26/94,http://amzn.to/2wOeOav
 9 | 8,3,A Connecticut Yankee in King Arthur's Court,6/17/17,http://amzn.to/2x1NuoD
10 | 10,4,Jacques le Fataliste,3/1/00,http://amzn.to/2uZj2KA
11 | 11,4,Diderot Encyclopedia: The Complete Illustrations 1762-1777,,http://amzn.to/2i2zo3I
12 | 12,,A Woman in Berlin,7/11/06,http://amzn.to/2i472WZ
13 | 13,6,Spring Boot in Action,1/3/16,http://amzn.to/2hCPktW
14 | 14,6,Spring in Action: Covers Spring 4,11/28/14,http://amzn.to/2yJLyCk
15 | 15,7,Soft Skills: The software developer's life manual,12/29/14,http://amzn.to/2zNnSyn
16 | 16,8,Of Mice and Men,,http://amzn.to/2zJjXoc
17 | 17,9,"Java 8 in Action: Lambdas, Streams, and functional-style programming",8/28/14,http://amzn.to/2isdqoL
18 | 18,12,Hamlet,6/8/12,http://amzn.to/2yRbewY
19 | 19,13,Pensées,12/31/1670,http://amzn.to/2jweHOG
20 | 20,14,"Fables choisies, mises en vers par M. de La Fontaine",9/1/1999,http://amzn.to/2yRH10W
21 | 21,15,Discourse on Method and Meditations on First Philosophy,6/15/1999,http://amzn.to/2hwB8zc
22 | 22,12,Twelfth Night,7/1/4,http://amzn.to/2zPYnwo
23 | 23,12,Macbeth,7/1/3,http://amzn.to/2zPYnwo


--------------------------------------------------------------------------------
/datanode/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | HEALTHCHECK CMD curl -f http://localhost:9864/ || exit 1
 6 | 
 7 | ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/data
 8 | RUN mkdir -p /hadoop/dfs/data
 9 | VOLUME /hadoop/dfs/data
10 | 
11 | ADD run.sh /run.sh
12 | RUN chmod a+x /run.sh
13 | 
14 | EXPOSE 9864
15 | 
16 | CMD ["/run.sh"]
17 | 


--------------------------------------------------------------------------------
/datanode/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | datadir=`echo $HDFS_CONF_dfs_datanode_data_dir | perl -pe 's#file://##'`
 4 | if [ ! -d $datadir ]; then
 5 |   echo "Datanode data directory not found: $datadir"
 6 |   exit 2
 7 | fi
 8 | 
 9 | $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR datanode
10 | 


--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
  1 | services:
  2 |   namenode:
  3 |     image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
  4 |     container_name: namenode
  5 |     restart: always
  6 |     ports:
  7 |       - 9870:9870
  8 |       - 9010:9000
  9 |     volumes:
 10 |       - hadoop_namenode:/hadoop/dfs/name
 11 |     environment:
 12 |       - CLUSTER_NAME=test
 13 |       - CORE_CONF_fs_defaultFS=hdfs://namenode:9000
 14 |     env_file:
 15 |       - ./hadoop.env
 16 | 
 17 |   datanode:
 18 |     image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
 19 |     container_name: datanode
 20 |     restart: always
 21 |     volumes:
 22 |       - hadoop_datanode:/hadoop/dfs/data
 23 |     environment:
 24 |       SERVICE_PRECONDITION: "namenode:9870"
 25 |       CORE_CONF_fs_defaultFS: hdfs://namenode:9000
 26 |     ports:
 27 |       - "9864:9864"
 28 |     env_file:
 29 |       - ./hadoop.env
 30 | 
 31 |   resourcemanager:
 32 |     image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8
 33 |     container_name: resourcemanager
 34 |     restart: always
 35 |     environment:
 36 |       SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864"
 37 |     ports:
 38 |       - "8088:8088"
 39 |     env_file:
 40 |       - ./hadoop.env
 41 | 
 42 |   nodemanager1:
 43 |     image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8
 44 |     container_name: nodemanager
 45 |     restart: always
 46 |     environment:
 47 |       SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
 48 |     env_file:
 49 |       - ./hadoop.env
 50 | 
 51 |   historyserver:
 52 |     image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8
 53 |     container_name: historyserver
 54 |     restart: always
 55 |     environment:
 56 |       SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088"
 57 |     volumes:
 58 |       - hadoop_historyserver:/hadoop/yarn/timeline
 59 |     env_file:
 60 |       - ./hadoop.env
 61 | 
 62 |   spark-master:
 63 |     image: bde2020/spark-master:3.0.0-hadoop3.2
 64 |     container_name: spark-master
 65 |     depends_on:
 66 |       - namenode
 67 |       - datanode
 68 |     ports:
 69 |       - "8080:8080"
 70 |       - "7077:7077"
 71 |     environment:
 72 |       - INIT_DAEMON_STEP=setup_spark
 73 |       - CORE_CONF_fs_defaultFS=hdfs://namenode:9000
 74 | 
 75 |   spark-worker-1:
 76 |     image: bde2020/spark-worker:3.0.0-hadoop3.2
 77 |     container_name: spark-worker-1
 78 |     depends_on:
 79 |       - spark-master
 80 |     ports:
 81 |       - "8081:8081"
 82 |     environment:
 83 |       - "SPARK_MASTER=spark://spark-master:7077"
 84 |       - CORE_CONF_fs_defaultFS=hdfs://namenode:9000
 85 | 
 86 |   hive-server:
 87 |     image: bde2020/hive:2.3.2-postgresql-metastore
 88 |     container_name: hive-server
 89 |     depends_on:
 90 |       - namenode
 91 |       - datanode
 92 |     env_file:
 93 |       - ./hadoop-hive.env
 94 |     environment:
 95 |       HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
 96 |       SERVICE_PRECONDITION: "hive-metastore:9083"
 97 |     ports:
 98 |       - "10000:10000"
 99 | 
100 |   hive-metastore:
101 |     image: bde2020/hive:2.3.2-postgresql-metastore
102 |     container_name: hive-metastore
103 |     env_file:
104 |       - ./hadoop-hive.env
105 |     command: /opt/hive/bin/hive --service metastore
106 |     environment:
107 |       SERVICE_PRECONDITION: "namenode:9870 datanode:9864 hive-metastore-postgresql:5432"
108 |     ports:
109 |       - "9083:9083"
110 | 
111 |   hive-metastore-postgresql:
112 |     image: bde2020/hive-metastore-postgresql:2.3.0
113 |     container_name: hive-metastore-postgresql
114 | 
115 |   presto-coordinator:
116 |     image: shawnzhu/prestodb:0.181
117 |     container_name: presto-coordinator
118 |     ports:
119 |       - "8089:8089"
120 | 
121 | volumes:
122 |   hadoop_namenode:
123 |   hadoop_datanode:
124 |   hadoop_historyserver:
125 | 
126 | 


--------------------------------------------------------------------------------
/ecom.md:
--------------------------------------------------------------------------------
  1 | # 🚀 E-commerce Sales Data Analysis Using Hive
  2 | 
  3 | ## 📊 Project Overview
  4 | This project demonstrates how to perform **E-commerce Sales Data Analysis** using **Apache Hive** on a Hadoop ecosystem. The goal of this project is to analyze sales data, generate business insights, and understand trends in e-commerce sales.
  5 | 
  6 | The project uses a **CSV file containing real-world simulated sales data**, which is imported into **HDFS (Hadoop Distributed File System)** and processed using **HiveQL (Hive Query Language)**.
  7 | 
  8 | ✅ **Project Objectives:**
  9 | - Import large-scale e-commerce sales data into **HDFS**.
 10 | - Create Hive tables (Managed & External).
 11 | - Analyze data to extract business insights like:
 12 |   - 💰 **Total Revenue.**
 13 |   - 🛒 **Best-selling products.**
 14 |   - 👥 **Most active customers.**
 15 |   - 📅 **Monthly/Yearly sales trends.**
 16 |   - 💵 **Most used payment methods.**
 17 | - Generate useful business insights for decision-making.
 18 | 
 19 | ---
 20 | 
 21 | ## 📁 Dataset Information
 22 | The dataset used in this project is a simulated **E-commerce Sales Data CSV file** containing the following columns:
 23 | 
 24 | | Column Name      | Description                            |
 25 | |-----------------|------------------------------------------|
 26 | | **order_date**   | Date of the order                      |
 27 | | **customer_id**  | Unique ID of the customer               |
 28 | | **product_name** | Name of the product purchased           |
 29 | | **category**     | Product category                        |
 30 | | **quantity**     | Number of units sold                    |
 31 | | **price**        | Price per unit                          |
 32 | | **total_amount** | Total amount for the order              |
 33 | | **payment_type** | Payment method used                     |
 34 | | **city**         | Customer's city                         |
 35 | | **state**        | Customer's state                        |
 36 | | **country**      | Customer's country                      |
 37 | 
 38 | 👉 **Sample Size:** 10,000 records of e-commerce transactions.
 39 | 👉 **File Type:** CSV
 40 | 👉 **File Name:** `ecommerce_sales_data.csv`
 41 | 
 42 | You can download the dataset from here: [Download E-commerce Sales Data](https://drive.google.com/file/d/1MYN0AdX6uD9kNR6UdqlCZuZCxlfmK6T6/view)
 43 | 
 44 | ---
 45 | 
 46 | ## 📥 Step 1: Upload Data to HDFS
 47 | ### ✅ Create Directory in HDFS
 48 | Run the following commands to create a directory in **HDFS**:
 49 | ```bash
 50 | hadoop fs -mkdir -p /user/hdfs/ecommerce_data
 51 | ```
 52 | 
 53 | ### ✅ Upload the CSV File to HDFS
 54 | ```bash
 55 | hadoop fs -put /mnt/data/ecommerce_sales_data.csv /user/hdfs/ecommerce_data/
 56 | ```
 57 | 
 58 | Verify the upload:
 59 | ```bash
 60 | hadoop fs -ls /user/hdfs/ecommerce_data/
 61 | ```
 62 | You should see the file listed there.
 63 | 
 64 | ---
 65 | 
 66 | ## 🗄 Step 2: Create Hive Tables
 67 | Now, open the **Hive shell**:
 68 | ```bash
 69 | hive
 70 | ```
 71 | 
 72 | ### ✅ Create Database
 73 | ```sql
 74 | CREATE DATABASE ecommerce;
 75 | USE ecommerce;
 76 | ```
 77 | 
 78 | ### ✅ Create External Table
 79 | We will create an **External Table** linked to our HDFS file.
 80 | ```sql
 81 | CREATE EXTERNAL TABLE IF NOT EXISTS sales_data (
 82 |     order_date STRING,
 83 |     customer_id INT,
 84 |     product_name STRING,
 85 |     category STRING,
 86 |     quantity INT,
 87 |     price FLOAT,
 88 |     total_amount FLOAT,
 89 |     payment_type STRING,
 90 |     city STRING,
 91 |     state STRING,
 92 |     country STRING
 93 | )
 94 | ROW FORMAT DELIMITED
 95 | FIELDS TERMINATED BY ','
 96 | STORED AS TEXTFILE
 97 | LOCATION '/user/hdfs/ecommerce_data/';
 98 | ```
 99 | 
100 | ✅ **Verify the data:**
101 | ```sql
102 | SELECT * FROM sales_data LIMIT 10;
103 | ```
104 | 
105 | ---
106 | 
107 | ## 💻 Step 3: Hive Queries (Data Analysis)
108 | ### 💰 1. Calculate Total Revenue
109 | ```sql
110 | SELECT SUM(total_amount) AS total_revenue
111 | FROM sales_data;
112 | ```
113 | 👉 This query shows the **total revenue generated** by the business.
114 | 
115 | ---
116 | 
117 | ### 🛍 2. Find Best-Selling Products
118 | ```sql
119 | SELECT product_name, SUM(quantity) AS total_sold
120 | FROM sales_data
121 | GROUP BY product_name
122 | ORDER BY total_sold DESC
123 | LIMIT 10;
124 | ```
125 | 👉 This query shows the **top 10 best-selling products**.
126 | 
127 | ---
128 | 
129 | ### 👥 3. Identify Most Active Customers
130 | ```sql
131 | SELECT customer_id, COUNT(*) AS total_orders
132 | FROM sales_data
133 | GROUP BY customer_id
134 | ORDER BY total_orders DESC
135 | LIMIT 10;
136 | ```
137 | 👉 This query identifies the **top 10 most active customers**.
138 | 
139 | ---
140 | 
141 | ### 📅 4. Monthly Sales Trend
142 | ```sql
143 | SELECT substr(order_date, 1, 7) AS month, SUM(total_amount) AS monthly_revenue
144 | FROM sales_data
145 | GROUP BY substr(order_date, 1, 7)
146 | ORDER BY month;
147 | ```
148 | 👉 This query shows the **monthly revenue trend**.
149 | 
150 | ---
151 | 
152 | ### 🏢 5. Top Revenue-Generating Cities
153 | ```sql
154 | SELECT city, SUM(total_amount) AS revenue
155 | FROM sales_data
156 | GROUP BY city
157 | ORDER BY revenue DESC
158 | LIMIT 5;
159 | ```
160 | 👉 This query identifies the **top 5 revenue-generating cities**.
161 | 
162 | ---
163 | 
164 | ### 💵 6. Most Used Payment Type
165 | ```sql
166 | SELECT payment_type, COUNT(*) AS usage_count
167 | FROM sales_data
168 | GROUP BY payment_type
169 | ORDER BY usage_count DESC;
170 | ```
171 | 👉 This query shows the **most preferred payment methods**.
172 | 
173 | ---
174 | 
175 | ## 📊 Step 4: Visualization (Optional)
176 | You can visualize the data using:
177 | - 📊 **Apache Zeppelin**.
178 | - 📊 **Power BI / Tableau**.
179 | - 💻 **Python (Matplotlib/Seaborn)**.
180 | 
181 | Example visualization in **Zeppelin:**
182 | ```sql
183 | %sql
184 | SELECT substr(order_date, 1, 7) AS month, SUM(total_amount) AS monthly_revenue
185 | FROM sales_data
186 | GROUP BY substr(order_date, 1, 7)
187 | ORDER BY month;
188 | ```
189 | 👉 Convert it into a **Line Chart** to see monthly revenue.
190 | 
191 | ---
192 | 
193 | ## 📜 Step 5: Business Insights
194 | | Insight | Description |
195 | |---------|-------------|
196 | | 💰 Total Revenue | Understand the overall revenue generated. |
197 | | 🛍 Best-Selling Products | Identify which products are most popular. |
198 | | 👥 Most Active Customers | Track the most loyal customers. |
199 | | 📅 Monthly Revenue Trend | Understand peak seasons and off-seasons. |
200 | | 🏢 Revenue by City | Focus on cities generating maximum revenue. |
201 | | 💵 Payment Preference | Identify the most used payment method. |
202 | 
203 | ---
204 | 
205 | ## 📊 Future Scope
206 | 1. ✅ **Integrate Apache Kafka** for real-time streaming data.
207 | 2. ✅ Use **Apache Spark** to process data faster.
208 | 3. ✅ Build a **Tableau/Power BI dashboard** for live business insights.
209 | 4. ✅ Connect Hive data to **Flask/Django web app**.
210 | 
211 | ---
212 | 
213 | ## 💎 Conclusion
214 | This project provides a practical demonstration of:
215 | - ✅ **Big Data Processing** using Hive.
216 | - ✅ Importing data into HDFS.
217 | - ✅ Performing data analysis using HiveQL.
218 | - ✅ Generating business insights from e-commerce sales data.
219 | 
220 | 👉 **Next Step:**:
221 | - ✅ Create a real-time dashboard using Zeppelin/Power BI?
222 | - ✅ Automate PDF Report Generation using Python?
223 | - ✅ Deploy this project on a web application using Flask?
224 | 


--------------------------------------------------------------------------------
/entrypoint.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # Set some sensible defaults
  4 | export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020}
  5 | 
  6 | function addProperty() {
  7 |   local path=$1
  8 |   local name=$2
  9 |   local value=$3
 10 | 
 11 |   local entry="<property><name>$name</name><value>${value}</value></property>"
 12 |   local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
 13 |   sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
 14 | }
 15 | 
 16 | function configure() {
 17 |     local path=$1
 18 |     local module=$2
 19 |     local envPrefix=$3
 20 | 
 21 |     local var
 22 |     local value
 23 |     
 24 |     echo "Configuring $module"
 25 |     for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do 
 26 |         name=`echo ${c} | perl -pe 's/___/-/g; s/__/_/g; s/_/./g'`
 27 |         var="${envPrefix}_${c}"
 28 |         value=${!var}
 29 |         echo " - Setting $name=$value"
 30 |         addProperty $path $name "$value"
 31 |     done
 32 | }
 33 | 
 34 | configure /etc/hadoop/core-site.xml core CORE_CONF
 35 | configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF
 36 | configure /etc/hadoop/yarn-site.xml yarn YARN_CONF
 37 | configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF
 38 | configure /etc/hadoop/kms-site.xml kms KMS_CONF
 39 | configure /etc/hadoop/mapred-site.xml mapred MAPRED_CONF
 40 | configure /opt/hive/conf/hive-site.xml hive HIVE_SITE_CONF
 41 | 
 42 | if [ "$MULTIHOMED_NETWORK" = "1" ]; then
 43 |     echo "Configuring for multihomed network"
 44 | 
 45 |     # HDFS
 46 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
 47 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
 48 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
 49 |     addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
 50 |     addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
 51 |     addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
 52 | 
 53 |     # YARN
 54 |     addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
 55 |     addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
 56 |     addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
 57 |     addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
 58 | 
 59 |     # MAPRED
 60 |     addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
 61 | fi
 62 | 
 63 | if [ -n "$GANGLIA_HOST" ]; then
 64 |     mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig
 65 |     mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig
 66 | 
 67 |     for module in mapred jvm rpc ugi; do
 68 |         echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31"
 69 |         echo "$module.period=10"
 70 |         echo "$module.servers=$GANGLIA_HOST:8649"
 71 |     done > /etc/hadoop/hadoop-metrics.properties
 72 |     
 73 |     for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do
 74 |         echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31"
 75 |         echo "$module.sink.ganglia.period=10"
 76 |         echo "$module.sink.ganglia.supportsparse=true"
 77 |         echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both"
 78 |         echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40"
 79 |         echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649"
 80 |     done > /etc/hadoop/hadoop-metrics2.properties
 81 | fi
 82 | 
 83 | function wait_for_it()
 84 | {
 85 |     local serviceport=$1
 86 |     local service=${serviceport%%:*}
 87 |     local port=${serviceport#*:}
 88 |     local retry_seconds=5
 89 |     local max_try=100
 90 |     let i=1
 91 | 
 92 |     nc -z $service $port
 93 |     result=$?
 94 | 
 95 |     until [ $result -eq 0 ]; do
 96 |       echo "[$i/$max_try] check for ${service}:${port}..."
 97 |       echo "[$i/$max_try] ${service}:${port} is not available yet"
 98 |       if (( $i == $max_try )); then
 99 |         echo "[$i/$max_try] ${service}:${port} is still not available; giving up after ${max_try} tries. :/"
100 |         exit 1
101 |       fi
102 |       
103 |       echo "[$i/$max_try] try in ${retry_seconds}s once again ..."
104 |       let "i++"
105 |       sleep $retry_seconds
106 | 
107 |       nc -z $service $port
108 |       result=$?
109 |     done
110 |     echo "[$i/$max_try] $service:${port} is available."
111 | }
112 | 
113 | for i in ${SERVICE_PRECONDITION[@]}
114 | do
115 |     wait_for_it ${i}
116 | done
117 | 
118 | exec $@
119 | 


--------------------------------------------------------------------------------
/flume.md:
--------------------------------------------------------------------------------
  1 | ### ✅ **What is Apache Flume in Big Data? 🚀**
  2 | 
  3 | ---
  4 | 
  5 | ### 💡 **Definition of Apache Flume:**
  6 | 👉 **Apache Flume** is a **data ingestion tool** used to **collect, aggregate, and transfer large volumes of streaming data** (such as **log files, social media data, server logs, IoT data, etc.**) **into Hadoop (HDFS/Hive).**
  7 | 
  8 | ---
  9 | 
 10 | ## ✅ **Why Do We Need Apache Flume? 🤔**
 11 | ### 📊 **Problem:**
 12 | Suppose you have:
 13 | - ✅ **Millions of log files** generated every second from **Web Servers, IoT devices, Sensors, etc.**
 14 | - ✅ Or you have **Streaming Data from Twitter, Facebook, YouTube, etc.**
 15 | - ✅ Or you have **Server Logs** from your website.
 16 | 
 17 | 👉 You want to **send this streaming data** into:
 18 | - ✅ **HDFS (Hadoop File System)** for storage.
 19 | - ✅ **Hive** for querying and analysis.
 20 | - ✅ **HBase** for real-time access.
 21 | 
 22 | 👉 **How will you transfer this large streaming data continuously?** 🤔
 23 | 
 24 | ---
 25 | 
 26 | ## ✅ **Solution: Use Apache Flume 💯**
 27 | 👉 Apache Flume will **continuously capture streaming data** from:
 28 | - ✅ **Web Servers (logs)**  
 29 | - ✅ **IoT Devices (sensor data)**  
 30 | - ✅ **Social Media (Twitter, Facebook)**  
 31 | - ✅ **Application Logs (Tomcat, Apache)**  
 32 | 
 33 | 👉 And automatically **push it into Hadoop (HDFS/Hive)** without manual work.
 34 | 
 35 | ---
 36 | 
 37 | ## ✅ **Where is Flume Used in Real Life? 💡**
 38 | | Industry                  | Flume is Used For                                                             |
 39 | |--------------------------|---------------------------------------------------------------------------------|
 40 | | 📊 **E-commerce (Amazon, Flipkart)** | Capturing **user behavior logs**, product clicks, browsing history, etc.  |
 41 | | 💻 **IT Companies (Google, Facebook)** | Collecting **application logs**, crash logs, web traffic logs, etc.      |
 42 | | 📡 **IoT Devices (Smart Homes)**     | Streaming data from **IoT devices, sensors, CCTV, etc.**               |
 43 | | 📜 **News Websites**             | **Capturing real-time news**, logs, and content from different sources.    |
 44 | | 🛰️ **Social Media Platforms**   | Capturing **tweets, Facebook posts, YouTube comments, etc.**              |
 45 | 
 46 | ---
 47 | 
 48 | ## ✅ **How Does Apache Flume Work? 🚀**
 49 | 👉 **Apache Flume works on a Pipeline Architecture.**
 50 | 
 51 | ### ✔ **Pipeline = Source → Channel → Sink → Hadoop (HDFS)**
 52 | | Component    | What it Does                                                             |
 53 | |--------------|-------------------------------------------------------------------------|
 54 | | ✅ **Source**  | Collects **data from source (logs, Twitter, IoT, etc.)**                |
 55 | | ✅ **Channel** | Temporarily stores the data (like a queue or buffer).                  |
 56 | | ✅ **Sink**    | Sends data to **HDFS, Hive, or HBase**.                                 |
 57 | | ✅ **Hadoop**  | Stores the data permanently for analysis.                              |
 58 | 
 59 | ---
 60 | 
 61 | ## ✅ **Architecture of Apache Flume 🔥**
 62 | Here’s how Flume works step-by-step:
 63 | 
 64 | ```
 65 |                  ┌─────────────────┐
 66 |  Data Source --> │     Source      │ --> Captures Data (Logs, Twitter, IoT)
 67 |                  └─────────────────┘
 68 |                            │
 69 |                            ▼
 70 |                  ┌─────────────────┐
 71 |  Data Buffer --> │     Channel     │ --> Holds data temporarily (like a Queue)
 72 |                  └─────────────────┘
 73 |                            │
 74 |                            ▼
 75 |                  ┌─────────────────┐
 76 |  Data Storage -->│      Sink       │ --> Sends Data to HDFS, Hive, or HBase
 77 |                  └─────────────────┘
 78 |                            │
 79 |                            ▼
 80 |                 ┌───────────────────────┐
 81 |  Data in Hadoop│    HDFS / Hive / HBase   │
 82 |                 └───────────────────────┘
 83 | ```
 84 | 
 85 | ---
 86 | 
 87 | ## ✅ **Example of Apache Flume Use Cases 🚀**
 88 | Here are some real-world use cases:
 89 | 
 90 | ---
 91 | 
 92 | ### ✔ **1. Capturing Web Server Logs (Access Logs, Error Logs)**
 93 | Suppose you have a website with **1 Billion hits/day** like **Flipkart, Amazon, etc.**.
 94 | 
 95 | 👉 Every hit generates a log file:  
 96 | ```
 97 | 2025-03-10 12:34:55 INFO User Clicked on Product ID: 2345
 98 | 2025-03-10 12:35:00 INFO User Added Product ID: 2345 to Cart
 99 | ```
100 | 
101 | 👉 Flume will:
102 | - ✅ **Capture these logs**.
103 | - ✅ **Stream them to Hadoop (HDFS)** in real-time.
104 | - ✅ You can **analyze it later in Hive**.
105 | 
106 | ### **Flume Configuration Example:**
107 | ```properties
108 | # Flume Agent Configuration
109 | agent1.sources = source1
110 | agent1.channels = channel1
111 | agent1.sinks = sink1
112 | 
113 | # Source Configuration (Log File)
114 | agent1.sources.source1.type = exec
115 | agent1.sources.source1.command = tail -f /var/log/httpd/access.log
116 | 
117 | # Channel Configuration
118 | agent1.channels.channel1.type = memory
119 | 
120 | # Sink Configuration (HDFS)
121 | agent1.sinks.sink1.type = hdfs
122 | agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/logs
123 | ```
124 | 
125 | ✅ Flume will **capture log files in real-time** and push them to **HDFS**.
126 | 
127 | ---
128 | 
129 | ### ✔ **2. Capturing Twitter Data (Trending Hashtags)**
130 | Suppose you want to capture **live tweets** on a trending hashtag like:
131 | ```
132 | #election2025
133 | #iphone16
134 | #IndiaWins
135 | ```
136 | 
137 | 👉 **Flume can capture these tweets** and push them to **HDFS/Hive** for analysis.
138 | 
139 | ### ✅ Flume Twitter Configuration Example:
140 | ```properties
141 | # Source Configuration
142 | agent1.sources.source1.type = org.apache.flume.source.twitter.TwitterSource
143 | agent1.sources.source1.consumerKey = YOUR_CONSUMER_KEY
144 | agent1.sources.source1.consumerSecret = YOUR_CONSUMER_SECRET
145 | agent1.sources.source1.accessToken = YOUR_ACCESS_TOKEN
146 | agent1.sources.source1.accessTokenSecret = YOUR_ACCESS_TOKEN_SECRET
147 | 
148 | # Sink Configuration (HDFS)
149 | agent1.sinks.sink1.type = hdfs
150 | agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/twitter
151 | ```
152 | 
153 | 👉 ✅ **Flume will capture live tweets** and push them to **HDFS**.
154 | 
155 | ---
156 | 
157 | ### ✔ **3. IoT Sensor Data (Smart Homes, CCTV, Temperature Sensors)**
158 | Suppose you have:
159 | - ✅ **IoT Sensors (Temperature, Humidity, CCTV)**.
160 | - ✅ You want to capture the data in real-time.
161 | 
162 | 👉 Flume will:
163 | - ✅ Continuously read sensor data.
164 | - ✅ Push it to HDFS in real-time.
165 | - ✅ You can then analyze it.
166 | 
167 | ---
168 | 
169 | ## ✅ **Types of Flume Channels 🚀**
170 | | Channel Type     | Use Case                                                      |
171 | |-----------------|-----------------------------------------------------------------|
172 | | ✅ **Memory Channel** | Fastest but not durable (if Flume crashes, data is lost).    |
173 | | ✅ **File Channel**   | Slower but data is saved even if Flume crashes.              |
174 | | ✅ **Kafka Channel**  | Highly scalable and fault-tolerant (best for production).    |
175 | 
176 | ---
177 | 
178 | ## ✅ **Why Is Flume Better Than Manual Data Transfer? 🚀**
179 | | Feature                   | Manual File Transfer    | Apache Flume                         |
180 | |--------------------------|------------------------|----------------------------------------|
181 | | **Data Transfer Speed**   | Very Slow              | Lightning Fast 🚀                    |
182 | | **Streaming Data**        | Impossible             | Handles Real-time Streaming 🚀        |
183 | | **Data Loss**             | High                   | Zero Data Loss (Fault-tolerant)       |
184 | | **Automation**            | Manual Effort          | Fully Automated                       |
185 | | **Big Data Compatibility**| Not Possible           | Integrates with Hadoop, Hive, HBase    |
186 | 
187 | ---
188 | 
189 | ## ✅ **Where Does Apache Flume Send Data? 🚀**
190 | | Data Source               | Flume Can Send Data To                              |
191 | |--------------------------|-----------------------------------------------------|
192 | | ✅ **Log Files**        | **HDFS / Hive / HBase / Kafka**                      |
193 | | ✅ **Social Media**     | **Hive / Spark / ElasticSearch**                     |
194 | | ✅ **IoT Devices**     | **Hadoop / MongoDB / Kafka**                         |
195 | | ✅ **Web Server Logs** | **HDFS / Hive / Kafka**                              |
196 | 
197 | ---
198 | 
199 | ## ✅ **Why Is Flume So Powerful? 💯**
200 | 👉 Flume can:
201 | - ✅ **Ingest Terabytes of Data/Hour.**  
202 | - ✅ Handle **Millions of Streaming Logs/Second**.  
203 | - ✅ Push data to **Hadoop, Hive, HBase, Kafka, etc.**  
204 | - ✅ Fully Automated.  
205 | - ✅ Real-time Data Processing.  
206 | 
207 | ---
208 | 
209 | ## ✅ **🔥 Final Answer**
210 | 👉 **Apache Flume** is used for:
211 | - ✅ **Real-time streaming data capture.**  
212 | - ✅ **Log file ingestion from web servers.**  
213 | - ✅ **Capturing social media data (Twitter, YouTube, etc.).**  
214 | - ✅ **Moving IoT data (sensors, CCTV) to Hadoop.**
215 | 
216 | ---
217 | 
218 | 
219 | ### **Here is a complete step-by-step guide to install Apache Flume on top of your Hadoop setup and demonstrate a working example:**
220 | 
221 | ---
222 | 
223 | ### **Step 1: Install Apache Flume**
224 | 
225 | 1. **Download Apache Flume**  
226 |    Visit the official Apache Flume [download page](https://flume.apache.org/download.html) or use `wget` to download the latest binary tarball directly:  
227 |    ```bash
228 |    wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
229 |    ```
230 | 
231 | 2. **Extract the Tarball**  
232 |    Extract the downloaded tarball:
233 |    ```bash
234 |    tar -xvzf apache-flume-1.9.0-bin.tar.gz
235 |    ```
236 | 
237 | 3. **Move the Folder**  
238 |    Move the extracted folder to `/usr/local/flume`:
239 |    ```bash
240 |    mv apache-flume-1.9.0-bin /usr/local/flume
241 |    ```
242 | 
243 | 4. **Set Environment Variables**  
244 |    Add Flume to your `PATH` by editing the `~/.bashrc` file:
245 |    ```bash
246 |    nano ~/.bashrc
247 |    ```
248 |    Add the following lines at the end of the file:
249 |    ```bash
250 |    export FLUME_HOME=/usr/local/flume
251 |    export PATH=$PATH:$FLUME_HOME/bin
252 |    ```
253 |    Reload the environment variables:
254 |    ```bash
255 |    source ~/.bashrc
256 |    ```
257 | 
258 | 5. **Verify Installation**  
259 |    Check Flume's version:
260 |    ```bash
261 |    flume-ng version
262 |    ```
263 | ![image](https://github.com/user-attachments/assets/14fd9825-4efe-4c17-9167-3feab67710ac)
264 | 
265 | ---
266 | 
267 | ### **Step 2: Configure Flume**
268 | 
269 | 1. Navigate to the Flume configuration directory:
270 |    ```bash
271 |    cd /usr/local/flume/conf
272 |    ```
273 | 
274 | 2. Create a new Flume agent configuration file:
275 |    ```bash
276 |    nano demo-agent.conf
277 |    ```
278 | 
279 | 3. Add the following content to define the Flume agent configuration:
280 |    ```properties
281 |    # Define the agent components
282 |    demo.sources = source1
283 |    demo.sinks = sink1
284 |    demo.channels = channel1
285 | 
286 |    # Define the source
287 |    demo.sources.source1.type = netcat
288 |    demo.sources.source1.bind = localhost
289 |    demo.sources.source1.port = 44444
290 | 
291 |    # Define the sink (HDFS)
292 |    demo.sinks.sink1.type = hdfs
293 |    demo.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/flume/demo
294 |    demo.sinks.sink1.hdfs.fileType = DataStream
295 | 
296 |    # Define the channel
297 |    demo.channels.channel1.type = memory
298 |    demo.channels.channel1.capacity = 1000
299 |    demo.channels.channel1.transactionCapacity = 100
300 | 
301 |    # Bind the source and sink to the channel
302 |    demo.sources.source1.channels = channel1
303 |    demo.sinks.sink1.channel = channel1
304 |    ```
305 | 
306 |    Replace `localhost` with your Hadoop Namenode hostname or IP address.
307 | you can find it using cat $HADOOP_HOME/etc/hadoop/core-site.xml
308 | ---
309 | 
310 | ### **Step 3: Start Flume Agent**
311 | 
312 | Run the Flume agent using the configuration file:
313 | ```bash
314 | flume-ng agent \
315 | --conf /usr/local/flume/conf \
316 | --conf-file /usr/local/flume/conf/demo-agent.conf \
317 | --name demo \
318 | -Dflume.root.logger=INFO,console
319 | ```
320 | 
321 | This starts the Flume agent with the name `demo` and logs activities to the console.
322 | 
323 | ![image](https://github.com/user-attachments/assets/8dcae12e-2b1f-490b-ae07-f052040b3c7d)
324 | 
325 | ---
326 | If you're facing error `bash: nc: command not found` indicates that the `netcat` (`nc`) utility is not installed in your container. Netcat is required to send data to the Flume source.
327 | 
328 | ### **Steps to Resolve**
329 | 
330 | 1. **Install Netcat in the Container**
331 |    - Install `netcat` using the package manager inside the container:
332 |      ```bash
333 |      apt-get update
334 |      apt-get install netcat -y
335 |      ```
336 |    - Verify the installation:
337 |      ```bash
338 |      nc -h
339 |      ```
340 | 
341 | 2. **Test the Netcat Command Again**
342 |    After installing `netcat`, retry the command to send data to Flume:
343 |    ```bash
344 |    echo "Hello Flume Demo" | nc localhost 44444
345 |    ```
346 | 
347 | 3. **Verify Data in Flume Sink**
348 |    - Check the configured HDFS path or the file sink location to verify that the message has been captured by the Flume agent.
349 | 
350 | ---
351 | 
352 | 
353 | ### **Step 4: Test Flume Data Flow**
354 | 
355 | 1. **Send Data to Flume Source**  
356 |    Open another terminal and send data to the Netcat source using the `nc` command:
357 |    ```bash
358 |    echo "Hello Flume Demo" | nc localhost 44444
359 |    ```
360 | ![image](https://github.com/user-attachments/assets/1290fb3e-cdac-4265-8c3a-067265783963)
361 | 
362 |    Send multiple lines of data:
363 |    ```bash
364 |    for i in {1..5}; do echo "This is message $i" | nc localhost 44444; done
365 |    ```
366 | ![image](https://github.com/user-attachments/assets/e2cc2a42-7f26-4b6b-81cc-5102a1f39a7f)
367 | 
368 | 1. **Verify Data in HDFS**  
369 |    Check the HDFS directory where Flume is writing data:
370 |    ```bash
371 |    hadoop fs -ls /user/flume/demo
372 |    ```
373 |    View the ingested data files:
374 |    ```bash
375 |    hadoop fs -cat /user/flume/demo/*
376 |    ```
377 | 
378 |    You should see the messages sent via `Netcat`.
379 | ![image](https://github.com/user-attachments/assets/9460b9d8-8ba4-4788-a318-a55bac5a27d3)
380 | 
381 | ---
382 | 
383 | ### **Step 5: Optional Customizations**
384 | 
385 | 1. **Roll Policies**  
386 |    Adjust roll policies in the sink configuration:
387 |    - **Roll by file size**:  
388 |      ```properties
389 |      demo.sinks.sink1.hdfs.rollSize = 1048576  # 1MB
390 |      ```
391 |    - **Roll by time interval**:  
392 |      ```properties
393 |      demo.sinks.sink1.hdfs.rollInterval = 300  # 5 minutes
394 |      ```
395 |    - **Roll by event count**:  
396 |      ```properties
397 |      demo.sinks.sink1.hdfs.rollCount = 1000
398 |      ```
399 | 
400 | 2. **Monitoring and Logging**  
401 |    Configure monitoring and logging in `flume-env.sh` and `log4j.properties`.
402 | 
403 | ---
404 | 
405 | ### **Expected Results**
406 | 
407 | 1. **Flume Console Output**  
408 |    You will see logs showing Flume processing events and writing them to HDFS.
409 | 
410 | 2. **HDFS Data**  
411 |    The ingested data in HDFS will look like this:
412 |    ```
413 |    Hello Flume Demo
414 |    This is message 1
415 |    This is message 2
416 |    This is message 3
417 |    ```
418 | 
419 | ---
420 | 
421 | ### **Troubleshooting**
422 | 
423 | - **Agent Fails to Start**:  
424 |    Check the logs for configuration errors:
425 |    ```bash
426 |    cat /usr/local/flume/logs/flume.log
427 |    ```
428 | 
429 | - **Data Not in HDFS**:  
430 |    Ensure the `namenode_host` in the sink configuration is correct and that the HDFS path is writable.
431 | 
432 | ---
433 | 


--------------------------------------------------------------------------------
/hadoop-basic-commands.md:
--------------------------------------------------------------------------------
  1 | Here's a **comprehensive list of HDFS commands** with easy-to-understand instructions for quick reference:
  2 | ![image](https://github.com/user-attachments/assets/1ea3ba32-1b68-4584-b521-3f5e6f5c6ffb)
  3 | 
  4 | ---
  5 | 
  6 | ## 📂 **HDFS Commands - Complete Reference**
  7 | 
  8 | ---
  9 | 
 10 | ### **1. Basic File Operations**
 11 | 
 12 | #### 📄 **Create a new file locally**
 13 | Create a file on your local system:
 14 | ```bash
 15 | echo "This is a sample file" > localfile.txt
 16 | ```
 17 | 
 18 | #### 📤 **Upload a local file to HDFS**
 19 | Upload a local file to HDFS:
 20 | ```bash
 21 | hdfs dfs -put localfile.txt /user/hadoop/destination-path
 22 | ```
 23 | 
 24 | #### ⬇️ **Download a file from HDFS to the local file system**
 25 | Use the `-get` command to copy files from HDFS to the local system:
 26 | ```bash
 27 | hdfs dfs -get /path/to/hdfspath /localpath
 28 | ```
 29 | 
 30 | #### 🖼️ **View the file content from HDFS**
 31 | View the contents of a file directly without copying it:
 32 | ```bash
 33 | hdfs dfs -cat /path/to/file
 34 | ```
 35 | 
 36 | #### ✍️ **Append content to an HDFS file**
 37 | Append local file content to an existing file on HDFS:
 38 | ```bash
 39 | hdfs dfs -appendToFile localfile.txt /user/hadoop/hdfspath
 40 | ```
 41 | 
 42 | ---
 43 | 
 44 | ### **2. Directory Operations**
 45 | 
 46 | #### 📁 **Create a directory**
 47 | Create a new directory in HDFS:
 48 | ```bash
 49 | hdfs dfs -mkdir /path/to/directory
 50 | ```
 51 | 
 52 | #### 🛠️ **Create multiple directories**
 53 | Create multiple directories in a single command:
 54 | ```bash
 55 | hdfs dfs -mkdir -p /path/to/dir1 /path/to/dir2
 56 | ```
 57 | 
 58 | #### 🧑‍💻 **Check directory usage with summary**
 59 | View the disk usage of a directory in human-readable format:
 60 | ```bash
 61 | hdfs dfs -du -s -h /path/to/directory
 62 | ```
 63 | 
 64 | #### 📑 **List contents of a directory**
 65 | List the files in a directory on HDFS:
 66 | ```bash
 67 | hdfs dfs -ls /path/to/directory
 68 | ```
 69 | 
 70 | ---
 71 | 
 72 | ### **3. File Operations**
 73 | 
 74 | #### ✏️ **Rename or move a file in HDFS**
 75 | Rename or move a file within HDFS:
 76 | ```bash
 77 | hdfs dfs -mv /path/to/oldfile /path/to/newfile
 78 | ```
 79 | 
 80 | #### 📦 **Copy a file within HDFS**
 81 | Copy a file from one location in HDFS to another:
 82 | ```bash
 83 | hdfs dfs -cp /path/to/source /path/to/destination
 84 | ```
 85 | 
 86 | #### 🗂️ **Count files, directories, and bytes in HDFS**
 87 | Get the count of files, directories, and the total byte size in a directory:
 88 | ```bash
 89 | hdfs dfs -count /path/to/directory
 90 | ```
 91 | 
 92 | #### 📝 **Display the first few lines of a file**
 93 | View the first few lines of a file:
 94 | ```bash
 95 | hdfs dfs -head /path/to/file
 96 | ```
 97 | 
 98 | #### 📚 **Display the last few lines of a file**
 99 | View the last few lines of a file:
100 | ```bash
101 | hdfs dfs -tail /path/to/file
102 | ```
103 | 
104 | #### 🔒 **Display file checksum**
105 | Verify file integrity by checking the checksum:
106 | ```bash
107 | hdfs dfs -checksum /path/to/file
108 | ```
109 | 
110 | ---
111 | 
112 | ### **4. File Permission and Ownership**
113 | 
114 | #### 🔧 **Change file or directory permissions**
115 | Change the permissions of a file or directory:
116 | ```bash
117 | hdfs dfs -chmod 755 /path/to/file-or-directory
118 | ```
119 | 
120 | #### 🧑‍🔧 **Change file or directory ownership**
121 | Change the ownership of a file or directory:
122 | ```bash
123 | hdfs dfs -chown user:group /path/to/file-or-directory
124 | ```
125 | 
126 | #### 📊 **Set file replication factor**
127 | Change the replication factor of a file or directory:
128 | ```bash
129 | hdfs dfs -setrep -w 3 /path/to/file-or-directory
130 | ```
131 | 
132 | ---
133 | 
134 | ### **5. Data Verification and Repair**
135 | 
136 | #### 🛡️ **Verify the file checksum**
137 | Check if the file’s checksum matches its original value:
138 | ```bash
139 | hdfs dfs -checksum /path/to/file
140 | ```
141 | 
142 | #### 🛠️ **Recover corrupted blocks in HDFS**
143 | Recover corrupted files by moving or deleting bad blocks:
144 | ```bash
145 | hdfs fsck /path/to/file -move -delete
146 | ```
147 | 
148 | ---
149 | 
150 | ### **6. Data Migration and Export**
151 | 
152 | #### 📤 **Export a directory to the local filesystem**
153 | Copy a directory from HDFS to a local file system:
154 | ```bash
155 | hdfs dfs -get /path/to/hdfspath /localpath
156 | ```
157 | 
158 | #### 🔄 **Export a file from one HDFS directory to another**
159 | Copy a file from one HDFS location to another:
160 | ```bash
161 | hdfs dfs -cp /path/to/hdfspath /new/path/to/hdfspath
162 | ```
163 | 
164 | ---
165 | 
166 | ### **7. File System Check**
167 | 
168 | #### 🏥 **Check the health of HDFS**
169 | Perform a health check on HDFS and get details about block and file status:
170 | ```bash
171 | hdfs fsck / -files -blocks -locations
172 | ```
173 | 
174 | #### 📈 **Check block replication status**
175 | View block replication details and the location of blocks:
176 | ```bash
177 | hdfs fsck / -blocks -locations
178 | ```
179 | 
180 | ---
181 | 
182 | ### **8. HDFS Admin Commands**
183 | 
184 | #### 🔍 **Show HDFS file system status**
185 | Get a report on the status and health of the HDFS system:
186 | ```bash
187 | hdfs dfsadmin -report
188 | ```
189 | 
190 | #### 🛑 **Enable safemode**
191 | Enter HDFS safemode (used for maintenance operations):
192 | ```bash
193 | hdfs dfsadmin -safemode enter
194 | ```
195 | 
196 | #### 🚪 **Disable safemode**
197 | Exit from HDFS safemode:
198 | ```bash
199 | hdfs dfsadmin -safemode leave
200 | ```
201 | 
202 | #### 📊 **Check safemode status**
203 | Check if HDFS is in safemode:
204 | ```bash
205 | hdfs dfsadmin -safemode get
206 | ```
207 | 
208 | #### 🧑‍🔧 **Decommission a DataNode**
209 | Remove a DataNode from the cluster (by updating the `dfs.exclude` file):
210 | ```bash
211 | hdfs dfsadmin -refreshNodes
212 | ```
213 | 
214 | ---
215 | 
216 | ### **9. YARN Commands**
217 | 
218 | #### 🖥️ **Resource Manager Operations**
219 | 
220 | ##### 📊 **Check cluster metrics**
221 | Get detailed metrics for the YARN cluster:
222 | ```bash
223 | yarn cluster -metrics
224 | ```
225 | 
226 | ##### 🔍 **View NodeManager details**
227 | List the details of NodeManagers in the YARN cluster:
228 | ```bash
229 | yarn node -list
230 | ```
231 | 
232 | #### 🧑‍💻 **Container Management**
233 | 
234 | ##### 📋 **List containers for an application**
235 | List the containers running for a specific application:
236 | ```bash
237 | yarn container -list <Application_ID>
238 | ```
239 | 
240 | ##### ⛔ **Kill a specific container**
241 | Terminate a running container:
242 | ```bash
243 | yarn container -kill <Container_ID>
244 | ```
245 | 
246 | ---
247 | 
248 | ### **10. General Hadoop Commands**
249 | 
250 | #### 🆘 **Display all Hadoop-related commands**
251 | Get a list of all Hadoop commands:
252 | ```bash
253 | hadoop -help
254 | ```
255 | 
256 | #### 📚 **Display help for specific HDFS commands**
257 | Get detailed help for HDFS commands:
258 | ```bash
259 | hdfs dfs -help
260 | ```
261 | 
262 | #### 📄 **Display help for YARN commands**
263 | Get detailed help for YARN commands:
264 | ```bash
265 | yarn -help
266 | ```
267 | 
268 | ---
269 | 
270 | ### **11. General Tips for Hadoop**
271 | 
272 | - **Use aliases for commonly used commands**  
273 |   Save time by creating aliases for frequently used commands. Add these to your `.bashrc` or `.zshrc`:
274 |   ```bash
275 |   alias hls="hdfs dfs -ls"
276 |   alias hput="hdfs dfs -put"
277 |   alias hget="hdfs dfs -get"
278 |   ```
279 | 
280 | - **Use `-help` with any Hadoop command**  
281 |   To learn more options and features, always try `-help` with any Hadoop command:
282 |   ```bash
283 |   hdfs dfs -help
284 |   yarn -help
285 |   hadoop -help
286 |   ```
287 | 
288 | ---
289 | 
290 | By following these instructions, you will be able to easily manage and manipulate files, directories, and resources in Hadoop Distributed File System (HDFS) and YARN.
291 | 


--------------------------------------------------------------------------------
/hadoop-hive.env:
--------------------------------------------------------------------------------
 1 | HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore
 2 | HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver
 3 | HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
 4 | HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
 5 | HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
 6 | HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083
 7 | HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
 8 | 
 9 | CORE_CONF_fs_defaultFS=hdfs://namenode:9000
10 | CORE_CONF_hadoop_http_staticuser_user=root
11 | CORE_CONF_hadoop_proxyuser_hue_hosts=*
12 | CORE_CONF_hadoop_proxyuser_hue_groups=*
13 | 
14 | HDFS_CONF_dfs_webhdfs_enabled=true
15 | HDFS_CONF_dfs_permissions_enabled=false
16 | 
17 | YARN_CONF_yarn_log___aggregation___enable=true
18 | YARN_CONF_yarn_resourcemanager_recovery_enabled=true
19 | YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
20 | YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
21 | YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
22 | YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
23 | YARN_CONF_yarn_timeline___service_enabled=true
24 | YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
25 | YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
26 | YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
27 | YARN_CONF_yarn_timeline___service_hostname=historyserver
28 | YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
29 | YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
30 | YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
31 | 


--------------------------------------------------------------------------------
/hadoop.env:
--------------------------------------------------------------------------------
 1 | CORE_CONF_fs_defaultFS=hdfs://namenode:9000
 2 | CORE_CONF_hadoop_http_staticuser_user=root
 3 | CORE_CONF_hadoop_proxyuser_hue_hosts=*
 4 | CORE_CONF_hadoop_proxyuser_hue_groups=*
 5 | CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec
 6 | 
 7 | HDFS_CONF_dfs_webhdfs_enabled=true
 8 | HDFS_CONF_dfs_permissions_enabled=false
 9 | HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
10 | 
11 | YARN_CONF_yarn_log___aggregation___enable=true
12 | YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
13 | YARN_CONF_yarn_resourcemanager_recovery_enabled=true
14 | YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
15 | YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
16 | YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
17 | YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
18 | YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
19 | YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
20 | YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
21 | YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
22 | YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
23 | YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
24 | YARN_CONF_yarn_timeline___service_enabled=true
25 | YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
26 | YARN_CONF_yarn_timeline___service_hostname=historyserver
27 | YARN_CONF_mapreduce_map_output_compress=true
28 | YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
29 | YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
30 | YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
31 | YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
32 | YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
33 | YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle
34 | 
35 | MAPRED_CONF_mapreduce_framework_name=yarn
36 | MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
37 | MAPRED_CONF_mapreduce_map_memory_mb=4096
38 | MAPRED_CONF_mapreduce_reduce_memory_mb=8192
39 | MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
40 | MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
41 | MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
42 | MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
43 | MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/
44 | 


--------------------------------------------------------------------------------
/hadoop_installation_VMware Workstation.md:
--------------------------------------------------------------------------------
  1 | Guide for installing **Hadoop 3.3.6 on Ubuntu 24.04** in a **VMware virtual machine**. This guide includes **troubleshooting tips, verification steps, and SSH configuration** to ensure a **properly working** single-node (pseudo-distributed) Hadoop setup.  
  2 | 
  3 | ---
  4 | 
  5 | # **🚀 Complete Guide to Installing Hadoop 3.3.6 on Ubuntu 24.04 (VMware)**
  6 | This guide covers:  
  7 | ✅ Installing **Hadoop 3.3.6** on **Ubuntu 24.04**  
  8 | ✅ Configuring **HDFS, YARN, and MapReduce**  
  9 | ✅ Setting up **passwordless SSH**  
 10 | ✅ Ensuring **proper Java installation**  
 11 | ✅ Troubleshooting common issues  
 12 | 
 13 | ---
 14 | 
 15 | ## **1️⃣ Prerequisites**  
 16 | Before starting, ensure:  
 17 | ✔ You have **Ubuntu 24.04** running in **VMware Workstation**.  
 18 | ✔ At least **4GB RAM**, **50GB disk space**, and **4 CPU cores** are allocated to the VM.  
 19 | ✔ Java **8 or later** is installed.  
 20 | 
 21 | ---
 22 | 
 23 | ## **2️⃣ Update Ubuntu Packages**  
 24 | Update system packages to avoid dependency issues:  
 25 | ```bash
 26 | sudo apt update && sudo apt upgrade -y
 27 | ```
 28 | 
 29 | ---
 30 | 
 31 | ## **3️⃣ Install Java (OpenJDK 11)**  
 32 | Hadoop requires Java. The recommended version is **OpenJDK 11**:  
 33 | ```bash
 34 | sudo apt install openjdk-11-jdk -y
 35 | ```
 36 | Verify installation:  
 37 | ```bash
 38 | java -version
 39 | ```
 40 | Expected output (may vary slightly):  
 41 | ```
 42 | openjdk version "11.0.20" 2024-XX-XX
 43 | ```
 44 | **Alternative:** If you need Java 8 for compatibility, install it using:  
 45 | ```bash
 46 | sudo apt install openjdk-8-jdk -y
 47 | ```
 48 | 
 49 | ---
 50 | 
 51 | ## **4️⃣ Create a Hadoop User (Optional - Skip for now)**
 52 | Instead of using root or your personal user, create a dedicated **hadoop** user:  
 53 | ```bash
 54 | sudo adduser hadoop
 55 | ```
 56 | Add your user to the `sudo` group:  
 57 | ```bash
 58 | sudo usermod -aG sudo hadoop
 59 | ```
 60 | Switch to the `hadoop` user:  
 61 | ```bash
 62 | su - hadoop
 63 | ```
 64 | 
 65 | ---
 66 | 
 67 | ## **5️⃣ Download & Install Hadoop 3.3.6**  
 68 | Navigate to the **Apache Hadoop downloads page**:  
 69 | 🔗 [https://hadoop.apache.org/releases.html](https://hadoop.apache.org/releases.html)  
 70 | 
 71 | Download Hadoop 3.3.6:  
 72 | ```bash
 73 | wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
 74 | ```
 75 | Verify the file integrity (optional but recommended):  
 76 | ```bash
 77 | sha512sum hadoop-3.3.6.tar.gz
 78 | ```
 79 | Compare the hash with the one on the official website.  
 80 | 
 81 | Extract Hadoop:  
 82 | ```bash
 83 | tar -xvzf hadoop-3.3.6.tar.gz
 84 | ```
 85 | Move it to `/usr/local/`:  
 86 | ```bash
 87 | sudo mv hadoop-3.3.6 /usr/local/hadoop
 88 | ```
 89 | Set permissions:  
 90 | ```bash
 91 | sudo chown -R $USER:$USER /usr/local/hadoop
 92 | ```
 93 | 
 94 | ---
 95 | 
 96 | ## **6️⃣ Configure Hadoop Environment Variables**  
 97 | Edit the `~/.bashrc` file:  
 98 | ```bash
 99 | nano ~/.bashrc
100 | ```
101 | Add these lines at the end:  
102 | ```bash
103 | # Hadoop Environment Variables
104 | export HADOOP_HOME=/usr/local/hadoop
105 | export HADOOP_INSTALL=$HADOOP_HOME
106 | export HADOOP_MAPRED_HOME=$HADOOP_HOME
107 | export HADOOP_COMMON_HOME=$HADOOP_HOME
108 | export HADOOP_HDFS_HOME=$HADOOP_HOME
109 | export HADOOP_YARN_HOME=$HADOOP_HOME
110 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
111 | export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
112 | ```
113 | Save & exit (CTRL+X → Y → ENTER).  
114 | 
115 | Apply changes:  
116 | ```bash
117 | source ~/.bashrc
118 | ```
119 | Verify:  
120 | ```bash
121 | echo $HADOOP_HOME
122 | ```
123 | Expected output: `/usr/local/hadoop`
124 | 
125 | ---
126 | 
127 | ## **7️⃣ Configure Hadoop Core Files**  
128 | ### **1️⃣ Configure `hadoop-env.sh`**  
129 | Edit Hadoop environment configuration:  
130 | ```bash
131 | nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
132 | ```
133 | Find the line:  
134 | ```bash
135 | export JAVA_HOME=
136 | ```
137 | Replace it with: (Replace `11` with `8` if you installed JDK 8)  
138 | ```bash
139 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
140 | ```
141 | Save & exit.
142 | 
143 | ---
144 | 
145 | ### **2️⃣ Configure `core-site.xml`**  
146 | Edit:  
147 | ```bash
148 | nano $HADOOP_HOME/etc/hadoop/core-site.xml
149 | ```
150 | Replace existing content with:  
151 | ```xml
152 | <configuration>
153 |     <property>
154 |         <name>fs.defaultFS</name>
155 |         <value>hdfs://localhost:9000</value>
156 |     </property>
157 | </configuration>
158 | ```
159 | Save & exit.
160 | 
161 | ---
162 | 
163 | ### **3️⃣ Configure `hdfs-site.xml`**  
164 | Edit:  
165 | ```bash
166 | nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
167 | ```
168 | Add:  
169 | ```xml
170 | <configuration>
171 |     <property>
172 |         <name>dfs.replication</name>
173 |         <value>1</value>
174 |     </property>
175 |     <property>
176 |         <name>dfs.name.dir</name>
177 |         <value>file:///usr/local/hadoop/hdfs/namenode</value>
178 |     </property>
179 |     <property>
180 |         <name>dfs.data.dir</name>
181 |         <value>file:///usr/local/hadoop/hdfs/datanode</value>
182 |     </property>
183 | </configuration>
184 | ```
185 | Create necessary directories:  
186 | ```bash
187 | mkdir -p /usr/local/hadoop/hdfs/namenode
188 | mkdir -p /usr/local/hadoop/hdfs/datanode
189 | ```
190 | Set permissions:  
191 | ```bash
192 | sudo chown -R $USER:$USER /usr/local/hadoop/hdfs
193 | ```
194 | 
195 | ---
196 | 
197 | ### **4️⃣ Configure `mapred-site.xml`**  
198 | ```bash
199 | cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
200 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
201 | ```
202 | Add:  
203 | ```xml
204 | <configuration>
205 |     <property>
206 |         <name>mapreduce.framework.name</name>
207 |         <value>yarn</value>
208 |     </property>
209 |     <property>
210 |         <name>yarn.app.mapreduce.am.env</name>
211 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
212 |     </property>
213 |     <property>
214 |         <name>mapreduce.map.env</name>
215 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
216 |     </property>
217 |     <property>
218 |         <name>mapreduce.reduce.env</name>
219 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
220 |     </property>
221 | </configuration>
222 | ```
223 | Save & exit.
224 | 
225 | ---
226 | 
227 | ### **5️⃣ Configure `yarn-site.xml`**  
228 | ```bash
229 | nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
230 | ```
231 | Add:  
232 | ```xml
233 | <configuration>
234 |     <property>
235 |         <name>yarn.nodemanager.aux-services</name>
236 |         <value>mapreduce_shuffle</value>
237 |     </property>
238 | </configuration>
239 | ```
240 | Save & exit.
241 | 
242 | ---
243 | 
244 | ## **8️⃣ Configure Passwordless SSH**
245 | ```bash
246 | sudo apt install ssh -y
247 | ssh-keygen -t rsa -P ""
248 | cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
249 | chmod 600 ~/.ssh/authorized_keys
250 | ```
251 | Test SSH:  
252 | ```bash
253 | ssh localhost
254 | ```
255 | 
256 | ---
257 | 
258 | ## **9️⃣ Format the Namenode & Start Hadoop**  
259 | Format the Namenode:  
260 | ```bash
261 | hdfs namenode -format
262 | ```
263 | Start HDFS:  
264 | ```bash
265 | start-dfs.sh
266 | ```
267 | Start YARN:  
268 | ```bash
269 | start-yarn.sh
270 | ```
271 | Verify:  
272 | ```bash
273 | jps
274 | ```
275 | Expected output:  
276 | ```
277 | NameNode
278 | DataNode
279 | ResourceManager
280 | NodeManager
281 | ```
282 | 
283 | ---
284 | 
285 | ## **✅ Verify Hadoop Installation**  
286 | 📌 Open a browser and go to:  
287 | ✔ HDFS Web UI → **http://localhost:9870/**  
288 | ✔ YARN Web UI → **http://localhost:8088/**  
289 | 
290 | ---
291 | 
292 | 🎉 **Congratulations!** You have successfully installed **Hadoop 3.3.6** on **Ubuntu 24.04 (VMware)**! 😊
293 | 


--------------------------------------------------------------------------------
/historyserver/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | HEALTHCHECK CMD curl -f http://localhost:8188/ || exit 1
 6 | 
 7 | ENV YARN_CONF_yarn_timeline___service_leveldb___timeline___store_path=/hadoop/yarn/timeline
 8 | RUN mkdir -p /hadoop/yarn/timeline
 9 | VOLUME /hadoop/yarn/timeline
10 | 
11 | ADD run.sh /run.sh
12 | RUN chmod a+x /run.sh
13 | 
14 | EXPOSE 8188
15 | 
16 | CMD ["/run.sh"]
17 | 


--------------------------------------------------------------------------------
/historyserver/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR historyserver
4 | 


--------------------------------------------------------------------------------
/master/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/spark-base:3.0.0-hadoop3.2
 2 | 
 3 | LABEL maintainer="Gezim Sejdiu <g.sejdiu@gmail.com>, Giannis Mouchakis <gmouchakis@gmail.com>"
 4 | 
 5 | COPY master.sh /
 6 | 
 7 | ENV SPARK_MASTER_PORT 7077
 8 | ENV SPARK_MASTER_WEBUI_PORT 8080
 9 | ENV SPARK_MASTER_LOG /spark/logs
10 | 
11 | EXPOSE 8080 7077 6066
12 | 
13 | CMD ["/bin/bash", "/master.sh"]
14 | 


--------------------------------------------------------------------------------
/master/README.md:
--------------------------------------------------------------------------------
1 | # Spark master
2 | 
3 | See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark).


--------------------------------------------------------------------------------
/master/master.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | export SPARK_MASTER_HOST=`hostname`
 4 | 
 5 | . "/spark/sbin/spark-config.sh"
 6 | 
 7 | . "/spark/bin/load-spark-env.sh"
 8 | 
 9 | mkdir -p $SPARK_MASTER_LOG
10 | 
11 | export SPARK_HOME=/spark
12 | 
13 | ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out
14 | 
15 | cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master \
16 |     --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out
17 | 


--------------------------------------------------------------------------------
/namenode/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | HEALTHCHECK CMD curl -f http://localhost:9870/ || exit 1
 6 | 
 7 | ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/name
 8 | RUN mkdir -p /hadoop/dfs/name
 9 | VOLUME /hadoop/dfs/name
10 | 
11 | ADD run.sh /run.sh
12 | RUN chmod a+x /run.sh
13 | 
14 | EXPOSE 9870
15 | 
16 | CMD ["/run.sh"]
17 | 


--------------------------------------------------------------------------------
/namenode/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | namedir=`echo $HDFS_CONF_dfs_namenode_name_dir | perl -pe 's#file://##'`
 4 | if [ ! -d $namedir ]; then
 5 |   echo "Namenode name directory not found: $namedir"
 6 |   exit 2
 7 | fi
 8 | 
 9 | if [ -z "$CLUSTER_NAME" ]; then
10 |   echo "Cluster name not specified"
11 |   exit 2
12 | fi
13 | 
14 | echo "remove lost+found from $namedir"
15 | rm -r $namedir/lost+found
16 | 
17 | if [ "`ls -A $namedir`" == "" ]; then
18 |   echo "Formatting namenode name directory: $namedir"
19 |   $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR namenode -format $CLUSTER_NAME
20 | fi
21 | 
22 | $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR namenode
23 | 


--------------------------------------------------------------------------------
/nginx/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM nginx
2 | 
3 | MAINTAINER "Ivan Ermilov <mailto:ivan.s.ermilov@gmail.com>"
4 | 
5 | COPY default.conf /etc/nginx/conf.d/default.conf
6 | COPY materialize.min.css /data/bde-css/materialize.min.css
7 | COPY bde-hadoop.css /data/bde-css/bde-hadoop.css
8 | 


--------------------------------------------------------------------------------
/nginx/bde-hadoop.css:
--------------------------------------------------------------------------------
 1 | body {
 2 |     background: #F1F1F1;
 3 | }
 4 | 
 5 | body > .container {
 6 |     margin: 5rem auto;
 7 |     background: white;
 8 |     box-shadow: 0 2px 5px 0 rgba(0,0,0,0.16), 0 2px 10px 0 rgba(0,0,0,0.12);
 9 | }
10 | 
11 | header.bs-docs-nav {
12 |     position: fixed;
13 |     top: 0;
14 |     left: 0;
15 |     width: 100%;
16 |     height: 3rem;
17 |     border: none;
18 |     background: #A94F74;
19 |     box-shadow: 0 2px 5px 0 rgba(0,0,0,0.16), 0 2px 10px 0 rgba(0,0,0,0.12);
20 | }
21 | 
22 | header.bs-docs-nav .navbar-brand {
23 |     background: inherit;
24 | }
25 | 
26 | #ui-tabs .active a {
27 |     background: #B96A8B;
28 | }
29 | 
30 | #ui-tabs > li > a {
31 |     color: white;
32 | }
33 | 
34 | .navbar-inverse .navbar-nav > .dropdown > a .caret {
35 |     border-top-color: white;
36 |     border-bottom-color: white;
37 | }
38 | 
39 | .navbar-inverse .navbar-nav > .open > a,
40 | .navbar-inverse .navbar-nav > .open > a:hover,
41 | .navbar-inverse .navbar-nav > .open > a:focus {
42 |     background-color: #B96A8B;
43 | }
44 | 
45 | .dropdown-menu > li > a {
46 |     color: #A94F74;
47 | }
48 | 
49 | .modal-dialog .panel-success {
50 |     border-color: lightgrey;
51 | }
52 | 
53 | .modal-dialog .panel-heading {
54 |     background-color: #A94F74 !important;
55 | }
56 | 
57 | .modal-dialog .panel-heading select {
58 |     margin-top: 1rem;
59 | }


--------------------------------------------------------------------------------
/nginx/default.conf:
--------------------------------------------------------------------------------
 1 | server {
 2 |     listen       80;
 3 |     server_name  localhost;
 4 | 
 5 |     root /data;
 6 |     gzip on;
 7 | 
 8 |     location / {
 9 |         proxy_pass http://127.0.0.1:8000;
10 |         proxy_set_header Accept-Encoding "";
11 |     }
12 | 
13 |     location /bde-css/ {
14 |     }
15 | }
16 | 
17 | server {
18 |   listen 127.0.0.1:8000;
19 |   location / {
20 |       proxy_pass http://127.0.0.1:8001;
21 |       sub_filter '</head>' '<link rel="stylesheet" type="text/css" href="/bde-css/materialize.min.css">
22 |       <link rel="stylesheet" type="text/css" href="/bde-css/bde-hadoop.css"></head>';
23 |       sub_filter_once on;
24 |       proxy_set_header Accept-Encoding "";
25 |   }    
26 | }
27 | 
28 | server {
29 |   listen 127.0.0.1:8001;
30 |   gunzip on;
31 |   location / {
32 |     proxy_pass http://namenode:50070;
33 |     proxy_set_header Accept-Encoding gzip;
34 |   }
35 | }
36 | 


--------------------------------------------------------------------------------
/nodemanager/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | HEALTHCHECK CMD curl -f http://localhost:8042/ || exit 1
 6 | 
 7 | ADD run.sh /run.sh
 8 | RUN chmod a+x /run.sh
 9 | 
10 | EXPOSE 8042
11 | 
12 | CMD ["/run.sh"]
13 | 


--------------------------------------------------------------------------------
/nodemanager/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR nodemanager
4 | 


--------------------------------------------------------------------------------
/resourcemanager/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | HEALTHCHECK CMD curl -f http://localhost:8088/ || exit 1
 6 | 
 7 | ADD run.sh /run.sh
 8 | RUN chmod a+x /run.sh
 9 | 
10 | EXPOSE 8088
11 | 
12 | CMD ["/run.sh"]
13 | 


--------------------------------------------------------------------------------
/resourcemanager/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR resourcemanager
4 | 


--------------------------------------------------------------------------------
/spark_in_action.MD:
--------------------------------------------------------------------------------
  1 | # Spark with Hadoop Usage Guide
  2 | 
  3 | ## 1. Starting Hadoop and Spark Services
  4 | 
  5 | Before using Spark with Hadoop, ensure all required services are running.
  6 | 
  7 | ### Start Hadoop Services:
  8 | ```bash
  9 | start-dfs.sh  # Start HDFS
 10 | start-yarn.sh # Start YARN
 11 | ```
 12 | Verify running services:
 13 | ```bash
 14 | jps
 15 | ```
 16 | Expected output (or similar):
 17 | ```
 18 | NameNode
 19 | DataNode
 20 | SecondaryNameNode
 21 | ResourceManager
 22 | NodeManager
 23 | ```
 24 | 
 25 | ### Start Spark Services (if needed):
 26 | ```bash
 27 | $SPARK_HOME/sbin/start-all.sh
 28 | ```
 29 | or
 30 | 
 31 | ```bash
 32 | $SPARK_HOME/sbin/start-master.sh
 33 | $SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
 34 | ```
 35 | 
 36 | ### To Stop
 37 | 
 38 | ```bash
 39 | $SPARK_HOME/sbin/stop-all.sh
 40 | ```
 41 | 
 42 | ## 2. Running Spark Shell on Hadoop (YARN Mode)
 43 | ```bash
 44 | spark-shell --master yarn
 45 | ```
 46 | Run basic commands in Spark shell:
 47 | ```scala
 48 | val rdd = sc.parallelize(Seq("Spark", "Hadoop", "Big Data"))
 49 | rdd.collect().foreach(println)
 50 | ```
 51 | 
 52 | ## 3. Running a Spark Job on Hadoop (YARN)
 53 | ### Submit a Job
 54 | ```bash
 55 | spark-submit --master yarn --deploy-mode client \
 56 |   --class org.apache.spark.examples.SparkPi \
 57 |   $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.1.jar 10
 58 | ```
 59 | 
 60 | For cluster mode:
 61 | ```bash
 62 | spark-submit --master yarn --deploy-mode cluster \
 63 |   --class org.apache.spark.examples.SparkPi \
 64 |   $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.1.jar 10
 65 | ```
 66 | 
 67 | ## 4. Reading and Writing Data from HDFS
 68 | 
 69 | ### Upload File to HDFS:
 70 | ```bash
 71 | hdfs dfs -mkdir -p /user/lovnish/test
 72 | hdfs dfs -put localfile.txt /user/lovnish/test/
 73 | ```
 74 | 
 75 | ### Read File in Spark:
 76 | ```scala
 77 | val file = sc.textFile("hdfs://localhost:9000/user/lovnish/test/localfile.txt")
 78 | file.collect().foreach(println)
 79 | ```
 80 | 
 81 | ### Write Output to HDFS:
 82 | ```scala
 83 | file.saveAsTextFile("hdfs://localhost:9000/user/lovnish/output")
 84 | ```
 85 | 
 86 | ## 5. Using Spark SQL with Hive Metastore
 87 | 
 88 | Start Spark with Hive support:
 89 | ```bash
 90 | spark-shell --master yarn --conf spark.sql.catalogImplementation=hive
 91 | ```
 92 | 
 93 | ### Create a Table:
 94 | ```scala
 95 | spark.sql("CREATE TABLE students (id INT, name STRING) USING hive")
 96 | spark.sql("INSERT INTO students VALUES (1, 'Spark'), (2, 'Hadoop')")
 97 | ```
 98 | 
 99 | ### Query Data:
100 | ```scala
101 | spark.sql("SELECT * FROM students").show()
102 | ```
103 | 
104 | ## 6. Running a Python (PySpark) Job
105 | 
106 | ### Start PySpark:
107 | ```bash
108 | pyspark --master yarn
109 | ```
110 | 
111 | ### Run a PySpark Job:
112 | ```python
113 | from pyspark.sql import SparkSession
114 | spark = SparkSession.builder.appName("PySparkExample").getOrCreate()
115 | data = [(1, "Spark"), (2, "Hadoop")]
116 | df = spark.createDataFrame(data, ["id", "name"])
117 | df.show()
118 | ```
119 | 
120 | ## 7. Stopping Services
121 | 
122 | Stop Spark:
123 | ```bash
124 | $SPARK_HOME/sbin/stop-all.sh
125 | ```
126 | 
127 | Stop Hadoop:
128 | ```bash
129 | stop-dfs.sh
130 | stop-yarn.sh
131 | ```
132 | 
133 | ## 8. Monitoring Spark Jobs
134 | 
135 | View Spark Web UI:
136 | - Standalone Mode: http://localhost:4040
137 | - YARN Mode: Run `yarn application -list` to get the Application ID, then:
138 |   ```bash
139 |   yarn application -status <application_id>
140 |   ```
141 | 
142 | ## 9. Debugging and Logs
143 | Check logs of Spark applications running on YARN:
144 | ```bash
145 | yarn logs -applicationId <application_id>
146 | ```
147 | For Hadoop logs:
148 | ```bash
149 | hdfs dfsadmin -report
150 | ```
151 | 
152 | ## 10. Hands-On: Spark SQL
153 | 
154 | ### **Objective**:
155 | To create DataFrames, load data from different sources, and perform transformations and SQL queries.
156 | 
157 | ### **Step 1: Setup Environment**
158 | 
159 | Start Spark in Master Mode:
160 | ```bash
161 | $SPARK_HOME/sbin/start-master.sh
162 | ```
163 | 
164 | Start Spark in Worker Mode:
165 | ```bash
166 | $SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
167 | ```
168 | 
169 | Open Spark shell:
170 | ```bash
171 | spark-shell
172 | ```
173 | 
174 | ### **Step 2: Create DataFrames**
175 | ```scala
176 | val data = Seq(
177 |   ("Alice", 30, "HR"),
178 |   ("Bob", 25, "Engineering"),
179 |   ("Charlie", 35, "Finance")
180 | )
181 | 
182 | val df = data.toDF("Name", "Age", "Department")
183 | 
184 | df.show()
185 | ```
186 | 
187 | ### **Step 3: Perform Transformations Using Spark SQL**
188 | ```scala
189 | df.createOrReplaceTempView("employees")
190 | val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department")
191 | result.show()
192 | ```
193 | 
194 | ### **Step 4: Save Transformed Data**
195 | ```scala
196 | result.write.option("header", "true").csv("hdfs://localhost:9000/data/output/output_employees")
197 | ```
198 | 
199 | ### **Step 5: Scala WordCount Program**
200 | ```scala
201 | import org.apache.spark.{SparkConf}
202 | val conf = new SparkConf().setAppName("WordCountExample").setMaster("local")
203 | val input = sc.textFile("hdfs://localhost:9000/data.txt")
204 | val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1))
205 | val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
206 | wordCounts.collect().foreach { case (word, count) =>
207 |   println(s"$word: $count")
208 | }
209 | ```
210 | 
211 | **Stop Session**:
212 | ```scala
213 | sc.stop()
214 | ```
215 | 
216 | ---
217 | 
218 | ## **11. Key Takeaways**
219 | - Spark SQL simplifies working with structured data.
220 | - DataFrames provide a flexible and powerful API for handling large datasets.
221 | - Apache Spark is a versatile tool for distributed data processing, offering scalability and performance.
222 | 
223 | 


--------------------------------------------------------------------------------
/sqoop.md:
--------------------------------------------------------------------------------
  1 | ### 💡 **What is Sqoop?**
  2 | **Sqoop (SQL to Hadoop)** is a powerful **Big Data tool** used to **transfer data between:**
  3 | - ✅ **Relational Databases (MySQL, Oracle, PostgreSQL, etc.)**  
  4 | - ✅ **Hadoop Ecosystem (HDFS, Hive, HBase, etc.)**  
  5 | 
  6 | ---
  7 | 
  8 | ## ✅ **Why is Sqoop Important in Big Data?**
  9 | Imagine you have **millions of records** in a **MySQL Database** (like customer data, sales data, etc.) and you want to:
 10 | - **Analyze the data using Hadoop, Hive, or Spark.**
 11 | - **Store the data in HDFS for distributed processing.**
 12 | - **Move the processed data back to MySQL for reporting.**
 13 | 
 14 | 👉 **Manually transferring data** from MySQL to Hadoop would be a nightmare.  
 15 | 👉 **But with Sqoop, you can transfer data within minutes! 🚀**
 16 | 
 17 | ---
 18 | 
 19 | ## 🚀 **Major Benefits of Using Sqoop**
 20 | Here are the **Top 10 Benefits** of using **Sqoop in Big Data**:
 21 | 
 22 | ---
 23 | 
 24 | ## ✅ 1. **Easy Data Transfer from RDBMS to Hadoop (HDFS)**
 25 | 👉 **Sqoop simplifies the process** of transferring large amounts of data from **MySQL, Oracle, SQL Server, etc., to HDFS.**
 26 | 
 27 | ### Example:
 28 | If you have **1 Billion rows** in MySQL and you want to **analyze** them in Hadoop,  
 29 | ✅ Without Sqoop → **You would write complex scripts (slow)**  
 30 | ✅ With Sqoop → **One command imports the data (fast)**
 31 | 
 32 | **Command:**
 33 | ```shell
 34 | sqoop import \
 35 | --connect jdbc:mysql://localhost/testdb \
 36 | --username root \
 37 | --password password \
 38 | --table employees \
 39 | --target-dir /user/hdfs/employees_data
 40 | ```
 41 | 
 42 | ✔ In just **5 minutes**, your **1 billion records** are transferred to Hadoop.
 43 | 
 44 | ---
 45 | 
 46 | ## ✅ 2. **Fast Data Transfer (Parallel Processing)**
 47 | 👉 **Sqoop uses MapReduce internally** to transfer data from MySQL → Hadoop.
 48 | 
 49 | ### What Happens Internally?
 50 | - ✅ **Sqoop launches multiple MapReduce jobs**.
 51 | - ✅ **Each MapReduce job transfers part of the data**.
 52 | - ✅ **Parallel data transfer** speeds up the process.
 53 | 
 54 | ### 🚀 Example:
 55 | If you have **10 Million rows** in MySQL:
 56 | - ✅ **Without Sqoop** → Takes **6 hours**.
 57 | - ✅ **With Sqoop (parallel 8 mappers)** → Takes **30 minutes**.
 58 | 
 59 | ✔ Massive speed improvement 🚀.
 60 | 
 61 | ---
 62 | 
 63 | ## ✅ 3. **Supports All Major Databases**
 64 | 👉 Sqoop supports importing/exporting data from almost all major databases, including:
 65 | - ✅ **MySQL**
 66 | - ✅ **Oracle**
 67 | - ✅ **PostgreSQL**
 68 | - ✅ **MS SQL Server**
 69 | - ✅ **DB2**
 70 | - ✅ **Teradata**
 71 | 
 72 | 👉 This means **you can use one single tool** for **all database operations**.
 73 | 
 74 | ---
 75 | 
 76 | ## ✅ 4. **Incremental Import (Import Only New Data)** 🚀
 77 | 👉 This is a **game-changer!** 💯
 78 | 
 79 | ### ✅ **Problem:**
 80 | Suppose your MySQL database gets **new data every day**.  
 81 | - ❌ If you run a normal import → **It will import all data** (duplicate data).  
 82 | - ✅ But with **Sqoop Incremental Import**, you can **import only new data**.
 83 | 
 84 | ### ✅ **Example: Import Only New Data**
 85 | ```shell
 86 | sqoop import \
 87 | --connect jdbc:mysql://localhost/testdb \
 88 | --username root \
 89 | --password password \
 90 | --table orders \
 91 | --target-dir /user/hdfs/orders \
 92 | --incremental append \
 93 | --check-column order_date \
 94 | --last-value '2024-03-01'
 95 | ```
 96 | 
 97 | 👉 **It will only import records after `2024-03-01`.**
 98 | 
 99 | ### 🚀 Benefits:
100 | - ✅ No Duplicate Data.
101 | - ✅ Only New Data Comes In.
102 | - ✅ Saves Time and Resources.
103 | 
104 | ---
105 | 
106 | ## ✅ 5. **Incremental Export (Export Only New Data)** 💯
107 | 👉 You can also **export only new or updated data** from **Hadoop → MySQL**.
108 | 
109 | ### ✅ Example:
110 | ```shell
111 | sqoop export \
112 | --connect jdbc:mysql://localhost/testdb \
113 | --username root \
114 | --password password \
115 | --table orders \
116 | --export-dir /user/hdfs/orders \
117 | --update-key order_id \
118 | --update-mode allowinsert
119 | ```
120 | 
121 | 👉 This will **update old records** and **insert new records**. 🚀
122 | 
123 | ✔ No duplicates, No conflicts. 💯
124 | 
125 | ---
126 | 
127 | ## ✅ 6. **Direct Import into Hive or HBase (No Manual Work)** 📊
128 | 👉 If you're working with **Hive (SQL-like tool for Hadoop)**,  
129 | 👉 You can **directly import data into Hive tables** without any manual work.
130 | 
131 | ### ✅ Example:
132 | ```shell
133 | sqoop import \
134 | --connect jdbc:mysql://localhost/testdb \
135 | --username root \
136 | --password password \
137 | --table customers \
138 | --hive-import \
139 | --hive-table mydatabase.customers
140 | ```
141 | 
142 | 👉 This command will:
143 | - ✅ Automatically create a Hive Table (`customers`)
144 | - ✅ Automatically load all data from MySQL to Hive.
145 | - ✅ No manual work needed.
146 | 
147 | ---
148 | 
149 | ## ✅ 7. **Import Large Data (TB/PB Scale) Without Crash 💥**
150 | 👉 If your **MySQL database** has **1 Billion Rows** or **2TB data**,  
151 | 👉 Normal **manual export** will fail or crash. ❌
152 | 
153 | 👉 But **Sqoop can handle Terabytes or Petabytes** of data smoothly. 🚀
154 | 
155 | 👉 It uses:
156 | - ✅ **Parallel Data Transfer.**
157 | - ✅ **Fault Tolerance (If one mapper fails, others continue).**
158 | - ✅ **Automatic Data Split.**
159 | 
160 | ---
161 | 
162 | ## ✅ 8. **Save Time and Money 💸**
163 | 👉 **Imagine transferring 1 billion records manually** via Python or CSV files.  
164 | 👉 It would take **days or even weeks**.
165 | 
166 | ✅ But **Sqoop transfers the data in minutes**.
167 | 
168 | ### Example:
169 | | Data Size      | Without Sqoop (Manual) | With Sqoop (Auto)  |
170 | |----------------|---------------------|--------------------|
171 | | 1 Billion Rows | 24 Hours              | **30 Minutes** 🚀   |
172 | | 10 TB Data     | 5 Days                | **5 Hours** 🚀     |
173 | 
174 | ✔ **This saves time, infrastructure costs, and manpower.**
175 | 
176 | ---
177 | 
178 | ## ✅ 9. **Support for Data Warehousing (ETL Process)**
179 | 👉 **Sqoop is widely used in ETL pipelines** for:
180 | - ✅ Extracting data from MySQL → Hadoop.
181 | - ✅ Transforming data using Spark, Hive, or MapReduce.
182 | - ✅ Loading data back to MySQL → Reporting.
183 | 
184 | 👉 This is a **standard data warehousing pipeline**.
185 | 
186 | ---
187 | 
188 | ## ✅ 10. **Easy Automation with Cron Job / Oozie**
189 | 👉 You can schedule **Sqoop Jobs** to run **daily, weekly, or hourly** using:
190 | - ✅ **Oozie (Big Data Scheduler)**
191 | - ✅ **Linux Cron Job**
192 | 
193 | ### ✅ Example: Daily Import
194 | ```shell
195 | sqoop job --create daily_import \
196 | --import \
197 | --connect jdbc:mysql://localhost/testdb \
198 | --username root \
199 | --password password \
200 | --table orders \
201 | --incremental append \
202 | --check-column order_date \
203 | --last-value '2024-03-01'
204 | ```
205 | 
206 | ✅ Now schedule it daily using **cron job**:
207 | ```shell
208 | crontab -e
209 | ```
210 | ```shell
211 | 0 0 * * * sqoop job --exec daily_import
212 | ```
213 | 
214 | 👉 **Automatically fetch new data daily**. 🚀
215 | 
216 | ---
217 | 
218 | ## ✅ **Bonus Benefits of Sqoop**
219 | | Feature                     | Benefit                                                                 |
220 | |-----------------------------|-------------------------------------------------------------------------|
221 | | ✅ High-Speed Data Transfer | Sqoop uses **parallel processing (MapReduce)** for fast transfer.      |
222 | | ✅ No Data Loss             | Data is transferred **without loss or corruption.**                   |
223 | | ✅ Automatic Schema Mapping | Sqoop automatically maps MySQL Schema to Hive Schema.                |
224 | | ✅ Easy to Use              | Simple **one-line command** for import/export.                        |
225 | | ✅ Fault Tolerance          | If one Mapper fails, others continue the process.                    |
226 | 
227 | ---
228 | 
229 | ## ✅ **So Why Do Companies Use Sqoop? 💯**
230 | | Use Case                         | Why Sqoop is Best 💯                                              |
231 | |---------------------------------|------------------------------------------------------------------|
232 | | ✅ Data Migration                | Move data from MySQL → Hadoop easily.                           |
233 | | ✅ Data Warehousing              | Automate ETL Pipelines.                                          |
234 | | ✅ Data Archival                 | Archive old data from MySQL to HDFS.                            |
235 | | ✅ Machine Learning Data         | Transfer MySQL Data → Spark, Hive for AI/ML.                     |
236 | | ✅ Fast Data Transfer            | Transfer TBs of data in minutes.                                |
237 | 
238 | ---
239 | 
240 | ## 💯 Conclusion 🚀
241 | ### ✔ **Sqoop = Fast + Easy + Reliable** Data Transfer. 💯
242 | ### ✔ It saves **time, cost, and effort** in Big Data processing. 💯
243 | ### ✔ Highly used in **Data Engineering, ETL Pipelines, and Hadoop Projects.** 🚀
244 | 
245 | ---
246 | 
247 | **💡 Apache Sqoop 🚀🙂** is a tool designed for efficiently transferring bulk data between Apache Hadoop and relational databases. It allows for seamless data import and export between **Hadoop ecosystem** components (like HDFS, HBase, Hive) and relational databases (like MySQL, PostgreSQL, Oracle, SQL Server).
248 | 
249 | Here is a basic **Sqoop tutorial** to help you understand how to use it for importing and exporting data:
250 | 
251 | ### Prerequisites:
252 | 1. Hadoop and Sqoop should be installed on your system.
253 | 2. A relational database (e.g., MySQL) should be available to use with Sqoop.
254 | 3. Ensure the JDBC driver for the relational database is available.
255 | 
256 | ### 1. **Setting up Sqoop**
257 |    - Make sure **Sqoop** is installed and properly configured in your environment.
258 |    - Sqoop’s installation can be verified with the following command:
259 |      ```bash
260 |      sqoop version
261 |      ```
262 |    - If Sqoop is installed correctly, it should display its version.
263 | 
264 | ### 2. **Importing Data from Relational Databases to Hadoop (HDFS)**
265 |    The most common use case for Sqoop is importing data from a relational database into Hadoop's **HDFS**.
266 | 
267 |    #### Steps to import data:
268 |    1. **Create a table in the database (e.g., MySQL):**
269 | 
270 |       ```sql
271 |       CREATE DATABASE test;
272 |       CREATE USER 'sqoop_user'@'%' IDENTIFIED BY 'password123';
273 |       GRANT ALL PRIVILEGES ON testdb.* TO 'sqoop_user'@'%';
274 |       FLUSH PRIVILEGES;
275 |       ```
276 |       ```sql
277 |       SHOW DATABASES;
278 |       ```
279 |       ```sql
280 |       USE test;
281 |       ```
282 |      
283 |       ```sql
284 |       CREATE TABLE employees (
285 |           id INT,
286 |           name VARCHAR(100),
287 |           age INT
288 |       );
289 |       INSERT INTO employees VALUES (1, 'Love', 25);
290 |       INSERT INTO employees VALUES (2, 'Ravi', 21);
291 |       INSERT INTO employees VALUES (3, 'Nikshep', 22);
292 |       ```
293 | 
294 |       
295 | **Now get out of MYSQL Shell and then Let's get started with Apache Sqoop**
296 |       
297 | **List Databases Using Sqoop**:
298 |       ```bash
299 |       sqoop list-databases --connect jdbc:mysql://localhost:3306 --username sqoop_user --password password123
300 |       ```
301 | 
302 |    3. **Import Data Using Sqoop**:
303 |       Use the following command to import data from a MySQL database to HDFS:
304 |       ```bash
305 |       sqoop import --connect jdbc:mysql://localhost/employeesdb \
306 |                    --username your_username --password your_password \
307 |                    --table employees --target-dir /user/hadoop/employees
308 |       ```
309 | 
310 |       Explanation:
311 |       - `--connect`: JDBC URL for your database.
312 |       - `--username`: Database username.
313 |       - `--password`: Database password.
314 |       - `--table`: The table to import.
315 |       - `--target-dir`: The directory in HDFS where the data will be stored.
316 | 
317 |    4. **Verify Data in HDFS**:
318 |       After the import, check if the data is available in HDFS:
319 |       ```bash
320 |       hadoop fs -ls /user/hadoop/employees
321 |       hadoop fs -cat /user/hadoop/employees/part-m-00000
322 |       ```
323 | 
324 | ### 3. **Exporting Data from Hadoop (HDFS) to Relational Databases**
325 |    Sqoop can also be used to export data from HDFS back into a relational database.
326 | 
327 |    #### Steps to export data:
328 |    1. **Create a Table in the Database for Export:**
329 | 
330 |       ```sql
331 |       CREATE TABLE employees_export (
332 |           id INT,
333 |           name VARCHAR(100),
334 |           age INT
335 |       );
336 |       ```
337 | 
338 |    2. **Export Data Using Sqoop**:
339 |       Use the following command to export data from HDFS to a MySQL table:
340 |       ```bash
341 |       sqoop export --connect jdbc:mysql://localhost/employeesdb \
342 |                    --username your_username --password your_password \
343 |                    --table employees_export \
344 |                    --export-dir /user/hadoop/employees
345 |       ```
346 | 
347 |       Explanation:
348 |       - `--connect`: JDBC URL for the database.
349 |       - `--username`: Database username.
350 |       - `--password`: Database password.
351 |       - `--table`: Table in the database to export the data to.
352 |       - `--export-dir`: Directory in HDFS where the data to be exported resides.
353 | 
354 |    3. **Verify Data in the Database**:
355 |       After the export, check if the data is available in the database:
356 |       ```sql
357 |       SELECT * FROM employees_export;
358 |       ```
359 | 
360 | ### 4. **Incremental Imports (Importing Data Increments)**
361 |    Sqoop can import only the new or updated data from a table by using **incremental imports**.
362 | 
363 |    #### Example of incremental import:
364 |    ```bash
365 |    sqoop import --connect jdbc:mysql://localhost/employeesdb \
366 |                 --username your_username --password your_password \
367 |                 --table employees --target-dir /user/hadoop/employees \
368 |                 --incremental append --check-column id --last-value 10
369 |    ```
370 | 
371 |    Explanation:
372 |    - `--incremental append`: Indicates that Sqoop should only import data that has changed (new rows or updated rows).
373 |    - `--check-column`: The column to use for tracking changes (usually an auto-incremented column like `id`).
374 |    - `--last-value`: The value of the `check-column` that was imported last time. This ensures only new or changed data is imported.
375 | 
376 | ### 5. **Importing Data into Hive**
377 |    Sqoop can also import data directly into **Apache Hive**, which is a data warehousing tool that sits on top of Hadoop.
378 | 
379 |    #### Example of importing data to Hive:
380 |    ```bash
381 |    sqoop import --connect jdbc:mysql://localhost/employeesdb \
382 |                 --username your_username --password your_password \
383 |                 --table employees --hive-import --create-hive-table \
384 |                 --hive-table employees_hive
385 |    ```
386 | 
387 |    Explanation:
388 |    - `--hive-import`: Imports the data into Hive.
389 |    - `--create-hive-table`: Automatically creates the corresponding Hive table.
390 |    - `--hive-table`: The Hive table to store the data.
391 | 
392 | ### 6. **Job Scheduling with Sqoop**
393 |    You can schedule Sqoop jobs to run at specific intervals using **Apache Oozie** or **cron jobs** for periodic data imports or exports.
394 | 
395 | ### 7. **Additional Sqoop Features**
396 |    - **Parallelism**: You can use **parallel imports** to split the data into multiple tasks and speed up the import/export process.
397 |      ```bash
398 |      sqoop import --connect jdbc:mysql://localhost/employeesdb \
399 |                   --username your_username --password your_password \
400 |                   --table employees --target-dir /user/hadoop/employees \
401 |                   --num-mappers 4
402 |      ```
403 | 
404 |    - **Direct Mode**: Sqoop provides a **direct mode** for some databases like MySQL, which bypasses JDBC and uses the database's native data transfer mechanism to improve performance.
405 |      ```bash
406 |      sqoop import --connect jdbc:mysql://localhost/employeesdb \
407 |                   --username your_username --password your_password \
408 |                   --table employees --target-dir /user/hadoop/employees \
409 |                   --direct
410 |      ```
411 | 
412 | ---
413 | 
414 | ### Conclusion
415 | Apache **Sqoop** is a powerful tool for bulk data transfers between Hadoop and relational databases. By understanding how to use Sqoop for importing, exporting, and managing data between various sources and Hadoop, you can integrate your data efficiently for further analysis, processing, or storage.
416 | 
417 | In this tutorial, we covered basic Sqoop commands for importing and exporting data from a MySQL database into HDFS, as well as other advanced functionalities like incremental imports and loading data into Hive.
418 | 


--------------------------------------------------------------------------------
/startup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | hadoop fs -mkdir       /tmp
 4 | hadoop fs -mkdir -p    /user/hive/warehouse
 5 | hadoop fs -chmod g+w   /tmp
 6 | hadoop fs -chmod g+w   /user/hive/warehouse
 7 | 
 8 | cd $HIVE_HOME/bin
 9 | ./hiveserver2 --hiveconf hive.server2.enable.doAs=false
10 | 


--------------------------------------------------------------------------------
/students.csv:
--------------------------------------------------------------------------------
1 | 1,Lovnish,25
2 | 2,Ravikant,21
3 | 3,Nikshep,23


--------------------------------------------------------------------------------
/submit/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8
 2 | 
 3 | MAINTAINER Ivan Ermilov <ivan.s.ermilov@gmail.com>
 4 | 
 5 | COPY WordCount.jar /opt/hadoop/applications/WordCount.jar
 6 | 
 7 | ENV JAR_FILEPATH="/opt/hadoop/applications/WordCount.jar"
 8 | ENV CLASS_TO_RUN="WordCount"
 9 | ENV PARAMS="/input /output"
10 | 
11 | ADD run.sh /run.sh
12 | RUN chmod a+x /run.sh
13 | 
14 | CMD ["/run.sh"]
15 | 


--------------------------------------------------------------------------------
/submit/WordCount.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/submit/WordCount.jar


--------------------------------------------------------------------------------
/submit/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | $HADOOP_HOME/bin/hadoop jar $JAR_FILEPATH $CLASS_TO_RUN $PARAMS
4 | 


--------------------------------------------------------------------------------
/template/java/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2
 2 | 
 3 | LABEL maintainer="Gezim Sejdiu <g.sejdiu@gmail.com>, Giannis Mouchakis <gmouchakis@gmail.com>"
 4 | 
 5 | ENV SPARK_APPLICATION_JAR_NAME application-1.0
 6 | 
 7 | COPY template.sh /
 8 | 
 9 | RUN apk add --no-cache openjdk8 maven\
10 |       && chmod +x /template.sh \
11 |       && mkdir -p /app \
12 |       && mkdir -p /usr/src/app
13 | 
14 | # Copy the POM-file first, for separate dependency resolving and downloading
15 | ONBUILD COPY pom.xml /usr/src/app
16 | ONBUILD RUN cd /usr/src/app \
17 |       && mvn dependency:resolve
18 | ONBUILD RUN cd /usr/src/app \
19 |       && mvn verify
20 | 
21 | # Copy the source code and build the application
22 | ONBUILD COPY . /usr/src/app
23 | ONBUILD RUN cd /usr/src/app \
24 |       && mvn clean package
25 | 
26 | CMD ["/bin/bash", "/template.sh"]
27 | 


--------------------------------------------------------------------------------
/template/java/README.md:
--------------------------------------------------------------------------------
 1 | # Spark Java template
 2 | 
 3 | The Spark Java template image serves as a base image to build your own Java application to run on a Spark cluster. See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark) for a description how to setup a Spark cluster.
 4 | 
 5 | ### Package your application using Maven
 6 | You can build and launch your Java application on a Spark cluster by extending this image with your sources. The template uses [Maven](https://maven.apache.org/) as build tool, so make sure you have a `pom.xml` file for your application specifying all the dependencies.
 7 | 
 8 | The Maven `package` command must create an assembly JAR (or 'uber' JAR) containing your code and its dependencies. Spark and Hadoop dependencies should be listes as `provided`. The [Maven shade plugin](http://maven.apache.org/plugins/maven-shade-plugin/) provides a plugin to build such assembly JARs.
 9 | 
10 | ### Extending the Spark Java template with your application
11 | 
12 | #### Steps to extend the Spark Java template
13 | 1. Create a Dockerfile in the root folder of your project (which also contains a `pom.xml`)
14 | 2. Extend the Spark Java template Docker image
15 | 3. Configure the following environment variables (unless the default value satisfies):
16 |   * `SPARK_MASTER_NAME` (default: spark-master)
17 |   * `SPARK_MASTER_PORT` (default: 7077)
18 |   * `SPARK_APPLICATION_JAR_NAME` (default: application-1.0)
19 |   * `SPARK_APPLICATION_MAIN_CLASS` (default: my.main.Application)
20 |   * `SPARK_APPLICATION_ARGS` (default: "")
21 | 4. Build and run the image
22 | ```
23 | docker build --rm=true -t bde/spark-app .
24 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app
25 | ```
26 | 
27 | The sources in the project folder will be automatically added to `/usr/src/app` if you directly extend the Spark Java template image. Otherwise you will have to add and package the sources by yourself in your Dockerfile with the commands:
28 | 
29 |     COPY . /usr/src/app
30 |     RUN cd /usr/src/app \
31 |         && mvn clean package
32 | 
33 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute the `/template.sh` script at the end.
34 | 
35 | #### Example Dockerfile
36 | ```
37 | FROM bde2020/spark-java-template:2.4.0-hadoop2.7
38 | 
39 | MAINTAINER Erika Pauwels <erika.pauwels@tenforce.com>
40 | MAINTAINER Gezim Sejdiu <g.sejdiu@gmail.com>
41 | 
42 | ENV SPARK_APPLICATION_JAR_NAME my-app-1.0-SNAPSHOT-with-dependencies
43 | ENV SPARK_APPLICATION_MAIN_CLASS eu.bde.my.Application
44 | ENV SPARK_APPLICATION_ARGS "foo bar baz"
45 | ```
46 | 
47 | #### Example application
48 | See [big-data-europe/demo-spark-sensor-data](https://github.com/big-data-europe/demo-spark-sensor-data).
49 | 


--------------------------------------------------------------------------------
/template/java/template.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | cd /usr/src/app
4 | cp target/${SPARK_APPLICATION_JAR_NAME}.jar ${SPARK_APPLICATION_JAR_LOCATION}
5 | 
6 | sh /submit.sh
7 | 


--------------------------------------------------------------------------------
/template/python/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2
 2 | 
 3 | LABEL maintainer="Gezim Sejdiu <g.sejdiu@gmail.com>, Giannis Mouchakis <gmouchakis@gmail.com>"
 4 | 
 5 | COPY template.sh /
 6 | 
 7 | # Copy the requirements.txt first, for separate dependency resolving and downloading
 8 | ONBUILD COPY requirements.txt /app/
 9 | ONBUILD RUN cd /app \
10 |       && pip3 install -r requirements.txt
11 | 
12 | # Copy the source code
13 | ONBUILD COPY . /app
14 | 
15 | CMD ["/bin/bash", "/template.sh"]
16 | 


--------------------------------------------------------------------------------
/template/python/README.md:
--------------------------------------------------------------------------------
 1 | # Spark Python template
 2 | 
 3 | The Spark Python template image serves as a base image to build your own Python application to run on a Spark cluster. See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark) for a description how to setup a Spark cluster.
 4 | 
 5 | ### Package your application using pip
 6 | You can build and launch your Python application on a Spark cluster by extending this image with your sources. The template uses [pip](https://pip.pypa.io/en/stable/) to manage the dependencies of your
 7 | project, so make sure you have a `requirements.txt` file in the root of your application specifying all the dependencies.
 8 | 
 9 | ### Extending the Spark Python template with your application
10 | 
11 | #### Steps to extend the Spark Python template
12 | 1. Create a Dockerfile in the root folder of your project (which also contains a `requirements.txt`)
13 | 2. Extend the Spark Python template Docker image
14 | 3. Configure the following environment variables (unless the default value satisfies):
15 |   * `SPARK_MASTER_NAME` (default: spark-master)
16 |   * `SPARK_MASTER_PORT` (default: 7077)
17 |   * `SPARK_APPLICATION_PYTHON_LOCATION` (default: /app/app.py)
18 |   * `SPARK_APPLICATION_ARGS`
19 | 4. Build and run the image
20 | ```
21 | docker build --rm -t bde/spark-app .
22 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app
23 | ```
24 | 
25 | The sources in the project folder will be automatically added to `/app` if you directly extend the Spark Python template image. Otherwise you will have to add the sources by yourself in your Dockerfile with the command:
26 | 
27 |     COPY . /app
28 | 
29 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute the `/template.sh` script at the end.
30 | 
31 | #### Example Dockerfile
32 | ```
33 | FROM bde2020/spark-python-template:2.4.0-hadoop2.7
34 | 
35 | MAINTAINER You <you@example.org>
36 | 
37 | ENV SPARK_APPLICATION_PYTHON_LOCATION /app/entrypoint.py
38 | ENV SPARK_APPLICATION_ARGS "foo bar baz"
39 | ```
40 | 
41 | #### Example application
42 | Coming soon
43 | 


--------------------------------------------------------------------------------
/template/python/template.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | sh /submit.sh
4 | 


--------------------------------------------------------------------------------
/template/scala/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2
 2 | 
 3 | LABEL maintainer="Gezim Sejdiu <g.sejdiu@gmail.com>, Giannis Mouchakis <gmouchakis@gmail.com>"
 4 | 
 5 | ARG SBT_VERSION
 6 | ENV SBT_VERSION=${SBT_VERSION:-1.3.12}
 7 | 
 8 | RUN wget -O - https://piccolo.link/sbt-1.3.12.tgz | gunzip | tar -x -C /usr/local
 9 | 
10 | ENV PATH /usr/local/sbt/bin:${PATH}
11 | 
12 | WORKDIR /app
13 | 
14 | # Pre-install base libraries
15 | ADD build.sbt /app/
16 | ADD plugins.sbt /app/project/
17 | RUN sbt update
18 | 
19 | COPY template.sh /
20 | 
21 | ENV SPARK_APPLICATION_MAIN_CLASS Application
22 | 
23 | # Copy the build.sbt first, for separate dependency resolving and downloading
24 | ONBUILD COPY build.sbt /app/
25 | ONBUILD COPY project /app/project
26 | ONBUILD RUN sbt update
27 | 
28 | # Copy the source code and build the application
29 | ONBUILD COPY . /app
30 | ONBUILD RUN sbt clean assembly
31 | 
32 | CMD ["/template.sh"]
33 | 


--------------------------------------------------------------------------------
/template/scala/README.md:
--------------------------------------------------------------------------------
 1 | # Spark Scala template
 2 | 
 3 | The Spark Scala template image serves as a base image to build your own Scala
 4 | application to run on a Spark cluster. See
 5 | [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark)
 6 | for a description how to setup a Spark cluster.
 7 | 
 8 | ## Scala Console
 9 | 
10 | `sbt console` will create you a Spark Context for testing your code like the
11 | spark-shell:
12 | 
13 | ```
14 | docker run -it --rm bde2020/spark-scala-template sbt console
15 | ```
16 | 
17 | You can also use directly your Docker image and test your own code that way.
18 | 
19 | ## Package your application using sbt
20 | 
21 | You can build and launch your Scala application on a Spark cluster by extending
22 | this image with your sources. The template uses
23 | [sbt](http://www.scala-sbt.org) as build tool, so you should take the
24 | `build.sbt` file located in this directory and the `project` directory that
25 | includes the
26 | [sbt-assembly](https://github.com/sbt/sbt-assembly).
27 | 
28 | When the Docker image is built using this template, you should get a Docker
29 | image that includes a fat JAR containing your application and all its
30 | dependencies.
31 | 
32 | ### Extending the Spark Scala template with your application
33 | 
34 | #### Steps to extend the Spark Scala template
35 | 
36 | 1. Create a Dockerfile in the root folder of your project (which also contains
37 |    a `build.sbt`)
38 | 2. Extend the Spark Scala template Docker image
39 | 3. Configure the following environment variables (unless the default value
40 |    satisfies):
41 |   * `SPARK_MASTER_NAME` (default: spark-master)
42 |   * `SPARK_MASTER_PORT` (default: 7077)
43 |   * `SPARK_APPLICATION_MAIN_CLASS` (default: Application)
44 |   * `SPARK_APPLICATION_ARGS` (default: "")
45 | 4. Build and run the image:
46 | ```
47 | docker build --rm=true -t bde/spark-app .
48 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app
49 | ```
50 | 
51 | The sources in the project folder will be automatically added to `/usr/src/app`
52 | if you directly extend the Spark Scala template image. Otherwise you will have
53 | to add and package the sources by yourself in your Dockerfile with the
54 | commands:
55 | 
56 |     COPY . /usr/src/app
57 |     RUN cd /usr/src/app && sbt clean assembly
58 | 
59 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute
60 | the `/template.sh` script at the end.
61 | 
62 | #### Example Dockerfile
63 | 
64 | ```
65 | FROM bde2020/spark-scala-template:2.4.0-hadoop2.7
66 | 
67 | MAINTAINER Cecile Tonglet <cecile.tonglet@tenforce.com>
68 | 
69 | ENV SPARK_APPLICATION_MAIN_CLASS eu.bde.my.Application
70 | ENV SPARK_APPLICATION_ARGS "foo bar baz"
71 | ```
72 | 
73 | #### Example application
74 | 
75 | TODO
76 | 


--------------------------------------------------------------------------------
/template/scala/build.sbt:
--------------------------------------------------------------------------------
1 | scalaVersion := "2.12.11"
2 | libraryDependencies ++= Seq(
3 |   "org.apache.spark" %% "spark-sql" % "3.0.0" % "provided"
4 | )
5 | 


--------------------------------------------------------------------------------
/template/scala/plugins.sbt:
--------------------------------------------------------------------------------
1 | addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")


--------------------------------------------------------------------------------
/template/scala/template.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | SPARK_APPLICATION_JAR_LOCATION=`find /app/target -iname '*-assembly-*.jar' | head -n1`
 4 | export SPARK_APPLICATION_JAR_LOCATION
 5 | 
 6 | if [ -z "$SPARK_APPLICATION_JAR_LOCATION" ]; then
 7 | 	echo "Can't find a file *-assembly-*.jar in /app/target"
 8 | 	exit 1
 9 | fi
10 | 
11 | /submit.sh
12 | 


--------------------------------------------------------------------------------
/wordcount.md:
--------------------------------------------------------------------------------
  1 | To perform a **Word Count** using Hadoop, follow these steps:
  2 | 
  3 | ---
  4 | 
  5 | ### **Configure `mapred-site.xml`**  
  6 | ```bash
  7 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
  8 | ```
  9 | Add:  
 10 | ```xml
 11 | <configuration>
 12 |     <property>
 13 |         <name>mapreduce.framework.name</name>
 14 |         <value>yarn</value>
 15 |     </property>
 16 |     <property>
 17 |         <name>yarn.app.mapreduce.am.env</name>
 18 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
 19 |     </property>
 20 |     <property>
 21 |         <name>mapreduce.map.env</name>
 22 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
 23 |     </property>
 24 |     <property>
 25 |         <name>mapreduce.reduce.env</name>
 26 |         <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
 27 |     </property>
 28 | </configuration>
 29 | ```
 30 | Save & exit.
 31 | 
 32 | 
 33 | ## **1. Ensure Hadoop is Running**
 34 | Before running the Word Count example, ensure Hadoop is running:
 35 | 
 36 | ```bash
 37 | start-dfs.sh
 38 | start-yarn.sh
 39 | ```
 40 | Verify with:
 41 | ```bash
 42 | jps
 43 | ```
 44 | You should see **NameNode, DataNode, ResourceManager, and NodeManager** running.
 45 | 
 46 | ---
 47 | 
 48 | ## **2. Upload the Input File to HDFS**
 49 | If you haven't already created a directory in HDFS, do it now:
 50 | 
 51 | ```bash
 52 | hadoop fs -mkdir -p /user/nielit/input
 53 | ```
 54 | 
 55 | Now, upload your text file (`data.txt`):
 56 | 
 57 | ```bash
 58 | hadoop fs -put data.txt /user/nielit/input/
 59 | ```
 60 | 
 61 | Verify the upload:
 62 | ```bash
 63 | hadoop fs -ls /user/nielit/input/
 64 | ```
 65 | 
 66 | ---
 67 | 
 68 | ## **3. Run the Word Count Example**
 69 | Hadoop provides a built-in Word Count example. **(Replace 3.3.6 with your hadoop Version)** Run it with:
 70 | 
 71 | ```bash
 72 | hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /user/nielit/input /user/nielit/output
 73 | ```
 74 | 
 75 | > **Note:** If you downloaded a different Hadoop version, update `3.4.0` accordingly.
 76 | 
 77 | ---
 78 | 
 79 | ![Screenshot from 2025-03-02 17-45-02](https://github.com/user-attachments/assets/44613f55-cdf0-48fc-9769-9ff7066c70e3)
 80 | 
 81 | ## **4. Check the Output**
 82 | Once the job completes, view the output files:
 83 | 
 84 | ```bash
 85 | hadoop fs -ls /user/nielit/output
 86 | ```
 87 | 
 88 | The output is usually stored in `part-r-00000`. To read the results:
 89 | 
 90 | ```bash
 91 | hadoop fs -cat /user/nielit/output/part-r-00000
 92 | ```
 93 | 
 94 | ---
 95 | 
 96 | ## **5. Download the Output to Your Local System (Optional)**
 97 | If you want to copy the results from HDFS to your local machine:
 98 | 
 99 | ```bash
100 | hadoop fs -get /user/nielit/output/part-r-00000 wordcount_output.txt
101 | cat wordcount_output.txt
102 | ```
103 | 
104 | 
105 | ![Screenshot from 2025-03-02 17-46-44](https://github.com/user-attachments/assets/5d5d3b51-708f-4ff3-8476-a518c5f8ee3e)
106 | 
107 | **METHOD: 2**
108 | 
109 | You can write your own **Java MapReduce program** for word count in Hadoop. Follow these steps:
110 | 
111 | ---
112 | 
113 | ## **1. Create the Word Count Java Program**
114 | Create a new file:  
115 | ```bash
116 | nano WordCount.java
117 | ```
118 | 
119 | Copy and paste the following Java code:
120 | 
121 | ```java
122 | import org.apache.hadoop.conf.Configuration;
123 | import org.apache.hadoop.fs.Path;
124 | import org.apache.hadoop.io.IntWritable;
125 | import org.apache.hadoop.io.Text;
126 | import org.apache.hadoop.mapreduce.Job;
127 | import org.apache.hadoop.mapreduce.Mapper;
128 | import org.apache.hadoop.mapreduce.Reducer;
129 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
130 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
131 | 
132 | import java.io.IOException;
133 | import java.util.StringTokenizer;
134 | 
135 | public class WordCount {
136 | 
137 |     public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
138 |         private final static IntWritable one = new IntWritable(1);
139 |         private Text word = new Text();
140 | 
141 |         public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
142 |             StringTokenizer itr = new StringTokenizer(value.toString());
143 |             while (itr.hasMoreTokens()) {
144 |                 word.set(itr.nextToken());
145 |                 context.write(word, one);
146 |             }
147 |         }
148 |     }
149 | 
150 |     public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
151 |         private IntWritable result = new IntWritable();
152 | 
153 |         public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
154 |             int sum = 0;
155 |             for (IntWritable val : values) {
156 |                 sum += val.get();
157 |             }
158 |             result.set(sum);
159 |             context.write(key, result);
160 |         }
161 |     }
162 | 
163 |     public static void main(String[] args) throws Exception {
164 |         Configuration conf = new Configuration();
165 |         Job job = Job.getInstance(conf, "word count");
166 |         job.setJarByClass(WordCount.class);
167 |         job.setMapperClass(TokenizerMapper.class);
168 |         job.setCombinerClass(IntSumReducer.class);
169 |         job.setReducerClass(IntSumReducer.class);
170 |         job.setOutputKeyClass(Text.class);
171 |         job.setOutputValueClass(IntWritable.class);
172 |         FileInputFormat.addInputPath(job, new Path(args[0]));
173 |         FileOutputFormat.setOutputPath(job, new Path(args[1]));
174 |         System.exit(job.waitForCompletion(true) ? 0 : 1);
175 |     }
176 | }
177 | ```
178 | 
179 | ---
180 | 
181 | ## **2. Compile the Java Code**
182 | Make sure you have Hadoop's libraries available. Use the following command to compile:
183 | 
184 | ```bash
185 | javac -classpath $(hadoop classpath) -d . WordCount.java
186 | ```
187 | 
188 | This will generate `.class` files inside the current directory.
189 | 
190 | ---
191 | 
192 | ## **3. Create a JAR File**
193 | Now, package the compiled Java files into a JAR:
194 | 
195 | ```bash
196 | jar -cvf WordCount.jar *.class
197 | ```
198 | 
199 | ---
200 | 
201 | ## **4. Upload Input File to HDFS**
202 | If not already uploaded, create an input directory and upload your text file:
203 | 
204 | ```bash
205 | hadoop fs -mkdir -p /user/nielit/input
206 | hadoop fs -put data.txt /user/nielit/input/
207 | ```
208 | 
209 | Verify:
210 | ```bash
211 | hadoop fs -ls /user/nielit/input/
212 | ```
213 | 
214 | ---
215 | 
216 | ## **5. Run Your Word Count Program**
217 | Execute your custom Word Count JAR in Hadoop:
218 | 
219 | ```bash
220 | hadoop jar WordCount.jar WordCount /user/nielit/input /user/nielit/output
221 | ```
222 | 
223 | ---
224 | 
225 | ## **6. View Output**
226 | After the job completes, check the output:
227 | 
228 | ```bash
229 | hadoop fs -ls /user/nielit/output
230 | ```
231 | 
232 | To see the results:
233 | 
234 | ```bash
235 | hadoop fs -cat /user/nielit/output/part-r-00000
236 | ```
237 | 
238 | ---
239 | 
240 | ## **7. Download Output (Optional)**
241 | If you want to save the output to your local machine:
242 | 
243 | ```bash
244 | hadoop fs -get /user/nielit/output/part-r-00000 wordcount_output.txt
245 | cat wordcount_output.txt
246 | ```
247 | 
248 | ---
249 | 
250 | ## **Troubleshooting**
251 | - If the output directory already exists, delete it before rerunning:
252 |   ```bash
253 |   hadoop fs -rm -r /user/nielit/output
254 |   ```
255 | 
256 | 
257 | 


--------------------------------------------------------------------------------
/worker/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM bde2020/spark-base:3.0.0-hadoop3.2
 2 | 
 3 | LABEL maintainer="Gezim Sejdiu <g.sejdiu@gmail.com>, Giannis Mouchakis <gmouchakis@gmail.com>"
 4 | 
 5 | COPY worker.sh /
 6 | 
 7 | ENV SPARK_WORKER_WEBUI_PORT 8081
 8 | ENV SPARK_WORKER_LOG /spark/logs
 9 | ENV SPARK_MASTER "spark://spark-master:7077"
10 | 
11 | EXPOSE 8081
12 | 
13 | CMD ["/bin/bash", "/worker.sh"]
14 | 


--------------------------------------------------------------------------------
/worker/README.md:
--------------------------------------------------------------------------------
1 | # Spark worker
2 | 
3 | See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark).


--------------------------------------------------------------------------------
/worker/worker.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | . "/spark/sbin/spark-config.sh"
 4 | 
 5 | . "/spark/bin/load-spark-env.sh"
 6 | 
 7 | mkdir -p $SPARK_WORKER_LOG
 8 | 
 9 | export SPARK_HOME=/spark
10 | 
11 | ln -sf /dev/stdout $SPARK_WORKER_LOG/spark-worker.out
12 | 
13 | /spark/sbin/../bin/spark-class org.apache.spark.deploy.worker.Worker \
14 |     --webui-port $SPARK_WORKER_WEBUI_PORT $SPARK_MASTER >> $SPARK_WORKER_LOG/spark-worker.out
15 | 


--------------------------------------------------------------------------------
/yarn.md:
--------------------------------------------------------------------------------
  1 | # Practical: Running a WordCount Job on YARN ![image](https://github.com/user-attachments/assets/04d38509-38b8-4cef-b544-4a8c566fd863)
  2 | 
  3 | In this practical, you will run a simple WordCount job using Hadoop YARN. This exercise walks you through preparing a basic Hadoop job and running it on a YARN cluster.
  4 | 
  5 | ### Prerequisites:
  6 | 1. Docker Desktop must be up and running.
  7 | 2. **YARN ResourceManager** and **NodeManager** must be up and running.
  8 | 3. Hadoop should be set up correctly, with access to the HDFS file system.
  9 | 4. A sample WordCount program (JAR) is ready to be executed.
 10 | 
 11 | ---
 12 | 
 13 | ### Step-by-Step Guide
 14 | Compose container if not already running
 15 | 
 16 | **docker-compose up -d**
 17 | 
 18 | Copy code folder that has wordcount program to your container
 19 | 
 20 | **docker cp code namenode:/code**
 21 | 
 22 | ![image](https://github.com/user-attachments/assets/72fa5e86-02cb-4a09-864a-dae1256bf8cd)
 23 | 
 24 | 
 25 | Then execute bash shell of namenode in intractive mode
 26 | 
 27 | **docker exec -it namenode bash**
 28 | ---
 29 | 
 30 | ### Step-by-Step Guide
 31 | 
 32 | #### Step 1: Upload Data to HDFS
 33 | 
 34 | Before running a YARN job, we need some input data in HDFS. We will create a simple text file locally and upload it to HDFS.
 35 | 
 36 | 1. **Create a sample text file locally inside docker terminal**:
 37 |    Use the following commands to create a file called `sample.txt` on your local machine with some sample text data.
 38 | 
 39 |    ```bash
 40 |    echo "Ropar Chandigarh Ropar Chandigarh Punjab" > sample.txt
 41 |    echo "Mohali" >> sample.txt
 42 |    echo "Kharar" >> sample.txt
 43 |    ```
 44 | 
 45 | 
 46 | 2. **Upload the text file to HDFS**:
 47 |    Use the `hadoop fs -put` command to upload the file to HDFS.
 48 | 
 49 |    ```bash
 50 |    hadoop fs -mkdir -p /user  # Create the input directory in HDFS
 51 |    hadoop fs -mkdir -p /user/root  # Create the input directory in HDFS
 52 |    hadoop fs -mkdir -p /user/root/input  # Create the input directory in HDFS
 53 | 
 54 |    or
 55 |    The -p option stands for "parent", meaning it creates all the necessary parent directories in the specified path if they do not already exist.
 56 |    If any of the parent directories (/user, /user/root) do not exist, this command will create them.
 57 |    Key Feature: It does not throw an error if the directory already exists.
 58 |    
 59 |    hadoop fs -mkdir -p /user/root/input
 60 | 
 61 |    
 62 |    hadoop fs -put sample.txt /user/root/input/  # put sample.txt into HDFS
 63 |    ```
 64 | 
 65 |    You can confirm the file is uploaded by running:
 66 | 
 67 |    ```bash
 68 |    hadoop fs -ls /user/root/input/
 69 |    ```
 70 |    ![image](https://github.com/user-attachments/assets/a0e18957-a2a5-40f8-a5bf-b443da47eb67)
 71 | 
 72 | ![image](https://github.com/user-attachments/assets/b622d0d9-ef28-4eaa-ac7b-db7758dd390d)
 73 | 
 74 | ---
 75 | 
 76 | #### Step 2: Submit the WordCount Job to YARN
 77 | 
 78 | Now, we can run the WordCount job using YARN. This job will count the occurrences of each word in the input file.
 79 | 
 80 | 1. **Change your working directory to where the `wordCount.jar` is located**:
 81 | 
 82 |    ```bash
 83 |    cd /code  # Change to the directory where wordCount.jar is stored
 84 |    ```
 85 | ![image](https://github.com/user-attachments/assets/84b0288f-4cca-4bc8-9d62-eeb5b393ef6d)
 86 | 
 87 | 2. **Submit the WordCount job to YARN**:
 88 |    Run the following command to submit the job:
 89 | 
 90 |    ```bash
 91 |    hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount /user/root/input /user/root/outputfolder
 92 |    ```
 93 | 
 94 |    - `wordCount.jar`: The MapReduce program (JAR file).
 95 |    - `/user/root/input`: The input directory in HDFS containing the `sample.txt` file.
 96 |    - `/user/root/outputfolder`: The output directory in HDFS where the result will be stored.
 97 |   
 98 |      ![image](https://github.com/user-attachments/assets/f0aa28ae-c7c8-4e38-999c-a2197497c5cb)
 99 | 
100 | 
101 | 3. **Check the YARN UI**:
102 |    After submitting the job, you can monitor the job through the YARN ResourceManager UI.
103 | 
104 |    - Visit the YARN ResourceManager UI at `http://localhost:8088`.
105 |    - Look for your job under the "Applications" section. You should see your job with its status (e.g., Running, Completed, etc.).
106 | 
107 |   ![image](https://github.com/user-attachments/assets/f68bcf5f-e56a-420e-8f05-9ed2dcf68837)
108 | 
109 | 
110 |    - Click on your job to see more details, such as job progress, logs, and running containers.
111 | 
112 |   ![image](https://github.com/user-attachments/assets/5656eb67-8c49-46c0-a9ac-0af201231972)
113 | 
114 | 
115 | ---
116 | 
117 | #### Step 3: Check the Output of the WordCount Job
118 | 
119 | Once the job finishes, you can view the results in HDFS.
120 | 
121 | 1. **Check the output on HDFS**:
122 |    To verify that the output was successfully created, run the following command:
123 | 
124 |    ```bash
125 |    hadoop fs -ls /user/root/outputfolder
126 |    ```
127 | ![image](https://github.com/user-attachments/assets/508f4359-1b5c-462c-93a8-c1dc7048283d)
128 | 
129 | 
130 |    You should see output files like `part-r-00000`.
131 | 
132 | 2. **View the contents of the output file**:
133 |    To view the WordCount results, use the following command:
134 | 
135 |    ```bash
136 |    hadoop fs -cat /user/root/outputfolder/part-r-00000
137 |    ```
138 | 
139 |    The output will show the words and their respective counts, like this:
140 | 
141 |    ```
142 |    Chandigarh      2
143 |    Kharar  1
144 |    Mohali  1
145 |    Punjab  1
146 |    Ropar   2
147 |    ```
148 | ![image](https://github.com/user-attachments/assets/7b0ab366-71ed-49e7-b7fa-749f573f633a)
149 | 
150 | ---
151 | 
152 | #### Step 4: Clean Up
153 | 
154 | Once you’ve completed the practical, it's good practice to clean up by deleting the output files and any unnecessary files.
155 | 
156 | 1. **Remove the output directory from HDFS**:
157 | 
158 |    ```bash
159 |    hadoop fs -rm -r /user/root/outputfolder
160 |    ```
161 | 
162 | 2. **Optional**: Remove the input file from HDFS if you no longer need it.
163 | 
164 |    ```bash
165 |    hadoop fs -rm /user/root/input/sample.txt
166 |    ```
167 | 
168 | ---
169 | 
170 | Yarn (Yet Another Resource Negotiator) in the Hadoop ecosystem is a resource management layer, and its commands are different from the JavaScript package manager **Yarn**. Below is a list of essential **Yarn commands for Hadoop**:
171 | 
172 | ---
173 | 
174 | ### **General Yarn Commands**
175 | 1. **Check Yarn Version**  
176 |    ```bash
177 |    yarn version
178 |    ```
179 |    Displays the version of Yarn installed in your Hadoop environment.
180 | 
181 | 2. **Check Cluster Nodes**  
182 |    ```bash
183 |    yarn node -list
184 |    ```
185 |    Lists all the active, decommissioned, and unhealthy nodes in the cluster.
186 | 
187 | 3. **Resource Manager Web UI**  
188 |    ```bash
189 |    yarn rmadmin -getServiceState rm1
190 |    ```
191 |    Checks the state of a specific Resource Manager.
192 | 
193 | ---
194 | 
195 | ### **Application Management**
196 | 4. **Submit an Application**  
197 |    ```bash
198 |    yarn jar <path-to-jar> <main-class> [options]
199 |    ```
200 |    Submits a new application to the Yarn cluster.
201 | 
202 | 5. **List Applications**  
203 |    ```bash
204 |    yarn application -list
205 |    ```
206 |    Lists all running applications on the Yarn cluster.
207 | 
208 | 6. **View Application Status**  
209 |    ```bash
210 |    yarn application -status <application_id>
211 |    ```
212 |    Shows the status of a specific application.
213 | Example Output:-
214 | ![image](https://github.com/user-attachments/assets/44ab74ab-f662-4d87-834c-43e812117be0)
215 | 
216 | 
217 | 
218 | 8. **Kill an Application**  
219 |    ```bash
220 |    yarn application -kill <application_id>
221 |    ```
222 |    Terminates a specific application.
223 | 
224 | ---
225 | 
226 | ### **Logs and Diagnostics**
227 | 8. **View Logs of an Application**  
228 |    ```bash
229 |    yarn logs -applicationId <application_id>
230 |    ```
231 |    Displays logs for a specific application.
232 | 
233 | 9. **Fetch Application Logs to Local System**  
234 |    ```bash
235 |    yarn logs -applicationId <application_id> > logs.txt
236 |    ```
237 |    Saves application logs to a local file.
238 | 
239 | ---
240 | 
241 | ### **Queue Management**
242 | 10. **List Queues**  
243 |     ```bash
244 |     yarn queue -list
245 |     ```
246 |     Lists all queues available in the Yarn cluster.
247 | 
248 | 11. **Move Application to Another Queue**  
249 |     ```bash
250 |     yarn application -moveToQueue <queue_name> -appId <application_id>
251 |     ```
252 |     Moves a running application to a different queue.
253 | 
254 | ---
255 | 
256 | ### **Resource Manager Administration**
257 | 12. **Refresh Queue Configuration**  
258 |     ```bash
259 |     yarn rmadmin -refreshQueues
260 |     ```
261 |     Reloads the queue configuration without restarting the Resource Manager.
262 | 
263 | 13. **Refresh Node Information**  
264 |     ```bash
265 |     yarn rmadmin -refreshNodes
266 |     ```
267 |     Updates the Resource Manager with the latest node information.
268 | 
269 | 14. **Get Cluster Metrics**  
270 |     ```bash
271 |     yarn cluster -metrics
272 |     ```
273 |     Shows resource usage metrics of the Yarn cluster.
274 | 
275 | 15. **Decommission a Node**  
276 |     ```bash
277 |     yarn rmadmin -decommission <node-hostname>
278 |     ```
279 |     Marks a specific node as decommissioned.
280 | 
281 | 16. **Check Cluster Status**  
282 |     ```bash
283 |     yarn cluster -status
284 |     ```
285 |     Displays overall status and health of the cluster.
286 | 
287 | ---
288 | 
289 | ### **Node Manager Commands**
290 | 17. **Start Node Manager**  
291 |     ```bash
292 |     yarn nodemanager
293 |     ```
294 |     Starts the Node Manager daemon.
295 | 
296 | 18. **Stop Node Manager**  
297 |     ```bash
298 |     yarn nodemanager -stop
299 |     ```
300 |     Stops the Node Manager daemon.
301 | 
302 | 19. **List Containers on a Node**  
303 |     ```bash
304 |     yarn nodemanager -list
305 |     ```
306 |     Lists all running containers on the Node Manager.
307 | 
308 | ---
309 | 
310 | ### **Debugging and Troubleshooting**
311 | 20. **View Container Logs**  
312 |     ```bash
313 |     yarn logs -containerId <container_id> -nodeAddress <node_hostname>
314 |     ```
315 |     Retrieves logs for a specific container.
316 | 
317 | 21. **Check Application Environment Variables**  
318 |     ```bash
319 |     yarn application -envs <application_id>
320 |     ```
321 |     Displays environment variables for a specific application.
322 | 
323 | ---
324 | 
325 | These commands allow you to manage applications, queues, resources, and logs effectively on a Hadoop Yarn cluster.
326 | 
327 | ### Additional Tips
328 | 
329 | - **Custom Jobs**: You can write your own MapReduce programs in Java and package them into a JAR file, then submit them to YARN in a similar way.
330 | - **Resource Allocation**: If you want to control how much memory or CPU your YARN job uses, you can specify resources in the command, or modify the YARN configuration files.
331 | 
332 | ---
333 | 
334 | ### Troubleshooting
335 | 
336 | - **Job Not Starting**: If the job does not start or fails, check the logs for errors. You can view the logs from the YARN ResourceManager UI or use the following command to retrieve logs:
337 | 
338 |    ```bash
339 |    yarn logs -applicationId <application_id>
340 |    ```
341 | 
342 | - **Out of Memory Errors**: If your job runs into memory issues, consider adjusting the memory allocation in the `yarn-site.xml` configuration file for your NodeManagers and ResourceManager.
343 | 
344 | ---
345 | 
346 | **Conclusion**
347 | This practical exercise provided a hands-on experience in running a simple MapReduce job (WordCount) on YARN. You can now submit jobs, monitor them, and view results in HDFS using the YARN ResourceManager. By following the steps outlined, you should be able to run more complex jobs and work with Hadoop in a YARN-managed environment.
348 | 
349 | ---
350 | 
351 | ### Instructions for Use:
352 | - Ensure your Hadoop environment (including YARN and HDFS) is properly set up before running the job.
353 | - Submit your jobs using the `hadoop jar` command and monitor their progress through the YARN UI.
354 | - Clean up your HDFS after completing the practical exercise to maintain a clutter-free envir
355 | 


--------------------------------------------------------------------------------