├── .gitignore
├── Complete_Guide_to_Install_Ubuntu_and_JAVA_and_then_Configure_Hadoop,_MySQL,_HIVE,_Sqoop,_Flume,_Spark_on_a_Docker_Container.md
├── HiveInstallation.md
├── Makefile
├── Multi-Node Cluster on Ubuntu 24.04 (VMware).md
├── README.md
├── base
├── Dockerfile
├── bde-spark.css
├── entrypoint.sh
├── execute-step.sh
├── finish-step.sh
└── wait-for-step.sh
├── code
├── HadoopWordCount
│ ├── bin
│ │ ├── WordCount$IntSumReducer.class
│ │ ├── WordCount$TokenizerMapper.class
│ │ ├── WordCount.class
│ │ └── wc.jar
│ └── src
│ │ └── WordCount.java
├── input
│ ├── About Hadoop.txt~
│ └── data.txt
└── wordCount.jar
├── conf
├── beeline-log4j2.properties
├── hive-env.sh
├── hive-exec-log4j2.properties
├── hive-log4j2.properties
├── hive-site.xml
├── ivysettings.xml
└── llap-daemon-log4j2.properties
├── data
├── authors.csv
└── books.csv
├── datanode
├── Dockerfile
└── run.sh
├── docker-compose.yml
├── ecom.md
├── entrypoint.sh
├── flume.md
├── hadoop-basic-commands.md
├── hadoop-hive.env
├── hadoop.env
├── hadoop_installation_VMware Workstation.md
├── historyserver
├── Dockerfile
└── run.sh
├── master
├── Dockerfile
├── README.md
└── master.sh
├── namenode
├── Dockerfile
└── run.sh
├── nginx
├── Dockerfile
├── bde-hadoop.css
├── default.conf
└── materialize.min.css
├── nodemanager
├── Dockerfile
└── run.sh
├── police.csv
├── resourcemanager
├── Dockerfile
└── run.sh
├── spark_in_action.MD
├── sqoop.md
├── startup.sh
├── students.csv
├── submit
├── Dockerfile
├── WordCount.jar
└── run.sh
├── template
├── java
│ ├── Dockerfile
│ ├── README.md
│ └── template.sh
├── python
│ ├── Dockerfile
│ ├── README.md
│ └── template.sh
└── scala
│ ├── Dockerfile
│ ├── README.md
│ ├── build.sbt
│ ├── plugins.sbt
│ └── template.sh
├── wordcount.md
├── worker
├── Dockerfile
├── README.md
└── worker.sh
└── yarn.md
/.gitignore:
--------------------------------------------------------------------------------
1 | data/
2 |
--------------------------------------------------------------------------------
/HiveInstallation.md:
--------------------------------------------------------------------------------
1 | # **Complete Steps to Install Apache Hive on Ubuntu**
2 |
3 | Apache Hive is a data warehouse infrastructure built on top of Hadoop. This guide will show how to install and configure Hive on **Ubuntu**.
4 |
5 | ---
6 |
7 | ## **Step 1: Install Prerequisites**
8 | Before installing Hive, ensure your system has the necessary dependencies.
9 |
10 | ### **1.1 Install Java**
11 | Hive requires Java to run. Install it if it's not already installed:
12 | ```bash
13 | sudo apt update
14 | sudo apt install default-jdk -y
15 | java -version # Verify installation
16 | ```
17 |
18 | ### **1.2 Install Hadoop (Required for Hive)**
19 | Hive requires Hadoop to function properly. If Hadoop is not installed, install it using:
20 |
21 | ```bash
22 | sudo apt install hadoop -y
23 | hadoop version # Verify installation
24 | ```
25 | If you need a full Hadoop setup, follow a [Hadoop installation guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html).
26 |
27 | ### **1.3 Install wget (If not installed)**
28 | ```bash
29 | sudo apt install wget -y
30 | ```
31 |
32 | ---
33 |
34 | ## **Step 2: Download and Install Apache Hive**
35 | ### **2.1 Download Hive**
36 | ```bash
37 | wget https://apache.root.lu/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
38 | ```
39 | *Check the latest version from the official Hive website:* [Apache Hive Downloads](https://hive.apache.org/downloads.html)
40 |
41 | ### **2.2 Extract Hive and Move to /opt Directory**
42 | ```bash
43 | sudo tar -xzf apache-hive-2.3.9-bin.tar.gz -C /opt
44 | sudo mv /opt/apache-hive-2.3.9-bin /opt/hive
45 | ```
46 |
47 | ---
48 |
49 | ## **Step 3: Set Up Environment Variables**
50 | To run Hive commands globally, configure environment variables.
51 |
52 | ### **3.1 Open the `.bashrc` File**
53 | ```bash
54 | nano ~/.bashrc
55 | ```
56 |
57 | ### **3.2 Add the Following Lines at the End**
58 | ```bash
59 | export HIVE_HOME=/opt/hive
60 | export PATH=$HIVE_HOME/bin:$PATH
61 | ```
62 |
63 | ### **3.3 Apply the Changes**
64 | ```bash
65 | source ~/.bashrc
66 | ```
67 |
68 | ### **3.4 Verify Hive Installation**
69 | ```bash
70 | hive --version
71 | ```
72 | If Hive is installed correctly, it will print the version.
73 |
74 | ---
75 |
76 | ## **Step 4: Configure Hive**
77 | ### **4.1 Create Hive Directories in HDFS**
78 | ```bash
79 | hdfs dfs -mkdir -p /user/hive/warehouse
80 | hdfs dfs -chmod -R 770 /user/hive/warehouse
81 | hdfs dfs -chown -R $USER:$USER /user/hive/warehouse
82 | ```
83 |
84 | ### **4.2 Configure `hive-site.xml`**
85 | Edit the Hive configuration file:
86 | ```bash
87 | sudo nano /opt/hive/conf/hive-site.xml
88 | ```
89 |
90 | Add the following configurations:
91 |
92 | ```xml
93 |
94 |
95 |
96 |
97 | javax.jdo.option.ConnectionURL
98 | jdbc:derby:;databaseName=/opt/hive/metastore_db;create=true
99 | JDBC connection URL for the metastore database
100 |
101 |
102 | hive.metastore.warehouse.dir
103 | /user/hive/warehouse
104 | Location of default database for the warehouse
105 |
106 |
107 | hive.exec.scratchdir
108 | /tmp/hive
109 | Scratch directory for Hive jobs
110 |
111 |
112 | ```
113 |
114 | Save and exit (`CTRL + X`, then `Y` and `ENTER`).
115 |
116 | ---
117 |
118 | ## **Step 5: Set Proper Permissions**
119 | ```bash
120 | sudo chown -R $USER:$USER /opt/hive
121 | sudo chmod -R 755 /opt/hive
122 | ```
123 |
124 | ---
125 |
126 | ## **Step 6: Initialize Hive Metastore**
127 | Hive uses a database (Derby by default) to store metadata.
128 |
129 | ### **6.1 Run Schema Initialization**
130 | ```bash
131 | /opt/hive/bin/schematool -initSchema -dbType derby
132 | ```
133 |
134 | ---
135 |
136 | ## **Step 7: Start Hive**
137 | After setup, you can now start Hive.
138 |
139 | ### **7.1 Run Hive Shell**
140 | ```bash
141 | hive
142 | ```
143 |
144 | ### **7.2 Verify Hive is Working**
145 | Run the following command inside the Hive shell:
146 | ```sql
147 | SHOW DATABASES;
148 | ```
149 | It should list default databases.
150 |
151 | ---
152 |
153 | ## **(Optional) Configure Hive with MySQL (For Production Use)**
154 | Using **MySQL** instead of Derby is recommended for better performance.
155 |
156 | ### **1. Install MySQL Server**
157 | ```bash
158 | sudo apt install mysql-server -y
159 | sudo systemctl start mysql
160 | sudo systemctl enable mysql
161 | ```
162 |
163 | ### **2. Create a Hive Metastore Database**
164 | ```bash
165 | mysql -u root -p
166 | ```
167 | Inside the MySQL shell, run:
168 | ```sql
169 | CREATE DATABASE metastore;
170 | CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
171 | GRANT ALL PRIVILEGES ON metastore.* TO 'hiveuser'@'localhost';
172 | FLUSH PRIVILEGES;
173 | EXIT;
174 | ```
175 |
176 | ### **3. Configure Hive to Use MySQL**
177 | Edit `hive-site.xml`:
178 | ```bash
179 | nano /opt/hive/conf/hive-site.xml
180 | ```
181 | Replace the Derby configuration with:
182 | ```xml
183 |
184 | javax.jdo.option.ConnectionURL
185 | jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true
186 |
187 |
188 | javax.jdo.option.ConnectionDriverName
189 | com.mysql.jdbc.Driver
190 |
191 |
192 | javax.jdo.option.ConnectionUserName
193 | hiveuser
194 |
195 |
196 | javax.jdo.option.ConnectionPassword
197 | hivepassword
198 |
199 | ```
200 |
201 | ### **4. Download MySQL JDBC Driver**
202 | ```bash
203 | wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.28.tar.gz
204 | tar -xzf mysql-connector-java-8.0.28.tar.gz
205 | sudo mv mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar /opt/hive/lib/
206 | ```
207 |
208 | ### **5. Reinitialize Hive Metastore**
209 | ```bash
210 | /opt/hive/bin/schematool -initSchema -dbType mysql
211 | ```
212 |
213 | ---
214 |
215 | ## **Hive is Now Ready to Use! 🚀**
216 | With this setup, Hive is installed and ready for queries.
217 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | DOCKER_NETWORK = docker-hadoop_default
2 | ENV_FILE = hadoop.env
3 | current_branch := $(shell git rev-parse --abbrev-ref HEAD)
4 | build:
5 | docker build -t bde2020/hadoop-base:$(current_branch) ./base
6 | docker build -t bde2020/hadoop-namenode:$(current_branch) ./namenode
7 | docker build -t bde2020/hadoop-datanode:$(current_branch) ./datanode
8 | docker build -t bde2020/hadoop-resourcemanager:$(current_branch) ./resourcemanager
9 | docker build -t bde2020/hadoop-nodemanager:$(current_branch) ./nodemanager
10 | docker build -t bde2020/hadoop-historyserver:$(current_branch) ./historyserver
11 | docker build -t bde2020/hadoop-submit:$(current_branch) ./submit
12 | docker build -t bde2020/hive:$(current_branch) ./
13 |
14 | wordcount:
15 | docker build -t hadoop-wordcount ./submit
16 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -mkdir -p /input/
17 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -copyFromLocal -f /opt/hadoop-3.2.1/README.txt /input/
18 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} hadoop-wordcount
19 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -cat /output/*
20 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /output
21 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /input
22 |
--------------------------------------------------------------------------------
/Multi-Node Cluster on Ubuntu 24.04 (VMware).md:
--------------------------------------------------------------------------------
1 | # **Complete Guide: Install Hadoop Multi-Node Cluster on Ubuntu 24.04 (VMware)**
2 | This guide covers installing and configuring **Hadoop 3.3.6** on **two Ubuntu 24.04 virtual machines** inside **VMware Workstation**.
3 |
4 | ## **Prerequisites**
5 | 1. **Two Ubuntu 24.04 VMs** running in **VMware Workstation**.
6 | 2. **At least 4GB RAM & 50GB disk space per VM**.
7 | 3. **Static IPs for both VMs**.
8 | 4. **Java 8 or later installed**.
9 |
10 | ---
11 |
12 | # **Step 1: Configure Static IPs for Both VMs**
13 | ### **1. Check Network Interface Name**
14 | On **both VMs**, open Terminal and run:
15 | ```bash
16 | ip a
17 | ```
18 | Find your network interface (e.g., `ens33` or `eth0`).
19 |
20 | ### **2. Edit Netplan Configuration**
21 | Run:
22 | ```bash
23 | sudo nano /etc/netplan/00-installer-config.yaml
24 | ```
25 | For the **Master Node** (VM 1):
26 | ```yaml
27 | network:
28 | version: 2
29 | renderer: networkd
30 | ethernets:
31 | ens33:
32 | dhcp4: no
33 | addresses:
34 | - 192.168.1.100/24
35 | gateway4: 192.168.1.1
36 | nameservers:
37 | addresses:
38 | - 8.8.8.8
39 | - 8.8.4.4
40 | ```
41 | For the **Worker Node** (VM 2):
42 | ```yaml
43 | network:
44 | version: 2
45 | renderer: networkd
46 | ethernets:
47 | ens33:
48 | dhcp4: no
49 | addresses:
50 | - 192.168.1.101/24
51 | gateway4: 192.168.1.1
52 | nameservers:
53 | addresses:
54 | - 8.8.8.8
55 | - 8.8.4.4
56 | ```
57 | ### **3. Apply Changes**
58 | ```bash
59 | sudo netplan apply
60 | ip a # Verify new IP
61 | ```
62 |
63 | ---
64 |
65 | # **Step 2: Install Java on Both VMs**
66 | Hadoop requires Java. Install **OpenJDK 11**:
67 | ```bash
68 | sudo apt update && sudo apt install openjdk-11-jdk -y
69 | ```
70 | Verify installation:
71 | ```bash
72 | java -version
73 | ```
74 | Expected output:
75 | ```
76 | openjdk version "11.0.20" 2024-XX-XX
77 | ```
78 |
79 | ---
80 |
81 | # **Step 3: Create Hadoop User on Both VMs**
82 | ```bash
83 | sudo adduser hadoop
84 | sudo usermod -aG sudo hadoop
85 | su - hadoop
86 | ```
87 |
88 | ---
89 |
90 | # **Step 4: Configure SSH Access**
91 | 1. **Install SSH on Both VMs**:
92 | ```bash
93 | sudo apt install ssh -y
94 | ```
95 | 2. **Generate SSH Keys on Master Node**:
96 | ```bash
97 | ssh-keygen -t rsa -P ""
98 | ```
99 | 3. **Copy SSH Key to Worker Node**:
100 | ```bash
101 | ssh-copy-id hadoop@192.168.1.101
102 | ```
103 | 4. **Test SSH Connection from Master to Worker**:
104 | ```bash
105 | ssh hadoop@192.168.1.101
106 | ```
107 | It should log in without asking for a password.
108 |
109 | ---
110 |
111 | # **Step 5: Download and Install Hadoop**
112 | Perform the following steps **on both VMs**.
113 |
114 | ### **1. Download Hadoop**
115 | ```bash
116 | wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
117 | tar -xvzf hadoop-3.3.6.tar.gz
118 | sudo mv hadoop-3.3.6 /usr/local/hadoop
119 | ```
120 |
121 | ### **2. Set Environment Variables**
122 | Edit `~/.bashrc`:
123 | ```bash
124 | nano ~/.bashrc
125 | ```
126 | Add:
127 | ```bash
128 | # Hadoop Environment Variables
129 | export HADOOP_HOME=/usr/local/hadoop
130 | export HADOOP_INSTALL=$HADOOP_HOME
131 | export HADOOP_MAPRED_HOME=$HADOOP_HOME
132 | export HADOOP_COMMON_HOME=$HADOOP_HOME
133 | export HADOOP_HDFS_HOME=$HADOOP_HOME
134 | export HADOOP_YARN_HOME=$HADOOP_HOME
135 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
136 | export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
137 | ```
138 | Save and apply:
139 | ```bash
140 | source ~/.bashrc
141 | ```
142 |
143 | ---
144 |
145 | # **Step 6: Configure Hadoop**
146 | ## **1. Configure `hadoop-env.sh`**
147 | Edit:
148 | ```bash
149 | nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
150 | ```
151 | Set Java path:
152 | ```bash
153 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
154 | ```
155 |
156 | ---
157 |
158 | ## **2. Configure Core Site (`core-site.xml`)**
159 | Edit:
160 | ```bash
161 | nano $HADOOP_HOME/etc/hadoop/core-site.xml
162 | ```
163 | Replace with:
164 | ```xml
165 |
166 |
167 | fs.defaultFS
168 | hdfs://master:9000
169 |
170 |
171 | ```
172 |
173 | ---
174 |
175 | ## **3. Configure HDFS (`hdfs-site.xml`)**
176 | Edit:
177 | ```bash
178 | nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
179 | ```
180 | Add:
181 | ```xml
182 |
183 |
184 | dfs.replication
185 | 2
186 |
187 |
188 | dfs.name.dir
189 | file:///usr/local/hadoop/hdfs/namenode
190 |
191 |
192 | dfs.data.dir
193 | file:///usr/local/hadoop/hdfs/datanode
194 |
195 |
196 | ```
197 | Create necessary directories:
198 | ```bash
199 | mkdir -p /usr/local/hadoop/hdfs/namenode
200 | mkdir -p /usr/local/hadoop/hdfs/datanode
201 | sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs
202 | ```
203 |
204 | ---
205 |
206 | ## **4. Configure MapReduce (`mapred-site.xml`)**
207 | Copy template:
208 | ```bash
209 | cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
210 | ```
211 | Edit:
212 | ```bash
213 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
214 | ```
215 | Add:
216 | ```xml
217 |
218 |
219 | mapreduce.framework.name
220 | yarn
221 |
222 |
223 | ```
224 |
225 | ---
226 |
227 | ## **5. Configure YARN (`yarn-site.xml`)**
228 | Edit:
229 | ```bash
230 | nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
231 | ```
232 | Add:
233 | ```xml
234 |
235 |
236 | yarn.nodemanager.aux-services
237 | mapreduce_shuffle
238 |
239 |
240 | ```
241 |
242 | ---
243 |
244 | # **Step 7: Set Up Master and Worker Nodes**
245 | ## **1. Edit Hosts File on Both VMs**
246 | ```bash
247 | sudo nano /etc/hosts
248 | ```
249 | Add:
250 | ```
251 | 192.168.1.100 master
252 | 192.168.1.101 worker1
253 | ```
254 |
255 | ## **2. Define Workers on Master Node**
256 | On the **Master Node**, edit:
257 | ```bash
258 | nano $HADOOP_HOME/etc/hadoop/workers
259 | ```
260 | Add:
261 | ```
262 | worker1
263 | ```
264 |
265 | ---
266 |
267 | # **Step 8: Start Hadoop Cluster**
268 | ## **1. Format Namenode (Master Only)**
269 | ```bash
270 | hdfs namenode -format
271 | ```
272 |
273 | ## **2. Start Hadoop Services (Master Only)**
274 | ```bash
275 | start-dfs.sh
276 | start-yarn.sh
277 | ```
278 | Check running services:
279 | ```bash
280 | jps
281 | ```
282 | Expected output:
283 | ```
284 | NameNode
285 | DataNode
286 | ResourceManager
287 | NodeManager
288 | ```
289 |
290 | ---
291 |
292 | # **Step 9: Verify Hadoop Cluster**
293 | ## **Check Web UI**
294 | 1. **HDFS Web UI**:
295 | 📌 **http://master:9870/**
296 | 2. **YARN Resource Manager**:
297 | 📌 **http://master:8088/**
298 |
299 | ---
300 |
301 | # **Step 10: Stop Hadoop**
302 | To stop services:
303 | ```bash
304 | stop-dfs.sh
305 | stop-yarn.sh
306 | ```
307 |
308 | ---
309 |
310 | # **Conclusion**
311 | You have successfully set up a **Hadoop multi-node cluster** on **two Ubuntu 24.04 VMs** inside **VMware Workstation**!
312 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |  .
2 |
3 |
4 |
5 | # Docker Multi-Container Environment with Hadoop, Spark, and Hive
6 |
7 | This guide helps you set up a multi-container environment using Docker for Hadoop (HDFS), Spark, and Hive. The setup is lightweight, without the large memory requirements of a Cloudera sandbox.
8 |
9 | ## **Prerequisites**
10 |
11 | Before you begin, ensure you have the following installed:
12 |
13 | - **Docker**: [Install Docker Desktop for Windows](https://docs.docker.com/desktop/setup/install/windows-install/)
14 |
15 | - IMPORTANT:
16 | ******- Enable the "Expose daemon on tcp://localhost:2375 without TLS" option if you're using Docker Desktop for compatibility.******
17 |
18 | 
19 |
20 |
21 | - **Git**: [Download Git](https://git-scm.com/downloads/win)
22 | - Git is used to download the required files from a repository.
23 |
24 | Create a newfolder and open it in terminal or go inside it using CD Command
25 |
26 | 
27 |
28 |
29 | ## **Step 1: Clone the Repository**
30 |
31 | First, clone the GitHub repository that contains the necessary Docker setup files.
32 |
33 | ```bash
34 | git clone https://github.com/lovnishverma/bigdataecosystem.git
35 | ```
36 |
37 | [or Directly download zip from my repo](https://github.com/lovnishverma/BigDataecosystem)
38 |
39 | Navigate to the directory:
40 |
41 | ```bash
42 | cd bigdataecosystem
43 | ```
44 |
45 | 
46 |
47 | if downloaded zip than cd bigdataecosystem-main
48 |
49 | ## **Step 2: Start the Cluster**
50 |
51 | Use Docker Compose to start the containers in the background.
52 |
53 | ```bash
54 | docker-compose up -d
55 | ```
56 |
57 | This command will launch the Hadoop, Spark, and Hive containers.
58 |
59 | 
60 |
61 |
62 | ## **Step 3: Verify Running Containers**
63 |
64 | To check if the containers are running, use the following command:
65 |
66 | ```bash
67 | docker ps
68 | ```
69 | 
70 |
71 |
72 | ## **Step 4: Stop and Remove Containers**
73 |
74 | When you are done, stop and remove the containers with:
75 |
76 | ```bash
77 | docker-compose down
78 | ```
79 | 
80 |
81 |
82 | ### Step 5: Access the NameNode container
83 | Enter the NameNode container to interact with Hadoop:
84 | ```bash
85 | docker exec -it namenode bash
86 | ```
87 | ** -it refers to (interactive terminal)**
88 | ---
89 |
90 | ## **Running Hadoop Code**
91 |
92 | To View NameNode UI Visit: [http://localhost:9870/](http://localhost:9870/)
93 |
94 | 
95 |
96 |
97 | To View Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/)
98 |
99 | 
100 |
101 |
102 | ### ** MAPREDUCE WordCount program**
103 | ### Step 1: Copy the `code` folder into the container
104 | Use the following command in your windows cmd to copy the `code` folder to the container:
105 | ```bash
106 | docker cp code namenode:/
107 | ```
108 |
109 | 
110 |
111 |
112 | ### Step 2: Locate the `data.txt` file
113 | Inside the container, navigate to the `code/input` directory where the `data.txt` file is located.
114 |
115 | ### Step 3: Create directories in the Hadoop file system
116 | Run the following commands to set up directories in Hadoop's file system:
117 | ```bash
118 | hdfs dfs -mkdir /user
119 | hdfs dfs -mkdir /user/root
120 | hdfs dfs -mkdir /user/root/input
121 | ```
122 |
123 | ### Step 4: Upload the `data.txt` file
124 | Copy `data.txt` into the Hadoop file system:
125 | ```bash
126 | hdfs dfs -put /code/input/data.txt /user/root/input
127 | ```
128 | 
129 |
130 |
131 | ### Step 5: Navigate to the directory containing the `wordCount.jar` file
132 | Return to the directory where the `wordCount.jar` file is located:
133 | ```bash
134 | cd /code/
135 | ```
136 | 
137 |
138 |
139 | ### Step 6: Execute the WordCount program
140 |
141 | To View NameNode UI Visit: [http://localhost:9870/](http://localhost:9870/)
142 |
143 | 
144 |
145 |
146 | Run the WordCount program to process the input data:
147 | ```bash
148 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
149 | ```
150 | 
151 |
152 |
153 | To View YARN Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/)
154 |
155 | 
156 |
157 | ### Step 7: Display the output
158 | View the results of the WordCount program:
159 | ```bash
160 | hdfs dfs -cat /user/root/output/*
161 | ```
162 | 
163 |
164 | ---
165 |
166 | **or**
167 |
168 | ```bash
169 | hdfs dfs -cat /user/root/output/part-r-00000
170 | ```
171 |
172 | 
173 |
174 |
175 | ## **Summary**
176 |
177 | This guide simplifies setting up and running Hadoop on Docker. Each step ensures a smooth experience, even for beginners without a technical background. Follow the instructions carefully, and you’ll have a working Hadoop setup in no time!
178 |
179 | Certainly! Here’s the explanation of your **MapReduce process** using the input example `DOG CAT RAT`, `CAR CAR RAT`, and `DOG CAR CAT`.
180 | ---
181 |
182 | ## 🐾 **Input Data**
183 |
184 | The `data.txt` file contains the following lines:
185 |
186 | ```
187 | DOG CAT RAT
188 | CAR CAR RAT
189 | DOG CAR CAT
190 | ```
191 |
192 | This text file is processed by the **MapReduce WordCount program** to count the occurrences of each word.
193 |
194 | ---
195 |
196 | ## 💡 **What is MapReduce?**
197 |
198 | - **MapReduce** is a two-step process:
199 | 1. **Map Phase** 🗺️: Splits the input into key-value pairs.
200 | 2. **Reduce Phase** ➕: Combines the key-value pairs to produce the final result.
201 |
202 | It's like dividing a big task (word counting) into smaller tasks and then combining the results. 🧩
203 |
204 | ---
205 |
206 | ## 🔄 **How MapReduce Works in Your Example**
207 |
208 | ### **1. Map Phase** 🗺️
209 |
210 | The mapper processes each line of the input file, splits it into words, and assigns each word a count of `1`.
211 |
212 | For example:
213 | ```
214 | DOG CAT RAT -> (DOG, 1), (CAT, 1), (RAT, 1)
215 | CAR CAR RAT -> (CAR, 1), (CAR, 1), (RAT, 1)
216 | DOG CAR CAT -> (DOG, 1), (CAR, 1), (CAT, 1)
217 | ```
218 |
219 | **Mapper Output**:
220 | ```
221 | (DOG, 1), (CAT, 1), (RAT, 1)
222 | (CAR, 1), (CAR, 1), (RAT, 1)
223 | (DOG, 1), (CAR, 1), (CAT, 1)
224 | ```
225 |
226 | ---
227 |
228 | ### **2. Shuffle and Sort Phase** 🔄
229 |
230 | This step groups all values for the same key (word) together and sorts them.
231 |
232 | For example:
233 | ```
234 | (CAR, [1, 1, 1])
235 | (CAT, [1, 1])
236 | (DOG, [1, 1])
237 | (RAT, [1, 1])
238 | ```
239 |
240 | ---
241 |
242 | ### **3. Reduce Phase** ➕
243 |
244 | The reducer sums up the counts for each word to get the total number of occurrences.
245 |
246 | **Reducer Output**:
247 | ```
248 | CAR 3 🏎️
249 | CAT 2 🐱
250 | DOG 2 🐶
251 | RAT 2 🐭
252 | ```
253 |
254 | ---
255 |
256 | ### **Final Output** 📋
257 |
258 | The final word count is saved in the HDFS output directory. You can view it using:
259 | ```bash
260 | hdfs dfs -cat /user/root/output/*
261 | ```
262 |
263 | **Result**:
264 | ```
265 | CAR 3
266 | CAT 2
267 | DOG 2
268 | RAT 2
269 | ```
270 |
271 | ---
272 |
273 | ## 🗂️ **HDFS Commands You Used**
274 |
275 | Here are the basic HDFS commands you used and their purpose:
276 |
277 | 1. **Upload a file to HDFS** 📤:
278 | ```bash
279 | hdfs dfs -put data.txt /user/root/input
280 | ```
281 | - **What it does**: Uploads `data.txt` to the HDFS directory `/user/root/input`.
282 | - **Output**: No output, but the file is now in HDFS.
283 |
284 | 2. **List files in a directory** 📁:
285 | ```bash
286 | hdfs dfs -ls /user/root/input
287 | ```
288 | - **What it does**: Lists all files in the `/user/root/input` directory.
289 | - **Output**: Something like this:
290 | ```
291 | Found 1 items
292 | -rw-r--r-- 1 root supergroup 50 2024-12-12 /user/root/input/data.txt
293 | ```
294 |
295 | 3. **View the contents of a file** 📄:
296 | ```bash
297 | hdfs dfs -cat /user/root/input/data.txt
298 | ```
299 | - **What it does**: Displays the contents of the `data.txt` file in HDFS.
300 | - **Output**:
301 | ```
302 | DOG CAT RAT
303 | CAR CAR RAT
304 | DOG CAR CAT
305 | ```
306 |
307 | 4. **Run the MapReduce Job** 🚀:
308 | ```bash
309 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
310 | ```
311 | - **What it does**: Runs the WordCount program on the input directory and saves the result in the output directory.
312 |
313 | 5. **View the final output** 📊:
314 | ```bash
315 | hdfs dfs -cat /user/root/output/*
316 | ```
317 | - **What it does**: Displays the word count results.
318 | - **Output**:
319 | ```
320 | CAR 3
321 | CAT 2
322 | DOG 2
323 | RAT 2
324 | ```
325 |
326 | ---
327 |
328 | ## 🛠️ **How You Utilized MapReduce**
329 |
330 | 1. **Input**:
331 | You uploaded a small text file (`data.txt`) to HDFS.
332 |
333 | 2. **Process**:
334 | The `WordCount` program processed the file using MapReduce:
335 | - The **mapper** broke the file into words and counted each occurrence.
336 | - The **reducer** aggregated the counts for each word.
337 |
338 | 3. **Output**:
339 | The results were saved in HDFS and displayed using the `cat` command.
340 |
341 | ---
342 |
343 | ## 🧩 **Visualization of the Entire Process**
344 |
345 | ### **Input** (HDFS file):
346 | ```
347 | DOG CAT RAT
348 | CAR CAR RAT
349 | DOG CAR CAT
350 | ```
351 |
352 | ### **Map Phase Output** 🗺️:
353 | ```
354 | (DOG, 1), (CAT, 1), (RAT, 1)
355 | (CAR, 1), (CAR, 1), (RAT, 1)
356 | (DOG, 1), (CAR, 1), (CAT, 1)
357 | ```
358 |
359 | ### **Shuffle & Sort** 🔄:
360 | ```
361 | (CAR, [1, 1, 1])
362 | (CAT, [1, 1])
363 | (DOG, [1, 1])
364 | (RAT, [1, 1])
365 | ```
366 |
367 | ### **Reduce Phase Output** ➕:
368 | ```
369 | CAR 3
370 | CAT 2
371 | DOG 2
372 | RAT 2
373 | ```
374 |
375 | ---
376 |
377 | 
378 |
379 | ### 🔑 **Key Takeaways**
380 | - **MapReduce** splits the task into small, manageable pieces and processes them in parallel.
381 | - It’s ideal for large datasets but works the same for smaller ones (like your example).
382 | - Hadoop is designed for distributed systems, making it powerful for big data processing.
383 |
384 |
385 |
386 |
387 |
388 |
389 | ### . **Stopping the Containers**
390 | To stop the Docker containers when done:
391 | ```bash
392 | docker-compose down
393 | ```
394 | This will stop and remove the containers and networks created by `docker-compose up`.
395 |
396 | ### 4. **Permissions Issue with Copying Files**
397 | If you face permission issues while copying files to containers ensure the correct directory permissions in Docker by using:
398 | ```bash
399 | docker exec -it namenode bash
400 | chmod -R 777 /your-directory
401 | ```
402 |
403 | ### 5. **Additional Debugging Tips**
404 | Sometimes, containers might not start or might throw errors related to Hadoop configuration. A small troubleshooting section or references to common issues (e.g., insufficient memory for Hadoop) would be helpful.
405 |
406 | ### 6. **Final Output File Path**
407 | The output of the WordCount job will be written to `/user/root/output/` in HDFS. This is clearly explained, but you could also include a note that the output directory might need to be created beforehand to avoid errors.
408 |
409 | ---
410 |
411 | ### **Example Additions:**
412 |
413 | 1. **Network Issues:**
414 | ```
415 | If you can't access the NameNode UI, ensure that your Docker container's ports are correctly exposed. For example, if you're running a local machine, the UI should be accessible via http://localhost:9870.
416 | ```
417 |
418 | 2. **Stopping Containers:**
419 | ```bash
420 | docker-compose down # Stop and remove the containers
421 | ```
422 |
423 | 3. **Permissions Fix:**
424 | ```bash
425 | docker exec -it namenode bash
426 | chmod -R 777 /your-directory # If you face any permission errors
427 | ```
428 |
429 | 4. **Handling HDFS Directory Creation:**
430 | If `hdfs dfs -mkdir` gives an error, it may be because the directory already exists. Consider adding:
431 | ```bash
432 | hdfs dfs -rm -r /user/root/input # If the directory exists, remove it first
433 | hdfs dfs -mkdir /user/root/input
434 | ```
435 |
436 | ---
437 |
438 | 😊 References
439 |
440 | https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/
441 |
442 | https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
443 |
444 | https://medium.com/@traininghub.io/hadoop-mapreduce-architecture-7e167e264595
445 |
446 |
447 | ## **Step 5: Set Up HDFS**
448 |
449 | ### **Upload Files to HDFS**
450 |
451 | To copy a file (e.g., `police.csv`) to the Hadoop cluster:
452 |
453 | 1. Copy the file into the namenode container:
454 | ```bash
455 | docker cp police.csv namenode:/police.csv
456 | ```
457 | 
458 |
459 |
460 | 2. Access the namenode container's bash shell:
461 | ```bash
462 | docker exec -it namenode bash
463 | ```
464 | 
465 |
466 |
467 | 3. Create a directory in HDFS and upload the file:
468 | ```bash
469 | hdfs dfs -mkdir -p /data/crimerecord/police
470 | hdfs dfs -put /police.csv /data/crimerecord/police/
471 | ```
472 | 
473 |
474 |
475 |
476 | 
477 |
478 |
479 | ### **Start Spark Shell**
480 |
481 | To interact with Spark, start the Spark shell in the master container:
482 |
483 | ```bash
484 | docker exec -it spark-master bash
485 |
486 | spark/bin/spark-shell --master spark://spark-master:7077
487 | ```
488 | ### **Access the Spark Master UI**
489 |
490 | - Open `http://localhost:8080` in your web browser to view the Spark Master UI.
491 | - **You can monitor processes here**
492 |
493 | - 
494 |
495 |
496 | 
497 |
498 | 
499 |
500 | # **Working with Apache Spark**
501 |
502 | ## **1. Introduction to Apache Spark**
503 |
504 | - **Overview**: Apache Spark is an open-source distributed computing system known for its speed, ease of use, and general-purpose capabilities for big data processing.
505 |
506 | - **Key Features**:
507 | - Fast processing using in-memory computation.
508 | - Supports multiple languages: Scala, Python, Java, and R.
509 | - Unified framework for batch and streaming data processing.
510 |
511 | ---
512 |
513 | ## **2. Introduction to DataFrames**
514 |
515 | - **What are DataFrames?**
516 | - Distributed collections of data organized into named columns, similar to a table in a database or a DataFrame in Python's pandas.
517 | - Optimized for processing large datasets using Spark SQL.
518 |
519 | - **Key Operations**:
520 | - Creating DataFrames from structured data sources (CSV, JSON, Parquet, etc.).
521 | - Performing transformations and actions on the data.
522 |
523 | ---
524 |
525 | ## **3. Introduction to Scala for Apache Spark**
526 |
527 | - **Why Scala?**
528 | - Apache Spark is written in Scala, offering the best compatibility and performance.
529 | - Concise syntax and functional programming support.
530 |
531 | - **Basic Syntax**:
532 |
533 | ```scala
534 | val numbers = List(1, 2, 3, 4, 5) // Creates a list of numbers.
535 | val doubled = numbers.map(_ * 2) // Doubles each element in the list using map.
536 | println(doubled) // Prints the doubled list.
537 | ```
538 | The output will be:
539 | List(2, 4, 6, 8, 10)
540 |
541 | ---
542 |
543 | ## **4. Spark SQL**
544 |
545 | - **Need for Spark SQL**:
546 | - Provides a declarative interface to query structured data using SQL-like syntax.
547 | - Supports seamless integration with other Spark modules.
548 | - Allows for optimization through Catalyst Optimizer.
549 |
550 | - **Key Components**:
551 | - SQL Queries on DataFrames and temporary views.
552 | - Hive integration for legacy SQL workflows.
553 | - Support for structured data sources.
554 |
555 | ---
556 | ## **5. Hands-On: Spark SQL**
557 |
558 | ### **Objective**:
559 | To create DataFrames, load data from different sources, and perform transformations and SQL queries.
560 |
561 |
562 | #### **Step 1: Create DataFrames**
563 |
564 | ```scala
565 | val data = Seq(
566 | ("Alice", 30, "HR"),
567 | ("Bob", 25, "Engineering"),
568 | ("Charlie", 35, "Finance")
569 | )
570 |
571 | val df = data.toDF("Name", "Age", "Department")
572 |
573 | df.show()
574 | ```
575 | 
576 |
577 |
578 | #### **Step 3: Perform Transformations Using Spark SQL**
579 |
580 | ```scala
581 | df.createOrReplaceTempView("employees")
582 | val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department")
583 | result.show()
584 | ```
585 | 
586 |
587 |
588 | #### **Step 4: Save Transformed Data**
589 |
590 | ```scala
591 | result.write.option("header", "true").csv("hdfs://namenode:9000/output_employees")
592 | ```
593 |
594 | Reading from HDFS:
595 | Once the data is written to HDFS, you can read it back into Spark using:
596 |
597 | ```scala
598 | val outputDF = spark.read.option("header", "true").csv("hdfs://namenode:9000/output_employees")
599 | ```
600 |
601 | View output_employees.csv from HDFS
602 |
603 | ```scala
604 | outputDF.show()
605 | ```
606 | 
607 |
608 |
609 | #### **Step 5: Load Data from HDFS**
610 |
611 | ```scala
612 | // Load CSV from HDFS
613 | val df = spark.read.option("header", "false").csv("hdfs://namenode:9000/data/crimerecord/police/police.csv")
614 | df.show()
615 | ```
616 |
617 | 
618 |
619 |
620 | #### **Step 6: Scala WordCount using Apache Spark**
621 |
622 |
623 | ### Docker Command to Copy File
624 | *Copy File**: Use `docker cp` to move or create the file inside the namenode Docker container.
625 | Use the following command to copy the `data.txt` file from your local system to the Docker container:
626 |
627 | ```bash
628 | docker cp data.txt nodemanager:/data.txt
629 | ```
630 | 
631 |
632 | *Copy File to HDFS**: Use `hdfs dfs -put` to move the file inside the HDFS filesystem.
633 | Use the following command to put the `data.txt` file from your Docker container to HDFS:
634 |
635 | ```bash
636 | hdfs dfs -mkdir /data
637 | hdfs dfs -put data.txt /data
638 | ```
639 | 
640 |
641 | **Scala WordCount program.**
642 |
643 | **WordCount Program**: The program reads the file, splits it into words, and counts the occurrences of each word.
644 |
645 | ```scala
646 | import org.apache.spark.{SparkConf}
647 | val conf = new SparkConf().setAppName("WordCountExample").setMaster("local")
648 | val input = sc.textFile("hdfs://namenode:9000/data/data.txt")
649 | val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1))
650 | val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
651 | wordCounts.collect().foreach { case (word, count) =>
652 | println(s"$word: $count")
653 | }
654 | ```
655 |
656 | **Output**: The word counts will be printed to the console when the program is executed.
657 |
658 | 
659 |
660 |
661 | **Stop Session**:
662 |
663 | ```scala
664 | sc.stop()
665 | ```
666 |
667 | ---
668 |
669 | ## **6. Key Takeaways**
670 |
671 | - Spark SQL simplifies working with structured data.
672 | - DataFrames provide a flexible and powerful API for handling large datasets.
673 | - Apache Spark is a versatile tool for distributed data processing, offering scalability and performance.
674 |
675 | ---
676 |
677 |
678 | 
679 |
680 | ## **Step 7: Set Up Hive**
681 |
682 | ### **Start Hive Server**
683 |
684 | Access the Hive container and start the Hive Server:
685 |
686 | ```bash
687 | docker exec -it hive-server bash
688 | ```
689 |
690 | ```bash
691 | hive
692 | ```
693 |
694 | Check if Hive is listening on port 10000:
695 | 
696 |
697 |
698 | ```bash
699 | netstat -anp | grep 10000
700 | ```
701 | 
702 |
703 |
704 | ### **Connect to Hive Server**
705 |
706 | Use Beeline to connect to the Hive server:
707 |
708 | ```bash
709 | beeline -u jdbc:hive2://localhost:10000 -n root
710 | ```
711 | 
712 |
713 |
714 | Alternatively, use the following command for direct connection:
715 |
716 | ```bash
717 | beeline
718 | ```
719 |
720 | ```bash
721 | !connect jdbc:hive2://127.0.0.1:10000 scott tiger
722 | ```
723 |
724 | 
725 |
726 |
727 | ### **Create Database and Table in Hive**
728 |
729 | 1. Create a new Hive database:
730 | ```sql
731 | CREATE DATABASE punjab_police;
732 | USE punjab_police;
733 | ```
734 | 
735 |
736 |
737 | 2. Create a table based on the schema of the `police.csv` dataset:
738 | ```sql
739 | CREATE TABLE police_data (
740 | Crime_ID INT,
741 | Crime_Type STRING,
742 | Location STRING,
743 | Reported_Date STRING,
744 | Status STRING
745 | )
746 | ROW FORMAT DELIMITED
747 | FIELDS TERMINATED BY ','
748 | STORED AS TEXTFILE;
749 | ```
750 | 
751 |
752 |
753 | 3. Load the data into the Hive table:
754 | ```sql
755 | LOAD DATA INPATH '/data/crimerecord/police/police.csv' INTO TABLE police_data;
756 | ```
757 | 
758 |
759 |
760 | ### **Query the Data in Hive**
761 |
762 | Run SQL queries to analyze the data in Hive:
763 |
764 | 1. **View the top 10 rows:**
765 | ```sql
766 | SELECT * FROM police_data LIMIT 10;
767 | ```
768 | 
769 |
770 |
771 | 2. **Count total crimes:**
772 | ```sql
773 | SELECT COUNT(*) AS Total_Crimes FROM police_data;
774 | ```
775 | 
776 |
777 |
778 | 3. **Find most common crime types:**
779 | ```sql
780 | SELECT Crime_Type, COUNT(*) AS Occurrences
781 | FROM police_data
782 | GROUP BY Crime_Type
783 | ORDER BY Occurrences DESC;
784 | ```
785 |
786 | 
787 |
788 |
789 | 4. **Identify locations with the highest crime rates:**
790 | ```sql
791 | SELECT Location, COUNT(*) AS Total_Crimes
792 | FROM police_data
793 | GROUP BY Location
794 | ORDER BY Total_Crimes DESC;
795 | ```
796 | 
797 |
798 |
799 | 5. **Find unresolved cases:**
800 | ```sql
801 | SELECT Status, COUNT(*) AS Count
802 | FROM police_data
803 | WHERE Status != 'Closed'
804 | GROUP BY Status;
805 | ```
806 | 
807 |
808 |
809 | **********There you go: your private Hive server to play with.**********
810 |
811 | show databases;
812 |
813 | 
814 |
815 | #### **📂 Part 2: Creating a Simple Hive Project**
816 |
817 | ---
818 |
819 | ##### **🎯 Objective**
820 | We will:
821 | 1. Create a database.
822 | 2. Create a table inside the database.
823 | 3. Load data into the table.
824 | 4. Run queries to retrieve data.
825 |
826 | ---
827 |
828 | ##### **💾 Step 1: Create a Database**
829 | In the Beeline CLI:
830 | ```sql
831 | CREATE DATABASE mydb;
832 | USE mydb;
833 | ```
834 | - 📝 *`mydb` is the name of the database. Replace it with your preferred name.*
835 |
836 | ---
837 |
838 | ##### **📋 Step 2: Create a Table**
839 | Still in the Beeline CLI, create a simple table:
840 | ```sql
841 | CREATE TABLE employees (
842 | id INT,
843 | name STRING,
844 | age INT
845 | );
846 | ```
847 | - This creates a table named `employees` with columns `id`, `name`, and `age`.
848 |
849 | ---
850 |
851 | ##### **📥 Step 3: Insert Data into the Table**
852 | Insert sample data into your table:
853 | ```sql
854 | INSERT INTO employees VALUES (1, 'Prince', 30);
855 | INSERT INTO employees VALUES (2, 'Ram Singh', 25);
856 | ```
857 |
858 | ---
859 |
860 | ##### **🔍 Step 4: Query the Table**
861 | Retrieve data from your table:
862 | ```sql
863 | SELECT * FROM employees;
864 | ```
865 | - Output:
866 |
867 | 
868 |
869 |
870 | ```
871 | +----+----------+-----+
872 | | id | name | age |
873 | +----+----------+-----+
874 | | 2 | Ram Singh | 25 |
875 | | 1 | Prince | 30 |
876 | +----+----------+-----+
877 | ```
878 |
879 | ---
880 |
881 | #### **🌟 Tips & Knowledge**
882 |
883 | 1. **What is Hive?**
884 | - Hive is a data warehouse tool on top of Hadoop.
885 | - It allows SQL-like querying over large datasets.
886 |
887 | 2. **Why Docker for Hive?**
888 | - Simplifies setup by avoiding manual configurations.
889 | - Provides a pre-configured environment for running Hive.
890 |
891 | 3. **Beeline CLI**:
892 | - A lightweight command-line tool for running Hive queries.
893 |
894 | 4. **Use Cases**:
895 | - **Data Analysis**: Run analytics on large datasets.
896 | - **ETL**: Extract, Transform, and Load data into your Hadoop ecosystem.
897 |
898 | ---
899 |
900 | #### **🎉 You're Ready!**
901 | You’ve successfully:
902 | 1. Set up Apache Hive.
903 | 2. Created and queried a sample project. 🐝
904 |
905 | ### **🐝 Apache Hive Basic Commands**
906 |
907 | Here is a collection of basic Apache Hive commands with explanations that can help you while working with Hive:
908 |
909 | ---
910 |
911 | #### **1. Database Commands**
912 |
913 | - **Show Databases:**
914 | Displays all the databases available in your Hive environment.
915 | ```sql
916 | SHOW DATABASES;
917 | ```
918 |
919 | - **Create a Database:**
920 | Create a new database.
921 | ```sql
922 | CREATE DATABASE ;
923 | ```
924 | Example:
925 | ```sql
926 | CREATE DATABASE mydb;
927 | ```
928 | In Hive, you can find out which database you are currently using by running the following command:
929 |
930 | ```sql
931 | SELECT current_database();
932 | ```
933 |
934 | This will return the name of the database that is currently in use.
935 |
936 | Alternatively, you can use this command:
937 |
938 | ```sql
939 | USE database_name;
940 | ```
941 |
942 | If you want to explicitly switch to a specific database or verify the database context, you can use this command before running your queries.
943 |
944 | - **Use a Database:**
945 | Switch to the specified database.
946 | ```sql
947 | USE ;
948 | ```
949 | Example:
950 | ```sql
951 | USE mydb;
952 | ```
953 |
954 |
955 | - **Drop a Database:**
956 | Deletes a database and its associated data.
957 | ```sql
958 | DROP DATABASE ;
959 | ```
960 |
961 | ---
962 |
963 | #### **2. Table Commands**
964 |
965 | - **Show Tables:**
966 | List all the tables in the current database.
967 | ```sql
968 | SHOW TABLES;
969 | ```
970 |
971 | - **Create a Table:**
972 | Define a new table with specific columns.
973 | ```sql
974 | CREATE TABLE (
975 | column_name column_type,
976 | ...
977 | );
978 | ```
979 | Example:
980 | ```sql
981 | CREATE TABLE employees (
982 | id INT,
983 | name STRING,
984 | age INT
985 | );
986 | ```
987 |
988 | - **Describe a Table:**
989 | Get detailed information about a table, including column names and types.
990 | ```sql
991 | DESCRIBE ;
992 | ```
993 |
994 | - **Drop a Table:**
995 | Deletes a table and its associated data.
996 | ```sql
997 | DROP TABLE ;
998 | ```
999 |
1000 | - **Alter a Table:**
1001 | Modify a table structure, like adding new columns.
1002 | ```sql
1003 | ALTER TABLE ADD COLUMNS ( );
1004 | ```
1005 | Example:
1006 | ```sql
1007 | ALTER TABLE employees ADD COLUMNS (salary DOUBLE);
1008 | ```
1009 |
1010 | ---
1011 |
1012 | #### **3. Data Manipulation Commands**
1013 |
1014 | - **Insert Data:**
1015 | Insert data into a table.
1016 | ```sql
1017 | INSERT INTO VALUES (, , ...);
1018 | INSERT INTO employees VALUES (1, 'Prince', 30), (2, 'Ram Singh', 25), (3, 'John Doe', 28), (4, 'Jane Smith', 32);
1019 | ```
1020 | Example:
1021 | ```sql
1022 | INSERT INTO employees VALUES (1, 'John Doe', 30);
1023 |
1024 | ```
1025 |
1026 | - **Select Data:**
1027 | Retrieve data from a table.
1028 | ```sql
1029 | SELECT * FROM ;
1030 | ```
1031 |
1032 | - **Update Data:**
1033 | Update existing data in a table.
1034 | ```sql
1035 | UPDATE SET = WHERE ;
1036 | ```
1037 |
1038 | - **Delete Data:**
1039 | Delete rows from a table based on a condition.
1040 | ```sql
1041 | DELETE FROM WHERE ;
1042 | ```
1043 |
1044 | ---
1045 |
1046 | #### **4. Querying Commands**
1047 |
1048 | - **Select Specific Columns:**
1049 | Retrieve specific columns from a table.
1050 | ```sql
1051 | SELECT , FROM ;
1052 | ```
1053 |
1054 | - **Filtering Data:**
1055 | Filter data based on conditions using the `WHERE` clause.
1056 | ```sql
1057 | SELECT * FROM WHERE ;
1058 | ```
1059 | Example:
1060 | ```sql
1061 | SELECT * FROM employees WHERE age > 25;
1062 | ```
1063 |
1064 | - **Sorting Data:**
1065 | Sort the result by a column in ascending or descending order.
1066 | ```sql
1067 | SELECT * FROM ORDER BY ASC|DESC;
1068 | ```
1069 | Example:
1070 | ```sql
1071 | SELECT * FROM employees ORDER BY age DESC;
1072 | SELECT * FROM employees ORDER BY age ASC;
1073 | ```
1074 |
1075 | - **Group By:**
1076 | Group data by one or more columns and aggregate it using functions like `COUNT`, `AVG`, `SUM`, etc.
1077 | ```sql
1078 | SELECT , COUNT(*) FROM GROUP BY ;
1079 | ```
1080 | Example:
1081 | ```sql
1082 | SELECT age, COUNT(*) FROM employees GROUP BY age;
1083 | ```
1084 |
1085 | ---
1086 |
1087 | #### **5. File Format Commands**
1088 |
1089 | - **Create External Table:**
1090 | Create a table that references data stored externally (e.g., in HDFS).
1091 | ```sql
1092 | CREATE EXTERNAL TABLE ( , ...)
1093 | ROW FORMAT DELIMITED
1094 | FIELDS TERMINATED BY ''
1095 | LOCATION '';
1096 | ```
1097 | Example:
1098 | ```sql
1099 | CREATE EXTERNAL TABLE employees (
1100 | id INT,
1101 | name STRING,
1102 | age INT
1103 | ) ROW FORMAT DELIMITED
1104 | FIELDS TERMINATED BY ','
1105 | LOCATION '/user/hive/warehouse/employees';
1106 | ```
1107 |
1108 | - **Load Data into Table:**
1109 | Load data from a file into an existing Hive table.
1110 | ```sql
1111 | LOAD DATA LOCAL INPATH '' INTO TABLE ;
1112 | ```
1113 |
1114 | ---
1115 |
1116 | #### **6. Other Useful Commands**
1117 |
1118 | - **Show Current User:**
1119 | Display the current user running the Hive session.
1120 | ```sql
1121 | !whoami;
1122 | ```
1123 |
1124 | - **Exit Hive:**
1125 | Exit from the Hive shell.
1126 | ```sql
1127 | EXIT;
1128 | ```
1129 |
1130 | - **Set Hive Variables:**
1131 | Set Hive session variables.
1132 | ```sql
1133 | SET =;
1134 | ```
1135 |
1136 | - **Show Hive Variables:**
1137 | Display all the set variables.
1138 | ```sql
1139 | SET;
1140 | ```
1141 |
1142 | - **Show the Status of Hive Jobs:**
1143 | Display the status of running queries.
1144 | ```sql
1145 | SHOW JOBS;
1146 | ```
1147 |
1148 | ---
1149 |
1150 | #### **🌟 Tips & Best Practices**
1151 |
1152 | - **Partitioning Tables:**
1153 | When dealing with large datasets, partitioning your tables can help improve query performance.
1154 | ```sql
1155 | CREATE TABLE sales (id INT, amount DOUBLE)
1156 | PARTITIONED BY (year INT, month INT);
1157 | ```
1158 |
1159 | - **Bucketing:**
1160 | Bucketing splits your data into a fixed number of files or "buckets."
1161 | ```sql
1162 | CREATE TABLE sales (id INT, amount DOUBLE)
1163 | CLUSTERED BY (id) INTO 4 BUCKETS;
1164 | ```
1165 |
1166 | - **Optimization:**
1167 | Use columnar formats like `ORC` or `Parquet` for efficient storage and performance.
1168 | ```sql
1169 | CREATE TABLE sales (id INT, amount DOUBLE)
1170 | STORED AS ORC;
1171 | ```
1172 |
1173 | These basic commands will help you interact with Hive and perform common operations like creating tables, querying data, and managing your Hive environment efficiently.
1174 |
1175 | While **Hive** and **MySQL** both use SQL-like syntax for querying data, there are some key differences in their commands, especially since Hive is designed for querying large datasets in a Hadoop ecosystem, while MySQL is a relational database management system (RDBMS).
1176 |
1177 | ##**Here’s a comparison of **Hive** and **MySQL** commands in terms of common operations:**
1178 |
1179 | ### **1. Creating Databases**
1180 | - **Hive**:
1181 | ```sql
1182 | CREATE DATABASE mydb;
1183 | ```
1184 |
1185 | - **MySQL**:
1186 | ```sql
1187 | CREATE DATABASE mydb;
1188 | ```
1189 |
1190 | *Both Hive and MySQL use the same syntax to create a database.*
1191 |
1192 | ---
1193 |
1194 | ### **2. Switching to a Database**
1195 | - **Hive**:
1196 | ```sql
1197 | USE mydb;
1198 | ```
1199 |
1200 | - **MySQL**:
1201 | ```sql
1202 | USE mydb;
1203 | ```
1204 |
1205 | *The syntax is the same for selecting a database in both systems.*
1206 |
1207 | ---
1208 |
1209 | ### **3. Creating Tables**
1210 | - **Hive**:
1211 | ```sql
1212 | CREATE TABLE employees (
1213 | id INT,
1214 | name STRING,
1215 | age INT
1216 | );
1217 | ```
1218 |
1219 | - **MySQL**:
1220 | ```sql
1221 | CREATE TABLE employees (
1222 | id INT,
1223 | name VARCHAR(255),
1224 | age INT
1225 | );
1226 | ```
1227 |
1228 | **Differences**:
1229 | - In Hive, **STRING** is used for text data, while in MySQL, **VARCHAR** is used.
1230 | - Hive also has some specialized data types for distributed storage and performance, like `ARRAY`, `MAP`, `STRUCT`, etc.
1231 |
1232 | ---
1233 |
1234 | ### **4. Inserting Data**
1235 | - **Hive**:
1236 | ```sql
1237 | INSERT INTO employees VALUES (1, 'John', 30);
1238 | INSERT INTO employees VALUES (2, 'Alice', 25);
1239 | ```
1240 |
1241 | - **MySQL**:
1242 | ```sql
1243 | INSERT INTO employees (id, name, age) VALUES (1, 'John', 30);
1244 | INSERT INTO employees (id, name, age) VALUES (2, 'Alice', 25);
1245 | ```
1246 |
1247 | **Differences**:
1248 | - Hive allows direct `INSERT INTO` with values, while MySQL explicitly lists column names in the insert statement (though this is optional in MySQL if the columns match).
1249 |
1250 | ---
1251 |
1252 | ### **5. Querying Data**
1253 | - **Hive**:
1254 | ```sql
1255 | SELECT * FROM employees;
1256 | ```
1257 |
1258 | - **MySQL**:
1259 | ```sql
1260 | SELECT * FROM employees;
1261 | ```
1262 |
1263 | *Querying data using `SELECT` is identical in both systems.*
1264 |
1265 | ---
1266 |
1267 | ### **6. Modifying Data**
1268 | - **Hive**:
1269 | Hive doesn’t support traditional **UPDATE** or **DELETE** commands directly, as it is optimized for batch processing and is more suited for append operations. However, it does support **INSERT** and **INSERT OVERWRITE** operations.
1270 |
1271 | Example of replacing data:
1272 | ```sql
1273 | INSERT OVERWRITE TABLE employees SELECT * FROM employees WHERE age > 30;
1274 | ```
1275 |
1276 | - **MySQL**:
1277 | ```sql
1278 | UPDATE employees SET age = 31 WHERE id = 1;
1279 | DELETE FROM employees WHERE id = 2;
1280 | ```
1281 |
1282 | **Differences**:
1283 | - Hive does not allow direct **UPDATE** or **DELETE**; instead, it uses **INSERT OVERWRITE** to modify data in batch operations.
1284 |
1285 | ---
1286 |
1287 | ### **7. Dropping Tables**
1288 | - **Hive**:
1289 | ```sql
1290 | DROP TABLE IF EXISTS employees;
1291 | ```
1292 |
1293 | - **MySQL**:
1294 | ```sql
1295 | DROP TABLE IF EXISTS employees;
1296 | ```
1297 |
1298 | *The syntax for dropping tables is the same in both systems.*
1299 |
1300 | ---
1301 |
1302 | ### **8. Query Performance**
1303 | - **Hive**:
1304 | - Hive is designed to run on large datasets using the Hadoop Distributed File System (HDFS), so it focuses more on **batch processing** rather than real-time queries. Query performance in Hive may be slower than MySQL because it’s optimized for scale, not for low-latency transaction processing.
1305 |
1306 | - **MySQL**:
1307 | - MySQL is an RDBMS, designed to handle **transactional workloads** with low-latency queries. It’s better suited for OLTP (Online Transaction Processing) rather than OLAP (Online Analytical Processing) workloads.
1308 |
1309 | ---
1310 |
1311 | ### **9. Indexing**
1312 | - **Hive**:
1313 | - Hive doesn’t support traditional indexing as MySQL does. However, you can create **partitioned** or **bucketed** tables in Hive to improve query performance for certain types of data.
1314 |
1315 | - **MySQL**:
1316 | - MySQL supports **indexes** (e.g., **PRIMARY KEY**, **UNIQUE**, **INDEX**) to speed up query performance on large datasets.
1317 |
1318 | ---
1319 |
1320 | ### **10. Joins**
1321 | - **Hive**:
1322 | ```sql
1323 | SELECT a.id, a.name, b.age
1324 | FROM employees a
1325 | JOIN employee_details b ON a.id = b.id;
1326 | ```
1327 |
1328 | - **MySQL**:
1329 | ```sql
1330 | SELECT a.id, a.name, b.age
1331 | FROM employees a
1332 | JOIN employee_details b ON a.id = b.id;
1333 | ```
1334 |
1335 | *The syntax for **JOIN** is the same in both systems.*
1336 |
1337 | ---
1338 |
1339 | ### **Summary of Key Differences**:
1340 | - **Data Types**: Hive uses types like `STRING`, `TEXT`, `BOOLEAN`, etc., while MySQL uses types like `VARCHAR`, `CHAR`, `TEXT`, etc.
1341 | - **Data Modification**: Hive does not support **UPDATE** or **DELETE** in the traditional way, and is generally used for **batch processing**.
1342 | - **Performance**: Hive is designed for querying large-scale datasets in Hadoop, so queries tend to be slower than MySQL.
1343 | - **Indexing**: Hive does not natively support indexing but can use partitioning and bucketing for performance optimization. MySQL supports indexing for faster queries.
1344 | - **ACID Properties**: MySQL supports full ACID compliance for transactional systems, whereas Hive is not transactional by default (but can support limited ACID features starting from version 0.14 with certain configurations).
1345 |
1346 | In conclusion, while **Hive** and **MySQL** share SQL-like syntax, they are designed for very different use cases, and not all commands work the same way in both systems.
1347 |
1348 | ### **Visualize the Data (Optional)**
1349 |
1350 | Export the query results to a CSV file for analysis in visualization tools:
1351 |
1352 | ```bash
1353 | hive -e "SELECT * FROM police_data;" > police_analysis_results.csv
1354 | ```
1355 |
1356 | You can use tools like Tableau, Excel, or Python (Matplotlib, Pandas) for data visualization.
1357 |
1358 | ## **Step 8: Configure Environment Variables (Optional)**
1359 |
1360 | If you need to customize configurations, you can specify parameters in the `hadoop.env` file or as environmental variables for services (e.g., namenode, datanode, etc.). For example:
1361 |
1362 | ```bash
1363 | CORE_CONF_fs_defaultFS=hdfs://namenode:8020
1364 | ```
1365 |
1366 | This will be transformed into the following in the `core-site.xml` file:
1367 |
1368 | ```xml
1369 |
1370 | fs.defaultFS
1371 | hdfs://namenode:8020
1372 |
1373 | ```
1374 |
1375 | ## **Conclusion**
1376 |
1377 | You now have a fully functional Hadoop, Spark, and Hive cluster running in Docker. This environment is great for experimenting with big data processing and analytics in a lightweight, containerized setup.
1378 |
1379 | ---
1380 |
1381 | I hope you have fun with this Hadoop-Spark-Hive cluster.
1382 |
1383 |
1384 |
1385 | 
1386 |
1387 |
--------------------------------------------------------------------------------
/base/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM debian:9
2 |
3 | MAINTAINER Ivan Ermilov
4 | MAINTAINER Giannis Mouchakis
5 |
6 | RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
7 | openjdk-8-jdk \
8 | net-tools \
9 | curl \
10 | netcat \
11 | gnupg \
12 | libsnappy-dev \
13 | && rm -rf /var/lib/apt/lists/*
14 |
15 | ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
16 |
17 | RUN curl -O https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
18 |
19 | RUN gpg --import KEYS
20 |
21 | ENV HADOOP_VERSION 3.2.1
22 | ENV HADOOP_URL https://www.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
23 |
24 | RUN set -x \
25 | && curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \
26 | && curl -fSL "$HADOOP_URL.asc" -o /tmp/hadoop.tar.gz.asc \
27 | && gpg --verify /tmp/hadoop.tar.gz.asc \
28 | && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
29 | && rm /tmp/hadoop.tar.gz*
30 |
31 | RUN ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop
32 |
33 | RUN mkdir /opt/hadoop-$HADOOP_VERSION/logs
34 |
35 | RUN mkdir /hadoop-data
36 |
37 | ENV HADOOP_HOME=/opt/hadoop-$HADOOP_VERSION
38 | ENV HADOOP_CONF_DIR=/etc/hadoop
39 | ENV MULTIHOMED_NETWORK=1
40 | ENV USER=root
41 | ENV PATH $HADOOP_HOME/bin/:$PATH
42 |
43 | ADD entrypoint.sh /entrypoint.sh
44 |
45 | RUN chmod a+x /entrypoint.sh
46 |
47 | ENTRYPOINT ["/entrypoint.sh"]
48 |
--------------------------------------------------------------------------------
/base/entrypoint.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Set some sensible defaults
4 | export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020}
5 |
6 | function addProperty() {
7 | local path=$1
8 | local name=$2
9 | local value=$3
10 |
11 | local entry="$name${value}"
12 | local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
13 | sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
14 | }
15 |
16 | function configure() {
17 | local path=$1
18 | local module=$2
19 | local envPrefix=$3
20 |
21 | local var
22 | local value
23 |
24 | echo "Configuring $module"
25 | for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do
26 | name=`echo ${c} | perl -pe 's/___/-/g; s/__/@/g; s/_/./g; s/@/_/g;'`
27 | var="${envPrefix}_${c}"
28 | value=${!var}
29 | echo " - Setting $name=$value"
30 | addProperty $path $name "$value"
31 | done
32 | }
33 |
34 | configure /etc/hadoop/core-site.xml core CORE_CONF
35 | configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF
36 | configure /etc/hadoop/yarn-site.xml yarn YARN_CONF
37 | configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF
38 | configure /etc/hadoop/kms-site.xml kms KMS_CONF
39 | configure /etc/hadoop/mapred-site.xml mapred MAPRED_CONF
40 |
41 | if [ "$MULTIHOMED_NETWORK" = "1" ]; then
42 | echo "Configuring for multihomed network"
43 |
44 | # HDFS
45 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
46 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
47 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
48 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
49 | addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
50 | addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
51 |
52 | # YARN
53 | addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
54 | addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
55 | addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
56 |
57 | # MAPRED
58 | addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
59 | fi
60 |
61 | if [ -n "$GANGLIA_HOST" ]; then
62 | mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig
63 | mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig
64 |
65 | for module in mapred jvm rpc ugi; do
66 | echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31"
67 | echo "$module.period=10"
68 | echo "$module.servers=$GANGLIA_HOST:8649"
69 | done > /etc/hadoop/hadoop-metrics.properties
70 |
71 | for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do
72 | echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31"
73 | echo "$module.sink.ganglia.period=10"
74 | echo "$module.sink.ganglia.supportsparse=true"
75 | echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both"
76 | echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40"
77 | echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649"
78 | done > /etc/hadoop/hadoop-metrics2.properties
79 | fi
80 |
81 | function wait_for_it()
82 | {
83 | local serviceport=$1
84 | local service=${serviceport%%:*}
85 | local port=${serviceport#*:}
86 | local retry_seconds=5
87 | local max_try=100
88 | let i=1
89 |
90 | nc -z $service $port
91 | result=$?
92 |
93 | until [ $result -eq 0 ]; do
94 | echo "[$i/$max_try] check for ${service}:${port}..."
95 | echo "[$i/$max_try] ${service}:${port} is not available yet"
96 | if (( $i == $max_try )); then
97 | echo "[$i/$max_try] ${service}:${port} is still not available; giving up after ${max_try} tries. :/"
98 | exit 1
99 | fi
100 |
101 | echo "[$i/$max_try] try in ${retry_seconds}s once again ..."
102 | let "i++"
103 | sleep $retry_seconds
104 |
105 | nc -z $service $port
106 | result=$?
107 | done
108 | echo "[$i/$max_try] $service:${port} is available."
109 | }
110 |
111 | for i in ${SERVICE_PRECONDITION[@]}
112 | do
113 | wait_for_it ${i}
114 | done
115 |
116 | exec $@
117 |
--------------------------------------------------------------------------------
/base/execute-step.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | if [ $ENABLE_INIT_DAEMON = "true" ]
4 | then
5 | echo "Execute step ${INIT_DAEMON_STEP} in pipeline"
6 | while true; do
7 | sleep 5
8 | echo -n '.'
9 | string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/execute?step=$INIT_DAEMON_STEP -o /dev/null)
10 | [ "$string" = "204" ] && break
11 | done
12 | echo "Notified execution of step ${INIT_DAEMON_STEP}"
13 | fi
14 |
15 |
--------------------------------------------------------------------------------
/base/finish-step.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | if [ $ENABLE_INIT_DAEMON = "true" ]
4 | then
5 | echo "Finish step ${INIT_DAEMON_STEP} in pipeline"
6 | while true; do
7 | sleep 5
8 | echo -n '.'
9 | string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/finish?step=$INIT_DAEMON_STEP -o /dev/null)
10 | [ "$string" = "204" ] && break
11 | done
12 | echo "Notified finish of step ${INIT_DAEMON_STEP}"
13 | fi
14 |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/base/wait-for-step.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | if [ $ENABLE_INIT_DAEMON = "true" ]
4 | then
5 | echo "Validating if step ${INIT_DAEMON_STEP} can start in pipeline"
6 | while true; do
7 | sleep 5
8 | echo -n '.'
9 | string=$(curl -s $INIT_DAEMON_BASE_URI/canStart?step=$INIT_DAEMON_STEP)
10 | [ "$string" = "true" ] && break
11 | done
12 | echo "Can start step ${INIT_DAEMON_STEP}"
13 | fi
14 |
--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount$IntSumReducer.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$IntSumReducer.class
--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount$TokenizerMapper.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$TokenizerMapper.class
--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/WordCount.class:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount.class
--------------------------------------------------------------------------------
/code/HadoopWordCount/bin/wc.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/wc.jar
--------------------------------------------------------------------------------
/code/HadoopWordCount/src/WordCount.java:
--------------------------------------------------------------------------------
1 | import java.io.IOException;
2 | import java.util.StringTokenizer;
3 |
4 | import org.apache.hadoop.conf.Configuration;
5 | import org.apache.hadoop.fs.Path;
6 | import org.apache.hadoop.io.IntWritable;
7 | import org.apache.hadoop.io.Text;
8 | import org.apache.hadoop.mapreduce.Job;
9 | import org.apache.hadoop.mapreduce.Mapper;
10 | import org.apache.hadoop.mapreduce.Reducer;
11 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
12 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 |
14 | public class WordCount {
15 |
16 | public static class TokenizerMapper
17 | extends Mapper