├── .gitignore ├── Complete_Guide_to_Install_Ubuntu_and_JAVA_and_then_Configure_Hadoop,_MySQL,_HIVE,_Sqoop,_Flume,_Spark_on_a_Docker_Container.md ├── HiveInstallation.md ├── Makefile ├── Multi-Node Cluster on Ubuntu 24.04 (VMware).md ├── README.md ├── base ├── Dockerfile ├── bde-spark.css ├── entrypoint.sh ├── execute-step.sh ├── finish-step.sh └── wait-for-step.sh ├── code ├── HadoopWordCount │ ├── bin │ │ ├── WordCount$IntSumReducer.class │ │ ├── WordCount$TokenizerMapper.class │ │ ├── WordCount.class │ │ └── wc.jar │ └── src │ │ └── WordCount.java ├── input │ ├── About Hadoop.txt~ │ └── data.txt └── wordCount.jar ├── conf ├── beeline-log4j2.properties ├── hive-env.sh ├── hive-exec-log4j2.properties ├── hive-log4j2.properties ├── hive-site.xml ├── ivysettings.xml └── llap-daemon-log4j2.properties ├── data ├── authors.csv └── books.csv ├── datanode ├── Dockerfile └── run.sh ├── docker-compose.yml ├── ecom.md ├── entrypoint.sh ├── flume.md ├── hadoop-basic-commands.md ├── hadoop-hive.env ├── hadoop.env ├── hadoop_installation_VMware Workstation.md ├── historyserver ├── Dockerfile └── run.sh ├── master ├── Dockerfile ├── README.md └── master.sh ├── namenode ├── Dockerfile └── run.sh ├── nginx ├── Dockerfile ├── bde-hadoop.css ├── default.conf └── materialize.min.css ├── nodemanager ├── Dockerfile └── run.sh ├── police.csv ├── resourcemanager ├── Dockerfile └── run.sh ├── spark_in_action.MD ├── sqoop.md ├── startup.sh ├── students.csv ├── submit ├── Dockerfile ├── WordCount.jar └── run.sh ├── template ├── java │ ├── Dockerfile │ ├── README.md │ └── template.sh ├── python │ ├── Dockerfile │ ├── README.md │ └── template.sh └── scala │ ├── Dockerfile │ ├── README.md │ ├── build.sbt │ ├── plugins.sbt │ └── template.sh ├── wordcount.md ├── worker ├── Dockerfile ├── README.md └── worker.sh └── yarn.md /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | -------------------------------------------------------------------------------- /HiveInstallation.md: -------------------------------------------------------------------------------- 1 | # **Complete Steps to Install Apache Hive on Ubuntu** 2 | 3 | Apache Hive is a data warehouse infrastructure built on top of Hadoop. This guide will show how to install and configure Hive on **Ubuntu**. 4 | 5 | --- 6 | 7 | ## **Step 1: Install Prerequisites** 8 | Before installing Hive, ensure your system has the necessary dependencies. 9 | 10 | ### **1.1 Install Java** 11 | Hive requires Java to run. Install it if it's not already installed: 12 | ```bash 13 | sudo apt update 14 | sudo apt install default-jdk -y 15 | java -version # Verify installation 16 | ``` 17 | 18 | ### **1.2 Install Hadoop (Required for Hive)** 19 | Hive requires Hadoop to function properly. If Hadoop is not installed, install it using: 20 | 21 | ```bash 22 | sudo apt install hadoop -y 23 | hadoop version # Verify installation 24 | ``` 25 | If you need a full Hadoop setup, follow a [Hadoop installation guide](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html). 26 | 27 | ### **1.3 Install wget (If not installed)** 28 | ```bash 29 | sudo apt install wget -y 30 | ``` 31 | 32 | --- 33 | 34 | ## **Step 2: Download and Install Apache Hive** 35 | ### **2.1 Download Hive** 36 | ```bash 37 | wget https://apache.root.lu/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz 38 | ``` 39 | *Check the latest version from the official Hive website:* [Apache Hive Downloads](https://hive.apache.org/downloads.html) 40 | 41 | ### **2.2 Extract Hive and Move to /opt Directory** 42 | ```bash 43 | sudo tar -xzf apache-hive-2.3.9-bin.tar.gz -C /opt 44 | sudo mv /opt/apache-hive-2.3.9-bin /opt/hive 45 | ``` 46 | 47 | --- 48 | 49 | ## **Step 3: Set Up Environment Variables** 50 | To run Hive commands globally, configure environment variables. 51 | 52 | ### **3.1 Open the `.bashrc` File** 53 | ```bash 54 | nano ~/.bashrc 55 | ``` 56 | 57 | ### **3.2 Add the Following Lines at the End** 58 | ```bash 59 | export HIVE_HOME=/opt/hive 60 | export PATH=$HIVE_HOME/bin:$PATH 61 | ``` 62 | 63 | ### **3.3 Apply the Changes** 64 | ```bash 65 | source ~/.bashrc 66 | ``` 67 | 68 | ### **3.4 Verify Hive Installation** 69 | ```bash 70 | hive --version 71 | ``` 72 | If Hive is installed correctly, it will print the version. 73 | 74 | --- 75 | 76 | ## **Step 4: Configure Hive** 77 | ### **4.1 Create Hive Directories in HDFS** 78 | ```bash 79 | hdfs dfs -mkdir -p /user/hive/warehouse 80 | hdfs dfs -chmod -R 770 /user/hive/warehouse 81 | hdfs dfs -chown -R $USER:$USER /user/hive/warehouse 82 | ``` 83 | 84 | ### **4.2 Configure `hive-site.xml`** 85 | Edit the Hive configuration file: 86 | ```bash 87 | sudo nano /opt/hive/conf/hive-site.xml 88 | ``` 89 | 90 | Add the following configurations: 91 | 92 | ```xml 93 | 94 | 95 | 96 | 97 | javax.jdo.option.ConnectionURL 98 | jdbc:derby:;databaseName=/opt/hive/metastore_db;create=true 99 | JDBC connection URL for the metastore database 100 | 101 | 102 | hive.metastore.warehouse.dir 103 | /user/hive/warehouse 104 | Location of default database for the warehouse 105 | 106 | 107 | hive.exec.scratchdir 108 | /tmp/hive 109 | Scratch directory for Hive jobs 110 | 111 | 112 | ``` 113 | 114 | Save and exit (`CTRL + X`, then `Y` and `ENTER`). 115 | 116 | --- 117 | 118 | ## **Step 5: Set Proper Permissions** 119 | ```bash 120 | sudo chown -R $USER:$USER /opt/hive 121 | sudo chmod -R 755 /opt/hive 122 | ``` 123 | 124 | --- 125 | 126 | ## **Step 6: Initialize Hive Metastore** 127 | Hive uses a database (Derby by default) to store metadata. 128 | 129 | ### **6.1 Run Schema Initialization** 130 | ```bash 131 | /opt/hive/bin/schematool -initSchema -dbType derby 132 | ``` 133 | 134 | --- 135 | 136 | ## **Step 7: Start Hive** 137 | After setup, you can now start Hive. 138 | 139 | ### **7.1 Run Hive Shell** 140 | ```bash 141 | hive 142 | ``` 143 | 144 | ### **7.2 Verify Hive is Working** 145 | Run the following command inside the Hive shell: 146 | ```sql 147 | SHOW DATABASES; 148 | ``` 149 | It should list default databases. 150 | 151 | --- 152 | 153 | ## **(Optional) Configure Hive with MySQL (For Production Use)** 154 | Using **MySQL** instead of Derby is recommended for better performance. 155 | 156 | ### **1. Install MySQL Server** 157 | ```bash 158 | sudo apt install mysql-server -y 159 | sudo systemctl start mysql 160 | sudo systemctl enable mysql 161 | ``` 162 | 163 | ### **2. Create a Hive Metastore Database** 164 | ```bash 165 | mysql -u root -p 166 | ``` 167 | Inside the MySQL shell, run: 168 | ```sql 169 | CREATE DATABASE metastore; 170 | CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword'; 171 | GRANT ALL PRIVILEGES ON metastore.* TO 'hiveuser'@'localhost'; 172 | FLUSH PRIVILEGES; 173 | EXIT; 174 | ``` 175 | 176 | ### **3. Configure Hive to Use MySQL** 177 | Edit `hive-site.xml`: 178 | ```bash 179 | nano /opt/hive/conf/hive-site.xml 180 | ``` 181 | Replace the Derby configuration with: 182 | ```xml 183 | 184 | javax.jdo.option.ConnectionURL 185 | jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true 186 | 187 | 188 | javax.jdo.option.ConnectionDriverName 189 | com.mysql.jdbc.Driver 190 | 191 | 192 | javax.jdo.option.ConnectionUserName 193 | hiveuser 194 | 195 | 196 | javax.jdo.option.ConnectionPassword 197 | hivepassword 198 | 199 | ``` 200 | 201 | ### **4. Download MySQL JDBC Driver** 202 | ```bash 203 | wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-8.0.28.tar.gz 204 | tar -xzf mysql-connector-java-8.0.28.tar.gz 205 | sudo mv mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar /opt/hive/lib/ 206 | ``` 207 | 208 | ### **5. Reinitialize Hive Metastore** 209 | ```bash 210 | /opt/hive/bin/schematool -initSchema -dbType mysql 211 | ``` 212 | 213 | --- 214 | 215 | ## **Hive is Now Ready to Use! 🚀** 216 | With this setup, Hive is installed and ready for queries. 217 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | DOCKER_NETWORK = docker-hadoop_default 2 | ENV_FILE = hadoop.env 3 | current_branch := $(shell git rev-parse --abbrev-ref HEAD) 4 | build: 5 | docker build -t bde2020/hadoop-base:$(current_branch) ./base 6 | docker build -t bde2020/hadoop-namenode:$(current_branch) ./namenode 7 | docker build -t bde2020/hadoop-datanode:$(current_branch) ./datanode 8 | docker build -t bde2020/hadoop-resourcemanager:$(current_branch) ./resourcemanager 9 | docker build -t bde2020/hadoop-nodemanager:$(current_branch) ./nodemanager 10 | docker build -t bde2020/hadoop-historyserver:$(current_branch) ./historyserver 11 | docker build -t bde2020/hadoop-submit:$(current_branch) ./submit 12 | docker build -t bde2020/hive:$(current_branch) ./ 13 | 14 | wordcount: 15 | docker build -t hadoop-wordcount ./submit 16 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -mkdir -p /input/ 17 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -copyFromLocal -f /opt/hadoop-3.2.1/README.txt /input/ 18 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} hadoop-wordcount 19 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -cat /output/* 20 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /output 21 | docker run --network ${DOCKER_NETWORK} --env-file ${ENV_FILE} bde2020/hadoop-base:$(current_branch) hdfs dfs -rm -r /input 22 | -------------------------------------------------------------------------------- /Multi-Node Cluster on Ubuntu 24.04 (VMware).md: -------------------------------------------------------------------------------- 1 | # **Complete Guide: Install Hadoop Multi-Node Cluster on Ubuntu 24.04 (VMware)** 2 | This guide covers installing and configuring **Hadoop 3.3.6** on **two Ubuntu 24.04 virtual machines** inside **VMware Workstation**. 3 | 4 | ## **Prerequisites** 5 | 1. **Two Ubuntu 24.04 VMs** running in **VMware Workstation**. 6 | 2. **At least 4GB RAM & 50GB disk space per VM**. 7 | 3. **Static IPs for both VMs**. 8 | 4. **Java 8 or later installed**. 9 | 10 | --- 11 | 12 | # **Step 1: Configure Static IPs for Both VMs** 13 | ### **1. Check Network Interface Name** 14 | On **both VMs**, open Terminal and run: 15 | ```bash 16 | ip a 17 | ``` 18 | Find your network interface (e.g., `ens33` or `eth0`). 19 | 20 | ### **2. Edit Netplan Configuration** 21 | Run: 22 | ```bash 23 | sudo nano /etc/netplan/00-installer-config.yaml 24 | ``` 25 | For the **Master Node** (VM 1): 26 | ```yaml 27 | network: 28 | version: 2 29 | renderer: networkd 30 | ethernets: 31 | ens33: 32 | dhcp4: no 33 | addresses: 34 | - 192.168.1.100/24 35 | gateway4: 192.168.1.1 36 | nameservers: 37 | addresses: 38 | - 8.8.8.8 39 | - 8.8.4.4 40 | ``` 41 | For the **Worker Node** (VM 2): 42 | ```yaml 43 | network: 44 | version: 2 45 | renderer: networkd 46 | ethernets: 47 | ens33: 48 | dhcp4: no 49 | addresses: 50 | - 192.168.1.101/24 51 | gateway4: 192.168.1.1 52 | nameservers: 53 | addresses: 54 | - 8.8.8.8 55 | - 8.8.4.4 56 | ``` 57 | ### **3. Apply Changes** 58 | ```bash 59 | sudo netplan apply 60 | ip a # Verify new IP 61 | ``` 62 | 63 | --- 64 | 65 | # **Step 2: Install Java on Both VMs** 66 | Hadoop requires Java. Install **OpenJDK 11**: 67 | ```bash 68 | sudo apt update && sudo apt install openjdk-11-jdk -y 69 | ``` 70 | Verify installation: 71 | ```bash 72 | java -version 73 | ``` 74 | Expected output: 75 | ``` 76 | openjdk version "11.0.20" 2024-XX-XX 77 | ``` 78 | 79 | --- 80 | 81 | # **Step 3: Create Hadoop User on Both VMs** 82 | ```bash 83 | sudo adduser hadoop 84 | sudo usermod -aG sudo hadoop 85 | su - hadoop 86 | ``` 87 | 88 | --- 89 | 90 | # **Step 4: Configure SSH Access** 91 | 1. **Install SSH on Both VMs**: 92 | ```bash 93 | sudo apt install ssh -y 94 | ``` 95 | 2. **Generate SSH Keys on Master Node**: 96 | ```bash 97 | ssh-keygen -t rsa -P "" 98 | ``` 99 | 3. **Copy SSH Key to Worker Node**: 100 | ```bash 101 | ssh-copy-id hadoop@192.168.1.101 102 | ``` 103 | 4. **Test SSH Connection from Master to Worker**: 104 | ```bash 105 | ssh hadoop@192.168.1.101 106 | ``` 107 | It should log in without asking for a password. 108 | 109 | --- 110 | 111 | # **Step 5: Download and Install Hadoop** 112 | Perform the following steps **on both VMs**. 113 | 114 | ### **1. Download Hadoop** 115 | ```bash 116 | wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz 117 | tar -xvzf hadoop-3.3.6.tar.gz 118 | sudo mv hadoop-3.3.6 /usr/local/hadoop 119 | ``` 120 | 121 | ### **2. Set Environment Variables** 122 | Edit `~/.bashrc`: 123 | ```bash 124 | nano ~/.bashrc 125 | ``` 126 | Add: 127 | ```bash 128 | # Hadoop Environment Variables 129 | export HADOOP_HOME=/usr/local/hadoop 130 | export HADOOP_INSTALL=$HADOOP_HOME 131 | export HADOOP_MAPRED_HOME=$HADOOP_HOME 132 | export HADOOP_COMMON_HOME=$HADOOP_HOME 133 | export HADOOP_HDFS_HOME=$HADOOP_HOME 134 | export HADOOP_YARN_HOME=$HADOOP_HOME 135 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 136 | export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 137 | ``` 138 | Save and apply: 139 | ```bash 140 | source ~/.bashrc 141 | ``` 142 | 143 | --- 144 | 145 | # **Step 6: Configure Hadoop** 146 | ## **1. Configure `hadoop-env.sh`** 147 | Edit: 148 | ```bash 149 | nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh 150 | ``` 151 | Set Java path: 152 | ```bash 153 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 154 | ``` 155 | 156 | --- 157 | 158 | ## **2. Configure Core Site (`core-site.xml`)** 159 | Edit: 160 | ```bash 161 | nano $HADOOP_HOME/etc/hadoop/core-site.xml 162 | ``` 163 | Replace with: 164 | ```xml 165 | 166 | 167 | fs.defaultFS 168 | hdfs://master:9000 169 | 170 | 171 | ``` 172 | 173 | --- 174 | 175 | ## **3. Configure HDFS (`hdfs-site.xml`)** 176 | Edit: 177 | ```bash 178 | nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml 179 | ``` 180 | Add: 181 | ```xml 182 | 183 | 184 | dfs.replication 185 | 2 186 | 187 | 188 | dfs.name.dir 189 | file:///usr/local/hadoop/hdfs/namenode 190 | 191 | 192 | dfs.data.dir 193 | file:///usr/local/hadoop/hdfs/datanode 194 | 195 | 196 | ``` 197 | Create necessary directories: 198 | ```bash 199 | mkdir -p /usr/local/hadoop/hdfs/namenode 200 | mkdir -p /usr/local/hadoop/hdfs/datanode 201 | sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs 202 | ``` 203 | 204 | --- 205 | 206 | ## **4. Configure MapReduce (`mapred-site.xml`)** 207 | Copy template: 208 | ```bash 209 | cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml 210 | ``` 211 | Edit: 212 | ```bash 213 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml 214 | ``` 215 | Add: 216 | ```xml 217 | 218 | 219 | mapreduce.framework.name 220 | yarn 221 | 222 | 223 | ``` 224 | 225 | --- 226 | 227 | ## **5. Configure YARN (`yarn-site.xml`)** 228 | Edit: 229 | ```bash 230 | nano $HADOOP_HOME/etc/hadoop/yarn-site.xml 231 | ``` 232 | Add: 233 | ```xml 234 | 235 | 236 | yarn.nodemanager.aux-services 237 | mapreduce_shuffle 238 | 239 | 240 | ``` 241 | 242 | --- 243 | 244 | # **Step 7: Set Up Master and Worker Nodes** 245 | ## **1. Edit Hosts File on Both VMs** 246 | ```bash 247 | sudo nano /etc/hosts 248 | ``` 249 | Add: 250 | ``` 251 | 192.168.1.100 master 252 | 192.168.1.101 worker1 253 | ``` 254 | 255 | ## **2. Define Workers on Master Node** 256 | On the **Master Node**, edit: 257 | ```bash 258 | nano $HADOOP_HOME/etc/hadoop/workers 259 | ``` 260 | Add: 261 | ``` 262 | worker1 263 | ``` 264 | 265 | --- 266 | 267 | # **Step 8: Start Hadoop Cluster** 268 | ## **1. Format Namenode (Master Only)** 269 | ```bash 270 | hdfs namenode -format 271 | ``` 272 | 273 | ## **2. Start Hadoop Services (Master Only)** 274 | ```bash 275 | start-dfs.sh 276 | start-yarn.sh 277 | ``` 278 | Check running services: 279 | ```bash 280 | jps 281 | ``` 282 | Expected output: 283 | ``` 284 | NameNode 285 | DataNode 286 | ResourceManager 287 | NodeManager 288 | ``` 289 | 290 | --- 291 | 292 | # **Step 9: Verify Hadoop Cluster** 293 | ## **Check Web UI** 294 | 1. **HDFS Web UI**: 295 | 📌 **http://master:9870/** 296 | 2. **YARN Resource Manager**: 297 | 📌 **http://master:8088/** 298 | 299 | --- 300 | 301 | # **Step 10: Stop Hadoop** 302 | To stop services: 303 | ```bash 304 | stop-dfs.sh 305 | stop-yarn.sh 306 | ``` 307 | 308 | --- 309 | 310 | # **Conclusion** 311 | You have successfully set up a **Hadoop multi-node cluster** on **two Ubuntu 24.04 VMs** inside **VMware Workstation**! 312 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![image](https://github.com/user-attachments/assets/2b0a8b29-8287-446a-8a0c-8c1820ea0971) ![image](https://github.com/user-attachments/assets/343cfd7e-73b7-4eb2-a9a4-76c31f5703c8).![image](https://github.com/user-attachments/assets/04ad8a37-c3a0-4e62-a5c4-70c023992209)![image](https://github.com/user-attachments/assets/5a5fc24a-bc9d-4cc2-aab4-b651c59197d5)![image](https://github.com/user-attachments/assets/10b26b1e-614f-4ad7-966c-505e54825680) 2 | 3 | 4 | 5 | # Docker Multi-Container Environment with Hadoop, Spark, and Hive 6 | 7 | This guide helps you set up a multi-container environment using Docker for Hadoop (HDFS), Spark, and Hive. The setup is lightweight, without the large memory requirements of a Cloudera sandbox. 8 | 9 | ## **Prerequisites** 10 | 11 | Before you begin, ensure you have the following installed: 12 | 13 | - **Docker**: [Install Docker Desktop for Windows](https://docs.docker.com/desktop/setup/install/windows-install/) 14 | 15 | - IMPORTANT: 16 | ******- Enable the "Expose daemon on tcp://localhost:2375 without TLS" option if you're using Docker Desktop for compatibility.****** 17 | 18 | ![image](https://github.com/user-attachments/assets/398451cd-46bb-4ba8-876f-9e85f8c0d632) 19 | 20 | 21 | - **Git**: [Download Git](https://git-scm.com/downloads/win) 22 | - Git is used to download the required files from a repository. 23 | 24 | Create a newfolder and open it in terminal or go inside it using CD Command 25 | 26 | ![image](https://github.com/user-attachments/assets/28602a4b-52e2-4265-bfb5-a08301fda7b8) 27 | 28 | 29 | ## **Step 1: Clone the Repository** 30 | 31 | First, clone the GitHub repository that contains the necessary Docker setup files. 32 | 33 | ```bash 34 | git clone https://github.com/lovnishverma/bigdataecosystem.git 35 | ``` 36 | 37 | [or Directly download zip from my repo](https://github.com/lovnishverma/BigDataecosystem) 38 | 39 | Navigate to the directory: 40 | 41 | ```bash 42 | cd bigdataecosystem 43 | ``` 44 | 45 | ![image](https://github.com/user-attachments/assets/e4d6a8ab-3f36-424a-bf13-9402bc1c13a2) 46 | 47 | if downloaded zip than cd bigdataecosystem-main 48 | 49 | ## **Step 2: Start the Cluster** 50 | 51 | Use Docker Compose to start the containers in the background. 52 | 53 | ```bash 54 | docker-compose up -d 55 | ``` 56 | 57 | This command will launch the Hadoop, Spark, and Hive containers. 58 | 59 | ![image](https://github.com/user-attachments/assets/8dc3ec44-84af-40f2-8056-92e5f3449919) 60 | 61 | 62 | ## **Step 3: Verify Running Containers** 63 | 64 | To check if the containers are running, use the following command: 65 | 66 | ```bash 67 | docker ps 68 | ``` 69 | ![image](https://github.com/user-attachments/assets/f6897172-d14f-462a-95dd-ba46401b5dd7) 70 | 71 | 72 | ## **Step 4: Stop and Remove Containers** 73 | 74 | When you are done, stop and remove the containers with: 75 | 76 | ```bash 77 | docker-compose down 78 | ``` 79 | ![image](https://github.com/user-attachments/assets/fd1f2298-7d65-4055-a929-12de4d01c428) 80 | 81 | 82 | ### Step 5: Access the NameNode container 83 | Enter the NameNode container to interact with Hadoop: 84 | ```bash 85 | docker exec -it namenode bash 86 | ``` 87 | ** -it refers to (interactive terminal)** 88 | --- 89 | 90 | ## **Running Hadoop Code** 91 | 92 | To View NameNode UI Visit: [http://localhost:9870/](http://localhost:9870/) 93 | 94 | ![image](https://github.com/user-attachments/assets/c4f708cb-7976-49f8-ba79-8b985bcd6a10) 95 | 96 | 97 | To View Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/) 98 | 99 | ![image](https://github.com/user-attachments/assets/a65f2495-293e-440c-8366-9e1bed605b29) 100 | 101 | 102 | ### ** MAPREDUCE WordCount program** 103 | ### Step 1: Copy the `code` folder into the container 104 | Use the following command in your windows cmd to copy the `code` folder to the container: 105 | ```bash 106 | docker cp code namenode:/ 107 | ``` 108 | 109 | ![image](https://github.com/user-attachments/assets/7acdebdc-2b20-41bf-b92d-8555091d570c) 110 | 111 | 112 | ### Step 2: Locate the `data.txt` file 113 | Inside the container, navigate to the `code/input` directory where the `data.txt` file is located. 114 | 115 | ### Step 3: Create directories in the Hadoop file system 116 | Run the following commands to set up directories in Hadoop's file system: 117 | ```bash 118 | hdfs dfs -mkdir /user 119 | hdfs dfs -mkdir /user/root 120 | hdfs dfs -mkdir /user/root/input 121 | ``` 122 | 123 | ### Step 4: Upload the `data.txt` file 124 | Copy `data.txt` into the Hadoop file system: 125 | ```bash 126 | hdfs dfs -put /code/input/data.txt /user/root/input 127 | ``` 128 | ![image](https://github.com/user-attachments/assets/31fadc17-1c8c-4621-bdee-39d818f3da2c) 129 | 130 | 131 | ### Step 5: Navigate to the directory containing the `wordCount.jar` file 132 | Return to the directory where the `wordCount.jar` file is located: 133 | ```bash 134 | cd /code/ 135 | ``` 136 | ![image](https://github.com/user-attachments/assets/4242e3b2-c954-4faf-ab75-825906eeafc5) 137 | 138 | 139 | ### Step 6: Execute the WordCount program 140 | 141 | To View NameNode UI Visit: [http://localhost:9870/](http://localhost:9870/) 142 | 143 | ![image](https://github.com/user-attachments/assets/20681490-0fcc-41dd-874a-8fe0376dc981) 144 | 145 | 146 | Run the WordCount program to process the input data: 147 | ```bash 148 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output 149 | ``` 150 | ![image](https://github.com/user-attachments/assets/2bafcdd5-be22-471c-bf9a-6b8a48d88d44) 151 | 152 | 153 | To View YARN Resource Manager UI Visit [http://localhost:8088/](http://localhost:8088/) 154 | 155 | ![image](https://github.com/user-attachments/assets/89f47e9f-c92f-456c-b89e-0e6025df80e2) 156 | 157 | ### Step 7: Display the output 158 | View the results of the WordCount program: 159 | ```bash 160 | hdfs dfs -cat /user/root/output/* 161 | ``` 162 | ![image](https://github.com/user-attachments/assets/8a20f77f-71bd-423b-a501-c9514ec9f825) 163 | 164 | --- 165 | 166 | **or** 167 | 168 | ```bash 169 | hdfs dfs -cat /user/root/output/part-r-00000 170 | ``` 171 | 172 | ![image](https://github.com/user-attachments/assets/a4ef5293-1018-4c5e-a314-91681d430715) 173 | 174 | 175 | ## **Summary** 176 | 177 | This guide simplifies setting up and running Hadoop on Docker. Each step ensures a smooth experience, even for beginners without a technical background. Follow the instructions carefully, and you’ll have a working Hadoop setup in no time! 178 | 179 | Certainly! Here’s the explanation of your **MapReduce process** using the input example `DOG CAT RAT`, `CAR CAR RAT`, and `DOG CAR CAT`. 180 | --- 181 | 182 | ## 🐾 **Input Data** 183 | 184 | The `data.txt` file contains the following lines: 185 | 186 | ``` 187 | DOG CAT RAT 188 | CAR CAR RAT 189 | DOG CAR CAT 190 | ``` 191 | 192 | This text file is processed by the **MapReduce WordCount program** to count the occurrences of each word. 193 | 194 | --- 195 | 196 | ## 💡 **What is MapReduce?** 197 | 198 | - **MapReduce** is a two-step process: 199 | 1. **Map Phase** 🗺️: Splits the input into key-value pairs. 200 | 2. **Reduce Phase** ➕: Combines the key-value pairs to produce the final result. 201 | 202 | It's like dividing a big task (word counting) into smaller tasks and then combining the results. 🧩 203 | 204 | --- 205 | 206 | ## 🔄 **How MapReduce Works in Your Example** 207 | 208 | ### **1. Map Phase** 🗺️ 209 | 210 | The mapper processes each line of the input file, splits it into words, and assigns each word a count of `1`. 211 | 212 | For example: 213 | ``` 214 | DOG CAT RAT -> (DOG, 1), (CAT, 1), (RAT, 1) 215 | CAR CAR RAT -> (CAR, 1), (CAR, 1), (RAT, 1) 216 | DOG CAR CAT -> (DOG, 1), (CAR, 1), (CAT, 1) 217 | ``` 218 | 219 | **Mapper Output**: 220 | ``` 221 | (DOG, 1), (CAT, 1), (RAT, 1) 222 | (CAR, 1), (CAR, 1), (RAT, 1) 223 | (DOG, 1), (CAR, 1), (CAT, 1) 224 | ``` 225 | 226 | --- 227 | 228 | ### **2. Shuffle and Sort Phase** 🔄 229 | 230 | This step groups all values for the same key (word) together and sorts them. 231 | 232 | For example: 233 | ``` 234 | (CAR, [1, 1, 1]) 235 | (CAT, [1, 1]) 236 | (DOG, [1, 1]) 237 | (RAT, [1, 1]) 238 | ``` 239 | 240 | --- 241 | 242 | ### **3. Reduce Phase** ➕ 243 | 244 | The reducer sums up the counts for each word to get the total number of occurrences. 245 | 246 | **Reducer Output**: 247 | ``` 248 | CAR 3 🏎️ 249 | CAT 2 🐱 250 | DOG 2 🐶 251 | RAT 2 🐭 252 | ``` 253 | 254 | --- 255 | 256 | ### **Final Output** 📋 257 | 258 | The final word count is saved in the HDFS output directory. You can view it using: 259 | ```bash 260 | hdfs dfs -cat /user/root/output/* 261 | ``` 262 | 263 | **Result**: 264 | ``` 265 | CAR 3 266 | CAT 2 267 | DOG 2 268 | RAT 2 269 | ``` 270 | 271 | --- 272 | 273 | ## 🗂️ **HDFS Commands You Used** 274 | 275 | Here are the basic HDFS commands you used and their purpose: 276 | 277 | 1. **Upload a file to HDFS** 📤: 278 | ```bash 279 | hdfs dfs -put data.txt /user/root/input 280 | ``` 281 | - **What it does**: Uploads `data.txt` to the HDFS directory `/user/root/input`. 282 | - **Output**: No output, but the file is now in HDFS. 283 | 284 | 2. **List files in a directory** 📁: 285 | ```bash 286 | hdfs dfs -ls /user/root/input 287 | ``` 288 | - **What it does**: Lists all files in the `/user/root/input` directory. 289 | - **Output**: Something like this: 290 | ``` 291 | Found 1 items 292 | -rw-r--r-- 1 root supergroup 50 2024-12-12 /user/root/input/data.txt 293 | ``` 294 | 295 | 3. **View the contents of a file** 📄: 296 | ```bash 297 | hdfs dfs -cat /user/root/input/data.txt 298 | ``` 299 | - **What it does**: Displays the contents of the `data.txt` file in HDFS. 300 | - **Output**: 301 | ``` 302 | DOG CAT RAT 303 | CAR CAR RAT 304 | DOG CAR CAT 305 | ``` 306 | 307 | 4. **Run the MapReduce Job** 🚀: 308 | ```bash 309 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output 310 | ``` 311 | - **What it does**: Runs the WordCount program on the input directory and saves the result in the output directory. 312 | 313 | 5. **View the final output** 📊: 314 | ```bash 315 | hdfs dfs -cat /user/root/output/* 316 | ``` 317 | - **What it does**: Displays the word count results. 318 | - **Output**: 319 | ``` 320 | CAR 3 321 | CAT 2 322 | DOG 2 323 | RAT 2 324 | ``` 325 | 326 | --- 327 | 328 | ## 🛠️ **How You Utilized MapReduce** 329 | 330 | 1. **Input**: 331 | You uploaded a small text file (`data.txt`) to HDFS. 332 | 333 | 2. **Process**: 334 | The `WordCount` program processed the file using MapReduce: 335 | - The **mapper** broke the file into words and counted each occurrence. 336 | - The **reducer** aggregated the counts for each word. 337 | 338 | 3. **Output**: 339 | The results were saved in HDFS and displayed using the `cat` command. 340 | 341 | --- 342 | 343 | ## 🧩 **Visualization of the Entire Process** 344 | 345 | ### **Input** (HDFS file): 346 | ``` 347 | DOG CAT RAT 348 | CAR CAR RAT 349 | DOG CAR CAT 350 | ``` 351 | 352 | ### **Map Phase Output** 🗺️: 353 | ``` 354 | (DOG, 1), (CAT, 1), (RAT, 1) 355 | (CAR, 1), (CAR, 1), (RAT, 1) 356 | (DOG, 1), (CAR, 1), (CAT, 1) 357 | ``` 358 | 359 | ### **Shuffle & Sort** 🔄: 360 | ``` 361 | (CAR, [1, 1, 1]) 362 | (CAT, [1, 1]) 363 | (DOG, [1, 1]) 364 | (RAT, [1, 1]) 365 | ``` 366 | 367 | ### **Reduce Phase Output** ➕: 368 | ``` 369 | CAR 3 370 | CAT 2 371 | DOG 2 372 | RAT 2 373 | ``` 374 | 375 | --- 376 | 377 | ![image](https://github.com/user-attachments/assets/a037fc47-7639-48b8-b3f7-5d9f2d5c51ac) 378 | 379 | ### 🔑 **Key Takeaways** 380 | - **MapReduce** splits the task into small, manageable pieces and processes them in parallel. 381 | - It’s ideal for large datasets but works the same for smaller ones (like your example). 382 | - Hadoop is designed for distributed systems, making it powerful for big data processing. 383 | 384 | 385 | 386 | 387 | 388 | 389 | ### . **Stopping the Containers** 390 | To stop the Docker containers when done: 391 | ```bash 392 | docker-compose down 393 | ``` 394 | This will stop and remove the containers and networks created by `docker-compose up`. 395 | 396 | ### 4. **Permissions Issue with Copying Files** 397 | If you face permission issues while copying files to containers ensure the correct directory permissions in Docker by using: 398 | ```bash 399 | docker exec -it namenode bash 400 | chmod -R 777 /your-directory 401 | ``` 402 | 403 | ### 5. **Additional Debugging Tips** 404 | Sometimes, containers might not start or might throw errors related to Hadoop configuration. A small troubleshooting section or references to common issues (e.g., insufficient memory for Hadoop) would be helpful. 405 | 406 | ### 6. **Final Output File Path** 407 | The output of the WordCount job will be written to `/user/root/output/` in HDFS. This is clearly explained, but you could also include a note that the output directory might need to be created beforehand to avoid errors. 408 | 409 | --- 410 | 411 | ### **Example Additions:** 412 | 413 | 1. **Network Issues:** 414 | ``` 415 | If you can't access the NameNode UI, ensure that your Docker container's ports are correctly exposed. For example, if you're running a local machine, the UI should be accessible via http://localhost:9870. 416 | ``` 417 | 418 | 2. **Stopping Containers:** 419 | ```bash 420 | docker-compose down # Stop and remove the containers 421 | ``` 422 | 423 | 3. **Permissions Fix:** 424 | ```bash 425 | docker exec -it namenode bash 426 | chmod -R 777 /your-directory # If you face any permission errors 427 | ``` 428 | 429 | 4. **Handling HDFS Directory Creation:** 430 | If `hdfs dfs -mkdir` gives an error, it may be because the directory already exists. Consider adding: 431 | ```bash 432 | hdfs dfs -rm -r /user/root/input # If the directory exists, remove it first 433 | hdfs dfs -mkdir /user/root/input 434 | ``` 435 | 436 | --- 437 | 438 | 😊 References 439 | 440 | https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/ 441 | 442 | https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html 443 | 444 | https://medium.com/@traininghub.io/hadoop-mapreduce-architecture-7e167e264595 445 | 446 | 447 | ## **Step 5: Set Up HDFS** 448 | 449 | ### **Upload Files to HDFS** 450 | 451 | To copy a file (e.g., `police.csv`) to the Hadoop cluster: 452 | 453 | 1. Copy the file into the namenode container: 454 | ```bash 455 | docker cp police.csv namenode:/police.csv 456 | ``` 457 | ![image](https://github.com/user-attachments/assets/496c7e6a-41d6-44d2-9557-b6004fe986c4) 458 | 459 | 460 | 2. Access the namenode container's bash shell: 461 | ```bash 462 | docker exec -it namenode bash 463 | ``` 464 | ![image](https://github.com/user-attachments/assets/d501a9b3-d2d9-4e2d-aecb-8e3eb7ccf678) 465 | 466 | 467 | 3. Create a directory in HDFS and upload the file: 468 | ```bash 469 | hdfs dfs -mkdir -p /data/crimerecord/police 470 | hdfs dfs -put /police.csv /data/crimerecord/police/ 471 | ``` 472 | ![image](https://github.com/user-attachments/assets/ab68bba9-92f2-4b15-a50e-f3ee1a0f998e) 473 | 474 | 475 | 476 | ![image](https://github.com/user-attachments/assets/6b27db66-a111-4c2f-a701-2cef8aaa3344) 477 | 478 | 479 | ### **Start Spark Shell** 480 | 481 | To interact with Spark, start the Spark shell in the master container: 482 | 483 | ```bash 484 | docker exec -it spark-master bash 485 | 486 | spark/bin/spark-shell --master spark://spark-master:7077 487 | ``` 488 | ### **Access the Spark Master UI** 489 | 490 | - Open `http://localhost:8080` in your web browser to view the Spark Master UI. 491 | - **You can monitor processes here** 492 | 493 | - ![image](https://github.com/user-attachments/assets/8fa7e525-d601-4dad-b5b4-0477d47ec4dd) 494 | 495 | 496 | ![image](https://github.com/user-attachments/assets/45765d5e-b1e7-4726-a60c-ddd5dd278c93) 497 | 498 | ![image](https://github.com/user-attachments/assets/b071335b-4928-491a-8bed-321995881d83) 499 | 500 | # **Working with Apache Spark** 501 | 502 | ## **1. Introduction to Apache Spark** 503 | 504 | - **Overview**: Apache Spark is an open-source distributed computing system known for its speed, ease of use, and general-purpose capabilities for big data processing. 505 | 506 | - **Key Features**: 507 | - Fast processing using in-memory computation. 508 | - Supports multiple languages: Scala, Python, Java, and R. 509 | - Unified framework for batch and streaming data processing. 510 | 511 | --- 512 | 513 | ## **2. Introduction to DataFrames** 514 | 515 | - **What are DataFrames?** 516 | - Distributed collections of data organized into named columns, similar to a table in a database or a DataFrame in Python's pandas. 517 | - Optimized for processing large datasets using Spark SQL. 518 | 519 | - **Key Operations**: 520 | - Creating DataFrames from structured data sources (CSV, JSON, Parquet, etc.). 521 | - Performing transformations and actions on the data. 522 | 523 | --- 524 | 525 | ## **3. Introduction to Scala for Apache Spark** 526 | 527 | - **Why Scala?** 528 | - Apache Spark is written in Scala, offering the best compatibility and performance. 529 | - Concise syntax and functional programming support. 530 | 531 | - **Basic Syntax**: 532 | 533 | ```scala 534 | val numbers = List(1, 2, 3, 4, 5) // Creates a list of numbers. 535 | val doubled = numbers.map(_ * 2) // Doubles each element in the list using map. 536 | println(doubled) // Prints the doubled list. 537 | ``` 538 | The output will be: 539 | List(2, 4, 6, 8, 10) 540 | 541 | --- 542 | 543 | ## **4. Spark SQL** 544 | 545 | - **Need for Spark SQL**: 546 | - Provides a declarative interface to query structured data using SQL-like syntax. 547 | - Supports seamless integration with other Spark modules. 548 | - Allows for optimization through Catalyst Optimizer. 549 | 550 | - **Key Components**: 551 | - SQL Queries on DataFrames and temporary views. 552 | - Hive integration for legacy SQL workflows. 553 | - Support for structured data sources. 554 | 555 | --- 556 | ## **5. Hands-On: Spark SQL** 557 | 558 | ### **Objective**: 559 | To create DataFrames, load data from different sources, and perform transformations and SQL queries. 560 | 561 | 562 | #### **Step 1: Create DataFrames** 563 | 564 | ```scala 565 | val data = Seq( 566 | ("Alice", 30, "HR"), 567 | ("Bob", 25, "Engineering"), 568 | ("Charlie", 35, "Finance") 569 | ) 570 | 571 | val df = data.toDF("Name", "Age", "Department") 572 | 573 | df.show() 574 | ``` 575 | ![image](https://github.com/user-attachments/assets/06c2c14f-cf8e-4b38-8944-7844e75ee5d6) 576 | 577 | 578 | #### **Step 3: Perform Transformations Using Spark SQL** 579 | 580 | ```scala 581 | df.createOrReplaceTempView("employees") 582 | val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department") 583 | result.show() 584 | ``` 585 | ![image](https://github.com/user-attachments/assets/c9125138-63dd-4c29-82c4-6d04bc531508) 586 | 587 | 588 | #### **Step 4: Save Transformed Data** 589 | 590 | ```scala 591 | result.write.option("header", "true").csv("hdfs://namenode:9000/output_employees") 592 | ``` 593 | 594 | Reading from HDFS: 595 | Once the data is written to HDFS, you can read it back into Spark using: 596 | 597 | ```scala 598 | val outputDF = spark.read.option("header", "true").csv("hdfs://namenode:9000/output_employees") 599 | ``` 600 | 601 | View output_employees.csv from HDFS 602 | 603 | ```scala 604 | outputDF.show() 605 | ``` 606 | ![image](https://github.com/user-attachments/assets/a4bb7af6-2ee6-485f-a306-371165e5bf37) 607 | 608 | 609 | #### **Step 5: Load Data from HDFS** 610 | 611 | ```scala 612 | // Load CSV from HDFS 613 | val df = spark.read.option("header", "false").csv("hdfs://namenode:9000/data/crimerecord/police/police.csv") 614 | df.show() 615 | ``` 616 | 617 | ![image](https://github.com/user-attachments/assets/f6dfde78-f44a-4554-9c0f-f11cb9173e6c) 618 | 619 | 620 | #### **Step 6: Scala WordCount using Apache Spark** 621 | 622 | 623 | ### Docker Command to Copy File 624 | *Copy File**: Use `docker cp` to move or create the file inside the namenode Docker container. 625 | Use the following command to copy the `data.txt` file from your local system to the Docker container: 626 | 627 | ```bash 628 | docker cp data.txt nodemanager:/data.txt 629 | ``` 630 | ![image](https://github.com/user-attachments/assets/73a84d9a-af1c-45f0-9504-a24b192e598d) 631 | 632 | *Copy File to HDFS**: Use `hdfs dfs -put` to move the file inside the HDFS filesystem. 633 | Use the following command to put the `data.txt` file from your Docker container to HDFS: 634 | 635 | ```bash 636 | hdfs dfs -mkdir /data 637 | hdfs dfs -put data.txt /data 638 | ``` 639 | ![image](https://github.com/user-attachments/assets/b4d93f36-f1b1-4056-a4af-d4dbb418634e) 640 | 641 | **Scala WordCount program.** 642 | 643 | **WordCount Program**: The program reads the file, splits it into words, and counts the occurrences of each word. 644 | 645 | ```scala 646 | import org.apache.spark.{SparkConf} 647 | val conf = new SparkConf().setAppName("WordCountExample").setMaster("local") 648 | val input = sc.textFile("hdfs://namenode:9000/data/data.txt") 649 | val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1)) 650 | val wordCounts = wordPairs.reduceByKey((a, b) => a + b) 651 | wordCounts.collect().foreach { case (word, count) => 652 | println(s"$word: $count") 653 | } 654 | ``` 655 | 656 | **Output**: The word counts will be printed to the console when the program is executed. 657 | 658 | ![image](https://github.com/user-attachments/assets/428e0d99-f0e0-4edd-8f3c-4543130c8a47) 659 | 660 | 661 | **Stop Session**: 662 | 663 | ```scala 664 | sc.stop() 665 | ``` 666 | 667 | --- 668 | 669 | ## **6. Key Takeaways** 670 | 671 | - Spark SQL simplifies working with structured data. 672 | - DataFrames provide a flexible and powerful API for handling large datasets. 673 | - Apache Spark is a versatile tool for distributed data processing, offering scalability and performance. 674 | 675 | --- 676 | 677 | 678 | ![image](https://github.com/user-attachments/assets/fada1eec-5349-4382-8d1a-96940c124064) 679 | 680 | ## **Step 7: Set Up Hive** 681 | 682 | ### **Start Hive Server** 683 | 684 | Access the Hive container and start the Hive Server: 685 | 686 | ```bash 687 | docker exec -it hive-server bash 688 | ``` 689 | 690 | ```bash 691 | hive 692 | ``` 693 | 694 | Check if Hive is listening on port 10000: 695 | ![image](https://github.com/user-attachments/assets/dc1e78d4-d903-4ac5-9eaa-eff0b893d6fb) 696 | 697 | 698 | ```bash 699 | netstat -anp | grep 10000 700 | ``` 701 | ![image](https://github.com/user-attachments/assets/9ac08fd3-f515-448d-83b3-c620fa3b15c2) 702 | 703 | 704 | ### **Connect to Hive Server** 705 | 706 | Use Beeline to connect to the Hive server: 707 | 708 | ```bash 709 | beeline -u jdbc:hive2://localhost:10000 -n root 710 | ``` 711 | ![image](https://github.com/user-attachments/assets/d2dce309-0334-4a64-b8df-8cb6206b1432) 712 | 713 | 714 | Alternatively, use the following command for direct connection: 715 | 716 | ```bash 717 | beeline 718 | ``` 719 | 720 | ```bash 721 | !connect jdbc:hive2://127.0.0.1:10000 scott tiger 722 | ``` 723 | 724 | ![image](https://github.com/user-attachments/assets/77fadb1f-118e-4d15-8a78-e9783baa9690) 725 | 726 | 727 | ### **Create Database and Table in Hive** 728 | 729 | 1. Create a new Hive database: 730 | ```sql 731 | CREATE DATABASE punjab_police; 732 | USE punjab_police; 733 | ``` 734 | ![image](https://github.com/user-attachments/assets/73227817-b2d5-4df0-a392-6927750d7220) 735 | 736 | 737 | 2. Create a table based on the schema of the `police.csv` dataset: 738 | ```sql 739 | CREATE TABLE police_data ( 740 | Crime_ID INT, 741 | Crime_Type STRING, 742 | Location STRING, 743 | Reported_Date STRING, 744 | Status STRING 745 | ) 746 | ROW FORMAT DELIMITED 747 | FIELDS TERMINATED BY ',' 748 | STORED AS TEXTFILE; 749 | ``` 750 | ![image](https://github.com/user-attachments/assets/13faa21a-5242-4f1e-bd69-4d98dc318400) 751 | 752 | 753 | 3. Load the data into the Hive table: 754 | ```sql 755 | LOAD DATA INPATH '/data/crimerecord/police/police.csv' INTO TABLE police_data; 756 | ``` 757 | ![image](https://github.com/user-attachments/assets/e0fcbe55-d5fd-4a8c-a17b-df888204915f) 758 | 759 | 760 | ### **Query the Data in Hive** 761 | 762 | Run SQL queries to analyze the data in Hive: 763 | 764 | 1. **View the top 10 rows:** 765 | ```sql 766 | SELECT * FROM police_data LIMIT 10; 767 | ``` 768 | ![image](https://github.com/user-attachments/assets/6f189765-24f4-47db-ad70-42fbcfb4068e) 769 | 770 | 771 | 2. **Count total crimes:** 772 | ```sql 773 | SELECT COUNT(*) AS Total_Crimes FROM police_data; 774 | ``` 775 | ![image](https://github.com/user-attachments/assets/8b56a8b5-6b0b-4306-82da-4cce52b50e95) 776 | 777 | 778 | 3. **Find most common crime types:** 779 | ```sql 780 | SELECT Crime_Type, COUNT(*) AS Occurrences 781 | FROM police_data 782 | GROUP BY Crime_Type 783 | ORDER BY Occurrences DESC; 784 | ``` 785 | 786 | ![image](https://github.com/user-attachments/assets/54f000f7-36ec-4672-8bc6-996ac7b4004b) 787 | 788 | 789 | 4. **Identify locations with the highest crime rates:** 790 | ```sql 791 | SELECT Location, COUNT(*) AS Total_Crimes 792 | FROM police_data 793 | GROUP BY Location 794 | ORDER BY Total_Crimes DESC; 795 | ``` 796 | ![image](https://github.com/user-attachments/assets/fb418097-97ff-46aa-941a-4b72a0702d3d) 797 | 798 | 799 | 5. **Find unresolved cases:** 800 | ```sql 801 | SELECT Status, COUNT(*) AS Count 802 | FROM police_data 803 | WHERE Status != 'Closed' 804 | GROUP BY Status; 805 | ``` 806 | ![image](https://github.com/user-attachments/assets/9b3b32df-38c9-45bd-85dc-c4ac2b16b246) 807 | 808 | 809 | **********There you go: your private Hive server to play with.********** 810 | 811 | show databases; 812 | 813 | ![image](https://github.com/user-attachments/assets/7e8e65b1-cb98-41e2-b655-ddf941b614d5) 814 | 815 | #### **📂 Part 2: Creating a Simple Hive Project** 816 | 817 | --- 818 | 819 | ##### **🎯 Objective** 820 | We will: 821 | 1. Create a database. 822 | 2. Create a table inside the database. 823 | 3. Load data into the table. 824 | 4. Run queries to retrieve data. 825 | 826 | --- 827 | 828 | ##### **💾 Step 1: Create a Database** 829 | In the Beeline CLI: 830 | ```sql 831 | CREATE DATABASE mydb; 832 | USE mydb; 833 | ``` 834 | - 📝 *`mydb` is the name of the database. Replace it with your preferred name.* 835 | 836 | --- 837 | 838 | ##### **📋 Step 2: Create a Table** 839 | Still in the Beeline CLI, create a simple table: 840 | ```sql 841 | CREATE TABLE employees ( 842 | id INT, 843 | name STRING, 844 | age INT 845 | ); 846 | ``` 847 | - This creates a table named `employees` with columns `id`, `name`, and `age`. 848 | 849 | --- 850 | 851 | ##### **📥 Step 3: Insert Data into the Table** 852 | Insert sample data into your table: 853 | ```sql 854 | INSERT INTO employees VALUES (1, 'Prince', 30); 855 | INSERT INTO employees VALUES (2, 'Ram Singh', 25); 856 | ``` 857 | 858 | --- 859 | 860 | ##### **🔍 Step 4: Query the Table** 861 | Retrieve data from your table: 862 | ```sql 863 | SELECT * FROM employees; 864 | ``` 865 | - Output: 866 | 867 | ![image](https://github.com/user-attachments/assets/63529cb9-c74d-453e-a4d7-9f176762a8bc) 868 | 869 | 870 | ``` 871 | +----+----------+-----+ 872 | | id | name | age | 873 | +----+----------+-----+ 874 | | 2 | Ram Singh | 25 | 875 | | 1 | Prince | 30 | 876 | +----+----------+-----+ 877 | ``` 878 | 879 | --- 880 | 881 | #### **🌟 Tips & Knowledge** 882 | 883 | 1. **What is Hive?** 884 | - Hive is a data warehouse tool on top of Hadoop. 885 | - It allows SQL-like querying over large datasets. 886 | 887 | 2. **Why Docker for Hive?** 888 | - Simplifies setup by avoiding manual configurations. 889 | - Provides a pre-configured environment for running Hive. 890 | 891 | 3. **Beeline CLI**: 892 | - A lightweight command-line tool for running Hive queries. 893 | 894 | 4. **Use Cases**: 895 | - **Data Analysis**: Run analytics on large datasets. 896 | - **ETL**: Extract, Transform, and Load data into your Hadoop ecosystem. 897 | 898 | --- 899 | 900 | #### **🎉 You're Ready!** 901 | You’ve successfully: 902 | 1. Set up Apache Hive. 903 | 2. Created and queried a sample project. 🐝 904 | 905 | ### **🐝 Apache Hive Basic Commands** 906 | 907 | Here is a collection of basic Apache Hive commands with explanations that can help you while working with Hive: 908 | 909 | --- 910 | 911 | #### **1. Database Commands** 912 | 913 | - **Show Databases:** 914 | Displays all the databases available in your Hive environment. 915 | ```sql 916 | SHOW DATABASES; 917 | ``` 918 | 919 | - **Create a Database:** 920 | Create a new database. 921 | ```sql 922 | CREATE DATABASE ; 923 | ``` 924 | Example: 925 | ```sql 926 | CREATE DATABASE mydb; 927 | ``` 928 | In Hive, you can find out which database you are currently using by running the following command: 929 | 930 | ```sql 931 | SELECT current_database(); 932 | ``` 933 | 934 | This will return the name of the database that is currently in use. 935 | 936 | Alternatively, you can use this command: 937 | 938 | ```sql 939 | USE database_name; 940 | ``` 941 | 942 | If you want to explicitly switch to a specific database or verify the database context, you can use this command before running your queries. 943 | 944 | - **Use a Database:** 945 | Switch to the specified database. 946 | ```sql 947 | USE ; 948 | ``` 949 | Example: 950 | ```sql 951 | USE mydb; 952 | ``` 953 | 954 | 955 | - **Drop a Database:** 956 | Deletes a database and its associated data. 957 | ```sql 958 | DROP DATABASE ; 959 | ``` 960 | 961 | --- 962 | 963 | #### **2. Table Commands** 964 | 965 | - **Show Tables:** 966 | List all the tables in the current database. 967 | ```sql 968 | SHOW TABLES; 969 | ``` 970 | 971 | - **Create a Table:** 972 | Define a new table with specific columns. 973 | ```sql 974 | CREATE TABLE ( 975 | column_name column_type, 976 | ... 977 | ); 978 | ``` 979 | Example: 980 | ```sql 981 | CREATE TABLE employees ( 982 | id INT, 983 | name STRING, 984 | age INT 985 | ); 986 | ``` 987 | 988 | - **Describe a Table:** 989 | Get detailed information about a table, including column names and types. 990 | ```sql 991 | DESCRIBE ; 992 | ``` 993 | 994 | - **Drop a Table:** 995 | Deletes a table and its associated data. 996 | ```sql 997 | DROP TABLE ; 998 | ``` 999 | 1000 | - **Alter a Table:** 1001 | Modify a table structure, like adding new columns. 1002 | ```sql 1003 | ALTER TABLE ADD COLUMNS ( ); 1004 | ``` 1005 | Example: 1006 | ```sql 1007 | ALTER TABLE employees ADD COLUMNS (salary DOUBLE); 1008 | ``` 1009 | 1010 | --- 1011 | 1012 | #### **3. Data Manipulation Commands** 1013 | 1014 | - **Insert Data:** 1015 | Insert data into a table. 1016 | ```sql 1017 | INSERT INTO VALUES (, , ...); 1018 | INSERT INTO employees VALUES (1, 'Prince', 30), (2, 'Ram Singh', 25), (3, 'John Doe', 28), (4, 'Jane Smith', 32); 1019 | ``` 1020 | Example: 1021 | ```sql 1022 | INSERT INTO employees VALUES (1, 'John Doe', 30); 1023 | 1024 | ``` 1025 | 1026 | - **Select Data:** 1027 | Retrieve data from a table. 1028 | ```sql 1029 | SELECT * FROM ; 1030 | ``` 1031 | 1032 | - **Update Data:** 1033 | Update existing data in a table. 1034 | ```sql 1035 | UPDATE SET = WHERE ; 1036 | ``` 1037 | 1038 | - **Delete Data:** 1039 | Delete rows from a table based on a condition. 1040 | ```sql 1041 | DELETE FROM WHERE ; 1042 | ``` 1043 | 1044 | --- 1045 | 1046 | #### **4. Querying Commands** 1047 | 1048 | - **Select Specific Columns:** 1049 | Retrieve specific columns from a table. 1050 | ```sql 1051 | SELECT , FROM ; 1052 | ``` 1053 | 1054 | - **Filtering Data:** 1055 | Filter data based on conditions using the `WHERE` clause. 1056 | ```sql 1057 | SELECT * FROM WHERE ; 1058 | ``` 1059 | Example: 1060 | ```sql 1061 | SELECT * FROM employees WHERE age > 25; 1062 | ``` 1063 | 1064 | - **Sorting Data:** 1065 | Sort the result by a column in ascending or descending order. 1066 | ```sql 1067 | SELECT * FROM ORDER BY ASC|DESC; 1068 | ``` 1069 | Example: 1070 | ```sql 1071 | SELECT * FROM employees ORDER BY age DESC; 1072 | SELECT * FROM employees ORDER BY age ASC; 1073 | ``` 1074 | 1075 | - **Group By:** 1076 | Group data by one or more columns and aggregate it using functions like `COUNT`, `AVG`, `SUM`, etc. 1077 | ```sql 1078 | SELECT , COUNT(*) FROM GROUP BY ; 1079 | ``` 1080 | Example: 1081 | ```sql 1082 | SELECT age, COUNT(*) FROM employees GROUP BY age; 1083 | ``` 1084 | 1085 | --- 1086 | 1087 | #### **5. File Format Commands** 1088 | 1089 | - **Create External Table:** 1090 | Create a table that references data stored externally (e.g., in HDFS). 1091 | ```sql 1092 | CREATE EXTERNAL TABLE ( , ...) 1093 | ROW FORMAT DELIMITED 1094 | FIELDS TERMINATED BY '' 1095 | LOCATION ''; 1096 | ``` 1097 | Example: 1098 | ```sql 1099 | CREATE EXTERNAL TABLE employees ( 1100 | id INT, 1101 | name STRING, 1102 | age INT 1103 | ) ROW FORMAT DELIMITED 1104 | FIELDS TERMINATED BY ',' 1105 | LOCATION '/user/hive/warehouse/employees'; 1106 | ``` 1107 | 1108 | - **Load Data into Table:** 1109 | Load data from a file into an existing Hive table. 1110 | ```sql 1111 | LOAD DATA LOCAL INPATH '' INTO TABLE ; 1112 | ``` 1113 | 1114 | --- 1115 | 1116 | #### **6. Other Useful Commands** 1117 | 1118 | - **Show Current User:** 1119 | Display the current user running the Hive session. 1120 | ```sql 1121 | !whoami; 1122 | ``` 1123 | 1124 | - **Exit Hive:** 1125 | Exit from the Hive shell. 1126 | ```sql 1127 | EXIT; 1128 | ``` 1129 | 1130 | - **Set Hive Variables:** 1131 | Set Hive session variables. 1132 | ```sql 1133 | SET =; 1134 | ``` 1135 | 1136 | - **Show Hive Variables:** 1137 | Display all the set variables. 1138 | ```sql 1139 | SET; 1140 | ``` 1141 | 1142 | - **Show the Status of Hive Jobs:** 1143 | Display the status of running queries. 1144 | ```sql 1145 | SHOW JOBS; 1146 | ``` 1147 | 1148 | --- 1149 | 1150 | #### **🌟 Tips & Best Practices** 1151 | 1152 | - **Partitioning Tables:** 1153 | When dealing with large datasets, partitioning your tables can help improve query performance. 1154 | ```sql 1155 | CREATE TABLE sales (id INT, amount DOUBLE) 1156 | PARTITIONED BY (year INT, month INT); 1157 | ``` 1158 | 1159 | - **Bucketing:** 1160 | Bucketing splits your data into a fixed number of files or "buckets." 1161 | ```sql 1162 | CREATE TABLE sales (id INT, amount DOUBLE) 1163 | CLUSTERED BY (id) INTO 4 BUCKETS; 1164 | ``` 1165 | 1166 | - **Optimization:** 1167 | Use columnar formats like `ORC` or `Parquet` for efficient storage and performance. 1168 | ```sql 1169 | CREATE TABLE sales (id INT, amount DOUBLE) 1170 | STORED AS ORC; 1171 | ``` 1172 | 1173 | These basic commands will help you interact with Hive and perform common operations like creating tables, querying data, and managing your Hive environment efficiently. 1174 | 1175 | While **Hive** and **MySQL** both use SQL-like syntax for querying data, there are some key differences in their commands, especially since Hive is designed for querying large datasets in a Hadoop ecosystem, while MySQL is a relational database management system (RDBMS). 1176 | 1177 | ##**Here’s a comparison of **Hive** and **MySQL** commands in terms of common operations:** 1178 | 1179 | ### **1. Creating Databases** 1180 | - **Hive**: 1181 | ```sql 1182 | CREATE DATABASE mydb; 1183 | ``` 1184 | 1185 | - **MySQL**: 1186 | ```sql 1187 | CREATE DATABASE mydb; 1188 | ``` 1189 | 1190 | *Both Hive and MySQL use the same syntax to create a database.* 1191 | 1192 | --- 1193 | 1194 | ### **2. Switching to a Database** 1195 | - **Hive**: 1196 | ```sql 1197 | USE mydb; 1198 | ``` 1199 | 1200 | - **MySQL**: 1201 | ```sql 1202 | USE mydb; 1203 | ``` 1204 | 1205 | *The syntax is the same for selecting a database in both systems.* 1206 | 1207 | --- 1208 | 1209 | ### **3. Creating Tables** 1210 | - **Hive**: 1211 | ```sql 1212 | CREATE TABLE employees ( 1213 | id INT, 1214 | name STRING, 1215 | age INT 1216 | ); 1217 | ``` 1218 | 1219 | - **MySQL**: 1220 | ```sql 1221 | CREATE TABLE employees ( 1222 | id INT, 1223 | name VARCHAR(255), 1224 | age INT 1225 | ); 1226 | ``` 1227 | 1228 | **Differences**: 1229 | - In Hive, **STRING** is used for text data, while in MySQL, **VARCHAR** is used. 1230 | - Hive also has some specialized data types for distributed storage and performance, like `ARRAY`, `MAP`, `STRUCT`, etc. 1231 | 1232 | --- 1233 | 1234 | ### **4. Inserting Data** 1235 | - **Hive**: 1236 | ```sql 1237 | INSERT INTO employees VALUES (1, 'John', 30); 1238 | INSERT INTO employees VALUES (2, 'Alice', 25); 1239 | ``` 1240 | 1241 | - **MySQL**: 1242 | ```sql 1243 | INSERT INTO employees (id, name, age) VALUES (1, 'John', 30); 1244 | INSERT INTO employees (id, name, age) VALUES (2, 'Alice', 25); 1245 | ``` 1246 | 1247 | **Differences**: 1248 | - Hive allows direct `INSERT INTO` with values, while MySQL explicitly lists column names in the insert statement (though this is optional in MySQL if the columns match). 1249 | 1250 | --- 1251 | 1252 | ### **5. Querying Data** 1253 | - **Hive**: 1254 | ```sql 1255 | SELECT * FROM employees; 1256 | ``` 1257 | 1258 | - **MySQL**: 1259 | ```sql 1260 | SELECT * FROM employees; 1261 | ``` 1262 | 1263 | *Querying data using `SELECT` is identical in both systems.* 1264 | 1265 | --- 1266 | 1267 | ### **6. Modifying Data** 1268 | - **Hive**: 1269 | Hive doesn’t support traditional **UPDATE** or **DELETE** commands directly, as it is optimized for batch processing and is more suited for append operations. However, it does support **INSERT** and **INSERT OVERWRITE** operations. 1270 | 1271 | Example of replacing data: 1272 | ```sql 1273 | INSERT OVERWRITE TABLE employees SELECT * FROM employees WHERE age > 30; 1274 | ``` 1275 | 1276 | - **MySQL**: 1277 | ```sql 1278 | UPDATE employees SET age = 31 WHERE id = 1; 1279 | DELETE FROM employees WHERE id = 2; 1280 | ``` 1281 | 1282 | **Differences**: 1283 | - Hive does not allow direct **UPDATE** or **DELETE**; instead, it uses **INSERT OVERWRITE** to modify data in batch operations. 1284 | 1285 | --- 1286 | 1287 | ### **7. Dropping Tables** 1288 | - **Hive**: 1289 | ```sql 1290 | DROP TABLE IF EXISTS employees; 1291 | ``` 1292 | 1293 | - **MySQL**: 1294 | ```sql 1295 | DROP TABLE IF EXISTS employees; 1296 | ``` 1297 | 1298 | *The syntax for dropping tables is the same in both systems.* 1299 | 1300 | --- 1301 | 1302 | ### **8. Query Performance** 1303 | - **Hive**: 1304 | - Hive is designed to run on large datasets using the Hadoop Distributed File System (HDFS), so it focuses more on **batch processing** rather than real-time queries. Query performance in Hive may be slower than MySQL because it’s optimized for scale, not for low-latency transaction processing. 1305 | 1306 | - **MySQL**: 1307 | - MySQL is an RDBMS, designed to handle **transactional workloads** with low-latency queries. It’s better suited for OLTP (Online Transaction Processing) rather than OLAP (Online Analytical Processing) workloads. 1308 | 1309 | --- 1310 | 1311 | ### **9. Indexing** 1312 | - **Hive**: 1313 | - Hive doesn’t support traditional indexing as MySQL does. However, you can create **partitioned** or **bucketed** tables in Hive to improve query performance for certain types of data. 1314 | 1315 | - **MySQL**: 1316 | - MySQL supports **indexes** (e.g., **PRIMARY KEY**, **UNIQUE**, **INDEX**) to speed up query performance on large datasets. 1317 | 1318 | --- 1319 | 1320 | ### **10. Joins** 1321 | - **Hive**: 1322 | ```sql 1323 | SELECT a.id, a.name, b.age 1324 | FROM employees a 1325 | JOIN employee_details b ON a.id = b.id; 1326 | ``` 1327 | 1328 | - **MySQL**: 1329 | ```sql 1330 | SELECT a.id, a.name, b.age 1331 | FROM employees a 1332 | JOIN employee_details b ON a.id = b.id; 1333 | ``` 1334 | 1335 | *The syntax for **JOIN** is the same in both systems.* 1336 | 1337 | --- 1338 | 1339 | ### **Summary of Key Differences**: 1340 | - **Data Types**: Hive uses types like `STRING`, `TEXT`, `BOOLEAN`, etc., while MySQL uses types like `VARCHAR`, `CHAR`, `TEXT`, etc. 1341 | - **Data Modification**: Hive does not support **UPDATE** or **DELETE** in the traditional way, and is generally used for **batch processing**. 1342 | - **Performance**: Hive is designed for querying large-scale datasets in Hadoop, so queries tend to be slower than MySQL. 1343 | - **Indexing**: Hive does not natively support indexing but can use partitioning and bucketing for performance optimization. MySQL supports indexing for faster queries. 1344 | - **ACID Properties**: MySQL supports full ACID compliance for transactional systems, whereas Hive is not transactional by default (but can support limited ACID features starting from version 0.14 with certain configurations). 1345 | 1346 | In conclusion, while **Hive** and **MySQL** share SQL-like syntax, they are designed for very different use cases, and not all commands work the same way in both systems. 1347 | 1348 | ### **Visualize the Data (Optional)** 1349 | 1350 | Export the query results to a CSV file for analysis in visualization tools: 1351 | 1352 | ```bash 1353 | hive -e "SELECT * FROM police_data;" > police_analysis_results.csv 1354 | ``` 1355 | 1356 | You can use tools like Tableau, Excel, or Python (Matplotlib, Pandas) for data visualization. 1357 | 1358 | ## **Step 8: Configure Environment Variables (Optional)** 1359 | 1360 | If you need to customize configurations, you can specify parameters in the `hadoop.env` file or as environmental variables for services (e.g., namenode, datanode, etc.). For example: 1361 | 1362 | ```bash 1363 | CORE_CONF_fs_defaultFS=hdfs://namenode:8020 1364 | ``` 1365 | 1366 | This will be transformed into the following in the `core-site.xml` file: 1367 | 1368 | ```xml 1369 | 1370 | fs.defaultFS 1371 | hdfs://namenode:8020 1372 | 1373 | ``` 1374 | 1375 | ## **Conclusion** 1376 | 1377 | You now have a fully functional Hadoop, Spark, and Hive cluster running in Docker. This environment is great for experimenting with big data processing and analytics in a lightweight, containerized setup. 1378 | 1379 | --- 1380 | 1381 | I hope you have fun with this Hadoop-Spark-Hive cluster. 1382 | 1383 | 1384 | 1385 | ![image](https://github.com/user-attachments/assets/1347d354-a160-4cc6-8547-eb0857a72ba5) 1386 | 1387 | -------------------------------------------------------------------------------- /base/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM debian:9 2 | 3 | MAINTAINER Ivan Ermilov 4 | MAINTAINER Giannis Mouchakis 5 | 6 | RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 7 | openjdk-8-jdk \ 8 | net-tools \ 9 | curl \ 10 | netcat \ 11 | gnupg \ 12 | libsnappy-dev \ 13 | && rm -rf /var/lib/apt/lists/* 14 | 15 | ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ 16 | 17 | RUN curl -O https://dist.apache.org/repos/dist/release/hadoop/common/KEYS 18 | 19 | RUN gpg --import KEYS 20 | 21 | ENV HADOOP_VERSION 3.2.1 22 | ENV HADOOP_URL https://www.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz 23 | 24 | RUN set -x \ 25 | && curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \ 26 | && curl -fSL "$HADOOP_URL.asc" -o /tmp/hadoop.tar.gz.asc \ 27 | && gpg --verify /tmp/hadoop.tar.gz.asc \ 28 | && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \ 29 | && rm /tmp/hadoop.tar.gz* 30 | 31 | RUN ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop 32 | 33 | RUN mkdir /opt/hadoop-$HADOOP_VERSION/logs 34 | 35 | RUN mkdir /hadoop-data 36 | 37 | ENV HADOOP_HOME=/opt/hadoop-$HADOOP_VERSION 38 | ENV HADOOP_CONF_DIR=/etc/hadoop 39 | ENV MULTIHOMED_NETWORK=1 40 | ENV USER=root 41 | ENV PATH $HADOOP_HOME/bin/:$PATH 42 | 43 | ADD entrypoint.sh /entrypoint.sh 44 | 45 | RUN chmod a+x /entrypoint.sh 46 | 47 | ENTRYPOINT ["/entrypoint.sh"] 48 | -------------------------------------------------------------------------------- /base/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Set some sensible defaults 4 | export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020} 5 | 6 | function addProperty() { 7 | local path=$1 8 | local name=$2 9 | local value=$3 10 | 11 | local entry="$name${value}" 12 | local escapedEntry=$(echo $entry | sed 's/\//\\\//g') 13 | sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path 14 | } 15 | 16 | function configure() { 17 | local path=$1 18 | local module=$2 19 | local envPrefix=$3 20 | 21 | local var 22 | local value 23 | 24 | echo "Configuring $module" 25 | for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do 26 | name=`echo ${c} | perl -pe 's/___/-/g; s/__/@/g; s/_/./g; s/@/_/g;'` 27 | var="${envPrefix}_${c}" 28 | value=${!var} 29 | echo " - Setting $name=$value" 30 | addProperty $path $name "$value" 31 | done 32 | } 33 | 34 | configure /etc/hadoop/core-site.xml core CORE_CONF 35 | configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF 36 | configure /etc/hadoop/yarn-site.xml yarn YARN_CONF 37 | configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF 38 | configure /etc/hadoop/kms-site.xml kms KMS_CONF 39 | configure /etc/hadoop/mapred-site.xml mapred MAPRED_CONF 40 | 41 | if [ "$MULTIHOMED_NETWORK" = "1" ]; then 42 | echo "Configuring for multihomed network" 43 | 44 | # HDFS 45 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0 46 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0 47 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0 48 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0 49 | addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true 50 | addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true 51 | 52 | # YARN 53 | addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0 54 | addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 55 | addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0 56 | 57 | # MAPRED 58 | addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 59 | fi 60 | 61 | if [ -n "$GANGLIA_HOST" ]; then 62 | mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig 63 | mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig 64 | 65 | for module in mapred jvm rpc ugi; do 66 | echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31" 67 | echo "$module.period=10" 68 | echo "$module.servers=$GANGLIA_HOST:8649" 69 | done > /etc/hadoop/hadoop-metrics.properties 70 | 71 | for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do 72 | echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31" 73 | echo "$module.sink.ganglia.period=10" 74 | echo "$module.sink.ganglia.supportsparse=true" 75 | echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both" 76 | echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40" 77 | echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649" 78 | done > /etc/hadoop/hadoop-metrics2.properties 79 | fi 80 | 81 | function wait_for_it() 82 | { 83 | local serviceport=$1 84 | local service=${serviceport%%:*} 85 | local port=${serviceport#*:} 86 | local retry_seconds=5 87 | local max_try=100 88 | let i=1 89 | 90 | nc -z $service $port 91 | result=$? 92 | 93 | until [ $result -eq 0 ]; do 94 | echo "[$i/$max_try] check for ${service}:${port}..." 95 | echo "[$i/$max_try] ${service}:${port} is not available yet" 96 | if (( $i == $max_try )); then 97 | echo "[$i/$max_try] ${service}:${port} is still not available; giving up after ${max_try} tries. :/" 98 | exit 1 99 | fi 100 | 101 | echo "[$i/$max_try] try in ${retry_seconds}s once again ..." 102 | let "i++" 103 | sleep $retry_seconds 104 | 105 | nc -z $service $port 106 | result=$? 107 | done 108 | echo "[$i/$max_try] $service:${port} is available." 109 | } 110 | 111 | for i in ${SERVICE_PRECONDITION[@]} 112 | do 113 | wait_for_it ${i} 114 | done 115 | 116 | exec $@ 117 | -------------------------------------------------------------------------------- /base/execute-step.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ $ENABLE_INIT_DAEMON = "true" ] 4 | then 5 | echo "Execute step ${INIT_DAEMON_STEP} in pipeline" 6 | while true; do 7 | sleep 5 8 | echo -n '.' 9 | string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/execute?step=$INIT_DAEMON_STEP -o /dev/null) 10 | [ "$string" = "204" ] && break 11 | done 12 | echo "Notified execution of step ${INIT_DAEMON_STEP}" 13 | fi 14 | 15 | -------------------------------------------------------------------------------- /base/finish-step.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ $ENABLE_INIT_DAEMON = "true" ] 4 | then 5 | echo "Finish step ${INIT_DAEMON_STEP} in pipeline" 6 | while true; do 7 | sleep 5 8 | echo -n '.' 9 | string=$(curl -sL -w "%{http_code}" -X PUT $INIT_DAEMON_BASE_URI/finish?step=$INIT_DAEMON_STEP -o /dev/null) 10 | [ "$string" = "204" ] && break 11 | done 12 | echo "Notified finish of step ${INIT_DAEMON_STEP}" 13 | fi 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /base/wait-for-step.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ $ENABLE_INIT_DAEMON = "true" ] 4 | then 5 | echo "Validating if step ${INIT_DAEMON_STEP} can start in pipeline" 6 | while true; do 7 | sleep 5 8 | echo -n '.' 9 | string=$(curl -s $INIT_DAEMON_BASE_URI/canStart?step=$INIT_DAEMON_STEP) 10 | [ "$string" = "true" ] && break 11 | done 12 | echo "Can start step ${INIT_DAEMON_STEP}" 13 | fi 14 | -------------------------------------------------------------------------------- /code/HadoopWordCount/bin/WordCount$IntSumReducer.class: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$IntSumReducer.class -------------------------------------------------------------------------------- /code/HadoopWordCount/bin/WordCount$TokenizerMapper.class: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount$TokenizerMapper.class -------------------------------------------------------------------------------- /code/HadoopWordCount/bin/WordCount.class: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/WordCount.class -------------------------------------------------------------------------------- /code/HadoopWordCount/bin/wc.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/HadoopWordCount/bin/wc.jar -------------------------------------------------------------------------------- /code/HadoopWordCount/src/WordCount.java: -------------------------------------------------------------------------------- 1 | import java.io.IOException; 2 | import java.util.StringTokenizer; 3 | 4 | import org.apache.hadoop.conf.Configuration; 5 | import org.apache.hadoop.fs.Path; 6 | import org.apache.hadoop.io.IntWritable; 7 | import org.apache.hadoop.io.Text; 8 | import org.apache.hadoop.mapreduce.Job; 9 | import org.apache.hadoop.mapreduce.Mapper; 10 | import org.apache.hadoop.mapreduce.Reducer; 11 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 12 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 | 14 | public class WordCount { 15 | 16 | public static class TokenizerMapper 17 | extends Mapper{ 18 | 19 | private final static IntWritable one = new IntWritable(1); 20 | private Text word = new Text(); 21 | 22 | public void map(Object key, Text value, Context context 23 | ) throws IOException, InterruptedException { 24 | StringTokenizer itr = new StringTokenizer(value.toString()); 25 | while (itr.hasMoreTokens()) { 26 | word.set(itr.nextToken()); 27 | context.write(word, one); 28 | } 29 | } 30 | } 31 | 32 | public static class IntSumReducer 33 | extends Reducer { 34 | private IntWritable result = new IntWritable(); 35 | 36 | public void reduce(Text key, Iterable values, 37 | Context context 38 | ) throws IOException, InterruptedException { 39 | int sum = 0; 40 | for (IntWritable val : values) { 41 | sum += val.get(); 42 | } 43 | result.set(sum); 44 | context.write(key, result); 45 | } 46 | } 47 | 48 | public static void main(String[] args) throws Exception { 49 | Configuration conf = new Configuration(); 50 | Job job = Job.getInstance(conf, "word count"); 51 | job.setJarByClass(WordCount.class); 52 | job.setMapperClass(TokenizerMapper.class); 53 | job.setCombinerClass(IntSumReducer.class); 54 | job.setReducerClass(IntSumReducer.class); 55 | job.setOutputKeyClass(Text.class); 56 | job.setOutputValueClass(IntWritable.class); 57 | FileInputFormat.addInputPath(job, new Path(args[0])); 58 | FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 | System.exit(job.waitForCompletion(true) ? 0 : 1); 60 | } 61 | } -------------------------------------------------------------------------------- /code/input/About Hadoop.txt~: -------------------------------------------------------------------------------- 1 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 2 | 3 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 4 | 5 | The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 6 | -------------------------------------------------------------------------------- /code/input/data.txt: -------------------------------------------------------------------------------- 1 | DOG CAT RAT 2 | CAR CAR RAT 3 | DOG CAR CAT 4 | -------------------------------------------------------------------------------- /code/wordCount.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/code/wordCount.jar -------------------------------------------------------------------------------- /conf/beeline-log4j2.properties: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | status = INFO 18 | name = BeelineLog4j2 19 | packages = org.apache.hadoop.hive.ql.log 20 | 21 | # list of properties 22 | property.hive.log.level = WARN 23 | property.hive.root.logger = console 24 | 25 | # list of all appenders 26 | appenders = console 27 | 28 | # console appender 29 | appender.console.type = Console 30 | appender.console.name = console 31 | appender.console.target = SYSTEM_ERR 32 | appender.console.layout.type = PatternLayout 33 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n 34 | 35 | # list of all loggers 36 | loggers = HiveConnection 37 | 38 | # HiveConnection logs useful info for dynamic service discovery 39 | logger.HiveConnection.name = org.apache.hive.jdbc.HiveConnection 40 | logger.HiveConnection.level = INFO 41 | 42 | # root logger 43 | rootLogger.level = ${sys:hive.log.level} 44 | rootLogger.appenderRefs = root 45 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger} 46 | -------------------------------------------------------------------------------- /conf/hive-env.sh: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | # Set Hive and Hadoop environment variables here. These variables can be used 18 | # to control the execution of Hive. It should be used by admins to configure 19 | # the Hive installation (so that users do not have to set environment variables 20 | # or set command line parameters to get correct behavior). 21 | # 22 | # The hive service being invoked (CLI/HWI etc.) is available via the environment 23 | # variable SERVICE 24 | 25 | 26 | # Hive Client memory usage can be an issue if a large number of clients 27 | # are running at the same time. The flags below have been useful in 28 | # reducing memory usage: 29 | # 30 | # if [ "$SERVICE" = "cli" ]; then 31 | # if [ -z "$DEBUG" ]; then 32 | # export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit" 33 | # else 34 | # export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit" 35 | # fi 36 | # fi 37 | 38 | # The heap size of the jvm stared by hive shell script can be controlled via: 39 | # 40 | # export HADOOP_HEAPSIZE=1024 41 | # 42 | # Larger heap size may be required when running queries over large number of files or partitions. 43 | # By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be 44 | # appropriate for hive server (hwi etc). 45 | 46 | 47 | # Set HADOOP_HOME to point to a specific hadoop install directory 48 | # HADOOP_HOME=${bin}/../../hadoop 49 | 50 | # Hive Configuration Directory can be controlled by: 51 | # export HIVE_CONF_DIR= 52 | 53 | # Folder containing extra ibraries required for hive compilation/execution can be controlled by: 54 | # export HIVE_AUX_JARS_PATH= 55 | -------------------------------------------------------------------------------- /conf/hive-exec-log4j2.properties: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | status = INFO 18 | name = HiveExecLog4j2 19 | packages = org.apache.hadoop.hive.ql.log 20 | 21 | # list of properties 22 | property.hive.log.level = INFO 23 | property.hive.root.logger = FA 24 | property.hive.query.id = hadoop 25 | property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name} 26 | property.hive.log.file = ${sys:hive.query.id}.log 27 | 28 | # list of all appenders 29 | appenders = console, FA 30 | 31 | # console appender 32 | appender.console.type = Console 33 | appender.console.name = console 34 | appender.console.target = SYSTEM_ERR 35 | appender.console.layout.type = PatternLayout 36 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n 37 | 38 | # simple file appender 39 | appender.FA.type = File 40 | appender.FA.name = FA 41 | appender.FA.fileName = ${sys:hive.log.dir}/${sys:hive.log.file} 42 | appender.FA.layout.type = PatternLayout 43 | appender.FA.layout.pattern = %d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n 44 | 45 | # list of all loggers 46 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX 47 | 48 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn 49 | logger.NIOServerCnxn.level = WARN 50 | 51 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO 52 | logger.ClientCnxnSocketNIO.level = WARN 53 | 54 | logger.DataNucleus.name = DataNucleus 55 | logger.DataNucleus.level = ERROR 56 | 57 | logger.Datastore.name = Datastore 58 | logger.Datastore.level = ERROR 59 | 60 | logger.JPOX.name = JPOX 61 | logger.JPOX.level = ERROR 62 | 63 | # root logger 64 | rootLogger.level = ${sys:hive.log.level} 65 | rootLogger.appenderRefs = root 66 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger} 67 | -------------------------------------------------------------------------------- /conf/hive-log4j2.properties: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | status = INFO 18 | name = HiveLog4j2 19 | packages = org.apache.hadoop.hive.ql.log 20 | 21 | # list of properties 22 | property.hive.log.level = INFO 23 | property.hive.root.logger = DRFA 24 | property.hive.log.dir = ${sys:java.io.tmpdir}/${sys:user.name} 25 | property.hive.log.file = hive.log 26 | 27 | # list of all appenders 28 | appenders = console, DRFA 29 | 30 | # console appender 31 | appender.console.type = Console 32 | appender.console.name = console 33 | appender.console.target = SYSTEM_ERR 34 | appender.console.layout.type = PatternLayout 35 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n 36 | 37 | # daily rolling file appender 38 | appender.DRFA.type = RollingFile 39 | appender.DRFA.name = DRFA 40 | appender.DRFA.fileName = ${sys:hive.log.dir}/${sys:hive.log.file} 41 | # Use %pid in the filePattern to append @ to the filename if you want separate log files for different CLI session 42 | appender.DRFA.filePattern = ${sys:hive.log.dir}/${sys:hive.log.file}.%d{yyyy-MM-dd} 43 | appender.DRFA.layout.type = PatternLayout 44 | appender.DRFA.layout.pattern = %d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n 45 | appender.DRFA.policies.type = Policies 46 | appender.DRFA.policies.time.type = TimeBasedTriggeringPolicy 47 | appender.DRFA.policies.time.interval = 1 48 | appender.DRFA.policies.time.modulate = true 49 | appender.DRFA.strategy.type = DefaultRolloverStrategy 50 | appender.DRFA.strategy.max = 30 51 | 52 | # list of all loggers 53 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX 54 | 55 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn 56 | logger.NIOServerCnxn.level = WARN 57 | 58 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO 59 | logger.ClientCnxnSocketNIO.level = WARN 60 | 61 | logger.DataNucleus.name = DataNucleus 62 | logger.DataNucleus.level = ERROR 63 | 64 | logger.Datastore.name = Datastore 65 | logger.Datastore.level = ERROR 66 | 67 | logger.JPOX.name = JPOX 68 | logger.JPOX.level = ERROR 69 | 70 | # root logger 71 | rootLogger.level = ${sys:hive.log.level} 72 | rootLogger.appenderRefs = root 73 | rootLogger.appenderRef.root.ref = ${sys:hive.root.logger} 74 | -------------------------------------------------------------------------------- /conf/hive-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 18 | 19 | -------------------------------------------------------------------------------- /conf/ivysettings.xml: -------------------------------------------------------------------------------- 1 | 17 | 18 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | -------------------------------------------------------------------------------- /conf/llap-daemon-log4j2.properties: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one 2 | # or more contributor license agreements. See the NOTICE file 3 | # distributed with this work for additional information 4 | # regarding copyright ownership. The ASF licenses this file 5 | # to you under the Apache License, Version 2.0 (the 6 | # "License"); you may not use this file except in compliance 7 | # with the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | 17 | status = INFO 18 | name = LlapDaemonLog4j2 19 | packages = org.apache.hadoop.hive.ql.log 20 | 21 | # list of properties 22 | property.llap.daemon.log.level = INFO 23 | property.llap.daemon.root.logger = console 24 | property.llap.daemon.log.dir = . 25 | property.llap.daemon.log.file = llapdaemon.log 26 | property.llap.daemon.historylog.file = llapdaemon_history.log 27 | property.llap.daemon.log.maxfilesize = 256MB 28 | property.llap.daemon.log.maxbackupindex = 20 29 | 30 | # list of all appenders 31 | appenders = console, RFA, HISTORYAPPENDER 32 | 33 | # console appender 34 | appender.console.type = Console 35 | appender.console.name = console 36 | appender.console.target = SYSTEM_ERR 37 | appender.console.layout.type = PatternLayout 38 | appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} [%t%x] %p %c{2} : %m%n 39 | 40 | # rolling file appender 41 | appender.RFA.type = RollingFile 42 | appender.RFA.name = RFA 43 | appender.RFA.fileName = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.log.file} 44 | appender.RFA.filePattern = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.log.file}_%i 45 | appender.RFA.layout.type = PatternLayout 46 | appender.RFA.layout.pattern = %d{ISO8601} %-5p [%t%x]: %c{2} (%F:%M(%L)) - %m%n 47 | appender.RFA.policies.type = Policies 48 | appender.RFA.policies.size.type = SizeBasedTriggeringPolicy 49 | appender.RFA.policies.size.size = ${sys:llap.daemon.log.maxfilesize} 50 | appender.RFA.strategy.type = DefaultRolloverStrategy 51 | appender.RFA.strategy.max = ${sys:llap.daemon.log.maxbackupindex} 52 | 53 | # history file appender 54 | appender.HISTORYAPPENDER.type = RollingFile 55 | appender.HISTORYAPPENDER.name = HISTORYAPPENDER 56 | appender.HISTORYAPPENDER.fileName = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.historylog.file} 57 | appender.HISTORYAPPENDER.filePattern = ${sys:llap.daemon.log.dir}/${sys:llap.daemon.historylog.file}_%i 58 | appender.HISTORYAPPENDER.layout.type = PatternLayout 59 | appender.HISTORYAPPENDER.layout.pattern = %m%n 60 | appender.HISTORYAPPENDER.policies.type = Policies 61 | appender.HISTORYAPPENDER.policies.size.type = SizeBasedTriggeringPolicy 62 | appender.HISTORYAPPENDER.policies.size.size = ${sys:llap.daemon.log.maxfilesize} 63 | appender.HISTORYAPPENDER.strategy.type = DefaultRolloverStrategy 64 | appender.HISTORYAPPENDER.strategy.max = ${sys:llap.daemon.log.maxbackupindex} 65 | 66 | # list of all loggers 67 | loggers = NIOServerCnxn, ClientCnxnSocketNIO, DataNucleus, Datastore, JPOX, HistoryLogger 68 | 69 | logger.NIOServerCnxn.name = org.apache.zookeeper.server.NIOServerCnxn 70 | logger.NIOServerCnxn.level = WARN 71 | 72 | logger.ClientCnxnSocketNIO.name = org.apache.zookeeper.ClientCnxnSocketNIO 73 | logger.ClientCnxnSocketNIO.level = WARN 74 | 75 | logger.DataNucleus.name = DataNucleus 76 | logger.DataNucleus.level = ERROR 77 | 78 | logger.Datastore.name = Datastore 79 | logger.Datastore.level = ERROR 80 | 81 | logger.JPOX.name = JPOX 82 | logger.JPOX.level = ERROR 83 | 84 | logger.HistoryLogger.name = org.apache.hadoop.hive.llap.daemon.HistoryLogger 85 | logger.HistoryLogger.level = INFO 86 | logger.HistoryLogger.additivity = false 87 | logger.HistoryLogger.appenderRefs = HistoryAppender 88 | logger.HistoryLogger.appenderRef.HistoryAppender.ref = HISTORYAPPENDER 89 | 90 | # root logger 91 | rootLogger.level = ${sys:llap.daemon.log.level} 92 | rootLogger.appenderRefs = root 93 | rootLogger.appenderRef.root.ref = ${sys:llap.daemon.root.logger} 94 | -------------------------------------------------------------------------------- /data/authors.csv: -------------------------------------------------------------------------------- 1 | lname,fname 2 | Pascal,Blaise 3 | Voltaire,François 4 | Perrin,Jean-Georges 5 | Maréchal,Pierre Sylvain 6 | Karau,Holden 7 | Zaharia,Matei 8 | -------------------------------------------------------------------------------- /data/books.csv: -------------------------------------------------------------------------------- 1 | id,authorId,title,releaseDate,link 2 | 1,1,Fantastic Beasts and Where to Find Them: The Original Screenplay,11/18/16,http://amzn.to/2kup94P 3 | 2,1,"Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter, Book 1)",10/6/15,http://amzn.to/2l2lSwP 4 | 3,1,"The Tales of Beedle the Bard, Standard Edition (Harry Potter)",12/4/08,http://amzn.to/2kYezqr 5 | 4,1,"Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter, Book 2)",10/4/16,http://amzn.to/2kYhL5n 6 | 5,2,"Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple, the Coffee, and a Great Database",4/23/17,http://amzn.to/2i3mthT 7 | 6,2,"Development Tools in 2006: any Room for a 4GL-style Language?: An independent study by Jean Georges Perrin, IIUG Board Member",12/28/16,http://amzn.to/2vBxOe1 8 | 7,3,Adventures of Huckleberry Finn,5/26/94,http://amzn.to/2wOeOav 9 | 8,3,A Connecticut Yankee in King Arthur's Court,6/17/17,http://amzn.to/2x1NuoD 10 | 10,4,Jacques le Fataliste,3/1/00,http://amzn.to/2uZj2KA 11 | 11,4,Diderot Encyclopedia: The Complete Illustrations 1762-1777,,http://amzn.to/2i2zo3I 12 | 12,,A Woman in Berlin,7/11/06,http://amzn.to/2i472WZ 13 | 13,6,Spring Boot in Action,1/3/16,http://amzn.to/2hCPktW 14 | 14,6,Spring in Action: Covers Spring 4,11/28/14,http://amzn.to/2yJLyCk 15 | 15,7,Soft Skills: The software developer's life manual,12/29/14,http://amzn.to/2zNnSyn 16 | 16,8,Of Mice and Men,,http://amzn.to/2zJjXoc 17 | 17,9,"Java 8 in Action: Lambdas, Streams, and functional-style programming",8/28/14,http://amzn.to/2isdqoL 18 | 18,12,Hamlet,6/8/12,http://amzn.to/2yRbewY 19 | 19,13,Pensées,12/31/1670,http://amzn.to/2jweHOG 20 | 20,14,"Fables choisies, mises en vers par M. de La Fontaine",9/1/1999,http://amzn.to/2yRH10W 21 | 21,15,Discourse on Method and Meditations on First Philosophy,6/15/1999,http://amzn.to/2hwB8zc 22 | 22,12,Twelfth Night,7/1/4,http://amzn.to/2zPYnwo 23 | 23,12,Macbeth,7/1/3,http://amzn.to/2zPYnwo -------------------------------------------------------------------------------- /datanode/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | HEALTHCHECK CMD curl -f http://localhost:9864/ || exit 1 6 | 7 | ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/data 8 | RUN mkdir -p /hadoop/dfs/data 9 | VOLUME /hadoop/dfs/data 10 | 11 | ADD run.sh /run.sh 12 | RUN chmod a+x /run.sh 13 | 14 | EXPOSE 9864 15 | 16 | CMD ["/run.sh"] 17 | -------------------------------------------------------------------------------- /datanode/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | datadir=`echo $HDFS_CONF_dfs_datanode_data_dir | perl -pe 's#file://##'` 4 | if [ ! -d $datadir ]; then 5 | echo "Datanode data directory not found: $datadir" 6 | exit 2 7 | fi 8 | 9 | $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR datanode 10 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | services: 2 | namenode: 3 | image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 4 | container_name: namenode 5 | restart: always 6 | ports: 7 | - 9870:9870 8 | - 9010:9000 9 | volumes: 10 | - hadoop_namenode:/hadoop/dfs/name 11 | environment: 12 | - CLUSTER_NAME=test 13 | - CORE_CONF_fs_defaultFS=hdfs://namenode:9000 14 | env_file: 15 | - ./hadoop.env 16 | 17 | datanode: 18 | image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 19 | container_name: datanode 20 | restart: always 21 | volumes: 22 | - hadoop_datanode:/hadoop/dfs/data 23 | environment: 24 | SERVICE_PRECONDITION: "namenode:9870" 25 | CORE_CONF_fs_defaultFS: hdfs://namenode:9000 26 | ports: 27 | - "9864:9864" 28 | env_file: 29 | - ./hadoop.env 30 | 31 | resourcemanager: 32 | image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 33 | container_name: resourcemanager 34 | restart: always 35 | environment: 36 | SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864" 37 | ports: 38 | - "8088:8088" 39 | env_file: 40 | - ./hadoop.env 41 | 42 | nodemanager1: 43 | image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 44 | container_name: nodemanager 45 | restart: always 46 | environment: 47 | SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088" 48 | env_file: 49 | - ./hadoop.env 50 | 51 | historyserver: 52 | image: bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 53 | container_name: historyserver 54 | restart: always 55 | environment: 56 | SERVICE_PRECONDITION: "namenode:9000 namenode:9870 datanode:9864 resourcemanager:8088" 57 | volumes: 58 | - hadoop_historyserver:/hadoop/yarn/timeline 59 | env_file: 60 | - ./hadoop.env 61 | 62 | spark-master: 63 | image: bde2020/spark-master:3.0.0-hadoop3.2 64 | container_name: spark-master 65 | depends_on: 66 | - namenode 67 | - datanode 68 | ports: 69 | - "8080:8080" 70 | - "7077:7077" 71 | environment: 72 | - INIT_DAEMON_STEP=setup_spark 73 | - CORE_CONF_fs_defaultFS=hdfs://namenode:9000 74 | 75 | spark-worker-1: 76 | image: bde2020/spark-worker:3.0.0-hadoop3.2 77 | container_name: spark-worker-1 78 | depends_on: 79 | - spark-master 80 | ports: 81 | - "8081:8081" 82 | environment: 83 | - "SPARK_MASTER=spark://spark-master:7077" 84 | - CORE_CONF_fs_defaultFS=hdfs://namenode:9000 85 | 86 | hive-server: 87 | image: bde2020/hive:2.3.2-postgresql-metastore 88 | container_name: hive-server 89 | depends_on: 90 | - namenode 91 | - datanode 92 | env_file: 93 | - ./hadoop-hive.env 94 | environment: 95 | HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore" 96 | SERVICE_PRECONDITION: "hive-metastore:9083" 97 | ports: 98 | - "10000:10000" 99 | 100 | hive-metastore: 101 | image: bde2020/hive:2.3.2-postgresql-metastore 102 | container_name: hive-metastore 103 | env_file: 104 | - ./hadoop-hive.env 105 | command: /opt/hive/bin/hive --service metastore 106 | environment: 107 | SERVICE_PRECONDITION: "namenode:9870 datanode:9864 hive-metastore-postgresql:5432" 108 | ports: 109 | - "9083:9083" 110 | 111 | hive-metastore-postgresql: 112 | image: bde2020/hive-metastore-postgresql:2.3.0 113 | container_name: hive-metastore-postgresql 114 | 115 | presto-coordinator: 116 | image: shawnzhu/prestodb:0.181 117 | container_name: presto-coordinator 118 | ports: 119 | - "8089:8089" 120 | 121 | volumes: 122 | hadoop_namenode: 123 | hadoop_datanode: 124 | hadoop_historyserver: 125 | 126 | -------------------------------------------------------------------------------- /ecom.md: -------------------------------------------------------------------------------- 1 | # 🚀 E-commerce Sales Data Analysis Using Hive 2 | 3 | ## 📊 Project Overview 4 | This project demonstrates how to perform **E-commerce Sales Data Analysis** using **Apache Hive** on a Hadoop ecosystem. The goal of this project is to analyze sales data, generate business insights, and understand trends in e-commerce sales. 5 | 6 | The project uses a **CSV file containing real-world simulated sales data**, which is imported into **HDFS (Hadoop Distributed File System)** and processed using **HiveQL (Hive Query Language)**. 7 | 8 | ✅ **Project Objectives:** 9 | - Import large-scale e-commerce sales data into **HDFS**. 10 | - Create Hive tables (Managed & External). 11 | - Analyze data to extract business insights like: 12 | - 💰 **Total Revenue.** 13 | - 🛒 **Best-selling products.** 14 | - 👥 **Most active customers.** 15 | - 📅 **Monthly/Yearly sales trends.** 16 | - 💵 **Most used payment methods.** 17 | - Generate useful business insights for decision-making. 18 | 19 | --- 20 | 21 | ## 📁 Dataset Information 22 | The dataset used in this project is a simulated **E-commerce Sales Data CSV file** containing the following columns: 23 | 24 | | Column Name | Description | 25 | |-----------------|------------------------------------------| 26 | | **order_date** | Date of the order | 27 | | **customer_id** | Unique ID of the customer | 28 | | **product_name** | Name of the product purchased | 29 | | **category** | Product category | 30 | | **quantity** | Number of units sold | 31 | | **price** | Price per unit | 32 | | **total_amount** | Total amount for the order | 33 | | **payment_type** | Payment method used | 34 | | **city** | Customer's city | 35 | | **state** | Customer's state | 36 | | **country** | Customer's country | 37 | 38 | 👉 **Sample Size:** 10,000 records of e-commerce transactions. 39 | 👉 **File Type:** CSV 40 | 👉 **File Name:** `ecommerce_sales_data.csv` 41 | 42 | You can download the dataset from here: [Download E-commerce Sales Data](https://drive.google.com/file/d/1MYN0AdX6uD9kNR6UdqlCZuZCxlfmK6T6/view) 43 | 44 | --- 45 | 46 | ## 📥 Step 1: Upload Data to HDFS 47 | ### ✅ Create Directory in HDFS 48 | Run the following commands to create a directory in **HDFS**: 49 | ```bash 50 | hadoop fs -mkdir -p /user/hdfs/ecommerce_data 51 | ``` 52 | 53 | ### ✅ Upload the CSV File to HDFS 54 | ```bash 55 | hadoop fs -put /mnt/data/ecommerce_sales_data.csv /user/hdfs/ecommerce_data/ 56 | ``` 57 | 58 | Verify the upload: 59 | ```bash 60 | hadoop fs -ls /user/hdfs/ecommerce_data/ 61 | ``` 62 | You should see the file listed there. 63 | 64 | --- 65 | 66 | ## 🗄 Step 2: Create Hive Tables 67 | Now, open the **Hive shell**: 68 | ```bash 69 | hive 70 | ``` 71 | 72 | ### ✅ Create Database 73 | ```sql 74 | CREATE DATABASE ecommerce; 75 | USE ecommerce; 76 | ``` 77 | 78 | ### ✅ Create External Table 79 | We will create an **External Table** linked to our HDFS file. 80 | ```sql 81 | CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( 82 | order_date STRING, 83 | customer_id INT, 84 | product_name STRING, 85 | category STRING, 86 | quantity INT, 87 | price FLOAT, 88 | total_amount FLOAT, 89 | payment_type STRING, 90 | city STRING, 91 | state STRING, 92 | country STRING 93 | ) 94 | ROW FORMAT DELIMITED 95 | FIELDS TERMINATED BY ',' 96 | STORED AS TEXTFILE 97 | LOCATION '/user/hdfs/ecommerce_data/'; 98 | ``` 99 | 100 | ✅ **Verify the data:** 101 | ```sql 102 | SELECT * FROM sales_data LIMIT 10; 103 | ``` 104 | 105 | --- 106 | 107 | ## 💻 Step 3: Hive Queries (Data Analysis) 108 | ### 💰 1. Calculate Total Revenue 109 | ```sql 110 | SELECT SUM(total_amount) AS total_revenue 111 | FROM sales_data; 112 | ``` 113 | 👉 This query shows the **total revenue generated** by the business. 114 | 115 | --- 116 | 117 | ### 🛍 2. Find Best-Selling Products 118 | ```sql 119 | SELECT product_name, SUM(quantity) AS total_sold 120 | FROM sales_data 121 | GROUP BY product_name 122 | ORDER BY total_sold DESC 123 | LIMIT 10; 124 | ``` 125 | 👉 This query shows the **top 10 best-selling products**. 126 | 127 | --- 128 | 129 | ### 👥 3. Identify Most Active Customers 130 | ```sql 131 | SELECT customer_id, COUNT(*) AS total_orders 132 | FROM sales_data 133 | GROUP BY customer_id 134 | ORDER BY total_orders DESC 135 | LIMIT 10; 136 | ``` 137 | 👉 This query identifies the **top 10 most active customers**. 138 | 139 | --- 140 | 141 | ### 📅 4. Monthly Sales Trend 142 | ```sql 143 | SELECT substr(order_date, 1, 7) AS month, SUM(total_amount) AS monthly_revenue 144 | FROM sales_data 145 | GROUP BY substr(order_date, 1, 7) 146 | ORDER BY month; 147 | ``` 148 | 👉 This query shows the **monthly revenue trend**. 149 | 150 | --- 151 | 152 | ### 🏢 5. Top Revenue-Generating Cities 153 | ```sql 154 | SELECT city, SUM(total_amount) AS revenue 155 | FROM sales_data 156 | GROUP BY city 157 | ORDER BY revenue DESC 158 | LIMIT 5; 159 | ``` 160 | 👉 This query identifies the **top 5 revenue-generating cities**. 161 | 162 | --- 163 | 164 | ### 💵 6. Most Used Payment Type 165 | ```sql 166 | SELECT payment_type, COUNT(*) AS usage_count 167 | FROM sales_data 168 | GROUP BY payment_type 169 | ORDER BY usage_count DESC; 170 | ``` 171 | 👉 This query shows the **most preferred payment methods**. 172 | 173 | --- 174 | 175 | ## 📊 Step 4: Visualization (Optional) 176 | You can visualize the data using: 177 | - 📊 **Apache Zeppelin**. 178 | - 📊 **Power BI / Tableau**. 179 | - 💻 **Python (Matplotlib/Seaborn)**. 180 | 181 | Example visualization in **Zeppelin:** 182 | ```sql 183 | %sql 184 | SELECT substr(order_date, 1, 7) AS month, SUM(total_amount) AS monthly_revenue 185 | FROM sales_data 186 | GROUP BY substr(order_date, 1, 7) 187 | ORDER BY month; 188 | ``` 189 | 👉 Convert it into a **Line Chart** to see monthly revenue. 190 | 191 | --- 192 | 193 | ## 📜 Step 5: Business Insights 194 | | Insight | Description | 195 | |---------|-------------| 196 | | 💰 Total Revenue | Understand the overall revenue generated. | 197 | | 🛍 Best-Selling Products | Identify which products are most popular. | 198 | | 👥 Most Active Customers | Track the most loyal customers. | 199 | | 📅 Monthly Revenue Trend | Understand peak seasons and off-seasons. | 200 | | 🏢 Revenue by City | Focus on cities generating maximum revenue. | 201 | | 💵 Payment Preference | Identify the most used payment method. | 202 | 203 | --- 204 | 205 | ## 📊 Future Scope 206 | 1. ✅ **Integrate Apache Kafka** for real-time streaming data. 207 | 2. ✅ Use **Apache Spark** to process data faster. 208 | 3. ✅ Build a **Tableau/Power BI dashboard** for live business insights. 209 | 4. ✅ Connect Hive data to **Flask/Django web app**. 210 | 211 | --- 212 | 213 | ## 💎 Conclusion 214 | This project provides a practical demonstration of: 215 | - ✅ **Big Data Processing** using Hive. 216 | - ✅ Importing data into HDFS. 217 | - ✅ Performing data analysis using HiveQL. 218 | - ✅ Generating business insights from e-commerce sales data. 219 | 220 | 👉 **Next Step:**: 221 | - ✅ Create a real-time dashboard using Zeppelin/Power BI? 222 | - ✅ Automate PDF Report Generation using Python? 223 | - ✅ Deploy this project on a web application using Flask? 224 | -------------------------------------------------------------------------------- /entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Set some sensible defaults 4 | export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020} 5 | 6 | function addProperty() { 7 | local path=$1 8 | local name=$2 9 | local value=$3 10 | 11 | local entry="$name${value}" 12 | local escapedEntry=$(echo $entry | sed 's/\//\\\//g') 13 | sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path 14 | } 15 | 16 | function configure() { 17 | local path=$1 18 | local module=$2 19 | local envPrefix=$3 20 | 21 | local var 22 | local value 23 | 24 | echo "Configuring $module" 25 | for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do 26 | name=`echo ${c} | perl -pe 's/___/-/g; s/__/_/g; s/_/./g'` 27 | var="${envPrefix}_${c}" 28 | value=${!var} 29 | echo " - Setting $name=$value" 30 | addProperty $path $name "$value" 31 | done 32 | } 33 | 34 | configure /etc/hadoop/core-site.xml core CORE_CONF 35 | configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF 36 | configure /etc/hadoop/yarn-site.xml yarn YARN_CONF 37 | configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF 38 | configure /etc/hadoop/kms-site.xml kms KMS_CONF 39 | configure /etc/hadoop/mapred-site.xml mapred MAPRED_CONF 40 | configure /opt/hive/conf/hive-site.xml hive HIVE_SITE_CONF 41 | 42 | if [ "$MULTIHOMED_NETWORK" = "1" ]; then 43 | echo "Configuring for multihomed network" 44 | 45 | # HDFS 46 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0 47 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0 48 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0 49 | addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0 50 | addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true 51 | addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true 52 | 53 | # YARN 54 | addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0 55 | addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 56 | addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0 57 | addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0 58 | 59 | # MAPRED 60 | addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0 61 | fi 62 | 63 | if [ -n "$GANGLIA_HOST" ]; then 64 | mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig 65 | mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig 66 | 67 | for module in mapred jvm rpc ugi; do 68 | echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31" 69 | echo "$module.period=10" 70 | echo "$module.servers=$GANGLIA_HOST:8649" 71 | done > /etc/hadoop/hadoop-metrics.properties 72 | 73 | for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do 74 | echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31" 75 | echo "$module.sink.ganglia.period=10" 76 | echo "$module.sink.ganglia.supportsparse=true" 77 | echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both" 78 | echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40" 79 | echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649" 80 | done > /etc/hadoop/hadoop-metrics2.properties 81 | fi 82 | 83 | function wait_for_it() 84 | { 85 | local serviceport=$1 86 | local service=${serviceport%%:*} 87 | local port=${serviceport#*:} 88 | local retry_seconds=5 89 | local max_try=100 90 | let i=1 91 | 92 | nc -z $service $port 93 | result=$? 94 | 95 | until [ $result -eq 0 ]; do 96 | echo "[$i/$max_try] check for ${service}:${port}..." 97 | echo "[$i/$max_try] ${service}:${port} is not available yet" 98 | if (( $i == $max_try )); then 99 | echo "[$i/$max_try] ${service}:${port} is still not available; giving up after ${max_try} tries. :/" 100 | exit 1 101 | fi 102 | 103 | echo "[$i/$max_try] try in ${retry_seconds}s once again ..." 104 | let "i++" 105 | sleep $retry_seconds 106 | 107 | nc -z $service $port 108 | result=$? 109 | done 110 | echo "[$i/$max_try] $service:${port} is available." 111 | } 112 | 113 | for i in ${SERVICE_PRECONDITION[@]} 114 | do 115 | wait_for_it ${i} 116 | done 117 | 118 | exec $@ 119 | -------------------------------------------------------------------------------- /flume.md: -------------------------------------------------------------------------------- 1 | ### ✅ **What is Apache Flume in Big Data? 🚀** 2 | 3 | --- 4 | 5 | ### 💡 **Definition of Apache Flume:** 6 | 👉 **Apache Flume** is a **data ingestion tool** used to **collect, aggregate, and transfer large volumes of streaming data** (such as **log files, social media data, server logs, IoT data, etc.**) **into Hadoop (HDFS/Hive).** 7 | 8 | --- 9 | 10 | ## ✅ **Why Do We Need Apache Flume? 🤔** 11 | ### 📊 **Problem:** 12 | Suppose you have: 13 | - ✅ **Millions of log files** generated every second from **Web Servers, IoT devices, Sensors, etc.** 14 | - ✅ Or you have **Streaming Data from Twitter, Facebook, YouTube, etc.** 15 | - ✅ Or you have **Server Logs** from your website. 16 | 17 | 👉 You want to **send this streaming data** into: 18 | - ✅ **HDFS (Hadoop File System)** for storage. 19 | - ✅ **Hive** for querying and analysis. 20 | - ✅ **HBase** for real-time access. 21 | 22 | 👉 **How will you transfer this large streaming data continuously?** 🤔 23 | 24 | --- 25 | 26 | ## ✅ **Solution: Use Apache Flume 💯** 27 | 👉 Apache Flume will **continuously capture streaming data** from: 28 | - ✅ **Web Servers (logs)** 29 | - ✅ **IoT Devices (sensor data)** 30 | - ✅ **Social Media (Twitter, Facebook)** 31 | - ✅ **Application Logs (Tomcat, Apache)** 32 | 33 | 👉 And automatically **push it into Hadoop (HDFS/Hive)** without manual work. 34 | 35 | --- 36 | 37 | ## ✅ **Where is Flume Used in Real Life? 💡** 38 | | Industry | Flume is Used For | 39 | |--------------------------|---------------------------------------------------------------------------------| 40 | | 📊 **E-commerce (Amazon, Flipkart)** | Capturing **user behavior logs**, product clicks, browsing history, etc. | 41 | | 💻 **IT Companies (Google, Facebook)** | Collecting **application logs**, crash logs, web traffic logs, etc. | 42 | | 📡 **IoT Devices (Smart Homes)** | Streaming data from **IoT devices, sensors, CCTV, etc.** | 43 | | 📜 **News Websites** | **Capturing real-time news**, logs, and content from different sources. | 44 | | 🛰️ **Social Media Platforms** | Capturing **tweets, Facebook posts, YouTube comments, etc.** | 45 | 46 | --- 47 | 48 | ## ✅ **How Does Apache Flume Work? 🚀** 49 | 👉 **Apache Flume works on a Pipeline Architecture.** 50 | 51 | ### ✔ **Pipeline = Source → Channel → Sink → Hadoop (HDFS)** 52 | | Component | What it Does | 53 | |--------------|-------------------------------------------------------------------------| 54 | | ✅ **Source** | Collects **data from source (logs, Twitter, IoT, etc.)** | 55 | | ✅ **Channel** | Temporarily stores the data (like a queue or buffer). | 56 | | ✅ **Sink** | Sends data to **HDFS, Hive, or HBase**. | 57 | | ✅ **Hadoop** | Stores the data permanently for analysis. | 58 | 59 | --- 60 | 61 | ## ✅ **Architecture of Apache Flume 🔥** 62 | Here’s how Flume works step-by-step: 63 | 64 | ``` 65 | ┌─────────────────┐ 66 | Data Source --> │ Source │ --> Captures Data (Logs, Twitter, IoT) 67 | └─────────────────┘ 68 | │ 69 | ▼ 70 | ┌─────────────────┐ 71 | Data Buffer --> │ Channel │ --> Holds data temporarily (like a Queue) 72 | └─────────────────┘ 73 | │ 74 | ▼ 75 | ┌─────────────────┐ 76 | Data Storage -->│ Sink │ --> Sends Data to HDFS, Hive, or HBase 77 | └─────────────────┘ 78 | │ 79 | ▼ 80 | ┌───────────────────────┐ 81 | Data in Hadoop│ HDFS / Hive / HBase │ 82 | └───────────────────────┘ 83 | ``` 84 | 85 | --- 86 | 87 | ## ✅ **Example of Apache Flume Use Cases 🚀** 88 | Here are some real-world use cases: 89 | 90 | --- 91 | 92 | ### ✔ **1. Capturing Web Server Logs (Access Logs, Error Logs)** 93 | Suppose you have a website with **1 Billion hits/day** like **Flipkart, Amazon, etc.**. 94 | 95 | 👉 Every hit generates a log file: 96 | ``` 97 | 2025-03-10 12:34:55 INFO User Clicked on Product ID: 2345 98 | 2025-03-10 12:35:00 INFO User Added Product ID: 2345 to Cart 99 | ``` 100 | 101 | 👉 Flume will: 102 | - ✅ **Capture these logs**. 103 | - ✅ **Stream them to Hadoop (HDFS)** in real-time. 104 | - ✅ You can **analyze it later in Hive**. 105 | 106 | ### **Flume Configuration Example:** 107 | ```properties 108 | # Flume Agent Configuration 109 | agent1.sources = source1 110 | agent1.channels = channel1 111 | agent1.sinks = sink1 112 | 113 | # Source Configuration (Log File) 114 | agent1.sources.source1.type = exec 115 | agent1.sources.source1.command = tail -f /var/log/httpd/access.log 116 | 117 | # Channel Configuration 118 | agent1.channels.channel1.type = memory 119 | 120 | # Sink Configuration (HDFS) 121 | agent1.sinks.sink1.type = hdfs 122 | agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/logs 123 | ``` 124 | 125 | ✅ Flume will **capture log files in real-time** and push them to **HDFS**. 126 | 127 | --- 128 | 129 | ### ✔ **2. Capturing Twitter Data (Trending Hashtags)** 130 | Suppose you want to capture **live tweets** on a trending hashtag like: 131 | ``` 132 | #election2025 133 | #iphone16 134 | #IndiaWins 135 | ``` 136 | 137 | 👉 **Flume can capture these tweets** and push them to **HDFS/Hive** for analysis. 138 | 139 | ### ✅ Flume Twitter Configuration Example: 140 | ```properties 141 | # Source Configuration 142 | agent1.sources.source1.type = org.apache.flume.source.twitter.TwitterSource 143 | agent1.sources.source1.consumerKey = YOUR_CONSUMER_KEY 144 | agent1.sources.source1.consumerSecret = YOUR_CONSUMER_SECRET 145 | agent1.sources.source1.accessToken = YOUR_ACCESS_TOKEN 146 | agent1.sources.source1.accessTokenSecret = YOUR_ACCESS_TOKEN_SECRET 147 | 148 | # Sink Configuration (HDFS) 149 | agent1.sinks.sink1.type = hdfs 150 | agent1.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/twitter 151 | ``` 152 | 153 | 👉 ✅ **Flume will capture live tweets** and push them to **HDFS**. 154 | 155 | --- 156 | 157 | ### ✔ **3. IoT Sensor Data (Smart Homes, CCTV, Temperature Sensors)** 158 | Suppose you have: 159 | - ✅ **IoT Sensors (Temperature, Humidity, CCTV)**. 160 | - ✅ You want to capture the data in real-time. 161 | 162 | 👉 Flume will: 163 | - ✅ Continuously read sensor data. 164 | - ✅ Push it to HDFS in real-time. 165 | - ✅ You can then analyze it. 166 | 167 | --- 168 | 169 | ## ✅ **Types of Flume Channels 🚀** 170 | | Channel Type | Use Case | 171 | |-----------------|-----------------------------------------------------------------| 172 | | ✅ **Memory Channel** | Fastest but not durable (if Flume crashes, data is lost). | 173 | | ✅ **File Channel** | Slower but data is saved even if Flume crashes. | 174 | | ✅ **Kafka Channel** | Highly scalable and fault-tolerant (best for production). | 175 | 176 | --- 177 | 178 | ## ✅ **Why Is Flume Better Than Manual Data Transfer? 🚀** 179 | | Feature | Manual File Transfer | Apache Flume | 180 | |--------------------------|------------------------|----------------------------------------| 181 | | **Data Transfer Speed** | Very Slow | Lightning Fast 🚀 | 182 | | **Streaming Data** | Impossible | Handles Real-time Streaming 🚀 | 183 | | **Data Loss** | High | Zero Data Loss (Fault-tolerant) | 184 | | **Automation** | Manual Effort | Fully Automated | 185 | | **Big Data Compatibility**| Not Possible | Integrates with Hadoop, Hive, HBase | 186 | 187 | --- 188 | 189 | ## ✅ **Where Does Apache Flume Send Data? 🚀** 190 | | Data Source | Flume Can Send Data To | 191 | |--------------------------|-----------------------------------------------------| 192 | | ✅ **Log Files** | **HDFS / Hive / HBase / Kafka** | 193 | | ✅ **Social Media** | **Hive / Spark / ElasticSearch** | 194 | | ✅ **IoT Devices** | **Hadoop / MongoDB / Kafka** | 195 | | ✅ **Web Server Logs** | **HDFS / Hive / Kafka** | 196 | 197 | --- 198 | 199 | ## ✅ **Why Is Flume So Powerful? 💯** 200 | 👉 Flume can: 201 | - ✅ **Ingest Terabytes of Data/Hour.** 202 | - ✅ Handle **Millions of Streaming Logs/Second**. 203 | - ✅ Push data to **Hadoop, Hive, HBase, Kafka, etc.** 204 | - ✅ Fully Automated. 205 | - ✅ Real-time Data Processing. 206 | 207 | --- 208 | 209 | ## ✅ **🔥 Final Answer** 210 | 👉 **Apache Flume** is used for: 211 | - ✅ **Real-time streaming data capture.** 212 | - ✅ **Log file ingestion from web servers.** 213 | - ✅ **Capturing social media data (Twitter, YouTube, etc.).** 214 | - ✅ **Moving IoT data (sensors, CCTV) to Hadoop.** 215 | 216 | --- 217 | 218 | 219 | ### **Here is a complete step-by-step guide to install Apache Flume on top of your Hadoop setup and demonstrate a working example:** 220 | 221 | --- 222 | 223 | ### **Step 1: Install Apache Flume** 224 | 225 | 1. **Download Apache Flume** 226 | Visit the official Apache Flume [download page](https://flume.apache.org/download.html) or use `wget` to download the latest binary tarball directly: 227 | ```bash 228 | wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz 229 | ``` 230 | 231 | 2. **Extract the Tarball** 232 | Extract the downloaded tarball: 233 | ```bash 234 | tar -xvzf apache-flume-1.9.0-bin.tar.gz 235 | ``` 236 | 237 | 3. **Move the Folder** 238 | Move the extracted folder to `/usr/local/flume`: 239 | ```bash 240 | mv apache-flume-1.9.0-bin /usr/local/flume 241 | ``` 242 | 243 | 4. **Set Environment Variables** 244 | Add Flume to your `PATH` by editing the `~/.bashrc` file: 245 | ```bash 246 | nano ~/.bashrc 247 | ``` 248 | Add the following lines at the end of the file: 249 | ```bash 250 | export FLUME_HOME=/usr/local/flume 251 | export PATH=$PATH:$FLUME_HOME/bin 252 | ``` 253 | Reload the environment variables: 254 | ```bash 255 | source ~/.bashrc 256 | ``` 257 | 258 | 5. **Verify Installation** 259 | Check Flume's version: 260 | ```bash 261 | flume-ng version 262 | ``` 263 | ![image](https://github.com/user-attachments/assets/14fd9825-4efe-4c17-9167-3feab67710ac) 264 | 265 | --- 266 | 267 | ### **Step 2: Configure Flume** 268 | 269 | 1. Navigate to the Flume configuration directory: 270 | ```bash 271 | cd /usr/local/flume/conf 272 | ``` 273 | 274 | 2. Create a new Flume agent configuration file: 275 | ```bash 276 | nano demo-agent.conf 277 | ``` 278 | 279 | 3. Add the following content to define the Flume agent configuration: 280 | ```properties 281 | # Define the agent components 282 | demo.sources = source1 283 | demo.sinks = sink1 284 | demo.channels = channel1 285 | 286 | # Define the source 287 | demo.sources.source1.type = netcat 288 | demo.sources.source1.bind = localhost 289 | demo.sources.source1.port = 44444 290 | 291 | # Define the sink (HDFS) 292 | demo.sinks.sink1.type = hdfs 293 | demo.sinks.sink1.hdfs.path = hdfs://localhost:9000/user/flume/demo 294 | demo.sinks.sink1.hdfs.fileType = DataStream 295 | 296 | # Define the channel 297 | demo.channels.channel1.type = memory 298 | demo.channels.channel1.capacity = 1000 299 | demo.channels.channel1.transactionCapacity = 100 300 | 301 | # Bind the source and sink to the channel 302 | demo.sources.source1.channels = channel1 303 | demo.sinks.sink1.channel = channel1 304 | ``` 305 | 306 | Replace `localhost` with your Hadoop Namenode hostname or IP address. 307 | you can find it using cat $HADOOP_HOME/etc/hadoop/core-site.xml 308 | --- 309 | 310 | ### **Step 3: Start Flume Agent** 311 | 312 | Run the Flume agent using the configuration file: 313 | ```bash 314 | flume-ng agent \ 315 | --conf /usr/local/flume/conf \ 316 | --conf-file /usr/local/flume/conf/demo-agent.conf \ 317 | --name demo \ 318 | -Dflume.root.logger=INFO,console 319 | ``` 320 | 321 | This starts the Flume agent with the name `demo` and logs activities to the console. 322 | 323 | ![image](https://github.com/user-attachments/assets/8dcae12e-2b1f-490b-ae07-f052040b3c7d) 324 | 325 | --- 326 | If you're facing error `bash: nc: command not found` indicates that the `netcat` (`nc`) utility is not installed in your container. Netcat is required to send data to the Flume source. 327 | 328 | ### **Steps to Resolve** 329 | 330 | 1. **Install Netcat in the Container** 331 | - Install `netcat` using the package manager inside the container: 332 | ```bash 333 | apt-get update 334 | apt-get install netcat -y 335 | ``` 336 | - Verify the installation: 337 | ```bash 338 | nc -h 339 | ``` 340 | 341 | 2. **Test the Netcat Command Again** 342 | After installing `netcat`, retry the command to send data to Flume: 343 | ```bash 344 | echo "Hello Flume Demo" | nc localhost 44444 345 | ``` 346 | 347 | 3. **Verify Data in Flume Sink** 348 | - Check the configured HDFS path or the file sink location to verify that the message has been captured by the Flume agent. 349 | 350 | --- 351 | 352 | 353 | ### **Step 4: Test Flume Data Flow** 354 | 355 | 1. **Send Data to Flume Source** 356 | Open another terminal and send data to the Netcat source using the `nc` command: 357 | ```bash 358 | echo "Hello Flume Demo" | nc localhost 44444 359 | ``` 360 | ![image](https://github.com/user-attachments/assets/1290fb3e-cdac-4265-8c3a-067265783963) 361 | 362 | Send multiple lines of data: 363 | ```bash 364 | for i in {1..5}; do echo "This is message $i" | nc localhost 44444; done 365 | ``` 366 | ![image](https://github.com/user-attachments/assets/e2cc2a42-7f26-4b6b-81cc-5102a1f39a7f) 367 | 368 | 1. **Verify Data in HDFS** 369 | Check the HDFS directory where Flume is writing data: 370 | ```bash 371 | hadoop fs -ls /user/flume/demo 372 | ``` 373 | View the ingested data files: 374 | ```bash 375 | hadoop fs -cat /user/flume/demo/* 376 | ``` 377 | 378 | You should see the messages sent via `Netcat`. 379 | ![image](https://github.com/user-attachments/assets/9460b9d8-8ba4-4788-a318-a55bac5a27d3) 380 | 381 | --- 382 | 383 | ### **Step 5: Optional Customizations** 384 | 385 | 1. **Roll Policies** 386 | Adjust roll policies in the sink configuration: 387 | - **Roll by file size**: 388 | ```properties 389 | demo.sinks.sink1.hdfs.rollSize = 1048576 # 1MB 390 | ``` 391 | - **Roll by time interval**: 392 | ```properties 393 | demo.sinks.sink1.hdfs.rollInterval = 300 # 5 minutes 394 | ``` 395 | - **Roll by event count**: 396 | ```properties 397 | demo.sinks.sink1.hdfs.rollCount = 1000 398 | ``` 399 | 400 | 2. **Monitoring and Logging** 401 | Configure monitoring and logging in `flume-env.sh` and `log4j.properties`. 402 | 403 | --- 404 | 405 | ### **Expected Results** 406 | 407 | 1. **Flume Console Output** 408 | You will see logs showing Flume processing events and writing them to HDFS. 409 | 410 | 2. **HDFS Data** 411 | The ingested data in HDFS will look like this: 412 | ``` 413 | Hello Flume Demo 414 | This is message 1 415 | This is message 2 416 | This is message 3 417 | ``` 418 | 419 | --- 420 | 421 | ### **Troubleshooting** 422 | 423 | - **Agent Fails to Start**: 424 | Check the logs for configuration errors: 425 | ```bash 426 | cat /usr/local/flume/logs/flume.log 427 | ``` 428 | 429 | - **Data Not in HDFS**: 430 | Ensure the `namenode_host` in the sink configuration is correct and that the HDFS path is writable. 431 | 432 | --- 433 | -------------------------------------------------------------------------------- /hadoop-basic-commands.md: -------------------------------------------------------------------------------- 1 | Here's a **comprehensive list of HDFS commands** with easy-to-understand instructions for quick reference: 2 | ![image](https://github.com/user-attachments/assets/1ea3ba32-1b68-4584-b521-3f5e6f5c6ffb) 3 | 4 | --- 5 | 6 | ## 📂 **HDFS Commands - Complete Reference** 7 | 8 | --- 9 | 10 | ### **1. Basic File Operations** 11 | 12 | #### 📄 **Create a new file locally** 13 | Create a file on your local system: 14 | ```bash 15 | echo "This is a sample file" > localfile.txt 16 | ``` 17 | 18 | #### 📤 **Upload a local file to HDFS** 19 | Upload a local file to HDFS: 20 | ```bash 21 | hdfs dfs -put localfile.txt /user/hadoop/destination-path 22 | ``` 23 | 24 | #### ⬇️ **Download a file from HDFS to the local file system** 25 | Use the `-get` command to copy files from HDFS to the local system: 26 | ```bash 27 | hdfs dfs -get /path/to/hdfspath /localpath 28 | ``` 29 | 30 | #### 🖼️ **View the file content from HDFS** 31 | View the contents of a file directly without copying it: 32 | ```bash 33 | hdfs dfs -cat /path/to/file 34 | ``` 35 | 36 | #### ✍️ **Append content to an HDFS file** 37 | Append local file content to an existing file on HDFS: 38 | ```bash 39 | hdfs dfs -appendToFile localfile.txt /user/hadoop/hdfspath 40 | ``` 41 | 42 | --- 43 | 44 | ### **2. Directory Operations** 45 | 46 | #### 📁 **Create a directory** 47 | Create a new directory in HDFS: 48 | ```bash 49 | hdfs dfs -mkdir /path/to/directory 50 | ``` 51 | 52 | #### 🛠️ **Create multiple directories** 53 | Create multiple directories in a single command: 54 | ```bash 55 | hdfs dfs -mkdir -p /path/to/dir1 /path/to/dir2 56 | ``` 57 | 58 | #### 🧑‍💻 **Check directory usage with summary** 59 | View the disk usage of a directory in human-readable format: 60 | ```bash 61 | hdfs dfs -du -s -h /path/to/directory 62 | ``` 63 | 64 | #### 📑 **List contents of a directory** 65 | List the files in a directory on HDFS: 66 | ```bash 67 | hdfs dfs -ls /path/to/directory 68 | ``` 69 | 70 | --- 71 | 72 | ### **3. File Operations** 73 | 74 | #### ✏️ **Rename or move a file in HDFS** 75 | Rename or move a file within HDFS: 76 | ```bash 77 | hdfs dfs -mv /path/to/oldfile /path/to/newfile 78 | ``` 79 | 80 | #### 📦 **Copy a file within HDFS** 81 | Copy a file from one location in HDFS to another: 82 | ```bash 83 | hdfs dfs -cp /path/to/source /path/to/destination 84 | ``` 85 | 86 | #### 🗂️ **Count files, directories, and bytes in HDFS** 87 | Get the count of files, directories, and the total byte size in a directory: 88 | ```bash 89 | hdfs dfs -count /path/to/directory 90 | ``` 91 | 92 | #### 📝 **Display the first few lines of a file** 93 | View the first few lines of a file: 94 | ```bash 95 | hdfs dfs -head /path/to/file 96 | ``` 97 | 98 | #### 📚 **Display the last few lines of a file** 99 | View the last few lines of a file: 100 | ```bash 101 | hdfs dfs -tail /path/to/file 102 | ``` 103 | 104 | #### 🔒 **Display file checksum** 105 | Verify file integrity by checking the checksum: 106 | ```bash 107 | hdfs dfs -checksum /path/to/file 108 | ``` 109 | 110 | --- 111 | 112 | ### **4. File Permission and Ownership** 113 | 114 | #### 🔧 **Change file or directory permissions** 115 | Change the permissions of a file or directory: 116 | ```bash 117 | hdfs dfs -chmod 755 /path/to/file-or-directory 118 | ``` 119 | 120 | #### 🧑‍🔧 **Change file or directory ownership** 121 | Change the ownership of a file or directory: 122 | ```bash 123 | hdfs dfs -chown user:group /path/to/file-or-directory 124 | ``` 125 | 126 | #### 📊 **Set file replication factor** 127 | Change the replication factor of a file or directory: 128 | ```bash 129 | hdfs dfs -setrep -w 3 /path/to/file-or-directory 130 | ``` 131 | 132 | --- 133 | 134 | ### **5. Data Verification and Repair** 135 | 136 | #### 🛡️ **Verify the file checksum** 137 | Check if the file’s checksum matches its original value: 138 | ```bash 139 | hdfs dfs -checksum /path/to/file 140 | ``` 141 | 142 | #### 🛠️ **Recover corrupted blocks in HDFS** 143 | Recover corrupted files by moving or deleting bad blocks: 144 | ```bash 145 | hdfs fsck /path/to/file -move -delete 146 | ``` 147 | 148 | --- 149 | 150 | ### **6. Data Migration and Export** 151 | 152 | #### 📤 **Export a directory to the local filesystem** 153 | Copy a directory from HDFS to a local file system: 154 | ```bash 155 | hdfs dfs -get /path/to/hdfspath /localpath 156 | ``` 157 | 158 | #### 🔄 **Export a file from one HDFS directory to another** 159 | Copy a file from one HDFS location to another: 160 | ```bash 161 | hdfs dfs -cp /path/to/hdfspath /new/path/to/hdfspath 162 | ``` 163 | 164 | --- 165 | 166 | ### **7. File System Check** 167 | 168 | #### 🏥 **Check the health of HDFS** 169 | Perform a health check on HDFS and get details about block and file status: 170 | ```bash 171 | hdfs fsck / -files -blocks -locations 172 | ``` 173 | 174 | #### 📈 **Check block replication status** 175 | View block replication details and the location of blocks: 176 | ```bash 177 | hdfs fsck / -blocks -locations 178 | ``` 179 | 180 | --- 181 | 182 | ### **8. HDFS Admin Commands** 183 | 184 | #### 🔍 **Show HDFS file system status** 185 | Get a report on the status and health of the HDFS system: 186 | ```bash 187 | hdfs dfsadmin -report 188 | ``` 189 | 190 | #### 🛑 **Enable safemode** 191 | Enter HDFS safemode (used for maintenance operations): 192 | ```bash 193 | hdfs dfsadmin -safemode enter 194 | ``` 195 | 196 | #### 🚪 **Disable safemode** 197 | Exit from HDFS safemode: 198 | ```bash 199 | hdfs dfsadmin -safemode leave 200 | ``` 201 | 202 | #### 📊 **Check safemode status** 203 | Check if HDFS is in safemode: 204 | ```bash 205 | hdfs dfsadmin -safemode get 206 | ``` 207 | 208 | #### 🧑‍🔧 **Decommission a DataNode** 209 | Remove a DataNode from the cluster (by updating the `dfs.exclude` file): 210 | ```bash 211 | hdfs dfsadmin -refreshNodes 212 | ``` 213 | 214 | --- 215 | 216 | ### **9. YARN Commands** 217 | 218 | #### 🖥️ **Resource Manager Operations** 219 | 220 | ##### 📊 **Check cluster metrics** 221 | Get detailed metrics for the YARN cluster: 222 | ```bash 223 | yarn cluster -metrics 224 | ``` 225 | 226 | ##### 🔍 **View NodeManager details** 227 | List the details of NodeManagers in the YARN cluster: 228 | ```bash 229 | yarn node -list 230 | ``` 231 | 232 | #### 🧑‍💻 **Container Management** 233 | 234 | ##### 📋 **List containers for an application** 235 | List the containers running for a specific application: 236 | ```bash 237 | yarn container -list 238 | ``` 239 | 240 | ##### ⛔ **Kill a specific container** 241 | Terminate a running container: 242 | ```bash 243 | yarn container -kill 244 | ``` 245 | 246 | --- 247 | 248 | ### **10. General Hadoop Commands** 249 | 250 | #### 🆘 **Display all Hadoop-related commands** 251 | Get a list of all Hadoop commands: 252 | ```bash 253 | hadoop -help 254 | ``` 255 | 256 | #### 📚 **Display help for specific HDFS commands** 257 | Get detailed help for HDFS commands: 258 | ```bash 259 | hdfs dfs -help 260 | ``` 261 | 262 | #### 📄 **Display help for YARN commands** 263 | Get detailed help for YARN commands: 264 | ```bash 265 | yarn -help 266 | ``` 267 | 268 | --- 269 | 270 | ### **11. General Tips for Hadoop** 271 | 272 | - **Use aliases for commonly used commands** 273 | Save time by creating aliases for frequently used commands. Add these to your `.bashrc` or `.zshrc`: 274 | ```bash 275 | alias hls="hdfs dfs -ls" 276 | alias hput="hdfs dfs -put" 277 | alias hget="hdfs dfs -get" 278 | ``` 279 | 280 | - **Use `-help` with any Hadoop command** 281 | To learn more options and features, always try `-help` with any Hadoop command: 282 | ```bash 283 | hdfs dfs -help 284 | yarn -help 285 | hadoop -help 286 | ``` 287 | 288 | --- 289 | 290 | By following these instructions, you will be able to easily manage and manipulate files, directories, and resources in Hadoop Distributed File System (HDFS) and YARN. 291 | -------------------------------------------------------------------------------- /hadoop-hive.env: -------------------------------------------------------------------------------- 1 | HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore 2 | HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver 3 | HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive 4 | HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive 5 | HIVE_SITE_CONF_datanucleus_autoCreateSchema=false 6 | HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083 7 | HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false 8 | 9 | CORE_CONF_fs_defaultFS=hdfs://namenode:9000 10 | CORE_CONF_hadoop_http_staticuser_user=root 11 | CORE_CONF_hadoop_proxyuser_hue_hosts=* 12 | CORE_CONF_hadoop_proxyuser_hue_groups=* 13 | 14 | HDFS_CONF_dfs_webhdfs_enabled=true 15 | HDFS_CONF_dfs_permissions_enabled=false 16 | 17 | YARN_CONF_yarn_log___aggregation___enable=true 18 | YARN_CONF_yarn_resourcemanager_recovery_enabled=true 19 | YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore 20 | YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate 21 | YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs 22 | YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/ 23 | YARN_CONF_yarn_timeline___service_enabled=true 24 | YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true 25 | YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true 26 | YARN_CONF_yarn_resourcemanager_hostname=resourcemanager 27 | YARN_CONF_yarn_timeline___service_hostname=historyserver 28 | YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032 29 | YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030 30 | YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031 31 | -------------------------------------------------------------------------------- /hadoop.env: -------------------------------------------------------------------------------- 1 | CORE_CONF_fs_defaultFS=hdfs://namenode:9000 2 | CORE_CONF_hadoop_http_staticuser_user=root 3 | CORE_CONF_hadoop_proxyuser_hue_hosts=* 4 | CORE_CONF_hadoop_proxyuser_hue_groups=* 5 | CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec 6 | 7 | HDFS_CONF_dfs_webhdfs_enabled=true 8 | HDFS_CONF_dfs_permissions_enabled=false 9 | HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false 10 | 11 | YARN_CONF_yarn_log___aggregation___enable=true 12 | YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/ 13 | YARN_CONF_yarn_resourcemanager_recovery_enabled=true 14 | YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore 15 | YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler 16 | YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192 17 | YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4 18 | YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate 19 | YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true 20 | YARN_CONF_yarn_resourcemanager_hostname=resourcemanager 21 | YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032 22 | YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030 23 | YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031 24 | YARN_CONF_yarn_timeline___service_enabled=true 25 | YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true 26 | YARN_CONF_yarn_timeline___service_hostname=historyserver 27 | YARN_CONF_mapreduce_map_output_compress=true 28 | YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec 29 | YARN_CONF_yarn_nodemanager_resource_memory___mb=16384 30 | YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8 31 | YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5 32 | YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs 33 | YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle 34 | 35 | MAPRED_CONF_mapreduce_framework_name=yarn 36 | MAPRED_CONF_mapred_child_java_opts=-Xmx4096m 37 | MAPRED_CONF_mapreduce_map_memory_mb=4096 38 | MAPRED_CONF_mapreduce_reduce_memory_mb=8192 39 | MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m 40 | MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m 41 | MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/ 42 | MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/ 43 | MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.2.1/ 44 | -------------------------------------------------------------------------------- /hadoop_installation_VMware Workstation.md: -------------------------------------------------------------------------------- 1 | Guide for installing **Hadoop 3.3.6 on Ubuntu 24.04** in a **VMware virtual machine**. This guide includes **troubleshooting tips, verification steps, and SSH configuration** to ensure a **properly working** single-node (pseudo-distributed) Hadoop setup. 2 | 3 | --- 4 | 5 | # **🚀 Complete Guide to Installing Hadoop 3.3.6 on Ubuntu 24.04 (VMware)** 6 | This guide covers: 7 | ✅ Installing **Hadoop 3.3.6** on **Ubuntu 24.04** 8 | ✅ Configuring **HDFS, YARN, and MapReduce** 9 | ✅ Setting up **passwordless SSH** 10 | ✅ Ensuring **proper Java installation** 11 | ✅ Troubleshooting common issues 12 | 13 | --- 14 | 15 | ## **1️⃣ Prerequisites** 16 | Before starting, ensure: 17 | ✔ You have **Ubuntu 24.04** running in **VMware Workstation**. 18 | ✔ At least **4GB RAM**, **50GB disk space**, and **4 CPU cores** are allocated to the VM. 19 | ✔ Java **8 or later** is installed. 20 | 21 | --- 22 | 23 | ## **2️⃣ Update Ubuntu Packages** 24 | Update system packages to avoid dependency issues: 25 | ```bash 26 | sudo apt update && sudo apt upgrade -y 27 | ``` 28 | 29 | --- 30 | 31 | ## **3️⃣ Install Java (OpenJDK 11)** 32 | Hadoop requires Java. The recommended version is **OpenJDK 11**: 33 | ```bash 34 | sudo apt install openjdk-11-jdk -y 35 | ``` 36 | Verify installation: 37 | ```bash 38 | java -version 39 | ``` 40 | Expected output (may vary slightly): 41 | ``` 42 | openjdk version "11.0.20" 2024-XX-XX 43 | ``` 44 | **Alternative:** If you need Java 8 for compatibility, install it using: 45 | ```bash 46 | sudo apt install openjdk-8-jdk -y 47 | ``` 48 | 49 | --- 50 | 51 | ## **4️⃣ Create a Hadoop User (Optional - Skip for now)** 52 | Instead of using root or your personal user, create a dedicated **hadoop** user: 53 | ```bash 54 | sudo adduser hadoop 55 | ``` 56 | Add your user to the `sudo` group: 57 | ```bash 58 | sudo usermod -aG sudo hadoop 59 | ``` 60 | Switch to the `hadoop` user: 61 | ```bash 62 | su - hadoop 63 | ``` 64 | 65 | --- 66 | 67 | ## **5️⃣ Download & Install Hadoop 3.3.6** 68 | Navigate to the **Apache Hadoop downloads page**: 69 | 🔗 [https://hadoop.apache.org/releases.html](https://hadoop.apache.org/releases.html) 70 | 71 | Download Hadoop 3.3.6: 72 | ```bash 73 | wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz 74 | ``` 75 | Verify the file integrity (optional but recommended): 76 | ```bash 77 | sha512sum hadoop-3.3.6.tar.gz 78 | ``` 79 | Compare the hash with the one on the official website. 80 | 81 | Extract Hadoop: 82 | ```bash 83 | tar -xvzf hadoop-3.3.6.tar.gz 84 | ``` 85 | Move it to `/usr/local/`: 86 | ```bash 87 | sudo mv hadoop-3.3.6 /usr/local/hadoop 88 | ``` 89 | Set permissions: 90 | ```bash 91 | sudo chown -R $USER:$USER /usr/local/hadoop 92 | ``` 93 | 94 | --- 95 | 96 | ## **6️⃣ Configure Hadoop Environment Variables** 97 | Edit the `~/.bashrc` file: 98 | ```bash 99 | nano ~/.bashrc 100 | ``` 101 | Add these lines at the end: 102 | ```bash 103 | # Hadoop Environment Variables 104 | export HADOOP_HOME=/usr/local/hadoop 105 | export HADOOP_INSTALL=$HADOOP_HOME 106 | export HADOOP_MAPRED_HOME=$HADOOP_HOME 107 | export HADOOP_COMMON_HOME=$HADOOP_HOME 108 | export HADOOP_HDFS_HOME=$HADOOP_HOME 109 | export HADOOP_YARN_HOME=$HADOOP_HOME 110 | export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop 111 | export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 112 | ``` 113 | Save & exit (CTRL+X → Y → ENTER). 114 | 115 | Apply changes: 116 | ```bash 117 | source ~/.bashrc 118 | ``` 119 | Verify: 120 | ```bash 121 | echo $HADOOP_HOME 122 | ``` 123 | Expected output: `/usr/local/hadoop` 124 | 125 | --- 126 | 127 | ## **7️⃣ Configure Hadoop Core Files** 128 | ### **1️⃣ Configure `hadoop-env.sh`** 129 | Edit Hadoop environment configuration: 130 | ```bash 131 | nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh 132 | ``` 133 | Find the line: 134 | ```bash 135 | export JAVA_HOME= 136 | ``` 137 | Replace it with: (Replace `11` with `8` if you installed JDK 8) 138 | ```bash 139 | export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 140 | ``` 141 | Save & exit. 142 | 143 | --- 144 | 145 | ### **2️⃣ Configure `core-site.xml`** 146 | Edit: 147 | ```bash 148 | nano $HADOOP_HOME/etc/hadoop/core-site.xml 149 | ``` 150 | Replace existing content with: 151 | ```xml 152 | 153 | 154 | fs.defaultFS 155 | hdfs://localhost:9000 156 | 157 | 158 | ``` 159 | Save & exit. 160 | 161 | --- 162 | 163 | ### **3️⃣ Configure `hdfs-site.xml`** 164 | Edit: 165 | ```bash 166 | nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml 167 | ``` 168 | Add: 169 | ```xml 170 | 171 | 172 | dfs.replication 173 | 1 174 | 175 | 176 | dfs.name.dir 177 | file:///usr/local/hadoop/hdfs/namenode 178 | 179 | 180 | dfs.data.dir 181 | file:///usr/local/hadoop/hdfs/datanode 182 | 183 | 184 | ``` 185 | Create necessary directories: 186 | ```bash 187 | mkdir -p /usr/local/hadoop/hdfs/namenode 188 | mkdir -p /usr/local/hadoop/hdfs/datanode 189 | ``` 190 | Set permissions: 191 | ```bash 192 | sudo chown -R $USER:$USER /usr/local/hadoop/hdfs 193 | ``` 194 | 195 | --- 196 | 197 | ### **4️⃣ Configure `mapred-site.xml`** 198 | ```bash 199 | cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml 200 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml 201 | ``` 202 | Add: 203 | ```xml 204 | 205 | 206 | mapreduce.framework.name 207 | yarn 208 | 209 | 210 | yarn.app.mapreduce.am.env 211 | HADOOP_MAPRED_HOME=/usr/local/hadoop 212 | 213 | 214 | mapreduce.map.env 215 | HADOOP_MAPRED_HOME=/usr/local/hadoop 216 | 217 | 218 | mapreduce.reduce.env 219 | HADOOP_MAPRED_HOME=/usr/local/hadoop 220 | 221 | 222 | ``` 223 | Save & exit. 224 | 225 | --- 226 | 227 | ### **5️⃣ Configure `yarn-site.xml`** 228 | ```bash 229 | nano $HADOOP_HOME/etc/hadoop/yarn-site.xml 230 | ``` 231 | Add: 232 | ```xml 233 | 234 | 235 | yarn.nodemanager.aux-services 236 | mapreduce_shuffle 237 | 238 | 239 | ``` 240 | Save & exit. 241 | 242 | --- 243 | 244 | ## **8️⃣ Configure Passwordless SSH** 245 | ```bash 246 | sudo apt install ssh -y 247 | ssh-keygen -t rsa -P "" 248 | cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 249 | chmod 600 ~/.ssh/authorized_keys 250 | ``` 251 | Test SSH: 252 | ```bash 253 | ssh localhost 254 | ``` 255 | 256 | --- 257 | 258 | ## **9️⃣ Format the Namenode & Start Hadoop** 259 | Format the Namenode: 260 | ```bash 261 | hdfs namenode -format 262 | ``` 263 | Start HDFS: 264 | ```bash 265 | start-dfs.sh 266 | ``` 267 | Start YARN: 268 | ```bash 269 | start-yarn.sh 270 | ``` 271 | Verify: 272 | ```bash 273 | jps 274 | ``` 275 | Expected output: 276 | ``` 277 | NameNode 278 | DataNode 279 | ResourceManager 280 | NodeManager 281 | ``` 282 | 283 | --- 284 | 285 | ## **✅ Verify Hadoop Installation** 286 | 📌 Open a browser and go to: 287 | ✔ HDFS Web UI → **http://localhost:9870/** 288 | ✔ YARN Web UI → **http://localhost:8088/** 289 | 290 | --- 291 | 292 | 🎉 **Congratulations!** You have successfully installed **Hadoop 3.3.6** on **Ubuntu 24.04 (VMware)**! 😊 293 | -------------------------------------------------------------------------------- /historyserver/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | HEALTHCHECK CMD curl -f http://localhost:8188/ || exit 1 6 | 7 | ENV YARN_CONF_yarn_timeline___service_leveldb___timeline___store_path=/hadoop/yarn/timeline 8 | RUN mkdir -p /hadoop/yarn/timeline 9 | VOLUME /hadoop/yarn/timeline 10 | 11 | ADD run.sh /run.sh 12 | RUN chmod a+x /run.sh 13 | 14 | EXPOSE 8188 15 | 16 | CMD ["/run.sh"] 17 | -------------------------------------------------------------------------------- /historyserver/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR historyserver 4 | -------------------------------------------------------------------------------- /master/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/spark-base:3.0.0-hadoop3.2 2 | 3 | LABEL maintainer="Gezim Sejdiu , Giannis Mouchakis " 4 | 5 | COPY master.sh / 6 | 7 | ENV SPARK_MASTER_PORT 7077 8 | ENV SPARK_MASTER_WEBUI_PORT 8080 9 | ENV SPARK_MASTER_LOG /spark/logs 10 | 11 | EXPOSE 8080 7077 6066 12 | 13 | CMD ["/bin/bash", "/master.sh"] 14 | -------------------------------------------------------------------------------- /master/README.md: -------------------------------------------------------------------------------- 1 | # Spark master 2 | 3 | See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark). -------------------------------------------------------------------------------- /master/master.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | export SPARK_MASTER_HOST=`hostname` 4 | 5 | . "/spark/sbin/spark-config.sh" 6 | 7 | . "/spark/bin/load-spark-env.sh" 8 | 9 | mkdir -p $SPARK_MASTER_LOG 10 | 11 | export SPARK_HOME=/spark 12 | 13 | ln -sf /dev/stdout $SPARK_MASTER_LOG/spark-master.out 14 | 15 | cd /spark/bin && /spark/sbin/../bin/spark-class org.apache.spark.deploy.master.Master \ 16 | --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG/spark-master.out 17 | -------------------------------------------------------------------------------- /namenode/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | HEALTHCHECK CMD curl -f http://localhost:9870/ || exit 1 6 | 7 | ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/name 8 | RUN mkdir -p /hadoop/dfs/name 9 | VOLUME /hadoop/dfs/name 10 | 11 | ADD run.sh /run.sh 12 | RUN chmod a+x /run.sh 13 | 14 | EXPOSE 9870 15 | 16 | CMD ["/run.sh"] 17 | -------------------------------------------------------------------------------- /namenode/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | namedir=`echo $HDFS_CONF_dfs_namenode_name_dir | perl -pe 's#file://##'` 4 | if [ ! -d $namedir ]; then 5 | echo "Namenode name directory not found: $namedir" 6 | exit 2 7 | fi 8 | 9 | if [ -z "$CLUSTER_NAME" ]; then 10 | echo "Cluster name not specified" 11 | exit 2 12 | fi 13 | 14 | echo "remove lost+found from $namedir" 15 | rm -r $namedir/lost+found 16 | 17 | if [ "`ls -A $namedir`" == "" ]; then 18 | echo "Formatting namenode name directory: $namedir" 19 | $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR namenode -format $CLUSTER_NAME 20 | fi 21 | 22 | $HADOOP_HOME/bin/hdfs --config $HADOOP_CONF_DIR namenode 23 | -------------------------------------------------------------------------------- /nginx/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM nginx 2 | 3 | MAINTAINER "Ivan Ermilov " 4 | 5 | COPY default.conf /etc/nginx/conf.d/default.conf 6 | COPY materialize.min.css /data/bde-css/materialize.min.css 7 | COPY bde-hadoop.css /data/bde-css/bde-hadoop.css 8 | -------------------------------------------------------------------------------- /nginx/bde-hadoop.css: -------------------------------------------------------------------------------- 1 | body { 2 | background: #F1F1F1; 3 | } 4 | 5 | body > .container { 6 | margin: 5rem auto; 7 | background: white; 8 | box-shadow: 0 2px 5px 0 rgba(0,0,0,0.16), 0 2px 10px 0 rgba(0,0,0,0.12); 9 | } 10 | 11 | header.bs-docs-nav { 12 | position: fixed; 13 | top: 0; 14 | left: 0; 15 | width: 100%; 16 | height: 3rem; 17 | border: none; 18 | background: #A94F74; 19 | box-shadow: 0 2px 5px 0 rgba(0,0,0,0.16), 0 2px 10px 0 rgba(0,0,0,0.12); 20 | } 21 | 22 | header.bs-docs-nav .navbar-brand { 23 | background: inherit; 24 | } 25 | 26 | #ui-tabs .active a { 27 | background: #B96A8B; 28 | } 29 | 30 | #ui-tabs > li > a { 31 | color: white; 32 | } 33 | 34 | .navbar-inverse .navbar-nav > .dropdown > a .caret { 35 | border-top-color: white; 36 | border-bottom-color: white; 37 | } 38 | 39 | .navbar-inverse .navbar-nav > .open > a, 40 | .navbar-inverse .navbar-nav > .open > a:hover, 41 | .navbar-inverse .navbar-nav > .open > a:focus { 42 | background-color: #B96A8B; 43 | } 44 | 45 | .dropdown-menu > li > a { 46 | color: #A94F74; 47 | } 48 | 49 | .modal-dialog .panel-success { 50 | border-color: lightgrey; 51 | } 52 | 53 | .modal-dialog .panel-heading { 54 | background-color: #A94F74 !important; 55 | } 56 | 57 | .modal-dialog .panel-heading select { 58 | margin-top: 1rem; 59 | } -------------------------------------------------------------------------------- /nginx/default.conf: -------------------------------------------------------------------------------- 1 | server { 2 | listen 80; 3 | server_name localhost; 4 | 5 | root /data; 6 | gzip on; 7 | 8 | location / { 9 | proxy_pass http://127.0.0.1:8000; 10 | proxy_set_header Accept-Encoding ""; 11 | } 12 | 13 | location /bde-css/ { 14 | } 15 | } 16 | 17 | server { 18 | listen 127.0.0.1:8000; 19 | location / { 20 | proxy_pass http://127.0.0.1:8001; 21 | sub_filter '' ' 22 | '; 23 | sub_filter_once on; 24 | proxy_set_header Accept-Encoding ""; 25 | } 26 | } 27 | 28 | server { 29 | listen 127.0.0.1:8001; 30 | gunzip on; 31 | location / { 32 | proxy_pass http://namenode:50070; 33 | proxy_set_header Accept-Encoding gzip; 34 | } 35 | } 36 | -------------------------------------------------------------------------------- /nodemanager/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | HEALTHCHECK CMD curl -f http://localhost:8042/ || exit 1 6 | 7 | ADD run.sh /run.sh 8 | RUN chmod a+x /run.sh 9 | 10 | EXPOSE 8042 11 | 12 | CMD ["/run.sh"] 13 | -------------------------------------------------------------------------------- /nodemanager/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR nodemanager 4 | -------------------------------------------------------------------------------- /resourcemanager/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | HEALTHCHECK CMD curl -f http://localhost:8088/ || exit 1 6 | 7 | ADD run.sh /run.sh 8 | RUN chmod a+x /run.sh 9 | 10 | EXPOSE 8088 11 | 12 | CMD ["/run.sh"] 13 | -------------------------------------------------------------------------------- /resourcemanager/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | $HADOOP_HOME/bin/yarn --config $HADOOP_CONF_DIR resourcemanager 4 | -------------------------------------------------------------------------------- /spark_in_action.MD: -------------------------------------------------------------------------------- 1 | # Spark with Hadoop Usage Guide 2 | 3 | ## 1. Starting Hadoop and Spark Services 4 | 5 | Before using Spark with Hadoop, ensure all required services are running. 6 | 7 | ### Start Hadoop Services: 8 | ```bash 9 | start-dfs.sh # Start HDFS 10 | start-yarn.sh # Start YARN 11 | ``` 12 | Verify running services: 13 | ```bash 14 | jps 15 | ``` 16 | Expected output (or similar): 17 | ``` 18 | NameNode 19 | DataNode 20 | SecondaryNameNode 21 | ResourceManager 22 | NodeManager 23 | ``` 24 | 25 | ### Start Spark Services (if needed): 26 | ```bash 27 | $SPARK_HOME/sbin/start-all.sh 28 | ``` 29 | or 30 | 31 | ```bash 32 | $SPARK_HOME/sbin/start-master.sh 33 | $SPARK_HOME/sbin/start-worker.sh spark://localhost:7077 34 | ``` 35 | 36 | ### To Stop 37 | 38 | ```bash 39 | $SPARK_HOME/sbin/stop-all.sh 40 | ``` 41 | 42 | ## 2. Running Spark Shell on Hadoop (YARN Mode) 43 | ```bash 44 | spark-shell --master yarn 45 | ``` 46 | Run basic commands in Spark shell: 47 | ```scala 48 | val rdd = sc.parallelize(Seq("Spark", "Hadoop", "Big Data")) 49 | rdd.collect().foreach(println) 50 | ``` 51 | 52 | ## 3. Running a Spark Job on Hadoop (YARN) 53 | ### Submit a Job 54 | ```bash 55 | spark-submit --master yarn --deploy-mode client \ 56 | --class org.apache.spark.examples.SparkPi \ 57 | $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.1.jar 10 58 | ``` 59 | 60 | For cluster mode: 61 | ```bash 62 | spark-submit --master yarn --deploy-mode cluster \ 63 | --class org.apache.spark.examples.SparkPi \ 64 | $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.1.jar 10 65 | ``` 66 | 67 | ## 4. Reading and Writing Data from HDFS 68 | 69 | ### Upload File to HDFS: 70 | ```bash 71 | hdfs dfs -mkdir -p /user/lovnish/test 72 | hdfs dfs -put localfile.txt /user/lovnish/test/ 73 | ``` 74 | 75 | ### Read File in Spark: 76 | ```scala 77 | val file = sc.textFile("hdfs://localhost:9000/user/lovnish/test/localfile.txt") 78 | file.collect().foreach(println) 79 | ``` 80 | 81 | ### Write Output to HDFS: 82 | ```scala 83 | file.saveAsTextFile("hdfs://localhost:9000/user/lovnish/output") 84 | ``` 85 | 86 | ## 5. Using Spark SQL with Hive Metastore 87 | 88 | Start Spark with Hive support: 89 | ```bash 90 | spark-shell --master yarn --conf spark.sql.catalogImplementation=hive 91 | ``` 92 | 93 | ### Create a Table: 94 | ```scala 95 | spark.sql("CREATE TABLE students (id INT, name STRING) USING hive") 96 | spark.sql("INSERT INTO students VALUES (1, 'Spark'), (2, 'Hadoop')") 97 | ``` 98 | 99 | ### Query Data: 100 | ```scala 101 | spark.sql("SELECT * FROM students").show() 102 | ``` 103 | 104 | ## 6. Running a Python (PySpark) Job 105 | 106 | ### Start PySpark: 107 | ```bash 108 | pyspark --master yarn 109 | ``` 110 | 111 | ### Run a PySpark Job: 112 | ```python 113 | from pyspark.sql import SparkSession 114 | spark = SparkSession.builder.appName("PySparkExample").getOrCreate() 115 | data = [(1, "Spark"), (2, "Hadoop")] 116 | df = spark.createDataFrame(data, ["id", "name"]) 117 | df.show() 118 | ``` 119 | 120 | ## 7. Stopping Services 121 | 122 | Stop Spark: 123 | ```bash 124 | $SPARK_HOME/sbin/stop-all.sh 125 | ``` 126 | 127 | Stop Hadoop: 128 | ```bash 129 | stop-dfs.sh 130 | stop-yarn.sh 131 | ``` 132 | 133 | ## 8. Monitoring Spark Jobs 134 | 135 | View Spark Web UI: 136 | - Standalone Mode: http://localhost:4040 137 | - YARN Mode: Run `yarn application -list` to get the Application ID, then: 138 | ```bash 139 | yarn application -status 140 | ``` 141 | 142 | ## 9. Debugging and Logs 143 | Check logs of Spark applications running on YARN: 144 | ```bash 145 | yarn logs -applicationId 146 | ``` 147 | For Hadoop logs: 148 | ```bash 149 | hdfs dfsadmin -report 150 | ``` 151 | 152 | ## 10. Hands-On: Spark SQL 153 | 154 | ### **Objective**: 155 | To create DataFrames, load data from different sources, and perform transformations and SQL queries. 156 | 157 | ### **Step 1: Setup Environment** 158 | 159 | Start Spark in Master Mode: 160 | ```bash 161 | $SPARK_HOME/sbin/start-master.sh 162 | ``` 163 | 164 | Start Spark in Worker Mode: 165 | ```bash 166 | $SPARK_HOME/sbin/start-worker.sh spark://localhost:7077 167 | ``` 168 | 169 | Open Spark shell: 170 | ```bash 171 | spark-shell 172 | ``` 173 | 174 | ### **Step 2: Create DataFrames** 175 | ```scala 176 | val data = Seq( 177 | ("Alice", 30, "HR"), 178 | ("Bob", 25, "Engineering"), 179 | ("Charlie", 35, "Finance") 180 | ) 181 | 182 | val df = data.toDF("Name", "Age", "Department") 183 | 184 | df.show() 185 | ``` 186 | 187 | ### **Step 3: Perform Transformations Using Spark SQL** 188 | ```scala 189 | df.createOrReplaceTempView("employees") 190 | val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department") 191 | result.show() 192 | ``` 193 | 194 | ### **Step 4: Save Transformed Data** 195 | ```scala 196 | result.write.option("header", "true").csv("hdfs://localhost:9000/data/output/output_employees") 197 | ``` 198 | 199 | ### **Step 5: Scala WordCount Program** 200 | ```scala 201 | import org.apache.spark.{SparkConf} 202 | val conf = new SparkConf().setAppName("WordCountExample").setMaster("local") 203 | val input = sc.textFile("hdfs://localhost:9000/data.txt") 204 | val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1)) 205 | val wordCounts = wordPairs.reduceByKey((a, b) => a + b) 206 | wordCounts.collect().foreach { case (word, count) => 207 | println(s"$word: $count") 208 | } 209 | ``` 210 | 211 | **Stop Session**: 212 | ```scala 213 | sc.stop() 214 | ``` 215 | 216 | --- 217 | 218 | ## **11. Key Takeaways** 219 | - Spark SQL simplifies working with structured data. 220 | - DataFrames provide a flexible and powerful API for handling large datasets. 221 | - Apache Spark is a versatile tool for distributed data processing, offering scalability and performance. 222 | 223 | -------------------------------------------------------------------------------- /sqoop.md: -------------------------------------------------------------------------------- 1 | ### 💡 **What is Sqoop?** 2 | **Sqoop (SQL to Hadoop)** is a powerful **Big Data tool** used to **transfer data between:** 3 | - ✅ **Relational Databases (MySQL, Oracle, PostgreSQL, etc.)** 4 | - ✅ **Hadoop Ecosystem (HDFS, Hive, HBase, etc.)** 5 | 6 | --- 7 | 8 | ## ✅ **Why is Sqoop Important in Big Data?** 9 | Imagine you have **millions of records** in a **MySQL Database** (like customer data, sales data, etc.) and you want to: 10 | - **Analyze the data using Hadoop, Hive, or Spark.** 11 | - **Store the data in HDFS for distributed processing.** 12 | - **Move the processed data back to MySQL for reporting.** 13 | 14 | 👉 **Manually transferring data** from MySQL to Hadoop would be a nightmare. 15 | 👉 **But with Sqoop, you can transfer data within minutes! 🚀** 16 | 17 | --- 18 | 19 | ## 🚀 **Major Benefits of Using Sqoop** 20 | Here are the **Top 10 Benefits** of using **Sqoop in Big Data**: 21 | 22 | --- 23 | 24 | ## ✅ 1. **Easy Data Transfer from RDBMS to Hadoop (HDFS)** 25 | 👉 **Sqoop simplifies the process** of transferring large amounts of data from **MySQL, Oracle, SQL Server, etc., to HDFS.** 26 | 27 | ### Example: 28 | If you have **1 Billion rows** in MySQL and you want to **analyze** them in Hadoop, 29 | ✅ Without Sqoop → **You would write complex scripts (slow)** 30 | ✅ With Sqoop → **One command imports the data (fast)** 31 | 32 | **Command:** 33 | ```shell 34 | sqoop import \ 35 | --connect jdbc:mysql://localhost/testdb \ 36 | --username root \ 37 | --password password \ 38 | --table employees \ 39 | --target-dir /user/hdfs/employees_data 40 | ``` 41 | 42 | ✔ In just **5 minutes**, your **1 billion records** are transferred to Hadoop. 43 | 44 | --- 45 | 46 | ## ✅ 2. **Fast Data Transfer (Parallel Processing)** 47 | 👉 **Sqoop uses MapReduce internally** to transfer data from MySQL → Hadoop. 48 | 49 | ### What Happens Internally? 50 | - ✅ **Sqoop launches multiple MapReduce jobs**. 51 | - ✅ **Each MapReduce job transfers part of the data**. 52 | - ✅ **Parallel data transfer** speeds up the process. 53 | 54 | ### 🚀 Example: 55 | If you have **10 Million rows** in MySQL: 56 | - ✅ **Without Sqoop** → Takes **6 hours**. 57 | - ✅ **With Sqoop (parallel 8 mappers)** → Takes **30 minutes**. 58 | 59 | ✔ Massive speed improvement 🚀. 60 | 61 | --- 62 | 63 | ## ✅ 3. **Supports All Major Databases** 64 | 👉 Sqoop supports importing/exporting data from almost all major databases, including: 65 | - ✅ **MySQL** 66 | - ✅ **Oracle** 67 | - ✅ **PostgreSQL** 68 | - ✅ **MS SQL Server** 69 | - ✅ **DB2** 70 | - ✅ **Teradata** 71 | 72 | 👉 This means **you can use one single tool** for **all database operations**. 73 | 74 | --- 75 | 76 | ## ✅ 4. **Incremental Import (Import Only New Data)** 🚀 77 | 👉 This is a **game-changer!** 💯 78 | 79 | ### ✅ **Problem:** 80 | Suppose your MySQL database gets **new data every day**. 81 | - ❌ If you run a normal import → **It will import all data** (duplicate data). 82 | - ✅ But with **Sqoop Incremental Import**, you can **import only new data**. 83 | 84 | ### ✅ **Example: Import Only New Data** 85 | ```shell 86 | sqoop import \ 87 | --connect jdbc:mysql://localhost/testdb \ 88 | --username root \ 89 | --password password \ 90 | --table orders \ 91 | --target-dir /user/hdfs/orders \ 92 | --incremental append \ 93 | --check-column order_date \ 94 | --last-value '2024-03-01' 95 | ``` 96 | 97 | 👉 **It will only import records after `2024-03-01`.** 98 | 99 | ### 🚀 Benefits: 100 | - ✅ No Duplicate Data. 101 | - ✅ Only New Data Comes In. 102 | - ✅ Saves Time and Resources. 103 | 104 | --- 105 | 106 | ## ✅ 5. **Incremental Export (Export Only New Data)** 💯 107 | 👉 You can also **export only new or updated data** from **Hadoop → MySQL**. 108 | 109 | ### ✅ Example: 110 | ```shell 111 | sqoop export \ 112 | --connect jdbc:mysql://localhost/testdb \ 113 | --username root \ 114 | --password password \ 115 | --table orders \ 116 | --export-dir /user/hdfs/orders \ 117 | --update-key order_id \ 118 | --update-mode allowinsert 119 | ``` 120 | 121 | 👉 This will **update old records** and **insert new records**. 🚀 122 | 123 | ✔ No duplicates, No conflicts. 💯 124 | 125 | --- 126 | 127 | ## ✅ 6. **Direct Import into Hive or HBase (No Manual Work)** 📊 128 | 👉 If you're working with **Hive (SQL-like tool for Hadoop)**, 129 | 👉 You can **directly import data into Hive tables** without any manual work. 130 | 131 | ### ✅ Example: 132 | ```shell 133 | sqoop import \ 134 | --connect jdbc:mysql://localhost/testdb \ 135 | --username root \ 136 | --password password \ 137 | --table customers \ 138 | --hive-import \ 139 | --hive-table mydatabase.customers 140 | ``` 141 | 142 | 👉 This command will: 143 | - ✅ Automatically create a Hive Table (`customers`) 144 | - ✅ Automatically load all data from MySQL to Hive. 145 | - ✅ No manual work needed. 146 | 147 | --- 148 | 149 | ## ✅ 7. **Import Large Data (TB/PB Scale) Without Crash 💥** 150 | 👉 If your **MySQL database** has **1 Billion Rows** or **2TB data**, 151 | 👉 Normal **manual export** will fail or crash. ❌ 152 | 153 | 👉 But **Sqoop can handle Terabytes or Petabytes** of data smoothly. 🚀 154 | 155 | 👉 It uses: 156 | - ✅ **Parallel Data Transfer.** 157 | - ✅ **Fault Tolerance (If one mapper fails, others continue).** 158 | - ✅ **Automatic Data Split.** 159 | 160 | --- 161 | 162 | ## ✅ 8. **Save Time and Money 💸** 163 | 👉 **Imagine transferring 1 billion records manually** via Python or CSV files. 164 | 👉 It would take **days or even weeks**. 165 | 166 | ✅ But **Sqoop transfers the data in minutes**. 167 | 168 | ### Example: 169 | | Data Size | Without Sqoop (Manual) | With Sqoop (Auto) | 170 | |----------------|---------------------|--------------------| 171 | | 1 Billion Rows | 24 Hours | **30 Minutes** 🚀 | 172 | | 10 TB Data | 5 Days | **5 Hours** 🚀 | 173 | 174 | ✔ **This saves time, infrastructure costs, and manpower.** 175 | 176 | --- 177 | 178 | ## ✅ 9. **Support for Data Warehousing (ETL Process)** 179 | 👉 **Sqoop is widely used in ETL pipelines** for: 180 | - ✅ Extracting data from MySQL → Hadoop. 181 | - ✅ Transforming data using Spark, Hive, or MapReduce. 182 | - ✅ Loading data back to MySQL → Reporting. 183 | 184 | 👉 This is a **standard data warehousing pipeline**. 185 | 186 | --- 187 | 188 | ## ✅ 10. **Easy Automation with Cron Job / Oozie** 189 | 👉 You can schedule **Sqoop Jobs** to run **daily, weekly, or hourly** using: 190 | - ✅ **Oozie (Big Data Scheduler)** 191 | - ✅ **Linux Cron Job** 192 | 193 | ### ✅ Example: Daily Import 194 | ```shell 195 | sqoop job --create daily_import \ 196 | --import \ 197 | --connect jdbc:mysql://localhost/testdb \ 198 | --username root \ 199 | --password password \ 200 | --table orders \ 201 | --incremental append \ 202 | --check-column order_date \ 203 | --last-value '2024-03-01' 204 | ``` 205 | 206 | ✅ Now schedule it daily using **cron job**: 207 | ```shell 208 | crontab -e 209 | ``` 210 | ```shell 211 | 0 0 * * * sqoop job --exec daily_import 212 | ``` 213 | 214 | 👉 **Automatically fetch new data daily**. 🚀 215 | 216 | --- 217 | 218 | ## ✅ **Bonus Benefits of Sqoop** 219 | | Feature | Benefit | 220 | |-----------------------------|-------------------------------------------------------------------------| 221 | | ✅ High-Speed Data Transfer | Sqoop uses **parallel processing (MapReduce)** for fast transfer. | 222 | | ✅ No Data Loss | Data is transferred **without loss or corruption.** | 223 | | ✅ Automatic Schema Mapping | Sqoop automatically maps MySQL Schema to Hive Schema. | 224 | | ✅ Easy to Use | Simple **one-line command** for import/export. | 225 | | ✅ Fault Tolerance | If one Mapper fails, others continue the process. | 226 | 227 | --- 228 | 229 | ## ✅ **So Why Do Companies Use Sqoop? 💯** 230 | | Use Case | Why Sqoop is Best 💯 | 231 | |---------------------------------|------------------------------------------------------------------| 232 | | ✅ Data Migration | Move data from MySQL → Hadoop easily. | 233 | | ✅ Data Warehousing | Automate ETL Pipelines. | 234 | | ✅ Data Archival | Archive old data from MySQL to HDFS. | 235 | | ✅ Machine Learning Data | Transfer MySQL Data → Spark, Hive for AI/ML. | 236 | | ✅ Fast Data Transfer | Transfer TBs of data in minutes. | 237 | 238 | --- 239 | 240 | ## 💯 Conclusion 🚀 241 | ### ✔ **Sqoop = Fast + Easy + Reliable** Data Transfer. 💯 242 | ### ✔ It saves **time, cost, and effort** in Big Data processing. 💯 243 | ### ✔ Highly used in **Data Engineering, ETL Pipelines, and Hadoop Projects.** 🚀 244 | 245 | --- 246 | 247 | **💡 Apache Sqoop 🚀🙂** is a tool designed for efficiently transferring bulk data between Apache Hadoop and relational databases. It allows for seamless data import and export between **Hadoop ecosystem** components (like HDFS, HBase, Hive) and relational databases (like MySQL, PostgreSQL, Oracle, SQL Server). 248 | 249 | Here is a basic **Sqoop tutorial** to help you understand how to use it for importing and exporting data: 250 | 251 | ### Prerequisites: 252 | 1. Hadoop and Sqoop should be installed on your system. 253 | 2. A relational database (e.g., MySQL) should be available to use with Sqoop. 254 | 3. Ensure the JDBC driver for the relational database is available. 255 | 256 | ### 1. **Setting up Sqoop** 257 | - Make sure **Sqoop** is installed and properly configured in your environment. 258 | - Sqoop’s installation can be verified with the following command: 259 | ```bash 260 | sqoop version 261 | ``` 262 | - If Sqoop is installed correctly, it should display its version. 263 | 264 | ### 2. **Importing Data from Relational Databases to Hadoop (HDFS)** 265 | The most common use case for Sqoop is importing data from a relational database into Hadoop's **HDFS**. 266 | 267 | #### Steps to import data: 268 | 1. **Create a table in the database (e.g., MySQL):** 269 | 270 | ```sql 271 | CREATE DATABASE test; 272 | CREATE USER 'sqoop_user'@'%' IDENTIFIED BY 'password123'; 273 | GRANT ALL PRIVILEGES ON testdb.* TO 'sqoop_user'@'%'; 274 | FLUSH PRIVILEGES; 275 | ``` 276 | ```sql 277 | SHOW DATABASES; 278 | ``` 279 | ```sql 280 | USE test; 281 | ``` 282 | 283 | ```sql 284 | CREATE TABLE employees ( 285 | id INT, 286 | name VARCHAR(100), 287 | age INT 288 | ); 289 | INSERT INTO employees VALUES (1, 'Love', 25); 290 | INSERT INTO employees VALUES (2, 'Ravi', 21); 291 | INSERT INTO employees VALUES (3, 'Nikshep', 22); 292 | ``` 293 | 294 | 295 | **Now get out of MYSQL Shell and then Let's get started with Apache Sqoop** 296 | 297 | **List Databases Using Sqoop**: 298 | ```bash 299 | sqoop list-databases --connect jdbc:mysql://localhost:3306 --username sqoop_user --password password123 300 | ``` 301 | 302 | 3. **Import Data Using Sqoop**: 303 | Use the following command to import data from a MySQL database to HDFS: 304 | ```bash 305 | sqoop import --connect jdbc:mysql://localhost/employeesdb \ 306 | --username your_username --password your_password \ 307 | --table employees --target-dir /user/hadoop/employees 308 | ``` 309 | 310 | Explanation: 311 | - `--connect`: JDBC URL for your database. 312 | - `--username`: Database username. 313 | - `--password`: Database password. 314 | - `--table`: The table to import. 315 | - `--target-dir`: The directory in HDFS where the data will be stored. 316 | 317 | 4. **Verify Data in HDFS**: 318 | After the import, check if the data is available in HDFS: 319 | ```bash 320 | hadoop fs -ls /user/hadoop/employees 321 | hadoop fs -cat /user/hadoop/employees/part-m-00000 322 | ``` 323 | 324 | ### 3. **Exporting Data from Hadoop (HDFS) to Relational Databases** 325 | Sqoop can also be used to export data from HDFS back into a relational database. 326 | 327 | #### Steps to export data: 328 | 1. **Create a Table in the Database for Export:** 329 | 330 | ```sql 331 | CREATE TABLE employees_export ( 332 | id INT, 333 | name VARCHAR(100), 334 | age INT 335 | ); 336 | ``` 337 | 338 | 2. **Export Data Using Sqoop**: 339 | Use the following command to export data from HDFS to a MySQL table: 340 | ```bash 341 | sqoop export --connect jdbc:mysql://localhost/employeesdb \ 342 | --username your_username --password your_password \ 343 | --table employees_export \ 344 | --export-dir /user/hadoop/employees 345 | ``` 346 | 347 | Explanation: 348 | - `--connect`: JDBC URL for the database. 349 | - `--username`: Database username. 350 | - `--password`: Database password. 351 | - `--table`: Table in the database to export the data to. 352 | - `--export-dir`: Directory in HDFS where the data to be exported resides. 353 | 354 | 3. **Verify Data in the Database**: 355 | After the export, check if the data is available in the database: 356 | ```sql 357 | SELECT * FROM employees_export; 358 | ``` 359 | 360 | ### 4. **Incremental Imports (Importing Data Increments)** 361 | Sqoop can import only the new or updated data from a table by using **incremental imports**. 362 | 363 | #### Example of incremental import: 364 | ```bash 365 | sqoop import --connect jdbc:mysql://localhost/employeesdb \ 366 | --username your_username --password your_password \ 367 | --table employees --target-dir /user/hadoop/employees \ 368 | --incremental append --check-column id --last-value 10 369 | ``` 370 | 371 | Explanation: 372 | - `--incremental append`: Indicates that Sqoop should only import data that has changed (new rows or updated rows). 373 | - `--check-column`: The column to use for tracking changes (usually an auto-incremented column like `id`). 374 | - `--last-value`: The value of the `check-column` that was imported last time. This ensures only new or changed data is imported. 375 | 376 | ### 5. **Importing Data into Hive** 377 | Sqoop can also import data directly into **Apache Hive**, which is a data warehousing tool that sits on top of Hadoop. 378 | 379 | #### Example of importing data to Hive: 380 | ```bash 381 | sqoop import --connect jdbc:mysql://localhost/employeesdb \ 382 | --username your_username --password your_password \ 383 | --table employees --hive-import --create-hive-table \ 384 | --hive-table employees_hive 385 | ``` 386 | 387 | Explanation: 388 | - `--hive-import`: Imports the data into Hive. 389 | - `--create-hive-table`: Automatically creates the corresponding Hive table. 390 | - `--hive-table`: The Hive table to store the data. 391 | 392 | ### 6. **Job Scheduling with Sqoop** 393 | You can schedule Sqoop jobs to run at specific intervals using **Apache Oozie** or **cron jobs** for periodic data imports or exports. 394 | 395 | ### 7. **Additional Sqoop Features** 396 | - **Parallelism**: You can use **parallel imports** to split the data into multiple tasks and speed up the import/export process. 397 | ```bash 398 | sqoop import --connect jdbc:mysql://localhost/employeesdb \ 399 | --username your_username --password your_password \ 400 | --table employees --target-dir /user/hadoop/employees \ 401 | --num-mappers 4 402 | ``` 403 | 404 | - **Direct Mode**: Sqoop provides a **direct mode** for some databases like MySQL, which bypasses JDBC and uses the database's native data transfer mechanism to improve performance. 405 | ```bash 406 | sqoop import --connect jdbc:mysql://localhost/employeesdb \ 407 | --username your_username --password your_password \ 408 | --table employees --target-dir /user/hadoop/employees \ 409 | --direct 410 | ``` 411 | 412 | --- 413 | 414 | ### Conclusion 415 | Apache **Sqoop** is a powerful tool for bulk data transfers between Hadoop and relational databases. By understanding how to use Sqoop for importing, exporting, and managing data between various sources and Hadoop, you can integrate your data efficiently for further analysis, processing, or storage. 416 | 417 | In this tutorial, we covered basic Sqoop commands for importing and exporting data from a MySQL database into HDFS, as well as other advanced functionalities like incremental imports and loading data into Hive. 418 | -------------------------------------------------------------------------------- /startup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | hadoop fs -mkdir /tmp 4 | hadoop fs -mkdir -p /user/hive/warehouse 5 | hadoop fs -chmod g+w /tmp 6 | hadoop fs -chmod g+w /user/hive/warehouse 7 | 8 | cd $HIVE_HOME/bin 9 | ./hiveserver2 --hiveconf hive.server2.enable.doAs=false 10 | -------------------------------------------------------------------------------- /students.csv: -------------------------------------------------------------------------------- 1 | 1,Lovnish,25 2 | 2,Ravikant,21 3 | 3,Nikshep,23 -------------------------------------------------------------------------------- /submit/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/hadoop-base:2.0.0-hadoop3.2.1-java8 2 | 3 | MAINTAINER Ivan Ermilov 4 | 5 | COPY WordCount.jar /opt/hadoop/applications/WordCount.jar 6 | 7 | ENV JAR_FILEPATH="/opt/hadoop/applications/WordCount.jar" 8 | ENV CLASS_TO_RUN="WordCount" 9 | ENV PARAMS="/input /output" 10 | 11 | ADD run.sh /run.sh 12 | RUN chmod a+x /run.sh 13 | 14 | CMD ["/run.sh"] 15 | -------------------------------------------------------------------------------- /submit/WordCount.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lovnishverma/bigdataecosystem/50b2fc2e1138de61698eff94c48da229b1dd3363/submit/WordCount.jar -------------------------------------------------------------------------------- /submit/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | $HADOOP_HOME/bin/hadoop jar $JAR_FILEPATH $CLASS_TO_RUN $PARAMS 4 | -------------------------------------------------------------------------------- /template/java/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2 2 | 3 | LABEL maintainer="Gezim Sejdiu , Giannis Mouchakis " 4 | 5 | ENV SPARK_APPLICATION_JAR_NAME application-1.0 6 | 7 | COPY template.sh / 8 | 9 | RUN apk add --no-cache openjdk8 maven\ 10 | && chmod +x /template.sh \ 11 | && mkdir -p /app \ 12 | && mkdir -p /usr/src/app 13 | 14 | # Copy the POM-file first, for separate dependency resolving and downloading 15 | ONBUILD COPY pom.xml /usr/src/app 16 | ONBUILD RUN cd /usr/src/app \ 17 | && mvn dependency:resolve 18 | ONBUILD RUN cd /usr/src/app \ 19 | && mvn verify 20 | 21 | # Copy the source code and build the application 22 | ONBUILD COPY . /usr/src/app 23 | ONBUILD RUN cd /usr/src/app \ 24 | && mvn clean package 25 | 26 | CMD ["/bin/bash", "/template.sh"] 27 | -------------------------------------------------------------------------------- /template/java/README.md: -------------------------------------------------------------------------------- 1 | # Spark Java template 2 | 3 | The Spark Java template image serves as a base image to build your own Java application to run on a Spark cluster. See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark) for a description how to setup a Spark cluster. 4 | 5 | ### Package your application using Maven 6 | You can build and launch your Java application on a Spark cluster by extending this image with your sources. The template uses [Maven](https://maven.apache.org/) as build tool, so make sure you have a `pom.xml` file for your application specifying all the dependencies. 7 | 8 | The Maven `package` command must create an assembly JAR (or 'uber' JAR) containing your code and its dependencies. Spark and Hadoop dependencies should be listes as `provided`. The [Maven shade plugin](http://maven.apache.org/plugins/maven-shade-plugin/) provides a plugin to build such assembly JARs. 9 | 10 | ### Extending the Spark Java template with your application 11 | 12 | #### Steps to extend the Spark Java template 13 | 1. Create a Dockerfile in the root folder of your project (which also contains a `pom.xml`) 14 | 2. Extend the Spark Java template Docker image 15 | 3. Configure the following environment variables (unless the default value satisfies): 16 | * `SPARK_MASTER_NAME` (default: spark-master) 17 | * `SPARK_MASTER_PORT` (default: 7077) 18 | * `SPARK_APPLICATION_JAR_NAME` (default: application-1.0) 19 | * `SPARK_APPLICATION_MAIN_CLASS` (default: my.main.Application) 20 | * `SPARK_APPLICATION_ARGS` (default: "") 21 | 4. Build and run the image 22 | ``` 23 | docker build --rm=true -t bde/spark-app . 24 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app 25 | ``` 26 | 27 | The sources in the project folder will be automatically added to `/usr/src/app` if you directly extend the Spark Java template image. Otherwise you will have to add and package the sources by yourself in your Dockerfile with the commands: 28 | 29 | COPY . /usr/src/app 30 | RUN cd /usr/src/app \ 31 | && mvn clean package 32 | 33 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute the `/template.sh` script at the end. 34 | 35 | #### Example Dockerfile 36 | ``` 37 | FROM bde2020/spark-java-template:2.4.0-hadoop2.7 38 | 39 | MAINTAINER Erika Pauwels 40 | MAINTAINER Gezim Sejdiu 41 | 42 | ENV SPARK_APPLICATION_JAR_NAME my-app-1.0-SNAPSHOT-with-dependencies 43 | ENV SPARK_APPLICATION_MAIN_CLASS eu.bde.my.Application 44 | ENV SPARK_APPLICATION_ARGS "foo bar baz" 45 | ``` 46 | 47 | #### Example application 48 | See [big-data-europe/demo-spark-sensor-data](https://github.com/big-data-europe/demo-spark-sensor-data). 49 | -------------------------------------------------------------------------------- /template/java/template.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cd /usr/src/app 4 | cp target/${SPARK_APPLICATION_JAR_NAME}.jar ${SPARK_APPLICATION_JAR_LOCATION} 5 | 6 | sh /submit.sh 7 | -------------------------------------------------------------------------------- /template/python/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2 2 | 3 | LABEL maintainer="Gezim Sejdiu , Giannis Mouchakis " 4 | 5 | COPY template.sh / 6 | 7 | # Copy the requirements.txt first, for separate dependency resolving and downloading 8 | ONBUILD COPY requirements.txt /app/ 9 | ONBUILD RUN cd /app \ 10 | && pip3 install -r requirements.txt 11 | 12 | # Copy the source code 13 | ONBUILD COPY . /app 14 | 15 | CMD ["/bin/bash", "/template.sh"] 16 | -------------------------------------------------------------------------------- /template/python/README.md: -------------------------------------------------------------------------------- 1 | # Spark Python template 2 | 3 | The Spark Python template image serves as a base image to build your own Python application to run on a Spark cluster. See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark) for a description how to setup a Spark cluster. 4 | 5 | ### Package your application using pip 6 | You can build and launch your Python application on a Spark cluster by extending this image with your sources. The template uses [pip](https://pip.pypa.io/en/stable/) to manage the dependencies of your 7 | project, so make sure you have a `requirements.txt` file in the root of your application specifying all the dependencies. 8 | 9 | ### Extending the Spark Python template with your application 10 | 11 | #### Steps to extend the Spark Python template 12 | 1. Create a Dockerfile in the root folder of your project (which also contains a `requirements.txt`) 13 | 2. Extend the Spark Python template Docker image 14 | 3. Configure the following environment variables (unless the default value satisfies): 15 | * `SPARK_MASTER_NAME` (default: spark-master) 16 | * `SPARK_MASTER_PORT` (default: 7077) 17 | * `SPARK_APPLICATION_PYTHON_LOCATION` (default: /app/app.py) 18 | * `SPARK_APPLICATION_ARGS` 19 | 4. Build and run the image 20 | ``` 21 | docker build --rm -t bde/spark-app . 22 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app 23 | ``` 24 | 25 | The sources in the project folder will be automatically added to `/app` if you directly extend the Spark Python template image. Otherwise you will have to add the sources by yourself in your Dockerfile with the command: 26 | 27 | COPY . /app 28 | 29 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute the `/template.sh` script at the end. 30 | 31 | #### Example Dockerfile 32 | ``` 33 | FROM bde2020/spark-python-template:2.4.0-hadoop2.7 34 | 35 | MAINTAINER You 36 | 37 | ENV SPARK_APPLICATION_PYTHON_LOCATION /app/entrypoint.py 38 | ENV SPARK_APPLICATION_ARGS "foo bar baz" 39 | ``` 40 | 41 | #### Example application 42 | Coming soon 43 | -------------------------------------------------------------------------------- /template/python/template.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | sh /submit.sh 4 | -------------------------------------------------------------------------------- /template/scala/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/spark-submit:3.0.0-hadoop3.2 2 | 3 | LABEL maintainer="Gezim Sejdiu , Giannis Mouchakis " 4 | 5 | ARG SBT_VERSION 6 | ENV SBT_VERSION=${SBT_VERSION:-1.3.12} 7 | 8 | RUN wget -O - https://piccolo.link/sbt-1.3.12.tgz | gunzip | tar -x -C /usr/local 9 | 10 | ENV PATH /usr/local/sbt/bin:${PATH} 11 | 12 | WORKDIR /app 13 | 14 | # Pre-install base libraries 15 | ADD build.sbt /app/ 16 | ADD plugins.sbt /app/project/ 17 | RUN sbt update 18 | 19 | COPY template.sh / 20 | 21 | ENV SPARK_APPLICATION_MAIN_CLASS Application 22 | 23 | # Copy the build.sbt first, for separate dependency resolving and downloading 24 | ONBUILD COPY build.sbt /app/ 25 | ONBUILD COPY project /app/project 26 | ONBUILD RUN sbt update 27 | 28 | # Copy the source code and build the application 29 | ONBUILD COPY . /app 30 | ONBUILD RUN sbt clean assembly 31 | 32 | CMD ["/template.sh"] 33 | -------------------------------------------------------------------------------- /template/scala/README.md: -------------------------------------------------------------------------------- 1 | # Spark Scala template 2 | 3 | The Spark Scala template image serves as a base image to build your own Scala 4 | application to run on a Spark cluster. See 5 | [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark) 6 | for a description how to setup a Spark cluster. 7 | 8 | ## Scala Console 9 | 10 | `sbt console` will create you a Spark Context for testing your code like the 11 | spark-shell: 12 | 13 | ``` 14 | docker run -it --rm bde2020/spark-scala-template sbt console 15 | ``` 16 | 17 | You can also use directly your Docker image and test your own code that way. 18 | 19 | ## Package your application using sbt 20 | 21 | You can build and launch your Scala application on a Spark cluster by extending 22 | this image with your sources. The template uses 23 | [sbt](http://www.scala-sbt.org) as build tool, so you should take the 24 | `build.sbt` file located in this directory and the `project` directory that 25 | includes the 26 | [sbt-assembly](https://github.com/sbt/sbt-assembly). 27 | 28 | When the Docker image is built using this template, you should get a Docker 29 | image that includes a fat JAR containing your application and all its 30 | dependencies. 31 | 32 | ### Extending the Spark Scala template with your application 33 | 34 | #### Steps to extend the Spark Scala template 35 | 36 | 1. Create a Dockerfile in the root folder of your project (which also contains 37 | a `build.sbt`) 38 | 2. Extend the Spark Scala template Docker image 39 | 3. Configure the following environment variables (unless the default value 40 | satisfies): 41 | * `SPARK_MASTER_NAME` (default: spark-master) 42 | * `SPARK_MASTER_PORT` (default: 7077) 43 | * `SPARK_APPLICATION_MAIN_CLASS` (default: Application) 44 | * `SPARK_APPLICATION_ARGS` (default: "") 45 | 4. Build and run the image: 46 | ``` 47 | docker build --rm=true -t bde/spark-app . 48 | docker run --name my-spark-app -e ENABLE_INIT_DAEMON=false --link spark-master:spark-master -d bde/spark-app 49 | ``` 50 | 51 | The sources in the project folder will be automatically added to `/usr/src/app` 52 | if you directly extend the Spark Scala template image. Otherwise you will have 53 | to add and package the sources by yourself in your Dockerfile with the 54 | commands: 55 | 56 | COPY . /usr/src/app 57 | RUN cd /usr/src/app && sbt clean assembly 58 | 59 | If you overwrite the template's `CMD` in your Dockerfile, make sure to execute 60 | the `/template.sh` script at the end. 61 | 62 | #### Example Dockerfile 63 | 64 | ``` 65 | FROM bde2020/spark-scala-template:2.4.0-hadoop2.7 66 | 67 | MAINTAINER Cecile Tonglet 68 | 69 | ENV SPARK_APPLICATION_MAIN_CLASS eu.bde.my.Application 70 | ENV SPARK_APPLICATION_ARGS "foo bar baz" 71 | ``` 72 | 73 | #### Example application 74 | 75 | TODO 76 | -------------------------------------------------------------------------------- /template/scala/build.sbt: -------------------------------------------------------------------------------- 1 | scalaVersion := "2.12.11" 2 | libraryDependencies ++= Seq( 3 | "org.apache.spark" %% "spark-sql" % "3.0.0" % "provided" 4 | ) 5 | -------------------------------------------------------------------------------- /template/scala/plugins.sbt: -------------------------------------------------------------------------------- 1 | addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10") -------------------------------------------------------------------------------- /template/scala/template.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | SPARK_APPLICATION_JAR_LOCATION=`find /app/target -iname '*-assembly-*.jar' | head -n1` 4 | export SPARK_APPLICATION_JAR_LOCATION 5 | 6 | if [ -z "$SPARK_APPLICATION_JAR_LOCATION" ]; then 7 | echo "Can't find a file *-assembly-*.jar in /app/target" 8 | exit 1 9 | fi 10 | 11 | /submit.sh 12 | -------------------------------------------------------------------------------- /wordcount.md: -------------------------------------------------------------------------------- 1 | To perform a **Word Count** using Hadoop, follow these steps: 2 | 3 | --- 4 | 5 | ### **Configure `mapred-site.xml`** 6 | ```bash 7 | nano $HADOOP_HOME/etc/hadoop/mapred-site.xml 8 | ``` 9 | Add: 10 | ```xml 11 | 12 | 13 | mapreduce.framework.name 14 | yarn 15 | 16 | 17 | yarn.app.mapreduce.am.env 18 | HADOOP_MAPRED_HOME=/usr/local/hadoop 19 | 20 | 21 | mapreduce.map.env 22 | HADOOP_MAPRED_HOME=/usr/local/hadoop 23 | 24 | 25 | mapreduce.reduce.env 26 | HADOOP_MAPRED_HOME=/usr/local/hadoop 27 | 28 | 29 | ``` 30 | Save & exit. 31 | 32 | 33 | ## **1. Ensure Hadoop is Running** 34 | Before running the Word Count example, ensure Hadoop is running: 35 | 36 | ```bash 37 | start-dfs.sh 38 | start-yarn.sh 39 | ``` 40 | Verify with: 41 | ```bash 42 | jps 43 | ``` 44 | You should see **NameNode, DataNode, ResourceManager, and NodeManager** running. 45 | 46 | --- 47 | 48 | ## **2. Upload the Input File to HDFS** 49 | If you haven't already created a directory in HDFS, do it now: 50 | 51 | ```bash 52 | hadoop fs -mkdir -p /user/nielit/input 53 | ``` 54 | 55 | Now, upload your text file (`data.txt`): 56 | 57 | ```bash 58 | hadoop fs -put data.txt /user/nielit/input/ 59 | ``` 60 | 61 | Verify the upload: 62 | ```bash 63 | hadoop fs -ls /user/nielit/input/ 64 | ``` 65 | 66 | --- 67 | 68 | ## **3. Run the Word Count Example** 69 | Hadoop provides a built-in Word Count example. **(Replace 3.3.6 with your hadoop Version)** Run it with: 70 | 71 | ```bash 72 | hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /user/nielit/input /user/nielit/output 73 | ``` 74 | 75 | > **Note:** If you downloaded a different Hadoop version, update `3.4.0` accordingly. 76 | 77 | --- 78 | 79 | ![Screenshot from 2025-03-02 17-45-02](https://github.com/user-attachments/assets/44613f55-cdf0-48fc-9769-9ff7066c70e3) 80 | 81 | ## **4. Check the Output** 82 | Once the job completes, view the output files: 83 | 84 | ```bash 85 | hadoop fs -ls /user/nielit/output 86 | ``` 87 | 88 | The output is usually stored in `part-r-00000`. To read the results: 89 | 90 | ```bash 91 | hadoop fs -cat /user/nielit/output/part-r-00000 92 | ``` 93 | 94 | --- 95 | 96 | ## **5. Download the Output to Your Local System (Optional)** 97 | If you want to copy the results from HDFS to your local machine: 98 | 99 | ```bash 100 | hadoop fs -get /user/nielit/output/part-r-00000 wordcount_output.txt 101 | cat wordcount_output.txt 102 | ``` 103 | 104 | 105 | ![Screenshot from 2025-03-02 17-46-44](https://github.com/user-attachments/assets/5d5d3b51-708f-4ff3-8476-a518c5f8ee3e) 106 | 107 | **METHOD: 2** 108 | 109 | You can write your own **Java MapReduce program** for word count in Hadoop. Follow these steps: 110 | 111 | --- 112 | 113 | ## **1. Create the Word Count Java Program** 114 | Create a new file: 115 | ```bash 116 | nano WordCount.java 117 | ``` 118 | 119 | Copy and paste the following Java code: 120 | 121 | ```java 122 | import org.apache.hadoop.conf.Configuration; 123 | import org.apache.hadoop.fs.Path; 124 | import org.apache.hadoop.io.IntWritable; 125 | import org.apache.hadoop.io.Text; 126 | import org.apache.hadoop.mapreduce.Job; 127 | import org.apache.hadoop.mapreduce.Mapper; 128 | import org.apache.hadoop.mapreduce.Reducer; 129 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 130 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 131 | 132 | import java.io.IOException; 133 | import java.util.StringTokenizer; 134 | 135 | public class WordCount { 136 | 137 | public static class TokenizerMapper extends Mapper { 138 | private final static IntWritable one = new IntWritable(1); 139 | private Text word = new Text(); 140 | 141 | public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 142 | StringTokenizer itr = new StringTokenizer(value.toString()); 143 | while (itr.hasMoreTokens()) { 144 | word.set(itr.nextToken()); 145 | context.write(word, one); 146 | } 147 | } 148 | } 149 | 150 | public static class IntSumReducer extends Reducer { 151 | private IntWritable result = new IntWritable(); 152 | 153 | public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { 154 | int sum = 0; 155 | for (IntWritable val : values) { 156 | sum += val.get(); 157 | } 158 | result.set(sum); 159 | context.write(key, result); 160 | } 161 | } 162 | 163 | public static void main(String[] args) throws Exception { 164 | Configuration conf = new Configuration(); 165 | Job job = Job.getInstance(conf, "word count"); 166 | job.setJarByClass(WordCount.class); 167 | job.setMapperClass(TokenizerMapper.class); 168 | job.setCombinerClass(IntSumReducer.class); 169 | job.setReducerClass(IntSumReducer.class); 170 | job.setOutputKeyClass(Text.class); 171 | job.setOutputValueClass(IntWritable.class); 172 | FileInputFormat.addInputPath(job, new Path(args[0])); 173 | FileOutputFormat.setOutputPath(job, new Path(args[1])); 174 | System.exit(job.waitForCompletion(true) ? 0 : 1); 175 | } 176 | } 177 | ``` 178 | 179 | --- 180 | 181 | ## **2. Compile the Java Code** 182 | Make sure you have Hadoop's libraries available. Use the following command to compile: 183 | 184 | ```bash 185 | javac -classpath $(hadoop classpath) -d . WordCount.java 186 | ``` 187 | 188 | This will generate `.class` files inside the current directory. 189 | 190 | --- 191 | 192 | ## **3. Create a JAR File** 193 | Now, package the compiled Java files into a JAR: 194 | 195 | ```bash 196 | jar -cvf WordCount.jar *.class 197 | ``` 198 | 199 | --- 200 | 201 | ## **4. Upload Input File to HDFS** 202 | If not already uploaded, create an input directory and upload your text file: 203 | 204 | ```bash 205 | hadoop fs -mkdir -p /user/nielit/input 206 | hadoop fs -put data.txt /user/nielit/input/ 207 | ``` 208 | 209 | Verify: 210 | ```bash 211 | hadoop fs -ls /user/nielit/input/ 212 | ``` 213 | 214 | --- 215 | 216 | ## **5. Run Your Word Count Program** 217 | Execute your custom Word Count JAR in Hadoop: 218 | 219 | ```bash 220 | hadoop jar WordCount.jar WordCount /user/nielit/input /user/nielit/output 221 | ``` 222 | 223 | --- 224 | 225 | ## **6. View Output** 226 | After the job completes, check the output: 227 | 228 | ```bash 229 | hadoop fs -ls /user/nielit/output 230 | ``` 231 | 232 | To see the results: 233 | 234 | ```bash 235 | hadoop fs -cat /user/nielit/output/part-r-00000 236 | ``` 237 | 238 | --- 239 | 240 | ## **7. Download Output (Optional)** 241 | If you want to save the output to your local machine: 242 | 243 | ```bash 244 | hadoop fs -get /user/nielit/output/part-r-00000 wordcount_output.txt 245 | cat wordcount_output.txt 246 | ``` 247 | 248 | --- 249 | 250 | ## **Troubleshooting** 251 | - If the output directory already exists, delete it before rerunning: 252 | ```bash 253 | hadoop fs -rm -r /user/nielit/output 254 | ``` 255 | 256 | 257 | -------------------------------------------------------------------------------- /worker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM bde2020/spark-base:3.0.0-hadoop3.2 2 | 3 | LABEL maintainer="Gezim Sejdiu , Giannis Mouchakis " 4 | 5 | COPY worker.sh / 6 | 7 | ENV SPARK_WORKER_WEBUI_PORT 8081 8 | ENV SPARK_WORKER_LOG /spark/logs 9 | ENV SPARK_MASTER "spark://spark-master:7077" 10 | 11 | EXPOSE 8081 12 | 13 | CMD ["/bin/bash", "/worker.sh"] 14 | -------------------------------------------------------------------------------- /worker/README.md: -------------------------------------------------------------------------------- 1 | # Spark worker 2 | 3 | See [big-data-europe/docker-spark README](https://github.com/big-data-europe/docker-spark). -------------------------------------------------------------------------------- /worker/worker.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | . "/spark/sbin/spark-config.sh" 4 | 5 | . "/spark/bin/load-spark-env.sh" 6 | 7 | mkdir -p $SPARK_WORKER_LOG 8 | 9 | export SPARK_HOME=/spark 10 | 11 | ln -sf /dev/stdout $SPARK_WORKER_LOG/spark-worker.out 12 | 13 | /spark/sbin/../bin/spark-class org.apache.spark.deploy.worker.Worker \ 14 | --webui-port $SPARK_WORKER_WEBUI_PORT $SPARK_MASTER >> $SPARK_WORKER_LOG/spark-worker.out 15 | -------------------------------------------------------------------------------- /yarn.md: -------------------------------------------------------------------------------- 1 | # Practical: Running a WordCount Job on YARN ![image](https://github.com/user-attachments/assets/04d38509-38b8-4cef-b544-4a8c566fd863) 2 | 3 | In this practical, you will run a simple WordCount job using Hadoop YARN. This exercise walks you through preparing a basic Hadoop job and running it on a YARN cluster. 4 | 5 | ### Prerequisites: 6 | 1. Docker Desktop must be up and running. 7 | 2. **YARN ResourceManager** and **NodeManager** must be up and running. 8 | 3. Hadoop should be set up correctly, with access to the HDFS file system. 9 | 4. A sample WordCount program (JAR) is ready to be executed. 10 | 11 | --- 12 | 13 | ### Step-by-Step Guide 14 | Compose container if not already running 15 | 16 | **docker-compose up -d** 17 | 18 | Copy code folder that has wordcount program to your container 19 | 20 | **docker cp code namenode:/code** 21 | 22 | ![image](https://github.com/user-attachments/assets/72fa5e86-02cb-4a09-864a-dae1256bf8cd) 23 | 24 | 25 | Then execute bash shell of namenode in intractive mode 26 | 27 | **docker exec -it namenode bash** 28 | --- 29 | 30 | ### Step-by-Step Guide 31 | 32 | #### Step 1: Upload Data to HDFS 33 | 34 | Before running a YARN job, we need some input data in HDFS. We will create a simple text file locally and upload it to HDFS. 35 | 36 | 1. **Create a sample text file locally inside docker terminal**: 37 | Use the following commands to create a file called `sample.txt` on your local machine with some sample text data. 38 | 39 | ```bash 40 | echo "Ropar Chandigarh Ropar Chandigarh Punjab" > sample.txt 41 | echo "Mohali" >> sample.txt 42 | echo "Kharar" >> sample.txt 43 | ``` 44 | 45 | 46 | 2. **Upload the text file to HDFS**: 47 | Use the `hadoop fs -put` command to upload the file to HDFS. 48 | 49 | ```bash 50 | hadoop fs -mkdir -p /user # Create the input directory in HDFS 51 | hadoop fs -mkdir -p /user/root # Create the input directory in HDFS 52 | hadoop fs -mkdir -p /user/root/input # Create the input directory in HDFS 53 | 54 | or 55 | The -p option stands for "parent", meaning it creates all the necessary parent directories in the specified path if they do not already exist. 56 | If any of the parent directories (/user, /user/root) do not exist, this command will create them. 57 | Key Feature: It does not throw an error if the directory already exists. 58 | 59 | hadoop fs -mkdir -p /user/root/input 60 | 61 | 62 | hadoop fs -put sample.txt /user/root/input/ # put sample.txt into HDFS 63 | ``` 64 | 65 | You can confirm the file is uploaded by running: 66 | 67 | ```bash 68 | hadoop fs -ls /user/root/input/ 69 | ``` 70 | ![image](https://github.com/user-attachments/assets/a0e18957-a2a5-40f8-a5bf-b443da47eb67) 71 | 72 | ![image](https://github.com/user-attachments/assets/b622d0d9-ef28-4eaa-ac7b-db7758dd390d) 73 | 74 | --- 75 | 76 | #### Step 2: Submit the WordCount Job to YARN 77 | 78 | Now, we can run the WordCount job using YARN. This job will count the occurrences of each word in the input file. 79 | 80 | 1. **Change your working directory to where the `wordCount.jar` is located**: 81 | 82 | ```bash 83 | cd /code # Change to the directory where wordCount.jar is stored 84 | ``` 85 | ![image](https://github.com/user-attachments/assets/84b0288f-4cca-4bc8-9d62-eeb5b393ef6d) 86 | 87 | 2. **Submit the WordCount job to YARN**: 88 | Run the following command to submit the job: 89 | 90 | ```bash 91 | hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount /user/root/input /user/root/outputfolder 92 | ``` 93 | 94 | - `wordCount.jar`: The MapReduce program (JAR file). 95 | - `/user/root/input`: The input directory in HDFS containing the `sample.txt` file. 96 | - `/user/root/outputfolder`: The output directory in HDFS where the result will be stored. 97 | 98 | ![image](https://github.com/user-attachments/assets/f0aa28ae-c7c8-4e38-999c-a2197497c5cb) 99 | 100 | 101 | 3. **Check the YARN UI**: 102 | After submitting the job, you can monitor the job through the YARN ResourceManager UI. 103 | 104 | - Visit the YARN ResourceManager UI at `http://localhost:8088`. 105 | - Look for your job under the "Applications" section. You should see your job with its status (e.g., Running, Completed, etc.). 106 | 107 | ![image](https://github.com/user-attachments/assets/f68bcf5f-e56a-420e-8f05-9ed2dcf68837) 108 | 109 | 110 | - Click on your job to see more details, such as job progress, logs, and running containers. 111 | 112 | ![image](https://github.com/user-attachments/assets/5656eb67-8c49-46c0-a9ac-0af201231972) 113 | 114 | 115 | --- 116 | 117 | #### Step 3: Check the Output of the WordCount Job 118 | 119 | Once the job finishes, you can view the results in HDFS. 120 | 121 | 1. **Check the output on HDFS**: 122 | To verify that the output was successfully created, run the following command: 123 | 124 | ```bash 125 | hadoop fs -ls /user/root/outputfolder 126 | ``` 127 | ![image](https://github.com/user-attachments/assets/508f4359-1b5c-462c-93a8-c1dc7048283d) 128 | 129 | 130 | You should see output files like `part-r-00000`. 131 | 132 | 2. **View the contents of the output file**: 133 | To view the WordCount results, use the following command: 134 | 135 | ```bash 136 | hadoop fs -cat /user/root/outputfolder/part-r-00000 137 | ``` 138 | 139 | The output will show the words and their respective counts, like this: 140 | 141 | ``` 142 | Chandigarh 2 143 | Kharar 1 144 | Mohali 1 145 | Punjab 1 146 | Ropar 2 147 | ``` 148 | ![image](https://github.com/user-attachments/assets/7b0ab366-71ed-49e7-b7fa-749f573f633a) 149 | 150 | --- 151 | 152 | #### Step 4: Clean Up 153 | 154 | Once you’ve completed the practical, it's good practice to clean up by deleting the output files and any unnecessary files. 155 | 156 | 1. **Remove the output directory from HDFS**: 157 | 158 | ```bash 159 | hadoop fs -rm -r /user/root/outputfolder 160 | ``` 161 | 162 | 2. **Optional**: Remove the input file from HDFS if you no longer need it. 163 | 164 | ```bash 165 | hadoop fs -rm /user/root/input/sample.txt 166 | ``` 167 | 168 | --- 169 | 170 | Yarn (Yet Another Resource Negotiator) in the Hadoop ecosystem is a resource management layer, and its commands are different from the JavaScript package manager **Yarn**. Below is a list of essential **Yarn commands for Hadoop**: 171 | 172 | --- 173 | 174 | ### **General Yarn Commands** 175 | 1. **Check Yarn Version** 176 | ```bash 177 | yarn version 178 | ``` 179 | Displays the version of Yarn installed in your Hadoop environment. 180 | 181 | 2. **Check Cluster Nodes** 182 | ```bash 183 | yarn node -list 184 | ``` 185 | Lists all the active, decommissioned, and unhealthy nodes in the cluster. 186 | 187 | 3. **Resource Manager Web UI** 188 | ```bash 189 | yarn rmadmin -getServiceState rm1 190 | ``` 191 | Checks the state of a specific Resource Manager. 192 | 193 | --- 194 | 195 | ### **Application Management** 196 | 4. **Submit an Application** 197 | ```bash 198 | yarn jar [options] 199 | ``` 200 | Submits a new application to the Yarn cluster. 201 | 202 | 5. **List Applications** 203 | ```bash 204 | yarn application -list 205 | ``` 206 | Lists all running applications on the Yarn cluster. 207 | 208 | 6. **View Application Status** 209 | ```bash 210 | yarn application -status 211 | ``` 212 | Shows the status of a specific application. 213 | Example Output:- 214 | ![image](https://github.com/user-attachments/assets/44ab74ab-f662-4d87-834c-43e812117be0) 215 | 216 | 217 | 218 | 8. **Kill an Application** 219 | ```bash 220 | yarn application -kill 221 | ``` 222 | Terminates a specific application. 223 | 224 | --- 225 | 226 | ### **Logs and Diagnostics** 227 | 8. **View Logs of an Application** 228 | ```bash 229 | yarn logs -applicationId 230 | ``` 231 | Displays logs for a specific application. 232 | 233 | 9. **Fetch Application Logs to Local System** 234 | ```bash 235 | yarn logs -applicationId > logs.txt 236 | ``` 237 | Saves application logs to a local file. 238 | 239 | --- 240 | 241 | ### **Queue Management** 242 | 10. **List Queues** 243 | ```bash 244 | yarn queue -list 245 | ``` 246 | Lists all queues available in the Yarn cluster. 247 | 248 | 11. **Move Application to Another Queue** 249 | ```bash 250 | yarn application -moveToQueue -appId 251 | ``` 252 | Moves a running application to a different queue. 253 | 254 | --- 255 | 256 | ### **Resource Manager Administration** 257 | 12. **Refresh Queue Configuration** 258 | ```bash 259 | yarn rmadmin -refreshQueues 260 | ``` 261 | Reloads the queue configuration without restarting the Resource Manager. 262 | 263 | 13. **Refresh Node Information** 264 | ```bash 265 | yarn rmadmin -refreshNodes 266 | ``` 267 | Updates the Resource Manager with the latest node information. 268 | 269 | 14. **Get Cluster Metrics** 270 | ```bash 271 | yarn cluster -metrics 272 | ``` 273 | Shows resource usage metrics of the Yarn cluster. 274 | 275 | 15. **Decommission a Node** 276 | ```bash 277 | yarn rmadmin -decommission 278 | ``` 279 | Marks a specific node as decommissioned. 280 | 281 | 16. **Check Cluster Status** 282 | ```bash 283 | yarn cluster -status 284 | ``` 285 | Displays overall status and health of the cluster. 286 | 287 | --- 288 | 289 | ### **Node Manager Commands** 290 | 17. **Start Node Manager** 291 | ```bash 292 | yarn nodemanager 293 | ``` 294 | Starts the Node Manager daemon. 295 | 296 | 18. **Stop Node Manager** 297 | ```bash 298 | yarn nodemanager -stop 299 | ``` 300 | Stops the Node Manager daemon. 301 | 302 | 19. **List Containers on a Node** 303 | ```bash 304 | yarn nodemanager -list 305 | ``` 306 | Lists all running containers on the Node Manager. 307 | 308 | --- 309 | 310 | ### **Debugging and Troubleshooting** 311 | 20. **View Container Logs** 312 | ```bash 313 | yarn logs -containerId -nodeAddress 314 | ``` 315 | Retrieves logs for a specific container. 316 | 317 | 21. **Check Application Environment Variables** 318 | ```bash 319 | yarn application -envs 320 | ``` 321 | Displays environment variables for a specific application. 322 | 323 | --- 324 | 325 | These commands allow you to manage applications, queues, resources, and logs effectively on a Hadoop Yarn cluster. 326 | 327 | ### Additional Tips 328 | 329 | - **Custom Jobs**: You can write your own MapReduce programs in Java and package them into a JAR file, then submit them to YARN in a similar way. 330 | - **Resource Allocation**: If you want to control how much memory or CPU your YARN job uses, you can specify resources in the command, or modify the YARN configuration files. 331 | 332 | --- 333 | 334 | ### Troubleshooting 335 | 336 | - **Job Not Starting**: If the job does not start or fails, check the logs for errors. You can view the logs from the YARN ResourceManager UI or use the following command to retrieve logs: 337 | 338 | ```bash 339 | yarn logs -applicationId 340 | ``` 341 | 342 | - **Out of Memory Errors**: If your job runs into memory issues, consider adjusting the memory allocation in the `yarn-site.xml` configuration file for your NodeManagers and ResourceManager. 343 | 344 | --- 345 | 346 | **Conclusion** 347 | This practical exercise provided a hands-on experience in running a simple MapReduce job (WordCount) on YARN. You can now submit jobs, monitor them, and view results in HDFS using the YARN ResourceManager. By following the steps outlined, you should be able to run more complex jobs and work with Hadoop in a YARN-managed environment. 348 | 349 | --- 350 | 351 | ### Instructions for Use: 352 | - Ensure your Hadoop environment (including YARN and HDFS) is properly set up before running the job. 353 | - Submit your jobs using the `hadoop jar` command and monitor their progress through the YARN UI. 354 | - Clean up your HDFS after completing the practical exercise to maintain a clutter-free envir 355 | --------------------------------------------------------------------------------