├── centos7de ├── Vagrantfile └── README.md ├── centos7spark ├── Vagrantfile └── README.md └── README.md /centos7de/Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | # vi: set ft=ruby : 3 | 4 | Vagrant.configure("2") do |config| 5 | config.vm.box = "centos7de" 6 | 7 | config.vm.provider "virtualbox" do |vb| 8 | # Customize the amount of memory on the VM: 9 | vb.memory = "1024" 10 | end 11 | end 12 | -------------------------------------------------------------------------------- /centos7spark/Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | # vi: set ft=ruby : 3 | 4 | Vagrant.configure("2") do |config| 5 | config.vm.box = "itversity/centos7spark" 6 | config.vm.network "forwarded_port", guest: 8888, host: 8888 7 | config.vm.network "forwarded_port", guest: 4040, host: 4040 8 | 9 | config.vm.provider "virtualbox" do |vb| 10 | vb.cpus = "2" 11 | vb.memory = "4096" 12 | end 13 | end 14 | -------------------------------------------------------------------------------- /centos7de/README.md: -------------------------------------------------------------------------------- 1 | ## Data Engineering Box 2 | 3 | This is the Vagrant box which have following to learn Data Engineering using Programming Languages like Python. 4 | 5 | * Python 3.7 6 | * Postgres 7 | * MySQL 8 | * MongoDB 9 | * Jupyter Notebook 10 | * Port Number 8888 is exposed as 8888. 11 | 12 | ## Setup Process 13 | 14 | Once the repository is cloned go to centos7de by running `cd centos7de`. 15 | 16 | * Bringing up virtual machine - Run `vagrant up`. 17 | * Connect to virtual machines - Run `vagrant ssh`. 18 | * Run `jupyter notebook --ip 0.0.0.0`, you will see the link some thing like this - **http://localhost:8888**. It will prompt for the token which can be copied and pasted once jupyter notebook is started. 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ITVersity-Boxes 2 | Repository for all ITVersity Vagrant Boxes. 3 | 4 | ## Setting up 5 | 6 | Here are the instructions to use this repository. There will be multiple Vagrant Boxes over a period of time. 7 | * Clone this repository `git clone https://github.com/dgadiraju/itversity-boxes.git` 8 | * Go to the respectieve directory and follow the instructions provided for each of the Vagrant Box below. 9 | 10 | ## Data Engineering 11 | 12 | As part of [centos7de](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7de), we will provide the capabilites to learn Data Engineering using Python as programming language. 13 | 14 | ## Learning Spark 15 | 16 | As part of [centos7spark](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7spark), we will provide the capabilites to learn Spark. 17 | 18 | * It contains Hadoop (HDFS and YARN). 19 | * It also have sample dataset 20 | * Spark 2.4.5 is provided as part of the image. 21 | * Follow the instructions provided as part of the [README](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7spark). 22 | -------------------------------------------------------------------------------- /centos7spark/README.md: -------------------------------------------------------------------------------- 1 | # Learning Spark 2 | 3 | You can clone the repository, get into this folder and follow below instructions to setup Virtual Machine and use it. 4 | 5 | ## Technologies Provided 6 | 7 | Here are the technologies that are setup on this virtual machine. 8 | 9 | * Python 2.7 10 | * Python 3.6 11 | * Hadoop 3.2.1 12 | * Spark 2.4.5 13 | 14 | ## Usages 15 | 16 | Here are some of the usages of the Virtual Machine. 17 | 18 | * Learn Spark using Jupyter Lab environment. 19 | * Prepare for CCA 175 Spark and Hadoop Developer Certification using Python or Scala. 20 | 21 | In case you don't have enough resources to practice, you can sign up for our [labs](https://labs.itversity.com). 22 | 23 | ## Starting Virtual Machine 24 | 25 | Here are the instructions to start Virtual Machine. 26 | 27 | * Make sure you are inside the folder `centos7spark`. 28 | * Run `vagrant up` to bring up the Virtual Machine. 29 | * You can connect to Virtual Machine by saying `vagrant ssh`. 30 | * By default you will be logged in as user `vagrant`. 31 | * Most of the stuff in the virtual machine are owned by `vagrant`. 32 | * `vagrant` can sudo as root and can perform any task in the virtual machine. 33 | 34 | ## Starting Services 35 | 36 | Here are the instructions to start the services. 37 | 38 | * HDFS - `start-dfs.sh` 39 | * YARN - `start-yarn.sh` 40 | * You can run `jps` to see all the relevant processes running. 41 | * You can also run `hdfs dfs -ls /user/vagrant` to confirm HDFS is up and running. 42 | 43 | ## Stopping Virtual Machine 44 | 45 | Here are the instructions to stop the virtual machine. 46 | 47 | * You need to make sure that all the services are gracefully stopped. 48 | * HDFS - `stop-dfs.sh` 49 | * YARN - `stop-yarn.sh` 50 | * You can validate by running `jps`, it should list any of the HDFS or YARN Components. 51 | * You can come out of the virtual machine by running `exit` command. 52 | * Once you are back to `centos7spark` folder, you can say `vagrant halt` to bring down the virtual machine. 53 | 54 | ## Using retail_db dataset 55 | 56 | As part of the virtual machine the data is made available in local file system of Virtual Machine. 57 | 58 | * Location: `/data/retail_db`. 59 | * We can copy `/data/retail_db` into HDFS location `/public` by saying `hdfs dfs -put /data/retail_db /public`. 60 | * You can validate by running `hdfs dfs -ls /public/retail_db`. You should start seeing the folders related to retail_db such as **orders**, **order_items** etc. 61 | 62 | ## Launching Jupyter Lab 63 | 64 | Lab is setup with Jupter Lab with following Kernels. 65 | 66 | * Python 3 67 | * Pyspark 2 68 | * Apache Toree - Scala (Spark) 69 | * Apache Toree - SQL (Spark SQL) 70 | 71 | Here are the instructions to run jupyter lab. 72 | 73 | * Run `jupyter lab --ip 0.0.0.0` 74 | * Go to the browser and enter - `http://localhost:8888` 75 | * Copy the Token that is generated on the terminal in the login page and login. 76 | * You will see bare minimum content to begin with. 77 | 78 | ## Using Content from GitHub 79 | 80 | At ITVersity, we not only provide Virtual Machine, we also provide content to practice. Here are the details about using the content to practice Spark especialy for CCA 175. 81 | 82 | * Click [here](https://github.com/dgadiraju/itversity-books/) to visit official repository for all ITVersity Notebooks. 83 | * Visit appropriate folders and click on the Notebook. 84 | * Once the Notebook is opened you can copy paste the code into the virtual machine or any other environment you want to practice. 85 | * Keep in mind that you might have to make some adjustments such as using appropriate paths of your environment to access the data. 86 | 87 | ## Accessing Content Locally 88 | 89 | On top of providing Virtual Machine we are also planning to open source our high quality content. Here are the instructions to clone it in the Virtual Machine. 90 | 91 | * Make sure you are in the virtual machine under `/home/vagrant`. 92 | * If you run `ls -ltr` you will see a folder by name `itversity-books`. 93 | * You can get into the folder by running `cd itversity-books`. 94 | * Run `git pull` to get latest content as of date. Keep in mind that it might replace existing books, if you made any changes. 95 | 96 | ## Spark - Getting Started 97 | 98 | TBD 99 | --------------------------------------------------------------------------------