├── centos7de
    ├── Vagrantfile
    └── README.md
├── centos7spark
    ├── Vagrantfile
    └── README.md
└── README.md


/centos7de/Vagrantfile:
--------------------------------------------------------------------------------
 1 | # -*- mode: ruby -*-
 2 | # vi: set ft=ruby :
 3 | 
 4 | Vagrant.configure("2") do |config|
 5 |   config.vm.box = "centos7de"
 6 | 
 7 |   config.vm.provider "virtualbox" do |vb|
 8 |     # Customize the amount of memory on the VM:
 9 |     vb.memory = "1024"
10 |   end
11 | end
12 | 


--------------------------------------------------------------------------------
/centos7spark/Vagrantfile:
--------------------------------------------------------------------------------
 1 | # -*- mode: ruby -*-
 2 | # vi: set ft=ruby :
 3 | 
 4 | Vagrant.configure("2") do |config|
 5 |   config.vm.box = "itversity/centos7spark"
 6 |   config.vm.network "forwarded_port", guest: 8888, host: 8888
 7 |   config.vm.network "forwarded_port", guest: 4040, host: 4040
 8 | 
 9 |   config.vm.provider "virtualbox" do |vb|
10 |     vb.cpus = "2"
11 |     vb.memory = "4096"
12 |   end
13 | end
14 | 


--------------------------------------------------------------------------------
/centos7de/README.md:
--------------------------------------------------------------------------------
 1 | ## Data Engineering Box
 2 | 
 3 | This is the Vagrant box which have following to learn Data Engineering using Programming Languages like Python.
 4 | 
 5 | * Python 3.7
 6 | * Postgres
 7 | * MySQL
 8 | * MongoDB
 9 | * Jupyter Notebook
10 | * Port Number 8888 is exposed as 8888.
11 | 
12 | ## Setup Process
13 | 
14 | Once the repository is cloned go to centos7de by running `cd centos7de`.
15 | 
16 | * Bringing up virtual machine - Run `vagrant up`.
17 | * Connect to virtual machines - Run `vagrant ssh`.
18 | * Run `jupyter notebook --ip 0.0.0.0`, you will see the link some thing like this - **http://localhost:8888**. It will prompt for the token which can be copied and pasted once jupyter notebook is started.
19 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ITVersity-Boxes
 2 | Repository for all ITVersity Vagrant Boxes.
 3 | 
 4 | ## Setting up
 5 | 
 6 | Here are the instructions to use this repository. There will be multiple Vagrant Boxes over a period of time.
 7 | * Clone this repository `git clone https://github.com/dgadiraju/itversity-boxes.git`
 8 | * Go to the respectieve directory and follow the instructions provided for each of the Vagrant Box below.
 9 | 
10 | ## Data Engineering
11 | 
12 | As part of [centos7de](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7de), we will provide the capabilites to learn Data Engineering using Python as programming language.
13 | 
14 | ## Learning Spark
15 | 
16 | As part of [centos7spark](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7spark), we will provide the capabilites to learn Spark.
17 | 
18 | * It contains Hadoop (HDFS and YARN).
19 | * It also have sample dataset
20 | * Spark 2.4.5 is provided as part of the image.
21 | * Follow the instructions provided as part of the [README](https://github.com/dgadiraju/itversity-boxes/tree/master/centos7spark).
22 | 


--------------------------------------------------------------------------------
/centos7spark/README.md:
--------------------------------------------------------------------------------
 1 | # Learning Spark
 2 | 
 3 | You can clone the repository, get into this folder and follow below instructions to setup Virtual Machine and use it.
 4 | 
 5 | ## Technologies Provided
 6 | 
 7 | Here are the technologies that are setup on this virtual machine.
 8 | 
 9 | * Python 2.7
10 | * Python 3.6
11 | * Hadoop 3.2.1
12 | * Spark 2.4.5
13 | 
14 | ## Usages
15 | 
16 | Here are some of the usages of the Virtual Machine.
17 | 
18 | * Learn Spark using Jupyter Lab environment.
19 | * Prepare for CCA 175 Spark and Hadoop Developer Certification using Python or Scala.
20 | 
21 | In case you don't have enough resources to practice, you can sign up for our [labs](https://labs.itversity.com).
22 | 
23 | ## Starting Virtual Machine
24 | 
25 | Here are the instructions to start Virtual Machine.
26 | 
27 | * Make sure you are inside the folder `centos7spark`.
28 | * Run `vagrant up` to bring up the Virtual Machine.
29 | * You can connect to Virtual Machine by saying `vagrant ssh`.
30 | * By default you will be logged in as user `vagrant`.
31 | * Most of the stuff in the virtual machine are owned by `vagrant`.
32 | * `vagrant` can sudo as root and can perform any task in the virtual machine.
33 | 
34 | ## Starting Services
35 | 
36 | Here are the instructions to start the services.
37 | 
38 | * HDFS - `start-dfs.sh`
39 | * YARN - `start-yarn.sh`
40 | * You can run `jps` to see all the relevant processes running.
41 | * You can also run `hdfs dfs -ls /user/vagrant` to confirm HDFS is up and running.
42 | 
43 | ## Stopping Virtual Machine
44 | 
45 | Here are the instructions to stop the virtual machine.
46 | 
47 | * You need to make sure that all the services are gracefully stopped.
48 |   * HDFS - `stop-dfs.sh`
49 |   * YARN - `stop-yarn.sh`
50 | * You can validate by running `jps`, it should list any of the HDFS or YARN Components.
51 | * You can come out of the virtual machine by running `exit` command.
52 | * Once you are back to `centos7spark` folder, you can say `vagrant halt` to bring down the virtual machine.
53 | 
54 | ## Using retail_db dataset
55 | 
56 | As part of the virtual machine the data is made available in local file system of Virtual Machine.
57 | 
58 | * Location: `/data/retail_db`.
59 | * We can copy `/data/retail_db` into HDFS location `/public` by saying `hdfs dfs -put /data/retail_db /public`.
60 | * You can validate by running `hdfs dfs -ls /public/retail_db`. You should start seeing the folders related to retail_db such as **orders**, **order_items** etc.
61 | 
62 | ## Launching Jupyter Lab
63 | 
64 | Lab is setup with Jupter Lab with following Kernels.
65 | 
66 | * Python 3
67 | * Pyspark 2
68 | * Apache Toree - Scala (Spark)
69 | * Apache Toree - SQL (Spark SQL)
70 | 
71 | Here are the instructions to run jupyter lab.
72 | 
73 | * Run `jupyter lab --ip 0.0.0.0`
74 | * Go to the browser and enter - `http://localhost:8888`
75 | * Copy the Token that is generated on the terminal in the login page and login.
76 | * You will see bare minimum content to begin with.
77 | 
78 | ## Using Content from GitHub
79 | 
80 | At ITVersity, we not only provide Virtual Machine, we also provide content to practice. Here are the details about using the content to practice Spark especialy for CCA 175.
81 | 
82 | * Click [here](https://github.com/dgadiraju/itversity-books/) to visit official repository for all ITVersity Notebooks.
83 | * Visit appropriate folders and click on the Notebook.
84 | * Once the Notebook is opened you can copy paste the code into the virtual machine or any other environment you want to practice.
85 | * Keep in mind that you might have to make some adjustments such as using appropriate paths of your environment to access the data.
86 | 
87 | ## Accessing Content Locally
88 | 
89 | On top of providing Virtual Machine we are also planning to open source our high quality content. Here are the instructions to clone it in the Virtual Machine.
90 | 
91 | * Make sure you are in the virtual machine under `/home/vagrant`.
92 | * If you run `ls -ltr` you will see a folder by name `itversity-books`.
93 | * You can get into the folder by running `cd itversity-books`.
94 | * Run `git pull` to get latest content as of date. Keep in mind that it might replace existing books, if you made any changes.
95 | 
96 | ## Spark - Getting Started
97 | 
98 | TBD
99 | 


--------------------------------------------------------------------------------