├── .gitignore
├── AWS
    ├── EKS.md
    ├── EMR.md
    ├── README.md
    └── S3.md
├── Git.md
├── Helm
    ├── Jupyterhub.md
    └── README.md
├── Linux.md
├── PySpark.md
├── Python.md
├── PythonScript
    ├── json_load.py
    ├── re_date_time.py
    ├── read_With_custom_schema.py
    ├── read_from_gitlab.py
    ├── read_parquet_file.py
    └── read_part_file.py
├── README.md
├── mongodb.md
├── pyspark
    ├── encrypt_decryt.py
    ├── profiler.sh
    ├── pySparkApp
    │   ├── Makefile
    │   ├── README.md
    │   ├── dist
    │   │   └── foo.zip
    │   ├── foo
    │   │   ├── __init__.py
    │   │   └── foo.py
    │   └── main.py
    ├── read_hive_table.py
    └── when_otherwise.py
└── terraform.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | .DS_Store


--------------------------------------------------------------------------------
/AWS/EKS.md:
--------------------------------------------------------------------------------
 1 | ## EKS
 2 | ### Delete multiple pods
 3 | ```bash
 4 | $ kubectl get pods -n default | grep Running | cut -d' ' -f 1 | xargs kubectl delete pod -n default
 5 | ```
 6 | 
 7 | ### See which pod is running on which node
 8 | ```
 9 | $ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -n stage
10 | ```
11 | ### Adding WebIdentity role to a ServiceAccount
12 | ```
13 | $ kubectl annotate serviceaccount -n <SERVICE_ACCOUNT_NAMESPACE> <SERVICE_ACCOUNT_NAME> \
14 | eks.amazonaws.com/role-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:role/<IAM_ROLE_NAME
15 | ```
16 | 
17 | ### Get SparkUI for a Spark Pod
18 | ```
19 | $ kubectl port-forward <driver-pod-name> 4040:4040 --namespace <name_space>
20 | ```
21 | 
22 | ### Find a file in Linux
23 | ```
24 | $ sudo find . -name <file_name>
25 | ```
26 | 
27 | ### Delete multiple pods
28 | ```
29 | $ kubectl get pods -n default | grep Running | cut -d' ' -f 1 | xargs kubectl delete pod -n default
30 | ```
31 | 
32 | ### See which pod is running on which node
33 | ```
34 | $ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -n stage
35 | ```
36 | 
37 | ### Adding WebIdentity role to a ServiceAccount
38 | ```
39 | $ kubectl annotate serviceaccount -n <SERVICE_ACCOUNT_NAMESPACE> <SERVICE_ACCOUNT_NAME> eks.amazonaws.com/role-arn=arn:aws:iam::<AWS_ACCOUNT_ID>:role/<IAM_ROLE_NAME>
40 | ```
41 | 
42 | ### Get SparkUI for a Spark Pod
43 | ```
44 | $ kubectl port-forward <driver-pod-name> 4040:4040 --namespace <name_space>
45 | ```
46 | 
47 | ### Set default namespace for kubectl commands
48 | ```
49 | $ kubectl config set-context --current --namespace=<insert-namespace-name-here>
50 | # Validate it
51 | $ kubectl config view --minify | grep namespace:
52 | ```
53 | 
54 | ### Configure EKS
55 | ```
56 | $ export KUBECONFIG=$KUBECONFIG:~/.kube/config-nonprod 
57 | $ aws eks update-kubeconfig --name <cluster_name> --region us-east-1
58 | ```
59 | 
60 | ### Get all pods in an EKS cluster
61 | ```
62 | $ kubectl get pods --all-namespaces
63 | ```
64 | 
65 | ### Get airflow UI for airflow pod
66 | ```
67 | $ kubectl port-forward <airflow-pod-name> 8080:8080
68 | ```
69 | 


--------------------------------------------------------------------------------
/AWS/EMR.md:
--------------------------------------------------------------------------------
 1 | ## EMR
 2 | ### Connect to EMR Cluster from CLI
 3 | 
 4 | `ssh -i file.pem hadoop@ip.ip.ip.ip`
 5 | 
 6 | ### Screen
 7 | 
 8 | * Create a new screen: `screen -S aranjan`
 9 | * Go into specific screen: `screen -rd aranjan`
10 | * Nested screen: `screen -t aman`
11 | 
12 | ### Secure copy a file from local to hadoop
13 | 
14 | `scp -i file.pem file_path hadoop@ip.ip.ip.ip:/home/hadoop/`


--------------------------------------------------------------------------------
/AWS/README.md:
--------------------------------------------------------------------------------
 1 | # Data Enginner's Essential Commands
 2 | 
 3 | * Linux: [Link](../Linux.md)
 4 | * Python: [Link](../Python.md)
 5 | * PySpark: [Link](../PySpark.md)
 6 | * **AWS**:
 7 |     * EKS: [Link](EKS.md)
 8 |     * EMR: [Link](EMR.md)
 9 |     * S3: [Link](S3.md)
10 | * Terraform: [Link](../terraform.md)
11 | * Git: [Link](../Git.md)
12 | * Helm: [Link](../Helm)
13 |     * Jupyterhub: [Link](../Helm/Jupyterhub.md)
14 | 


--------------------------------------------------------------------------------
/AWS/S3.md:
--------------------------------------------------------------------------------
 1 | # AWS
 2 | 
 3 | ## S3
 4 | ### Rename files in S3
 5 | 
 6 | `for f in $(aws s3api list-objects --bucket bucket_name --prefix "key/" --delimiter "/" | grep 097 | cut -d : -f 2 | cut -d \" -f 2);  do aws s3 mv s3://bucket_name/$f s3://bucket_name/${f/%/.csv.gz}; done `
 7 | 
 8 | ### Sync files at two S3 locations
 9 | 
10 | `aws s3 sync s3_source s3_target --recursive`
11 | 
12 | ### Download bucket object from s3
13 | `aws s3 cp s3://.. . --recursive`
14 | 
15 | 


--------------------------------------------------------------------------------
/Git.md:
--------------------------------------------------------------------------------
 1 | # Git
 2 | 
 3 | ## Table of Contents
 4 | 
 5 | ### Set up ▶️
 6 | This commands will be useful to set up your project/directory
 7 | 1) [.gitignore](#ignoringFiles)
 8 | 2) [Create branch](#creatingBranches)
 9 | 
10 | ### Lifecycle 🔄
11 | This commands will be useful in the lifecycle project
12 | 1) [Modifying Commits](#modifyingCommits)
13 | 
14 | -----------
15 | 
16 | [.gitignore](https://www.toptal.com/developers/gitignore)
17 | ```
18 | # ignore all .a files
19 | *.a
20 | 
21 | # but do track lib.a, even though you're ignoring .a files above
22 | !lib.a
23 | 
24 | # only ignore the TODO file in the current directory, not subdir/TODO
25 | /TODO
26 | 
27 | # ignore all files in any directory named build
28 | build/
29 | 
30 | # ignore doc/notes.txt, but not doc/server/arch.txt
31 | doc/*.txt
32 | 
33 | # ignore all .pdf files in the doc/ directory and any of its subdirectories
34 | doc/**/*.pdf
35 | 
36 | ```
37 | 
38 | ## <a name="creatingBranches">Create Branch</a>
39 | 
40 | 1. First, check in which branch you are right now:   
41 | ```
42 | git branch
43 | ```
44 | You'll get a list of the branchs available, the branch with an ```*``` next to it, is the branch you are working on.
45 | 
46 | 2. Then, create a new branch:
47 | ```
48 | git branch name_of_the_branch_to_create
49 | ```
50 | 
51 | 3. Move to the new branch created:
52 | ```
53 | git checkout name_of_the_branch_to_create
54 | ```
55 | 
56 | 4. You can do step 2 and 3 by one command:
57 | ```
58 | git checkout -b name_of_the_branch_to_create
59 | ```
60 | -----------
61 | 
62 | [Modifying Commits](https://classroom.udacity.com/courses/ud123/lessons/f02167ad-3ba7-40e0-a157-e5320a5b0dc8/concepts/e176503b-3eae-4b22-a1b3-2953bab3d5e5)
63 | 
64 | ### Changing the last commit message
65 | `$ git commit --amend`
66 | 
67 | ### Add files to last commit
68 | * edit the file(s)
69 | * save the file(s)
70 | * stage the file(s)
71 | * and run `git commit --amend`
72 | 
73 | ### Reverse a previously made commit, undo the changes
74 | `$ git revert <SHA-of-commit-to-revert>`
75 | 
76 | ### [Reset vs Revert](https://classroom.udacity.com/courses/ud123/lessons/f02167ad-3ba7-40e0-a157-e5320a5b0dc8/concepts/fed81eb7-49b4-4129-9f6b-8201e0796fd8)
77 | At first glance, resetting might seem coincidentally close to reverting, but they are actually quite different. Reverting creates a new commit that reverts or undos a previous commit. Resetting, on the other hand, erases commits!
78 | 
79 | Before I do any resetting, I usually create a backup branch on the most-recent commit so that I can get back to the commits if I make a mistake.
80 | 
81 | ### Change the git upstream url from SSH to HTTPS or back
82 | `$ git remote set-url origin git@github.......url`


--------------------------------------------------------------------------------
/Helm/Jupyterhub.md:
--------------------------------------------------------------------------------
1 | ### JupyterHub
2 | ```
3 | helm repo add stable https://kubernetes-charts.storage.googleapis.com
4 | helm repo update
5 | helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
6 | helm upgrade --install <release_name> jupyterhub/jupyterhub --namespace <name_space> --version=0.9.0 --values config.yaml
7 | helm uninstall <relase-name> -n <name-space>
8 | ```


--------------------------------------------------------------------------------
/Helm/README.md:
--------------------------------------------------------------------------------
 1 | # Data Enginner's Essential Commands
 2 | 
 3 | * Linux: [Link](../Linux.md)
 4 | * Python: [Link](../Python.md)
 5 | * PySpark: [Link](../PySpark.md)
 6 | * AWS: [Link](../AWS/)
 7 |     * EKS: [Link](../AWS/EKS.md)
 8 |     * EMR: [Link](../AWS/EMR.md)
 9 |     * S3: [Link](../AWS/S3.md)
10 | * Terraform: [Link](../terraform.md)
11 | * Git: [Link](../Git.md)
12 | * **Helm**:
13 |     * Jupyterhub: [Link](Jupyterhub.md)


--------------------------------------------------------------------------------
/Linux.md:
--------------------------------------------------------------------------------
  1 | # Linux
  2 | 
  3 | ## Table of contents
  4 | 
  5 | 1) [Run a job in background on Linux](#runAJobInBack)
  6 | 2) [tar](#tar)
  7 | 3) [gunzip](#gunzip)
  8 | 4) [Add extension to a file](#addExtensionToAFile)
  9 | 5) [Find a file in linux](#findAFile)
 10 | 6) [Check IP to whitelist](#checkIpToWhitelist)
 11 | 7) [Outputs Geographical Information, regarding an ip_address](#outputGeographicalInfo)
 12 | 9) [Show the available space on the mounted filesystems](#showFilesystemSpace)
 13 | 10) [Show the size of a file or a folder](#showSizeOfFile)
 14 | 11) [List the conents of a file in a numbered fashion](#listContentsOfAFile)
 15 | 12) [Search Bash history](#searchBashHistory)
 16 | 13) [Moves the cursor to the beginning of the line](#moveCursorToTheBeginning)
 17 | 14) [Moves the cursor to the end of the line](#moveCursorToTheEndOfLine)
 18 | 15) [Run previous ran command](#runPreviousCommand)
 19 | 16) [Find word in a file](#findWordInAFile)
 20 | 17) [Get 1st N rows from a file](#get1stNrows)
 21 | 
 22 | #### <a name="runAJobInBack"></a> Run a job in background on Linux
 23 | 
 24 | ```bash
 25 | $ nohup command > my.log 2>&1 &
 26 | ```
 27 | 
 28 | #### Check disk usage for directory
 29 | ```bash
 30 | $ sudo du -h --max-depth=1  | sort -h
 31 | $ sudo du -h --max-depth=4 ./log | sort -h
 32 | ```
 33 | #### <a name="tar"></a> tar
 34 | 
 35 | * Create a tar archive
 36 |     ```bash
 37 |     $ tar -cvf <file_name.tar> <file1> <file2>
 38 |     ```
 39 |     * -c - create
 40 |     * -v - verbose
 41 |     * -f - the filename of the tar archive
 42 | 
 43 | * Extract tar archives
 44 |     ```bash
 45 |     $ tar -xvf <file_name.tar>
 46 |     ```
 47 |     * -x - extract
 48 |     * -v - verbose
 49 |     * -f - the filename of tar archive to extract
 50 | 
 51 | * Create archives and compress with tar
 52 |     ```bash
 53 |     $ tar -czvf <file_name.tar.gz> <file1> <file2>`
 54 |     ```
 55 |     * -c - create
 56 |     * -z - zip
 57 |     * -v - verbose
 58 |     * -f - the filename of the compressed file
 59 | 
 60 | *  Uncompress using tar
 61 |     ```bash
 62 |     $ tar -xzvf <file_name.tar.gz>
 63 |     ```
 64 | 
 65 |     * -xz - uncompress and extract
 66 |     * -v - verbose
 67 |     * -f - the filename of the compressed file
 68 | #### <a name="gunzip"></a> gunzip
 69 | 
 70 | * Compress a file with gzip
 71 |     ```bash
 72 |     gzip <file_name>
 73 |     ```
 74 | 
 75 | * Extract .gz file
 76 |     ```bash
 77 |     gunzip <file_name.gz>
 78 |     ```
 79 | 
 80 | #### <a name="addExtensionToAFile"></a> Add extension to file
 81 | ```bash
 82 | $ for f in *; do mv "$f" "$f.gz"; done
 83 | ```
 84 | 
 85 | #### <a name="findAFile"></a> Find a file in Linux
 86 | ```bash
 87 | $ sudo find . -name <file_name>
 88 | ```
 89 | 
 90 | #### <a name="checkIpToWhitelist"></a> Check IP to whitelist
 91 | ```bash
 92 | $ curl ifconfig.me ; echo
 93 | ```
 94 | 
 95 | #### <a name="outputGeographicalInfo"></a> Outputs Geographical Information, regarding an ip_address
 96 | For current system's ip address:
 97 | 
 98 | ```bash
 99 | $ curl ipinfo.io
100 | ```
101 | 
102 | For any specific ip address:
103 | 
104 | ```bash
105 | $ curl ipinfo.io/<ip_address>
106 | ```
107 | 
108 | #### <a name="showFilesystemSpace"></a> Show the available space on the mounted filesystems
109 | ```bash
110 | $ df -h
111 | ```
112 | 
113 | #### <a name="showSizeOfFile"></a> Show the size of a file or a folder
114 | ```bash
115 | $ du -sh <file/folder_name>
116 | ```
117 | 
118 | #### <a name="listContentsOfAFile"></a> List the conents of a file in a numbered fashion
119 | ```bash
120 | $ nl <file_name>
121 | ```
122 | 
123 | #### <a name="searchBashHistory"></a> Search Bash history
124 | ```bash
125 | Ctrl+r
126 | ```
127 | 
128 | #### <a name="moveCursorToTheBeginning"></a> Moves the cursor to the beginning of the line
129 | ```bash
130 | Ctrl+a
131 | ```
132 | 
133 | #### <a name="moveCursorToTheEndOfLine"></a> Moves the cursor to the end of the line
134 | ```bash
135 | Ctrl+e
136 | ```
137 | 
138 | #### <a name="runPreviousCommand"></a> Runs previous ran command
139 | ```bash
140 | $ !!
141 | ```
142 | 
143 | #### <a name="findWordInAFile"></a> Find word in a file
144 | ```bash
145 | $ grep <wordYouWantToFind> <fileName.txt>
146 | ```
147 | 
148 | #### <a name="get1stNrows"></a> Get 1st N rows from a file
149 | ```bash
150 | !head -5 file_name.csv
151 | ```
152 | 
153 | 


--------------------------------------------------------------------------------
/PySpark.md:
--------------------------------------------------------------------------------
  1 | # PySpark
  2 | 
  3 | [PySpark Examples Official Doc link](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html)
  4 | 
  5 | ### Create Dataframe
  6 | ```
  7 | spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
  8 | 
  9 | columns = ["letter","number"]
 10 | data = [(l, n) for l, n in zip("abcdefghijklmnopqrstuvwxyz", "12345678912345678912345678")]
 11 | rdd = spark.sparkContext.parallelize(data)
 12 | df = rdd.toDF(columns)
 13 | ```
 14 | 
 15 | ### Read csv file in Spark with Schema Inference
 16 | ```
 17 | df = spark.read.csv("file_path/file_name.csv", header=True, inferSchema=True)
 18 | ```
 19 | 
 20 | ### Run SQL Query on Spark DF
 21 | ```
 22 | df.createOrReplaceTempView("table_name")
 23 | df_new = spark.sql("Select * from table_name")
 24 | df_new.show()
 25 | ```
 26 | 
 27 | ### Get number of partitions
 28 | ```
 29 | if dataframe:
 30 |   df.rdd.getNumPartitions()
 31 | if RDD
 32 |   rdd.getNumPartitions()
 33 | ```
 34 | 
 35 | ### Repartition dataframe into "n" partitions
 36 | * Partitions has unequal distribution of data(Fast since less suffeling, cann't increase number of partitions) = `df.coalesce(n)`
 37 | * Partitoins has equal distribution of data(Slow since more suffeling) = `df.repartition(n)`
 38 | 
 39 | ### Drop Columns from Spark DataFrame
 40 | 
 41 | `df = df.drop("query_type").drop("_c0").drop("_c01")`
 42 | 
 43 | ### Write data to S3
 44 | 
 45 | `df.write.mode("overwrite").format("csv").option("compression", ".gzip").save(output_s3_path_in_string, header=True)`
 46 | 
 47 | ### Validation Steps
 48 | ```python
 49 | >>> df_new = spark.read.csv("s3://leadid-sandbox/aranjan/mysql_leads_new")
 50 | 
 51 | >>> df_old = spark.read.option("delimiter", "\x01").csv("s3://..")
 52 | 
 53 | >>> df_new.count()
 54 | 
 55 | >>> df_intersect = df_old.intersect(df_new)
 56 | 
 57 | >>> df_subtract = df_old.subtract(df_intersect)
 58 | 
 59 | >>> df_new.filter(df_new._c0 == "").filter(df_new._c1 == "").head()
 60 | ```
 61 | 
 62 | ### Select records with specific string in columns
 63 | 
 64 | `df.filter(lower(col("_c0")).contains('%string_to_find%')).head()`
 65 | 
 66 | ### Get max/min/mean value for a column
 67 | ```
 68 | max_value = df.agg({"_c0": "max"}).collect()[0]
 69 | mean_value = df.agg({"_c0": "mean"}).collect()[0]
 70 | min_value = df.agg({"_c0": "min"}).collect()[0]
 71 | df.select("_c0").rdd.min()[0]
 72 | df.select("_c0").rdd.max()[0]
 73 | ```
 74 | ### Arithmetic Operation on Columns (-, +, %, /, **)
 75 | ```
 76 | df = df.withColumn("new_col", df._c0 * df._c1)
 77 | df = df.withColumn("new_col", df._c0 + 100)
 78 | df = df.withColumn("new_col", df._c0 + lit(100))
 79 | ```
 80 | 
 81 | ### Spark Configuration
 82 | In a cluster with 10 nodes with each node(16 cores and 64GB RAM)
 83 | * Assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput)
 84 | * Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
 85 |   So, Total available of cores in cluster = 15 x 10 = 150
 86 | * Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
 87 | * Leaving 1 executor for ApplicationManager => --num-executors = 29
 88 | * Number of executors per node = 30/10 = 3
 89 | * Memory per executor = 64GB/3 = 21GB
 90 | * Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB
 91 | 
 92 | ### Setup Colab to run PySpark
 93 | 1. As a first step, Let's setup Spark on your Colab environment. Run the cell below!
 94 | ```[Python]
 95 |   !pip install pyspark
 96 |   !pip install -U -q PyDrive
 97 |   !apt install openjdk-8-jdk-headless -qq
 98 |   import os
 99 |   os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
100 | ```
101 | 2. Import some of the libraries usually needed by our workload.
102 | ```[Python]
103 |   import pyspark
104 |   from pyspark.sql import *
105 |   from pyspark.sql.types import *
106 |   from pyspark.sql.functions import *
107 |   from pyspark import SparkContext, SparkConf
108 | ```
109 | 3. Initialize the Spark context, 
110 | ```[Python]
111 |   # Create the session
112 |   conf = SparkConf().set("spark.ui.port", "4050")
113 | 
114 |   # Create the context
115 |   sc = pyspark.SparkContext(conf=conf)
116 |   spark = SparkSession.builder.getOrCreate()
117 | 
118 |   spark
119 | ```
120 | 4. If you are running this Colab on the Google hosted runtime, the cell below will create a ngrok tunnel which will allow you to still check the Spark UI.
121 | ```[Python]
122 |   !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
123 |   ! rm -rf ngrok
124 |   !unzip ngrok-stable-linux-amd64.zip
125 |   get_ipython().system_raw('./ngrok http 4050 &')
126 |   !curl -s http://localhost:4040/api/tunnels | python3 -c \
127 |       "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
128 | ```
129 | 5. Test Spark installation
130 | ```[Python]
131 |   import pyspark
132 |   print(pyspark.__version__)
133 |   spark = SparkSession.builder.master("local[*]").getOrCreate()
134 |   # Test the spark 
135 |   df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
136 | 
137 |   df.show(3, False)
138 | ```
139 | 


--------------------------------------------------------------------------------
/Python.md:
--------------------------------------------------------------------------------
 1 | # Python3
 2 | 
 3 | ## Table of Contents
 4 | 
 5 | 1) [Check your Python version](#checkYourPythonVersion)
 6 | 2) [Run Python unittest](#runPythonUnittest)
 7 | 3) [Install Python Libraries](#installPythonLibraries)
 8 | 4) [Managing Requirements](#managingRequirements)
 9 | 5) [Prettify print yaml files](#PrettifyfyPrintYamlFiles)
10 | 6) [Prettify print json files](#PrettifyfyPrintJsonFiles)
11 | 7) [Read file from Gitlab private repo](Python%20Script/read_from_gitlab.py)
12 | 8) [Read parquet file](Python%20Script/read_parquet_file.py)
13 | 
14 | ### Create a server
15 | ```bash
16 | python3 -m http.server 8000
17 | ```
18 | 
19 | ### <a name="checkYourPythonVersion"></a> Check your Python version
20 | ```bash
21 | $ python3 -V
22 | ```
23 | 
24 | ### <a name="runPythonUnittest"></a> Run Python unittest
25 | ```bash
26 | $ python3 -m unittest discover -s /path/to/unittest/lambdas
27 | ```
28 | 
29 | ### <a name="installPythonLibraries"></a> Install Python Libraries
30 | 
31 | ```bash
32 | $ python3 -m pip install --user --upgrade "<library_name>"
33 | ```
34 | 
35 | ### <a name="managingRequirements"></a> Managing Requirements
36 | 
37 | **Create a Virtual Environment**
38 | 
39 | ```bash
40 | $ python3 -m venv venv
41 | ```
42 | 
43 | **Activate a Virtual Environment** - _Linux_
44 | ```bash
45 | $ source venv/bin/activate
46 | ```
47 | 
48 | 
49 | **Creating a requirements.txt file**
50 | 
51 | ```bash
52 | $ python3 -m pip freeze > requirements.txt
53 | ```
54 | 
55 | **Installing a requirements.txt file**
56 | 
57 | ```bash
58 | $ python3 -m pip install -r requirements.txt
59 | ```
60 | 
61 | ### <a name="PrettifyfyPrintYamlFiles"></a> Prettify print yaml files
62 | ```bash
63 | $ python3 -c 'import yaml;print(yaml.safe_load(open("<pathToFile.yaml>")))'`
64 | ```
65 | 
66 | ### <a name="PrettifyfyPrintJsonFiles"></a> Prettify print json files
67 | ```bash
68 | $ python3 -m json.tool <PathToFile.Json>`
69 | ```


--------------------------------------------------------------------------------
/PythonScript/json_load.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | # json_str = {"cohort_name":["STP", "ETS", "ETB|CARDED", "ETB", "NSTP", "RI", "Non", "Buzz", "Non Buzz1"],
 3 | # "absolute_count":[738453, 5957908, 1089618, 10967636, 87032, 2868654, 119403, 1357280], "matched_records_%":[
 4 | # 3.1849111946251667, 25.696161957154807, 4.699468437483611, 47.30287056180148, 0.3753647030895907, 12.372362544544153,
 5 | # 0.5149792219299384, 5.853881379371262]}
 6 | 
 7 | x = "{'cohort_name': ['ETB', 'ETS', 'Non Buzz1', 'NSTP', 'RI', 'Non Buzz', 'STP'], 'absolute_count': [12092944, 5940231, 1291809, 84929, 2665341, 116642, 516990], 'matched_records_%': [53.252035348629605, 26.15817878516806, 5.688561737462595, 0.3739901640265401, 11.736995817408216, 0.5136403432559395, 2.2765978040490404]}"
 8 | y = "{cohort_name=[Non Buzz1, NSTP, ETS, Non Buzz, STP, ETB|CARDED, RI, ETB], absolute_count=[1357280, 87032, 5957908, 119403, 738453, 1089618, 2868654, 10967636], matched_records_%=[5.853881379371262, 0.3753647030895907, 25.696161957154807, 0.5149792219299384, 3.1849111946251667, 4.699468437483611, 12.372362544544153, 47.30287056180148]}"
 9 | x = x.replace("\'", "\"")
10 | print(x)
11 | 
12 | print(json.loads(x))
13 | 


--------------------------------------------------------------------------------
/PythonScript/re_date_time.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | import time
 3 | import pytz
 4 | 
 5 | to_day = datetime.datetime.today()
 6 | print("Current Date and time :|:", to_day)
 7 | print("Current Date and time :|:", datetime.datetime.now())
 8 | print("Current Date :|:", to_day.date())
 9 | print("Today's Date :|:", to_day.day)
10 | print("Current Month :|:", to_day.month)
11 | print("Current Year :|:", to_day.date().year)
12 | print("Week of the day :|:", to_day.date().isoweekday())
13 | print("Date 5 days back :|:", datetime.date.today() - datetime.timedelta(days=5))
14 | print("Date 5 days after :|:", datetime.datetime.now() + datetime.timedelta(days=5))
15 | print("Seconds remaining in my coming BDay :|:", (datetime.datetime(year=2021, month=10, day=4, hour=0, minute=0) - to_day).seconds)
16 | print("Seconds consumed in running this program :|:", (datetime.datetime.now() - to_day).microseconds)
17 | print("Current Time :|:", to_day.time())
18 | print("Current Time minute", datetime.datetime.now().time().minute)
19 | print("Current Time hour", datetime.datetime.now().time().minute)
20 | print("Current Time seconds", datetime.datetime.now().time().second)
21 | print("Current Time microseconds", to_day.time().microsecond)
22 | 
23 | 
24 | print("Current Epoch time\t", time.time(), "Seconds")
25 | print("Current Epoch time\t", time.time_ns(), "Nanoseconds")
26 | 
27 | utc_dt = datetime.datetime(2021, 10, 4, 12, 44, 56, 10, tzinfo=pytz.utc)
28 | print("Time Zone aware date time\t", utc_dt)
29 | 
30 | current_utc_dt = datetime.datetime.now(tz=pytz.utc)
31 | print("Current UTC Time Zone\t", current_utc_dt)
32 | 
33 | current_utc_dt = datetime.datetime.now(tz=pytz.utc)
34 | print("Convert UTC to India Time zone \t", current_utc_dt.astimezone(tz=pytz.timezone("Asia/Calcutta")))
35 | 
36 | curr_dt = datetime.datetime.today()
37 | print("Naive time to timezone aware \t", curr_dt, pytz.timezone("Asia/Calcutta").localize(curr_dt))
38 | print("ISO format \t", curr_dt.isoformat())
39 | 


--------------------------------------------------------------------------------
/PythonScript/read_With_custom_schema.py:
--------------------------------------------------------------------------------
1 | from pyspark.sql.types import StructType,StructField, StringType
2 | 
3 | customSchema = StructType().add("col1", StringType(), True).add("col2", StringType(), True).add("col3", StringType(), True)
4 | 
5 | df = spark.read.format("csv").option("header", "true").schema(customSchema).load("path_to_files")


--------------------------------------------------------------------------------
/PythonScript/read_from_gitlab.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import yaml
 3 | 
 4 | 
 5 | def read_yaml_from_gitlab():
 6 |     """
 7 |     Read any file content from Gitlab
 8 |     Gitlab File Path = path/to/file.yaml
 9 |     Gitlab Project_Id = 123
10 |     :return: Resurce Yaml Content
11 |     :rtype: dict
12 |     """
13 |     project_id = "****************"
14 |     url = "https://<host>/api/v4/projects/{project_id}/repository/files/{f}/raw?ref=<branch>".format(
15 |         project_id=project_id, f="path%2Fto%2Ffile%2Eyaml")
16 |     resp = requests.get(url, headers={"Private-Token": "********************"})
17 |     content_file = resp.content
18 |     content_file.decode("utf-8")
19 |     return yaml.safe_load(content_file)
20 | 


--------------------------------------------------------------------------------
/PythonScript/read_parquet_file.py:
--------------------------------------------------------------------------------
1 | import pyarrow.parquet as pq
2 | 
3 | table = pq.read_table("<FileName>.parquet")
4 | pd = table.to_pandas()
5 | print(pd.head)
6 | # print(pd.columns)
7 | 


--------------------------------------------------------------------------------
/PythonScript/read_part_file.py:
--------------------------------------------------------------------------------
 1 | count = 0
 2 | for i in range(162):
 3 |     file = "0"*(5-len(str(i)))+str(i)
 4 |     print("path/*/part-{}".format(file))
 5 |     rdd = spark.sparkContext.textFile("path/part-{}".format(file))
 6 |     temp = rdd.count()
 7 |     print(temp)
 8 |     count += temp
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Enginner's Essential Commands
 2 | 
 3 | * Linux: [Link](Linux.md)
 4 | * Python: [Link](Python.md)
 5 | * PySpark: [Link](PySpark.md)
 6 | * AWS: [Link](AWS)
 7 |     * EKS: [Link](AWS/EKS.md)
 8 |     * EMR: [Link](AWS/EMR.md)
 9 |     * S3: [Link](AWS/S3.md)
10 | * Terraform: [Link](terraform.md)
11 | * Git: [Link](Git.md)
12 | * Helm: [Link](Helm)
13 |     * Jupyterhub: [Link](Helm/Jupyterhub.md)
14 | 
15 | ---   
16 | <details><summary>Want to contribute?</summary>
17 | <p>
18 | 
19 | * The commands should not be copy-pasted from any source in bulk.
20 | * Only add those commands that you use more frequently but may be unknown to other developers.
21 | 
22 | Example: `pwd`, `ls` e.t.c., are not allowed
23 | * Follow the structure and don't forget to embed any reference links either in heading or command description.
24 |     * Put it inside a directory if applicable
25 |     * Give a proper heading
26 |     * Use markdown script for [block code or inline code]((https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet#code)) to embed commands
27 | * If the command heading is not sufficient to explain the uses, give 1 liner explanation with an example.
28 | * I would be happy to accept your pull request even if you add `one` good command than adding `ten` not so good commands.
29 | </p>
30 | </details>
31 | 


--------------------------------------------------------------------------------
/mongodb.md:
--------------------------------------------------------------------------------
1 | To have launchd start mongodb/brew/mongodb-community now and restart at login:
2 | 
3 |   `brew services start mongodb/brew/mongodb-community`
4 |   
5 | Or, if you don't want/need a background service you can just run:
6 | 
7 |   `mongod --config /opt/homebrew/etc/mongod.conf`
8 | 


--------------------------------------------------------------------------------
/pyspark/encrypt_decryt.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql.functions import lit, udf
 2 | from cryptography.fernet import Fernet
 3 | from pyspark.sql.types import StringType
 4 | from pyspark.sql import SparkSession, Row
 5 | 
 6 | 
 7 | def encrypt_val(clear_text, MASTER_KEY):
 8 |     f = Fernet(MASTER_KEY)
 9 |     clear_text_b = bytes(clear_text, 'utf-8')
10 |     cipher_text = f.encrypt(clear_text_b)
11 |     cipher_text = str(cipher_text.decode('ascii'))
12 |     return cipher_text
13 | 
14 | 
15 | def decrypt_val(cipher_text, MASTER_KEY):
16 |     f = Fernet(MASTER_KEY)
17 |     clear_val = f.decrypt(cipher_text.encode()).decode()
18 |     return clear_val
19 | 
20 | 
21 | if __name__ == '__main__':
22 |     spark = SparkSession.builder.appName("Test Job").getOrCreate()
23 | 
24 |     #     df = spark.read.csv("sample_file.csv", header=True)
25 |     # create a list of rows with the data
26 |     data = [
27 |         Row("John Doe", "IT", "201901", 50000),
28 |         Row("Jane Doe", "HR", "202001", 60000),
29 |         Row("Bob Smith", "IT", "201905", 70000),
30 |         Row("Alice Smith", "HR", "202011", 80000),
31 |         Row("James Johnson", "IT", "202002", 90000),
32 |         Row("Emily Johnson", "HR", "201310", 100000),
33 |         Row("David Williams", "IT", "201511", 110000),
34 |         Row("Samantha Williams", "HR", "202207", 120000),
35 |         Row("Charles Brown", "IT", "202101", 130000),
36 |         Row("Ashley Brown", "HR", "202111", 140000)
37 |     ]
38 | 
39 |     # create the DataFrame
40 |     df = spark.createDataFrame(data, schema)
41 | 
42 |     encrypt = udf(encrypt_val, StringType())
43 |     decrypt = udf(decrypt_val, StringType())
44 | 
45 |     encryptionKey = Fernet.generate_key()
46 | 
47 |     df = df.withColumn("first_name_encrypted", encrypt(df.employee_name, lit(encryptionKey)))
48 |     df = df.withColumn("first_name_decrypted", decrypt(df.employee_name, lit(encryptionKey)))
49 | 
50 |     df.show()
51 | #     df.coalesce(1).write.mode("overwrite").csv("output", header=True)
52 | 


--------------------------------------------------------------------------------
/pyspark/profiler.sh:
--------------------------------------------------------------------------------
 1 | spark-submit --master local[2] \
 2 | --jar sparklens-0.3.0-s_2.11.jar \
 3 | --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener \
 4 | main.py
 5 | 
 6 | spark-submit --master local[2] \
 7 | --conf spark.jars=jvm-profiler-1.0.0.jar \
 8 | --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
 9 | main.py
10 | 
11 | spark-submit --master yarn --queue DE --deploy-mode cluster \
12 | --conf spark.jars=jvm-profiler-1.0.0.jar \
13 | --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
14 | main.py
15 | 
16 | spark-submit --master yarn --queue DE --deploy-mode cluster \
17 | --files ./babar-agent-0.2.0-SNAPSHOT.jar \
18 | --conf spark.executor.extraJavaOptions="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=2560],ProcFSProfiler" \
19 | main.py


--------------------------------------------------------------------------------
/pyspark/pySparkApp/Makefile:
--------------------------------------------------------------------------------
1 | build:
2 | 	python3 setup.py sdist
3 | 	rm -r foo.egg-info
4 | 
5 | build_zip:
6 | 	mkdir -p dist/
7 | 	rsync -av foo dist/
8 | 	cd dist ; zip -r foo.zip . * ; cd ..
9 | 	#rm -fr dist/foo


--------------------------------------------------------------------------------
/pyspark/pySparkApp/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/README.md


--------------------------------------------------------------------------------
/pyspark/pySparkApp/dist/foo.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/dist/foo.zip


--------------------------------------------------------------------------------
/pyspark/pySparkApp/foo/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/foo/__init__.py


--------------------------------------------------------------------------------
/pyspark/pySparkApp/foo/foo.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | 
 3 | 
 4 | class Foo:
 5 |     def __init__(self, app_name):
 6 |         self.spark = SparkSession.builder.appName(app_name).getOrCreate()
 7 | 
 8 |     def get_source_df(self):
 9 |         simple_data = [("James", "Sales", "NY", 90000, 34, 10000),
10 |                       ("Michael", "Sales", "NY", 86000, 56, 20000),
11 |                       ("Robert", "Sales", "CA", 81000, 30, 23000),
12 |                       ("Maria", "Finance", "CA", 90000, 24, 23000),
13 |                       ("Raman", "Finance", "CA", 99000, 40, 24000),
14 |                       ("Scott", "Finance", "NY", 83000, 36, 19000),
15 |                       ("Jen", "Finance", "NY", 79000, 53, 15000),
16 |                       ("Jeff", "Marketing", "CA", 80000, 25, 18000),
17 |                       ("Kumar", "Marketing", "NY", 91000, 50, 21000)
18 |                       ]
19 | 
20 |         schema = ["employee_name", "department", "state", "salary", "age", "bonus"]
21 |         return self.spark.createDataFrame(data=simple_data, schema=schema)
22 | 
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/pyspark/pySparkApp/main.py:
--------------------------------------------------------------------------------
 1 | """
 2 | spark-submit --master local --py-files foo.zip main.py
 3 | """
 4 | 
 5 | from pyspark.sql import functions as F
 6 | from foo.foo import Foo
 7 | 
 8 | 
 9 | def do_transform(df):
10 |     df.groupBy("department").sum("salary").show(truncate=False)
11 | 
12 |     df.groupBy("department").count().show(truncate=False)
13 | 
14 |     df.groupBy("department", "state") \
15 |         .sum("salary", "bonus") \
16 |         .show(truncate=False)
17 | 
18 |     df.groupBy("department") \
19 |         .agg(F.sum("salary").alias("sum_salary"),
20 |              F.avg("salary").alias("avg_salary"),
21 |              F.sum("bonus").alias("sum_bonus"),
22 |              F.max("bonus").alias("max_bonus")
23 |              ) \
24 |         .show(truncate=False)
25 | 
26 |     df.groupBy("department") \
27 |         .agg(F.sum("salary").alias("sum_salary"),
28 |              F.avg("salary").alias("avg_salary"),
29 |              F.sum("bonus").alias("sum_bonus"),
30 |              F.max("bonus").alias("max_bonus")).\
31 |         where(F.col("sum_bonus") >= 50000) \
32 |         .show(truncate=False)
33 | 
34 | 
35 | if __name__ == '__main__':
36 |     foo = Foo("PySparkPackagingTest")
37 |     df = foo.get_source_df()
38 |     do_transform(df)
39 | 


--------------------------------------------------------------------------------
/pyspark/read_hive_table.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/read_hive_table.py


--------------------------------------------------------------------------------
/pyspark/when_otherwise.py:
--------------------------------------------------------------------------------
 1 | from pyspark.sql import SparkSession
 2 | from pyspark.sql import functions as F
 3 | 
 4 | if __name__ == '__main__':
 5 |     spark = SparkSession.builder.getOrCreate()
 6 |     data = [('James', 'Smith', 'M', 30),
 7 |             ('Anna', 'Rose', 'F', 41),
 8 |             ('Robert', 'Williams', 'O', 62),
 9 |             ]
10 | 
11 |     columns = ["firstname", "lastname", "gender", "salary"]
12 |     df = spark.createDataFrame(data=data, schema=columns)
13 |     df = df.withColumn("Sex", F.when(df.gender == "M", "Male").when(df.gender == "F", "Female").otherwise(None))
14 |     df.show()
15 | 


--------------------------------------------------------------------------------
/terraform.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ### Terraform
 3 | ```
 4 | terraform init -var-file="terraform.tfvars" "-lock=false"
 5 | 
 6 | terraform plan -var-file="terraform.tfvars" "-lock=false"
 7 | 
 8 | terraform apply -var-file="terraform.tfvars" "-lock=false"
 9 | 
10 | terraform destroy -var-file="terraform.tfvars" "-lock=false"
11 | ```
12 | 
13 | ```
14 | terragrunt init
15 | 
16 | terragrunt plan
17 | 
18 | terragrunt apply
19 | 
20 | terragrunt destroy
21 | ```
22 | 


--------------------------------------------------------------------------------