├── .gitignore ├── AWS ├── EKS.md ├── EMR.md ├── README.md └── S3.md ├── Git.md ├── Helm ├── Jupyterhub.md └── README.md ├── Linux.md ├── PySpark.md ├── Python.md ├── PythonScript ├── json_load.py ├── re_date_time.py ├── read_With_custom_schema.py ├── read_from_gitlab.py ├── read_parquet_file.py └── read_part_file.py ├── README.md ├── mongodb.md ├── pyspark ├── encrypt_decryt.py ├── profiler.sh ├── pySparkApp │ ├── Makefile │ ├── README.md │ ├── dist │ │ └── foo.zip │ ├── foo │ │ ├── __init__.py │ │ └── foo.py │ └── main.py ├── read_hive_table.py └── when_otherwise.py └── terraform.md /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | .DS_Store -------------------------------------------------------------------------------- /AWS/EKS.md: -------------------------------------------------------------------------------- 1 | ## EKS 2 | ### Delete multiple pods 3 | ```bash 4 | $ kubectl get pods -n default | grep Running | cut -d' ' -f 1 | xargs kubectl delete pod -n default 5 | ``` 6 | 7 | ### See which pod is running on which node 8 | ``` 9 | $ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -n stage 10 | ``` 11 | ### Adding WebIdentity role to a ServiceAccount 12 | ``` 13 | $ kubectl annotate serviceaccount -n \ 14 | eks.amazonaws.com/role-arn=arn:aws:iam:::role/ 4040:4040 --namespace 20 | ``` 21 | 22 | ### Find a file in Linux 23 | ``` 24 | $ sudo find . -name 25 | ``` 26 | 27 | ### Delete multiple pods 28 | ``` 29 | $ kubectl get pods -n default | grep Running | cut -d' ' -f 1 | xargs kubectl delete pod -n default 30 | ``` 31 | 32 | ### See which pod is running on which node 33 | ``` 34 | $ kubectl get pod -o=custom-columns=NODE:.spec.nodeName,NAME:.metadata.name -n stage 35 | ``` 36 | 37 | ### Adding WebIdentity role to a ServiceAccount 38 | ``` 39 | $ kubectl annotate serviceaccount -n eks.amazonaws.com/role-arn=arn:aws:iam:::role/ 40 | ``` 41 | 42 | ### Get SparkUI for a Spark Pod 43 | ``` 44 | $ kubectl port-forward 4040:4040 --namespace 45 | ``` 46 | 47 | ### Set default namespace for kubectl commands 48 | ``` 49 | $ kubectl config set-context --current --namespace= 50 | # Validate it 51 | $ kubectl config view --minify | grep namespace: 52 | ``` 53 | 54 | ### Configure EKS 55 | ``` 56 | $ export KUBECONFIG=$KUBECONFIG:~/.kube/config-nonprod 57 | $ aws eks update-kubeconfig --name --region us-east-1 58 | ``` 59 | 60 | ### Get all pods in an EKS cluster 61 | ``` 62 | $ kubectl get pods --all-namespaces 63 | ``` 64 | 65 | ### Get airflow UI for airflow pod 66 | ``` 67 | $ kubectl port-forward 8080:8080 68 | ``` 69 | -------------------------------------------------------------------------------- /AWS/EMR.md: -------------------------------------------------------------------------------- 1 | ## EMR 2 | ### Connect to EMR Cluster from CLI 3 | 4 | `ssh -i file.pem hadoop@ip.ip.ip.ip` 5 | 6 | ### Screen 7 | 8 | * Create a new screen: `screen -S aranjan` 9 | * Go into specific screen: `screen -rd aranjan` 10 | * Nested screen: `screen -t aman` 11 | 12 | ### Secure copy a file from local to hadoop 13 | 14 | `scp -i file.pem file_path hadoop@ip.ip.ip.ip:/home/hadoop/` -------------------------------------------------------------------------------- /AWS/README.md: -------------------------------------------------------------------------------- 1 | # Data Enginner's Essential Commands 2 | 3 | * Linux: [Link](../Linux.md) 4 | * Python: [Link](../Python.md) 5 | * PySpark: [Link](../PySpark.md) 6 | * **AWS**: 7 | * EKS: [Link](EKS.md) 8 | * EMR: [Link](EMR.md) 9 | * S3: [Link](S3.md) 10 | * Terraform: [Link](../terraform.md) 11 | * Git: [Link](../Git.md) 12 | * Helm: [Link](../Helm) 13 | * Jupyterhub: [Link](../Helm/Jupyterhub.md) 14 | -------------------------------------------------------------------------------- /AWS/S3.md: -------------------------------------------------------------------------------- 1 | # AWS 2 | 3 | ## S3 4 | ### Rename files in S3 5 | 6 | `for f in $(aws s3api list-objects --bucket bucket_name --prefix "key/" --delimiter "/" | grep 097 | cut -d : -f 2 | cut -d \" -f 2); do aws s3 mv s3://bucket_name/$f s3://bucket_name/${f/%/.csv.gz}; done ` 7 | 8 | ### Sync files at two S3 locations 9 | 10 | `aws s3 sync s3_source s3_target --recursive` 11 | 12 | ### Download bucket object from s3 13 | `aws s3 cp s3://.. . --recursive` 14 | 15 | -------------------------------------------------------------------------------- /Git.md: -------------------------------------------------------------------------------- 1 | # Git 2 | 3 | ## Table of Contents 4 | 5 | ### Set up ▶️ 6 | This commands will be useful to set up your project/directory 7 | 1) [.gitignore](#ignoringFiles) 8 | 2) [Create branch](#creatingBranches) 9 | 10 | ### Lifecycle 🔄 11 | This commands will be useful in the lifecycle project 12 | 1) [Modifying Commits](#modifyingCommits) 13 | 14 | ----------- 15 | 16 | [.gitignore](https://www.toptal.com/developers/gitignore) 17 | ``` 18 | # ignore all .a files 19 | *.a 20 | 21 | # but do track lib.a, even though you're ignoring .a files above 22 | !lib.a 23 | 24 | # only ignore the TODO file in the current directory, not subdir/TODO 25 | /TODO 26 | 27 | # ignore all files in any directory named build 28 | build/ 29 | 30 | # ignore doc/notes.txt, but not doc/server/arch.txt 31 | doc/*.txt 32 | 33 | # ignore all .pdf files in the doc/ directory and any of its subdirectories 34 | doc/**/*.pdf 35 | 36 | ``` 37 | 38 | ## Create Branch 39 | 40 | 1. First, check in which branch you are right now: 41 | ``` 42 | git branch 43 | ``` 44 | You'll get a list of the branchs available, the branch with an ```*``` next to it, is the branch you are working on. 45 | 46 | 2. Then, create a new branch: 47 | ``` 48 | git branch name_of_the_branch_to_create 49 | ``` 50 | 51 | 3. Move to the new branch created: 52 | ``` 53 | git checkout name_of_the_branch_to_create 54 | ``` 55 | 56 | 4. You can do step 2 and 3 by one command: 57 | ``` 58 | git checkout -b name_of_the_branch_to_create 59 | ``` 60 | ----------- 61 | 62 | [Modifying Commits](https://classroom.udacity.com/courses/ud123/lessons/f02167ad-3ba7-40e0-a157-e5320a5b0dc8/concepts/e176503b-3eae-4b22-a1b3-2953bab3d5e5) 63 | 64 | ### Changing the last commit message 65 | `$ git commit --amend` 66 | 67 | ### Add files to last commit 68 | * edit the file(s) 69 | * save the file(s) 70 | * stage the file(s) 71 | * and run `git commit --amend` 72 | 73 | ### Reverse a previously made commit, undo the changes 74 | `$ git revert ` 75 | 76 | ### [Reset vs Revert](https://classroom.udacity.com/courses/ud123/lessons/f02167ad-3ba7-40e0-a157-e5320a5b0dc8/concepts/fed81eb7-49b4-4129-9f6b-8201e0796fd8) 77 | At first glance, resetting might seem coincidentally close to reverting, but they are actually quite different. Reverting creates a new commit that reverts or undos a previous commit. Resetting, on the other hand, erases commits! 78 | 79 | Before I do any resetting, I usually create a backup branch on the most-recent commit so that I can get back to the commits if I make a mistake. 80 | 81 | ### Change the git upstream url from SSH to HTTPS or back 82 | `$ git remote set-url origin git@github.......url` -------------------------------------------------------------------------------- /Helm/Jupyterhub.md: -------------------------------------------------------------------------------- 1 | ### JupyterHub 2 | ``` 3 | helm repo add stable https://kubernetes-charts.storage.googleapis.com 4 | helm repo update 5 | helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ 6 | helm upgrade --install jupyterhub/jupyterhub --namespace --version=0.9.0 --values config.yaml 7 | helm uninstall -n 8 | ``` -------------------------------------------------------------------------------- /Helm/README.md: -------------------------------------------------------------------------------- 1 | # Data Enginner's Essential Commands 2 | 3 | * Linux: [Link](../Linux.md) 4 | * Python: [Link](../Python.md) 5 | * PySpark: [Link](../PySpark.md) 6 | * AWS: [Link](../AWS/) 7 | * EKS: [Link](../AWS/EKS.md) 8 | * EMR: [Link](../AWS/EMR.md) 9 | * S3: [Link](../AWS/S3.md) 10 | * Terraform: [Link](../terraform.md) 11 | * Git: [Link](../Git.md) 12 | * **Helm**: 13 | * Jupyterhub: [Link](Jupyterhub.md) -------------------------------------------------------------------------------- /Linux.md: -------------------------------------------------------------------------------- 1 | # Linux 2 | 3 | ## Table of contents 4 | 5 | 1) [Run a job in background on Linux](#runAJobInBack) 6 | 2) [tar](#tar) 7 | 3) [gunzip](#gunzip) 8 | 4) [Add extension to a file](#addExtensionToAFile) 9 | 5) [Find a file in linux](#findAFile) 10 | 6) [Check IP to whitelist](#checkIpToWhitelist) 11 | 7) [Outputs Geographical Information, regarding an ip_address](#outputGeographicalInfo) 12 | 9) [Show the available space on the mounted filesystems](#showFilesystemSpace) 13 | 10) [Show the size of a file or a folder](#showSizeOfFile) 14 | 11) [List the conents of a file in a numbered fashion](#listContentsOfAFile) 15 | 12) [Search Bash history](#searchBashHistory) 16 | 13) [Moves the cursor to the beginning of the line](#moveCursorToTheBeginning) 17 | 14) [Moves the cursor to the end of the line](#moveCursorToTheEndOfLine) 18 | 15) [Run previous ran command](#runPreviousCommand) 19 | 16) [Find word in a file](#findWordInAFile) 20 | 17) [Get 1st N rows from a file](#get1stNrows) 21 | 22 | #### Run a job in background on Linux 23 | 24 | ```bash 25 | $ nohup command > my.log 2>&1 & 26 | ``` 27 | 28 | #### Check disk usage for directory 29 | ```bash 30 | $ sudo du -h --max-depth=1 | sort -h 31 | $ sudo du -h --max-depth=4 ./log | sort -h 32 | ``` 33 | #### tar 34 | 35 | * Create a tar archive 36 | ```bash 37 | $ tar -cvf 38 | ``` 39 | * -c - create 40 | * -v - verbose 41 | * -f - the filename of the tar archive 42 | 43 | * Extract tar archives 44 | ```bash 45 | $ tar -xvf 46 | ``` 47 | * -x - extract 48 | * -v - verbose 49 | * -f - the filename of tar archive to extract 50 | 51 | * Create archives and compress with tar 52 | ```bash 53 | $ tar -czvf ` 54 | ``` 55 | * -c - create 56 | * -z - zip 57 | * -v - verbose 58 | * -f - the filename of the compressed file 59 | 60 | * Uncompress using tar 61 | ```bash 62 | $ tar -xzvf 63 | ``` 64 | 65 | * -xz - uncompress and extract 66 | * -v - verbose 67 | * -f - the filename of the compressed file 68 | #### gunzip 69 | 70 | * Compress a file with gzip 71 | ```bash 72 | gzip 73 | ``` 74 | 75 | * Extract .gz file 76 | ```bash 77 | gunzip 78 | ``` 79 | 80 | #### Add extension to file 81 | ```bash 82 | $ for f in *; do mv "$f" "$f.gz"; done 83 | ``` 84 | 85 | #### Find a file in Linux 86 | ```bash 87 | $ sudo find . -name 88 | ``` 89 | 90 | #### Check IP to whitelist 91 | ```bash 92 | $ curl ifconfig.me ; echo 93 | ``` 94 | 95 | #### Outputs Geographical Information, regarding an ip_address 96 | For current system's ip address: 97 | 98 | ```bash 99 | $ curl ipinfo.io 100 | ``` 101 | 102 | For any specific ip address: 103 | 104 | ```bash 105 | $ curl ipinfo.io/ 106 | ``` 107 | 108 | #### Show the available space on the mounted filesystems 109 | ```bash 110 | $ df -h 111 | ``` 112 | 113 | #### Show the size of a file or a folder 114 | ```bash 115 | $ du -sh 116 | ``` 117 | 118 | #### List the conents of a file in a numbered fashion 119 | ```bash 120 | $ nl 121 | ``` 122 | 123 | #### Search Bash history 124 | ```bash 125 | Ctrl+r 126 | ``` 127 | 128 | #### Moves the cursor to the beginning of the line 129 | ```bash 130 | Ctrl+a 131 | ``` 132 | 133 | #### Moves the cursor to the end of the line 134 | ```bash 135 | Ctrl+e 136 | ``` 137 | 138 | #### Runs previous ran command 139 | ```bash 140 | $ !! 141 | ``` 142 | 143 | #### Find word in a file 144 | ```bash 145 | $ grep 146 | ``` 147 | 148 | #### Get 1st N rows from a file 149 | ```bash 150 | !head -5 file_name.csv 151 | ``` 152 | 153 | -------------------------------------------------------------------------------- /PySpark.md: -------------------------------------------------------------------------------- 1 | # PySpark 2 | 3 | [PySpark Examples Official Doc link](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html) 4 | 5 | ### Create Dataframe 6 | ``` 7 | spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate() 8 | 9 | columns = ["letter","number"] 10 | data = [(l, n) for l, n in zip("abcdefghijklmnopqrstuvwxyz", "12345678912345678912345678")] 11 | rdd = spark.sparkContext.parallelize(data) 12 | df = rdd.toDF(columns) 13 | ``` 14 | 15 | ### Read csv file in Spark with Schema Inference 16 | ``` 17 | df = spark.read.csv("file_path/file_name.csv", header=True, inferSchema=True) 18 | ``` 19 | 20 | ### Run SQL Query on Spark DF 21 | ``` 22 | df.createOrReplaceTempView("table_name") 23 | df_new = spark.sql("Select * from table_name") 24 | df_new.show() 25 | ``` 26 | 27 | ### Get number of partitions 28 | ``` 29 | if dataframe: 30 | df.rdd.getNumPartitions() 31 | if RDD 32 | rdd.getNumPartitions() 33 | ``` 34 | 35 | ### Repartition dataframe into "n" partitions 36 | * Partitions has unequal distribution of data(Fast since less suffeling, cann't increase number of partitions) = `df.coalesce(n)` 37 | * Partitoins has equal distribution of data(Slow since more suffeling) = `df.repartition(n)` 38 | 39 | ### Drop Columns from Spark DataFrame 40 | 41 | `df = df.drop("query_type").drop("_c0").drop("_c01")` 42 | 43 | ### Write data to S3 44 | 45 | `df.write.mode("overwrite").format("csv").option("compression", ".gzip").save(output_s3_path_in_string, header=True)` 46 | 47 | ### Validation Steps 48 | ```python 49 | >>> df_new = spark.read.csv("s3://leadid-sandbox/aranjan/mysql_leads_new") 50 | 51 | >>> df_old = spark.read.option("delimiter", "\x01").csv("s3://..") 52 | 53 | >>> df_new.count() 54 | 55 | >>> df_intersect = df_old.intersect(df_new) 56 | 57 | >>> df_subtract = df_old.subtract(df_intersect) 58 | 59 | >>> df_new.filter(df_new._c0 == "").filter(df_new._c1 == "").head() 60 | ``` 61 | 62 | ### Select records with specific string in columns 63 | 64 | `df.filter(lower(col("_c0")).contains('%string_to_find%')).head()` 65 | 66 | ### Get max/min/mean value for a column 67 | ``` 68 | max_value = df.agg({"_c0": "max"}).collect()[0] 69 | mean_value = df.agg({"_c0": "mean"}).collect()[0] 70 | min_value = df.agg({"_c0": "min"}).collect()[0] 71 | df.select("_c0").rdd.min()[0] 72 | df.select("_c0").rdd.max()[0] 73 | ``` 74 | ### Arithmetic Operation on Columns (-, +, %, /, **) 75 | ``` 76 | df = df.withColumn("new_col", df._c0 * df._c1) 77 | df = df.withColumn("new_col", df._c0 + 100) 78 | df = df.withColumn("new_col", df._c0 + lit(100)) 79 | ``` 80 | 81 | ### Spark Configuration 82 | In a cluster with 10 nodes with each node(16 cores and 64GB RAM) 83 | * Assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput) 84 | * Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15 85 | So, Total available of cores in cluster = 15 x 10 = 150 86 | * Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30 87 | * Leaving 1 executor for ApplicationManager => --num-executors = 29 88 | * Number of executors per node = 30/10 = 3 89 | * Memory per executor = 64GB/3 = 21GB 90 | * Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB 91 | 92 | ### Setup Colab to run PySpark 93 | 1. As a first step, Let's setup Spark on your Colab environment. Run the cell below! 94 | ```[Python] 95 | !pip install pyspark 96 | !pip install -U -q PyDrive 97 | !apt install openjdk-8-jdk-headless -qq 98 | import os 99 | os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" 100 | ``` 101 | 2. Import some of the libraries usually needed by our workload. 102 | ```[Python] 103 | import pyspark 104 | from pyspark.sql import * 105 | from pyspark.sql.types import * 106 | from pyspark.sql.functions import * 107 | from pyspark import SparkContext, SparkConf 108 | ``` 109 | 3. Initialize the Spark context, 110 | ```[Python] 111 | # Create the session 112 | conf = SparkConf().set("spark.ui.port", "4050") 113 | 114 | # Create the context 115 | sc = pyspark.SparkContext(conf=conf) 116 | spark = SparkSession.builder.getOrCreate() 117 | 118 | spark 119 | ``` 120 | 4. If you are running this Colab on the Google hosted runtime, the cell below will create a ngrok tunnel which will allow you to still check the Spark UI. 121 | ```[Python] 122 | !wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip 123 | ! rm -rf ngrok 124 | !unzip ngrok-stable-linux-amd64.zip 125 | get_ipython().system_raw('./ngrok http 4050 &') 126 | !curl -s http://localhost:4040/api/tunnels | python3 -c \ 127 | "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])" 128 | ``` 129 | 5. Test Spark installation 130 | ```[Python] 131 | import pyspark 132 | print(pyspark.__version__) 133 | spark = SparkSession.builder.master("local[*]").getOrCreate() 134 | # Test the spark 135 | df = spark.createDataFrame([{"hello": "world"} for x in range(1000)]) 136 | 137 | df.show(3, False) 138 | ``` 139 | -------------------------------------------------------------------------------- /Python.md: -------------------------------------------------------------------------------- 1 | # Python3 2 | 3 | ## Table of Contents 4 | 5 | 1) [Check your Python version](#checkYourPythonVersion) 6 | 2) [Run Python unittest](#runPythonUnittest) 7 | 3) [Install Python Libraries](#installPythonLibraries) 8 | 4) [Managing Requirements](#managingRequirements) 9 | 5) [Prettify print yaml files](#PrettifyfyPrintYamlFiles) 10 | 6) [Prettify print json files](#PrettifyfyPrintJsonFiles) 11 | 7) [Read file from Gitlab private repo](Python%20Script/read_from_gitlab.py) 12 | 8) [Read parquet file](Python%20Script/read_parquet_file.py) 13 | 14 | ### Create a server 15 | ```bash 16 | python3 -m http.server 8000 17 | ``` 18 | 19 | ### Check your Python version 20 | ```bash 21 | $ python3 -V 22 | ``` 23 | 24 | ### Run Python unittest 25 | ```bash 26 | $ python3 -m unittest discover -s /path/to/unittest/lambdas 27 | ``` 28 | 29 | ### Install Python Libraries 30 | 31 | ```bash 32 | $ python3 -m pip install --user --upgrade "" 33 | ``` 34 | 35 | ### Managing Requirements 36 | 37 | **Create a Virtual Environment** 38 | 39 | ```bash 40 | $ python3 -m venv venv 41 | ``` 42 | 43 | **Activate a Virtual Environment** - _Linux_ 44 | ```bash 45 | $ source venv/bin/activate 46 | ``` 47 | 48 | 49 | **Creating a requirements.txt file** 50 | 51 | ```bash 52 | $ python3 -m pip freeze > requirements.txt 53 | ``` 54 | 55 | **Installing a requirements.txt file** 56 | 57 | ```bash 58 | $ python3 -m pip install -r requirements.txt 59 | ``` 60 | 61 | ### Prettify print yaml files 62 | ```bash 63 | $ python3 -c 'import yaml;print(yaml.safe_load(open("")))'` 64 | ``` 65 | 66 | ### Prettify print json files 67 | ```bash 68 | $ python3 -m json.tool ` 69 | ``` -------------------------------------------------------------------------------- /PythonScript/json_load.py: -------------------------------------------------------------------------------- 1 | import json 2 | # json_str = {"cohort_name":["STP", "ETS", "ETB|CARDED", "ETB", "NSTP", "RI", "Non", "Buzz", "Non Buzz1"], 3 | # "absolute_count":[738453, 5957908, 1089618, 10967636, 87032, 2868654, 119403, 1357280], "matched_records_%":[ 4 | # 3.1849111946251667, 25.696161957154807, 4.699468437483611, 47.30287056180148, 0.3753647030895907, 12.372362544544153, 5 | # 0.5149792219299384, 5.853881379371262]} 6 | 7 | x = "{'cohort_name': ['ETB', 'ETS', 'Non Buzz1', 'NSTP', 'RI', 'Non Buzz', 'STP'], 'absolute_count': [12092944, 5940231, 1291809, 84929, 2665341, 116642, 516990], 'matched_records_%': [53.252035348629605, 26.15817878516806, 5.688561737462595, 0.3739901640265401, 11.736995817408216, 0.5136403432559395, 2.2765978040490404]}" 8 | y = "{cohort_name=[Non Buzz1, NSTP, ETS, Non Buzz, STP, ETB|CARDED, RI, ETB], absolute_count=[1357280, 87032, 5957908, 119403, 738453, 1089618, 2868654, 10967636], matched_records_%=[5.853881379371262, 0.3753647030895907, 25.696161957154807, 0.5149792219299384, 3.1849111946251667, 4.699468437483611, 12.372362544544153, 47.30287056180148]}" 9 | x = x.replace("\'", "\"") 10 | print(x) 11 | 12 | print(json.loads(x)) 13 | -------------------------------------------------------------------------------- /PythonScript/re_date_time.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import time 3 | import pytz 4 | 5 | to_day = datetime.datetime.today() 6 | print("Current Date and time :|:", to_day) 7 | print("Current Date and time :|:", datetime.datetime.now()) 8 | print("Current Date :|:", to_day.date()) 9 | print("Today's Date :|:", to_day.day) 10 | print("Current Month :|:", to_day.month) 11 | print("Current Year :|:", to_day.date().year) 12 | print("Week of the day :|:", to_day.date().isoweekday()) 13 | print("Date 5 days back :|:", datetime.date.today() - datetime.timedelta(days=5)) 14 | print("Date 5 days after :|:", datetime.datetime.now() + datetime.timedelta(days=5)) 15 | print("Seconds remaining in my coming BDay :|:", (datetime.datetime(year=2021, month=10, day=4, hour=0, minute=0) - to_day).seconds) 16 | print("Seconds consumed in running this program :|:", (datetime.datetime.now() - to_day).microseconds) 17 | print("Current Time :|:", to_day.time()) 18 | print("Current Time minute", datetime.datetime.now().time().minute) 19 | print("Current Time hour", datetime.datetime.now().time().minute) 20 | print("Current Time seconds", datetime.datetime.now().time().second) 21 | print("Current Time microseconds", to_day.time().microsecond) 22 | 23 | 24 | print("Current Epoch time\t", time.time(), "Seconds") 25 | print("Current Epoch time\t", time.time_ns(), "Nanoseconds") 26 | 27 | utc_dt = datetime.datetime(2021, 10, 4, 12, 44, 56, 10, tzinfo=pytz.utc) 28 | print("Time Zone aware date time\t", utc_dt) 29 | 30 | current_utc_dt = datetime.datetime.now(tz=pytz.utc) 31 | print("Current UTC Time Zone\t", current_utc_dt) 32 | 33 | current_utc_dt = datetime.datetime.now(tz=pytz.utc) 34 | print("Convert UTC to India Time zone \t", current_utc_dt.astimezone(tz=pytz.timezone("Asia/Calcutta"))) 35 | 36 | curr_dt = datetime.datetime.today() 37 | print("Naive time to timezone aware \t", curr_dt, pytz.timezone("Asia/Calcutta").localize(curr_dt)) 38 | print("ISO format \t", curr_dt.isoformat()) 39 | -------------------------------------------------------------------------------- /PythonScript/read_With_custom_schema.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql.types import StructType,StructField, StringType 2 | 3 | customSchema = StructType().add("col1", StringType(), True).add("col2", StringType(), True).add("col3", StringType(), True) 4 | 5 | df = spark.read.format("csv").option("header", "true").schema(customSchema).load("path_to_files") -------------------------------------------------------------------------------- /PythonScript/read_from_gitlab.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import yaml 3 | 4 | 5 | def read_yaml_from_gitlab(): 6 | """ 7 | Read any file content from Gitlab 8 | Gitlab File Path = path/to/file.yaml 9 | Gitlab Project_Id = 123 10 | :return: Resurce Yaml Content 11 | :rtype: dict 12 | """ 13 | project_id = "****************" 14 | url = "https:///api/v4/projects/{project_id}/repository/files/{f}/raw?ref=".format( 15 | project_id=project_id, f="path%2Fto%2Ffile%2Eyaml") 16 | resp = requests.get(url, headers={"Private-Token": "********************"}) 17 | content_file = resp.content 18 | content_file.decode("utf-8") 19 | return yaml.safe_load(content_file) 20 | -------------------------------------------------------------------------------- /PythonScript/read_parquet_file.py: -------------------------------------------------------------------------------- 1 | import pyarrow.parquet as pq 2 | 3 | table = pq.read_table(".parquet") 4 | pd = table.to_pandas() 5 | print(pd.head) 6 | # print(pd.columns) 7 | -------------------------------------------------------------------------------- /PythonScript/read_part_file.py: -------------------------------------------------------------------------------- 1 | count = 0 2 | for i in range(162): 3 | file = "0"*(5-len(str(i)))+str(i) 4 | print("path/*/part-{}".format(file)) 5 | rdd = spark.sparkContext.textFile("path/part-{}".format(file)) 6 | temp = rdd.count() 7 | print(temp) 8 | count += temp 9 | 10 | 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Enginner's Essential Commands 2 | 3 | * Linux: [Link](Linux.md) 4 | * Python: [Link](Python.md) 5 | * PySpark: [Link](PySpark.md) 6 | * AWS: [Link](AWS) 7 | * EKS: [Link](AWS/EKS.md) 8 | * EMR: [Link](AWS/EMR.md) 9 | * S3: [Link](AWS/S3.md) 10 | * Terraform: [Link](terraform.md) 11 | * Git: [Link](Git.md) 12 | * Helm: [Link](Helm) 13 | * Jupyterhub: [Link](Helm/Jupyterhub.md) 14 | 15 | --- 16 |
Want to contribute? 17 |

18 | 19 | * The commands should not be copy-pasted from any source in bulk. 20 | * Only add those commands that you use more frequently but may be unknown to other developers. 21 | 22 | Example: `pwd`, `ls` e.t.c., are not allowed 23 | * Follow the structure and don't forget to embed any reference links either in heading or command description. 24 | * Put it inside a directory if applicable 25 | * Give a proper heading 26 | * Use markdown script for [block code or inline code]((https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet#code)) to embed commands 27 | * If the command heading is not sufficient to explain the uses, give 1 liner explanation with an example. 28 | * I would be happy to accept your pull request even if you add `one` good command than adding `ten` not so good commands. 29 |

30 |
31 | -------------------------------------------------------------------------------- /mongodb.md: -------------------------------------------------------------------------------- 1 | To have launchd start mongodb/brew/mongodb-community now and restart at login: 2 | 3 | `brew services start mongodb/brew/mongodb-community` 4 | 5 | Or, if you don't want/need a background service you can just run: 6 | 7 | `mongod --config /opt/homebrew/etc/mongod.conf` 8 | -------------------------------------------------------------------------------- /pyspark/encrypt_decryt.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql.functions import lit, udf 2 | from cryptography.fernet import Fernet 3 | from pyspark.sql.types import StringType 4 | from pyspark.sql import SparkSession, Row 5 | 6 | 7 | def encrypt_val(clear_text, MASTER_KEY): 8 | f = Fernet(MASTER_KEY) 9 | clear_text_b = bytes(clear_text, 'utf-8') 10 | cipher_text = f.encrypt(clear_text_b) 11 | cipher_text = str(cipher_text.decode('ascii')) 12 | return cipher_text 13 | 14 | 15 | def decrypt_val(cipher_text, MASTER_KEY): 16 | f = Fernet(MASTER_KEY) 17 | clear_val = f.decrypt(cipher_text.encode()).decode() 18 | return clear_val 19 | 20 | 21 | if __name__ == '__main__': 22 | spark = SparkSession.builder.appName("Test Job").getOrCreate() 23 | 24 | # df = spark.read.csv("sample_file.csv", header=True) 25 | # create a list of rows with the data 26 | data = [ 27 | Row("John Doe", "IT", "201901", 50000), 28 | Row("Jane Doe", "HR", "202001", 60000), 29 | Row("Bob Smith", "IT", "201905", 70000), 30 | Row("Alice Smith", "HR", "202011", 80000), 31 | Row("James Johnson", "IT", "202002", 90000), 32 | Row("Emily Johnson", "HR", "201310", 100000), 33 | Row("David Williams", "IT", "201511", 110000), 34 | Row("Samantha Williams", "HR", "202207", 120000), 35 | Row("Charles Brown", "IT", "202101", 130000), 36 | Row("Ashley Brown", "HR", "202111", 140000) 37 | ] 38 | 39 | # create the DataFrame 40 | df = spark.createDataFrame(data, schema) 41 | 42 | encrypt = udf(encrypt_val, StringType()) 43 | decrypt = udf(decrypt_val, StringType()) 44 | 45 | encryptionKey = Fernet.generate_key() 46 | 47 | df = df.withColumn("first_name_encrypted", encrypt(df.employee_name, lit(encryptionKey))) 48 | df = df.withColumn("first_name_decrypted", decrypt(df.employee_name, lit(encryptionKey))) 49 | 50 | df.show() 51 | # df.coalesce(1).write.mode("overwrite").csv("output", header=True) 52 | -------------------------------------------------------------------------------- /pyspark/profiler.sh: -------------------------------------------------------------------------------- 1 | spark-submit --master local[2] \ 2 | --jar sparklens-0.3.0-s_2.11.jar \ 3 | --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener \ 4 | main.py 5 | 6 | spark-submit --master local[2] \ 7 | --conf spark.jars=jvm-profiler-1.0.0.jar \ 8 | --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \ 9 | main.py 10 | 11 | spark-submit --master yarn --queue DE --deploy-mode cluster \ 12 | --conf spark.jars=jvm-profiler-1.0.0.jar \ 13 | --conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \ 14 | main.py 15 | 16 | spark-submit --master yarn --queue DE --deploy-mode cluster \ 17 | --files ./babar-agent-0.2.0-SNAPSHOT.jar \ 18 | --conf spark.executor.extraJavaOptions="-javaagent:./babar-agent-0.2.0-SNAPSHOT.jar=StackTraceProfiler,JVMProfiler[reservedMB=2560],ProcFSProfiler" \ 19 | main.py -------------------------------------------------------------------------------- /pyspark/pySparkApp/Makefile: -------------------------------------------------------------------------------- 1 | build: 2 | python3 setup.py sdist 3 | rm -r foo.egg-info 4 | 5 | build_zip: 6 | mkdir -p dist/ 7 | rsync -av foo dist/ 8 | cd dist ; zip -r foo.zip . * ; cd .. 9 | #rm -fr dist/foo -------------------------------------------------------------------------------- /pyspark/pySparkApp/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/README.md -------------------------------------------------------------------------------- /pyspark/pySparkApp/dist/foo.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/dist/foo.zip -------------------------------------------------------------------------------- /pyspark/pySparkApp/foo/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/pySparkApp/foo/__init__.py -------------------------------------------------------------------------------- /pyspark/pySparkApp/foo/foo.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | 3 | 4 | class Foo: 5 | def __init__(self, app_name): 6 | self.spark = SparkSession.builder.appName(app_name).getOrCreate() 7 | 8 | def get_source_df(self): 9 | simple_data = [("James", "Sales", "NY", 90000, 34, 10000), 10 | ("Michael", "Sales", "NY", 86000, 56, 20000), 11 | ("Robert", "Sales", "CA", 81000, 30, 23000), 12 | ("Maria", "Finance", "CA", 90000, 24, 23000), 13 | ("Raman", "Finance", "CA", 99000, 40, 24000), 14 | ("Scott", "Finance", "NY", 83000, 36, 19000), 15 | ("Jen", "Finance", "NY", 79000, 53, 15000), 16 | ("Jeff", "Marketing", "CA", 80000, 25, 18000), 17 | ("Kumar", "Marketing", "NY", 91000, 50, 21000) 18 | ] 19 | 20 | schema = ["employee_name", "department", "state", "salary", "age", "bonus"] 21 | return self.spark.createDataFrame(data=simple_data, schema=schema) 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /pyspark/pySparkApp/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | spark-submit --master local --py-files foo.zip main.py 3 | """ 4 | 5 | from pyspark.sql import functions as F 6 | from foo.foo import Foo 7 | 8 | 9 | def do_transform(df): 10 | df.groupBy("department").sum("salary").show(truncate=False) 11 | 12 | df.groupBy("department").count().show(truncate=False) 13 | 14 | df.groupBy("department", "state") \ 15 | .sum("salary", "bonus") \ 16 | .show(truncate=False) 17 | 18 | df.groupBy("department") \ 19 | .agg(F.sum("salary").alias("sum_salary"), 20 | F.avg("salary").alias("avg_salary"), 21 | F.sum("bonus").alias("sum_bonus"), 22 | F.max("bonus").alias("max_bonus") 23 | ) \ 24 | .show(truncate=False) 25 | 26 | df.groupBy("department") \ 27 | .agg(F.sum("salary").alias("sum_salary"), 28 | F.avg("salary").alias("avg_salary"), 29 | F.sum("bonus").alias("sum_bonus"), 30 | F.max("bonus").alias("max_bonus")).\ 31 | where(F.col("sum_bonus") >= 50000) \ 32 | .show(truncate=False) 33 | 34 | 35 | if __name__ == '__main__': 36 | foo = Foo("PySparkPackagingTest") 37 | df = foo.get_source_df() 38 | do_transform(df) 39 | -------------------------------------------------------------------------------- /pyspark/read_hive_table.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arverma/TowardsDataEngineering/9e0cb54bc4604826154a3afce2c0e74f60d0c06d/pyspark/read_hive_table.py -------------------------------------------------------------------------------- /pyspark/when_otherwise.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.sql import functions as F 3 | 4 | if __name__ == '__main__': 5 | spark = SparkSession.builder.getOrCreate() 6 | data = [('James', 'Smith', 'M', 30), 7 | ('Anna', 'Rose', 'F', 41), 8 | ('Robert', 'Williams', 'O', 62), 9 | ] 10 | 11 | columns = ["firstname", "lastname", "gender", "salary"] 12 | df = spark.createDataFrame(data=data, schema=columns) 13 | df = df.withColumn("Sex", F.when(df.gender == "M", "Male").when(df.gender == "F", "Female").otherwise(None)) 14 | df.show() 15 | -------------------------------------------------------------------------------- /terraform.md: -------------------------------------------------------------------------------- 1 | 2 | ### Terraform 3 | ``` 4 | terraform init -var-file="terraform.tfvars" "-lock=false" 5 | 6 | terraform plan -var-file="terraform.tfvars" "-lock=false" 7 | 8 | terraform apply -var-file="terraform.tfvars" "-lock=false" 9 | 10 | terraform destroy -var-file="terraform.tfvars" "-lock=false" 11 | ``` 12 | 13 | ``` 14 | terragrunt init 15 | 16 | terragrunt plan 17 | 18 | terragrunt apply 19 | 20 | terragrunt destroy 21 | ``` 22 | --------------------------------------------------------------------------------