└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Hadoop Cheat Sheet (Spark 2.2.0)
 2 | 
 3 | ## WinSCP
 4 | Transfer files to Flux (and then to HDFS)
 5 | 
 6 | ## Putty
 7 | Host: `cavium-thunderx.arc-ts.umich.edu`  
 8 | Duo two-factor login is required. 
 9 | 
10 | ## Terminal
11 | Action|Command
12 | ---|---
13 | set python path|`export PYSPARK_PYTHON=/usr/bin/python`
14 | submit job|`spark-submit --master yarn --queue default filename`
15 | Loading PySpark interactive shell|`pyspark --master yarn --queue default`
16 | Loading PySpark w/ options|`pyspark --master yarn --queue default --num-executors 20 --executor-memory 5g --executor-cores 5`
17 | List previous commands|`history`
18 | how much space is used in `/home` quota|`du -sh /home/caoa`
19 | 
20 | 
21 | ## Useful hadoop specific commands
22 | 
23 | Apache Hadoop Documentation  
24 | http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
25 | 
26 | See Tip #2 below
27 | 
28 | Action|Command
29 | ---|---
30 | list directory|`hdfs dfs -ls`
31 | copy file to hdfs|`hdfs dfs [-put][-copyFromLocal] filename destination`
32 | overwite an existing file on hdfs|`hdfs dfs [-put][-copyFromLocal] -f filename`
33 | get file from hdfs|`hdfs dfs [-get][-copyToLocal] filename destination`
34 | rename file on hdfs|`hdfs dfs -mv oldname newname`
35 | delete file|`hdfs dfs -rm filename`
36 | delete directory|`hdfs dfs -rm -r directory`
37 | get folder size|`hdfs dfs -du -s folder`
38 | delete file and skip trash|`hdfs dfs -rm -skipTrash filename`
39 | delete directory and skip trash|`hdfs dfs -rm -r -skipTrash directory`
40 | empty trash bin (superuser privilege is required)|`hdfs dfs -expunge`
41 | parallelizing get put commands with gnu parallel|`hdfs dfs -ls decahose.202009{01,02}* \| parallel -j 10 --progress hdfs dfs -get {} directory`
42 | 
43 | ## Tips:
44 | 1. Need to be in the main directory when running spark-submit commands
45 | 2. `hdfs dfs` prefix is preferable to `hadoops fs` (legacy)
46 | 3. To exit PySpark interactive shell, type `exit()` or Ctrl-D
47 | 
48 | ## Accessing a Jupyter Notebook on a remote machine (linux) from another computer browser (windows)
49 | This url details a way to do this https://hsaghir.github.io/data_science/jupyter-notebook-on-a-remote-machine-linux/
50 | 
51 | OR
52 | 
53 | Follow these directions
54 | https://github.com/caocscar/twitter-decahose-pyspark#using-jupyter-notebook-with-pyspark
55 | 
56 | ## Problems
57 | 
58 | ### Error
59 | `Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions`
60 | 
61 | ### Solution
62 | ```
63 | export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0  
64 | export PYSPARK_PYTHON=/sw/dsi/centos7/x86-64/Anaconda3-5.0.1/bin/python
65 | ```
66 | Then restart Spark job or interactive shell.
67 | 


--------------------------------------------------------------------------------