└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Hadoop Cheat Sheet (Spark 2.2.0) 2 | 3 | ## WinSCP 4 | Transfer files to Flux (and then to HDFS) 5 | 6 | ## Putty 7 | Host: `cavium-thunderx.arc-ts.umich.edu` 8 | Duo two-factor login is required. 9 | 10 | ## Terminal 11 | Action|Command 12 | ---|--- 13 | set python path|`export PYSPARK_PYTHON=/usr/bin/python` 14 | submit job|`spark-submit --master yarn --queue default filename` 15 | Loading PySpark interactive shell|`pyspark --master yarn --queue default` 16 | Loading PySpark w/ options|`pyspark --master yarn --queue default --num-executors 20 --executor-memory 5g --executor-cores 5` 17 | List previous commands|`history` 18 | how much space is used in `/home` quota|`du -sh /home/caoa` 19 | 20 | 21 | ## Useful hadoop specific commands 22 | 23 | Apache Hadoop Documentation 24 | http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html 25 | 26 | See Tip #2 below 27 | 28 | Action|Command 29 | ---|--- 30 | list directory|`hdfs dfs -ls` 31 | copy file to hdfs|`hdfs dfs [-put][-copyFromLocal] filename destination` 32 | overwite an existing file on hdfs|`hdfs dfs [-put][-copyFromLocal] -f filename` 33 | get file from hdfs|`hdfs dfs [-get][-copyToLocal] filename destination` 34 | rename file on hdfs|`hdfs dfs -mv oldname newname` 35 | delete file|`hdfs dfs -rm filename` 36 | delete directory|`hdfs dfs -rm -r directory` 37 | get folder size|`hdfs dfs -du -s folder` 38 | delete file and skip trash|`hdfs dfs -rm -skipTrash filename` 39 | delete directory and skip trash|`hdfs dfs -rm -r -skipTrash directory` 40 | empty trash bin (superuser privilege is required)|`hdfs dfs -expunge` 41 | parallelizing get put commands with gnu parallel|`hdfs dfs -ls decahose.202009{01,02}* \| parallel -j 10 --progress hdfs dfs -get {} directory` 42 | 43 | ## Tips: 44 | 1. Need to be in the main directory when running spark-submit commands 45 | 2. `hdfs dfs` prefix is preferable to `hadoops fs` (legacy) 46 | 3. To exit PySpark interactive shell, type `exit()` or Ctrl-D 47 | 48 | ## Accessing a Jupyter Notebook on a remote machine (linux) from another computer browser (windows) 49 | This url details a way to do this https://hsaghir.github.io/data_science/jupyter-notebook-on-a-remote-machine-linux/ 50 | 51 | OR 52 | 53 | Follow these directions 54 | https://github.com/caocscar/twitter-decahose-pyspark#using-jupyter-notebook-with-pyspark 55 | 56 | ## Problems 57 | 58 | ### Error 59 | `Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions` 60 | 61 | ### Solution 62 | ``` 63 | export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 64 | export PYSPARK_PYTHON=/sw/dsi/centos7/x86-64/Anaconda3-5.0.1/bin/python 65 | ``` 66 | Then restart Spark job or interactive shell. 67 | --------------------------------------------------------------------------------