├── .classpath ├── .project ├── README.md ├── html ├── hdfs_sunburst.html └── hdfs_sunburst_small.html ├── pom.xml ├── scripts └── samples.sh ├── src └── main │ └── java │ └── com │ └── github │ └── gbraccialli │ └── hdfs │ ├── DirectoryContentsUtils.java │ ├── HDFSConfigUtils.java │ └── PathInfo.java ├── target └── gbraccialli-hdfs-utils-with-dependencies.jar └── zeppelin ├── hdfs-d3.json └── note.json /.classpath: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | -------------------------------------------------------------------------------- /.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | HdfsUtils 4 | Hadoop HDFS Command Line Interface. NO_M2ECLIPSE_SUPPORT: Project files created with the maven-eclipse-plugin are not supported in M2Eclipse. 5 | 6 | 7 | 8 | org.eclipse.jdt.core.javabuilder 9 | 10 | 11 | 12 | org.eclipse.jdt.core.javanature 13 | 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HdfsUtils 2 | Project to help analysing HDFS metadata. 3 | 4 | - First feature is a D3 Sunburst visualization showing HDFS space usage and/or number of files 5 | - Snapshot space consumption overhead analyzer (from [this discussion](https://community.hortonworks.com/questions/24063/hdfs-snapshot-space-consumption-report.html) is coming next (stay tunned). 6 | 7 | ##Options to run 8 | ###1- Zeppelin notebook 9 | Just import URL below in your zeppelin instance and runs step-by-step:
10 | https://raw.githubusercontent.com/gbraccialli/HdfsUtils/master/zeppelin/hdfs-d3.json 11 | 12 | ###[Live Preview here](https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2dicmFjY2lhbGxpL0hkZnNVdGlscy9tYXN0ZXIvemVwcGVsaW4vbm90ZS5qc29u) 13 | 14 | ###2- Build from source, running in command line and using html file 15 | ###Building 16 | ```sh 17 | git clone https://github.com/gbraccialli/HdfsUtils.git 18 | cd HdfsUtils 19 | mvn clean package 20 | ```` 21 | ###Basic usage 22 | ```sh 23 | java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar \ 24 | --path=/ \ 25 | --maxLevelThreshold=-1 \ 26 | --minSizeThreshold=-1 \ 27 | --showFiles=false \ 28 | --verbose=true > out.json 29 | ``` 30 | ###Visualizing 31 | Open html/hdfs_sunburst.html in your browser and point to .json file you created in previous step, or copy/paste json content on right load options
32 |
33 | PS: note Chrome browser has security contraint that does not allow you to load local files, use one of the following options: 34 | - Use zeppelin notebook (describe above) 35 | - Use Safari 36 | - Enable Chrome local files access: [instructions here](http://stackoverflow.com/questions/18586921/how-to-launch-html-using-chrome-at-allow-file-access-from-files-mode) 37 | - Publish json in a webserver and use full URL 38 | 39 | 40 | ###Command line options: 41 | ####--confDir=
42 | //path-to-conf-dir 43 | //specify directory containing hadoop config files, default to /etc/hadoop/conf 44 | 45 | ####--maxLevelThreshold=
46 | -1 or or valid int 47 | //max number of directories do drill down. -1 means no limit. for example: maxLevelThreshold=3 means drill down will stop after 3 levels of subdirectories 48 | 49 | ####--minSizeThreshold=
50 | //-1 or valid long 51 | //min number of bytes in a directory to continue drill down. -1 means no limit. minSizeThreshold=1000000 means only directories greater > 1000000 bytes will be drilled down 52 | 53 | ####--showFiles=
54 | //true or false 55 | //whether to show information about files. showFiles=false will show summary information about files in each directory/subdirectory. 56 | 57 | ####--exclude=
58 | //path1,path2,... 59 | //directories to exclude from drill down, for example: /tmp/,/user/ won't present information about those directories. 60 | 61 | ####--doAs=
62 | //username (hdfs for example) 63 | //for non-kerberized cluster, you can set user to perform hdfs operations, using hdfs you won't have permissions issues. if you are using a kerberized cluster, grant read access to user performing this operation (you can use Ranger for this) 64 | 65 | ####--verbose=
66 | //true or false 67 | //when true print processing info into System.err (not applied for zeppelin) 68 | 69 | ####--path=
70 | //path to start analysis 71 | 72 | 73 | ##Special thanks to: 74 | - [Dave Patton](https://github.com/dp1140a) who first created [HDP-Viz](https://github.com/dp1140a/HDP-Viz) where I got insipered and copied lots of code 75 | - [Ali Bajwa](https://github.com/abajwa-hw) who created [ambari stack for Dave's project](https://github.com/abajwa-hw/hdpviz) (and helped me get it working) 76 | - [David Streever](https://github.com/dstreev) who created (or forked) [hdfs-cli](https://github.com/dstreev/hdfs-cli), where I also copied lots of code 77 | -------------------------------------------------------------------------------- /html/hdfs_sunburst.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 58 | 59 | 60 | 61 | 62 | 66 | 97 | 98 |

63 | HDFS Size/FileCount Exploration
64 |

65 |

67 |

68 | Size (without replicas)
69 | Space Consumed (with replicas)
70 | Number of files (namenode pressure)
71 |

72 |

73 |

74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 |

Details table:
Path:
Name:
Directory:
Size (without replicas):
Space consumed (with replicas):
Average replica count:
Number of Files:
Average File Size:
Number of Sub Directories:
Message:

87 |

88 |

89 |
90 | load json from server:
91 | 92 |
93 |
94 |

{"name":"","fullName":"/","length":1392942312,"spaceConsumed":4984132599,"numberOfFiles":892,"numberOfSubDirectories":286,"directory":true,"children":[{"name":"app-logs","fullName":"/app-logs","length":182431,"spaceConsumed":547293,"numberOfFiles":8,"numberOfSubDirectories":14,"directory":true,"children":[{"name":"hive","fullName":"/app-logs/hive","length":51093,"spaceConsumed":153279,"numberOfFiles":2,"numberOfSubDirectories":3,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"root","fullName":"/app-logs/root","length":30222,"spaceConsumed":90666,"numberOfFiles":2,"numberOfSubDirectories":3,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"zeppelin","fullName":"/app-logs/zeppelin","length":101116,"spaceConsumed":303348,"numberOfFiles":4,"numberOfSubDirectories":5,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"apps","fullName":"/apps","length":192660926,"spaceConsumed":577982778,"numberOfFiles":250,"numberOfSubDirectories":66,"directory":true,"children":[{"name":"data-mirroring","fullName":"/apps/data-mirroring","length":29170,"spaceConsumed":87510,"numberOfFiles":3,"numberOfSubDirectories":1,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"falcon","fullName":"/apps/falcon","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hbase","fullName":"/apps/hbase","length":55629,"spaceConsumed":166887,"numberOfFiles":28,"numberOfSubDirectories":39,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hive","fullName":"/apps/hive","length":352130,"spaceConsumed":1056390,"numberOfFiles":218,"numberOfSubDirectories":21,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"zeppelin","fullName":"/apps/zeppelin","length":192223997,"spaceConsumed":576671991,"numberOfFiles":1,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"ats","fullName":"/ats","length":503889,"spaceConsumed":1511667,"numberOfFiles":15,"numberOfSubDirectories":28,"directory":true,"children":[{"name":"active","fullName":"/ats/active","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"done","fullName":"/ats/done","length":503889,"spaceConsumed":1511667,"numberOfFiles":15,"numberOfSubDirectories":26,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"demo","fullName":"/demo","length":881591,"spaceConsumed":2644773,"numberOfFiles":10,"numberOfSubDirectories":8,"directory":true,"children":[{"name":"data","fullName":"/demo/data","length":881591,"spaceConsumed":2644773,"numberOfFiles":10,"numberOfSubDirectories":7,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"hdp","fullName":"/hdp","length":727228619,"spaceConsumed":2181685857,"numberOfFiles":8,"numberOfSubDirectories":10,"directory":true,"children":[{"name":".test","fullName":"/hdp/.test","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"apps","fullName":"/hdp/apps","length":727228619,"spaceConsumed":2181685857,"numberOfFiles":8,"numberOfSubDirectories":8,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"mapred","fullName":"/mapred","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":1,"directory":true,"children":[{"name":"system","fullName":"/mapred/system","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"mr-history","fullName":"/mr-history","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":2,"directory":true,"children":[{"name":"done","fullName":"/mr-history/done","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"tmp","fullName":"/mr-history/tmp","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"ranger","fullName":"/ranger","length":10583670,"spaceConsumed":434404194,"numberOfFiles":57,"numberOfSubDirectories":37,"directory":true,"children":[{"name":"audit","fullName":"/ranger/audit","length":10583670,"spaceConsumed":434404194,"numberOfFiles":57,"numberOfSubDirectories":36,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"spark-history","fullName":"/spark-history","length":8949172,"spaceConsumed":429499995,"numberOfFiles":30,"numberOfSubDirectories":0,"directory":true,"message":"this directory doesn\u0027t have sub-directories","children":[]},{"name":"test","fullName":"/test","length":422,"spaceConsumed":1266,"numberOfFiles":2,"numberOfSubDirectories":3,"directory":true,"children":[{"name":"geo","fullName":"/test/geo","length":422,"spaceConsumed":1266,"numberOfFiles":2,"numberOfSubDirectories":2,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]},{"name":"tmp","fullName":"/tmp","length":7705513,"spaceConsumed":23116539,"numberOfFiles":13,"numberOfSubDirectories":66,"directory":true,"children":[{"name":"asdf asdf","fullName":"/tmp/asdf asdf","length":8143,"spaceConsumed":24429,"numberOfFiles":1,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"crimes_sf","fullName":"/tmp/crimes_sf","length":3324296,"spaceConsumed":9972888,"numberOfFiles":2,"numberOfSubDirectories":2,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"entity-file-history","fullName":"/tmp/entity-file-history","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":1,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"falcon1","fullName":"/tmp/falcon1","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"falcon2","fullName":"/tmp/falcon2","length":4372983,"spaceConsumed":13118949,"numberOfFiles":9,"numberOfSubDirectories":1,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hive","fullName":"/tmp/hive","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":56,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"+(1 files)","fullName":"/tmp","length":91,"spaceConsumed":273,"numberOfFiles":1,"numberOfSubDirectories":0,"directory":false,"message":"this entry represents multiple files to reduce visualization pressure"}]},{"name":"user","fullName":"/user","length":444246079,"spaceConsumed":1332738237,"numberOfFiles":499,"numberOfSubDirectories":39,"directory":true,"children":[{"name":"admin","fullName":"/user/admin","length":126,"spaceConsumed":378,"numberOfFiles":4,"numberOfSubDirectories":4,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"ambari-qa","fullName":"/user/ambari-qa","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hcat","fullName":"/user/hcat","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hdfs","fullName":"/user/hdfs","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":1,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"hive","fullName":"/user/hive","length":20733504,"spaceConsumed":62200512,"numberOfFiles":1,"numberOfSubDirectories":1,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"maria.dev1","fullName":"/user/maria.dev1","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"oozie","fullName":"/user/oozie","length":397140164,"spaceConsumed":1191420492,"numberOfFiles":485,"numberOfSubDirectories":12,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"root","fullName":"/user/root","length":20733504,"spaceConsumed":62200512,"numberOfFiles":1,"numberOfSubDirectories":3,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"solr","fullName":"/user/solr","length":35,"spaceConsumed":105,"numberOfFiles":1,"numberOfSubDirectories":2,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"spark","fullName":"/user/spark","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"unit","fullName":"/user/unit","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"yarn","fullName":"/user/yarn","length":0,"spaceConsumed":0,"numberOfFiles":0,"numberOfSubDirectories":0,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]},{"name":"zeppelin","fullName":"/user/zeppelin","length":5638746,"spaceConsumed":16916238,"numberOfFiles":7,"numberOfSubDirectories":3,"directory":true,"message":"Drill down stopped due to maxLevelThreshold achieved (2)","children":[]}]}]}

95 |
96 |

99 | 100 | 101 | 102 | 103 | 359 | -------------------------------------------------------------------------------- /html/hdfs_sunburst_small.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 58 | 59 | 60 | 61 | 62 | 63 | 67 | 98 | 99 |

64 | HDFS Size/FileCount Exploration
65 |

66 |

68 |

69 | Size (without replicas)
70 | Space Consumed (with replicas)
71 | Number of files (namenode pressure)
72 |

73 |

74 |

75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 |

Details table:
Path:
Name:
Directory:
Size (without replicas):
Space consumed (with replicas):
Average replica count:
Number of Files:
Average File Size:
Number of Sub Directories:
Message:

88 |

89 |

90 |
91 | load json from server:
92 | 93 |
94 |
95 |

96 |
97 |

100 | 101 | 102 | 103 | 104 | 360 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 3 | 4.0.0 4 | com.github.gbraccialli 5 | HdfsUtils 6 | 0.0.1-SNAPSHOT 7 | gbraccialli HDFS Utils 8 | gbraccialli HDFS Utils 9 | 10 | 2.6.0 11 | 4.11 12 | 13 | 14 | 15 | com.google.code.gson 16 | gson 17 | 2.2.4 18 | 19 | 20 | org.apache.hadoop 21 | hadoop-client 22 | ${hadoop.version} 23 | 24 | 25 | junit 26 | junit 27 | ${junit.version} 28 | test 29 | 30 | 31 | 32 | 33 | 34 | org.apache.maven.plugins 35 | maven-shade-plugin 36 | 2.0 37 | 38 | 39 | 40 | 41 | 42 | package 43 | 44 | shade 45 | 46 | 47 | true 48 | gbraccialli-hdfs-utils-with-dependencies 49 | true 50 | 51 | 52 | 54 | 56 | com.github.gbraccialli.hdfs.DirectoryContentsUtils 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | org.apache.maven.plugins 65 | maven-compiler-plugin 66 | 67 | 1.7 68 | 1.7 69 | 70 | 71 | 72 | 73 | 74 | -------------------------------------------------------------------------------- /scripts/samples.sh: -------------------------------------------------------------------------------- 1 | java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar \ 2 | --path=/ \ 3 | --maxLevelThreshold=-1 \ 4 | --minSizeThreshold=-1 \ 5 | --showFiles=false \ 6 | --verbose=true > out.json 7 | 8 | 9 | java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar \ 10 | --path=/ \ 11 | --maxLevelThreshold=5 \ 12 | --minSizeThreshold=50000 \ 13 | --showFiles=false \ 14 | --verbose=true > out.json 15 | 16 | java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar --confDir=/Users/gbraccialli/Documents/workspace/hdfs-cli/conf --verbose=true 17 | 18 | java -jar target/gbraccialli-hdfs-utils-with-dependencies.jar --confDir=/Users/gbraccialli/Documents/workspace/hdfs-cli/conf --verbose=true 19 | -------------------------------------------------------------------------------- /src/main/java/com/github/gbraccialli/hdfs/DirectoryContentsUtils.java: -------------------------------------------------------------------------------- 1 | package com.github.gbraccialli.hdfs; 2 | 3 | import java.io.FileNotFoundException; 4 | import java.text.DateFormat; 5 | import java.text.SimpleDateFormat; 6 | import java.util.ArrayList; 7 | import java.util.Arrays; 8 | import java.util.Date; 9 | import java.util.List; 10 | 11 | import org.apache.hadoop.fs.ContentSummary; 12 | import org.apache.hadoop.fs.FileStatus; 13 | import org.apache.hadoop.fs.FileSystem; 14 | import org.apache.hadoop.fs.Path; 15 | 16 | import com.google.gson.Gson; 17 | import com.google.gson.GsonBuilder; 18 | 19 | public class DirectoryContentsUtils { 20 | 21 | public static long countEntries = 0; 22 | 23 | public static void main(String[] args) throws Exception { 24 | 25 | 26 | String confDir = "/etc/hadoop/conf"; 27 | String path = "/"; 28 | int maxLevelThreshold = -1; 29 | long minSizeThreshold = -1; 30 | boolean showFiles = false; 31 | boolean verbose = false; 32 | String doAs = null; 33 | List excludeList = null; 34 | 35 | for (String arg : args){ 36 | if (arg.startsWith("--confDir=")){ 37 | confDir = arg.split("=")[1]; 38 | }else if (arg.startsWith("--path=")){ 39 | path = arg.split("=")[1]; 40 | }else if (arg.startsWith("--maxLevelThreshold=")){ 41 | maxLevelThreshold = Integer.parseInt(arg.split("=")[1]); 42 | }else if (arg.startsWith("--minSizeThreshold=")){ 43 | minSizeThreshold = Long.parseLong(arg.split("=")[1]); 44 | }else if (arg.startsWith("--showFiles=")){ 45 | showFiles = Boolean.parseBoolean(arg.split("=")[1]); 46 | }else if (arg.startsWith("--verbose=")){ 47 | verbose = Boolean.parseBoolean(arg.split("=")[1]); 48 | }else if (arg.startsWith("--doAs=")){ 49 | doAs = arg.split("=")[1]; 50 | }else if (arg.startsWith("--exclude=")){ 51 | excludeList = Arrays.asList(arg.split("=")[1].split(",")); 52 | }else{ 53 | System.err.println("Argumment not in list of valid arguments, it will be ignored:" + arg); 54 | } 55 | } 56 | 57 | 58 | Date dateStart = new Date(); 59 | DateFormat dateFormat = new SimpleDateFormat("yyyy/MM/dd HH:mm:ss"); 60 | 61 | //System.setProperty("HADOOP_USER_NAME", doAs); 62 | 63 | FileSystem hdfs = HDFSConfigUtils.loadConfigsAndGetFileSystem(confDir, doAs); 64 | Path hdfsPath = new Path(path); 65 | 66 | if (verbose){ 67 | System.err.println("confDir=" + confDir); 68 | System.err.println("path=" + path); 69 | System.err.println("maxLevelThreshold=" + maxLevelThreshold); 70 | System.err.println("minSizeThreshold=" + minSizeThreshold); 71 | System.err.println("showFiles=" + showFiles); 72 | System.err.println("verbose=" + verbose); 73 | System.err.println("uri=" + hdfs.getUri().toString()); 74 | System.err.println("start at:" + dateFormat.format(dateStart)); 75 | } 76 | 77 | countEntries = 0; 78 | directoryInfoToJson(DirectoryContentsUtils.listContents(hdfs,hdfsPath,0,maxLevelThreshold,minSizeThreshold,showFiles,verbose,excludeList),System.out); 79 | 80 | if (verbose){ 81 | Date dateEnd = new Date(); 82 | System.err.println(); 83 | System.err.println("end at:" + dateFormat.format(dateEnd)); 84 | System.err.println("total entries : " + countEntries); 85 | System.err.println("elapsed time: " + (dateEnd.getTime() - dateStart.getTime()) / 1000.0 + " seconds") ; 86 | } 87 | } 88 | 89 | public static void directoryInfoToJson(PathInfo directoryInfo, Appendable writer) { 90 | Gson gson = new GsonBuilder().create(); 91 | gson.toJson(directoryInfo,writer); 92 | } 93 | 94 | public static String directoryInfoToJson(PathInfo directoryInfo) { 95 | Gson gson = new GsonBuilder().create(); 96 | return gson.toJson(directoryInfo); 97 | } 98 | 99 | public static PathInfo listContents(FileSystem hdfs, Path path) throws Exception{ 100 | String str = null; 101 | return listContents(hdfs,path,0,-1,-1,false,false,str); 102 | } 103 | 104 | public static PathInfo listContents(FileSystem hdfs, Path path, boolean verbose) throws Exception{ 105 | String str = null; 106 | return listContents(hdfs,path,0,-1,-1,false,verbose,str); 107 | } 108 | 109 | public static boolean isInExclusionList(PathInfo pathInfo, List excludeList){ 110 | 111 | boolean found = false; 112 | for (String excludePath : excludeList){ 113 | if (pathInfo.getFullName().startsWith(excludePath)){ 114 | found = true; 115 | break; 116 | } 117 | } 118 | return found; 119 | 120 | } 121 | 122 | 123 | public static PathInfo listContents(FileSystem hdfs, Path path, int currentLevel, int maxLevelThreshold, long minSizeThreshold, boolean showFiles, boolean verbose, String excludeList) throws Exception{ 124 | List excludeArray = null; 125 | if (excludeList!= null && excludeList.length() >0){ 126 | excludeArray = Arrays.asList(excludeList.split(",")); 127 | } 128 | return listContents(hdfs,path,0,maxLevelThreshold,minSizeThreshold,showFiles,verbose,excludeArray); 129 | } 130 | 131 | public static PathInfo listContents(FileSystem hdfs, Path path, int currentLevel, int maxLevelThreshold, long minSizeThreshold, boolean showFiles, boolean verbose, List excludeList) throws Exception{ 132 | 133 | countEntries++; 134 | 135 | PathInfo pathInfo = new PathInfo(); 136 | ArrayList children = new ArrayList(); 137 | 138 | try{ 139 | ContentSummary summary = hdfs.getContentSummary(path); 140 | FileStatus fileStatus = hdfs.getFileStatus(path); 141 | 142 | long totalLength = summary.getLength(); 143 | long totalSpaceConsumed = summary.getSpaceConsumed(); 144 | pathInfo.setName(path.getName()); 145 | pathInfo.setFullName(fileStatus.getPath().toUri().getPath()); 146 | pathInfo.setDirectory(fileStatus.isDirectory()); 147 | pathInfo.setLength(totalLength); 148 | pathInfo.setSpaceConsumed(totalSpaceConsumed); 149 | pathInfo.setNumberOfFiles(summary.getFileCount()); 150 | long dirCount = summary.getDirectoryCount(); 151 | if (dirCount > 0){ 152 | dirCount -= 1; 153 | } 154 | pathInfo.setNumberOfSubDirectories(dirCount); 155 | 156 | if (verbose){ 157 | System.err.println("Processing dir: " + pathInfo.getFullName()); 158 | } 159 | 160 | if (pathInfo.isDirectory()){ 161 | if (excludeList != null && isInExclusionList(pathInfo,excludeList)){ 162 | pathInfo.setMessage("Drill down stopped due to file/folder is listed in exclusion list"); 163 | } else if (maxLevelThreshold > -1 && currentLevel >= maxLevelThreshold){ 164 | pathInfo.setMessage("Drill down stopped due to maxLevelThreshold achieved (" + maxLevelThreshold + ")"); 165 | if (verbose){ 166 | System.err.println("Drill down stopped due to maxLevelThreshold achieved (" + maxLevelThreshold + ")"); 167 | } 168 | }else if (minSizeThreshold > -1 && totalLength < minSizeThreshold){ 169 | pathInfo.setMessage("Drill down stopped due to minSizeThreshold achieved (" + minSizeThreshold + ")"); 170 | if (verbose){ 171 | System.err.println("Drill down stopped due to minSizeThreshold achieved (" + minSizeThreshold + ")"); 172 | } 173 | }else{ 174 | long subDirsLength=0; 175 | long subDirsSpaceConsumed=0; 176 | long files=0; 177 | for (FileStatus fs : hdfs.listStatus(path)){ 178 | if (fs.isDirectory()){ 179 | PathInfo child = listContents(hdfs,fs.getPath(), currentLevel+1, maxLevelThreshold, minSizeThreshold, showFiles, verbose, excludeList); 180 | children.add(child); 181 | subDirsLength += child.getLength(); 182 | subDirsSpaceConsumed += child.getSpaceConsumed(); 183 | }else if (showFiles){ 184 | PathInfo child = listContents(hdfs,fs.getPath(), currentLevel+1, maxLevelThreshold, minSizeThreshold, showFiles, verbose, excludeList); 185 | children.add(child); 186 | }else{ 187 | files++; 188 | } 189 | } 190 | if (!showFiles && files > 0){ 191 | if (dirCount == 0){ 192 | pathInfo.setMessage("this directory doesn't have sub-directories"); 193 | }else{ 194 | PathInfo multipleFilesInfo = new PathInfo("+(" + files + " files)", hdfs.getFileStatus(path).getPath().toUri().getPath(), 195 | false,totalLength-subDirsLength,totalSpaceConsumed-subDirsSpaceConsumed,files, "this entry represents multiple files to reduce visualization pressure"); 196 | children.add(multipleFilesInfo); 197 | } 198 | } 199 | } 200 | pathInfo.setChildren(children); 201 | } 202 | }catch (FileNotFoundException e){ 203 | //IGNORE FILE NOT FOUND, IT HAPPENS WHEN TEMPORARY FILES/FOLDERS WERE DELETED BETWEEEN DIRECTORY LIST AND SUMMARY REQUEST 204 | pathInfo.setMessage("FileNotFound error, probably a temporary file/folder deleted in the middle of processing"); 205 | } 206 | 207 | return pathInfo; 208 | } 209 | } -------------------------------------------------------------------------------- /src/main/java/com/github/gbraccialli/hdfs/HDFSConfigUtils.java: -------------------------------------------------------------------------------- 1 | package com.github.gbraccialli.hdfs; 2 | 3 | import java.io.File; 4 | 5 | import org.apache.hadoop.conf.Configuration; 6 | import org.apache.hadoop.fs.FileSystem; 7 | import org.apache.hadoop.fs.Path; 8 | import java.net.URI; 9 | 10 | public class HDFSConfigUtils { 11 | 12 | private static final String[] HADOOP_CONF_FILES = {"core-site.xml", "hdfs-site.xml"}; 13 | 14 | public static FileSystem loadConfigsAndGetFileSystem(String hadoopConfDirProp) throws Exception{ 15 | return loadConfigsAndGetFileSystem(hadoopConfDirProp, null, null); 16 | } 17 | 18 | public static FileSystem loadConfigsAndGetFileSystem(String hadoopConfDirProp, String uri) throws Exception{ 19 | return loadConfigsAndGetFileSystem(hadoopConfDirProp, uri, null); 20 | } 21 | 22 | public static FileSystem loadConfigsAndGetFileSystem(String hadoopConfDirProp, String uri, String doAs) throws Exception{ 23 | 24 | if (doAs != null){ 25 | System.setProperty("HADOOP_USER_NAME", doAs); 26 | } 27 | 28 | Configuration config; 29 | FileSystem hdfs; 30 | if (hadoopConfDirProp == null) 31 | hadoopConfDirProp = "/etc/hadoop/conf"; 32 | 33 | config = new Configuration(false); 34 | 35 | File hadoopConfDir = new File(hadoopConfDirProp).getAbsoluteFile(); 36 | for (String file : HADOOP_CONF_FILES) { 37 | File f = new File(hadoopConfDir, file); 38 | if (f.exists()) { 39 | config.addResource(new Path(f.getAbsolutePath())); 40 | } 41 | } 42 | 43 | try { 44 | URI fsURI; 45 | if (uri != null && uri.length() > 0){ 46 | hdfs = FileSystem.get(new URI(uri),config); 47 | }else{ 48 | hdfs = FileSystem.get(config); 49 | } 50 | if (doAs != null){ 51 | hdfs = FileSystem.get(hdfs.getUri(),config, doAs); 52 | } 53 | return hdfs; 54 | 55 | } catch (Exception e) { 56 | e.printStackTrace(); 57 | throw new Exception(e); 58 | } 59 | 60 | 61 | } 62 | 63 | } 64 | -------------------------------------------------------------------------------- /src/main/java/com/github/gbraccialli/hdfs/PathInfo.java: -------------------------------------------------------------------------------- 1 | package com.github.gbraccialli.hdfs; 2 | 3 | import java.util.ArrayList; 4 | 5 | public class PathInfo { 6 | 7 | private String name; 8 | private String fullName; 9 | private long length; 10 | private long spaceConsumed; 11 | private long numberOfFiles; 12 | private long numberOfSubDirectories; 13 | private boolean directory; 14 | private String message; 15 | private ArrayList children; 16 | 17 | 18 | 19 | public PathInfo(){ 20 | 21 | }; 22 | 23 | 24 | 25 | 26 | public PathInfo(String name, String fullName, boolean directory, long length, long spaceConsumed, 27 | long numberOfFiles, String message) { 28 | this.name = name; 29 | this.fullName = fullName; 30 | this.directory = directory; 31 | this.length = length; 32 | this.spaceConsumed = spaceConsumed; 33 | this.numberOfFiles = numberOfFiles; 34 | this.message = message; 35 | 36 | } 37 | 38 | 39 | 40 | 41 | public String getName() { 42 | return name; 43 | } 44 | 45 | 46 | 47 | 48 | public void setName(String name) { 49 | this.name = name; 50 | } 51 | 52 | 53 | 54 | 55 | public String getFullName() { 56 | return fullName; 57 | } 58 | 59 | 60 | 61 | 62 | public void setFullName(String fullName) { 63 | this.fullName = fullName; 64 | } 65 | 66 | 67 | 68 | 69 | public long getLength() { 70 | return length; 71 | } 72 | 73 | 74 | 75 | 76 | public void setLength(long length) { 77 | this.length = length; 78 | } 79 | 80 | 81 | 82 | 83 | public long getSpaceConsumed() { 84 | return spaceConsumed; 85 | } 86 | 87 | 88 | 89 | 90 | public void setSpaceConsumed(long spaceConsumed) { 91 | this.spaceConsumed = spaceConsumed; 92 | } 93 | 94 | 95 | 96 | 97 | public long getNumberOfFiles() { 98 | return numberOfFiles; 99 | } 100 | 101 | 102 | 103 | 104 | public void setNumberOfFiles(long numberOfFiles) { 105 | this.numberOfFiles = numberOfFiles; 106 | } 107 | 108 | 109 | 110 | 111 | public long getNumberOfSubDirectories() { 112 | return numberOfSubDirectories; 113 | } 114 | 115 | 116 | 117 | 118 | public void setNumberOfSubDirectories(long numberOfSubDirectories) { 119 | this.numberOfSubDirectories = numberOfSubDirectories; 120 | } 121 | 122 | 123 | 124 | 125 | public boolean isDirectory() { 126 | return directory; 127 | } 128 | 129 | 130 | 131 | 132 | public void setDirectory(boolean directory) { 133 | this.directory = directory; 134 | } 135 | 136 | public String getMessage() { 137 | return message; 138 | } 139 | 140 | 141 | public void setMessage(String message) { 142 | this.message = message; 143 | } 144 | 145 | 146 | public ArrayList getChildren() { 147 | return children; 148 | } 149 | 150 | 151 | 152 | 153 | public void setChildren(ArrayList children) { 154 | this.children = children; 155 | } 156 | 157 | 158 | 159 | 160 | 161 | public double getAverageReplicaCount(){ 162 | return this.getSpaceConsumed() / (double)this.getSpaceConsumed(); 163 | } 164 | 165 | public double getAverageFileSize(){ 166 | return this.getLength() / this.getNumberOfFiles(); 167 | } 168 | 169 | 170 | } 171 | -------------------------------------------------------------------------------- /target/gbraccialli-hdfs-utils-with-dependencies.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gbraccialli/HdfsUtils/4aa4e421fc3ed5528e903703df88531ce03d7f73/target/gbraccialli-hdfs-utils-with-dependencies.jar --------------------------------------------------------------------------------