├── .github └── workflows │ └── java8.yml ├── .gitignore ├── .travis.yml ├── LICENSE.txt ├── README.md ├── pom.xml └── src ├── main └── java │ └── com │ └── google │ └── code │ └── externalsorting │ ├── BinaryFileBuffer.java │ ├── ExternalSort.java │ ├── IOStringStack.java │ ├── StringSizeEstimator.java │ └── csv │ ├── CSVRecordBuffer.java │ ├── CsvExternalSort.java │ ├── CsvSortOptions.java │ └── SizeEstimator.java └── test ├── java └── com │ └── google │ └── code │ └── externalsorting │ ├── ExternalSortTest.java │ └── csv │ └── CsvExternalSortTest.java └── resources ├── externalSorting.csv ├── externalSortingSemicolon.csv ├── externalSortingTabs.csv ├── issue44.csv ├── nonLatinSorting.csv ├── test-file-1.csv ├── test-file-1.txt └── test-file-2.txt /.github/workflows/java8.yml: -------------------------------------------------------------------------------- 1 | name: Java CI 2 | 3 | on: [push] 4 | 5 | jobs: 6 | build: 7 | runs-on: ubuntu-latest 8 | 9 | steps: 10 | - uses: actions/checkout@v2 11 | - name: Set up JDK 1.8 12 | uses: actions/setup-java@v1 13 | with: 14 | java-version: 1.8 15 | - name: Build with Maven 16 | run: mvn -B package 17 | - name: Test with Maven 18 | run: mvn test -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /target/ 2 | *.iml 3 | .idea -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: java 2 | 3 | jdk: 4 | - openjdk8 5 | - openjdk12 6 | 7 | install: true 8 | 9 | branches: 10 | only: 11 | - master 12 | 13 | script: mvn clean test jacoco:report 14 | 15 | after_success: 16 | - mvn coveralls:report 17 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | This code is in the public domain. You can take it, modify it, and use it in your commercial projects without attribution. We encourage you, however, to acknowledge this package whenever possible and to contribute your bug fixes and reports. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Externalsortinginjava 2 | ========================================================== 3 | [![][maven img]][maven] 4 | [![][license img]][license] 5 | [![docs-badge][]][docs] 6 | ![Java CI](https://github.com/lemire/externalsortinginjava/workflows/Java%20CI/badge.svg) 7 | 8 | External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm. 9 | 10 | 11 | The versions 0.1 of the library are compatible with Java 6 and above. Versions 0.2 and above 12 | require at least Java 8. 13 | 14 | This code is used in [Apache Jackrabbit Oak](https://github.com/apache/jackrabbit-oak) as well as in [Apache Beam](https://github.com/apache/beam) and in [Spotify scio](https://github.com/spotify/scio). 15 | 16 | Code sample 17 | ------------ 18 | 19 | ```java 20 | import com.google.code.externalsorting.ExternalSort; 21 | 22 | //... inputfile: input file name 23 | //... outputfile: output file name 24 | // next command sorts the lines from inputfile to outputfile 25 | int numLinesWritten = ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(new File(inputfile)), new File(outputfile)); 26 | // you can also provide a custom string comparator, see API 27 | ``` 28 | 29 | 30 | Code sample (CSV) 31 | ------------ 32 | 33 | For sorting CSV files, it might be more convenient to use `CsvExternalSort`. 34 | 35 | ```java 36 | import com.google.code.externalsorting.CsvExternalSort; 37 | import com.google.code.externalsorting.CsvSortOptions; 38 | 39 | // provide a comparator 40 | Comparator comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0)); 41 | //... inputfile: input file name 42 | //... outputfile: output file name 43 | //...provide sort options 44 | CsvSortOptions sortOptions = new CsvSortOptions 45 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 46 | .charset(Charset.defaultCharset()) 47 | .distinct(false) 48 | .numHeader(1) 49 | .skipHeader(false) 50 | .format(CSVFormat.DEFAULT) 51 | .build(); 52 | // container to store the header lines 53 | ArrayList header = new ArrayList(); 54 | 55 | // next two lines sort the lines from inputfile to outputfile 56 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 57 | // at this point you can access header if you'd like. 58 | int numWrittenLines = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 59 | 60 | ``` 61 | 62 | The `numHeader` parameter is the number of lines of headers in the CSV files (typically 1 or 0) and the `skipHeader` parameter indicates whether you would like to exclude these lines from the parsing. 63 | 64 | API Documentation 65 | ----------------- 66 | 67 | http://www.javadoc.io/doc/com.google.code.externalsortinginjava/externalsortinginjava/ 68 | 69 | 70 | 71 | 72 | Maven dependency 73 | ----------------- 74 | 75 | 76 | You can download the jar files from the Maven central repository: 77 | https://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/ 78 | 79 | You can also specify the dependency in the Maven "pom.xml" file: 80 | 81 | ```xml 82 | 83 | 84 | com.google.code.externalsortinginjava 85 | externalsortinginjava 86 | [0.6.0,) 87 | 88 | 89 | ``` 90 | 91 | How to build 92 | ----------------- 93 | 94 | - get the java jdk 95 | - Install Maven 2 96 | - mvn install - builds jar (requires signing) 97 | - or mvn package - builds jar (does not require signing) 98 | - mvn test - runs tests 99 | 100 | 101 | 102 | [maven img]:https://maven-badges.herokuapp.com/maven-central/com.googlecode.javaewah/JavaEWAH/badge.svg 103 | [maven]:http://search.maven.org/#search%7Cga%7C1%7Cexternalsortinginjava 104 | 105 | [license]:LICENSE.txt 106 | [license img]:https://img.shields.io/badge/License-Apache%202-blue.svg 107 | 108 | 109 | [docs-badge]:https://img.shields.io/badge/API-docs-blue.svg?style=flat-square 110 | [docs]:http://www.javadoc.io/doc/com.google.code.externalsortinginjava/externalsortinginjava/ 111 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 4.0.0 3 | com.google.code.externalsortinginjava 4 | externalsortinginjava 5 | jar 6 | 0.6.3-SNAPSHOT 7 | externalsortinginjava 8 | http://github.com/lemire/externalsortinginjava/ 9 | Sometimes, you want to sort large file without first loading them into memory. The solution is to use External Sorting. You divide the files into small blocks, sort each block in RAM, and then merge the result. 10 | 11 | Many database engines and the Unix sort command support external sorting. But what if you want to avoid a database? Or what if you want to sort in a non-lexicographic order? Or maybe you just want a simple external sorting example? 12 | 13 | When we could not find such a simple program, we wrote one. 14 | 15 | UTF-8 16 | 1.8 17 | 1.8 18 | 1.8 19 | 20 | 21 | GitHub Issue Tracking 22 | https://github.com/lemire/externalsortinginjava/issues 23 | 24 | 25 | org.sonatype.oss 26 | oss-parent 27 | 5 28 | 29 | 30 | 31 | Public Domain 32 | http://creativecommons.org/licenses/publicdomain 33 | repo 34 | This code is in the public domain. You can take it, modify it, and use it in your commercial projects without attribution. We encourage you, however, to acknowledge this package whenever possible and to contribute your bug fixes and reports. 35 | 36 | 37 | 38 | 39 | 40 | junit 41 | junit 42 | 4.13.1 43 | 44 | 45 | 46 | com.github.jbellis 47 | jamm 48 | 0.3.1 49 | 50 | 51 | 52 | org.apache.commons 53 | commons-csv 54 | 1.9.0 55 | 56 | 57 | 58 | 59 | 60 | 61 | junit 62 | junit 63 | test 64 | 65 | 66 | com.github.jbellis 67 | jamm 68 | test 69 | 70 | 71 | 72 | org.apache.commons 73 | commons-csv 74 | 75 | 76 | 77 | 78 | 79 | 80 | maven-dependency-plugin 81 | 82 | 83 | 84 | copy-dependencies 85 | 86 | 87 | ${project.build.directory}/lib 88 | 89 | 90 | 91 | 92 | 93 | org.jacoco 94 | jacoco-maven-plugin 95 | 0.7.8 96 | 97 | 98 | prepare-agent 99 | 100 | prepare-agent 101 | 102 | 103 | 104 | 105 | 106 | org.eluder.coveralls 107 | coveralls-maven-plugin 108 | 3.2.1 109 | 110 | 111 | org.apache.maven.plugins 112 | maven-compiler-plugin 113 | 3.5.1 114 | 115 | ${java.target.version} 116 | ${java.target.version} 117 | 118 | 119 | 120 | 121 | org.apache.maven.plugins 122 | maven-surefire-plugin 123 | 2.19.1 124 | 125 | 126 | **/*Spec.* 127 | **/*Test.* 128 | **/*Benchmark.java 129 | 130 | -javaagent:${project.build.directory}/lib/jamm-0.3.1.jar 131 | 132 | 133 | 134 | org.apache.maven.plugins 135 | maven-jar-plugin 136 | 2.6 137 | 138 | 139 | 140 | true 141 | com.google.code.externalsorting.ExternalSort 142 | 143 | 144 | 145 | 146 | 147 | maven-release-plugin 148 | 2.5.3 149 | 150 | deploy 151 | 152 | 153 | 154 | org.apache.felix 155 | maven-bundle-plugin 156 | 2.3.7 157 | true 158 | 159 | 160 | com.googlecode.javaewah.* 161 | * 162 | 163 | 164 | 165 | 166 | org.apache.maven.plugins 167 | maven-gpg-plugin 168 | 1.6 169 | 170 | 171 | sign-artifacts 172 | verify 173 | 174 | sign 175 | 176 | 177 | 178 | 179 | 180 | org.apache.maven.plugins 181 | maven-javadoc-plugin 182 | 2.10.4 183 | 184 | 8 185 | 186 | 187 | 188 | attach-javadocs 189 | 190 | jar 191 | 192 | 193 | 194 | 195 | 196 | org.apache.maven.plugins 197 | maven-source-plugin 198 | 3.0.1 199 | 200 | 201 | attach-sources 202 | 203 | jar 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | scm:git:git@github.com:lemire/externalsortinginjava.git 212 | scm:git:git@github.com:lemire/externalsortinginjava.git 213 | scm:git:git@github.com:lemire/externalsortinginjava.git 214 | HEAD 215 | 216 | 217 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/BinaryFileBuffer.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.IOException; 5 | 6 | /** 7 | * This is essentially a thin wrapper on top of a BufferedReader... which keeps 8 | * the last line in memory. 9 | * 10 | */ 11 | public final class BinaryFileBuffer implements IOStringStack { 12 | public BinaryFileBuffer(BufferedReader r) throws IOException { 13 | this.fbr = r; 14 | reload(); 15 | } 16 | public void close() throws IOException { 17 | this.fbr.close(); 18 | } 19 | 20 | public boolean empty() { 21 | return this.cache == null; 22 | } 23 | 24 | public String peek() { 25 | return this.cache; 26 | } 27 | 28 | public String pop() throws IOException { 29 | String answer = peek().toString();// make a copy 30 | reload(); 31 | return answer; 32 | } 33 | 34 | private void reload() throws IOException { 35 | this.cache = this.fbr.readLine(); 36 | } 37 | 38 | private BufferedReader fbr; 39 | 40 | private String cache; 41 | 42 | } -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/ExternalSort.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting; 2 | 3 | // filename: ExternalSort.java 4 | import java.io.BufferedReader; 5 | import java.io.BufferedWriter; 6 | import java.io.EOFException; 7 | import java.io.File; 8 | import java.io.FileInputStream; 9 | import java.io.FileOutputStream; 10 | import java.io.IOException; 11 | import java.io.InputStream; 12 | import java.io.InputStreamReader; 13 | import java.io.OutputStream; 14 | import java.io.OutputStreamWriter; 15 | import java.nio.charset.Charset; 16 | import java.util.stream.Collectors; 17 | import java.util.ArrayList; 18 | import java.util.Collections; 19 | import java.util.Iterator; 20 | import java.util.Comparator; 21 | import java.util.List; 22 | import java.util.PriorityQueue; 23 | import java.util.zip.Deflater; 24 | import java.util.zip.GZIPInputStream; 25 | import java.util.zip.GZIPOutputStream; 26 | 27 | /** 28 | * Goal: offer a generic external-memory sorting program in Java. 29 | * 30 | * It must be : - hackable (easy to adapt) - scalable to large files - sensibly 31 | * efficient. 32 | * 33 | * This software is in the public domain. 34 | * 35 | * Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt 36 | * 37 | * You can change the default maximal number of temporary files with the -t 38 | * flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt 39 | * -t 3 40 | * 41 | * For very large files, you might want to use an appropriate flag to allocate 42 | * more memory to the Java VM: java -Xms2G 43 | * com/google/code/externalsorting/ExternalSort somefile.txt out.txt 44 | * 45 | * By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon 46 | * Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, 47 | * Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at 48 | * http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/ 49 | */ 50 | public class ExternalSort { 51 | 52 | 53 | private static void displayUsage() { 54 | System.out 55 | .println("java com.google.externalsorting.ExternalSort inputfile outputfile"); 56 | System.out.println("Flags are:"); 57 | System.out.println("-v or --verbose: verbose output"); 58 | System.out.println("-d or --distinct: prune duplicate lines"); 59 | System.out 60 | .println("-t or --maxtmpfiles (followed by an integer): specify an upper bound on the number of temporary files"); 61 | System.out 62 | .println("-c or --charset (followed by a charset code): specify the character set to use (for sorting)"); 63 | System.out 64 | .println("-z or --gzip: use compression for the temporary files"); 65 | System.out 66 | .println("-H or --header (followed by an integer): ignore the first few lines"); 67 | System.out 68 | .println("-s or --store (following by a path): where to store the temporary files"); 69 | System.out.println("-h or --help: display this message"); 70 | } 71 | 72 | /** 73 | * This method calls the garbage collector and then returns the free 74 | * memory. This avoids problems with applications where the GC hasn't 75 | * reclaimed memory and reports no available memory. 76 | * 77 | * @return available memory 78 | */ 79 | public static long estimateAvailableMemory() { 80 | System.gc(); 81 | // http://stackoverflow.com/questions/12807797/java-get-available-memory 82 | Runtime r = Runtime.getRuntime(); 83 | long allocatedMemory = r.totalMemory() - r.freeMemory(); 84 | long presFreeMemory = r.maxMemory() - allocatedMemory; 85 | return presFreeMemory; 86 | } 87 | 88 | /** 89 | * we divide the file into small blocks. If the blocks are too small, we 90 | * shall create too many temporary files. If they are too big, we shall 91 | * be using too much memory. 92 | * 93 | * @param sizeoffile how much data (in bytes) can we expect 94 | * @param maxtmpfiles how many temporary files can we create (e.g., 1024) 95 | * @param maxMemory Maximum memory to use (in bytes) 96 | * @return the estimate 97 | */ 98 | public static long estimateBestSizeOfBlocks(final long sizeoffile, 99 | final int maxtmpfiles, final long maxMemory) { 100 | // we don't want to open up much more than maxtmpfiles temporary 101 | // files, better run 102 | // out of memory first. 103 | long blocksize = sizeoffile / maxtmpfiles 104 | + (sizeoffile % maxtmpfiles == 0 ? 0 : 1); 105 | 106 | // on the other hand, we don't want to create many temporary 107 | // files 108 | // for naught. If blocksize is smaller than half the free 109 | // memory, grow it. 110 | if (blocksize < maxMemory / 2) { 111 | blocksize = maxMemory / 2; 112 | } 113 | return blocksize; 114 | } 115 | 116 | /** 117 | * @param args command line argument 118 | * @throws IOException generic IO exception 119 | */ 120 | public static void main(final String[] args) throws IOException { 121 | boolean verbose = false; 122 | boolean distinct = false; 123 | int maxtmpfiles = DEFAULTMAXTEMPFILES; 124 | Charset cs = Charset.defaultCharset(); 125 | String inputfile = null, outputfile = null; 126 | File tempFileStore = null; 127 | boolean usegzip = false; 128 | boolean parallel = true; 129 | int headersize = 0; 130 | for (int param = 0; param < args.length; ++param) { 131 | if (args[param].equals("-v") 132 | || args[param].equals("--verbose")) { 133 | verbose = true; 134 | } else if ((args[param].equals("-h") || args[param] 135 | .equals("--help"))) { 136 | displayUsage(); 137 | return; 138 | } else if ((args[param].equals("-d") || args[param] 139 | .equals("--distinct"))) { 140 | distinct = true; 141 | } else if ((args[param].equals("-t") || args[param] 142 | .equals("--maxtmpfiles")) 143 | && args.length > param + 1) { 144 | param++; 145 | maxtmpfiles = Integer.parseInt(args[param]); 146 | if (maxtmpfiles < 0) { 147 | System.err 148 | .println("maxtmpfiles should be positive"); 149 | } 150 | } else if ((args[param].equals("-c") || args[param] 151 | .equals("--charset")) 152 | && args.length > param + 1) { 153 | param++; 154 | cs = Charset.forName(args[param]); 155 | } else if ((args[param].equals("-z") || args[param] 156 | .equals("--gzip"))) { 157 | usegzip = true; 158 | } else if ((args[param].equals("-H") || args[param] 159 | .equals("--header")) && args.length > param + 1) { 160 | param++; 161 | headersize = Integer.parseInt(args[param]); 162 | if (headersize < 0) { 163 | System.err 164 | .println("headersize should be positive"); 165 | } 166 | } else if ((args[param].equals("-s") || args[param] 167 | .equals("--store")) && args.length > param + 1) { 168 | param++; 169 | tempFileStore = new File(args[param]); 170 | } else { 171 | if (inputfile == null) { 172 | inputfile = args[param]; 173 | } else if (outputfile == null) { 174 | outputfile = args[param]; 175 | } else { 176 | System.out.println("Unparsed: " 177 | + args[param]); 178 | } 179 | } 180 | } 181 | if (outputfile == null) { 182 | System.out 183 | .println("please provide input and output file names"); 184 | displayUsage(); 185 | return; 186 | } 187 | Comparator comparator = defaultcomparator; 188 | List l = sortInBatch(new File(inputfile), comparator, 189 | maxtmpfiles, cs, tempFileStore, distinct, headersize, 190 | usegzip, parallel); 191 | if (verbose) { 192 | System.out 193 | .println("created " + l.size() + " tmp files"); 194 | } 195 | mergeSortedFiles(l, new File(outputfile), comparator, cs, 196 | distinct, false, usegzip); 197 | } 198 | 199 | /** 200 | * This merges several BinaryFileBuffer to an output writer. 201 | * 202 | * @param fbw A buffer where we write the data. 203 | * @param cmp A comparator object that tells us how to sort the 204 | * lines. 205 | * @param distinct Pass true if duplicate lines should be 206 | * discarded. 207 | * @param buffers 208 | * Where the data should be read. 209 | * @return The number of lines sorted. 210 | * @throws IOException generic IO exception 211 | * 212 | */ 213 | public static long mergeSortedFiles(BufferedWriter fbw, 214 | final Comparator cmp, boolean distinct, 215 | List buffers) throws IOException { 216 | PriorityQueue pq = new PriorityQueue<>( 217 | 11, new Comparator() { 218 | @Override 219 | public int compare(IOStringStack i, 220 | IOStringStack j) { 221 | return cmp.compare(i.peek(), j.peek()); 222 | } 223 | }); 224 | for (IOStringStack bfb : buffers) { 225 | if (!bfb.empty()) { 226 | pq.add(bfb); 227 | } 228 | } 229 | long numLinesWritten = 0; 230 | try { 231 | if (!distinct) { 232 | while (pq.size() > 0) { 233 | IOStringStack bfb = pq.poll(); 234 | String r = bfb.pop(); 235 | fbw.write(r); 236 | fbw.newLine(); 237 | ++numLinesWritten; 238 | if (bfb.empty()) { 239 | bfb.close(); 240 | } else { 241 | pq.add(bfb); // add it back 242 | } 243 | } 244 | } else { 245 | String lastLine = null; 246 | if(pq.size() > 0) { 247 | IOStringStack bfb = pq.poll(); 248 | lastLine = bfb.pop(); 249 | fbw.write(lastLine); 250 | fbw.newLine(); 251 | ++numLinesWritten; 252 | if (bfb.empty()) { 253 | bfb.close(); 254 | } else { 255 | pq.add(bfb); // add it back 256 | } 257 | } 258 | while (pq.size() > 0) { 259 | IOStringStack bfb = pq.poll(); 260 | String r = bfb.pop(); 261 | // Skip duplicate lines 262 | if (cmp.compare(r, lastLine) != 0) { 263 | fbw.write(r); 264 | fbw.newLine(); 265 | lastLine = r; 266 | ++numLinesWritten; 267 | } 268 | if (bfb.empty()) { 269 | bfb.close(); 270 | } else { 271 | pq.add(bfb); // add it back 272 | } 273 | } 274 | } 275 | } finally { 276 | fbw.close(); 277 | for (IOStringStack bfb : pq) { 278 | bfb.close(); 279 | } 280 | } 281 | return numLinesWritten; 282 | 283 | } 284 | 285 | 286 | /** 287 | * This merges a bunch of temporary flat files 288 | * 289 | * @param files The {@link List} of sorted {@link File}s to be merged. 290 | * @param outputfile The output {@link File} to merge the results to. 291 | * @return The number of lines sorted. 292 | * @throws IOException generic IO exception 293 | */ 294 | public static long mergeSortedFiles(List files, File outputfile) 295 | throws IOException { 296 | return mergeSortedFiles(files, outputfile, defaultcomparator, 297 | Charset.defaultCharset()); 298 | } 299 | 300 | /** 301 | * This merges a bunch of temporary flat files 302 | * 303 | * @param files The {@link List} of sorted {@link File}s to be merged. 304 | * @param outputfile The output {@link File} to merge the results to. 305 | * @param cmp The {@link Comparator} to use to compare 306 | * {@link String}s. 307 | * @return The number of lines sorted. 308 | * @throws IOException generic IO exception 309 | */ 310 | public static long mergeSortedFiles(List files, File outputfile, 311 | final Comparator cmp) throws IOException { 312 | return mergeSortedFiles(files, outputfile, cmp, 313 | Charset.defaultCharset()); 314 | } 315 | 316 | /** 317 | * This merges a bunch of temporary flat files 318 | * 319 | * @param files The {@link List} of sorted {@link File}s to be merged. 320 | * @param outputfile The output {@link File} to merge the results to. 321 | * @param cmp The {@link Comparator} to use to compare 322 | * {@link String}s. 323 | * @param distinct Pass true if duplicate lines should be 324 | * discarded. 325 | * @return The number of lines sorted. 326 | * @throws IOException generic IO exception 327 | */ 328 | public static long mergeSortedFiles(List files, File outputfile, 329 | final Comparator cmp, boolean distinct) 330 | throws IOException { 331 | return mergeSortedFiles(files, outputfile, cmp, 332 | Charset.defaultCharset(), distinct); 333 | } 334 | 335 | /** 336 | * This merges a bunch of temporary flat files 337 | * 338 | * @param files The {@link List} of sorted {@link File}s to be merged. 339 | * @param outputfile The output {@link File} to merge the results to. 340 | * @param cmp The {@link Comparator} to use to compare 341 | * {@link String}s. 342 | * @param cs The {@link Charset} to be used for the byte to 343 | * character conversion. 344 | * @return The number of lines sorted. 345 | * @throws IOException generic IO exception 346 | */ 347 | public static long mergeSortedFiles(List files, File outputfile, 348 | final Comparator cmp, Charset cs) throws IOException { 349 | return mergeSortedFiles(files, outputfile, cmp, cs, false); 350 | } 351 | 352 | /** 353 | * This merges a bunch of temporary flat files 354 | * 355 | * @param files The {@link List} of sorted {@link File}s to be merged. 356 | * @param distinct Pass true if duplicate lines should be 357 | * discarded. 358 | * @param outputfile The output {@link File} to merge the results to. 359 | * @param cmp The {@link Comparator} to use to compare 360 | * {@link String}s. 361 | * @param cs The {@link Charset} to be used for the byte to 362 | * character conversion. 363 | * @return The number of lines sorted. 364 | * @throws IOException generic IO exception 365 | * @since v0.1.2 366 | */ 367 | public static long mergeSortedFiles(List files, File outputfile, 368 | final Comparator cmp, Charset cs, boolean distinct) 369 | throws IOException { 370 | return mergeSortedFiles(files, outputfile, cmp, cs, distinct, 371 | false, false); 372 | } 373 | 374 | /** 375 | * This merges a bunch of temporary flat files 376 | * 377 | * @param files The {@link List} of sorted {@link File}s to be merged. 378 | * @param distinct Pass true if duplicate lines should be 379 | * discarded. 380 | * @param outputfile The output {@link File} to merge the results to. 381 | * @param cmp The {@link Comparator} to use to compare 382 | * {@link String}s. 383 | * @param cs The {@link Charset} to be used for the byte to 384 | * character conversion. 385 | * @param append Pass true if result should append to 386 | * {@link File} instead of overwrite. Default to be false 387 | * for overloading methods. 388 | * @param usegzip assumes we used gzip compression for temporary files 389 | * @return The number of lines sorted. 390 | * @throws IOException generic IO exception 391 | * @since v0.1.4 392 | */ 393 | public static long mergeSortedFiles(List files, File outputfile, 394 | final Comparator cmp, Charset cs, boolean distinct, 395 | boolean append, boolean usegzip) throws IOException { 396 | ArrayList bfbs = new ArrayList<>(); 397 | for (File f : files) { 398 | final int BUFFERSIZE = 2048; 399 | InputStream in = new FileInputStream(f); 400 | BufferedReader br; 401 | if (usegzip) { 402 | br = new BufferedReader( 403 | new InputStreamReader( 404 | new GZIPInputStream(in, 405 | BUFFERSIZE), cs)); 406 | } else { 407 | br = new BufferedReader(new InputStreamReader( 408 | in, cs)); 409 | } 410 | 411 | BinaryFileBuffer bfb = new BinaryFileBuffer(br); 412 | bfbs.add(bfb); 413 | } 414 | BufferedWriter fbw = new BufferedWriter(new OutputStreamWriter( 415 | new FileOutputStream(outputfile, append), cs)); 416 | long rowcounter = mergeSortedFiles(fbw, cmp, distinct, bfbs); 417 | for (File f : files) { 418 | f.delete(); 419 | } 420 | return rowcounter; 421 | } 422 | 423 | /** 424 | * This merges a bunch of temporary flat files 425 | * 426 | * @param files The {@link List} of sorted {@link File}s to be merged. 427 | * @param distinct Pass true if duplicate lines should be 428 | * discarded. 429 | * @param fbw The output {@link BufferedWriter} to merge the results to. 430 | * @param cmp The {@link Comparator} to use to compare 431 | * {@link String}s. 432 | * @param cs The {@link Charset} to be used for the byte to 433 | * character conversion. 434 | * @param usegzip assumes we used gzip compression for temporary files 435 | * @return The number of lines sorted. 436 | * @throws IOException generic IO exception 437 | * @since v0.1.4 438 | */ 439 | public static long mergeSortedFiles(List files, BufferedWriter fbw, 440 | final Comparator cmp, Charset cs, boolean distinct, 441 | boolean usegzip) throws IOException { 442 | ArrayList bfbs = new ArrayList<>(); 443 | for (File f : files) { 444 | final int BUFFERSIZE = 2048; 445 | if (f.length() == 0) { 446 | continue; 447 | } 448 | InputStream in = new FileInputStream(f); 449 | BufferedReader br; 450 | if (usegzip) { 451 | br = new BufferedReader( 452 | new InputStreamReader( 453 | new GZIPInputStream(in, 454 | BUFFERSIZE), cs)); 455 | } else { 456 | br = new BufferedReader(new InputStreamReader( 457 | in, cs)); 458 | } 459 | 460 | BinaryFileBuffer bfb = new BinaryFileBuffer(br); 461 | bfbs.add(bfb); 462 | } 463 | long numLinesWritten = mergeSortedFiles(fbw, cmp, distinct, bfbs); 464 | for (File f : files) { 465 | f.delete(); 466 | } 467 | return numLinesWritten; 468 | } 469 | 470 | /** 471 | * This sorts a file (input) to an output file (output) using default 472 | * parameters 473 | * 474 | * @param input source file 475 | * 476 | * @param output output file 477 | * @throws IOException generic IO exception 478 | */ 479 | public static void sort(final File input, final File output) 480 | throws IOException { 481 | ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(input), 482 | output); 483 | } 484 | 485 | /** 486 | * This sorts a file (input) to an output file (output) using customized comparator 487 | * 488 | * @param input source file 489 | * 490 | * @param output output file 491 | * 492 | * @param cmp The {@link Comparator} to use to compare 493 | * {@link String}s. 494 | * @throws IOException generic IO exception 495 | */ 496 | public static void sort(final File input, final File output, final Comparator cmp) 497 | throws IOException { 498 | ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(input, cmp), 499 | output, cmp); 500 | } 501 | 502 | /** 503 | * Sort a list and save it to a temporary file 504 | * 505 | * @return the file containing the sorted data 506 | * @param tmplist data to be sorted 507 | * @param cmp string comparator 508 | * @param cs charset to use for output (can use 509 | * Charset.defaultCharset()) 510 | * @param tmpdirectory location of the temporary files (set to null for 511 | * default location) 512 | * @throws IOException generic IO exception 513 | */ 514 | public static File sortAndSave(List tmplist, 515 | Comparator cmp, Charset cs, File tmpdirectory) 516 | throws IOException { 517 | return sortAndSave(tmplist, cmp, cs, tmpdirectory, false, false, true); 518 | } 519 | 520 | /** 521 | * Sort a list and save it to a temporary file 522 | * 523 | * @return the file containing the sorted data 524 | * @param tmplist data to be sorted 525 | * @param cmp string comparator 526 | * @param cs charset to use for output (can use 527 | * Charset.defaultCharset()) 528 | * @param tmpdirectory location of the temporary files (set to null for 529 | * default location) 530 | * @param distinct Pass true if duplicate lines should be 531 | * discarded. 532 | * @param usegzip set to true if you are using gzip compression for the 533 | * temporary files 534 | * @param parallel set to true when sorting in parallel 535 | * @throws IOException generic IO exception 536 | */ 537 | public static File sortAndSave(List tmplist, 538 | Comparator cmp, Charset cs, File tmpdirectory, 539 | boolean distinct, boolean usegzip, boolean parallel) throws IOException { 540 | if (parallel) { 541 | tmplist = tmplist.parallelStream().sorted(cmp).collect(Collectors.toCollection(ArrayList::new)); 542 | } else { 543 | Collections.sort(tmplist, cmp); 544 | } 545 | File newtmpfile = File.createTempFile("sortInBatch", 546 | "flatfile", tmpdirectory); 547 | newtmpfile.deleteOnExit(); 548 | OutputStream out = new FileOutputStream(newtmpfile); 549 | int ZIPBUFFERSIZE = 2048; 550 | if (usegzip) { 551 | out = new GZIPOutputStream(out, ZIPBUFFERSIZE) { 552 | { 553 | this.def.setLevel(Deflater.BEST_SPEED); 554 | } 555 | }; 556 | } 557 | try (BufferedWriter fbw = new BufferedWriter(new OutputStreamWriter( 558 | out, cs))) { 559 | if (!distinct) { 560 | for (String r : tmplist) { 561 | fbw.write(r); 562 | fbw.newLine(); 563 | } 564 | } else { 565 | String lastLine = null; 566 | Iterator i = tmplist.iterator(); 567 | if(i.hasNext()) { 568 | lastLine = i.next(); 569 | fbw.write(lastLine); 570 | fbw.newLine(); 571 | } 572 | while (i.hasNext()) { 573 | String r = i.next(); 574 | // Skip duplicate lines 575 | if (cmp.compare(r, lastLine) != 0) { 576 | fbw.write(r); 577 | fbw.newLine(); 578 | lastLine = r; 579 | } 580 | } 581 | } 582 | } 583 | return newtmpfile; 584 | } 585 | 586 | /** 587 | * This will simply load the file by blocks of lines, then sort them 588 | * in-memory, and write the result to temporary files that have to be 589 | * merged later. 590 | * 591 | * @param fbr data source 592 | * @param datalength estimated data volume (in bytes) 593 | * @return a list of temporary flat files 594 | * @throws IOException generic IO exception 595 | */ 596 | public static List sortInBatch(final BufferedReader fbr, 597 | final long datalength) throws IOException { 598 | return sortInBatch(fbr, datalength, defaultcomparator, 599 | DEFAULTMAXTEMPFILES, estimateAvailableMemory(), 600 | Charset.defaultCharset(), null, false, 0, false, true); 601 | } 602 | 603 | /** 604 | * This will simply load the file by blocks of lines, then sort them 605 | * in-memory, and write the result to temporary files that have to be 606 | * merged later. 607 | * 608 | * @param fbr data source 609 | * @param datalength estimated data volume (in bytes) 610 | * @param cmp string comparator 611 | * @param distinct Pass true if duplicate lines should be 612 | * discarded. 613 | * @return a list of temporary flat files 614 | * @throws IOException generic IO exception 615 | */ 616 | public static List sortInBatch(final BufferedReader fbr, 617 | final long datalength, final Comparator cmp, 618 | final boolean distinct) throws IOException { 619 | return sortInBatch(fbr, datalength, cmp, DEFAULTMAXTEMPFILES, 620 | estimateAvailableMemory(), Charset.defaultCharset(), 621 | null, distinct, 0, false, true); 622 | } 623 | 624 | /** 625 | * This will simply load the file by blocks of lines, then sort them 626 | * in-memory, and write the result to temporary files that have to be 627 | * merged later. 628 | * 629 | * @param fbr data source 630 | * @param datalength estimated data volume (in bytes) 631 | * @param cmp string comparator 632 | * @param maxtmpfiles maximal number of temporary files 633 | * @param maxMemory maximum amount of memory to use (in bytes) 634 | * @param cs character set to use (can use 635 | * Charset.defaultCharset()) 636 | * @param tmpdirectory location of the temporary files (set to null for 637 | * default location) 638 | * @param distinct Pass true if duplicate lines should be 639 | * discarded. 640 | * @param numHeader number of lines to preclude before sorting starts 641 | * @param usegzip use gzip compression for the temporary files 642 | * @param parallel sort in parallel 643 | * @return a list of temporary flat files 644 | * @throws IOException generic IO exception 645 | */ 646 | public static List sortInBatch(final BufferedReader fbr, 647 | final long datalength, final Comparator cmp, 648 | final int maxtmpfiles, long maxMemory, final Charset cs, 649 | final File tmpdirectory, final boolean distinct, 650 | final int numHeader, final boolean usegzip, final boolean parallel) 651 | throws IOException { 652 | List files = new ArrayList<>(); 653 | long blocksize = estimateBestSizeOfBlocks(datalength, 654 | maxtmpfiles, maxMemory);// in 655 | // bytes 656 | 657 | try { 658 | List tmplist = new ArrayList<>(); 659 | String line = ""; 660 | try { 661 | int counter = 0; 662 | while (line != null) { 663 | long currentblocksize = 0;// in bytes 664 | while ((currentblocksize < blocksize) 665 | && ((line = fbr.readLine()) != null)) { 666 | // as long as you have enough 667 | // memory 668 | if (counter < numHeader) { 669 | counter++; 670 | continue; 671 | } 672 | tmplist.add(line); 673 | currentblocksize += StringSizeEstimator 674 | .estimatedSizeOf(line); 675 | } 676 | files.add(sortAndSave(tmplist, cmp, cs, 677 | tmpdirectory, distinct, usegzip, parallel)); 678 | tmplist.clear(); 679 | } 680 | } catch (EOFException oef) { 681 | if (tmplist.size() > 0) { 682 | files.add(sortAndSave(tmplist, cmp, cs, 683 | tmpdirectory, distinct, usegzip, parallel)); 684 | tmplist.clear(); 685 | } 686 | } 687 | } finally { 688 | fbr.close(); 689 | } 690 | return files; 691 | } 692 | 693 | /** 694 | * This will simply load the file by blocks of lines, then sort them 695 | * in-memory, and write the result to temporary files that have to be 696 | * merged later. 697 | * 698 | * @param file some flat file 699 | * @return a list of temporary flat files 700 | * @throws IOException generic IO exception 701 | */ 702 | public static List sortInBatch(File file) throws IOException { 703 | return sortInBatch(file, defaultcomparator); 704 | } 705 | 706 | /** 707 | * This will simply load the file by blocks of lines, then sort them 708 | * in-memory, and write the result to temporary files that have to be 709 | * merged later. 710 | * 711 | * @param file some flat file 712 | * @param cmp string comparator 713 | * @return a list of temporary flat files 714 | * @throws IOException generic IO exception 715 | */ 716 | public static List sortInBatch(File file, Comparator cmp) 717 | throws IOException { 718 | return sortInBatch(file, cmp, false); 719 | } 720 | 721 | /** 722 | * This will simply load the file by blocks of lines, then sort them 723 | * in-memory, and write the result to temporary files that have to be 724 | * merged later. 725 | * 726 | * @param file some flat file 727 | * @param cmp string comparator 728 | * @param distinct Pass true if duplicate lines should be 729 | * discarded. 730 | * @return a list of temporary flat files 731 | * @throws IOException generic IO exception 732 | */ 733 | public static List sortInBatch(File file, Comparator cmp, 734 | boolean distinct) throws IOException { 735 | return sortInBatch(file, cmp, DEFAULTMAXTEMPFILES, 736 | Charset.defaultCharset(), null, distinct); 737 | } 738 | 739 | /** 740 | * This will simply load the file by blocks of lines, then sort them 741 | * in-memory, and write the result to temporary files that have to be 742 | * merged later. You can specify a bound on the number of temporary 743 | * files that will be created. 744 | * 745 | * @param file some flat file 746 | * @param cmp string comparator 747 | * @param tmpdirectory location of the temporary files (set to null for 748 | * default location) 749 | * @param distinct Pass true if duplicate lines should be 750 | * discarded. 751 | * @param numHeader number of lines to preclude before sorting starts 752 | * @return a list of temporary flat files 753 | * @throws IOException generic IO exception 754 | */ 755 | public static List sortInBatch(File file, Comparator cmp, 756 | File tmpdirectory, 757 | boolean distinct, int numHeader) 758 | throws IOException { 759 | return sortInBatch(file, cmp, DEFAULTMAXTEMPFILES, 760 | Charset.defaultCharset(), tmpdirectory, distinct, 761 | numHeader); 762 | } 763 | 764 | /** 765 | * This will simply load the file by blocks of lines, then sort them 766 | * in-memory, and write the result to temporary files that have to be 767 | * merged later. You can specify a bound on the number of temporary 768 | * files that will be created. 769 | * 770 | * @param file some flat file 771 | * @param cmp string comparator 772 | * @param maxtmpfiles maximal number of temporary files 773 | * @param cs character set to use (can use 774 | * Charset.defaultCharset()) 775 | * @param tmpdirectory location of the temporary files (set to null for 776 | * default location) 777 | * @param distinct Pass true if duplicate lines should be 778 | * discarded. 779 | * @return a list of temporary flat files 780 | * @throws IOException generic IO exception 781 | */ 782 | public static List sortInBatch(File file, Comparator cmp, 783 | int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct) 784 | throws IOException { 785 | return sortInBatch(file, cmp, maxtmpfiles, cs, tmpdirectory, 786 | distinct, 0); 787 | } 788 | 789 | /** 790 | * This will simply load the file by blocks of lines, then sort them 791 | * in-memory, and write the result to temporary files that have to be 792 | * merged later. You can specify a bound on the number of temporary 793 | * files that will be created. 794 | * 795 | * @param file some flat file 796 | * @param cmp string comparator 797 | * @param cs character set to use (can use 798 | * Charset.defaultCharset()) 799 | * @param tmpdirectory location of the temporary files (set to null for 800 | * default location) 801 | * @param distinct Pass true if duplicate lines should be 802 | * discarded. 803 | * @param numHeader number of lines to preclude before sorting starts 804 | * @return a list of temporary flat files 805 | * @throws IOException generic IO exception 806 | */ 807 | public static List sortInBatch(File file, Comparator cmp, 808 | Charset cs, File tmpdirectory, 809 | boolean distinct, int numHeader) 810 | throws IOException { 811 | BufferedReader fbr = new BufferedReader(new InputStreamReader( 812 | new FileInputStream(file), cs)); 813 | return sortInBatch(fbr, file.length(), cmp, DEFAULTMAXTEMPFILES, 814 | estimateAvailableMemory(), cs, tmpdirectory, distinct, 815 | numHeader, false, true); 816 | } 817 | 818 | /** 819 | * This will simply load the file by blocks of lines, then sort them 820 | * in-memory, and write the result to temporary files that have to be 821 | * merged later. You can specify a bound on the number of temporary 822 | * files that will be created. 823 | * 824 | * @param file some flat file 825 | * @param cmp string comparator 826 | * @param maxtmpfiles maximal number of temporary files 827 | * @param cs character set to use (can use 828 | * Charset.defaultCharset()) 829 | * @param tmpdirectory location of the temporary files (set to null for 830 | * default location) 831 | * @param distinct Pass true if duplicate lines should be 832 | * discarded. 833 | * @param numHeader number of lines to preclude before sorting starts 834 | * @return a list of temporary flat files 835 | * @throws IOException generic IO exception 836 | */ 837 | public static List sortInBatch(File file, Comparator cmp, 838 | int maxtmpfiles, Charset cs, File tmpdirectory, 839 | boolean distinct, int numHeader) 840 | throws IOException { 841 | BufferedReader fbr = new BufferedReader(new InputStreamReader( 842 | new FileInputStream(file), cs)); 843 | return sortInBatch(fbr, file.length(), cmp, maxtmpfiles, 844 | estimateAvailableMemory(), cs, tmpdirectory, distinct, 845 | numHeader, false, true); 846 | } 847 | 848 | /** 849 | * This will simply load the file by blocks of lines, then sort them 850 | * in-memory, and write the result to temporary files that have to be 851 | * merged later. You can specify a bound on the number of temporary 852 | * files that will be created. 853 | * 854 | * @param file some flat file 855 | * @param cmp string comparator 856 | * @param maxtmpfiles maximal number of temporary files 857 | * @param cs character set to use (can use 858 | * Charset.defaultCharset()) 859 | * @param tmpdirectory location of the temporary files (set to null for 860 | * default location) 861 | * @param distinct Pass true if duplicate lines should be 862 | * discarded. 863 | * @param numHeader number of lines to preclude before sorting starts 864 | * @param usegzip use gzip compression for the temporary files 865 | * @return a list of temporary flat files 866 | * @throws IOException generic IO exception 867 | */ 868 | public static List sortInBatch(File file, Comparator cmp, 869 | int maxtmpfiles, Charset cs, File tmpdirectory, 870 | boolean distinct, int numHeader, boolean usegzip) 871 | throws IOException { 872 | BufferedReader fbr = new BufferedReader(new InputStreamReader( 873 | new FileInputStream(file), cs)); 874 | return sortInBatch(fbr, file.length(), cmp, maxtmpfiles, 875 | estimateAvailableMemory(), cs, tmpdirectory, distinct, 876 | numHeader, usegzip, true); 877 | } 878 | 879 | /** 880 | * This will simply load the file by blocks of lines, then sort them 881 | * in-memory, and write the result to temporary files that have to be 882 | * merged later. You can specify a bound on the number of temporary 883 | * files that will be created. 884 | * 885 | * @param file some flat file 886 | * @param cmp string comparator 887 | * @param maxtmpfiles maximal number of temporary files 888 | * @param cs character set to use (can use 889 | * Charset.defaultCharset()) 890 | * @param tmpdirectory location of the temporary files (set to null for 891 | * default location) 892 | * @param distinct Pass true if duplicate lines should be 893 | * discarded. 894 | * @param numHeader number of lines to preclude before sorting starts 895 | * @param usegzip use gzip compression for the temporary files 896 | * @param parallel whether to sort in parallel 897 | * @return a list of temporary flat files 898 | * @throws IOException generic IO exception 899 | */ 900 | public static List sortInBatch(File file, Comparator cmp, 901 | int maxtmpfiles, Charset cs, File tmpdirectory, 902 | boolean distinct, int numHeader, boolean usegzip, boolean parallel) 903 | throws IOException { 904 | BufferedReader fbr = new BufferedReader(new InputStreamReader( 905 | new FileInputStream(file), cs)); 906 | return sortInBatch(fbr, file.length(), cmp, maxtmpfiles, 907 | estimateAvailableMemory(), cs, tmpdirectory, distinct, 908 | numHeader, usegzip, parallel); 909 | } 910 | 911 | /** 912 | * default comparator between strings. 913 | */ 914 | public static Comparator defaultcomparator = new Comparator() { 915 | @Override 916 | public int compare(String r1, String r2) { 917 | return r1.compareTo(r2); 918 | } 919 | }; 920 | 921 | /** 922 | * Default maximal number of temporary files allowed. 923 | */ 924 | public static final int DEFAULTMAXTEMPFILES = 1024; 925 | 926 | } 927 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/IOStringStack.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting; 2 | 3 | import java.io.IOException; 4 | 5 | /** 6 | * General interface to abstract away BinaryFileBuffer 7 | * so that users of the library can roll their own. 8 | */ 9 | public interface IOStringStack { 10 | public void close() throws IOException; 11 | 12 | public boolean empty(); 13 | 14 | public String peek(); 15 | 16 | public String pop() throws IOException; 17 | 18 | } -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/StringSizeEstimator.java: -------------------------------------------------------------------------------- 1 | /** 2 | * 3 | */ 4 | package com.google.code.externalsorting; 5 | 6 | /** 7 | * Simple class used to estimate memory usage. 8 | * 9 | * @author Eleftherios Chetzakis 10 | * 11 | */ 12 | public final class StringSizeEstimator { 13 | 14 | private static int OBJ_HEADER; 15 | private static int ARR_HEADER; 16 | private static int INT_FIELDS = 12; 17 | private static int OBJ_REF; 18 | private static int OBJ_OVERHEAD; 19 | private static boolean IS_64_BIT_JVM; 20 | 21 | /** 22 | * Private constructor to prevent instantiation. 23 | */ 24 | private StringSizeEstimator() { 25 | } 26 | 27 | /** 28 | * Class initializations. 29 | */ 30 | static { 31 | // By default we assume 64 bit JVM 32 | // (defensive approach since we will get 33 | // larger estimations in case we are not sure) 34 | IS_64_BIT_JVM = true; 35 | // check the system property "sun.arch.data.model" 36 | // not very safe, as it might not work for all JVM implementations 37 | // nevertheless the worst thing that might happen is that the JVM is 32bit 38 | // but we assume its 64bit, so we will be counting a few extra bytes per string object 39 | // no harm done here since this is just an approximation. 40 | String arch = System.getProperty("sun.arch.data.model"); 41 | if (arch != null) { 42 | if (arch.contains("32")) { 43 | // If exists and is 32 bit then we assume a 32bit JVM 44 | IS_64_BIT_JVM = false; 45 | } 46 | } 47 | // The sizes below are a bit rough as we don't take into account 48 | // advanced JVM options such as compressed oops 49 | // however if our calculation is not accurate it'll be a bit over 50 | // so there is no danger of an out of memory error because of this. 51 | OBJ_HEADER = IS_64_BIT_JVM ? 16 : 8; 52 | ARR_HEADER = IS_64_BIT_JVM ? 24 : 12; 53 | OBJ_REF = IS_64_BIT_JVM ? 8 : 4; 54 | OBJ_OVERHEAD = OBJ_HEADER + INT_FIELDS + OBJ_REF + ARR_HEADER; 55 | 56 | } 57 | 58 | /** 59 | * Estimates the size of a {@link String} object in bytes. 60 | * 61 | * This function was designed with the following goals in mind (in order of importance) : 62 | * 63 | * First goal is speed: this function is called repeatedly and it should 64 | * execute in not much more than a nanosecond. 65 | * 66 | * Second goal is to never underestimate (as it would lead to memory shortage and a crash). 67 | * 68 | * Third goal is to never overestimate too much (say within a factor of two), as it would 69 | * mean that we are leaving much of the RAM underutilized. 70 | * 71 | * @param s The string to estimate memory footprint. 72 | * @return The estimated size in bytes. 73 | */ 74 | public static long estimatedSizeOf(String s) { 75 | return (s.length() * 2) + OBJ_OVERHEAD; 76 | } 77 | 78 | } 79 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/csv/CSVRecordBuffer.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting.csv; 2 | 3 | import java.io.IOException; 4 | import java.util.Iterator; 5 | 6 | import org.apache.commons.csv.CSVParser; 7 | import org.apache.commons.csv.CSVRecord; 8 | 9 | public class CSVRecordBuffer { 10 | 11 | private Iterator iterator; 12 | 13 | private CSVParser parser; 14 | 15 | private CSVRecord cache; 16 | 17 | public CSVRecordBuffer(CSVParser parser) throws IOException, ClassNotFoundException { 18 | this.iterator = parser.iterator(); 19 | this.parser = parser; 20 | reload(); 21 | } 22 | 23 | public void close() throws IOException { 24 | this.parser.close(); 25 | } 26 | 27 | public boolean empty() { 28 | return this.cache == null; 29 | } 30 | 31 | public CSVRecord peek() { 32 | return this.cache; 33 | } 34 | 35 | // 36 | public CSVRecord pop() throws IOException, ClassNotFoundException { 37 | CSVRecord answer = peek();// make a copy 38 | reload(); 39 | return answer; 40 | } 41 | 42 | // Get the next in line 43 | private void reload() throws IOException, ClassNotFoundException { 44 | this.cache = this.iterator.hasNext() ? this.iterator.next() : null; 45 | } 46 | } 47 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/csv/CsvExternalSort.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting.csv; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.BufferedWriter; 5 | import java.io.File; 6 | import java.io.FileInputStream; 7 | import java.io.FileOutputStream; 8 | import java.io.IOException; 9 | import java.io.InputStream; 10 | import java.io.InputStreamReader; 11 | import java.io.OutputStreamWriter; 12 | import java.io.Writer; 13 | import java.util.ArrayList; 14 | import java.util.Collections; 15 | import java.util.Comparator; 16 | import java.util.List; 17 | import java.util.PriorityQueue; 18 | import java.util.concurrent.atomic.AtomicLong; 19 | import java.util.logging.Level; 20 | import java.util.logging.Logger; 21 | 22 | import org.apache.commons.csv.CSVFormat; 23 | import org.apache.commons.csv.CSVParser; 24 | import org.apache.commons.csv.CSVPrinter; 25 | import org.apache.commons.csv.CSVRecord; 26 | 27 | public class CsvExternalSort { 28 | 29 | private static final Logger LOG = Logger.getLogger(CsvExternalSort.class.getName()); 30 | 31 | private CsvExternalSort() { 32 | throw new UnsupportedOperationException("Unable to instantiate utility class"); 33 | } 34 | 35 | /** 36 | * This method calls the garbage collector and then returns the free memory. 37 | * This avoids problems with applications where the GC hasn't reclaimed memory 38 | * and reports no available memory. 39 | * 40 | * @return available memory 41 | */ 42 | public static long estimateAvailableMemory() { 43 | System.gc(); 44 | return Runtime.getRuntime().freeMemory(); 45 | } 46 | 47 | /** 48 | * we divide the file into small blocks. If the blocks are too small, we shall 49 | * create too many temporary files. If they are too big, we shall be using too 50 | * much memory. 51 | * 52 | * @param sizeoffile how much data (in bytes) can we expect 53 | * @param maxtmpfiles how many temporary files can we create (e.g., 1024) 54 | * @param maxMemory Maximum memory to use (in bytes) 55 | * @return the estimate 56 | */ 57 | public static long estimateBestSizeOfBlocks(final long sizeoffile, final int maxtmpfiles, final long maxMemory) { 58 | // we don't want to open up much more than maxtmpfiles temporary 59 | // files, better run 60 | // out of memory first. 61 | long blocksize = sizeoffile / maxtmpfiles + (sizeoffile % maxtmpfiles == 0 ? 0 : 1); 62 | 63 | // on the other hand, we don't want to create many temporary 64 | // files 65 | // for naught. If blocksize is smaller than half the free 66 | // memory, grow it. 67 | if (blocksize < maxMemory / 6) { 68 | blocksize = maxMemory / 6; 69 | } 70 | return blocksize; 71 | } 72 | 73 | public static int mergeSortedFiles(BufferedWriter fbw, final CsvSortOptions sortOptions, List bfbs, List header) 74 | throws IOException, ClassNotFoundException { 75 | PriorityQueue pq = new PriorityQueue(11, new Comparator() { 76 | @Override 77 | public int compare(CSVRecordBuffer i, CSVRecordBuffer j) { 78 | return sortOptions.getComparator().compare(i.peek(), j.peek()); 79 | } 80 | }); 81 | for (CSVRecordBuffer bfb : bfbs) 82 | if (!bfb.empty()) 83 | pq.add(bfb); 84 | int numWrittenLines = 0; 85 | CSVPrinter printer = new CSVPrinter(fbw, sortOptions.getFormat()); 86 | if(! sortOptions.isSkipHeader()) { 87 | for(CSVRecord r: header) { 88 | printer.printRecord(r); 89 | } 90 | } 91 | CSVRecord lastLine = null; 92 | try { 93 | while (pq.size() > 0) { 94 | CSVRecordBuffer bfb = pq.poll(); 95 | CSVRecord r = bfb.pop(); 96 | // Skip duplicate lines 97 | if (sortOptions.isDistinct() && checkDuplicateLine(r, lastLine)) { 98 | } else { 99 | printer.printRecord(r); 100 | lastLine = r; 101 | ++numWrittenLines; 102 | } 103 | if (bfb.empty()) { 104 | bfb.close(); 105 | } else { 106 | pq.add(bfb); // add it back 107 | } 108 | } 109 | } finally { 110 | printer.close(); 111 | fbw.close(); 112 | for (CSVRecordBuffer bfb : pq) 113 | bfb.close(); 114 | } 115 | 116 | return numWrittenLines; 117 | } 118 | 119 | public static int mergeSortedFiles(List files, File outputfile, final CsvSortOptions sortOptions, 120 | boolean append, List header) throws IOException, ClassNotFoundException { 121 | 122 | List bfbs = new ArrayList(); 123 | for (File f : files) { 124 | InputStream in = new FileInputStream(f); 125 | BufferedReader fbr = new BufferedReader(new InputStreamReader(in, sortOptions.getCharset())); 126 | CSVParser parser = new CSVParser(fbr, sortOptions.getFormat()); 127 | CSVRecordBuffer bfb = new CSVRecordBuffer(parser); 128 | bfbs.add(bfb); 129 | } 130 | 131 | BufferedWriter fbw = new BufferedWriter( 132 | new OutputStreamWriter(new FileOutputStream(outputfile, append), sortOptions.getCharset())); 133 | 134 | int numWrittenLines = mergeSortedFiles(fbw, sortOptions, bfbs, header); 135 | for (File f : files) { 136 | if (!f.delete()) { 137 | LOG.log(Level.WARNING, String.format("The file %s was not deleted", f.getName())); 138 | } 139 | } 140 | 141 | return numWrittenLines; 142 | } 143 | 144 | public static List sortInBatch(long size_in_byte, final BufferedReader fbr, final File tmpdirectory, 145 | final CsvSortOptions sortOptions, List header) throws IOException { 146 | 147 | List files = new ArrayList(); 148 | long blocksize = estimateBestSizeOfBlocks(size_in_byte, sortOptions.getMaxTmpFiles(), 149 | sortOptions.getMaxMemory());// in 150 | // bytes 151 | AtomicLong currentBlock = new AtomicLong(0); 152 | List tmplist = new ArrayList(); 153 | 154 | try (CSVParser parser = new CSVParser(fbr, sortOptions.getFormat())) { 155 | parser.spliterator().forEachRemaining(e -> { 156 | if (e.getRecordNumber() <= sortOptions.getNumHeader()) { 157 | header.add(e); 158 | } else { 159 | tmplist.add(e); 160 | currentBlock.addAndGet(SizeEstimator.estimatedSizeOf(e)); 161 | } 162 | if (currentBlock.get() >= blocksize) { 163 | try { 164 | files.add(sortAndSave(tmplist, tmpdirectory, sortOptions)); 165 | } catch (IOException e1) { 166 | LOG.log(Level.WARNING, String.format("Error during the sort in batch"), e1); 167 | } 168 | tmplist.clear(); 169 | currentBlock.getAndSet(0); 170 | } 171 | }); 172 | } 173 | if (!tmplist.isEmpty()) { 174 | files.add(sortAndSave(tmplist, tmpdirectory, sortOptions)); 175 | } 176 | 177 | return files; 178 | } 179 | 180 | public static File sortAndSave(List tmplist, File tmpdirectory, final CsvSortOptions sortOptions) throws IOException { 181 | Collections.sort(tmplist, sortOptions.getComparator()); 182 | File newtmpfile = File.createTempFile("sortInBatch", "flatfile", tmpdirectory); 183 | newtmpfile.deleteOnExit(); 184 | 185 | CSVRecord lastLine = null; 186 | try (Writer writer = new OutputStreamWriter(new FileOutputStream(newtmpfile), sortOptions.getCharset()); 187 | CSVPrinter printer = new CSVPrinter(new BufferedWriter(writer), sortOptions.getFormat());) { 188 | for (CSVRecord r : tmplist) { 189 | // Skip duplicate lines 190 | if (sortOptions.isDistinct() && checkDuplicateLine(r, lastLine)) { 191 | } else { 192 | printer.printRecord(r); 193 | lastLine = r; 194 | } 195 | } 196 | } 197 | 198 | return newtmpfile; 199 | } 200 | 201 | private static boolean checkDuplicateLine(CSVRecord currentLine, CSVRecord lastLine) { 202 | if (lastLine == null || currentLine == null) { 203 | return false; 204 | } 205 | 206 | for (int i = 0; i < currentLine.size(); i++) { 207 | if (!currentLine.get(i).equals(lastLine.get(i))) { 208 | return false; 209 | } 210 | } 211 | return true; 212 | } 213 | 214 | public static List sortInBatch(File file, File tmpdirectory, final CsvSortOptions sortOptions, List header) 215 | throws IOException { 216 | try (BufferedReader fbr = new BufferedReader( 217 | new InputStreamReader(new FileInputStream(file), sortOptions.getCharset()))) { 218 | return sortInBatch(file.length(), fbr, tmpdirectory, sortOptions, header); 219 | } 220 | } 221 | 222 | /** 223 | * Default maximal number of temporary files allowed. 224 | */ 225 | public static final int DEFAULTMAXTEMPFILES = 1024; 226 | 227 | } 228 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/csv/CsvSortOptions.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting.csv; 2 | 3 | import org.apache.commons.csv.CSVFormat; 4 | import org.apache.commons.csv.CSVRecord; 5 | 6 | import java.nio.charset.Charset; 7 | import java.util.Comparator; 8 | 9 | /** 10 | * Parameters for csv sorting 11 | */ 12 | public class CsvSortOptions { 13 | private final Comparator comparator; 14 | private final int maxTmpFiles; 15 | private final long maxMemory; 16 | private final Charset charset; 17 | 18 | private final boolean distinct; 19 | private final int numHeader; //number of header row in input file 20 | private final boolean skipHeader; //print header or not to output file 21 | private final CSVFormat format; 22 | 23 | public Comparator getComparator() { 24 | return comparator; 25 | } 26 | 27 | public int getMaxTmpFiles() { 28 | return maxTmpFiles; 29 | } 30 | 31 | public long getMaxMemory() { 32 | return maxMemory; 33 | } 34 | 35 | public Charset getCharset() { 36 | return charset; 37 | } 38 | 39 | public boolean isDistinct() { 40 | return distinct; 41 | } 42 | 43 | public int getNumHeader() { 44 | return numHeader; 45 | } 46 | 47 | public boolean isSkipHeader() { 48 | return skipHeader; 49 | } 50 | 51 | public CSVFormat getFormat() { 52 | return format; 53 | } 54 | 55 | public static class Builder { 56 | //mandatory params 57 | private final Comparator cmp; 58 | private final int maxTmpFiles; 59 | private final long maxMemory; 60 | 61 | //optional params with default values 62 | private Charset cs = Charset.defaultCharset(); 63 | private boolean distinct = false; 64 | private int numHeader = 0; 65 | private boolean skipHeader = true; 66 | private CSVFormat format = CSVFormat.DEFAULT; 67 | 68 | public Builder(Comparator cmp, int maxTmpFiles, long maxMemory) { 69 | this.cmp = cmp; 70 | this.maxTmpFiles = maxTmpFiles; 71 | this.maxMemory = maxMemory; 72 | } 73 | 74 | public Builder charset(Charset value){ 75 | cs = value; 76 | return this; 77 | } 78 | 79 | public Builder distinct(boolean value){ 80 | distinct = value; 81 | return this; 82 | } 83 | 84 | public Builder numHeader(int value){ 85 | numHeader = value; 86 | return this; 87 | } 88 | 89 | public Builder skipHeader(boolean value){ 90 | skipHeader = value; 91 | return this; 92 | } 93 | 94 | public Builder format(CSVFormat value){ 95 | format = value; 96 | return this; 97 | } 98 | 99 | 100 | public CsvSortOptions build(){ 101 | return new CsvSortOptions(this); 102 | } 103 | } 104 | 105 | private CsvSortOptions(Builder builder){ 106 | this.comparator = builder.cmp; 107 | this.maxTmpFiles = builder.maxTmpFiles; 108 | this.maxMemory = builder.maxMemory; 109 | this.charset = builder.cs; 110 | this.distinct = builder.distinct; 111 | this.numHeader = builder.numHeader; 112 | this.skipHeader = builder.skipHeader; 113 | this.format = builder.format; 114 | } 115 | 116 | } 117 | -------------------------------------------------------------------------------- /src/main/java/com/google/code/externalsorting/csv/SizeEstimator.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting.csv; 2 | 3 | public final class SizeEstimator { 4 | 5 | private static int OBJ_HEADER; 6 | private static int ARR_HEADER; 7 | private static int INT_FIELDS = 12; 8 | private static int OBJ_REF; 9 | private static int OBJ_OVERHEAD; 10 | private static boolean IS_64_BIT_JVM; 11 | 12 | private SizeEstimator() { 13 | 14 | } 15 | 16 | /** 17 | * Class initializations. 18 | */ 19 | static { 20 | // By default we assume 64 bit JVM 21 | // (defensive approach since we will get 22 | // larger estimations in case we are not sure) 23 | IS_64_BIT_JVM = true; 24 | // check the system property "sun.arch.data.model" 25 | // not very safe, as it might not work for all JVM implementations 26 | // nevertheless the worst thing that might happen is that the JVM is 32bit 27 | // but we assume its 64bit, so we will be counting a few extra bytes per string object 28 | // no harm done here since this is just an approximation. 29 | String arch = System.getProperty("sun.arch.data.model"); 30 | if (arch != null) { 31 | if (arch.indexOf("32") != -1) { 32 | // If exists and is 32 bit then we assume a 32bit JVM 33 | IS_64_BIT_JVM = false; 34 | } 35 | } 36 | // The sizes below are a bit rough as we don't take into account 37 | // advanced JVM options such as compressed oops 38 | // however if our calculation is not accurate it'll be a bit over 39 | // so there is no danger of an out of memory error because of this. 40 | OBJ_HEADER = IS_64_BIT_JVM ? 16 : 8; 41 | ARR_HEADER = IS_64_BIT_JVM ? 24 : 12; 42 | OBJ_REF = IS_64_BIT_JVM ? 8 : 4; 43 | OBJ_OVERHEAD = OBJ_HEADER + INT_FIELDS + OBJ_REF + ARR_HEADER; 44 | 45 | } 46 | 47 | /** 48 | * Estimates the size of a object in bytes. 49 | * 50 | * @param s The string to estimate memory footprint. 51 | * @return The estimated size in bytes. 52 | */ 53 | public static long estimatedSizeOf(Object s) { 54 | return ((long) (s.toString().length() * 2) + OBJ_OVERHEAD); 55 | } 56 | } 57 | -------------------------------------------------------------------------------- /src/test/java/com/google/code/externalsorting/ExternalSortTest.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting; 2 | 3 | import static com.google.code.externalsorting.ExternalSort.defaultcomparator; 4 | import static org.junit.Assert.assertArrayEquals; 5 | import static org.junit.Assert.assertEquals; 6 | import static org.junit.Assert.assertNotNull; 7 | import static org.junit.Assert.assertTrue; 8 | import static org.junit.Assert.assertFalse; 9 | 10 | import java.io.*; 11 | import java.nio.channels.FileChannel; 12 | import java.nio.charset.Charset; 13 | import java.nio.charset.StandardCharsets; 14 | import java.nio.file.Files; 15 | import java.nio.file.Path; 16 | import java.nio.file.StandardOpenOption; 17 | import java.util.*; 18 | import java.util.concurrent.atomic.AtomicBoolean; 19 | import java.util.stream.Collectors; 20 | import java.util.stream.IntStream; 21 | 22 | import org.junit.After; 23 | import org.junit.Before; 24 | import org.junit.Ignore; 25 | import org.junit.Test; 26 | import org.github.jamm.*; 27 | 28 | /** 29 | * Unit test for simple App. 30 | */ 31 | @SuppressWarnings({"static-method","javadoc"}) 32 | public class ExternalSortTest { 33 | private static final String TEST_FILE1_TXT = "test-file-1.txt"; 34 | private static final String TEST_FILE2_TXT = "test-file-2.txt"; 35 | private static final String TEST_FILE1_CSV = "test-file-1.csv"; 36 | private static final String[] EXPECTED_SORT_RESULTS = { "a", "b", "b", "e", "f", 37 | "i", "m", "o", "u", "u", "x", "y", "z" 38 | }; 39 | private static final String[] EXPECTED_MERGE_RESULTS = {"a", "a", "b", "c", "c", "d", "e", "e", "f", "g", "g","h", "i", "j", "k"}; 40 | private static final String[] EXPECTED_MERGE_DISTINCT_RESULTS = {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"}; 41 | private static final String[] EXPECTED_HEADER_RESULTS = {"HEADER, HEADER", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"}; 42 | private static final String[] EXPECTED_DISTINCT_RESULTS = { "a", "b", "e", 43 | "f", "i", "m", "o", "u", "x", "y", "z" 44 | }; 45 | private static final String[] SAMPLE = { "f", "m", "b", "e", "i", "o", "u", 46 | "x", "a", "y", "z", "b", "u" 47 | }; 48 | 49 | private File file1; 50 | private File file2; 51 | private File csvFile; 52 | private List fileList; 53 | 54 | /** 55 | * @throws Exception 56 | */ 57 | @Before 58 | public void setUp() throws Exception { 59 | this.fileList = new ArrayList(3); 60 | this.file1 = new File(this.getClass().getClassLoader() 61 | .getResource(TEST_FILE1_TXT).toURI()); 62 | this.file2 = new File(this.getClass().getClassLoader() 63 | .getResource(TEST_FILE2_TXT).toURI()); 64 | this.csvFile = new File(this.getClass().getClassLoader() 65 | .getResource(TEST_FILE1_CSV).toURI()); 66 | 67 | File tmpFile1 = new File(this.file1.getPath().toString()+".tmp"); 68 | File tmpFile2 = new File(this.file2.getPath().toString()+".tmp"); 69 | 70 | copyFile(this.file1, tmpFile1); 71 | copyFile(this.file2, tmpFile2); 72 | 73 | this.fileList.add(tmpFile1); 74 | this.fileList.add(tmpFile2); 75 | } 76 | 77 | /** 78 | * @throws Exception 79 | */ 80 | @After 81 | public void tearDown() throws Exception { 82 | this.file1 = null; 83 | this.file2 = null; 84 | this.csvFile = null; 85 | for(File f:this.fileList) { 86 | f.delete(); 87 | } 88 | this.fileList.clear(); 89 | this.fileList = null; 90 | } 91 | 92 | private static void copyFile(File sourceFile, File destFile) throws IOException { 93 | if (!destFile.exists()) { 94 | destFile.createNewFile(); 95 | } 96 | 97 | try (FileInputStream fis = new FileInputStream(sourceFile); 98 | FileChannel source = fis.getChannel(); 99 | FileOutputStream fos = new FileOutputStream(destFile); 100 | FileChannel destination = fos.getChannel()) { 101 | destination.transferFrom(source, 0, source.size()); 102 | } 103 | 104 | } 105 | 106 | public static int estimateTotalSize(String[] mystrings) { 107 | int total = 0; 108 | for (String s : mystrings) { 109 | total += StringSizeEstimator.estimatedSizeOf(s); 110 | } 111 | return total; 112 | } 113 | 114 | public static void oneRoundOfStringSizeEstimation() { 115 | // could use JMH for better results but this should do 116 | final int N = 1024; 117 | String [] mystrings = new String[1024]; 118 | for(int k = 0; k < N ; ++k ) { 119 | mystrings[k] = Integer.toString(k); 120 | } 121 | final int repeat = 1000; 122 | long bef, aft, diff; 123 | long bestdiff = Long.MAX_VALUE; 124 | int bogus = 0; 125 | for(int t = 0 ; t < repeat; ++t ) { 126 | bef = System.nanoTime(); 127 | bogus += estimateTotalSize(mystrings); 128 | aft = System.nanoTime(); 129 | diff = aft - bef; 130 | if(diff < bestdiff) bestdiff = diff; 131 | } 132 | System.out.println("#ignore = "+bogus); 133 | System.out.println("[performance] String size estimator uses "+bestdiff * 1.0 / N + " ns per string"); 134 | } 135 | 136 | @Test 137 | public void stringSizeEstimator() { 138 | for(int k = 0; k < 10; ++k) { 139 | oneRoundOfStringSizeEstimation(); 140 | } 141 | } 142 | 143 | @Test 144 | public void displayTest() throws Exception { 145 | ExternalSort.main(new String[]{}); // check that it does not crash 146 | } 147 | 148 | 149 | @Test 150 | public void mainTest() throws Exception { 151 | ExternalSort.main(new String[]{"-h"}); // check that it does not crash 152 | ExternalSort.main(new String[]{""});// check that it does not crash 153 | ExternalSort.main(new String[]{"-v"}); // check that it does not crash 154 | File f1 = File.createTempFile("tmp", "unit"); 155 | File f2 = File.createTempFile("tmp", "unit"); 156 | f1.deleteOnExit(); 157 | f2.deleteOnExit(); 158 | writeStringToFile(f1, "oh"); 159 | ExternalSort.main(new String[]{"-v","-d","-t","5000","-c","ascii","-z","-H","1","-s",".",f1.toString(),f2.toString()}); 160 | } 161 | 162 | @Test 163 | public void testEmptyFiles() throws Exception { 164 | File f1 = File.createTempFile("tmp", "unit"); 165 | File f2 = File.createTempFile("tmp", "unit"); 166 | f1.deleteOnExit(); 167 | f2.deleteOnExit(); 168 | ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(f1),f2); 169 | if (f2.length() != 0) throw new RuntimeException("empty files should end up emtpy"); 170 | } 171 | 172 | @Test 173 | public void testMergeSortedFiles() throws Exception { 174 | String line; 175 | 176 | Comparator cmp = new Comparator() { 177 | @Override 178 | public int compare(String o1, String o2) { 179 | return o1.compareTo(o2); 180 | } 181 | }; 182 | File out = File.createTempFile("test_results", ".tmp", null); 183 | out.deleteOnExit(); 184 | ExternalSort.mergeSortedFiles(this.fileList, out, cmp, 185 | Charset.defaultCharset(), false); 186 | 187 | List result = new ArrayList<>(); 188 | try (BufferedReader bf = new BufferedReader(new FileReader(out))) { 189 | while ((line = bf.readLine()) != null) { 190 | result.add(line); 191 | } 192 | } 193 | assertArrayEquals(Arrays.toString(result.toArray()), EXPECTED_MERGE_RESULTS, 194 | result.toArray()); 195 | } 196 | 197 | @Test 198 | public void testMergeSortedFiles_Distinct() throws Exception { 199 | String line; 200 | 201 | 202 | Comparator cmp = new Comparator() { 203 | @Override 204 | public int compare(String o1, String o2) { 205 | return o1.compareTo(o2); 206 | } 207 | }; 208 | File out = File.createTempFile("test_results", ".tmp", null); 209 | out.deleteOnExit(); 210 | long numLinesWritten = ExternalSort.mergeSortedFiles(this.fileList, out, cmp, 211 | Charset.defaultCharset(), true); 212 | 213 | List result = new ArrayList<>(); 214 | try (BufferedReader bf = new BufferedReader(new FileReader(out))) { 215 | while ((line = bf.readLine()) != null) { 216 | result.add(line); 217 | } 218 | } 219 | 220 | assertEquals(11, numLinesWritten); 221 | assertArrayEquals(Arrays.toString(result.toArray()), EXPECTED_MERGE_DISTINCT_RESULTS, 222 | result.toArray()); 223 | } 224 | 225 | @Test 226 | public void testMergeSortedFiles_Append() throws Exception { 227 | String line; 228 | 229 | Comparator cmp = new Comparator() { 230 | @Override 231 | public int compare(String o1, String o2) 232 | { 233 | return o1.compareTo(o2); 234 | } 235 | }; 236 | 237 | File out = File.createTempFile("test_results", ".tmp", null); 238 | out.deleteOnExit(); 239 | writeStringToFile(out, "HEADER, HEADER\n"); 240 | 241 | ExternalSort.mergeSortedFiles(this.fileList, out, cmp, Charset.defaultCharset(), true, true, false); 242 | 243 | List result = new ArrayList<>(); 244 | try (BufferedReader bf = new BufferedReader(new FileReader(out))) { 245 | while ((line = bf.readLine()) != null) { 246 | result.add(line); 247 | } 248 | } 249 | assertArrayEquals(Arrays.toString(result.toArray()), EXPECTED_HEADER_RESULTS, result.toArray()); 250 | } 251 | 252 | @Test 253 | public void testSortAndSave() throws Exception { 254 | File f; 255 | String line; 256 | 257 | List sample = Arrays.asList(SAMPLE); 258 | Comparator cmp = new Comparator() { 259 | @Override 260 | public int compare(String o1, String o2) { 261 | return o1.compareTo(o2); 262 | } 263 | }; 264 | f = ExternalSort.sortAndSave(sample, cmp, Charset.defaultCharset(), 265 | null, false, false, true); 266 | assertNotNull(f); 267 | assertTrue(f.exists()); 268 | assertTrue(f.length() > 0); 269 | List result = new ArrayList<>(); 270 | try (BufferedReader bf = new BufferedReader(new FileReader(f))) { 271 | while ((line = bf.readLine()) != null) { 272 | result.add(line); 273 | } 274 | } 275 | assertArrayEquals(Arrays.toString(result.toArray()), EXPECTED_SORT_RESULTS, 276 | result.toArray()); 277 | } 278 | 279 | @Test 280 | public void testSortAndSave_Distinct() throws Exception { 281 | File f; 282 | String line; 283 | 284 | BufferedReader bf; 285 | List sample = Arrays.asList(SAMPLE); 286 | Comparator cmp = new Comparator() { 287 | @Override 288 | public int compare(String o1, String o2) { 289 | return o1.compareTo(o2); 290 | } 291 | }; 292 | 293 | f = ExternalSort.sortAndSave(sample, cmp, Charset.defaultCharset(), 294 | null, true, false, true); 295 | assertNotNull(f); 296 | assertTrue(f.exists()); 297 | assertTrue(f.length() > 0); 298 | bf = new BufferedReader(new FileReader(f)); 299 | 300 | List result = new ArrayList<>(); 301 | while ((line = bf.readLine()) != null) { 302 | result.add(line); 303 | } 304 | bf.close(); 305 | assertArrayEquals(Arrays.toString(result.toArray()), 306 | EXPECTED_DISTINCT_RESULTS, result.toArray()); 307 | } 308 | 309 | @Test 310 | public void testSortInBatch() throws Exception { 311 | Comparator cmp = new Comparator() { 312 | @Override 313 | public int compare(String o1, String o2) { 314 | return o1.compareTo(o2); 315 | } 316 | }; 317 | 318 | List listOfFiles = ExternalSort.sortInBatch(this.csvFile, cmp, ExternalSort.DEFAULTMAXTEMPFILES, Charset.defaultCharset(), null, false, 1, false, true); 319 | assertEquals(1, listOfFiles.size()); 320 | 321 | ArrayList result = readLines(listOfFiles.get(0)); 322 | assertArrayEquals(Arrays.toString(result.toArray()),EXPECTED_MERGE_DISTINCT_RESULTS, result.toArray()); 323 | } 324 | 325 | /** 326 | * Sample case to sort csv file. 327 | * @throws Exception 328 | * 329 | */ 330 | @Test 331 | public void testCSVSorting() throws Exception { 332 | testCSVSortingWithParams(false); 333 | testCSVSortingWithParams(true); 334 | } 335 | 336 | /** 337 | * Sample case to sort csv file. 338 | * @param usegzip use compression for temporary files 339 | * @throws Exception 340 | * 341 | */ 342 | public void testCSVSortingWithParams(boolean usegzip) throws Exception { 343 | 344 | File out = File.createTempFile("test_results", ".tmp", null); 345 | out.deleteOnExit(); 346 | Comparator cmp = new Comparator() { 347 | @Override 348 | public int compare(String o1, String o2) 349 | { 350 | return o1.compareTo(o2); 351 | } 352 | }; 353 | 354 | String head; 355 | try ( // read header 356 | FileReader fr = new FileReader(this.csvFile)) { 357 | try (Scanner scan = new Scanner(fr)) { 358 | head = scan.nextLine(); 359 | } 360 | } 361 | // write to the file 362 | writeStringToFile(out, head+"\n"); 363 | 364 | // omit the first line, which is the header.. 365 | List listOfFiles = ExternalSort.sortInBatch(this.csvFile, cmp, ExternalSort.DEFAULTMAXTEMPFILES, Charset.defaultCharset(), null, false, 1, usegzip, true); 366 | 367 | // now merge with append 368 | ExternalSort.mergeSortedFiles(listOfFiles, out, cmp, Charset.defaultCharset(), false, true, usegzip); 369 | 370 | ArrayList result = readLines(out); 371 | 372 | assertEquals(12, result.size()); 373 | assertArrayEquals(Arrays.toString(result.toArray()),EXPECTED_HEADER_RESULTS, result.toArray()); 374 | 375 | } 376 | 377 | public static ArrayList readLines(File f) throws IOException { 378 | ArrayList answer; 379 | try (BufferedReader r = new BufferedReader(new FileReader(f))) { 380 | answer = new ArrayList<>(); 381 | String line; 382 | while ((line = r.readLine()) != null) { 383 | answer.add(line); 384 | } 385 | } 386 | return answer; 387 | } 388 | 389 | public static void writeStringToFile(File f, String s) throws IOException { 390 | try (FileOutputStream out = new FileOutputStream(f)) { 391 | out.write(s.getBytes()); 392 | } 393 | } 394 | 395 | /** 396 | * Sort a text file with lines greater than {@link Integer#MAX_VALUE}. 397 | * 398 | * @throws IOException 399 | */ 400 | @Ignore("This test takes too long to execute") 401 | @Test 402 | public void sortVeryLargeFile() throws IOException { 403 | final Path veryLargeFile = getTestFile(); 404 | final Path outputFile = Files.createTempFile("Merged-File", ".tmp"); 405 | final long numLinesWritten = ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(veryLargeFile.toFile()), outputFile.toFile()); 406 | final long expectedLines = 2148L * 1000000L; 407 | assertEquals(expectedLines, numLinesWritten); 408 | } 409 | 410 | @Ignore("This test takes too long to execute") 411 | @Test 412 | public void sortVeryLargeFileWhenDistinctEnabled() throws IOException { 413 | boolean distinctEnabled = true; 414 | final Path veryLargeFile = getTestFile(); 415 | final File outputFile = Files.createTempFile("Merged-File", ".tmp").toFile(); 416 | List veryLargeSortBatch = ExternalSort.sortInBatch(veryLargeFile.toFile()); 417 | 418 | long numLinesWritten = ExternalSort.mergeSortedFiles(veryLargeSortBatch, outputFile, defaultcomparator, distinctEnabled); 419 | 420 | assertEquals(1 /* 😸 */, numLinesWritten); 421 | } 422 | 423 | /** 424 | * Generate a test file with 2148 million lines. 425 | * 426 | * @throws IOException 427 | */ 428 | private Path getTestFile() throws IOException { 429 | System.out.println("Temp File Creation: Started"); 430 | final Path path = Files.createTempFile("IntegrationTestFile", ".txt"); 431 | final List idList = new ArrayList<>(); 432 | final int saneLimit = 1000000; 433 | IntStream.range(0, saneLimit) 434 | .forEach(i -> idList.add("A")); 435 | final String content = idList.stream().collect(Collectors.joining("\n")); 436 | Files.write(path, content.getBytes(StandardCharsets.UTF_8), StandardOpenOption.TRUNCATE_EXISTING); 437 | final String newLine = "\n"; 438 | IntStream.range(1, 2148) 439 | .forEach(i -> { 440 | try { 441 | Files.write(path, newLine.getBytes(StandardCharsets.UTF_8), StandardOpenOption.APPEND); 442 | Files.write(path, content.getBytes(StandardCharsets.UTF_8), StandardOpenOption.APPEND); 443 | } catch (IOException e) { 444 | throw new RuntimeException(e.getMessage()); 445 | } 446 | }); 447 | System.out.println("Temp File Creation: Finished"); 448 | return path; 449 | } 450 | 451 | /** 452 | * Sort with a custom comparator. 453 | * @throws IOException 454 | */ 455 | @Test 456 | public void sortWithCustomComparator() throws IOException { 457 | Random rand = new Random(); 458 | final Path path = Files.createTempFile("TestCsvWithLongIds", ".csv"); 459 | final Path pathSorted = Files.createTempFile("TestCsvWithLongIdsSorted", ".csv"); 460 | Set sortedIds = new TreeSet<>(); 461 | try (FileWriter fw = new FileWriter(path.toFile()); 462 | BufferedWriter bw = new BufferedWriter(fw)) { 463 | for (int i = 0; i < 1000; ++i) { 464 | long nextLong = rand.nextLong(); 465 | sortedIds.add(nextLong); 466 | bw.write(String.format("%d,%s\n", nextLong, UUID.randomUUID().toString())); 467 | } 468 | } 469 | AtomicBoolean wasCalled = new AtomicBoolean(false); 470 | ExternalSort.sort(path.toFile(), pathSorted.toFile(), (lhs, rhs) -> { 471 | Long lhsLong = lhs.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(lhs.split(",")[0]); 472 | Long rhsLong = rhs.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(rhs.split(",")[0]); 473 | wasCalled.set(true); 474 | return lhsLong.compareTo(rhsLong); 475 | }); 476 | assertTrue("The custom comparator was not called!", wasCalled.get()); 477 | Iterator idIter = sortedIds.iterator(); 478 | try (FileReader fr = new FileReader(pathSorted.toFile()); 479 | BufferedReader bw = new BufferedReader(fr)) { 480 | String nextLine = bw.readLine(); 481 | Long lhsLong = nextLine.indexOf(',') == -1 ? Long.MAX_VALUE : Long.parseLong(nextLine.split(",")[0]); 482 | Long nextId = idIter.next(); 483 | assertEquals(lhsLong, nextId); 484 | } 485 | } 486 | 487 | @Test 488 | public void lowMaxMemory() throws IOException { 489 | String unsortedContent = 490 | "Val1,Data2,Data3,Data4\r\n" + 491 | "Val2,Data2,Data4,Data5\r\n" + 492 | "Val1,Data2,Data3,Data5\r\n" + 493 | "Val2,Data2,Data6,Data7\r\n"; 494 | InputStream bis = new ByteArrayInputStream(unsortedContent.getBytes(StandardCharsets.UTF_8)); 495 | File tmpDirectory = Files.createTempDirectory("sort").toFile(); 496 | tmpDirectory.deleteOnExit(); 497 | 498 | BufferedReader inputReader = new BufferedReader(new InputStreamReader(bis, StandardCharsets.UTF_8)); 499 | List tmpSortedFiles = ExternalSort.sortInBatch( 500 | inputReader, 501 | unsortedContent.length(), 502 | ExternalSort.defaultcomparator, 503 | Integer.MAX_VALUE, // use an unlimited number of temp files 504 | 100, // max memory 505 | StandardCharsets.UTF_8, 506 | tmpDirectory, 507 | false, // no distinct 508 | 0, // no header lines to skip 509 | false, // don't use gzip 510 | true); // parallel 511 | File tmpOutputFile = File.createTempFile("merged", "", tmpDirectory); 512 | tmpOutputFile.deleteOnExit(); 513 | BufferedWriter outputWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream((tmpOutputFile)))); 514 | ExternalSort.mergeSortedFiles( 515 | tmpSortedFiles, 516 | outputWriter, 517 | ExternalSort.defaultcomparator, 518 | StandardCharsets.UTF_8, 519 | false, // no distinct 520 | false); // don't use gzip 521 | 522 | for (File tmpSortedFile: tmpSortedFiles) { 523 | assertFalse(tmpSortedFile.exists()); 524 | } 525 | } 526 | } 527 | -------------------------------------------------------------------------------- /src/test/java/com/google/code/externalsorting/csv/CsvExternalSortTest.java: -------------------------------------------------------------------------------- 1 | package com.google.code.externalsorting.csv; 2 | 3 | import org.apache.commons.csv.CSVFormat; 4 | import org.apache.commons.csv.CSVRecord; 5 | import org.junit.After; 6 | import org.junit.Test; 7 | 8 | import java.io.BufferedReader; 9 | import java.io.File; 10 | import java.io.FileReader; 11 | import java.io.IOException; 12 | import java.nio.charset.Charset; 13 | import java.nio.charset.StandardCharsets; 14 | import java.nio.file.Files; 15 | import java.nio.file.Paths; 16 | import java.util.Comparator; 17 | import java.util.HashMap; 18 | import java.util.List; 19 | import java.util.ArrayList; 20 | import java.util.Map; 21 | 22 | import static org.junit.Assert.assertEquals; 23 | import static org.junit.Assert.assertNotEquals; 24 | 25 | 26 | public class CsvExternalSortTest { 27 | private static final String FILE_CSV = "externalSorting.csv"; 28 | private static final String FILE_UNICODE_CSV = "nonLatinSorting.csv"; 29 | 30 | private static final String FILE_CSV_WITH_TABS = "externalSortingTabs.csv"; 31 | private static final String FILE_CSV_WITH_SEMICOOLONS = "externalSortingSemicolon.csv"; 32 | private static final char SEMICOLON = ';'; 33 | 34 | File outputfile; 35 | 36 | @Test 37 | public void testMultiLineFile() throws IOException, ClassNotFoundException { 38 | String path = this.getClass().getClassLoader().getResource(FILE_CSV).getPath(); 39 | 40 | File file = new File(path); 41 | 42 | outputfile = new File("outputSort1.csv"); 43 | 44 | Comparator comparator = (op1, op2) -> op1.get(0) 45 | .compareTo(op2.get(0)); 46 | 47 | CsvSortOptions sortOptions = new CsvSortOptions 48 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 49 | .charset(Charset.defaultCharset()) 50 | .distinct(false) 51 | .numHeader(1) 52 | .skipHeader(true) 53 | .format(CSVFormat.DEFAULT) 54 | .build(); 55 | ArrayList header = new ArrayList(); 56 | 57 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 58 | 59 | assertEquals(1, sortInBatch.size()); 60 | 61 | int mergeSortedFiles = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 62 | 63 | assertEquals(4, mergeSortedFiles); 64 | 65 | BufferedReader reader = new BufferedReader(new FileReader(outputfile)); 66 | String readLine = reader.readLine(); 67 | 68 | assertEquals("6,this wont work in other systems,3", readLine); 69 | reader.close(); 70 | } 71 | 72 | 73 | @Test 74 | public void testIssue44() throws Exception { 75 | String path = this.getClass().getClassLoader().getResource("issue44.csv").getPath(); 76 | List input_lines = Files.readAllLines(Paths.get(path), StandardCharsets.UTF_8); 77 | int input_length = input_lines.size(); 78 | 79 | File file = new File(path); 80 | 81 | outputfile = new File("issue44_output.csv"); 82 | 83 | Comparator comparator = (op1, op2) -> op1.get(0) 84 | .compareTo(op2.get(0)); 85 | CSVFormat f = CSVFormat.DEFAULT; 86 | f = f.withQuote(null); 87 | 88 | CsvSortOptions sortOptions = new CsvSortOptions 89 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 90 | .charset(StandardCharsets.UTF_8) 91 | .distinct(false) 92 | .numHeader(1) 93 | .skipHeader(true) 94 | .format(CSVFormat.DEFAULT) 95 | .build(); 96 | ArrayList header = new ArrayList(); 97 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 98 | 99 | 100 | CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 101 | 102 | List lines = Files.readAllLines(Paths.get(outputfile.getPath()), StandardCharsets.UTF_8); 103 | for(String a : lines) { 104 | System.out.println(a); 105 | } 106 | System.out.println("Sorted lines = "+lines.size()); 107 | System.out.println("Input lines (with header) = "+input_length); 108 | assertEquals(lines.size(), input_length - 1); 109 | } 110 | 111 | 112 | @Test 113 | public void testNonLatin() throws Exception { 114 | String path = this.getClass().getClassLoader().getResource(FILE_UNICODE_CSV).getPath(); 115 | 116 | File file = new File(path); 117 | 118 | outputfile = new File("unicode_output.csv"); 119 | 120 | Comparator comparator = (op1, op2) -> op1.get(0) 121 | .compareTo(op2.get(0)); 122 | 123 | CsvSortOptions sortOptions = new CsvSortOptions 124 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 125 | .charset(StandardCharsets.UTF_8) 126 | .distinct(false) 127 | .numHeader(1) 128 | .skipHeader(true) 129 | .format(CSVFormat.DEFAULT) 130 | .build(); 131 | ArrayList header = new ArrayList(); 132 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 133 | 134 | assertEquals(1, sortInBatch.size()); 135 | 136 | int numLinesWritten = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 137 | 138 | assertEquals(5, numLinesWritten); 139 | 140 | List lines = Files.readAllLines(Paths.get(outputfile.getPath()), StandardCharsets.UTF_8); 141 | 142 | assertEquals("2,זה רק טקסט אחי לקריאה קשה,8", lines.get(0)); 143 | assertEquals("5,هذا هو النص إخوانه فقط من الصعب القراءة,3", lines.get(1)); 144 | assertEquals("6,это не будет работать в других системах,3", lines.get(2)); 145 | } 146 | 147 | 148 | @Test 149 | public void testCVSFormat() throws Exception { 150 | Map map = new HashMap(){{ 151 | put(CSVFormat.MYSQL, new Pair(FILE_CSV_WITH_TABS, "6 \"this wont work in other systems\" 3")); 152 | put(CSVFormat.EXCEL.withDelimiter(SEMICOLON), new Pair(FILE_CSV_WITH_SEMICOOLONS, "6;this wont work in other systems;3")); 153 | }}; 154 | 155 | for (Map.Entry format : map.entrySet()){ 156 | String path = this.getClass().getClassLoader().getResource(format.getValue().getFileName()).getPath(); 157 | 158 | File file = new File(path); 159 | 160 | outputfile = new File("outputSort1.csv"); 161 | 162 | Comparator comparator = (op1, op2) -> op1.get(0) 163 | .compareTo(op2.get(0)); 164 | 165 | CsvSortOptions sortOptions = new CsvSortOptions 166 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 167 | .charset(Charset.defaultCharset()) 168 | .distinct(false) 169 | .numHeader(1) 170 | .skipHeader(true) 171 | .format(format.getKey()) 172 | .build(); 173 | ArrayList header = new ArrayList(); 174 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 175 | 176 | assertEquals(1, sortInBatch.size()); 177 | 178 | int numLinesWritten = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, false, header); 179 | 180 | assertEquals(4, numLinesWritten); 181 | 182 | List lines = Files.readAllLines(outputfile.toPath()); 183 | 184 | assertEquals(format.getValue().getExpected(), lines.get(0)); 185 | assertEquals(4, lines.size()); 186 | } 187 | } 188 | 189 | @Test 190 | public void testMultiLineFileWthHeader() throws IOException, ClassNotFoundException { 191 | String path = this.getClass().getClassLoader().getResource(FILE_CSV).getPath(); 192 | 193 | File file = new File(path); 194 | 195 | outputfile = new File("outputSort1.csv"); 196 | 197 | Comparator comparator = (op1, op2) -> op1.get(0) 198 | .compareTo(op2.get(0)); 199 | 200 | CsvSortOptions sortOptions = new CsvSortOptions 201 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 202 | .charset(Charset.defaultCharset()) 203 | .distinct(false) 204 | .numHeader(1) 205 | .skipHeader(false) 206 | .format(CSVFormat.DEFAULT) 207 | .build(); 208 | ArrayList header = new ArrayList(); 209 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 210 | 211 | assertEquals(1, sortInBatch.size()); 212 | 213 | CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 214 | 215 | List lines = Files.readAllLines(outputfile.toPath(), sortOptions.getCharset()); 216 | 217 | assertEquals("personId,text,ishired", lines.get(0)); 218 | assertEquals("6,this wont work in other systems,3", lines.get(1)); 219 | assertEquals("6,this wont work in other systems,3", lines.get(2)); 220 | assertEquals("7,My Broken Text will break you all,1", lines.get(3)); 221 | assertEquals("8,this is only bro text for hard read,2", lines.get(4)); 222 | assertEquals(5, lines.size()); 223 | 224 | } 225 | 226 | @Test 227 | public void testNumLinesWrittenIfDistinctEnabled() throws IOException, ClassNotFoundException { 228 | boolean distinctEnabled = true; 229 | String path = this.getClass().getClassLoader().getResource(FILE_CSV).getPath(); 230 | File file = new File(path); 231 | outputfile = new File("outputSort1.csv"); 232 | 233 | Comparator comparator = Comparator.comparing(op -> op.get(0)); 234 | 235 | CsvSortOptions sortOptions = new CsvSortOptions 236 | .Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory()) 237 | .charset(Charset.defaultCharset()) 238 | .distinct(distinctEnabled) 239 | .numHeader(1) 240 | .skipHeader(true) 241 | .format(CSVFormat.DEFAULT) 242 | .build(); 243 | ArrayList header = new ArrayList(); 244 | 245 | List sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header); 246 | 247 | int numLinesWritten = CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header); 248 | 249 | BufferedReader reader = new BufferedReader(new FileReader(outputfile)); 250 | 251 | assertEquals(1, sortInBatch.size()); 252 | assertEquals(3, numLinesWritten); 253 | 254 | String firstLine = reader.readLine(); 255 | assertEquals("6,this wont work in other systems,3", firstLine); 256 | 257 | String secondLine = reader.readLine(); 258 | assertNotEquals(firstLine, secondLine); 259 | 260 | reader.close(); 261 | } 262 | 263 | @After 264 | public void onTearDown() { 265 | if(outputfile.exists()) { 266 | outputfile.delete(); 267 | } 268 | } 269 | 270 | private class Pair { 271 | private String fileName; 272 | private String expected; 273 | 274 | public Pair(String fileName, String expected) { 275 | this.fileName = fileName; 276 | this.expected = expected; 277 | } 278 | 279 | public String getFileName() { 280 | return fileName; 281 | } 282 | 283 | public void setFileName(String fileName) { 284 | this.fileName = fileName; 285 | } 286 | 287 | public String getExpected() { 288 | return expected; 289 | } 290 | 291 | public void setExpected(String expected) { 292 | this.expected = expected; 293 | } 294 | } 295 | } 296 | -------------------------------------------------------------------------------- /src/test/resources/externalSorting.csv: -------------------------------------------------------------------------------- 1 | personId,text,ishired 2 | 7,"My Broken Text will break you all",1 3 | 8,"this is only bro text for hard read",2 4 | 6,"this wont work in other systems",3 5 | 6,"this wont work in other systems",3 -------------------------------------------------------------------------------- /src/test/resources/externalSortingSemicolon.csv: -------------------------------------------------------------------------------- 1 | personId;text;ishired 2 | 7;"My Broken Text will break you all";1 3 | 8;"this is only bro text for hard read";2 4 | 6;"this wont work in other systems";3 5 | 6;"this wont work in other systems";3 -------------------------------------------------------------------------------- /src/test/resources/externalSortingTabs.csv: -------------------------------------------------------------------------------- 1 | personId text ishired 2 | 7 "My Broken Text will break you all" 1 3 | 8 "this is only bro text for hard read" 2 4 | 6 "this wont work in other systems" 3 5 | 6 "this wont work in other systems" 3 -------------------------------------------------------------------------------- /src/test/resources/issue44.csv: -------------------------------------------------------------------------------- 1 | "author","title" 2 | "Dan Simmons","Hyperion" 3 | "Dan Simmons","Hyperion" 4 | "Dan Simmons","Hyperion" 5 | "Dan Simmons","Hyperion" 6 | "Dan Simmons","Hyperion" 7 | "Dan Simmons","Hyperion" 8 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 9 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 10 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy 11 | hello" 12 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 13 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 14 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 15 | "Douglas Adams","The Hitchhiker's Guide to the Galaxy" 16 | "Dan ""The Man"" Simmons","Hyperion" -------------------------------------------------------------------------------- /src/test/resources/nonLatinSorting.csv: -------------------------------------------------------------------------------- 1 | personId,text,ishired 2 | 7,"My Broken Text will break you all",1 3 | 2,"זה רק טקסט אחי לקריאה קשה",8 4 | 6,"это не будет работать в других системах",3 5 | 6,"это не будет работать в других системах",3 6 | 5,"هذا هو النص إخوانه فقط من الصعب القراءة",3 -------------------------------------------------------------------------------- /src/test/resources/test-file-1.csv: -------------------------------------------------------------------------------- 1 | HEADER, HEADER 2 | a 3 | b 4 | k 5 | c 6 | d 7 | i 8 | j 9 | e 10 | h 11 | f 12 | g 13 | -------------------------------------------------------------------------------- /src/test/resources/test-file-1.txt: -------------------------------------------------------------------------------- 1 | a 2 | b 3 | c 4 | d 5 | e 6 | f 7 | g 8 | h -------------------------------------------------------------------------------- /src/test/resources/test-file-2.txt: -------------------------------------------------------------------------------- 1 | a 2 | c 3 | e 4 | g 5 | i 6 | j 7 | k --------------------------------------------------------------------------------