Eclipse Public License

├── .gitignore ├── CHANGES.txt ├── LICENSE.html ├── README.txt ├── pom.xml └── src ├── examples └── clojure │ └── clojure_hadoop │ └── examples │ ├── wordcount1.clj │ ├── wordcount2.clj │ ├── wordcount3.clj │ ├── wordcount4.clj │ └── wordcount5.clj ├── main ├── assembly │ ├── dist.xml │ ├── examples.xml │ └── job.xml └── clojure │ └── clojure_hadoop │ ├── config.clj │ ├── defjob.clj │ ├── gen.clj │ ├── imports.clj │ ├── job.clj │ ├── load.clj │ └── wrap.clj └── test └── clojure └── clojure_hadoop └── test_imports.clj /.gitignore: -------------------------------------------------------------------------------- 1 | target 2 | -------------------------------------------------------------------------------- /CHANGES.txt: -------------------------------------------------------------------------------- 1 | Changes in Version 1.1.0: 2 | 3 | * Additional configuration options for defjob and command line: 4 | output-key, output-value, map-output-key, map-output-value, 5 | compress-output, output-compressor, compression-type. 6 | 7 | * Renamed configuration options inputformat and outputformat to 8 | input-format and output-format, respectively. 9 | 10 | * Added example wordcount5 showing new configuration options. 11 | 12 | 13 | Version 1.0.0: Initial Release 14 | -------------------------------------------------------------------------------- /LICENSE.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Eclipse Public License - Version 1.0 8 | 25 | 26 | 27 | 28 | 29 | 30 |

Eclipse Public License - v 1.0

31 | 32 |

THE ACCOMPANYING PROGRAM IS PROVIDED UNDER THE TERMS OF THIS ECLIPSE 33 | PUBLIC LICENSE ("AGREEMENT"). ANY USE, REPRODUCTION OR 34 | DISTRIBUTION OF THE PROGRAM CONSTITUTES RECIPIENT'S ACCEPTANCE OF THIS 35 | AGREEMENT.

36 | 37 |

1. DEFINITIONS

38 | 39 |

"Contribution" means:

40 | 41 |

a) in the case of the initial Contributor, the initial 42 | code and documentation distributed under this Agreement, and

43 |

b) in the case of each subsequent Contributor:

44 |

i) changes to the Program, and

45 |

ii) additions to the Program;

46 |

where such changes and/or additions to the Program 47 | originate from and are distributed by that particular Contributor. A 48 | Contribution 'originates' from a Contributor if it was added to the 49 | Program by such Contributor itself or anyone acting on such 50 | Contributor's behalf. Contributions do not include additions to the 51 | Program which: (i) are separate modules of software distributed in 52 | conjunction with the Program under their own license agreement, and (ii) 53 | are not derivative works of the Program.

54 | 55 |

"Contributor" means any person or entity that distributes 56 | the Program.

57 | 58 |

"Licensed Patents" mean patent claims licensable by a 59 | Contributor which are necessarily infringed by the use or sale of its 60 | Contribution alone or when combined with the Program.

61 | 62 |

"Program" means the Contributions distributed in accordance 63 | with this Agreement.

64 | 65 |

"Recipient" means anyone who receives the Program under 66 | this Agreement, including all Contributors.

67 | 68 |

2. GRANT OF RIGHTS

69 | 70 |

a) Subject to the terms of this Agreement, each 71 | Contributor hereby grants Recipient a non-exclusive, worldwide, 72 | royalty-free copyright license to reproduce, prepare derivative works 73 | of, publicly display, publicly perform, distribute and sublicense the 74 | Contribution of such Contributor, if any, and such derivative works, in 75 | source code and object code form.

76 | 77 |

b) Subject to the terms of this Agreement, each 78 | Contributor hereby grants Recipient a non-exclusive, worldwide, 79 | royalty-free patent license under Licensed Patents to make, use, sell, 80 | offer to sell, import and otherwise transfer the Contribution of such 81 | Contributor, if any, in source code and object code form. This patent 82 | license shall apply to the combination of the Contribution and the 83 | Program if, at the time the Contribution is added by the Contributor, 84 | such addition of the Contribution causes such combination to be covered 85 | by the Licensed Patents. The patent license shall not apply to any other 86 | combinations which include the Contribution. No hardware per se is 87 | licensed hereunder.

88 | 89 |

c) Recipient understands that although each Contributor 90 | grants the licenses to its Contributions set forth herein, no assurances 91 | are provided by any Contributor that the Program does not infringe the 92 | patent or other intellectual property rights of any other entity. Each 93 | Contributor disclaims any liability to Recipient for claims brought by 94 | any other entity based on infringement of intellectual property rights 95 | or otherwise. As a condition to exercising the rights and licenses 96 | granted hereunder, each Recipient hereby assumes sole responsibility to 97 | secure any other intellectual property rights needed, if any. For 98 | example, if a third party patent license is required to allow Recipient 99 | to distribute the Program, it is Recipient's responsibility to acquire 100 | that license before distributing the Program.

101 | 102 |

d) Each Contributor represents that to its knowledge it 103 | has sufficient copyright rights in its Contribution, if any, to grant 104 | the copyright license set forth in this Agreement.

105 | 106 |

3. REQUIREMENTS

107 | 108 |

A Contributor may choose to distribute the Program in object code 109 | form under its own license agreement, provided that:

110 | 111 |

a) it complies with the terms and conditions of this 112 | Agreement; and

113 | 114 |

b) its license agreement:

115 | 116 |

i) effectively disclaims on behalf of all Contributors 117 | all warranties and conditions, express and implied, including warranties 118 | or conditions of title and non-infringement, and implied warranties or 119 | conditions of merchantability and fitness for a particular purpose;

120 | 121 |

ii) effectively excludes on behalf of all Contributors 122 | all liability for damages, including direct, indirect, special, 123 | incidental and consequential damages, such as lost profits;

124 | 125 |

iii) states that any provisions which differ from this 126 | Agreement are offered by that Contributor alone and not by any other 127 | party; and

128 | 129 |

iv) states that source code for the Program is available 130 | from such Contributor, and informs licensees how to obtain it in a 131 | reasonable manner on or through a medium customarily used for software 132 | exchange.

133 | 134 |

When the Program is made available in source code form:

135 | 136 |

a) it must be made available under this Agreement; and

137 | 138 |

b) a copy of this Agreement must be included with each 139 | copy of the Program.

140 | 141 |

Contributors may not remove or alter any copyright notices contained 142 | within the Program.

143 | 144 |

Each Contributor must identify itself as the originator of its 145 | Contribution, if any, in a manner that reasonably allows subsequent 146 | Recipients to identify the originator of the Contribution.

147 | 148 |

4. COMMERCIAL DISTRIBUTION

149 | 150 |

Commercial distributors of software may accept certain 151 | responsibilities with respect to end users, business partners and the 152 | like. While this license is intended to facilitate the commercial use of 153 | the Program, the Contributor who includes the Program in a commercial 154 | product offering should do so in a manner which does not create 155 | potential liability for other Contributors. Therefore, if a Contributor 156 | includes the Program in a commercial product offering, such Contributor 157 | ("Commercial Contributor") hereby agrees to defend and 158 | indemnify every other Contributor ("Indemnified Contributor") 159 | against any losses, damages and costs (collectively "Losses") 160 | arising from claims, lawsuits and other legal actions brought by a third 161 | party against the Indemnified Contributor to the extent caused by the 162 | acts or omissions of such Commercial Contributor in connection with its 163 | distribution of the Program in a commercial product offering. The 164 | obligations in this section do not apply to any claims or Losses 165 | relating to any actual or alleged intellectual property infringement. In 166 | order to qualify, an Indemnified Contributor must: a) promptly notify 167 | the Commercial Contributor in writing of such claim, and b) allow the 168 | Commercial Contributor to control, and cooperate with the Commercial 169 | Contributor in, the defense and any related settlement negotiations. The 170 | Indemnified Contributor may participate in any such claim at its own 171 | expense.

172 | 173 |

For example, a Contributor might include the Program in a commercial 174 | product offering, Product X. That Contributor is then a Commercial 175 | Contributor. If that Commercial Contributor then makes performance 176 | claims, or offers warranties related to Product X, those performance 177 | claims and warranties are such Commercial Contributor's responsibility 178 | alone. Under this section, the Commercial Contributor would have to 179 | defend claims against the other Contributors related to those 180 | performance claims and warranties, and if a court requires any other 181 | Contributor to pay any damages as a result, the Commercial Contributor 182 | must pay those damages.

183 | 184 |

5. NO WARRANTY

185 | 186 |

EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS 187 | PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 188 | OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION, 189 | ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY 190 | OR FITNESS FOR A PARTICULAR PURPOSE. Each Recipient is solely 191 | responsible for determining the appropriateness of using and 192 | distributing the Program and assumes all risks associated with its 193 | exercise of rights under this Agreement , including but not limited to 194 | the risks and costs of program errors, compliance with applicable laws, 195 | damage to or loss of data, programs or equipment, and unavailability or 196 | interruption of operations.

197 | 198 |

6. DISCLAIMER OF LIABILITY

199 | 200 |

EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, NEITHER RECIPIENT 201 | NOR ANY CONTRIBUTORS SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, 202 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING 203 | WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF 204 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 205 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OR 206 | DISTRIBUTION OF THE PROGRAM OR THE EXERCISE OF ANY RIGHTS GRANTED 207 | HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

208 | 209 |

7. GENERAL

210 | 211 |

If any provision of this Agreement is invalid or unenforceable under 212 | applicable law, it shall not affect the validity or enforceability of 213 | the remainder of the terms of this Agreement, and without further action 214 | by the parties hereto, such provision shall be reformed to the minimum 215 | extent necessary to make such provision valid and enforceable.

216 | 217 |

If Recipient institutes patent litigation against any entity 218 | (including a cross-claim or counterclaim in a lawsuit) alleging that the 219 | Program itself (excluding combinations of the Program with other 220 | software or hardware) infringes such Recipient's patent(s), then such 221 | Recipient's rights granted under Section 2(b) shall terminate as of the 222 | date such litigation is filed.

223 | 224 |

All Recipient's rights under this Agreement shall terminate if it 225 | fails to comply with any of the material terms or conditions of this 226 | Agreement and does not cure such failure in a reasonable period of time 227 | after becoming aware of such noncompliance. If all Recipient's rights 228 | under this Agreement terminate, Recipient agrees to cease use and 229 | distribution of the Program as soon as reasonably practicable. However, 230 | Recipient's obligations under this Agreement and any licenses granted by 231 | Recipient relating to the Program shall continue and survive.

232 | 233 |

Everyone is permitted to copy and distribute copies of this 234 | Agreement, but in order to avoid inconsistency the Agreement is 235 | copyrighted and may only be modified in the following manner. The 236 | Agreement Steward reserves the right to publish new versions (including 237 | revisions) of this Agreement from time to time. No one other than the 238 | Agreement Steward has the right to modify this Agreement. The Eclipse 239 | Foundation is the initial Agreement Steward. The Eclipse Foundation may 240 | assign the responsibility to serve as the Agreement Steward to a 241 | suitable separate entity. Each new version of the Agreement will be 242 | given a distinguishing version number. The Program (including 243 | Contributions) may always be distributed subject to the version of the 244 | Agreement under which it was received. In addition, after a new version 245 | of the Agreement is published, Contributor may elect to distribute the 246 | Program (including its Contributions) under the new version. Except as 247 | expressly stated in Sections 2(a) and 2(b) above, Recipient receives no 248 | rights or licenses to the intellectual property of any Contributor under 249 | this Agreement, whether expressly, by implication, estoppel or 250 | otherwise. All rights in the Program not expressly granted under this 251 | Agreement are reserved.

252 | 253 |

This Agreement is governed by the laws of the State of New York and 254 | the intellectual property laws of the United States of America. No party 255 | to this Agreement will bring a legal action under this Agreement more 256 | than one year after the cause of action arose. Each party waives its 257 | rights to a jury trial in any resulting litigation.

258 | 259 | 260 | 261 | 262 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | UP-TO-DATE fork with more recent maintenance is here: 2 | https://github.com/alexott/clojure-hadoop 3 | 4 | 5 | clojure-hadoop 6 | 7 | A library to assist in writing Hadoop MapReduce jobs in Clojure. 8 | 9 | by Stuart Sierra 10 | http://stuartsierra.com/ 11 | 12 | For stable releases, see 13 | http://stuartsierra.com/software/clojure-hadoop 14 | 15 | For more information 16 | on Clojure, http://clojure.org/ 17 | on Hadoop, http://hadoop.apache.org/ 18 | 19 | Also see my presentation about this library at 20 | http://vimeo.com/7669741 21 | 22 | 23 | Copyright (c) Stuart Sierra, 2009. All rights reserved. The use and 24 | distribution terms for this software are covered by the Eclipse Public 25 | License 1.0 (http://opensource.org/licenses/eclipse-1.0.php) which can 26 | be found in the file LICENSE.html at the root of this distribution. 27 | By using this software in any fashion, you are agreeing to be bound by 28 | the terms of this license. You must not remove this notice, or any 29 | other, from this software. 30 | 31 | 32 | 33 | DEPENDENCIES 34 | 35 | This library requires Java 6 JDK, http://java.sun.com/ 36 | 37 | Building from source requires Apache Maven 2, http://maven.apache.org/ 38 | 39 | 40 | 41 | BUILDING 42 | 43 | If you downloaded the library distribution as a .zip or .tar file, 44 | everything is pre-built and there is nothing you need to do. 45 | 46 | If you downloaded the sources from Git, then you need to run the build 47 | with Maven. In the top-level directory of this project, run: 48 | 49 | mvn assembly:assembly 50 | 51 | This compiles and builds the JAR files. 52 | 53 | You can find these files in the "target" directory (replace ${VERSION} 54 | with the current version number of this library): 55 | 56 | clojure-hadoop-${VERSION}-examples.jar : 57 | 58 | This JAR contains all dependencies, including all of Hadoop 59 | 0.18.3. You can use this JAR to run the example MapReduce 60 | jobs from the command line. This file is ONLY for running the 61 | examples. 62 | 63 | 64 | clojure-hadoop-${VERSION}-job.jar : 65 | 66 | This JAR contains the clojure-hadoop libraries and Clojure 67 | 1.0. It is suitable for inclusion in the "lib" directory of a 68 | JAR file submitted as a Hadoop job. 69 | 70 | 71 | clojure-hadoop-${VERSION}.jar : 72 | 73 | This JAR contains ONLY the clojure-hadoop libraries. It can 74 | be placed in the "lib" directory of a JAR file submitted as a 75 | Hadoop job; that JAR must also include the Clojure 1.0 JAR. 76 | 77 | 78 | 79 | RUNNING THE EXAMPLES 80 | 81 | After building, copy the file from 82 | 83 | target/clojure-hadoop-${VERSION}-examples.jar 84 | 85 | to something short, like "examples.jar". Each of the *.clj files in 86 | the src/examples directory contains instructions for running that 87 | example. 88 | 89 | 90 | 91 | USING THE LIBRARY IN HADOOP 92 | 93 | After building, include the "clojure-hadoop-${VERSION}-job.jar" file 94 | in the lib/ directory of the JAR you submit as your Hadoop job. 95 | 96 | 97 | 98 | DEPENDING ON THE LIBRARY WITH MAVEN 99 | 100 | You can depend on clojure-hadoop in your Maven 2 projects by adding 101 | the following lines to your pom.xml: 102 | 103 | 104 | ... 105 | 106 | 107 | com.stuartsierra 108 | clojure-hadoop 109 | ${VERSION} 110 | 111 | 112 | ... 113 | 114 | ... 115 | 116 | ... 117 | 118 | 119 | stuartsierra-releases 120 | Stuart Sierra's personal Maven 2 release repository 121 | http://stuartsierra.com/maven2 122 | 123 | 124 | 125 | 126 | stuartsierra-snapshots 127 | Stuart Sierra's personal Maven 2 SNAPSHOT repository 128 | http://stuartsierra.com/m2snapshots 129 | 130 | ... 131 | 132 | 133 | 134 | 135 | USING THE LIBRARY 136 | 137 | This library provides different layers of abstraction away from the 138 | raw Hadoop API. 139 | 140 | Layer 1: clojure-hadoop.imports 141 | 142 | Provides convenience functions for importing the many classes and 143 | interfaces in the Hadoop API. 144 | 145 | Layer 2: clojure-hadoop.gen 146 | 147 | Provides gen-class macros to generate the multiple classes needed 148 | for a MapReduce job. See the example file "wordcount1.clj" for a 149 | demonstration of these macros. 150 | 151 | Layer 3: clojure-hadoop.wrap 152 | 153 | clojure-hadoop.wrap: provides wrapper functions that automatically 154 | convert between Hadoop Text objects and Clojure data structures. 155 | See the example file "wordcount2.clj" for a demonstration of these 156 | wrappers. 157 | 158 | Layer 4: clojure-hadoop.job 159 | 160 | Provides a complete implementation of a Hadoop MapReduce job that 161 | can be dynamically configured to use any Clojure functions in the 162 | map and reduce phases. See the example file "wordcount3.clj" for 163 | a demonstration of this usage. 164 | 165 | Layer 5: clojure-hadoop.defjob 166 | 167 | A convenient macro to configure MapReduce jobs with Clojure code. 168 | See the example files "wordcount4.clj" and "wordcount5.clj" for 169 | demonstrations of this macro. 170 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 3 | 4.0.0 4 | com.stuartsierra 5 | clojure-hadoop 6 | jar 7 | 1.2.0-SNAPSHOT 8 | clojure-hadoop 9 | http://github.com/stuartsierra/clojure-hadoop 10 | 11 | 12 | stuartsierra 13 | Stuart Sierra 14 | mail@stuartsierra.com 15 | http://www.stuartsierra.com/ 16 | 17 | 18 | 19 | 20 | Eclipse Public License 1.0 21 | http://opensource.org/licenses/eclipse-1.0.php 22 | repo 23 | Same license as Clojure 24 | 25 | 26 | 27 | 28 | org.clojure 29 | clojure 30 | 1.0.0 31 | 32 | 33 | org.apache.hadoop 34 | hadoop-core-with-dependencies 35 | 0.18.3 36 | 37 | 38 | 39 | 40 | 41 | org.apache.maven.plugins 42 | maven-compiler-plugin 43 | 44 | 1.6 45 | 1.6 46 | 47 | 48 | 49 | maven-assembly-plugin 50 | 51 | 52 | 53 | src/main/assembly/job.xml 54 | src/main/assembly/examples.xml 55 | src/main/assembly/dist.xml 56 | 57 | 58 | 59 | 60 | com.theoryinpractise 61 | clojure-maven-plugin 62 | 1.0 63 | 64 | 65 | src/main/clojure 66 | src/examples/clojure 67 | 68 | 69 | src/test/clojure 70 | 71 | 72 | 73 | 74 | compile-clojure 75 | compile 76 | 77 | compile 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | org.apache.maven.wagon 86 | wagon-ftp 87 | 1.0-beta-6 88 | 89 | 90 | 91 | 92 | 93 | stuartsierra-releases 94 | Stuart Sierra's personal Maven 2 release repository 95 | http://stuartsierra.com/maven2 96 | 97 | 98 | 99 | 100 | 101 | stuartsierra-releases 102 | Stuart Sierra's personal Maven 2 release repository 103 | ftp://stuartsierra.com/public_html/stuartsierra/maven2 104 | 105 | 106 | stuartsierra-snapshots 107 | Stuart Sierra's personal Maven 2 SNAPSHOT repository 108 | ftp://stuartsierra.com/public_html/stuartsierra/m2snapshots 109 | 110 | 111 | 112 | -------------------------------------------------------------------------------- /src/examples/clojure/clojure_hadoop/examples/wordcount1.clj: -------------------------------------------------------------------------------- 1 | ;; wordcount1 -- low-level MapReduce example 2 | ;; 3 | ;; This namespace demonstrates how to use the lower layers of 4 | ;; abstraction provided by the clojure-hadoop library. 5 | ;; 6 | ;; This is the example word count program used in the Hadoop MapReduce 7 | ;; tutorials. As you can see, it is very similar to the Java code, and 8 | ;; uses the Hadoop API directly. 9 | ;; 10 | ;; We have to call gen-job-classes and gen-main-method, then define the 11 | ;; three functions mapper-map, reducer-reduce, and tool-run. 12 | ;; 13 | ;; To run this example, first compile it (see instructions in 14 | ;; README.txt), then run this command (all one line): 15 | ;; 16 | ;; java -cp examples.jar \ 17 | ;; clojure_hadoop.examples.wordcount1 \ 18 | ;; README.txt out1 19 | ;; 20 | ;; This will count the instances of each word in README.txt and write 21 | ;; the results to out1/part-00000 22 | 23 | 24 | (ns clojure-hadoop.examples.wordcount1 25 | (:require [clojure-hadoop.gen :as gen] 26 | [clojure-hadoop.imports :as imp]) 27 | (:import (java.util StringTokenizer) 28 | (org.apache.hadoop.util Tool))) 29 | 30 | (imp/import-io) ;; for Text, LongWritable 31 | (imp/import-fs) ;; for Path 32 | (imp/import-mapred) ;; for JobConf, JobClient 33 | 34 | (gen/gen-job-classes) ;; generates Tool, Mapper, and Reducer classes 35 | (gen/gen-main-method) ;; generates Tool.main method 36 | 37 | (defn mapper-map 38 | "This is our implementation of the Mapper.map method. The key and 39 | value arguments are sub-classes of Hadoop's Writable interface, so 40 | we have to convert them to strings or some other type before we can 41 | use them. Likewise, we have to call the OutputCollector.collect 42 | method with objects that are sub-classes of Writable." 43 | [this key value #^OutputCollector output reporter] 44 | (doseq [word (enumeration-seq (StringTokenizer. (str value)))] 45 | (.collect output (Text. word) (LongWritable. 1)))) 46 | 47 | (defn reducer-reduce 48 | "This is our implementation of the Reducer.reduce method. The key 49 | argument is a sub-class of Hadoop's Writable, but 'values' is a Java 50 | Iterator that returns successive values. We have to use 51 | iterator-seq to get a Clojure sequence from the Iterator. 52 | 53 | Beware, however, that Hadoop re-uses a single object for every 54 | object returned by the Iterator. So when you get an object from the 55 | iterator, you must extract its value (as we do here with the 'get' 56 | method) immediately, before accepting the next value from the 57 | iterator. That is, you cannot hang on to past values from the 58 | iterator." 59 | [this key values #^OutputCollector output reporter] 60 | (let [sum (reduce + (map (fn [#^LongWritable v] (.get v)) (iterator-seq values)))] 61 | (.collect output key (LongWritable. sum)))) 62 | 63 | (defn tool-run 64 | "This is our implementation of the Tool.run method. args are the 65 | command-line arguments as a Java array of strings. We have to 66 | create a JobConf object, set all the MapReduce job parameters, then 67 | call the JobClient.runJob static method on it. 68 | 69 | This method must return zero on success or Hadoop will report that 70 | the job failed." 71 | [#^Tool this args] 72 | (doto (JobConf. (.getConf this) (.getClass this)) 73 | (.setJobName "wordcount1") 74 | (.setOutputKeyClass Text) 75 | (.setOutputValueClass LongWritable) 76 | (.setMapperClass (Class/forName "clojure_hadoop.examples.wordcount1_mapper")) 77 | (.setReducerClass (Class/forName "clojure_hadoop.examples.wordcount1_reducer")) 78 | (.setInputFormat TextInputFormat) 79 | (.setOutputFormat TextOutputFormat) 80 | (FileInputFormat/setInputPaths (first args)) 81 | (FileOutputFormat/setOutputPath (Path. (second args))) 82 | (JobClient/runJob)) 83 | 0) 84 | -------------------------------------------------------------------------------- /src/examples/clojure/clojure_hadoop/examples/wordcount2.clj: -------------------------------------------------------------------------------- 1 | ;; wordcount2 -- wrapped MapReduce example 2 | ;; 3 | ;; This namespace demonstrates how to use the function wrappers 4 | ;; provided by the clojure-hadoop library. 5 | ;; 6 | ;; As in the wordcount1 example, we have to call gen-job-classes and 7 | ;; gen-main-method, then define the three functions mapper-map, 8 | ;; reducer-reduce, and tool-run. 9 | ;; 10 | ;; mapper-map uses the wrap-map function. This allows us to write our 11 | ;; reducer as a simple, pure-Clojure function. Converting between 12 | ;; Hadoop types, and dealing with the Hadoop APIs, are handled by the 13 | ;; wrapper. We give it a function that returns a sequence of pairs, 14 | ;; and a pre-defined reader that accepts a Hadoop [LongWritable, Text] 15 | ;; pair. The default writer function writes keys and values as Hadoop 16 | ;; Text objects rendered with pr-str. 17 | ;; 18 | ;; reducer-reduce similarly uses the wrap-reduce function. However, 19 | ;; rather than passing the sequence of values directly to the 20 | ;; function, wrap-reduce will pass a *function* that *returns* a lazy 21 | ;; sequence of values. Because this sequence may be very large, you 22 | ;; must be careful never to bind it to a local variable. Basically, 23 | ;; you should only use the values-fn in one of Clojure's sequence 24 | ;; functions such as map, filter, or reduce. 25 | ;; 26 | ;; To run this example, first compile it (see instructions in 27 | ;; README.txt), then run this command (all one line): 28 | ;; 29 | ;; java -cp examples.jar \ 30 | ;; clojure_hadoop.examples.wordcount2 \ 31 | ;; README.txt out2 32 | ;; 33 | ;; This will count the instances of each word in README.txt and write 34 | ;; the results to out2/part-00000 35 | ;; 36 | ;; Notice that, in the output file, the words are enclosed in double 37 | ;; quotation marks. That's because they are being printed as readable 38 | ;; strings by Clojure, as with 'pr'. 39 | 40 | 41 | (ns clojure-hadoop.examples.wordcount2 42 | (:require [clojure-hadoop.gen :as gen] 43 | [clojure-hadoop.imports :as imp] 44 | [clojure-hadoop.wrap :as wrap]) 45 | (:import (java.util StringTokenizer) 46 | (org.apache.hadoop.util Tool))) 47 | 48 | (imp/import-io) ;; for Text 49 | (imp/import-fs) ;; for Path 50 | (imp/import-mapred) ;; for JobConf, JobClient 51 | 52 | (gen/gen-job-classes) 53 | (gen/gen-main-method) 54 | 55 | (def mapper-map 56 | (wrap/wrap-map 57 | (fn [key value] 58 | (map (fn [token] [token 1]) 59 | (enumeration-seq (StringTokenizer. value)))) 60 | wrap/int-string-map-reader)) 61 | 62 | (def reducer-reduce 63 | (wrap/wrap-reduce 64 | (fn [key values-fn] 65 | [[key (reduce + (values-fn))]]))) 66 | 67 | (defn tool-run [#^Tool this args] 68 | (doto (JobConf. (.getConf this) (.getClass this)) 69 | (.setJobName "wordcount2") 70 | (.setOutputKeyClass Text) 71 | (.setOutputValueClass Text) 72 | (.setMapperClass (Class/forName "clojure_hadoop.examples.wordcount2_mapper")) 73 | (.setReducerClass (Class/forName "clojure_hadoop.examples.wordcount2_reducer")) 74 | (.setInputFormat TextInputFormat) 75 | (.setOutputFormat TextOutputFormat) 76 | (FileInputFormat/setInputPaths #^String (first args)) 77 | (FileOutputFormat/setOutputPath (Path. (second args))) 78 | (JobClient/runJob)) 79 | 0) 80 | -------------------------------------------------------------------------------- /src/examples/clojure/clojure_hadoop/examples/wordcount3.clj: -------------------------------------------------------------------------------- 1 | ;; wordcount3 -- example for use with clojure-hadoop.job 2 | ;; 3 | ;; This example wordcount program is very different from the first 4 | ;; two. As you can see, it defines only two functions, doesn't import 5 | ;; anything, and doesn't generate any classes. 6 | ;; 7 | ;; This example is designed to be run with the clojure-hadoop.job 8 | ;; library, which allows you to run a MapReduce job that can be 9 | ;; configured to use any Clojure functions as the mapper and reducer. 10 | ;; 11 | ;; After compiling (see README.txt), run the example like this 12 | ;; (all on one line): 13 | ;; 14 | ;; java -cp examples.jar clojure_hadoop.job \ 15 | ;; -input README.txt \ 16 | ;; -output out3 \ 17 | ;; -map clojure-hadoop.examples.wordcount3/my-map \ 18 | ;; -map-reader clojure-hadoop.wrap/int-string-map-reader \ 19 | ;; -reduce clojure-hadoop.examples.wordcount3/my-reduce \ 20 | ;; -input-format text 21 | ;; 22 | ;; The output is a Hadoop SequenceFile. You can view the output 23 | ;; with (all one line): 24 | ;; 25 | ;; java -cp examples.jar org.apache.hadoop.fs.FsShell \ 26 | ;; -text out3/part-00000 27 | ;; 28 | ;; clojure_hadoop.job (note the underscore instead of a dash, because 29 | ;; we are calling it as a Java class) provides classes for Tool, 30 | ;; Mapper, and Reducer, which are dynamically configured on the command 31 | ;; line. 32 | ;; 33 | ;; The argument to -map is a namespace-qualified Clojure symbol. It 34 | ;; names the function that will be used as a mapper. We need to 35 | ;; specify the -map-reader function as well because we are not using 36 | ;; the default reader (which read pr'd Clojure data structures). 37 | ;; 38 | ;; The argument to -reduce is also a namespace-qualified Clojure 39 | ;; symbol. 40 | ;; 41 | ;; We also have to specify the input and output paths, and specify the 42 | ;; non-default input-format as 'text', because README.txt is a text 43 | ;; file. 44 | ;; 45 | ;; Run clojure_hadoop.job without any arguments for a brief summary of 46 | ;; the options. See src/clojure_hadoop/job.clj and 47 | ;; src/clojure_hadoop/config.clj for more configuration options. 48 | 49 | 50 | (ns clojure-hadoop.examples.wordcount3 51 | (:import (java.util StringTokenizer))) 52 | 53 | (defn my-map [key value] 54 | (map (fn [token] [token 1]) 55 | (enumeration-seq (StringTokenizer. value)))) 56 | 57 | (defn my-reduce [key values-fn] 58 | [[key (reduce + (values-fn))]]) 59 | 60 | -------------------------------------------------------------------------------- /src/examples/clojure/clojure_hadoop/examples/wordcount4.clj: -------------------------------------------------------------------------------- 1 | ;; wordcount4 -- example defjob 2 | ;; 3 | ;; This example wordcount program is similar to wordcount3, but it 4 | ;; includes a job definition function created with defjob. 5 | ;; 6 | ;; defjob parses its options to create a job configuration map 7 | ;; suitable for clojure-hadoop.config. 8 | ;; 9 | ;; defjob defines an ordinary function, with the given name ("job" in 10 | ;; this example), which returns the job configuration map. 11 | ;; 12 | ;; We can specify the job definition function on the command line to 13 | ;; clojure_hadoop.job, adding or overriding any additional arguments 14 | ;; at the command line. 15 | ;; 16 | ;; After compiling (see README.txt), run the example like this 17 | ;; (all on one line): 18 | ;; 19 | ;; java -cp examples.jar clojure_hadoop.job \ 20 | ;; -job clojure-hadoop.examples.wordcount4/job \ 21 | ;; -input README.txt -output out4 22 | ;; 23 | ;; The output is a Hadoop SequenceFile. You can view the output 24 | ;; with (all one line): 25 | ;; 26 | ;; java -cp examples.jar org.apache.hadoop.fs.FsShell \ 27 | ;; -text out4/part-00000 28 | 29 | 30 | (ns clojure-hadoop.examples.wordcount4 31 | (:require [clojure-hadoop.wrap :as wrap] 32 | [clojure-hadoop.defjob :as defjob]) 33 | (:import (java.util StringTokenizer))) 34 | 35 | (defn my-map [key value] 36 | (map (fn [token] [token 1]) 37 | (enumeration-seq (StringTokenizer. value)))) 38 | 39 | (defn my-reduce [key values-fn] 40 | [[key (reduce + (values-fn))]]) 41 | 42 | (defjob/defjob job 43 | :map my-map 44 | :map-reader wrap/int-string-map-reader 45 | :reduce my-reduce 46 | :input-format :text) 47 | 48 | -------------------------------------------------------------------------------- /src/examples/clojure/clojure_hadoop/examples/wordcount5.clj: -------------------------------------------------------------------------------- 1 | ;; wordcount5 -- example customized defjob 2 | ;; 3 | ;; This example wordcount program uses defjob like wordcount4, but it 4 | ;; includes some more configuration options that make it more 5 | ;; efficient. 6 | ;; 7 | ;; In the default configuration (wordcount4), everything is passed to 8 | ;; Hadoop as a Text and converted by the Clojure reader and printer. 9 | ;; By adding configuration options, this example works more closely 10 | ;; with Hadoop types like LongWritable. In order to do that it must 11 | ;; define custom reader and writer functions, and specify the output 12 | ;; key/value types in the defjob configuration. 13 | ;; 14 | ;; After compiling (see README.txt), run the example like this 15 | ;; (all on one line): 16 | ;; 17 | ;; java -cp examples.jar clojure_hadoop.job \ 18 | ;; -job clojure-hadoop.examples.wordcount5/job \ 19 | ;; -input README.txt -output out5 20 | ;; 21 | ;; The output is plain text, written to out5/part-00000 22 | ;; 23 | ;; Notice that the strings in the output are not quoted. In effect, 24 | ;; we have come full circle to wordcount1, while maintaining the 25 | ;; separation between the mapper/reducer functions and the 26 | ;; reader/writer functions. 27 | 28 | 29 | (ns clojure-hadoop.examples.wordcount5 30 | (:require [clojure-hadoop.wrap :as wrap] 31 | [clojure-hadoop.defjob :as defjob] 32 | [clojure-hadoop.imports :as imp]) 33 | (:import (java.util StringTokenizer))) 34 | 35 | (imp/import-io) ;; for Text, LongWritable 36 | (imp/import-mapred) ;; for OutputCollector 37 | 38 | (defn my-map [key value] 39 | (map (fn [token] [token 1]) 40 | (enumeration-seq (StringTokenizer. value)))) 41 | 42 | (defn my-reduce [key values-fn] 43 | [[key (reduce + (values-fn))]]) 44 | 45 | (defn string-long-writer [#^OutputCollector output 46 | #^String key value] 47 | (.collect output (Text. key) (LongWritable. value))) 48 | 49 | (defn string-long-reduce-reader [#^Text key wvalues] 50 | [(.toString key) 51 | (fn [] (map (fn [#^LongWritable v] (.get v)) 52 | (iterator-seq wvalues)))]) 53 | 54 | (defjob/defjob job 55 | :map my-map 56 | :map-reader wrap/int-string-map-reader 57 | :map-writer string-long-writer 58 | :reduce my-reduce 59 | :reduce-reader string-long-reduce-reader 60 | :reduce-writer string-long-writer 61 | :output-key Text 62 | :output-value LongWritable 63 | :input-format :text 64 | :output-format :text 65 | :compress-output false) 66 | 67 | -------------------------------------------------------------------------------- /src/main/assembly/dist.xml: -------------------------------------------------------------------------------- 1 | 4 | dist 5 | 6 | zip 7 | tar.gz 8 | tar.bz2 9 | 10 | 11 | 12 | ${project.basedir} 13 | / 14 | true 15 | 16 | README.* 17 | LICENSE.* 18 | NOTICE.* 19 | CHANGES.* 20 | pom.xml 21 | src/** 22 | target/*.jar 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /src/main/assembly/examples.xml: -------------------------------------------------------------------------------- 1 | 4 | examples 5 | 6 | jar 7 | 8 | false 9 | 10 | 11 | true 12 | runtime 13 | 14 | 15 | -------------------------------------------------------------------------------- /src/main/assembly/job.xml: -------------------------------------------------------------------------------- 1 | 4 | job 5 | 6 | jar 7 | 8 | false 9 | 10 | 11 | true 12 | runtime 13 | 14 | org.apache.hadoop:hadoop-core-with-dependencies 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/config.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.config 2 | (:require [clojure-hadoop.imports :as imp] 3 | [clojure-hadoop.load :as load]) 4 | (:import (org.apache.hadoop.io.compress 5 | DefaultCodec GzipCodec LzoCodec))) 6 | 7 | ;; This file defines configuration options for clojure-hadoop. 8 | ;; 9 | ;; The SAME options may be given either on the command line (to 10 | ;; clojure_hadoop.job) or in a call to defjob. 11 | ;; 12 | ;; In defjob, option names are keywords. Values are symbols or 13 | ;; keywords. Symbols are resolved as functions or classes. Keywords 14 | ;; are converted to Strings. 15 | ;; 16 | ;; On the command line, option names are preceeded by "-". 17 | ;; 18 | ;; Options are defined as methods of the conf multimethod. 19 | ;; Documentation for individual options appears with each method, 20 | ;; below. 21 | 22 | (imp/import-io) 23 | (imp/import-fs) 24 | (imp/import-mapred) 25 | (imp/import-mapred-lib) 26 | 27 | (defn- #^String as-str [s] 28 | (cond (keyword? s) (name s) 29 | (class? s) (.getName #^Class s) 30 | (fn? s) (throw (Exception. "Cannot use function as value; use a symbol.")) 31 | :else (str s))) 32 | 33 | (defmulti conf (fn [jobconf key value] key)) 34 | 35 | (defmethod conf :job [jobconf key value] 36 | (let [f (load/load-name value)] 37 | (doseq [[k v] (f)] 38 | (conf jobconf k v)))) 39 | 40 | ;; Job input paths, separated by commas, as a String. 41 | (defmethod conf :input [#^JobConf jobconf key value] 42 | (FileInputFormat/setInputPaths jobconf (as-str value))) 43 | 44 | ;; Job output path, as a String. 45 | (defmethod conf :output [#^JobConf jobconf key value] 46 | (FileOutputFormat/setOutputPath jobconf (Path. (as-str value)))) 47 | 48 | ;; When true or "true", deletes output path before starting. 49 | (defmethod conf :replace [#^JobConf jobconf key value] 50 | (when (= (as-str value) "true") 51 | (.set jobconf "clojure-hadoop.job.replace" "true"))) 52 | 53 | ;; The mapper function. May be a class name or a Clojure function as 54 | ;; namespace/symbol. May also be "identity" for IdentityMapper. 55 | (defmethod conf :map [#^JobConf jobconf key value] 56 | (let [value (as-str value)] 57 | (cond 58 | (= "identity" value) 59 | (.setMapperClass jobconf IdentityMapper) 60 | 61 | (.contains value "/") 62 | (.set jobconf "clojure-hadoop.job.map" value) 63 | 64 | :else 65 | (.setMapperClass jobconf (Class/forName value))))) 66 | 67 | ;; The reducer function. May be a class name or a Clojure function as 68 | ;; namespace/symbol. May also be "identity" for IdentityReducer or 69 | ;; "none" for no reduce stage. 70 | (defmethod conf :reduce [#^JobConf jobconf key value] 71 | (let [value (as-str value)] 72 | (cond 73 | (= "identity" value) 74 | (.setReducerClass jobconf IdentityReducer) 75 | 76 | (= "none" value) 77 | (.setNumReduceTasks jobconf 0) 78 | 79 | (.contains value "/") 80 | (.set jobconf "clojure-hadoop.job.reduce" value) 81 | 82 | :else 83 | (.setReducerClass jobconf (Class/forName value))))) 84 | 85 | ;; The mapper reader function, converts Hadoop Writable types to 86 | ;; native Clojure types. 87 | (defmethod conf :map-reader [#^JobConf jobconf key value] 88 | (.set jobconf "clojure-hadoop.job.map.reader" (as-str value))) 89 | 90 | ;; The mapper writer function; converts native Clojure types to Hadoop 91 | ;; Writable types. 92 | (defmethod conf :map-writer [#^JobConf jobconf key value] 93 | (.set jobconf "clojure-hadoop.job.map.writer" (as-str value))) 94 | 95 | ;; The mapper output key class; used when the mapper writer outputs 96 | ;; types different from the job output. 97 | (defmethod conf :map-output-key [#^JobConf jobconf key value] 98 | (.setMapOutputKeyClass jobconf (Class/forName value))) 99 | 100 | ;; The mapper output value class; used when the mapper writer outputs 101 | ;; types different from the job output. 102 | (defmethod conf :map-output-value [#^JobConf jobconf key value] 103 | (.setMapOutputValueClass jobconf (Class/forName value))) 104 | 105 | ;; The job output key class. 106 | (defmethod conf :output-key [#^JobConf jobconf key value] 107 | (.setOutputKeyClass jobconf (Class/forName value))) 108 | 109 | ;; The job output value class. 110 | (defmethod conf :output-value [#^JobConf jobconf key value] 111 | (.setOutputValueClass jobconf (Class/forName value))) 112 | 113 | ;; The reducer reader function, converts Hadoop Writable types to 114 | ;; native Clojure types. 115 | (defmethod conf :reduce-reader [#^JobConf jobconf key value] 116 | (.set jobconf "clojure-hadoop.job.reduce.reader" (as-str value))) 117 | 118 | ;; The reducer writer function; converts native Clojure types to 119 | ;; Hadoop Writable types. 120 | (defmethod conf :reduce-writer [#^JobConf jobconf key value] 121 | (.set jobconf "clojure-hadoop.job.reduce.writer" (as-str value))) 122 | 123 | ;; The input file format. May be a class name or "text" for 124 | ;; TextInputFormat, "kvtext" fro KeyValueTextInputFormat, "seq" for 125 | ;; SequenceFileInputFormat. 126 | (defmethod conf :input-format [#^JobConf jobconf key value] 127 | (let [value (as-str value)] 128 | (cond 129 | (= "text" value) 130 | (.setInputFormat jobconf TextInputFormat) 131 | 132 | (= "kvtext" value) 133 | (.setInputFormat jobconf KeyValueTextInputFormat) 134 | 135 | (= "seq" value) 136 | (.setInputFormat jobconf SequenceFileInputFormat) 137 | 138 | :else 139 | (.setInputFormat jobconf (Class/forName value))))) 140 | 141 | ;; The output file format. May be a class name or "text" for 142 | ;; TextOutputFormat, "seq" for SequenceFileOutputFormat. 143 | (defmethod conf :output-format [#^JobConf jobconf key value] 144 | (let [value (as-str value)] 145 | (cond 146 | (= "text" value) 147 | (.setOutputFormat jobconf TextOutputFormat) 148 | 149 | (= "seq" value) 150 | (.setOutputFormat jobconf SequenceFileOutputFormat) 151 | 152 | :else 153 | (.setOutputFormat jobconf (Class/forName value))))) 154 | 155 | ;; If true, compress job output files. 156 | (defmethod conf :compress-output [#^JobConf jobconf key value] 157 | (cond 158 | (= "true" (as-str value)) 159 | (FileOutputFormat/setCompressOutput jobconf true) 160 | 161 | (= "false" (as-str value)) 162 | (FileOutputFormat/setCompressOutput jobconf false) 163 | 164 | :else 165 | (throw (Exception. "compress-output value must be true or false")))) 166 | 167 | ;; Codec to use for compressing job output files. 168 | (defmethod conf :output-compressor [#^JobConf jobconf key value] 169 | (cond 170 | (= "default" (as-str value)) 171 | (FileOutputFormat/setOutputCompressorClass 172 | jobconf DefaultCodec) 173 | 174 | (= "gzip" (as-str value)) 175 | (FileOutputFormat/setOutputCompressorClass 176 | jobconf GzipCodec) 177 | 178 | (= "lzo" (as-str value)) 179 | (FileOutputFormat/setOutputCompressorClass 180 | jobconf LzoCodec) 181 | 182 | :else 183 | (FileOutputFormat/setOutputCompressorClass 184 | jobconf (Class/forName value)))) 185 | 186 | ;; Type of compression to use for sequence files. 187 | (defmethod conf :compression-type [#^JobConf jobconf key value] 188 | (cond 189 | (= "block" (as-str value)) 190 | (SequenceFileOutputFormat/setOutputCompressionType 191 | jobconf SequenceFile$CompressionType/BLOCK) 192 | 193 | (= "none" (as-str value)) 194 | (SequenceFileOutputFormat/setOutputCompressionType 195 | jobconf SequenceFile$CompressionType/NONE) 196 | 197 | (= "record" (as-str value)) 198 | (SequenceFileOutputFormat/setOutputCompressionType 199 | jobconf SequenceFile$CompressionType/RECORD))) 200 | 201 | (defn parse-command-line-args [#^JobConf jobconf args] 202 | (when (empty? args) 203 | (throw (Exception. "Missing required options."))) 204 | (when-not (even? (count args)) 205 | (throw (Exception. "Number of options must be even."))) 206 | (doseq [[k v] (partition 2 args)] 207 | (conf jobconf (keyword (subs k 1)) v))) 208 | 209 | (defn print-usage [] 210 | (println "Usage: java -cp [jars...] clojure_hadoop.job [options...] 211 | Required options are: 212 | -input comma-separated input paths 213 | -output output path 214 | -map mapper function, as namespace/name or class name 215 | -reduce reducer function, as namespace/name or class name 216 | OR 217 | -job job definition function, as namespace/name 218 | 219 | Mapper or reducer function may also be \"identity\". 220 | Reducer function may also be \"none\". 221 | 222 | Other available options are: 223 | -input-format Class name or \"text\" or \"seq\" (SeqFile) 224 | -output-format Class name or \"text\" or \"seq\" (SeqFile) 225 | -output-key Class for job output key 226 | -output-value Class for job output value 227 | -map-output-key Class for intermediate Mapper output key 228 | -map-output-value Class for intermediate Mapper output value 229 | -map-reader Mapper reader function, as namespace/name 230 | -map-writer Mapper writer function, as namespace/name 231 | -reduce-reader Reducer reader function, as namespace/name 232 | -reduce-writer Reducer writer function, as namespace/name 233 | -name Job name 234 | -replace If \"true\", deletes output dir before start 235 | -compress-output If \"true\", compress job output files 236 | -output-compressor Compression class or \"gzip\",\"lzo\",\"default\" 237 | -compression-type For seqfiles, compress \"block\",\"record\",\"none\" 238 | ")) 239 | 240 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/defjob.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.defjob 2 | (:require [clojure-hadoop.job :as job])) 3 | 4 | (defn- full-name 5 | "Returns the fully-qualified name for a symbol s, either a class or 6 | a var, resolved in the current namespace." 7 | [s] 8 | (if-let [v (resolve s)] 9 | (cond (var? v) (let [m (meta v)] 10 | (str (ns-name (:ns m)) \/ 11 | (name (:name m)))) 12 | (class? v) (.getName #^Class v)) 13 | (throw (Exception. (str "Symbol not found: " s))))) 14 | 15 | (defmacro defjob 16 | "Defines a job function. Options are the same those in 17 | clojure-hadoop.config. 18 | 19 | A job function may be given as the -job argument to 20 | clojure_hadoop.job to run a job." 21 | [sym & options] 22 | (let [args (reduce (fn [m [k v]] 23 | (assoc m k 24 | (cond (keyword? v) (name v) 25 | (string? v) v 26 | (symbol? v) (full-name v) 27 | (instance? Boolean v) (str v) 28 | :else (throw (Exception. "defjob arguments must be strings, symbols, or keywords"))))) 29 | {} (apply hash-map options))] 30 | `(defn ~sym [] ~args))) 31 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/gen.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.gen 2 | ;;#^{:doc "Class-generation helpers for writing Hadoop jobs in Clojure."} 3 | ) 4 | 5 | (defmacro gen-job-classes 6 | "Creates gen-class forms for Hadoop job classes from the current 7 | namespace. Now you only need to write three functions: 8 | 9 | (defn mapper-map [this key value output reporter] ...) 10 | 11 | (defn reducer-reduce [this key values output reporter] ...) 12 | 13 | (defn tool-run [& args] ...) 14 | 15 | The first two functions are the standard map/reduce functions in any 16 | Hadoop job. 17 | 18 | The third function, tool-run, will be called by the Hadoop framework 19 | to start your job, with the arguments from the command line. It 20 | should set up the JobConf object and call JobClient/runJob, then 21 | return zero on success. 22 | 23 | You must also call gen-main-method to create the main method. 24 | 25 | After compiling your namespace, you can run it as a Hadoop job using 26 | the standard Hadoop command-line tools." 27 | [] 28 | (let [the-name (.replace (str (ns-name *ns*)) \- \_)] 29 | `(do 30 | (gen-class 31 | :name ~the-name 32 | :extends "org.apache.hadoop.conf.Configured" 33 | :implements ["org.apache.hadoop.util.Tool"] 34 | :prefix "tool-" 35 | :main true) 36 | (gen-class 37 | :name ~(str the-name "_mapper") 38 | :extends "org.apache.hadoop.mapred.MapReduceBase" 39 | :implements ["org.apache.hadoop.mapred.Mapper"] 40 | :prefix "mapper-") 41 | (gen-class 42 | :name ~(str the-name "_reducer") 43 | :extends "org.apache.hadoop.mapred.MapReduceBase" 44 | :implements ["org.apache.hadoop.mapred.Reducer"] 45 | :prefix "reducer-")))) 46 | 47 | (defn gen-main-method 48 | "Adds a standard main method, named tool-main, to the current 49 | namespace." 50 | [] 51 | (let [the-name (.replace (str (ns-name *ns*)) \- \_)] 52 | (intern *ns* 'tool-main 53 | (fn [& args] 54 | (System/exit 55 | (org.apache.hadoop.util.ToolRunner/run 56 | (new org.apache.hadoop.conf.Configuration) 57 | (. (Class/forName the-name) newInstance) 58 | (into-array String args))))))) 59 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/imports.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.imports 2 | ;;#^{:doc "Functions to import entire packages under org.apache.hadoop."} 3 | ) 4 | 5 | (defn import-io 6 | "Imports all classes/interfaces/exceptions from the package 7 | org.apache.hadoop.io into the current namespace." 8 | [] 9 | (import '(org.apache.hadoop.io RawComparator 10 | SequenceFile$Sorter$RawKeyValueIterator SequenceFile$ValueBytes 11 | Stringifier Writable WritableComparable WritableFactory 12 | AbstractMapWritable ArrayFile ArrayFile$Reader ArrayFile$Writer 13 | ArrayWritable BooleanWritable BooleanWritable$Comparator BytesWritable 14 | BytesWritable$Comparator ByteWritable ByteWritable$Comparator 15 | CompressedWritable DataInputBuffer DataOutputBuffer DefaultStringifier 16 | DoubleWritable DoubleWritable$Comparator FloatWritable 17 | FloatWritable$Comparator GenericWritable InputBuffer LongWritable 18 | LongWritable$Comparator IOUtils IOUtils$NullOutputStream LongWritable 19 | LongWritable$Comparator LongWritable$DecreasingComparator MapFile 20 | MapFile$Reader MapFile$Writer MapWritable MD5Hash MD5Hash$Comparator 21 | NullWritable NullWritable$Comparator ObjectWritable OutputBuffer 22 | SequenceFile SequenceFile$Metadata SequenceFile$Reader 23 | SequenceFile$Sorter SequenceFile$Writer SetFile SetFile$Reader 24 | SetFile$Writer SortedMapWritable Text Text$Comparator 25 | TwoDArrayWritable UTF8 UTF8$Comparator VersionedWritable VLongWritable 26 | VLongWritable WritableComparator WritableFactories WritableName 27 | WritableUtils SequenceFile$CompressionType MultipleIOException 28 | VersionMismatchException))) 29 | 30 | (defn import-io-compress 31 | "Imports all classes/interfaces/exceptions from the package 32 | org.apache.hadoop.io.compress into the current namespace." 33 | [] 34 | (import '(org.apache.hadoop.io.compress CompressionCodec Compressor 35 | Decompressor BlockCompressorStream BlockDecompressorStream CodecPool 36 | CompressionCodecFactory CompressionInputStream CompressionOutputStream 37 | CompressorStream DecompressorStream DefaultCodec GzipCodec 38 | GzipCodec$GzipInputStream GzipCodec$GzipOutputStream))) 39 | 40 | (defn import-fs 41 | "Imports all classes/interfaces/exceptions from the package 42 | org.apache.hadoop.fs into the current namespace." 43 | [] 44 | (import '(org.apache.hadoop.fs PathFilter PositionedReadable 45 | Seekable Syncable BlockLocation BufferedFSInputStream 46 | ChecksumFileSystem ContentSummary DF DU FileStatus FileSystem 47 | FileSystem$Statistics FileUtil FileUtil$HardLink FilterFileSystem 48 | FSDataInputStream FSDataOutputStream FSInputChecker FSInputStream 49 | FSOutputSummer FsShell FsUrlStreamHandlerFactory HarFileSystem 50 | InMemoryFileSystem LocalDirAllocator LocalFileSystem Path 51 | RawLocalFileSystem Trash ChecksumException FSError))) 52 | 53 | (defn import-mapred 54 | "Imports all classes/interfaces/exceptions from the package 55 | org.apache.hadoop.mapred into the current namespace." 56 | [] 57 | (import '(org.apache.hadoop.mapred InputFormat InputSplit 58 | JobConfigurable JobHistory$Listener Mapper MapRunnable OutputCollector 59 | OutputFormat Partitioner RawKeyValueIterator RecordReader RecordWriter 60 | Reducer Reporter RunningJob SequenceFileInputFilter$Filter 61 | ClusterStatus Counters Counters$Counter Counters$Group 62 | DefaultJobHistoryParser FileInputFormat FileOutputFormat FileSplit ID 63 | IsolationRunner JobClient JobConf JobEndNotifier JobHistory 64 | JobHistory$HistoryCleaner JobHistory$JobInfo JobHistory$MapAttempt 65 | JobHistory$ReduceAttempt JobHistory$Task JobHistory$TaskAttempt JobID 66 | JobProfile JobStatus JobTracker KeyValueLineRecordReader 67 | KeyValueTextInputFormat LineRecordReader LineRecordReader$LineReader 68 | MapFileOutputFormat MapReduceBase MapRunner MultiFileInputFormat 69 | MultiFileSplit OutputLogFilter SequenceFileAsBinaryInputFormat 70 | SequenceFileAsBinaryInputFormat$SequenceFileAsBinaryRecordReader 71 | SequenceFileAsBinaryOutputFormat 72 | SequenceFileAsBinaryOutputFormat$WritableValueBytes 73 | SequenceFileAsTextInputFormat SequenceFileAsTextRecordReader 74 | SequenceFileInputFilter SequenceFileInputFilter$FilterBase 75 | SequenceFileInputFilter$MD5Filter 76 | SequenceFileInputFilter$PercentFilter 77 | SequenceFileInputFilter$RegexFilter SequenceFileInputFormat 78 | SequenceFileOutputFormat SequenceFileRecordReader TaskAttemptID 79 | TaskCompletionEvent TaskID TaskLog TaskLogAppender TaskLogServlet 80 | TaskReport TaskTracker TaskTracker$MapOutputServlet TextInputFormat 81 | TextOutputFormat TextOutputFormat$LineRecordWriter 82 | JobClient$TaskStatusFilter JobHistory$Keys JobHistory$RecordTypes 83 | JobHistory$Values JobPriority JobTracker$State 84 | TaskCompletionEvent$Status TaskLog$LogName FileAlreadyExistsException 85 | InvalidFileTypeException InvalidInputException InvalidJobConfException 86 | JobTracker$IllegalStateException))) 87 | 88 | (defn import-mapred-lib 89 | "Imports all classes/interfaces/exceptions from the package 90 | org.apache.hadoop.mapred.lib into the current namespace." 91 | [] 92 | (import '(org.apache.hadoop.mapred.lib FieldSelectionMapReduce 93 | HashPartitioner IdentityMapper IdentityReducer InverseMapper 94 | KeyFieldBasedPartitioner LongSumReducer MultipleOutputFormat 95 | MultipleSequenceFileOutputFormat MultipleTextOutputFormat 96 | MultithreadedMapRunner NLineInputFormat NullOutputFormat RegexMapper 97 | TokenCountMapper))) 98 | 99 | 100 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/job.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.job 2 | (:require [clojure-hadoop.gen :as gen] 3 | [clojure-hadoop.imports :as imp] 4 | [clojure-hadoop.wrap :as wrap] 5 | [clojure-hadoop.config :as config] 6 | [clojure-hadoop.load :as load]) 7 | (:import (org.apache.hadoop.util Tool))) 8 | 9 | (imp/import-io) 10 | (imp/import-io-compress) 11 | (imp/import-fs) 12 | (imp/import-mapred) 13 | 14 | (gen/gen-job-classes) 15 | (gen/gen-main-method) 16 | 17 | (def #^JobConf *jobconf* nil) 18 | 19 | (def #^{:private true} method-fn-name 20 | {"map" "mapper-map" 21 | "reduce" "reducer-reduce"}) 22 | 23 | (def #^{:private true} wrapper-fn 24 | {"map" wrap/wrap-map 25 | "reduce" wrap/wrap-reduce}) 26 | 27 | (def #^{:private true} default-reader 28 | {"map" wrap/clojure-map-reader 29 | "reduce" wrap/clojure-reduce-reader}) 30 | 31 | (defn- configure-functions 32 | "Preps the mapper or reducer with a Clojure function read from the 33 | job configuration. Called from Mapper.configure and 34 | Reducer.configure." 35 | [type #^JobConf jobconf] 36 | (alter-var-root (var *jobconf*) (fn [_] jobconf)) 37 | (let [function (load/load-name (.get jobconf (str "clojure-hadoop.job." type))) 38 | reader (if-let [v (.get jobconf (str "clojure-hadoop.job." type ".reader"))] 39 | (load/load-name v) 40 | (default-reader type)) 41 | writer (if-let [v (.get jobconf (str "clojure-hadoop.job." type ".writer"))] 42 | (load/load-name v) 43 | wrap/clojure-writer)] 44 | (assert (fn? function)) 45 | (alter-var-root (ns-resolve (the-ns 'clojure-hadoop.job) 46 | (symbol (method-fn-name type))) 47 | (fn [_] ((wrapper-fn type) function reader writer))))) 48 | 49 | ;;; CREATING AND CONFIGURING JOBS 50 | 51 | (defn- parse-command-line [jobconf args] 52 | (try 53 | (config/parse-command-line-args jobconf args) 54 | (catch Exception e 55 | (prn e) 56 | (config/print-usage) 57 | (System/exit 1)))) 58 | 59 | (defn- handle-replace-option [#^JobConf jobconf] 60 | (when (= "true" (.get jobconf "clojure-hadoop.job.replace")) 61 | (let [fs (FileSystem/get jobconf) 62 | output (FileOutputFormat/getOutputPath jobconf)] 63 | (.delete fs output true)))) 64 | 65 | (defn- set-default-config [#^JobConf jobconf] 66 | (doto jobconf 67 | (.setJobName "clojure_hadoop.job") 68 | (.setOutputKeyClass Text) 69 | (.setOutputValueClass Text) 70 | (.setMapperClass (Class/forName "clojure_hadoop.job_mapper")) 71 | (.setReducerClass (Class/forName "clojure_hadoop.job_reducer")) 72 | (.setInputFormat SequenceFileInputFormat) 73 | (.setOutputFormat SequenceFileOutputFormat) 74 | (FileOutputFormat/setCompressOutput true) 75 | (SequenceFileOutputFormat/setOutputCompressionType 76 | SequenceFile$CompressionType/BLOCK))) 77 | 78 | (defn run 79 | "Runs a Hadoop job given the JobConf object." 80 | [jobconf] 81 | (doto jobconf 82 | (handle-replace-option) 83 | (JobClient/runJob))) 84 | 85 | 86 | ;;; MAPPER METHODS 87 | 88 | (defn mapper-configure [this jobconf] 89 | (configure-functions "map" jobconf)) 90 | 91 | (defn mapper-map [this wkey wvalue output reporter] 92 | (throw (Exception. "Mapper function not defined."))) 93 | 94 | ;;; REDUCER METHODS 95 | 96 | (defn reducer-configure [this jobconf] 97 | (configure-functions "reduce" jobconf)) 98 | 99 | (defn reducer-reduce [this wkey wvalues output reporter] 100 | (throw (Exception. "Reducer function not defined."))) 101 | 102 | ;;; TOOL METHODS 103 | 104 | (defn tool-run [#^Tool this args] 105 | (doto (JobConf. (.getConf this) (.getClass this)) 106 | (set-default-config) 107 | (parse-command-line args) 108 | (run)) 109 | 0) 110 | 111 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/load.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.load) 2 | 3 | (defn load-name 4 | "Loads and returns the value of a namespace-qualified string naming 5 | a symbol. If the namespace is not currently loaded it will be 6 | require'd." 7 | [#^String s] 8 | (let [[ns-name fn-name] (.split s "/")] 9 | (when-not (find-ns (symbol ns-name)) 10 | (require (symbol ns-name))) 11 | (assert (find-ns (symbol ns-name))) 12 | (deref (resolve (symbol ns-name fn-name))))) 13 | 14 | -------------------------------------------------------------------------------- /src/main/clojure/clojure_hadoop/wrap.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.wrap 2 | ;;#^{:doc "Map/Reduce wrappers that set up common input/output 3 | ;;conversions for Clojure jobs."} 4 | (:require [clojure-hadoop.imports :as imp])) 5 | 6 | (imp/import-io) 7 | (imp/import-mapred) 8 | 9 | (declare #^Reporter *reporter*) 10 | 11 | (defn string-map-reader 12 | "Returns a [key value] pair by calling .toString on the Writable key 13 | and value." 14 | [#^Writable wkey #^Writable wvalue] 15 | [(.toString wkey) (.toString wvalue)]) 16 | 17 | (defn int-string-map-reader [#^LongWritable wkey #^Writable wvalue] 18 | [(.get wkey) (.toString wvalue)]) 19 | 20 | (defn clojure-map-reader 21 | "Returns a [key value] pair by calling read-string on the string 22 | representations of the Writable key and value." 23 | [#^Writable wkey #^Writable wvalue] 24 | [(read-string (.toString wkey)) (read-string (.toString wvalue))]) 25 | 26 | (defn clojure-reduce-reader 27 | "Returns a [key seq-of-values] pair by calling read-string on the 28 | string representations of the Writable key and values." 29 | [#^Writable wkey wvalues] 30 | [(read-string (.toString wkey)) 31 | (fn [] (map (fn [#^Writable v] (read-string (.toString v))) 32 | (iterator-seq wvalues)))]) 33 | 34 | (defn clojure-writer 35 | "Sends key and value to the OutputCollector by calling pr-str on key 36 | and value and wrapping them in Hadoop Text objects." 37 | [#^OutputCollector output key value] 38 | (binding [*print-dup* true] 39 | (.collect output (Text. (pr-str key)) (Text. (pr-str value))))) 40 | 41 | (defn wrap-map 42 | "Returns a function implementing the Mapper.map interface. 43 | 44 | f is a function of two arguments, key and value. 45 | 46 | f must return a *sequence* of *pairs* like 47 | [[key1 value1] [key2 value2] ...] 48 | 49 | When f is called, *reporter* is bound to the Hadoop Reporter. 50 | 51 | reader is a function that receives the Writable key and value from 52 | Hadoop and returns a [key value] pair for f. 53 | 54 | writer is a function that receives each [key value] pair returned by 55 | f and sends the appropriately-type arguments to the Hadoop 56 | OutputCollector. 57 | 58 | If not given, reader and writer default to clojure-map-reader and 59 | clojure-writer, respectively." 60 | ([f] (wrap-map f clojure-map-reader clojure-writer)) 61 | ([f reader] (wrap-map f reader clojure-writer)) 62 | ([f reader writer] 63 | (fn [this wkey wvalue output reporter] 64 | (binding [*reporter* reporter] 65 | (doseq [pair (apply f (reader wkey wvalue))] 66 | (apply writer output pair)))))) 67 | 68 | (defn wrap-reduce 69 | "Returns a function implementing the Reducer.reduce interface. 70 | 71 | f is a function of two arguments. First argument is the key, second 72 | argument is a function, which takes no arguments and returns a lazy 73 | sequence of values. 74 | 75 | f must return a *sequence* of *pairs* like 76 | [[key1 value1] [key2 value2] ...] 77 | 78 | When f is called, *reporter* is bound to the Hadoop Reporter. 79 | 80 | reader is a function that receives the Writable key and value from 81 | Hadoop and returns a [key values-function] pair for f. 82 | 83 | writer is a function that receives each [key value] pair returned by 84 | f and sends the appropriately-typed arguments to the Hadoop 85 | OutputCollector. 86 | 87 | If not given, reader and writer default to clojure-reduce-reader and 88 | clojure-writer, respectively." 89 | ([f] (wrap-reduce f clojure-reduce-reader clojure-writer)) 90 | ([f writer] (wrap-reduce f clojure-reduce-reader writer)) 91 | ([f reader writer] 92 | (fn [this wkey wvalues output reporter] 93 | (binding [*reporter* reporter] 94 | (doseq [pair (apply f (reader wkey wvalues))] 95 | (apply writer output pair)))))) 96 | -------------------------------------------------------------------------------- /src/test/clojure/clojure_hadoop/test_imports.clj: -------------------------------------------------------------------------------- 1 | (ns clojure-hadoop.test-imports 2 | (:require [clojure-hadoop.imports :as imp]) 3 | (:use clojure.test)) 4 | 5 | (deftest test-imports 6 | (imp/import-io) 7 | (imp/import-io-compress) 8 | (imp/import-fs) 9 | (imp/import-mapred) 10 | (imp/import-mapred-lib) 11 | (imp/import-util)) 12 | --------------------------------------------------------------------------------