THE ACCOMPANYING PROGRAM IS PROVIDED UNDER THE TERMS OF THIS ECLIPSE
33 | PUBLIC LICENSE ("AGREEMENT"). ANY USE, REPRODUCTION OR
34 | DISTRIBUTION OF THE PROGRAM CONSTITUTES RECIPIENT'S ACCEPTANCE OF THIS
35 | AGREEMENT.
36 |
37 |
1. DEFINITIONS
38 |
39 |
"Contribution" means:
40 |
41 |
a) in the case of the initial Contributor, the initial
42 | code and documentation distributed under this Agreement, and
43 |
b) in the case of each subsequent Contributor:
44 |
i) changes to the Program, and
45 |
ii) additions to the Program;
46 |
where such changes and/or additions to the Program
47 | originate from and are distributed by that particular Contributor. A
48 | Contribution 'originates' from a Contributor if it was added to the
49 | Program by such Contributor itself or anyone acting on such
50 | Contributor's behalf. Contributions do not include additions to the
51 | Program which: (i) are separate modules of software distributed in
52 | conjunction with the Program under their own license agreement, and (ii)
53 | are not derivative works of the Program.
54 |
55 |
"Contributor" means any person or entity that distributes
56 | the Program.
57 |
58 |
"Licensed Patents" mean patent claims licensable by a
59 | Contributor which are necessarily infringed by the use or sale of its
60 | Contribution alone or when combined with the Program.
61 |
62 |
"Program" means the Contributions distributed in accordance
63 | with this Agreement.
64 |
65 |
"Recipient" means anyone who receives the Program under
66 | this Agreement, including all Contributors.
67 |
68 |
2. GRANT OF RIGHTS
69 |
70 |
a) Subject to the terms of this Agreement, each
71 | Contributor hereby grants Recipient a non-exclusive, worldwide,
72 | royalty-free copyright license to reproduce, prepare derivative works
73 | of, publicly display, publicly perform, distribute and sublicense the
74 | Contribution of such Contributor, if any, and such derivative works, in
75 | source code and object code form.
76 |
77 |
b) Subject to the terms of this Agreement, each
78 | Contributor hereby grants Recipient a non-exclusive, worldwide,
79 | royalty-free patent license under Licensed Patents to make, use, sell,
80 | offer to sell, import and otherwise transfer the Contribution of such
81 | Contributor, if any, in source code and object code form. This patent
82 | license shall apply to the combination of the Contribution and the
83 | Program if, at the time the Contribution is added by the Contributor,
84 | such addition of the Contribution causes such combination to be covered
85 | by the Licensed Patents. The patent license shall not apply to any other
86 | combinations which include the Contribution. No hardware per se is
87 | licensed hereunder.
88 |
89 |
c) Recipient understands that although each Contributor
90 | grants the licenses to its Contributions set forth herein, no assurances
91 | are provided by any Contributor that the Program does not infringe the
92 | patent or other intellectual property rights of any other entity. Each
93 | Contributor disclaims any liability to Recipient for claims brought by
94 | any other entity based on infringement of intellectual property rights
95 | or otherwise. As a condition to exercising the rights and licenses
96 | granted hereunder, each Recipient hereby assumes sole responsibility to
97 | secure any other intellectual property rights needed, if any. For
98 | example, if a third party patent license is required to allow Recipient
99 | to distribute the Program, it is Recipient's responsibility to acquire
100 | that license before distributing the Program.
101 |
102 |
d) Each Contributor represents that to its knowledge it
103 | has sufficient copyright rights in its Contribution, if any, to grant
104 | the copyright license set forth in this Agreement.
105 |
106 |
3. REQUIREMENTS
107 |
108 |
A Contributor may choose to distribute the Program in object code
109 | form under its own license agreement, provided that:
110 |
111 |
a) it complies with the terms and conditions of this
112 | Agreement; and
113 |
114 |
b) its license agreement:
115 |
116 |
i) effectively disclaims on behalf of all Contributors
117 | all warranties and conditions, express and implied, including warranties
118 | or conditions of title and non-infringement, and implied warranties or
119 | conditions of merchantability and fitness for a particular purpose;
120 |
121 |
ii) effectively excludes on behalf of all Contributors
122 | all liability for damages, including direct, indirect, special,
123 | incidental and consequential damages, such as lost profits;
124 |
125 |
iii) states that any provisions which differ from this
126 | Agreement are offered by that Contributor alone and not by any other
127 | party; and
128 |
129 |
iv) states that source code for the Program is available
130 | from such Contributor, and informs licensees how to obtain it in a
131 | reasonable manner on or through a medium customarily used for software
132 | exchange.
133 |
134 |
When the Program is made available in source code form:
135 |
136 |
a) it must be made available under this Agreement; and
137 |
138 |
b) a copy of this Agreement must be included with each
139 | copy of the Program.
140 |
141 |
Contributors may not remove or alter any copyright notices contained
142 | within the Program.
143 |
144 |
Each Contributor must identify itself as the originator of its
145 | Contribution, if any, in a manner that reasonably allows subsequent
146 | Recipients to identify the originator of the Contribution.
147 |
148 |
4. COMMERCIAL DISTRIBUTION
149 |
150 |
Commercial distributors of software may accept certain
151 | responsibilities with respect to end users, business partners and the
152 | like. While this license is intended to facilitate the commercial use of
153 | the Program, the Contributor who includes the Program in a commercial
154 | product offering should do so in a manner which does not create
155 | potential liability for other Contributors. Therefore, if a Contributor
156 | includes the Program in a commercial product offering, such Contributor
157 | ("Commercial Contributor") hereby agrees to defend and
158 | indemnify every other Contributor ("Indemnified Contributor")
159 | against any losses, damages and costs (collectively "Losses")
160 | arising from claims, lawsuits and other legal actions brought by a third
161 | party against the Indemnified Contributor to the extent caused by the
162 | acts or omissions of such Commercial Contributor in connection with its
163 | distribution of the Program in a commercial product offering. The
164 | obligations in this section do not apply to any claims or Losses
165 | relating to any actual or alleged intellectual property infringement. In
166 | order to qualify, an Indemnified Contributor must: a) promptly notify
167 | the Commercial Contributor in writing of such claim, and b) allow the
168 | Commercial Contributor to control, and cooperate with the Commercial
169 | Contributor in, the defense and any related settlement negotiations. The
170 | Indemnified Contributor may participate in any such claim at its own
171 | expense.
172 |
173 |
For example, a Contributor might include the Program in a commercial
174 | product offering, Product X. That Contributor is then a Commercial
175 | Contributor. If that Commercial Contributor then makes performance
176 | claims, or offers warranties related to Product X, those performance
177 | claims and warranties are such Commercial Contributor's responsibility
178 | alone. Under this section, the Commercial Contributor would have to
179 | defend claims against the other Contributors related to those
180 | performance claims and warranties, and if a court requires any other
181 | Contributor to pay any damages as a result, the Commercial Contributor
182 | must pay those damages.
183 |
184 |
5. NO WARRANTY
185 |
186 |
EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS
187 | PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
188 | OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT LIMITATION,
189 | ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY
190 | OR FITNESS FOR A PARTICULAR PURPOSE. Each Recipient is solely
191 | responsible for determining the appropriateness of using and
192 | distributing the Program and assumes all risks associated with its
193 | exercise of rights under this Agreement , including but not limited to
194 | the risks and costs of program errors, compliance with applicable laws,
195 | damage to or loss of data, programs or equipment, and unavailability or
196 | interruption of operations.
197 |
198 |
6. DISCLAIMER OF LIABILITY
199 |
200 |
EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, NEITHER RECIPIENT
201 | NOR ANY CONTRIBUTORS SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT,
202 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING
203 | WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF
204 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
205 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OR
206 | DISTRIBUTION OF THE PROGRAM OR THE EXERCISE OF ANY RIGHTS GRANTED
207 | HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
208 |
209 |
7. GENERAL
210 |
211 |
If any provision of this Agreement is invalid or unenforceable under
212 | applicable law, it shall not affect the validity or enforceability of
213 | the remainder of the terms of this Agreement, and without further action
214 | by the parties hereto, such provision shall be reformed to the minimum
215 | extent necessary to make such provision valid and enforceable.
216 |
217 |
If Recipient institutes patent litigation against any entity
218 | (including a cross-claim or counterclaim in a lawsuit) alleging that the
219 | Program itself (excluding combinations of the Program with other
220 | software or hardware) infringes such Recipient's patent(s), then such
221 | Recipient's rights granted under Section 2(b) shall terminate as of the
222 | date such litigation is filed.
223 |
224 |
All Recipient's rights under this Agreement shall terminate if it
225 | fails to comply with any of the material terms or conditions of this
226 | Agreement and does not cure such failure in a reasonable period of time
227 | after becoming aware of such noncompliance. If all Recipient's rights
228 | under this Agreement terminate, Recipient agrees to cease use and
229 | distribution of the Program as soon as reasonably practicable. However,
230 | Recipient's obligations under this Agreement and any licenses granted by
231 | Recipient relating to the Program shall continue and survive.
232 |
233 |
Everyone is permitted to copy and distribute copies of this
234 | Agreement, but in order to avoid inconsistency the Agreement is
235 | copyrighted and may only be modified in the following manner. The
236 | Agreement Steward reserves the right to publish new versions (including
237 | revisions) of this Agreement from time to time. No one other than the
238 | Agreement Steward has the right to modify this Agreement. The Eclipse
239 | Foundation is the initial Agreement Steward. The Eclipse Foundation may
240 | assign the responsibility to serve as the Agreement Steward to a
241 | suitable separate entity. Each new version of the Agreement will be
242 | given a distinguishing version number. The Program (including
243 | Contributions) may always be distributed subject to the version of the
244 | Agreement under which it was received. In addition, after a new version
245 | of the Agreement is published, Contributor may elect to distribute the
246 | Program (including its Contributions) under the new version. Except as
247 | expressly stated in Sections 2(a) and 2(b) above, Recipient receives no
248 | rights or licenses to the intellectual property of any Contributor under
249 | this Agreement, whether expressly, by implication, estoppel or
250 | otherwise. All rights in the Program not expressly granted under this
251 | Agreement are reserved.
252 |
253 |
This Agreement is governed by the laws of the State of New York and
254 | the intellectual property laws of the United States of America. No party
255 | to this Agreement will bring a legal action under this Agreement more
256 | than one year after the cause of action arose. Each party waives its
257 | rights to a jury trial in any resulting litigation.
258 |
259 |
260 |
261 |
262 |
--------------------------------------------------------------------------------
/README.txt:
--------------------------------------------------------------------------------
1 | UP-TO-DATE fork with more recent maintenance is here:
2 | https://github.com/alexott/clojure-hadoop
3 |
4 |
5 | clojure-hadoop
6 |
7 | A library to assist in writing Hadoop MapReduce jobs in Clojure.
8 |
9 | by Stuart Sierra
10 | http://stuartsierra.com/
11 |
12 | For stable releases, see
13 | http://stuartsierra.com/software/clojure-hadoop
14 |
15 | For more information
16 | on Clojure, http://clojure.org/
17 | on Hadoop, http://hadoop.apache.org/
18 |
19 | Also see my presentation about this library at
20 | http://vimeo.com/7669741
21 |
22 |
23 | Copyright (c) Stuart Sierra, 2009. All rights reserved. The use and
24 | distribution terms for this software are covered by the Eclipse Public
25 | License 1.0 (http://opensource.org/licenses/eclipse-1.0.php) which can
26 | be found in the file LICENSE.html at the root of this distribution.
27 | By using this software in any fashion, you are agreeing to be bound by
28 | the terms of this license. You must not remove this notice, or any
29 | other, from this software.
30 |
31 |
32 |
33 | DEPENDENCIES
34 |
35 | This library requires Java 6 JDK, http://java.sun.com/
36 |
37 | Building from source requires Apache Maven 2, http://maven.apache.org/
38 |
39 |
40 |
41 | BUILDING
42 |
43 | If you downloaded the library distribution as a .zip or .tar file,
44 | everything is pre-built and there is nothing you need to do.
45 |
46 | If you downloaded the sources from Git, then you need to run the build
47 | with Maven. In the top-level directory of this project, run:
48 |
49 | mvn assembly:assembly
50 |
51 | This compiles and builds the JAR files.
52 |
53 | You can find these files in the "target" directory (replace ${VERSION}
54 | with the current version number of this library):
55 |
56 | clojure-hadoop-${VERSION}-examples.jar :
57 |
58 | This JAR contains all dependencies, including all of Hadoop
59 | 0.18.3. You can use this JAR to run the example MapReduce
60 | jobs from the command line. This file is ONLY for running the
61 | examples.
62 |
63 |
64 | clojure-hadoop-${VERSION}-job.jar :
65 |
66 | This JAR contains the clojure-hadoop libraries and Clojure
67 | 1.0. It is suitable for inclusion in the "lib" directory of a
68 | JAR file submitted as a Hadoop job.
69 |
70 |
71 | clojure-hadoop-${VERSION}.jar :
72 |
73 | This JAR contains ONLY the clojure-hadoop libraries. It can
74 | be placed in the "lib" directory of a JAR file submitted as a
75 | Hadoop job; that JAR must also include the Clojure 1.0 JAR.
76 |
77 |
78 |
79 | RUNNING THE EXAMPLES
80 |
81 | After building, copy the file from
82 |
83 | target/clojure-hadoop-${VERSION}-examples.jar
84 |
85 | to something short, like "examples.jar". Each of the *.clj files in
86 | the src/examples directory contains instructions for running that
87 | example.
88 |
89 |
90 |
91 | USING THE LIBRARY IN HADOOP
92 |
93 | After building, include the "clojure-hadoop-${VERSION}-job.jar" file
94 | in the lib/ directory of the JAR you submit as your Hadoop job.
95 |
96 |
97 |
98 | DEPENDING ON THE LIBRARY WITH MAVEN
99 |
100 | You can depend on clojure-hadoop in your Maven 2 projects by adding
101 | the following lines to your pom.xml:
102 |
103 |
104 | ...
105 |
106 |
107 | com.stuartsierra
108 | clojure-hadoop
109 | ${VERSION}
110 |
111 |
112 | ...
113 |
114 | ...
115 |
116 | ...
117 |
118 |
119 | stuartsierra-releases
120 | Stuart Sierra's personal Maven 2 release repository
121 | http://stuartsierra.com/maven2
122 |
123 |
124 |
125 |
126 | stuartsierra-snapshots
127 | Stuart Sierra's personal Maven 2 SNAPSHOT repository
128 | http://stuartsierra.com/m2snapshots
129 |
130 | ...
131 |
132 |
133 |
134 |
135 | USING THE LIBRARY
136 |
137 | This library provides different layers of abstraction away from the
138 | raw Hadoop API.
139 |
140 | Layer 1: clojure-hadoop.imports
141 |
142 | Provides convenience functions for importing the many classes and
143 | interfaces in the Hadoop API.
144 |
145 | Layer 2: clojure-hadoop.gen
146 |
147 | Provides gen-class macros to generate the multiple classes needed
148 | for a MapReduce job. See the example file "wordcount1.clj" for a
149 | demonstration of these macros.
150 |
151 | Layer 3: clojure-hadoop.wrap
152 |
153 | clojure-hadoop.wrap: provides wrapper functions that automatically
154 | convert between Hadoop Text objects and Clojure data structures.
155 | See the example file "wordcount2.clj" for a demonstration of these
156 | wrappers.
157 |
158 | Layer 4: clojure-hadoop.job
159 |
160 | Provides a complete implementation of a Hadoop MapReduce job that
161 | can be dynamically configured to use any Clojure functions in the
162 | map and reduce phases. See the example file "wordcount3.clj" for
163 | a demonstration of this usage.
164 |
165 | Layer 5: clojure-hadoop.defjob
166 |
167 | A convenient macro to configure MapReduce jobs with Clojure code.
168 | See the example files "wordcount4.clj" and "wordcount5.clj" for
169 | demonstrations of this macro.
170 |
--------------------------------------------------------------------------------
/pom.xml:
--------------------------------------------------------------------------------
1 |
3 | 4.0.0
4 | com.stuartsierra
5 | clojure-hadoop
6 | jar
7 | 1.2.0-SNAPSHOT
8 | clojure-hadoop
9 | http://github.com/stuartsierra/clojure-hadoop
10 |
11 |
12 | stuartsierra
13 | Stuart Sierra
14 | mail@stuartsierra.com
15 | http://www.stuartsierra.com/
16 |
17 |
18 |
19 |
20 | Eclipse Public License 1.0
21 | http://opensource.org/licenses/eclipse-1.0.php
22 | repo
23 | Same license as Clojure
24 |
25 |
26 |
27 |
28 | org.clojure
29 | clojure
30 | 1.0.0
31 |
32 |
33 | org.apache.hadoop
34 | hadoop-core-with-dependencies
35 | 0.18.3
36 |
37 |
38 |
39 |
40 |
41 | org.apache.maven.plugins
42 | maven-compiler-plugin
43 |
44 | 1.6
45 | 1.6
46 |
47 |
48 |
49 | maven-assembly-plugin
50 |
51 |
52 |
53 | src/main/assembly/job.xml
54 | src/main/assembly/examples.xml
55 | src/main/assembly/dist.xml
56 |
57 |
58 |
59 |
60 | com.theoryinpractise
61 | clojure-maven-plugin
62 | 1.0
63 |
64 |
65 | src/main/clojure
66 | src/examples/clojure
67 |
68 |
69 | src/test/clojure
70 |
71 |
72 |
73 |
74 | compile-clojure
75 | compile
76 |
77 | compile
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 | org.apache.maven.wagon
86 | wagon-ftp
87 | 1.0-beta-6
88 |
89 |
90 |
91 |
92 |
93 | stuartsierra-releases
94 | Stuart Sierra's personal Maven 2 release repository
95 | http://stuartsierra.com/maven2
96 |
97 |
98 |
99 |
100 |
101 | stuartsierra-releases
102 | Stuart Sierra's personal Maven 2 release repository
103 | ftp://stuartsierra.com/public_html/stuartsierra/maven2
104 |
105 |
106 | stuartsierra-snapshots
107 | Stuart Sierra's personal Maven 2 SNAPSHOT repository
108 | ftp://stuartsierra.com/public_html/stuartsierra/m2snapshots
109 |
110 |
111 |
112 |
--------------------------------------------------------------------------------
/src/examples/clojure/clojure_hadoop/examples/wordcount1.clj:
--------------------------------------------------------------------------------
1 | ;; wordcount1 -- low-level MapReduce example
2 | ;;
3 | ;; This namespace demonstrates how to use the lower layers of
4 | ;; abstraction provided by the clojure-hadoop library.
5 | ;;
6 | ;; This is the example word count program used in the Hadoop MapReduce
7 | ;; tutorials. As you can see, it is very similar to the Java code, and
8 | ;; uses the Hadoop API directly.
9 | ;;
10 | ;; We have to call gen-job-classes and gen-main-method, then define the
11 | ;; three functions mapper-map, reducer-reduce, and tool-run.
12 | ;;
13 | ;; To run this example, first compile it (see instructions in
14 | ;; README.txt), then run this command (all one line):
15 | ;;
16 | ;; java -cp examples.jar \
17 | ;; clojure_hadoop.examples.wordcount1 \
18 | ;; README.txt out1
19 | ;;
20 | ;; This will count the instances of each word in README.txt and write
21 | ;; the results to out1/part-00000
22 |
23 |
24 | (ns clojure-hadoop.examples.wordcount1
25 | (:require [clojure-hadoop.gen :as gen]
26 | [clojure-hadoop.imports :as imp])
27 | (:import (java.util StringTokenizer)
28 | (org.apache.hadoop.util Tool)))
29 |
30 | (imp/import-io) ;; for Text, LongWritable
31 | (imp/import-fs) ;; for Path
32 | (imp/import-mapred) ;; for JobConf, JobClient
33 |
34 | (gen/gen-job-classes) ;; generates Tool, Mapper, and Reducer classes
35 | (gen/gen-main-method) ;; generates Tool.main method
36 |
37 | (defn mapper-map
38 | "This is our implementation of the Mapper.map method. The key and
39 | value arguments are sub-classes of Hadoop's Writable interface, so
40 | we have to convert them to strings or some other type before we can
41 | use them. Likewise, we have to call the OutputCollector.collect
42 | method with objects that are sub-classes of Writable."
43 | [this key value #^OutputCollector output reporter]
44 | (doseq [word (enumeration-seq (StringTokenizer. (str value)))]
45 | (.collect output (Text. word) (LongWritable. 1))))
46 |
47 | (defn reducer-reduce
48 | "This is our implementation of the Reducer.reduce method. The key
49 | argument is a sub-class of Hadoop's Writable, but 'values' is a Java
50 | Iterator that returns successive values. We have to use
51 | iterator-seq to get a Clojure sequence from the Iterator.
52 |
53 | Beware, however, that Hadoop re-uses a single object for every
54 | object returned by the Iterator. So when you get an object from the
55 | iterator, you must extract its value (as we do here with the 'get'
56 | method) immediately, before accepting the next value from the
57 | iterator. That is, you cannot hang on to past values from the
58 | iterator."
59 | [this key values #^OutputCollector output reporter]
60 | (let [sum (reduce + (map (fn [#^LongWritable v] (.get v)) (iterator-seq values)))]
61 | (.collect output key (LongWritable. sum))))
62 |
63 | (defn tool-run
64 | "This is our implementation of the Tool.run method. args are the
65 | command-line arguments as a Java array of strings. We have to
66 | create a JobConf object, set all the MapReduce job parameters, then
67 | call the JobClient.runJob static method on it.
68 |
69 | This method must return zero on success or Hadoop will report that
70 | the job failed."
71 | [#^Tool this args]
72 | (doto (JobConf. (.getConf this) (.getClass this))
73 | (.setJobName "wordcount1")
74 | (.setOutputKeyClass Text)
75 | (.setOutputValueClass LongWritable)
76 | (.setMapperClass (Class/forName "clojure_hadoop.examples.wordcount1_mapper"))
77 | (.setReducerClass (Class/forName "clojure_hadoop.examples.wordcount1_reducer"))
78 | (.setInputFormat TextInputFormat)
79 | (.setOutputFormat TextOutputFormat)
80 | (FileInputFormat/setInputPaths (first args))
81 | (FileOutputFormat/setOutputPath (Path. (second args)))
82 | (JobClient/runJob))
83 | 0)
84 |
--------------------------------------------------------------------------------
/src/examples/clojure/clojure_hadoop/examples/wordcount2.clj:
--------------------------------------------------------------------------------
1 | ;; wordcount2 -- wrapped MapReduce example
2 | ;;
3 | ;; This namespace demonstrates how to use the function wrappers
4 | ;; provided by the clojure-hadoop library.
5 | ;;
6 | ;; As in the wordcount1 example, we have to call gen-job-classes and
7 | ;; gen-main-method, then define the three functions mapper-map,
8 | ;; reducer-reduce, and tool-run.
9 | ;;
10 | ;; mapper-map uses the wrap-map function. This allows us to write our
11 | ;; reducer as a simple, pure-Clojure function. Converting between
12 | ;; Hadoop types, and dealing with the Hadoop APIs, are handled by the
13 | ;; wrapper. We give it a function that returns a sequence of pairs,
14 | ;; and a pre-defined reader that accepts a Hadoop [LongWritable, Text]
15 | ;; pair. The default writer function writes keys and values as Hadoop
16 | ;; Text objects rendered with pr-str.
17 | ;;
18 | ;; reducer-reduce similarly uses the wrap-reduce function. However,
19 | ;; rather than passing the sequence of values directly to the
20 | ;; function, wrap-reduce will pass a *function* that *returns* a lazy
21 | ;; sequence of values. Because this sequence may be very large, you
22 | ;; must be careful never to bind it to a local variable. Basically,
23 | ;; you should only use the values-fn in one of Clojure's sequence
24 | ;; functions such as map, filter, or reduce.
25 | ;;
26 | ;; To run this example, first compile it (see instructions in
27 | ;; README.txt), then run this command (all one line):
28 | ;;
29 | ;; java -cp examples.jar \
30 | ;; clojure_hadoop.examples.wordcount2 \
31 | ;; README.txt out2
32 | ;;
33 | ;; This will count the instances of each word in README.txt and write
34 | ;; the results to out2/part-00000
35 | ;;
36 | ;; Notice that, in the output file, the words are enclosed in double
37 | ;; quotation marks. That's because they are being printed as readable
38 | ;; strings by Clojure, as with 'pr'.
39 |
40 |
41 | (ns clojure-hadoop.examples.wordcount2
42 | (:require [clojure-hadoop.gen :as gen]
43 | [clojure-hadoop.imports :as imp]
44 | [clojure-hadoop.wrap :as wrap])
45 | (:import (java.util StringTokenizer)
46 | (org.apache.hadoop.util Tool)))
47 |
48 | (imp/import-io) ;; for Text
49 | (imp/import-fs) ;; for Path
50 | (imp/import-mapred) ;; for JobConf, JobClient
51 |
52 | (gen/gen-job-classes)
53 | (gen/gen-main-method)
54 |
55 | (def mapper-map
56 | (wrap/wrap-map
57 | (fn [key value]
58 | (map (fn [token] [token 1])
59 | (enumeration-seq (StringTokenizer. value))))
60 | wrap/int-string-map-reader))
61 |
62 | (def reducer-reduce
63 | (wrap/wrap-reduce
64 | (fn [key values-fn]
65 | [[key (reduce + (values-fn))]])))
66 |
67 | (defn tool-run [#^Tool this args]
68 | (doto (JobConf. (.getConf this) (.getClass this))
69 | (.setJobName "wordcount2")
70 | (.setOutputKeyClass Text)
71 | (.setOutputValueClass Text)
72 | (.setMapperClass (Class/forName "clojure_hadoop.examples.wordcount2_mapper"))
73 | (.setReducerClass (Class/forName "clojure_hadoop.examples.wordcount2_reducer"))
74 | (.setInputFormat TextInputFormat)
75 | (.setOutputFormat TextOutputFormat)
76 | (FileInputFormat/setInputPaths #^String (first args))
77 | (FileOutputFormat/setOutputPath (Path. (second args)))
78 | (JobClient/runJob))
79 | 0)
80 |
--------------------------------------------------------------------------------
/src/examples/clojure/clojure_hadoop/examples/wordcount3.clj:
--------------------------------------------------------------------------------
1 | ;; wordcount3 -- example for use with clojure-hadoop.job
2 | ;;
3 | ;; This example wordcount program is very different from the first
4 | ;; two. As you can see, it defines only two functions, doesn't import
5 | ;; anything, and doesn't generate any classes.
6 | ;;
7 | ;; This example is designed to be run with the clojure-hadoop.job
8 | ;; library, which allows you to run a MapReduce job that can be
9 | ;; configured to use any Clojure functions as the mapper and reducer.
10 | ;;
11 | ;; After compiling (see README.txt), run the example like this
12 | ;; (all on one line):
13 | ;;
14 | ;; java -cp examples.jar clojure_hadoop.job \
15 | ;; -input README.txt \
16 | ;; -output out3 \
17 | ;; -map clojure-hadoop.examples.wordcount3/my-map \
18 | ;; -map-reader clojure-hadoop.wrap/int-string-map-reader \
19 | ;; -reduce clojure-hadoop.examples.wordcount3/my-reduce \
20 | ;; -input-format text
21 | ;;
22 | ;; The output is a Hadoop SequenceFile. You can view the output
23 | ;; with (all one line):
24 | ;;
25 | ;; java -cp examples.jar org.apache.hadoop.fs.FsShell \
26 | ;; -text out3/part-00000
27 | ;;
28 | ;; clojure_hadoop.job (note the underscore instead of a dash, because
29 | ;; we are calling it as a Java class) provides classes for Tool,
30 | ;; Mapper, and Reducer, which are dynamically configured on the command
31 | ;; line.
32 | ;;
33 | ;; The argument to -map is a namespace-qualified Clojure symbol. It
34 | ;; names the function that will be used as a mapper. We need to
35 | ;; specify the -map-reader function as well because we are not using
36 | ;; the default reader (which read pr'd Clojure data structures).
37 | ;;
38 | ;; The argument to -reduce is also a namespace-qualified Clojure
39 | ;; symbol.
40 | ;;
41 | ;; We also have to specify the input and output paths, and specify the
42 | ;; non-default input-format as 'text', because README.txt is a text
43 | ;; file.
44 | ;;
45 | ;; Run clojure_hadoop.job without any arguments for a brief summary of
46 | ;; the options. See src/clojure_hadoop/job.clj and
47 | ;; src/clojure_hadoop/config.clj for more configuration options.
48 |
49 |
50 | (ns clojure-hadoop.examples.wordcount3
51 | (:import (java.util StringTokenizer)))
52 |
53 | (defn my-map [key value]
54 | (map (fn [token] [token 1])
55 | (enumeration-seq (StringTokenizer. value))))
56 |
57 | (defn my-reduce [key values-fn]
58 | [[key (reduce + (values-fn))]])
59 |
60 |
--------------------------------------------------------------------------------
/src/examples/clojure/clojure_hadoop/examples/wordcount4.clj:
--------------------------------------------------------------------------------
1 | ;; wordcount4 -- example defjob
2 | ;;
3 | ;; This example wordcount program is similar to wordcount3, but it
4 | ;; includes a job definition function created with defjob.
5 | ;;
6 | ;; defjob parses its options to create a job configuration map
7 | ;; suitable for clojure-hadoop.config.
8 | ;;
9 | ;; defjob defines an ordinary function, with the given name ("job" in
10 | ;; this example), which returns the job configuration map.
11 | ;;
12 | ;; We can specify the job definition function on the command line to
13 | ;; clojure_hadoop.job, adding or overriding any additional arguments
14 | ;; at the command line.
15 | ;;
16 | ;; After compiling (see README.txt), run the example like this
17 | ;; (all on one line):
18 | ;;
19 | ;; java -cp examples.jar clojure_hadoop.job \
20 | ;; -job clojure-hadoop.examples.wordcount4/job \
21 | ;; -input README.txt -output out4
22 | ;;
23 | ;; The output is a Hadoop SequenceFile. You can view the output
24 | ;; with (all one line):
25 | ;;
26 | ;; java -cp examples.jar org.apache.hadoop.fs.FsShell \
27 | ;; -text out4/part-00000
28 |
29 |
30 | (ns clojure-hadoop.examples.wordcount4
31 | (:require [clojure-hadoop.wrap :as wrap]
32 | [clojure-hadoop.defjob :as defjob])
33 | (:import (java.util StringTokenizer)))
34 |
35 | (defn my-map [key value]
36 | (map (fn [token] [token 1])
37 | (enumeration-seq (StringTokenizer. value))))
38 |
39 | (defn my-reduce [key values-fn]
40 | [[key (reduce + (values-fn))]])
41 |
42 | (defjob/defjob job
43 | :map my-map
44 | :map-reader wrap/int-string-map-reader
45 | :reduce my-reduce
46 | :input-format :text)
47 |
48 |
--------------------------------------------------------------------------------
/src/examples/clojure/clojure_hadoop/examples/wordcount5.clj:
--------------------------------------------------------------------------------
1 | ;; wordcount5 -- example customized defjob
2 | ;;
3 | ;; This example wordcount program uses defjob like wordcount4, but it
4 | ;; includes some more configuration options that make it more
5 | ;; efficient.
6 | ;;
7 | ;; In the default configuration (wordcount4), everything is passed to
8 | ;; Hadoop as a Text and converted by the Clojure reader and printer.
9 | ;; By adding configuration options, this example works more closely
10 | ;; with Hadoop types like LongWritable. In order to do that it must
11 | ;; define custom reader and writer functions, and specify the output
12 | ;; key/value types in the defjob configuration.
13 | ;;
14 | ;; After compiling (see README.txt), run the example like this
15 | ;; (all on one line):
16 | ;;
17 | ;; java -cp examples.jar clojure_hadoop.job \
18 | ;; -job clojure-hadoop.examples.wordcount5/job \
19 | ;; -input README.txt -output out5
20 | ;;
21 | ;; The output is plain text, written to out5/part-00000
22 | ;;
23 | ;; Notice that the strings in the output are not quoted. In effect,
24 | ;; we have come full circle to wordcount1, while maintaining the
25 | ;; separation between the mapper/reducer functions and the
26 | ;; reader/writer functions.
27 |
28 |
29 | (ns clojure-hadoop.examples.wordcount5
30 | (:require [clojure-hadoop.wrap :as wrap]
31 | [clojure-hadoop.defjob :as defjob]
32 | [clojure-hadoop.imports :as imp])
33 | (:import (java.util StringTokenizer)))
34 |
35 | (imp/import-io) ;; for Text, LongWritable
36 | (imp/import-mapred) ;; for OutputCollector
37 |
38 | (defn my-map [key value]
39 | (map (fn [token] [token 1])
40 | (enumeration-seq (StringTokenizer. value))))
41 |
42 | (defn my-reduce [key values-fn]
43 | [[key (reduce + (values-fn))]])
44 |
45 | (defn string-long-writer [#^OutputCollector output
46 | #^String key value]
47 | (.collect output (Text. key) (LongWritable. value)))
48 |
49 | (defn string-long-reduce-reader [#^Text key wvalues]
50 | [(.toString key)
51 | (fn [] (map (fn [#^LongWritable v] (.get v))
52 | (iterator-seq wvalues)))])
53 |
54 | (defjob/defjob job
55 | :map my-map
56 | :map-reader wrap/int-string-map-reader
57 | :map-writer string-long-writer
58 | :reduce my-reduce
59 | :reduce-reader string-long-reduce-reader
60 | :reduce-writer string-long-writer
61 | :output-key Text
62 | :output-value LongWritable
63 | :input-format :text
64 | :output-format :text
65 | :compress-output false)
66 |
67 |
--------------------------------------------------------------------------------
/src/main/assembly/dist.xml:
--------------------------------------------------------------------------------
1 |
4 | dist
5 |
6 | zip
7 | tar.gz
8 | tar.bz2
9 |
10 |
11 |
12 | ${project.basedir}
13 | /
14 | true
15 |
16 | README.*
17 | LICENSE.*
18 | NOTICE.*
19 | CHANGES.*
20 | pom.xml
21 | src/**
22 | target/*.jar
23 |
24 |
25 |
26 |
--------------------------------------------------------------------------------
/src/main/assembly/examples.xml:
--------------------------------------------------------------------------------
1 |
4 | examples
5 |
6 | jar
7 |
8 | false
9 |
10 |
11 | true
12 | runtime
13 |
14 |
15 |
--------------------------------------------------------------------------------
/src/main/assembly/job.xml:
--------------------------------------------------------------------------------
1 |
4 | job
5 |
6 | jar
7 |
8 | false
9 |
10 |
11 | true
12 | runtime
13 |
14 | org.apache.hadoop:hadoop-core-with-dependencies
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/config.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.config
2 | (:require [clojure-hadoop.imports :as imp]
3 | [clojure-hadoop.load :as load])
4 | (:import (org.apache.hadoop.io.compress
5 | DefaultCodec GzipCodec LzoCodec)))
6 |
7 | ;; This file defines configuration options for clojure-hadoop.
8 | ;;
9 | ;; The SAME options may be given either on the command line (to
10 | ;; clojure_hadoop.job) or in a call to defjob.
11 | ;;
12 | ;; In defjob, option names are keywords. Values are symbols or
13 | ;; keywords. Symbols are resolved as functions or classes. Keywords
14 | ;; are converted to Strings.
15 | ;;
16 | ;; On the command line, option names are preceeded by "-".
17 | ;;
18 | ;; Options are defined as methods of the conf multimethod.
19 | ;; Documentation for individual options appears with each method,
20 | ;; below.
21 |
22 | (imp/import-io)
23 | (imp/import-fs)
24 | (imp/import-mapred)
25 | (imp/import-mapred-lib)
26 |
27 | (defn- #^String as-str [s]
28 | (cond (keyword? s) (name s)
29 | (class? s) (.getName #^Class s)
30 | (fn? s) (throw (Exception. "Cannot use function as value; use a symbol."))
31 | :else (str s)))
32 |
33 | (defmulti conf (fn [jobconf key value] key))
34 |
35 | (defmethod conf :job [jobconf key value]
36 | (let [f (load/load-name value)]
37 | (doseq [[k v] (f)]
38 | (conf jobconf k v))))
39 |
40 | ;; Job input paths, separated by commas, as a String.
41 | (defmethod conf :input [#^JobConf jobconf key value]
42 | (FileInputFormat/setInputPaths jobconf (as-str value)))
43 |
44 | ;; Job output path, as a String.
45 | (defmethod conf :output [#^JobConf jobconf key value]
46 | (FileOutputFormat/setOutputPath jobconf (Path. (as-str value))))
47 |
48 | ;; When true or "true", deletes output path before starting.
49 | (defmethod conf :replace [#^JobConf jobconf key value]
50 | (when (= (as-str value) "true")
51 | (.set jobconf "clojure-hadoop.job.replace" "true")))
52 |
53 | ;; The mapper function. May be a class name or a Clojure function as
54 | ;; namespace/symbol. May also be "identity" for IdentityMapper.
55 | (defmethod conf :map [#^JobConf jobconf key value]
56 | (let [value (as-str value)]
57 | (cond
58 | (= "identity" value)
59 | (.setMapperClass jobconf IdentityMapper)
60 |
61 | (.contains value "/")
62 | (.set jobconf "clojure-hadoop.job.map" value)
63 |
64 | :else
65 | (.setMapperClass jobconf (Class/forName value)))))
66 |
67 | ;; The reducer function. May be a class name or a Clojure function as
68 | ;; namespace/symbol. May also be "identity" for IdentityReducer or
69 | ;; "none" for no reduce stage.
70 | (defmethod conf :reduce [#^JobConf jobconf key value]
71 | (let [value (as-str value)]
72 | (cond
73 | (= "identity" value)
74 | (.setReducerClass jobconf IdentityReducer)
75 |
76 | (= "none" value)
77 | (.setNumReduceTasks jobconf 0)
78 |
79 | (.contains value "/")
80 | (.set jobconf "clojure-hadoop.job.reduce" value)
81 |
82 | :else
83 | (.setReducerClass jobconf (Class/forName value)))))
84 |
85 | ;; The mapper reader function, converts Hadoop Writable types to
86 | ;; native Clojure types.
87 | (defmethod conf :map-reader [#^JobConf jobconf key value]
88 | (.set jobconf "clojure-hadoop.job.map.reader" (as-str value)))
89 |
90 | ;; The mapper writer function; converts native Clojure types to Hadoop
91 | ;; Writable types.
92 | (defmethod conf :map-writer [#^JobConf jobconf key value]
93 | (.set jobconf "clojure-hadoop.job.map.writer" (as-str value)))
94 |
95 | ;; The mapper output key class; used when the mapper writer outputs
96 | ;; types different from the job output.
97 | (defmethod conf :map-output-key [#^JobConf jobconf key value]
98 | (.setMapOutputKeyClass jobconf (Class/forName value)))
99 |
100 | ;; The mapper output value class; used when the mapper writer outputs
101 | ;; types different from the job output.
102 | (defmethod conf :map-output-value [#^JobConf jobconf key value]
103 | (.setMapOutputValueClass jobconf (Class/forName value)))
104 |
105 | ;; The job output key class.
106 | (defmethod conf :output-key [#^JobConf jobconf key value]
107 | (.setOutputKeyClass jobconf (Class/forName value)))
108 |
109 | ;; The job output value class.
110 | (defmethod conf :output-value [#^JobConf jobconf key value]
111 | (.setOutputValueClass jobconf (Class/forName value)))
112 |
113 | ;; The reducer reader function, converts Hadoop Writable types to
114 | ;; native Clojure types.
115 | (defmethod conf :reduce-reader [#^JobConf jobconf key value]
116 | (.set jobconf "clojure-hadoop.job.reduce.reader" (as-str value)))
117 |
118 | ;; The reducer writer function; converts native Clojure types to
119 | ;; Hadoop Writable types.
120 | (defmethod conf :reduce-writer [#^JobConf jobconf key value]
121 | (.set jobconf "clojure-hadoop.job.reduce.writer" (as-str value)))
122 |
123 | ;; The input file format. May be a class name or "text" for
124 | ;; TextInputFormat, "kvtext" fro KeyValueTextInputFormat, "seq" for
125 | ;; SequenceFileInputFormat.
126 | (defmethod conf :input-format [#^JobConf jobconf key value]
127 | (let [value (as-str value)]
128 | (cond
129 | (= "text" value)
130 | (.setInputFormat jobconf TextInputFormat)
131 |
132 | (= "kvtext" value)
133 | (.setInputFormat jobconf KeyValueTextInputFormat)
134 |
135 | (= "seq" value)
136 | (.setInputFormat jobconf SequenceFileInputFormat)
137 |
138 | :else
139 | (.setInputFormat jobconf (Class/forName value)))))
140 |
141 | ;; The output file format. May be a class name or "text" for
142 | ;; TextOutputFormat, "seq" for SequenceFileOutputFormat.
143 | (defmethod conf :output-format [#^JobConf jobconf key value]
144 | (let [value (as-str value)]
145 | (cond
146 | (= "text" value)
147 | (.setOutputFormat jobconf TextOutputFormat)
148 |
149 | (= "seq" value)
150 | (.setOutputFormat jobconf SequenceFileOutputFormat)
151 |
152 | :else
153 | (.setOutputFormat jobconf (Class/forName value)))))
154 |
155 | ;; If true, compress job output files.
156 | (defmethod conf :compress-output [#^JobConf jobconf key value]
157 | (cond
158 | (= "true" (as-str value))
159 | (FileOutputFormat/setCompressOutput jobconf true)
160 |
161 | (= "false" (as-str value))
162 | (FileOutputFormat/setCompressOutput jobconf false)
163 |
164 | :else
165 | (throw (Exception. "compress-output value must be true or false"))))
166 |
167 | ;; Codec to use for compressing job output files.
168 | (defmethod conf :output-compressor [#^JobConf jobconf key value]
169 | (cond
170 | (= "default" (as-str value))
171 | (FileOutputFormat/setOutputCompressorClass
172 | jobconf DefaultCodec)
173 |
174 | (= "gzip" (as-str value))
175 | (FileOutputFormat/setOutputCompressorClass
176 | jobconf GzipCodec)
177 |
178 | (= "lzo" (as-str value))
179 | (FileOutputFormat/setOutputCompressorClass
180 | jobconf LzoCodec)
181 |
182 | :else
183 | (FileOutputFormat/setOutputCompressorClass
184 | jobconf (Class/forName value))))
185 |
186 | ;; Type of compression to use for sequence files.
187 | (defmethod conf :compression-type [#^JobConf jobconf key value]
188 | (cond
189 | (= "block" (as-str value))
190 | (SequenceFileOutputFormat/setOutputCompressionType
191 | jobconf SequenceFile$CompressionType/BLOCK)
192 |
193 | (= "none" (as-str value))
194 | (SequenceFileOutputFormat/setOutputCompressionType
195 | jobconf SequenceFile$CompressionType/NONE)
196 |
197 | (= "record" (as-str value))
198 | (SequenceFileOutputFormat/setOutputCompressionType
199 | jobconf SequenceFile$CompressionType/RECORD)))
200 |
201 | (defn parse-command-line-args [#^JobConf jobconf args]
202 | (when (empty? args)
203 | (throw (Exception. "Missing required options.")))
204 | (when-not (even? (count args))
205 | (throw (Exception. "Number of options must be even.")))
206 | (doseq [[k v] (partition 2 args)]
207 | (conf jobconf (keyword (subs k 1)) v)))
208 |
209 | (defn print-usage []
210 | (println "Usage: java -cp [jars...] clojure_hadoop.job [options...]
211 | Required options are:
212 | -input comma-separated input paths
213 | -output output path
214 | -map mapper function, as namespace/name or class name
215 | -reduce reducer function, as namespace/name or class name
216 | OR
217 | -job job definition function, as namespace/name
218 |
219 | Mapper or reducer function may also be \"identity\".
220 | Reducer function may also be \"none\".
221 |
222 | Other available options are:
223 | -input-format Class name or \"text\" or \"seq\" (SeqFile)
224 | -output-format Class name or \"text\" or \"seq\" (SeqFile)
225 | -output-key Class for job output key
226 | -output-value Class for job output value
227 | -map-output-key Class for intermediate Mapper output key
228 | -map-output-value Class for intermediate Mapper output value
229 | -map-reader Mapper reader function, as namespace/name
230 | -map-writer Mapper writer function, as namespace/name
231 | -reduce-reader Reducer reader function, as namespace/name
232 | -reduce-writer Reducer writer function, as namespace/name
233 | -name Job name
234 | -replace If \"true\", deletes output dir before start
235 | -compress-output If \"true\", compress job output files
236 | -output-compressor Compression class or \"gzip\",\"lzo\",\"default\"
237 | -compression-type For seqfiles, compress \"block\",\"record\",\"none\"
238 | "))
239 |
240 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/defjob.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.defjob
2 | (:require [clojure-hadoop.job :as job]))
3 |
4 | (defn- full-name
5 | "Returns the fully-qualified name for a symbol s, either a class or
6 | a var, resolved in the current namespace."
7 | [s]
8 | (if-let [v (resolve s)]
9 | (cond (var? v) (let [m (meta v)]
10 | (str (ns-name (:ns m)) \/
11 | (name (:name m))))
12 | (class? v) (.getName #^Class v))
13 | (throw (Exception. (str "Symbol not found: " s)))))
14 |
15 | (defmacro defjob
16 | "Defines a job function. Options are the same those in
17 | clojure-hadoop.config.
18 |
19 | A job function may be given as the -job argument to
20 | clojure_hadoop.job to run a job."
21 | [sym & options]
22 | (let [args (reduce (fn [m [k v]]
23 | (assoc m k
24 | (cond (keyword? v) (name v)
25 | (string? v) v
26 | (symbol? v) (full-name v)
27 | (instance? Boolean v) (str v)
28 | :else (throw (Exception. "defjob arguments must be strings, symbols, or keywords")))))
29 | {} (apply hash-map options))]
30 | `(defn ~sym [] ~args)))
31 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/gen.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.gen
2 | ;;#^{:doc "Class-generation helpers for writing Hadoop jobs in Clojure."}
3 | )
4 |
5 | (defmacro gen-job-classes
6 | "Creates gen-class forms for Hadoop job classes from the current
7 | namespace. Now you only need to write three functions:
8 |
9 | (defn mapper-map [this key value output reporter] ...)
10 |
11 | (defn reducer-reduce [this key values output reporter] ...)
12 |
13 | (defn tool-run [& args] ...)
14 |
15 | The first two functions are the standard map/reduce functions in any
16 | Hadoop job.
17 |
18 | The third function, tool-run, will be called by the Hadoop framework
19 | to start your job, with the arguments from the command line. It
20 | should set up the JobConf object and call JobClient/runJob, then
21 | return zero on success.
22 |
23 | You must also call gen-main-method to create the main method.
24 |
25 | After compiling your namespace, you can run it as a Hadoop job using
26 | the standard Hadoop command-line tools."
27 | []
28 | (let [the-name (.replace (str (ns-name *ns*)) \- \_)]
29 | `(do
30 | (gen-class
31 | :name ~the-name
32 | :extends "org.apache.hadoop.conf.Configured"
33 | :implements ["org.apache.hadoop.util.Tool"]
34 | :prefix "tool-"
35 | :main true)
36 | (gen-class
37 | :name ~(str the-name "_mapper")
38 | :extends "org.apache.hadoop.mapred.MapReduceBase"
39 | :implements ["org.apache.hadoop.mapred.Mapper"]
40 | :prefix "mapper-")
41 | (gen-class
42 | :name ~(str the-name "_reducer")
43 | :extends "org.apache.hadoop.mapred.MapReduceBase"
44 | :implements ["org.apache.hadoop.mapred.Reducer"]
45 | :prefix "reducer-"))))
46 |
47 | (defn gen-main-method
48 | "Adds a standard main method, named tool-main, to the current
49 | namespace."
50 | []
51 | (let [the-name (.replace (str (ns-name *ns*)) \- \_)]
52 | (intern *ns* 'tool-main
53 | (fn [& args]
54 | (System/exit
55 | (org.apache.hadoop.util.ToolRunner/run
56 | (new org.apache.hadoop.conf.Configuration)
57 | (. (Class/forName the-name) newInstance)
58 | (into-array String args)))))))
59 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/imports.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.imports
2 | ;;#^{:doc "Functions to import entire packages under org.apache.hadoop."}
3 | )
4 |
5 | (defn import-io
6 | "Imports all classes/interfaces/exceptions from the package
7 | org.apache.hadoop.io into the current namespace."
8 | []
9 | (import '(org.apache.hadoop.io RawComparator
10 | SequenceFile$Sorter$RawKeyValueIterator SequenceFile$ValueBytes
11 | Stringifier Writable WritableComparable WritableFactory
12 | AbstractMapWritable ArrayFile ArrayFile$Reader ArrayFile$Writer
13 | ArrayWritable BooleanWritable BooleanWritable$Comparator BytesWritable
14 | BytesWritable$Comparator ByteWritable ByteWritable$Comparator
15 | CompressedWritable DataInputBuffer DataOutputBuffer DefaultStringifier
16 | DoubleWritable DoubleWritable$Comparator FloatWritable
17 | FloatWritable$Comparator GenericWritable InputBuffer LongWritable
18 | LongWritable$Comparator IOUtils IOUtils$NullOutputStream LongWritable
19 | LongWritable$Comparator LongWritable$DecreasingComparator MapFile
20 | MapFile$Reader MapFile$Writer MapWritable MD5Hash MD5Hash$Comparator
21 | NullWritable NullWritable$Comparator ObjectWritable OutputBuffer
22 | SequenceFile SequenceFile$Metadata SequenceFile$Reader
23 | SequenceFile$Sorter SequenceFile$Writer SetFile SetFile$Reader
24 | SetFile$Writer SortedMapWritable Text Text$Comparator
25 | TwoDArrayWritable UTF8 UTF8$Comparator VersionedWritable VLongWritable
26 | VLongWritable WritableComparator WritableFactories WritableName
27 | WritableUtils SequenceFile$CompressionType MultipleIOException
28 | VersionMismatchException)))
29 |
30 | (defn import-io-compress
31 | "Imports all classes/interfaces/exceptions from the package
32 | org.apache.hadoop.io.compress into the current namespace."
33 | []
34 | (import '(org.apache.hadoop.io.compress CompressionCodec Compressor
35 | Decompressor BlockCompressorStream BlockDecompressorStream CodecPool
36 | CompressionCodecFactory CompressionInputStream CompressionOutputStream
37 | CompressorStream DecompressorStream DefaultCodec GzipCodec
38 | GzipCodec$GzipInputStream GzipCodec$GzipOutputStream)))
39 |
40 | (defn import-fs
41 | "Imports all classes/interfaces/exceptions from the package
42 | org.apache.hadoop.fs into the current namespace."
43 | []
44 | (import '(org.apache.hadoop.fs PathFilter PositionedReadable
45 | Seekable Syncable BlockLocation BufferedFSInputStream
46 | ChecksumFileSystem ContentSummary DF DU FileStatus FileSystem
47 | FileSystem$Statistics FileUtil FileUtil$HardLink FilterFileSystem
48 | FSDataInputStream FSDataOutputStream FSInputChecker FSInputStream
49 | FSOutputSummer FsShell FsUrlStreamHandlerFactory HarFileSystem
50 | InMemoryFileSystem LocalDirAllocator LocalFileSystem Path
51 | RawLocalFileSystem Trash ChecksumException FSError)))
52 |
53 | (defn import-mapred
54 | "Imports all classes/interfaces/exceptions from the package
55 | org.apache.hadoop.mapred into the current namespace."
56 | []
57 | (import '(org.apache.hadoop.mapred InputFormat InputSplit
58 | JobConfigurable JobHistory$Listener Mapper MapRunnable OutputCollector
59 | OutputFormat Partitioner RawKeyValueIterator RecordReader RecordWriter
60 | Reducer Reporter RunningJob SequenceFileInputFilter$Filter
61 | ClusterStatus Counters Counters$Counter Counters$Group
62 | DefaultJobHistoryParser FileInputFormat FileOutputFormat FileSplit ID
63 | IsolationRunner JobClient JobConf JobEndNotifier JobHistory
64 | JobHistory$HistoryCleaner JobHistory$JobInfo JobHistory$MapAttempt
65 | JobHistory$ReduceAttempt JobHistory$Task JobHistory$TaskAttempt JobID
66 | JobProfile JobStatus JobTracker KeyValueLineRecordReader
67 | KeyValueTextInputFormat LineRecordReader LineRecordReader$LineReader
68 | MapFileOutputFormat MapReduceBase MapRunner MultiFileInputFormat
69 | MultiFileSplit OutputLogFilter SequenceFileAsBinaryInputFormat
70 | SequenceFileAsBinaryInputFormat$SequenceFileAsBinaryRecordReader
71 | SequenceFileAsBinaryOutputFormat
72 | SequenceFileAsBinaryOutputFormat$WritableValueBytes
73 | SequenceFileAsTextInputFormat SequenceFileAsTextRecordReader
74 | SequenceFileInputFilter SequenceFileInputFilter$FilterBase
75 | SequenceFileInputFilter$MD5Filter
76 | SequenceFileInputFilter$PercentFilter
77 | SequenceFileInputFilter$RegexFilter SequenceFileInputFormat
78 | SequenceFileOutputFormat SequenceFileRecordReader TaskAttemptID
79 | TaskCompletionEvent TaskID TaskLog TaskLogAppender TaskLogServlet
80 | TaskReport TaskTracker TaskTracker$MapOutputServlet TextInputFormat
81 | TextOutputFormat TextOutputFormat$LineRecordWriter
82 | JobClient$TaskStatusFilter JobHistory$Keys JobHistory$RecordTypes
83 | JobHistory$Values JobPriority JobTracker$State
84 | TaskCompletionEvent$Status TaskLog$LogName FileAlreadyExistsException
85 | InvalidFileTypeException InvalidInputException InvalidJobConfException
86 | JobTracker$IllegalStateException)))
87 |
88 | (defn import-mapred-lib
89 | "Imports all classes/interfaces/exceptions from the package
90 | org.apache.hadoop.mapred.lib into the current namespace."
91 | []
92 | (import '(org.apache.hadoop.mapred.lib FieldSelectionMapReduce
93 | HashPartitioner IdentityMapper IdentityReducer InverseMapper
94 | KeyFieldBasedPartitioner LongSumReducer MultipleOutputFormat
95 | MultipleSequenceFileOutputFormat MultipleTextOutputFormat
96 | MultithreadedMapRunner NLineInputFormat NullOutputFormat RegexMapper
97 | TokenCountMapper)))
98 |
99 |
100 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/job.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.job
2 | (:require [clojure-hadoop.gen :as gen]
3 | [clojure-hadoop.imports :as imp]
4 | [clojure-hadoop.wrap :as wrap]
5 | [clojure-hadoop.config :as config]
6 | [clojure-hadoop.load :as load])
7 | (:import (org.apache.hadoop.util Tool)))
8 |
9 | (imp/import-io)
10 | (imp/import-io-compress)
11 | (imp/import-fs)
12 | (imp/import-mapred)
13 |
14 | (gen/gen-job-classes)
15 | (gen/gen-main-method)
16 |
17 | (def #^JobConf *jobconf* nil)
18 |
19 | (def #^{:private true} method-fn-name
20 | {"map" "mapper-map"
21 | "reduce" "reducer-reduce"})
22 |
23 | (def #^{:private true} wrapper-fn
24 | {"map" wrap/wrap-map
25 | "reduce" wrap/wrap-reduce})
26 |
27 | (def #^{:private true} default-reader
28 | {"map" wrap/clojure-map-reader
29 | "reduce" wrap/clojure-reduce-reader})
30 |
31 | (defn- configure-functions
32 | "Preps the mapper or reducer with a Clojure function read from the
33 | job configuration. Called from Mapper.configure and
34 | Reducer.configure."
35 | [type #^JobConf jobconf]
36 | (alter-var-root (var *jobconf*) (fn [_] jobconf))
37 | (let [function (load/load-name (.get jobconf (str "clojure-hadoop.job." type)))
38 | reader (if-let [v (.get jobconf (str "clojure-hadoop.job." type ".reader"))]
39 | (load/load-name v)
40 | (default-reader type))
41 | writer (if-let [v (.get jobconf (str "clojure-hadoop.job." type ".writer"))]
42 | (load/load-name v)
43 | wrap/clojure-writer)]
44 | (assert (fn? function))
45 | (alter-var-root (ns-resolve (the-ns 'clojure-hadoop.job)
46 | (symbol (method-fn-name type)))
47 | (fn [_] ((wrapper-fn type) function reader writer)))))
48 |
49 | ;;; CREATING AND CONFIGURING JOBS
50 |
51 | (defn- parse-command-line [jobconf args]
52 | (try
53 | (config/parse-command-line-args jobconf args)
54 | (catch Exception e
55 | (prn e)
56 | (config/print-usage)
57 | (System/exit 1))))
58 |
59 | (defn- handle-replace-option [#^JobConf jobconf]
60 | (when (= "true" (.get jobconf "clojure-hadoop.job.replace"))
61 | (let [fs (FileSystem/get jobconf)
62 | output (FileOutputFormat/getOutputPath jobconf)]
63 | (.delete fs output true))))
64 |
65 | (defn- set-default-config [#^JobConf jobconf]
66 | (doto jobconf
67 | (.setJobName "clojure_hadoop.job")
68 | (.setOutputKeyClass Text)
69 | (.setOutputValueClass Text)
70 | (.setMapperClass (Class/forName "clojure_hadoop.job_mapper"))
71 | (.setReducerClass (Class/forName "clojure_hadoop.job_reducer"))
72 | (.setInputFormat SequenceFileInputFormat)
73 | (.setOutputFormat SequenceFileOutputFormat)
74 | (FileOutputFormat/setCompressOutput true)
75 | (SequenceFileOutputFormat/setOutputCompressionType
76 | SequenceFile$CompressionType/BLOCK)))
77 |
78 | (defn run
79 | "Runs a Hadoop job given the JobConf object."
80 | [jobconf]
81 | (doto jobconf
82 | (handle-replace-option)
83 | (JobClient/runJob)))
84 |
85 |
86 | ;;; MAPPER METHODS
87 |
88 | (defn mapper-configure [this jobconf]
89 | (configure-functions "map" jobconf))
90 |
91 | (defn mapper-map [this wkey wvalue output reporter]
92 | (throw (Exception. "Mapper function not defined.")))
93 |
94 | ;;; REDUCER METHODS
95 |
96 | (defn reducer-configure [this jobconf]
97 | (configure-functions "reduce" jobconf))
98 |
99 | (defn reducer-reduce [this wkey wvalues output reporter]
100 | (throw (Exception. "Reducer function not defined.")))
101 |
102 | ;;; TOOL METHODS
103 |
104 | (defn tool-run [#^Tool this args]
105 | (doto (JobConf. (.getConf this) (.getClass this))
106 | (set-default-config)
107 | (parse-command-line args)
108 | (run))
109 | 0)
110 |
111 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/load.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.load)
2 |
3 | (defn load-name
4 | "Loads and returns the value of a namespace-qualified string naming
5 | a symbol. If the namespace is not currently loaded it will be
6 | require'd."
7 | [#^String s]
8 | (let [[ns-name fn-name] (.split s "/")]
9 | (when-not (find-ns (symbol ns-name))
10 | (require (symbol ns-name)))
11 | (assert (find-ns (symbol ns-name)))
12 | (deref (resolve (symbol ns-name fn-name)))))
13 |
14 |
--------------------------------------------------------------------------------
/src/main/clojure/clojure_hadoop/wrap.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.wrap
2 | ;;#^{:doc "Map/Reduce wrappers that set up common input/output
3 | ;;conversions for Clojure jobs."}
4 | (:require [clojure-hadoop.imports :as imp]))
5 |
6 | (imp/import-io)
7 | (imp/import-mapred)
8 |
9 | (declare #^Reporter *reporter*)
10 |
11 | (defn string-map-reader
12 | "Returns a [key value] pair by calling .toString on the Writable key
13 | and value."
14 | [#^Writable wkey #^Writable wvalue]
15 | [(.toString wkey) (.toString wvalue)])
16 |
17 | (defn int-string-map-reader [#^LongWritable wkey #^Writable wvalue]
18 | [(.get wkey) (.toString wvalue)])
19 |
20 | (defn clojure-map-reader
21 | "Returns a [key value] pair by calling read-string on the string
22 | representations of the Writable key and value."
23 | [#^Writable wkey #^Writable wvalue]
24 | [(read-string (.toString wkey)) (read-string (.toString wvalue))])
25 |
26 | (defn clojure-reduce-reader
27 | "Returns a [key seq-of-values] pair by calling read-string on the
28 | string representations of the Writable key and values."
29 | [#^Writable wkey wvalues]
30 | [(read-string (.toString wkey))
31 | (fn [] (map (fn [#^Writable v] (read-string (.toString v)))
32 | (iterator-seq wvalues)))])
33 |
34 | (defn clojure-writer
35 | "Sends key and value to the OutputCollector by calling pr-str on key
36 | and value and wrapping them in Hadoop Text objects."
37 | [#^OutputCollector output key value]
38 | (binding [*print-dup* true]
39 | (.collect output (Text. (pr-str key)) (Text. (pr-str value)))))
40 |
41 | (defn wrap-map
42 | "Returns a function implementing the Mapper.map interface.
43 |
44 | f is a function of two arguments, key and value.
45 |
46 | f must return a *sequence* of *pairs* like
47 | [[key1 value1] [key2 value2] ...]
48 |
49 | When f is called, *reporter* is bound to the Hadoop Reporter.
50 |
51 | reader is a function that receives the Writable key and value from
52 | Hadoop and returns a [key value] pair for f.
53 |
54 | writer is a function that receives each [key value] pair returned by
55 | f and sends the appropriately-type arguments to the Hadoop
56 | OutputCollector.
57 |
58 | If not given, reader and writer default to clojure-map-reader and
59 | clojure-writer, respectively."
60 | ([f] (wrap-map f clojure-map-reader clojure-writer))
61 | ([f reader] (wrap-map f reader clojure-writer))
62 | ([f reader writer]
63 | (fn [this wkey wvalue output reporter]
64 | (binding [*reporter* reporter]
65 | (doseq [pair (apply f (reader wkey wvalue))]
66 | (apply writer output pair))))))
67 |
68 | (defn wrap-reduce
69 | "Returns a function implementing the Reducer.reduce interface.
70 |
71 | f is a function of two arguments. First argument is the key, second
72 | argument is a function, which takes no arguments and returns a lazy
73 | sequence of values.
74 |
75 | f must return a *sequence* of *pairs* like
76 | [[key1 value1] [key2 value2] ...]
77 |
78 | When f is called, *reporter* is bound to the Hadoop Reporter.
79 |
80 | reader is a function that receives the Writable key and value from
81 | Hadoop and returns a [key values-function] pair for f.
82 |
83 | writer is a function that receives each [key value] pair returned by
84 | f and sends the appropriately-typed arguments to the Hadoop
85 | OutputCollector.
86 |
87 | If not given, reader and writer default to clojure-reduce-reader and
88 | clojure-writer, respectively."
89 | ([f] (wrap-reduce f clojure-reduce-reader clojure-writer))
90 | ([f writer] (wrap-reduce f clojure-reduce-reader writer))
91 | ([f reader writer]
92 | (fn [this wkey wvalues output reporter]
93 | (binding [*reporter* reporter]
94 | (doseq [pair (apply f (reader wkey wvalues))]
95 | (apply writer output pair))))))
96 |
--------------------------------------------------------------------------------
/src/test/clojure/clojure_hadoop/test_imports.clj:
--------------------------------------------------------------------------------
1 | (ns clojure-hadoop.test-imports
2 | (:require [clojure-hadoop.imports :as imp])
3 | (:use clojure.test))
4 |
5 | (deftest test-imports
6 | (imp/import-io)
7 | (imp/import-io-compress)
8 | (imp/import-fs)
9 | (imp/import-mapred)
10 | (imp/import-mapred-lib)
11 | (imp/import-util))
12 |
--------------------------------------------------------------------------------