├── .gitignore
├── .travis.yml
├── PLANS.md
├── README.md
├── ReleaseNotes.md
├── dev
    └── log4j.xml
├── profiles.clj
├── project.clj
├── src
    └── pallet_hadoop
    │   └── node.clj
└── test
    └── pallet_hadoop
        └── node_test.clj


/.gitignore:
--------------------------------------------------------------------------------
 1 | .#
 2 | pom.xml
 3 | .cake
 4 | *jar
 5 | lib
 6 | log/
 7 | classes
 8 | docs
 9 | autodoc
10 | .#*
11 | *#*
12 | logs/
13 | .lein-failures
14 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: clojure
 2 | lein: lein2
 3 | before_script:
 4 | - lein2 version
 5 | script: lein2 test
 6 | after_success:
 7 | - lein2 pallet-release push
 8 | env:
 9 |   global:
10 |     secure: pCYoBX17CLTJypmVpAovJnQ4ZmBqRGbHpib65wmgsXItEZQ92QX5brTS8G/C7lkNg+Cq9VrhwoeZpZ4bZ9kNAK3J2N20OpX/QZpBSDxNxiptQaKoFCRsoTDk6hoGZkY/HUYyrN7vv5P0NKUGExjRNzpCTjgTz59zUR/5NXLdqS8=
11 | 


--------------------------------------------------------------------------------
/PLANS.md:
--------------------------------------------------------------------------------
  1 | # Pallet & Cascalog: Future Features
  2 | 
  3 | (All page numbers refer to locations in [Hadoop, the Definitive Guide, 2nd ed.](http://oreilly.com/catalog/0636920010388).)
  4 | 
  5 | ### Network Topology Optimization
  6 | 
  7 | [rack awareness example...](http://www.matejunkie.com/how-to-kick-off-hadoops-rack-awareness/)
  8 | 
  9 | Page 248 discusses how to map out a custom network topology on hadoop using script based mapping. Essentially, we need to write a script that will take a variable number of IP addresses, and return the corresponding network locations. I'm not sure how we can do this effectively, with a pre-written script. Maybe we could use stevedore to generate a script based on all existing nodes in the cluster? Check the "Hadoop Definitive Guide" source code for an example script, here.
 10 | 
 11 | The other option would be to implement DNS to Switch mapping:
 12 | 
 13 |     public interface DNSToSwitchMapping {
 14 |       public List<String> resolve(List<String> names);
 15 |     }
 16 | 
 17 | * Figure out whether we want to go the script route, or the interface
 18 |   route. (Toni, what's easiest here, based on pallet's capabilities?)
 19 | * What IP addresses are we going to receive, public or private? I'm
 20 |   guessing private, but it'd be nice to have an example to reference,
 21 |   here.
 22 | 
 23 | ### Cluster Tags
 24 | 
 25 | Each cluster the user creates should have a specific tag... every node that gets created should be accessed by that tag. I think that pallet can do this now, just not sure how. (`samcluster` should be different than `tonicluster`, and our commands on nodes of the same names shouldn't interfere with each other.
 26 | 
 27 | Some functions I'd like:
 28 | 
 29 | * (hadoop-property :clustertag "mapred.job.tasks")
 30 |   ;=> 10
 31 |   
 32 | ### Cluster Balancing
 33 | 
 34 | Page 284. When we change the cluster size in some way, we need to run balancer on, I believe, the namenode.
 35 | 
 36 | * (Where do we need to run `balancer`?
 37 | 
 38 | `balancer` runs until it's finished, and doesn't get in the way of much. Here's the code for `bin/start-balancer.sh`
 39 | 
 40 |     bin=`dirname "$0"`
 41 |     bin=`cd "$bin"; pwd`
 42 |     
 43 |     . "$bin"/hadoop-config.sh
 44 | 
 45 |     # Start balancer daemon.
 46 | 
 47 |     "$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start balancer $@
 48 | 
 49 | ### Better Ways of transferring in, out of HDFS
 50 | 
 51 | It would be nice to have some good interface for `distcp`, plus a few commands like `bin/hadoop fs -getmerge`. This isn't so important at all for cascalog & cascading, since various custom and supplied taps take care of everything, here.
 52 | 
 53 | ### SSH Password Protection
 54 | 
 55 | Page 251. We can use ssh-agent to get rid of the need to supply a password when logging in.
 56 | 
 57 | ## Hadoop Configuration Files
 58 | 
 59 | Turns out that the hostnames in the masters file are used to start up a secondary namenode. Weird!
 60 | 
 61 | ### Different Node Classes
 62 | 
 63 | We should provide the ability to have a class of node, with a number of different images, all sitting under the same class.
 64 | 
 65 | The first place we can use this is for clusters of spot and non-spot nodes... then, spot nodes of varying prices. Beyond that, we might want some machines to have different capabilities than others, for whatever reason. (One can imagine a case where a fixed number of nodes are running, backed by EBS, hosting something like ElephantDB in addition to the hadoop namenode and jobtracker processes... the cluster can scale elastically beyond those nodes, but only by non-ebs-backed instances.
 66 | 
 67 | ### Metrics Support!
 68 | 
 69 | See page 286. We might add support for [Ganglia](http://ganglia.info/), or FileContext. This would require proper modification of the `conf/hadoop-metrics.sh`.
 70 | 
 71 | ### Hadoop Logging Setup
 72 | 
 73 | Page 142... thinking here about customization of `conf/log4j.sh`.
 74 | 
 75 | ### Support for different Master Node Scenarios
 76 | 
 77 | Page 254. The three masters, in any given cluster, will be the namenode, the jobtracker, and the secondary namenode (optional). They can run on 1-3 machines, in any combination. I don't think we'll ever want more than one of each. And, of course, the startup order's important!
 78 | 
 79 | ### Other.
 80 | 
 81 | NOTES:
 82 | 
 83 | A cluster should take in a map of arguments (ip-type, for example)
 84 | and a map of node descriptions, including base nodes for each node
 85 | type, and output a cluster object. We should have a layer of
 86 | abstraction on top of nodes, etc.
 87 | 
 88 | #### NOTES ON HOSTNAME RESOLUTION
 89 | 
 90 | It seems like this is an issue a number of folks are having. We
 91 | need to populate etc/hosts to skip DNS resolution, if we're going to
 92 | work on local machines. On EC2, I think we can get around this issue
 93 | by using the public DNS address.
 94 | 
 95 | Some discussion here on a way to short circuit DNS --
 96 | http://www.travishegner.com/2009/06/hadoop-020-on-ubuntu-server-904-jaunty.html
 97 | 
 98 | But do we want that, really?
 99 | 
100 | Looks like we need to do etc/hosts internally -- we could probably
101 | do this externally as well, with Amazon's public DNS names and
102 | private IP addresses.
103 | 
104 | From here:
105 | https://twiki.grid.iu.edu/bin/view/Storage/HadoopUnderstanding
106 | 
107 | For the namenode, etc to be virtualized, you must be able to access
108 | them through DNS, or etc/hosts.
109 | 
110 | From HDFS-default --
111 | http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
112 | 
113 | `dfs.datanode.dns.nameserver` -- The host name or IP address of the
114 | name server (DNS) which a DataNode should use to determine the host
115 | name used by the NameNode for communication and display purposes.
116 | 
117 | More support for using external hostnames on EC2
118 | http://getsatisfaction.com/cloudera/topics/hadoop_configuring_a_slaves_hostname
119 | 
120 | How to get hadoop running without DNS --
121 | http://db.tmtec.biz/blogs/index.php/get-hadoop-up-and-running-without-dns
122 | 
123 | Using etc/hosts as default --
124 | http://www.linuxquestions.org/questions/linux-server-73/how-to-setup-nslookups-queries-using-etc-hosts-as-the-default-654882/
125 | 
126 | And, most clearly:
127 | 
128 | http://www.cloudera.com/blog/2008/12/securing-a-hadoop-cluster-through-a-gateway/
129 | 
130 | One “gotcha” of Hadoop is that the HDFS instance has a canonical name
131 | associated with it, based on the DNS name of the machine — not its IP
132 | address. If you provide an IP address for the fs.default.name, it will
133 | reverse-DNS this back to a DNS name, then subsequent connections will
134 | perform a forward-DNS lookup on the canonical DNS name
135 | 
136 | OTHER NOTES
137 | 
138 | * [Hadoop cluster tips and tricks](http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/)
139 | * [Discussion of rack awareness](http://hadoop.apache.org/common/docs/r0.19.2/cluster_setup.html#Configuration+Files)
140 | * [Hadoop tutorial](http://developer.yahoo.com/hadoop/tutorial/module7.html)
141 | 
142 | #### KEY NOTES
143 | 
144 | From Noll link:
145 | http://www.mail-archive.com/common-user@hadoop.apache.org/msg00170.html
146 | http://search-hadoop.com/m/PcJ6xnNrSo1/Error+reading+task+output+http/v=threaded
147 | From a note here:
148 | http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#confmasters-master-only
149 | 
150 | So, we can probably do this with etc/hosts.
151 | 
152 | ### More Notes
153 | 
154 | Okay, here's the good stuff. We're trying to get a system up and
155 | running that can configure a persistent hadoop cluster.
156 | 
157 | to act as the hadoop user;
158 | 
159 |     sudo su - hadoop
160 | 
161 | With jclouds 9b, I'm getting all sorts of errors. In config, we need
162 | to make sure we're using aws-ec2, not just ec2. Also, cake-pallet adds
163 | pallet as a dependency, which forces jclouds beta-8... doesn't work,
164 | if we're trying to play in 9b's world.
165 | 
166 | Either I have to go straight back to 8b, with cake-pallet and no
167 | dependencies excluded,
168 | 
169 | ## Configuring Proxy
170 | 
171 | Compile squid from scratch;
172 | 
173 |     ./configure --enable-removal-policies="heap,lru"
174 | 
175 | Then give the guys my configuration file, from my macbook.
176 | 
177 | TODO -- figure out how to get the proper user permissions, for the
178 | squid user!
179 | 
180 | run `squid -z` the first time. `squid -N` runs with no daemon mode
181 | 
182 | [Squid Config Basics](http://www.deckle.co.za/squid-users-guide/Squid_Configuration_Basics)
183 | [Starting Squid Guide](http://www.deckle.co.za/squid-users-guide/Starting_Squid)
184 | 
185 | ## Configuring VMFest!
186 | 
187 | link over to [Toni's instructions](https://gist.github.com/867526), on
188 | how to test this bad boy.
189 | 
190 | #### ERRORS with virtualbox
191 | 
192 | http://forums.virtualbox.org/viewtopic.php?f=6&t=24383
193 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Pallet-Hadoop #
 2 | 
 3 | Hadoop Cluster Management with Intelligent Defaults.
 4 | 
 5 | ## Background ##
 6 | 
 7 | Hadoop is an Apache java framework that allows for distributed
 8 | processing of enormous datasets across large clusters. It combines a
 9 | computation engine based on
10 | [MapReduce](http://en.wikipedia.org/wiki/MapReduce) with
11 | [HDFS](http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html), a
12 | distributed filesystem based on the [Google File
13 | System](http://en.wikipedia.org/wiki/Google_File_System).
14 | 
15 | Abstraction layers such as
16 | [Cascading](https://github.com/cwensel/cascading) (for Java) and
17 | [Cascalog](https://github.com/nathanmarz/cascalog) (for
18 | [Clojure](http://clojure.org/)) make writing MapReduce queries quite
19 | nice. Indeed, running hadoop locally with cascalog [couldn't be
20 | easier](http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html).
21 | 
22 | Unfortunately, graduating one's MapReduce jobs to the cluster level
23 | isn't so easy. Amazon's [Elastic
24 | MapReduce](http://aws.amazon.com/elasticmapreduce/) is a great option
25 | for getting up and running fast; but what to do if you want to
26 | configure your own cluster?
27 | 
28 | After surveying existing tools, I decided to write my own layer over
29 | [Pallet](https://github.com/pallet/pallet), a wonderful cloud
30 | provisioning library written in Clojure. Pallet runs on top of
31 | [jclouds](https://github.com/jclouds/jclouds), which allows pallet to
32 | define its operations independent of any one cloud provider. Switching
33 | between clouds involves a change of login credentials, nothing more.
34 | 
35 | ## Getting Started ##
36 | 
37 | To include pallet-hadoop in your project, add the following lines to
38 | `:dev-dependencies` in your `project.clj` file:
39 | 
40 | ```clojure
41 | [com.palletops/pallet-hadoop "0.3.5"]
42 | [com.palletops/pallet-jclouds "1.7.1"]
43 | [org.apache.jclouds/jclouds-all "1.7.1"]
44 | [org.apache.jclouds.driver/jclouds-sshj "1.7.1"]
45 | [org.apache.jclouds.driver/jclouds-slf4j "1.7.1"]
46 | [ch.qos.logback/logback-classic "1.0.1"]
47 | [ch.qos.logback/logback-core "1.0.1"]
48 | ```
49 | 
50 | You'll also need to add the Sonatype repository, to get access to
51 | Pallet. Add this k-v pair to your `project.clj` file:
52 | 
53 | ```clojure
54 | :repositories {"sonatype" "http://oss.sonatype.org/content/repositories/releases/"}
55 | ```
56 | For a detailed example of how to run Pallet-Hadoop, see the [example
57 | project](https://github.com/pallet/pallet-hadoop-example) here. For
58 | more detailed information on the project's design, see [the project
59 | wiki](https://github.com/pallet/pallet-hadoop).
60 | 
61 | Pallet-Hadoop version `0.3.4-SNAPSHOT` uses Pallet `0.7.5`, jclouds 
62 | `1.7.1` and Clojure `1.4` and later.
63 | 


--------------------------------------------------------------------------------
/ReleaseNotes.md:
--------------------------------------------------------------------------------
 1 | ## 0.3.4
 2 | 
 3 | - updated library dependencies
 4 | 
 5 | - update dependency on pallet-jclouds, and group name to com.palletops
 6 | 
 7 | - prepeare for lein-pallet-release
 8 | 
 9 | - Update dependencies to a newer pallet/jclouds combination
10 | 
11 | - config between apache and cloudera distibutions
12 | 
13 | - Prepare for next release development
14 | 
15 | 


--------------------------------------------------------------------------------
/dev/log4j.xml:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" ?>
  2 | <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
  3 | 
  4 | <log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">
  5 |   <appender name="console" class="org.apache.log4j.ConsoleAppender">
  6 |     <param name="Target" value="System.out"/>
  7 |     <param name="Threshold" value="INFO" />
  8 |     <layout class="org.apache.log4j.PatternLayout">
  9 |       <param name="ConversionPattern" value="%-5p %c{1} - %m%n"/>
 10 |     </layout>
 11 |   </appender>
 12 | 
 13 |   <appender name="COMPUTEFILE" class="org.apache.log4j.FileAppender">
 14 |     <param name="File" value="log/jclouds-compute.log" />
 15 |     <param name="Append" value="true" />
 16 |     <param name="Threshold" value="TRACE" />
 17 |     <layout class="org.apache.log4j.PatternLayout">
 18 |       <param name="ConversionPattern" value="%d %-5p [%c] (%t) %m%n" />
 19 |     </layout>
 20 |   </appender>
 21 | 
 22 |   <appender name="WIREFILE" class="org.apache.log4j.FileAppender">
 23 |     <param name="File" value="log/jclouds-wire.log" />
 24 |     <param name="Append" value="true" />
 25 |     <param name="Threshold" value="TRACE" />
 26 |     <layout class="org.apache.log4j.PatternLayout">
 27 |       <!-- The default pattern: Date Priority [Category] Message\n -->
 28 |       <param name="ConversionPattern" value="%d %-5p [%c] (%t) %m%n" />
 29 | 
 30 |       <!--
 31 |           The full pattern: Date MS Priority [Category]
 32 |           (Thread:NDC) Message\n <param name="ConversionPattern"
 33 |           value="%d %-5r %-5p [%c] (%t:%x) %m%n"/>
 34 |       -->
 35 |     </layout>
 36 |   </appender>
 37 | 
 38 |   <appender name="PALLETFILE" class="org.apache.log4j.FileAppender">
 39 |     <param name="File" value="log/pallet.log" />
 40 |     <param name="Append" value="true" />
 41 |     <param name="Threshold" value="TRACE" />
 42 |     <layout class="org.apache.log4j.PatternLayout">
 43 |       <param name="ConversionPattern" value="%d %-5p [%c] (%t) %m%n" />
 44 |     </layout>
 45 |   </appender>
 46 | 
 47 |   <appender name="VMFESTFILE" class="org.apache.log4j.FileAppender">
 48 |     <param name="File" value="log/vmfest.log" />
 49 |     <param name="Append" value="true" />
 50 |     <param name="Threshold" value="TRACE" />
 51 |     <layout class="org.apache.log4j.PatternLayout">
 52 |       <param name="ConversionPattern" value="%d %-5p [%c] (%t) %m%n" />
 53 |     </layout>
 54 |   </appender>
 55 | 
 56 |   <appender name="ASYNCCOMPUTE" class="org.apache.log4j.AsyncAppender">
 57 |     <appender-ref ref="COMPUTEFILE" />
 58 |   </appender>
 59 | 
 60 |   <appender name="ASYNCWIRE" class="org.apache.log4j.AsyncAppender">
 61 |     <appender-ref ref="WIREFILE" />
 62 |   </appender>
 63 | 
 64 |   <category name="jclouds.headers">
 65 |     <priority value="DEBUG" />
 66 |     <appender-ref ref="ASYNCWIRE" />
 67 |   </category>
 68 | 
 69 |   <category name="jclouds.wire">
 70 |     <priority value="DEBUG" />
 71 |     <appender-ref ref="ASYNCWIRE" />
 72 |   </category>
 73 | 
 74 |   <category name="jclouds.compute">
 75 |     <priority value="TRACE" />
 76 |     <appender-ref ref="ASYNCCOMPUTE" />
 77 |     <appender-ref ref="console" />
 78 |   </category>
 79 | 
 80 |   <category name="pallet">
 81 |     <priority value="TRACE" />
 82 |     <appender-ref ref="PALLETFILE" />
 83 |   </category>
 84 | 
 85 |   <category name="vmfest">
 86 |     <priority value="TRACE" />
 87 |     <appender-ref ref="VMFESTFILE" />
 88 |   </category>
 89 | 
 90 |   <category name="jclouds.ssh">
 91 |     <priority value="DEBUG" />
 92 |     <appender-ref ref="ASYNCCOMPUTE" />
 93 |   </category>
 94 | 
 95 |   <root>
 96 |     <priority value ="info" />
 97 |     <appender-ref ref="console" />
 98 |   </root>
 99 | 
100 | 
101 | 
102 | </log4j:configuration>
103 | 


--------------------------------------------------------------------------------
/profiles.clj:
--------------------------------------------------------------------------------
1 | {:dev {:plugins [[lein-pallet-release "0.1.6"]],
2 |        :pallet-release {:url "https://pbors:${GH_TOKEN}@github.com/pallet/pallet-hadoop.git",
3 |                         :branch "master"}}}
4 | 


--------------------------------------------------------------------------------
/project.clj:
--------------------------------------------------------------------------------
 1 | (defproject com.palletops/pallet-hadoop "0.3.4"
 2 |   :description "Pallet meets Hadoop."
 3 |   :dev-resources-path "dev"
 4 |   :url "http://github.com/pallet/pallet-hadoop"
 5 |   :scm {:url "git@github.com:pallet/pallet-hadoop.git"}
 6 |   :repositories {"sonatype" "https://oss.sonatype.org/content/repositories/releases/"}
 7 |   :dependencies [[org.clojure/clojure "1.4.0"]
 8 |                  [org.cloudhoist/pallet "0.7.5"]
 9 |                  [org.cloudhoist/hadoop "0.7.0"]
10 |                  [org.cloudhoist/java "0.5.1"]
11 |                  [org.cloudhoist/automated-admin-user "0.5.0"]]
12 |   :dev-dependencies [[org.apache.jclouds/jclouds-all "1.7.1"]
13 |                      [org.apache.jclouds.driver/jclouds-jsch "1.7.1"]
14 |                      [org.apache.jclouds.driver/jclouds-slf4j "1.7.1"]
15 |                      [com.palletops/pallet-jclouds "1.7.1"]
16 |                      [ch.qos.logback/logback-classic "1.0.1"]
17 |                      [ch.qos.logback/logback-core "1.0.1"]])
18 | 


--------------------------------------------------------------------------------
/src/pallet_hadoop/node.clj:
--------------------------------------------------------------------------------
  1 | (ns pallet-hadoop.node
  2 |   (:use [pallet.crate.automated-admin-user :only (automated-admin-user)]
  3 |         [pallet.extensions :only (phase def-phase-fn)]
  4 |         [pallet.crate.java :only (java)]
  5 |         [pallet.core :only (make-node lift converge)]
  6 |         [pallet.compute :only (running? primary-ip private-ip nodes-by-tag nodes)]
  7 |         [clojure.set :only (union)])
  8 |   (:require [pallet.core :as core]
  9 |             [pallet.action.package :as package]
 10 |             [pallet.crate.hadoop :as h]
 11 |             [pallet.configure :as config]))
 12 | 
 13 | ;; ## Hadoop Cluster Configuration
 14 | (def user-config (:hadoop (config/pallet-config)))
 15 | 
 16 | ;; ### Utilities
 17 | 
 18 | (defn merge-to-vec
 19 |   "Returns a vector representation of the union of all supplied
 20 |   items. Entries in xs can be collections or individual items. For
 21 |   example,
 22 | 
 23 |   (merge-to-vec [1 2] :help 2 1)
 24 |   => [1 2 :help]"
 25 |   [& xs]
 26 |   (->> xs
 27 |        (map #(if (coll? %) (set %) #{%}))
 28 |        (apply (comp vec union))))
 29 | 
 30 | (defn set-vals
 31 |   "Sets all entries of the supplied map equal to the supplied value."
 32 |   [map val]
 33 |   (zipmap (keys map)
 34 |           (repeat val)))
 35 | 
 36 | ;; ### Defaults
 37 | 
 38 | (def
 39 |   ^{:doc "Map between hadoop aliases and the roles for which they
 40 |   stand.`:slavenode` acts as an alias for nodes that function as both
 41 |   datanodes and tasktrackers."}
 42 |   hadoop-aliases
 43 |   {:slavenode [:datanode :tasktracker]})
 44 | 
 45 | (defn expand-aliases
 46 |   "Returns a sequence of hadoop roles, with any aliases replaced by
 47 |   the corresponding roles. `:slavenode` is the only alias, currently,
 48 |   and expands out to `:datanode` and `:tasktracker`."
 49 |   [roles]
 50 |   (->> roles
 51 |        (replace hadoop-aliases)
 52 |        (apply merge-to-vec)))
 53 | 
 54 | (def
 55 |   ^{:doc "Set of all hadoop `master` level tags. Used to assign
 56 |   default counts to master nodes, and to make sure that no more than
 57 |   one of each exists."}
 58 |   hadoop-masters
 59 |   #{:namenode :jobtracker})
 60 | 
 61 | (defn master?
 62 |   "Predicate to determine whether or not the supplied sequence of
 63 |   roles contains a master node tag."
 64 |   [roleseq]
 65 |   (boolean (some hadoop-masters roleseq)))
 66 | 
 67 | (def
 68 |   ^{:doc "Hadoop requires certain ports to be accessible, as discussed
 69 |   [here](http://goo.gl/nKk3B) by the folks at Cloudera. We provide
 70 |   sets of ports that can be merged based on the hadoop roles that some
 71 |   node-spec wants to use."}
 72 |   hadoop-ports
 73 |   {:default #{22 80}
 74 |    :namenode #{50070 8020}
 75 |    :datanode #{50075 50010 50020}
 76 |    :jobtracker #{50030 8021}
 77 |    :tasktracker #{50060}
 78 |    :secondarynamenode #{50090 50105}})
 79 | 
 80 | (def role->phase-map
 81 |   {:default #{:bootstrap
 82 |               :reinstall
 83 |               :configure
 84 |               :reconfigure
 85 |               :authorize-jobtracker}
 86 |    :namenode #{:start-namenode}
 87 |    :datanode #{:start-hdfs}
 88 |    :jobtracker #{:publish-ssh-key :start-jobtracker}
 89 |    :tasktracker #{:start-mapred}})
 90 | 
 91 | (defn roles->tags
 92 |   "Accepts sequence of hadoop roles and a map of `tag, node-group`
 93 |   pairs and returns a sequence of the corresponding node tags. Every
 94 |   role must exist in the supplied node-def map to make it past the
 95 |   postcondition."
 96 |   [role-seq node-defs]
 97 |   {:post [(= (count %)
 98 |              (count role-seq))]}
 99 |   (remove nil?
100 |           (for [role role-seq]
101 |             (some (fn [[tag def]]
102 |                     (when (some #{role} (get-in def [:node :roles]))
103 |                       tag))
104 |                   node-defs))))
105 | 
106 | (defn roles->phases
107 |   "Converts a sequence of hadoop roles into a sequence of pallet
108 |   phases required by a node trying to take on each of these roles."
109 |   [roles]
110 |   (->> roles (mapcat role->phase-map) distinct vec))
111 | 
112 | (defn hadoop-phases
113 |   "Returns a map of all possible hadoop phases. IP-type specifies..."
114 |   [{:keys [nodedefs ip-type]} properties]
115 |   (let [[jt-tag nn-tag] (roles->tags [:jobtracker :namenode] nodedefs)
116 |         configure (phase (h/configure ip-type nn-tag jt-tag properties))]
117 |     {:bootstrap automated-admin-user
118 |      :configure (phase (package/package-manager :update)
119 |                        ;(package/package-manager :upgrade)
120 |                        (java :openjdk)
121 |                        (h/install (or (:distro user-config) :cloudera))
122 |                        configure)
123 |      :reinstall (phase (h/install (or (:distro user-config) :cloudera))
124 |                        configure)
125 |      :reconfigure configure
126 |      :publish-ssh-key h/publish-ssh-key
127 |      :authorize-jobtracker (phase (h/authorize-tag jt-tag))
128 |      :start-mapred h/task-tracker
129 |      :start-hdfs h/data-node
130 |      :start-jobtracker h/job-tracker
131 |      :start-namenode (phase (h/name-node "/tmp/node-name/data"))}))
132 | 
133 | (defn hadoop-machine-spec
134 |   "Generates a pallet node spec for the supplied hadoop node,
135 |   merging together the given base map with properties required to
136 |   support the attached hadoop roles."
137 |   [{:keys [spec roles]}]
138 |   (let [ports (->> roles (mapcat hadoop-ports) distinct vec)]
139 |     (merge-with merge-to-vec
140 |                 spec
141 |                 {:inbound-ports ports})))
142 | 
143 | (defn hadoop-server-spec
144 |   "Returns a map of all hadoop phases. `hadoop-server-spec` currently
145 |   doesn't compose with existing hadoop phases. This will change soon."
146 |   [cluster {:keys [props roles]}]
147 |   (select-keys (hadoop-phases cluster props)
148 |                (roles->phases roles)))
149 | 
150 | (defn merge-node
151 |   "Returns a new hadoop node map generated by merging the supplied
152 |   node into the base specs defined by the supplied cluster."
153 |   [cluster node]
154 |   {:post [(some (partial contains? role->phase-map) (:roles %))]}
155 |   (let [{:keys [base-machine-spec base-props]} cluster
156 |         {:keys [spec roles props]} node]
157 |     {:spec (merge base-machine-spec spec)
158 |      :props (h/merge-config base-props props)
159 |      :roles (-> roles
160 |                 (conj :default)
161 |                 expand-aliases)}))
162 | 
163 | (defn hadoop-spec
164 |   "Generates a pallet representation of a hadoop node, built from the
165 |   supplied cluster and the supplied hadoop node map -- see
166 |   `node-group` for construction details. (`hadoop-spec` is similar to
167 |   `pallet.core/defnode`, sans binding.)"
168 |   [cluster tag node]
169 |   (let [node (merge-node cluster node)]
170 |     (apply core/make-node
171 |            tag
172 |            (hadoop-machine-spec node)
173 |            (apply concat (hadoop-server-spec cluster node)))))
174 | 
175 | (defn node-group
176 |   "Generates a map representation of a Hadoop node. For example:
177 | 
178 |    (node-group [:slavenode] 10)
179 |     => {:node {:roles [:tasktracker :datanode]
180 |                :spec {}
181 |                :props {}}
182 |        :count 10}"
183 |   [role-seq & [count & {:keys [spec props]}]]
184 |   {:pre [(if (master? role-seq)
185 |            (or (nil? count) (= count 1))
186 |            count)]}
187 |   {:node {:roles role-seq
188 |           :spec (or spec {})
189 |           :props (or props {})}
190 |    :count (or count 1)})
191 | 
192 | (def slave-group (partial node-group [:slavenode]))
193 | 
194 | (defn cluster-spec
195 |   "Generates a data representation of a hadoop cluster.
196 | 
197 |     ip-type: `:public` or `:private`. (Hadoop keeps track of
198 |   jobtracker and namenode identity via IP address. This option toggles
199 |   the type of IP address used. (EC2 requires `:private`, while a local
200 |   cluster running on virtual machines will require `:public`."
201 |   [ip-type nodedefs & {:as options}]
202 |   {:pre [(#{:public :private} ip-type)]}
203 |   (merge {:base-machine-spec {}
204 |           :base-props {}}
205 |          options
206 |          {:ip-type ip-type
207 |           :nodedefs nodedefs}))
208 | 
209 | (defn cluster->node-map
210 |   "Converts a cluster to `node-map` represention, for use in a call to
211 |   `pallet.core/converge`."
212 |   [cluster]
213 |   (into {}
214 |         (for [[tag {:keys [count node]}] (:nodedefs cluster)]
215 |           [(hadoop-spec cluster tag node) count])))
216 | 
217 | (defn cluster->node-set
218 |   "Converts a cluster to `node-set` represention, for use in a call to
219 |   `pallet.core/lift`."
220 |   [cluster]
221 |   (keys (cluster->node-map cluster)))
222 | 
223 | ;; ### Cluster Level Converge and Lift
224 | 
225 | (defn converge-cluster
226 |   "Identical to `pallet.core/converge`, with `cluster` taking the
227 |   place of `node-map`."
228 |   [cluster & options]
229 |   (apply core/converge
230 |          (cluster->node-map cluster)
231 |          options))
232 | 
233 | (defn lift-cluster
234 |   "Identical to `pallet.core/lift`, with `cluster` taking the
235 |   place of `node-set`."
236 |   [cluster & options]
237 |   (apply core/lift
238 |          (cluster->node-set cluster)
239 |          options))
240 | 
241 | (defn boot-cluster
242 |   "Launches all nodes in the supplied cluster, installs hadoop and
243 |   enables SSH access between jobtracker and all other nodes. See
244 |   `pallet.core/converge` for details on acceptable options."
245 |   [cluster & options]
246 |   (apply converge-cluster
247 |          cluster
248 |          :phase [:configure
249 |                  :publish-ssh-key
250 |                  :authorize-jobtracker]
251 |          options))
252 | 
253 | (defn start-cluster
254 |   "Starts up all hadoop services on the supplied cluster. See
255 |   `pallet.core/lift` for details on acceptable options. (All are valid
256 |   except `:phase`, for now."
257 |   [cluster & options]
258 |   (apply lift-cluster
259 |          cluster
260 |          :phase [:start-namenode
261 |                  :start-hdfs
262 |                  :start-jobtracker
263 |                  :start-mapred]
264 |          options))
265 | 
266 | (defn kill-cluster
267 |   "Converges cluster with counts of zero for all nodes, shutting down
268 |   everything. See `pallet.core/converge` for details on acceptable
269 |   options."
270 |   [cluster & options]
271 |   (apply core/converge
272 |          (-> (cluster->node-map cluster)
273 |              (set-vals 0))
274 |          options))
275 | 
276 | ;; helper functions
277 | 
278 | (defn master-ip
279 |   "Returns a string containing the IP address of the master node
280 |   instantiated in the service."
281 |   [service tag-kwd ip-type]
282 |   ;; We need to make sure we only check for running nodes, as if you
283 |   ;; rebuild the cluster EC2 will report both running and terminated
284 |   ;; nodes for quite a while.
285 |   (when-let [master-node (first
286 |                           (tag-kwd (nodes-by-tag
287 |                                     (filter running? (nodes service)))))]
288 |     (case ip-type
289 |       :private (private-ip  master-node)
290 |       :public (primary-ip  master-node))))
291 | 
292 | (defn jobtracker-ip
293 |   "Returns a string containing the IP address of the jobtracker node
294 |   instantiated in the service."
295 |   [ip-type service]
296 |   (master-ip service :jobtracker ip-type))
297 | 
298 | (defn namenode-ip
299 |   "Returns a string containing the IP address of the namenode node
300 |   instantiated in the service, if there is one"
301 |   [ip-type service]
302 |   (master-ip service :namenode ip-type))
303 | 
304 | (comment
305 |   "This'll get you started; for a more detailed introduction, please
306 |    head over to https://github.com/pallet/pallet-hadoop-example."
307 | 
308 |   (use 'pallet-hadoop.node)
309 |   (use 'pallet.compute)
310 | 
311 |   ;; We can define our compute service here...
312 |   (def ec2-service
313 |     (compute-service "aws-ec2"
314 |                      :identity "ec2-access-key-id"
315 |                      :credential "ec2-secret-access-key"))
316 | 
317 |   ;; Or, we can get this from a config file, in
318 |   ;; `~/.pallet/config.clj`.
319 |   (def ec2-service
320 |     (service :aws))
321 | 
322 |   (def example-cluster
323 |     (cluster-spec :private
324 |                   {:jobtracker (node-group [:jobtracker :namenode])
325 |                    :slaves     (slave-group 1)}
326 |                   :base-machine-spec {:os-family :ubuntu
327 |                                       :os-version-matches "10.10"
328 |                                       :os-64-bit true}
329 |                   :base-props {:mapred-site {:mapred.task.timeout 300000
330 |                                              :mapred.reduce.tasks 3
331 |                                              :mapred.tasktracker.map.tasks.maximum 3
332 |                                              :mapred.tasktracker.reduce.tasks.maximum 3
333 |                                              :mapred.child.java.opts "-Xms1024m"}}))
334 | 
335 |   (boot-cluster  example-cluster :compute ec2-service)
336 |   (start-cluster example-cluster :compute ec2-service))
337 | 


--------------------------------------------------------------------------------
/test/pallet_hadoop/node_test.clj:
--------------------------------------------------------------------------------
 1 | (ns pallet-hadoop.node-test
 2 |   (:use [pallet-hadoop.node] :reload)
 3 |   (:use [clojure.test]))
 4 | 
 5 | (def test-cluster
 6 |   (cluster-spec :private
 7 |                 {:master (node-group [:jobtracker :namenode])
 8 |                  :slaves (slave-group 1)}
 9 |                 :base-machine-spec {:os-family :ubuntu
10 |                                     :os-version-matches "10.10"
11 |                                     :os-64-bit true}
12 |                 :base-props {:mapred-site {:mapred.task.timeout 300000
13 |                                            :mapred.reduce.tasks 50
14 |                                            :mapred.tasktracker.map.tasks.maximum 15
15 |                                            :mapred.tasktracker.reduce.tasks.maximum 15}}))
16 | 
17 | (deftest merge-to-vec-test
18 |   (are [in-seqs out-vec] (let [out (apply merge-to-vec in-seqs)]
19 |                            (and (vector? out)
20 |                                 (= (set out-vec)
21 |                                    (set out))))
22 |        [[1 2] [5 4] [2]] [1 2 4 5]
23 |        [[1 2] [4 5]] [1 2 4 5]
24 |        [["face" 2] [2 1]] [1 2 "face"]))
25 | 
26 | (deftest set-vals-test
27 |   (is (= {:key1 0 :key2 0} (set-vals {:key1 "face" :key2 8} 0))))
28 | 
29 | (deftest expand-aliases-test
30 |   (is (= [:datanode :tasktracker] (expand-aliases [:slavenode :datanode])))
31 |   (is (= [:datanode :tasktracker :namenode] (expand-aliases [:slavenode :namenode])))
32 |   (is (= [:datanode :namenode] (expand-aliases [:datanode :namenode]))))
33 | 
34 | (deftest master?-test
35 |   (is (= false (master? [:cake])))
36 |   (is (= true (master? [:jobtracker]))))
37 | 
38 | (deftest roles->tags-test
39 |   (is (= [:master :master] (roles->tags [:jobtracker :namenode]
40 |                                         (:nodedefs test-cluster)))))
41 | 
42 | (deftest slave-group-test
43 |   (is (thrown? AssertionError (slave-group)))
44 |   (are [opts result] (= result (apply slave-group opts))
45 |        [3]
46 |        {:node {:roles [:slavenode] :spec {} :props {}} :count 3}
47 | 
48 |        [1 :props {:mapred-site {:prop "val"}}]
49 |        {:node {:roles [:slavenode]
50 |                :spec {}
51 |                :props {:mapred-site {:prop "val"}}}
52 |         :count 1}))
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------