├── .gitignore
├── LICENSE
├── README.md
├── hive.md
├── mapreduce.md
└── setup.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | lib/
17 | lib64/
18 | parts/
19 | sdist/
20 | var/
21 | *.egg-info/
22 | .installed.cfg
23 | *.egg
24 | 
25 | # PyInstaller
26 | #  Usually these files are written by a python script from a template
27 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
28 | *.manifest
29 | *.spec
30 | 
31 | # Installer logs
32 | pip-log.txt
33 | pip-delete-this-directory.txt
34 | 
35 | # Unit test / coverage reports
36 | htmlcov/
37 | .tox/
38 | .coverage
39 | .cache
40 | nosetests.xml
41 | coverage.xml
42 | 
43 | # Translations
44 | *.mo
45 | *.pot
46 | 
47 | # Django stuff:
48 | *.log
49 | 
50 | # Sphinx documentation
51 | docs/_build/
52 | 
53 | # PyBuilder
54 | target/
55 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Irmak Sirer
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Hadoop MapReduce with Python and Hive
 2 | 
 3 | A tutorial for writing a MapReduce program for Hadoop in python, and using Hive to do MapReduce with SQL-like queries.
 4 | 
 5 | This uses the Hadoop Streaming API with python to teach the basics of using the MapReduce framework.
 6 | The main idea and structure is based on [Michael G. Noll's great tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/). However, that tutorial is outdated and quite few of the steps do not work anymore, both in setting up and running Hadoop. This is an updated and expanded tutorial, combined with a Hive tutorial.
 7 | 
 8 | You can write map and reduce functions in python, and use them with Hadoop's streaming API as shown here. This gives you a lot of flexibility.
 9 | 
10 | In many cases, however, the information you are trying to get from the data distributed on the cluster can be expressed in terms of a SQL query. Hive is a program that takes a SQL query like this, automatically builds map and reduce jobs, runs them and returns the results. It makes using MapReduce really simple, all you need is some familiarity with basic SQL queries.
11 | 
12 | This tutorial shows both of these ways of using Hadoop.
13 | 
14 | ## Let's Begin
15 | 
16 | [Setup your Hadoop cluster](setup.md)
17 | 
18 | [Write your first Python MapReduce program](mapreduce.md)
19 | 
20 | [Use Hive to make MapReduce queries](hive.md)
21 | 
22 | 
23 | 
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/hive.md:
--------------------------------------------------------------------------------
  1 | ## Using Hive
  2 | 
  3 | #### Install and set up Hive
  4 | 
  5 | ssh to your cloud computer and switch to the hduser. Go to the hduser's home.
  6 | 
  7 | ```bash
  8 | $ su hduser
  9 | $ cd
 10 | ```
 11 | Let's download Hive.
 12 | 
 13 | ```bash
 14 | $ wget http://mirror.cogentco.com/pub/apache/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz
 15 | ```
 16 | This is a zipped file. Extract it.
 17 | 
 18 | ```bash
 19 | $ tar -xzvf apache-hive-0.14.0-bin.tar.gz 
 20 | ```
 21 | Now, in hduser's home directory, you have a directory called `apache-hive-0.14.0-bin`.
 22 | We are going to edit your `.bashrc` file to make an environment varible called HIVE_HOME
 23 | so that Hadoop and Hive know where it lives (like we did with the Hadoop setup).
 24 | 
 25 | ```bash
 26 | $ emacs ~/.bashrc
 27 | ```
 28 | To the end of the `.bashrc` file that we are editing, add the following lines
 29 | 
 30 | ```bash
 31 | export HIVE_HOME=/home/hduser/apache-hive-0.14.0-bin
 32 | export PATH=$PATH:$HIVE_HOME/bin
 33 | ```
 34 | Save and close. Now we need to run the `.bashrc` to make sure HIVE_HOME is defined.
 35 | 
 36 | ```bash
 37 | $ source ~/.bashrc
 38 | ```
 39 | That's it. Now we installed Hive. We have one last setup step left.
 40 | Hive basically behaves like a SQL database that lives on the hdfs.
 41 | To be able to do that, it needs a temporary files folder and a place to store the underlying data (in the hdfs).
 42 | We need to create these directories in the hdfs and give the necessary permissions.
 43 | > Obviously, to be able to make changes to the hdfs, make sure your hadoop cluster is up and running.
 44 | > Do `jps` to check if the namenode, secondary namenode and the datanode are up.
 45 | > If not, you need to start them with the `start-dfs.sh` first, you can check to hadoop tutorial to see how we did that.
 46 | 
 47 | ```bash
 48 | $ hdfs dfs -mkdir -p /tmp
 49 | $ hdfs dfs -mkdir -p /user/hive/warehouse
 50 | $ hdfs dfs -chmod g+w /tmp
 51 | $ hdfs dfs -chmod g+w /user/hive/warehouse
 52 | ```
 53 | Aaand, your Hive is ready to rock your world. You can run it by typing
 54 | 
 55 | ```
 56 | $ hive
 57 | ```
 58 | You should get a hive prompt, like this:
 59 | 
 60 | ```
 61 | hive> _
 62 | ```
 63 | Hive's syntax is (almost) identical to SQL. So let's load up some data and use it. First, exit:
 64 | 
 65 | ```
 66 | hive> exit;
 67 | ```
 68 | 
 69 | #### Download some baseball data to play with
 70 | 
 71 | You should be back at your regular prompt now. Let's download some baseball data.
 72 | ```bash
 73 | $ wget http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip
 74 | ```
 75 | This is a zipped file with a bunch of csv files, each is a sql table.
 76 | These tables are full of baseball statistics from 2013. Let's unzip this.
 77 | First we need to switch back to a user with sudo powers so we can install unzip, 
 78 | then switch back to hduser and unzip it.
 79 | (Of course, switch `irmak` below with your own username)
 80 | 
 81 | ```bash
 82 | $ su irmak
 83 | $ sudo apt-get install unzip
 84 | $ su hduser
 85 | $ mkdir baseballdata
 86 | $ unzip lahman-csv_2014-02-14.zip -d baseballdata
 87 | ```
 88 | 
 89 | #### First look & cleanup of the data
 90 | 
 91 | Now you have a bunch of csv files in the `baseballdata` directory.
 92 | You can think of each csv as a table in a baseball database.
 93 | Let's create one Hive table and read a csv into that table.
 94 | This is the exact analog of loading a csv into a sql table.
 95 | Let's do this with the `baseballdata/Master.csv`.
 96 | First, take a look at that csv file.
 97 | 
 98 | ```bash
 99 | $ head baseballdata/Master.csv
100 | ```
101 | 
102 | You should see this:
103 | 
104 | ```Text
105 | playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
106 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01
107 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
108 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
109 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
110 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
111 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01
112 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
113 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
114 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
115 | ```
116 | 
117 | Ok. We need to create at table with these following column headers.
118 | To be able to read it better, let's print each column name in a line.
119 | ```bash
120 | $ head -n 1 baseballdata/Master.csv | tr ',' '\n'
121 | ```
122 | You should see
123 | 
124 | ```Text
125 | playerID
126 | birthYear
127 | birthMonth
128 | birthDay
129 | birthCountry
130 | birthState
131 | birthCity
132 | deathYear
133 | deathMonth
134 | deathDay
135 | deathCountry
136 | deathState
137 | deathCity
138 | nameFirst
139 | nameLast
140 | nameGiven
141 | weight
142 | height
143 | bats
144 | throws
145 | debut
146 | finalGame
147 | retroID
148 | bbrefID
149 | ```
150 | What did we do up there? `head` shows us only several lines at the beginning of a file.
151 | The option `-n 1` tells it to show only the first line (`-n 5` would have shown the first five).
152 | The output of this is a line of column headers separated by commas.
153 | We pipe this output into `tr ',' '\n'`, which converts (or *tr*anslates) every `,` character into a newline character (`\n`).
154 | That way, we get a new line every time there was a comma.
155 | Ok, great. This will help us construct the table.
156 | One last thing we need to do is to remove the first line from the file, though, to make it easier to upload it to Hive.
157 | We can use `tail` for this, which, just like head, only shows several lines of a file, but the **last** lines instead of the first.
158 | `tail -n 4` shows the last 4 lines, for example. `tail -n +8` shows all lines including and after the 8th line.
159 | So, to get rid of the first line, we want `tail -n +2`. Let's pipe the output into head to check if it will indeed work:
160 | 
161 | ```bash
162 | $ tail -n +2 baseballdata/Master.csv | head
163 | ```
164 | should show
165 | ```Text
166 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01
167 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
168 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
169 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
170 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
171 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01
172 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
173 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
174 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
175 | abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01
176 | ```
177 | Looks like it's working. So let's write this into a temporary file and then overwrite the original with this new temp file so Master.csv no longer has the header line.
178 | ```bash
179 | $ tail -n +2 baseballdata/Master.csv > tmp && mv tmp baseballdata/Master.csv
180 | ```
181 | The `&&` means do the first part first, and when it finished, do what follows the `&&`.
182 | 
183 | Ok. we removed the header. Let's make sure we did
184 | ```bash
185 | $ head baseballdata/Master.csv
186 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01
187 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
188 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
189 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
190 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01
191 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01
192 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
193 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01
194 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
195 | abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01
196 | ```
197 | 
198 | #### Upload data to Hive
199 | 
200 | Indeed it's gone. Alright. Let's upload this to hive. First, we need to upload it to hdfs.
201 | (of course, change `irmak` to whichever directory you have in hdfs) 
202 | ```bash
203 | $ hdfs dfs -mkdir -p /user/irmak/baseballdata
204 | $ hdfs dfs -put baseballdata/Master.csv /user/irmak/baseballdata
205 | ```
206 | We created a new directory in hsfs and uploaded the csv to it.
207 | Let's make sure it's there.
208 | ```bash
209 | $ hdfs dfs -ls /user/irmak/baseballdata
210 | Found 1 items
211 | -rw-r--r--   1 hduser supergroup    2422684 2015-03-11 22:12 /user/irmak/baseballdata/Master.csv
212 | ```
213 | It is. Awesome. Time to run hive
214 | ```bash
215 | $ hive
216 | 
217 | Logging initialized using configuration in jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-common-0.14.0.jar!/hive-log4j.properties
218 | SLF4J: Class path contains multiple SLF4J bindings.
219 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
220 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
221 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
222 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]  
223 | hive> 
224 | ```
225 | We now have the Hive prompt. (Ignore the SLF4J warning, it's an unimportant logging thing).
226 | Let's create the table
227 | 
228 | ```sql
229 | hive> CREATE TABLE IF NOT EXISTS Master
230 |       (playerID STRING,
231 |       birthYear INT,
232 |       birthMonth INT,
233 |       birthDay INT,
234 |       birthCountry STRING,
235 |       birthState STRING,
236 |       birthCity STRING,
237 |       deathYear INT,
238 |       deathMonth INT,
239 |       deathDay INT,
240 |       deathCountry STRING,
241 |       deathState STRING,
242 |       deathCity STRING,
243 |       nameFirst STRING,
244 |       nameLast STRING,
245 |       nameGiven STRING,
246 |       weight INT,
247 |       height INT,
248 |       bats STRING,
249 |       throws STRING,
250 |       debut STRING,
251 |       finalGame STRING,
252 |       retroID STRING,
253 |       bbrefID STRING)
254 |       COMMENT 'Master Player Table'
255 |       ROW FORMAT DELIMITED
256 |       FIELDS TERMINATED BY ','
257 |       STORED AS TEXTFILE;
258 | OK
259 | Time taken: 1.752 seconds
260 | ```
261 | And let's load the data
262 | ```sql
263 | hive> LOAD DATA INPATH '/user/irmak/baseballdata/Master.csv' OVERWRITE INTO TABLE Master;
264 | Loading data to table default.master
265 | Table default.master stats: [numFiles=1, numRows=0, totalSize=2422684, rawDataSize=0]
266 | OK
267 | Time taken: 1.166 seconds
268 | ```
269 | And it's in!
270 | 
271 | #### Use Hive to make queries over the distributed data
272 | 
273 | We now have a Hive table. The best part of hive is, when you make a query (that most of the time looks **exactly** like a sql query), Hive automatically creates the map and reduce tasks, runs them over the hadoop cluster, and gives you the answer, without you having to worry about any of it. If your question is easily represented in the form of a sql query, Hive will take care of all the dirty work for you. The table might be spread over thousands of computers, but you don't need to think hard about that at all.
274 | 
275 | Let's start easy. Let's find out how many players we have in this table.
276 | ```sql
277 | hive> SELECT COUNT(playerid) FROM Master;
278 | Query ID = hduser_20150311224646_00211363-82a3-49f0-aac2-b0d6abb4caf9
279 | Total jobs = 1
280 | Launching Job 1 out of 1
281 | Number of reduce tasks determined at compile time: 1
282 | In order to change the average load for a reducer (in bytes):
283 |   set hive.exec.reducers.bytes.per.reducer=<number>
284 | In order to limit the maximum number of reducers:
285 |   set hive.exec.reducers.max=<number>
286 | In order to set a constant number of reducers:
287 |   set mapreduce.job.reduces=<number>
288 | Job running in-process (local Hadoop)
289 | Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
290 | 2015-03-11 22:46:52,488 Stage-1 map = 100%,  reduce = 100%
291 | Ended Job = job_local747935831_0002
292 | MapReduce Jobs Launched: 
293 | Stage-Stage-1:  HDFS Read: 9690750 HDFS Write: 0 SUCCESS
294 | Total MapReduce CPU Time Spent: 0 msec
295 | OK
296 | 18354
297 | Time taken: 2.424 seconds, Fetched: 1 row(s)
298 | ```
299 | As you can see, hive reports on the job it is setting up and the map reduce tasks that job entails, reports on the progress, and finally gives the result: There are 18354 players.
300 | 
301 | Let's do something that would require more involved mapper and reducer functions, but is pretty straightforward with Hive. Let's get the weight distribution of the players in this table.
302 | ```sql
303 | hive> SELECT weight, count(playerID) FROM Master GROUP BY weight;
304 | Query ID = hduser_20150311223636_6eda794b-8400-4054-9fce-b1080af16f99
305 | Total jobs = 1
306 | Launching Job 1 out of 1
307 | Number of reduce tasks not specified. Estimated from input data size: 1
308 | In order to change the average load for a reducer (in bytes):
309 |   set hive.exec.reducers.bytes.per.reducer=<number>
310 | In order to limit the maximum number of reducers:
311 |   set hive.exec.reducers.max=<number>
312 | In order to set a constant number of reducers:
313 |   set mapreduce.job.reduces=<number>
314 | Job running in-process (local Hadoop)
315 | Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
316 | 2015-03-11 22:36:44,510 Stage-1 map = 0%,  reduce = 0%
317 | 2015-03-11 22:36:45,753 Stage-1 map = 100%,  reduce = 0%
318 | 2015-03-11 22:36:46,797 Stage-1 map = 100%,  reduce = 100%
319 | Ended Job = job_local1584830750_0001
320 | MapReduce Jobs Launched: 
321 | Stage-Stage-1:  HDFS Read: 4845382 HDFS Write: 0 SUCCESS
322 | Total MapReduce CPU Time Spent: 0 msec
323 | OK
324 | NULL	877
325 | 65	1
326 | 120	3
327 | 125	4
328 | 126	2
329 | 127	1
330 | 128	1
331 | 129	2
332 | 130	11
333 | 132	1
334 | 133	1
335 | 134	1
336 | 135	16
337 | 136	4
338 | 137	2
339 | 138	9
340 | 139	2
341 | 140	59
342 | 141	3
343 | 142	15
344 | 143	9
345 | 144	7
346 | 145	94
347 | 146	6
348 | 147	10
349 | 148	30
350 | 149	8
351 | 150	279
352 | 151	8
353 | 152	33
354 | 153	16
355 | 154	41
356 | 155	317
357 | 156	34
358 | 157	40
359 | 158	78
360 | 159	14
361 | 160	794
362 | 161	18
363 | 162	61
364 | 163	52
365 | 164	53
366 | 165	996
367 | 166	27
368 | 167	60
369 | 168	191
370 | 169	29
371 | 170	1273
372 | 171	16
373 | 172	113
374 | 173	54
375 | 174	69
376 | 175	1532
377 | 176	66
378 | 177	25
379 | 178	188
380 | 179	28
381 | 180	1604
382 | 181	21
383 | 182	75
384 | 183	61
385 | 184	45
386 | 185	1587
387 | 186	71
388 | 187	108
389 | 188	75
390 | 189	24
391 | 190	1444
392 | 191	12
393 | 192	62
394 | 193	49
395 | 194	33
396 | 195	1082
397 | 196	38
398 | 197	36
399 | 198	52
400 | 199	2
401 | 200	1012
402 | 201	7
403 | 202	18
404 | 203	19
405 | 204	19
406 | 205	634
407 | 206	5
408 | 207	18
409 | 208	22
410 | 209	10
411 | 210	620
412 | 211	4
413 | 212	15
414 | 213	6
415 | 214	5
416 | 215	463
417 | 216	5
418 | 217	9
419 | 218	12
420 | 219	3
421 | 220	412
422 | 221	3
423 | 222	4
424 | 223	4
425 | 225	243
426 | 226	5
427 | 227	2
428 | 228	6
429 | 230	189
430 | 233	2
431 | 234	3
432 | 235	112
433 | 237	3
434 | 240	105
435 | 241	1
436 | 242	1
437 | 243	1
438 | 244	2
439 | 245	48
440 | 250	52
441 | 254	1
442 | 255	22
443 | 257	1
444 | 260	21
445 | 265	8
446 | 269	1
447 | 270	8
448 | 275	8
449 | 280	5
450 | 283	1
451 | 285	3
452 | 290	2
453 | 295	2
454 | 310	1
455 | 320	1
456 | Time taken: 9.265 seconds, Fetched: 132 row(s)
457 | ```
458 | As you can see, a simple GROUP BY statement takes care of everything. Easier than writing and executing specific mapreduce functions.
459 | 
460 | In this manner, you can do sql-like queries over tons of data that live in the hdfs in a distributed state. Since hdfs and mapreduce have overheads, it will not be as fast as a sql query on data that fits a single machine, but you now get the answers in parallel, and are able to do sql queries over hundreds of terabytes of data.
461 | 
462 | #### Join example in Hive
463 | 
464 | Let's upload another table and see how joins work. Salaries.csv has four columns: year, team, league, player, salary. It only has salary information for after 1984, but it's pretty extensive.
465 | 
466 | Let's remove the header and upload it to hdfs
467 | ```bash
468 | hive> exit;
469 | $ tail -n +2 baseballdata/Salaries.csv > tmp && mv tmp baseballdata/Salaries.csv
470 | $ hdfs dfs -put baseballdata/Salaries.csv /user/irmak/baseballdata
471 | ```
472 | Switch to hive, create the table and load the data.
473 | ```sql
474 | $ hive
475 | 
476 | Logging initialized using configuration in jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-common-0.14.0.jar!/hive-log4j.properties
477 | SLF4J: Class path contains multiple SLF4J bindings.
478 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
479 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
480 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
481 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
482 | hive> CREATE TABLE IF NOT EXISTS Salaries                                                
483 |       (yearID INT, teamID STRING, lgID STRING, playerID STRING, salary INT)              
484 |       COMMENT 'Salary Table for Players'                                                 
485 |       ROW FORMAT DELIMITED                                                               
486 |       FIELDS TERMINATED BY ','                                                           
487 |       STORED AS TEXTFILE;                                                                
488 | OK
489 | Time taken: 2.502 seconds
490 | hive> LOAD DATA INPATH '/user/irmak/baseballdata/Salaries.csv' OVERWRITE INTO TABLE Salaries;
491 | Loading data to table default.salaries
492 | Table default.salaries stats: [numFiles=1, numRows=0, totalSize=724918, rawDataSize=0]
493 | OK
494 | Time taken: 2.237 seconds
495 | hive> SHOW TABLES;
496 | OK
497 | master
498 | salaries
499 | Time taken: 0.203 seconds, Fetched: 2 row(s)
500 | ```
501 | Mission accomplished. We have two tables now: `Master` and `Salaries`. (By the way, you should have noticed by now that nothing in Hive is case sensitive).
502 | 
503 | Let's do a somewhat more complicated query that involves two tables. Let's take a look at the upper end of the weight distribution among the players and their salaries. Here is the breakdown of the query:
504 | >
505 | > ```sql
506 | > SELECT Salaries.yearID, Master.nameFirst, Master.nameLast, Master.weight, Salaries.salary
507 | > ```
508 | > This is what we want to read: The first & last name of the player, their weight, and their salary at a specific year. Salary and year comes from the salary table and the rest from the master table.
509 | >
510 | > ```sql
511 | > FROM Master JOIN Salaries ON (Master.playerID = Salaries.playerID)
512 | > ```
513 | > This is how we combine the information from both tables. We want the row for a player to connect with the salary rows for that player. Note that there are multiple rows for the same player in the Salaries table (for multiple years).
514 | >
515 | > ```sql
516 | > WHERE Master.weight > 270;
517 | > ```
518 | > Only show the players who weigh more than 270 pounds. Also note that we don't have yearly weights, but a single weight statistic for each player (that is reported in the Master table).
519 | >
520 | 
521 | So, let's put this query together and execute it.
522 | ```sql
523 | hive> SELECT Salaries.yearID, Master.nameFirst, Master.nameLast, Master.weight, Salaries.salary FROM Master JOIN Salaries ON (Master.playerID = Salaries.playerID) WHERE Master.weight > 270;                 
524 | Query ID = hduser_20150312000707_f2a73817-d862-4080-9d23-8c0e77960e65
525 | Total jobs = 1
526 | SLF4J: Class path contains multiple SLF4J bindings.
527 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
528 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
529 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
530 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
531 | Execution log at: /tmp/hduser/hduser_20150312000707_f2a73817-d862-4080-9d23-8c0e77960e65.log
532 | 2015-03-12 12:08:02	Starting to launch local task to process map join;	maximum memory = 518979584
533 | 2015-03-12 12:08:05	Dump the side-table for tag: 1 with group count: 4668 into file: file:/tmp/hduser/734bcc7b-78b9-46d9-a7cc-868c9ded365d/hive_2015-03-12_00-07-53_654_5923233654914816047-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile41--.hashtable
534 | 2015-03-12 12:08:06	Uploaded 1 File to: file:/tmp/hduser/734bcc7b-78b9-46d9-a7cc-868c9ded365d/hive_2015-03-12_00-07-53_654_5923233654914816047-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (396345 bytes)
535 | 2015-03-12 12:08:06	End of local task; Time Taken: 3.41 sec.
536 | Execution completed successfully
537 | MapredLocal task succeeded
538 | Launching Job 1 out of 1
539 | Number of reduce tasks is set to 0 since there's no reduce operator
540 | Job running in-process (local Hadoop)
541 | Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
542 | 2015-03-12 00:08:09,069 Stage-3 map = 100%,  reduce = 0%
543 | Ended Job = job_local414824063_0005
544 | MapReduce Jobs Launched: 
545 | Stage-Stage-3:  HDFS Read: 12113427 HDFS Write: 0 SUCCESS
546 | Total MapReduce CPU Time Spent: 0 msec
547 | OK
548 | 2007	Jonathan	Broxton	310	390000
549 | 2008	Jonathan	Broxton	310	454000
550 | 2009	Jonathan	Broxton	310	1825000
551 | 2010	Jonathan	Broxton	310	4000000
552 | 2011	Jonathan	Broxton	310	7000000
553 | 2012	Jonathan	Broxton	310	4000000
554 | 2013	Jonathan	Broxton	310	4000000
555 | 2012	Jose	Ceda	280	480000
556 | 2002	Adam	Dunn	285	250000
557 | 2003	Adam	Dunn	285	400000
558 | 2004	Adam	Dunn	285	445000
559 | 2005	Adam	Dunn	285	4600000
560 | 2006	Adam	Dunn	285	7500000
561 | 2007	Adam	Dunn	285	10500000
562 | 2008	Adam	Dunn	285	13000000
563 | 2009	Adam	Dunn	285	8000000
564 | 2010	Adam	Dunn	285	12000000
565 | 2011	Adam	Dunn	285	12000000
566 | 2012	Adam	Dunn	285	14000000
567 | 2013	Adam	Dunn	285	15000000
568 | 2006	Prince	Fielder	275	329500
569 | 2007	Prince	Fielder	275	415000
570 | 2008	Prince	Fielder	275	670000
571 | 2009	Prince	Fielder	275	7000000
572 | 2010	Prince	Fielder	275	11000000
573 | 2011	Prince	Fielder	275	15500000
574 | 2012	Prince	Fielder	275	23000000
575 | 2013	Prince	Fielder	275	23000000
576 | 2006	Bobby	Jenks	275	340000
577 | 2007	Bobby	Jenks	275	400000
578 | 2008	Bobby	Jenks	275	550000
579 | 2009	Bobby	Jenks	275	5600000
580 | 2010	Bobby	Jenks	275	7500000
581 | 2011	Bobby	Jenks	275	6000000
582 | 2012	Bobby	Jenks	275	6000000
583 | 2003	Seth	McClung	280	300000
584 | 2004	Seth	McClung	280	302500
585 | 2005	Seth	McClung	280	320000
586 | 2006	Seth	McClung	280	343000
587 | 2008	Seth	McClung	280	750000
588 | 2009	Seth	McClung	280	1662500
589 | 2009	Jeff	Niemann	285	1290000
590 | 2010	Jeff	Niemann	285	1032000
591 | 2011	Jeff	Niemann	285	903000
592 | 2012	Jeff	Niemann	285	2750000
593 | 2007	Chad	Paronto	285	420000
594 | 2002	Calvin	Pickering	283	200000
595 | 2005	Calvin	Pickering	283	323500
596 | 2007	Renyel	Pinto	280	380000
597 | 2008	Renyel	Pinto	280	391500
598 | 2009	Renyel	Pinto	280	404000
599 | 2010	Renyel	Pinto	280	1075000
600 | 2002	Jon	Rauch	290	200000
601 | 2006	Jon	Rauch	290	335000
602 | 2007	Jon	Rauch	290	455000
603 | 2008	Jon	Rauch	290	1200000
604 | 2009	Jon	Rauch	290	2525000
605 | 2010	Jon	Rauch	290	2900000
606 | 2011	Jon	Rauch	290	3500000
607 | 2012	Jon	Rauch	290	3500000
608 | 2013	Jon	Rauch	290	1000000
609 | 2002	CC	Sabathia	290	700000
610 | 2003	CC	Sabathia	290	1100000
611 | 2004	CC	Sabathia	290	2700000
612 | 2005	CC	Sabathia	290	5250000
613 | 2006	CC	Sabathia	290	7000000
614 | 2007	CC	Sabathia	290	8750000
615 | 2008	CC	Sabathia	290	11000000
616 | 2009	CC	Sabathia	290	15285714
617 | 2010	CC	Sabathia	290	24285714
618 | 2011	CC	Sabathia	290	24285714
619 | 2012	CC	Sabathia	290	23000000
620 | 2013	CC	Sabathia	290	24285714
621 | 2002	Carlos	Silva	280	200000
622 | 2003	Carlos	Silva	280	310000
623 | 2004	Carlos	Silva	280	340000
624 | 2005	Carlos	Silva	280	1750000
625 | 2006	Carlos	Silva	280	3200000
626 | 2007	Carlos	Silva	280	4325000
627 | 2008	Carlos	Silva	280	8250000
628 | 2009	Carlos	Silva	280	12250000
629 | 2010	Carlos	Silva	280	12750000
630 | 1996	Dmitri	Young	295	109000
631 | 1997	Dmitri	Young	295	155000
632 | 1998	Dmitri	Young	295	215000
633 | 1999	Dmitri	Young	295	375000
634 | 2000	Dmitri	Young	295	1950000
635 | 2001	Dmitri	Young	295	3500000
636 | 2002	Dmitri	Young	295	5500000
637 | 2003	Dmitri	Young	295	6750000
638 | 2004	Dmitri	Young	295	7750000
639 | 2005	Dmitri	Young	295	8000000
640 | 2006	Dmitri	Young	295	8000000
641 | 2007	Dmitri	Young	295	500000
642 | 2008	Dmitri	Young	295	5000000
643 | 2009	Dmitri	Young	295	5000000
644 | 2003	Carlos	Zambrano	275	340000
645 | 2004	Carlos	Zambrano	275	450000
646 | 2005	Carlos	Zambrano	275	3760000
647 | 2006	Carlos	Zambrano	275	6500000
648 | 2007	Carlos	Zambrano	275	12400000
649 | 2008	Carlos	Zambrano	275	16000000
650 | 2009	Carlos	Zambrano	275	18750000
651 | 2010	Carlos	Zambrano	275	18875000
652 | 2011	Carlos	Zambrano	275	18875000
653 | 2012	Carlos	Zambrano	275	19000000
654 | Time taken: 15.451 seconds, Fetched: 106 row(s)
655 | ```
656 | Done. By joining tables, you can build some pretty complicated queries, which Hive will automatically execute with MapReduce.
657 | 
658 | #### More resources
659 | 
660 | [You can find the documentation for Hive commands here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual).
661 | 
662 | [And here is another tutorial with more examples](https://cwiki.apache.org/confluence/display/Hive/Tutorial)
663 | 


--------------------------------------------------------------------------------
/mapreduce.md:
--------------------------------------------------------------------------------
  1 | ## MapReduce with Python
  2 | 
  3 | #### Start the cluster!
  4 | 
  5 |     /usr/local/hadoop/sbin/start-dfs.sh
  6 | 
  7 | Yes!!! It’s running. You can check the report on the cluster at this
  8 | address on your web browser:
  9 | 
 10 |     http://<YOUR_CLOUD_SERVER_IP>:50070
 11 | 
 12 | (Replace <YOUR_CLOUD_SERVER_ID> with the actual ip, like 167.214.312.54)    
 13 | On the terminal,
 14 | 
 15 |     jps
 16 | 
 17 | will show you that `DataNode`, `NameNode` and `SecondaryNameNode` are running.
 18 | 
 19 | #### Let’s stop it
 20 | 
 21 |     /usr/local/hadoop/sbin/stop-dfs.sh
 22 | 
 23 | Now go check
 24 | 
 25 |     http://<YOUR_CLOUD_SERVER_IP>:50070
 26 | 
 27 | It shouldn’t be there anymore!
 28 | 
 29 | #### Get data
 30 | 
 31 | Alright, let’s put some data in.
 32 | 
 33 | Let’s make a directory for these
 34 | 
 35 |     mkdir -p /home/hduser/textdata
 36 | 
 37 | First we’ll start with putting the data into our normal data system.
 38 | If you have some text files, you can use them for this.
 39 | If not, here are three ebooks (plain text `utf-8` encoding) you can
 40 | `wget`:
 41 | 
 42 | Ulyses by James Joyce    
 43 | [http://www.gutenberg.org/cache/epub/4300/pg4300.txt][1]   
 44 | 
 45 | Notebooks of Leonardo Da Vinci    
 46 | [http://www.gutenberg.org/cache/epub/5000/pg5000.txt][2]   
 47 | 
 48 | The Outline of Science by J Arthur Thomson    
 49 | [http://www.gutenberg.org/cache/epub/20417/pg20417.txt][3]   
 50 | 
 51 | (For example, to get these, you can do this:
 52 | 
 53 |     cd /home/hduser/textdata
 54 |     wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt
 55 |     wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt
 56 |     wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
 57 |     
 58 | )
 59 | 
 60 | #### Put data in hdfs
 61 | 
 62 | First, let’s start the cluster again!
 63 | 
 64 |     /usr/local/hadoop/sbin/start-dfs.sh
 65 | 
 66 | make some directories **in the hadoop distributed file system!**
 67 | 
 68 |     hdfs dfs -mkdir /user/
 69 |     hdfs dfs -mkdir /user/irmak/
 70 | 
 71 | Of course replace `irmak` with your own username.
 72 | Let’s check that they exist
 73 | 
 74 |     hdfs dfs -ls /
 75 |     hdfs dfs -ls /user/
 76 | 
 77 | Yay!
 78 | 
 79 | Ok, put some data in
 80 | 
 81 |     hdfs dfs -put /home/hduser/textdata/* /user/irmak
 82 | 
 83 | Check and make sure it is in the hdfs
 84 | 
 85 |     hdfs dfs -ls /user/irmak
 86 | 
 87 | Yay!
 88 | 
 89 | ####Our mapper and reducer
 90 | 
 91 | Our mapper `count_mapper.py` includes the following code:
 92 | ```python
 93 | #!/usr/bin/env python
 94 | 
 95 | import sys
 96 | from textblob import TextBlob
 97 | 
 98 | for line in sys.stdin:
 99 |     line = line.decode('utf-8')
100 |     words = TextBlob(line).words
101 |     for word in words:
102 |         word = word.encode('utf-8')
103 |         print "%s\t%i" % (word, 1)
104 | ```
105 | 
106 | And our reducer `count_reducer.py` looks like this:
107 | ```python
108 | #!/usr/bin/env python
109 | 
110 | import sys
111 | 
112 | current_word = None
113 | current_count = 0
114 | word = None
115 | 
116 | for line in sys.stdin:
117 |     word, count = line.split('\t')
118 |     count = int(count)
119 |     if word == current_word:
120 |         current_count += count
121 | 	else:
122 |         if current_word:
123 | 	        print '%s\t%i' % (current_word, current_count)
124 | 		current_word = word
125 | 		current_count = count
126 | 
127 | if current_word == word:
128 |     print '%s\t%i' % (current_word, current_count)
129 | ```
130 | 
131 | Before running these codes, we need to make sure that textblob has its nltk corpora downloaded, so that it can work without an error. To do that, execute this on the command line (as the hduser):
132 | 
133 |     python -m textblob.download_corpora
134 | 
135 | ####Let's run it!
136 | 
137 | Before giving the following command, don't forget to replace the `/user/irmak` path (in the hdfs) with your own version, and the paths to `count_mapper.py` and `count_reducer.py` (in your droplet's local filesystem) with your own versions.
138 | 
139 |     hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file /home/hduser/count_mapper.py -mapper /home/hduser/count_mapper.py -file /home/hduser/count_reducer.py -reducer /home/hduser/count_reducer.py -input /user/irmak/* -output /user/irmak/book-output
140 | 
141 | Booom ! It's running.
142 | 
143 | #### Looking at the output
144 | Once it's done,
145 | 
146 |     hdfs dfs -ls /user/irmak/book-output
147 | 
148 | should show that there is a `_SUCCESS` file (showing we did it!) and
149 | another file called `part-00000`
150 | 
151 | This `part-00000` is our output. To look in:
152 | 
153 |     hdfs dfs -cat /user/irmak/book-output/part-00000
154 | 
155 | or just
156 | 
157 |     hdfs dfs -cat /user/irmak/book-output/*
158 | 
159 | will show the output of our job!
160 | 
161 | If you want to see the most common words, run:
162 | 
163 |     hdfs dfs -cat /user/irmak/book-output/* | sort -rnk2 | less
164 | 
165 | ########Note:
166 | If something went wrong when you ran your mapreduce job, you fix something and want to run it again, it will throw a different error, saying that the book-output directory already exists in hdfs. This error is thrown to avoid overwriting previous results. If you want to just rerun it anyway, you need to delete the output first, so it can be created again:
167 | 
168 |     hdfs dfs -rm -r /user/irmak/book-output
169 |     
170 |     
171 | 
172 | [1]: http://www.gutenberg.org/cache/epub/4300/pg4300.txt
173 | [2]: http://www.gutenberg.org/cache/epub/5000/pg5000.txt
174 | [3]: http://www.gutenberg.org/cache/epub/20417/pg20417.txt
175 | 


--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
  1 | ### Hadoop installation and setup on an Ubuntu server
  2 | 
  3 | Create a cloud server (through a service such as AWS, Rackspace, Digital Ocean, etc.), ssh to it, and follow the white rabbit below.
  4 | 
  5 | #### Install TextBlob
  6 | 
  7 | We're going to use it in our text processing, so make sure you have textblob in there.
  8 | 
  9 |     sudo pip install textblob
 10 | 
 11 | #### Install Java 7
 12 | 
 13 |     sudo apt-get install python-software-properties
 14 |     sudo add-apt-repository ppa:webupd8team/java
 15 |     sudo apt-get update
 16 |     sudo apt-get install oracle-jdk7-installer
 17 | 
 18 | The java install will ask a few straightforward questions, just answer
 19 | them.
 20 | 
 21 | ####Check that java version is 1.7
 22 | 
 23 |     java -version
 24 | 
 25 | 
 26 | ####Create a Hadoop user
 27 | 
 28 |     sudo addgroup hadoop
 29 |     sudo adduser --ingroup hadoop hduser
 30 | 
 31 | This will ask for a password, give it one. Each user in unix has a
 32 | password. You will use that when you switch to that user.
 33 | 
 34 | Make an ssh key so hadoop can connect to machines with ssh without entering a password every time.
 35 | 
 36 |     su hduser
 37 |     ssh-keygen  -t rsa -P ""
 38 | 
 39 | (hit enter when asked where to save the key)
 40 | 
 41 | Add the key to recognized keys in target computers (same as localhost
 42 | in this tutorial case)
 43 | 
 44 |     cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
 45 | 
 46 | Add key to recognized keys list and test that it works without a
 47 | password
 48 | 
 49 |     ssh localhost
 50 | 
 51 | (type yes and enter when it asks about adding the key print to the
 52 | known keys list)
 53 | 
 54 | 
 55 | ####Download and Install Hadoop
 56 | 
 57 |     wget http://psg.mtu.edu/pub/apache/hadoop/common/stable/hadoop-2.6.0.tar.gz
 58 | 
 59 |     su irmak
 60 | 
 61 | (switch to your own user, you’ll need some sudo here)
 62 | 
 63 |     sudo tar xvzf hadoop-2.6.0.tar.gz
 64 | 
 65 |     sudo mv hadoop-2.6.0 /usr/local/hadoop
 66 | 
 67 |     cd /usr/local
 68 | 
 69 |     sudo chown -R hduser:hadoop hadoop
 70 | 
 71 | #### Change bashrc settings
 72 | 
 73 |     su hduser
 74 | 
 75 |     emacs ~/.bashrc
 76 | 
 77 | You can use another editor, too, of course. (If you don't have emacs,
 78 | and do not want to use another editor you can install emacs with
 79 | *apt-get install emacs*). Add the following lines:
 80 | 
 81 |     # Environment variable for Hadoop location, include bin in the path
 82 |     export HADOOP_HOME=/usr/local/hadoop
 83 |     export PATH=$PATH:$HADOOP_HOME/bin
 84 |     
 85 |     # Environment varibale for Java location
 86 |     export JAVA_HOME=/usr/lib/jvm/java-7-oracle
 87 |     
 88 |     # Hadoop related aliases
 89 |     unalias fs &> /dev/null
 90 |     alias fs="hadoop fs"
 91 |     unalias hls &> /dev/null
 92 |     alias hls="fs -ls"
 93 | 
 94 | Exit emacs (`Ctrl-x Ctrl-s` to save, `Ctrl-x Ctrl-c` to exit). Great, now these will run every time you connect to the
 95 | server, but let's also make sure they apply now. Type this in your
 96 | terminal.:
 97 | 
 98 |     source ~/.bashrc
 99 |     
100 | #### Create the place to put HDFS on and tell Hadoop where it is
101 | 
102 |     su irmak
103 | 
104 | (We need some more sudo stuff so switch back to yourself for now)
105 | 
106 |     sudo mkdir -p /app/hadoop/tmp
107 | 
108 |     sudo chown -R hduser:hadoop /app/hadoop/tmp
109 | 
110 |     su hduser
111 | 
112 | (back to hduser to edit the configuration files)
113 | 
114 |     emacs /usr/local/hadoop/etc/hadoop/core-site.xml
115 | 
116 | Between `< configuration >`  and `< /configuration >` put this in:
117 | 
118 | ```xml
119 | <property>
120 | <name>hadoop.tmp.dir</name>
121 | <value>/app/hadoop/tmp</value>
122 |       <description>A base for other temporary directories.</description>
123 | </property>
124 | 
125 | 
126 | <property>
127 | <name>fs.default.name</name>
128 | <value>hdfs://localhost:54310</value>
129 | <description>The name of the default file system.  A URI whose
130 | scheme and authority determine the FileSystem implementation.
131 | The uri's scheme determines the config property (fs.SCHEME.impl)
132 | naming the FileSystem implementation class. The uri's authority
133 | is used to determine the host, port, etc. for a filesystem.</description>
134 | </property>
135 | ```
136 | 
137 | Ok now another one.
138 | 
139 |     emacs /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
140 | 
141 | Between `< configuration >`  and `< /configuration >`  put this in:
142 | ```xml
143 | <property>
144 | <name>mapred.job.tracker</name>
145 | <value>localhost:54311</value>
146 |       <description>The host and port that the MapReduce job tracker
147 | 	  runs
148 | 	        at.  If "local", then jobs are run in-process as a single
149 | 			map
150 | 			and reduce task.
151 | 			</description>
152 | 			</property>
153 | ```
154 | And the last one
155 | 
156 |     emacs /usr/local/hadoop/etc/hadoop/hdfs-site.xml
157 | 
158 | Between `< configuration >`  and `< /configuration >` put this in:
159 | ```xml
160 | <property>
161 | <name>dfs.replication</name>
162 | <value>1</value>
163 | <description>Default block replication. The actual number of replications
164 | can be specified when the file is created. The default is used if replication
165 | is not specified in create time.</description>
166 | </property>
167 | ```
168 | 
169 | Also tell hadoop where java 7 is
170 | 
171 |     emacs /usr/local/hadoop/etc/hadoop/hadoop-env.sh 
172 |     
173 | And at the very end, append this line:
174 |     
175 |     export JAVA_HOME=/usr/lib/jvm/java-7-oracle
176 | 
177 | Save, quit, and we're good.
178 | 
179 | #### Format the HDFS (hadoop filesystem)
180 | 
181 |     hdfs namenode -format
182 | 
183 | ## STOP HERE.
184 | 
185 | Take a breath. Your setup is complete. The rest is actually using hadoop.
186 | 
187 | 


--------------------------------------------------------------------------------