├── .gitignore ├── LICENSE ├── README.md ├── hive.md ├── mapreduce.md └── setup.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | lib/ 17 | lib64/ 18 | parts/ 19 | sdist/ 20 | var/ 21 | *.egg-info/ 22 | .installed.cfg 23 | *.egg 24 | 25 | # PyInstaller 26 | # Usually these files are written by a python script from a template 27 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 28 | *.manifest 29 | *.spec 30 | 31 | # Installer logs 32 | pip-log.txt 33 | pip-delete-this-directory.txt 34 | 35 | # Unit test / coverage reports 36 | htmlcov/ 37 | .tox/ 38 | .coverage 39 | .cache 40 | nosetests.xml 41 | coverage.xml 42 | 43 | # Translations 44 | *.mo 45 | *.pot 46 | 47 | # Django stuff: 48 | *.log 49 | 50 | # Sphinx documentation 51 | docs/_build/ 52 | 53 | # PyBuilder 54 | target/ 55 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Irmak Sirer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hadoop MapReduce with Python and Hive 2 | 3 | A tutorial for writing a MapReduce program for Hadoop in python, and using Hive to do MapReduce with SQL-like queries. 4 | 5 | This uses the Hadoop Streaming API with python to teach the basics of using the MapReduce framework. 6 | The main idea and structure is based on [Michael G. Noll's great tutorial](http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/). However, that tutorial is outdated and quite few of the steps do not work anymore, both in setting up and running Hadoop. This is an updated and expanded tutorial, combined with a Hive tutorial. 7 | 8 | You can write map and reduce functions in python, and use them with Hadoop's streaming API as shown here. This gives you a lot of flexibility. 9 | 10 | In many cases, however, the information you are trying to get from the data distributed on the cluster can be expressed in terms of a SQL query. Hive is a program that takes a SQL query like this, automatically builds map and reduce jobs, runs them and returns the results. It makes using MapReduce really simple, all you need is some familiarity with basic SQL queries. 11 | 12 | This tutorial shows both of these ways of using Hadoop. 13 | 14 | ## Let's Begin 15 | 16 | [Setup your Hadoop cluster](setup.md) 17 | 18 | [Write your first Python MapReduce program](mapreduce.md) 19 | 20 | [Use Hive to make MapReduce queries](hive.md) 21 | 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /hive.md: -------------------------------------------------------------------------------- 1 | ## Using Hive 2 | 3 | #### Install and set up Hive 4 | 5 | ssh to your cloud computer and switch to the hduser. Go to the hduser's home. 6 | 7 | ```bash 8 | $ su hduser 9 | $ cd 10 | ``` 11 | Let's download Hive. 12 | 13 | ```bash 14 | $ wget http://mirror.cogentco.com/pub/apache/hive/hive-0.14.0/apache-hive-0.14.0-bin.tar.gz 15 | ``` 16 | This is a zipped file. Extract it. 17 | 18 | ```bash 19 | $ tar -xzvf apache-hive-0.14.0-bin.tar.gz 20 | ``` 21 | Now, in hduser's home directory, you have a directory called `apache-hive-0.14.0-bin`. 22 | We are going to edit your `.bashrc` file to make an environment varible called HIVE_HOME 23 | so that Hadoop and Hive know where it lives (like we did with the Hadoop setup). 24 | 25 | ```bash 26 | $ emacs ~/.bashrc 27 | ``` 28 | To the end of the `.bashrc` file that we are editing, add the following lines 29 | 30 | ```bash 31 | export HIVE_HOME=/home/hduser/apache-hive-0.14.0-bin 32 | export PATH=$PATH:$HIVE_HOME/bin 33 | ``` 34 | Save and close. Now we need to run the `.bashrc` to make sure HIVE_HOME is defined. 35 | 36 | ```bash 37 | $ source ~/.bashrc 38 | ``` 39 | That's it. Now we installed Hive. We have one last setup step left. 40 | Hive basically behaves like a SQL database that lives on the hdfs. 41 | To be able to do that, it needs a temporary files folder and a place to store the underlying data (in the hdfs). 42 | We need to create these directories in the hdfs and give the necessary permissions. 43 | > Obviously, to be able to make changes to the hdfs, make sure your hadoop cluster is up and running. 44 | > Do `jps` to check if the namenode, secondary namenode and the datanode are up. 45 | > If not, you need to start them with the `start-dfs.sh` first, you can check to hadoop tutorial to see how we did that. 46 | 47 | ```bash 48 | $ hdfs dfs -mkdir -p /tmp 49 | $ hdfs dfs -mkdir -p /user/hive/warehouse 50 | $ hdfs dfs -chmod g+w /tmp 51 | $ hdfs dfs -chmod g+w /user/hive/warehouse 52 | ``` 53 | Aaand, your Hive is ready to rock your world. You can run it by typing 54 | 55 | ``` 56 | $ hive 57 | ``` 58 | You should get a hive prompt, like this: 59 | 60 | ``` 61 | hive> _ 62 | ``` 63 | Hive's syntax is (almost) identical to SQL. So let's load up some data and use it. First, exit: 64 | 65 | ``` 66 | hive> exit; 67 | ``` 68 | 69 | #### Download some baseball data to play with 70 | 71 | You should be back at your regular prompt now. Let's download some baseball data. 72 | ```bash 73 | $ wget http://seanlahman.com/files/database/lahman-csv_2014-02-14.zip 74 | ``` 75 | This is a zipped file with a bunch of csv files, each is a sql table. 76 | These tables are full of baseball statistics from 2013. Let's unzip this. 77 | First we need to switch back to a user with sudo powers so we can install unzip, 78 | then switch back to hduser and unzip it. 79 | (Of course, switch `irmak` below with your own username) 80 | 81 | ```bash 82 | $ su irmak 83 | $ sudo apt-get install unzip 84 | $ su hduser 85 | $ mkdir baseballdata 86 | $ unzip lahman-csv_2014-02-14.zip -d baseballdata 87 | ``` 88 | 89 | #### First look & cleanup of the data 90 | 91 | Now you have a bunch of csv files in the `baseballdata` directory. 92 | You can think of each csv as a table in a baseball database. 93 | Let's create one Hive table and read a csv into that table. 94 | This is the exact analog of loading a csv into a sql table. 95 | Let's do this with the `baseballdata/Master.csv`. 96 | First, take a look at that csv file. 97 | 98 | ```bash 99 | $ head baseballdata/Master.csv 100 | ``` 101 | 102 | You should see this: 103 | 104 | ```Text 105 | playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID 106 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01 107 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01 108 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01 109 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01 110 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01 111 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01 112 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01 113 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01 114 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01 115 | ``` 116 | 117 | Ok. We need to create at table with these following column headers. 118 | To be able to read it better, let's print each column name in a line. 119 | ```bash 120 | $ head -n 1 baseballdata/Master.csv | tr ',' '\n' 121 | ``` 122 | You should see 123 | 124 | ```Text 125 | playerID 126 | birthYear 127 | birthMonth 128 | birthDay 129 | birthCountry 130 | birthState 131 | birthCity 132 | deathYear 133 | deathMonth 134 | deathDay 135 | deathCountry 136 | deathState 137 | deathCity 138 | nameFirst 139 | nameLast 140 | nameGiven 141 | weight 142 | height 143 | bats 144 | throws 145 | debut 146 | finalGame 147 | retroID 148 | bbrefID 149 | ``` 150 | What did we do up there? `head` shows us only several lines at the beginning of a file. 151 | The option `-n 1` tells it to show only the first line (`-n 5` would have shown the first five). 152 | The output of this is a line of column headers separated by commas. 153 | We pipe this output into `tr ',' '\n'`, which converts (or *tr*anslates) every `,` character into a newline character (`\n`). 154 | That way, we get a new line every time there was a comma. 155 | Ok, great. This will help us construct the table. 156 | One last thing we need to do is to remove the first line from the file, though, to make it easier to upload it to Hive. 157 | We can use `tail` for this, which, just like head, only shows several lines of a file, but the **last** lines instead of the first. 158 | `tail -n 4` shows the last 4 lines, for example. `tail -n +8` shows all lines including and after the 8th line. 159 | So, to get rid of the first line, we want `tail -n +2`. Let's pipe the output into head to check if it will indeed work: 160 | 161 | ```bash 162 | $ tail -n +2 baseballdata/Master.csv | head 163 | ``` 164 | should show 165 | ```Text 166 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01 167 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01 168 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01 169 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01 170 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01 171 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01 172 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01 173 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01 174 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01 175 | abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01 176 | ``` 177 | Looks like it's working. So let's write this into a temporary file and then overwrite the original with this new temp file so Master.csv no longer has the header line. 178 | ```bash 179 | $ tail -n +2 baseballdata/Master.csv > tmp && mv tmp baseballdata/Master.csv 180 | ``` 181 | The `&&` means do the first part first, and when it finished, do what follows the `&&`. 182 | 183 | Ok. we removed the header. Let's make sure we did 184 | ```bash 185 | $ head baseballdata/Master.csv 186 | aardsda01,1981,12,27,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,205,75,R,R,2004-04-06,2013-09-28,aardd001,aardsda01 187 | aaronha01,1934,2,5,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01 188 | aaronto01,1939,8,5,USA,AL,Mobile,1984,8,16,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01 189 | aasedo01,1954,9,8,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01 190 | abadan01,1972,8,25,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01 191 | abadfe01,1985,12,17,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,220,73,L,L,2010-07-28,2013-09-27,abadf001,abadfe01 192 | abadijo01,1854,11,4,USA,PA,Philadelphia,1905,5,17,USA,NJ,Pemberton,John,Abadie,John W.,192,72,R,R,1875-04-26,1875-06-10,abadj101,abadijo01 193 | abbated01,1877,4,15,USA,PA,Latrobe,1957,1,6,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170,71,R,R,1897-09-04,1910-09-15,abbae101,abbated01 194 | abbeybe01,1869,11,11,USA,VT,Essex,1962,6,11,USA,VT,Colchester,Bert,Abbey,Bert Wood,175,71,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01 195 | abbeych01,1866,10,14,USA,NE,Falls City,1926,4,27,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169,68,L,L,1893-08-16,1897-08-19,abbec101,abbeych01 196 | ``` 197 | 198 | #### Upload data to Hive 199 | 200 | Indeed it's gone. Alright. Let's upload this to hive. First, we need to upload it to hdfs. 201 | (of course, change `irmak` to whichever directory you have in hdfs) 202 | ```bash 203 | $ hdfs dfs -mkdir -p /user/irmak/baseballdata 204 | $ hdfs dfs -put baseballdata/Master.csv /user/irmak/baseballdata 205 | ``` 206 | We created a new directory in hsfs and uploaded the csv to it. 207 | Let's make sure it's there. 208 | ```bash 209 | $ hdfs dfs -ls /user/irmak/baseballdata 210 | Found 1 items 211 | -rw-r--r-- 1 hduser supergroup 2422684 2015-03-11 22:12 /user/irmak/baseballdata/Master.csv 212 | ``` 213 | It is. Awesome. Time to run hive 214 | ```bash 215 | $ hive 216 | 217 | Logging initialized using configuration in jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-common-0.14.0.jar!/hive-log4j.properties 218 | SLF4J: Class path contains multiple SLF4J bindings. 219 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 220 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class] 221 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 222 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 223 | hive> 224 | ``` 225 | We now have the Hive prompt. (Ignore the SLF4J warning, it's an unimportant logging thing). 226 | Let's create the table 227 | 228 | ```sql 229 | hive> CREATE TABLE IF NOT EXISTS Master 230 | (playerID STRING, 231 | birthYear INT, 232 | birthMonth INT, 233 | birthDay INT, 234 | birthCountry STRING, 235 | birthState STRING, 236 | birthCity STRING, 237 | deathYear INT, 238 | deathMonth INT, 239 | deathDay INT, 240 | deathCountry STRING, 241 | deathState STRING, 242 | deathCity STRING, 243 | nameFirst STRING, 244 | nameLast STRING, 245 | nameGiven STRING, 246 | weight INT, 247 | height INT, 248 | bats STRING, 249 | throws STRING, 250 | debut STRING, 251 | finalGame STRING, 252 | retroID STRING, 253 | bbrefID STRING) 254 | COMMENT 'Master Player Table' 255 | ROW FORMAT DELIMITED 256 | FIELDS TERMINATED BY ',' 257 | STORED AS TEXTFILE; 258 | OK 259 | Time taken: 1.752 seconds 260 | ``` 261 | And let's load the data 262 | ```sql 263 | hive> LOAD DATA INPATH '/user/irmak/baseballdata/Master.csv' OVERWRITE INTO TABLE Master; 264 | Loading data to table default.master 265 | Table default.master stats: [numFiles=1, numRows=0, totalSize=2422684, rawDataSize=0] 266 | OK 267 | Time taken: 1.166 seconds 268 | ``` 269 | And it's in! 270 | 271 | #### Use Hive to make queries over the distributed data 272 | 273 | We now have a Hive table. The best part of hive is, when you make a query (that most of the time looks **exactly** like a sql query), Hive automatically creates the map and reduce tasks, runs them over the hadoop cluster, and gives you the answer, without you having to worry about any of it. If your question is easily represented in the form of a sql query, Hive will take care of all the dirty work for you. The table might be spread over thousands of computers, but you don't need to think hard about that at all. 274 | 275 | Let's start easy. Let's find out how many players we have in this table. 276 | ```sql 277 | hive> SELECT COUNT(playerid) FROM Master; 278 | Query ID = hduser_20150311224646_00211363-82a3-49f0-aac2-b0d6abb4caf9 279 | Total jobs = 1 280 | Launching Job 1 out of 1 281 | Number of reduce tasks determined at compile time: 1 282 | In order to change the average load for a reducer (in bytes): 283 | set hive.exec.reducers.bytes.per.reducer= 284 | In order to limit the maximum number of reducers: 285 | set hive.exec.reducers.max= 286 | In order to set a constant number of reducers: 287 | set mapreduce.job.reduces= 288 | Job running in-process (local Hadoop) 289 | Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 290 | 2015-03-11 22:46:52,488 Stage-1 map = 100%, reduce = 100% 291 | Ended Job = job_local747935831_0002 292 | MapReduce Jobs Launched: 293 | Stage-Stage-1: HDFS Read: 9690750 HDFS Write: 0 SUCCESS 294 | Total MapReduce CPU Time Spent: 0 msec 295 | OK 296 | 18354 297 | Time taken: 2.424 seconds, Fetched: 1 row(s) 298 | ``` 299 | As you can see, hive reports on the job it is setting up and the map reduce tasks that job entails, reports on the progress, and finally gives the result: There are 18354 players. 300 | 301 | Let's do something that would require more involved mapper and reducer functions, but is pretty straightforward with Hive. Let's get the weight distribution of the players in this table. 302 | ```sql 303 | hive> SELECT weight, count(playerID) FROM Master GROUP BY weight; 304 | Query ID = hduser_20150311223636_6eda794b-8400-4054-9fce-b1080af16f99 305 | Total jobs = 1 306 | Launching Job 1 out of 1 307 | Number of reduce tasks not specified. Estimated from input data size: 1 308 | In order to change the average load for a reducer (in bytes): 309 | set hive.exec.reducers.bytes.per.reducer= 310 | In order to limit the maximum number of reducers: 311 | set hive.exec.reducers.max= 312 | In order to set a constant number of reducers: 313 | set mapreduce.job.reduces= 314 | Job running in-process (local Hadoop) 315 | Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 316 | 2015-03-11 22:36:44,510 Stage-1 map = 0%, reduce = 0% 317 | 2015-03-11 22:36:45,753 Stage-1 map = 100%, reduce = 0% 318 | 2015-03-11 22:36:46,797 Stage-1 map = 100%, reduce = 100% 319 | Ended Job = job_local1584830750_0001 320 | MapReduce Jobs Launched: 321 | Stage-Stage-1: HDFS Read: 4845382 HDFS Write: 0 SUCCESS 322 | Total MapReduce CPU Time Spent: 0 msec 323 | OK 324 | NULL 877 325 | 65 1 326 | 120 3 327 | 125 4 328 | 126 2 329 | 127 1 330 | 128 1 331 | 129 2 332 | 130 11 333 | 132 1 334 | 133 1 335 | 134 1 336 | 135 16 337 | 136 4 338 | 137 2 339 | 138 9 340 | 139 2 341 | 140 59 342 | 141 3 343 | 142 15 344 | 143 9 345 | 144 7 346 | 145 94 347 | 146 6 348 | 147 10 349 | 148 30 350 | 149 8 351 | 150 279 352 | 151 8 353 | 152 33 354 | 153 16 355 | 154 41 356 | 155 317 357 | 156 34 358 | 157 40 359 | 158 78 360 | 159 14 361 | 160 794 362 | 161 18 363 | 162 61 364 | 163 52 365 | 164 53 366 | 165 996 367 | 166 27 368 | 167 60 369 | 168 191 370 | 169 29 371 | 170 1273 372 | 171 16 373 | 172 113 374 | 173 54 375 | 174 69 376 | 175 1532 377 | 176 66 378 | 177 25 379 | 178 188 380 | 179 28 381 | 180 1604 382 | 181 21 383 | 182 75 384 | 183 61 385 | 184 45 386 | 185 1587 387 | 186 71 388 | 187 108 389 | 188 75 390 | 189 24 391 | 190 1444 392 | 191 12 393 | 192 62 394 | 193 49 395 | 194 33 396 | 195 1082 397 | 196 38 398 | 197 36 399 | 198 52 400 | 199 2 401 | 200 1012 402 | 201 7 403 | 202 18 404 | 203 19 405 | 204 19 406 | 205 634 407 | 206 5 408 | 207 18 409 | 208 22 410 | 209 10 411 | 210 620 412 | 211 4 413 | 212 15 414 | 213 6 415 | 214 5 416 | 215 463 417 | 216 5 418 | 217 9 419 | 218 12 420 | 219 3 421 | 220 412 422 | 221 3 423 | 222 4 424 | 223 4 425 | 225 243 426 | 226 5 427 | 227 2 428 | 228 6 429 | 230 189 430 | 233 2 431 | 234 3 432 | 235 112 433 | 237 3 434 | 240 105 435 | 241 1 436 | 242 1 437 | 243 1 438 | 244 2 439 | 245 48 440 | 250 52 441 | 254 1 442 | 255 22 443 | 257 1 444 | 260 21 445 | 265 8 446 | 269 1 447 | 270 8 448 | 275 8 449 | 280 5 450 | 283 1 451 | 285 3 452 | 290 2 453 | 295 2 454 | 310 1 455 | 320 1 456 | Time taken: 9.265 seconds, Fetched: 132 row(s) 457 | ``` 458 | As you can see, a simple GROUP BY statement takes care of everything. Easier than writing and executing specific mapreduce functions. 459 | 460 | In this manner, you can do sql-like queries over tons of data that live in the hdfs in a distributed state. Since hdfs and mapreduce have overheads, it will not be as fast as a sql query on data that fits a single machine, but you now get the answers in parallel, and are able to do sql queries over hundreds of terabytes of data. 461 | 462 | #### Join example in Hive 463 | 464 | Let's upload another table and see how joins work. Salaries.csv has four columns: year, team, league, player, salary. It only has salary information for after 1984, but it's pretty extensive. 465 | 466 | Let's remove the header and upload it to hdfs 467 | ```bash 468 | hive> exit; 469 | $ tail -n +2 baseballdata/Salaries.csv > tmp && mv tmp baseballdata/Salaries.csv 470 | $ hdfs dfs -put baseballdata/Salaries.csv /user/irmak/baseballdata 471 | ``` 472 | Switch to hive, create the table and load the data. 473 | ```sql 474 | $ hive 475 | 476 | Logging initialized using configuration in jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-common-0.14.0.jar!/hive-log4j.properties 477 | SLF4J: Class path contains multiple SLF4J bindings. 478 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 479 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class] 480 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 481 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 482 | hive> CREATE TABLE IF NOT EXISTS Salaries 483 | (yearID INT, teamID STRING, lgID STRING, playerID STRING, salary INT) 484 | COMMENT 'Salary Table for Players' 485 | ROW FORMAT DELIMITED 486 | FIELDS TERMINATED BY ',' 487 | STORED AS TEXTFILE; 488 | OK 489 | Time taken: 2.502 seconds 490 | hive> LOAD DATA INPATH '/user/irmak/baseballdata/Salaries.csv' OVERWRITE INTO TABLE Salaries; 491 | Loading data to table default.salaries 492 | Table default.salaries stats: [numFiles=1, numRows=0, totalSize=724918, rawDataSize=0] 493 | OK 494 | Time taken: 2.237 seconds 495 | hive> SHOW TABLES; 496 | OK 497 | master 498 | salaries 499 | Time taken: 0.203 seconds, Fetched: 2 row(s) 500 | ``` 501 | Mission accomplished. We have two tables now: `Master` and `Salaries`. (By the way, you should have noticed by now that nothing in Hive is case sensitive). 502 | 503 | Let's do a somewhat more complicated query that involves two tables. Let's take a look at the upper end of the weight distribution among the players and their salaries. Here is the breakdown of the query: 504 | > 505 | > ```sql 506 | > SELECT Salaries.yearID, Master.nameFirst, Master.nameLast, Master.weight, Salaries.salary 507 | > ``` 508 | > This is what we want to read: The first & last name of the player, their weight, and their salary at a specific year. Salary and year comes from the salary table and the rest from the master table. 509 | > 510 | > ```sql 511 | > FROM Master JOIN Salaries ON (Master.playerID = Salaries.playerID) 512 | > ``` 513 | > This is how we combine the information from both tables. We want the row for a player to connect with the salary rows for that player. Note that there are multiple rows for the same player in the Salaries table (for multiple years). 514 | > 515 | > ```sql 516 | > WHERE Master.weight > 270; 517 | > ``` 518 | > Only show the players who weigh more than 270 pounds. Also note that we don't have yearly weights, but a single weight statistic for each player (that is reported in the Master table). 519 | > 520 | 521 | So, let's put this query together and execute it. 522 | ```sql 523 | hive> SELECT Salaries.yearID, Master.nameFirst, Master.nameLast, Master.weight, Salaries.salary FROM Master JOIN Salaries ON (Master.playerID = Salaries.playerID) WHERE Master.weight > 270; 524 | Query ID = hduser_20150312000707_f2a73817-d862-4080-9d23-8c0e77960e65 525 | Total jobs = 1 526 | SLF4J: Class path contains multiple SLF4J bindings. 527 | SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 528 | SLF4J: Found binding in [jar:file:/home/hduser/apache-hive-0.14.0-bin/lib/hive-jdbc-0.14.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class] 529 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 530 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 531 | Execution log at: /tmp/hduser/hduser_20150312000707_f2a73817-d862-4080-9d23-8c0e77960e65.log 532 | 2015-03-12 12:08:02 Starting to launch local task to process map join; maximum memory = 518979584 533 | 2015-03-12 12:08:05 Dump the side-table for tag: 1 with group count: 4668 into file: file:/tmp/hduser/734bcc7b-78b9-46d9-a7cc-868c9ded365d/hive_2015-03-12_00-07-53_654_5923233654914816047-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile41--.hashtable 534 | 2015-03-12 12:08:06 Uploaded 1 File to: file:/tmp/hduser/734bcc7b-78b9-46d9-a7cc-868c9ded365d/hive_2015-03-12_00-07-53_654_5923233654914816047-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (396345 bytes) 535 | 2015-03-12 12:08:06 End of local task; Time Taken: 3.41 sec. 536 | Execution completed successfully 537 | MapredLocal task succeeded 538 | Launching Job 1 out of 1 539 | Number of reduce tasks is set to 0 since there's no reduce operator 540 | Job running in-process (local Hadoop) 541 | Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0 542 | 2015-03-12 00:08:09,069 Stage-3 map = 100%, reduce = 0% 543 | Ended Job = job_local414824063_0005 544 | MapReduce Jobs Launched: 545 | Stage-Stage-3: HDFS Read: 12113427 HDFS Write: 0 SUCCESS 546 | Total MapReduce CPU Time Spent: 0 msec 547 | OK 548 | 2007 Jonathan Broxton 310 390000 549 | 2008 Jonathan Broxton 310 454000 550 | 2009 Jonathan Broxton 310 1825000 551 | 2010 Jonathan Broxton 310 4000000 552 | 2011 Jonathan Broxton 310 7000000 553 | 2012 Jonathan Broxton 310 4000000 554 | 2013 Jonathan Broxton 310 4000000 555 | 2012 Jose Ceda 280 480000 556 | 2002 Adam Dunn 285 250000 557 | 2003 Adam Dunn 285 400000 558 | 2004 Adam Dunn 285 445000 559 | 2005 Adam Dunn 285 4600000 560 | 2006 Adam Dunn 285 7500000 561 | 2007 Adam Dunn 285 10500000 562 | 2008 Adam Dunn 285 13000000 563 | 2009 Adam Dunn 285 8000000 564 | 2010 Adam Dunn 285 12000000 565 | 2011 Adam Dunn 285 12000000 566 | 2012 Adam Dunn 285 14000000 567 | 2013 Adam Dunn 285 15000000 568 | 2006 Prince Fielder 275 329500 569 | 2007 Prince Fielder 275 415000 570 | 2008 Prince Fielder 275 670000 571 | 2009 Prince Fielder 275 7000000 572 | 2010 Prince Fielder 275 11000000 573 | 2011 Prince Fielder 275 15500000 574 | 2012 Prince Fielder 275 23000000 575 | 2013 Prince Fielder 275 23000000 576 | 2006 Bobby Jenks 275 340000 577 | 2007 Bobby Jenks 275 400000 578 | 2008 Bobby Jenks 275 550000 579 | 2009 Bobby Jenks 275 5600000 580 | 2010 Bobby Jenks 275 7500000 581 | 2011 Bobby Jenks 275 6000000 582 | 2012 Bobby Jenks 275 6000000 583 | 2003 Seth McClung 280 300000 584 | 2004 Seth McClung 280 302500 585 | 2005 Seth McClung 280 320000 586 | 2006 Seth McClung 280 343000 587 | 2008 Seth McClung 280 750000 588 | 2009 Seth McClung 280 1662500 589 | 2009 Jeff Niemann 285 1290000 590 | 2010 Jeff Niemann 285 1032000 591 | 2011 Jeff Niemann 285 903000 592 | 2012 Jeff Niemann 285 2750000 593 | 2007 Chad Paronto 285 420000 594 | 2002 Calvin Pickering 283 200000 595 | 2005 Calvin Pickering 283 323500 596 | 2007 Renyel Pinto 280 380000 597 | 2008 Renyel Pinto 280 391500 598 | 2009 Renyel Pinto 280 404000 599 | 2010 Renyel Pinto 280 1075000 600 | 2002 Jon Rauch 290 200000 601 | 2006 Jon Rauch 290 335000 602 | 2007 Jon Rauch 290 455000 603 | 2008 Jon Rauch 290 1200000 604 | 2009 Jon Rauch 290 2525000 605 | 2010 Jon Rauch 290 2900000 606 | 2011 Jon Rauch 290 3500000 607 | 2012 Jon Rauch 290 3500000 608 | 2013 Jon Rauch 290 1000000 609 | 2002 CC Sabathia 290 700000 610 | 2003 CC Sabathia 290 1100000 611 | 2004 CC Sabathia 290 2700000 612 | 2005 CC Sabathia 290 5250000 613 | 2006 CC Sabathia 290 7000000 614 | 2007 CC Sabathia 290 8750000 615 | 2008 CC Sabathia 290 11000000 616 | 2009 CC Sabathia 290 15285714 617 | 2010 CC Sabathia 290 24285714 618 | 2011 CC Sabathia 290 24285714 619 | 2012 CC Sabathia 290 23000000 620 | 2013 CC Sabathia 290 24285714 621 | 2002 Carlos Silva 280 200000 622 | 2003 Carlos Silva 280 310000 623 | 2004 Carlos Silva 280 340000 624 | 2005 Carlos Silva 280 1750000 625 | 2006 Carlos Silva 280 3200000 626 | 2007 Carlos Silva 280 4325000 627 | 2008 Carlos Silva 280 8250000 628 | 2009 Carlos Silva 280 12250000 629 | 2010 Carlos Silva 280 12750000 630 | 1996 Dmitri Young 295 109000 631 | 1997 Dmitri Young 295 155000 632 | 1998 Dmitri Young 295 215000 633 | 1999 Dmitri Young 295 375000 634 | 2000 Dmitri Young 295 1950000 635 | 2001 Dmitri Young 295 3500000 636 | 2002 Dmitri Young 295 5500000 637 | 2003 Dmitri Young 295 6750000 638 | 2004 Dmitri Young 295 7750000 639 | 2005 Dmitri Young 295 8000000 640 | 2006 Dmitri Young 295 8000000 641 | 2007 Dmitri Young 295 500000 642 | 2008 Dmitri Young 295 5000000 643 | 2009 Dmitri Young 295 5000000 644 | 2003 Carlos Zambrano 275 340000 645 | 2004 Carlos Zambrano 275 450000 646 | 2005 Carlos Zambrano 275 3760000 647 | 2006 Carlos Zambrano 275 6500000 648 | 2007 Carlos Zambrano 275 12400000 649 | 2008 Carlos Zambrano 275 16000000 650 | 2009 Carlos Zambrano 275 18750000 651 | 2010 Carlos Zambrano 275 18875000 652 | 2011 Carlos Zambrano 275 18875000 653 | 2012 Carlos Zambrano 275 19000000 654 | Time taken: 15.451 seconds, Fetched: 106 row(s) 655 | ``` 656 | Done. By joining tables, you can build some pretty complicated queries, which Hive will automatically execute with MapReduce. 657 | 658 | #### More resources 659 | 660 | [You can find the documentation for Hive commands here](https://cwiki.apache.org/confluence/display/Hive/LanguageManual). 661 | 662 | [And here is another tutorial with more examples](https://cwiki.apache.org/confluence/display/Hive/Tutorial) 663 | -------------------------------------------------------------------------------- /mapreduce.md: -------------------------------------------------------------------------------- 1 | ## MapReduce with Python 2 | 3 | #### Start the cluster! 4 | 5 | /usr/local/hadoop/sbin/start-dfs.sh 6 | 7 | Yes!!! It’s running. You can check the report on the cluster at this 8 | address on your web browser: 9 | 10 | http://:50070 11 | 12 | (Replace with the actual ip, like 167.214.312.54) 13 | On the terminal, 14 | 15 | jps 16 | 17 | will show you that `DataNode`, `NameNode` and `SecondaryNameNode` are running. 18 | 19 | #### Let’s stop it 20 | 21 | /usr/local/hadoop/sbin/stop-dfs.sh 22 | 23 | Now go check 24 | 25 | http://:50070 26 | 27 | It shouldn’t be there anymore! 28 | 29 | #### Get data 30 | 31 | Alright, let’s put some data in. 32 | 33 | Let’s make a directory for these 34 | 35 | mkdir -p /home/hduser/textdata 36 | 37 | First we’ll start with putting the data into our normal data system. 38 | If you have some text files, you can use them for this. 39 | If not, here are three ebooks (plain text `utf-8` encoding) you can 40 | `wget`: 41 | 42 | Ulyses by James Joyce 43 | [http://www.gutenberg.org/cache/epub/4300/pg4300.txt][1] 44 | 45 | Notebooks of Leonardo Da Vinci 46 | [http://www.gutenberg.org/cache/epub/5000/pg5000.txt][2] 47 | 48 | The Outline of Science by J Arthur Thomson 49 | [http://www.gutenberg.org/cache/epub/20417/pg20417.txt][3] 50 | 51 | (For example, to get these, you can do this: 52 | 53 | cd /home/hduser/textdata 54 | wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt 55 | wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt 56 | wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt 57 | 58 | ) 59 | 60 | #### Put data in hdfs 61 | 62 | First, let’s start the cluster again! 63 | 64 | /usr/local/hadoop/sbin/start-dfs.sh 65 | 66 | make some directories **in the hadoop distributed file system!** 67 | 68 | hdfs dfs -mkdir /user/ 69 | hdfs dfs -mkdir /user/irmak/ 70 | 71 | Of course replace `irmak` with your own username. 72 | Let’s check that they exist 73 | 74 | hdfs dfs -ls / 75 | hdfs dfs -ls /user/ 76 | 77 | Yay! 78 | 79 | Ok, put some data in 80 | 81 | hdfs dfs -put /home/hduser/textdata/* /user/irmak 82 | 83 | Check and make sure it is in the hdfs 84 | 85 | hdfs dfs -ls /user/irmak 86 | 87 | Yay! 88 | 89 | ####Our mapper and reducer 90 | 91 | Our mapper `count_mapper.py` includes the following code: 92 | ```python 93 | #!/usr/bin/env python 94 | 95 | import sys 96 | from textblob import TextBlob 97 | 98 | for line in sys.stdin: 99 | line = line.decode('utf-8') 100 | words = TextBlob(line).words 101 | for word in words: 102 | word = word.encode('utf-8') 103 | print "%s\t%i" % (word, 1) 104 | ``` 105 | 106 | And our reducer `count_reducer.py` looks like this: 107 | ```python 108 | #!/usr/bin/env python 109 | 110 | import sys 111 | 112 | current_word = None 113 | current_count = 0 114 | word = None 115 | 116 | for line in sys.stdin: 117 | word, count = line.split('\t') 118 | count = int(count) 119 | if word == current_word: 120 | current_count += count 121 | else: 122 | if current_word: 123 | print '%s\t%i' % (current_word, current_count) 124 | current_word = word 125 | current_count = count 126 | 127 | if current_word == word: 128 | print '%s\t%i' % (current_word, current_count) 129 | ``` 130 | 131 | Before running these codes, we need to make sure that textblob has its nltk corpora downloaded, so that it can work without an error. To do that, execute this on the command line (as the hduser): 132 | 133 | python -m textblob.download_corpora 134 | 135 | ####Let's run it! 136 | 137 | Before giving the following command, don't forget to replace the `/user/irmak` path (in the hdfs) with your own version, and the paths to `count_mapper.py` and `count_reducer.py` (in your droplet's local filesystem) with your own versions. 138 | 139 | hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file /home/hduser/count_mapper.py -mapper /home/hduser/count_mapper.py -file /home/hduser/count_reducer.py -reducer /home/hduser/count_reducer.py -input /user/irmak/* -output /user/irmak/book-output 140 | 141 | Booom ! It's running. 142 | 143 | #### Looking at the output 144 | Once it's done, 145 | 146 | hdfs dfs -ls /user/irmak/book-output 147 | 148 | should show that there is a `_SUCCESS` file (showing we did it!) and 149 | another file called `part-00000` 150 | 151 | This `part-00000` is our output. To look in: 152 | 153 | hdfs dfs -cat /user/irmak/book-output/part-00000 154 | 155 | or just 156 | 157 | hdfs dfs -cat /user/irmak/book-output/* 158 | 159 | will show the output of our job! 160 | 161 | If you want to see the most common words, run: 162 | 163 | hdfs dfs -cat /user/irmak/book-output/* | sort -rnk2 | less 164 | 165 | ########Note: 166 | If something went wrong when you ran your mapreduce job, you fix something and want to run it again, it will throw a different error, saying that the book-output directory already exists in hdfs. This error is thrown to avoid overwriting previous results. If you want to just rerun it anyway, you need to delete the output first, so it can be created again: 167 | 168 | hdfs dfs -rm -r /user/irmak/book-output 169 | 170 | 171 | 172 | [1]: http://www.gutenberg.org/cache/epub/4300/pg4300.txt 173 | [2]: http://www.gutenberg.org/cache/epub/5000/pg5000.txt 174 | [3]: http://www.gutenberg.org/cache/epub/20417/pg20417.txt 175 | -------------------------------------------------------------------------------- /setup.md: -------------------------------------------------------------------------------- 1 | ### Hadoop installation and setup on an Ubuntu server 2 | 3 | Create a cloud server (through a service such as AWS, Rackspace, Digital Ocean, etc.), ssh to it, and follow the white rabbit below. 4 | 5 | #### Install TextBlob 6 | 7 | We're going to use it in our text processing, so make sure you have textblob in there. 8 | 9 | sudo pip install textblob 10 | 11 | #### Install Java 7 12 | 13 | sudo apt-get install python-software-properties 14 | sudo add-apt-repository ppa:webupd8team/java 15 | sudo apt-get update 16 | sudo apt-get install oracle-jdk7-installer 17 | 18 | The java install will ask a few straightforward questions, just answer 19 | them. 20 | 21 | ####Check that java version is 1.7 22 | 23 | java -version 24 | 25 | 26 | ####Create a Hadoop user 27 | 28 | sudo addgroup hadoop 29 | sudo adduser --ingroup hadoop hduser 30 | 31 | This will ask for a password, give it one. Each user in unix has a 32 | password. You will use that when you switch to that user. 33 | 34 | Make an ssh key so hadoop can connect to machines with ssh without entering a password every time. 35 | 36 | su hduser 37 | ssh-keygen -t rsa -P "" 38 | 39 | (hit enter when asked where to save the key) 40 | 41 | Add the key to recognized keys in target computers (same as localhost 42 | in this tutorial case) 43 | 44 | cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 45 | 46 | Add key to recognized keys list and test that it works without a 47 | password 48 | 49 | ssh localhost 50 | 51 | (type yes and enter when it asks about adding the key print to the 52 | known keys list) 53 | 54 | 55 | ####Download and Install Hadoop 56 | 57 | wget http://psg.mtu.edu/pub/apache/hadoop/common/stable/hadoop-2.6.0.tar.gz 58 | 59 | su irmak 60 | 61 | (switch to your own user, you’ll need some sudo here) 62 | 63 | sudo tar xvzf hadoop-2.6.0.tar.gz 64 | 65 | sudo mv hadoop-2.6.0 /usr/local/hadoop 66 | 67 | cd /usr/local 68 | 69 | sudo chown -R hduser:hadoop hadoop 70 | 71 | #### Change bashrc settings 72 | 73 | su hduser 74 | 75 | emacs ~/.bashrc 76 | 77 | You can use another editor, too, of course. (If you don't have emacs, 78 | and do not want to use another editor you can install emacs with 79 | *apt-get install emacs*). Add the following lines: 80 | 81 | # Environment variable for Hadoop location, include bin in the path 82 | export HADOOP_HOME=/usr/local/hadoop 83 | export PATH=$PATH:$HADOOP_HOME/bin 84 | 85 | # Environment varibale for Java location 86 | export JAVA_HOME=/usr/lib/jvm/java-7-oracle 87 | 88 | # Hadoop related aliases 89 | unalias fs &> /dev/null 90 | alias fs="hadoop fs" 91 | unalias hls &> /dev/null 92 | alias hls="fs -ls" 93 | 94 | Exit emacs (`Ctrl-x Ctrl-s` to save, `Ctrl-x Ctrl-c` to exit). Great, now these will run every time you connect to the 95 | server, but let's also make sure they apply now. Type this in your 96 | terminal.: 97 | 98 | source ~/.bashrc 99 | 100 | #### Create the place to put HDFS on and tell Hadoop where it is 101 | 102 | su irmak 103 | 104 | (We need some more sudo stuff so switch back to yourself for now) 105 | 106 | sudo mkdir -p /app/hadoop/tmp 107 | 108 | sudo chown -R hduser:hadoop /app/hadoop/tmp 109 | 110 | su hduser 111 | 112 | (back to hduser to edit the configuration files) 113 | 114 | emacs /usr/local/hadoop/etc/hadoop/core-site.xml 115 | 116 | Between `< configuration >` and `< /configuration >` put this in: 117 | 118 | ```xml 119 | 120 | hadoop.tmp.dir 121 | /app/hadoop/tmp 122 | A base for other temporary directories. 123 | 124 | 125 | 126 | 127 | fs.default.name 128 | hdfs://localhost:54310 129 | The name of the default file system. A URI whose 130 | scheme and authority determine the FileSystem implementation. 131 | The uri's scheme determines the config property (fs.SCHEME.impl) 132 | naming the FileSystem implementation class. The uri's authority 133 | is used to determine the host, port, etc. for a filesystem. 134 | 135 | ``` 136 | 137 | Ok now another one. 138 | 139 | emacs /usr/local/hadoop/etc/hadoop/mapred-site.xml.template 140 | 141 | Between `< configuration >` and `< /configuration >` put this in: 142 | ```xml 143 | 144 | mapred.job.tracker 145 | localhost:54311 146 | The host and port that the MapReduce job tracker 147 | runs 148 | at. If "local", then jobs are run in-process as a single 149 | map 150 | and reduce task. 151 | 152 | 153 | ``` 154 | And the last one 155 | 156 | emacs /usr/local/hadoop/etc/hadoop/hdfs-site.xml 157 | 158 | Between `< configuration >` and `< /configuration >` put this in: 159 | ```xml 160 | 161 | dfs.replication 162 | 1 163 | Default block replication. The actual number of replications 164 | can be specified when the file is created. The default is used if replication 165 | is not specified in create time. 166 | 167 | ``` 168 | 169 | Also tell hadoop where java 7 is 170 | 171 | emacs /usr/local/hadoop/etc/hadoop/hadoop-env.sh 172 | 173 | And at the very end, append this line: 174 | 175 | export JAVA_HOME=/usr/lib/jvm/java-7-oracle 176 | 177 | Save, quit, and we're good. 178 | 179 | #### Format the HDFS (hadoop filesystem) 180 | 181 | hdfs namenode -format 182 | 183 | ## STOP HERE. 184 | 185 | Take a breath. Your setup is complete. The rest is actually using hadoop. 186 | 187 | --------------------------------------------------------------------------------