├── LICENSE ├── README.md ├── conf └── hive │ └── hive-site.xml ├── examples └── ngrams │ ├── hive_query_ngrams.q │ ├── hive_table_create.sh │ ├── ngram_hdfs_load.sh │ ├── ngram_setup.sh │ └── pig_query_ngrams.pig ├── project_properties.sh └── scripts ├── common_utils.sh ├── install-packages-on-master__at__host.sh ├── package_utils.sh ├── packages-delete-from-gcs__at__host.sh ├── packages-to-gcs__at__host.sh ├── setup-hdfs-for-hdtools__at__master.sh ├── setup-packages__at__master.sh └── setup-ssh-keys__at__master.sh /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Apache Hive and Pig on Google Compute Engine 2 | ============================================ 3 | 4 | Copyright 5 | --------- 6 | 7 | Copyright 2013 Google Inc. All Rights Reserved. 8 | 9 | Licensed under the Apache License, Version 2.0 (the "License"); 10 | you may not use this file except in compliance with the License. 11 | You may obtain a copy of the License at 12 | 13 | [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0) 14 | 15 | Unless required by applicable law or agreed to in writing, software 16 | distributed under the License is distributed on an "AS IS" BASIS, 17 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 18 | See the License for the specific language governing permissions and 19 | limitations under the License. 20 | 21 | 22 | Disclaimer 23 | ---------- 24 | 25 | This sample application is not an official Google product. 26 | 27 | 28 | Summary 29 | ------- 30 | 31 | This sample application can be used to install Apache Hive and/or Pig 32 | onto a Hadoop master instance on Google Compute Engine. 33 | 34 | Prerequisites 35 | ------------- 36 | 37 | The sample application should be run from an Internet-connected machine 38 | such as a workstation/laptop. 39 | 40 | The application assumes you have a Google Cloud Project created and that 41 | [Google Cloud Storage](https://developers.google.com/storage/docs/signup) and 42 | [Google Compute Engine](https://developers.google.com/compute/docs/signup) 43 | services are enabled on the project. 44 | 45 | The application uses 46 | [gsutil](https://developers.google.com/storage/docs/gsutil) and 47 | [gcutil](https://developers.google.com/compute/docs/gcutil), 48 | command line tools for Google Cloud Storage and Google Compute Engine 49 | respectively. Make sure to have the latest version of these 50 | [Cloud SDK](https://developers.google.com/cloud/sdk/) tools installed 51 | and added to your PATH environment variable. 52 | 53 | ##### Default project 54 | 55 | The default project of the Cloud SDK tools must be set to the project 56 | of the Hadoop cluster. 57 | 58 | Use the [gcloud](https://developers.google.com/cloud/sdk/gcloud) command 59 | to change the default project of the Cloud SDK: 60 | 61 | gcloud config set project 62 | 63 | ##### Hadoop cluster 64 | 65 | You should already have a Hadoop cluster running on Google Compute Engine. 66 | This sample application has been tested with clusters created using both 67 | of the following: 68 | 69 | * [Solutions Cluster for Hadoop](https://github.com/GoogleCloudPlatform/solutions-google-compute-engine-cluster-for-hadoop) 70 | * [Bash Script Quickstart for Hadoop](https://developers.google.com/hadoop/setting-up-a-hadoop-cluster) 71 | 72 | In addition to a running Google Compute Engine cluster, the sample application 73 | requires the user running the installation scripts and the Google 74 | Compute Engine instances to have authorized access to a Google 75 | Cloud Storage bucket. 76 | 77 | ##### Cloud Storage bucket 78 | 79 | Create a Google Cloud Storage bucket. This can be done by one of: 80 | 81 | * Using an existing bucket. 82 | If you used either of the above Hadoop cluster bring-up packages, you may 83 | use the same bucket. 84 | * Creating a new bucket from the "Cloud Storage" page on the project page of 85 | [Developers Console](https://console.developers.google.com/) 86 | * Creating a new bucket with the 87 | [gsutil command line tool](https://developers.google.com/storage/docs/gsutil): 88 | 89 | gsutil mb gs:// 90 | 91 | Make sure to create the bucket in the same Google Cloud project as that 92 | specified in the "Default project" section above. 93 | 94 | Package Downloads 95 | ----------------- 96 | 97 | This sample application can be used to install just one or both 98 | of the packages discussed here. 99 | Which packages are installed will be driven by copying the respective 100 | tool's package archive into the "packages/_toolname_" subdirectory 101 | of the sample app prior to running the installation scripts. 102 | 103 | ### Hive Package Setup 104 | Create a directory for the Hive package as a subdirectory of the sample application: 105 | 106 | mkdir -p packages/hive 107 | 108 | Download Hive from 109 | [hive.apache.org](http://hive.apache.org/downloads.html) 110 | and copy the gzipped tar file 111 | into the `packages/hive/` subdirectory. Testing of this sample application 112 | was performed with `hive-0.11.0.tar.gz` and `hive-0.12.0.tar.gz`. 113 | 114 | ### Pig Package Setup 115 | Create a directory for the Pig package as a subdirectory of the sample application:: 116 | 117 | mkdir -p packages/pig 118 | 119 | Download Pig from 120 | [pig.apache.org](http://pig.apache.org/releases.html) 121 | and copy the gzipped tar file 122 | into the `packages/pig/` subdirectory. Testing of this sample application 123 | was performed with pig-0.11.1.tar.gz and pig-0.12.0.tar.gz. 124 | 125 | If installing both tools, the packages subdirectory will 126 | now appear as: 127 | 128 | packages/ 129 | hive/ 130 | hive-0.12.0.tar.gz 131 | pig/ 132 | pig-0.12.0.tar.gz 133 | 134 | Enter project details into properties file 135 | ------------------------------------------ 136 | Edit the file `project_properties.sh` found in the root directory of the 137 | sample application. 138 | 139 | Update the `GCS_PACKAGE_BUCKET` value with the bucket name of the Google 140 | Cloud Storage associated with your Hadoop project, such as 141 | 142 | readonly GCS_PACKAGE_BUCKET=myproject-bucket 143 | 144 | Update the `ZONE` value with the Compute Engine zone associated with your 145 | Hadoop master instance, such as: 146 | 147 | readonly ZONE=us-central1-a 148 | 149 | Update the `MASTER` value with the name of the Compute Engine 150 | Hadoop master instance associated with your project, such as: 151 | 152 | readonly MASTER=myproject-hm 153 | 154 | Update the `HADOOP_HOME` value with the full directory path where 155 | hadoop is installed on the Hadoop master instance, such as: 156 | 157 | readonly HADOOP_HOME=/home/hadoop/hadoop 158 | 159 | or 160 | 161 | readonly HADOOP_HOME=/home/hadoop/hadoop-install 162 | 163 | The sample application will create a system userid named `hdpuser` on the 164 | Hadoop master instance. The software will be installed into the user's 165 | home directory `/home/hdpuser`. 166 | 167 | If you would like to use a different username or install the software 168 | into a different directory, then update the `HDP_USER`, `HDP_USER_HOME`, 169 | and `MASTER_INSTALL_DIR` values in `project_proerties.sh`. 170 | 171 | Push packages to cloud storage 172 | ------------------------------ 173 | From the root directory where this sample application has been installed, 174 | run: 175 | 176 | $ ./scripts/packages-to-gcs__at__host.sh 177 | 178 | This command will push the packages tree structure up to the 179 | `GCS_PACKAGE_BUCKET` configured above. 180 | 181 | Run installation onto Hadoop master 182 | ----------------------------------- 183 | From the root directory where this sample application has been installed, 184 | run: 185 | 186 | $ ./scripts/install-packages-on-master__at__host.sh 187 | 188 | This command will perform the installation onto the Hadoop master instance, 189 | including the following operations: 190 | 191 | * Create the `hdpuser` 192 | * Install software packages 193 | * Set user privileges in the Google Compute Engine instance filesystem 194 | * Set user privileges in the Hadoop File System (HDFS) 195 | * Set up SSH keys for the `hdpuser` 196 | 197 | Tests 198 | ----- 199 | On successful installation, the script will emit the appropriate command 200 | to connect to the `hdpuser` over SSH. 201 | 202 | Once connected to the Hadoop master, the following steps will help verify 203 | a functioning installation. 204 | 205 | The examples here use the `/etc/passwd` file, which contains no actual 206 | passwords and is publicly readable. 207 | 208 | ### Hive Test 209 | On the Hadoop master instance, under the `hdpuser` copy the 210 | `/etc/passwd` file from the file system into HDFS: 211 | 212 | $ hadoop fs -put /etc/passwd /tmp 213 | 214 | Start the Hive shell with the following command: 215 | 216 | $ hive 217 | 218 | At the Hive shell prompt enter: 219 | 220 | CREATE TABLE passwd ( 221 | user STRING, 222 | dummy STRING, 223 | uid INT, 224 | gid INT, 225 | name STRING, 226 | home STRING, 227 | shell STRING 228 | ) 229 | ROW FORMAT DELIMITED 230 | FIELDS TERMINATED BY ':' 231 | STORED AS TEXTFILE; 232 | 233 | LOAD DATA INPATH '/tmp/passwd' 234 | OVERWRITE INTO TABLE passwd; 235 | 236 | SELECT shell, COUNT(*) shell_count 237 | FROM passwd 238 | GROUP BY shell 239 | ORDER BY shell_count DESC; 240 | 241 | This should start a MapReduce job sequence to first group and sum 242 | the shell types and secondly to sort the results. 243 | 244 | The query results will be emitted to the console and should look 245 | something like: 246 | 247 | /bin/bash 23 248 | /usr/sbin/nologin 21 249 | /bin/sync 1 250 | 251 | To drop the table enter: 252 | 253 | DROP TABLE passwd; 254 | 255 | Note that you do NOT need to remove the `passwd` file from HDFS. 256 | The LOAD DATA command will have _moved_ the file from `/tmp/passwd` 257 | to `/user/hdpuser/warehouse`. Dropping the `passwd` table will 258 | remove the file from `/user/hdpuser/warehouse`. 259 | 260 | To exit the Hive shell enter: 261 | 262 | exit; 263 | 264 | ### Pig Test 265 | On the Hadoop master instance, under the `hdpuser` copy the 266 | `/etc/passwd` file from the file system into HDFS: 267 | 268 | $ hadoop fs -put /etc/passwd /tmp 269 | 270 | Start the Pig shell with the following command: 271 | 272 | $ pig 273 | 274 | At the Pig shell prompt enter: 275 | 276 | data = LOAD '/tmp/passwd' 277 | USING PigStorage(':') 278 | AS (user:CHARARRAY, dummy:CHARARRAY, uid:INT, gid:INT, 279 | name:CHARARRAY, home:CHARARRAY, shell:CHARARRAY); 280 | grp = GROUP data BY (shell); 281 | counts = FOREACH grp GENERATE 282 | FLATTEN(group), COUNT(data) AS shell_count:LONG; 283 | res = ORDER counts BY shell_count DESC; 284 | DUMP res; 285 | 286 | This should start a MapReduce job sequence to first group and sum 287 | the shell types, secondly to sample the results for the subsequent 288 | sort job. 289 | 290 | The query results will be emitted to the console and should look 291 | something like: 292 | 293 | (/bin/bash,23) 294 | (/usr/sbin/nologin,21) 295 | (/bin/sync,1) 296 | 297 | To exit the Pig shell enter: 298 | 299 | quit; 300 | 301 | When tests are completed, the passwd file can be removed with: 302 | 303 | $ hadoop fs -rm /tmp/passwd 304 | 305 | 306 | Post-installation cleanup 307 | ------------------------- 308 | After installation, the software packages can be removed from Google Cloud Storage. 309 | 310 | From the root directory where this sample application has been installed, 311 | run: 312 | 313 | ./scripts/packages-delete-from-gcs__at__host.sh 314 | 315 | Appendix A 316 | ---------- 317 | Using MySQL for the Hive Metastore 318 | 319 | The default Hive installation uses a local Derby database to store Hive 320 | meta information (table and column names, column types, etc). 321 | This local database is fine for single-user/single-session usage, but a common 322 | setup is to configure Hive to use a MySQL database instance. 323 | 324 | The following sections describe how to setup Hive to use a MySQL database 325 | either using Google Cloud SQL or a self-installed and managed MySQL database 326 | on the Google Compute Engine cluster master instance. 327 | 328 | ### Google Cloud SQL 329 | 330 | [Google Cloud SQL](https://developers.google.com/cloud-sql/) is a 331 | MySQL database service on the Google Cloud Platform. Using Cloud SQL 332 | removes the need to install and maintain MySQL on Google Compute Engine. 333 | 334 | The instructions here assume that you will use native MySQL JDBC driver 335 | and not the legacy Google JDBC driver. 336 | 337 | #### Create and configure the Google Cloud SQL instance 338 | 339 | In the [Developers Console](https://console.developers.google.com/), 340 | create a Google Cloud SQL instance, as described at 341 | [Getting Started](https://developers.google.com/cloud-sql/docs/before_you_begin). 342 | When creating the Cloud SQL instance, be sure to: 343 | 344 | * Select **Specify Compute Engine zone** for the **Preferred Location** 345 | * Select the **GCE Zone** of your Hadoop master instance 346 | * Select **Assign IP Address** 347 | * Add the external IP address of the Hadoop master instance to 348 | **Authorized IP Addresses** 349 | (you can also assign an IP address after creating the Cloud SQL instance). 350 | 351 | After the Cloud SQL instance is created 352 | 353 | * Select the instance link from the Cloud SQL instance list 354 | * Go to the *Access Control* tab 355 | * Enter a root password in the appropriate field (make a note of it, you will need it for the steps below.) 356 | * Note the *IP address* assigned to the Cloud SQL instance; 357 | it will be used in the Hive configuration file as explained below. 358 | 359 | #### Install the MySQL client 360 | 361 | Connect to the Hadoop master instance: 362 | 363 | gcutil ssh 364 | 365 | Use the `aptitude` package manager to install the MySQL client: 366 | 367 | sudo apt-get install --yes -f 368 | sudo apt-get install --yes mysql-client 369 | 370 | 371 | #### Create the Cloud SQL database 372 | 373 | Launch the mysql client to connect to the Google Cloud SQL instance, 374 | replacing `` with the assigned IP address of the 375 | Cloud SQL instance 376 | 377 | mysql --host= --user=root --password 378 | 379 | The --password flag causes mysql to prompt for a password. Enter your root user password for the Cloud SQL instance. 380 | 381 | 382 | Create the database `hivemeta`. Note that Hive requires the database 383 | to use `latin1` character encoding. 384 | 385 | CREATE DATABASE hivemeta CHARSET latin1; 386 | 387 | #### Configure database user and grant privileges 388 | 389 | Create the database user `hdpuser`: 390 | 391 | CREATE USER hdpuser IDENTIFIED BY 'hdppassword'; 392 | 393 | [You should select your own password here in place of _hdppassword_.] 394 | 395 | Issue grants on the `hivemeta` database to `hdpuser`: 396 | 397 | GRANT ALL PRIVILEGES ON hivemeta.* TO hdpuser; 398 | 399 | #### Install the MySQL native JDBC driver 400 | 401 | Connect to the Hadoop master instance: 402 | 403 | gcutil ssh 404 | 405 | Use the `aptitude` package manager to install the MySQL JDBC driver: 406 | 407 | sudo apt-get install --yes libmysql-java 408 | 409 | #### Configure Hive to use Cloud SQL 410 | 411 | Connect to the Hadoop master instance as the user `hdpuser`. 412 | 413 | Add the JDBC driver JAR file to hive's CLASSPATH. 414 | The simplest method is to copy the file to the `hive/lib/` directory: 415 | 416 | cp /usr/share/java/mysql-connector-java.jar hive/lib/ 417 | 418 | Update the `hive/conf/hive-site.xml` file to connect to 419 | the Google Cloud SQL database. Add the following configuration, 420 | replacing `` with the assigned IP address of the 421 | Cloud SQL instance, and replacing `hdppassword` with the database 422 | user password set earlier: 423 | 424 | 425 | javax.jdo.option.ConnectionURL 426 | jdbc:mysql:///hivemeta?createDatabaseIfNotExist=true 427 | 428 | 429 | javax.jdo.option.ConnectionDriverName 430 | com.mysql.jdbc.Driver 431 | 432 | 433 | javax.jdo.option.ConnectionUserName 434 | hdpuser 435 | 436 | 437 | javax.jdo.option.ConnectionPassword 438 | hdppassword 439 | 440 | 441 | As a password has been added to this configuration file, it is recommended that 442 | you make the file readable and writeable only by the `hdpuser`: 443 | 444 | chmod 600 hive/conf/hive-site.xml 445 | 446 | When run, Hive will now be able to use the Google Cloud SQL database as its 447 | metastore. The metastore database will be created with the first Hive DDL 448 | operation. For troubleshooting, check the Hive log at `/tmp/hdpuser/hive.log`. 449 | 450 | ### MySQL on Google Compute Engine 451 | 452 | MySQL can be installed and run on Google Compute Engine. The instructions 453 | here are for installing MySQL on the Hadoop master instance. MySQL could 454 | also be installed on a separate Google Compute Engine instance. 455 | 456 | #### Install the MySQL server 457 | 458 | Connect to the Hadoop master instance: 459 | 460 | gcutil ssh 461 | 462 | Use the `aptitude` package manager to install MySQL: 463 | 464 | sudo apt-get install --yes -f 465 | sudo apt-get install --yes mysql-server 466 | 467 | #### Create MySQL database 468 | 469 | Create a database for the Hive metastore with mysqladmin: 470 | 471 | sudo mysqladmin create hivemeta 472 | 473 | When completed, create a user for the Hive metastore. 474 | 475 | #### Configure database user and grant privileges 476 | 477 | Launch mysql: 478 | 479 | sudo mysql 480 | 481 | At the MySQL shell prompt, issue: 482 | 483 | CREATE USER hdpuser@localhost IDENTIFIED BY 'hdppassword'; 484 | GRANT ALL PRIVILEGES ON hivemeta.* TO hdpuser@localhost; 485 | 486 | [You should select your own password here in place of _hdppassword_.] 487 | 488 | #### Install the MySQL native JDBC driver 489 | 490 | Use the `aptitude` package manager to install the MySQL JDBC driver: 491 | 492 | sudo apt-get install --yes libmysql-java 493 | 494 | #### Configure Hive to use MySQL 495 | 496 | Connect to the Hadoop master instance as the user `hdpuser`. 497 | 498 | Add the JDBC driver JAR file to hive's CLASSPATH. 499 | The simplest method is to copy the file to the `hive/lib/` directory: 500 | 501 | cp /usr/share/java/mysql-connector-java.jar hive/lib/ 502 | 503 | Update the `hive/conf/hive-site.xml` file to connect to 504 | the Google Cloud SQL database. Add the following configuration, 505 | replacing `hdppassword` with the database user password set earlier: 506 | 507 | 508 | javax.jdo.option.ConnectionURL 509 | jdbc:mysql://localhost/hivemeta?createDatabaseIfNotExist=true 510 | 511 | 512 | javax.jdo.option.ConnectionDriverName 513 | com.mysql.jdbc.Driver 514 | 515 | 516 | javax.jdo.option.ConnectionUserName 517 | hdpuser 518 | 519 | 520 | javax.jdo.option.ConnectionPassword 521 | hdppassword 522 | 523 | 524 | As a password has been added to this configuration file, it is recommended that 525 | you make the file readable and writeable only by the hdpuser: 526 | 527 | chmod 600 ~hdpuser/hive/conf/hive-site.xml 528 | 529 | When run, Hive will now be able to use the MySQL database as its metastore. 530 | The metastore database will be created with the first Hive DDL operation. 531 | For troubleshooting, check the Hive log at `/tmp/hdpuser/hive.log`. 532 | 533 | Appendix B 534 | ---------- 535 | This section lists some useful file locations on the Hadoop master 536 | instance: 537 | 538 | ---------------------|--------------------------------- 539 | | File Description | Path | 540 | |---------------------|---------------------------------| 541 | | Hadoop binaries | /home/[hadoop]/hadoop- | 542 | | Hive binaries | /home/[hdpuser]/hive- | 543 | | Pig binaries | /home/[hdpuser]/pig- | 544 | | HDFS NameNode Files | /hadoop/hdfs/name | 545 | | HDFS DataNode Files | /hadoop/hdfs/data | 546 | | Hadoop Logs | /var/log/hadoop | 547 | | MapReduce Logs | /var/log/hadoop | 548 | | Hive Logs | /tmp/[hdpuser] | 549 | | Pig Logs | [Pig launch directory] | 550 | ---------------------|--------------------------------- 551 | -------------------------------------------------------------------------------- /conf/hive/hive-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 19 | 20 | 21 | 22 | 23 | hive.metastore.warehouse.dir 24 | /user/${user.name}/warehouse 25 | location of default database for the warehouse 26 | 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /examples/ngrams/hive_query_ngrams.q: -------------------------------------------------------------------------------- 1 | -- 2 | -- Copyright 2013 Google Inc. All Rights Reserved. 3 | -- 4 | -- Licensed under the Apache License, Version 2.0 (the "License"); 5 | -- you may not use this file except in compliance with the License. 6 | -- You may obtain a copy of the License at 7 | -- 8 | -- http://www.apache.org/licenses/LICENSE-2.0 9 | -- 10 | -- Unless required by applicable law or agreed to in writing, software 11 | -- distributed under the License is distributed on an "AS IS" BASIS, 12 | -- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | -- See the License for the specific language governing permissions and 14 | -- 15 | 16 | -- 17 | -- This script is intended to be run from the Hive shell: 18 | -- 19 | -- hive> source hive_query_ngrams.q; 20 | -- 21 | -- or from the operating system shell: 22 | -- 23 | -- $ hive -f hive_query_ngrams.q 24 | -- 25 | -- The result of this query is a table of records indicating the count 26 | -- of occurrences of the words "radio" and "television" in the Google 27 | -- ngrams corpora for each year since 1920. 28 | -- 29 | -- This query ensures that a record exists in the result for every year 30 | -- since 1920, even if there were no instances of a given word. 31 | -- In practice this is unnecessary as radio and television both occur 32 | -- more than once in the data set for every year since 1920. 33 | -- 34 | -- The structure of this query is to join three distinct subqueries (on year): 35 | -- y: list of years since 1920 (implicitly ordered by the DISTINCT operation) 36 | -- r: sum of instances of the word "radio" for each year since 1920 37 | -- t: sum of instances of the word "television" for each year since 1920 38 | -- 39 | 40 | SELECT y.year AS year, 41 | r.instance_count AS radio, t.instance_count AS television, 42 | CAST(r.instance_count AS DOUBLE)/(r.instance_count + t.instance_count) 43 | AS pct 44 | FROM 45 | (SELECT DISTINCT year AS year FROM 46 | (SELECT distinct year from 1gram where prefix = 'r' and year >= 1920 47 | UNION ALL 48 | SELECT distinct year from 1gram where prefix = 't' and year >= 1920) y_all) 49 | y 50 | JOIN 51 | (SELECT LOWER(word) AS ngram_col, year, SUM(instance_count) AS instance_count 52 | FROM 1gram 53 | WHERE LOWER(word) = 'radio' AND prefix='r' AND (year >= 1920) 54 | GROUP BY LOWER(word), year) r 55 | ON y.year = r.year 56 | JOIN 57 | (SELECT LOWER(word) AS ngram_col, year, SUM(instance_count) AS instance_count 58 | FROM 1gram 59 | WHERE LOWER(word) = 'television' AND prefix='t' AND (year >= 1920) 60 | GROUP BY LOWER(word), year) t 61 | ON y.year = t.year 62 | ORDER BY year; 63 | 64 | EXIT; 65 | 66 | -- 67 | -- This is a simplified version of the above which eliminates the explicit 68 | -- generation of the "year" list. It assumes (correctly) that the word 69 | -- "television" appears every year that "radio" does. 70 | -- This query is listed here for reference and educational purposes only. 71 | -- 72 | -- SELECT a.year, a.instance_count, b.instance_count, 73 | -- CAST(a.instance_count AS DOUBLE)/(a.instance_count + b.instance_count) 74 | -- FROM 75 | -- (SELECT LOWER(word) AS ngram_col, year, SUM(instance_count) AS instance_count 76 | -- FROM 1gram 77 | -- WHERE LOWER(word) = 'radio' AND prefix='r' AND (year >= 1920) 78 | -- GROUP BY LOWER(word), year) a 79 | -- JOIN 80 | -- (SELECT LOWER(word) AS ngram_col, year, SUM(instance_count) AS instance_count 81 | -- FROM 1gram 82 | -- WHERE LOWER(word) = 'television' AND prefix='t' AND (year >= 1920) 83 | -- GROUP BY LOWER(word), year) b 84 | -- ON a.year = b.year 85 | -- ORDER BY year; 86 | -- 87 | -------------------------------------------------------------------------------- /examples/ngrams/hive_table_create.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # 17 | # This script is intended to be run from the unix command line 18 | # on an instance with hive installed (and the hive executable 19 | # available in the user PATH). 20 | # 21 | # It is assumed that the one has already run the shell script 22 | # ngram_hdfs_load.sh which will have downloaded the associated 23 | # ngram data and deposited it into HDFS under /user/hdpusr/ngrams/ 24 | # 25 | # This script will create a table ("1gram") and then load each 26 | # file into a separate partition within the table. 27 | # 28 | 29 | set -o errexit 30 | set -o nounset 31 | 32 | # Select what to install 33 | readonly SCRIPT_DIR=$(dirname $0) 34 | source $SCRIPT_DIR/ngram_setup.sh 35 | 36 | # Create the table if it does not already exist 37 | hive << END_CREATE 38 | CREATE TABLE IF NOT EXISTS $NGRAMS ( 39 | word STRING, 40 | year INT, 41 | instance_count INT, 42 | book_count INT 43 | ) 44 | PARTITIONED BY (prefix STRING) 45 | ROW FORMAT DELIMITED 46 | FIELDS TERMINATED BY '\t' 47 | STORED AS TEXTFILE 48 | ; 49 | EXIT 50 | ; 51 | END_CREATE 52 | 53 | # Get the list of files to put into the table 54 | FILE_PATTERN=$(printf $SOURCE_FORMAT $NGRAMS "" "") 55 | FILE_LIST=$($HDFS_CMD -ls $HDFS_DIR | grep $FILE_PATTERN | awk '{ print $8 }') 56 | for filepath in $FILE_LIST; do 57 | filename=$(basename $filepath) 58 | prefix=${filename##$FILE_PATTERN} 59 | 60 | hive --silent << END_LOAD 61 | LOAD DATA INPATH '$HDFS_DIR/$filename' 62 | OVERWRITE INTO TABLE $NGRAMS 63 | PARTITION (prefix='$prefix') 64 | ; 65 | EXIT 66 | ; 67 | END_LOAD 68 | done 69 | 70 | echo "Data loaded into hive table $NGRAMS" 71 | -------------------------------------------------------------------------------- /examples/ngrams/ngram_hdfs_load.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # This script is intended to be run from the command line on the 17 | # Hadoop master node by the hdpuser. 18 | # 19 | # $ ngram_hdfs_load.sh [N] 20 | # 21 | # where N defaults to 1 and indicates which Ngram dataset to download. 22 | # 23 | # The script will downlaod the zipped ngram files from a public 24 | # Google Cloud Storage bucket to a local temporary directory, 25 | # decompress the files, and then insert them into HDFS into 26 | # hdfs:///user/hdpuser/ngrams/ directory. 27 | # 28 | 29 | set -o errexit 30 | set -o nounset 31 | 32 | # Select what to install 33 | readonly SCRIPT_DIR=$(dirname $0) 34 | source $SCRIPT_DIR/ngram_setup.sh 35 | 36 | # Generate Partition List 37 | PARTITION_LIST=$(echo {a..z}) 38 | 39 | for ((i = 1; i < N; i++)); do 40 | NEW_LIST="" 41 | for curr in $PARTITION_LIST; do 42 | curr=$(echo ${curr}{a..z}) 43 | NEW_LIST="$NEW_LIST $curr" 44 | done 45 | PARTITION_LIST="$NEW_LIST" 46 | done 47 | 48 | # Start Downloads Into Stage_dir 49 | mkdir -p $STAGE_DIR 50 | echo "Building $NGRAMS download list" 51 | SOURCE_LIST="" 52 | for partition in $PARTITION_LIST; do 53 | filename_gz=$(printf $SOURCE_FORMAT $NGRAMS ${partition} ".gz") 54 | filename=${filename_gz%%.gz} 55 | 56 | stage_gz=$STAGE_DIR/$filename_gz 57 | stage=$STAGE_DIR/$filename 58 | 59 | # To make this restartable, we check what needs to be downloaded 60 | # before doing so... 61 | 62 | if [[ -e $stage_gz ]]; then 63 | echo "$filename_gz already downloaded" 64 | continue 65 | fi 66 | 67 | if [[ -e $stage ]]; then 68 | echo "$filename_gz downloaded and decompressed" 69 | continue 70 | fi 71 | 72 | if $HDFS_CMD -test -e $HDFS_DIR/$filename; then 73 | echo "HDFS: $HDFS_DIR/$filename already inserted" 74 | continue 75 | fi 76 | 77 | SOURCE_LIST="$SOURCE_LIST $filename_gz" 78 | done 79 | 80 | # Note that this process could be batched up into one call to 81 | # gsutil -m, but in practice it had little performance impact 82 | # and impacted the restartability (as partial files could be 83 | # left in the stage dir). 84 | if [[ -n $SOURCE_LIST ]]; then 85 | echo "Downloading $NGRAMS files from Cloud Storage to $STAGE_DIR" 86 | for filename_gz in $SOURCE_LIST; do 87 | remote=$SOURCE_LOCATION/$filename_gz 88 | stage_gz=$STAGE_DIR/$filename_gz 89 | 90 | gsutil cp $remote ${stage_gz}.tmp && \ 91 | mv ${stage_gz}.tmp $stage_gz 92 | done 93 | fi 94 | 95 | # Decompress 96 | set +o errexit 97 | COMPRESSED_FILES=$(cd $STAGE_DIR && /bin/ls -1 *.gz 2>/dev/null) 98 | set -o errexit 99 | if [[ -n $COMPRESSED_FILES ]]; then 100 | echo "Decompressing $NGRAMS files in $STAGE_DIR" 101 | 102 | for filename in $COMPRESSED_FILES; do 103 | start_sec=$(date +%s) 104 | 105 | echo -n "Decompress $filename" 106 | gunzip $STAGE_DIR/$filename 107 | 108 | end_sec=$(date +%s) 109 | time_sec=$((end_sec - $start_sec)) 110 | 111 | echo " in $time_sec seconds" 112 | done 113 | fi 114 | 115 | # Insert into HDFS 116 | if ! $HDFS_CMD -test -e $HDFS_DIR; then 117 | $HDFS_CMD -mkdir $HDFS_DIR 118 | fi 119 | 120 | UNCOMPRESSED_FILES=$(cd $STAGE_DIR && /bin/ls -1 --ignore *.gz 2>/dev/null) 121 | if [[ -n $UNCOMPRESSED_FILES ]]; then 122 | echo "Inserting $NGRAMS files into HDFS: $HDFS_DIR/" 123 | 124 | for filename in $UNCOMPRESSED_FILES; do 125 | start_sec=$(date +%s) 126 | 127 | echo -n "Insert to HDFS: $HDFS_DIR/$filename" 128 | $HDFS_CMD -put $STAGE_DIR/$filename $HDFS_DIR/${filename} && \ 129 | rm $STAGE_DIR/$filename 130 | 131 | end_sec=$(date +%s) 132 | time_sec=$((end_sec - $start_sec)) 133 | 134 | echo " in $time_sec seconds" 135 | done 136 | fi 137 | 138 | -------------------------------------------------------------------------------- /examples/ngrams/ngram_setup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # Utility script, sourced by both ngram_hdfs_load.sh and hive_table_create.sh 17 | # This script will set a series of constants, some based on the choice 18 | # of the command line "N" value (defaults to 1). N indicates the ngram 19 | # dataset to download and copy into HDFS. 20 | 21 | readonly SOURCE_FORMAT="googlebooks-eng-all-%s-20120701-%s%s" 22 | readonly SOURCE_LOCATION="gs://books/ngrams/books" 23 | 24 | # The "hadoop" executable should be in the user path 25 | readonly HDFS_CMD="hadoop fs" 26 | 27 | # What to install: 1gram by default 28 | N=1 29 | 30 | # Now parse command line arguments 31 | while [[ $# -ne 0 ]]; do 32 | case "$1" in 33 | --N=*) 34 | N=${1#--N=} 35 | shift 36 | ;; 37 | --help) 38 | N= 39 | shift 40 | ;; 41 | *) 42 | esac 43 | done 44 | 45 | if [[ ! $N -ge 1 ]]; then 46 | echo "usage $(basename $0): --N=" 47 | exit 1 48 | fi 49 | 50 | # Now set constants based on the selection of N 51 | readonly NGRAMS="${N}gram" 52 | readonly HDFS_DIR="ngrams/$NGRAMS" 53 | readonly STAGE_DIR="/hadoop/tmp/$USER/ngrams/$NGRAMS" 54 | 55 | -------------------------------------------------------------------------------- /examples/ngrams/pig_query_ngrams.pig: -------------------------------------------------------------------------------- 1 | /* 2 | Copyright 2013 Google Inc. All Rights Reserved. 3 | 4 | Licensed under the Apache License, Version 2.0 (the "License"); 5 | you may not use this file except in compliance with the License. 6 | You may obtain a copy of the License at 7 | 8 | http://www.apache.org/licenses/LICENSE-2.0 9 | 10 | Unless required by applicable law or agreed to in writing, software 11 | distributed under the License is distributed on an "AS IS" BASIS, 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | See the License for the specific language governing permissions and 14 | */ 15 | 16 | /* 17 | This script is intended to be run from the Pig shell (grunt): 18 | 19 | grunt> exec pig_query_ngrams.pig 20 | 21 | or from the operating system shell: 22 | 23 | $ pig -f pig_query_ngrams.q 24 | 25 | The result of this pipeline is a relation of tuples indicating the count 26 | of occurrences of the words "radio" and "television" in the Google 27 | ngrams corpora for each year since 1920. 28 | 29 | This pipeline ensures that a record exists in the result for every year 30 | since 1920, even if there were no instances of a given word. 31 | In practice this is unnecessary as radio and television both occur 32 | more than once in the data set for every year since 1920. 33 | */ 34 | 35 | /* Default directory should be /user/ */ 36 | 37 | /* Load the "r" records */ 38 | data1 = LOAD './ngrams/1gram/googlebooks-eng-all-1gram-20120701-r' 39 | USING PigStorage() 40 | AS (ngram:CHARARRAY, year:INT, instance_count:INT, book_count:INT); 41 | 42 | /* Filter only records for "radio" */ 43 | flt1 = FILTER data1 BY LOWER(ngram) == 'radio' AND year >= 1920; 44 | 45 | /* Group all instances of [Rr]adio by year */ 46 | grp1 = GROUP flt1 BY (LOWER(ngram), year); 47 | 48 | /* Sum the count of occurrences (by year) */ 49 | res1 = FOREACH grp1 GENERATE FLATTEN(group), 50 | SUM(flt1.instance_count) AS instance_count:LONG; 51 | 52 | 53 | /* Load the "t" records */ 54 | data2 = LOAD './ngrams/1gram/googlebooks-eng-all-1gram-20120701-t' 55 | USING PigStorage() 56 | AS (ngram:CHARARRAY, year:INT, instance_count:INT, book_count:INT); 57 | 58 | /* Filter only records for "television" */ 59 | flt2 = FILTER data2 BY LOWER(ngram) == 'television' AND year >= 1920; 60 | 61 | /* Group all instances of [Tt]elevision */ 62 | grp2 = GROUP flt2 BY (LOWER(ngram), year); 63 | 64 | /* Sum the count of occurrences (by year) */ 65 | res2 = FOREACH grp2 GENERATE FLATTEN(group), 66 | SUM(flt2.instance_count) AS instance_count:LONG; 67 | 68 | /* 69 | res1 and res2 contain the occurrences for radio and television 70 | respectively by year. To generate the results in a form: 71 | 72 | year radio televison 73 | 1920 15523 133 74 | 1921 18688 28 75 | 76 | 77 | generate a relation containing all "years" and then OUTER JOIN 78 | back to it. This isn't strictly necessary as the radio and television 79 | records exist for all years 1920-2008 (and so a simple JOIN of 80 | the two resultsets would suffice in practice). 81 | */ 82 | 83 | /* Ensure that we have all years represented. */ 84 | years1 = FOREACH res1 GENERATE year; 85 | years2 = FOREACH res2 GENERATE year; 86 | years_all = UNION years1, years2; 87 | 88 | /* Filter unique year values - implicitly orders the results */ 89 | years = DISTINCT years_all; 90 | 91 | /* Join radio records to the years relation */ 92 | j1 = JOIN years BY year LEFT OUTER, res1 BY year; 93 | /* Join television records to the years/radio relation */ 94 | j2 = JOIN j1 BY years::group::year LEFT OUTER, res2 BY year; 95 | 96 | /* 97 | Generate a simple relation showing 98 | year, radio_count, television_count, radio_pct 99 | where radio_pct is the percentage of the occurrences of 100 | "radio" and "television" that were "radio". 101 | */ 102 | res = FOREACH j2 GENERATE j1::years::group::year AS year:INT, 103 | j1::res1::instance_count AS radio:INT, 104 | res2::instance_count AS television:INT, 105 | (double)j1::res1::instance_count/ 106 | ((double)j1::res1::instance_count + 107 | (double)res2::instance_count) AS radio_pct:DOUBLE; 108 | 109 | /* Dump it */ 110 | dump res; 111 | -------------------------------------------------------------------------------- /project_properties.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # Begin: edit these values to set up your cluster 17 | # GCS bucket for packages 18 | readonly GCS_PACKAGE_BUCKET={{{{ bucket_name }}}} 19 | # Zone of the Hadoop master instance 20 | readonly ZONE={{{{ zone_id }}}} 21 | # Hadoop master instance name 22 | readonly MASTER={{{{ master_hostname }}}} 23 | 24 | # Subdirectory in cloud storage where packages are pushed at initial setup 25 | readonly GCS_PACKAGE_DIR=hdp_tools 26 | 27 | # Full GCS URIs of the Pig and Hive tarballs, if packages-to-gcs__at__host.sh 28 | # is used; alternatively, these can be set to other pre-existing GCS paths 29 | readonly SUPPORTED_HDPTOOLS="hive pig" 30 | readonly TARBALL_BASE="gs://$GCS_PACKAGE_BUCKET/$GCS_PACKAGE_DIR/packages" 31 | readonly HIVE_TARBALL_URI="$TARBALL_BASE/hive/hive-*.tar.gz" 32 | readonly PIG_TARBALL_URI="$TARBALL_BASE/pig/pig-*.tar.gz" 33 | 34 | # Directory on master where hadoop is installed 35 | readonly HADOOP_HOME=/home/hadoop/hadoop 36 | 37 | # Set to the major version of hadoop ("1" or "2") 38 | readonly HADOOP_MAJOR_VERSION="1" 39 | 40 | # Hadoop username and group on Compute Engine Cluster 41 | readonly HADOOP_USER=hadoop 42 | readonly HADOOP_GROUP=hadoop 43 | 44 | # Hadoop client username on Compute Engine Cluster 45 | readonly HDP_USER=hdpuser 46 | 47 | # Directory on master where packages are installed 48 | readonly HDP_USER_HOME=/home/hdpuser 49 | readonly MASTER_INSTALL_DIR=/home/hdpuser 50 | 51 | # End: edit these values to set up your cluster 52 | 53 | 54 | # Begin: constants used througout the solution 55 | 56 | # Subdirectory where packages files (tar.gz) are stored 57 | readonly PACKAGES_DIR=packages 58 | 59 | # Subdirectory where scripts are stored 60 | readonly SCRIPTS_DIR=scripts 61 | 62 | # Subdirectory on master where we pull down package files 63 | readonly MASTER_PACKAGE_DIR=/tmp/hdp_tools 64 | 65 | # User tmp dir in HDFS 66 | readonly HDFS_TMP_DIR="/tmp" 67 | 68 | # Hadoop temp dir (hadoop.tmp.dir) 69 | readonly HADOOP_TMP_DIR="/hadoop/tmp" 70 | 71 | # End: constants used througout the solution 72 | -------------------------------------------------------------------------------- /scripts/common_utils.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | set -o nounset 17 | set -o errexit 18 | 19 | function emit() { 20 | echo -e "$@" 21 | } 22 | readonly -f emit 23 | 24 | function die() { 25 | echo -e "$@" >&2 26 | exit 1 27 | } 28 | readonly -f die 29 | 30 | -------------------------------------------------------------------------------- /scripts/install-packages-on-master__at__host.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | set -o nounset 17 | set -o errexit 18 | 19 | readonly SCRIPTDIR=$(dirname $0) 20 | 21 | # Pull in global properties 22 | source project_properties.sh 23 | 24 | # Pull in common functions 25 | source $SCRIPTDIR/common_utils.sh 26 | 27 | # Files to push to master; place project_properties.sh in the same directory 28 | # as the other scripts 29 | readonly SCRIPT_FILES_TO_PUSH="\ 30 | project_properties.sh \ 31 | $SCRIPTS_DIR/common_utils.sh \ 32 | $SCRIPTS_DIR/package_utils.sh \ 33 | $SCRIPTS_DIR/setup-hdfs-for-hdtools__at__master.sh \ 34 | $SCRIPTS_DIR/setup-packages__at__master.sh \ 35 | $SCRIPTS_DIR/setup-ssh-keys__at__master.sh \ 36 | " 37 | readonly MASTER_PACKAGE_SUBDIRS="\ 38 | $MASTER_PACKAGE_DIR/$SCRIPTS_DIR \ 39 | $MASTER_PACKAGE_DIR/conf/hive \ 40 | $MASTER_PACKAGE_DIR/ssh-key 41 | " 42 | 43 | # Ensure permissions on the script files before we push them 44 | chmod 755 $SCRIPT_FILES_TO_PUSH 45 | 46 | # Create the destination directory on the master 47 | emit "" 48 | emit "Ensuring setup directories exist on master:" 49 | gcutil ssh --zone=$ZONE --ssh_arg -t $MASTER sudo -i \ 50 | "rm -rf $MASTER_PACKAGE_DIR && \ 51 | mkdir -p $MASTER_PACKAGE_SUBDIRS" 52 | 53 | # Push the setup script to the master 54 | emit "" 55 | emit "Pushing the setup scripts to the master:" 56 | gcutil push --zone=$ZONE $MASTER \ 57 | $SCRIPT_FILES_TO_PUSH $MASTER_PACKAGE_DIR/$SCRIPTS_DIR 58 | 59 | # Push configuration to the master 60 | emit "" 61 | emit "Pushing configuration to the master:" 62 | gcutil push --zone=$ZONE $MASTER \ 63 | conf/hive/* $MASTER_PACKAGE_DIR/conf/hive 64 | 65 | # Execute the setup script on the master 66 | emit "" 67 | emit "Launching the user and package setup script on the master:" 68 | gcutil ssh --zone=$ZONE --ssh_arg -t $MASTER \ 69 | sudo $MASTER_PACKAGE_DIR/$SCRIPTS_DIR/setup-packages__at__master.sh 70 | 71 | # Execute the HDFS setup script on the master 72 | emit "" 73 | emit "Launching the HDFS setup script on the master:" 74 | gcutil ssh --zone=$ZONE --ssh_arg -t $MASTER \ 75 | sudo \ 76 | $MASTER_PACKAGE_DIR/$SCRIPTS_DIR/setup-hdfs-for-hdtools__at__master.sh 77 | 78 | # Set up SSH keys for the user 79 | emit "" 80 | emit "Generating SSH keys for user $HDP_USER" 81 | 82 | readonly KEY_DIR=./ssh-key 83 | mkdir -p $KEY_DIR 84 | rm -f $KEY_DIR/$HDP_USER $KEY_DIR/${HDP_USER}.pub 85 | 86 | ssh-keygen -t rsa -P '' -f $KEY_DIR/$HDP_USER 87 | chmod o+r $KEY_DIR/${HDP_USER}.pub 88 | emit "Pushing SSH keys for user $HDP_USER to $MASTER" 89 | gcutil push --zone=$ZONE $MASTER \ 90 | $KEY_DIR/${HDP_USER}.pub $MASTER_PACKAGE_DIR/ssh-key/ 91 | emit "Adding SSH public key for user $HDP_USER to authorized_keys" 92 | gcutil ssh --zone=$ZONE --ssh_arg -t $MASTER \ 93 | sudo sudo -u $HDP_USER -i \ 94 | $MASTER_PACKAGE_DIR/$SCRIPTS_DIR/setup-ssh-keys__at__master.sh \ 95 | $MASTER_PACKAGE_DIR/ssh-key 96 | 97 | MASTER_IP=$(gcutil getinstance --zone=$ZONE $MASTER | \ 98 | awk -F '|' \ 99 | '$2 ~ / *external-ip */ { gsub(/[ ]*/, "", $3); print $3 }') 100 | 101 | emit "" 102 | emit "***" 103 | emit "SSH keys generated locally to:" 104 | emit " Public key: $KEY_DIR/$HDP_USER.pub" 105 | emit " Private key: $KEY_DIR/$HDP_USER" 106 | emit "" 107 | emit "Public key installed on $MASTER to ~$HDP_USER/.ssh/authorized_keys" 108 | emit "" 109 | emit "You may now ssh to user $HDP_USER@$MASTER with:" 110 | emit " ssh -i $KEY_DIR/$HDP_USER $HDP_USER@$MASTER_IP" 111 | emit "***" 112 | 113 | emit "" 114 | emit "Installation complete" 115 | -------------------------------------------------------------------------------- /scripts/package_utils.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | set -o nounset 17 | set -o errexit 18 | 19 | function pkgutil_get_list() { 20 | local pkg_dir="$1" 21 | 22 | find $pkg_dir -mindepth 2 -maxdepth 2 | sort 23 | } 24 | readonly -f pkgutil_get_list 25 | 26 | function pkgutil_pkg_name() { 27 | local pkg_dir="$1" 28 | local pkg="$2" 29 | 30 | # Strip the "package" directory 31 | local pkg_stripped=${pkg#$pkg_dir/} 32 | 33 | # Get the query-tool specific directory name 34 | echo ${pkg_stripped%/*} 35 | } 36 | readonly -f pkgutil_pkg_name 37 | 38 | function pkgutil_pkg_file() { 39 | local pkg_dir="$1" 40 | local pkg="$2" 41 | 42 | # Return just the filename 43 | echo ${pkg##*/} 44 | } 45 | readonly -f pkgutil_pkg_file 46 | 47 | function pkgutil_emit_list() { 48 | local pkg_dir="$1" 49 | local pkg_list="$2" 50 | 51 | emit "" 52 | emit "Discovered packages:" 53 | for pkg in $pkg_list; do 54 | # Get the query-tool specific directory name 55 | local pkg_name=$(pkgutil_pkg_name $pkg_dir $pkg) 56 | 57 | # Get the name of the zip file 58 | local pkg_file=$(pkgutil_pkg_file $pkg_dir $pkg) 59 | 60 | emit " $pkg_name ($pkg_file)" 61 | done 62 | } 63 | readonly -f pkgutil_emit_list 64 | 65 | -------------------------------------------------------------------------------- /scripts/packages-delete-from-gcs__at__host.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # packages-delete-from-gcs 17 | # This script removes the Hadoop query tool packages from Google Cloud 18 | # Storage which were uploaded by packages-to-gcs__at__host.sh 19 | 20 | set -o nounset 21 | set -o errexit 22 | 23 | readonly SCRIPTDIR=$(dirname $0) 24 | 25 | # Pull in global properties 26 | source project_properties.sh 27 | 28 | # Pull in common functions 29 | source $SCRIPTDIR/common_utils.sh 30 | 31 | # Remove packages from GCS 32 | emit "" 33 | emit "Removing packages:" 34 | gsutil rm -R -f gs://$GCS_PACKAGE_BUCKET/$GCS_PACKAGE_DIR 35 | 36 | emit "" 37 | emit "Package removal complete" 38 | 39 | -------------------------------------------------------------------------------- /scripts/packages-to-gcs__at__host.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # packages-to-gcs 17 | # This script examines the Hadoop tools packages directory for a list 18 | # of packages to push to Google Cloud Storage. 19 | # 20 | # All packages should be found in the "packages" subdirectory. 21 | # The required format is for the package name to be a subdirectory 22 | # and the associated TAR.GZ file to be inside the package subdirectory: 23 | # packages/ 24 | # hive/ 25 | # hive-0.10.0.tar.gz 26 | # pig/ 27 | # pig-0.11.1.tar.gz 28 | 29 | set -o nounset 30 | set -o errexit 31 | 32 | readonly SCRIPTDIR=$(dirname $0) 33 | 34 | # Pull in global properties 35 | source project_properties.sh 36 | 37 | # Pull in common functions 38 | source $SCRIPTDIR/common_utils.sh 39 | source $SCRIPTDIR/package_utils.sh 40 | 41 | # The resulting PACKAGE_LIST will contain one entry per package where the 42 | # the entry is of the form "package_dir/package/gzip" 43 | # (for example packages/hive/hive-0.10.0.tar.gz) 44 | PACKAGE_LIST=$(pkgutil_get_list $PACKAGES_DIR) 45 | if [[ -z $PACKAGE_LIST ]]; then 46 | die "No package found in $PACKAGES_DIR subdirectory" 47 | fi 48 | 49 | # Emit package list 50 | pkgutil_emit_list "$PACKAGES_DIR" "$PACKAGE_LIST" 51 | 52 | # Push packages to GCS 53 | emit "" 54 | emit "Pushing packages to gs://$GCS_PACKAGE_BUCKET/$GCS_PACKAGE_DIR/:" 55 | gsutil -m cp -R $PACKAGES_DIR gs://$GCS_PACKAGE_BUCKET/$GCS_PACKAGE_DIR/ 56 | 57 | emit "" 58 | emit "Package upload complete" 59 | 60 | -------------------------------------------------------------------------------- /scripts/setup-hdfs-for-hdtools__at__master.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | set -o nounset 17 | set -o errexit 18 | 19 | SCRIPT=$(basename $0) 20 | SCRIPTDIR=$(dirname $0) 21 | 22 | source $SCRIPTDIR/project_properties.sh 23 | source $SCRIPTDIR/common_utils.sh 24 | 25 | readonly HDFS_CMD="sudo -u $HADOOP_USER -i $HADOOP_HOME/bin/hadoop fs" 26 | readonly HDFS_ROOT_USER="$HADOOP_USER" 27 | 28 | function hdfs_mkdir () { 29 | local dir=$1 30 | local owner=${2:-} 31 | local permissions=${3:-} 32 | 33 | emit " Checking directory $dir" 34 | if ! $HDFS_CMD -test -d $dir 2> /dev/null; then 35 | emit " Creating directory $dir" 36 | $HDFS_CMD -mkdir $dir 37 | fi 38 | 39 | if [[ -n "$owner" ]]; then 40 | emit " Ensuring owner $owner" 41 | $HDFS_CMD -chown $owner $dir 42 | fi 43 | 44 | if [[ -n "$permissions" ]]; then 45 | emit " Ensuring permissions $permissions" 46 | $HDFS_CMD -chmod $permissions $dir 47 | fi 48 | } 49 | readonly -f hdfs_mkdir 50 | 51 | emit "" 52 | emit "*** Begin: $SCRIPT running on master $(hostname) ***" 53 | 54 | # Ensure that /tmp exists (it should) and is fully accessible 55 | hdfs_mkdir "$HDFS_TMP_DIR" "$HDFS_ROOT_USER" "777" 56 | 57 | # Create a hive-specific scratch space in /tmp for the hdpuser 58 | hdfs_mkdir "$HDFS_TMP_DIR/hive-$HDP_USER" "$HDP_USER" 59 | 60 | # Create a warehouse directory (hive) for the hdpuser 61 | hdfs_mkdir "/user" "$HDFS_ROOT_USER" 62 | hdfs_mkdir "/user/$HDP_USER" "$HDP_USER" 63 | hdfs_mkdir "/user/$HDP_USER/warehouse" "$HDP_USER" 64 | 65 | # Create a mapreduce staging directory for the hdpuser 66 | if [[ "${HADOOP_MAJOR_VERSION}" == "2" ]]; then 67 | hdfs_mkdir "/hadoop/mapreduce" "$HADOOP_USER" "o+rw" 68 | hdfs_mkdir "/hadoop/mapreduce/staging" "$HADOOP_USER" "o+rw" 69 | hdfs_mkdir "/hadoop/mapreduce/staging/history" "$HADOOP_USER" "777" 70 | hdfs_mkdir "/hadoop/mapreduce/staging/$HDP_USER" "$HDP_USER" 71 | else 72 | hdfs_mkdir "$HADOOP_TMP_DIR/mapred/staging/$HDP_USER" "$HDP_USER" 73 | fi 74 | 75 | emit "" 76 | emit "*** End: $SCRIPT running on master $(hostname) ***" 77 | emit "" 78 | 79 | -------------------------------------------------------------------------------- /scripts/setup-packages__at__master.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | set -o nounset 17 | set -o errexit 18 | 19 | SCRIPT=$(basename $0) 20 | SCRIPTDIR=$(dirname $0) 21 | 22 | source $SCRIPTDIR/project_properties.sh 23 | source $SCRIPTDIR/common_utils.sh 24 | source $SCRIPTDIR/package_utils.sh 25 | 26 | # BEGIN: Package-specific setup functions 27 | 28 | function setup_pkg_generic() { 29 | local pkg_dir="$1" 30 | local pkg_name="$2" 31 | local pkg_file="$3" 32 | local install_dir="$4" 33 | 34 | emit "Installing $pkg_name from $pkg_dir/$pkg_name/$pkg_file" 35 | 36 | local target_dir=${pkg_file%.tar.gz} 37 | emit " Exploding $pkg_file into $install_dir" 38 | emit " Logging to $pkg_dir/${target_dir}.log" 39 | if ! tar xvfz $pkg_dir/$pkg_name/$pkg_file \ 40 | -C $install_dir > $MASTER_PACKAGE_DIR/${target_dir}.log; then 41 | exit 1 42 | fi 43 | 44 | emit " Creating soft link '$pkg_name' to $target_dir" 45 | (cd $install_dir; ln -f -s $target_dir $pkg_name) 46 | } 47 | readonly -f setup_pkg_generic 48 | 49 | function setup_pkg_hive() { 50 | setup_pkg_generic $@ 51 | 52 | # Move configuration 53 | emit "" 54 | emit "Move hive configuration" 55 | mv $MASTER_PACKAGE_DIR/conf/hive/* $MASTER_INSTALL_DIR/hive/conf 56 | chown $HDP_USER:$HADOOP_GROUP -R $MASTER_INSTALL_DIR/hive/conf 57 | chmod 600 $MASTER_INSTALL_DIR/hive/conf/hive-site.xml 58 | 59 | # Increase the heapsize when using the GCS connector 60 | if $($HADOOP_HOME/bin/hadoop org.apache.hadoop.conf.Configuration \ 61 | | grep "fs.gs.impl" &> /dev/null); then 62 | emit "Detected use of GCS connector- increasing Hive heap size" 63 | 64 | echo "export HADOOP_HEAPSIZE=1024" >> $MASTER_INSTALL_DIR/hive/conf/hive-env.sh 65 | fi 66 | } 67 | readonly -f setup_pkg_hive 68 | 69 | function setup_pkg_pig() { 70 | setup_pkg_generic $@ 71 | } 72 | readonly -f setup_pkg_pig 73 | 74 | # END: Package-specific setup functions 75 | 76 | emit "" 77 | emit "*** Begin: $SCRIPT running on master $(hostname) ***" 78 | 79 | # Set up a "hadoop user" and add to the hadoop group 80 | emit "" 81 | emit "Checking for $HDP_USER" 82 | if $(id -u $HDP_USER &> /dev/null); then 83 | emit "$HDP_USER already exists" 84 | usermod --gid $HADOOP_GROUP $HDP_USER 85 | else 86 | emit "Creating user $HDP_USER in group $HADOOP_GROUP" 87 | useradd --gid $HADOOP_GROUP --shell /bin/bash -m $HDP_USER 88 | fi 89 | 90 | # Set up login environment 91 | # Source our own file from the .profile so that we can overwrite with impunity 92 | if ! test -e $HDP_USER_HOME/.profile || \ 93 | ! grep --silent profile_hdtools $HDP_USER_HOME/.profile; then 94 | emit "Setting up $HDP_USER_HOME/.profile" 95 | echo "" >> $HDP_USER_HOME/.profile 96 | echo "# Pull in hadoop tool setup" >> $HDP_USER_HOME/.profile 97 | echo "source .profile_hdtools" >> $HDP_USER_HOME/.profile 98 | fi 99 | 100 | # Set common environment variables 101 | emit "Setting up $HDP_USER_HOME/.profile_hdtools" 102 | echo "" >| $HDP_USER_HOME/.profile_hdtools 103 | JAVA_HOME=$(which java | xargs readlink -f | sed -E "s/\/(jre|jdk).*\/bin\/java$//") 104 | echo "export JAVA_HOME=$JAVA_HOME" >> $HDP_USER_HOME/.profile_hdtools 105 | echo "export HADOOP_PREFIX=$HADOOP_HOME" >> $HDP_USER_HOME/.profile_hdtools 106 | echo "export PATH=\$HADOOP_PREFIX/bin:\"\$PATH\"" >> $HDP_USER_HOME/.profile_hdtools 107 | 108 | # Pull down packages and unzip them into the install directory. 109 | # (rather than say /usr/local). 110 | emit "" 111 | emit "Copying package files from cloud storage" 112 | for tool in $SUPPORTED_HDPTOOLS; do 113 | tool_uri=$(echo ${tool}_TARBALL_URI | tr '[:lower:]' '[:upper:']) 114 | 115 | mkdir -p $MASTER_PACKAGE_DIR/$PACKAGES_DIR/${tool} 116 | gsutil -q cp ${!tool_uri} $MASTER_PACKAGE_DIR/$PACKAGES_DIR/${tool} 117 | done 118 | 119 | PACKAGE_LIST=$(pkgutil_get_list $MASTER_PACKAGE_DIR/$PACKAGES_DIR) 120 | if [[ -z $PACKAGE_LIST ]]; then 121 | die "No package found in $MASTER_PACKAGE_DIR/$PACKAGES_DIR" 122 | fi 123 | 124 | pkgutil_emit_list "$MASTER_PACKAGE_DIR/$PACKAGES_DIR" "$PACKAGE_LIST" 125 | 126 | # Call package-specific functions to do the install 127 | emit "" 128 | emit "Installing packages into install directory $MASTER_INSTALL_DIR" 129 | 130 | for pkg in $PACKAGE_LIST; do 131 | # Get the query-tool specific directory name 132 | pkg_name=$(pkgutil_pkg_name $MASTER_PACKAGE_DIR/$PACKAGES_DIR $pkg) 133 | pkg_upper=$(echo "$pkg_name" | tr '[a-z]' '[A-Z]') 134 | 135 | # Get the name of the zip file 136 | pkg_file=$(pkgutil_pkg_file $MASTER_PACKAGE_DIR/$PACKAGES_DIR $pkg) 137 | 138 | # Unzip the package file(s) 139 | setup_pkg_${pkg_name} $MASTER_PACKAGE_DIR/$PACKAGES_DIR \ 140 | $pkg_name $pkg_file $MASTER_INSTALL_DIR 141 | 142 | # For each packages set _HOME=/home// into .profile 143 | echo "export ${pkg_upper}_HOME=\$HOME/$pkg_name" >> $HDP_USER_HOME/.profile_hdtools 144 | echo "export PATH=\$${pkg_upper}_HOME/bin:\"\$PATH\"" >> $HDP_USER_HOME/.profile_hdtools 145 | done 146 | 147 | emit "Setting user:group ownership on $MASTER_INSTALL_DIR to $HDP_USER:$HADOOP_GROUP" 148 | chown -R $HDP_USER:$HADOOP_GROUP $MASTER_INSTALL_DIR 149 | 150 | # Set group write permissions on the /hadoop/tmp directory. 151 | # Depending on the version of the Hadoop solution, this may already be done. 152 | emit "Setting group write permissions on hadoop temporary directory" 153 | 154 | mkdir -p $HADOOP_TMP_DIR 155 | chown $HADOOP_USER:$HADOOP_GROUP $HADOOP_TMP_DIR 156 | chmod g+w $HADOOP_TMP_DIR 157 | 158 | emit "" 159 | emit "*** End: $SCRIPT running on master $(hostname) ***" 160 | emit "" 161 | 162 | -------------------------------------------------------------------------------- /scripts/setup-ssh-keys__at__master.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Copyright 2013 Google Inc. All Rights Reserved. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # This script runs on the Hadoop master node as the target user ($HDP_USER). 17 | # It is asssumed that a public key file for the user has been pushed 18 | # onto the master node and the location of that file is the first argument 19 | # to the script. 20 | 21 | set -o nounset 22 | set -o errexit 23 | 24 | readonly SCRIPT=$(basename $0) 25 | readonly SCRIPTDIR=$(dirname $0) 26 | 27 | # Pull in global properties 28 | source $SCRIPTDIR/project_properties.sh 29 | source $SCRIPTDIR/common_utils.sh 30 | 31 | if [[ $# -lt 1 ]]; then 32 | die "usage: $0 [keys-dir]" 33 | fi 34 | 35 | KEY_DIR=$1; shift 36 | KEY_FILE=$KEY_DIR/${USER}.pub 37 | 38 | if [[ ! -e $KEY_FILE ]]; then 39 | die "Public key file not found: $KEY_FILE" 40 | fi 41 | 42 | # Ensure that the .ssh directory and authorized_keys files exist 43 | if [[ ! -e $HOME/.ssh/authorized_keys ]]; then 44 | mkdir -p $HOME/.ssh 45 | chmod 700 $HOME/.ssh 46 | 47 | touch $HOME/.ssh/authorized_keys 48 | chmod 600 $HOME/.ssh/authorized_keys 49 | fi 50 | 51 | # Add the public key file for the user to authorized_keys 52 | emit "Updating $HOME/.ssh/authorized_keys" 53 | (echo "# Added $(date)" && cat $KEY_FILE) >> $HOME/.ssh/authorized_keys 54 | 55 | --------------------------------------------------------------------------------