├── Chapter 01 └── Chapter 01_Codes.rar ├── Chapter 03 └── Chapter 03_Codes.rar ├── Chapter 04 └── Chapter 04_Codes.rar ├── Chapter 05 └── Chapter 05_Codes.rar ├── Chapter 06 └── Chapter 06_Codes.rar ├── Chapter 07 └── Chapter 07_Codes.rar ├── Chapter 08 └── Chapter 08_Codes.rar ├── Chapter 09 └── Chapter 09_Codes.rar ├── .gitattributes ├── .gitignore ├── LICENSE └── README.md /Chapter 01/Chapter 01_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 01/Chapter 01_Codes.rar -------------------------------------------------------------------------------- /Chapter 03/Chapter 03_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 03/Chapter 03_Codes.rar -------------------------------------------------------------------------------- /Chapter 04/Chapter 04_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 04/Chapter 04_Codes.rar -------------------------------------------------------------------------------- /Chapter 05/Chapter 05_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 05/Chapter 05_Codes.rar -------------------------------------------------------------------------------- /Chapter 06/Chapter 06_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 06/Chapter 06_Codes.rar -------------------------------------------------------------------------------- /Chapter 07/Chapter 07_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 07/Chapter 07_Codes.rar -------------------------------------------------------------------------------- /Chapter 08/Chapter 08_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 08/Chapter 08_Codes.rar -------------------------------------------------------------------------------- /Chapter 09/Chapter 09_Codes.rar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PacktPublishing/Large-Scale-Machine-Learning-with-Spark/HEAD/Chapter 09/Chapter 09_Codes.rar -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | 4 | # Custom for Visual Studio 5 | *.cs diff=csharp 6 | 7 | # Standard to msysgit 8 | *.doc diff=astextplain 9 | *.DOC diff=astextplain 10 | *.docx diff=astextplain 11 | *.DOCX diff=astextplain 12 | *.dot diff=astextplain 13 | *.DOT diff=astextplain 14 | *.pdf diff=astextplain 15 | *.PDF diff=astextplain 16 | *.rtf diff=astextplain 17 | *.RTF diff=astextplain 18 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Windows image file caches 2 | Thumbs.db 3 | ehthumbs.db 4 | 5 | # Folder config file 6 | Desktop.ini 7 | 8 | # Recycle Bin used on file shares 9 | $RECYCLE.BIN/ 10 | 11 | # Windows Installer files 12 | *.cab 13 | *.msi 14 | *.msm 15 | *.msp 16 | 17 | # Windows shortcuts 18 | *.lnk 19 | 20 | # ========================= 21 | # Operating System Files 22 | # ========================= 23 | 24 | # OSX 25 | # ========================= 26 | 27 | .DS_Store 28 | .AppleDouble 29 | .LSOverride 30 | 31 | # Thumbnails 32 | ._* 33 | 34 | # Files that might appear in the root of a volume 35 | .DocumentRevisions-V100 36 | .fseventsd 37 | .Spotlight-V100 38 | .TemporaryItems 39 | .Trashes 40 | .VolumeIcon.icns 41 | 42 | # Directories potentially created on remote AFP share 43 | .AppleDB 44 | .AppleDesktop 45 | Network Trash Folder 46 | Temporary Items 47 | .apdisk 48 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Packt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | #Large Scale Machine Learning with Spark 2 | This is the code repository for [Large Scale Machine Learning with Spark](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-spark?utm_source=github&utm_medium=repository&utm_campaign=9781783288519), published by Packt. It contains all the supporting project files necessary to work through the book from start to finish. 3 | ##Instructions and Navigations 4 | All of the code is organized into folders. Each folder starts with a number followed by the application name. For example, Chapter02. 5 | 6 | 7 | 8 | The code will look like the following: 9 | ``` 10 | 11 | [default] 12 | SparkSession spark = SparkSession 13 | .builder() 14 | .appName("JavaFPGrowthExample") 15 | .master("local[*]") 16 | .config("spark.sql.warehouse.dir", 17 | "E:/Exp/") .getOrCreate(); 18 | 19 | Or creating simple RDD from the input dataset is set as follows: 20 | 21 | [default] 22 | String filename = “input/dataset.txt”; 23 | RDD data = spark.sparkContext().textFile(fileName, 1); 24 | 25 | ``` 26 | 27 | ### Software requirements: 28 | 29 | Following software is required for chapters 1-8 and 10: Spark 2.0.0 (or higher), Hadoop 2.7 30 | (or higher), Java (JDK and JRE) 1.7+/1.8+, Scala 2.11.x (or higher), Python 2.6+/3.4+, R 3.1+, 31 | and RStudio 0.99.879 (or higher) installed. Eclipse Mars or Luna (latest) can be used. 32 | Moreover, Maven Eclipse plugin (2.9 or higher), Maven compiler plugin for Eclipse (2.3.2 or 33 | higher) and Maven assembly plugin for Eclipse (2.4.1 or higher) are required. Most 34 | importantly, re-use the provided pom.xml file with Packt's supplements and change the 35 | previously-mentioned version and APIs accordingly and everything will be sorted out. 36 | 37 | For Chapter 9, Advanced Machine Learning with Streaming and Graph Data, almost all the 38 | software required, mentioned previously, except for the Twitter data collection example, 39 | which will be shown in Spark 1.6.1. Therefore, Spark 1.6.1 or 1.6.2 is required, along with 40 | the Maven-friendly pom.xml file. 41 | 42 | ### Operating system requirements: 43 | 44 | Spark can be run on a number of operating systems including Windows, Mac OS, and 45 | LINUX. However, Linux distributions are preferable (including Debian, Ubuntu, Fedora, 46 | RHEL, CentOS and so on). To be more specific, for example, for Ubuntu it is recommended 47 | to have a 14.04/15.04 (LTS) 64-bit complete installation or VMWare player 12 or Virtual 48 | Box. For Windows, Windows (XP/7/8/10) and for Mac OS X (10.4.7+) is recommended. 49 | 50 | ### Hardware requirements: 51 | 52 | To work with Spark smoothly, a machine with at least a core i3 or core i5 processor is 53 | recommended. However, to get the best results, core i7 would achieve faster data 54 | processing and scalability with at least 8 GB RAM (recommended) for a standalone mode 55 | and at least 32 GB RAM for a single VM, or higher for a cluster. Besides, enough storage to 56 | run heavy jobs (depending upon the data size you will be handling), and preferably at least 57 | 50 GB of free disk storage (for stand-alone and for SQL warehouse). 58 | 59 | ##Related Products 60 | * [Fast Data Processing with Spark](https://www.packtpub.com/big-data-and-business-intelligence/fast-data-processing-spark?utm_source=github&utm_medium=repository&utm_campaign=9781782167068) 61 | 62 | * [Machine Learning with Spark](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-spark?utm_source=github&utm_medium=repository&utm_campaign=9781783288519) 63 | 64 | * [Large Scale Machine Learning with Python](https://www.packtpub.com/big-data-and-business-intelligence/large-scale-machine-learning-python?utm_source=github&utm_medium=repository&utm_campaign=9781785887215) 65 | ###Suggestions and Feedback 66 | [Click here](https://docs.google.com/forms/d/e/1FAIpQLSe5qwunkGf6PUvzPirPDtuy1Du5Rlzew23UBp2S-P3wB-GcwQ/viewform) if you have any feedback or suggestions. 67 | --------------------------------------------------------------------------------