Hint
\n",
48 | "\n",
49 | " \n",
50 | "Checkout this:\n",
51 | " \n",
52 | "```python\n",
53 | "tokenizer = Tokenizer(inputCol=\"sentence\", outputCol=\"words\")\n",
54 | "countTokens = udf(lambda words: len(words), IntegerType())\n",
55 | "wordDataFrame = tokenizer.transform(sentenceDataFrame)\n",
56 | "\n",
57 | "ngram = NGram(n=2, inputCol=\"words\", outputCol=\"ngrams\")\n",
58 | "\n",
59 | "ngramDataFrame = ngram.transform(wordDataFrame)\n",
60 | "ngramDataFrame.select(\"ngrams\").show(truncate=False)\n",
61 | " \n",
62 | "```\n",
63 | "
\n",
64 | " "
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": 5,
70 | "id": "adc10f2a",
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "from pyspark.ml.feature import NGram\n",
75 | "from pyspark.ml.feature import Tokenizer\n",
76 | "from pyspark.sql.functions import col, udf\n",
77 | "from pyspark.sql.types import IntegerType\n",
78 | "\n",
79 | "sentenceDataFrame = spark.createDataFrame([\n",
80 | " (0, \"Hi I heard about Spark \"),\n",
81 | " (1, \"I wish, wish Java, Java could\"),\n",
82 | " (2, \"Logistic regression, regression models\")\n",
83 | "], [\"id\", \"sentence\"])\n",
84 | "\n",
85 | "# your solution goes here\n",
86 | "# ...\n",
87 | "\n"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "id": "9e0eef96",
93 | "metadata": {},
94 | "source": [
95 | "### ✅ **Task 2 :** PCA\n",
96 | "\n",
97 | "Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation\n",
98 | "\n",
99 | "Take the DF and reduce the demensions of the vectores using `PCA`.\n",
100 | "\n",
101 | " Click here to see the Solution
\n",
104 | "\n",
105 | " \n",
106 | "here:\n",
107 | " \n",
108 | "```python\n",
109 | "pca = PCA(k=3, inputCol=\"features\", outputCol=\"pcaFeatures\")\n",
110 | "model = pca.fit(df)\n",
111 | "\n",
112 | "result = model.transform(df).select(\"pcaFeatures\")\n",
113 | "result.show(truncate=False)\n",
114 | "```\n",
115 | "
\n",
116 | " \n"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 6,
122 | "id": "aa857bef",
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "from pyspark.ml.feature import PCA\n",
127 | "from pyspark.ml.linalg import Vectors\n",
128 | "\n",
129 | "data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),\n",
130 | " (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),\n",
131 | " (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]\n",
132 | "df = spark.createDataFrame(data, [\"features\"])\n",
133 | "\n",
134 | "# your solution goes here\n",
135 | "# ...\n"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": null,
141 | "id": "0566e5f0",
142 | "metadata": {},
143 | "outputs": [],
144 | "source": []
145 | }
146 | ],
147 | "metadata": {
148 | "kernelspec": {
149 | "display_name": "Python 3",
150 | "language": "python",
151 | "name": "python3"
152 | },
153 | "language_info": {
154 | "codemirror_mode": {
155 | "name": "ipython",
156 | "version": 3
157 | },
158 | "file_extension": ".py",
159 | "mimetype": "text/x-python",
160 | "name": "python",
161 | "nbconvert_exporter": "python",
162 | "pygments_lexer": "ipython3",
163 | "version": "3.9.4"
164 | }
165 | },
166 | "nbformat": 4,
167 | "nbformat_minor": 5
168 | }
169 |
--------------------------------------------------------------------------------
/notebooks/week-3.0-batch_deployment.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "0767a21b",
6 | "metadata": {},
7 | "source": [
8 | "# ⭐ Scaling Machine Learning in Three Week course \n",
9 | "# - Week 3:\n",
10 | "## Deployment\n",
11 | "\n",
12 | "**Prerequisite**\n",
13 | "Run notebook `week-3.0-data-prep-for-training` and `week-3.0-evaluate-and-automate-pipelines.ipynb` before.\n",
14 | "\n",
15 | "\n",
16 | "In this excercise, you will use:\n",
17 | " * deployments in batch setting\n",
18 | "\n",
19 | "\n",
20 | "\n",
21 | "\n",
22 | "This excercise is part of the [Scaling Machine Learning with Spark book](https://learning.oreilly.com/library/view/scaling-machine-learning/9781098106812/)\n",
23 | "available on the O'Reilly platform or on [Amazon](https://amzn.to/3WgHQvd).\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 17,
29 | "id": "c686250b",
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "import mlflow\n",
34 | "import mlflow.spark\n",
35 | "from pyspark.sql.types import ArrayType, StringType\n",
36 | "from pyspark.sql.functions import col, struct\n",
37 | "from pyspark.ml.regression import LinearRegression, LinearRegressionModel\n",
38 | "from pyspark.sql import SparkSession \n"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 18,
44 | "id": "e02a2c88",
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "spark = SparkSession.builder \\\n",
49 | " .master('local[*]') \\\n",
50 | " .appName(\"deployment\") \\\n",
51 | " .getOrCreate()"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "id": "f53c3595",
57 | "metadata": {},
58 | "source": [
59 | " ### ✅ **Task 1 :** ### Move model from Model folder to Best Model\n",
60 | " \n",
61 | " Now that we have a model that gives us a good results, it's time to move it to the next phase."
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 19,
67 | "id": "32939824",
68 | "metadata": {},
69 | "outputs": [],
70 | "source": [
71 | "model_path = \"../models/linearRegression_model\""
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 20,
77 | "id": "84b485b5",
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "restored_mllib_model = LinearRegressionModel.load(model_path)\n"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 21,
87 | "id": "44068a22",
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "restored_mllib_model.save(\"../models/best_model\")"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "id": "724fd2d4",
97 | "metadata": {},
98 | "source": [
99 | "### ✅ **Task 2 :** use the model for prediction in production\n",
100 | "\n",
101 | "imagine there is a deployment to production of the best_model\n",
102 | "that meanes, that there is a new app that is going to load the model within it and leverage it with Spark. \n",
103 | "so now, there is a production dataframe.\n",
104 | "\n",
105 | "Write the functionality to load the model, and use it to predict production dataframe in a batch setting."
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 22,
111 | "id": "5c53c794",
112 | "metadata": {},
113 | "outputs": [],
114 | "source": [
115 | "# your code goes\n",
116 | "# ..."
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "id": "92035982",
122 | "metadata": {},
123 | "source": [
124 | "How is it different from what you have done so far? \n",
125 | "\n",
126 | "shar your response in the chat!"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "id": "e4572fde",
133 | "metadata": {},
134 | "outputs": [],
135 | "source": []
136 | }
137 | ],
138 | "metadata": {
139 | "kernelspec": {
140 | "display_name": "Python 3",
141 | "language": "python",
142 | "name": "python3"
143 | },
144 | "language_info": {
145 | "codemirror_mode": {
146 | "name": "ipython",
147 | "version": 3
148 | },
149 | "file_extension": ".py",
150 | "mimetype": "text/x-python",
151 | "name": "python",
152 | "nbconvert_exporter": "python",
153 | "pygments_lexer": "ipython3",
154 | "version": "3.9.4"
155 | }
156 | },
157 | "nbformat": 4,
158 | "nbformat_minor": 5
159 | }
160 |
--------------------------------------------------------------------------------