├── .gitignore
├── README.md
├── build.sbt
├── project
    ├── build.properties
    └── plugins.sbt
└── src
    ├── main
        └── scala
        │   └── HttpRequest.scala
    └── test
        └── scala
            └── MainApp.scala


/.gitignore:
--------------------------------------------------------------------------------
 1 | # bloop and metals
 2 | .bloop
 3 | .bsp
 4 | 
 5 | # metals
 6 | project/metals.sbt
 7 | .metals
 8 | 
 9 | # vs code
10 | .vscode
11 | 
12 | # scala 3
13 | .tasty
14 | 
15 | # sbt
16 | project/project/
17 | project/target/
18 | target/
19 | 
20 | # eclipse
21 | build/
22 | .classpath
23 | .project
24 | .settings
25 | .worksheet
26 | bin/
27 | .cache
28 | 
29 | # intellij idea
30 | *.log
31 | *.iml
32 | *.ipr
33 | *.iws
34 | .idea
35 | 
36 | # mac
37 | .DS_Store
38 | 
39 | # other?
40 | .history
41 | .scala_dependencies
42 | .cache-main
43 | 
44 | #general
45 | *.class
46 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # How to execute a REST API call on Apache Spark the Right Way - Scala
  2 | 
  3 | Note: This repository is a duplicate of another that I have created ([https://github.com/jamesshocking/Spark-REST-API-UDF](https://github.com/jamesshocking/Spark-REST-API-UDF)), albeit demonstrates how to execute REST API calls from Apache Spark using Scala.  The other repository is for those looking for the Python version of this code.
  4 | 
  5 | ### Note
  6 | Oct 2022 - Since originally writing this demo, the example URL https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json is terminating all requests.  The end result is that it will appear as if the code isn't work.  The problem is that when the Python Requests library executes the request, the remote server terminates the request and an exception is thrown.  The code is still valid, but I recommend trying with a different endpoint
  7 | 
  8 | ## Introduction
  9 | 
 10 | Apache Spark is a wonderful invention that can solve a great many problems.  Its flexibility and adaptability gives great power but also the opportunity for big mistakes.  One such mistake is executing code on the driver, which you thought would run in a distributed way on the workers.  One such example is when you execute Python code outside of the context of a Dataframe.
 11 | 
 12 | For example, when you execute code similar to:
 13 | 
 14 | ```scala
 15 | val s = "Scala is amazing"
 16 | print(s)
 17 | ```
 18 | 
 19 | Apache Spark will execute the code on the driver, and not a worker.  This isn't a problem with such a simple command, but what happens when you need to download large amounts of data via a REST API service?  In this and the demo code, I am using the OkHttp3 library [https://square.github.io/okhttp/](https://square.github.io/okhttp/).  For those needed to request an Auth Token to access a REST API, OkHttp greatly simplifies this process.
 20 | 
 21 | ```scala
 22 | val client: OkHttpClient = new OkHttpClient();
 23 | 
 24 | val headerBuilder = new Headers.Builder
 25 | val headers = headerBuilder
 26 |   .add("content-type", "application/json")
 27 |   .build
 28 | 
 29 | val result = try {
 30 |     val request = new Request.Builder()
 31 |       .url(url)
 32 |       .headers(headers)
 33 |       .build();
 34 | 
 35 |     val response: Response = client.newCall(request).execute()
 36 |     response.body().string()
 37 |   }
 38 |   catch {
 39 |     case _: Throwable => "Something went wrong"
 40 |   }
 41 |   
 42 | print(result)
 43 | ```
 44 | 
 45 | If we execute the code above, it will be executed on the Driver.  If I were to create a loop with multiple of API requests, there would be no parallelism, no scaling, leaving a huge dependency on the Driver.  This approach criples Apache Spark and leaves it no better than a single threaded program.  To take advantage of Apache Spark's scaling and distribution, an alternative solution must be sought.
 46 | 
 47 | The solution is to use a UDF coupled to a withColumn statement.  This example, demonstrates how one can create a DataFrame whereby each row represents a single request to the REST service.  A UDF (User Defined Function) is used to encapsulate the HTTP request, returning a structured column that represents the REST API response, which can then be sliced and diced using the likes of explode and other built-in DataFrame functions (Or collapsed, see [https://github.com/jamesshocking/collapse-spark-dataframe](https://github.com/jamesshocking/collapse-spark-dataframe)).
 48 | 
 49 | With the advent of Apache Spark 3.0, the udf function has been deprecated.  [Migrating from Apache Spark 2.x to 3.0](https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30).  The example code assumes Apache Spark 3.0 but review the code comments for the 2.x version of implementing a UDF.
 50 | 
 51 | ## The Solution
 52 | 
 53 | For the sake of brevity I am assuming that a SparkSession has been created and assigned to a variable called spark.  In addition, for this example I will be used the OkHttp HTTP library.
 54 | 
 55 | The solution assumes that you need to consume data from a REST API, which you will be calling multiple times to get the data that you need.  In order to take advantage of the parallelism that Apache Spark offers, each REST API call will be encapsulated by a UDF, which is bound to a DataFrame.  Each row in the DataFrame will represent a single call to the REST API service.  Once an action is executed on the DataFrame, the result from each individual REST API call will be appended to each row as a Structured data type.
 56 | 
 57 | To demonstrate the mechanism, I will be using a free US Government REST API service that returns the makes and models of USA vehicles [https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json](https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json).
 58 | 
 59 | ### Start by declaring your imports:
 60 | 
 61 | ```scala
 62 | import okhttp3.{Headers, OkHttpClient, Request, Response}
 63 | import org.apache.spark.sql.SparkSession
 64 | import org.apache.spark.sql.api.java.UDF1
 65 | import org.apache.spark.sql.functions.{col, udf, from_json, explode}
 66 | import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
 67 | ```
 68 | 
 69 | ### Now declare a function that will execute our REST API call
 70 | 
 71 | Use the OkHttp library to execute either an HTTP get or a post.  There is nothing special about this function, it even returns the REST service response as a String.
 72 | 
 73 | ```scala
 74 | def ExecuteHttpGet(url: String) : Option[String] = {
 75 | 
 76 |   val client: OkHttpClient = new OkHttpClient();
 77 | 
 78 |   val headerBuilder = new Headers.Builder
 79 |   val headers = headerBuilder
 80 |     .add("content-type", "application/json")
 81 |     .build
 82 | 
 83 |   val result = try {
 84 |       val request = new Request.Builder()
 85 |         .url(url)
 86 |         .headers(headers)
 87 |         .build();
 88 | 
 89 |       val response: Response = client.newCall(request).execute()
 90 |       response.body().string()
 91 |     }
 92 |     catch {
 93 |       case _: Throwable => null
 94 |     }
 95 | 
 96 |   Option[String](result)
 97 | }
 98 | ```
 99 | 
100 | ### Define the response schema and the UDF
101 | 
102 | This is one of the parts of Apache Spark that I really like.  I can pick and chose what values I want from the JSON returned by the REST API call.  All I have to do is declare what parts of the JSON I want in a schema, which will be used by the from_json function.  
103 | 
104 | ```scala
105 | val restApiSchema = StructType(List(
106 |   StructField("Count", IntegerType, true),
107 |   StructField("Message", StringType, true),
108 |   StructField("SearchCriteria", StringType, true),
109 |   StructField("Results", ArrayType(
110 |     StructType(List(
111 |       StructField("Make_ID", IntegerType, true),
112 |       StructField("Make_Name", StringType, true)
113 |     ))
114 |   ), true)
115 | ))
116 | ```
117 | 
118 | Next I declare the UDF, making sure to set the return type as a String.  This will ensure that the new column, which is used to execute the UDF, will be ready to call from_json.  After executing the UDF, and then from_json, the row and column in the Dataframe will contain a structured object rather than plain JSON formatted text.  
119 | 
120 | ```scala
121 | val executeRestApiUDF = udf(new UDF1[String, String] {
122 |   override def call(url: String) = {
123 |     ExecuteHttpGet(url).getOrElse("")
124 |   }
125 | }, StringType)
126 | ```
127 | 
128 | ### Create the Request DataFrame and Execute
129 | 
130 | The final piece is to create a DataFrame where each row represents a single REST API call.  The number of columns in the Dataframe are up to you but you will need at least one, which will host the URL and/or parameters required to execute the REST API call.  There are a number of ways to create a Dataframe from a Sequence but I am going to use a case class.
131 | 
132 | Using the US Goverments free-to-access vehicle make REST service, we would create a Dataframe as follows:
133 | 
134 | ```scala
135 | case class RestAPIRequest (url: String)
136 | 
137 | val restApiCallsToMake = Seq(RestAPIRequest("https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json"))
138 | val source_df = restApiCallsToMake.toDF()
139 | ```
140 | 
141 | The case class is used to define the columns of the Dataframe, and using the toDF method of the spark version of the Seq object (available via an import to the Sql Context implicits object), we end up with a Dataframe where each row represents a separate API request.
142 | 
143 | All being well, the Dataframe will look like:
144 | 
145 | |  url           | 
146 | | ------------- |
147 | |  https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json | 
148 | 
149 | Finally we can use withColumn on the Dataframe to execute the UDF and REST API, before using a second withColumn to convert the response String into a Structured object.
150 | 
151 | ```scala
152 | val execute_df = source_df
153 |     .withColumn("result", executeRestApiUDF(col("url")))
154 |     .withColumn("result", from_json(col("result"), restApiSchema))
155 | ```
156 | 
157 | As Spark is lazy, the UDF will execute once an action like count() or show() is executed against the Dataframe.  Spark will distribute the API calls amongst all the workers, before returning the results such as:
158 | 
159 | |  url           | result |
160 | | -------------| --------|
161 | |  https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json | [9773, Response r...] |
162 | 
163 | The REST service returns a number of attributes and we're only interested in the one identified as Results (i.e. result.Results), which happens to be an array.  Using the explode method, and a select, we can output all Make ID's and Name's returned by the service.  
164 | 
165 | ```scala
166 | execute_df.select(explode(col("result.Results")).alias("makes"))
167 |       .select(col("makes.Make_ID"), col("makes.Make_Name"))
168 |       .show
169 | ```
170 | 
171 | you would see:
172 | 
173 | |results_Make_ID|   results_Make_Name|
174 | |---------------|--------------------|
175 | |            440|        ASTON MARTIN|
176 | |            441|               TESLA|
177 | |            442|              JAGUAR|
178 | |            443|            MASERATI|
179 | |            444|          LAND ROVER|
180 | |            445|         ROLLS ROYCE|
181 | 


--------------------------------------------------------------------------------
/build.sbt:
--------------------------------------------------------------------------------
 1 | name := "apache_spark_consume_api_scala"
 2 | 
 3 | version := "0.1"
 4 | 
 5 | scalaVersion := "2.12.10"
 6 | 
 7 | idePackagePrefix := Some("com.gastecka.demo")
 8 | 
 9 | // https://mvnrepository.com/artifact/org.apache.spark/spark-core
10 | libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
11 | 
12 | // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
13 | libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.0.1" % "provided"
14 | 
15 | // https://mvnrepository.com/artifact/org.apache.spark/spark-sql
16 | libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
17 | 
18 | // https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs
19 | libraryDependencies += "com.amazonaws" % "aws-java-sdk-sqs" % "1.11.1001"
20 | 
21 | libraryDependencies += "com.squareup.okhttp3" % "okhttp" % "4.9.0"
22 | 


--------------------------------------------------------------------------------
/project/build.properties:
--------------------------------------------------------------------------------
1 | sbt.version = 1.5.5


--------------------------------------------------------------------------------
/project/plugins.sbt:
--------------------------------------------------------------------------------
1 | addSbtPlugin("org.jetbrains" % "sbt-ide-settings" % "1.1.0")


--------------------------------------------------------------------------------
/src/main/scala/HttpRequest.scala:
--------------------------------------------------------------------------------
 1 | package com.gastecka.demo
 2 | 
 3 | import java.io.IOException
 4 | import okhttp3.{Headers, OkHttpClient, Request, Response}
 5 | 
 6 | class HttpRequest {
 7 | 
 8 |   def ExecuteHttpGet(url: String) : Option[String] = {
 9 | 
10 |     val client: OkHttpClient = new OkHttpClient();
11 | 
12 |     val headerBuilder = new Headers.Builder
13 |     val headers = headerBuilder
14 |       .add("content-type", "application/json")
15 |       .build
16 | 
17 |     val result = try {
18 |         val request = new Request.Builder()
19 |           .url(url)
20 |           .headers(headers)
21 |           .build();
22 | 
23 |         val response: Response = client.newCall(request).execute()
24 |         response.body().string()
25 |       }
26 |       catch {
27 |         case _: Throwable => null
28 |       }
29 | 
30 |     Option[String](result)
31 |   }
32 | 
33 | }
34 | 


--------------------------------------------------------------------------------
/src/test/scala/MainApp.scala:
--------------------------------------------------------------------------------
 1 | package com.gastecka.demo
 2 | 
 3 | import org.apache.spark.sql.SparkSession
 4 | import org.apache.spark.sql.api.java.UDF1
 5 | import org.apache.spark.sql.functions.{col, udf, from_json, explode}
 6 | import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
 7 | 
 8 | object MainApp extends App {
 9 | 
10 |   val appName: String = "main-app-test"
11 | 
12 |   // create a SQL Context
13 |   //val conf = new SparkConf().setMaster("local[*]").setAppName(appName)
14 |   // val sc = new SparkContext(conf)
15 |   val spark = SparkSession
16 |     .builder()
17 |     .appName(appName)
18 |     .master("local[*]")
19 |     .getOrCreate()
20 | 
21 |   // required for toDF on Seq
22 |   import spark.implicits._
23 | 
24 |   // Define the UDF that will execute the HTTP(s) GET/POST for the RESTAPI
25 |   // note: udf was deprecated with Spark 3.0.  If you're using Spark pre 3.0, use the following
26 |   // For those using Spark pre 3.0, uncomment the folowing code:
27 |   // create the local function that will execute our HTTP GET
28 |   //val executeRestApi = (url: String) => {
29 |   //  val httpRequest = new HttpRequest;
30 |   //  httpRequest.ExecuteHttpGet(url)
31 |   //}
32 |   //
33 |   // val executeRestApiUDF = udf(executeRestApi, restApiSchema)
34 |   //
35 |   // The following UDF code is for those using Spark 3.0 there-after
36 |   val executeRestApiUDF = udf(new UDF1[String, String] {
37 |     override def call(url: String) = {
38 |       val httpRequest = new HttpRequest;
39 |       httpRequest.ExecuteHttpGet(url).getOrElse("")
40 |     }
41 |   }, StringType)
42 | 
43 |   // Lets set up an example test
44 |   // create the Dataframe to bind the UDF to
45 |   case class RestAPIRequest (url: String)
46 | 
47 |   val restApiCallsToMake = Seq(RestAPIRequest("https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json"))
48 |   val source_df = restApiCallsToMake.toDF()
49 | 
50 |   // Define the schema used to format the REST response.  This will be used by from_json
51 |   val restApiSchema = StructType(List(
52 |     StructField("Count", IntegerType, true),
53 |     StructField("Message", StringType, true),
54 |     StructField("SearchCriteria", StringType, true),
55 |     StructField("Results", ArrayType(
56 |       StructType(List(
57 |         StructField("Make_ID", IntegerType, true),
58 |         StructField("Make_Name", StringType, true)
59 |       ))
60 |     ), true)
61 |   ))
62 | 
63 |   // add the UDF column, and a column to parse the output to
64 |   // a structure that we can interogate on the dataframe
65 |   val execute_df = source_df
66 |     .withColumn("result", executeRestApiUDF(col("url")))
67 |     .withColumn("result", from_json(col("result"), restApiSchema))
68 | 
69 |   // call an action on the Dataframe to execute the UDF
70 |   // process the results
71 |   execute_df.select(explode(col("result.Results")).alias("makes"))
72 |       .select(col("makes.Make_ID"), col("makes.Make_Name"))
73 |       .show
74 | }
75 | 


--------------------------------------------------------------------------------