├── .gitignore ├── README.md ├── build.sbt ├── project ├── build.properties └── plugins.sbt └── src ├── main └── scala │ └── HttpRequest.scala └── test └── scala └── MainApp.scala /.gitignore: -------------------------------------------------------------------------------- 1 | # bloop and metals 2 | .bloop 3 | .bsp 4 | 5 | # metals 6 | project/metals.sbt 7 | .metals 8 | 9 | # vs code 10 | .vscode 11 | 12 | # scala 3 13 | .tasty 14 | 15 | # sbt 16 | project/project/ 17 | project/target/ 18 | target/ 19 | 20 | # eclipse 21 | build/ 22 | .classpath 23 | .project 24 | .settings 25 | .worksheet 26 | bin/ 27 | .cache 28 | 29 | # intellij idea 30 | *.log 31 | *.iml 32 | *.ipr 33 | *.iws 34 | .idea 35 | 36 | # mac 37 | .DS_Store 38 | 39 | # other? 40 | .history 41 | .scala_dependencies 42 | .cache-main 43 | 44 | #general 45 | *.class 46 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to execute a REST API call on Apache Spark the Right Way - Scala 2 | 3 | Note: This repository is a duplicate of another that I have created ([https://github.com/jamesshocking/Spark-REST-API-UDF](https://github.com/jamesshocking/Spark-REST-API-UDF)), albeit demonstrates how to execute REST API calls from Apache Spark using Scala. The other repository is for those looking for the Python version of this code. 4 | 5 | ### Note 6 | Oct 2022 - Since originally writing this demo, the example URL https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json is terminating all requests. The end result is that it will appear as if the code isn't work. The problem is that when the Python Requests library executes the request, the remote server terminates the request and an exception is thrown. The code is still valid, but I recommend trying with a different endpoint 7 | 8 | ## Introduction 9 | 10 | Apache Spark is a wonderful invention that can solve a great many problems. Its flexibility and adaptability gives great power but also the opportunity for big mistakes. One such mistake is executing code on the driver, which you thought would run in a distributed way on the workers. One such example is when you execute Python code outside of the context of a Dataframe. 11 | 12 | For example, when you execute code similar to: 13 | 14 | ```scala 15 | val s = "Scala is amazing" 16 | print(s) 17 | ``` 18 | 19 | Apache Spark will execute the code on the driver, and not a worker. This isn't a problem with such a simple command, but what happens when you need to download large amounts of data via a REST API service? In this and the demo code, I am using the OkHttp3 library [https://square.github.io/okhttp/](https://square.github.io/okhttp/). For those needed to request an Auth Token to access a REST API, OkHttp greatly simplifies this process. 20 | 21 | ```scala 22 | val client: OkHttpClient = new OkHttpClient(); 23 | 24 | val headerBuilder = new Headers.Builder 25 | val headers = headerBuilder 26 | .add("content-type", "application/json") 27 | .build 28 | 29 | val result = try { 30 | val request = new Request.Builder() 31 | .url(url) 32 | .headers(headers) 33 | .build(); 34 | 35 | val response: Response = client.newCall(request).execute() 36 | response.body().string() 37 | } 38 | catch { 39 | case _: Throwable => "Something went wrong" 40 | } 41 | 42 | print(result) 43 | ``` 44 | 45 | If we execute the code above, it will be executed on the Driver. If I were to create a loop with multiple of API requests, there would be no parallelism, no scaling, leaving a huge dependency on the Driver. This approach criples Apache Spark and leaves it no better than a single threaded program. To take advantage of Apache Spark's scaling and distribution, an alternative solution must be sought. 46 | 47 | The solution is to use a UDF coupled to a withColumn statement. This example, demonstrates how one can create a DataFrame whereby each row represents a single request to the REST service. A UDF (User Defined Function) is used to encapsulate the HTTP request, returning a structured column that represents the REST API response, which can then be sliced and diced using the likes of explode and other built-in DataFrame functions (Or collapsed, see [https://github.com/jamesshocking/collapse-spark-dataframe](https://github.com/jamesshocking/collapse-spark-dataframe)). 48 | 49 | With the advent of Apache Spark 3.0, the udf function has been deprecated. [Migrating from Apache Spark 2.x to 3.0](https://spark.apache.org/docs/3.0.0-preview/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30). The example code assumes Apache Spark 3.0 but review the code comments for the 2.x version of implementing a UDF. 50 | 51 | ## The Solution 52 | 53 | For the sake of brevity I am assuming that a SparkSession has been created and assigned to a variable called spark. In addition, for this example I will be used the OkHttp HTTP library. 54 | 55 | The solution assumes that you need to consume data from a REST API, which you will be calling multiple times to get the data that you need. In order to take advantage of the parallelism that Apache Spark offers, each REST API call will be encapsulated by a UDF, which is bound to a DataFrame. Each row in the DataFrame will represent a single call to the REST API service. Once an action is executed on the DataFrame, the result from each individual REST API call will be appended to each row as a Structured data type. 56 | 57 | To demonstrate the mechanism, I will be using a free US Government REST API service that returns the makes and models of USA vehicles [https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json](https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json). 58 | 59 | ### Start by declaring your imports: 60 | 61 | ```scala 62 | import okhttp3.{Headers, OkHttpClient, Request, Response} 63 | import org.apache.spark.sql.SparkSession 64 | import org.apache.spark.sql.api.java.UDF1 65 | import org.apache.spark.sql.functions.{col, udf, from_json, explode} 66 | import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType} 67 | ``` 68 | 69 | ### Now declare a function that will execute our REST API call 70 | 71 | Use the OkHttp library to execute either an HTTP get or a post. There is nothing special about this function, it even returns the REST service response as a String. 72 | 73 | ```scala 74 | def ExecuteHttpGet(url: String) : Option[String] = { 75 | 76 | val client: OkHttpClient = new OkHttpClient(); 77 | 78 | val headerBuilder = new Headers.Builder 79 | val headers = headerBuilder 80 | .add("content-type", "application/json") 81 | .build 82 | 83 | val result = try { 84 | val request = new Request.Builder() 85 | .url(url) 86 | .headers(headers) 87 | .build(); 88 | 89 | val response: Response = client.newCall(request).execute() 90 | response.body().string() 91 | } 92 | catch { 93 | case _: Throwable => null 94 | } 95 | 96 | Option[String](result) 97 | } 98 | ``` 99 | 100 | ### Define the response schema and the UDF 101 | 102 | This is one of the parts of Apache Spark that I really like. I can pick and chose what values I want from the JSON returned by the REST API call. All I have to do is declare what parts of the JSON I want in a schema, which will be used by the from_json function. 103 | 104 | ```scala 105 | val restApiSchema = StructType(List( 106 | StructField("Count", IntegerType, true), 107 | StructField("Message", StringType, true), 108 | StructField("SearchCriteria", StringType, true), 109 | StructField("Results", ArrayType( 110 | StructType(List( 111 | StructField("Make_ID", IntegerType, true), 112 | StructField("Make_Name", StringType, true) 113 | )) 114 | ), true) 115 | )) 116 | ``` 117 | 118 | Next I declare the UDF, making sure to set the return type as a String. This will ensure that the new column, which is used to execute the UDF, will be ready to call from_json. After executing the UDF, and then from_json, the row and column in the Dataframe will contain a structured object rather than plain JSON formatted text. 119 | 120 | ```scala 121 | val executeRestApiUDF = udf(new UDF1[String, String] { 122 | override def call(url: String) = { 123 | ExecuteHttpGet(url).getOrElse("") 124 | } 125 | }, StringType) 126 | ``` 127 | 128 | ### Create the Request DataFrame and Execute 129 | 130 | The final piece is to create a DataFrame where each row represents a single REST API call. The number of columns in the Dataframe are up to you but you will need at least one, which will host the URL and/or parameters required to execute the REST API call. There are a number of ways to create a Dataframe from a Sequence but I am going to use a case class. 131 | 132 | Using the US Goverments free-to-access vehicle make REST service, we would create a Dataframe as follows: 133 | 134 | ```scala 135 | case class RestAPIRequest (url: String) 136 | 137 | val restApiCallsToMake = Seq(RestAPIRequest("https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json")) 138 | val source_df = restApiCallsToMake.toDF() 139 | ``` 140 | 141 | The case class is used to define the columns of the Dataframe, and using the toDF method of the spark version of the Seq object (available via an import to the Sql Context implicits object), we end up with a Dataframe where each row represents a separate API request. 142 | 143 | All being well, the Dataframe will look like: 144 | 145 | | url | 146 | | ------------- | 147 | | https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json | 148 | 149 | Finally we can use withColumn on the Dataframe to execute the UDF and REST API, before using a second withColumn to convert the response String into a Structured object. 150 | 151 | ```scala 152 | val execute_df = source_df 153 | .withColumn("result", executeRestApiUDF(col("url"))) 154 | .withColumn("result", from_json(col("result"), restApiSchema)) 155 | ``` 156 | 157 | As Spark is lazy, the UDF will execute once an action like count() or show() is executed against the Dataframe. Spark will distribute the API calls amongst all the workers, before returning the results such as: 158 | 159 | | url | result | 160 | | -------------| --------| 161 | | https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json | [9773, Response r...] | 162 | 163 | The REST service returns a number of attributes and we're only interested in the one identified as Results (i.e. result.Results), which happens to be an array. Using the explode method, and a select, we can output all Make ID's and Name's returned by the service. 164 | 165 | ```scala 166 | execute_df.select(explode(col("result.Results")).alias("makes")) 167 | .select(col("makes.Make_ID"), col("makes.Make_Name")) 168 | .show 169 | ``` 170 | 171 | you would see: 172 | 173 | |results_Make_ID| results_Make_Name| 174 | |---------------|--------------------| 175 | | 440| ASTON MARTIN| 176 | | 441| TESLA| 177 | | 442| JAGUAR| 178 | | 443| MASERATI| 179 | | 444| LAND ROVER| 180 | | 445| ROLLS ROYCE| 181 | -------------------------------------------------------------------------------- /build.sbt: -------------------------------------------------------------------------------- 1 | name := "apache_spark_consume_api_scala" 2 | 3 | version := "0.1" 4 | 5 | scalaVersion := "2.12.10" 6 | 7 | idePackagePrefix := Some("com.gastecka.demo") 8 | 9 | // https://mvnrepository.com/artifact/org.apache.spark/spark-core 10 | libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" 11 | 12 | // https://mvnrepository.com/artifact/org.apache.spark/spark-streaming 13 | libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.0.1" % "provided" 14 | 15 | // https://mvnrepository.com/artifact/org.apache.spark/spark-sql 16 | libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1" 17 | 18 | // https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs 19 | libraryDependencies += "com.amazonaws" % "aws-java-sdk-sqs" % "1.11.1001" 20 | 21 | libraryDependencies += "com.squareup.okhttp3" % "okhttp" % "4.9.0" 22 | -------------------------------------------------------------------------------- /project/build.properties: -------------------------------------------------------------------------------- 1 | sbt.version = 1.5.5 -------------------------------------------------------------------------------- /project/plugins.sbt: -------------------------------------------------------------------------------- 1 | addSbtPlugin("org.jetbrains" % "sbt-ide-settings" % "1.1.0") -------------------------------------------------------------------------------- /src/main/scala/HttpRequest.scala: -------------------------------------------------------------------------------- 1 | package com.gastecka.demo 2 | 3 | import java.io.IOException 4 | import okhttp3.{Headers, OkHttpClient, Request, Response} 5 | 6 | class HttpRequest { 7 | 8 | def ExecuteHttpGet(url: String) : Option[String] = { 9 | 10 | val client: OkHttpClient = new OkHttpClient(); 11 | 12 | val headerBuilder = new Headers.Builder 13 | val headers = headerBuilder 14 | .add("content-type", "application/json") 15 | .build 16 | 17 | val result = try { 18 | val request = new Request.Builder() 19 | .url(url) 20 | .headers(headers) 21 | .build(); 22 | 23 | val response: Response = client.newCall(request).execute() 24 | response.body().string() 25 | } 26 | catch { 27 | case _: Throwable => null 28 | } 29 | 30 | Option[String](result) 31 | } 32 | 33 | } 34 | -------------------------------------------------------------------------------- /src/test/scala/MainApp.scala: -------------------------------------------------------------------------------- 1 | package com.gastecka.demo 2 | 3 | import org.apache.spark.sql.SparkSession 4 | import org.apache.spark.sql.api.java.UDF1 5 | import org.apache.spark.sql.functions.{col, udf, from_json, explode} 6 | import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType} 7 | 8 | object MainApp extends App { 9 | 10 | val appName: String = "main-app-test" 11 | 12 | // create a SQL Context 13 | //val conf = new SparkConf().setMaster("local[*]").setAppName(appName) 14 | // val sc = new SparkContext(conf) 15 | val spark = SparkSession 16 | .builder() 17 | .appName(appName) 18 | .master("local[*]") 19 | .getOrCreate() 20 | 21 | // required for toDF on Seq 22 | import spark.implicits._ 23 | 24 | // Define the UDF that will execute the HTTP(s) GET/POST for the RESTAPI 25 | // note: udf was deprecated with Spark 3.0. If you're using Spark pre 3.0, use the following 26 | // For those using Spark pre 3.0, uncomment the folowing code: 27 | // create the local function that will execute our HTTP GET 28 | //val executeRestApi = (url: String) => { 29 | // val httpRequest = new HttpRequest; 30 | // httpRequest.ExecuteHttpGet(url) 31 | //} 32 | // 33 | // val executeRestApiUDF = udf(executeRestApi, restApiSchema) 34 | // 35 | // The following UDF code is for those using Spark 3.0 there-after 36 | val executeRestApiUDF = udf(new UDF1[String, String] { 37 | override def call(url: String) = { 38 | val httpRequest = new HttpRequest; 39 | httpRequest.ExecuteHttpGet(url).getOrElse("") 40 | } 41 | }, StringType) 42 | 43 | // Lets set up an example test 44 | // create the Dataframe to bind the UDF to 45 | case class RestAPIRequest (url: String) 46 | 47 | val restApiCallsToMake = Seq(RestAPIRequest("https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json")) 48 | val source_df = restApiCallsToMake.toDF() 49 | 50 | // Define the schema used to format the REST response. This will be used by from_json 51 | val restApiSchema = StructType(List( 52 | StructField("Count", IntegerType, true), 53 | StructField("Message", StringType, true), 54 | StructField("SearchCriteria", StringType, true), 55 | StructField("Results", ArrayType( 56 | StructType(List( 57 | StructField("Make_ID", IntegerType, true), 58 | StructField("Make_Name", StringType, true) 59 | )) 60 | ), true) 61 | )) 62 | 63 | // add the UDF column, and a column to parse the output to 64 | // a structure that we can interogate on the dataframe 65 | val execute_df = source_df 66 | .withColumn("result", executeRestApiUDF(col("url"))) 67 | .withColumn("result", from_json(col("result"), restApiSchema)) 68 | 69 | // call an action on the Dataframe to execute the UDF 70 | // process the results 71 | execute_df.select(explode(col("result.Results")).alias("makes")) 72 | .select(col("makes.Make_ID"), col("makes.Make_Name")) 73 | .show 74 | } 75 | --------------------------------------------------------------------------------