├── LICENSE ├── README.md ├── build.gradle ├── img ├── 1024px-Google-BigQuery-Logo.svg.png ├── Bizzabo-Logo-SVG-No-Space.svg ├── airflow.png ├── beam.png ├── bigquery.png ├── bizzabo.svg ├── dataflow.png ├── elastic-elasticsearch-logo-png-transparent.png └── elasticsearch.png └── src └── main ├── kotlin └── com │ └── bizzabo │ └── dataPipelineFromElasticsearchToBigquery │ └── Application.kt └── python ├── dataPipelineFromElasticsearchToBigquery_dag.py └── dataPipelineFromElasticsearchToBigquery_options.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Bizzabo 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | 3 |
4 |
5 |

6 | 7 | 8 | 9 | 10 | 11 |

12 | 13 | ![Kotlin](https://img.shields.io/badge/Kotlin-1.3-green.svg) 14 | ![Python](https://img.shields.io/badge/python-v3-green.svg) 15 | 16 | ![Apache Airflow](https://img.shields.io/badge/Apache_Airflow-blue.svg) 17 | ![Elasticsearch](https://img.shields.io/badge/Elasticsearch-blue.svg) 18 | ![Google Cloud Dataflow](https://img.shields.io/badge/Google_Cloud_Dataflow-blue.svg) 19 | ![Apache Beam](https://img.shields.io/badge/Apache_Beam-blue.svg) 20 | ![Google BigQuery](https://img.shields.io/badge/Google_BigQuery-blue.svg) 21 | 22 | ![License](https://img.shields.io/badge/license-MIT-yellow.svg) 23 | 24 | # Elasticsearch to BigQuery Data pipeline 25 | #### A generic data pipeline which maps Elasticsearch documents to Google BigQuery table rows using Apache Airflow and Google Cloud Dataflow 26 | 27 | * [About](#about) 28 | * [Getting Started](#getting-started) 29 | * [Prerequisites](#prerequisites) 30 | * [Arguments](#arguments) 31 | * [Pipeline Arguments](#pipeline-arguments) 32 | * [Elasticsearch Arguments](#elasticsearch-arguments) 33 | * [Google Cloud Arguments](#google-cloud-arguments) 34 | * [Airflow Options](#airflow-options) 35 | * [Deployment](#deployment) 36 | * [Built With](#built-with) 37 | * [Project Status](#project-status) 38 | * [License](#license) 39 | 40 | ## About 41 | This application was developed following the need for an ETL process which would do the following: 42 | * Retrieve documents from Elasticsearch, 43 | * Transform said documents, and 44 | * Write them as rows to BigQuery 45 | The application runs on Dataflow and is triggered periodically by Airflow 46 | 47 | ## Getting Started 48 | This repo performs the following steps: 49 | 1. Gather application arguments 50 | 2. Create data pipeline with the provided arguments 51 | 3. Generate a query according to the provided arguments 52 | 4. Create a BigQuery table reference 53 | 5. Create a BigQuery table field schema 54 | 6. Apply actions to the pipeline (Read documents from Elasticsearch, transform read documents to table rows, write rows to BigQuery) 55 | 56 | ## Prerequisites 57 | In order to run this application there are a number of things to set up: 58 | 1. The application will attempt to connect to Elasticsearch and gauge the number of documents that will be processed. 59 | This requires Elasticsearch to be accessible to Dataflow. 60 | If your elasticsearch cluster is behind a firewall, network modifications may be required to prevent the application from falling to access Elasticsearch and therefore falling altogether. 61 | 2. The application requires BigQuery to include a table with the correct name and schema as defined in the setFields function. 62 | If said table does not exist, writing to BigQuery will fail. 63 | 64 | ## Running 65 | ### CLI or IDE Execution 66 | 67 | When the application is executed, a job is created in Dataflow and the application is run with the provided arguments. 68 | Monitoring of the job can be done via Dataflow's web console. 69 | ### Airflow Execution 70 | When the relevant DAG is triggered, the application jar is executed along with any arguments provided by the DAG. 71 | Monitoring of the job can be done via Dataflow's web console or via Airflow's web console. 72 | 73 | ## Arguments 74 | The application's arguments can be divided into three categories: 75 | ### Pipeline Arguments 76 | * queryType - determines which type of query will be used to retrieve documents from Elasticsearch. 77 | Possible values: 78 | * daysAgo - query will return documents modified between "daysBeforeStart" and "daysBeforeEnd". 79 | * betweenDates - query will return documents modified between "beginDate" and "endDate". 80 | * withSearchParam - query will return all of the documents in Elasticsearch which meet the criteria specified by "paramName" and "paramValue" 81 | * everything - query will return all of the documents in Elasticsearch 82 | * beginDate - a YYYYMMDD formatted string that determines the bottom boundary for when the document was modified. 83 | * endDate - a YYYYMMDD formatted string that determines the top boundary for when the document was modified. 84 | * daysBeforeStart - an int value that determines the bottom boundary for how many days ago the document was modified. 85 | * daysBeforeEnd - an int value that determines the top boundary for how many days ago the document was modified. 86 | * paramName - the name of the parameter to be used as a criteria in the query. 87 | * paramValue - the value of the parameter to be used as a criteria in the query. 88 | 89 | ### Elasticsearch Arguments 90 | * batchSize - the Elasticsearch result batch size. 91 | * connectTimeout - the Elasticsearch connection timeout duration. 92 | * index - the Elasticsearch index to be queried against. 93 | * socketAndRetryTimeout - the Elasticsearch socket and retry timeout duration. 94 | * source - the url and port of the Elasticsearch instance to be queried against. 95 | * type - the Elasticsearch document type. 96 | 97 | ### Google Cloud Arguments 98 | * datasetId - BigQuery dataset ID. 99 | * diskSizeGb - Dataflow worker disk size in GB. 100 | * enableCloudDebugger - boolean indicator of whether to enable Cloud Debugger. 101 | * gcpTempLocation - Dataflow temporary file storage location. 102 | * network - Google Cloud VPC network name. 103 | * numWorkers - number of Dataflow workers. 104 | * project - Google Cloud Platform project name. 105 | * projectId - Google Cloud Platform project ID. 106 | * region - Google Cloud Platform VPC network region. 107 | * serviceAccount - Google Cloud Platform service account. 108 | * subnetwork - Google Cloud Platform VPC subnetwork. 109 | * tableId - BigQuery table ID. 110 | * tempLocation - Dataflow pipeline temporary file storage location. 111 | * usePublicIps - boolean indicator of whether Dataflow should use public IP addresses. 112 | 113 | Note: any argument which is not passed to the application will be replaced with a default value. 114 | 115 | ## Airflow Options 116 | All of the arguments available to the application may be set by Airflow. There are a number of additional options available for Airflow: 117 | * autoscalingAlgorithm - Dataflow autoscaling algorithm. 118 | * partitionType - Dataflow partition type. 119 | 120 | ## Deployment 121 | In order to deploy the application, it must be built into a fat jar so any dependencies are accessible to Dataflow during runtime. 122 | If you plan on running the application using Airflow, the jar must be uploaded to an accessible location in Google Cloud Storage. 123 | 124 | ## Built With 125 | The application is built with Gradle. 126 | 127 | ## Project Status 128 | The project is currently in production and is run periodically as part of Bizzabo's data pipeline. 129 | 130 | ## License 131 | This project is licensed under the MIT License - see the LICENSE.md file for details. 132 | -------------------------------------------------------------------------------- /build.gradle: -------------------------------------------------------------------------------- 1 | buildscript { 2 | ext { 3 | kotlinVersion = '1.2.51' 4 | beamSdksJavaIoElasticsearch = '2.15.0' 5 | googleCloudDataflowJavaSdkVersion = '2.5.0' 6 | googleApiClientVersion = '1.30.2' 7 | slf4jVersion = '1.7.25' 8 | kotlinLoggingVersion = '1.5.4' 9 | } 10 | 11 | repositories { 12 | mavenCentral() 13 | } 14 | dependencies { 15 | classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlinVersion" 16 | } 17 | } 18 | 19 | plugins { 20 | id 'com.github.johnrengelman.shadow' version '5.1.0' 21 | id 'java' 22 | id 'distribution' 23 | id 'application' 24 | } 25 | 26 | group 'com.bizzabo.dataPipelineFromElasticsearchToBigquery' 27 | version '0.1.0-SNAPSHOT' 28 | apply plugin: 'kotlin' 29 | mainClassName = 'com.bizzabo.dataPipelineFromElasticsearchToBigquery.ApplicationKt' 30 | sourceCompatibility = 1.8 31 | 32 | repositories { 33 | mavenCentral() 34 | } 35 | 36 | dependencies { 37 | compile "org.jetbrains.kotlin:kotlin-stdlib-jdk8:$kotlinVersion" 38 | compile "org.apache.beam:beam-sdks-java-io-google-cloud-platform:${beamSdksJavaIoElasticsearch}" 39 | compile "org.apache.beam:beam-sdks-java-io-elasticsearch:${beamSdksJavaIoElasticsearch}" 40 | compile "org.apache.beam:beam-sdks-java-core:${beamSdksJavaIoElasticsearch}" 41 | compile "org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:${beamSdksJavaIoElasticsearch}" 42 | compile("com.google.api-client:google-api-client:${googleApiClientVersion}") { 43 | force = true 44 | } 45 | compile 'com.google.cloud:google-cloud-storage:1.90.0' 46 | 47 | compile "org.slf4j:slf4j-api:${slf4jVersion}" 48 | compile "org.slf4j:slf4j-jdk14:${slf4jVersion}" 49 | compile "io.github.microutils:kotlin-logging:$kotlinLoggingVersion" 50 | 51 | compile "com.google.guava:guava:23.6-jre" 52 | compile "org.apache.httpcomponents:httpcore:4.4.8" 53 | compile "org.apache.beam:beam-runners-google-cloud-dataflow-java:${beamSdksJavaIoElasticsearch}" 54 | } 55 | 56 | compileKotlin { 57 | kotlinOptions.jvmTarget = '1.8' 58 | } 59 | 60 | shadowJar { 61 | mergeServiceFiles() 62 | } -------------------------------------------------------------------------------- /img/1024px-Google-BigQuery-Logo.svg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/1024px-Google-BigQuery-Logo.svg.png -------------------------------------------------------------------------------- /img/Bizzabo-Logo-SVG-No-Space.svg: -------------------------------------------------------------------------------- 1 | SVG-2Artboard 1 -------------------------------------------------------------------------------- /img/airflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/airflow.png -------------------------------------------------------------------------------- /img/beam.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/beam.png -------------------------------------------------------------------------------- /img/bigquery.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/bigquery.png -------------------------------------------------------------------------------- /img/bizzabo.svg: -------------------------------------------------------------------------------- 1 | SVG-2Artboard 1 -------------------------------------------------------------------------------- /img/dataflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/dataflow.png -------------------------------------------------------------------------------- /img/elastic-elasticsearch-logo-png-transparent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/elastic-elasticsearch-logo-png-transparent.png -------------------------------------------------------------------------------- /img/elasticsearch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bizzabo/elasticsearch_to_bigquery_data_pipeline/eaaf78e04b80324657f34845565aa528e5d3466c/img/elasticsearch.png -------------------------------------------------------------------------------- /src/main/kotlin/com/bizzabo/dataPipelineFromElasticsearchToBigquery/Application.kt: -------------------------------------------------------------------------------- 1 | package com.bizzabo.dataPipelineFromElasticsearchToBigquery 2 | 3 | import com.google.api.services.bigquery.model.TableFieldSchema 4 | import com.google.api.services.bigquery.model.TableReference 5 | import com.google.api.services.bigquery.model.TableRow 6 | import com.google.api.services.bigquery.model.TableSchema 7 | import com.google.gson.Gson 8 | import com.google.gson.GsonBuilder 9 | import com.google.gson.reflect.TypeToken 10 | import mu.KotlinLogging 11 | import org.apache.beam.runners.dataflow.DataflowRunner 12 | import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions 13 | import org.apache.beam.sdk.Pipeline 14 | import org.apache.beam.sdk.coders.StringUtf8Coder 15 | import org.apache.beam.sdk.io.FileSystems 16 | import org.apache.beam.sdk.io.elasticsearch.ElasticsearchIO 17 | import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO 18 | import org.apache.beam.sdk.io.gcp.bigquery.TableRowJsonCoder 19 | import org.apache.beam.sdk.options.PipelineOptions 20 | import org.apache.beam.sdk.options.PipelineOptionsFactory 21 | import org.apache.beam.sdk.transforms.MapElements 22 | import org.apache.beam.sdk.transforms.SimpleFunction 23 | import org.apache.beam.sdk.values.TypeDescriptor 24 | import org.joda.time.DateTimeZone 25 | import org.joda.time.LocalDate 26 | import org.joda.time.format.DateTimeFormat 27 | import org.joda.time.format.ISODateTimeFormat 28 | 29 | private val logger = KotlinLogging.logger {} 30 | 31 | fun main(args: Array) { 32 | 33 | val argMap = generateArgMap(args) 34 | val options = createPipelineOptions(argMap) 35 | FileSystems.setDefaultPipelineOptions(options) 36 | val pipeline = Pipeline.create(options) 37 | pipeline.coderRegistry.registerCoderForClass(TableRow::class.java, TableRowJsonCoder.of()) 38 | 39 | val query: String = generateQuery(argMap) 40 | 41 | val tableReference = tableReference(argMap) 42 | 43 | val fields = setFields() 44 | 45 | pipeline.apply("Read from ES", readFromES(argMap, query)) 46 | .setCoder(StringUtf8Coder.of()) 47 | .apply("MapToTableRow", mapToTableRow()) 48 | .setCoder(TableRowJsonCoder.of()) 49 | .apply("Write to BQ", writeToBQ(tableReference, fields)) 50 | pipeline.run() 51 | } 52 | 53 | private fun generateQuery(argMap: HashMap): String { 54 | return when (argMap["queryType"]) { 55 | "daysAgo" -> { 56 | logger.info("Query will return documents modified between dates provided as days ago") 57 | queryModifiedBetweenDates( 58 | getDaysAgoDateAsLong(argMap["daysBeforeStart"]!!.toInt()), 59 | getDaysAgoDateAsLong(argMap["daysBeforeEnd"]!!.toInt())) 60 | } 61 | "betweenDates" -> { 62 | logger.info("Query will return documents modified between dates provided as dates") 63 | queryModifiedBetweenDates( 64 | getDateAsLong(argMap["beginDate"]!!), 65 | getDateAsLong(argMap["endDate"]!!)) 66 | } 67 | "everything" -> { 68 | logger.info("Query will return all documents without constraints") 69 | queryAllBuilder() 70 | } 71 | "withSearchParam" -> { 72 | logger.info("Query will return documents which meet provided condition") 73 | queryBuilder( 74 | argMap["paramName"]!!, 75 | argMap["paramValue"]!!) 76 | } 77 | else -> { 78 | logger.info("Query type not provided or invalid. Query will return documents modified yesterday") 79 | queryModifiedBetweenDates( 80 | getDaysAgoDateAsLong(1), 81 | getDaysAgoDateAsLong(0)) 82 | } 83 | } 84 | } 85 | 86 | private fun tableReference(argMap: HashMap): TableReference? { 87 | return TableReference() 88 | .set("projectId", argMap["projectId"]!!) 89 | .set("datasetId", argMap["datasetId"]!!) 90 | .set("tableId", argMap["tableId"]!!) 91 | } 92 | 93 | private fun generateArgMap(args: Array): HashMap { 94 | val optionMap: MutableMap = HashMap() 95 | val defaultArgMap: MutableMap = HashMap() 96 | 97 | populateDefaultArgMap(defaultArgMap) 98 | 99 | for (arg in args) { 100 | optionMap[arg.substringBefore('=').substringAfter("--")] = arg.substringAfter('=') 101 | } 102 | 103 | val builder = StringBuilder() 104 | for (key in defaultArgMap.keys) { 105 | if (!optionMap.contains(key)) { 106 | builder.append("Mandatory option \"$key\" was not passed as argument. Default value: \"" + defaultArgMap[key] + "\" will be used instead.\n") 107 | optionMap[key] = defaultArgMap[key].toString() 108 | } 109 | } 110 | 111 | builder.append("Pipeline will be created with the following options:\n") 112 | for (key in optionMap.keys) { 113 | builder.append("\t$key: " + optionMap[key] + "\n") 114 | } 115 | logger.info(builder.toString()) 116 | return optionMap as HashMap 117 | } 118 | 119 | private fun populateDefaultArgMap(defaultArgMap: MutableMap) { 120 | defaultArgMap["batchSize"] = "5000" 121 | defaultArgMap["beginDate"] = "20190101" 122 | defaultArgMap["connectTimeout"] = "5000" 123 | defaultArgMap["datasetId"] = "datasetId" 124 | defaultArgMap["daysBeforeEnd"] = "0" 125 | defaultArgMap["daysBeforeStart"] = "1" 126 | defaultArgMap["diskSizeGb"] = "100" 127 | defaultArgMap["enableCloudDebugger"] = "true" 128 | defaultArgMap["endDate"] = "20190102" 129 | defaultArgMap["gcpTempLocation"] = "gs://dataPipelineFromElasticsearchToBigquery/gcpTempLocation/" 130 | defaultArgMap["index"] = "elasticsearchIndex" 131 | defaultArgMap["network"] = "gcp_network" 132 | defaultArgMap["numWorkers"] = "1000" 133 | defaultArgMap["paramName"] = "attributes.paramName.raw" 134 | defaultArgMap["paramValue"] = "Zohar" 135 | defaultArgMap["project"] = "gcpProject" 136 | defaultArgMap["projectId"] = "gcpProjectId" 137 | defaultArgMap["queryType"] = "yesterday" 138 | defaultArgMap["region"] = "gcpRegion" 139 | defaultArgMap["serviceAccount"] = "service@account.iam.gserviceaccount.com" 140 | defaultArgMap["socketAndRetryTimeout"] = "90000" 141 | defaultArgMap["source"] = "http://elasticsearch.data.source.com:9200" 142 | defaultArgMap["subnetwork"] = "regions/gcpRegion/subnetworks/subNetwork" 143 | defaultArgMap["tableId"] = "table" 144 | defaultArgMap["tempLocation"] = "gs://dataPipelineFromElasticsearchToBigquery/tempLocation/" 145 | defaultArgMap["type"] = "documentType" 146 | defaultArgMap["usePublicIps"] = "false" 147 | } 148 | 149 | private fun setFields(): java.util.ArrayList { 150 | return arrayListOf( 151 | TableFieldSchema().setName("id").setType("INTEGER").setMode("REQUIRED"), 152 | TableFieldSchema().setName("first_name").setType("STRING").setMode("REQUIRED"), 153 | TableFieldSchema().setName("last_name").setType("STRING").setMode("REQUIRED"), 154 | TableFieldSchema().setName("address").setType("STRING").setMode("REQUIRED"), 155 | TableFieldSchema().setName("birthday").setType("TIMESTAMP").setMode("REQUIRED"), 156 | TableFieldSchema().setName("person_json").setType("STRING").setMode("REQUIRED"), 157 | TableFieldSchema().setName("created").setType("TIMESTAMP").setMode("NULLABLE"), 158 | TableFieldSchema().setName("modified").setType("TIMESTAMP").setMode("NULLABLE") 159 | ) 160 | } 161 | 162 | private fun mapToTableRow() = MapElements.into(TypeDescriptor.of(TableRow::class.java)) 163 | .via(ContactStringToTableRow()) 164 | 165 | private fun writeToBQ(tableReference: TableReference?, fields: ArrayList): BigQueryIO.Write? { 166 | return BigQueryIO.writeTableRows() 167 | .to(tableReference) 168 | .optimizedWrites() 169 | .withSchema(TableSchema().setFields(fields)) 170 | .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) 171 | .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) 172 | } 173 | 174 | private fun readFromES(argMap: HashMap, query: String): ElasticsearchIO.Read? { 175 | return ElasticsearchIO.read() 176 | .withConnectionConfiguration(ElasticsearchIO.ConnectionConfiguration 177 | .create(arrayOf(argMap["source"]), argMap["index"], argMap["type"]) 178 | .withConnectTimeout(argMap["connectTimeout"]!!.toInt()) 179 | .withSocketAndRetryTimeout(argMap["socketAndRetryTimeout"]!!.toInt())) 180 | .withBatchSize(argMap["batchSize"]!!.toLong()) 181 | .withQuery(query) 182 | } 183 | 184 | private fun getDaysAgoDateAsLong(days: Int): Long { 185 | return LocalDate.now().minusDays(days).toDateTimeAtStartOfDay(DateTimeZone.UTC).millis 186 | } 187 | 188 | private fun getDateAsLong(date: String): Long { 189 | return DateTimeFormat.forPattern("YYYYMMdd").parseDateTime(date).millis 190 | } 191 | 192 | private fun createPipelineOptions(argMap: HashMap): PipelineOptions { 193 | val options = PipelineOptionsFactory.`as`(DataflowPipelineOptions::class.java) 194 | options.project = argMap["project"] 195 | options.tempLocation = argMap["tempLocation"] 196 | options.gcpTempLocation = argMap["gcpTempLocation"] 197 | options.serviceAccount = argMap["serviceAccount"] 198 | options.region = argMap["region"] 199 | options.network = argMap["network"] 200 | options.subnetwork = argMap["subnetwork"] 201 | options.usePublicIps = argMap["usePublicIps"]!!.toBoolean() 202 | options.numWorkers = argMap["numWorkers"]!!.toInt() 203 | options.diskSizeGb = argMap["diskSizeGb"]!!.toInt() 204 | 205 | options.runner = DataflowRunner::class.java 206 | if (argMap["enableCloudDebugger"]!!.toBoolean()) { 207 | options.enableCloudDebugger 208 | } 209 | return options 210 | } 211 | 212 | private fun queryAllBuilder(): String { 213 | return """ 214 | { 215 | "query": { 216 | "match_all" : {} 217 | } 218 | } 219 | """.trimIndent() 220 | } 221 | 222 | private fun queryModifiedBetweenDates(beginDate: Long, endDate: Long): String { 223 | return """ 224 | { 225 | "query": { 226 | "bool": { 227 | "must": { 228 | "match_all": {} 229 | }, 230 | "filter": { 231 | "bool": { 232 | "must": { 233 | "range": { 234 | "modified": { 235 | "from": $beginDate, 236 | "to": $endDate, 237 | "include_lower": true, 238 | "include_upper": false 239 | } 240 | } 241 | } 242 | } 243 | } 244 | } 245 | } 246 | } 247 | """.trimIndent() 248 | } 249 | 250 | private fun queryBuilder(paramName: String, paramValue: String): String { 251 | return """ 252 | { 253 | "query": { 254 | "bool": { 255 | "must": [ 256 | { "match": { "$paramName": "$paramValue" }} 257 | ] 258 | } 259 | } 260 | } 261 | """.trimIndent() 262 | } 263 | 264 | class ContactStringToTableRow : SimpleFunction() { 265 | override fun apply(input: String): TableRow { 266 | val gson: Gson = GsonBuilder().create() 267 | val parsedMap: Map = gson.fromJson(input, object : TypeToken>() {}.type) 268 | return TableRow() 269 | .set("id", parsedMap["id"].toString().toDouble().toLong()) 270 | .set("first_name", parsedMap["first_name"].toString()) 271 | .set("last_name", parsedMap["last_name"].toString()) 272 | .set("address", parsedMap["address"].toString()) 273 | .set("birthday", ISODateTimeFormat.dateTime().print((parsedMap["birthday"].toString().toDouble().toLong()))) 274 | .set("person_json", input) 275 | .set("created", ISODateTimeFormat.dateTime().print((parsedMap["created"].toString().toDouble().toLong()))) 276 | .set("modified", ISODateTimeFormat.dateTime().print((parsedMap["modified"].toString().toDouble().toLong()))) 277 | } 278 | } 279 | -------------------------------------------------------------------------------- /src/main/python/dataPipelineFromElasticsearchToBigquery_dag.py: -------------------------------------------------------------------------------- 1 | from datetime import timedelta 2 | 3 | from datetime import datetime 4 | from airflow import DAG 5 | from airflow.contrib.operators.dataflow_operator import DataFlowJavaOperator 6 | from airflow.operators.dummy_operator import DummyOperator 7 | from dataPipelineFromElasticsearchToBigquery_options import options 8 | 9 | 10 | def merge_dicts(*dict_args): 11 | result = {} 12 | for dictionary in dict_args: 13 | result.update(dictionary) 14 | return result 15 | 16 | 17 | all_options = merge_dicts( 18 | options['run_options'], 19 | options['pipeline_options'], 20 | options['gcp_options'], 21 | options['es_options'] 22 | ) 23 | 24 | dag_args = { 25 | 'owner': 'Zohar', 26 | 'depends_on_past': False, 27 | 'start_date': 28 | datetime(2019, 9, 01), 29 | 'email': ['dataPipelineFromElasticsearchToBigquery@company.com'], 30 | 'email_on_failure': False, 31 | 'email_on_retry': False, 32 | 'retries': 3, 33 | 'retry_delay': timedelta(minutes=5), 34 | 'dataflow_default_options': { 35 | 'project': 'gcpProject', 36 | 'zone': 'gpcNetworkZone', 37 | 'stagingLocation': 'gs://dataPipelineFromElasticsearchToBigquery/airflowStaging/' 38 | } 39 | } 40 | 41 | dag = DAG('dataPipelineFromElasticsearchToBigquery-dag', default_args=dag_args, catchup=False) 42 | 43 | start = DummyOperator(task_id='start', dag=dag) 44 | 45 | task = DataFlowJavaOperator( 46 | task_id='daily-dataPipelineFromElasticsearchToBigquery-task', 47 | jar='gs://dataPipelineFromElasticsearchToBigquery/lib/dataPipelineFromElasticsearchToBigquery.jar', 48 | options=all_options, 49 | dag=dag) 50 | 51 | start >> task 52 | -------------------------------------------------------------------------------- /src/main/python/dataPipelineFromElasticsearchToBigquery_options.py: -------------------------------------------------------------------------------- 1 | options = { 2 | 'run_options': { 3 | 'daysBeforeStart': '1', 4 | 'daysBeforeEnd': '0', 5 | 'paramName': 'attributes.paramName.raw', 6 | 'paramValue': 'Zohar', 7 | 'queryType': 'betweenDates', 8 | 'beginDate': '{{ yesterday_ds_nodash }}', 9 | 'endDate': '{{ ds_nodash }}' 10 | }, 11 | 'pipeline_options': { 12 | 'autoscalingAlgorithm': 'BASIC', 13 | 'partitionType': 'DAY', 14 | 'project': 'gcpProject', 15 | 'tempLocation': 'gs://dataPipelineFromElasticsearchToBigquery/tempLocation/', 16 | 'gcpTempLocation': 'gs://dataPipelineFromElasticsearchToBigquery/gcpTempLocation/', 17 | 'serviceAccount': 'service@account.iam.gserviceaccount.com', 18 | 'region': 'gcpRegion', 19 | 'network': 'gcpNetwork', 20 | 'subnetwork': 'regions/gcpRegion/subnetworks/subNetwork', 21 | 'usePublicIps': 'false', 22 | 'numWorkers': '1000', 23 | 'diskSizeGb': '100', 24 | 'enableCloudDebugger': 'true' 25 | }, 26 | 'gcp_options': { 27 | 'datasetId': 'datasetId', 28 | 'projectId': 'gcpProjectId', 29 | 'tableId': 'table' 30 | }, 31 | 'es_options': { 32 | 'connectTimeout': '5000', 33 | 'index': 'elasticsearchIndex', 34 | 'source': 'http://elasticsearch.data.source.com:9200', 35 | 'type': 'documentType', 36 | 'socketAndRetryTimeout': '90000', 37 | 'batchSize': '5000' 38 | } 39 | } 40 | --------------------------------------------------------------------------------