├── CloudFunctions_Dataprep.png ├── README.md ├── _config.yml ├── export_import_dataprep_flow.py ├── gcs_trigger_dataprep_job.py ├── job-result-google-bigquery.py ├── job-result-google-sheet.js ├── publishing_googlesheet.js └── trifactalogo.png /CloudFunctions_Dataprep.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/CloudFunctions_Dataprep.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Google Cloud Functions for Cloud Dataprep 2 | 3 | 4 | 5 | [Google Cloud Functions](https://cloud.google.com/functions) examples for [Cloud Dataprep](https://cloud.google.com/dataprep) 6 | 7 | - **[gcs_trigger_dataprep_job.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/gcs_trigger_dataprep_job.py)** : Background Python function to trigger a Dataprep job when a file is created in a Google Cloud Storage bucket folder. Dataprep job started with REST API call and new file as parameter. Implementation details in the blog post [How to Automate a Cloud Dataprep Pipeline When a File Arrives](https://medium.com/google-cloud/how-to-automate-a-cloud-dataprep-pipeline-when-a-file-arrives-9b85f2745a09) 8 | 9 | - **[job-result-google-sheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-sheet.js)** : HTTP Node.js function to write in a Google Sheet a Dataprep job result info (id, status) with recipe name, link to the job page and link to PDF of result's profile. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Leverage Cloud Functions and APIs to Monitor Cloud Dataprep Jobs Status in a Google Sheet](https://towardsdatascience.com/leverage-cloud-functions-and-apis-to-monitor-cloud-dataprep-jobs-status-in-a-google-sheet-b412ee2b9acc). 10 | 11 | - **[publishing_googlesheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/publishing_googlesheet.js)** : HTTP Node.js function to publish Dataprep output in a Google Sheet. Google sheet name created will be based on the default single CSV file name generated in GSC + Dataprep Job id. In the Cloud Function code, you need to update your [Dataprep Token Access](https://docs.trifacta.com/display/DP/Access+Tokens+Page) (to call REST API) and the [Google Spreadsheet ID](https://developers.google.com/sheets/api/guides/concepts#spreadsheet_id). And this Cloud Function can be triggered when a Dataprep job is finished via a [Dataprep Webhook](https://docs.trifacta.com/display/DP/Create+Flow+Webhook+Task). 12 | 13 | - **[job-result-google-bigquery.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-bigquery.py)** : HTTP Python function to write in a Google BigQuery table a Dataprep job result info (id, status) with dataset output name (recipe name), Google user and link to the job page. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Monitor your BigQuery Data Warehouse Dataprep Pipeline with Data Studio](https://medium.com/google-cloud/monitor-your-bigquery-data-warehouse-dataprep-pipeline-with-data-studio-8e46b2beda1). 14 | 15 | - **[export_import_dataprep_flow.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/export_import_dataprep_flow.py)** : Export a Dataprep flow from a project and import it in another project. Option to save or get the flow package (zip file) in a GCS bucket folder. 16 | 17 | - **[Update Google Cloud Data Catalog](https://victorcouste.github.io/google-data-catalog-dataprep/)** : A Cloud Function to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column's Profile. 18 | 19 |

20 | 21 | 22 | Google Cloud Functions [https://cloud.google.com/functions](https://cloud.google.com/functions) 23 | 24 | Cloud Dataprep by Trifacta [https://cloud.google.com/dataprep](https://cloud.google.com/dataprep) 25 | 26 | Cloud Dataprep Standard API [https://api.trifacta.com/dataprep-standard](https://api.trifacta.com/dataprep-standard) 27 | 28 | Cloud Dataprep Premium API [https://api.trifacta.com/dataprep-premium](https://api.trifacta.com/dataprep-premium) 29 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal -------------------------------------------------------------------------------- /export_import_dataprep_flow.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | from google.cloud import storage 4 | 5 | def import_export_dataprep_flow(request): 6 | """Responds to any HTTP request. 7 | Args: 8 | request (flask.Request): HTTP request object. 9 | Returns: 10 | The response text or any set of values that can be turned into a 11 | Response object using 12 | `make_response `. 13 | """ 14 | request_json = request.get_json() 15 | if request_json and 'flowid' in request_json: 16 | dataprep_flowid = request_json['flowid'] 17 | else: 18 | return 'No FlowId to export' 19 | 20 | #dataprep_flowid=9999999 21 | 22 | print('FlowId {} to export/import'.format(dataprep_flowid)) 23 | 24 | datataprep_export_auth_token = 'xxxxxxxxx' 25 | dataprep_exportflow_endpoint = 'https://api.clouddataprep.com/v4/flows/{}/package'.format(dataprep_flowid) 26 | dataprep_exportflow_headers = {"Authorization": "Bearer "+datataprep_export_auth_token} 27 | 28 | resp_export = requests.get( 29 | url=dataprep_exportflow_endpoint, 30 | headers=dataprep_exportflow_headers 31 | ) 32 | print('Export Flow Status Code : {}'.format(resp_export.status_code)) 33 | 34 | # Option to save Flow package in a GCS folder 35 | flowfile_path="flows/flow_{}.zip".format(dataprep_flowid) 36 | storage_client = storage.Client() 37 | bucket = storage_client.bucket("dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce") 38 | blob = bucket.blob(flowfile_path) 39 | blob.upload_from_string(resp_export.content,content_type="application/zip") 40 | 41 | # Option to get Flow package from a GCS folder 42 | #flowfile = blob.download_as_string() 43 | 44 | # Get Flow package directly from the export 45 | flowfile = resp_export.content 46 | 47 | datataprep_import_auth_token = 'yyyyyyy' 48 | dataprep_importflow_endpoint = 'https://api.clouddataprep.com/v4/flows/package' 49 | dataprep_importflow_headers = {"Authorization": "Bearer "+datataprep_import_auth_token} 50 | dataprep_importflow_files={"archive": ("flow.zip", flowfile)} 51 | 52 | resp_import = requests.post( 53 | url=dataprep_importflow_endpoint, 54 | headers=dataprep_importflow_headers, 55 | files=dataprep_importflow_files 56 | ) 57 | 58 | print('Import flow Status Code : {}'.format(resp_import.status_code)) 59 | print('Result Import: {}'.format(resp_import.json())) 60 | 61 | return 'FlowId {} export/import'.format(dataprep_flowid) 62 | -------------------------------------------------------------------------------- /gcs_trigger_dataprep_job.py: -------------------------------------------------------------------------------- 1 | import os 2 | import requests 3 | import json 4 | 5 | def dataprep_job_gcs_trigger(event, context): 6 | 7 | """Background Cloud Function to be triggered by Cloud Storage. 8 | Args: 9 | event (dict): The Cloud Functions event payload. 10 | context (google.cloud.functions.Context): Metadata of triggering event.""" 11 | 12 | head_tail = os.path.split(event['name']) 13 | newfilename = head_tail[1] 14 | newfilepath = head_tail[0] 15 | 16 | datataprep_auth_token = 'xxxxxxxxxxxxxxx' 17 | dataprep_jobid = 99999999 18 | 19 | if context.event_type == 'google.storage.object.finalize' and newfilepath == 'landingzone': 20 | 21 | print('Run Dataprep job on new file: {}'.format(newfilename)) 22 | 23 | dataprep_runjob_endpoint = 'https://api.clouddataprep.com/v4/jobGroups' 24 | datataprep_job_param = { 25 | "wrangledDataset": {"id": dataprep_jobid}, 26 | "runParameters": {"overrides": {"data": [{"key": "FileName","value": newfilename}]}} 27 | } 28 | print('Run Dataprep job param: {}'.format(datataprep_job_param)) 29 | dataprep_headers = { 30 | "Content-Type":"application/json", 31 | "Authorization": "Bearer "+datataprep_auth_token 32 | } 33 | 34 | resp = requests.post( 35 | url=dataprep_runjob_endpoint, 36 | headers=dataprep_headers, 37 | data=json.dumps(datataprep_job_param) 38 | ) 39 | 40 | print('Status Code : {}'.format(resp.status_code)) 41 | print('Result : {}'.format(resp.json())) 42 | 43 | return 'End File event'.format(newfilename) 44 | -------------------------------------------------------------------------------- /job-result-google-bigquery.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | from google.cloud import bigquery 4 | from datetime import datetime 5 | 6 | def publish_bigquery(request): 7 | 8 | request_json = request.get_json() 9 | if request_json and 'job_id' in request_json: 10 | job_id = request_json['job_id'] 11 | job_status=request_json['job_status'] 12 | else: 13 | return 'No Job Id to publish' 14 | 15 | datataprep_auth_token='xxxxxxxx' 16 | dataprep_headers = {"Authorization": "Bearer "+datataprep_auth_token} 17 | 18 | print('Dataprep Job ID {} and Status {}'.format(job_id,job_status)) 19 | 20 | job_url="https://clouddataprep.com/jobs/"+job_id; 21 | job_result_profile="https://clouddataprep.com/v4/jobGroups/"+job_id+"/pdfResults" 22 | 23 | dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+job_id+"?embed=wrangledDataset.recipe,creator,jobs" 24 | 25 | resp = requests.get( 26 | url=dataprep_job_endpoint, 27 | headers=dataprep_headers 28 | ) 29 | job_object=resp.json() 30 | print('Status Code Get Job: {}'.format(resp.status_code)) 31 | #print('Result : {}'.format(job_object)) 32 | 33 | output_name = job_object["wrangledDataset"]["recipe"]["name"] 34 | print('Output Name : {}'.format(output_name)) 35 | 36 | user = job_object["creator"]["email"] 37 | print('User : {}'.format(user)) 38 | 39 | createdAt = job_object["jobs"]["data"][0]["createdAt"] 40 | 41 | # Find "wrangle" job type, executed with Dataflow 42 | for job in job_object["jobs"]["data"]: 43 | if job["jobType"]=="wrangle": 44 | dataflow_jobid = job["cpJobId"] 45 | 46 | print('Dataflow jobId : {}'.format(dataflow_jobid)) 47 | 48 | # Datetime of last job 49 | updatedAt=job["updatedAt"] 50 | 51 | start_job = datetime.strptime(createdAt, "%Y-%m-%dT%H:%M:%S.000Z") 52 | end_job = datetime.strptime(updatedAt, "%Y-%m-%dT%H:%M:%S.000Z") 53 | job_duration = (end_job - start_job) 54 | print('Duration : {}'.format(job_duration)) 55 | 56 | datetime_string = datetime.now().strftime("%Y-%m-%dT%H:%M:%S") 57 | 58 | # Instantiates a client 59 | bigquery_client = bigquery.Client() 60 | 61 | # Prepares a reference to the dataset 62 | dataset_ref = bigquery_client.dataset('default') 63 | 64 | table_ref = dataset_ref.table('dataprep_jobs') 65 | table = bigquery_client.get_table(table_ref) # API call 66 | row_to_insert = [{ 67 | "job_run_date":datetime_string, 68 | "job_id":int(job_id), 69 | "output_name":output_name, 70 | "job_status":job_status, 71 | "job_url":job_url, 72 | "user":user, 73 | "dataflow_job_id":dataflow_jobid, 74 | "job_duration":str(job_duration) 75 | }] 76 | errors = bigquery_client.insert_rows(table, row_to_insert) # API request 77 | assert errors == [] 78 | 79 | return 'JobId {} - {} - {} published in BigQuery'.format(job_id,job_status,output_name) 80 | -------------------------------------------------------------------------------- /job-result-google-sheet.js: -------------------------------------------------------------------------------- 1 | const {google} = require('googleapis'); 2 | const request = require('sync-request'); 3 | 4 | exports.jobresultgsheet = async (req, res) => { 5 | 6 | var jobID = req.body.jobid; 7 | var jobStatus = req.body.jobstatus; 8 | 9 | var jobURL = "https://clouddataprep.com/jobs/"+jobID; 10 | 11 | var jobProfileForumula = '=LIEN_HYPERTEXTE("https://clouddataprep.com/v4/jobGroups/'+jobID+'/pdfResults";"Profile PDF")'; 12 | 13 | var DataprepToken = "eyJhbGciOiJSUzI.................7VQLSPH3mteFmQfOPBCrJPqGWErQ"; 14 | 15 | // ------------------ GET DATAPREP JOB OBJECT -------------------------------- 16 | 17 | var job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+jobID+"?embed=wrangledDataset"; 18 | 19 | var res_job = request('GET', job_endpoint, { 20 | headers: { 21 | 'Content-Type': 'application/json', 22 | 'Authorization': 'Bearer '+ DataprepToken 23 | }, 24 | }); 25 | var jsonjob = JSON.parse(res_job.getBody()); 26 | var recipeID = jsonjob.wrangledDataset.id; 27 | console.log("Recipe ID : "+recipeID); 28 | 29 | // ------------------ GET DATAPREP RECIPE OBJECT -------------------------------- 30 | 31 | var recipe_endpoint = "https://api.clouddataprep.com/v4/wrangledDatasets/"+recipeID; 32 | 33 | var res_recipe = request('GET', recipe_endpoint, { 34 | headers: { 35 | 'Content-Type': 'application/json', 36 | 'Authorization': 'Bearer '+ DataprepToken 37 | }, 38 | }); 39 | var jsonrecipe = JSON.parse(res_recipe.getBody()); 40 | var recipeName = jsonrecipe.name 41 | console.log("Recipe Name : "+recipeName); 42 | 43 | // ------------------ ADD ALL RESULTS TO A GOOGLE SHEET -------------------------------- 44 | 45 | // block on auth + getting the sheets API object 46 | const auth = await google.auth.getClient({ 47 | scopes: ["https://www.googleapis.com/auth/spreadsheets"] 48 | }); 49 | const sheetsAPI = await google.sheets({ version: "v4", auth }); 50 | const JobSheetId = "1X63lT7...........VbwiDN0wm3SKx-Ro"; 51 | 52 | sheetsAPI.spreadsheets.values.append({ 53 | key:"AIza............0qu8qlXUA", 54 | spreadsheetId: JobSheetId, 55 | range: 'A1:F1', 56 | valueInputOption: 'USER_ENTERED', 57 | insertDataOption: 'INSERT_ROWS', 58 | resource: { 59 | values: [ 60 | [new Date().toISOString().replace('T', ' ').substr(0, 19), jobID, recipeName, jobStatus, jobURL,jobProfileForumula] 61 | ], 62 | }, 63 | }, (err, response) => { 64 | if (err) res.send(err) 65 | }) 66 | res.status(200).send("job "+jobID+" "+jobStatus); 67 | console.log("job "+jobID+" "+jobStatus); 68 | } 69 | -------------------------------------------------------------------------------- /publishing_googlesheet.js: -------------------------------------------------------------------------------- 1 | const request = require('then-request'); 2 | const {google} = require('googleapis'); 3 | const {Storage} = require("@google-cloud/storage"); 4 | 5 | exports.publish_gsheet = async (req, res) => { 6 | 7 | const DataprepJobID = req.body.jobid; 8 | 9 | console.log("DataprepJobID : "+DataprepJobID); 10 | 11 | const spreadsheetId = "1WiGd.........4tuoc"; 12 | 13 | const DataprepToken ="eyJhbGc........bcOwTQ"; 14 | 15 | // block on auth + getting the sheets API object 16 | const auth = await google.auth.getClient({ 17 | scopes: [ 18 | "https://www.googleapis.com/auth/spreadsheets", 19 | "https://www.googleapis.com/auth/devstorage.read_only" 20 | ] 21 | }); 22 | const sheetsAPI = google.sheets({version: 'v4',auth}); 23 | 24 | // ------------------ GET DATAPREP JOB AND CSV FILE NAME GENERATED IN GCS -------------------------------- 25 | 26 | const dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+DataprepJobID+"?embed=jobs.fileWriterJob.writeSetting"; 27 | 28 | var res_job = await request('GET', dataprep_job_endpoint, { 29 | headers: { 30 | 'Content-Type': 'application/json', 31 | 'Authorization': 'Bearer '+ DataprepToken 32 | }, 33 | }); 34 | 35 | const jsonresult = JSON.parse(res_job.getBody()); 36 | 37 | var outputFileURI="" 38 | for (key in jsonresult.jobs.data) { 39 | if (jsonresult.jobs.data[key].jobType == "filewriter") { 40 | outputFileURI = jsonresult.jobs.data[key].writeSetting.path; 41 | } 42 | }; 43 | 44 | //gs://dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce/vcoustenoble@trifacta.com/jobrun/Sales_Data_small.csv 45 | //console.log("outputFileURI : "+outputFileURI); 46 | 47 | const outputFilepathArray = outputFileURI.split('/'); 48 | 49 | const outputBucket=outputFilepathArray[2]; 50 | console.log("Bucket : "+outputBucket); 51 | 52 | var outputFilepath=''; 53 | for (key in outputFilepathArray) { 54 | if (key > 2) { 55 | outputFilepath = outputFilepath + outputFilepathArray[key]+'/'; 56 | } 57 | }; 58 | outputFilepath=outputFilepath.slice(0,-1); 59 | console.log("Output Filepath : "+outputFilepath); 60 | 61 | const filename = outputFilepathArray.slice(-1).toString(); 62 | //console.log("Filename : "+filename); 63 | const sheetName = filename.slice(0,-4)+"_"+DataprepJobID; 64 | console.log("Sheet Name : "+sheetName); 65 | 66 | const FileData = await readCSVContent(outputBucket,outputFilepath); 67 | 68 | sheetid = await createEmptySheet(sheetName,spreadsheetId); 69 | await populateAndStyle(FileData,sheetid,spreadsheetId); 70 | 71 | res.send(`Spreadsheet ${sheetName} created`); 72 | 73 | // ------------------ READ CSV FILE CONTENT FROM GCS -------------------------------- 74 | 75 | function readCSVContent(mybucket,myfilepath) { 76 | return new Promise((resolve, reject) => { 77 | const storage = new Storage(); 78 | const bucket = storage.bucket(mybucket); 79 | const file = bucket.file(myfilepath); 80 | 81 | let fileContents = Buffer.from(''); 82 | 83 | file.createReadStream() 84 | .on('error', function(err) { 85 | reject('The Storage API returned an error: ' + err); 86 | }) 87 | .on('data', function(chunk) { 88 | fileContents = Buffer.concat([fileContents, chunk]); 89 | }) 90 | .on('end', function() { 91 | let content = fileContents.toString('utf8'); 92 | //console.log("CSV content read as string : " + content ); 93 | resolve(content); 94 | }); 95 | }); 96 | } 97 | 98 | // ------------------ CREATE EMPTY NEW SHEET -------------------------------- 99 | 100 | function createEmptySheet(MySheetName,Myspreadsheetid) { 101 | return new Promise((resolve, reject) => { 102 | 103 | const emptySheetParams = { 104 | spreadsheetId: Myspreadsheetid, 105 | resource: { 106 | requests: [ 107 | { 108 | addSheet: { 109 | properties: { 110 | title: MySheetName, 111 | index: 1, 112 | gridProperties: { 113 | rowCount: 10, 114 | columnCount: 10, 115 | frozenRowCount: 1 116 | } 117 | } 118 | } 119 | } 120 | ] 121 | } 122 | }; 123 | sheetsAPI.spreadsheets.batchUpdate( emptySheetParams, function(err, response) { 124 | if (err) { 125 | reject("The Sheets API returned an error: " + err); 126 | } else { 127 | const sheetId = response.data.replies[0].addSheet.properties.sheetId; 128 | console.log("Created empty sheet: " + sheetId); 129 | resolve(sheetId); 130 | } 131 | }); 132 | }); 133 | } 134 | 135 | // ------------------ WRITE DATA IN THE NEW EMPTY SHEET -------------------------------- 136 | 137 | function populateAndStyle(FileData,MySheetId,MySpreadsheetId) { 138 | return new Promise((resolve, reject) => { 139 | // Using 'batchUpdate' allows for multiple 'requests' to be sent in a single batch. 140 | // Populate the sheet referenced by its ID with the data received (a CSV string) 141 | // Style: set first row font size to 11 and to Bold. Exercise left for the reader: resize columns 142 | const dataAndStyle = { 143 | spreadsheetId: MySpreadsheetId, 144 | resource: { 145 | requests: [ 146 | { 147 | pasteData: { 148 | coordinate: { 149 | sheetId: MySheetId, 150 | rowIndex: 0, 151 | columnIndex: 0 152 | }, 153 | data: FileData, 154 | delimiter: "," 155 | } 156 | }, 157 | { 158 | repeatCell: { 159 | range: { 160 | sheetId: MySheetId, 161 | startRowIndex: 0, 162 | endRowIndex: 1 163 | }, 164 | cell: { 165 | userEnteredFormat: { 166 | textFormat: { 167 | fontSize: 11, 168 | bold: true 169 | } 170 | } 171 | }, 172 | fields: "userEnteredFormat(textFormat)" 173 | } 174 | } 175 | ] 176 | } 177 | }; 178 | 179 | sheetsAPI.spreadsheets.batchUpdate(dataAndStyle, function(err, response) { 180 | if (err) { 181 | reject("The Sheets API returned an error: " + err); 182 | } else { 183 | console.log(MySheetId + " sheet populated with " + FileData.length + " rows and column style set."); 184 | resolve(); 185 | } 186 | }); 187 | }); 188 | } 189 | 190 | } 191 | -------------------------------------------------------------------------------- /trifactalogo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/trifactalogo.png --------------------------------------------------------------------------------