├── CloudFunctions_Dataprep.png
├── README.md
├── _config.yml
├── export_import_dataprep_flow.py
├── gcs_trigger_dataprep_job.py
├── job-result-google-bigquery.py
├── job-result-google-sheet.js
├── publishing_googlesheet.js
└── trifactalogo.png


/CloudFunctions_Dataprep.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/CloudFunctions_Dataprep.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Google Cloud Functions for Cloud Dataprep
 2 | 
 3 | <img src="https://github.com/victorcouste/google-cloudfunctions-dataprep/raw/master/CloudFunctions_Dataprep.png" width="70%" height="70%">
 4 | 
 5 | [Google Cloud Functions](https://cloud.google.com/functions) examples for [Cloud Dataprep](https://cloud.google.com/dataprep)
 6 | 
 7 | - **[gcs_trigger_dataprep_job.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/gcs_trigger_dataprep_job.py)** : Background Python function to trigger a Dataprep job when a file is created in a Google Cloud Storage bucket folder. Dataprep job started with REST API call and new file as parameter. Implementation details in the blog post [How to Automate a Cloud Dataprep Pipeline When a File Arrives](https://medium.com/google-cloud/how-to-automate-a-cloud-dataprep-pipeline-when-a-file-arrives-9b85f2745a09)
 8 | 
 9 | - **[job-result-google-sheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-sheet.js)** : HTTP Node.js function to write in a Google Sheet a Dataprep job result info (id, status) with recipe name, link to the job page and link to PDF of result's profile. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Leverage Cloud Functions and APIs to Monitor Cloud Dataprep Jobs Status in a Google Sheet](https://towardsdatascience.com/leverage-cloud-functions-and-apis-to-monitor-cloud-dataprep-jobs-status-in-a-google-sheet-b412ee2b9acc).
10 | 
11 | - **[publishing_googlesheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/publishing_googlesheet.js)** : HTTP Node.js function to publish Dataprep output in a Google Sheet. Google sheet name created will be based on the default single CSV file name generated in GSC + Dataprep Job id. In the Cloud Function code, you need to update your [Dataprep Token Access](https://docs.trifacta.com/display/DP/Access+Tokens+Page) (to call REST API) and the [Google Spreadsheet ID](https://developers.google.com/sheets/api/guides/concepts#spreadsheet_id). And this Cloud Function can be triggered when a Dataprep job is finished via a [Dataprep Webhook](https://docs.trifacta.com/display/DP/Create+Flow+Webhook+Task).
12 | 
13 | - **[job-result-google-bigquery.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-bigquery.py)** : HTTP Python function to write in a Google BigQuery table a Dataprep job result info (id, status) with dataset output name (recipe name), Google user and link to the job page. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Monitor your BigQuery Data Warehouse Dataprep Pipeline with Data Studio](https://medium.com/google-cloud/monitor-your-bigquery-data-warehouse-dataprep-pipeline-with-data-studio-8e46b2beda1).
14 | 
15 | - **[export_import_dataprep_flow.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/export_import_dataprep_flow.py)** : Export a Dataprep flow from a project and import it in another project. Option to save or get the flow package (zip file) in a GCS bucket folder.
16 | 
17 | - **[Update Google Cloud Data Catalog](https://victorcouste.github.io/google-data-catalog-dataprep/)** : A Cloud Function to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column's Profile.
18 | 
19 | <br/><br/>
20 | 
21 | 
22 | Google Cloud Functions [https://cloud.google.com/functions](https://cloud.google.com/functions)
23 | 
24 | Cloud Dataprep by Trifacta [https://cloud.google.com/dataprep](https://cloud.google.com/dataprep)
25 | 
26 | Cloud Dataprep Standard API [https://api.trifacta.com/dataprep-standard](https://api.trifacta.com/dataprep-standard)
27 | 
28 | Cloud Dataprep Premium API [https://api.trifacta.com/dataprep-premium](https://api.trifacta.com/dataprep-premium)
29 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal


--------------------------------------------------------------------------------
/export_import_dataprep_flow.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import json
 3 | from google.cloud import storage
 4 | 
 5 | def import_export_dataprep_flow(request):
 6 |     """Responds to any HTTP request.
 7 |     Args:
 8 |         request (flask.Request): HTTP request object.
 9 |     Returns:
10 |         The response text or any set of values that can be turned into a
11 |         Response object using
12 |         `make_response <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>`.
13 |     """
14 |     request_json = request.get_json()
15 |     if request_json and 'flowid' in request_json:
16 |         dataprep_flowid = request_json['flowid']
17 |     else:
18 |         return 'No FlowId to export'
19 | 
20 |     #dataprep_flowid=9999999
21 | 
22 |     print('FlowId {} to export/import'.format(dataprep_flowid))
23 | 
24 |     datataprep_export_auth_token = 'xxxxxxxxx'
25 |     dataprep_exportflow_endpoint = 'https://api.clouddataprep.com/v4/flows/{}/package'.format(dataprep_flowid)
26 |     dataprep_exportflow_headers = {"Authorization": "Bearer "+datataprep_export_auth_token}        
27 |     
28 |     resp_export = requests.get(
29 |         url=dataprep_exportflow_endpoint,
30 |         headers=dataprep_exportflow_headers
31 |     )
32 |     print('Export Flow Status Code : {}'.format(resp_export.status_code))
33 | 
34 |     # Option to save Flow package in a GCS folder
35 |     flowfile_path="flows/flow_{}.zip".format(dataprep_flowid)
36 |     storage_client = storage.Client()
37 |     bucket = storage_client.bucket("dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce")
38 |     blob = bucket.blob(flowfile_path)
39 |     blob.upload_from_string(resp_export.content,content_type="application/zip")
40 |     
41 |     # Option to get Flow package from a GCS folder 
42 |     #flowfile = blob.download_as_string()
43 |     
44 |     # Get Flow package directly from the export
45 |     flowfile = resp_export.content
46 | 
47 |     datataprep_import_auth_token = 'yyyyyyy'
48 |     dataprep_importflow_endpoint = 'https://api.clouddataprep.com/v4/flows/package'
49 |     dataprep_importflow_headers = {"Authorization": "Bearer "+datataprep_import_auth_token}
50 |     dataprep_importflow_files={"archive": ("flow.zip", flowfile)}
51 | 
52 |     resp_import = requests.post(
53 |         url=dataprep_importflow_endpoint,
54 |         headers=dataprep_importflow_headers,
55 |         files=dataprep_importflow_files
56 |     )
57 |     
58 |     print('Import flow Status Code : {}'.format(resp_import.status_code))
59 |     print('Result Import: {}'.format(resp_import.json()))
60 | 
61 |     return 'FlowId {} export/import'.format(dataprep_flowid)
62 | 


--------------------------------------------------------------------------------
/gcs_trigger_dataprep_job.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import requests
 3 | import json
 4 | 
 5 | def dataprep_job_gcs_trigger(event, context):
 6 | 
 7 |     """Background Cloud Function to be triggered by Cloud Storage.
 8 |     Args:
 9 |         event (dict): The Cloud Functions event payload.
10 |         context (google.cloud.functions.Context): Metadata of triggering event."""
11 | 
12 |     head_tail = os.path.split(event['name'])
13 |     newfilename = head_tail[1]
14 |     newfilepath = head_tail[0]
15 | 
16 |     datataprep_auth_token = 'xxxxxxxxxxxxxxx'
17 |     dataprep_jobid = 99999999
18 | 
19 |     if context.event_type == 'google.storage.object.finalize' and newfilepath == 'landingzone':
20 | 
21 |         print('Run Dataprep job on new file: {}'.format(newfilename))
22 | 
23 |         dataprep_runjob_endpoint = 'https://api.clouddataprep.com/v4/jobGroups'
24 |         datataprep_job_param = {
25 |             "wrangledDataset": {"id": dataprep_jobid},
26 |             "runParameters": {"overrides": {"data": [{"key": "FileName","value": newfilename}]}}
27 |         }
28 |         print('Run Dataprep job param: {}'.format(datataprep_job_param))
29 |         dataprep_headers = {
30 |             "Content-Type":"application/json",
31 |             "Authorization": "Bearer "+datataprep_auth_token
32 |         }        
33 | 
34 |         resp = requests.post(
35 |             url=dataprep_runjob_endpoint,
36 |             headers=dataprep_headers,
37 |             data=json.dumps(datataprep_job_param)
38 |         )
39 | 
40 |         print('Status Code : {}'.format(resp.status_code))      
41 |         print('Result : {}'.format(resp.json()))
42 | 
43 |     return 'End File event'.format(newfilename)
44 | 


--------------------------------------------------------------------------------
/job-result-google-bigquery.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import json
 3 | from google.cloud import bigquery
 4 | from datetime import datetime
 5 | 
 6 | def publish_bigquery(request):
 7 | 
 8 |     request_json = request.get_json()
 9 |     if request_json and 'job_id' in request_json:
10 |         job_id = request_json['job_id']
11 |         job_status=request_json['job_status']
12 |     else:
13 |         return 'No Job Id to publish'
14 | 
15 |     datataprep_auth_token='xxxxxxxx'
16 |     dataprep_headers = {"Authorization": "Bearer "+datataprep_auth_token}        
17 | 
18 |     print('Dataprep Job ID {} and Status {}'.format(job_id,job_status))
19 | 
20 |     job_url="https://clouddataprep.com/jobs/"+job_id;
21 |     job_result_profile="https://clouddataprep.com/v4/jobGroups/"+job_id+"/pdfResults"
22 | 
23 |     dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+job_id+"?embed=wrangledDataset.recipe,creator,jobs"
24 |         
25 |     resp = requests.get(
26 |         url=dataprep_job_endpoint,
27 |         headers=dataprep_headers
28 |     )
29 |     job_object=resp.json()
30 |     print('Status Code Get Job: {}'.format(resp.status_code))
31 |     #print('Result : {}'.format(job_object))
32 | 
33 |     output_name = job_object["wrangledDataset"]["recipe"]["name"]
34 |     print('Output Name : {}'.format(output_name))
35 |     
36 |     user = job_object["creator"]["email"]
37 |     print('User : {}'.format(user))
38 |     
39 |    createdAt = job_object["jobs"]["data"][0]["createdAt"]
40 | 
41 |    # Find "wrangle" job type, executed with Dataflow
42 |    for job in job_object["jobs"]["data"]:
43 |    	if job["jobType"]=="wrangle":
44 | 		dataflow_jobid = job["cpJobId"]
45 | 	
46 |     print('Dataflow jobId : {}'.format(dataflow_jobid))
47 |     
48 |     # Datetime of last job
49 |     updatedAt=job["updatedAt"]
50 |     
51 |     start_job = datetime.strptime(createdAt, "%Y-%m-%dT%H:%M:%S.000Z")
52 |     end_job = datetime.strptime(updatedAt, "%Y-%m-%dT%H:%M:%S.000Z")
53 |     job_duration = (end_job - start_job)
54 |     print('Duration : {}'.format(job_duration))
55 | 
56 |     datetime_string = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
57 | 
58 |     # Instantiates a client
59 |     bigquery_client = bigquery.Client()
60 | 
61 |     # Prepares a reference to the dataset
62 |     dataset_ref = bigquery_client.dataset('default')
63 | 
64 |     table_ref = dataset_ref.table('dataprep_jobs')
65 |     table = bigquery_client.get_table(table_ref)  # API call
66 |     row_to_insert = [{
67 |     "job_run_date":datetime_string,
68 |     "job_id":int(job_id),
69 |     "output_name":output_name,
70 |     "job_status":job_status,
71 |     "job_url":job_url,
72 |     "user":user,
73 |     "dataflow_job_id":dataflow_jobid,
74 |     "job_duration":str(job_duration)
75 |     }]
76 |     errors = bigquery_client.insert_rows(table, row_to_insert)  # API request
77 |     assert errors == []
78 | 
79 |     return 'JobId {} - {} - {} published in BigQuery'.format(job_id,job_status,output_name)
80 | 


--------------------------------------------------------------------------------
/job-result-google-sheet.js:
--------------------------------------------------------------------------------
 1 | const {google} = require('googleapis');
 2 | const request = require('sync-request');
 3 | 
 4 | exports.jobresultgsheet = async (req, res) => {
 5 | 
 6 |   var jobID = req.body.jobid;
 7 |   var jobStatus = req.body.jobstatus;
 8 |   
 9 |   var jobURL = "https://clouddataprep.com/jobs/"+jobID;
10 |   
11 |   var jobProfileForumula = '=LIEN_HYPERTEXTE("https://clouddataprep.com/v4/jobGroups/'+jobID+'/pdfResults";"Profile PDF")';
12 |    
13 |   var DataprepToken = "eyJhbGciOiJSUzI.................7VQLSPH3mteFmQfOPBCrJPqGWErQ";
14 |   
15 |   // ------------------ GET DATAPREP JOB OBJECT --------------------------------
16 | 
17 |   var job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+jobID+"?embed=wrangledDataset";
18 | 
19 |   var res_job = request('GET', job_endpoint, {
20 |     headers: {
21 |         'Content-Type': 'application/json',
22 |         'Authorization': 'Bearer '+ DataprepToken
23 |     },
24 |   });
25 |   var jsonjob = JSON.parse(res_job.getBody());
26 |   var recipeID = jsonjob.wrangledDataset.id;
27 |   console.log("Recipe ID : "+recipeID);
28 | 
29 |   // ------------------ GET DATAPREP RECIPE OBJECT --------------------------------
30 | 
31 |   var recipe_endpoint = "https://api.clouddataprep.com/v4/wrangledDatasets/"+recipeID;
32 | 
33 |   var res_recipe = request('GET', recipe_endpoint, {
34 |     headers: {
35 |         'Content-Type': 'application/json',
36 |         'Authorization': 'Bearer '+ DataprepToken
37 |     },
38 |   });
39 |   var jsonrecipe = JSON.parse(res_recipe.getBody());
40 |   var recipeName = jsonrecipe.name
41 |   console.log("Recipe Name : "+recipeName);
42 | 
43 |   // ------------------ ADD ALL RESULTS TO A GOOGLE SHEET  --------------------------------
44 | 
45 |   // block on auth + getting the sheets API object
46 |   const auth = await google.auth.getClient({
47 |     scopes: ["https://www.googleapis.com/auth/spreadsheets"]
48 |   });
49 |   const sheetsAPI = await google.sheets({ version: "v4", auth });
50 |   const JobSheetId = "1X63lT7...........VbwiDN0wm3SKx-Ro";
51 |   
52 |   sheetsAPI.spreadsheets.values.append({
53 |     key:"AIza............0qu8qlXUA",
54 |     spreadsheetId: JobSheetId,
55 |     range: 'A1:F1',
56 |     valueInputOption: 'USER_ENTERED',
57 |     insertDataOption: 'INSERT_ROWS',
58 |     resource: {
59 |       values: [
60 |         [new Date().toISOString().replace('T', ' ').substr(0, 19), jobID, recipeName, jobStatus, jobURL,jobProfileForumula]
61 |       ],
62 |     },
63 |   }, (err, response) => {
64 |     if (err) res.send(err)
65 |   })
66 |   res.status(200).send("job "+jobID+" "+jobStatus); 
67 |   console.log("job "+jobID+" "+jobStatus);
68 | }
69 | 


--------------------------------------------------------------------------------
/publishing_googlesheet.js:
--------------------------------------------------------------------------------
  1 | const request = require('then-request');
  2 | const {google} = require('googleapis');
  3 | const {Storage} = require("@google-cloud/storage");
  4 | 
  5 | exports.publish_gsheet = async (req, res) => {
  6 | 
  7 |   const DataprepJobID = req.body.jobid;
  8 | 
  9 |   console.log("DataprepJobID : "+DataprepJobID);
 10 | 
 11 |   const spreadsheetId = "1WiGd.........4tuoc";
 12 | 
 13 |   const DataprepToken ="eyJhbGc........bcOwTQ";
 14 | 
 15 |   // block on auth + getting the sheets API object
 16 |   const auth = await google.auth.getClient({
 17 |     scopes: [
 18 |       "https://www.googleapis.com/auth/spreadsheets",
 19 |       "https://www.googleapis.com/auth/devstorage.read_only"
 20 |     ]
 21 |   });
 22 |   const sheetsAPI = google.sheets({version: 'v4',auth});
 23 |   
 24 |   // ------------------ GET DATAPREP JOB AND CSV FILE NAME GENERATED IN GCS --------------------------------
 25 | 
 26 |   const dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+DataprepJobID+"?embed=jobs.fileWriterJob.writeSetting";
 27 | 
 28 |   var res_job = await request('GET', dataprep_job_endpoint, {
 29 |     headers: {
 30 |       'Content-Type': 'application/json',
 31 |       'Authorization': 'Bearer '+ DataprepToken
 32 |     },
 33 |   });
 34 |       
 35 |   const jsonresult = JSON.parse(res_job.getBody());
 36 | 
 37 |   var outputFileURI=""
 38 |   for (key in jsonresult.jobs.data) {
 39 |     if (jsonresult.jobs.data[key].jobType == "filewriter") {
 40 |        outputFileURI = jsonresult.jobs.data[key].writeSetting.path;
 41 |     }
 42 |   };
 43 | 
 44 |   //gs://dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce/vcoustenoble@trifacta.com/jobrun/Sales_Data_small.csv
 45 |   //console.log("outputFileURI : "+outputFileURI);
 46 | 
 47 |   const outputFilepathArray = outputFileURI.split('/');
 48 | 
 49 |   const outputBucket=outputFilepathArray[2];
 50 |   console.log("Bucket : "+outputBucket);
 51 | 
 52 |   var outputFilepath='';
 53 |   for (key in outputFilepathArray) {
 54 |     if (key > 2) {
 55 |       outputFilepath = outputFilepath + outputFilepathArray[key]+'/';
 56 |     }
 57 |   };
 58 |   outputFilepath=outputFilepath.slice(0,-1);
 59 |   console.log("Output Filepath : "+outputFilepath);
 60 | 
 61 |   const filename = outputFilepathArray.slice(-1).toString();
 62 |   //console.log("Filename : "+filename);
 63 |   const sheetName = filename.slice(0,-4)+"_"+DataprepJobID;
 64 |   console.log("Sheet Name : "+sheetName);
 65 | 
 66 |   const FileData = await readCSVContent(outputBucket,outputFilepath);
 67 | 
 68 |   sheetid = await createEmptySheet(sheetName,spreadsheetId);
 69 |   await populateAndStyle(FileData,sheetid,spreadsheetId);
 70 | 
 71 |   res.send(`Spreadsheet ${sheetName} created`);
 72 | 
 73 |   // ------------------ READ CSV FILE CONTENT FROM GCS --------------------------------
 74 | 
 75 |   function readCSVContent(mybucket,myfilepath) {
 76 |     return new Promise((resolve, reject) => {
 77 |       const storage = new Storage();
 78 |       const bucket = storage.bucket(mybucket);
 79 |       const file = bucket.file(myfilepath);
 80 | 
 81 |       let fileContents = Buffer.from('');
 82 | 
 83 |       file.createReadStream()
 84 |       .on('error', function(err) {
 85 |         reject('The Storage API returned an error: ' + err);
 86 |       })
 87 |       .on('data', function(chunk) {
 88 |         fileContents = Buffer.concat([fileContents, chunk]);
 89 |       })  
 90 |       .on('end', function() {
 91 |         let content = fileContents.toString('utf8');
 92 |         //console.log("CSV content read as string : " + content );
 93 |         resolve(content);
 94 |       });
 95 |     });
 96 |   }
 97 | 
 98 | // ------------------ CREATE EMPTY NEW SHEET  --------------------------------
 99 | 
100 |   function createEmptySheet(MySheetName,Myspreadsheetid) {
101 |     return new Promise((resolve, reject) => {
102 | 
103 |       const emptySheetParams = {
104 |         spreadsheetId: Myspreadsheetid,
105 |         resource: {
106 |           requests: [
107 |             {
108 |               addSheet: {
109 |                 properties: {
110 |                   title: MySheetName,
111 |                   index: 1,
112 |                   gridProperties: {
113 |                     rowCount: 10,
114 |                     columnCount: 10,
115 |                     frozenRowCount: 1
116 |                   }
117 |                 }
118 |               }
119 |             }
120 |           ]
121 |         }
122 |       };
123 |       sheetsAPI.spreadsheets.batchUpdate( emptySheetParams, function(err, response) {
124 |           if (err) {
125 |             reject("The Sheets API returned an error: " + err);
126 |           } else {
127 |             const sheetId = response.data.replies[0].addSheet.properties.sheetId;
128 |             console.log("Created empty sheet: " + sheetId);
129 |             resolve(sheetId);
130 |           }
131 |         });
132 |     });
133 |   }
134 | 
135 |   // ------------------ WRITE DATA IN THE NEW EMPTY SHEET  --------------------------------
136 | 
137 |   function populateAndStyle(FileData,MySheetId,MySpreadsheetId) {
138 |     return new Promise((resolve, reject) => {
139 |       // Using 'batchUpdate' allows for multiple 'requests' to be sent in a single batch.
140 |       // Populate the sheet referenced by its ID with the data received (a CSV string)
141 |       // Style: set first row font size to 11 and to Bold. Exercise left for the reader: resize columns
142 |       const dataAndStyle = {
143 |         spreadsheetId: MySpreadsheetId,
144 |         resource: {
145 |           requests: [
146 |             {
147 |               pasteData: {
148 |                 coordinate: {
149 |                   sheetId: MySheetId,
150 |                   rowIndex: 0,
151 |                   columnIndex: 0
152 |                 },
153 |                 data: FileData,
154 |                 delimiter: ","
155 |               }
156 |             },
157 |             {
158 |               repeatCell: {
159 |                 range: {
160 |                   sheetId: MySheetId,
161 |                   startRowIndex: 0,
162 |                   endRowIndex: 1
163 |                 },
164 |                 cell: {
165 |                   userEnteredFormat: {
166 |                     textFormat: {
167 |                       fontSize: 11,
168 |                       bold: true
169 |                     }
170 |                   }
171 |                 },
172 |                 fields: "userEnteredFormat(textFormat)"
173 |               }
174 |             }       
175 |           ]
176 |         }
177 |       };
178 |     
179 |       sheetsAPI.spreadsheets.batchUpdate(dataAndStyle, function(err, response) {
180 |         if (err) {
181 |           reject("The Sheets API returned an error: " + err);
182 |         } else {
183 |           console.log(MySheetId + " sheet populated with " + FileData.length + " rows and column style set.");
184 |           resolve();
185 |         }
186 |       });    
187 |     });
188 |   }
189 | 
190 | }
191 | 


--------------------------------------------------------------------------------
/trifactalogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/trifactalogo.png


--------------------------------------------------------------------------------