├── CloudFunctions_Dataprep.png
├── README.md
├── _config.yml
├── export_import_dataprep_flow.py
├── gcs_trigger_dataprep_job.py
├── job-result-google-bigquery.py
├── job-result-google-sheet.js
├── publishing_googlesheet.js
└── trifactalogo.png
/CloudFunctions_Dataprep.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/CloudFunctions_Dataprep.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Google Cloud Functions for Cloud Dataprep
2 |
3 |
4 |
5 | [Google Cloud Functions](https://cloud.google.com/functions) examples for [Cloud Dataprep](https://cloud.google.com/dataprep)
6 |
7 | - **[gcs_trigger_dataprep_job.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/gcs_trigger_dataprep_job.py)** : Background Python function to trigger a Dataprep job when a file is created in a Google Cloud Storage bucket folder. Dataprep job started with REST API call and new file as parameter. Implementation details in the blog post [How to Automate a Cloud Dataprep Pipeline When a File Arrives](https://medium.com/google-cloud/how-to-automate-a-cloud-dataprep-pipeline-when-a-file-arrives-9b85f2745a09)
8 |
9 | - **[job-result-google-sheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-sheet.js)** : HTTP Node.js function to write in a Google Sheet a Dataprep job result info (id, status) with recipe name, link to the job page and link to PDF of result's profile. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Leverage Cloud Functions and APIs to Monitor Cloud Dataprep Jobs Status in a Google Sheet](https://towardsdatascience.com/leverage-cloud-functions-and-apis-to-monitor-cloud-dataprep-jobs-status-in-a-google-sheet-b412ee2b9acc).
10 |
11 | - **[publishing_googlesheet.js](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/publishing_googlesheet.js)** : HTTP Node.js function to publish Dataprep output in a Google Sheet. Google sheet name created will be based on the default single CSV file name generated in GSC + Dataprep Job id. In the Cloud Function code, you need to update your [Dataprep Token Access](https://docs.trifacta.com/display/DP/Access+Tokens+Page) (to call REST API) and the [Google Spreadsheet ID](https://developers.google.com/sheets/api/guides/concepts#spreadsheet_id). And this Cloud Function can be triggered when a Dataprep job is finished via a [Dataprep Webhook](https://docs.trifacta.com/display/DP/Create+Flow+Webhook+Task).
12 |
13 | - **[job-result-google-bigquery.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/job-result-google-bigquery.py)** : HTTP Python function to write in a Google BigQuery table a Dataprep job result info (id, status) with dataset output name (recipe name), Google user and link to the job page. This HTTP Cloud function is called from a Dataprep Webhook when jobs are finished (success or failure). Implementation details in the blog post [Monitor your BigQuery Data Warehouse Dataprep Pipeline with Data Studio](https://medium.com/google-cloud/monitor-your-bigquery-data-warehouse-dataprep-pipeline-with-data-studio-8e46b2beda1).
14 |
15 | - **[export_import_dataprep_flow.py](https://github.com/victorcouste/google-cloudfunctions-dataprep/blob/master/export_import_dataprep_flow.py)** : Export a Dataprep flow from a project and import it in another project. Option to save or get the flow package (zip file) in a GCS bucket folder.
16 |
17 | - **[Update Google Cloud Data Catalog](https://victorcouste.github.io/google-data-catalog-dataprep/)** : A Cloud Function to create or update Google Cloud Data Catalog tags on BigQuery tables with Cloud Dataprep Metadata and Column's Profile.
18 |
19 |
20 |
21 |
22 | Google Cloud Functions [https://cloud.google.com/functions](https://cloud.google.com/functions)
23 |
24 | Cloud Dataprep by Trifacta [https://cloud.google.com/dataprep](https://cloud.google.com/dataprep)
25 |
26 | Cloud Dataprep Standard API [https://api.trifacta.com/dataprep-standard](https://api.trifacta.com/dataprep-standard)
27 |
28 | Cloud Dataprep Premium API [https://api.trifacta.com/dataprep-premium](https://api.trifacta.com/dataprep-premium)
29 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-minimal
--------------------------------------------------------------------------------
/export_import_dataprep_flow.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | from google.cloud import storage
4 |
5 | def import_export_dataprep_flow(request):
6 | """Responds to any HTTP request.
7 | Args:
8 | request (flask.Request): HTTP request object.
9 | Returns:
10 | The response text or any set of values that can be turned into a
11 | Response object using
12 | `make_response `.
13 | """
14 | request_json = request.get_json()
15 | if request_json and 'flowid' in request_json:
16 | dataprep_flowid = request_json['flowid']
17 | else:
18 | return 'No FlowId to export'
19 |
20 | #dataprep_flowid=9999999
21 |
22 | print('FlowId {} to export/import'.format(dataprep_flowid))
23 |
24 | datataprep_export_auth_token = 'xxxxxxxxx'
25 | dataprep_exportflow_endpoint = 'https://api.clouddataprep.com/v4/flows/{}/package'.format(dataprep_flowid)
26 | dataprep_exportflow_headers = {"Authorization": "Bearer "+datataprep_export_auth_token}
27 |
28 | resp_export = requests.get(
29 | url=dataprep_exportflow_endpoint,
30 | headers=dataprep_exportflow_headers
31 | )
32 | print('Export Flow Status Code : {}'.format(resp_export.status_code))
33 |
34 | # Option to save Flow package in a GCS folder
35 | flowfile_path="flows/flow_{}.zip".format(dataprep_flowid)
36 | storage_client = storage.Client()
37 | bucket = storage_client.bucket("dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce")
38 | blob = bucket.blob(flowfile_path)
39 | blob.upload_from_string(resp_export.content,content_type="application/zip")
40 |
41 | # Option to get Flow package from a GCS folder
42 | #flowfile = blob.download_as_string()
43 |
44 | # Get Flow package directly from the export
45 | flowfile = resp_export.content
46 |
47 | datataprep_import_auth_token = 'yyyyyyy'
48 | dataprep_importflow_endpoint = 'https://api.clouddataprep.com/v4/flows/package'
49 | dataprep_importflow_headers = {"Authorization": "Bearer "+datataprep_import_auth_token}
50 | dataprep_importflow_files={"archive": ("flow.zip", flowfile)}
51 |
52 | resp_import = requests.post(
53 | url=dataprep_importflow_endpoint,
54 | headers=dataprep_importflow_headers,
55 | files=dataprep_importflow_files
56 | )
57 |
58 | print('Import flow Status Code : {}'.format(resp_import.status_code))
59 | print('Result Import: {}'.format(resp_import.json()))
60 |
61 | return 'FlowId {} export/import'.format(dataprep_flowid)
62 |
--------------------------------------------------------------------------------
/gcs_trigger_dataprep_job.py:
--------------------------------------------------------------------------------
1 | import os
2 | import requests
3 | import json
4 |
5 | def dataprep_job_gcs_trigger(event, context):
6 |
7 | """Background Cloud Function to be triggered by Cloud Storage.
8 | Args:
9 | event (dict): The Cloud Functions event payload.
10 | context (google.cloud.functions.Context): Metadata of triggering event."""
11 |
12 | head_tail = os.path.split(event['name'])
13 | newfilename = head_tail[1]
14 | newfilepath = head_tail[0]
15 |
16 | datataprep_auth_token = 'xxxxxxxxxxxxxxx'
17 | dataprep_jobid = 99999999
18 |
19 | if context.event_type == 'google.storage.object.finalize' and newfilepath == 'landingzone':
20 |
21 | print('Run Dataprep job on new file: {}'.format(newfilename))
22 |
23 | dataprep_runjob_endpoint = 'https://api.clouddataprep.com/v4/jobGroups'
24 | datataprep_job_param = {
25 | "wrangledDataset": {"id": dataprep_jobid},
26 | "runParameters": {"overrides": {"data": [{"key": "FileName","value": newfilename}]}}
27 | }
28 | print('Run Dataprep job param: {}'.format(datataprep_job_param))
29 | dataprep_headers = {
30 | "Content-Type":"application/json",
31 | "Authorization": "Bearer "+datataprep_auth_token
32 | }
33 |
34 | resp = requests.post(
35 | url=dataprep_runjob_endpoint,
36 | headers=dataprep_headers,
37 | data=json.dumps(datataprep_job_param)
38 | )
39 |
40 | print('Status Code : {}'.format(resp.status_code))
41 | print('Result : {}'.format(resp.json()))
42 |
43 | return 'End File event'.format(newfilename)
44 |
--------------------------------------------------------------------------------
/job-result-google-bigquery.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | from google.cloud import bigquery
4 | from datetime import datetime
5 |
6 | def publish_bigquery(request):
7 |
8 | request_json = request.get_json()
9 | if request_json and 'job_id' in request_json:
10 | job_id = request_json['job_id']
11 | job_status=request_json['job_status']
12 | else:
13 | return 'No Job Id to publish'
14 |
15 | datataprep_auth_token='xxxxxxxx'
16 | dataprep_headers = {"Authorization": "Bearer "+datataprep_auth_token}
17 |
18 | print('Dataprep Job ID {} and Status {}'.format(job_id,job_status))
19 |
20 | job_url="https://clouddataprep.com/jobs/"+job_id;
21 | job_result_profile="https://clouddataprep.com/v4/jobGroups/"+job_id+"/pdfResults"
22 |
23 | dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+job_id+"?embed=wrangledDataset.recipe,creator,jobs"
24 |
25 | resp = requests.get(
26 | url=dataprep_job_endpoint,
27 | headers=dataprep_headers
28 | )
29 | job_object=resp.json()
30 | print('Status Code Get Job: {}'.format(resp.status_code))
31 | #print('Result : {}'.format(job_object))
32 |
33 | output_name = job_object["wrangledDataset"]["recipe"]["name"]
34 | print('Output Name : {}'.format(output_name))
35 |
36 | user = job_object["creator"]["email"]
37 | print('User : {}'.format(user))
38 |
39 | createdAt = job_object["jobs"]["data"][0]["createdAt"]
40 |
41 | # Find "wrangle" job type, executed with Dataflow
42 | for job in job_object["jobs"]["data"]:
43 | if job["jobType"]=="wrangle":
44 | dataflow_jobid = job["cpJobId"]
45 |
46 | print('Dataflow jobId : {}'.format(dataflow_jobid))
47 |
48 | # Datetime of last job
49 | updatedAt=job["updatedAt"]
50 |
51 | start_job = datetime.strptime(createdAt, "%Y-%m-%dT%H:%M:%S.000Z")
52 | end_job = datetime.strptime(updatedAt, "%Y-%m-%dT%H:%M:%S.000Z")
53 | job_duration = (end_job - start_job)
54 | print('Duration : {}'.format(job_duration))
55 |
56 | datetime_string = datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
57 |
58 | # Instantiates a client
59 | bigquery_client = bigquery.Client()
60 |
61 | # Prepares a reference to the dataset
62 | dataset_ref = bigquery_client.dataset('default')
63 |
64 | table_ref = dataset_ref.table('dataprep_jobs')
65 | table = bigquery_client.get_table(table_ref) # API call
66 | row_to_insert = [{
67 | "job_run_date":datetime_string,
68 | "job_id":int(job_id),
69 | "output_name":output_name,
70 | "job_status":job_status,
71 | "job_url":job_url,
72 | "user":user,
73 | "dataflow_job_id":dataflow_jobid,
74 | "job_duration":str(job_duration)
75 | }]
76 | errors = bigquery_client.insert_rows(table, row_to_insert) # API request
77 | assert errors == []
78 |
79 | return 'JobId {} - {} - {} published in BigQuery'.format(job_id,job_status,output_name)
80 |
--------------------------------------------------------------------------------
/job-result-google-sheet.js:
--------------------------------------------------------------------------------
1 | const {google} = require('googleapis');
2 | const request = require('sync-request');
3 |
4 | exports.jobresultgsheet = async (req, res) => {
5 |
6 | var jobID = req.body.jobid;
7 | var jobStatus = req.body.jobstatus;
8 |
9 | var jobURL = "https://clouddataprep.com/jobs/"+jobID;
10 |
11 | var jobProfileForumula = '=LIEN_HYPERTEXTE("https://clouddataprep.com/v4/jobGroups/'+jobID+'/pdfResults";"Profile PDF")';
12 |
13 | var DataprepToken = "eyJhbGciOiJSUzI.................7VQLSPH3mteFmQfOPBCrJPqGWErQ";
14 |
15 | // ------------------ GET DATAPREP JOB OBJECT --------------------------------
16 |
17 | var job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+jobID+"?embed=wrangledDataset";
18 |
19 | var res_job = request('GET', job_endpoint, {
20 | headers: {
21 | 'Content-Type': 'application/json',
22 | 'Authorization': 'Bearer '+ DataprepToken
23 | },
24 | });
25 | var jsonjob = JSON.parse(res_job.getBody());
26 | var recipeID = jsonjob.wrangledDataset.id;
27 | console.log("Recipe ID : "+recipeID);
28 |
29 | // ------------------ GET DATAPREP RECIPE OBJECT --------------------------------
30 |
31 | var recipe_endpoint = "https://api.clouddataprep.com/v4/wrangledDatasets/"+recipeID;
32 |
33 | var res_recipe = request('GET', recipe_endpoint, {
34 | headers: {
35 | 'Content-Type': 'application/json',
36 | 'Authorization': 'Bearer '+ DataprepToken
37 | },
38 | });
39 | var jsonrecipe = JSON.parse(res_recipe.getBody());
40 | var recipeName = jsonrecipe.name
41 | console.log("Recipe Name : "+recipeName);
42 |
43 | // ------------------ ADD ALL RESULTS TO A GOOGLE SHEET --------------------------------
44 |
45 | // block on auth + getting the sheets API object
46 | const auth = await google.auth.getClient({
47 | scopes: ["https://www.googleapis.com/auth/spreadsheets"]
48 | });
49 | const sheetsAPI = await google.sheets({ version: "v4", auth });
50 | const JobSheetId = "1X63lT7...........VbwiDN0wm3SKx-Ro";
51 |
52 | sheetsAPI.spreadsheets.values.append({
53 | key:"AIza............0qu8qlXUA",
54 | spreadsheetId: JobSheetId,
55 | range: 'A1:F1',
56 | valueInputOption: 'USER_ENTERED',
57 | insertDataOption: 'INSERT_ROWS',
58 | resource: {
59 | values: [
60 | [new Date().toISOString().replace('T', ' ').substr(0, 19), jobID, recipeName, jobStatus, jobURL,jobProfileForumula]
61 | ],
62 | },
63 | }, (err, response) => {
64 | if (err) res.send(err)
65 | })
66 | res.status(200).send("job "+jobID+" "+jobStatus);
67 | console.log("job "+jobID+" "+jobStatus);
68 | }
69 |
--------------------------------------------------------------------------------
/publishing_googlesheet.js:
--------------------------------------------------------------------------------
1 | const request = require('then-request');
2 | const {google} = require('googleapis');
3 | const {Storage} = require("@google-cloud/storage");
4 |
5 | exports.publish_gsheet = async (req, res) => {
6 |
7 | const DataprepJobID = req.body.jobid;
8 |
9 | console.log("DataprepJobID : "+DataprepJobID);
10 |
11 | const spreadsheetId = "1WiGd.........4tuoc";
12 |
13 | const DataprepToken ="eyJhbGc........bcOwTQ";
14 |
15 | // block on auth + getting the sheets API object
16 | const auth = await google.auth.getClient({
17 | scopes: [
18 | "https://www.googleapis.com/auth/spreadsheets",
19 | "https://www.googleapis.com/auth/devstorage.read_only"
20 | ]
21 | });
22 | const sheetsAPI = google.sheets({version: 'v4',auth});
23 |
24 | // ------------------ GET DATAPREP JOB AND CSV FILE NAME GENERATED IN GCS --------------------------------
25 |
26 | const dataprep_job_endpoint = "https://api.clouddataprep.com/v4/jobGroups/"+DataprepJobID+"?embed=jobs.fileWriterJob.writeSetting";
27 |
28 | var res_job = await request('GET', dataprep_job_endpoint, {
29 | headers: {
30 | 'Content-Type': 'application/json',
31 | 'Authorization': 'Bearer '+ DataprepToken
32 | },
33 | });
34 |
35 | const jsonresult = JSON.parse(res_job.getBody());
36 |
37 | var outputFileURI=""
38 | for (key in jsonresult.jobs.data) {
39 | if (jsonresult.jobs.data[key].jobType == "filewriter") {
40 | outputFileURI = jsonresult.jobs.data[key].writeSetting.path;
41 | }
42 | };
43 |
44 | //gs://dataprep-staging-0b9ad034-9473-4777-98f1-0f3e643d0dce/vcoustenoble@trifacta.com/jobrun/Sales_Data_small.csv
45 | //console.log("outputFileURI : "+outputFileURI);
46 |
47 | const outputFilepathArray = outputFileURI.split('/');
48 |
49 | const outputBucket=outputFilepathArray[2];
50 | console.log("Bucket : "+outputBucket);
51 |
52 | var outputFilepath='';
53 | for (key in outputFilepathArray) {
54 | if (key > 2) {
55 | outputFilepath = outputFilepath + outputFilepathArray[key]+'/';
56 | }
57 | };
58 | outputFilepath=outputFilepath.slice(0,-1);
59 | console.log("Output Filepath : "+outputFilepath);
60 |
61 | const filename = outputFilepathArray.slice(-1).toString();
62 | //console.log("Filename : "+filename);
63 | const sheetName = filename.slice(0,-4)+"_"+DataprepJobID;
64 | console.log("Sheet Name : "+sheetName);
65 |
66 | const FileData = await readCSVContent(outputBucket,outputFilepath);
67 |
68 | sheetid = await createEmptySheet(sheetName,spreadsheetId);
69 | await populateAndStyle(FileData,sheetid,spreadsheetId);
70 |
71 | res.send(`Spreadsheet ${sheetName} created`);
72 |
73 | // ------------------ READ CSV FILE CONTENT FROM GCS --------------------------------
74 |
75 | function readCSVContent(mybucket,myfilepath) {
76 | return new Promise((resolve, reject) => {
77 | const storage = new Storage();
78 | const bucket = storage.bucket(mybucket);
79 | const file = bucket.file(myfilepath);
80 |
81 | let fileContents = Buffer.from('');
82 |
83 | file.createReadStream()
84 | .on('error', function(err) {
85 | reject('The Storage API returned an error: ' + err);
86 | })
87 | .on('data', function(chunk) {
88 | fileContents = Buffer.concat([fileContents, chunk]);
89 | })
90 | .on('end', function() {
91 | let content = fileContents.toString('utf8');
92 | //console.log("CSV content read as string : " + content );
93 | resolve(content);
94 | });
95 | });
96 | }
97 |
98 | // ------------------ CREATE EMPTY NEW SHEET --------------------------------
99 |
100 | function createEmptySheet(MySheetName,Myspreadsheetid) {
101 | return new Promise((resolve, reject) => {
102 |
103 | const emptySheetParams = {
104 | spreadsheetId: Myspreadsheetid,
105 | resource: {
106 | requests: [
107 | {
108 | addSheet: {
109 | properties: {
110 | title: MySheetName,
111 | index: 1,
112 | gridProperties: {
113 | rowCount: 10,
114 | columnCount: 10,
115 | frozenRowCount: 1
116 | }
117 | }
118 | }
119 | }
120 | ]
121 | }
122 | };
123 | sheetsAPI.spreadsheets.batchUpdate( emptySheetParams, function(err, response) {
124 | if (err) {
125 | reject("The Sheets API returned an error: " + err);
126 | } else {
127 | const sheetId = response.data.replies[0].addSheet.properties.sheetId;
128 | console.log("Created empty sheet: " + sheetId);
129 | resolve(sheetId);
130 | }
131 | });
132 | });
133 | }
134 |
135 | // ------------------ WRITE DATA IN THE NEW EMPTY SHEET --------------------------------
136 |
137 | function populateAndStyle(FileData,MySheetId,MySpreadsheetId) {
138 | return new Promise((resolve, reject) => {
139 | // Using 'batchUpdate' allows for multiple 'requests' to be sent in a single batch.
140 | // Populate the sheet referenced by its ID with the data received (a CSV string)
141 | // Style: set first row font size to 11 and to Bold. Exercise left for the reader: resize columns
142 | const dataAndStyle = {
143 | spreadsheetId: MySpreadsheetId,
144 | resource: {
145 | requests: [
146 | {
147 | pasteData: {
148 | coordinate: {
149 | sheetId: MySheetId,
150 | rowIndex: 0,
151 | columnIndex: 0
152 | },
153 | data: FileData,
154 | delimiter: ","
155 | }
156 | },
157 | {
158 | repeatCell: {
159 | range: {
160 | sheetId: MySheetId,
161 | startRowIndex: 0,
162 | endRowIndex: 1
163 | },
164 | cell: {
165 | userEnteredFormat: {
166 | textFormat: {
167 | fontSize: 11,
168 | bold: true
169 | }
170 | }
171 | },
172 | fields: "userEnteredFormat(textFormat)"
173 | }
174 | }
175 | ]
176 | }
177 | };
178 |
179 | sheetsAPI.spreadsheets.batchUpdate(dataAndStyle, function(err, response) {
180 | if (err) {
181 | reject("The Sheets API returned an error: " + err);
182 | } else {
183 | console.log(MySheetId + " sheet populated with " + FileData.length + " rows and column style set.");
184 | resolve();
185 | }
186 | });
187 | });
188 | }
189 |
190 | }
191 |
--------------------------------------------------------------------------------
/trifactalogo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/victorcouste/google-cloudfunctions-dataprep/9d869c42edd53865bcf4e55ff371e82ee88b6473/trifactalogo.png
--------------------------------------------------------------------------------