├── CONTRIBUTING.md ├── ContributorAgreement.txt ├── LICENSE ├── README.md ├── SUPPORT.md ├── discover.py ├── dsccnfg └── config.txt ├── dscdonl └── dscdonl.txt ├── dscextr └── dscextr.txt ├── dscwh └── dscwh.txt ├── log └── logs.txt └── sql └── sql.txt /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a signed 9 | [Contributor Agreement](ContributorAgreement.txt). 10 | You (or your employer) retain the copyright to your contribution, 11 | this simply gives us permission to use and redistribute your contributions as 12 | part of the project. 13 | 14 | ## Code reviews 15 | 16 | All submissions, including submissions by project members, require review. We 17 | use GitHub pull requests for this purpose. Consult 18 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 19 | information on using pull requests. 20 | -------------------------------------------------------------------------------- /ContributorAgreement.txt: -------------------------------------------------------------------------------- 1 | Contributor Agreement 2 | 3 | Version 1.1 4 | 5 | Contributions to this software are accepted only when they are 6 | properly accompanied by a Contributor Agreement. The Contributor 7 | Agreement for this software is the Developer's Certificate of Origin 8 | 1.1 (DCO) as provided with and required for accepting contributions 9 | to the Linux kernel. 10 | 11 | In each contribution proposed to be included in this software, the 12 | developer must include a "sign-off" that denotes consent to the 13 | terms of the Developer's Certificate of Origin. The sign-off is 14 | a line of text in the description that accompanies the change, 15 | certifying that you have the right to provide the contribution 16 | to be included. For changes provided in source code control (for 17 | example, via a Git pull request) the sign-off must be included in 18 | the commit message in source code control. For changes provided 19 | in email or issue tracking, the sign-off must be included in the 20 | email or the issue, and the sign-off will be incorporated into the 21 | permanent commit message if the contribution is accepted into the 22 | official source code. 23 | 24 | If you can certify the below: 25 | 26 | Developer's Certificate of Origin 1.1 27 | 28 | By making a contribution to this project, I certify that: 29 | 30 | (a) The contribution was created in whole or in part by me and I 31 | have the right to submit it under the open source license 32 | indicated in the file; or 33 | 34 | (b) The contribution is based upon previous work that, to the best 35 | of my knowledge, is covered under an appropriate open source 36 | license and I have the right under that license to submit that 37 | work with modifications, whether created in whole or in part 38 | by me, under the same open source license (unless I am 39 | permitted to submit under a different license), as indicated 40 | in the file; or 41 | 42 | (c) The contribution was provided directly to me by some other 43 | person who certified (a), (b) or (c) and I have not modified 44 | it. 45 | 46 | (d) I understand and agree that this project and the contribution 47 | are public and that a record of the contribution (including all 48 | personal information I submit with it, including my sign-off) is 49 | maintained indefinitely and may be redistributed consistent with 50 | this project or the open source license(s) involved. 51 | 52 | then you just add a line saying 53 | 54 | Signed-off-by: Random J Developer 55 | 56 | using your real name (sorry, no pseudonyms or anonymous contributions.) -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SAS Customer Intelligence 360 Download Client: Python 2 | 3 | ## Overview 4 | This Python script enables you to download cloud-hosted data tables from SAS Customer Intelligence 360. 5 | 6 | The script can perform the following tasks: 7 | * Download the following data marts: `detail`, `dbtReport`, and `snapshot (previously identity)`. 8 | * Specify a time range to be downloaded. 9 | * Automatically unzip the download packages and create csv files with header rows and field delimiters. 10 | * Keep track of all initiated downloads. This lets you download a delta from the last complete download and append it to one file per table. 11 | 12 | This topic contains the following sections: 13 | * [Configuration](#configuration) 14 | * [Using the Download Script](#using-the-download-script) 15 | * [Considerations](#considerations) 16 | * [Running the script](#running-the-script) 17 | * [Examples](#examples) 18 | * [Contributing](#contributing) 19 | * [License](#license) 20 | * [Additional Resources](#additional-resources) 21 | 22 | 23 | 24 | ## Configuration 25 | 1. Install Python (version 3 or later) from https://www.python.org/. 26 | 27 | **Tip:** Select the option to add Python to your PATH variable. If you choose the advanced installation option, make sure to install the pip utility. 28 | 29 | 2. Make sure the following modules are installed for Python: `argparse`, `backoff`, `base64`, `codecs`, `csv`, `gzip`, `json`, `os`, 30 | `pandas` (version 1.3.0 or later), `PyJWT`, `requests`, `sys`, `time`, and `tqdm`. 31 | 32 | In most cases, many of the modules are installed by default. To list all packages that are installed with Python 33 | (through pip or by default), use this command: 34 | ```python -c help('modules')``` 35 | 36 | **Tip:** In most situations, you can install the non-default packages with this command: 37 | ```pip install backoff pandas PyJWT requests tqdm``` 38 | 39 | 40 | 3. Create an access point in SAS Customer Intelligence 360. 41 | 1. From the user interface, navigate to **General Settings** > **External Access** > **Access Points**. 42 | 2. Create a new access point if one does not exist. 43 | 3. Get the following information from the access point: 44 | ``` 45 | External gateway address: e.g. https://extapigwservice-/marketingGateway 46 | Name: ci360_agent 47 | Tenant ID: abc123-ci360-tenant-id-xyz 48 | Client secret: ABC123ci360clientSecretXYZ 49 | ``` 50 | 4. Download the Python script from this repository and save it to your local machine. 51 | 52 | 5. In the `./dsccnfg/config.txt` file, set the following variables for your tenant: 53 | ``` 54 | agentName = ci360_agent 55 | tenantId = abc123-ci360-tenant-id-xyz 56 | secret = ABC123ci360clientSecretXYZ 57 | baseUrl = https://extapigwservice-/marketingGateway/discoverService/dataDownload/eventData/ 58 | ``` 59 | 60 | 6. Verify the installation by running the following command from command prompt: 61 | ```py discover.py –h``` 62 | 63 | 64 | ## Using the Download Script 65 | 66 | ### Considerations 67 | Before starting a download, make a note of the following things: 68 | * When you use the option to create a CSV, choose a delimiter that is not present in the source data. 69 | * If data resets were processed and you download data in append mode, the old data is not deleted. 70 | The new reset data for the same time period will be appended to the file. 71 | * If you download data using schema 1 and then use append mode to download data using schema 6, the data is appended based on schema 6. However, the header rows in the existing file will not be updated. 72 | 73 | ### Running the Script 74 | 75 | 1. Open a command prompt. 76 | 2. Run the discover.py script with parameter values that are based on the tables that you want to download. For example, to download the detail tables with a start and end date range, you can run the following command: 77 | ``` 78 | py discover.py -m detail -st 2019-12-01T00 -et 2019-12-01T12 79 | ``` 80 | 81 | --- 82 | **Note:** On Unix-like environments and Macs, the default `py` or `python` command might default to Python 2 if that version is installed. Uninstall earlier versions of Python, or explicitly call Python 3 when you run script like this example: 83 | ``` 84 | python3 discover.py -m detail -st 2019-12-01T00 -et 2019-12-01T12 85 | ``` 86 | 87 | You can verify which version runs by default with the following command: `python --version` 88 | 89 | --- 90 | 91 | 92 | 93 | These are the parameters to use when you run the discover.py script: 94 | 95 | | Parameter | Description | 96 | | :---------- | :-----------------| 97 | | -h | Displays the help | 98 | | -m | The table set to download. Use one of these values:
  • detail (This value downloads Detail mart tables and the partitioned CDM tables - cdm_contact_history and cdm_response_history.)
  • dbtReport
  • snapshot (for CDM tables that are not partitioned, identity tables, and metadata tables)
| 99 | | -svn | Specify a specific schema of tables to download. | 100 | | -st | The start value in this datetime format: `yyyy-mm-ddThh` | 101 | | -et | The end value in this datetime format: `yyyy-mm-ddThh` | 102 | | -ct | The category of tables to download. When the parameter is not specified, you download tables for all the categories that you have a license to access.

To download tables from a specific category, you can use one of these values:
  • cdm
  • discover
  • engagedigital
  • engagedirect
  • engagemetadata
  • engagemobile
  • engageweb
  • engageemail
  • optoutdata
  • plan

For more information, see [Schemas and Categories](https://go.documentation.sas.com/?cdcId=cintcdc&cdcVersion=production.a&docsetId=cintag&docsetTarget=dat-export-api-sch.htm).| 103 | | -d | Download only the changes (the delta) from the previous download. Set the value to `yes` or `no`. | 104 | | -l | For partitioned tables, specify a limit of partitions to download. For example, `-l 150` downloads only the first 150 partitions of a specific set.| 105 | | -a | Append the download to the existing files. Set the value to `yes` or `no`. | 106 | | -cf | Create a CSV file from the download tables. Set the value to `yes` or `no`. | 107 | | -cd | Specify a delimiter other than the default | 108 | | -ch | Include a column header in the first row. Set the value to `yes` or `no`. | 109 | | -cl | Clean the download .zip files. By default, the files are deleted, but you can set this parameter to `no` to keep them. | 110 | 111 | **Note:** The start and end ranges are only used for the script's first run. After the first run, the download history is stored in the dsccnfg directory. To force the script to use the variables for start date and end date, delete or move the history information. 112 | 113 | In addition, the values in the dataRangeStartTimeStamp column and dataRangeEndTimeStamp column in the download history tables are in the UTC time zone. The values in the download_dttm column are in the local time zone. 114 | 115 | ### Examples 116 | 117 | * Download the detail tables: 118 | ```py discover.py –m detail``` 119 | 120 | * Download the discover Base tables: 121 | ```py discover.py –m dbtReport``` 122 | 123 | * Download the snapshot tables: 124 | ```py discover.py –m snapshot``` 125 | 126 | * Download the complete set of the CDM tables (both partitioned tables and non-partitioned tables): 127 | ```py discover.py –m snapshot -ct cdm``` 128 | ```py discover.py –m detail -ct cdm``` 129 | 130 | * Download the detail tables (with only the delta from the last download), create a CSV file, and append to the existing files: 131 | ```py discover.py –m detail –d yes –cf yes –a yes``` 132 | 133 | * Download the detail tables for the specific time range from start hour (`-st`) to end hour (`-et`): 134 | ```py discover.py -m detail -st 2019-12-01T00 -et 2019-12-01T12``` 135 | 136 | * Download the discover base tables, create a CSV file, use the ";" (semicolon) delimiter, and include a column header in 137 | the first row: 138 | ```py discover.py -m dbtReport -cf yes -cd ";" -ch yes``` 139 | 140 | * This example is similar to the previous example, but the option `-cl no` keeps the downloaded zip files in the download 141 | folder: 142 | ```py discover.py -m dbtReport -cf yes -cd ";" -ch yes -cl no``` 143 | 144 | * Download the detail tables with a specific schema (`-svn`), and specify a limit (`-l`) to download only the most recent 145 | 150 partitions: 146 | ```py discover.py -m detail -svn 3 -l 150 -cf yes -cd "," -ch yes``` 147 | 148 | * Download the Plan data tables, create a CSV file, use the ";" (semicolon) delimiter, and include a column header in 149 | the first row: 150 | ```py discover.py -m snapshot -ct plan -svn 5 -cf yes -cd ";" -ch yes``` 151 | 152 | 153 | 154 | ## Contributing 155 | 156 | We welcome your contributions! Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on how to submit contributions to this project. 157 | 158 | 159 | 160 | ## License 161 | 162 | This project is licensed under the [Apache 2.0 License](LICENSE). 163 | 164 | 165 | 166 | ## Additional Resources 167 | For more information, see [Downloading Data Tables with the REST API](https://go.documentation.sas.com/?softwareId=ONEMKTMID&softwareVersion=production.a&softwareContextId=DownloadDataTables) in the Help Center for SAS Customer Intelligence 360. 168 | -------------------------------------------------------------------------------- /SUPPORT.md: -------------------------------------------------------------------------------- 1 | ## Support 2 | 3 | We use GitHub for tracking bugs and feature requests. Please submit a GitHub issue or pull request for support. -------------------------------------------------------------------------------- /discover.py: -------------------------------------------------------------------------------- 1 | #! python3 2 | ''' 3 | Copyright © 2019, SAS Institute Inc., Cary, NC, USA. All Rights Reserved. 4 | SPDX-License-Identifier: Apache-2.0 5 | Created on Nov 3, 2017 6 | last update on Feb 5, 2025 7 | Version 0.9 8 | @authors: Mathias Bouten , Shashikant Deore 9 | NOTE: This Discover Download script should help to better understand 10 | the download API and can be used as a base to start interacting 11 | with the API and download collected customer information. It is 12 | not officially supported by SAS. 13 | ''' 14 | 15 | import requests, base64 16 | import json, jwt, gzip, csv, codecs 17 | import os, sys, argparse, time 18 | from argparse import RawTextHelpFormatter 19 | from datetime import datetime, timedelta 20 | from tqdm import tqdm 21 | from urllib.parse import urlsplit 22 | import pandas 23 | import backoff 24 | 25 | #version and update date 26 | version = 'V0.92' 27 | updateDate = '05 Feb 2025' 28 | downloadClient = 'ci360pythonV0.92' 29 | 30 | # default values 31 | limit = "20" 32 | csvflag = 'no' 33 | delimiter = '|' 34 | csvheader = 'yes' 35 | append = 'no' 36 | sohDelimiter = "\x01" 37 | progressbar = 'no' 38 | allhourstatus='true' 39 | #subhourrange="60" 40 | #autoreset = 'yes' 41 | dayOffset = "60" 42 | max_retry_attempts = 4 43 | 44 | # folders 45 | dir_log = 'log/' 46 | dir_csv = 'dscwh/' 47 | dir_zip = 'dscdonl/' 48 | dir_config = 'dsccnfg/' 49 | dir_extr = 'dscextr/' 50 | dir_sql = 'sql/' 51 | 52 | # global variables 53 | querystring = {} 54 | resetQueryString = {} 55 | gSql = '' 56 | gSqlInsert = '' 57 | cleanFiles = 'yes' 58 | responseText ='' 59 | 60 | ##### functions ##### 61 | 62 | # function to do the version specific changes to the already created objects if any 63 | def versionUpdate(): 64 | 65 | # Change#1 : update download_history_detail.csv & download_history_detail.csv to add dataRangeProcessingStatus column 66 | ColumnDelimiter=';' 67 | for martNm in ('detail','dbtReport'): 68 | historyFile='download_history_' + martNm 69 | historyFilePath = dir_config + historyFile + '.csv' 70 | if fileExists(historyFilePath): 71 | # read first line and see if it contains the required number of columns 72 | with open(historyFilePath) as f: 73 | historyHeader = f.readlines()[1] 74 | f.close 75 | 76 | countofDelmHeader = historyHeader.count(ColumnDelimiter) 77 | if not ( countofDelmHeader == 3 ) : 78 | 79 | backupFile=logFileNmtimeStamped(historyFile,fmt='{filename}_%Y%m%dT%H%M%S') 80 | backupFilePath=dir_config + backupFile + '.csv' 81 | 82 | logger(' backing up ' + historyFile + ' as ' + backupFile , 'n', True) 83 | # back up the existing file 84 | with open(historyFilePath) as hf: 85 | with open(backupFilePath, "w") as bkf: 86 | for line in hf: 87 | bkf.write(line) 88 | 89 | logger(' updating ' + historyFile + ' to new version' , 'n', True) 90 | # from the backup re-write existing history file with required number of columns in header and in data lines 91 | headerline = 'dataRangeStart;dataRangeEnd;download_dttm;dataRangeProcessingStatus' + "\n" 92 | rows = 0 93 | with open(backupFilePath) as bkf: 94 | with open(historyFilePath, "w") as hf: 95 | for line in bkf: 96 | rows = rows+1 97 | if (rows == 1): 98 | hf.write(headerline) 99 | else: 100 | # append delimiter to line 101 | newline = line.replace('\n','') + ColumnDelimiter + "\n" 102 | hf.write(newline) 103 | 104 | def getNextDataRangeStart(): 105 | # set nextDataRangeStart = lastDataRangeEnd + 1 ms 106 | historyFile = dir_config + 'download_history_' + martName + '.csv' 107 | 108 | try: 109 | with open(historyFile) as f: 110 | last = f.readlines()[-1] 111 | 112 | lastDataRangeEnd = last.split(';',3)[1] 113 | adjustedTime = datetime.strptime(lastDataRangeEnd, '%Y-%m-%dT%H:%M:%S.%fZ') 114 | adjustedTime += pandas.to_timedelta(1, unit='ms') 115 | adjustedTimeStr = adjustedTime.strftime('%Y-%m-%dT%H:%M:%S.000Z') 116 | return adjustedTimeStr 117 | except FileNotFoundError as e: 118 | print('\n', e) 119 | raise SystemExit('\nFATAL: When you use the -d parameter, a history file must exist.') 120 | 121 | def logFileNmtimeStamped(filename, fmt='{filename}_%Y%m%dT%H%M%S.log'): 122 | #return datetime.datetime.now().strftime(fmt).format(filename=filename) 123 | return datetime.now().strftime(fmt).format(filename=filename) 124 | 125 | def readConfig(configFile): 126 | keys = {} 127 | seperator = '=' 128 | with open(configFile) as f: 129 | for line in f: 130 | if line.startswith('#') == False and seperator in line: 131 | # Find the name and value by splitting the string 132 | name, value = line.split(seperator, 1) 133 | # Assign key value pair to dict 134 | keys[name.strip()] = value.strip() 135 | return keys 136 | 137 | def printDownloadDetails(json_data): 138 | 139 | #if there is a message attribute in json response log the message 140 | if json_data.get('message') : 141 | logger('WARNING:' + str(json_data['message']),'n' ) 142 | 143 | if martName == 'identity' or martName == 'snapshot': 144 | logger(' download of dataMart snapshot','n') 145 | else: 146 | TotalDownloadPackages=json_data['count'] 147 | CurrentPageDownloadPackages=len(json_data['items']) 148 | logger(' download of dataMart ' + martName \ 149 | + ' - total downloads ' + str(TotalDownloadPackages) + ' package(s)' \ 150 | + ' - downloading ' + str(CurrentPageDownloadPackages) + ' package(s)','n') 151 | 152 | # logging 153 | logger(' request URL: ' + url, 'n', False) 154 | logger(' config: ' + json.dumps(config), 'n', False) 155 | logger(' arguments: ' + str(args),'n',False) 156 | 157 | def printResetDetails(json_data): 158 | # print details of the json response like number of total reset packages & number of reset package on current page 159 | TotalResetPackages=json_data['count'] 160 | CurrentPageResetPackages=len(json_data['items']) 161 | logger(' reset of dataMart ' + martName \ 162 | + ' - total resets ' + str(TotalResetPackages) + ' package(s)' \ 163 | + ' - current page resets ' + str(CurrentPageResetPackages) + ' package(s)','n') 164 | 165 | # logging 166 | logger(' request URL: ' + url, 'n', False) 167 | logger(' config: ' + json.dumps(config), 'n', False) 168 | logger(' arguments: ' + str(args),'n',False) 169 | 170 | # extract the query parameters from url & print ? 171 | 172 | 173 | def createDiscoverAPIUrl(config): 174 | baseUrl = config['baseUrl'] 175 | if martName == 'detail': 176 | url = baseUrl + 'detail/partitionedData' 177 | elif martName == 'dbtReport': 178 | url = baseUrl + 'dbtReport' 179 | elif martName == 'identity' or martName =='snapshot': 180 | url = baseUrl + 'detail/nonPartitionedData' 181 | else: 182 | print('Error: wrong martName ') 183 | sys.exit() 184 | return url 185 | 186 | def logger(line, action, console=True): 187 | #logfile = dir_log + 'discover_download.log' 188 | logfilePath = dir_log + logfile 189 | nowDttm = str(datetime.now()) 190 | with open(logfilePath,'a') as log: 191 | if action == 'n': 192 | log.write('\n' + nowDttm + ': ' + line) 193 | if console == True: 194 | print('\n' + line, sep='', end='', flush=True) 195 | elif action == 'a': 196 | log.write(line) 197 | if console == True: 198 | print(line, sep='', end='', flush=True) 199 | 200 | def log_retry_attempt(details): 201 | #print ("Backing off {wait:0.1f} seconds afters {tries} tries " 202 | # "calling function {target} with args {args} and kwargs " 203 | # "{kwargs}".format(**details)) 204 | logger(' Caught retryable error after ' + str(details["tries"]) + ' tries. Waiting ' + str(round(details["wait"],2)) + ' more seconds then retrying...', 'n', True) 205 | logger(' responseText: ' + str(responseText) ,'n', True) 206 | 207 | def after_all_retries(details): 208 | _, exception, _ = sys.exc_info() 209 | logger(' error executing ' + str(details["target"]),'n',True) 210 | logger(' exception ' + str(exception) ,'n', True) 211 | sys.exit(1) 212 | 213 | @backoff.on_exception( 214 | backoff.expo 215 | ,requests.exceptions.RequestException 216 | ,max_tries=max_retry_attempts 217 | ,factor=5 218 | ,on_backoff=log_retry_attempt 219 | ,on_giveup=after_all_retries 220 | #,jitter=backoff.full_jitter 221 | #,giveup=lambda e: e.response is not None and e.response.status_code < 500 222 | #,max_time=30 223 | ) 224 | def getResetUrls(url): 225 | global responseText 226 | resetQueryString["agentName"] = config['agentName'] 227 | if martName == 'dbtReport' : 228 | resetQueryString["martType"] = 'dbt-report' 229 | else: 230 | resetQueryString["martType"] = martName 231 | 232 | resetQueryString["dayOffset"] = dayOffset 233 | 234 | getURLs_start = time.time() # track get download URL request time 235 | 236 | logger(' send reset request to Discover API with querystring: ','n') 237 | logger(' ' + json.dumps(resetQueryString),'n') 238 | response = requests.request("GET", url, headers=headers, params=resetQueryString) 239 | # to test retry meachanism force the response status code 240 | # response.status_code=409 241 | #print_response(response) 242 | responseText=response.text 243 | response.raise_for_status() 244 | 245 | getURLs_end = time.time() # track get download URL request time 246 | getURLs_duration = round((getURLs_end - getURLs_start),2) 247 | logger(' getResetUrls request duration: ' + str(getURLs_duration) + ' seconds','n') 248 | 249 | json_data = json.loads(response.text) 250 | 251 | response_file = dir_config + 'ResetResponse.json' 252 | with open(response_file, "w") as f: 253 | f.write(json.dumps(json_data, indent=4, sort_keys=True)) 254 | 255 | if 'error' in json_data: 256 | print('\n Error: ' + json_data['error'] + " - " + json_data['message']) 257 | print('\n Check connection details! \n') 258 | sys.exit() 259 | 260 | return json_data 261 | 262 | # Function to log the request information to console if required 263 | def print_request(req): 264 | print('HTTP/1.1 {method} {url}\n{headers}\n\n{body}'.format( 265 | method=req.method, 266 | url=req.url, 267 | headers='\n'.join('{}: {}'.format(k, v) for k, v in req.headers.items()), 268 | body=req.body, 269 | )) 270 | 271 | def print_response(res): 272 | print('HTTP/1.1 {status_code}\n{headers}\n\n{body}'.format( 273 | status_code=res.status_code, 274 | headers='\n'.join('{}: {}'.format(k, v) for k, v in res.headers.items()), 275 | body=res.content, 276 | )) 277 | Response_info= 'HTTP/1.1 {status_code}\n{headers}\n\n{body}'.format(status_code=res.status_code, headers='\n'.join('{}: {}'.format(k, v) for k, v in res.headers.items()),body=res.content) 278 | logger(Response_info, 'n') 279 | 280 | @backoff.on_exception(backoff.expo,requests.exceptions.RequestException,max_tries=max_retry_attempts,factor=5,on_backoff=log_retry_attempt,on_giveup=after_all_retries) 281 | def getDownloadUrls(url): 282 | global responseText 283 | querystring["limit"] = limit 284 | querystring["agentName"] = config['agentName'] 285 | getURLs_start = time.time() # track get download URL request time 286 | 287 | logger(' send download request to Discover API with querystring: ','n') 288 | logger(' ' + json.dumps(querystring),'n') 289 | response = requests.request("GET", url, headers=headers, params=querystring) 290 | # to test retry meachanism force the response status code 291 | # response.status_code=409 292 | responseText=response.text 293 | 294 | response.raise_for_status() 295 | 296 | 297 | getURLs_end = time.time() # track get download URL request time 298 | getURLs_duration = round((getURLs_end - getURLs_start),2) 299 | logger(' getDownloadUrls request duration: ' + str(getURLs_duration) + ' seconds','n') 300 | 301 | json_data = json.loads(response.text) 302 | 303 | response_file = dir_config + 'response.json' 304 | with open(response_file, "w") as f: 305 | f.write(json.dumps(json_data, indent=4, sort_keys=True)) 306 | 307 | if 'error' in json_data: 308 | print('\n Error: ' + json_data['error'] + " - " + json_data['message']) 309 | print('\n Check connection details! \n') 310 | sys.exit() 311 | 312 | return json_data 313 | 314 | def createDiscoverResetAPIUrl(config): 315 | baseUrl = config['baseUrl'] 316 | if martName == 'detail': 317 | #url = baseUrl + 'partitionedData/resets' 318 | url = baseUrl + 'partitionedData/resets' 319 | elif martName == 'dbtReport': 320 | #url = baseUrl + 'dbtReport/resets' 321 | url = baseUrl + 'partitionedData/resets' 322 | else: 323 | print('Error: wrong martName ') 324 | sys.exit() 325 | return url 326 | 327 | def createDiscoverAPIUrlFromHref(config,href): 328 | # function to generate the reset API 329 | baseUrl = config['baseUrl'] 330 | baseUrlHost = "{0.scheme}://{0.netloc}".format(urlsplit(baseUrl)) 331 | url= baseUrlHost + href 332 | return url 333 | 334 | @backoff.on_exception(backoff.expo,requests.exceptions.RequestException,max_tries=max_retry_attempts,factor=5,on_backoff=log_retry_attempt,on_giveup=after_all_retries) 335 | def getSchema( url, tablename ): 336 | global responseText 337 | global gSql 338 | global gSqlInsert 339 | r = requests.get(url) 340 | # to test retry meachanism force the response status code 341 | # r.status_code=409 342 | responseText=r.text 343 | r.raise_for_status() 344 | 345 | json_meta = json.loads(r.text) 346 | columnHeader = '' 347 | sqlTable = 'create table ' + tablename + '(' 348 | sqlInsert = 'insert into ' + tablename + ' values (' 349 | sqlColumn = '' 350 | sqlInsertColumn = '' 351 | 352 | for item in json_meta: 353 | meta_table = item['table_name'] 354 | if tablename.lower() == meta_table.lower(): 355 | column = item['column_name'] 356 | columnType = item['column_type'] 357 | sqlColumn = sqlColumn + '\n ' + column + ' ' + columnType + ', ' 358 | sqlInsertColumn = sqlInsertColumn + '%s,' 359 | columnHeader = columnHeader + column + delimiter 360 | 361 | #finish create table statement 362 | gSql = gSql + sqlTable + sqlColumn[:-2] + ');\n\n' 363 | gSqlInsert = sqlInsert + sqlInsertColumn[:-1] + ')' 364 | 365 | #remove last delimiter and return line 366 | return columnHeader[:-len(delimiter)] 367 | 368 | @backoff.on_exception(backoff.expo,requests.exceptions.RequestException,max_tries=max_retry_attempts,factor=5,on_backoff=log_retry_attempt,on_giveup=after_all_retries) 369 | def downloadWithProgress( url, outputfile, writeType): 370 | global responseText 371 | r = requests.get(url, stream=True) 372 | # to test retry meachanism force the response status code 373 | # r.status_code=409 374 | #responseText=r.text 375 | r.raise_for_status() 376 | # Total size in bytes. 377 | file_size = int(r.headers.get('content-length', 0)) 378 | block_size = 1024 379 | wrote = 0 380 | with open(outputfile, writeType) as f: 381 | #for data in tqdm(iterable = r.iter_content(block_size), total= file_size/block_size , unit='KB', unit_scale=True, leave=True): 382 | for i in tqdm(range(file_size), ncols = 100, unit='KB'): 383 | data = r.raw.read(block_size) # read content block in bytes 384 | wrote = wrote + len(data) # update actual number of written bytes 385 | i = i + block_size # update iterator to continue loop from right point 386 | f.write(data) # write data to file 387 | f.flush() 388 | if file_size != 0 and wrote != file_size: 389 | print("ERROR, something went wrong during download - try again") 390 | 391 | @backoff.on_exception(backoff.expo,requests.exceptions.RequestException,max_tries=max_retry_attempts,factor=5,on_backoff=log_retry_attempt,on_giveup=after_all_retries) 392 | def download( url, outputfile, writeType): 393 | global responseText 394 | r = requests.get(url) 395 | # to test retry meachanism force the response status code 396 | #r.status_code=409 397 | #responseText=r.text 398 | r.raise_for_status() 399 | #with open(outputfile, writeType) as f: 400 | # f.write(r.content) # write data to file 401 | 402 | # Retry for PermissionError in the open statement 403 | retry_attempts = 5 404 | for attempt in range(retry_attempts): 405 | try: 406 | with open(outputfile, writeType) as f: 407 | f.write(r.content) # Write data to file 408 | break # If the operation succeeds, exit the loop 409 | except PermissionError: 410 | if attempt < retry_attempts - 1: # Don't sleep after the last attempt 411 | print(f"PermissionError occurred. Retry {attempt + 1}/{retry_attempts} after 1 second.") 412 | time.sleep(2) # Wait for 1 second before retrying 413 | else: 414 | raise # If it fails after the max retries, raise the exception 415 | 416 | def unzipFile(in_file,out_file,in_delimiter,out_delimiter,header): 417 | #read file line by line and write columns as per schema 418 | error = 0 419 | errorMsg = '' 420 | 421 | with gzip.open(in_file, "rb") as in_f, \ 422 | open(out_file, "wb") as out_f: 423 | 424 | # go line by line and make changes in line to match header columns and data columns 425 | rows = 0 426 | for line in in_f: 427 | line2=str(line, 'utf-8') 428 | rows = rows+1 429 | # when its first row check if number of header cols are different from number of data columns 430 | if rows == 1: 431 | countofDelmHeader = header.count(out_delimiter) 432 | countofDelmData = str(line2).count(in_delimiter) 433 | 434 | # when schema is old but datafile is newer version then remove the extra columns 435 | if countofDelmHeader < countofDelmData : 436 | #split the fields into list 437 | fields = line2.split(in_delimiter) 438 | #limit the list to required number of columns as per header 439 | #join fileds and create a line string again 440 | line2=in_delimiter.join(fields[0:countofDelmHeader + 1]) + "\n" 441 | # when schema is newer but datafile is older version then add the extra empty columns 442 | elif countofDelmHeader > countofDelmData : 443 | # remove the existing newline char ..add the empty columns and in the end of line add the new line char 444 | line2 = line2.replace('\n','') + in_delimiter*(countofDelmHeader - countofDelmData) + "\n" 445 | 446 | try: 447 | #out_f.write(line2.replace(in_delimiter, in_delimiter).encode()) 448 | out_f.write(line2.encode()) 449 | except (UnicodeEncodeError) as e: 450 | error = error+1 451 | errorMsg = errorMsg +'\nerror in row: ' + str(rows) + ' - ' + str(e) 452 | 453 | logger("...unzipped file with " + str(rows) + " rows - errors: " + str(error),'a') 454 | 455 | def createCSV(in_file, out_file, in_delimiter, out_delimiter, header): 456 | #read unzipped file line by line and replace delimiter 457 | error = 0 458 | errorMsg = '' 459 | with codecs.open(in_file, 'r','utf-8') as in_f, \ 460 | codecs.open(out_file, 'w','utf-8') as out_f: 461 | # print column header line in csv file if flag is yes 462 | if csvheader == 'yes': 463 | out_f.write(header+"\n") 464 | # go line by line and replace delimiter 465 | rows = 0 466 | for line in in_f: 467 | rows = rows+1 468 | try: 469 | out_f.write(line.replace(in_delimiter, out_delimiter)) 470 | except (UnicodeEncodeError) as e: 471 | error = error+1 472 | errorMsg = errorMsg +'\nerror in row: ' + str(rows) + ' - ' + str(e) 473 | 474 | logger("...CSV created with " + str(rows) + " rows - errors: " + str(error),'a') 475 | 476 | if error != 0: 477 | logError(in_file, errorMsg) 478 | logger(" Error during csv creation process - see separate log file", 'n') 479 | 480 | if cleanFiles == 'yes': 481 | os.remove(in_file) 482 | 483 | def createSingleTableFiles( entity, schemaUrl ): 484 | name = entity['entityName'] 485 | tablefile = dir_csv+name+'.csv' 486 | if not os.path.exists(tablefile): 487 | header = getSchema(schemaUrl, name) 488 | with open(tablefile,'w') as f: 489 | f.write(header+"\n") 490 | 491 | def appendCSV(in_file, out_file, in_delimiter, out_delimiter): 492 | #read unzipped file line by line and replace delimiter 493 | error = 0 494 | errorMsg = '' 495 | with codecs.open(in_file, 'r','utf-8') as in_f, \ 496 | codecs.open(out_file, 'a','utf-8') as out_f: 497 | # go line by line and replace delimiter 498 | rows = 0 499 | for line in in_f: 500 | rows = rows+1 501 | try: 502 | out_f.write(line.replace(in_delimiter, out_delimiter)) 503 | except (UnicodeEncodeError) as e: 504 | error = error+1 505 | errorMsg = errorMsg +'\nerror in row: ' + str(rows) + ' - ' + str(e) 506 | 507 | logger("...appended " + str(rows) + " rows - errors: " + str(error),'a') 508 | 509 | if error != 0: 510 | logError(in_file, errorMsg) 511 | logger(" Error during append process - see separate log file", 'n') 512 | 513 | if cleanFiles == 'yes': 514 | os.remove(in_file) 515 | 516 | def logError(file, message): 517 | errorLog = dir_log + 'error_' + file + '.log' 518 | with open(errorLog, 'a') as f: 519 | f.write(message) 520 | 521 | def downloadEntity( entity, schemaUrl, prefix ): 522 | name = entity['entityName'] 523 | header = getSchema(schemaUrl, name) 524 | zippedFile = dir_extr+prefix+name+'.gz' 525 | unzippedFile = dir_zip+prefix+name+'.soh' 526 | csvFile = dir_csv+prefix+name+'.csv' 527 | sqlFile = dir_sql+prefix+'create_tables_'+martName+'.sql' 528 | tablefile = dir_csv+name+'.csv' 529 | items = len(entity['dataUrlDetails']) 530 | logger(' ' + name + ' - total items: ' + str(items),'n') 531 | 532 | i=0 533 | for dataUrlDetail in entity['dataUrlDetails']: 534 | i=i+1 535 | 536 | # when its first file in the hour create a new file else append to existing file 537 | if (i==1): 538 | writeType='wb' 539 | else: 540 | writeType='ab' 541 | 542 | url = dataUrlDetail['url'] 543 | logger(' item#: ' + str(i) , 'n') 544 | if progressbar == 'yes': 545 | #print(" - download with progress bar") 546 | #downloadWithProgress( url, zippedFile, "ab") 547 | downloadWithProgress( url, zippedFile, writeType) 548 | elif progressbar == 'no': 549 | #print(" - download without progress bar") 550 | #download( url, zippedFile, "ab") 551 | download( url, zippedFile, writeType) 552 | 553 | #unzip downloaded file 554 | ''' sinshd - created a new function to do line by line unzipping 555 | with gzip.open(zippedFile, "rb") as zipped, \ 556 | open(unzippedFile, "wb") as unzipped: 557 | #read zipped data 558 | unzipped_content = zipped.read() 559 | #save unzipped data into file 560 | unzipped.write(unzipped_content) 561 | logger("...unzipped",'a') 562 | ''' 563 | unzipFile(zippedFile,unzippedFile,sohDelimiter,delimiter,header) 564 | 565 | #create CSV file - replace SOH delimiter 566 | if csvflag == 'yes' and append == 'no': 567 | #sinshd - 568 | createCSV(unzippedFile, csvFile, sohDelimiter, delimiter, header) 569 | #createCSV2(unzippedFile, csvFile, sohDelimiter, delimiter, header) 570 | elif csvflag == 'yes' and append == 'yes': 571 | createSingleTableFiles(entity, schemaUrl) 572 | appendCSV(unzippedFile, tablefile, sohDelimiter, delimiter) 573 | 574 | 575 | #remove zipped file 576 | if cleanFiles == 'yes': 577 | os.remove(zippedFile) 578 | 579 | with open(sqlFile,'w') as f: 580 | f.write(gSql) 581 | 582 | return 583 | 584 | def logHistory(dataRangeStart, dataRangeEnd,dataRangeProcessingStatus): 585 | 586 | if resetInProgress == True : 587 | historyFile = dir_config + 'reset_download_history_' + martName + '.csv' 588 | else: 589 | historyFile = dir_config + 'download_history_' + martName + '.csv' 590 | 591 | #historyFile = dir_config + 'download_history_' + martName + '.csv' 592 | nowDttm = str(datetime.now()) 593 | 594 | # if martName = detail or dbtReport 595 | headerline = 'dataRangeStart;dataRangeEnd;download_dttm;dataRangeProcessingStatus' 596 | recordline = dataRangeStart + ';' + dataRangeEnd + ';' + nowDttm + ';' + dataRangeProcessingStatus 597 | 598 | # open file - if not exist create it with headerline 599 | if not os.path.exists(historyFile): 600 | with open(historyFile, 'w') as f: 601 | f.write(headerline+"\n") 602 | 603 | # append rows to history file 604 | with open(historyFile, 'a') as f: 605 | f.write(recordline+"\n") 606 | 607 | def logResetRange(dataRangeStart=None, dataRangeEnd=None, resetCompleted_dttm=None): 608 | historyFile = dir_config + 'reset_range_' + martName + '.csv' 609 | # if martName = detail or dbtReport 610 | headerline = 'dataRangeStart;dataRangeEnd;resetCompleted_dttm;download_dttm' 611 | 612 | # open file - if not exist create it with headerline 613 | if not os.path.exists(historyFile): 614 | with open(historyFile, 'w') as f: 615 | f.write(headerline+"\n") 616 | 617 | # append line only when dataRangeStart is set to something 618 | # this way we can call logResetRange to just create an header row only 619 | if not (dataRangeStart == None): 620 | nowDttm = str(datetime.now()) 621 | recordline = dataRangeStart + ';' + dataRangeEnd + ';' + resetCompleted_dttm + ';' + nowDttm 622 | # append rows to history file 623 | with open(historyFile, 'a') as f: 624 | f.write(recordline+"\n") 625 | 626 | def logHistorySnapshot(entity): 627 | historyFile = dir_config +'download_history_snapshot.csv' 628 | nowDttm = str(datetime.now()) 629 | lastModifiedTimestamp = entity['dataUrlDetails'][0]['lastModifiedTimestamp'] 630 | 631 | headerline = 'entityName;lastModifiedTimestamp;download_dttm' 632 | recordline = entity['entityName'] + ';' + lastModifiedTimestamp + ';' + nowDttm 633 | 634 | # open file - if not exist create it with headerline 635 | if not os.path.exists(historyFile): 636 | with open(historyFile, 'w') as f: 637 | f.write(headerline+"\n") 638 | 639 | # append rows to history file 640 | with open(historyFile, 'a') as f: 641 | f.write(recordline+"\n") 642 | 643 | def readDownloadHistory(historyFile): 644 | # function to read the mart history file as dataframe 645 | # this will be later used to do lookups 646 | # e.g. to check if the history records exits 647 | # historyFile = dir_config + 'download_history_' + martName + '.csv' 648 | df = pandas.read_csv(historyFile 649 | ,sep=';' 650 | ,header=0 651 | ,names=['dataRangeStart','dataRangeEnd','download_dttm','dataRangeProcessingStatus'] 652 | ,parse_dates =['dataRangeStart','dataRangeEnd','download_dttm']) 653 | return df 654 | 655 | def readResetRange(resetFile): 656 | # function to read the mart reset file as dataframe 657 | # this will be later used for lookups 658 | # e.g. to check if the reset records exits 659 | # resetFile = dir_config + 'reset_range_' + martName + '.csv' 660 | 661 | # weflower 2025-02-05: pandas 1.3.0+ deprecated error_bad_lines and replaced with on_bad_lines 662 | # https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html 663 | 664 | df = pandas.read_csv(resetFile 665 | ,sep=';' 666 | ,on_bad_lines='warn' 667 | ,header=0 668 | ,names=['dataRangeStart','dataRangeEnd','resetCompleted_dttm','download_dttm'] 669 | ,parse_dates =['dataRangeStart','dataRangeEnd','resetCompleted_dttm','download_dttm']) 670 | return df 671 | 672 | def fileExists( fileName ): 673 | # function to check if the input fileName exists 674 | fileExists=False 675 | if os.path.exists(fileName): 676 | fileExists=True 677 | return fileExists 678 | 679 | def loopThroughDownloadPackages(url): 680 | 681 | # call CI360 Discover API to get Reset URLs 682 | json_data = getDownloadUrls(url) 683 | 684 | # print details of the json response like number of total download packages & number of download package on current page 685 | printDownloadDetails(json_data) 686 | 687 | #loop through download packages 688 | packageNumber=0 689 | for item in json_data['items']: 690 | schemaUrl = item['schemaUrl'] 691 | prefix = '' 692 | 693 | # only for detail and dbtReport data mart display the ranges 694 | if martName == 'detail' or martName == 'dbtReport': 695 | rangeStartDt = item['dataRangeStartTimeStamp'] 696 | rangeStart = rangeStartDt.replace(':','-').replace('.000Z','') 697 | rangeEndDt = item['dataRangeEndTimeStamp'] 698 | rangeEnd = rangeEndDt.replace(':','-').replace('.999Z','') 699 | 700 | if not is_json_key_present(item, 'dataRangeProcessingStatus'): 701 | processingStatus = '' 702 | else: 703 | processingStatus = item['dataRangeProcessingStatus'] 704 | 705 | prefix = rangeStart+ "_" 706 | packageNumber=packageNumber+1 707 | str_packageNumber = str(packageNumber) 708 | # add a zero infront of package number if number is lower 10 709 | if packageNumber < 10: 710 | str_packageNumber = '0' + str_packageNumber 711 | 712 | logger('********** Tables of package ' + str_packageNumber \ 713 | + ' - start: ' + str(rangeStart) + ' **********', 'n') 714 | # dataRangeProcessingStatus : DATA_AVAILABLE NO_DATA ERROR RESET_INPROGRESS 715 | logger(' dataRangeProcessingStatus : ' + processingStatus , 'n') 716 | 717 | for entity in json_data['items'][packageNumber-1]['entities']: 718 | ###createSingleTableFiles(entity, schemaUrl) 719 | downloadEntity(entity, schemaUrl, prefix) 720 | if martName == 'identity' or martName == 'snapshot' : 721 | logHistorySnapshot(entity) 722 | 723 | if martName == 'detail' or martName == 'dbtReport': 724 | logHistory(rangeStartDt, rangeEndDt,processingStatus) 725 | 726 | logger('********** Finished Downloading Current Page **********', 'n') 727 | 728 | for link in json_data['links']: 729 | if link['rel'] == 'next' : 730 | nextHref = link['href'] 731 | #form the next url using href 732 | url=createDiscoverAPIUrlFromHref(config,nextHref) 733 | loopThroughDownloadPackages(url) 734 | 735 | def is_json_key_present(json, key): 736 | try: 737 | buf = json[key] 738 | except KeyError: 739 | return False 740 | 741 | return True 742 | 743 | def loopThroughResetPackages(url): 744 | 745 | # call CI360 Discover API to get Reset URLs 746 | json_data = getResetUrls(url) 747 | # print details of the json response like number of total reset packages & number of reset package on current page 748 | printResetDetails(json_data) 749 | 750 | #loop through reset packages 751 | packageNumber = 0 752 | for item in json_data['items']: 753 | packageNumber=packageNumber + 1 754 | dataRangeStart = item['dataRangeStartTimeStamp'] 755 | dataRangeEnd = item['dataRangeEndTimeStamp'] 756 | resetCompleted_dttm = item['resetCompletedTimeStamp'] 757 | 758 | str_packageNumber = str(packageNumber) 759 | # add a zero infront of package number if number is lower 10 760 | if packageNumber < 10: 761 | str_packageNumber = '0' + str_packageNumber 762 | logger('********** Reset of package ' + str_packageNumber \ 763 | + ' - start: ' + str(dataRangeStart) \ 764 | + ' - end: ' + str(dataRangeEnd) \ 765 | + ' - resetCompleted_dttm: ' + str(resetCompleted_dttm) + ' **********' , 'n') 766 | 767 | # get the download packages from the current downloadUrl and loop through all download packages 768 | downloadUrlHref=item['downloadUrl'] 769 | downloadUrl=createDiscoverAPIUrlFromHref(config,downloadUrlHref) 770 | 771 | logger(' checking if reset range exists in download history... ', 'n') 772 | # check if the input range exists in download history 773 | # create a dataframe with the filtered data from download history dataframe 774 | hist_dataRange_df = download_history_df[(download_history_df['dataRangeStart']==dataRangeStart)] 775 | if (len(hist_dataRange_df.index) > 0 ): 776 | logger('exists ', 'a') 777 | # check if reset range is already downloaded (record exists in reset history) 778 | logger(' checking if reset range exists in reset history... ', 'n') 779 | reset_dataRange_df = reset_range_df[(reset_range_df['dataRangeStart']==dataRangeStart) 780 | &(reset_range_df['dataRangeEnd']==dataRangeEnd) 781 | &(reset_range_df['resetCompleted_dttm']==resetCompleted_dttm)] 782 | if (len(reset_dataRange_df.index) == 0 ): 783 | logger(' does not exists ..starting download', 'a') 784 | # download reseted data 785 | loopThroughDownloadPackages(downloadUrl) 786 | # add the reset entry into reset history table 787 | logResetRange(dataRangeStart, dataRangeEnd, resetCompleted_dttm) 788 | else: 789 | logger(' exists..skipping reset ', 'a') 790 | else: 791 | logger(' does not exists..skipping reset ', 'a') 792 | # check if reset range is already downloaded (record exists in reset history) 793 | 794 | logger('********** Finished Reset of packages on current page **********', 'n') 795 | for link in json_data['links']: 796 | if link['rel'] == 'next' : 797 | nextResetRangeHref = link['href'] 798 | #form the next url using href 799 | url=createDiscoverAPIUrlFromHref(config,nextResetRangeHref) 800 | loopThroughResetPackages(url) 801 | 802 | 803 | ############################################### 804 | 805 | # set dynamic log file name to create a new log file in each run 806 | #logfile = dir_log + 'discover_download.log' 807 | #logfile = dir_log + logFileNmtimeStamped('discover_' + martName) 808 | #print ('logfile:',logfile) 809 | #logger('CI360 DISCOVER Download API (' + version + ') - last updated '+ updateDate,'n') 810 | 811 | #check command line arguments 812 | parser = argparse.ArgumentParser(description= 813 | 'Download CI360 Discover data for a specific data mart. \ 814 | \nOptional you can download data for a specific time range. \ 815 | \nOptional you can transform the downloaded data into a specific CSV file.\n \ 816 | \nExamples: py discover.py -m detail \ 817 | \n py discover.py -m detail -d yes -cf yes -a yes \ 818 | \n py discover.py -m detail -d yes -cf yes -a yes -pb yes \ 819 | \n py discover.py -m detail -l 2 \ 820 | \n py discover.py -m detail -st 2017-12-07T10 -et 2017-12-07T12 \ 821 | \n py discover.py -m dbtReport -cf yes -cd ";" -ch yes \ 822 | \n py discover.py -m snapshot ' 823 | , formatter_class=RawTextHelpFormatter) 824 | 825 | parser.add_argument('-m', action='store', dest='mart', type=str, 826 | help='enter dataMart: detail, dbtReport or snapshot', required=True) 827 | parser.add_argument('-l', action='store', dest='limit', type=int, 828 | help='enter a limit: ie. 30 - default 20', required=False) 829 | parser.add_argument('-cd', action='store', dest='delimiter', type=str, 830 | help='enter a csv delimiter - default | (pipe)', required=False) 831 | parser.add_argument('-cf', action='store', dest='csvflag', type=str, 832 | help='create csv: yes or no - default no', required=False) 833 | parser.add_argument('-ch', action='store', dest='csvheader', type=str, 834 | help='csv column header row: yes or no - default yes', required=False) 835 | parser.add_argument('-st', action='store', dest='start', type=str, 836 | help='enter start time: ie. 2017-11-07T10', required=False) 837 | parser.add_argument('-et', action='store', dest='end', type=str, 838 | help='enter end time: ie. 2017-11-07T12', required=False) 839 | parser.add_argument('-a', action='store', dest='append', type=str, 840 | help='append to one file: yes or no - default no', required=False) 841 | parser.add_argument('-d', action='store', dest='delta', type=str, 842 | help='download delta: yes or no - default no', required=False) 843 | parser.add_argument('-cl', action='store', dest='clean', type=str, 844 | help='clean zip files: yes or no - default yes', required=False) 845 | parser.add_argument('-pb', action='store', dest='progressbar', type=str, 846 | help='show progress bar: yes or no - default no', required=False) 847 | 848 | #added 2018-11-21 by Mathias Bouten - new API features 849 | #parser.add_argument('-ahs', action='store', dest='allhourstatus', type=str, 850 | # help='enter includeAllHoursStatus flag: ie. true - default false', required=False) 851 | parser.add_argument('-shr', action='store', dest='subhourrange', type=str, default=60, 852 | help='enter subHourlyDataRangeInMinutes: ie. 10', required=False) 853 | parser.add_argument('-svn', action='store', dest='schemaversion', type=str, default=1, 854 | help='enter schemaVersion: ie. 3 - default 1', required=False) 855 | parser.add_argument('-ar', action='store', dest='autoreset', type=str, default='yes', 856 | help='perform reset before download : yes or no - default yes', required=False) 857 | parser.add_argument('-ct', action='store', dest='category', type=str, default='discover', 858 | help='category to download : e.g. discover,engagedirect .. - default discover', required=False) 859 | #added 2020-01-27 - sinshd -new API features - test mode download 860 | parser.add_argument('-tm', action='store', dest='testmode', type=str, 861 | help='test mode download : e.g. PLANTESTMODE ', required=False) 862 | 863 | args = parser.parse_args() 864 | 865 | if args.mart is not None: 866 | martName = args.mart 867 | download_msg = 'all' 868 | print(' datamart: ' + martName) 869 | if args.limit is not None: 870 | limit = str(args.limit) 871 | querystring["limit"] = limit 872 | print(' limit: ' + limit) 873 | if args.delimiter is not None: 874 | delimiter = str(args.delimiter) 875 | print(' delimiter: ' + delimiter) 876 | if args.csvflag is not None: 877 | csvflag = str(args.csvflag) 878 | print(' csvflag: ' + csvflag) 879 | if args.csvheader is not None: 880 | csvheader = str(args.csvheader) 881 | print(' csvheader: ' + csvheader) 882 | if args.clean is not None: 883 | cleanFiles = str(args.clean) 884 | print(' cleanFiles: ' + cleanFiles) 885 | if args.progressbar is not None: 886 | progressbar = str(args.progressbar) 887 | print(' progressbar: ' + progressbar) 888 | if args.start is not None: 889 | dataRangeStartTimeStamp = str(args.start) + ":00:00.000Z" 890 | querystring["dataRangeStartTimeStamp"] = dataRangeStartTimeStamp 891 | print(' start: ' + dataRangeStartTimeStamp) 892 | if args.end is not None: 893 | dataRangeEndTimeStamp = str(args.end) + ":00:00.000Z" 894 | querystring["dataRangeEndTimeStamp"] = dataRangeEndTimeStamp 895 | print(' end: ' + dataRangeEndTimeStamp) 896 | if args.append is not None: 897 | append = str(args.append) 898 | print(' append: ' + append) 899 | if args.delta is not None: 900 | # sinshd - use new function to get the max(lastEnd) + 1 instead of last(start) + 1 hour, 901 | # as this can create a gap when download is in minute level mode 902 | # dataRangeStartTimeStamp = getLastDataRangeStart() 903 | dataRangeStartTimeStamp = getNextDataRangeStart() 904 | querystring["dataRangeStartTimeStamp"] = dataRangeStartTimeStamp 905 | print(' start: ' + dataRangeStartTimeStamp) 906 | # sinshd - getDataRangeEndOfNow is a system clock , this can limit the data even though its available in source 907 | # instead do not set the end , API by default should return upto the current date time when only start time is set 908 | #dataRangeEndTimeStamp = getDataRangeEndOfNow() 909 | #querystring["dataRangeEndTimeStamp"] = dataRangeEndTimeStamp 910 | #print(' end: ' + dataRangeEndTimeStamp) 911 | 912 | #added 2018-11-21 by Mathias Bouten - new API features 913 | 914 | # sinshd - set the allhourstatus = true by default 915 | #if args.allhourstatus is not None: 916 | # allhourstatus = str(args.allhourstatus) 917 | querystring["includeAllHourStatus"] = allhourstatus 918 | print(' includeAllHourStatus: ' + allhourstatus) 919 | 920 | if args.subhourrange is not None: 921 | subhourrange = str(args.subhourrange) 922 | querystring["subHourlyDataRangeInMinutes"] = subhourrange 923 | print(' subHourlyDataRangeInMinutes: ' + subhourrange) 924 | 925 | if args.schemaversion is not None: 926 | schemaversion = str(args.schemaversion) 927 | querystring["schemaVersion"] = schemaversion 928 | print(' schemaVersion: ' + schemaversion) 929 | 930 | if args.autoreset is not None: 931 | autoreset = str(args.autoreset) 932 | print(' autoreset: ' + autoreset) 933 | 934 | if args.category is not None: 935 | category = str(args.category) 936 | querystring["category"] = category 937 | print(' category: ' + category) 938 | 939 | if args.testmode is not None: 940 | testmode = str(args.testmode) 941 | querystring["code"] = testmode 942 | print(' testmode: ' + testmode) 943 | 944 | querystring["downloadClient"] = downloadClient 945 | print(' downloadClient: ' + downloadClient) 946 | ################### START SCRITPT ##################### 947 | 948 | # call version update function to update existing files 949 | 950 | # set dynamic log file name to create a new log file in each run 951 | fileNm = 'discover_' + martName 952 | logfile = logFileNmtimeStamped(fileNm) 953 | print ('logfile:',logfile) 954 | 955 | logger('CI360 DISCOVER Download API (' + version + ') - last updated '+ updateDate,'n') 956 | 957 | # track start time 958 | start = time.time() 959 | 960 | # make any changes as we change the versions 961 | versionUpdate() 962 | 963 | resetInProgress=False 964 | 965 | # read config file 966 | config = readConfig(dir_config + 'config.txt') 967 | 968 | # PyJWT returns str type for jwt.encode() function: https://pyjwt.readthedocs.io/en/latest/changelog.html#improve-typings 969 | # For backwards compatibility for older PyJwt releases, decode or return token value based on type. 970 | def decodeToken(token): 971 | if (type(token)) == bytes: 972 | return bytes.decode(token) 973 | else: 974 | return token 975 | 976 | # Generate the JWT 977 | encodedSecret = base64.b64encode(bytes(config['secret'], 'utf-8')) 978 | token = jwt.encode({'clientID': config['tenantId']}, encodedSecret, algorithm='HS256') 979 | #print('\nJWT token: ' + bytes.decode(token)) 980 | headers = {'authorization': "Bearer "+ decodeToken(token),'cache-control': "no-cache"} 981 | 982 | # modify discover Reset API URL 983 | #url = createDiscoverResetAPIUrl(config) 984 | 985 | # do reset if autoreset is set to yes 986 | if martName == 'detail' or martName == 'dbtReport': 987 | if autoreset == 'yes' : 988 | # modify discover Reset API URL 989 | url = createDiscoverResetAPIUrl(config) 990 | 991 | # set resetInProgress=True to indicate we are now in reset mode. 992 | # when resetInProgress, record download history to reset_download_history_mart.csv 993 | resetInProgress=True 994 | # check if mart download history exists.if not exists then no need to run the resets 995 | # this will avoid running resets when nothing is downloaded 996 | logger(' starting resets','n') 997 | historyFile = dir_config + 'download_history_' + martName + '.csv' 998 | logger(' checking if ' + historyFile + ' exists ...','n') 999 | if fileExists(historyFile) : 1000 | logger(' found ', 'a') 1001 | #store the download history into a dataframe 1002 | download_history_df=readDownloadHistory(historyFile) 1003 | 1004 | resetFile = dir_config + 'reset_range_' + martName + '.csv' 1005 | logger(' checking if ' + resetFile + ' exists ...','n') 1006 | if fileExists(resetFile) : 1007 | logger(' found ', 'a') 1008 | else: 1009 | logger(' not found...', 'a') 1010 | logger(' creating ', 'a') 1011 | logResetRange() 1012 | #store the reset history into a dataframe 1013 | reset_range_df=readResetRange(resetFile) 1014 | # Start looping through reset packages 1015 | loopThroughResetPackages(url) 1016 | else: 1017 | logger(' not found..skipping reset ', 'a') 1018 | logger(' finished resets','n') 1019 | resetInProgress=False 1020 | 1021 | # modify discover API URL 1022 | url = createDiscoverAPIUrl(config) 1023 | 1024 | # Start looping through download packages 1025 | logger(' starting downloads','n') 1026 | loopThroughDownloadPackages(url) 1027 | logger(' finished downloads','n') 1028 | 1029 | # track end time and calculate duration 1030 | end = time.time() 1031 | duration = round((end - start),2) 1032 | 1033 | logger('Done - execution time: ' + str(duration) + ' seconds','n') 1034 | 1035 | print('\n') 1036 | -------------------------------------------------------------------------------- /dsccnfg/config.txt: -------------------------------------------------------------------------------- 1 | # Enter the agent name configured in CI360 GUI 2 | agentName = ci360_agent 3 | 4 | # Enter the tenantId of your CI360 tenant, you find it underneath General in CI360 GUI 5 | tenantId = abc123-ci360-tenant-id-xyz 6 | 7 | # Enter the secret of your agent which you created in CI360 GUI 8 | secret = ABC123ci360clientSecretXYZ 9 | 10 | # CI360 Download URL 11 | baseUrl = https://extapigwservice-/marketingGateway/discoverService/dataDownload/eventData/ 12 | 13 | 14 | -------------------------------------------------------------------------------- /dscdonl/dscdonl.txt: -------------------------------------------------------------------------------- 1 | This directory will be used for storing the downloaded files from cloud. -------------------------------------------------------------------------------- /dscextr/dscextr.txt: -------------------------------------------------------------------------------- 1 | This directory will be used for storing the extracted files from cloud. -------------------------------------------------------------------------------- /dscwh/dscwh.txt: -------------------------------------------------------------------------------- 1 | This directory will be used for storing the extracted files from cloud. -------------------------------------------------------------------------------- /log/logs.txt: -------------------------------------------------------------------------------- 1 | This directory will be used for storing the log files. -------------------------------------------------------------------------------- /sql/sql.txt: -------------------------------------------------------------------------------- 1 | This directory will be used for storing the sql files. --------------------------------------------------------------------------------