├── .gitignore ├── README.md ├── Tableau_calculation_extractor.ipynb ├── Tableau_calculation_extractor_with_mermaid.ipynb ├── Tableau_calculation_extractor_with_mermaid.py ├── excelgenerator.py └── output_examples ├── TheMoodsofMidgarWillSutton_CALCS_only.pdf └── TheMoodsofMidgarWillSutton_CALCS_only.xlsx /.gitignore: -------------------------------------------------------------------------------- 1 | #temporary input and output folders 2 | inputs 3 | outputs 4 | 5 | .idea 6 | 7 | #Unzipped tableau workbooks 8 | inputs/Image 9 | inputs/Data 10 | inputs/to use later 11 | 12 | #Archived files from main directory 13 | archive 14 | 15 | 16 | # Byte-compiled / optimized / DLL files 17 | __pycache__/ 18 | *.py[cod] 19 | *$py.class 20 | 21 | # C extensions 22 | *.so 23 | 24 | # Distribution / packaging 25 | .Python 26 | build/ 27 | develop-eggs/ 28 | dist/ 29 | downloads/ 30 | eggs/ 31 | .eggs/ 32 | lib/ 33 | lib64/ 34 | parts/ 35 | sdist/ 36 | var/ 37 | wheels/ 38 | pip-wheel-metadata/ 39 | share/python-wheels/ 40 | *.egg-info/ 41 | .installed.cfg 42 | *.egg 43 | MANIFEST 44 | 45 | # PyInstaller 46 | # Usually these files are written by a python script from a template 47 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 48 | *.manifest 49 | *.spec 50 | 51 | # Installer logs 52 | pip-log.txt 53 | pip-delete-this-directory.txt 54 | 55 | # Unit test / coverage reports 56 | htmlcov/ 57 | .tox/ 58 | .nox/ 59 | .coverage 60 | .coverage.* 61 | .cache 62 | nosetests.xml 63 | coverage.xml 64 | *.cover 65 | *.py,cover 66 | .hypothesis/ 67 | .pytest_cache/ 68 | 69 | # Translations 70 | *.mo 71 | *.pot 72 | 73 | # Django stuff: 74 | *.log 75 | local_settings.py 76 | db.sqlite3 77 | db.sqlite3-journal 78 | 79 | # Flask stuff: 80 | instance/ 81 | .webassets-cache 82 | 83 | # Scrapy stuff: 84 | .scrapy 85 | 86 | # Sphinx documentation 87 | docs/_build/ 88 | 89 | # PyBuilder 90 | target/ 91 | 92 | # Jupyter Notebook 93 | .ipynb_checkpoints 94 | 95 | # IPython 96 | profile_default/ 97 | ipython_config.py 98 | 99 | # pyenv 100 | .python-version 101 | 102 | # pipenv 103 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 104 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 105 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 106 | # install all needed dependencies. 107 | #Pipfile.lock 108 | 109 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 110 | __pypackages__/ 111 | 112 | # Celery stuff 113 | celerybeat-schedule 114 | celerybeat.pid 115 | 116 | # SageMath parsed files 117 | *.sage.py 118 | 119 | # Environments 120 | .env 121 | .venv 122 | env/ 123 | venv/ 124 | ENV/ 125 | env.bak/ 126 | venv.bak/ 127 | 128 | # Spyder project settings 129 | .spyderproject 130 | .spyproject 131 | 132 | # Rope project settings 133 | .ropeproject 134 | 135 | # mkdocs documentation 136 | /site 137 | 138 | # mypy 139 | .mypy_cache/ 140 | .dmypy.json 141 | dmypy.json 142 | 143 | # Pyre type checker 144 | .pyre/ 145 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tableauCalculationExport 2 | 3 | ## What this code does 4 | - This code will extract all Calculated Fields, Default Fields and Parameters from a Tableau workbook and export them into an Excel and PDF file. 5 | - The code will also generate a Mermaid diagram showing the Lineage between fields. The diagram will be exported into an html file that you can open on your PCs Internet browser. 6 | - Note that the Lineage Diagram will only show relationships between USED fields (ie. Default (datasource) fields that are NOT used in an Calculated Field will NOT come up in the diagram). 7 | 8 | ## Limitations and Important Considerations 9 | - The latest version of the code will only work on **twbx** files (packaged Tableau files). 10 | - The code is only available for **Windows** systems (as it needs the win32com.client package to generate the Excel file). 11 | 12 | ## Getting Started 13 | - Please make sure you have a **working Python environment**, and you have installed the following packages/libraries (either via pip install or conda install - please Google the steps to install each package as some are either pip or Conda specific) 14 | - win32com.client 15 | - [tableaudocumentapi](https://tableau.github.io/document-api-python/docs/) 16 | - pandas 17 | - Jupyter Notebook 18 | - Some modules should already come with your Python installation (depending on what Python version you are using), but if for some reason they're not present in your Python env, please make sure you get them too 19 | - pathlib 20 | 21 | ## Downloading the Code and Setting up your working directory 22 | 23 | **Before starting on this section, please make sure you've installed Python and any dependencies into your Python environment (ie. the libraries and packages detailed in the previous section)** 24 | 25 | 1. Download ALL the code into your preferred directory (ie. a folder on your PC). 26 | - Make sure the **excelgenerator.py** file is in the SAME directory as the ipynb or py file you want to run (ie. the Tableau_calculation_extractor_with_mermaid.ipynb) 27 | 28 | 2. In your working directory, create an empty "/inputs" and an "/outputs" folder. Your working directory should look like this: 29 | 30 | Note that an "/inputs" folder means that you will create a folder called "inputs" inside your working directory. From here onwards I will use "/inputs" and "inputs" interchangeably (same for outputs). 31 | 32 | ![image](https://github.com/user-attachments/assets/62ec66c6-0db6-495a-9063-8b603fe66d17) 33 | 34 | 35 | 36 | 3. Once you have a Tableau packaged workbook (twbx file) that you want to analyse, save it in the "inputs" folder. 37 | 38 | 4. Run your **Calculation Extractor code** (ie. Tableau_calculation_extractor_with_mermaid.ipynb or Tableau_calculation_extractor_with_mermaid.py, depending on which version you want to run - either a Jupyter Notebook one or a py file - they're both meant to have the same functionality) 39 | 40 | 5. Check the "/outputs" folder for the code outputs - you should now have a PDF, Excel and HTML file with the results from the Calculation Extraction process (PDF and Excel) and the Lineage Creation process (the HTML file). 41 | 42 | ### Running the code again (eg. to analyse a new workbook) 43 | At the moment the code will only run on one twbx at a time, and will **only handle 1 twbx file from the inputs folder**. If two or more twbx files are found in the inputs folder, the code will only analyse one of them --> in future versions of the code, I will add file handling so more than one twxb file can be analysed at a time - you can also submit a PR with this code if you'd like to contribute to this code! 44 | 45 | - Before analysing a new workbook (once already saved to the "/inputs" folder), remove any OTHER files from the "/inputs" folder (eg. any previous workbook you have already analysed), and only leave the one workbook you want to analyse. 46 | - You can now run the Calculation Extractor code. 47 | - You don't need to worry about emptying the "/outputs" folder - this folder will simply store all the outputs from any runs of the Calculation Extractor code, so more and more outputs will be added as more runs occur. 48 | 49 | 50 | # Troubleshooting and Help 51 | As this is a personal project, I am not providing any IT support for this code. However if you have any questions that are NOT explained above, feel free to reach out to nana7milana@gmail.com. 52 | I will aim to reply within one or two weeks, but if I don't, feel free to send me a reminder. 53 | Thanks for checking out my code! 54 | 55 | Ana 56 | 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /Tableau_calculation_extractor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# version 2.35" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import pandas as pd\n", 17 | "import numpy as np\n", 18 | "import os, re, sys, pathlib, zipfile\n", 19 | "import win32com.client\n", 20 | "import xml.etree.ElementTree as ET\n", 21 | "import tableaudocumentapi\n", 22 | "\n", 23 | "from tableaudocumentapi import Workbook\n", 24 | "from os.path import isfile, join" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Input folder - Find if there is a twbx or twb file in the folder\n", 32 | "- if there is a twbx, unzip it to create a twb, then work with this\n", 33 | "- if there's only a twb, work with this" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "scrolled": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "input_path = \"inputs\"\n", 45 | "output_path = \"outputs\"\n", 46 | "\n", 47 | "mypath = \"./{}\".format(input_path)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "scrolled": false 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "#only gets files and not directories within the inputs folder -https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory\n", 59 | "f = [f for f in os.listdir(mypath) if isfile(join(mypath, f)) and f[-5:] == '.twbx'] \n", 60 | "f" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "def removeSpecialCharFromStr(spstring):\n", 70 | " \n", 71 | " return ''.join(e for e in spstring if e.isalnum())" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "for i in f: \n", 81 | " \n", 82 | " if i[-5:] == '.twbx':\n", 83 | " sp_packagedWorkbook = i[:len(i)-5]\n", 84 | " print(sp_packagedWorkbook)\n", 85 | " packagedWorkbook = removeSpecialCharFromStr(sp_packagedWorkbook)+'.twbx'\n", 86 | " print(packagedWorkbook)\n", 87 | " \n", 88 | " old_file = join(input_path, sp_packagedWorkbook+'.twbx')\n", 89 | " new_file = join(input_path, packagedWorkbook)\n", 90 | " os.rename(old_file, new_file)\n", 91 | " \n", 92 | " with zipfile.ZipFile(input_path+\"/\"+packagedWorkbook, 'r') as zip_ref:\n", 93 | " zip_ref.extractall(input_path+\"/\")\n", 94 | " else:\n", 95 | " packagedWorkbook = \"\"\n", 96 | " \n", 97 | "for i in [f for f in os.listdir(mypath) if isfile(join(mypath, f))] :\n", 98 | " \n", 99 | " if i[-4:] == '.twb':\n", 100 | " sp_unpackagedWorkbook = i[:len(i)-4]\n", 101 | " unpackedWorkbook = removeSpecialCharFromStr(sp_unpackagedWorkbook)+'.twb' \n", 102 | " \n", 103 | " old_file = join(input_path, sp_unpackagedWorkbook+'.twb')\n", 104 | " new_file = join(input_path, unpackedWorkbook)\n", 105 | " os.rename(old_file, new_file)\n", 106 | "\n", 107 | "print('\\n')\n", 108 | "print('packaged workbook: ' + packagedWorkbook)\n", 109 | "print('unpackaged workbook: ' + unpackedWorkbook)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "tableauFile = input_path+\"/\"+unpackedWorkbook\n", 119 | "tableauFile" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "packagedTableauFile = input_path+\"/\"+packagedWorkbook\n", 129 | "packagedTableauFile" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "#substring to be used when naming the exported data\n", 139 | "\n", 140 | "tableau_name_substring = packagedWorkbook.replace(\".twbx\",\"\")[:30]\n", 141 | "tableau_name_substring" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# Parse xml to get all calculations" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": null, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "tree = ET.parse(tableauFile)\n", 158 | "root = tree.getroot()\n", 159 | "\n", 160 | "collator1 = []\n", 161 | "calcNames = []\n", 162 | "calcCaptions = []\n", 163 | "\n", 164 | "for_findall = [\"./datasources/datasource/column\", \"./worksheets/worksheet/table/view/datasource-dependencies/column\"]\n", 165 | "\n", 166 | "for pathy in for_findall:\n", 167 | " for elem in root.findall(pathy):\n", 168 | "\n", 169 | " dict_temp = {}\n", 170 | "\n", 171 | " if (elem.findall('calculation')) != []: #only get nodes where there is a calculation\n", 172 | " try:\n", 173 | " dict_temp['caption'] = elem.attrib['caption']\n", 174 | " calcCaptions.append(elem.attrib['caption'])\n", 175 | " except:\n", 176 | " dict_temp['caption'] = elem.attrib['name'] #DEPRECATED #'MISSING'\n", 177 | " calcCaptions.append(elem.attrib['name']) #DEPRECATED append('MISSING')\n", 178 | "\n", 179 | " dict_temp['datatype'] = elem.attrib['datatype']\n", 180 | " dict_temp['name'] = elem.attrib['name']\n", 181 | "\n", 182 | " f2 = (elem.attrib['name']).replace(']','')\n", 183 | " f2 = f2.replace('[', '')\n", 184 | " calcNames.append(f2)\n", 185 | "\n", 186 | " try: #this part evaluates for a parameter\n", 187 | " paramExists = elem.attrib['param-domain-type']\n", 188 | " dict_temp['isParameter'] = 'yes'\n", 189 | " dict_temp['formula'] = 'NA'\n", 190 | "\n", 191 | " except: #this part is for calculations only (not parameters)\n", 192 | " dict_temp['isParameter'] = 'no'\n", 193 | "\n", 194 | " try:\n", 195 | " for calc in elem.findall('calculation'):\n", 196 | " dict_temp['formula'] = calc.attrib['formula']\n", 197 | " except:\n", 198 | "\n", 199 | " dict_temp['formula'] = 'NA'\n", 200 | "\n", 201 | " collator1.append(dict_temp)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "calcDict = dict(zip(calcNames, calcCaptions))\n", 211 | "calcDict" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "def default_to_friendly_names(formulaList):\n", 221 | "\n", 222 | " for i in formulaList:\n", 223 | " for tableauName, friendlyName in calcDict.items():\n", 224 | " i['formula'] = (i['formula']).replace(tableauName, friendlyName)\n", 225 | " \n", 226 | " return formulaList" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "scrolled": true 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "collator1 = default_to_friendly_names(collator1)\n", 238 | "collator1[0:2]" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "scrolled": true 246 | }, 247 | "outputs": [], 248 | "source": [ 249 | "df = pd.DataFrame(collator1)\n", 250 | "df = df[['caption', 'datatype', 'formula', 'isParameter', 'name']]\n", 251 | "df.columns = ['CalculationName', 'DataType', 'Formula', 'isParameter', 'RawName']\n", 252 | "\n", 253 | "df = df.drop_duplicates()\n", 254 | "\n", 255 | "df = df.sort_values(by=['isParameter','CalculationName'])\n", 256 | "df = df.reset_index(drop=True)\n", 257 | "df" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "# Getting all filters for all worksheets" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": null, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "tree = ET.parse(tableauFile)\n", 274 | "root = tree.getroot()\n", 275 | "\n", 276 | "filters_in_sheet = []\n", 277 | "context = []\n", 278 | "collatelist = []\n", 279 | "\n", 280 | "for worskheet in root.findall(\"./worksheets/worksheet\"):\n", 281 | " \n", 282 | " tempdict = {}\n", 283 | " c = 0\n", 284 | " \n", 285 | " for filt in worskheet.findall('table/view/filter'):\n", 286 | "\n", 287 | " calcfromfilter = filt.attrib['column'] \n", 288 | " pat = '(?<=\\:)(.*?)(?=\\:)' \n", 289 | " string_cleaned = calcfromfilter.split('].[')[1].replace(']','')\n", 290 | " \n", 291 | " tempdict['field'] = calcfromfilter\n", 292 | " tempdict['formula'] = calcfromfilter\n", 293 | " tempdict['counter'] = c\n", 294 | " tempdict['sheetname'] = worskheet.attrib['name']\n", 295 | " \n", 296 | " try:\n", 297 | " st1 = re.findall(pat,string_cleaned)[0]\n", 298 | " tempdict['field'] = st1\n", 299 | " tempdict['formula'] = st1\n", 300 | " collatelist.append(tempdict)\n", 301 | " \n", 302 | " except:\n", 303 | " st2 = string_cleaned.replace(':','')\n", 304 | " tempdict['field'] = st2\n", 305 | " tempdict['formula'] = st2\n", 306 | " collatelist.append(tempdict)\n", 307 | "\n", 308 | " try:\n", 309 | " tempdict['context'] = filt.attrib['context']\n", 310 | " except:\n", 311 | " tempdict['context'] = 'False'\n", 312 | " \n", 313 | " c = c + 1\n", 314 | " tempdict = {}\n", 315 | " \n", 316 | "collatelist[0:2]" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "metadata": {}, 323 | "outputs": [], 324 | "source": [ 325 | "collatelist = default_to_friendly_names(collatelist)\n", 326 | "collatelist[0:2]" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [ 335 | "try: \n", 336 | " df1 = pd.DataFrame(collatelist)\n", 337 | "\n", 338 | " df1 = df1[['sheetname', 'formula', 'context', 'field']]\n", 339 | " df1.columns = ['Sheet Name', 'FilterField', 'Context filter', 'FilterField_RawName']\n", 340 | "\n", 341 | " print(df1.head(2))\n", 342 | "except:\n", 343 | " print('error with df1')" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "# Extracting rows and cols for each sheet" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": { 357 | "scrolled": true 358 | }, 359 | "outputs": [], 360 | "source": [ 361 | "collecteddata = []\n", 362 | "\n", 363 | "for worksheet in root.findall(\"./worksheets/worksheet\"):\n", 364 | "\n", 365 | " argumentstopass = ['rows', 'cols']\n", 366 | " \n", 367 | " for i in argumentstopass: \n", 368 | " \n", 369 | " internaldict = {}\n", 370 | "\n", 371 | " internaldict['sheetname'] = worksheet.attrib['name']\n", 372 | " internaldict['type'] = i\n", 373 | " \n", 374 | " formulahere = worksheet.findall('table/'+i)[0].text\n", 375 | " internaldict['formula'] = formulahere\n", 376 | " \n", 377 | " collecteddata.append(internaldict)\n", 378 | " \n", 379 | "collecteddata[0:2]" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": null, 385 | "metadata": { 386 | "scrolled": true 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "for i in collecteddata:\n", 391 | "\n", 392 | " try:\n", 393 | " pattern = '\\:.*?\\:'\n", 394 | " pat = '(?<=\\:)(.*?)(?=\\:)'\n", 395 | "\n", 396 | " calculationsWithColon = re.findall(pattern,i['formula']) \n", 397 | " calcsWithoutColon = []\n", 398 | "\n", 399 | " for n in calculationsWithColon:\n", 400 | " oneCalcWithoutColon = re.findall(pat,n)[0]\n", 401 | "\n", 402 | " calcsWithoutColon.append(oneCalcWithoutColon)\n", 403 | " \n", 404 | " i['extracted formulas'] = calcsWithoutColon\n", 405 | " \n", 406 | " except:\n", 407 | " i['extracted formulas'] = []\n", 408 | " \n", 409 | " newcalcs = []\n", 410 | " formulas_to_process = i['extracted formulas']\n", 411 | " \n", 412 | " for n in formulas_to_process:\n", 413 | " \n", 414 | " for tableauName, friendlyName in calcDict.items():\n", 415 | " \n", 416 | " n = n.replace(tableauName, friendlyName)\n", 417 | " \n", 418 | " newcalcs.append(n)\n", 419 | " \n", 420 | " #version 2.35 added this part to check for longitude or latitute in the formula\n", 421 | " #separate to other try/except as long/lat appear in a different string structure so cannot analyse with above regex\n", 422 | " try:\n", 423 | " if \"Longitude (generated)\" in i['formula']:\n", 424 | " newcalcs.append(\"Longitude (generated)\")\n", 425 | " elif \"Latitude (generated)\" in i['formula']:\n", 426 | " newcalcs.append(\"Latitude (generated)\")\n", 427 | " except:\n", 428 | " dummy = 0\n", 429 | " \n", 430 | " i['processed formulas'] = newcalcs\n", 431 | "\n", 432 | "collecteddata" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": { 439 | "scrolled": true 440 | }, 441 | "outputs": [], 442 | "source": [ 443 | "df2 = pd.DataFrame(collecteddata)\n", 444 | "df2 = df2[['extracted formulas', 'formula', 'processed formulas', 'sheetname', 'type']]\n", 445 | "df2 = df2.drop(columns=['formula', 'extracted formulas'])\n", 446 | "df2 = df2.pivot(index='sheetname', columns='type', values='processed formulas')\n", 447 | "df2 = df2.reset_index()\n", 448 | "df2" 449 | ] 450 | }, 451 | { 452 | "cell_type": "markdown", 453 | "metadata": {}, 454 | "source": [ 455 | "# Doc API" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "# All default fields - DOC API" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "scrolled": true 470 | }, 471 | "outputs": [], 472 | "source": [ 473 | "packagedTableauFile" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": { 480 | "scrolled": true 481 | }, 482 | "outputs": [], 483 | "source": [ 484 | "#get all fields in workbook\n", 485 | "sourceTWBX = Workbook(packagedTableauFile)\n", 486 | "\n", 487 | "collator = []\n", 488 | "calcID = []\n", 489 | "calcID2 = []\n", 490 | "calcNames = []\n", 491 | "\n", 492 | "c = 0\n", 493 | "\n", 494 | "worksheets = sourceTWBX.worksheets\n", 495 | "\n", 496 | "#for worksheet in worksheets: #see if this has to be marked out or not\n", 497 | " \n", 498 | "for datasource in sourceTWBX.datasources:\n", 499 | "\n", 500 | " for count, field in enumerate(datasource.fields.values()):\n", 501 | "\n", 502 | " #if worksheet in field.worksheets: #removed this part so all fields are listed,as otherwise some fields were missed out\n", 503 | "\n", 504 | " dict_temp = {}\n", 505 | " dict_temp['counter'] = c\n", 506 | " dict_temp['worksheet'] = worksheet\n", 507 | " dict_temp['datasource_name'] = datasource.name\n", 508 | " dict_temp['field_WHOLE'] = field\n", 509 | " dict_temp['field_name'] = field.name\n", 510 | " dict_temp['field_caption'] = field.caption\n", 511 | " dict_temp['field_calculation'] = field.calculation\n", 512 | " dict_temp['field_id'] = field.id\n", 513 | " dict_temp['field_datatype'] = field.datatype\n", 514 | "\n", 515 | "\n", 516 | " if not(isinstance(field.calculation, type(None))):\n", 517 | " calcID.append(field.id)\n", 518 | " calcNames.append(field.name)\n", 519 | "\n", 520 | " f2 = (field.id).replace(']','')\n", 521 | " f2 = f2.replace('[', '')\n", 522 | " calcID2.append(f2)\n", 523 | "\n", 524 | " c = c + 1\n", 525 | "\n", 526 | " collator.append(dict_temp)" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "calcDict = dict(zip(calcID, calcNames))\n", 536 | "calcDict2 = dict(zip(calcID2, calcNames)) #raw fields without any []\n", 537 | "\n", 538 | "def default_to_friendly_names2(formulaList,fieldToConvert, dictToUse):\n", 539 | "\n", 540 | " for i in formulaList:\n", 541 | " for tableauName, friendlyName in dictToUse.items():\n", 542 | " try:\n", 543 | " i[fieldToConvert] = (i[fieldToConvert]).replace(tableauName, friendlyName)\n", 544 | " except:\n", 545 | " a = 0\n", 546 | " \n", 547 | " return formulaList" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": null, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [ 556 | "def f(row):\n", 557 | " if row['field_calculation'] == None:\n", 558 | " val = 'Datasource field'\n", 559 | " else:\n", 560 | " val = 'Calculated field'\n", 561 | " return val" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": null, 567 | "metadata": {}, 568 | "outputs": [], 569 | "source": [ 570 | "default_to_friendly_names2(collator,'field_calculation',calcDict)\n", 571 | "\n", 572 | "df_API_all = pd.DataFrame(collator)\n", 573 | "df_API_all['field_type'] = df_API_all.apply(f, axis=1)\n", 574 | "\n", 575 | "df_API_all.head()" 576 | ] 577 | }, 578 | { 579 | "cell_type": "code", 580 | "execution_count": null, 581 | "metadata": {}, 582 | "outputs": [], 583 | "source": [ 584 | "df_defaultFields = df_API_all[df_API_all['field_type'] == 'Datasource field'][['field_id', 'field_caption','field_datatype', 'datasource_name']].drop_duplicates().copy()\n", 585 | "\n", 586 | "df_defaultFields['prefOrder'] = np.where(df_defaultFields['field_caption'].isnull(), 0, 1)\n", 587 | "df_defaultFields['field_id2'] = df_defaultFields['field_id'].str.replace('[','')\n", 588 | "df_defaultFields['field_id2'] = df_defaultFields['field_id2'].str.replace(']','')\n", 589 | "\n", 590 | "df_defaultFields = df_defaultFields.sort_values(by = ['field_id2'])\n", 591 | "#https://stackoverflow.com/questions/63271050/use-drop-duplicates-in-pandas-df-but-choose-keep-column-based-on-a-preference-li\n", 592 | "preference_list=[1,0]\n", 593 | "\n", 594 | "df_defaultFields[\"prefOrder\"] = pd.Categorical(df_defaultFields[\"prefOrder\"], categories=preference_list, ordered=True)\n", 595 | "\n", 596 | "df_defaultFields = df_defaultFields.sort_values([\"field_id2\",\"prefOrder\"]).drop_duplicates(\"field_id2\")\n", 597 | "df_defaultFields = df_defaultFields.drop('prefOrder', axis=1)\n", 598 | "df_defaultFields = df_defaultFields.drop('field_id2', axis=1)\n", 599 | "df_defaultFields.head(2)" 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "# Parameters" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": { 613 | "scrolled": true 614 | }, 615 | "outputs": [], 616 | "source": [ 617 | "colsToUse = ['field_id', 'field_name', 'field_calculation', 'field_caption','field_datatype', 'datasource_name' ]\n", 618 | "dfAPIParameters = df_API_all[colsToUse][df_API_all['datasource_name']=='Parameters'].drop_duplicates().copy()\n", 619 | "\n", 620 | "dfAPIParameters" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": null, 626 | "metadata": {}, 627 | "outputs": [], 628 | "source": [ 629 | "df = df.merge(dfAPIParameters[['field_id','field_calculation']], left_on='RawName', right_on = 'field_id', how='left')\n", 630 | "\n", 631 | "df[\"Formula\"] = np.where(df[\"Formula\"] == \"NA\", df['field_calculation'], df[\"Formula\"])\n", 632 | "df = df.drop(columns=['field_id', 'field_calculation'])\n", 633 | "df" 634 | ] 635 | }, 636 | { 637 | "cell_type": "markdown", 638 | "metadata": {}, 639 | "source": [ 640 | "# Sheet - all field dependencies, not just the explicitly used fields" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": null, 646 | "metadata": {}, 647 | "outputs": [], 648 | "source": [ 649 | "#df_api_insheet\n", 650 | "sourceTWBX = Workbook(packagedTableauFile)\n", 651 | "\n", 652 | "collator_sheet_dependencies = []\n", 653 | "\n", 654 | "c = 0\n", 655 | "\n", 656 | "worksheets = sourceTWBX.worksheets\n", 657 | "\n", 658 | "for worksheet in worksheets:\n", 659 | " \n", 660 | " for datasource in sourceTWBX.datasources:\n", 661 | " \n", 662 | " for count, field in enumerate(datasource.fields.values()):\n", 663 | " \n", 664 | " if worksheet in field.worksheets: #to see if only fields that appear in sheets are listed, else last df is too large\n", 665 | " \n", 666 | " dict_temp = {}\n", 667 | " dict_temp['counter'] = c\n", 668 | " dict_temp['worksheet'] = worksheet\n", 669 | " dict_temp['datasource_name'] = datasource.name\n", 670 | " dict_temp['field_WHOLE'] = field\n", 671 | " dict_temp['field_name'] = field.name\n", 672 | " dict_temp['field_caption'] = field.caption\n", 673 | " dict_temp['field_calculation'] = field.calculation\n", 674 | " dict_temp['field_id'] = field.id\n", 675 | " dict_temp['field_datatype'] = field.datatype\n", 676 | " \n", 677 | " c = c + 1\n", 678 | " \n", 679 | " collator_sheet_dependencies.append(dict_temp)" 680 | ] 681 | }, 682 | { 683 | "cell_type": "code", 684 | "execution_count": null, 685 | "metadata": { 686 | "scrolled": false 687 | }, 688 | "outputs": [], 689 | "source": [ 690 | "#default_to_friendly_names2(collator_sheet_dependencies, 'field_calculation',calcDict)\n", 691 | "\n", 692 | "df_api_insheet = pd.DataFrame(collator_sheet_dependencies)\n", 693 | "df_api_insheet['field_type'] = df_api_insheet.apply(f, axis=1)\n", 694 | "df_api_insheet.head()" 695 | ] 696 | }, 697 | { 698 | "cell_type": "code", 699 | "execution_count": null, 700 | "metadata": { 701 | "scrolled": true 702 | }, 703 | "outputs": [], 704 | "source": [ 705 | "df_sheetDependencies = df_api_insheet.copy()\n", 706 | "preference_list=[1,0]\n", 707 | "\n", 708 | "df_sheetDependencies['prefOrder'] = np.where(df_sheetDependencies['field_caption'].isnull(), 0, 1)\n", 709 | "\n", 710 | "df_sheetDependencies['field_id2'] = df_sheetDependencies['field_id'].str.replace('[','')\n", 711 | "df_sheetDependencies['field_id2'] = df_sheetDependencies['field_id2'].str.replace(']','')\n", 712 | "\n", 713 | "df_sheetDependencies[\"prefOrder\"] = pd.Categorical(df_sheetDependencies[\"prefOrder\"], categories=preference_list, ordered=True)\n", 714 | "df_sheetDependencies = df_sheetDependencies.sort_values([\"field_id2\",\\\n", 715 | " \"prefOrder\"]).drop_duplicates(subset=[\"field_id2\", \"worksheet\"])\n", 716 | "\n", 717 | "df_sheetDependencies = df_sheetDependencies.drop(\\\n", 718 | " columns=['prefOrder', 'field_id2', 'counter', 'field_caption', 'field_WHOLE', \\\n", 719 | " 'field_calculation', 'field_id'])\n", 720 | "\n", 721 | "df_sheetDependencies = df_sheetDependencies[['worksheet', 'field_name', 'field_datatype', \\\n", 722 | " 'field_type', 'datasource_name']].sort_values(by = ['worksheet', 'field_type', 'field_name'])\n", 723 | "df_sheetDependencies.head()" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": {}, 729 | "source": [ 730 | "# General workbook description" 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": {}, 737 | "outputs": [], 738 | "source": [ 739 | "sourceTWBX = Workbook(packagedTableauFile)" 740 | ] 741 | }, 742 | { 743 | "cell_type": "code", 744 | "execution_count": null, 745 | "metadata": {}, 746 | "outputs": [], 747 | "source": [ 748 | "collate_list = []\n", 749 | "\n", 750 | "for dash in sourceTWBX.dashboards:\n", 751 | " dicti = {}\n", 752 | " \n", 753 | " dicti['type'] = 'dashboard'\n", 754 | " # print(format(dash))\n", 755 | " dicti['name'] = format(dash)\n", 756 | " \n", 757 | " collate_list.append(dicti)\n", 758 | " \n", 759 | "for data in sourceTWBX.datasources:\n", 760 | " dicti = {}\n", 761 | " \n", 762 | " dicti['type'] = 'datasource'\n", 763 | " dicti['name'] = format(data.name)\n", 764 | " # print(format(data.name))\n", 765 | " \n", 766 | " collate_list.append(dicti)\n", 767 | " \n", 768 | "for data in sourceTWBX.worksheets:\n", 769 | " dicti = {}\n", 770 | " \n", 771 | " dicti['type'] = 'sheet'\n", 772 | " dicti['name'] = format(data)\n", 773 | " # print(format(data))\n", 774 | " \n", 775 | " collate_list.append(dicti)" 776 | ] 777 | }, 778 | { 779 | "cell_type": "code", 780 | "execution_count": null, 781 | "metadata": {}, 782 | "outputs": [], 783 | "source": [ 784 | "df_workbookdec = pd.DataFrame(collate_list)\n", 785 | "df_workbookdec = df_workbookdec[['type', 'name']]\n", 786 | "df_workbookdec.head(2)" 787 | ] 788 | }, 789 | { 790 | "cell_type": "code", 791 | "execution_count": null, 792 | "metadata": { 793 | "scrolled": false 794 | }, 795 | "outputs": [], 796 | "source": [ 797 | "df_workbookdec_counts = df_workbookdec.groupby(['type']).count().reset_index()\n", 798 | "df_workbookdec_counts" 799 | ] 800 | }, 801 | { 802 | "cell_type": "code", 803 | "execution_count": null, 804 | "metadata": {}, 805 | "outputs": [], 806 | "source": [ 807 | "#count parameters and calc fields, based on xml scraping\n", 808 | "parameterCount = len(df[df['isParameter'] == 'yes'])\n", 809 | "calcFieldCount = len(df[df['isParameter'] != 'yes'])" 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": null, 815 | "metadata": {}, 816 | "outputs": [], 817 | "source": [ 818 | "new_row1 = {'type':'parameter', 'name':parameterCount}\n", 819 | "new_row2 = {'type':'calculated field', 'name':calcFieldCount}\n", 820 | "\n", 821 | "toappend = [new_row1, new_row2]\n", 822 | "\n", 823 | "for i in toappend:\n", 824 | "#append row to the dataframe\n", 825 | " df_workbookdec_counts = df_workbookdec_counts.append(i, ignore_index=True)\n", 826 | "\n", 827 | "df_workbookdec_counts.columns = ['type', 'count']\n", 828 | "df_workbookdec_counts" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "## Generating an excel file from a df (so the excel rows/cols can be formatted), then turning the excel into a pdf" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": null, 841 | "metadata": {}, 842 | "outputs": [], 843 | "source": [ 844 | "cwd = os.getcwd()\n", 845 | "path_string = pathlib.Path(cwd).resolve().__str__() + \"\\{}\"" 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": {}, 851 | "source": [ 852 | "- Loading the file names and output locations for the excel and pdfs to be produced" 853 | ] 854 | }, 855 | { 856 | "cell_type": "code", 857 | "execution_count": null, 858 | "metadata": {}, 859 | "outputs": [], 860 | "source": [ 861 | "name_to_use = tableau_name_substring \n", 862 | "\n", 863 | "newFileName = 'outputs\\{}'.format(name_to_use)\n", 864 | "excelName = newFileName + \".xlsx\"\n", 865 | "pdfName = newFileName + \".pdf\"\n", 866 | "print(pdfName)\n", 867 | "\n", 868 | "excel_path = path_string.format(excelName)\n", 869 | "path_to_pdf = path_string.format(pdfName)" 870 | ] 871 | }, 872 | { 873 | "cell_type": "markdown", 874 | "metadata": {}, 875 | "source": [ 876 | "- Functions to format the excel files" 877 | ] 878 | }, 879 | { 880 | "cell_type": "code", 881 | "execution_count": null, 882 | "metadata": {}, 883 | "outputs": [], 884 | "source": [ 885 | "#colors to be used in each sheet\n", 886 | "c1 = '#f4dfa4'\n", 887 | "c2 = '#ffc8b3'\n", 888 | "c3 = '#fff0b3'\n", 889 | "c4 = '#d5dfb9'\n", 890 | "c5 = '#d1c5d3'\n", 891 | "c6 = '#bfd9d7'" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": null, 897 | "metadata": {}, 898 | "outputs": [], 899 | "source": [ 900 | "def mainCol(colNumber, color):\n", 901 | " format_mainCol = workbook.add_format({'text_wrap': True, 'bold': True})\n", 902 | " format_mainCol.set_align('vcenter')\n", 903 | " format_mainCol.set_bg_color(color)\n", 904 | " format_mainCol.set_border(1)\n", 905 | " worksheet.set_column(colNumber,colNumber,20,format_mainCol)\n", 906 | " return worksheet" 907 | ] 908 | }, 909 | { 910 | "cell_type": "code", 911 | "execution_count": null, 912 | "metadata": {}, 913 | "outputs": [], 914 | "source": [ 915 | "def normalCol(colNumber, colWidth):\n", 916 | " format2 = workbook.add_format({'text_wrap': True})\n", 917 | " format2.set_align('vcenter')\n", 918 | " format2.set_border(1)\n", 919 | " worksheet.set_column(colNumber,colNumber,colWidth,format2)\n", 920 | " return worksheet" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "- Creation of excel file" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": null, 933 | "metadata": {}, 934 | "outputs": [], 935 | "source": [ 936 | "#modify this part if you want to add more information/dfs to be saved as a separate sheet in excel\n", 937 | "\n", 938 | "dfs_to_use = [{'excelSheetTitle': 'Dashboard, datasource and sheet details', 'df_to_use':df_workbookdec, 'mainColWidth':'' , \n", 939 | " 'normalColWidth': [30], 'sheetName': 'GeneralDetails', 'footer': 'Data_1 (DOC API)', 'papersize':9, 'color': c1} , \n", 940 | " \n", 941 | " {'excelSheetTitle': 'Overall counts of dashboards, datasources and sheets', 'df_to_use':df_workbookdec_counts, 'mainColWidth':'' , \n", 942 | " 'normalColWidth': [10], 'sheetName': 'GeneralCounts', 'footer': 'Data_2 (DOC API + XML)', 'papersize':9, 'color': c1},\n", 943 | " \n", 944 | " {'excelSheetTitle': 'Default fields from all datasources', 'df_to_use':df_defaultFields, 'mainColWidth':'' , \n", 945 | " 'normalColWidth': [20,20,40], 'sheetName': 'DefaultFields', 'footer': 'Data_3 (XML extraction)', 'papersize':9, 'color': c2},\n", 946 | " \n", 947 | " {'excelSheetTitle': 'Calculated fields and parameters', 'df_to_use':df, 'mainColWidth':'' , \n", 948 | " 'normalColWidth': [10,50,10,20], 'sheetName': 'CalculatedFields', 'footer': 'Data_4 (XML extraction + DOC API for Param value)', \n", 949 | " 'papersize':9, 'color': c3},\n", 950 | " \n", 951 | " {'excelSheetTitle': 'Filters used in each sheet', 'df_to_use':df1, 'mainColWidth':'' , \n", 952 | " 'normalColWidth': [20,20,40], 'sheetName': 'Filters', 'footer': 'Data_5 (XML extraction)', 'papersize':9, 'color': c4},\n", 953 | " \n", 954 | " {'excelSheetTitle': 'Metrics used in Columns and Rows, for each sheet', 'df_to_use':df2, 'mainColWidth':'' , \n", 955 | " 'normalColWidth': [30,40], 'sheetName': 'RowsAndCols', 'footer': 'Data_6 (XML extraction)', 'papersize':9, 'color': c5},\n", 956 | " \n", 957 | " {'excelSheetTitle': 'Sheet dependencies on default fields, calculated fields and parameters', 'df_to_use':df_sheetDependencies, 'mainColWidth':'' , \n", 958 | " 'normalColWidth': [30,15,25,30], 'sheetName': 'SheetDependencies', 'footer': 'Data_7 (DOC API)', 'papersize':8, 'color': c6}\n", 959 | " ]\n", 960 | "\n", 961 | "#papersize: a3 = 8, a4 = 9" 962 | ] 963 | }, 964 | { 965 | "cell_type": "code", 966 | "execution_count": null, 967 | "metadata": {}, 968 | "outputs": [], 969 | "source": [ 970 | "writer = pd.ExcelWriter(excelName, engine = 'xlsxwriter')\n", 971 | "\n", 972 | "#code to create each sheet in excel, with the specified df and formatting each sheet as per requirements\n", 973 | "#also adds a header and footer to each sheet\n", 974 | "#all the info to be replaced below (ie. for each df) comes form the dfs_to_use list of dictionaries\n", 975 | "\n", 976 | "for x in dfs_to_use:\n", 977 | " excelSheetTitle = x['excelSheetTitle']\n", 978 | " df_to_use = x['df_to_use']\n", 979 | " normalColWidth = x['normalColWidth']\n", 980 | " sheetName = x['sheetName']\n", 981 | " papersize = x['papersize']\n", 982 | " footer = x['footer']\n", 983 | " color = x['color']\n", 984 | "\n", 985 | " df_to_use.to_excel(writer, sheet_name = sheetName, index=False)\n", 986 | " \n", 987 | " workbook=writer.book\n", 988 | " worksheet = writer.sheets[sheetName]\n", 989 | "\n", 990 | " worksheet = mainCol(0, color)\n", 991 | " \n", 992 | " ws = 1\n", 993 | " for i in normalColWidth:\n", 994 | " worksheet = normalCol(ws,i)\n", 995 | " ws = ws + 1\n", 996 | "\n", 997 | " worksheet.set_paper(papersize) # a4\n", 998 | " worksheet.fit_to_pages(1,0) # fit to 1 page wide, n long\n", 999 | " worksheet.repeat_rows(0) # repeat the first row\n", 1000 | " \n", 1001 | " header_x = '&C&\"Arial,Bold\"&10{}'.format(excelSheetTitle)\n", 1002 | " footer_x = '&L{}&CPage &P of &N'.format(footer)\n", 1003 | "\n", 1004 | " worksheet.set_header(header_x)\n", 1005 | " worksheet.set_footer(footer_x)\n", 1006 | "\n", 1007 | "writer.save()" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "markdown", 1012 | "metadata": {}, 1013 | "source": [ 1014 | "- Creation of pdf from excel file" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": {}, 1021 | "outputs": [], 1022 | "source": [ 1023 | "#this creates an index to list each excel sheet, based on the number of sheets that were created before\n", 1024 | "\n", 1025 | "for_ws_index_list = []\n", 1026 | "for i in range(len(dfs_to_use)):\n", 1027 | " for_ws_index_list.append(i+1)" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": null, 1033 | "metadata": {}, 1034 | "outputs": [], 1035 | "source": [ 1036 | "excel = win32com.client.Dispatch(\"Excel.Application\")\n", 1037 | "excel.Visible = False\n", 1038 | "\n", 1039 | "wb = excel.Workbooks.Open(excel_path)\n", 1040 | "\n", 1041 | "#print all the excel sheets into a single pdf\n", 1042 | "ws_index_list = for_ws_index_list\n", 1043 | "wb.Worksheets(ws_index_list).Select()\n", 1044 | "wb.ActiveSheet.ExportAsFixedFormat(0, path_to_pdf)\n", 1045 | "wb.Close()\n", 1046 | "excel.Quit()" 1047 | ] 1048 | } 1049 | ], 1050 | "metadata": { 1051 | "kernelspec": { 1052 | "display_name": "Python 3 (ipykernel)", 1053 | "language": "python", 1054 | "name": "python3" 1055 | }, 1056 | "language_info": { 1057 | "codemirror_mode": { 1058 | "name": "ipython", 1059 | "version": 3 1060 | }, 1061 | "file_extension": ".py", 1062 | "mimetype": "text/x-python", 1063 | "name": "python", 1064 | "nbconvert_exporter": "python", 1065 | "pygments_lexer": "ipython3", 1066 | "version": "3.7.13" 1067 | } 1068 | }, 1069 | "nbformat": 4, 1070 | "nbformat_minor": 2 1071 | } 1072 | -------------------------------------------------------------------------------- /Tableau_calculation_extractor_with_mermaid.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# version 3.1\n", 10 | "\n", 11 | "import pandas as pd\n", 12 | "import os, re, sys\n", 13 | "import string\n", 14 | "import webbrowser\n", 15 | "\n", 16 | "from tableaudocumentapi import Workbook\n", 17 | "from os.path import isfile, join\n", 18 | "\n", 19 | "import excelgenerator as exg\n", 20 | "\n", 21 | "pd.set_option('display.max_columns', None)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "## File Handling\n", 29 | "\n", 30 | "- this version of code will only work with twbx files" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": { 37 | "scrolled": true 38 | }, 39 | "outputs": [], 40 | "source": [ 41 | "input_path = \"inputs\"\n", 42 | "output_path = \"outputs\"\n", 43 | "\n", 44 | "mypath = \"./{}\".format(input_path) #./ points to \"this path\" as a relative path\n", 45 | "\n", 46 | "mypath" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "scrolled": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "#only gets files and not directories within the inputs folder -https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory\n", 58 | "input_files = [f for f in os.listdir(mypath) if isfile(join(mypath, f)) and f[-5:] == '.twbx'] \n", 59 | "#input_files.pop()\n", 60 | "input_files" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "def removeSpecialCharFromStr(spstring):\n", 70 | " \n", 71 | "# \"\"\"\n", 72 | "# input: string\n", 73 | "# output: new string, without any special char\n", 74 | "# \"\"\"\n", 75 | " \n", 76 | " return ''.join(e for e in spstring if e.isalnum())\n", 77 | "\n", 78 | "def removeSpecialCharFromStr_leaveSpaces(spstring):\n", 79 | " \n", 80 | " return ''.join(e for e in spstring if (e.isalnum() or e ==' '))\n", 81 | "\n", 82 | "def remove_sp_char_then_turn_spaces_into_underscore(string_to_convert):\n", 83 | " filtered_string = re.sub(r'[^a-zA-Z0-9\\s_]', '', string_to_convert).replace(' ', \"_\")\n", 84 | " return filtered_string\n", 85 | "\n", 86 | "def remove_sp_char_leave_undescore_square_brackets(string_to_convert):\n", 87 | " filtered_string = re.sub(r'[^a-zA-Z0-9\\s._\\[\\]]', '', string_to_convert).replace(' ', \"_\")\n", 88 | " return filtered_string\n", 89 | "\n", 90 | "def find_twbx_file(inputfile):\n", 91 | " \n", 92 | "# \"\"\"\n", 93 | "# input: any input file\n", 94 | "# output: returns the file name without any special char for a twxb file if one is found, else returns empty string\n", 95 | "# \"\"\"\n", 96 | "\n", 97 | " if inputfile[-5:] == '.twbx':\n", 98 | " sp_packagedWorkbook = inputfile[:len(inputfile)-5]\n", 99 | " \n", 100 | " packagedWorkbook = removeSpecialCharFromStr(sp_packagedWorkbook)+'.twbx'\n", 101 | " \n", 102 | " old_file = join(input_path, sp_packagedWorkbook+'.twbx')\n", 103 | " new_file = join(input_path, packagedWorkbook)\n", 104 | " os.rename(old_file, new_file)\n", 105 | "\n", 106 | " else:\n", 107 | " packagedWorkbook = \"\" \n", 108 | " \n", 109 | " return packagedWorkbook" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": {}, 116 | "outputs": [], 117 | "source": [ 118 | "for i in input_files: \n", 119 | " packagedWorkbook = find_twbx_file(i)\n", 120 | " print('Packaged workbook (no sp char): ' + packagedWorkbook)\n", 121 | "\n", 122 | " #substring to be used when naming the exported data, NEEDS A PACKAGED WORKBOOK TO EXIST, OTHERWISE IT WILL GIVE AN EMPTY STRING\n", 123 | " tableau_name_substring = packagedWorkbook.replace(\".twbx\",\"\")[:30]\n", 124 | " print('\\nOutput docs name (word/pdf): ' + tableau_name_substring)\n", 125 | " \n", 126 | "packagedTableauFile_relPath = input_path+\"/\"+packagedWorkbook\n", 127 | "packagedTableauFile_relPath" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "# Doc API" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "%%capture \n", 144 | "\n", 145 | "#get all fields in workbook\n", 146 | "TWBX_Workbook = Workbook(packagedTableauFile_relPath)\n", 147 | "\n", 148 | "collator = []\n", 149 | "calcID = []\n", 150 | "calcID2 = []\n", 151 | "calcNames = []\n", 152 | "\n", 153 | "c = 0\n", 154 | " \n", 155 | "for datasource in TWBX_Workbook.datasources:\n", 156 | " datasource_name = datasource.name\n", 157 | " datasource_caption = datasource.caption if datasource.caption else datasource_name\n", 158 | "\n", 159 | " for count, field in enumerate(datasource.fields.values()):\n", 160 | " dict_temp = {\n", 161 | " 'counter': c,\n", 162 | " 'datasource_name': datasource_name,\n", 163 | " 'datasource_caption': datasource_caption,\n", 164 | " 'alias': field.alias,\n", 165 | " 'field_calculation': field.calculation,\n", 166 | " 'field_calculation_bk': field.calculation,\n", 167 | " 'field_caption': field.caption,\n", 168 | " 'field_datatype': field.datatype,\n", 169 | " 'field_def_agg': field.default_aggregation,\n", 170 | " 'field_desc': field.description,\n", 171 | " 'field_hidden': field.hidden,\n", 172 | " 'field_id': field.id,\n", 173 | " 'field_is_nominal': field.is_nominal,\n", 174 | " 'field_is_ordinal': field.is_ordinal,\n", 175 | " 'field_is_quantitative': field.is_quantitative,\n", 176 | " 'field_name': field.name,\n", 177 | " 'field_role': field.role,\n", 178 | " 'field_type': field.type,\n", 179 | " 'field_worksheets': field.worksheets,\n", 180 | " 'field_WHOLE': field\n", 181 | " }\n", 182 | "\n", 183 | " if field.calculation is not None:\n", 184 | " calcID.append(field.id)\n", 185 | " calcNames.append(field.name)\n", 186 | "\n", 187 | " f2 = field.id.replace(']', '').replace('[', '')\n", 188 | " calcID2.append(f2)\n", 189 | "\n", 190 | " c += 1\n", 191 | " collator.append(dict_temp)" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "def default_to_friendly_names2(formulaList,fieldToConvert, dictToUse):\n", 201 | "\n", 202 | " for i in formulaList:\n", 203 | " for tableauName, friendlyName in dictToUse.items():\n", 204 | " try:\n", 205 | " i[fieldToConvert] = (i[fieldToConvert]).replace(tableauName, friendlyName)\n", 206 | " except:\n", 207 | " a = 0\n", 208 | " \n", 209 | " return formulaList\n", 210 | "\n", 211 | "\n", 212 | "def category_field_type(row):\n", 213 | " if row['datasource_name'] == 'Parameters':\n", 214 | " val = 'Parameters'\n", 215 | " elif row['field_calculation'] == None:\n", 216 | " val = 'Default_Field'\n", 217 | " else:\n", 218 | " val = 'Calculated_Field'\n", 219 | " return val\n", 220 | "\n", 221 | "def compare_fields(row):\n", 222 | " if row['field_id'] == row['field_id2']:\n", 223 | " val = 0\n", 224 | " else:\n", 225 | " val = 1\n", 226 | " return val" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": null, 232 | "metadata": { 233 | "scrolled": false 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "calcDict = dict(zip(calcID, calcNames))\n", 238 | "calcDict2 = dict(zip(calcID2, calcNames)) #raw fields without any []\n", 239 | "\n", 240 | "collator = default_to_friendly_names2(collator,'field_calculation',calcDict2)\n", 241 | "\n", 242 | "df_API_all = pd.DataFrame(collator)\n", 243 | "df_API_all['field_type'] = df_API_all.apply(category_field_type, axis=1)\n", 244 | "\n", 245 | "preference_list=['Parameters', 'Calculated_Field', 'Default_Field']\n", 246 | "df_API_all[\"field_type\"] = pd.Categorical(df_API_all[\"field_type\"], categories=preference_list, ordered=True)\n", 247 | "\n", 248 | "#get rid of duplicates for parameters, so only parameters from the explicit Parameters datasource are kept (as they are also listed again under the name of any other datasources)\n", 249 | "df_API_all = df_API_all.sort_values([\"field_id\",\"field_type\"]).drop_duplicates([\"field_id\", 'field_calculation']) \n", 250 | "\n", 251 | "df_API_all['field_id2'] = df_API_all['field_id'].str.replace(r'[\\[\\]]', '', regex=True)\n", 252 | "\n", 253 | "df_API_all['comparison'] = df_API_all.apply(compare_fields, axis=1)\n", 254 | "df_API_all = df_API_all[df_API_all['comparison'] == 1]\n", 255 | "\n", 256 | "df_API_all = df_API_all.drop(['field_id2', 'comparison'], axis=1)\n", 257 | "df_API_all.sort_values(['datasource_name', 'field_type', 'counter', 'field_name'])\n", 258 | "\n", 259 | "df1 = df_API_all[[ 'field_name', 'field_datatype','field_type', 'field_calculation', 'field_id', 'datasource_caption']].copy()\n", 260 | "\n", 261 | "preference_list=[ 'Default_Field', 'Parameters', 'Calculated_Field']\n", 262 | "df1[\"field_type\"] = pd.Categorical(df1[\"field_type\"], categories=preference_list, ordered=True)\n", 263 | "df1 = df1.sort_values(['field_type'])\n", 264 | "\n", 265 | "df1.columns = ['Field_Name', 'DataType', 'Type', 'Calculation', 'Field_ID', 'Datasource']\n", 266 | "\n", 267 | "df1['Field_Name'] = df1['Field_Name'].str.replace(r'[\\[\\]]', '', regex=True)\n", 268 | "\n", 269 | "df1" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "## Generating an excel file from a df (so the excel rows/cols can be formatted), then turning the excel into a pdf" 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": null, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "#modify this part if you want to add more information/dfs to be saved as a separate sheet in excel\n", 286 | "\n", 287 | "dfs_to_use = [{'excelSheetTitle': 'All fields extracted from DOC API', 'df_to_use':df1, 'mainColWidth':'' , \n", 288 | " 'normalColWidth': [10,15,50,20, 25], 'sheetName': 'GeneralDetails', 'footer': 'Data_1 (DOC API)', 'papersize':9, 'color': '#fff0b3'} \n", 289 | " \n", 290 | " ]\n", 291 | "\n", 292 | "#papersize: a3 = 8, a4 = 9" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": { 299 | "scrolled": true 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "path_excel_file_to_create, path_pdf_file_to_create = exg.create_new_file_paths(tableau_name_substring+'_CALCS_only')\n", 304 | "\n", 305 | "exg.create_excel_from_dfs(dfs_to_use, path_excel_file_to_create)\n", 306 | "\n", 307 | "exg.create_pdf_from_excel(path_excel_file_to_create, path_pdf_file_to_create, dfs_to_use)" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "# Start of mermaid module" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "def first_char_checker(cell_value):\n", 324 | " if cell_value[0] != '[':\n", 325 | " cell_value = '__' + cell_value + '__'\n", 326 | " else:\n", 327 | " cell_value = cell_value.replace('[', '__')\n", 328 | " cell_value = cell_value.replace(']', '__')\n", 329 | "\n", 330 | " return cell_value\n", 331 | "\n", 332 | "\n", 333 | "#define abc list to use during mermaid creation\n", 334 | "\n", 335 | "abc=list(string.ascii_uppercase)\n", 336 | "collated_abc = []\n", 337 | "\n", 338 | "for i in abc:\n", 339 | " for j in abc:\n", 340 | " collated_abc.append(i+j)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "def_fields = df1[df1['Type'] == 'Default_Field']['Field_ID'].copy().apply(remove_sp_char_leave_undescore_square_brackets)\n", 350 | "\n", 351 | "abc_touse = collated_abc[0:len(def_fields)]\n", 352 | "\n", 353 | "def_fields_final = pd.DataFrame(list(zip(def_fields.tolist(), abc_touse)))\n", 354 | "def_fields_final['aa'] = def_fields_final.apply(lambda row: first_char_checker(row[0]), axis=1)\n", 355 | "def_fields_final['default_field'] = def_fields_final.apply(lambda row: '_st_' + row['aa'] + '_en_', axis=1)\n", 356 | "\n", 357 | "mapping_dict_friendly_names = dict(zip(def_fields_final[0].tolist(), abc_touse))\n", 358 | "mapping_dict = dict(zip(def_fields_final['aa'].tolist(), abc_touse))\n", 359 | "\n", 360 | "def_fields_final" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": { 367 | "scrolled": true 368 | }, 369 | "outputs": [], 370 | "source": [ 371 | "created_calc = df_API_all[df_API_all['field_type'] != 'Default_Field']\\\n", 372 | " [['field_name', 'field_id', 'field_calculation', 'field_calculation_bk']].copy()\n", 373 | "\n", 374 | "nlsi = ['x___' + i for i in collated_abc]\n", 375 | "nlsi_to_use = nlsi[0:len(created_calc)]\n", 376 | "\n", 377 | "created_calc['field_name'] = created_calc['field_name'].apply(remove_sp_char_leave_undescore_square_brackets)\n", 378 | "created_calc['aa'] = created_calc.apply(lambda row: first_char_checker(row['field_id']), axis=1)\n", 379 | "created_calc['calc_field'] = created_calc.apply(lambda row: '_st_' + row['aa'] + '_en_', axis=1)\n", 380 | "created_calc['field_calculation_bk'] = created_calc['field_calculation_bk'].str.replace(r'[\\[\\]]', '__', regex=True)\n", 381 | "\n", 382 | "created_calc" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "metadata": {}, 389 | "outputs": [], 390 | "source": [ 391 | "calc_map_dict = dict(zip(created_calc['aa'].to_list(), nlsi_to_use))\n", 392 | "calc_map_dict" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": { 399 | "scrolled": true 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "created_calc['shorthand_abc'] = created_calc['aa'].map(calc_map_dict)\n", 404 | "created_calc.sort_values(by='shorthand_abc', inplace = True)\n", 405 | "created_calc" 406 | ] 407 | }, 408 | { 409 | "cell_type": "code", 410 | "execution_count": null, 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# function to add suffixes to duplicate values\n", 415 | "def differentiate_duplicates(series):\n", 416 | " counts = series.groupby(series).cumcount() \n", 417 | " return series + counts.astype(str).replace('0', '')\n", 418 | "\n", 419 | "# differentiate field names that have duplicate values (eg. calc field Index appears twice in workbook, now it will be Index, Index1)\n", 420 | "created_calc['field_name'] = differentiate_duplicates(created_calc['field_name'])\n", 421 | "\n", 422 | "created_calc" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": null, 428 | "metadata": { 429 | "scrolled": true 430 | }, 431 | "outputs": [], 432 | "source": [ 433 | "calc_map_dict_friendly_names = dict(zip(created_calc['field_name'], created_calc['shorthand_abc'] ))\n", 434 | "calc_map_dict_friendly_names" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "def create_mermaid_paths(df, field_type):\n", 444 | " \n", 445 | " c = 0\n", 446 | " t_collator = []\n", 447 | "\n", 448 | " for i in df['aa']:\n", 449 | "\n", 450 | " print('\\n______________________' + field_type.upper() + ' TO ANALYSE ________________________: ' + i + '\\n')\n", 451 | "\n", 452 | " try:\n", 453 | " tlist = created_calc[created_calc['field_calculation_bk'].str.contains(i, regex=False) == True]['aa'].to_list()\n", 454 | " except:\n", 455 | " tlist = []\n", 456 | "\n", 457 | " if len(tlist) != 0:\n", 458 | " print('LIST PRINTING:\\n\\n' + str(tlist))\n", 459 | "\n", 460 | " for x in tlist:\n", 461 | " newdict = {}\n", 462 | "\n", 463 | " newdict['count'] = c\n", 464 | " newdict['starting'] = i\n", 465 | " newdict['ending'] = x\n", 466 | "\n", 467 | " newdict['path_mermaid'] = i + \" --> \" + x\n", 468 | "\n", 469 | " print('\\n' + str(c) + ' ******************NEW DICT PRINTING ********************** \\n\\n' + str(newdict))\n", 470 | "\n", 471 | " t_collator.append(newdict)\n", 472 | "\n", 473 | " c = c + 1\n", 474 | " \n", 475 | " return t_collator" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "t_collator_def_fields = create_mermaid_paths(def_fields_final, 'default_field')\n", 485 | "t_collator_def_fields" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [ 494 | "t_collator_calcs = create_mermaid_paths(created_calc, 'calculation')\n", 495 | "t_collator_calcs" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "###############################\n", 505 | "#replace the full names of fields and calcs for their abbrv letters, to make the mermaid code leaner\n", 506 | "\n", 507 | "for default_field, mapping_letter in mapping_dict.items():\n", 508 | " for i in t_collator_def_fields:\n", 509 | " i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter)\n", 510 | "\n", 511 | "for default_field, mapping_letter in calc_map_dict.items():\n", 512 | " for i in t_collator_def_fields:\n", 513 | " i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter)\n", 514 | "\n", 515 | "t_collator_def_fields\n", 516 | "##############################\n", 517 | "\n", 518 | "##############################\n", 519 | "# replace the full names of fields and calcs for their abbrv letters, to make the mermaid code leaner\n", 520 | "\n", 521 | "for default_field, mapping_letter in mapping_dict.items():\n", 522 | " for i in t_collator_calcs:\n", 523 | " i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter)\n", 524 | "\n", 525 | "for default_field, mapping_letter in calc_map_dict.items():\n", 526 | " for i in t_collator_calcs:\n", 527 | " i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter)\n", 528 | "\n", 529 | "t_collator_calcs\n", 530 | "##############################" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": null, 536 | "metadata": {}, 537 | "outputs": [], 538 | "source": [ 539 | "new_list_a = ['']\n", 540 | "fields_list = ['']\n", 541 | "\n", 542 | "new_list_a.extend([i['path_mermaid'] for i in t_collator_calcs])\n", 543 | "new_list_a.extend([i['path_mermaid'] for i in t_collator_def_fields])\n", 544 | "\n", 545 | "################################\n", 546 | "#find the unique nodes within the a --> b mermaid paths in new_list_a (eg. a and b)\n", 547 | "c = []\n", 548 | "\n", 549 | "for i in new_list_a:\n", 550 | " print(i)\n", 551 | " c.append(i.split(' --> ')[0])\n", 552 | "\n", 553 | " try:\n", 554 | " c.append(i.split(' --> ')[1])\n", 555 | " except:\n", 556 | " pass\n", 557 | "\n", 558 | "c.pop(0)\n", 559 | "s = set(c)\n", 560 | "c = list(s)\n", 561 | "##############################\n", 562 | "\n", 563 | "for i, d in mapping_dict_friendly_names.items():\n", 564 | " if d in c:\n", 565 | " if i[0] != '[':\n", 566 | " print(d + \"[\" + i + \"]\")\n", 567 | " fields_list.append(d + \"[\" + i + \"]:::foo\")\n", 568 | " else:\n", 569 | " print(d + i)\n", 570 | " fields_list.append(d + i + ':::foo')\n", 571 | "\n", 572 | "for i, d in calc_map_dict_friendly_names.items():\n", 573 | " if d in c:\n", 574 | " print(d + \"[\" + i + \"]\")\n", 575 | " fields_list.append(d + \"[\" + i + \"]\")\n", 576 | " \n", 577 | "superfinallist = fields_list + new_list_a\n", 578 | "superfinallist" 579 | ] 580 | }, 581 | { 582 | "cell_type": "code", 583 | "execution_count": null, 584 | "metadata": { 585 | "scrolled": true 586 | }, 587 | "outputs": [], 588 | "source": [ 589 | "mermaid_diagram_code = \\\n", 590 | "\"\"\"\n", 591 | "flowchart LR\n", 592 | " classDef foo fill:#f9f,stroke:#333,stroke-width:1px{}\n", 593 | "\"\"\".format(\"\\n\\t\".join(superfinallist))\n", 594 | "\n", 595 | "print(mermaid_diagram_code)" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": null, 601 | "metadata": {}, 602 | "outputs": [], 603 | "source": [ 604 | "### Create html which will display the mermaid diagram\n", 605 | "\n", 606 | "\n", 607 | "html_base = \"\"\"\n", 608 | "\n", 609 | "\n", 610 | "\n", 611 | "\n", 612 | " \n", 613 | " \n", 614 | " \"\"\" + tableau_name_substring + \" Calculation Lineage\" + \"\"\"\n", 615 | " \n", 616 | " \n", 620 | "\n", 621 | "\n", 622 | "

\"\"\" + tableau_name_substring + \" Calculation Lineage\" + \"\"\"

\n", 623 | " \n", 624 | "
\"\"\" + mermaid_diagram_code + \"\"\"
\n", 625 | "\n", 626 | "\n", 627 | "\"\"\"\n", 628 | "\n", 629 | "print('\\n ______________________________ START_OF_HTML ______________________________')\n", 630 | "print(html_base)\n", 631 | "print('\\n ______________________________ END_OF_HTML ______________________________')\n", 632 | "\n", 633 | "\n", 634 | "### Output html string to a local file, then open it on the web browser (this bit was done with help of chatgpt)\n", 635 | "\n", 636 | "# Specify the file path\n", 637 | "file_path = 'outputs\\mermaid_diagram_{}.html'.format(tableau_name_substring)\n", 638 | "\n", 639 | "# Write the string to an HTML file\n", 640 | "with open(file_path, 'w') as file:\n", 641 | " file.write(html_base)\n", 642 | "\n", 643 | "print(\"HTML content successfully written to {}\".format(file_path))\n", 644 | "\n", 645 | "# Open the HTML file in the default web browser\n", 646 | "webbrowser.open('file://' + os.path.realpath(file_path))\n", 647 | "\n", 648 | "### end of chatgpt code" 649 | ] 650 | } 651 | ], 652 | "metadata": { 653 | "kernelspec": { 654 | "display_name": "Python 3 (ipykernel)", 655 | "language": "python", 656 | "name": "python3" 657 | }, 658 | "language_info": { 659 | "codemirror_mode": { 660 | "name": "ipython", 661 | "version": 3 662 | }, 663 | "file_extension": ".py", 664 | "mimetype": "text/x-python", 665 | "name": "python", 666 | "nbconvert_exporter": "python", 667 | "pygments_lexer": "ipython3", 668 | "version": "3.7.13" 669 | } 670 | }, 671 | "nbformat": 4, 672 | "nbformat_minor": 2 673 | } 674 | -------------------------------------------------------------------------------- /Tableau_calculation_extractor_with_mermaid.py: -------------------------------------------------------------------------------- 1 | # version 3.0 2 | 3 | import pandas as pd, os, re, string, webbrowser 4 | 5 | from tableaudocumentapi import Workbook 6 | from os.path import isfile, join 7 | 8 | import excelgenerator as exg 9 | 10 | pd.set_option('display.max_columns', None) 11 | 12 | 13 | # ## File Handling 14 | # 15 | # - this version of code will only work with twbx files 16 | 17 | input_path = "inputs" 18 | output_path = "outputs" 19 | 20 | mypath = "./{}".format(input_path) #./ points to "this path" as a relative path 21 | 22 | #only gets files and not directories within the inputs folder -https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory 23 | input_files = [f for f in os.listdir(mypath) if isfile(join(mypath, f))] 24 | input_files 25 | 26 | 27 | 28 | def removeSpecialCharFromStr(spstring): 29 | 30 | # """ 31 | # input: string 32 | # output: new string, without any special char 33 | # """ 34 | 35 | return ''.join(e for e in spstring if e.isalnum()) 36 | 37 | 38 | def removeSpecialCharFromStr_leaveSpaces(spstring): 39 | 40 | return ''.join(e for e in spstring if (e.isalnum() or e ==' ')) 41 | 42 | 43 | def remove_sp_char_then_turn_spaces_into_underscore(string_to_convert): 44 | filtered_string = re.sub(r'[^a-zA-Z0-9\s_]', '', string_to_convert).replace(' ', "_") 45 | return filtered_string 46 | 47 | 48 | def remove_sp_char_leave_undescore_square_brackets(string_to_convert): 49 | filtered_string = re.sub(r'[^a-zA-Z0-9\s_\[\]]', '', string_to_convert).replace(' ', "_") 50 | return filtered_string 51 | 52 | def find_twbx_file(inputfile): 53 | 54 | # """ 55 | # input: any input file 56 | # output: returns the file name without any special char for a twxb file if one is found, else returns empty string 57 | # """ 58 | 59 | if inputfile[-5:] == '.twbx': 60 | sp_packagedWorkbook = i[:len(inputfile)-5] 61 | 62 | packagedWorkbook = removeSpecialCharFromStr(sp_packagedWorkbook)+'.twbx' 63 | 64 | old_file = join(input_path, sp_packagedWorkbook+'.twbx') 65 | new_file = join(input_path, packagedWorkbook) 66 | os.rename(old_file, new_file) 67 | 68 | else: 69 | packagedWorkbook = "" 70 | 71 | return packagedWorkbook 72 | 73 | 74 | for i in input_files: 75 | packagedWorkbook = find_twbx_file(i) 76 | print('Packaged workbook (no sp char): ' + packagedWorkbook) 77 | 78 | #substring to be used when naming the exported data, NEEDS A PACKAGED WORKBOOK TO EXIST, OTHERWISE IT WILL GIVE AN EMPTY STRING 79 | tableau_name_substring = packagedWorkbook.replace(".twbx","")[:30] 80 | print('\nOutput docs name (word/pdf): ' + tableau_name_substring) 81 | 82 | packagedTableauFile_relPath = input_path+"/"+packagedWorkbook 83 | 84 | 85 | # # Doc API 86 | 87 | # get all fields in workbook 88 | TWBX_Workbook = Workbook(packagedTableauFile_relPath) 89 | 90 | collator = [] 91 | calcID = [] 92 | calcID2 = [] 93 | calcNames = [] 94 | 95 | c = 0 96 | 97 | for datasource in TWBX_Workbook.datasources: 98 | datasource_name = datasource.name 99 | datasource_caption = datasource.caption if datasource.caption else datasource_name 100 | 101 | for count, field in enumerate(datasource.fields.values()): 102 | dict_temp = { 103 | 'counter': c, 104 | 'datasource_name': datasource_name, 105 | 'datasource_caption': datasource_caption, 106 | 'alias': field.alias, 107 | 'field_calculation': field.calculation, 108 | 'field_calculation_bk': field.calculation, 109 | 'field_caption': field.caption, 110 | 'field_datatype': field.datatype, 111 | 'field_def_agg': field.default_aggregation, 112 | 'field_desc': field.description, 113 | 'field_hidden': field.hidden, 114 | 'field_id': field.id, 115 | 'field_is_nominal': field.is_nominal, 116 | 'field_is_ordinal': field.is_ordinal, 117 | 'field_is_quantitative': field.is_quantitative, 118 | 'field_name': field.name, 119 | 'field_role': field.role, 120 | 'field_type': field.type, 121 | 'field_worksheets': field.worksheets, 122 | 'field_WHOLE': field 123 | } 124 | 125 | if field.calculation is not None: 126 | calcID.append(field.id) 127 | calcNames.append(field.name) 128 | 129 | f2 = field.id.replace(']', '').replace('[', '') 130 | calcID2.append(f2) 131 | 132 | c += 1 133 | collator.append(dict_temp) 134 | 135 | 136 | 137 | def default_to_friendly_names2(formulaList,fieldToConvert, dictToUse): 138 | 139 | for i in formulaList: 140 | for tableauName, friendlyName in dictToUse.items(): 141 | try: 142 | i[fieldToConvert] = (i[fieldToConvert]).replace(tableauName, friendlyName) 143 | except: 144 | a = 0 145 | 146 | return formulaList 147 | 148 | 149 | def category_field_type(row): 150 | if row['datasource_name'] == 'Parameters': 151 | val = 'Parameters' 152 | elif row['field_calculation'] == None: 153 | val = 'Default_Field' 154 | else: 155 | val = 'Calculated_Field' 156 | return val 157 | 158 | def compare_fields(row): 159 | if row['field_id'] == row['field_id2']: 160 | val = 0 161 | else: 162 | val = 1 163 | return val 164 | 165 | 166 | calcDict = dict(zip(calcID, calcNames)) 167 | calcDict2 = dict(zip(calcID2, calcNames)) #raw fields without any [] 168 | 169 | collator = default_to_friendly_names2(collator,'field_calculation',calcDict2) 170 | 171 | df_API_all = pd.DataFrame(collator) 172 | df_API_all['field_type'] = df_API_all.apply(category_field_type, axis=1) 173 | 174 | preference_list=['Parameters', 'Calculated_Field', 'Default_Field'] 175 | df_API_all["field_type"] = pd.Categorical(df_API_all["field_type"], categories=preference_list, ordered=True) 176 | 177 | #get rid of duplicates for parameters, so only parameters from the explicit Parameters datasource are kept (as they are also listed again under the name of any other datasources) 178 | df_API_all = df_API_all.sort_values(["field_id","field_type"]).drop_duplicates(["field_id", 'field_calculation']) 179 | 180 | df_API_all['field_id2'] = df_API_all['field_id'].str.replace(r'[\[\]]', '', regex=True) 181 | 182 | df_API_all['comparison'] = df_API_all.apply(compare_fields, axis=1) 183 | df_API_all = df_API_all[df_API_all['comparison'] == 1] 184 | 185 | df_API_all = df_API_all.drop(['field_id2', 'comparison'], axis=1) 186 | df_API_all.sort_values(['datasource_name', 'field_type', 'counter', 'field_name']) 187 | 188 | df1 = df_API_all[[ 'field_name', 'field_datatype','field_type', 'field_calculation', 'field_id', 'datasource_caption']].copy() 189 | 190 | preference_list=[ 'Default_Field', 'Parameters', 'Calculated_Field'] 191 | df1["field_type"] = pd.Categorical(df1["field_type"], categories=preference_list, ordered=True) 192 | df1 = df1.sort_values(['field_type']) 193 | 194 | df1.columns = ['Field_Name', 'DataType', 'Type', 'Calculation', 'Field_ID', 'Datasource'] 195 | 196 | df1['Field_Name'] = df1['Field_Name'].str.replace(r'[\[\]]', '', regex=True) 197 | 198 | 199 | 200 | # ## Generating an excel file from a df (so the excel rows/cols can be formatted), then turning the excel into a pdf 201 | 202 | #modify this part if you want to add more information/dfs to be saved as a separate sheet in excel 203 | 204 | dfs_to_use = [{'excelSheetTitle': 'All fields extracted from DOC API', 'df_to_use':df1, 'mainColWidth':'' , 205 | 'normalColWidth': [10,15,50,20, 25], 'sheetName': 'GeneralDetails', 'footer': 'Data_1 (DOC API)', 'papersize':9, 'color': '#fff0b3'} 206 | 207 | ] 208 | 209 | #papersize: a3 = 8, a4 = 9 210 | 211 | 212 | 213 | path_excel_file_to_create, path_pdf_file_to_create = exg.create_new_file_paths(tableau_name_substring+'_CALCS_only') 214 | 215 | exg.create_excel_from_dfs(dfs_to_use, path_excel_file_to_create) 216 | 217 | exg.create_pdf_from_excel(path_excel_file_to_create, path_pdf_file_to_create, dfs_to_use) 218 | 219 | 220 | # # Start of mermaid module 221 | 222 | # In[23]: 223 | 224 | 225 | def first_char_checker(cell_value): 226 | if cell_value[0] != '[': 227 | cell_value = '__' + cell_value + '__' 228 | else: 229 | cell_value = cell_value.replace('[', '__') 230 | cell_value = cell_value.replace(']', '__') 231 | 232 | return cell_value 233 | 234 | 235 | #define abc list to use during mermaid creation 236 | 237 | abc=list(string.ascii_uppercase) 238 | collated_abc = [] 239 | 240 | for i in abc: 241 | for j in abc: 242 | collated_abc.append(i+j) 243 | 244 | 245 | # In[24]: 246 | 247 | 248 | def_fields = df1[df1['Type'] == 'Default_Field']['Field_ID'].copy().apply(remove_sp_char_leave_undescore_square_brackets) 249 | 250 | abc_touse = collated_abc[0:len(def_fields)] 251 | 252 | def_fields_final = pd.DataFrame(list(zip(def_fields.tolist(), abc_touse))) 253 | def_fields_final['aa'] = def_fields_final.apply(lambda row: first_char_checker(row[0]), axis=1) 254 | def_fields_final['default_field'] = def_fields_final.apply(lambda row: '_st_' + row['aa'] + '_en_', axis=1) 255 | 256 | mapping_dict_friendly_names = dict(zip(def_fields_final[0].tolist(), abc_touse)) 257 | mapping_dict = dict(zip(def_fields_final['aa'].tolist(), abc_touse)) 258 | 259 | 260 | created_calc = df_API_all[df_API_all['field_type'] != 'Default_Field'][['field_name', 'field_id', 'field_calculation', 'field_calculation_bk']].copy() 261 | 262 | nlsi = ['x___' + i for i in collated_abc] 263 | nlsi_to_use = nlsi[0:len(created_calc)] 264 | 265 | created_calc['field_name'] = created_calc['field_name'].apply(remove_sp_char_leave_undescore_square_brackets) 266 | created_calc['aa'] = created_calc.apply(lambda row: first_char_checker(row['field_id']), axis=1) 267 | created_calc['calc_field'] = created_calc.apply(lambda row: '_st_' + row['aa'] + '_en_', axis=1) 268 | created_calc['field_calculation_bk'] = created_calc['field_calculation_bk'].str.replace(r'[\[\]]', '__', regex=True) 269 | 270 | calc_map_dict_friendly_names = dict(zip(created_calc['field_name'].to_list(), nlsi_to_use)) 271 | calc_map_dict = dict(zip(created_calc['aa'].to_list(), nlsi_to_use)) 272 | 273 | 274 | def create_mermaid_paths(df, field_type): 275 | 276 | c = 0 277 | t_collator = [] 278 | 279 | for i in df['aa']: 280 | 281 | print('\n______________________' + field_type.upper() + ' TO ANALYSE ________________________: ' + i + '\n') 282 | 283 | try: 284 | tlist = created_calc[created_calc['field_calculation_bk'].str.contains(i, regex=False) == True]['aa'].to_list() 285 | except: 286 | tlist = [] 287 | 288 | if len(tlist) != 0: 289 | print('LIST PRINTING:\n\n' + str(tlist)) 290 | 291 | for x in tlist: 292 | newdict = {} 293 | 294 | newdict['count'] = c 295 | newdict['starting'] = i 296 | newdict['ending'] = x 297 | 298 | newdict['path_mermaid'] = i + " --> " + x 299 | 300 | print('\n' + str(c) + ' ******************NEW DICT PRINTING ********************** \n\n' + str(newdict)) 301 | 302 | t_collator.append(newdict) 303 | 304 | c = c + 1 305 | 306 | return t_collator 307 | 308 | 309 | 310 | t_collator_def_fields = create_mermaid_paths(def_fields_final, 'default_field') 311 | 312 | 313 | t_collator_calcs = create_mermaid_paths(created_calc, 'calculation') 314 | 315 | 316 | 317 | ############################### 318 | #replace the full names of fields and calcs for their abbrv letters, to make the mermaid code leaner 319 | 320 | for default_field, mapping_letter in mapping_dict.items(): 321 | for i in t_collator_def_fields: 322 | i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter) 323 | 324 | for default_field, mapping_letter in calc_map_dict.items(): 325 | for i in t_collator_def_fields: 326 | i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter) 327 | 328 | 329 | ############################## 330 | 331 | ############################## 332 | # replace the full names of fields and calcs for their abbrv letters, to make the mermaid code leaner 333 | 334 | for default_field, mapping_letter in mapping_dict.items(): 335 | for i in t_collator_calcs: 336 | i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter) 337 | 338 | for default_field, mapping_letter in calc_map_dict.items(): 339 | for i in t_collator_calcs: 340 | i['path_mermaid'] = i['path_mermaid'].replace(default_field, mapping_letter) 341 | 342 | 343 | ############################## 344 | 345 | 346 | new_list_a = [''] 347 | fields_list = [''] 348 | 349 | new_list_a.extend([i['path_mermaid'] for i in t_collator_calcs]) 350 | new_list_a.extend([i['path_mermaid'] for i in t_collator_def_fields]) 351 | 352 | ################################ 353 | #find the unique nodes within the a --> b mermaid paths in new_list_a (eg. a and b) 354 | c = [] 355 | 356 | for i in new_list_a: 357 | print(i) 358 | c.append(i.split(' --> ')[0]) 359 | 360 | try: 361 | c.append(i.split(' --> ')[1]) 362 | except: 363 | pass 364 | 365 | c.pop(0) 366 | s = set(c) 367 | c = list(s) 368 | ############################## 369 | 370 | for i, d in mapping_dict_friendly_names.items(): 371 | if d in c: 372 | if i[0] != '[': 373 | print(d + "[" + i + "]") 374 | fields_list.append(d + "[" + i + "]:::foo") 375 | else: 376 | print(d + i) 377 | fields_list.append(d + i + ':::foo') 378 | 379 | for i, d in calc_map_dict_friendly_names.items(): 380 | if d in c: 381 | print(d + "[" + i + "]") 382 | fields_list.append(d + "[" + i + "]") 383 | 384 | superfinallist = fields_list + new_list_a 385 | 386 | 387 | 388 | mermaid_diagram_code = """ 389 | flowchart LR 390 | classDef foo fill:#f9f,stroke:#333,stroke-width:1px{} 391 | """.format("\n\t".join(superfinallist)) 392 | 393 | print(mermaid_diagram_code) 394 | 395 | 396 | ### Create html which will display the mermaid diagram 397 | 398 | html_base = """ 399 | 400 | 401 | 402 | 403 | 404 | 405 | """ + tableau_name_substring + " Calculation Lineage" + """ 406 | 407 | 411 | 412 | 413 |

""" + tableau_name_substring + " Calculation Lineage" + """

414 | 415 |
""" + mermaid_diagram_code + """
416 | 417 | 418 | """ 419 | 420 | print('\n ______________________________ START_OF_HTML ______________________________') 421 | print(html_base) 422 | print('\n ______________________________ END_OF_HTML ______________________________') 423 | 424 | 425 | 426 | ### Output html string to a local file, then open it on the web browser (this bit was done with help of chatgpt) 427 | 428 | # Specify the file path 429 | file_path = 'outputs\mermaid_diagram_{}.html'.format(tableau_name_substring) 430 | 431 | # Write the string to an HTML file 432 | with open(file_path, 'w') as file: 433 | file.write(html_base) 434 | 435 | print("HTML content successfully written to {}".format(file_path)) 436 | 437 | # Open the HTML file in the default web browser 438 | webbrowser.open('file://' + os.path.realpath(file_path)) 439 | 440 | ### end of code block done with help of chatgpt 441 | 442 | -------------------------------------------------------------------------------- /excelgenerator.py: -------------------------------------------------------------------------------- 1 | import os, pathlib 2 | import win32com.client 3 | import pandas as pd 4 | 5 | 6 | def create_new_file_paths(tableau_name_substring): 7 | 8 | cwd = os.getcwd() 9 | path_string = pathlib.Path(cwd).resolve().__str__() + "\{}" 10 | 11 | print(path_string) 12 | 13 | newFileName = 'outputs\{}'.format(tableau_name_substring) 14 | 15 | excel_path = path_string.format(newFileName + ".xlsx") 16 | path_to_pdf = path_string.format(newFileName + ".pdf") 17 | 18 | print(excel_path) 19 | print(path_to_pdf) 20 | 21 | return (excel_path, path_to_pdf) 22 | 23 | 24 | def mainCol(colNumber, color, writer, sheetName): 25 | 26 | workbook = writer.book 27 | worksheet = writer.sheets[sheetName] 28 | 29 | format_mainCol = workbook.add_format({'text_wrap': True, 'bold': True}) 30 | format_mainCol.set_align('vcenter') 31 | format_mainCol.set_bg_color(color) 32 | format_mainCol.set_border(1) 33 | worksheet.set_column(colNumber,colNumber,20,format_mainCol) 34 | return worksheet 35 | 36 | 37 | def normalCol(colNumber, colWidth, writer, sheetName): 38 | 39 | workbook = writer.book 40 | worksheet = writer.sheets[sheetName] 41 | 42 | 43 | format2 = workbook.add_format({'text_wrap': True}) 44 | format2.set_align('vcenter') 45 | format2.set_border(1) 46 | worksheet.set_column(colNumber,colNumber,colWidth,format2) 47 | return worksheet 48 | 49 | 50 | def create_excel_from_dfs(dfs_to_use, excel_path): 51 | 52 | writer = pd.ExcelWriter(excel_path, engine='xlsxwriter') 53 | 54 | # input: any number of dfs 55 | # output: an excel file with one excel sheet per df 56 | 57 | # code to create each sheet in excel, with the specified df and formatting each sheet as per requirements 58 | # also adds a header and footer to each sheet 59 | # all the info to be replaced below (ie. for each df) comes form the dfs_to_use list of dictionaries 60 | 61 | for x in dfs_to_use: 62 | excelSheetTitle = x['excelSheetTitle'] 63 | df_to_use = x['df_to_use'] 64 | normalColWidth = x['normalColWidth'] 65 | sheetName = x['sheetName'] 66 | papersize = x['papersize'] 67 | footer = x['footer'] 68 | color = x['color'] 69 | 70 | df_to_use.to_excel(writer, sheet_name=sheetName, index=False) 71 | 72 | worksheet = mainCol(colNumber = 0, color = color, writer=writer, sheetName=sheetName) 73 | 74 | ws = 1 75 | for i in normalColWidth: #iterates through each column 76 | worksheet = normalCol(ws, i, writer=writer, sheetName=sheetName) 77 | ws = ws + 1 78 | 79 | worksheet.set_paper(papersize) # a4 80 | worksheet.fit_to_pages(1, 0) # fit to 1 page wide, n long 81 | worksheet.repeat_rows(0) # repeat the first row 82 | 83 | header_x = '&C&"Arial,Bold"&10{}'.format(excelSheetTitle) 84 | footer_x = '&L{}&CPage &P of &N'.format(footer) 85 | 86 | worksheet.set_header(header_x) 87 | worksheet.set_footer(footer_x) 88 | 89 | #writer.save() 90 | writer.close() 91 | 92 | 93 | def create_pdf_from_excel(path_excel, path_pdf, dfs_to_use): 94 | 95 | 96 | # this creates an index to list each excel sheet, based on the number of sheets that were created before 97 | 98 | for_ws_index_list = [] 99 | for i in range(len(dfs_to_use)): 100 | for_ws_index_list.append(i + 1) 101 | 102 | excel = win32com.client.Dispatch("Excel.Application") 103 | excel.Visible = False 104 | 105 | wb = excel.Workbooks.Open(path_excel) 106 | 107 | #print all the excel sheets into a single pdf 108 | ws_index_list = for_ws_index_list 109 | wb.Worksheets(ws_index_list).Select() 110 | wb.ActiveSheet.ExportAsFixedFormat(0, path_pdf) 111 | wb.Close() 112 | excel.Quit() 113 | -------------------------------------------------------------------------------- /output_examples/TheMoodsofMidgarWillSutton_CALCS_only.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scinana/tableauCalculationExport/165989e7d9967fe1c3810624ad942f47fe2436e3/output_examples/TheMoodsofMidgarWillSutton_CALCS_only.pdf -------------------------------------------------------------------------------- /output_examples/TheMoodsofMidgarWillSutton_CALCS_only.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/scinana/tableauCalculationExport/165989e7d9967fe1c3810624ad942f47fe2436e3/output_examples/TheMoodsofMidgarWillSutton_CALCS_only.xlsx --------------------------------------------------------------------------------