├── .gitignore ├── .readthedocs.yaml ├── LICENSE.md ├── README.md ├── TARDIS.svg ├── data ├── Pinched_torus.txt.gz └── Wedged_spheres_2D.txt.gz ├── docs ├── .gitignore ├── Makefile ├── make.bat └── source │ ├── api.rst │ ├── conf.py │ ├── data.rst │ ├── index.rst │ └── utils.rst ├── examples └── simple_numpy.py ├── output ├── Gudhi_Computational_Effort.txt └── Ripser_Computational_Effort.txt ├── poetry.lock ├── pyproject.toml ├── scripts ├── es.sh ├── ipsc.sh ├── pinched_torus.sh ├── vision.sh └── wedged_spheres.sh └── tardis ├── __init__.py ├── analyse_euclidicity.py ├── api.py ├── cli.py ├── data.py ├── euclidicity.py ├── make_pinched_torus.py ├── make_wedged_spheres.py ├── make_wedged_spheres_varying_dim.py ├── mat_to_npy.py ├── persistent_homology.py ├── shapes.py ├── utils.py └── visualise_data.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.swp 3 | 4 | data/ 5 | 6 | # Using this for `pyenv` and related tools. It can be ignored since we 7 | # do not enforce a specific version in practice. 8 | .python-version 9 | 10 | __pycache__/ 11 | -------------------------------------------------------------------------------- /.readthedocs.yaml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | build: 4 | # Required to get access to more recent Python versions. 5 | image: testing 6 | 7 | sphinx: 8 | configuration: docs/source/conf.py 9 | 10 | python: 11 | version: 3.9 12 | install: 13 | - method: pip 14 | path: . 15 | extra_requirements: 16 | - docs 17 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | Copyright (c) 2022 Julius von Rohrscheidt and Bastian Rieck 2 | 3 | Redistribution and use in source and binary forms, with or without 4 | modification, are permitted provided that the following conditions are 5 | met: 6 | 7 | 1. Redistributions of source code must retain the above copyright 8 | notice, this list of conditions and the following disclaimer. 9 | 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 14 | 3. Neither the name of the copyright holder nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS 19 | IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 20 | TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 21 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 22 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 23 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED 24 | TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 25 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 26 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 27 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 28 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # TARDIS: Topological Algorithms for Robust DIscovery of Singularities 2 | 3 | [![arXiv](https://img.shields.io/badge/arXiv-2210.00069-b31b1b.svg)](https://arxiv.org/abs/2210.00069) [![Maintainability](https://api.codeclimate.com/v1/badges/4656850a9d0eb2f85b6e/maintainability)](https://codeclimate.com/github/aidos-lab/TARDIS/maintainability) ![GitHub contributors](https://img.shields.io/github/contributors/aidos-lab/TARDIS) ![GitHub](https://img.shields.io/github/license/aidos-lab/TARDIS) ![Read the Docs](https://img.shields.io/readthedocs/tardis-tda) 4 | 5 | ![TARDIS icon](./TARDIS.svg) 6 | 7 | This is the code for our [ICML paper on topology-driven singularity analysis](https://proceedings.mlr.press/v202/von-rohrscheidt23a.html): 8 | 9 | ```bibtex 10 | @inproceedings{vonRohrscheidt23a, 11 | title = {Topological Singularity Detection at Multiple Scales}, 12 | author = {von Rohrscheidt, Julius and Rieck, Bastian}, 13 | year = 2023, 14 | booktitle = {Proceedings of the 40th International Conference on Machine Learning}, 15 | publisher = {PMLR}, 16 | series = {Proceedings of Machine Learning Research}, 17 | number = 202, 18 | pages = {35175--35197}, 19 | editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, 20 | abstract = {The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the `manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.} 21 | } 22 | ``` 23 | 24 | ## Installation 25 | 26 | Our code has been tested with Python 3.8 and Python 3.9 under Mac OS 27 | X and Linux. Other Python versions *may* not support all dependencies. 28 | 29 | The recommended way to install the project is via [`poetry`](https://python-poetry.org/). 30 | If this is available, installation should work very quickly: 31 | 32 | $ poetry install 33 | 34 | Recent versions of `pip` should also be capable of installing the 35 | project directly: 36 | 37 | $ pip install . 38 | 39 | ## Experiments 40 | 41 | To reproduce the main experiments in our paper, we ship synthetic data 42 | sets in the repository and offer the automated capability to download 43 | the computer vision data sets (`MNIST` and `FashionMNIST`). For 44 | reasons of simplicity, we suggest to reproduce the experiments with 45 | synthetic point clouds first as they run quickly even on a standard 46 | desktop computer. 47 | 48 | All experiments make use of the script `cli.py`, which provides 49 | a command-line interface to our framework. Given input parameters for 50 | the local annuli, this script will calculate Euclidicity values as 51 | described in the paper. For reasons of simplicity, all output is 52 | provided to `stdout`, i.e. the standard output of your terminal, and 53 | needs to be redirected to a file for subsequent analysis. 54 | 55 | We will subsequently provide the precise commands to reproduce the 56 | experiments; readers are invited to take a look at the code in `cli.py` 57 | or call `python cli.py --help` in order to see what additional options 58 | are available for processing data. 59 | 60 | ### Pinched torus 61 | 62 | Run the following commands from the root directory of the repository: 63 | 64 | $ cd tardis 65 | $ python cli.py ../data/Pinched_torus.txt.gz -q 500 -r 0.05 -R 0.45 -s 0.2 -S 0.6 > ../output/Pinched_torus.txt 66 | 67 | This will create a point cloud of 500 sample points with $x, y, z$ 68 | coordinates, followed by our Euclidicity score. 69 | 70 | ### Wedged spheres (with automated parameter selection) 71 | 72 | **Warning**: this example might require a long runtime on an ordinary 73 | machine. We ran this on our cluster (see also the [`scripts`](https://github.com/aidos-lab/TARDIS/tree/main/scripts) 74 | folder in the root directory). 75 | 76 | Run the following commands from the root directory of the repository: 77 | 78 | $ cd tardis 79 | $ python cli.py -k 100 -q 2000 -d 2 --num-steps 20 ../data/Wedged_spheres_2D.txt.gz > ../output/Wedged_spheres_2D.txt 80 | 81 | This will make use of the automated parameter selection procedure based 82 | on nearest neighbours. Notice that this example uses more query 83 | points; it is of course possible to adjust this parameter. 84 | 85 | ## API & examples 86 | 87 | Check out the [examples folder](https://github.com/aidos-lab/TARDIS/tree/main/examples) for some code snippets that 88 | demonstrate how to use TARDIS in your own code. They all make use of the 89 | [preliminary API](https://github.com/aidos-lab/TARDIS/blob/main/tardis/api.py). 90 | 91 | ## License 92 | 93 | Our code is released under a BSD-3-Clause license. This license 94 | essentially permits you to freely use our code as desired, integrate it 95 | into your projects, and much more---provided you acknowledge the 96 | original authors. Please refer to [LICENSE.md](./LICENSE.md) for more 97 | information. 98 | 99 | ## Issues 100 | 101 | This project is maintained by members of the [AIDOS Lab](https://github.com/aidos-lab). 102 | Please open an [issue](https://github.com/aidos-lab/TARDIS/issues) in 103 | case you encounter any problems. 104 | -------------------------------------------------------------------------------- /TARDIS.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 56 | 59 | 61 | 63 | 67 | 68 | 70 | 74 | 75 | 76 | 78 | 80 | 84 | 85 | 87 | 91 | 92 | 93 | 95 | 97 | 99 | 106 | 107 | 109 | 113 | 114 | 115 | 117 | 119 | 126 | 127 | 129 | 133 | 134 | 135 | 137 | 139 | 143 | 144 | 146 | 150 | 151 | 152 | 154 | 156 | 163 | 164 | 166 | 170 | 171 | 172 | 174 | 176 | 183 | 184 | 186 | 190 | 191 | 192 | 193 | 195 | 197 | 199 | 206 | 207 | 209 | 213 | 214 | 215 | 217 | 219 | 226 | 227 | 229 | 233 | 234 | 235 | 236 | 238 | 240 | 242 | 249 | 250 | 252 | 256 | 257 | 258 | 260 | 262 | 269 | 270 | 272 | 276 | 277 | 278 | 279 | 281 | 283 | 285 | 292 | 293 | 295 | 299 | 300 | 301 | 303 | 305 | 312 | 313 | 315 | 317 | 321 | 322 | 323 | 324 | 325 | 327 | 329 | 331 | 333 | 340 | 341 | 343 | 347 | 348 | 349 | 351 | 353 | 360 | 361 | 363 | 367 | 368 | 369 | 371 | 373 | 380 | 381 | 383 | 387 | 388 | 389 | 391 | 393 | 400 | 401 | 403 | 407 | 408 | 409 | 411 | 413 | 420 | 421 | 423 | 427 | 428 | 429 | 431 | 433 | 440 | 441 | 443 | 447 | 448 | 449 | 450 | 452 | 454 | 456 | 463 | 464 | 466 | 470 | 471 | 472 | 474 | 476 | 483 | 484 | 486 | 490 | 491 | 492 | 494 | 496 | 503 | 504 | 506 | 510 | 511 | 512 | 514 | 516 | 523 | 524 | 526 | 530 | 531 | 532 | 534 | 536 | 543 | 544 | 546 | 550 | 551 | 552 | 554 | 556 | 563 | 564 | 566 | 570 | 571 | 572 | 573 | 574 | 576 | 578 | 580 | 587 | 588 | 590 | 594 | 595 | 596 | 597 | 599 | 601 | 608 | 609 | 611 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | -------------------------------------------------------------------------------- /data/Pinched_torus.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aidos-lab/TARDIS/df0ef5d9567efb0b33616452aa9b6b439f16ae0d/data/Pinched_torus.txt.gz -------------------------------------------------------------------------------- /data/Wedged_spheres_2D.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aidos-lab/TARDIS/df0ef5d9567efb0b33616452aa9b6b439f16ae0d/data/Wedged_spheres_2D.txt.gz -------------------------------------------------------------------------------- /docs/.gitignore: -------------------------------------------------------------------------------- 1 | build/ 2 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line, and also 5 | # from the environment for the first two. 6 | SPHINXOPTS ?= 7 | SPHINXBUILD ?= sphinx-build 8 | SOURCEDIR = source 9 | BUILDDIR = build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=source 11 | set BUILDDIR=build 12 | 13 | %SPHINXBUILD% >NUL 2>NUL 14 | if errorlevel 9009 ( 15 | echo. 16 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 17 | echo.installed, then set the SPHINXBUILD environment variable to point 18 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 19 | echo.may add the Sphinx directory to PATH. 20 | echo. 21 | echo.If you don't have Sphinx installed, grab it from 22 | echo.https://www.sphinx-doc.org/ 23 | exit /b 1 24 | ) 25 | 26 | if "%1" == "" goto help 27 | 28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 29 | goto end 30 | 31 | :help 32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% 33 | 34 | :end 35 | popd 36 | -------------------------------------------------------------------------------- /docs/source/api.rst: -------------------------------------------------------------------------------- 1 | tardis.api 2 | ========== 3 | 4 | .. automodule:: tardis.api 5 | :members: 6 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | project = "TARDIS" 2 | copyright = "2023, Julius von Rohrscheidt and Bastian Rieck" 3 | author = "Julius von Rohrscheidt and Bastian Rieck" 4 | 5 | extensions = [ 6 | "sphinx.ext.autodoc", 7 | "sphinx.ext.napoleon", 8 | "sphinx.ext.linkcode", 9 | ] 10 | 11 | # Ensure that member functions are documented. These are sane defaults. 12 | autodoc_default_options = { 13 | "members": True, 14 | "member-order": "bysource", 15 | "special-members": "__init__", 16 | "undoc-members": True, 17 | "exclude-members": "__weakref__", 18 | } 19 | 20 | templates_path = ["_templates"] 21 | exclude_patterns = [] 22 | 23 | # Tries to assign some semantic meaning to arguments provided with 24 | # single backtics, such as `x`. This way, we can ignore `func` and 25 | # `class` targets etc. (They still work, though!) 26 | default_role = "obj" 27 | 28 | html_theme = "furo" 29 | html_logo = "../../TARDIS.svg" 30 | html_static_path = ["_static"] 31 | 32 | # Ensures that modules are sorted correctly. Since they all pertain to 33 | # the same package, the prefix itself can be ignored. 34 | modindex_common_prefix = ["tardis."] 35 | 36 | html_theme_options = { 37 | "source_repository": "https://github.com/aidos-lab/TARDIS/", 38 | "source_branch": "main", 39 | "source_directory": "docs/source/", 40 | } 41 | 42 | 43 | # Specifies how to actually find the sources of the modules. Ensures 44 | # that people can jump to files in the repository directly. 45 | def linkcode_resolve(domain, info): 46 | # Let's frown on global imports and do everything locally as much as 47 | # we can. 48 | import sys 49 | import tardis 50 | 51 | if domain != "py": 52 | return None 53 | if not info["module"]: 54 | return None 55 | 56 | # Attempt to identify the source file belonging to an `info` object. 57 | # This code is adapted from the Sphinx configuration of `numpy`; see 58 | # https://github.com/numpy/numpy/blob/main/doc/source/conf.py. 59 | def find_source_file(module): 60 | obj = sys.modules[module] 61 | 62 | for part in info["fullname"].split("."): 63 | obj = getattr(obj, part) 64 | 65 | import inspect 66 | import os 67 | 68 | fn = inspect.getsourcefile(obj) 69 | fn = os.path.relpath(fn, start=os.path.dirname(tardis.__file__)) 70 | 71 | source, lineno = inspect.getsourcelines(obj) 72 | return fn, lineno, lineno + len(source) - 1 73 | 74 | try: 75 | module = info["module"] 76 | source = find_source_file(module) 77 | except Exception: 78 | source = None 79 | 80 | root = f"https://github.com/aidos-lab/TARDIS/tree/main/{project.lower()}/" 81 | 82 | if source is not None: 83 | fn, start, end = source 84 | return root + f"{fn}#L{start}-L{end}" 85 | else: 86 | return None 87 | -------------------------------------------------------------------------------- /docs/source/data.rst: -------------------------------------------------------------------------------- 1 | tardis.data 2 | =========== 3 | 4 | .. automodule:: tardis.data 5 | :members: 6 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | TARDIS: Topological Algorithms for Robust DIscovery of Singularities 2 | ==================================================================== 3 | 4 | The manifold hypothesis drives most of modern machine learning research, 5 | but what if you are **not** dealing with a manifold but a more complicated space? 6 | TARDIS uses a topology-driven approach to identify singularities in 7 | high-dimensional data sets at multiple scales, giving you a better 8 | overview of what is in your data. 9 | 10 | How can TARDIS help you? 11 | ------------------------ 12 | 13 | * Find out whether your data set contains singular regions, i.e. regions 14 | that are not adequately described by Euclidean space. 15 | 16 | * Discover whether dimensionality reduction algorithms are embedding 17 | your data correctly or resulting in distortion. 18 | 19 | * Assess the overall complexity of your data set in an unsupervised 20 | fashion. 21 | 22 | Interested? 23 | ----------- 24 | 25 | Read more about TARDIS in our `ICML paper `_ 26 | and consider citing us: 27 | 28 | .. code-block:: bibtex 29 | 30 | @inproceedings{vonRohrscheidt23a, 31 | title = {Topological Singularity Detection at Multiple Scales}, 32 | author = {von Rohrscheidt, Julius and Rieck, Bastian}, 33 | year = 2023, 34 | booktitle = {Proceedings of the 40th International Conference on Machine Learning}, 35 | publisher = {PMLR}, 36 | series = {Proceedings of Machine Learning Research}, 37 | number = 202, 38 | pages = {35175--35197}, 39 | editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, 40 | abstract = {The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the `manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.} 41 | } 42 | 43 | Documentation 44 | ------------- 45 | 46 | Please find the API documentation and the module documentation below. As 47 | with a lot of academic code, TARDIS is a constant work in progress. Your 48 | contributions are more than welcome! 49 | 50 | .. toctree:: 51 | :maxdepth: 2 52 | :caption: Contents: 53 | 54 | .. toctree:: 55 | :maxdepth: 2 56 | :caption: Modules 57 | 58 | api 59 | data 60 | utils 61 | 62 | Indices and tables 63 | ================== 64 | 65 | * :ref:`genindex` 66 | * :ref:`modindex` 67 | * :ref:`search` 68 | -------------------------------------------------------------------------------- /docs/source/utils.rst: -------------------------------------------------------------------------------- 1 | tardis.utils 2 | ============ 3 | 4 | .. automodule:: tardis.utils 5 | :members: 6 | -------------------------------------------------------------------------------- /examples/simple_numpy.py: -------------------------------------------------------------------------------- 1 | """Simple example of integrating TARDIS and ``numpy``.""" 2 | 3 | 4 | import numpy as np 5 | 6 | from tardis import calculate_euclidicity 7 | 8 | 9 | if __name__ == "__main__": 10 | rng = np.random.default_rng(42) 11 | 12 | # This is the same data set that will also be used for the 13 | # estimation of Euclidicity later on. 14 | X = rng.normal(size=(100, 3)) 15 | 16 | # Only get Euclidicity values. By default, no dimensions will be 17 | # returned (they are always computed, though). Use `n_steps` for 18 | # controlling the scale traversal. 19 | euclidicity = calculate_euclidicity( 20 | X, r=0.01, R=0.25, s=0.05, S=0.5, max_dim=3, n_steps=5 21 | ) 22 | 23 | # Get both Euclidicity and the persistent intrinsic dimension (PID) 24 | # of each data point. 25 | euclidicity, persistent_intrinsic_dimension = calculate_euclidicity( 26 | X, 27 | r=0.01, 28 | R=0.25, 29 | s=0.05, 30 | S=0.5, 31 | max_dim=3, 32 | n_steps=5, 33 | return_dimensions=True, 34 | ) 35 | 36 | # We can also just specify a number of neighbours to use for scale 37 | # estimation. Note that most of the parameters specified below can 38 | # be considered optional since we provide useful defaults. 39 | euclidicity = calculate_euclidicity( 40 | X, 41 | max_dim=3, 42 | n_steps=5, 43 | k=10, 44 | ) 45 | 46 | # Finally, let's calculate Euclidicity with respect to *another* 47 | # data set. 48 | Y = rng.normal(size=(10, 3)) 49 | euclidicity = calculate_euclidicity( 50 | X, Y, r=0.01, R=0.25, s=0.05, S=0.5, max_dim=3 51 | ) 52 | -------------------------------------------------------------------------------- /output/Gudhi_Computational_Effort.txt: -------------------------------------------------------------------------------- 1 | --- Gudhi Rips Complex --- 10 samples --- dimension 1 --- 0.0020189285278320312 seconds --- 2 | --- Gudhi Rips Complex --- 20 samples --- dimension 1 --- 0.0025501251220703125 seconds --- 3 | --- Gudhi Rips Complex --- 30 samples --- dimension 1 --- 0.0033788681030273438 seconds --- 4 | --- Gudhi Rips Complex --- 40 samples --- dimension 1 --- 0.007442951202392578 seconds --- 5 | --- Gudhi Rips Complex --- 50 samples --- dimension 1 --- 0.006020069122314453 seconds --- 6 | --- Gudhi Rips Complex --- 60 samples --- dimension 1 --- 0.010352849960327148 seconds --- 7 | --- Gudhi Rips Complex --- 10 samples --- dimension 2 --- 0.0004889965057373047 seconds --- 8 | --- Gudhi Rips Complex --- 20 samples --- dimension 2 --- 0.002521038055419922 seconds --- 9 | --- Gudhi Rips Complex --- 30 samples --- dimension 2 --- 0.011078834533691406 seconds --- 10 | --- Gudhi Rips Complex --- 40 samples --- dimension 2 --- 0.033782005310058594 seconds --- 11 | --- Gudhi Rips Complex --- 50 samples --- dimension 2 --- 0.08360719680786133 seconds --- 12 | --- Gudhi Rips Complex --- 60 samples --- dimension 2 --- 0.18638396263122559 seconds --- 13 | --- Gudhi Rips Complex --- 10 samples --- dimension 3 --- 0.0005481243133544922 seconds --- 14 | --- Gudhi Rips Complex --- 20 samples --- dimension 3 --- 0.009124040603637695 seconds --- 15 | --- Gudhi Rips Complex --- 30 samples --- dimension 3 --- 0.07537388801574707 seconds --- 16 | --- Gudhi Rips Complex --- 40 samples --- dimension 3 --- 0.31710100173950195 seconds --- 17 | --- Gudhi Rips Complex --- 50 samples --- dimension 3 --- 1.0923190116882324 seconds --- 18 | --- Gudhi Rips Complex --- 60 samples --- dimension 3 --- 3.1666109561920166 seconds --- 19 | --- Gudhi Rips Complex --- 10 samples --- dimension 4 --- 0.0009720325469970703 seconds --- 20 | --- Gudhi Rips Complex --- 20 samples --- dimension 4 --- 0.028330087661743164 seconds --- 21 | --- Gudhi Rips Complex --- 30 samples --- dimension 4 --- 0.3949770927429199 seconds --- 22 | --- Gudhi Rips Complex --- 40 samples --- dimension 4 --- 2.8597021102905273 seconds --- 23 | --- Gudhi Rips Complex --- 50 samples --- dimension 4 --- 14.362553834915161 seconds --- 24 | --- Gudhi Rips Complex --- 60 samples --- dimension 4 --- 61.31955695152283 seconds --- 25 | --- Gudhi Rips Complex --- 10 samples --- dimension 5 --- 0.0011510848999023438 seconds --- 26 | --- Gudhi Rips Complex --- 20 samples --- dimension 5 --- 0.07502388954162598 seconds --- 27 | --- Gudhi Rips Complex --- 30 samples --- dimension 5 --- 2.104665994644165 seconds --- 28 | --- Gudhi Rips Complex --- 40 samples --- dimension 5 --- 26.206981897354126 seconds --- 29 | --- Gudhi Rips Complex --- 50 samples --- dimension 5 --- 192.25888419151306 seconds --- 30 | --- Gudhi Rips Complex --- 60 samples --- dimension 5 --- 4449.7761080265045 seconds --- 31 | -------------------------------------------------------------------------------- /output/Ripser_Computational_Effort.txt: -------------------------------------------------------------------------------- 1 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 1 --- 0.0019309520721435547 seconds --- 2 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 1 --- 0.002407073974609375 seconds --- 3 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 1 --- 0.00831294059753418 seconds --- 4 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 1 --- 0.007644176483154297 seconds --- 5 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 1 --- 0.021908998489379883 seconds --- 6 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 1 --- 0.030781030654907227 seconds --- 7 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 2 --- 0.0009131431579589844 seconds --- 8 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 2 --- 0.0012538433074951172 seconds --- 9 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 2 --- 0.0011789798736572266 seconds --- 10 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 2 --- 0.016334056854248047 seconds --- 11 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 2 --- 0.03236794471740723 seconds --- 12 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 2 --- 0.05398416519165039 seconds --- 13 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 3 --- 0.0006580352783203125 seconds --- 14 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 3 --- 0.002480030059814453 seconds --- 15 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 3 --- 0.00840306282043457 seconds --- 16 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 3 --- 0.03389477729797363 seconds --- 17 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 3 --- 0.17017388343811035 seconds --- 18 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 3 --- 0.1088259220123291 seconds --- 19 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 4 --- 0.0008890628814697266 seconds --- 20 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 4 --- 0.0017061233520507812 seconds --- 21 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 4 --- 0.0023801326751708984 seconds --- 22 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 4 --- 0.060665130615234375 seconds --- 23 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 4 --- 0.6652989387512207 seconds --- 24 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 4 --- 2.826575994491577 seconds --- 25 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 5 --- 0.0010030269622802734 seconds --- 26 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 5 --- 0.001631021499633789 seconds --- 27 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 5 --- 0.03177905082702637 seconds --- 28 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 5 --- 0.18613076210021973 seconds --- 29 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 5 --- 2.020855188369751 seconds --- 30 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 5 --- 14.891887187957764 seconds --- 31 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 6 --- 0.0009930133819580078 seconds --- 32 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 6 --- 0.0017299652099609375 seconds --- 33 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 6 --- 0.038159847259521484 seconds --- 34 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 6 --- 1.1811788082122803 seconds --- 35 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 6 --- 5.688662052154541 seconds --- 36 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 6 --- 24.082331895828247 seconds --- 37 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 7 --- 0.00086212158203125 seconds --- 38 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 7 --- 0.001519918441772461 seconds --- 39 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 7 --- 0.00822901725769043 seconds --- 40 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 7 --- 2.1794939041137695 seconds --- 41 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 7 --- 72.07182788848877 seconds --- 42 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 7 --- 561.868953704834 seconds --- 43 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 8 --- 0.0025250911712646484 seconds --- 44 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 8 --- 0.0015439987182617188 seconds --- 45 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 8 --- 0.0831460952758789 seconds --- 46 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 8 --- 0.9906179904937744 seconds --- 47 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 8 --- 10.50025987625122 seconds --- 48 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 8 --- 916.2965040206909 seconds --- 49 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 9 --- 0.0034601688385009766 seconds --- 50 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 9 --- 0.0012841224670410156 seconds --- 51 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 9 --- 0.09065103530883789 seconds --- 52 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 9 --- 0.7031102180480957 seconds --- 53 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 9 --- 157.24873232841492 seconds --- 54 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 9 --- 661.64568400383 seconds --- 55 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 10 --- 0.004229068756103516 seconds --- 56 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 10 --- 0.002311229705810547 seconds --- 57 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 10 --- 0.03548407554626465 seconds --- 58 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 10 --- 5.440964937210083 seconds --- 59 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 10 --- 25.291701078414917 seconds --- 60 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 10 --- 791.482873916626 seconds --- 61 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 11 --- 0.0065288543701171875 seconds --- 62 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 11 --- 0.002384185791015625 seconds --- 63 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 11 --- 0.0969080924987793 seconds --- 64 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 11 --- 1.355767011642456 seconds --- 65 | --- Ripser Rips Complex w/ Edge Collapse --- 50 samples --- dimension 11 --- 430.8150792121887 seconds --- 66 | --- Ripser Rips Complex w/ Edge Collapse --- 60 samples --- dimension 11 --- 3043.677969932556 seconds --- 67 | --- Ripser Rips Complex w/ Edge Collapse --- 10 samples --- dimension 12 --- 0.010491132736206055 seconds --- 68 | --- Ripser Rips Complex w/ Edge Collapse --- 20 samples --- dimension 12 --- 0.002953052520751953 seconds --- 69 | --- Ripser Rips Complex w/ Edge Collapse --- 30 samples --- dimension 12 --- 0.025474071502685547 seconds --- 70 | --- Ripser Rips Complex w/ Edge Collapse --- 40 samples --- dimension 12 --- 13.660001039505005 seconds --- 71 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "tardis" 3 | version = "0.1.0" 4 | description = "TARDIS: Topological Algorithms for Robust DIscovery of Singularities" 5 | authors = ["Julius von Rohrscheidt ", "Bastian Rieck "] 6 | readme = "README.md" 7 | 8 | [tool.poetry.dependencies] 9 | python = ">=3.9,<3.12" 10 | gudhi = "^3.6.0" 11 | numpy = "^1.23.2" 12 | scikit-learn = "^1.1.2" 13 | tqdm = "^4.64.1" 14 | joblib = "^1.2.0" 15 | colorlog = "^6.7.0" 16 | pandas = "^1.5.0" 17 | matplotlib = "^3.6.0" 18 | seaborn = "^0.12.0" 19 | torch = "^1.12.1" 20 | torchvision = "^0.13.1" 21 | scipy = "^1.9.1" 22 | phate = "^1.0.9" 23 | pot = "^0.8.2" 24 | giotto-ph = "^0.2.2" 25 | sphinx = "^7.0.1" 26 | furo = "^2023.5.20" 27 | 28 | [tool.black] 29 | line-length = 79 30 | 31 | [build-system] 32 | requires = ["poetry-core"] 33 | build-backend = "poetry.core.masonry.api" 34 | -------------------------------------------------------------------------------- /scripts/es.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # es.sh: easy submit --- submit a job quickly to the cluster 4 | # 5 | # This script tries to provide a 'fire and forget' solution for 6 | # submitting CPU jobs to the cluster. The parameters may not be 7 | # perfect, but it's probably sufficient for most cases. 8 | 9 | NAME="$1" 10 | CMD="$2" 11 | 12 | if [ -z "$2" ]; then 13 | NAME="easy-submit" 14 | CMD=$1 15 | fi 16 | 17 | if [ -z "$CMD" ]; then 18 | echo "Usage: $0 [NAME] COMMAND" 19 | echo " Specify at least the command to run." 20 | exit -1 21 | fi 22 | 23 | MAIL_USER=${USER}@helmholtz-muenchen.de 24 | 25 | sbatch -p cpu_p \ 26 | -J ${NAME} \ 27 | -o "${NAME}_%j.out" \ 28 | --cpus-per-task=32 \ 29 | --mem=16G \ 30 | --mail-type=END,FAIL \ 31 | --mail-user=${MAIL_USER} \ 32 | --nice=10000 \ 33 | --wrap "${CMD}" 34 | -------------------------------------------------------------------------------- /scripts/ipsc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for POINTS in 1000 2500 5000 10000; do 4 | ./es.sh IPSC-${POINTS} "poetry run python ../toast/cli.py ../data/ipsc.npz --seed 42 -q ${POINTS} -d 16 -o ../output/ipsc_d16_q${POINTS}_seed42.csv" 5 | done 6 | -------------------------------------------------------------------------------- /scripts/pinched_torus.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for S in 0.5 0.55 0.60 0.65 0.75; do 4 | ./es.sh PINCHED_TORUS "poetry run python ../toast/cli.py ../data/Pinched_torus.txt -r 0.05 -R 0.45 -s 0.2 -S ${S} > ../output/Pinched_torus_S${S}.txt" 5 | done 6 | -------------------------------------------------------------------------------- /scripts/vision.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | for NUM in $(seq 1 5); do 4 | ./es.sh MNIST "poetry run python ../toast/cli.py -d 10 --num-steps 20 MNIST > ../output/MNIST_${NUM}.txt" 5 | ./es.sh FashionMNIST "poetry run python ../toast/cli.py -d 10 --num-steps 20 FashionMNIST > ../output/FashionMNIST_${NUM}.txt" 6 | done 7 | 8 | -------------------------------------------------------------------------------- /scripts/wedged_spheres.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | ./es.sh WEDGED-SPHERES-2 "poetry run python ../toast/cli.py -r 0.05 -R 0.25 -s 0.1 -S 0.5 -d 2 --num-steps 20 ../data/Wedged_spheres_2D.txt > ../output/Wedged_spheres_2D.txt" 4 | -------------------------------------------------------------------------------- /tardis/__init__.py: -------------------------------------------------------------------------------- 1 | from .api import calculate_euclidicity 2 | from .utils import estimate_scales 3 | -------------------------------------------------------------------------------- /tardis/analyse_euclidicity.py: -------------------------------------------------------------------------------- 1 | """Basic statistical analysis of Euclidicity scores. 2 | 3 | This is a helper script for analysing Euclidicity scores. It generates 4 | plots of the summary statistics and performs Tukey's range test. 5 | """ 6 | 7 | import argparse 8 | import os 9 | 10 | import numpy as np 11 | import pandas as pd 12 | 13 | import matplotlib.pyplot as plt 14 | import seaborn as sns 15 | 16 | from scipy.stats import tukey_hsd 17 | 18 | 19 | def detect_outliers(data): 20 | """Detect outliers based on IQR criterion.""" 21 | # Simple outlier detection: clip everything that is larger than 22 | # q3 + 1.5 * IQR. 23 | iqr = np.subtract(*np.percentile(data, [75, 25])) 24 | q1 = np.percentile(data, 25) 25 | q3 = np.percentile(data, 75) 26 | 27 | print(f"Q1 = {q1:.2f}, Q3 = {q3:.2f}, IQR = {iqr:.2f}") 28 | 29 | upper = data > q3 + 1.5 * iqr 30 | lower = data < q1 - 1.5 * iqr 31 | 32 | print("- Found", upper.sum(), "upper outliers") 33 | print("- Found", lower.sum(), "lower outliers") 34 | 35 | 36 | def print_summary_statistics(data): 37 | """Print some summary statistics.""" 38 | print( 39 | f"max = {np.max(data):.2f}, " 40 | f"mean = {np.mean(data):.2f}, " 41 | f"min = {np.min(data):.2f}", 42 | ) 43 | 44 | 45 | if __name__ == "__main__": 46 | parser = argparse.ArgumentParser() 47 | parser.add_argument("FILE", nargs="+", help="Input filename(s)") 48 | 49 | args = parser.parse_args() 50 | 51 | n_files = len(args.FILE) 52 | fig, axes = plt.subplots(nrows=2, ncols=n_files, squeeze=False) 53 | 54 | distributions = [] 55 | 56 | for ( 57 | col, 58 | filename, 59 | ) in enumerate(args.FILE): 60 | print(f"Processing {filename}") 61 | 62 | if filename.endswith(".csv"): 63 | df = pd.read_csv(filename) 64 | df = df.drop("persistent_intrinsic_dimension", axis="columns") 65 | X = df.to_numpy() 66 | elif filename.endswith(".npz"): 67 | X = np.load(filename)["arr_0"] 68 | else: 69 | X = np.loadtxt(filename) 70 | 71 | # Skip empty files because they lead to problems in the 72 | # downstream analysis. 73 | if len(X) == 0: 74 | continue 75 | 76 | euclidicity = X[:, -1].flatten() 77 | 78 | distributions.append(np.asarray(euclidicity)) 79 | 80 | detect_outliers(euclidicity) 81 | print_summary_statistics(euclidicity) 82 | 83 | axes[0, col].set_title(os.path.splitext(os.path.basename(filename))[0]) 84 | 85 | sns.histplot(data=euclidicity, kde=True, ax=axes[0, col]) 86 | sns.violinplot(data=euclidicity, ax=axes[1, col]) 87 | sns.stripplot(data=euclidicity, ax=axes[1, col], color="black", size=1) 88 | 89 | # We can only do this with more than one distribution, but even for 90 | # a single distribution, we can show the respective plot. 91 | if len(distributions) > 1: 92 | print(tukey_hsd(*distributions)) 93 | 94 | plt.show() 95 | -------------------------------------------------------------------------------- /tardis/api.py: -------------------------------------------------------------------------------- 1 | """Main entry point for API calls to TARDIS. 2 | 3 | This module collects API calls to TARDIS. Each exported function should 4 | facilitate using TARDIS for data analysis. Users that need fine-grained 5 | control are encouraged to build their own functions. 6 | """ 7 | 8 | import joblib 9 | 10 | import numpy as np 11 | 12 | from tardis.euclidicity import Euclidicity 13 | from tardis.utils import estimate_scales 14 | 15 | 16 | def calculate_euclidicity( 17 | X, 18 | Y=None, 19 | max_dim=2, 20 | n_steps=10, 21 | r=None, 22 | R=None, 23 | s=None, 24 | S=None, 25 | k=20, 26 | n_jobs=1, 27 | return_dimensions=False, 28 | ): 29 | """Convenience function for calculating Euclidicity of a point cloud. 30 | 31 | This function provides the most convenient interface for calculating 32 | Euclidicity of a point cloud. Internally, this function will use the 33 | best and fastest Euclidicity calculation, but this comes at the cost 34 | of configurability. 35 | 36 | TODO: Document me :-) 37 | """ 38 | r_, R_, s_, S_ = r, R, s, S 39 | query_points = X if Y is None else Y 40 | 41 | if all([x is not None for x in [r_, R_, s_, S_]]): 42 | scales = [dict()] * len(query_points) 43 | else: 44 | scales = estimate_scales(X, query_points, k) 45 | 46 | euclidicity = Euclidicity( 47 | max_dim=max_dim, 48 | n_steps=n_steps, 49 | r=r_, 50 | R=R_, 51 | s=s_, 52 | S=S_, 53 | method="ripser", 54 | data=X, 55 | ) 56 | 57 | def _process(x, scale=None): 58 | scores, dimensions = euclidicity(X, x, **scale) 59 | 60 | score = np.mean(np.nan_to_num(scores)) 61 | dimension = np.mean(dimensions) 62 | 63 | return score, dimension 64 | 65 | output = joblib.Parallel(n_jobs=n_jobs)( 66 | joblib.delayed(_process)(x, scale) 67 | for x, scale in zip(query_points, scales) 68 | ) 69 | 70 | euclidicity = np.asarray([e for (e, _) in output]) 71 | persistent_intrinsic_dimension = np.asarray([d for (_, d) in output]) 72 | 73 | if return_dimensions: 74 | return euclidicity, persistent_intrinsic_dimension 75 | else: 76 | return euclidicity 77 | -------------------------------------------------------------------------------- /tardis/cli.py: -------------------------------------------------------------------------------- 1 | """Command-line interface for Euclidicity calculations. 2 | 3 | This script is the main command-line interface for our Euclidicity 4 | calculations. It supports loading various input formats, for which 5 | it will calculate Euclidicity scores. 6 | """ 7 | 8 | import argparse 9 | import colorlog 10 | import functools 11 | import joblib 12 | import os 13 | 14 | import numpy as np 15 | import pandas as pd 16 | 17 | from tardis.euclidicity import Euclidicity 18 | 19 | from tardis.shapes import sample_from_annulus 20 | from tardis.shapes import sample_from_constant_curvature_annulus 21 | 22 | from tardis.utils import load_data 23 | from tardis.utils import estimate_scales 24 | 25 | 26 | def setup(): 27 | """Perform logging and argument parsing setup. 28 | 29 | Sets up the command-line interface for subsequent usage so that we 30 | do not clutter up the actual Euclidicity calculations. 31 | 32 | Returns 33 | ------- 34 | Tuple of logger and parsed arguments 35 | """ 36 | handler = colorlog.StreamHandler() 37 | handler.setFormatter( 38 | colorlog.ColoredFormatter("%(log_color)s%(levelname)-.1s: %(message)s") 39 | ) 40 | 41 | logger = colorlog.getLogger() 42 | logger.addHandler(handler) 43 | logger.setLevel(colorlog.INFO) 44 | 45 | parser = argparse.ArgumentParser() 46 | parser.add_argument( 47 | "INPUT", 48 | type=str, 49 | help="Input point cloud or name of data set to load. If this points " 50 | "to an existing file, the file is loaded. Else the input is treated " 51 | "as the name of a (vision) data set.", 52 | ) 53 | parser.add_argument( 54 | "-o", 55 | "--output", 56 | type=str, 57 | help="Output file (optional). If not set, data will be printed to " 58 | "standard output. If set, will guess the output format based " 59 | "on the file extension.", 60 | ) 61 | 62 | euclidicity_group = parser.add_argument_group("Euclidicity calculations") 63 | 64 | euclidicity_group.add_argument( 65 | "-k", 66 | "--num-neighbours", 67 | default=50, 68 | type=int, 69 | help="Number of neighbours for parameter estimation", 70 | ) 71 | euclidicity_group.add_argument( 72 | "-d", 73 | "--dimension", 74 | default=2, 75 | type=int, 76 | help="Known or estimated intrinsic dimension", 77 | ) 78 | euclidicity_group.add_argument( 79 | "-r", 80 | type=float, 81 | help="Minimum inner radius of annulus", 82 | ) 83 | euclidicity_group.add_argument( 84 | "-R", 85 | type=float, 86 | help="Maximum inner radius of annulus", 87 | ) 88 | euclidicity_group.add_argument( 89 | "-s", 90 | type=float, 91 | help="Minimum outer radius of annulus", 92 | ) 93 | euclidicity_group.add_argument( 94 | "-S", 95 | type=float, 96 | help="Maximum outer radius of annulus", 97 | ) 98 | euclidicity_group.add_argument( 99 | "--num-steps", 100 | default=10, 101 | type=int, 102 | help="Number of steps for annulus sampling", 103 | ) 104 | parser.add_argument( 105 | "-f", 106 | "--fixed-annulus", 107 | action="store_true", 108 | help="If set, compare to fixed annulus (disables Euclidean sampling)", 109 | ) 110 | 111 | sampling_group = parser.add_argument_group("Sampling") 112 | 113 | sampling_group.add_argument( 114 | "-b", 115 | "--batch-size", 116 | default=10000, 117 | type=int, 118 | help="Number of points to sample from input data", 119 | ) 120 | sampling_group.add_argument( 121 | "-q", 122 | "--num-query-points", 123 | default=1000, 124 | type=int, 125 | help="Number of query points for Euclidicity calculations", 126 | ) 127 | sampling_group.add_argument( 128 | "--seed", 129 | type=int, 130 | help="Random number generator seed for reproducible experiments", 131 | ) 132 | 133 | experimental_group = parser.add_argument_group("Experimental") 134 | 135 | experimental_group.add_argument( 136 | "--curvature", 137 | "-K", 138 | type=float, 139 | default=None, 140 | help="If set, change model space from Euclidean annulus to 2D disk of " 141 | "constant curvature.", 142 | ) 143 | 144 | # TODO: Check for compatibility of different settings. We cannot 145 | # sample from different spaces if we also use a fixed annulus. 146 | args = parser.parse_args() 147 | return logger, args 148 | 149 | 150 | if __name__ == "__main__": 151 | logger, args = setup() 152 | 153 | if args.seed is not None: 154 | logger.info(f"Using pre-defined seed {args.seed}") 155 | 156 | rng = np.random.default_rng(args.seed) 157 | 158 | X, query_points = load_data( 159 | args.INPUT, 160 | args.batch_size, 161 | args.num_query_points, 162 | seed=rng, 163 | ) 164 | 165 | r, R, s, S = args.r, args.R, args.s, args.S 166 | k = args.num_neighbours 167 | 168 | # Check whether we have to perform scale estimation on a per-point 169 | # basis. If not, we just supply an empty dict. 170 | if all([x is not None for x in [r, R, s, S]]): 171 | logger.info( 172 | f"Using global scales r = {r:.2f}, R = {R:.2f}, " 173 | f"s = {s:.2f}, S = {S:.2f}" 174 | ) 175 | 176 | scales = [dict()] * len(query_points) 177 | else: 178 | logger.info( 179 | f"Performing scale estimation with k = {k} since no " 180 | f"parameters have been provided by the client." 181 | ) 182 | 183 | scales = estimate_scales(X, query_points, k) 184 | 185 | max_dim = args.dimension 186 | n_steps = args.num_steps 187 | 188 | logger.info(f"Maximum dimension: {max_dim}") 189 | logger.info(f"Number of steps for local sampling: {n_steps}") 190 | 191 | # Choose a sampling procedure for the inner comparison of sampled 192 | # annuli from the data space with model spaces. 193 | if args.fixed_annulus: 194 | logger.info("Using fixed annulus comparison") 195 | model_sample_fn = None 196 | elif args.curvature is not None: 197 | logger.info("Using constant-curvature model space") 198 | model_sample_fn = functools.partial( 199 | sample_from_constant_curvature_annulus, K=args.curvature 200 | ) 201 | else: 202 | logger.info("Using Euclidean annulus model space") 203 | model_sample_fn = sample_from_annulus 204 | 205 | euclidicity = Euclidicity( 206 | max_dim=max_dim, 207 | n_steps=n_steps, 208 | r=args.r, 209 | R=args.R, 210 | s=args.s, 211 | S=args.S, 212 | method="ripser", 213 | data=X, 214 | model_sample_fn=model_sample_fn, 215 | ) 216 | 217 | def _process(x, scale=None): 218 | scores, dimensions = euclidicity(X, x, **scale) 219 | 220 | # Aggregate over all scores that we find. We could pick 221 | # a different aggregation here! 222 | score = np.mean(np.nan_to_num(scores)) 223 | dimension = np.mean(dimensions) 224 | 225 | return score, dimension 226 | 227 | output = joblib.Parallel(n_jobs=-1)( 228 | joblib.delayed(_process)(x, scale) 229 | for x, scale in zip(query_points, scales) 230 | ) 231 | 232 | df = pd.DataFrame( 233 | output, columns=["euclidicity", "persistent_intrinsic_dimension"] 234 | ) 235 | 236 | df = pd.concat([pd.DataFrame(query_points).add_prefix("X"), df], axis=1) 237 | 238 | if args.output is None: 239 | print(df.to_csv(index=False)) 240 | else: 241 | extension = os.path.splitext(args.output)[1] 242 | if extension == ".tsv": 243 | df.to_csv(args.output, index=False, sep="\t") 244 | elif extension == ".csv": 245 | df.to_csv(args.output, index=False) 246 | elif extension == ".npy": 247 | np.save(args.output, df) 248 | elif extension == ".npz": 249 | np.savez(args.output, df) 250 | -------------------------------------------------------------------------------- /tardis/data.py: -------------------------------------------------------------------------------- 1 | """Provides samples of more complicated data sets.""" 2 | 3 | from torchvision import datasets 4 | from torchvision import transforms 5 | 6 | from torch.utils.data import DataLoader 7 | 8 | 9 | def sample_vision_data_set(name, n_samples): 10 | """Sample vision data set. 11 | 12 | Parameters 13 | ---------- 14 | name : str 15 | Name of the data set. Currently, only "MNIST" and "FashionMNIST" 16 | are supported here. 17 | 18 | n_samples : int 19 | Number of samples to retrieve. 20 | 21 | Returns 22 | ------- 23 | np.array 24 | Sampled data points 25 | """ 26 | assert name in ["MNIST", "FashionMNIST"] 27 | 28 | transform = transforms.Compose( 29 | [transforms.ToTensor(), transforms.Normalize((0.5), (0.5))] 30 | ) 31 | 32 | transform = transforms.ToTensor() 33 | 34 | if name == "MNIST": 35 | cls = datasets.MNIST 36 | elif name == "FashionMNIST": 37 | cls = datasets.FashionMNIST 38 | 39 | data = cls(root="../data", train=True, download=True, transform=transform) 40 | 41 | data_loader = DataLoader(dataset=data, batch_size=n_samples, shuffle=True) 42 | 43 | X, _ = next(iter(data_loader)) 44 | X = X.reshape(n_samples, -1) 45 | X = X.numpy() 46 | 47 | return X 48 | -------------------------------------------------------------------------------- /tardis/euclidicity.py: -------------------------------------------------------------------------------- 1 | """Euclidicity example implementation.""" 2 | 3 | import numpy as np 4 | 5 | from sklearn.neighbors import KDTree 6 | 7 | from tardis.persistent_homology import GUDHI 8 | from tardis.persistent_homology import Ripser 9 | 10 | 11 | class Euclidicity: 12 | """Functor for calculating Euclidicity of a point cloud.""" 13 | 14 | def __init__( 15 | self, 16 | max_dim, 17 | r=None, 18 | R=None, 19 | s=None, 20 | S=None, 21 | n_steps=10, 22 | data=None, 23 | method="gudhi", 24 | model_sample_fn=None, 25 | ): 26 | """Initialise new instance of functor. 27 | 28 | Sets up a new instance of the Euclidicity functor and stores 29 | shared parameters that will be used for the calculation. The 30 | client has the choice of either providing global parameters, 31 | or adjusting them on a per-point basis. 32 | 33 | Parameters 34 | ---------- 35 | max_dim : int 36 | Maximum dimension for persistent homology approximations. 37 | This is the *only* required parameter. 38 | 39 | r : float, optional 40 | Minimum inner radius of annulus 41 | 42 | R : float, optional 43 | Maximum inner radius of annulus 44 | 45 | s : float, optional 46 | Minimum outer radius of annulus 47 | 48 | S : float, optional 49 | Maximum outer radius of annulus 50 | 51 | n_steps : int, optional 52 | Number of steps for the radius parameter grid of the 53 | annulus. Note that the complexity of the function is 54 | quadratic in the number of steps. 55 | 56 | data : np.array or None 57 | If set, prepares a tree for nearest-neighbour and radius 58 | queries on the input data set. This can lead to substantial 59 | speed improvements in practice. 60 | 61 | method : str 62 | Persistent homology calculation method. At the moment, only 63 | "gudhi" and "ripser" are supported. "gudhi" is better for a 64 | small, low-dimensional data set, while "ripser" scales well 65 | to larger, high-dimensional point clouds. 66 | 67 | model_sample_fn : callable 68 | Function to be called for sampling from a comparison space. 69 | The function is being supplied with the number of samples, 70 | the radii of the annulus, and the intrinsic dimension. Its 71 | output must be a point cloud representing the annulus. If no 72 | sample function is provided, the class will default to 73 | compare the topological features with those of fixed 74 | Euclidean annulus. 75 | """ 76 | self.r = r 77 | self.R = R 78 | self.s = s 79 | self.S = S 80 | 81 | self.n_steps = n_steps 82 | self.max_dim = max_dim 83 | 84 | self.model_sample_fn = model_sample_fn 85 | 86 | if method == "gudhi": 87 | self.vr = GUDHI() 88 | elif method == "ripser": 89 | self.vr = Ripser() 90 | else: 91 | raise RuntimeError("No persistent homology calculation selected.") 92 | 93 | # Prepare KD tree to speed up annulus calculations. We make this 94 | # configurable to permit both types of workflows. 95 | if data is not None: 96 | self.tree = KDTree(data) 97 | else: 98 | self.tree = None 99 | 100 | def __call__(self, X, x, **kwargs): 101 | """Calculate Euclidicity of a specific point. 102 | 103 | Parameters 104 | ---------- 105 | X : np.array or tensor of shape ``(N, d)`` 106 | Input data set. Must be compatible with the persistent 107 | homology calculations. 108 | 109 | x : np.array, tensor, or iterable of shape ``(d, )`` 110 | Input point. 111 | 112 | Other Parameters 113 | ---------------- 114 | r : float, optional 115 | Minimum inner radius of annulus. Will default to global `r` 116 | parameter if not set. 117 | 118 | R : float, optional 119 | Maximum inner radius of annulus. Will default to global `R` 120 | parameter if not set. 121 | 122 | s : float, optional 123 | Minimum outer radius of annulus. Will default to global `s` 124 | parameter if not set. 125 | 126 | S : float, optional 127 | Maximum outer radius of annulus. Will default to global `S` 128 | parameter if not set. 129 | 130 | Returns 131 | ------- 132 | Tuple of np.array, np.array 133 | 1D array containing Euclidicity estimates. The length of the 134 | array depends on the number of scales. The second array will 135 | contain the persistent intrinsic dimension (PID) values. 136 | """ 137 | r = kwargs.get("r", self.r) 138 | R = kwargs.get("R", self.R) 139 | s = kwargs.get("s", self.s) 140 | S = kwargs.get("S", self.S) 141 | 142 | bottleneck_distances = [] 143 | dimensions = [] 144 | 145 | for r in np.linspace(r, R, self.n_steps): 146 | for s in np.linspace(s, S, self.n_steps): 147 | if r < s: 148 | dist, dim = self._calculate_euclidicity( 149 | r, s, X, x, self.max_dim 150 | ) 151 | 152 | bottleneck_distances.append(dist) 153 | dimensions.append(dim) 154 | 155 | return np.asarray(bottleneck_distances), np.asarray(dimensions) 156 | 157 | # Auxiliary method for performing the 'heavy lifting' when it comes 158 | # to Euclidicity calculations. 159 | def _calculate_euclidicity(self, r, s, X, x, d): 160 | if self.tree is not None: 161 | inner_indices = self.tree.query_radius(x.reshape(1, -1), r)[0] 162 | outer_indices = self.tree.query_radius(x.reshape(1, -1), s)[0] 163 | 164 | annulus_indices = np.setdiff1d(outer_indices, inner_indices) 165 | annulus = X[annulus_indices] 166 | else: 167 | annulus = np.asarray( 168 | [ 169 | np.asarray(p) 170 | for p in X 171 | if np.linalg.norm(x - p) <= s 172 | and np.linalg.norm(x - p) >= r 173 | ] 174 | ) 175 | 176 | barcodes, max_dim = self.vr(annulus, d) 177 | 178 | if max_dim < 0: 179 | return np.nan, max_dim 180 | 181 | if self.model_sample_fn is not None: 182 | euclidean_annulus = self.model_sample_fn( 183 | n=len(annulus), r=r, R=s, d=d 184 | ) 185 | barcodes_euclidean, _ = self.vr(euclidean_annulus, d) 186 | 187 | # No sampling function has been specified. Compare to a fixed 188 | # annulus with known persistent homology. 189 | # 190 | # TODO: Technically, the single feature should be put into 191 | # a persistence diagram of the right dimension. Let us not 192 | # do that for now (since we stack diagrams anyway). 193 | else: 194 | barcodes_euclidean = np.asarray([[0, np.inf], [r, s]]) 195 | 196 | if barcodes_euclidean is None: 197 | return np.nan, max_dim 198 | 199 | dist = self.vr.distance(barcodes, barcodes_euclidean) 200 | return dist, max_dim 201 | -------------------------------------------------------------------------------- /tardis/make_pinched_torus.py: -------------------------------------------------------------------------------- 1 | """Create "pinched torus" data set. 2 | 3 | Usage: 4 | python make_pinched_torus.py > Pinched_torus.csv 5 | """ 6 | 7 | import sys 8 | 9 | from math import pi 10 | from math import cos 11 | from math import sin 12 | 13 | import numpy as np 14 | 15 | n = 4096 16 | m = 512 17 | o = 512 18 | R = 10 19 | r = 1 20 | k = 0.5 21 | 22 | # Gap size in angular coordinates. This is to be seen as the radius for 23 | # which the 'pinch' is relevant. 24 | gap_size = pi / 180.0 * 90 25 | 26 | X = list() 27 | Y = list() 28 | Z = list() 29 | 30 | for i in range(m): 31 | for j in range(o): 32 | phi = 2 * pi * i / (m - 1) 33 | theta = 2 * pi * j / (o - 1) 34 | 35 | r_ = r 36 | 37 | x = (R + r_ * cos(theta) * cos(k * phi)) * cos(phi) 38 | y = (R + r_ * cos(theta) * cos(k * phi)) * sin(phi) 39 | z = r_ * sin(theta) * cos(k * phi) 40 | 41 | X.append(x) 42 | Y.append(y) 43 | Z.append(z) 44 | 45 | X = np.vstack((X, Y, Z)).T 46 | np.savetxt(sys.stdout, X) 47 | -------------------------------------------------------------------------------- /tardis/make_wedged_spheres.py: -------------------------------------------------------------------------------- 1 | """Create "wedged spheres" data set. 2 | 3 | Usage: 4 | python make_wedged_spheres.py 2 > Wedged_spheres_2.csv 5 | """ 6 | 7 | import argparse 8 | import sys 9 | 10 | import numpy as np 11 | 12 | from tardis.shapes import sample_from_wedged_spheres 13 | 14 | 15 | if __name__ == "__main__": 16 | parser = argparse.ArgumentParser() 17 | 18 | parser.add_argument( 19 | "-d", "--dimension", default=2, type=int, help="Intrinsic dimension" 20 | ) 21 | parser.add_argument( 22 | "-n", 23 | "--num-samples", 24 | default=10000, 25 | type=int, 26 | help="Number of samples", 27 | ) 28 | 29 | args = parser.parse_args() 30 | 31 | X = sample_from_wedged_spheres(args.num_samples, args.dimension) 32 | np.savetxt(sys.stdout, X) 33 | -------------------------------------------------------------------------------- /tardis/make_wedged_spheres_varying_dim.py: -------------------------------------------------------------------------------- 1 | """Create "wedged spheres of possibly different dimensions" data set. 2 | 3 | Usage: 4 | python make_wedged_spheres_varying_dim.py > Wedged_spheres_varying_dim.csv 5 | """ 6 | 7 | import argparse 8 | import sys 9 | 10 | import numpy as np 11 | 12 | from tardis.shapes import sample_from_wedged_sphere_varying_dim 13 | 14 | 15 | if __name__ == "__main__": 16 | parser = argparse.ArgumentParser() 17 | 18 | parser.add_argument( 19 | "-d1", 20 | "--dimension1", 21 | default=1, 22 | type=int, 23 | help="Intrinsic dimension of first sphere", 24 | ) 25 | parser.add_argument( 26 | "-d2", 27 | "--dimension2", 28 | default=2, 29 | type=int, 30 | help="Intrinsic dimension of second sphere", 31 | ) 32 | parser.add_argument( 33 | "-n", 34 | "--num-samples", 35 | default=10000, 36 | type=int, 37 | help="Number of samples", 38 | ) 39 | 40 | args = parser.parse_args() 41 | 42 | X = sample_from_wedged_sphere_varying_dim( 43 | args.num_samples, args.dimension1, args.dimension2 44 | ) 45 | np.savetxt(sys.stdout, X) 46 | -------------------------------------------------------------------------------- /tardis/mat_to_npy.py: -------------------------------------------------------------------------------- 1 | """Convert .mat to .npy file(s). 2 | 3 | Usage: 4 | python mat_to_npy.py INPUT 5 | """ 6 | 7 | import os 8 | 9 | import numpy as np 10 | import scipy.io as sio 11 | import sys 12 | 13 | fname = sys.argv[1] 14 | X = sio.loadmat(fname) 15 | X = X["data"] 16 | 17 | fname = os.path.splitext(fname)[0] 18 | 19 | if not os.path.exists(fname): 20 | np.savez(fname, data=X) 21 | -------------------------------------------------------------------------------- /tardis/persistent_homology.py: -------------------------------------------------------------------------------- 1 | """Wrappers for persistent homology calculations. 2 | 3 | The purpose of this module is to provide wrappers for the persistent 4 | homology calculations. This is to ensure that the returned shapes of 5 | barcodes etc. are always consistent regardless of any implementation 6 | details. 7 | """ 8 | 9 | import gudhi as gd 10 | import numpy as np 11 | 12 | from gph import ripser_parallel 13 | 14 | 15 | class GUDHI: 16 | """Wrapper for GUDHI persistent homology calculations.""" 17 | 18 | def __call__(self, X, max_dim): 19 | """Calculate persistent homology. 20 | 21 | Parameters 22 | ---------- 23 | X : np.array of shape ``(N, d)`` 24 | Input data set. 25 | 26 | max_dim : int 27 | Maximum dimension for calculations 28 | 29 | Returns 30 | ------- 31 | np.array 32 | Full barcode (persistence diagram) of the data set. 33 | """ 34 | barcodes = ( 35 | gd.RipsComplex(points=X) 36 | .create_simplex_tree(max_dimension=max_dim) 37 | .persistence() 38 | ) 39 | 40 | if len(barcodes) == 0: 41 | return None, -1 42 | 43 | # TODO: Check whether this is *always* a feature of non-zero 44 | # persistence. 45 | max_dim = np.max([d for d, _ in barcodes]) 46 | 47 | # TODO: We are throwing away dimensionality information; it is 48 | # thus possible that we are matching across different dimensions 49 | # in any distance calculation. 50 | barcodes = np.asarray([np.array(x) for _, x in barcodes]) 51 | 52 | return barcodes, max_dim 53 | 54 | def distance(self, D1, D2): 55 | """Calculate Bottleneck distance between two persistence diagrams.""" 56 | return gd.bottleneck_distance(D1, D2) 57 | 58 | 59 | class Ripser: 60 | def __init__(self, stack_diagrams=True): 61 | self.stack_diagrams = stack_diagrams 62 | 63 | if self.stack_diagrams: 64 | def distance_fn(D1, D2): 65 | return gd.bottleneck_distance(D1, D2) 66 | else: 67 | def distance_fn(diagrams1, diagrams2): 68 | values = [ 69 | gd.bottleneck_distance(D1, D2) 70 | for D1, D2 in zip(diagrams1, diagrams2) 71 | ] 72 | return np.max(values) 73 | 74 | self.distance = distance_fn 75 | 76 | def __call__(self, X, max_dim): 77 | if len(X) == 0: 78 | return [], -1 79 | 80 | diagrams = ripser_parallel( 81 | X, maxdim=max_dim, collapse_edges=True 82 | ) 83 | 84 | diagrams = diagrams["dgms"] 85 | max_dim = np.max([d for d, D in enumerate(diagrams) if len(D) > 0]) 86 | 87 | if self.stack_diagrams: 88 | diagrams = np.row_stack(diagrams) 89 | 90 | return diagrams, max_dim 91 | -------------------------------------------------------------------------------- /tardis/shapes.py: -------------------------------------------------------------------------------- 1 | """Shape sampling methods.""" 2 | 3 | import numpy as np 4 | 5 | 6 | def sample_from_annulus(n, r, R, d=2, seed=None): 7 | """Sample points from an annulus. 8 | 9 | This function samples `N` points from an annulus with inner radius `r` 10 | and outer radius `R`. 11 | 12 | Parameters 13 | ---------- 14 | n : int 15 | Number of points to sample 16 | 17 | r : float 18 | Inner radius of annulus 19 | 20 | R : float 21 | Outer radius of annulus 22 | 23 | d : int 24 | Dimension of the annulus. Technically, for higher dimensions, we 25 | should call the resulting space a "hyperspherical shell." Notice 26 | that the algorithm for sampling points in higher dimensions uses 27 | rejection sampling, so its efficiency decreases as the dimension 28 | increases. 29 | 30 | seed : int, instance of `np.random.Generator`, or `None` 31 | Seed for the random number generator, or an instance of such 32 | a generator. If set to `None`, the default random number 33 | generator will be used. 34 | 35 | Returns 36 | ------- 37 | np.array of shape `(n, 2)` 38 | Array containing sampled coordinates. 39 | """ 40 | if r >= R: 41 | raise RuntimeError( 42 | "Inner radius must be less than or equal to outer radius" 43 | ) 44 | 45 | rng = np.random.default_rng(seed) 46 | 47 | if d == 2: 48 | thetas = rng.uniform(0, 2 * np.pi, n) 49 | 50 | # Need to sample based on squared radii to account for density 51 | # differences. 52 | radii = np.sqrt(rng.uniform(r**2, R**2, n)) 53 | 54 | X = np.column_stack((radii * np.cos(thetas), radii * np.sin(thetas))) 55 | else: 56 | X = np.empty((0, d)) 57 | 58 | while True: 59 | sample = sample_from_ball(n, d, r=R, seed=rng) 60 | norms = np.sqrt(np.sum(np.abs(sample) ** 2, axis=-1)) 61 | 62 | X = np.row_stack((X, sample[norms >= r])) 63 | 64 | if len(X) >= n: 65 | X = X[:n, :] 66 | break 67 | 68 | return X 69 | 70 | 71 | def sample_from_ball(n=100, d=2, r=1, seed=None): 72 | """Sample `n` data points from a `d`-ball in `d` dimensions. 73 | 74 | Parameters 75 | ----------- 76 | n : int 77 | Number of data points in ball. 78 | 79 | d : int 80 | Dimension of the ball. Notice that there is an inherent shift in 81 | dimension if you compare a ball to a sphere. 82 | 83 | r : float 84 | Radius of ball. 85 | 86 | seed : int, instance of `np.random.Generator`, or `None` 87 | Seed for the random number generator, or an instance of such 88 | a generator. If set to `None`, the default random number 89 | generator will be used. 90 | 91 | Returns 92 | ------- 93 | np.array of shape `(n, d)` 94 | Array of sampled coordinates. 95 | 96 | References 97 | ---------- 98 | .. [Voelker2007] A. Voelker et al, Efficiently sampling vectors and 99 | coordinates from the $n$-sphere and $n$-ball, Technical Report, 100 | 2017. http://compneuro.uwaterloo.ca/files/publications/voelker.2017.pdf 101 | """ 102 | rng = np.random.default_rng(seed) 103 | 104 | # This algorithm was originally described in the following blog 105 | # post: 106 | # 107 | # http://extremelearning.com.au/how-to-generate-uniformly-random-points 108 | # -on-n-spheres-and-n-balls/ 109 | # 110 | # It's mind-boggling that this works but it's true! 111 | U = rng.normal(size=(n, d + 2)) 112 | norms = np.sqrt(np.sum(np.abs(U) ** 2, axis=-1)) 113 | U = r * U / norms[:, np.newaxis] 114 | X = U[:, 0:d] 115 | 116 | return np.asarray(X) 117 | 118 | 119 | def sample_from_sphere(n=100, d=2, r=1, noise=None, seed=None): 120 | """Sample `n` data points from a `d`-sphere in `d + 1` dimensions. 121 | 122 | Parameters 123 | ----------- 124 | n : int 125 | Number of data points in shape. 126 | 127 | d : int 128 | Dimension of the sphere. 129 | 130 | r : float 131 | Radius of sphere. 132 | 133 | noise : float or None 134 | Optional noise factor. If set, data coordinates will be 135 | perturbed by a standard normal distribution, scaled by 136 | `noise`. 137 | 138 | seed : int, instance of `np.random.Generator`, or `None` 139 | Seed for the random number generator, or an instance of such 140 | a generator. If set to `None`, the default random number 141 | generator will be used. 142 | 143 | Returns 144 | ------- 145 | np.array of shape `(n, d + 1)` 146 | Array of sampled coordinates. 147 | 148 | Notes 149 | ----- 150 | This function was originally authored by Nathaniel Saul as part of 151 | the `tadasets` package. [tadasets]_ 152 | 153 | References 154 | ---------- 155 | .. [tadasets] https://github.com/scikit-tda/tadasets 156 | """ 157 | rng = np.random.default_rng(seed) 158 | data = rng.standard_normal((n, d + 1)) 159 | 160 | # Normalize points to the sphere 161 | data = r * data / np.sqrt(np.sum(data**2, 1)[:, None]) 162 | 163 | if noise: 164 | data += noise * rng.standard_normal(data.shape) 165 | 166 | return np.asarray(data) 167 | 168 | 169 | def sample_from_wedged_spheres(n=100, d=2, r=1, noise=None, seed=None): 170 | """Sample points from two wedged spheres. 171 | 172 | Parameters 173 | ---------- 174 | n : int 175 | Number of points to sample 176 | 177 | d : int 178 | Intrinsic dimension of spheres. The ambient dimension will be 179 | ``d + 1``. 180 | 181 | r : float 182 | Radius of spheres 183 | 184 | noise : float or None 185 | If set, will be used as a scale factor for random perturbations 186 | of the positions of points, following a standard normal 187 | distribution. 188 | 189 | seed : int, instance of `np.random.Generator`, or `None` 190 | Seed for the random number generator, or an instance of such 191 | a generator. If set to `None`, the default random number 192 | generator will be used. 193 | """ 194 | rng = np.random.default_rng(seed) 195 | 196 | data1 = rng.standard_normal((n, d + 1)) 197 | data1 = r * data1 / np.sqrt(np.sum(data1**2, 1)[:, None]) 198 | 199 | data2 = rng.standard_normal((n, d + 1)) 200 | data2 = ( 201 | r * data2 / np.sqrt(np.sum(data2**2, 1)[:, None]) 202 | ) + np.concatenate((np.array([2 * r]), np.zeros(data2.shape[1] - 1))) 203 | 204 | X = np.concatenate((data1, data2)) 205 | 206 | if noise: 207 | X += noise * rng.standard_normal(X.shape) 208 | 209 | return X 210 | 211 | 212 | def sample_from_wedged_sphere_varying_dim(n=100, d1=1, d2=2, r=1, noise=None): 213 | """Sample points from two wedged spheres of possibly different dimensions. 214 | 215 | This function permits sampling from wedged spheres of different 216 | dimensions, thus making it possible to, for instance, combine a 217 | circle with an ordinary 2D sphere. 218 | 219 | Parameters 220 | ---------- 221 | n : int 222 | Number of points to sample 223 | 224 | d1 : int 225 | Intrinsic dimension of first sphere. The ambient dimension will be 226 | ``d1 + 1``. 227 | 228 | d2 : int 229 | Intrinsic dimension of second spheres. The ambient dimension will be 230 | ``d2 + 1``. 231 | 232 | r : float 233 | Radius of spheres 234 | 235 | noise : float or None 236 | If set, will be used as a scale factor for random perturbations 237 | of the positions of points, following a standard normal 238 | distribution. 239 | """ 240 | data1 = np.random.randn(n, d1 + 1) 241 | data1 = r * data1 / np.sqrt(np.sum(data1**2, 1)[:, None]) 242 | zeros = np.zeros((len(data1), d2 - d1)) 243 | data1 = np.concatenate((data1, zeros), axis=1) 244 | 245 | data2 = np.random.randn(n, d2 + 1) 246 | data2 = ( 247 | r * data2 / np.sqrt(np.sum(data2**2, 1)[:, None]) 248 | ) + np.concatenate((np.array([2 * r]), np.zeros(data2.shape[1] - 1))) 249 | 250 | data = np.concatenate((data1, data2)) 251 | if noise: 252 | data += noise * np.random.randn(*data.shape) 253 | 254 | return data 255 | 256 | 257 | def sample_from_constant_curvature_annulus(n, K, r, R, seed=None, **kwargs): 258 | rng = np.random.default_rng(seed) 259 | X = np.empty((0, 2)) 260 | 261 | while True: 262 | sample = sample_from_constant_curvature_disk(n, K=K, r=R, seed=rng) 263 | norms = np.sqrt(np.sum(np.abs(sample) ** 2, axis=-1)) 264 | 265 | X = np.row_stack((X, sample[norms >= r])) 266 | 267 | if len(X) >= n: 268 | X = X[:n, :] 269 | break 270 | 271 | return X 272 | 273 | 274 | def sample_from_constant_curvature_disk(n, K=0.0, r=1.0, seed=None): 275 | """Sample from a disk of constant curvature. 276 | 277 | Parameters 278 | ---------- 279 | n : int 280 | Number of points to sample 281 | 282 | K : float 283 | Curvature of the respective disk. When positive, must be less 284 | than or equal to 2. 285 | 286 | seed : int, instance of `np.random.Generator`, or `None` 287 | Seed for the random number generator, or an instance of such 288 | a generator. If set to `None`, the default random number 289 | generator will be used. 290 | 291 | Returns 292 | ------- 293 | np.array of shape `(n, 2)` 294 | Array containing sampled coordinates. 295 | """ 296 | rng = np.random.default_rng(seed) 297 | 298 | theta = rng.uniform(0, 2 * np.pi, n) 299 | u = rng.uniform(0, 1, n) 300 | 301 | # Sample from Euclidean disk; we could also get this result with 302 | # other routines from this module, but implementing this here is 303 | # making everything more self-contained. 304 | if K == 0.0: 305 | radii = np.sqrt(u) 306 | 307 | # Hyperbolic case (negative curvature) 308 | elif K < 0.0: 309 | radii = np.multiply(np.sqrt(u), np.sinh(np.sqrt(-K) / 2.0)) 310 | radii = np.multiply(2.0 / np.sqrt(-K), np.arcsinh(radii)) 311 | 312 | # Spherical case (positive curvature) 313 | else: 314 | assert K <= 2 315 | 316 | radii = np.multiply(np.sqrt(u), np.sin(np.sqrt(K) / 2.0)) 317 | radii = np.multiply(2.0 / np.sqrt(K), np.arcsin(radii)) 318 | 319 | x = np.multiply(r * radii, np.cos(theta)) 320 | y = np.multiply(r * radii, np.sin(theta)) 321 | return np.vstack([x, y]).T 322 | -------------------------------------------------------------------------------- /tardis/utils.py: -------------------------------------------------------------------------------- 1 | """Utilities module. 2 | 3 | This module collects some utility functions, making them accessible to 4 | a wider number of modules. 5 | """ 6 | 7 | import logging 8 | import os 9 | 10 | import numpy as np 11 | 12 | from sklearn.neighbors import KDTree 13 | 14 | from tardis.data import sample_vision_data_set 15 | 16 | 17 | def load_data(filename, batch_size, n_query_points, seed=None): 18 | """Load data from filename, depending on input type. 19 | 20 | Parameters 21 | ---------- 22 | filename : str 23 | If this points to a file name, the function will attempt to load 24 | said file and parse it. Else, the function will consider this as 25 | the name of a data set to load. 26 | 27 | batch_size : int 28 | Number of points to sample from data set. 29 | 30 | n_query_points : int 31 | Number of points to use for the subsequent Euclidicity 32 | calculations. It is possible to use the full data set. 33 | 34 | seed : int, instance of `np.random.Generator`, or `None` 35 | Seed for the random number generator, or an instance of such 36 | a generator. If set to `None`, the default random number 37 | generator will be used. 38 | 39 | Returns 40 | ------- 41 | Tuple of np.array, np.array 42 | The (subsampled) data set along with its query points is 43 | returned. 44 | """ 45 | if os.path.exists(filename): 46 | ext = os.path.splitext(filename)[1] 47 | if ext == ".txt" or ext == ".gz": 48 | X = np.loadtxt(filename) 49 | elif ext == ".npz": 50 | X = np.load(filename)["data"] 51 | else: 52 | X = sample_vision_data_set(filename, batch_size) 53 | 54 | assert X is not None, RuntimeError( 55 | f"Unable to handle input file {filename}" 56 | ) 57 | 58 | logger = logging.getLogger() 59 | 60 | logger.info(f"Sampling a batch of {batch_size} points") 61 | logger.info(f"Using {n_query_points} query points") 62 | 63 | rng = np.random.default_rng(seed) 64 | 65 | X = X[rng.choice(X.shape[0], batch_size, replace=False)] 66 | query_points = X[rng.choice(X.shape[0], n_query_points, replace=False)] 67 | 68 | return X, query_points 69 | 70 | 71 | def estimate_scales(X, query_points, k_max): 72 | """Perform simple scale estimation of the data set. 73 | 74 | Parameters 75 | ---------- 76 | k_max : int 77 | Maximum number of neighbours to consider for the local scale 78 | estimation. 79 | 80 | Returns 81 | -------- 82 | List of dict 83 | A list of dictionaries consisting of the minimum and maximum 84 | inner and outer radius, respectively. 85 | """ 86 | tree = KDTree(X) 87 | distances, _ = tree.query(query_points, k=k_max, return_distance=True) 88 | 89 | # Ignore the distance to ourself, as we know that one already. 90 | distances = distances[:, 1:] 91 | 92 | scales = [ 93 | { 94 | "r": dist[0], 95 | "R": dist[round(k_max / 3)], 96 | "s": dist[round(k_max / 3)], 97 | "S": dist[-1], 98 | } 99 | for dist in distances 100 | ] 101 | 102 | return scales 103 | -------------------------------------------------------------------------------- /tardis/visualise_data.py: -------------------------------------------------------------------------------- 1 | """Basic visualisation of Euclidicity. 2 | 3 | This is a helper script for visualising Euclidicity scores of 4 | high-dimensional point clouds. 5 | """ 6 | 7 | import argparse 8 | import os 9 | 10 | import numpy as np 11 | import pandas as pd 12 | 13 | import phate 14 | 15 | import matplotlib.pyplot as plt 16 | import seaborn as sns 17 | 18 | 19 | if __name__ == "__main__": 20 | parser = argparse.ArgumentParser() 21 | 22 | parser.add_argument("FILE", nargs="+", type=str, help="Input filename(s)") 23 | parser.add_argument( 24 | "-o", 25 | "--output", 26 | help="Output directory. If set, will store embedded point clouds.", 27 | type=str, 28 | ) 29 | 30 | args = parser.parse_args() 31 | 32 | n_files = len(args.FILE) 33 | 34 | sns.set_theme(style="darkgrid") 35 | fig, axes = plt.subplots(ncols=n_files) 36 | 37 | if n_files == 1: 38 | axes = [axes] 39 | 40 | # Following the parameters of the original PHATE publication. We set 41 | # a random state to ensure that the output remains reproducible. 42 | emb = phate.PHATE(decay=10, t=50, random_state=42) 43 | 44 | for filename, ax in zip(args.FILE, axes): 45 | if (ext := os.path.splitext(filename)[1]) == ".csv": 46 | df = pd.read_csv(filename) 47 | df = df.drop("persistent_intrinsic_dimension", axis="columns") 48 | X = df.to_numpy() 49 | elif ext == ".npz": 50 | X = np.load(filename)["arr_0"] 51 | else: 52 | X = np.loadtxt(filename) 53 | 54 | y = X[:, -1].flatten() 55 | 56 | iqr = np.subtract(*np.percentile(y, [75, 25])) 57 | q3 = np.percentile(y, 75) 58 | 59 | # Remove Euclidicity scores. Our implementation adds them to the 60 | # last column of the data. 61 | X = X[:, :-1] 62 | 63 | X_emb = emb.fit_transform(X) 64 | 65 | scatter = ax.scatter( 66 | x=X_emb[:, 0], 67 | y=X_emb[:, 1], 68 | c=y, 69 | alpha=0.5, 70 | s=1.0, 71 | # Try to highlight outliers a little bit better. 72 | vmax=q3 + 1.5 * iqr, 73 | ) 74 | fig.colorbar(scatter, ax=ax) 75 | 76 | if args.output is not None: 77 | out_filename = os.path.basename(filename) 78 | out_filename = os.path.splitext(out_filename)[0] + ".csv" 79 | out_filename = os.path.join(args.output, out_filename) 80 | 81 | X_out = np.hstack((X_emb, y.reshape(-1, 1))) 82 | 83 | np.savetxt( 84 | out_filename, 85 | X_out, 86 | fmt="%.4f", 87 | delimiter=",", 88 | header="x,y,euclidicity", 89 | ) 90 | 91 | plt.show() 92 | --------------------------------------------------------------------------------