├── __init__.py ├── .gitmodules ├── makefiles └── .gitkeep ├── notebooks └── .gitkeep ├── src ├── __init__.py ├── functional │ ├── numpy.py │ ├── __init__.py │ ├── pandas.py │ └── common.py ├── models │ ├── __init__.py │ ├── ensembles.py │ └── base.py ├── motifs │ ├── __init__.py │ ├── features.py │ └── models.py ├── utils │ ├── __init__.py │ ├── exceptions.py │ ├── label_helpers.py │ ├── misc.py │ ├── loaders.py │ └── decorators.py ├── evaluation │ ├── __init__.py │ └── classification.py ├── features │ ├── __init__.py │ ├── ecdf_features.py │ ├── statistical_features.py │ └── statistical_features_impl.py ├── transformers │ ├── __init__.py │ ├── source_selector.py │ ├── body_grav_filter.py │ ├── resample.py │ └── window.py ├── visualisations │ ├── __init__.py │ └── umap_embedding.py ├── datasets │ ├── __init__.py │ ├── uschad.py │ ├── anguita2013.py │ ├── pamap2.py │ └── base.py ├── keys.py ├── meta.py └── base.py ├── tables ├── modalities.md ├── pipelines.md ├── locations.md ├── visualisations.md ├── transformers.md ├── features.md ├── models.md ├── activities.md └── datasets.md ├── metadata ├── indices │ ├── index.yaml │ ├── label.yaml │ ├── split.yaml │ └── target.yaml ├── data_partitions │ ├── loso.yaml │ ├── deployable.yaml │ └── predefined.yaml ├── modality.yaml ├── tasks │ ├── har.yaml │ └── localisation.yaml └── datasets │ ├── sphere_challenge.yaml_ │ ├── uschad.yaml │ ├── anguita2013.yaml │ └── pamap2.yaml ├── .flake8 ├── requirements.txt ├── docs ├── getting-started.rst ├── commands.rst ├── index.rst ├── make.bat ├── Makefile └── conf.py ├── pyproject.toml ├── Pipfile ├── .pre-commit-config.yaml ├── LICENSE ├── .gitignore ├── make_download.py ├── har_basic.py ├── har_chain.py ├── har_ensemble_avg.py ├── make_tables.py ├── har_zero.py ├── Makefile └── README.md /__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /makefiles/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /notebooks/.gitkeep: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/functional/numpy.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/models/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/motifs/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/utils/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tables/modalities.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/indices/index.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/indices/label.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/indices/split.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/indices/target.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/evaluation/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/features/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/functional/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/functional/pandas.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/transformers/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/visualisations/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/data_partitions/loso.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/data_partitions/deployable.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /metadata/data_partitions/predefined.yaml: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /src/utils/exceptions.py: -------------------------------------------------------------------------------- 1 | class ModalityNotPresentError(KeyError): 2 | pass 3 | -------------------------------------------------------------------------------- /.flake8: -------------------------------------------------------------------------------- 1 | [flake8] 2 | ignore = E203, E266, E501, W503, F403, F401, W293 3 | max-line-length = 120 4 | max-complexity = 18 5 | select = B,C,E,F,W,T4,B9 6 | -------------------------------------------------------------------------------- /tables/pipelines.md: -------------------------------------------------------------------------------- 1 | | Index | Pipelines | value | 2 | | ----- | ----- | ----- | 3 | | 0 | stat_feat | None | 4 | | 1 | ecdf_11 | None | 5 | | 2 | ecdf_21 | None | 6 | -------------------------------------------------------------------------------- /tables/locations.md: -------------------------------------------------------------------------------- 1 | | Index | Locations | value | 2 | | ----- | ----- | ----- | 3 | | 0 | wrist | 0 | 4 | | 1 | chest | 1 | 5 | | 2 | ankle | 2 | 6 | | 3 | waist | 3 | 7 | -------------------------------------------------------------------------------- /src/datasets/__init__.py: -------------------------------------------------------------------------------- 1 | from src.datasets.anguita2013 import anguita2013 2 | from src.datasets.base import Dataset 3 | from src.datasets.pamap2 import pamap2 4 | from src.datasets.uschad import uschad 5 | -------------------------------------------------------------------------------- /tables/visualisations.md: -------------------------------------------------------------------------------- 1 | | Index | Visualisations | value | 2 | | ----- | ----- | ----- | 3 | | 0 | umap_embedding | [Link 1](https://arxiv.org/abs/1802.03426), [Link 2](https://github.com/lmcinnes/umap) | 4 | -------------------------------------------------------------------------------- /tables/transformers.md: -------------------------------------------------------------------------------- 1 | | Index | Transformers | value | 2 | | ----- | ----- | ----- | 3 | | 0 | body_grav_filter | {'resamples': False} | 4 | | 1 | resample | None | 5 | | 2 | window | {'resamples': True} | 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | numpy 3 | pandas 4 | numpy 5 | matplotlib 6 | seaborn 7 | python-dotenv 8 | scikit-learn 9 | joblib 10 | ipython 11 | jupyter 12 | pyyaml 13 | loguru 14 | tqdm 15 | git+https://github.com/njtwomey/mldb.git#egg=mldb 16 | spectrum 17 | click 18 | umap-learn 19 | -------------------------------------------------------------------------------- /docs/getting-started.rst: -------------------------------------------------------------------------------- 1 | Getting started 2 | =============== 3 | 4 | This is where you describe how to get set up on a clean install, including the 5 | commands necessary to get the raw data (using the `sync_data_from_s3` command, 6 | for example), and then how to make the cleaned, final data sets. 7 | -------------------------------------------------------------------------------- /metadata/modality.yaml: -------------------------------------------------------------------------------- 1 | # Wearable modalities 2 | - accel 3 | - gyro 4 | - mag 5 | 6 | # Ambient sensor modalities 7 | - humidity 8 | - temperature 9 | - pir 10 | - light 11 | - electricity 12 | - rssi 13 | 14 | # RGBD modalities 15 | - rgbd_kitchen 16 | - rgbd_hall 17 | - rgbd_lr 18 | -------------------------------------------------------------------------------- /src/functional/common.py: -------------------------------------------------------------------------------- 1 | def sorted_node_values(nodes): 2 | return [nodes[key] for key in sorted(nodes.keys())] 3 | 4 | 5 | def node_itemgetter(item): 6 | def itemgetter_func(df): 7 | return df[item] 8 | 9 | itemgetter_func.__name__ = f"get_{item}" 10 | 11 | return itemgetter_func 12 | -------------------------------------------------------------------------------- /tables/features.md: -------------------------------------------------------------------------------- 1 | | Index | Features | value | 2 | | ----- | ----- | ----- | 3 | | 0 | statistical_features | [Link 1](https://pdfs.semanticscholar.org/83de/43bc849ad3d9579ccf540e6fe566ef90a58e.pdf) | 4 | | 1 | ecdf | [Link 1](https://dl.acm.org/citation.cfm?id=2494353), [Link 2](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.725.908&rep=rep1&type=pdf) | 5 | -------------------------------------------------------------------------------- /tables/models.md: -------------------------------------------------------------------------------- 1 | | Index | Models | value | 2 | | ----- | ----- | ----- | 3 | | 0 | scale_log_reg | [Link 1](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [Link 2](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html) | 4 | | 1 | random_forest | [Link 1](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) | 5 | -------------------------------------------------------------------------------- /docs/commands.rst: -------------------------------------------------------------------------------- 1 | Commands 2 | ======== 3 | 4 | The Makefile contains the central entry points for common tasks related to this project. 5 | 6 | Syncing data to S3 7 | ^^^^^^^^^^^^^^^^^^ 8 | 9 | * `make sync_data_to_s3` will use `aws s3 sync` to recursively sync files in `data/` up to `s3://[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')/data/`. 10 | * `make sync_data_from_s3` will use `aws s3 sync` to recursively sync files from `s3://[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')/data/` to `data/`. 11 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | .. har_datasets documentation master file, created by 2 | sphinx-quickstart. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | har_datasets documentation! 7 | ============================================== 8 | 9 | Contents: 10 | 11 | .. toctree:: 12 | :maxdepth: 2 13 | 14 | getting-started 15 | commands 16 | 17 | 18 | 19 | Indices and tables 20 | ================== 21 | 22 | * :ref:`genindex` 23 | * :ref:`modindex` 24 | * :ref:`search` 25 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.black] 2 | line-length = 120 3 | target-version = ['py38'] 4 | include = '\.pyi?$' 5 | exclude = ''' 6 | 7 | ( 8 | /( 9 | \.eggs # exclude a few common directories in the 10 | | \.git # root of the project 11 | | \.hg 12 | | \.mypy_cache 13 | | \.tox 14 | | \.venv 15 | | _build 16 | | buck-out 17 | | build 18 | | dist 19 | )/ 20 | | foo.py # also separately exclude a file named foo.py in 21 | # the root of the project 22 | ) 23 | ''' 24 | -------------------------------------------------------------------------------- /src/utils/label_helpers.py: -------------------------------------------------------------------------------- 1 | __all__ = ["normalise_labels"] 2 | 3 | 4 | def normalise_labels(ll): 5 | """ 6 | 7 | Args: 8 | ll: 9 | 10 | Returns: 11 | 12 | """ 13 | if "walk" in ll: 14 | return "walk" 15 | elif "elevator" in ll: 16 | return "stand" 17 | elif ll in {"lie", "sleep"}: 18 | return "lie" 19 | elif ll in {"vacuum", "iron", "laundry", "clean"}: 20 | return "chores" 21 | elif ll in {"run", "jump", "rope_jump", "soccer", "cycle"}: 22 | return "sport" 23 | return ll 24 | -------------------------------------------------------------------------------- /metadata/tasks/har.yaml: -------------------------------------------------------------------------------- 1 | # Sedentary activities 2 | - sit 3 | - stand 4 | - lie 5 | - watch_tv 6 | - elevator_up 7 | - elevator_down 8 | - sleep 9 | 10 | # Walking activities 11 | - walk 12 | - walk_up 13 | - walk_down 14 | - walk_left 15 | - walk_right 16 | - walk_nordic 17 | 18 | # Household 19 | - iron 20 | - laundry 21 | - clean 22 | - vacuum 23 | 24 | # Sport/exercise 25 | - run 26 | - cycle 27 | - soccer 28 | - rope_jump 29 | - jump 30 | 31 | # Other 32 | - work_computer 33 | - drive_car 34 | 35 | # Catchall 36 | - other 37 | - none 38 | -------------------------------------------------------------------------------- /Pipfile: -------------------------------------------------------------------------------- 1 | [[source]] 2 | url = "https://pypi.org/simple" 3 | verify_ssl = true 4 | name = "pypi" 5 | 6 | [packages] 7 | requests = "*" 8 | numpy = "*" 9 | pandas = "*" 10 | matplotlib = "*" 11 | seaborn = "*" 12 | python-dotenv = "*" 13 | scikit-learn = "*" 14 | joblib = "*" 15 | ipython = "*" 16 | jupyter = "*" 17 | loguru = "*" 18 | tqdm = "*" 19 | mldb = {git = "https://github.com/njtwomey/mldb.git"} 20 | spectrum = "*" 21 | click = "*" 22 | umap-learn = "*" 23 | PyYAML = "*" 24 | pygraphviz = "*" 25 | 26 | [dev-packages] 27 | 28 | [requires] 29 | python_version = "3.8" 30 | -------------------------------------------------------------------------------- /src/utils/misc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import random 3 | 4 | import numpy as np 5 | 6 | 7 | __all__ = ["randomised_order", "NumpyEncoder"] 8 | 9 | 10 | def randomised_order(iterable): 11 | iterable = list(iterable) 12 | random.shuffle(iterable) 13 | yield from iterable 14 | 15 | 16 | class NumpyEncoder(json.JSONEncoder): 17 | def default(self, obj): 18 | if isinstance(obj, np.integer): 19 | return int(obj) 20 | elif isinstance(obj, np.floating): 21 | return float(obj) 22 | elif isinstance(obj, np.ndarray): 23 | return obj.tolist() 24 | else: 25 | return super(NumpyEncoder, self).default(obj) 26 | -------------------------------------------------------------------------------- /tables/activities.md: -------------------------------------------------------------------------------- 1 | | Index | Activities | value | 2 | | ----- | ----- | ----- | 3 | | 0 | walk | 0 | 4 | | 1 | walk_up | 1 | 5 | | 2 | walk_down | 2 | 6 | | 3 | sit | 3 | 7 | | 4 | stand | 4 | 8 | | 5 | lie | 5 | 9 | | 6 | run | 6 | 10 | | 7 | cycle | 7 | 11 | | 8 | walk_nordic | 8 | 12 | | 9 | watch_tv | 9 | 13 | | 10 | work_computer | 10 | 14 | | 11 | drive_car | 11 | 15 | | 12 | vacuum | 12 | 16 | | 13 | iron | 13 | 17 | | 14 | laundry | 14 | 18 | | 15 | clean | 15 | 19 | | 16 | soccer | 16 | 20 | | 17 | rope_jump | 17 | 21 | | 18 | other | 18 | 22 | | 19 | walk_left | 19 | 23 | | 20 | walk_right | 20 | 24 | | 21 | jump | 21 | 25 | | 22 | sleep | 22 | 26 | | 23 | elevator_up | 23 | 27 | | 24 | elevator_down | 24 | 28 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: https://github.com/ambv/black 3 | rev: stable 4 | hooks: 5 | - id: black 6 | language_version: python3.8 7 | - repo: https://gitlab.com/pycqa/flake8 8 | rev: 3.7.9 9 | hooks: 10 | - id: flake8 11 | - repo: https://github.com/asottile/reorder_python_imports 12 | rev: v2.4.0 13 | hooks: 14 | - id: reorder-python-imports 15 | - repo: https://github.com/pre-commit/pre-commit-hooks 16 | rev: v3.4.0 17 | hooks: 18 | - id: trailing-whitespace 19 | - id: end-of-file-fixer 20 | - id: check-docstring-first 21 | - id: check-json 22 | - id: check-added-large-files 23 | - id: check-yaml 24 | - id: debug-statements 25 | - id: requirements-txt-fixer 26 | -------------------------------------------------------------------------------- /metadata/tasks/localisation.yaml: -------------------------------------------------------------------------------- 1 | - floor_5 2 | - floor_4 3 | - floor_3 4 | - floor_2 5 | - floor_1 6 | - floor_0 7 | - basement_1 8 | - basement_2 9 | - basement_3 10 | - basement_4 11 | - basement_5 12 | 13 | - hallway_1 14 | - hallway_2 15 | - hallway_3 16 | - hallway_4 17 | - hallway_5 18 | 19 | - bathroom_1 20 | - bathroom_2 21 | - bathroom_3 22 | - bathroom_4 23 | - bathroom_5 24 | 25 | - toilet_1 26 | - toilet_2 27 | - toilet_3 28 | - toilet_4 29 | - toilet_5 30 | 31 | - living_1 32 | - living_2 33 | - living_3 34 | - living_4 35 | - living_5 36 | 37 | - bedroom_1 38 | - bedroom_2 39 | - bedroom_3 40 | - bedroom_4 41 | - bedroom_5 42 | 43 | - dining_room_1 44 | - dining_room_2 45 | - dining_room_3 46 | - dining_room_4 47 | - dining_room_5 48 | 49 | - stairs_1 50 | - stairs_2 51 | - stairs_3 52 | - stairs_4 53 | - stairs_5 54 | 55 | - kitchen 56 | - study 57 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | The MIT License (MIT) 3 | Copyright (c) 2019, Niall Twomey 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 6 | associated documentation files (the "Software"), to deal in the Software without restriction, including 7 | without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 8 | copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the 9 | following conditions: 10 | 11 | The above copyright notice and this permission notice shall be included in all copies or substantial 12 | portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT 15 | LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN 16 | NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 17 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 18 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 | 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | .DS_Store 5 | # C extensions 6 | *.so 7 | 8 | *.lock 9 | 10 | devel.py 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | *.lock 30 | 31 | niall_*.py 32 | 33 | # PyInstaller 34 | # Usually these files are written by a python script from a template 35 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 36 | *.manifest 37 | *.spec 38 | 39 | # Installer logs 40 | pip-log.txt 41 | pip-delete-this-directory.txt 42 | 43 | # Unit test / coverage reports 44 | htmlcov/ 45 | .tox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *,cover 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | 60 | # Sphinx documentation 61 | docs/_build/ 62 | 63 | # PyBuilder 64 | target/ 65 | 66 | # DotEnv configuration 67 | .env 68 | 69 | # Database 70 | *.db 71 | *.rdb 72 | 73 | # Pycharm 74 | .idea 75 | 76 | # Jupyter NB Checkpoints 77 | .ipynb_checkpoints/ 78 | 79 | # exclude data from source control by default 80 | /data/ 81 | -------------------------------------------------------------------------------- /src/features/ecdf_features.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from src.functional.common import sorted_node_values 4 | 5 | 6 | __all__ = [ 7 | "ecdf", 8 | ] 9 | 10 | 11 | def ecdf(parent, n_components): 12 | root = parent / f"feat='ecdf'-k={n_components}" 13 | 14 | for key, node in parent.outputs.items(): 15 | root.instantiate_node( 16 | key=f"{key}-ecdf", backend="numpy", func=calc_ecdf, kwargs=dict(n_components=n_components, data=node), 17 | ) 18 | 19 | return root.instantiate_node( 20 | key="features", func=np.concatenate, args=[sorted_node_values(root.outputs)], kwargs=dict(axis=1) 21 | ) 22 | 23 | 24 | def calc_ecdf(data, n_components): 25 | return np.asarray([ecdf_rep(datum, n_components) for datum in data]) 26 | 27 | 28 | def ecdf_rep(data, components): 29 | """ 30 | Taken from: https://github.com/nhammerla/ecdfRepresentation/blob/master/python/ecdfRep.py 31 | 32 | Parameters 33 | ---------- 34 | data 35 | components 36 | 37 | Returns 38 | ------- 39 | 40 | """ 41 | 42 | m = data.mean(0) 43 | data = np.sort(data, axis=0) 44 | data = data[np.int32(np.around(np.linspace(0, data.shape[0] - 1, num=components))), :] 45 | data = data.flatten() 46 | return np.hstack((data, m)) 47 | -------------------------------------------------------------------------------- /src/motifs/features.py: -------------------------------------------------------------------------------- 1 | from mldb import NodeWrapper 2 | 3 | from src.features.ecdf_features import ecdf 4 | from src.features.statistical_features import statistical_features 5 | from src.transformers.body_grav_filter import body_grav_filter 6 | from src.transformers.resample import resample 7 | from src.transformers.source_selector import source_selector 8 | from src.transformers.window import window 9 | 10 | 11 | def get_windowed_wearables( 12 | dataset: NodeWrapper, modality: str, location: str, fs_new: float, win_len: float, win_inc: float 13 | ): 14 | selected_sources = source_selector(parent=dataset, modality=modality, location=location) 15 | wear_resampled = resample(parent=selected_sources, fs_new=fs_new) 16 | wear_filtered = body_grav_filter(parent=wear_resampled) 17 | wear_windowed = window(parent=wear_filtered, win_len=win_len, win_inc=win_inc) 18 | return wear_windowed 19 | 20 | 21 | def get_features(feat_name: str, windowed_data: NodeWrapper): 22 | if feat_name == "statistical": 23 | features = statistical_features(parent=windowed_data) 24 | elif feat_name == "ecdf": 25 | features = ecdf(parent=windowed_data, n_components=21) 26 | else: 27 | raise ValueError 28 | 29 | assert isinstance(features, NodeWrapper) 30 | 31 | return features 32 | -------------------------------------------------------------------------------- /src/keys.py: -------------------------------------------------------------------------------- 1 | __all__ = ["Key"] 2 | 3 | from pathlib import Path 4 | from typing import Union 5 | 6 | 7 | class Key(object): 8 | def __init__(self, key): 9 | self.key = validate_key(key) 10 | 11 | def __len__(self) -> int: 12 | return len(self.key) 13 | 14 | def __str__(self) -> str: 15 | return self.key 16 | 17 | def __repr__(self) -> str: 18 | key = self.key 19 | return f"Key({key=})" 20 | 21 | def __eq__(self, other: Union["Key", Path, str]) -> bool: 22 | if isinstance(other, Path): 23 | return self.key == other.name 24 | if isinstance(other, Key): 25 | return self.key == other.key 26 | if isinstance(other, str): 27 | return self.key == other 28 | raise NotImplementedError 29 | 30 | def __hash__(self) -> int: 31 | return hash(self.key) 32 | 33 | def __contains__(self, key) -> bool: 34 | validate_key(key) 35 | return key in self.key 36 | 37 | def __lt__(self, other) -> bool: 38 | return self.key < other.key 39 | 40 | 41 | def validate_key(key: Union[Path, Key, str]): 42 | if isinstance(key, Path): 43 | return key.name 44 | if isinstance(key, Key): 45 | return key.key 46 | if isinstance(key, str): 47 | return key 48 | raise ValueError(f"Unsupported key type: expected str or Key, got {type(key)} (value: {key})") 49 | -------------------------------------------------------------------------------- /metadata/datasets/sphere_challenge.yaml_: -------------------------------------------------------------------------------- 1 | #name: "sphere_challenge" 2 | #author: "Twomey" 3 | #paper_name: "The SPHERE challenge: Activity recognition with multimodal sensor data" 4 | #venue: "arXiv preprint arXiv:1603.00797" 5 | #bibtex: 6 | # - "@article{twomey2016sphere,title={The SPHERE challenge: Activity recognition with multimodal sensor data},author={Twomey, Niall and Diethe, Tom and Kull, Meelis and Song, Hao and Camplani, Massimo and Hannuna, Sion and Fafoutis, Xenofon and Zhu, Ni and Woznowski, Pete and Flach, Peter and Craddock, Ian},journal={arXiv preprint arXiv:1603.00797},year={2016}}" 7 | #paper_urls: 8 | # - "https://arxiv.org/pdf/1603.00797" 9 | #year: 2016 10 | #description_urls: 11 | # - "https://www.irc-sphere.ac.uk/sphere-challenge/home" 12 | # - "https://arxiv.org/abs/1603.00797" 13 | #download_urls: 14 | # - "https://data.bris.ac.uk/datasets/8gccwpx47rav19vk8x4xapcog/8gccwpx47rav19vk8x4xapcog.zip" 15 | #fs: 20 16 | #num_subjects: 9 17 | #num_activities: 12 18 | #missing: true 19 | #tasks: 20 | # har: 21 | # target_transform: 22 | # lie: 1 23 | # sit: 2 24 | # stand: 3 25 | # walk: 4 26 | # run: 5 27 | # cycle: 6 28 | # walk_nordic: 7 29 | # watch_tv: 9 30 | # work_computer: 10 31 | # drive_car: 11 32 | # walk_up: 12 33 | # walk_down: 13 34 | # vacuum: 16 35 | # iron: 17 36 | # laundry: 18 37 | # clean: 19 38 | # soccer: 20 39 | # rope_jump: 24 40 | # other: 0 41 | # evaluation: 42 | # - "probabilistic_targets" 43 | -------------------------------------------------------------------------------- /tables/datasets.md: -------------------------------------------------------------------------------- 1 | | First Author | Paper Name | Dataset Name | Description | Missing data | Download Links | Year | Sampling Rate | Device Locations | Device Modalities | Num Subjects | Num Activities | Activities | 2 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | 3 | | Anguita | A Public Domain Dataset for Human Activity Recognition Using Smartphones | anguita2013 | [Link 1](http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) | None | [Link 1](https://pdfs.semanticscholar.org/83de/43bc849ad3d9579ccf540e6fe566ef90a58e.pdf) | 2013 | 50 | waist | accel, gyro, mag | 30 | 6 | walk, walk_up, walk_down, sit, stand, lie | 4 | | Reiss | Introducing a new benchmarked dataset for activity monitoring | pamap2 | [Link 1](http://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring), [Link 2](http://archive.ics.uci.edu/ml/machine-learning-databases/00231/readme.pdf) | None | [Link 1](https://ieeexplore.ieee.org/document/6246152/), [Link 2](https://www.researchgate.net/publication/235348485_Introducing_a_New_Benchmarked_Dataset_for_Activity_Monitoring) | 2012 | 100 | wrist, chest, ankle | accel, gyro, mag | 9 | 12 | lie, sit, stand, walk, run, cycle, walk_nordic, watch_tv, work_computer, drive_car, walk_up, walk_down, vacuum, iron, laundry, clean, soccer, rope_jump, other | 5 | | Zhang | USC-HAD: A Daily Activity Dataset for Ubiquitous Activity Recognition Using Wearable Sensors | uschad | [Link 1](http://sipi.usc.edu/had/) | None | [Link 1](http://sipi.usc.edu/had/mi_ubicomp_sagaware12.pdf) | 2012 | 100 | waist | accel, gyro | 14 | 12 | walk, walk_left, walk_right, walk_up, walk_down, run, jump, sit, stand, sleep, elevator_up, elevator_down | 6 | -------------------------------------------------------------------------------- /src/transformers/source_selector.py: -------------------------------------------------------------------------------- 1 | from loguru import logger 2 | from numpy import concatenate 3 | 4 | from src.base import get_ancestral_metadata 5 | from src.keys import Key 6 | 7 | __all__ = [ 8 | "source_selector", 9 | ] 10 | 11 | 12 | def do_select_feats(**nodes): 13 | keys = sorted(nodes.keys()) 14 | return concatenate([nodes[key] for key in keys], axis=1) 15 | 16 | 17 | def source_selector(parent, modality="all", location="all"): 18 | locations_set = set(get_ancestral_metadata(parent, "locations")) 19 | assert location in locations_set or location == "all", f"Location {location} not in {locations_set}" 20 | 21 | modality_set = set(get_ancestral_metadata(parent, "modalities")) 22 | assert modality in modality_set or modality == "all", f"Modality {modality} not in {modality_set}" 23 | 24 | loc, mod = location, modality 25 | root = parent / f"{loc=}-{mod=}" 26 | 27 | # Prepare a set of viable outputs 28 | valid_locations = set() 29 | for pair in parent.meta["sources"]: 30 | loc, mod = pair["loc"], pair["mod"] 31 | good_location = location == "all" or pair["loc"] == location 32 | good_modality = modality == "all" or pair["mod"] == modality 33 | if good_location and good_modality: 34 | valid_locations.update({Key(f"{loc=}-{mod=}")}) 35 | 36 | # Aggregate all relevant sources 37 | selected = 0 38 | for key, node in parent.outputs.items(): 39 | if key in valid_locations: 40 | selected += 1 41 | root.acquire_node(key=key, node=node) 42 | 43 | if not selected: 44 | logger.exception(f"No wearable keys found in {sorted(parent.outputs.keys())}") 45 | raise KeyError 46 | 47 | return root 48 | -------------------------------------------------------------------------------- /make_download.py: -------------------------------------------------------------------------------- 1 | import zipfile 2 | from os import makedirs 3 | from os.path import basename 4 | from os.path import exists 5 | from os.path import join 6 | from os.path import split 7 | from os.path import splitext 8 | 9 | import requests 10 | from loguru import logger 11 | from tqdm import tqdm 12 | 13 | from src.meta import DatasetMeta 14 | from src.utils.loaders import iter_dataset_paths 15 | 16 | 17 | def unzip_data(zip_path, in_name, out_name): 18 | if exists(join(zip_path, out_name)): 19 | return 20 | with zipfile.ZipFile(join(zip_path, in_name), "r") as fil: 21 | fil.extractall(zip_path) 22 | 23 | 24 | def download_and_save(url, path, force=False, chunk_size=2 ** 12): 25 | response = requests.get(url, stream=True) 26 | fname = join(path, split(url)[1]) 27 | desc = f"Downloading {fname}..." 28 | if exists(fname): 29 | if not force: 30 | return 31 | chunks = tqdm(response.iter_content(chunk_size=chunk_size), desc=basename(desc)) 32 | with open(fname, "wb") as fil: 33 | for chunk in chunks: 34 | fil.write(chunk) 35 | 36 | 37 | def download_dataset(dataset_meta_path): 38 | dataset = DatasetMeta(dataset_meta_path) 39 | if not exists(dataset.zip_path): 40 | makedirs(dataset.zip_path) 41 | for ii, url in enumerate(dataset.meta["download_urls"]): 42 | logger.info("\t{}/{} {}".format(ii + 1, len(dataset.meta["download_urls"]), url)) 43 | download_and_save(url=url, path=dataset.zip_path) 44 | zip_name = basename(dataset.meta["download_urls"][0]) 45 | unzip_path = join(dataset.zip_path, splitext(zip_name)[0]) 46 | unzip_data(zip_path=dataset.zip_path, in_name=zip_name, out_name=unzip_path) 47 | 48 | 49 | def main(): 50 | for dataset_meta_path in iter_dataset_paths(): 51 | logger.info(f"Downloading {dataset_meta_path}") 52 | download_dataset(dataset_meta_path) 53 | 54 | 55 | if __name__ == "__main__": 56 | main() 57 | -------------------------------------------------------------------------------- /src/visualisations/umap_embedding.py: -------------------------------------------------------------------------------- 1 | import matplotlib.pyplot as pl 2 | import pandas as pd 3 | import seaborn as sns 4 | from mldb import NodeWrapper 5 | from sklearn.pipeline import Pipeline 6 | from sklearn.preprocessing import StandardScaler 7 | from umap import UMAP 8 | 9 | from src.base import ExecutionGraph 10 | 11 | # from src.utils.label_helpers import normalise_labels 12 | 13 | sns.set_style("darkgrid") 14 | sns.set_context("paper") 15 | 16 | 17 | def learn_umap(data): 18 | umap = Pipeline([("scale", StandardScaler()), ("embed", UMAP(n_neighbors=50, verbose=True))]) 19 | umap.fit(data) 20 | return umap 21 | 22 | 23 | def embed_umap(label, data, model): 24 | embedding = model.transform(data) 25 | 26 | # Need to re-label with a new dataframe since the categories in the normalised label 27 | # set are different to those in the full set. 28 | # label = pd.DataFrame(label.target.apply(normalise_labels)).astype("category") 29 | label = pd.DataFrame(label["target"]).astype("category") 30 | 31 | fig, ax = pl.subplots(1, 1, figsize=(10, 10)) 32 | 33 | labels = label.target.values 34 | unique_labels = label.target.unique() 35 | colours = sns.color_palette(n_colors=unique_labels.shape[0]) 36 | for ll, cc in zip(unique_labels, colours): 37 | if ll == "other": 38 | continue 39 | inds = labels == ll 40 | ax.scatter(embedding[inds, 0], embedding[inds, 1], color=cc, label=ll, s=5, alpha=0.75) 41 | pl.legend(fontsize="x-large", markerscale=3) 42 | pl.tight_layout() 43 | 44 | return fig 45 | 46 | 47 | def umap_embedding(features: NodeWrapper, task_name): 48 | parent: ExecutionGraph = features.graph 49 | umap_model = parent.instantiate_orphan_node(func=learn_umap, kwargs=dict(data=features),) 50 | viz = parent.instantiate_node( 51 | key=f"umap-visualisation", 52 | func=embed_umap, 53 | backend="png", 54 | kwargs=dict(data=features, model=umap_model, label=parent[task_name]), 55 | ) 56 | return viz 57 | -------------------------------------------------------------------------------- /metadata/datasets/uschad.yaml: -------------------------------------------------------------------------------- 1 | name: "uschad" 2 | author: "Zhang" 3 | paper_name: "USC-HAD: A Daily Activity Dataset for Ubiquitous Activity Recognition Using Wearable Sensors" 4 | venue: "Proceedings of the 2012 ACM Conference on Ubiquitous Computing" 5 | bibtex: 6 | - "@inproceedings{zhang2012usc,title={USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors},author={Zhang, Mi and Sawchuk, Alexander A},booktitle={Proceedings of the 2012 ACM Conference on Ubiquitous Computing},pages={1036--1043},year={2012},organization={ACM}}" 7 | paper_urls: 8 | - "http://sipi.usc.edu/had/mi_ubicomp_sagaware12.pdf" 9 | year: 2012 10 | description_urls: 11 | - "http://sipi.usc.edu/had/" 12 | download_urls: 13 | - "http://sipi.usc.edu/had/USC-HAD.zip" 14 | fs: 100 15 | num_subjects: 14 16 | num_activities: 12 17 | missing: null 18 | modalities: 19 | - accel 20 | - gyro 21 | locations: 22 | - waist 23 | sources: 24 | - loc: waist 25 | mod: accel 26 | - loc: waist 27 | mod: gyro 28 | tasks: 29 | har: 30 | evaluation: classification 31 | target_transform: 32 | walk: 1 33 | walk_left: 2 34 | walk_right: 3 35 | walk_up: 4 36 | walk_down: 5 37 | run: 6 38 | jump: 7 39 | sit: 8 40 | stand: 9 41 | sleep: 10 42 | elevator_up: 11 43 | elevator_down: 12 44 | data_partitions: 45 | predefined: 46 | - fold_0 47 | loso: 48 | - fold_0 49 | - fold_1 50 | - fold_2 51 | - fold_3 52 | - fold_4 53 | - fold_5 54 | - fold_6 55 | - fold_7 56 | - fold_8 57 | - fold_9 58 | - fold_10 59 | - fold_11 60 | - fold_12 61 | - fold_13 62 | - fold_14 63 | - fold_15 64 | - fold_16 65 | - fold_17 66 | - fold_18 67 | - fold_19 68 | - fold_20 69 | - fold_21 70 | - fold_22 71 | - fold_23 72 | - fold_24 73 | - fold_25 74 | - fold_26 75 | - fold_27 76 | - fold_28 77 | - fold_29 78 | deployable: 79 | - fold_0 80 | -------------------------------------------------------------------------------- /har_basic.py: -------------------------------------------------------------------------------- 1 | from src.base import get_ancestral_metadata 2 | from src.motifs.features import get_features 3 | from src.motifs.features import get_windowed_wearables 4 | from src.motifs.models import get_classifier 5 | from src.utils.loaders import dataset_importer 6 | from src.utils.misc import randomised_order 7 | from src.visualisations.umap_embedding import umap_embedding 8 | 9 | 10 | def basic_har( 11 | # 12 | # Dataset 13 | dataset_name="pamap2", 14 | # 15 | # Representation sources 16 | modality="all", 17 | location="all", 18 | # 19 | # Task/split 20 | task_name="har", 21 | data_partition="predefined", 22 | # 23 | # Windowification 24 | fs_new=33, 25 | win_len=3, 26 | win_inc=1, 27 | # 28 | # Features 29 | feat_name="ecdf", 30 | clf_name="rf", 31 | # 32 | # Embedding visualisation 33 | viz=False, 34 | evaluate=False, 35 | ): 36 | dataset = dataset_importer(dataset_name) 37 | 38 | # Resample, filter and window the raw sensor data 39 | wear_windowed = get_windowed_wearables( 40 | dataset=dataset, modality=modality, location=location, fs_new=fs_new, win_len=win_len, win_inc=win_inc 41 | ) 42 | 43 | # Extract features 44 | features = get_features(feat_name=feat_name, windowed_data=wear_windowed) 45 | 46 | # Visualise the feature embeddings 47 | if viz: 48 | umap_embedding(features, task_name=task_name).evaluate() 49 | 50 | # Get classifier params 51 | models = dict() 52 | train_test_splits = get_ancestral_metadata(features, "data_partitions")[data_partition] 53 | for train_test_split in randomised_order(train_test_splits): 54 | models[train_test_split] = get_classifier( 55 | clf_name=clf_name, 56 | feature_node=features, 57 | task_name=task_name, 58 | data_partition=data_partition, 59 | evaluate=evaluate, 60 | train_test_split=train_test_split, 61 | ) 62 | 63 | return features, models 64 | 65 | 66 | if __name__ == "__main__": 67 | basic_har(viz=True, evaluate=True) 68 | -------------------------------------------------------------------------------- /metadata/datasets/anguita2013.yaml: -------------------------------------------------------------------------------- 1 | name: "anguita2013" 2 | author: "Anguita" 3 | paper_name: "A Public Domain Dataset for Human Activity Recognition Using Smartphones" 4 | venue: "European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning" 5 | bibtex: 6 | - "@inproceedings{anguita2013public,title={A public domain dataset for human activity recognition using smartphones.},author={Anguita, Davide and Ghio, Alessandro and Oneto, Luca and Parra, Xavier and Reyes-Ortiz, Jorge Luis},booktitle={Esann},year={2013}}" 7 | paper_urls: 8 | - "https://pdfs.semanticscholar.org/83de/43bc849ad3d9579ccf540e6fe566ef90a58e.pdf" 9 | year: 2013 10 | description_urls: 11 | - "http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones" 12 | download_urls: 13 | - "http://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip" 14 | - "http://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.names" 15 | fs: 50 16 | missing: null 17 | num_subjects: 30 18 | num_activities: 6 19 | modalities: 20 | - accel 21 | - gyro 22 | locations: 23 | - waist 24 | sources: 25 | - loc: waist 26 | mod: accel 27 | - loc: waist 28 | mod: gyro 29 | tasks: 30 | har: 31 | evaluation: 32 | - classification 33 | target_transform: 34 | walk: 1 35 | walk_up: 2 36 | walk_down: 3 37 | sit: 4 38 | stand: 5 39 | lie: 6 40 | data_partitions: 41 | predefined: 42 | - fold_0 43 | loso: 44 | - fold_0 45 | - fold_1 46 | - fold_2 47 | - fold_3 48 | - fold_4 49 | - fold_5 50 | - fold_6 51 | - fold_7 52 | - fold_8 53 | - fold_9 54 | - fold_10 55 | - fold_11 56 | - fold_12 57 | - fold_13 58 | - fold_14 59 | - fold_15 60 | - fold_16 61 | - fold_17 62 | - fold_18 63 | - fold_19 64 | - fold_20 65 | - fold_21 66 | - fold_22 67 | - fold_23 68 | - fold_24 69 | - fold_25 70 | - fold_26 71 | - fold_27 72 | - fold_28 73 | - fold_29 74 | deployable: 75 | - fold_0 76 | -------------------------------------------------------------------------------- /src/transformers/body_grav_filter.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy import signal 3 | 4 | from src.base import get_ancestral_metadata 5 | from src.utils.decorators import PartitionByTrial 6 | 7 | __all__ = [ 8 | "body_grav_filter", 9 | ] 10 | 11 | 12 | def filter_signal(data, filter_order, cutoff, fs, btype, axis=0): 13 | ba = signal.butter(filter_order, cutoff / fs / 2, btype=btype) 14 | 15 | mu = np.mean(data, axis=0, keepdims=True) 16 | 17 | dd = signal.filtfilt(ba[0], ba[1], data - mu, axis=axis) + mu 18 | 19 | return dd 20 | 21 | 22 | def body_filt(index, data, **kwargs): 23 | filt = filter_signal(data, btype="high", **kwargs) 24 | assert np.isfinite(filt).all() 25 | return filt 26 | 27 | 28 | def grav_filt(index, data, **kwargs): 29 | filt = filter_signal(data, btype="low", **kwargs) 30 | assert np.isfinite(filt).all() 31 | return filt 32 | 33 | 34 | def body_jerk_filt(index, data, **kwargs): 35 | filt = body_filt(index, data, **kwargs) 36 | jerk = np.empty(filt.shape, dtype=filt.dtype) 37 | jerk[0] = 0 38 | jerk[1:] = filt[1:] - filt[:-1] 39 | assert np.isfinite(filt).all() 40 | return jerk 41 | 42 | 43 | def body_grav_filter(parent): 44 | root = parent / "body_grav_filter" 45 | 46 | kwargs = dict(fs=get_ancestral_metadata(root, "fs"), filter_order=3, cutoff=0.3) 47 | 48 | for key, node in parent.outputs.items(): 49 | filt = "body" 50 | root.instantiate_node( 51 | key=f"{key}-{filt=}", 52 | func=PartitionByTrial(func=body_filt), 53 | backend="none", 54 | kwargs=dict(data=node, index=parent.index["index"], **kwargs), 55 | ) 56 | 57 | filt = "body_jerk" 58 | root.instantiate_node( 59 | key=f"{key}-{filt=}", 60 | func=PartitionByTrial(func=body_jerk_filt), 61 | backend="none", 62 | kwargs=dict(data=node, index=parent.index["index"], **kwargs), 63 | ) 64 | 65 | if "accel" in key: 66 | filt = "grav" 67 | root.instantiate_node( 68 | key=f"{key}-{filt=}", 69 | func=PartitionByTrial(func=grav_filt), 70 | backend="none", 71 | kwargs=dict(data=node, index=parent.index["index"], **kwargs), 72 | ) 73 | 74 | return root 75 | -------------------------------------------------------------------------------- /metadata/datasets/pamap2.yaml: -------------------------------------------------------------------------------- 1 | name: "pamap2" 2 | author: "Reiss" 3 | paper_name: "Introducing a new benchmarked dataset for activity monitoring" 4 | venue: "IEEE 16th International Symposium on Wearable Computers" 5 | bibtex: 6 | - "@inproceedings{reiss2012introducing,title={Introducing a new benchmarked dataset for activity monitoring},author={Reiss, Attila and Stricker, Didier},booktitle={2012 16th International Symposium on Wearable Computers},pages={108--109},year={2012},organization={IEEE}}" 7 | paper_urls: 8 | - "https://ieeexplore.ieee.org/document/6246152/" 9 | - "https://www.researchgate.net/publication/235348485_Introducing_a_New_Benchmarked_Dataset_for_Activity_Monitoring" 10 | year: 2012 11 | description_urls: 12 | - "http://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring" 13 | - "http://archive.ics.uci.edu/ml/machine-learning-databases/00231/readme.pdf" 14 | download_urls: 15 | - "http://archive.ics.uci.edu/ml/machine-learning-databases/00231/PAMAP2_Dataset.zip" 16 | - "http://archive.ics.uci.edu/ml/machine-learning-databases/00231/readme.pdf" 17 | fs: 100 18 | num_subjects: 9 19 | num_activities: 12 20 | missing: null 21 | modalities: 22 | - accel 23 | - gyro 24 | - mag 25 | locations: 26 | - ankle 27 | - chest 28 | - wrist 29 | sources: 30 | - loc: ankle 31 | mod: accel 32 | - loc: ankle 33 | mod: gyro 34 | - loc: ankle 35 | mod: mag 36 | - loc: chest 37 | mod: accel 38 | - loc: chest 39 | mod: gyro 40 | - loc: chest 41 | mod: mag 42 | - loc: wrist 43 | mod: accel 44 | - loc: wrist 45 | mod: gyro 46 | - loc: wrist 47 | mod: mag 48 | tasks: 49 | har: 50 | evaluation: classification 51 | target_transform: 52 | lie: 1 53 | sit: 2 54 | stand: 3 55 | walk: 4 56 | run: 5 57 | cycle: 6 58 | walk_nordic: 7 59 | watch_tv: 9 60 | work_computer: 10 61 | drive_car: 11 62 | walk_up: 12 63 | walk_down: 13 64 | vacuum: 16 65 | iron: 17 66 | laundry: 18 67 | clean: 19 68 | soccer: 20 69 | rope_jump: 24 70 | other: 0 71 | data_partitions: 72 | predefined: 73 | - fold_0 74 | loso: 75 | - fold_0 76 | - fold_1 77 | - fold_2 78 | - fold_3 79 | - fold_4 80 | - fold_5 81 | - fold_6 82 | - fold_7 83 | - fold_8 84 | deployable: 85 | - fold_0 86 | -------------------------------------------------------------------------------- /har_chain.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from har_basic import basic_har 4 | from src.base import get_ancestral_metadata 5 | from src.motifs.models import get_classifier 6 | from src.utils.misc import randomised_order 7 | 8 | 9 | def har_chain( 10 | test_dataset="anguita2013", 11 | fs_new=33, 12 | win_len=3, 13 | win_inc=1, 14 | task_name="har", 15 | data_partition="predefined", 16 | feat_name="ecdf", 17 | clf_name="sgd", 18 | evaluate=False, 19 | ): 20 | # Make metadata for the experiment 21 | kwargs = dict( 22 | fs_new=fs_new, win_len=win_len, win_inc=win_inc, task_name=task_name, feat_name=feat_name, clf_name=clf_name 23 | ) 24 | 25 | dataset_alignment = dict( 26 | anguita2013=dict(dataset_name="anguita2013", location="waist", modality="accel"), 27 | pamap2=dict(dataset_name="pamap2", location="chest", modality="accel"), 28 | uschad=dict(dataset_name="uschad", location="waist", modality="accel"), 29 | ) 30 | 31 | # Extract the representation for the test dataset 32 | test_dataset = dataset_alignment.pop(test_dataset) 33 | features, test_models = basic_har(data_partition="predefined", **test_dataset, **kwargs) 34 | 35 | # Instantiate the root directory 36 | root = features.graph / f"chained-from-{sorted(dataset_alignment.keys())}" 37 | 38 | # Build up the list of models from aux datasets 39 | auxiliary_models = {train_test_split: [model] for train_test_split, model in test_models.items()} 40 | for model_name, model_kwargs in dataset_alignment.items(): 41 | aux_features, aux_models = basic_har(data_partition="deployable", **model_kwargs, **kwargs) 42 | for fi, mi in aux_models.items(): 43 | auxiliary_models[fi].append(mi) 44 | 45 | models = dict() 46 | 47 | # Perform the chaining 48 | train_test_splits = get_ancestral_metadata(features, "data_partitions")[data_partition] 49 | for train_test_split in randomised_order(train_test_splits): 50 | aux_probs = [features] + [model.predict_proba(features) for model in auxiliary_models[train_test_split]] 51 | prob_features = root.instantiate_orphan_node(func=np.concatenate, args=[aux_probs], kwargs=dict(axis=1)) 52 | 53 | models[train_test_split] = get_classifier( 54 | clf_name=clf_name, 55 | feature_node=prob_features, 56 | task_name=task_name, 57 | data_partition=data_partition, 58 | train_test_split=train_test_split, 59 | evaluate=evaluate, 60 | ) 61 | 62 | return features, models 63 | 64 | 65 | if __name__ == "__main__": 66 | har_chain(evaluate=True) 67 | -------------------------------------------------------------------------------- /src/features/statistical_features.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from src.base import get_ancestral_metadata 4 | from src.features.statistical_features_impl import f_feat 5 | from src.features.statistical_features_impl import t_feat 6 | from src.functional.common import sorted_node_values 7 | 8 | __all__ = ["statistical_features"] 9 | 10 | 11 | def statistical_features(parent): 12 | """ 13 | There are two feature categories defined here: 14 | 1. Time domain 15 | 2. Frequency domain 16 | 17 | And these get mapped from transformed data from two sources: 18 | 1. Acceleration 19 | 2. Gyroscope 20 | 21 | Assuming these two sources have gone through some body/gravity transformations 22 | (eg from src.transformations.body_grav_filt) there will actually be several 23 | more sources, eg: 24 | 1. accel-body 25 | 2. accel-body-jerk 26 | 3. accel-body-jerk 27 | 4. accel-grav 28 | 5. gyro-body 29 | 6. gyro-body-jerk 30 | 7. gyro-body-jerk 31 | 32 | With more data sources this list will grows quickly. 33 | 34 | The feature types (time and frequency domain) are mapped to the transformed 35 | sources in a particular way. For example, the frequency domain features are 36 | *not* calculated on the gravity data sources. The loop below iterates through 37 | all of the outputs of the previous node in the graph, and the logic within 38 | the loop manages the correct mapping of functions to sources. 39 | 40 | Consult with the dataset table (tables/datasets.md) and see anguita2013 for 41 | details. 42 | """ 43 | 44 | root = parent / "statistical_features" 45 | 46 | fs = get_ancestral_metadata(root, "fs") 47 | 48 | accel_key = "mod='accel'" 49 | gyro_key = "mod='gyro'" 50 | mag_key = "mod='mag'" 51 | 52 | for key, node in parent.outputs.items(): 53 | key_td = f"{key}-feat='td'" 54 | key_fd = f"{key}-feat='fd'" 55 | 56 | t_kwargs = dict(data=node) 57 | f_kwargs = dict(data=node, fs=fs) 58 | 59 | if accel_key in key: 60 | root.instantiate_node(key=key_td, func=t_feat, kwargs=t_kwargs, backend="numpy") 61 | if "grav" not in key: 62 | root.instantiate_node(key=key_fd, func=f_feat, kwargs=f_kwargs, backend="numpy") 63 | if gyro_key in key or mag_key in key: 64 | root.instantiate_node(key=key_td, func=t_feat, kwargs=t_kwargs, backend="numpy") 65 | root.instantiate_node(key=key_fd, func=f_feat, kwargs=f_kwargs, backend="numpy") 66 | 67 | return root.instantiate_node( 68 | key="features", func=np.concatenate, args=[sorted_node_values(root.outputs)], kwargs=dict(axis=1) 69 | ) 70 | -------------------------------------------------------------------------------- /src/transformers/resample.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from scipy import signal 3 | 4 | from src.base import get_ancestral_metadata 5 | from src.utils.decorators import PartitionByTrial 6 | 7 | __all__ = [ 8 | "resample", 9 | ] 10 | 11 | 12 | def resample_data(index, data, fs_old, fs_new): 13 | if fs_old == fs_new: 14 | return data 15 | 16 | n_samples = int(data.shape[0] * fs_new / fs_old) 17 | 18 | return signal.resample(data, n_samples, axis=0) 19 | 20 | 21 | def align_metadata(t_old, t_new): 22 | assert np.all((t_new[1:] - t_new[:-1]) > 0) 23 | assert np.all((t_old[1:] - t_old[:-1]) > 0) 24 | assert t_old[-1] == t_new[-1] 25 | assert t_old[0] == t_new[0] 26 | 27 | inds = [0] 28 | 29 | for ti, target in enumerate(t_new[1:], start=1): 30 | best, best_ind, last_ind = np.inf, None, inds[-1] 31 | for ii in range(last_ind, len(t_old)): 32 | diff_ii = abs(target - t_old[ii]) 33 | is_best = diff_ii < best 34 | if is_best: 35 | best, best_ind = diff_ii, ii 36 | if (not is_best) or (ii == (len(t_old) - 1)): 37 | inds.append(best_ind) 38 | break 39 | 40 | inds = np.asarray(inds) 41 | 42 | return inds 43 | 44 | 45 | def resample_metadata(index, data, fs_old, fs_new, is_index): 46 | if fs_old == fs_new: 47 | return data 48 | 49 | n_samples = int(data.shape[0] * fs_new / fs_old) 50 | t_old = index.time.values 51 | t_new = np.linspace(t_old[0], t_old[-1], n_samples) 52 | 53 | inds = align_metadata(t_old, t_new) 54 | df = data.iloc[inds].copy() 55 | if is_index: 56 | df.loc[:, "time"] = t_new 57 | df = df.astype(data.dtypes) 58 | return df 59 | 60 | 61 | def resample(parent, fs_new): 62 | fs_old = get_ancestral_metadata(parent, "fs") 63 | 64 | root = parent / f"{fs_new}Hz" 65 | root.meta.insert("fs", fs_new) 66 | 67 | kwargs = dict(fs_old=fs_old, fs_new=fs_new) 68 | 69 | if fs_old != fs_new: 70 | # Only compute indexes and outputs if the sample rate has changed 71 | for key, node in parent.index.items(): 72 | root.instantiate_node( 73 | key=key, 74 | backend="pandas", 75 | func=PartitionByTrial(resample_metadata), 76 | kwargs=dict(index=parent.index["index"], data=node, is_index="index" in str(key), **kwargs), 77 | ) 78 | 79 | for key, node in parent.outputs.items(): 80 | root.instantiate_node( 81 | key=key, 82 | func=PartitionByTrial(resample_data), 83 | kwargs=dict(index=parent.index["index"], data=node, **kwargs), 84 | ) 85 | 86 | return root 87 | -------------------------------------------------------------------------------- /src/models/ensembles.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | from typing import List 3 | from typing import Optional 4 | from typing import Sized 5 | from typing import Union 6 | 7 | import numpy as np 8 | from sklearn.base import BaseEstimator 9 | from sklearn.preprocessing import LabelEncoder 10 | 11 | 12 | class PrefittedVotingClassifier(BaseEstimator): 13 | def __init__( 14 | self, 15 | estimators: List[Union[BaseEstimator]], 16 | voting: str = "soft", 17 | weights: Optional[Sized] = None, 18 | verbose: bool = False, 19 | strict: bool = True, 20 | ): 21 | assert weights is None or len(weights) == len(estimators) 22 | 23 | self.estimators = estimators 24 | self.voting = voting 25 | self.weights = weights 26 | self.verbose = verbose 27 | self.strict = strict 28 | self.le_ = None 29 | self.classes_ = None 30 | 31 | def transform(self, X): 32 | weights = self.weights 33 | if weights is None: 34 | weights = np.ones(len(self.estimators)) / len(self.estimators) 35 | return [est.predict_proba(X) * ww for ww, (_, est) in zip(weights, self.estimators)] 36 | 37 | def predict_proba(self, X): 38 | return sum(self.transform(X)) 39 | 40 | def predict(self, X): 41 | probs = self.predict_proba(X) 42 | inds = np.argmax(probs, axis=1) 43 | return self.classes_[inds] 44 | 45 | def fit(self, X, y, sample_weight=None): 46 | self.le_ = LabelEncoder().fit(y) 47 | self.classes_ = self.le_.classes_ 48 | for name, est in self.estimators: 49 | if self.strict: 50 | assert np.all( 51 | est.classes_ == self.classes_ 52 | ), f"Model classes ({self.classes_}) not aligned with {name}: {est.classes_=}" 53 | return self 54 | 55 | def score(self, X, y): 56 | return np.mean(self.predict(X) == y) 57 | 58 | 59 | class ZeroShotVotingClassifier(PrefittedVotingClassifier): 60 | def __init__( 61 | self, 62 | estimators: List[Union[BaseEstimator]], 63 | label_alignment: Dict[str, str], 64 | voting: str = "soft", 65 | weights: Optional[Sized] = None, 66 | verbose: bool = False, 67 | ): 68 | super().__init__(estimators=estimators, voting=voting, weights=weights, verbose=verbose, strict=False) 69 | self.label_alignment = label_alignment 70 | 71 | def predict_proba(self, X): 72 | out = np.zeros((X.shape[0], self.classes_.shape[0])) 73 | self_lookup = dict(zip(self.classes_, range(len(self.classes_)))) 74 | for (_, estimator), transformed in zip(self.estimators, self.transform(X)): 75 | for fi, (name, col) in enumerate(zip(estimator.classes_, transformed.T)): 76 | out[:, self_lookup[self.label_alignment[name]]] += col 77 | return out 78 | 79 | def predict(self, X): 80 | probs = self.predict_proba(X) 81 | inds = np.argmax(probs, axis=1) 82 | return self.classes_[inds] 83 | -------------------------------------------------------------------------------- /src/datasets/uschad.py: -------------------------------------------------------------------------------- 1 | from os.path import join 2 | 3 | import numpy as np 4 | import pandas as pd 5 | from scipy.io import loadmat 6 | from tqdm import trange 7 | 8 | from src.datasets.base import Dataset 9 | from src.utils.decorators import fold_decorator 10 | from src.utils.decorators import index_decorator 11 | from src.utils.decorators import label_decorator 12 | 13 | __all__ = ["uschad"] 14 | 15 | 16 | class uschad(Dataset): 17 | def __init__(self): 18 | super(uschad, self).__init__(name=self.__class__.__name__) 19 | 20 | @label_decorator 21 | def build_label(self, task, *args, **kwargs): 22 | def callback(ii, sub_id, act_id, trial_id, data): 23 | return np.zeros((data.shape[0], 1)) + act_id 24 | 25 | return ( 26 | self.meta.inv_lookup[task], 27 | uschad_iterator(self.unzip_path, callback=callback, desc=f"{self.identifier} Labels"), 28 | ) 29 | 30 | @fold_decorator 31 | def build_predefined(self, *args, **kwargs): 32 | def callback(ii, sub_id, act_id, trial_id, data): 33 | return np.tile(["train", "test"][sub_id > 10], (data.shape[0], 1)) 34 | 35 | return uschad_iterator(self.unzip_path, callback=callback, desc=f"{self.identifier} Folds") 36 | 37 | @index_decorator 38 | def build_index(self, *args, **kwargs): 39 | def callback(ii, sub_id, act_id, trial_id, data): 40 | return np.c_[ 41 | np.zeros((data.shape[0], 1)) + sub_id, 42 | np.zeros((data.shape[0], 1)) + ii, 43 | np.arange(data.shape[0]) / self.meta["fs"], 44 | ] 45 | 46 | return uschad_iterator( 47 | self.unzip_path, callback=callback, columns=["subject", "trial", "time"], desc=f"{self.identifier} Index", 48 | ) 49 | 50 | def build_data(self, loc, mod, *args, **kwargs): 51 | cols = dict(accel=[0, 1, 2], gyro=[3, 4, 5])[mod] 52 | 53 | def callback(ii, sub_id, act_id, trial_id, data): 54 | return data[:, cols] 55 | 56 | scale = dict(accel=1.0, gyro=2 * np.pi / 360) 57 | 58 | data = uschad_iterator(self.unzip_path, callback=callback, desc=f"Data ({loc}-{mod})") 59 | 60 | return data.values * scale[mod] 61 | 62 | 63 | def uschad_iterator(path, columns=None, cols=None, callback=None, desc=None): 64 | data_list = [] 65 | 66 | ii = 0 67 | 68 | for sub_id in trange(1, 14 + 1, desc=desc): 69 | for act_id in range(1, 12 + 1): 70 | for trail_id in range(1, 5 + 1): 71 | fname = join(path, f"Subject{sub_id}", f"a{act_id}t{trail_id}.mat") 72 | 73 | data = loadmat(fname)["sensor_readings"] 74 | 75 | if callback: 76 | data = callback(ii, sub_id, act_id, trail_id, data) 77 | elif cols: 78 | data = data[cols] 79 | else: 80 | raise ValueError 81 | 82 | data_list.extend(data) 83 | ii += 1 84 | 85 | df = pd.DataFrame(data_list) 86 | if columns: 87 | df.columns = columns 88 | return df 89 | -------------------------------------------------------------------------------- /har_ensemble_avg.py: -------------------------------------------------------------------------------- 1 | from mldb import NodeWrapper 2 | 3 | from src.base import ExecutionGraph 4 | from src.base import get_ancestral_metadata 5 | from src.models.base import BasicScorer 6 | from src.models.base import ClassifierWrapper 7 | from src.models.ensembles import PrefittedVotingClassifier 8 | from src.motifs.features import get_features 9 | from src.motifs.features import get_windowed_wearables 10 | from src.motifs.models import get_classifier 11 | from src.utils.loaders import dataset_importer 12 | from src.utils.misc import randomised_order 13 | 14 | 15 | def ensemble_classifier( 16 | task_name: str, 17 | feat_name: str, 18 | data_partition: str, 19 | train_test_split: str, 20 | windowed_data: NodeWrapper, 21 | clf_names, 22 | evaluate=False, 23 | ): 24 | features = get_features(feat_name=feat_name, windowed_data=windowed_data) 25 | 26 | graph: ExecutionGraph = features.graph / f"ensemble-over={sorted(clf_names)}" / task_name / train_test_split 27 | 28 | estimators = list() 29 | for clf_name in randomised_order(clf_names): 30 | estimators.append( 31 | [ 32 | f"{clf_name=}", 33 | get_classifier( 34 | feature_node=features, 35 | clf_name=clf_name, 36 | task_name=task_name, 37 | data_partition=data_partition, 38 | train_test_split=train_test_split, 39 | ).model, 40 | ] 41 | ) 42 | 43 | estimator = graph.instantiate_node( 44 | key=f"PrefittedVotingClassifier-{feat_name}".lower(), 45 | func=PrefittedVotingClassifier, 46 | kwargs=dict(estimators=estimators, voting="soft", verbose=10), 47 | ) 48 | 49 | model = ClassifierWrapper( 50 | parent=graph, 51 | estimator=estimator, 52 | features=features, 53 | task=features.graph[task_name], 54 | split=graph.get_split_series(data_partition=data_partition, train_test_split=train_test_split), 55 | scorer=BasicScorer(), 56 | evaluate=evaluate, 57 | ) 58 | 59 | return model 60 | 61 | 62 | def basic_ensemble( 63 | dataset_name="anguita2013", 64 | modality="all", 65 | location="all", 66 | task_name="har", 67 | feat_name="ecdf", 68 | data_partition="predefined", 69 | fs_new=33, 70 | win_len=3, 71 | win_inc=1, 72 | ): 73 | dataset = dataset_importer(dataset_name) 74 | 75 | windowed_data = get_windowed_wearables( 76 | dataset=dataset, modality=modality, location=location, fs_new=fs_new, win_len=win_len, win_inc=win_inc 77 | ) 78 | 79 | models = dict() 80 | 81 | train_test_splits = get_ancestral_metadata(windowed_data, "data_partitions")[data_partition] 82 | for train_test_split in randomised_order(train_test_splits): 83 | models[train_test_split] = ensemble_classifier( 84 | feat_name=feat_name, 85 | task_name=task_name, 86 | data_partition=data_partition, 87 | windowed_data=windowed_data, 88 | train_test_split=train_test_split, 89 | clf_names=["sgd", "lr", "rf"], 90 | evaluate=True, 91 | ) 92 | 93 | return models 94 | 95 | 96 | if __name__ == "__main__": 97 | basic_ensemble() 98 | -------------------------------------------------------------------------------- /src/datasets/anguita2013.py: -------------------------------------------------------------------------------- 1 | from os.path import join 2 | 3 | import numpy as np 4 | import pandas as pd 5 | 6 | from src.datasets.base import Dataset 7 | from src.utils.decorators import fold_decorator 8 | from src.utils.decorators import index_decorator 9 | from src.utils.decorators import label_decorator 10 | from src.utils.loaders import load_csv_data 11 | 12 | __all__ = ["anguita2013"] 13 | 14 | WIN_LEN = 128 15 | 16 | 17 | class anguita2013(Dataset): 18 | def __init__(self): 19 | super(anguita2013, self).__init__(name=self.__class__.__name__, unzip_path=lambda s: s.replace("%20", " ")) 20 | 21 | @label_decorator 22 | def build_label(self, task, *args, **kwargs): 23 | labels = [] 24 | for fold in ("train", "test"): 25 | fold_labels = load_csv_data(join(self.unzip_path, fold, f"y_{fold}.txt")) 26 | labels.extend([l for l in fold_labels for _ in range(WIN_LEN)]) 27 | return self.meta.inv_lookup[task], pd.DataFrame(dict(labels=labels)) 28 | 29 | @fold_decorator 30 | def build_predefined(self, *args, **kwargs): 31 | fold = [] 32 | fold.extend( 33 | ["train" for _ in load_csv_data(join(self.unzip_path, "train", "y_train.txt")) for _ in range(WIN_LEN)] 34 | ) 35 | fold.extend( 36 | ["test" for _ in load_csv_data(join(self.unzip_path, "test", "y_test.txt")) for _ in range(WIN_LEN)] 37 | ) 38 | return pd.DataFrame(fold) 39 | 40 | @index_decorator 41 | def build_index(self, *args, **kwargs): 42 | sub = [] 43 | for fold in ("train", "test"): 44 | sub.extend(load_csv_data(join(self.unzip_path, fold, f"subject_{fold}.txt"))) 45 | index = pd.DataFrame( 46 | dict( 47 | subject=[si for si in sub for _ in range(WIN_LEN)], 48 | trial=build_seq_list(subs=sub, win_len=WIN_LEN), 49 | time=build_time(subs=sub, win_len=WIN_LEN, fs=float(self.meta.meta["fs"])), 50 | ) 51 | ) 52 | return index 53 | 54 | def build_data(self, loc, mod, *args, **kwargs): 55 | x_data = [] 56 | y_data = [] 57 | z_data = [] 58 | modality = dict(accel="acc", gyro="gyro")[mod] 59 | for fold in ("train", "test"): 60 | for l, d in zip((x_data, y_data, z_data), ("x", "y", "z")): 61 | ty = ["body", "total"][modality in {"accel", "acc"}] 62 | acc = load_csv_data( 63 | join(self.unzip_path, fold, "Inertial Signals", f"{ty}_{modality}_{d}_{fold}.txt"), astype="np", 64 | ) 65 | l.extend(acc.ravel().tolist()) 66 | return np.c_[x_data, y_data, z_data] 67 | 68 | 69 | def build_time(subs, win_len, fs): 70 | win = np.arange(win_len, dtype=float) / fs 71 | inc = win_len / fs 72 | t = [] 73 | prev_sub = subs[0] 74 | for curr_sub in subs: 75 | if curr_sub != prev_sub: 76 | win = np.arange(win_len, dtype=float) / fs 77 | t.extend(win) 78 | win += inc 79 | prev_sub = curr_sub 80 | return t 81 | 82 | 83 | def build_seq_list(subs, win_len): 84 | seq = [] 85 | si = 0 86 | last_sub = subs[0] 87 | for prev_sub in subs: 88 | if prev_sub != last_sub: 89 | si += 1 90 | seq.extend([si] * win_len) 91 | last_sub = prev_sub 92 | return seq 93 | -------------------------------------------------------------------------------- /make_tables.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import os 3 | 4 | from src.meta import DatasetMeta 5 | from src.utils.loaders import build_path 6 | from src.utils.loaders import get_yaml_file_list 7 | from src.utils.loaders import load_metadata 8 | from src.utils.loaders import load_yaml 9 | 10 | 11 | def make_links(links, desc="Link"): 12 | return ", ".join("[{} {}]({})".format(desc, ii, url) for ii, url in enumerate(links, start=1)) 13 | 14 | 15 | def make_dataset_row(dataset): 16 | # modalities = sorted(set([mn for ln, lm in self.meta['locations'].items() for mn, mv in lm.items() if mv])) 17 | 18 | data = [ 19 | dataset.meta["author"], 20 | dataset.meta["paper_name"], 21 | dataset.name, 22 | make_links(links=dataset.meta["description_urls"], desc="Link"), 23 | dataset.meta.get("missing", ""), 24 | make_links(links=dataset.meta["paper_urls"], desc="Link"), 25 | dataset.meta["year"], 26 | dataset.meta["fs"], 27 | ", ".join(dataset.meta["locations"].keys()), 28 | ", ".join(dataset.meta["modalities"]), 29 | dataset.meta["num_subjects"], 30 | dataset.meta["num_activities"], 31 | ", ".join(dataset.meta["activities"].keys()), 32 | ] 33 | 34 | return ( 35 | ( 36 | f"| First Author | Paper Name | Dataset Name | Description | Missing data " 37 | f"| Download Links | Year | Sampling Rate | Device Locations | Device Modalities " 38 | f"| Num Subjects | Num Activities | Activities | " 39 | ), 40 | "| {} |".format(" | ".join(["-----"] * len(data))), 41 | "| {} |".format(" | ".join(map(str, data))), 42 | ) 43 | 44 | 45 | def load_datasets_metadata(): 46 | return [load_yaml(path) for path in get_yaml_file_list("datasets", stem=False)] 47 | 48 | 49 | def main(): 50 | # Ensure the paths exist 51 | root = build_path("tables") 52 | if not os.path.exists(root): 53 | os.makedirs(root) 54 | 55 | # Current list of datasets 56 | lines = [] 57 | datasets = load_datasets_metadata() 58 | for dataset in datasets: 59 | dataset = DatasetMeta(dataset) 60 | head, space, line = make_dataset_row(dataset) 61 | lines.append(line) 62 | with open(build_path("tables", "datasets.md"), "w") as fil: 63 | fil.write("{}\n".format(head)) 64 | fil.write("{}\n".format(space)) 65 | for line in lines: 66 | fil.write("{}\n".format(line)) 67 | 68 | # Iterate over the other data tables 69 | dims = [ 70 | "activities", 71 | "features", 72 | "locations", 73 | "models", 74 | "pipelines", 75 | "transformers", 76 | "visualisations", 77 | ] 78 | 79 | for dim in dims: 80 | with open(build_path("tables", f"{dim}.md"), "w") as fil: 81 | data = load_metadata(f"{dim}.yaml") 82 | fil.write(f"| Index | {dim[0].upper()}{dim[1:].lower()} | value | \n") 83 | fil.write(f"| ----- | ----- | ----- | \n") 84 | if isinstance(data, dict): 85 | for ki, (key, value) in enumerate(data.items()): 86 | if isinstance(value, dict) and "description" in value: 87 | value = make_links(value["description"]) 88 | fil.write(f"| {ki} | {key} | {value} | \n") 89 | 90 | 91 | if __name__ == "__main__": 92 | main() 93 | -------------------------------------------------------------------------------- /src/datasets/pamap2.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | from os.path import join 3 | 4 | import numpy as np 5 | import pandas as pd 6 | from tqdm import tqdm 7 | 8 | from src.datasets.base import Dataset 9 | from src.utils.decorators import fold_decorator 10 | from src.utils.decorators import index_decorator 11 | from src.utils.decorators import label_decorator 12 | 13 | __all__ = [ 14 | "pamap2", 15 | ] 16 | 17 | 18 | class pamap2(Dataset): 19 | def __init__(self): 20 | super(pamap2, self).__init__(name=self.__class__.__name__, unzip_path=lambda p: join(p, "Protocol")) 21 | 22 | @label_decorator 23 | def build_label(self, task, *args, **kwargs): 24 | df = pd.DataFrame(iter_pamap2_subs(path=self.unzip_path, cols=[1], desc=f"{self.identifier} Labels")) 25 | 26 | return self.meta.inv_lookup[task], df 27 | 28 | @fold_decorator 29 | def build_predefined(self, *args, **kwargs): 30 | def folder(sid, data): 31 | return np.zeros(data.shape[0]) + sid 32 | 33 | df = iter_pamap2_subs( 34 | path=self.unzip_path, cols=[1], desc=f"{self.identifier} Folds", callback=folder, columns=["fold"], 35 | ).astype(int) 36 | 37 | lookup = { 38 | 1: "train", 39 | 2: "train", 40 | 3: "test", 41 | 4: "train", 42 | 5: "train", 43 | 6: "test", 44 | 7: "train", 45 | 8: "train", 46 | 9: "test", 47 | } 48 | 49 | return df.assign(fold_0=df["fold"].apply(lookup.__getitem__))[["fold_0"]].astype("category") 50 | 51 | @index_decorator 52 | def build_index(self, *args, **kwargs): 53 | def indexer(sid, data): 54 | subject = np.zeros(data.shape[0])[:, None] + sid 55 | trial = np.zeros(data.shape[0])[:, None] + sid 56 | return np.concatenate((subject, trial, data), axis=1) 57 | 58 | df = iter_pamap2_subs( 59 | path=self.unzip_path, 60 | cols=[0], 61 | desc=f"{self.identifier} Index", 62 | callback=indexer, 63 | columns=["subject", "trial", "time"], 64 | ).astype(dict(subject=int, trial=int, time=float)) 65 | 66 | return df 67 | 68 | def build_data(self, loc, mod, *args, **kwargs): 69 | offset = dict(wrist=3, chest=20, ankle=37)[loc] + dict(accel=1, gyro=7, mag=10)[mod] 70 | 71 | df = iter_pamap2_subs( 72 | path=self.unzip_path, 73 | cols=list(range(offset, offset + 3)), 74 | desc=f"Parsing {mod} at {loc}", 75 | columns=["x", "y", "z"], 76 | ).astype(float) 77 | 78 | scale = dict(accel=9.80665, gyro=np.pi * 2.0, mag=1.0)[mod] 79 | 80 | return df.values / scale 81 | 82 | 83 | def iter_pamap2_subs(path, cols, desc, columns=None, callback=None, n_subjects=9): 84 | data = [] 85 | 86 | for sid in tqdm(range(1, n_subjects + 1), desc=desc): 87 | datum = pd.read_csv(join(path, f"subject10{sid}.dat"), delim_whitespace=True, header=None, usecols=cols).fillna( 88 | method="ffill" 89 | ) 90 | assert np.isfinite(datum.values).all() 91 | if callback: 92 | data.extend(callback(sid, datum.values)) 93 | else: 94 | data.extend(datum.values) 95 | df = pd.DataFrame(data) 96 | if columns: 97 | df.columns = columns 98 | return df 99 | -------------------------------------------------------------------------------- /src/motifs/models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from mldb import NodeWrapper 3 | from sklearn.decomposition import PCA 4 | from sklearn.ensemble import RandomForestClassifier 5 | from sklearn.linear_model import LogisticRegressionCV 6 | from sklearn.linear_model import SGDClassifier 7 | from sklearn.pipeline import Pipeline 8 | from sklearn.preprocessing import MinMaxScaler 9 | from sklearn.preprocessing import Normalizer 10 | from sklearn.preprocessing import StandardScaler 11 | from sklearn.svm import SVC 12 | 13 | from src.base import ExecutionGraph 14 | from src.base import get_ancestral_metadata 15 | from src.models.base import BasicScorer 16 | from src.models.base import ClassifierWrapper 17 | 18 | 19 | def make_classifier_node(root: ExecutionGraph, features: NodeWrapper, clf_name: str): 20 | if clf_name == "sgd": 21 | steps = [ 22 | # ('imputation', SimpleImputer()), 23 | ("scaling", StandardScaler()), 24 | ("pca", PCA(n_components=0.95)), 25 | ("clf", SGDClassifier(loss="log")), 26 | ] 27 | 28 | param_grid = dict(clf__alpha=np.logspace(-5, 5, 11)) 29 | 30 | elif clf_name == "lr": 31 | steps = [ 32 | # ('imputation', SimpleImputer()), 33 | ("scaling", StandardScaler()), 34 | ("pca", PCA(n_components=0.95)), 35 | ("clf", LogisticRegressionCV(max_iter=1000)), 36 | ] 37 | 38 | param_grid = dict(clf__penalty=["l2"], clf__max_iter=[100]) 39 | 40 | elif clf_name == "rf": 41 | steps = [ 42 | # ('imputation', SimpleImputer()), 43 | ("scaling", StandardScaler()), 44 | ("pca", PCA(n_components=0.95)), 45 | ("clf", RandomForestClassifier()), 46 | ] 47 | 48 | param_grid = dict(clf__n_estimators=[10, 30, 100]) 49 | 50 | elif clf_name == "svc": 51 | steps = [ 52 | # ('imputation', SimpleImputer()), 53 | ("scaling", Normalizer()), 54 | ("pca", PCA(n_components=0.95)), 55 | ("minmax", MinMaxScaler(feature_range=(-1, 1))), 56 | ("clf", SVC(kernel="rbf", probability=True)), 57 | ] 58 | 59 | param_grid = dict(clf__gamma=["scale", "auto"], clf__C=np.logspace(-3, 3, 7),) 60 | 61 | else: 62 | raise ValueError(f"Classifier {clf_name} selected, but this is not supported.") 63 | 64 | key = ",".join([str(nn) for _, nn in steps]).lower() 65 | estimator = root.instantiate_node(key=key, func=Pipeline, kwargs=dict(steps=steps)) 66 | 67 | return estimator, param_grid 68 | 69 | 70 | def get_classifier( 71 | feature_node: NodeWrapper, 72 | clf_name: str, 73 | task_name: str, 74 | data_partition: str, 75 | train_test_split: str, 76 | evaluate: bool = False, 77 | ) -> ClassifierWrapper: 78 | # Value checks 79 | assert task_name in get_ancestral_metadata(feature_node, "tasks") 80 | assert data_partition in get_ancestral_metadata(feature_node, "data_partitions") 81 | assert train_test_split in get_ancestral_metadata(feature_node, "data_partitions")[data_partition] 82 | 83 | root: ExecutionGraph = feature_node.graph / task_name / data_partition / train_test_split 84 | 85 | # Instantiate the classifier 86 | estimator, param_grid = make_classifier_node(root=root, features=feature_node, clf_name=clf_name) 87 | 88 | # Instantiate the classifier 89 | model = ClassifierWrapper( 90 | parent=root, 91 | estimator=estimator, 92 | param_grid=param_grid, 93 | features=feature_node, 94 | task=feature_node.graph[task_name], 95 | split=root.get_split_series(data_partition=data_partition, train_test_split=train_test_split), 96 | scorer=BasicScorer(), 97 | evaluate=evaluate, 98 | ) 99 | 100 | return model 101 | -------------------------------------------------------------------------------- /src/evaluation/classification.py: -------------------------------------------------------------------------------- 1 | from collections import Counter 2 | 3 | import numpy as np 4 | from loguru import logger 5 | from scipy.special import logsumexp 6 | from sklearn import metrics 7 | 8 | 9 | __all__ = ["evaluate_data_split"] 10 | 11 | 12 | def evaluate_data_split(split, targets, estimator, prob_predictions): 13 | res = dict() 14 | predictions = estimator.classes_[prob_predictions.argmax(axis=1)] 15 | for tr_val_te in split.unique(): 16 | inds = split == tr_val_te 17 | yy, pp, ss = targets[inds], predictions[inds], prob_predictions[inds] 18 | res[tr_val_te] = dict() 19 | res[tr_val_te] = _classification_perf_metrics( 20 | model=estimator, labels=np.asarray(yy).ravel(), predictions=np.asarray(pp), scores=np.asarray(ss), 21 | ) 22 | if hasattr(estimator, "cv_results_"): 23 | res["xval"] = estimator.cv_results_ 24 | return res 25 | 26 | 27 | def _get_class_names(model): 28 | if hasattr(model, "classes_"): 29 | return model.classes_ 30 | logger.exception(TypeError(f"The classes member cannot be extracted from this object: {model}")) 31 | 32 | 33 | def _classification_perf_metrics(labels, model, predictions, scores): 34 | cols = _get_class_names(model) 35 | 36 | def score_metrics(name, func, labels_, scores_, **kwargs): 37 | unique_labels = np.unique(labels_) 38 | lookup = dict(zip(unique_labels, range(unique_labels.shape[0]))) 39 | scores_ = scores_[:, unique_labels] 40 | if scores_.shape[1] == 1: 41 | return 1.0 42 | scores_ /= scores_.sum(axis=1, keepdims=True) 43 | if scores_.shape[1] == 2: 44 | scores_ = scores_[:, 1] 45 | return { 46 | f"{name}_{average}": func( 47 | y_true=np.asarray([lookup[label] for label in labels_]), y_score=scores_, average=average, **kwargs, 48 | ) 49 | for average in ("macro", "weighted") 50 | } 51 | 52 | def prediction_metrics(name, func): 53 | return { 54 | f"{name}_{average}": func(y_true=labels, y_pred=predictions, average=average) 55 | for average in ("macro", "micro", "weighted") 56 | } 57 | 58 | label_lookup = dict(zip(model.classes_, range(model.classes_.shape[0]))) 59 | probs = np.exp(scores - logsumexp(scores, axis=1, keepdims=True)) 60 | label_ind = np.asarray([label_lookup[ll] for ll in labels]) 61 | 62 | label_counts = Counter(labels) 63 | 64 | res = dict( 65 | class_names=cols, 66 | num_instances=len(labels), 67 | label_counts=dict(label_counts.items()), 68 | class_prior={kk: vv / len(labels) for kk, vv in label_counts.items()}, 69 | accuracy=metrics.accuracy_score(labels, predictions), 70 | confusion_matrix=metrics.confusion_matrix(labels, predictions), 71 | **score_metrics("auroc_ovo", metrics.roc_auc_score, label_ind, probs, multi_class="ovo"), 72 | **score_metrics("auroc_ovr", metrics.roc_auc_score, label_ind, probs, multi_class="ovr"), 73 | **prediction_metrics("f1", metrics.f1_score), 74 | **prediction_metrics("precision", metrics.precision_score), 75 | **prediction_metrics("recall", metrics.recall_score), 76 | per_class_metrics=dict(), 77 | ) 78 | 79 | if len(cols) > 2: 80 | for col in np.unique(labels): 81 | yi = label_lookup[col] 82 | yy_i = labels == col 83 | y_hat_i = predictions == col 84 | res["per_class_metrics"][col] = dict( 85 | index=yi, 86 | label=col, 87 | count=yy_i.sum(), 88 | class_prior=yy_i.mean(), 89 | accuracy=metrics.accuracy_score(yy_i, y_hat_i), 90 | auroc=metrics.roc_auc_score(yy_i, y_hat_i), 91 | f1=metrics.f1_score(yy_i, y_hat_i), 92 | precision=metrics.precision_score(yy_i, y_hat_i), 93 | recall=metrics.recall_score(yy_i, y_hat_i), 94 | confusion_matrix=metrics.confusion_matrix(yy_i, y_hat_i), 95 | ) 96 | 97 | return res 98 | -------------------------------------------------------------------------------- /src/meta.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | from loguru import logger 4 | 5 | from src.utils.loaders import get_env 6 | from src.utils.loaders import load_yaml 7 | from src.utils.loaders import metadata_path 8 | 9 | 10 | __all__ = [ 11 | "DatasetMeta", 12 | "BaseMeta", 13 | "HARMeta", 14 | "ModalityMeta", 15 | "DatasetMeta", 16 | ] 17 | 18 | 19 | class BaseMeta(object): 20 | def __init__(self, path, *args, **kwargs): 21 | self.path = Path(path) 22 | self.name = self.path.stem 23 | self.meta = dict() 24 | 25 | if path: 26 | try: 27 | meta = load_yaml(path) 28 | 29 | if meta is None: 30 | logger.info(f'The content metadata module "{self.name}" from {path} is empty. Assigning empty dict') 31 | meta = dict() 32 | else: 33 | if not isinstance(meta, dict): 34 | logger.warning(f"Metadata not of type dict loaded: {meta}") 35 | 36 | self.meta = meta 37 | 38 | except FileNotFoundError: 39 | # logger.warning(f'The metadata file for "{self.name}" was not found.') 40 | pass 41 | 42 | def __getitem__(self, item): 43 | if item not in self.meta: 44 | logger.exception(KeyError(f"{item} not found in {self.__class__.__name__}")) 45 | return self.meta[item] 46 | 47 | def __contains__(self, item): 48 | return item in self.meta 49 | 50 | def __repr__(self): 51 | return f"<{self.name} {self.meta.__repr__()}>" 52 | 53 | def keys(self): 54 | return self.meta.keys() 55 | 56 | def values(self): 57 | return self.meta.values() 58 | 59 | def items(self): 60 | return self.meta.items() 61 | 62 | def insert(self, key, value): 63 | assert key not in self.meta 64 | self.meta[key] = value 65 | 66 | 67 | """ 68 | Non-functional metadata 69 | """ 70 | 71 | 72 | class HARMeta(BaseMeta): 73 | def __init__(self, path, *args, **kwargs): 74 | super(HARMeta, self).__init__(path=metadata_path("tasks", "har.yaml"), *args, **kwargs) 75 | 76 | 77 | class LocalisationMeta(BaseMeta): 78 | def __init__(self, path, *args, **kwargs): 79 | super(LocalisationMeta, self).__init__(path=metadata_path("tasks", "localisation.yaml"), *args, **kwargs) 80 | 81 | 82 | class ModalityMeta(BaseMeta): 83 | def __init__(self, path, *args, **kwargs): 84 | super(ModalityMeta, self).__init__(name=metadata_path("modality.yaml"), *args, **kwargs) 85 | 86 | 87 | class DatasetMeta(BaseMeta): 88 | def __init__(self, path, *args, **kwargs): 89 | if isinstance(path, str): 90 | path = Path("metadata", "datasets", f"{path}.yaml") 91 | 92 | assert path.exists() 93 | 94 | super(DatasetMeta, self).__init__(path=path, *args, **kwargs) 95 | 96 | if "fs" not in self.meta: 97 | logger.exception(KeyError(f'The file {path} does not contain the key "fs"')) 98 | 99 | self.inv_lookup = dict() 100 | 101 | for task_name in self.meta["tasks"].keys(): 102 | task_label_file = metadata_path("tasks", f"{task_name}.yaml") 103 | task_labels = load_yaml(task_label_file) 104 | dataset_labels = self.meta["tasks"][task_name]["target_transform"] 105 | if not set(dataset_labels.keys()).issubset(set(task_labels)): 106 | logger.exception( 107 | ValueError( 108 | f"The following labels from dataset {path} are not accounted for in {task_label_file}: " 109 | f"{set(dataset_labels.keys()).difference(task_labels.keys())}" 110 | ) 111 | ) 112 | self.inv_lookup[task_name] = {dataset_labels[kk]: kk for kk, vv in dataset_labels.items()} 113 | 114 | @property 115 | def fs(self): 116 | return float(self.meta["fs"]) 117 | 118 | @property 119 | def zip_path(self): 120 | return get_env("ZIP_ROOT") / self.name 121 | -------------------------------------------------------------------------------- /har_zero.py: -------------------------------------------------------------------------------- 1 | from operator import itemgetter 2 | from typing import Dict 3 | from typing import List 4 | from typing import Tuple 5 | 6 | from mldb import NodeWrapper 7 | 8 | from har_basic import basic_har 9 | from src.base import ExecutionGraph 10 | from src.base import get_ancestral_metadata 11 | from src.models.base import BasicScorer 12 | from src.models.base import ClassifierWrapper 13 | from src.models.ensembles import ZeroShotVotingClassifier 14 | from src.utils.misc import randomised_order 15 | 16 | 17 | def make_zero_shot_classifier( 18 | feat_name, 19 | features, 20 | estimators: List[Tuple[str, NodeWrapper]], 21 | task_name, 22 | data_partition, 23 | train_test_split, 24 | label_alignment: Dict[str, str], 25 | evaluate: bool = True, 26 | ): 27 | clf_names = sorted(map(itemgetter(0), estimators)) 28 | 29 | graph: ExecutionGraph = features.graph / f"zero-shot-from={sorted(clf_names)}" / task_name / data_partition / train_test_split 30 | 31 | estimator = graph.instantiate_node( 32 | key=f"ZeroShotVotingClassifier-{feat_name}".lower(), 33 | func=ZeroShotVotingClassifier, 34 | kwargs=dict(estimators=estimators, voting="soft", verbose=10, label_alignment=label_alignment), 35 | ) 36 | 37 | model = ClassifierWrapper( 38 | parent=graph, 39 | estimator=estimator, 40 | features=features, 41 | task=features.graph[task_name], 42 | split=graph.get_split_series(data_partition=data_partition, train_test_split=train_test_split), 43 | scorer=BasicScorer(), 44 | evaluate=evaluate, 45 | ) 46 | 47 | return model 48 | 49 | 50 | def har_zero( 51 | fs_new: float = 33, 52 | win_len: float = 3, 53 | win_inc: float = 1, 54 | clf_name: str = "sgd", 55 | task_name: str = "har", 56 | dataset_partition: str = "predefined", 57 | feat_name: str = "statistical", 58 | evaluate: bool = False, 59 | ): 60 | kwargs = dict( 61 | fs_new=fs_new, win_len=win_len, win_inc=win_inc, task_name=task_name, feat_name=feat_name, clf_name=clf_name 62 | ) 63 | 64 | external_datasets = dict( 65 | pamap2=dict(dataset_name="pamap2", location="chest", modality="accel"), 66 | uschad=dict(dataset_name="uschad", location="waist", modality="accel"), 67 | ) 68 | 69 | test_dataset = dict(dataset_name="anguita2013", location="waist", modality="accel") 70 | 71 | label_alignment = dict( 72 | cycle="walk", 73 | elevator_down="stand", 74 | elevator_up="stand", 75 | iron="stand", 76 | jump="walk", 77 | other="walk", 78 | rope_jump="walk", 79 | run="walk", 80 | sit="sit", 81 | sleep="lie", 82 | stand="stand", 83 | vacuum="walk", 84 | walk="walk", 85 | walk_down="walk_down", 86 | walk_left="walk", 87 | walk_nordic="walk", 88 | walk_right="walk", 89 | walk_up="walk_up", 90 | lie="lie", 91 | ) 92 | 93 | features, test_models = basic_har(data_partition="predefined", **test_dataset, **kwargs) 94 | 95 | auxiliary_models = dict() 96 | for model_name, model_kwargs in external_datasets.items(): 97 | aux_features, aux_models = basic_har(data_partition="deployable", **model_kwargs, **kwargs) 98 | auxiliary_models[model_name] = aux_models 99 | 100 | models = dict() 101 | 102 | train_test_splits = get_ancestral_metadata(features, "data_partitions")[dataset_partition] 103 | for train_test_split in randomised_order(train_test_splits): 104 | models[train_test_split] = make_zero_shot_classifier( 105 | estimators=[(mn, mm[train_test_split].model) for mn, mm in auxiliary_models.items()], 106 | feat_name=feat_name, 107 | features=features, 108 | task_name=task_name, 109 | train_test_split=train_test_split, 110 | data_partition=dataset_partition, 111 | evaluate=evaluate, 112 | label_alignment=label_alignment, 113 | ) 114 | 115 | return models 116 | 117 | 118 | if __name__ == "__main__": 119 | har_zero(evaluate=True) 120 | -------------------------------------------------------------------------------- /src/datasets/base.py: -------------------------------------------------------------------------------- 1 | from os.path import basename 2 | from os.path import join 3 | from os.path import splitext 4 | 5 | import pandas as pd 6 | 7 | from src.base import ExecutionGraph 8 | from src.base import get_ancestral_metadata 9 | from src.meta import DatasetMeta 10 | 11 | __all__ = [ 12 | "Dataset", 13 | ] 14 | 15 | 16 | def validate_split_names(split_name, split_df, split_cols): 17 | meta_set = set(split_cols) 18 | df_set = set(split_df.columns) 19 | if not df_set.issubset(meta_set): 20 | raise ValueError( 21 | f"Split dataframe columns ({df_set}) not subset of metadata ({meta_set}) for the split type {split_name}." 22 | ) 23 | 24 | 25 | class Dataset(ExecutionGraph): 26 | def __init__(self, name, *args, **kwargs): 27 | super(Dataset, self).__init__(name=f"datasets/{name}", meta=DatasetMeta(name)) 28 | 29 | def load_meta(*args, **kwargs): 30 | return self.meta.meta 31 | 32 | load_meta.__name__ = name 33 | 34 | metadata = self.instantiate_node(key=f"{name}-metadata", backend="yaml", func=load_meta, kwargs=dict()) 35 | 36 | zip_name = kwargs.get("unzip_path", lambda x: x)(splitext(basename(self.meta.meta["download_urls"][0]))[0]) 37 | self.unzip_path = join(self.meta.zip_path, splitext(zip_name)[0]) 38 | 39 | index = self.instantiate_node( 40 | key="index", func=self.build_index, backend="pandas", kwargs=dict(path=self.unzip_path, metatdata=metadata), 41 | ) 42 | 43 | # Build the indexes 44 | self.instantiate_node( 45 | key="predefined", 46 | func=self.build_predefined, 47 | backend="pandas", 48 | kwargs=dict(path=self.unzip_path, metatdata=metadata), 49 | ) 50 | 51 | data_partitions = get_ancestral_metadata(self, "data_partitions") 52 | 53 | self.instantiate_node( 54 | key="loso", 55 | func=self.build_loso, 56 | backend="pandas", 57 | kwargs=dict(index=index, columns=data_partitions["loso"]), 58 | ) 59 | 60 | self.instantiate_node( 61 | key="deployable", 62 | func=self.build_deployable, 63 | backend="pandas", 64 | kwargs=dict(index=index, columns=data_partitions["deployable"]), 65 | ) 66 | 67 | tasks = get_ancestral_metadata(self, "tasks") 68 | for task in tasks: 69 | self.instantiate_node( 70 | key=task, 71 | func=self.build_label, 72 | backend="pandas", 73 | kwargs=dict(path=self.unzip_path, task=task, inv_lookup=self.meta.inv_lookup[task], metatdata=metadata), 74 | ) 75 | 76 | # Build list of outputs 77 | for placement_modality in self.meta["sources"]: 78 | loc = placement_modality["loc"] 79 | mod = placement_modality["mod"] 80 | 81 | self.instantiate_node( 82 | key=f"{loc=}-{mod=}", 83 | func=self.build_data, 84 | backend="numpy", 85 | kwargs=dict(loc=loc, mod=mod, metadata=metadata), 86 | ) 87 | 88 | def build_label(self, *args, **kwargs): 89 | raise NotImplementedError 90 | 91 | def build_index(self, *args, **kwargs): 92 | raise NotImplementedError 93 | 94 | def build_data(self, loc, mod, *args, **kwargs): 95 | raise NotImplementedError 96 | 97 | def build_predefined(self, *args, **kwargs): 98 | raise NotImplementedError 99 | 100 | def build_deployable(self, index, columns): 101 | splits = pd.DataFrame({"fold_0": ["train"] * len(index)}, dtype="category") 102 | validate_split_names(split_name="deployable", split_df=splits, split_cols=columns) 103 | return splits 104 | 105 | def build_loso(self, index, columns): 106 | splits = pd.DataFrame( 107 | { 108 | f"fold_{ki}": index.trial.apply(lambda tt: ["train", "test"][tt == kk]) 109 | for ki, kk in enumerate(index.trial.unique()) 110 | } 111 | ) 112 | validate_split_names(split_name="loso", split_df=splits, split_cols=columns) 113 | return splits 114 | -------------------------------------------------------------------------------- /src/utils/loaders.py: -------------------------------------------------------------------------------- 1 | from os import environ 2 | from pathlib import Path 3 | 4 | import pandas as pd 5 | import yaml 6 | from dotenv import find_dotenv 7 | from dotenv import load_dotenv 8 | from loguru import logger 9 | 10 | 11 | __all__ = [ 12 | # Generic 13 | "load_csv_data", 14 | "load_metadata", 15 | "build_path", 16 | "get_yaml_file_list", 17 | "iter_dataset_paths", 18 | "iter_task_paths", 19 | "metadata_path", 20 | # Metadata loaders 21 | "load_task_metadata", 22 | "load_modality_metadata", 23 | # Module importers 24 | "dataset_importer", 25 | "transformer_importer", 26 | "feature_importer", 27 | "pipeline_importer", 28 | "model_importer", 29 | "visualisation_importer", 30 | "load_yaml", 31 | ] 32 | 33 | 34 | def get_env(key) -> Path: 35 | load_dotenv(find_dotenv()) 36 | return Path(environ[key]) 37 | 38 | 39 | # Root directory of the project 40 | def get_project_root() -> Path: 41 | return get_env("PROJECT_ROOT") 42 | 43 | 44 | # For building file structure 45 | def build_path(*args) -> Path: 46 | path = get_env("DATA_ROOT").joinpath(*args) 47 | return path 48 | 49 | 50 | def metadata_path(*args) -> Path: 51 | path = get_project_root() / "metadata" 52 | return path.joinpath(*args) 53 | 54 | 55 | # Generic CSV loader 56 | def load_csv_data(fname, astype="list"): 57 | df = pd.read_csv(fname, delim_whitespace=True, header=None) 58 | 59 | if astype in {"dataframe", "pandas", "pd"}: 60 | return df 61 | if astype in {"values", "np", "numpy"}: 62 | return df.values 63 | if astype == "list": 64 | return df.values.ravel().tolist() 65 | 66 | logger.exception(ValueError(f"Un-implemented type specification: {astype}")) 67 | 68 | 69 | # YAML file loaders 70 | def iter_files(path, suffix, stem=False): 71 | fil_iter = filter(lambda fil: fil.suffix == suffix, path.iterdir()) 72 | if stem: 73 | yield from map(lambda fil: fil.stem, fil_iter) 74 | yield from map(lambda fil: path / fil, fil_iter) 75 | 76 | 77 | def iter_dataset_paths(): 78 | return iter_files(path=metadata_path("datasets"), suffix=".yaml", stem=False) 79 | 80 | 81 | def iter_task_paths(): 82 | return iter_files(path=metadata_path("tasks"), suffix=".yaml", stem=False) 83 | 84 | 85 | def load_yaml(filename): 86 | with open(filename, "r") as fil: 87 | return yaml.load(fil, Loader=yaml.SafeLoader) 88 | 89 | 90 | # Metadata 91 | def load_metadata(*args): 92 | return load_yaml(metadata_path(*args)) 93 | 94 | 95 | def load_task_metadata(task_name): 96 | return load_metadata("task", f"{task_name}.yaml") 97 | 98 | 99 | # Dataset metadata 100 | def load_modality_metadata(): 101 | return load_metadata("modality.yaml") 102 | 103 | 104 | # 105 | 106 | 107 | def get_yaml_file_list(*args, stem=False): 108 | path = metadata_path(*args) 109 | fil_iter = iter_files(path=path, suffix=".yaml", stem=stem) 110 | return list(fil_iter) 111 | 112 | 113 | # Module importers 114 | 115 | 116 | def module_importer(module_path, class_name, *args, **kwargs): 117 | """ 118 | 119 | Args: 120 | module_path: 121 | class_name: 122 | *args: 123 | **kwargs: 124 | 125 | Returns: 126 | 127 | """ 128 | m = __import__(module_path, fromlist=[class_name]) 129 | c = getattr(m, class_name) 130 | return c(*args, **kwargs) 131 | 132 | 133 | def dataset_importer(class_name, *args, **kwargs): 134 | """ 135 | 136 | Args: 137 | class_name: 138 | *args: 139 | **kwargs: 140 | 141 | Returns: 142 | 143 | """ 144 | return module_importer(module_path="src.datasets", class_name=class_name, *args, **kwargs) 145 | 146 | 147 | def feature_importer(class_name, *args, **kwargs): 148 | """ 149 | 150 | Args: 151 | class_name: 152 | *args: 153 | **kwargs: 154 | 155 | Returns: 156 | 157 | """ 158 | return module_importer(module_path="src.features", class_name=class_name, *args, **kwargs) 159 | 160 | 161 | def transformer_importer(class_name, *args, **kwargs): 162 | """ 163 | 164 | Args: 165 | class_name: 166 | *args: 167 | **kwargs: 168 | 169 | Returns: 170 | 171 | """ 172 | return module_importer(module_path="src.transformers", class_name=class_name, *args, **kwargs) 173 | 174 | 175 | def pipeline_importer(class_name, *args, **kwargs): 176 | """ 177 | 178 | Args: 179 | class_name: 180 | *args: 181 | **kwargs: 182 | 183 | Returns: 184 | 185 | """ 186 | return module_importer(module_path="src.pipelines", class_name=class_name, *args, **kwargs) 187 | 188 | 189 | def model_importer(class_name, *args, **kwargs): 190 | """ 191 | 192 | Args: 193 | class_name: 194 | *args: 195 | **kwargs: 196 | 197 | Returns: 198 | 199 | """ 200 | return module_importer(module_path="src.models", class_name=class_name, *args, **kwargs) 201 | 202 | 203 | def visualisation_importer(class_name, *args, **kwargs): 204 | """ 205 | 206 | Args: 207 | class_name: 208 | *args: 209 | **kwargs: 210 | 211 | Returns: 212 | 213 | """ 214 | return module_importer(module_path="src.visualisations", class_name=class_name, *args, **kwargs) 215 | -------------------------------------------------------------------------------- /src/features/statistical_features_impl.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.signal 3 | import scipy.stats 4 | from spectrum import arburg 5 | 6 | 7 | __all__ = [ 8 | "mad", 9 | "sma", 10 | "energy", 11 | "autoreg", 12 | "corr", 13 | "td_entropy", 14 | "fd_entropy", 15 | "mean_freq", 16 | "bands_energy", 17 | "t_feat", 18 | "f_feat", 19 | ] 20 | 21 | """ 22 | TIME DOMAIN FEATURES 23 | """ 24 | 25 | 26 | def mad(data, axis): 27 | return np.median(np.abs(data), axis=axis) 28 | 29 | 30 | def sma(data, axis): 31 | return np.abs(data).sum(tuple(np.arange(1, data.ndim)))[:, None] 32 | 33 | 34 | def energy(data, axis): 35 | return np.power(data, 2).mean(axis=axis) 36 | 37 | 38 | def autoreg(data, axis): 39 | def _autoreg(datum): 40 | order = 4 41 | try: 42 | coef, _, _ = arburg(datum, order) 43 | coef = coef.real.tolist() 44 | except ValueError: 45 | coef = [0] * order 46 | return coef 47 | 48 | ar = np.asarray([[_autoreg(data[jj, :, ii]) for ii in range(data.shape[2])] for jj in range(data.shape[0])]) 49 | 50 | return ar.reshape(ar.shape[0], -1) 51 | 52 | 53 | def corr(data, axis): 54 | inds = np.tril_indices(3, k=-1) 55 | cor = np.asarray([np.corrcoef(datum.T)[inds] for datum in data]) 56 | return cor 57 | 58 | 59 | def td_entropy(data, axis, bins=16): 60 | bins = np.linspace(-4, 4, bins) 61 | 62 | def _td_entropy(datum): 63 | ent = [] 64 | for ci in range(datum.shape[1]): 65 | pp, bb = np.histogram(datum[:, ci], bins, density=True) 66 | ent.append(scipy.stats.entropy(pp * (bb[1:] - bb[:-1]), base=2)) 67 | return ent 68 | 69 | H = np.asarray([_td_entropy(datum) for datum in data]) 70 | 71 | return H 72 | 73 | 74 | """ 75 | FREQUENCY DOMAIN FEATURES 76 | """ 77 | 78 | 79 | def fd_entropy(psd, axis, td=False): 80 | H = scipy.stats.entropy((psd / psd.sum(axis=axis)[:, None, :]).transpose(1, 0, 2), base=2) 81 | return H 82 | 83 | 84 | def mean_freq(freq, spec, axis): 85 | return (spec * freq[None, :, None]).sum(axis=axis) 86 | 87 | 88 | def bands_energy(freq, spec, axis): 89 | # Based on window of 2.56 seconds sampled at 50 Hz: 128 samples 90 | orig_freqs = np.fft.fftfreq(128, 1 / 50)[:64] 91 | orig_band_inds = np.asarray( 92 | [ 93 | orig_freqs[[ll - 1, uu - 1]] 94 | for ll, uu in [ 95 | [1, 8], 96 | [9, 16], 97 | [17, 24], 98 | [25, 32], 99 | [33, 40], 100 | [41, 48], 101 | [49, 56], 102 | [57, 64], 103 | [1, 16], 104 | [17, 32], 105 | [22, 48], 106 | [49, 64], 107 | [1, 24], 108 | [25, 48], 109 | ] 110 | ] 111 | ) 112 | 113 | # Generate the inds 114 | bands = np.asarray([(freq > ll) & (freq <= uu) for ll, uu in orig_band_inds]).T 115 | 116 | # Compute the sum with tensor multiplication 117 | band_energy = np.einsum("ijk,kl->ijl", spec.transpose(0, 2, 1), bands).transpose(0, 2, 1) 118 | band_energy = band_energy.reshape(band_energy.shape[0], -1) 119 | 120 | return band_energy 121 | 122 | 123 | def add_magnitude(data): 124 | assert isinstance(data, np.ndarray) 125 | return np.concatenate((data, np.sqrt(np.power(data, 2).sum(axis=2, keepdims=True)) - 1), axis=-1) 126 | 127 | 128 | """ 129 | Time and frequency feature interfaces 130 | """ 131 | 132 | 133 | def t_feat(data): 134 | data = add_magnitude(data) 135 | features = [ 136 | f(data, axis=1) 137 | for f in [ 138 | np.mean, # 3 (cumsum: 3) 139 | np.std, # 3 (cumsum: 6) 140 | mad, # 3 (cumsum: 9) 141 | np.max, # 3 (cumsum: 12) 142 | np.min, # 3 (cumsum: 15) 143 | sma, # 1 --- (cumsum: 16) 144 | energy, # 3 --- (cumsum: 19) 145 | scipy.stats.iqr, # 3 (cumsum: 22) 146 | td_entropy, # 3 (cumsum: 25) 147 | # autoreg, # 12 (cumsum: 37) 148 | corr, # 3 (cumsum: 40) 149 | ] 150 | ] 151 | 152 | feats = np.concatenate(features, axis=1) 153 | return feats 154 | 155 | 156 | def f_feat(data, fs): 157 | data = add_magnitude(data) 158 | 159 | freq, spec = scipy.signal.periodogram(data, fs=fs, axis=1) 160 | spec_normed = spec / spec.sum(axis=1)[:, None, :] 161 | 162 | features = [ 163 | f(spec, axis=1) 164 | for f in [ 165 | np.mean, # 3 (cumsum: 3) 166 | np.std, # 3 (cumsum: 6) 167 | mad, # 3 (cumsum: 9) 168 | np.max, # 3 (cumsum: 12) 169 | np.min, # 3 (cumsum: 15) 170 | sma, # 1 (cumsum: 16) 171 | energy, # 3 (cumsum: 19) 172 | scipy.stats.iqr, # 3 (cumsum: 22) 173 | fd_entropy, # 3 (cumsum: 25) 174 | np.argmax, # 3 (cumsum: 28) 175 | scipy.stats.skew, # 3 (cumsum: 31) 176 | scipy.stats.kurtosis, # 3 (cumsum: 34) 177 | ] 178 | ] 179 | 180 | features += [ 181 | mean_freq(freq, spec_normed, axis=1), # 3 (cumsum: 37) 182 | bands_energy(freq, spec_normed, axis=1), # 42 (cumsum: 79) (not on mag) 183 | ] 184 | 185 | feats = np.concatenate(features, axis=1) 186 | return feats 187 | -------------------------------------------------------------------------------- /src/transformers/window.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from loguru import logger 4 | 5 | from src.base import get_ancestral_metadata 6 | from src.utils.decorators import PartitionByTrial 7 | 8 | 9 | __all__ = [ 10 | "window", 11 | ] 12 | 13 | 14 | def window_data(index, data, fs, win_len, win_inc): 15 | assert data.ndim == 2 16 | 17 | win_len = int(win_len * fs) 18 | win_inc = int(win_inc * fs) 19 | 20 | data_windowed = sliding_window_rect(data, win_len, win_inc) 21 | 22 | if data.shape[0] // win_len == 0: 23 | raise ValueError 24 | elif data.shape[0] // win_len == 1: 25 | return data_windowed[None, ...] 26 | elif data_windowed.ndim == 2: 27 | return data_windowed[..., None] 28 | assert data_windowed.ndim == 3 29 | return data_windowed 30 | 31 | 32 | def window_index(index, data, fs, win_len, win_inc, slice_at="middle"): 33 | assert isinstance(data, pd.DataFrame) 34 | data_windowed = window_data(index=index, data=data.values, fs=fs, win_len=win_len, win_inc=win_inc) 35 | ind = dict(start=0, middle=data_windowed.shape[1] // 2, end=-1)[slice_at] 36 | df = pd.DataFrame(data_windowed[:, ind, :], columns=data.columns) 37 | df = df.astype(data.dtypes) 38 | return df 39 | 40 | 41 | def window(parent, win_len, win_inc): 42 | root = parent / f"{win_len=:03.2f}-{win_inc=:03.2f}" 43 | 44 | fs = get_ancestral_metadata(root, "fs") 45 | 46 | kwargs = dict(index=parent.index["index"], win_len=win_len, win_inc=win_inc, fs=fs) 47 | 48 | # Build index outputs 49 | for key, node in parent.index.items(): 50 | root.instantiate_node( 51 | key=key, func=PartitionByTrial(window_index), kwargs=dict(data=node, **kwargs), backend="pandas" 52 | ) 53 | 54 | # Build Data outputs 55 | for key, node in parent.outputs.items(): 56 | root.instantiate_node( 57 | key=key, func=PartitionByTrial(window_data), kwargs=dict(data=node, **kwargs), backend="none", 58 | ) 59 | 60 | return root 61 | 62 | 63 | def norm_shape(shape): 64 | try: 65 | i = int(shape) 66 | return (i,) 67 | except TypeError: 68 | # shape was not a number 69 | pass 70 | 71 | try: 72 | t = tuple(shape) 73 | return t 74 | except TypeError: 75 | # shape was not iterable 76 | pass 77 | 78 | logger.exception(TypeError("shape must be an int, or a tuple of ints")) 79 | 80 | 81 | def sliding_window(a, ws, ss=None, flatten=True): 82 | """ 83 | based on: https://stackoverflow.com/questions/22685274 84 | 85 | Return a sliding window over a in any number of dimensions 86 | 87 | Parameters 88 | ---------- 89 | a : ndarray 90 | an n-dimensional numpy array 91 | ws : int, tuple 92 | an int (a is 1D) or tuple (a is 2D or greater) representing the size of 93 | each dimension of the window 94 | ss : int, tuple 95 | an int (a is 1D) or tuple (a is 2D or greater) representing the amount 96 | to slide the window in each dimension. If not specified, it defaults to ws. 97 | flatten : book 98 | if True, all slices are flattened, otherwise, there is an extra dimension 99 | for each dimension of the input. 100 | 101 | Returns 102 | ------- 103 | strided : ndarray 104 | an array containing each n-dimensional window from a 105 | """ 106 | 107 | if None is ss: 108 | # ss was not provided. the windows will not overlap in any direction. 109 | ss = ws 110 | ws = norm_shape(ws) 111 | ss = norm_shape(ss) 112 | 113 | # convert ws, ss, and a.shape to numpy arrays so that we can do math in every 114 | # dimension at once. 115 | ws = np.array(ws) 116 | ss = np.array(ss) 117 | shape = np.array(a.shape) 118 | 119 | # ensure that ws, ss, and a.shape all have the same number of dimensions 120 | ls = [len(shape), len(ws), len(ss)] 121 | if 1 != len(set(ls)): 122 | logger.exception(ValueError(f"a.shape, ws and ss must all have the same length. They were {ls}")) 123 | 124 | # ensure that ws is smaller than a in every dimension 125 | if np.any(ws > shape): 126 | logger.exception( 127 | ValueError( 128 | f"ws cannot be larger than a in any dimension. a.shape was %s and " "ws was {(str(a.shape), str(ws))}" 129 | ) 130 | ) 131 | 132 | # how many slices will there be in each dimension? 133 | newshape = norm_shape(((shape - ws) // ss) + 1) 134 | # the shape of the strided array will be the number of slices in each dimension 135 | # plus the shape of the window (tuple addition) 136 | newshape += norm_shape(ws) 137 | # the strides tuple will be the array's strides multiplied by step size, plus 138 | # the array's strides (tuple addition) 139 | newstrides = norm_shape(np.array(a.strides) * ss) + a.strides 140 | strided = np.lib.stride_tricks.as_strided(a, shape=newshape, strides=newstrides) 141 | if not flatten: 142 | return strided 143 | 144 | # Collapse strided so that it has one more dimension than the window. I.e., 145 | # the new array is a flat list of slices. 146 | meat = len(ws) if ws.shape else 0 147 | firstdim = (np.product(newshape[:-meat]),) if ws.shape else () 148 | dim = firstdim + (newshape[-meat:]) 149 | dim = list(filter(lambda i: i != 1, dim)) 150 | 151 | return strided.reshape(dim) 152 | 153 | 154 | def sliding_window_rect(data, length, increment): 155 | length = (length, data.shape[1]) 156 | increment = (increment, data.shape[1]) 157 | 158 | return sliding_window(data, length, increment) 159 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: clean data lint requirements sync_data_to_s3 sync_data_from_s3 2 | 3 | ################################################################################# 4 | # GLOBALS # 5 | ################################################################################# 6 | 7 | PROJECT_DIR := $(shell dirname $(realpath $(lastword $(MAKEFILE_LIST)))) 8 | BUCKET = [OPTIONAL] your-bucket-for-syncing-data (do not include 's3://') 9 | PROFILE = default 10 | PROJECT_NAME = har_datasets 11 | PYTHON_INTERPRETER = pipenv run 12 | MAKE = /usr/bin/make 13 | 14 | ifeq (,$(shell which conda)) 15 | HAS_CONDA=False 16 | else 17 | HAS_CONDA=True 18 | endif 19 | 20 | ################################################################################# 21 | # COMMANDS # 22 | ################################################################################# 23 | 24 | ## Install Python Dependencies 25 | docs: 26 | ifeq (True,$(HAS_CONDA)) 27 | conda install --file requirements.txt 28 | else 29 | pip install -r requirements.txt 30 | endif 31 | 32 | ## Install Python Dependencies 33 | requirements: 34 | ifeq (True,$(HAS_CONDA)) 35 | conda install --file requirements.txt 36 | else 37 | pip install -r requirements.txt 38 | endif 39 | 40 | ## Make Dataset Table 41 | tables: 42 | $(PYTHON_INTERPRETER) make_tables.py 43 | 44 | ## Download the raw data 45 | download: 46 | $(PYTHON_INTERPRETER) make_download.py 47 | 48 | ## Delete all compiled Python files 49 | clean: 50 | find . -name "*.pyc" -delete 51 | 52 | ## Lint using flake8 53 | lint: 54 | ## F401 module imported but unused 55 | ## F403 ‘from module import *’ used; unable to detect undefined names 56 | ## W293 blank line contains whitespace 57 | flake8 . \ 58 | --exclude=data/,docs/,*__init__.py \ 59 | --ignore=W293,F401,F403 \ 60 | --max-line-length=120 61 | 62 | ## Upload Data to S3 63 | sync_data_to_s3: 64 | ifeq (default,$(PROFILE)) 65 | aws s3 sync data/ s3://$(BUCKET)/data/ 66 | else 67 | aws s3 sync data/ s3://$(BUCKET)/data/ --profile $(PROFILE) 68 | endif 69 | 70 | ## Download Data from S3 71 | sync_data_from_s3: 72 | ifeq (default,$(PROFILE)) 73 | aws s3 sync s3://$(BUCKET)/data/ data/ 74 | else 75 | aws s3 sync s3://$(BUCKET)/data/ data/ --profile $(PROFILE) 76 | endif 77 | 78 | ## Set up python interpreter environment 79 | create_environment: 80 | ifeq (True,$(HAS_CONDA)) 81 | @echo ">>> Detected conda, creating conda environment." 82 | ifeq (3,$(findstring 3,$(PYTHON_INTERPRETER))) 83 | conda create --name $(PROJECT_NAME) python=3 84 | else 85 | conda create --name $(PROJECT_NAME) python=2.7 86 | endif 87 | @echo ">>> New conda env created. Activate with:\nsource activate $(PROJECT_NAME)" 88 | else 89 | @pip install -q virtualenv virtualenvwrapper 90 | @echo ">>> Installing virtualenvwrapper if not already intalled.\nMake sure the following lines are in shell startup file\n\ 91 | export WORKON_HOME=$$HOME/.virtualenvs\nexport PROJECT_HOME=$$HOME/Devel\nsource /usr/local/bin/virtualenvwrapper.sh\n" 92 | @bash -c "source `which virtualenvwrapper.sh`;mkvirtualenv $(PROJECT_NAME) --python=$(PYTHON_INTERPRETER)" 93 | @echo ">>> New virtualenv created. Activate with:\nworkon $(PROJECT_NAME)" 94 | endif 95 | 96 | ## Test python environment is setup correctly 97 | test_environment: 98 | $(PYTHON_INTERPRETER) test_environment.py 99 | 100 | ################################################################################# 101 | # PROJECT RULES # 102 | ################################################################################# 103 | 104 | 105 | 106 | ################################################################################# 107 | # Self Documenting Commands # 108 | ################################################################################# 109 | 110 | .DEFAULT_GOAL := show-help 111 | 112 | # Inspired by 113 | # sed script explained: 114 | # /^##/: 115 | # * save line in hold space 116 | # * purge line 117 | # * Loop: 118 | # * append newline + line to hold space 119 | # * go to next line 120 | # * if line starts with doc comment, strip comment character off and loop 121 | # * remove target prerequisites 122 | # * append hold space (+ newline) to line 123 | # * replace newline plus comments by `---` 124 | # * print line 125 | # Separate expressions are necessary because labels cannot be delimited by 126 | # semicolon; see 127 | .PHONY: show-help 128 | show-help: 129 | @echo "$$(tput bold)Available rules:$$(tput sgr0)" 130 | @echo 131 | @sed -n -e "/^## / { \ 132 | h; \ 133 | s/.*//; \ 134 | :doc" \ 135 | -e "H; \ 136 | n; \ 137 | s/^## //; \ 138 | t doc" \ 139 | -e "s/:.*//; \ 140 | G; \ 141 | s/\\n## /---/; \ 142 | s/\\n/ /g; \ 143 | p; \ 144 | }" ${MAKEFILE_LIST} \ 145 | | LC_ALL='C' sort --ignore-case \ 146 | | awk -F '---' \ 147 | -v ncol=$$(tput cols) \ 148 | -v indent=19 \ 149 | -v col_on="$$(tput setaf 6)" \ 150 | -v col_off="$$(tput sgr0)" \ 151 | '{ \ 152 | printf "%s%*s%s ", col_on, -indent, $$1, col_off; \ 153 | n = split($$2, words, " "); \ 154 | line_length = ncol - indent; \ 155 | for (i = 1; i <= n; i++) { \ 156 | line_length -= length(words[i]) + 1; \ 157 | if (line_length <= 0) { \ 158 | line_length = ncol - indent - length(words[i]) - 1; \ 159 | printf "\n%*s ", -indent, " "; \ 160 | } \ 161 | printf "%s ", words[i]; \ 162 | } \ 163 | printf "\n"; \ 164 | }' \ 165 | | more $(shell test $(shell uname) = Darwin && echo '--no-init --raw-control-chars') 166 | -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | REM Command file for Sphinx documentation 4 | 5 | if "%SPHINXBUILD%" == "" ( 6 | set SPHINXBUILD=sphinx-build 7 | ) 8 | set BUILDDIR=_build 9 | set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . 10 | set I18NSPHINXOPTS=%SPHINXOPTS% . 11 | if NOT "%PAPER%" == "" ( 12 | set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% 13 | set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% 14 | ) 15 | 16 | if "%1" == "" goto help 17 | 18 | if "%1" == "help" ( 19 | :help 20 | echo.Please use `make ^` where ^ is one of 21 | echo. html to make standalone HTML files 22 | echo. dirhtml to make HTML files named index.html in directories 23 | echo. singlehtml to make a single large HTML file 24 | echo. pickle to make pickle files 25 | echo. json to make JSON files 26 | echo. htmlhelp to make HTML files and a HTML help project 27 | echo. qthelp to make HTML files and a qthelp project 28 | echo. devhelp to make HTML files and a Devhelp project 29 | echo. epub to make an epub 30 | echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter 31 | echo. text to make text files 32 | echo. man to make manual pages 33 | echo. texinfo to make Texinfo files 34 | echo. gettext to make PO message catalogs 35 | echo. changes to make an overview over all changed/added/deprecated items 36 | echo. linkcheck to check all external links for integrity 37 | echo. doctest to run all doctests embedded in the documentation if enabled 38 | goto end 39 | ) 40 | 41 | if "%1" == "clean" ( 42 | for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i 43 | del /q /s %BUILDDIR%\* 44 | goto end 45 | ) 46 | 47 | if "%1" == "html" ( 48 | %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html 49 | if errorlevel 1 exit /b 1 50 | echo. 51 | echo.Build finished. The HTML pages are in %BUILDDIR%/html. 52 | goto end 53 | ) 54 | 55 | if "%1" == "dirhtml" ( 56 | %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml 57 | if errorlevel 1 exit /b 1 58 | echo. 59 | echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. 60 | goto end 61 | ) 62 | 63 | if "%1" == "singlehtml" ( 64 | %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml 65 | if errorlevel 1 exit /b 1 66 | echo. 67 | echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. 68 | goto end 69 | ) 70 | 71 | if "%1" == "pickle" ( 72 | %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle 73 | if errorlevel 1 exit /b 1 74 | echo. 75 | echo.Build finished; now you can process the pickle files. 76 | goto end 77 | ) 78 | 79 | if "%1" == "json" ( 80 | %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json 81 | if errorlevel 1 exit /b 1 82 | echo. 83 | echo.Build finished; now you can process the JSON files. 84 | goto end 85 | ) 86 | 87 | if "%1" == "htmlhelp" ( 88 | %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp 89 | if errorlevel 1 exit /b 1 90 | echo. 91 | echo.Build finished; now you can run HTML Help Workshop with the ^ 92 | .hhp project file in %BUILDDIR%/htmlhelp. 93 | goto end 94 | ) 95 | 96 | if "%1" == "qthelp" ( 97 | %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp 98 | if errorlevel 1 exit /b 1 99 | echo. 100 | echo.Build finished; now you can run "qcollectiongenerator" with the ^ 101 | .qhcp project file in %BUILDDIR%/qthelp, like this: 102 | echo.^> qcollectiongenerator %BUILDDIR%\qthelp\har_datasets.qhcp 103 | echo.To view the help file: 104 | echo.^> assistant -collectionFile %BUILDDIR%\qthelp\har_datasets.ghc 105 | goto end 106 | ) 107 | 108 | if "%1" == "devhelp" ( 109 | %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp 110 | if errorlevel 1 exit /b 1 111 | echo. 112 | echo.Build finished. 113 | goto end 114 | ) 115 | 116 | if "%1" == "epub" ( 117 | %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub 118 | if errorlevel 1 exit /b 1 119 | echo. 120 | echo.Build finished. The epub file is in %BUILDDIR%/epub. 121 | goto end 122 | ) 123 | 124 | if "%1" == "latex" ( 125 | %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex 126 | if errorlevel 1 exit /b 1 127 | echo. 128 | echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. 129 | goto end 130 | ) 131 | 132 | if "%1" == "text" ( 133 | %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text 134 | if errorlevel 1 exit /b 1 135 | echo. 136 | echo.Build finished. The text files are in %BUILDDIR%/text. 137 | goto end 138 | ) 139 | 140 | if "%1" == "man" ( 141 | %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man 142 | if errorlevel 1 exit /b 1 143 | echo. 144 | echo.Build finished. The manual pages are in %BUILDDIR%/man. 145 | goto end 146 | ) 147 | 148 | if "%1" == "texinfo" ( 149 | %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo 150 | if errorlevel 1 exit /b 1 151 | echo. 152 | echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. 153 | goto end 154 | ) 155 | 156 | if "%1" == "gettext" ( 157 | %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale 158 | if errorlevel 1 exit /b 1 159 | echo. 160 | echo.Build finished. The message catalogs are in %BUILDDIR%/locale. 161 | goto end 162 | ) 163 | 164 | if "%1" == "changes" ( 165 | %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes 166 | if errorlevel 1 exit /b 1 167 | echo. 168 | echo.The overview file is in %BUILDDIR%/changes. 169 | goto end 170 | ) 171 | 172 | if "%1" == "linkcheck" ( 173 | %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck 174 | if errorlevel 1 exit /b 1 175 | echo. 176 | echo.Link check complete; look for any errors in the above output ^ 177 | or in %BUILDDIR%/linkcheck/output.txt. 178 | goto end 179 | ) 180 | 181 | if "%1" == "doctest" ( 182 | %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest 183 | if errorlevel 1 exit /b 1 184 | echo. 185 | echo.Testing of doctests in the sources finished, look at the ^ 186 | results in %BUILDDIR%/doctest/output.txt. 187 | goto end 188 | ) 189 | 190 | :end 191 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | PAPER = 8 | BUILDDIR = _build 9 | 10 | # Internal variables. 11 | PAPEROPT_a4 = -D latex_paper_size=a4 12 | PAPEROPT_letter = -D latex_paper_size=letter 13 | ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 14 | # the i18n builder cannot share the environment and doctrees with the others 15 | I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . 16 | 17 | .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext 18 | 19 | help: 20 | @echo "Please use \`make ' where is one of" 21 | @echo " html to make standalone HTML files" 22 | @echo " dirhtml to make HTML files named index.html in directories" 23 | @echo " singlehtml to make a single large HTML file" 24 | @echo " pickle to make pickle files" 25 | @echo " json to make JSON files" 26 | @echo " htmlhelp to make HTML files and a HTML help project" 27 | @echo " qthelp to make HTML files and a qthelp project" 28 | @echo " devhelp to make HTML files and a Devhelp project" 29 | @echo " epub to make an epub" 30 | @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" 31 | @echo " latexpdf to make LaTeX files and run them through pdflatex" 32 | @echo " text to make text files" 33 | @echo " man to make manual pages" 34 | @echo " texinfo to make Texinfo files" 35 | @echo " info to make Texinfo files and run them through makeinfo" 36 | @echo " gettext to make PO message catalogs" 37 | @echo " changes to make an overview of all changed/added/deprecated items" 38 | @echo " linkcheck to check all external links for integrity" 39 | @echo " doctest to run all doctests embedded in the documentation (if enabled)" 40 | 41 | clean: 42 | -rm -rf $(BUILDDIR)/* 43 | 44 | html: 45 | $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html 46 | @echo 47 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." 48 | 49 | dirhtml: 50 | $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml 51 | @echo 52 | @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." 53 | 54 | singlehtml: 55 | $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml 56 | @echo 57 | @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." 58 | 59 | pickle: 60 | $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle 61 | @echo 62 | @echo "Build finished; now you can process the pickle files." 63 | 64 | json: 65 | $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json 66 | @echo 67 | @echo "Build finished; now you can process the JSON files." 68 | 69 | htmlhelp: 70 | $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp 71 | @echo 72 | @echo "Build finished; now you can run HTML Help Workshop with the" \ 73 | ".hhp project file in $(BUILDDIR)/htmlhelp." 74 | 75 | qthelp: 76 | $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp 77 | @echo 78 | @echo "Build finished; now you can run "qcollectiongenerator" with the" \ 79 | ".qhcp project file in $(BUILDDIR)/qthelp, like this:" 80 | @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/har_datasets.qhcp" 81 | @echo "To view the help file:" 82 | @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/har_datasets.qhc" 83 | 84 | devhelp: 85 | $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp 86 | @echo 87 | @echo "Build finished." 88 | @echo "To view the help file:" 89 | @echo "# mkdir -p $$HOME/.local/share/devhelp/har_datasets" 90 | @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/har_datasets" 91 | @echo "# devhelp" 92 | 93 | epub: 94 | $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub 95 | @echo 96 | @echo "Build finished. The epub file is in $(BUILDDIR)/epub." 97 | 98 | latex: 99 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 100 | @echo 101 | @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." 102 | @echo "Run \`make' in that directory to run these through (pdf)latex" \ 103 | "(use \`make latexpdf' here to do that automatically)." 104 | 105 | latexpdf: 106 | $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex 107 | @echo "Running LaTeX files through pdflatex..." 108 | $(MAKE) -C $(BUILDDIR)/latex all-pdf 109 | @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." 110 | 111 | text: 112 | $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text 113 | @echo 114 | @echo "Build finished. The text files are in $(BUILDDIR)/text." 115 | 116 | man: 117 | $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man 118 | @echo 119 | @echo "Build finished. The manual pages are in $(BUILDDIR)/man." 120 | 121 | texinfo: 122 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 123 | @echo 124 | @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." 125 | @echo "Run \`make' in that directory to run these through makeinfo" \ 126 | "(use \`make info' here to do that automatically)." 127 | 128 | info: 129 | $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo 130 | @echo "Running Texinfo files through makeinfo..." 131 | make -C $(BUILDDIR)/texinfo info 132 | @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." 133 | 134 | gettext: 135 | $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale 136 | @echo 137 | @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." 138 | 139 | changes: 140 | $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes 141 | @echo 142 | @echo "The overview file is in $(BUILDDIR)/changes." 143 | 144 | linkcheck: 145 | $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck 146 | @echo 147 | @echo "Link check complete; look for any errors in the above output " \ 148 | "or in $(BUILDDIR)/linkcheck/output.txt." 149 | 150 | doctest: 151 | $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest 152 | @echo "Testing of doctests in the sources finished, look at the " \ 153 | "results in $(BUILDDIR)/doctest/output.txt." 154 | -------------------------------------------------------------------------------- /src/models/base.py: -------------------------------------------------------------------------------- 1 | from typing import Any 2 | from typing import Dict 3 | from typing import Optional 4 | 5 | import numpy as np 6 | import pandas as pd 7 | from loguru import logger 8 | from mldb import NodeWrapper 9 | from sklearn.base import BaseEstimator 10 | from sklearn.model_selection import GridSearchCV 11 | from sklearn.model_selection import GroupKFold 12 | 13 | from src.base import ExecutionGraph 14 | from src.evaluation.classification import evaluate_data_split 15 | 16 | __all__ = ["instantiate_and_fit", "ClassifierWrapper", "BasicScorer"] 17 | 18 | 19 | def instantiate_and_fit( 20 | index: pd.DataFrame, 21 | fold: pd.DataFrame, 22 | X: np.ndarray, 23 | y: pd.DataFrame, 24 | estimator: BaseEstimator, 25 | n_splits: int = 5, 26 | param_grid: Optional[Dict[str, Any]] = None, 27 | ) -> BaseEstimator: 28 | assert fold.shape[0] == index.shape[0] 29 | assert fold.shape[0] == X.shape[0] 30 | assert fold.shape[0] == y.shape[0] 31 | 32 | fold_vals = fold.ravel() 33 | 34 | train_inds = fold_vals == "train" 35 | val_inds = fold_vals == "val" 36 | 37 | if val_inds.sum(): 38 | raise NotImplementedError("Explicit validation indices not yet supported.") 39 | 40 | y = y.values.ravel() 41 | 42 | nan_row, nan_col = np.nonzero(np.isnan(X) | np.isinf(X)) 43 | if len(nan_row): 44 | logger.warning(f"Setting {len(nan_row)} NaN elements to zero before fitting {estimator}.") 45 | X[nan_row, nan_col] = 0 46 | 47 | logger.info(f"Fitting {estimator} on data (shape: {X.shape})") 48 | 49 | if param_grid is not None: 50 | group_k_fold = GroupKFold(n_splits=n_splits).split(X[train_inds], y[train_inds], index.trial.values[train_inds]) 51 | 52 | grid_search = GridSearchCV(estimator=estimator, param_grid=param_grid, verbose=10, cv=list(group_k_fold)) 53 | grid_search.fit(X[train_inds], y[train_inds]) 54 | 55 | return grid_search.best_estimator_ 56 | 57 | estimator.fit(X[train_inds], y[train_inds]) 58 | return estimator 59 | 60 | 61 | # noinspection PyPep8Naming 62 | class BasicScorer(object): 63 | def fit(self, estimator: Any, X: np.ndarray, y: np.ndarray): 64 | return estimator.fit(X, y) 65 | 66 | def score(self, estimator: Any, X: np.ndarray, y: np.ndarray): 67 | return estimator.score(X, y) 68 | 69 | def transform(self, estimator: Any, X: np.ndarray): 70 | return estimator.transform(X) 71 | 72 | def decision_function(self, estimator: Any, X: np.ndarray): 73 | return estimator.predict_proba(X) 74 | 75 | def predict(self, estimator: Any, X: np.ndarray): 76 | return estimator.predict(X) 77 | 78 | def predict_proba(self, estimator: Any, X: np.ndarray): 79 | return estimator.predict_proba(X) 80 | 81 | def predict_log_proba(self, estimator: Any, X: np.ndarray): 82 | return estimator.predict_proba(X) 83 | 84 | 85 | # noinspection PyPep8Naming 86 | class ClassifierWrapper(ExecutionGraph): 87 | def __init__( 88 | self, 89 | parent: ExecutionGraph, 90 | features: NodeWrapper, 91 | split: NodeWrapper, 92 | task: NodeWrapper, 93 | estimator: NodeWrapper, 94 | param_grid: Optional[Dict[str, Any]] = None, 95 | scorer: Optional[BasicScorer] = None, 96 | evaluate: bool = False, 97 | ): 98 | assert isinstance(parent, ExecutionGraph) 99 | assert isinstance(features, NodeWrapper) 100 | assert isinstance(split, NodeWrapper) 101 | assert isinstance(task, NodeWrapper) 102 | assert isinstance(estimator, NodeWrapper) 103 | 104 | super(ClassifierWrapper, self).__init__(parent=parent, name=f"estimator={str(estimator.name.name)}") 105 | 106 | self.features = features 107 | self.split = split 108 | self.task = task 109 | 110 | self.scorer = BasicScorer() if scorer is None else scorer 111 | 112 | model = self.instantiate_node( 113 | key="model", 114 | func=instantiate_and_fit, 115 | backend="sklearn", 116 | kwargs=dict( 117 | X=features, y=task, index=self["index"], fold=self.split, estimator=estimator, param_grid=param_grid, 118 | ), 119 | ) 120 | 121 | results = self.get_or_create( 122 | key="results", 123 | func=evaluate_data_split, 124 | backend="json", 125 | kwargs=dict(split=split, targets=task, estimator=model, prob_predictions=self.predict_proba(features)), 126 | ) 127 | 128 | if evaluate: 129 | self.dump_graph() 130 | model.evaluate() 131 | results.evaluate() 132 | 133 | @property 134 | def model(self): 135 | return self["model"] 136 | 137 | @property 138 | def results(self): 139 | return self["results"] 140 | 141 | def fit(self, X, y) -> NodeWrapper: 142 | logger.warning(f"it looks like you're attempting to re-fit a model on new data - is this the intent?") 143 | return self.instantiate_orphan_node(func=self.scorer.fit, kwargs=dict(estimator=self["model"], X=X, y=y),) 144 | 145 | def score(self, X, y) -> NodeWrapper: 146 | return self.instantiate_orphan_node(func=self.scorer.score, kwargs=dict(estimator=self["model"], X=X, y=y)) 147 | 148 | def transform(self, X) -> NodeWrapper: 149 | return self.instantiate_orphan_node(func=self.scorer.transform, kwargs=dict(estimator=self["model"], X=X)) 150 | 151 | def predict(self, X) -> NodeWrapper: 152 | return self.instantiate_orphan_node(func=self.scorer.predict, kwargs=dict(estimator=self["model"], X=X)) 153 | 154 | def decision_function(self, X) -> NodeWrapper: 155 | return self.instantiate_orphan_node( 156 | func=self.scorer.decision_function, kwargs=dict(estimator=self["model"], X=X) 157 | ) 158 | 159 | def predict_proba(self, X) -> NodeWrapper: 160 | return self.instantiate_orphan_node(func=self.scorer.predict_proba, kwargs=dict(estimator=self["model"], X=X)) 161 | 162 | def predict_log_proba(self, X) -> NodeWrapper: 163 | return self.instantiate_orphan_node( 164 | func=self.scorer.predict_log_proba, kwargs=dict(estimator=self["model"], X=X) 165 | ) 166 | -------------------------------------------------------------------------------- /src/utils/decorators.py: -------------------------------------------------------------------------------- 1 | from functools import partial 2 | from functools import update_wrapper 3 | 4 | import numpy as np 5 | import pandas as pd 6 | from loguru import logger 7 | from pandas.api.types import is_categorical_dtype 8 | from tqdm import tqdm 9 | 10 | from src.utils.exceptions import ModalityNotPresentError 11 | from src.utils.loaders import dataset_importer 12 | 13 | 14 | __all__ = [ 15 | "index_decorator", 16 | "fold_decorator", 17 | "label_decorator", 18 | "PartitionByTrial", 19 | "partitioning_decorator", 20 | ] 21 | 22 | 23 | class DecoratorBase(object): 24 | def __init__(self, func): 25 | update_wrapper(self, func) 26 | self.func = func 27 | 28 | def __get__(self, obj, objtype): 29 | return partial(self.__call__, obj) 30 | 31 | def __call__(self, *args, **kwargs): 32 | return self.func(*args, **kwargs) 33 | 34 | 35 | class LabelDecorator(DecoratorBase): 36 | def __init__(self, func): 37 | super(LabelDecorator, self).__init__(func) 38 | 39 | def __call__(self, *args, **kwargs): 40 | df = super(LabelDecorator, self).__call__(*args, **kwargs) 41 | 42 | # TODO/FIXME: remove this strange pattern 43 | if isinstance(df, tuple): 44 | inv_lookup, df = df 45 | df = pd.DataFrame(df) 46 | for ci in df.columns: 47 | df[ci] = df[ci].apply(lambda ll: inv_lookup[ll]) 48 | 49 | assert len(df.columns) == 1 50 | 51 | df = pd.DataFrame(df) 52 | df.columns = [f"target" for _ in range(len(df.columns))] 53 | if is_categorical_dtype(df["target"]): 54 | df = df.astype(dict(target="category")) 55 | 56 | return df 57 | 58 | 59 | class FoldDecorator(DecoratorBase): 60 | def __init__(self, func): 61 | super(FoldDecorator, self).__init__(func) 62 | 63 | def __call__(self, *args, **kwargs): 64 | df = super(FoldDecorator, self).__call__(*args, **kwargs) 65 | if isinstance(df.columns, pd.RangeIndex): 66 | df.columns = [f"fold_{fi}" for fi in range(len(df.columns))] 67 | df = df.astype({col: "category" for col in df.columns}) 68 | return df 69 | 70 | 71 | class IndexDecorator(DecoratorBase): 72 | def __init__(self, func): 73 | super(IndexDecorator, self).__init__(func) 74 | 75 | def __call__(self, *args, **kwargs): 76 | df = super(IndexDecorator, self).__call__(*args, **kwargs) 77 | df.columns = ["subject", "trial", "time"] 78 | return df.astype(dict(subject="category", trial="category", time=float)) 79 | 80 | 81 | def infer_data_type(data): 82 | if isinstance(data, np.ndarray): 83 | return "numpy" 84 | elif isinstance(data, pd.DataFrame): 85 | return "pandas" 86 | 87 | logger.exception( 88 | TypeError(f"Unsupported data type in infer_data_type ({type(data)}), currently only {{numpy, pandas}}") 89 | ) 90 | 91 | 92 | def slice_data_type(data, inds, data_type_name): 93 | if data_type_name == "numpy": 94 | return data[inds] 95 | elif data_type_name == "pandas": 96 | return data.loc[inds] 97 | 98 | logger.exception( 99 | TypeError(f"Unsupported data type in slice_data_type ({type(data)}), currently only {{numpy, pandas}}") 100 | ) 101 | 102 | 103 | def concat_data_type(datas, data_type_name): 104 | if data_type_name == "numpy": 105 | return np.concatenate(datas, axis=0) 106 | elif data_type_name == "pandas": 107 | df = pd.concat(datas, axis=0) 108 | return df.reset_index(drop=True) 109 | 110 | logger.exception( 111 | TypeError(f"Unsupported data type in concat_data_type ({type(datas)}), currently only {{numpy, pandas}}") 112 | ) 113 | 114 | 115 | class PartitionByTrial(DecoratorBase): 116 | """ 117 | 118 | """ 119 | 120 | def __init__(self, func): 121 | super(PartitionByTrial, self).__init__(func=func) 122 | 123 | def __call__(self, index, data, *args, **kwargs): 124 | if index.shape[0] != data.shape[0]: 125 | logger.exception( 126 | ValueError( 127 | f"The data and index should have the same length " 128 | "with index: {index.shape}; and data: {data.shape}" 129 | ) 130 | ) 131 | output = [] 132 | trials = index.trial.unique() 133 | data_type = infer_data_type(data) 134 | for trial in tqdm(trials): 135 | inds = index.trial == trial 136 | index_ = index.loc[inds] 137 | data_ = slice_data_type(data, inds, data_type) 138 | vals = self.func(index=index_, data=data_, *args, **kwargs) 139 | opdt = infer_data_type(vals) 140 | if opdt != data_type: 141 | logger.exception( 142 | ValueError( 143 | f"The data type of {self.func} should be the same as the input {data_type} " 144 | f"but instead got {opdt}" 145 | ) 146 | ) 147 | output.append(vals) 148 | return concat_data_type(output, data_type) 149 | 150 | 151 | class RequiredModalities(DecoratorBase): 152 | def __init__(self, func, *modalities): 153 | super(RequiredModalities, self).__init__(func=func) 154 | 155 | self.required_modalities = set(modalities) 156 | 157 | def __call__(self, dataset, *args, **kwargs): 158 | dataset = dataset_importer(dataset) 159 | dataset_modalities = dataset.meta.modalities 160 | for required_modality in self.required_modalities: 161 | if required_modality not in dataset_modalities: 162 | logger.exception( 163 | ModalityNotPresentError( 164 | f"The modality {required_modality} is required by the function {self.func}. " 165 | f"However, the dataset {dataset} does not have {required_modality}. The " 166 | f"available modalities are: {dataset_modalities})" 167 | ) 168 | ) 169 | 170 | super(self, RequiredModalities).__call__(dataset, *args, **kwargs) 171 | 172 | 173 | required_modalities = RequiredModalities 174 | 175 | 176 | label_decorator = LabelDecorator 177 | index_decorator = IndexDecorator 178 | fold_decorator = FoldDecorator 179 | partitioning_decorator = PartitionByTrial 180 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # har_datasets documentation build configuration file, created by 4 | # sphinx-quickstart. 5 | # 6 | # This file is execfile()d with the current directory set to its containing dir. 7 | # 8 | # Note that not all possible configuration values are present in this 9 | # autogenerated file. 10 | # 11 | # All configuration values have a default; values that are commented out 12 | # serve to show the default. 13 | 14 | import os 15 | import sys 16 | 17 | # If extensions (or modules to document with autodoc) are in another directory, 18 | # add these directories to sys.path here. If the directory is relative to the 19 | # documentation root, use os.path.abspath to make it absolute, like shown here. 20 | # sys.path.insert(0, os.path.abspath('.')) 21 | 22 | # -- General configuration ----------------------------------------------------- 23 | 24 | # If your documentation needs a minimal Sphinx version, state it here. 25 | # needs_sphinx = '1.0' 26 | 27 | # Add any Sphinx extension module names here, as strings. They can be extensions 28 | # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. 29 | extensions = [] 30 | 31 | # Add any paths that contain templates here, relative to this directory. 32 | templates_path = ["_templates"] 33 | 34 | # The suffix of source filenames. 35 | source_suffix = ".rst" 36 | 37 | # The encoding of source files. 38 | # source_encoding = 'utf-8-sig' 39 | 40 | # The master toctree document. 41 | master_doc = "index" 42 | 43 | # General information about the project. 44 | project = "har_datasets" 45 | 46 | # The version info for the project you're documenting, acts as replacement for 47 | # |version| and |release|, also used in various other places throughout the 48 | # built documents. 49 | # 50 | # The short X.Y version. 51 | version = "0.1" 52 | # The full version, including alpha/beta/rc tags. 53 | release = "0.1" 54 | 55 | # The language for content autogenerated by Sphinx. Refer to documentation 56 | # for a list of supported languages. 57 | # language = None 58 | 59 | # There are two options for replacing |today|: either, you set today to some 60 | # non-false value, then it is used: 61 | # today = '' 62 | # Else, today_fmt is used as the format for a strftime call. 63 | # today_fmt = '%B %d, %Y' 64 | 65 | # List of patterns, relative to source directory, that match files and 66 | # directories to ignore when looking for source files. 67 | exclude_patterns = ["_build"] 68 | 69 | # The reST default role (used for this markup: `text`) to use for all documents. 70 | # default_role = None 71 | 72 | # If true, '()' will be appended to :func: etc. cross-reference text. 73 | # add_function_parentheses = True 74 | 75 | # If true, the current module name will be prepended to all description 76 | # unit titles (such as .. function::). 77 | # add_module_names = True 78 | 79 | # If true, sectionauthor and moduleauthor directives will be shown in the 80 | # output. They are ignored by default. 81 | # show_authors = False 82 | 83 | # The name of the Pygments (syntax highlighting) style to use. 84 | pygments_style = "sphinx" 85 | 86 | # A list of ignored prefixes for module index sorting. 87 | # modindex_common_prefix = [] 88 | 89 | 90 | # -- Options for HTML output --------------------------------------------------- 91 | 92 | # The theme to use for HTML and HTML Help pages. See the documentation for 93 | # a list of builtin themes. 94 | html_theme = "default" 95 | 96 | # Theme options are theme-specific and customize the look and feel of a theme 97 | # further. For a list of options available for each theme, see the 98 | # documentation. 99 | # html_theme_options = {} 100 | 101 | # Add any paths that contain custom themes here, relative to this directory. 102 | # html_theme_path = [] 103 | 104 | # The name for this set of Sphinx documents. If None, it defaults to 105 | # " v documentation". 106 | # html_title = None 107 | 108 | # A shorter title for the navigation bar. Default is the same as html_title. 109 | # html_short_title = None 110 | 111 | # The name of an image file (relative to this directory) to place at the top 112 | # of the sidebar. 113 | # html_logo = None 114 | 115 | # The name of an image file (within the static path) to use as favicon of the 116 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 117 | # pixels large. 118 | # html_favicon = None 119 | 120 | # Add any paths that contain custom static files (such as style sheets) here, 121 | # relative to this directory. They are copied after the builtin static files, 122 | # so a file named "default.css" will overwrite the builtin "default.css". 123 | html_static_path = ["_static"] 124 | 125 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 126 | # using the given strftime format. 127 | # html_last_updated_fmt = '%b %d, %Y' 128 | 129 | # If true, SmartyPants will be used to convert quotes and dashes to 130 | # typographically correct entities. 131 | # html_use_smartypants = True 132 | 133 | # Custom sidebar templates, maps document names to template names. 134 | # html_sidebars = {} 135 | 136 | # Additional templates that should be rendered to pages, maps page names to 137 | # template names. 138 | # html_additional_pages = {} 139 | 140 | # If false, no module index is generated. 141 | # html_domain_indices = True 142 | 143 | # If false, no index is generated. 144 | # html_use_index = True 145 | 146 | # If true, the index is split into individual pages for each letter. 147 | # html_split_index = False 148 | 149 | # If true, links to the reST sources are added to the pages. 150 | # html_show_sourcelink = True 151 | 152 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 153 | # html_show_sphinx = True 154 | 155 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 156 | # html_show_copyright = True 157 | 158 | # If true, an OpenSearch description file will be output, and all pages will 159 | # contain a tag referring to it. The value of this option must be the 160 | # base URL from which the finished HTML is served. 161 | # html_use_opensearch = '' 162 | 163 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 164 | # html_file_suffix = None 165 | 166 | # Output file base name for HTML help builder. 167 | htmlhelp_basename = "har_datasetsdoc" 168 | 169 | 170 | # -- Options for LaTeX output -------------------------------------------------- 171 | 172 | latex_elements = { 173 | # The paper size ('letterpaper' or 'a4paper'). 174 | # 'papersize': 'letterpaper', 175 | # The font size ('10pt', '11pt' or '12pt'). 176 | # 'pointsize': '10pt', 177 | # Additional stuff for the LaTeX preamble. 178 | # 'preamble': '', 179 | } 180 | 181 | # Grouping the document tree into LaTeX files. List of tuples 182 | # (source start file, target name, title, author, documentclass [howto/manual]). 183 | latex_documents = [ 184 | ("index", "har_datasets.tex", "har_datasets Documentation", "Niall Twomey", "manual"), 185 | ] 186 | 187 | # The name of an image file (relative to this directory) to place at the top of 188 | # the title page. 189 | # latex_logo = None 190 | 191 | # For "manual" documents, if this is true, then toplevel headings are parts, 192 | # not chapters. 193 | # latex_use_parts = False 194 | 195 | # If true, show page references after internal links. 196 | # latex_show_pagerefs = False 197 | 198 | # If true, show URL addresses after external links. 199 | # latex_show_urls = False 200 | 201 | # Documents to append as an appendix to all manuals. 202 | # latex_appendices = [] 203 | 204 | # If false, no module index is generated. 205 | # latex_domain_indices = True 206 | 207 | 208 | # -- Options for manual page output -------------------------------------------- 209 | 210 | # One entry per manual page. List of tuples 211 | # (source start file, name, description, authors, manual section). 212 | man_pages = [("index", "har_datasets", "har_datasets Documentation", ["Niall Twomey"], 1)] 213 | 214 | # If true, show URL addresses after external links. 215 | # man_show_urls = False 216 | 217 | 218 | # -- Options for Texinfo output ------------------------------------------------ 219 | 220 | # Grouping the document tree into Texinfo files. List of tuples 221 | # (source start file, target name, title, author, 222 | # dir menu entry, description, category) 223 | texinfo_documents = [ 224 | ( 225 | "index", 226 | "har_datasets", 227 | "har_datasets Documentation", 228 | "Niall Twomey", 229 | "har_datasets", 230 | "A collection of human activity recognition (HAR) datasets, complete with metadata, consistent labels, and processing engine for unified HAR analysis. ", 231 | "Miscellaneous", 232 | ), 233 | ] 234 | 235 | # Documents to append as an appendix to all manuals. 236 | # texinfo_appendices = [] 237 | 238 | # If false, no module index is generated. 239 | # texinfo_domain_indices = True 240 | 241 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 242 | # texinfo_show_urls = 'footnote' 243 | -------------------------------------------------------------------------------- /src/base.py: -------------------------------------------------------------------------------- 1 | from functools import lru_cache 2 | from functools import partial 3 | from operator import itemgetter 4 | from pathlib import Path 5 | from typing import Any 6 | from typing import Callable 7 | from typing import Dict 8 | from typing import Iterable 9 | from typing import List 10 | from typing import Optional 11 | from typing import Tuple 12 | from typing import Union 13 | 14 | import pygraphviz as pgv 15 | from loguru import logger 16 | from mldb import ComputationGraph 17 | from mldb import FileLockExistsException 18 | from mldb import NodeWrapper 19 | from mldb.backends import Backend 20 | from mldb.backends import JsonBackend 21 | from mldb.backends import NumpyBackend 22 | from mldb.backends import PandasBackend 23 | from mldb.backends import PickleBackend 24 | from mldb.backends import PNGBackend 25 | from mldb.backends import ScikitLearnBackend 26 | from mldb.backends import VolatileBackend 27 | from mldb.backends import YamlBackend 28 | 29 | from src.functional.common import node_itemgetter 30 | from src.keys import Key 31 | from src.meta import BaseMeta 32 | from src.utils.decorators import DecoratorBase 33 | from src.utils.loaders import build_path 34 | from src.utils.loaders import get_yaml_file_list 35 | from src.utils.misc import NumpyEncoder 36 | from src.utils.misc import randomised_order 37 | 38 | 39 | __all__ = ["ExecutionGraph", "get_ancestral_metadata"] 40 | 41 | 42 | INDEX_FILES_SET = set( 43 | get_yaml_file_list("indices", stem=True) 44 | + get_yaml_file_list("tasks", stem=True) 45 | + get_yaml_file_list("data_partitions", stem=True) 46 | ) 47 | 48 | DATA_ROOT: Path = build_path() 49 | 50 | BACKEND_DICT = dict( 51 | none=VolatileBackend(), 52 | pickle=PickleBackend(DATA_ROOT), 53 | pandas=PandasBackend(DATA_ROOT), 54 | numpy=NumpyBackend(DATA_ROOT), 55 | json=JsonBackend(DATA_ROOT, cls=NumpyEncoder), 56 | sklearn=ScikitLearnBackend(DATA_ROOT), 57 | png=PNGBackend(DATA_ROOT), 58 | yaml=YamlBackend(DATA_ROOT), 59 | ) 60 | 61 | 62 | @lru_cache(2 ** 16) 63 | def is_index_key(key: Optional[Union[Key, str]]) -> bool: 64 | if key is None: 65 | return False 66 | if isinstance(key, Key): 67 | key = key.key 68 | assert isinstance(key, str) 69 | return key in INDEX_FILES_SET 70 | 71 | 72 | def validate_meta(meta: Union[BaseMeta, Path, str], name: Union[Path, str]) -> BaseMeta: 73 | if isinstance(meta, BaseMeta): 74 | return meta 75 | elif isinstance(meta, (str, Path)): 76 | return BaseMeta(path=meta) 77 | elif isinstance(name, (str, Path)): 78 | return BaseMeta(path=name) 79 | 80 | logger.exception(f"Ambiguous metadata specification with {name=} and {meta=}") 81 | 82 | raise ValueError 83 | 84 | 85 | def validate_backend(backend: Optional[str], key: Optional[Union[str, Key]] = None) -> Backend: 86 | if is_index_key(key): 87 | if backend != "pandas": 88 | logger.warning(f"Backend for node {key} is not pandas - setting value to 'pandas'.") 89 | backend = "pandas" 90 | 91 | else: 92 | if backend is None: 93 | backend = "none" 94 | 95 | if backend not in BACKEND_DICT: 96 | logger.exception(f"Backend ({backend}) not in known list ({sorted(BACKEND_DICT.keys())})") 97 | raise KeyError 98 | 99 | return BACKEND_DICT[backend] 100 | 101 | 102 | def relative_node_name(identifier: Path, key: Union[Key, str]) -> Path: 103 | assert isinstance(key, (Key, str)) 104 | return identifier / str(key) 105 | 106 | 107 | def absolute_node_name(identifier: Path, key: Union[Key, str]) -> Path: 108 | return DATA_ROOT / relative_node_name(identifier=identifier, key=key) 109 | 110 | 111 | class NodeGroup(object): 112 | def __init__(self, graph: "ExecutionGraph"): 113 | self.graph = graph 114 | 115 | def __repr__(self) -> str: 116 | graph_name = self.graph.name 117 | nodes = sorted(map(str, self.keys())) 118 | return f"{self.__class__.__name__}({graph_name=}, {nodes=})" 119 | 120 | def __getitem__(self, key: Union[Key, str]) -> NodeWrapper: 121 | assert self.validate_key(key) 122 | 123 | key = Key(key) 124 | 125 | if key in self.graph.nodes: 126 | return self.graph.nodes[key] 127 | 128 | if self.graph.parent is not None: 129 | return self.parent_group[key] 130 | 131 | logger.exception(f"Unable to find key '{key}' in graph - reached root.") 132 | 133 | raise KeyError 134 | 135 | def keys(self) -> Iterable[Union[Key, str]]: 136 | yield from map(itemgetter(0), self.items()) 137 | 138 | def values(self) -> Iterable[NodeWrapper]: 139 | yield from map(itemgetter(1), self.items()) 140 | 141 | def items(self) -> Iterable[Tuple[Union[Key, str], NodeWrapper]]: 142 | keys = [key for key in self.graph.nodes.keys() if self.validate_key(key)] 143 | 144 | if len(keys) == 0: 145 | yield from self.parent_group.items() 146 | 147 | else: 148 | for key in keys: 149 | yield key, self.graph.nodes[key] 150 | 151 | def validate_key(self, key: Union[Key, str]) -> bool: 152 | raise NotImplementedError 153 | 154 | @property 155 | def parent_group(self) -> "NodeGroup": 156 | raise NotImplementedError 157 | 158 | 159 | class OutputGroup(NodeGroup): 160 | def validate_key(self, key: Union[Key, str]) -> bool: 161 | return not is_index_key(key) 162 | 163 | @property 164 | def parent_group(self) -> "OutputGroup": 165 | return self.graph.parent.outputs 166 | 167 | 168 | class IndexGroup(NodeGroup): 169 | def validate_key(self, key): 170 | return is_index_key(key) 171 | 172 | @property 173 | def parent_group(self) -> "IndexGroup": 174 | return self.graph.parent.index 175 | 176 | @property 177 | def index(self): 178 | return self["index"] 179 | 180 | @property 181 | def har(self): 182 | return self["har"] 183 | 184 | @property 185 | def localisation(self): 186 | return self["localisation"] 187 | 188 | @property 189 | def predefined(self): 190 | return self["predefined"] 191 | 192 | @property 193 | def loso(self): 194 | return self["loso"] 195 | 196 | @property 197 | def deployable(self): 198 | return self["deployable"] 199 | 200 | 201 | class ExecutionGraph(ComputationGraph): 202 | def __init__(self, name, parent=None, meta=None): 203 | super(ExecutionGraph, self).__init__(name=name) 204 | self.meta = validate_meta(meta=meta, name=name) 205 | self.parent: Optional["ExecutionGraph"] = parent 206 | self.index = IndexGroup(graph=self) 207 | self.outputs = OutputGroup(graph=self) 208 | 209 | # NODE CREATION/ACQUISITION 210 | 211 | def instantiate_orphan_node( 212 | self, 213 | func: Callable, 214 | args: Optional[Union[Any, List[Any], Tuple[Any]]] = None, 215 | kwargs: Optional[Dict[str, Any]] = None, 216 | ) -> NodeWrapper: 217 | return self.make_node(name=None, func=func, backend=None, args=args, kwargs=kwargs, cache=False) 218 | 219 | def instantiate_node( 220 | self, 221 | key: Union[Key, str], 222 | func: Callable, 223 | args: Optional[Union[Any, List[Any], Tuple[Any]]] = None, 224 | kwargs: Optional[Dict[str, Any]] = None, 225 | backend: Optional[str] = None, 226 | force_add: bool = False, 227 | ) -> NodeWrapper: 228 | key = Key(key) 229 | if not force_add: 230 | assert key not in self.nodes 231 | name = absolute_node_name(identifier=self.identifier, key=key) 232 | backend = validate_backend(backend, key) 233 | return self.make_node(name=name, key=key, func=func, backend=backend, args=args, kwargs=kwargs) 234 | 235 | def get_or_create( 236 | self, 237 | key: Union[Key, str], 238 | func: Callable, 239 | args: Optional[Union[Any, Tuple[Any]]] = None, 240 | kwargs: Optional[Dict[str, Any]] = None, 241 | backend: Optional[str] = None, 242 | ) -> NodeWrapper: 243 | key = Key(key) 244 | if key in self.nodes: 245 | return self.nodes[key] 246 | return self.instantiate_node(key=key, func=func, backend=backend, args=args, kwargs=kwargs) 247 | 248 | def acquire_node(self, node: NodeWrapper, key: Optional[Union[Key, str]] = None) -> None: 249 | if key is None: 250 | raise NotImplementedError 251 | key = Key(key) 252 | if key in self.nodes: 253 | raise KeyError(f"Cannot acquire {key} since a node of this name's already in {self.nodes.keys()} of {self}") 254 | self.nodes[key] = node 255 | 256 | def __getitem__(self, key: Union[Key, str]) -> NodeWrapper: 257 | if is_index_key(key): 258 | return self.index[key] 259 | return self.outputs[key] 260 | 261 | # Some convenience functions 262 | 263 | def get_split_series(self, data_partition: str, train_test_split: str) -> NodeWrapper: 264 | return self.instantiate_node( 265 | key=f"{data_partition=}-{train_test_split=}", 266 | func=node_itemgetter(train_test_split), 267 | backend="pandas", 268 | args=self.index[data_partition], 269 | ) 270 | 271 | # BRANCHING 272 | 273 | def make_child(self, name: Union[Key, str], meta: Tuple[Path, str] = None) -> "ExecutionGraph": 274 | return ExecutionGraph(name=name, parent=self, meta=meta) 275 | 276 | def make_sibling(self) -> "ExecutionGraph": 277 | logger.warning("Making siblings not tested - may be buggy!") 278 | return ExecutionGraph(name=self.name, parent=self.parent, meta=self.meta) 279 | 280 | def __truediv__(self, name: Union[Key, str]) -> "ExecutionGraph": 281 | return self.make_child(name=name) 282 | 283 | # EVALUATION 284 | 285 | @property 286 | def identifier(self) -> Path: 287 | if self.parent is None: 288 | return Path(self.name) 289 | return self.parent.identifier / self.name 290 | 291 | def dump_graph(self) -> None: 292 | dump_graph(graph=self, filepath=absolute_node_name(identifier=self.identifier, key="graph.pdf")) 293 | 294 | def evaluate(self, force: bool = False) -> Dict[str, Any]: 295 | output = dict() 296 | for key in randomised_order(self.nodes.keys()): 297 | node = self.nodes[key] 298 | try: 299 | output[key] = node.evaluate() 300 | except FileLockExistsException: 301 | logger.warning(f"Skipping evaluation of {node.name} as it's already being computed.") 302 | return output 303 | 304 | # @staticmethod 305 | # def build_root(): 306 | # return ExecutionGraph(name="datasets") 307 | 308 | # @staticmethod 309 | # def zip_root(): 310 | # return ExecutionGraph("zips") 311 | 312 | 313 | def dump_graph(graph: Union[NodeWrapper, ExecutionGraph, ComputationGraph], filepath: Path): 314 | nodes = dict() 315 | edges = [] 316 | 317 | if isinstance(graph, NodeWrapper): 318 | consume_nodes(nodes, edges, graph) 319 | elif isinstance(graph, ExecutionGraph): 320 | for _, node in graph.outputs.items(): 321 | consume_nodes(nodes, edges, node) 322 | else: 323 | raise TypeError 324 | 325 | nodes = {str(kk): vv for kk, vv in nodes.items()} 326 | edges = list(map(lambda rr: list(map(str, rr)), edges)) 327 | 328 | G = pgv.AGraph(directed=True, strict=True, rankdir="LR") 329 | for node_id, node_name in nodes.items(): 330 | G.add_node(node_id, label=node_name) 331 | G.add_edges_from(edges) 332 | try: 333 | G.layout("dot") 334 | filepath.parent.mkdir(exist_ok=True, parents=True) 335 | G.draw(filepath) 336 | G.close() 337 | except ValueError as ex: 338 | logger.exception(f"Unable to save dot file {filepath}: {ex}") 339 | 340 | return nodes, edges 341 | 342 | 343 | def get_all_sources(node: NodeWrapper): 344 | sources = [] 345 | 346 | def resolve(nn): 347 | if isinstance(nn, NodeWrapper): 348 | sources.append(nn) 349 | elif isinstance(nn, (list, tuple)): 350 | for ni in nn: 351 | resolve(ni) 352 | elif isinstance(nn, dict): 353 | for ni in nn.values(): 354 | resolve(ni) 355 | 356 | resolve(node.args) 357 | resolve(node.kwargs) 358 | 359 | return sources 360 | 361 | 362 | def consume_nodes(nodes: Dict[str, str], edges: List[Tuple[str, str]], ptr: NodeWrapper): 363 | def add_node(node): 364 | node_name = node.name 365 | func = node.func 366 | if isinstance(func, partial): 367 | func = node.func.func.__self__.func 368 | elif isinstance(node.func, DecoratorBase): 369 | func = func.func 370 | func_name = func.__name__ 371 | if node_name not in nodes: 372 | if isinstance(node.name, Path): 373 | name = f"{func_name} =>\n{node.name.stem}" 374 | else: 375 | name = func_name 376 | nodes[node_name] = f"{name}" 377 | return node_name 378 | 379 | add_node(ptr) 380 | 381 | for source_node in get_all_sources(ptr): 382 | source_name = add_node(source_node) 383 | edges.append((source_name, ptr.name)) 384 | consume_nodes(nodes, edges, source_node) 385 | 386 | 387 | def get_ancestral_metadata(graph: Union[NodeWrapper, ExecutionGraph], key: str): 388 | if isinstance(graph, NodeWrapper): 389 | graph = graph.graph 390 | if graph.meta is None: 391 | logger.exception(f'The key "{key}" cannot be found in "{graph}"') 392 | raise KeyError 393 | if key in graph.meta: 394 | return graph.meta[key] 395 | if graph.parent is None: 396 | logger.exception(f'The key "{key}" cannot be found in the ancestry of "{graph}"') 397 | raise ValueError 398 | return get_ancestral_metadata(graph.parent, key) 399 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | This repository aims to provide a unified interface to wearable-based Human Activity Recognition (HAR) datasets. The philosophy is to acquire many datasets from a wide variety of recording conditions and to translate these into a consistent data format in order to more easily address open questions on feature extraction/representation learning, meta/transfer learning, active learning amongst other tasks. Ultimately, I am to create a home for the easier understanding of the stability, strengths and weaknesses of the state-of-the-art in HAR. 4 | 5 | # Setup 6 | 7 | ## Virtual environment 8 | 9 | It is good practise to use virtual environments when using this. I have recently been using [miniconda](https://docs.conda.io/en/latest/miniconda.html) as my python management system. It works exactly like anaconda. The following commands create a new environment, activates it and installs the requirements to that environment. 10 | 11 | ```bash 12 | pipenv install --python 3.8 --skip-lock --dev 13 | pipenv shell 14 | pre-commit install 15 | ``` 16 | 17 | ## dotenv 18 | 19 | Several global variables are required for this library to work. I set these up with the [dotenv](https://pypi.org/project/python-dotenv/) library. This searches for a file called `.env` that should be found in the project root. It then loads environment variables called `PROJECT_ROOT`, `ZIP_ROOT` and `BUILD_ROOT`. In my system, these are set up roughly as follows. 20 | 21 | ```bash 22 | export PROJECT_ROOT = "/users/username/workspace/har_datasets" 23 | export ZIP_ROOT = "/users/username/workspace/har_datasets/data/zip" 24 | export BUILD_ROOT = "/users/username/workspace/har_datasets/data/build" 25 | ``` 26 | 27 | # Data Format 28 | 29 | The data from all datasets listed in this project are converted into one consistent format that consistes of four key elements: 30 | 31 | 1. the train/validation/test fold definition file; 32 | 2. the label file; 33 | 3. the data file; and 34 | 4. an index file. 35 | 36 | Note, the serialisation format used in this repository is that data are stored on a per-sample basis. This means that each of the files listed above will have the same number of rows. 37 | 38 | ## Index File 39 | 40 | The following columns are required for the index file: 41 | 42 | ``` 43 | subject, trail, time 44 | ``` 45 | 46 | `subject` defines a subject identifier, `trial` allows for different trials to be specified (eg it can distinguish data from subjects who perform a task several times), and `time` defines the time (absolute or relative). Subject and trial should be integers, but need not be contiguous. Although time can be considered unnecessary in many applications (especially if the recording was done in a controlled environment or following a script) it is added here to allow for the detection of missing data (missing time stamps) and time-of-day features (if `time` represents epoch time, for example). 47 | 48 | This file must have three columns only. 49 | 50 | ## Task Files 51 | 52 | The following structure is required for the task files 53 | 54 | ``` 55 | label_vals 56 | ``` 57 | 58 | This file must have at least one column. In general, it is expected that the column will be a list of strings (where the string corresponds to the target). This is not a requirement, however, and the label values may be vector-valued. It is important that the correct model and evaluation criteria are associated with the task. 59 | 60 | ## Data File 61 | 62 | The data format is quite simple: 63 | 64 | ``` 65 | x, y, z 66 | ``` 67 | 68 | where `x`, `y` and `z` correspond to the axes of the wearable. By default different files are created for each modality (ie accelerometer, gyroscope and magnetomoter) and for each location (eg wrist, waist). For example, if one accelerometer is on the wrist a file called `accel-wrist` will be created for it. There is no restriction on the number of colums in this file, but we expect that more often than not 3 columns will be present for each axis of the device. 69 | 70 | This file must have at least one column. 71 | 72 | ## Fold Definitions 73 | 74 | Train and test folds are defined by the columns of this file: 75 | 76 | ```python 77 | fold_1 78 | -1 79 | -1 80 | -1 81 | 0 82 | 0 83 | 0 84 | 1 85 | 1 86 | 1 87 | ``` 88 | 89 | The behaviour of these folds is based on scikit-learn's [PredefinedSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html) module. Additional folds can (if necessary) be defined by adding supplementary columns to this file. For example if doing 10 times 10-fold cross validation, 10 fold identifiers would be contained in each of the 10 columns. 90 | 91 | This file must have at least one column. 92 | 93 | Several special fold definitions are also supported. `LOSO` performs leave one subject out cross validation, and `deployable` learns models on all of the data with the expectation that this model is to be deployed outside of the scope of the pipeline that created it. 94 | 95 | # Contributing 96 | 97 | I hope to receive pull requests for new datasets, processing methods, features, and models to this repository. Requests are likely to be accepted once the exact data format, feature extraction, modelling and evaluation interfaces are relatively stable. 98 | 99 | ## Contributing Datasets 100 | 101 | 1. Create a new [yaml](https://en.m.wikipedia.org/wiki/YAML) file in the `metadata/datasets` directory and fill out the information as accurately as possible. Follow the styles and detail given in the entries named `anguita2013`, `pamap2` and `uschad`. The entry of accurate metadata will be heavily strictly moderated before a submission is accepted. Note: 102 | - The name of the file and the `name` filed in the yaml file dataset name must be lower case. 103 | - List all sensor modalities in the dataset in the `modalities` field. The modality names should be consistent with the values found in `metadata/modality.yaml`. 104 | - List all sensor placements in the dataset in the `placements` field The placement names should be consistent with the values found in `metadata/placement.yaml`. 105 | - List all outputs in the dataset in the `sources` field. For example, if a data source arrives from an accelerometer placed on the wrist, a dict entry like `{"placement": "wrist", "modality": "accel"}`. This can be tedious, but there is great value in doing this. 106 | - If the dataset introduces a new task, add a new file to the `metadata/tasks/.yaml` file. List all new target names in this file (see `metadata/tasks/har.yaml` for example). 107 | - If the dataset introduces a new target to an existing task, add it to the end of `tasks/.yaml`. 108 | - If the sensor has been placed on a new location add it to the end of `metadata/placement.yaml`. 109 | - If the sensor is of a new modality, add it to the end of `metadata/modality.yaml`. 110 | 2. Run `make table`. This will update the dataset table in the `tables` directory. Ensure this command executes successully and verify that the entered information is accurate. 111 | 3. Run `make data`. This will download the archive automatically based on the URLs provided in the `download_urls` field from step 1 above. 112 | 4. Copy the file `src/datasets/__new__.py` to `src/datasets/.py` (`` is defined by #1 above). The prupose of this file is to translate the data to the expected format described in the sections above. In particular, separate files with the wearable data, annotated labels, pre-defined folds, and index files are required. Use the existing examples of the aforementioned datasets (`anguita2013`, `pamap2` and `uschad`) that can be found in `src/datasets` as examples of how this has been achieved. 113 | 114 | ## Contributing Pipelines 115 | 116 | (Under construction. See `examples/basic_har.py` for basic examples.) 117 | 118 | ## Contributing Models 119 | 120 | (Under construction. See `src/models/sklearn/basic.py` for basic examples.) 121 | 122 | 123 | # Datasets 124 | 125 | The following table enumerates the datasets that are under consideration for inclusion in this repository. 126 | 127 | | First Author | Dataset Name | Paper (URL) | Data Description (URL) | Data Download (URL) | Year | fs | Accel | Gyro | Mag | #Subjects | #Activities | Notes | 128 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | 129 | | Banos | banos2012 | [A benchmark dataset to evaluate sensor displacement in activity recognition](http://www.orestibanos.com/paper_files/banos_ubicomp_2012.pdf) | [Description](http://archive.ics.uci.edu/ml/datasets/REALDISP+Activity+Recognition+Dataset) | [Download](http://archive.ics.uci.edu/ml/machine-learning-databases/00305/realistic_sensor_displacement.zip) | 2012 | 50 | yes | yes | yes | 17 | 33 | | 130 | | Banos | banos2015 | [mHealthDroid: a novel framework for agile development of mobile health applications](https://link.springer.com/chapter/10.1007/978-3-319-13105-4_14) | [Description](http://archive.ics.uci.edu/ml/datasets/mhealth+dataset) | [Download](http://archive.ics.uci.edu/ml/machine-learning-databases/00319/MHEALTHDATASET.zip) | 2015 | 50 | yes | yes | yes | 10 | 12 | | 131 | | Barshan | barshan2014 | [Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units](https://ieeexplore.ieee.org/abstract/document/8130901/) | [Description](https://archive.ics.uci.edu/ml/datasets/daily+and+sports+activities) | [Download](https://archive.ics.uci.edu/ml/machine-learning-databases/00256/data.zip) | 2014 | 25 | yes | yes | yes | 8 | 19 | | 132 | | Bruno | bruno2013 | [Analysis of Human Behavior Recognition Algorithms based on Acceleration Data](https://www.researchgate.net/profile/Barbara_Bruno2/publication/261415865_Analysis_of_human_behavior_recognition_algorithms_based_on_acceleration_data/links/53d001320cf25dc05cfca025.pdf) | [Description](DescriptionURL) | [Download](DownloadURL) | 2013 | 32 | yes | | | 16 | 14 | Notes | 133 | | Casale | casale2015 | [Personalization and user verification in wearable systems using biometric walking patterns](https://dl.acm.org/citation.cfm?id=2339117) | | | 2012 | 52 | yes | | | 7 | 15 | | 134 | | Chen | utdmhad | [UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor](https://ieeexplore.ieee.org/abstract/document/7350781) | [Description](https://www.utdallas.edu/~kehtar/UTD-MHAD.html) | [Download](http://www.utdallas.edu/~kehtar/UTD-MAD/Inertial.zip) | 2015 | 50 | yes | yes | | 9 | 21 | | 135 | | Chavarriaga | opportunity | [The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition](https://www.sciencedirect.com/science/article/pii/S0167865512004205) | [Description](https://archive.ics.uci.edu/ml/datasets/opportunity+activity+recognition) | [Download](https://archive.ics.uci.edu/ml/machine-learning-databases/00226/OpportunityUCIDataset.zip) | 2012 | 30 | yes | yes | yes | 12 | 7 | Several annotation tracks. | 136 | | Chereshnev | hugadb | [HuGaDB: Human Gait Database for Activity Recognition from Wearable Inertial Sensor Networks](https://link.springer.com/chapter/10.1007/978-3-319-73013-4_12) | [Description](https://github.com/romanchereshnev/HuGaDB) | [Download](https://www.dropbox.com/s/7nb9g650i5m9k6c/HuGaDB.zip?dl=0) | 2017 | ~56 | yes | yes | | 18 | 12 | | 137 | | Kwapisz | wisdm | [Activity Recognition using Cell Phone Accelerometers](http://www.cis.fordham.edu/wisdm/includes/files/sensorKDD-2010.pdf) | [Description](http://www.cis.fordham.edu/wisdm/dataset.php) | [Download](http://www.cis.fordham.edu/wisdm/includes/datasets/latest/WISDM_ar_latest.tar.gz) | 2012 | 20 | yes | | | 29 | 6 | | 138 | | Micucci | micucci2017 | [UniMiB SHAR: A Dataset for Human Activity Recognition Using Acceleration Data from Smartphones](https://www.mdpi.com/2076-3417/7/10/1101/html) | [Description](http://www.sal.disco.unimib.it/technologies/unimib-shar/) | [Download](https://www.dropbox.com/s/x2fpfqj0bpf8ep6/UniMiB-SHAR.zip?dl=0) | 2017 | 50 | yes | | | 30 | 8 | Notes | 139 | | Ortiz | ortiz2015 | [Human Activity Recognition on Smartphones with Awareness of Basic Activities and Postural Transitions](https://link.springer.com/chapter/10.1007/978-3-319-11179-7_23) | [Description](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based%20Recognition%20of%20Human%20Activities%20and%20Postural%20Transitions) | [Download](http://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip) | 2015 | 50 | yes | yes | | ? | 7 | With postural transitions | 140 | | Shoaib | shoaib2014 | [Fusion of Smartphone Motion Sensors for Physical Activity Recognition](https://www.mdpi.com/1424-8220/14/6/10146) | [Description](https://www.researchgate.net/publication/266384007_Sensors_Activity_Recognition_DataSet) | [Download](https://www.researchgate.net/profile/Muhammad_Shoaib20/publication/266384007_Sensors_Activity_Recognition_DataSet/data/542e9d260cf277d58e8ec40c/Sensors-Activity-Recognition-DataSet-Shoaib.rar) | 2014 | 50 | yes | yes | yes | 7 | 7 | | 141 | | Siirtola | siirtola2012 | [Recognizing human activities user-independently on smartphones based on accelerometer data](https://dialnet.unirioja.es/servlet/articulo?codigo=3954593) | [Description](http://www.oulu.fi/bisg/node/40364) | [Download](http://www.ee.oulu.fi/research/neurogroup/opendata/OpenHAR.zip) | 2012 | 40 | yes | | | 7 | 5 | | 142 | | Stisen | stisen2015 | [Smart Devices are Different: Assessing and MitigatingMobile Sensing Heterogeneities for Activity Recognition](https://dl.acm.org/citation.cfm?id=2809718) | [Description](https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition) | [Download](https://archive.ics.uci.edu/ml/machine-learning-databases/00344/Activity%20recognition%20exp.zip) | 2015 | 50-200 | yes | | | 9 | 6 | | 143 | | Sztyler | sztyler2016 | [On-body localization of wearable devices: An investigation of position-aware activity recognition](https://ieeexplore.ieee.org/document/7456521) | [Description](http://sensor.informatik.uni-mannheim.de/index.html#dataset_realworld) | [Download](http://wifo5-14.informatik.uni-mannheim.de/sensor/dataset/realworld2016/realworld2016_dataset.zip) | 2016 | 50 | yes | yes | yes | 15 | 8 | Many other sensors also (video, light, sound, etc) | 144 | | Twomey | spherechallenge | [The SPHERE Challenge: Activity Recognition with Multimodal Sensor Data](https://arxiv.org/abs/1603.00797) | [Description](https://data.bris.ac.uk/data/dataset/8gccwpx47rav19vk8x4xapcog) | [Download](https://data.bris.ac.uk/datasets/8gccwpx47rav19vk8x4xapcog/8gccwpx47rav19vk8x4xapcog.zip) | 2016 | 20 | yes | | | 20 | 20 | | 145 | | Ugulino | ugulino2012 | [Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements](https://link.springer.com/chapter/10.1007/978-3-642-34459-6_6) | [Description](http://groupware.les.inf.puc-rio.br/har) | [Download](http://groupware.les.inf.puc-rio.br/static/har/SystematicReview-RIS-Format.zip) | 2012 | 50 | yes | | | 4 | 5 | | 146 | | Vavoulas | mobiact | [The MobiAct Dataset: Recognition of Activities of Daily Living using Smartphones](http://www.scitepress.org/Papers/2016/57924/57924.pdf) | [Description](https://bmi.teicrete.gr/en/the-mobifall-and-mobiact-datasets-2/) | [Fill out this form to download](https://bmi.hmu.gr/the-mobifall-and-mobiact-datasets-2/) | 2016 | 100 | yes | | | 57 | 9 | | 147 | 148 | 149 | 150 | 151 | 152 | 153 | # Project Structure 154 | 155 | This project follows the [DataScience CookieCutter](https://drivendata.github.io/cookiecutter-data-science/) template with the aim of facilitating reproducible models and results. the majority of commands are executed with the `make` command, and we also provide a high-level data loading interface. 156 | 157 | --------------------------------------------------------------------------------