├── requirements.txt ├── MANIFEST.in ├── docs ├── source │ ├── LICENSE.rst │ ├── README.rst │ ├── CHANGELOG.rst │ ├── images │ │ ├── dataset_structure.png │ │ └── package_data_and_metadata_into_beautiful_box.png │ ├── citing_dtool.rst │ ├── configuring_storage_brokers.rst │ ├── tagging_datasets.rst │ ├── index.rst │ ├── publishing_a_dataset.rst │ ├── configuring_the_dtool_cache_directory.rst │ ├── python_api.rst │ ├── configuring_user_name_and_email.rst │ ├── creating_plugins.rst │ ├── installation_notes.rst │ ├── annotating_datasets.rst │ ├── configuring_a_custom_readme_template.rst │ ├── philosophy.rst │ ├── working_with_overlays.rst │ ├── conf.py │ ├── quick_start_guide.rst │ └── working_with_datasets.rst ├── Makefile └── make.bat ├── icons └── 22x22 │ └── dtool_logo.png ├── .gitignore ├── tests ├── test_dtool_package.py └── __init__.py ├── setup.cfg ├── tox.ini ├── .github ├── dependabot.yml └── workflows │ ├── test.yml │ └── publish.yml ├── dtool └── __init__.py ├── LICENSE.rst ├── pyproject.toml ├── README.rst └── CHANGELOG.rst /requirements.txt: -------------------------------------------------------------------------------- 1 | -e . 2 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst 2 | include LICENSE.rst 3 | -------------------------------------------------------------------------------- /docs/source/LICENSE.rst: -------------------------------------------------------------------------------- 1 | .. include:: ../../LICENSE.rst 2 | -------------------------------------------------------------------------------- /docs/source/README.rst: -------------------------------------------------------------------------------- 1 | .. include:: ../../README.rst 2 | -------------------------------------------------------------------------------- /docs/source/CHANGELOG.rst: -------------------------------------------------------------------------------- 1 | .. include:: ../../CHANGELOG.rst 2 | -------------------------------------------------------------------------------- /icons/22x22/dtool_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jic-dtool/dtool/HEAD/icons/22x22/dtool_logo.png -------------------------------------------------------------------------------- /docs/source/images/dataset_structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jic-dtool/dtool/HEAD/docs/source/images/dataset_structure.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.swp 3 | *.egg-info 4 | 5 | .coverage 6 | .eggs 7 | .tox 8 | .pytest_cache 9 | env 10 | build 11 | dist 12 | 13 | dtool/version.py 14 | -------------------------------------------------------------------------------- /tests/test_dtool_package.py: -------------------------------------------------------------------------------- 1 | """Test the dtool package.""" 2 | 3 | 4 | def test_version_is_string(): 5 | import dtool 6 | assert isinstance(dtool.__version__, str) 7 | -------------------------------------------------------------------------------- /docs/source/images/package_data_and_metadata_into_beautiful_box.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jic-dtool/dtool/HEAD/docs/source/images/package_data_and_metadata_into_beautiful_box.png -------------------------------------------------------------------------------- /docs/source/citing_dtool.rst: -------------------------------------------------------------------------------- 1 | Citing dtool 2 | ============ 3 | 4 | Olsson TSG, Hartley M. 2019. Lightweight data management with dtool. PeerJ 7:e6562 https://doi.org/10.7717/peerj.6562 5 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | exclude=env*,.tox,.git,*.egg,build,docs 3 | 4 | [tool:pytest] 5 | testpaths = tests 6 | addopts = --cov=dtool 7 | #addopts = -x --pdb 8 | 9 | [cov:run] 10 | source = dtool 11 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist=py27,py3,flake8 3 | 4 | [testenv] 5 | deps=pytest 6 | pytest-cov 7 | mock 8 | pytest-mock 9 | coverage 10 | -r{toxinidir}/requirements.txt 11 | commands=py.test 12 | 13 | [testenv:flake8] 14 | deps=flake8 15 | commands=flake8 16 | -------------------------------------------------------------------------------- /docs/source/configuring_storage_brokers.rst: -------------------------------------------------------------------------------- 1 | Configuring storage brokers 2 | =========================== 3 | 4 | Some remote storage brokers require extra configuration to enable 5 | authentication. 6 | 7 | The command below configures access to a Azure storage container named 8 | ``jicinformatics``:: 9 | 10 | $ dtool config azure set jicinformatics the-secret-token 11 | the-secret-token 12 | 13 | For information on other storage brokers have a look at their documentation 14 | and/or use ``dtool config --help`` to get more information. 15 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | # To get started with Dependabot version updates, you'll need to specify which 2 | # package ecosystems to update and where the package manifests are located. 3 | # Please see the documentation for all configuration options: 4 | # https://help.github.com/github/administering-a-repository/configuration-options-for-dependency-updates 5 | 6 | version: 2 7 | updates: 8 | # Maintain dependencies for GitHub Actions 9 | - package-ecosystem: "github-actions" 10 | directory: "/" 11 | schedule: 12 | interval: "daily" 13 | 14 | - package-ecosystem: "pip" 15 | directory: "/" 16 | schedule: 17 | interval: "daily" -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | SPHINXPROJ = dtool 8 | SOURCEDIR = source 9 | BUILDDIR = build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | -------------------------------------------------------------------------------- /docs/source/tagging_datasets.rst: -------------------------------------------------------------------------------- 1 | Tagging datasets 2 | ================ 3 | 4 | It is possible to tag datasets with labels. 5 | 6 | To tag a dataset with the label "rnaseq" one would use the command below:: 7 | 8 | $ dtool tag set rnaseq 9 | 10 | It is possible to add more than one tag to a dataset. The command below 11 | adds the tag "A.thaliana":: 12 | 13 | $ dtool tag set A.thaliana 14 | 15 | To list tags one would use the command below: 16 | 17 | $ dtool tag ls 18 | 19 | This would produce the output:: 20 | 21 | A.thalina 22 | rnaseq 23 | 24 | It is possible to delete a tag that has been added to a dataset:: 25 | 26 | 27 | $ dtool tag delete A.thaliana 28 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | dtool: Manage Scientific Data 2 | ============================= 3 | 4 | Make your data more resilient, portable and easy to work with by packaging 5 | files & metadata into self contained datasets. 6 | 7 | .. toctree:: 8 | :maxdepth: 2 9 | 10 | README 11 | installation_notes 12 | philosophy 13 | quick_start_guide 14 | working_with_datasets 15 | tagging_datasets 16 | annotating_datasets 17 | working_with_overlays 18 | configuring_user_name_and_email 19 | configuring_the_dtool_cache_directory 20 | configuring_a_custom_readme_template 21 | configuring_storage_brokers 22 | publishing_a_dataset 23 | python_api 24 | creating_plugins 25 | citing_dtool 26 | CHANGELOG 27 | LICENSE 28 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | """Test fixtures.""" 2 | 3 | import os 4 | import shutil 5 | import tempfile 6 | 7 | import pytest 8 | 9 | _HERE = os.path.dirname(__file__) 10 | 11 | 12 | @pytest.fixture 13 | def chdir_fixture(request): 14 | d = tempfile.mkdtemp() 15 | curdir = os.getcwd() 16 | os.chdir(d) 17 | 18 | @request.addfinalizer 19 | def teardown(): 20 | os.chdir(curdir) 21 | shutil.rmtree(d) 22 | 23 | 24 | @pytest.fixture 25 | def tmp_dir_fixture(request): 26 | d = tempfile.mkdtemp() 27 | 28 | @request.addfinalizer 29 | def teardown(): 30 | shutil.rmtree(d) 31 | return d 32 | 33 | 34 | @pytest.fixture 35 | def local_tmp_dir_fixture(request): 36 | d = tempfile.mkdtemp(dir=_HERE) 37 | 38 | @request.addfinalizer 39 | def teardown(): 40 | shutil.rmtree(d) 41 | return d 42 | -------------------------------------------------------------------------------- /docs/make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=source 11 | set BUILDDIR=build 12 | set SPHINXPROJ=cookiecutterrepo_name 13 | 14 | if "%1" == "" goto help 15 | 16 | %SPHINXBUILD% >NUL 2>NUL 17 | if errorlevel 9009 ( 18 | echo. 19 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 20 | echo.installed, then set the SPHINXBUILD environment variable to point 21 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 22 | echo.may add the Sphinx directory to PATH. 23 | echo. 24 | echo.If you don't have Sphinx installed, grab it from 25 | echo.http://sphinx-doc.org/ 26 | exit /b 1 27 | ) 28 | 29 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 30 | goto end 31 | 32 | :help 33 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 34 | 35 | :end 36 | popd 37 | -------------------------------------------------------------------------------- /docs/source/publishing_a_dataset.rst: -------------------------------------------------------------------------------- 1 | Publishing a dataset 2 | ==================== 3 | 4 | It is possible to publish a datasets hosted in AWS S3 and Microsoft Azure 5 | Storage. A dataset is published by making it accessible via the HTTP(S) 6 | protocol. 7 | 8 | .. warning:: A published dataset is accessible by anyone in the world with an 9 | internet connection! 10 | 11 | .. code-block:: none 12 | 13 | $ dtool publish s3://dtool-demo/ba92a5fa-d3b4-4f10-bcb9-947f62e652db 14 | Dataset accessible at https://dtool-demo.s3.amazonaws.com/ba92a5fa-d3b4-4f10-bcb9-947f62e652db 15 | 16 | The URL retuned by the ``dtool publish`` command can be used to interact with the dataset. 17 | 18 | .. code-block:: none 19 | 20 | $ dtool summary https://dtool-demo.s3.amazonaws.com/ba92a5fa-d3b4-4f10-bcb9-947f62e652db 21 | name: hypocotyl3 22 | uuid: ba92a5fa-d3b4-4f10-bcb9-947f62e652db 23 | creator_username: olssont 24 | number_of_items: 339 25 | size: 86.7MiB 26 | frozen_at: 2018-09-12 27 | 28 | -------------------------------------------------------------------------------- /docs/source/configuring_the_dtool_cache_directory.rst: -------------------------------------------------------------------------------- 1 | Configuring the dtool cache directory 2 | ===================================== 3 | 4 | When fetching a dataset item from a dataset stored in object storage the file 5 | get stored in a cache directory. The default cache directory is:: 6 | 7 | ~/.cache/dtool 8 | 9 | You may want to configure this cache to be in a different location. This can be achieved using the ``dtool config cache`` command:: 10 | 11 | $ mkdir /tmp/dtool 12 | $ dtool config cache /tmp/dtool 13 | 14 | It is also possible to override both the default and the configured cache 15 | directory by exporting the environment variable ``DTOOL_CACHE_DIRECTORY``. 16 | This can be useful when using local SSD on a compute cluster:: 17 | 18 | 19 | $ mkdir /local/ssd/dtool 20 | $ export DTOOL_CACHE_DIRECTORY=/local/ssd/dtool 21 | 22 | 23 | .. warning:: There is no automatic mechanism built into dtool to clear up the 24 | cache. It can therefore grow very large if you are working with 25 | lots of datasets in object storage. 26 | -------------------------------------------------------------------------------- /dtool/__init__.py: -------------------------------------------------------------------------------- 1 | """dtool package.""" 2 | 3 | import logging 4 | 5 | logger = logging.getLogger(__name__) 6 | 7 | # workaround for diverging python versions: 8 | try: 9 | from importlib.metadata import version, PackageNotFoundError 10 | logger.debug("imported version, PackageNotFoundError from importlib.metadata") 11 | except ModuleNotFoundError: 12 | from importlib_metadata import version, PackageNotFoundError 13 | logger.debug("imported version, PackageNotFoundError from importlib_metadata") 14 | 15 | # first, try to determine dynamic version at runtime 16 | try: 17 | __version__ = version(__name__) 18 | logger.debug("Determined version %s via importlib_metadata.version", __version__) 19 | except PackageNotFoundError: 20 | # if that fails, check for static version file written by setuptools_scm 21 | try: 22 | from .version import version as __version__ 23 | logger.debug("Determined version %s from autogenerated dtool/version.py", __version__) 24 | except Exception as e: 25 | logger.debug("All efforts to determine version failed: %s", e) 26 | __version__ = None -------------------------------------------------------------------------------- /LICENSE.rst: -------------------------------------------------------------------------------- 1 | MIT License 2 | =========== 3 | 4 | Copyright (c) 2017 Tjelvar Olsson 5 | 6 | Permission is hereby granted, free of charge, to any person obtaining a copy 7 | of this software and associated documentation files (the "Software"), to deal 8 | in the Software without restriction, including without limitation the rights 9 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 10 | copies of the Software, and to permit persons to whom the Software is 11 | furnished to do so, subject to the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be included in all 14 | copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 17 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 18 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 19 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 20 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 22 | SOFTWARE. 23 | -------------------------------------------------------------------------------- /docs/source/python_api.rst: -------------------------------------------------------------------------------- 1 | Python API 2 | ========== 3 | 4 | The ``dtool`` command line tool is built using the Python API in `dtoolcore 5 | `_. This API can also be used to create 6 | and interact with datasets directly. 7 | 8 | Below is an example showing how to load a dataset from a URI and use it to 9 | print out a list of all the data item identifiers in the dataset. 10 | 11 | .. code-block:: python 12 | 13 | >>> from dtoolcore import DataSet 14 | >>> dataset = DataSet.from_uri("bgi-sequencing-12345") 15 | >>> for i in dataset.identifiers: 16 | ... print(i) 17 | ... 18 | 1c10766c4a29536bc648260f456202091e2f57b4 19 | fbcc24bed36128535a263b74b2e138d7cc43e90c 20 | 9ca330a84f3dbbdd457a860b5e3c21c917743dd6 21 | 3dce23b901709a24cfbb974b70c1ef132af10a67 22 | 78e7f1507da598e9f6a02810c1f846cfc24fb8ad 23 | 42f43f49b74ef7f901010965aae71170c9fd3ef6 24 | ab069337b0f86cdad899d57e8de63d5b2b680c85 25 | b55ae3fbe6081eb2ed4ed2c4ea316dbeb943ea2c 26 | 27 | More information on how to make use of the Python API can be found in the 28 | `dtoolcore documentation `_. 29 | -------------------------------------------------------------------------------- /.github/workflows/test.yml: -------------------------------------------------------------------------------- 1 | name: test 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - master 8 | tags: 9 | - '*' 10 | pull_request: 11 | 12 | jobs: 13 | test: 14 | runs-on: ubuntu-latest 15 | 16 | strategy: 17 | matrix: 18 | python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12'] 19 | 20 | steps: 21 | - name: Git checkout 22 | uses: actions/checkout@v3 23 | 24 | - name: Set up python3 ${{ matrix.python-version }} 25 | uses: actions/setup-python@v2 26 | with: 27 | python-version: ${{ matrix.python-version }} 28 | 29 | - name: Install requirements 30 | run: | 31 | python -m pip install --upgrade pip 32 | pip install flake8 33 | pip install .[test] 34 | pip list 35 | 36 | - name: Test with pytest 37 | run: | 38 | pytest -sv 39 | 40 | - name: Lint with flake8 41 | run: | 42 | # stop the build if there are Python syntax errors or undefined names 43 | flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics 44 | # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide 45 | flake8 . --count --exit-zero --max-complexity=12 --max-line-length=127 --statistics -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools>=42", "setuptools_scm[toml]>=6.3"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "dtool" 7 | description = "dtool command line client for managing data" 8 | readme = "README.rst" 9 | license = {file = "LICENSE"} 10 | authors = [ 11 | {name = "Tjelvar Olsson", email = "tjelvar.olsson@gmail.com"} 12 | ] 13 | dynamic = ["version"] 14 | dependencies = [ 15 | "dtoolcore==3.18.3", 16 | "dtool-cli==0.7.1", 17 | "dtool-create==0.23.4", 18 | "dtool-info==0.16.2", 19 | "dtool-symlink==0.3.1", 20 | "dtool-http==0.5.1", 21 | "dtool-config==0.4.1", 22 | "dtool-overlay==0.3.1", 23 | "dtool-annotation==0.1.1", 24 | "dtool-tag==0.1.1" 25 | ] 26 | 27 | [project.optional-dependencies] 28 | test = [ 29 | "pytest", 30 | "pytest-cov" 31 | ] 32 | docs = [ 33 | "sphinx", 34 | "sphinx_rtd_theme" 35 | ] 36 | 37 | [project.urls] 38 | Documentation = "https://dtool.readthedocs.io" 39 | Repository = "https://github.com/jic-dtool/dtool" 40 | Changelog = "https://github.com/jic-dtool/dtool/blob/master/CHANGELOG.rst" 41 | 42 | [tool.setuptools_scm] 43 | version_scheme = "guess-next-dev" 44 | local_scheme = "no-local-version" 45 | write_to = "dtool/version.py" 46 | 47 | [tool.setuptools] 48 | packages = ["dtool"] 49 | -------------------------------------------------------------------------------- /docs/source/configuring_user_name_and_email.rst: -------------------------------------------------------------------------------- 1 | Configuring user name and email 2 | =============================== 3 | 4 | When running the ``dtool interactive readme`` the default name and email 5 | address are ``Your Name`` and ``you@example.com``. 6 | 7 | :: 8 | 9 | $ dtool readme interactive my_dataset 10 | description [Dataset description]: 11 | project [Project name]: 12 | confidential [False]: 13 | personally_identifiable_information [False]: 14 | name [Your Name]: 15 | email [you@example.com]: 16 | username [olssont]: 17 | creation_date [2017-12-14]: 18 | 19 | These defaults can be configuring the user name and email address. 20 | 21 | :: 22 | 23 | $ dtool config user name "Care A. Bout-Data" 24 | Care A. Bout-Data 25 | $ dtool config user email researcher@famous.uni.ac.uk 26 | researcher@famous.uni.ac.uk 27 | 28 | 29 | 30 | Rerunning the previous ``dtool readme interactive`` command now gives updated 31 | defaults when prompting for input. 32 | 33 | :: 34 | 35 | $ dtool readme interactive my_dataset 36 | description [Dataset description]: 37 | project [Project name]: 38 | confidential [False]: 39 | personally_identifiable_information [False]: 40 | name [Care A. Bout-Data]: 41 | email [researcher@famous.uni.ac.uk]: 42 | username [olssont]: 43 | creation_date [2017-12-14]: 44 | -------------------------------------------------------------------------------- /.github/workflows/publish.yml: -------------------------------------------------------------------------------- 1 | name: publish 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - master 8 | tags: 9 | - '*' 10 | 11 | jobs: 12 | build: 13 | 14 | runs-on: ubuntu-latest 15 | 16 | steps: 17 | - uses: actions/checkout@v4 18 | with: 19 | fetch-depth: 0 20 | 21 | - name: Set up Python 3.12 22 | uses: actions/setup-python@v5 23 | with: 24 | python-version: 3.12 25 | 26 | - name: Install requirements 27 | run: | 28 | pip install --upgrade build 29 | pip install --upgrade setuptools wheel setuptools-scm[toml] 30 | pip list 31 | 32 | - name: Package distribution 33 | run: | 34 | python -m build 35 | 36 | - name: Publish distribution to Test PyPI 37 | uses: pypa/gh-action-pypi-publish@release/v1 38 | continue-on-error: true 39 | with: 40 | user: __token__ 41 | password: ${{ secrets.test_pypi_password }} 42 | repository-url: https://test.pypi.org/legacy/ 43 | verbose: true 44 | skip-existing: true 45 | 46 | - name: Publish distribution to PyPI 47 | if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags') 48 | uses: pypa/gh-action-pypi-publish@release/v1 49 | with: 50 | user: __token__ 51 | password: ${{ secrets.pypi_password }} 52 | verbose: true 53 | -------------------------------------------------------------------------------- /docs/source/creating_plugins.rst: -------------------------------------------------------------------------------- 1 | Creating plugins 2 | ================ 3 | 4 | It is possible to create plugins to the ``dtool`` command line tool. There are 5 | two different types of plugins: command line tools and backend storage brokers. 6 | The former allows a developer to add custom extensions to the ``dtool`` 7 | command. The latter allows a developer to create an interface for talking to a 8 | new type of storage. One could for example create a storage broker to interface 9 | with `Amazon S3 `_ object storage. 10 | 11 | 12 | Extending the ``dtool`` command line tool 13 | ----------------------------------------- 14 | 15 | Information on how to extend the ``dtool`` command line tool is available in 16 | the README file of `dtool-cli `_. 17 | 18 | Concrete examples making use of this plugin system are: 19 | 20 | - `dtool-create `_ 21 | - `dtool-info `_ 22 | 23 | 24 | Creating an interface to a new type of storage 25 | ---------------------------------------------- 26 | 27 | Below are the steps required to create a storage broker for allowing ``dtool`` 28 | to interact with a new backend. A concrete example making use of this plugin 29 | system is `dtool-irods `_. 30 | 31 | 1. Examine the code in ``dtoolcore.storagebroker.DiskStorageBroker``. 32 | 2. Create a Python class for your storage, e.g. ``MyStorageBroker`` 33 | 3. Add a ``MyStorageBroker.key``` attribute to the class, this key is used to 34 | lookup an appropriate storage broker when interacting with a dataset 35 | 4. Add a ``dtoolcore.FileHasher`` instance that matches the hashing algorithm 36 | used by your storage to your ``MyStorageBroker.hasher`` attribute 37 | 5. Add implementations for all the public functions in 38 | ``dtoolcore.storagebroker.DiskStorageBroker`` class to ``MyStorageBroker`` 39 | 6. Expose the ``MyStorageBroker`` class as a ``dtool.storage_broker`` 40 | entrypoint, e.g. add a section along the lines of the below to the 41 | ``setup.py`` file:: 42 | 43 | entry_points={ 44 | "dtool.storage_brokers": [ 45 | "MyStorageBroker=my_dtool_storage_plugin:MyStorageBroker", 46 | ], 47 | }, 48 | -------------------------------------------------------------------------------- /docs/source/installation_notes.rst: -------------------------------------------------------------------------------- 1 | Installation notes 2 | ================== 3 | 4 | dtool is a Python package that is pip installable. 5 | 6 | Make sure that ``pip``, ``setputools`` and ``wheel`` are up to date. 7 | This is a requirement of one of the dependencies (``ruamel.yaml``). 8 | 9 | .. code-block:: none 10 | 11 | $ pip install -U pip setuptools wheel 12 | 13 | dtool can then be installed using ``pip``. 14 | 15 | .. code-block:: none 16 | 17 | $ pip install dtool 18 | 19 | 20 | Adding support for S3 object storage 21 | ------------------------------------ 22 | 23 | Install the ``dtool-s3`` package using ``pip``. 24 | 25 | .. code-block:: none 26 | 27 | $ pip install dtool-s3 28 | 29 | To configure Amazon S3 credentials see the README file in the `dtool-s3 30 | `_ GitHub repository. 31 | 32 | 33 | Adding support for Azure storage 34 | -------------------------------- 35 | 36 | Install the ``dtool-azure`` package using ``pip``. 37 | 38 | .. code-block:: none 39 | 40 | $ pip install dtool-azure 41 | 42 | To configure Microsoft Azure credentials see the README file in the 43 | `dtool-azure `_ GitHub repository. 44 | 45 | 46 | Adding support for ECS S3 object storage 47 | ---------------------------------------- 48 | 49 | Install the ``dtool-ecs`` package using ``pip``. 50 | 51 | .. code-block:: none 52 | 53 | $ pip install dtool-ecs 54 | 55 | To configure ECS S3 object storage credentials see the README file in the 56 | `dtool-ecs `_ GitHub repository. 57 | 58 | 59 | Adding support for iRODS storage 60 | -------------------------------- 61 | 62 | Install the ``dtool-irods`` package using ``pip``. 63 | 64 | .. code-block:: none 65 | 66 | $ pip install dtool-irods 67 | 68 | .. warning:: In order to be able to use the iRODS backend storage 69 | you will need to install the iCommands. Linux packages 70 | can be downloaded from `irods.org/download 71 | `_. On Mac OSX these can 72 | be installed using the brew package manager:: 73 | 74 | $ brew install irods 75 | 76 | For more details see the `dtool-irods 77 | `_ GitHub repository. 78 | -------------------------------------------------------------------------------- /docs/source/annotating_datasets.rst: -------------------------------------------------------------------------------- 1 | Annotating datasets 2 | =================== 3 | 4 | It is possible to annotate a dataset with so called key/value pairs. Such 5 | key/value annotations are intended to make it easy to add and access specific 6 | metadata at a per dataset level. 7 | 8 | The difference between annotations and the descriptive metadata is that the 9 | former is easier to work with in a programmatic fashion. The descriptive 10 | metadata, stored in the dataset's README content, is more free form. It is 11 | non-trivial to access specific pieces of information from the descriptive 12 | metadata in the dataset's README content, whereas a dtool annotation can be 13 | easily accessed by its name (key). 14 | 15 | To create an annotation using the dtool CLI one would use the ``dtool annotation 16 | set`` command. For example to annotate a dataset with a "project" one would use 17 | the command:: 18 | 19 | $ dtool annotation set project world-peace 20 | 21 | To access the "project" annotation one would use the ``dtool annotation get`` command:: 22 | 23 | $ dtool annotation get project 24 | world-peace 25 | 26 | Annotations set using ``dtool annotation set`` are strings by default. It is possible 27 | to set the type to ``int``, ``float``, and ``bool`` using the ``--type`` option. For 28 | example to annotate a dataset with a "stars" rating one could use the command:: 29 | 30 | $ dtool annotation set --type int stars 3 31 | 32 | For more complex data structures one can set the type to ``json``. For example:: 33 | 34 | $ dtool annotation set --type json params '{"x": 3.4, "y": 5.6}' 35 | 36 | It is possible to list all the annotations of a dataset:: 37 | 38 | $ dtool annotation ls 39 | params {"x": 3.4, "y": 5.6} 40 | project world-peace 41 | stars 3 42 | 43 | To update an annotation one can use the ``dtool annotation set`` command again. 44 | For example to show that a dataset is really fantastic one could increase its 45 | star rating to 5:: 46 | 47 | $ dtool annotation set stars 5 --type int 48 | $ dtool annotation get stars 49 | 5 50 | 51 | .. warning:: 52 | There are restrictions on the characters and the length of the keys. They have to 53 | match the regular expression ``^[a-zA-Z.-_]*$`` and it must be 80 characters or less. 54 | -------------------------------------------------------------------------------- /docs/source/configuring_a_custom_readme_template.rst: -------------------------------------------------------------------------------- 1 | Configuring a custom README template 2 | ==================================== 3 | 4 | When running the ``dtool interactive readme`` command one is prompted to enter 5 | the default descriptive metadata shown below. 6 | 7 | :: 8 | 9 | $ dtool readme interactive my_dataset 10 | description [Dataset description]: 11 | project [Project name]: 12 | confidential [False]: 13 | personally_identifiable_information [False]: 14 | name [Your Name]: 15 | email [you@example.com]: 16 | username [olssont]: 17 | creation_date [2017-12-14]: 18 | 19 | It is possible to configure the required metadata prompted for by the 20 | ``dtool readme interactive`` command. This requires the creation of a 21 | README file making use of the YAML file format. 22 | 23 | The default template is shown below. 24 | 25 | .. code-block:: yaml 26 | 27 | --- 28 | description: Dataset description 29 | project: Project name 30 | confidential: False 31 | personally_identifiable_information: False 32 | owners: 33 | - name: {DTOOL_USER_FULL_NAME} 34 | email: {DTOOL_USER_EMAIL} 35 | username: {username} 36 | creation_date: {date} 37 | # links: 38 | # - http://doi.dx.org/your_doi 39 | # - http://github.com/your_code_repository 40 | # budget_codes: 41 | # - E.g. CCBS1H10S 42 | 43 | To create a custom template that also prompted for a species definition one 44 | could create the file ``~/custom_dtool_readme.yml`` with the content below. 45 | 46 | .. code-block:: yaml 47 | 48 | --- 49 | description: Dataset description 50 | project: Project name 51 | species: A. thaliana 52 | confidential: False 53 | personally_identifiable_information: False 54 | owners: 55 | - name: {DTOOL_USER_FULL_NAME} 56 | email: {DTOOL_USER_EMAIL} 57 | username: {username} 58 | creation_date: {date} 59 | 60 | To configure the dtool to make use of this template one can use the ``dtool config readme-template`` command:: 61 | 62 | $ dtool config readme-template ~/custom_dtool_readme.yml 63 | 64 | The ``dtool config readme-template`` command sets the 65 | ``DTOOL_README_TEMPLATE_FPATH`` key in the ``~/.config/dtool/dtool.json`` file. 66 | Alternatively one can make use of the ``DTOOL_README_TEMPLATE_FPATH`` 67 | environment variable:: 68 | 69 | $ export DTOOL_README_TEMPLATE_FPATH=~/custom_dtool_readme.yml 70 | 71 | Re-running the previous ``dtool readme interacitve`` command now includes a prompt for the species and the default value ``A. thaliana``:: 72 | 73 | $ dtool readme interactive my_dataset 74 | description [Dataset description]: 75 | project [Project name]: 76 | species [A. thaliana]: 77 | confidential [False]: 78 | personally_identifiable_information [False]: 79 | name [Your Name]: 80 | email [you@example.com]: 81 | username [olssont]: 82 | creation_date [2017-12-14]: 83 | 84 | 85 | -------------------------------------------------------------------------------- /docs/source/philosophy.rst: -------------------------------------------------------------------------------- 1 | Philosophy - what is dtool? 2 | =========================== 3 | 4 | What problem is dtool solving? 5 | ------------------------------ 6 | 7 | Managing data as a collection of individual files is hard. Analysing that data 8 | will require that certain sets of files are present, understanding it requires 9 | suitable metadata, and copying or moving it while keeping its integrity is 10 | difficult. 11 | 12 | dtool solves this problem by packaging a collection of files and accompanying 13 | metadata into a self contained and unified whole: a dataset. 14 | 15 | Having metadata separate from the data, for example in an Excel spread sheet 16 | with links to the data files, it becomes difficult to reorganise the data 17 | without fear of breaking links between the data and the metadata. By 18 | encapsulating both the data files and associated metadata in a dataset one is 19 | free to move the dataset around at will. The high level organisation of 20 | datasets can therefore evolve over time as data management processes change. 21 | 22 | dtool also solves an issue of trust. By including file hashes as metadata 23 | it is possible to verify the integrity of a dataset after it has been moved to 24 | a new location or when coming back to a dataset after a period of time. 25 | 26 | It is possible to discover and access both metadata and data files in a 27 | dataset. It is therefore easy to create scripts and pipelines to process the 28 | items, or a subset of items, in a dataset. 29 | 30 | 31 | What is a "dtool dataset"? 32 | -------------------------- 33 | 34 | Briefly, a dtool dataset consists of: 35 | 36 | - The files added to the dataset, known as the dataset "items" 37 | - Metadata used to describe the dataset as a whole 38 | - Metadata describing the items in the dataset 39 | 40 | The exact details of how this data and metadata is stored depends on the 41 | "backend" (the type of storage used). In other words a dataset is stored 42 | differently on local file system disk to how it is stored in Amazon S3 object 43 | store. However, the ``dtool`` commands and the Python API for interacting with 44 | datasets are the same for all backends. 45 | 46 | 47 | What does a dtool dataset look like on local disk? 48 | -------------------------------------------------- 49 | 50 | Below is the structure of a fictional dataset containing three items from an 51 | RNA sequencing experiment. 52 | 53 | .. code-block:: none 54 | 55 | $ tree ~/my_dataset 56 | /Users/olssont/my_dataset 57 | ├── README.yml 58 | └── data 59 | ├── rna_seq_reads_1.fq.gz 60 | ├── rna_seq_reads_2.fq.gz 61 | └── rna_seq_reads_3.fq.gz 62 | 63 | The ``README.yml`` file is where metadata used to describe the whole dataset is 64 | stored. The items of the dataset are stored in the directory named ``data``. 65 | 66 | There is also hidden metadata, stored as plain text files, in a directory named 67 | ``.dtool``. This should not be edited directly by the user. 68 | 69 | .. image:: images/dataset_structure.png 70 | 71 | 72 | How does one create a dtool dataset? 73 | ------------------------------------ 74 | 75 | This happens in stages: 76 | 77 | 1. One creates a so called "proto dataset" 78 | 2. One adds data and metadata to this proto dataset 79 | 3. One converts the proto dataset into a dataset by "freezing" it 80 | 81 | Once a proto dataset is "frozen" it is simply referred to as a dataset and it 82 | is no longer possible to modify the data in it. In other words it is not 83 | possible to add or remove items from a dataset or to alter any of the items in 84 | a dataset. 85 | 86 | The process can be likened to creating an open box (the proto dataset), putting 87 | items (data) into it, sticking a label (metadata) on it, and closing the box 88 | (freezing the dataset). 89 | 90 | .. image:: images/package_data_and_metadata_into_beautiful_box.png 91 | 92 | 93 | Give me more details! 94 | --------------------- 95 | 96 | An in depth discussion of dtool can be found in the paper 97 | `Lightweight data management with dtool `_. 98 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | dtool: Manage Scientific Data 2 | ============================= 3 | 4 | .. |dtool| image:: https://github.com/jic-dtool/dtool/blob/master/icons/22x22/dtool_logo.png?raw=True 5 | :height: 20px 6 | :target: https://github.com/jic-dtool/dtool 7 | 8 | .. |pypi| image:: https://badge.fury.io/py/dtool.svg 9 | :target: http://badge.fury.io/py/dtool 10 | :alt: PyPi package 11 | 12 | .. |test| image:: https://img.shields.io/github/actions/workflow/status/jic-dtool/dtool/test.yml?branch=master&label=tests 13 | :target: https://github.com/jic-dtool/dtool/actions/workflows/test.yml 14 | 15 | .. |docs| image:: https://readthedocs.org/projects/dtool/badge/?version=latest 16 | :target: https://readthedocs.org/projects/dtool?badge=latest 17 | :alt: Documentation Status 18 | 19 | |dtool| |pypi| |test| |docs| 20 | 21 | *Make your data more resilient, portable and easy to work with by packaging 22 | files & metadata into self contained datasets.* 23 | 24 | - Documentation: http://dtool.readthedocs.io 25 | - Paper: https://doi.org/10.7717/peerj.6562 26 | - Free software: MIT License 27 | 28 | Overview 29 | -------- 30 | 31 | dtool is a suite of software for managing scientific data and making it 32 | accessible programmatically. It consists of a command line interface ``dtool`` 33 | and a Python API: `dtoolcore `_. 34 | 35 | The ``dtool`` command line interface allows one to organise files into datasets 36 | and to move datasets between different storage solutions, for example from 37 | local disk to remote object storage. Importantly it also provides methods to 38 | verify that the transfer has been successful. 39 | 40 | The Python API gives complete access to the data and metadata in a dataset. It 41 | makes it easy to create scripts for processing the items, or a subset of items, 42 | in a dataset. The Python API also allows datasets to be constructed 43 | programmatically. 44 | 45 | dtool is extensible, meaning that it is possible to create plugins both for 46 | adding functionality to the command line interface and for creating interfaces 47 | to custom storage backends. 48 | 49 | The ``dtool`` Python package is a meta package that installs the packages: 50 | 51 | - `dtoolcore `_ - core API 52 | - `dtool-cli `_ - CLI plugin scaffold 53 | - `dtool-annotation `_ - CLI commands for working with dataset annotations 54 | - `dtool-config `_ - CLI commands for configuring dtool 55 | - `dtool-create `_ - CLI commands for creating datasets 56 | - `dtool-info `_ - CLI commands for getting information about datasets 57 | - `dtool-overlay `_ - CLI commands for working with per item metadata stored as overlays 58 | - `dtool-symlink `_ - storage broker interface allowing symlinking to data 59 | - `dtool-http `_ - storage broker interface allowing read only access to datasets over HTTP 60 | 61 | 62 | Installation:: 63 | 64 | $ pip install dtool 65 | 66 | There are support packages for several object storage solutions: 67 | 68 | - `dtool-s3 `_ - storage broker interface to S3 object storage 69 | - `dtool-smb `_ - storage broker interface to smb network share 70 | - `dtool-azure `_ - storage broker interface to Azure Storage 71 | - `dtool-ecs `_ - storage broker interface to ECS S3 object storage 72 | - `dtool-irods `_ - storage broker interface to iRODS 73 | 74 | If you have access to Amazon S3, Microsoft Azure, ECS S3 or iRODS storage you may also want to install support for these:: 75 | 76 | $ pip install dtool-s3 dtool-azure dtool-ecs dtool-irods 77 | 78 | Usage:: 79 | 80 | $ dtool create my-awesome-dataset 81 | Created proto dataset file:///Users/olssont/my-awesome-dataset 82 | Next steps: 83 | 1. Add raw data, eg: 84 | dtool add item my_file.txt file:///Users/olssont/my-awesome-dataset 85 | Or use your system commands, e.g: 86 | mv my_data_directory /Users/olssont/my-awesome-dataset/data/ 87 | 2. Add descriptive metadata, e.g: 88 | dtool readme interactive file:///Users/olssont/my-awesome-dataset 89 | 3. Convert the proto dataset into a dataset: 90 | dtool freeze file:///Users/olssont/my-awesome-dataset 91 | 92 | -------------------------------------------------------------------------------- /docs/source/working_with_overlays.rst: -------------------------------------------------------------------------------- 1 | Working with overlays 2 | ===================== 3 | 4 | Overlays provide a means to store and access per item metadata. 5 | 6 | Display table with all per item metadata 7 | ---------------------------------------- 8 | 9 | It is possible to display all the per item metadata as a CSV table using the 10 | command ``dtool overlays show``. 11 | 12 | .. code-block:: none 13 | 14 | $ dtool overlays show http://bit.ly/Ecoli-reads-minified 15 | identifiers,pair_id,is_read1,useful_name,relpaths 16 | 8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,False,ERR022075,ERR022075_2.fastq.gz 17 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,True,ERR022075,ERR022075_1.fastq.gz 18 | 19 | The dataset above has three overlays named: ``pair_id``, ``is_read1``, and 20 | ``useful_name``. The columns named ``identifiers`` and ``relpaths`` are 21 | reported for bookkeeping purposes. 22 | 23 | 24 | Accessing an overlay value of a specific dataset item 25 | ------------------------------------------------------ 26 | 27 | It is possible to get access to the value stored in an overlay for a specific 28 | item using the command ``dtool item overlay``. 29 | 30 | .. code-block:: none 31 | 32 | $ dtool item overlay \ 33 | is_read1 \ 34 | http://bit.ly/Ecoli-reads-minified \ 35 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc 36 | True 37 | 38 | 39 | Creating overlays 40 | ----------------- 41 | 42 | Overlay creation happens in two steps. 43 | 44 | 1. Create a template overlay CSV file using the format above 45 | 2. Use the template to write all overlays in the template to the dataset 46 | 47 | Creating overlay templates 48 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ 49 | 50 | A starting template can be created using the ``dtool overlays show`` command. 51 | For a dataset with no overlays this will result in a table with the columns 52 | ``identifiers`` and ``relpaths``. The table will have one row for each item in 53 | the dataset. One can then add columns for the overlays one would wish to 54 | create. 55 | 56 | However, in many cases one would want to use metadata in the items' relapths to 57 | generate a starting CSV template. This can be achieved using the commands: 58 | 59 | - ``dtool overlays template parse`` 60 | - ``dtool overlays template glob`` 61 | - ``dtool overlays template pairs`` 62 | 63 | Consider for example the dataset below:: 64 | 65 | $ dtool ls http://bit.ly/Ecoli-reads-minified 66 | 8bda245a8cd526673aab775f90206c8b67d196af ERR022075_2.fastq.gz 67 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc ERR022075_1.fastq.gz 68 | 69 | The command below could be used to generate a template for the overlays 70 | "useful_name" and "read":: 71 | 72 | $ dtool overlays template parse \ 73 | http://bit.ly/Ecoli-reads-minified \ 74 | '{useful_name}_{read:d}.fastq.gz' 75 | 76 | Results in the CSV output below:: 77 | 78 | identifiers,read,useful_name,relpaths 79 | 8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz 80 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz 81 | 82 | To ignore a variable element when parsing one can use unnamed curly braces. The 83 | command below for example only generates the overlay "useful_name":: 84 | 85 | $ dtool overlays template parse \ 86 | http://bit.ly/Ecoli-reads-minified \ 87 | '{useful_name}_{:d}.fastq.gz' 88 | identifiers,useful_name,relpaths 89 | 8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz 90 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz 91 | 92 | 93 | Sometimes one simply wants to create a boolean overlay based on weather or not 94 | a particular file matches a glob pattern. The command below can be used to 95 | create a CSV template for an overlay named ``is_read1``:: 96 | 97 | 98 | $ dtool overlays template glob \ 99 | http://bit.ly/Ecoli-reads-minified \ 100 | is_read1 \ 101 | '*1.fastq.gz' 102 | identifiers,is_read1,relpaths 103 | 8bda245a8cd526673aab775f90206c8b67d196af,False,ERR022075_2.fastq.gz 104 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc,True,ERR022075_1.fastq.gz 105 | 106 | Sometimes it is useful to be able to find pairs of items. For example when 107 | dealing with genomic sequencing data that has forward and reverse reads. 108 | 109 | One can create a "pair_id" overlay CSV template for this dataset using the 110 | command below:: 111 | 112 | $ dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz 113 | identifiers,pair_id,relpaths 114 | 8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz 115 | 9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz 116 | 117 | In the above the suffix ".fastq.gz" is used to extract the prefix ``ERR022075_`` 118 | that is used to find matching pairs. 119 | 120 | 121 | Writing an overlay template to a dataset 122 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 123 | 124 | Once one has a overlay template CSV file one can write this to a dataset:: 125 | 126 | $ dtool overlays write overlays.csv 127 | 128 | 129 | Further reading 130 | --------------- 131 | 132 | For more information see the at https://github.com/jic-dtool/dtool-overlay 133 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # This file is execfile()d with the current directory set to its 4 | # containing dir. 5 | # 6 | # Note that not all possible configuration values are present in this 7 | # autogenerated file. 8 | # 9 | # All configuration values have a default; values that are commented out 10 | # serve to show the default. 11 | 12 | # If extensions (or modules to document with autodoc) are in another directory, 13 | # add these directories to sys.path here. If the directory is relative to the 14 | # documentation root, use os.path.abspath to make it absolute, like shown here. 15 | # 16 | import os 17 | # import sys 18 | # sys.path.insert(0, os.path.abspath('.')) 19 | 20 | 21 | # -- General configuration ------------------------------------------------ 22 | 23 | # If your documentation needs a minimal Sphinx version, state it here. 24 | # 25 | # needs_sphinx = '1.0' 26 | 27 | # Add any Sphinx extension module names here, as strings. They can be 28 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 29 | # ones. 30 | extensions = [ 31 | 'sphinx.ext.autodoc', 32 | 'sphinx.ext.doctest', 33 | 'sphinx.ext.viewcode'] 34 | 35 | # Add any paths that contain templates here, relative to this directory. 36 | templates_path = ['_templates'] 37 | 38 | # The suffix(es) of source filenames. 39 | # You can specify multiple suffix as a list of string: 40 | # 41 | # source_suffix = ['.rst', '.md'] 42 | source_suffix = '.rst' 43 | 44 | # The master toctree document. 45 | master_doc = 'index' 46 | 47 | # General information about the project. 48 | project = u"dtool" 49 | copyright = u"2017, Tjelvar Olsson" 50 | author = u"Tjelvar Olsson" 51 | repo_name = u"dtool" 52 | 53 | # The version info for the project you're documenting, acts as replacement for 54 | # |version| and |release|, also used in various other places throughout the 55 | # built documents. 56 | # 57 | # The short X.Y version. 58 | version = u"3.26.2" 59 | # The full version, including alpha/beta/rc tags. 60 | release = version 61 | 62 | # The language for content autogenerated by Sphinx. Refer to documentation 63 | # for a list of supported languages. 64 | # 65 | # This is also used if you do content translation via gettext catalogs. 66 | # Usually you set "language" from the command line for these cases. 67 | language = None 68 | 69 | # List of patterns, relative to source directory, that match files and 70 | # directories to ignore when looking for source files. 71 | # This patterns also effect to html_static_path and html_extra_path 72 | exclude_patterns = [] 73 | 74 | # The name of the Pygments (syntax highlighting) style to use. 75 | pygments_style = 'sphinx' 76 | 77 | # If true, `todo` and `todoList` produce output, else they produce nothing. 78 | todo_include_todos = False 79 | 80 | 81 | # -- Options for HTML output ---------------------------------------------- 82 | 83 | # The theme to use for HTML and HTML Help pages. See the documentation for 84 | # a list of builtin themes. 85 | # 86 | html_theme = 'default' 87 | 88 | # Set the readthedocs theme. 89 | on_rtd = os.environ.get('READTHEDOCS', None) == 'True' 90 | 91 | if not on_rtd: # only import and set the theme if we're building docs locally 92 | print('using readthedocs theme...') 93 | import sphinx_rtd_theme 94 | html_theme = 'sphinx_rtd_theme' 95 | html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] 96 | # otherwise, readthedocs.org uses their theme by default, so no need to specify 97 | # it 98 | 99 | # Theme options are theme-specific and customize the look and feel of a theme 100 | # further. For a list of options available for each theme, see the 101 | # documentation. 102 | # 103 | # html_theme_options = {} 104 | 105 | # Add any paths that contain custom static files (such as style sheets) here, 106 | # relative to this directory. They are copied after the builtin static files, 107 | # so a file named "default.css" will overwrite the builtin "default.css". 108 | html_static_path = ['_static'] 109 | 110 | 111 | # -- Options for HTMLHelp output ------------------------------------------ 112 | 113 | # Output file base name for HTML help builder. 114 | htmlhelp_basename = '{}doc'.format(repo_name) 115 | 116 | 117 | # -- Options for LaTeX output --------------------------------------------- 118 | 119 | latex_elements = { 120 | # The paper size ('letterpaper' or 'a4paper'). 121 | # 122 | # 'papersize': 'letterpaper', 123 | 124 | # The font size ('10pt', '11pt' or '12pt'). 125 | # 126 | # 'pointsize': '10pt', 127 | 128 | # Additional stuff for the LaTeX preamble. 129 | # 130 | # 'preamble': '', 131 | 132 | # Latex figure (float) alignment 133 | # 134 | # 'figure_align': 'htbp', 135 | } 136 | 137 | # Grouping the document tree into LaTeX files. List of tuples 138 | # (source start file, target name, title, 139 | # author, documentclass [howto, manual, or own class]). 140 | latex_documents = [ 141 | (master_doc, '{}.tex'.format(repo_name), 142 | u'{} Documentation'.format(repo_name), 143 | author, 'manual'), 144 | ] 145 | 146 | 147 | # -- Options for manual page output --------------------------------------- 148 | 149 | # One entry per manual page. List of tuples 150 | # (source start file, name, description, authors, manual section). 151 | man_pages = [ 152 | (master_doc, repo_name, u'{} Documentation'.format(repo_name), 153 | [author], 1) 154 | ] 155 | 156 | 157 | # -- Options for Texinfo output ------------------------------------------- 158 | 159 | # Grouping the document tree into Texinfo files. List of tuples 160 | # (source start file, target name, title, author, 161 | # dir menu entry, description, category) 162 | texinfo_documents = [ 163 | (master_doc, repo_name, u'{} Documentation'.format(repo_name), 164 | author, repo_name, u'Manage scientific data', 165 | 'Miscellaneous'), 166 | ] 167 | -------------------------------------------------------------------------------- /docs/source/quick_start_guide.rst: -------------------------------------------------------------------------------- 1 | Quick start guide 2 | ================= 3 | 4 | This quick start guide shows how the ``dtool`` command line tool can be used to 5 | accomplish some common data management tasks. 6 | 7 | Organising files into a dataset on local disk 8 | --------------------------------------------- 9 | 10 | In this scenario one simply wants to organise one or more files into a dataset 11 | in the file system on the local computer. 12 | 13 | When working on local disk a dataset is simply a standardised directory layout 14 | combined with some hidden files used to annotate the dataset and its items. 15 | 16 | The first step is to create a "proto" dataset. The command below creates a 17 | dataset named ``fishers-iris-data`` in the current working directory. 18 | 19 | .. code-block:: none 20 | 21 | $ dtool create fishers-iris-data 22 | 23 | One can now add files to the dataset by moving/copying them to the 24 | ``fisher-iris-data/data`` directory, or by using the built in ``dtool add 25 | item`` command. In the example below the file ``iris.csv`` is added to the 26 | proto dataset. 27 | 28 | .. code-block:: none 29 | 30 | $ touch iris.csv 31 | $ dtool add item iris.csv fishers-iris-data 32 | 33 | Metadata describing the data is as important as the data itself. Metadata 34 | describing the dataset is stored in the file ``fisers-iris-data/README.yml``. 35 | An easy way to add content to this file is to use the ``dtool readme 36 | interactive``, which will prompt for input regarding the dataset. 37 | 38 | .. code-block:: none 39 | 40 | $ dtool readme interactive fishers-iris-data 41 | description [Dataset description]: Fisher's classic iris data, but with an empty file :( 42 | project [Project name]: dtool demo 43 | confidential [False]: 44 | personally_identifiable_information [False]: 45 | name [Your Name]: Tjelvar Olsson 46 | email [olssont@nbi.ac.uk]: 47 | username [olssont]: 48 | creation_date [2017-10-06]: 49 | Updated readme 50 | To edit the readme using your default editor: 51 | dtool readme edit fiser-iris-data 52 | 53 | Finally, to convert the proto dataset into a dataset one uses the ``dtool 54 | freeze`` command. 55 | 56 | .. code-block:: none 57 | 58 | $ dtool freeze fishers-iris-data 59 | Generating manifest [####################################] 100% iris.csv 60 | Dataset frozen fiser-iris-data 61 | 62 | 63 | Copying data from an external hard drive to remote storage as a dataset 64 | ----------------------------------------------------------------------- 65 | 66 | Genome sequencing generates large volumes of data, which are often sent from 67 | the sequencing company to the user by posting an external hard drive. When 68 | backing up such data on a remote storage system one does not want to have to 69 | reorganise the data before copying it to the remote storage system. 70 | 71 | In this case one can create a "symlink" dataset and copy that to the remote 72 | storage. A symlink dataset is a dataset where the data directory is a symlink 73 | to another location, for example the data directory on the external hard drive. 74 | 75 | .. code-block:: none 76 | 77 | $ dtool create bgi-sequencing-12345 --symlink-path /mnt/external-hard-drive 78 | 79 | Again, adding metadata to the dataset is vital. 80 | 81 | .. code-block:: none 82 | 83 | $ dtool readme interactive bgi-sequencing-12345 84 | 85 | One can then convert the proto dataset into a dataset by "freezing" it. 86 | 87 | .. code-block:: none 88 | 89 | $ dtool freeze bgi-sequencing-12345 90 | 91 | It is now time to copy the dataset to the remote storage. The command below 92 | assumes that one has credentials setup to write to the Amazon S3 bucket 93 | ``dtool-demo``. The command copies the local dataset to the S3 ``dtool-demo`` 94 | bucket. 95 | 96 | .. code-block:: none 97 | 98 | $ dtool cp bgi-sequencing-12345 s3://dtool-demo/ 99 | 100 | The command above returns feedback on the URI used to identify the dataset in 101 | the remote storage. In this case 102 | ``s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16``. 103 | 104 | The URI used to identify the dataset uses the UUID of the dataset rather than 105 | the dataset's name. This is to avoid name clashes in the object storage. 106 | 107 | Finally, one may want to confirm that the data transfer was successful. This 108 | can be achieved using the ``dtool diff`` command, which should show no 109 | differences if the transfer was successful. 110 | 111 | .. code-block:: none 112 | 113 | $ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 114 | 115 | By default only identifiers and file sizes are compared. To check file hashes 116 | make use of the ``--full`` option. 117 | 118 | .. warning:: When comparing datasets identifiers, sizes and hashes are 119 | compared. When checking that the hashes are identical the hashes 120 | for the first dataset are recalculated using the hashing algorithm 121 | of the reference dataset (the second). If the dataset in S3 had 122 | been specified as the first argument then all the files would have 123 | had to have been downloaded to the local disk before calculating 124 | their hashes, which would have made the command slower. 125 | 126 | 127 | Copying a dataset from remote storage to local disk 128 | --------------------------------------------------- 129 | 130 | After having copied a dataset to a remote storage system one may have deleted 131 | the copy on the local disk. In this case one may want to be able to get the 132 | dataset back onto the local disk. 133 | 134 | This can be achieved using the ``dtool cp`` command. The command below copies 135 | the dataset in iRODS to the current working directory. 136 | 137 | .. code-block:: none 138 | 139 | $ dtool cp s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 ./ 140 | 141 | Note that on the local disk the dataset will use the name of the dataset rather 142 | than the UUID, in this example ``bgi-sequencing-12345``. 143 | 144 | Again one can verify the data transfer using the ``dtool diff`` command. 145 | 146 | .. code-block:: none 147 | 148 | $ dtool diff bgi-sequencing-12345 s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 149 | -------------------------------------------------------------------------------- /docs/source/working_with_datasets.rst: -------------------------------------------------------------------------------- 1 | Working with datasets 2 | ===================== 3 | 4 | Listing datasets 5 | ---------------- 6 | 7 | It is possible to list all datasets in a directory or in a S3 bucket 8 | using the ``dtool ls`` command. 9 | 10 | .. code-block:: none 11 | 12 | $ dtool ls ~/my_datasets 13 | bgi-sequencing-12345 14 | file:///Users/olssont/my_datasets/bgi-sequencing-12345 15 | drone-images 16 | file:///Users/olssont/my_datasets/drone-images 17 | fishers-iris-data 18 | file:///Users/olssont/my_datasets/fishers-iris-data 19 | my_rnaseq_data 20 | file:///Users/olssont/my_datasets/my_rnaseq_data 21 | 22 | .. tip:: When using this command proto datasets are highlighted in red. 23 | 24 | .. tip:: The ``dtool ls`` command takes a URI. As such it can be used to list 25 | the datasets in remote storage locations. The example below lists all 26 | the datasets in the S3 bucket named ``dtool-demo``:: 27 | 28 | $ dtool ls s3://dtool-demo/ 29 | 30 | 31 | Generating an inventory of datasets 32 | ----------------------------------- 33 | 34 | It is possible to generate CSV/TSV/HTML inventories of datasets in a directory 35 | or in another base URI such as an Amazon S3 bucket. For example, the command 36 | below is used to generate a HTML report of all the datasets in the 37 | s3://dtool-demo/ bucket. 38 | 39 | .. code-block:: none 40 | 41 | $ dtool inventory --format html s3://dtool-demo/ > inventory.html 42 | 43 | 44 | Verifying a dataset has not been modified since freezing it 45 | ----------------------------------------------------------- 46 | 47 | A dtool dataset has metadata listing its items and their hashes. This 48 | information can be used to verify that a dataset is in the same state as it was 49 | when it was frozen. 50 | 51 | In the example below the dataset has been corrupted in three ways. 52 | 53 | 1. The file ``rna_seq_reads_4.fq.gz`` has been added to it 54 | 2. The file ``rna_seq_reads_3.fq.gz`` has been deleted from it 55 | 3. The content of the file ``rna_seq_reads_1.fq.gz`` has been modified 56 | 57 | .. code-block:: none 58 | 59 | $ dtool verify ~/my_datasets/my_rnaseq_data 60 | Unknown item: 49919bdae83011b96bf54d984735e24c4419feb5 rna_seq_reads_4.fq.gz 61 | Missing item: 72b24007759c0086a316d13838021c2571853a16 rna_seq_reads_3.fq.gz 62 | 63 | By default only identifiers and file sizes are compared. To check file hashes 64 | make use of the ``--full`` option. 65 | 66 | .. code-block:: none 67 | 68 | $ dtool verify --full ~/my_datasets/my_rnaseq_data 69 | Unknown item: 49919bdae83011b96bf54d984735e24c4419feb5 rna_seq_reads_4.fq.gz 70 | Missing item: 72b24007759c0086a316d13838021c2571853a16 rna_seq_reads_3.fq.gz 71 | Altered item: d4e065787eab480e9cbd2bac6988bc7717464c83 rna_seq_reads_1.fq.gz 72 | 73 | 74 | Displaying the README descriptive metadata 75 | ------------------------------------------ 76 | 77 | To display the README metadata used to describe the dataset one can make use of 78 | the ``dtool readme show`` command. 79 | 80 | .. code-block:: none 81 | 82 | $ dtool readme show ~/my_datasets/chrX-rna-seq 83 | --- 84 | description: RNA-seq sample data 85 | creation_date: 2017-11-20 86 | ftp: "ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol/" 87 | doi: "10.1038/nprot.2016.095" 88 | 89 | 90 | Reporting summary information about a dataset 91 | --------------------------------------------- 92 | 93 | One often wants to find out how many items are in a dataset and what their 94 | total size is. This can be achieved using the ``dtool summary`` command. 95 | 96 | .. code-block:: none 97 | 98 | $ dtool summary ~/my_datasets/drone-images 99 | name: drone-images 100 | uuid: c2542c2b-d149-4f73-84bc-741bf9af918f 101 | creator_username: hartleym 102 | number_of_items: 59 103 | size: 152.5MiB 104 | frozen_at: 2017-09-19 105 | 106 | 107 | 108 | Listing the item identifiers in a dataset 109 | ----------------------------------------- 110 | 111 | To list all the item identifiers in a dataset one can use the ``dtool 112 | identifiers`` command. 113 | 114 | .. code-block:: none 115 | 116 | $ dtool identifiers ~/my_datasets/my_rnaseq_data 117 | b0f92a668d24a3015692b0869e2b7590a62a380c 118 | 72b24007759c0086a316d13838021c2571853a16 119 | d4e065787eab480e9cbd2bac6988bc7717464c83 120 | 121 | 122 | .. tip:: Using ``dtool ls`` on a dataset URI results in a list of item 123 | identifiers and relapths:: 124 | 125 | $ dtool ls ~/my_datasets/my_rnaseq_data 126 | b0f92a668d24a3015692b0869e2b7590a62a380c - rna_seq_reads_2.fq.gz 127 | 72b24007759c0086a316d13838021c2571853a16 - rna_seq_reads_3.fq.gz 128 | d4e065787eab480e9cbd2bac6988bc7717464c83 - rna_seq_reads_1.fq.gz 129 | 130 | 131 | Finding out the size of an item in a dataset 132 | -------------------------------------------- 133 | 134 | To find the size of a specific item in a dataset one can use the ``dtool item 135 | properties`` command. The command below accesses the properties of the item 136 | with the identifier ``58f50508c42a56919376132e36b693e9815dbd0c``. 137 | 138 | .. code-block:: none 139 | 140 | $ dtool item properties ~/my_datasets/drone-images 58f50508c42a56919376132e36b693e9815dbd0c 141 | { 142 | "relpath": "IMG_8585.JPG", 143 | "size_in_bytes": 2716446, 144 | "utc_timestamp": 1505818439.0, 145 | "hash": "dbcb0d6f22ec660fa4ac33b3d74556f3" 146 | } 147 | 148 | 149 | Accessing the content of an item in a dataset 150 | --------------------------------------------- 151 | 152 | When all files are on local disk getting access to them is trivial. However, 153 | when files are located in some object storage system in the cloud, access may 154 | be less trivial. 155 | 156 | dtool solves this problem by providing a call to a method that returns an 157 | absolute path on local disk with a promise that the file requested will be 158 | available from there when the call returns the path. 159 | 160 | The dtool command line interface makes this call available as the command 161 | ``dtool item fetch``. 162 | 163 | Below is an example of this command being used on a local disk file storage. 164 | 165 | .. code-block:: none 166 | 167 | $ dtool item fetch ~/my_datasets/drone-images 58f50508c42a56919376132e36b693e9815dbd0c 168 | /Users/olssont/my_datasets/drone-images/data/IMG_8585.JPG 169 | 170 | Below is an example of this command being used on a dataset in the S3 bucket 171 | ``dtool-demo``. 172 | 173 | .. code-block:: none 174 | 175 | $ dtool item fetch s3://dtool-demo/1e47c076-2eb0-43b2-b219-fc7d419f1f16 3dce23b901709a24cfbb974b70c1ef132af10a67 176 | /Users/olssont/.cache/dtool/s3/1e47c076-2eb0-43b2-b219-fc7d419f1f16/3dce23b901709a24cfbb974b70c1ef132af10a67.txt 177 | 178 | 179 | Processing all the items in a dataset 180 | ------------------------------------- 181 | 182 | By combining the use of ``dtool identifiers`` and ``dtool item fetch`` it is 183 | possible to create basic Bash scripts to process all the items in a dataset. 184 | 185 | .. code-block:: none 186 | 187 | $ DS_URI=~/my_datasets/my_rnaseq_data 188 | $ for ITEM_ID in `dtool identifiers $DS_URI`; 189 | > do ITEM_FPATH=`dtool item fetch $DS_URI $ITEM_ID`; 190 | > echo $ITEM_FPATH; 191 | > done 192 | /Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_2.fq.gz 193 | /Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_3.fq.gz 194 | /Users/olssont/my_datasets/my_rnaseq_data/data/rna_seq_reads_1.fq.gz 195 | -------------------------------------------------------------------------------- /CHANGELOG.rst: -------------------------------------------------------------------------------- 1 | CHANGELOG 2 | ========= 3 | 4 | This project uses `semantic versioning `_. 5 | This change log uses principles from `keep a changelog `_. 6 | 7 | [Unreleased] 8 | ------------ 9 | 10 | Added 11 | ^^^^^ 12 | 13 | 14 | Changed 15 | ^^^^^^^ 16 | 17 | 18 | Deprecated 19 | ^^^^^^^^^^ 20 | 21 | 22 | Removed 23 | ^^^^^^^ 24 | 25 | 26 | Fixed 27 | ^^^^^ 28 | 29 | 30 | Security 31 | ^^^^^^^^ 32 | 33 | [3.27.0] - 2024-07-04 34 | --------------------- 35 | 36 | Added 37 | ^^^^^ 38 | 39 | 40 | Changed 41 | ^^^^^^^ 42 | 43 | - Pinned ``dtoolcore`` version to 3.18.3, where copying tags has been fixed 44 | - Embedded dtool icon in ``README.rst`` 45 | - Replaced ``setup.py`` with ``pyproject.toml`` 46 | - Dynamic versioning from scm tag 47 | 48 | 49 | [3.26.2] - 2022-02-20 50 | --------------------- 51 | 52 | Fixed 53 | ^^^^^ 54 | 55 | - Fixed defect where "frozen_at" administrative metadata changed when a dataset 56 | was being copied (in the destination dataset). 57 | Many thanks to `Johannes L. Hörmann `_ 58 | and `Lars Pastewka `_ for bug reports, 59 | design discussions and code contributions. 60 | See: 61 | https://github.com/jic-dtool/dtoolcore/issues/20 62 | - Improve handling of Windows paths with drive letters where the 63 | dataset is located in a drive different to that of the working 64 | directory, see https://github.com/jic-dtool/dtoolcore/pull/23 65 | 66 | 67 | [3.26.1] - 2021-06-23 68 | --------------------- 69 | 70 | Fixed 71 | ^^^^^ 72 | 73 | - License files now included in releases thanks to Jan Janssen (https://github.com/jan-janssen) 74 | 75 | 76 | [3.26.0] - 2021-04-11 77 | --------------------- 78 | 79 | Added 80 | ^^^^^ 81 | 82 | - ``dtoolcore.iter_datasets_in_base_uri`` helper function 83 | - ``dtoolcore.iter_proto_datasets_in_base_uri`` helper function 84 | 85 | Fixed 86 | ^^^^^ 87 | 88 | - Fixed defect in ``dtool readme interactive`` command when the readme template contains a date. 89 | Thanks to Lars Pastewka. 90 | - Fixed defect in "dtool readme interaction" when the default date of today is 91 | not updated when using "{{ date }}" in the readme template. See 92 | https://github.com/jic-dtool/dtool-create/issues/24 93 | Thanks to Antoine Sanner. 94 | - Fixed issue where "dtool readme edit" opened file with ".txt" extension 95 | rather than ".yml" extension. See: 96 | https://github.com/jic-dtool/dtool-cli/issues/3 97 | Thanks to Antoine Sanner. 98 | 99 | 100 | 101 | [3.25.0] - 2020-03-25 102 | --------------------- 103 | 104 | Added support for tags from the dtool CLI. 105 | 106 | Added 107 | ^^^^^ 108 | 109 | - The CLI command 'dtool tag set' 110 | - The CLI command 'dtool tag ls' 111 | - The CLI command 'dtool tag delete' 112 | 113 | 114 | [3.24.0] - 2020-03-23 115 | --------------------- 116 | 117 | Added Python API support for tags. 118 | 119 | Added 120 | ^^^^^ 121 | 122 | - Added ``dtoolcore._BaseDataSet.put_tag()`` method 123 | - Added ``dtoolcore._BaseDataSet.delete_tag()`` method 124 | - Added ``dtoolcore._BaseDataSet.list_tags()`` method 125 | - Added ``dtoolcore.storagebroker.BaseStorageBroker.delete_key()`` method 126 | - Added ``dtoolcore.storagebroker.BaseStorageBroker.get_tag_key()`` method 127 | - Added ``dtoolcore.storagebroker.BaseStorageBroker.list_tags()`` method 128 | - Added ``dtoolcore.storagebroker.BaseStorageBroker.put_tag()`` method 129 | - Added ``dtoolcore.storagebroker.BaseStorageBroker.delete_tag()`` method 130 | - Added ``dtoolcore.storagebroker.DiskStorageBroker.delete_key()`` method 131 | - Added ``dtoolcore.storagebroker.DiskStorageBroker.get_tag_key()`` method 132 | - Added ``dtoolcore.storagebroker.DiskStorageBroker.list_tags()`` method 133 | - Default cache directory changed from ``~/.cache/dtool/http`` to 134 | ``~/.cache/dtool`` 135 | 136 | Fixed 137 | ^^^^^ 138 | 139 | - Cache environment variable changed from DTOOL_HTTP_CACHE_DIRECTORY to 140 | DTOOL_CACHE_DIRECTORY 141 | 142 | 143 | [3.23.0] - 2020-02-28 144 | --------------------- 145 | 146 | Added 147 | ^^^^^ 148 | 149 | - Add ``dtool readme validate`` command 150 | - Ability to update descriptive metadata in README of frozen datasets 151 | when using ``dtool redme write`` 152 | 153 | Fixed 154 | ^^^^^ 155 | 156 | - Fixed several defects in how URIs were parsed and generated on Windows. 157 | 158 | 159 | [3.22.0] - 2020-02-06 160 | --------------------- 161 | 162 | Improved Python API for creating datasets. 163 | 164 | Added 165 | ^^^^^ 166 | 167 | - dtoolcore.create_proto_dataset() helper function 168 | - dtoolcore.create_derived_proto_dataset() helper function 169 | - dtoolcore.DataSetCreator helper context manager class 170 | - dtoolcore.DerivedDataSetCreator helper context manager class 171 | 172 | Fixed 173 | ^^^^^ 174 | 175 | - Fixed defect where using ``DTOOL_NUM_PROCESSES`` > 1 resulted in 176 | a cPickle.PicklingError on some storage brokers. Multiprocessing 177 | is now only used if the storage broker supports it. 178 | 179 | 180 | [3.21.1] - 2020-01-23 181 | --------------------- 182 | 183 | - Fixed defect where 'dtool verify' calculated hashes even when the '-f/--full' 184 | option was not specified. The 'dtool verify' command now runs more quickly. 185 | 186 | 187 | [3.21.0] - 2020-01-21 188 | --------------------- 189 | 190 | Added 191 | ^^^^^ 192 | 193 | - Ability to use multiple processes (cores) to generate item properties for 194 | manifest files in parallel. Set the environment variable 195 | ``DTOOL_NUM_PROCESSES`` to specify the number of processes to use. 196 | 197 | Fixed 198 | ^^^^^ 199 | 200 | - Included .dtool/annotations directory in DiskStorageBroker self description file 201 | 202 | 203 | [3.20.0] - 2019-10-31 204 | --------------------- 205 | 206 | *New feature: Dataset annotation* 207 | 208 | Dataset annotations are intended to make it easy to add and access specific 209 | metadata at a per dataset level. 210 | 211 | The difference between annotations and the descriptive metadata is that the 212 | former is easier to work with in a programmatic fashion. The descriptive 213 | metadata, stored in the dataset's README content, is more free form. It is 214 | non-trivial to access specific pieces of information from the descriptive 215 | metadata in the dataset's README content, whereas a dtool annotation can be 216 | easily accessed by its name. 217 | 218 | Added 219 | ^^^^^ 220 | 221 | - Added ``dtool annotation set`` command 222 | - Added ``dtool annotation get`` command 223 | - Added ``dtool annotation ls`` command 224 | 225 | 226 | [3.19.0] - 2019-09-12 227 | --------------------- 228 | 229 | Added 230 | ^^^^^ 231 | 232 | - Added sorting of items by relpath to 'dtool ls ' 233 | 234 | Fixed 235 | ^^^^^ 236 | 237 | - Fixed formatting of 'dtool ls ' from using two whitespaces to using 238 | one tab to make it easier to work with command line tools such as ``cut`` 239 | - Fixed ordering of lines in overlay CSV template from being sorted by the 240 | identifier to being ordered by the relpath 241 | 242 | 243 | [3.18.0] - 2019-09-06 244 | --------------------- 245 | 246 | Added 247 | ^^^^^ 248 | 249 | - Added 'dtool overlays show' command 250 | - Added 'dtool overlays write' command 251 | - Added 'dtool overlays template parse' command 252 | - Added 'dtool overlays template glob' command 253 | - Added 'dtool overlays template pairs' command 254 | 255 | 256 | Deprecated 257 | ^^^^^^^^^^ 258 | 259 | - Deprecated 'dtool overlay ls' 260 | - Deprecated 'dtool overlay show' 261 | 262 | 263 | [3.17.0] - 2019-08-06 264 | --------------------- 265 | 266 | Added 267 | ^^^^^ 268 | 269 | - Added support for host name in file URI. 270 | - Added ``dtool status`` command for working out if a dataset is frozen or not 271 | - Added ``dtool uri`` command for expanding absolute and relative paths into 272 | proper URIs 273 | 274 | 275 | [3.16.0] - 2019-07-12 276 | --------------------- 277 | 278 | Added 279 | ^^^^^ 280 | 281 | - Added more debug logging 282 | - Added ``dtool config ecs ls`` command to list ECS base URIs that have been 283 | - Added support for configuring access to ECS buckets in multiple namespaces 284 | 285 | Fixed 286 | ^^^^^ 287 | 288 | - The ``dtool config azure ls`` command now returns base URIs rather than 289 | container names 290 | 291 | 292 | [3.15.0] - 2019-04-26 293 | --------------------- 294 | 295 | Added 296 | ^^^^^ 297 | 298 | - ``dtool config readme-template`` CLI command for configuring the path to a 299 | custom readme template 300 | - ``dtoolcore._BaseDataSet.base_uri`` property 301 | - ``dtoolcore.storagebroker.BaseStorageBroker.generate_base_uri`` method 302 | - ``dtoolcore.utils.DEFAULT_CACHE_PATH`` global helper variable 303 | - ``dtoolcore.utils.get_config_value_from_file`` helper function 304 | - ``dtoolcore.utils.write_config_value_to_file`` helper function 305 | 306 | 307 | Changed 308 | ^^^^^^^ 309 | 310 | - ``dtool config cache`` now works with one unified cache directory for all 311 | storage brokers 312 | - Started using unified environment variable to specify the cache directory 313 | ``DTOOL_CACHE_DIRECTORY`` 314 | - Default cache directory changed set to ``~/.cache/dtool`` 315 | 316 | Fixed 317 | ^^^^^ 318 | 319 | - Fixed defect when username was supplied as two separate strings to 320 | ``dtool config user name`` in CLI 321 | 322 | 323 | [3.14.1] - 2018-12-12 324 | --------------------- 325 | 326 | Fixed 327 | ^^^^^ 328 | 329 | - Fixed the ``dtool config azure set`` help text 330 | 331 | 332 | [3.14.0] - 2018-11-21 333 | --------------------- 334 | 335 | Added 336 | ^^^^^ 337 | 338 | - Added ``dtool publish`` command 339 | - Added ``-f/--format`` option to ``dtool summary`` command to enable output in 340 | JSON format 341 | - Added sorting of CSV/TSV/HTML inventories by dataset name 342 | 343 | 344 | Changed 345 | ^^^^^^^ 346 | 347 | - Changed default output of ``dtool summary`` to be human readable YAML 348 | 349 | 350 | [3.13.0] - 2018-11-13 351 | --------------------- 352 | 353 | Added 354 | ^^^^^ 355 | 356 | - Added support for Windows! :) 357 | - Added ``dtool config`` command 358 | 359 | 360 | 361 | 362 | [3.12.0] - 2018-09-25 363 | --------------------- 364 | 365 | Added 366 | ^^^^^ 367 | 368 | - Added ``dtool uuid`` command 369 | - Added ``dtool item relpath`` command 370 | 371 | 372 | [3.11.0] - 2018-09-20 373 | --------------------- 374 | 375 | Added 376 | ^^^^^ 377 | 378 | - ``dtool cp`` to replace ``dtool copy`` 379 | - ``dtool readme write`` to write readme from file or stdin 380 | - ``dtool item overlay`` command 381 | 382 | 383 | Deprecated 384 | ^^^^^^^^^^ 385 | 386 | - ``dtool copy`` in favour of ``dtool cp`` 387 | 388 | 389 | Removed 390 | ^^^^^^^ 391 | 392 | - Removed ``created_at`` field from default README template 393 | 394 | 395 | Fixed 396 | ^^^^^ 397 | 398 | - Defect in ``dtool create`` when providing a relative path to the 399 | ``--symlink-path`` option 400 | - Python 2 defect in dealing with unicode in README.yml file when using 401 | ``dtool readme edit`` 402 | 403 | 404 | [3.10.0] - 2018-09-11 405 | --------------------- 406 | 407 | Added 408 | ^^^^^ 409 | 410 | - ``dtoolcore.filehasher.hashsum_digest`` helper function 411 | - ``dtoolcore.filehasher.md5sum_digest`` helper function 412 | 413 | 414 | Changed 415 | ^^^^^^^ 416 | 417 | - Improved name from ``dtoolcore.filehasher.hashsum`` to 418 | ``dtoolcore.filehasher.hashsum_hexdigest`` 419 | 420 | Fixed 421 | ^^^^^ 422 | 423 | - Deal with issue in how ruamel.yaml deals with float values 424 | 425 | 426 | 427 | [3.9.0] - 2018-08-03 428 | -------------------- 429 | 430 | Added 431 | ^^^^^ 432 | 433 | - Added ability to update the name of a frozen dataset from the ``dtool`` CLI 434 | - Added ``update_name`` method to ``DataSet`` class (previously only available 435 | on ``ProtoDataSet`` class) 436 | 437 | 438 | [3.8.0] - 2018-07-31 439 | -------------------- 440 | 441 | Dataset name validation. 442 | 443 | Added 444 | ^^^^^ 445 | 446 | - ``dtoolcore.generate_admin_metadata`` function raises 447 | ``dtoolcore.DtoolCoreInvalidNameError`` if invalid name is provided 448 | - ``dtoolcore.utils.name_is_valid`` utility function for checking sanity of 449 | dataset names 450 | - Validation of dataset name upon creation using dtool CLI 451 | - Validation of dataset name when updating it using dtool CLI 452 | 453 | Fixed 454 | ^^^^^ 455 | 456 | - Fixed defect where ``dtool ls -q`` was listing dataset names rather than URIs 457 | making it impossible to process datasets in a BASE_URI programatically 458 | - Make ``SymlinkStorageBroker`` compatible with dtoolcore 3.4.0 459 | 460 | 461 | [3.7.0] - 2018-07-26 462 | -------------------- 463 | 464 | Storage broker base class redesign and refactoring. 465 | 466 | Added 467 | ^^^^^ 468 | 469 | - Ability to update descriptive metadata in README of frozen datasets 470 | - Validation that the descriptive metadata provided by the 471 | ``dtool readme edit`` command is valid YAML 472 | - Added ``dtoolcore.storagebroker.BaseStorageBroker`` 473 | - Added logging to the reusable ``BaseStorageBroker`` methods 474 | - ``get_text`` new method on ``BaseStorageBroker`` class 475 | - ``put_text`` new method on ``BaseStorageBroker`` class 476 | - ``get_admin_metadata_key`` new method on ``BaseStorageBroker`` class 477 | - ``get_readme_key`` new method on ``BaseStorageBroker`` class 478 | - ``get_manifest_key`` new method on ``BaseStorageBroker`` class 479 | - ``get_overlay_key`` new method on ``BaseStorageBroker`` class 480 | - ``get_structure_key`` new method on ``BaseStorageBroker`` class 481 | - ``get_dtool_readme_key`` new method on ``BaseStorageBroker`` class 482 | - ``get_size_in_bytes`` new method on ``BaseStorageBroker`` class 483 | - ``get_utc_timestamp`` new method on ``BaseStorageBroker`` class 484 | - ``get_hash`` new method on ``BaseStorageBroker`` class 485 | - ``get_relpath`` new method on ``BaseStorageBroker`` class 486 | - ``update_readme`` new method on ``BaseStorageBroker`` class 487 | - ``DataSet.put_readme`` method that can be used to update descriptive metadata 488 | in (frozen) dataset README whilst keeping a copy of the historical README 489 | content 490 | - Add ``storage_broker_version`` key to structure parameters 491 | 492 | Fixed 493 | ^^^^^ 494 | 495 | - Stop ``copy_resume`` function calculating hashes unnecessarily 496 | - Fixed the documentation of the ``dtool verify`` command 497 | 498 | 499 | [3.6.2] - 2018-07-10 500 | -------------------- 501 | 502 | Fixed 503 | ^^^^^ 504 | 505 | - Default config file now set in ``dtoolcore.utils.get_config_value`` if not provided in caller 506 | 507 | 508 | [3.6.1] - 2018-07-09 509 | -------------------- 510 | 511 | Fixed 512 | ^^^^^ 513 | 514 | - Made download to DTOOL_HTTP_CACHE_DIRECTORY more robust 515 | - Added ability to deal with redirects to enable working with shortened URLs 516 | 517 | 518 | [3.6.0] - 2018-07-05 519 | -------------------- 520 | 521 | Added 522 | ^^^^^ 523 | 524 | - Bundling of ``dtool-http`` package 525 | 526 | Removed 527 | ^^^^^^^ 528 | 529 | - Bundling of ``dtool-irods`` package 530 | - Bundling of ``dtool-s3`` package 531 | 532 | 533 | [3.5.0] - 2018-06-06 534 | -------------------- 535 | 536 | Added 537 | ^^^^^ 538 | 539 | - Pre-checks to 'dtool freeze' command to ensure that there is no rogue content 540 | in the base of disk datasets 541 | - Added rogue content validation check to DiskStorageBroker.pre_freeze hook 542 | 543 | 544 | [3.4.0] - 2018-05-24 545 | -------------------- 546 | 547 | Added 548 | ^^^^^ 549 | 550 | - Pre-checks to 'dtool freeze' command to ensure that the item handles are sane, i.e. that they do not contain newline characters 551 | - Pre-checks to 'dtool freeze' command to ensure that there are not too many items in the proto dataset, default to less than 10000 552 | 553 | 554 | [3.3.1] - 2018-05-18 555 | -------------------- 556 | 557 | Fixed 558 | ^^^^^ 559 | 560 | - Defect where inventory html template is not included in Python package on PyPi 561 | 562 | 563 | [3.3.0] - 2018-05-18 564 | -------------------- 565 | 566 | Added 567 | ^^^^^ 568 | 569 | - Add "created_at" key to the administrative metadata 570 | - ``dtool inventory`` command for generating csv/tsv/html inventories of collections 571 | of datasets 572 | - Added support for ``-h`` flag as well as ``--help`` 573 | - Added timestamp to logging output 574 | 575 | Fixed 576 | ^^^^^ 577 | 578 | - Improved handling of URIs in validation code 579 | - Fixed defect where running ``dtool item properties`` with an invalid identifier 580 | resulted in a KeyError exception being propagated to the user 581 | - Fixed defect where ``dtool verify`` did not compare file sizes 582 | - Fixed timestamp defect in DiskStoragBroker 583 | 584 | 585 | [3.2.1] - 2018-05-01 586 | -------------------- 587 | 588 | Fixed 589 | ^^^^^ 590 | 591 | - Fixed issue arising from a file being put into iRODS and the connection 592 | breaking before the appropriate metadata could be set on the file in iRODS. 593 | See also: https://github.com/jic-dtool/dtool-irods/issues/7 594 | 595 | 596 | [3.2.0] - 2018-02-09 597 | -------------------- 598 | 599 | Release to make it easier to create symlink datasets in an automated fashion. 600 | 601 | Changed 602 | ^^^^^^^ 603 | 604 | - Simplified the way to specify the symbolic link path in the 605 | SymLinkStorageBroker 606 | - The path to the data when creating a symlink dataset is now specified using the 607 | ``-s/--symlink-path`` option rather than being something that is prompted for. 608 | This makes it easier to create symlink datasets in an automated fashion. 609 | 610 | 611 | [3.1.0] - 2018-02-05 612 | -------------------- 613 | 614 | Added 615 | ^^^^^ 616 | 617 | - ``--resume`` option to ``dtool copy`` command 618 | - ``--quite`` and ``--verbose`` options to ``dtool ls`` and improved formatting 619 | - Add ``dtoolcore.copy_resume`` function 620 | 621 | 622 | [3.0.0] - 2018-01-18 623 | -------------------- 624 | 625 | This release makes use of the dtoolcore version 3.0.0 API, which improves the 626 | handling of URIs and adds more metadata describing the structure of datasets. 627 | 628 | Another major feature of this release is the addition of an S3 storage broker 629 | that can be used to interact with Amazon's S3 object storage. 630 | 631 | Added 632 | ^^^^^ 633 | 634 | - AWS S3 object storage broker 635 | - Writing of ``.dtool/structure.json`` file to the DiskStorageBroker; a file 636 | for describing the structure of the dtool dataset in a computer readable format 637 | - Writing of ``.dtool/README.txt`` file to the DiskStorageBroker; a file 638 | for describing the structure of the dtool dataset in a human readable format 639 | - Writing of ``.dtool/structure.json`` file to the IrodsStorageBroker; a file 640 | for describing the structure of the dtool dataset in a computer readable format 641 | - Writing of ``.dtool/README.txt`` file to the IrodsStorageBroker; a file 642 | for describing the structure of the dtool dataset in a human readable format 643 | 644 | 645 | Changed 646 | ^^^^^^^ 647 | 648 | - Make use of dtoolcore version 3 API 649 | 650 | 651 | Fixed 652 | ^^^^^ 653 | 654 | - Removed the historical ``dtool_readme`` key/value pair from the 655 | administrative metadata (in the .dtool/dtool file) 656 | 657 | 658 | [2.4.0] - 2017-12-14 659 | -------------------- 660 | 661 | Added 662 | ^^^^^ 663 | 664 | - Ability to specify a custom README.yml template file path. 665 | - Ability to configure the full user name for the README.yml template using 666 | ``DTOOL_USER_FULL_NAME`` 667 | 668 | Fixed 669 | ^^^^^ 670 | 671 | - Made ``.dtool/manifest.json`` content created by DiskStorageBroker human 672 | readable by adding new lines and indentation to the JSON formatting. 673 | - Made the DiskStorageBroker.list_overlay_names method more robust. It no 674 | longer falls over if the ``.dtool/overlays`` directory has been lost, i.e. by 675 | cloning a dataset with no overlays from a Git repository. 676 | - Fixed defect where an incorrect URI would get set on the dataset when using 677 | ``DataSet.from_path`` class method on a relative path 678 | - Made the YAML output more pretty by adding more indentation. 679 | - Replaced hardcoded ``nbi.ac.uk`` email with configurable ``DTOOL_USER_EMAIL`` 680 | in the default README.yml template. 681 | - Fixed ``IrodsStorageBroker.generate_uri`` class method 682 | - Made ``.dtool/manifest.json`` content created by IrodsStorageBroker human 683 | readable by adding new lines and indentation to the JSON formatting. 684 | - Added rule to catch ``CAT_INVALID_USER`` string for giving a more informative 685 | error message when iRODS authentication times out 686 | 687 | 688 | 689 | [2.3.2] - 2017-10-25 690 | -------------------- 691 | 692 | Fixed 693 | ^^^^^ 694 | 695 | - Fixed issue where the symbolic link was not fully resolved when creating 696 | a symlink dataset that used the terminal to prompt for the data directory 697 | 698 | 699 | [2.3.1] - 2017-10-25 700 | -------------------- 701 | 702 | Fixed 703 | ^^^^^ 704 | 705 | - More graceful exit if one presses Cancel in file browser when creating a 706 | symlink dataset 707 | - Data directory now falls back on click command line prompt if TkInter has 708 | issues when creating a symlink dataset 709 | 710 | 711 | [2.3.0] - 2017-10-23 712 | -------------------- 713 | 714 | Added 715 | ^^^^^ 716 | 717 | - ``pre_freeze_hoook`` to the stroage broker interface called at the beginning 718 | of ``ProtoDataSet.freeze`` method. 719 | - ``--quiet`` flag to ``dtool create`` command 720 | - ``dtool overlay ls`` command to list the overlays in dataset 721 | - ``dtool overlay show`` command to show the content of a specific overlay 722 | 723 | 724 | Changed 725 | ^^^^^^^ 726 | 727 | - Improved speed of freezing a dataset in iRODS by making use of 728 | caches to reduce the number of calls made to iRODS during this 729 | process 730 | - ``dtool copy`` now specifies target location using URI rather than 731 | using the ``--prefix`` and ``--storage`` arguments 732 | 733 | 734 | Fixed 735 | ^^^^^ 736 | 737 | - Made the ``DiskStorageBroker.create_structure`` method more robust 738 | - More informative error message when iRODS has not been configured 739 | - More informative error message when iRODS authentication times out 740 | - Stopped client hanging when iRODS authentication has timed out 741 | - storagebroker's ``put_item`` method now returns relpath 742 | - Made the ``IrodsStorageBroker.create_structure`` method more 743 | robust by checking if the parent collection exists 744 | - Made error handling in ``dtool create`` more specific 745 | - Added propagation of original error message when ``StorageBrokerOSError`` 746 | captures in ``dtool create`` 747 | 748 | 749 | [2.2.0] - 2017-10-09 750 | -------------------- 751 | 752 | 753 | Added 754 | ^^^^^ 755 | 756 | - ``dtool ls`` can now be used to list the relpaths of the items in a dataset 757 | - ``-f/--full`` flag to ``dtool diff`` command to include checking of file 758 | hashes 759 | - ``-f/--full`` flag to ``dtool verify`` command to include checking of file 760 | hashes 761 | 762 | 763 | Changed 764 | ^^^^^^^ 765 | 766 | - ``dtool ls`` now works with URIs rather than with prefix and storage arguments 767 | - ``dtool diff`` now only compares identifiers and file sizes by default 768 | - ``dtool verify`` now only compares identifiers and file sizes by default 769 | 770 | 771 | Fixed 772 | ^^^^^ 773 | 774 | - Made ``DiskStorageBroker.list_dataset_uris`` class method more robust 775 | 776 | 777 | [2.1.2] - 2017-10-05 778 | -------------------- 779 | 780 | Fixed 781 | ^^^^^ 782 | 783 | - Set the correct dependency to actually get fix reported in 2.1.1 784 | 785 | [2.1.1] - 2017-10-05 786 | -------------------- 787 | 788 | Fixed 789 | ^^^^^ 790 | 791 | - Fixed defect in iRODS storage broker where files with white space resulted in 792 | broken identifiers 793 | 794 | 795 | [2.1.0] - 2017-10-04 796 | -------------------- 797 | 798 | Added 799 | ^^^^^ 800 | 801 | - ``dtool readme show`` command that returns the readme content 802 | - ``--quiet`` flag to ``dtool copy`` command 803 | 804 | Changed 805 | ^^^^^^^ 806 | 807 | - Improved the ``dtool readme --help`` output 808 | 809 | Fixed 810 | ^^^^^ 811 | 812 | - Progress bar now shows information on individual items being processed 813 | - ``dtool ls`` now works with relative paths 814 | - Fix defect where ``IrodsStorageBroker.put_item`` raised SystemError when 815 | trying to overwrite an existing file 816 | 817 | 818 | [2.0.2] - 2017-09-25 819 | -------------------- 820 | 821 | Fixed 822 | ^^^^^ 823 | 824 | - Better validation of input in terms of base vs proto vs frozen dataset URIs 825 | - Fixed bug where copy creates an intermediate proto dataset that self 826 | identifies as a frozen dataset. 827 | - Fixed potential bug where a copy could convert a proto dataset to 828 | a dataset before all its overlays had been copied over 829 | - Fixed type of "frozen_at" time stamp in admin metadata: from string to float 830 | 831 | 832 | [2.0.1] - 2017-09-20 833 | -------------------- 834 | 835 | Fixed 836 | ^^^^^ 837 | 838 | - Made version requirements of dtool sub-packages explicit 839 | 840 | [2.0.0] - 2017-09-14 841 | -------------------- 842 | 843 | Initial release of ``dtool`` as a meta package. 844 | --------------------------------------------------------------------------------