├── AUTHORS.md ├── LICENSE ├── README.md ├── codemeta.json ├── git-rdm ├── paper.bib ├── paper.md ├── requirements.txt └── setup.py /AUTHORS.md: -------------------------------------------------------------------------------- 1 | # Authors 2 | 3 | * Christian T. Jacobs 4 | * Alexandros Avdis 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Christian T. Jacobs, Alexandros Avdis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 6 | 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Git-RDM 2 | 3 | Git-RDM is a Research Data Management (RDM) plugin for the [Git](https://git-scm.com/) version control system. It interfaces Git with data hosting services to manage the curation of version controlled files using persistent, citable repositories. This facilitates the sharing of research outputs and encourages a more open workflow within the research community. 4 | 5 | Much like the standard Git commands, Git-RDM allows users to add/remove files within a 'publication staging area'. When ready, users can readily publish these staged files to a data repository hosted either by Figshare or Zenodo via the command line. Details of the files and their associated publication(s) are then recorded in a local SQLite database, including the specific Git revision (in the form of a SHA-1 hash), publication date/time, and the DOI, such that a full history of data publication is maintained. 6 | 7 | ## Dependencies 8 | 9 | Git-RDM mostly relies on the standard Python modules and, of course, Git. However, two extra modules are needed: 10 | 11 | * [GitPython](https://gitpython.readthedocs.io), to access the Git repository's information. 12 | * [PyRDM](https://github.com/pyrdm/pyrdm), to handle the publishing of files. 13 | 14 | Both of these dependencies can be installed via `pip` using 15 | 16 | ``` 17 | sudo pip install -r requirements.txt 18 | ``` 19 | 20 | Note that once PyRDM is installed, you will need to setup Figshare/Zenodo authentication tokens and copy them into the PyRDM configuration file in order to publish your data. See the [PyRDM documentation](https://pyrdm.readthedocs.io/en/latest/getting_started.html) for instructions on how to do this. 21 | 22 | ## Installing 23 | 24 | After downloading or cloning this software using 25 | 26 | ``` 27 | git clone https://github.com/ctjacobs/git-rdm.git 28 | ``` 29 | 30 | a system-wide installation can be achieved by navigating to the git-rdm directory 31 | 32 | ``` 33 | cd git-rdm 34 | ``` 35 | 36 | and running 37 | 38 | ``` 39 | sudo python setup.py install 40 | ``` 41 | 42 | Alternatively, a local user installation can be achieved using 43 | 44 | ``` 45 | python setup.py install --prefix=/path/to/custom/install/directory 46 | ``` 47 | 48 | and adding `/path/to/custom/install/directory/bin` to the `PATH` environment variable: 49 | 50 | ``` 51 | export PATH=$PATH:/path/to/custom/install/directory/bin 52 | ``` 53 | 54 | Once Git-RDM is installed, Git should automatically detect the plugin and recognise the `rdm` command; for example, run `git rdm -h` to list the RDM-related subcommands described in the Usage section below. 55 | 56 | ## Usage 57 | 58 | The Git-RDM plugin comes with several subcommands. The following subsections demonstrate, with examples, how to use each of them. 59 | 60 | ### git rdm init 61 | 62 | In order to start using Git-RDM, the command `git rdm init` must first be run within the Git repository containing the data files to be published. This creates a new directory called `.rdm` containing a database file `publications.db`. All data publication details are stored within this file. Note that this command is similar to `git init` which initialises a new Git repository and creates the `.git` control directory. As an example, consider the `test` directory below, containing files `test1.txt`, `test2.txt` and `test3.png`: 63 | 64 | ``` 65 | ~/test $ git rdm init 66 | ~/test $ ls -lrta 67 | total 68 68 | drwx------ 60 christian christian 20480 Jun 12 23:39 .. 69 | -rw-r--r-- 1 christian christian 5 Jun 12 23:39 test1.txt 70 | -rw-r--r-- 1 christian christian 5 Jun 12 23:39 test2.txt 71 | -rw-r--r-- 1 christian christian 5 Jun 12 23:39 test3.png 72 | drwxr-xr-x 7 christian christian 4096 Jun 12 23:40 .git 73 | drwxr-xr-x 4 christian christian 4096 Jun 12 23:40 . 74 | drwxr-xr-x 2 christian christian 4096 Jun 12 23:40 .rdm 75 | ``` 76 | 77 | ### git rdm add 78 | 79 | Once the RDM database has been initialised, data files may be added to the 'publication staging area' using `git rdm add` as follows: 80 | 81 | ``` 82 | ~/test $ git rdm add test* 83 | ~/test $ git rdm ls 84 | git-rdm INFO: Files staged for publishing: 85 | git-rdm INFO: /home/christian/test/test1.txt 86 | git-rdm INFO: /home/christian/test/test3.png 87 | git-rdm INFO: /home/christian/test/test2.txt 88 | ``` 89 | 90 | The file being added for publication must first have been committed within the Git repository, otherwise Git-RDM will refuse to add it. 91 | 92 | ### git rdm rm 93 | 94 | Files can also be removed from the publication staging area using `git rdm rm`: 95 | 96 | ``` 97 | ~/test $ git rdm rm test* 98 | ``` 99 | 100 | ### git rdm publish 101 | 102 | Once all the files are ready to be published, the `git rdm publish` command can be used to publish the files to a data repository hosted by a particular service. The hosting service must be specified as an argument, and can be either `figshare` or `zenodo`. Support for new services can be readily added by extending the [PyRDM library](https://pyrdm.readthedocs.io). Some basic publication information is obtained from the user, for example the title, description, and keyword metadata. PyRDM then interfaces with the hosting service and publishes the data files: 103 | 104 | ``` 105 | ~/test $ git rdm publish figshare 106 | Private publication? (y/n): y 107 | git-rdm INFO: Publishing as a private repository... 108 | Title: Test Article 109 | Description: Testing 110 | Tags/keywords (in list format ["a", "b", "c"]): ["hello", "world"] 111 | pyrdm.figshare INFO: Testing Figshare authentication... 112 | pyrdm.figshare DEBUG: Server returned response 200 113 | pyrdm.figshare INFO: Authentication test successful. 114 | 115 | pyrdm.publisher INFO: Publishing data... 116 | pyrdm.publisher INFO: Creating new fileset... 117 | pyrdm.publisher INFO: Adding category... 118 | pyrdm.publisher INFO: Fileset created with ID: 3428222 and DOI: 10.6084/m9.figshare.3428222 119 | pyrdm.publisher DEBUG: The following files have been marked for uploading: ['/home/christian/test/test1.txt', '/home/christian/test/test3.png', '/home/christian/test/test2.txt'] 120 | pyrdm.publisher INFO: Uploading /home/christian/test/test1.txt... 121 | pyrdm.publisher INFO: Uploading /home/christian/test/test3.png... 122 | pyrdm.publisher INFO: Uploading /home/christian/test/test2.txt... 123 | pyrdm.publisher INFO: All files successfully uploaded. 124 | ``` 125 | 126 | The publication information is stored in the local database, and can be viewed using `git rdm ls`. Note that Git-RDM currently publishes the files using the current `HEAD` revision of the Git repository, and not the revision at which the files were first added using `git rdm add`. 127 | 128 | ### git rdm ls 129 | 130 | `git rdm ls` is used to list and keep track of which data files have been published, and which files are still in the staging area. Users can choose to list each file, followed by any DOIs associated with it (by default) as follows: 131 | 132 | ``` 133 | ~/test $ git rdm ls 134 | git-rdm INFO: Published files: 135 | git-rdm INFO: /home/christian/test/test1.txt 136 | git-rdm INFO: 10.6084/m9.figshare.3428222 (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 137 | git-rdm INFO: /home/christian/test/test2.txt 138 | git-rdm INFO: 10.6084/m9.figshare.3428222 (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 139 | git-rdm INFO: /home/christian/test/test3.png 140 | git-rdm INFO: 10.6084/m9.figshare.3428222 (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 141 | ``` 142 | 143 | Users can also choose to list the DOIs first and the files associated with it afterwards: 144 | 145 | ``` 146 | ~/test $ git rdm ls --by-doi 147 | git-rdm INFO: Published files: 148 | git-rdm INFO: 10.6084/m9.figshare.3428222 149 | git-rdm INFO: /home/christian/test/test1.txt (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 150 | git-rdm INFO: /home/christian/test/test3.png (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 151 | git-rdm INFO: /home/christian/test/test2.txt (2016-06-13 @ 00:29:03, revision '1eeccabba810b8c91eef82e692713fdb05ca4a32') 152 | ``` 153 | 154 | To check the raw, unformatted contents of the entire publications database, use the `--raw` flag: 155 | 156 | ``` 157 | ~/test $ git rdm ls --raw 158 | git-rdm INFO: Database dump: 159 | git-rdm INFO: id, path, date, time, sha, pid, doi 160 | git-rdm INFO: 13, /home/christian/test/test1.txt, 2016-06-13, 00:29:03.016951, 1eeccabba810b8c91eef82e692713fdb05ca4a32, 3428222, 10.6084/m9.figshare.3428222 161 | git-rdm INFO: 14, /home/christian/test/test3.png, 2016-06-13, 00:29:03.016951, 1eeccabba810b8c91eef82e692713fdb05ca4a32, 3428222, 10.6084/m9.figshare.3428222 162 | git-rdm INFO: 15, /home/christian/test/test2.txt, 2016-06-13, 00:29:03.016951, 1eeccabba810b8c91eef82e692713fdb05ca4a32, 3428222, 10.6084/m9.figshare.3428222 163 | ``` 164 | 165 | ### git rdm show 166 | 167 | The full publication record maintained by the data repository service can be shown using `git rdm show`. It expects two arguments: the name of the hosting service (`figshare` or `zenodo`) and the publication ID. For example, for the publication whose Figshare publication ID is 3428222 (and DOI is `10.6084/m9.figshare.3428222`), the (truncated) output is: 168 | 169 | ``` 170 | ~/test $ git rdm show figshare 3428222 171 | pyrdm.figshare INFO: Testing Figshare authentication... 172 | pyrdm.figshare DEBUG: Server returned response 200 173 | pyrdm.figshare INFO: Authentication test successful. 174 | 175 | git-rdm INFO: { 176 | "authors": [ 177 | { 178 | "full_name": "Christian T. Jacobs", 179 | "id": 554577, 180 | "is_active": true, 181 | "orcid_id": "0000-0002-0034-4650", 182 | "url_name": "Christian_T_Jacobs" 183 | } 184 | ], 185 | "categories": [ 186 | { 187 | "id": 2, 188 | "title": "Uncategorized" 189 | } 190 | ], 191 | "citation": "Jacobs, Christian T. (): Test Article. figshare.\n 10.6084/m9.figshare.3428222\n Retrieved: 23 32, Jun 12, 2016 (GMT)", 192 | "confidential_reason": "", 193 | "created_date": "2016-06-12T23:28:54Z", 194 | "custom_fields": [], 195 | "defined_type": 4, 196 | "description": "Testing", 197 | "doi": "10.6084/m9.figshare.3428222", 198 | ``` 199 | 200 | ## License 201 | This software is released under the MIT license. See the file called `LICENSE` for more information. 202 | 203 | ## Citing 204 | 205 | If you use Git-RDM during the course of your research, please consider citing the following paper: 206 | 207 | * C. T. Jacobs, A. Avdis (2016). Git-RDM: A research data management plugin for the Git version control system. *The Journal of Open Source Software*, 1(2), DOI: [10.21105/joss.00029](http://dx.doi.org/10.21105/joss.00029) 208 | 209 | ## Contact 210 | Please send any questions or comments about Git-RDM via email to . 211 | 212 | Any bugs should be reported using the project's [issue tracker](http://github.com/ctjacobs/git-rdm/issues). If possible, please run Git-RDM with debugging enabled using the `-d` flag after `git rdm` (e.g. `git rdm -d publish figshare`) and provide the full output. 213 | 214 | Contributions are welcome and should be made via a pull request. 215 | -------------------------------------------------------------------------------- /codemeta.json: -------------------------------------------------------------------------------- 1 | { 2 | "@context": "https://raw.githubusercontent.com/mbjones/codemeta/master/codemeta.jsonld", 3 | "@type": "Code", 4 | "author": [ 5 | 6 | ], 7 | "identifier": "", 8 | "codeRepository": "https://github.com/ctjacobs/git-rdm", 9 | "datePublished": "2016-06-19", 10 | "dateModified": "2016-06-19", 11 | "dateCreated": "2016-06-19", 12 | "description": "Git-RDM is a research data management plugin for the Git version control system.", 13 | "keywords": "research data management, git, version control, digital curation, data, publishing, figshare, zenodo", 14 | "license": "MIT", 15 | "title": "Git-RDM", 16 | "version": "v1.0.1" 17 | } -------------------------------------------------------------------------------- /git-rdm: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Git-RDM is released under the MIT license. 4 | 5 | # The MIT License (MIT) 6 | 7 | # Copyright (c) 2016 Christian T. Jacobs, Alexandros Avdis 8 | 9 | # Permission is hereby granted, free of charge, to any person obtaining a 10 | # copy of this software and associated documentation files (the 11 | # "Software"), to deal in the Software without restriction, including 12 | # without limitation the rights to use, copy, modify, merge, publish, 13 | # distribute, sublicense, and/or sell copies of the Software, and to 14 | # permit persons to whom the Software is furnished to do so, subject to 15 | # the following conditions: 16 | 17 | # The above copyright notice and this permission notice shall be included 18 | # in all copies or substantial portions of the Software. 19 | 20 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 21 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 22 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 23 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 24 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 25 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 26 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 27 | 28 | import os, os.path, sys 29 | import itertools 30 | import argparse 31 | import git 32 | import logging 33 | import sqlite3 as sqlite 34 | import datetime 35 | import glob 36 | import json 37 | import subprocess 38 | 39 | _LOG = logging.getLogger(__name__) 40 | _HANDLER = logging.StreamHandler() 41 | _LOG.addHandler(_HANDLER) 42 | _HANDLER.setFormatter(logging.Formatter('%(module)s %(levelname)s: %(message)s')) 43 | del(_HANDLER) 44 | _LOG.setLevel(logging.INFO) 45 | 46 | try: 47 | from pyrdm.publisher import Publisher 48 | except ImportError: 49 | _LOG.exception("Could not import the PyRDM library necessary for research data management.") 50 | sys.exit(1) 51 | 52 | PUBLICATIONS_TABLE = "publications" 53 | 54 | class GitRDM: 55 | 56 | """ The Git-RDM plugin for the Git version control system. """ 57 | 58 | def __init__(self): 59 | """ Open the Git repository. """ 60 | 61 | try: 62 | self.repo = git.Repo(".", search_parent_directories=True) 63 | _LOG.debug("The Git working directory is: %s" % self.repo.working_dir) 64 | except git.InvalidGitRepositoryError: 65 | _LOG.exception("Not in a Git version controlled repository.") 66 | sys.exit(1) 67 | 68 | return 69 | 70 | def initialise(self): 71 | """ Initialise the RDM control directory and set up the SQL database of published files (and files staged for publication). """ 72 | 73 | # Create the RDM control directory if it doesn't already exist. 74 | rdm_control_directory = self.repo.working_dir + "/.rdm" 75 | if not os.path.exists(rdm_control_directory): 76 | os.makedirs(rdm_control_directory) 77 | 78 | # Set up the SQLite database. 79 | self.db_connect() 80 | 81 | # If this file already exists, then skip this step. 82 | if self.db_exists(): 83 | response = raw_input("The publications database already exists. Do you want to overwrite it? (y/n)\n") 84 | if response == "y" or response == "Y": 85 | _LOG.info("Overwriting...") 86 | with self.connection: 87 | c = self.connection.cursor() 88 | query = "DROP TABLE %s" % PUBLICATIONS_TABLE 89 | c.execute(query) 90 | elif response == "n" or response == "N": 91 | _LOG.info("Not overwriting.") 92 | return 93 | else: 94 | _LOG.error("Unknown response '%s'. Not overwriting." % response) 95 | return 96 | 97 | # Set up publication table columns. 98 | with self.connection: 99 | c = self.connection.cursor() 100 | query = "CREATE TABLE %s (id INTEGER PRIMARY KEY AUTOINCREMENT, path TEXT, date TEXT, time TEXT, sha TEXT, pid TEXT, doi TEXT)" % PUBLICATIONS_TABLE 101 | c.execute(query) 102 | 103 | # Disconnect. 104 | self.db_disconnect() 105 | 106 | return 107 | 108 | def add(self, paths): 109 | """ Add a desired file to the 'publication' staging area. 110 | 111 | :arg paths: A string, or list of strings, of absolute or relative paths to files to be added. 112 | """ 113 | 114 | self.db_connect() 115 | 116 | # Expand and get the absolute paths 117 | expanded_paths = self.expand_paths(paths) 118 | _LOG.debug("Expanded paths: %s", str(expanded_paths)) 119 | 120 | # Check that the file/files being added is/are actually under version control. 121 | skipped = [] 122 | for f in expanded_paths: 123 | found = False 124 | for b in self.repo.tree().traverse(): 125 | if b.abspath == f: 126 | found = True 127 | break 128 | if not found: 129 | _LOG.error("Could not add file '%s' for publishing because it is not under Git version control. Skipping..." % f) 130 | skipped.append(f) 131 | continue 132 | 133 | # Remove the skipped files 134 | expanded_paths = list(set(expanded_paths).difference(skipped)) 135 | _LOG.debug("Expanded paths after skipped files removed: %s", str(expanded_paths)) 136 | 137 | # Check that the file has not been added for publication already. If not, then add it to the database. 138 | for f in expanded_paths: 139 | with self.connection: 140 | query = "SELECT * FROM %s WHERE path=? AND doi IS NULL" % PUBLICATIONS_TABLE 141 | c = self.connection.cursor() 142 | c.execute(query, [f]) 143 | result = c.fetchall() 144 | if len(result) > 0: 145 | _LOG.warning("File '%s' has been added already. Skipping..." % os.path.abspath(f)) 146 | else: 147 | query = "INSERT INTO %s VALUES (NULL, ?, NULL, NULL, NULL, NULL, NULL)" % PUBLICATIONS_TABLE 148 | c = self.connection.cursor() 149 | c.execute(query, [f]) 150 | 151 | self.db_disconnect() 152 | return 153 | 154 | def rm(self, paths): 155 | """ Remove a desired file from the 'publication' staging area. 156 | 157 | :arg paths: A string, or list of strings, of absolute or relative paths to files to be removed. 158 | """ 159 | 160 | self.db_connect() 161 | 162 | # Expand and get the absolute paths 163 | expanded_paths = self.expand_paths(paths) 164 | 165 | query = "DELETE FROM %s WHERE path=? AND doi IS NULL" % PUBLICATIONS_TABLE 166 | with self.connection: 167 | c = self.connection.cursor() 168 | for f in expanded_paths: 169 | c.execute(query, [f]) 170 | 171 | self.db_disconnect() 172 | 173 | return 174 | 175 | def ls(self, path=None, by_doi=True, raw=False): 176 | """ List all published files, and files to be published. If a file has multiple DOIs associated with it then they are all listed together. 177 | 178 | :arg path: The relative/absolute path to the file to be listed. 179 | :arg bool by_doi: List the files by DOI first (i.e. each DOI will be listed at the top level, with names of the files associated with that DOI underneath). 180 | :arg bool raw: Print out the entire SQL database table without any formatting. 181 | """ 182 | 183 | self.db_connect() 184 | 185 | # Raw database dump of all the rows 186 | if raw: 187 | _LOG.info("Database dump:") 188 | 189 | query = "SELECT * FROM %s" % PUBLICATIONS_TABLE 190 | 191 | with self.connection: 192 | c = self.connection.cursor() 193 | c.execute(query) 194 | result = c.fetchall() 195 | if len(result) > 0: 196 | # Column names 197 | names = list(map(lambda x: x[0], c.description)) 198 | _LOG.info(", ".join(names)) 199 | 200 | # Rows 201 | for r in result: 202 | values = [str(r[name]) for name in names] 203 | _LOG.info(", ".join(values)) 204 | 205 | return 206 | 207 | # If a path is provided, then specify all the publications/DOIs associated with that path. 208 | if path: 209 | path = os.path.abspath(path) # Get the absolute path 210 | query = "SELECT * FROM %s WHERE path=? AND doi IS NOT NULL" % PUBLICATIONS_TABLE 211 | with self.connection: 212 | c = self.connection.cursor() 213 | c.execute(query, [path]) 214 | result = c.fetchall() 215 | if len(result) > 0: 216 | _LOG.info(path) 217 | for r in result: 218 | _LOG.info("\t" + str(r["doi"]) + " (" + str(r["date"]) + " @ " + str(r["time"]).split(".")[0] + ", revision '%s')" % r["sha"]) 219 | return 220 | 221 | # List all published files. 222 | if by_doi: 223 | query = "SELECT * FROM %s WHERE doi IS NOT NULL ORDER BY doi" % PUBLICATIONS_TABLE 224 | else: 225 | query = "SELECT * FROM %s WHERE doi IS NOT NULL ORDER BY path" % PUBLICATIONS_TABLE 226 | 227 | with self.connection: 228 | c = self.connection.cursor() 229 | c.execute(query) 230 | result = c.fetchall() 231 | 232 | if len(result) > 0: 233 | _LOG.info("Published files:") 234 | 235 | if by_doi: 236 | for doi, publication_iter in itertools.groupby(result, key=lambda r: r[len(r)-1]): 237 | publication_list = list(publication_iter) 238 | _LOG.info("\t" + doi) 239 | for p in publication_list: 240 | _LOG.info("\t\t" + str(p[1]) + " (" + str(p[2]) + " @ " + str(p[3]).split(".")[0] + ", revision '%s')" % p[4]) 241 | else: 242 | for path, publication_iter in itertools.groupby(result, key=lambda r: r[1]): 243 | publication_list = list(publication_iter) 244 | _LOG.info("\t" + path) 245 | for p in publication_list: 246 | _LOG.info("\t\t" + str(p[len(p)-1]) + " (" + str(p[2]) + " @ " + str(p[3]).split(".")[0] + ", revision '%s')" % p[4]) 247 | 248 | # List all files staged for publishing. 249 | query = "SELECT * FROM %s WHERE doi IS NULL" % PUBLICATIONS_TABLE 250 | with self.connection: 251 | c = self.connection.cursor() 252 | c.execute(query) 253 | result = c.fetchall() 254 | if len(result) > 0: 255 | _LOG.info("Files staged for publishing:") 256 | for r in result: 257 | _LOG.info("\t"+r["path"]) 258 | 259 | self.db_disconnect() 260 | 261 | return 262 | 263 | def show(self, service, pid): 264 | """ Show details of a particular publication. 265 | 266 | :arg str service: The service with which the repository is hosted. 267 | :arg pid: The publication ID. 268 | """ 269 | 270 | publisher = Publisher(service=service) 271 | 272 | if service == "figshare": 273 | try: 274 | publication_details = publisher.figshare.get_article_details(article_id=pid) 275 | except: 276 | publication_details = publisher.figshare.get_article_details(article_id=pid, private=True) 277 | elif service == "zenodo": 278 | publication_details = publisher.zenodo.retrieve_deposition(deposition_id=int(pid)) 279 | else: 280 | _LOG.error("Unknown service '%s'" % service) 281 | return 282 | 283 | # Pretty print 284 | try: 285 | _LOG.info(json.dumps(publication_details, sort_keys=True, indent=4, separators=(',', ': '))) 286 | except: 287 | _LOG.info(publication_details) 288 | 289 | return 290 | 291 | def publish(self, service, pid=None): 292 | """ Publish the desired files. 293 | 294 | :arg str service: The repository hosting service with which to publish the files. 295 | :arg pid: An (optional) existing publication ID. If this is not None, then a new version of the repository will be created. 296 | """ 297 | 298 | self.db_connect() 299 | 300 | # Find all files without a DOI (and assume these are in the publication staging area). 301 | with self.connection: 302 | query = "SELECT * FROM %s WHERE doi IS NULL" % PUBLICATIONS_TABLE 303 | c = self.connection.cursor() 304 | c.execute(query) 305 | to_publish = c.fetchall() 306 | 307 | if not to_publish: 308 | _LOG.warning("No files selected for publication.") 309 | return 310 | 311 | # Does the user needs to commit any modified files first? 312 | modified_files = subprocess.check_output(['git', 'diff', '--name-only']).split() 313 | for i in range(len(modified_files)): 314 | # Get the absolute path 315 | modified_files[i] = self.repo.working_dir + "/" + modified_files[i] 316 | _LOG.debug("Modified files: %s" % str(modified_files)) 317 | 318 | # We only care if the uncommitted changes apply to files in the 'publishing staging area'. 319 | overlap = False 320 | for f in to_publish: 321 | if f["path"] in modified_files: 322 | overlap = True 323 | if self.repo.is_dirty() and overlap: 324 | _LOG.error("Uncomitted changes exist in the repository. Please commit these changes before trying to publish any files.") 325 | return 326 | 327 | # Get the minimal amount of metadata needed to publish from the user. 328 | response = raw_input("Private publication? (y/n): ") 329 | if response == "y" or response == "Y": 330 | _LOG.info("Publishing as a private repository...") 331 | private = True 332 | elif response == "n" or response == "N": 333 | _LOG.info("Publishing as a public repository...") 334 | private = False 335 | else: 336 | _LOG.error("Unknown response '%s'. Not publishing." % response) 337 | return 338 | 339 | parameters = self.get_publication_parameters() 340 | 341 | # Publish to the repository hosting service. 342 | publisher = Publisher(service=service) 343 | pid, doi = publisher.publish_data(parameters, pid=pid, private=private) 344 | 345 | # Update the publications database by adding the DOIs and publication IDs to the previously-staged files. 346 | with self.connection: 347 | c = self.connection.cursor() 348 | query = "UPDATE %s SET doi=? WHERE doi IS NULL" % (PUBLICATIONS_TABLE) 349 | c.execute(query, [doi]) 350 | query = "UPDATE %s SET pid=? WHERE pid IS NULL" % (PUBLICATIONS_TABLE) 351 | c.execute(query, [pid]) 352 | query = "UPDATE %s SET date=? WHERE date IS NULL" % (PUBLICATIONS_TABLE) 353 | c.execute(query, [str(datetime.datetime.now().date())]) 354 | query = "UPDATE %s SET time=? WHERE time IS NULL" % (PUBLICATIONS_TABLE) 355 | c.execute(query, [str(datetime.datetime.now().time())]) 356 | query = "UPDATE %s SET sha=? WHERE sha IS NULL" % (PUBLICATIONS_TABLE) 357 | c.execute(query, [str(self.repo.head.object.hexsha)]) 358 | 359 | self.db_disconnect() 360 | 361 | return 362 | 363 | def get_publication_parameters(self): 364 | """ Return the parameters required for publication from the user. 365 | 366 | :rtype: dict 367 | :returns: The parameters required for publication. 368 | """ 369 | 370 | parameters = {} 371 | 372 | title = raw_input("Title: ") 373 | description = raw_input("Description: ") 374 | tags = raw_input("Tags/keywords (in list format [\"a\", \"b\", \"c\"]): ") 375 | if tags: 376 | tags = eval(tags) 377 | else: 378 | tags = [] 379 | 380 | # Get the list of file paths. 381 | files = self.db_select_unpublished() 382 | 383 | parameters = {"title":title, "description":description, "category":"Uncategorized", "tag_name":tags, "files":files} 384 | 385 | return parameters 386 | 387 | def db_connect(self): 388 | """ Create a connection to the publications database. """ 389 | 390 | _LOG.debug("Attempting to connect to publication database...") 391 | path = self.repo.working_dir + "/.rdm/publications.db" 392 | try: 393 | self.connection = sqlite.connect(path) 394 | self.connection.row_factory = sqlite.Row 395 | _LOG.debug("Connected successfully!") 396 | except sqlite.Error as e: 397 | _LOG.exception("Could not connect to the publication database. Check read permissions? Check that the .rdm directory exists? If it doesn't, run the 'git rdm init' command.") 398 | sys.exit(1) 399 | return 400 | 401 | def db_disconnect(self): 402 | """ Distroy the existing connection to the publications database. """ 403 | 404 | if(self.connection): 405 | self.connection.close() 406 | return 407 | 408 | def db_search_by_path(self, path): 409 | """ Create a connection to the publications database. """ 410 | 411 | try: 412 | with self.connection: 413 | c = self.connection.cursor() 414 | query = "SELECT * FROM %s WHERE path=?" % PUBLICATIONS_TABLE 415 | c.execute(query, [path]) 416 | return c.fetchone() # This path is a unique absolute path. 417 | except sqlite.Error as e: 418 | logging.exception(e) 419 | return None 420 | 421 | def db_exists(self): 422 | """ Return True if the publications table exists in the database, otherwise return False. """ 423 | 424 | with self.connection: 425 | c = self.connection.cursor() 426 | c.execute("SELECT EXISTS(SELECT 1 FROM sqlite_master WHERE name=?)", [PUBLICATIONS_TABLE]) 427 | exists = c.fetchone() 428 | if(exists[0] == 1): 429 | return True 430 | else: 431 | return False 432 | 433 | def db_select_all(self): 434 | """ Return all rows of the database. 435 | 436 | :rtype: list 437 | :returns: All rows of the database. 438 | """ 439 | 440 | query = "SELECT * FROM %s" % PUBLICATIONS_TABLE 441 | with self.connection: 442 | c = self.connection.cursor() 443 | c.execute(query) 444 | return c.fetchall() 445 | 446 | def db_select_unpublished(self): 447 | """ Return all rows of the database in which the DOI field is NULL. 448 | 449 | :rtype: list 450 | :returns: All rows of the database in which the DOI field is NULL. 451 | """ 452 | 453 | query = "SELECT * FROM %s WHERE doi IS NULL" % PUBLICATIONS_TABLE 454 | with self.connection: 455 | c = self.connection.cursor() 456 | c.execute(query) 457 | result = c.fetchall() 458 | 459 | paths = [] 460 | for r in result: 461 | paths.append(str(r["path"])) 462 | return paths 463 | 464 | def expand_paths(self, paths): 465 | """ Expand and get the absolute paths. 466 | 467 | :arg paths: A string, or list of strings, of absolute or relative paths to files. 468 | :rtype: list 469 | :returns: A list of expanded paths. 470 | """ 471 | 472 | expanded_paths = [] 473 | if isinstance(paths, str): # A single path 474 | expanded = glob.glob(paths) 475 | for e in expanded: 476 | expanded_paths.append(os.path.abspath(e)) 477 | elif isinstance(paths, list): # Multiple path 478 | for p in paths: 479 | expanded = glob.glob(p) 480 | for e in expanded: 481 | expanded_paths.append(os.path.abspath(e)) 482 | else: 483 | _LOG.exception("Unknown input for the 'add' function.") 484 | return expanded_paths 485 | 486 | if(__name__ == "__main__"): 487 | # Command line arguments 488 | parser = argparse.ArgumentParser(prog="git-rdm") 489 | parser.add_argument("-d", "--debug", action="store_true", default=False, help="Enable debugging.") 490 | 491 | # Subparsers 492 | subparsers = parser.add_subparsers(help="The subcommand of 'git rdm'.", dest='subcommand') 493 | 494 | # 'git rdm init' 495 | init_parser = subparsers.add_parser("init", help="Initialise the .rdm control directory and publication database.") 496 | 497 | # 'git rdm add' 498 | add_parser = subparsers.add_parser("add", help="Add a file to the publishing staging area.") 499 | add_parser.add_argument("path", nargs='+', help="The path(s) to the file(s) to be added.", action="store", type=str) 500 | 501 | # 'git rdm rm' 502 | rm_parser = subparsers.add_parser("rm", help="Remove a file from the publishing staging area.") 503 | rm_parser.add_argument("path", nargs='+', help="The path(s) to the file(s) to be removed.", action="store", type=str) 504 | 505 | # 'git rdm ls' 506 | ls_parser = subparsers.add_parser("ls", help="List the published files, and files staged for publishing.") 507 | ls_parser.add_argument("--path", required=False, help="The path to the file to list.", action="store", type=str, default=None) 508 | ls_parser.add_argument("--by-doi", help="Group files by DOI.", action="store_true", default=False, dest="by_doi") 509 | ls_parser.add_argument("--raw", help="Print out the entire SQL database table without any formatting.", action="store_true", default=False, dest="raw") 510 | 511 | # 'git rdm publish' 512 | publish_parser = subparsers.add_parser("publish", help="Publish the files in the publishing staging area.") 513 | publish_parser.add_argument("service", help="The service with which to publish.", action="store", type=str) 514 | publish_parser.add_argument("--pid", required=False, help="The ID of an existing publication. This will result in a new version of the publication being created. The DOI will stay the same.", action="store", default=None, type=str) 515 | 516 | # 'git rdm show' 517 | show_parser = subparsers.add_parser("show", help="Show details about a particular publication.") 518 | show_parser.add_argument("service", help="The hosting service of the publication.", action="store", type=str) 519 | show_parser.add_argument("pid", help="The ID of the publication.", action="store", type=str) 520 | 521 | # Parse all arguments 522 | args = parser.parse_args() 523 | 524 | # Output debugging messages to a file 525 | if(args.debug): 526 | logger = logging.getLogger(__name__) 527 | logger.setLevel(logging.DEBUG) 528 | 529 | # Execute the desired subcommand 530 | rdm = GitRDM() 531 | if args.subcommand == "init": 532 | rdm.initialise() 533 | elif args.subcommand == "add": 534 | rdm.add(args.path) 535 | elif args.subcommand == "rm": 536 | rdm.rm(args.path) 537 | elif args.subcommand == "ls": 538 | rdm.ls(path=args.path, by_doi=args.by_doi, raw=args.raw) 539 | elif args.subcommand == "publish": 540 | rdm.publish(service=args.service, pid=args.pid) 541 | elif args.subcommand == "show": 542 | rdm.show(service=args.service, pid=args.pid) 543 | else: 544 | _LOG.error("Unknown git-rdm subcommand '%s'" % args.subcommand) 545 | -------------------------------------------------------------------------------- /paper.bib: -------------------------------------------------------------------------------- 1 | % This file was created with JabRef 2.10b2. 2 | % Encoding: ISO8859_1 3 | 4 | 5 | @Book{ChaconStraub_2014, 6 | Title = {{Pro Git}}, 7 | Author = {Chacon, S. and Straub, B.}, 8 | Publisher = {Apress}, 9 | Year = {2014}, 10 | Edition = {2nd} 11 | } 12 | 13 | @Article{Jacobs_etal_2014, 14 | Title = {{PyRDM: A Python-based library for automating the management and online publication of scientific software and data}}, 15 | Author = {Jacobs, C. T. and Avdis, A. and Gorman, G. J. and Piggott, M. D.}, 16 | Journal = {{Journal of Open Research Software}}, 17 | Year = {2014}, 18 | Number = {1}, 19 | Pages = {e28}, 20 | Volume = {2}, 21 | Doi = {10.5334/jors.bj} 22 | } 23 | 24 | 25 | @TECHREPORT{RS_2012, 26 | title = "Science as an open enterprise: Open data for open science", 27 | author = "{Royal Society}", 28 | month = jun, 29 | year = 2012 30 | } 31 | 32 | 33 | @TECHREPORT{RCUK_2015, 34 | title = "Guidance on best practice in the management of research data", 35 | author = "{Research Council UK}", 36 | month = jul, 37 | year = 2015 38 | } 39 | 40 | @techreport{ISO26324_2012, 41 | author = {{Technical Committee ISO/TC 46 (Information and documentation), Subcommittee SC 9 (Identification and description)}}, 42 | title = {{ISO 26324:2012 Information and documentation -- Digital object identifier system}}, 43 | institution = {International Organisation for Standardization}, 44 | year = {2012} 45 | } 46 | 47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /paper.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Git-RDM: A research data management plugin for the Git version control system' 3 | tags: 4 | - Git 5 | - Research Data Management 6 | - plugin 7 | - version control 8 | - figshare 9 | - Zenodo 10 | - digital curation 11 | - Digital Object Identifiers 12 | authors: 13 | - name: Christian T. Jacobs 14 | orcid: 0000-0002-0034-4650 15 | affiliation: University of Southampton 16 | - name: Alexandros Avdis 17 | orcid: 0000-0002-2695-3358 18 | affiliation: Imperial College London 19 | date: 16 June 2016 20 | bibliography: paper.bib 21 | --- 22 | 23 | # Summary 24 | 25 | Many research funding agencies [@RCUK_2015] and research societies [@RS_2012] are increasingly requiring that data from at least publicly funded research be made openly available, and with clear citations that describe provenance. These requirements have led to the proliferation of institutional repositories with universities maintaining a handful of data services, but also repository services capable of minting a persistent and citable Digital Object Identifier (DOI) [@ISO26324_2012] for every published item. Figshare (figshare.com) and Zenodo (zenodo.org) are examples of the latter. Alongside data, software is also increasingly seen as a research output. This viewpoint necessitates not just open-source publication of code, but also provenance and attribution. While a DOI is an identifier of static items, many research teams use version control systems and services to organise their collective efforts and publish output, be that code or data. Popular examples include Git [@ChaconStraub_2014] and GitHub (github.com). 26 | 27 | Git-RDM is a Research Data Management (RDM) plugin for the Git version control system. It interfaces Git with data hosting services to manage the curation of version controlled files using persistent, citable repositories. This facilitates the sharing of research outputs and encourages a more open workflow within the research community. 28 | 29 | Much like the standard Git commands, Git-RDM allows users to add/remove files within a 'publication staging area'. When ready, users can readily publish these staged files to a data repository hosted either by Figshare or Zenodo via the command line; this curation step is handled by the PyRDM library [@Jacobs_etal_2014]. Details of the files and their associated publication(s) are then recorded in a local SQLite database, including the specific Git revision (in the form of a SHA-1 hash), publication date/time, and the DOI, such that a full history of data publication is maintained. 30 | 31 | # References 32 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | GitPython 2 | -e git+https://github.com/pyrdm/pyrdm.git#egg=pyrdm 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Git-RDM is released under the MIT license. 4 | 5 | # The MIT License (MIT) 6 | 7 | # Copyright (c) 2016 Christian T. Jacobs, Alexandros Avdis 8 | 9 | # Permission is hereby granted, free of charge, to any person obtaining a 10 | # copy of this software and associated documentation files (the 11 | # "Software"), to deal in the Software without restriction, including 12 | # without limitation the rights to use, copy, modify, merge, publish, 13 | # distribute, sublicense, and/or sell copies of the Software, and to 14 | # permit persons to whom the Software is furnished to do so, subject to 15 | # the following conditions: 16 | 17 | # The above copyright notice and this permission notice shall be included 18 | # in all copies or substantial portions of the Software. 19 | 20 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 21 | # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 22 | # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 23 | # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 24 | # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 25 | # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 26 | # SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 27 | 28 | from distutils.core import setup 29 | 30 | setup(name='git-rdm', 31 | version='1.0.1', 32 | description='Git-RDM is a research data management plugin for the Git version control system.', 33 | author='Christian T. Jacobs, Alexandros Avdis', 34 | author_email='C.T.Jacobs@soton.ac.uk', 35 | url='https://github.com/ctjacobs/git-rdm', 36 | scripts=["git-rdm"], 37 | classifiers=[ 38 | 'Environment :: Console', 39 | 'Intended Audience :: Science/Research', 40 | 'Natural Language :: English', 41 | 'Programming Language :: Python', 42 | ] 43 | ) 44 | --------------------------------------------------------------------------------