├── .dvc ├── .gitignore └── config ├── README.md ├── blog ├── .gitignore └── cats-dogs.dvc ├── dvc-course ├── .gitignore └── hymenoptera_data.dvc ├── dvc.yaml ├── get-started ├── .gitignore └── data.xml.dvc ├── github-issues ├── .gitignore └── dataset.parquet.dvc ├── images ├── .gitignore ├── dvc-logo-outlines.png.dvc ├── owl_sticker.png.dvc └── owl_sticker.svg.dvc ├── tutorials ├── nlp │ ├── .gitignore │ ├── Posts.xml.zip.dvc │ └── pipeline.zip.dvc └── versioning │ ├── .gitignore │ ├── data.zip.dvc │ └── new-labels.zip.dvc ├── use-cases ├── .gitignore ├── cats-dogs.dvc └── pool_data.dvc └── workshop ├── .gitignore ├── README.md ├── dvc_discord_channel.csv.dvc └── satellite-data.dvc /.dvc/.gitignore: -------------------------------------------------------------------------------- 1 | /lock 2 | /config.local 3 | /updater 4 | /updater.lock 5 | /state-journal 6 | /state-wal 7 | /state 8 | /cache 9 | /tmp 10 | -------------------------------------------------------------------------------- /.dvc/config: -------------------------------------------------------------------------------- 1 | [core] 2 | remote = storage 3 | ['remote "storage"'] 4 | url = https://remote.dvc.org/dataset-registry 5 | ['remote "docs/dvc"'] 6 | url = https://remote.dvc.org/docs/dvc 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DVC Dataset Registry 2 | 3 | This _[DVC Data Registry]_ is a centralized place to manage raw data files for 4 | use in other example DVC projects, such as 5 | https://github.com/iterative/example-get-started. 6 | 7 | [dvc data registry]: https://dvc.org/doc/use-cases/data-registry 8 | 9 | ## Installation 10 | 11 | Start by cloning the project: 12 | 13 | ```console 14 | $ git clone https://github.com/iterative/dataset-registry 15 | $ cd dataset-registry 16 | ``` 17 | 18 | This DVC project comes with a preconfigured DVC 19 | [remote storage](https://man.dvc.org/remote) to hold all of the datasets. This 20 | is a read-only HTTP remote. 21 | 22 | ```console 23 | $ dvc remote list 24 | storage https://remote.dvc.org/dataset-registry 25 | ``` 26 | 27 | **Important**: To be able to push to the default remote, overwrite it with: 28 | 29 | ```console 30 | $ dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry 31 | ``` 32 | 33 | > This requires having configured corresponding S3 credentials locally. 34 | 35 | ## Testing data synchronization locally 36 | 37 | If you'd like to test commands like [`dvc push`](https://man.dvc.org/push), 38 | that require write access to the remote storage, the easiest way would be to set 39 | up a "local remote" on your file system: 40 | 41 | > This kind of remote is located in the local file system, but is external to 42 | > the DVC project. 43 | 44 | ```console 45 | $ mkdir -P /tmp/dvc-storage 46 | $ dvc remote add local /tmp/dvc-storage 47 | ``` 48 | 49 | You should now be able to run: 50 | 51 | ```console 52 | $ dvc push -r local 53 | ``` 54 | 55 | ## Datasets 56 | 57 | The folder structure of this project groups datasets corresponding to the 58 | external projects they pertain to. 59 | After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data 60 | under DVC control, the workspace should look like this: 61 | 62 | 63 | ```console 64 | $ tree 65 | . 66 | ├── README.md 67 | ├── get-started 68 | │   └── data.xml.dvc # Dataset used in iterative/example-get-started 69 | ├── mnist 70 | │   └── raw.dvc # Dataset used in iterative/dvc-get-started 71 | ├── fashion-mnist 72 |    └── raw.dvc # Dataset used in iterative/dvc-get-started 73 | ``` 74 | -------------------------------------------------------------------------------- /blog/.gitignore: -------------------------------------------------------------------------------- 1 | /cats-dogs 2 | -------------------------------------------------------------------------------- /blog/cats-dogs.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: 84453a3661e17405f48087e6cf409c51.dir 3 | size: 12368201 4 | nfiles: 542 5 | path: cats-dogs 6 | hash: md5 7 | -------------------------------------------------------------------------------- /dvc-course/.gitignore: -------------------------------------------------------------------------------- 1 | /hymenoptera_data 2 | -------------------------------------------------------------------------------- /dvc-course/hymenoptera_data.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: 70257f9e4b0b3ec72deeed5e1b880c3b.dir 3 | size: 47404368 4 | nfiles: 401 5 | path: hymenoptera_data 6 | hash: md5 7 | -------------------------------------------------------------------------------- /dvc.yaml: -------------------------------------------------------------------------------- 1 | artifacts: 2 | get-started-data: 3 | path: get-started/data.xml 4 | type: dataset 5 | desc: 'Stack Overflow questions' 6 | labels: 7 | - nlp 8 | - classification 9 | -------------------------------------------------------------------------------- /get-started/.gitignore: -------------------------------------------------------------------------------- 1 | /data.xml 2 | -------------------------------------------------------------------------------- /get-started/data.xml.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - path: data.xml 3 | md5: 22a1a2931c8370d3aeedd7183606fd7f 4 | size: 14445097 5 | desc: 10K StackOverflow posts (1/3 is R lang) 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /github-issues/.gitignore: -------------------------------------------------------------------------------- 1 | /dataset.parquet 2 | -------------------------------------------------------------------------------- /github-issues/dataset.parquet.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: 572e23808b1fd6fe6b2b628feebc890c.dir 3 | size: 12731887 4 | nfiles: 1254 5 | path: dataset.parquet 6 | hash: md5 7 | -------------------------------------------------------------------------------- /images/.gitignore: -------------------------------------------------------------------------------- 1 | /dvc-logo-outlines.png 2 | /owl_sticker.png 3 | /owl_sticker.svg 4 | -------------------------------------------------------------------------------- /images/dvc-logo-outlines.png.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - path: dvc-logo-outlines.png 3 | md5: 5846bf188572bbc78a10124f36e92631 4 | size: 35981 5 | hash: md5 6 | -------------------------------------------------------------------------------- /images/owl_sticker.png.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - path: owl_sticker.png 3 | md5: fa8b9e82c893eb401f30c353bd550ada 4 | size: 120357 5 | hash: md5 6 | -------------------------------------------------------------------------------- /images/owl_sticker.svg.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - path: owl_sticker.svg 3 | md5: 52e64ff3607484b4e0d798a2f61b0cc7 4 | size: 5920 5 | hash: md5 6 | -------------------------------------------------------------------------------- /tutorials/nlp/.gitignore: -------------------------------------------------------------------------------- 1 | /pipeline.zip 2 | /Posts.xml.zip 3 | -------------------------------------------------------------------------------- /tutorials/nlp/Posts.xml.zip.dvc: -------------------------------------------------------------------------------- 1 | wdir: ../.. 2 | outs: 3 | - path: tutorials/nlp/Posts.xml.zip 4 | md5: ce68b98d82545628782c66192c96f2d2 5 | size: 10567868 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /tutorials/nlp/pipeline.zip.dvc: -------------------------------------------------------------------------------- 1 | wdir: ../.. 2 | outs: 3 | - path: tutorials/nlp/pipeline.zip 4 | md5: 1d2070ee188fc5e4d94ad920e6cc82aa 5 | size: 4687 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /tutorials/versioning/.gitignore: -------------------------------------------------------------------------------- 1 | /data.zip 2 | /new-labels.zip 3 | -------------------------------------------------------------------------------- /tutorials/versioning/data.zip.dvc: -------------------------------------------------------------------------------- 1 | wdir: ../.. 2 | outs: 3 | - path: tutorials/versioning/data.zip 4 | md5: fa9c0eb4173d86695b4e800219651360 5 | size: 41122310 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /tutorials/versioning/new-labels.zip.dvc: -------------------------------------------------------------------------------- 1 | wdir: ../.. 2 | outs: 3 | - path: tutorials/versioning/new-labels.zip 4 | md5: 2eaa473159443e75e6fb7b29e56c0787 5 | size: 22950744 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /use-cases/.gitignore: -------------------------------------------------------------------------------- 1 | /cats-dogs 2 | /pool_data 3 | -------------------------------------------------------------------------------- /use-cases/cats-dogs.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - path: cats-dogs 3 | md5: 22e3f61e52c0ba45334d973244efc155.dir 4 | size: 64128504 5 | nfiles: 2800 6 | remote: docs/dvc 7 | hash: md5 8 | -------------------------------------------------------------------------------- /use-cases/pool_data.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: 14d187e749ee5614e105741c719fa185.dir 3 | size: 18999874 4 | nfiles: 183 5 | hash: md5 6 | path: pool_data 7 | -------------------------------------------------------------------------------- /workshop/.gitignore: -------------------------------------------------------------------------------- 1 | /satellite-data 2 | /dvc_discord_channel.csv 3 | -------------------------------------------------------------------------------- /workshop/README.md: -------------------------------------------------------------------------------- 1 | ## Dataset description 2 | 3 | The dataset is used in [VS Code workshop](https://github.com/iterative/VSCode-DVC-Workshop) 4 | for predicting satellite kinematic orbit and running several experiments. 5 | 6 | The dataset contains information about RSOs - artificial objects that are in 7 | orbit around the earth and used to predict orbit trajectories. Each row 8 | represents an orbit observation of a satellite. There are 600 9 | satellites that have multiple observations. 10 | 11 | ### Acknowledgments 12 | 13 | Dataset published in 14 | [Kaggle](https://www.kaggle.com/datasets/idawoodjee/predict-the-positions-and-speeds-of-600-satellites) 15 | and has been collected by the Russian Astronomical Science Centre. -------------------------------------------------------------------------------- /workshop/dvc_discord_channel.csv.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: c99631b236e0daa26c7658382c91c1f3 3 | size: 9265543 4 | hash: md5 5 | path: dvc_discord_channel.csv 6 | -------------------------------------------------------------------------------- /workshop/satellite-data.dvc: -------------------------------------------------------------------------------- 1 | outs: 2 | - md5: 04dd843f68ab4819fadd53fba5aacdc3.dir 3 | size: 171226948 4 | nfiles: 4 5 | path: satellite-data 6 | hash: md5 7 | --------------------------------------------------------------------------------