├── .gitignore ├── Gemfile ├── LICENSE.txt ├── README.md └── git-scraper-extractor /.gitignore: -------------------------------------------------------------------------------- 1 | Gemfile.lock 2 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | # frozen_string_literal: true 2 | source "https://rubygems.org" 3 | 4 | 5 | gem "rugged", "~> 1.1.0" 6 | gem "tty-prompt", "~> 0.23.0" -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2021 Drew Breunig 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # git-scraper-extractor 2 | 3 | `git-scraper-extractor` is a handy tool for your gitscraping repositories. 4 | 5 | What is [gitscraping](https://simonwillison.net/2020/Oct/9/git-scraping/)? We'll let Simon Willison, who coined the term, explain: 6 | 7 | >The internet is full of interesting data that changes over time. These changes can sometimes be more interesting than the underlying static data—The @nyt_diff Twitter account tracks changes made to New York Times headlines for example, which offers a fascinating insight into that publication’s editorial process. 8 | > 9 | >We already have a great tool for efficiently tracking changes to text over time: Git. And GitHub Actions (and other CI systems) make it easy to create a scraper that runs every few minutes, records the current state of a resource and records changes to that resource over time in the commit history. 10 | 11 | `git-scraper-extractor` is a little tool for extracting the multiple versions of a files from your git repository into separate, timestamped files. After your gitscraping repository has been updating a json or csv for awhile, use `git-scraper-extractor` to find each change and output that version into a separate file. Then load those files into the tool of your choice. 12 | 13 | ## Usage 14 | 15 | It's simple. Clone this repo, `cd` into the directory and run: 16 | 17 | `$ bundle install` 18 | 19 | `$ ./git-scraper-extractor /path/to/repo /path/to/output` 20 | -------------------------------------------------------------------------------- /git-scraper-extractor: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | require 'bundler' 4 | 5 | # 6 | # Option Handling 7 | # 8 | def exit_with_help(with_help: true) 9 | puts "Usage: git-scraper-extractor /path/to/repo /path/to/output" 10 | exit 11 | end 12 | 13 | # Get args 14 | if ARGV.empty? || (ARGV[0] == "--help") 15 | exit_with_help 16 | end 17 | repo_path = ARGV.shift 18 | dest_path = ARGV.first ? ARGV.shift : "./" 19 | 20 | # Validate and/or create the output path 21 | unless File.directory?(dest_path) 22 | if File.exist?(dest_path) 23 | puts "The target output destination exists but is not a directory." 24 | exit_with_help() 25 | end 26 | FileUtils.mkdir_p(dest_path) 27 | end 28 | 29 | # Check for input validity 30 | unless File.directory?(repo_path) 31 | exit_with_help 32 | end 33 | 34 | # 35 | # Load the repo 36 | # 37 | repo = Rugged::Repository.discover(repo_path) 38 | head_ref = repo.head 39 | 40 | # Get the files that might be interesting 41 | interesting_files = [] 42 | head_ref.target.tree.each_blob do |blob| 43 | interesting_files << blob[:name] if /.json|.csv|.tsv|.txt|.yaml/.match?(blob[:name]) 44 | end 45 | 46 | # Ask the user which they want to extract 47 | all_option_string = "All of the above" 48 | prompt = TTY::Prompt.new 49 | selections = prompt.multi_select("What files would you like to extract? (Use space to select)", [interesting_files, all_option_string].flatten) 50 | selections = interesting_files if selections.include?(all_option_string) 51 | 52 | # 53 | # Extract the files 54 | # 55 | walker = Rugged::Walker.new(repo) 56 | walker.push(head_ref.target_id) 57 | walker.each do |commit| 58 | timestamp = commit.time.strftime("%FT%R") 59 | commit.tree.each_blob do |o| 60 | selections.each do |selected_filename| 61 | basename = File.basename(selected_filename) 62 | filetype = File.extname(selected_filename) 63 | if selected_filename == o[:name] 64 | obj = repo.lookup(o[:oid]) 65 | filename = "#{basename}_#{timestamp}#{filetype}" 66 | File.open(File.join(dest_path, filename), "w") { |f| f.write(obj.read_raw.data) } 67 | end 68 | end 69 | end 70 | end 71 | --------------------------------------------------------------------------------