├── .gitignore ├── Gemfile ├── LICENSE.txt ├── README.md ├── Rakefile ├── bin └── staticizer ├── lib ├── staticizer.rb └── staticizer │ ├── command.rb │ ├── crawler.rb │ └── version.rb ├── staticizer.gemspec └── tests ├── crawler_test.rb └── fake_page.html /.gitignore: -------------------------------------------------------------------------------- 1 | *.gem 2 | *.rbc 3 | .bundle 4 | .config 5 | .yardoc 6 | Gemfile.lock 7 | InstalledFiles 8 | _yardoc 9 | coverage 10 | doc/ 11 | lib/bundler/man 12 | pkg 13 | rdoc 14 | spec/reports 15 | test/tmp 16 | test/version_tmp 17 | tmp 18 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | 3 | # Specify your gem's dependencies in staticizer.gemspec 4 | gemspec 5 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2014 Conor Hunt 2 | 3 | MIT License 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining 6 | a copy of this software and associated documentation files (the 7 | "Software"), to deal in the Software without restriction, including 8 | without limitation the rights to use, copy, modify, merge, publish, 9 | distribute, sublicense, and/or sell copies of the Software, and to 10 | permit persons to whom the Software is furnished to do so, subject to 11 | the following conditions: 12 | 13 | The above copyright notice and this permission notice shall be 14 | included in all copies or substantial portions of the Software. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 20 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 21 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 22 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Staticizer 2 | 3 | A tool to create a static version of a website for hosting on S3. 4 | 5 | ## Rationale 6 | 7 | One of our clients needed a reliable emergency backup for a 8 | website. If the website goes down this backup would be available 9 | with reduced functionality. 10 | 11 | S3 and Route 53 provide an great way to host a static emergency backup for a website. 12 | See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html 13 | . In our experience it works well and is incredibly cheap. Our average sized website 14 | with a few hundred pages and assets is less than US$1 a month. 15 | 16 | We tried using existing tools httrack/wget to crawl and create a static version 17 | of the site to upload to S3, but we found that they did not work well with S3 hosting. 18 | We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as 19 | the existing site. This way when the site goes down incoming links from Google search 20 | results etc. will still work. 21 | 22 | ## TODO 23 | 24 | * Abillity to specify AWS credentials via file or environment options 25 | * Tests! 26 | * Decide what to do with URLs with query strings. Currently they are crawled and uploaded to S3, but those keys cannot be accessed. ex http://squaremill.com/file?test=1 will be uploaded with the key file?test=1, but can only be accessed by encoding the ? like this %3Ftest=1 27 | * Create a 404 file on S3 28 | * Provide the option to rewrite absolute URLs to relative urls so that hosting can work on a different domain. 29 | * Multithread the crawler 30 | * Check for too many redirects 31 | * Provide regex options for what urls are scraped 32 | * Better handling of incorrect server mime types (ex. server returns text/plain for css instead of text/css) 33 | * Provide more options for uploading (upload via scp, ftp, custom etc.). Split out save/uploading into an interface. 34 | * Handle large files in a more memory efficient way by streaming uploads/downloads 35 | 36 | ## Installation 37 | 38 | Add this line to your application's Gemfile: 39 | 40 | gem 'staticizer' 41 | 42 | And then execute: 43 | 44 | $ bundle 45 | 46 | Or install it yourself as: 47 | 48 | $ gem install staticizer 49 | 50 | ## Command line usage 51 | 52 | Staticizer can be used through the commandline tool or by requiring the library. 53 | 54 | ### Crawl a website and write to disk 55 | 56 | staticizer http://squaremill.com -output-dir=/tmp/crawl 57 | 58 | ### Crawl a website and upload to AWS 59 | 60 | staticizer http://squaremill.com -aws-s3-bucket=squaremill.com --aws-access-key=HJFJS5gSJHMDZDFFSSDQQ --aws-secret-key=HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s 61 | 62 | ### Crawl a website and allow several domains to be crawled 63 | 64 | staticizer http://squaremill.com --valid-domains=squaremill.com,www.squaremill.com,img.squaremill.com 65 | 66 | ## Code Usage 67 | 68 | For all these examples you must first: 69 | 70 | require 'staticizer' 71 | 72 | ### Crawl a website and upload to AWS 73 | 74 | This will only crawl urls in the domain squaremill.com 75 | 76 | s = Staticizer::Crawler.new("http://squaremill.com", 77 | :aws => { 78 | :region => "us-west-1", 79 | :endpoint => "http://s3.amazonaws.com", 80 | :bucket_name => "www.squaremill.com", 81 | :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s", 82 | :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ" 83 | } 84 | ) 85 | s.crawl 86 | 87 | ### Crawl a website and write to disk 88 | 89 | s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl") 90 | s.crawl 91 | 92 | 93 | ### Crawl a website and make all pages contain 'noindex' meta tag 94 | 95 | s = Staticizer::Crawler.new("http://squaremill.com", 96 | :output_dir => "/tmp/crawl", 97 | :process_body => lambda {|body, uri, opts| 98 | # not the best regex, but it will do for our use 99 | body = body.gsub(/]+>/i,'') 100 | body = body.gsub(/
/i,"\n") 101 | body 102 | } 103 | ) 104 | s.crawl 105 | 106 | 107 | ### Crawl a website and rewrite all non www urls to www 108 | 109 | s = Staticizer::Crawler.new("http://squaremill.com", 110 | :aws => { 111 | :region => "us-west-1", 112 | :endpoint => "http://s3.amazonaws.com", 113 | :bucket_name => "www.squaremill.com", 114 | :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s", 115 | :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ" 116 | }, 117 | :filter_url => lambda do |url, info| 118 | # Only crawl URL if it matches squaremill.com or www.squaremil.com 119 | if url =~ %r{https?://(www\.)?squaremill\.com} 120 | # Rewrite non-www urls to www 121 | return url.gsub(%r{https?://(www\.)?squaremill\.com}, "http://www.squaremill.com") 122 | end 123 | # returning nil here prevents the url from being crawled 124 | end 125 | ) 126 | s.crawl 127 | 128 | ## Crawler Options 129 | 130 | * :aws - Hash of connection options passed to aws/sdk gem 131 | * :filter_url - lambda called to see if a discovered URL should be crawled, return the url (can be modified) to crawl, return nil otherwise 132 | * :output_dir - if writing a site to disk the directory to write to, will be created if it does not exist 133 | * :logger - A logger object responding to the usual Ruby Logger methods. 134 | * :log_level - Log level - defaults to INFO. 135 | * :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored. 136 | * :process_body - lambda called to pre-process body of content before writing it out. 137 | * :skip_write - don't write retrieved files to disk or s3, just crawl the site (can be used to find 404s etc.) 138 | 139 | ## Contributing 140 | 141 | 1. Fork it 142 | 2. Create your feature branch (`git checkout -b my-new-feature`) 143 | 3. Commit your changes (`git commit -am 'Add some feature'`) 144 | 4. Push to the branch (`git push origin my-new-feature`) 145 | 5. Create new Pull Request 146 | -------------------------------------------------------------------------------- /Rakefile: -------------------------------------------------------------------------------- 1 | require "bundler/gem_tasks" 2 | require 'rake/testtask' 3 | 4 | Rake::TestTask.new do |t| 5 | t.libs << "tests" 6 | t.test_files = FileList['tests/*_test.rb'] 7 | end -------------------------------------------------------------------------------- /bin/staticizer: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env ruby 2 | 3 | lib = File.expand_path(File.dirname(__FILE__) + '/../lib') 4 | $LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib) 5 | 6 | require 'staticizer' 7 | require 'staticizer/command' 8 | 9 | options, initial_page = Staticizer::Command.parse(ARGV) 10 | s = Staticizer::Crawler.new(initial_page, options) 11 | s.crawl -------------------------------------------------------------------------------- /lib/staticizer.rb: -------------------------------------------------------------------------------- 1 | require_relative "staticizer/version" 2 | require_relative 'staticizer/crawler' 3 | 4 | module Staticizer 5 | def Staticizer.crawl(url, options = {}, &block) 6 | cralwer = Staticizer::Crawler.new(url, options) 7 | crawler.crawl 8 | end 9 | end 10 | -------------------------------------------------------------------------------- /lib/staticizer/command.rb: -------------------------------------------------------------------------------- 1 | require 'optparse' 2 | 3 | module Staticizer 4 | class Command 5 | # Parse command line arguments and print out any errors 6 | def Command.parse(args) 7 | options = {} 8 | initial_page = nil 9 | 10 | parser = OptionParser.new do |opts| 11 | opts.banner = "Usage: staticizer initial_url [options]\nExample: staticizer http://squaremill.com --output-dir=/tmp/crawl" 12 | 13 | opts.separator "" 14 | opts.separator "Specific options:" 15 | 16 | opts.on("--aws-s3-bucket [STRING]", "Name of S3 bucket to write to") do |v| 17 | options[:aws] ||= {} 18 | options[:aws][:bucket_name] = v 19 | end 20 | 21 | opts.on("--aws-region [STRING]", "AWS Region of S3 bucket") do |v| 22 | options[:aws] ||= {} 23 | options[:aws][:region] = v 24 | end 25 | 26 | opts.on("--aws-access-key [STRING]", "AWS Access Key ID") do |v| 27 | options[:aws] ||= {} 28 | options[:aws][:access_key_id] = v 29 | end 30 | 31 | opts.on("--aws-secret-key [STRING]", "AWS Secret Access Key") do |v| 32 | options[:aws] ||= {} 33 | options[:aws][:secret_access_key] = v 34 | end 35 | 36 | opts.on("-d", "--output-dir [DIRECTORY]", "Write crawl to disk in this directory, will be created if it does not exist") do |v| 37 | options[:output_dir] = v 38 | end 39 | 40 | opts.on("-v", "--verbose", "Run verbosely (sets log level to Logger::DEBUG)") do |v| 41 | options[:log_level] = Logger::DEBUG 42 | end 43 | 44 | opts.on("--log-level [NUMBER]", "Set log level 0 = most verbose to 4 = least verbose") do |v| 45 | options[:log_level] = v.to_i 46 | end 47 | 48 | opts.on("--log-file [PATH]", "Log file to write to") do |v| 49 | options[:logger] = Logger.new(v) 50 | end 51 | 52 | opts.on("--skip-write [PATH]", "Don't write out files to disk or s3") do |v| 53 | options[:skip_write] = true 54 | end 55 | 56 | opts.on("--valid-domains x,y,z", Array, "Comma separated list of domains that should be crawled, other domains will be ignored") do |v| 57 | options[:valid_domains] = v 58 | end 59 | 60 | opts.on_tail("-h", "--help", "Show this message") do 61 | puts "test" 62 | puts opts 63 | exit 64 | end 65 | end 66 | 67 | begin 68 | parser.parse!(args) 69 | initial_page = ARGV.pop 70 | raise ArgumentError, "Need to specify an initial URL to start the crawl" unless initial_page 71 | rescue StandardError => e 72 | puts e 73 | puts parser 74 | exit(1) 75 | end 76 | 77 | return options, initial_page 78 | end 79 | end 80 | end 81 | 82 | 83 | =begin 84 | 85 | =end 86 | -------------------------------------------------------------------------------- /lib/staticizer/crawler.rb: -------------------------------------------------------------------------------- 1 | require 'net/http' 2 | require 'fileutils' 3 | require 'nokogiri' 4 | require 'aws-sdk' 5 | require 'logger' 6 | 7 | module Staticizer 8 | class Crawler 9 | attr_reader :url_queue 10 | attr_accessor :output_dir 11 | 12 | def initialize(initial_page, opts = {}) 13 | if initial_page.nil? 14 | raise ArgumentError, "Initial page required" 15 | end 16 | 17 | @opts = opts.dup 18 | @url_queue = [] 19 | @processed_urls = [] 20 | @output_dir = @opts[:output_dir] || File.expand_path("crawl/") 21 | @log = @opts[:logger] || Logger.new(STDOUT) 22 | @log.level = @opts[:log_level] || Logger::INFO 23 | 24 | if @opts[:aws] 25 | bucket_name = @opts[:aws].delete(:bucket_name) 26 | Aws.config.update(opts[:aws]) 27 | @s3_bucket = Aws::S3::Resource.new.bucket(bucket_name) 28 | end 29 | 30 | if @opts[:valid_domains].nil? 31 | uri = URI.parse(initial_page) 32 | @opts[:valid_domains] ||= [uri.host] 33 | end 34 | 35 | if @opts[:process_body] 36 | @process_body = @opts[:process_body] 37 | end 38 | 39 | add_url(initial_page) 40 | end 41 | 42 | def log_level 43 | @log.level 44 | end 45 | 46 | def log_level=(level) 47 | @log.level = level 48 | end 49 | 50 | def crawl 51 | @log.info("Starting crawl") 52 | while(@url_queue.length > 0) 53 | url, info = @url_queue.shift 54 | @processed_urls << url 55 | process_url(url, info) 56 | end 57 | @log.info("Finished crawl") 58 | end 59 | 60 | def extract_hrefs(doc, base_uri) 61 | doc.xpath("//a/@href").map {|href| make_absolute(base_uri, href) } 62 | end 63 | 64 | def extract_images(doc, base_uri) 65 | doc.xpath("//img/@src").map {|src| make_absolute(base_uri, src) } 66 | end 67 | 68 | def extract_links(doc, base_uri) 69 | doc.xpath("//link/@href").map {|href| make_absolute(base_uri, href) } 70 | end 71 | 72 | def extract_videos(doc, base_uri) 73 | doc.xpath("//video").map do |video| 74 | sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)} 75 | poster = video.attributes["poster"].to_s 76 | make_absolute(base_uri, poster) 77 | [poster, sources] 78 | end.flatten.uniq.compact 79 | end 80 | 81 | def extract_scripts(doc, base_uri) 82 | doc.xpath("//script/@src").map {|src| make_absolute(base_uri, src) } 83 | end 84 | 85 | def extract_css_urls(css, base_uri) 86 | css.scan(/url\(\s*['"]?(.+?)['"]?\s*\)/).map {|src| make_absolute(base_uri, src[0]) } 87 | end 88 | 89 | def add_urls(urls, info = {}) 90 | urls.compact.uniq.each {|url| add_url(url, info.dup) } 91 | end 92 | 93 | def make_absolute(base_uri, href) 94 | dup_uri = base_uri.dup 95 | dup_uri.query = nil 96 | if href.to_s =~ /https?/i 97 | href.to_s.gsub(" ", "+") 98 | else 99 | URI::join(dup_uri.to_s, href).to_s 100 | end 101 | rescue StandardError => e 102 | @log.error "Could not make absolute #{dup_uri} - #{href}" 103 | nil 104 | end 105 | 106 | def add_url(url, info = {}) 107 | if @opts[:filter_url] 108 | url = @opts[:filter_url].call(url, info) 109 | return if url.nil? 110 | else 111 | regex = "(#{@opts[:valid_domains].join(")|(")})" 112 | return if url !~ %r{^https?://#{regex}} 113 | end 114 | 115 | url = url.sub(/#.*$/,'') # strip off any fragments 116 | return if @url_queue.index {|u| u[0] == url } || @processed_urls.include?(url) 117 | @url_queue << [url, info] 118 | end 119 | 120 | def save_page(response, uri) 121 | return if @opts[:skip_write] 122 | if @opts[:aws] 123 | save_page_to_aws(response, uri) 124 | else 125 | save_page_to_disk(response, uri) 126 | end 127 | end 128 | 129 | def save_page_to_disk(response, uri) 130 | path = uri.path 131 | path += "?#{uri.query}" if uri.query 132 | 133 | path_segments = path.scan(%r{[^/]*/}) 134 | filename = path.include?("/") ? path[path.rindex("/")+1..-1] : path 135 | 136 | current = @output_dir 137 | FileUtils.mkdir_p(current) unless File.exist?(current) 138 | 139 | # Create all the directories necessary for this file 140 | path_segments.each do |segment| 141 | current = File.join(current, "#{segment}").sub(%r{/$},'') 142 | if File.file?(current) 143 | # If we are trying to create a directory and there already is a file 144 | # with the same name add a .d to the file since we can't create 145 | # a directory and file with the same name in the file system 146 | dirfile = current + ".d" 147 | FileUtils.mv(current, dirfile) 148 | FileUtils.mkdir(current) 149 | FileUtils.cp(dirfile, File.join(current, "/index.html")) 150 | elsif !File.exists?(current) 151 | FileUtils.mkdir(current) 152 | end 153 | end 154 | 155 | body = response.respond_to?(:read_body) ? response.read_body : response 156 | body = process_body(body, uri, {}) 157 | outfile = File.join(current, "/#{filename}") 158 | if filename == "" 159 | indexfile = File.join(outfile, "/index.html") 160 | @log.info "Saving #{indexfile}" 161 | File.open(indexfile, "wb") {|f| f << body } 162 | elsif File.directory?(outfile) 163 | dirfile = outfile + ".d" 164 | @log.info "Saving #{dirfile}" 165 | File.open(dirfile, "wb") {|f| f << body } 166 | FileUtils.cp(dirfile, File.join(outfile, "/index.html")) 167 | else 168 | @log.info "Saving #{outfile}" 169 | File.open(outfile, "wb") {|f| f << body } 170 | end 171 | end 172 | 173 | def save_page_to_aws(response, uri) 174 | key = uri.path 175 | key += "?#{uri.query}" if uri.query 176 | key = key.gsub(%r{/$},"/index.html") 177 | key = key.gsub(%r{^/},"") 178 | key = "index.html" if key == "" 179 | # Upload this file directly to AWS::S3 180 | opts = {:acl => "public-read"} 181 | opts[:content_type] = response['content-type'] rescue "text/html" 182 | @log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}" 183 | if response.respond_to?(:read_body) 184 | body = process_body(response.read_body, uri, opts) 185 | @s3_bucket.object(key).put(opts.merge(body: body)) 186 | else 187 | body = process_body(response, uri, opts) 188 | @s3_bucket.object(key).put(opts.merge(body: body)) 189 | end 190 | end 191 | 192 | def process_success(response, parsed_uri) 193 | url = parsed_uri.to_s 194 | if @opts[:filter_process] 195 | return if @opts[:filter_process].call(response, parsed_uri) 196 | end 197 | case response['content-type'] 198 | when /css/ 199 | save_page(response, parsed_uri) 200 | add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"}) 201 | when /html/ 202 | save_page(response, parsed_uri) 203 | doc = Nokogiri::HTML(response.body) 204 | add_urls(extract_links(doc, url), {:type_hint => "link"}) 205 | add_urls(extract_scripts(doc, url), {:type_hint => "script"}) 206 | add_urls(extract_images(doc, url), {:type_hint => "image"}) 207 | add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"}) 208 | add_urls(extract_videos(doc, parsed_uri), {:type_hint => "video"}) 209 | add_urls(extract_hrefs(doc, url), {:type_hint => "href"}) unless @opts[:single_page] 210 | else 211 | save_page(response, parsed_uri) 212 | end 213 | end 214 | 215 | # If we hit a redirect we save the redirect as a meta refresh page 216 | # TODO: for AWS S3 hosting we could instead create a redirect? 217 | def process_redirect(url, destination_url) 218 | body = "You are being redirected to #{destination_url}." 219 | save_page(body, url) 220 | end 221 | 222 | def process_body(body, uri, opts) 223 | if @process_body 224 | body = @process_body.call(body, uri, opts) 225 | end 226 | body 227 | end 228 | 229 | # Fetch a URI and save it to disk 230 | def process_url(url, info) 231 | @http_connections ||= {} 232 | parsed_uri = URI(url) 233 | 234 | @log.debug "Fetching #{parsed_uri}" 235 | 236 | # Attempt to use an already open Net::HTTP connection 237 | key = parsed_uri.host + parsed_uri.port.to_s 238 | connection = @http_connections[key] 239 | if connection.nil? 240 | connection = Net::HTTP.new(parsed_uri.host, parsed_uri.port) 241 | connection.use_ssl = true if parsed_uri.scheme.downcase == "https" 242 | @http_connections[key] = connection 243 | end 244 | 245 | request = Net::HTTP::Get.new(parsed_uri.request_uri) 246 | begin 247 | connection.request(request) do |response| 248 | case response 249 | when Net::HTTPSuccess 250 | process_success(response, parsed_uri) 251 | when Net::HTTPRedirection 252 | redirect_url = response['location'] 253 | @log.debug "Processing redirect to #{redirect_url}" 254 | process_redirect(parsed_uri, redirect_url) 255 | add_url(redirect_url) 256 | else 257 | @log.error "Error #{response.code}:#{response.message} fetching url #{url}" 258 | end 259 | end 260 | rescue OpenSSL::SSL::SSLError => e 261 | @log.error "SSL Error #{e.message} fetching url #{url}" 262 | rescue Errno::ECONNRESET => e 263 | @log.error "Error #{e.class}:#{e.message} fetching url #{url}" 264 | end 265 | end 266 | 267 | end 268 | end 269 | -------------------------------------------------------------------------------- /lib/staticizer/version.rb: -------------------------------------------------------------------------------- 1 | module Staticizer 2 | VERSION = "0.0.7" 3 | end 4 | -------------------------------------------------------------------------------- /staticizer.gemspec: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | lib = File.expand_path('../lib', __FILE__) 3 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) 4 | require 'staticizer/version' 5 | 6 | Gem::Specification.new do |spec| 7 | spec.name = "staticizer" 8 | spec.version = Staticizer::VERSION 9 | spec.authors = ["Conor Hunt"] 10 | spec.email = ["conor.hunt+git@gmail.com"] 11 | spec.description = %q{A tool to create a static version of a website for hosting on S3. Can be used to create a cheap emergency backup version of a dynamic website.} 12 | spec.summary = %q{A tool to create a static version of a website for hosting on S3.} 13 | spec.homepage = "https://github.com/SquareMill/staticizer" 14 | spec.license = "MIT" 15 | 16 | spec.files = `git ls-files`.split($/) 17 | spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) } 18 | spec.test_files = spec.files.grep(%r{^(test|spec|features)/}) 19 | spec.require_paths = ["lib"] 20 | 21 | spec.add_development_dependency "bundler", "~> 1.3" 22 | spec.add_development_dependency "rake" 23 | spec.add_development_dependency "webmock" 24 | 25 | spec.add_runtime_dependency 'nokogiri' 26 | spec.add_runtime_dependency 'aws-sdk' 27 | end 28 | -------------------------------------------------------------------------------- /tests/crawler_test.rb: -------------------------------------------------------------------------------- 1 | require 'minitest/autorun' 2 | require 'ostruct' 3 | 4 | lib = File.expand_path(File.dirname(__FILE__) + '/../lib') 5 | $LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib) 6 | 7 | require 'staticizer' 8 | 9 | class TestFilePaths < MiniTest::Unit::TestCase 10 | def setup 11 | @crawler = Staticizer::Crawler.new("http://test.com") 12 | @crawler.log_level = Logger::FATAL 13 | @fake_page = File.read(File.expand_path(File.dirname(__FILE__) + "/fake_page.html")) 14 | end 15 | 16 | def test_save_page_to_disk 17 | fake_response = OpenStruct.new(:read_body => "test", :body => "test") 18 | file_paths = { 19 | "http://test.com" => "index.html", 20 | "http://test.com/" => "index.html", 21 | "http://test.com/asdfdf/dfdf" => "/asdfdf/dfdf", 22 | "http://test.com/asdfdf/dfdf/" => ["/asdfdf/dfdf","/asdfdf/dfdf/index.html"], 23 | "http://test.com/asdfad/asdffd.test" => "/asdfad/asdffd.test", 24 | "http://test.com/?asdfsd=12312" => "/?asdfsd=12312", 25 | "http://test.com/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff", 26 | } 27 | 28 | # TODO: Stub out file system using https://github.com/defunkt/fakefs? 29 | outputdir = "/tmp/staticizer_crawl_test" 30 | FileUtils.rm_rf(outputdir) 31 | @crawler.output_dir = outputdir 32 | 33 | file_paths.each do |k,v| 34 | @crawler.save_page_to_disk(fake_response, URI.parse(k)) 35 | [v].flatten.each do |file| 36 | expected = File.expand_path(outputdir + "/#{file}") 37 | assert File.exists?(expected), "File #{expected} not created for url #{k}" 38 | end 39 | end 40 | end 41 | 42 | def test_save_page_to_aws 43 | end 44 | 45 | def test_add_url_with_valid_domains 46 | test_url = "http://test.com/test" 47 | @crawler.add_url(test_url) 48 | assert(@crawler.url_queue[-1] == [test_url, {}], "URL #{test_url} not added to queue") 49 | end 50 | 51 | def test_add_url_with_filter 52 | end 53 | 54 | def test_initialize_options 55 | end 56 | 57 | def test_process_url 58 | end 59 | 60 | def test_make_absolute 61 | end 62 | 63 | def test_link_extraction 64 | end 65 | 66 | def test_href_extraction 67 | end 68 | 69 | def test_css_extraction 70 | end 71 | 72 | def test_css_url_extraction 73 | end 74 | 75 | def test_image_extraction 76 | end 77 | 78 | def test_script_extraction 79 | end 80 | end -------------------------------------------------------------------------------- /tests/fake_page.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |195 |198 |"Square Mill really took the time to understand our business and think strategically about how we want to engage and communicate with our entrepreneurs online. Together, their small team is responsive, nimble and efficient and has the deep design and technical chops to back it up."
196 | Christina Lee, Operating Partner at KPCB 197 |