├── .gitignore
├── Gemfile
├── LICENSE.txt
├── README.md
├── Rakefile
├── bin
    └── staticizer
├── lib
    ├── staticizer.rb
    └── staticizer
    │   ├── command.rb
    │   ├── crawler.rb
    │   └── version.rb
├── staticizer.gemspec
└── tests
    ├── crawler_test.rb
    └── fake_page.html


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.gem
 2 | *.rbc
 3 | .bundle
 4 | .config
 5 | .yardoc
 6 | Gemfile.lock
 7 | InstalledFiles
 8 | _yardoc
 9 | coverage
10 | doc/
11 | lib/bundler/man
12 | pkg
13 | rdoc
14 | spec/reports
15 | test/tmp
16 | test/version_tmp
17 | tmp
18 | 


--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 | 
3 | # Specify your gem's dependencies in staticizer.gemspec
4 | gemspec
5 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2014 Conor Hunt
 2 | 
 3 | MIT License
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining
 6 | a copy of this software and associated documentation files (the
 7 | "Software"), to deal in the Software without restriction, including
 8 | without limitation the rights to use, copy, modify, merge, publish,
 9 | distribute, sublicense, and/or sell copies of the Software, and to
10 | permit persons to whom the Software is furnished to do so, subject to
11 | the following conditions:
12 | 
13 | The above copyright notice and this permission notice shall be
14 | included in all copies or substantial portions of the Software.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Staticizer
  2 | 
  3 | A tool to create a static version of a website for hosting on S3.
  4 | 
  5 | ## Rationale
  6 | 
  7 | One of our clients needed a reliable emergency backup for a
  8 | website. If the website goes down this backup would be available
  9 | with reduced functionality.
 10 | 
 11 | S3 and Route 53 provide an great way to host a static emergency backup for a website.
 12 | See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
 13 | . In our experience it works well and is incredibly cheap. Our average sized website
 14 | with a few hundred pages and assets is less than US$1 a month.
 15 | 
 16 | We tried using existing tools httrack/wget to crawl and create a static version
 17 | of the site to upload to S3, but we found that they did not work well with S3 hosting.
 18 | We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as
 19 | the existing site. This way when the  site goes down incoming links from Google search
 20 | results etc. will still work.
 21 | 
 22 | ## TODO
 23 | 
 24 | * Abillity to specify AWS credentials via file or environment options
 25 | * Tests!
 26 | * Decide what to do with URLs with query strings. Currently they are crawled and uploaded to S3, but those keys cannot be accessed. ex http://squaremill.com/file?test=1 will be uploaded with the key file?test=1, but can only be accessed by encoding the ? like this %3Ftest=1
 27 | * Create a 404 file on S3
 28 | * Provide the option to rewrite absolute URLs to relative urls so that hosting can work on a different domain.
 29 | * Multithread the crawler
 30 | * Check for too many redirects
 31 | * Provide regex options for what urls are scraped
 32 | * Better handling of incorrect server mime types (ex. server returns text/plain for css instead of text/css)
 33 | * Provide more options for uploading (upload via scp, ftp, custom etc.). Split out save/uploading into an interface.
 34 | * Handle large files in a more memory efficient way by streaming uploads/downloads
 35 | 
 36 | ## Installation
 37 | 
 38 | Add this line to your application's Gemfile:
 39 | 
 40 |     gem 'staticizer'
 41 | 
 42 | And then execute:
 43 | 
 44 |     $ bundle
 45 | 
 46 | Or install it yourself as:
 47 | 
 48 |     $ gem install staticizer
 49 | 
 50 | ## Command line usage
 51 | 
 52 | Staticizer can be used through the commandline tool or by requiring the library.
 53 | 
 54 | ### Crawl a website and write to disk
 55 | 
 56 |     staticizer http://squaremill.com -output-dir=/tmp/crawl
 57 | 
 58 | ### Crawl a website and upload to AWS
 59 | 
 60 |     staticizer http://squaremill.com -aws-s3-bucket=squaremill.com --aws-access-key=HJFJS5gSJHMDZDFFSSDQQ --aws-secret-key=HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s
 61 | 
 62 | ### Crawl a website and allow several domains to be crawled
 63 | 
 64 |     staticizer http://squaremill.com --valid-domains=squaremill.com,www.squaremill.com,img.squaremill.com
 65 | 
 66 | ## Code Usage
 67 | 
 68 | For all these examples you must first:
 69 | 
 70 |     require 'staticizer'
 71 | 
 72 | ### Crawl a website and upload to AWS
 73 | 
 74 | This will only crawl urls in the domain squaremill.com
 75 | 
 76 |     s = Staticizer::Crawler.new("http://squaremill.com",
 77 |       :aws => {
 78 |         :region => "us-west-1",
 79 |         :endpoint => "http://s3.amazonaws.com",
 80 |         :bucket_name => "www.squaremill.com",
 81 |         :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
 82 |         :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
 83 |       }
 84 |     )
 85 |     s.crawl
 86 | 
 87 | ### Crawl a website and write to disk
 88 | 
 89 |     s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
 90 |     s.crawl
 91 | 
 92 | 
 93 | ### Crawl a website and make all pages contain 'noindex' meta tag
 94 | 
 95 |     s = Staticizer::Crawler.new("http://squaremill.com",
 96 |       :output_dir => "/tmp/crawl",
 97 |       :process_body => lambda {|body, uri, opts|
 98 |         # not the best regex, but it will do for our use
 99 |         body = body.gsub(/<meta\s+name=['"]robots[^>]+>/i,'')
100 |         body = body.gsub(/<head>/i,"<head>\n<meta name='robots' content='noindex'>")
101 |         body
102 |       }
103 |     )
104 |     s.crawl
105 | 
106 | 
107 | ### Crawl a website and rewrite all non www urls to www
108 | 
109 |     s = Staticizer::Crawler.new("http://squaremill.com",
110 |       :aws => {
111 |         :region => "us-west-1",
112 |         :endpoint => "http://s3.amazonaws.com",
113 |         :bucket_name => "www.squaremill.com",
114 |         :secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
115 |         :access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
116 |       },
117 |       :filter_url => lambda do |url, info|
118 |         # Only crawl URL if it matches squaremill.com or www.squaremil.com
119 |         if url =~ %r{https?://(www\.)?squaremill\.com}
120 |           # Rewrite non-www urls to www
121 |           return url.gsub(%r{https?://(www\.)?squaremill\.com}, "http://www.squaremill.com")
122 |         end
123 |         # returning nil here prevents the url from being crawled
124 |       end
125 |     )
126 |     s.crawl
127 | 
128 | ## Crawler Options
129 | 
130 | * :aws - Hash of connection options passed to aws/sdk gem
131 | * :filter_url - lambda called to see if a discovered URL should be crawled, return the url (can be modified) to crawl, return nil otherwise
132 | * :output_dir - if writing a site to disk the directory to write to, will be created if it does not exist
133 | * :logger - A logger object responding to the usual Ruby Logger methods.
134 | * :log_level - Log level - defaults to INFO.
135 | * :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
136 | * :process_body - lambda called to pre-process body of content before writing it out.
137 | * :skip_write - don't write retrieved files to disk or s3, just crawl the site (can be used to find 404s etc.)
138 | 
139 | ## Contributing
140 | 
141 | 1. Fork it
142 | 2. Create your feature branch (`git checkout -b my-new-feature`)
143 | 3. Commit your changes (`git commit -am 'Add some feature'`)
144 | 4. Push to the branch (`git push origin my-new-feature`)
145 | 5. Create new Pull Request
146 | 


--------------------------------------------------------------------------------
/Rakefile:
--------------------------------------------------------------------------------
1 | require "bundler/gem_tasks"
2 | require 'rake/testtask'
3 | 
4 | Rake::TestTask.new do |t|
5 |   t.libs << "tests"
6 |   t.test_files = FileList['tests/*_test.rb']
7 | end


--------------------------------------------------------------------------------
/bin/staticizer:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env ruby
 2 | 
 3 | lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
 4 | $LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
 5 | 
 6 | require 'staticizer'
 7 | require 'staticizer/command'
 8 | 
 9 | options, initial_page = Staticizer::Command.parse(ARGV)
10 | s = Staticizer::Crawler.new(initial_page, options)
11 | s.crawl


--------------------------------------------------------------------------------
/lib/staticizer.rb:
--------------------------------------------------------------------------------
 1 | require_relative "staticizer/version"
 2 | require_relative 'staticizer/crawler'
 3 | 
 4 | module Staticizer
 5 |   def Staticizer.crawl(url, options = {}, &block)
 6 |     cralwer = Staticizer::Crawler.new(url, options)
 7 |     crawler.crawl
 8 |   end
 9 | end
10 | 


--------------------------------------------------------------------------------
/lib/staticizer/command.rb:
--------------------------------------------------------------------------------
 1 | require 'optparse'
 2 | 
 3 | module Staticizer
 4 |   class Command
 5 |     # Parse command line arguments and print out any errors
 6 |     def Command.parse(args)
 7 |       options = {}
 8 |       initial_page = nil
 9 | 
10 |       parser = OptionParser.new do |opts|
11 |         opts.banner = "Usage: staticizer initial_url [options]\nExample: staticizer http://squaremill.com --output-dir=/tmp/crawl"
12 | 
13 |         opts.separator ""
14 |         opts.separator "Specific options:"
15 | 
16 |         opts.on("--aws-s3-bucket [STRING]", "Name of S3 bucket to write to") do |v|
17 |           options[:aws] ||= {}
18 |           options[:aws][:bucket_name] = v
19 |         end
20 | 
21 |         opts.on("--aws-region [STRING]", "AWS Region of S3 bucket") do |v|
22 |           options[:aws] ||= {}
23 |           options[:aws][:region] = v
24 |         end
25 | 
26 |         opts.on("--aws-access-key [STRING]", "AWS Access Key ID") do |v|
27 |           options[:aws] ||= {}
28 |           options[:aws][:access_key_id] = v
29 |         end
30 | 
31 |         opts.on("--aws-secret-key [STRING]", "AWS Secret Access Key") do |v|
32 |           options[:aws] ||= {}
33 |           options[:aws][:secret_access_key] = v
34 |         end
35 | 
36 |         opts.on("-d", "--output-dir [DIRECTORY]", "Write crawl to disk in this directory, will be created if it does not exist") do |v|
37 |           options[:output_dir] = v
38 |         end
39 | 
40 |         opts.on("-v", "--verbose", "Run verbosely (sets log level to Logger::DEBUG)") do |v|
41 |           options[:log_level] = Logger::DEBUG
42 |         end
43 | 
44 |         opts.on("--log-level [NUMBER]", "Set log level 0 = most verbose to 4 = least verbose") do |v|
45 |           options[:log_level] = v.to_i
46 |         end
47 | 
48 |         opts.on("--log-file [PATH]", "Log file to write to") do |v|
49 |           options[:logger] = Logger.new(v)
50 |         end
51 | 
52 |         opts.on("--skip-write [PATH]", "Don't write out files to disk or s3") do |v|
53 |           options[:skip_write] = true
54 |         end
55 | 
56 |         opts.on("--valid-domains x,y,z", Array, "Comma separated list of domains that should be crawled, other domains will be ignored") do |v|
57 |           options[:valid_domains] = v
58 |         end
59 | 
60 |         opts.on_tail("-h", "--help", "Show this message") do
61 |           puts "test"
62 |           puts opts
63 |           exit
64 |         end
65 |       end
66 | 
67 |       begin
68 |         parser.parse!(args)
69 |         initial_page = ARGV.pop
70 |         raise ArgumentError, "Need to specify an initial URL to start the crawl" unless initial_page
71 |       rescue StandardError => e
72 |         puts e
73 |         puts parser
74 |         exit(1)
75 |       end
76 | 
77 |       return options, initial_page
78 |     end
79 |   end
80 | end
81 | 
82 | 
83 | =begin
84 | 
85 | =end
86 | 


--------------------------------------------------------------------------------
/lib/staticizer/crawler.rb:
--------------------------------------------------------------------------------
  1 | require 'net/http'
  2 | require 'fileutils'
  3 | require 'nokogiri'
  4 | require 'aws-sdk'
  5 | require 'logger'
  6 | 
  7 | module Staticizer
  8 |   class Crawler
  9 |     attr_reader :url_queue
 10 |     attr_accessor :output_dir
 11 | 
 12 |     def initialize(initial_page, opts = {})
 13 |       if initial_page.nil?
 14 |         raise ArgumentError, "Initial page required"
 15 |       end
 16 | 
 17 |       @opts = opts.dup
 18 |       @url_queue = []
 19 |       @processed_urls = []
 20 |       @output_dir = @opts[:output_dir] || File.expand_path("crawl/")
 21 |       @log = @opts[:logger] || Logger.new(STDOUT)
 22 |       @log.level = @opts[:log_level] || Logger::INFO
 23 | 
 24 |       if @opts[:aws]
 25 |         bucket_name = @opts[:aws].delete(:bucket_name)
 26 |         Aws.config.update(opts[:aws])
 27 |         @s3_bucket = Aws::S3::Resource.new.bucket(bucket_name)
 28 |       end
 29 | 
 30 |       if @opts[:valid_domains].nil?
 31 |         uri = URI.parse(initial_page)
 32 |         @opts[:valid_domains] ||= [uri.host]
 33 |       end
 34 | 
 35 |       if @opts[:process_body]
 36 |         @process_body = @opts[:process_body]
 37 |       end
 38 | 
 39 |       add_url(initial_page)
 40 |     end
 41 | 
 42 |     def log_level
 43 |       @log.level
 44 |     end
 45 | 
 46 |     def log_level=(level)
 47 |       @log.level = level
 48 |     end
 49 | 
 50 |     def crawl
 51 |       @log.info("Starting crawl")
 52 |       while(@url_queue.length > 0)
 53 |         url, info = @url_queue.shift
 54 |         @processed_urls << url
 55 |         process_url(url, info)
 56 |       end
 57 |       @log.info("Finished crawl")
 58 |     end
 59 | 
 60 |     def extract_hrefs(doc, base_uri)
 61 |       doc.xpath("//a/@href").map {|href| make_absolute(base_uri, href) }
 62 |     end
 63 | 
 64 |     def extract_images(doc, base_uri)
 65 |       doc.xpath("//img/@src").map {|src| make_absolute(base_uri, src) }
 66 |     end
 67 | 
 68 |     def extract_links(doc, base_uri)
 69 |       doc.xpath("//link/@href").map {|href| make_absolute(base_uri, href) }
 70 |     end
 71 | 
 72 |     def extract_videos(doc, base_uri)
 73 |       doc.xpath("//video").map do |video|
 74 |         sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)}
 75 |         poster = video.attributes["poster"].to_s
 76 |         make_absolute(base_uri, poster)
 77 |         [poster, sources]
 78 |       end.flatten.uniq.compact
 79 |     end
 80 | 
 81 |     def extract_scripts(doc, base_uri)
 82 |       doc.xpath("//script/@src").map {|src| make_absolute(base_uri, src) }
 83 |     end
 84 | 
 85 |     def extract_css_urls(css, base_uri)
 86 |       css.scan(/url\(\s*['"]?(.+?)['"]?\s*\)/).map {|src| make_absolute(base_uri, src[0]) }
 87 |     end
 88 | 
 89 |     def add_urls(urls, info = {})
 90 |       urls.compact.uniq.each {|url| add_url(url, info.dup) }
 91 |     end
 92 | 
 93 |     def make_absolute(base_uri, href)
 94 |       dup_uri = base_uri.dup
 95 |       dup_uri.query = nil
 96 |       if href.to_s =~ /https?/i
 97 |         href.to_s.gsub(" ", "+")
 98 |       else
 99 |         URI::join(dup_uri.to_s, href).to_s
100 |       end
101 |     rescue StandardError => e
102 |       @log.error "Could not make absolute #{dup_uri} - #{href}"
103 |       nil
104 |     end
105 | 
106 |     def add_url(url, info = {})
107 |       if @opts[:filter_url]
108 |         url = @opts[:filter_url].call(url, info)
109 |         return if url.nil?
110 |       else
111 |         regex = "(#{@opts[:valid_domains].join(")|(")})"
112 |         return if url !~ %r{^https?://#{regex}}
113 |       end
114 | 
115 |       url = url.sub(/#.*$/,'') # strip off any fragments
116 |       return if @url_queue.index {|u| u[0] == url } || @processed_urls.include?(url)
117 |       @url_queue << [url, info]
118 |     end
119 | 
120 |     def save_page(response, uri)
121 |       return if @opts[:skip_write]
122 |       if @opts[:aws]
123 |         save_page_to_aws(response, uri)
124 |       else
125 |         save_page_to_disk(response, uri)
126 |       end
127 |     end
128 | 
129 |     def save_page_to_disk(response, uri)
130 |       path = uri.path
131 |       path += "?#{uri.query}" if uri.query
132 | 
133 |       path_segments = path.scan(%r{[^/]*/})
134 |       filename = path.include?("/") ? path[path.rindex("/")+1..-1] : path
135 | 
136 |       current = @output_dir
137 |       FileUtils.mkdir_p(current) unless File.exist?(current)
138 | 
139 |       # Create all the directories necessary for this file
140 |       path_segments.each do |segment|
141 |         current = File.join(current, "#{segment}").sub(%r{/$},'')
142 |         if File.file?(current)
143 |           # If we are trying to create a directory and there already is a file
144 |           # with the same name add a .d to the file since we can't create
145 |           # a directory and file with the same name in the file system
146 |           dirfile = current + ".d"
147 |           FileUtils.mv(current, dirfile)
148 |           FileUtils.mkdir(current)
149 |           FileUtils.cp(dirfile, File.join(current, "/index.html"))
150 |         elsif !File.exists?(current)
151 |           FileUtils.mkdir(current)
152 |         end
153 |       end
154 | 
155 |       body = response.respond_to?(:read_body) ? response.read_body : response
156 |       body = process_body(body, uri, {})
157 |       outfile = File.join(current, "/#{filename}")
158 |       if filename == ""
159 |         indexfile = File.join(outfile, "/index.html")
160 |         @log.info "Saving #{indexfile}"
161 |         File.open(indexfile, "wb") {|f| f << body }
162 |       elsif File.directory?(outfile)
163 |         dirfile = outfile + ".d"
164 |         @log.info "Saving #{dirfile}"
165 |         File.open(dirfile, "wb") {|f| f << body }
166 |         FileUtils.cp(dirfile, File.join(outfile, "/index.html"))
167 |       else
168 |         @log.info "Saving #{outfile}"
169 |         File.open(outfile, "wb") {|f| f << body }
170 |       end
171 |     end
172 | 
173 |     def save_page_to_aws(response, uri)
174 |       key = uri.path
175 |       key += "?#{uri.query}" if uri.query
176 |       key = key.gsub(%r{/$},"/index.html")
177 |       key = key.gsub(%r{^/},"")
178 |       key = "index.html" if key == ""
179 |       # Upload this file directly to AWS::S3
180 |       opts = {:acl => "public-read"}
181 |       opts[:content_type] = response['content-type'] rescue "text/html"
182 |       @log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}"
183 |       if response.respond_to?(:read_body)
184 |         body = process_body(response.read_body, uri, opts)
185 |         @s3_bucket.object(key).put(opts.merge(body: body))
186 |       else
187 |         body = process_body(response, uri, opts)
188 |         @s3_bucket.object(key).put(opts.merge(body: body))
189 |       end
190 |     end
191 | 
192 |     def process_success(response, parsed_uri)
193 |       url = parsed_uri.to_s
194 |       if @opts[:filter_process]
195 |         return if @opts[:filter_process].call(response, parsed_uri)
196 |       end
197 |       case response['content-type']
198 |       when /css/
199 |         save_page(response, parsed_uri)
200 |         add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
201 |       when /html/
202 |         save_page(response, parsed_uri)
203 |         doc = Nokogiri::HTML(response.body)
204 |         add_urls(extract_links(doc, url), {:type_hint => "link"})
205 |         add_urls(extract_scripts(doc, url), {:type_hint => "script"})
206 |         add_urls(extract_images(doc, url), {:type_hint => "image"})
207 |         add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
208 |         add_urls(extract_videos(doc, parsed_uri), {:type_hint => "video"})
209 |         add_urls(extract_hrefs(doc, url), {:type_hint => "href"}) unless @opts[:single_page]
210 |       else
211 |         save_page(response, parsed_uri)
212 |       end
213 |     end
214 | 
215 |     # If we hit a redirect we save the redirect as a meta refresh page
216 |     # TODO: for AWS S3 hosting we could instead create a redirect?
217 |     def process_redirect(url, destination_url)
218 |       body = "<html><head><META http-equiv='refresh' content='0;URL=\"#{destination_url}\"'></head><body>You are being redirected to <a href='#{destination_url}'>#{destination_url}</a>.</body></html>"
219 |       save_page(body, url)
220 |     end
221 | 
222 |     def process_body(body, uri, opts)
223 |       if @process_body
224 |         body = @process_body.call(body, uri, opts)
225 |       end
226 |       body
227 |     end
228 | 
229 |     # Fetch a URI and save it to disk
230 |     def process_url(url, info)
231 |       @http_connections ||= {}
232 |       parsed_uri = URI(url)
233 | 
234 |       @log.debug "Fetching #{parsed_uri}"
235 |   
236 |       # Attempt to use an already open Net::HTTP connection
237 |       key = parsed_uri.host + parsed_uri.port.to_s
238 |       connection = @http_connections[key]
239 |       if connection.nil?
240 |         connection = Net::HTTP.new(parsed_uri.host, parsed_uri.port)
241 |         connection.use_ssl = true if parsed_uri.scheme.downcase == "https"
242 |         @http_connections[key] = connection
243 |       end
244 | 
245 |       request = Net::HTTP::Get.new(parsed_uri.request_uri)
246 |       begin
247 |         connection.request(request) do |response|
248 |           case response
249 |           when Net::HTTPSuccess
250 |             process_success(response, parsed_uri)
251 |           when Net::HTTPRedirection
252 |             redirect_url = response['location']
253 |             @log.debug "Processing redirect to #{redirect_url}"
254 |             process_redirect(parsed_uri, redirect_url)
255 |             add_url(redirect_url)
256 |           else
257 |             @log.error "Error #{response.code}:#{response.message} fetching url #{url}"
258 |           end
259 |         end
260 |       rescue OpenSSL::SSL::SSLError => e
261 |         @log.error "SSL Error #{e.message} fetching url #{url}"
262 |       rescue Errno::ECONNRESET => e
263 |         @log.error "Error #{e.class}:#{e.message} fetching url #{url}"
264 |       end
265 |     end
266 | 
267 |   end
268 | end
269 | 


--------------------------------------------------------------------------------
/lib/staticizer/version.rb:
--------------------------------------------------------------------------------
1 | module Staticizer
2 |   VERSION = "0.0.7"
3 | end
4 | 


--------------------------------------------------------------------------------
/staticizer.gemspec:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | lib = File.expand_path('../lib', __FILE__)
 3 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
 4 | require 'staticizer/version'
 5 | 
 6 | Gem::Specification.new do |spec|
 7 |   spec.name          = "staticizer"
 8 |   spec.version       = Staticizer::VERSION
 9 |   spec.authors       = ["Conor Hunt"]
10 |   spec.email         = ["conor.hunt+git@gmail.com"]
11 |   spec.description   = %q{A tool to create a static version of a website for hosting on S3. Can be used to create a cheap emergency backup version of a dynamic website.}
12 |   spec.summary       = %q{A tool to create a static version of a website for hosting on S3.}
13 |   spec.homepage      = "https://github.com/SquareMill/staticizer"
14 |   spec.license       = "MIT"
15 | 
16 |   spec.files         = `git ls-files`.split($/)
17 |   spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18 |   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
19 |   spec.require_paths = ["lib"]
20 | 
21 |   spec.add_development_dependency "bundler", "~> 1.3"
22 |   spec.add_development_dependency "rake"
23 |   spec.add_development_dependency "webmock"
24 | 
25 |   spec.add_runtime_dependency 'nokogiri'
26 |   spec.add_runtime_dependency 'aws-sdk'
27 | end
28 | 


--------------------------------------------------------------------------------
/tests/crawler_test.rb:
--------------------------------------------------------------------------------
 1 | require 'minitest/autorun'
 2 | require 'ostruct'
 3 | 
 4 | lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
 5 | $LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
 6 | 
 7 | require 'staticizer'
 8 | 
 9 | class TestFilePaths < MiniTest::Unit::TestCase
10 |   def setup
11 |     @crawler = Staticizer::Crawler.new("http://test.com")
12 |     @crawler.log_level = Logger::FATAL
13 |     @fake_page = File.read(File.expand_path(File.dirname(__FILE__) + "/fake_page.html"))
14 |   end
15 | 
16 |   def test_save_page_to_disk
17 |     fake_response = OpenStruct.new(:read_body => "test", :body => "test")
18 |     file_paths = {
19 |       "http://test.com" => "index.html",
20 |       "http://test.com/" => "index.html",
21 |       "http://test.com/asdfdf/dfdf" => "/asdfdf/dfdf",
22 |       "http://test.com/asdfdf/dfdf/" => ["/asdfdf/dfdf","/asdfdf/dfdf/index.html"],
23 |       "http://test.com/asdfad/asdffd.test" => "/asdfad/asdffd.test",
24 |       "http://test.com/?asdfsd=12312" => "/?asdfsd=12312",
25 |       "http://test.com/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff",
26 |     }
27 | 
28 |     # TODO: Stub out file system using https://github.com/defunkt/fakefs?
29 |     outputdir = "/tmp/staticizer_crawl_test"
30 |     FileUtils.rm_rf(outputdir)
31 |     @crawler.output_dir = outputdir
32 | 
33 |     file_paths.each do |k,v|
34 |       @crawler.save_page_to_disk(fake_response, URI.parse(k))
35 |       [v].flatten.each do |file|
36 |         expected = File.expand_path(outputdir + "/#{file}")
37 |         assert File.exists?(expected), "File #{expected} not created for url #{k}"
38 |       end
39 |     end
40 |   end
41 | 
42 |   def test_save_page_to_aws
43 |   end
44 | 
45 |   def test_add_url_with_valid_domains
46 |     test_url = "http://test.com/test"
47 |     @crawler.add_url(test_url)
48 |     assert(@crawler.url_queue[-1] == [test_url, {}], "URL #{test_url} not added to queue")
49 |   end
50 | 
51 |   def test_add_url_with_filter
52 |   end
53 | 
54 |   def test_initialize_options
55 |   end
56 | 
57 |   def test_process_url
58 |   end
59 | 
60 |   def test_make_absolute
61 |   end
62 | 
63 |   def test_link_extraction
64 |   end
65 | 
66 |   def test_href_extraction
67 |   end
68 | 
69 |   def test_css_extraction
70 |   end
71 | 
72 |   def test_css_url_extraction
73 |   end
74 | 
75 |   def test_image_extraction
76 |   end
77 | 
78 |   def test_script_extraction
79 |   end
80 | end


--------------------------------------------------------------------------------
/tests/fake_page.html:
--------------------------------------------------------------------------------
  1 | <!DOCTYPE html>
  2 | <html lang="en">
  3 | <head>
  4 |   <title>Web Application Design and Development &mdash; Square Mill Labs</title>
  5 |   <meta content="authenticity_token" name="csrf-param" />
  6 | <meta content="LshjtNLXmjVY9NINXYQds+2Ur+jxUtqKVjjbDbVl+9w=" name="csrf-token" />
  7 |   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  8 | <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
  9 | <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
 10 | <meta property="og:type" content="website">
 11 | <meta property="og:url" content="http://squaremill.com/">
 12 | <meta property="og:image" content="">
 13 | <meta name="viewport" content="width=device-width, maximum-scale=1.0, initial-scale=1.0">
 14 | <meta name="description" content="Web Application Design and Development &mdash; Square Mill Labs">
 15 |   <link rel="shortcut icon" type="image/png" href="http://squaremill.com/assets/icons/favicon-0fecbe6b20ff5bdf623357a3fac76b4b.png">
 16 |   <link data-turbolinks-track="true" href="/assets/mn_application-5ddad96f16e03ad2137bf02270506e61.css" media="all" rel="stylesheet" />
 17 |   <!--[if lt IE 9]>
 18 |     <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
 19 |   <![endif]-->
 20 | 
 21 |   <script type="text/javascript" src="//use.typekit.net/cjr4fwy.js"></script>
 22 | <script type="text/javascript">try{Typekit.load();}catch(e){}</script>
 23 |   </head>
 24 | 
 25 | <body id="public">
 26 |   <script type="text/javascript">
 27 | 
 28 |   var _gaq = _gaq || [];
 29 |   _gaq.push(['_setAccount', 'UA-30460332-1']);
 30 |   _gaq.push(['_setDomainName', 'squaremill.com']);
 31 |   _gaq.push(['_trackPageview']);
 32 | 
 33 |   (function() {
 34 |     var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
 35 |     ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
 36 |     var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
 37 |   })();
 38 | 
 39 | </script>
 40 |   
 41 | 
 42 |   <header id="header">
 43 |   <nav class="nav container">
 44 |     <a class="branding" href="http://squaremill.com/" rel="home" title="Square Mill - Digital Products for Web and Mobile">
 45 |       <img alt="Square Mill Logo" class="logo" height="16" src="/assets/m2-wordmark-black-97525464acd136ce26b77e39c7ed2ba3.png" width="128" />
 46 |       <p class="description">
 47 |         Digital Products for Web and Mobile
 48 |       </p>
 49 | </a>    <a class="menu-trigger" href="#">Menu</a>
 50 |     <div class="main-nav">
 51 |   <ul class="container">
 52 |     <li><a href="/projects">Projects</a></li>
 53 |     <li><a href="/about">About Us</a></li>
 54 |     <li><a href="/blog">Blog</a></li>
 55 | 
 56 |     <!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
 57 |   </ul>
 58 | </div>
 59 | 
 60 |   </nav>
 61 |   
 62 | </header>
 63 | 
 64 | 
 65 | 
 66 |   <div id="site-content">
 67 | 
 68 | 
 69 | 
 70 | 
 71 |     <div class="container" id="home-projects">
 72 |     <section class="big-promo">
 73 |   <div class="project" id="project-7">
 74 |     <a href="/projects/bon-voyaging">
 75 |       <div class="devices">
 76 |         <div class="laptop device">
 77 |           <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
 78 |           <div class="screenshot">
 79 |             <img alt="" src="/uploads/project/desktop_image/7/bonvoyaging-desktop.jpg" />
 80 |           </div>
 81 |         </div>
 82 | 
 83 |           <div class="handheld device">
 84 |             <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
 85 |             <div class="screenshot">
 86 |               <img alt="" src="/uploads/project/iphone_image/7/bonvoyagin-handheld.jpg" />
 87 |             </div>
 88 |           </div>
 89 |       </div>
 90 | 
 91 |       <div class="project-description">
 92 |         <div class="summary">
 93 |           <h2>Bon Voyaging <i class="icon-play-sign"></i></h2>
 94 |           <p>Bon Voyaging enables discerning travelers to expertly envision their next voyage from inspiration to exploration. Powerful search tools and a interactive javascript interface make planning trips fun.</p>
 95 |         </div>
 96 |       </div>
 97 | </a>
 98 |   </div>
 99 | </section>
100 |     <section class="big-promo">
101 |   <div class="project" id="project-1">
102 |     <a href="/projects/kpcb-fellows">
103 |       <div class="devices">
104 |         <div class="laptop device">
105 |           <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
106 |           <div class="screenshot">
107 |             <img alt="" src="/uploads/project/desktop_image/1/kpcb-fellows-screenshot.jpg" />
108 |           </div>
109 |         </div>
110 | 
111 |           <div class="handheld device">
112 |             <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
113 |             <div class="screenshot">
114 |               <img alt="" src="/uploads/project/iphone_image/1/kpcb-fellows-iphone-screenshot.jpg" />
115 |             </div>
116 |           </div>
117 |       </div>
118 | 
119 |       <div class="project-description">
120 |         <div class="summary">
121 |           <h2>KPCB Fellows Website and Brand <i class="icon-play-sign"></i></h2>
122 |           <p>The Fellows Program is a three-month work-based program that pairs top U.S. Engineering, Design and Product Design students with leading technology companies</p>
123 |         </div>
124 |       </div>
125 | </a>
126 |   </div>
127 | </section>
128 |     <section class="big-promo">
129 |   <div class="project" id="project-2">
130 |     <a href="/projects/thomson-reuters-messenger">
131 |       <div class="devices">
132 |         <div class="laptop device no-handheld">
133 |           <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
134 |           <div class="screenshot">
135 |             <img alt="" src="/uploads/project/desktop_image/2/thomson-reuters-messenger-desktop.png" />
136 |           </div>
137 |         </div>
138 | 
139 |       </div>
140 | 
141 |       <div class="project-description">
142 |         <div class="summary">
143 |           <h2>Thomson Reuters Messenger <i class="icon-play-sign"></i></h2>
144 |           <p>Messenger is an html5 / javascript instant messenger application for financial professionals</p>
145 |         </div>
146 |       </div>
147 | </a>
148 |   </div>
149 | </section>
150 |     <section class="big-promo">
151 |   <div class="project" id="project-3">
152 |     <a href="/projects/kleiner-perkins-caufield-byers-digital-presence">
153 |       <div class="devices">
154 |         <div class="laptop device">
155 |           <img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
156 |           <div class="screenshot">
157 |             <img alt="" src="/uploads/project/desktop_image/3/kpcb-screenshot.jpg" />
158 |           </div>
159 |         </div>
160 | 
161 |           <div class="handheld device">
162 |             <img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
163 |             <div class="screenshot">
164 |               <img alt="" src="/uploads/project/iphone_image/3/kpcb-iphone-screenshot.jpg" />
165 |             </div>
166 |           </div>
167 |       </div>
168 | 
169 |       <div class="project-description">
170 |         <div class="summary">
171 |           <h2>Kleiner Perkins Caufield &amp; Byers Digital Presence <i class="icon-play-sign"></i></h2>
172 |           <p>KPCB is a venture capital stalwart located in Silicon Valley with over 40 years of tech and science investment.</p>
173 |         </div>
174 |       </div>
175 | </a>
176 |   </div>
177 | </section>
178 | </div>
179 | 
180 | <section class="clients full-width">
181 |   <div class="container">
182 |     <h2>Clients</h2>
183 |     <ul class="hlist">
184 |         <li><a href="http://kpcb.com" rel="friend" target="_blank" title="KPCB&#39;s Website"><img alt="KPCB" src="/uploads/client/image/1/home_logo_kpcb-logo.png" /></a></li>
185 |         <li><a href="http://thomsonreuters.com" rel="friend" target="_blank" title="Thomson Reuters&#39;s Website"><img alt="Thomson Reuters" src="/uploads/client/image/2/home_logo_thomsonreuters.png" /></a></li>
186 |         <li><a href="http://sumzero.com" rel="friend" target="_blank" title="SumZero&#39;s Website"><img alt="SumZero" src="/uploads/client/image/3/home_logo_sumzero.png" /></a></li>
187 |         <li><a href="http://marlboroughgallery.com" rel="friend" target="_blank" title="Marlborough Gallery&#39;s Website"><img alt="Marlborough Gallery" src="/uploads/client/image/4/home_logo_marlborough.png" /></a></li>
188 |         <li><a href="http://flurry.com" rel="friend" target="_blank" title="Flurry Analytics&#39;s Website"><img alt="Flurry Analytics" src="/uploads/client/image/8/home_logo_flurry.png" /></a></li>
189 |     </ul>
190 |   </div>
191 | </section>
192 | 
193 | <section class="quote">
194 |   <blockquote>
195 |   <p>"Square Mill really took the time to understand our business and think strategically about how we want to engage and communicate with our entrepreneurs online. Together, their small team is responsive, nimble and efficient and has the deep design and technical chops to back it up."</p>
196 |   <small><a href="http://kpcb.com/partner/christina-lee" rel="friend" title="Christina Lee, Operating Partner at KPCB">Christina Lee</a>, <em>Operating Partner at KPCB</em></small>
197 | </blockquote>
198 | </section>
199 | 
200 | 
201 | 
202 | 
203 |   </div>
204 | 
205 |   <footer id="footer">
206 |   <section class="container">
207 |     <a class="logo" href="http://squaremill.com/" rel="home">
208 |       <img alt="Square Mill Logo" height="64" src="/assets/md-logo-black-942423ecfd86c43ec6f13f163ea03f97.png" width="64" />
209 | </a>    <div class="main-nav">
210 |   <ul class="container">
211 |     <li><a href="/projects">Projects</a></li>
212 |     <li><a href="/about">About Us</a></li>
213 |     <li><a href="/blog">Blog</a></li>
214 | 
215 |     <!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
216 |   </ul>
217 | </div>
218 | 
219 |   </section>
220 | 
221 |   <p class="copyright">
222 |     &copy; 2014 Square Mill Labs, LLC. All rights reserved.
223 |   </p>
224 | </footer>
225 | 
226 |   <script type="text/javascript" src="http://code.jquery.com/jquery-2.0.0.js"></script>
227 |   <script type="text/javascript" src="http://code.jquery.com/jquery-migrate-1.1.1.js"></script>
228 |   <script src="/assets/mn_application-82f6787dca307be34ec0c9fa6b7ba7d4.js"></script>
229 |   <script>
230 |   $(document).ready(function() {
231 | 
232 |     var controller = $.superscrollorama({
233 |       triggerAtCenter: true,
234 |       playoutAnimations: true
235 |     });
236 | 
237 |     if ( $(window).width() >= 767 ) {
238 |       controller.addTween('#project-7', 
239 |         TweenMax.from($('#project-7'), .7, {
240 |           css:{"opacity":"0"},
241 |           onComplete: function(){
242 |             $('#project-7').toggleClass('active-in')
243 |           }
244 |         }), 
245 |         300, // duration of scroll in pixel units
246 |         -100, // scroll offset (from center of viewport)
247 |         true
248 |         );
249 |       controller.addTween('#project-1', 
250 |         TweenMax.from($('#project-1'), .7, {
251 |           css:{"opacity":"0"},
252 |           onComplete: function(){
253 |             $('#project-1').toggleClass('active-in')
254 |           }
255 |         }), 
256 |         300, // duration of scroll in pixel units
257 |         -100, // scroll offset (from center of viewport)
258 |         true
259 |         );
260 |       controller.addTween('#project-2', 
261 |         TweenMax.from($('#project-2'), .7, {
262 |           css:{"opacity":"0"},
263 |           onComplete: function(){
264 |             $('#project-2').toggleClass('active-in')
265 |           }
266 |         }), 
267 |         300, // duration of scroll in pixel units
268 |         -100, // scroll offset (from center of viewport)
269 |         true
270 |         );
271 |       controller.addTween('#project-3', 
272 |         TweenMax.from($('#project-3'), .7, {
273 |           css:{"opacity":"0"},
274 |           onComplete: function(){
275 |             $('#project-3').toggleClass('active-in')
276 |           }
277 |         }), 
278 |         300, // duration of scroll in pixel units
279 |         -100, // scroll offset (from center of viewport)
280 |         true
281 |         );
282 |     }
283 | 
284 |   });
285 | </script>
286 | 
287 | </body>
288 | </html>


--------------------------------------------------------------------------------