Dinosaur Comics - December 30th, 2009

├── .gitignore ├── README.textile ├── extractula.gemspec ├── lib ├── extractula.rb └── extractula │ ├── custom_extractors.rb │ ├── custom_extractors │ ├── dinosaur_comics.rb │ ├── flickr.rb │ ├── twit_pic.rb │ ├── vimeo.rb │ ├── y_frog.rb │ └── you_tube.rb │ ├── extracted_content.rb │ ├── extractor.rb │ └── oembed.rb └── spec ├── extractula ├── custom_extractors │ ├── dinosaur_comics_spec.rb │ ├── flickr_spec.rb │ ├── twit_pic_spec.rb │ ├── vimeo_spec.rb │ ├── y_frog_spec.rb │ └── you_tube_spec.rb ├── extracted_content_spec.rb ├── extractor_spec.rb └── oembed_spec.rb ├── extractula_spec.rb ├── spec.opts ├── spec_helper.rb └── test-files ├── 10-stunning-web-site-prototype-sketches.html ├── dinosaur-comics.html ├── flickr-oembed.json ├── flickr.html ├── node-name-error.html ├── nytimes.html ├── nytimes_story.html ├── script_tag_remove_case.html ├── totlol-youtube.html ├── twitpic.html ├── typhoeus-the-best-ruby-http-client-just-got-better.html ├── ustream-new-years-eve.html ├── vimeo.html ├── vimeo.json ├── weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video.html ├── yfrog.html ├── youtube-oembed.json └── youtube.html /.gitignore: -------------------------------------------------------------------------------- 1 | *.gem 2 | -------------------------------------------------------------------------------- /README.textile: -------------------------------------------------------------------------------- 1 | h1. Extractula 2 | 3 | "http://github.com/pauldix/extractula":http://github.com/pauldix/extractula 4 | 5 | h2. Summary 6 | 7 | Extracts content like title, summary, and images from web pages like Dracula extracts blood: with care and finesse. 8 | 9 | h2. Description 10 | 11 | Extractula attempts to extract the core content from a web page. For a news article or blog post this would be the content of the article itself. For a github project this would be the main README file. The library also has logic for writing your own custom extractors. This is useful if you want to write extractors for popular sites that you want to build custom support for. 12 | 13 | h2. Installation 14 | 15 |

16 |   gem install extractula --source http://gemcutter.org
17 |

18 | 19 | h2. Use 20 | 21 |

22 | require 'extractula'
23 | some_html = "..." # get some html to extract, yo!
24 | 
25 | extracted_content = Extractula.extract(url, some_html)
26 | extracted_content.title       # pulled from the page
27 | extracted_content.url         # what you passed in
28 | extracted_content.content     # the main content body (article, blog post, etc)
29 | extracted_content.summary     # an automatically generated plain text summary of the content
30 | extracted_content.image_urls  # the urls for images that appear in the content
31 | extracted_content.video_embed # the embed code if a video is embedded in the content
32 | 
33 | Extractula.add_extractor(SomeClass) # so you can add a custom extractor
34 |

35 | 36 | h3. Custom Extractors 37 | 38 | The "Use" section showed adding a custom extractor. This should be a class that at a minimum implements the following methods. 39 | 40 |

41 | class MyCustomExtractor
42 |   def self.can_extract?(url, html)
43 |   end
44 | 
45 |   def extract(url, html)
46 |     # should return a Extractula::ExtractedContent object
47 |   end
48 | end
49 |

50 | 51 | Notice that can_extract? is a class method while extract is an instance method. Extract should return an ExtractedContent object. 52 | 53 | h3. ExtractedContent 54 | 55 | The ExtractedContent object holds the results of an extraction. It additionally has methods to automatically generate a summary, image_urls, and video_embed code from the content. If you implement a custom extractor and want to provide the summary, image_urls, and video_embed, simply pass those values into the constructor for ExtractedContent. Here are some examples: 56 | 57 |

58 | extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...")
59 | extracted_content.summary     # auto-generated from content
60 | extracted_content.image_urls  # auto-generated from content
61 | extracted_content.video_embed # auto-generated from content
62 | 
63 | extracted_content = ExtractedContent.new(:url => "http://pauldix.net", :content => "...some content...",
64 |   :summary => "a summary", :image_urls => ["foo.jpg"], :video_embed => "blah")
65 | extracted_content.summary     # "a summary"
66 | extracted_content.image_urls  # ["foo.jpg"]
67 | extracted_content.video_embed # "blah"
68 |

69 | 70 | Zero, one, or more of the values can be passed into the ExtractedContent constructor. It will auto-generate ones not passed in and keep the others. 71 | 72 | h2. LICENSE 73 | 74 | (The MIT License) 75 | 76 | Copyright (c) 2009: 77 | 78 | "Paul Dix":http://pauldix.net 79 | 80 | Permission is hereby granted, free of charge, to any person obtaining 81 | a copy of this software and associated documentation files (the 82 | 'Software'), to deal in the Software without restriction, including 83 | without limitation the rights to use, copy, modify, merge, publish, 84 | distribute, sublicense, and/or sell copies of the Software, and to 85 | permit persons to whom the Software is furnished to do so, subject to 86 | the following conditions: 87 | 88 | The above copyright notice and this permission notice shall be 89 | included in all copies or substantial portions of the Software. 90 | 91 | THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, 92 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 93 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 94 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY 95 | CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, 96 | TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 97 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 98 | -------------------------------------------------------------------------------- /extractula.gemspec: -------------------------------------------------------------------------------- 1 | # -*- encoding: utf-8 -*- 2 | 3 | Gem::Specification.new do |s| 4 | s.name = %q{extractula} 5 | s.version = "0.0.11" 6 | 7 | s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= 8 | s.authors = ["Paul Dix", "Sander Hartlage"] 9 | s.date = %q{2009-12-18} 10 | s.email = %q{paul@pauldix.net} 11 | s.files = [ 12 | "lib/extractula.rb", 13 | "lib/extractula/custom_extractors.rb", 14 | "lib/extractula/extracted_content.rb", 15 | "lib/extractula/extractor.rb", 16 | "lib/extractula/oembed.rb", 17 | "lib/extractula/custom_extractors/dinosaur_comics.rb", 18 | "lib/extractula/custom_extractors/flickr.rb", 19 | "lib/extractula/custom_extractors/twit_pic.rb", 20 | "lib/extractula/custom_extractors/vimeo.rb", 21 | "lib/extractula/custom_extractors/y_frog.rb", 22 | "lib/extractula/custom_extractors/you_tube.rb", 23 | "README.textile", 24 | "spec/spec.opts", 25 | "spec/spec_helper.rb", 26 | "spec/extractula_spec.rb", 27 | "spec/extractula/extracted_content_spec.rb", 28 | "spec/test-files/10-stunning-web-site-prototype-sketches.html", 29 | "spec/test-files/totlol-youtube.html", 30 | "spec/test-files/typhoeus-the-best-ruby-http-client-just-got-better.html", 31 | "spec/test-files/ustream-new-years-eve.html", 32 | "spec/test-files/weather-channel-marriage-proposal-touching-with-a-chance-of-viral-status-video.html", 33 | "spec/test-files/nytimes.html", 34 | "spec/test-files/nytimes_story.html", 35 | "spec/test-files/script_tag_remove_case.html"] 36 | s.has_rdoc = true 37 | s.homepage = %q{http://github.com/pauldix/extractula} 38 | s.require_paths = ["lib"] 39 | s.rubygems_version = %q{1.3.5} 40 | s.summary = %q{Extracts content like title, summary, and images from web pages like Dracula extracts blood: with care and finesse.} 41 | 42 | if s.respond_to? :specification_version then 43 | current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION 44 | s.specification_version = 2 45 | 46 | if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then 47 | s.add_runtime_dependency(%q, ["> 0.0.0"]) 48 | s.add_runtime_dependency(%q, [">= 0.4.2"]) 49 | else 50 | s.add_dependency(%q, ["> 0.0.0"]) 51 | s.add_dependency(%q, [">= 0.4.2"]) 52 | end 53 | else 54 | s.add_dependency(%q, ["> 0.0.0"]) 55 | s.add_dependency(%q, [">= 0.4.2"]) 56 | end 57 | end -------------------------------------------------------------------------------- /lib/extractula.rb: -------------------------------------------------------------------------------- 1 | $LOAD_PATH.unshift(File.dirname(__FILE__)) unless $LOAD_PATH.include?(File.dirname(__FILE__)) 2 | 3 | module Extractula; end 4 | 5 | require 'nokogiri' 6 | require 'loofah' 7 | require 'domainatrix' 8 | require 'extractula/extracted_content' 9 | require 'extractula/extractor' 10 | 11 | module Extractula 12 | VERSION = "0.0.11" 13 | 14 | @extractors = [] 15 | 16 | class << self 17 | 18 | attr_reader :extractors, :last_extractor 19 | 20 | def add_extractor(extractor_class) 21 | @extractors << extractor_class 22 | end 23 | 24 | def remove_extractor(extractor_class) 25 | @extractors.delete extractor_class 26 | end 27 | 28 | def extract(url, html) 29 | parsed_url, parsed_html = Domainatrix.parse(url), Nokogiri::HTML(html) 30 | extractor = select_extractor parsed_url, parsed_html 31 | extractor.new(parsed_url, parsed_html).extract 32 | end 33 | 34 | def select_extractor url, html 35 | @last_extractor = @extractors.detect {|e| e.can_extract? url, html} || Extractor 36 | end 37 | 38 | def custom_extractor(*args, &block) 39 | config = args.last.is_a?(Hash) ? args.pop : {} 40 | klass_name = args[0] 41 | if block_given? 42 | klass = Class.new Extractula::Extractor, &block 43 | else 44 | klass = Class.new Extractula::Extractor 45 | klass.__send__ :include, Extractula::OEmbed if config.delete(:oembed) 46 | config.each { |option, args| klass.__send__(option, *args) } 47 | end 48 | const_set klass_name, klass if klass_name 49 | klass 50 | end 51 | end 52 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors.rb: -------------------------------------------------------------------------------- 1 | require File.dirname(__FILE__) + '/oembed' 2 | 3 | Dir.glob(File.dirname(__FILE__) + '/custom_extractors/*.rb').each do |lib| 4 | require File.expand_path(lib).chomp('.rb') 5 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/dinosaur_comics.rb: -------------------------------------------------------------------------------- 1 | # This is mostly a proof-of-concept. 2 | 3 | module Extractula 4 | class DinosaurComics < Extractula::Extractor 5 | domain 'qwantz' 6 | media_type 'image' 7 | content_path 'img.comic', 'title' 8 | image_urls_path 'img.comic' 9 | end 10 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/flickr.rb: -------------------------------------------------------------------------------- 1 | module Extractula 2 | class Flickr < Extractula::Extractor 3 | include Extractula::OEmbed 4 | domain 'flickr' 5 | media_type 'image' 6 | content_path 'meta[name=description]', 'content' 7 | oembed_endpoint 'http://www.flickr.com/services/oembed/' 8 | end 9 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/twit_pic.rb: -------------------------------------------------------------------------------- 1 | module Extractula 2 | class TwitPic < Extractula::Extractor 3 | domain 'twitpic' 4 | media_type 'image' 5 | content_path '#view-photo-caption' 6 | image_urls_path '#photo-display' 7 | end 8 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/vimeo.rb: -------------------------------------------------------------------------------- 1 | module Extractula 2 | class Vimeo < Extractula::Extractor 3 | include Extractula::OEmbed 4 | domain 'vimeo' 5 | media_type 'video' 6 | oembed_endpoint 'http://www.vimeo.com/api/oembed.json' 7 | 8 | def content 9 | oembed['description'] 10 | end 11 | end 12 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/y_frog.rb: -------------------------------------------------------------------------------- 1 | module Extractula 2 | class YFrog < Extractula::Extractor 3 | domain 'yfrog' 4 | media_type 'image' 5 | content_path '.twittertweet div div > div' 6 | image_urls_path '#main_image' 7 | end 8 | end -------------------------------------------------------------------------------- /lib/extractula/custom_extractors/you_tube.rb: -------------------------------------------------------------------------------- 1 | module Extractula 2 | class YouTube < Extractula::Extractor 3 | include Extractula::OEmbed 4 | domain 'youtube' 5 | media_type 'video' 6 | content_path '.description' 7 | oembed_endpoint 'http://www.youtube.com/oembed' 8 | end 9 | end -------------------------------------------------------------------------------- /lib/extractula/extracted_content.rb: -------------------------------------------------------------------------------- 1 | class Extractula::ExtractedContent 2 | attr_reader :url, :media_type, :title, :content, :summary, :image_urls, :video_embed 3 | 4 | def initialize(attributes = {}) 5 | attributes.each_pair {|k, v| instance_variable_set("@#{k}", v)} 6 | end 7 | 8 | def summary 9 | return @summary if @summary 10 | content_fragment = Loofah.scrub_document(@content, :prune).text.gsub("\\n", " ").gsub(/\s+/, " ").slice(0, 350).strip 11 | sentence_break = content_fragment.rindex(/\?|\.|\!|\;/) 12 | if sentence_break 13 | @summary = content_fragment.slice(0, sentence_break + 1) 14 | @summary 15 | else 16 | @summary = content_fragment.gsub(/\s\w+$/, "...") 17 | end 18 | end 19 | 20 | def image_urls 21 | return @image_urls if @image_urls 22 | @content_doc ||= Nokogiri::HTML(@content) 23 | @image_urls = @content_doc.search("//img").collect {|t| t["src"]} 24 | end 25 | 26 | def video_embed 27 | return @video_embed if @video_embed 28 | @content_doc ||= Nokogiri::HTML(@content) 29 | @content_doc.search("//object").collect {|t| t.to_html}.first 30 | end 31 | end 32 | -------------------------------------------------------------------------------- /lib/extractula/extractor.rb: -------------------------------------------------------------------------------- 1 | # Abstract (more or less) extractor class from which custom extractor 2 | # classes should descend. Subclasses of Extractula::Extractor will be 3 | # automatically added to the Extracula module. 4 | 5 | class Extractula::Extractor 6 | def self.inherited subclass 7 | Extractula.add_extractor subclass 8 | end 9 | 10 | def self.domain domain 11 | @extractable_domain = domain 12 | end 13 | 14 | def self.can_extract? url, html 15 | if @extractable_domain.is_a? Regexp 16 | url.host + url.path =~ @extractable_domain 17 | else 18 | @extractable_domain ? @extractable_domain == url.domain : false 19 | end 20 | end 21 | 22 | def self.media_type type = nil 23 | @media_type = type if type 24 | @media_type 25 | end 26 | 27 | %w{title content summary image_urls video_embed }.each do |field| 28 | class_eval <<-EOS 29 | def self.#{field}_path(path = nil, attrib = nil, &block) 30 | if path 31 | @#{field}_path = path 32 | @#{field}_attr = attrib || :text 33 | @#{field}_block = block 34 | end 35 | @#{field}_path 36 | end 37 | 38 | def self.#{field}_attr(attrib = nil) 39 | @#{field}_attr = attrib if attrib 40 | @#{field}_attr 41 | end 42 | 43 | def self.#{field}_block(&block) 44 | @#{field}_block = block if block 45 | @#{field}_block 46 | end 47 | 48 | def #{field}_path 49 | self.class.#{field}_path 50 | end 51 | 52 | def #{field}_attr 53 | self.class.#{field}_attr 54 | end 55 | 56 | def #{field}_block 57 | self.class.#{field}_block 58 | end 59 | EOS 60 | end 61 | 62 | attr_reader :url, :html 63 | 64 | def initialize url, html 65 | @url = url.is_a?(Domainatrix::Url) ? url : Domainatrix.parse(url) 66 | @html = html.is_a?(Nokogiri::HTML::Document) ? html : Nokogiri::HTML(html) 67 | end 68 | 69 | def extract 70 | Extractula::ExtractedContent.new({ 71 | :url => url.url, 72 | :media_type => media_type, 73 | :title => title, 74 | :content => content, 75 | :summary => summary, 76 | :image_urls => image_urls, 77 | :video_embed => video_embed 78 | }) 79 | end 80 | 81 | def media_type 82 | self.class.media_type || 'text' 83 | end 84 | 85 | def title 86 | content_at(title_path, title_attr, title_block) || content_at("//title") 87 | end 88 | 89 | def content 90 | content_at(content_path, content_attr, content_block) || extract_content 91 | end 92 | 93 | def summary 94 | content_at(summary_path, summary_attr, summary_block) 95 | end 96 | 97 | def image_urls 98 | if image_urls_path 99 | image_srcs_from html.search(image_urls_path) 100 | end 101 | end 102 | 103 | def video_embed 104 | if video_embed_path 105 | embed_code_from html.search(video_embed_path) 106 | end 107 | end 108 | 109 | private 110 | 111 | def image_srcs_from nodeset 112 | nodeset.collect { |img| unrelativize img['src'].strip } 113 | end 114 | 115 | def embed_code_from nodeset 116 | nodeset.collect { |embed| embed.to_html }.first 117 | end 118 | 119 | def unrelativize path 120 | path.start_with?('/') ? "#{@url.scheme}://#{@url.host}#{path}" : path 121 | end 122 | 123 | def content_at path, attrib = :text, block = nil 124 | if path 125 | if node = html.at(path) 126 | value = attrib == :text ? node.text.strip : node[attrib].strip 127 | block ? block.call(value) : value 128 | end 129 | end 130 | end 131 | 132 | def extract_content 133 | content_node ? content_node.inner_html.strip : "" 134 | end 135 | 136 | def candidate_nodes 137 | @candidate_nodes ||= html.search("//div|//p|//br").collect do |node| 138 | parent = node.parent 139 | if node.node_name == 'div' 140 | text_size = calculate_children_text_size(parent, "div") 141 | 142 | if text_size > 0 143 | {:text_size => text_size, :parent => parent} 144 | else 145 | nil 146 | end 147 | elsif node.node_name == "p" 148 | text_size = calculate_children_text_size(parent, "p") 149 | 150 | if text_size > 0 151 | {:text_size => text_size, :parent => parent} 152 | else 153 | nil 154 | end 155 | elsif node.node_name == "br" 156 | begin 157 | if node.previous.node_name == "text" && node.next.node_name == "text" 158 | text_size = 0 159 | parent.children.each do |child| 160 | text_size += child.text.strip.size if child.node_name == "text" 161 | end 162 | 163 | if text_size > 0 164 | {:text_size => text_size, :parent => parent} 165 | else 166 | nil 167 | end 168 | else 169 | nil 170 | end 171 | rescue => e 172 | nil 173 | end 174 | else 175 | nil 176 | end 177 | end.compact.uniq 178 | end 179 | 180 | def content_node_selector 181 | Proc.new { |n| n[:text_size] > content_node_text_size_cutoff } 182 | end 183 | 184 | def content_node_text_size_cutoff 185 | 140 186 | end 187 | 188 | def content_node 189 | @content_node ||= begin 190 | if node = candidate_nodes.detect(&content_node_selector) 191 | node[:parent] 192 | end 193 | end 194 | end 195 | 196 | def calculate_children_text_size(parent, node_type) 197 | text_size = 0 198 | parent.children.each do |child| 199 | if child.node_name == node_type 200 | child.children.each {|c| text_size += c.text.strip.size if c.node_name == "text"} 201 | end 202 | end 203 | 204 | text_size 205 | end 206 | end 207 | -------------------------------------------------------------------------------- /lib/extractula/oembed.rb: -------------------------------------------------------------------------------- 1 | require 'typhoeus' 2 | require 'json' 3 | 4 | module Extractula 5 | module OEmbed 6 | 7 | def self.included(base) 8 | base.class_eval { 9 | extend Extractula::OEmbed::ClassMethods 10 | include Extractula::OEmbed::InstanceMethods 11 | } 12 | end 13 | 14 | def self.request(request) 15 | http_response = Typhoeus::Request.get(request) 16 | if http_response.code == 200 17 | Extractula::OEmbed::Response.new(http_response.body) 18 | else 19 | # do something 20 | end 21 | end 22 | 23 | def self.max_width(width = nil) 24 | @global_oembed_max_width = width if width 25 | @global_oembed_max_width 26 | end 27 | 28 | def self.max_height(height = nil) 29 | @global_oembed_max_height = height if height 30 | @global_oembed_max_height 31 | end 32 | 33 | module ClassMethods 34 | def oembed_endpoint(url = nil) 35 | if url 36 | @oembed_endpoint = url 37 | if @oembed_endpoint.match(/\.(xml|json)$/) 38 | @oembed_format_param_required = false 39 | @oembed_endpoint.sub!(/\.xml$/, '.json') if $1 == 'xml' 40 | else 41 | @oembed_format_param_required = true 42 | end 43 | end 44 | @oembed_endpoint 45 | end 46 | 47 | def oembed_max_width(width = nil) 48 | @oembed_max_width = width if width 49 | @oembed_max_width || OEmbed.max_width 50 | end 51 | 52 | def oembed_max_height(height = nil) 53 | @oembed_max_height = height if height 54 | @oembed_max_height || OEmbed.max_height 55 | end 56 | 57 | def oembed_format_param_required? 58 | @oembed_format_param_required 59 | end 60 | end 61 | 62 | module InstanceMethods 63 | def initialize(*args) 64 | super 65 | @oembed = Extractula::OEmbed.request(oembed_request) 66 | end 67 | 68 | def oembed_endpoint 69 | self.class.oembed_endpoint 70 | end 71 | 72 | def oembed_max_width 73 | self.class.oembed_max_width 74 | end 75 | 76 | def oembed_max_height 77 | self.class.oembed_max_height 78 | end 79 | 80 | def oembed_format_param_required? 81 | self.class.oembed_format_param_required? 82 | end 83 | 84 | def oembed 85 | @oembed 86 | end 87 | 88 | def oembed_request 89 | request = "#{oembed_endpoint}?url=#{oembed_request_url}" 90 | request += "&format=json" if oembed_format_param_required? 91 | request += "&maxwidth=#{oembed_max_width}" if oembed_max_width 92 | request += "&maxheight=#{oembed_max_height}" if oembed_max_height 93 | request 94 | end 95 | 96 | def oembed_request_url 97 | url.url 98 | end 99 | 100 | def title 101 | oembed ? oembed.title : super 102 | end 103 | 104 | def image_urls 105 | [ oembed.url ] if oembed && oembed.type == 'photo' 106 | end 107 | 108 | def video_embed 109 | oembed.html if oembed 110 | end 111 | end 112 | 113 | class Response 114 | 115 | FIELDS = %w{ type version title author_name author_url 116 | provider_name provider_url cache_age thumbnail_url 117 | thumbnail_width thumbnail_height } 118 | 119 | FIELDS.each { |field| attr_reader field.to_sym } 120 | attr_reader :width, :height, :url, :html 121 | 122 | def initialize response 123 | @doc = ::JSON.parse(response) 124 | FIELDS.each { |field| instance_variable_set "@#{field}", @doc[field] } 125 | unless @type == 'link' 126 | @width = @doc['width'] 127 | @height = @doc['height'] 128 | if @type == 'photo' 129 | @url = @doc['url'] 130 | else 131 | @html = @doc['html'] 132 | end 133 | end 134 | 135 | def [](field) 136 | @doc[field] 137 | end 138 | end 139 | 140 | end 141 | end 142 | end -------------------------------------------------------------------------------- /spec/extractula/custom_extractors/dinosaur_comics_spec.rb: -------------------------------------------------------------------------------- 1 | require File.dirname(__FILE__) + '/../../spec_helper' 2 | 3 | describe Extractula::DinosaurComics do 4 | before do 5 | @url = Domainatrix.parse("http://www.qwantz.com/index.php") 6 | @html = Nokogiri::HTML::Document.new 7 | end 8 | 9 | it "can extract comics from qwantz.com" do 10 | Extractula::DinosaurComics.can_extract?(@url, @html).should be_true 11 | end 12 | 13 | it "should have media type 'image'" do 14 | Extractula::DinosaurComics.media_type.should == 'image' 15 | end 16 | end 17 | 18 | describe "extracting from a Dinosaur Comics page" do 19 | 20 | before do 21 | @extracted_content = Extractula::DinosaurComics.new("http://www.qwantz.com/index.php", read_test_file("dinosaur-comics.html")).extract 22 | end 23 | 24 | it "extracts the title" do 25 | @extracted_content.title.should == "Dinosaur Comics - December 30th, 2009 - awesome fun times!" 26 | end 27 | 28 | it "extracts the content" do 29 | @extracted_content.content.should == "tell this to a doctor and they will have no choice but to make you a doctor too, this is a TRUE FACT" 30 | end 31 | 32 | it "extracts the main comic" do 33 | @extracted_content.image_urls.should include "http://www.qwantz.com/comics/comic2-503.png" 34 | end 35 | 36 | end -------------------------------------------------------------------------------- /spec/extractula/custom_extractors/flickr_spec.rb: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | require File.dirname(__FILE__) + '/../../spec_helper' 4 | 5 | describe Extractula::Flickr do 6 | before do 7 | @url = Domainatrix.parse("http://www.flickr.com/photos/kotobuki711/1789570897/") 8 | @html = Nokogiri::HTML::Document.new 9 | end 10 | 11 | it "can extract videos from flickr.com" do 12 | Extractula::Flickr.can_extract?(@url, @html).should be_true 13 | end 14 | 15 | it "should have media type 'image'" do 16 | Extractula::Flickr.media_type.should == 'image' 17 | end 18 | end 19 | 20 | describe "extracting from a YouTube page" do 21 | 22 | before do 23 | @response = Extractula::OEmbed::Response.new(read_test_file("flickr-oembed.json")) 24 | Extractula::OEmbed.stub!(:request).and_return(@response) 25 | @extracted_content = Extractula::Flickr.new("http://www.flickr.com/photos/kotobuki711/1789570897/", read_test_file("flickr.html")).extract 26 | end 27 | 28 | it "extracts the title" do 29 | @extracted_content.title.should == "Greyhound Fisheye" 30 | end 31 | 32 | it "extracts the content" do 33 | @extracted_content.content.should == "A Greyhound named Latte at Meadow Woods Dog Park in Orlando, FL. Published in All Animals Magazine." 34 | end 35 | 36 | it "extracts the image url" do 37 | @extracted_content.image_urls.should include("http:\/\/farm3.static.flickr.com\/2127\/1789570897_6db70a9dbe.jpg") 38 | end 39 | 40 | end -------------------------------------------------------------------------------- /spec/extractula/custom_extractors/twit_pic_spec.rb: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | require File.dirname(__FILE__) + '/../../spec_helper' 4 | 5 | describe Extractula::TwitPic do 6 | before do 7 | @url = Domainatrix.parse("http://twitpic.com/ytw1u") 8 | @html = Nokogiri::HTML::Document.new 9 | end 10 | 11 | it "can extract images from twitpic.com" do 12 | Extractula::TwitPic.can_extract?(@url, @html).should be_true 13 | end 14 | 15 | it "should have media type 'image'" do 16 | Extractula::TwitPic.media_type.should == 'image' 17 | end 18 | end 19 | 20 | describe "extracting from a TwitPic page" do 21 | 22 | before do 23 | @extracted_content = Extractula::TwitPic.new("http://twitpic.com/ytw1u", read_test_file("twitpic.html")).extract 24 | end 25 | 26 | it "extracts the title" do 27 | @extracted_content.title.should == "@AMY_CLUB si te dejo Jack ? Lol on Twitpic" 28 | end 29 | 30 | it "extracts the content" do 31 | @extracted_content.content.should == "@AMY_CLUB si te dejo Jack ? Lol" 32 | end 33 | 34 | it "extracts the image url" do 35 | @extracted_content.image_urls.should include("http://web4.twitpic.com/img/58501506-137b4b4c8b5b0244f4b0d9d607e14941.4b56110f-full.jpg") 36 | end 37 | 38 | end -------------------------------------------------------------------------------- /spec/extractula/custom_extractors/vimeo_spec.rb: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | 3 | require File.dirname(__FILE__) + '/../../spec_helper' 4 | 5 | describe Extractula::Vimeo do 6 | before do 7 | @url = Domainatrix.parse("http://vimeo.com/8833777") 8 | @html = Nokogiri::HTML::Document.new 9 | end 10 | 11 | it "can extract videos from vimeo.com" do 12 | Extractula::Vimeo.can_extract?(@url, @html).should be_true 13 | end 14 | 15 | it "should have media type 'video'" do 16 | Extractula::Vimeo.media_type.should == 'video' 17 | end 18 | end 19 | 20 | describe "extracting from a Vimeo page" do 21 | 22 | before do 23 | @response = Extractula::OEmbed::Response.new(read_test_file("vimeo.json")) 24 | Extractula::OEmbed.stub!(:request).and_return(@response) 25 | @extracted_content = Extractula::Vimeo.new("http://vimeo.com/8833777", read_test_file("vimeo.html")).extract 26 | end 27 | 28 | it "extracts the title" do 29 | @extracted_content.title.should == "Cracker Bag" 30 | end 31 | 32 | it "extracts the content" do 33 | @extracted_content.content.should == "Eddie spends her pocket money obsessively hoarding fireworks and carefully planning for cracker night. When it finally it arrives, Eddie and her family head to the local football oval. In the frosty air Eddie lights the fuse of her first cracker and experiences a pivotal moment, one of the seemingly small experiences of childhood, that affects us for the rest of our lives. \n\nSet in the 1980s, Cracker Bag is a gentle suburban observation which subtly reflects a disenchanting prelude to the coming of age. \n\nWinner of the Palme D'Or - Short Film Cannes Film Festival 2003\n\nwww.GlendynIvin.com\nwww.Exitfilms.com" 34 | end 35 | 36 | it "extracts the main video" do 37 | @extracted_content.video_embed.should == "