Page not found
65 | 66 |Sorry, but the page you were trying to get to, does not exist. You 67 | may want to try searching this site using the sidebar or using our 68 | API Reference page to find what 69 | you were looking for.
70 | 71 | 83 |├── config ├── dev.exs ├── prod.exs ├── test.exs └── config.exs ├── lib ├── scrapex.ex └── scrapex │ ├── gen_spider │ ├── README.md │ ├── response.ex │ └── request.ex │ ├── selector.ex │ ├── spider │ └── webscraper.ex │ └── gen_spider.ex ├── logo.png ├── .gitignore ├── doc ├── assets │ └── logo.png ├── fonts │ ├── icomoon.eot │ ├── icomoon.ttf │ ├── icomoon.woff │ └── icomoon.svg ├── index.html ├── dist │ ├── sidebar_items.js │ └── app.css ├── 404.html ├── Scrapex.html ├── extra-api-reference.html ├── extra-readme.html ├── Scrapex.GenSpider.Response.html └── Scrapex.Selector.html ├── test ├── scrapex_test.exs ├── sample_pages │ ├── e-commerce │ │ └── static │ │ │ ├── computers │ │ │ ├── index_files │ │ │ │ └── cart2.png │ │ │ ├── index.html │ │ │ ├── tablets │ │ │ │ └── index.html │ │ │ └── laptops │ │ │ │ └── index.html │ │ │ ├── index.html │ │ │ └── phones │ │ │ ├── index.html │ │ │ └── touch │ │ │ └── index.html │ └── example.com.html ├── test_helper.exs └── scrapex │ ├── spider │ ├── example_test.exs │ ├── webscraper.csv │ └── webscraper_test.exs │ ├── selector_test.exs │ └── gen_spider_test.exs ├── LICENSE ├── mix.lock ├── mix.exs └── README.md /config/dev.exs: -------------------------------------------------------------------------------- 1 | use Mix.Config -------------------------------------------------------------------------------- /lib/scrapex.ex: -------------------------------------------------------------------------------- 1 | defmodule Scrapex do 2 | end 3 | -------------------------------------------------------------------------------- /logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/logo.png -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /_build 2 | /deps 3 | erl_crash.dump 4 | *.ez 5 | .DS_Store 6 | *.beam -------------------------------------------------------------------------------- /doc/assets/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/doc/assets/logo.png -------------------------------------------------------------------------------- /doc/fonts/icomoon.eot: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/doc/fonts/icomoon.eot -------------------------------------------------------------------------------- /doc/fonts/icomoon.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/doc/fonts/icomoon.ttf -------------------------------------------------------------------------------- /doc/fonts/icomoon.woff: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/doc/fonts/icomoon.woff -------------------------------------------------------------------------------- /config/prod.exs: -------------------------------------------------------------------------------- 1 | use Mix.Config 2 | 3 | # Do not print debug messages in production 4 | config :logger, level: :info -------------------------------------------------------------------------------- /config/test.exs: -------------------------------------------------------------------------------- 1 | use Mix.Config 2 | 3 | # Print only warnings and errors during test 4 | config :logger, level: :warn -------------------------------------------------------------------------------- /test/scrapex_test.exs: -------------------------------------------------------------------------------- 1 | defmodule ScrapexTest do 2 | use ExUnit.Case 3 | 4 | test "the truth" do 5 | assert 1 + 1 == 2 6 | end 7 | end 8 | -------------------------------------------------------------------------------- /test/sample_pages/e-commerce/static/computers/index_files/cart2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sntran/scrapex/HEAD/test/sample_pages/e-commerce/static/computers/index_files/cart2.png -------------------------------------------------------------------------------- /doc/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 |This domain is established to be used for illustrative examples in documents. You may use this 46 | domain in examples without prior coordination or asking for permission.
47 | 48 |Sorry, but the page you were trying to get to, does not exist. You 67 | may want to try searching this site using the sidebar or using our 68 | API Reference page to find what 69 | you were looking for.
70 | 71 | 83 |A behaviour module for implementing a web data extractor
85 |Utilities for working response returned from GenSpider
Utilities for extracting data from markup language
99 |An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
66 |Write the rules to extract the data and let Scrapex do the rest.
68 |Extensible by design, plug new functionality easily without having to touch the core.
70 |Written in Elixir and runs on Linux, Windows, Mac, BSD, and embedded devices.
72 |alias Scrapex.GenSpider
73 | defmodule StackOverflowSpider do
74 | use GenSpider
75 | import Scrapex.Selector
76 |
77 | def parse(response, state) do
78 | result = response.body
79 | |> select(".question-summary h3 a")
80 | |> extract("href")
81 | |> Enum.map(fn(href) ->
82 | GenSpider.Response.url_join(response, href)
83 | |> GenSpider.request(&parse_question/1)
84 | |> GenSpider.await
85 | end)
86 | {:ok, result, state}
87 | end
88 |
89 | defp parse_question({:ok, response}) do
90 | html = response.body
91 | [title] = html |> select("h1 a") |> extract()
92 | question = html |> select(".question")
93 | [body] = question |> select(".post-text") |> extract
94 | [votes] = question |> select(".vote-count-post") |> extract
95 | tags = question |> select(".post-tag") |> extract
96 |
97 | %{title: title, body: body, votes: votes, tags: tags}
98 | end
99 | end
100 | urls = ["http://stackoverflow.com/questions?sort=votes"]
101 | opts = [name: :webscrapper, urls: urls]
102 | {:ok, spider} = GenSpider.start_link(StackOverflowSpider, [], opts)
103 | questions = GenSpider.export(spider)
104 | #=> "[{} | _]"
105 | GenSpider behaviour.
107 | parse/2 callback.
109 | parse/2
117 | Utilities for working response returned from GenSpider.
Join a path relative to the response’s URL
115 |t :: %Scrapex.GenSpider.Response{url: binary, body: binary}
140 |
141 | url_join(t, binary) :: binary
173 |
174 | Join a path relative to the response’s URL.
179 |iex> alias Scrapex.GenSpider.Response
181 | iex> response = %Response{url: "http://www.scrapex.com/subfolder"}
182 | iex> Response.url_join(response, "/subfolder2")
183 | "http://www.scrapex.com/subfolder2"
184 | iex> Response.url_join(response, "subsubfolder")
185 | "http://www.scrapex.com/subfolder/subsubfolder"
186 |
187 | 93 | Welcome to WebScraper e-commerce site. You can use this site for training 94 | to learn how to use the Web Scraper. Items listed here are not for sale. 95 |
96 |
104 |
109 |
117 |
122 |
127 |
136 |
141 |
146 |
154 |
105 |
110 |
119 |
124 |
129 |
138 |
143 |
148 |
155 |
109 |
114 |
123 |
128 |
133 |
141 |
146 |
151 |
160 |
103 |
108 |
116 |
121 |
126 |
134 |
139 |
144 |
151 |
156 |
161 |
168 |
173 |
178 |
186 |
191 |
196 |
205 | Utilities for extracting data from markup language.
77 | 78 |Attribute of a node
100 |A tree of HTML nodes, or a node itself if only one
121 |Name of the tag or attribute
130 |Extracts content or attribute value for a selection
160 |Generates a selection for a particular selector
175 |Extracts content or attribute value for a selection.
263 | 264 |Generates a selection for a particular selector.
316 |The return value is a Selector.t
317 | 318 |
107 |
112 |
120 |
125 |
130 |
137 |
142 |
147 |
156 |
161 |
166 |
176 |
181 |
186 |
195 |
200 |
205 |
212 |
107 |
112 |
121 |
126 |
131 |
139 |
144 |
149 |
157 |
162 | 15.6", Core i5-4200U, 4GB, 750GB, Radeon HD8670M 2GB, Windows
166 |
179 | 15.6", Core i5-4200U, 8GB, 1TB, Radeon R7 M265, Windows 8.1
183 |
198 |
203 |
213 |