├── README-german.md
├── README.md
├── examples
├── demo_01.php
├── demo_02.php
├── demo_03.php
├── demo_04.php
├── demo_05.php
├── demo_06.php
├── demo_07.php
├── demo_08.php
├── demo_09.php
├── demo_10.php
├── demo_11.php
├── demo_12.php
├── demo_data.htm
├── demo_xml.xml
└── query_examples.txt
├── htmlsql.class.php
└── snoopy.class.php
/README-german.md:
--------------------------------------------------------------------------------
1 | htmlSQL - Version 0.5
2 | =====================
3 |
4 | htmlSQL ist eine experimentelle PHP Klasse mit der man auf HTML
5 | Elemente über eine SQL ähnliche Syntax zugreifen kann. Das
6 | bedeutet das man nicht mehr über komplizierte Funktionen
7 | bestimmte Tags extrahieren muss, sondern einfach eine Query
8 | wie diese ausführt:
9 |
10 | SELECT href,title FROM a WHERE $class == "liste"
11 | ^ HTML Attrib. ^ ^ Abfrage (kann auch leer sein)
12 | die zurück- ^
13 | gegeben ^ HTML Tags die durchsucht werden sollen
14 | werden sollen "*" ist hier möglich = alle Tags
15 |
16 | Diese Abfrage gibt einen Array aller Links mit dem Attribut class="liste"
17 | zurück.
18 |
19 | Alle HTTP Verbindungen in htmlSQL benützen die Snoopy Klasse
20 | (Package Version 1.2.3 - URL: http://sourceforge.net/projects/snoopy/).
21 | Allerdings wird Snoopy nicht für "file" oder "string" Queries benötigt.
22 | Alle Snoopy betreffenden Dokumente (z.B: Copyright-Infos, Readme, usw.)
23 | befinden sich im "snoopy_data/" Unterordner.
24 |
25 |
26 | Installation / Anwendung
27 | ------------------------
28 |
29 | Um htmlSQL in eigenen Projekten zu benützen ist es nur notwendig die
30 | zwei Dateien "snoopy.class.php" und die "htmlsql.class.php" zu laden
31 | (mit include oder z.B. require). Danach kann htmlSQL, wie in den
32 | Beispielen (siehe examples/-Ordner), angesprochen werden. Dies sollte
33 | nicht allzu schwer sein :-)
34 |
35 |
36 | Hintergrund / Geschichte
37 | ------------------------
38 |
39 | Ich hatte die Idee zu dieser Klasse als ich Daten von einer Web-Seite
40 | extrahiert habe und dabei merkte das sich die Funktionen und Quelltexte
41 | oftmals wiederholen. Da kam mir die Idee das ganze zu vereinfachen und
42 | eine universelle Klasse dafür zu entwickeln.
43 |
44 |
45 | Warnung
46 | -------
47 |
48 | Für die Abfragen wird die `eval()` Funktion benützt. Deshalb sollten alle
49 | vom Besucher abhängige Daten wie z.b. IDs geprüft oder ggf. gefiltert
50 | werden da es ansonsten möglich wäre schadhaften PHP Quelltext auszuführen.
51 | Vertraue niemals Benutzereingaben!
52 |
53 |
54 | Todo
55 | ----
56 |
57 | - Den internen HTML Parser verbessern
58 | - Ein eigenes Query system entwickeln und nicht das PHP eigene nutzen
59 | (Die eval()-Lösung ist nicht wirklich schön)
60 | - Mehr Fehlerprüfungen
61 | - Unit tests
62 | - LIMIT Funktion (wie in SQL)
63 |
64 |
65 | Anwendungsgebiete von htmlSQL
66 | -----------------------------
67 |
68 | - Daten von anderen Web-Seiten auslesen
69 | - HTML basierte Datenbanken?
70 | - XML Daten auslesen
71 |
72 |
73 | Author
74 | ------
75 |
76 | - [Jonas John](http://www.jonasjohn.de/)
77 |
78 |
79 | Lizenz
80 | ------
81 |
82 | htmlSQL benützt eine modifizierte BSD Lizenz, welche ziemlich offen ist.
83 | Der Lizenztext befindet sich in der "htmlsql.class.php".
84 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | htmlSQL - Version 0.5
2 | =====================
3 |
4 | htmlSQL is an experimental PHP library that allows you to access HTML values by an SQL like syntax.
5 | This means that you don't have to write complex functions or regular expressions to extract specific values.
6 |
7 | **htmlSQL queries look like this:**
8 |
9 | SELECT href,title FROM a WHERE $class == "list"
10 | ^ Attributes ^ ^ search query (can be empty)
11 | to return ^
12 | ^ HTML tag to search in
13 | "*" is possible = all tags
14 |
15 | This query should return an array with all links that contain the attribute `class="list"`.
16 |
17 |
18 | Project Discontinued
19 | --------------------
20 |
21 | HtmlSQL was an experiment I did in 2006. I'm not supporting or extending the library anymore this repository is only for historical purposes. But feel free to fork, modify and study the source code. If you need a reliable library for data scraping I recommend using other modules.
22 |
23 | Related projects:
24 |
25 | * PHP: [phpQuery](http://code.google.com/p/phpquery/), [SimpleXML](http://www.php.net/simplexml), [DOM](http://www.php.net/dom)
26 | * Perl: [WWW::Mechanize](http://search.cpan.org/dist/WWW-Mechanize/), [pQuery](http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm)
27 | * Python: [Scrapy](http://scrapy.org/), [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/)
28 | * JavaScript: [node.js](http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs)
29 | * .NET: [Html Agility Pack](http://htmlagilitypack.codeplex.com/)
30 |
31 | Related links:
32 |
33 | * [Stack Overflow: Options for HTML scraping?](http://stackoverflow.com/questions/2861/options-for-html-scraping)
34 | * [Stack Overflow: HTML Scraping in PHP](http://stackoverflow.com/questions/34120/html-scraping-in-php)
35 | * [Hacker News: PHP class to query the web by an SQL like language](http://news.ycombinator.com/item?id=2097008)
36 | * [Hacker News: Ask YC: What do you scrape? How do you scrape?](http://news.ycombinator.com/item?id=159025)
37 |
38 |
39 | Requirements
40 | ------------
41 |
42 | - Any flavor of PHP4+ should do
43 | - [Snoopy PHP class - Version 1.2.3](http://sourceforge.net/projects/snoopy/) (optional - required for web transfers)
44 | You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ subdirectory.
45 |
46 |
47 | Usage
48 | -----
49 |
50 | Just include the "snoopy.class.php" and the "htmlsql.class.php" files
51 | into your PHP scripts and look at the examples to get an idea of how
52 | to use the htmlSQL library. It should be very simple :-)
53 |
54 |
55 | Background / idea
56 | -----------------
57 |
58 | I had this idea while extracting some data from a website. As I realized
59 | that the algorithms and functions to extract links and other tags are
60 | often the same - I had the idea to combine all functions into a universal
61 | usable library. While drinking a coffee and thinking about that, I
62 | thought it would be cool to access HTML elements by using SQL. So I
63 | started creating this library...
64 |
65 |
66 | Warning
67 | -------
68 |
69 | The `eval()` function is used for the WHERE statement. Make sure that all
70 | user data is checked and filtered against malicious PHP code.
71 | Never trust any user input!
72 |
73 |
74 | Todo
75 | ----
76 |
77 | * Enhance the HTML parser
78 | * Test htmlSQL with invalid and bad HTML files
79 | * Replace the ugly `eval()` method for the WHERE statement with an own method
80 | * Add more error checks
81 | * Add unit tests
82 | * Add a LIMIT function like in SQL
83 |
84 |
85 | Author
86 | ------
87 |
88 | * [Jonas John](http://www.jonasjohn.de/)
89 |
90 |
91 | License
92 | -------
93 |
94 | htmlSQL uses a modified BSD license, you find the full license text in the "htmlsql.class.php".
95 |
--------------------------------------------------------------------------------
/examples/demo_01.php:
--------------------------------------------------------------------------------
1 | connect('url', 'http://codedump.jonasjohn.de/')){
16 | print 'Error while connecting: ' . $wsql->error;
17 | exit;
18 | }
19 |
20 | /* execute a query:
21 |
22 | This query extracts all links with the classname = nav_item
23 | */
24 | if (!$wsql->query('SELECT * FROM a WHERE $class == "nav_item"')){
25 | print "Query error: " . $wsql->error;
26 | exit;
27 | }
28 |
29 | // show results:
30 | foreach($wsql->fetch_array() as $row){
31 |
32 | print_r($row);
33 |
34 | /*
35 | $row is an array and looks like this:
36 | Array (
37 | [href] => /feedback.htm
38 | [class] => nav_item
39 | [tagname] => a
40 | [text] => Feedback
41 | )
42 | */
43 |
44 | }
45 |
46 | ?>
--------------------------------------------------------------------------------
/examples/demo_02.php:
--------------------------------------------------------------------------------
1 | connect('file', 'demo_data.htm')){
16 | print 'Error while connecting: ' . $wsql->error;
17 | exit;
18 | }
19 |
20 | /* execute a query:
21 |
22 | This query extracts all links from the document
23 | and just returns href (as url) and text
24 | */
25 | if (!$wsql->query('SELECT href as url, text FROM a')){
26 | print "Query error: " . $wsql->error;
27 | exit;
28 | }
29 |
30 | // show results:
31 | foreach($wsql->fetch_array() as $row){
32 |
33 | print "Link-URL: " . $row['url'] . "\n";
34 | print "Link-Text: " . trim($row['text']) . "\n\n";
35 |
36 | }
37 |
38 | ?>
--------------------------------------------------------------------------------
/examples/demo_03.php:
--------------------------------------------------------------------------------
1 | connect('file', 'demo_data.htm')){
16 | print 'Error while connecting: ' . $wsql->error;
17 | exit;
18 | }
19 |
20 | /* execute a query:
21 |
22 | This query searches in all tags for the id == header and returns
23 | the tag
24 | */
25 | if (!$wsql->query('SELECT * FROM * WHERE $id == "header"')){
26 | print "Query error: " . $wsql->error;
27 | exit;
28 | }
29 |
30 | // show results:
31 | foreach($wsql->fetch_array() as $row){
32 |
33 | print_r($row);
34 |
35 | }
36 |
37 | ?>
--------------------------------------------------------------------------------
/examples/demo_04.php:
--------------------------------------------------------------------------------
1 | connect('url', 'http://codedump.jonasjohn.de/links.htm')){
16 | print 'Error while connecting: ' . $wsql->error;
17 | exit;
18 | }
19 |
20 | /* execute a query:
21 |
22 | This query returns all links of an document that start with http://
23 | */
24 | if (!$wsql->query('SELECT * FROM a WHERE preg_match("/^http:\/\//", $href)')){
25 | print "Query error: " . $wsql->error;
26 | exit;
27 | }
28 |
29 | // show results:
30 | foreach($wsql->fetch_array() as $row){
31 |
32 | print_r($row);
33 |
34 | }
35 |
36 | ?>
--------------------------------------------------------------------------------
/examples/demo_05.php:
--------------------------------------------------------------------------------
1 | connect('url', 'http://codedump.jonasjohn.de/links.htm')){
16 | print 'Error while connecting: ' . $wsql->error;
17 | exit;
18 | }
19 |
20 | /* execute a query:
21 |
22 | This query returns all links of an document that not start with /
23 | ( / = internal links)
24 | */
25 | if (!$wsql->query('SELECT * FROM a WHERE substr($href,0,1) != "/"')){
26 | print "Query error: " . $wsql->error;
27 | exit;
28 | }
29 |
30 | // fetch results as object and format as HTML links:
31 | foreach($wsql->fetch_objects() as $obj){
32 |
33 | print ''.$obj->text.'
';
34 | print "\n";
35 |
36 | }
37 |
38 | ?>
--------------------------------------------------------------------------------
/examples/demo_06.php:
--------------------------------------------------------------------------------
1 | link1 foobar ';
14 | $some_html .= 'link2
" this returns all snippet links, and no other links 27 | ** (like header or navigation links) 28 | */ 29 | 30 | $wsql->isolate_content('
'); 31 | 32 | /* 33 | other examples: 34 | 35 | $wsql->isolate_content('
', ''); 36 | $wsql->isolate_content('', ''); 37 | */ 38 | 39 | /* execute a query: 40 | 41 | This query returns all links: 42 | */ 43 | if (!$wsql->query('SELECT * FROM a')){ 44 | print "Query error: " . $wsql->error; 45 | exit; 46 | } 47 | 48 | // fetch results as array 49 | foreach($wsql->fetch_array() as $row){ 50 | 51 | print_r($row); 52 | 53 | } 54 | 55 | ?> -------------------------------------------------------------------------------- /examples/demo_11.php: -------------------------------------------------------------------------------- 1 | connect('file', 'demo_xml.xml')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query returns the id, name and password of all active users 23 | */ 24 | if (!$wsql->query('SELECT id, name, password FROM user WHERE $status == "active"')){ 25 | print "Query error: " . $wsql->error; 26 | exit; 27 | } 28 | 29 | // fetch results as array 30 | foreach($wsql->fetch_array() as $row){ 31 | 32 | print_r($row); 33 | 34 | } 35 | 36 | ?> -------------------------------------------------------------------------------- /examples/demo_12.php: -------------------------------------------------------------------------------- 1 | set_user_agent('MyAgentName/0.9'); 17 | 18 | // set a new referer: 19 | $wsql->set_referer('http://www.jonasjohn.de/custom/referer/'); 20 | 21 | 22 | // connect to a URL 23 | if (!$wsql->connect('url', 'http://codedump.jonasjohn.de/')){ 24 | print 'Error while connecting: ' . $wsql->error; 25 | exit; 26 | } 27 | 28 | /* execute a query: 29 | 30 | This query returns all links: 31 | */ 32 | if (!$wsql->query('SELECT * FROM a')){ 33 | print "Query error: " . $wsql->error; 34 | exit; 35 | } 36 | 37 | // fetch results as array 38 | foreach($wsql->fetch_array() as $row){ 39 | 40 | print_r($row); 41 | 42 | } 43 | 44 | ?> -------------------------------------------------------------------------------- /examples/demo_data.htm: -------------------------------------------------------------------------------- 1 | 3 | 4 | 10 | 11 | 12 | 13 |29 | Skip to content... 30 |
31 | 32 | 51 | 52 | 53 |
83 |
84 | Hello and welcome to the personal website of Jonas John. This is my personal
85 | web playground, I use it to present myself and to create some experimental
86 | things. Have fun!
87 |
88 |
89 |
90 |
91 |
92 |
93 |
98 | News (May 04, 2006):
99 | I published the third version of my website. Now it's almost
100 | completely translated in English. Just a few texts left.
101 |
102 |
103 |
104 | News archive...
105 |
106 |
120 |
121 |
Lab
122 |
123 |
124 | Look on this page to get some informations about my
125 | web projects and software that I made.
126 |
127 |
132 |
133 |
Photos
134 |
135 |
136 | Here you find a few photos I made. I'm an amateur photographer,
137 | so don't expect too much ;-)
138 |
139 |
147 |
148 |
Adblock F. Generator
149 |
150 |
151 | This Adblock Plus Filterset Generator allows you to create your own customized
152 | filterlist for the Firefox Plugin "Adblock Plus". Just check or uncheck
153 | the filters you want.
154 |
155 |
156 |
161 |
162 |
Codedump
163 |
164 |
165 | Here you can find around 70 code snippets for different topics.
166 | The snippet languages are PHP, JavaScript, HTML, Perl and Python.
167 | You can use them freely in your projects (public domain).
168 |
169 |