├── README-german.md ├── README.md ├── examples ├── demo_01.php ├── demo_02.php ├── demo_03.php ├── demo_04.php ├── demo_05.php ├── demo_06.php ├── demo_07.php ├── demo_08.php ├── demo_09.php ├── demo_10.php ├── demo_11.php ├── demo_12.php ├── demo_data.htm ├── demo_xml.xml └── query_examples.txt ├── htmlsql.class.php └── snoopy.class.php /README-german.md: -------------------------------------------------------------------------------- 1 | htmlSQL - Version 0.5 2 | ===================== 3 | 4 | htmlSQL ist eine experimentelle PHP Klasse mit der man auf HTML 5 | Elemente über eine SQL ähnliche Syntax zugreifen kann. Das 6 | bedeutet das man nicht mehr über komplizierte Funktionen 7 | bestimmte Tags extrahieren muss, sondern einfach eine Query 8 | wie diese ausführt: 9 | 10 | SELECT href,title FROM a WHERE $class == "liste" 11 | ^ HTML Attrib. ^ ^ Abfrage (kann auch leer sein) 12 | die zurück- ^ 13 | gegeben ^ HTML Tags die durchsucht werden sollen 14 | werden sollen "*" ist hier möglich = alle Tags 15 | 16 | Diese Abfrage gibt einen Array aller Links mit dem Attribut class="liste" 17 | zurück. 18 | 19 | Alle HTTP Verbindungen in htmlSQL benützen die Snoopy Klasse 20 | (Package Version 1.2.3 - URL: http://sourceforge.net/projects/snoopy/). 21 | Allerdings wird Snoopy nicht für "file" oder "string" Queries benötigt. 22 | Alle Snoopy betreffenden Dokumente (z.B: Copyright-Infos, Readme, usw.) 23 | befinden sich im "snoopy_data/" Unterordner. 24 | 25 | 26 | Installation / Anwendung 27 | ------------------------ 28 | 29 | Um htmlSQL in eigenen Projekten zu benützen ist es nur notwendig die 30 | zwei Dateien "snoopy.class.php" und die "htmlsql.class.php" zu laden 31 | (mit include oder z.B. require). Danach kann htmlSQL, wie in den 32 | Beispielen (siehe examples/-Ordner), angesprochen werden. Dies sollte 33 | nicht allzu schwer sein :-) 34 | 35 | 36 | Hintergrund / Geschichte 37 | ------------------------ 38 | 39 | Ich hatte die Idee zu dieser Klasse als ich Daten von einer Web-Seite 40 | extrahiert habe und dabei merkte das sich die Funktionen und Quelltexte 41 | oftmals wiederholen. Da kam mir die Idee das ganze zu vereinfachen und 42 | eine universelle Klasse dafür zu entwickeln. 43 | 44 | 45 | Warnung 46 | ------- 47 | 48 | Für die Abfragen wird die `eval()` Funktion benützt. Deshalb sollten alle 49 | vom Besucher abhängige Daten wie z.b. IDs geprüft oder ggf. gefiltert 50 | werden da es ansonsten möglich wäre schadhaften PHP Quelltext auszuführen. 51 | Vertraue niemals Benutzereingaben! 52 | 53 | 54 | Todo 55 | ---- 56 | 57 | - Den internen HTML Parser verbessern 58 | - Ein eigenes Query system entwickeln und nicht das PHP eigene nutzen 59 | (Die eval()-Lösung ist nicht wirklich schön) 60 | - Mehr Fehlerprüfungen 61 | - Unit tests 62 | - LIMIT Funktion (wie in SQL) 63 | 64 | 65 | Anwendungsgebiete von htmlSQL 66 | ----------------------------- 67 | 68 | - Daten von anderen Web-Seiten auslesen 69 | - HTML basierte Datenbanken? 70 | - XML Daten auslesen 71 | 72 | 73 | Author 74 | ------ 75 | 76 | - [Jonas John](http://www.jonasjohn.de/) 77 | 78 | 79 | Lizenz 80 | ------ 81 | 82 | htmlSQL benützt eine modifizierte BSD Lizenz, welche ziemlich offen ist. 83 | Der Lizenztext befindet sich in der "htmlsql.class.php". 84 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | htmlSQL - Version 0.5 2 | ===================== 3 | 4 | htmlSQL is an experimental PHP library that allows you to access HTML values by an SQL like syntax. 5 | This means that you don't have to write complex functions or regular expressions to extract specific values. 6 | 7 | **htmlSQL queries look like this:** 8 | 9 | SELECT href,title FROM a WHERE $class == "list" 10 | ^ Attributes ^ ^ search query (can be empty) 11 | to return ^ 12 | ^ HTML tag to search in 13 | "*" is possible = all tags 14 | 15 | This query should return an array with all links that contain the attribute `class="list"`. 16 | 17 | 18 | Project Discontinued 19 | -------------------- 20 | 21 | HtmlSQL was an experiment I did in 2006. I'm not supporting or extending the library anymore this repository is only for historical purposes. But feel free to fork, modify and study the source code. If you need a reliable library for data scraping I recommend using other modules. 22 | 23 | Related projects: 24 | 25 | * PHP: [phpQuery](http://code.google.com/p/phpquery/), [SimpleXML](http://www.php.net/simplexml), [DOM](http://www.php.net/dom) 26 | * Perl: [WWW::Mechanize](http://search.cpan.org/dist/WWW-Mechanize/), [pQuery](http://search.cpan.org/~ingy/pQuery-0.07/lib/pQuery.pm) 27 | * Python: [Scrapy](http://scrapy.org/), [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) 28 | * JavaScript: [node.js](http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs) 29 | * .NET: [Html Agility Pack](http://htmlagilitypack.codeplex.com/) 30 | 31 | Related links: 32 | 33 | * [Stack Overflow: Options for HTML scraping?](http://stackoverflow.com/questions/2861/options-for-html-scraping) 34 | * [Stack Overflow: HTML Scraping in PHP](http://stackoverflow.com/questions/34120/html-scraping-in-php) 35 | * [Hacker News: PHP class to query the web by an SQL like language](http://news.ycombinator.com/item?id=2097008) 36 | * [Hacker News: Ask YC: What do you scrape? How do you scrape?](http://news.ycombinator.com/item?id=159025) 37 | 38 | 39 | Requirements 40 | ------------ 41 | 42 | - Any flavor of PHP4+ should do 43 | - [Snoopy PHP class - Version 1.2.3](http://sourceforge.net/projects/snoopy/) (optional - required for web transfers) 44 | You find all Snoopy related documents (copyright, readme, etc) in the snoopy_data/ subdirectory. 45 | 46 | 47 | Usage 48 | ----- 49 | 50 | Just include the "snoopy.class.php" and the "htmlsql.class.php" files 51 | into your PHP scripts and look at the examples to get an idea of how 52 | to use the htmlSQL library. It should be very simple :-) 53 | 54 | 55 | Background / idea 56 | ----------------- 57 | 58 | I had this idea while extracting some data from a website. As I realized 59 | that the algorithms and functions to extract links and other tags are 60 | often the same - I had the idea to combine all functions into a universal 61 | usable library. While drinking a coffee and thinking about that, I 62 | thought it would be cool to access HTML elements by using SQL. So I 63 | started creating this library... 64 | 65 | 66 | Warning 67 | ------- 68 | 69 | The `eval()` function is used for the WHERE statement. Make sure that all 70 | user data is checked and filtered against malicious PHP code. 71 | Never trust any user input! 72 | 73 | 74 | Todo 75 | ---- 76 | 77 | * Enhance the HTML parser 78 | * Test htmlSQL with invalid and bad HTML files 79 | * Replace the ugly `eval()` method for the WHERE statement with an own method 80 | * Add more error checks 81 | * Add unit tests 82 | * Add a LIMIT function like in SQL 83 | 84 | 85 | Author 86 | ------ 87 | 88 | * [Jonas John](http://www.jonasjohn.de/) 89 | 90 | 91 | License 92 | ------- 93 | 94 | htmlSQL uses a modified BSD license, you find the full license text in the "htmlsql.class.php". 95 | -------------------------------------------------------------------------------- /examples/demo_01.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query extracts all links with the classname = nav_item 23 | */ 24 | if (!$wsql->query('SELECT * FROM a WHERE $class == "nav_item"')){ 25 | print "Query error: " . $wsql->error; 26 | exit; 27 | } 28 | 29 | // show results: 30 | foreach($wsql->fetch_array() as $row){ 31 | 32 | print_r($row); 33 | 34 | /* 35 | $row is an array and looks like this: 36 | Array ( 37 | [href] => /feedback.htm 38 | [class] => nav_item 39 | [tagname] => a 40 | [text] => Feedback 41 | ) 42 | */ 43 | 44 | } 45 | 46 | ?> -------------------------------------------------------------------------------- /examples/demo_02.php: -------------------------------------------------------------------------------- 1 | connect('file', 'demo_data.htm')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query extracts all links from the document 23 | and just returns href (as url) and text 24 | */ 25 | if (!$wsql->query('SELECT href as url, text FROM a')){ 26 | print "Query error: " . $wsql->error; 27 | exit; 28 | } 29 | 30 | // show results: 31 | foreach($wsql->fetch_array() as $row){ 32 | 33 | print "Link-URL: " . $row['url'] . "\n"; 34 | print "Link-Text: " . trim($row['text']) . "\n\n"; 35 | 36 | } 37 | 38 | ?> -------------------------------------------------------------------------------- /examples/demo_03.php: -------------------------------------------------------------------------------- 1 | connect('file', 'demo_data.htm')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query searches in all tags for the id == header and returns 23 | the tag 24 | */ 25 | if (!$wsql->query('SELECT * FROM * WHERE $id == "header"')){ 26 | print "Query error: " . $wsql->error; 27 | exit; 28 | } 29 | 30 | // show results: 31 | foreach($wsql->fetch_array() as $row){ 32 | 33 | print_r($row); 34 | 35 | } 36 | 37 | ?> -------------------------------------------------------------------------------- /examples/demo_04.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/links.htm')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query returns all links of an document that start with http:// 23 | */ 24 | if (!$wsql->query('SELECT * FROM a WHERE preg_match("/^http:\/\//", $href)')){ 25 | print "Query error: " . $wsql->error; 26 | exit; 27 | } 28 | 29 | // show results: 30 | foreach($wsql->fetch_array() as $row){ 31 | 32 | print_r($row); 33 | 34 | } 35 | 36 | ?> -------------------------------------------------------------------------------- /examples/demo_05.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/links.htm')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query returns all links of an document that not start with / 23 | ( / = internal links) 24 | */ 25 | if (!$wsql->query('SELECT * FROM a WHERE substr($href,0,1) != "/"')){ 26 | print "Query error: " . $wsql->error; 27 | exit; 28 | } 29 | 30 | // fetch results as object and format as HTML links: 31 | foreach($wsql->fetch_objects() as $obj){ 32 | 33 | print ''.$obj->text.'
'; 34 | print "\n"; 35 | 36 | } 37 | 38 | ?> -------------------------------------------------------------------------------- /examples/demo_06.php: -------------------------------------------------------------------------------- 1 | link1 foobar '; 14 | $some_html .= 'link2
'; 15 | 16 | $wsql = new htmlsql(); 17 | 18 | // connect to a string 19 | if (!$wsql->connect('string', $some_html)){ 20 | print 'Error while connecting: ' . $wsql->error; 21 | exit; 22 | } 23 | 24 | /* execute a query: 25 | 26 | This query returns all links of the given HTML 27 | */ 28 | if (!$wsql->query('SELECT * FROM a')){ 29 | print "Query error: " . $wsql->error; 30 | exit; 31 | } 32 | 33 | // fetch results as array and output them: 34 | foreach($wsql->fetch_array() as $row){ 35 | 36 | print_r($row); 37 | 38 | } 39 | 40 | ?> -------------------------------------------------------------------------------- /examples/demo_07.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/browse/lang/php/')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query searches all links where the URL starts with /snippets and the text starts with 23 | "array_" => so all links to array functions will be returned 24 | */ 25 | if (!$wsql->query('SELECT * FROM a WHERE preg_match("/^\/snippets/i", $href) and preg_match("/^array_/i", $text)')){ 26 | print "Query error: " . $wsql->error; 27 | exit; 28 | } 29 | 30 | // fetch results as array return them: 31 | foreach($wsql->fetch_array() as $row){ 32 | 33 | print_r($row); 34 | 35 | } 36 | 37 | ?> -------------------------------------------------------------------------------- /examples/demo_08.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/rss/')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | select the text attribute (alias for the tag content) from the tag 23 | */ 24 | if (!$wsql->query('SELECT text FROM item')){ 25 | print "Query error: " . $wsql->error; 26 | exit; 27 | } 28 | 29 | // fetch all results as objects: 30 | foreach($wsql->fetch_objects() as $obj){ 31 | 32 | // create a new htmlsql object: 33 | $sub_wsql = new htmlsql(); 34 | 35 | // connect to the content: 36 | $sub_wsql->connect('string', $obj->text); 37 | 38 | // fetch all attributes of all tags: 39 | if (!$sub_wsql->query('SELECT * FROM *')){ 40 | print "Query error: " . $wsql->error; 41 | exit; 42 | } 43 | 44 | // this "special" function converts tagnames to keys 45 | $sub_wsql->convert_tagname_to_key(); 46 | 47 | /* this function converts an array that looks like this: 48 | 49 | $array[0]['tagname'] = 'title'; 50 | $array[0]['text'] = 'example 1'; 51 | 52 | $array[1]['tagname'] = 'link'; 53 | $array[1]['text'] = 'http://www.example.org/'; 54 | 55 | $array[2]['tagname'] = 'description'; 56 | $array[2]['text'] = 'description bla'; 57 | $array[2]['fulltext'] = '1'; // additional attribute 58 | 59 | -> to: 60 | 61 | $array['title']['text'] = 'example 1'; 62 | 63 | $array[1]['link']['text'] = 'http://www.example.org/'; 64 | 65 | $array[2]['description']['text'] = 'description bla'; 66 | $array[2]['description']['fulltext'] = '1'; // additional attribute 67 | 68 | this makes the array easier to access 69 | 70 | */ 71 | 72 | 73 | // fetch item as array: 74 | $item = $sub_wsql->fetch_array(); 75 | 76 | // format the extracted links as HTML links and output them: 77 | print ""; 78 | print $item['title']['text'] . "
\n"; 79 | 80 | // also available: 81 | // description, pubDate 82 | 83 | 84 | } 85 | 86 | ?> -------------------------------------------------------------------------------- /examples/demo_09.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | // restricts the search process to the content between 21 | // and 22 | // this also works with other tags like: head or html, or table 23 | $wsql->select('body'); 24 | 25 | /* 26 | other examples: 27 | 28 | $wsql->select('div',3); <-- selects the third
29 | 30 | $wsql->select('table',0); <-- selects the first table 31 | ^ default is also = 0 32 | */ 33 | 34 | 35 | /* execute a query: 36 | 37 | This query returns all

headers 38 | */ 39 | if (!$wsql->query('SELECT * FROM h1')){ 40 | print "Query error: " . $wsql->error; 41 | exit; 42 | } 43 | 44 | // fetch results as array 45 | foreach($wsql->fetch_array() as $row){ 46 | 47 | print_r($row); 48 | 49 | } 50 | 51 | ?> -------------------------------------------------------------------------------- /examples/demo_10.php: -------------------------------------------------------------------------------- 1 | connect('url', 'http://codedump.jonasjohn.de/')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* 21 | ** The isolate_content functions works like the select function, 22 | ** but you can specify custom HTML parts, the content between 23 | ** these two strings will be used for the query process 24 | ** 25 | ** In this case we select all content between "

New snippets

" 26 | ** and "

" this returns all snippet links, and no other links 27 | ** (like header or navigation links) 28 | */ 29 | 30 | $wsql->isolate_content('

New snippets

', '

'); 31 | 32 | /* 33 | other examples: 34 | 35 | $wsql->isolate_content('', ''); 36 | $wsql->isolate_content('', ''); 37 | */ 38 | 39 | /* execute a query: 40 | 41 | This query returns all links: 42 | */ 43 | if (!$wsql->query('SELECT * FROM a')){ 44 | print "Query error: " . $wsql->error; 45 | exit; 46 | } 47 | 48 | // fetch results as array 49 | foreach($wsql->fetch_array() as $row){ 50 | 51 | print_r($row); 52 | 53 | } 54 | 55 | ?> -------------------------------------------------------------------------------- /examples/demo_11.php: -------------------------------------------------------------------------------- 1 | connect('file', 'demo_xml.xml')){ 16 | print 'Error while connecting: ' . $wsql->error; 17 | exit; 18 | } 19 | 20 | /* execute a query: 21 | 22 | This query returns the id, name and password of all active users 23 | */ 24 | if (!$wsql->query('SELECT id, name, password FROM user WHERE $status == "active"')){ 25 | print "Query error: " . $wsql->error; 26 | exit; 27 | } 28 | 29 | // fetch results as array 30 | foreach($wsql->fetch_array() as $row){ 31 | 32 | print_r($row); 33 | 34 | } 35 | 36 | ?> -------------------------------------------------------------------------------- /examples/demo_12.php: -------------------------------------------------------------------------------- 1 | set_user_agent('MyAgentName/0.9'); 17 | 18 | // set a new referer: 19 | $wsql->set_referer('http://www.jonasjohn.de/custom/referer/'); 20 | 21 | 22 | // connect to a URL 23 | if (!$wsql->connect('url', 'http://codedump.jonasjohn.de/')){ 24 | print 'Error while connecting: ' . $wsql->error; 25 | exit; 26 | } 27 | 28 | /* execute a query: 29 | 30 | This query returns all links: 31 | */ 32 | if (!$wsql->query('SELECT * FROM a')){ 33 | print "Query error: " . $wsql->error; 34 | exit; 35 | } 36 | 37 | // fetch results as array 38 | foreach($wsql->fetch_array() as $row){ 39 | 40 | print_r($row); 41 | 42 | } 43 | 44 | ?> -------------------------------------------------------------------------------- /examples/demo_data.htm: -------------------------------------------------------------------------------- 1 | 3 | 4 | 10 | 11 | 12 | 13 | jonasjohn.de: startpage 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |

31 | 32 | 51 | 52 | 53 | 69 | 70 |
71 |
72 | 73 | 74 | 75 |
76 | 77 |

78 | ¬ welcome to...
79 | the personal website of jonas john! 80 |

81 | 82 |

83 | 84 | Hello and welcome to the personal website of Jonas John. This is my personal 85 | web playground, I use it to present myself and to create some experimental 86 | things. Have fun! 87 | 88 | 89 |
90 |
91 |
92 | 93 |

94 |
95 | 96 |
97 |

98 | News (May 04, 2006):
99 | I published the third version of my website. Now it's almost 100 | completely translated in English. Just a few texts left. 101 |
102 |
103 | 104 | News archive... 105 |
106 |

107 | 108 |
109 | 110 |
111 | 112 |
113 | 114 |
115 | 116 |

What do I find here?

117 | 118 |
119 |

120 | 121 | my lab
Lab 122 |
123 | 124 | Look on this page to get some informations about my 125 | web projects and software that I made. 126 | 127 |

128 |
129 | 130 |
131 |

132 | 133 | photos
Photos 134 |
135 | 136 | Here you find a few photos I made. I'm an amateur photographer, 137 | so don't expect too much ;-) 138 | 139 |

140 |
141 | 142 |
143 |
144 | 145 |
146 |

147 | 148 | adblock filterset generator
Adblock F. Generator 149 |
150 | 151 | This Adblock Plus Filterset Generator allows you to create your own customized 152 | filterlist for the Firefox Plugin "Adblock Plus". Just check or uncheck 153 | the filters you want. 154 | 155 | 156 |

157 |
158 | 159 |
160 |

161 | 162 | codedump
Codedump 163 |
164 | 165 | Here you can find around 70 code snippets for different topics. 166 | The snippet languages are PHP, JavaScript, HTML, Perl and Python. 167 | You can use them freely in your projects (public domain). 168 | 169 |

170 |
171 | 172 |
173 |
174 |
175 | 176 |
177 | 178 |
179 | 180 |
181 | 182 |
183 | 184 |
185 | 186 | 192 | 193 | 194 | 195 | 196 | -------------------------------------------------------------------------------- /examples/demo_xml.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /examples/query_examples.txt: -------------------------------------------------------------------------------- 1 | 2 | Some query examples for copy & paste ;-) 3 | 4 | 5 | SELECT * FROM h1 6 | ^ select all

tags 7 | 8 | 9 | SELECT * FROM a 10 | ^ select all links 11 | 12 | 13 | SELECT * FROM td 14 | ^ select all 's 15 | 16 | 17 | SELECT href as url, text FROM a 18 | ^ return href as url and text as text from all links 19 | 20 | 21 | SELECT * FROM a WHERE preg_match("/^http:\/\//", $href) 22 | ^ find all external links 23 | 24 | 25 | SELECT * FROM a WHERE preg_match("/^\/snippets/i", $href) and preg_match("/^array_/i", $text) 26 | ^ find all links starting with /snippets and with a link text starting with "array_" 27 | 28 | 29 | SELECT * FROM * 30 | ^ select all attributes of all tags ;-) 31 | 32 | 33 | SELECT id, name, password FROM user WHERE $status == "active" 34 | ^ select all tags where status="active" (for XML files) 35 | 36 | 37 | SELECT * FROM * WHERE $id == "header" 38 | ^ return all tags with the $id = header 39 | 40 | 41 | SELECT * FROM a WHERE substr($href,0,1) != "/" 42 | ^ select links with URLs that start with / (mainly internal links) 43 | 44 | 45 | SELECT * FROM * WHERE $class == "nav_item" 46 | ^ select all tags with the class = nav_item 47 | 48 | 49 | SELECT * FROM a WHERE ($href == "foo.htm" and $title == "foo") or ($title == "bar") 50 | ^ complex query 51 | 52 | -------------------------------------------------------------------------------- /htmlsql.class.php: -------------------------------------------------------------------------------- 1 | 0.5 (May 07, 2006): 48 | - Renamed the project from webSQL to htmlSQL because webSQL already exists 49 | - Added more error checks 50 | - Added the convert_tagname_to_key function and fixed a few issues 51 | 52 | 0.1 -> 0.4 (April 2006): 53 | - Created main parts of the library 54 | 55 | */ 56 | 57 | class htmlsql { 58 | 59 | // configuration: 60 | 61 | // htmlSQL version: 62 | var $version = '0.5'; 63 | 64 | // referer and user agent: 65 | var $referer = ''; 66 | var $user_agent = 'htmlSQL/0.5'; 67 | 68 | 69 | 70 | // these are filled on runtime: 71 | // (don't touch them) 72 | 73 | // holds snoopy object: 74 | var $snoopy = NULL; 75 | 76 | // the results array is stored in here: 77 | var $results = array(); 78 | 79 | // the results objects are stored in here: 80 | var $results_objects = NULL; 81 | 82 | // the error message gets stored in here: 83 | var $error = ''; 84 | 85 | // the downloaded page is stored in here: 86 | var $page = ''; 87 | 88 | 89 | /* 90 | ** init_snoopy 91 | ** 92 | ** initializes the snoopy class 93 | */ 94 | 95 | function init_snoopy(){ 96 | $this->snoopy = new Snoopy(); 97 | $this->snoopy->agent = $this->user_agent; 98 | $this->snoopy->referer = $this->referer; 99 | } 100 | 101 | 102 | /* 103 | ** set_user_agent 104 | ** 105 | ** set a custom user agent 106 | */ 107 | 108 | function set_user_agent($u){ 109 | $this->user_agent = $u; 110 | } 111 | 112 | 113 | 114 | /* 115 | ** set_referer 116 | ** 117 | ** sets the referer 118 | */ 119 | 120 | function set_referer($r){ 121 | $this->referer = $r; 122 | } 123 | 124 | 125 | /* 126 | ** _get_between 127 | ** 128 | ** returns the content between $start and $end 129 | */ 130 | 131 | function _get_between($content,$start,$end){ 132 | $r = explode($start, $content); 133 | if (isset($r[1])){ 134 | $r = explode($end, $r[1]); 135 | return $r[0]; 136 | } 137 | return ''; 138 | } 139 | 140 | 141 | /* 142 | ** connect 143 | ** 144 | ** connects to a data source (url, file or string) 145 | */ 146 | 147 | function connect($type, $resource){ 148 | 149 | if ($type == 'url'){ 150 | return $this->_fetch_url($resource); 151 | } 152 | else if ($type == 'file') { 153 | 154 | if (!file_exists($resource)){ 155 | $this->error = 'The given file "'.$resource.' does not exist!'; 156 | return false; 157 | } 158 | 159 | $this->page = file_get_contents($resource); 160 | return true; 161 | } 162 | else if ($type == 'string') { 163 | $this->page = $resource; 164 | return true; 165 | } 166 | 167 | return false; 168 | } 169 | 170 | 171 | /* 172 | ** _fetch_url 173 | ** 174 | ** downloads the given URL with snoopy 175 | */ 176 | 177 | function _fetch_url($url){ 178 | 179 | $parsed_url = parse_url($url); 180 | 181 | if (!isset($parsed_url['scheme']) or $parsed_url['scheme'] != 'http'){ 182 | $this->error = 'Unsupported URL sheme given, please just use "HTTP".'; 183 | return false; 184 | } 185 | if (!isset($parsed_url['host']) or $parsed_url['host'] == ''){ 186 | $this->error = 'Invalid URL given!'; 187 | return false; 188 | } 189 | 190 | $host = $parsed_url['host']; 191 | $host .= (isset($parsed_url['port']) and !empty($parsed_url['port'])) ? ':'.$parsed_url['port'] : ''; 192 | $path = (isset($parsed_url['path']) and !empty($parsed_url['path'])) ? $parsed_url['path'] : '/'; 193 | $path .= (isset($parsed_url['query']) and !empty($parsed_url['query'])) ? '?'.$parsed_url['query'] : ''; 194 | 195 | $url = 'http://' . $host . $path; 196 | 197 | $this->init_snoopy(); 198 | 199 | if($this->snoopy->fetch($url)){ 200 | 201 | $this->page = $this->snoopy->results; 202 | 203 | // empty buffer: 204 | $this->snoopy->results = ''; 205 | } 206 | else { 207 | $this->error = 'Could not establish a connection to the given URL!'; 208 | return false; 209 | } 210 | 211 | return true; 212 | } 213 | 214 | 215 | /* 216 | ** _extract_all_tags 217 | ** 218 | ** 219 | */ 220 | 221 | function _extract_all_tags($html, &$tag_names, &$tag_attributes, &$tag_values, $depth=0){ 222 | 223 | // stop endless loops -> ugly... 224 | if ($depth > 99999) return; 225 | 226 | preg_match_all('/<([a-z0-9\-]+)(.*?)>((.*?)<\/\1>)?/is', $html, $m); 227 | 228 | if (count($m[0]) != 0){ 229 | for ($t=0; $t < count($m[0]); $t++){ 230 | 231 | $tag_names[] = trim($m[1][$t]); 232 | $tag_attributes[] = trim($m[2][$t]); 233 | $tag_values[] = trim($m[4][$t]); 234 | 235 | // go deeper: 236 | if (trim($m[4][$t]) != '' and preg_match('/<[a-z0-9\-]+.*?>/is', $m[4][$t])){ 237 | $this->_extract_all_tags($m[4][$t], $tag_names, $tag_attributes, $tag_values, $depth+1); 238 | } 239 | 240 | } 241 | } 242 | 243 | } 244 | 245 | 246 | /* 247 | ** isolate_content 248 | ** 249 | ** isolates the content to a specific part 250 | */ 251 | 252 | function isolate_content($start,$end){ 253 | 254 | $this->page = $this->_get_between($this->page, $start, $end); 255 | 256 | } 257 | 258 | 259 | /* 260 | ** select 261 | ** 262 | ** restricts the content of a specific tag 263 | */ 264 | 265 | function select($tagname, $num=0){ 266 | $num++; 267 | 268 | if ($tagname != ''){ 269 | 270 | preg_match('/<'.$tagname.'.*?>(.*?)<\/'.$tagname.'>/is', $this->page, $m); 271 | 272 | if (isset($m[$num]) and !empty($m[$num])){ 273 | $this->page = $m[$num]; 274 | } 275 | else { 276 | $this->error = 'Could not select tag: "'.$tagname.'('.$num.')"!'; 277 | return false; 278 | } 279 | } 280 | return true; 281 | } 282 | 283 | 284 | /* 285 | ** get_content 286 | ** 287 | ** returns the content of an request 288 | */ 289 | 290 | function get_content(){ 291 | return $this->page; 292 | } 293 | 294 | 295 | /* 296 | ** _clean_array 297 | ** 298 | ** 299 | */ 300 | 301 | function _clean_array($arr){ 302 | $new = array(); 303 | for ($x=0; $x < count($arr); $x++){ 304 | $arr[$x] = trim($arr[$x]); 305 | if ($arr[$x] != ''){ $new[] = $arr[$x]; } 306 | } 307 | return $new; 308 | } 309 | 310 | 311 | /* 312 | ** _test_tag 313 | ** 314 | ** 315 | */ 316 | 317 | function _test_tag($tag_attributes, $if_term){ 318 | 319 | preg_match_all('/\$([a-z0-9_\-]+)/i', $if_term, $m); 320 | if (isset($m[1])){ 321 | for ($x=0; $x < count($m[1]); $x++){ 322 | $varname = $m[1][$x]; 323 | $$varname = ''; 324 | } 325 | } 326 | 327 | $new_list = array(); 328 | while (list($k,$v) = each($tag_attributes)){ 329 | $k = preg_replace('/[^a-z0-9_\-]/i', '', $k); 330 | if ($k != ''){ $new_list[$k] = $v; } 331 | } 332 | unset($tag_attributes); 333 | 334 | extract($new_list); 335 | 336 | $r = false; 337 | if (@eval('$r = ('.$if_term.');') === false){ 338 | $this->error = 'The WHERE statement is invalid (eval() failed)!'; 339 | return false; 340 | } 341 | 342 | return $r; 343 | } 344 | 345 | 346 | /* 347 | ** _match_tags 348 | ** 349 | ** 350 | */ 351 | 352 | function _match_tags(&$results, &$return_values, &$where_term, &$tag_attributes, &$tag_values, &$tag_names){ 353 | 354 | $search_mode = ''; $search_attribute = ''; $search_term = ''; 355 | 356 | /* 357 | ** parse: 358 | ** 359 | ** href LIKE ".htm" 360 | ** class = "foo" 361 | */ 362 | 363 | $where_term = trim($where_term); 364 | 365 | $search_mode = ($where_term == '') ? 'match_all' : 'eval'; 366 | 367 | for ($x=0; $x < count($tag_attributes); $x++){ 368 | 369 | $tag_attributes[$x] = $this->parse_attributes($tag_attributes[$x]); 370 | 371 | if (is_array($tag_names)){ 372 | $tag_attributes[$x]['tagname'] = isset($tag_names[$x]) ? $tag_names[$x] : ''; 373 | } 374 | else { $tag_attributes[$x]['tagname'] = $tag_names; } // string 375 | 376 | $tag_attributes[$x]['text'] = isset($tag_values[$x]) ? $tag_values[$x] : ''; 377 | 378 | if ($search_mode == 'eval'){ 379 | 380 | if ($this->_test_tag($tag_attributes[$x], $where_term)){ 381 | $this->_add_result($results, $return_values, $tag_attributes[$x]); 382 | } 383 | 384 | } 385 | else if ($search_mode == 'match_all'){ 386 | $this->_add_result($results, $return_values, $tag_attributes[$x]); 387 | } 388 | } 389 | } 390 | 391 | 392 | /* 393 | ** query 394 | ** 395 | ** performs a query 396 | */ 397 | 398 | function query($term){ 399 | 400 | // query results are stored in here: 401 | $results = array(); 402 | $this->results = NULL; 403 | $this->results_objects = NULL; 404 | 405 | $term = trim($term); 406 | if ($term == ''){ 407 | $this->error = 'Empty query given!'; 408 | return false; 409 | } 410 | 411 | // match query: 412 | preg_match('/^SELECT (.*?) FROM (.*)$/i', $term, $m); 413 | 414 | // parse returns values 415 | // SELECT * FROM ... 416 | // SELECT foo,bar FROM ... 417 | $return_values = isset($m[1]) ? trim($m[1]) : '*'; 418 | if ($return_values != '*'){ 419 | $return_values = explode(',', strtolower($return_values)); 420 | $return_values = $this->_clean_array($return_values); 421 | } 422 | 423 | // match from and where part: 424 | // 425 | // ... FROM * WHERE $id=="one" 426 | // ... FROM a WHERE $class=="red" 427 | // ... FROM a 428 | // ... FROM * 429 | $last = isset($m[2]) ? trim($m[2]) : ''; 430 | 431 | $search_term = ''; 432 | $where_term = ''; 433 | 434 | if (preg_match('/^(.*?) WHERE (.*?)$/i', $last, $m)){ 435 | $search_term = trim($m[1]); 436 | $where_term = trim($m[2]); 437 | } 438 | else { 439 | $search_term = $last; 440 | } 441 | 442 | // find tags 443 | 444 | if ($search_term == '*'){ 445 | // search all 446 | 447 | $tag_names = array(); 448 | $tag_attributes = array(); 449 | $tag_values = array(); 450 | 451 | $html = $this->page; 452 | 453 | $this->_extract_all_tags($html, $tag_names, $tag_attributes, $tag_values); 454 | 455 | $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tag_names); 456 | } 457 | else { 458 | 459 | // search term is a tag 460 | 461 | $tagname = trim($search_term); 462 | 463 | $tag_attributes = array(); 464 | $tag_values = array(); 465 | 466 | $regexp = '<'.$tagname.'([ \t].*?|)>((.*?)<\/'.$tagname.'>)?'; 467 | preg_match_all('/'.$regexp.'/is', $this->page, $m); 468 | 469 | if (count($m[0]) != 0){ 470 | $tag_attributes = $m[1]; 471 | $tag_values = $m[3]; 472 | } 473 | 474 | $this->_match_tags($results, $return_values, $where_term, $tag_attributes, $tag_values, $tagname); 475 | } 476 | 477 | $this->results = $results; 478 | 479 | // was there a error during the search process? 480 | return ($this->error == ''); 481 | } 482 | 483 | /* 484 | ** convert_tagname_to_key 485 | ** 486 | ** converts the tagname to the array key 487 | */ 488 | 489 | function convert_tagname_to_key(){ 490 | 491 | $new_array = array(); 492 | $tag_name = ''; 493 | 494 | while(list($key,$val) = each($this->results)){ 495 | 496 | if (isset($val['tagname'])){ 497 | $tag_name = $val['tagname']; 498 | unset($val['tagname']); 499 | } 500 | else { $tag_name = '(empty)'; } 501 | 502 | $new_array[$tag_name] = $val; 503 | 504 | } 505 | 506 | $this->results = $new_array; 507 | } 508 | 509 | 510 | /* 511 | ** fetch_array 512 | ** 513 | ** returns the results as an array 514 | */ 515 | 516 | function fetch_array(){ 517 | return $this->results; 518 | } 519 | 520 | 521 | /* 522 | ** _array2object 523 | ** 524 | ** converts an array to an object 525 | */ 526 | 527 | function _array2object($array) { 528 | 529 | if (is_array($array)) { 530 | 531 | $obj = new StdClass(); 532 | 533 | foreach ($array as $key => $val){ 534 | $obj->$key = $val; 535 | } 536 | 537 | } 538 | else { $obj = $array; } 539 | 540 | return $obj; 541 | } 542 | 543 | 544 | /* 545 | ** fetch_objects 546 | ** 547 | ** returns the results as objects 548 | */ 549 | 550 | function fetch_objects(){ 551 | 552 | if ($this->results_objects == NULL){ 553 | 554 | $results = array(); 555 | 556 | reset($this->results); 557 | while(list($key,$val) = each($this->results)){ 558 | $results[$key] = $this->_array2object($val); 559 | } 560 | 561 | $this->results_objects = $results; 562 | } 563 | 564 | return $this->results_objects; 565 | } 566 | 567 | /* 568 | ** get_result_count 569 | ** 570 | ** returns the number of results 571 | */ 572 | 573 | function get_result_count(){ 574 | return count($this->results); 575 | } 576 | 577 | 578 | /* 579 | ** _add_result 580 | ** 581 | ** 582 | */ 583 | 584 | function _add_result(&$results, $return_values, $tag_attributes){ 585 | 586 | if ($return_values == '*'){ 587 | $results[] = $tag_attributes; 588 | } 589 | else if (is_array($return_values)){ 590 | 591 | $new_result = array(); 592 | 593 | reset($return_values); 594 | for ($t=0; $t < count($return_values); $t++){ 595 | 596 | $_tagname = explode(' as ', $return_values[$t]); 597 | $_caption = $return_values[$t]; 598 | 599 | if (count($_tagname) != 1){ 600 | $_caption = trim($_tagname[1]); 601 | $_tagname = trim($_tagname[0]); 602 | } 603 | else { $_tagname = $_caption; } 604 | 605 | $new_result[$_caption] = isset($tag_attributes[$_tagname]) ? $tag_attributes[$_tagname] : ''; 606 | } 607 | $results[] = $new_result; 608 | } 609 | } 610 | 611 | 612 | /* 613 | ** parse_attributes 614 | ** 615 | ** parses HTML attributes and returns an array 616 | */ 617 | 618 | function parse_attributes($attrib){ 619 | 620 | $attrib .= '>'; 621 | 622 | $mode = 'search_key'; 623 | $tmp = ''; 624 | $current_key = ''; 625 | 626 | $attributes = array(); 627 | 628 | for ($x=0; $x < strlen($attrib); $x++){ 629 | 630 | $char = $attrib[$x]; 631 | 632 | if ($char == '=' and $mode == 'search_key'){ 633 | $current_key = trim($tmp); 634 | $tmp = ''; 635 | $mode = 'value'; 636 | } 637 | else if ($mode == 'search_key' and preg_match('/[ \t\s\r\n>]/', $char)){ 638 | $current_key = strtolower(trim($tmp)); 639 | if ($current_key != ''){ $attributes[$current_key] = ''; } 640 | $tmp = ''; $current_key = ''; 641 | } 642 | else if ($mode == 'value' and $char == '"'){ $mode = 'find_value_ending_a'; } 643 | else if ($mode == 'value' and $char == '\''){ $mode = 'find_value_ending_b'; } 644 | else if ($mode == 'value'){ $tmp .= $char; $mode = 'find_value_ending_c'; } 645 | else if ( 646 | ($mode == 'find_value_ending_a' and $char == '"') or 647 | ($mode == 'find_value_ending_b' and $char == '\'') or 648 | ($mode == 'find_value_ending_c' and preg_match('/[ \t\s\r\n>]/', $char)) 649 | ){ 650 | 651 | $mode = 'search_key'; 652 | 653 | if ($current_key != ''){ 654 | $current_key = strtolower($current_key); 655 | $attributes[$current_key] = $tmp; 656 | } 657 | $tmp = ''; 658 | } 659 | else { $tmp .= $char; } 660 | } 661 | 662 | if ($mode != 'search_key' and $current_key != ''){ 663 | $current_key = strtolower($current_key); 664 | $attributes[$current_key] = trim(preg_replace('/>+$/', '', $tmp)); 665 | } 666 | 667 | return $attributes; 668 | 669 | } 670 | 671 | } 672 | 673 | ?> -------------------------------------------------------------------------------- /snoopy.class.php: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hxseven/htmlSQL/165b7fa7e3a984cf1bfdb466c1c8bf9739ada9cf/snoopy.class.php --------------------------------------------------------------------------------