├── TODO.md ├── CHANGELOG.md ├── README.md ├── LICENSE ├── index.php └── simple_html_dom.php /TODO.md: -------------------------------------------------------------------------------- 1 | TODO 2 | ---- 3 | 4 | * Clean JSON 5 | * Clean XML 6 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | 23-04-2018 2 | ========== 3 | 4 | * Treat HTML as text/xml to prevent free hosts from injecting scripts 5 | * Remove two warnings 6 | * HTML now has to allow for debugging messages and for the XML validation in browsers to work (for example Firefox) 7 | * Handle format datatext as text/plain; charset=utf-8 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | LyricsCore 2 | ========== 3 | 4 | LyricsCore is a lyrics API written in PHP. It fetches the lyrics text from many lyrics websites, as listed below. These lyrics can then be used in any application by sending a request to the (hosted) LyricsCore API. 5 | 6 | Examples 7 | ======== 8 | 9 | The LyricsCore API can be used like this from the (Linux) command line: 10 | 11 | FORMAT=text FILENAME="Roxette - Dangerous" php index.php 12 | 13 | This request outputs the lyrics text only, properly formatted. If you want something more, you can opt for the output format xml: 14 | 15 | FORMAT=xml FILENAME="Roxette - Dangerous" php index.php 16 | 17 | outputs: 18 | 19 | 20 | Roxette 21 | Dangerous 22 | MetroLyrics 23 | 24 | 25 |

Here comes the lyrics text. This text still has the original HTML formatting.

26 |

Of course there are many paragraphs

27 |
28 |
29 | 30 | Hosted PHP API 31 | ============== 32 | 33 | It is advised you host the API yourself and you keep up to date with the changes made. That way you will have optimal coverage of as many songs as possible, and you won't be confronted by downtime on lyricscore.eu5.org or lyricscore.cmshost.nl. 34 | 35 | You can host the API yourself by uploading index.php and simple_html_dom.php to some folder on your FTP host. 36 | 37 | If the index.php and simple_html_dom.php are uploaded to lyricscore.eu5.org/api/v1/, you can use the API as follows: 38 | 39 | http://lyricscore.eu5.org/api/v1/?filename=Roxette%20-%20Dangerous 40 | 41 | I strongly advise to have multiple hosted LyricsCore API's at your disposal, in case one goes down. 42 | 43 | If you don't want to host the API yourself, you can use these URLs: 44 | * http://lyricscore.eu5.org/api/v1/ 45 | * http://lyricscore.cmshost.nl/api/v1/ 46 | 47 | Parameters and values 48 | ===================== 49 | 50 | Passing input values: 51 | * artist (must be used together with title) 52 | * title (must be used together with artist) 53 | * filename (can be any filename) 54 | 55 | Example filenames: 56 | * "Roxette - Dangerous" 57 | * "Roxette - Dangerous.mp4" 58 | * "Roxette - Dangerous (1983)" 59 | * "Roxette - Dangerous (1983).mp4" 60 | * "Roxette - Dangerous (anything).anything" 61 | * "Roxette-Dangerous.mp3" 62 | * "Roxette-Dangerous" 63 | 64 | Controlling the output: 65 | * format ["", "datatext", "text", "xml", "json"] (empty means html, use one of the four defined formats in your application - use lower case values) 66 | * mode ["", "debug"] (empty means normal operation, so the debug mode is disabled) 67 | 68 | Attention: 69 | * Use lower case values for the format and mode parameters, e.g. "json" or "debug" 70 | * Use CAPITAL LETTERS for the parameter names when calling LyricsCore from the command line, e.g. "FORMAT" or "MODE" 71 | 72 | Usage in an external program 73 | ============================ 74 | 75 | It is advised that the application that uses this API reads the metadata from the music file and passes this to the API using the artist and title parameters. If metadata is not available, passing a file name will suffice. 76 | 77 | A file name cannot contain the sign "&", but you can replace this by "and" before passing it to the LyricsCore API. There are no other known limitations. 78 | 79 | Debug mode 80 | ========== 81 | 82 | There is a debug mode built in. This debug mode shows any requests made by the LyricsCore API, in order to determine why a filename (artist+title) does not return a correct lyrics text. The debug mode can be enabled by setting the parameter "mode" to "debug". 83 | 84 | http://lyricscore.eu5.org/api/v1/?filename=Patti%20Austin%20and%20James%20Ingram%20-%20Baby%20Come%20To%20Me&mode=debug 85 | 86 | returns the output: 87 | 88 | DEBUG: Patti Austin and James Ingram 89 | DEBUG: MetroLyrics: artist_metro (patti-austin-and-james-ingram) contains and 90 | DEBUG: first artist: http://www.metrolyrics.com/baby-come-to-me-lyrics-patti-austin.html 91 | 92 | Here comes the lyrics text. 93 | 94 | The debug mode can be useful to see what goes wrong and where. 95 | 96 | Websites 97 | ======== 98 | 99 | These websites are supported: 100 | - MetroLyrics 101 | - LyricsMania 102 | - Lyrics.com 103 | - SonicHits 104 | - AZLyrics 105 | - LyricsMode 106 | - MusixMatch 107 | - Golyr.de 108 | - Songteksten.net 109 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | -------------------------------------------------------------------------------- /index.php: -------------------------------------------------------------------------------- 1 | "; 43 | } 44 | 45 | if($filename == ""){ 46 | $lyrics = get_lyrics($artist, $title); 47 | }else{ 48 | $artist = get_artist($filename); 49 | $title = get_title($filename); 50 | $lyrics = get_lyrics($artist, $title); 51 | //$lyrics = fetch_lyrics("https://www.lyricsmania.com/compendium_lyrics_elder.html"); 52 | /*if(trim($lyrics) == ""){ 53 | $artist_new = $title; 54 | $title = $artist; 55 | $artist = $artist_new; 56 | $lyrics = get_lyrics($artist, $title); 57 | }*/ 58 | } 59 | 60 | //$lyrics = fetch_lyrics("http://www.lyrics.com/hello-lyrics-adele.html"); 61 | 62 | switch ($format) { 63 | case "xml": 64 | /*if($source == "LyricsMania"){ 65 | $lyrics = str_replace(["\r\n", "\r", "\n"], "
", $lyrics); 66 | }*/ 67 | 68 | //$lyrics = str_replace("

", "\n\n", $lyrics); 69 | $lyrics = str_replace("\n", "
", get_text_from_unclean_html($lyrics)); 70 | print "" . str_replace("&", "&", $artist) . "". str_replace("&", "&", $title) ."$source$url$lyrics"; 71 | break; 72 | case "json": 73 | $data = array( 74 | 'debug' => $debugmsgs, 75 | 'artist' => $artist, 76 | 'title' => $title, 77 | 'source' => $source, 78 | 'url' => $url, 79 | 'lyrics' => preg_replace("/[\n\r]/","\n",$lyrics) 80 | ); 81 | print json_encode($data, JSON_PRETTY_PRINT); 82 | break; 83 | case "text": 84 | print get_text_from_unclean_html($lyrics); 85 | break; 86 | case "datatext": 87 | print $source . "\n" . get_text_from_unclean_html($lyrics); 88 | break; 89 | default: 90 | if($source == "LyricsMania"){ 91 | $lyrics = str_replace(["\r\n", "\r", "\n"], "
", $lyrics); 92 | } 93 | print $lyrics . ""; 94 | break; 95 | } 96 | 97 | function get_text_from_unclean_html($unclean){ 98 | global $source; 99 | 100 | $unclean = str_replace("

", "\n\n", $unclean); 101 | 102 | return trim(strip_tags(html_entity_decode($unclean))); 103 | } 104 | 105 | function get_artist($filename){ 106 | $filename = str_replace(chr(38), "", $filename); 107 | 108 | //$filename = str_replace("%20"," ", $filename); // replace space 109 | $filename = str_replace("%2C", ",", $filename); // replace comma 110 | $filename = str_replace("%26", "and", $filename); // replace & by and (note: you still need to encode the parameter before sending it!) 111 | $filename = str_replace("&", "and", $filename); // replace & by and 112 | $filename = str_replace("+", " ", $filename); // replace + by space 113 | 114 | $filename = str_replace("%27", "", $filename); //replace single quote by nothing 115 | $filename = str_replace("'", "", $filename); //replace single quote by nothing 116 | 117 | $filename = preg_replace('/\\.[^.\\s]{3,4}$/', '', $filename); // remove extension 118 | $filename = str_replace(" _ ", " and ", $filename); //replace underscore with spaces by and 119 | $filename = str_replace("!", "", $filename); //replace exclamation mark by nothing 120 | $filename = str_replace("ç", "c", $filename); 121 | 122 | $filename = preg_replace('/Ft\..*\-/',"-", $filename); // remove Ft. 123 | $filename = preg_replace('/ft\..*\-/',"-", $filename); // remove ft. 124 | $filename = preg_replace('/Feat.*\-/',"-", $filename); // remove Feat 125 | $filename = preg_replace('/feat.*\-/',"-", $filename); // remove feat 126 | $filename = preg_replace('/Featuring.*\-/', "-", $filename); // remove featuring 127 | $filename = preg_replace('/featuring.*\-/', "-", $filename); // remove featuring 128 | 129 | $filename = preg_replace('/\[.*\]/', '', $filename); // remove square brackets 130 | 131 | $spacepos = strpos($filename, " "); 132 | $hyphenpos = strpos($filename, "-"); 133 | 134 | if($hyphenpos == false){ 135 | debug_print("hyphenpos is $hyphenpos"); 136 | return ""; // invalid things 137 | } 138 | 139 | if($spacepos == false){ 140 | if(substr_count($filename, "_") > 1){ 141 | debug_print("number of underscores > 1 (artist)"); 142 | $filename = str_replace("_", " ", $filename); 143 | }else{ 144 | debug_print("spacepos artist is $spacepos"); 145 | return ""; // invalid things (for example "-IRememberYou") 146 | } 147 | } 148 | 149 | $filename = str_replace("- Official Video", "", $filename); 150 | $filename = str_replace("Official Video", "", $filename); 151 | 152 | if($hyphenpos + 1 < $spacepos){ 153 | $oldpos = $hyphenpos; 154 | $hyphenpos = strpos($filename, "-", $spacepos); // was not the right hyphen (a-ha) 155 | 156 | if ($hyphenpos == false){ 157 | $hyphenpos = $oldpos; 158 | } 159 | } 160 | 161 | $amount = 2; 162 | $space_after_pos = strpos($filename, " ", $hyphenpos); 163 | 164 | if($space_after_pos == false){ 165 | //no space after hyphenpos (Artist-Title) 166 | $amount = 1; 167 | }else{ 168 | //space after hyphenpos (Artist- Title) 169 | //equal to or greather than (was space_after_pos > hyphenpos + 1) 170 | if($space_after_pos > $hyphenpos - 1){ 171 | $amount = 0; // was 1 172 | } 173 | } 174 | 175 | // position brace 176 | $filenamesub = substr($filename, 0, $hyphenpos); 177 | $positionbrace = strpos($filenamesub, "("); 178 | if($positionbrace > 0){ 179 | $filename = substr($filename, 0, $positionbrace); 180 | } 181 | 182 | $hyphenpostwo = strpos($filename, "-", $hyphenpos + 1); 183 | if($hyphenpostwo == false){ 184 | $artist_local = trim(substr($filename, 0, $hyphenpos - $amount)); 185 | return $artist_local; 186 | }else{ 187 | if($hyphenpostwo > strlen($filename) - 5){ 188 | $artist_local = trim(substr($filename, 0, $hyphenpos - $amount)); 189 | return $artist_local; //Wham! - Wake Me Up Before You Go-Go 190 | }else{ 191 | return trim(substr($filename, 0, $hyphenpostwo - $amount)); //Olivia Newton-John 192 | } 193 | } 194 | } 195 | 196 | function get_title($filename){ 197 | $filename = str_replace("!", "", $filename); //replace exclamation mark by nothing 198 | $filename = preg_replace('/\\.[^.\\s]{3,4}$/', '', $filename); // remove extension 199 | $filename = preg_replace('/\[.*\]/', '', $filename); // remove square brackets 200 | 201 | //$filename_featuring = $filename; 202 | 203 | $pos = strpos($filename, "-"); 204 | $spacepos = strpos($filename, " "); 205 | 206 | if($pos == false){ 207 | return ""; // invalid things 208 | } 209 | 210 | if($spacepos == false){ 211 | if(substr_count($filename, "_") > 1){ 212 | debug_print("number of underscores > 1 (title)"); 213 | $filename = str_replace("_", " ", $filename); 214 | }else{ 215 | debug_print("spacepos artist is $spacepos"); 216 | return ""; // invalid things (for example "-IRememberYou") 217 | } 218 | } 219 | 220 | $filename = str_replace(" - Lyrics", "", $filename); 221 | $filename = str_replace(" with Lyrics","", $filename); // remove with lyrics 222 | $filename = str_replace("_ll","'ll", $filename); 223 | $filename = str_replace("w_ lyrics","", $filename); // remove w_ lyrics 224 | $filename = str_replace(" Lyrics", "", $filename); 225 | $filename = str_replace(" HD", "", $filename); 226 | $filename = str_replace(" HQ", "", $filename); 227 | $filename = str_replace("- Official Video", "", $filename); 228 | $filename = str_replace("Official Video", "", $filename); 229 | $filename = str_replace("official video", "", $filename); 230 | $filename = str_replace("Music Video", "", $filename); 231 | //$filename = str_replace(" Live", "", $filename); 232 | 233 | // remove featuring 234 | $filename_lower = strtolower($filename); 235 | if(strpos($filename_lower, "ft.", $pos)){ 236 | $filename = preg_replace('/Ft\..*/',"", $filename); // remove Ft. 237 | $filename = preg_replace('/ft\..*/',"", $filename); // remove ft. 238 | } 239 | if(strpos($filename_lower, "feat", $pos)){ 240 | $filename = preg_replace('/Feat.*/',"", $filename); // remove Feat 241 | $filename = preg_replace('/feat.*/',"", $filename); // remove feat 242 | } 243 | if(strpos($filename_lower, "featuring.", $pos)){ 244 | $filename = preg_replace('/Featuring.*/', "", $filename); // remove Featuring 245 | $filename = preg_replace('/featuring.*/', "", $filename); // remove featuring 246 | } 247 | 248 | $hyphenpostwo = strpos($filename, "-", $pos + 1); 249 | if($hyphenpostwo > -1){ 250 | if($hyphenpostwo > strlen($filename) - 5){ 251 | //Wham! - Wake Me Up Before You Go-Go 252 | //debug_print("Wham! - Wake Me Up Before You Go-Go match"); 253 | }else{ 254 | //debug_print("Olivia Newton-John match"); 255 | $pos = $hyphenpostwo; //Olivia Newton-John 256 | } 257 | } 258 | 259 | $amount = 2; 260 | $space_after_pos = strpos($filename, " ", $pos); 261 | 262 | //debug_print("space_after_pos is $space_after_pos and pos is $pos"); 263 | 264 | if($space_after_pos == false){ 265 | //no space after pos 266 | $amount = 1; 267 | }else{ 268 | //was +1 269 | if($space_after_pos > $pos - 1){ 270 | $amount = 1; 271 | } 272 | if($space_after_pos == $pos - 1){ 273 | $amount = 0; // added 274 | } 275 | } 276 | $parpos = strpos($filename, "("); 277 | if($parpos > $pos){ // implies $parpos > 0 278 | // in the title part, remove it 279 | $filename = substr($filename, 0, $parpos); 280 | } 281 | 282 | return trim(substr($filename, $pos + $amount)); 283 | } 284 | 285 | function get_parameter($parametername){ 286 | $parameter = isset($_GET[$parametername]) ? $_GET[$parametername] : ''; 287 | if($parameter == ""){ 288 | $parameter = getenv(strtoupper($parametername)); 289 | } 290 | return $parameter; 291 | } 292 | 293 | function debug_print($message){ 294 | debug_print_importance($message, "debug"); 295 | } 296 | 297 | function debug_print_importance($message, $importance){ 298 | global $debugmsgs; 299 | $extrainfo = get_parameter("extrainfo"); 300 | 301 | if($importance == "extrainfo"){ 302 | if($extrainfo != "true"){ 303 | return; 304 | } 305 | } 306 | 307 | $mode = get_parameter("mode"); 308 | $format = get_parameter("format"); 309 | if($mode == "debug"){ 310 | switch($format){ 311 | case "xml": 312 | print "$message"; 313 | break; 314 | case "text": 315 | print "$message\n"; 316 | break; 317 | case "json": 318 | $debugmsgs[] = $message; 319 | break; 320 | default: 321 | print "DEBUG: $message
"; 322 | break; 323 | } 324 | } 325 | } 326 | 327 | function get_lyrics($artist_x, $title_x){ 328 | global $source; 329 | global $url; 330 | 331 | if($title_x == "" || $artist_x == ""){ 332 | return ""; 333 | } 334 | 335 | $title_x = trim($title_x); 336 | $artist_x = trim($artist_x); 337 | 338 | $title_x = str_replace(" ", "_", $title_x); 339 | $artist_x = str_replace(' _ ', ' and ', $artist_x); // Womack _ Womack - Friends 340 | $artist_x = str_replace(" ", "_", $artist_x); 341 | 342 | $title_x = strtolower($title_x); 343 | $artist_x = strtolower($artist_x); 344 | 345 | $title_x = str_replace("-", "_", $title_x); 346 | $original_title = $title_x; 347 | //$title_x = str_replace($title_x, '[^%w_]',''); TODO: what does this do? 348 | $title_x = str_replace('&','and', $title_x); 349 | $artist_x = str_replace('&', 'and', $artist_x); 350 | //$artist_x = str_replace($artist_x, '_&_','_and_'); 351 | $artist_x = str_replace('_&_','_', $artist_x); 352 | $artist_x = str_replace('.','', $artist_x); 353 | $artist_x = str_replace('ó', 'o', $artist_x); // Róisín Murphy 354 | $artist_x = str_replace('í', 'i', $artist_x); // Róisín Murphy 355 | 356 | $artist_metro = str_replace('-','', $artist_x); //a-ha is aha on metrolyrics, but a_ha on lyricsmode 357 | $artist_metro = str_replace('_','-', $artist_metro); 358 | 359 | $artist_x = str_replace('-','_', $artist_x); 360 | //$artist_x = str_replace('[^%w_]','', $artist_x); 361 | 362 | $metrotitle = str_replace('_','-', $title_x); 363 | 364 | $url = ""; 365 | $lyric_string = ""; 366 | 367 | if(is_lyric_page($lyric_string) == false){ 368 | $metrourl = "http://www.metrolyrics.com/$metrotitle-lyrics-$artist_metro.html"; 369 | //lyric_string = fetch_lyrics(metrourl) 370 | 371 | $artist_and_location = strpos($artist_metro, "-and-"); 372 | 373 | if($artist_and_location > -1){ 374 | $artist_and_location = strpos($artist_metro, "and", $artist_and_location - 2); 375 | if ($artist_and_location > -1){ 376 | debug_print("MetroLyrics: artist_metro ($artist_metro) contains and"); 377 | if(is_lyric_page($lyric_string) == false){ 378 | if(strlen($artist_metro) - $artist_and_location < 14){ 379 | //quick code path to reduce the number of false tries 380 | //probably not two separate artists, but one with a & in the name 381 | //together 382 | $lyric_string = fetch_lyrics($metrourl); 383 | $tried_together_and = true; 384 | 385 | if(is_lyric_page($lyric_string) == false){ 386 | // together with a dash 387 | $new_artist_metro = substr($artist_metro, 0, $artist_and_location - 1) . substr($artist_metro, $artist_and_location + 3); // removed "-" . 388 | $url = "http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; //must be the same as above 389 | debug_print("metrolyrics together with a dash: " . $url); 390 | 391 | $lyric_string = fetch_lyrics($url); 392 | $tried_together_withdash = true; 393 | } 394 | } 395 | } 396 | 397 | //$first_artist_url 398 | if(is_lyric_page($lyric_string) == false){ 399 | //first artist 400 | $new_artist_metro = substr($artist_metro, 0, $artist_and_location - 1); 401 | $url = "http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; 402 | $first_artist_url = $url; 403 | debug_print("first artist: $url"); 404 | $lyric_string = fetch_lyrics($url); 405 | } 406 | if(is_lyric_page($lyric_string) == false){ 407 | //second artist 408 | $new_artist_metro = substr($artist_metro, $artist_and_location + 4); 409 | $url = "http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; 410 | if($first_artist_url != $url){ 411 | $lyric_string = fetch_lyrics($url); 412 | } 413 | } 414 | 415 | if(is_lyric_page($lyric_string) == false && $tried_together_withdash == false){ 416 | //together with a dash 417 | // VLC: was 1 418 | $new_artist_metro = substr($artist_metro, 0, $artist_and_location - 1) . "-" . substr($artist_metro, $artist_and_location + 3); 419 | $url = "http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; 420 | $lyric_string = fetch_lyrics($url); 421 | } 422 | if(is_lyric_page($lyric_string) == false){ 423 | //try again without and between artists 424 | // VLC: 1 -> 0 425 | // VLc: 2 -> 1 426 | $new_artist_metro = substr($artist_metro, 0, $artist_and_location - 1) . substr($artist_metro, $artist_and_location + 3); // VLC: 4 -> 3 427 | $url = "http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; 428 | debug_print("try again without and between artists: " . $url); 429 | $lyric_string = fetch_lyrics($url); 430 | } 431 | 432 | if(is_lyric_page($lyric_string) == false && $tried_together_and == false){ 433 | //together 434 | $lyric_string = fetch_lyrics($metrourl); 435 | } 436 | } 437 | }else{ 438 | debug_print("MetroLyrics (normal): $metrourl"); 439 | $lyric_string = fetch_lyrics($metrourl); 440 | 441 | if(is_lyric_page($lyric_string) == false && strpos($artist_metro, 'the-') > -1){ 442 | $new_artist_metro = str_replace('the-','', $artist_metro); 443 | $url="http://www.metrolyrics.com/$metrotitle-lyrics-$new_artist_metro.html"; 444 | $lyric_string = fetch_lyrics($url); 445 | debug_print("MetroLyrics (normal): match THE at $url"); 446 | } 447 | } 448 | } 449 | //best coverage, but a bit slow to put first 450 | /*if(is_lyric_page($lyric_string) == false){ 451 | $url = "http://sonichits.com/video/$artist_x/" . str_replace("-", "_", $original_title); 452 | $lyric_string = fetch_lyrics($url); 453 | }*/ 454 | 455 | $artist_and_location = strpos($artist_x, "_and_"); 456 | if($artist_and_location){ 457 | $artist_and_location = strpos($artist_x, "and", $artist_and_location - 2); 458 | } 459 | 460 | if($artist_and_location > -1){ 461 | if(is_lyric_page($lyric_string) == false){ 462 | $new_artist_x = substr($artist_x, 0, $artist_and_location - 1) . substr($artist_x, $artist_and_location + 3); // . "_" 463 | $url = "http://www.lyricsmode.com/lyrics/".substr($new_artist_x, 0, 1)."/".$new_artist_x."/".$title_x.".html"; 464 | debug_print("lyricsmode1: $url"); 465 | $lyric_string = fetch_lyrics($url); 466 | } 467 | } 468 | 469 | if(is_lyric_page($lyric_string) == false){ 470 | if($artist_and_location > -1){ 471 | $artist_and_location = strpos($artist_x, "and", $artist_and_location - 2); 472 | //try again without and (first artist) 473 | $new_artist_x = substr($artist_x, 0, $artist_and_location - 1); 474 | $first_artist_name = $new_artist_x; 475 | $url = "http://www.lyricsmode.com/lyrics/".substr($new_artist_x, 0, 1)."/".$new_artist_x."/".$title_x.".html"; 476 | debug_print("lyricsmode2: $url"); 477 | $lyric_string = fetch_lyrics($url); 478 | } 479 | if(is_lyric_page($lyric_string) == false){ 480 | $url = "http://www.lyricsmode.com/lyrics/".substr($artist_x, 0, 1)."/".$artist_x."/".$title_x.".html"; 481 | debug_print("lyricsmode3: $url"); 482 | $lyric_string = fetch_lyrics($url); 483 | } 484 | } 485 | 486 | if(is_lyric_page($lyric_string) == false){ 487 | $artist_and_location = strpos($artist_x, "_and_"); 488 | if($artist_and_location > -1){ 489 | $artist_and_location = strpos($artist_x, "and", $artist_and_location - 2); 490 | //try again without and (second artist) 491 | $new_artist_x = substr($artist_x, $artist_and_location + 4); //length of and + 1 492 | if($first_artist_name == $new_artist_x){ 493 | // try again without and (before: do nothing) 494 | // Womack & Womack - MPB 495 | $new_artist_x = substr($artist_x, 0, $artist_and_location - 1) . "_" . substr($artist_x, $artist_and_location + 3); 496 | $url = "http://www.lyricsmode.com/lyrics/".substr($new_artist_x, 0, 1)."/".$new_artist_x."/".$title_x.".html"; 497 | debug_print("lyricsmode4: $url"); 498 | $lyric_string = fetch_lyrics($url); 499 | }else{ 500 | $url = "http://www.lyricsmode.com/lyrics/".substr($new_artist_x, 0, 1)."/".$new_artist_x."/".$title_x.".html"; 501 | debug_print("lyricsmode5: $url"); 502 | $lyric_string = fetch_lyrics($url); 503 | } 504 | } 505 | } 506 | 507 | if(is_lyric_page($lyric_string) == false){ 508 | $title_dec_loc = strpos($title_x, "twee"); 509 | if($title_dec_loc > -1){ 510 | //try again, replacing the full word with a decimal 511 | $new_title_x = str_replace("twee", "2", $title_x); 512 | $url = "http://www.lyricsmode.com/lyrics/".substr($artist_x, 0,1)."/".$artist_x."/".$new_title_x.".html"; 513 | debug_print("lyricsmode6: $url"); 514 | $lyric_string = fetch_lyrics($url); 515 | } 516 | } 517 | 518 | if(is_lyric_page($lyric_string) == false){ 519 | $artist_the_location = strpos($artist_x, "the"); 520 | if($artist_the_location > -1){ 521 | debug_print("lyricsmode: try again without the"); 522 | $new_artist_x = str_replace("the_", "", $artist_x); 523 | $url = "http://www.lyricsmode.com/lyrics/".substr($new_artist_x, 0, 1)."/".$new_artist_x."/".$title_x.".html"; 524 | debug_print("lyricsmode7: $url"); 525 | $lyric_string = fetch_lyrics($url); 526 | } 527 | } 528 | 529 | 530 | if(is_lyric_page($lyric_string) && $source=="LyricsMode"){ 531 | //LyricsMode has some problems with encoding, fix these before showing 532 | $lyric_string = str_replace("ґ", "%'", $lyric_string); //replace ґt with 't 533 | $lyric_string = str_replace("й", "é", $lyric_string); //French 534 | $lyric_string = str_replace("к", "ê", $lyric_string); //French 535 | $lyric_string = str_replace("и", "è", $lyric_string); //French 536 | $lyric_string = str_replace("ы", "û", $lyric_string); //French 537 | $lyric_string = str_replace("њ", "œ", $lyric_string); //French 538 | $lyric_string = str_replace("д", "ä", $lyric_string); //German 539 | 540 | //cleanup first lines 541 | $lower_artist_name = strtolower($artist); // (artist:get_text()) TODO: verify if this code block works 542 | $lower_title = strtolower($title); //title:get_text()) 543 | $lower_lyric_string = strtolower($lyric_string); 544 | if($lower_artist_name != ""){ 545 | $pos_author = strpos($lower_lyric_string, $lower_artist_name); 546 | } 547 | if($lower_title != ""){ 548 | $pos_title = strpos($lower_lyric_string, $lower_title); 549 | } 550 | $pos_newline = strpos($lower_lyric_string, "\n"); 551 | 552 | // TODO: verify if this works 553 | //check if the first line is empty 554 | if($pos_newline){ 555 | if($pos_newline == 1){ 556 | $lyric_string = substr($lyric_string, $pos_newline+1); //remove the first line (=empty line) 557 | } 558 | } 559 | 560 | if($pos_author){ 561 | //remove author name from first line(s) 562 | if($pos_author < $pos_newline){ 563 | //contains 564 | $lyric_string = substr($lyric_string, $pos_newline+1); //remove the first line (=artist name) 565 | } 566 | } 567 | 568 | if($pos_title){ 569 | //remove title from first line(s) 570 | //$pos_newline = strpos($lyric_string, "\n", $pos_newline+1); 571 | if($pos_title < $pos_newline){ 572 | //contains 573 | $lyric_string = substr($lyric_string, $pos_newline+1); //remove the first line (=title) 574 | } 575 | } 576 | 577 | if($pos_newline){ 578 | $pos_new_newline = strpos($lyric_string, "\n", $pos_newline+1); 579 | 580 | if($pos_new_newline == $pos_newline+1){ 581 | //next line is empty 582 | $lyric_string = substr($lyric_string, $pos_new_newline+1); 583 | } 584 | } 585 | } 586 | 587 | if(is_lyric_page($lyric_string) == false){ 588 | $url = "http://www.golyr.de/" . str_replace("_","-", $artist_x) . "/songtext-" . str_replace("_", "-", $title_x); 589 | debug_print("golyr.de: $url"); 590 | $lyric_string = fetch_lyrics($url); 591 | } 592 | 593 | if(is_lyric_page($lyric_string) == false){ 594 | $artist_az = str_replace("_", "", $artist_x); 595 | $title_az = str_replace("_", "", $title_x); 596 | 597 | $url = "http://www.azlyrics.com/lyrics/" . $artist_az . "/" . $title_az .".html"; 598 | debug_print("azlyrics.com: $url"); 599 | $lyric_string = fetch_lyrics($url); 600 | 601 | if(is_lyric_page($lyric_string) == false && strpos($artist_az, 'the') == 1){ 602 | $new_artist_az = str_replace('the-','', $artist_az); 603 | $url = "http://www.azlyrics.com/lyrics/" . substr($new_artist_az, 3) . "/" . $title_az .".html"; 604 | $lyric_string = fetch_lyrics($url); 605 | debug_print("azlyrics.com (THE): $url"); 606 | } 607 | if(is_lyric_page($lyric_string) == false && strpos($artist_az, 'and') > 0){ 608 | $andlocation = strpos($artist_az, "and"); 609 | $new_artist_az = substr($artist_az, 0, $andlocation); // preg_replace('and.*', '', $artist_az); 610 | $url = "http://www.azlyrics.com/lyrics/" . $new_artist_az . "/" . $title_az .".html"; 611 | $lyric_string = fetch_lyrics($url); 612 | debug_print("azlyrics.com (without AND part): $url"); 613 | } 614 | } 615 | 616 | if(is_lyric_page($lyric_string) == false){ 617 | $url = "http://www.lyrics.com/".str_replace("_", "-", $title_x)."-lyrics-".str_replace("_", "-", $artist_x).".html"; 618 | debug_print("lyrics.com: $url"); 619 | $lyric_string = fetch_lyrics($url); 620 | } 621 | 622 | 623 | if(is_lyric_page($lyric_string) == false){ 624 | $url = "http://www.lyricsmania.com/" . str_replace("-", "_", $title_x)."_lyrics_$artist_x.html"; 625 | debug_print("lyricsmania.com: $url"); 626 | $lyric_string = fetch_lyrics($url); 627 | } 628 | 629 | $title_x_normal = str_replace("_", "-", $title_x); 630 | if(is_lyric_page($lyric_string) == false){ 631 | //http://songteksten.net/search/title.html?q=climbing+to+the+top&type=title 632 | $url = "http://songteksten.net/search/title.html?q=" . str_replace("-", "+", $title_x)."&type=title"; 633 | debug_print("songteksten.net: $url"); 634 | $data = file_get_contents($url); 635 | 636 | $posurl = strrpos($data, "http://songteksten.net/lyric"); 637 | $middlelinkpos = strpos($data, '"', $posurl); 638 | 639 | $url = substr($data, $posurl, $middlelinkpos-$posurl); 640 | debug_print($url . " with title " . $title_x_normal); 641 | 642 | if(strpos($url, $title_x_normal) == true){ 643 | $lyric_string = fetch_lyrics($url); 644 | } 645 | } 646 | 647 | 648 | if(is_lyric_page($lyric_string) == false){ 649 | $source = ""; 650 | } 651 | 652 | return clean_lyrics($lyric_string); 653 | } 654 | 655 | function is_lyric_page($lyric_string){ 656 | if($lyric_string==""){ 657 | return false; 658 | } 659 | 660 | $licensing = "We are not in a position to display these lyrics due to licensing restrictions. Sorry for the inconvenience."; 661 | if(strpos($lyric_string, $licensing) > -1){ 662 | debug_print($licensing); 663 | return false; 664 | } 665 | $dailylimit = "You've reached the daily limit of 10 videos. Log in to watch more"; 666 | if(strpos($lyric_string, $dailylimit) > -1){ 667 | debug_print($dailylimit); 668 | return false; 669 | } 670 | $dailylimit = "Daily limit reached for Sonic Hits"; 671 | if(strpos($lyric_string, $dailylimit) > -1){ 672 | debug_print_importance($dailylimit, "extrainfo"); 673 | return false; 674 | } 675 | 676 | if(strpos($lyric_string, "Select your carrier...") > -1){ 677 | debug_print("Low quality lyrics"); 678 | return false; 679 | } 680 | 681 | if(strpos($lyric_string, "No lyrics found for this song") > -1){ 682 | debug_print("Data dump: $lyric_string"); 683 | return false; 684 | } 685 | 686 | if(strlen($lyric_string) < 40){ 687 | debug_print("(info) Partial page / not a valid lyrics page"); 688 | debug_print("Data dump: $lyric_string"); 689 | return false; 690 | } 691 | 692 | return true; 693 | } 694 | 695 | function fetch_lyrics($url){ 696 | global $source; 697 | 698 | $metrolyrics = strpos($url, 'metrolyrics'); 699 | $lyricsmania = strpos($url, 'lyricsmania'); 700 | $lyricscom = strpos($url, 'lyrics.com'); 701 | $sonichits = strpos($url, 'sonichits'); 702 | $azlyrics = strpos($url, 'azlyrics'); 703 | $lyricsmode = strpos($url, 'lyricsmode'); 704 | $musixmatch = strpos($url, 'musixmatch'); 705 | $golyr = strpos($url, 'golyr'); 706 | $songteksten = strpos($url, 'songteksten.net'); 707 | 708 | $mode = get_parameter("mode"); 709 | if($mode != "debug") 710 | error_reporting(E_ERROR | E_PARSE); 711 | $data = file_get_contents($url); 712 | 713 | if($metrolyrics){ 714 | $source="MetroLyrics"; 715 | 716 | if($data == "") return ""; 717 | $html = str_get_html($data); 718 | 719 | //$lyrics_body_text = $html->find('div[id=lyrics-body-text]'); 720 | 721 | $verses = $html->find('p[class=verse]');; 722 | 723 | $metrolyrics_text = ""; 724 | foreach ($verses as &$verse) { 725 | $metrolyrics_text = $metrolyrics_text . str_replace("
", "\n", $verse); 726 | } 727 | 728 | return str_replace("\n ", "\n", $metrolyrics_text); 729 | } 730 | 731 | if($lyricsmania){ 732 | $source="LyricsMania"; 733 | 734 | $data = str_replace("

", "", $data); 735 | $data = str_replace("
\t", "", $data); // for p402_premium 736 | $data = str_replace("
", "\r\n", $data); 737 | $data = str_replace("\n\r\n", "\r\n", $data); 738 | $data = str_replace("\r\r\n", "\r\n", $data); 739 | 740 | $html = str_get_html($data, true, true, DEFAULT_TARGET_CHARSET, false); 741 | $lyricsbody = $html->find('div[class=lyrics-body]', 0); 742 | 743 | $lyricsbody->find('#video-musictory', 0)->outertext = ''; 744 | $lyricsbody->find('div[class=fb-quote]', 0)->outertext = ''; 745 | $lyricsbody->find('script', 0)->outertext = ''; 746 | 747 | $html->save(); 748 | 749 | return $lyricsbody; 750 | } 751 | 752 | if($lyricsmode){ 753 | $source="LyricsMode"; 754 | $identifier = '

'; 755 | $a = strpos($data, $identifier); 756 | if($a == false){ 757 | return ""; 758 | } 759 | $b = strpos($data, "

", $a) + 4; // +4 includes

, which is a workaround for a missing letter (http://www.lyricsmode.com/lyrics/c/céline_dion/think_twice.html) (is compensated for by converting unclean html to clean text to clean html) 760 | $lengthofidentifier = strlen($identifier); 761 | $lyricsmode_result = substr($data, $a, $b-$a); 762 | 763 | return $lyricsmode_result; 764 | } 765 | 766 | if($sonichits){ 767 | $source="Sonic Hits"; 768 | // TODO: verify if this works 769 | 770 | // You've reached the daily limit of 10 videos. Log in to watch more 771 | $dailylimit = strpos($data, "You've reached the daily limit of 10 videos. Log in to watch more"); 772 | if($dailylimit){ 773 | return "Daily limit reached for Sonic Hits"; 774 | } 775 | 776 | $a = strpos($data, '

', $a); 786 | if($position == false){ 787 | return ""; 788 | } 789 | 790 | $contributedby = strpos($data, "Contributed by", $a); 791 | $lyricsc = strpos($data, "Lyrics", $a); 792 | 793 | if($contributedby == false && $lyricsc == false){ 794 | return ""; 795 | } 796 | if($contributedby){ 797 | echo "contributed by"; 798 | $b = strpos($data, "
", $a); 799 | } 800 | if($lyricsc){ 801 | echo "lyricscopy"; 802 | $b = strpos($data, "
", $lyricsc-10); 803 | } 804 | 805 | if($b == false){ 806 | return ""; 807 | } 808 | 809 | return substr($data, $position+strlen("

"),$b-1-$position); 810 | } 811 | 812 | if($golyr){ 813 | $source="Golyr"; 814 | $a = strpos($data, '
", $position); 852 | 853 | return substr($data, $position+5,$b-1-($position+5)); 854 | } 855 | if($songteksten){ 856 | $source = "Songteksten.net"; 857 | $a = strpos($data, 'body_right'); 858 | 859 | if($a == false){ 860 | return ""; 861 | } 862 | $a = strpos($data, ''); 863 | if($a == false){ 864 | return ""; 865 | } 866 | 867 | $endofstring = 'div'; 868 | $position = strpos($data, $endofstring, $a); 869 | if($position == false){ 870 | return ""; 871 | } 872 | return substr($data, $a + 5, $position-($a+6)); 873 | } 874 | 875 | if($lyricscom){ 876 | $source="Lyrics.com"; 877 | $a = strpos($data, 'itemprop="description'); 878 | if($a == false){ 879 | return ""; 880 | } 881 | $b = strpos($data, "---", $a); 882 | return substr($data, $a+strlen('itemprop="description">'),$b-($a+strlen('itemprop="description">'))); 883 | } 884 | 885 | return ""; 886 | } 887 | 888 | function clean_lyrics($lyrics_text_input){ 889 | $lyrics_text_input = trim_all($lyrics_text_input, "\\x00-\\x09"); 890 | $lyrics_text_input = trim_all($lyrics_text_input, "\\x0B-\\x0C"); 891 | $lyrics_text_input = trim_all($lyrics_text_input, "\\x0E-\\x1F"); 892 | 893 | $lyrics_text_input = trim_all($lyrics_text_input, "\\x0D\\x0A", "\n"); // \r\n 894 | 895 | return $lyrics_text_input; 896 | } 897 | 898 | // http://pageconfig.com/post/remove-undesired-characters-with-trim_all-php 899 | function trim_all( $str , $what = NULL , $with = ' ' ) 900 | { 901 | if( $what === NULL ) 902 | { 903 | // Character Decimal Use 904 | // "\0" 0 Null Character 905 | // "\t" 9 Tab 906 | // "\n" 10 New line 907 | // "\x0B" 11 Vertical Tab 908 | // "\r" 13 New Line in Mac 909 | // " " 32 Space 910 | 911 | //$what = "\\x00-\\x20"; //all white-spaces and control chars 912 | 913 | $what = "\\x09"; 914 | } 915 | 916 | return trim( preg_replace( "/[".$what."]+/" , $with , $str ) , $what ); 917 | } 918 | -------------------------------------------------------------------------------- /simple_html_dom.php: -------------------------------------------------------------------------------- 1 | size is the "real" number of bytes the dom was created from. 18 | * but for most purposes, it's a really good estimation. 19 | * Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors. 20 | * Allow the user to tell us how much they trust the html. 21 | * Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node. 22 | * This allows for us to find tags based on the text they contain. 23 | * Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag. 24 | * Paperg: added parse_charset so that we know about the character set of the source document. 25 | * NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the 26 | * last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection. 27 | * 28 | * Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that. 29 | * PaperG (John Schlick) Added get_display_size for "IMG" tags. 30 | * 31 | * Licensed under The MIT License 32 | * Redistributions of files must retain the above copyright notice. 33 | * 34 | * @author S.C. Chen 35 | * @author John Schlick 36 | * @author Rus Carroll 37 | * @version 1.5 ($Rev: 210 $) 38 | * @package PlaceLocalInclude 39 | * @subpackage simple_html_dom 40 | */ 41 | 42 | /** 43 | * All of the Defines for the classes below. 44 | * @author S.C. Chen 45 | */ 46 | define('HDOM_TYPE_ELEMENT', 1); 47 | define('HDOM_TYPE_COMMENT', 2); 48 | define('HDOM_TYPE_TEXT', 3); 49 | define('HDOM_TYPE_ENDTAG', 4); 50 | define('HDOM_TYPE_ROOT', 5); 51 | define('HDOM_TYPE_UNKNOWN', 6); 52 | define('HDOM_QUOTE_DOUBLE', 0); 53 | define('HDOM_QUOTE_SINGLE', 1); 54 | define('HDOM_QUOTE_NO', 3); 55 | define('HDOM_INFO_BEGIN', 0); 56 | define('HDOM_INFO_END', 1); 57 | define('HDOM_INFO_QUOTE', 2); 58 | define('HDOM_INFO_SPACE', 3); 59 | define('HDOM_INFO_TEXT', 4); 60 | define('HDOM_INFO_INNER', 5); 61 | define('HDOM_INFO_OUTER', 6); 62 | define('HDOM_INFO_ENDSPACE',7); 63 | define('DEFAULT_TARGET_CHARSET', 'UTF-8'); 64 | define('DEFAULT_BR_TEXT', "\r\n"); 65 | define('DEFAULT_SPAN_TEXT', " "); 66 | define('MAX_FILE_SIZE', 600000); 67 | // helper functions 68 | // ----------------------------------------------------------------------------- 69 | // get html dom from file 70 | // $maxlen is defined in the code as PHP_STREAM_COPY_ALL which is defined as -1. 71 | function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 72 | { 73 | // We DO force the tags to be terminated. 74 | $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 75 | // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done. 76 | $contents = file_get_contents($url, $use_include_path, $context, $offset); 77 | // Paperg - use our own mechanism for getting the contents as we want to control the timeout. 78 | //$contents = retrieve_url_contents($url); 79 | if (empty($contents) || strlen($contents) > MAX_FILE_SIZE) 80 | { 81 | return false; 82 | } 83 | // The second parameter can force the selectors to all be lowercase. 84 | $dom->load($contents, $lowercase, $stripRN); 85 | return $dom; 86 | } 87 | 88 | // get html dom from string 89 | function str_get_html($str, $lowercase=true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 90 | { 91 | $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText); 92 | if (empty($str) || strlen($str) > MAX_FILE_SIZE) 93 | { 94 | $dom->clear(); 95 | return false; 96 | } 97 | $dom->load($str, $lowercase, $stripRN); 98 | return $dom; 99 | } 100 | 101 | // dump html dom tree 102 | function dump_html_tree($node, $show_attr=true, $deep=0) 103 | { 104 | $node->dump($node); 105 | } 106 | 107 | 108 | /** 109 | * simple html dom node 110 | * PaperG - added ability for "find" routine to lowercase the value of the selector. 111 | * PaperG - added $tag_start to track the start position of the tag in the total byte index 112 | * 113 | * @package PlaceLocalInclude 114 | */ 115 | class simple_html_dom_node 116 | { 117 | public $nodetype = HDOM_TYPE_TEXT; 118 | public $tag = 'text'; 119 | public $attr = array(); 120 | public $children = array(); 121 | public $nodes = array(); 122 | public $parent = null; 123 | // The "info" array - see HDOM_INFO_... for what each element contains. 124 | public $_ = array(); 125 | public $tag_start = 0; 126 | private $dom = null; 127 | 128 | function __construct($dom) 129 | { 130 | $this->dom = $dom; 131 | $dom->nodes[] = $this; 132 | } 133 | 134 | function __destruct() 135 | { 136 | $this->clear(); 137 | } 138 | 139 | function __toString() 140 | { 141 | return $this->outertext(); 142 | } 143 | 144 | // clean up memory due to php5 circular references memory leak... 145 | function clear() 146 | { 147 | $this->dom = null; 148 | $this->nodes = null; 149 | $this->parent = null; 150 | $this->children = null; 151 | } 152 | 153 | // dump node's tree 154 | function dump($show_attr=true, $deep=0) 155 | { 156 | $lead = str_repeat(' ', $deep); 157 | 158 | echo $lead.$this->tag; 159 | if ($show_attr && count($this->attr)>0) 160 | { 161 | echo '('; 162 | foreach ($this->attr as $k=>$v) 163 | echo "[$k]=>\"".$this->$k.'", '; 164 | echo ')'; 165 | } 166 | echo "\n"; 167 | 168 | if ($this->nodes) 169 | { 170 | foreach ($this->nodes as $c) 171 | { 172 | $c->dump($show_attr, $deep+1); 173 | } 174 | } 175 | } 176 | 177 | 178 | // Debugging function to dump a single dom node with a bunch of information about it. 179 | function dump_node($echo=true) 180 | { 181 | 182 | $string = $this->tag; 183 | if (count($this->attr)>0) 184 | { 185 | $string .= '('; 186 | foreach ($this->attr as $k=>$v) 187 | { 188 | $string .= "[$k]=>\"".$this->$k.'", '; 189 | } 190 | $string .= ')'; 191 | } 192 | if (count($this->_)>0) 193 | { 194 | $string .= ' $_ ('; 195 | foreach ($this->_ as $k=>$v) 196 | { 197 | if (is_array($v)) 198 | { 199 | $string .= "[$k]=>("; 200 | foreach ($v as $k2=>$v2) 201 | { 202 | $string .= "[$k2]=>\"".$v2.'", '; 203 | } 204 | $string .= ")"; 205 | } else { 206 | $string .= "[$k]=>\"".$v.'", '; 207 | } 208 | } 209 | $string .= ")"; 210 | } 211 | 212 | if (isset($this->text)) 213 | { 214 | $string .= " text: (" . $this->text . ")"; 215 | } 216 | 217 | $string .= " HDOM_INNER_INFO: '"; 218 | if (isset($node->_[HDOM_INFO_INNER])) 219 | { 220 | $string .= $node->_[HDOM_INFO_INNER] . "'"; 221 | } 222 | else 223 | { 224 | $string .= ' NULL '; 225 | } 226 | 227 | $string .= " children: " . count($this->children); 228 | $string .= " nodes: " . count($this->nodes); 229 | $string .= " tag_start: " . $this->tag_start; 230 | $string .= "\n"; 231 | 232 | if ($echo) 233 | { 234 | echo $string; 235 | return; 236 | } 237 | else 238 | { 239 | return $string; 240 | } 241 | } 242 | 243 | // returns the parent of node 244 | // If a node is passed in, it will reset the parent of the current node to that one. 245 | function parent($parent=null) 246 | { 247 | // I am SURE that this doesn't work properly. 248 | // It fails to unset the current node from it's current parents nodes or children list first. 249 | if ($parent !== null) 250 | { 251 | $this->parent = $parent; 252 | $this->parent->nodes[] = $this; 253 | $this->parent->children[] = $this; 254 | } 255 | 256 | return $this->parent; 257 | } 258 | 259 | // verify that node has children 260 | function has_child() 261 | { 262 | return !empty($this->children); 263 | } 264 | 265 | // returns children of node 266 | function children($idx=-1) 267 | { 268 | if ($idx===-1) 269 | { 270 | return $this->children; 271 | } 272 | if (isset($this->children[$idx])) 273 | { 274 | return $this->children[$idx]; 275 | } 276 | return null; 277 | } 278 | 279 | // returns the first child of node 280 | function first_child() 281 | { 282 | if (count($this->children)>0) 283 | { 284 | return $this->children[0]; 285 | } 286 | return null; 287 | } 288 | 289 | // returns the last child of node 290 | function last_child() 291 | { 292 | if (($count=count($this->children))>0) 293 | { 294 | return $this->children[$count-1]; 295 | } 296 | return null; 297 | } 298 | 299 | // returns the next sibling of node 300 | function next_sibling() 301 | { 302 | if ($this->parent===null) 303 | { 304 | return null; 305 | } 306 | 307 | $idx = 0; 308 | $count = count($this->parent->children); 309 | while ($idx<$count && $this!==$this->parent->children[$idx]) 310 | { 311 | ++$idx; 312 | } 313 | if (++$idx>=$count) 314 | { 315 | return null; 316 | } 317 | return $this->parent->children[$idx]; 318 | } 319 | 320 | // returns the previous sibling of node 321 | function prev_sibling() 322 | { 323 | if ($this->parent===null) return null; 324 | $idx = 0; 325 | $count = count($this->parent->children); 326 | while ($idx<$count && $this!==$this->parent->children[$idx]) 327 | ++$idx; 328 | if (--$idx<0) return null; 329 | return $this->parent->children[$idx]; 330 | } 331 | 332 | // function to locate a specific ancestor tag in the path to the root. 333 | function find_ancestor_tag($tag) 334 | { 335 | global $debug_object; 336 | if (is_object($debug_object)) { $debug_object->debug_log_entry(1); } 337 | 338 | // Start by including ourselves in the comparison. 339 | $returnDom = $this; 340 | 341 | while (!is_null($returnDom)) 342 | { 343 | if (is_object($debug_object)) { $debug_object->debug_log(2, "Current tag is: " . $returnDom->tag); } 344 | 345 | if ($returnDom->tag == $tag) 346 | { 347 | break; 348 | } 349 | $returnDom = $returnDom->parent; 350 | } 351 | return $returnDom; 352 | } 353 | 354 | // get dom node's inner html 355 | function innertext() 356 | { 357 | if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER]; 358 | if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 359 | 360 | $ret = ''; 361 | foreach ($this->nodes as $n) 362 | $ret .= $n->outertext(); 363 | return $ret; 364 | } 365 | 366 | // get dom node's outer text (with tag) 367 | function outertext() 368 | { 369 | global $debug_object; 370 | if (is_object($debug_object)) 371 | { 372 | $text = ''; 373 | if ($this->tag == 'text') 374 | { 375 | if (!empty($this->text)) 376 | { 377 | $text = " with text: " . $this->text; 378 | } 379 | } 380 | $debug_object->debug_log(1, 'Innertext of tag: ' . $this->tag . $text); 381 | } 382 | 383 | if ($this->tag==='root') return $this->innertext(); 384 | 385 | // trigger callback 386 | if ($this->dom && $this->dom->callback!==null) 387 | { 388 | call_user_func_array($this->dom->callback, array($this)); 389 | } 390 | 391 | if (isset($this->_[HDOM_INFO_OUTER])) return $this->_[HDOM_INFO_OUTER]; 392 | if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 393 | 394 | // render begin tag 395 | if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]) 396 | { 397 | $ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup(); 398 | } else { 399 | $ret = ""; 400 | } 401 | 402 | // render inner text 403 | if (isset($this->_[HDOM_INFO_INNER])) 404 | { 405 | // If it's a br tag... don't return the HDOM_INNER_INFO that we may or may not have added. 406 | if ($this->tag != "br") 407 | { 408 | $ret .= $this->_[HDOM_INFO_INNER]; 409 | } 410 | } else { 411 | if ($this->nodes) 412 | { 413 | foreach ($this->nodes as $n) 414 | { 415 | $ret .= $this->convert_text($n->outertext()); 416 | } 417 | } 418 | } 419 | 420 | // render end tag 421 | if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END]!=0) 422 | $ret .= 'tag.'>'; 423 | return $ret; 424 | } 425 | 426 | // get dom node's plain text 427 | function text() 428 | { 429 | if (isset($this->_[HDOM_INFO_INNER])) return $this->_[HDOM_INFO_INNER]; 430 | switch ($this->nodetype) 431 | { 432 | case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 433 | case HDOM_TYPE_COMMENT: return ''; 434 | case HDOM_TYPE_UNKNOWN: return ''; 435 | } 436 | if (strcasecmp($this->tag, 'script')===0) return ''; 437 | if (strcasecmp($this->tag, 'style')===0) return ''; 438 | 439 | $ret = ''; 440 | // In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed for some span tags, and some p tags) $this->nodes is set to NULL. 441 | // NOTE: This indicates that there is a problem where it's set to NULL without a clear happening. 442 | // WHY is this happening? 443 | if (!is_null($this->nodes)) 444 | { 445 | foreach ($this->nodes as $n) 446 | { 447 | $ret .= $this->convert_text($n->text()); 448 | } 449 | 450 | // If this node is a span... add a space at the end of it so multiple spans don't run into each other. This is plaintext after all. 451 | if ($this->tag == "span") 452 | { 453 | $ret .= $this->dom->default_span_text; 454 | } 455 | 456 | 457 | } 458 | return $ret; 459 | } 460 | 461 | function xmltext() 462 | { 463 | $ret = $this->innertext(); 464 | $ret = str_ireplace('', '', $ret); 466 | return $ret; 467 | } 468 | 469 | // build node's text with tag 470 | function makeup() 471 | { 472 | // text, comment, unknown 473 | if (isset($this->_[HDOM_INFO_TEXT])) return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]); 474 | 475 | $ret = '<'.$this->tag; 476 | $i = -1; 477 | 478 | foreach ($this->attr as $key=>$val) 479 | { 480 | ++$i; 481 | 482 | // skip removed attribute 483 | if ($val===null || $val===false) 484 | continue; 485 | 486 | $ret .= $this->_[HDOM_INFO_SPACE][$i][0]; 487 | //no value attr: nowrap, checked selected... 488 | if ($val===true) 489 | $ret .= $key; 490 | else { 491 | switch ($this->_[HDOM_INFO_QUOTE][$i]) 492 | { 493 | case HDOM_QUOTE_DOUBLE: $quote = '"'; break; 494 | case HDOM_QUOTE_SINGLE: $quote = '\''; break; 495 | default: $quote = ''; 496 | } 497 | $ret .= $key.$this->_[HDOM_INFO_SPACE][$i][1].'='.$this->_[HDOM_INFO_SPACE][$i][2].$quote.$val.$quote; 498 | } 499 | } 500 | $ret = $this->dom->restore_noise($ret); 501 | return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>'; 502 | } 503 | 504 | // find elements by css selector 505 | //PaperG - added ability for find to lowercase the value of the selector. 506 | function find($selector, $idx=null, $lowercase=false) 507 | { 508 | $selectors = $this->parse_selector($selector); 509 | if (($count=count($selectors))===0) return array(); 510 | $found_keys = array(); 511 | 512 | // find each selector 513 | for ($c=0; $c<$count; ++$c) 514 | { 515 | // The change on the below line was documented on the sourceforge code tracker id 2788009 516 | // used to be: if (($levle=count($selectors[0]))===0) return array(); 517 | if (($levle=count($selectors[$c]))===0) return array(); 518 | if (!isset($this->_[HDOM_INFO_BEGIN])) return array(); 519 | 520 | $head = array($this->_[HDOM_INFO_BEGIN]=>1); 521 | 522 | // handle descendant selectors, no recursive! 523 | for ($l=0; $l<$levle; ++$l) 524 | { 525 | $ret = array(); 526 | foreach ($head as $k=>$v) 527 | { 528 | $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k]; 529 | //PaperG - Pass this optional parameter on to the seek function. 530 | $n->seek($selectors[$c][$l], $ret, $lowercase); 531 | } 532 | $head = $ret; 533 | } 534 | 535 | foreach ($head as $k=>$v) 536 | { 537 | if (!isset($found_keys[$k])) 538 | { 539 | $found_keys[$k] = 1; 540 | } 541 | } 542 | } 543 | 544 | // sort keys 545 | ksort($found_keys); 546 | 547 | $found = array(); 548 | foreach ($found_keys as $k=>$v) 549 | $found[] = $this->dom->nodes[$k]; 550 | 551 | // return nth-element or array 552 | if (is_null($idx)) return $found; 553 | else if ($idx<0) $idx = count($found) + $idx; 554 | return (isset($found[$idx])) ? $found[$idx] : null; 555 | } 556 | 557 | // seek for given conditions 558 | // PaperG - added parameter to allow for case insensitive testing of the value of a selector. 559 | protected function seek($selector, &$ret, $lowercase=false) 560 | { 561 | global $debug_object; 562 | if (is_object($debug_object)) { $debug_object->debug_log_entry(1); } 563 | 564 | list($tag, $key, $val, $exp, $no_key) = $selector; 565 | 566 | // xpath index 567 | if ($tag && $key && is_numeric($key)) 568 | { 569 | $count = 0; 570 | foreach ($this->children as $c) 571 | { 572 | if ($tag==='*' || $tag===$c->tag) { 573 | if (++$count==$key) { 574 | $ret[$c->_[HDOM_INFO_BEGIN]] = 1; 575 | return; 576 | } 577 | } 578 | } 579 | return; 580 | } 581 | 582 | $end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0; 583 | if ($end==0) { 584 | $parent = $this->parent; 585 | while (!isset($parent->_[HDOM_INFO_END]) && $parent!==null) { 586 | $end -= 1; 587 | $parent = $parent->parent; 588 | } 589 | $end += $parent->_[HDOM_INFO_END]; 590 | } 591 | 592 | for ($i=$this->_[HDOM_INFO_BEGIN]+1; $i<$end; ++$i) { 593 | $node = $this->dom->nodes[$i]; 594 | 595 | $pass = true; 596 | 597 | if ($tag==='*' && !$key) { 598 | if (in_array($node, $this->children, true)) 599 | $ret[$i] = 1; 600 | continue; 601 | } 602 | 603 | // compare tag 604 | if ($tag && $tag!=$node->tag && $tag!=='*') {$pass=false;} 605 | // compare key 606 | if ($pass && $key) { 607 | if ($no_key) { 608 | if (isset($node->attr[$key])) $pass=false; 609 | } else { 610 | if (($key != "plaintext") && !isset($node->attr[$key])) $pass=false; 611 | } 612 | } 613 | // compare value 614 | if ($pass && $key && $val && $val!=='*') { 615 | // If they have told us that this is a "plaintext" search then we want the plaintext of the node - right? 616 | if ($key == "plaintext") { 617 | // $node->plaintext actually returns $node->text(); 618 | $nodeKeyValue = $node->text(); 619 | } else { 620 | // this is a normal search, we want the value of that attribute of the tag. 621 | $nodeKeyValue = $node->attr[$key]; 622 | } 623 | if (is_object($debug_object)) {$debug_object->debug_log(2, "testing node: " . $node->tag . " for attribute: " . $key . $exp . $val . " where nodes value is: " . $nodeKeyValue);} 624 | 625 | //PaperG - If lowercase is set, do a case insensitive test of the value of the selector. 626 | if ($lowercase) { 627 | $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue)); 628 | } else { 629 | $check = $this->match($exp, $val, $nodeKeyValue); 630 | } 631 | if (is_object($debug_object)) {$debug_object->debug_log(2, "after match: " . ($check ? "true" : "false"));} 632 | 633 | // handle multiple class 634 | if (!$check && strcasecmp($key, 'class')===0) { 635 | foreach (explode(' ',$node->attr[$key]) as $k) { 636 | // Without this, there were cases where leading, trailing, or double spaces lead to our comparing blanks - bad form. 637 | if (!empty($k)) { 638 | if ($lowercase) { 639 | $check = $this->match($exp, strtolower($val), strtolower($k)); 640 | } else { 641 | $check = $this->match($exp, $val, $k); 642 | } 643 | if ($check) break; 644 | } 645 | } 646 | } 647 | if (!$check) $pass = false; 648 | } 649 | if ($pass) $ret[$i] = 1; 650 | unset($node); 651 | } 652 | // It's passed by reference so this is actually what this function returns. 653 | if (is_object($debug_object)) {$debug_object->debug_log(1, "EXIT - ret: ", $ret);} 654 | } 655 | 656 | protected function match($exp, $pattern, $value) { 657 | global $debug_object; 658 | if (is_object($debug_object)) {$debug_object->debug_log_entry(1);} 659 | 660 | switch ($exp) { 661 | case '=': 662 | return ($value===$pattern); 663 | case '!=': 664 | return ($value!==$pattern); 665 | case '^=': 666 | return preg_match("/^".preg_quote($pattern,'/')."/", $value); 667 | case '$=': 668 | return preg_match("/".preg_quote($pattern,'/')."$/", $value); 669 | case '*=': 670 | if ($pattern[0]=='/') { 671 | return preg_match($pattern, $value); 672 | } 673 | return preg_match("/".$pattern."/i", $value); 674 | } 675 | return false; 676 | } 677 | 678 | protected function parse_selector($selector_string) { 679 | global $debug_object; 680 | if (is_object($debug_object)) {$debug_object->debug_log_entry(1);} 681 | 682 | // pattern of CSS selectors, modified from mootools 683 | // Paperg: Add the colon to the attrbute, so that it properly finds like google does. 684 | // Note: if you try to look at this attribute, yo MUST use getAttribute since $dom->x:y will fail the php syntax check. 685 | // Notice the \[ starting the attbute? and the @? following? This implies that an attribute can begin with an @ sign that is not captured. 686 | // This implies that an html attribute specifier may start with an @ sign that is NOT captured by the expression. 687 | // farther study is required to determine of this should be documented or removed. 688 | // $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is"; 689 | $pattern = "/([\w-:\*]*)(?:\#([\w-]+)|\.([\w-]+))?(?:\[@?(!?[\w-:]+)(?:([!*^$]?=)[\"']?(.*?)[\"']?)?\])?([\/, ]+)/is"; 690 | preg_match_all($pattern, trim($selector_string).' ', $matches, PREG_SET_ORDER); 691 | if (is_object($debug_object)) {$debug_object->debug_log(2, "Matches Array: ", $matches);} 692 | 693 | $selectors = array(); 694 | $result = array(); 695 | //print_r($matches); 696 | 697 | foreach ($matches as $m) { 698 | $m[0] = trim($m[0]); 699 | if ($m[0]==='' || $m[0]==='/' || $m[0]==='//') continue; 700 | // for browser generated xpath 701 | if ($m[1]==='tbody') continue; 702 | 703 | list($tag, $key, $val, $exp, $no_key) = array($m[1], null, null, '=', false); 704 | if (!empty($m[2])) {$key='id'; $val=$m[2];} 705 | if (!empty($m[3])) {$key='class'; $val=$m[3];} 706 | if (!empty($m[4])) {$key=$m[4];} 707 | if (!empty($m[5])) {$exp=$m[5];} 708 | if (!empty($m[6])) {$val=$m[6];} 709 | 710 | // convert to lowercase 711 | if ($this->dom->lowercase) {$tag=strtolower($tag); $key=strtolower($key);} 712 | //elements that do NOT have the specified attribute 713 | if (isset($key[0]) && $key[0]==='!') {$key=substr($key, 1); $no_key=true;} 714 | 715 | $result[] = array($tag, $key, $val, $exp, $no_key); 716 | if (trim($m[7])===',') { 717 | $selectors[] = $result; 718 | $result = array(); 719 | } 720 | } 721 | if (count($result)>0) 722 | $selectors[] = $result; 723 | return $selectors; 724 | } 725 | 726 | function __get($name) 727 | { 728 | if (isset($this->attr[$name])) 729 | { 730 | return $this->convert_text($this->attr[$name]); 731 | } 732 | switch ($name) 733 | { 734 | case 'outertext': return $this->outertext(); 735 | case 'innertext': return $this->innertext(); 736 | case 'plaintext': return $this->text(); 737 | case 'xmltext': return $this->xmltext(); 738 | default: return array_key_exists($name, $this->attr); 739 | } 740 | } 741 | 742 | function __set($name, $value) 743 | { 744 | global $debug_object; 745 | if (is_object($debug_object)) {$debug_object->debug_log_entry(1);} 746 | 747 | switch ($name) 748 | { 749 | case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value; 750 | case 'innertext': 751 | if (isset($this->_[HDOM_INFO_TEXT])) return $this->_[HDOM_INFO_TEXT] = $value; 752 | return $this->_[HDOM_INFO_INNER] = $value; 753 | } 754 | if (!isset($this->attr[$name])) 755 | { 756 | $this->_[HDOM_INFO_SPACE][] = array(' ', '', ''); 757 | $this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE; 758 | } 759 | $this->attr[$name] = $value; 760 | } 761 | 762 | function __isset($name) 763 | { 764 | switch ($name) 765 | { 766 | case 'outertext': return true; 767 | case 'innertext': return true; 768 | case 'plaintext': return true; 769 | } 770 | //no value attr: nowrap, checked selected... 771 | return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]); 772 | } 773 | 774 | function __unset($name) { 775 | if (isset($this->attr[$name])) 776 | unset($this->attr[$name]); 777 | } 778 | 779 | // PaperG - Function to convert the text from one character set to another if the two sets are not the same. 780 | function convert_text($text) 781 | { 782 | global $debug_object; 783 | if (is_object($debug_object)) {$debug_object->debug_log_entry(1);} 784 | 785 | $converted_text = $text; 786 | 787 | $sourceCharset = ""; 788 | $targetCharset = ""; 789 | 790 | if ($this->dom) 791 | { 792 | $sourceCharset = strtoupper($this->dom->_charset); 793 | $targetCharset = strtoupper($this->dom->_target_charset); 794 | } 795 | if (is_object($debug_object)) {$debug_object->debug_log(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);} 796 | 797 | if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0)) 798 | { 799 | // Check if the reported encoding could have been incorrect and the text is actually already UTF-8 800 | if ((strcasecmp($targetCharset, 'UTF-8') == 0) && ($this->is_utf8($text))) 801 | { 802 | $converted_text = $text; 803 | } 804 | else 805 | { 806 | $converted_text = iconv($sourceCharset, $targetCharset, $text); 807 | } 808 | } 809 | 810 | // Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output. 811 | if ($targetCharset == 'UTF-8') 812 | { 813 | if (substr($converted_text, 0, 3) == "\xef\xbb\xbf") 814 | { 815 | $converted_text = substr($converted_text, 3); 816 | } 817 | if (substr($converted_text, -3) == "\xef\xbb\xbf") 818 | { 819 | $converted_text = substr($converted_text, 0, -3); 820 | } 821 | } 822 | 823 | return $converted_text; 824 | } 825 | 826 | /** 827 | * Returns true if $string is valid UTF-8 and false otherwise. 828 | * 829 | * @param mixed $str String to be tested 830 | * @return boolean 831 | */ 832 | static function is_utf8($str) 833 | { 834 | $c=0; $b=0; 835 | $bits=0; 836 | $len=strlen($str); 837 | for($i=0; $i<$len; $i++) 838 | { 839 | $c=ord($str[$i]); 840 | if($c > 128) 841 | { 842 | if(($c >= 254)) return false; 843 | elseif($c >= 252) $bits=6; 844 | elseif($c >= 248) $bits=5; 845 | elseif($c >= 240) $bits=4; 846 | elseif($c >= 224) $bits=3; 847 | elseif($c >= 192) $bits=2; 848 | else return false; 849 | if(($i+$bits) > $len) return false; 850 | while($bits > 1) 851 | { 852 | $i++; 853 | $b=ord($str[$i]); 854 | if($b < 128 || $b > 191) return false; 855 | $bits--; 856 | } 857 | } 858 | } 859 | return true; 860 | } 861 | /* 862 | function is_utf8($string) 863 | { 864 | //this is buggy 865 | return (utf8_encode(utf8_decode($string)) == $string); 866 | } 867 | */ 868 | 869 | /** 870 | * Function to try a few tricks to determine the displayed size of an img on the page. 871 | * NOTE: This will ONLY work on an IMG tag. Returns FALSE on all other tag types. 872 | * 873 | * @author John Schlick 874 | * @version April 19 2012 875 | * @return array an array containing the 'height' and 'width' of the image on the page or -1 if we can't figure it out. 876 | */ 877 | function get_display_size() 878 | { 879 | global $debug_object; 880 | 881 | $width = -1; 882 | $height = -1; 883 | 884 | if ($this->tag !== 'img') 885 | { 886 | return false; 887 | } 888 | 889 | // See if there is aheight or width attribute in the tag itself. 890 | if (isset($this->attr['width'])) 891 | { 892 | $width = $this->attr['width']; 893 | } 894 | 895 | if (isset($this->attr['height'])) 896 | { 897 | $height = $this->attr['height']; 898 | } 899 | 900 | // Now look for an inline style. 901 | if (isset($this->attr['style'])) 902 | { 903 | // Thanks to user gnarf from stackoverflow for this regular expression. 904 | $attributes = array(); 905 | preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $this->attr['style'], $matches, PREG_SET_ORDER); 906 | foreach ($matches as $match) { 907 | $attributes[$match[1]] = $match[2]; 908 | } 909 | 910 | // If there is a width in the style attributes: 911 | if (isset($attributes['width']) && $width == -1) 912 | { 913 | // check that the last two characters are px (pixels) 914 | if (strtolower(substr($attributes['width'], -2)) == 'px') 915 | { 916 | $proposed_width = substr($attributes['width'], 0, -2); 917 | // Now make sure that it's an integer and not something stupid. 918 | if (filter_var($proposed_width, FILTER_VALIDATE_INT)) 919 | { 920 | $width = $proposed_width; 921 | } 922 | } 923 | } 924 | 925 | // If there is a width in the style attributes: 926 | if (isset($attributes['height']) && $height == -1) 927 | { 928 | // check that the last two characters are px (pixels) 929 | if (strtolower(substr($attributes['height'], -2)) == 'px') 930 | { 931 | $proposed_height = substr($attributes['height'], 0, -2); 932 | // Now make sure that it's an integer and not something stupid. 933 | if (filter_var($proposed_height, FILTER_VALIDATE_INT)) 934 | { 935 | $height = $proposed_height; 936 | } 937 | } 938 | } 939 | 940 | } 941 | 942 | // Future enhancement: 943 | // Look in the tag to see if there is a class or id specified that has a height or width attribute to it. 944 | 945 | // Far future enhancement 946 | // Look at all the parent tags of this image to see if they specify a class or id that has an img selector that specifies a height or width 947 | // Note that in this case, the class or id will have the img subselector for it to apply to the image. 948 | 949 | // ridiculously far future development 950 | // If the class or id is specified in a SEPARATE css file thats not on the page, go get it and do what we were just doing for the ones on the page. 951 | 952 | $result = array('height' => $height, 953 | 'width' => $width); 954 | return $result; 955 | } 956 | 957 | // camel naming conventions 958 | function getAllAttributes() {return $this->attr;} 959 | function getAttribute($name) {return $this->__get($name);} 960 | function setAttribute($name, $value) {$this->__set($name, $value);} 961 | function hasAttribute($name) {return $this->__isset($name);} 962 | function removeAttribute($name) {$this->__set($name, null);} 963 | function getElementById($id) {return $this->find("#$id", 0);} 964 | function getElementsById($id, $idx=null) {return $this->find("#$id", $idx);} 965 | function getElementByTagName($name) {return $this->find($name, 0);} 966 | function getElementsByTagName($name, $idx=null) {return $this->find($name, $idx);} 967 | function parentNode() {return $this->parent();} 968 | function childNodes($idx=-1) {return $this->children($idx);} 969 | function firstChild() {return $this->first_child();} 970 | function lastChild() {return $this->last_child();} 971 | function nextSibling() {return $this->next_sibling();} 972 | function previousSibling() {return $this->prev_sibling();} 973 | function hasChildNodes() {return $this->has_child();} 974 | function nodeName() {return $this->tag;} 975 | function appendChild($node) {$node->parent($this); return $node;} 976 | 977 | } 978 | 979 | /** 980 | * simple html dom parser 981 | * Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector. 982 | * Paperg - change $size from protected to public so we can easily access it 983 | * Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not. Default is to NOT trust it. 984 | * 985 | * @package PlaceLocalInclude 986 | */ 987 | class simple_html_dom 988 | { 989 | public $root = null; 990 | public $nodes = array(); 991 | public $callback = null; 992 | public $lowercase = false; 993 | // Used to keep track of how large the text was when we started. 994 | public $original_size; 995 | public $size; 996 | protected $pos; 997 | protected $doc; 998 | protected $char; 999 | protected $cursor; 1000 | protected $parent; 1001 | protected $noise = array(); 1002 | protected $token_blank = " \t\r\n"; 1003 | protected $token_equal = ' =/>'; 1004 | protected $token_slash = " />\r\n\t"; 1005 | protected $token_attr = ' >'; 1006 | // Note that this is referenced by a child node, and so it needs to be public for that node to see this information. 1007 | public $_charset = ''; 1008 | public $_target_charset = ''; 1009 | protected $default_br_text = ""; 1010 | public $default_span_text = ""; 1011 | 1012 | // use isset instead of in_array, performance boost about 30%... 1013 | protected $self_closing_tags = array('img'=>1, 'br'=>1, 'input'=>1, 'meta'=>1, 'link'=>1, 'hr'=>1, 'base'=>1, 'embed'=>1, 'spacer'=>1); 1014 | protected $block_tags = array('root'=>1, 'body'=>1, 'form'=>1, 'div'=>1, 'span'=>1, 'table'=>1); 1015 | // Known sourceforge issue #2977341 1016 | // B tags that are not closed cause us to return everything to the end of the document. 1017 | protected $optional_closing_tags = array( 1018 | 'tr'=>array('tr'=>1, 'td'=>1, 'th'=>1), 1019 | 'th'=>array('th'=>1), 1020 | 'td'=>array('td'=>1), 1021 | 'li'=>array('li'=>1), 1022 | 'dt'=>array('dt'=>1, 'dd'=>1), 1023 | 'dd'=>array('dd'=>1, 'dt'=>1), 1024 | 'dl'=>array('dd'=>1, 'dt'=>1), 1025 | 'p'=>array('p'=>1), 1026 | 'nobr'=>array('nobr'=>1), 1027 | 'b'=>array('b'=>1), 1028 | 'option'=>array('option'=>1), 1029 | ); 1030 | 1031 | function __construct($str=null, $lowercase=true, $forceTagsClosed=true, $target_charset=DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 1032 | { 1033 | if ($str) 1034 | { 1035 | if (preg_match("/^http:\/\//i",$str) || is_file($str)) 1036 | { 1037 | $this->load_file($str); 1038 | } 1039 | else 1040 | { 1041 | $this->load($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText); 1042 | } 1043 | } 1044 | // Forcing tags to be closed implies that we don't trust the html, but it can lead to parsing errors if we SHOULD trust the html. 1045 | if (!$forceTagsClosed) { 1046 | $this->optional_closing_array=array(); 1047 | } 1048 | $this->_target_charset = $target_charset; 1049 | } 1050 | 1051 | function __destruct() 1052 | { 1053 | $this->clear(); 1054 | } 1055 | 1056 | // load html from string 1057 | function load($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT) 1058 | { 1059 | global $debug_object; 1060 | 1061 | // prepare 1062 | $this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText); 1063 | // strip out cdata 1064 | $this->remove_noise("''is", true); 1065 | // strip out comments 1066 | $this->remove_noise("''is"); 1067 | // Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037 1068 | // Script tags removal now preceeds style tag removal. 1069 | // strip out