├── .gitignore ├── LICENSE ├── README.md ├── pystemon.py ├── pystemon.yaml ├── requirements.txt └── user-agents.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[co] 2 | 3 | # Packages 4 | *.egg 5 | *.egg-info 6 | dist 7 | build 8 | eggs 9 | parts 10 | bin 11 | var 12 | sdist 13 | develop-eggs 14 | .installed.cfg 15 | 16 | # Installer logs 17 | pip-log.txt 18 | 19 | # Unit test / coverage reports 20 | .coverage 21 | .tox 22 | 23 | # Project specifics 24 | /.project 25 | /.pydevproject 26 | /archive 27 | /alerts 28 | 29 | proxies.txt 30 | pystemon.yaml 31 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU AFFERO GENERAL PUBLIC LICENSE 2 | Version 3, 19 November 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | Preamble 9 | 10 | The GNU Affero General Public License is a free, copyleft license for 11 | software and other kinds of works, specifically designed to ensure 12 | cooperation with the community in the case of network server software. 13 | 14 | The licenses for most software and other practical works are designed 15 | to take away your freedom to share and change the works. By contrast, 16 | our General Public Licenses are intended to guarantee your freedom to 17 | share and change all versions of a program--to make sure it remains free 18 | software for all its users. 19 | 20 | When we speak of free software, we are referring to freedom, not 21 | price. Our General Public Licenses are designed to make sure that you 22 | have the freedom to distribute copies of free software (and charge for 23 | them if you wish), that you receive source code or can get it if you 24 | want it, that you can change the software or use pieces of it in new 25 | free programs, and that you know you can do these things. 26 | 27 | Developers that use our General Public Licenses protect your rights 28 | with two steps: (1) assert copyright on the software, and (2) offer 29 | you this License which gives you legal permission to copy, distribute 30 | and/or modify the software. 31 | 32 | A secondary benefit of defending all users' freedom is that 33 | improvements made in alternate versions of the program, if they 34 | receive widespread use, become available for other developers to 35 | incorporate. Many developers of free software are heartened and 36 | encouraged by the resulting cooperation. However, in the case of 37 | software used on network servers, this result may fail to come about. 38 | The GNU General Public License permits making a modified version and 39 | letting the public access it on a server without ever releasing its 40 | source code to the public. 41 | 42 | The GNU Affero General Public License is designed specifically to 43 | ensure that, in such cases, the modified source code becomes available 44 | to the community. It requires the operator of a network server to 45 | provide the source code of the modified version running there to the 46 | users of that server. Therefore, public use of a modified version, on 47 | a publicly accessible server, gives the public access to the source 48 | code of the modified version. 49 | 50 | An older license, called the Affero General Public License and 51 | published by Affero, was designed to accomplish similar goals. This is 52 | a different license, not a version of the Affero GPL, but Affero has 53 | released a new version of the Affero GPL which permits relicensing under 54 | this license. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | TERMS AND CONDITIONS 60 | 61 | 0. Definitions. 62 | 63 | "This License" refers to version 3 of the GNU Affero General Public License. 64 | 65 | "Copyright" also means copyright-like laws that apply to other kinds of 66 | works, such as semiconductor masks. 67 | 68 | "The Program" refers to any copyrightable work licensed under this 69 | License. Each licensee is addressed as "you". "Licensees" and 70 | "recipients" may be individuals or organizations. 71 | 72 | To "modify" a work means to copy from or adapt all or part of the work 73 | in a fashion requiring copyright permission, other than the making of an 74 | exact copy. The resulting work is called a "modified version" of the 75 | earlier work or a work "based on" the earlier work. 76 | 77 | A "covered work" means either the unmodified Program or a work based 78 | on the Program. 79 | 80 | To "propagate" a work means to do anything with it that, without 81 | permission, would make you directly or secondarily liable for 82 | infringement under applicable copyright law, except executing it on a 83 | computer or modifying a private copy. Propagation includes copying, 84 | distribution (with or without modification), making available to the 85 | public, and in some countries other activities as well. 86 | 87 | To "convey" a work means any kind of propagation that enables other 88 | parties to make or receive copies. Mere interaction with a user through 89 | a computer network, with no transfer of a copy, is not conveying. 90 | 91 | An interactive user interface displays "Appropriate Legal Notices" 92 | to the extent that it includes a convenient and prominently visible 93 | feature that (1) displays an appropriate copyright notice, and (2) 94 | tells the user that there is no warranty for the work (except to the 95 | extent that warranties are provided), that licensees may convey the 96 | work under this License, and how to view a copy of this License. If 97 | the interface presents a list of user commands or options, such as a 98 | menu, a prominent item in the list meets this criterion. 99 | 100 | 1. Source Code. 101 | 102 | The "source code" for a work means the preferred form of the work 103 | for making modifications to it. "Object code" means any non-source 104 | form of a work. 105 | 106 | A "Standard Interface" means an interface that either is an official 107 | standard defined by a recognized standards body, or, in the case of 108 | interfaces specified for a particular programming language, one that 109 | is widely used among developers working in that language. 110 | 111 | The "System Libraries" of an executable work include anything, other 112 | than the work as a whole, that (a) is included in the normal form of 113 | packaging a Major Component, but which is not part of that Major 114 | Component, and (b) serves only to enable use of the work with that 115 | Major Component, or to implement a Standard Interface for which an 116 | implementation is available to the public in source code form. A 117 | "Major Component", in this context, means a major essential component 118 | (kernel, window system, and so on) of the specific operating system 119 | (if any) on which the executable work runs, or a compiler used to 120 | produce the work, or an object code interpreter used to run it. 121 | 122 | The "Corresponding Source" for a work in object code form means all 123 | the source code needed to generate, install, and (for an executable 124 | work) run the object code and to modify the work, including scripts to 125 | control those activities. However, it does not include the work's 126 | System Libraries, or general-purpose tools or generally available free 127 | programs which are used unmodified in performing those activities but 128 | which are not part of the work. For example, Corresponding Source 129 | includes interface definition files associated with source files for 130 | the work, and the source code for shared libraries and dynamically 131 | linked subprograms that the work is specifically designed to require, 132 | such as by intimate data communication or control flow between those 133 | subprograms and other parts of the work. 134 | 135 | The Corresponding Source need not include anything that users 136 | can regenerate automatically from other parts of the Corresponding 137 | Source. 138 | 139 | The Corresponding Source for a work in source code form is that 140 | same work. 141 | 142 | 2. Basic Permissions. 143 | 144 | All rights granted under this License are granted for the term of 145 | copyright on the Program, and are irrevocable provided the stated 146 | conditions are met. This License explicitly affirms your unlimited 147 | permission to run the unmodified Program. The output from running a 148 | covered work is covered by this License only if the output, given its 149 | content, constitutes a covered work. This License acknowledges your 150 | rights of fair use or other equivalent, as provided by copyright law. 151 | 152 | You may make, run and propagate covered works that you do not 153 | convey, without conditions so long as your license otherwise remains 154 | in force. You may convey covered works to others for the sole purpose 155 | of having them make modifications exclusively for you, or provide you 156 | with facilities for running those works, provided that you comply with 157 | the terms of this License in conveying all material for which you do 158 | not control copyright. Those thus making or running the covered works 159 | for you must do so exclusively on your behalf, under your direction 160 | and control, on terms that prohibit them from making any copies of 161 | your copyrighted material outside their relationship with you. 162 | 163 | Conveying under any other circumstances is permitted solely under 164 | the conditions stated below. Sublicensing is not allowed; section 10 165 | makes it unnecessary. 166 | 167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law. 168 | 169 | No covered work shall be deemed part of an effective technological 170 | measure under any applicable law fulfilling obligations under article 171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or 172 | similar laws prohibiting or restricting circumvention of such 173 | measures. 174 | 175 | When you convey a covered work, you waive any legal power to forbid 176 | circumvention of technological measures to the extent such circumvention 177 | is effected by exercising rights under this License with respect to 178 | the covered work, and you disclaim any intention to limit operation or 179 | modification of the work as a means of enforcing, against the work's 180 | users, your or third parties' legal rights to forbid circumvention of 181 | technological measures. 182 | 183 | 4. Conveying Verbatim Copies. 184 | 185 | You may convey verbatim copies of the Program's source code as you 186 | receive it, in any medium, provided that you conspicuously and 187 | appropriately publish on each copy an appropriate copyright notice; 188 | keep intact all notices stating that this License and any 189 | non-permissive terms added in accord with section 7 apply to the code; 190 | keep intact all notices of the absence of any warranty; and give all 191 | recipients a copy of this License along with the Program. 192 | 193 | You may charge any price or no price for each copy that you convey, 194 | and you may offer support or warranty protection for a fee. 195 | 196 | 5. Conveying Modified Source Versions. 197 | 198 | You may convey a work based on the Program, or the modifications to 199 | produce it from the Program, in the form of source code under the 200 | terms of section 4, provided that you also meet all of these conditions: 201 | 202 | a) The work must carry prominent notices stating that you modified 203 | it, and giving a relevant date. 204 | 205 | b) The work must carry prominent notices stating that it is 206 | released under this License and any conditions added under section 207 | 7. This requirement modifies the requirement in section 4 to 208 | "keep intact all notices". 209 | 210 | c) You must license the entire work, as a whole, under this 211 | License to anyone who comes into possession of a copy. This 212 | License will therefore apply, along with any applicable section 7 213 | additional terms, to the whole of the work, and all its parts, 214 | regardless of how they are packaged. This License gives no 215 | permission to license the work in any other way, but it does not 216 | invalidate such permission if you have separately received it. 217 | 218 | d) If the work has interactive user interfaces, each must display 219 | Appropriate Legal Notices; however, if the Program has interactive 220 | interfaces that do not display Appropriate Legal Notices, your 221 | work need not make them do so. 222 | 223 | A compilation of a covered work with other separate and independent 224 | works, which are not by their nature extensions of the covered work, 225 | and which are not combined with it such as to form a larger program, 226 | in or on a volume of a storage or distribution medium, is called an 227 | "aggregate" if the compilation and its resulting copyright are not 228 | used to limit the access or legal rights of the compilation's users 229 | beyond what the individual works permit. Inclusion of a covered work 230 | in an aggregate does not cause this License to apply to the other 231 | parts of the aggregate. 232 | 233 | 6. Conveying Non-Source Forms. 234 | 235 | You may convey a covered work in object code form under the terms 236 | of sections 4 and 5, provided that you also convey the 237 | machine-readable Corresponding Source under the terms of this License, 238 | in one of these ways: 239 | 240 | a) Convey the object code in, or embodied in, a physical product 241 | (including a physical distribution medium), accompanied by the 242 | Corresponding Source fixed on a durable physical medium 243 | customarily used for software interchange. 244 | 245 | b) Convey the object code in, or embodied in, a physical product 246 | (including a physical distribution medium), accompanied by a 247 | written offer, valid for at least three years and valid for as 248 | long as you offer spare parts or customer support for that product 249 | model, to give anyone who possesses the object code either (1) a 250 | copy of the Corresponding Source for all the software in the 251 | product that is covered by this License, on a durable physical 252 | medium customarily used for software interchange, for a price no 253 | more than your reasonable cost of physically performing this 254 | conveying of source, or (2) access to copy the 255 | Corresponding Source from a network server at no charge. 256 | 257 | c) Convey individual copies of the object code with a copy of the 258 | written offer to provide the Corresponding Source. This 259 | alternative is allowed only occasionally and noncommercially, and 260 | only if you received the object code with such an offer, in accord 261 | with subsection 6b. 262 | 263 | d) Convey the object code by offering access from a designated 264 | place (gratis or for a charge), and offer equivalent access to the 265 | Corresponding Source in the same way through the same place at no 266 | further charge. You need not require recipients to copy the 267 | Corresponding Source along with the object code. If the place to 268 | copy the object code is a network server, the Corresponding Source 269 | may be on a different server (operated by you or a third party) 270 | that supports equivalent copying facilities, provided you maintain 271 | clear directions next to the object code saying where to find the 272 | Corresponding Source. Regardless of what server hosts the 273 | Corresponding Source, you remain obligated to ensure that it is 274 | available for as long as needed to satisfy these requirements. 275 | 276 | e) Convey the object code using peer-to-peer transmission, provided 277 | you inform other peers where the object code and Corresponding 278 | Source of the work are being offered to the general public at no 279 | charge under subsection 6d. 280 | 281 | A separable portion of the object code, whose source code is excluded 282 | from the Corresponding Source as a System Library, need not be 283 | included in conveying the object code work. 284 | 285 | A "User Product" is either (1) a "consumer product", which means any 286 | tangible personal property which is normally used for personal, family, 287 | or household purposes, or (2) anything designed or sold for incorporation 288 | into a dwelling. In determining whether a product is a consumer product, 289 | doubtful cases shall be resolved in favor of coverage. For a particular 290 | product received by a particular user, "normally used" refers to a 291 | typical or common use of that class of product, regardless of the status 292 | of the particular user or of the way in which the particular user 293 | actually uses, or expects or is expected to use, the product. A product 294 | is a consumer product regardless of whether the product has substantial 295 | commercial, industrial or non-consumer uses, unless such uses represent 296 | the only significant mode of use of the product. 297 | 298 | "Installation Information" for a User Product means any methods, 299 | procedures, authorization keys, or other information required to install 300 | and execute modified versions of a covered work in that User Product from 301 | a modified version of its Corresponding Source. The information must 302 | suffice to ensure that the continued functioning of the modified object 303 | code is in no case prevented or interfered with solely because 304 | modification has been made. 305 | 306 | If you convey an object code work under this section in, or with, or 307 | specifically for use in, a User Product, and the conveying occurs as 308 | part of a transaction in which the right of possession and use of the 309 | User Product is transferred to the recipient in perpetuity or for a 310 | fixed term (regardless of how the transaction is characterized), the 311 | Corresponding Source conveyed under this section must be accompanied 312 | by the Installation Information. But this requirement does not apply 313 | if neither you nor any third party retains the ability to install 314 | modified object code on the User Product (for example, the work has 315 | been installed in ROM). 316 | 317 | The requirement to provide Installation Information does not include a 318 | requirement to continue to provide support service, warranty, or updates 319 | for a work that has been modified or installed by the recipient, or for 320 | the User Product in which it has been modified or installed. Access to a 321 | network may be denied when the modification itself materially and 322 | adversely affects the operation of the network or violates the rules and 323 | protocols for communication across the network. 324 | 325 | Corresponding Source conveyed, and Installation Information provided, 326 | in accord with this section must be in a format that is publicly 327 | documented (and with an implementation available to the public in 328 | source code form), and must require no special password or key for 329 | unpacking, reading or copying. 330 | 331 | 7. Additional Terms. 332 | 333 | "Additional permissions" are terms that supplement the terms of this 334 | License by making exceptions from one or more of its conditions. 335 | Additional permissions that are applicable to the entire Program shall 336 | be treated as though they were included in this License, to the extent 337 | that they are valid under applicable law. If additional permissions 338 | apply only to part of the Program, that part may be used separately 339 | under those permissions, but the entire Program remains governed by 340 | this License without regard to the additional permissions. 341 | 342 | When you convey a copy of a covered work, you may at your option 343 | remove any additional permissions from that copy, or from any part of 344 | it. (Additional permissions may be written to require their own 345 | removal in certain cases when you modify the work.) You may place 346 | additional permissions on material, added by you to a covered work, 347 | for which you have or can give appropriate copyright permission. 348 | 349 | Notwithstanding any other provision of this License, for material you 350 | add to a covered work, you may (if authorized by the copyright holders of 351 | that material) supplement the terms of this License with terms: 352 | 353 | a) Disclaiming warranty or limiting liability differently from the 354 | terms of sections 15 and 16 of this License; or 355 | 356 | b) Requiring preservation of specified reasonable legal notices or 357 | author attributions in that material or in the Appropriate Legal 358 | Notices displayed by works containing it; or 359 | 360 | c) Prohibiting misrepresentation of the origin of that material, or 361 | requiring that modified versions of such material be marked in 362 | reasonable ways as different from the original version; or 363 | 364 | d) Limiting the use for publicity purposes of names of licensors or 365 | authors of the material; or 366 | 367 | e) Declining to grant rights under trademark law for use of some 368 | trade names, trademarks, or service marks; or 369 | 370 | f) Requiring indemnification of licensors and authors of that 371 | material by anyone who conveys the material (or modified versions of 372 | it) with contractual assumptions of liability to the recipient, for 373 | any liability that these contractual assumptions directly impose on 374 | those licensors and authors. 375 | 376 | All other non-permissive additional terms are considered "further 377 | restrictions" within the meaning of section 10. If the Program as you 378 | received it, or any part of it, contains a notice stating that it is 379 | governed by this License along with a term that is a further 380 | restriction, you may remove that term. If a license document contains 381 | a further restriction but permits relicensing or conveying under this 382 | License, you may add to a covered work material governed by the terms 383 | of that license document, provided that the further restriction does 384 | not survive such relicensing or conveying. 385 | 386 | If you add terms to a covered work in accord with this section, you 387 | must place, in the relevant source files, a statement of the 388 | additional terms that apply to those files, or a notice indicating 389 | where to find the applicable terms. 390 | 391 | Additional terms, permissive or non-permissive, may be stated in the 392 | form of a separately written license, or stated as exceptions; 393 | the above requirements apply either way. 394 | 395 | 8. Termination. 396 | 397 | You may not propagate or modify a covered work except as expressly 398 | provided under this License. Any attempt otherwise to propagate or 399 | modify it is void, and will automatically terminate your rights under 400 | this License (including any patent licenses granted under the third 401 | paragraph of section 11). 402 | 403 | However, if you cease all violation of this License, then your 404 | license from a particular copyright holder is reinstated (a) 405 | provisionally, unless and until the copyright holder explicitly and 406 | finally terminates your license, and (b) permanently, if the copyright 407 | holder fails to notify you of the violation by some reasonable means 408 | prior to 60 days after the cessation. 409 | 410 | Moreover, your license from a particular copyright holder is 411 | reinstated permanently if the copyright holder notifies you of the 412 | violation by some reasonable means, this is the first time you have 413 | received notice of violation of this License (for any work) from that 414 | copyright holder, and you cure the violation prior to 30 days after 415 | your receipt of the notice. 416 | 417 | Termination of your rights under this section does not terminate the 418 | licenses of parties who have received copies or rights from you under 419 | this License. If your rights have been terminated and not permanently 420 | reinstated, you do not qualify to receive new licenses for the same 421 | material under section 10. 422 | 423 | 9. Acceptance Not Required for Having Copies. 424 | 425 | You are not required to accept this License in order to receive or 426 | run a copy of the Program. Ancillary propagation of a covered work 427 | occurring solely as a consequence of using peer-to-peer transmission 428 | to receive a copy likewise does not require acceptance. However, 429 | nothing other than this License grants you permission to propagate or 430 | modify any covered work. These actions infringe copyright if you do 431 | not accept this License. Therefore, by modifying or propagating a 432 | covered work, you indicate your acceptance of this License to do so. 433 | 434 | 10. Automatic Licensing of Downstream Recipients. 435 | 436 | Each time you convey a covered work, the recipient automatically 437 | receives a license from the original licensors, to run, modify and 438 | propagate that work, subject to this License. You are not responsible 439 | for enforcing compliance by third parties with this License. 440 | 441 | An "entity transaction" is a transaction transferring control of an 442 | organization, or substantially all assets of one, or subdividing an 443 | organization, or merging organizations. If propagation of a covered 444 | work results from an entity transaction, each party to that 445 | transaction who receives a copy of the work also receives whatever 446 | licenses to the work the party's predecessor in interest had or could 447 | give under the previous paragraph, plus a right to possession of the 448 | Corresponding Source of the work from the predecessor in interest, if 449 | the predecessor has it or can get it with reasonable efforts. 450 | 451 | You may not impose any further restrictions on the exercise of the 452 | rights granted or affirmed under this License. For example, you may 453 | not impose a license fee, royalty, or other charge for exercise of 454 | rights granted under this License, and you may not initiate litigation 455 | (including a cross-claim or counterclaim in a lawsuit) alleging that 456 | any patent claim is infringed by making, using, selling, offering for 457 | sale, or importing the Program or any portion of it. 458 | 459 | 11. Patents. 460 | 461 | A "contributor" is a copyright holder who authorizes use under this 462 | License of the Program or a work on which the Program is based. The 463 | work thus licensed is called the contributor's "contributor version". 464 | 465 | A contributor's "essential patent claims" are all patent claims 466 | owned or controlled by the contributor, whether already acquired or 467 | hereafter acquired, that would be infringed by some manner, permitted 468 | by this License, of making, using, or selling its contributor version, 469 | but do not include claims that would be infringed only as a 470 | consequence of further modification of the contributor version. For 471 | purposes of this definition, "control" includes the right to grant 472 | patent sublicenses in a manner consistent with the requirements of 473 | this License. 474 | 475 | Each contributor grants you a non-exclusive, worldwide, royalty-free 476 | patent license under the contributor's essential patent claims, to 477 | make, use, sell, offer for sale, import and otherwise run, modify and 478 | propagate the contents of its contributor version. 479 | 480 | In the following three paragraphs, a "patent license" is any express 481 | agreement or commitment, however denominated, not to enforce a patent 482 | (such as an express permission to practice a patent or covenant not to 483 | sue for patent infringement). To "grant" such a patent license to a 484 | party means to make such an agreement or commitment not to enforce a 485 | patent against the party. 486 | 487 | If you convey a covered work, knowingly relying on a patent license, 488 | and the Corresponding Source of the work is not available for anyone 489 | to copy, free of charge and under the terms of this License, through a 490 | publicly available network server or other readily accessible means, 491 | then you must either (1) cause the Corresponding Source to be so 492 | available, or (2) arrange to deprive yourself of the benefit of the 493 | patent license for this particular work, or (3) arrange, in a manner 494 | consistent with the requirements of this License, to extend the patent 495 | license to downstream recipients. "Knowingly relying" means you have 496 | actual knowledge that, but for the patent license, your conveying the 497 | covered work in a country, or your recipient's use of the covered work 498 | in a country, would infringe one or more identifiable patents in that 499 | country that you have reason to believe are valid. 500 | 501 | If, pursuant to or in connection with a single transaction or 502 | arrangement, you convey, or propagate by procuring conveyance of, a 503 | covered work, and grant a patent license to some of the parties 504 | receiving the covered work authorizing them to use, propagate, modify 505 | or convey a specific copy of the covered work, then the patent license 506 | you grant is automatically extended to all recipients of the covered 507 | work and works based on it. 508 | 509 | A patent license is "discriminatory" if it does not include within 510 | the scope of its coverage, prohibits the exercise of, or is 511 | conditioned on the non-exercise of one or more of the rights that are 512 | specifically granted under this License. You may not convey a covered 513 | work if you are a party to an arrangement with a third party that is 514 | in the business of distributing software, under which you make payment 515 | to the third party based on the extent of your activity of conveying 516 | the work, and under which the third party grants, to any of the 517 | parties who would receive the covered work from you, a discriminatory 518 | patent license (a) in connection with copies of the covered work 519 | conveyed by you (or copies made from those copies), or (b) primarily 520 | for and in connection with specific products or compilations that 521 | contain the covered work, unless you entered into that arrangement, 522 | or that patent license was granted, prior to 28 March 2007. 523 | 524 | Nothing in this License shall be construed as excluding or limiting 525 | any implied license or other defenses to infringement that may 526 | otherwise be available to you under applicable patent law. 527 | 528 | 12. No Surrender of Others' Freedom. 529 | 530 | If conditions are imposed on you (whether by court order, agreement or 531 | otherwise) that contradict the conditions of this License, they do not 532 | excuse you from the conditions of this License. If you cannot convey a 533 | covered work so as to satisfy simultaneously your obligations under this 534 | License and any other pertinent obligations, then as a consequence you may 535 | not convey it at all. For example, if you agree to terms that obligate you 536 | to collect a royalty for further conveying from those to whom you convey 537 | the Program, the only way you could satisfy both those terms and this 538 | License would be to refrain entirely from conveying the Program. 539 | 540 | 13. Remote Network Interaction; Use with the GNU General Public License. 541 | 542 | Notwithstanding any other provision of this License, if you modify the 543 | Program, your modified version must prominently offer all users 544 | interacting with it remotely through a computer network (if your version 545 | supports such interaction) an opportunity to receive the Corresponding 546 | Source of your version by providing access to the Corresponding Source 547 | from a network server at no charge, through some standard or customary 548 | means of facilitating copying of software. This Corresponding Source 549 | shall include the Corresponding Source for any work covered by version 3 550 | of the GNU General Public License that is incorporated pursuant to the 551 | following paragraph. 552 | 553 | Notwithstanding any other provision of this License, you have 554 | permission to link or combine any covered work with a work licensed 555 | under version 3 of the GNU General Public License into a single 556 | combined work, and to convey the resulting work. The terms of this 557 | License will continue to apply to the part which is the covered work, 558 | but the work with which it is combined will remain governed by version 559 | 3 of the GNU General Public License. 560 | 561 | 14. Revised Versions of this License. 562 | 563 | The Free Software Foundation may publish revised and/or new versions of 564 | the GNU Affero General Public License from time to time. Such new versions 565 | will be similar in spirit to the present version, but may differ in detail to 566 | address new problems or concerns. 567 | 568 | Each version is given a distinguishing version number. If the 569 | Program specifies that a certain numbered version of the GNU Affero General 570 | Public License "or any later version" applies to it, you have the 571 | option of following the terms and conditions either of that numbered 572 | version or of any later version published by the Free Software 573 | Foundation. If the Program does not specify a version number of the 574 | GNU Affero General Public License, you may choose any version ever published 575 | by the Free Software Foundation. 576 | 577 | If the Program specifies that a proxy can decide which future 578 | versions of the GNU Affero General Public License can be used, that proxy's 579 | public statement of acceptance of a version permanently authorizes you 580 | to choose that version for the Program. 581 | 582 | Later license versions may give you additional or different 583 | permissions. However, no additional obligations are imposed on any 584 | author or copyright holder as a result of your choosing to follow a 585 | later version. 586 | 587 | 15. Disclaimer of Warranty. 588 | 589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY 590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT 591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY 592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, 593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM 595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF 596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 597 | 598 | 16. Limitation of Liability. 599 | 600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS 602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY 603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE 604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF 605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD 606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), 607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF 608 | SUCH DAMAGES. 609 | 610 | 17. Interpretation of Sections 15 and 16. 611 | 612 | If the disclaimer of warranty and limitation of liability provided 613 | above cannot be given local legal effect according to their terms, 614 | reviewing courts shall apply local law that most closely approximates 615 | an absolute waiver of all civil liability in connection with the 616 | Program, unless a warranty or assumption of liability accompanies a 617 | copy of the Program in return for a fee. 618 | 619 | END OF TERMS AND CONDITIONS 620 | 621 | How to Apply These Terms to Your New Programs 622 | 623 | If you develop a new program, and you want it to be of the greatest 624 | possible use to the public, the best way to achieve this is to make it 625 | free software which everyone can redistribute and change under these terms. 626 | 627 | To do so, attach the following notices to the program. It is safest 628 | to attach them to the start of each source file to most effectively 629 | state the exclusion of warranty; and each file should have at least 630 | the "copyright" line and a pointer to where the full notice is found. 631 | 632 | 633 | Copyright (C) 634 | 635 | This program is free software: you can redistribute it and/or modify 636 | it under the terms of the GNU Affero General Public License as published by 637 | the Free Software Foundation, either version 3 of the License, or 638 | (at your option) any later version. 639 | 640 | This program is distributed in the hope that it will be useful, 641 | but WITHOUT ANY WARRANTY; without even the implied warranty of 642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 643 | GNU Affero General Public License for more details. 644 | 645 | You should have received a copy of the GNU Affero General Public License 646 | along with this program. If not, see . 647 | 648 | Also add information on how to contact you by electronic and paper mail. 649 | 650 | If your software can interact with users remotely through a computer 651 | network, you should also make sure that it provides a way for users to 652 | get its source. For example, if your program is a web application, its 653 | interface could display a "Source" link that leads users to an archive 654 | of the code. There are many ways you could offer source, and different 655 | solutions will be better for different programs; see section 13 for the 656 | specific requirements. 657 | 658 | You should also get your employer (if you work as a programmer) or school, 659 | if any, to sign a "copyright disclaimer" for the program, if necessary. 660 | For more information on this, and how to apply and follow the GNU AGPL, see 661 | . 662 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | pystemon 2 | ======== 3 | Monitoring tool for PasteBin-alike sites written in Python 4 | 5 | Copyleft AGPLv3 - Christophe Vandeplas - christophe@vandeplas.com 6 | Feel free to use the code, but please share the changes you've made 7 | 8 | Features: 9 | --------- 10 | * search for regular expressions in pasties 11 | * flexible design, minimal effort to add another paste* site 12 | * use custom download functions for complex pastie sites 13 | * uses multiple threads per unique site to download the pastes 14 | * waits a random time (within a range) before downloading the latest pastes, time customizable per site 15 | * (optional) only trigger on X hits in the same pastie 16 | * (optional) exclude matching pasties if exclusion regex matches 17 | * (optional) allow additional email recipients per search pattern 18 | * (optional) uses random User-Agents 19 | * (optional) uses random proxies 20 | * removes a proxy if it is unreliable (fails 5 times) 21 | * (optional) compress saved files with Gzip. (no zip to limit external dependencies) 22 | 23 | Python Dependencies 24 | ------------------- 25 | * PyYAML 26 | * requests 27 | * redis 28 | 29 | Limitations: 30 | ------------ 31 | * Only HTTP proxies are allowed 32 | * Only HTTP urls will use proxies 33 | 34 | Usage 35 | ------ 36 | ``` 37 | Usage: pystemon.py [options] 38 | Options: 39 | -h, --help show this help message and exit 40 | -c FILE, --config=FILE 41 | load configuration from file 42 | -d, --daemon runs in background as a daemon (NOT IMPLEMENTED) 43 | -s, --stats display statistics about the running threads (NOT IMPLEMENTED) 44 | -v outputs more information 45 | 46 | Default configuration file: /etc/pystemon.yaml or pystemon.yaml in current directory 47 | ``` 48 | -------------------------------------------------------------------------------- /pystemon.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding: utf-8 3 | 4 | ''' 5 | @author: Christophe Vandeplas 6 | @copyright: AGPLv3 7 | http://www.gnu.org/licenses/agpl.html 8 | 9 | To be implemented: 10 | - FIXME set all the config options in the class variables 11 | - FIXME validate parsing of config file 12 | - FIXME use syslog logging 13 | - TODO runs as a daemon in background 14 | - TODO save files in separate directories depending on the day/week/month. Try to avoid duplicate files 15 | ''' 16 | 17 | try: 18 | from queue import Queue 19 | except ImportError: 20 | from Queue import Queue 21 | from collections import deque 22 | from datetime import datetime 23 | try: 24 | from email.mime.multipart import MIMEMultipart 25 | except ImportError: 26 | from email.MIMEMultipart import MIMEMultipart 27 | 28 | try: 29 | from email.mime.text import MIMEText 30 | except ImportError: 31 | from email.MIMEText import MIMEText 32 | import gzip 33 | import hashlib 34 | import logging.handlers 35 | import optparse 36 | import os 37 | import random 38 | import re 39 | import smtplib 40 | import socket 41 | import sys 42 | import traceback 43 | import threading 44 | import time 45 | import urllib 46 | import urllib2 47 | import httplib 48 | import ssl 49 | from io import open 50 | import requests 51 | 52 | # try: 53 | # from urllib.error import HTTPError, URLError 54 | # except ImportError: 55 | # from urllib2 import HTTPError, URLError 56 | 57 | try: 58 | import redis 59 | except ImportError: 60 | exit('ERROR: Cannot import the redis Python library. Are you sure it is installed?') 61 | 62 | try: 63 | import yaml 64 | except ImportError: 65 | exit('ERROR: Cannot import the yaml Python library. Are you sure it is installed?') 66 | 67 | try: 68 | if sys.version_info < (2, 7): 69 | exit('You need python version 2.7 or newer.') 70 | except Exception as exc: 71 | exit('You need python version 2.7 or newer.') 72 | 73 | retries_paste = 3 74 | retries_client = 5 75 | retries_server = 100 76 | 77 | socket.setdefaulttimeout(10) # set a default timeout of 10 seconds to download the page (default = unlimited) 78 | true_socket = socket.socket 79 | 80 | 81 | def make_bound_socket(source_ip): 82 | def bound_socket(*a, **k): 83 | sock = true_socket(*a, **k) 84 | sock.bind((source_ip, 0)) 85 | return sock 86 | return bound_socket 87 | 88 | 89 | class PastieSite(threading.Thread): 90 | ''' 91 | Instances of these threads are responsible for downloading the list of 92 | the most recent pastes and added those to the download queue. 93 | ''' 94 | def __init__(self, name, download_url, archive_url, archive_regex): 95 | threading.Thread.__init__(self) 96 | self.kill_received = False 97 | 98 | self.name = name 99 | self.download_url = download_url 100 | self.archive_url = archive_url 101 | self.archive_regex = archive_regex 102 | try: 103 | self.ip_addr = yamlconfig['network']['ip'] 104 | socket.socket = make_bound_socket(self.ip_addr) 105 | except Exception as exc: 106 | logger.debug("Using default IP address") 107 | 108 | self.save_dir = yamlconfig['archive']['dir'] + os.sep + name 109 | self.archive_dir = yamlconfig['archive']['dir-all'] + os.sep + name 110 | if yamlconfig['archive']['save'] and not os.path.exists(self.save_dir): 111 | os.makedirs(self.save_dir) 112 | if yamlconfig['archive']['save-all'] and not os.path.exists(self.archive_dir): 113 | os.makedirs(self.archive_dir) 114 | self.archive_compress = yamlconfig['archive']['compress'] 115 | self.update_max = 30 # TODO set by config file 116 | self.update_min = 10 # TODO set by config file 117 | self.pastie_classname = None 118 | self.seen_pasties = deque('', 1000) # max number of pasties ids in memory 119 | 120 | def run(self): 121 | while not self.kill_received: 122 | sleep_time = random.randint(self.update_min, self.update_max) 123 | try: 124 | # grabs site from queue 125 | logger.info( 126 | 'Downloading list of new pastes from {name}. ' 127 | 'Will check again in {time} seconds'.format( 128 | name=self.name, time=sleep_time)) 129 | # get the list of last pasties, but reverse it 130 | # so we first have the old entries and then the new ones 131 | last_pasties = self.get_last_pasties() 132 | if last_pasties: 133 | for pastie in reversed(last_pasties): 134 | queues[self.name].put(pastie) # add pastie to queue 135 | logger.info("Found {amount} new pasties for site {site}. There are now {qsize} pasties to be downloaded.".format(amount=len(last_pasties), 136 | site=self.name, 137 | qsize=queues[self.name].qsize())) 138 | # catch unknown errors 139 | except Exception as e: 140 | msg = 'Thread for {name} crashed unexpectectly, '\ 141 | 'recovering...: {e}'.format(name=self.name, e=e) 142 | logger.error(msg) 143 | logger.debug(traceback.format_exc()) 144 | time.sleep(sleep_time) 145 | 146 | def get_last_pasties(self): 147 | # reset the pasties list 148 | pasties = [] 149 | # populate queue with data 150 | response = download_url(self.archive_url) 151 | htmlPage = response.text 152 | if not htmlPage: 153 | logger.warning("No HTML content for page {url}".format(url=self.archive_url)) 154 | return False 155 | pasties_ids = re.findall(self.archive_regex, htmlPage) 156 | if pasties_ids: 157 | for pastie_id in pasties_ids: 158 | # check if the pastie was already downloaded 159 | # and remember that we've seen it 160 | if self.seen_pastie(pastie_id): 161 | # do not append the seen things again in the queue 162 | continue 163 | # pastie was not downloaded yet. Add it to the queue 164 | if self.pastie_classname: 165 | class_name = globals()[self.pastie_classname] 166 | pastie = class_name(self, pastie_id) 167 | else: 168 | pastie = Pastie(self, pastie_id) 169 | pasties.append(pastie) 170 | return pasties 171 | if "DOES NOT HAVE ACCESS" in htmlPage.encode('utf8'): 172 | print("Problem with configured IP address") 173 | 174 | logger.error("No last pasties matches for regular expression site:{site} regex:{regex}. Error in your regex? Dumping htmlPage \n {html}".format(site=self.name, regex=self.archive_regex, html=htmlPage.encode('utf8'))) 175 | return False 176 | 177 | def seen_pastie(self, pastie_id): 178 | ''' check if the pastie was already downloaded. ''' 179 | # first look in memory if we have already seen this pastie 180 | if self.seen_pasties.count(pastie_id): 181 | return True 182 | # look on the filesystem. # LATER remove this filesystem lookup as it will give problems on long term 183 | if yamlconfig['archive']['save-all']: 184 | # check if the pastie was already saved on the disk 185 | if os.path.exists(verify_directory_exists(self.archive_dir) + os.sep + self.pastie_id_to_filename(pastie_id)): 186 | return True 187 | # TODO look in the database if it was already seen 188 | 189 | def seen_pastie_and_remember(self, pastie): 190 | ''' 191 | Check if the pastie was already downloaded 192 | and remember that we've seen it 193 | ''' 194 | seen = False 195 | if self.seen_pastie(pastie.id): 196 | seen = True 197 | else: 198 | # We have not yet seen the pastie. 199 | # Keep in memory that we've seen it using 200 | # appendleft for performance reasons. 201 | # (faster later when we iterate over the deque) 202 | self.seen_pasties.appendleft(pastie.id) 203 | # add / update the pastie in the database 204 | if db: 205 | db.queue.put(pastie) 206 | return seen 207 | 208 | def pastie_id_to_filename(self, pastie_id): 209 | filename = pastie_id.replace('/', '_') 210 | if self.archive_compress: 211 | filename = filename + ".gz" 212 | return filename 213 | 214 | 215 | def verify_directory_exists(directory): 216 | d = datetime.now() 217 | year = str(d.year) 218 | month = str(d.month) 219 | # prefix month and day with "0" if it is only one digit 220 | if len(month) < 2: 221 | month = "0" + month 222 | day = str(d.day) 223 | if len(day) < 2: 224 | day = "0" + day 225 | fullpath = directory + os.sep + year + os.sep + month + os.sep + day 226 | if not os.path.isdir(fullpath): 227 | os.makedirs(fullpath) 228 | return fullpath 229 | 230 | 231 | class Pastie(): 232 | def __init__(self, site, pastie_id): 233 | self.site = site 234 | self.id = pastie_id 235 | self.pastie_content = None 236 | self.matches = [] 237 | self.md5 = None 238 | self.url = self.site.download_url.format(id=self.id) 239 | self.public = False 240 | 241 | def hash_pastie(self): 242 | if self.pastie_content: 243 | try: 244 | self.md5 = hashlib.md5(self.pastie_content).hexdigest() 245 | logger.debug('Pastie {site} {id} has md5: "{md5}"'.format(site=self.site.name, id=self.id, md5=self.md5)) 246 | except Exception as e: 247 | logger.error('Pastie {site} {id} md5 problem: {e}'.format(site=self.site.name, id=self.id, e=e)) 248 | 249 | def fetch_pastie(self): 250 | response = download_url(self.url) 251 | self.pastie_content = response.content 252 | return self.pastie_content 253 | 254 | def save_pastie(self, directory): 255 | if not self.pastie_content: 256 | raise SystemExit('BUG: Content not set, sannot save') 257 | full_path = verify_directory_exists(directory) + os.sep + self.site.pastie_id_to_filename(self.id) 258 | if yamlconfig['redis']['queue']: 259 | r = redis.StrictRedis(host=yamlconfig['redis']['server'], port=yamlconfig['redis']['port'], db=yamlconfig['redis']['database']) 260 | if self.site.archive_compress: 261 | f = gzip.open(full_path, 'w') 262 | f.write(self.pastie_content.encode('utf8')) 263 | f.flush() 264 | os.fsync(f.fileno()) 265 | f.close() 266 | else: 267 | f = open(full_path, 'w') 268 | f.write(self.pastie_content.encode('utf8')) 269 | f.flush() 270 | os.fsync(f.fileno()) 271 | f.close() 272 | if yamlconfig['redis']['queue']: 273 | time.sleep(3) 274 | r.lpush('pastes', full_path) 275 | # with gzip.open(full_path, 'wb') as f: 276 | # f.write(self.pastie_content) 277 | # if yamlconfig['redis']['queue']: 278 | # r.lpush('pastes', full_path) 279 | # else: 280 | # with open(full_path, 'wb') as f: 281 | # f.write(self.pastie_content) 282 | # if yamlconfig['redis']['queue']: 283 | # r.lpush('pastes', full_path) 284 | 285 | def fetch_and_process_pastie(self): 286 | # double check if the pastie was already downloaded, 287 | # and remember that we've seen it 288 | if self.site.seen_pastie(self.id): 289 | return None 290 | # download pastie 291 | self.fetch_pastie() 292 | # save the pastie on the disk 293 | if self.pastie_content: 294 | # take checksum 295 | self.hash_pastie() 296 | # keep in memory that the pastie was seen successfully 297 | self.site.seen_pastie_and_remember(self) 298 | # Save pastie to archive dir if configured 299 | if yamlconfig['archive']['save-all']: 300 | self.save_pastie(self.site.archive_dir) 301 | # search for data in pastie 302 | self.search_content() 303 | return self.pastie_content 304 | 305 | def search_content(self): 306 | if not self.pastie_content: 307 | raise SystemExit('BUG: Content not set, cannot search') 308 | return False 309 | # search for the regexes in the htmlPage 310 | for regex in yamlconfig['search']: 311 | # LATER first compile regex, then search using compiled version 312 | regex_flags = re.IGNORECASE 313 | if 'regex-flags' in regex: 314 | regex_flags = eval(regex['regex-flags']) 315 | m = re.findall(regex['search'].encode(), self.pastie_content, regex_flags) 316 | if m: 317 | # the regex matches the text 318 | # ignore if not enough counts 319 | if 'count' in regex and len(m) < int(regex['count']): 320 | continue 321 | # ignore if exclude 322 | if 'exclude' in regex and re.search(regex['exclude'].encode(), self.pastie_content, regex_flags): 323 | continue 324 | # we have a match, add to match list 325 | self.matches.append(regex) 326 | if 'public' in regex: 327 | self.public = regex['public'] 328 | else: 329 | self.public = False 330 | if self.matches: 331 | self.action_on_match() 332 | 333 | def action_on_match(self): 334 | msg = 'Found hit for {matches} in pastie {url}'.format( 335 | matches=self.matches_to_text(), url=self.url) 336 | logger.info(msg) 337 | # store info in DB 338 | if db: 339 | db.queue.put(self) 340 | # Save pastie to disk if configured 341 | if yamlconfig['archive']['save']: 342 | self.save_pastie(self.site.save_dir) 343 | # Send email alert if configured 344 | if yamlconfig['email']['alert']: 345 | self.send_email_alert() 346 | 347 | def matches_to_text(self): 348 | descriptions = [] 349 | for match in self.matches: 350 | if 'description' in match: 351 | descriptions.append(match['description']) 352 | else: 353 | descriptions.append(match['search']) 354 | if descriptions: 355 | return '[{}]'.format(', '.join(descriptions.decode('utf-8', 'ignore'))) 356 | else: 357 | return '' 358 | 359 | def matches_to_regex(self): 360 | descriptions = [] 361 | for match in self.matches: 362 | descriptions.append(match['search']) 363 | if descriptions: 364 | return '[{}]'.format(', '.join(descriptions.decode('utf-8', 'ignore'))) 365 | else: 366 | return '' 367 | 368 | def send_email_alert(self): 369 | msg = MIMEMultipart() 370 | if self.public: 371 | alert = "Found hit for {matches} in pastie {url}".format(matches=self.matches_to_text(), url=self.url) 372 | else: 373 | alert = "Found hit in pastie {url}".format(url=self.url) 374 | # headers 375 | msg['Subject'] = yamlconfig['email']['subject'].format(subject=alert) 376 | msg['From'] = yamlconfig['email']['from'] 377 | # build the list of recipients 378 | recipients = [] 379 | recipients.append(yamlconfig['email']['to']) # first the global alert email 380 | for match in self.matches: # per match, the custom additional email 381 | if 'to' in match and match['to']: 382 | recipients.extend(match['to'].split(",")) 383 | msg['Bcc'] = ','.join(recipients) # here the list needs to be comma separated 384 | # message body including full paste rather than attaching it 385 | message = ''' 386 | I found a hit for a regular expression on one of the pastebin sites. 387 | 388 | The site where the paste came from : {site} 389 | The original paste was located here: {url} 390 | And the regular expressions that matched: [redacted] 391 | 392 | Below (after newline) is the content of the pastie: 393 | 394 | {content} 395 | 396 | '''.format(site=self.site.name, url=self.url, content=self.pastie_content.encode('utf8')) 397 | # '''.format(site=self.site.name, url=self.url, matches=self.matches_to_regex(), content=self.pastie_content.encode('utf8')) 398 | msg.attach(MIMEText(message)) 399 | # send out the mail 400 | try: 401 | s = smtplib.SMTP(yamlconfig['email']['server'], yamlconfig['email']['port']) 402 | # login to the SMTP server if configured 403 | if 'username' in yamlconfig['email'] and yamlconfig['email']['username']: 404 | s.login(yamlconfig['email']['username'], yamlconfig['email']['password']) 405 | # send the mail 406 | s.sendmail(yamlconfig['email']['from'], recipients, msg.as_string()) 407 | s.close() 408 | except smtplib.SMTPException as e: 409 | logger.error("ERROR: unable to send email: {0}".format(e)) 410 | except Exception as e: 411 | logger.error("ERROR: unable to send email. Are your email setting correct?: {e}".format(e=e)) 412 | 413 | 414 | class ThreadPasties(threading.Thread): 415 | ''' 416 | Instances of these threads are responsible for downloading the pastes 417 | found in the queue. 418 | ''' 419 | def __init__(self, queue, queue_name): 420 | threading.Thread.__init__(self) 421 | self.queue = queue 422 | self.name = queue_name 423 | self.kill_received = False 424 | 425 | def run(self): 426 | while not self.kill_received: 427 | try: 428 | # grabs pastie from queue 429 | pastie = self.queue.get() 430 | pastie_content = pastie.fetch_and_process_pastie() 431 | logger.debug("Queue {name} size: {size}".format( 432 | size=self.queue.qsize(), name=self.name)) 433 | if pastie_content: 434 | logger.debug( 435 | "Saved new pastie from {0} " 436 | "with id {1}".format(self.name, pastie.id)) 437 | else: 438 | # pastie already downloaded OR error ? 439 | pass 440 | # signals to queue job is done 441 | self.queue.task_done() 442 | # catch unknown errors 443 | except Exception as e: 444 | msg = "ThreadPasties for {name} crashed unexpectectly, "\ 445 | "recovering...: {e}".format(name=self.name, e=e) 446 | logger.error(msg) 447 | logger.debug(traceback.format_exc()) 448 | 449 | 450 | def main(): 451 | global queues 452 | global threads 453 | global db 454 | queues = {} 455 | threads = [] 456 | 457 | # start a thread to handle the DB data 458 | db = None 459 | if yamlconfig['db'] and yamlconfig['db']['sqlite3'] and yamlconfig['db']['sqlite3']['enable']: 460 | try: 461 | global sqlite3 462 | import sqlite3 463 | except Exception as exc: 464 | exit('ERROR: Cannot import the sqlite3 Python library. Are you sure it is compiled in python?') 465 | db = Sqlite3Database(yamlconfig['db']['sqlite3']['file']) 466 | db.setDaemon(True) 467 | threads.append(db) 468 | db.start() 469 | # test() 470 | # Build array of enabled sites. 471 | sites_enabled = [] 472 | for site in yamlconfig['site']: 473 | if yamlconfig['site'][site]['enable']: 474 | print("Site: {} is enabled, adding to pool...".format(site)) 475 | sites_enabled.append(site) 476 | elif not yamlconfig['site'][site]['enable']: 477 | print("Site: {} is disabled.".format(site)) 478 | else: 479 | print("Site: {} is not enabled or disabled in config file. We just assume it disabled.".format(site)) 480 | # spawn a pool of threads per PastieSite, and pass them a queue instance 481 | for site in sites_enabled: 482 | queues[site] = Queue() 483 | for i in range(yamlconfig['threads']): 484 | t = ThreadPasties(queues[site], site) 485 | t.setDaemon(True) 486 | threads.append(t) 487 | t.start() 488 | 489 | # build threads to download the last pasties 490 | for site_name in sites_enabled: 491 | t = PastieSite(site_name, 492 | yamlconfig['site'][site_name]['download-url'], 493 | yamlconfig['site'][site_name]['archive-url'], 494 | yamlconfig['site'][site_name]['archive-regex']) 495 | if 'update-min' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['update-min']: 496 | t.update_min = yamlconfig['site'][site_name]['update-min'] 497 | if 'update-max' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['update-max']: 498 | t.update_max = yamlconfig['site'][site_name]['update-max'] 499 | if 'pastie-classname' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['pastie-classname']: 500 | t.pastie_classname = yamlconfig['site'][site_name]['pastie-classname'] 501 | threads.append(t) 502 | t.setDaemon(True) 503 | t.start() 504 | 505 | # wait while all the threads are running and someone sends CTRL+C 506 | while True: 507 | try: 508 | for t in threads: 509 | t.join(1) 510 | except KeyboardInterrupt: 511 | print('') 512 | print("Ctrl-c received! Sending kill to threads...") 513 | for t in threads: 514 | t.kill_received = True 515 | exit(0) # quit immediately 516 | 517 | 518 | user_agents_list = [] 519 | 520 | 521 | def load_user_agents_from_file(filename): 522 | global user_agents_list 523 | try: 524 | f = open(filename) 525 | except Exception as e: 526 | logger.error('Configuration problem: user-agent-file "{file}" not found or not readable: {e}'.format(file=filename, e=e)) 527 | for line in f: 528 | line = line.strip() 529 | if line: 530 | user_agents_list.append(line) 531 | logger.debug('Found {count} UserAgents in file "{file}"'.format(file=filename, count=len(user_agents_list))) 532 | 533 | 534 | def get_random_user_agent(): 535 | global proxies_list 536 | if user_agents_list: 537 | return random.choice(user_agents_list) 538 | return None 539 | 540 | 541 | proxies_failed = [] 542 | proxies_lock = threading.Lock() 543 | proxies_list = [] 544 | 545 | 546 | def load_proxies_from_file(filename): 547 | global proxies_list 548 | try: 549 | f = open(filename) 550 | except Exception as e: 551 | logger.error('Configuration problem: proxyfile "{file}" not found or not readable: {e}'.format(file=filename, e=e)) 552 | for line in f: 553 | line = line.strip() 554 | if line: # LATER verify if the proxy line has the correct structure 555 | proxies_list.append(line) 556 | logger.debug('Found {count} proxies in file "{file}"'.format(file=filename, count=len(proxies_list))) 557 | 558 | 559 | def get_random_proxy(): 560 | global proxies_list 561 | proxy = None 562 | proxies_lock.acquire() 563 | if proxies_list: 564 | proxy = random.choice(proxies_list) 565 | proxies_lock.release() 566 | return proxy 567 | 568 | 569 | def failed_proxy(proxy): 570 | proxies_failed.append(proxy) 571 | if proxies_failed.count(proxy) >= 2 and proxies_list.count(proxy) >= 1: 572 | logger.info("Removing proxy {0} from proxy list because of to many errors errors.".format(proxy)) 573 | proxies_lock.acquire() 574 | proxies_list.remove(proxy) 575 | proxies_lock.release() 576 | 577 | 578 | class NoRedirectHandler(urllib2.HTTPRedirectHandler): 579 | ''' 580 | This class is only necessary to not follow HTTP redirects in webpages. 581 | It is used by the download_url() function 582 | ''' 583 | def http_error_302(self, req, fp, code, msg, headers): 584 | infourl = urllib2.addinfourl(fp, headers, req.get_full_url()) 585 | infourl.status = code 586 | infourl.code = code 587 | return infourl 588 | http_error_301 = http_error_303 = http_error_307 = http_error_302 589 | 590 | 591 | class TLS1Connection(httplib.HTTPSConnection): 592 | """Like HTTPSConnection but more specific""" 593 | def __init__(self, host, **kwargs): 594 | httplib.HTTPSConnection.__init__(self, host, **kwargs) 595 | 596 | def connect(self): 597 | """Overrides HTTPSConnection.connect to specify TLS version""" 598 | # Standard implementation from HTTPSConnection, which is not 599 | # designed for extension, unfortunately 600 | sock = socket.create_connection( 601 | (self.host, self.port), 602 | self.timeout, self.source_address) 603 | if getattr(self, '_tunnel_host', None): 604 | self.sock = sock 605 | self._tunnel() 606 | 607 | # This is the only difference; default wrap_socket uses SSLv23 608 | self.sock = ssl.wrap_socket( 609 | sock, self.key_file, self.cert_file, 610 | ssl_version=ssl.PROTOCOL_TLSv1) 611 | 612 | 613 | class TLS1Handler(urllib2.HTTPSHandler): 614 | """Like HTTPSHandler but more specific""" 615 | def __init__(self): 616 | urllib2.HTTPSHandler.__init__(self) 617 | 618 | def https_open(self, req): 619 | return self.do_open(TLS1Connection, req) 620 | 621 | 622 | def download_url(url, data=None, cookie=None, loop_client=0, loop_server=0, loop_paste=0): 623 | # Client errors (40x): if more than 5 recursions, give up on URL (used for the 404 case) 624 | if loop_client >= retries_client: 625 | return None 626 | # Server errors (50x): if more than 100 recursions, give up on URL 627 | if loop_server >= retries_server: 628 | return None 629 | 630 | session = requests.Session() 631 | random_proxy = get_random_proxy() 632 | if random_proxy: 633 | session.proxies = {'http': random_proxy} 634 | user_agent = get_random_user_agent() 635 | session.headers.update({'User-Agent': get_random_user_agent(), 'Accept-Charset': 'utf-8'}) 636 | if cookie: 637 | session.headers.update({'Cookie': cookie}) 638 | if data: 639 | session.headers.update(data) 640 | logger.debug('Downloading url: {url} with proxy: {proxy} and user-agent: {ua}'.format(url=url, proxy=random_proxy, ua=user_agent)) 641 | try: 642 | opener = None 643 | # urllib2.install_opener(urllib2.build_opener(TLS1Handler())) 644 | 645 | # Random Proxy if set in config 646 | random_proxy = get_random_proxy() 647 | if random_proxy: 648 | proxyh = urllib2.ProxyHandler({'http': random_proxy}) 649 | opener = urllib2.build_opener(proxyh, NoRedirectHandler()) 650 | # We need to create an opener if it didn't exist yet 651 | if not opener: 652 | opener = urllib2.build_opener(NoRedirectHandler()) 653 | # Random User-Agent if set in config 654 | user_agent = get_random_user_agent() 655 | opener.addheaders = [('Accept-Charset', 'utf-8')] 656 | if user_agent: 657 | opener.addheaders.append(('User-Agent', user_agent)) 658 | if cookie: 659 | opener.addheaders.append(('Cookie', cookie)) 660 | logger.debug( 661 | 'Downloading url: {url} with proxy: {proxy} and user-agent: {ua}'.format( 662 | url=url, proxy=random_proxy, ua=user_agent)) 663 | if data: 664 | response = opener.open(url, data) 665 | else: 666 | response = opener.open(url) 667 | htmlPage = unicode(response.read(), errors='replace') 668 | if 'File is not ready for scraping yet. Try again in 1 minute.' in htmlPage: 669 | if loop_paste >= retries_paste: 670 | logger.warning("Tried to scrape too early for {url}, giving up and saving current content".format(url=url)) 671 | return htmlPage, response.headers 672 | else: 673 | loop_paste += 1 674 | logger.warning("Tried to scrape too early for {url}, trying again in 60s ({nb}/{total})".format(url=url, nb=loop_paste, total=retries_paste)) 675 | time.sleep(60) 676 | return download_url(url, loop_paste=loop_paste) 677 | return htmlPage, response.headers 678 | except urllib2.HTTPError, e: 679 | failed_proxy(random_proxy) 680 | logger.warning("!!Proxy error on {url} for proxy {proxy}.".format(url=url, proxy=random_proxy)) 681 | if 404 == e.code: 682 | htmlPage = e.read() 683 | logger.warning("404 from proxy received for {url}. Waiting 1 minute".format(url=url)) 684 | time.sleep(60) 685 | loop_client += 1 686 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_client, total=retries_client, url=url)) 687 | return download_url(url, loop_client=loop_client) 688 | if 500 == e.code: 689 | htmlPage = e.read() 690 | logger.warning("500 from proxy received for {url}. Waiting 1 minute".format(url=url)) 691 | time.sleep(60) 692 | loop_server += 1 693 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 694 | return download_url(url, loop_server=loop_server) 695 | if 504 == e.code: 696 | htmlPage = e.read() 697 | logger.warning("504 from proxy received for {url}. Waiting 1 minute".format(url=url)) 698 | time.sleep(60) 699 | loop_server += 1 700 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 701 | return download_url(url, loop_server=loop_server) 702 | if 502 == e.code: 703 | htmlPage = e.read() 704 | logger.warning("502 from proxy received for {url}. Waiting 1 minute".format(url=url)) 705 | time.sleep(60) 706 | loop_server += 1 707 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 708 | return download_url(url, loop_server=loop_server) 709 | if 403 == e.code: 710 | htmlPage = e.read() 711 | if 'Please slow down' in htmlPage or 'has temporarily blocked your computer' in htmlPage or 'blocked' in htmlPage: 712 | logger.warning("Slow down message received for {url}. Waiting 1 minute".format(url=url)) 713 | time.sleep(60) 714 | loop_server += 1 715 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 716 | return download_url(url, loop_server=loop_server) 717 | if 504 == e.code: 718 | htmlPage = e.read() 719 | logger.warning("504 from proxy received for {url}. Waiting 1 minute".format(url=url)) 720 | time.sleep(60) 721 | loop_server += 1 722 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 723 | return download_url(url, loop_server=loop_server) 724 | if 502 == e.code: 725 | htmlPage = e.read() 726 | logger.warning("502 from proxy received for {url}. Waiting 1 minute".format(url=url)) 727 | time.sleep(60) 728 | loop_server += 1 729 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 730 | return download_url(url, loop_server=loop_server) 731 | if 403 == e.code: 732 | htmlPage = e.read() 733 | if 'Please slow down' in htmlPage or 'has temporarily blocked your computer' in htmlPage or 'blocked' in htmlPage: 734 | logger.warning("Slow down message received for {url}. Waiting 1 minute".format(url=url)) 735 | time.sleep(60) 736 | return download_url(url) 737 | logger.warning("ERROR: HTTP Error ##### {e} ######################## {url}".format(e=e, url=url)) 738 | return None 739 | return response 740 | except URLError as e: 741 | logger.debug("ERROR: URL Error ##### {e} ######################## ".format(e=e, url=url)) 742 | if random_proxy: # remove proxy from the list if needed 743 | failed_proxy(random_proxy) 744 | logger.warning("Failed to download the page {url} because of proxy error {proxy}. Trying again.".format(url=url, proxy=random_proxy)) 745 | loop_server += 1 746 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 747 | return download_url(url, loop_server=loop_server) 748 | if 'timed out' in e.reason: 749 | logger.warning("Timed out or slow down for {url}. Waiting 1 minute".format(url=url)) 750 | loop_server += 1 751 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 752 | time.sleep(60) 753 | return download_url(url, loop_server=loop_server) 754 | return None 755 | except socket.timeout: 756 | logger.debug("ERROR: timeout ############################# " + url) 757 | if random_proxy: # remove proxy from the list if needed 758 | failed_proxy(random_proxy) 759 | logger.warning("Failed to download the page because of socket error {0} trying again.".format(url)) 760 | loop_server += 1 761 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 762 | return download_url(url, loop_server=loop_server) 763 | return None 764 | except Exception as e: 765 | failed_proxy(random_proxy) 766 | logger.warning("Failed to download the page because of other HTTPlib error proxy error {0} trying again.".format(url)) 767 | loop_server += 1 768 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url)) 769 | return download_url(url, loop_server=loop_server) 770 | # logger.error("ERROR: Other HTTPlib error: {e}".format(e=e)) 771 | # return None, None 772 | # do NOT try to download the url again here, as we might end in enless loop 773 | 774 | 775 | class Sqlite3Database(threading.Thread): 776 | def __init__(self, filename): 777 | threading.Thread.__init__(self) 778 | self.kill_received = False 779 | self.queue = Queue() 780 | self.filename = filename 781 | self.db_conn = None 782 | self.c = None 783 | 784 | def run(self): 785 | self.db_conn = sqlite3.connect(self.filename) 786 | # create the db if it doesn't exist 787 | self.c = self.db_conn.cursor() 788 | try: 789 | # LATER maybe create a table per site. Lookups will be faster as less text-searching is needed 790 | self.c.execute(''' 791 | CREATE TABLE IF NOT EXISTS pasties ( 792 | site TEXT, 793 | id TEXT, 794 | md5 TEXT, 795 | url TEXT, 796 | local_path TEXT, 797 | timestamp DATE, 798 | matches TEXT 799 | )''') 800 | self.db_conn.commit() 801 | except sqlite3.DatabaseError as e: 802 | logger.error('Problem with the SQLite database {0}: {1}'.format(self.filename, e)) 803 | return None 804 | # loop over the queue 805 | while not self.kill_received: 806 | try: 807 | # grabs pastie from queue 808 | pastie = self.queue.get() 809 | # add the pastie to the DB 810 | self.add_or_update(pastie) 811 | # signals to queue job is done 812 | self.queue.task_done() 813 | # catch unknown errors 814 | except Exception as e: 815 | logger.error("Thread for SQLite crashed unexpectectly, recovering...: {e}".format(e=e)) 816 | logger.debug(traceback.format_exc()) 817 | 818 | def add_or_update(self, pastie): 819 | data = {'site': pastie.site.name, 820 | 'id': pastie.id 821 | } 822 | self.c.execute('SELECT count(id) FROM pasties WHERE site=:site AND id=:id', data) 823 | pastie_in_db = self.c.fetchone() 824 | # logger.debug('State of Database for pastie {site} {id} - {state}'.format(site=pastie.site.name, id=pastie.id, state=pastie_in_db)) 825 | if pastie_in_db and pastie_in_db[0]: 826 | self.update(pastie) 827 | else: 828 | self.add(pastie) 829 | 830 | def add(self, pastie): 831 | try: 832 | data = {'site': pastie.site.name, 833 | 'id': pastie.id, 834 | 'md5': pastie.md5, 835 | 'url': pastie.url, 836 | 'local_path': pastie.site.archive_dir + os.sep + pastie.site.pastie_id_to_filename(pastie.id), 837 | 'timestamp': datetime.now(), 838 | 'matches': pastie.matches_to_text() 839 | } 840 | self.c.execute('INSERT INTO pasties VALUES (:site, :id, :md5, :url, :local_path, :timestamp, :matches)', data) 841 | self.db_conn.commit() 842 | except sqlite3.DatabaseError as e: 843 | logger.error('Cannot add pastie {site} {id} in the SQLite database: {error}'.format(site=pastie.site.name, id=pastie.id, error=e)) 844 | logger.debug('Added pastie {site} {id} in the SQLite database.'.format(site=pastie.site.name, id=pastie.id)) 845 | 846 | def update(self, pastie): 847 | try: 848 | data = {'site': pastie.site.name, 849 | 'id': pastie.id, 850 | 'md5': pastie.md5, 851 | 'url': pastie.url, 852 | 'local_path': pastie.site.archive_dir + os.sep + pastie.site.pastie_id_to_filename(pastie.id), 853 | 'timestamp': datetime.now(), 854 | 'matches': pastie.matches_to_text() 855 | } 856 | self.c.execute('''UPDATE pasties SET md5 = :md5, 857 | url = :url, 858 | local_path = :local_path, 859 | timestamp = :timestamp, 860 | matches = :matches 861 | WHERE site = :site AND id = :id''', data) 862 | self.db_conn.commit() 863 | except sqlite3.DatabaseError as e: 864 | logger.error('Cannot add pastie {site} {id} in the SQLite database: {error}'.format(site=pastie.site.name, id=pastie.id, error=e)) 865 | logger.debug('Updated pastie {site} {id} in the SQLite database.'.format(site=pastie.site.name, id=pastie.id)) 866 | 867 | 868 | def parse_config_file(configfile): 869 | global yamlconfig 870 | try: 871 | yamlconfig = yaml.load(open(configfile)) 872 | except yaml.YAMLError as exc: 873 | logger.error("Error in configuration file:") 874 | if hasattr(exc, 'problem_mark'): 875 | mark = exc.problem_mark 876 | logger.error("error position: (%s:%s)" % (mark.line + 1, mark.column + 1)) 877 | exit(1) 878 | # TODO verify validity of config parameters 879 | for includes in yamlconfig.get("includes", []): 880 | yamlconfig.update(yaml.load(open(includes))) 881 | if yamlconfig['proxy']['random']: 882 | load_proxies_from_file(yamlconfig['proxy']['file']) 883 | if yamlconfig['user-agent']['random']: 884 | load_user_agents_from_file(yamlconfig['user-agent']['file']) 885 | # if yamlconfig['redis']['queue']: 886 | # import redis 887 | 888 | 889 | if __name__ == "__main__": 890 | global logger 891 | parser = optparse.OptionParser("usage: %prog [options]") 892 | parser.add_option("-c", "--config", dest="config", 893 | help="load configuration from file", metavar="FILE") 894 | parser.add_option("-d", "--daemon", action="store_true", dest="daemon", 895 | help="runs in background as a daemon (NOT IMPLEMENTED)") 896 | parser.add_option("-s", "--stats", action="store_true", dest="stats", 897 | help="display statistics about the running threads (NOT IMPLEMENTED)") 898 | parser.add_option("-v", action="store_true", dest="verbose", 899 | help="outputs more information") 900 | 901 | (options, args) = parser.parse_args() 902 | 903 | if not options.config: 904 | # try to read out the default configuration files if -c option is not set 905 | if os.path.isfile('/etc/pystemon.yaml'): 906 | options.config = '/etc/pystemon.yaml' 907 | if os.path.isfile('pystemon.yaml'): 908 | options.config = 'pystemon.yaml' 909 | filename = sys.argv[0] 910 | config_file = filename.replace('.py', '.yaml') 911 | if os.path.isfile(config_file): 912 | options.config = config_file 913 | if not os.path.isfile(options.config): 914 | parser.error('Configuration file not found. Please create /etc/pystemon.yaml, pystemon.yaml or specify a config file using the -c option.') 915 | exit(1) 916 | 917 | logger = logging.getLogger('pystemon') 918 | logger.setLevel(logging.DEBUG) 919 | hdlr = logging.StreamHandler(sys.stdout) 920 | formatter = logging.Formatter('[%(asctime)s] %(message)s') 921 | hdlr.setFormatter(formatter) 922 | logger.addHandler(hdlr) 923 | if options.verbose: 924 | logger.setLevel(logging.DEBUG) 925 | 926 | if options.daemon: 927 | # send logging to syslog if using daemon 928 | logger.addHandler(logging.handlers.SysLogHandler(facility=logging.handlers.SysLogHandler.LOG_DAEMON)) 929 | # FIXME run application in background 930 | 931 | parse_config_file(options.config) 932 | # run the software 933 | main() 934 | -------------------------------------------------------------------------------- /pystemon.yaml: -------------------------------------------------------------------------------- 1 | #network: # Network settings 2 | # ip: '1.1.1.1' # Specify source IP address if you want to bind on a specific one 3 | 4 | archive: 5 | save: yes # Keep 6 | save-all: no # Keep a copy of all pasties 7 | dir: "alerts" # Directory where matching pasties should be kept 8 | dir-all: "archive" # Directory where all pasties should be kept (if save-all is set to yes) 9 | compress: yes # Store the pasties compressed 10 | 11 | db: 12 | sqlite3: # Store information about the pastie in a database 13 | enable: no # Activate this DB engine # NOT FULLY IMPLEMENTED 14 | file: 'db.sqlite3' # The filename of the database 15 | 16 | redis: 17 | queue: no # Toggle PUSH to redis queue 18 | server: "localhost" 19 | port: 6379 20 | database: 10 21 | 22 | email: 23 | alert: no # Enable/disable email alerts 24 | from: alert@example.com 25 | to: alert@example.com 26 | server: 127.0.0.1 # Address of the server (hostname or IP) 27 | port: 25 # Outgoing SMTP port: 25, 587, ... 28 | username: '' # (optional) Username for authentication. Leave blank for no authentication. 29 | password: '' # (optional) Password for authentication. Leave blank for no authentication. 30 | subject: '[pystemon] - {subject}' 31 | 32 | ##### 33 | # Definition of regular expressions to search for in the pasties 34 | # 35 | search: 36 | # - description: '' # (optional) A human readable description. (used in alerting) 37 | # search: '' # The regular expression to search for 38 | # count: '' # (optional) How many hits should it have to be interesting? 39 | # exclude: '' # (optional) Do not alert if this regular expression matches 40 | # regex-flags: '' # (optional) Regular expression flags to give to the find function. 41 | # # Default = re.IGNORECASE 42 | # # Set to 0 to have no flags set 43 | # # See http://docs.python.org/2/library/re.html#re.DEBUG for more info. 44 | # # Warning: when setting this the default is overridden 45 | # # example: 're.MULTILINE + re.DOTALL + re.IGNORECASE' 46 | # to: '' # (optional) Additional recipients for email alert, comma separated list 47 | 48 | - search: '[^a-zA-Z0-9]example\.com' 49 | - search: '[^a-zA-Z0-9]foobar\.com' 50 | - description: 'Download (non-porn)' 51 | search: 'download' 52 | exclude: 'porn|sex|teen' 53 | count: 4 54 | 55 | ##### 56 | # Configuration section for the paste sites 57 | # 58 | threads: 1 # number of download threads per site 59 | site: 60 | # example.com: 61 | # archive-url: # the url where the list of last pasties is present 62 | # # example: 'http://pastebin.com/archive' 63 | # archive-regex: # a regular expression to extract the pastie-id from the page. 64 | # # do not forget the () to extract the pastie-id 65 | # # example: '.+' 66 | # download-url: # url for the raw pastie. 67 | # # Should contain {id} on the place where the ID of the pastie needs to be placed 68 | # # example: 'http://pastebin.com/raw.php?i={id}' 69 | # update-max: 40 # every X seconds check for new updates to see if new pasties are available 70 | # update-min: 30 # a random number will be chosen between these two numbers 71 | # pastie-classname: # OPTIONAL: The name of a custom Class that inherits from Pastie 72 | # # This is practical for sites that require custom fetchPastie() functions 73 | 74 | pastebin.com: 75 | enable: yes 76 | archive-url: 'http://pastebin.com/archive' 77 | archive-regex: '.+' 78 | download-url: 'http://pastebin.com/raw.php?i={id}' 79 | update-max: 50 80 | update-min: 40 81 | 82 | # See https://pastebin.com/api_scraping_faq , you will need a pro account on pastebin 83 | # You need to #whitelist you source IP and "Our scraping API is only available for LIFETIME PRO members, and only for those who have their IP whitelisted." 84 | pastebin.com_pro: 85 | enable: no 86 | archive-url: 'https://scrape.pastebin.com/api_scraping.php?limit=500' 87 | archive-regex: '"key": "(.+)",' 88 | download-url: 'http://pastebin.com/api_scrape_item.php?i={id}' 89 | update-max: 50 90 | update-min: 40 91 | 92 | slexy.org: 93 | enable: yes 94 | archive-url: 'http://slexy.org/recent' 95 | archive-regex: 'View paste' 96 | download-url: 'http://slexy.org/raw/{id}' 97 | 98 | gist.github.com: 99 | enable: no 100 | archive-url: 'https://gist.github.com/gists' 101 | archive-regex: 'gist' 102 | download-url: 'https://raw.github.com/gist/{id}' 103 | 104 | codepad.org: 105 | enable: no 106 | archive-url: 'http://codepad.org/recent' 107 | archive-regex: 'view' 108 | download-url: 'http://codepad.org/{id}/raw.txt' 109 | 110 | safebin.net: # FIXME not finished 111 | enable: no 112 | archive-url: 'http://safebin.net/?archive' 113 | archive-regex: '' 114 | download-url: 'http://safebin.net/{id}' 115 | update-max: 60 116 | update-min: 50 117 | 118 | 119 | # TODO 120 | # http://www.safebin.net/ # more complex site 121 | # http://www.heypasteit.com/ # http://www.heypasteit.com/clip/0IZA => incremental 122 | 123 | # http://hastebin.com/ # no list of last pastes 124 | # http://sebsauvage.net/paste/ # no list of last pastes 125 | # http://tny.cz/ # no list of last pastes 126 | # https://pastee.org/ # no list of last pastes 127 | # http://paste2.org/ # no list of last pastes 128 | # http://0bin.net/ # no list of last pastes 129 | # http://markable.in/ # no list of last pastes 130 | 131 | 132 | ##### 133 | # Configuration section to configure proxies 134 | # Currently only HTTP proxies are permitted 135 | # 136 | proxy: 137 | random: no 138 | file: 'proxies.txt' 139 | 140 | ##### 141 | # Configuration section for User-Agents 142 | # 143 | user-agent: 144 | random: no 145 | file: 'user-agents.txt' 146 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pyyaml 2 | redis 3 | requests 4 | --------------------------------------------------------------------------------