├── .gitignore
├── LICENSE
├── README.md
├── pystemon.py
├── pystemon.yaml
├── requirements.txt
└── user-agents.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | *.py[co]
2 |
3 | # Packages
4 | *.egg
5 | *.egg-info
6 | dist
7 | build
8 | eggs
9 | parts
10 | bin
11 | var
12 | sdist
13 | develop-eggs
14 | .installed.cfg
15 |
16 | # Installer logs
17 | pip-log.txt
18 |
19 | # Unit test / coverage reports
20 | .coverage
21 | .tox
22 |
23 | # Project specifics
24 | /.project
25 | /.pydevproject
26 | /archive
27 | /alerts
28 |
29 | proxies.txt
30 | pystemon.yaml
31 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU AFFERO GENERAL PUBLIC LICENSE
2 | Version 3, 19 November 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 | Preamble
9 |
10 | The GNU Affero General Public License is a free, copyleft license for
11 | software and other kinds of works, specifically designed to ensure
12 | cooperation with the community in the case of network server software.
13 |
14 | The licenses for most software and other practical works are designed
15 | to take away your freedom to share and change the works. By contrast,
16 | our General Public Licenses are intended to guarantee your freedom to
17 | share and change all versions of a program--to make sure it remains free
18 | software for all its users.
19 |
20 | When we speak of free software, we are referring to freedom, not
21 | price. Our General Public Licenses are designed to make sure that you
22 | have the freedom to distribute copies of free software (and charge for
23 | them if you wish), that you receive source code or can get it if you
24 | want it, that you can change the software or use pieces of it in new
25 | free programs, and that you know you can do these things.
26 |
27 | Developers that use our General Public Licenses protect your rights
28 | with two steps: (1) assert copyright on the software, and (2) offer
29 | you this License which gives you legal permission to copy, distribute
30 | and/or modify the software.
31 |
32 | A secondary benefit of defending all users' freedom is that
33 | improvements made in alternate versions of the program, if they
34 | receive widespread use, become available for other developers to
35 | incorporate. Many developers of free software are heartened and
36 | encouraged by the resulting cooperation. However, in the case of
37 | software used on network servers, this result may fail to come about.
38 | The GNU General Public License permits making a modified version and
39 | letting the public access it on a server without ever releasing its
40 | source code to the public.
41 |
42 | The GNU Affero General Public License is designed specifically to
43 | ensure that, in such cases, the modified source code becomes available
44 | to the community. It requires the operator of a network server to
45 | provide the source code of the modified version running there to the
46 | users of that server. Therefore, public use of a modified version, on
47 | a publicly accessible server, gives the public access to the source
48 | code of the modified version.
49 |
50 | An older license, called the Affero General Public License and
51 | published by Affero, was designed to accomplish similar goals. This is
52 | a different license, not a version of the Affero GPL, but Affero has
53 | released a new version of the Affero GPL which permits relicensing under
54 | this license.
55 |
56 | The precise terms and conditions for copying, distribution and
57 | modification follow.
58 |
59 | TERMS AND CONDITIONS
60 |
61 | 0. Definitions.
62 |
63 | "This License" refers to version 3 of the GNU Affero General Public License.
64 |
65 | "Copyright" also means copyright-like laws that apply to other kinds of
66 | works, such as semiconductor masks.
67 |
68 | "The Program" refers to any copyrightable work licensed under this
69 | License. Each licensee is addressed as "you". "Licensees" and
70 | "recipients" may be individuals or organizations.
71 |
72 | To "modify" a work means to copy from or adapt all or part of the work
73 | in a fashion requiring copyright permission, other than the making of an
74 | exact copy. The resulting work is called a "modified version" of the
75 | earlier work or a work "based on" the earlier work.
76 |
77 | A "covered work" means either the unmodified Program or a work based
78 | on the Program.
79 |
80 | To "propagate" a work means to do anything with it that, without
81 | permission, would make you directly or secondarily liable for
82 | infringement under applicable copyright law, except executing it on a
83 | computer or modifying a private copy. Propagation includes copying,
84 | distribution (with or without modification), making available to the
85 | public, and in some countries other activities as well.
86 |
87 | To "convey" a work means any kind of propagation that enables other
88 | parties to make or receive copies. Mere interaction with a user through
89 | a computer network, with no transfer of a copy, is not conveying.
90 |
91 | An interactive user interface displays "Appropriate Legal Notices"
92 | to the extent that it includes a convenient and prominently visible
93 | feature that (1) displays an appropriate copyright notice, and (2)
94 | tells the user that there is no warranty for the work (except to the
95 | extent that warranties are provided), that licensees may convey the
96 | work under this License, and how to view a copy of this License. If
97 | the interface presents a list of user commands or options, such as a
98 | menu, a prominent item in the list meets this criterion.
99 |
100 | 1. Source Code.
101 |
102 | The "source code" for a work means the preferred form of the work
103 | for making modifications to it. "Object code" means any non-source
104 | form of a work.
105 |
106 | A "Standard Interface" means an interface that either is an official
107 | standard defined by a recognized standards body, or, in the case of
108 | interfaces specified for a particular programming language, one that
109 | is widely used among developers working in that language.
110 |
111 | The "System Libraries" of an executable work include anything, other
112 | than the work as a whole, that (a) is included in the normal form of
113 | packaging a Major Component, but which is not part of that Major
114 | Component, and (b) serves only to enable use of the work with that
115 | Major Component, or to implement a Standard Interface for which an
116 | implementation is available to the public in source code form. A
117 | "Major Component", in this context, means a major essential component
118 | (kernel, window system, and so on) of the specific operating system
119 | (if any) on which the executable work runs, or a compiler used to
120 | produce the work, or an object code interpreter used to run it.
121 |
122 | The "Corresponding Source" for a work in object code form means all
123 | the source code needed to generate, install, and (for an executable
124 | work) run the object code and to modify the work, including scripts to
125 | control those activities. However, it does not include the work's
126 | System Libraries, or general-purpose tools or generally available free
127 | programs which are used unmodified in performing those activities but
128 | which are not part of the work. For example, Corresponding Source
129 | includes interface definition files associated with source files for
130 | the work, and the source code for shared libraries and dynamically
131 | linked subprograms that the work is specifically designed to require,
132 | such as by intimate data communication or control flow between those
133 | subprograms and other parts of the work.
134 |
135 | The Corresponding Source need not include anything that users
136 | can regenerate automatically from other parts of the Corresponding
137 | Source.
138 |
139 | The Corresponding Source for a work in source code form is that
140 | same work.
141 |
142 | 2. Basic Permissions.
143 |
144 | All rights granted under this License are granted for the term of
145 | copyright on the Program, and are irrevocable provided the stated
146 | conditions are met. This License explicitly affirms your unlimited
147 | permission to run the unmodified Program. The output from running a
148 | covered work is covered by this License only if the output, given its
149 | content, constitutes a covered work. This License acknowledges your
150 | rights of fair use or other equivalent, as provided by copyright law.
151 |
152 | You may make, run and propagate covered works that you do not
153 | convey, without conditions so long as your license otherwise remains
154 | in force. You may convey covered works to others for the sole purpose
155 | of having them make modifications exclusively for you, or provide you
156 | with facilities for running those works, provided that you comply with
157 | the terms of this License in conveying all material for which you do
158 | not control copyright. Those thus making or running the covered works
159 | for you must do so exclusively on your behalf, under your direction
160 | and control, on terms that prohibit them from making any copies of
161 | your copyrighted material outside their relationship with you.
162 |
163 | Conveying under any other circumstances is permitted solely under
164 | the conditions stated below. Sublicensing is not allowed; section 10
165 | makes it unnecessary.
166 |
167 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
168 |
169 | No covered work shall be deemed part of an effective technological
170 | measure under any applicable law fulfilling obligations under article
171 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
172 | similar laws prohibiting or restricting circumvention of such
173 | measures.
174 |
175 | When you convey a covered work, you waive any legal power to forbid
176 | circumvention of technological measures to the extent such circumvention
177 | is effected by exercising rights under this License with respect to
178 | the covered work, and you disclaim any intention to limit operation or
179 | modification of the work as a means of enforcing, against the work's
180 | users, your or third parties' legal rights to forbid circumvention of
181 | technological measures.
182 |
183 | 4. Conveying Verbatim Copies.
184 |
185 | You may convey verbatim copies of the Program's source code as you
186 | receive it, in any medium, provided that you conspicuously and
187 | appropriately publish on each copy an appropriate copyright notice;
188 | keep intact all notices stating that this License and any
189 | non-permissive terms added in accord with section 7 apply to the code;
190 | keep intact all notices of the absence of any warranty; and give all
191 | recipients a copy of this License along with the Program.
192 |
193 | You may charge any price or no price for each copy that you convey,
194 | and you may offer support or warranty protection for a fee.
195 |
196 | 5. Conveying Modified Source Versions.
197 |
198 | You may convey a work based on the Program, or the modifications to
199 | produce it from the Program, in the form of source code under the
200 | terms of section 4, provided that you also meet all of these conditions:
201 |
202 | a) The work must carry prominent notices stating that you modified
203 | it, and giving a relevant date.
204 |
205 | b) The work must carry prominent notices stating that it is
206 | released under this License and any conditions added under section
207 | 7. This requirement modifies the requirement in section 4 to
208 | "keep intact all notices".
209 |
210 | c) You must license the entire work, as a whole, under this
211 | License to anyone who comes into possession of a copy. This
212 | License will therefore apply, along with any applicable section 7
213 | additional terms, to the whole of the work, and all its parts,
214 | regardless of how they are packaged. This License gives no
215 | permission to license the work in any other way, but it does not
216 | invalidate such permission if you have separately received it.
217 |
218 | d) If the work has interactive user interfaces, each must display
219 | Appropriate Legal Notices; however, if the Program has interactive
220 | interfaces that do not display Appropriate Legal Notices, your
221 | work need not make them do so.
222 |
223 | A compilation of a covered work with other separate and independent
224 | works, which are not by their nature extensions of the covered work,
225 | and which are not combined with it such as to form a larger program,
226 | in or on a volume of a storage or distribution medium, is called an
227 | "aggregate" if the compilation and its resulting copyright are not
228 | used to limit the access or legal rights of the compilation's users
229 | beyond what the individual works permit. Inclusion of a covered work
230 | in an aggregate does not cause this License to apply to the other
231 | parts of the aggregate.
232 |
233 | 6. Conveying Non-Source Forms.
234 |
235 | You may convey a covered work in object code form under the terms
236 | of sections 4 and 5, provided that you also convey the
237 | machine-readable Corresponding Source under the terms of this License,
238 | in one of these ways:
239 |
240 | a) Convey the object code in, or embodied in, a physical product
241 | (including a physical distribution medium), accompanied by the
242 | Corresponding Source fixed on a durable physical medium
243 | customarily used for software interchange.
244 |
245 | b) Convey the object code in, or embodied in, a physical product
246 | (including a physical distribution medium), accompanied by a
247 | written offer, valid for at least three years and valid for as
248 | long as you offer spare parts or customer support for that product
249 | model, to give anyone who possesses the object code either (1) a
250 | copy of the Corresponding Source for all the software in the
251 | product that is covered by this License, on a durable physical
252 | medium customarily used for software interchange, for a price no
253 | more than your reasonable cost of physically performing this
254 | conveying of source, or (2) access to copy the
255 | Corresponding Source from a network server at no charge.
256 |
257 | c) Convey individual copies of the object code with a copy of the
258 | written offer to provide the Corresponding Source. This
259 | alternative is allowed only occasionally and noncommercially, and
260 | only if you received the object code with such an offer, in accord
261 | with subsection 6b.
262 |
263 | d) Convey the object code by offering access from a designated
264 | place (gratis or for a charge), and offer equivalent access to the
265 | Corresponding Source in the same way through the same place at no
266 | further charge. You need not require recipients to copy the
267 | Corresponding Source along with the object code. If the place to
268 | copy the object code is a network server, the Corresponding Source
269 | may be on a different server (operated by you or a third party)
270 | that supports equivalent copying facilities, provided you maintain
271 | clear directions next to the object code saying where to find the
272 | Corresponding Source. Regardless of what server hosts the
273 | Corresponding Source, you remain obligated to ensure that it is
274 | available for as long as needed to satisfy these requirements.
275 |
276 | e) Convey the object code using peer-to-peer transmission, provided
277 | you inform other peers where the object code and Corresponding
278 | Source of the work are being offered to the general public at no
279 | charge under subsection 6d.
280 |
281 | A separable portion of the object code, whose source code is excluded
282 | from the Corresponding Source as a System Library, need not be
283 | included in conveying the object code work.
284 |
285 | A "User Product" is either (1) a "consumer product", which means any
286 | tangible personal property which is normally used for personal, family,
287 | or household purposes, or (2) anything designed or sold for incorporation
288 | into a dwelling. In determining whether a product is a consumer product,
289 | doubtful cases shall be resolved in favor of coverage. For a particular
290 | product received by a particular user, "normally used" refers to a
291 | typical or common use of that class of product, regardless of the status
292 | of the particular user or of the way in which the particular user
293 | actually uses, or expects or is expected to use, the product. A product
294 | is a consumer product regardless of whether the product has substantial
295 | commercial, industrial or non-consumer uses, unless such uses represent
296 | the only significant mode of use of the product.
297 |
298 | "Installation Information" for a User Product means any methods,
299 | procedures, authorization keys, or other information required to install
300 | and execute modified versions of a covered work in that User Product from
301 | a modified version of its Corresponding Source. The information must
302 | suffice to ensure that the continued functioning of the modified object
303 | code is in no case prevented or interfered with solely because
304 | modification has been made.
305 |
306 | If you convey an object code work under this section in, or with, or
307 | specifically for use in, a User Product, and the conveying occurs as
308 | part of a transaction in which the right of possession and use of the
309 | User Product is transferred to the recipient in perpetuity or for a
310 | fixed term (regardless of how the transaction is characterized), the
311 | Corresponding Source conveyed under this section must be accompanied
312 | by the Installation Information. But this requirement does not apply
313 | if neither you nor any third party retains the ability to install
314 | modified object code on the User Product (for example, the work has
315 | been installed in ROM).
316 |
317 | The requirement to provide Installation Information does not include a
318 | requirement to continue to provide support service, warranty, or updates
319 | for a work that has been modified or installed by the recipient, or for
320 | the User Product in which it has been modified or installed. Access to a
321 | network may be denied when the modification itself materially and
322 | adversely affects the operation of the network or violates the rules and
323 | protocols for communication across the network.
324 |
325 | Corresponding Source conveyed, and Installation Information provided,
326 | in accord with this section must be in a format that is publicly
327 | documented (and with an implementation available to the public in
328 | source code form), and must require no special password or key for
329 | unpacking, reading or copying.
330 |
331 | 7. Additional Terms.
332 |
333 | "Additional permissions" are terms that supplement the terms of this
334 | License by making exceptions from one or more of its conditions.
335 | Additional permissions that are applicable to the entire Program shall
336 | be treated as though they were included in this License, to the extent
337 | that they are valid under applicable law. If additional permissions
338 | apply only to part of the Program, that part may be used separately
339 | under those permissions, but the entire Program remains governed by
340 | this License without regard to the additional permissions.
341 |
342 | When you convey a copy of a covered work, you may at your option
343 | remove any additional permissions from that copy, or from any part of
344 | it. (Additional permissions may be written to require their own
345 | removal in certain cases when you modify the work.) You may place
346 | additional permissions on material, added by you to a covered work,
347 | for which you have or can give appropriate copyright permission.
348 |
349 | Notwithstanding any other provision of this License, for material you
350 | add to a covered work, you may (if authorized by the copyright holders of
351 | that material) supplement the terms of this License with terms:
352 |
353 | a) Disclaiming warranty or limiting liability differently from the
354 | terms of sections 15 and 16 of this License; or
355 |
356 | b) Requiring preservation of specified reasonable legal notices or
357 | author attributions in that material or in the Appropriate Legal
358 | Notices displayed by works containing it; or
359 |
360 | c) Prohibiting misrepresentation of the origin of that material, or
361 | requiring that modified versions of such material be marked in
362 | reasonable ways as different from the original version; or
363 |
364 | d) Limiting the use for publicity purposes of names of licensors or
365 | authors of the material; or
366 |
367 | e) Declining to grant rights under trademark law for use of some
368 | trade names, trademarks, or service marks; or
369 |
370 | f) Requiring indemnification of licensors and authors of that
371 | material by anyone who conveys the material (or modified versions of
372 | it) with contractual assumptions of liability to the recipient, for
373 | any liability that these contractual assumptions directly impose on
374 | those licensors and authors.
375 |
376 | All other non-permissive additional terms are considered "further
377 | restrictions" within the meaning of section 10. If the Program as you
378 | received it, or any part of it, contains a notice stating that it is
379 | governed by this License along with a term that is a further
380 | restriction, you may remove that term. If a license document contains
381 | a further restriction but permits relicensing or conveying under this
382 | License, you may add to a covered work material governed by the terms
383 | of that license document, provided that the further restriction does
384 | not survive such relicensing or conveying.
385 |
386 | If you add terms to a covered work in accord with this section, you
387 | must place, in the relevant source files, a statement of the
388 | additional terms that apply to those files, or a notice indicating
389 | where to find the applicable terms.
390 |
391 | Additional terms, permissive or non-permissive, may be stated in the
392 | form of a separately written license, or stated as exceptions;
393 | the above requirements apply either way.
394 |
395 | 8. Termination.
396 |
397 | You may not propagate or modify a covered work except as expressly
398 | provided under this License. Any attempt otherwise to propagate or
399 | modify it is void, and will automatically terminate your rights under
400 | this License (including any patent licenses granted under the third
401 | paragraph of section 11).
402 |
403 | However, if you cease all violation of this License, then your
404 | license from a particular copyright holder is reinstated (a)
405 | provisionally, unless and until the copyright holder explicitly and
406 | finally terminates your license, and (b) permanently, if the copyright
407 | holder fails to notify you of the violation by some reasonable means
408 | prior to 60 days after the cessation.
409 |
410 | Moreover, your license from a particular copyright holder is
411 | reinstated permanently if the copyright holder notifies you of the
412 | violation by some reasonable means, this is the first time you have
413 | received notice of violation of this License (for any work) from that
414 | copyright holder, and you cure the violation prior to 30 days after
415 | your receipt of the notice.
416 |
417 | Termination of your rights under this section does not terminate the
418 | licenses of parties who have received copies or rights from you under
419 | this License. If your rights have been terminated and not permanently
420 | reinstated, you do not qualify to receive new licenses for the same
421 | material under section 10.
422 |
423 | 9. Acceptance Not Required for Having Copies.
424 |
425 | You are not required to accept this License in order to receive or
426 | run a copy of the Program. Ancillary propagation of a covered work
427 | occurring solely as a consequence of using peer-to-peer transmission
428 | to receive a copy likewise does not require acceptance. However,
429 | nothing other than this License grants you permission to propagate or
430 | modify any covered work. These actions infringe copyright if you do
431 | not accept this License. Therefore, by modifying or propagating a
432 | covered work, you indicate your acceptance of this License to do so.
433 |
434 | 10. Automatic Licensing of Downstream Recipients.
435 |
436 | Each time you convey a covered work, the recipient automatically
437 | receives a license from the original licensors, to run, modify and
438 | propagate that work, subject to this License. You are not responsible
439 | for enforcing compliance by third parties with this License.
440 |
441 | An "entity transaction" is a transaction transferring control of an
442 | organization, or substantially all assets of one, or subdividing an
443 | organization, or merging organizations. If propagation of a covered
444 | work results from an entity transaction, each party to that
445 | transaction who receives a copy of the work also receives whatever
446 | licenses to the work the party's predecessor in interest had or could
447 | give under the previous paragraph, plus a right to possession of the
448 | Corresponding Source of the work from the predecessor in interest, if
449 | the predecessor has it or can get it with reasonable efforts.
450 |
451 | You may not impose any further restrictions on the exercise of the
452 | rights granted or affirmed under this License. For example, you may
453 | not impose a license fee, royalty, or other charge for exercise of
454 | rights granted under this License, and you may not initiate litigation
455 | (including a cross-claim or counterclaim in a lawsuit) alleging that
456 | any patent claim is infringed by making, using, selling, offering for
457 | sale, or importing the Program or any portion of it.
458 |
459 | 11. Patents.
460 |
461 | A "contributor" is a copyright holder who authorizes use under this
462 | License of the Program or a work on which the Program is based. The
463 | work thus licensed is called the contributor's "contributor version".
464 |
465 | A contributor's "essential patent claims" are all patent claims
466 | owned or controlled by the contributor, whether already acquired or
467 | hereafter acquired, that would be infringed by some manner, permitted
468 | by this License, of making, using, or selling its contributor version,
469 | but do not include claims that would be infringed only as a
470 | consequence of further modification of the contributor version. For
471 | purposes of this definition, "control" includes the right to grant
472 | patent sublicenses in a manner consistent with the requirements of
473 | this License.
474 |
475 | Each contributor grants you a non-exclusive, worldwide, royalty-free
476 | patent license under the contributor's essential patent claims, to
477 | make, use, sell, offer for sale, import and otherwise run, modify and
478 | propagate the contents of its contributor version.
479 |
480 | In the following three paragraphs, a "patent license" is any express
481 | agreement or commitment, however denominated, not to enforce a patent
482 | (such as an express permission to practice a patent or covenant not to
483 | sue for patent infringement). To "grant" such a patent license to a
484 | party means to make such an agreement or commitment not to enforce a
485 | patent against the party.
486 |
487 | If you convey a covered work, knowingly relying on a patent license,
488 | and the Corresponding Source of the work is not available for anyone
489 | to copy, free of charge and under the terms of this License, through a
490 | publicly available network server or other readily accessible means,
491 | then you must either (1) cause the Corresponding Source to be so
492 | available, or (2) arrange to deprive yourself of the benefit of the
493 | patent license for this particular work, or (3) arrange, in a manner
494 | consistent with the requirements of this License, to extend the patent
495 | license to downstream recipients. "Knowingly relying" means you have
496 | actual knowledge that, but for the patent license, your conveying the
497 | covered work in a country, or your recipient's use of the covered work
498 | in a country, would infringe one or more identifiable patents in that
499 | country that you have reason to believe are valid.
500 |
501 | If, pursuant to or in connection with a single transaction or
502 | arrangement, you convey, or propagate by procuring conveyance of, a
503 | covered work, and grant a patent license to some of the parties
504 | receiving the covered work authorizing them to use, propagate, modify
505 | or convey a specific copy of the covered work, then the patent license
506 | you grant is automatically extended to all recipients of the covered
507 | work and works based on it.
508 |
509 | A patent license is "discriminatory" if it does not include within
510 | the scope of its coverage, prohibits the exercise of, or is
511 | conditioned on the non-exercise of one or more of the rights that are
512 | specifically granted under this License. You may not convey a covered
513 | work if you are a party to an arrangement with a third party that is
514 | in the business of distributing software, under which you make payment
515 | to the third party based on the extent of your activity of conveying
516 | the work, and under which the third party grants, to any of the
517 | parties who would receive the covered work from you, a discriminatory
518 | patent license (a) in connection with copies of the covered work
519 | conveyed by you (or copies made from those copies), or (b) primarily
520 | for and in connection with specific products or compilations that
521 | contain the covered work, unless you entered into that arrangement,
522 | or that patent license was granted, prior to 28 March 2007.
523 |
524 | Nothing in this License shall be construed as excluding or limiting
525 | any implied license or other defenses to infringement that may
526 | otherwise be available to you under applicable patent law.
527 |
528 | 12. No Surrender of Others' Freedom.
529 |
530 | If conditions are imposed on you (whether by court order, agreement or
531 | otherwise) that contradict the conditions of this License, they do not
532 | excuse you from the conditions of this License. If you cannot convey a
533 | covered work so as to satisfy simultaneously your obligations under this
534 | License and any other pertinent obligations, then as a consequence you may
535 | not convey it at all. For example, if you agree to terms that obligate you
536 | to collect a royalty for further conveying from those to whom you convey
537 | the Program, the only way you could satisfy both those terms and this
538 | License would be to refrain entirely from conveying the Program.
539 |
540 | 13. Remote Network Interaction; Use with the GNU General Public License.
541 |
542 | Notwithstanding any other provision of this License, if you modify the
543 | Program, your modified version must prominently offer all users
544 | interacting with it remotely through a computer network (if your version
545 | supports such interaction) an opportunity to receive the Corresponding
546 | Source of your version by providing access to the Corresponding Source
547 | from a network server at no charge, through some standard or customary
548 | means of facilitating copying of software. This Corresponding Source
549 | shall include the Corresponding Source for any work covered by version 3
550 | of the GNU General Public License that is incorporated pursuant to the
551 | following paragraph.
552 |
553 | Notwithstanding any other provision of this License, you have
554 | permission to link or combine any covered work with a work licensed
555 | under version 3 of the GNU General Public License into a single
556 | combined work, and to convey the resulting work. The terms of this
557 | License will continue to apply to the part which is the covered work,
558 | but the work with which it is combined will remain governed by version
559 | 3 of the GNU General Public License.
560 |
561 | 14. Revised Versions of this License.
562 |
563 | The Free Software Foundation may publish revised and/or new versions of
564 | the GNU Affero General Public License from time to time. Such new versions
565 | will be similar in spirit to the present version, but may differ in detail to
566 | address new problems or concerns.
567 |
568 | Each version is given a distinguishing version number. If the
569 | Program specifies that a certain numbered version of the GNU Affero General
570 | Public License "or any later version" applies to it, you have the
571 | option of following the terms and conditions either of that numbered
572 | version or of any later version published by the Free Software
573 | Foundation. If the Program does not specify a version number of the
574 | GNU Affero General Public License, you may choose any version ever published
575 | by the Free Software Foundation.
576 |
577 | If the Program specifies that a proxy can decide which future
578 | versions of the GNU Affero General Public License can be used, that proxy's
579 | public statement of acceptance of a version permanently authorizes you
580 | to choose that version for the Program.
581 |
582 | Later license versions may give you additional or different
583 | permissions. However, no additional obligations are imposed on any
584 | author or copyright holder as a result of your choosing to follow a
585 | later version.
586 |
587 | 15. Disclaimer of Warranty.
588 |
589 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
590 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
591 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
592 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
593 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
594 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
595 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
596 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
597 |
598 | 16. Limitation of Liability.
599 |
600 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
601 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
602 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
603 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
604 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
605 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
606 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
607 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
608 | SUCH DAMAGES.
609 |
610 | 17. Interpretation of Sections 15 and 16.
611 |
612 | If the disclaimer of warranty and limitation of liability provided
613 | above cannot be given local legal effect according to their terms,
614 | reviewing courts shall apply local law that most closely approximates
615 | an absolute waiver of all civil liability in connection with the
616 | Program, unless a warranty or assumption of liability accompanies a
617 | copy of the Program in return for a fee.
618 |
619 | END OF TERMS AND CONDITIONS
620 |
621 | How to Apply These Terms to Your New Programs
622 |
623 | If you develop a new program, and you want it to be of the greatest
624 | possible use to the public, the best way to achieve this is to make it
625 | free software which everyone can redistribute and change under these terms.
626 |
627 | To do so, attach the following notices to the program. It is safest
628 | to attach them to the start of each source file to most effectively
629 | state the exclusion of warranty; and each file should have at least
630 | the "copyright" line and a pointer to where the full notice is found.
631 |
632 |
633 | Copyright (C)
634 |
635 | This program is free software: you can redistribute it and/or modify
636 | it under the terms of the GNU Affero General Public License as published by
637 | the Free Software Foundation, either version 3 of the License, or
638 | (at your option) any later version.
639 |
640 | This program is distributed in the hope that it will be useful,
641 | but WITHOUT ANY WARRANTY; without even the implied warranty of
642 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
643 | GNU Affero General Public License for more details.
644 |
645 | You should have received a copy of the GNU Affero General Public License
646 | along with this program. If not, see .
647 |
648 | Also add information on how to contact you by electronic and paper mail.
649 |
650 | If your software can interact with users remotely through a computer
651 | network, you should also make sure that it provides a way for users to
652 | get its source. For example, if your program is a web application, its
653 | interface could display a "Source" link that leads users to an archive
654 | of the code. There are many ways you could offer source, and different
655 | solutions will be better for different programs; see section 13 for the
656 | specific requirements.
657 |
658 | You should also get your employer (if you work as a programmer) or school,
659 | if any, to sign a "copyright disclaimer" for the program, if necessary.
660 | For more information on this, and how to apply and follow the GNU AGPL, see
661 | .
662 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | pystemon
2 | ========
3 | Monitoring tool for PasteBin-alike sites written in Python
4 |
5 | Copyleft AGPLv3 - Christophe Vandeplas - christophe@vandeplas.com
6 | Feel free to use the code, but please share the changes you've made
7 |
8 | Features:
9 | ---------
10 | * search for regular expressions in pasties
11 | * flexible design, minimal effort to add another paste* site
12 | * use custom download functions for complex pastie sites
13 | * uses multiple threads per unique site to download the pastes
14 | * waits a random time (within a range) before downloading the latest pastes, time customizable per site
15 | * (optional) only trigger on X hits in the same pastie
16 | * (optional) exclude matching pasties if exclusion regex matches
17 | * (optional) allow additional email recipients per search pattern
18 | * (optional) uses random User-Agents
19 | * (optional) uses random proxies
20 | * removes a proxy if it is unreliable (fails 5 times)
21 | * (optional) compress saved files with Gzip. (no zip to limit external dependencies)
22 |
23 | Python Dependencies
24 | -------------------
25 | * PyYAML
26 | * requests
27 | * redis
28 |
29 | Limitations:
30 | ------------
31 | * Only HTTP proxies are allowed
32 | * Only HTTP urls will use proxies
33 |
34 | Usage
35 | ------
36 | ```
37 | Usage: pystemon.py [options]
38 | Options:
39 | -h, --help show this help message and exit
40 | -c FILE, --config=FILE
41 | load configuration from file
42 | -d, --daemon runs in background as a daemon (NOT IMPLEMENTED)
43 | -s, --stats display statistics about the running threads (NOT IMPLEMENTED)
44 | -v outputs more information
45 |
46 | Default configuration file: /etc/pystemon.yaml or pystemon.yaml in current directory
47 | ```
48 |
--------------------------------------------------------------------------------
/pystemon.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 |
4 | '''
5 | @author: Christophe Vandeplas
6 | @copyright: AGPLv3
7 | http://www.gnu.org/licenses/agpl.html
8 |
9 | To be implemented:
10 | - FIXME set all the config options in the class variables
11 | - FIXME validate parsing of config file
12 | - FIXME use syslog logging
13 | - TODO runs as a daemon in background
14 | - TODO save files in separate directories depending on the day/week/month. Try to avoid duplicate files
15 | '''
16 |
17 | try:
18 | from queue import Queue
19 | except ImportError:
20 | from Queue import Queue
21 | from collections import deque
22 | from datetime import datetime
23 | try:
24 | from email.mime.multipart import MIMEMultipart
25 | except ImportError:
26 | from email.MIMEMultipart import MIMEMultipart
27 |
28 | try:
29 | from email.mime.text import MIMEText
30 | except ImportError:
31 | from email.MIMEText import MIMEText
32 | import gzip
33 | import hashlib
34 | import logging.handlers
35 | import optparse
36 | import os
37 | import random
38 | import re
39 | import smtplib
40 | import socket
41 | import sys
42 | import traceback
43 | import threading
44 | import time
45 | import urllib
46 | import urllib2
47 | import httplib
48 | import ssl
49 | from io import open
50 | import requests
51 |
52 | # try:
53 | # from urllib.error import HTTPError, URLError
54 | # except ImportError:
55 | # from urllib2 import HTTPError, URLError
56 |
57 | try:
58 | import redis
59 | except ImportError:
60 | exit('ERROR: Cannot import the redis Python library. Are you sure it is installed?')
61 |
62 | try:
63 | import yaml
64 | except ImportError:
65 | exit('ERROR: Cannot import the yaml Python library. Are you sure it is installed?')
66 |
67 | try:
68 | if sys.version_info < (2, 7):
69 | exit('You need python version 2.7 or newer.')
70 | except Exception as exc:
71 | exit('You need python version 2.7 or newer.')
72 |
73 | retries_paste = 3
74 | retries_client = 5
75 | retries_server = 100
76 |
77 | socket.setdefaulttimeout(10) # set a default timeout of 10 seconds to download the page (default = unlimited)
78 | true_socket = socket.socket
79 |
80 |
81 | def make_bound_socket(source_ip):
82 | def bound_socket(*a, **k):
83 | sock = true_socket(*a, **k)
84 | sock.bind((source_ip, 0))
85 | return sock
86 | return bound_socket
87 |
88 |
89 | class PastieSite(threading.Thread):
90 | '''
91 | Instances of these threads are responsible for downloading the list of
92 | the most recent pastes and added those to the download queue.
93 | '''
94 | def __init__(self, name, download_url, archive_url, archive_regex):
95 | threading.Thread.__init__(self)
96 | self.kill_received = False
97 |
98 | self.name = name
99 | self.download_url = download_url
100 | self.archive_url = archive_url
101 | self.archive_regex = archive_regex
102 | try:
103 | self.ip_addr = yamlconfig['network']['ip']
104 | socket.socket = make_bound_socket(self.ip_addr)
105 | except Exception as exc:
106 | logger.debug("Using default IP address")
107 |
108 | self.save_dir = yamlconfig['archive']['dir'] + os.sep + name
109 | self.archive_dir = yamlconfig['archive']['dir-all'] + os.sep + name
110 | if yamlconfig['archive']['save'] and not os.path.exists(self.save_dir):
111 | os.makedirs(self.save_dir)
112 | if yamlconfig['archive']['save-all'] and not os.path.exists(self.archive_dir):
113 | os.makedirs(self.archive_dir)
114 | self.archive_compress = yamlconfig['archive']['compress']
115 | self.update_max = 30 # TODO set by config file
116 | self.update_min = 10 # TODO set by config file
117 | self.pastie_classname = None
118 | self.seen_pasties = deque('', 1000) # max number of pasties ids in memory
119 |
120 | def run(self):
121 | while not self.kill_received:
122 | sleep_time = random.randint(self.update_min, self.update_max)
123 | try:
124 | # grabs site from queue
125 | logger.info(
126 | 'Downloading list of new pastes from {name}. '
127 | 'Will check again in {time} seconds'.format(
128 | name=self.name, time=sleep_time))
129 | # get the list of last pasties, but reverse it
130 | # so we first have the old entries and then the new ones
131 | last_pasties = self.get_last_pasties()
132 | if last_pasties:
133 | for pastie in reversed(last_pasties):
134 | queues[self.name].put(pastie) # add pastie to queue
135 | logger.info("Found {amount} new pasties for site {site}. There are now {qsize} pasties to be downloaded.".format(amount=len(last_pasties),
136 | site=self.name,
137 | qsize=queues[self.name].qsize()))
138 | # catch unknown errors
139 | except Exception as e:
140 | msg = 'Thread for {name} crashed unexpectectly, '\
141 | 'recovering...: {e}'.format(name=self.name, e=e)
142 | logger.error(msg)
143 | logger.debug(traceback.format_exc())
144 | time.sleep(sleep_time)
145 |
146 | def get_last_pasties(self):
147 | # reset the pasties list
148 | pasties = []
149 | # populate queue with data
150 | response = download_url(self.archive_url)
151 | htmlPage = response.text
152 | if not htmlPage:
153 | logger.warning("No HTML content for page {url}".format(url=self.archive_url))
154 | return False
155 | pasties_ids = re.findall(self.archive_regex, htmlPage)
156 | if pasties_ids:
157 | for pastie_id in pasties_ids:
158 | # check if the pastie was already downloaded
159 | # and remember that we've seen it
160 | if self.seen_pastie(pastie_id):
161 | # do not append the seen things again in the queue
162 | continue
163 | # pastie was not downloaded yet. Add it to the queue
164 | if self.pastie_classname:
165 | class_name = globals()[self.pastie_classname]
166 | pastie = class_name(self, pastie_id)
167 | else:
168 | pastie = Pastie(self, pastie_id)
169 | pasties.append(pastie)
170 | return pasties
171 | if "DOES NOT HAVE ACCESS" in htmlPage.encode('utf8'):
172 | print("Problem with configured IP address")
173 |
174 | logger.error("No last pasties matches for regular expression site:{site} regex:{regex}. Error in your regex? Dumping htmlPage \n {html}".format(site=self.name, regex=self.archive_regex, html=htmlPage.encode('utf8')))
175 | return False
176 |
177 | def seen_pastie(self, pastie_id):
178 | ''' check if the pastie was already downloaded. '''
179 | # first look in memory if we have already seen this pastie
180 | if self.seen_pasties.count(pastie_id):
181 | return True
182 | # look on the filesystem. # LATER remove this filesystem lookup as it will give problems on long term
183 | if yamlconfig['archive']['save-all']:
184 | # check if the pastie was already saved on the disk
185 | if os.path.exists(verify_directory_exists(self.archive_dir) + os.sep + self.pastie_id_to_filename(pastie_id)):
186 | return True
187 | # TODO look in the database if it was already seen
188 |
189 | def seen_pastie_and_remember(self, pastie):
190 | '''
191 | Check if the pastie was already downloaded
192 | and remember that we've seen it
193 | '''
194 | seen = False
195 | if self.seen_pastie(pastie.id):
196 | seen = True
197 | else:
198 | # We have not yet seen the pastie.
199 | # Keep in memory that we've seen it using
200 | # appendleft for performance reasons.
201 | # (faster later when we iterate over the deque)
202 | self.seen_pasties.appendleft(pastie.id)
203 | # add / update the pastie in the database
204 | if db:
205 | db.queue.put(pastie)
206 | return seen
207 |
208 | def pastie_id_to_filename(self, pastie_id):
209 | filename = pastie_id.replace('/', '_')
210 | if self.archive_compress:
211 | filename = filename + ".gz"
212 | return filename
213 |
214 |
215 | def verify_directory_exists(directory):
216 | d = datetime.now()
217 | year = str(d.year)
218 | month = str(d.month)
219 | # prefix month and day with "0" if it is only one digit
220 | if len(month) < 2:
221 | month = "0" + month
222 | day = str(d.day)
223 | if len(day) < 2:
224 | day = "0" + day
225 | fullpath = directory + os.sep + year + os.sep + month + os.sep + day
226 | if not os.path.isdir(fullpath):
227 | os.makedirs(fullpath)
228 | return fullpath
229 |
230 |
231 | class Pastie():
232 | def __init__(self, site, pastie_id):
233 | self.site = site
234 | self.id = pastie_id
235 | self.pastie_content = None
236 | self.matches = []
237 | self.md5 = None
238 | self.url = self.site.download_url.format(id=self.id)
239 | self.public = False
240 |
241 | def hash_pastie(self):
242 | if self.pastie_content:
243 | try:
244 | self.md5 = hashlib.md5(self.pastie_content).hexdigest()
245 | logger.debug('Pastie {site} {id} has md5: "{md5}"'.format(site=self.site.name, id=self.id, md5=self.md5))
246 | except Exception as e:
247 | logger.error('Pastie {site} {id} md5 problem: {e}'.format(site=self.site.name, id=self.id, e=e))
248 |
249 | def fetch_pastie(self):
250 | response = download_url(self.url)
251 | self.pastie_content = response.content
252 | return self.pastie_content
253 |
254 | def save_pastie(self, directory):
255 | if not self.pastie_content:
256 | raise SystemExit('BUG: Content not set, sannot save')
257 | full_path = verify_directory_exists(directory) + os.sep + self.site.pastie_id_to_filename(self.id)
258 | if yamlconfig['redis']['queue']:
259 | r = redis.StrictRedis(host=yamlconfig['redis']['server'], port=yamlconfig['redis']['port'], db=yamlconfig['redis']['database'])
260 | if self.site.archive_compress:
261 | f = gzip.open(full_path, 'w')
262 | f.write(self.pastie_content.encode('utf8'))
263 | f.flush()
264 | os.fsync(f.fileno())
265 | f.close()
266 | else:
267 | f = open(full_path, 'w')
268 | f.write(self.pastie_content.encode('utf8'))
269 | f.flush()
270 | os.fsync(f.fileno())
271 | f.close()
272 | if yamlconfig['redis']['queue']:
273 | time.sleep(3)
274 | r.lpush('pastes', full_path)
275 | # with gzip.open(full_path, 'wb') as f:
276 | # f.write(self.pastie_content)
277 | # if yamlconfig['redis']['queue']:
278 | # r.lpush('pastes', full_path)
279 | # else:
280 | # with open(full_path, 'wb') as f:
281 | # f.write(self.pastie_content)
282 | # if yamlconfig['redis']['queue']:
283 | # r.lpush('pastes', full_path)
284 |
285 | def fetch_and_process_pastie(self):
286 | # double check if the pastie was already downloaded,
287 | # and remember that we've seen it
288 | if self.site.seen_pastie(self.id):
289 | return None
290 | # download pastie
291 | self.fetch_pastie()
292 | # save the pastie on the disk
293 | if self.pastie_content:
294 | # take checksum
295 | self.hash_pastie()
296 | # keep in memory that the pastie was seen successfully
297 | self.site.seen_pastie_and_remember(self)
298 | # Save pastie to archive dir if configured
299 | if yamlconfig['archive']['save-all']:
300 | self.save_pastie(self.site.archive_dir)
301 | # search for data in pastie
302 | self.search_content()
303 | return self.pastie_content
304 |
305 | def search_content(self):
306 | if not self.pastie_content:
307 | raise SystemExit('BUG: Content not set, cannot search')
308 | return False
309 | # search for the regexes in the htmlPage
310 | for regex in yamlconfig['search']:
311 | # LATER first compile regex, then search using compiled version
312 | regex_flags = re.IGNORECASE
313 | if 'regex-flags' in regex:
314 | regex_flags = eval(regex['regex-flags'])
315 | m = re.findall(regex['search'].encode(), self.pastie_content, regex_flags)
316 | if m:
317 | # the regex matches the text
318 | # ignore if not enough counts
319 | if 'count' in regex and len(m) < int(regex['count']):
320 | continue
321 | # ignore if exclude
322 | if 'exclude' in regex and re.search(regex['exclude'].encode(), self.pastie_content, regex_flags):
323 | continue
324 | # we have a match, add to match list
325 | self.matches.append(regex)
326 | if 'public' in regex:
327 | self.public = regex['public']
328 | else:
329 | self.public = False
330 | if self.matches:
331 | self.action_on_match()
332 |
333 | def action_on_match(self):
334 | msg = 'Found hit for {matches} in pastie {url}'.format(
335 | matches=self.matches_to_text(), url=self.url)
336 | logger.info(msg)
337 | # store info in DB
338 | if db:
339 | db.queue.put(self)
340 | # Save pastie to disk if configured
341 | if yamlconfig['archive']['save']:
342 | self.save_pastie(self.site.save_dir)
343 | # Send email alert if configured
344 | if yamlconfig['email']['alert']:
345 | self.send_email_alert()
346 |
347 | def matches_to_text(self):
348 | descriptions = []
349 | for match in self.matches:
350 | if 'description' in match:
351 | descriptions.append(match['description'])
352 | else:
353 | descriptions.append(match['search'])
354 | if descriptions:
355 | return '[{}]'.format(', '.join(descriptions.decode('utf-8', 'ignore')))
356 | else:
357 | return ''
358 |
359 | def matches_to_regex(self):
360 | descriptions = []
361 | for match in self.matches:
362 | descriptions.append(match['search'])
363 | if descriptions:
364 | return '[{}]'.format(', '.join(descriptions.decode('utf-8', 'ignore')))
365 | else:
366 | return ''
367 |
368 | def send_email_alert(self):
369 | msg = MIMEMultipart()
370 | if self.public:
371 | alert = "Found hit for {matches} in pastie {url}".format(matches=self.matches_to_text(), url=self.url)
372 | else:
373 | alert = "Found hit in pastie {url}".format(url=self.url)
374 | # headers
375 | msg['Subject'] = yamlconfig['email']['subject'].format(subject=alert)
376 | msg['From'] = yamlconfig['email']['from']
377 | # build the list of recipients
378 | recipients = []
379 | recipients.append(yamlconfig['email']['to']) # first the global alert email
380 | for match in self.matches: # per match, the custom additional email
381 | if 'to' in match and match['to']:
382 | recipients.extend(match['to'].split(","))
383 | msg['Bcc'] = ','.join(recipients) # here the list needs to be comma separated
384 | # message body including full paste rather than attaching it
385 | message = '''
386 | I found a hit for a regular expression on one of the pastebin sites.
387 |
388 | The site where the paste came from : {site}
389 | The original paste was located here: {url}
390 | And the regular expressions that matched: [redacted]
391 |
392 | Below (after newline) is the content of the pastie:
393 |
394 | {content}
395 |
396 | '''.format(site=self.site.name, url=self.url, content=self.pastie_content.encode('utf8'))
397 | # '''.format(site=self.site.name, url=self.url, matches=self.matches_to_regex(), content=self.pastie_content.encode('utf8'))
398 | msg.attach(MIMEText(message))
399 | # send out the mail
400 | try:
401 | s = smtplib.SMTP(yamlconfig['email']['server'], yamlconfig['email']['port'])
402 | # login to the SMTP server if configured
403 | if 'username' in yamlconfig['email'] and yamlconfig['email']['username']:
404 | s.login(yamlconfig['email']['username'], yamlconfig['email']['password'])
405 | # send the mail
406 | s.sendmail(yamlconfig['email']['from'], recipients, msg.as_string())
407 | s.close()
408 | except smtplib.SMTPException as e:
409 | logger.error("ERROR: unable to send email: {0}".format(e))
410 | except Exception as e:
411 | logger.error("ERROR: unable to send email. Are your email setting correct?: {e}".format(e=e))
412 |
413 |
414 | class ThreadPasties(threading.Thread):
415 | '''
416 | Instances of these threads are responsible for downloading the pastes
417 | found in the queue.
418 | '''
419 | def __init__(self, queue, queue_name):
420 | threading.Thread.__init__(self)
421 | self.queue = queue
422 | self.name = queue_name
423 | self.kill_received = False
424 |
425 | def run(self):
426 | while not self.kill_received:
427 | try:
428 | # grabs pastie from queue
429 | pastie = self.queue.get()
430 | pastie_content = pastie.fetch_and_process_pastie()
431 | logger.debug("Queue {name} size: {size}".format(
432 | size=self.queue.qsize(), name=self.name))
433 | if pastie_content:
434 | logger.debug(
435 | "Saved new pastie from {0} "
436 | "with id {1}".format(self.name, pastie.id))
437 | else:
438 | # pastie already downloaded OR error ?
439 | pass
440 | # signals to queue job is done
441 | self.queue.task_done()
442 | # catch unknown errors
443 | except Exception as e:
444 | msg = "ThreadPasties for {name} crashed unexpectectly, "\
445 | "recovering...: {e}".format(name=self.name, e=e)
446 | logger.error(msg)
447 | logger.debug(traceback.format_exc())
448 |
449 |
450 | def main():
451 | global queues
452 | global threads
453 | global db
454 | queues = {}
455 | threads = []
456 |
457 | # start a thread to handle the DB data
458 | db = None
459 | if yamlconfig['db'] and yamlconfig['db']['sqlite3'] and yamlconfig['db']['sqlite3']['enable']:
460 | try:
461 | global sqlite3
462 | import sqlite3
463 | except Exception as exc:
464 | exit('ERROR: Cannot import the sqlite3 Python library. Are you sure it is compiled in python?')
465 | db = Sqlite3Database(yamlconfig['db']['sqlite3']['file'])
466 | db.setDaemon(True)
467 | threads.append(db)
468 | db.start()
469 | # test()
470 | # Build array of enabled sites.
471 | sites_enabled = []
472 | for site in yamlconfig['site']:
473 | if yamlconfig['site'][site]['enable']:
474 | print("Site: {} is enabled, adding to pool...".format(site))
475 | sites_enabled.append(site)
476 | elif not yamlconfig['site'][site]['enable']:
477 | print("Site: {} is disabled.".format(site))
478 | else:
479 | print("Site: {} is not enabled or disabled in config file. We just assume it disabled.".format(site))
480 | # spawn a pool of threads per PastieSite, and pass them a queue instance
481 | for site in sites_enabled:
482 | queues[site] = Queue()
483 | for i in range(yamlconfig['threads']):
484 | t = ThreadPasties(queues[site], site)
485 | t.setDaemon(True)
486 | threads.append(t)
487 | t.start()
488 |
489 | # build threads to download the last pasties
490 | for site_name in sites_enabled:
491 | t = PastieSite(site_name,
492 | yamlconfig['site'][site_name]['download-url'],
493 | yamlconfig['site'][site_name]['archive-url'],
494 | yamlconfig['site'][site_name]['archive-regex'])
495 | if 'update-min' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['update-min']:
496 | t.update_min = yamlconfig['site'][site_name]['update-min']
497 | if 'update-max' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['update-max']:
498 | t.update_max = yamlconfig['site'][site_name]['update-max']
499 | if 'pastie-classname' in yamlconfig['site'][site_name] and yamlconfig['site'][site_name]['pastie-classname']:
500 | t.pastie_classname = yamlconfig['site'][site_name]['pastie-classname']
501 | threads.append(t)
502 | t.setDaemon(True)
503 | t.start()
504 |
505 | # wait while all the threads are running and someone sends CTRL+C
506 | while True:
507 | try:
508 | for t in threads:
509 | t.join(1)
510 | except KeyboardInterrupt:
511 | print('')
512 | print("Ctrl-c received! Sending kill to threads...")
513 | for t in threads:
514 | t.kill_received = True
515 | exit(0) # quit immediately
516 |
517 |
518 | user_agents_list = []
519 |
520 |
521 | def load_user_agents_from_file(filename):
522 | global user_agents_list
523 | try:
524 | f = open(filename)
525 | except Exception as e:
526 | logger.error('Configuration problem: user-agent-file "{file}" not found or not readable: {e}'.format(file=filename, e=e))
527 | for line in f:
528 | line = line.strip()
529 | if line:
530 | user_agents_list.append(line)
531 | logger.debug('Found {count} UserAgents in file "{file}"'.format(file=filename, count=len(user_agents_list)))
532 |
533 |
534 | def get_random_user_agent():
535 | global proxies_list
536 | if user_agents_list:
537 | return random.choice(user_agents_list)
538 | return None
539 |
540 |
541 | proxies_failed = []
542 | proxies_lock = threading.Lock()
543 | proxies_list = []
544 |
545 |
546 | def load_proxies_from_file(filename):
547 | global proxies_list
548 | try:
549 | f = open(filename)
550 | except Exception as e:
551 | logger.error('Configuration problem: proxyfile "{file}" not found or not readable: {e}'.format(file=filename, e=e))
552 | for line in f:
553 | line = line.strip()
554 | if line: # LATER verify if the proxy line has the correct structure
555 | proxies_list.append(line)
556 | logger.debug('Found {count} proxies in file "{file}"'.format(file=filename, count=len(proxies_list)))
557 |
558 |
559 | def get_random_proxy():
560 | global proxies_list
561 | proxy = None
562 | proxies_lock.acquire()
563 | if proxies_list:
564 | proxy = random.choice(proxies_list)
565 | proxies_lock.release()
566 | return proxy
567 |
568 |
569 | def failed_proxy(proxy):
570 | proxies_failed.append(proxy)
571 | if proxies_failed.count(proxy) >= 2 and proxies_list.count(proxy) >= 1:
572 | logger.info("Removing proxy {0} from proxy list because of to many errors errors.".format(proxy))
573 | proxies_lock.acquire()
574 | proxies_list.remove(proxy)
575 | proxies_lock.release()
576 |
577 |
578 | class NoRedirectHandler(urllib2.HTTPRedirectHandler):
579 | '''
580 | This class is only necessary to not follow HTTP redirects in webpages.
581 | It is used by the download_url() function
582 | '''
583 | def http_error_302(self, req, fp, code, msg, headers):
584 | infourl = urllib2.addinfourl(fp, headers, req.get_full_url())
585 | infourl.status = code
586 | infourl.code = code
587 | return infourl
588 | http_error_301 = http_error_303 = http_error_307 = http_error_302
589 |
590 |
591 | class TLS1Connection(httplib.HTTPSConnection):
592 | """Like HTTPSConnection but more specific"""
593 | def __init__(self, host, **kwargs):
594 | httplib.HTTPSConnection.__init__(self, host, **kwargs)
595 |
596 | def connect(self):
597 | """Overrides HTTPSConnection.connect to specify TLS version"""
598 | # Standard implementation from HTTPSConnection, which is not
599 | # designed for extension, unfortunately
600 | sock = socket.create_connection(
601 | (self.host, self.port),
602 | self.timeout, self.source_address)
603 | if getattr(self, '_tunnel_host', None):
604 | self.sock = sock
605 | self._tunnel()
606 |
607 | # This is the only difference; default wrap_socket uses SSLv23
608 | self.sock = ssl.wrap_socket(
609 | sock, self.key_file, self.cert_file,
610 | ssl_version=ssl.PROTOCOL_TLSv1)
611 |
612 |
613 | class TLS1Handler(urllib2.HTTPSHandler):
614 | """Like HTTPSHandler but more specific"""
615 | def __init__(self):
616 | urllib2.HTTPSHandler.__init__(self)
617 |
618 | def https_open(self, req):
619 | return self.do_open(TLS1Connection, req)
620 |
621 |
622 | def download_url(url, data=None, cookie=None, loop_client=0, loop_server=0, loop_paste=0):
623 | # Client errors (40x): if more than 5 recursions, give up on URL (used for the 404 case)
624 | if loop_client >= retries_client:
625 | return None
626 | # Server errors (50x): if more than 100 recursions, give up on URL
627 | if loop_server >= retries_server:
628 | return None
629 |
630 | session = requests.Session()
631 | random_proxy = get_random_proxy()
632 | if random_proxy:
633 | session.proxies = {'http': random_proxy}
634 | user_agent = get_random_user_agent()
635 | session.headers.update({'User-Agent': get_random_user_agent(), 'Accept-Charset': 'utf-8'})
636 | if cookie:
637 | session.headers.update({'Cookie': cookie})
638 | if data:
639 | session.headers.update(data)
640 | logger.debug('Downloading url: {url} with proxy: {proxy} and user-agent: {ua}'.format(url=url, proxy=random_proxy, ua=user_agent))
641 | try:
642 | opener = None
643 | # urllib2.install_opener(urllib2.build_opener(TLS1Handler()))
644 |
645 | # Random Proxy if set in config
646 | random_proxy = get_random_proxy()
647 | if random_proxy:
648 | proxyh = urllib2.ProxyHandler({'http': random_proxy})
649 | opener = urllib2.build_opener(proxyh, NoRedirectHandler())
650 | # We need to create an opener if it didn't exist yet
651 | if not opener:
652 | opener = urllib2.build_opener(NoRedirectHandler())
653 | # Random User-Agent if set in config
654 | user_agent = get_random_user_agent()
655 | opener.addheaders = [('Accept-Charset', 'utf-8')]
656 | if user_agent:
657 | opener.addheaders.append(('User-Agent', user_agent))
658 | if cookie:
659 | opener.addheaders.append(('Cookie', cookie))
660 | logger.debug(
661 | 'Downloading url: {url} with proxy: {proxy} and user-agent: {ua}'.format(
662 | url=url, proxy=random_proxy, ua=user_agent))
663 | if data:
664 | response = opener.open(url, data)
665 | else:
666 | response = opener.open(url)
667 | htmlPage = unicode(response.read(), errors='replace')
668 | if 'File is not ready for scraping yet. Try again in 1 minute.' in htmlPage:
669 | if loop_paste >= retries_paste:
670 | logger.warning("Tried to scrape too early for {url}, giving up and saving current content".format(url=url))
671 | return htmlPage, response.headers
672 | else:
673 | loop_paste += 1
674 | logger.warning("Tried to scrape too early for {url}, trying again in 60s ({nb}/{total})".format(url=url, nb=loop_paste, total=retries_paste))
675 | time.sleep(60)
676 | return download_url(url, loop_paste=loop_paste)
677 | return htmlPage, response.headers
678 | except urllib2.HTTPError, e:
679 | failed_proxy(random_proxy)
680 | logger.warning("!!Proxy error on {url} for proxy {proxy}.".format(url=url, proxy=random_proxy))
681 | if 404 == e.code:
682 | htmlPage = e.read()
683 | logger.warning("404 from proxy received for {url}. Waiting 1 minute".format(url=url))
684 | time.sleep(60)
685 | loop_client += 1
686 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_client, total=retries_client, url=url))
687 | return download_url(url, loop_client=loop_client)
688 | if 500 == e.code:
689 | htmlPage = e.read()
690 | logger.warning("500 from proxy received for {url}. Waiting 1 minute".format(url=url))
691 | time.sleep(60)
692 | loop_server += 1
693 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
694 | return download_url(url, loop_server=loop_server)
695 | if 504 == e.code:
696 | htmlPage = e.read()
697 | logger.warning("504 from proxy received for {url}. Waiting 1 minute".format(url=url))
698 | time.sleep(60)
699 | loop_server += 1
700 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
701 | return download_url(url, loop_server=loop_server)
702 | if 502 == e.code:
703 | htmlPage = e.read()
704 | logger.warning("502 from proxy received for {url}. Waiting 1 minute".format(url=url))
705 | time.sleep(60)
706 | loop_server += 1
707 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
708 | return download_url(url, loop_server=loop_server)
709 | if 403 == e.code:
710 | htmlPage = e.read()
711 | if 'Please slow down' in htmlPage or 'has temporarily blocked your computer' in htmlPage or 'blocked' in htmlPage:
712 | logger.warning("Slow down message received for {url}. Waiting 1 minute".format(url=url))
713 | time.sleep(60)
714 | loop_server += 1
715 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
716 | return download_url(url, loop_server=loop_server)
717 | if 504 == e.code:
718 | htmlPage = e.read()
719 | logger.warning("504 from proxy received for {url}. Waiting 1 minute".format(url=url))
720 | time.sleep(60)
721 | loop_server += 1
722 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
723 | return download_url(url, loop_server=loop_server)
724 | if 502 == e.code:
725 | htmlPage = e.read()
726 | logger.warning("502 from proxy received for {url}. Waiting 1 minute".format(url=url))
727 | time.sleep(60)
728 | loop_server += 1
729 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
730 | return download_url(url, loop_server=loop_server)
731 | if 403 == e.code:
732 | htmlPage = e.read()
733 | if 'Please slow down' in htmlPage or 'has temporarily blocked your computer' in htmlPage or 'blocked' in htmlPage:
734 | logger.warning("Slow down message received for {url}. Waiting 1 minute".format(url=url))
735 | time.sleep(60)
736 | return download_url(url)
737 | logger.warning("ERROR: HTTP Error ##### {e} ######################## {url}".format(e=e, url=url))
738 | return None
739 | return response
740 | except URLError as e:
741 | logger.debug("ERROR: URL Error ##### {e} ######################## ".format(e=e, url=url))
742 | if random_proxy: # remove proxy from the list if needed
743 | failed_proxy(random_proxy)
744 | logger.warning("Failed to download the page {url} because of proxy error {proxy}. Trying again.".format(url=url, proxy=random_proxy))
745 | loop_server += 1
746 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
747 | return download_url(url, loop_server=loop_server)
748 | if 'timed out' in e.reason:
749 | logger.warning("Timed out or slow down for {url}. Waiting 1 minute".format(url=url))
750 | loop_server += 1
751 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
752 | time.sleep(60)
753 | return download_url(url, loop_server=loop_server)
754 | return None
755 | except socket.timeout:
756 | logger.debug("ERROR: timeout ############################# " + url)
757 | if random_proxy: # remove proxy from the list if needed
758 | failed_proxy(random_proxy)
759 | logger.warning("Failed to download the page because of socket error {0} trying again.".format(url))
760 | loop_server += 1
761 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
762 | return download_url(url, loop_server=loop_server)
763 | return None
764 | except Exception as e:
765 | failed_proxy(random_proxy)
766 | logger.warning("Failed to download the page because of other HTTPlib error proxy error {0} trying again.".format(url))
767 | loop_server += 1
768 | logger.warning("Retry {nb}/{total} for {url}".format(nb=loop_server, total=retries_server, url=url))
769 | return download_url(url, loop_server=loop_server)
770 | # logger.error("ERROR: Other HTTPlib error: {e}".format(e=e))
771 | # return None, None
772 | # do NOT try to download the url again here, as we might end in enless loop
773 |
774 |
775 | class Sqlite3Database(threading.Thread):
776 | def __init__(self, filename):
777 | threading.Thread.__init__(self)
778 | self.kill_received = False
779 | self.queue = Queue()
780 | self.filename = filename
781 | self.db_conn = None
782 | self.c = None
783 |
784 | def run(self):
785 | self.db_conn = sqlite3.connect(self.filename)
786 | # create the db if it doesn't exist
787 | self.c = self.db_conn.cursor()
788 | try:
789 | # LATER maybe create a table per site. Lookups will be faster as less text-searching is needed
790 | self.c.execute('''
791 | CREATE TABLE IF NOT EXISTS pasties (
792 | site TEXT,
793 | id TEXT,
794 | md5 TEXT,
795 | url TEXT,
796 | local_path TEXT,
797 | timestamp DATE,
798 | matches TEXT
799 | )''')
800 | self.db_conn.commit()
801 | except sqlite3.DatabaseError as e:
802 | logger.error('Problem with the SQLite database {0}: {1}'.format(self.filename, e))
803 | return None
804 | # loop over the queue
805 | while not self.kill_received:
806 | try:
807 | # grabs pastie from queue
808 | pastie = self.queue.get()
809 | # add the pastie to the DB
810 | self.add_or_update(pastie)
811 | # signals to queue job is done
812 | self.queue.task_done()
813 | # catch unknown errors
814 | except Exception as e:
815 | logger.error("Thread for SQLite crashed unexpectectly, recovering...: {e}".format(e=e))
816 | logger.debug(traceback.format_exc())
817 |
818 | def add_or_update(self, pastie):
819 | data = {'site': pastie.site.name,
820 | 'id': pastie.id
821 | }
822 | self.c.execute('SELECT count(id) FROM pasties WHERE site=:site AND id=:id', data)
823 | pastie_in_db = self.c.fetchone()
824 | # logger.debug('State of Database for pastie {site} {id} - {state}'.format(site=pastie.site.name, id=pastie.id, state=pastie_in_db))
825 | if pastie_in_db and pastie_in_db[0]:
826 | self.update(pastie)
827 | else:
828 | self.add(pastie)
829 |
830 | def add(self, pastie):
831 | try:
832 | data = {'site': pastie.site.name,
833 | 'id': pastie.id,
834 | 'md5': pastie.md5,
835 | 'url': pastie.url,
836 | 'local_path': pastie.site.archive_dir + os.sep + pastie.site.pastie_id_to_filename(pastie.id),
837 | 'timestamp': datetime.now(),
838 | 'matches': pastie.matches_to_text()
839 | }
840 | self.c.execute('INSERT INTO pasties VALUES (:site, :id, :md5, :url, :local_path, :timestamp, :matches)', data)
841 | self.db_conn.commit()
842 | except sqlite3.DatabaseError as e:
843 | logger.error('Cannot add pastie {site} {id} in the SQLite database: {error}'.format(site=pastie.site.name, id=pastie.id, error=e))
844 | logger.debug('Added pastie {site} {id} in the SQLite database.'.format(site=pastie.site.name, id=pastie.id))
845 |
846 | def update(self, pastie):
847 | try:
848 | data = {'site': pastie.site.name,
849 | 'id': pastie.id,
850 | 'md5': pastie.md5,
851 | 'url': pastie.url,
852 | 'local_path': pastie.site.archive_dir + os.sep + pastie.site.pastie_id_to_filename(pastie.id),
853 | 'timestamp': datetime.now(),
854 | 'matches': pastie.matches_to_text()
855 | }
856 | self.c.execute('''UPDATE pasties SET md5 = :md5,
857 | url = :url,
858 | local_path = :local_path,
859 | timestamp = :timestamp,
860 | matches = :matches
861 | WHERE site = :site AND id = :id''', data)
862 | self.db_conn.commit()
863 | except sqlite3.DatabaseError as e:
864 | logger.error('Cannot add pastie {site} {id} in the SQLite database: {error}'.format(site=pastie.site.name, id=pastie.id, error=e))
865 | logger.debug('Updated pastie {site} {id} in the SQLite database.'.format(site=pastie.site.name, id=pastie.id))
866 |
867 |
868 | def parse_config_file(configfile):
869 | global yamlconfig
870 | try:
871 | yamlconfig = yaml.load(open(configfile))
872 | except yaml.YAMLError as exc:
873 | logger.error("Error in configuration file:")
874 | if hasattr(exc, 'problem_mark'):
875 | mark = exc.problem_mark
876 | logger.error("error position: (%s:%s)" % (mark.line + 1, mark.column + 1))
877 | exit(1)
878 | # TODO verify validity of config parameters
879 | for includes in yamlconfig.get("includes", []):
880 | yamlconfig.update(yaml.load(open(includes)))
881 | if yamlconfig['proxy']['random']:
882 | load_proxies_from_file(yamlconfig['proxy']['file'])
883 | if yamlconfig['user-agent']['random']:
884 | load_user_agents_from_file(yamlconfig['user-agent']['file'])
885 | # if yamlconfig['redis']['queue']:
886 | # import redis
887 |
888 |
889 | if __name__ == "__main__":
890 | global logger
891 | parser = optparse.OptionParser("usage: %prog [options]")
892 | parser.add_option("-c", "--config", dest="config",
893 | help="load configuration from file", metavar="FILE")
894 | parser.add_option("-d", "--daemon", action="store_true", dest="daemon",
895 | help="runs in background as a daemon (NOT IMPLEMENTED)")
896 | parser.add_option("-s", "--stats", action="store_true", dest="stats",
897 | help="display statistics about the running threads (NOT IMPLEMENTED)")
898 | parser.add_option("-v", action="store_true", dest="verbose",
899 | help="outputs more information")
900 |
901 | (options, args) = parser.parse_args()
902 |
903 | if not options.config:
904 | # try to read out the default configuration files if -c option is not set
905 | if os.path.isfile('/etc/pystemon.yaml'):
906 | options.config = '/etc/pystemon.yaml'
907 | if os.path.isfile('pystemon.yaml'):
908 | options.config = 'pystemon.yaml'
909 | filename = sys.argv[0]
910 | config_file = filename.replace('.py', '.yaml')
911 | if os.path.isfile(config_file):
912 | options.config = config_file
913 | if not os.path.isfile(options.config):
914 | parser.error('Configuration file not found. Please create /etc/pystemon.yaml, pystemon.yaml or specify a config file using the -c option.')
915 | exit(1)
916 |
917 | logger = logging.getLogger('pystemon')
918 | logger.setLevel(logging.DEBUG)
919 | hdlr = logging.StreamHandler(sys.stdout)
920 | formatter = logging.Formatter('[%(asctime)s] %(message)s')
921 | hdlr.setFormatter(formatter)
922 | logger.addHandler(hdlr)
923 | if options.verbose:
924 | logger.setLevel(logging.DEBUG)
925 |
926 | if options.daemon:
927 | # send logging to syslog if using daemon
928 | logger.addHandler(logging.handlers.SysLogHandler(facility=logging.handlers.SysLogHandler.LOG_DAEMON))
929 | # FIXME run application in background
930 |
931 | parse_config_file(options.config)
932 | # run the software
933 | main()
934 |
--------------------------------------------------------------------------------
/pystemon.yaml:
--------------------------------------------------------------------------------
1 | #network: # Network settings
2 | # ip: '1.1.1.1' # Specify source IP address if you want to bind on a specific one
3 |
4 | archive:
5 | save: yes # Keep
6 | save-all: no # Keep a copy of all pasties
7 | dir: "alerts" # Directory where matching pasties should be kept
8 | dir-all: "archive" # Directory where all pasties should be kept (if save-all is set to yes)
9 | compress: yes # Store the pasties compressed
10 |
11 | db:
12 | sqlite3: # Store information about the pastie in a database
13 | enable: no # Activate this DB engine # NOT FULLY IMPLEMENTED
14 | file: 'db.sqlite3' # The filename of the database
15 |
16 | redis:
17 | queue: no # Toggle PUSH to redis queue
18 | server: "localhost"
19 | port: 6379
20 | database: 10
21 |
22 | email:
23 | alert: no # Enable/disable email alerts
24 | from: alert@example.com
25 | to: alert@example.com
26 | server: 127.0.0.1 # Address of the server (hostname or IP)
27 | port: 25 # Outgoing SMTP port: 25, 587, ...
28 | username: '' # (optional) Username for authentication. Leave blank for no authentication.
29 | password: '' # (optional) Password for authentication. Leave blank for no authentication.
30 | subject: '[pystemon] - {subject}'
31 |
32 | #####
33 | # Definition of regular expressions to search for in the pasties
34 | #
35 | search:
36 | # - description: '' # (optional) A human readable description. (used in alerting)
37 | # search: '' # The regular expression to search for
38 | # count: '' # (optional) How many hits should it have to be interesting?
39 | # exclude: '' # (optional) Do not alert if this regular expression matches
40 | # regex-flags: '' # (optional) Regular expression flags to give to the find function.
41 | # # Default = re.IGNORECASE
42 | # # Set to 0 to have no flags set
43 | # # See http://docs.python.org/2/library/re.html#re.DEBUG for more info.
44 | # # Warning: when setting this the default is overridden
45 | # # example: 're.MULTILINE + re.DOTALL + re.IGNORECASE'
46 | # to: '' # (optional) Additional recipients for email alert, comma separated list
47 |
48 | - search: '[^a-zA-Z0-9]example\.com'
49 | - search: '[^a-zA-Z0-9]foobar\.com'
50 | - description: 'Download (non-porn)'
51 | search: 'download'
52 | exclude: 'porn|sex|teen'
53 | count: 4
54 |
55 | #####
56 | # Configuration section for the paste sites
57 | #
58 | threads: 1 # number of download threads per site
59 | site:
60 | # example.com:
61 | # archive-url: # the url where the list of last pasties is present
62 | # # example: 'http://pastebin.com/archive'
63 | # archive-regex: # a regular expression to extract the pastie-id from the page.
64 | # # do not forget the () to extract the pastie-id
65 | # # example: '.+'
66 | # download-url: # url for the raw pastie.
67 | # # Should contain {id} on the place where the ID of the pastie needs to be placed
68 | # # example: 'http://pastebin.com/raw.php?i={id}'
69 | # update-max: 40 # every X seconds check for new updates to see if new pasties are available
70 | # update-min: 30 # a random number will be chosen between these two numbers
71 | # pastie-classname: # OPTIONAL: The name of a custom Class that inherits from Pastie
72 | # # This is practical for sites that require custom fetchPastie() functions
73 |
74 | pastebin.com:
75 | enable: yes
76 | archive-url: 'http://pastebin.com/archive'
77 | archive-regex: '.+'
78 | download-url: 'http://pastebin.com/raw.php?i={id}'
79 | update-max: 50
80 | update-min: 40
81 |
82 | # See https://pastebin.com/api_scraping_faq , you will need a pro account on pastebin
83 | # You need to #whitelist you source IP and "Our scraping API is only available for LIFETIME PRO members, and only for those who have their IP whitelisted."
84 | pastebin.com_pro:
85 | enable: no
86 | archive-url: 'https://scrape.pastebin.com/api_scraping.php?limit=500'
87 | archive-regex: '"key": "(.+)",'
88 | download-url: 'http://pastebin.com/api_scrape_item.php?i={id}'
89 | update-max: 50
90 | update-min: 40
91 |
92 | slexy.org:
93 | enable: yes
94 | archive-url: 'http://slexy.org/recent'
95 | archive-regex: 'View paste'
96 | download-url: 'http://slexy.org/raw/{id}'
97 |
98 | gist.github.com:
99 | enable: no
100 | archive-url: 'https://gist.github.com/gists'
101 | archive-regex: 'gist'
102 | download-url: 'https://raw.github.com/gist/{id}'
103 |
104 | codepad.org:
105 | enable: no
106 | archive-url: 'http://codepad.org/recent'
107 | archive-regex: 'view'
108 | download-url: 'http://codepad.org/{id}/raw.txt'
109 |
110 | safebin.net: # FIXME not finished
111 | enable: no
112 | archive-url: 'http://safebin.net/?archive'
113 | archive-regex: ''
114 | download-url: 'http://safebin.net/{id}'
115 | update-max: 60
116 | update-min: 50
117 |
118 |
119 | # TODO
120 | # http://www.safebin.net/ # more complex site
121 | # http://www.heypasteit.com/ # http://www.heypasteit.com/clip/0IZA => incremental
122 |
123 | # http://hastebin.com/ # no list of last pastes
124 | # http://sebsauvage.net/paste/ # no list of last pastes
125 | # http://tny.cz/ # no list of last pastes
126 | # https://pastee.org/ # no list of last pastes
127 | # http://paste2.org/ # no list of last pastes
128 | # http://0bin.net/ # no list of last pastes
129 | # http://markable.in/ # no list of last pastes
130 |
131 |
132 | #####
133 | # Configuration section to configure proxies
134 | # Currently only HTTP proxies are permitted
135 | #
136 | proxy:
137 | random: no
138 | file: 'proxies.txt'
139 |
140 | #####
141 | # Configuration section for User-Agents
142 | #
143 | user-agent:
144 | random: no
145 | file: 'user-agents.txt'
146 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pyyaml
2 | redis
3 | requests
4 |
--------------------------------------------------------------------------------