├── .idea
├── .gitignore
├── bilibili_spider.iml
├── encodings.xml
├── inspectionProfiles
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
└── vcs.xml
├── LICENSE
├── README.md
├── bilibili_spider.log
├── bilibili_spider
├── main_window.py
├── models
│ ├── __init__.py
│ └── comments.py
├── pages
│ ├── __init__.py
│ ├── crawl_page.py
│ ├── home_page.py
│ ├── search_page.py
│ └── settings_page.py
├── spiders
│ ├── __init__.py
│ └── comment_spider.py
└── utils
│ ├── __init__.py
│ ├── config.py
│ ├── cookie_helper.py
│ └── db_handler.py
├── main.py
├── readme.md
└── requirements.txt
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # 默认忽略的文件
2 | /shelf/
3 | /workspace.xml
4 |
--------------------------------------------------------------------------------
/.idea/bilibili_spider.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
--------------------------------------------------------------------------------
/.idea/encodings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 | Preamble
9 |
10 | The GNU General Public License is a free, copyleft license for
11 | software and other kinds of works.
12 |
13 | The licenses for most software and other practical works are designed
14 | to take away your freedom to share and change the works. By contrast,
15 | the GNU General Public License is intended to guarantee your freedom to
16 | share and change all versions of a program--to make sure it remains free
17 | software for all its users. We, the Free Software Foundation, use the
18 | GNU General Public License for most of our software; it applies also to
19 | any other work released this way by its authors. You can apply it to
20 | your programs, too.
21 |
22 | When we speak of free software, we are referring to freedom, not
23 | price. Our General Public Licenses are designed to make sure that you
24 | have the freedom to distribute copies of free software (and charge for
25 | them if you wish), that you receive source code or can get it if you
26 | want it, that you can change the software or use pieces of it in new
27 | free programs, and that you know you can do these things.
28 |
29 | To protect your rights, we need to prevent others from denying you
30 | these rights or asking you to surrender the rights. Therefore, you have
31 | certain responsibilities if you distribute copies of the software, or if
32 | you modify it: responsibilities to respect the freedom of others.
33 |
34 | For example, if you distribute copies of such a program, whether
35 | gratis or for a fee, you must pass on to the recipients the same
36 | freedoms that you received. You must make sure that they, too, receive
37 | or can get the source code. And you must show them these terms so they
38 | know their rights.
39 |
40 | Developers that use the GNU GPL protect your rights with two steps:
41 | (1) assert copyright on the software, and (2) offer you this License
42 | giving you legal permission to copy, distribute and/or modify it.
43 |
44 | For the developers' and authors' protection, the GPL clearly explains
45 | that there is no warranty for this free software. For both users' and
46 | authors' sake, the GPL requires that modified versions be marked as
47 | changed, so that their problems will not be attributed erroneously to
48 | authors of previous versions.
49 |
50 | Some devices are designed to deny users access to install or run
51 | modified versions of the software inside them, although the manufacturer
52 | can do so. This is fundamentally incompatible with the aim of
53 | protecting users' freedom to change the software. The systematic
54 | pattern of such abuse occurs in the area of products for individuals to
55 | use, which is precisely where it is most unacceptable. Therefore, we
56 | have designed this version of the GPL to prohibit the practice for those
57 | products. If such problems arise substantially in other domains, we
58 | stand ready to extend this provision to those domains in future versions
59 | of the GPL, as needed to protect the freedom of users.
60 |
61 | Finally, every program is threatened constantly by software patents.
62 | States should not allow patents to restrict development and use of
63 | software on general-purpose computers, but in those that do, we wish to
64 | avoid the special danger that patents applied to a free program could
65 | make it effectively proprietary. To prevent this, the GPL assures that
66 | patents cannot be used to render the program non-free.
67 |
68 | The precise terms and conditions for copying, distribution and
69 | modification follow.
70 |
71 | TERMS AND CONDITIONS
72 |
73 | 0. Definitions.
74 |
75 | "This License" refers to version 3 of the GNU General Public License.
76 |
77 | "Copyright" also means copyright-like laws that apply to other kinds of
78 | works, such as semiconductor masks.
79 |
80 | "The Program" refers to any copyrightable work licensed under this
81 | License. Each licensee is addressed as "you". "Licensees" and
82 | "recipients" may be individuals or organizations.
83 |
84 | To "modify" a work means to copy from or adapt all or part of the work
85 | in a fashion requiring copyright permission, other than the making of an
86 | exact copy. The resulting work is called a "modified version" of the
87 | earlier work or a work "based on" the earlier work.
88 |
89 | A "covered work" means either the unmodified Program or a work based
90 | on the Program.
91 |
92 | To "propagate" a work means to do anything with it that, without
93 | permission, would make you directly or secondarily liable for
94 | infringement under applicable copyright law, except executing it on a
95 | computer or modifying a private copy. Propagation includes copying,
96 | distribution (with or without modification), making available to the
97 | public, and in some countries other activities as well.
98 |
99 | To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies. Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 |
103 | An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License. If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 |
112 | 1. Source Code.
113 |
114 | The "source code" for a work means the preferred form of the work
115 | for making modifications to it. "Object code" means any non-source
116 | form of a work.
117 |
118 | A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 |
123 | The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form. A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 |
134 | The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities. However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work. For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 |
147 | The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 |
151 | The Corresponding Source for a work in source code form is that
152 | same work.
153 |
154 | 2. Basic Permissions.
155 |
156 | All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met. This License explicitly affirms your unlimited
159 | permission to run the unmodified Program. The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work. This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 |
164 | You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force. You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright. Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 |
175 | Conveying under any other circumstances is permitted solely under
176 | the conditions stated below. Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 |
179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 |
181 | No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 |
187 | When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 |
195 | 4. Conveying Verbatim Copies.
196 |
197 | You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 |
205 | You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 |
208 | 5. Conveying Modified Source Versions.
209 |
210 | You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 |
214 | a) The work must carry prominent notices stating that you modified
215 | it, and giving a relevant date.
216 |
217 | b) The work must carry prominent notices stating that it is
218 | released under this License and any conditions added under section
219 | 7. This requirement modifies the requirement in section 4 to
220 | "keep intact all notices".
221 |
222 | c) You must license the entire work, as a whole, under this
223 | License to anyone who comes into possession of a copy. This
224 | License will therefore apply, along with any applicable section 7
225 | additional terms, to the whole of the work, and all its parts,
226 | regardless of how they are packaged. This License gives no
227 | permission to license the work in any other way, but it does not
228 | invalidate such permission if you have separately received it.
229 |
230 | d) If the work has interactive user interfaces, each must display
231 | Appropriate Legal Notices; however, if the Program has interactive
232 | interfaces that do not display Appropriate Legal Notices, your
233 | work need not make them do so.
234 |
235 | A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit. Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 |
245 | 6. Conveying Non-Source Forms.
246 |
247 | You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 |
252 | a) Convey the object code in, or embodied in, a physical product
253 | (including a physical distribution medium), accompanied by the
254 | Corresponding Source fixed on a durable physical medium
255 | customarily used for software interchange.
256 |
257 | b) Convey the object code in, or embodied in, a physical product
258 | (including a physical distribution medium), accompanied by a
259 | written offer, valid for at least three years and valid for as
260 | long as you offer spare parts or customer support for that product
261 | model, to give anyone who possesses the object code either (1) a
262 | copy of the Corresponding Source for all the software in the
263 | product that is covered by this License, on a durable physical
264 | medium customarily used for software interchange, for a price no
265 | more than your reasonable cost of physically performing this
266 | conveying of source, or (2) access to copy the
267 | Corresponding Source from a network server at no charge.
268 |
269 | c) Convey individual copies of the object code with a copy of the
270 | written offer to provide the Corresponding Source. This
271 | alternative is allowed only occasionally and noncommercially, and
272 | only if you received the object code with such an offer, in accord
273 | with subsection 6b.
274 |
275 | d) Convey the object code by offering access from a designated
276 | place (gratis or for a charge), and offer equivalent access to the
277 | Corresponding Source in the same way through the same place at no
278 | further charge. You need not require recipients to copy the
279 | Corresponding Source along with the object code. If the place to
280 | copy the object code is a network server, the Corresponding Source
281 | may be on a different server (operated by you or a third party)
282 | that supports equivalent copying facilities, provided you maintain
283 | clear directions next to the object code saying where to find the
284 | Corresponding Source. Regardless of what server hosts the
285 | Corresponding Source, you remain obligated to ensure that it is
286 | available for as long as needed to satisfy these requirements.
287 |
288 | e) Convey the object code using peer-to-peer transmission, provided
289 | you inform other peers where the object code and Corresponding
290 | Source of the work are being offered to the general public at no
291 | charge under subsection 6d.
292 |
293 | A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 |
297 | A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling. In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage. For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product. A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 |
310 | "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source. The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 |
318 | If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information. But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 |
329 | The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed. Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 |
337 | Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 |
343 | 7. Additional Terms.
344 |
345 | "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law. If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 |
354 | When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it. (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.) You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 |
361 | Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 |
365 | a) Disclaiming warranty or limiting liability differently from the
366 | terms of sections 15 and 16 of this License; or
367 |
368 | b) Requiring preservation of specified reasonable legal notices or
369 | author attributions in that material or in the Appropriate Legal
370 | Notices displayed by works containing it; or
371 |
372 | c) Prohibiting misrepresentation of the origin of that material, or
373 | requiring that modified versions of such material be marked in
374 | reasonable ways as different from the original version; or
375 |
376 | d) Limiting the use for publicity purposes of names of licensors or
377 | authors of the material; or
378 |
379 | e) Declining to grant rights under trademark law for use of some
380 | trade names, trademarks, or service marks; or
381 |
382 | f) Requiring indemnification of licensors and authors of that
383 | material by anyone who conveys the material (or modified versions of
384 | it) with contractual assumptions of liability to the recipient, for
385 | any liability that these contractual assumptions directly impose on
386 | those licensors and authors.
387 |
388 | All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10. If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term. If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 |
398 | If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 |
403 | Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 |
407 | 8. Termination.
408 |
409 | You may not propagate or modify a covered work except as expressly
410 | provided under this License. Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 |
415 | However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 |
422 | Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 |
429 | Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License. If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 |
435 | 9. Acceptance Not Required for Having Copies.
436 |
437 | You are not required to accept this License in order to receive or
438 | run a copy of the Program. Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance. However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work. These actions infringe copyright if you do
443 | not accept this License. Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 |
446 | 10. Automatic Licensing of Downstream Recipients.
447 |
448 | Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License. You are not responsible
451 | for enforcing compliance by third parties with this License.
452 |
453 | An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations. If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 |
463 | You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License. For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 |
471 | 11. Patents.
472 |
473 | A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based. The
475 | work thus licensed is called the contributor's "contributor version".
476 |
477 | A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version. For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 |
487 | Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 |
492 | In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement). To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 |
499 | If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients. "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 |
513 | If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 |
521 | A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License. You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 |
536 | Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 |
540 | 12. No Surrender of Others' Freedom.
541 |
542 | If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License. If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all. For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 |
552 | 13. Use with the GNU Affero General Public License.
553 |
554 | Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work. The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 |
563 | 14. Revised Versions of this License.
564 |
565 | The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time. Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 |
570 | Each version is given a distinguishing version number. If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation. If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 |
579 | If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 |
584 | Later license versions may give you additional or different
585 | permissions. However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 |
589 | 15. Disclaimer of Warranty.
590 |
591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 |
600 | 16. Limitation of Liability.
601 |
602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 |
612 | 17. Interpretation of Sections 15 and 16.
613 |
614 | If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 |
621 | END OF TERMS AND CONDITIONS
622 |
623 | How to Apply These Terms to Your New Programs
624 |
625 | If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 |
629 | To do so, attach the following notices to the program. It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 |
634 |
635 | Copyright (C)
636 |
637 | This program is free software: you can redistribute it and/or modify
638 | it under the terms of the GNU General Public License as published by
639 | the Free Software Foundation, either version 3 of the License, or
640 | (at your option) any later version.
641 |
642 | This program is distributed in the hope that it will be useful,
643 | but WITHOUT ANY WARRANTY; without even the implied warranty of
644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
645 | GNU General Public License for more details.
646 |
647 | You should have received a copy of the GNU General Public License
648 | along with this program. If not, see .
649 |
650 | Also add information on how to contact you by electronic and paper mail.
651 |
652 | If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 |
655 | Copyright (C)
656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 | This is free software, and you are welcome to redistribute it
658 | under certain conditions; type `show c' for details.
659 |
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License. Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 |
664 | You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | .
668 |
669 | The GNU General Public License does not permit incorporating your program
670 | into proprietary programs. If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library. If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License. But first, please read
674 | .
675 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 🎯 B站评论爬虫
2 |
3 | 一个基于PyQt6开发的B站评论爬虫桌面应用程序,支持视频评论的批量爬取、数据存储和智能管理。采用现代化的暗色主题界面设计,为用户提供流畅的操作体验。
4 |
5 | ## ✨ 主要特点
6 |
7 | 本项目采用模块化设计,具有以下特色功能:
8 |
9 | - 🖥️ 采用现代暗色主题的图形界面,设计简约直观
10 | - 🚀 支持BV号和av号视频评论爬取,自动处理视频ID识别
11 | - 💾 使用本地SQLite数据库存储,支持增量更新机制
12 | - 🔍 提供多维度评论检索功能,支持多种筛选方式
13 | - 📊 内置数据统计和分析功能,实时掌握爬取进度
14 | - ⚙️ 支持自定义爬虫参数配置,灵活控制爬取策略
15 | - 🔒 集成智能Cookie管理,支持浏览器自动获取
16 |
17 | ## 🛠️ 开发环境
18 |
19 | - Python 3.8+
20 | - PyQt6
21 | - SQLite3
22 | - Requests
23 | - Selenium
24 |
25 | ## 📦 安装步骤
26 |
27 | 1. 克隆项目到本地:
28 | ```bash
29 | git clone https://github.com/Roinflam/bilibili-spider.git
30 | cd bilibili-spider
31 | ```
32 |
33 | 2. 安装依赖包:
34 | ```bash
35 | pip install -r requirements.txt
36 | ```
37 |
38 | 3. 运行程序:
39 | ```bash
40 | python main.py
41 | ```
42 |
43 | ## 💡 功能说明
44 |
45 | 本程序主要包含四个核心模块:
46 |
47 | **评论爬取模块**
48 | - 支持通过视频URL直接爬取评论
49 | - 可配置爬取页数和时间间隔
50 | - 支持评论数据的增量更新
51 |
52 | **数据管理模块**
53 | - 采用SQLite数据库本地存储
54 | - 实现自动去重和更新机制
55 | - 提供数据备份功能
56 |
57 | **搜索系统**
58 | - 支持多种维度的评论检索
59 | - 提供灵活的排序功能
60 | - 支持评论内容的快速定位
61 |
62 | **系统配置**
63 | - 提供Cookie配置管理
64 | - 支持自定义爬虫参数
65 | - 集成数据库管理功能
66 |
67 | ## 📝 使用提示
68 |
69 | 使用本程序时请注意以下事项:
70 |
71 | 1. 首次使用需要在系统设置中配置登录Cookie
72 | 2. 建议合理设置爬取延迟,避免请求过于频繁
73 | 3. 大量数据爬取时建议启用自动备份功能
74 | 4. 程序仅供学习研究使用,请勿用于商业用途
75 | 5. 请遵守B站相关规定,合理使用爬虫功能
76 |
77 | ## 📄 开源协议
78 |
79 | 本项目采用 GPLv3 协议开源,仅供学习和研究使用,禁止用于商业目的。
80 |
81 | ## 🤝 关于作者
82 |
83 | - GitHub: [Roinflam](https://github.com/Roinflam)
84 | - 欢迎通过Issues或Pull Requests参与项目改进
85 |
--------------------------------------------------------------------------------
/bilibili_spider.log:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Roinflam/bilibili_spider/957ed38cb390d132d6a6aa46b19aeb008899d0d3/bilibili_spider.log
--------------------------------------------------------------------------------
/bilibili_spider/main_window.py:
--------------------------------------------------------------------------------
1 | # main_window.py
2 |
3 | from PyQt6.QtWidgets import (QMainWindow, QWidget, QVBoxLayout, QTabWidget,
4 | QLabel, QStatusBar, QMessageBox)
5 | from PyQt6.QtCore import Qt
6 | import logging
7 |
8 | from bilibili_spider.utils.config import Config
9 | from bilibili_spider.utils.db_handler import DatabaseHandler
10 | from bilibili_spider.pages.home_page import HomePage
11 | from bilibili_spider.pages.crawl_page import CrawlPage
12 | from bilibili_spider.pages.search_page import SearchPage
13 | from bilibili_spider.pages.settings_page import SettingsPage
14 |
15 |
16 | class MainWindow(QMainWindow):
17 | def __init__(self):
18 | super().__init__()
19 | self.logger = logging.getLogger('BilibiliSpider')
20 | self.init_backend()
21 | self.init_ui()
22 |
23 | def init_backend(self):
24 | try:
25 | self.config = Config()
26 | self.db_handler = DatabaseHandler('bilibili_comments.db')
27 | self.spider = None
28 | self.logger.info("后端组件初始化成功")
29 | except Exception as e:
30 | self.logger.error(f"后端组件初始化失败: {str(e)}")
31 | QMessageBox.critical(self, "错误", "程序初始化失败,请检查配置和数据库")
32 | raise
33 |
34 | def init_ui(self):
35 | self.setWindowTitle("B站评论爬虫")
36 | self.resize(1200, 800)
37 | self.setMinimumSize(1000, 600)
38 | self.setup_style()
39 |
40 | try:
41 | central_widget = QWidget()
42 | self.setCentralWidget(central_widget)
43 | main_layout = QVBoxLayout(central_widget)
44 | main_layout.setContentsMargins(0, 0, 0, 0)
45 |
46 | self.tab_widget = QTabWidget()
47 | self.setup_tab_style()
48 |
49 | # 初始化各页面
50 | self.home_page = HomePage(self.db_handler)
51 | self.crawl_page = CrawlPage(self.db_handler, self.config)
52 | self.search_page = SearchPage(self.db_handler)
53 | self.settings_page = SettingsPage(self.config, self.db_handler)
54 |
55 | self.tab_widget.addTab(self.home_page, "主页")
56 | self.tab_widget.addTab(self.crawl_page, "评论爬取")
57 | self.tab_widget.addTab(self.search_page, "评论查询")
58 | self.tab_widget.addTab(self.settings_page, "系统设置")
59 |
60 | main_layout.addWidget(self.tab_widget)
61 |
62 | self.home_page.connect_buttons(self.tab_widget)
63 |
64 | self.logger.info("界面初始化完成")
65 |
66 | except Exception as e:
67 | self.logger.error(f"界面初始化失败: {str(e)}")
68 | QMessageBox.critical(self, "错误", "界面初始化失败")
69 | raise
70 |
71 | def setup_tab_style(self):
72 | self.tab_widget.setStyleSheet("""
73 | QTabWidget::pane {
74 | border: none;
75 | background: #1e1e1e;
76 | }
77 | QTabBar::tab {
78 | background: #2d2d2d;
79 | color: white;
80 | padding: 12px 25px;
81 | margin: 0px 2px 0px 0px;
82 | font-size: 14px;
83 | }
84 | QTabBar::tab:selected {
85 | background: #0078d4;
86 | font-weight: bold;
87 | }
88 | QTabBar::tab:hover:!selected {
89 | background: #3d3d3d;
90 | }
91 | """)
92 |
93 | def setup_style(self):
94 | self.setStyleSheet("""
95 | QMainWindow {
96 | background-color: #1e1e1e;
97 | }
98 | QWidget {
99 | color: white;
100 | }
101 | QScrollBar:vertical {
102 | background: #2d2d2d;
103 | width: 12px;
104 | margin: 0px;
105 | }
106 | QScrollBar::handle:vertical {
107 | background: #404040;
108 | min-height: 30px;
109 | border-radius: 6px;
110 | }
111 | QScrollBar::handle:vertical:hover {
112 | background: #505050;
113 | }
114 | """)
115 |
116 | def closeEvent(self, event):
117 | try:
118 | self.logger.info("正在关闭应用程序...")
119 | # 在这里添加清理代码
120 | event.accept()
121 | except Exception as e:
122 | self.logger.error(f"程序关闭时发生错误: {str(e)}")
123 | event.accept()
124 |
--------------------------------------------------------------------------------
/bilibili_spider/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Roinflam/bilibili_spider/957ed38cb390d132d6a6aa46b19aeb008899d0d3/bilibili_spider/models/__init__.py
--------------------------------------------------------------------------------
/bilibili_spider/models/comments.py:
--------------------------------------------------------------------------------
1 | """
2 | 评论数据模型
3 | """
4 | from datetime import datetime
5 |
6 |
7 | class Comment:
8 | """
9 | 评论数据模型类
10 | """
11 |
12 | def __init__(self, video_id, video_title, comment_id, user_name, content,
13 | publish_time, like_count, replies=None):
14 | """初始化评论对象"""
15 | self.video_id = video_id
16 | self.video_title = video_title
17 | self.comment_id = comment_id
18 | self.user_name = user_name
19 | self.content = content
20 | self.publish_time = publish_time
21 | self.like_count = like_count
22 | self.replies = replies or []
--------------------------------------------------------------------------------
/bilibili_spider/pages/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Roinflam/bilibili_spider/957ed38cb390d132d6a6aa46b19aeb008899d0d3/bilibili_spider/pages/__init__.py
--------------------------------------------------------------------------------
/bilibili_spider/pages/crawl_page.py:
--------------------------------------------------------------------------------
1 | from PyQt6.QtWidgets import (QWidget, QVBoxLayout, QHBoxLayout, QLabel,
2 | QPushButton, QLineEdit, QSpinBox, QTextEdit,
3 | QProgressBar, QFrame, QMessageBox)
4 | from PyQt6.QtCore import Qt, QThread, pyqtSignal
5 | from datetime import datetime
6 |
7 | from bilibili_spider.spiders.comment_spider import BilibiliSpider
8 | from bilibili_spider.models.comments import Comment
9 |
10 | # pages/crawl_page.py 顶部添加导入
11 | import time
12 | import requests
13 | import random
14 |
15 |
16 | class CrawlWorker(QThread):
17 | """
18 | @description: 爬取工作线程,负责在后台执行评论爬取任务
19 | """
20 | progress = pyqtSignal(str) # 用于发送进度信息
21 | error = pyqtSignal(str)
22 | comment_received = pyqtSignal(dict) # 用于发送单条评论数据
23 | finished = pyqtSignal(dict)
24 |
25 | def __init__(self, spider, url, max_pages):
26 | """
27 | @description: 初始化爬取工作线程
28 | @param {BilibiliSpider} spider - 爬虫实例
29 | @param {str} url - 要爬取的视频URL
30 | @param {int} max_pages - 最大爬取页数
31 | """
32 | super().__init__()
33 | self.spider = spider
34 | self.url = url
35 | self.max_pages = max_pages
36 | self.is_running = False
37 |
38 | def run(self):
39 | """
40 | @description: 执行爬取任务的主方法
41 | """
42 | self.is_running = True
43 | try:
44 | video_id = self.spider.extract_video_id(self.url)
45 | if not video_id:
46 | self.error.emit("无法从URL中提取视频ID")
47 | return
48 |
49 | # 获取视频标题
50 | video_title = ""
51 | if video_id.startswith('BV'):
52 | view_url = f'https://api.bilibili.com/x/web-interface/view?bvid={video_id}'
53 | else:
54 | view_url = f'https://api.bilibili.com/x/web-interface/view?aid={video_id.lstrip("av")}'
55 |
56 | try:
57 | response = requests.get(view_url, headers=self.spider.headers)
58 | response.raise_for_status()
59 | data = response.json()
60 | if data['code'] == 0:
61 | video_title = data['data']['title']
62 | self.progress.emit(f"获取到视频标题: {video_title}")
63 | else:
64 | self.progress.emit(f"获取视频标题失败: {data.get('message', '未知错误')}")
65 | return
66 | except Exception as e:
67 | self.progress.emit(f"获取视频标题失败: {str(e)}")
68 | return
69 |
70 | self.progress.emit(f"开始爬取视频 {video_id} 的评论...")
71 | current_page = 1
72 | total_comments = 0
73 |
74 | while current_page <= self.max_pages and self.is_running:
75 | self.progress.emit(f"正在爬取第 {current_page} 页...")
76 |
77 | api_url = self.spider.get_api_url(video_id, current_page)
78 | if not api_url:
79 | break
80 |
81 | try:
82 | response = requests.get(api_url, headers=self.spider.headers)
83 | response.raise_for_status()
84 | data = response.json()
85 |
86 | if data['code'] != 0:
87 | self.progress.emit(f"API返回错误: {data.get('message', '未知错误')}")
88 | break
89 |
90 | replies = data['data'].get('replies', [])
91 | if not replies:
92 | self.progress.emit("没有更多评论了")
93 | break
94 |
95 | for reply in replies:
96 | comment_data = {
97 | 'video_id': video_id,
98 | 'video_title': video_title,
99 | 'comment_id': str(reply['rpid']),
100 | 'user_name': reply['member']['uname'],
101 | 'content': reply['content']['message'],
102 | 'publish_time': datetime.fromtimestamp(
103 | reply['ctime']
104 | ).strftime('%Y-%m-%d %H:%M:%S'),
105 | 'like_count': reply['like'],
106 | 'replies': []
107 | }
108 |
109 | # 处理子回复
110 | if reply.get('replies'):
111 | for sub_reply in reply['replies']:
112 | reply_data = {
113 | 'user_name': sub_reply['member']['uname'],
114 | 'content': sub_reply['content']['message'],
115 | 'time': datetime.fromtimestamp(
116 | sub_reply['ctime']
117 | ).strftime('%Y-%m-%d %H:%M:%S')
118 | }
119 | comment_data['replies'].append(reply_data)
120 |
121 | # 发送单条评论信号
122 | self.comment_received.emit(comment_data)
123 | total_comments += 1
124 | self.progress.emit(f"已获取 {total_comments} 条评论")
125 |
126 | except requests.exceptions.RequestException as e:
127 | self.progress.emit(f"请求失败: {str(e)}")
128 | continue
129 |
130 | current_page += 1
131 | time.sleep(random.uniform(1, 3)) # 添加随机延迟
132 |
133 | if self.is_running:
134 | self.finished.emit({
135 | 'video_id': video_id,
136 | 'total_comments': total_comments
137 | })
138 |
139 | except Exception as e:
140 | self.error.emit(str(e))
141 | finally:
142 | self.is_running = False
143 |
144 | def stop(self):
145 | """
146 | @description: 停止爬取任务
147 | """
148 | self.is_running = False
149 |
150 |
151 | class StyledFrame(QFrame):
152 | """
153 | @description: 自定义样式面板控件
154 | """
155 | def __init__(self, title="", parent=None):
156 | """
157 | @description: 初始化样式面板
158 | @param {str} title - 面板标题
159 | @param {QWidget} parent - 父控件
160 | """
161 | super().__init__(parent)
162 | self.setStyleSheet("""
163 | StyledFrame {
164 | background-color: #2d2d2d;
165 | border-radius: 8px;
166 | padding: 15px;
167 | margin: 5px;
168 | }
169 | """)
170 | self.layout = QVBoxLayout(self)
171 | self.layout.setSpacing(10)
172 |
173 | if title:
174 | label = QLabel(title)
175 | label.setStyleSheet("""
176 | font-size: 16px;
177 | font-weight: bold;
178 | color: white;
179 | padding: 5px;
180 | margin-bottom: 10px;
181 | """)
182 | self.layout.addWidget(label)
183 |
184 |
185 | class CrawlPage(QWidget):
186 | """
187 | @description: 评论爬取页面,提供视频URL输入和爬取控制功能
188 | """
189 | def __init__(self, db_handler, config):
190 | """
191 | @description: 初始化爬取页面
192 | @param {DatabaseHandler} db_handler - 数据库处理器实例
193 | @param {Config} config - 配置管理实例
194 | """
195 | super().__init__()
196 | self.db_handler = db_handler
197 | self.config = config
198 | self.crawl_worker = None
199 |
200 | # 初始化爬虫实例
201 | cookie, _ = self.db_handler.get_valid_cookie()
202 | if cookie and self.config.set_cookie(cookie):
203 | self.spider = BilibiliSpider(self.config)
204 | else:
205 | self.spider = None
206 |
207 | self.init_ui()
208 |
209 | def init_ui(self):
210 | """
211 | @description: 初始化用户界面
212 | """
213 | layout = QVBoxLayout(self)
214 | layout.setContentsMargins(20, 20, 20, 20)
215 | layout.setSpacing(15)
216 |
217 | # URL输入区域
218 | url_frame = StyledFrame("视频URL")
219 | url_layout = QHBoxLayout()
220 | url_layout.setContentsMargins(5, 5, 5, 5)
221 |
222 | self.url_input = QLineEdit()
223 | self.url_input.setPlaceholderText("请输入B站视频URL")
224 | self.url_input.setStyleSheet("""
225 | QLineEdit {
226 | padding: 8px;
227 | border: 1px solid #3d3d3d;
228 | border-radius: 4px;
229 | background-color: #1e1e1e;
230 | color: white;
231 | font-size: 14px;
232 | }
233 | QLineEdit:focus {
234 | border: 1px solid #0078d4;
235 | }
236 | """)
237 |
238 | url_layout.addWidget(self.url_input)
239 | url_frame.layout.addLayout(url_layout)
240 | layout.addWidget(url_frame)
241 |
242 | # 爬取控制区域
243 | control_frame = StyledFrame("爬取控制")
244 | control_layout = QHBoxLayout()
245 | control_layout.setContentsMargins(5, 5, 5, 5)
246 | control_layout.setAlignment(Qt.AlignmentFlag.AlignLeft)
247 |
248 | page_label = QLabel("爬取页数:")
249 | page_label.setStyleSheet("""
250 | QLabel {
251 | color: white;
252 | font-size: 14px;
253 | padding-right: 10px;
254 | }
255 | """)
256 | page_label.setFixedWidth(80)
257 | control_layout.addWidget(page_label)
258 |
259 | # 页数输入框 - 修复高度问题
260 | self.page_spinbox = QSpinBox()
261 | self.page_spinbox.setRange(1, 1000000)
262 | self.page_spinbox.setValue(10)
263 | self.page_spinbox.setMinimumWidth(120)
264 | self.page_spinbox.setMinimumHeight(27) # 增加最小高度
265 | self.page_spinbox.setAlignment(Qt.AlignmentFlag.AlignCenter)
266 | self.page_spinbox.setButtonSymbols(QSpinBox.ButtonSymbols.UpDownArrows)
267 | self.page_spinbox.setStyleSheet("""
268 | QSpinBox {
269 | padding: 8px;
270 | border: 1px solid #3d3d3d;
271 | border-radius: 4px;
272 | background-color: #2d2d2d;
273 | color: white;
274 | font-size: 14px;
275 | min-height: 27px;
276 | }
277 | QSpinBox::up-button, QSpinBox::down-button {
278 | width: 20px;
279 | height: 18px;
280 | background: #404040;
281 | border: none;
282 | subcontrol-origin: border;
283 | }
284 | QSpinBox::up-button {
285 | subcontrol-position: top right;
286 | }
287 | QSpinBox::down-button {
288 | subcontrol-position: bottom right;
289 | }
290 | QSpinBox::up-button:hover, QSpinBox::down-button:hover {
291 | background: #505050;
292 | }
293 | QSpinBox::up-button:pressed, QSpinBox::down-button:pressed {
294 | background: #606060;
295 | }
296 | """)
297 |
298 | control_layout.addWidget(self.page_spinbox)
299 | control_layout.addStretch()
300 |
301 | # 开始爬取按钮
302 | self.start_button = QPushButton("开始爬取")
303 | self.start_button.setStyleSheet("""
304 | QPushButton {
305 | padding: 8px 30px;
306 | background-color: #0078d4;
307 | color: white;
308 | border: none;
309 | border-radius: 4px;
310 | font-weight: bold;
311 | font-size: 14px;
312 | min-width: 120px;
313 | min-height: 40px;
314 | }
315 | QPushButton:hover {
316 | background-color: #1184db;
317 | }
318 | QPushButton:pressed {
319 | background-color: #006abc;
320 | }
321 | QPushButton:disabled {
322 | background-color: #666666;
323 | }
324 | """)
325 |
326 | control_layout.addWidget(self.start_button)
327 | control_frame.layout.addLayout(control_layout)
328 | layout.addWidget(control_frame)
329 |
330 | # 日志显示区域
331 | log_frame = StyledFrame("运行日志")
332 | log_frame.layout.setContentsMargins(5, 5, 5, 5)
333 |
334 | self.log_text = QTextEdit()
335 | self.log_text.setReadOnly(True)
336 | self.log_text.setStyleSheet("""
337 | QTextEdit {
338 | border: 1px solid #3d3d3d;
339 | border-radius: 4px;
340 | background-color: #1e1e1e;
341 | color: white;
342 | padding: 8px;
343 | font-size: 14px;
344 | }
345 | """)
346 |
347 | log_frame.layout.addWidget(self.log_text)
348 | layout.addWidget(log_frame)
349 |
350 | # 连接信号
351 | self.start_button.clicked.connect(self.start_crawl)
352 |
353 | def add_log(self, message):
354 | """
355 | @description: 添加日志信息到日志显示区域
356 | @param {str} message - 日志信息
357 | """
358 | timestamp = datetime.now().strftime('%H:%M:%S')
359 | self.log_text.append(f"[{timestamp}] {message}")
360 | self.log_text.verticalScrollBar().setValue(
361 | self.log_text.verticalScrollBar().maximum()
362 | )
363 |
364 | def handle_comment(self, comment_data):
365 | """
366 | @description: 处理单条评论数据并保存到数据库
367 | @param {dict} comment_data - 评论数据字典
368 | """
369 | try:
370 | comment = Comment(
371 | video_id=comment_data['video_id'],
372 | video_title=comment_data['video_title'],
373 | comment_id=comment_data['comment_id'],
374 | user_name=comment_data['user_name'],
375 | content=comment_data['content'],
376 | publish_time=comment_data['publish_time'],
377 | like_count=comment_data['like_count'],
378 | replies=comment_data['replies']
379 | )
380 |
381 | result = self.db_handler.save_comment(comment)
382 | status = "新增" if result == 1 else "更新" if result == 2 else "失败"
383 | # 输出到日志
384 | self.add_log(f"[{status}] {comment_data['user_name']}: {comment_data['content']}")
385 | # 输出到控制台
386 | print(f"[{status}] {comment_data['user_name']}: {comment_data['content']}")
387 |
388 | except Exception as e:
389 | self.add_log(f"处理评论失败: {str(e)}")
390 |
391 | def handle_error(self, error_message):
392 | """
393 | @description: 处理爬取过程中的错误
394 | @param {str} error_message - 错误信息
395 | """
396 | self.add_log(f"爬取失败: {error_message}")
397 | QMessageBox.critical(self, "错误", f"爬取过程出错: {error_message}")
398 | self.start_button.setEnabled(True)
399 |
400 | def handle_crawl_finished(self, result):
401 | """
402 | @description: 处理爬取完成事件
403 | @param {dict} result - 爬取结果数据
404 | """
405 | try:
406 | if not result:
407 | return
408 |
409 | video_id = result.get('video_id')
410 | total_comments = result.get('total_comments', 0)
411 |
412 | self.add_log(f"爬取完成! 共获取 {total_comments} 条评论")
413 | QMessageBox.information(
414 | self,
415 | "成功",
416 | f"成功爬取视频 {video_id} 的评论\n共获取 {total_comments} 条评论"
417 | )
418 |
419 | except Exception as e:
420 | self.handle_error(str(e))
421 | finally:
422 | self.start_button.setEnabled(True)
423 |
424 | def start_crawl(self):
425 | """
426 | @description: 开始爬取评论
427 | """
428 | # 重新获取和设置Cookie
429 | cookie, _ = self.db_handler.get_valid_cookie()
430 | if cookie and self.config.set_cookie(cookie):
431 | self.spider = BilibiliSpider(self.config)
432 | else:
433 | self.spider = None
434 |
435 | if not self.spider:
436 | QMessageBox.warning(self, "提示", "请先设置Cookie")
437 | return
438 |
439 | url = self.url_input.text().strip()
440 | if not url:
441 | QMessageBox.warning(self, "提示", "请输入视频URL")
442 | return
443 |
444 | try:
445 | self.start_button.setEnabled(False)
446 | self.crawl_worker = CrawlWorker(self.spider, url, self.page_spinbox.value())
447 | self.crawl_worker.progress.connect(self.add_log)
448 | self.crawl_worker.error.connect(self.handle_error)
449 | self.crawl_worker.comment_received.connect(self.handle_comment)
450 | self.crawl_worker.finished.connect(self.handle_crawl_finished)
451 | self.crawl_worker.start()
452 |
453 | except Exception as e:
454 | self.handle_error(str(e))
--------------------------------------------------------------------------------
/bilibili_spider/pages/home_page.py:
--------------------------------------------------------------------------------
1 | from PyQt6.QtWidgets import (QWidget, QVBoxLayout, QHBoxLayout, QLabel,
2 | QPushButton, QFrame, QSpacerItem, QSizePolicy)
3 | from PyQt6.QtCore import Qt, QSize, QThread, pyqtSignal, QTimer
4 | from PyQt6.QtGui import QFont
5 |
6 |
7 | class UpdateStatsWorker(QThread):
8 | stats_updated = pyqtSignal(dict)
9 |
10 | def __init__(self, db_handler):
11 | super().__init__()
12 | self.db_handler = db_handler
13 | self.is_running = False
14 |
15 | def run(self):
16 | self.is_running = True
17 | try:
18 | stats = self.db_handler.get_statistics()
19 | if self.is_running:
20 | self.stats_updated.emit(stats)
21 | except Exception as e:
22 | print(f"更新统计数据失败: {str(e)}")
23 | finally:
24 | self.is_running = False
25 |
26 | def stop(self):
27 | self.is_running = False
28 |
29 |
30 | class StyledCard(QFrame):
31 | def __init__(self, title, content, parent=None):
32 | super().__init__(parent)
33 | self.setStyleSheet("""
34 | StyledCard {
35 | background-color: #2d2d2d;
36 | border-radius: 10px;
37 | padding: 15px;
38 | margin: 5px;
39 | }
40 | QLabel {
41 | color: white;
42 | }
43 | """)
44 |
45 | layout = QVBoxLayout(self)
46 | layout.setSpacing(10)
47 |
48 | # 标题
49 | title_label = QLabel(title)
50 | title_label.setStyleSheet("font-size: 18px; font-weight: bold; color: #0078d4;")
51 | title_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
52 |
53 | # 内容
54 | content_label = QLabel(content)
55 | content_label.setStyleSheet("font-size: 24px; font-weight: bold; color: white;")
56 | content_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
57 |
58 | layout.addWidget(title_label)
59 | layout.addWidget(content_label)
60 |
61 | def update_content(self, content):
62 | """更新卡片内容"""
63 | for widget in self.findChildren(QLabel):
64 | if widget.styleSheet().find("color: white") != -1:
65 | widget.setText(content)
66 | break
67 |
68 |
69 | class HomePage(QWidget):
70 | def __init__(self, db_handler):
71 | super().__init__()
72 | self.db_handler = db_handler
73 | self.stats_worker = None
74 | self.init_ui()
75 |
76 | # 创建定时器
77 | self.update_timer = QTimer()
78 | self.update_timer.setInterval(1000) # 1秒
79 | self.update_timer.timeout.connect(self.start_update_stats)
80 | self.update_timer.start()
81 |
82 | def init_ui(self):
83 | layout = QVBoxLayout(self)
84 | layout.setSpacing(20)
85 | layout.setContentsMargins(30, 30, 30, 30)
86 |
87 | # 标题区域
88 | title_widget = QWidget()
89 | title_layout = QVBoxLayout(title_widget)
90 |
91 | welcome_label = QLabel("欢迎使用B站评论爬虫")
92 | welcome_label.setFont(QFont("Arial", 24, QFont.Weight.Bold))
93 | welcome_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
94 | welcome_label.setStyleSheet("color: #0078d4; margin: 20px;")
95 |
96 | subtitle_label = QLabel("一个强大的B站评论数据采集工具")
97 | subtitle_label.setFont(QFont("Arial", 14))
98 | subtitle_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
99 | subtitle_label.setStyleSheet("color: #cccccc; margin-bottom: 20px;")
100 |
101 | title_layout.addWidget(welcome_label)
102 | title_layout.addWidget(subtitle_label)
103 |
104 | layout.addWidget(title_widget)
105 |
106 | # 统计信息区域
107 | stats_widget = QWidget()
108 | stats_layout = QHBoxLayout(stats_widget)
109 |
110 | # 加载实时统计数据
111 | stats = self.db_handler.get_statistics()
112 |
113 | self.total_comments = StyledCard("总评论数", str(stats['total_comments']))
114 | self.total_videos = StyledCard("视频数", str(stats['total_videos']))
115 | self.total_users = StyledCard("评论用户数", str(stats['total_users']))
116 |
117 | stats_layout.addWidget(self.total_comments)
118 | stats_layout.addWidget(self.total_videos)
119 | stats_layout.addWidget(self.total_users)
120 |
121 | layout.addWidget(stats_widget)
122 |
123 | # 功能区域
124 | feature_label = QLabel("快速入口")
125 | feature_label.setFont(QFont("Arial", 18, QFont.Weight.Bold))
126 | feature_label.setStyleSheet("color: white; margin: 20px 0;")
127 | feature_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
128 | layout.addWidget(feature_label)
129 |
130 | # 功能按钮区域
131 | buttons_widget = QWidget()
132 | buttons_layout = QHBoxLayout(buttons_widget)
133 | buttons_layout.setSpacing(20)
134 |
135 | self.start_crawl_btn = self.create_feature_button("开始爬取", "立即开始爬取B站视频评论")
136 | self.search_btn = self.create_feature_button("评论查询", "搜索和查看已爬取的评论")
137 | self.settings_btn = self.create_feature_button("系统设置", "配置Cookie和系统参数")
138 |
139 | buttons_layout.addWidget(self.start_crawl_btn)
140 | buttons_layout.addWidget(self.search_btn)
141 | buttons_layout.addWidget(self.settings_btn)
142 |
143 | layout.addWidget(buttons_widget)
144 | layout.addStretch()
145 |
146 | # 添加作者信息
147 | author_label = QLabel("版本: 1.0.0")
148 | author_label.setAlignment(Qt.AlignmentFlag.AlignRight)
149 | author_label.setStyleSheet("color: #666666; margin: 0px;")
150 | layout.addWidget(author_label)
151 |
152 | def create_feature_button(self, title, description):
153 | button = QPushButton()
154 | button.setMinimumSize(QSize(300, 140))
155 | button.setStyleSheet("""
156 | QPushButton {
157 | background-color: #2d2d2d;
158 | border: none;
159 | border-radius: 10px;
160 | transition: background-color 0.2s;
161 | }
162 | QPushButton:hover {
163 | background-color: #3d3d3d;
164 | }
165 | QPushButton:pressed {
166 | background-color: #404040;
167 | }
168 | """)
169 |
170 | content_widget = QWidget(button)
171 | content_widget.setGeometry(button.rect())
172 | content_layout = QVBoxLayout(content_widget)
173 | content_layout.setContentsMargins(20, 20, 20, 20)
174 | content_layout.setSpacing(10)
175 |
176 | title_label = QLabel(title)
177 | title_label.setFont(QFont("Arial", 16, QFont.Weight.Bold))
178 | title_label.setStyleSheet("color: #0078d4;")
179 | title_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
180 |
181 | desc_label = QLabel(description)
182 | desc_label.setStyleSheet("color: #cccccc;")
183 | desc_label.setWordWrap(True)
184 | desc_label.setAlignment(Qt.AlignmentFlag.AlignCenter)
185 |
186 | content_layout.addStretch()
187 | content_layout.addWidget(title_label)
188 | content_layout.addWidget(desc_label)
189 | content_layout.addStretch()
190 |
191 | # 确保内容跟随按钮大小变化
192 | button.resizeEvent = lambda e: content_widget.setGeometry(button.rect())
193 |
194 | return button
195 |
196 | def connect_buttons(self, tab_widget):
197 | """连接按钮信号到对应的标签页"""
198 | self.start_crawl_btn.clicked.connect(lambda: tab_widget.setCurrentIndex(1))
199 | self.search_btn.clicked.connect(lambda: tab_widget.setCurrentIndex(2))
200 | self.settings_btn.clicked.connect(lambda: tab_widget.setCurrentIndex(3))
201 |
202 | def start_update_stats(self):
203 | """启动异步统计更新"""
204 | if self.stats_worker and self.stats_worker.isRunning():
205 | self.stats_worker.stop()
206 | self.stats_worker.wait()
207 |
208 | self.stats_worker = UpdateStatsWorker(self.db_handler)
209 | self.stats_worker.stats_updated.connect(self.handle_stats_updated)
210 | self.stats_worker.start()
211 |
212 | def handle_stats_updated(self, stats):
213 | """处理统计更新结果"""
214 | self.total_comments.update_content(str(stats['total_comments']))
215 | self.total_videos.update_content(str(stats['total_videos']))
216 | self.total_users.update_content(str(stats['total_users']))
--------------------------------------------------------------------------------
/bilibili_spider/pages/search_page.py:
--------------------------------------------------------------------------------
1 | from PyQt6.QtWidgets import (QWidget, QVBoxLayout, QHBoxLayout, QLabel,
2 | QPushButton, QTableWidget, QTableWidgetItem,
3 | QFrame, QComboBox, QLineEdit, QApplication,
4 | QHeaderView, QMessageBox, QToolTip, QGraphicsOpacityEffect)
5 | from PyQt6.QtCore import Qt, QThread, pyqtSignal, QTimer, QPropertyAnimation
6 | from PyQt6.QtGui import QColor, QCursor
7 | import json
8 |
9 |
10 | class SearchWorker(QThread):
11 | finished = pyqtSignal(list)
12 | error = pyqtSignal(str)
13 |
14 | def __init__(self, db_handler, query_type, search_text='', sort_field='publish_time', sort_order='DESC'):
15 | super().__init__()
16 | self.db_handler = db_handler
17 | self.query_type = query_type
18 | self.search_text = search_text
19 | self.sort_field = sort_field
20 | self.sort_order = sort_order
21 |
22 | def run(self):
23 | try:
24 | results = self.db_handler.query_comments_batch(
25 | self.query_type,
26 | self.search_text,
27 | batch_size=1000,
28 | offset=0,
29 | sort_by=self.sort_field,
30 | sort_order=self.sort_order
31 | )
32 | self.finished.emit(results)
33 | except Exception as e:
34 | self.error.emit(str(e))
35 |
36 | class StyledFrame(QFrame):
37 | def __init__(self, title="", parent=None):
38 | super().__init__(parent)
39 | self.setStyleSheet("""
40 | StyledFrame {
41 | background-color: #2d2d2d;
42 | border-radius: 8px;
43 | padding: 15px;
44 | margin: 5px;
45 | }
46 | """)
47 | self.layout = QVBoxLayout(self)
48 | self.layout.setSpacing(10)
49 |
50 | if title:
51 | label = QLabel(title)
52 | label.setStyleSheet("""
53 | font-size: 16px;
54 | font-weight: bold;
55 | color: white;
56 | padding: 5px;
57 | margin-bottom: 10px;
58 | """)
59 | self.layout.addWidget(label)
60 |
61 |
62 | class SearchPage(QWidget):
63 | def __init__(self, db_handler):
64 | super().__init__()
65 | self.db_handler = db_handler
66 | self.search_worker = None
67 | self.sort_field = 'publish_time'
68 | self.sort_order = 'DESC'
69 | self.floating_tip = FloatingTip(self)
70 | self.init_ui()
71 |
72 | def init_ui(self):
73 | layout = QVBoxLayout(self)
74 | layout.setContentsMargins(20, 20, 20, 20)
75 | layout.setSpacing(15)
76 |
77 | # 搜索控制区域
78 | search_frame = StyledFrame("搜索控制")
79 | search_layout = QHBoxLayout()
80 | search_layout.setSpacing(10)
81 | search_layout.setContentsMargins(10, 5, 10, 5)
82 |
83 | # 搜索类型选择
84 | self.search_type = QComboBox()
85 | self.search_type.addItems([
86 | "查看全部评论",
87 | "按视频ID搜索",
88 | "按视频标题搜索",
89 | "按用户名搜索",
90 | "按评论内容搜索"
91 | ])
92 | self.search_type.setStyleSheet("""
93 | QComboBox {
94 | padding: 8px;
95 | border: 1px solid #3d3d3d;
96 | border-radius: 4px;
97 | background-color: #1e1e1e;
98 | color: white;
99 | min-width: 150px;
100 | font-size: 14px;
101 | }
102 | QComboBox::drop-down {
103 | border: none;
104 | width: 30px;
105 | }
106 | QComboBox::down-arrow {
107 | image: url(resources/down_arrow.png);
108 | width: 12px;
109 | height: 12px;
110 | }
111 | QComboBox QAbstractItemView {
112 | background-color: #1e1e1e;
113 | border: 1px solid #3d3d3d;
114 | selection-background-color: #0078d4;
115 | color: white;
116 | }
117 | """)
118 |
119 | # 搜索框
120 | self.search_input = QLineEdit()
121 | self.search_input.setEnabled(False)
122 | self.search_input.setPlaceholderText("查看全部评论无需输入搜索内容")
123 | self.search_input.setStyleSheet("""
124 | QLineEdit {
125 | padding: 8px;
126 | border: 1px solid #3d3d3d;
127 | border-radius: 4px;
128 | background-color: #1e1e1e;
129 | color: white;
130 | font-size: 14px;
131 | }
132 | QLineEdit:focus {
133 | border: 1px solid #0078d4;
134 | }
135 | QLineEdit:disabled {
136 | background-color: #2d2d2d;
137 | color: #888888;
138 | }
139 | """)
140 |
141 | # 搜索按钮
142 | self.search_button = QPushButton("搜索")
143 | self.search_button.setStyleSheet("""
144 | QPushButton {
145 | padding: 8px 20px;
146 | background-color: #0078d4;
147 | color: white;
148 | border: none;
149 | border-radius: 4px;
150 | font-weight: bold;
151 | font-size: 14px;
152 | min-width: 100px;
153 | }
154 | QPushButton:hover {
155 | background-color: #1184db;
156 | }
157 | QPushButton:pressed {
158 | background-color: #006abc;
159 | }
160 | QPushButton:disabled {
161 | background-color: #666666;
162 | }
163 | """)
164 |
165 | search_layout.addWidget(self.search_type)
166 | search_layout.addWidget(self.search_input)
167 | search_layout.addWidget(self.search_button)
168 |
169 | search_frame.layout.addLayout(search_layout)
170 | layout.addWidget(search_frame)
171 |
172 | # 结果表格
173 | results_frame = StyledFrame("搜索结果")
174 | results_frame.layout.setContentsMargins(10, 5, 10, 5)
175 |
176 | self.result_table = QTableWidget()
177 | self.result_table.setColumnCount(8)
178 | self.result_table.setHorizontalHeaderLabels([
179 | '视频ID', '视频标题', '用户名', '评论内容', '发布时间',
180 | '点赞数', '回复数', '更新时间'
181 | ])
182 |
183 | self.setup_table_style()
184 | results_frame.layout.addWidget(self.result_table)
185 | layout.addWidget(results_frame)
186 |
187 | # 连接信号
188 | self.search_button.clicked.connect(self.start_search)
189 | self.search_type.currentIndexChanged.connect(self.on_search_type_changed)
190 | self.result_table.horizontalHeader().sectionClicked.connect(self.handle_sort_click)
191 | self.result_table.cellDoubleClicked.connect(self.copy_cell_content)
192 |
193 | def setup_table_style(self):
194 | self.result_table.setStyleSheet("""
195 | QTableWidget {
196 | background-color: #1e1e1e;
197 | border: 1px solid #3d3d3d;
198 | gridline-color: #3d3d3d;
199 | color: white;
200 | }
201 | QTableWidget::item {
202 | padding: 8px;
203 | border-bottom: 1px solid #3d3d3d;
204 | color: white;
205 | }
206 | QTableWidget::item:hover {
207 | background-color: #3d3d3d;
208 | }
209 | QTableWidget::item:selected {
210 | background-color: #404040;
211 | color: white;
212 | }
213 | QHeaderView::section {
214 | background-color: #2d2d2d;
215 | padding: 12px 8px;
216 | border: none;
217 | border-right: 1px solid #3d3d3d;
218 | border-bottom: 1px solid #3d3d3d;
219 | font-weight: bold;
220 | color: white;
221 | font-size: 14px;
222 | }
223 | QHeaderView::section:hover {
224 | background-color: #3d3d3d;
225 | cursor: pointer;
226 | }
227 | QToolTip {
228 | background-color: #2d2d2d;
229 | color: white;
230 | border: 1px solid #3d3d3d;
231 | padding: 5px;
232 | }
233 | """)
234 |
235 | header = self.result_table.horizontalHeader()
236 | header.setSectionResizeMode(QHeaderView.ResizeMode.Interactive)
237 |
238 | column_widths = {
239 | 0: 120, # 视频ID
240 | 1: 200, # 视频标题
241 | 2: 150, # 用户名
242 | 3: 305, # 评论内容
243 | 4: 150, # 发布时间
244 | 5: 80, # 点赞数
245 | 6: 80, # 回复数
246 | 7: 150 # 更新时间
247 | }
248 |
249 | for col, width in column_widths.items():
250 | self.result_table.setColumnWidth(col, width)
251 |
252 | self.result_table.verticalHeader().setVisible(False)
253 | self.result_table.setSelectionBehavior(QTableWidget.SelectionBehavior.SelectRows)
254 | self.result_table.setSelectionMode(QTableWidget.SelectionMode.SingleSelection)
255 |
256 | def copy_cell_content(self, row, col):
257 | try:
258 | content = ''
259 | cell_widget = self.result_table.cellWidget(row, col)
260 | if cell_widget and isinstance(cell_widget, QLabel):
261 | content = cell_widget.text()
262 | content = content.replace('', '').replace('', '')
263 | else:
264 | item = self.result_table.item(row, col)
265 | if item:
266 | content = item.text()
267 |
268 | if content:
269 | clipboard = QApplication.clipboard()
270 | clipboard.setText(content)
271 |
272 | # 获取鼠标位置
273 | cursor_pos = QCursor.pos()
274 | tip = FloatingTip(self)
275 | tip.showTip("已复制到剪贴板", cursor_pos, 2000)
276 |
277 | except Exception as e:
278 | print(f"复制内容失败: {str(e)}")
279 |
280 | def on_search_type_changed(self, index):
281 | self.search_input.setEnabled(index != 0)
282 | if index == 0:
283 | self.search_input.clear()
284 | self.search_input.setPlaceholderText("查看全部评论无需输入搜索内容")
285 | else:
286 | self.search_input.setPlaceholderText("请输入搜索内容...")
287 |
288 | def handle_sort_click(self, column_index):
289 | sort_mapping = {
290 | 4: 'publish_time', # 发布时间列
291 | 5: 'like_count', # 点赞数列
292 | 6: 'replies' # 回复数列
293 | }
294 |
295 | if column_index in sort_mapping:
296 | if self.sort_field == sort_mapping[column_index]:
297 | self.sort_order = 'ASC' if self.sort_order == 'DESC' else 'DESC'
298 | else:
299 | self.sort_field = sort_mapping[column_index]
300 | self.sort_order = 'DESC'
301 |
302 | self.start_search()
303 |
304 | def start_search(self):
305 | if hasattr(self, 'search_worker') and self.search_worker:
306 | self.search_worker.wait()
307 | self.search_worker.deleteLater()
308 | self.search_worker = None
309 |
310 | query_type = str(self.search_type.currentIndex() + 1)
311 | search_text = self.search_input.text().strip()
312 |
313 | if query_type != '1' and not search_text:
314 | return
315 |
316 | self.result_table.setRowCount(0)
317 | self.search_button.setEnabled(False)
318 | self.search_button.setText("正在查询...")
319 |
320 | self.search_worker = SearchWorker(
321 | self.db_handler,
322 | query_type,
323 | search_text,
324 | self.sort_field,
325 | self.sort_order
326 | )
327 | self.search_worker.finished.connect(self.handle_search_results)
328 | self.search_worker.error.connect(self.handle_search_error)
329 | self.search_worker.start()
330 |
331 | def highlight_text(self, text, search_text):
332 | if not search_text or search_text.lower() not in text.lower():
333 | return text
334 |
335 | label = QLabel()
336 | index = text.lower().find(search_text.lower())
337 | before = text[:index]
338 | matched = text[index:index + len(search_text)]
339 | after = text[index + len(search_text):]
340 |
341 | label.setText(f"{before}{matched}{after}")
342 | label.setTextFormat(Qt.TextFormat.RichText)
343 | return label
344 |
345 | def handle_search_results(self, results):
346 | try:
347 | self.result_table.setRowCount(0)
348 | query_type = str(self.search_type.currentIndex() + 1)
349 | search_text = self.search_input.text().strip()
350 |
351 | for row_data in results:
352 | row = self.result_table.rowCount()
353 | self.result_table.insertRow(row)
354 |
355 | for col, data in enumerate(row_data):
356 | text = str(data)
357 | if len(text) > 100 and col == 3:
358 | text = text[:97] + "..."
359 |
360 | if search_text and (
361 | (query_type == '2' and col == 0) or # 视频ID
362 | (query_type == '3' and col == 1) or # 视频标题
363 | (query_type == '4' and col == 2) or # 用户名
364 | (query_type == '5' and col == 3) # 评论内容
365 | ):
366 | highlighted = self.highlight_text(text, search_text)
367 | if isinstance(highlighted, QLabel):
368 | self.result_table.setCellWidget(row, col, highlighted)
369 | continue
370 |
371 | item = QTableWidgetItem(text)
372 | if col == 6:
373 | replies = json.loads(data) if data else []
374 | item.setText(str(len(replies)))
375 | item.setTextAlignment(Qt.AlignmentFlag.AlignCenter)
376 |
377 | item.setFlags(item.flags() & ~Qt.ItemFlag.ItemIsEditable)
378 | self.result_table.setItem(row, col, item)
379 |
380 | self.result_table.setRowHeight(row, 40)
381 |
382 | except Exception as e:
383 | self.handle_search_error(str(e))
384 | finally:
385 | self.search_button.setEnabled(True)
386 | self.search_button.setText("搜索")
387 |
388 | def handle_search_error(self, error_message):
389 | self.search_button.setEnabled(True)
390 | self.search_button.setText("搜索")
391 | QMessageBox.critical(self, "错误", f"搜索失败: {error_message}")
392 |
393 |
394 | class FloatingTip(QLabel):
395 | def __init__(self, parent=None):
396 | super().__init__(parent)
397 | self.setStyleSheet("""
398 | background-color: rgba(45, 45, 45, 0.95);
399 | color: white;
400 | border: 1px solid #3d3d3d;
401 | padding: 8px 12px;
402 | border-radius: 4px;
403 | """)
404 | self.setAlignment(Qt.AlignmentFlag.AlignCenter)
405 | self.hide()
406 |
407 | # 动画效果
408 | self.opacity = QGraphicsOpacityEffect(self)
409 | self.setGraphicsEffect(self.opacity)
410 | self.anim = QPropertyAnimation(self.opacity, b"opacity")
411 | self.anim.setDuration(200) # 200ms渐变
412 |
413 | self.fade_timer = QTimer(self)
414 | self.fade_timer.timeout.connect(self.start_fade)
415 |
416 | def showTip(self, text, pos, duration=3000):
417 | self.setText(text)
418 | self.adjustSize()
419 |
420 | self.move(pos.x() - 250, pos.y() - 90)
421 |
422 | # 重置并显示
423 | self.opacity.setOpacity(1)
424 | self.show()
425 |
426 | # 设置定时器
427 | self.fade_timer.start(duration)
428 |
429 | def start_fade(self):
430 | self.fade_timer.stop()
431 | self.anim.setStartValue(1)
432 | self.anim.setEndValue(0)
433 | self.anim.start()
434 | self.anim.finished.connect(self.hide)
--------------------------------------------------------------------------------
/bilibili_spider/pages/settings_page.py:
--------------------------------------------------------------------------------
1 | from PyQt6.QtWidgets import (QWidget, QVBoxLayout, QHBoxLayout, QLabel,
2 | QPushButton, QLineEdit, QTextEdit, QFrame,
3 | QMessageBox, QSpinBox)
4 | from PyQt6.QtCore import Qt
5 | from datetime import datetime
6 |
7 | from bilibili_spider.utils.cookie_helper import CookieHelper
8 |
9 |
10 | class StyledFrame(QFrame):
11 | """
12 | @description: 自定义样式面板控件
13 | """
14 |
15 | def __init__(self, title="", parent=None):
16 | """
17 | @description: 初始化样式面板
18 | @param {str} title - 面板标题
19 | @param {QWidget} parent - 父控件
20 | """
21 | super().__init__(parent)
22 | self.setStyleSheet("""
23 | StyledFrame {
24 | background-color: #2d2d2d;
25 | border-radius: 8px;
26 | padding: 15px;
27 | margin: 5px;
28 | }
29 | """)
30 | self.layout = QVBoxLayout(self)
31 | self.layout.setSpacing(10)
32 |
33 | if title:
34 | label = QLabel(title)
35 | label.setStyleSheet("""
36 | font-size: 16px;
37 | font-weight: bold;
38 | color: white;
39 | padding: 5px;
40 | margin-bottom: 10px;
41 | """)
42 | self.layout.addWidget(label)
43 |
44 |
45 | class SettingsPage(QWidget):
46 | """
47 | @description: 设置页面,提供Cookie管理、爬虫配置和数据库管理功能
48 | """
49 |
50 | def __init__(self, config, db_handler):
51 | """
52 | @description: 初始化设置页面
53 | @param {Config} config - 配置管理实例
54 | @param {DatabaseHandler} db_handler - 数据库处理器实例
55 | """
56 | super().__init__()
57 | self.config = config
58 | self.db_handler = db_handler
59 | self.cookie_helper = None
60 | self.init_ui()
61 | self.load_settings()
62 |
63 | def init_ui(self):
64 | """
65 | @description: 初始化用户界面
66 | """
67 | layout = QVBoxLayout(self)
68 | layout.setContentsMargins(20, 20, 20, 20)
69 | layout.setSpacing(15)
70 |
71 | # Cookie设置区域
72 | cookie_frame = StyledFrame("Cookie 管理")
73 | cookie_frame.layout.setContentsMargins(10, 5, 10, 5)
74 |
75 | # Cookie状态显示
76 | status_layout = QHBoxLayout()
77 | status_layout.setContentsMargins(0, 5, 0, 5)
78 | self.cookie_status_label = QLabel("当前状态: 未设置")
79 | self.cookie_status_label.setStyleSheet("""
80 | color: #ff9900;
81 | font-weight: bold;
82 | font-size: 14px;
83 | padding: 5px;
84 | """)
85 | status_layout.addWidget(self.cookie_status_label)
86 | cookie_frame.layout.addLayout(status_layout)
87 |
88 | # Cookie输入区域
89 | self.cookie_input = QTextEdit()
90 | self.cookie_input.setPlaceholderText(
91 | "请输入B站Cookie...\n"
92 | "提示:Cookie中必须包含以下字段:\n"
93 | "- SESSDATA\n"
94 | "- bili_jct\n"
95 | "- DedeUserID"
96 | )
97 | self.cookie_input.setStyleSheet("""
98 | QTextEdit {
99 | background-color: #1e1e1e;
100 | color: white;
101 | border: 1px solid #3d3d3d;
102 | border-radius: 4px;
103 | padding: 10px;
104 | font-size: 14px;
105 | }
106 | """)
107 | self.cookie_input.setMaximumHeight(150)
108 | cookie_frame.layout.addWidget(self.cookie_input)
109 |
110 | # Cookie操作按钮
111 | button_layout = QHBoxLayout()
112 | button_layout.setSpacing(10)
113 |
114 | button_styles = """
115 | QPushButton {
116 | padding: 8px 20px;
117 | background-color: #0078d4;
118 | color: white;
119 | border: none;
120 | border-radius: 4px;
121 | font-weight: bold;
122 | font-size: 14px;
123 | min-width: 120px;
124 | min-height: 40px;
125 | }
126 | QPushButton:hover {
127 | background-color: #1184db;
128 | }
129 | QPushButton:pressed {
130 | background-color: #006abc;
131 | }
132 | """
133 |
134 | # 创建并添加按钮
135 | for btn_text, btn_action in [
136 | ("快速获取Cookie", self.show_cookie_helper),
137 | ("验证Cookie", self.validate_cookie),
138 | ("保存Cookie", self.save_cookie),
139 | ("清除Cookie", self.clear_cookie)
140 | ]:
141 | btn = QPushButton(btn_text)
142 | btn.setStyleSheet(button_styles)
143 | btn.clicked.connect(btn_action)
144 | button_layout.addWidget(btn)
145 |
146 | cookie_frame.layout.addLayout(button_layout)
147 | layout.addWidget(cookie_frame)
148 |
149 | # 爬虫配置区域
150 | crawler_frame = StyledFrame("爬虫设置")
151 | crawler_frame.layout.setContentsMargins(10, 5, 10, 5)
152 |
153 | # 延迟设置
154 | delay_layout = QHBoxLayout()
155 | delay_layout.setContentsMargins(0, 5, 0, 5)
156 |
157 | delay_label = QLabel("请求延迟范围(秒):")
158 | delay_label.setStyleSheet("""
159 | QLabel {
160 | color: white;
161 | font-size: 14px;
162 | padding: 5px;
163 | }
164 | """)
165 | delay_layout.addWidget(delay_label)
166 |
167 | # 优化的SpinBox样式 - 修复高度问题
168 | spinbox_style = """
169 | QSpinBox {
170 | padding: 8px;
171 | border: 1px solid #3d3d3d;
172 | border-radius: 4px;
173 | background-color: #2d2d2d;
174 | color: white;
175 | font-size: 14px;
176 | min-height: 27px;
177 | }
178 | QSpinBox::up-button, QSpinBox::down-button {
179 | width: 20px;
180 | height: 18px;
181 | background: #404040;
182 | border: none;
183 | subcontrol-origin: border;
184 | }
185 | QSpinBox::up-button {
186 | subcontrol-position: top right;
187 | }
188 | QSpinBox::down-button {
189 | subcontrol-position: bottom right;
190 | }
191 | QSpinBox::up-button:hover, QSpinBox::down-button:hover {
192 | background: #505050;
193 | }
194 | QSpinBox::up-button:pressed, QSpinBox::down-button:pressed {
195 | background: #606060;
196 | }
197 | """
198 |
199 | # 最小延迟输入框
200 | self.min_delay = QSpinBox()
201 | self.min_delay.setRange(0, 100)
202 | self.min_delay.setValue(self.config.DELAY_MIN)
203 | self.min_delay.setMinimumWidth(80)
204 | self.min_delay.setMinimumHeight(27) # 增加最小高度
205 | self.min_delay.setAlignment(Qt.AlignmentFlag.AlignCenter)
206 | self.min_delay.setStyleSheet(spinbox_style)
207 |
208 | delay_layout.addWidget(self.min_delay)
209 |
210 | delay_to_label = QLabel("到")
211 | delay_to_label.setStyleSheet("""
212 | QLabel {
213 | color: white;
214 | font-size: 14px;
215 | padding: 5px;
216 | }
217 | """)
218 | delay_layout.addWidget(delay_to_label)
219 |
220 | # 最大延迟输入框
221 | self.max_delay = QSpinBox()
222 | self.max_delay.setRange(0, 100)
223 | self.max_delay.setValue(self.config.DELAY_MAX)
224 | self.max_delay.setMinimumWidth(80)
225 | self.max_delay.setMinimumHeight(27) # 增加最小高度
226 | self.max_delay.setAlignment(Qt.AlignmentFlag.AlignCenter)
227 | self.max_delay.setStyleSheet(spinbox_style)
228 |
229 | delay_layout.addWidget(self.max_delay)
230 | delay_layout.addStretch()
231 |
232 | crawler_frame.layout.addLayout(delay_layout)
233 |
234 | # 重试设置
235 | retry_layout = QHBoxLayout()
236 | retry_layout.setContentsMargins(0, 5, 0, 5)
237 |
238 | retry_label = QLabel("最大重试次数:")
239 | retry_label.setStyleSheet("""
240 | QLabel {
241 | color: white;
242 | font-size: 14px;
243 | padding: 5px;
244 | }
245 | """)
246 | retry_layout.addWidget(retry_label)
247 |
248 | # 重试次数输入框
249 | self.max_retries = QSpinBox()
250 | self.max_retries.setRange(0, 10)
251 | self.max_retries.setValue(self.config.MAX_RETRIES)
252 | self.max_retries.setMinimumWidth(80)
253 | self.max_retries.setMinimumHeight(40) # 增加最小高度
254 | self.max_retries.setAlignment(Qt.AlignmentFlag.AlignCenter)
255 | self.max_retries.setStyleSheet(spinbox_style)
256 |
257 | retry_layout.addWidget(self.max_retries)
258 | retry_layout.addStretch()
259 |
260 | crawler_frame.layout.addLayout(retry_layout)
261 | layout.addWidget(crawler_frame)
262 |
263 | # 数据库管理区域
264 | db_frame = StyledFrame("数据库管理")
265 | db_frame.layout.setContentsMargins(10, 5, 10, 5)
266 |
267 | # 数据库状态
268 | self.db_status_label = QLabel("数据库状态: 正常")
269 | self.db_status_label.setStyleSheet("""
270 | color: #00cc00;
271 | font-weight: bold;
272 | font-size: 14px;
273 | padding: 5px;
274 | """)
275 | db_frame.layout.addWidget(self.db_status_label)
276 |
277 | # 数据库操作按钮
278 | db_button_layout = QHBoxLayout()
279 | db_button_layout.setSpacing(10)
280 |
281 | # 清空数据库按钮
282 | self.clear_db_button = QPushButton("清空数据库")
283 | self.clear_db_button.setStyleSheet("""
284 | QPushButton {
285 | padding: 8px 20px;
286 | background-color: #d83b01;
287 | color: white;
288 | border: none;
289 | border-radius: 4px;
290 | font-weight: bold;
291 | font-size: 14px;
292 | min-width: 120px;
293 | min-height: 27px;
294 | }
295 | QPushButton:hover {
296 | background-color: #e9553e;
297 | }
298 | QPushButton:pressed {
299 | background-color: #c63502;
300 | }
301 | """)
302 |
303 | # 备份数据库按钮
304 | self.backup_db_button = QPushButton("备份数据库")
305 | self.backup_db_button.setStyleSheet("""
306 | QPushButton {
307 | padding: 8px 20px;
308 | background-color: #107c10;
309 | color: white;
310 | border: none;
311 | border-radius: 4px;
312 | font-weight: bold;
313 | font-size: 14px;
314 | min-width: 120px;
315 | min-height: 40px;
316 | }
317 | QPushButton:hover {
318 | background-color: #13981c;
319 | }
320 | QPushButton:pressed {
321 | background-color: #0e6a0e;
322 | }
323 | """)
324 |
325 | db_button_layout.addWidget(self.clear_db_button)
326 | db_button_layout.addWidget(self.backup_db_button)
327 | db_button_layout.addStretch()
328 |
329 | db_frame.layout.addLayout(db_button_layout)
330 | layout.addWidget(db_frame)
331 |
332 | # 连接信号
333 | self.clear_db_button.clicked.connect(self.clear_database)
334 | self.backup_db_button.clicked.connect(self.backup_database)
335 |
336 | # 连接设置值变化信号
337 | self.min_delay.valueChanged.connect(self.save_settings)
338 | self.max_delay.valueChanged.connect(self.save_settings)
339 | self.max_retries.valueChanged.connect(self.save_settings)
340 |
341 | layout.addStretch()
342 |
343 | def show_cookie_helper(self):
344 | """
345 | @description: 显示Cookie获取工具
346 | """
347 | try:
348 | self.cookie_helper = CookieHelper(self.config, self.db_handler)
349 | self.cookie_helper.cookie_ready.connect(self.on_cookie_received)
350 | except Exception as e:
351 | QMessageBox.critical(self, "错误", f"启动Cookie获取工具失败: {str(e)}")
352 |
353 | def on_cookie_received(self, cookie):
354 | """
355 | @description: 处理获取到的Cookie
356 | @param {str} cookie - 获取到的Cookie字符串
357 | """
358 | try:
359 | # 直接显示在输入框中
360 | self.cookie_input.setText(cookie)
361 | # 自动保存Cookie
362 | if self.config.set_cookie(cookie):
363 | self.db_handler.save_cookie(cookie)
364 | self.load_settings()
365 | QMessageBox.information(self, "成功", "Cookie已获取并保存!")
366 | except Exception as e:
367 | QMessageBox.critical(self, "错误", f"处理Cookie失败: {str(e)}")
368 |
369 | def load_settings(self):
370 | """
371 | @description: 加载当前设置
372 | """
373 | try:
374 | cookie, need_update = self.db_handler.get_valid_cookie()
375 | if cookie:
376 | self.cookie_input.setText(cookie)
377 | if need_update:
378 | self.cookie_status_label.setText("当前状态: Cookie即将过期")
379 | self.cookie_status_label.setStyleSheet("""
380 | color: #ff9900;
381 | font-weight: bold;
382 | font-size: 14px;
383 | padding: 5px;
384 | """)
385 | else:
386 | self.cookie_status_label.setText("当前状态: Cookie有效")
387 | self.cookie_status_label.setStyleSheet("""
388 | color: #00cc00;
389 | font-weight: bold;
390 | font-size: 14px;
391 | padding: 5px;
392 | """)
393 | else:
394 | self.cookie_status_label.setText("当前状态: 未设置Cookie")
395 | self.cookie_status_label.setStyleSheet("""
396 | color: #ff0000;
397 | font-weight: bold;
398 | font-size: 14px;
399 | padding: 5px;
400 | """)
401 |
402 | except Exception as e:
403 | QMessageBox.critical(self, "错误", f"加载设置失败: {str(e)}")
404 |
405 | def validate_cookie(self):
406 | """
407 | @description: 验证Cookie有效性
408 | """
409 | cookie = self.cookie_input.toPlainText().strip()
410 | if not cookie:
411 | QMessageBox.warning(self, "提示", "请输入Cookie")
412 | return
413 |
414 | try:
415 | if self.config.validate_cookie(cookie):
416 | QMessageBox.information(self, "成功", "Cookie格式验证通过")
417 | else:
418 | QMessageBox.warning(self, "错误", "Cookie格式不正确或缺少必要字段")
419 | except Exception as e:
420 | QMessageBox.critical(self, "错误", f"验证Cookie失败: {str(e)}")
421 |
422 | def save_cookie(self):
423 | """
424 | @description: 保存Cookie到数据库
425 | """
426 | cookie = self.cookie_input.toPlainText().strip()
427 | if not cookie:
428 | QMessageBox.warning(self, "提示", "请输入Cookie")
429 | return
430 |
431 | try:
432 | # 先清除旧的Cookie
433 | self.config.clear_cookie()
434 |
435 | # 设置新的Cookie
436 | if self.config.set_cookie(cookie):
437 | self.db_handler.save_cookie(cookie)
438 | self.load_settings() # 重新加载状态
439 | QMessageBox.information(self, "成功", "Cookie保存成功")
440 | else:
441 | QMessageBox.warning(self, "错误", "Cookie格式不正确或缺少必要字段")
442 | except Exception as e:
443 | QMessageBox.critical(self, "错误", f"保存Cookie失败: {str(e)}")
444 |
445 | def clear_cookie(self):
446 | """
447 | @description: 清除Cookie信息
448 | """
449 | reply = QMessageBox.question(
450 | self, "确认", "确定要清除Cookie吗?",
451 | QMessageBox.StandardButton.Yes | QMessageBox.StandardButton.No
452 | )
453 |
454 | if reply == QMessageBox.StandardButton.Yes:
455 | try:
456 | self.cookie_input.clear()
457 | self.config.clear_cookie()
458 | self.db_handler.clear_cookies()
459 | self.load_settings()
460 | QMessageBox.information(self, "成功", "Cookie已清除")
461 | except Exception as e:
462 | QMessageBox.critical(self, "错误", f"清除Cookie失败: {str(e)}")
463 |
464 | def clear_database(self):
465 | """
466 | @description: 清空数据库
467 | """
468 | reply = QMessageBox.question(
469 | self, "确认",
470 | "确定要清空数据库吗?此操作将删除所有已爬取的评论数据!",
471 | QMessageBox.StandardButton.Yes | QMessageBox.StandardButton.No
472 | )
473 |
474 | if reply == QMessageBox.StandardButton.Yes:
475 | try:
476 | self.db_handler.clear_database()
477 | QMessageBox.information(self, "成功", "数据库已清空")
478 | except Exception as e:
479 | QMessageBox.critical(self, "错误", f"清空数据库失败: {str(e)}")
480 |
481 | def backup_database(self):
482 | """
483 | @description: 备份数据库(待实现)
484 | """
485 | QMessageBox.information(self, "提示", "数据库备份功能开发中...")
486 |
487 | def closeEvent(self, event):
488 | """
489 | @description: 窗口关闭事件处理
490 | @param {QCloseEvent} event - 关闭事件
491 | """
492 | if self.cookie_helper:
493 | self.cookie_helper.cleanup()
494 | self.cookie_helper = None
495 | event.accept()
496 |
497 | def save_settings(self):
498 | """
499 | @description: 保存爬虫设置到配置类
500 | """
501 | try:
502 | self.config.DELAY_MIN = self.min_delay.value()
503 | self.config.DELAY_MAX = self.max_delay.value()
504 | self.config.MAX_RETRIES = self.max_retries.value()
505 | except Exception as e:
506 | QMessageBox.critical(self, "错误", f"保存设置失败: {str(e)}")
--------------------------------------------------------------------------------
/bilibili_spider/spiders/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Roinflam/bilibili_spider/957ed38cb390d132d6a6aa46b19aeb008899d0d3/bilibili_spider/spiders/__init__.py
--------------------------------------------------------------------------------
/bilibili_spider/spiders/comment_spider.py:
--------------------------------------------------------------------------------
1 | # bilibili_spider/spiders/comment_spider.py
2 |
3 | import re
4 | import time
5 | import random
6 | import logging
7 | import requests
8 | from datetime import datetime
9 | import json
10 |
11 |
12 | class BilibiliSpider:
13 | """B站评论爬虫实现类"""
14 |
15 | def __init__(self, config):
16 | """初始化爬虫实例
17 |
18 | @param {Config} config - 配置对象
19 | """
20 | self.headers = config.get_headers()
21 | self.config = config
22 |
23 | logging.basicConfig(
24 | level=logging.INFO,
25 | format='%(asctime)s - %(levelname)s - %(message)s'
26 | )
27 | self.logger = logging.getLogger(__name__)
28 |
29 | def extract_video_id(self, url):
30 | """从URL中提取视频ID
31 |
32 | @param {string} url - 视频URL
33 | @return {string} - 视频ID
34 | """
35 | patterns = [
36 | r'BV\w{10}', # BV号格式
37 | r'av\d+' # av号格式
38 | ]
39 |
40 | for pattern in patterns:
41 | match = re.search(pattern, url)
42 | if match:
43 | return match.group()
44 | return None
45 |
46 | def get_api_url(self, video_id, page=1):
47 | """获取评论API的URL
48 |
49 | @param {string} video_id - 视频ID
50 | @param {int} page - 页码
51 | @return {string} - API URL
52 | """
53 | if video_id.startswith('BV'):
54 | self.logger.info(f"正在处理BV号: {video_id}")
55 | try:
56 | view_url = f'https://api.bilibili.com/x/web-interface/view?bvid={video_id}'
57 | response = requests.get(view_url, headers=self.headers)
58 | response.raise_for_status()
59 | data = response.json()
60 | if data['code'] == 0:
61 | aid = data['data']['aid']
62 | self.logger.info(f"获取到aid: {aid}")
63 | else:
64 | self.logger.error(f"获取aid失败: {data['message']}")
65 | return None
66 | except Exception as e:
67 | self.logger.error(f"转换BV号失败: {str(e)}")
68 | return None
69 | elif video_id.startswith('av'):
70 | aid = video_id[2:]
71 | else:
72 | aid = video_id
73 |
74 | return f'http://api.bilibili.com/x/v2/reply?pn={page}&type=1&oid={aid}&sort=2'
75 |
76 | def crawl_video_comments(self, url, max_pages=10):
77 | """爬取视频评论"""
78 | self.logger.info(f"开始爬取视频评论: {url}")
79 | video_id = self.extract_video_id(url)
80 | if not video_id:
81 | self.logger.error("无法从URL中提取视频ID")
82 | return []
83 |
84 | # 获取视频标题
85 | video_title = ""
86 | try:
87 | if video_id.startswith('BV'):
88 | view_url = f'https://api.bilibili.com/x/web-interface/view?bvid={video_id}'
89 | else:
90 | view_url = f'https://api.bilibili.com/x/web-interface/view?aid={video_id.lstrip("av")}'
91 |
92 | response = requests.get(view_url, headers=self.headers)
93 | response.raise_for_status()
94 | data = response.json()
95 | if data['code'] == 0:
96 | video_title = data['data']['title']
97 | self.logger.info(f"获取到视频标题: {video_title}")
98 | except Exception as e:
99 | self.logger.error(f"获取视频标题失败: {str(e)}")
100 | video_title = "未知标题"
101 |
102 | all_comments = []
103 | current_page = 1
104 |
105 | while current_page <= max_pages:
106 | try:
107 | self.logger.info(f"正在爬取第 {current_page} 页")
108 |
109 | api_url = self.get_api_url(video_id, current_page)
110 | if not api_url:
111 | break
112 |
113 | response = requests.get(api_url, headers=self.headers)
114 | response.raise_for_status()
115 | data = response.json()
116 |
117 | if data['code'] != 0:
118 | self.logger.error(f"API返回错误: {data.get('message', '未知错误')}")
119 | break
120 |
121 | replies = data['data'].get('replies', [])
122 | if not replies:
123 | self.logger.info("没有更多评论了")
124 | break
125 |
126 | for reply in replies:
127 | try:
128 | comment_data = {
129 | 'comment_id': str(reply['rpid']),
130 | 'video_id': video_id,
131 | 'video_title': video_title,
132 | 'user_name': reply['member']['uname'],
133 | 'content': reply['content']['message'],
134 | 'publish_time': datetime.fromtimestamp(
135 | reply['ctime']
136 | ).strftime('%Y-%m-%d %H:%M:%S'),
137 | 'like_count': reply['like'],
138 | 'replies': []
139 | }
140 |
141 | # 处理子回复
142 | if reply.get('replies'):
143 | for sub_reply in reply['replies']:
144 | reply_data = {
145 | 'user_name': sub_reply['member']['uname'],
146 | 'content': sub_reply['content']['message'],
147 | 'time': datetime.fromtimestamp(
148 | sub_reply['ctime']
149 | ).strftime('%Y-%m-%d %H:%M:%S')
150 | }
151 | comment_data['replies'].append(reply_data)
152 |
153 | all_comments.append(comment_data)
154 | except Exception as e:
155 | self.logger.error(f"处理评论数据失败: {str(e)}")
156 | continue
157 |
158 | self.logger.info(f"第 {current_page} 页爬取完成,获取到 {len(replies)} 条评论")
159 |
160 | delay = random.uniform(self.config.DELAY_MIN, self.config.DELAY_MAX)
161 | self.logger.debug(f"等待 {delay:.2f} 秒后继续...")
162 | time.sleep(delay)
163 |
164 | current_page += 1
165 |
166 | except requests.exceptions.RequestException as e:
167 | self.logger.error(f"网络请求失败: {str(e)}")
168 | break
169 | except json.JSONDecodeError as e:
170 | self.logger.error(f"解析响应数据失败: {str(e)}")
171 | break
172 | except Exception as e:
173 | self.logger.error(f"爬取过程出错: {str(e)}")
174 | break
175 |
176 | self.logger.info(f"爬取完成,共获取 {len(all_comments)} 条评论")
177 | return all_comments
178 |
179 | def test_cookie(self):
180 | """测试Cookie是否有效
181 |
182 | @return {bool} - Cookie是否有效
183 | """
184 | try:
185 | test_url = 'http://api.bilibili.com/x/web-interface/nav'
186 | response = requests.get(test_url, headers=self.headers)
187 | data = response.json()
188 |
189 | if data['code'] == 0:
190 | user_name = data['data'].get('uname', '')
191 | self.logger.info(f"Cookie有效,当前用户: {user_name}")
192 | return True
193 | else:
194 | self.logger.warning(f"Cookie无效: {data.get('message', '未知错误')}")
195 | return False
196 |
197 | except Exception as e:
198 | self.logger.error(f"测试Cookie失败: {str(e)}")
199 | return False
--------------------------------------------------------------------------------
/bilibili_spider/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Roinflam/bilibili_spider/957ed38cb390d132d6a6aa46b19aeb008899d0d3/bilibili_spider/utils/__init__.py
--------------------------------------------------------------------------------
/bilibili_spider/utils/config.py:
--------------------------------------------------------------------------------
1 | # bilibili_spider/utils/config.py
2 |
3 | import logging
4 | from datetime import datetime
5 |
6 |
7 | class Config:
8 | """配置类,管理爬虫所需的各项配置"""
9 |
10 | def __init__(self):
11 | """初始化配置"""
12 | # 基础请求头
13 | self.base_headers = {
14 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
15 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
16 | 'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6',
17 | 'Accept-Encoding': 'gzip, deflate, br, zstd',
18 | 'sec-ch-ua': '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
19 | 'sec-ch-ua-mobile': '?0',
20 | 'sec-ch-ua-platform': '"Windows"',
21 | 'sec-fetch-dest': 'document',
22 | 'sec-fetch-mode': 'navigate',
23 | 'sec-fetch-site': 'none',
24 | 'sec-fetch-user': '?1',
25 | 'upgrade-insecure-requests': '1',
26 | 'cache-control': 'no-cache',
27 | 'pragma': 'no-cache'
28 | }
29 |
30 | # 当前使用的headers
31 | self.headers = self.base_headers.copy()
32 |
33 | # 爬虫配置
34 | self.DELAY_MIN = 3 # 最小延迟秒数
35 | self.DELAY_MAX = 7 # 最大延迟秒数
36 | self.MAX_RETRIES = 3 # 最大重试次数
37 | self.MAX_PAGES = 10 # 默认最大爬取页数
38 |
39 | # Cookie配置
40 | self.cookie = None
41 | self._cookie_valid = False
42 |
43 | # 配置日志
44 | logging.basicConfig(
45 | level=logging.INFO,
46 | format='%(asctime)s - %(levelname)s - %(message)s'
47 | )
48 | self.logger = logging.getLogger(__name__)
49 |
50 | def validate_cookie(self, cookie):
51 | """验证Cookie是否包含必要字段和格式是否正确
52 |
53 | @param {string} cookie - 要验证的Cookie字符串
54 | @return {bool} - Cookie是否有效
55 | """
56 | if not cookie:
57 | self.logger.info("Cookie为空")
58 | return False
59 |
60 | try:
61 | # 解析Cookie字符串为字典
62 | cookie_dict = {}
63 | for item in cookie.split(';'):
64 | if '=' in item:
65 | name, value = item.strip().split('=', 1)
66 | cookie_dict[name.strip()] = value.strip()
67 |
68 | # 检查必要字段
69 | required_fields = ['SESSDATA', 'bili_jct', 'DedeUserID']
70 | for field in required_fields:
71 | if field not in cookie_dict:
72 | self.logger.info(f"缺少必要字段: {field}")
73 | return False
74 | if not cookie_dict[field]:
75 | self.logger.info(f"字段值为空: {field}")
76 | return False
77 |
78 | return True
79 |
80 | except Exception as e:
81 | self.logger.error(f"Cookie验证失败: {str(e)}")
82 | return False
83 |
84 | def set_cookie(self, cookie):
85 | """设置Cookie并更新请求头
86 |
87 | @param {string} cookie - Cookie字符串
88 | @return {bool} - 是否设置成功
89 | """
90 | # 如果传入空Cookie,则清除
91 | if not cookie:
92 | self.clear_cookie()
93 | return False
94 |
95 | # 先验证Cookie
96 | if not self.validate_cookie(cookie):
97 | return False
98 |
99 | try:
100 | # 重置headers
101 | self.headers = self.base_headers.copy()
102 |
103 | # 设置新的Cookie相关字段
104 | self.cookie = cookie
105 | self.headers['Cookie'] = cookie
106 | self.headers['Origin'] = 'https://www.bilibili.com'
107 | self.headers['Host'] = 'api.bilibili.com'
108 | self.headers['Referer'] = 'https://www.bilibili.com'
109 |
110 | self._cookie_valid = True
111 | self.logger.info("Cookie设置成功")
112 | return True
113 |
114 | except Exception as e:
115 | self.logger.error(f"设置Cookie失败: {str(e)}")
116 | self.clear_cookie()
117 | return False
118 |
119 | def clear_cookie(self):
120 | """清除Cookie相关的所有信息"""
121 | self.cookie = None
122 | self._cookie_valid = False
123 | self.headers = self.base_headers.copy()
124 | self.logger.info("Cookie已清除")
125 |
126 | def has_valid_cookie(self):
127 | """检查是否有有效的Cookie
128 |
129 | @return {bool} - 是否有有效的Cookie
130 | """
131 | return bool(self.cookie and self._cookie_valid)
132 |
133 | def get_headers(self):
134 | """获取请求头
135 |
136 | @return {dict} - 完整的请求头字典
137 | """
138 | if not self.has_valid_cookie():
139 | self.logger.warning("当前没有有效的Cookie")
140 | return self.headers.copy()
--------------------------------------------------------------------------------
/bilibili_spider/utils/cookie_helper.py:
--------------------------------------------------------------------------------
1 | # bilibili_spider/utils/cookie_helper.py
2 |
3 | from PyQt6.QtCore import QObject, pyqtSignal
4 | from selenium import webdriver
5 | from selenium.webdriver.common.by import By
6 | from selenium.webdriver.support.ui import WebDriverWait
7 | from selenium.webdriver.support import expected_conditions as EC
8 | from selenium.common.exceptions import TimeoutException
9 | import threading
10 | import logging
11 |
12 |
13 | class CookieHelper(QObject):
14 | """Cookie获取工具类"""
15 | cookie_ready = pyqtSignal(str)
16 |
17 | def __init__(self, config, db_handler):
18 | super().__init__()
19 | self.config = config
20 | self.db_handler = db_handler
21 | self.driver = None
22 | self.is_running = False
23 |
24 | # 检查浏览器环境并启动
25 | if self.check_browser_environment():
26 | self.start_browser_thread()
27 |
28 | def check_browser_environment(self):
29 | """检查浏览器环境"""
30 | try:
31 | # 先尝试Edge浏览器
32 | from selenium.webdriver.edge.options import Options as EdgeOptions
33 | self.browser_type = 'edge'
34 | return True
35 | except:
36 | try:
37 | # 再尝试Chrome浏览器
38 | from selenium.webdriver.chrome.options import Options as ChromeOptions
39 | self.browser_type = 'chrome'
40 | return True
41 | except:
42 | logging.error("未检测到可用的浏览器")
43 | return False
44 |
45 | def start_browser_thread(self):
46 | """在新线程中启动浏览器"""
47 | self.is_running = True
48 | threading.Thread(target=self.run_browser, daemon=True).start()
49 |
50 | def run_browser(self):
51 | """运行浏览器并监视Cookie"""
52 | try:
53 | # 根据检测结果创建对应的浏览器实例
54 | if self.browser_type == 'edge':
55 | from selenium.webdriver.edge.options import Options
56 | options = Options()
57 | options.add_argument('--start-maximized')
58 | self.driver = webdriver.Edge(options=options)
59 | else:
60 | from selenium.webdriver.chrome.options import Options
61 | options = Options()
62 | options.add_argument('--start-maximized')
63 | self.driver = webdriver.Chrome(options=options)
64 |
65 | # 访问B站首页
66 | self.driver.get('https://www.bilibili.com')
67 |
68 | try:
69 | # 等待并点击登录按钮
70 | wait = WebDriverWait(self.driver, 10)
71 | login_button = wait.until(
72 | EC.presence_of_element_located((By.CLASS_NAME, "header-login-entry"))
73 | )
74 | login_button.click()
75 |
76 | # 开始检查Cookie
77 | while self.is_running:
78 | cookies = self.driver.get_cookies()
79 | cookie_dict = {cookie['name']: cookie['value'] for cookie in cookies}
80 |
81 | required_fields = ['SESSDATA', 'bili_jct', 'DedeUserID']
82 | if all(field in cookie_dict for field in required_fields):
83 | cookie_str = '; '.join([f"{k}={v}" for k, v in cookie_dict.items()])
84 |
85 | if self.config.validate_cookie(cookie_str):
86 | self.cookie_ready.emit(cookie_str)
87 | break
88 |
89 | import time
90 | time.sleep(2)
91 |
92 | except TimeoutException:
93 | logging.error("加载登录页面失败")
94 |
95 | except Exception as e:
96 | logging.error(f"启动浏览器失败: {str(e)}")
97 |
98 | finally:
99 | self.cleanup()
100 |
101 | def cleanup(self):
102 | """清理资源"""
103 | self.is_running = False
104 | if self.driver:
105 | try:
106 | self.driver.quit()
107 | except:
108 | pass
109 | self.driver = None
--------------------------------------------------------------------------------
/bilibili_spider/utils/db_handler.py:
--------------------------------------------------------------------------------
1 | # bilibili_spider/utils/db_handler.py
2 |
3 | """数据库操作工具"""
4 |
5 | import sqlite3
6 | import json
7 | import logging
8 | from datetime import datetime, timedelta
9 | from contextlib import contextmanager
10 |
11 |
12 | class DatabaseHandler:
13 | """数据库处理类,负责评论数据和Cookie管理"""
14 |
15 | def __init__(self, db_file):
16 | """初始化数据库处理器
17 |
18 | @param {string} db_file - 数据库文件路径
19 | """
20 | self.db_file = db_file
21 | self.logger = self._setup_logger()
22 | self.init_db()
23 |
24 | @contextmanager
25 | def get_connection(self):
26 | """安全获取数据库连接"""
27 | conn = None
28 | try:
29 | conn = sqlite3.connect(self.db_file)
30 | yield conn
31 | finally:
32 | if conn:
33 | conn.close()
34 |
35 | def _setup_logger(self):
36 | """设置日志记录器
37 |
38 | @return {Logger} - 配置好的日志记录器
39 | """
40 | logger = logging.getLogger('BilibiliSpider')
41 | logger.setLevel(logging.INFO)
42 |
43 | if not logger.handlers:
44 | handler = logging.StreamHandler()
45 | handler.setLevel(logging.INFO)
46 | formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
47 | handler.setFormatter(formatter)
48 | logger.addHandler(handler)
49 |
50 | return logger
51 |
52 | def init_db(self):
53 | """初始化数据库结构"""
54 | try:
55 | with self.get_connection() as conn:
56 | cursor = conn.cursor()
57 |
58 | # 创建评论表
59 | cursor.execute('''
60 | CREATE TABLE IF NOT EXISTS comments (
61 | id INTEGER PRIMARY KEY AUTOINCREMENT,
62 | video_id TEXT NOT NULL,
63 | video_title TEXT NOT NULL,
64 | comment_id TEXT NOT NULL UNIQUE,
65 | user_name TEXT NOT NULL,
66 | content TEXT NOT NULL,
67 | publish_time TEXT NOT NULL,
68 | like_count INTEGER DEFAULT 0,
69 | replies TEXT,
70 | create_time TEXT NOT NULL,
71 | update_time TEXT NOT NULL
72 | )
73 | ''')
74 |
75 | # Cookie管理表保持不变
76 | cursor.execute('''
77 | CREATE TABLE IF NOT EXISTS cookie_manager (
78 | id INTEGER PRIMARY KEY AUTOINCREMENT,
79 | cookie TEXT NOT NULL,
80 | create_time TEXT NOT NULL,
81 | expire_time TEXT NOT NULL,
82 | last_check_time TEXT NOT NULL,
83 | is_valid INTEGER DEFAULT 1,
84 | UNIQUE(cookie)
85 | )
86 | ''')
87 |
88 | conn.commit()
89 | self.logger.info("数据库表结构初始化成功")
90 |
91 | except Exception as e:
92 | self.logger.error(f"初始化数据库失败: {str(e)}")
93 | raise
94 |
95 | def save_cookie(self, cookie, expire_days=30):
96 | """保存Cookie信息
97 |
98 | @param {string} cookie - Cookie字符串
99 | @param {int} expire_days - Cookie有效期(天数)
100 | """
101 | try:
102 | with self.get_connection() as conn:
103 | cursor = conn.cursor()
104 |
105 | # 将所有已有Cookie标记为无效
106 | cursor.execute('UPDATE cookie_manager SET is_valid = 0')
107 |
108 | # 获取当前时间和过期时间
109 | current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
110 | expire_time = (datetime.now() + timedelta(days=expire_days)).strftime('%Y-%m-%d %H:%M:%S')
111 |
112 | # 插入新Cookie
113 | cursor.execute('''
114 | INSERT OR REPLACE INTO cookie_manager (
115 | cookie, create_time, expire_time, last_check_time, is_valid
116 | ) VALUES (?, ?, ?, ?, 1)
117 | ''', (cookie, current_time, expire_time, current_time))
118 |
119 | conn.commit()
120 | self.logger.info("Cookie已成功保存到数据库")
121 |
122 | except Exception as e:
123 | self.logger.error(f"保存Cookie失败: {str(e)}")
124 | raise
125 |
126 | def get_valid_cookie(self):
127 | """获取有效的Cookie信息
128 |
129 | @return {tuple} - (cookie字符串, 是否需要更新)
130 | """
131 | try:
132 | with self.get_connection() as conn:
133 | cursor = conn.cursor()
134 |
135 | # 查询最新的有效Cookie
136 | cursor.execute('''
137 | SELECT cookie, expire_time, create_time
138 | FROM cookie_manager
139 | WHERE is_valid = 1
140 | ORDER BY create_time DESC
141 | LIMIT 1
142 | ''')
143 |
144 | result = cursor.fetchone()
145 | if not result:
146 | self.logger.info("数据库中未找到有效的Cookie")
147 | return None, True
148 |
149 | cookie, expire_time, create_time = result
150 | self.logger.info(f"找到Cookie记录,创建时间: {create_time}")
151 |
152 | # 转换时间并检查是否过期
153 | expire_time = datetime.strptime(expire_time, '%Y-%m-%d %H:%M:%S')
154 | current_time = datetime.now()
155 | need_update = current_time + timedelta(days=3) > expire_time
156 |
157 | if cookie:
158 | self.logger.info("成功获取有效的Cookie")
159 |
160 | return cookie, need_update
161 |
162 | except Exception as e:
163 | self.logger.error(f"获取Cookie失败: {str(e)}")
164 | raise
165 |
166 | def clear_cookies(self):
167 | """清除所有Cookie记录"""
168 | try:
169 | with self.get_connection() as conn:
170 | cursor = conn.cursor()
171 |
172 | cursor.execute('DELETE FROM cookie_manager')
173 | conn.commit()
174 | self.logger.info("已清除所有Cookie记录")
175 |
176 | except Exception as e:
177 | self.logger.error(f"清除Cookie失败: {str(e)}")
178 | raise
179 |
180 | def save_comment(self, comment):
181 | """保存或更新评论数据"""
182 | try:
183 | with self.get_connection() as conn:
184 | cursor = conn.cursor()
185 | current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
186 |
187 | # 检查评论是否已存在
188 | cursor.execute('SELECT id FROM comments WHERE comment_id = ?', (comment.comment_id,))
189 | exists = cursor.fetchone()
190 |
191 | if exists:
192 | # 更新已存在的评论
193 | cursor.execute('''
194 | UPDATE comments
195 | SET user_name = ?,
196 | content = ?,
197 | publish_time = ?,
198 | like_count = ?,
199 | replies = ?,
200 | video_title = ?,
201 | update_time = ?
202 | WHERE comment_id = ?
203 | ''', (
204 | comment.user_name,
205 | comment.content,
206 | comment.publish_time,
207 | comment.like_count,
208 | json.dumps(comment.replies, ensure_ascii=False),
209 | comment.video_title,
210 | current_time,
211 | comment.comment_id
212 | ))
213 | conn.commit()
214 | return 2 # 更新成功
215 | else:
216 | # 插入新评论
217 | cursor.execute('''
218 | INSERT INTO comments (
219 | video_id, video_title, comment_id, user_name, content,
220 | publish_time, like_count, replies, create_time, update_time
221 | ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
222 | ''', (
223 | comment.video_id,
224 | comment.video_title,
225 | comment.comment_id,
226 | comment.user_name,
227 | comment.content,
228 | comment.publish_time,
229 | comment.like_count,
230 | json.dumps(comment.replies, ensure_ascii=False),
231 | current_time,
232 | current_time
233 | ))
234 | conn.commit()
235 | return 1 # 新增成功
236 |
237 | except Exception as e:
238 | self.logger.error(f"保存评论失败: {str(e)}")
239 | return 0 # 保存失败
240 |
241 | def query_comments_batch(self, query_type, search_text='', batch_size=100, offset=0, sort_by='publish_time',
242 | sort_order='DESC'):
243 | try:
244 | with self.get_connection() as conn:
245 | cursor = conn.cursor()
246 |
247 | base_sql = """
248 | SELECT video_id, video_title, user_name, content, publish_time,
249 | like_count, replies, update_time
250 | FROM comments
251 | {where_clause}
252 | ORDER BY {sort_field} {sort_order}
253 | LIMIT ? OFFSET ?
254 | """
255 |
256 | valid_sort_fields = {
257 | 'publish_time': 'publish_time',
258 | 'like_count': 'like_count',
259 | 'replies': 'json_array_length(replies)'
260 | }
261 |
262 | sort_field = valid_sort_fields.get(sort_by, 'publish_time')
263 | sort_order = 'DESC' if sort_order.upper() == 'DESC' else 'ASC'
264 |
265 | where_clause = {
266 | '2': "WHERE video_id LIKE ?", # 按视频ID搜索
267 | '3': "WHERE video_title LIKE ?", # 按视频标题搜索
268 | '4': "WHERE user_name LIKE ?", # 按用户名搜索
269 | '5': "WHERE content LIKE ?" # 按评论内容搜索
270 | }.get(query_type, "")
271 |
272 | if where_clause:
273 | params = (f'%{search_text}%', batch_size, offset)
274 | else:
275 | params = (batch_size, offset)
276 |
277 | sql = base_sql.format(
278 | where_clause=where_clause,
279 | sort_field=sort_field,
280 | sort_order=sort_order
281 | )
282 |
283 | cursor.execute(sql, params)
284 | results = cursor.fetchall()
285 | return results
286 |
287 | except Exception as e:
288 | self.logger.error(f"分批查询评论失败: {str(e)}")
289 | raise
290 |
291 | def clear_database(self):
292 | """清空数据库中的评论数据"""
293 | try:
294 | with self.get_connection() as conn:
295 | cursor = conn.cursor()
296 |
297 | cursor.execute('DELETE FROM comments')
298 | cursor.execute('DELETE FROM sqlite_sequence WHERE name="comments"')
299 |
300 | conn.commit()
301 | self.logger.info("数据库评论数据已清空")
302 |
303 | except Exception as e:
304 | self.logger.error(f"清空数据库失败: {str(e)}")
305 | raise
306 |
307 | def get_statistics(self):
308 | """获取数据库统计信息
309 |
310 | @return {dict} - 包含统计信息的字典
311 | """
312 | try:
313 | with self.get_connection() as conn:
314 | cursor = conn.cursor()
315 |
316 | # 获取总评论数
317 | cursor.execute('SELECT COUNT(*) FROM comments')
318 | total_comments = cursor.fetchone()[0]
319 |
320 | # 获取视频数量
321 | cursor.execute('SELECT COUNT(DISTINCT video_id) FROM comments')
322 | total_videos = cursor.fetchone()[0]
323 |
324 | # 获取用户数量
325 | cursor.execute('SELECT COUNT(DISTINCT user_name) FROM comments')
326 | total_users = cursor.fetchone()[0]
327 |
328 | # 获取最新评论时间
329 | cursor.execute('SELECT MAX(create_time) FROM comments')
330 | latest_comment = cursor.fetchone()[0]
331 |
332 | return {
333 | 'total_comments': total_comments,
334 | 'total_videos': total_videos,
335 | 'total_users': total_users,
336 | 'latest_comment': latest_comment
337 | }
338 |
339 | except Exception as e:
340 | self.logger.error(f"获取统计信息失败: {str(e)}")
341 | return {
342 | 'total_comments': 0,
343 | 'total_videos': 0,
344 | 'total_users': 0,
345 | 'latest_comment': None
346 | }
347 |
348 | def export_comments(self, video_id=None, format='json'):
349 | """导出评论数据
350 |
351 | @param {string} video_id - 视频ID,为None时导出所有评论
352 | @param {string} format - 导出格式,支持json/csv
353 | @return {string} - 导出的数据字符串
354 | """
355 | try:
356 | with self.get_connection() as conn:
357 | cursor = conn.cursor()
358 |
359 | if video_id:
360 | cursor.execute('''
361 | SELECT * FROM comments
362 | WHERE video_id = ?
363 | ORDER BY publish_time DESC
364 | ''', (video_id,))
365 | else:
366 | cursor.execute('SELECT * FROM comments ORDER BY publish_time DESC')
367 |
368 | results = cursor.fetchall()
369 |
370 | if format == 'json':
371 | # 将查询结果转换为JSON格式
372 | columns = [description[0] for description in cursor.description]
373 | data = []
374 | for row in results:
375 | item = dict(zip(columns, row))
376 | if 'replies' in item:
377 | item['replies'] = json.loads(item['replies'])
378 | data.append(item)
379 | return json.dumps(data, ensure_ascii=False, indent=2)
380 |
381 | elif format == 'csv':
382 | # TODO: 实现CSV格式导出
383 | raise NotImplementedError("CSV导出功能尚未实现")
384 |
385 | else:
386 | raise ValueError(f"不支持的导出格式: {format}")
387 |
388 | except Exception as e:
389 | self.logger.error(f"导出评论失败: {str(e)}")
390 | raise
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | # main.py
2 |
3 | import sys
4 | import logging
5 | from PyQt6.QtWidgets import QApplication
6 | from bilibili_spider.main_window import MainWindow
7 |
8 |
9 | def setup_logger():
10 | logger = logging.getLogger('BilibiliSpider')
11 |
12 | # 检查是否已经有处理器,如果有则不重复添加
13 | if not logger.handlers:
14 | logger.setLevel(logging.INFO)
15 | handler = logging.StreamHandler()
16 | formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
17 | handler.setFormatter(formatter)
18 | logger.addHandler(handler)
19 |
20 | # 防止日志重复输出
21 | logger.propagate = False
22 | return logger
23 |
24 |
25 | def main():
26 | try:
27 | logger = setup_logger()
28 | app = QApplication(sys.argv)
29 | window = MainWindow()
30 | window.show()
31 | logger.info("应用程序启动成功")
32 | sys.exit(app.exec())
33 | except Exception as e:
34 | logger.error(f"应用程序启动失败: {str(e)}")
35 | sys.exit(1)
36 |
37 | if __name__ == "__main__":
38 | main()
--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
1 | # 🎯 B站评论爬虫
2 |
3 | 一个基于PyQt6开发的B站评论爬虫桌面应用程序,支持视频评论的批量爬取、数据存储和智能管理。采用现代化的暗色主题界面设计,为用户提供流畅的操作体验。
4 |
5 | ## ✨ 主要特点
6 |
7 | 本项目采用模块化设计,具有以下特色功能:
8 |
9 | - 🖥️ 采用现代暗色主题的图形界面,设计简约直观
10 | - 🚀 支持BV号和av号视频评论爬取,自动处理视频ID识别
11 | - 💾 使用本地SQLite数据库存储,支持增量更新机制
12 | - 🔍 提供多维度评论检索功能,支持多种筛选方式
13 | - 📊 内置数据统计和分析功能,实时掌握爬取进度
14 | - ⚙️ 支持自定义爬虫参数配置,灵活控制爬取策略
15 | - 🔒 集成智能Cookie管理,支持浏览器自动获取
16 |
17 | ## 🛠️ 开发环境
18 |
19 | - Python 3.8+
20 | - PyQt6
21 | - SQLite3
22 | - Requests
23 | - Selenium
24 |
25 | ## 📦 安装步骤
26 |
27 | 1. 克隆项目到本地:
28 | ```bash
29 | git clone https://github.com/Roinflam/bilibili-spider.git
30 | cd bilibili-spider
31 | ```
32 |
33 | 2. 安装依赖包:
34 | ```bash
35 | pip install -r requirements.txt
36 | ```
37 |
38 | 3. 运行程序:
39 | ```bash
40 | python main.py
41 | ```
42 |
43 | ## 💡 功能说明
44 |
45 | 本程序主要包含四个核心模块:
46 |
47 | **评论爬取模块**
48 | - 支持通过视频URL直接爬取评论
49 | - 可配置爬取页数和时间间隔
50 | - 支持评论数据的增量更新
51 |
52 | **数据管理模块**
53 | - 采用SQLite数据库本地存储
54 | - 实现自动去重和更新机制
55 | - 提供数据备份功能
56 |
57 | **搜索系统**
58 | - 支持多种维度的评论检索
59 | - 提供灵活的排序功能
60 | - 支持评论内容的快速定位
61 |
62 | **系统配置**
63 | - 提供Cookie配置管理
64 | - 支持自定义爬虫参数
65 | - 集成数据库管理功能
66 |
67 | ## 📝 使用提示
68 |
69 | 使用本程序时请注意以下事项:
70 |
71 | 1. 首次使用需要在系统设置中配置登录Cookie
72 | 2. 建议合理设置爬取延迟,避免请求过于频繁
73 | 3. 大量数据爬取时建议启用自动备份功能
74 | 4. 程序仅供学习研究使用,请勿用于商业用途
75 | 5. 请遵守B站相关规定,合理使用爬虫功能
76 |
77 | ## 📄 开源协议
78 |
79 | 本项目采用 GPLv3 协议开源,仅供学习和研究使用,禁止用于商业目的。
80 |
81 | ## 🤝 关于作者
82 |
83 | - GitHub: [Roinflam](https://github.com/Roinflam)
84 | - 欢迎通过Issues或Pull Requests参与项目改进
85 | -
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # Basic
2 | requests>=2.31.0
3 | beautifulsoup4>=4.12.2
4 |
5 | # GUI
6 | PyQt6>=6.6.1
7 | PyQt6-WebEngine>=6.6.0
8 |
9 | # Dev tools
10 | black>=23.11.0
11 | pylint>=3.0.2
12 | selenium~=4.27.1
--------------------------------------------------------------------------------