├── url.txt ├── config.toml ├── go.mod ├── makefile ├── README.md ├── .github └── workflows │ └── go.yml ├── template ├── template1.html └── template2.html ├── LICENSE ├── .gitignore ├── template.go ├── main.go ├── go.sum ├── post_processing.go ├── fetch.go ├── parse_test.go ├── type.go └── parse.go /url.txt: -------------------------------------------------------------------------------- 1 | https://tieba.baidu.com/p/7201761174 2 | http://tieba.baidu.com/mo/m?kz=6212415344 3 | file:///example/test0.json?tid=6212415344 -------------------------------------------------------------------------------- /config.toml: -------------------------------------------------------------------------------- 1 | numFetcher = 10 2 | numParser = 50 3 | numRenderer = 5 4 | # "template1.html": 最简输出模板 5 | # "template2.html": 替换为高分辨率图片 6 | templateName = "template1.html" 7 | retryPeriod = 10 8 | highResImage = true 9 | storeExternalResource = true 10 | # 人工处理百度安全认证 11 | userAgent = "" 12 | cookieString = "" 13 | 14 | # 显示用户昵称 15 | showNickname = true 16 | -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/hjhee/tiebaSpider 2 | 3 | go 1.23 4 | 5 | toolchain go1.23.1 6 | 7 | require ( 8 | github.com/PuerkitoBio/goquery v1.10.0 9 | github.com/fsnotify/fsnotify v1.7.0 10 | github.com/pelletier/go-toml v1.9.5 11 | golang.org/x/net v0.30.0 12 | ) 13 | 14 | require ( 15 | github.com/andybalholm/cascadia v1.3.2 // indirect 16 | golang.org/x/sys v0.26.0 // indirect 17 | ) 18 | -------------------------------------------------------------------------------- /makefile: -------------------------------------------------------------------------------- 1 | .DEFAULT_GOAL := build 2 | 3 | GITVER = `git describe --tags HEAD` 4 | 5 | build: 6 | @go build -ldflags "-X main.version=${GITVER}" 7 | 8 | clean: 9 | @go clean 10 | 11 | .PHONY: git-tree-check 12 | git-tree-check: 13 | ifneq ($(git diff --stat),) 14 | $(warning "git tree is not clean") 15 | endif 16 | 17 | win: git-tree-check 18 | @echo ver: ${GITVER} 19 | @GOOS="windows" go build -ldflags "-X main.version=${GITVER}" 20 | @zip win64.zip template/*.html tiebaSpider.exe LICENSE README.md url.txt config.toml 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # tiebaSpider 2 | 3 | 程序获取百度贴吧帖子的所有评论,包括所有楼中楼,以HTML和JSON为格式保存到本地,同时合并所有楼层连续、发帖人相同帖子方便阅读。 4 | 5 | 需要获取的帖子在`url.txt`中逐行指定。程序读取程序所在目录下的文件`url.txt`获取贴吧URL,逐行爬取URL指向的帖子。除了http协议的URL之外还支持file协议,file协议格式参考`url.txt`已有的URL。此功能主要用于验证程序功能或者调整HTML模板样式。所有已提取的帖子将命名为`file_{帖子主题}.{json,html}`保存至程序所在目录下的`output`文件夹。若开启了本地保存图片功能,程序会把已获取的资源保存到`res_{帖子主题}`文件夹下。 6 | 7 | ## 特点 8 | 9 | 程序采用Go语言编写,利用goroutine同时获取、解析和渲染页面,各类goroutine的数量可以在`config.toml`文件调整。 10 | 11 | - 支持所有楼中楼评论 12 | - 支持访问WAP版贴吧链接 13 | 14 | 此外还可以通过配置文件开启如下功能: 15 | 16 | - 切换模板以设定输出HTML样式 17 | - 图片链接替换为高清原图 18 | - 本地保存图片 19 | - 设定Cookie和User-Agent处理安全认证 20 | 21 | ## 模板 22 | 23 | - 保存的HTML格式文件由`template/template1.html`的HTML模板定义。可以改写该文件以调整生成的HTML文件,从而美化界面或者嵌入Javascript脚本实现根据发帖人筛选帖子,比如只看楼主等自定义功能。模板的所有可指定的数据参考`type.go`的`TemplateField`定义,模板语法参考go官方文档。 24 | 25 | - `template/template2.html`演示了如何利用模板文件通过javascript程序替换缩略图为高分辨率图片的链接。 26 | -------------------------------------------------------------------------------- /.github/workflows/go.yml: -------------------------------------------------------------------------------- 1 | name: Go 2 | 3 | on: 4 | push: 5 | branches: [ master ] 6 | pull_request: 7 | branches: [ master ] 8 | 9 | jobs: 10 | 11 | build: 12 | if: github.repository == 'hjhee/tiebaSpider' 13 | name: Build 14 | runs-on: ubuntu-latest 15 | steps: 16 | 17 | - name: Set up Go 18 | uses: actions/setup-go@v5 19 | with: 20 | go-version: '>=1.23.1' 21 | 22 | - name: Check out code into the Go module directory 23 | uses: actions/checkout@v4 24 | 25 | - name: Get dependencies 26 | run: | 27 | go get -v -t -d ./... 28 | if [ -f Gopkg.toml ]; then 29 | curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh 30 | dep ensure 31 | fi 32 | 33 | - name: Build 34 | run: GOOS=windows GOARCH=amd64 go build -v . 35 | 36 | - name: Create build artifacts 37 | uses: actions/upload-artifact@v4 38 | with: 39 | name: win64 40 | path: | 41 | template/ 42 | LICENSE 43 | README.md 44 | url.txt 45 | tiebaSpider.exe 46 | config.toml -------------------------------------------------------------------------------- /template/template1.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |
5 |