├── .gitignore ├── .gitmodules ├── .npmrc ├── .prettierrc.js ├── .travis.yml ├── README.md ├── _config.yml ├── package.json ├── roadmaps.json ├── scaffolds ├── draft.md ├── page.md └── post.md ├── script ├── buildRoadmaps.js ├── buildTutorials.js ├── deploy.js ├── downloadRepos.js ├── movePublicToRoot.js └── utils │ ├── buildRoadmap.js │ ├── buildTutorial.js │ └── index.js ├── source ├── 404 │ └── index.md ├── FAQ │ └── index.md ├── _data │ ├── iconfont.eot │ ├── iconfont.json │ ├── iconfont.svg │ ├── iconfont.ttf │ ├── iconfont.woff │ ├── iconfont.woff2 │ └── styles.styl ├── _posts │ ├── @10aaf06.md │ ├── @2629c44.md │ ├── @34dvBzFh6.md │ ├── @4675l54tY.md │ ├── @5e3e0d2.md │ ├── @5e60587.md │ ├── @70fc9b9.md │ ├── @auY0siFek.md │ ├── @d5269af.md │ ├── @e215d5a.md │ ├── @pRtgJQ4NP.md │ ├── @sO4iOISav.md │ ├── @sQN91Mviv.md │ └── @uXOOfFmhS.md ├── about │ └── index.md ├── categories │ └── index.md ├── images │ ├── avatars │ │ ├── bldtp.png │ │ ├── cebuzhun.png │ │ ├── crxk.jpg │ │ ├── mRcfps.jpg │ │ ├── manyipai.png │ │ ├── pftom.jpg │ │ ├── tuture-dev.jpg │ │ ├── zw.png │ │ └── zwmxs.png │ ├── logos │ │ ├── Angular.svg │ │ ├── Go.svg │ │ ├── Java.svg │ │ ├── Nodejs.svg │ │ ├── Python.svg │ │ ├── React.svg │ │ └── Vue.svg │ └── social │ │ └── wechat.png ├── roadmaps │ └── index.md ├── schedule │ └── index.md ├── sitemap │ └── index.md └── tags │ └── index.md ├── tutorials.json └── yarn.lock /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | Thumbs.db 3 | db.json 4 | *.log 5 | node_modules/ 6 | public/ 7 | .deploy*/ 8 | .vscode 9 | .env 10 | 11 | source/_posts/* 12 | !source/_posts/@*.md 13 | source/images/covers 14 | source/roadmaps/* 15 | !source/roadmaps/index.md 16 | 17 | repos 18 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "themes/next"] 2 | path = themes/next 3 | url = https://github.com/tuture-dev/hexo-next-theme.git 4 | -------------------------------------------------------------------------------- /.npmrc: -------------------------------------------------------------------------------- 1 | package-lock=false -------------------------------------------------------------------------------- /.prettierrc.js: -------------------------------------------------------------------------------- 1 | module.exports = { 2 | bracketSpacing: true, 3 | singleQuote: true, 4 | jsxBracketSameLine: true, 5 | trailingComma: 'all', 6 | printWidth: 80, 7 | }; 8 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | notifications: 2 | webhooks: 3 | urls: 4 | - https://open.feishu.cn/officialapp/notify/6ce8fd9560a63f21cd1b28abdec2fe5573fb9bdab3625ca879920e6b93f3f7a8 5 | on_success: always # default: always 6 | on_failure: always # default: always 7 | on_start: never # default: never 8 | on_cancel: always # default: always 9 | on_error: always # default: always 10 | 11 | language: node_js 12 | 13 | dist: xenial 14 | 15 | node_js: 16 | - 12 17 | 18 | env: 19 | - ID_DIGITS=7 20 | 21 | before_script: 22 | - npm i -g @tuture/cli 23 | 24 | script: 25 | - yarn 26 | - yarn download 27 | - yarn build:roadmaps 28 | - yarn build:tutorials 29 | - yarn clean 30 | - yarn algolia 31 | - yarn build 32 | - node script/deploy.js 33 | 34 | cache: yarn 35 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 图雀社区主站 2 | 3 | 本项目是一个 Hexo 博客,这里汇集了由社区贡献的、通过 [Tuture](https://github.com/tuture-dev/tuture) 工具写成的优质实战教程。 4 | 5 | > **🇨🇳目前仅支持中文教程!Currently only Chinese tutorials are supported!** 6 | 7 | ## 本地查看 8 | 9 | 首先确保本地已安装 tuture,如果没有则通过 `npm install -g tuture` 安装。然后将仓库克隆到本地(包括所有 Git 子模块): 10 | 11 | ```bash 12 | $ git clone --recurse-submodules https://github.com/tuture-dev/hub.git 13 | ``` 14 | 15 | 进入仓库,安装 npm 依赖: 16 | 17 | ```bash 18 | cd hub 19 | npm install 20 | ``` 21 | 22 | 下载所有学习路线和教程,并构建(需要几分钟的时间): 23 | 24 | ```bash 25 | $ npm run download 26 | $ npm run build:roadmaps 27 | $ npm run build:tutorials 28 | ``` 29 | 30 | 最后打开 hexo 服务器: 31 | 32 | ```bash 33 | $ npm start 34 | ``` 35 | 36 | 然后访问 `localhost:5000` 即可在本地查看图雀社区主站啦!(⚠️注意:搜索功能无法使用) 37 | 38 | ## 常见问题(FAQs) 39 | 40 | 我们对常见的问题都进行了解答,请访问[图雀社区 FAQ](https://tuture.co/FAQ/)。 41 | 42 | ## 贡献教程 43 | 44 | 首先,非常感谢你选择分享教程!分享教程非常容易,请阅读[分享教程指南](https://docs.tuture.co/guide/sharing.html)。 45 | 46 | ## 关注我们 47 | 48 | 想要第一时间获取最新教程的通知?不妨关注我们的微信公众号吧: 49 | 50 | ![](https://tuture.co/uploads/wechat-qcode.png) 51 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | # Hexo Configuration 2 | ## Docs: https://hexo.io/docs/configuration.html 3 | ## Source: https://github.com/hexojs/hexo/ 4 | 5 | # Site 6 | title: 图雀社区 7 | subtitle: 汇集精彩的实战技术教程 8 | description: 图雀社区是一个供大家分享用 Tuture 写作工具完成的教程的一个平台。在这里,读者们可以尽情享受高质量且免费的实战教程,并能与作者和其他读者互动和讨论;而作者们也可以借此传播他们的技术知识,宣传他们的开源项目,找到自己输出内容的受众,加速技术的传播。 9 | keywords: 图雀社区,Tuture,Vue.js实战教程,微信小程序,Kotlin,React Native,Webpack,MVVM,React.js,Node.js,Redux,Django,MongoDB,Docker,JavaScript,Java,Go,Kubernetes,Nuxt,vue-router,react-router,小程序,跨端开发,Taro,react hooks,redux-saga,learn by doing,Web 前端实战教程,后端实战教程,小程序实战教程,移动端实战教程 10 | author: 图雀社区 11 | language: zh-CN 12 | timezone: 13 | 14 | # URL 15 | ## If your site is put in a subdirectory, set url as 'http://yoursite.com/child' and root as '/child/' 16 | url: https://tuture.co/ 17 | root: / 18 | permalink: :year/:month/:day/:title/ 19 | permalink_defaults: 20 | 21 | # Directory 22 | source_dir: source 23 | public_dir: public 24 | tag_dir: tags 25 | archive_dir: archives 26 | category_dir: categories 27 | code_dir: downloads/code 28 | i18n_dir: :lang 29 | skip_render: 30 | 31 | # Writing 32 | new_post_name: :title.md # File name of new posts 33 | default_layout: post 34 | titlecase: false # Transform title into titlecase 35 | external_link: true # Open external links in new tab 36 | filename_case: 0 37 | render_drafts: false 38 | post_asset_folder: true 39 | relative_link: false 40 | future: true 41 | highlight: 42 | enable: true 43 | line_number: false 44 | auto_detect: false 45 | tab_replace: 46 | 47 | # Home page setting 48 | # path: Root path for your blogs index page. (default = '') 49 | # per_page: Posts displayed per page. (0 = disable pagination) 50 | # order_by: Posts order. (Order by date descending by default) 51 | index_generator: 52 | path: "" 53 | per_page: 10 54 | order_by: -date 55 | 56 | # Category & Tag 57 | default_category: uncategorized 58 | category_map: 59 | tag_map: 60 | 61 | # Date / Time format 62 | ## Hexo uses Moment.js to parse and display date 63 | ## You can customize the date format as defined in 64 | ## http://momentjs.com/docs/#/displaying/format/ 65 | date_format: YYYY-MM-DD 66 | time_format: HH:mm:ss 67 | 68 | # Pagination 69 | ## Set per_page to 0 to disable pagination 70 | per_page: 10 71 | pagination_dir: page 72 | 73 | # Extensions 74 | ## Plugins: https://hexo.io/plugins/ 75 | ## Themes: https://hexo.io/themes/ 76 | theme: next 77 | 78 | # Deployment 79 | ## Docs: https://hexo.io/docs/deployment.html 80 | deploy: 81 | - type: git 82 | repo: git@github.com:tutureproject/tutureproject.github.io.git 83 | branch: master 84 | - type: git 85 | repo: git@github.com:tutureproject/tutureproject.github.io.git 86 | branch: src 87 | extend_dirs: / 88 | ignore_hidden: false 89 | ignore_pattern: 90 | public: . 91 | 92 | algolia: 93 | applicationID: 73BSUE4RKU 94 | apiKey: 0b7cce26a4734cb760080065b3c4a4a1 95 | indexName: Tuture 96 | chunkSize: 5000 97 | 98 | # Post wordcount display settings 99 | # Dependencies: https://github.com/theme-next/hexo-symbols-count-time 100 | symbols_count_time: 101 | symbols: true 102 | time: true 103 | total_symbols: true 104 | total_time: true 105 | 106 | baidusitemap: 107 | path: baidusitemap.xml 108 | 109 | nofollow: 110 | enable: true 111 | exclude: 112 | - https://docs.tuture.co/ 113 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "tuture-hub", 3 | "version": "0.0.0", 4 | "private": true, 5 | "hexo": { 6 | "version": "4.2.0" 7 | }, 8 | "scripts": { 9 | "start": "hexo server", 10 | "clean": "hexo clean", 11 | "download": "node script/downloadRepos.js", 12 | "build:roadmaps": "node script/buildRoadmaps.js", 13 | "build:tutorials": "node script/buildTutorials.js", 14 | "build": "hexo generate", 15 | "algolia": "hexo algolia" 16 | }, 17 | "dependencies": { 18 | "ali-oss": "^6.5.1", 19 | "chalk": "^3.0.0", 20 | "fs-extra": "^8.1.0", 21 | "hexo": "^4.1.0", 22 | "hexo-algolia": "^1.3.1", 23 | "hexo-autonofollow": "^1.0.1", 24 | "hexo-deployer-git": "^2.0.0", 25 | "hexo-generator-archive": "^0.1.5", 26 | "hexo-generator-baidu-sitemap": "^0.1.6", 27 | "hexo-generator-category": "^0.1.3", 28 | "hexo-generator-feed": "^2.0.0", 29 | "hexo-generator-index": "^0.2.1", 30 | "hexo-generator-sitemap": "^1.2.0", 31 | "hexo-generator-tag": "^0.2.0", 32 | "hexo-renderer-ejs": "^0.3.1", 33 | "hexo-renderer-marked": "^1.0.1", 34 | "hexo-renderer-stylus": "^0.3.3", 35 | "hexo-server": "^0.3.3", 36 | "hexo-symbols-count-time": "^0.6.1", 37 | "hexo-util": "^1.7.0", 38 | "js-yaml": "^3.13.1", 39 | "p-limit": "^2.2.2", 40 | "p-retry": "^4.2.0" 41 | } 42 | } -------------------------------------------------------------------------------- /roadmaps.json: -------------------------------------------------------------------------------- 1 | { 2 | "root": "repos/roadmaps", 3 | "sources": [ 4 | { 5 | "name": "node", 6 | "git": "https://github.com/tuture-dev/nodejs-roadmap.git" 7 | }, 8 | { 9 | "name": "react", 10 | "git": "https://github.com/tuture-dev/react-roadmap.git" 11 | } 12 | ] 13 | } 14 | -------------------------------------------------------------------------------- /scaffolds/draft.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: {{ title }} 3 | tags: 4 | --- 5 | -------------------------------------------------------------------------------- /scaffolds/page.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: {{ title }} 3 | date: {{ date }} 4 | --- 5 | -------------------------------------------------------------------------------- /scaffolds/post.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: { { title } } 3 | date: { { date } } 4 | tags: 5 | keywords: 6 | description: 7 | --- 8 | -------------------------------------------------------------------------------- /script/buildRoadmaps.js: -------------------------------------------------------------------------------- 1 | const path = require('path'); 2 | const { listSubdirectories, buildRoadmap } = require('./utils'); 3 | 4 | const root = process.cwd(); 5 | const roadmapsRoot = path.resolve('repos', 'roadmaps'); 6 | 7 | console.log('Building roadmaps ...'); 8 | listSubdirectories(roadmapsRoot).forEach(p => buildRoadmap(p)); 9 | 10 | process.chdir(root); 11 | -------------------------------------------------------------------------------- /script/buildTutorials.js: -------------------------------------------------------------------------------- 1 | const cp = require('child_process'); 2 | const path = require('path'); 3 | const { listSubdirectories, buildTutorial } = require('./utils'); 4 | 5 | const root = process.cwd(); 6 | const tutorialsRoot = path.resolve('repos', 'tutorials'); 7 | 8 | console.log('Tuture Version'); 9 | console.log(cp.execSync('tuture -v').toString()); 10 | 11 | console.log('\nBuilding tutorials ...'); 12 | listSubdirectories(tutorialsRoot).forEach(p => buildTutorial(p)); 13 | 14 | process.chdir(root); 15 | -------------------------------------------------------------------------------- /script/deploy.js: -------------------------------------------------------------------------------- 1 | const chalk = require('chalk'); 2 | const fs = require('fs'); 3 | const path = require('path'); 4 | const retry = require('retry'); 5 | const pLimit = require('p-limit'); 6 | const OSS = require('ali-oss'); 7 | 8 | const client = new OSS({ 9 | accessKeyId: process.env.ACCESS_KEY_ID, 10 | accessKeySecret: process.env.ACCESS_KEY_SECRET, 11 | bucket: process.env.BUCKET_NAME, 12 | region: process.env.BUCKET_REGION, 13 | }); 14 | 15 | const distDir = 'public'; 16 | const filePaths = []; 17 | 18 | function walk(dirName) { 19 | const files = fs.readdirSync(dirName); 20 | 21 | files.forEach(file => { 22 | const fullPath = path.join(dirName, file); 23 | const stat = fs.statSync(fullPath); 24 | 25 | if (stat.isDirectory()) { 26 | walk(fullPath); 27 | } else { 28 | filePaths.push(fullPath); 29 | } 30 | }); 31 | } 32 | 33 | walk(distDir); 34 | 35 | const limit = pLimit(2); 36 | 37 | const uploadTasks = filePaths.map(filePath => 38 | limit(async () => { 39 | await client.put(filePath.substr(distDir.length + 1), filePath); 40 | console.log(`Upload ${filePath} successfully.`); 41 | }), 42 | ); 43 | 44 | (async () => { 45 | await Promise.all(uploadTasks); 46 | console.log('Upload complete!'); 47 | })(); 48 | -------------------------------------------------------------------------------- /script/downloadRepos.js: -------------------------------------------------------------------------------- 1 | const chalk = require('chalk'); 2 | const path = require('path'); 3 | const fs = require('fs-extra'); 4 | const { exec } = require('child_process'); 5 | 6 | const roadmaps = require('../roadmaps.json'); 7 | const tutorials = require('../tutorials.json'); 8 | 9 | function log(status, message) { 10 | let output; 11 | switch (status) { 12 | case 'info': 13 | output = `${chalk.blue('[INFO]')} ${message}`; 14 | break; 15 | case 'warning': 16 | output = `${chalk.yellow('[WARNING]')} ${message}`; 17 | break; 18 | case 'success': 19 | output = `${chalk.green('[SUCCESS]')} ${message}`; 20 | break; 21 | case 'fail': 22 | output = `${chalk.red('[FAIL]')} ${message}`; 23 | break; 24 | default: 25 | throw new Error(`Unsupported status: ${status}`); 26 | } 27 | 28 | console.log(output); 29 | } 30 | 31 | function downloadAll(object) { 32 | const { root, sources } = object; 33 | log('info', `Downloading ${root} with ${sources.length} sources ...`); 34 | 35 | const tasks = sources.map( 36 | source => 37 | new Promise((resolve, reject) => { 38 | const repoPath = path.join(root, source.name); 39 | log('info', `Starting to download ${repoPath} ...`); 40 | 41 | // 如果已经下载过,则删除重新下载 42 | if (fs.existsSync(repoPath)) { 43 | log('warning', `Deleting ${repoPath} and re-download.`); 44 | fs.removeSync(repoPath); 45 | } 46 | 47 | exec(`git clone ${source.git} ${repoPath}`, err => { 48 | if (err) { 49 | log('fail', `Download failed: ${repoPath}!`); 50 | reject(err.message); 51 | } else { 52 | log('success', `Finished ${repoPath}!`); 53 | resolve(); 54 | } 55 | }); 56 | }), 57 | ); 58 | 59 | Promise.all(tasks) 60 | .then(() => { 61 | log('success', `Download ${root} complete!`); 62 | }) 63 | .catch(err => { 64 | log('fail', `Download ${root} failed: ${err}`); 65 | }); 66 | } 67 | 68 | downloadAll(roadmaps); 69 | downloadAll(tutorials); 70 | -------------------------------------------------------------------------------- /script/movePublicToRoot.js: -------------------------------------------------------------------------------- 1 | const fs = require('fs-extra'); 2 | const path = require('path'); 3 | 4 | const buildDir = 'public'; 5 | const excludedFiles = ['.git', '.gitignore', 'node_modules', buildDir]; 6 | 7 | if (!fs.existsSync(buildDir)) { 8 | console.log('Build directory not ready. Stopping.'); 9 | process.exit(1); 10 | } 11 | 12 | fs.readdirSync('.') 13 | .filter(fname => excludedFiles.indexOf(fname) < 0) 14 | .forEach(fname => fs.removeSync(fname)); 15 | 16 | fs.readdirSync(buildDir).forEach(fname => 17 | fs.moveSync(path.join(buildDir, fname), fname, { overwrite: true }), 18 | ); 19 | -------------------------------------------------------------------------------- /script/utils/buildRoadmap.js: -------------------------------------------------------------------------------- 1 | const fs = require('fs-extra'); 2 | const path = require('path'); 3 | 4 | // Root path of this project. 5 | const root = process.cwd(); 6 | 7 | // Path to hexo posts. 8 | const roadmapsDir = path.join(root, 'source', 'roadmaps'); 9 | 10 | function buildRoadmap(roadmapPath) { 11 | process.chdir(roadmapPath); 12 | console.log(`Working on ${process.cwd()}.`); 13 | 14 | if (!fs.existsSync('README.md')) { 15 | console.log('Not a valid roadmap, skipping.'); 16 | process.chdir(root); 17 | return; 18 | } 19 | 20 | const roadmapName = path.parse(roadmapPath).name; 21 | let content = fs.readFileSync('README.md').toString(); 22 | const frontmatter = fs.readFileSync('frontmatter.yml').toString(); 23 | 24 | // Remove markdown TOC. 25 | content = content.replace(/## 目录[\w\W]+## 入门/, '## 入门'); 26 | 27 | // Remove original h1. 28 | content = content.replace(/

[\w\W]+?<\/h1>/, ''); 29 | 30 | // Append front matter. 31 | content = `${frontmatter}\n${content}`; 32 | 33 | // Save roadmap to target directory. 34 | const targetDir = path.join(roadmapsDir, roadmapName); 35 | fs.ensureDirSync(targetDir); 36 | fs.writeFileSync(path.join(targetDir, 'index.md'), content); 37 | 38 | // Move assets directory. 39 | fs.copySync('assets', path.join(targetDir, 'assets')); 40 | } 41 | 42 | module.exports = buildRoadmap; 43 | -------------------------------------------------------------------------------- /script/utils/buildTutorial.js: -------------------------------------------------------------------------------- 1 | const fs = require('fs-extra'); 2 | const path = require('path'); 3 | const cp = require('child_process'); 4 | const yaml = require('js-yaml'); 5 | 6 | // Root path of this project. 7 | const root = process.cwd(); 8 | 9 | // Path to hexo posts. 10 | const postsDir = path.join(root, 'source', '_posts'); 11 | 12 | // Sub-directory for storing markdowns of each tutorial. 13 | 14 | const workspace = '.tuture'; 15 | 16 | const collectionPath = path.join(workspace, 'collection.json'); 17 | 18 | const buildDir = path.join(workspace, 'build'); 19 | 20 | /** 21 | * Function for adjusting markdown content in place. 22 | */ 23 | function adjustContent(markdownPath, info) { 24 | const { cover, id } = info; 25 | 26 | let content = fs.readFileSync(markdownPath).toString(); 27 | 28 | // Set the lang of all vue code blocks to html, 29 | // since highlight.js doesn't support vue syntax. 30 | content = content.replace(/```vue/g, '```html'); 31 | 32 | // Replace tsx to ts. 33 | content = content.replace(/```tsx/g, '```ts'); 34 | 35 | fs.writeFileSync(markdownPath, content); 36 | } 37 | 38 | /** 39 | * Build a single hexo post. 40 | */ 41 | function buildSingleArticle(article) { 42 | const { name, id, cover } = article; 43 | const truncatedId = id 44 | .toString() 45 | .slice(0, Number(process.env.ID_DIGITS) || 7); 46 | const mdPath = path.join(buildDir, `${name}.md`); 47 | 48 | adjustContent(mdPath, { cover, id: truncatedId }); 49 | 50 | fs.copySync(mdPath, path.join(postsDir, `${truncatedId}.md`), { 51 | overwrite: true, 52 | }); 53 | } 54 | 55 | /** 56 | * Build tutorials and move them into hexo posts directory. 57 | */ 58 | function buildTutorial(tuturePath) { 59 | process.chdir(tuturePath); 60 | console.log(`\nWorking on ${process.cwd()}.`); 61 | 62 | // Build tutorial as usual. 63 | cp.execSync('tuture reload && tuture build --hexo'); 64 | console.log('Build complete.'); 65 | 66 | const collection = JSON.parse(fs.readFileSync(collectionPath).toString()); 67 | const idDigits = Number(process.env.ID_DIGITS) || 7; 68 | const convertId = (id) => id.toString().slice(0, idDigits); 69 | 70 | collection.articles.forEach((article) => buildSingleArticle(article)); 71 | 72 | console.log(`Finished ${process.cwd()}.`); 73 | process.chdir(root); 74 | } 75 | 76 | module.exports = buildTutorial; 77 | -------------------------------------------------------------------------------- /script/utils/index.js: -------------------------------------------------------------------------------- 1 | const fs = require('fs-extra'); 2 | const path = require('path'); 3 | 4 | function listSubdirectories(root) { 5 | return fs 6 | .readdirSync(root) 7 | .map(p => path.resolve(root, p)) 8 | .filter(p => fs.lstatSync(p).isDirectory()); 9 | } 10 | 11 | exports.listSubdirectories = listSubdirectories; 12 | 13 | exports.buildRoadmap = require('./buildRoadmap'); 14 | exports.buildTutorial = require('./buildTutorial'); 15 | -------------------------------------------------------------------------------- /source/404/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 404 3 | date: 1970-01-01 00:00:00 4 | comments: false 5 | --- 6 | -------------------------------------------------------------------------------- /source/FAQ/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 常见问题说明 3 | date: 2019-11-23 10:38:31 4 | --- 5 | 6 | **Q:图雀社区到底是个什么样的社区?** 7 | 8 | A:简单来说,图雀社区是一个供大家分享用 [Tuture](https://github.com/tuture-dev/tuture) 写作工具完成的教程的一个平台。在这里,读者们可以尽情享受高质量的实战教程,并且与作者和其他读者互动和讨论;而作者们也可以借此传播他们的技术知识,宣传他们的开源项目。 9 | 10 | 11 | **Q:这个 Tuture 写作工具到底是个什么玩意儿?** 12 | 13 | A:Tuture 写作工具说起来很简单:把你的 Git 仓库转换成一个教程的“骨架”。接着你需要做的仅仅只是填充“血肉”——也就是给你的代码变化(Git Diff)提供讲解。听上去这个想法很普通,但是经过我们反复实践,发现这种写作方式节省了很多整理代码的时间,并且能够让写作者专注于项目的组织和概念的讲解。所有用 Tuture 工具完成的教程不仅写起来很快(即便一个大型实战也只要几天的时间),而且质量有保障(代码不会出错,读者总能够跟着敲)。如果看到这里,您迫不及待的想要尝试一下 Tuture 写作工具的话,那么您可以访问我们的[文档地址](https://www.yuque.com/tuture/product-manuals),我们撰写了一篇简洁易懂的教程教您使用它哦! 14 | 15 | 16 | **Q:现在的技术社区已经这么多了,图雀社区有什么特别之处吗?** 17 | 18 | A:正如之前所言,图雀社区是由 [Tuture](https://github.com/tuture-dev/tuture) 写作工具驱动的,这也意味着我们的特点是:**专注于讲透代码的文字教程,并且每篇教程与一个 Git 仓库完全对应,因此文中代码都是能够让你从头跟着敲到尾,最终一定能够写出实际可运行的项目**。我们认为这种“边学边做”的方式能够最大化学习效率,让你快速掌握一门技术,还能收获满满的成就感。 19 | 20 | 21 | **Q:图雀社区的内容会收费吗?** 22 | 23 | A:我们承诺,所有内容面向广大读者**永久免费**。 24 | 25 | 26 | **Q:内容永久免费还怎么吸引作者来创作?** 27 | 28 | A:知识付费很常见,但却又是**矛盾**的:作者希望自己高质量的内容能够得到更多的报酬,而读者却希望能够以最小的代价获取。免费的内容当然能够吸引读者,但是凭什么能够吸引作者?图雀社区的思路是,先通过几位发起人的努力产出高质量的实战教程,并将其发表各大平台来积累忠实读者;之后投稿的文章,图雀社区将以保留署名的方式进行发表,帮助每位作者推广他们的文章和项目,快速建立人气和知名度,而不用再经历漫长的等待。 29 | 30 | 31 | **Q:如果我是作者,想要投稿到图雀,该怎么做?** 32 | 33 | A:首先感谢你打算加入到图雀作者的大家庭!您可以参考我们的[安装文档](https://www.yuque.com/tuture/product-manuals/installation),配置好写作工具;然后浏览[《开始写作》](https://www.yuque.com/tuture/product-manuals/initialization)文档学习如何编写教程, 最后查看[《分享教程》](https://www.yuque.com/tuture/product-manuals/sharing)了解如何发布文章到图雀社区。我们希望能够帮助您的文章、项目、知识和经验收获应有的认可! 34 | 35 | **Q:如果我是个普通的读者,想要为图雀社区出一份力,该怎么做?** 36 | 37 | A:图雀酱再次表示感谢!您只需要在微信搜索 「图雀社区」关注我们的微信公众号就好啦,或者您是掘金、知乎或者简书、CSDN 等的用户,也可以在对应的平台上支持我们哦: 38 | 39 | - 图雀社区官方网站:[网站地址](http://tuture.co/) 40 | - 微信公众号:[二维码地址](https://tuture.co/images/social/wechat.png) 41 | - 掘金专栏:[网站地址](https://juejin.im/user/5b33414351882574b9694d28) 42 | - 知乎专栏:[网站地址](https://www.zhihu.com/people/tuture-dev/activities) 43 | - CSDN:[网站地址](https://tuture.blog.csdn.net/) 44 | 45 | 当然喽,如果你打算向圈内好友推荐一波图雀社区就更好啦。还有,要记得 ”常回来看看哟“~ 46 | 47 | 48 | **Q:“图雀”这个名字有什么含义,图灵 + 语雀?** 49 | 50 | A:图雀可以是“图灵 + 语雀”,可以是“想要一展宏图的燕雀”,也可以是 Tuture 写作工具的音译。一千个人眼中有一千只图雀,它的含义由你决定哦。 51 | -------------------------------------------------------------------------------- /source/_data/iconfont.eot: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tuture-dev/hub/0a5c20a6764793c0ac4c027f0f5980179ce284b3/source/_data/iconfont.eot -------------------------------------------------------------------------------- /source/_data/iconfont.json: -------------------------------------------------------------------------------- 1 | { 2 | "id": "", 3 | "name": "", 4 | "font_family": "iconfont", 5 | "css_prefix_text": "icon-", 6 | "description": "", 7 | "glyphs": [ 8 | { 9 | "icon_id": "4565635", 10 | "name": "jianshu", 11 | "font_class": "jianshu", 12 | "unicode": "e60c", 13 | "unicode_decimal": 58892 14 | }, 15 | { 16 | "icon_id": "6773372", 17 | "name": "zhihu (1)", 18 | "font_class": "zhihu", 19 | "unicode": "e688", 20 | "unicode_decimal": 59016 21 | }, 22 | { 23 | "icon_id": "10633713", 24 | "name": "we-chat", 25 | "font_class": "wechat", 26 | "unicode": "e502", 27 | "unicode_decimal": 58626 28 | }, 29 | { 30 | "icon_id": "11588751", 31 | "name": "csdn", 32 | "font_class": "csdn", 33 | "unicode": "e515", 34 | "unicode_decimal": 58645 35 | }, 36 | { 37 | "icon_id": "12150404", 38 | "name": "juejin", 39 | "font_class": "juejin", 40 | "unicode": "e503", 41 | "unicode_decimal": 58627 42 | } 43 | ] 44 | } 45 | -------------------------------------------------------------------------------- /source/_data/iconfont.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | 7 | 8 | Created by iconfont 9 | 10 | 11 | 12 | 13 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /source/_data/iconfont.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tuture-dev/hub/0a5c20a6764793c0ac4c027f0f5980179ce284b3/source/_data/iconfont.ttf -------------------------------------------------------------------------------- /source/_data/iconfont.woff: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tuture-dev/hub/0a5c20a6764793c0ac4c027f0f5980179ce284b3/source/_data/iconfont.woff -------------------------------------------------------------------------------- /source/_data/iconfont.woff2: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tuture-dev/hub/0a5c20a6764793c0ac4c027f0f5980179ce284b3/source/_data/iconfont.woff2 -------------------------------------------------------------------------------- /source/_data/styles.styl: -------------------------------------------------------------------------------- 1 | @font-face {font-family: "iconfont"; 2 | src: url('iconfont.eot?t=1576828227650'); /* IE9 */ 3 | src: url('iconfont.eot?t=1576828227650#iefix') format('embedded-opentype'), /* IE6-IE8 */ 4 | url('data:application/x-font-woff2;charset=utf-8;base64,d09GMgABAAAAAAcYAAsAAAAADNwAAAbLAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHEIGVgCDQAqMXIloATYCJAMYCw4ABCAFhG0HUhuQClGUT1KG7Mdh7DwJxahUroyLl1gJC6pwEw/Q/b4/M3dubCdF4noUJS1gJK2twBJta4kX9k+wN4Tb/m9F5Kw+wqgCBKvASIQR2mPoi6rE9zv9P7cMNt38lLCzEcci5qWzk1MtmnvXQQvtSExn4pVXJyIcHDNdJA4sb21ziSyLatoAJzig7g+w6fD1H+QBce6v+gvsbAs6knCbgHrDIiDL6uJqwJFIqgAFR1azHnB8ZimFFFS6ImZnCkDmAUIVp6O3AHBu/nz4BeICB4jyDEgrt3byjCDnHXVMw8YNjQONOgJI6nOC2UeGGUAinMa6T2C2csZU18e5cYsj+I1KFN+xz24OgWPa0BDwjrryEbnrNcp/8sjkkiASCiCpGxG7Fst5R1WQ8S6O07gBj8SHgEfgY5pZJPVFjeEBIBaoPUPsoLOU4EsUskalZ6iIRwhS2GwJ118oECACMRnmAgkWS//EMVs2GST/Pvt/Pz3UXKFwvJXVYfPSxeP8fBQtwQJXC2HaPsRLMS7mCuKtBQUIC25F1Fh+ablqy5ibWsb+mePQyAPGfFYJXHAtN2/+lEDKlW4j5LVwNWBYfXkhTCjKg1wsFtP5IdRjkkvYVciYf2ktonMaic+xOFAia7ou0JZMR/ielBLNbOvoY57yE2O+oSHYcdWS3u6jM1e6rPGDRzW7fhqLT92DBGvDrTSvqvIIiDVpiSwdxsPTIZICvwnIqYYUr7gYPYtmL37WPALjitB7tTNL/iPh0/q9SfdTfqz9V/FLK+/uHIT7J7vLl8/XB14W+1n0VvVX91+x/z9Mqi/WJNosl+Tu/dNoeRFcwo6H0Jd2WFd2xpbLLbob7PkuEUU9Zt8/c/Vd8tEfSdUS6cjIijTtUkFJaAJz21nxBpxKdRy9V207JSuXditjBpWfBp0lFZqn48pVa450eJpzgZW9Y5NZICGImXuZ2QxsyRKdfHubGtuRI/P5Elbv7ADpvM/f3s3MH32CODnu3ukD2VmfeBxlNJVT5jZdud5QOTT2OQ4h/hyMEGpy95/V4OiTB87LJXg0sF8H9P6BFl7cJ/HoM8RNhd9fmQec5j1FH+Kn/OCEeEjDCOnm0p7n59zOYxumzvnLx2PDnYmPwVRY4Ifw1s3oMgZ7KTteWdcjExfIw5aNRdYo7QqpiywocNp91+AKQqQYvOe6pN6b7xxjYPe0IgJ/GZfAFvWiDWO+uifz1Swfw6T1+cygELWNeYp0JofnWbknHyXMl6VmTfEQFbMm+ExOngxG8OoavidDzh5JbECgj/Atsh4MGyd/3HAA60XJghqmFzo9eXNF3ojXdWG5XWtm7jGIc9rQiofA//9AHVZxYN5Z1CS/9W+daU1wv3tYzVY+uJNVagKltpwQRWnQQWlOkWJ0jrRjnmVuqGdp30RFCq0L0etKJYUBI4PEBcgDDnkt6Hjv757ucU5SkatW69NgTCT84ipBwYgS7il2UWiUh5XsVkoKZJWgbFWqkpxuclPIBSPKPUM93OA3p51dbkrspeTLE8LtL8l38f6qRwC+7BAev+EsUnlG3ANDLTdxkQi/KQt0oSZ3X9TJMcRHxCTI9V3zBmXixrM5kgkRXkXsHPd165feJXbeINeVTYZogfph33+kKm/EBB9KWf4LicliNn5NXBhTEnZao/3l1zav673jx1VU/gbJLfGCMVzOj5LyIJYyIqE8Blfkci5Oejdx/3pSjifSzXuvzb3/k0oRsyi57k5S/TKzn0yKViIkqL2rVbUWtZboS1SsLSXkaQ8+jf485rP750P3ZQ654z7wBdSk+94mBJMCUFuHvcNaAaj9ijnQv9/EWd6OW0Ut4ozfOPynM/AsXaM0heWQ49zjAHidNHJL3jtnOjUk2XHkKwNLkUSpKkyPAdJ9uTqFC8DmllXWy35q6G/3pi5tV27Ji63tciEqDUGmMoJM2BnINZiFQmUO6k3T7G/QgYYJiQuYMpmAoNVeiJo9QabVbWTCvoZct++g0BowqLcdXE5sMBasPCVN6RnKCNuGQ6uhp9uk2BieVVhJme2detorM8k6irZZAzA2KiaY0FDdFF3HGlu/OY5hTNBE93RBNXEf1dnZA3vpnnbKwERZGKY3LTraFLc2ytDTBXiW0Ch6jM5uBLUZzm9l0KObaTCDF/r+ShQzu056dErHD78OhWZjHR6KFSUmB1Kj7s7VsS/lNv3M4jDoNIFqPa1HF0iNTkqnHO0B9cZPa0cxYESxlAj3ShON/Ux51VHta7u2strm6yp9oAgogiE4QiAshA3KA6u+22axEyMsVv0y1gBlsOgZ3GAzdrPa7VS7tRsAAA==') format('woff2'), 5 | url('iconfont.woff?t=1576828227650') format('woff'), 6 | url('iconfont.ttf?t=1576828227650') format('truetype'), /* chrome, firefox, opera, Safari, Android, iOS 4.2+ */ 7 | url('iconfont.svg?t=1576828227650#iconfont') format('svg'); /* iOS 4.1- */ 8 | } 9 | 10 | .fa-custom { 11 | font-family: "iconfont" !important; 12 | font-size: 16px; 13 | font-style: normal; 14 | -webkit-font-smoothing: antialiased; 15 | -moz-osx-font-smoothing: grayscale; 16 | } 17 | 18 | .jianshu:before { 19 | content: "\e60c"; 20 | } 21 | 22 | .zhihu:before { 23 | content: "\e688"; 24 | } 25 | 26 | .wechat:before { 27 | content: "\e502"; 28 | } 29 | 30 | .csdn:before { 31 | content: "\e515"; 32 | } 33 | 34 | .juejin:before { 35 | content: "\e503"; 36 | } 37 | -------------------------------------------------------------------------------- /source/_posts/@10aaf06.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "爬虫养成记--千军万马来相见(详解多线程)" 3 | description: "在上篇教程《爬虫养成记--顺藤摸瓜回首掏(女生定制篇)》中我们通过分析网页之间的联系,串起一条线,从而爬取大量的小哥哥图片,但是一张一张的爬取速度未免也有些太慢,在本篇教程中将会与大家分享提高爬虫速率的神奇技能——多线程。" 4 | tags: ["爬虫"] 5 | categories: ["后端", "Python", "入门"] 6 | date: 2020-03-23T00:00:00.509Z 7 | photos: 8 | - https://static.tuture.co/c/%4010aaf06.md/spider-3-cover.jpg 9 | --- 10 | 11 |
12 |
13 | 14 |
15 |
16 |
17 |

crxk

18 |
19 |
20 |
21 | 22 | ## 前情回顾 23 | 在上篇教程[爬虫养成记--顺藤摸瓜回首掏(女生定制篇)](https://blog.csdn.net/crxk_/article/details/104892652)中我们通过分析网页之间的联系,串起一条线,从而爬取大量的小哥哥图片,但是一张一张的爬取速度未免也**有些太慢**,在本篇教程中将会与大家分享**提高爬虫速率**的神奇技能——多线程。 24 | 25 | ## 慢在哪里? 26 | 首先我们将之前所写的爬虫程序以流程图的方式将其表示出来,通过这种更直观的方式来分析程序在速度上的瓶颈。下面*程序流程图*中红色箭头标明了程序获取一张图片时所要执行的步骤。 27 | ![程序流程图](https://static.tuture.co/c/@10aaf06.md/171022950b5c6541.png) 28 | 大多数的程序设计语言其代码执行顺序都是同步执行(JavaScript为异步),也就是说在Python程序中只有上一条语句执行完成了,下一条语句才会开始执行。从流程图中也可以看出来,只有第一页的图片抓取完成了,第二页的图片才会开始下载…………,当整个图集所有的图片都处理完了,下一个图集的图片才会开始进行遍历下载。此过程如*串行流程图*中蓝色箭头所示: 29 | ![串行流程图](https://static.tuture.co/c/@10aaf06.md/171022950c6003b4.png) 30 | 31 | 从图中可以看出当程序入到每个分叉点时也就是进入for循环时,在循环队列中的每个任务(比如遍历图集or下载图片)就只能等着前面一个任务完成,才能开始下面一个任务。就是因为**需要等待**,才拖慢了程序的速度。 32 | 33 | 这就像食堂打饭一样,如果只有一个窗口,每个同学打饭时长为一分钟,那么一百个学生就有99个同学需要等待,100个同学打饭的总时长为1+2+3+……+ 99 + 100 = 5050分钟。如果哪天食堂同时开放了100个窗口,那么100个同学打饭的总时间将变为1分钟,时间缩短了五千多倍! 34 | ## 如何提速? 35 | 我们现在所使用的计算机都拥有多个CPU,就相当于三头六臂的哪吒,完全可以多心多用。如果可以充分发掘计算机的算力,将上述**串行**的执行顺序**改为并行**执行(如下*并行流程图*所示),那么在整个程序的执行的过程中将**消灭等待**的过程,速度会有质的飞跃! 36 | ![并行执行图](https://static.tuture.co/c/@10aaf06.md/171022951129ac47.png) 37 | ### 从单线程到多线程 38 | 单线程 = 串行 39 | 从串行流程图中可以看出红色箭头与蓝色箭头是首尾相连,一环扣一环。这称之为串行。 40 | 41 | 多线程 = 并行 42 | 从并行流程图中可以看出红色箭头每到一个分叉点就直接产生了分支,多个分支共同执行。此称之为并行。 43 | 44 | 当然在整个程序当中,不可能一开始就搞个并行执行,串行是并行的基础,它们两者相辅相成。只有当程序出现分支(进入for循环)此时多线程可以派上用场,为每一个分支开启一个线程从而加速程序的执行。对于萌新可以粗暴简单地理解:**没有for循环,就不用多线程**。对于有一定编程经验的同学可以这样理解:**当程序中出现耗时操作时,要另开一个线程处理此操作**。所谓耗时操做比如:文件IO、网络IO……。 45 | 46 | ## 动手实践 47 | ### 定义一个线程类 48 | Python3中提供了[threading](https://www.runoob.com/python3/python3-multithreading.html)模块用于帮助用户构建多线程程序。我们首先将基于此模块来自定义一个线程类,用于消灭遍历图集时所需要的等待。 49 | #### 线程ID 50 | 程序执行时会开启很多个线程,为了后期方便管理这些线程,可以在线程类的构造方法中添加threadID这一参数,为每个线程赋予唯一的ID号 51 | #### 所执行目标方法的参数 52 | 一般来说定义一个线程类主要目的是让此线程去执行一个耗时的方法,所以这个线程类的构造方法中所需要传入所要执行目的方法的参数。比如 handleTitleLinks 这个类主要用来执行[getBoys()](https://blog.csdn.net/crxk_/article/details/104892652) (参见文末中的完整代码)这一方法。getBoys() 所需一个标题的链接作为参数,所以在handleTitleLinks的构造方法中也需要传入一个链接。 53 | #### 调用目标方法 54 | 线程类需要一个run(),在此方法中传入参数,调用所需执行的目标方法即可。 55 | ```python 56 | class handleTitleLinks (threading.Thread): 57 | def __init__(self,threadID,link): 58 | threading.Thread.__init__(self) 59 | self.threadID = threadID 60 | self.link = link 61 | def run(self): 62 | print ("start handleTitleLinks:" + self.threadID) 63 | getBoys(self.link) 64 | print ("exit handleTitleLinks:" + self.threadID) 65 | ``` 66 | ### 实例化线程对象代替目标方法 67 | 当把线程类定义好之后,找到曾经耗时的目标方法,实例化一个线程对象将其代替即可。 68 | 69 | ```python 70 | def main(): 71 | baseUrl = "https://www.nanrentu.cc/sgtp/" 72 | response = requests.get(baseUrl,headers=headers) 73 | if response.status_code == 200: 74 | with open("index.html",'w',encoding="utf-8") as f: 75 | f.write(response.text) 76 | doc = pq(response.text) 77 | # 得到所有图集的标题连接 78 | titleLinks = doc('.h-piclist > li > a').items() 79 | # 遍历这些连接 80 | for link in titleLinks: 81 | # 替换目标方法,开启线程 82 | handleTitleLinks(uuid.uuid1().hex,link).start() 83 | # getBoys(link) 84 | ``` 85 | ### 如法炮制 86 | 我们已经定义了一个线程去处理每个图集,但是在处理每个图集的过程中还会有分支(参见程序并行执行图)去下载图集中的图片。此时需要再定义一个线程用来下载图片,即定义一个线程去替换getImg()。 87 | 88 | ```python 89 | class handleGetImg (threading.Thread): 90 | def __init__(self,threadID,urlArray): 91 | threading.Thread.__init__(self) 92 | self.threadID = threadID 93 | self.url = url 94 | def run(self): 95 | print ("start handleGetImg:" + self.threadID) 96 | getPic(self.urlArray) 97 | print ("exit handleGetImg:" + self.threadID) 98 | ``` 99 | ### 改造后完整代码如下: 100 | 101 | ```python 102 | #!/usr/bin/python3 103 | import requests 104 | from pyquery import PyQuery as pq 105 | import uuid 106 | import threading 107 | 108 | headers = { 109 | 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36', 110 | 'cookie': 'UM_distinctid=170a5a00fa25bf-075185606c88b7-396d7407-100200-170a5a00fa3507; CNZZDATA1274895726=1196969733-1583323670-%7C1583925652; Hm_lvt_45e50d2aec057f43a3112beaf7f00179=1583326696,1583756661,1583926583; Hm_lpvt_45e50d2aec057f43a3112beaf7f00179=1583926583' 111 | } 112 | def saveImage(imgUrl,name): 113 | imgResponse = requests.get(imgUrl) 114 | fileName = "学习文件/%s.jpg" % name 115 | if imgResponse.status_code == 200: 116 | with open(fileName, 'wb') as f: 117 | f.write(imgResponse.content) 118 | f.close() 119 | 120 | # 根据链接找到图片并下载 121 | def getImg(url): 122 | res = requests.get(url,headers=headers) 123 | if res.status_code == 200: 124 | doc = pq(res.text) 125 | imgSrc = doc('.info-pic-list > a > img').attr('src') 126 | print(imgSrc) 127 | saveImage(imgSrc,uuid.uuid1().hex) 128 | 129 | # 遍历组图链接 130 | def getPic(urlArray): 131 | for url in urlArray: 132 | # 替换方法 133 | handleGetImg(uuid.uuid1().hex,url).start() 134 | # getImg(url) 135 | 136 | 137 | def createUrl(indexUrl,allPage): 138 | baseUrl = indexUrl.split('.html')[0] 139 | urlArray = [] 140 | for i in range(1,allPage): 141 | tempUrl = baseUrl+"_"+str(i)+".html" 142 | urlArray.append(tempUrl) 143 | return urlArray 144 | 145 | def getBoys(link): 146 | # 摸瓜第1步:获取首页连接 147 | picIndex = link.attr('href') 148 | # 摸瓜第2步:打开首页,提取末页链接,得出组图页数 149 | res = requests.get(picIndex,headers=headers) 150 | print("当前正在抓取的 picIndex: " + picIndex) 151 | if res.status_code == 200: 152 | with open("picIndex.html",'w',encoding="utf-8") as f: 153 | f.write(res.text) 154 | doc = pq(res.text) 155 | lastLink = doc('.page > ul > li:nth-last-child(2) > a').attr('href') 156 | # 字符串分割,得出全部的页数 157 | if(lastLink is None): 158 | return 159 | # 以.html 为分割符进行分割,取结果数组中的第一项 160 | temp = lastLink.split('.html')[0] 161 | # 再以下划线 _ 分割,取结果数组中的第二项,再转为数值型 162 | allPage = int(temp.split('_')[1]) 163 | # 摸瓜第3步:根据首尾链接构造url 164 | urlArray = createUrl(picIndex,allPage) 165 | # 摸瓜第4步:存储图片,摸瓜成功 166 | getPic(urlArray) 167 | 168 | def main(): 169 | baseUrl = "https://www.nanrentu.cc/sgtp/" 170 | response = requests.get(baseUrl,headers=headers) 171 | if response.status_code == 200: 172 | with open("index.html",'w',encoding="utf-8") as f: 173 | f.write(response.text) 174 | doc = pq(response.text) 175 | # 得到所有图集的标题连接 176 | titleLinks = doc('.h-piclist > li > a').items() 177 | # 遍历这些连接 178 | for link in titleLinks: 179 | # 替换方法,开启线程 180 | handleTitleLinks(uuid.uuid1().hex,link).start() 181 | # getBoys(link) 182 | 183 | # 处理组图链接的线程类 184 | class handleTitleLinks (threading.Thread): 185 | def __init__(self,threadID,link): 186 | threading.Thread.__init__(self) 187 | self.threadID = threadID 188 | self.link = link 189 | def run(self): 190 | print ("start handleTitleLinks:" + self.threadID) 191 | getBoys(self.link) 192 | print ("exit handleTitleLinks:" + self.threadID) 193 | # 下载图片的线程类 194 | class handleGetImg (threading.Thread): 195 | def __init__(self,threadID,url): 196 | threading.Thread.__init__(self) 197 | self.threadID = threadID 198 | self.url = url 199 | def run(self): 200 | print ("start handleGetImg:" + self.threadID) 201 | getImg(self.url) 202 | print ("exit handleGetImg:" + self.threadID) 203 | 204 | if __name__ == "__main__": 205 | main() 206 | ``` 207 | ## 性能对比 208 | ![单线程100张图片用时](https://static.tuture.co/c/@10aaf06.md/1710229514ea4fa3.png) 209 | ![多线程100张图片用时](https://static.tuture.co/c/@10aaf06.md/171022951555675d.png) 210 | ![多线程200张图片用时](https://static.tuture.co/c/@10aaf06.md/171022951599c2eb.png) 211 | 212 | 因为网络波动的原因,采用多线程后并不能获得理论上的速度提升,不过显而易见的时多线程能大幅度提升程序速度,且数据量越大效果越明显。 213 | 214 | ## 总结 215 | 至此爬虫养成记系列文章,可以告一段落了。我们从零开始一步一步地学习了如何获取网页,然后从中分析出所要下载的图片;还学习了如何分析网页之间的联系,从而获取到更多的图片;最后又学习了如何利用多线程提高程序运行的效率。 216 | 217 | 希望各位看官能从这三篇文章中获得启发,体会到分析、设计并实现爬虫程序时的各种方法与思想,从而能够举一反三,写出自己所需的爬虫程序~ 加油!🆙💪 218 | 219 | ## 预告 220 | 敬请期待爬虫进阶记~ -------------------------------------------------------------------------------- /source/_posts/@2629c44.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "爬虫养成记--顺藤摸瓜回首掏(女生定制篇)" 3 | description: "在上篇教程[爬虫养成记——先跨进这个精彩的世界(女生定制篇)]中我们已经可以将所有小哥哥的封面照片抓取下来,但仅仅是封面图片在质量和数量上怎么能满足小仙女们的要求呢?在本篇教程中,我们串起一根姻缘“线”,来把这一系列的小哥哥们都收入囊中。" 4 | tags: ["爬虫"] 5 | categories: ["后端", "Python", "入门"] 6 | date: 2020-03-16T00:00:00.509Z 7 | photos: 8 | - https://static.tuture.co/c/%402629c44.md/spider-2-cover.jpg 9 | --- 10 | 11 |
12 |
13 | 14 |
15 |
16 |
17 |

crxk

18 |
19 |
20 |
21 | 22 | ## 出门先化妆 23 | 24 | 小仙女们出门约会总会“淡妆浓抹总相宜”,那爬虫出门去爬取数据,也得打扮打扮啊,不然怎么能让男神们都乖乖地跟着走呢? 25 | 26 | 爬虫的“化妆”可不是“妆前乳 --> 粉底 --> 遮瑕 --> 散粉 --> 画眉 --> 口红”等这些步骤,其目的是为了让对方网站更加确信来访者不是爬虫程序,而是一个活生生的人。人们通过操控浏览器来访问网站,那么爬虫程序只需要模仿浏览器就可以了。 那就来看看浏览器在打开网页时都画了那些“妆”。 27 | 28 | ![8GMVwd.png](https://static.tuture.co/c/@2629c44.md/170e1504e83f6e61.png) 29 | 30 | 打开Chrome并打开调试台,切换到NetWork选项卡,此时访问 https://www.nanrentu.cc/sgtp/ , 这是时候会看到调试台里出现了很多链接信息,这么多链接到底哪个是我们所需要的呢?回想一下上一篇内容,首先是要获得HTML文档,再从此文档中提取出图片的链接,所以目标有了,就是找到浏览器获取到这个HTML文档的那个链接。 31 | 32 | Chrome知道这么多链接信息肯定会让开发者陷入茫然,所以给链接进行了归类,点击上方Doc分类,再点击那唯一的一条链接,就会看到获取此HTML文档链接的详细信息了。此时我们关注主要Request Headers 这个里面的内容。浏览器通过http协议与服务器交互获取信息,爬虫是通过模仿浏览器发出http协议获取信息,其中最重要的一个模仿点就是Request Headers。 33 | 34 | ### http协议里面的“瓶瓶罐罐” 35 | 让男生看女孩子化妆用的那些瓶瓶罐罐估计会陷入沉思,这是BB霜,那是粉底液,还有散粉、眼影、遮瑕膏,更不用说各种色号的口红啦。那女孩子看到这http里面的各项内容时估计也会一脸懵逼,其这比化妆品简单多了,我们只需简单了解,就能给爬虫画出精致妆容。 36 | 37 | ``` 38 | :authority: www.nanrentu.cc 39 | :method: GET // 自定义请求头 请求方法 40 | :path: /sgtp/ // 自定义请求头 请求路径 41 | :scheme: https // 自定义请求头 请求方式 42 | // 所接受的内容格式 43 | accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 44 | // 所接受的编码方式 45 | accept-encoding: gzip, deflate, br 46 | // 所接受的语言 47 | accept-language: zh-CN,zh;q=0.9 48 | // 缓存控制:告诉服务器客户端希望得到一个最新的资源 49 | cache-control: max-age=0 50 | cookie: UM_distinctid=170a5a00fa25bf-075185606c88b7-396d7407-100200-170a5a00fa3507; Hm_lvt_45e50d2aec057f43a3112beaf7f00179=1583326696,1583756661; CNZZDATA1274895726=1196969733-1583323670-%7C1583752625; Hm_lpvt_45e50d2aec057f43a3112beaf7f00179=1583756721 51 | sec-fetch-dest: document 52 | sec-fetch-mode: navigate 53 | sec-fetch-site: none 54 | sec-fetch-user: ?1 55 | // 屏蔽HTTPS页面出现HTTP请求警报 56 | upgrade-insecure-requests: 1 57 | user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36 58 | ``` 59 | 这么多的信息不用都给爬虫加上,因为这网站的防爬措施等级不高,暂时只要关键的两个就可以了。 60 | - cookie: 这是存储在浏览器里面一段文本,有时包含了验证信息和一些特殊的请求信息 61 | - user-agent:用于标识此请求是由什么工具所发出的 62 | 关于User-Agent的详细信息可以参考此篇博文 [谈谈 UserAgent 字符串的规律和伪造方法](https://juejin.im/entry/59cf793a51882550b219567b) 63 | 64 | 但是当爬取其他网站时可能会有所需要,在这里赘述这么多原因就是希望大家能明白**伪装爬虫**的重要性,以及怎么获取这些伪装信息。 65 | 66 | 67 | 68 | ```python 69 | // 建立一个名叫headers的字典 70 | headers = { 71 | 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36', 72 | 'cookie': 'UM_distinctid=170a5a00fa25bf-075185606c88b7-396d7407-100200-170a5a00fa3507; CNZZDATA1274895726=1196969733-1583323670-%7C1583925652; Hm_lvt_45e50d2aec057f43a3112beaf7f00179=1583326696,1583756661,1583926583; Hm_lpvt_45e50d2aec057f43a3112beaf7f00179=1583926583' 73 | } 74 | // 发送请求时带上请求头 75 | response = requests.get(baseUrl,headers=headers) 76 | ``` 77 | 78 | 79 | 80 | ## 顺藤摸瓜 81 | 一个网站是由若干个网页组合而成的,网页中充满着各种超链接,从这网页链接到那个网页,如果我们想要更多小哥哥,那就得首先分析出串联起他们那些超链接,然后就可以顺藤摸瓜咯。 82 | 83 | ![超连接元素.png](https://static.tuture.co/c/@2629c44.md/170e1504eb4fc80c.png) 84 | 85 | 当把鼠标发放到标题上时,标题的颜色发生了变化,证明这一元素为超连接,点击标题浏览器会自动打开一个tab标签页,来显示网页,注意到下方的页码标签,是这些元素串联起了整个图集。 86 | 87 | ![8EBD9U.png](https://static.tuture.co/c/@2629c44.md/170e1504e6d8d6a4.png) 88 | 89 | 点击“末页”观察url发生了什么变化 90 | 91 | 末页的url:https://www.nanrentu.cc/sgtp/36805_7.html 92 | 93 | 首页的url:https://www.nanrentu.cc/sgtp/36805.html 94 | 95 | 看起来有点意思了,末页的url比首页的url多了“_7”,接下来再点击分别进入第2页,第3页……观察url的变化,可得出下表。 96 | 97 | 98 | 页面 | url 99 | ---|--- 100 | 首页 | https://www.nanrentu.cc/sgtp/36805.html 101 | 第2页 | https://www.nanrentu.cc/sgtp/36805_2.html 102 | 第3页 | https://www.nanrentu.cc/sgtp/36805_3.html 103 | 第4页 | https://www.nanrentu.cc/sgtp/36805_4.html 104 | 第5页 | https://www.nanrentu.cc/sgtp/36805_5.html 105 | 第6页 | https://www.nanrentu.cc/sgtp/36805_6.html 106 | 第7页 | https://www.nanrentu.cc/sgtp/36805_7.html 107 | 108 | 多点几个组图,也会发现同样规律。这样就明了很多了,我们已经分析清楚了这个跟“藤”的开头与结尾,接下来就可以敲出代码让爬虫开始“摸瓜”咯。 109 | ### 摸瓜第1步:提取标题链接 110 | 111 | 这个操作与上篇博文中所介绍的一样,打开调试台切换到Elements选项卡就能开始探索提取了。 112 | 113 | ![8Ech4I.png](https://static.tuture.co/c/@2629c44.md/170e1504516e307a.png) 114 | 115 | ### 摸瓜第2步:提取末页链接,得出组图页数 116 | 117 | ![8ERtu8.png](https://static.tuture.co/c/@2629c44.md/170e1504e2553120.png) 118 | 119 | 通过观察HTML元素结构,可发现包含末页的 \
  • 标签为其父元素\