├── .gitignore ├── LICENSE ├── README.md ├── index.js ├── media ├── all-johns.png ├── copy-selector.png ├── desertious.jpg ├── num-results-new.png ├── num-results.png └── whoa.png ├── models └── user.js ├── package-lock.json ├── package.json └── screenshots └── github.png /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | node_modules/ 3 | creds.js 4 | db/ 5 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Emad Ehsan 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Puppeteer 与 Chrome Headless —— 从入门到爬虫 3 | 4 | **这里是 [@emadehsan 5 | ](https://github.com/emadehsan) 的 [GitHub 英文原文](https://github.com/emadehsan/thal)** 6 | 7 | **和 [他](https://medium.com/@e_mad_ehsan) 的 [Medium 英文原文](https://medium.com/@e_mad_ehsan/getting-started-with-puppeteer-and-chrome-headless-for-web-scrapping-6bf5979dee3e)** 8 | 9 | ![A Desert in painters perception](./media/desertious.jpg) 10 | 11 | [`Puppeteer`](https://github.com/GoogleChrome/puppeteer) 是 Google Chrome 团队官方的无头/无界面(Headless)Chrome 工具。正因为这个官方声明,许多业内自动化测试库都已经停止维护,包括 **[PhantomJS](http://phantomjs.org/)**。**[Selenium IDE for Firefox](https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/)** 项目也因为缺乏维护者而终止。 12 | 13 | > 译者注:关于 PhantomJS 和 Selenium IDE for Firefox 停止维护并没有找到相关的公告,但这两个项目的确已经都超过 2 年没有发布新版本了。但另一个今年 5 月才开启的项目 [Chromeless](https://github.com/graphcool/chromeless) 目前在 Github 上已经超过 1w star,目前还非常活跃。 14 | 15 | Chrome 作为浏览器市场的领头羊,**Chrome Headless** 必将成为 web 应用 **自动化测试** 的行业标杆。所以我整合了这份如何利用 **Chrome Headless** 做 `网页爬虫` 的入门指南。 16 | 17 | ## TL;DR 18 | 19 | 本文我们将使用 `Chrome Headless`, `Puppeteer`, `Node` 和 `MongoDB`,爬取 GitHub,登录并提取和保存用户的邮箱。不用担心 GitHub 的频率限制,本文会基于 Chrome Headless 和 Node 给你相应的策略。同时,请时刻关注 `Puppeteer` 的[文档](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md),因为该项目仍然处于开发中,API 并不是很稳定。 20 | 21 | ## 开始 22 | 23 | 开始之前,我们需要安装以下工具。点击他们的官网然后安装吧。 24 | 25 | * [Node 8.+](https://nodejs.org) 26 | * [MongoDB](http://mongodb.com) 27 | 28 | > 译者注:Puppeteer 要求使用 Node v6.4.0,但因为文中大量使用 `async/await`,需要 Node v7.6.0 或以上。 29 | 30 | ## 初始化项目 31 | 32 | 项目都是以创建文件夹开始。 33 | 34 | ``` 35 | $ mkdir thal 36 | $ cd thal 37 | ``` 38 | 39 | 初始化 NPM,填入一些必要的信息。 40 | 41 | ``` 42 | $ npm init 43 | ``` 44 | 45 | 安装 `Puppeteer`。由于 `Puppeteer` 并不是稳定的版本而且每天都在更新,所以如果你想要最新的功能可以直接通过 GitHub 的[仓库](https://github.com/GoogleChrome/puppeteer)安装。 46 | ``` 47 | $ npm i --save puppeteer 48 | ``` 49 | 50 | Puppeteer 包含了自己的 chrome / chromium 用以确保可以无头地工作。因此每当你安装/更新 puppeteer 的时候,他都会下载指定的 chrome 版本。 51 | 52 | ## 编码 53 | 54 | 我们将从页面截图开始,这是他们的文档中的代码。 55 | 56 | ### 页面截图 57 | 58 | ```js 59 | const puppeteer = require('puppeteer'); 60 | 61 | async function run() { 62 | const browser = await puppeteer.launch(); 63 | const page = await browser.newPage(); 64 | 65 | await page.goto('https://github.com'); 66 | await page.screenshot({path: 'screenshots/github.png'}); 67 | 68 | browser.close(); 69 | } 70 | 71 | run(); 72 | ``` 73 | 74 | 如果您第一次使用 `Node` 7 或 8,那你可能不太熟悉 `async` 和 `await` 关键字。简单地说,一个 `async` 函数返回一个 Promise,当 Promise 完成时会返回你所定义的内容。当你需要像同步函数那样调用时,需要使用 `await`。 75 | 76 | 保存上面的代码为 `index.js` 文件到项目目录里。并运行 77 | 78 | ``` 79 | $ node index.js 80 | ``` 81 | 82 | 这样截图就会被保存到 `screenshots/` 目录下。 83 | 84 | > 译者注:如果不是 clone 本 repo 的童鞋可能会遇到 `screenshots` 目录不存在的异常,请先手动创建该目录,或使用 [mkdirp](https://www.npmjs.com/package/mkdirp)。 85 | 86 | ![GitHub](./screenshots/github.png) 87 | 88 | ### 登录 GitHub 89 | 90 | 如果你在 GitHub 上搜索 *john*,然后点击 Users 标签,你将看到一个带有姓名信息的用户列表。 91 | 92 | ![Johns](./media/all-johns.png) 93 | 94 | 有些用户设置了他们的邮箱是公开可见的,但有些用户没有。但你一个邮箱都看不到的原因是没有登录。下面,让我们利用 [Puppeteer 文档](https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md) 来登录吧~ 95 | 96 | > 译者注:上图是登录后看到的效果 97 | 98 | 在项目根目录添加一个 `creds.js` 文件。我强烈建议用一个没什么卵用的邮箱来注册一个新 GitHub 账号,不然 GitHub 可能会封掉你的常用账号。 99 | 100 | ```js 101 | module.exports = { 102 | username: '', 103 | password: '' 104 | }; 105 | ``` 106 | 107 | 同时添加一个 `.gitignore` 文件,输入以下内容: 108 | 109 | ```txt 110 | node_modules/ 111 | creds.js 112 | ``` 113 | 114 | #### 以非无头(non headless)模式启动 115 | 116 | 在调用 Puppeteer 的 `launch` 方法的时候传入参数对象中带有 `headless: false`,即可启动其 GUI 界面,进行可视化调试。 117 | 118 | ```js 119 | const browser = await puppeteer.launch({ 120 | headless: false 121 | }); 122 | ``` 123 | 124 | 跳转到登录页 125 | 126 | ```js 127 | await page.goto('https://github.com/login'); 128 | ``` 129 | 130 | 在浏览器中打开 [https://github.com/login](https://github.com/login)。右击 **Username or email address** 下方的输入框(译者注:并选择 **Inspect**)。在开发者工具中,右击被高亮的代码选择 `Copy` > `Copy selector`。 131 | 132 | ![Copy dom element selector](./media/copy-selector.png) 133 | 134 | 把复制出来的值放到以下常量中 135 | 136 | ```js 137 | const USERNAME_SELECTOR = '#login_field'; // "#login_field" 就是被复制出来的值 138 | ``` 139 | 重复上面的步骤,吧 **Password** 输入框和 **Sign in** 按钮的值也填好,将得到下面内容: 140 | 141 | ```js 142 | // dom element selectors 143 | const USERNAME_SELECTOR = '#login_field'; 144 | const PASSWORD_SELECTOR = '#password'; 145 | const BUTTON_SELECTOR = '#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block'; 146 | ``` 147 | 148 | #### 登录 149 | 150 | Puppeteer 提供了 `click` 方法用来点击 DOM 元素和 `type` 方法来输入内容。下面我们将填写验证信息并点击登录然后坐等跳转。 151 | 152 | 首先,需要引入 `creds.js` 文件。 153 | 154 | ```js 155 | const CREDS = require('./creds'); 156 | ``` 157 | 158 | 然后 159 | 160 | ```js 161 | // puppeteer@0.11 以前是需要点击再输入 162 | // await page.click(USERNAME_SELECTOR); 163 | // await page.type(CREDS.username); 164 | // await page.click(PASSWORD_SELECTOR); 165 | // await page.type(CREDS.password); 166 | 167 | // puppeteer@0.12 以后 page.type 方法需要对某个 selector 进行输入 168 | await page.type(USERNAME_SELECTOR, CREDS.username); 169 | await page.type(PASSWORD_SELECTOR, CREDS.password); 170 | await page.click(BUTTON_SELECTOR); 171 | 172 | await page.waitForNavigation(); 173 | ``` 174 | 175 | ### 搜索 GitHub 176 | 177 | 现在,我们已经登录啦!我们可以点击搜索框,填写并在结果页面点击用户标签。但有一个简单的方法:搜索请求通常是 GET 请求,所有内容都是通过 URL 发送的。所以,在搜索框内手动输入 `john`,然后点击用户标签并复制地址栏上的网址。这将是 178 | 179 | ```js 180 | const searchUrl = 'https://github.com/search?q=john&type=Users&utf8=%E2%9C%93'; 181 | ``` 182 | 183 | 做一丢丢调整 184 | 185 | ```js 186 | const userToSearch = 'john'; 187 | const searchUrl = 'https://github.com/search?q=' + userToSearch + '&type=Users&utf8=%E2%9C%93'; 188 | ``` 189 | 190 | 让我们跳转到这个页面,看是不是真的搜索到了? 191 | 192 | ```js 193 | await page.goto(searchUrl); 194 | await page.waitFor(2*1000); 195 | ``` 196 | 197 | ### 提取邮箱地址 198 | 199 | > 译者注:本小节没有直译,因为译者没有使用作者的方案 200 | 201 | 我们的目的是搜刮用户的 `username` 和 `email`。我们可以在 Chrome 的开发者工具中看到,每个单独的用户信息都是在一个 class 为 `user-list-item` 的 `
` 内。 202 | 203 | 一种提取元素内容的方法是 `Page` or `ElementHandle` 的 `evaluate` 方法,因为它作用于浏览器运行的上下文环境内。当我们跳转到搜索结果页的时候,使用 `page.evaluate` 方法可以将所有用户信息的 div 获取出来。 204 | 205 | ```js 206 | const USER_LIST_INFO_SELECTOR = '.user-list-item'; 207 | const users = await page.evaluate((sel) => { 208 | const $els = document.querySelectorAll(sel); 209 | // ... 210 | }, USER_LIST_INFO_SELECTOR); 211 | ``` 212 | 213 | 遍历上面的 `$els`,继续使用选择器提取出其中的信息。当然,这里的选择器相当于用户信息的 div 的,不是像之前那样直接复制出来的,稍微有一点 css 知识应该能很容易读懂。 214 | 215 | ```js 216 | const USER_LIST_INFO_SELECTOR = '.user-list-item'; 217 | const USER_LIST_USERNAME_SELECTOR = '.user-list-info>a:nth-child(1)'; 218 | const USER_LIST_EMAIL_SELECTOR = '.user-list-info>.user-list-meta .muted-link'; 219 | 220 | const users = await page.evaluate((sInfo, sName, sEmail) => { 221 | return Array.prototype.slice.apply(document.querySelectorAll(sInfo)) 222 | .map($userListItem => { 223 | // 用户名 224 | const username = $userListItem.querySelector(sName).innerText; 225 | // 邮箱 226 | const $email = $userListItem.querySelector(sEmail); 227 | const email = $email ? $email.innerText : undefined; 228 | return { 229 | username, 230 | email, 231 | }; 232 | }) 233 | // 不是所有用户都显示邮箱 234 | .filter(u => !!u.email); 235 | }, USER_LIST_INFO_SELECTOR, USER_LIST_USERNAME_SELECTOR, USER_LIST_EMAIL_SELECTOR); 236 | 237 | console.log(users); 238 | ``` 239 | 240 | 现在,当你运行 `node index.js`,你讲看到 Chrome 跳出来自动执行上述操作后,在命令行打出 `username` 与其相关的 `email`。 241 | 242 | ### 遍历全部页面 243 | 244 | 首先我们需要评估搜索结果最后一页的页码。在搜索结果页的顶部,你可以看到当我在翻译这篇文章时有 **70,134 users**。 245 | 246 | **有趣的现象: 如果你对比上面的截图,你会发现这两天已经有 371 个新的 *john* 同学加入 GitHub。** 247 | 248 | ![Number of search items](./media/num-results-new.png) 249 | 250 | 从开发者工具复制人数的选择器。我们将在 `run` 外面写一个新的函数,用来获取页面数。 251 | 252 | ```js 253 | async function getNumPages(page) { 254 | const NUM_USER_SELECTOR = '#js-pjax-container .codesearch-results h3'; 255 | 256 | let inner = await page.evaluate((sel) => { 257 | return document.querySelector(sel).innerHTML; 258 | }, NUM_USER_SELECTOR); 259 | 260 | // 格式是: "69,803 users" 261 | inner = inner.replace(',', '').replace(' users', ''); 262 | const numUsers = parseInt(inner); 263 | console.log('numUsers: ', numUsers); 264 | 265 | /* 266 | * GitHub 每页显示 10 个结果 267 | */ 268 | const numPages = Math.ceil(numUsers / 10); 269 | return numPages; 270 | } 271 | ``` 272 | 273 | 在搜索结果页底部,如果你把鼠标悬浮在页码按钮上面,可以看到是一个指向下一页的链接。如:第二页的链接是 `https://github.com/search?p=2&q=john&type=Users&utf8=%E2%9C%93`。注意到 `p=2` 是一个 URL 的 query 参数,这将帮助我们跳转到指定的页面。 274 | 275 | 在添加了遍历页码的代码在上面爬取内容的方法之后,代码变成了这样: 276 | 277 | ```js 278 | const numPages = await getNumPages(page); 279 | console.log('Numpages: ', numPages); 280 | 281 | for (let h = 1; h <= numPages; h++) { 282 | // 跳转到指定页码 283 | await page.goto(`${searchUrl}&p=${h}`); 284 | // 执行爬取 285 | const users = await page.evaluate((sInfo, sName, sEmail) => { 286 | return Array.prototype.slice.apply(document.querySelectorAll(sInfo)) 287 | .map($userListItem => { 288 | // 用户名 289 | const username = $userListItem.querySelector(sName).innerText; 290 | // 邮箱 291 | const $email = $userListItem.querySelector(sEmail); 292 | const email = $email ? $email.innerText : undefined; 293 | return { 294 | username, 295 | email, 296 | }; 297 | }) 298 | // 不是所有用户都显示邮箱 299 | .filter(u => !!u.email); 300 | }, USER_LIST_INFO_SELECTOR, USER_LIST_USERNAME_SELECTOR, USER_LIST_EMAIL_SELECTOR); 301 | 302 | users.map(({username, email}) => { 303 | // TODO: 保存用户信息 304 | console.log(username, '->', email); 305 | }); 306 | } 307 | ``` 308 | 309 | ### 保存到 MongoDB 310 | 311 | 到这里 `Puppeteer` 的部分已经结束了。下面我们将使用 `mongoose` 存储上面的信息到 `MongoDB`。它是个 [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping),确切说是个便于从数据库进行信息存储和检索的库。 312 | 313 | ``` 314 | $ npm i --save mongoose 315 | ``` 316 | 317 | [MongoDB](https://www.mongodb.com/download-center/community) 是一个 Schema-less 的 NoSQL 数据库,但我们可以使用 Mongoose 使其遵循一些原则。首先我们需要创建一个 `Model`,他代表 MongoDB 中的 `Collection`。创建一个 `models` 文件夹,然后在里面创建一个 `user.js` 文件,并加入以下 collection 的构造函数代码。之后无论我们塞什么东西进 `User`,他都会遵循这个结构。 318 | 319 | 320 | ```js 321 | const mongoose = require('mongoose'); 322 | 323 | const userSchema = new mongoose.Schema({ 324 | username: String, 325 | email: String, 326 | dateCrawled: Date 327 | }); 328 | 329 | module.exports = mongoose.model('User', userSchema); 330 | ``` 331 | 332 | 现在我们可以开始往数据库塞数据了。由于我们不希望数据库中存在重复的 email,所以我们只新增那些以前从没出现过的邮箱,否则我们只更新数据。为此,我们需要使用 mongoose 的 `Model.findOneAndUpdate` 方法。 333 | 334 | 回到 `index.js`,引用所需的依赖: 335 | 336 | ```js 337 | const mongoose = require('mongoose'); 338 | const User = require('./models/user'); 339 | ``` 340 | 341 | 然后我们再创建一个新的方法,用于 **upsert** (更新 update 或 新增 insert) 用户实例。 342 | 343 | ```js 344 | function upsertUser(userObj) { 345 | 346 | const DB_URL = 'mongodb://localhost/thal'; 347 | if (mongoose.connection.readyState == 0) { 348 | mongoose.connect(DB_URL); 349 | } 350 | 351 | // if this email exists, update the entry, don't insert 352 | // 如果邮箱存在,就更新实例,不新增 353 | const conditions = { 354 | email: userObj.email 355 | }; 356 | const options = { 357 | upsert: true, 358 | new: true, 359 | setDefaultsOnInsert: true 360 | }; 361 | 362 | User.findOneAndUpdate(conditions, userObj, options, (err, result) => { 363 | if (err) { 364 | throw err; 365 | } 366 | }); 367 | } 368 | ``` 369 | 370 | 启动 MongoDB 服务。用下面的代码替换掉之前的注释内容 `// TODO: 保存用户信息`。 371 | 372 | ```js 373 | upsertUser({ 374 | username: username, 375 | email: email, 376 | dateCrawled: new Date() 377 | }); 378 | ``` 379 | 380 | 想要检查是否真的保存了这些用户,可以到 mongo 里面执行下列脚本: 381 | 382 | ``` 383 | $ mongo 384 | > use thal 385 | > db.users.find().pretty() 386 | ``` 387 | 388 | 你会看到有多个用户在已经添加在里面,那你就成功了哦~ 389 | 390 | > 译者注:使用 `db.users.find().pretty().length()` 可以查看爬取了多少条 391 | 392 | ## 最后 393 | 394 | Chrome Headless 和 Puppeteer 开启了网页爬虫和自动化测试的新纪元,而且 Chrome Headless 还支持 WebGL!你可以把你的爬虫脚本发布到云端,然后就可以坐享其成。当然,发布到服务器之前请记得去掉 `headless: false` 配置。 395 | 396 | * 在爬取的时候,你可能会被 GitHub 的频率控制阻止。 397 | 398 | ![Whoa](./media/whoa.png) 399 | 400 | * 另外,我发现你无法跳到 100 页之后。 401 | 402 | > 译者注:我爬了 100 页并没有被阻止。从 101 页开始就变成了 404 页面,或许通过页面下方的页码进行遍历会更合理 403 | 404 | ## 结语 405 | 406 | 广阔无垠的沙漠见证着 `穿越` 这些巨大的沙滩的人们的斗争和牺牲。 [**Thal**](https://en.wikipedia.org/wiki/Thal_Desert) 是巴基斯坦的一个跨越多个地区的沙漠,包括我的家乡 Bhakkar。与今天在 `互联网` 上搜索数据的情况类似。这就是为什么我将这个 repo 命名为 `Thal`。如果你喜欢它,那请与他人分享。如果您有任何建议,请在这里发表评论或直接与原作者联络 [@e_mad_ehsan](https://twitter.com/e_mad_ehsan)。他很乐意听到你的消息。 407 | 408 | > 译者注:中文版也欢迎直接提 [issue](https://github.com/csbun/thal/issues) 讨论 或 PR 409 | -------------------------------------------------------------------------------- /index.js: -------------------------------------------------------------------------------- 1 | const puppeteer = require('puppeteer'); 2 | const mongoose = require('mongoose'); 3 | const CREDS = require('./creds'); 4 | const User = require('./models/user'); 5 | 6 | async function run() { 7 | const browser = await puppeteer.launch({ 8 | headless: false 9 | }); 10 | const page = await browser.newPage(); 11 | 12 | // await page.goto('https://github.com'); 13 | // await page.screenshot({path: 'screenshots/github.png'}); 14 | 15 | await page.goto('https://github.com/login'); 16 | 17 | // dom element selectors 18 | const USERNAME_SELECTOR = '#login_field'; 19 | const PASSWORD_SELECTOR = '#password'; 20 | const BUTTON_SELECTOR = '#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block'; 21 | 22 | await page.type(USERNAME_SELECTOR, CREDS.username); 23 | await page.type(PASSWORD_SELECTOR, CREDS.password); 24 | await page.click(BUTTON_SELECTOR); 25 | await page.waitForNavigation(); 26 | 27 | const userToSearch = 'john'; 28 | const searchUrl = 'https://github.com/search?q=' + userToSearch + '&type=Users&utf8=%E2%9C%93'; 29 | // let searchUrl = 'https://github.com/search?utf8=%E2%9C%93&q=bashua&type=Users'; 30 | 31 | await page.goto(searchUrl); 32 | await page.waitFor(2 * 1000); 33 | 34 | const USER_LIST_INFO_SELECTOR = '.user-list-item'; 35 | const USER_LIST_USERNAME_SELECTOR = '.user-list-info>a:nth-child(1)'; 36 | const USER_LIST_EMAIL_SELECTOR = '.user-list-info>.user-list-meta .muted-link'; 37 | 38 | const numPages = await getNumPages(page); 39 | console.log('Numpages: ', numPages); 40 | 41 | for (let h = 1; h <= numPages; h++) { 42 | // 跳转到指定页码 43 | await page.goto(`${searchUrl}&p=${h}`); 44 | // 执行爬取 45 | const users = await page.evaluate((sInfo, sName, sEmail) => { 46 | return Array.prototype.slice.apply(document.querySelectorAll(sInfo)) 47 | .map($userListItem => { 48 | // 用户名 49 | const username = $userListItem.querySelector(sName).innerText; 50 | // 邮箱 51 | const $email = $userListItem.querySelector(sEmail); 52 | const email = $email ? $email.innerText : undefined; 53 | return { 54 | username, 55 | email, 56 | }; 57 | }) 58 | // 不是所有用户都显示邮箱 59 | .filter(u => !!u.email); 60 | }, USER_LIST_INFO_SELECTOR, USER_LIST_USERNAME_SELECTOR, USER_LIST_EMAIL_SELECTOR); 61 | 62 | users.map(({username, email}) => { 63 | // 保存用户信息 64 | upsertUser({ 65 | username: username, 66 | email: email, 67 | dateCrawled: new Date() 68 | }); 69 | }); 70 | } 71 | 72 | // 关闭 puppeteer 73 | browser.close(); 74 | 75 | // TODO: upsertUser 为异步方法,这里并没有等待其完成,纯粹是为了验证 MongoDB 里面是否有数据而已 76 | showAllCounts(); 77 | } 78 | 79 | /** 80 | * 获取页数 81 | * @param {Page} page 搜索结果页 82 | * @return {number} 总页数 83 | */ 84 | async function getNumPages(page) { 85 | const NUM_USER_SELECTOR = '#js-pjax-container .codesearch-results h3'; 86 | 87 | let inner = await page.evaluate((sel) => { 88 | return document.querySelector(sel).innerHTML; 89 | }, NUM_USER_SELECTOR); 90 | 91 | // 格式是: "69,803 users" 92 | inner = inner.replace(',', '').replace(' users', ''); 93 | const numUsers = parseInt(inner); 94 | console.log('numUsers: ', numUsers); 95 | 96 | /* 97 | * GitHub 每页显示 10 个结果 98 | */ 99 | const numPages = Math.ceil(numUsers / 10); 100 | return numPages; 101 | } 102 | 103 | /** 104 | * 初始化 MongoDB 105 | */ 106 | function initMongoDB() { 107 | if (mongoose.connection.readyState == 0) { 108 | const DB_URL = 'mongodb://localhost/thal'; 109 | mongoose.connect(DB_URL); 110 | } 111 | } 112 | 113 | /** 114 | * 新增或更新用户信息 115 | * @param {object} userObj 用户信息 116 | */ 117 | function upsertUser(userObj) { 118 | initMongoDB(); 119 | // if this email exists, update the entry, don't insert 120 | // 如果邮箱存在,就更新实例,不新增 121 | const conditions = { 122 | email: userObj.email 123 | }; 124 | const options = { 125 | upsert: true, 126 | new: true, 127 | setDefaultsOnInsert: true 128 | }; 129 | 130 | User.findOneAndUpdate(conditions, userObj, options, (err, result) => { 131 | if (err) { 132 | throw err; 133 | } 134 | }); 135 | } 136 | 137 | /** 138 | * 查找并展示目前有多少已保持的 User 139 | */ 140 | function showAllCounts() { 141 | initMongoDB(); 142 | User.count({}, function (err, count) { 143 | if (err) { 144 | console.error(err); 145 | } 146 | console.log('==== There are %d users saved ====', count); 147 | }); 148 | } 149 | 150 | run(); 151 | -------------------------------------------------------------------------------- /media/all-johns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/all-johns.png -------------------------------------------------------------------------------- /media/copy-selector.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/copy-selector.png -------------------------------------------------------------------------------- /media/desertious.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/desertious.jpg -------------------------------------------------------------------------------- /media/num-results-new.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/num-results-new.png -------------------------------------------------------------------------------- /media/num-results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/num-results.png -------------------------------------------------------------------------------- /media/whoa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/media/whoa.png -------------------------------------------------------------------------------- /models/user.js: -------------------------------------------------------------------------------- 1 | const mongoose = require('mongoose'); 2 | 3 | const userSchema = new mongoose.Schema({ 4 | username: String, 5 | email: String, 6 | dateCrawled: Date 7 | }); 8 | 9 | module.exports = mongoose.model('User', userSchema); 10 | -------------------------------------------------------------------------------- /package-lock.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "thal", 3 | "version": "1.0.0", 4 | "lockfileVersion": 1, 5 | "requires": true, 6 | "dependencies": { 7 | "agent-base": { 8 | "version": "4.2.1", 9 | "resolved": "https://registry.npmjs.org/agent-base/-/agent-base-4.2.1.tgz", 10 | "integrity": "sha512-JVwXMr9nHYTUXsBFKUqhJwvlcYU/blreOEUkhNR2eXZIvwd+c+o5V4MgDPKWnMS/56awN3TRzIP+KoPn+roQtg==", 11 | "requires": { 12 | "es6-promisify": "^5.0.0" 13 | } 14 | }, 15 | "async": { 16 | "version": "2.6.1", 17 | "resolved": "https://registry.npmjs.org/async/-/async-2.6.1.tgz", 18 | "integrity": "sha512-fNEiL2+AZt6AlAw/29Cr0UDe4sRAHCpEHh54WMz+Bb7QfNcFw4h3loofyJpLeQs4Yx7yuqu/2dLgM5hKOs6HlQ==", 19 | "requires": { 20 | "lodash": "^4.17.10" 21 | } 22 | }, 23 | "async-limiter": { 24 | "version": "1.0.0", 25 | "resolved": "https://registry.npmjs.org/async-limiter/-/async-limiter-1.0.0.tgz", 26 | "integrity": "sha512-jp/uFnooOiO+L211eZOoSyzpOITMXx1rBITauYykG3BRYPu8h0UcxsPNB04RR5vo4Tyz3+ay17tR6JVf9qzYWg==" 27 | }, 28 | "balanced-match": { 29 | "version": "1.0.0", 30 | "resolved": "https://registry.npmjs.org/balanced-match/-/balanced-match-1.0.0.tgz", 31 | "integrity": "sha1-ibTRmasr7kneFk6gK4nORi1xt2c=" 32 | }, 33 | "bluebird": { 34 | "version": "3.5.1", 35 | "resolved": "https://registry.npmjs.org/bluebird/-/bluebird-3.5.1.tgz", 36 | "integrity": "sha512-MKiLiV+I1AA596t9w1sQJ8jkiSr5+ZKi0WKrYGUn6d1Fx+Ij4tIj+m2WMQSGczs5jZVxV339chE8iwk6F64wjA==" 37 | }, 38 | "brace-expansion": { 39 | "version": "1.1.11", 40 | "resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-1.1.11.tgz", 41 | "integrity": "sha512-iCuPHDFgrHX7H2vEI/5xpz07zSHB00TpugqhmYtVmMO6518mCuRMoOYFldEBl0g187ufozdaHgWKcYFb61qGiA==", 42 | "requires": { 43 | "balanced-match": "^1.0.0", 44 | "concat-map": "0.0.1" 45 | } 46 | }, 47 | "bson": { 48 | "version": "1.1.0", 49 | "resolved": "https://registry.npmjs.org/bson/-/bson-1.1.0.tgz", 50 | "integrity": "sha512-9Aeai9TacfNtWXOYarkFJRW2CWo+dRon+fuLZYJmvLV3+MiUp0bEI6IAZfXEIg7/Pl/7IWlLaDnhzTsD81etQA==" 51 | }, 52 | "buffer-from": { 53 | "version": "1.1.1", 54 | "resolved": "https://registry.npmjs.org/buffer-from/-/buffer-from-1.1.1.tgz", 55 | "integrity": "sha512-MQcXEUbCKtEo7bhqEs6560Hyd4XaovZlO/k9V3hjVUF/zwW7KBVdSK4gIt/bzwS9MbR5qob+F5jusZsb0YQK2A==" 56 | }, 57 | "concat-map": { 58 | "version": "0.0.1", 59 | "resolved": "https://registry.npmjs.org/concat-map/-/concat-map-0.0.1.tgz", 60 | "integrity": "sha1-2Klr13/Wjfd5OnMDajug1UBdR3s=" 61 | }, 62 | "concat-stream": { 63 | "version": "1.6.2", 64 | "resolved": "https://registry.npmjs.org/concat-stream/-/concat-stream-1.6.2.tgz", 65 | "integrity": "sha512-27HBghJxjiZtIk3Ycvn/4kbJk/1uZuJFfuPEns6LaEvpvG1f0hTea8lilrouyo9mVc2GWdcEZ8OLoGmSADlrCw==", 66 | "requires": { 67 | "buffer-from": "^1.0.0", 68 | "inherits": "^2.0.3", 69 | "readable-stream": "^2.2.2", 70 | "typedarray": "^0.0.6" 71 | } 72 | }, 73 | "core-util-is": { 74 | "version": "1.0.2", 75 | "resolved": "https://registry.npmjs.org/core-util-is/-/core-util-is-1.0.2.tgz", 76 | "integrity": "sha1-tf1UIgqivFq1eqtxQMlAdUUDwac=" 77 | }, 78 | "debug": { 79 | "version": "3.1.0", 80 | "resolved": "https://registry.npmjs.org/debug/-/debug-3.1.0.tgz", 81 | "integrity": "sha512-OX8XqP7/1a9cqkxYw2yXss15f26NKWBpDXQd0/uK/KPqdQhxbPa994hnzjcE2VqQpDslf55723cKPUOGSmMY3g==", 82 | "requires": { 83 | "ms": "2.0.0" 84 | } 85 | }, 86 | "es6-promise": { 87 | "version": "4.2.5", 88 | "resolved": "https://registry.npmjs.org/es6-promise/-/es6-promise-4.2.5.tgz", 89 | "integrity": "sha512-n6wvpdE43VFtJq+lUDYDBFUwV8TZbuGXLV4D6wKafg13ldznKsyEvatubnmUe31zcvelSzOHF+XbaT+Bl9ObDg==" 90 | }, 91 | "es6-promisify": { 92 | "version": "5.0.0", 93 | "resolved": "http://registry.npmjs.org/es6-promisify/-/es6-promisify-5.0.0.tgz", 94 | "integrity": "sha1-UQnWLz5W6pZ8S2NQWu8IKRyKUgM=", 95 | "requires": { 96 | "es6-promise": "^4.0.3" 97 | } 98 | }, 99 | "extract-zip": { 100 | "version": "1.6.7", 101 | "resolved": "https://registry.npmjs.org/extract-zip/-/extract-zip-1.6.7.tgz", 102 | "integrity": "sha1-qEC0uK9kAyZMjbV/Txp0Mz74H+k=", 103 | "requires": { 104 | "concat-stream": "1.6.2", 105 | "debug": "2.6.9", 106 | "mkdirp": "0.5.1", 107 | "yauzl": "2.4.1" 108 | }, 109 | "dependencies": { 110 | "debug": { 111 | "version": "2.6.9", 112 | "resolved": "https://registry.npmjs.org/debug/-/debug-2.6.9.tgz", 113 | "integrity": "sha512-bC7ElrdJaJnPbAP+1EotYvqZsb3ecl5wi6Bfi6BJTUcNowp6cvspg0jXznRTKDjm/E7AdgFBVeAPVMNcKGsHMA==", 114 | "requires": { 115 | "ms": "2.0.0" 116 | } 117 | } 118 | } 119 | }, 120 | "fd-slicer": { 121 | "version": "1.0.1", 122 | "resolved": "https://registry.npmjs.org/fd-slicer/-/fd-slicer-1.0.1.tgz", 123 | "integrity": "sha1-i1vL2ewyfFBBv5qwI/1nUPEXfmU=", 124 | "requires": { 125 | "pend": "~1.2.0" 126 | } 127 | }, 128 | "fs.realpath": { 129 | "version": "1.0.0", 130 | "resolved": "https://registry.npmjs.org/fs.realpath/-/fs.realpath-1.0.0.tgz", 131 | "integrity": "sha1-FQStJSMVjKpA20onh8sBQRmU6k8=" 132 | }, 133 | "glob": { 134 | "version": "7.1.3", 135 | "resolved": "https://registry.npmjs.org/glob/-/glob-7.1.3.tgz", 136 | "integrity": "sha512-vcfuiIxogLV4DlGBHIUOwI0IbrJ8HWPc4MU7HzviGeNho/UJDfi6B5p3sHeWIQ0KGIU0Jpxi5ZHxemQfLkkAwQ==", 137 | "requires": { 138 | "fs.realpath": "^1.0.0", 139 | "inflight": "^1.0.4", 140 | "inherits": "2", 141 | "minimatch": "^3.0.4", 142 | "once": "^1.3.0", 143 | "path-is-absolute": "^1.0.0" 144 | } 145 | }, 146 | "https-proxy-agent": { 147 | "version": "2.2.1", 148 | "resolved": "https://registry.npmjs.org/https-proxy-agent/-/https-proxy-agent-2.2.1.tgz", 149 | "integrity": "sha512-HPCTS1LW51bcyMYbxUIOO4HEOlQ1/1qRaFWcyxvwaqUS9TY88aoEuHUY33kuAh1YhVVaDQhLZsnPd+XNARWZlQ==", 150 | "requires": { 151 | "agent-base": "^4.1.0", 152 | "debug": "^3.1.0" 153 | }, 154 | "dependencies": { 155 | "debug": { 156 | "version": "3.2.6", 157 | "resolved": "https://registry.npmjs.org/debug/-/debug-3.2.6.tgz", 158 | "integrity": "sha512-mel+jf7nrtEl5Pn1Qx46zARXKDpBbvzezse7p7LqINmdoIk8PYP5SySaxEmYv6TZ0JyEKA1hsCId6DIhgITtWQ==", 159 | "requires": { 160 | "ms": "^2.1.1" 161 | } 162 | }, 163 | "ms": { 164 | "version": "2.1.1", 165 | "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", 166 | "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" 167 | } 168 | } 169 | }, 170 | "inflight": { 171 | "version": "1.0.6", 172 | "resolved": "https://registry.npmjs.org/inflight/-/inflight-1.0.6.tgz", 173 | "integrity": "sha1-Sb1jMdfQLQwJvJEKEHW6gWW1bfk=", 174 | "requires": { 175 | "once": "^1.3.0", 176 | "wrappy": "1" 177 | } 178 | }, 179 | "inherits": { 180 | "version": "2.0.3", 181 | "resolved": "https://registry.npmjs.org/inherits/-/inherits-2.0.3.tgz", 182 | "integrity": "sha1-Yzwsg+PaQqUC9SRmAiSA9CCCYd4=" 183 | }, 184 | "isarray": { 185 | "version": "1.0.0", 186 | "resolved": "https://registry.npmjs.org/isarray/-/isarray-1.0.0.tgz", 187 | "integrity": "sha1-u5NdSFgsuhaMBoNJV6VKPgcSTxE=" 188 | }, 189 | "kareem": { 190 | "version": "2.3.0", 191 | "resolved": "https://registry.npmjs.org/kareem/-/kareem-2.3.0.tgz", 192 | "integrity": "sha512-6hHxsp9e6zQU8nXsP+02HGWXwTkOEw6IROhF2ZA28cYbUk4eJ6QbtZvdqZOdD9YPKghG3apk5eOCvs+tLl3lRg==" 193 | }, 194 | "lodash": { 195 | "version": "4.17.11", 196 | "resolved": "https://registry.npmjs.org/lodash/-/lodash-4.17.11.tgz", 197 | "integrity": "sha512-cQKh8igo5QUhZ7lg38DYWAxMvjSAKG0A8wGSVimP07SIUEK2UO+arSRKbRZWtelMtN5V0Hkwh5ryOto/SshYIg==" 198 | }, 199 | "lodash.get": { 200 | "version": "4.4.2", 201 | "resolved": "https://registry.npmjs.org/lodash.get/-/lodash.get-4.4.2.tgz", 202 | "integrity": "sha1-LRd/ZS+jHpObRDjVNBSZ36OCXpk=" 203 | }, 204 | "memory-pager": { 205 | "version": "1.1.0", 206 | "resolved": "https://registry.npmjs.org/memory-pager/-/memory-pager-1.1.0.tgz", 207 | "integrity": "sha512-Mf9OHV/Y7h6YWDxTzX/b4ZZ4oh9NSXblQL8dtPCOomOtZciEHxePR78+uHFLLlsk01A6jVHhHsQZZ/WcIPpnzg==", 208 | "optional": true 209 | }, 210 | "mime": { 211 | "version": "2.3.1", 212 | "resolved": "https://registry.npmjs.org/mime/-/mime-2.3.1.tgz", 213 | "integrity": "sha512-OEUllcVoydBHGN1z84yfQDimn58pZNNNXgZlHXSboxMlFvgI6MXSWpWKpFRra7H1HxpVhHTkrghfRW49k6yjeg==" 214 | }, 215 | "minimatch": { 216 | "version": "3.0.4", 217 | "resolved": "https://registry.npmjs.org/minimatch/-/minimatch-3.0.4.tgz", 218 | "integrity": "sha512-yJHVQEhyqPLUTgt9B83PXu6W3rx4MvvHvSUvToogpwoGDOUQ+yDrR0HRot+yOCdCO7u4hX3pWft6kWBBcqh0UA==", 219 | "requires": { 220 | "brace-expansion": "^1.1.7" 221 | } 222 | }, 223 | "minimist": { 224 | "version": "0.0.8", 225 | "resolved": "http://registry.npmjs.org/minimist/-/minimist-0.0.8.tgz", 226 | "integrity": "sha1-hX/Kv8M5fSYluCKCYuhqp6ARsF0=" 227 | }, 228 | "mkdirp": { 229 | "version": "0.5.1", 230 | "resolved": "http://registry.npmjs.org/mkdirp/-/mkdirp-0.5.1.tgz", 231 | "integrity": "sha1-MAV0OOrGz3+MR2fzhkjWaX11yQM=", 232 | "requires": { 233 | "minimist": "0.0.8" 234 | } 235 | }, 236 | "mongodb": { 237 | "version": "3.1.8", 238 | "resolved": "https://registry.npmjs.org/mongodb/-/mongodb-3.1.8.tgz", 239 | "integrity": "sha512-yNKwYxQ6m00NV6+pMoWoheFTHSQVv1KkSrfOhRDYMILGWDYtUtQRqHrFqU75rmPIY8hMozVft8zdC4KYMWaM3Q==", 240 | "requires": { 241 | "mongodb-core": "3.1.7", 242 | "safe-buffer": "^5.1.2" 243 | }, 244 | "dependencies": { 245 | "safe-buffer": { 246 | "version": "5.1.2", 247 | "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", 248 | "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" 249 | } 250 | } 251 | }, 252 | "mongodb-core": { 253 | "version": "3.1.7", 254 | "resolved": "https://registry.npmjs.org/mongodb-core/-/mongodb-core-3.1.7.tgz", 255 | "integrity": "sha512-YffpSrLmgFNmrvkGx+yX00KyBNk64C0BalfEn6vHHkXtcMUGXw8nxrMmhq5eXPLLlYeBpD/CsgNxE2Chf0o4zQ==", 256 | "requires": { 257 | "bson": "^1.1.0", 258 | "require_optional": "^1.0.1", 259 | "safe-buffer": "^5.1.2", 260 | "saslprep": "^1.0.0" 261 | }, 262 | "dependencies": { 263 | "safe-buffer": { 264 | "version": "5.1.2", 265 | "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", 266 | "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" 267 | } 268 | } 269 | }, 270 | "mongoose": { 271 | "version": "5.3.10", 272 | "resolved": "https://registry.npmjs.org/mongoose/-/mongoose-5.3.10.tgz", 273 | "integrity": "sha512-h2cW/vR/7UFOAlOoGMpyWdXE75fvfC61TdX63tXnz8L95OU5p7Lj11FxZoznBKjfBPGUk79tmMz6zxPLyBkClQ==", 274 | "requires": { 275 | "async": "2.6.1", 276 | "bson": "~1.1.0", 277 | "kareem": "2.3.0", 278 | "lodash.get": "4.4.2", 279 | "mongodb": "3.1.8", 280 | "mongodb-core": "3.1.7", 281 | "mongoose-legacy-pluralize": "1.0.2", 282 | "mpath": "0.5.1", 283 | "mquery": "3.2.0", 284 | "ms": "2.0.0", 285 | "regexp-clone": "0.0.1", 286 | "safe-buffer": "5.1.2", 287 | "sliced": "1.0.1" 288 | }, 289 | "dependencies": { 290 | "safe-buffer": { 291 | "version": "5.1.2", 292 | "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", 293 | "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" 294 | } 295 | } 296 | }, 297 | "mongoose-legacy-pluralize": { 298 | "version": "1.0.2", 299 | "resolved": "https://registry.npmjs.org/mongoose-legacy-pluralize/-/mongoose-legacy-pluralize-1.0.2.tgz", 300 | "integrity": "sha512-Yo/7qQU4/EyIS8YDFSeenIvXxZN+ld7YdV9LqFVQJzTLye8unujAWPZ4NWKfFA+RNjh+wvTWKY9Z3E5XM6ZZiQ==" 301 | }, 302 | "mpath": { 303 | "version": "0.5.1", 304 | "resolved": "https://registry.npmjs.org/mpath/-/mpath-0.5.1.tgz", 305 | "integrity": "sha512-H8OVQ+QEz82sch4wbODFOz+3YQ61FYz/z3eJ5pIdbMEaUzDqA268Wd+Vt4Paw9TJfvDgVKaayC0gBzMIw2jhsg==" 306 | }, 307 | "mquery": { 308 | "version": "3.2.0", 309 | "resolved": "https://registry.npmjs.org/mquery/-/mquery-3.2.0.tgz", 310 | "integrity": "sha512-qPJcdK/yqcbQiKoemAt62Y0BAc0fTEKo1IThodBD+O5meQRJT/2HSe5QpBNwaa4CjskoGrYWsEyjkqgiE0qjhg==", 311 | "requires": { 312 | "bluebird": "3.5.1", 313 | "debug": "3.1.0", 314 | "regexp-clone": "0.0.1", 315 | "safe-buffer": "5.1.2", 316 | "sliced": "1.0.1" 317 | }, 318 | "dependencies": { 319 | "safe-buffer": { 320 | "version": "5.1.2", 321 | "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.2.tgz", 322 | "integrity": "sha512-Gd2UZBJDkXlY7GbJxfsE8/nvKkUEU1G38c1siN6QP6a9PT9MmHB8GnpscSmMJSoF8LOIrt8ud/wPtojys4G6+g==" 323 | } 324 | } 325 | }, 326 | "ms": { 327 | "version": "2.0.0", 328 | "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", 329 | "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=" 330 | }, 331 | "once": { 332 | "version": "1.4.0", 333 | "resolved": "https://registry.npmjs.org/once/-/once-1.4.0.tgz", 334 | "integrity": "sha1-WDsap3WWHUsROsF9nFC6753Xa9E=", 335 | "requires": { 336 | "wrappy": "1" 337 | } 338 | }, 339 | "path-is-absolute": { 340 | "version": "1.0.1", 341 | "resolved": "http://registry.npmjs.org/path-is-absolute/-/path-is-absolute-1.0.1.tgz", 342 | "integrity": "sha1-F0uSaHNVNP+8es5r9TpanhtcX18=" 343 | }, 344 | "pend": { 345 | "version": "1.2.0", 346 | "resolved": "https://registry.npmjs.org/pend/-/pend-1.2.0.tgz", 347 | "integrity": "sha1-elfrVQpng/kRUzH89GY9XI4AelA=" 348 | }, 349 | "progress": { 350 | "version": "2.0.1", 351 | "resolved": "https://registry.npmjs.org/progress/-/progress-2.0.1.tgz", 352 | "integrity": "sha512-OE+a6vzqazc+K6LxJrX5UPyKFvGnL5CYmq2jFGNIBWHpc4QyE49/YOumcrpQFJpfejmvRtbJzgO1zPmMCqlbBg==" 353 | }, 354 | "proxy-from-env": { 355 | "version": "1.0.0", 356 | "resolved": "https://registry.npmjs.org/proxy-from-env/-/proxy-from-env-1.0.0.tgz", 357 | "integrity": "sha1-M8UDmPcOp+uW0h97gXYwpVeRx+4=" 358 | }, 359 | "puppeteer": { 360 | "version": "1.10.0", 361 | "resolved": "https://registry.npmjs.org/puppeteer/-/puppeteer-1.10.0.tgz", 362 | "integrity": "sha512-3i28X/ucX8t3eL4TZA60FLMOQNKqudFSOGDHr0cT7T4dE027CrcS885aAqjdxNybhMPliM5yImNsKJ6SQrPzhw==", 363 | "requires": { 364 | "debug": "^3.1.0", 365 | "extract-zip": "^1.6.6", 366 | "https-proxy-agent": "^2.2.1", 367 | "mime": "^2.0.3", 368 | "progress": "^2.0.0", 369 | "proxy-from-env": "^1.0.0", 370 | "rimraf": "^2.6.1", 371 | "ws": "^5.1.1" 372 | }, 373 | "dependencies": { 374 | "debug": { 375 | "version": "3.2.6", 376 | "resolved": "https://registry.npmjs.org/debug/-/debug-3.2.6.tgz", 377 | "integrity": "sha512-mel+jf7nrtEl5Pn1Qx46zARXKDpBbvzezse7p7LqINmdoIk8PYP5SySaxEmYv6TZ0JyEKA1hsCId6DIhgITtWQ==", 378 | "requires": { 379 | "ms": "^2.1.1" 380 | } 381 | }, 382 | "ms": { 383 | "version": "2.1.1", 384 | "resolved": "https://registry.npmjs.org/ms/-/ms-2.1.1.tgz", 385 | "integrity": "sha512-tgp+dl5cGk28utYktBsrFqA7HKgrhgPsg6Z/EfhWI4gl1Hwq8B/GmY/0oXZ6nF8hDVesS/FpnYaD/kOWhYQvyg==" 386 | } 387 | } 388 | }, 389 | "readable-stream": { 390 | "version": "2.3.6", 391 | "resolved": "http://registry.npmjs.org/readable-stream/-/readable-stream-2.3.6.tgz", 392 | "integrity": "sha512-tQtKA9WIAhBF3+VLAseyMqZeBjW0AHJoxOtYqSUZNJxauErmLbVm2FW1y+J/YA9dUrAC39ITejlZWhVIwawkKw==", 393 | "requires": { 394 | "core-util-is": "~1.0.0", 395 | "inherits": "~2.0.3", 396 | "isarray": "~1.0.0", 397 | "process-nextick-args": "~2.0.0", 398 | "safe-buffer": "~5.1.1", 399 | "string_decoder": "~1.1.1", 400 | "util-deprecate": "~1.0.1" 401 | }, 402 | "dependencies": { 403 | "process-nextick-args": { 404 | "version": "2.0.0", 405 | "resolved": "https://registry.npmjs.org/process-nextick-args/-/process-nextick-args-2.0.0.tgz", 406 | "integrity": "sha512-MtEC1TqN0EU5nephaJ4rAtThHtC86dNN9qCuEhtshvpVBkAW5ZO7BASN9REnF9eoXGcRub+pFuKEpOHE+HbEMw==" 407 | }, 408 | "string_decoder": { 409 | "version": "1.1.1", 410 | "resolved": "https://registry.npmjs.org/string_decoder/-/string_decoder-1.1.1.tgz", 411 | "integrity": "sha512-n/ShnvDi6FHbbVfviro+WojiFzv+s8MPMHBczVePfUpDJLwoLT0ht1l4YwBCbi8pJAveEEdnkHyPyTP/mzRfwg==", 412 | "requires": { 413 | "safe-buffer": "~5.1.0" 414 | } 415 | } 416 | } 417 | }, 418 | "regexp-clone": { 419 | "version": "0.0.1", 420 | "resolved": "https://registry.npmjs.org/regexp-clone/-/regexp-clone-0.0.1.tgz", 421 | "integrity": "sha1-p8LgmJH9vzj7sQ03b7cwA+aKxYk=" 422 | }, 423 | "require_optional": { 424 | "version": "1.0.1", 425 | "resolved": "https://registry.npmjs.org/require_optional/-/require_optional-1.0.1.tgz", 426 | "integrity": "sha512-qhM/y57enGWHAe3v/NcwML6a3/vfESLe/sGM2dII+gEO0BpKRUkWZow/tyloNqJyN6kXSl3RyyM8Ll5D/sJP8g==", 427 | "requires": { 428 | "resolve-from": "^2.0.0", 429 | "semver": "^5.1.0" 430 | } 431 | }, 432 | "resolve-from": { 433 | "version": "2.0.0", 434 | "resolved": "https://registry.npmjs.org/resolve-from/-/resolve-from-2.0.0.tgz", 435 | "integrity": "sha1-lICrIOlP+h2egKgEx+oUdhGWa1c=" 436 | }, 437 | "rimraf": { 438 | "version": "2.6.2", 439 | "resolved": "https://registry.npmjs.org/rimraf/-/rimraf-2.6.2.tgz", 440 | "integrity": "sha512-lreewLK/BlghmxtfH36YYVg1i8IAce4TI7oao75I1g245+6BctqTVQiBP3YUJ9C6DQOXJmkYR9X9fCLtCOJc5w==", 441 | "requires": { 442 | "glob": "^7.0.5" 443 | } 444 | }, 445 | "safe-buffer": { 446 | "version": "5.1.1", 447 | "resolved": "https://registry.npmjs.org/safe-buffer/-/safe-buffer-5.1.1.tgz", 448 | "integrity": "sha1-iTMSr2myEj3vcfV4iQAWce6yyFM=" 449 | }, 450 | "saslprep": { 451 | "version": "1.0.2", 452 | "resolved": "https://registry.npmjs.org/saslprep/-/saslprep-1.0.2.tgz", 453 | "integrity": "sha512-4cDsYuAjXssUSjxHKRe4DTZC0agDwsCqcMqtJAQPzC74nJ7LfAJflAtC1Zed5hMzEQKj82d3tuzqdGNRsLJ4Gw==", 454 | "optional": true, 455 | "requires": { 456 | "sparse-bitfield": "^3.0.3" 457 | } 458 | }, 459 | "semver": { 460 | "version": "5.6.0", 461 | "resolved": "https://registry.npmjs.org/semver/-/semver-5.6.0.tgz", 462 | "integrity": "sha512-RS9R6R35NYgQn++fkDWaOmqGoj4Ek9gGs+DPxNUZKuwE183xjJroKvyo1IzVFeXvUrvmALy6FWD5xrdJT25gMg==" 463 | }, 464 | "sliced": { 465 | "version": "1.0.1", 466 | "resolved": "https://registry.npmjs.org/sliced/-/sliced-1.0.1.tgz", 467 | "integrity": "sha1-CzpmK10Ewxd7GSa+qCsD+Dei70E=" 468 | }, 469 | "sparse-bitfield": { 470 | "version": "3.0.3", 471 | "resolved": "https://registry.npmjs.org/sparse-bitfield/-/sparse-bitfield-3.0.3.tgz", 472 | "integrity": "sha1-/0rm5oZWBWuks+eSqzM004JzyhE=", 473 | "optional": true, 474 | "requires": { 475 | "memory-pager": "^1.0.2" 476 | } 477 | }, 478 | "typedarray": { 479 | "version": "0.0.6", 480 | "resolved": "https://registry.npmjs.org/typedarray/-/typedarray-0.0.6.tgz", 481 | "integrity": "sha1-hnrHTjhkGHsdPUfZlqeOxciDB3c=" 482 | }, 483 | "util-deprecate": { 484 | "version": "1.0.2", 485 | "resolved": "https://registry.npmjs.org/util-deprecate/-/util-deprecate-1.0.2.tgz", 486 | "integrity": "sha1-RQ1Nyfpw3nMnYvvS1KKJgUGaDM8=" 487 | }, 488 | "wrappy": { 489 | "version": "1.0.2", 490 | "resolved": "https://registry.npmjs.org/wrappy/-/wrappy-1.0.2.tgz", 491 | "integrity": "sha1-tSQ9jz7BqjXxNkYFvA0QNuMKtp8=" 492 | }, 493 | "ws": { 494 | "version": "5.2.2", 495 | "resolved": "https://registry.npmjs.org/ws/-/ws-5.2.2.tgz", 496 | "integrity": "sha512-jaHFD6PFv6UgoIVda6qZllptQsMlDEJkTQcybzzXDYM1XO9Y8em691FGMPmM46WGyLU4z9KMgQN+qrux/nhlHA==", 497 | "requires": { 498 | "async-limiter": "~1.0.0" 499 | } 500 | }, 501 | "yauzl": { 502 | "version": "2.4.1", 503 | "resolved": "https://registry.npmjs.org/yauzl/-/yauzl-2.4.1.tgz", 504 | "integrity": "sha1-lSj0QtqxsihOWLQ3m7GU4i4MQAU=", 505 | "requires": { 506 | "fd-slicer": "~1.0.1" 507 | } 508 | } 509 | } 510 | } 511 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "thal", 3 | "version": "1.0.0", 4 | "description": "Web Scrapper with Chrome Headless & Puppeteer", 5 | "main": "index.js", 6 | "scripts": { 7 | "mongo": "mkdir -p db && mongod --dbpath ./db", 8 | "start": "node index.js", 9 | "test": "echo \"Error: no test specified\" && exit 1" 10 | }, 11 | "repository": { 12 | "type": "git", 13 | "url": "git+https://github.com/emadehsan/thal.git" 14 | }, 15 | "keywords": [ 16 | "Puppeteer", 17 | "Chrome", 18 | "Headless", 19 | "Scrapping", 20 | "Thal" 21 | ], 22 | "author": "Emad Ehsan", 23 | "license": "ISC", 24 | "bugs": { 25 | "url": "https://github.com/emadehsan/thal/issues" 26 | }, 27 | "homepage": "https://github.com/emadehsan/thal#readme", 28 | "dependencies": { 29 | "mongoose": "^5.3.10", 30 | "puppeteer": "^1.10.0" 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /screenshots/github.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/csbun/thal/ea8dde2fdc7a89b6f5e15452f0d1629aef4f8059/screenshots/github.png --------------------------------------------------------------------------------