6 |
7 |
8 |
9 |
10 |
11 |
19 |
--------------------------------------------------------------------------------
/front/src/pages/ScraperPage.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
19 |
--------------------------------------------------------------------------------
/front/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .thumbs.db
3 | node_modules
4 | .yarn
5 | # Quasar core related directories
6 | .quasar
7 | /dist
8 |
9 | # Cordova related directories and files
10 | /src-cordova/node_modules
11 | /src-cordova/platforms
12 | /src-cordova/plugins
13 | /src-cordova/www
14 |
15 | # Capacitor related directories and files
16 | /src-capacitor/www
17 | /src-capacitor/node_modules
18 |
19 | # Log files
20 | npm-debug.log*
21 | yarn-debug.log*
22 | yarn-error.log*
23 |
24 | # Editor directories and files
25 | .idea
26 | *.suo
27 | *.ntvs*
28 | *.njsproj
29 | *.sln
30 |
--------------------------------------------------------------------------------
/docs/middleware/index.rst:
--------------------------------------------------------------------------------
1 | #################
2 | Middleware
3 | #################
4 |
5 | **Sneakpeek** allows you to run arbitrary code before the request and after the response has been recieved.
6 | This can be helpful if you have some common logic you want to use in your scrapers.
7 |
8 | There are some plugins that are already implemented:
9 |
10 | .. toctree::
11 | :maxdepth: 2
12 |
13 |
14 | rate_limiter_middleware
15 | robots_txt_middleware
16 | user_agent_injecter_middleware
17 | proxy_middleware
18 | requests_logging_middleware
19 | new_middleware
20 |
--------------------------------------------------------------------------------
/front/src/pages/ErrorNotFound.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | 404
6 |
7 |
8 |
9 | Oops. Nothing is here...
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
24 |
--------------------------------------------------------------------------------
/docs/Makefile:
--------------------------------------------------------------------------------
1 | # Minimal makefile for Sphinx documentation
2 | #
3 |
4 | # You can set these variables from the command line, and also
5 | # from the environment for the first two.
6 | SPHINXOPTS ?=
7 | SPHINXBUILD ?= sphinx-build
8 | SOURCEDIR = .
9 | BUILDDIR = _build
10 |
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 |
15 | .PHONY: help Makefile
16 |
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Suggest an idea for this project
4 | title: "[FEATURE]"
5 | labels: enhancement
6 | assignees: flulemon
7 |
8 | ---
9 |
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 |
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 |
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 |
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 |
--------------------------------------------------------------------------------
/front/README.md:
--------------------------------------------------------------------------------
1 | # Sneakpeek (sneakpeek-front)
2 |
3 | A toolbox to create scrapers
4 |
5 | ## Install the dependencies
6 | ```bash
7 | yarn
8 | # or
9 | npm install
10 | ```
11 |
12 | ### Start the app in development mode (hot-code reloading, error reporting, etc.)
13 | ```bash
14 | quasar dev
15 | ```
16 |
17 |
18 | ### Lint the files
19 | ```bash
20 | yarn lint
21 | # or
22 | npm run lint
23 | ```
24 |
25 |
26 | ### Format the files
27 | ```bash
28 | yarn format
29 | # or
30 | npm run format
31 | ```
32 |
33 |
34 |
35 | ### Build the app for production
36 | ```bash
37 | quasar build
38 | ```
39 |
40 | ### Customize the configuration
41 | See [Configuring quasar.config.js](https://v2.quasar.dev/quasar-cli-vite/quasar-config-js).
42 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: "[BUG]"
5 | labels: bug
6 | assignees: flulemon
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 |
16 | **Expected behavior**
17 | A clear and concise description of what you expected to happen.
18 |
19 | **Screenshots**
20 | If applicable, add screenshots to help explain your problem.
21 |
22 | **Environment (please complete the following information):**
23 | - OS: [e.g. iOS]
24 | - Python version [e.g. 3.10]
25 | - Package Version [e.g. 0.1.4]
26 |
27 | **Additional context**
28 | Add any other context about the problem here.
29 |
--------------------------------------------------------------------------------
/front/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | <%= productName %>
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
--------------------------------------------------------------------------------
/front/jsconfig.json:
--------------------------------------------------------------------------------
1 | {
2 | "compilerOptions": {
3 | "baseUrl": ".",
4 | "paths": {
5 | "src/*": [
6 | "src/*"
7 | ],
8 | "app/*": [
9 | "*"
10 | ],
11 | "components/*": [
12 | "src/components/*"
13 | ],
14 | "layouts/*": [
15 | "src/layouts/*"
16 | ],
17 | "pages/*": [
18 | "src/pages/*"
19 | ],
20 | "assets/*": [
21 | "src/assets/*"
22 | ],
23 | "boot/*": [
24 | "src/boot/*"
25 | ],
26 | "stores/*": [
27 | "src/stores/*"
28 | ],
29 | "vue$": [
30 | "node_modules/vue/dist/vue.runtime.esm-bundler.js"
31 | ]
32 | }
33 | },
34 | "exclude": [
35 | "dist",
36 | ".quasar",
37 | "node_modules"
38 | ]
39 | }
--------------------------------------------------------------------------------
/front/src/components/ScraperJobStatusChip.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
33 |
--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 | "python.formatting.provider": "black",
3 | "python.linting.flake8Enabled": true,
4 | "python.linting.enabled": true,
5 | "editor.formatOnSave": true,
6 | "editor.codeActionsOnSave": {
7 | "source.organizeImports": true,
8 | "source.fixAll.eslint": true
9 | },
10 | "python.testing.pytestArgs": [],
11 | "python.testing.unittestEnabled": false,
12 | "python.testing.pytestEnabled": true,
13 | "eslint.validate": ["javascript", "javascriptreact", "typescript", "vue"],
14 | "editor.bracketPairColorization.enabled": true,
15 | "editor.guides.bracketPairs": true,
16 | "editor.defaultFormatter": "esbenp.prettier-vscode",
17 | "typescript.tsdk": "node_modules/typescript/lib",
18 | "[xml]": {
19 | "editor.defaultFormatter": "redhat.vscode-xml"
20 | },
21 | "esbonio.sphinx.confDir": ""
22 | }
23 |
--------------------------------------------------------------------------------
/front/src/router/routes.js:
--------------------------------------------------------------------------------
1 |
2 | const routes = [
3 | {
4 | name: 'Homepage',
5 | path: '/',
6 | component: () => import('layouts/MainLayout.vue'),
7 | children: [
8 | { name: 'ScrapersPage', path: '', component: () => import('src/pages/ScrapersPage.vue') },
9 | { name: 'NewScraperPage', path: 'new', component: () => import('src/pages/NewScraperPage.vue') },
10 | { name: 'ScraperIde', path: 'ide', component: () => import('src/pages/ScraperIde.vue') },
11 | { name: 'ScraperPage', path: 'scraper/:id', component: () => import('src/pages/ScraperPage.vue'), props: true },
12 | ]
13 | },
14 |
15 | // Always leave this as last one,
16 | // but you can also remove it
17 | {
18 | path: '/:catchAll(.*)*',
19 | component: () => import('pages/ErrorNotFound.vue')
20 | }
21 | ]
22 |
23 | export default routes
24 |
--------------------------------------------------------------------------------
/front/src/css/quasar.variables.scss:
--------------------------------------------------------------------------------
1 | // Quasar SCSS (& Sass) Variables
2 | // --------------------------------------------------
3 | // To customize the look and feel of this app, you can override
4 | // the Sass/SCSS variables found in Quasar's source Sass/SCSS files.
5 |
6 | // Check documentation for full list of Quasar variables
7 |
8 | // Your own variables (that are declared here) and Quasar's own
9 | // ones will be available out of the box in your .vue/.scss/.sass files
10 |
11 | // It's highly recommended to change the default colors
12 | // to match your app's branding.
13 | // Tip: Use the "Theme Builder" on Quasar's documentation website.
14 |
15 | $primary : #1976d2;
16 | $secondary : #c2c2c2;
17 | $accent : #9C27B0;
18 |
19 | $dark : #3b3535;
20 | $dark-page : #121212;
21 |
22 | $positive : #37994e;
23 | $negative : #c22b3c;
24 | $info : #add5de;
25 | $warning : #ff9b29;
26 |
--------------------------------------------------------------------------------
/front/src/components/PriorityChip.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
36 |
--------------------------------------------------------------------------------
/front/postcss.config.js:
--------------------------------------------------------------------------------
1 | /* eslint-disable */
2 | // https://github.com/michael-ciniawsky/postcss-load-config
3 |
4 | module.exports = {
5 | plugins: [
6 | // https://github.com/postcss/autoprefixer
7 | require('autoprefixer')({
8 | overrideBrowserslist: [
9 | 'last 4 Chrome versions',
10 | 'last 4 Firefox versions',
11 | 'last 4 Edge versions',
12 | 'last 4 Safari versions',
13 | 'last 4 Android versions',
14 | 'last 4 ChromeAndroid versions',
15 | 'last 4 FirefoxAndroid versions',
16 | 'last 4 iOS versions'
17 | ]
18 | })
19 |
20 | // https://github.com/elchininet/postcss-rtlcss
21 | // If you want to support RTL css, then
22 | // 1. yarn/npm install postcss-rtlcss
23 | // 2. optionally set quasar.config.js > framework > lang to an RTL language
24 | // 3. uncomment the following line:
25 | // require('postcss-rtlcss')
26 | ]
27 | }
28 |
--------------------------------------------------------------------------------
/docs/make.bat:
--------------------------------------------------------------------------------
1 | @ECHO OFF
2 |
3 | pushd %~dp0
4 |
5 | REM Command file for Sphinx documentation
6 |
7 | if "%SPHINXBUILD%" == "" (
8 | set SPHINXBUILD=sphinx-build
9 | )
10 | set SOURCEDIR=.
11 | set BUILDDIR=_build
12 |
13 | if "%1" == "" goto help
14 |
15 | %SPHINXBUILD% >NUL 2>NUL
16 | if errorlevel 9009 (
17 | echo.
18 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
19 | echo.installed, then set the SPHINXBUILD environment variable to point
20 | echo.to the full path of the 'sphinx-build' executable. Alternatively you
21 | echo.may add the Sphinx directory to PATH.
22 | echo.
23 | echo.If you don't have Sphinx installed, grab it from
24 | echo.https://www.sphinx-doc.org/
25 | exit /b 1
26 | )
27 |
28 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29 | goto end
30 |
31 | :help
32 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33 |
34 | :end
35 | popd
36 |
--------------------------------------------------------------------------------
/sneakpeek/session_loggers/base.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from abc import ABC, abstractmethod
3 | from typing import Any, List
4 |
5 | from pydantic import BaseModel
6 |
7 | FIELDS_TO_LOG = [
8 | "levelname",
9 | "msg",
10 | "filename",
11 | "lineno",
12 | "name",
13 | "funcName",
14 | "task_id",
15 | "task_name",
16 | "task_handler",
17 | "asctime",
18 | "headers",
19 | "kwargs",
20 | "request",
21 | "response",
22 | ]
23 |
24 |
25 | def get_fields_to_log(record: logging.LogRecord) -> dict[str, Any]:
26 | return {
27 | field: value
28 | for field in FIELDS_TO_LOG
29 | if (value := getattr(record, field, None)) is not None
30 | }
31 |
32 |
33 | class LogLine(BaseModel):
34 | id: str
35 | data: dict[str, Any]
36 |
37 |
38 | class SessionLogger(ABC, logging.Handler):
39 | @abstractmethod
40 | async def read(
41 | self,
42 | task_id: str,
43 | last_log_line_id: str | None = None,
44 | max_lines: int = 100,
45 | ) -> List[LogLine]:
46 | ...
47 |
--------------------------------------------------------------------------------
/front/src/router/index.js:
--------------------------------------------------------------------------------
1 | import { route } from 'quasar/wrappers'
2 | import { createRouter, createMemoryHistory, createWebHistory, createWebHashHistory } from 'vue-router'
3 | import routes from './routes'
4 |
5 | /*
6 | * If not building with SSR mode, you can
7 | * directly export the Router instantiation;
8 | *
9 | * The function below can be async too; either use
10 | * async/await or return a Promise which resolves
11 | * with the Router instance.
12 | */
13 |
14 | export default route(function (/* { store, ssrContext } */) {
15 | const createHistory = process.env.SERVER
16 | ? createMemoryHistory
17 | : (process.env.VUE_ROUTER_MODE === 'history' ? createWebHistory : createWebHashHistory)
18 |
19 | const Router = createRouter({
20 | scrollBehavior: () => ({ left: 0, top: 0 }),
21 | routes,
22 |
23 | // Leave this as is and make changes in quasar.conf.js instead!
24 | // quasar.conf.js -> build -> vueRouterMode
25 | // quasar.conf.js -> build -> publicPath
26 | history: createHistory(process.env.VUE_ROUTER_BASE)
27 | })
28 |
29 | return Router
30 | })
31 |
--------------------------------------------------------------------------------
/sneakpeek/middleware/parser.py:
--------------------------------------------------------------------------------
1 | import re
2 | from dataclasses import dataclass
3 |
4 | from sneakpeek.middleware.base import BaseMiddleware
5 |
6 |
7 | @dataclass
8 | class RegexMatch:
9 | """Regex match"""
10 |
11 | full_match: str #: Full regular expression match
12 | groups: dict[str, str] #: Regular expression group matches
13 |
14 |
15 | class ParserMiddleware(BaseMiddleware):
16 | """Parser middleware provides parsing utilities"""
17 |
18 | @property
19 | def name(self) -> str:
20 | return "parser"
21 |
22 | def regex(
23 | self,
24 | text: str,
25 | pattern: str,
26 | flags: re.RegexFlag = re.UNICODE | re.MULTILINE | re.IGNORECASE,
27 | ) -> list[RegexMatch]:
28 | """Find matches in the text using regular expression
29 |
30 | Args:
31 | text (str): Text to search in
32 | pattern (str): Regular expression
33 | flags (re.RegexFlag, optional): Regular expression flags. Defaults to re.UNICODE | re.MULTILINE | re.IGNORECASE.
34 |
35 | Returns:
36 | list[RegexMatch]: Matches found in the text
37 | """
38 | return [
39 | RegexMatch(full_match=match.group(0), groups=match.groupdict())
40 | for match in re.finditer(pattern, text, flags)
41 | ]
42 |
--------------------------------------------------------------------------------
/front/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "sneakpeek-front",
3 | "version": "0.2.2",
4 | "description": "A toolbox to create scrapers",
5 | "productName": "Sneakpeek",
6 | "author": "Dan Yazovsky ",
7 | "private": true,
8 | "scripts": {
9 | "lint": "eslint --ext .js,.vue ./",
10 | "format": "prettier --write \"**/*.{js,vue,scss,html,md,json}\" --ignore-path .gitignore",
11 | "test": "echo \"No test specified\" && exit 0",
12 | "dev": "quasar dev",
13 | "build": "quasar build"
14 | },
15 | "dependencies": {
16 | "@quasar/extras": "^1.0.0",
17 | "axios": "^1.3.5",
18 | "json-editor-vue": "^0.10.5",
19 | "monaco-editor-vue3": "^0.1.6",
20 | "monaco-editor-webpack-plugin": "^7.0.1",
21 | "quasar": "^2.6.0",
22 | "vanilla-jsoneditor": "^0.16.1",
23 | "vscode-ws-jsonrpc": "^3.0.0",
24 | "vue": "^3.0.0",
25 | "vue-router": "^4.0.0"
26 | },
27 | "devDependencies": {
28 | "@quasar/app-vite": "^1.0.0",
29 | "autoprefixer": "^10.4.2",
30 | "eslint": "^8.10.0",
31 | "eslint-config-prettier": "^8.1.0",
32 | "eslint-plugin-vue": "^9.0.0",
33 | "postcss": "^8.4.14",
34 | "prettier": "^2.5.1"
35 | },
36 | "engines": {
37 | "node": "^18 || ^16 || ^14.19",
38 | "npm": ">= 6.13.4",
39 | "yarn": ">= 1.21.1"
40 | }
41 | }
42 |
--------------------------------------------------------------------------------
/sneakpeek/scraper/task_handler.py:
--------------------------------------------------------------------------------
1 | from sneakpeek.queue.model import Task, TaskHandlerABC
2 | from sneakpeek.scraper.model import (
3 | SCRAPER_PERIODIC_TASK_HANDLER_NAME,
4 | Scraper,
5 | ScraperHandler,
6 | ScraperRunnerABC,
7 | ScraperStorageABC,
8 | UnknownScraperHandlerError,
9 | )
10 |
11 |
12 | class ScraperTaskHandler(TaskHandlerABC):
13 | def __init__(
14 | self,
15 | scraper_handlers: list[ScraperHandler],
16 | runner: ScraperRunnerABC,
17 | storage: ScraperStorageABC,
18 | ) -> None:
19 | self.scraper_handlers = {handler.name: handler for handler in scraper_handlers}
20 | self.runner = runner
21 | self.storage = storage
22 |
23 | def name(self) -> int:
24 | return SCRAPER_PERIODIC_TASK_HANDLER_NAME
25 |
26 | async def process(self, task: Task) -> str:
27 | scraper = await self.storage.get_scraper(task.task_name)
28 | handler = self._get_handler(scraper)
29 | return await self.runner.run(handler, scraper)
30 |
31 | def _get_handler(self, scraper: Scraper) -> ScraperHandler:
32 | if scraper.handler not in self.scraper_handlers:
33 | raise UnknownScraperHandlerError(
34 | f"Unknown scraper handler '{scraper.handler}'"
35 | )
36 | return self.scraper_handlers[scraper.handler]
37 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | ## Contributing
2 |
3 | 1. File an issue to notify the maintainers about what you're working on.
4 | 2. Fork the repo, develop and test your code changes, add docs.
5 | 3. Make sure that your commit messages clearly describe the changes.
6 | 4. Send a pull request.
7 |
8 | ## File an Issue
9 |
10 | Use the issue tracker to start the discussion. It is possible that someone
11 | else is already working on your idea, your approach is not quite right, or that
12 | the functionality exists already. The ticket you file in the issue tracker will
13 | be used to hash that all out.
14 |
15 | ## Running tests, building package and docs
16 |
17 | Use the issue tracker to start the discussion. It is possible that someone
18 | else is already working on your idea, your approach is not quite right, or that
19 | the functionality exists already. The ticket you file in the issue tracker will
20 | be used to hash that all out.
21 |
22 | ## Make the Pull Request
23 |
24 | Once you have made all your changes, tests, and updated the documentation, run the tests and build the package:
25 |
26 | ```
27 | make test
28 | make build
29 | ```
30 |
31 | Once everything succeeds make a pull request to move everything back into the main branch of the
32 | `repository`.
33 |
34 | Be sure to reference the original issue in the pull request.
35 | Expect some back-and-forth with regards to style and compliance of these
36 | rules.
37 |
--------------------------------------------------------------------------------
/docs/index.rst:
--------------------------------------------------------------------------------
1 | #################
2 | Overview
3 | #################
4 |
5 | **Sneakpeek** - is a platform to author, schedule and monitor scrapers in an easy, fast and extensible way.
6 | It's the best choice for scrapers that have some specific complex scraping logic that needs
7 | to be run on a constant basis.
8 |
9 | Key features
10 | ############
11 |
12 | - Horizontally scalable
13 | - Robust scraper scheduler and priority task queue
14 | - Multiple storage implementations to persist scrapers' configs, tasks, logs, etc.
15 | - JSON RPC API to manage the platform programmatically
16 | - Useful UI to manage all of your scrapers
17 | - Scraper IDE to enable you developing scrapers right in your browser
18 | - Easily extendable via middleware
19 |
20 | Demo
21 | ####
22 |
23 | [Here's a demo project](https://github.com/flulemon/sneakpeek-demo) which uses **Sneakpeek** framework.
24 |
25 | You can also run the demo using Docker:
26 |
27 | .. code-block:: bash
28 |
29 | docker run -it --rm -p 8080:8080 flulemon/sneakpeek-demo
30 |
31 |
32 | Once it has started head over to http://localhost:8080 to play around with it.
33 |
34 | Table of contents
35 | ==================
36 |
37 | .. toctree::
38 | :maxdepth: 2
39 |
40 | self
41 | quick_start
42 | local_debugging
43 | design
44 | deployment
45 | middleware/index
46 | api
47 |
48 | Indices
49 | ==================
50 | * :ref:`genindex`
51 | * :ref:`modindex`
52 | * :ref:`search`
--------------------------------------------------------------------------------
/docs/api.rst:
--------------------------------------------------------------------------------
1 | #################
2 | API
3 | #################
4 |
5 | .. automodule:: sneakpeek.server
6 | .. automodule:: sneakpeek.queue.model
7 | .. automodule:: sneakpeek.scheduler.model
8 | .. automodule:: sneakpeek.scraper.model
9 | .. automodule:: sneakpeek.queue.queue
10 | .. automodule:: sneakpeek.queue.consumer
11 | .. automodule:: sneakpeek.queue.in_memory_storage
12 | .. automodule:: sneakpeek.queue.redis_storage
13 | .. automodule:: sneakpeek.queue.tasks
14 | .. automodule:: sneakpeek.scheduler.scheduler
15 | .. automodule:: sneakpeek.scheduler.in_memory_lease_storage
16 | .. automodule:: sneakpeek.scheduler.redis_lease_storage
17 | .. automodule:: sneakpeek.scraper.context
18 | .. automodule:: sneakpeek.scraper.runner
19 | .. automodule:: sneakpeek.scraper.task_handler
20 | .. automodule:: sneakpeek.scraper.redis_storage
21 | .. automodule:: sneakpeek.scraper.in_memory_storage
22 | .. automodule:: sneakpeek.scraper.dynamic_scraper_handler
23 | .. automodule:: sneakpeek.middleware.base
24 | .. automodule:: sneakpeek.middleware.parser
25 | .. automodule:: sneakpeek.middleware.proxy_middleware
26 | .. automodule:: sneakpeek.middleware.rate_limiter_middleware
27 | .. automodule:: sneakpeek.middleware.requests_logging_middleware
28 | .. automodule:: sneakpeek.middleware.robots_txt_middleware
29 | .. automodule:: sneakpeek.middleware.user_agent_injecter_middleware
30 | .. automodule:: sneakpeek.api
31 | .. automodule:: sneakpeek.logging
32 | .. automodule:: sneakpeek.metrics
--------------------------------------------------------------------------------
/docs/deployment.rst:
--------------------------------------------------------------------------------
1 | ##################
2 | Deployment options
3 | ##################
4 |
5 | There are multiple options how you can deploy your scrapers depending on your requirements:
6 |
7 | =============================
8 | One replica that does it all
9 | =============================
10 |
11 | This is a good option if:
12 |
13 | * you can tolerate some downtime
14 | * you don't need to host thousands of scrapers that can be dynamically changed by users
15 | * you don't care if you lose the information about the scraper jobs
16 |
17 | In this case all you need to do is to:
18 |
19 | * define a list of scrapers in the code (just like in the :doc:`tutorial `)
20 | * use in-memory storage
21 |
22 | ======================
23 | Using external storage
24 | ======================
25 |
26 | If you use some external storage (e.g. redis or RDBMS) for jobs queue and lease storage you'll be able:
27 |
28 | * to scale workers horizontally until queue, storage or scheduler becomes a bottleneck
29 | * to have a secondary replicas for the scheduler, so when primary dies for some reason there are fallback options
30 |
31 | If you also use the external storage as a scrapers storage you'll be able to dynamically
32 | add, delete and update scrapers via UI or JsonRPC API.
33 |
34 | Note that each **Sneakpeek** server by default runs worker, scheduler and API services, but
35 | it's possible to run only one role at the time, therefore you'll be able to scale
36 | services independently.
37 |
38 |
--------------------------------------------------------------------------------
/sneakpeek/middleware/base.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from traceback import format_exc
3 | from typing import Any, Coroutine, Type, TypeVar
4 |
5 | from aiohttp import ClientResponse
6 | from pydantic import BaseModel
7 | from typing_extensions import override
8 |
9 | from sneakpeek.scraper.model import Middleware, MiddlewareConfig, Request
10 |
11 | logger = logging.getLogger(__name__)
12 |
13 | _TBaseModel = TypeVar("_TBaseModel", bound=BaseModel)
14 |
15 |
16 | def parse_config_from_obj(
17 | config: Any | None,
18 | plugin_name: str,
19 | config_type: Type[_TBaseModel],
20 | default_config: _TBaseModel,
21 | ) -> _TBaseModel:
22 | if not config:
23 | return default_config
24 | try:
25 | return config_type.parse_obj(config)
26 | except Exception as e:
27 | logger.warn(f"Failed to parse config for plugin '{plugin_name}': {e}")
28 | logger.debug(f"Traceback: {format_exc()}")
29 | return default_config
30 |
31 |
32 | class BaseMiddleware(Middleware):
33 | @property
34 | def name(self) -> str:
35 | return "proxy"
36 |
37 | @override
38 | async def on_request(
39 | self,
40 | request: Request,
41 | config: Any | None,
42 | ) -> Request:
43 | return request
44 |
45 | async def on_response(
46 | self,
47 | request: Request,
48 | response: ClientResponse,
49 | config: MiddlewareConfig | None = None,
50 | ) -> Coroutine[Any, Any, ClientResponse]:
51 | return response
52 |
--------------------------------------------------------------------------------
/LICENCE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2023, Daniil Iazovskii
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | * Neither the name of the copyright holder nor the names of its
15 | contributors may be used to endorse or promote products derived from
16 | this software without specific prior written permission.
17 |
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--------------------------------------------------------------------------------
/.vscode/launch.json:
--------------------------------------------------------------------------------
1 | {
2 | // Use IntelliSense to learn about possible attributes.
3 | // Hover to view descriptions of existing attributes.
4 | // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
5 | "version": "0.2.1",
6 | "configurations": [
7 | {
8 | "name": "python",
9 | "type": "python",
10 | "request": "launch",
11 | "program": "${file}",
12 | "console": "integratedTerminal",
13 | "justMyCode": true,
14 | "env": {
15 | "PYTHONPATH": "${workspaceFolder}"
16 | }
17 | },
18 | {
19 | "name": "Run all",
20 | "type": "python",
21 | "request": "launch",
22 | "program": "${workspaceFolder}/sneakpeek/app.py",
23 | "args": ["--api", "--scheduler", "--worker"],
24 | "console": "integratedTerminal",
25 | "justMyCode": true,
26 | "env": {
27 | "PYTHONPATH": "${workspaceFolder}"
28 | }
29 | },
30 | {
31 | "name": "Run demo",
32 | "type": "python",
33 | "request": "launch",
34 | "program": "${workspaceFolder}/demo/app.py",
35 | "console": "integratedTerminal",
36 | "justMyCode": true,
37 | "env": {
38 | "PYTHONPATH": "${workspaceFolder}"
39 | }
40 | },
41 | {
42 | "name": "Run demo (local handler)",
43 | "type": "python",
44 | "request": "launch",
45 | "program": "${workspaceFolder}/demo/demo_scraper.py",
46 | "console": "integratedTerminal",
47 | "justMyCode": true,
48 | "env": {
49 | "PYTHONPATH": "${workspaceFolder}"
50 | }
51 | }
52 | ]
53 | }
54 |
--------------------------------------------------------------------------------
/sneakpeek/scraper/ephemeral_scraper_task_handler.py:
--------------------------------------------------------------------------------
1 | from pydantic import BaseModel
2 |
3 | from sneakpeek.queue.model import Task, TaskHandlerABC
4 | from sneakpeek.scraper.model import (
5 | EPHEMERAL_SCRAPER_TASK_HANDLER_NAME,
6 | ScraperConfig,
7 | ScraperHandler,
8 | ScraperRunnerABC,
9 | UnknownScraperHandlerError,
10 | )
11 |
12 |
13 | class EphemeralScraperTask(BaseModel):
14 | scraper_handler: str
15 | scraper_config: ScraperConfig
16 | scraper_state: str | None = None
17 |
18 |
19 | class EphemeralScraperTaskHandler(TaskHandlerABC):
20 | def __init__(
21 | self,
22 | scraper_handlers: list[ScraperHandler],
23 | runner: ScraperRunnerABC,
24 | ) -> None:
25 | self.scraper_handlers = {handler.name: handler for handler in scraper_handlers}
26 | self.runner = runner
27 |
28 | def name(self) -> int:
29 | return EPHEMERAL_SCRAPER_TASK_HANDLER_NAME
30 |
31 | async def process(self, task: Task) -> str:
32 | config = EphemeralScraperTask.parse_raw(task.payload)
33 | handler = self._get_handler(config.scraper_handler)
34 | return await self.runner.run_ephemeral(
35 | handler,
36 | config.scraper_config,
37 | config.scraper_state,
38 | )
39 |
40 | def _get_handler(self, scraper_handler: str) -> ScraperHandler:
41 | if scraper_handler not in self.scraper_handlers:
42 | raise UnknownScraperHandlerError(
43 | f"Unknown scraper handler '{scraper_handler}'"
44 | )
45 | return self.scraper_handlers[scraper_handler]
46 |
--------------------------------------------------------------------------------
/docs/local_debugging.rst:
--------------------------------------------------------------------------------
1 | ################################
2 | Local handler debugging
3 | ################################
4 |
5 | You can easily test handler without running full-featured server. Here's how you can do that for the `DemoScraper` that we have developed in the :doc:`tutorial `.
6 |
7 | Add import in the beginning of the file:
8 |
9 | .. code-block:: python3
10 |
11 | from sneakpeek.scraper.runner import ScraperRunner
12 |
13 |
14 | And add the following lines to the end of the file:
15 |
16 |
17 | .. code-block:: python3
18 |
19 |
20 | async def main():
21 | result = await ScraperRunner.debug_handler(
22 | DemoScraper(),
23 | config=ScraperConfig(
24 | params=DemoScraperParams(
25 | start_url="https://www.ycombinator.com/",
26 | max_pages=20,
27 | ).dict(),
28 | ),
29 | middlewares=[
30 | RequestsLoggingMiddleware(),
31 | ],
32 | )
33 | logging.info(f"Finished scraper with result: {result}")
34 |
35 | if __name__ == "__main__":
36 | asyncio.run(main())
37 |
38 |
39 | For the argument `ScraperRunner.debug_handler` takes:
40 |
41 | 1. An instance of your scraper handler
42 | 2. Scraper config
43 | 3. **[Optional]** Middleware that will be used in the handler (:doc:`see full list of the middleware here `)
44 |
45 | Now you can run you handler as an ordinary Python script. Given it's in `demo_scraper.py` file you can use:
46 |
47 | .. code-block:: bash
48 |
49 | python3 demo_scraper.py
50 |
--------------------------------------------------------------------------------
/docs/middleware/requests_logging_middleware.rst:
--------------------------------------------------------------------------------
1 | ##############################
2 | Requests logging middleware
3 | ##############################
4 |
5 | Requests logging middleware logs all requests being made and received responses.
6 |
7 | Configuration of the middleware is defined in :py:class:`RequestsLoggingMiddlewareConfig `.
8 |
9 | How to configure middleware for the :py:class:`SneakpeekServer ` (will be used globally for all requests):
10 |
11 | .. code-block:: python3
12 |
13 | from sneakpeek.middleware.requests_logging_middleware import RequestsLoggingMiddleware, RequestsLoggingMiddlewareConfig
14 |
15 | server = SneakpeekServer.create(
16 | ...
17 | middleware=[
18 | RequestsLoggingMiddleware(
19 | RequestsLoggingMiddlewareConfig(
20 | log_request=True,
21 | log_response=True,
22 | )
23 | )
24 | ],
25 | )
26 |
27 |
28 | How to override middleware settings for a given scraper:
29 |
30 | .. code-block:: python3
31 |
32 | from sneakpeek.middleware.requests_logging_middleware import RequestsLoggingMiddlewareConfig
33 |
34 | scraper = Scraper(
35 | ...
36 | config=ScraperConfig(
37 | ...
38 | middleware={
39 | "requests_logging": RequestsLoggingMiddlewareConfig(
40 | log_request=True,
41 | log_response=False,
42 | )
43 | }
44 | ),
45 | )
46 |
--------------------------------------------------------------------------------
/docs/middleware/robots_txt_middleware.rst:
--------------------------------------------------------------------------------
1 | #########################
2 | Robots.txt
3 | #########################
4 |
5 | Robots.txt middleware can log and optionally block requests if they are disallowed by website robots.txt.
6 | If robots.txt is unavailable (e.g. request returns 5xx code) all requests will be allowed.
7 |
8 | Configuration of the middleware is defined in :py:class:`RobotsTxtMiddlewareConfig `.
9 |
10 | How to configure middleware for the :py:class:`SneakpeekServer ` (will be used globally for all requests):
11 |
12 | .. code-block:: python3
13 |
14 | from sneakpeek.middleware.robots_txt_middleware import RobotsTxtMiddleware, RobotsTxtMiddlewareConfig
15 |
16 | server = SneakpeekServer.create(
17 | ...
18 | middleware=[
19 | ProxyMiddleware(
20 | ProxyMiddlewareConfig(
21 | violation_strategy = RobotsTxtViolationStrategy.THROW,
22 | )
23 | )
24 | ],
25 | )
26 |
27 |
28 | How to override middleware settings for a given scraper:
29 |
30 | .. code-block:: python3
31 |
32 | from aiohttp import BasicAuth
33 | from sneakpeek.middleware.robots_txt_middleware import RobotsTxtMiddlewareConfig
34 |
35 | scraper = Scraper(
36 | ...
37 | config=ScraperConfig(
38 | ...
39 | middleware={
40 | "robots_txt": ProxyMiddlewareConfig(
41 | violation_strategy = RobotsTxtViolationStrategy.LOG,
42 | )
43 | }
44 | ),
45 | )
46 |
--------------------------------------------------------------------------------
/docs/middleware/proxy_middleware.rst:
--------------------------------------------------------------------------------
1 | #########################
2 | Proxy middleware
3 | #########################
4 |
5 | Proxy middleware automatically sets proxy arguments for all HTTP requests.
6 | Configuration of the middleware is defined in :py:class:`ProxyMiddlewareConfig `.
7 |
8 | How to configure middleware for the :py:class:`SneakpeekServer ` (will be used globally for all requests):
9 |
10 | .. code-block:: python3
11 |
12 | from aiohttp import BasicAuth
13 | from sneakpeek.middleware.proxy_middleware import ProxyMiddleware, ProxyMiddlewareConfig
14 |
15 | server = SneakpeekServer.create(
16 | ...
17 | middleware=[
18 | ProxyMiddleware(
19 | ProxyMiddlewareConfig(
20 | proxy = "http://example.proxy.com:3128",
21 | proxy_auth = BasicAuth(login="mylogin", password="securepassword"),
22 | )
23 | )
24 | ],
25 | )
26 |
27 |
28 | How to override middleware settings for a given scraper:
29 |
30 | .. code-block:: python3
31 |
32 | from aiohttp import BasicAuth
33 | from sneakpeek.middleware.proxy_middleware import ProxyMiddlewareConfig
34 |
35 | scraper = Scraper(
36 | ...
37 | config=ScraperConfig(
38 | ...
39 | middleware={
40 | "proxy": ProxyMiddlewareConfig(
41 | proxy = "http://example.proxy.com:3128",
42 | proxy_auth = BasicAuth(login="mylogin", password="securepassword"),
43 | )
44 | }
45 | ),
46 | )
47 |
--------------------------------------------------------------------------------
/docs/middleware/user_agent_injecter_middleware.rst:
--------------------------------------------------------------------------------
1 | #########################
2 | User Agent injector
3 | #########################
4 |
5 | This middleware automatically adds ``User-Agent`` header if it's not present.
6 | It uses `fake-useragent `_ in order to generate fake real world user agents.
7 |
8 | Configuration of the middleware is defined in :py:class:`UserAgentInjecterMiddlewareConfig `.
9 |
10 | How to configure middleware for the :py:class:`SneakpeekServer ` (will be used globally for all requests):
11 |
12 | .. code-block:: python3
13 |
14 | from sneakpeek.middleware.user_agent_injecter_middleware import UserAgentInjecterMiddleware, UserAgentInjecterMiddlewareConfig
15 |
16 | server = SneakpeekServer.create(
17 | ...
18 | middleware=[
19 | UserAgentInjecterMiddleware(
20 | UserAgentInjecterMiddlewareConfig(
21 | use_external_data = True,
22 | browsers = ["chrome", "firefox"],
23 | )
24 | )
25 | ],
26 | )
27 |
28 |
29 | How to override middleware settings for a given scraper:
30 |
31 | .. code-block:: python3
32 |
33 | from sneakpeek.middleware.user_agent_injecter_middleware import UserAgentInjecterMiddlewareConfig
34 |
35 | scraper = Scraper(
36 | ...
37 | config=ScraperConfig(
38 | ...
39 | middleware={
40 | "user_agent_injecter": UserAgentInjecterMiddlewareConfig(
41 | use_external_data = False,
42 | browsers = ["chrome", "firefox"],
43 | )
44 | }
45 | ),
46 | )
47 |
--------------------------------------------------------------------------------
/front/quasar.config.js:
--------------------------------------------------------------------------------
1 | const { configure } = require('quasar/wrappers');
2 | const MonacoWebpackPlugin = require('monaco-editor-webpack-plugin');
3 |
4 | module.exports = configure(function (ctx) {
5 | return {
6 | eslint: {
7 | warnings: true,
8 | errors: true
9 | },
10 | boot: [
11 | ],
12 | css: [
13 | 'app.scss'
14 | ],
15 | extras: [
16 | 'fontawesome-v6',
17 | 'roboto-font',
18 | 'material-icons',
19 | ],
20 | build: {
21 | target: {
22 | browser: [ 'es2019', 'edge88', 'firefox78', 'chrome87', 'safari13.1' ],
23 | node: 'node16'
24 | },
25 | distDir: '../sneakpeek/static/ui/',
26 | vueRouterMode: 'hash',
27 | env: {
28 | JSONRPC_ENDPOINT: ctx.dev ? 'http://localhost:8080/api/v1/jsonrpc' : '/api/v1/jsonrpc',
29 | },
30 | chainWebpack: config => {
31 | config.plugin('monaco-editor').use(MonacoWebpackPlugin, [
32 | {
33 | languages: ['python', 'javascript', 'html', 'xml']
34 | }
35 | ])
36 | }
37 | },
38 | devServer: {
39 | open: true
40 | },
41 | framework: {
42 | config: {
43 | dark: "auto",
44 | notify: {
45 | position: "bottom"
46 | }
47 | },
48 | plugins: [
49 | "Notify",
50 | "SessionStorage",
51 | ]
52 | },
53 | ssr: {
54 | pwa: false,
55 | prodPort: 3000,
56 | middlewares: [
57 | 'render'
58 | ]
59 | },
60 | pwa: {
61 | workboxMode: 'generateSW',
62 | injectPwaMetaTags: true,
63 | swFilename: 'sw.js',
64 | manifestFilename: 'manifest.json',
65 | useCredentialsForManifestTag: false,
66 | },
67 | capacitor: {
68 | hideSplashscreen: true
69 | },
70 | }
71 | });
72 |
--------------------------------------------------------------------------------
/sneakpeek/middleware/proxy_middleware.py:
--------------------------------------------------------------------------------
1 | from typing import Any
2 |
3 | from aiohttp import BasicAuth
4 | from fake_useragent import UserAgent
5 | from pydantic import BaseModel
6 | from typing_extensions import override
7 | from yarl import URL
8 |
9 | from sneakpeek.middleware.base import BaseMiddleware, parse_config_from_obj
10 | from sneakpeek.scraper.model import Request
11 |
12 |
13 | class ProxyMiddlewareConfig(BaseModel):
14 | """Proxy middleware config"""
15 |
16 | proxy: str | URL | None = None #: Proxy URL
17 | proxy_auth: BasicAuth | None = None #: Proxy authentication info to use
18 |
19 | class Config:
20 | arbitrary_types_allowed = True
21 |
22 |
23 | class ProxyMiddleware(BaseMiddleware):
24 | """Proxy middleware automatically sets proxy arguments for all HTTP requests."""
25 |
26 | def __init__(self, default_config: ProxyMiddlewareConfig | None = None) -> None:
27 | self._default_config = default_config or ProxyMiddlewareConfig()
28 | self._user_agents = UserAgent(
29 | use_external_data=self._default_config.use_external_data,
30 | browsers=self._default_config.browsers,
31 | )
32 |
33 | @property
34 | def name(self) -> str:
35 | return "proxy"
36 |
37 | @override
38 | async def on_request(
39 | self,
40 | request: Request,
41 | config: Any | None,
42 | ) -> Request:
43 | config = parse_config_from_obj(
44 | config,
45 | self.name,
46 | ProxyMiddlewareConfig,
47 | self._default_config,
48 | )
49 | if not request.kwargs:
50 | request.kwargs = {}
51 | if config.proxy:
52 | request.kwargs["proxy"] = config.proxy
53 | if config.proxy_auth:
54 | request.kwargs["proxy_auth"] = config.proxy_auth
55 | return request
56 |
--------------------------------------------------------------------------------
/.github/workflows/ci.yml:
--------------------------------------------------------------------------------
1 | name: CI
2 |
3 | on: push
4 |
5 | jobs:
6 | ci:
7 | name: Build and publish Python pacakage to PyPI
8 | runs-on: "ubuntu-latest"
9 | permissions:
10 | id-token: write
11 | strategy:
12 | fail-fast: false
13 | matrix:
14 | python-version: ["3.10"]
15 | poetry-version: ["1.4.2"]
16 | node-version: ["18.16.0"]
17 | steps:
18 | - uses: actions/checkout@v3
19 |
20 | - name: Set up Python
21 | uses: actions/setup-python@v4
22 | with:
23 | python-version: ${{ matrix.python-version }}
24 |
25 | - name: Set up Poetry
26 | uses: abatilo/actions-poetry@v2
27 | with:
28 | poetry-version: ${{ matrix.poetry-version }}
29 |
30 | - name: Set Node.js
31 | uses: actions/setup-node@v3
32 | with:
33 | node-version: ${{ matrix.node-version }}
34 |
35 | - name: Run install
36 | run: make install
37 |
38 | - name: Run tests
39 | run: make test
40 |
41 | - name: Tests coverage
42 | run: make coverage
43 |
44 | - name: Upload coverage reports to Codecov
45 | uses: codecov/codecov-action@v3
46 | with:
47 | token: ${{ secrets.CODECOV_TOKEN }}
48 | files: ./coverage.xml
49 | verbose: true
50 |
51 | - name: Build package
52 | run: make build
53 |
54 | - name: Publish package to Test PyPI
55 | if: startsWith(github.ref, 'refs/tags')
56 | uses: pypa/gh-action-pypi-publish@release/v1
57 | with:
58 | repository-url: https://test.pypi.org/legacy/
59 | skip-existing: true
60 |
61 | - name: Publish package to PyPI
62 | if: startsWith(github.ref, 'refs/tags')
63 | uses: pypa/gh-action-pypi-publish@release/v1
64 | with:
65 | password: ${{ secrets.PYPI_API_TOKEN }}
66 |
--------------------------------------------------------------------------------
/sneakpeek/scheduler/redis_lease_storage.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime, timedelta
2 |
3 | from redis.asyncio import Redis
4 |
5 | from sneakpeek.metrics import count_invocations, measure_latency
6 | from sneakpeek.scheduler.model import Lease, LeaseStorageABC
7 |
8 |
9 | class RedisLeaseStorage(LeaseStorageABC):
10 | """Redis storage for leases. Should only be used for development purposes"""
11 |
12 | def __init__(self, redis: Redis) -> None:
13 | """
14 | Args:
15 | redis (Redis): Async redis client
16 | """
17 | self._redis = redis
18 |
19 | @count_invocations(subsystem="storage")
20 | @measure_latency(subsystem="storage")
21 | async def maybe_acquire_lease(
22 | self,
23 | lease_name: str,
24 | owner_id: str,
25 | acquire_for: timedelta,
26 | ) -> Lease | None:
27 | lease_key = f"lease:{lease_name}"
28 | existing_lease = await self._redis.get(lease_key)
29 | result = None
30 | if not existing_lease or existing_lease.decode() == owner_id:
31 | result = await self._redis.set(
32 | f"lease:{lease_name}",
33 | owner_id,
34 | ex=acquire_for,
35 | )
36 | return (
37 | Lease(
38 | name=lease_name,
39 | owner_id=owner_id,
40 | acquired=datetime.utcnow(),
41 | acquired_until=datetime.utcnow() + acquire_for,
42 | )
43 | if result
44 | else None
45 | )
46 |
47 | @count_invocations(subsystem="storage")
48 | @measure_latency(subsystem="storage")
49 | async def release_lease(self, lease_name: str, owner_id: str) -> None:
50 | lease_owner = await self._redis.get(f"lease:{lease_name}")
51 | if lease_owner == owner_id:
52 | await self._redis.delete(f"lease:{lease_name}")
53 |
--------------------------------------------------------------------------------
/docs/conf.py:
--------------------------------------------------------------------------------
1 | # Configuration file for the Sphinx documentation builder.
2 | #
3 | # This file only contains a selection of the most common options. For a full
4 | # list see the documentation:
5 | # https://www.sphinx-doc.org/en/master/usage/configuration.html
6 |
7 | # -- Path setup --------------------------------------------------------------
8 |
9 | # If extensions (or modules to document with autodoc) are in another directory,
10 | # add these directories to sys.path here. If the directory is relative to the
11 | # documentation root, use os.path.abspath to make it absolute, like shown here.
12 | #
13 | import os
14 | import sys
15 |
16 | print(os.path.abspath(".."))
17 | sys.path.insert(0, os.path.abspath(".."))
18 |
19 |
20 | # -- Project information -----------------------------------------------------
21 |
22 | project = "Sneakpeek"
23 | copyright = "2023, Dan Yazovsky"
24 | author = "Dan Yazovsky"
25 | version = "0.2"
26 | release = "0.2.2"
27 | extensions = ["sphinx.ext.autodoc", "sphinx.ext.coverage", "sphinx.ext.napoleon"]
28 | templates_path = ["_templates"]
29 | language = "en"
30 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
31 | html_static_path = ["_static"]
32 | autoclass_content = "both"
33 | html_theme = "sphinx_rtd_theme"
34 | html_theme_options = {
35 | "analytics_id": "G-3EW8JNTBHC",
36 | "logo_only": False,
37 | "display_version": True,
38 | "prev_next_buttons_location": "bottom",
39 | "style_external_links": False,
40 | "vcs_pageview_mode": "display_github",
41 | "collapse_navigation": False,
42 | "sticky_navigation": True,
43 | "navigation_depth": 4,
44 | "includehidden": True,
45 | "titles_only": True,
46 | }
47 | github_url = "https://github.com/flulemon/sneakpeek"
48 | highlight_language = "python3"
49 | pygments_style = "sphinx"
50 |
51 | autodoc_default_options = {
52 | "members": True,
53 | "show-inheritance": True,
54 | }
55 | autodoc_typehints = "both"
56 |
--------------------------------------------------------------------------------
/sneakpeek/queue/tasks.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | from sneakpeek.queue.model import QueueABC, Task, TaskHandlerABC, TaskPriority
4 | from sneakpeek.scheduler.model import (
5 | PeriodicTask,
6 | StaticPeriodicTasksStorage,
7 | TaskSchedule,
8 | generate_id,
9 | )
10 |
11 | KILL_DEAD_TASKS_TASK_NAME = "internal::queue::kill_dead_tasks"
12 | DELETE_OLD_TASKS_TASK_NAME = "internal::queue::delete_old_tasks"
13 |
14 |
15 | class KillDeadTasksHandler(TaskHandlerABC):
16 | def __init__(self, queue: QueueABC) -> None:
17 | self.queue = queue
18 |
19 | def name(self) -> int:
20 | return KILL_DEAD_TASKS_TASK_NAME
21 |
22 | async def process(self, task: Task) -> str:
23 | killed = await self.queue.kill_dead_tasks()
24 | return json.dumps(
25 | {
26 | "success": True,
27 | "killed": [item.id for item in killed],
28 | },
29 | indent=4,
30 | )
31 |
32 |
33 | class DeleteOldTasksHandler(TaskHandlerABC):
34 | def __init__(self, queue: QueueABC) -> None:
35 | self.queue = queue
36 |
37 | def name(self) -> int:
38 | return DELETE_OLD_TASKS_TASK_NAME
39 |
40 | async def process(self, task: Task) -> str:
41 | await self.queue.delete_old_tasks()
42 | return json.dumps({"success": True}, indent=4)
43 |
44 |
45 | queue_periodic_tasks = StaticPeriodicTasksStorage(
46 | tasks=[
47 | PeriodicTask(
48 | id=generate_id(),
49 | name=KILL_DEAD_TASKS_TASK_NAME,
50 | handler=KILL_DEAD_TASKS_TASK_NAME,
51 | priority=TaskPriority.NORMAL,
52 | payload="",
53 | schedule=TaskSchedule.EVERY_HOUR,
54 | ),
55 | PeriodicTask(
56 | id=generate_id(),
57 | name=DELETE_OLD_TASKS_TASK_NAME,
58 | handler=DELETE_OLD_TASKS_TASK_NAME,
59 | priority=TaskPriority.NORMAL,
60 | payload="",
61 | schedule=TaskSchedule.EVERY_HOUR,
62 | ),
63 | ]
64 | )
65 |
--------------------------------------------------------------------------------
/sneakpeek/scheduler/in_memory_lease_storage.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from asyncio import Lock
3 | from datetime import datetime, timedelta
4 |
5 | from sneakpeek.metrics import count_invocations, measure_latency
6 | from sneakpeek.scheduler.model import Lease, LeaseStorageABC
7 |
8 |
9 | class InMemoryLeaseStorage(LeaseStorageABC):
10 | """In memory storage for leases. Should only be used for development purposes"""
11 |
12 | def __init__(self) -> None:
13 | self._logger = logging.getLogger(__name__)
14 | self._lock = Lock()
15 | self._leases: dict[str, Lease] = {}
16 |
17 | def _can_acquire_lease(self, lease_name: str, owner_id: str) -> bool:
18 | existing_lease = self._leases.get(lease_name)
19 | return (
20 | not existing_lease
21 | or existing_lease.acquired_until < datetime.utcnow()
22 | or existing_lease.owner_id == owner_id
23 | )
24 |
25 | @count_invocations(subsystem="storage")
26 | @measure_latency(subsystem="storage")
27 | async def maybe_acquire_lease(
28 | self,
29 | lease_name: str,
30 | owner_id: str,
31 | acquire_for: timedelta,
32 | ) -> Lease | None:
33 | async with self._lock:
34 | if self._can_acquire_lease(lease_name, owner_id):
35 | self._leases[lease_name] = Lease(
36 | name=lease_name,
37 | owner_id=owner_id,
38 | acquired=datetime.utcnow(),
39 | acquired_until=datetime.utcnow() + acquire_for,
40 | )
41 | return self._leases[lease_name]
42 | return None
43 |
44 | @count_invocations(subsystem="storage")
45 | @measure_latency(subsystem="storage")
46 | async def release_lease(self, lease_name: str, owner_id: str) -> None:
47 | async with self._lock:
48 | if lease_name not in self._leases:
49 | return
50 | if self._can_acquire_lease(lease_name, owner_id):
51 | del self._leases[lease_name]
52 |
--------------------------------------------------------------------------------
/sneakpeek/middleware/user_agent_injecter_middleware.py:
--------------------------------------------------------------------------------
1 | from typing import Any
2 |
3 | from fake_useragent import UserAgent
4 | from pydantic import BaseModel
5 | from typing_extensions import override
6 |
7 | from sneakpeek.middleware.base import BaseMiddleware, parse_config_from_obj
8 | from sneakpeek.scraper.model import Request
9 |
10 |
11 | class UserAgentInjecterMiddlewareConfig(BaseModel):
12 | """Middleware configuration"""
13 |
14 | #: Whether to use external data as a fallback
15 | use_external_data: bool = True
16 |
17 | #: List of browsers which are used to generate user agents
18 | browsers: list[str] = ["chrome", "edge", "firefox", "safari", "opera"]
19 |
20 |
21 | class UserAgentInjecterMiddleware(BaseMiddleware):
22 | """
23 | This middleware automatically adds ``User-Agent`` header if it's not present.
24 | It uses `fake-useragent `_ in order to generate fake real world user agents.
25 | """
26 |
27 | def __init__(
28 | self, default_config: UserAgentInjecterMiddlewareConfig | None = None
29 | ) -> None:
30 | self._default_config = default_config or UserAgentInjecterMiddlewareConfig()
31 | self._user_agents = UserAgent(
32 | use_external_data=self._default_config.use_external_data,
33 | browsers=self._default_config.browsers,
34 | )
35 |
36 | @property
37 | def name(self) -> str:
38 | return "user_agent_injecter"
39 |
40 | @override
41 | async def on_request(
42 | self,
43 | request: Request,
44 | config: Any | None,
45 | ) -> Request:
46 | config = parse_config_from_obj(
47 | config,
48 | self.name,
49 | UserAgentInjecterMiddlewareConfig,
50 | self._default_config,
51 | )
52 | if (request.headers or {}).get("User-Agent"):
53 | return request
54 | if not request.headers:
55 | request.headers = {}
56 | request.headers["User-Agent"] = self._user_agents.random
57 | return request
58 |
--------------------------------------------------------------------------------
/sneakpeek/scraper/dynamic_scraper_handler.py:
--------------------------------------------------------------------------------
1 | import inspect
2 | import json
3 | from typing import Any, Awaitable, Callable, Mapping
4 |
5 | from pydantic import BaseModel
6 | from typing_extensions import override
7 |
8 | from sneakpeek.scraper.model import ScraperContextABC, ScraperHandler
9 |
10 |
11 | class DynamicScraperParams(BaseModel):
12 | source_code: str
13 | args: list[Any] | None = None
14 | kwargs: Mapping[str, Any] | None = None
15 |
16 |
17 | class DynamicScraperHandler(ScraperHandler):
18 | @property
19 | def name(self) -> str:
20 | return "dynamic_scraper"
21 |
22 | def compile(self, source_code: str) -> Callable[..., Awaitable[None]]:
23 | bytecode = compile(source=source_code, filename="", mode="exec")
24 | session_globals = {}
25 | exec(bytecode, session_globals)
26 | if "context" in session_globals:
27 | raise SyntaxError("`context` is a reserved keyword")
28 | if "handler" not in session_globals:
29 | raise SyntaxError("Expected source code to define a `handler` function")
30 | handler = session_globals["handler"]
31 | if not inspect.iscoroutinefunction(handler):
32 | raise SyntaxError("Expected `handler` to be a function")
33 | if handler.__code__.co_argcount == 0:
34 | raise SyntaxError(
35 | "Expected `handler` to have at least one argument: `context: ScraperContext`"
36 | )
37 | return handler
38 |
39 | @override
40 | async def run(self, context: ScraperContextABC) -> str:
41 | params = DynamicScraperParams.parse_obj(context.params)
42 | handler = self.compile(params.source_code)
43 | result = await handler(context, *(params.args or []), **(params.kwargs or {}))
44 | if result is None:
45 | return "No result was returned"
46 | if isinstance(result, str):
47 | return result
48 | try:
49 | return json.dumps(result, indent=4)
50 | except TypeError as ex:
51 | return f"Failed to serialize result with error: {ex}"
52 |
--------------------------------------------------------------------------------
/front/src/components/ScraperCard.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | {{ scraper.name }}
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | Failed to load scraper. Try to refresh.
19 | {{ error }}
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
70 |
--------------------------------------------------------------------------------
/sneakpeek/session_loggers/redis_logger.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from asyncio import AbstractEventLoop
3 | from copy import copy
4 | from dataclasses import dataclass
5 | from datetime import datetime, timedelta
6 | from typing import Any
7 |
8 | from redis.asyncio import Redis
9 |
10 | from sneakpeek.session_loggers.base import SessionLogger, get_fields_to_log
11 |
12 | MAX_BUFFER_AGE = timedelta(seconds=5)
13 |
14 |
15 | @dataclass
16 | class _LogRecord:
17 | task_id: str
18 | data: Any
19 |
20 |
21 | class RedisLoggerHandler(SessionLogger):
22 | def __init__(
23 | self,
24 | redis: Redis,
25 | loop: AbstractEventLoop | None = None,
26 | max_buffer_size: int = 10,
27 | max_buffer_age: timedelta = MAX_BUFFER_AGE,
28 | ) -> None:
29 | super().__init__()
30 | self.redis = redis
31 | self.loop = loop
32 | self.max_buffer_size = max_buffer_size
33 | self.max_buffer_age = max_buffer_age
34 | self.buffer: list[_LogRecord] = []
35 | self.last_flush = datetime.min
36 |
37 | async def _write_to_log(self, messages: list[_LogRecord]) -> None:
38 | for message in messages:
39 | await self.redis.xadd(name=message.task_id, fields=message.data)
40 |
41 | def flush(self):
42 | """
43 | Flushes the stream.
44 | """
45 | if not self.buffer:
46 | return
47 | if (
48 | len(self.buffer) < self.max_buffer_size
49 | and datetime.utcnow() - self.last_flush < self.max_buffer_age
50 | ):
51 | return
52 | self.acquire()
53 | try:
54 | self.loop.create_task(self._write_to_log, copy(self.buffer))
55 | finally:
56 | self.buffer.clear()
57 | self.release()
58 |
59 | def emit(self, record: logging.LogRecord) -> None:
60 | if not getattr(record, "task_id"):
61 | return
62 |
63 | self.buffer.append(
64 | _LogRecord(
65 | task_id=record.task_id,
66 | data=get_fields_to_log(record),
67 | )
68 | )
69 | self.flush()
70 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Sneakpeek
2 |
3 | 
4 | [](https://badge.fury.io/py/sneakpeek-py)
5 | [](https://pepy.tech/project/sneakpeek-py)
6 | [](https://sneakpeek-py.readthedocs.io/en/latest/?badge=latest)
7 | [](https://codecov.io/gh/flulemon/sneakpeek)
8 |
9 | **Sneakpeek** - is a platform to author, schedule and monitor scrapers in an easy, fast and extensible way.
10 | It's the best choice for scrapers that have some specific complex scraping logic that needs
11 | to be run on a constant basis.
12 |
13 | ## Key features
14 |
15 | - Horizontally scalable
16 | - Robust scraper scheduler and priority task queue
17 | - Multiple storage implementations to persist scrapers' configs, tasks, logs, etc.
18 | - JSON RPC API to manage the platform programmatically
19 | - Useful UI to manage all of your scrapers
20 | - Scraper IDE to enable you developing scrapers right in your browser
21 | - Easily extendable via middleware
22 |
23 | ## Demo
24 |
25 | [Here's a demo project](https://github.com/flulemon/sneakpeek-demo) which uses **Sneakpeek** framework.
26 |
27 | You can also run the demo using Docker:
28 |
29 | ```bash
30 | docker run -it --rm -p 8080:8080 flulemon/sneakpeek-demo
31 | ```
32 |
33 | Once it has started head over to http://localhost:8080 to play around with it.
34 |
35 | ## Documentation
36 |
37 | For the full documentation please visit [sneakpeek-py.readthedocs.io](https://sneakpeek-py.readthedocs.io/en/latest/)
38 |
39 | ## Contributing
40 |
41 | Please take a look at our [contributing](https://github.com/flulemon/sneakpeek/blob/main/CONTRIBUTING.md) guidelines if you're interested in helping!
42 |
43 | ## Future plans
44 |
45 | - Headful and headless browser engines middleware (Selenium and Playwright)
46 | - SQL and AmazonDB storage implementation
47 | - Advanced monitoring for the scrapers' health
48 |
--------------------------------------------------------------------------------
/sneakpeek/scheduler/tests/test_lease_storage.py:
--------------------------------------------------------------------------------
1 | import asyncio
2 | from datetime import timedelta
3 |
4 | import pytest
5 | from fakeredis.aioredis import FakeRedis
6 |
7 | from sneakpeek.scheduler.in_memory_lease_storage import InMemoryLeaseStorage
8 | from sneakpeek.scheduler.model import LeaseStorageABC
9 | from sneakpeek.scheduler.redis_lease_storage import RedisLeaseStorage
10 |
11 | NON_EXISTENT_SCRAPER_ID = 10001
12 |
13 |
14 | @pytest.fixture
15 | def in_memory_storage() -> LeaseStorageABC:
16 | return InMemoryLeaseStorage()
17 |
18 |
19 | @pytest.fixture
20 | def redis_storage() -> LeaseStorageABC:
21 | return RedisLeaseStorage(FakeRedis())
22 |
23 |
24 | @pytest.fixture(
25 | params=[
26 | pytest.lazy_fixture(in_memory_storage.__name__),
27 | pytest.lazy_fixture(redis_storage.__name__),
28 | ]
29 | )
30 | def storage(request) -> LeaseStorageABC:
31 | yield request.param
32 |
33 |
34 | @pytest.mark.asyncio
35 | async def test_lease(storage: LeaseStorageABC):
36 | lease_name_1 = "test_lease_1"
37 | lease_name_2 = "test_lease_2"
38 | owner_1 = "owner_id_1"
39 | owner_2 = "owner_id_2"
40 | owner_1_acquire_until = timedelta(seconds=1)
41 | owner_2_acquire_until = timedelta(seconds=5)
42 |
43 | # initial acquire
44 | assert (
45 | await storage.maybe_acquire_lease(lease_name_1, owner_1, owner_1_acquire_until)
46 | is not None
47 | )
48 | # another lease can be acquired
49 | assert (
50 | await storage.maybe_acquire_lease(lease_name_2, owner_2, owner_2_acquire_until)
51 | is not None
52 | )
53 | # lock is acquired so no one can acquire
54 | assert (
55 | await storage.maybe_acquire_lease(lease_name_1, owner_2, owner_2_acquire_until)
56 | is None
57 | )
58 | # owner can re-acquire
59 | assert (
60 | await storage.maybe_acquire_lease(lease_name_1, owner_1, owner_1_acquire_until)
61 | is not None
62 | )
63 |
64 | # lock expires and can be acuired
65 | await asyncio.sleep(1)
66 | assert (
67 | await storage.maybe_acquire_lease(lease_name_1, owner_2, owner_2_acquire_until)
68 | is not None
69 | )
70 |
--------------------------------------------------------------------------------
/docs/middleware/rate_limiter_middleware.rst:
--------------------------------------------------------------------------------
1 | #########################
2 | Rate limiter
3 | #########################
4 |
5 | Rate limiter implements `leaky bucket algorithm `_
6 | to limit number of requests made to the hosts. If the request is rate limited it can either
7 | raise an exception or wait until the request won't be limited anymore.
8 |
9 | Configuration of the middleware is defined in :py:class:`RateLimiterMiddlewareConfig `.
10 |
11 | How to configure middleware for the :py:class:`SneakpeekServer ` (will be used globally for all requests):
12 |
13 | .. code-block:: python3
14 |
15 | from sneakpeek.middleware.rate_limiter_middleware import RateLimiterMiddleware, RateLimiterMiddlewareConfig
16 |
17 | server = SneakpeekServer.create(
18 | ...
19 | middleware=[
20 | RateLimiterMiddleware(
21 | RateLimiterMiddlewareConfig(
22 | # maximum number of requests in a given time window
23 | max_requests = 60,
24 | # wait until request won't be rate limited
25 | rate_limited_strategy = RateLimitedStrategy.WAIT
26 | # only 60 requests per host are allowed within 1 minute
27 | time_window = timedelta(minute=1),
28 | )
29 | )
30 | ],
31 | )
32 |
33 |
34 | How to override middleware settings for a given scraper:
35 |
36 | .. code-block:: python3
37 |
38 | from sneakpeek.middleware.rate_limiter_middleware import RateLimiterMiddlewareConfig
39 |
40 | scraper = Scraper(
41 | ...
42 | config=ScraperConfig(
43 | ...
44 | middleware={
45 | "rate_limiter": RateLimiterMiddlewareConfig(
46 | # maximum number of requests in a given time window
47 | max_requests = 120,
48 | # throw RateLimiterException if request is rate limited
49 | rate_limited_strategy = RateLimitedStrategy.THROW
50 | # only 120 requests per host are allowed within 1 minute
51 | time_window = timedelta(minute=1),
52 | )
53 | }
54 | ),
55 | )
56 |
--------------------------------------------------------------------------------
/front/src/components/TaskLogs.vue:
--------------------------------------------------------------------------------
1 |
2 |
3 |