26 | {{ super() }}
27 | {%- endblock -%}
28 |
--------------------------------------------------------------------------------
/doc/source/admin_ui.rst:
--------------------------------------------------------------------------------
1 | Administration interface
2 | ========================
3 |
4 | To reach administration user interface, you first need to authenticate, by clicking the |user_menu_button| button, then
5 | selecting ``Log in``.
6 |
7 | .. |user_menu_button| image:: ../../tests/robotframework/screenshots/user_menu_button.png
8 | :class: sosse-inline-screenshot
9 |
10 | The default user name and password are both ``admin``. After submitting, the administration interface can be reached
11 | from the |conf_menu_button| menu, by selecting ``Administration``.
12 |
13 | .. |conf_menu_button| image:: ../../tests/robotframework/screenshots/conf_menu_button.png
14 | :class: sosse-inline-screenshot
15 |
16 | .. image:: ../../tests/robotframework/screenshots/admin_ui.png
17 | :class: sosse-screenshot
18 |
--------------------------------------------------------------------------------
/doc/source/administration.rst:
--------------------------------------------------------------------------------
1 | Administration
2 | ==============
3 |
4 | .. toctree::
5 | :maxdepth: 2
6 | :caption: Contents:
7 |
8 | admin_ui.rst
9 | crawl/new_url.rst
10 | crawl/queue.rst
11 | crawl/crawlers.rst
12 | crawl/analytics.rst
13 | crawl/policies.rst
14 | crawl/recursion_depth.rst
15 | crawl/feeds.rst
16 | documents.rst
17 | tags.rst
18 | domain_settings.rst
19 | cookies.rst
20 | webhooks.rst
21 | excluded_urls.rst
22 | search_engines.rst
23 | permissions.rst
24 |
--------------------------------------------------------------------------------
/doc/source/cli.rst:
--------------------------------------------------------------------------------
1 | Command Line Interface
2 | ======================
3 |
4 | SOSSE provides a ``sosse-admin`` command that is based on the
5 | `Django management command `_.
6 | It can be called with ``sosse-amdin help``, to list all commands available, and ``sosse-admin help `` to get
7 | help a specific command. The help for SOSSE specific commands is also provided below:
8 |
9 | .. include:: cli_generated.rst
10 |
--------------------------------------------------------------------------------
/doc/source/config_file.rst:
--------------------------------------------------------------------------------
1 | Configuration file reference
2 | ============================
3 |
4 | SOSSE can be configured through the configuration file ``/etc/sosse/sosse.conf``. Configuration variables are grouped in
5 | 3 sections, depending on which component they affect. Modyifing any of these option requires restarting the crawlers or
6 | the web interface.
7 |
8 | .. note::
9 | Configuration options can also be set using environment variables by prefixing with ``SOSSE_``.
10 | For example, the proxy option of the crawler can be set by settings the ``SOSSE_PROXY`` environment variable.
11 | Envionment variable options have highher precedence than options from the configuration file.
12 |
13 | .. include:: config_file_generated.rst
14 |
--------------------------------------------------------------------------------
/doc/source/cookies.rst:
--------------------------------------------------------------------------------
1 | 🍪 Cookies
2 | ==========
3 |
4 | Cookies stored by the crawlers can be seen from the :doc:`../admin_ui`, by clicking on ``Cookies``.
5 |
6 | .. image:: ../../tests/robotframework/screenshots/cookies_list.png
7 | :class: sosse-screenshot
8 |
9 | You can find which cookie applies to a specific web page by typing its URL in the search bar.
10 |
11 | Cookies import
12 | --------------
13 |
14 | Cookies can be imported using the ``Import cookies`` link.
15 |
16 | .. image:: ../../tests/robotframework/screenshots/cookies_import.png
17 | :class: sosse-screenshot
18 |
19 | They should be entered using the `Netscape cookie format `_.
20 | Tools like `Cookie editor `_ can be used to export cookies from a browser in a compatible manner.
21 |
--------------------------------------------------------------------------------
/doc/source/crawl/analytics.rst:
--------------------------------------------------------------------------------
1 | 📊 Analytics
2 | ============
3 |
4 | .. image:: ../../../tests/robotframework/screenshots/analytics.png
5 | :class: sosse-screenshot
6 |
7 | The analytics page shows global information about indexed pages, it can be reached by clicking ``📊 Analytics`` from
8 | the |conf_menu_button| menu, or in the :doc:`../admin_ui`.
9 |
10 | .. |conf_menu_button| image:: ../../../tests/robotframework/screenshots/conf_menu_button.png
11 | :class: sosse-inline-screenshot
12 |
--------------------------------------------------------------------------------
/doc/source/crawl/crawlers.rst:
--------------------------------------------------------------------------------
1 | 🕷 Crawlers
2 | ===========
3 |
4 | The crawlers page displays real-time information on crawlers processes. It can be accessed from
5 | the :doc:`../admin_ui`, by selecting ``🕷 Crawlers``.
6 |
7 | .. image:: ../../../tests/robotframework/screenshots/crawlers.png
8 | :class: sosse-screenshot
9 |
--------------------------------------------------------------------------------
/doc/source/crawl/feeds.rst:
--------------------------------------------------------------------------------
1 | Atom and RSS feeds
2 | ==================
3 |
4 | SOSSE can crawl `Atom `_ and
5 | `RSS `_ feeds, this can be useful to crawl websites that are updated often, and skip
6 | already indexed pages. To index a syndication feed, it needs to be :doc:`queued explicitly `.
7 |
8 | .. note::
9 | SOSSE crawler does not recurse into feeds declared in the ```` element of webpages. To crawl a feed, the URL of the XML feed must be added to the crawl queue manually.
10 |
11 | Caching for news aggregators
12 | ----------------------------
13 |
14 | By crawling syndication feeds, SOSSE can be used as an offline archive for news aggregator 🐊 softwares. After the XML
15 | feed is indexed, archived pages from the feed can be registered in the aggregator using the
16 | :ref:`atom feed ` generated by SOSSE. This can be done using the
17 | :doc:`search parameters <../user/search>`:
18 |
19 | - Leave the keyword field empty
20 | - Set a search parameter to ``Keep`` ``Linked by url`` ``Equal to``, and use the URL of the XML feed as the value
21 | - Sort results by ``First crawled descending``
22 |
23 | .. image:: ../../../tests/robotframework/screenshots/syndication_feed.png
24 | :class: sosse-screenshot
25 |
--------------------------------------------------------------------------------
/doc/source/crawl/new_url.rst:
--------------------------------------------------------------------------------
1 | 🌐 Crawl a new URL
2 | ==================
3 |
4 | In the |conf_menu_button| menu, or in the :doc:`../admin_ui`, by clicking ``🌐 Crawl a new URL`` you can queue one or
5 | multiple URLs to be crawled when a worker is available.
6 |
7 | .. |conf_menu_button| image:: ../../../tests/robotframework/screenshots/conf_menu_button.png
8 | :class: sosse-inline-screenshot
9 |
10 | .. image:: ../../../tests/robotframework/screenshots/crawl_new_url.png
11 | :class: sosse-screenshot
12 |
13 | By default, only the URLs queued for crawling will be visited. The crawler will not recurse into discovered links unless
14 | explicitly configured.
15 |
16 | To control how pages are indexed and whether recursion occurs, update the relevant settings in :doc:`policies`.
17 |
18 | After submitting a URL, the next page shows the :doc:`queue`.
19 |
--------------------------------------------------------------------------------
/doc/source/crawl/queue.rst:
--------------------------------------------------------------------------------
1 | ✔ Crawl queue
2 | =============
3 |
4 | The crawl queue page displays real-time information on documents being crawled. It can be accessed from
5 | the |conf_menu_button| menu, or from the :doc:`../admin_ui`, by selecting ``✔ Crawl queue``.
6 |
7 | .. |conf_menu_button| image:: ../../../tests/robotframework/screenshots/conf_menu_button.png
8 | :class: sosse-inline-screenshot
9 |
10 | .. image:: ../../../tests/robotframework/screenshots/crawl_queue.png
11 | :class: sosse-screenshot
12 |
--------------------------------------------------------------------------------
/doc/source/crawl/recursion_depth.rst:
--------------------------------------------------------------------------------
1 | Recursive crawling
2 | ==================
3 |
4 | SOSSE can crawl recursively all pages it finds, or the recursion level can be limited when crawling large websites or
5 | public sites.
6 |
7 | No limit recursion
8 | -------------------
9 |
10 | Recursing with no limit is achieved by using a policy with a :ref:`Recursion ` set to
11 | ``Crawl all pages``.
12 |
13 | For example, a full domain can be extracted with 2 policies:
14 |
15 | * A policy for the domain with a ``URL regex`` that matches the domain, and ``Recursion`` set to ``Crawl all pages``
16 |
17 | * A default policy with a ``Recursion`` set to ``Never crawl`` (the default)
18 |
19 | Limited recursion
20 | -----------------
21 |
22 | Crawling pages up to a certain level can be simply achieved by setting the :ref:`Recursion ` to
23 | ``Depending on depth`` and setting the ``Recursion depth`` when :doc:`queueing the initial URL `.
24 |
25 | .. image:: ../../../tests/robotframework/screenshots/crawl_on_depth_add.png
26 | :class: sosse-screenshot
27 |
28 | Partial limited recursion
29 | -------------------------
30 |
31 | A mixed approach is also possible, by setting a :ref:`Recursion ` to ``Depending on depth`` in
32 | one policy, and setting it to ``Crawl all pages`` in an other and a positive ``Recursion depth``.
33 |
34 | For example, one could crawl all Wikipedia, and crawl external links up to 2 levels with the following policies:
35 |
36 | * A policy for Wikipedia, with ``Recursion depth`` of 2:
37 |
38 | .. image:: ../../../tests/robotframework/screenshots/policy_all.png
39 | :class: sosse-screenshot
40 |
41 | * A default policy with a ``Depending on depth`` condition:
42 |
43 | .. image:: ../../../tests/robotframework/screenshots/policy_on_depth.png
44 | :class: sosse-screenshot
45 |
--------------------------------------------------------------------------------
/doc/source/crawl_guidelines.rst:
--------------------------------------------------------------------------------
1 | Guidelines for Ethical Use
2 | ==========================
3 |
4 | When using a web crawler or scraper, it’s important to be responsible and ethical. Here are some quick tips to keep in
5 | mind:
6 |
7 | **Get Permission First**
8 | ------------------------
9 |
10 | Before crawling a site, make sure you’re allowed to access it. Although your crawler may have the ability to ignore
11 | `robots.txt` or modify the `User-Agent`, **always respect the site owner’s preferences**:
12 |
13 | - Read the site’s terms of service to see if scraping is allowed.
14 | - If you’re unsure, consider reaching out to the site owner for permission.
15 |
16 | **Crawl Responsibly & Respect the Environment**
17 | -----------------------------------------------
18 |
19 | Crawling can impact both website performance and the environment. Here’s how to do it responsibly:
20 | - **Avoid Overloading Servers**: Don’t make too many requests at once or crawl the same pages repeatedly.
21 |
22 | - **Use Data Dumps**: If available, use downloadable data dumps (e.g., `Kiwix `_)
23 | instead of crawling, which helps reduce server load and saves resources.
24 |
25 | - **Consider Environmental Impact**: Crawling consumes energy. Keep your crawls efficient—only collect the data you
26 | need, and avoid unnecessary large downloads like media files.
27 |
28 | - **Use APIs When Available**: If the website provides an API, prefer using it instead of crawling, as APIs are
29 | optimized for data access and reduce server load.
30 |
31 | - **Prefer Generating Scripts with AI**: When possible, use AI to generate scripts for structured data extraction
32 | rather than parsing unstructured pages, which can be less efficient and error-prone.
33 |
34 | **Respect the Web**
35 | -------------------
36 |
37 | Ethical scraping is all about respect:
38 |
39 | - Be transparent and let site owners know if you're crawling their content.
40 | - Avoid scraping personal or sensitive information unless explicitly allowed.
41 | - Follow copyright laws and properly attribute sources.
42 |
43 | For more information, see `Is Web Scraping Legal? `_.
44 |
--------------------------------------------------------------------------------
/doc/source/domain_settings.rst:
--------------------------------------------------------------------------------
1 | 🕸 Domain Settings
2 | ==================
3 |
4 | Domain level parameters can be reached from the :doc:`../admin_ui`, by clicking on ``Domain settings``.
5 |
6 | .. image:: ../../tests/robotframework/screenshots/domain_setting.png
7 | :class: sosse-screenshot
8 |
9 | Domain settings are automatically created during crawling, but can also be updated manually or created manually.
10 |
11 | Browse mode
12 | """""""""""
13 |
14 | When the policy's :ref:`Default browse mode ` is set to ``Detect``, the ``Browse mode`` option of
15 | the domain define which browsing method to use. When its value is ``Detect``, the browsing mode is detected the next
16 | time the page is accessed, and this option is switched to either ``Chromium``, ``Firefox`` or ``Python Requests``.
17 |
18 | .. _domain_ignore_robots:
19 |
20 | Ignore robots.txt
21 | """""""""""""""""
22 |
23 | By default the crawler will honor the ``robots.txt`` 🤖 of the domain and follow its rules depending on the
24 | :ref:`User Agent `. When enabled, this option will ignore any ``robots.txt`` rule and crawl
25 | pages of the domain unconditionally.
26 |
27 | Robots.txt status
28 | """""""""""""""""
29 |
30 | One of:
31 |
32 | * ``Unknown``: the file has not been processed yet
33 | * ``Empty``: there is no ``robots.txt`` or it's empty
34 | * ``Loaded``: the file has been successfully loaded
35 |
36 | Robots.txt allow/disallow rules
37 | """""""""""""""""""""""""""""""
38 |
39 | This contains the rules relevant to the crawlers :ref:`User Agent `.
40 |
--------------------------------------------------------------------------------
/doc/source/excluded_urls.rst:
--------------------------------------------------------------------------------
1 | 🔗 Excluded URLs
2 | ================
3 |
4 | The excluded URLs list can be reached from the :doc:`../admin_ui`, by clicking on ``Excluded URLs``.
5 |
6 | .. image:: ../../tests/robotframework/screenshots/excluded_url.png
7 | :class: sosse-screenshot
8 |
9 | This stores URLs that will always be skipped by the crawlers.
10 |
--------------------------------------------------------------------------------
/doc/source/guides.rst:
--------------------------------------------------------------------------------
1 | Guides
2 | ======
3 |
4 | .. toctree::
5 | :maxdepth: 2
6 | :caption: Contents:
7 |
8 | crawl_guidelines.rst
9 | guides/search.rst
10 | guides/archive.rst
11 | guides/download.rst
12 | guides/feed_website_monitor.rst
13 | guides/authentication.rst
14 | guides/captcha.rst
15 |
--------------------------------------------------------------------------------
/doc/source/guides/authentication_browser_inspect.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/doc/source/guides/authentication_browser_inspect.png
--------------------------------------------------------------------------------
/doc/source/guides/captcha.rst:
--------------------------------------------------------------------------------
1 | Dealing with Captchas
2 | =====================
3 |
4 | User agent
5 | ----------
6 |
7 | By default, the crawlers send HTTP requests with a ``SOSSE``
8 | `User agent HTTP header `_ this can sometime lead websites to flag the
9 | crawler as a robot and display a Captcha. To mitigate this, SOSSE can use the
10 | `Fake user-agent `_ library to simulate a real browser user agent.
11 | This can be achieved with the following options in the configuration file:
12 |
13 | * :ref:`user_agent`: uncomment the option and make it empty
14 | * :ref:`fake_user_agent_browser`,
15 | :ref:`fake_user_agent_os`,
16 | :ref:`fake_user_agent_platform`: these control how the user agent is generated.
17 | It's probably best to set the ``fake_user_agent_platform`` to ``pc`` as some website may change there rendering on
18 | mobile platforms.
19 |
20 | Cookies
21 | -------
22 |
23 | The captcha can be manually validated in a browser, then cookies can be exported and imported in SOSSE, see the
24 | :doc:`Cookies<../cookies>` documentation.
25 |
--------------------------------------------------------------------------------
/doc/source/guides/search.rst:
--------------------------------------------------------------------------------
1 | Website Search
2 | ==============
3 |
4 | SOSSE allows you to crawl a website and search its pages for specific keywords. This process involves configuring
5 | a :doc:`Crawl Policy <../crawl/policies>` to define how the site is crawled, followed by searching for the desired
6 | content.
7 |
8 | Creating a Crawl Policy
9 | -----------------------
10 |
11 | Crawl policies control how SOSSE accesses and logs website content. This section covers key settings; for full details,
12 | see the :doc:`Crawl Policies <../crawl/policies>` documentation.
13 |
14 | By default, the crawler processes only directly queued pages. Enabling recursion ensures linked pages are also crawled:
15 |
16 | - In the ``⚡ Crawl`` tab, enter a regular expression to match URLs for crawling.
17 | - In the ``🔖 Archive`` tab, disable ``Archive content`` if you only need to search pages without archiving.
18 | - In the ``🕑 Recurrence`` tab, adjust the crawl frequency as needed.
19 |
20 | .. note::
21 | By default, SOSSE archives pages, detects if a browser is required for rendering, and adjusts crawl frequency based
22 | on site updates. Modify the policy to optimize crawl speed or reduce disk usage.
23 |
24 | .. image:: ../../../tests/robotframework/screenshots/guide_search_policy.png
25 | :class: sosse-screenshot
26 |
27 | Starting the Crawl
28 | ------------------
29 |
30 | To begin crawling, go to the :doc:`Crawl a new URL <../crawl/new_url>` page and enter the site's homepage URL.
31 |
32 | Review the parameters, then click ``Confirm``. SOSSE will crawl the site and log pages matching the Crawl Policy.
33 |
34 | .. note::
35 | If pages aren’t crawled as expected, check whether the site’s `robots.txt` file is blocking the crawler.
36 | *Bypass it only if authorized.* You can review this setting in the :doc:`../domain_settings` for the website.
37 |
38 | Searching the Website
39 | ---------------------
40 |
41 | Once crawling is complete, search for keywords directly from the homepage.
42 |
43 | For advanced search options, see the :doc:`search parameters <../user/search>` documentation.
44 |
45 | Additional Resources
46 | --------------------
47 |
48 | - See :doc:`../crawl/recursion_depth` for advanced crawling strategies.
49 | - Explore the :doc:`../guides` for further assistance.
50 |
--------------------------------------------------------------------------------
/doc/source/index.rst:
--------------------------------------------------------------------------------
1 | .. SOSSE documentation master file, created by
2 | sphinx-quickstart on Mon Apr 17 13:06:50 2023.
3 | You can adapt this file completely to your liking, but it should at least
4 | contain the root `toctree` directive.
5 |
6 | SOSSE's documentation!
7 | ======================
8 |
9 | .. toctree::
10 | :maxdepth: 2
11 | :caption: 🐾 Contents:
12 |
13 | introduction.rst
14 | install.rst
15 | administration.rst
16 | guides.rst
17 | config_file.rst
18 | cli.rst
19 | user_doc.rst
20 | screenshots.rst
21 |
22 | .. toctree::
23 | :maxdepth: 1
24 |
25 | CHANGELOG.md
26 |
27 | Indices and tables
28 | ==================
29 |
30 | * :ref:`genindex`
31 | * :ref:`modindex`
32 | * :ref:`search`
33 |
--------------------------------------------------------------------------------
/doc/source/install.rst:
--------------------------------------------------------------------------------
1 | Installation
2 | ============
3 |
4 | SOSSE can be installed in a few different ways 🦨:
5 |
6 | .. toctree::
7 | :maxdepth: 2
8 | :caption: Contents:
9 |
10 | install/debian.rst
11 | install/debian_upgrades.rst
12 | install/pip.rst
13 | install/pip_upgrades.rst
14 | install/docker.rst
15 | install/docker_upgrades.rst
16 | install/docker_compose.rst
17 | install/docker_compose_upgrades.rst
18 |
--------------------------------------------------------------------------------
/doc/source/install/database.rst.template:
--------------------------------------------------------------------------------
1 | Database connection parameters can be changed in the ``/etc/sosse/sosse.conf`` file, you can find more information about each variable in the :doc:`../config_file`).
2 |
3 | Database creation
4 | """""""""""""""""
5 |
6 | The PostgreSQL database can be created with the commands:
7 |
8 | .. code-block:: shell
9 |
10 | su - postgres -c "psql --command=\"CREATE USER sosse WITH PASSWORD 'CHANGE ME';\""
11 | su - postgres -c "psql --command=\"CREATE DATABASE sosse OWNER sosse;\""
12 |
13 | Replace ``sosse`` by an appropriate username and password, and set them in the ``/etc/sosse/sosse.conf`` configuration file.
14 |
15 | Database schema
16 | """""""""""""""
17 |
18 | The initial database data can be injected with the following commands:
19 |
20 | .. code-block:: shell
21 |
22 | |sosse-admin| migrate
23 | |sosse-admin| update_se
24 |
25 | A default ``admin`` user with password ``admin`` can be created with:
26 |
27 | .. code-block:: shell
28 |
29 | |sosse-admin| default_admin
30 |
--------------------------------------------------------------------------------
/doc/source/install/debian_upgrades.rst:
--------------------------------------------------------------------------------
1 | Debian upgrades
2 | ===============
3 |
4 | The Debian package installed following the :doc:`debian` documentation can be upgraded by running a regular:
5 |
6 | .. code-block:: shell
7 |
8 | apt-get upgrade
9 |
10 | It is recommended to make a backup of the database before upgrading.
11 |
--------------------------------------------------------------------------------
/doc/source/install/docker.rst:
--------------------------------------------------------------------------------
1 | Running in Docker
2 | =================
3 |
4 | The latest stable version of SOSSE can be run in docker with the command:
5 |
6 | .. code-block:: shell
7 |
8 | docker run -p 8005:80 --mount source=sosse_postgres,destination=/var/lib/postgresql \
9 | --mount source=sosse_var,destination=/var/lib/sosse biolds/sosse:latest
10 |
11 | This would start an instance of SOSSE on port 8005, and would persist data in the ``sosse_postgres`` and
12 | ``sosse_var`` `Docker volumes `_.
13 |
14 | You may also locally mount other directories to access their content, with the following flags:
15 |
16 | * ``--volume $PWD/sosse-conf:/etc/sosse/``: mounting an empty directory as ``/etc/sosse/`` will create default
17 | configuration files in it. You can then edit them and restart Docker to make the changes effective.
18 | * ``--volume $PWD/sosse-log:/var/log/sosse/``: mounting this directory would let you access log files.
19 |
20 | Next steps
21 | ----------
22 |
23 | You can now point your browser to connect to the port 8005 and log in with the user ``admin`` and the password
24 | ``admin``. For more information about the configuration, you can follow the :doc:`../administration` pages,
25 | or follow :doc:`../guides/search` to start indexing documents.
26 |
--------------------------------------------------------------------------------
/doc/source/install/docker_compose.rst:
--------------------------------------------------------------------------------
1 | Running in Docker-compose
2 | =========================
3 |
4 | To run the latest version of SOSSE with docker-compose, you need to download the latest version of the
5 | ``docker-compose.yml`` file from the SOSSE repository in a dedicated directory:
6 |
7 | .. code-block:: shell
8 |
9 | mkdir sosse
10 | cd sosse
11 | curl https://raw.githubusercontent.com/biolds/sosse/refs/heads/stable/docker-compose.yml > docker-compose.yml
12 |
13 | Review its content, then run the following command to start SOSSE:
14 |
15 | .. code-block:: shell
16 |
17 | docker-compose up -d
18 |
19 | By default, this would start an instance of SOSSE on port 8005.
20 |
21 | Next steps
22 | ----------
23 |
24 | You can now point your browser to connect to the port 8005 and log in with the user ``admin`` and the password
25 | ``admin``. For more information about the configuration, you can follow the :doc:`../administration` pages,
26 | or follow :doc:`../guides/search` to start indexing documents.
27 |
--------------------------------------------------------------------------------
/doc/source/install/docker_compose_upgrades.rst:
--------------------------------------------------------------------------------
1 | Docker-compose upgrades
2 | =======================
3 |
4 | The Docker-compose version installed following the :doc:`docker_compose` documentation can be upgraded by running:
5 |
6 | .. code-block:: shell
7 |
8 | docker-compose pull
9 | docker compose down
10 | docker compose up -d --force-recreate
11 |
12 | It is recommended to make a backup of the database before upgrading.
13 |
--------------------------------------------------------------------------------
/doc/source/install/docker_upgrades.rst:
--------------------------------------------------------------------------------
1 | Docker upgrades
2 | ===============
3 |
4 | The Docker version installed following the :doc:`docker` documentation can be upgraded by running:
5 |
6 | .. code-block:: shell
7 |
8 | docker pull biolds/sosse:latest
9 |
10 | It is recommended to make a backup of the database before upgrading.
11 |
--------------------------------------------------------------------------------
/doc/source/install/pip_upgrades.rst:
--------------------------------------------------------------------------------
1 | Pip upgrades
2 | ============
3 |
4 | The Pip packages installed following the :doc:`pip` documentation can be upgraded by running:
5 |
6 | .. code-block:: shell
7 |
8 | pip install --upgrade sosse
9 |
10 | It is recommended to make a backup of the database before upgrading.
11 |
12 | When the upgrade is done, the following commands need to be run to update the data:
13 |
14 | .. code-block:: shell
15 |
16 | sosse-admin collectstatic --noinput --clear
17 | sosse-admin migrate
18 | sosse-admin update_se
19 |
--------------------------------------------------------------------------------
/doc/source/introduction.rst:
--------------------------------------------------------------------------------
1 | SOSSE Documentation
2 | ===================
3 |
4 | Welcome to the official SOSSE documentation page! Here, you'll find everything you need to get started, from
5 | installation guides to community support. Explore the links below to dive into the world of SOSSE and make the most out
6 | of the platform.
7 |
8 | 🌐 `Official Website `_
9 |
10 | Visit the official website for the latest updates, announcements, and resources on SOSSE. Stay connected with all
11 | the details about the project.
12 |
13 | 🛠️ :doc:`Installation Guide `
14 |
15 | Looking to install SOSSE? The installation guide will walk you through setting up SOSSE on your machine using various
16 | methods, including Docker and other configurations for persistence.
17 |
18 | 🌍 :doc:`Guidelines for Ethical Use `
19 |
20 | When using SOSSE for web crawling or scraping, it's important to follow ethical guidelines and best practices to avoid
21 | overloading servers, violating site terms of service, or causing damage to the websites being crawled. Please review the
22 | ethical guidelines to ensure responsible usage.
23 |
24 | 📚 :doc:`Guides `
25 |
26 | For detailed guides on key features like search, crawling, archiving, file downloads, and more, visit the SOSSE guides
27 | page.
28 |
29 | 📚 :doc:`Documentation Index `
30 |
31 | For comprehensive documentation on all features, configurations, and usage, visit the SOSSE documentation index. This is
32 | your one-stop resource for learning everything about the platform.
33 |
34 | 💻 `GitHub Project Page `_
35 |
36 | Check out the official SOSSE project on GitHub for access to the source code, issue tracking, and collaboration.
37 | Feel free to contribute, report bugs, or browse the code!
38 |
39 | 🎮 `Join the Discord Community `_
40 |
41 | Join the SOSSE community on Discord! Whether you have questions, want to share your ideas, or need help with
42 | troubleshooting, our Discord server is the perfect place to connect with other users and contributors.
43 |
44 | Thank you for being a part of the SOSSE community! Whether you’re just getting started or need advanced help, all of
45 | these resources are here to assist you.
46 |
--------------------------------------------------------------------------------
/doc/source/permissions.rst:
--------------------------------------------------------------------------------
1 | 👥 Permissions
2 | ==============
3 |
4 | Crawl Permissions
5 | -----------------
6 |
7 | User management and group editing can be done from the :doc:`../admin_ui`, by clicking on ``Users`` or ``Groups``.
8 | Thanks to the `Django framework `_, fine-grained permissions can be defined by group and
9 | by user.
10 |
11 | .. image:: ../../tests/robotframework/screenshots/permissions.png
12 | :class: sosse-screenshot
13 |
14 | Permissions are set by the type of objects that can be modified through the :doc:`admin_ui`. Some of these permissions
15 | also grant access to other parts of the user interface:
16 |
17 | - ``Can add document``: Grants access to the :doc:`🌐 Crawl a new URL ` page.
18 | - ``Can change document``: Grants access to document actions such as ``Crawl now``, ``Remove from crawl queue``,
19 | ``Convert screens to JPEG``.
20 | - ``Can view crawler stats``: Grants access to the :doc:`✔ Crawl queue ` page and
21 | :doc:`🕷 Crawlers ` page.
22 | - ``Can change crawler stats``: Grants access to the ``Pause`` and ``Resume`` crawler buttons in the
23 | :doc:`✔ Crawl queue ` page and :doc:`🕷 Crawlers ` page.
24 |
25 | Search Permissions
26 | ------------------
27 |
28 | By default, search requires users to be authenticated, but :ref:`anonymous searches `
29 | can be enabled with the related option.
30 |
--------------------------------------------------------------------------------
/doc/source/screenshots.rst:
--------------------------------------------------------------------------------
1 | Screenshots
2 | ===========
3 |
4 | .. figure:: ../../tests/robotframework/screenshots/search.png
5 | :class: sosse-screenshot
6 |
7 | :doc:`Search results `
8 |
9 | .. raw:: html
10 |
11 |
12 |
13 |
14 | .. figure:: ../../tests/robotframework/screenshots/guide_download_archive_html.png
15 | :class: sosse-screenshot
16 |
17 | :doc:`Offline browsing `
18 |
19 | .. raw:: html
20 |
21 |
22 |
23 |
24 | .. figure:: ../../tests/robotframework/screenshots/archive_download.png
25 | :class: sosse-screenshot
26 |
27 | :doc:`File scraping `
28 |
29 | .. raw:: html
30 |
31 |
32 |
33 |
34 | .. figure:: ../../tests/robotframework/screenshots/analytics.png
35 | :class: sosse-screenshot
36 |
37 | :doc:`Index analytics `
38 |
39 | .. raw:: html
40 |
41 |
42 |
43 |
44 | .. figure:: ../../tests/robotframework/screenshots/history.png
45 | :class: sosse-screenshot
46 |
47 | :doc:`Search history `
48 |
49 | .. raw:: html
50 |
51 |
52 |
53 |
54 | .. figure:: ../../tests/robotframework/screenshots/crawl_queue.png
55 | :class: sosse-screenshot
56 |
57 | :doc:`Real-time crawling status `
58 |
59 | .. raw:: html
60 |
61 |
62 |
63 |
64 | .. figure:: ../../tests/robotframework/screenshots/crawl_policy_decision_no_hilight.png
65 | :class: sosse-screenshot
66 |
67 | :doc:`Crawl Policies setup `
68 |
69 | .. raw:: html
70 |
71 |
72 |
73 |
74 | .. figure:: ../../tests/robotframework/screenshots/browsable_home.png
75 | :class: sosse-screenshot
76 |
77 | :doc:`Archive browsing `
78 |
--------------------------------------------------------------------------------
/doc/source/search_engines.rst:
--------------------------------------------------------------------------------
1 | 🔍 External Search Engines
2 | ==========================
3 |
4 | The list of :doc:`user/shortcuts` can be reached from the :doc:`../admin_ui`, by clicking on ``Search engines``.
5 |
6 | .. image:: ../../tests/robotframework/screenshots/search_engines_list.png
7 | :class: sosse-screenshot
8 |
9 | New search engines can be added manually, or using the :ref:`CLI ` using an `Open Search Description `_ formatted file.
10 |
11 | .. image:: ../../tests/robotframework/screenshots/search_engine.png
12 | :class: sosse-screenshot
13 |
14 | In this form, the shortcut that will be used to redirect to the external search engine can be defined. If you add a search engine, please consider adding it to the list of `included search engines `_ and opening a Pull request (also works on `Github `_).
15 |
--------------------------------------------------------------------------------
/doc/source/tags.rst:
--------------------------------------------------------------------------------
1 | ⭐ Tags
2 | =======
3 |
4 | The tagging system allows for efficient searching and categorization of documents by associating them with tags. Tags
5 | can be assigned to documents during the crawling process based on :doc:`Crawl Policies `, or they can
6 | be manually added or edited in the :doc:`Archive page ` of Documents.
7 |
8 | Tags can be accessed by clicking **Tags** from the :doc:`../admin_ui`.
9 |
10 | .. image:: ../../tests/robotframework/screenshots/tags_list.png
11 | :class: sosse-screenshot
12 |
13 | Tags can be modified through the admin interface by selecting a tag and updating its properties:
14 |
15 | .. image:: ../../tests/robotframework/screenshots/edit_tag.png
16 | :class: sosse-screenshot
17 |
18 | Editable Fields:
19 |
20 | - Name: The label of the tag.
21 | - Parent: Allows organizing tags into a hierarchical structure by selecting a parent tag.
22 | - Documents: A link to the admin interface showing all documents associated with the tag.
23 | - Crawl Policies: A link to the admin interface showing all crawl policies that assign this tag.
24 |
--------------------------------------------------------------------------------
/doc/source/user/archive.rst:
--------------------------------------------------------------------------------
1 | Offline browsing, archived pages
2 | ================================
3 |
4 | Archived pages can be access from the search results, by clicking the ``archive`` link.
5 |
6 | .. image:: ../../../tests/robotframework/screenshots/archive_header.png
7 | :class: sosse-screenshot
8 |
9 | When the :doc:`Crawl Policy <../crawl/policies>` has ``🔖 Archive content`` or ``📷 Take screenshots`` enabled,
10 | the archive page shows the rendered content and links to other indexed pages can be clicked:
11 |
12 | .. image:: ../../../tests/robotframework/screenshots/archive_screenshot.png
13 | :class: sosse-screenshot
14 |
15 | The ``✏️ Text`` links to the text version of the page. The ``📚 Word weights`` shows the weight of
16 | stemmed words in the page, these are used to calculate the score of the page in the :doc:`search results `.
17 |
--------------------------------------------------------------------------------
/doc/source/user/history.rst:
--------------------------------------------------------------------------------
1 | History
2 | =======
3 |
4 | The history page shows the search history of the logged in user.
5 |
6 | .. image:: ../../../tests/robotframework/screenshots/history.png
7 | :class: sosse-screenshot
8 |
9 | Clicking the |delete_button| button deletes the search entry.
10 |
11 | .. |delete_button| image:: ../../../tests/robotframework/screenshots/history_delete.png
12 | :class: sosse-inline-screenshot
13 |
14 | Clicking |delete_all_button| button deletes the whole search history of the user.
15 |
16 | .. |delete_all_button| image:: ../../../tests/robotframework/screenshots/history_delete_all.png
17 | :class: sosse-inline-screenshot
18 |
--------------------------------------------------------------------------------
/doc/source/user/profile.rst:
--------------------------------------------------------------------------------
1 | Profile
2 | ===========
3 |
4 | To reach the Profile user interface, click the |user_menu_button| button, then select ``Profile``.
5 |
6 | .. |user_menu_button| image:: ../../../tests/robotframework/screenshots/user_menu_button.png
7 | :class: sosse-inline-screenshot
8 |
9 | .. image:: ../../../tests/robotframework/screenshots/profile.png
10 | :class: sosse-screenshot
11 |
12 | Profile data is stored in the browser's
13 | `Local storage `_, so they are not shared across
14 | users, devices, or browsers.
15 |
16 | Theme
17 | -----
18 |
19 | The theme option lets you choose light theme, dark theme or let it switch automatically depending on the browser
20 | configuration.
21 |
22 | Search terms parsing language
23 | -----------------------------
24 |
25 | This defines the default language used to read the search terms typed in the search bar. SOSSE uses
26 | `PostgreSQL's Full Text Search `_ feature which uses
27 | this parameter to make searches more intelligent than simple word matches.
28 |
29 | Results by page
30 | ---------------
31 |
32 | The number of search result displayed in one page.
33 |
34 | .. _pref_principal_link:
35 |
36 | Search result main links point to the archive
37 | ---------------------------------------------
38 |
39 | When enabled, search result links point to the :doc:`archive versions ` of pages. ``source`` links are
40 | displayed to access original websites.
41 |
42 | When disabled, search result links point to original websites. ``archive`` links are displayed to access
43 | :doc:`archive versions `.
44 |
45 | .. _pref_online_mode:
46 |
47 | Online mode
48 | -----------
49 |
50 | When :ref:`Online detection ` is set up, searching locally or online can be overridden.
51 |
52 | .. image:: ../../../tests/robotframework/screenshots/online_mode.png
53 | :class: sosse-screenshot
54 |
55 | Next to the user menu a dot displays the status of the online mode:
56 |
57 | .. image:: ../../../tests/robotframework/screenshots/online_mode_status.png
58 | :class: sosse-screenshot
59 |
60 | * Green for online
61 | * Orange for offline
62 | * Purple when ``Force online`` is selected
63 | * Blue when ``Force local`` is selected
64 |
--------------------------------------------------------------------------------
/doc/source/user/rest_api.rst:
--------------------------------------------------------------------------------
1 | Rest API
2 | ========
3 |
4 | .. image:: ../../../tests/robotframework/screenshots/swagger.png
5 | :class: sosse-screenshot
6 |
7 | A rest API is available, it can be explored with a `Swagger `_ user interface. To open it, click the |user_menu_button| button, then select ``Rest API``.
8 |
9 | .. |user_menu_button| image:: ../../../tests/robotframework/screenshots/user_menu_button.png
10 | :class: sosse-inline-screenshot
11 |
--------------------------------------------------------------------------------
/doc/source/user/shortcut_list.rst:
--------------------------------------------------------------------------------
1 | Search Engine shortcut defaults
2 | ===============================
3 |
4 | You can find below the list of :doc:`shortcuts` defined by default. You can add new ones by following the
5 | :doc:`../search_engines` documentation.
6 |
7 | .. include:: shortcut_list_generated.rst
8 |
--------------------------------------------------------------------------------
/doc/source/user/shortcuts.rst:
--------------------------------------------------------------------------------
1 | External search engine shortcuts
2 | ================================
3 |
4 | .. image:: ../../../tests/robotframework/screenshots/shortcut.png
5 | :class: sosse-screenshot
6 |
7 | In the search bar, shortcuts can be used to search on external search engine. In the screenshot above, the search terms
8 | ``!b cats`` would redirect to the `Brave Search `_ search engine, searching for ``cats`` 🐈.
9 |
10 | The default list of shortcuts is available in the :doc:`shortcut_list` page, new search engines can be added in the
11 | :doc:`administration UI <../search_engines>`.
12 |
13 | The special character (``!`` by default) used to trigger the shortcut can be modified in the
14 | :ref:`configuration `.
15 |
16 | It is possible to make SOSSE redirect to an external search engine by default by setting the option
17 | :ref:`default_search_redirect `. In this case SOSSE internal searches can still be
18 | reached using the shortcut defined by :ref:`sosse_shortcut `.
19 |
--------------------------------------------------------------------------------
/doc/source/user_doc.rst:
--------------------------------------------------------------------------------
1 | User documentation
2 | ==================
3 |
4 | .. toctree::
5 | :maxdepth: 2
6 | :caption: Contents:
7 |
8 | user/search.rst
9 | user/shortcuts.rst
10 | user/shortcut_list.rst
11 | user/profile.rst
12 | user/history.rst
13 | user/archive.rst
14 | user/rest_api.rst
15 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | services:
2 | sosse:
3 | image: biolds/sosse:pip-compose
4 | container_name: sosse_app
5 | depends_on:
6 | - postgres
7 | environment:
8 | # Available configuration variables can be found on https://sosse.readthedocs.io/en/stable/config_file.html
9 | # any option can be set by using the SOSSE_ prefix
10 | - SOSSE_DB_NAME=sosse_db
11 | - SOSSE_DB_USER=sosse_user
12 | - SOSSE_DB_PASS=sosse_password
13 | - SOSSE_DB_HOST=postgres
14 | ports:
15 | - "8000:80"
16 | volumes:
17 | - sosse_data:/var/lib/sosse
18 | restart: always
19 |
20 | postgres:
21 | image: postgres:latest
22 | container_name: sosse_db
23 | environment:
24 | POSTGRES_USER: sosse_user
25 | POSTGRES_PASSWORD: sosse_password
26 | POSTGRES_DB: sosse_db
27 | ports:
28 | - "5432:5432"
29 | volumes:
30 | - postgres_data:/var/lib/postgresql/data
31 | restart: always
32 |
33 | volumes:
34 | sosse_data:
35 | postgres_data:
36 |
--------------------------------------------------------------------------------
/docker/Makefile:
--------------------------------------------------------------------------------
1 | .PHONY=build
2 |
3 | build:
4 | docker pull debian:bookworm
5 | $(MAKE) -C debian-base build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
6 | $(MAKE) -C debian build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
7 | $(MAKE) -C debian-pkg build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
8 | $(MAKE) -C debian-test build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
9 | $(MAKE) -C pip-base build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
10 | $(MAKE) -C pip-compose build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
11 | $(MAKE) -C pip-release build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
12 | $(MAKE) -C pip-test build APT_PROXY=$(APT_PROXY) PIP_INDEX_URL=$(PIP_INDEX_URL) PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST)
13 | $(MAKE) -C doc build
14 | $(MAKE) -C docker build
15 |
16 | push:
17 | @for a in $$(ls); do \
18 | if [ -d $$a ]; then \
19 | $(MAKE) -C $$a push; \
20 | fi; \
21 | done;
22 |
--------------------------------------------------------------------------------
/docker/Makefile.common:
--------------------------------------------------------------------------------
1 | DOCKER_NAME=$(shell pwd | sed -e s_^.*/__)
2 | #APT_PROXY=http://192.168.3.2:3142/
3 | #PIP_INDEX_URL=http://192.168.3.3:5000/index/
4 | #PIP_TRUSTED_HOST=192.168.3.3
5 |
6 | .PHONY: _build push
7 |
8 | push:
9 | docker push biolds/sosse:$(DOCKER_NAME)
10 |
11 | _build:
12 | docker build --build-arg APT_PROXY=$(APT_PROXY) --build-arg PIP_INDEX_URL=$(PIP_INDEX_URL) --build-arg PIP_TRUSTED_HOST=$(PIP_TRUSTED_HOST) -t biolds/sosse:$(DOCKER_NAME) .
13 |
14 | %:
15 | $(MAKE) -f ../Makefile.common _$@
16 |
--------------------------------------------------------------------------------
/docker/README.md:
--------------------------------------------------------------------------------
1 | - debian: Docker image using the Sosse Debian package, for testing purpose only
2 | - debian-pkg: image that builds the Debian package
3 | - debian-test FROM debian: image used in the Gitlab CI to run some tests (unit tests, static checks, etc.)
4 | - doc: image used to build the documentation (for testing only, the published doc is built on readthedoc)
5 | - docker FROM pip-test: image used to rebuild the Docker package to upgrade packages on Docker Hub
6 | - pip-base: base image for the pip-test and pip-release images
7 | - pip-compose: official image for Docker-compose
8 | - pip-release: official Docker image
9 | - pip-test: image used to test the pip package
10 |
--------------------------------------------------------------------------------
/docker/debian-base/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM debian:bookworm
2 | ADD control /
3 | RUN apt-get update && \
4 | grep ^Depends: /control | sed -e "s/.*},//" -e "s/,//g" | xargs apt-get install -y && \
5 | apt-get clean autoclean && \
6 | rm -rf /control /var/lib/cache /var/lib/log /usr/share/doc /usr/share/man
7 |
--------------------------------------------------------------------------------
/docker/debian-base/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
3 | .PHONY: build
4 |
5 | build:
6 | cp ../../debian/control .
7 | $(MAKE) -f ../Makefile.common _build
8 |
--------------------------------------------------------------------------------
/docker/debian-pkg/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM debian:bookworm
2 | ARG APT_PROXY=
3 | ARG PIP_INDEX_URL=
4 | ARG PIP_TRUSTED_HOST=
5 | RUN test -z "$APT_PROXY" || (echo "Acquire::http::Proxy \"$APT_PROXY\";" > /etc/apt/apt.conf.d/proxy.conf)
6 | RUN apt update && \
7 | apt upgrade -y && \
8 | apt install -y make build-essential python3-dev devscripts cdbs dh-python python3-setuptools curl gnupg2 npm && \
9 | apt-get clean autoclean && \
10 | apt-get autoremove --yes && \
11 | rm -rf /var/lib/cache /var/lib/log /usr/share/doc /usr/share/man
12 | RUN test -z "$APT_PROXY" || rm /etc/apt/apt.conf.d/proxy.conf
13 |
--------------------------------------------------------------------------------
/docker/debian-pkg/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
--------------------------------------------------------------------------------
/docker/debian-test/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM biolds/sosse:debian
2 | ARG APT_PROXY=
3 | ARG PIP_INDEX_URL=
4 | ARG PIP_TRUSTED_HOST=
5 | RUN test -z "$APT_PROXY" || (echo "Acquire::http::Proxy \"$APT_PROXY\";" > /etc/apt/apt.conf.d/proxy.conf)
6 | RUN apt update
7 | RUN apt purge -y sosse
8 | # Remove python3-coverage after version 1.13 is released
9 | RUN apt install -y python3-coverage python3-virtualenv flake8 sudo jq make git rsync
10 | RUN /etc/init.d/postgresql start && \
11 | su - postgres -c "psql --command 'ALTER USER sosse WITH SUPERUSER;'" && \
12 | /etc/init.d/postgresql stop
13 | RUN git clone --depth=1 https://gitlab.com/biolds1/httpbin.git /root/httpbin && \
14 | cd /root/httpbin/httpbin && \
15 | python3 manage.py migrate && \
16 | python3 manage.py shell -c "from django.contrib.auth.models import User ; u = User.objects.create(username='admin', is_superuser=True, is_staff=True) ; u.set_password('admin') ; u.save()"
17 | ADD requirements.txt /tmp
18 | RUN virtualenv /robotframework-venv && /robotframework-venv/bin/pip install -r /tmp/requirements.txt && /robotframework-venv/bin/pip cache purge
19 | RUN mkdir -p /var/lib/sosse/screenshots && git clone --depth=1 https://github.com/GurvanKervern/dummy-static-website /var/lib/sosse/screenshots/website
20 | RUN test -z "$APT_PROXY" || rm /etc/apt/apt.conf.d/proxy.conf
21 |
22 | # Pre-commit installation
23 | RUN virtualenv /pre-commit-venv && /pre-commit-venv/bin/pip install pre-commit
24 | RUN mkdir -p /tmp/pre-commit && cd /tmp/pre-commit && git init
25 | ADD pre-commit-config.yaml /tmp/pre-commit/.pre-commit-config.yaml
26 | RUN cd /tmp/pre-commit && \
27 | /pre-commit-venv/bin/pre-commit autoupdate && \
28 | /pre-commit-venv/bin/pre-commit run -a
29 |
--------------------------------------------------------------------------------
/docker/debian-test/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
3 | .PHONY: build
4 |
5 | build:
6 | cp ../../tests/robotframework/requirements.txt .
7 | cp ../../.pre-commit-config.yaml pre-commit-config.yaml
8 | $(MAKE) -f ../Makefile.common _build
9 |
--------------------------------------------------------------------------------
/docker/debian/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
--------------------------------------------------------------------------------
/docker/doc/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM debian:bookworm
2 | ARG APT_PROXY=
3 | ARG PIP_INDEX_URL=
4 | ARG PIP_TRUSTED_HOST=
5 | RUN test -z "$APT_PROXY" || (echo "Acquire::http::Proxy \"$APT_PROXY\";" > /etc/apt/apt.conf.d/proxy.conf)
6 | RUN apt update
7 | RUN apt upgrade -y
8 | RUN apt install -y virtualenv jq curl
9 | RUN virtualenv /opt/sosse-doc
10 | ADD requirements.txt requirements-rtd.txt /tmp/
11 | RUN /opt/sosse-doc/bin/pip install -r /tmp/requirements.txt && /opt/sosse-doc/bin/pip install -r /tmp/requirements-rtd.txt
12 | RUN test -z "$APT_PROXY" || rm /etc/apt/apt.conf.d/proxy.conf
13 |
--------------------------------------------------------------------------------
/docker/doc/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
3 | .PHONY: build
4 |
5 | build:
6 | cp ../../doc/requirements.txt .
7 | $(MAKE) -f ../Makefile.common _build
8 |
--------------------------------------------------------------------------------
/docker/doc/requirements-rtd.txt:
--------------------------------------------------------------------------------
1 | alabaster>=0.7,<0.8,!=0.7.5
2 | commonmark==0.9.1
3 | mock==1.0.1
4 | pillow
5 | readthedocs-sphinx-ext<2.3
6 | recommonmark==0.5.0
7 | sphinx
8 | sphinx-rtd-theme
9 |
--------------------------------------------------------------------------------
/docker/docker/Dockerfile:
--------------------------------------------------------------------------------
1 | # inherits pip-test to build the pip pkg
2 | FROM biolds/sosse:pip-test
3 | RUN apt-get update
4 | RUN apt-get install -y ca-certificates make
5 | RUN install -m 0755 -d /etc/apt/keyrings
6 | RUN curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
7 | RUN chmod a+r /etc/apt/keyrings/docker.asc
8 | RUN echo \
9 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
10 | $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
11 | tee /etc/apt/sources.list.d/docker.list > /dev/null
12 | RUN apt-get update
13 | RUN apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
14 |
--------------------------------------------------------------------------------
/docker/docker/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
--------------------------------------------------------------------------------
/docker/pg_run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -x
2 | test -e /var/lib/postgresql/15 || tar -x -p -C / -f /tmp/postgres_sosse.tar.gz
3 |
4 | /etc/init.d/postgresql start
5 |
6 | export SOSSE_DB_HOST=localhost
7 |
8 | exec bash /run.sh
9 |
--------------------------------------------------------------------------------
/docker/pip-base/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM debian:bookworm
2 | ARG APT_PROXY=
3 | ARG PIP_INDEX_URL=
4 | ARG PIP_TRUSTED_HOST=
5 | RUN test -z "$APT_PROXY" || (echo "Acquire::http::Proxy \"$APT_PROXY\";" > /etc/apt/apt.conf.d/proxy.conf)
6 | RUN apt-get update
7 | RUN apt-get upgrade -y
8 | RUN apt-get install -y sudo python3-pip python3-dev python3-venv build-essential libpq-dev libmagic1 nginx chromium chromium-driver firefox-esr fonts-noto unifont virtualenv npm
9 | RUN test -z "$APT_PROXY" || rm /etc/apt/apt.conf.d/proxy.conf
10 |
--------------------------------------------------------------------------------
/docker/pip-base/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
--------------------------------------------------------------------------------
/docker/pip-compose/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM biolds/sosse:pip-base
2 | ARG PIP_INDEX_URL=
3 | ARG PIP_TRUSTED_HOST=
4 | RUN apt-get update
5 | RUN apt-get install -y postgresql-client # for pg_isready
6 | RUN virtualenv /venv
7 | RUN /venv/bin/pip install sosse uwsgi && /venv/bin/pip cache purge
8 | RUN mkdir -p /etc/sosse/ /etc/sosse_src/ /var/log/sosse /var/log/uwsgi
9 | ADD uwsgi.* /etc/sosse_src/
10 | ADD sosse.conf /etc/nginx/sites-enabled/default
11 | RUN chown -R root:www-data /etc/sosse /etc/sosse_src && chmod 750 /etc/sosse_src/ && chmod 640 /etc/sosse_src/*
12 | RUN mkdir /var/www/.cache /var/www/.mozilla
13 | RUN chown www-data:www-data /var/www/.cache /var/www/.mozilla
14 | ADD run.sh /
15 | RUN chmod +x /run.sh
16 |
17 | USER root
18 | CMD ["/usr/bin/bash", "/run.sh"]
19 |
--------------------------------------------------------------------------------
/docker/pip-compose/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
3 | .PHONY: build
4 |
5 | build:
6 | cp ../run.sh .
7 | cp ../../debian/uwsgi.* ../../debian/sosse.conf .
8 | $(MAKE) -f ../Makefile.common _build
9 |
--------------------------------------------------------------------------------
/docker/pip-release/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM biolds/sosse:pip-compose
2 | ARG PIP_INDEX_URL=
3 | ARG PIP_TRUSTED_HOST=
4 | ADD run.sh pg_run.sh /
5 | RUN chmod +x /run.sh /pg_run.sh
6 | RUN apt-get update && apt-get install -y postgresql && apt-get clean
7 |
8 | WORKDIR /
9 | USER postgres
10 | RUN /etc/init.d/postgresql start && \
11 | (until pg_isready; do sleep 1; done) && \
12 | psql --command "CREATE USER sosse WITH PASSWORD 'sosse';" && \
13 | createdb -O sosse sosse && \
14 | /etc/init.d/postgresql stop && \
15 | tar -c -p -C / -f /tmp/postgres_sosse.tar.gz /var/lib/postgresql
16 |
17 | USER root
18 | CMD ["/usr/bin/bash", "/pg_run.sh"]
19 |
--------------------------------------------------------------------------------
/docker/pip-release/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
3 | .PHONY: build
4 |
5 | build:
6 | cp ../pg_run.sh ../run.sh .
7 | cp ../../debian/uwsgi.* ../../debian/sosse.conf .
8 | $(MAKE) -f ../Makefile.common _build
9 |
--------------------------------------------------------------------------------
/docker/pip-test/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM biolds/sosse:pip-base
2 | ARG APT_PROXY=
3 | RUN test -z "$APT_PROXY" || (echo "Acquire::http::Proxy \"$APT_PROXY\";" > /etc/apt/apt.conf.d/proxy.conf)
4 | RUN apt update
5 | RUN apt install -y firefox-esr wget jq make git postgresql rsync curl python3-django python3-pil
6 | RUN test -z "$APT_PROXY" || rm /etc/apt/apt.conf.d/proxy.conf
7 | RUN wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz -O /tmp/gecko.tar.gz && \
8 | tar xvzf /tmp/gecko.tar.gz && \
9 | mv geckodriver /usr/local/bin/
10 | RUN mkdir -p /var/lib/sosse/screenshots && git clone --depth=1 https://github.com/GurvanKervern/dummy-static-website /var/lib/sosse/screenshots/website
11 | RUN git clone --depth=1 https://gitlab.com/biolds1/httpbin.git /root/httpbin && \
12 | cd /root/httpbin/httpbin && \
13 | python3 manage.py migrate && \
14 | python3 manage.py shell -c "from django.contrib.auth.models import User ; u = User.objects.create(username='admin', is_superuser=True, is_staff=True) ; u.set_password('admin') ; u.save()"
15 |
--------------------------------------------------------------------------------
/docker/pip-test/Makefile:
--------------------------------------------------------------------------------
1 | include ../Makefile.common
2 |
--------------------------------------------------------------------------------
/docker/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -x
2 | until pg_isready --host "$SOSSE_DB_HOST" ; do
3 | sleep 1
4 | done
5 |
6 | test -e /etc/sosse/sosse.conf || /venv/bin/sosse-admin default_conf | sed -e "s/^#db_pass.*/db_pass=sosse/" -e "s/^#\(chromium_options=.*\)$/\\1 --no-sandbox --disable-dev-shm-usage/" >/etc/sosse_src/sosse.conf
7 | test -e /etc/sosse/sosse.conf || cp -p /etc/sosse_src/* /etc/sosse/
8 | mkdir -p /run/sosse /var/log/sosse /var/lib/sosse/html/
9 | touch /var/log/sosse/{debug.log,main.log,crawler.log,uwsgi.log,webserver.log,webhooks.log}
10 | chown -R www-data:www-data /run/sosse /var/log/sosse/ /var/lib/sosse
11 |
12 | /venv/bin/sosse-admin migrate
13 | /venv/bin/sosse-admin collectstatic --noinput
14 | /venv/bin/sosse-admin update_se
15 | /venv/bin/sosse-admin default_admin
16 | /venv/bin/uwsgi --uid www-data --gid www-data --ini /etc/sosse/uwsgi.ini --logto /var/log/sosse/uwsgi.log &
17 | /etc/init.d/nginx start
18 | sudo --preserve-env -u www-data /venv/bin/sosse-admin crawl &
19 | tail -F /var/log/sosse/crawler.log
20 |
--------------------------------------------------------------------------------
/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "sosse",
3 | "lockfileVersion": 3,
4 | "requires": true,
5 | "packages": {
6 | "": {
7 | "dependencies": {
8 | "chart.js": "^4.4.2",
9 | "chartjs-adapter-luxon": "^1.3.1",
10 | "luxon": "^3.4.4",
11 | "swagger-ui-dist": "^5.12.0"
12 | }
13 | },
14 | "node_modules/@kurkle/color": {
15 | "version": "0.3.2",
16 | "resolved": "https://registry.npmjs.org/@kurkle/color/-/color-0.3.2.tgz",
17 | "integrity": "sha512-fuscdXJ9G1qb7W8VdHi+IwRqij3lBkosAm4ydQtEmbY58OzHXqQhvlxqEkoz0yssNVn38bcpRWgA9PP+OGoisw=="
18 | },
19 | "node_modules/chart.js": {
20 | "version": "4.4.2",
21 | "resolved": "https://registry.npmjs.org/chart.js/-/chart.js-4.4.2.tgz",
22 | "integrity": "sha512-6GD7iKwFpP5kbSD4MeRRRlTnQvxfQREy36uEtm1hzHzcOqwWx0YEHuspuoNlslu+nciLIB7fjjsHkUv/FzFcOg==",
23 | "dependencies": {
24 | "@kurkle/color": "^0.3.0"
25 | },
26 | "engines": {
27 | "pnpm": ">=8"
28 | }
29 | },
30 | "node_modules/chartjs-adapter-luxon": {
31 | "version": "1.3.1",
32 | "resolved": "https://registry.npmjs.org/chartjs-adapter-luxon/-/chartjs-adapter-luxon-1.3.1.tgz",
33 | "integrity": "sha512-yxHov3X8y+reIibl1o+j18xzrcdddCLqsXhriV2+aQ4hCR66IYFchlRXUvrJVoxglJ380pgytU7YWtoqdIgqhg==",
34 | "peerDependencies": {
35 | "chart.js": ">=3.0.0",
36 | "luxon": ">=1.0.0"
37 | }
38 | },
39 | "node_modules/luxon": {
40 | "version": "3.4.4",
41 | "resolved": "https://registry.npmjs.org/luxon/-/luxon-3.4.4.tgz",
42 | "integrity": "sha512-zobTr7akeGHnv7eBOXcRgMeCP6+uyYsczwmeRCauvpvaAltgNyTbLH/+VaEAPUeWBT+1GuNmz4wC/6jtQzbbVA==",
43 | "engines": {
44 | "node": ">=12"
45 | }
46 | },
47 | "node_modules/swagger-ui-dist": {
48 | "version": "5.12.0",
49 | "resolved": "https://registry.npmjs.org/swagger-ui-dist/-/swagger-ui-dist-5.12.0.tgz",
50 | "integrity": "sha512-Rt1xUpbHulJVGbiQjq9yy9/r/0Pg6TmpcG+fXTaMePDc8z5WUw4LfaWts5qcNv/8ewPvBIbY7DKq7qReIKNCCQ=="
51 | }
52 | }
53 | }
54 |
--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "dependencies": {
3 | "chart.js": "^4.4.2",
4 | "chartjs-adapter-luxon": "^1.3.1",
5 | "luxon": "^3.4.4",
6 | "swagger-ui-dist": "^5.12.0"
7 | }
8 | }
9 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools", "setuptools-scm"]
3 | build-backend = "setuptools.build_meta"
4 |
5 | [project]
6 | name = "sosse"
7 | authors = [{ name = "Laurent Defert", email = "laurent_defert@yahoo.fr" }]
8 | readme = "README.md"
9 | description = "Selenium Open Source Search Engine"
10 | requires-python = ">=3.9"
11 | keywords = ["search engine", "crawler"]
12 | license = "AGPL-3.0-only"
13 | classifiers = ["Framework :: Django", "Programming Language :: Python :: 3"]
14 |
15 | dynamic = ["version", "dependencies"]
16 |
17 | [tool.setuptools]
18 | packages = [
19 | "se",
20 | "se.deps.linkpreview",
21 | "se.deps.linkpreview.linkpreview",
22 | "se.deps.linkpreview.linkpreview.preview",
23 | "se.deps.fake-useragent",
24 | "se.deps.fake-useragent.src",
25 | "se.deps.fake-useragent.src.fake_useragent",
26 | "se.migrations",
27 | "se.management",
28 | "se.management.commands",
29 | "sosse",
30 | ]
31 |
32 | [tool.setuptools.package-data]
33 | se = ["*.html", "*.svg", "*.js", "*.css", "*.json"]
34 |
35 | [tool.setuptools.dynamic]
36 | version = { attr = "sosse.settings.SOSSE_VERSION_TAG" }
37 | dependencies = { file = ["requirements.txt"] }
38 |
39 | [tool.autopep8]
40 | max_line_length = 1000
41 |
42 | [tool.ruff]
43 | line-length = 120
44 |
45 | [tool.doc8]
46 | # Ignore include failure since we include generated files
47 | ignore = ["D000"]
48 | max-line-length = 120
49 |
50 | [tool.isort]
51 | profile = "black"
52 |
53 | [project.scripts]
54 | sosse-admin = "sosse.sosse_admin:main"
55 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | bs4
2 | cssutils
3 | defusedxml
4 | django<4
5 | django-filter
6 | django-treebeard
7 | django-uwsgi
8 | djangorestframework
9 | drf-spectacular
10 | feedparser
11 | html5lib
12 | langdetect
13 | lxml
14 | markdown
15 | pillow
16 | psutil
17 | psycopg2-binary
18 | publicsuffix2
19 | python-magic
20 | requests
21 | selenium<4.9
22 |
--------------------------------------------------------------------------------
/se/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/__init__.py
--------------------------------------------------------------------------------
/se/about.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from .views import UserView
17 |
18 |
19 | class AboutView(UserView):
20 | template_name = "se/about.html"
21 | title = "About"
22 |
--------------------------------------------------------------------------------
/se/analytics.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from .models import WorkerStats
17 | from .views import AdminView
18 |
19 |
20 | class AnalyticsView(AdminView):
21 | template_name = "admin/analytics.html"
22 | permission_required = set()
23 | title = "Analytics"
24 |
25 | def get_context_data(self):
26 | context = super().get_context_data()
27 | if self.request.user.has_perm("se.view_crawlerstats"):
28 | context["crawlers_count"] = WorkerStats.objects.count()
29 | return context
30 |
--------------------------------------------------------------------------------
/se/apps.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.apps import AppConfig
17 | from django.contrib.admin.apps import AdminConfig
18 |
19 |
20 | class SEConfig(AppConfig):
21 | name = "se"
22 | verbose_name = "Crawling"
23 | default_auto_field = "django.db.models.AutoField"
24 |
25 |
26 | class SEAdminConfig(AdminConfig):
27 | default_site = "se.admin.get_admin"
28 |
--------------------------------------------------------------------------------
/se/crawlers.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 |
17 | from .models import WorkerStats
18 | from .views import AdminView
19 |
20 |
21 | class CrawlersOperationMixin:
22 | def get_permission_required(self):
23 | if self.request.method == "POST":
24 | return {"se.change_crawlerstats"}
25 | return super().get_permission_required()
26 |
27 | def post(self, request):
28 | if "pause" in request.POST:
29 | WorkerStats.objects.update(state="paused")
30 | if "resume" in request.POST:
31 | WorkerStats.objects.update(state="running")
32 | WorkerStats.wake_up()
33 | return self.get(request)
34 |
35 |
36 | class CrawlersContentView(AdminView):
37 | template_name = "admin/crawlers_content.html"
38 | permission_required = "se.view_crawlerstats"
39 | admin_site = None
40 |
41 | def __init__(self, *args, **kwargs):
42 | self.admin_site = kwargs.pop("admin_site")
43 | super().__init__(*args, **kwargs)
44 |
45 | def get_context_data(self, **kwargs):
46 | context = super().get_context_data(**kwargs)
47 | crawlers = WorkerStats.live_state()
48 | running_count = [c for c in crawlers if c.state != "exited"]
49 | return context | {
50 | "crawlers": WorkerStats.live_state(),
51 | "running_count": running_count,
52 | "pause": WorkerStats.objects.filter(state="paused").count() == 0,
53 | }
54 |
55 |
56 | class CrawlersView(CrawlersOperationMixin, CrawlersContentView):
57 | title = "Crawlers"
58 | template_name = "admin/crawlers.html"
59 |
--------------------------------------------------------------------------------
/se/download.py:
--------------------------------------------------------------------------------
1 | # Copyright 2024-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | import os
17 | from urllib.parse import unquote
18 |
19 | from django.conf import settings
20 | from django.views.generic import TemplateView
21 |
22 | from .archive import ArchiveMixin
23 | from .html_asset import HTMLAsset
24 | from .utils import mimetype_icon
25 | from .views import RedirectException
26 |
27 |
28 | class DownloadView(ArchiveMixin, TemplateView):
29 | template_name = "se/download.html"
30 | view_name = "download"
31 |
32 | def get_context_data(self, *args, **kwargs) -> dict:
33 | url = self._url_from_request()
34 | asset = HTMLAsset.objects.filter(url=url).order_by("download_date").last()
35 |
36 | if not asset or not os.path.exists(settings.SOSSE_HTML_SNAPSHOT_DIR + asset.filename):
37 | raise RedirectException(self.doc.get_absolute_url())
38 |
39 | asset_path = settings.SOSSE_HTML_SNAPSHOT_DIR + asset.filename
40 |
41 | filename = url.rstrip("/").rsplit("/", 1)[1]
42 | filename = unquote(filename)
43 | if "." in filename:
44 | filename = filename.rsplit(".", 1)[0]
45 |
46 | extension = asset.filename.rsplit(".", 1)[1]
47 | filename = f"{filename}.{extension}"
48 |
49 | context = super().get_context_data()
50 | return context | {
51 | "url": self.request.build_absolute_uri(settings.SOSSE_HTML_SNAPSHOT_URL) + asset.filename,
52 | "filename": filename,
53 | "filesize": os.path.getsize(asset_path),
54 | "icon": mimetype_icon(self.doc.mimetype),
55 | "mimebase": self.doc.mimetype.split("/", 1)[0],
56 | }
57 |
--------------------------------------------------------------------------------
/se/favicon.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.http import HttpResponse
17 | from django.shortcuts import get_object_or_404
18 | from django.views.generic import View
19 |
20 | from .models import FavIcon
21 | from .views import SosseLoginRequiredMixin
22 |
23 |
24 | class FavIconView(View, SosseLoginRequiredMixin):
25 | def get(self, request, favicon_id):
26 | fav = get_object_or_404(FavIcon, id=favicon_id)
27 | return HttpResponse(fav.content, content_type=fav.mimetype)
28 |
--------------------------------------------------------------------------------
/se/login.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 |
17 | from django.conf import settings
18 | from django.contrib.auth import REDIRECT_FIELD_NAME
19 | from django.contrib.auth.mixins import UserPassesTestMixin
20 | from django.contrib.auth.views import LoginView
21 |
22 |
23 | class SosseLoginRequiredMixin(UserPassesTestMixin):
24 | login_url = None
25 | redirect_field_name = REDIRECT_FIELD_NAME
26 |
27 | def test_func(self):
28 | if settings.SOSSE_ANONYMOUS_SEARCH:
29 | return True
30 | return self.request.user.is_authenticated
31 |
32 |
33 | class SELoginView(LoginView):
34 | template_name = "admin/login.html"
35 |
--------------------------------------------------------------------------------
/se/management/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/management/__init__.py
--------------------------------------------------------------------------------
/se/management/commands/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/management/commands/__init__.py
--------------------------------------------------------------------------------
/se/management/commands/clear_html_archive.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.core.management.base import BaseCommand
17 |
18 | from ...html_asset import HTMLAsset
19 |
20 |
21 | class Command(BaseCommand):
22 | help = "Clears archived HTML snapshots."
23 |
24 | def handle(self, *args, **options):
25 | self.stdout.write("Clearing archive, please wait...")
26 | HTMLAsset.objects.update(download_date=None)
27 | self.stdout.write("Done.")
28 |
--------------------------------------------------------------------------------
/se/management/commands/default_admin.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | import sys
17 |
18 | from django.contrib.auth.models import User
19 | from django.core.management.base import BaseCommand
20 |
21 |
22 | class Command(BaseCommand):
23 | help = "Creates a default ``admin`` superuser with ``admin`` password,\ndoes nothing if at least one user already exists in the database."
24 |
25 | def handle(self, *args, **options):
26 | if User.objects.count() != 0:
27 | self.stdout.write("The database already has a user, skipping default user creation")
28 | sys.exit(0)
29 |
30 | user = User.objects.create(username="admin", is_superuser=True, is_staff=True, is_active=True)
31 | user.set_password("admin")
32 | user.save()
33 | self.stdout.write('Default user "admin", with password "admin" was created')
34 |
--------------------------------------------------------------------------------
/se/management/commands/default_conf.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.core.management.base import BaseCommand
17 |
18 | from sosse.conf import Conf
19 |
20 |
21 | class Command(BaseCommand):
22 | help = "Outputs default configuration file to stdout."
23 |
24 | def handle(self, *args, **options):
25 | self.stdout.write(Conf.generate_default())
26 |
--------------------------------------------------------------------------------
/se/management/commands/generate_secret.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.core.management.base import BaseCommand
17 | from django.core.management.utils import get_random_secret_key
18 |
19 |
20 | class Command(BaseCommand):
21 | help = "Generates a secret key to set in the configuration."
22 | doc = "Generates a secret key that can be used in the :ref:`Configuration file `."
23 |
24 | def handle(self, *args, **options):
25 | # Escape % to avoid value interpolation in the conf file
26 | # (https://docs.python.org/3/library/configparser.html#interpolation-of-values)
27 | self.stdout.write(get_random_secret_key().replace("%", "%%"))
28 |
--------------------------------------------------------------------------------
/se/management/commands/load_se.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.core.management.base import BaseCommand
17 |
18 | from ...models import SearchEngine
19 |
20 |
21 | class Command(BaseCommand):
22 | help = "Loads a search engine definition from an OpenSearch Description formatted XML file."
23 | doc = """Loads a :doc:`user/shortcuts` from an `OpenSearch Description `_ formatted XML file.
24 |
25 | Most search engines provide such a file, defined in the HTML of their web page.
26 | It can be found inside a ```` element below the ```` tag, for example `Brave Search `_ defines it as:
27 |
28 | .. code-block:: html
29 |
30 |
31 | """
32 |
33 | def add_arguments(self, parser):
34 | parser.add_argument(
35 | "opensearch_file",
36 | nargs=1,
37 | type=str,
38 | help="OpenSearch Description formatted XML file.",
39 | )
40 |
41 | def handle(self, *args, **options):
42 | SearchEngine.parse_xml_file(options["opensearch_file"][0])
43 |
--------------------------------------------------------------------------------
/se/migrations/0003_sosse_1_1_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.12 on 2023-05-29 08:59
17 |
18 | from django.db import migrations, models
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0002_search_vector"),
24 | ]
25 |
26 | operations = [
27 | migrations.AddField(
28 | model_name="document",
29 | name="show_on_homepage",
30 | field=models.BooleanField(default=False, help_text="Display this document on the homepage"),
31 | ),
32 | ]
33 |
--------------------------------------------------------------------------------
/se/migrations/0004_sosse_1_2_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.19 on 2023-07-04 12:28
17 |
18 | from django.db import migrations, models
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0003_sosse_1_1_0"),
24 | ]
25 |
26 | operations = [
27 | migrations.AddField(
28 | model_name="crawlpolicy",
29 | name="remove_nav_elements",
30 | field=models.CharField(
31 | choices=[("yes", "Yes"), ("no", "No")],
32 | default="yes",
33 | help_text="Remove navigation related elements",
34 | max_length=4,
35 | ),
36 | ),
37 | ]
38 |
--------------------------------------------------------------------------------
/se/migrations/0007_sosse_1_5_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.19 on 2023-08-26 19:20
17 |
18 | from django.db import migrations, models
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0006_sosse_1_4_0"),
24 | ]
25 |
26 | operations = [
27 | migrations.AlterField(
28 | model_name="crawlpolicy",
29 | name="auth_login_url_re",
30 | field=models.TextField(
31 | blank=True,
32 | help_text="A redirection to an URL matching the regexp will trigger authentication",
33 | null=True,
34 | verbose_name="Login URL regexp",
35 | ),
36 | ),
37 | ]
38 |
--------------------------------------------------------------------------------
/se/migrations/0010_sosse_1_8_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.19 on 2023-11-11 21:28
17 |
18 | from django.db import migrations, models
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0009_sosse_1_7_0"),
24 | ]
25 |
26 | operations = [
27 | migrations.AddField(
28 | model_name="excludedurl",
29 | name="starting_with",
30 | field=models.BooleanField(
31 | default=False,
32 | help_text="Exclude all urls starting with the url pattern",
33 | ),
34 | ),
35 | migrations.AddField(
36 | model_name="link",
37 | name="in_nav",
38 | field=models.BooleanField(default=False),
39 | ),
40 | ]
41 |
--------------------------------------------------------------------------------
/se/migrations/0011_sosse_1_9_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.19 on 2024-01-22 11:26
17 |
18 | from django.db import migrations
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0010_sosse_1_8_0"),
24 | ]
25 |
26 | operations = [
27 | migrations.RenameField(
28 | model_name="crawlpolicy",
29 | old_name="condition",
30 | new_name="recursion",
31 | ),
32 | migrations.RenameField(
33 | model_name="crawlpolicy",
34 | old_name="crawl_depth",
35 | new_name="recursion_depth",
36 | ),
37 | ]
38 |
--------------------------------------------------------------------------------
/se/migrations/0012_sosse_1_10_0.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | # Generated by Django 3.2.19 on 2024-06-30 09:45
17 |
18 | from django.db import migrations, models
19 |
20 |
21 | class Migration(migrations.Migration):
22 | dependencies = [
23 | ("se", "0011_sosse_1_9_0"),
24 | ]
25 |
26 | operations = [
27 | migrations.AddField(
28 | model_name="crawlpolicy",
29 | name="hide_documents",
30 | field=models.BooleanField(default=False, help_text="Hide documents from search results"),
31 | ),
32 | migrations.AddField(
33 | model_name="document",
34 | name="hidden",
35 | field=models.BooleanField(default=False, help_text="Hide this document from search results"),
36 | ),
37 | migrations.AddField(
38 | model_name="crawlpolicy",
39 | name="enabled",
40 | field=models.BooleanField(default=True),
41 | ),
42 | ]
43 |
--------------------------------------------------------------------------------
/se/migrations/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/migrations/__init__.py
--------------------------------------------------------------------------------
/se/opensearch.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.views.generic import TemplateView
17 |
18 |
19 | class OpensearchView(TemplateView):
20 | template_name = "se/opensearch.xml"
21 | content_type = "application/xml"
22 |
23 | def get_context_data(self, **kwargs):
24 | context = super().get_context_data(**kwargs)
25 | return context | {"url": self.request.build_absolute_uri("/").rstrip("/")}
26 |
--------------------------------------------------------------------------------
/se/profile.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | import json
17 |
18 | from .document import Document
19 | from .views import UserView
20 |
21 |
22 | class ProfileView(UserView):
23 | template_name = "se/profile.html"
24 | title = "Profile"
25 |
26 | def get_context_data(self, **kwargs):
27 | context = super().get_context_data(**kwargs)
28 | return context | {"supported_langs": json.dumps(Document.get_supported_lang_dict())}
29 |
--------------------------------------------------------------------------------
/se/resources.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from .views import UserView
17 |
18 |
19 | class ResourcesView(UserView):
20 | template_name = "se/resources.html"
21 | title = "Resources"
22 |
--------------------------------------------------------------------------------
/se/rest_permissions.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2024 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.conf import settings
17 | from rest_framework import permissions
18 |
19 |
20 | class LoginRequiredPermission(permissions.BasePermission):
21 | def has_permission(self, request, _):
22 | if settings.SOSSE_ANONYMOUS_SEARCH:
23 | return True
24 | return request.user.is_authenticated
25 |
26 |
27 | class IsSuperUserOrStaff(permissions.BasePermission):
28 | def has_permission(self, request, _):
29 | return request.user and (request.user.is_superuser or request.user.is_staff)
30 |
31 |
32 | class DjangoModelPermissionsRW(permissions.DjangoModelPermissions):
33 | """Permission checking class that checks Django model permissions.
34 |
35 | Contrary to DjangoModelPermissions, this class also checks for read
36 | permissions.
37 | """
38 |
39 | perms_map = permissions.DjangoModelPermissions.perms_map | {
40 | "GET": ["%(app_label)s.view_%(model_name)s"],
41 | "HEAD": ["%(app_label)s.view_%(model_name)s"],
42 | "OPTIONS": ["%(app_label)s.view_%(model_name)s"],
43 | }
44 |
--------------------------------------------------------------------------------
/se/screenshot.py:
--------------------------------------------------------------------------------
1 | # Copyright 2022-2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from django.conf import settings
17 | from django.views.generic import TemplateView
18 |
19 | from .archive import ArchiveMixin
20 |
21 |
22 | class ScreenshotView(ArchiveMixin, TemplateView):
23 | template_name = "se/embed.html"
24 | view_name = "screenshot"
25 |
26 | def get_context_data(self, *args, **kwargs):
27 | context = super().get_context_data(*args, **kwargs)
28 | return context | {
29 | "url": self.request.build_absolute_uri("/screenshot_full/") + self._url_from_request(),
30 | "allow_scripts": True,
31 | }
32 |
33 |
34 | class ScreenshotFullView(ArchiveMixin, TemplateView):
35 | template_name = "se/screenshot_full.html"
36 | view_name = "screenshot_full"
37 |
38 | def get_context_data(self, *args, **kwargs):
39 | context = super().get_context_data()
40 | return context | {
41 | "screenshot": settings.SOSSE_SCREENSHOTS_URL + "/" + self.doc.image_name(),
42 | "screenshot_size": self.doc.screenshot_size.split("x"),
43 | "screenshot_format": self.doc.screenshot_format,
44 | "screenshot_mime": ("image/png" if self.doc.screenshot_format == "png" else "image/jpeg"),
45 | "links": self.doc.links_to.filter(screen_pos__isnull=False).order_by("link_no"),
46 | "screens": range(self.doc.screenshot_count),
47 | }
48 |
--------------------------------------------------------------------------------
/se/search_redirect.py:
--------------------------------------------------------------------------------
1 | # Copyright 2025 Laurent Defert
2 | #
3 | # This file is part of SOSSE.
4 | #
5 | # SOSSE is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
6 | # General Public License as published by the Free Software Foundation, either version 3 of the
7 | # License, or (at your option) any later version.
8 | #
9 | # SOSSE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
10 | # the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | # See the GNU Affero General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Affero General Public License along with SOSSE.
14 | # If not, see .
15 |
16 | from urllib.parse import quote_plus
17 |
18 | from django.conf import settings
19 | from django.views.generic import TemplateView
20 |
21 | from .login import SosseLoginRequiredMixin
22 |
23 |
24 | class SearchRedirectView(SosseLoginRequiredMixin, TemplateView):
25 | template_name = "se/search_redirect.html"
26 |
27 | def get_context_data(self, **kwargs):
28 | context = super().get_context_data(**kwargs)
29 | return context | {
30 | "url": self.request.build_absolute_uri("/"),
31 | "q": quote_plus(self.request.GET.get("q", "")),
32 | "settings": settings,
33 | }
34 |
--------------------------------------------------------------------------------
/se/static/se/admin-webhooks.js:
--------------------------------------------------------------------------------
1 | function test_webhook() {
2 | var webhookData = {};
3 | var form = document.getElementById("webhook_form");
4 |
5 | form.querySelectorAll("input, select, textarea").forEach(function (input) {
6 | if (input.name && input.id.substr(0, 3) === "id_") {
7 | webhookData[input.name] = input.value;
8 | }
9 | });
10 |
11 | var resultDiv = document.getElementById("webhook_test_result");
12 | if (!resultDiv) {
13 | resultDiv = document.createElement("div");
14 | resultDiv.id = "webhook_test_result";
15 | resultDiv.style.width = "100%";
16 | resultDiv.style.marginTop = "10px";
17 |
18 | var webhookTestField = document.getElementById("webhook_test_button");
19 | webhookTestField.parentElement.appendChild(resultDiv);
20 | }
21 |
22 | resultDiv.innerHTML = "Processing request...";
23 | var payload = JSON.stringify(webhookData);
24 |
25 | fetch("/api/webhook/test_trigger/?as_html=1", {
26 | method: "POST",
27 | headers: {
28 | "Content-Type": "application/json",
29 | "X-CSRFToken": document.querySelector("[name=csrfmiddlewaretoken]").value,
30 | },
31 | body: payload,
32 | })
33 | .then((response) => {
34 | console.log(response);
35 | if (response.status !== 200) {
36 | throw new Error(
37 | `HTTP error! status: ${response.status} : ${response.statusText}`,
38 | );
39 | }
40 | response.text().then((body) => {
41 | resultDiv.innerHTML = body;
42 | });
43 | })
44 | .catch((error) => {
45 | resultDiv.value = `Error: ${error}`;
46 | console.error("Error:", error);
47 | });
48 | }
49 |
--------------------------------------------------------------------------------
/se/static/se/discord-symbol.svg:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/se/static/se/github-mark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/static/se/github-mark.png
--------------------------------------------------------------------------------
/se/static/se/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/biolds/sosse/efe38e1b1dcb975fa8d77eeade941aa43339a1db/se/static/se/logo.png
--------------------------------------------------------------------------------
/se/static/se/screenshot.js:
--------------------------------------------------------------------------------
1 | let images, links;
2 |
3 | function resize() {
4 | // width with implicit margin
5 | const w_width = document.body.getBoundingClientRect().width;
6 | const ratio = w_width / screen_width;
7 |
8 | for (let i = 0; i < images.length; i++) {
9 | const img = images[i];
10 | img.style.width = `${screen_width * ratio}px`;
11 | }
12 |
13 | for (let i = 0; i < links.length; i++) {
14 | const link = links[i];
15 | [elemLeft, elemTop, elemWidth, elemHeight] = link.dataset.loc.split(",");
16 | link.style.left = elemLeft * ratio + "px";
17 | link.style.top = elemTop * ratio + "px";
18 | link.style.width = elemWidth * ratio + "px";
19 | link.style.height = elemHeight * ratio + "px";
20 | }
21 | }
22 |
23 | document.addEventListener("DOMContentLoaded", function (event) {
24 | links = document.querySelectorAll("#screenshots > a");
25 | images = document.querySelectorAll("#screenshots > img");
26 |
27 | window.addEventListener("resize", function () {
28 | resize();
29 | });
30 |
31 | resize();
32 |
33 | // Work-around in case the initial resize() was done while no image was loaded yet
34 | setTimeout(resize, 300);
35 | });
36 |
--------------------------------------------------------------------------------
/se/templates/admin/base_site.html:
--------------------------------------------------------------------------------
1 | {% extends 'admin/base.html' %}
2 | {% load static %}
3 |
4 | {% block title %}SOSSE · Configuration{% endblock %}
5 |
--------------------------------------------------------------------------------
/se/templates/admin/change_form.html:
--------------------------------------------------------------------------------
1 | {% extends "admin/change_form.html" %}
2 | {% load i18n admin_urls static admin_modify %}
3 |
4 | {% block breadcrumbs %}
5 |
43 | {% if doc.redirect_url %}
44 |
45 | This page redirects to {{ doc.redirect_url }} · 🌍
46 | {% endif %}
47 | {% if doc.too_many_redirects %}
48 |
49 | Redirection was not followed, because the crawler was redirected too many times
50 | {% endif %}
51 |
5 | This file was not {% if method == 'mimetype' %}saved{% else %}downloaded{% endif %} because its {{ method }} matched the exclusion regex of assets to download.
6 |
7 | {% if crawl_policy and 'se.view_crawlpolicy' in perms %}
8 |
9 | The crawl policy {{ crawl_policy }} can be modified to make this file available.
10 |
11 | {% endif %}
12 | {% endblock %}
13 |
--------------------------------------------------------------------------------
/se/templates/se/info_fallback.html:
--------------------------------------------------------------------------------
1 | {% if doc and not doc.crawl_last %}
2 |
3 | This page has not been crawled yet.
4 |
5 | {% elif doc and doc.robotstxt_rejected %}
6 |
7 | Crawling this page was rejected by a robots.txt rule.
8 |
13 | {% include "se/components/tag_action.html" with text="📝 Edit" href=view_tags_href %}
14 |
15 |
16 | {% include "se/components/tag_action.html" with text="✏️ Create" href=create_tag_href %}
17 |
18 | {% endif %}
19 | Selected:
20 |
21 | {% for tag in tags %}
22 | {% include "se/components/tag.html" with suffix="-edit" on_delete=tag.js_add_tag_onclick classes="tag-select" bold=True %}
23 | {% endfor %}
24 | {% include "se/components/tag_action.html" with id="clear_selected_tags" text="⨉ Clear" onclick="clear_tags()" %}
25 |
26 |
27 | {% endif %}
28 |
29 |
30 |
31 | {% for tag in root_tags %}
32 | {# one div per root tag to make the grid layout #}
33 |
34 | {% for child in tag.descendants %}
35 |
{# this div makes sure the tag takes the full width of the grid layout's panel #}
36 | {% include "se/components/tag.html" with tag=child with_padding=True with_counters=True onclick=child.js_add_tag_onclick cursor_pointer=True %}
37 |