25 |
26 |
500 Internal Server Error
27 |
Sorry; something went wrong.
28 |
29 |
30 |
31 |
5 |
Wikilink
6 |
7 | The Wikilink tool helps program organisers and organisations track external links on Wikimedia projects. While
8 | MediaWiki has the ability to search existing
9 | links, at the time of writing there is no way to easily monitor link additions and removals over time. The
10 | tool was built primarily for The Wikipedia Library's use case. Publishers donate access to Wikipedia editors,
11 | and while it was possible to monitor the total number of links over time, there was no simple way to investigate
12 | that data further - to find out where links were being added, who was adding them, or in the case of a drop
13 | in link numbers, why those links were removed.
14 |
15 |
16 |
Using the tool
17 |
18 | There are two primary views into the data - the 'program' level and 'organisation' level.
19 |
20 |
Programs
21 |
22 | Programs are collections of organisations. Program pages provide a high level overview of the link additions
23 | and removals for many organisations in one place. If you have partnerships with multiple organisations,
24 | the program pages can provide data about their data in aggregate for reporting purposes.
25 |
26 |
Organisations
27 |
28 | Organisation pages provide data relevant to an individual organisation. Organisations can have multiple
29 | collections of tracked URLs - these could be different websites or simply different URL patterns. Results
30 | for each collection are presented individually. Additionally, each collection can have multiple URLs. This is
31 | useful primarily in the case that a website has moved; both URLs can continue to be tracked in the same place.
32 |
33 |
34 |
Data collection
35 |
36 | Two sets of data are collected: Link events and totals
37 |
38 |
Link events
39 |
40 | A
41 | script is always monitoring the
42 | page-links-change
43 | event stream; when a link tracked by Wikilink is added or removed, the data is stored in Wikilink's database.
44 |
45 |
46 | The event stream reports link additions and removals from all Wikimedia projects and languages, and tracks
47 | events from all namespaces. If a link is changed, it will register both an addition (the new URL) and a removal
48 | (the old URL). Editing the same URL multiple times in one edit will only send a single event.
49 |
50 |
51 | Please be aware there is currently a known bug with the
52 | event stream whereby some additional events are being sent related to template transclusions.
53 |
54 |
Link totals
55 |
56 | The tool also tracks the total number of links to each tracked URL on a weekly basis. These totals are
57 | retrieved from the externallinks table.
58 | Currently, these totals only consider Wikipedia projects, however they do cover every language. Unlike with the
59 | event stream, queries have to be made against each project's database individually, and it is therefore
60 | prohibitive to collect total data for every Wikimedia project.
61 |
62 |
63 | {% endblock %}
64 |
--------------------------------------------------------------------------------
/extlinks/aggregates/migrations/0012_programtopuserstotal_programtopprojectstotal_and_more.py:
--------------------------------------------------------------------------------
1 | # Generated by Django 4.2.20 on 2025-03-26 18:18
2 |
3 | from django.db import migrations, models
4 | import django.db.models.deletion
5 |
6 |
7 | class Migration(migrations.Migration):
8 |
9 | dependencies = [
10 | ('organisations', '0008_alter_collection_id_alter_organisation_id_and_more'),
11 | ('programs', '0003_alter_program_id'),
12 | ('aggregates', '0011_aggregate_composite_indexes'),
13 | ]
14 |
15 | operations = [
16 | migrations.CreateModel(
17 | name='ProgramTopUsersTotal',
18 | fields=[
19 | ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
20 | ('username', models.CharField(max_length=235)),
21 | ('full_date', models.DateField()),
22 | ('on_user_list', models.BooleanField(default=False)),
23 | ('total_links_added', models.PositiveIntegerField()),
24 | ('total_links_removed', models.PositiveIntegerField()),
25 | ('created_at', models.DateTimeField(auto_now_add=True)),
26 | ('updated_at', models.DateTimeField(auto_now=True)),
27 | ('program', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='programs.program')),
28 | ],
29 | options={
30 | 'indexes': [models.Index(fields=['program_id', 'full_date', 'username'], name='aggregates__program_885240_idx'), models.Index(fields=['program_id', 'username'], name='aggregates__program_5e05d9_idx')],
31 | },
32 | ),
33 | migrations.CreateModel(
34 | name='ProgramTopProjectsTotal',
35 | fields=[
36 | ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
37 | ('project_name', models.CharField(max_length=32)),
38 | ('full_date', models.DateField()),
39 | ('on_user_list', models.BooleanField(default=False)),
40 | ('total_links_added', models.PositiveIntegerField()),
41 | ('total_links_removed', models.PositiveIntegerField()),
42 | ('created_at', models.DateTimeField(auto_now_add=True)),
43 | ('updated_at', models.DateTimeField(auto_now=True)),
44 | ('program', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='programs.program')),
45 | ],
46 | options={
47 | 'indexes': [models.Index(fields=['program_id', 'full_date', 'project_name'], name='aggregates__program_ef06a4_idx'), models.Index(fields=['program_id', 'project_name'], name='aggregates__program_84ed52_idx')],
48 | },
49 | ),
50 | migrations.CreateModel(
51 | name='ProgramTopOrganisationsTotal',
52 | fields=[
53 | ('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
54 | ('full_date', models.DateField()),
55 | ('on_user_list', models.BooleanField(default=False)),
56 | ('total_links_added', models.PositiveIntegerField()),
57 | ('total_links_removed', models.PositiveIntegerField()),
58 | ('created_at', models.DateTimeField(auto_now_add=True)),
59 | ('updated_at', models.DateTimeField(auto_now=True)),
60 | ('organisation', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='organisations.organisation')),
61 | ('program', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='programs.program')),
62 | ],
63 | options={
64 | 'indexes': [models.Index(fields=['program_id', 'full_date', 'organisation_id'], name='aggregates__program_0db533_idx'), models.Index(fields=['program_id', 'organisation_id'], name='aggregates__program_fae7db_idx')],
65 | },
66 | ),
67 | ]
68 |
--------------------------------------------------------------------------------
/extlinks/links/migrations/0001_initial.py:
--------------------------------------------------------------------------------
1 | # Generated by Django 2.2 on 2019-05-20 14:01
2 |
3 | from django.db import migrations, models
4 | import django.db.models.deletion
5 |
6 |
7 | class Migration(migrations.Migration):
8 |
9 | initial = True
10 |
11 | dependencies = [
12 | ("organisations", "0001_initial"),
13 | ]
14 |
15 | operations = [
16 | migrations.CreateModel(
17 | name="URLPattern",
18 | fields=[
19 | (
20 | "id",
21 | models.AutoField(
22 | auto_created=True,
23 | primary_key=True,
24 | serialize=False,
25 | verbose_name="ID",
26 | ),
27 | ),
28 | ("url", models.CharField(max_length=60)),
29 | (
30 | "collection",
31 | models.ForeignKey(
32 | null=True,
33 | on_delete=django.db.models.deletion.SET_NULL,
34 | related_name="url",
35 | to="organisations.Collection",
36 | ),
37 | ),
38 | ],
39 | options={
40 | "verbose_name_plural": "URL patterns",
41 | "verbose_name": "URL pattern",
42 | },
43 | ),
44 | migrations.CreateModel(
45 | name="LinkSearchTotal",
46 | fields=[
47 | (
48 | "id",
49 | models.AutoField(
50 | auto_created=True,
51 | primary_key=True,
52 | serialize=False,
53 | verbose_name="ID",
54 | ),
55 | ),
56 | ("date", models.DateField(auto_now_add=True)),
57 | ("total", models.PositiveIntegerField()),
58 | (
59 | "url",
60 | models.ForeignKey(
61 | null=True,
62 | on_delete=django.db.models.deletion.SET_NULL,
63 | to="links.URLPattern",
64 | ),
65 | ),
66 | ],
67 | options={
68 | "verbose_name_plural": "LinkSearch totals",
69 | "verbose_name": "LinkSearch total",
70 | },
71 | ),
72 | migrations.CreateModel(
73 | name="LinkEvent",
74 | fields=[
75 | (
76 | "id",
77 | models.AutoField(
78 | auto_created=True,
79 | primary_key=True,
80 | serialize=False,
81 | verbose_name="ID",
82 | ),
83 | ),
84 | ("link", models.CharField(max_length=2083)),
85 | ("timestamp", models.DateTimeField()),
86 | ("domain", models.CharField(max_length=32)),
87 | ("username", models.CharField(max_length=255)),
88 | ("rev_id", models.PositiveIntegerField(null=True)),
89 | ("user_id", models.PositiveIntegerField()),
90 | ("page_title", models.CharField(max_length=255)),
91 | ("page_namespace", models.IntegerField()),
92 | ("event_id", models.CharField(max_length=36)),
93 | ("change", models.IntegerField(choices=[(0, "Removed"), (1, "Added")])),
94 | ("on_user_list", models.BooleanField(default=False)),
95 | (
96 | "url",
97 | models.ManyToManyField(
98 | related_name="linkevent", to="links.URLPattern"
99 | ),
100 | ),
101 | ],
102 | options={
103 | "get_latest_by": "timestamp",
104 | },
105 | ),
106 | ]
107 |
--------------------------------------------------------------------------------
/.github/workflows/dockerpublish.yml:
--------------------------------------------------------------------------------
1 | name: Docker
2 |
3 | on:
4 | push:
5 | # Publish `master` as Docker `latest` image.
6 | branches:
7 | - master
8 | - staging
9 |
10 | # Run tests for any PRs.
11 | pull_request:
12 |
13 | jobs:
14 | # Run tests.
15 | test:
16 | # Ensure latest python image is mirrored before running tests.
17 | runs-on: ubuntu-latest
18 | steps:
19 | - uses: actions/checkout@v4
20 | - name: Build and Start Images
21 | run: |
22 | cp template.env .env
23 | docker compose up -d --build
24 | - name: Run tests
25 | run: |
26 | docker compose exec -T externallinks /app/bin/django_wait_for_db.sh python django_wait_for_migrations.py test
27 |
28 | # Push images to quay.io/wikipedialibrary.
29 | push:
30 | # Ensure test job passes before pushing images.
31 | needs: test
32 | runs-on: ubuntu-latest
33 | if: github.event_name == 'push'
34 |
35 | steps:
36 | - uses: actions/checkout@v4
37 |
38 | - name: Log into quay.io
39 | run: echo "${{ secrets.CR_PASSWORD }}" | docker login quay.io -u ${{ secrets.CR_USERNAME }} --password-stdin
40 |
41 | - name: Build Images
42 | run: |
43 | cp template.env .env
44 | docker compose build
45 |
46 | - name: Set branch tag
47 | id: branch
48 | run: |
49 | # Strip git ref prefix from version
50 | branch_tag=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,')
51 |
52 | # Strip "v" prefix from tag name
53 | [[ "${{ github.ref }}" == "refs/tags/"* ]] && branch_tag=$(echo $branch_tag | sed -e 's/^v//')
54 |
55 | # preprend with "branch_" so we know what the tag means by looking at it.
56 | branch_tag="branch_${branch_tag}"
57 |
58 | echo ::set-output name=tag::$(echo $branch_tag)
59 |
60 | - name: Set commit tag
61 | id: commit
62 | run: |
63 | # The short git commit object name.
64 | commit_tag=${GITHUB_SHA::8}
65 |
66 | # prepend with "commit_" so we know what the tag means by looking at it.
67 | commit_tag="commit_${commit_tag}"
68 |
69 | echo ::set-output name=tag::$(echo $commit_tag)
70 |
71 | - name: Push externallinks image to quay.io/wikipedialibrary
72 | run: |
73 | # The image name represents both the local image name and the remote image repository.
74 | image_name=quay.io/wikipedialibrary/externallinks
75 | branch_tag=${{ steps.branch.outputs.tag }}
76 | commit_tag=${{ steps.commit.outputs.tag }}
77 |
78 | docker tag ${image_name}:latest ${image_name}:${branch_tag}
79 | docker tag ${image_name}:latest ${image_name}:${commit_tag}
80 | docker push ${image_name}:${branch_tag}
81 | docker push ${image_name}:${commit_tag}
82 |
83 | - name: Push eventstream image to quay.io/wikipedialibrary
84 | run: |
85 | # The image name represents both the local image name and the remote image repository.
86 | image_name=quay.io/wikipedialibrary/eventstream
87 | branch_tag=${{ steps.branch.outputs.tag }}
88 | commit_tag=${{ steps.commit.outputs.tag }}
89 |
90 | docker tag ${image_name}:latest ${image_name}:${branch_tag}
91 | docker tag ${image_name}:latest ${image_name}:${commit_tag}
92 | docker push ${image_name}:${branch_tag}
93 | docker push ${image_name}:${commit_tag}
94 |
95 | - name: Push externallinks_cron image to quay.io/wikipedialibrary
96 | run: |
97 | # The image name represents both the local image name and the remote image repository.
98 | image_name=quay.io/wikipedialibrary/externallinks_cron
99 | branch_tag=${{ steps.branch.outputs.tag }}
100 | commit_tag=${{ steps.commit.outputs.tag }}
101 |
102 | docker tag ${image_name}:latest ${image_name}:${branch_tag}
103 | docker tag ${image_name}:latest ${image_name}:${commit_tag}
104 | docker push ${image_name}:${branch_tag}
105 | docker push ${image_name}:${commit_tag}
106 |
--------------------------------------------------------------------------------
/nginx.conf:
--------------------------------------------------------------------------------
1 | map $http_x_forwarded_proto $web_proxy_scheme {
2 | default $scheme;
3 | https https;
4 | }
5 |
6 | map $http_user_agent $limit_bots {
7 | default "";
8 | ~*(GoogleBot|bingbot|YandexBot|mj12bot|Apache-HttpClient|Adsbot|Barkrowler|FacebookBot|dotbot|Googlebot|Bytespider|SemrushBot|AhrefsBot|Amazonbot|GPTBot|DotBot) $binary_remote_addr;
9 | }
10 |
11 | ## Testing the request method
12 | # Only GET and HEAD are caching safe.
13 | map $request_method $no_cache_method {
14 | default 1;
15 | HEAD 0;
16 | GET 0;
17 | }
18 |
19 | ## Testing for Cache-Control header
20 | # Only checking for no-cache because chrome annoyingly sets max-age=0 when hitting enter in the address bar.
21 | map $http_cache_control $no_cache_control {
22 | default 0;
23 | no-cache 1;
24 | }
25 |
26 | ## Testing for the session cookie being present
27 | map $http_cookie $no_cache_session {
28 | default 0;
29 | ~sessionid 1; # Django session cookie
30 | }
31 |
32 | ## proxy caching settings.
33 | proxy_cache_path /var/lib/nginx/cache levels=1:2 keys_zone=cache:8m max_size=10g inactive=10m;
34 | proxy_cache_key "$scheme$proxy_host$uri$is_args$args$http_accept_language";
35 | proxy_cache_lock on;
36 | proxy_cache_use_stale error timeout invalid_header updating http_500 http_502 http_503 http_504;
37 |
38 | # remote address is a joke here since we don't have x-forwarded-for
39 | limit_req_zone $limit_bots zone=bots:10m rate=1r/s;
40 | limit_req_zone $binary_remote_addr zone=one:10m rate=500r/s;
41 |
42 | upstream django_server {
43 | server externallinks:8000 fail_timeout=0;
44 | }
45 |
46 | server {
47 | listen 80 deferred;
48 | client_max_body_size 4G;
49 | server_name wikilink.wmflabs.org;
50 | keepalive_timeout 5;
51 |
52 | # Definied explicitly to avoid caching
53 | location /healthcheck/link_event {
54 | # Rate limit
55 | limit_req zone=bots burst=2 nodelay;
56 | limit_req zone=one burst=1000 nodelay;
57 | limit_req_status 429;
58 | # Proxy
59 | proxy_set_header X-Forwarded-Proto $web_proxy_scheme;
60 | proxy_set_header Host $http_host;
61 | proxy_redirect off;
62 | proxy_pass http://django_server;
63 | }
64 |
65 | location = /robots.txt {
66 | add_header Content-Type text/plain;
67 | alias /app/robots.txt;
68 | }
69 |
70 | location / {
71 | root /app/;
72 | expires 30d;
73 |
74 | if ($http_user_agent ~* (GoogleBot|bingbot|YandexBot|mj12bot|Apache-HttpClient|Adsbot|Barkrowler|FacebookBot|dotbot|Bytespider|SemrushBot|AhrefsBot|Amazonbot|GPTBot) ) {
75 | return 403;
76 | }
77 | location /admin/links/ {
78 | try_files $uri @django-admin-slow;
79 | }
80 | # checks for static file, if not found proxy to app
81 | try_files $uri @django;
82 | }
83 |
84 | location @django {
85 | # Cache
86 | proxy_cache_valid 200 301 302 401 403 404 1d;
87 | proxy_cache_bypass $http_pragma $no_cache_method $no_cache_control $no_cache_session;
88 | proxy_cache_revalidate on;
89 | proxy_cache cache;
90 | add_header X-Cache-Status $upstream_cache_status;
91 | # Rate limit
92 | limit_req zone=bots burst=2 nodelay;
93 | limit_req zone=one burst=1000 nodelay;
94 | limit_req_status 429;
95 | # Proxy
96 | proxy_set_header X-Forwarded-Proto $web_proxy_scheme;
97 | proxy_set_header Host $http_host;
98 | proxy_redirect off;
99 | proxy_pass http://django_server;
100 | }
101 |
102 | location @django-admin-slow {
103 | # https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_send_timeout
104 | proxy_connect_timeout 120s;
105 | proxy_send_timeout 120s;
106 | proxy_read_timeout 120s;
107 | # https://nginx.org/en/docs/http/ngx_http_core_module.html#send_timeout
108 | send_timeout 120s;
109 | keepalive_timeout 120s;
110 | # Cache
111 | proxy_cache_valid 200 301 302 401 403 404 1d;
112 | proxy_cache_bypass $http_pragma $no_cache_method $no_cache_control $no_cache_session;
113 | proxy_cache_revalidate on;
114 | proxy_cache cache;
115 | add_header X-Cache-Status $upstream_cache_status;
116 | # Rate limit
117 | limit_req zone=bots burst=2 nodelay;
118 | limit_req zone=one burst=1000 nodelay;
119 | limit_req_status 429;
120 | # Proxy
121 | proxy_set_header X-Forwarded-Proto $web_proxy_scheme;
122 | proxy_set_header Host $http_host;
123 | proxy_redirect off;
124 | proxy_pass http://django_server;
125 | }
126 |
127 | proxy_intercept_errors on;
128 | error_page 500 501 502 503 504 505 506 /500.html;
129 |
130 | location = /500.html {
131 | root /app/500;
132 | internal;
133 | }
134 | }
135 |
--------------------------------------------------------------------------------
/extlinks/settings/base.py:
--------------------------------------------------------------------------------
1 | """
2 | Django settings for extlinks project.
3 | """
4 |
5 | import os
6 | from pathlib import Path
7 |
8 |
9 | SECRET_KEY = os.environ["SECRET_KEY"]
10 | # Usually we'd define this relative to the settings file, but we're always
11 | # starting from /app in Docker.
12 | BASE_DIR = "/app"
13 |
14 | ALLOWED_HOSTS = ["127.0.0.1", "localhost", "0.0.0.0"]
15 |
16 | # Application definition
17 |
18 | INSTALLED_APPS = [
19 | "django.contrib.admin",
20 | "django.contrib.auth",
21 | "django.contrib.contenttypes",
22 | "django.contrib.sessions",
23 | "django.contrib.messages",
24 | "django.contrib.staticfiles",
25 | "extlinks.common",
26 | "extlinks.healthcheck",
27 | "extlinks.links",
28 | "extlinks.organisations",
29 | "extlinks.programs",
30 | "extlinks.aggregates",
31 | "django_extensions",
32 | ]
33 |
34 | MIDDLEWARE = [
35 | "django.middleware.security.SecurityMiddleware",
36 | "django.contrib.sessions.middleware.SessionMiddleware",
37 | "django.middleware.common.CommonMiddleware",
38 | "django.middleware.csrf.CsrfViewMiddleware",
39 | "django.contrib.auth.middleware.AuthenticationMiddleware",
40 | "django.contrib.messages.middleware.MessageMiddleware",
41 | "django.middleware.clickjacking.XFrameOptionsMiddleware",
42 | ]
43 |
44 | ROOT_URLCONF = "extlinks.urls"
45 |
46 | TEMPLATES = [
47 | {
48 | "BACKEND": "django.template.backends.django.DjangoTemplates",
49 | "DIRS": [os.path.join(BASE_DIR, "extlinks", "templates")],
50 | "APP_DIRS": True,
51 | "OPTIONS": {
52 | "context_processors": [
53 | "django.template.context_processors.debug",
54 | "django.template.context_processors.request",
55 | "django.contrib.auth.context_processors.auth",
56 | "django.contrib.messages.context_processors.messages",
57 | ],
58 | },
59 | },
60 | ]
61 |
62 | WSGI_APPLICATION = "extlinks.wsgi.application"
63 |
64 | # Database
65 | # https://docs.djangoproject.com/en/4.2/ref/settings/#databases
66 |
67 | DATABASES = {
68 | "default": {
69 | "ENGINE": "django.db.backends.mysql",
70 | "NAME": os.environ["MYSQL_DATABASE"],
71 | "USER": "root",
72 | "PASSWORD": os.environ["MYSQL_ROOT_PASSWORD"],
73 | "HOST": "db",
74 | "PORT": "3306",
75 | "OPTIONS": {"charset": "utf8mb4"},
76 | "CONN_MAX_AGE": None,
77 | "CONN_HEALTH_CHECKS": True,
78 | }
79 | }
80 |
81 | # Password validation
82 | # https://docs.djangoproject.com/en/4.2/ref/settings/#auth-password-validators
83 |
84 | AUTH_PASSWORD_VALIDATORS = [
85 | {
86 | "NAME": "django.contrib.auth.password_validation.UserAttributeSimilarityValidator",
87 | },
88 | {
89 | "NAME": "django.contrib.auth.password_validation.MinimumLengthValidator",
90 | },
91 | {
92 | "NAME": "django.contrib.auth.password_validation.CommonPasswordValidator",
93 | },
94 | {
95 | "NAME": "django.contrib.auth.password_validation.NumericPasswordValidator",
96 | },
97 | {
98 | "NAME": "django.contrib.auth.password_validation.MinimumLengthValidator",
99 | },
100 | {
101 | "NAME": "django.contrib.auth.password_validation.CommonPasswordValidator",
102 | },
103 | {
104 | "NAME": "django.contrib.auth.password_validation.NumericPasswordValidator",
105 | },
106 | ]
107 |
108 | # Internationalization
109 | # https://docs.djangoproject.com/en/4.2/topics/i18n/
110 |
111 | LANGUAGE_CODE = "en-us"
112 |
113 | TIME_ZONE = "UTC"
114 |
115 | USE_I18N = True
116 |
117 | USE_L10N = True
118 |
119 | USE_TZ = True
120 |
121 | # Cache
122 |
123 | CACHES = {
124 | "default": {
125 | "BACKEND": "django.core.cache.backends.memcached.PyMemcacheCache",
126 | "LOCATION": "cache:11211",
127 | "TIMEOUT": 600,
128 | "OPTIONS": {
129 | "no_delay": True,
130 | "ignore_exc": True,
131 | "max_pool_size": 4,
132 | "use_pooling": True,
133 | },
134 | }
135 | }
136 |
137 | # Static files (CSS, JavaScript, Images)
138 | # https://docs.djangoproject.com/en/4.2/howto/static-files/
139 |
140 | STATIC_URL = "/static/"
141 | STATIC_ROOT = os.path.join(BASE_DIR, "static")
142 |
143 | # EMAIL CONFIGURATION
144 | # ------------------------------------------------------------------------------
145 | EMAIL_BACKEND = "django.core.mail.backends.console.EmailBackend"
146 | EMAIL_HOST = os.environ.get("DJANGO_EMAIL_HOST", "localhost")
147 | EMAIL_PORT = 25
148 | EMAIL_HOST_USER = ""
149 | EMAIL_HOST_PASSWORD = ""
150 | EMAIL_USE_TLS = False
151 |
152 | DEFAULT_AUTO_FIELD = "django.db.models.BigAutoField"
153 |
--------------------------------------------------------------------------------
/extlinks/links/management/commands/remove_ezproxy_collection.py:
--------------------------------------------------------------------------------
1 | from django.contrib.contenttypes.models import ContentType
2 | from extlinks.common.management.commands import BaseCommand
3 | from django.core.management import call_command
4 |
5 | from extlinks.aggregates.models import (
6 | LinkAggregate,
7 | PageProjectAggregate,
8 | UserAggregate,
9 | )
10 | from extlinks.links.models import URLPattern, LinkEvent
11 | from extlinks.organisations.models import Organisation, Collection
12 |
13 |
14 | class Command(BaseCommand):
15 | help = "Deletes the EZProxy collection and organisation and reassigns those LinkEvents to new URLPatterns"
16 |
17 | def _handle(self, *args, **options):
18 | ezproxy_org = self._get_ezproxy_organisation()
19 | ezproxy_collection = self._get_ezproxy_collection()
20 | url_patterns = ezproxy_collection.get_url_patterns().all()
21 |
22 | linkevents = LinkEvent.objects.get_queryset()
23 | for url_pattern in url_patterns:
24 | linkevents.filter(object_id=url_pattern.id)
25 | collections = Collection.objects.all()
26 | self._process_linkevents_collections(linkevents, collections)
27 | self._delete_aggregates_ezproxy(ezproxy_org, ezproxy_collection, url_patterns)
28 |
29 | def _get_ezproxy_organisation(self):
30 | """
31 | Gets the EZProxy organisation, or returns None if it's already been deleted
32 |
33 | Parameters
34 | ----------
35 |
36 | Returns
37 | -------
38 | Organisation object or None
39 | """
40 | if Organisation.objects.filter(name="Wikipedia Library OCLC EZProxy").exists():
41 | return Organisation.objects.get(name="Wikipedia Library OCLC EZProxy")
42 |
43 | return None
44 |
45 | def _get_ezproxy_collection(self):
46 | """
47 | Gets the EZProxy collection, or returns None if it's already been deleted
48 |
49 | Parameters
50 | ----------
51 |
52 | Returns
53 | -------
54 | Collection object or None
55 | """
56 | if Collection.objects.filter(name="EZProxy").exists():
57 | return Collection.objects.get(name="EZProxy")
58 |
59 | return None
60 |
61 | def _get_ezproxy_url_patterns(self, collection):
62 | """
63 | Gets the EZProxy collection, or returns None if it's already been deleted
64 |
65 | Parameters
66 | ----------
67 | collection: The collection the URLPatterns belong to
68 |
69 | Returns
70 | -------
71 | URLPattern object or None
72 | """
73 | if collection and URLPattern.objects.filter(collection=collection).exists():
74 | return URLPattern.objects.get(collection=collection)
75 |
76 | return None
77 |
78 | def _delete_aggregates_ezproxy(self, ezproxy_org, ezproxy_collection, url_patterns):
79 | """
80 | Deletes any aggregate with the EZProxy collection and organisation,
81 | then deletes the collection, organisation and url patterns
82 |
83 | Parameters
84 | ----------
85 | ezproxy_org: Organisation
86 | The organisation to filter and delete the aggregates tables and that
87 | will later be deleted
88 |
89 | ezproxy_collection: Collection
90 | The collection to filter and delete the aggregates tables and that
91 | will later be deleted
92 |
93 | url_patterns: URLPattern
94 | The EZProxy URLPatterns that will be deleted
95 |
96 | Returns
97 | -------
98 |
99 | """
100 | LinkAggregate.objects.filter(
101 | organisation=ezproxy_org, collection=ezproxy_collection
102 | ).delete()
103 | PageProjectAggregate.objects.filter(
104 | organisation=ezproxy_org, collection=ezproxy_collection
105 | ).delete()
106 | UserAggregate.objects.filter(
107 | organisation=ezproxy_org, collection=ezproxy_collection
108 | ).delete()
109 |
110 | url_patterns.delete()
111 | ezproxy_collection.delete()
112 | ezproxy_org.delete()
113 |
114 | def _process_linkevents_collections(self, linkevents, collections):
115 | """
116 | Loops through all collections to get their url patterns. If a linkevent
117 | link coincides with a URLPattern, it is added to that LinkEvent. That way,
118 | it will be counted when the aggregates commands are run again
119 |
120 | Parameters
121 | ----------
122 | linkevents: Queryset[LinkEvent]
123 | LinkEvent that have no URLPatterns assigned (therefore no collection assigned)
124 |
125 | collections: Queryset[Collection]
126 | All of the collections
127 |
128 | Returns
129 | -------
130 |
131 | """
132 | for collection in collections:
133 | linkevents_changed = 0
134 | collection_urls = collection.get_url_patterns()
135 | for url_pattern in collection_urls:
136 | for linkevent in linkevents:
137 | proxy_url = url_pattern.url.replace(".", "-")
138 | if url_pattern.url in linkevent.link or proxy_url in linkevent.link:
139 | url_pattern.link_events.add(linkevent)
140 | url_pattern.save()
141 | linkevents_changed += 1
142 | if linkevents_changed > 0:
143 | # There have been changes to this collection, so we must delete
144 | # the aggregates tables for that collection and run the commands
145 | # for it
146 | LinkAggregate.objects.filter(collection=collection).delete()
147 | PageProjectAggregate.objects.filter(collection=collection).delete()
148 | UserAggregate.objects.filter(collection=collection).delete()
149 |
150 | call_command("fill_link_aggregates", collections=[collection.pk])
151 | call_command("fill_pageproject_aggregates", collections=[collection.pk])
152 | call_command("fill_user_aggregates", collections=[collection.pk])
153 |
--------------------------------------------------------------------------------
/extlinks/links/models.py:
--------------------------------------------------------------------------------
1 | import hashlib
2 | import logging
3 | from datetime import date
4 |
5 | from django.contrib.contenttypes.fields import GenericRelation, GenericForeignKey
6 | from django.contrib.contenttypes.models import ContentType
7 | from django.core.cache import cache
8 | from django.db import models
9 | from django.db.models.signals import post_save
10 | from django.dispatch import receiver
11 | from django.utils.functional import cached_property
12 |
13 | logger = logging.getLogger("django")
14 |
15 |
16 |
17 | class URLPatternManager(models.Manager):
18 | models.CharField.register_lookup(models.functions.Length)
19 | def cached(self):
20 | cached_patterns = cache.get('url_pattern_cache')
21 | if not cached_patterns:
22 | cached_patterns = self.all()
23 | logger.info('set url_pattern_cache')
24 | cache.set('url_pattern_cache', cached_patterns, None)
25 | return cached_patterns
26 |
27 | def matches(self, link):
28 | # All URL patterns matching this link
29 | tracked_urls = self.cached()
30 | return [
31 | pattern
32 | for pattern in tracked_urls
33 | if pattern.url in link or pattern.get_proxied_url in link
34 | ]
35 |
36 | class URLPattern(models.Model):
37 | class Meta:
38 | app_label = "links"
39 | verbose_name = "URL pattern"
40 | verbose_name_plural = "URL patterns"
41 |
42 | objects = URLPatternManager()
43 | # This doesn't have to look like a 'real' URL so we'll use a CharField.
44 | url = models.CharField(max_length=150)
45 | link_events = GenericRelation("LinkEvent",
46 | null=True,
47 | blank=True,
48 | default=None,
49 | related_query_name="url_pattern",
50 | on_delete=models.SET_NULL)
51 | collection = models.ForeignKey(
52 | "organisations.Collection",
53 | null=True,
54 | on_delete=models.SET_NULL,
55 | related_name="url",
56 | )
57 | collections = models.ManyToManyField(
58 | "organisations.Collection", related_name="urlpatterns"
59 | )
60 |
61 | def __str__(self):
62 | return self.url
63 |
64 | @cached_property
65 | def get_proxied_url(self):
66 | # This isn't everything that happens, but it's good enough
67 | # for us to make a decision about whether we have a match.
68 | return self.url.replace(".", "-")
69 |
70 |
71 | @receiver(post_save, sender=URLPattern)
72 | def delete_url_pattern_cache(sender, instance, **kwargs):
73 | if cache.delete("url_pattern_cache"):
74 | logger.info("delete url_pattern_cache")
75 |
76 |
77 | class LinkSearchTotal(models.Model):
78 | class Meta:
79 | app_label = "links"
80 | verbose_name = "LinkSearch total"
81 | verbose_name_plural = "LinkSearch totals"
82 | # We only want one record for each URL on any particular date
83 | constraints = [
84 | models.UniqueConstraint(fields=["url", "date"], name="unique_date_total")
85 | ]
86 |
87 | url = models.ForeignKey(URLPattern, null=True, on_delete=models.SET_NULL)
88 |
89 | date = models.DateField(default=date.today)
90 | total = models.PositiveIntegerField()
91 |
92 |
93 | class LinkEvent(models.Model):
94 | """
95 | Stores data from the page-links-change EventStream
96 |
97 | https://stream.wikimedia.org/?doc#!/Streams/get_v2_stream_page_links_change
98 | """
99 |
100 | class Meta:
101 | app_label = "links"
102 | get_latest_by = "timestamp"
103 | indexes = [
104 | models.Index(
105 | fields=[
106 | "hash_link_event_id",
107 | ]
108 | ),
109 | models.Index(
110 | fields=[
111 | "timestamp",
112 | ]
113 | ),
114 | models.Index(fields=["content_type", "object_id"]),
115 | ]
116 | url = models.ManyToManyField(URLPattern, related_name="linkevent")
117 | # URLs should have a max length of 2083
118 | link = models.CharField(max_length=2083)
119 | timestamp = models.DateTimeField()
120 | domain = models.CharField(max_length=32, db_index=True)
121 | content_type = models.ForeignKey(ContentType, on_delete=models.SET_NULL, related_name="content_type", null=True)
122 | object_id = models.PositiveIntegerField(null=True)
123 | content_object = GenericForeignKey("content_type", "object_id")
124 |
125 | username = models.ForeignKey(
126 | "organisations.User",
127 | null=True,
128 | on_delete=models.SET_NULL,
129 | )
130 | # rev_id has null=True because some tracked revisions don't have a
131 | # revision ID, like page moves.
132 | rev_id = models.PositiveIntegerField(null=True)
133 | # IPs have no user_id, so this can be blank too.
134 | user_id = models.PositiveIntegerField(null=True)
135 | page_title = models.CharField(max_length=255)
136 | page_namespace = models.IntegerField()
137 | event_id = models.CharField(max_length=36)
138 | user_is_bot = models.BooleanField(default=False)
139 | hash_link_event_id = models.CharField(max_length=256, blank=True)
140 |
141 | # Were links added or removed?
142 | REMOVED = 0
143 | ADDED = 1
144 |
145 | CHANGE_CHOICES = (
146 | (REMOVED, "Removed"),
147 | (ADDED, "Added"),
148 | )
149 |
150 | change = models.IntegerField(choices=CHANGE_CHOICES, db_index=True)
151 |
152 | # Flags whether this event was from a user on the user list for the
153 | # organisation tracking its URL.
154 | on_user_list = models.BooleanField(default=False)
155 |
156 | @property
157 | def get_organisation(self):
158 | url_pattern = URLPattern.objects.all()
159 | for url_pattern in url_pattern:
160 | link_events = url_pattern.link_events.all()
161 | if self in link_events:
162 | return url_pattern.collection.organisation
163 |
164 | def save(self, **kwargs):
165 | link_event_id = self.link + self.event_id
166 | hash = hashlib.sha256()
167 | hash.update(link_event_id.encode("utf-8"))
168 | self.hash_link_event_id = hash.hexdigest()
169 | super().save(**kwargs)
170 |
--------------------------------------------------------------------------------
/extlinks/aggregates/storage.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import gzip
3 | import itertools
4 | import json
5 | import logging
6 | import os
7 | import re
8 |
9 | from typing import Callable, Dict, Hashable, Iterable, List, Optional, Set
10 |
11 | from django.core.cache import cache
12 | from django.db.models import Q
13 |
14 | from extlinks.common.helpers import extract_queryset_filter
15 | from extlinks.common.swift import (
16 | batch_download_files,
17 | get_object_list,
18 | swift_connection,
19 | )
20 |
21 | logger = logging.getLogger("django")
22 |
23 | DEFAULT_EXPIRATION_SECS = 60 * 60
24 |
25 |
26 | def get_archive_list(prefix: str, expiration=DEFAULT_EXPIRATION_SECS) -> List[Dict]:
27 | """
28 | Gets a list of all available archives in object storage.
29 | """
30 |
31 | key = f"{prefix}_archive_list"
32 |
33 | # Retrieves the list from cache if possible.
34 | archives = cache.get(key)
35 | if archives:
36 | return json.loads(archives)
37 |
38 | # Download and cache the archive list if one wasn't available in the cache.
39 | try:
40 | archives = get_object_list(
41 | swift_connection(), os.environ.get("SWIFT_CONTAINER_AGGREGATES", "archive-aggregates"), f"{prefix}_"
42 | )
43 | cache.set(key, json.dumps(archives), expiration)
44 | except RuntimeError:
45 | # Swift is optional so return an empty list if it's not set up.
46 | return []
47 |
48 | return archives
49 |
50 |
51 | def get_archives(
52 | archives: Iterable[str], expiration=DEFAULT_EXPIRATION_SECS
53 | ) -> Dict[str, bytes]:
54 | """
55 | Retrieves the requested archives from objects storage or cache.
56 | """
57 |
58 | # Retrieve as many of the archives from cache as possible.
59 | archives = list(archives)
60 | result = cache.get_many(archives)
61 |
62 | # Identify missing archives that were not available in cache.
63 | missing = set()
64 | for archive in archives:
65 | if archive not in result:
66 | missing.add(archive)
67 |
68 | # Download and cache missing archives.
69 | if len(missing) > 0:
70 | downloaded_archives = batch_download_files(
71 | swift_connection(), os.environ.get("SWIFT_CONTAINER_AGGREGATES", "archive-aggregates"), archives
72 | )
73 | cache.set_many(downloaded_archives, expiration)
74 | result |= downloaded_archives
75 |
76 | return result
77 |
78 |
79 | def decode_archive(archive: bytes) -> List[Dict]:
80 | """
81 | Decodes a gzipped archive into a list of dictionaries (row records).
82 | """
83 | if archive is None or not isinstance(archive, (bytes, bytearray)):
84 | return []
85 |
86 | decompressed_archive = gzip.decompress(archive)
87 | if decompressed_archive is None or not isinstance(decompressed_archive, str):
88 | return []
89 |
90 | return json.loads(decompressed_archive)
91 |
92 |
93 | def download_aggregates(
94 | prefix: str,
95 | queryset_filter: Q,
96 | from_date: Optional[datetime.date] = None,
97 | to_date: Optional[datetime.date] = None,
98 | ) -> List[Dict]:
99 | """
100 | Find and download archives needed to augment aggregate results from the DB.
101 |
102 | This function tries its best to apply the passed in Django queryset to the
103 | records it returns. This function supports filtering by collection, user
104 | list, and date ranges.
105 | """
106 |
107 | extracted_filters = extract_queryset_filter(queryset_filter)
108 | collection_id = extracted_filters["collection"].pk
109 | on_user_list = extracted_filters.get("on_user_list", False)
110 |
111 | if from_date is None:
112 | from_date = extracted_filters.get("full_date__gte")
113 | if isinstance(from_date, str):
114 | from_date = datetime.datetime.strptime(from_date, "%Y-%m-%d").date()
115 |
116 | if to_date is None:
117 | to_date = extracted_filters.get("full_date__lte")
118 | if isinstance(to_date, str):
119 | to_date = datetime.datetime.strptime(to_date, "%Y-%m-%d").date()
120 |
121 | # We're only returning objects that match the following pattern. The
122 | # archive filenames use the following naming convention:
123 | #
124 | # {prefix}_{organisation}_{collection}_{full_date}_{on_user_list}.json.gz
125 | pattern = (
126 | rf"^{prefix}_([0-9]+)_([0-9]+)_([0-9]+-[0-9]{{2}}-[0-9]{{2}})_([01])\.json\.gz$"
127 | )
128 |
129 | # Identify archives that need to be downloaded from object storage
130 | # because they are not available in the database.
131 | archives = []
132 | for archive in get_archive_list(prefix):
133 | details = re.search(pattern, archive["name"])
134 | if not details:
135 | continue
136 |
137 | archive_collection_id = int(details.group(2))
138 | archive_date = datetime.datetime.strptime(details.group(3), "%Y-%m-%d").date()
139 | archive_on_user_list = bool(int(details.group(4)))
140 |
141 | # Filter out archives that don't match the queryset filter.
142 | if (
143 | (archive_collection_id != collection_id)
144 | or (on_user_list != archive_on_user_list)
145 | or (to_date and archive_date > to_date)
146 | or (from_date and archive_date < from_date)
147 | ):
148 | continue
149 |
150 | archives.append(archive)
151 |
152 | # Bail out if there's nothing to download.
153 | if len(archives) == 0:
154 | return []
155 |
156 | # Download and decompress the archives from object storage.
157 | unflattened_records = (
158 | (record["fields"] for record in decode_archive(contents))
159 | for contents in get_archives(archive["name"] for archive in archives).values()
160 | )
161 |
162 | # Each archive has its own records and are grouped together in a
163 | # two-dimensional array. Merge them all together.
164 | return list(itertools.chain(*unflattened_records))
165 |
166 |
167 | def calculate_totals(
168 | records: Iterable[Dict],
169 | group_by: Optional[Callable[[Dict], Hashable]] = None,
170 | ) -> List[Dict]:
171 | """
172 | Caclulate the totals of the passed in records.
173 | """
174 |
175 | totals = {}
176 |
177 | for record in records:
178 | key = group_by(record) if group_by else "_default"
179 |
180 | if key in totals:
181 | totals[key]["total_links_added"] += record["total_links_added"]
182 | totals[key]["total_links_removed"] += record["total_links_removed"]
183 | totals[key]["links_diff"] += (
184 | record["total_links_added"] - record["total_links_removed"]
185 | )
186 | else:
187 | totals[key] = record.copy()
188 | totals[key]["links_diff"] = (
189 | record["total_links_added"] - record["total_links_removed"]
190 | )
191 |
192 | return list(totals.values())
193 |
194 |
195 | def find_unique(
196 | records: Iterable[Dict],
197 | group_by: Callable[[Dict], Hashable],
198 | ) -> Set[Hashable]:
199 | """
200 | Find all distinct values in the given records.
201 | """
202 |
203 | values = set()
204 |
205 | for record in records:
206 | values.add(group_by(record))
207 |
208 | return values
209 |
--------------------------------------------------------------------------------
/extlinks/healthcheck/views.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import os
3 | import glob
4 |
5 | from datetime import timedelta
6 |
7 | from django.http import JsonResponse
8 | from django.views import View
9 | from django.utils.decorators import method_decorator
10 | from django.utils.timezone import now
11 | from django.views.decorators.cache import cache_page
12 |
13 | from extlinks.aggregates.models import (
14 | LinkAggregate,
15 | UserAggregate,
16 | PageProjectAggregate,
17 | )
18 | from extlinks.links.models import LinkEvent, LinkSearchTotal
19 | from extlinks.organisations.models import Organisation
20 |
21 |
22 | def get_most_recent(aggregate, monthly=False) -> datetime.date | None:
23 | try:
24 | if monthly:
25 | return aggregate.objects.filter(day=0).latest("full_date").full_date
26 | else:
27 | return aggregate.objects.exclude(day=0).latest("full_date").full_date
28 | except aggregate.DoesNotExist:
29 | pass
30 |
31 |
32 | @method_decorator(cache_page(60 * 1), name="dispatch")
33 | class LinkEventHealthCheckView(View):
34 | """
35 | Healthcheck that passes only if the latest link event is less than a day old
36 | """
37 |
38 | def get(self, request, *args, **kwargs):
39 | status_code = 500
40 | status_msg = "error"
41 | try:
42 | latest_linkevent_datetime = LinkEvent.objects.all().latest().timestamp
43 | cutoff_datetime = now() - timedelta(days=1)
44 | if latest_linkevent_datetime > cutoff_datetime:
45 | status_code = 200
46 | status_msg = "ok"
47 | else:
48 | status_msg = "out of date"
49 | except LinkEvent.DoesNotExist:
50 | status_code = 404
51 | status_msg = "not found"
52 | response = JsonResponse({"status": status_msg})
53 | response.status_code = status_code
54 | return response
55 |
56 |
57 | @method_decorator(cache_page(60 * 1), name="dispatch")
58 | class AggregatesCronHealthCheckView(View):
59 | """
60 | Healthcheck that passes only if the link aggregate jobs have all run successfully in the last 2 days
61 | """
62 |
63 | def get(self, request, *args, **kwargs):
64 | status_code = 500
65 | status_msg = "error"
66 |
67 | try:
68 | latest_link_aggregates_cron_endtime = get_most_recent(LinkAggregate)
69 | latest_user_aggregates_cron_endtime = get_most_recent(UserAggregate)
70 | latest_pageproject_aggregates_cron_endtime = get_most_recent(
71 | PageProjectAggregate
72 | )
73 |
74 | cutoff_datetime = (now() - timedelta(days=2)).date()
75 | if latest_link_aggregates_cron_endtime < cutoff_datetime:
76 | status_msg = "out of date"
77 | elif latest_user_aggregates_cron_endtime < cutoff_datetime:
78 | status_msg = "out of date"
79 | elif latest_pageproject_aggregates_cron_endtime < cutoff_datetime:
80 | status_msg = "out of date"
81 | else:
82 | status_code = 200
83 | status_msg = "ok"
84 | except:
85 | status_code = 404
86 | status_msg = "not found"
87 |
88 | response = JsonResponse({"status": status_msg})
89 | response.status_code = status_code
90 | return response
91 |
92 |
93 | @method_decorator(cache_page(60 * 1), name="dispatch")
94 | class MonthlyAggregatesCronHealthCheckView(View):
95 | """
96 | Healthcheck that passes only if the monthly aggregate jobs have all run successfully in the last month
97 | """
98 |
99 | def get(self, request, *args, **kwargs):
100 | status_code = 500
101 | status_msg = "error"
102 | try:
103 | latest_link_aggregates_cron_endtime = get_most_recent(LinkAggregate, True)
104 | latest_user_aggregates_cron_endtime = get_most_recent(UserAggregate, True)
105 | latest_pageproject_aggregates_cron_endtime = get_most_recent(
106 | PageProjectAggregate, True
107 | )
108 | # Monthly jobs may take some time to run, let's give 35 days to make sure
109 | cutoff_datetime = (now() - timedelta(days=35)).date()
110 | if latest_link_aggregates_cron_endtime < cutoff_datetime:
111 | status_msg = "out of date"
112 | elif latest_user_aggregates_cron_endtime < cutoff_datetime:
113 | status_msg = "out of date"
114 | elif latest_pageproject_aggregates_cron_endtime < cutoff_datetime:
115 | status_msg = "out of date"
116 | else:
117 | status_code = 200
118 | status_msg = "ok"
119 | except:
120 | status_code = 404
121 | status_msg = "not found"
122 | response = JsonResponse({"status": status_msg})
123 | response.status_code = status_code
124 | return response
125 |
126 |
127 | @method_decorator(cache_page(60 * 1), name="dispatch")
128 | class CommonCronHealthCheckView(View):
129 | """
130 | Healthcheck that passes only if a backup file has been created in the last 3 days
131 | """
132 |
133 | def get(self, request, *args, **kwargs):
134 | status_code = 500
135 | status_msg = "out of date"
136 |
137 | for i in range(3):
138 | date = now() - timedelta(days=i)
139 | filename = "links_linkevent_{}_*.json.gz".format(date.strftime("%Y%m%d"))
140 | filepath = os.path.join(os.environ["HOST_BACKUP_DIR"], filename)
141 |
142 | if bool(glob.glob(filepath)):
143 | status_code = 200
144 | status_msg = "ok"
145 | break
146 |
147 | response = JsonResponse({"status": status_msg})
148 | response.status_code = status_code
149 | return response
150 |
151 |
152 | @method_decorator(cache_page(60 * 1), name="dispatch")
153 | class LinksCronHealthCheckView(View):
154 | """
155 | Healthcheck that passes only if the links jobs have all run successfully in the last 9 days
156 | """
157 |
158 | def get(self, request, *args, **kwargs):
159 | status_code = 500
160 | status_msg = "error"
161 | try:
162 | latest_total_links_endtime = LinkSearchTotal.objects.latest("date").date
163 | cutoff_datetime = now().date() - timedelta(days=9)
164 | if latest_total_links_endtime < cutoff_datetime:
165 | status_msg = "out of date"
166 | else:
167 | status_code = 200
168 | status_msg = "ok"
169 | except:
170 | status_code = 404
171 | status_msg = "not found"
172 | response = JsonResponse({"status": status_msg})
173 | response.status_code = status_code
174 | return response
175 |
176 |
177 | @method_decorator(cache_page(60 * 1), name="dispatch")
178 | class OrganizationsCronHealthCheckView(View):
179 | """
180 | Healthcheck that passes only if the Organizations jobs have all run successfully in the last 2 hours
181 | """
182 |
183 | def get(self, request, *args, **kwargs):
184 | status_code = 500
185 | status_msg = "error"
186 | try:
187 | latest_user_lists_endtime = Organisation.objects.latest(
188 | "username_list_updated"
189 | ).username_list_updated
190 | cutoff_datetime = now() - timedelta(hours=2)
191 | if latest_user_lists_endtime < cutoff_datetime:
192 | status_msg = "out of date"
193 | else:
194 | status_code = 200
195 | status_msg = "ok"
196 | except:
197 | status_code = 404
198 | status_msg = "not found"
199 | response = JsonResponse({"status": status_msg})
200 | response.status_code = status_code
201 | return response
202 |
--------------------------------------------------------------------------------
/extlinks/aggregates/management/commands/fill_link_aggregates.py:
--------------------------------------------------------------------------------
1 | from datetime import date, timedelta, datetime
2 |
3 | from extlinks.common.management.commands import BaseCommand
4 | from django.core.management.base import CommandError
5 | from django.db import transaction, close_old_connections
6 | from django.db.models import Count, Q
7 | from django.db.models.functions import Cast
8 | from django.db.models.fields import DateField
9 |
10 | from ...models import LinkAggregate
11 | from extlinks.links.models import LinkEvent, URLPattern
12 | from extlinks.organisations.models import Collection
13 |
14 |
15 | class Command(BaseCommand):
16 | help = "Adds aggregated data into the LinkAggregate table"
17 |
18 | def add_arguments(self, parser):
19 | # Named (optional) arguments
20 | parser.add_argument(
21 | "--collections",
22 | nargs="+",
23 | type=int,
24 | help="A list of collection IDs that will be processed instead of every collection",
25 | )
26 |
27 | def _handle(self, *args, **options):
28 | if options["collections"]:
29 | for col_id in options["collections"]:
30 | collection = (
31 | Collection.objects.filter(pk=col_id, organisation__isnull=False)
32 | .prefetch_related("url")
33 | .first()
34 | )
35 | if collection is None:
36 | raise CommandError(f"Collection '{col_id}' does not exist")
37 |
38 | link_event_filter = self._get_linkevent_filter(collection)
39 | self._process_single_collection(link_event_filter, collection)
40 | else:
41 | # Looping through all collections
42 | link_event_filter = self._get_linkevent_filter()
43 | collections = Collection.objects.exclude(
44 | organisation__isnull=True
45 | ).prefetch_related("url")
46 |
47 | for collection in collections:
48 | self._process_single_collection(link_event_filter, collection)
49 |
50 | close_old_connections()
51 |
52 | def _get_linkevent_filter(self, collection=None):
53 | """
54 | This function checks if there is information in the LinkAggregate table
55 | to see what filters it should apply to the link events further on in the
56 | process
57 |
58 | Parameters
59 | ----------
60 | collection : Collection|None
61 | A collection to filter the LinkAggregate table. Is None by default
62 |
63 | Returns
64 | -------
65 | Q object
66 | """
67 | today = date.today()
68 | yesterday = today - timedelta(days=1)
69 |
70 | if collection is not None:
71 | linkaggregate_filter = Q(collection=collection)
72 | else:
73 | linkaggregate_filter = Q()
74 |
75 | latest_aggregated_link_date = (
76 | LinkAggregate.objects.filter(linkaggregate_filter)
77 | .order_by("full_date")
78 | .last()
79 | )
80 |
81 | if latest_aggregated_link_date is not None:
82 | latest_datetime = datetime(
83 | latest_aggregated_link_date.full_date.year,
84 | latest_aggregated_link_date.full_date.month,
85 | latest_aggregated_link_date.full_date.day,
86 | 0,
87 | 0,
88 | 0,
89 | )
90 | link_event_filter = Q(
91 | timestamp__lte=today,
92 | timestamp__gte=latest_datetime,
93 | )
94 | else:
95 | # There are no link aggregates, getting all LinkEvents from yesterday and backwards
96 | link_event_filter = Q(timestamp__lte=yesterday)
97 |
98 | return link_event_filter
99 |
100 | def _process_single_collection(self, link_event_filter, collection):
101 | """
102 | This function loops through all url patterns in a collection to check on
103 | new link events filtered by the dates passed in link_event_filter
104 |
105 | Parameters
106 | ----------
107 | link_event_filter : Q
108 | A Q query object to filter LinkEvents by. If the LinkAggregate table
109 | is empty, it will query all LinkEvents. If it has data, it will query
110 | by the latest date in the table and today
111 |
112 | collection: Collection
113 | A specific collection to fetch all link events
114 |
115 | Returns
116 | -------
117 | None
118 | """
119 | url_patterns = collection.get_url_patterns()
120 | if len(url_patterns) == 0:
121 | url_patterns = URLPattern.objects.filter(collection=collection).all()
122 | for url_pattern in url_patterns:
123 | link_events_with_annotated_timestamp = url_pattern.link_events.annotate(
124 | timestamp_date=Cast("timestamp", DateField())
125 | ).distinct()
126 | link_events = (
127 | link_events_with_annotated_timestamp.values(
128 | "timestamp_date", "on_user_list"
129 | )
130 | .filter(link_event_filter)
131 | .annotate(
132 | links_added=Count(
133 | "pk",
134 | filter=Q(change=LinkEvent.ADDED),
135 | distinct=True,
136 | ),
137 | links_removed=Count(
138 | "pk", filter=Q(change=LinkEvent.REMOVED), distinct=True
139 | ),
140 | )
141 | )
142 | self._fill_link_aggregates(link_events, collection)
143 |
144 | def _fill_link_aggregates(self, link_events, collection):
145 | """
146 | This function loops through all link events in a URLPattern of a collection
147 | to check if a LinkAggregate with prior information exists.
148 | If a LinkAggregate exists, it checks if there have been any changes to the
149 | links added and links removed sums. If there are any changes, then the
150 | LinkAggregate row is updated.
151 |
152 | Parameters
153 | ----------
154 | link_events : list(LinkEvent)
155 | A list of filtered and annotated LinkEvents that contains the sum of
156 | all links added and removed on a certain date
157 | collection: Collection
158 | The collection the LinkEvents came from. Will be used to fill the
159 | LinkAggregate table
160 |
161 | Returns
162 | -------
163 | None
164 | """
165 | for link_event in link_events:
166 | # Granulation level for the daily aggregation.
167 | # Changing this filter should also impact the monthly
168 | # aggregation in `fill_monthly_link_aggregates.py`
169 | existing_link_aggregate = (
170 | LinkAggregate.objects.filter(
171 | organisation=collection.organisation,
172 | collection=collection,
173 | full_date=link_event["timestamp_date"],
174 | on_user_list=link_event["on_user_list"],
175 | )
176 | .exclude(day=0)
177 | .first()
178 | )
179 | if existing_link_aggregate is not None:
180 | if (
181 | existing_link_aggregate.total_links_added
182 | != link_event["links_added"]
183 | or existing_link_aggregate.total_links_removed
184 | != link_event["links_removed"]
185 | ):
186 | # Updating the total links added and removed
187 | existing_link_aggregate.total_links_added = link_event[
188 | "links_added"
189 | ]
190 | existing_link_aggregate.total_links_removed = link_event[
191 | "links_removed"
192 | ]
193 | existing_link_aggregate.save()
194 | else:
195 | # Create a new link aggregate
196 | with transaction.atomic():
197 | LinkAggregate.objects.create(
198 | organisation=collection.organisation,
199 | collection=collection,
200 | full_date=link_event["timestamp_date"],
201 | total_links_added=link_event["links_added"],
202 | total_links_removed=link_event["links_removed"],
203 | on_user_list=link_event["on_user_list"],
204 | )
205 |
--------------------------------------------------------------------------------
/extlinks/common/helpers.py:
--------------------------------------------------------------------------------
1 | import calendar
2 | from datetime import date, timedelta
3 | from itertools import islice
4 | from typing import Any, Dict
5 |
6 | from django.db.models import Avg, Q
7 | from django.db.models.functions import TruncMonth
8 |
9 | from logging import getLogger
10 |
11 | logger = getLogger("django")
12 |
13 |
14 | def get_month_average(average_data, check_date):
15 | for avg_data in average_data:
16 | if avg_data["month"] == check_date:
17 | return avg_data["average"]
18 |
19 | return 0
20 |
21 |
22 | def get_linksearchtotal_data_by_time(queryset, start_date=None, end_date=None):
23 | """
24 | Calculates per-unit-time data from a queryset of LinkSearchTotal objects
25 |
26 | Given a queryset of LinkSearchTotal objects, returns the totals
27 | per month.
28 |
29 | Returns two lists: dates and totals
30 | """
31 | if queryset:
32 | earliest_date = queryset.earliest("date").date
33 | earliest_date = start_date if start_date is not None else earliest_date
34 | current_date = end_date if end_date is not None else date.today()
35 | linksearch_data = []
36 | dates = []
37 | has_real_data_flags = []
38 |
39 | average_month_data = (
40 | queryset.annotate(month=TruncMonth("date"))
41 | .values("month")
42 | .annotate(average=Avg("total"))
43 | )
44 |
45 | while current_date >= earliest_date:
46 | month_first = current_date.replace(day=1)
47 | this_month_avg = get_month_average(average_month_data, month_first)
48 |
49 | linksearch_data.append(round(this_month_avg))
50 | dates.append(month_first.strftime("%Y-%m-%d"))
51 | has_real_data_flags.append(round(this_month_avg) != 0)
52 |
53 | # Figure out what the last month is regardless of today's date
54 | current_date = month_first - timedelta(days=1)
55 |
56 | if dates and linksearch_data and has_real_data_flags:
57 | as_of_date = None
58 | # If a month has no data for some reason, we should use whatever
59 | # figure we have for the previous month
60 | for i, data in enumerate(linksearch_data):
61 | linksearch_data_length = len(linksearch_data)
62 | if data == 0 and i != linksearch_data_length - 1:
63 | for j in range(i + 1, linksearch_data_length):
64 | if linksearch_data[j] != 0:
65 | linksearch_data[i] = linksearch_data[j]
66 | has_real_data_flags[i] = False
67 | break
68 | dates_reversed = dates[::-1]
69 | for date_str, is_real in zip(dates_reversed, has_real_data_flags[::-1]):
70 | if is_real:
71 | as_of_date = date_str
72 |
73 | link_search_data_reversed = linksearch_data[::-1]
74 | return dates_reversed, link_search_data_reversed, date.fromisoformat(as_of_date).strftime("%B %Y")
75 | else:
76 | return [], [], []
77 | else:
78 | return [], [], []
79 |
80 |
81 | def filter_linksearchtotals(queryset, filter_dict):
82 | """
83 | Adds filter conditions to a LinkSearchTotal queryset based on form results.
84 |
85 | queryset -- a LinkSearchTotal queryset
86 | filter_dict -- a dictionary of data from the user filter form
87 |
88 | Returns a queryset
89 | """
90 | if "start_date" in filter_dict:
91 | start_date = filter_dict["start_date"]
92 | if start_date:
93 | queryset = queryset.filter(date__gte=start_date)
94 |
95 | if "end_date" in filter_dict:
96 | end_date = filter_dict["end_date"]
97 | if end_date:
98 | queryset = queryset.filter(date__lte=end_date)
99 |
100 | return queryset
101 |
102 |
103 | def build_queryset_filters(form_data, collection_or_organisations):
104 | """
105 | This function parses a filter dictionary and creates Q object to filter
106 | the aggregates tables by
107 |
108 | Parameters
109 | ----------
110 | form_data: dict
111 | If the filter form has valid filters, then there will be a dictionary
112 | to filter the aggregates tables by dates or if a user is part of a user
113 | list
114 |
115 | collection_or_organisations : dict
116 | A dictionary that will have either a collection or a set of
117 | organisations to filter by.
118 |
119 | Returns
120 | -------
121 | Q : A Q object which will filter the aggregates queries
122 | """
123 | start_date = None
124 | end_date = None
125 | start_date_filter = Q()
126 | end_date_filter = Q()
127 | limit_to_user_list_filter = Q()
128 | # The aggregates queries will always be filtered by organisation
129 | if "organisations" in collection_or_organisations:
130 | collection_or_organisation_filter = Q(
131 | organisation__in=collection_or_organisations["organisations"]
132 | )
133 | elif "program" in collection_or_organisations:
134 | collection_or_organisation_filter = Q(
135 | program=collection_or_organisations["program"]
136 | )
137 | elif "linkevents" in collection_or_organisations:
138 | collection_or_organisation_filter = Q()
139 | else:
140 | collection_or_organisation_filter = Q(
141 | collection=collection_or_organisations["collection"]
142 | )
143 |
144 | if "start_date" in form_data:
145 | start_date = form_data["start_date"]
146 | if start_date:
147 | if "linkevents" in collection_or_organisations:
148 | start_date_filter = Q(timestamp__gte=start_date)
149 | else:
150 | start_date_filter = Q(full_date__gte=start_date)
151 | if "end_date" in form_data:
152 | end_date = form_data["end_date"]
153 | # The end date must not be greater than today's date
154 | if end_date:
155 | if "linkevents" in collection_or_organisations:
156 | end_date_filter = Q(timestamp__lte=end_date)
157 | else:
158 | end_date_filter = Q(full_date__lte=end_date)
159 |
160 | if "limit_to_user_list" in form_data:
161 | limit_to_user_list = form_data["limit_to_user_list"]
162 | if limit_to_user_list:
163 | limit_to_user_list_filter = Q(on_user_list=True)
164 |
165 | if start_date and end_date:
166 | # If the start date is greater tham the end date, it won't filter
167 | # by date
168 | if start_date >= end_date:
169 | return collection_or_organisation_filter & limit_to_user_list_filter
170 |
171 | return (
172 | collection_or_organisation_filter
173 | & limit_to_user_list_filter
174 | & start_date_filter
175 | & end_date_filter
176 | )
177 |
178 |
179 | def extract_queryset_filter(queryset_filter: Q, filters=None) -> Dict[str, Any]:
180 | """
181 | Extract fields from a queryset filter that works for simple querysets.
182 |
183 | This is used by functions that need to query from multiple different
184 | datasources with the same filters. In-particular when querying data from
185 | both the database and object storage.
186 | """
187 |
188 | if filters is None:
189 | filters = {}
190 |
191 | for child in queryset_filter.children:
192 | if isinstance(child, Q):
193 | extract_queryset_filter(child, filters=filters)
194 | elif isinstance(child, tuple):
195 | key, value = child
196 | filters[key] = value
197 |
198 | return filters
199 |
200 |
201 | def batch_iterator(iterable, size=1000):
202 | """
203 | This yields successive batches from an iterable (memory-efficient).
204 |
205 | Used for large queries that use `.iterator()` for efficiency.
206 | Instead of loading all data into memory at once, this function
207 | retrieves items lazily in fixed-size batches.
208 |
209 | Parameters
210 | ----------
211 | iterable : Iterator
212 | An iterable object, typically a Django QuerySet with `.iterator()`,
213 | that returns items one by one in a memory-efficient manner.
214 |
215 | size : int
216 | The maximum number of items to include in each batch.
217 |
218 | Returns
219 | -------
220 | Iterator[List]
221 | An iterator that yields lists containing at most `size` items
222 | per batch.
223 | """
224 | iterator = iter(iterable)
225 | while batch := list(islice(iterator, size)):
226 | yield batch
227 |
228 |
229 | def last_day(date: date) -> int:
230 | """
231 | Finds the last day of the month for the given date.
232 | """
233 |
234 | return calendar.monthrange(date.year, date.month)[1]
235 |
--------------------------------------------------------------------------------