├── .gitignore
├── LICENSE
├── README.md
├── examples
├── images
│ ├── cli_dark.png
│ ├── rewrite.png
│ ├── transition.png
│ ├── viewer_light.png
│ └── viewer_stats_light.png
└── madness.py
├── poetry.lock
├── pyproject.toml
└── yark
├── __init__.py
├── __main__.py
├── archiver
└── archive.py
├── channel.py
├── cli.py
├── errors.py
├── reporter.py
├── templates
├── base.html
├── channel.html
├── index.html
└── video.html
├── utils.py
├── video.py
└── viewer.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode
2 | demo/
3 | **/.DS*
4 | dist/
5 | __pycache__/
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2022 Owen Griffiths
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Yark
2 |
3 | YouTube archiving made simple.
4 |
5 |
6 |
7 |
8 |
9 |
11 |
12 | ## Installation
13 |
14 | To install Yark, simply download [Python 3.9+](https://www.python.org/downloads/) and [FFmpeg](https://ffmpeg.org/) (optional), then run the following:
15 |
16 | ```shell
17 | $ pip3 install yark
18 | ```
19 |
20 | ## Managing your Archive
21 |
22 | Once you've installed Yark, think of a name for your archive (e.g., "foobar") and copy the target's url:
23 |
24 | ```shell
25 | $ yark new foobar https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA
26 | ```
27 |
28 | Now that you've created the archive, you can tell Yark to download all videos and metadata using the refresh command:
29 |
30 | ```shell
31 | $ yark refresh foobar
32 | ```
33 |
34 | Once everything has been downloaded, Yark will automatically give you a status report of what's changed since the last refresh:
35 |
36 |
37 |
38 | ## Viewing your Archive
39 |
40 | Viewing you archive is easy, just type `view` with your archives name:
41 |
42 | ```shell
43 | $ yark view foobar
44 | ```
45 |
46 | This will pop up an offline website in your browser letting you watch all videos 🚀
47 |
48 |
49 |
50 | Under each video is a rich history report filled with timelines and graphs, as well as a noting feature which lets you add timestamped and permalinked comments 👐
51 |
52 |
53 |
54 | Light and dark modes are both available and automatically apply based on the system's theme.
55 |
56 | ## Details
57 |
58 | Here are some things to keep in mind when using Yark; the good and the bad:
59 |
60 | - Don't create a new archive again if you just want to update it, Yark accumulates all new metadata for you via timestamps
61 | - Feel free to suggest new features via the issues tab on this repository
62 | - Scheduling isn't a feature just yet, please use [`cron`](https://en.wikipedia.org/wiki/Cron) or something similar!
63 |
64 | ## Archive Format
65 |
66 | The archive format itself is simple and consists of a directory-based structure with a core metadata file and all thumbnail/video data in their own directories as typical files:
67 |
68 | - `[name]/` – Your self-contained archive
69 | - `yark.json` – Archive file with all metadata
70 | - `yark.bak` – Backup archive file to protect against data damage
71 | - `videos/` – Directory containing all known videos
72 | - `[id].*` – Files containing video data for YouTube videos
73 | - `thumbnails/` – Directory containing all known thumbnails
74 | - `[hash].png` – Files containing thumbnails with its hash
75 |
76 | It's best to take a few minutes to familiarize yourself with your archive by looking at files which look interesting to you in it, everything is quite readable.
77 |
--------------------------------------------------------------------------------
/examples/images/cli_dark.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/cli_dark.png
--------------------------------------------------------------------------------
/examples/images/rewrite.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/rewrite.png
--------------------------------------------------------------------------------
/examples/images/transition.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/transition.png
--------------------------------------------------------------------------------
/examples/images/viewer_light.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/viewer_light.png
--------------------------------------------------------------------------------
/examples/images/viewer_stats_light.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/viewer_stats_light.png
--------------------------------------------------------------------------------
/examples/madness.py:
--------------------------------------------------------------------------------
1 | from yark import Channel, DownloadConfig
2 | from pathlib import Path
3 |
4 | # Create a new channel
5 | channel = Channel.new(
6 | Path("demo"), "https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA"
7 | )
8 |
9 | # Refresh only metadata and commit to file
10 | channel.metadata()
11 | channel.commit()
12 |
13 | # Load the channel back up from file for the fun of it
14 | channel = Channel.load(Path("demo"))
15 |
16 | # Print all the video id's of the channel
17 | print(", ".join([video.id for video in channel.videos]))
18 |
19 | # Get a cool video I made and print it's description
20 | video = channel.search("annp92OPZgQ")
21 | print(video.description.current())
22 |
23 | # Download the 5 most recent videos and 10 most recent shorts
24 | config = DownloadConfig()
25 | config.max_videos = 5
26 | config.max_shorts = 10
27 | config.submit()
28 | channel.download(config)
29 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "yark"
3 | version = "1.2.12"
4 | description = "YouTube archiving made simple."
5 | authors = ["Owen Griffiths "]
6 | license = "MIT"
7 | readme = "README.md"
8 | repository = "https://github.com/owez/yark"
9 | classifiers = [
10 | "Topic :: System :: Archiving",
11 | "Topic :: System :: Archiving :: Backup",
12 | "Topic :: Multimedia :: Video",
13 | ]
14 | include = [{ path = "templates/*" }]
15 |
16 | [tool.poetry.dependencies]
17 | python = "^3.9"
18 | Flask = "^2.3.1"
19 | requests = "^2.28.2"
20 | colorama = "^0.4.6"
21 | yt-dlp = "2024.10.07"
22 | progress = "^1.6"
23 |
24 | [tool.poetry.scripts]
25 | yark = "yark.cli:_cli"
26 |
27 | [tool.poetry.group.dev.dependencies]
28 | mypy = "^0.991"
29 | poethepoet = "^0.18.1"
30 | types-colorama = "^0.4.15.11"
31 | types-requests = "^2.28.11.17"
32 | black = "^22.12.0"
33 |
34 | [build-system]
35 | requires = ["poetry-core>=1.0.0"]
36 | build-backend = "poetry.core.masonry.api"
37 |
--------------------------------------------------------------------------------
/yark/__init__.py:
--------------------------------------------------------------------------------
1 | """
2 | Yark
3 | ====
4 |
5 | YouTube archiving made simple.
6 |
7 | Commonly-used
8 | -------------
9 |
10 | - `Channel`
11 | - `DownloadConfig`
12 | - `Video`
13 | - `Element`
14 | - `Note`
15 | - `Thumbnail`
16 | - `viewer()`
17 | - `ArchiveNotFoundException`
18 | - `VideoNotFoundException`
19 | - `NoteNotFoundException`
20 | - `TimestampException`
21 |
22 | Beware that using Yark as a library is currently experimental and breaking changes here are not tracked!
23 | """
24 |
25 | from .channel import Channel, DownloadConfig
26 | from .video import Video, Element, Note, Thumbnail
27 | from .viewer import viewer
28 | from .errors import (
29 | ArchiveNotFoundException,
30 | VideoNotFoundException,
31 | NoteNotFoundException,
32 | TimestampException,
33 | )
34 |
--------------------------------------------------------------------------------
/yark/__main__.py:
--------------------------------------------------------------------------------
1 | """Main runner for those using `python3 -m yark` instead of the proper `yark` script poetry provides"""
2 |
3 | from .cli import _cli
4 |
5 | _cli()
6 |
--------------------------------------------------------------------------------
/yark/archiver/archive.py:
--------------------------------------------------------------------------------
1 | """Archive management with metadata/video downloading core"""
2 |
3 | from __future__ import annotations
4 | from datetime import datetime
5 | import json
6 | from pathlib import Path
7 | import time
8 | from yt_dlp import YoutubeDL, DownloadError # type: ignore
9 | import sys
10 | from .reporter import Reporter
11 | from ..errors import ArchiveNotFoundException, MetadataFailException
12 | from .video.video import Video, Videos
13 | from .comment_author import CommentAuthor
14 | from typing import Optional, Any
15 | from .config import Config, YtDlpSettings
16 | from .converter import Converter
17 | from .migrator import _migrate
18 | from ..utils import ARCHIVE_COMPAT, _log_err
19 | from dataclasses import dataclass
20 | import logging
21 |
22 | RawMetadata = dict[str, Any]
23 | """Raw metadata downloaded from yt-dlp to be parsed"""
24 |
25 |
26 | @dataclass(init=False)
27 | class Archive:
28 | path: Path
29 | url: str
30 | version: int
31 | videos: Videos
32 | livestreams: Videos
33 | shorts: Videos
34 | reporter: Reporter
35 | comment_authors: dict[str, CommentAuthor]
36 |
37 | def __init__(
38 | self,
39 | path: Path,
40 | url: str,
41 | version: int = ARCHIVE_COMPAT,
42 | videos: Videos | None = None,
43 | livestreams: Videos | None = None,
44 | shorts: Videos | None = None,
45 | comment_authors: dict[str, CommentAuthor] = {},
46 | ) -> None:
47 | self.path = path
48 | self.url = url
49 | self.version = version
50 | self.videos = Videos(self) if videos is None else videos
51 | self.livestreams = Videos(self) if livestreams is None else livestreams
52 | self.shorts = Videos(self) if shorts is None else shorts
53 | self.reporter = Reporter(self)
54 | self.comment_authors = comment_authors
55 |
56 | @staticmethod
57 | def load(path: Path) -> Archive:
58 | """Loads existing archive from path"""
59 | # Check existence
60 | path = Path(path)
61 | archive_name = path.name
62 | logging.info(f"Loading {archive_name} archive")
63 | if not path.exists():
64 | raise ArchiveNotFoundException(path)
65 |
66 | # Load config
67 | encoded = json.load(open(path / "yark.json", "r"))
68 |
69 | # Check version before fully decoding and exit if wrong
70 | archive_version = encoded["version"]
71 | if archive_version != ARCHIVE_COMPAT:
72 | encoded = _migrate(
73 | archive_version, ARCHIVE_COMPAT, encoded, path, archive_name
74 | )
75 |
76 | # Decode and return
77 | return Archive._from_archive_o(encoded, path)
78 |
79 | def metadata_download(self, config: Config) -> RawMetadata:
80 | """Downloads raw metadata for further parsing"""
81 | logging.info(f"Downloading raw metadata for {self}")
82 |
83 | # Get settings
84 | settings = config.settings_md()
85 |
86 | # Pull metadata from youtube
87 | with YoutubeDL(settings) as ydl:
88 | for i in range(3):
89 | try:
90 | res: RawMetadata = ydl.extract_info(self.url, download=False)
91 | return res
92 | except Exception as exception:
93 | # Report error
94 | retrying = i != 2
95 | _download_error("metadata", exception, retrying)
96 |
97 | # Log retrying message
98 | if retrying:
99 | logging.warn(f"Retrying metadata download ({i+1}/3")
100 |
101 | # Couldn't download after all retries
102 | raise MetadataFailException()
103 |
104 | def metadata_parse(self, config: Config, metadata: RawMetadata) -> None:
105 | """Updates current archive by parsing the raw downloaded metadata"""
106 | logging.info(f"Parsing downloaded metadata for {self}")
107 |
108 | # Make buckets to normalize different types of videos
109 | videos = []
110 | livestreams = []
111 | shorts = []
112 |
113 | # Videos only (basic channel or playlist)
114 | if "entries" not in metadata["entries"][0]:
115 | videos = metadata["entries"]
116 |
117 | # Videos and at least one other (livestream/shorts)
118 | else:
119 | for entry in metadata["entries"]:
120 | # Find the kind of category this is; youtube formats these as 3 playlists
121 | kind = entry["title"].split(" - ")[-1].lower()
122 |
123 | # Plain videos
124 | if kind == "videos":
125 | videos = entry["entries"]
126 |
127 | # Livestreams
128 | elif kind == "live":
129 | livestreams = entry["entries"]
130 |
131 | # Shorts
132 | elif kind == "shorts":
133 | shorts = entry["entries"]
134 |
135 | # Unknown 4th kind; youtube might've updated
136 | else:
137 | _log_err(f"Unknown video kind '{kind}' found", True)
138 |
139 | # Parse metadata
140 | self._metadata_parse_videos("video", config, videos, self.videos)
141 | self._metadata_parse_videos("livestream", config, livestreams, self.livestreams)
142 | self._metadata_parse_videos("shorts", config, shorts, self.shorts)
143 |
144 | # Go through each and report deleted
145 | self._report_deleted(self.videos)
146 | self._report_deleted(self.livestreams)
147 | self._report_deleted(self.shorts)
148 |
149 | def _metadata_parse_videos(
150 | self,
151 | kind: str,
152 | config: Config,
153 | entries: list[dict[str, Any]],
154 | videos: Videos,
155 | ) -> None:
156 | """Parses metadata for a category of video into it's `videos` bucket"""
157 | logging.debug(f"Parsing through {kind} for {self}")
158 |
159 | # Parse each video
160 | for entry in entries:
161 | self._metadata_parse_video(config, entry, videos)
162 |
163 | # Sort videos by newest
164 | videos.sort()
165 |
166 | def _metadata_parse_video(
167 | self, config: Config, entry: dict[str, Any], videos: Videos
168 | ) -> None:
169 | """Parses metadata for one video, creating it or updating it depending on the `videos` already in the bucket"""
170 | id = entry["id"]
171 | logging.debug(f"Parsing video {id} metadata for {self}")
172 |
173 | # Skip video if there's no formats available; happens with upcoming videos/livestreams
174 | if "formats" not in entry or len(entry["formats"]) == 0:
175 | return
176 |
177 | # Updated intra-loop marker
178 | updated = False
179 |
180 | # Update video if it exists
181 | found_video = videos.inner.get(entry["id"])
182 | if found_video is not None:
183 | found_video.update(config, entry)
184 | updated = True
185 | return
186 |
187 | # Add new video if not
188 | if not updated:
189 | video = Video.new(config, self, entry)
190 | videos.inner[video.id] = video
191 | self.reporter.added.append(video)
192 |
193 | def download(self, config: Config) -> bool:
194 | """Downloads all videos which haven't already been downloaded, returning if anything was downloaded"""
195 | logging.debug(f"Downloading curated videos for {self}")
196 |
197 | # Prepare; clean out old part files and get settings
198 | self._clean_parts()
199 | settings = config.settings_dl(self.path)
200 |
201 | # Retry downloading 5 times in total for all videos
202 | anything_downloaded = True
203 | for i in range(5):
204 | # Try to curate a list and download videos on it
205 | try:
206 | # Curate list of non-downloaded videos
207 | not_downloaded = self._curate(config)
208 |
209 | # Return if there's nothing to download
210 | if len(not_downloaded) == 0:
211 | anything_downloaded = False
212 | return False
213 |
214 | # Launch core to download all curated videos
215 | self._download_launch(settings, not_downloaded)
216 |
217 | # Stop if we've got them all
218 | break
219 |
220 | # Report error and retry/stop
221 | except Exception as exception:
222 | _download_error("videos", exception, i != 4)
223 |
224 | # End by converting any downloaded but unsupported video file formats
225 | if anything_downloaded:
226 | converter = Converter(self.path / "videos")
227 | converter.run()
228 |
229 | # Say that something was downloaded
230 | return True
231 |
232 | def _download_launch(
233 | self, settings: YtDlpSettings, not_downloaded: list[Video]
234 | ) -> None:
235 | """Downloads all `not_downloaded` videos passed into it whilst automatically handling privated videos, this is the core of the downloader"""
236 | # Continuously try to download after private/deleted videos are found
237 | # This block gives the downloader all the curated videos and skips/reports deleted videos by filtering their exceptions
238 | while True:
239 | # Download from curated list then exit the optimistic loop
240 | try:
241 | urls = [video.url() for video in not_downloaded]
242 | with YoutubeDL(settings) as ydl:
243 | ydl.download(urls)
244 | break
245 |
246 | # Special handling for private/deleted videos which are archived, if not we raise again
247 | except DownloadError as exception:
248 | new_not_downloaded = self._download_exception_handle(
249 | not_downloaded, exception
250 | )
251 | if new_not_downloaded is not None:
252 | not_downloaded = new_not_downloaded
253 |
254 | def _download_exception_handle(
255 | self, not_downloaded: list[Video], exception: DownloadError
256 | ) -> Optional[list[Video]]:
257 | """Handle for failed downloads if there's a special private/deleted video"""
258 | # Set new list for not downloaded to return later
259 | new_not_downloaded = None
260 |
261 | # Video is privated or deleted
262 | if (
263 | "Private video" in exception.msg
264 | or "This video has been removed by the uploader" in exception.msg
265 | ):
266 | # Skip video from curated and get it as a return
267 | new_not_downloaded, video = _skip_video(not_downloaded, "deleted")
268 |
269 | # If this is a new occurrence then set it & report
270 | # This will only happen if its deleted after getting metadata, like in a dry run
271 | if video.deleted.current() == False:
272 | self.reporter.deleted.append(video)
273 | video.deleted.update(None, True)
274 |
275 | # User hasn't got ffmpeg installed and youtube hasn't got format 22
276 | # NOTE: see #55 to learn more
277 | # NOTE: sadly yt-dlp doesn't let us access yt_dlp.utils.ContentTooShortError so we check msg
278 | elif " bytes, expected " in exception.msg:
279 | # Skip video from curated
280 | new_not_downloaded, _ = _skip_video(
281 | not_downloaded,
282 | "no format found; please download ffmpeg!",
283 | True,
284 | )
285 |
286 | # Nevermind, normal exception
287 | else:
288 | raise exception
289 |
290 | # Return
291 | return new_not_downloaded
292 |
293 | def _curate(self, config: Config) -> list[Video]:
294 | """Curate videos which aren't downloaded and return their urls"""
295 |
296 | def curate_list(videos: Videos, maximum: Optional[int]) -> list[Video]:
297 | """Curates the videos inside of the provided `videos` list to it's local maximum"""
298 | # Make a list for the videos
299 | found_videos = []
300 |
301 | # Add all undownloaded videos because there's no maximum
302 | if maximum is None:
303 | found_videos = list(
304 | [video for video in videos.inner.values() if not video.downloaded()]
305 | )
306 |
307 | # Cut available videos to maximum if present for deterministic getting
308 | else:
309 | # Fix the maximum to the length so we don't try to get more than there is
310 | fixed_maximum = min(max(len(videos.inner) - 1, 0), maximum)
311 |
312 | # Set the available videos to this fixed maximum
313 | values = list(videos.inner.values())
314 | for ind in range(fixed_maximum):
315 | # Get video
316 | video = values[ind]
317 |
318 | # Save video if it's not been downloaded yet
319 | if not video.downloaded():
320 | found_videos.append(video)
321 |
322 | # Return
323 | return found_videos
324 |
325 | # Curate
326 | not_downloaded = []
327 | not_downloaded.extend(curate_list(self.videos, config.max_videos))
328 | not_downloaded.extend(curate_list(self.livestreams, config.max_livestreams))
329 | not_downloaded.extend(curate_list(self.shorts, config.max_shorts))
330 |
331 | # Return
332 | return not_downloaded
333 |
334 | def commit(self, backup: bool = False) -> None:
335 | """Commits (saves) archive to path; do this once you've finished all of your transactions"""
336 | # Save backup if explicitly wanted
337 | if backup:
338 | self._backup()
339 |
340 | # Directories
341 | logging.info(f"Committing {self} to file")
342 | paths = [self.path, self.path / "images", self.path / "videos"]
343 | for path in paths:
344 | if not path.exists():
345 | path.mkdir()
346 |
347 | # Config
348 | with open(self.path / "yark.json", "w+") as file:
349 | json.dump(self._to_archive_o(), file)
350 |
351 | def _report_deleted(self, videos: Videos) -> None:
352 | """Goes through a video category to report & save those which where not marked in the metadata as deleted if they're not already known to be deleted"""
353 | for video in videos.inner.values():
354 | if video.deleted.current() == False and not video.known_not_deleted:
355 | self.reporter.deleted.append(video)
356 | video.deleted.update(None, True)
357 |
358 | def _clean_parts(self) -> None:
359 | """Cleans old temporary `.part` files which where stopped during download if present"""
360 | # Make a bucket for found files
361 | deletion_bucket: list[Path] = []
362 |
363 | # Scan through and find part files
364 | videos = self.path / "videos"
365 | deletion_bucket.extend([file for file in videos.glob("*.part")])
366 | deletion_bucket.extend([file for file in videos.glob("*.ytdl")])
367 |
368 | # Log and delete if there are part files present
369 | if len(deletion_bucket) != 0:
370 | logging.info("Cleaning out previous temporary files..")
371 | for file in deletion_bucket:
372 | file.unlink()
373 |
374 | def _backup(self) -> None:
375 | """Creates a backup of the existing `yark.json` file in path as `yark.bak` with added comments"""
376 | logging.info(f"Creating a backup for {self} as yark.bak")
377 |
378 | # Get current archive path
379 | ARCHIVE_PATH = self.path / "yark.json"
380 |
381 | # Skip backing up if the archive doesn't exist
382 | if not ARCHIVE_PATH.exists():
383 | return
384 |
385 | # Open original archive to copy
386 | with open(self.path / "yark.json", "r") as file_archive:
387 | # Add comment information to backup file
388 | save = f"// Backup of a Yark archive, dated {datetime.utcnow().isoformat()}\n// Remove these comments and rename to 'yark.json' to restore\n{file_archive.read()}"
389 |
390 | # Save new information into a new backup
391 | with open(self.path / "yark.bak", "w+") as file_backup:
392 | file_backup.write(save)
393 |
394 | @staticmethod
395 | def _from_archive_o(encoded: dict[str, Any], path: Path) -> Archive:
396 | """Decodes object dict from archive which is being loaded back up"""
397 |
398 | # Initiate archive
399 | archive = Archive(path, encoded["url"], encoded["version"])
400 |
401 | # Decode id & body style comment authors
402 | # NOTE: needed above video decoding for comments
403 | for id in encoded["comment_authors"].keys():
404 | archive.comment_authors[id] = CommentAuthor._from_archive_ib(
405 | archive, id, encoded["comment_authors"][id]
406 | )
407 |
408 | # Load up videos/livestreams/shorts
409 | archive.videos = Videos._from_archive_o(archive, encoded["videos"])
410 | archive.livestreams = Videos._from_archive_o(archive, encoded["livestreams"])
411 | archive.shorts = Videos._from_archive_o(archive, encoded["shorts"])
412 |
413 | # Return
414 | return archive
415 |
416 | def _to_archive_o(self) -> dict[str, Any]:
417 | """Converts all archive data to a object dict to commit"""
418 | # Encode comment authors
419 | comment_authors = {}
420 | for id in self.comment_authors.keys():
421 | comment_authors[id] = self.comment_authors[id]._to_archive_b()
422 |
423 | # Basics
424 | payload = {
425 | "version": self.version,
426 | "url": self.url,
427 | "videos": self.videos._to_archive_o(),
428 | "livestreams": self.livestreams._to_archive_o(),
429 | "shorts": self.shorts._to_archive_o(),
430 | "comment_authors": comment_authors,
431 | }
432 |
433 | # Return
434 | return payload
435 |
436 | def __repr__(self) -> str:
437 | return self.path.name
438 |
439 |
440 | def _skip_video(
441 | videos: list[Video],
442 | reason: str,
443 | warning: bool = False,
444 | ) -> tuple[list[Video], Video]:
445 | """Skips first undownloaded video in `videos`, make sure there's at least one to skip otherwise an exception will be thrown"""
446 | # Find fist undownloaded video
447 | for ind, video in enumerate(videos):
448 | if not video.downloaded():
449 | # Tell the user we're skipping over it
450 | if warning:
451 | logging.warn(
452 | f"Skipping video {video.id} download for {video.archive} ({reason})"
453 | )
454 | else:
455 | logging.info(
456 | f"Skipping video {video.id} download for {video.archive} ({reason})"
457 | )
458 |
459 | # Set videos to skip over this one
460 | videos = videos[ind + 1 :]
461 |
462 | # Return the corrected list and the video found
463 | return videos, video
464 |
465 | # Shouldn't happen, see docs
466 | raise Exception(
467 | "We expected to skip a video and return it but nothing to skip was found"
468 | )
469 |
470 |
471 | def _download_error(
472 | archive_name: str, exception: DownloadError, retrying: bool
473 | ) -> None:
474 | """Logs errors depending on what kind of download error occurred"""
475 | # Default message
476 | msg = (
477 | f"Unknown error whilst downloading {archive_name}, details below:\n{exception}"
478 | )
479 |
480 | # Types of errors
481 | ERRORS = [
482 | "",
483 | "500",
484 | "Got error: The read operation timed out",
485 | "No such file or directory",
486 | "HTTP Error 404: Not Found",
487 | "",
488 | "Did not get any data blocks",
489 | ]
490 |
491 | # Download errors
492 | if type(exception) == DownloadError:
493 | # Server connection
494 | if ERRORS[0] in exception.msg or ERRORS[5] in exception.msg:
495 | msg = "Issue connecting with YouTube's servers"
496 |
497 | # Server fault
498 | elif ERRORS[1] in exception.msg:
499 | msg = "Fault with YouTube's servers"
500 |
501 | # Timeout
502 | elif ERRORS[2] in exception.msg:
503 | msg = "Timed out trying to download video"
504 |
505 | # Video deleted whilst downloading
506 | elif ERRORS[3] in exception.msg:
507 | msg = "Video deleted whilst downloading"
508 |
509 | # Target not found, might need to retry with alternative route
510 | elif ERRORS[4] in exception.msg:
511 | msg = "Couldn't find target by it's id"
512 |
513 | # Random timeout; not sure if its user-end or youtube-end
514 | elif ERRORS[5] in exception.msg:
515 | msg = "Timed out trying to reach YouTube"
516 |
517 | # Log error
518 | suffix = ", retrying in a few seconds.." if retrying else ""
519 | logging.warn(msg + suffix)
520 |
521 | # Wait if retrying, exit if failed
522 | if retrying:
523 | time.sleep(5)
524 | else:
525 | _log_err(f"Sorry, failed to download {archive_name}", True)
526 | sys.exit(1)
527 |
--------------------------------------------------------------------------------
/yark/channel.py:
--------------------------------------------------------------------------------
1 | """Channel and overall archive management with downloader"""
2 |
3 | from __future__ import annotations
4 | from datetime import datetime
5 | import json
6 | from pathlib import Path
7 | import time
8 | from yt_dlp import YoutubeDL, DownloadError # type: ignore
9 | from colorama import Style, Fore
10 | import sys
11 | from .reporter import Reporter
12 | from .errors import ArchiveNotFoundException, _err_msg, VideoNotFoundException
13 | from .video import Video, Element
14 | from typing import Any
15 | import time
16 | from progress.spinner import PieSpinner
17 | from concurrent.futures import ThreadPoolExecutor
18 | import time
19 |
20 | ARCHIVE_COMPAT = 3
21 | """
22 | Version of Yark archives which this script is capable of properly parsing
23 |
24 | - Version 1 was the initial format and had all the basic information you can see in the viewer now
25 | - Version 2 introduced livestreams and shorts into the mix, as well as making the channel id into a simple url
26 | - Version 3 was a minor change to introduce a deleted tag so we have full reporting capability
27 |
28 | Some of these breaking versions are large changes and some are relatively small.
29 | We don't check if a value exists or not in the archive format out of precedent
30 | and we don't have optionally-present values, meaning that any new tags are a
31 | breaking change to the format. The only downside to this is that the migrator
32 | gets a line or two of extra code every breaking change. This is much better than
33 | having way more complexity in the archiver decoding system itself.
34 | """
35 |
36 | from typing import Optional
37 |
38 |
39 | class DownloadConfig:
40 | max_videos: Optional[int]
41 | max_livestreams: Optional[int]
42 | max_shorts: Optional[int]
43 | skip_download: bool
44 | skip_metadata: bool
45 | format: Optional[str]
46 |
47 | def __init__(self) -> None:
48 | self.max_videos = None
49 | self.max_livestreams = None
50 | self.max_shorts = None
51 | self.skip_download = False
52 | self.skip_metadata = False
53 | self.format = None
54 |
55 | def submit(self):
56 | """Submits configuration, this has the effect of normalising maximums to 0 properly"""
57 | # Adjust remaining maximums if one is given
58 | no_maximums = (
59 | self.max_videos is None
60 | and self.max_livestreams is None
61 | and self.max_shorts is None
62 | )
63 | if not no_maximums:
64 | if self.max_videos is None:
65 | self.max_videos = 0
66 | if self.max_livestreams is None:
67 | self.max_livestreams = 0
68 | if self.max_shorts is None:
69 | self.max_shorts = 0
70 |
71 | # If all are 0 as its equivalent to skipping download
72 | if self.max_videos == 0 and self.max_livestreams == 0 and self.max_shorts == 0:
73 | print(
74 | Fore.YELLOW
75 | + "Using the skip downloads option is recommended over setting maximums to 0"
76 | + Fore.RESET
77 | )
78 | self.skip_download = True
79 |
80 |
81 | class VideoLogger:
82 | @staticmethod
83 | def downloading(d):
84 | """Progress hook for video downloading"""
85 | # Get video's id
86 | id = d["info_dict"]["id"]
87 |
88 | # Downloading percent
89 | if d["status"] == "downloading":
90 | percent = d["_percent_str"].strip()
91 | print(
92 | Style.DIM
93 | + f" • Downloading {id}, at {percent}"
94 | + Style.DIM
95 | + ".. "
96 | + Style.NORMAL,
97 | end="\r",
98 | )
99 |
100 | # Finished a video's download
101 | elif d["status"] == "finished":
102 | print(Style.DIM + f" • Downloaded {id} " + Style.NORMAL)
103 |
104 | def debug(self, msg):
105 | """Debug log messages, ignored"""
106 | pass
107 |
108 | def info(self, msg):
109 | """Info log messages ignored"""
110 | pass
111 |
112 | def warning(self, msg):
113 | """Warning log messages ignored"""
114 | pass
115 |
116 | def error(self, msg):
117 | """Error log messages"""
118 | pass
119 |
120 |
121 | class Channel:
122 | path: Path
123 | version: int
124 | url: str
125 | videos: list[Video]
126 | livestreams: list[Video]
127 | shorts: list[Video]
128 | reporter: Reporter
129 |
130 | @staticmethod
131 | def new(path: Path, url: str) -> Channel:
132 | """Creates a new channel"""
133 | # Details
134 | print("Creating new channel..")
135 | channel = Channel()
136 | channel.path = Path(path)
137 | channel.version = ARCHIVE_COMPAT
138 | channel.url = url
139 | channel.videos = []
140 | channel.livestreams = []
141 | channel.shorts = []
142 | channel.reporter = Reporter(channel)
143 |
144 | # Commit and return
145 | channel.commit()
146 | return channel
147 |
148 | @staticmethod
149 | def _new_empty() -> Channel:
150 | return Channel.new(
151 | Path("pretend"), "https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA"
152 | )
153 |
154 | @staticmethod
155 | def load(path: Path) -> Channel:
156 | """Loads existing channel from path"""
157 | # Check existence
158 | path = Path(path)
159 | channel_name = path.name
160 | print(f"Loading {channel_name} channel..")
161 | if not path.exists():
162 | raise ArchiveNotFoundException("Archive doesn't exist")
163 |
164 | # Load config
165 | encoded = json.load(open(path / "yark.json", "r"))
166 |
167 | # Check version before fully decoding and exit if wrong
168 | archive_version = encoded["version"]
169 | if archive_version != ARCHIVE_COMPAT:
170 | encoded = _migrate_archive(
171 | archive_version, ARCHIVE_COMPAT, encoded, channel_name
172 | )
173 |
174 | # Decode and return
175 | return Channel._from_dict(encoded, path)
176 |
177 | def metadata(self):
178 | """Queries YouTube for all channel metadata to refresh known videos"""
179 | # Print loading progress at the start without loading indicator so theres always a print
180 | msg = "Downloading metadata.."
181 | print(msg, end="\r")
182 |
183 | # Download metadata and give the user a spinner bar
184 | with ThreadPoolExecutor() as ex:
185 | # Make future for downloading metadata
186 | future = ex.submit(self._download_metadata)
187 |
188 | # Start spinning
189 | with PieSpinner(f"{msg} ") as bar:
190 | # Don't show bar for 2 seconds but check if future is done
191 | no_bar_time = time.time() + 2
192 | while time.time() < no_bar_time:
193 | if future.done():
194 | break
195 | time.sleep(0.25)
196 |
197 | # Show loading spinner
198 | while not future.done():
199 | bar.next()
200 | time.sleep(0.075)
201 |
202 | # Get result from thread now that it's finished
203 | res = future.result()
204 |
205 | # Uncomment for saving big dumps for testing
206 | # with open(self.path / "dump.json", "w+") as file:
207 | # json.dump(res, file)
208 |
209 | # Uncomment for loading big dumps for testing
210 | # res = json.load(open(self.path / "dump.json", "r"))
211 |
212 | # Parse downloaded metadata
213 | self._parse_metadata(res)
214 |
215 | def _download_metadata(self) -> dict[str, Any]:
216 | """Downloads metadata dict and returns for further parsing"""
217 | # Construct downloader
218 | settings = {
219 | # Centralized logging system; makes output fully quiet
220 | "logger": VideoLogger(),
221 | # Skip downloading pending livestreams (#60 )
222 | "ignore_no_formats_error": True,
223 | # Concurrent fragment downloading for increased resilience (#109 )
224 | "concurrent_fragment_downloads": 8,
225 | # First download "flat", then extract_info for each video, to support large channels/playlists (#71 )
226 | "extract_flat":True
227 | }
228 |
229 | # Get response and snip it
230 | with YoutubeDL(settings) as ydl:
231 | # first extract the "flat" metadata, which does not download the metadata for all the videos
232 | for i in range(3):
233 | try:
234 | res: dict[str, Any] = ydl.extract_info(self.url, download=False)
235 | break
236 | except Exception as exception:
237 | # Report error
238 | retrying = i != 2
239 | _err_dl("metadata", exception, retrying)
240 |
241 | # Print retrying message
242 | if retrying:
243 | print(
244 | Style.DIM
245 | + f" • Retrying metadata download.."
246 | + Style.RESET_ALL
247 | ) # TODO: compat with loading bar
248 |
249 | # go through the "flat" metadata and download the metadata for each video
250 | for index in range(len(res["entries"])):
251 | if res["entries"][index]["_type"] == "playlist":
252 | playlist = res["entries"][index]
253 | for list_index in range(len(playlist["entries"])):
254 | url = playlist["entries"][list_index]["url"]
255 | for i in range(3):
256 | try:
257 | entry = ydl.extract_info(url, download=False)
258 | if len(entry["formats"]) == 0:
259 | ydl = YoutubeDL(settings)
260 | entry = ydl.extract_info(url, download=False)
261 |
262 | playlist["entries"][list_index] = entry
263 | break
264 | except Exception as exception:
265 | # Report error
266 | retrying = i != 2
267 | _err_dl("metadata", exception, retrying)
268 |
269 | # Print retrying message
270 | if retrying:
271 | print(
272 | Style.DIM
273 | + f" • Retrying metadata download.."
274 | + Style.RESET_ALL
275 | ) # TODO: compat with loading bar
276 |
277 |
278 | elif res["entries"][index]["_type"] == "url":
279 | url = res["entries"][index]["url"]
280 | for i in range(3):
281 | try:
282 | entry = ydl.extract_info(url, download=False)
283 | # if video didn't download formats, open a new downloader and try again
284 | if len(entry["formats"]) == 0:
285 | ydl = YoutubeDL(settings)
286 | entry = ydl.extract_info(url, download=False)
287 |
288 | res["entries"][index] = entry
289 | break
290 | except Exception as exception:
291 | # Report error
292 | retrying = i != 2
293 | _err_dl("metadata", exception, retrying)
294 |
295 | # Print retrying message
296 | if retrying:
297 | print(
298 | Style.DIM
299 | + f" • Retrying metadata download.."
300 | + Style.RESET_ALL
301 | ) # TODO: compat with loading bar
302 |
303 | return res
304 |
305 | def _parse_metadata(self, res: dict[str, Any]):
306 | """Parses entirety of downloaded metadata"""
307 | # Normalize into types of videos
308 | videos = []
309 | livestreams = []
310 | shorts = []
311 | if len(res["entries"]) > 0 and "entries" not in res["entries"][0]:
312 | # Videos only
313 | videos = res["entries"]
314 | else:
315 | # Videos and at least one other (livestream/shorts)
316 | for entry in res["entries"]:
317 | kind = entry["title"].split(" - ")[-1].lower()
318 | if kind == "videos":
319 | videos = entry["entries"]
320 | elif kind == "live":
321 | livestreams = entry["entries"]
322 | elif kind == "shorts":
323 | shorts = entry["entries"]
324 | else:
325 | _err_msg(f"Unknown video kind '{kind}' found", True)
326 |
327 | # Parse metadata
328 | self._parse_metadata_videos("video", videos, self.videos)
329 | self._parse_metadata_videos("livestream", livestreams, self.livestreams)
330 | self._parse_metadata_videos("shorts", shorts, self.shorts)
331 |
332 | # Go through each and report deleted
333 | self._report_deleted(self.videos)
334 | self._report_deleted(self.livestreams)
335 | self._report_deleted(self.shorts)
336 |
337 | def download(self, config: DownloadConfig):
338 | """Downloads all videos which haven't already been downloaded"""
339 | # Clean out old part files
340 | self._clean_parts()
341 |
342 | # Create settings for the downloader
343 | settings = {
344 | # Set the output path
345 | "outtmpl": f"{self.path}/videos/%(id)s.%(ext)s",
346 | # Centralized logger hook for ignoring all stdout
347 | "logger": VideoLogger(),
348 | # Logger hook for download progress
349 | "progress_hooks": [VideoLogger.downloading],
350 | }
351 | if config.format is not None:
352 | settings["format"] = config.format
353 |
354 | # Attach to the downloader
355 | with YoutubeDL(settings) as ydl:
356 | # Retry downloading 5 times in total for all videos
357 | for i in range(5):
358 | # Try to curate a list and download videos on it
359 | try:
360 | # Curate list of non-downloaded videos
361 | not_downloaded = self._curate(config)
362 |
363 | # Stop if there's nothing to download
364 | if len(not_downloaded) == 0:
365 | break
366 |
367 | # Print curated if this is the first time
368 | if i == 0:
369 | fmt_num = (
370 | "a new video"
371 | if len(not_downloaded) == 1
372 | else f"{len(not_downloaded)} new videos"
373 | )
374 | print(f"Downloading {fmt_num}..")
375 |
376 | # Continuously try to download after private/deleted videos are found
377 | # This block gives the downloader all the curated videos and skips/reports deleted videos by filtering their exceptions
378 | while True:
379 | # Download from curated list then exit the optimistic loop
380 | try:
381 | urls = [video.url() for video in not_downloaded]
382 | ydl.download(urls)
383 | break
384 |
385 | # Special handling for private/deleted videos which are archived, if not we raise again
386 | except DownloadError as exception:
387 | # Video is privated or deleted
388 | if (
389 | "Private video" in exception.msg
390 | or "This video has been removed by the uploader"
391 | in exception.msg
392 | ):
393 | # Skip video from curated and get it as a return
394 | not_downloaded, video = _skip_video(
395 | not_downloaded, "deleted"
396 | )
397 |
398 | # If this is a new occurrence then set it & report
399 | # This will only happen if its deleted after getting metadata, like in a dry run
400 | if video.deleted.current() == False:
401 | self.reporter.deleted.append(video)
402 | video.deleted.update(None, True)
403 |
404 | # User hasn't got ffmpeg installed and youtube hasn't got format 22
405 | # NOTE: see #55 to learn more
406 | # NOTE: sadly yt-dlp doesn't let us access yt_dlp.utils.ContentTooShortError so we check msg
407 | elif " bytes, expected " in exception.msg:
408 | # Skip video from curated
409 | not_downloaded, _ = _skip_video(
410 | not_downloaded,
411 | "no format found; please download ffmpeg!",
412 | True,
413 | )
414 |
415 | # Nevermind, normal exception
416 | else:
417 | raise exception
418 |
419 | # Stop if we've got them all
420 | break
421 |
422 | # Report error and retry/stop
423 | except Exception as exception:
424 | # Get around carriage return
425 | if i == 0:
426 | print()
427 |
428 | # Report error
429 | _err_dl("videos", exception, i != 4)
430 |
431 | def search(self, id: str):
432 | """Searches channel for a video with the corresponding `id` and returns"""
433 | # Search
434 | for video in self.videos:
435 | if video.id == id:
436 | return video
437 |
438 | # Raise exception if it's not found
439 | raise VideoNotFoundException(f"Couldn't find {id} inside archive")
440 |
441 | def _curate(self, config: DownloadConfig) -> list[Video]:
442 | """Curate videos which aren't downloaded and return their urls"""
443 |
444 | def curate_list(videos: list[Video], maximum: Optional[int]) -> list[Video]:
445 | """Curates the videos inside of the provided `videos` list to it's local maximum"""
446 | # Cut available videos to maximum if present for deterministic getting
447 | if maximum is not None:
448 | # Fix the maximum to the length so we don't try to get more than there is
449 | fixed_maximum = min(max(len(videos) - 1, 0), maximum)
450 |
451 | # Set the available videos to this fixed maximum
452 | new_videos = []
453 | for ind in range(fixed_maximum):
454 | new_videos.append(videos[ind])
455 | videos = new_videos
456 |
457 | # Find undownloaded videos in available list
458 | not_downloaded = []
459 | for video in videos:
460 | if not video.downloaded():
461 | not_downloaded.append(video)
462 |
463 | # Return
464 | return not_downloaded
465 |
466 | # Curate
467 | not_downloaded = []
468 | not_downloaded.extend(curate_list(self.videos, config.max_videos))
469 | not_downloaded.extend(curate_list(self.livestreams, config.max_livestreams))
470 | not_downloaded.extend(curate_list(self.shorts, config.max_shorts))
471 |
472 | # Return
473 | return not_downloaded
474 |
475 | def commit(self):
476 | """Commits (saves) archive to path; do this once you've finished all of your transactions"""
477 | # Save backup
478 | self._backup()
479 |
480 | # Directories
481 | print(f"Committing {self} to file..")
482 | paths = [self.path, self.path / "thumbnails", self.path / "videos"]
483 | for path in paths:
484 | if not path.exists():
485 | path.mkdir()
486 |
487 | # Config
488 | with open(self.path / "yark.json", "w+") as file:
489 | json.dump(self._to_dict(), file)
490 |
491 | def _parse_metadata_videos(self, kind: str, i: list, bucket: list):
492 | """Parses metadata for a category of video into it's bucket and tells user what's happening"""
493 |
494 | # Print at the start without loading indicator so theres always a print
495 | msg = f"Parsing {kind} metadata.."
496 | print(msg, end="\r")
497 |
498 | # Start computing and show loading spinner
499 | with ThreadPoolExecutor() as ex:
500 | # Make future for computation of the video list
501 | future = ex.submit(self._parse_metadata_videos_comp, i, bucket)
502 |
503 | # Start spinning
504 | with PieSpinner(f"{msg} ") as bar:
505 | # Don't show bar for 2 seconds but check if future is done
506 | no_bar_time = time.time() + 2
507 | while time.time() < no_bar_time:
508 | if future.done():
509 | return
510 | time.sleep(0.25)
511 |
512 | # Spin until future is done
513 | while not future.done():
514 | time.sleep(0.075)
515 | bar.next()
516 |
517 | def _parse_metadata_videos_comp(self, i: list, bucket: list):
518 | """Computes the actual parsing for `_parse_metadata_videos` without outputting what's happening"""
519 | for entry in i:
520 | # Skip video if there's no formats available; happens with upcoming videos/livestreams
521 | if "formats" not in entry or len(entry["formats"]) == 0:
522 | continue
523 |
524 | # Updated intra-loop marker
525 | updated = False
526 |
527 | # Update video if it exists
528 | for video in bucket:
529 | if video.id == entry["id"]:
530 | video.update(entry)
531 | updated = True
532 | break
533 |
534 | # Add new video if not
535 | if not updated:
536 | video = Video.new(entry, self)
537 | bucket.append(video)
538 | self.reporter.added.append(video)
539 |
540 | # Sort videos by newest
541 | bucket.sort(reverse=True)
542 |
543 | def _report_deleted(self, videos: list):
544 | """Goes through a video category to report & save those which where not marked in the metadata as deleted if they're not already known to be deleted"""
545 | for video in videos:
546 | if video.deleted.current() == False and not video.known_not_deleted:
547 | self.reporter.deleted.append(video)
548 | video.deleted.update(None, True)
549 |
550 | def _clean_parts(self):
551 | """Cleans old temporary `.part` files which where stopped during download if present"""
552 | # Make a bucket for found files
553 | deletion_bucket: list[Path] = []
554 |
555 | # Scan through and find part files
556 | videos = self.path / "videos"
557 | for file in videos.iterdir():
558 | if file.suffix == ".part" or file.suffix == ".ytdl":
559 | deletion_bucket.append(file)
560 |
561 | # Print and delete if there are part files present
562 | if len(deletion_bucket) != 0:
563 | print("Cleaning out previous temporary files..")
564 | for file in deletion_bucket:
565 | file.unlink()
566 |
567 | def _backup(self):
568 | """Creates a backup of the existing `yark.json` file in path as `yark.bak` with added comments"""
569 | # Get current archive path
570 | ARCHIVE_PATH = self.path / "yark.json"
571 |
572 | # Skip backing up if the archive doesn't exist
573 | if not ARCHIVE_PATH.exists():
574 | return
575 |
576 | # Open original archive to copy
577 | with open(self.path / "yark.json", "r") as file_archive:
578 | # Add comment information to backup file
579 | save = f"// Backup of a Yark archive, dated {datetime.utcnow().isoformat()}\n// Remove these comments and rename to 'yark.json' to restore\n{file_archive.read()}"
580 |
581 | # Save new information into a new backup
582 | with open(self.path / "yark.bak", "w+") as file_backup:
583 | file_backup.write(save)
584 |
585 | @staticmethod
586 | def _from_dict(encoded: dict, path: Path) -> Channel:
587 | """Decodes archive which is being loaded back up"""
588 | channel = Channel()
589 | channel.path = path
590 | channel.version = encoded["version"]
591 | channel.url = encoded["url"]
592 | channel.reporter = Reporter(channel)
593 | channel.videos = [
594 | Video._from_dict(video, channel) for video in encoded["videos"]
595 | ]
596 | channel.livestreams = [
597 | Video._from_dict(video, channel) for video in encoded["livestreams"]
598 | ]
599 | channel.shorts = [
600 | Video._from_dict(video, channel) for video in encoded["shorts"]
601 | ]
602 | return channel
603 |
604 | def _to_dict(self) -> dict:
605 | """Converts channel data to a dictionary to commit"""
606 | return {
607 | "version": self.version,
608 | "url": self.url,
609 | "videos": [video._to_dict() for video in self.videos],
610 | "livestreams": [video._to_dict() for video in self.livestreams],
611 | "shorts": [video._to_dict() for video in self.shorts],
612 | }
613 |
614 | def __repr__(self) -> str:
615 | return self.path.name
616 |
617 |
618 | def _skip_video(
619 | videos: list[Video],
620 | reason: str,
621 | warning: bool = False,
622 | ) -> tuple[list[Video], Video]:
623 | """Skips first undownloaded video in `videos`, make sure there's at least one to skip otherwise an exception will be thrown"""
624 | # Find fist undownloaded video
625 | for ind, video in enumerate(videos):
626 | if not video.downloaded():
627 | # Tell the user we're skipping over it
628 | if warning:
629 | print(
630 | Fore.YELLOW + f" • Skipping {video.id} ({reason})" + Fore.RESET,
631 | file=sys.stderr,
632 | )
633 | else:
634 | print(
635 | Style.DIM + f" • Skipping {video.id} ({reason})" + Style.NORMAL,
636 | )
637 |
638 | # Set videos to skip over this one
639 | videos = videos[ind + 1 :]
640 |
641 | # Return the corrected list and the video found
642 | return videos, video
643 |
644 | # Shouldn't happen, see docs
645 | raise Exception(
646 | "We expected to skip a video and return it but nothing to skip was found"
647 | )
648 |
649 |
650 | def _migrate_archive(
651 | current_version: int, expected_version: int, encoded: dict, channel_name: str
652 | ) -> dict:
653 | """Automatically migrates an archive from one version to another by bootstrapping"""
654 |
655 | def migrate_step(cur: int, encoded: dict) -> dict:
656 | """Step in recursion to migrate from one to another, contains migration logic"""
657 | # Stop because we've reached the desired version
658 | if cur == expected_version:
659 | return encoded
660 |
661 | # From version 1 to version 2
662 | elif cur == 1:
663 | # Channel id to url
664 | encoded["url"] = "https://www.youtube.com/channel/" + encoded["id"]
665 | del encoded["id"]
666 | print(
667 | Fore.YELLOW
668 | + "Please make sure "
669 | + encoded["url"]
670 | + " is the correct url"
671 | + Fore.RESET
672 | )
673 |
674 | # Empty livestreams/shorts lists
675 | encoded["livestreams"] = []
676 | encoded["shorts"] = []
677 |
678 | # From version 2 to version 3
679 | elif cur == 2:
680 | # Add deleted status to every video/livestream/short
681 | # NOTE: none is fine for new elements, just a slight bodge
682 | for video in encoded["videos"]:
683 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict()
684 | for video in encoded["livestreams"]:
685 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict()
686 | for video in encoded["shorts"]:
687 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict()
688 |
689 | # Unknown version
690 | else:
691 | _err_msg(f"Unknown archive version v{cur} found during migration", True)
692 | sys.exit(1)
693 |
694 | # Increment version and run again until version has been reached
695 | cur += 1
696 | encoded["version"] = cur
697 | return migrate_step(cur, encoded)
698 |
699 | # Inform user of the backup process
700 | print(
701 | Fore.YELLOW
702 | + f"Automatically migrating archive from v{current_version} to v{expected_version}, a backup has been made at {channel_name}/yark.bak"
703 | + Fore.RESET
704 | )
705 |
706 | # Start recursion step
707 | return migrate_step(current_version, encoded)
708 |
709 |
710 | def _err_dl(name: str, exception: DownloadError, retrying: bool):
711 | """Prints errors to stdout depending on what kind of download error occurred"""
712 | # Default message
713 | msg = f"Unknown error whilst downloading {name}, details below:\n{exception}"
714 |
715 | # Types of errors
716 | ERRORS = [
717 | "",
718 | "500",
719 | "Got error: The read operation timed out",
720 | "No such file or directory",
721 | "HTTP Error 404: Not Found",
722 | "",
723 | ]
724 |
725 | # Download errors
726 | if type(exception) == DownloadError:
727 | # Server connection
728 | if ERRORS[0] in exception.msg:
729 | msg = "Issue connecting with YouTube's servers"
730 |
731 | # Server fault
732 | elif ERRORS[1] in exception.msg:
733 | msg = "Fault with YouTube's servers"
734 |
735 | # Timeout
736 | elif ERRORS[2] in exception.msg:
737 | msg = "Timed out trying to download video"
738 |
739 | # Video deleted whilst downloading
740 | elif ERRORS[3] in exception.msg:
741 | msg = "Video deleted whilst downloading"
742 |
743 | # Channel not found, might need to retry with alternative route
744 | elif ERRORS[4] in exception.msg:
745 | msg = "Couldn't find channel by it's id"
746 |
747 | # Random timeout; not sure if its user-end or youtube-end
748 | elif ERRORS[5] in exception.msg:
749 | msg = "Timed out trying to reach YouTube"
750 |
751 | # Print error
752 | suffix = ", retrying in a few seconds.." if retrying else ""
753 | print(
754 | Fore.YELLOW + " • " + msg + suffix.ljust(40) + Fore.RESET,
755 | file=sys.stderr,
756 | )
757 |
758 | # Wait if retrying, exit if failed
759 | if retrying:
760 | time.sleep(5)
761 | else:
762 | _err_msg(f" • Sorry, failed to download {name}", True)
763 | sys.exit(1)
764 |
--------------------------------------------------------------------------------
/yark/cli.py:
--------------------------------------------------------------------------------
1 | """Homegrown cli for managing archives"""
2 |
3 | from pathlib import Path
4 | from colorama import Style, Fore
5 | import sys
6 | import threading
7 | import webbrowser
8 | from .errors import _err_msg, ArchiveNotFoundException
9 | from .channel import Channel, DownloadConfig
10 | from .viewer import viewer
11 |
12 | HELP = f"yark [options]\n\n YouTube archiving made simple.\n\nOptions:\n new [name] [url] Creates new archive with name and channel url\n refresh [name] [args?] Refreshes/downloads archive with optional config\n view [name?] Launches offline archive viewer website\n report [name] Provides a report on the most interesting changes\n\nExample:\n $ yark new owez https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA\n $ yark refresh owez\n $ yark view owez"
13 | """User-facing help message provided from the cli"""
14 |
15 |
16 | def _cli():
17 | """Command-line-interface launcher"""
18 |
19 | # Get arguments
20 | args = sys.argv[1:]
21 |
22 | # No arguments
23 | if len(args) == 0:
24 | print(HELP, file=sys.stderr)
25 | _err_msg(f"\nError: No arguments provided")
26 | sys.exit(1)
27 |
28 | # Help
29 | if args[0] in ["help", "--help", "-h"]:
30 | print(HELP)
31 |
32 | # Version
33 | # TODO: automatically track this
34 | elif args[0] in ["-v", "-ver", "--version", "--v"]:
35 | print("1.2.9")
36 |
37 | # Create new
38 | elif args[0] == "new":
39 | # More help
40 | if len(args) == 2 and args[1] == "--help":
41 | _err_no_help()
42 |
43 | # Bad arguments
44 | if len(args) < 3:
45 | _err_msg("Please provide an archive name and the channel url")
46 | sys.exit(1)
47 |
48 | # Create channel
49 | Channel.new(Path(args[1]), args[2])
50 |
51 | # Refresh
52 | elif args[0] == "refresh":
53 | # More help
54 | if len(args) == 2 and args[1] == "--help":
55 | # NOTE: if these get more complex, separate into something like "basic config" and "advanced config"
56 | print(
57 | f"yark refresh [name] [args?]\n\n Refreshes/downloads archive with optional configuration.\n If a maximum is set, unset categories won't be downloaded\n\nArguments:\n --videos=[max] Maximum recent videos to download\n --shorts=[max] Maximum recent shorts to download\n --livestreams=[max] Maximum recent livestreams to download\n --skip-metadata Skips downloading metadata\n --skip-download Skips downloading content\n --format=[str] Downloads using custom yt-dlp format for advanced users\n\n Example:\n $ yark refresh demo\n $ yark refresh demo --videos=5\n $ yark refresh demo --shorts=2 --livestreams=25\n $ yark refresh demo --skip-download"
58 | )
59 | sys.exit(0)
60 |
61 | # Bad arguments
62 | if len(args) < 2:
63 | _err_msg("Please provide the archive name")
64 | sys.exit(1)
65 |
66 | # Figure out configuration
67 | config = DownloadConfig()
68 | if len(args) > 2:
69 |
70 | def parse_value(config_arg: str) -> str:
71 | return config_arg.split("=")[1]
72 |
73 | def parse_maximum_int(config_arg: str) -> int:
74 | """Tries to parse a maximum integer input"""
75 | maximum = parse_value(config_arg)
76 | try:
77 | return int(maximum)
78 | except:
79 | print(HELP, file=sys.stderr)
80 | _err_msg(
81 | f"\nError: The value '{maximum}' isn't a valid maximum number"
82 | )
83 | sys.exit(1)
84 |
85 | # Go through each configuration argument
86 | for config_arg in args[2:]:
87 | # Video maximum
88 | if config_arg.startswith("--videos="):
89 | config.max_videos = parse_maximum_int(config_arg)
90 |
91 | # Livestream maximum
92 | elif config_arg.startswith("--livestreams="):
93 | config.max_livestreams = parse_maximum_int(config_arg)
94 |
95 | # Shorts maximum
96 | elif config_arg.startswith("--shorts="):
97 | config.max_shorts = parse_maximum_int(config_arg)
98 |
99 | # No metadata
100 | elif config_arg == "--skip-metadata":
101 | config.skip_metadata = True
102 |
103 | # No downloading; functionally equivalent to all maximums being 0 but it skips entirely
104 | elif config_arg == "--skip-download":
105 | config.skip_download = True
106 |
107 | # Custom yt-dlp format
108 | elif config_arg.startswith("--format="):
109 | config.format = parse_value(config_arg)
110 |
111 | # Unknown argument
112 | else:
113 | print(HELP, file=sys.stderr)
114 | _err_msg(
115 | f"\nError: Unknown configuration '{config_arg}' provided for archive refresh"
116 | )
117 | sys.exit(1)
118 |
119 | # Submit config settings
120 | config.submit()
121 |
122 | # Refresh channel using config context
123 | try:
124 | channel = Channel.load(args[1])
125 | if config.skip_metadata:
126 | print("Skipping metadata download..")
127 | else:
128 | channel.metadata()
129 | channel.commit() # NOTE: Do it here no matter, because it's metadata. Downloads do not modify the archive
130 | if config.skip_download:
131 | print("Skipping videos/livestreams/shorts download..")
132 | else:
133 | channel.download(config)
134 | channel.reporter.print()
135 | except ArchiveNotFoundException:
136 | _err_archive_not_found()
137 |
138 | # View
139 | elif args[0] == "view":
140 | # More help
141 | if len(args) == 2 and args[1] == "--help":
142 | print(
143 | f"yark view [name] [args?]\n\n Launches offline archive viewer website.\n\nArguments:\n --host [str] Custom uri to act as host from\n --port [int] Custom port number instead of 7667\n\n Example:\n $ yark view foobar\n $ yark view foobar --port=80\n $ yark view foobar --port=1234 --host=0.0.0.0"
144 | )
145 | sys.exit(0)
146 |
147 | # Basis for custom host/port configs
148 | host = None
149 | port = 7667
150 |
151 | # Go through each configuration argument
152 | for config_arg in args[2:]:
153 | # Host configuration
154 | if config_arg.startswith("--host="):
155 | host = config_arg[7:]
156 |
157 | # Port configuration
158 | elif config_arg.startswith("--port="):
159 | if config_arg[7:].strip() == "":
160 | print(
161 | f"No port number provided for port argument",
162 | file=sys.stderr,
163 | )
164 | sys.exit(1)
165 | try:
166 | port = int(config_arg[7:])
167 | except:
168 | print(
169 | f"Invalid port number '{config_arg[7:]}' provided",
170 | file=sys.stderr,
171 | )
172 | sys.exit(1)
173 |
174 | def launch():
175 | """Launches viewer"""
176 | app = viewer()
177 | threading.Thread(target=lambda: app.run(host=host, port=port)).run()
178 |
179 | # Start on channel name
180 | if len(args) > 1:
181 | # Get name
182 | channel = args[1]
183 |
184 | # Jank archive check
185 | if not Path(channel).exists():
186 | _err_archive_not_found()
187 |
188 | # Launch and start browser
189 | print(f"Starting viewer for {channel}..")
190 | webbrowser.open(f"http://127.0.0.1:7667/channel/{channel}/videos")
191 | launch()
192 |
193 | # Start on channel finder
194 | else:
195 | print("Starting viewer..")
196 | webbrowser.open(f"http://127.0.0.1:7667/")
197 | launch()
198 |
199 | # Report
200 | elif args[0] == "report":
201 | # Bad arguments
202 | if len(args) < 2:
203 | _err_msg("Please provide the archive name")
204 | sys.exit(1)
205 |
206 | channel = Channel.load(Path(args[1]))
207 | channel.reporter.interesting_changes()
208 |
209 | # Unknown
210 | else:
211 | print(HELP, file=sys.stderr)
212 | _err_msg(f"\nError: Unknown command '{args[0]}' provided!", True)
213 | sys.exit(1)
214 |
215 |
216 | def _err_archive_not_found():
217 | """Errors out the user if the archive doesn't exist"""
218 | _err_msg("Archive doesn't exist, please make sure you typed it's name correctly!")
219 | sys.exit(1)
220 |
221 |
222 | def _err_no_help():
223 | """Prints out help message and exits, displaying a 'no additional help' message"""
224 | print(HELP)
225 | print("\nThere's no additional help for this command")
226 | sys.exit(0)
227 |
228 |
229 | # NOTE: not used, not sure why this is included. might be useful for the future
230 | # def _upgrade_messaging() -> None:
231 | # """
232 | # Give users some info on the new Yark 1.3 version because because PyPI releases aren't supported
233 |
234 | # This wouldn't happen normally but users might be confused seeing as we're switching distribution methods.
235 | # """
236 | # # Major update message for 1.3
237 | # print(
238 | # Style.BRIGHT
239 | # + "Yark 1.3 is out now! Go to https://github.com/Owez/yark to download"
240 | # + Style.DIM
241 | # + " (pip is no longer supported)"
242 | # + Style.NORMAL
243 | # )
244 |
245 | # # Give a warning if it's been over a year since release
246 | # if datetime.datetime.utcnow().year >= 2024:
247 | # print(
248 | # Fore.YELLOW
249 | # + "You're currently on an outdated version of Yark"
250 | # + Fore.RESET,
251 | # file=sys.stderr,
252 | # )
253 |
--------------------------------------------------------------------------------
/yark/errors.py:
--------------------------------------------------------------------------------
1 | """Exceptions and error functions"""
2 |
3 | from colorama import Style, Fore
4 | import sys
5 |
6 |
7 | class ArchiveNotFoundException(Exception):
8 | """Archive couldn't be found, the name was probably incorrect"""
9 |
10 | def __init__(self, *args: object) -> None:
11 | super().__init__(*args)
12 |
13 |
14 | class VideoNotFoundException(Exception):
15 | """Video couldn't be found, the id was probably incorrect"""
16 |
17 | def __init__(self, *args: object) -> None:
18 | super().__init__(*args)
19 |
20 |
21 | class NoteNotFoundException(Exception):
22 | """Note couldn't be found, the id was probably incorrect"""
23 |
24 | def __init__(self, *args: object) -> None:
25 | super().__init__(*args)
26 |
27 |
28 | class TimestampException(Exception):
29 | """Invalid timestamp inputted for note"""
30 |
31 | def __init__(self, *args: object) -> None:
32 | super().__init__(*args)
33 |
34 |
35 | def _err_msg(msg: str, report_msg: bool = False):
36 | """Provides a red-coloured error message to the user in the STDERR pipe"""
37 | msg = (
38 | msg
39 | if not report_msg
40 | else f"{msg}\nPlease file a bug report if you think this is a problem with Yark!"
41 | )
42 | print(Fore.RED + Style.BRIGHT + msg + Style.NORMAL + Fore.RESET, file=sys.stderr)
43 |
--------------------------------------------------------------------------------
/yark/reporter.py:
--------------------------------------------------------------------------------
1 | """Channel reporting system allowing detailed logging of useful information"""
2 |
3 | from colorama import Fore, Style
4 | import datetime
5 | from .video import Video, Element
6 | from .utils import _truncate_text
7 | from typing import TYPE_CHECKING, Optional
8 |
9 | if TYPE_CHECKING:
10 | from .channel import Channel
11 |
12 |
13 | class Reporter:
14 | channel: "Channel"
15 | added: list[Video]
16 | deleted: list[Video]
17 | updated: list[tuple[str, Element]]
18 |
19 | def __init__(self, channel) -> None:
20 | self.channel = channel
21 | self.added = []
22 | self.deleted = []
23 | self.updated = []
24 |
25 | def print(self):
26 | """Prints coloured report to STDOUT"""
27 | # Initial message
28 | print(f"Report for {self.channel}:")
29 |
30 | # Updated
31 | for kind, element in self.updated:
32 | colour = (
33 | Fore.CYAN
34 | if kind in ["title", "description", "undeleted"]
35 | else Fore.BLUE
36 | )
37 | video = f" • {element.video}".ljust(82)
38 | kind = f" │ 🔥{kind.capitalize()}"
39 |
40 | print(colour + video + kind)
41 |
42 | # Added
43 | for video in self.added:
44 | print(Fore.GREEN + f" • {video}")
45 |
46 | # Deleted
47 | for video in self.deleted:
48 | print(Fore.RED + f" • {video}")
49 |
50 | # Nothing
51 | if not self.added and not self.deleted and not self.updated:
52 | print(Style.DIM + f" • Nothing was added or deleted")
53 |
54 | # Watermark
55 | print(_watermark())
56 |
57 | def add_updated(self, kind: str, element: Element):
58 | """Tells reporter that an element has been updated"""
59 | self.updated.append((kind, element))
60 |
61 | def reset(self):
62 | """Resets reporting values for new run"""
63 | self.added = []
64 | self.deleted = []
65 | self.updated = []
66 |
67 | def interesting_changes(self):
68 | """Reports on the most interesting changes for the channel linked to this reporter"""
69 |
70 | def fmt_video(kind: str, video: Video) -> str:
71 | """Formats a video if it's interesting, otherwise returns an empty string"""
72 | # Skip formatting because it's got nothing of note
73 | if (
74 | not video.title.changed()
75 | and not video.description.changed()
76 | and not video.deleted.changed()
77 | ):
78 | return ""
79 |
80 | # Lambdas for easy buffer addition for next block
81 | buf: list[str] = []
82 | maybe_capitalize = lambda word: word.capitalize() if len(buf) == 0 else word
83 | add_buf = lambda name, change, colour: buf.append(
84 | colour + maybe_capitalize(name) + f" x{change}" + Fore.RESET
85 | )
86 |
87 | # Figure out how many changes have happened in each category and format them together
88 | change_deleted = sum(
89 | 1 for value in video.deleted.inner.values() if value == True
90 | )
91 | if change_deleted != 0:
92 | add_buf("deleted", change_deleted, Fore.RED)
93 | change_description = len(video.description.inner) - 1
94 | if change_description != 0:
95 | add_buf("description", change_description, Fore.CYAN)
96 | change_title = len(video.title.inner) - 1
97 | if change_title != 0:
98 | add_buf("title", change_title, Fore.CYAN)
99 |
100 | # Combine the detected changes together and capitalize
101 | changes = ", ".join(buf) + Fore.RESET
102 |
103 | # Truncate title, get viewer link, and format all together with viewer link
104 | title = _truncate_text(video.title.current(), 51).strip()
105 | url = f"http://127.0.0.1:7667/channel/{video.channel}/{kind}/{video.id}"
106 | return (
107 | f" • {title}\n {changes}\n "
108 | + Style.DIM
109 | + url
110 | + Style.RESET_ALL
111 | + "\n"
112 | )
113 |
114 | def fmt_category(kind: str, videos: list) -> Optional[str]:
115 | """Returns formatted string for an entire category of `videos` inputted or returns nothing"""
116 | # Add interesting videos to buffer
117 | HEADING = f"Interesting {kind}:\n"
118 | buf = HEADING
119 | for video in videos:
120 | buf += fmt_video(kind, video)
121 |
122 | # Return depending on if the buf is just the heading
123 | return None if buf == HEADING else buf[:-1]
124 |
125 | # Tell users whats happening
126 | print(f"Finding interesting changes in {self.channel}..")
127 |
128 | # Get reports on the three categories
129 | categories = [
130 | ("videos", fmt_category("videos", self.channel.videos)),
131 | ("livestreams", fmt_category("livestreams", self.channel.livestreams)),
132 | ("shorts", fmt_category("shorts", self.channel.shorts)),
133 | ]
134 |
135 | # Combine those with nothing of note and print out interesting
136 | not_of_note = []
137 | for name, buf in categories:
138 | if buf is None:
139 | not_of_note.append(name)
140 | else:
141 | print(buf)
142 |
143 | # Print out those with nothing of note at the end
144 | if len(not_of_note) != 0:
145 | not_of_note = "/".join(not_of_note)
146 | print(f"No interesting {not_of_note} found")
147 |
148 | # Watermark
149 | print(_watermark())
150 |
151 |
152 | def _watermark() -> str:
153 | """Returns a new watermark with a Yark timestamp"""
154 | date = datetime.datetime.utcnow().isoformat()
155 | return Style.RESET_ALL + f"Yark – {date}"
156 |
--------------------------------------------------------------------------------
/yark/templates/base.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 | Yark{% if title %} · {{ title }}{% endif %}
15 |
16 |
17 |
18 |
19 |
117 | {% block styling %}{% endblock %}
118 |
119 |
126 |
127 | {% block content %}{% endblock %}
128 |
129 | {% if error %}
130 |