├── .gitignore ├── LICENSE ├── README.md ├── examples ├── images │ ├── cli_dark.png │ ├── rewrite.png │ ├── transition.png │ ├── viewer_light.png │ └── viewer_stats_light.png └── madness.py ├── poetry.lock ├── pyproject.toml └── yark ├── __init__.py ├── __main__.py ├── archiver └── archive.py ├── channel.py ├── cli.py ├── errors.py ├── reporter.py ├── templates ├── base.html ├── channel.html ├── index.html └── video.html ├── utils.py ├── video.py └── viewer.py /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode 2 | demo/ 3 | **/.DS* 4 | dist/ 5 | __pycache__/ 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2022 Owen Griffiths 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Yark 2 | 3 | YouTube archiving made simple. 4 | 5 | 6 | 7 | 8 | 9 | 11 | 12 | ## Installation 13 | 14 | To install Yark, simply download [Python 3.9+](https://www.python.org/downloads/) and [FFmpeg](https://ffmpeg.org/) (optional), then run the following: 15 | 16 | ```shell 17 | $ pip3 install yark 18 | ``` 19 | 20 | ## Managing your Archive 21 | 22 | Once you've installed Yark, think of a name for your archive (e.g., "foobar") and copy the target's url: 23 | 24 | ```shell 25 | $ yark new foobar https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA 26 | ``` 27 | 28 | Now that you've created the archive, you can tell Yark to download all videos and metadata using the refresh command: 29 | 30 | ```shell 31 | $ yark refresh foobar 32 | ``` 33 | 34 | Once everything has been downloaded, Yark will automatically give you a status report of what's changed since the last refresh: 35 | 36 |

Report Demo

37 | 38 | ## Viewing your Archive 39 | 40 | Viewing you archive is easy, just type `view` with your archives name: 41 | 42 | ```shell 43 | $ yark view foobar 44 | ``` 45 | 46 | This will pop up an offline website in your browser letting you watch all videos 🚀 47 | 48 |

Viewer Demo

49 | 50 | Under each video is a rich history report filled with timelines and graphs, as well as a noting feature which lets you add timestamped and permalinked comments 👐 51 | 52 |

Viewer Demo – Stats

53 | 54 | Light and dark modes are both available and automatically apply based on the system's theme. 55 | 56 | ## Details 57 | 58 | Here are some things to keep in mind when using Yark; the good and the bad: 59 | 60 | - Don't create a new archive again if you just want to update it, Yark accumulates all new metadata for you via timestamps 61 | - Feel free to suggest new features via the issues tab on this repository 62 | - Scheduling isn't a feature just yet, please use [`cron`](https://en.wikipedia.org/wiki/Cron) or something similar! 63 | 64 | ## Archive Format 65 | 66 | The archive format itself is simple and consists of a directory-based structure with a core metadata file and all thumbnail/video data in their own directories as typical files: 67 | 68 | - `[name]/` – Your self-contained archive 69 | - `yark.json` – Archive file with all metadata 70 | - `yark.bak` – Backup archive file to protect against data damage 71 | - `videos/` – Directory containing all known videos 72 | - `[id].*` – Files containing video data for YouTube videos 73 | - `thumbnails/` – Directory containing all known thumbnails 74 | - `[hash].png` – Files containing thumbnails with its hash 75 | 76 | It's best to take a few minutes to familiarize yourself with your archive by looking at files which look interesting to you in it, everything is quite readable. 77 | -------------------------------------------------------------------------------- /examples/images/cli_dark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/cli_dark.png -------------------------------------------------------------------------------- /examples/images/rewrite.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/rewrite.png -------------------------------------------------------------------------------- /examples/images/transition.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/transition.png -------------------------------------------------------------------------------- /examples/images/viewer_light.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/viewer_light.png -------------------------------------------------------------------------------- /examples/images/viewer_stats_light.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Owez/yark/e9a6164245274b6cfa60a501d33e3c77069a4a8e/examples/images/viewer_stats_light.png -------------------------------------------------------------------------------- /examples/madness.py: -------------------------------------------------------------------------------- 1 | from yark import Channel, DownloadConfig 2 | from pathlib import Path 3 | 4 | # Create a new channel 5 | channel = Channel.new( 6 | Path("demo"), "https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA" 7 | ) 8 | 9 | # Refresh only metadata and commit to file 10 | channel.metadata() 11 | channel.commit() 12 | 13 | # Load the channel back up from file for the fun of it 14 | channel = Channel.load(Path("demo")) 15 | 16 | # Print all the video id's of the channel 17 | print(", ".join([video.id for video in channel.videos])) 18 | 19 | # Get a cool video I made and print it's description 20 | video = channel.search("annp92OPZgQ") 21 | print(video.description.current()) 22 | 23 | # Download the 5 most recent videos and 10 most recent shorts 24 | config = DownloadConfig() 25 | config.max_videos = 5 26 | config.max_shorts = 10 27 | config.submit() 28 | channel.download(config) 29 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "yark" 3 | version = "1.2.12" 4 | description = "YouTube archiving made simple." 5 | authors = ["Owen Griffiths "] 6 | license = "MIT" 7 | readme = "README.md" 8 | repository = "https://github.com/owez/yark" 9 | classifiers = [ 10 | "Topic :: System :: Archiving", 11 | "Topic :: System :: Archiving :: Backup", 12 | "Topic :: Multimedia :: Video", 13 | ] 14 | include = [{ path = "templates/*" }] 15 | 16 | [tool.poetry.dependencies] 17 | python = "^3.9" 18 | Flask = "^2.3.1" 19 | requests = "^2.28.2" 20 | colorama = "^0.4.6" 21 | yt-dlp = "2024.10.07" 22 | progress = "^1.6" 23 | 24 | [tool.poetry.scripts] 25 | yark = "yark.cli:_cli" 26 | 27 | [tool.poetry.group.dev.dependencies] 28 | mypy = "^0.991" 29 | poethepoet = "^0.18.1" 30 | types-colorama = "^0.4.15.11" 31 | types-requests = "^2.28.11.17" 32 | black = "^22.12.0" 33 | 34 | [build-system] 35 | requires = ["poetry-core>=1.0.0"] 36 | build-backend = "poetry.core.masonry.api" 37 | -------------------------------------------------------------------------------- /yark/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | Yark 3 | ==== 4 | 5 | YouTube archiving made simple. 6 | 7 | Commonly-used 8 | ------------- 9 | 10 | - `Channel` 11 | - `DownloadConfig` 12 | - `Video` 13 | - `Element` 14 | - `Note` 15 | - `Thumbnail` 16 | - `viewer()` 17 | - `ArchiveNotFoundException` 18 | - `VideoNotFoundException` 19 | - `NoteNotFoundException` 20 | - `TimestampException` 21 | 22 | Beware that using Yark as a library is currently experimental and breaking changes here are not tracked! 23 | """ 24 | 25 | from .channel import Channel, DownloadConfig 26 | from .video import Video, Element, Note, Thumbnail 27 | from .viewer import viewer 28 | from .errors import ( 29 | ArchiveNotFoundException, 30 | VideoNotFoundException, 31 | NoteNotFoundException, 32 | TimestampException, 33 | ) 34 | -------------------------------------------------------------------------------- /yark/__main__.py: -------------------------------------------------------------------------------- 1 | """Main runner for those using `python3 -m yark` instead of the proper `yark` script poetry provides""" 2 | 3 | from .cli import _cli 4 | 5 | _cli() 6 | -------------------------------------------------------------------------------- /yark/archiver/archive.py: -------------------------------------------------------------------------------- 1 | """Archive management with metadata/video downloading core""" 2 | 3 | from __future__ import annotations 4 | from datetime import datetime 5 | import json 6 | from pathlib import Path 7 | import time 8 | from yt_dlp import YoutubeDL, DownloadError # type: ignore 9 | import sys 10 | from .reporter import Reporter 11 | from ..errors import ArchiveNotFoundException, MetadataFailException 12 | from .video.video import Video, Videos 13 | from .comment_author import CommentAuthor 14 | from typing import Optional, Any 15 | from .config import Config, YtDlpSettings 16 | from .converter import Converter 17 | from .migrator import _migrate 18 | from ..utils import ARCHIVE_COMPAT, _log_err 19 | from dataclasses import dataclass 20 | import logging 21 | 22 | RawMetadata = dict[str, Any] 23 | """Raw metadata downloaded from yt-dlp to be parsed""" 24 | 25 | 26 | @dataclass(init=False) 27 | class Archive: 28 | path: Path 29 | url: str 30 | version: int 31 | videos: Videos 32 | livestreams: Videos 33 | shorts: Videos 34 | reporter: Reporter 35 | comment_authors: dict[str, CommentAuthor] 36 | 37 | def __init__( 38 | self, 39 | path: Path, 40 | url: str, 41 | version: int = ARCHIVE_COMPAT, 42 | videos: Videos | None = None, 43 | livestreams: Videos | None = None, 44 | shorts: Videos | None = None, 45 | comment_authors: dict[str, CommentAuthor] = {}, 46 | ) -> None: 47 | self.path = path 48 | self.url = url 49 | self.version = version 50 | self.videos = Videos(self) if videos is None else videos 51 | self.livestreams = Videos(self) if livestreams is None else livestreams 52 | self.shorts = Videos(self) if shorts is None else shorts 53 | self.reporter = Reporter(self) 54 | self.comment_authors = comment_authors 55 | 56 | @staticmethod 57 | def load(path: Path) -> Archive: 58 | """Loads existing archive from path""" 59 | # Check existence 60 | path = Path(path) 61 | archive_name = path.name 62 | logging.info(f"Loading {archive_name} archive") 63 | if not path.exists(): 64 | raise ArchiveNotFoundException(path) 65 | 66 | # Load config 67 | encoded = json.load(open(path / "yark.json", "r")) 68 | 69 | # Check version before fully decoding and exit if wrong 70 | archive_version = encoded["version"] 71 | if archive_version != ARCHIVE_COMPAT: 72 | encoded = _migrate( 73 | archive_version, ARCHIVE_COMPAT, encoded, path, archive_name 74 | ) 75 | 76 | # Decode and return 77 | return Archive._from_archive_o(encoded, path) 78 | 79 | def metadata_download(self, config: Config) -> RawMetadata: 80 | """Downloads raw metadata for further parsing""" 81 | logging.info(f"Downloading raw metadata for {self}") 82 | 83 | # Get settings 84 | settings = config.settings_md() 85 | 86 | # Pull metadata from youtube 87 | with YoutubeDL(settings) as ydl: 88 | for i in range(3): 89 | try: 90 | res: RawMetadata = ydl.extract_info(self.url, download=False) 91 | return res 92 | except Exception as exception: 93 | # Report error 94 | retrying = i != 2 95 | _download_error("metadata", exception, retrying) 96 | 97 | # Log retrying message 98 | if retrying: 99 | logging.warn(f"Retrying metadata download ({i+1}/3") 100 | 101 | # Couldn't download after all retries 102 | raise MetadataFailException() 103 | 104 | def metadata_parse(self, config: Config, metadata: RawMetadata) -> None: 105 | """Updates current archive by parsing the raw downloaded metadata""" 106 | logging.info(f"Parsing downloaded metadata for {self}") 107 | 108 | # Make buckets to normalize different types of videos 109 | videos = [] 110 | livestreams = [] 111 | shorts = [] 112 | 113 | # Videos only (basic channel or playlist) 114 | if "entries" not in metadata["entries"][0]: 115 | videos = metadata["entries"] 116 | 117 | # Videos and at least one other (livestream/shorts) 118 | else: 119 | for entry in metadata["entries"]: 120 | # Find the kind of category this is; youtube formats these as 3 playlists 121 | kind = entry["title"].split(" - ")[-1].lower() 122 | 123 | # Plain videos 124 | if kind == "videos": 125 | videos = entry["entries"] 126 | 127 | # Livestreams 128 | elif kind == "live": 129 | livestreams = entry["entries"] 130 | 131 | # Shorts 132 | elif kind == "shorts": 133 | shorts = entry["entries"] 134 | 135 | # Unknown 4th kind; youtube might've updated 136 | else: 137 | _log_err(f"Unknown video kind '{kind}' found", True) 138 | 139 | # Parse metadata 140 | self._metadata_parse_videos("video", config, videos, self.videos) 141 | self._metadata_parse_videos("livestream", config, livestreams, self.livestreams) 142 | self._metadata_parse_videos("shorts", config, shorts, self.shorts) 143 | 144 | # Go through each and report deleted 145 | self._report_deleted(self.videos) 146 | self._report_deleted(self.livestreams) 147 | self._report_deleted(self.shorts) 148 | 149 | def _metadata_parse_videos( 150 | self, 151 | kind: str, 152 | config: Config, 153 | entries: list[dict[str, Any]], 154 | videos: Videos, 155 | ) -> None: 156 | """Parses metadata for a category of video into it's `videos` bucket""" 157 | logging.debug(f"Parsing through {kind} for {self}") 158 | 159 | # Parse each video 160 | for entry in entries: 161 | self._metadata_parse_video(config, entry, videos) 162 | 163 | # Sort videos by newest 164 | videos.sort() 165 | 166 | def _metadata_parse_video( 167 | self, config: Config, entry: dict[str, Any], videos: Videos 168 | ) -> None: 169 | """Parses metadata for one video, creating it or updating it depending on the `videos` already in the bucket""" 170 | id = entry["id"] 171 | logging.debug(f"Parsing video {id} metadata for {self}") 172 | 173 | # Skip video if there's no formats available; happens with upcoming videos/livestreams 174 | if "formats" not in entry or len(entry["formats"]) == 0: 175 | return 176 | 177 | # Updated intra-loop marker 178 | updated = False 179 | 180 | # Update video if it exists 181 | found_video = videos.inner.get(entry["id"]) 182 | if found_video is not None: 183 | found_video.update(config, entry) 184 | updated = True 185 | return 186 | 187 | # Add new video if not 188 | if not updated: 189 | video = Video.new(config, self, entry) 190 | videos.inner[video.id] = video 191 | self.reporter.added.append(video) 192 | 193 | def download(self, config: Config) -> bool: 194 | """Downloads all videos which haven't already been downloaded, returning if anything was downloaded""" 195 | logging.debug(f"Downloading curated videos for {self}") 196 | 197 | # Prepare; clean out old part files and get settings 198 | self._clean_parts() 199 | settings = config.settings_dl(self.path) 200 | 201 | # Retry downloading 5 times in total for all videos 202 | anything_downloaded = True 203 | for i in range(5): 204 | # Try to curate a list and download videos on it 205 | try: 206 | # Curate list of non-downloaded videos 207 | not_downloaded = self._curate(config) 208 | 209 | # Return if there's nothing to download 210 | if len(not_downloaded) == 0: 211 | anything_downloaded = False 212 | return False 213 | 214 | # Launch core to download all curated videos 215 | self._download_launch(settings, not_downloaded) 216 | 217 | # Stop if we've got them all 218 | break 219 | 220 | # Report error and retry/stop 221 | except Exception as exception: 222 | _download_error("videos", exception, i != 4) 223 | 224 | # End by converting any downloaded but unsupported video file formats 225 | if anything_downloaded: 226 | converter = Converter(self.path / "videos") 227 | converter.run() 228 | 229 | # Say that something was downloaded 230 | return True 231 | 232 | def _download_launch( 233 | self, settings: YtDlpSettings, not_downloaded: list[Video] 234 | ) -> None: 235 | """Downloads all `not_downloaded` videos passed into it whilst automatically handling privated videos, this is the core of the downloader""" 236 | # Continuously try to download after private/deleted videos are found 237 | # This block gives the downloader all the curated videos and skips/reports deleted videos by filtering their exceptions 238 | while True: 239 | # Download from curated list then exit the optimistic loop 240 | try: 241 | urls = [video.url() for video in not_downloaded] 242 | with YoutubeDL(settings) as ydl: 243 | ydl.download(urls) 244 | break 245 | 246 | # Special handling for private/deleted videos which are archived, if not we raise again 247 | except DownloadError as exception: 248 | new_not_downloaded = self._download_exception_handle( 249 | not_downloaded, exception 250 | ) 251 | if new_not_downloaded is not None: 252 | not_downloaded = new_not_downloaded 253 | 254 | def _download_exception_handle( 255 | self, not_downloaded: list[Video], exception: DownloadError 256 | ) -> Optional[list[Video]]: 257 | """Handle for failed downloads if there's a special private/deleted video""" 258 | # Set new list for not downloaded to return later 259 | new_not_downloaded = None 260 | 261 | # Video is privated or deleted 262 | if ( 263 | "Private video" in exception.msg 264 | or "This video has been removed by the uploader" in exception.msg 265 | ): 266 | # Skip video from curated and get it as a return 267 | new_not_downloaded, video = _skip_video(not_downloaded, "deleted") 268 | 269 | # If this is a new occurrence then set it & report 270 | # This will only happen if its deleted after getting metadata, like in a dry run 271 | if video.deleted.current() == False: 272 | self.reporter.deleted.append(video) 273 | video.deleted.update(None, True) 274 | 275 | # User hasn't got ffmpeg installed and youtube hasn't got format 22 276 | # NOTE: see #55 to learn more 277 | # NOTE: sadly yt-dlp doesn't let us access yt_dlp.utils.ContentTooShortError so we check msg 278 | elif " bytes, expected " in exception.msg: 279 | # Skip video from curated 280 | new_not_downloaded, _ = _skip_video( 281 | not_downloaded, 282 | "no format found; please download ffmpeg!", 283 | True, 284 | ) 285 | 286 | # Nevermind, normal exception 287 | else: 288 | raise exception 289 | 290 | # Return 291 | return new_not_downloaded 292 | 293 | def _curate(self, config: Config) -> list[Video]: 294 | """Curate videos which aren't downloaded and return their urls""" 295 | 296 | def curate_list(videos: Videos, maximum: Optional[int]) -> list[Video]: 297 | """Curates the videos inside of the provided `videos` list to it's local maximum""" 298 | # Make a list for the videos 299 | found_videos = [] 300 | 301 | # Add all undownloaded videos because there's no maximum 302 | if maximum is None: 303 | found_videos = list( 304 | [video for video in videos.inner.values() if not video.downloaded()] 305 | ) 306 | 307 | # Cut available videos to maximum if present for deterministic getting 308 | else: 309 | # Fix the maximum to the length so we don't try to get more than there is 310 | fixed_maximum = min(max(len(videos.inner) - 1, 0), maximum) 311 | 312 | # Set the available videos to this fixed maximum 313 | values = list(videos.inner.values()) 314 | for ind in range(fixed_maximum): 315 | # Get video 316 | video = values[ind] 317 | 318 | # Save video if it's not been downloaded yet 319 | if not video.downloaded(): 320 | found_videos.append(video) 321 | 322 | # Return 323 | return found_videos 324 | 325 | # Curate 326 | not_downloaded = [] 327 | not_downloaded.extend(curate_list(self.videos, config.max_videos)) 328 | not_downloaded.extend(curate_list(self.livestreams, config.max_livestreams)) 329 | not_downloaded.extend(curate_list(self.shorts, config.max_shorts)) 330 | 331 | # Return 332 | return not_downloaded 333 | 334 | def commit(self, backup: bool = False) -> None: 335 | """Commits (saves) archive to path; do this once you've finished all of your transactions""" 336 | # Save backup if explicitly wanted 337 | if backup: 338 | self._backup() 339 | 340 | # Directories 341 | logging.info(f"Committing {self} to file") 342 | paths = [self.path, self.path / "images", self.path / "videos"] 343 | for path in paths: 344 | if not path.exists(): 345 | path.mkdir() 346 | 347 | # Config 348 | with open(self.path / "yark.json", "w+") as file: 349 | json.dump(self._to_archive_o(), file) 350 | 351 | def _report_deleted(self, videos: Videos) -> None: 352 | """Goes through a video category to report & save those which where not marked in the metadata as deleted if they're not already known to be deleted""" 353 | for video in videos.inner.values(): 354 | if video.deleted.current() == False and not video.known_not_deleted: 355 | self.reporter.deleted.append(video) 356 | video.deleted.update(None, True) 357 | 358 | def _clean_parts(self) -> None: 359 | """Cleans old temporary `.part` files which where stopped during download if present""" 360 | # Make a bucket for found files 361 | deletion_bucket: list[Path] = [] 362 | 363 | # Scan through and find part files 364 | videos = self.path / "videos" 365 | deletion_bucket.extend([file for file in videos.glob("*.part")]) 366 | deletion_bucket.extend([file for file in videos.glob("*.ytdl")]) 367 | 368 | # Log and delete if there are part files present 369 | if len(deletion_bucket) != 0: 370 | logging.info("Cleaning out previous temporary files..") 371 | for file in deletion_bucket: 372 | file.unlink() 373 | 374 | def _backup(self) -> None: 375 | """Creates a backup of the existing `yark.json` file in path as `yark.bak` with added comments""" 376 | logging.info(f"Creating a backup for {self} as yark.bak") 377 | 378 | # Get current archive path 379 | ARCHIVE_PATH = self.path / "yark.json" 380 | 381 | # Skip backing up if the archive doesn't exist 382 | if not ARCHIVE_PATH.exists(): 383 | return 384 | 385 | # Open original archive to copy 386 | with open(self.path / "yark.json", "r") as file_archive: 387 | # Add comment information to backup file 388 | save = f"// Backup of a Yark archive, dated {datetime.utcnow().isoformat()}\n// Remove these comments and rename to 'yark.json' to restore\n{file_archive.read()}" 389 | 390 | # Save new information into a new backup 391 | with open(self.path / "yark.bak", "w+") as file_backup: 392 | file_backup.write(save) 393 | 394 | @staticmethod 395 | def _from_archive_o(encoded: dict[str, Any], path: Path) -> Archive: 396 | """Decodes object dict from archive which is being loaded back up""" 397 | 398 | # Initiate archive 399 | archive = Archive(path, encoded["url"], encoded["version"]) 400 | 401 | # Decode id & body style comment authors 402 | # NOTE: needed above video decoding for comments 403 | for id in encoded["comment_authors"].keys(): 404 | archive.comment_authors[id] = CommentAuthor._from_archive_ib( 405 | archive, id, encoded["comment_authors"][id] 406 | ) 407 | 408 | # Load up videos/livestreams/shorts 409 | archive.videos = Videos._from_archive_o(archive, encoded["videos"]) 410 | archive.livestreams = Videos._from_archive_o(archive, encoded["livestreams"]) 411 | archive.shorts = Videos._from_archive_o(archive, encoded["shorts"]) 412 | 413 | # Return 414 | return archive 415 | 416 | def _to_archive_o(self) -> dict[str, Any]: 417 | """Converts all archive data to a object dict to commit""" 418 | # Encode comment authors 419 | comment_authors = {} 420 | for id in self.comment_authors.keys(): 421 | comment_authors[id] = self.comment_authors[id]._to_archive_b() 422 | 423 | # Basics 424 | payload = { 425 | "version": self.version, 426 | "url": self.url, 427 | "videos": self.videos._to_archive_o(), 428 | "livestreams": self.livestreams._to_archive_o(), 429 | "shorts": self.shorts._to_archive_o(), 430 | "comment_authors": comment_authors, 431 | } 432 | 433 | # Return 434 | return payload 435 | 436 | def __repr__(self) -> str: 437 | return self.path.name 438 | 439 | 440 | def _skip_video( 441 | videos: list[Video], 442 | reason: str, 443 | warning: bool = False, 444 | ) -> tuple[list[Video], Video]: 445 | """Skips first undownloaded video in `videos`, make sure there's at least one to skip otherwise an exception will be thrown""" 446 | # Find fist undownloaded video 447 | for ind, video in enumerate(videos): 448 | if not video.downloaded(): 449 | # Tell the user we're skipping over it 450 | if warning: 451 | logging.warn( 452 | f"Skipping video {video.id} download for {video.archive} ({reason})" 453 | ) 454 | else: 455 | logging.info( 456 | f"Skipping video {video.id} download for {video.archive} ({reason})" 457 | ) 458 | 459 | # Set videos to skip over this one 460 | videos = videos[ind + 1 :] 461 | 462 | # Return the corrected list and the video found 463 | return videos, video 464 | 465 | # Shouldn't happen, see docs 466 | raise Exception( 467 | "We expected to skip a video and return it but nothing to skip was found" 468 | ) 469 | 470 | 471 | def _download_error( 472 | archive_name: str, exception: DownloadError, retrying: bool 473 | ) -> None: 474 | """Logs errors depending on what kind of download error occurred""" 475 | # Default message 476 | msg = ( 477 | f"Unknown error whilst downloading {archive_name}, details below:\n{exception}" 478 | ) 479 | 480 | # Types of errors 481 | ERRORS = [ 482 | "", 483 | "500", 484 | "Got error: The read operation timed out", 485 | "No such file or directory", 486 | "HTTP Error 404: Not Found", 487 | "", 488 | "Did not get any data blocks", 489 | ] 490 | 491 | # Download errors 492 | if type(exception) == DownloadError: 493 | # Server connection 494 | if ERRORS[0] in exception.msg or ERRORS[5] in exception.msg: 495 | msg = "Issue connecting with YouTube's servers" 496 | 497 | # Server fault 498 | elif ERRORS[1] in exception.msg: 499 | msg = "Fault with YouTube's servers" 500 | 501 | # Timeout 502 | elif ERRORS[2] in exception.msg: 503 | msg = "Timed out trying to download video" 504 | 505 | # Video deleted whilst downloading 506 | elif ERRORS[3] in exception.msg: 507 | msg = "Video deleted whilst downloading" 508 | 509 | # Target not found, might need to retry with alternative route 510 | elif ERRORS[4] in exception.msg: 511 | msg = "Couldn't find target by it's id" 512 | 513 | # Random timeout; not sure if its user-end or youtube-end 514 | elif ERRORS[5] in exception.msg: 515 | msg = "Timed out trying to reach YouTube" 516 | 517 | # Log error 518 | suffix = ", retrying in a few seconds.." if retrying else "" 519 | logging.warn(msg + suffix) 520 | 521 | # Wait if retrying, exit if failed 522 | if retrying: 523 | time.sleep(5) 524 | else: 525 | _log_err(f"Sorry, failed to download {archive_name}", True) 526 | sys.exit(1) 527 | -------------------------------------------------------------------------------- /yark/channel.py: -------------------------------------------------------------------------------- 1 | """Channel and overall archive management with downloader""" 2 | 3 | from __future__ import annotations 4 | from datetime import datetime 5 | import json 6 | from pathlib import Path 7 | import time 8 | from yt_dlp import YoutubeDL, DownloadError # type: ignore 9 | from colorama import Style, Fore 10 | import sys 11 | from .reporter import Reporter 12 | from .errors import ArchiveNotFoundException, _err_msg, VideoNotFoundException 13 | from .video import Video, Element 14 | from typing import Any 15 | import time 16 | from progress.spinner import PieSpinner 17 | from concurrent.futures import ThreadPoolExecutor 18 | import time 19 | 20 | ARCHIVE_COMPAT = 3 21 | """ 22 | Version of Yark archives which this script is capable of properly parsing 23 | 24 | - Version 1 was the initial format and had all the basic information you can see in the viewer now 25 | - Version 2 introduced livestreams and shorts into the mix, as well as making the channel id into a simple url 26 | - Version 3 was a minor change to introduce a deleted tag so we have full reporting capability 27 | 28 | Some of these breaking versions are large changes and some are relatively small. 29 | We don't check if a value exists or not in the archive format out of precedent 30 | and we don't have optionally-present values, meaning that any new tags are a 31 | breaking change to the format. The only downside to this is that the migrator 32 | gets a line or two of extra code every breaking change. This is much better than 33 | having way more complexity in the archiver decoding system itself. 34 | """ 35 | 36 | from typing import Optional 37 | 38 | 39 | class DownloadConfig: 40 | max_videos: Optional[int] 41 | max_livestreams: Optional[int] 42 | max_shorts: Optional[int] 43 | skip_download: bool 44 | skip_metadata: bool 45 | format: Optional[str] 46 | 47 | def __init__(self) -> None: 48 | self.max_videos = None 49 | self.max_livestreams = None 50 | self.max_shorts = None 51 | self.skip_download = False 52 | self.skip_metadata = False 53 | self.format = None 54 | 55 | def submit(self): 56 | """Submits configuration, this has the effect of normalising maximums to 0 properly""" 57 | # Adjust remaining maximums if one is given 58 | no_maximums = ( 59 | self.max_videos is None 60 | and self.max_livestreams is None 61 | and self.max_shorts is None 62 | ) 63 | if not no_maximums: 64 | if self.max_videos is None: 65 | self.max_videos = 0 66 | if self.max_livestreams is None: 67 | self.max_livestreams = 0 68 | if self.max_shorts is None: 69 | self.max_shorts = 0 70 | 71 | # If all are 0 as its equivalent to skipping download 72 | if self.max_videos == 0 and self.max_livestreams == 0 and self.max_shorts == 0: 73 | print( 74 | Fore.YELLOW 75 | + "Using the skip downloads option is recommended over setting maximums to 0" 76 | + Fore.RESET 77 | ) 78 | self.skip_download = True 79 | 80 | 81 | class VideoLogger: 82 | @staticmethod 83 | def downloading(d): 84 | """Progress hook for video downloading""" 85 | # Get video's id 86 | id = d["info_dict"]["id"] 87 | 88 | # Downloading percent 89 | if d["status"] == "downloading": 90 | percent = d["_percent_str"].strip() 91 | print( 92 | Style.DIM 93 | + f" • Downloading {id}, at {percent}" 94 | + Style.DIM 95 | + ".. " 96 | + Style.NORMAL, 97 | end="\r", 98 | ) 99 | 100 | # Finished a video's download 101 | elif d["status"] == "finished": 102 | print(Style.DIM + f" • Downloaded {id} " + Style.NORMAL) 103 | 104 | def debug(self, msg): 105 | """Debug log messages, ignored""" 106 | pass 107 | 108 | def info(self, msg): 109 | """Info log messages ignored""" 110 | pass 111 | 112 | def warning(self, msg): 113 | """Warning log messages ignored""" 114 | pass 115 | 116 | def error(self, msg): 117 | """Error log messages""" 118 | pass 119 | 120 | 121 | class Channel: 122 | path: Path 123 | version: int 124 | url: str 125 | videos: list[Video] 126 | livestreams: list[Video] 127 | shorts: list[Video] 128 | reporter: Reporter 129 | 130 | @staticmethod 131 | def new(path: Path, url: str) -> Channel: 132 | """Creates a new channel""" 133 | # Details 134 | print("Creating new channel..") 135 | channel = Channel() 136 | channel.path = Path(path) 137 | channel.version = ARCHIVE_COMPAT 138 | channel.url = url 139 | channel.videos = [] 140 | channel.livestreams = [] 141 | channel.shorts = [] 142 | channel.reporter = Reporter(channel) 143 | 144 | # Commit and return 145 | channel.commit() 146 | return channel 147 | 148 | @staticmethod 149 | def _new_empty() -> Channel: 150 | return Channel.new( 151 | Path("pretend"), "https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA" 152 | ) 153 | 154 | @staticmethod 155 | def load(path: Path) -> Channel: 156 | """Loads existing channel from path""" 157 | # Check existence 158 | path = Path(path) 159 | channel_name = path.name 160 | print(f"Loading {channel_name} channel..") 161 | if not path.exists(): 162 | raise ArchiveNotFoundException("Archive doesn't exist") 163 | 164 | # Load config 165 | encoded = json.load(open(path / "yark.json", "r")) 166 | 167 | # Check version before fully decoding and exit if wrong 168 | archive_version = encoded["version"] 169 | if archive_version != ARCHIVE_COMPAT: 170 | encoded = _migrate_archive( 171 | archive_version, ARCHIVE_COMPAT, encoded, channel_name 172 | ) 173 | 174 | # Decode and return 175 | return Channel._from_dict(encoded, path) 176 | 177 | def metadata(self): 178 | """Queries YouTube for all channel metadata to refresh known videos""" 179 | # Print loading progress at the start without loading indicator so theres always a print 180 | msg = "Downloading metadata.." 181 | print(msg, end="\r") 182 | 183 | # Download metadata and give the user a spinner bar 184 | with ThreadPoolExecutor() as ex: 185 | # Make future for downloading metadata 186 | future = ex.submit(self._download_metadata) 187 | 188 | # Start spinning 189 | with PieSpinner(f"{msg} ") as bar: 190 | # Don't show bar for 2 seconds but check if future is done 191 | no_bar_time = time.time() + 2 192 | while time.time() < no_bar_time: 193 | if future.done(): 194 | break 195 | time.sleep(0.25) 196 | 197 | # Show loading spinner 198 | while not future.done(): 199 | bar.next() 200 | time.sleep(0.075) 201 | 202 | # Get result from thread now that it's finished 203 | res = future.result() 204 | 205 | # Uncomment for saving big dumps for testing 206 | # with open(self.path / "dump.json", "w+") as file: 207 | # json.dump(res, file) 208 | 209 | # Uncomment for loading big dumps for testing 210 | # res = json.load(open(self.path / "dump.json", "r")) 211 | 212 | # Parse downloaded metadata 213 | self._parse_metadata(res) 214 | 215 | def _download_metadata(self) -> dict[str, Any]: 216 | """Downloads metadata dict and returns for further parsing""" 217 | # Construct downloader 218 | settings = { 219 | # Centralized logging system; makes output fully quiet 220 | "logger": VideoLogger(), 221 | # Skip downloading pending livestreams (#60 ) 222 | "ignore_no_formats_error": True, 223 | # Concurrent fragment downloading for increased resilience (#109 ) 224 | "concurrent_fragment_downloads": 8, 225 | # First download "flat", then extract_info for each video, to support large channels/playlists (#71 ) 226 | "extract_flat":True 227 | } 228 | 229 | # Get response and snip it 230 | with YoutubeDL(settings) as ydl: 231 | # first extract the "flat" metadata, which does not download the metadata for all the videos 232 | for i in range(3): 233 | try: 234 | res: dict[str, Any] = ydl.extract_info(self.url, download=False) 235 | break 236 | except Exception as exception: 237 | # Report error 238 | retrying = i != 2 239 | _err_dl("metadata", exception, retrying) 240 | 241 | # Print retrying message 242 | if retrying: 243 | print( 244 | Style.DIM 245 | + f" • Retrying metadata download.." 246 | + Style.RESET_ALL 247 | ) # TODO: compat with loading bar 248 | 249 | # go through the "flat" metadata and download the metadata for each video 250 | for index in range(len(res["entries"])): 251 | if res["entries"][index]["_type"] == "playlist": 252 | playlist = res["entries"][index] 253 | for list_index in range(len(playlist["entries"])): 254 | url = playlist["entries"][list_index]["url"] 255 | for i in range(3): 256 | try: 257 | entry = ydl.extract_info(url, download=False) 258 | if len(entry["formats"]) == 0: 259 | ydl = YoutubeDL(settings) 260 | entry = ydl.extract_info(url, download=False) 261 | 262 | playlist["entries"][list_index] = entry 263 | break 264 | except Exception as exception: 265 | # Report error 266 | retrying = i != 2 267 | _err_dl("metadata", exception, retrying) 268 | 269 | # Print retrying message 270 | if retrying: 271 | print( 272 | Style.DIM 273 | + f" • Retrying metadata download.." 274 | + Style.RESET_ALL 275 | ) # TODO: compat with loading bar 276 | 277 | 278 | elif res["entries"][index]["_type"] == "url": 279 | url = res["entries"][index]["url"] 280 | for i in range(3): 281 | try: 282 | entry = ydl.extract_info(url, download=False) 283 | # if video didn't download formats, open a new downloader and try again 284 | if len(entry["formats"]) == 0: 285 | ydl = YoutubeDL(settings) 286 | entry = ydl.extract_info(url, download=False) 287 | 288 | res["entries"][index] = entry 289 | break 290 | except Exception as exception: 291 | # Report error 292 | retrying = i != 2 293 | _err_dl("metadata", exception, retrying) 294 | 295 | # Print retrying message 296 | if retrying: 297 | print( 298 | Style.DIM 299 | + f" • Retrying metadata download.." 300 | + Style.RESET_ALL 301 | ) # TODO: compat with loading bar 302 | 303 | return res 304 | 305 | def _parse_metadata(self, res: dict[str, Any]): 306 | """Parses entirety of downloaded metadata""" 307 | # Normalize into types of videos 308 | videos = [] 309 | livestreams = [] 310 | shorts = [] 311 | if len(res["entries"]) > 0 and "entries" not in res["entries"][0]: 312 | # Videos only 313 | videos = res["entries"] 314 | else: 315 | # Videos and at least one other (livestream/shorts) 316 | for entry in res["entries"]: 317 | kind = entry["title"].split(" - ")[-1].lower() 318 | if kind == "videos": 319 | videos = entry["entries"] 320 | elif kind == "live": 321 | livestreams = entry["entries"] 322 | elif kind == "shorts": 323 | shorts = entry["entries"] 324 | else: 325 | _err_msg(f"Unknown video kind '{kind}' found", True) 326 | 327 | # Parse metadata 328 | self._parse_metadata_videos("video", videos, self.videos) 329 | self._parse_metadata_videos("livestream", livestreams, self.livestreams) 330 | self._parse_metadata_videos("shorts", shorts, self.shorts) 331 | 332 | # Go through each and report deleted 333 | self._report_deleted(self.videos) 334 | self._report_deleted(self.livestreams) 335 | self._report_deleted(self.shorts) 336 | 337 | def download(self, config: DownloadConfig): 338 | """Downloads all videos which haven't already been downloaded""" 339 | # Clean out old part files 340 | self._clean_parts() 341 | 342 | # Create settings for the downloader 343 | settings = { 344 | # Set the output path 345 | "outtmpl": f"{self.path}/videos/%(id)s.%(ext)s", 346 | # Centralized logger hook for ignoring all stdout 347 | "logger": VideoLogger(), 348 | # Logger hook for download progress 349 | "progress_hooks": [VideoLogger.downloading], 350 | } 351 | if config.format is not None: 352 | settings["format"] = config.format 353 | 354 | # Attach to the downloader 355 | with YoutubeDL(settings) as ydl: 356 | # Retry downloading 5 times in total for all videos 357 | for i in range(5): 358 | # Try to curate a list and download videos on it 359 | try: 360 | # Curate list of non-downloaded videos 361 | not_downloaded = self._curate(config) 362 | 363 | # Stop if there's nothing to download 364 | if len(not_downloaded) == 0: 365 | break 366 | 367 | # Print curated if this is the first time 368 | if i == 0: 369 | fmt_num = ( 370 | "a new video" 371 | if len(not_downloaded) == 1 372 | else f"{len(not_downloaded)} new videos" 373 | ) 374 | print(f"Downloading {fmt_num}..") 375 | 376 | # Continuously try to download after private/deleted videos are found 377 | # This block gives the downloader all the curated videos and skips/reports deleted videos by filtering their exceptions 378 | while True: 379 | # Download from curated list then exit the optimistic loop 380 | try: 381 | urls = [video.url() for video in not_downloaded] 382 | ydl.download(urls) 383 | break 384 | 385 | # Special handling for private/deleted videos which are archived, if not we raise again 386 | except DownloadError as exception: 387 | # Video is privated or deleted 388 | if ( 389 | "Private video" in exception.msg 390 | or "This video has been removed by the uploader" 391 | in exception.msg 392 | ): 393 | # Skip video from curated and get it as a return 394 | not_downloaded, video = _skip_video( 395 | not_downloaded, "deleted" 396 | ) 397 | 398 | # If this is a new occurrence then set it & report 399 | # This will only happen if its deleted after getting metadata, like in a dry run 400 | if video.deleted.current() == False: 401 | self.reporter.deleted.append(video) 402 | video.deleted.update(None, True) 403 | 404 | # User hasn't got ffmpeg installed and youtube hasn't got format 22 405 | # NOTE: see #55 to learn more 406 | # NOTE: sadly yt-dlp doesn't let us access yt_dlp.utils.ContentTooShortError so we check msg 407 | elif " bytes, expected " in exception.msg: 408 | # Skip video from curated 409 | not_downloaded, _ = _skip_video( 410 | not_downloaded, 411 | "no format found; please download ffmpeg!", 412 | True, 413 | ) 414 | 415 | # Nevermind, normal exception 416 | else: 417 | raise exception 418 | 419 | # Stop if we've got them all 420 | break 421 | 422 | # Report error and retry/stop 423 | except Exception as exception: 424 | # Get around carriage return 425 | if i == 0: 426 | print() 427 | 428 | # Report error 429 | _err_dl("videos", exception, i != 4) 430 | 431 | def search(self, id: str): 432 | """Searches channel for a video with the corresponding `id` and returns""" 433 | # Search 434 | for video in self.videos: 435 | if video.id == id: 436 | return video 437 | 438 | # Raise exception if it's not found 439 | raise VideoNotFoundException(f"Couldn't find {id} inside archive") 440 | 441 | def _curate(self, config: DownloadConfig) -> list[Video]: 442 | """Curate videos which aren't downloaded and return their urls""" 443 | 444 | def curate_list(videos: list[Video], maximum: Optional[int]) -> list[Video]: 445 | """Curates the videos inside of the provided `videos` list to it's local maximum""" 446 | # Cut available videos to maximum if present for deterministic getting 447 | if maximum is not None: 448 | # Fix the maximum to the length so we don't try to get more than there is 449 | fixed_maximum = min(max(len(videos) - 1, 0), maximum) 450 | 451 | # Set the available videos to this fixed maximum 452 | new_videos = [] 453 | for ind in range(fixed_maximum): 454 | new_videos.append(videos[ind]) 455 | videos = new_videos 456 | 457 | # Find undownloaded videos in available list 458 | not_downloaded = [] 459 | for video in videos: 460 | if not video.downloaded(): 461 | not_downloaded.append(video) 462 | 463 | # Return 464 | return not_downloaded 465 | 466 | # Curate 467 | not_downloaded = [] 468 | not_downloaded.extend(curate_list(self.videos, config.max_videos)) 469 | not_downloaded.extend(curate_list(self.livestreams, config.max_livestreams)) 470 | not_downloaded.extend(curate_list(self.shorts, config.max_shorts)) 471 | 472 | # Return 473 | return not_downloaded 474 | 475 | def commit(self): 476 | """Commits (saves) archive to path; do this once you've finished all of your transactions""" 477 | # Save backup 478 | self._backup() 479 | 480 | # Directories 481 | print(f"Committing {self} to file..") 482 | paths = [self.path, self.path / "thumbnails", self.path / "videos"] 483 | for path in paths: 484 | if not path.exists(): 485 | path.mkdir() 486 | 487 | # Config 488 | with open(self.path / "yark.json", "w+") as file: 489 | json.dump(self._to_dict(), file) 490 | 491 | def _parse_metadata_videos(self, kind: str, i: list, bucket: list): 492 | """Parses metadata for a category of video into it's bucket and tells user what's happening""" 493 | 494 | # Print at the start without loading indicator so theres always a print 495 | msg = f"Parsing {kind} metadata.." 496 | print(msg, end="\r") 497 | 498 | # Start computing and show loading spinner 499 | with ThreadPoolExecutor() as ex: 500 | # Make future for computation of the video list 501 | future = ex.submit(self._parse_metadata_videos_comp, i, bucket) 502 | 503 | # Start spinning 504 | with PieSpinner(f"{msg} ") as bar: 505 | # Don't show bar for 2 seconds but check if future is done 506 | no_bar_time = time.time() + 2 507 | while time.time() < no_bar_time: 508 | if future.done(): 509 | return 510 | time.sleep(0.25) 511 | 512 | # Spin until future is done 513 | while not future.done(): 514 | time.sleep(0.075) 515 | bar.next() 516 | 517 | def _parse_metadata_videos_comp(self, i: list, bucket: list): 518 | """Computes the actual parsing for `_parse_metadata_videos` without outputting what's happening""" 519 | for entry in i: 520 | # Skip video if there's no formats available; happens with upcoming videos/livestreams 521 | if "formats" not in entry or len(entry["formats"]) == 0: 522 | continue 523 | 524 | # Updated intra-loop marker 525 | updated = False 526 | 527 | # Update video if it exists 528 | for video in bucket: 529 | if video.id == entry["id"]: 530 | video.update(entry) 531 | updated = True 532 | break 533 | 534 | # Add new video if not 535 | if not updated: 536 | video = Video.new(entry, self) 537 | bucket.append(video) 538 | self.reporter.added.append(video) 539 | 540 | # Sort videos by newest 541 | bucket.sort(reverse=True) 542 | 543 | def _report_deleted(self, videos: list): 544 | """Goes through a video category to report & save those which where not marked in the metadata as deleted if they're not already known to be deleted""" 545 | for video in videos: 546 | if video.deleted.current() == False and not video.known_not_deleted: 547 | self.reporter.deleted.append(video) 548 | video.deleted.update(None, True) 549 | 550 | def _clean_parts(self): 551 | """Cleans old temporary `.part` files which where stopped during download if present""" 552 | # Make a bucket for found files 553 | deletion_bucket: list[Path] = [] 554 | 555 | # Scan through and find part files 556 | videos = self.path / "videos" 557 | for file in videos.iterdir(): 558 | if file.suffix == ".part" or file.suffix == ".ytdl": 559 | deletion_bucket.append(file) 560 | 561 | # Print and delete if there are part files present 562 | if len(deletion_bucket) != 0: 563 | print("Cleaning out previous temporary files..") 564 | for file in deletion_bucket: 565 | file.unlink() 566 | 567 | def _backup(self): 568 | """Creates a backup of the existing `yark.json` file in path as `yark.bak` with added comments""" 569 | # Get current archive path 570 | ARCHIVE_PATH = self.path / "yark.json" 571 | 572 | # Skip backing up if the archive doesn't exist 573 | if not ARCHIVE_PATH.exists(): 574 | return 575 | 576 | # Open original archive to copy 577 | with open(self.path / "yark.json", "r") as file_archive: 578 | # Add comment information to backup file 579 | save = f"// Backup of a Yark archive, dated {datetime.utcnow().isoformat()}\n// Remove these comments and rename to 'yark.json' to restore\n{file_archive.read()}" 580 | 581 | # Save new information into a new backup 582 | with open(self.path / "yark.bak", "w+") as file_backup: 583 | file_backup.write(save) 584 | 585 | @staticmethod 586 | def _from_dict(encoded: dict, path: Path) -> Channel: 587 | """Decodes archive which is being loaded back up""" 588 | channel = Channel() 589 | channel.path = path 590 | channel.version = encoded["version"] 591 | channel.url = encoded["url"] 592 | channel.reporter = Reporter(channel) 593 | channel.videos = [ 594 | Video._from_dict(video, channel) for video in encoded["videos"] 595 | ] 596 | channel.livestreams = [ 597 | Video._from_dict(video, channel) for video in encoded["livestreams"] 598 | ] 599 | channel.shorts = [ 600 | Video._from_dict(video, channel) for video in encoded["shorts"] 601 | ] 602 | return channel 603 | 604 | def _to_dict(self) -> dict: 605 | """Converts channel data to a dictionary to commit""" 606 | return { 607 | "version": self.version, 608 | "url": self.url, 609 | "videos": [video._to_dict() for video in self.videos], 610 | "livestreams": [video._to_dict() for video in self.livestreams], 611 | "shorts": [video._to_dict() for video in self.shorts], 612 | } 613 | 614 | def __repr__(self) -> str: 615 | return self.path.name 616 | 617 | 618 | def _skip_video( 619 | videos: list[Video], 620 | reason: str, 621 | warning: bool = False, 622 | ) -> tuple[list[Video], Video]: 623 | """Skips first undownloaded video in `videos`, make sure there's at least one to skip otherwise an exception will be thrown""" 624 | # Find fist undownloaded video 625 | for ind, video in enumerate(videos): 626 | if not video.downloaded(): 627 | # Tell the user we're skipping over it 628 | if warning: 629 | print( 630 | Fore.YELLOW + f" • Skipping {video.id} ({reason})" + Fore.RESET, 631 | file=sys.stderr, 632 | ) 633 | else: 634 | print( 635 | Style.DIM + f" • Skipping {video.id} ({reason})" + Style.NORMAL, 636 | ) 637 | 638 | # Set videos to skip over this one 639 | videos = videos[ind + 1 :] 640 | 641 | # Return the corrected list and the video found 642 | return videos, video 643 | 644 | # Shouldn't happen, see docs 645 | raise Exception( 646 | "We expected to skip a video and return it but nothing to skip was found" 647 | ) 648 | 649 | 650 | def _migrate_archive( 651 | current_version: int, expected_version: int, encoded: dict, channel_name: str 652 | ) -> dict: 653 | """Automatically migrates an archive from one version to another by bootstrapping""" 654 | 655 | def migrate_step(cur: int, encoded: dict) -> dict: 656 | """Step in recursion to migrate from one to another, contains migration logic""" 657 | # Stop because we've reached the desired version 658 | if cur == expected_version: 659 | return encoded 660 | 661 | # From version 1 to version 2 662 | elif cur == 1: 663 | # Channel id to url 664 | encoded["url"] = "https://www.youtube.com/channel/" + encoded["id"] 665 | del encoded["id"] 666 | print( 667 | Fore.YELLOW 668 | + "Please make sure " 669 | + encoded["url"] 670 | + " is the correct url" 671 | + Fore.RESET 672 | ) 673 | 674 | # Empty livestreams/shorts lists 675 | encoded["livestreams"] = [] 676 | encoded["shorts"] = [] 677 | 678 | # From version 2 to version 3 679 | elif cur == 2: 680 | # Add deleted status to every video/livestream/short 681 | # NOTE: none is fine for new elements, just a slight bodge 682 | for video in encoded["videos"]: 683 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict() 684 | for video in encoded["livestreams"]: 685 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict() 686 | for video in encoded["shorts"]: 687 | video["deleted"] = Element.new(Video._new_empty(), False)._to_dict() 688 | 689 | # Unknown version 690 | else: 691 | _err_msg(f"Unknown archive version v{cur} found during migration", True) 692 | sys.exit(1) 693 | 694 | # Increment version and run again until version has been reached 695 | cur += 1 696 | encoded["version"] = cur 697 | return migrate_step(cur, encoded) 698 | 699 | # Inform user of the backup process 700 | print( 701 | Fore.YELLOW 702 | + f"Automatically migrating archive from v{current_version} to v{expected_version}, a backup has been made at {channel_name}/yark.bak" 703 | + Fore.RESET 704 | ) 705 | 706 | # Start recursion step 707 | return migrate_step(current_version, encoded) 708 | 709 | 710 | def _err_dl(name: str, exception: DownloadError, retrying: bool): 711 | """Prints errors to stdout depending on what kind of download error occurred""" 712 | # Default message 713 | msg = f"Unknown error whilst downloading {name}, details below:\n{exception}" 714 | 715 | # Types of errors 716 | ERRORS = [ 717 | "", 718 | "500", 719 | "Got error: The read operation timed out", 720 | "No such file or directory", 721 | "HTTP Error 404: Not Found", 722 | "", 723 | ] 724 | 725 | # Download errors 726 | if type(exception) == DownloadError: 727 | # Server connection 728 | if ERRORS[0] in exception.msg: 729 | msg = "Issue connecting with YouTube's servers" 730 | 731 | # Server fault 732 | elif ERRORS[1] in exception.msg: 733 | msg = "Fault with YouTube's servers" 734 | 735 | # Timeout 736 | elif ERRORS[2] in exception.msg: 737 | msg = "Timed out trying to download video" 738 | 739 | # Video deleted whilst downloading 740 | elif ERRORS[3] in exception.msg: 741 | msg = "Video deleted whilst downloading" 742 | 743 | # Channel not found, might need to retry with alternative route 744 | elif ERRORS[4] in exception.msg: 745 | msg = "Couldn't find channel by it's id" 746 | 747 | # Random timeout; not sure if its user-end or youtube-end 748 | elif ERRORS[5] in exception.msg: 749 | msg = "Timed out trying to reach YouTube" 750 | 751 | # Print error 752 | suffix = ", retrying in a few seconds.." if retrying else "" 753 | print( 754 | Fore.YELLOW + " • " + msg + suffix.ljust(40) + Fore.RESET, 755 | file=sys.stderr, 756 | ) 757 | 758 | # Wait if retrying, exit if failed 759 | if retrying: 760 | time.sleep(5) 761 | else: 762 | _err_msg(f" • Sorry, failed to download {name}", True) 763 | sys.exit(1) 764 | -------------------------------------------------------------------------------- /yark/cli.py: -------------------------------------------------------------------------------- 1 | """Homegrown cli for managing archives""" 2 | 3 | from pathlib import Path 4 | from colorama import Style, Fore 5 | import sys 6 | import threading 7 | import webbrowser 8 | from .errors import _err_msg, ArchiveNotFoundException 9 | from .channel import Channel, DownloadConfig 10 | from .viewer import viewer 11 | 12 | HELP = f"yark [options]\n\n YouTube archiving made simple.\n\nOptions:\n new [name] [url] Creates new archive with name and channel url\n refresh [name] [args?] Refreshes/downloads archive with optional config\n view [name?] Launches offline archive viewer website\n report [name] Provides a report on the most interesting changes\n\nExample:\n $ yark new owez https://www.youtube.com/channel/UCSMdm6bUYIBN0KfS2CVuEPA\n $ yark refresh owez\n $ yark view owez" 13 | """User-facing help message provided from the cli""" 14 | 15 | 16 | def _cli(): 17 | """Command-line-interface launcher""" 18 | 19 | # Get arguments 20 | args = sys.argv[1:] 21 | 22 | # No arguments 23 | if len(args) == 0: 24 | print(HELP, file=sys.stderr) 25 | _err_msg(f"\nError: No arguments provided") 26 | sys.exit(1) 27 | 28 | # Help 29 | if args[0] in ["help", "--help", "-h"]: 30 | print(HELP) 31 | 32 | # Version 33 | # TODO: automatically track this 34 | elif args[0] in ["-v", "-ver", "--version", "--v"]: 35 | print("1.2.9") 36 | 37 | # Create new 38 | elif args[0] == "new": 39 | # More help 40 | if len(args) == 2 and args[1] == "--help": 41 | _err_no_help() 42 | 43 | # Bad arguments 44 | if len(args) < 3: 45 | _err_msg("Please provide an archive name and the channel url") 46 | sys.exit(1) 47 | 48 | # Create channel 49 | Channel.new(Path(args[1]), args[2]) 50 | 51 | # Refresh 52 | elif args[0] == "refresh": 53 | # More help 54 | if len(args) == 2 and args[1] == "--help": 55 | # NOTE: if these get more complex, separate into something like "basic config" and "advanced config" 56 | print( 57 | f"yark refresh [name] [args?]\n\n Refreshes/downloads archive with optional configuration.\n If a maximum is set, unset categories won't be downloaded\n\nArguments:\n --videos=[max] Maximum recent videos to download\n --shorts=[max] Maximum recent shorts to download\n --livestreams=[max] Maximum recent livestreams to download\n --skip-metadata Skips downloading metadata\n --skip-download Skips downloading content\n --format=[str] Downloads using custom yt-dlp format for advanced users\n\n Example:\n $ yark refresh demo\n $ yark refresh demo --videos=5\n $ yark refresh demo --shorts=2 --livestreams=25\n $ yark refresh demo --skip-download" 58 | ) 59 | sys.exit(0) 60 | 61 | # Bad arguments 62 | if len(args) < 2: 63 | _err_msg("Please provide the archive name") 64 | sys.exit(1) 65 | 66 | # Figure out configuration 67 | config = DownloadConfig() 68 | if len(args) > 2: 69 | 70 | def parse_value(config_arg: str) -> str: 71 | return config_arg.split("=")[1] 72 | 73 | def parse_maximum_int(config_arg: str) -> int: 74 | """Tries to parse a maximum integer input""" 75 | maximum = parse_value(config_arg) 76 | try: 77 | return int(maximum) 78 | except: 79 | print(HELP, file=sys.stderr) 80 | _err_msg( 81 | f"\nError: The value '{maximum}' isn't a valid maximum number" 82 | ) 83 | sys.exit(1) 84 | 85 | # Go through each configuration argument 86 | for config_arg in args[2:]: 87 | # Video maximum 88 | if config_arg.startswith("--videos="): 89 | config.max_videos = parse_maximum_int(config_arg) 90 | 91 | # Livestream maximum 92 | elif config_arg.startswith("--livestreams="): 93 | config.max_livestreams = parse_maximum_int(config_arg) 94 | 95 | # Shorts maximum 96 | elif config_arg.startswith("--shorts="): 97 | config.max_shorts = parse_maximum_int(config_arg) 98 | 99 | # No metadata 100 | elif config_arg == "--skip-metadata": 101 | config.skip_metadata = True 102 | 103 | # No downloading; functionally equivalent to all maximums being 0 but it skips entirely 104 | elif config_arg == "--skip-download": 105 | config.skip_download = True 106 | 107 | # Custom yt-dlp format 108 | elif config_arg.startswith("--format="): 109 | config.format = parse_value(config_arg) 110 | 111 | # Unknown argument 112 | else: 113 | print(HELP, file=sys.stderr) 114 | _err_msg( 115 | f"\nError: Unknown configuration '{config_arg}' provided for archive refresh" 116 | ) 117 | sys.exit(1) 118 | 119 | # Submit config settings 120 | config.submit() 121 | 122 | # Refresh channel using config context 123 | try: 124 | channel = Channel.load(args[1]) 125 | if config.skip_metadata: 126 | print("Skipping metadata download..") 127 | else: 128 | channel.metadata() 129 | channel.commit() # NOTE: Do it here no matter, because it's metadata. Downloads do not modify the archive 130 | if config.skip_download: 131 | print("Skipping videos/livestreams/shorts download..") 132 | else: 133 | channel.download(config) 134 | channel.reporter.print() 135 | except ArchiveNotFoundException: 136 | _err_archive_not_found() 137 | 138 | # View 139 | elif args[0] == "view": 140 | # More help 141 | if len(args) == 2 and args[1] == "--help": 142 | print( 143 | f"yark view [name] [args?]\n\n Launches offline archive viewer website.\n\nArguments:\n --host [str] Custom uri to act as host from\n --port [int] Custom port number instead of 7667\n\n Example:\n $ yark view foobar\n $ yark view foobar --port=80\n $ yark view foobar --port=1234 --host=0.0.0.0" 144 | ) 145 | sys.exit(0) 146 | 147 | # Basis for custom host/port configs 148 | host = None 149 | port = 7667 150 | 151 | # Go through each configuration argument 152 | for config_arg in args[2:]: 153 | # Host configuration 154 | if config_arg.startswith("--host="): 155 | host = config_arg[7:] 156 | 157 | # Port configuration 158 | elif config_arg.startswith("--port="): 159 | if config_arg[7:].strip() == "": 160 | print( 161 | f"No port number provided for port argument", 162 | file=sys.stderr, 163 | ) 164 | sys.exit(1) 165 | try: 166 | port = int(config_arg[7:]) 167 | except: 168 | print( 169 | f"Invalid port number '{config_arg[7:]}' provided", 170 | file=sys.stderr, 171 | ) 172 | sys.exit(1) 173 | 174 | def launch(): 175 | """Launches viewer""" 176 | app = viewer() 177 | threading.Thread(target=lambda: app.run(host=host, port=port)).run() 178 | 179 | # Start on channel name 180 | if len(args) > 1: 181 | # Get name 182 | channel = args[1] 183 | 184 | # Jank archive check 185 | if not Path(channel).exists(): 186 | _err_archive_not_found() 187 | 188 | # Launch and start browser 189 | print(f"Starting viewer for {channel}..") 190 | webbrowser.open(f"http://127.0.0.1:7667/channel/{channel}/videos") 191 | launch() 192 | 193 | # Start on channel finder 194 | else: 195 | print("Starting viewer..") 196 | webbrowser.open(f"http://127.0.0.1:7667/") 197 | launch() 198 | 199 | # Report 200 | elif args[0] == "report": 201 | # Bad arguments 202 | if len(args) < 2: 203 | _err_msg("Please provide the archive name") 204 | sys.exit(1) 205 | 206 | channel = Channel.load(Path(args[1])) 207 | channel.reporter.interesting_changes() 208 | 209 | # Unknown 210 | else: 211 | print(HELP, file=sys.stderr) 212 | _err_msg(f"\nError: Unknown command '{args[0]}' provided!", True) 213 | sys.exit(1) 214 | 215 | 216 | def _err_archive_not_found(): 217 | """Errors out the user if the archive doesn't exist""" 218 | _err_msg("Archive doesn't exist, please make sure you typed it's name correctly!") 219 | sys.exit(1) 220 | 221 | 222 | def _err_no_help(): 223 | """Prints out help message and exits, displaying a 'no additional help' message""" 224 | print(HELP) 225 | print("\nThere's no additional help for this command") 226 | sys.exit(0) 227 | 228 | 229 | # NOTE: not used, not sure why this is included. might be useful for the future 230 | # def _upgrade_messaging() -> None: 231 | # """ 232 | # Give users some info on the new Yark 1.3 version because because PyPI releases aren't supported 233 | 234 | # This wouldn't happen normally but users might be confused seeing as we're switching distribution methods. 235 | # """ 236 | # # Major update message for 1.3 237 | # print( 238 | # Style.BRIGHT 239 | # + "Yark 1.3 is out now! Go to https://github.com/Owez/yark to download" 240 | # + Style.DIM 241 | # + " (pip is no longer supported)" 242 | # + Style.NORMAL 243 | # ) 244 | 245 | # # Give a warning if it's been over a year since release 246 | # if datetime.datetime.utcnow().year >= 2024: 247 | # print( 248 | # Fore.YELLOW 249 | # + "You're currently on an outdated version of Yark" 250 | # + Fore.RESET, 251 | # file=sys.stderr, 252 | # ) 253 | -------------------------------------------------------------------------------- /yark/errors.py: -------------------------------------------------------------------------------- 1 | """Exceptions and error functions""" 2 | 3 | from colorama import Style, Fore 4 | import sys 5 | 6 | 7 | class ArchiveNotFoundException(Exception): 8 | """Archive couldn't be found, the name was probably incorrect""" 9 | 10 | def __init__(self, *args: object) -> None: 11 | super().__init__(*args) 12 | 13 | 14 | class VideoNotFoundException(Exception): 15 | """Video couldn't be found, the id was probably incorrect""" 16 | 17 | def __init__(self, *args: object) -> None: 18 | super().__init__(*args) 19 | 20 | 21 | class NoteNotFoundException(Exception): 22 | """Note couldn't be found, the id was probably incorrect""" 23 | 24 | def __init__(self, *args: object) -> None: 25 | super().__init__(*args) 26 | 27 | 28 | class TimestampException(Exception): 29 | """Invalid timestamp inputted for note""" 30 | 31 | def __init__(self, *args: object) -> None: 32 | super().__init__(*args) 33 | 34 | 35 | def _err_msg(msg: str, report_msg: bool = False): 36 | """Provides a red-coloured error message to the user in the STDERR pipe""" 37 | msg = ( 38 | msg 39 | if not report_msg 40 | else f"{msg}\nPlease file a bug report if you think this is a problem with Yark!" 41 | ) 42 | print(Fore.RED + Style.BRIGHT + msg + Style.NORMAL + Fore.RESET, file=sys.stderr) 43 | -------------------------------------------------------------------------------- /yark/reporter.py: -------------------------------------------------------------------------------- 1 | """Channel reporting system allowing detailed logging of useful information""" 2 | 3 | from colorama import Fore, Style 4 | import datetime 5 | from .video import Video, Element 6 | from .utils import _truncate_text 7 | from typing import TYPE_CHECKING, Optional 8 | 9 | if TYPE_CHECKING: 10 | from .channel import Channel 11 | 12 | 13 | class Reporter: 14 | channel: "Channel" 15 | added: list[Video] 16 | deleted: list[Video] 17 | updated: list[tuple[str, Element]] 18 | 19 | def __init__(self, channel) -> None: 20 | self.channel = channel 21 | self.added = [] 22 | self.deleted = [] 23 | self.updated = [] 24 | 25 | def print(self): 26 | """Prints coloured report to STDOUT""" 27 | # Initial message 28 | print(f"Report for {self.channel}:") 29 | 30 | # Updated 31 | for kind, element in self.updated: 32 | colour = ( 33 | Fore.CYAN 34 | if kind in ["title", "description", "undeleted"] 35 | else Fore.BLUE 36 | ) 37 | video = f" • {element.video}".ljust(82) 38 | kind = f" │ 🔥{kind.capitalize()}" 39 | 40 | print(colour + video + kind) 41 | 42 | # Added 43 | for video in self.added: 44 | print(Fore.GREEN + f" • {video}") 45 | 46 | # Deleted 47 | for video in self.deleted: 48 | print(Fore.RED + f" • {video}") 49 | 50 | # Nothing 51 | if not self.added and not self.deleted and not self.updated: 52 | print(Style.DIM + f" • Nothing was added or deleted") 53 | 54 | # Watermark 55 | print(_watermark()) 56 | 57 | def add_updated(self, kind: str, element: Element): 58 | """Tells reporter that an element has been updated""" 59 | self.updated.append((kind, element)) 60 | 61 | def reset(self): 62 | """Resets reporting values for new run""" 63 | self.added = [] 64 | self.deleted = [] 65 | self.updated = [] 66 | 67 | def interesting_changes(self): 68 | """Reports on the most interesting changes for the channel linked to this reporter""" 69 | 70 | def fmt_video(kind: str, video: Video) -> str: 71 | """Formats a video if it's interesting, otherwise returns an empty string""" 72 | # Skip formatting because it's got nothing of note 73 | if ( 74 | not video.title.changed() 75 | and not video.description.changed() 76 | and not video.deleted.changed() 77 | ): 78 | return "" 79 | 80 | # Lambdas for easy buffer addition for next block 81 | buf: list[str] = [] 82 | maybe_capitalize = lambda word: word.capitalize() if len(buf) == 0 else word 83 | add_buf = lambda name, change, colour: buf.append( 84 | colour + maybe_capitalize(name) + f" x{change}" + Fore.RESET 85 | ) 86 | 87 | # Figure out how many changes have happened in each category and format them together 88 | change_deleted = sum( 89 | 1 for value in video.deleted.inner.values() if value == True 90 | ) 91 | if change_deleted != 0: 92 | add_buf("deleted", change_deleted, Fore.RED) 93 | change_description = len(video.description.inner) - 1 94 | if change_description != 0: 95 | add_buf("description", change_description, Fore.CYAN) 96 | change_title = len(video.title.inner) - 1 97 | if change_title != 0: 98 | add_buf("title", change_title, Fore.CYAN) 99 | 100 | # Combine the detected changes together and capitalize 101 | changes = ", ".join(buf) + Fore.RESET 102 | 103 | # Truncate title, get viewer link, and format all together with viewer link 104 | title = _truncate_text(video.title.current(), 51).strip() 105 | url = f"http://127.0.0.1:7667/channel/{video.channel}/{kind}/{video.id}" 106 | return ( 107 | f" • {title}\n {changes}\n " 108 | + Style.DIM 109 | + url 110 | + Style.RESET_ALL 111 | + "\n" 112 | ) 113 | 114 | def fmt_category(kind: str, videos: list) -> Optional[str]: 115 | """Returns formatted string for an entire category of `videos` inputted or returns nothing""" 116 | # Add interesting videos to buffer 117 | HEADING = f"Interesting {kind}:\n" 118 | buf = HEADING 119 | for video in videos: 120 | buf += fmt_video(kind, video) 121 | 122 | # Return depending on if the buf is just the heading 123 | return None if buf == HEADING else buf[:-1] 124 | 125 | # Tell users whats happening 126 | print(f"Finding interesting changes in {self.channel}..") 127 | 128 | # Get reports on the three categories 129 | categories = [ 130 | ("videos", fmt_category("videos", self.channel.videos)), 131 | ("livestreams", fmt_category("livestreams", self.channel.livestreams)), 132 | ("shorts", fmt_category("shorts", self.channel.shorts)), 133 | ] 134 | 135 | # Combine those with nothing of note and print out interesting 136 | not_of_note = [] 137 | for name, buf in categories: 138 | if buf is None: 139 | not_of_note.append(name) 140 | else: 141 | print(buf) 142 | 143 | # Print out those with nothing of note at the end 144 | if len(not_of_note) != 0: 145 | not_of_note = "/".join(not_of_note) 146 | print(f"No interesting {not_of_note} found") 147 | 148 | # Watermark 149 | print(_watermark()) 150 | 151 | 152 | def _watermark() -> str: 153 | """Returns a new watermark with a Yark timestamp""" 154 | date = datetime.datetime.utcnow().isoformat() 155 | return Style.RESET_ALL + f"Yark – {date}" 156 | -------------------------------------------------------------------------------- /yark/templates/base.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | Yark{% if title %} · {{ title }}{% endif %} 15 | 16 | 17 | 18 | 19 | 117 | {% block styling %}{% endblock %} 118 | 119 | 126 | 127 | {% block content %}{% endblock %} 128 | 129 | {% if error %} 130 |

131 | {% endif %} 132 | 133 | 178 | 179 | {% block scripts %}{% endblock %} 180 | 181 | 182 | -------------------------------------------------------------------------------- /yark/templates/channel.html: -------------------------------------------------------------------------------- 1 | {% extends 'base.html' %} 2 | 3 | {% block styling %} 4 | 71 | {% endblock %} 72 | 73 | {% block content %} 74 |

{{ name }}'s videos

75 | {% if channel.videos %} 76 |

77 | {% for video in channel.videos %} 78 | 79 | {% set downloaded = video.downloaded() %} 80 | {% if downloaded %} 81 | 82 | {% else %} 83 |

84 | {% endif %} 85 | 86 |

95 | 96 | {% if video.updated() %}🌀 {% endif %} 97 | 98 | {{ video.uploaded.strftime("%d/%m/%Y") }} 99 |

100 |

101 | {% if downloaded %} 102 | 103 | {% else %} 104 |

105 | {% endif %} 106 | {% endfor %} 107 |

108 | {% else %} 109 |

No videos found!

110 | {% endif %} 111 | {% endblock %} 112 | 113 | {% block scripts %} 114 | 152 | {% endblock %} -------------------------------------------------------------------------------- /yark/templates/index.html: -------------------------------------------------------------------------------- 1 | {% extends 'base.html' %} 2 | 3 | {% block styling %} 4 | 30 | {% endblock %} 31 | 32 | {% block content %} 33 |

Yark

34 | 37 | {% if visited %} 38 |

39 |

Recent Archives:

40 | {% for id in visited %} 41 | {{ id }} 42 | {% endfor %} 43 |

44 | {% endif %} 45 | {% endblock %} -------------------------------------------------------------------------------- /yark/templates/video.html: -------------------------------------------------------------------------------- 1 | {% extends 'base.html' %} 2 | 3 | {% block styling %} 4 | 162 | {% endblock %} 163 | 164 | {% block content %} 165 |

166 | 167 |

169 | 170 |

171 |

172 | {% set views = video.views.current() %} 173 |

174 | 175 | {{ views }} view{% if views != 1 %}s{% endif %} 176 | 177 | • 178 | {{ video.uploaded.strftime("%d/%m/%Y") }} 179 | 180 | {% if video.updated() %} 181 | • 182 | 🌀 183 | {% endif %} 184 |

185 |

186 | 187 | 🔗 188 | 189 | • 190 | 💾 192 | 193 | {% set likes = video.likes.current() %} 194 | {% if likes or likes == 0 %} 195 | • 196 | 👍 {{ likes }} 197 | {% endif %} 198 |

199 |

200 | 201 | {% set description = video.description.current().split("\n") %} 202 | {% set no_description = description|count == 1 and description[0] == "" %} 203 | {% if not no_description %} 204 | 206 | 211 | {% endif %} 212 | 213 |

Create Note

214 |

215 |

216 |

218 |

00:00 219 |

220 |

221 |

223 | 224 |

225 | 282 | 283 |

History

284 | {% set mto_title = video.title.inner|count != 1 %} 285 | {% set mto_description = video.description.inner|count != 1 %} 286 | {% set mto_views = video.views.inner|count != 1 %} 287 | {% set mto_likes = video.likes.inner|count != 1 %} 288 |

No title or description changes on record
296 | {% if loop.index == video.title.inner|count %} 297 | Current title has been in place since {{ timestamp.strftime("%d/%m/%Y") }} 298 | {% else %} 299 | {% if loop.index == 1 %} 300 | Title was originally 301 | {% else %} 302 | Title changed to 303 | {% endif %} 304 | 307 | on {{ timestamp.strftime("%d/%m/%Y") }} 308 | 309 |
{{ video.title.inner[timestamp] }} 310 | 311 | {% endif %} 312 |
No title changes on record
321 | {% if loop.index == video.description.inner|count %} 322 | Current description has been in place since {{ timestamp.strftime("%d/%m/%Y") }} 323 | {% else %} 324 | {% if loop.index == 1 %} 325 | Description was originally 326 | {% else %} 327 | Description changed to 328 | {% endif %} 329 | 332 | on {{ timestamp.strftime("%d/%m/%Y") }} 333 | 334 |
335 | {% for line in video.description.inner[timestamp].split("\n") %} 336 | {{ line }}
337 | {% endfor %} 338 | 339 | {% endif %} 340 |
No description changes on record
No view or like changes on record
No view changes on record
No like changes on record

355 | 356 | {% if mto_views or mto_likes %} 357 | {% if mto_views %} 358 |

Views over time

359 |

360 | {% endif %} 361 | {% if mto_likes %} 362 |

Likes over time

363 |

364 | {% endif %} 365 | 366 | 368 | 422 | {% endif %} 423 | 424 | {% if video.notes %} 425 |

Notes

426 | {% for note in video.notes %} 427 |

428 | 429 |

430 | 431 |

432 | 433 | {{ 434 | note.timestamp|timestamp }} 435 |

436 | 437 | {{ note.id }} 438 | 439 |

440 | 441 | {% if note.body %}{{ note.body }}{% endif %} 442 | 443 | 444 |

445 |

446 | {% endfor %} 447 | 519 | {% endif %} 520 |

521 | {% endblock %} 522 | 523 | {% block scripts %} 524 | 536 | {% endblock %} -------------------------------------------------------------------------------- /yark/utils.py: -------------------------------------------------------------------------------- 1 | """Useful shared utility functions""" 2 | 3 | 4 | def _truncate_text(text: str, to: int = 31) -> str: 5 | """Truncates inputted `text` to ~32 length, adding ellipsis at the end if overflowing""" 6 | if len(text) > to: 7 | text = text[: to - 2].strip() + ".." 8 | return text.ljust(to) 9 | -------------------------------------------------------------------------------- /yark/video.py: -------------------------------------------------------------------------------- 1 | """Single video inside of a channel, allowing reporting and addition/updates to it's status using timestamps""" 2 | 3 | from __future__ import annotations 4 | from datetime import datetime 5 | from fnmatch import fnmatch 6 | from pathlib import Path 7 | from uuid import uuid4 8 | import requests 9 | import hashlib 10 | from .errors import NoteNotFoundException 11 | from .utils import _truncate_text 12 | from typing import TYPE_CHECKING, Any, Optional 13 | 14 | if TYPE_CHECKING: 15 | from .channel import Channel 16 | 17 | 18 | class Video: 19 | channel: "Channel" 20 | id: str 21 | uploaded: datetime 22 | width: int 23 | height: int 24 | title: "Element" 25 | description: "Element" 26 | views: "Element" 27 | likes: "Element" 28 | thumbnail: "Element" 29 | deleted: "Element" 30 | notes: list["Note"] 31 | 32 | @staticmethod 33 | def new(entry: dict[str, Any], channel) -> Video: 34 | """Create new video from metadata entry""" 35 | # Normal 36 | video = Video() 37 | video.channel = channel 38 | video.id = entry["id"] 39 | video.uploaded = _decode_date_yt(entry["upload_date"]) 40 | video.width = entry["width"] 41 | video.height = entry["height"] 42 | video.title = Element.new(video, entry["title"]) 43 | video.description = Element.new(video, entry["description"]) 44 | video.views = Element.new(video, entry["view_count"]) 45 | video.likes = Element.new( 46 | video, entry["like_count"] if "like_count" in entry else None 47 | ) 48 | video.thumbnail = Element.new(video, Thumbnail.new(entry["thumbnail"], video)) 49 | video.deleted = Element.new(video, False) 50 | video.notes = [] 51 | 52 | # Runtime-only 53 | video.known_not_deleted = True 54 | 55 | # Return 56 | return video 57 | 58 | @staticmethod 59 | def _new_empty() -> Video: 60 | fake_entry = {"hi": True} # TODO: finish 61 | return Video.new(fake_entry, Channel._new_empty()) 62 | 63 | def update(self, entry: dict): 64 | """Updates video using new schema, adding a new timestamp to any changes""" 65 | # Normal 66 | self.title.update("title", entry["title"]) 67 | self.description.update("description", entry["description"]) 68 | self.views.update("view count", entry["view_count"]) 69 | self.likes.update( 70 | "like count", entry["like_count"] if "like_count" in entry else None 71 | ) 72 | self.thumbnail.update("thumbnail", Thumbnail.new(entry["thumbnail"], self)) 73 | self.deleted.update("undeleted", False) 74 | 75 | # Runtime-only 76 | self.known_not_deleted = True 77 | 78 | def filename(self) -> Optional[str]: 79 | """Returns the filename for the downloaded video, if any""" 80 | videos = self.channel.path / "videos" 81 | for file in videos.iterdir(): 82 | if file.stem == self.id and file.suffix != ".part": 83 | return file.name 84 | return None 85 | 86 | def downloaded(self) -> bool: 87 | """Checks if this video has been downloaded""" 88 | return self.filename() is not None 89 | 90 | def updated(self) -> bool: 91 | """Checks if this video's title or description or deleted status have been ever updated""" 92 | return ( 93 | len(self.title.inner) > 1 94 | or len(self.description.inner) > 1 95 | or len(self.deleted.inner) > 1 96 | ) 97 | 98 | def search(self, id: str): 99 | """Searches video for note's id""" 100 | for note in self.notes: 101 | if note.id == id: 102 | return note 103 | raise NoteNotFoundException(f"Couldn't find note {id}") 104 | 105 | def url(self) -> str: 106 | """Returns the YouTube watch url of the current video""" 107 | # NOTE: livestreams and shorts are currently just videos and can be seen via a normal watch url 108 | return f"https://www.youtube.com/watch?v={self.id}" 109 | 110 | @staticmethod 111 | def _from_dict(encoded: dict, channel) -> Video: 112 | """Converts id and encoded dictionary to video for loading a channel""" 113 | # Normal 114 | video = Video() 115 | video.channel = channel 116 | video.id = encoded["id"] 117 | video.uploaded = datetime.fromisoformat(encoded["uploaded"]) 118 | video.width = encoded["width"] 119 | video.height = encoded["height"] 120 | video.title = Element._from_dict(encoded["title"], video) 121 | video.description = Element._from_dict(encoded["description"], video) 122 | video.views = Element._from_dict(encoded["views"], video) 123 | video.likes = Element._from_dict(encoded["likes"], video) 124 | video.thumbnail = Thumbnail._from_element(encoded["thumbnail"], video) 125 | video.notes = [Note._from_dict(video, note) for note in encoded["notes"]] 126 | video.deleted = Element._from_dict(encoded["deleted"], video) 127 | 128 | # Runtime-only 129 | video.known_not_deleted = False 130 | 131 | # Return 132 | return video 133 | 134 | def _to_dict(self) -> dict: 135 | """Converts video information to dictionary for committing, doesn't include id""" 136 | return { 137 | "id": self.id, 138 | "uploaded": self.uploaded.isoformat(), 139 | "width": self.width, 140 | "height": self.height, 141 | "title": self.title._to_dict(), 142 | "description": self.description._to_dict(), 143 | "views": self.views._to_dict(), 144 | "likes": self.likes._to_dict(), 145 | "thumbnail": self.thumbnail._to_dict(), 146 | "deleted": self.deleted._to_dict(), 147 | "notes": [note._to_dict() for note in self.notes], 148 | } 149 | 150 | def __repr__(self) -> str: 151 | # Title 152 | title = _truncate_text(self.title.current()) 153 | 154 | # Views and likes 155 | views = _magnitude(self.views.current()).ljust(6) 156 | likes = _magnitude(self.likes.current()).ljust(6) 157 | 158 | # Width and height 159 | width = self.width if self.width is not None else "?" 160 | height = self.height if self.height is not None else "?" 161 | 162 | # Upload date 163 | uploaded = _encode_date_human(self.uploaded) 164 | 165 | # Return 166 | return f"{title} 🔎{views} │ 👍{likes} │ 📅{uploaded} │ 📺{width}x{height}" 167 | 168 | def __lt__(self, other) -> bool: 169 | return self.uploaded < other.uploaded 170 | 171 | 172 | def _decode_date_yt(input: str) -> datetime: 173 | """Decodes date from YouTube like `20180915` for example""" 174 | return datetime.strptime(input, "%Y%m%d") 175 | 176 | 177 | def _encode_date_human(input: datetime) -> str: 178 | """Encodes an `input` date into a standardized human-readable format""" 179 | return input.strftime("%d %b %Y") 180 | 181 | 182 | def _magnitude(count: Optional[int] = None) -> str: 183 | """Displays an integer as a sort of ordinal order of magnitude""" 184 | if count is None: 185 | return "?" 186 | elif count < 1000: 187 | return str(count) 188 | elif count < 1000000: 189 | value = "{:.1f}".format(float(count) / 1000.0) 190 | return value + "k" 191 | elif count < 1000000000: 192 | value = "{:.1f}".format(float(count) / 1000000.0) 193 | return value + "m" 194 | else: 195 | value = "{:.1f}".format(float(count) / 1000000000.0) 196 | return value + "b" 197 | 198 | 199 | class Element: 200 | video: Video 201 | inner: dict[datetime, Any] 202 | 203 | @staticmethod 204 | def new(video: Video, data): 205 | """Creates new element attached to a video with some initial data""" 206 | element = Element() 207 | element.video = video 208 | element.inner = {datetime.utcnow(): data} 209 | return element 210 | 211 | def update(self, kind: Optional[str], data): 212 | """Updates element if it needs to be and returns self, reports change unless `kind` is none""" 213 | # Check if updating is needed 214 | has_id = hasattr(data, "id") 215 | current = self.current() 216 | if (not has_id and current != data) or (has_id and data.id != current.id): 217 | # Update 218 | self.inner[datetime.utcnow()] = data 219 | 220 | # Report if wanted 221 | if kind is not None: 222 | self.video.channel.reporter.add_updated(kind, self) 223 | 224 | # Return self 225 | return self 226 | 227 | def current(self): 228 | """Returns most recent element""" 229 | return self.inner[list(self.inner.keys())[-1]] 230 | 231 | def changed(self) -> bool: 232 | """Checks if the value has ever been modified from it's original state""" 233 | return len(self.inner) > 1 234 | 235 | @staticmethod 236 | def _from_dict(encoded: dict, video: Video) -> Element: 237 | """Converts encoded dictionary into element""" 238 | # Basics 239 | element = Element() 240 | element.video = video 241 | element.inner = {} 242 | 243 | # Inner elements 244 | for key in encoded: 245 | date = datetime.fromisoformat(key) 246 | element.inner[date] = encoded[key] 247 | 248 | # Return 249 | return element 250 | 251 | def _to_dict(self) -> dict: 252 | """Converts element to dictionary for committing""" 253 | # Convert each item 254 | encoded = {} 255 | for date in self.inner: 256 | # Convert element value if method available to support custom 257 | data = self.inner[date] 258 | data = data._to_element() if hasattr(data, "_to_element") else data 259 | 260 | # Add encoded data to iso-formatted string date 261 | encoded[date.isoformat()] = data 262 | 263 | # Return 264 | return encoded 265 | 266 | 267 | class Thumbnail: 268 | video: Video 269 | id: str 270 | path: Path 271 | 272 | @staticmethod 273 | def new(url: str, video: Video): 274 | """Pulls a new thumbnail from YouTube and saves""" 275 | # Details 276 | thumbnail = Thumbnail() 277 | thumbnail.video = video 278 | 279 | # Get image and calculate it's hash 280 | image = requests.get(url).content 281 | thumbnail.id = hashlib.blake2b( 282 | image, digest_size=20, usedforsecurity=False 283 | ).hexdigest() 284 | 285 | # Calculate paths 286 | thumbnails = thumbnail._path() 287 | thumbnail.path = thumbnails / f"{thumbnail.id}.webp" 288 | 289 | # Save to collection 290 | with open(thumbnail.path, "wb+") as file: 291 | file.write(image) 292 | 293 | # Return 294 | return thumbnail 295 | 296 | @staticmethod 297 | def load(id: str, video: Video): 298 | """Loads existing thumbnail from saved path by id""" 299 | thumbnail = Thumbnail() 300 | thumbnail.id = id 301 | thumbnail.video = video 302 | thumbnail.path = thumbnail._path() / f"{thumbnail.id}.webp" 303 | return thumbnail 304 | 305 | def _path(self) -> Path: 306 | """Gets root path of thumbnail using video's channel path""" 307 | return self.video.channel.path / "thumbnails" 308 | 309 | @staticmethod 310 | def _from_element(element: dict, video: Video) -> Element: 311 | """Converts element of thumbnails to properly formed thumbnails""" 312 | decoded = Element._from_dict(element, video) 313 | for date in decoded.inner: 314 | decoded.inner[date] = Thumbnail.load(decoded.inner[date], video) 315 | return decoded 316 | 317 | def _to_element(self) -> str: 318 | """Converts thumbnail instance to value used for element identification""" 319 | return self.id 320 | 321 | 322 | class Note: 323 | """Allows Yark users to add notes to videos""" 324 | 325 | video: Video 326 | id: str 327 | timestamp: int 328 | title: str 329 | body: Optional[str] 330 | 331 | @staticmethod 332 | def new(video: Video, timestamp: int, title: str, body: Optional[str] = None): 333 | """Creates a new note""" 334 | note = Note() 335 | note.video = video 336 | note.id = str(uuid4()) 337 | note.timestamp = timestamp 338 | note.title = title 339 | note.body = body 340 | return note 341 | 342 | @staticmethod 343 | def _from_dict(video: Video, element: dict) -> Note: 344 | """Loads existing note attached to a video dict""" 345 | note = Note() 346 | note.video = video 347 | note.id = element["id"] 348 | note.timestamp = element["timestamp"] 349 | note.title = element["title"] 350 | note.body = element["body"] 351 | return note 352 | 353 | def _to_dict(self) -> dict: 354 | """Converts note to dictionary representation""" 355 | return { 356 | "id": self.id, 357 | "timestamp": self.timestamp, 358 | "title": self.title, 359 | "body": self.body, 360 | } 361 | -------------------------------------------------------------------------------- /yark/viewer.py: -------------------------------------------------------------------------------- 1 | """Flask-based web viewer for rich history reporting""" 2 | 3 | import json 4 | import os 5 | from flask import ( 6 | Flask, 7 | render_template, 8 | request, 9 | redirect, 10 | url_for, 11 | send_from_directory, 12 | Blueprint, 13 | ) 14 | import logging 15 | from .errors import ( 16 | ArchiveNotFoundException, 17 | NoteNotFoundException, 18 | VideoNotFoundException, 19 | TimestampException, 20 | ) 21 | from .channel import Channel 22 | from .video import Note 23 | 24 | routes = Blueprint("routes", __name__, template_folder="templates") 25 | 26 | 27 | @routes.route("/", methods=["POST", "GET"]) 28 | def index(): 29 | """Open channel for non-selected channel""" 30 | # Redirect to requested channel 31 | if request.method == "POST": 32 | name = request.form["channel"] 33 | return redirect(url_for("routes.channel", name=name, kind="videos")) 34 | 35 | # Show page 36 | elif request.method == "GET": 37 | visited = request.cookies.get("visited") 38 | if visited is not None: 39 | visited = json.loads(visited) 40 | error = request.args["error"] if "error" in request.args else None 41 | return render_template("index.html", error=error, visited=visited) 42 | 43 | 44 | @routes.route("/channel/") 45 | def channel_empty(name): 46 | """Empty channel url, just redirect to videos by default""" 47 | return redirect(url_for("routes.channel", name=name, kind="videos")) 48 | 49 | 50 | @routes.route("/channel//") 51 | def channel(name, kind): 52 | """Channel information""" 53 | if kind not in ["videos", "livestreams", "shorts"]: 54 | return redirect(url_for("routes.index", error="Video kind not recognised")) 55 | 56 | try: 57 | channel = Channel.load(name) 58 | ldir = os.listdir(channel.path / "videos") 59 | return render_template( 60 | "channel.html", title=name, channel=channel, name=name, ldir=ldir 61 | ) 62 | except ArchiveNotFoundException: 63 | return redirect( 64 | url_for("routes.index", error="Couldn't open channel's archive") 65 | ) 66 | except Exception as e: 67 | return redirect(url_for("routes.index", error=f"Internal server error:\n{e}")) 68 | 69 | 70 | @routes.route("/channel///", methods=["GET", "POST", "PATCH", "DELETE"]) 71 | def video(name, kind, id): 72 | """Detailed video information and viewer""" 73 | if kind not in ["videos", "livestreams", "shorts"]: 74 | return redirect( 75 | url_for("routes.channel", name=name, error="Video kind not recognised") 76 | ) 77 | 78 | try: 79 | # Get information 80 | channel = Channel.load(name) 81 | video = channel.search(id) 82 | 83 | # Return video webpage 84 | if request.method == "GET": 85 | title = f"{video.title.current()} · {name}" 86 | views_data = json.dumps(video.views._to_dict()) 87 | likes_data = json.dumps(video.likes._to_dict()) 88 | return render_template( 89 | "video.html", 90 | title=title, 91 | name=name, 92 | video=video, 93 | views_data=views_data, 94 | likes_data=likes_data, 95 | ) 96 | 97 | # Add new note 98 | elif request.method == "POST": 99 | # Parse json 100 | new = request.get_json() 101 | if not "title" in new: 102 | return "Invalid schema", 400 103 | 104 | # Create note 105 | timestamp = _decode_timestamp(new["timestamp"]) 106 | title = new["title"] 107 | body = new["body"] if "body" in new else None 108 | note = Note.new(video, timestamp, title, body) 109 | 110 | # Save new note 111 | video.notes.append(note) 112 | video.channel.commit() 113 | 114 | # Return 115 | return note._to_dict(), 200 116 | 117 | # Update existing note 118 | elif request.method == "PATCH": 119 | # Parse json 120 | update = request.get_json() 121 | if not "id" in update or (not "title" in update and not "body" in update): 122 | return "Invalid schema", 400 123 | 124 | # Find note 125 | try: 126 | note = video.search(update["id"]) 127 | except NoteNotFoundException: 128 | return "Note not found", 404 129 | 130 | # Update and save 131 | if "title" in update: 132 | note.title = update["title"] 133 | if "body" in update: 134 | note.body = update["body"] 135 | video.channel.commit() 136 | 137 | # Return 138 | return "Updated", 200 139 | 140 | # Delete existing note 141 | elif request.method == "DELETE": 142 | # Parse json 143 | delete = request.get_json() 144 | if not "id" in delete: 145 | return "Invalid schema", 400 146 | 147 | # Filter out note with id and save 148 | filtered_notes = [] 149 | for note in video.notes: 150 | if note.id != delete["id"]: 151 | filtered_notes.append(note) 152 | video.notes = filtered_notes 153 | video.channel.commit() 154 | 155 | # Return 156 | return "Deleted", 200 157 | 158 | # Archive not found 159 | except ArchiveNotFoundException: 160 | return redirect( 161 | url_for("routes.index", error="Couldn't open channel's archive") 162 | ) 163 | 164 | # Video not found 165 | except VideoNotFoundException: 166 | return redirect(url_for("routes.index", error="Couldn't find video in archive")) 167 | 168 | # Timestamp for note was invalid 169 | except TimestampException: 170 | return "Invalid timestamp", 400 171 | 172 | # Unknown error 173 | except Exception as e: 174 | return redirect(url_for("routes.index", error=f"Internal server error:\n{e}")) 175 | 176 | 177 | @routes.route("/archive//video/") 178 | def archive_video(name, file): 179 | """Serves video file using it's filename (id + ext)""" 180 | return send_from_directory(os.getcwd(), f"{name}/videos/{file}") 181 | 182 | 183 | @routes.route("/archive//thumbnail/") 184 | def archive_thumbnail(name, id): 185 | """Serves thumbnail file using it's id""" 186 | return send_from_directory(os.getcwd(), f"{name}/thumbnails/{id}.webp") 187 | 188 | 189 | def viewer() -> Flask: 190 | """Generates viewer flask app, launch by just using the typical `app.run()`""" 191 | # Make flask app 192 | app = Flask(__name__) 193 | 194 | # Only log errors 195 | log = logging.getLogger("werkzeug") 196 | log.setLevel(logging.ERROR) 197 | 198 | # Routing blueprint 199 | app.register_blueprint(routes) 200 | 201 | # TODO: redo nicer 202 | @app.template_filter("timestamp") 203 | def _jinja2_filter_timestamp(timestamp, fmt=None): 204 | """Special hook for timestamps""" 205 | return _encode_timestamp(timestamp) 206 | 207 | # Return 208 | return app 209 | 210 | 211 | def _decode_timestamp(input: str) -> int: 212 | """Parses timestamp into seconds or raises `TimestampException`""" 213 | # Check existence 214 | input = input.strip() 215 | if input == "": 216 | raise TimestampException("No input provided") 217 | 218 | # Split colons 219 | splitted = input.split(":") 220 | splitted.reverse() 221 | if len(splitted) > 3: 222 | raise TimestampException("Days and onwards aren't supported") 223 | 224 | # Parse 225 | secs = 0 226 | try: 227 | # Seconds 228 | secs += int(splitted[0]) 229 | 230 | # Minutes 231 | if len(splitted) > 1: 232 | secs += int(splitted[1]) * 60 233 | 234 | # Hours 235 | if len(splitted) > 2: 236 | secs += int(splitted[2]) * 60 * 60 237 | except: 238 | raise TimestampException("Only numbers are allowed in timestamps") 239 | 240 | # Return 241 | return secs 242 | 243 | 244 | def _encode_timestamp(timestamp: int) -> str: 245 | """Formats previously parsed human timestamp for notes, e.g. `02:25`""" 246 | # Collector 247 | parts = [] 248 | 249 | # Hours 250 | if timestamp >= 60 * 60: 251 | # Get hours float then append truncated 252 | hours = timestamp / (60 * 60) 253 | parts.append(str(int(hours)).rjust(2, "0")) 254 | 255 | # Remove truncated hours from timestamp 256 | timestamp = int((hours - int(hours)) * 60 * 60) 257 | 258 | # Minutes 259 | if timestamp >= 60: 260 | # Get minutes float then append truncated 261 | minutes = timestamp / 60 262 | parts.append(str(int(minutes)).rjust(2, "0")) 263 | 264 | # Remove truncated minutes from timestamp 265 | timestamp = int((minutes - int(minutes)) * 60) 266 | 267 | # Seconds 268 | if len(parts) == 0: 269 | parts.append("00") 270 | parts.append(str(timestamp).rjust(2, "0")) 271 | 272 | # Return 273 | return ":".join(parts) 274 | --------------------------------------------------------------------------------