├── .gitignore ├── .tool-versions ├── LICENSE.txt ├── README.md ├── run.rb └── calculations.rb /.gitignore: -------------------------------------------------------------------------------- 1 | out 2 | -------------------------------------------------------------------------------- /.tool-versions: -------------------------------------------------------------------------------- 1 | ruby 3.2.3 2 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Max Schnur, Wistia, Inc. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | Attribution: 16 | This technology was developed by Max Schnur at Wistia, Inc. Wistia 17 | Max Schnur on GitHub: https://github.com/MaxPower15 18 | Wistia Website: https://www.wistia.com 19 | 20 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 21 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 22 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 23 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 24 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 25 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 26 | SOFTWARE. 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Seamless AAC Split and Stitch Demo 2 | 3 | This repo demonstrates calculations and ffmpeg commands to encode portions of an audio file with the AAC codec and to recombine them without transcoding and without any skips or glitches. 4 | 5 | https://github.com/wistia/seamless-aac-split-and-stitch-demo/assets/493992/3a88dfda-f345-4518-9e86-696c14ae4a2b 6 | 7 | The general rule is that, when choosing your audio segment sizes, they _need_ to be aligned with AAC frame boundaries. With aligned frame boundaries, we can use the concat demuxer to cut out the silence ffmpeg adds, as well as some extra padding we add to account for AAC's dependency on previous frames. 8 | 9 | This tech is important because it allows faster and more efficient cloud rendering. It may also be used, for example, to render and mux individual HLS segments (TS files) independently of the full file. 10 | 11 | I've added more comments and explanations in the code itself. 12 | 13 | ## Requirements 14 | 15 | This demo assumes ffmpeg is installed and compiled with support for the libfdk_aac codec. It also assumes you have a modern version of ruby installed. 16 | 17 | Some versions of ffmpeg (around 5) may not work properly as there was a temporary regression with aac concatenation. ffmpeg 6 seems to work well. The author's build config looks like this: 18 | 19 | ``` 20 | ffmpeg version 6.0 Copyright (c) 2000-2023 the FFmpeg developers 21 | built with Apple clang version 15.0.0 (clang-1500.0.40.1) 22 | configuration: --prefix=/Users/maxschnur/.asdf/installs/ffmpeg/6.0 --enable-gpl --enable-libass --enable-libfdk-aac --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libzimg --enable-nonfree --enable-openssl --enable-shared 23 | ``` 24 | 25 | ## Usage 26 | 27 | Split and stitch with the defaults: 28 | 29 | ruby run.rb 30 | 31 | Test with your own input files and different target segment durations: 32 | 33 | ruby run.rb 0.5 34 | 35 | You can find all artifacts in the "out" directory: 36 | 37 | ls out 38 | 39 | If you know you're operating on an input file that's 44.1KHz, you can also split up the file without transcoding. Try a command like this to test it out: 40 | 41 | NO_TRANSCODE=1 ruby run.rb 1.0 42 | 43 | On a Mac, you may want to examine out/stitched.mp4 with a visualization program. The author uses [Audacity](https://www.audacityteam.org/) for that. 44 | 45 | open -a Audacity out/stitched.mp4 46 | 47 | NOTE: This repo hardcodes the sample rate as 44.1KHz to simplify the demo code. But this method should work with any sample rate. 48 | -------------------------------------------------------------------------------- /run.rb: -------------------------------------------------------------------------------- 1 | require "fileutils" 2 | require "uri" 3 | require "net/http" 4 | require_relative "./calculations" 5 | 6 | # Feel free to change these constants for your own testing. 7 | SINE_WAVE_DURATION = 10.to_f 8 | SINE_FREQUENCY = 10.to_f 9 | DEFAULT_SEGMENT_DURATION = 1.0.to_f 10 | SINE_WAVE_FILE_NAME = "sine-wave-#{SINE_WAVE_DURATION.to_i}-seconds.wav" 11 | 12 | FileUtils.rm_rf "out" 13 | FileUtils.mkdir_p "out" 14 | 15 | input_file = nil 16 | 17 | if ARGV.count == 1 && ARGV[0] !~ /\A-?\d+(\.\d+)?\z/ 18 | input_file = ARGV[0] 19 | target_segment_duration = DEFAULT_SEGMENT_DURATION 20 | else 21 | target_segment_duration = (ARGV[0] || DEFAULT_SEGMENT_DURATION).to_f 22 | end 23 | 24 | if target_segment_duration <= 0 25 | raise "Segment duration must be greater than 0" 26 | end 27 | 28 | if ARGV.count == 2 29 | input_file = ARGV[1] 30 | end 31 | 32 | if input_file&.start_with?(/^https?:\/\//) 33 | uri = URI.parse(input_file) 34 | puts "Downloading file from #{uri}..." 35 | resp = Net::HTTP.get_response(uri) 36 | if resp.code.to_i != 200 37 | raise "Failed to download file: #{resp.code.inspect}" 38 | end 39 | File.write("out/downloaded-file", resp.body) 40 | input_file = "out/downloaded-file" 41 | elsif input_file 42 | puts "Using local file #{input_file}" 43 | else 44 | # generate the sine wave we'll use as input 45 | system("ffmpeg -hide_banner -loglevel error -nostats -y -f lavfi -i \"sine=frequency=#{SINE_FREQUENCY}:duration=#{SINE_WAVE_DURATION}\" out/#{SINE_WAVE_FILE_NAME}") 46 | input_file = "out/#{SINE_WAVE_FILE_NAME}" 47 | end 48 | 49 | f = File.open(input_file) 50 | first_three_bytes_in_hex = f.read(3).unpack1("H*") 51 | first_two_bytes_in_hex = first_three_bytes_in_hex[0..3] 52 | f.close 53 | 54 | # Byte signatures taken from https://en.wikipedia.org/wiki/List_of_file_signatures 55 | looks_like_mp3 = first_three_bytes_in_hex == "494433" || 56 | first_two_bytes_in_hex == "fffb" || 57 | first_two_bytes_in_hex == "fff3" || 58 | first_two_bytes_in_hex == "fff2" 59 | 60 | if looks_like_mp3 61 | # Something about mp3s make it so they have extra padding between them when 62 | # split. Remuxing to mkv fixes it. 63 | # 64 | # NOTE: There may be other formats that benefit from remuxing to MKV too. 65 | puts "Detected mp3 input file. Remuxing to mkv..." 66 | remux_cmd = "ffmpeg -hide_banner -loglevel error -nostats -y -i #{input_file} -c copy out/remuxed.mkv" 67 | puts remux_cmd 68 | system(remux_cmd) 69 | input_file = "out/remuxed.mkv" 70 | end 71 | 72 | duration_cmd = "ffprobe -hide_banner -loglevel error -select_streams a:0 -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 #{input_file}" 73 | puts duration_cmd 74 | duration = `#{duration_cmd}`.to_f 75 | puts "Input file duration: #{duration}" 76 | 77 | # Generate the commands we'll use to slice the sine wave into segments and the 78 | # directives we'll use to recombine them later. 79 | commands_and_directives = (duration / target_segment_duration).ceil.to_i.times.map do |i| 80 | start_time = (i * target_segment_duration * 1000000).round.to_i 81 | end_time = [((i + 1) * target_segment_duration * 1000000).round, duration * 1000000].min.to_i 82 | is_last = i == (duration / target_segment_duration).ceil.to_i - 1 83 | generate_command_and_directives_for_segment(input_file, i, start_time, end_time, is_last) 84 | end 85 | 86 | all_directives = commands_and_directives.map { |cmd, directives| directives }.join("\n") 87 | File.write("out/audio-concat.txt", all_directives) 88 | 89 | puts "---" 90 | 91 | # Run the commands. 92 | commands_and_directives.each do |cmd, _| 93 | puts cmd 94 | system(cmd) 95 | end 96 | 97 | puts "---" 98 | 99 | # Stitch the segments back together. 100 | concat_cmd = "ffmpeg -hide_banner -loglevel error -nostats -y -f concat -i out/audio-concat.txt -c copy out/stitched.mp4" 101 | puts concat_cmd 102 | puts all_directives 103 | system(concat_cmd) 104 | -------------------------------------------------------------------------------- /calculations.rb: -------------------------------------------------------------------------------- 1 | def frame_duration 2 | 1024.0 / 44100.0 * 1000000.0 3 | end 4 | 5 | def get_closest_aligned_time(target_time) 6 | decimal_frames_to_target_time = target_time.to_f / frame_duration 7 | nearest_frame_index_for_target_time = decimal_frames_to_target_time.round 8 | puts "target_time: #{target_time}, decimal_frames_to_target_time: #{decimal_frames_to_target_time}, nearest_frame_index_for_target_time: #{nearest_frame_index_for_target_time}" 9 | nearest_frame_index_for_target_time * frame_duration 10 | end 11 | 12 | def generate_command_and_directives_for_segment(input_file, index, target_start, target_end, is_last) 13 | puts "--- segment #{index + 1} ---" 14 | 15 | start_time = get_closest_aligned_time(target_start) 16 | end_time = get_closest_aligned_time(target_end) 17 | puts "start_time: #{start_time}, end_time: #{end_time}" 18 | 19 | real_duration = end_time - start_time 20 | puts "real_duration: #{real_duration}" 21 | 22 | # We're subtracting two frames from the start time because ffmpeg allways internally 23 | # adds 2 frames of priming to the start of the stream. 24 | start_time_with_padding = [start_time - frame_duration * 2, 0].max 25 | 26 | # We add extra padding at the end, too, because ffmpeg tapers the last few frames 27 | # to avoid a pop when audio stops. We don't want tapering--we just want the signal. 28 | # So by shifting the end, we shift the taper past the content we care about it. We'll 29 | # chop off this tapered part using outpoint later. 30 | end_time_with_padding = end_time + frame_duration * 2 31 | puts "start_time_with_padding: #{start_time_with_padding}, end_time_with_padding: #{end_time_with_padding}" 32 | 33 | inpoint = 0 34 | 35 | if index > 0 36 | # We ask to also encode two frames before the start of our segment because 37 | # the AAC format is interframe. That is, the encoding of each frame depends 38 | # on the previous frame. This is also why AAC pads the start with silence. 39 | # By adding some extra padding ourselves, we ensure that the "real" data we 40 | # want will have been encoded as if the correct data preceded it. (Because 41 | # it did!) 42 | # 43 | # Note that, although we always set the extra time at the beginning to 2 44 | # frames here, it can actually be any value that's 2 frames or more. For 45 | # example, if you were encoding with echo, you might want to pad to account 46 | # for the full damping time of an echo. 47 | extra_time_at_beginning = frame_duration * 2 48 | start_time_with_padding = [start_time_with_padding - extra_time_at_beginning, 0].max 49 | 50 | # Although we only asked for two frames of padding, ffmpeg will add an 51 | # additional 2 frames of silence at the start of the segment. When we slice out 52 | # our real data with inpoint and outpoint, we'll want remove both the silence 53 | # and the extra frames we asked for. 54 | inpoint = frame_duration * 2 + extra_time_at_beginning 55 | end 56 | 57 | padded_duration = end_time_with_padding - start_time_with_padding 58 | puts "padded_duration: #{padded_duration}" 59 | 60 | # inpoint is inclusive and outpoint is exclusive. To avoid overlap, we subtract 61 | # the duration of one frame from the outpoint. 62 | # we don't have to subtract a frame if this is the last segment. 63 | subtract = frame_duration 64 | if is_last 65 | subtract = 0 66 | end 67 | outpoint = inpoint + real_duration - subtract 68 | 69 | # Things usually appear to work fine without the duration directive, but by 70 | # adding it, we make it so ffmpeg doesn't need to "guess" how long each 71 | # segment should be based on its sample count. Since we can do the math for 72 | # this at higher fidelity than ffmpeg, for very long outputs, it may help 73 | # avoid de-sync and make seeking more predictably exact. 74 | duration_directive = outpoint - inpoint + frame_duration 75 | 76 | puts "inpoint: #{inpoint}, outpoint: #{outpoint}" 77 | 78 | command = 79 | if ENV["NO_TRANSCODE"] 80 | # If we know the input file is AAC and we're not changing the sample rate, 81 | # we can create the segments without transcoding too. This works because, 82 | # if we cut at exactly the AAC frame boundaries, then we can just slice 83 | # out portions of the stream. Note, however, that -ss and -t flags are moved after 84 | # the input file so they're applied after the input file is read. Without that, 85 | # you'll get some funky output. 86 | "ffmpeg -hide_banner -loglevel error -nostats -y -i #{input_file} -c:a copy -ss #{start_time_with_padding}us -t #{padded_duration}us -f adts out/seg#{index + 1}.aac" 87 | else 88 | "ffmpeg -hide_banner -loglevel error -nostats -y -ss #{start_time_with_padding}us -t #{padded_duration}us -i #{input_file} -c:a libfdk_aac -ar 44100 -f adts out/seg#{index + 1}.aac" 89 | end 90 | 91 | directives = [ 92 | "file 'seg#{index + 1}.aac'", 93 | "inpoint #{inpoint}us", 94 | "outpoint #{outpoint}us", 95 | "duration #{duration_directive}us" 96 | ] 97 | 98 | [command, directives.join("\n")] 99 | end 100 | --------------------------------------------------------------------------------