├── .gitignore
├── .tool-versions
├── LICENSE.txt
├── README.md
├── run.rb
└── calculations.rb


/.gitignore:
--------------------------------------------------------------------------------
1 | out
2 | 


--------------------------------------------------------------------------------
/.tool-versions:
--------------------------------------------------------------------------------
1 | ruby 3.2.3
2 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Max Schnur, Wistia, Inc.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | Attribution:
16 | This technology was developed by Max Schnur at Wistia, Inc. Wistia 
17 | Max Schnur on GitHub: https://github.com/MaxPower15
18 | Wistia Website: https://www.wistia.com
19 | 
20 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26 | SOFTWARE.
27 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Seamless AAC Split and Stitch Demo
 2 | 
 3 | This repo demonstrates calculations and ffmpeg commands to encode portions of an audio file with the AAC codec and to recombine them without transcoding and without any skips or glitches.
 4 | 
 5 | https://github.com/wistia/seamless-aac-split-and-stitch-demo/assets/493992/3a88dfda-f345-4518-9e86-696c14ae4a2b
 6 | 
 7 | The general rule is that, when choosing your audio segment sizes, they _need_ to be aligned with AAC frame boundaries. With aligned frame boundaries, we can use the concat demuxer to cut out the silence ffmpeg adds, as well as some extra padding we add to account for AAC's dependency on previous frames.
 8 | 
 9 | This tech is important because it allows faster and more efficient cloud rendering. It may also be used, for example, to render and mux individual HLS segments (TS files) independently of the full file.
10 | 
11 | I've added more comments and explanations in the code itself.
12 | 
13 | ## Requirements
14 | 
15 | This demo assumes ffmpeg is installed and compiled with support for the libfdk_aac codec. It also assumes you have a modern version of ruby installed.
16 | 
17 | Some versions of ffmpeg (around 5) may not work properly as there was a temporary regression with aac concatenation. ffmpeg 6 seems to work well. The author's build config looks like this:
18 | 
19 | ```
20 | ffmpeg version 6.0 Copyright (c) 2000-2023 the FFmpeg developers
21 |   built with Apple clang version 15.0.0 (clang-1500.0.40.1)
22 |   configuration: --prefix=/Users/maxschnur/.asdf/installs/ffmpeg/6.0 --enable-gpl --enable-libass --enable-libfdk-aac --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libzimg --enable-nonfree --enable-openssl --enable-shared
23 | ```
24 | 
25 | ## Usage
26 | 
27 | Split and stitch with the defaults:
28 | 
29 |     ruby run.rb
30 | 
31 | Test with your own input files and different target segment durations:
32 | 
33 |     ruby run.rb 0.5 <path-or-url>
34 | 
35 | You can find all artifacts in the "out" directory:
36 | 
37 |     ls out
38 | 
39 | If you know you're operating on an input file that's 44.1KHz, you can also split up the file without transcoding. Try a command like this to test it out:
40 | 
41 |     NO_TRANSCODE=1 ruby run.rb 1.0 <your-file-with-an-aac-input-stream>
42 | 
43 | On a Mac, you may want to examine out/stitched.mp4 with a visualization program. The author uses [Audacity](https://www.audacityteam.org/) for that.
44 | 
45 |     open -a Audacity out/stitched.mp4
46 | 
47 | NOTE: This repo hardcodes the sample rate as 44.1KHz to simplify the demo code. But this method should work with any sample rate.
48 | 


--------------------------------------------------------------------------------
/run.rb:
--------------------------------------------------------------------------------
  1 | require "fileutils"
  2 | require "uri"
  3 | require "net/http"
  4 | require_relative "./calculations"
  5 | 
  6 | # Feel free to change these constants for your own testing.
  7 | SINE_WAVE_DURATION = 10.to_f
  8 | SINE_FREQUENCY = 10.to_f
  9 | DEFAULT_SEGMENT_DURATION = 1.0.to_f
 10 | SINE_WAVE_FILE_NAME = "sine-wave-#{SINE_WAVE_DURATION.to_i}-seconds.wav"
 11 | 
 12 | FileUtils.rm_rf "out"
 13 | FileUtils.mkdir_p "out"
 14 | 
 15 | input_file = nil
 16 | 
 17 | if ARGV.count == 1 && ARGV[0] !~ /\A-?\d+(\.\d+)?\z/
 18 |   input_file = ARGV[0]
 19 |   target_segment_duration = DEFAULT_SEGMENT_DURATION
 20 | else
 21 |   target_segment_duration = (ARGV[0] || DEFAULT_SEGMENT_DURATION).to_f
 22 | end
 23 | 
 24 | if target_segment_duration <= 0
 25 |   raise "Segment duration must be greater than 0"
 26 | end
 27 | 
 28 | if ARGV.count == 2
 29 |   input_file = ARGV[1]
 30 | end
 31 | 
 32 | if input_file&.start_with?(/^https?:\/\//)
 33 |   uri = URI.parse(input_file)
 34 |   puts "Downloading file from #{uri}..."
 35 |   resp = Net::HTTP.get_response(uri)
 36 |   if resp.code.to_i != 200
 37 |     raise "Failed to download file: #{resp.code.inspect}"
 38 |   end
 39 |   File.write("out/downloaded-file", resp.body)
 40 |   input_file = "out/downloaded-file"
 41 | elsif input_file
 42 |   puts "Using local file #{input_file}"
 43 | else
 44 |   # generate the sine wave we'll use as input
 45 |   system("ffmpeg -hide_banner -loglevel error -nostats -y -f lavfi -i \"sine=frequency=#{SINE_FREQUENCY}:duration=#{SINE_WAVE_DURATION}\" out/#{SINE_WAVE_FILE_NAME}")
 46 |   input_file = "out/#{SINE_WAVE_FILE_NAME}"
 47 | end
 48 | 
 49 | f = File.open(input_file)
 50 | first_three_bytes_in_hex = f.read(3).unpack1("H*")
 51 | first_two_bytes_in_hex = first_three_bytes_in_hex[0..3]
 52 | f.close
 53 | 
 54 | # Byte signatures taken from https://en.wikipedia.org/wiki/List_of_file_signatures
 55 | looks_like_mp3 = first_three_bytes_in_hex == "494433" ||
 56 |   first_two_bytes_in_hex == "fffb" ||
 57 |   first_two_bytes_in_hex == "fff3" ||
 58 |   first_two_bytes_in_hex == "fff2"
 59 | 
 60 | if looks_like_mp3
 61 |   # Something about mp3s make it so they have extra padding between them when
 62 |   # split. Remuxing to mkv fixes it.
 63 |   #
 64 |   # NOTE: There may be other formats that benefit from remuxing to MKV too.
 65 |   puts "Detected mp3 input file. Remuxing to mkv..."
 66 |   remux_cmd = "ffmpeg -hide_banner -loglevel error -nostats -y -i #{input_file} -c copy out/remuxed.mkv"
 67 |   puts remux_cmd
 68 |   system(remux_cmd)
 69 |   input_file = "out/remuxed.mkv"
 70 | end
 71 | 
 72 | duration_cmd = "ffprobe -hide_banner -loglevel error -select_streams a:0 -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 #{input_file}"
 73 | puts duration_cmd
 74 | duration = `#{duration_cmd}`.to_f
 75 | puts "Input file duration: #{duration}"
 76 | 
 77 | # Generate the commands we'll use to slice the sine wave into segments and the
 78 | # directives we'll use to recombine them later.
 79 | commands_and_directives = (duration / target_segment_duration).ceil.to_i.times.map do |i|
 80 |   start_time = (i * target_segment_duration * 1000000).round.to_i
 81 |   end_time = [((i + 1) * target_segment_duration * 1000000).round, duration * 1000000].min.to_i
 82 |   is_last = i == (duration / target_segment_duration).ceil.to_i - 1
 83 |   generate_command_and_directives_for_segment(input_file, i, start_time, end_time, is_last)
 84 | end
 85 | 
 86 | all_directives = commands_and_directives.map { |cmd, directives| directives }.join("\n")
 87 | File.write("out/audio-concat.txt", all_directives)
 88 | 
 89 | puts "---"
 90 | 
 91 | # Run the commands.
 92 | commands_and_directives.each do |cmd, _|
 93 |   puts cmd
 94 |   system(cmd)
 95 | end
 96 | 
 97 | puts "---"
 98 | 
 99 | # Stitch the segments back together.
100 | concat_cmd = "ffmpeg -hide_banner -loglevel error -nostats -y -f concat -i out/audio-concat.txt -c copy out/stitched.mp4"
101 | puts concat_cmd
102 | puts all_directives
103 | system(concat_cmd)
104 | 


--------------------------------------------------------------------------------
/calculations.rb:
--------------------------------------------------------------------------------
  1 | def frame_duration
  2 |   1024.0 / 44100.0 * 1000000.0
  3 | end
  4 | 
  5 | def get_closest_aligned_time(target_time)
  6 |   decimal_frames_to_target_time = target_time.to_f / frame_duration
  7 |   nearest_frame_index_for_target_time = decimal_frames_to_target_time.round
  8 |   puts "target_time: #{target_time}, decimal_frames_to_target_time: #{decimal_frames_to_target_time}, nearest_frame_index_for_target_time: #{nearest_frame_index_for_target_time}"
  9 |   nearest_frame_index_for_target_time * frame_duration
 10 | end
 11 | 
 12 | def generate_command_and_directives_for_segment(input_file, index, target_start, target_end, is_last)
 13 |   puts "--- segment #{index + 1} ---"
 14 | 
 15 |   start_time = get_closest_aligned_time(target_start)
 16 |   end_time = get_closest_aligned_time(target_end)
 17 |   puts "start_time: #{start_time}, end_time: #{end_time}"
 18 | 
 19 |   real_duration = end_time - start_time
 20 |   puts "real_duration: #{real_duration}"
 21 | 
 22 |   # We're subtracting two frames from the start time because ffmpeg allways internally
 23 |   # adds 2 frames of priming to the start of the stream.
 24 |   start_time_with_padding = [start_time - frame_duration * 2, 0].max
 25 | 
 26 |   # We add extra padding at the end, too, because ffmpeg tapers the last few frames
 27 |   # to avoid a pop when audio stops. We don't want tapering--we just want the signal.
 28 |   # So by shifting the end, we shift the taper past the content we care about it. We'll
 29 |   # chop off this tapered part using outpoint later.
 30 |   end_time_with_padding = end_time + frame_duration * 2
 31 |   puts "start_time_with_padding: #{start_time_with_padding}, end_time_with_padding: #{end_time_with_padding}"
 32 | 
 33 |   inpoint = 0
 34 | 
 35 |   if index > 0
 36 |     # We ask to also encode two frames before the start of our segment because
 37 |     # the AAC format is interframe. That is, the encoding of each frame depends
 38 |     # on the previous frame. This is also why AAC pads the start with silence.
 39 |     # By adding some extra padding ourselves, we ensure that the "real" data we
 40 |     # want will have been encoded as if the correct data preceded it. (Because
 41 |     # it did!)
 42 |     #
 43 |     # Note that, although we always set the extra time at the beginning to 2
 44 |     # frames here, it can actually be any value that's 2 frames or more. For
 45 |     # example, if you were encoding with echo, you might want to pad to account
 46 |     # for the full damping time of an echo.
 47 |     extra_time_at_beginning = frame_duration * 2
 48 |     start_time_with_padding = [start_time_with_padding - extra_time_at_beginning, 0].max
 49 | 
 50 |     # Although we only asked for two frames of padding, ffmpeg will add an
 51 |     # additional 2 frames of silence at the start of the segment. When we slice out
 52 |     # our real data with inpoint and outpoint, we'll want remove both the silence
 53 |     # and the extra frames we asked for.
 54 |     inpoint = frame_duration * 2 + extra_time_at_beginning
 55 |   end
 56 | 
 57 |   padded_duration = end_time_with_padding - start_time_with_padding
 58 |   puts "padded_duration: #{padded_duration}"
 59 | 
 60 |   # inpoint is inclusive and outpoint is exclusive. To avoid overlap, we subtract
 61 |   # the duration of one frame from the outpoint.
 62 |   # we don't have to subtract a frame if this is the last segment.
 63 |   subtract = frame_duration
 64 |   if is_last
 65 |     subtract = 0
 66 |   end
 67 |   outpoint = inpoint + real_duration - subtract
 68 | 
 69 |   # Things usually appear to work fine without the duration directive, but by
 70 |   # adding it, we make it so ffmpeg doesn't need to "guess" how long each
 71 |   # segment should be based on its sample count. Since we can do the math for
 72 |   # this at higher fidelity than ffmpeg, for very long outputs, it may help
 73 |   # avoid de-sync and make seeking more predictably exact.
 74 |   duration_directive = outpoint - inpoint + frame_duration
 75 | 
 76 |   puts "inpoint: #{inpoint}, outpoint: #{outpoint}"
 77 | 
 78 |   command =
 79 |     if ENV["NO_TRANSCODE"]
 80 |       # If we know the input file is AAC and we're not changing the sample rate,
 81 |       # we can create the segments without transcoding too. This works because,
 82 |       # if we cut at exactly the AAC frame boundaries, then we can just slice
 83 |       # out portions of the stream. Note, however, that -ss and -t flags are moved after
 84 |       # the input file so they're applied after the input file is read. Without that,
 85 |       # you'll get some funky output.
 86 |       "ffmpeg -hide_banner -loglevel error -nostats -y -i #{input_file} -c:a copy -ss #{start_time_with_padding}us -t #{padded_duration}us -f adts out/seg#{index + 1}.aac"
 87 |     else
 88 |       "ffmpeg -hide_banner -loglevel error -nostats -y -ss #{start_time_with_padding}us -t #{padded_duration}us -i #{input_file} -c:a libfdk_aac -ar 44100 -f adts out/seg#{index + 1}.aac"
 89 |     end
 90 | 
 91 |   directives = [
 92 |     "file 'seg#{index + 1}.aac'",
 93 |     "inpoint #{inpoint}us",
 94 |     "outpoint #{outpoint}us",
 95 |     "duration #{duration_directive}us"
 96 |   ]
 97 | 
 98 |   [command, directives.join("\n")]
 99 | end
100 | 


--------------------------------------------------------------------------------