osyrs: A Binary Patcher

In a past life I spent some time working as a Build Engineer for a co-development studio, so I was fortunate enough to get a good look at a bunch of different games and a bunch of unique (and common) challenges.

There was one challenge that was evidently consistent; game builds can be HUGE and you need to move them between build machines, to cloud storage, onto servers, down to developer machines, onto player's machines, etc., and do it all performantly and securely.

TIP: skip to the In Action section below if you need some convincing to keep reading!



This common problem of moving large files around is nothing new to the Game industry. Long gone are the days where players would download Doom via the FTP server at the University of Wisconsin-Madison. Today most studios have a solution for performant build distribution, whether it be public like Blizzard's Battlenet Launcher, or Epic's Epic Game Launcher, or the incredibly popular Steam and itch.io platforms, or internal only via some special tools only used by the development team, most teams have this covered in some way. That said, I've also seen and heard of plenty of teams that just pass around .zip archives via Google Drive. I think teams can do and deserve better than that though.

Why am I even thinking about this?

Building your own launcher/patcher probably doesn't make sense for most teams, if you can ship on an existing platform like Steam then you probably should, but when it comes to distributing internal only binaries used by the development team, or when you need to copy large files (or large quanities of files) between build machines in the cloud, you'll need something more custom. Game patchers are most synonymous with distributing builds to players, but they can be used for these other purposes as well.

Note that in the case of Unreal Engine you can utilize something off the shelf like UGS (Unreal Game Sync) provided by Epic, but that doesn't cover distribution of binaries between build machines & servers, so you'll find teams that build something custom or build ontop of something like rsync.

These are the usecases which have me thinking about game patchers, how they work, and if I could iterate on any of the approaches here. Additionally, I wanted a reason to learn Rust and this seemed like a good reason. Also... I'm hard stuck Iron in League of Legends, and this feels like a more constructive use of time than dying over and over to Teemo top lane cheese.

Unreal Game Sync & Editor Binaries

Lets talk briefly about UGS.

Unreal Game Sync is a utility application distributed by Epic as source code along with the Unreal Engine source. When a game team pulls down the source for Unreal Engine and uploads it to their VCS (Version Control System), one of the first tasks the build guy might tackle is a build job to compile Unreal Game Sync and save the artifact into Perforce (or whatever VCS they are using). Their team can then pull down the UGS installer, install UGS on their machine, and then each morning use it to "sync editor binaries" built from the nightly build including the Engineering team's changes from the previous day. Sounds great right? Its definetly a good place to start, but P4 as a mechanism for distributing these binaries sometimes falls short, and the perforce server is likely already busy syncing workspaces on all those build machines you have.

sync editor binaries

Running "sync editor binaries" when using UGS's default mechanism (which pulls from Perforce), can easily take minutes (5, 15, 30, etc.) depending on how large a game's editor binary artifact is, and what other tooling they ship with it. Doing this only once each morning might not be that big of a deal, but if a engine team is constantly shipping new editor changes, then this time adds up and it can be a real distraction and source of friction in the development process, can easily turn into a DDOS on your perforce server, and can be a signifcant source of stress and hair loss for your build engineer.

During my time as a Build Engineer, it was a common struggle trying to optimize this process, especially for teams that wanted to pull binaries more frequently.

What is osyrs

At this point you are probably thinking; Ok, cool story bro -- thanks for the history lesson and UGS / UE refresher. Now what is osyrs? Or maybe you left and aren't even reading anymore... but if you are, thanks for sticking it out this far.

osyrs is my "Binary Patcher" pet project. Meaning it is a tool that aims to support the "patching" of binary files... which means it enables the migration of a file from the current version to a later version of the file (well it will someday).

Functionality

Currently osyrs has this REVOLUTIONARY (/s) functionality:

  • Patching: Create a binary patch file representing the difference betwen two versions of that file
  • Patching: Apply a binary patch to upgrade one version of a file to the next version of that file
  • Chunking: "Chunk" a file into pieces defined by its contents
  • Chunking: Reconstruct a file fully from its previously computed chunks

Patching (bsdiff)

Patching is the first approach I took, and is somewhat simple/straightfoward. Patching supports creating a "patch" file that represents the binary differences between two versions of a file. With a patch file you can upgrade one version of a file to its next version by "applying" the patch file to the original. osyrs leverages the bsdiff algorithm through a Rust crate to accomplish the diffing and patching.

While straight forward (apply this one file to upgrade file X from version 1.0 to 2.0), it has its downsides and limitations, such as when you have version 1.0 of a file but need to go to version 3.0 of that file. How do you "skip" versions like that? Well, you have to compute a patch that represents the difference between 1.0, and 3.0, which works but over time if you have many versions of your application/game, these patches going from 1.0 to some other version are going to add up, and you need to have a patch for each potential starting version leading to the latest version, computed each time you create a new latest version. Thats alot of work. Additionally, there are times when a seemingly small change to a source file can result in a rather large patch file being generated, so the related IO and applying of patches isn't incredibly performant either. Riot Games has written some great articles on these topics, which I've linked below in the references section. I highly recommend you give Riot's tech blog a read (finish this article first though of course!).

Chunking (Fast CDC)

Chunking is the second approach I've tried, and it is much more complicated and rigorous than the simpler patching approach. Chunking aims to re-use portions of the original file (which is already present on disk) and only apply changes to the sections of the file which are different in the next (or several versions later) version of the file. This might sound very similar to the above patching approach but there are some very important differences, namely supporting "jumping" versions of a file, meaning you can update a file from version 1.0 -> 3.0 without first updating from 1.0 -> 2.0 and then 2.0 -> 3.0. Additionally, chunking can help optimize the network IO involved with fetching the differences that need to be applied during the upgrade process, as the aim is to ownly fetch the chunks which are actually missing from your current file, nothing more. Again, Riot does a great job of covering these concepts in more depth on their technology blog.

To accomplish this approach I am leveraging the "Fast CDC" (Content-Defined Chunking) algorithm (also through a Rust Crate) to compute the "chunks" that make up a file. Using Fast CDC I am able to compute the list of chunks in a file, logically group them for theoretically performant retrieval from disk (and later network) and access in memory, and then re-assemble the file with those chunks.

Practically speaking, what this looks like is grouping chunks into .occz (osyrs chunk container zip) files, and then referencing all .occz files composing a file into a .occmz (osyrs chunk container manifest zip) file. .occz and .occmz files are stored as binary files using the Message Pack interchange format, and then compressed with Brotli. I'm using Message Pack because I knew I would want to store these files in a Binary format, and I knew I didn't want to spend time writing my own serialization and deserialization code. I may do a small bit of custom framing in the future to support different types of serialization for the data payload portion of these files, but I'll explore that later if I get there.

Rust Structs

#[derive(Debug, PartialEq, Deserialize, Serialize)]
pub struct Container {
    pub total_size: usize,
    pub map: HashMap<String, Vec<u32>>,
    #[serde(with = "serde_bytes")]
    pub data: Vec<u8>,
}

#[derive(Debug, PartialEq, Deserialize, Serialize)]
pub struct ContainerManifest {
    pub chunk_ids: Vec<String>,
    pub chunk_hash_to_container_hash: HashMap<String, String>,
}

An interesting note is that the aggregate size on Disk for the various .occz and .occmz files is relatively about the same size as the source file. For example, using a 28 MB zip as the source file, the resulting directory on Disk on Windows is 28.4 MB, contianing 253 .occz files, and a single .occmz file. I would love to get this down to be smaller than the actual source file (maybe through better compression). Serializing as Message Pack does adds some size to each file though, and I add new data into each file to describe its contents, so there is naturally more data than just the source contents.

That said, what I am optimizing for is the reconstruction and patching of a file either from scratch or from one version to another, not necessarily the storage size for the remote patching system, so I believe this is a reasonable tradeoff at present.

In Action

In the below examples I first chunk a file, and then reconstruct it using the processed chunks. The input file (2d-rpg.zip) is a zip archive of a small 2d Unity game compiled and built for Windows. This zip archive is only about 28 MB compressed, which isn't anything impressive. I'll need to run these tests against a much larger file(s) (such as Unreal Editor Binaries) in the future.

Chunk the file:

RUST_LOG=info ./osyrs.exe chunk "./2d-rpg.zip" -o "./out/chunks"

Reconstruct the file:

RUST_LOG=info ./osyrs.exe reconstruct \
-m "./out/chunks/manifest.occmz" \
-c "./out/chunks" \
-o "./out/reconstructed/2d-rpg.zip"

Demonstrate that the file was reconstructed correctly (shoutout to Windows for taking forever to decompress the file):

Notice that in this example the chunking and reconstruction operations each take less than 1 second to complete. Seems pretty good to me, even with the test file being pretty small (28 MB).

What's Next

Now that I have what I feel is the basic functionality (recreating a file from nothing), the next step is to be able to upgrade files rather than just reconstruct them from scratch. This will involve inspecting an existing file, detecting which chunks the current file has, determine which chunks are missing, and then apply those chunks in the correct way so that the file is patched properly.

Once I have this ability to patch/upgrade a file from one version to another, I'll then either approach networking and fetching the .occz and .occmz files from a CDN or Cloud Bucket (S3, etc.), and/or explore applying these concepts to a directory of files, rather than a single file. This directory recreation/upgrading will be necessary functionality for upgrading an application/game on disk from one version to another without needing to first have a compressed archive or single file version of the application/game available.

I'm excited to keep exploring this project, and to some day escape Iron in League of Legends. Keep an eye on this blog and my other socials for updates if you want to follow along.

Cheers!

References