Interrogating Sora

27 Mar 2024 •

Although I’m a bit late to the party on this one, I thought today would be a good day to write about Sora. This will be a longer post so try not to nod off.

For those who live under a rock, Sora is OpenAI’s latest model. If you’re to believe OpenAI’s press releases, it’s capable of generating high-fidelity and coherent video from just text or single-image prompts.

As someone who works on generative AI (image generation specifically), the inner workings of Sora are of particular interest to me. OpenAI’s is painfully (and deliberately) vague on this point, with even the technical report describing only the most basic outline of how it works.

The rest, it seems, is left to the imagination of the reader. We can only speculate and make educated guesses as to what’s going on behind the scenes. So that’s what I’ll do in this post.

I’m going to make a series of precise observations and careful deductions, interspersed with speculative leaps of faith, and finally come up with my best guess at what on Earth Sora is really doing behind all the fancy marketing material.

Patches, Unification and Large Language Models

So the first concept that features extensively in the technical report of Sora is space-time patches. The general idea, they say (citing earlier works), is that the entire video is broken up into smaller, tesselating pieces which they call patches. A key detail of the technical report (which could easily be overlooked) is:

The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits.

I’d like to draw particular attention to the use of the words tokens and unify. To be clear, tokens are not, in general, patches. Patches are a specific way of grouping pixels when working with vision transformers. Token refers to any kind of integer-coded information (generally in a sequence) which has a fixed, finite ``vocabulary’'.

What’s interesting in the Sora technical report, then, is the fact that tokens and patches are mentioned in the same paragraph. What’s more, they’re mentioned in the context of both unification and large language models (which almost always use tokens). The only way to truly unify video with conventional language models would be to also tokenize the video.

Could it be that Sora’s ``patches’’ are in fact tokens, then? But how does one tokenize a video?

Vector Quantization

An idea that’s been around for a while now in generative AI / deep learning is that of vector quantization (or VQ). The basic idea is to turn a continuous, high-dimensional natural data type (such as image or audio) into a much more compact, discrete (integer-coded) representation. The resulting “tokenized” representation can then be thought of it as a sort of “visual sentence”, which has its own sort of emergent visual language.

Thinking of it this way is especially effective, since one can train a transformer to sample coherent sequences from the visual language, which can then be decoded (reconstructed) back into a plausible image.

The reason I mention this in the context of Sora, is because there’s no reason why vector quantization can’t be applied to video. In fact, an earlier work by Google AI shows this approach to be both effective and scalable for generating video from text, when combined with a masked encoder (which can be thought of as a kind of discrete diffusion).

So, it’s certainly plausible that Sora could be built on some kind of vector quantization. A point against this idea might be the fact that vector quantization isn’t mentioned at all in the technical report, however there’s also a reasonable chance that it’s a trade secret that’s best kept under wraps for a big company focused on maintaining a competitive advantage.

My best Guess

There’s a few more bits from the technical report worth mentioning that could give some clues as to how Sora works:

Continuous looped video generation: A notable capability of Sora is the ability to generate a seamless infinite loop of video. This is something that’s possible with discrete (VQ-based) models by essentially “tiling” the sampler by applying it many times at different offsets then averaging over the outputs.
Video in-filling and backwards completion: These kinds of feature works out-of-the-box with VQ type models since it’s possible to just “mask out” the earlier part of the image.
Blocky temporal rendering: I could be just imagining this one, but some movements in the videos in the technical report look a bit “blocky” in the time dimension. Though this could also just be an artifact of using patches.

All things considered, I think that here’s a reasonable chance Sora could work as follows:

Individual frames are encoded as grids of discrete latent codes (integer-valued) using vector quantization.
A masked encoder-only transformer (possibly with some specialized optimizations) is trained to generate representations of full-length vidoes.
The representations are decoded into high-fidelity videos

Why I could be wrong

The technical report also specifically mentions the use of Gaussian noise which is typically not something that’s used in VQ-based diffusion. This could undermine everything I’ve said about VQ, so don’t take anything here too seriously. Or perhaps Sora uses some combination of continuous and discrete latent variables (unlikely).

So hopefully by now you’re familiar with my humble opinion on something I know very little about. That’s all for today. Stay tuned for more technical-sounding rambles.

Take care!

Jamie