How a 4 Person Team Built Sora

John Mullan, Duncan Crawbuck, Chaitu Aluru, Aakash Sastry

Hotshot

close up eyes, a beautiful woman with braids subtly peeks above a calm water surface during golden hour, with only her eyes and forehead visible. water droplets on her skin, detailed skin, the water reflects her serene expression and the warm ambient light, creating a tranquil and intimate scene., close up of her eyes.

We're excited to share an early preview of Hotshot, a large-scale diffusion transformer model that serves as the foundation for our upcoming consumer product. Hotshot excels in prompt alignment, consistency, and motion, while being highly extensible to longer durations, higher resolutions, and additional modalities.

In our evaluations across a variety of prompts, users preferred Hotshot's results to other publicly available text-to-video models 70% of the time. You can see a subset of the results we tried here.

Hotshot is available to try today in beta at https://hotshot.co. We encourage you to give it a spin and share your feedback with us.

Introduction

We've been captivated by diffusion models; they feel like a new type of camera for our imagination. For the first time in over a decade, it feels possible to build new types of video applications that consumers will love.

As we tried building on top of open source and commercial video models, we quickly realized that 1) the base models would have to get much better and 2) we would need control over the underlying model in order to build any compelling user experiences.

So, we decided to build our own video models.

Background

In the last 13 months, we've trained 3 different video models.

Our first video model, Hotshot-XL, generates 1 second 8fps videos. 3 months after we wrote our first lines of code, our train was complete. We felt the model was more a tech demo than a viable base for building our own products upon, and we wanted to contribute our learnings back to the community. So we open sourced Hotshot-XL. Today, ~20K new developers and artists use Hotshot-XL per month and we've seen some amazing things made.

Our next video model, Hotshot Act-One, generates 3 second 8fps videos and was trained in 5 months. In building this model, we were able to significantly scale up our video dataset to 200MM densely captioned publicly available videos and get our first real tastes of compute at scale, distributed training, and high resolution diffusion models.

In the last 4 months, we've trained Hotshot, a text-to-video model that generates up to 10 seconds of footage at 720p.

In the next 12 months, entire YouTube videos will be AI generated by creators. Text to video is our foundation for much more. Control over every aspect of generations, longer durations, higher resolutions, real time interactivity, and more modalities (like audio!) are just around the corner.

We're writing this blog post to share some of the learnings we gained along the way. We hope to inspire others who are interested in training models like these and pull the veil back a tiny bit on the creation of these magical minds.

Data Engineering

First, since we were going to be training a larger model, we knew we were going to want to scale up our Data Engineering. We set a goal for ourselves of scaling our corpus to 600 million clips. This was going to be a challenge, but we knew the primary burden lay in the massive operations overhead more than anything. Second, we knew we were going to want to train the model on images and videos jointly in order to take advantage of how much more abundant publicly accessible image data is than video. We hadn't trained a foundation model on images from the ground up before, so we needed to build our image corpus from scratch. We set a goal for ourselves of scaling our image corpus to 1 billion images.

There are many publicly available VLMs to use for captioning (LLaVa, CogVLM, etc.) but because they were trained for image understanding and not video, they excel at spatial understanding (colors, objects, people, etc) but struggle quite heavily with temporal understanding (actions, how things change over time). To combat this, we created a dataset with 300K video samples with dense temporal captions hand captioned in the style we wanted and fine tuned a publicly available VLM for video understanding. In a couple weeks, we had a video captioner we were quite happy to use to annotate our hundreds of millions of video samples.

Deploying this captioner at scale for billions of image and video clips turned out to be quite the challenge. We needed thousands of GPUs for this, which came with its own orchestration challenges. Managing thousands of GPUs in the cloud is quite like scaling your team to hundreds or thousands of junior teammates overnight. There's a steep cost responsibility, they need constant babysitting, and it sometimes even feels like they have a mind of their own. Managing this pipeline was a 24/7 job for one of our team members for an entire month.

Research

As with many engineering projects, the first few days are always the easiest. To train a basic model, one only needs to clone one of the many great open source repos, and you'll be off to a great start. In fact there's an open source implementation of Diffusion Transformers from when the authors were at Meta.

We got toy examples of ImageNet training with DiT within a couple days. In parallel, we knew we had to train a new autoencoder to compress videos both spatially and temporally so that we could efficiently train long sequence lengths. We hadn't trained an autoencoder from scratch before and were caught off guard by how unstable the training can be. Halfway through training it looked like there was little to no progression - the discriminator had kicked in but we couldn't see how much it was actually benefiting our training run. Both generator and discriminator losses weren't moving; we nearly ended up restarting the run with different hyperparameters. Instead, we let the train run - after another day of training the generator and discriminator losses both started to go down together gradually. After a couple of weeks, we had a newly trained autoencoder that we were happy with using for the input to our network.

Training

Next, we wanted to make sure we had the perfect architecture for training. What type of diffusion formulation would we go with? How deep and how wide would we go with the network? As a product requirement, we wanted Hotshot to generate videos of any resolution and duration up to 10s. We worked our way through these problems and the code implementations over the next month.

We spent quite a bit of time evaluating a few different novel architectures that would give us maybe a ~20% speed improvement at training and inference. It turns out that most of the architecture experiments we tried didn't pan out how we hoped when evaluating expected quality vs time/cost savings. We learned quite a bit evaluating risk/reward trade offs for different experiments that we're excited to take with us into our next trains.

After running some ablation studies of different architectures, we were ready to start scaling up our trains on more compute.

Scaling

We learned the pains of 2 new areas of training that we hadn't seen previously: infrastructure and optimization. As compute scales, it becomes more difficult to manage - IO gets extremely bottlenecked, logging becomes chaotic and H100s fail regularly, particularly when you are pushing the hardware to the max in training a video model. Additionally, as compute scales, training runs become much much more expensive, which makes optimizing your code to run as fast as possible all the more important. Anecdotally, we found the more we optimized our code and pushed the power wattage on the GPUs, the higher our risk of GPU failure - a data center provider even told us we almost set a rack on fire, and instructed that we optimize our code less.

For the next 3 months, 99% of our time was spent on infrastructure and optimization. Continuously working on optimizing our trains with different types of data/model parallelism at scale, writing custom kernels, minimizing our GPU overhead, caching our data so it was retrievable as fast and efficiently as possible, baby sitting trains and reviving them when they would go down, and more.

It is obvious in hindsight: the challenges that come with scale make it a totally different type of adventure. Internally, we like to say that training these models is the software version of rocket launches. Training at scale with thousands of GPUs was like going straight from playing with our NERF rockets to sending up a SpaceX Falcon.

One quick example of a challenge with scale: As you scale your number of GPUs, train start times rise as thousands of processes are reading from an NFS drive trying to load the weights, so you have to start using a distributed file system (which isn't cheap) or beam the weights over the network from a single process (which is also quite slow). If slow weights loading was a one time thing it wouldn't be too bad, however as mentioned above these GPUs fail a lot so train restarts are a constant. When a node fails you'll need to: track it down, evict it from your pool, grab a hot spare node from a pre-prepared reserve which has your latest model checkpoint cached locally and ready to be loaded as fast as possible. This, of course, is another pipeline you'll have to write and manage from scratch alongside your train.

In addition, once we had all of the above working, we still encountered GPU process hangs. Just one process hanging among your thousands of GPUs synchronizing will take down an entire train. These hangs can be very hard to track down - the only early indication that a hang occurs is a sudden drop in GPU power. We wrote our own Watchdog to detect hangs and identify their cause. Each rank would run a Watchdog on a separate thread where it would track the last line of code the main thread reached. If the Watchdog didn't hear from the main thread inside a certain threshold this would be a candidate for a hang and we'd then alert an engineer of a potential train failure. The Watchdog would then consolidate all last known places the main thread hit across every rank and identify the culprit ranks. This process enabled us to react much faster, pinpoint issues, get a fix in or replace a node and resume trains quickly.

Further, the training data for Hotshot was so large that it needed to be streamed in. However, we were maxing out the cluster's bandwidth just trying to get our data into the train. As video resolutions got larger and durations got longer, decoding hi-res video while training became untenable. Since our model uses multiple frozen networks as inputs (text encoder, spatiotemporal latent encoder) we decided to precompute these embeddings for our dataset. This allowed us to skip the heavy steps of decoding video and crunching video latents and text embeddings during our training and instead parallelize this computation for next datasets on cheaper GPUs while the main train was running on our largest cluster.

However, the outputs of these frozen embedding networks are larger in size than the original data. Our dataset index ballooned so large in memory that it would crash trains when loaded by the worker processes. To solve this, we optimized and compressed our index by a factor of 7 to stop the train from going out of memory. To compress the data we stored all the embeddings as bfloat16, with some tensor data even being quantized. We then used Zstandard compression on the dataset shards and uploaded them to S3, choosing a bucket region as close as we physically could to the cluster. We were careful to maintain a balanced number of samples stored per shard to maintain consistency on shard size. This allowed steady data streaming into the cluster to maintain balanced loads.

Your train will only go as fast as your slowest node - so heavy monitoring, logging and health checks are required to catch faulty nodes from completely crippling your run.

More so in ML than anything else we've done before, optimization is money (and sanity).

End to end, data engineering and training of this latest version of Hotshot took 4 months and many millions of H100 hours.

As a 4 person team, managing these challenges has been an exhilarating experience. The only thing we've seen similar is the challenges posed by millions of DAU at scale. We're very keen to find out what both feel like at once.

Try it today

It's been awesome to see how these models learn over the course of training, and we've had a ton of fun playing with the early Hotshot base model so far.

Please try things for yourself here – we can't wait to see what you'll imagine with it.

Join Us

We're product builders. We're creating tools to bring human imagination to life, instantly. If this resonates with you, we'd love to meet you. Apply here 💫