Hotshot - ACT 1

Direct Text-to-Video Synthesis with Enhanced Motion Dynamics and Large-Scale Text-Video Pair Training

Hotshot Research

Text-to-Video Examples

pixar style, a raccoon eating dinner at a restaurant

big bird looks confused while buying milk at the grocery store

jay z rapping at a childrens birthday party

danny devito dressed as a 80s mexican wrestler, vintage, vhs footage.

brad pitt is wearing a purple tank top and holding a cupcake with a candle on it in his hand and blowing the candle.

batman has a red and gold iron man version of the bat suit, waving his hand, standing in the middle of gotham city on a rainy day, glowing.

a teddy bear is swimming

a man is sitting in a chair with his eyes closed, screaming at the top of his lungs and holding his face in his hands. The man is wearing a striped dress shirt and a tie, appearing to be in a professional setting.

a woman wearing a sun dress taking a selfie

Chewbacca is playing the guitar

kim kardashian holding up a sign that says "I LOVE YOU", standing in a field full of flowers

a man sitting on his couch and fist pumping and cheering while he watches the warriors game

jerome powell holding two glasses of whiskey and posing

steve jobs presenting apple car on stage

woman watching fireworks from inside her window

man looking shocked and his head explodes, his hands are on his head, massive explosion

a man drinking a pint at a pub and drunkenly pointing to his friends

woman with the word "POTATO" tattooed on her chest, smiling and happy and pumping her fists. The woman is celebrating and pumping her fists as confetti falls from the sky.

Teddy bears holding hands, walking down rainy 5th ave

will smith eating spaghetti.

Abstract

ACT 1 (Advanced Cinematic Transformer) is a state-of-the-art direct text-to-video synthesis system developed by Hotshot to empower the world to share their imagination through video.

ACT 1 produces high-definition videos at a variety of aspect ratios and without watermarks, creating an engaging user experience. Recently, latent diffusion models have enabled high-quality image synthesis propelled in part by the abundance of public text-image pair data. Unfortunately, accessible video datasets of the same fidelity and scale have remained few and far between; video creation has not seen the same advances. Furthermore, publicly available multimodal datasets heavily feature "conceptual captions," and after training are unaware of many of the People, Places, Characters, and Things that the average interested user cares most to generate. We solve this problem by training a cascaded video-captioner tailor made to annotate videos while making special care to make note of actions, interesting common knowledge elements, and everyday language one would use to describe that video.

We conduct a variety of architecture & dataset size experiments and find utilizing a large-scale high-resolution text-video corpus to be crucial to high fidelity spatial alignment, temporal alignment, and aesthetic quality.

Comparisons with Other Methods

Hotshot - ACT 1Pika 1.0Runway ML

"donald trump taking a selfie in court"

"iron man punching through his bedroom door"

"Shrek leaning in for a kiss at starbucks"

"xi jinping reading a book about winnie the pooh"

"A fat rabbit wearing a purple robe walking through a fantasy landscape."

"A panda standing on a surfboard, in the ocean in sunset, 4k, high resolution"

"putin smiling and making the shape of a heart with his hands"

"In the swamp, a crocodile stealthily surfaces, revealing only its eyes and the tip of its nose as it moves forward"

"joe biden holding a sign that says "bye" and waving"