Text and Image to Video Generation using Diffusion Models in Hugging Face

8 Months Ago usmanmalik57 2 1K Views

The AI wave has introduced a myriad of exciting applications. While text generation and natural language processing are leading the AI revolution, image, and vision-based technologies are quickly catching up. The intersection of text and vision applications has seen a rapid surge recently.

In this article, you'll learn how to generate videos using text and image inputs. We'll leverage open-source models from Hugging Face to bring these applications to life. So, without further ado, let's dive in!

Installing and Importing Required Libraries

We will use the Hugging Face diffusion models to generate videos from text and images. The following script installs the libraries you will need to import these models from Hugging Face.

!pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

For text-to-video generation, we will use the CogVideoX-2b diffusion model. For image-to-video generation, we will use the Stability AI's img2vid model.

The following script imports the Hugging Face pipelines for the two models. We also import some utility classes to save videos and display images.


import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image

Text to Video Generation with Hugging Face Diffusers

The first step is to create a Hugging Face pipeline that can access the CogVideoX-2b model. You can also use the CogVideoX-5b model, but it requires more space and memory.

The following script creates the pipeline for the CogVideoX-2b model. We also call some utility methods such as enable_model_cpu_offload(), enable_sequential_cpu_offload(), enable_slicing(), and enable_tiling() to improve the model performance.


text_video_pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)

text_video_pipe.enable_model_cpu_offload()
text_video_pipe.enable_sequential_cpu_offload()
text_video_pipe.vae.enable_slicing()
text_video_pipe.vae.enable_tiling()

Next, we define our text prompt and pass the prompt and other video configurations to the text_video_pipe pipeline that we created in the previous script. You can play around with the configuration settings to see how they affect the output.

The pipeline returns video frames you can export to video using the export_to_video() utility, as the following script shows.


prompt = "A white dog running on a Caribbean beach."

video = text_video_pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "text_to_video.mp4", fps=8)

Output:

Note: I intentionally reduced the output video dimensions.

The above output shows the video generated based on our input prompt. Amazing, isn't it?

But the magic doesn't end here. You can also pass an image as input to a diffusion model and get an animated video in response. This is what you will see in the next section.

Image to Video Generation with Hugging Face Diffusers

We will use Stability AI's img2vid model for image-to-video generation.

The script below imports the corresponding pipeline from the Hugging Face library.


image_video_pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid",
    torch_dtype=torch.float16,
    variant="fp16"
)
image_video_pipe.enable_model_cpu_offload()

We will generate a video using the following image as input. You can use any other image if you want.


## Image link: https://www.brookhavenanimalhosp.com/site/blog/2022/11/15/15-cute-dog-breeds-that-stay-small-forever
image = load_image("/content/image-73073-800.jpg")
image = image.resize((1024, 576))
image

Output:

To generate video from the image, you must pass the image object and the number of total frames to generate to the image_video_pipe pipeline you created in the previous script.

frames = image_video_pipe(image, num_frames=28).frames[0]
export_to_video(frames, "image_to_video.mp4", fps=7)

Output:

Conclusion

Video generation from text and image inputs is a fascinating application. In this article, you saw how to generate videos from text using open-source diffusion models from Hugging Face. I encourage you to play around with these models to generate your own stunning videos using text prompts and image inputs.

Be the first to reply

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.