Building a Video Generator from Scratch: A Comprehensive Guide

Welcome to the exciting world of video generation! In this in-depth blog post, we’ll embark on a journey to build our own video generator from the ground up. We’ll be using powerful AI tools like Stable Diffusion for image creation and Google Cloud AI for object detection and description, all powered by the Flask web framework.

This guide is perfect for beginners, and we’ll break down every step, explaining the code, the technologies involved, and the magic behind making it all work.

What You’ll Learn

  • Setting Up a Flask Project: We’ll create a basic Flask application to serve our video generator.
  • Integrating Google Cloud AI: Leveraging Google Cloud AI’s Vision API to analyze images.
  • Stable Diffusion for Image Generation: Using Stable Diffusion to create stunning images from textual prompts.
  • Text-to-Speech with gTTS: Converting text into audio using Google’s Text-to-Speech API.
  • Video Editing with MoviePy: Combining images and audio into a cohesive video.
  • Designing User Interfaces with HTML: Creating user-friendly forms for text and image input.
  • Deployment: Making your video generator accessible online.

Let’s Get Started!

1. Project Setup: Your First Steps

1.1 Create a Virtual Environment

A virtual environment helps us manage project dependencies separately, ensuring that our video generator project doesn’t conflict with other Python projects.

  1. Open your terminal and create a new virtual environment:

    python3 -m venv my_video_env
    
    • python3 -m venv: This command uses the venv module to create a virtual environment.
    • my_video_env: This is the name of our virtual environment. You can choose a different name.
  2. Activate the virtual environment:

    source my_video_env/bin/activate
    
    • source: This command runs a script from the specified path.
    • my_video_env/bin/activate: This is the path to the activation script for our virtual environment.

1.2 Install Required Libraries

Now, let’s install the necessary libraries for our project. We’ll use pip, the package installer for Python.

pip install flask moviepy gTTS langchain langchain_google_genai pillow requests
  • Flask: Web framework for building our application.
  • MoviePy: For video editing and combining images and audio.
  • gTTS: Text-to-speech conversion.
  • LangChain: A framework for building and interacting with AI applications.
  • LangChain Google GenAI: A component of LangChain that allows us to use Google Cloud AI models.
  • Pillow: Image processing and manipulation.
  • Requests: Making API calls.

2. Securely Handling API Keys

2.1 Obtaining API Keys

We’ll need API keys from two services: Google Cloud AI and Hugging Face.

2.2 Setting Up Environment Variables

To store our API keys securely, we’ll use environment variables.

  1. Create a .env file in your project directory.

  2. Add the following lines to your .env file:

    GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
    HUG_API_KEY=YOUR_HUGGING_FACE_API_KEY
    
    • Replace YOUR_GOOGLE_API_KEY and YOUR_HUGGING_FACE_API_KEY with the actual API keys you obtained.
  3. Make sure your environment variables are loaded using the dotenv package:

    from dotenv import load_dotenv
    load_dotenv() 
    

3. HTML Templates: Designing User Interfaces

Now, we’ll create the HTML templates that will form the user interface of our video generator.

3.1 index.html (Main Landing Page)

<!DOCTYPE html>
<html>
<head>
  <title>Video Generator</title>
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
  <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
  <header class="container-fluid bg-white py-3">
    <div class="container d-flex justify-content-between align-items-center">
      <h1>Video Generator</h1>
      <nav class="d-flex">
        <a href="/" class="mr-3 text-dark">Get Started</a>
        <a href="#" class="text-dark">Help</a>
      </nav>
    </div>
  </header>
  <main class="container py-5">
    <h3 class="text-center mb-5">Which content would you like to repurpose into videos?</h3>
    <div class="row options">
      <div class="col-md-6 mb-4">
        <div class="option bg-white rounded shadow p-4">
          <div class="icon text-center text-primary mb-3">
            <i class="fas fa-file-alt fa-3x"></i> 
          </div>
          <h4 class="text-center">Text to Video</h4>
          <p class="text-muted text-center">Enter the content and let AI create engaging videos.</p>
          <a href="text-to-video"><button class="btn btn-primary btn-block" >Proceed</button></a>
          <ul class="recommendations list-unstyled mt-3 text-muted">
            <li><i class="fas fa-check-circle mr-2"></i> Educational videos</li>
            <li><i class="fas fa-check-circle mr-2"></i> Coaching videos</li>
            <li><i class="fas fa-check-circle mr-2"></i> Step-by-step guides</li>
          </ul>
        </div>
      </div>
      <div class="col-md-6 mb-4">
        <div class="option bg-white rounded shadow p-4">
          <div class="icon text-center text-primary mb-3">
            <i class="far fa-image fa-3x"></i>
          </div>
          <h4 class="text-center">Image to Video</h4>
          <p class="text-muted text-center">Turn your static images into dynamic video experiences.</p>
          <a href="image-to-video"><button class="btn btn-primary btn-block" >Proceed</button></a>
          <ul class="recommendations list-unstyled mt-3 text-muted">
            <li><i class="fas fa-check-circle mr-2"></i> Marketing and Promotion</li>
            <li><i class="fas fa-check-circle mr-2"></i> Social Proof</li>
            <li><i class="fas fa-check-circle mr-2"></i> Community Engagement</li>
          </ul>
        </div>
      </div>
    </div>
  </main>
  <footer class="container-fluid bg-white py-3 mt-5">
    <div class="container text-center">
      <p class="mb-0">© 2024 Video Generator. All rights reserved.</p>
      <nav class="nav justify-content-center">
        <a href="#" class="nav-link text-muted">Terms of Service</a>
        <a href="#" class="nav-link text-muted">Privacy Policy</a>
        <a href="#" class="nav-link text-muted">Contact Us</a>
      </nav>
    </div>
  </footer>
  <script src="https://kit.fontawesome.com/your-fontawesome-kit-id.js"></script> </body>
</html>
  • Bootstrap: This template uses Bootstrap (https://getbootstrap.com/) to provide a nice-looking and responsive layout. Make sure to include the Bootstrap CSS file in your index.html (replace your-fontawesome-kit-id with your actual kit ID).
  • Navigation Links: The header includes links for “Get Started” and “Help.”
  • Content Options: This template presents two options for video generation: “Text to Video” and “Image to Video.”
  • Buttons: Buttons are used to navigate to the respective pages (text-to-video and image-to-video).

3.2 text-to-video.html (Text Input Form)

<!DOCTYPE html>
<html>
<head>
  <title>Video Generator</title>
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
  <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
  <header class="container-fluid bg-white py-3">
    <div class="container d-flex justify-content-between align-items-center">
      <h1>Video Generator</h1>
      <nav class="d-flex">
        <a href="/" class="mr-3 text-dark">Get Started</a>
        <a href="#" class="text-dark">Help</a>
      </nav>
    </div>
  </header>
  <main class="container py-5">
    <div class="row options">
      <div class="col-md-6 mx-auto mb-4">
        <div class="option bg-white rounded shadow p-4">
          <h3 class="text-center mb-3">Enter your text:</h3>
          <form action="/text-to-video" method="post"> 
            <div class="form-group">
              <textarea class="form-control" name="textInput" rows="5" placeholder="Describe the scene you want to create..." required></textarea>
            </div>
            <button type="submit" class="btn btn-primary btn-block">Generate Video</button>
          </form>
        </div>
      </div> 
    </div>
  </main>
  <footer class="container-fluid bg-white py-3 mt-5">
    <div class="container text-center">
      <p class="mb-0">© 2024 Video Generator. All rights reserved.</p>
      <nav class="nav justify-content-center"> 
        <a href="#" class="nav-link text-muted">Terms of Service</a>
        <a href="#" class="nav-link text-muted">Privacy Policy</a>
        <a href="#" class="nav-link text-muted">Contact Us</a>
      </nav>   
    </div>
  </footer>
</body>
</html>
  • Form Action: The form submits data to the /text-to-video route using the POST method.
  • Text Input: The user enters their text description into the textarea field (named textInput).

3.3 image-to-video.html (Image Upload Form)

<!DOCTYPE html>
<html>
<head>
  <title>Video Generator</title>
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
  <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
  <header class="container-fluid bg-white py-3">
    <div class="container d-flex justify-content-between align-items-center">
      <h1>Video Generator</h1>
      <nav class="d-flex">
        <a href="/" class="mr-3 text-dark">Get Started</a>
        <a href="#" class="text-dark">Help</a>
      </nav>
    </div>
  </header>
  <main class="container py-5">
    <div class="row options">
      <div class="col-md-6 mx-auto mb-4"> 
        <div class="option bg-white rounded shadow p-4">
          <div class="icon text-center text-primary mb-3">
            <i class="fas fa-file-alt fa-3x"></i> 
            <h3 class="text-center mb-5">Upload your image</h3>
            <form action="/image-to-video" method="post" enctype="multipart/form-data">
              <label for="file-input" class="btn btn-primary btn-block">Choose File</label>
              <input type="file" name="file" id="file-input">
              <input type="submit" class="btn btn-primary btn-block" value="Generate Video">
              </form>
            <ul class="recommendations list-unstyled mt-3 text-muted">
            </ul>
          </div>
        </div>
      </div>
    </div>
  </main>
  <footer class="container-fluid bg-white py-3 mt-5">
    <div class="container text-center">
      <p class="mb-0">© 2024 Video Generator. All rights reserved.</p>
      <nav class="nav justify-content-center">
        <a href="#" class="nav-link text-muted">Terms of Service</a>
        <a href="#" class="nav-link text-muted">Privacy Policy</a>
        <a href="#" class="nav-link text-muted">Contact Us</a>
      </nav>
    </div>
  </footer>
  <script src="https://kit.fontawesome.com/your-fontawesome-kit-id.js"></script>
</body>
</html>
  • Form Action: The form submits data to the /image-to-video route using the POST method. The enctype="multipart/form-data" is essential for handling file uploads.
  • File Input: The user selects an image file using the input type="file".

3.4 player.html (Video Player)

<!DOCTYPE html>
<html>
<head>
  <title>Video Generator - Video Player</title>
  <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
  <link rel="stylesheet" href="/static/css/style.css">
</head>
<body>
  <header class="container-fluid bg-white py-3">
    <div class="container d-flex justify-content-between align-items-center">
      <h1>Video Generator</h1>
      <nav class="d-flex">
        <a href="/" class="mr-3 text-dark">Get Started</a>
        <a href="#" class="text-dark">Help</a>
      </nav>
    </div>
  </header>
  <main class="container py-5">
    <div class="row">
      <div class="col-md-8 mx-auto">
        <h2>Generated Video</h2>
        <video width="100%" controls>
          <source src="/static/output/gen_video.mp4" type="video/mp4">
          Your browser does not support the video tag.
        </video>
        <!-- Download Button -->
        <a href="/static/output/gen_video.mp4" class="btn btn-primary mt-3" download>
          Download Video
        </a>
        <div>
          <h3>Detected Objects:</h3>
          <p>{{ object_detected }}</p>  
        </div>
        <div>
          <h3>Object Description:</h3>
          <p>{{ object_description }}</p>
        </div>
      </div>
    </div>
  </main>
  <footer class="container-fluid bg-white py-3 mt-5">
    <div class="container text-center">
      <p class="mb-0">© 2024 Video Generator. All rights reserved.</p>
      <nav class="nav justify-content-center">
        <a href="#" class="nav-link text-muted">Terms of Service</a>
        <a href="#" class="nav-link text-muted">Privacy Policy</a>
        <a href="#" class="nav-link text-muted">Contact Us</a>
      </nav>
    </div>
  </footer>
</body>
</html>
  • Video Player: The <video> tag displays the generated video.
  • Download Button: A button allows users to download the video.
  • Object Information: The detected objects and their descriptions are displayed here, using Jinja templating ({{ object_detected }} and {{ object_description }}).

4. Python Code: The Heart of Your Video Generator

Now, let’s delve into the core of our application, the main.py file.

from flask import Flask, render_template, request, redirect, url_for
import os
import requests
import io
from PIL import Image, UnidentifiedImageError
from gtts import gTTS
from moviepy.editor import ImageSequenceClip, AudioFileClip, concatenate_videoclips
from dotenv import load_dotenv
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI

app = Flask(__name__)

# Load API keys
load_dotenv()

# Set the API keys
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
hug_api_key = os.environ.get("HUG_API_KEY")

API_URL = "https://api-inference.huggingface.co/models/stabilityai/stable-diffusion-2-1"
headers = {"Authorization": f"Bearer {hug_api_key}"}

# Initialize model to get inference from the image
llm = ChatGoogleGenerativeAI(model="gemini-pro-vision", google_api_key=GOOGLE_API_KEY, max_output_tokens=100)

# Configure allowed upload extensions
ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg'}

# Define paths
output_path = os.path.join('static', 'output')
image_save_path = os.path.join(output_path, 'gen_image1.jpg')
image1_save_path = os.path.join(output_path, 'gen_image2.jpg')
image2_save_path = os.path.join(output_path, 'gen_image3.jpg')
audio_save_path = os.path.join(output_path, 'gen_audio.mp3')
video_save_path = os.path.join(output_path, 'gen_video.mp4')


# Function to check allowed file extensions
def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


# Function to detect object from the image
def detect_object(img_path):
    message = HumanMessage(content=[
        {
            "type": "text",
            "text": "detect the object in the image, just return the detected objects name, for example if it's a laptop return laptop, don't make things up"
        },
        {
            "type": "image_url",
            "image_url": img_path
        },
    ])
    response = llm.invoke([message])
    return response.content


# Function to get inference from the image
def get_inference(img_path):
    message = HumanMessage(content=[
        {
            "type": "text",
            "text": "what's in this image, give a short detail on the image"
        },
        {
            "type": "image_url",
            "image_url": img_path
        },
    ])
    response = llm.invoke([message])
    return response.content


# Function to generate image from response
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    if response.status_code != 200:
        print(f"Error: API request failed with status code {response.status_code}")
        print(f"Response: {response.text}")
    return response.content


def generate_image(detect_object):
    try:
        print(f"Generating images for: {detect_object}")

        # First image
        image_bytes = query({"inputs": detect_object})
        print(f"First image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the first image.")
            return

        try:
            image = Image.open(io.BytesIO(image_bytes))
            image.save(image_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the first image file. Details: {e}")
            return

        # Second image
        image_bytes = query({"inputs": f'dramatic,{detect_object}'})
        print(f"Second image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the second image.")
            return

        try:
            image1 = Image.open(io.BytesIO(image_bytes))
            image1.save(image1_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the second image file. Details: {e}")
            return

        # Third image
        image_bytes = query({"inputs": f'A photorealistic,{detect_object}'})
        print(f"Third image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the third image.")
            return

        try:
            image2 = Image.open(io.BytesIO(image_bytes))
            image2.save(image2_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the third image file. Details: {e}")
            return

    except Exception as e:
        print(f"Unexpected error occurred: {e}")


# Function to convert text to speech
def generate_audio(object_description):
    tts = gTTS(object_description, lang='en')
    tts.save(audio_save_path)


# Function for generating video
def generate_video():
    # Check if images exist
    if not (os.path.exists(image_save_path) and os.path.exists(image1_save_path) and os.path.exists(image2_save_path)):
        print("Error: One or more image files are missing.")
        return None

    # Load audio
    audio = AudioFileClip(audio_save_path)
    # Split audio
    avg_duration = audio.duration / 3
    audio1 = audio.subclip(0, avg_duration)
    audio2 = audio.subclip(avg_duration, 2 * avg_duration)
    audio3 = audio.subclip(2 * avg_duration, audio.duration)
    # Load images
    image1 = ImageSequenceClip([image_save_path], durations=[audio1.duration])  # Set duration for each image
    image2 = ImageSequenceClip([image1_save_path], durations=[audio2.duration])
    image3 = ImageSequenceClip([image2_save_path], durations=[audio3.duration])
    # Combine images
    final_clip = concatenate_videoclips([image1, image2, image3], method="compose")
    # Set audio to video clips
    final_clip = final_clip.set_audio(audio)
    # Write the final video
    final_clip.write_videofile(video_save_path, fps=1)
    return video_save_path


@app.route('/')
def index():
    return render_template('index.html')


@app.route('/text-to-video', methods=['GET', 'POST'])
def text_to_video():
    if request.method == 'POST':
        # Get the text data from the request
        text_data = request.form['textInput']  # Assuming the text input has id "textInput"

        # Process the text data
        object_description = text_data  # Use the text as the description
        generate_image(object_description)
        generate_audio(object_description)
        video_save_path = generate_video()
        return redirect(url_for('videoplayer', source='text'))

    return render_template('text-to-video.html')


@app.route('/image-to-video', methods=['GET', 'POST'])
def image_to_video():
    if request.method == 'POST':
        # Check if the post request has the file part
        if 'file' not in request.files:
            return "No file part"
        file = request.files['file']
        # If user does not select file, browser also
        # submit an empty part without filename
        if file.filename == '':
            return "No selected file"
        if file and allowed_file(file.filename):
            file_path = os.path.join(app.static_folder, 'uploads', 'temp_image.jpg')
            file.save(file_path)
            # Process image with AI functions
            object_detected = detect_object(file_path)
            object_description = get_inference(file_path)

            # Save detected objects to gen_keyword.txt
            keywords_file_path = os.path.join(app.static_folder, 'output', 'gen_keyword.txt')
            with open(keywords_file_path, 'w') as f:
                for obj in object_detected:
                    f.write(obj)

            description_file_path = os.path.join(app.static_folder, 'output', 'gen_desc.txt')
            with open(description_file_path, 'w') as f:
                for obj in object_description:
                    f.write(obj)

            generate_image(object_description)
            generate_audio(object_description)
            video_save_path = generate_video()
            return redirect(url_for('videoplayer', source='image'))

    return render_template('image-to-video.html')


@app.route('/videoplayer')
def videoplayer():
    # Read detected keywords from gen_keywords.txt
    keywords_file_path = os.path.join(app.static_folder, 'output', 'gen_keyword.txt')
    try:
        with open(keywords_file_path, 'r') as f:
            object_detected = f.read()
    except FileNotFoundError:
        object_detected = "No objects detected"

    description_file_path = os.path.join(app.static_folder, 'output', 'gen_desc.txt')
    try:
        with open(description_file_path, 'r') as f:
            object_description = f.read()
    except FileNotFoundError:
        object_description = "No description available"

    return render_template('player.html', object_detected=object_detected, object_description=object_description)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=80)

4.1 Initializing the Flask Application

from flask import Flask, render_template, request, redirect, url_for
import os

app = Flask(__name__)
  • Flask(__name__): Creates a Flask application instance. The __name__ variable is used to specify the current module, which Flask needs to determine the application’s root directory.

4.2 Loading API Keys

# Load API keys
load_dotenv()

# Set the API keys
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
hug_api_key = os.environ.get("HUG_API_KEY")
  • load_dotenv(): Loads environment variables from the .env file we created earlier, making the API keys accessible.
  • os.environ.get('GOOGLE_API_KEY') and os.environ.get("HUG_API_KEY"): Retrieves the values of the environment variables GOOGLE_API_KEY and HUG_API_KEY, respectively.

4.3 Setting Up Stable Diffusion

API_URL = "https://api-inference.huggingface.co/models/stabilityai/stable-diffusion-2-1"
headers = {"Authorization": f"Bearer {hug_api_key}"}
  • API_URL: The URL for the Hugging Face Stable Diffusion API.
  • headers: A dictionary containing the authorization header with the Hugging Face API token.

4.4 Initializing Google Cloud AI

# Initialize model to get inference from the image
llm = ChatGoogleGenerativeAI(model="gemini-pro-vision", google_api_key=GOOGLE_API_KEY, max_output_tokens=100)
  • ChatGoogleGenerativeAI: This class, from the langchain_google_genai module, is used to access Google Cloud AI’s text-to-image and vision models.
  • model="gemini-pro-vision": Specifies the Google Gemini Pro vision model for object detection and image description.
  • google_api_key=GOOGLE_API_KEY: Provides the Google Cloud AI API key.
  • max_output_tokens=100: Sets the maximum number of tokens for the model’s response.

4.5 Configuring Allowed File Extensions

# Configure allowed upload extensions
ALLOWED_EXTENSIONS = {'png', 'jpg', 'jpeg'}
  • ALLOWED_EXTENSIONS: A set containing the allowed file extensions for image uploads.

4.6 Defining File Paths

# Define paths
output_path = os.path.join('static', 'output')
image_save_path = os.path.join(output_path, 'gen_image1.jpg')
image1_save_path = os.path.join(output_path, 'gen_image2.jpg')
image2_save_path = os.path.join(output_path, 'gen_image3.jpg')
audio_save_path = os.path.join(output_path, 'gen_audio.mp3')
video_save_path = os.path.join(output_path, 'gen_video.mp4')
  • output_path: The path to the directory where generated files (images, audio, and video) will be saved.
  • Other paths: Paths for individual files within the output directory.

4.7 allowed_file Function

# Function to check allowed file extensions
def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
  • allowed_file: This function checks if a filename has a valid extension (i.e., png, jpg, or jpeg).

4.8 detect_object Function

# Function to detect object from the image
def detect_object(img_path):
    message = HumanMessage(content=[
        {
            "type": "text",
            "text": "detect the object in the image, just return the detected objects name, for example if it's a laptop return laptop, don't make things up"
        },
        {
            "type": "image_url",
            "image_url": img_path
        },
    ])
    response = llm.invoke([message])
    return response.content
  • detect_object: This function uses the Google Gemini Pro vision model to detect objects in an image.
    • HumanMessage: Creates a LangChain message object with the prompt “detect the object in the image…” and the image URL.
    • llm.invoke: Invokes the Google Gemini Pro vision model with the message, sending the image and the prompt.
    • response.content: Returns the model’s response containing the detected objects.

4.9 get_inference Function

# Function to get inference from the image
def get_inference(img_path):
    message = HumanMessage(content=[
        {
            "type": "text",
            "text": "what's in this image, give a short detail on the image"
        },
        {
            "type": "image_url",
            "image_url": img_path
        },
    ])
    response = llm.invoke([message])
    return response.content
  • get_inference: This function obtains a textual description of an image using the Google Gemini Pro vision model.
    • HumanMessage: Creates a LangChain message object with the prompt “what’s in this image…” and the image URL.
    • llm.invoke: Invokes the Google Gemini Pro vision model with the message, sending the image and the prompt.
    • response.content: Returns the model’s response containing a description of the image.

4.10 query Function

# Function to generate image from response
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    if response.status_code != 200:
        print(f"Error: API request failed with status code {response.status_code}")
        print(f"Response: {response.text}")
    return response.content

4.11 generate_image Function

def generate_image(detect_object):
    try:
        print(f"Generating images for: {detect_object}")

        # First image
        image_bytes = query({"inputs": detect_object})
        print(f"First image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the first image.")
            return

        try:
            image = Image.open(io.BytesIO(image_bytes))
            image.save(image_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the first image file. Details: {e}")
            return

        # Second image
        image_bytes = query({"inputs": f'dramatic,{detect_object}'})
        print(f"Second image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the second image.")
            return

        try:
            image1 = Image.open(io.BytesIO(image_bytes))
            image1.save(image1_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the second image file. Details: {e}")
            return

        # Third image
        image_bytes = query({"inputs": f'A photorealistic,{detect_object}'})
        print(f"Third image bytes length: {len(image_bytes)}")
        if len(image_bytes) < 100:
            print(f"Error: Received invalid image data for the third image.")
            return

        try:
            image2 = Image.open(io.BytesIO(image_bytes))
            image2.save(image2_save_path)
        except UnidentifiedImageError as e:
            print(f"Error: Cannot identify the third image file. Details: {e}")
            return

    except Exception as e:
        print(f"Unexpected error occurred: {e}")
  • generate_image: This function is responsible for generating three images using Stable Diffusion.
    • query: This function (which we’ll look at shortly) sends a request to the Stable Diffusion API with a prompt and receives image data.
    • Image.open(io.BytesIO(image_bytes)): Opens the image data received from the API.
    • image.save(image_save_path): Saves the generated image to a file.
    • try-except blocks: Error handling to catch potential issues with image loading.
    • Style Variations: This function generates three images with different styles: the first image is generated directly from the prompt, the second image adds a “dramatic” style, and the third image uses a “photorealistic” style.

4.12 query Function (Stable Diffusion API Call)

# Function to generate image from response
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    if response.status_code != 200:
        print(f"Error: API request failed with status code {response.status_code}")
        print(f"Response: {response.text}")
    return response.content
  • query: This function makes the actual request to the Hugging Face Stable Diffusion API.
    • requests.post(API_URL, headers=headers, json=payload): Sends a POST request to the Stable Diffusion API.
      • API_URL: The API endpoint.
      • headers: The authorization header (containing the Hugging Face API token).
      • json=payload: The prompt for image generation is sent as JSON data.
    • response.status_code != 200: Checks if the API request was successful. If not, an error message is printed.
    • response.content: Returns the image data from the API response.

4.13 generate_audio Function

# Function to convert text to speech
def generate_audio(object_description):
    tts = gTTS(object_description, lang='en')
    tts.save(audio_save_path)
  • generate_audio: This function converts text into speech using the gTTS library.
    • gTTS(object_description, lang='en'): Creates a gTTS object with the text to convert (object_description) and the language (en for English).
    • tts.save(audio_save_path): Saves the generated speech as an MP3 file.

4.14 generate_video Function

# Function for generating video
def generate_video():
    # Check if images exist
    if not (os.path.exists(image_save_path) and os.path.exists(image1_save_path) and os.path.exists(image2_save_path)):
        print("Error: One or more image files are missing.")
        return None

    # Load audio
    audio = AudioFileClip(audio_save_path)
    # Split audio
    avg_duration = audio.duration / 3
    audio1 = audio.subclip(0, avg_duration)
    audio2 = audio.subclip(avg_duration, 2 * avg_duration)
    audio3 = audio.subclip(2 * avg_duration, audio.duration)
    # Load images
    image1 = ImageSequenceClip([image_save_path], durations=[audio1.duration])  # Set duration for each image
    image2 = ImageSequenceClip([image1_save_path], durations=[audio2.duration])
    image3 = ImageSequenceClip([image2_save_path], durations=[audio3.duration])
    # Combine images
    final_clip = concatenate_videoclips([image1, image2, image3], method="compose")
    # Set audio to video clips
    final_clip = final_clip.set_audio(audio)
    # Write the final video
    final_clip.write_videofile(video_save_path, fps=1)
    return video_save_path
  • generate_video: This function combines the generated images and audio into a video using MoviePy.
    • File Check: Ensures that all three generated images exist before proceeding.
    • Audio Loading: Loads the generated audio file.
    • Audio Splitting: Divides the audio into three equal segments to match the three images.
    • Image Loading: Creates ImageSequenceClip objects for each image, setting the duration of each image to match the corresponding audio segment.
    • Video Concatenation: Combines the three image clips using concatenate_videoclips.
    • Setting Audio: Attaches the audio to the video clips.
    • Video Writing: Writes the final video to a file.

4.15 Flask Routes: Handling User Requests

4.15.1 index Route

@app.route('/')
def index():
    return render_template('index.html')
  • @app.route('/'): This decorator registers the index function as the handler for the root route (/).
  • return render_template('index.html'): Renders the index.html template, which is our main landing page.

4.15.2 text-to-video Route

@app.route('/text-to-video', methods=['GET', 'POST'])
def text_to_video():
    if request.method == 'POST':
        # Get the text data from the request
        text_data = request.form['textInput']  # Assuming the text input has id "textInput"

        # Process the text data
        object_description = text_data  # Use the text as the description
        generate_image(object_description)
        generate_audio(object_description)
        video_save_path = generate_video()
        return redirect(url_for('videoplayer', source='text'))

    return render_template('text-to-video.html')
  • @app.route('/text-to-video', methods=['GET', 'POST']): Registers the text_to_video function as the handler for the /text-to-video route. It handles both GET and POST requests.
  • if request.method == 'POST': If the request is a POST request (i.e., the form has been submitted), process the data.
    • request.form['textInput']: Retrieves the text description submitted by the user from the form.
    • generate_image: Generates images based on the text description.
    • generate_audio: Creates audio from the text description.
    • generate_video: Combines the images and audio into a video.
    • redirect(url_for('videoplayer', source='text')): Redirects the user to the /videoplayer route, passing information about the source of the video (“text”).
  • return render_template('text-to-video.html'): Renders the text-to-video.html template if the request is a GET request.

4.15.3 image-to-video Route

@app.route('/image-to-video', methods=['GET', 'POST'])
def image_to_video():
    if request.method == 'POST':
        # Check if the post request has the file part
        if 'file' not in request.files:
            return "No file part"
        file = request.files['file']
        # If user does not select file, browser also
        # submit an empty part without filename
        if file.filename == '':
            return "No selected file"
        if file and allowed_file(file.filename):
            file_path = os.path.join(app.static_folder, 'uploads', 'temp_image.jpg')
            file.save(file_path)
            # Process image with AI functions
            object_detected = detect_object(file_path)
            object_description = get_inference(file_path)

            # Save detected objects to gen_keyword.txt
            keywords_file_path = os.path.join(app.static_folder, 'output', 'gen_keyword.txt')
            with open(keywords_file_path, 'w') as f:
                for obj in object_detected:
                    f.write(obj)

            description_file_path = os.path.join(app.static_folder, 'output', 'gen_desc.txt')
            with open(description_file_path, 'w') as f:
                for obj in object_description:
                    f.write(obj)

            generate_image(object_description)
            generate_audio(object_description)
            video_save_path = generate_video()
            return redirect(url_for('videoplayer', source='image'))

    return render_template('image-to-video.html')
  • @app.route('/image-to-video', methods=['GET', 'POST']): Registers the image_to_video function as the handler for the /image-to-video route.
  • if request.method == 'POST': If the request is a POST request (i.e., the form has been submitted with an image file), process the data.
    • if 'file' not in request.files:: Checks if the form includes an uploaded file.
    • file = request.files['file']: Retrieves the uploaded file from the form.
    • if file and allowed_file(file.filename): Checks if a file was uploaded and if it has a valid extension.
    • file.save(file_path): Saves the uploaded image to a temporary file.
    • object_detected = detect_object(file_path): Detects objects in the uploaded image using the detect_object function.
    • object_description = get_inference(file_path): Gets a description of the uploaded image using the get_inference function.
    • Saving Keywords and Description: Saves the detected objects and description to files (gen_keyword.txt and gen_desc.txt) to be used later.
    • generate_image: Generates images based on the text description.
    • generate_audio: Creates audio from the text description.
    • generate_video: Combines the images and audio into a video.
    • redirect(url_for('videoplayer', source='image')): Redirects the user to the /videoplayer route, passing information about the source of the video (“image”).
  • return render_template('image-to-video.html'): Renders the image-to-video.html template if the request is a GET request.

4.15.4 videoplayer Route

@app.route('/videoplayer')
def videoplayer():
    # Read detected keywords from gen_keywords.txt
    keywords_file_path = os.path.join(app.static_folder, 'output', 'gen_keyword.txt')
    try:
        with open(keywords_file_path, 'r') as f:
            object_detected = f.read()
    except FileNotFoundError:
        object_detected = "No objects detected"

    description_file_path = os.path.join(app.static_folder, 'output', 'gen_desc.txt')
    try:
        with open(description_file_path, 'r') as f:
            object_description = f.read()
    except FileNotFoundError:
        object_description = "No description available"

    return render_template('player.html', object_detected=object_detected, object_description=object_description)
  • @app.route('/videoplayer'): Registers the videoplayer function as the handler for the /videoplayer route.
  • Reading Keywords and Description: Reads the detected objects and description from the saved files.
  • return render_template('player.html', object_detected=object_detected, object_description=object_description): Renders the player.html template, passing the detected objects and description to the template.

4.16 Running the Flask Application

  1. Open your terminal and navigate to the directory where your main.py file is located.
  2. Run the Flask app using either of these commands:
    flask run
    

    or

    python main.py
    
    • This will start the Flask development server, which will usually run on http://127.0.0.1:5000/.

5. Using Your Video Generator

  1. Open your web browser and visit http://127.0.0.1:5000/.
  2. You’ll see the main landing page (index.html) with the options for “Text to Video” and “Image to Video.”
  3. Text to Video:
    • Click “Proceed” under “Text to Video.”
    • Enter a text description in the form on the text-to-video.html page.
    • Click “Generate Video.”
  4. Image to Video:
    • Click “Proceed” under “Image to Video.”
    • Upload an image using the form on the image-to-video.html page.
    • Click “Generate Video.”
  5. The application will generate a video based on your input and redirect you to the /videoplayer route to view the generated video.

Conclusion

You’ve successfully built a video generator using Flask, Google Cloud AI, Stable Diffusion, gTTS, and MoviePy. This application allows you to turn your words and images into dynamic, engaging videos.

What’s Next?

This is just the beginning of your video generation journey. Consider these enhancements:

  • Customization: Experiment with different Stable Diffusion prompts to explore diverse styles and aesthetics for your videos.
  • Advanced Features: Implement features like background music, transitions between scenes, text overlays, and more.
  • Deployment: Deploy your application to a web server (like Heroku or AWS) so you can share your video generator with others.

I hope this in-depth guide has equipped you with the knowledge and tools to start creating incredible videos. Happy generating!