Unlocking the Essence of Your Videos: Building an AI-Powered Video-to-Keynotes Generator
In today’s information-packed world, video content is everywhere. Whether it’s a webinar, a lecture, a podcast, or even a casual conversation, extracting the key points from these videos can be a daunting task. This is where AI comes to the rescue. In this comprehensive blog post, we’ll build an intelligent application that can automatically summarize your videos and generate concise keynotes. Get ready to learn how to harness the power of AI to make your video content more digestible and actionable.
Why Create a Video-to-Keynotes Generator?
- Save Time and Effort: Manually summarizing long videos can be time-consuming and tedious. Our AI-powered tool will do the heavy lifting for you.
- Improved Comprehension: Keynotes provide a clear and concise overview of the video’s main points, making it easier for viewers to grasp the essential information.
- Enhanced Video Engagement: By extracting keynotes, you can create summaries, highlights, or even transcripts to improve audience engagement.
- Unlocking Value from Archived Content: Turn your existing video archives into valuable resources by summarizing them for easy access and reference.
Project Setup: Setting the Stage
1. Create a Virtual Environment
First, let’s create a virtual environment to keep our project’s dependencies separate from other Python projects. Open your terminal and follow these steps:
python3 -m venv my_video_env
source my_video_env/bin/activate
2. Install Required Libraries
Now, install the necessary Python libraries using pip
:
pip install streamlit transformers moviepy gTTS langchain langchain_google_genai pillow requests toml
- Streamlit: A Python library for building interactive web applications. It makes creating data-driven web apps super easy, even for beginners.
- Transformers: A library for working with state-of-the-art deep learning models, including those used for text summarization and speech recognition.
- MoviePy: For processing video files and extracting audio.
- LangChain: A framework for building and interacting with AI applications.
- LangChain Google GenAI: A component of LangChain that lets you use Google Cloud AI models.
- Toml: For handling configuration files in TOML format.
3. Obtain API Keys
We’ll need an API key for Google Cloud AI:
- Google Cloud AI:
- Sign up for a free trial of Google Cloud Platform (https://cloud.google.com/).
- Enable the Google Cloud AI Platform API.
- Create a service account and obtain your API key.
4. Set Up Environment Variables
To store your API key securely, we’ll use environment variables.
- Create a
secrets.toml
File: In your project directory, create a file namedsecrets.toml
. This file will store your API key in a structured way. - Add API Keys: Add your API key to the
secrets.toml
file:
[google]
api_key = "YOUR_GOOGLE_API_KEY"
- Replace
YOUR_GOOGLE_API_KEY
with the actual key you obtained.
5. Streamlit UI: Your Application’s Front End
We’ll use Streamlit to create an intuitive user interface for our video-to-keynotes generator. Here’s the Streamlit code:
import streamlit as st
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import pipeline as transformers_pipeline
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from io import BytesIO
from moviepy.editor import VideoFileClip
import os
import toml
import tempfile
# --- Streamlit UI ---
st.title("Video to Keynotes Generator")
# --- Load Configuration ---
config = toml.load("secrets.toml") # Load the TOML file
# --- File Upload ---
uploaded_file = st.file_uploader("Upload a video file", type=["mp4", "avi", "mov"])
if uploaded_file is not None:
# --- Video to Audio ---
video_bytes = uploaded_file.read()
# Create a temporary file for the video
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(video_bytes)
temp_file_path = temp_file.name
video_clip = VideoFileClip(temp_file_path)
audio_stream = video_clip.audio
audio_path = "temp_audio.wav" # Temporary audio file
audio_stream.write_audiofile(audio_path)
# --- Audio Transcription ---
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=438,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
result = pipe(audio_path)
transcript = result["text"]
st.subheader("Transcript")
st.markdown(f"<pre>{transcript}</pre>", unsafe_allow_html=True) # Use markdown for better formatting
# --- Summarization ---
summarizer = transformers_pipeline("summarization", model="facebook/bart-large-cnn")
chunk_size = 1000
max_length = 100 # You can adjust max_length as needed
min_length = 30
def summarize_chunks(text):
"""Summarizes text by splitting it into chunks and summarizing each chunk."""
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
summaries = []
for chunk in chunks:
summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)[0]['summary_text']
summaries.append(summary)
return summaries
summaries = summarize_chunks(transcript)
combined_summary = "\n\n".join(summaries)
st.subheader("Summary")
st.markdown(f"<pre>{combined_summary}</pre>", unsafe_allow_html=True) # Use markdown for better formatting
# --- Keynotes Generation (using Gemini) ---
GOOGLE_API_KEY = config["google"]["api_key"] # Access the API key from config.toml
def generate_keynotes_from_chunk(transcript_chunk):
"""Generates keynotes from a single chunk of transcript text."""
try:
llm = ChatGoogleGenerativeAI(
model="gemini-pro",
google_api_key=GOOGLE_API_KEY,
max_output_tokens=100)
except ValueError as e:
st.error(f"Error initializing Gemini model: {e}")
return None
message = HumanMessage(
content=f"Generate keynotes from the following transcript chunk, also don't use any beautifers like bold, italics, underscore, etc. Keep it plain text:\n\n{transcript_chunk}, also only accept the family friendly parts from the transcript and generate keynotes according to it. skip all unwanted and illegal parts for filtering"
)
try:
response = llm.invoke([message])
keynotes = response.content
return keynotes
except Exception as e:
st.error(f"Error generating keynotes: {e}")
return None
chunk_size = 250
transcript_chunks = [transcript[i:i+chunk_size] for i in range(0, len(transcript), chunk_size)]
all_keynotes = []
for chunk in transcript_chunks:
keynotes = generate_keynotes_from_chunk(chunk)
if keynotes:
all_keynotes.append(keynotes)
# --- Display Keynotes in a Formatted Way ---
st.subheader("Keynotes")
for i, keynotes in enumerate(all_keynotes):
st.write(keynotes)
st.write("---") # Add a separator between keynotes
# --- Cleanup ---
os.remove(audio_path)
os.remove(temp_file_path) # Delete the temporary file
st.title("Video to Keynotes Generator")
: Sets the title of the Streamlit app.st.file_uploader("Upload a video file", type=["mp4", "avi", "mov"])
: Creates a file uploader component that accepts video files with the specified extensions.
6. Video Processing: From Video to Audio (using MoviePy)
The first step is to extract the audio from the uploaded video using MoviePy.
# --- Video to Audio ---
video_bytes = uploaded_file.read()
# Create a temporary file for the video
with tempfile.NamedTemporaryFile(delete=False) as temp_file:
temp_file.write(video_bytes)
temp_file_path = temp_file.name
video_clip = VideoFileClip(temp_file_path)
audio_stream = video_clip.audio
audio_path = "temp_audio.wav" # Temporary audio file
audio_stream.write_audiofile(audio_path)
video_bytes = uploaded_file.read()
: Reads the video file contents into memory.tempfile.NamedTemporaryFile(delete=False)
: Creates a temporary file for storing the video data. Thedelete=False
option prevents the file from being deleted automatically.video_clip = VideoFileClip(temp_file_path)
: Loads the video using MoviePy.audio_stream = video_clip.audio
: Extracts the audio stream from the video clip.audio_stream.write_audiofile(audio_path)
: Saves the audio stream to a temporary WAV file (temp_audio.wav
).
7. Speech Recognition: Turning Audio into Text
Now, we’ll use the Whisper speech recognition model from Hugging Face to convert the audio into text.
# --- Audio Transcription ---
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=438,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device,
)
result = pipe(audio_path)
transcript = result["text"]
st.subheader("Transcript")
st.markdown(f"<pre>{transcript}</pre>", unsafe_allow_html=True) # Use markdown for better formatting
- Device Selection: Determines if the model should run on the GPU (if available) or the CPU.
AutoModelForSpeechSeq2Seq.from_pretrained
: Loads the Whisper model from Hugging Face.AutoProcessor.from_pretrained
: Loads the processor associated with the Whisper model.pipeline("automatic-speech-recognition", ...)
: Creates a pipeline for automatic speech recognition.result = pipe(audio_path)
: Runs the speech recognition pipeline on the audio file.st.subheader("Transcript")
: Displays a subheader for the transcript in the Streamlit UI.st.markdown(f"<pre>{transcript}</pre>", unsafe_allow_html=True)
: Displays the transcript in a formatted way using Markdown.
8. Summarization: Extracting the Key Points
We’ll use the BART (Bidirectional Encoder Representations from Transformers) model to create summaries of the transcript.
# --- Summarization ---
summarizer = transformers_pipeline("summarization", model="facebook/bart-large-cnn")
chunk_size = 1000
max_length = 100 # You can adjust max_length as needed
min_length = 30
def summarize_chunks(text):
"""Summarizes text by splitting it into chunks and summarizing each chunk."""
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
summaries = []
for chunk in chunks:
summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)[0]['summary_text']
summaries.append(summary)
return summaries
summaries = summarize_chunks(transcript)
combined_summary = "\n\n".join(summaries)
st.subheader("Summary")
st.markdown(f"<pre>{combined_summary}</pre>", unsafe_allow_html=True) # Use markdown for better formatting
transformers_pipeline("summarization", model="facebook/bart-large-cnn")
: Creates a pipeline for text summarization using the BART model.summarize_chunks(text)
: A function to split the transcript into chunks and summarize each chunk individually.summaries = summarize_chunks(transcript)
: Generates summaries for each chunk.combined_summary = "\n\n".join(summaries)
: Combines the individual summaries into a single, comprehensive summary.st.subheader("Summary")
andst.markdown(...)
: Displays the summary in the Streamlit UI using Markdown.
9. Keynotes Generation: Using Gemini for Insightful Summaries
To generate keynotes from the transcript, we’ll use the Gemini Pro model from Google Cloud AI. This model is excellent for understanding and extracting key points from text.
# --- Keynotes Generation (using Gemini) ---
GOOGLE_API_KEY = config["google"]["api_key"] # Access the API key from config.toml
def generate_keynotes_from_chunk(transcript_chunk):
"""Generates keynotes from a single chunk of transcript text."""
try:
llm = ChatGoogleGenerativeAI(
model="gemini-pro",
google_api_key=GOOGLE_API_KEY,
max_output_tokens=100)
except ValueError as e:
st.error(f"Error initializing Gemini model: {e}")
return None
message = HumanMessage(
content=f"Generate keynotes from the following transcript chunk, also don't use any beautifers like bold, italics, underscore, etc. Keep it plain text:\n\n{transcript_chunk}, also only accept the family friendly parts from the transcript and generate keynotes according to it. skip all unwanted and illegal parts for filtering"
)
try:
response = llm.invoke([message])
keynotes = response.content
return keynotes
except Exception as e:
st.error(f"Error generating keynotes: {e}")
return None
chunk_size = 250
transcript_chunks = [transcript[i:i+chunk_size] for i in range(0, len(transcript), chunk_size)]
all_keynotes = []
for chunk in transcript_chunks:
keynotes = generate_keynotes_from_chunk(chunk)
if keynotes:
all_keynotes.append(keynotes)
# --- Display Keynotes in a Formatted Way ---
st.subheader("Keynotes")
for i, keynotes in enumerate(all_keynotes):
st.write(keynotes)
st.write("---") # Add a separator between keynotes
# --- Cleanup ---
os.remove(audio_path)
os.remove(temp_file_path) # Delete the temporary file
GOOGLE_API_KEY = config["google"]["api_key"]
: Retrieves the Google Cloud AI API key from thesecrets.toml
file.generate_keynotes_from_chunk(transcript_chunk)
: A function to generate keynotes from a single chunk of transcript text.ChatGoogleGenerativeAI(model="gemini-pro", ...)
: Initializes the Gemini Pro model.HumanMessage
: Creates a LangChain message object with a prompt that asks Gemini Pro to generate keynotes from the transcript chunk, ensuring plain text output and filtering out inappropriate content.response = llm.invoke([message])
: Invokes the Gemini Pro model with the prompt.
- Chunk Processing: The transcript is divided into chunks, and keynotes are generated for each chunk.
- Displaying Keynotes: The generated keynotes are displayed in a formatted way in the Streamlit UI.
10. Cleanup
# --- Cleanup ---
os.remove(audio_path)
os.remove(temp_file_path) # Delete the temporary file
os.remove(audio_path)
andos.remove(temp_file_path)
: Deletes the temporary audio file and the temporary video file.
11. Running the Streamlit Application
- Save your code in a Python file (e.g.,
main.py
). - Open your terminal and navigate to the directory where your
main.py
file is located. - Run the Streamlit app:
streamlit run main.py
This will open your Streamlit app in a new browser window. Now you can upload a video, and the app will generate a transcript, summary, and keynotes.
Conclusion
You’ve successfully created an AI-powered video-to-keynotes generator using Python and Streamlit. This application can help you quickly and effectively extract the most important information from your video content.
Next Steps:
- Customize the App: You can customize the UI, add more features, or integrate the keynotes generation into other workflows.
- Explore Different Models: Experiment with different language models for summarization and keynotes generation to see which ones work best for your needs.
- Deployment: Deploy your Streamlit app to a web server (like Heroku or Streamlit Cloud) to make it accessible to others.