├── .gitignore ├── Makefile ├── README.md ├── __init__.py ├── app.py ├── assets ├── flow-chart.png ├── video_100_comments.png └── video_500_comments.png ├── comments.py ├── gpt3 ├── app_gpt3.py └── utils_gpt3.py ├── requirements.txt ├── sample_urls.txt ├── secrets.toml.example └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | venv/ 2 | 3 | *.pyc 4 | __pycache__/ 5 | 6 | instance/ 7 | 8 | .pytest_cache/ 9 | .coverage 10 | htmlcov/ 11 | 12 | dist/ 13 | build/ 14 | *.egg-info/ 15 | 16 | .DS_STORE 17 | 18 | .env 19 | 20 | .streamlit -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | requirements: 2 | pip install --upgrade pip &&\ 3 | pip install -r requirements.txt 4 | 5 | run: 6 | streamlit run app.py -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # youtube-comments-summary-project 2 | 3 | Please find my Streamlit app hosted [here](https://iprinka-comments-summary-ml-project-app-xobtj4.streamlit.app/) 4 |
5 | Please note that if you face an error, the Open AI API credits on my account might have expired :D 6 | 7 | ## Project Summary 8 | 9 | The goal of this model is to provide a summary of comment threads for any given Youtube video. 10 | 11 | ## Flow Diagram 12 | 13 | ![Flow Diagram](https://github.com/Priyanka-Gangadhar-Palshetkar/comments-summary-ml-project/blob/main/assets/flow-chart.png?raw=true) 14 | 15 | ## Approach 16 | 17 | * In this project, I will be using [Youtube API V3](https://developers.google.com/youtube/v3) to fetch all comment threads for any given youtube video id. 18 | * I used Streamlit to build this application. It takes a Youtube video url as input. Using the Youtube API, I fetch all the comment threads under that video and provide it to the Tokenizer to break the text into chunks. 19 | * Then I provide each chunk as an input to the Completions API with the prompt - “Provide a summary of the comments below.” followed by the chunk of comments. The model then returns a brief summary of how people are reacting to that video. 20 | 21 | ## How to use 22 | 23 | 1. Create a virtual environment (python3 -m venv venv && source venv/bin/activate) 24 | 25 | 2. Download all the dependencies by running `make requirements` in your terminal 26 | 27 | 3. Create a project on Google cloud console and enable Youtube API V3 for your project. Follow instructions [here.](https://console.developers.google.com/apis/api/youtube.googleapis.com/overview) 28 | 29 | 4. Create a .streamlit folder at the source level and add a secrets.toml file to it. 30 | 31 | 5. Ensure your Youtube API keys are set as `API_SERVICE_NAME`, `API_VERSION`, `YOUTUBE_API_KEY` in the secrets.toml file (check the secrets.toml.example file for reference). Please find the details to generate the API key [here.](https://developers.google.com/youtube/registering_an_application) 32 | 33 | 6. Ensure your OpenAI API key is set as an environment variable `OPENAI_API_KEY` in the secrets.toml file (check the secrets.toml.example file for reference). (see best practices around API key safety [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety)) 34 | 35 | 7. Run the [streamlit](https://streamlit.io/) app by running `make run` 36 | 37 | 8. Open the app in your browser at `http://localhost:8501` 38 | 39 | ## Example 40 | 41 | ![App_Demo](https://github.com/Priyanka-Gangadhar-Palshetkar/comments-summary-ml-project/blob/main/assets/video_100_comments.png?raw=true) 42 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/__init__.py -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from comments import fetch_comments 3 | from utils import get_summary 4 | 5 | 6 | st.title("Youtube Comments Summarizer") 7 | 8 | st.write( 9 | "Use this tool to generate summaries from comments under any Youtube video." 10 | "The Tool uses Google Gemini paired with Lang Chain to generate the summaries." 11 | ) 12 | 13 | st.write() 14 | 15 | left, right = st.columns(2) 16 | form = left.form("template_form") 17 | 18 | url_input = form.text_input( 19 | "Enter Youtube video url", 20 | placeholder="", 21 | value="", 22 | ) 23 | 24 | submit = form.form_submit_button("Get Summary") 25 | 26 | gemini_api_key = st.secrets['GEMINI_API_KEY'] 27 | 28 | with st.container(): 29 | if submit and url_input: 30 | with st.spinner("Fetching Summary..."): 31 | 32 | # Get Comments from Youtube API - INPUT 33 | text = fetch_comments(url_input) 34 | 35 | # Tokenization and Summarization - MAIN CODE 36 | final_summary = get_summary(text) 37 | 38 | # Display the output on Streamlit - OUTPUT 39 | with right: 40 | right.write(f"{final_summary}") -------------------------------------------------------------------------------- /assets/flow-chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/flow-chart.png -------------------------------------------------------------------------------- /assets/video_100_comments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/video_100_comments.png -------------------------------------------------------------------------------- /assets/video_500_comments.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/video_500_comments.png -------------------------------------------------------------------------------- /comments.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from googleapiclient.discovery import build 3 | from pytube import extract 4 | 5 | from dotenv import load_dotenv 6 | 7 | load_dotenv() 8 | 9 | api_server_name = st.secrets["API_SERVICE_NAME"] 10 | api_version = st.secrets['API_VERSION'] 11 | youtube_api_key = st.secrets['YOUTUBE_API_KEY'] 12 | 13 | def start_youtube_service(): 14 | return build(api_server_name, api_version, developerKey=youtube_api_key) 15 | 16 | def extract_video_id_from_link(url): 17 | return extract.video_id(url) 18 | 19 | def get_comments_thread(youtube, video_id, next_page_token): 20 | results = youtube.commentThreads().list( 21 | part="snippet,replies", 22 | videoId=video_id, 23 | textFormat='plainText', 24 | maxResults=100, 25 | # pageToken = next_page_token 26 | ).execute() 27 | return results 28 | 29 | def load_comments_in_format(comments): 30 | all_comments = [] 31 | all_comments_string = "" 32 | for thread in comments["items"]: 33 | comment = {} 34 | comment['content'] = thread['snippet']['topLevelComment']['snippet']['textOriginal'] 35 | all_comments_string = all_comments_string + comment['content']+"\n" 36 | replies = [] 37 | if 'replies' in thread: 38 | for reply in thread['replies']['comments']: 39 | reply_text = reply['snippet']['textOriginal'] 40 | all_comments_string = all_comments_string + reply_text+"\n" 41 | replies.append(reply_text) 42 | comment['replies'] = replies 43 | 44 | all_comments.append(comment) 45 | return all_comments_string 46 | 47 | def fetch_comments(url): 48 | youtube = start_youtube_service() 49 | video_id = extract_video_id_from_link(url) 50 | next_page_token = '' 51 | 52 | data = get_comments_thread(youtube, video_id, next_page_token) 53 | # if "nextPageToken" in data: 54 | # next_page_token = data["nextPageToken"] 55 | # all_comments = load_comments_in_format(data) 56 | 57 | # while next_page_token: 58 | # data = get_comments_thread(youtube, video_id, next_page_token) 59 | # if "nextPageToken" in data: 60 | # next_page_token = data["nextPageToken"] 61 | # else: 62 | # next_page_token = '' 63 | # all_comments = all_comments + load_comments_in_format(data) 64 | 65 | all_comments = load_comments_in_format(data) 66 | return all_comments -------------------------------------------------------------------------------- /gpt3/app_gpt3.py: -------------------------------------------------------------------------------- 1 | import transformers 2 | import streamlit as st 3 | from comments import fetch_comments 4 | from gpt3.utils_gpt3 import text_to_chunks, summarize_chunk 5 | 6 | 7 | st.title("Youtube Comments Summarizer") 8 | 9 | st.write( 10 | "Use this tool to generate summaries from comments under any Youtube video." 11 | "The Tool uses OpenAI's APIs to generate the summaries." 12 | ) 13 | st.write( 14 | "The app is currently a POC. It extracts comments from the selected Youtube video, " 15 | "chunks the text, summarizes it and then returns the same." 16 | ) 17 | 18 | st.write() 19 | 20 | left, right = st.columns(2) 21 | form = left.form("template_form") 22 | 23 | url_input = form.text_input( 24 | "Enter Youtube video url", 25 | placeholder="", 26 | value="", 27 | ) 28 | 29 | submit = form.form_submit_button("Get Summary") 30 | 31 | gemini_api_key = st.secrets['GEMINI_API_KEY'] 32 | 33 | with st.container(): 34 | if submit and url_input: 35 | with st.spinner("Fetching Summary..."): 36 | text = fetch_comments(url_input) 37 | tokenizer = transformers.GPT2TokenizerFast.from_pretrained("gpt2") 38 | chunks = text_to_chunks(text, tokenizer) 39 | print("Chunks list size: ", len(chunks)) 40 | summaries = "" 41 | for chunk in chunks: 42 | summary = summarize_chunk(chunk) 43 | summaries = summaries + summary 44 | 45 | final_summary = summarize_chunk(summaries) 46 | 47 | with right: 48 | right.write(f"{final_summary}") -------------------------------------------------------------------------------- /gpt3/utils_gpt3.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import openai 3 | import streamlit as st 4 | import transformers 5 | from tenacity import ( 6 | retry, 7 | stop_after_attempt, 8 | wait_random_exponential, 9 | ) 10 | from dotenv import load_dotenv 11 | from typing import List 12 | 13 | 14 | load_dotenv() 15 | nltk.download('punkt') 16 | 17 | openai.api_key = st.secrets['OPENAI_API_KEY'] 18 | 19 | def text_to_chunks(input_text: str, 20 | tokenizer: transformers.PreTrainedTokenizer, 21 | max_token_sz: int = 1024, 22 | overlapping_sentences: int = 10) -> List[str]: 23 | sentences = nltk.sent_tokenize(input_text) 24 | chunks = [] 25 | 26 | first_sentence = 0 27 | last_sentence = 0 28 | while last_sentence <= len(sentences) - 1: 29 | last_sentence = first_sentence 30 | chunk_parts = [] 31 | chunk_size = 0 32 | for sentence in sentences[first_sentence:]: 33 | sentence_sz = len(tokenizer.encode(sentence)) 34 | if chunk_size + sentence_sz > max_token_sz: 35 | break 36 | 37 | chunk_parts.append(sentence) 38 | chunk_size += sentence_sz 39 | last_sentence += 1 40 | 41 | chunks.append(" ".join(chunk_parts)) 42 | first_sentence = last_sentence - overlapping_sentences 43 | return chunks 44 | 45 | @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(5)) 46 | def completion_with_backoff(**kwargs): 47 | return openai.Completion.create(**kwargs) 48 | 49 | 50 | def summarize_chunk(chunk: str, max_tokens: int = 512, temperature: int = 0) -> str: 51 | response = completion_with_backoff( 52 | model="text-davinci-002", 53 | prompt=f'Provide a concise summary of the comments."' 54 | f"\n###\nComments:{chunk}\n###\n-", 55 | temperature=temperature, 56 | max_tokens=max_tokens, 57 | top_p=1, 58 | frequency_penalty=0, 59 | presence_penalty=0, 60 | ) 61 | 62 | return response['choices'][0]['text'].strip() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | nltk 2 | streamlit 3 | tenacity 4 | openai 5 | google-api-python-client 6 | pytube 7 | transformers 8 | python-dotenv 9 | google-auth-httplib2 10 | google-auth-oauthlib -------------------------------------------------------------------------------- /sample_urls.txt: -------------------------------------------------------------------------------- 1 | ## Youtube video urls for testing: 2 | 3 | 1. 100+ comments 4 | https://www.youtube.com/watch?v=_55G24aghPY 5 | 6 | 2. 500+ comments 7 | https://www.youtube.com/watch?v=7sB052Pz0sQ 8 | 9 | 3. 1000+ comments 10 | https://www.youtube.com/watch?v=M988_fsOSWo 11 | 12 | -------------------------------------------------------------------------------- /secrets.toml.example: -------------------------------------------------------------------------------- 1 | API_SERVICE_NAME = "youtube" 2 | API_VERSION = "v3" 3 | YOUTUBE_API_KEY='xxx' 4 | OPENAI_API_KEY='xxx' 5 | GEMINI_API_KEY='xxx' -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from langchain_google_genai import ChatGoogleGenerativeAI 3 | from langchain_text_splitters import TokenTextSplitter 4 | from langchain.chains.summarize import load_summarize_chain 5 | from dotenv import load_dotenv 6 | 7 | gemini_api_key = st.secrets['GEMINI_API_KEY'] 8 | 9 | load_dotenv() 10 | 11 | def get_summary(text): 12 | 13 | #Tokenization 14 | text_splitter = TokenTextSplitter( 15 | chunk_size=1000, 16 | chunk_overlap=10 17 | ) 18 | chunks = text_splitter.create_documents([text]) 19 | 20 | #Summarization 21 | llm = ChatGoogleGenerativeAI( 22 | model="gemini-pro", 23 | google_api_key=gemini_api_key 24 | ) 25 | 26 | chain = load_summarize_chain( 27 | llm, 28 | chain_type="map_reduce" 29 | ) 30 | 31 | #Invoke Chain 32 | response=chain.run(chunks) 33 | 34 | return response --------------------------------------------------------------------------------