├── .gitignore
├── Makefile
├── README.md
├── __init__.py
├── app.py
├── assets
    ├── flow-chart.png
    ├── video_100_comments.png
    └── video_500_comments.png
├── comments.py
├── gpt3
    ├── app_gpt3.py
    └── utils_gpt3.py
├── requirements.txt
├── sample_urls.txt
├── secrets.toml.example
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | venv/
 2 | 
 3 | *.pyc
 4 | __pycache__/
 5 | 
 6 | instance/
 7 | 
 8 | .pytest_cache/
 9 | .coverage
10 | htmlcov/
11 | 
12 | dist/
13 | build/
14 | *.egg-info/
15 | 
16 | .DS_STORE
17 | 
18 | .env
19 | 
20 | .streamlit


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | requirements:
2 | 	pip install --upgrade pip &&\
3 | 		pip install -r requirements.txt
4 | 
5 | run:
6 | 	streamlit run app.py


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # youtube-comments-summary-project
 2 | 
 3 | Please find my Streamlit app hosted [here](https://iprinka-comments-summary-ml-project-app-xobtj4.streamlit.app/)
 4 | <br/>
 5 | Please note that if you face an error, the Open AI API credits on my account might have expired :D
 6 | 
 7 | ## Project Summary
 8 | 
 9 | The goal of this model is to provide a summary of comment threads for any given Youtube video.
10 | 
11 | ## Flow Diagram
12 | 
13 | ![Flow Diagram](https://github.com/Priyanka-Gangadhar-Palshetkar/comments-summary-ml-project/blob/main/assets/flow-chart.png?raw=true)
14 | 
15 | ## Approach
16 | 
17 | * In this project, I will be using [Youtube API V3](https://developers.google.com/youtube/v3) to fetch all comment threads for any given youtube video id.
18 | * I used Streamlit to build this application. It takes a Youtube video url as input. Using the Youtube API, I fetch all the comment threads under that video and provide it to the Tokenizer to break the text into chunks. 
19 | * Then I provide each chunk as an input to the Completions API with the prompt - “Provide a summary of the comments below.” followed by the chunk of comments. The model then returns a brief summary of how people are reacting to that video.
20 | 
21 | ## How to use
22 | 
23 | 1. Create a virtual environment (python3 -m venv venv && source venv/bin/activate)
24 | 
25 | 2. Download all the dependencies by running `make requirements` in your terminal
26 | 
27 | 3. Create a project on Google cloud console and enable Youtube API V3 for your project. Follow instructions [here.](https://console.developers.google.com/apis/api/youtube.googleapis.com/overview)
28 | 
29 | 4. Create a .streamlit folder at the source level and add a secrets.toml file to it.
30 | 
31 | 5. Ensure your Youtube API keys are set as `API_SERVICE_NAME`, `API_VERSION`, `YOUTUBE_API_KEY` in the secrets.toml file (check the secrets.toml.example file for reference). Please find the details to generate the API key [here.](https://developers.google.com/youtube/registering_an_application)
32 | 
33 | 6. Ensure your OpenAI API key is set as an environment variable `OPENAI_API_KEY` in the secrets.toml file (check the secrets.toml.example file for reference). (see best practices around API key safety [here](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety))
34 | 
35 | 7. Run the [streamlit](https://streamlit.io/) app by running `make run`
36 | 
37 | 8. Open the app in your browser at `http://localhost:8501`
38 | 
39 | ## Example
40 | 
41 | ![App_Demo](https://github.com/Priyanka-Gangadhar-Palshetkar/comments-summary-ml-project/blob/main/assets/video_100_comments.png?raw=true)
42 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/__init__.py


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | from comments import fetch_comments
 3 | from utils import get_summary
 4 | 
 5 | 
 6 | st.title("Youtube Comments Summarizer")
 7 | 
 8 | st.write(
 9 |     "Use this tool to generate summaries from comments under any Youtube video."
10 |     "The Tool uses Google Gemini paired with Lang Chain to generate the summaries."
11 | )
12 | 
13 | st.write()
14 | 
15 | left, right = st.columns(2)
16 | form = left.form("template_form")
17 | 
18 | url_input = form.text_input(
19 |     "Enter Youtube video url",
20 |     placeholder="",
21 |     value="",
22 | )
23 | 
24 | submit = form.form_submit_button("Get Summary")
25 | 
26 | gemini_api_key = st.secrets['GEMINI_API_KEY']
27 | 
28 | with st.container():
29 |     if submit and url_input:
30 |         with st.spinner("Fetching Summary..."):
31 | 
32 |             # Get Comments from Youtube API - INPUT
33 |             text = fetch_comments(url_input)
34 | 
35 |             # Tokenization and Summarization  - MAIN CODE
36 |             final_summary = get_summary(text)
37 | 
38 |             # Display the output on Streamlit - OUTPUT
39 |             with right:
40 |                 right.write(f"{final_summary}")


--------------------------------------------------------------------------------
/assets/flow-chart.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/flow-chart.png


--------------------------------------------------------------------------------
/assets/video_100_comments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/video_100_comments.png


--------------------------------------------------------------------------------
/assets/video_500_comments.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iPrinka/comments-summary-ml-project/d5f93f486178ace407065bd32d2fded215ea1391/assets/video_500_comments.png


--------------------------------------------------------------------------------
/comments.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | from googleapiclient.discovery import build
 3 | from pytube import extract
 4 | 
 5 | from dotenv import load_dotenv
 6 | 
 7 | load_dotenv()
 8 | 
 9 | api_server_name = st.secrets["API_SERVICE_NAME"]
10 | api_version = st.secrets['API_VERSION']
11 | youtube_api_key = st.secrets['YOUTUBE_API_KEY']
12 | 
13 | def start_youtube_service():
14 |      return build(api_server_name, api_version, developerKey=youtube_api_key)
15 | 
16 | def extract_video_id_from_link(url):
17 |     return extract.video_id(url)
18 | 
19 | def get_comments_thread(youtube, video_id, next_page_token):
20 |     results = youtube.commentThreads().list(
21 |         part="snippet,replies",                     
22 |         videoId=video_id,
23 |         textFormat='plainText',
24 |         maxResults=100,
25 |         # pageToken = next_page_token
26 |     ).execute()
27 |     return results
28 | 
29 | def load_comments_in_format(comments):
30 |     all_comments = []
31 |     all_comments_string = ""
32 |     for thread in comments["items"]:
33 |         comment = {}
34 |         comment['content'] = thread['snippet']['topLevelComment']['snippet']['textOriginal']
35 |         all_comments_string = all_comments_string + comment['content']+"\n"
36 |         replies = []
37 |         if 'replies' in thread:
38 |             for reply in thread['replies']['comments']:
39 |                 reply_text = reply['snippet']['textOriginal']
40 |                 all_comments_string = all_comments_string + reply_text+"\n"
41 |                 replies.append(reply_text)
42 |             comment['replies'] = replies
43 |         
44 |         all_comments.append(comment)
45 |     return all_comments_string
46 | 
47 | def fetch_comments(url):
48 |     youtube = start_youtube_service()
49 |     video_id = extract_video_id_from_link(url)
50 |     next_page_token = ''
51 |    
52 |     data = get_comments_thread(youtube, video_id, next_page_token)
53 |     # if "nextPageToken" in data:
54 |     #     next_page_token = data["nextPageToken"]
55 |     # all_comments = load_comments_in_format(data)
56 | 
57 |     # while next_page_token:
58 |     #     data = get_comments_thread(youtube, video_id, next_page_token)
59 |     #     if "nextPageToken" in data:
60 |     #         next_page_token = data["nextPageToken"]
61 |     #     else:
62 |     #         next_page_token = ''
63 |     #     all_comments = all_comments + load_comments_in_format(data)
64 | 
65 |     all_comments = load_comments_in_format(data)
66 |     return all_comments


--------------------------------------------------------------------------------
/gpt3/app_gpt3.py:
--------------------------------------------------------------------------------
 1 | import transformers
 2 | import streamlit as st
 3 | from comments import fetch_comments
 4 | from gpt3.utils_gpt3 import text_to_chunks, summarize_chunk
 5 | 
 6 | 
 7 | st.title("Youtube Comments Summarizer")
 8 | 
 9 | st.write(
10 |     "Use this tool to generate summaries from comments under any Youtube video."
11 |     "The Tool uses OpenAI's APIs to generate the summaries."
12 | )
13 | st.write(
14 |     "The app is currently a POC. It extracts comments from the selected Youtube video, "
15 |     "chunks the text, summarizes it and then returns the same."
16 | )
17 | 
18 | st.write()
19 | 
20 | left, right = st.columns(2)
21 | form = left.form("template_form")
22 | 
23 | url_input = form.text_input(
24 |     "Enter Youtube video url",
25 |     placeholder="",
26 |     value="",
27 | )
28 | 
29 | submit = form.form_submit_button("Get Summary")
30 | 
31 | gemini_api_key = st.secrets['GEMINI_API_KEY']
32 | 
33 | with st.container():
34 |     if submit and url_input:
35 |         with st.spinner("Fetching Summary..."):
36 |             text = fetch_comments(url_input)
37 |             tokenizer = transformers.GPT2TokenizerFast.from_pretrained("gpt2")
38 |             chunks = text_to_chunks(text, tokenizer)
39 |             print("Chunks list size: ", len(chunks))    
40 |             summaries = ""
41 |             for chunk in chunks:
42 |                 summary = summarize_chunk(chunk)
43 |                 summaries = summaries + summary
44 | 
45 |             final_summary = summarize_chunk(summaries)
46 | 
47 |             with right:
48 |                 right.write(f"{final_summary}")


--------------------------------------------------------------------------------
/gpt3/utils_gpt3.py:
--------------------------------------------------------------------------------
 1 | import nltk
 2 | import openai
 3 | import streamlit as st
 4 | import transformers
 5 | from tenacity import (
 6 |     retry,
 7 |     stop_after_attempt,
 8 |     wait_random_exponential,
 9 | )
10 | from dotenv import load_dotenv
11 | from typing import List
12 | 
13 | 
14 | load_dotenv()
15 | nltk.download('punkt')
16 | 
17 | openai.api_key = st.secrets['OPENAI_API_KEY']
18 | 
19 | def text_to_chunks(input_text: str, 
20 |                    tokenizer: transformers.PreTrainedTokenizer, 
21 |                    max_token_sz: int = 1024, 
22 |                    overlapping_sentences: int = 10) -> List[str]:
23 |     sentences = nltk.sent_tokenize(input_text)
24 |     chunks = []
25 | 
26 |     first_sentence = 0
27 |     last_sentence = 0
28 |     while last_sentence <= len(sentences) - 1:
29 |         last_sentence = first_sentence
30 |         chunk_parts = []
31 |         chunk_size = 0
32 |         for sentence in sentences[first_sentence:]:
33 |             sentence_sz = len(tokenizer.encode(sentence))
34 |             if chunk_size + sentence_sz > max_token_sz:
35 |                 break
36 |             
37 |             chunk_parts.append(sentence)
38 |             chunk_size += sentence_sz
39 |             last_sentence += 1
40 | 
41 |         chunks.append(" ".join(chunk_parts))
42 |         first_sentence = last_sentence - overlapping_sentences
43 |     return chunks
44 | 
45 | @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(5))
46 | def completion_with_backoff(**kwargs):
47 |     return openai.Completion.create(**kwargs)
48 | 
49 | 
50 | def summarize_chunk(chunk: str, max_tokens: int = 512, temperature: int = 0) -> str:
51 |     response = completion_with_backoff(
52 |         model="text-davinci-002",
53 |         prompt=f'Provide a concise summary of the comments."'
54 |         f"\n###\nComments:{chunk}\n###\n-",
55 |         temperature=temperature,
56 |         max_tokens=max_tokens,
57 |         top_p=1,
58 |         frequency_penalty=0,
59 |         presence_penalty=0,
60 |     )
61 | 
62 |     return response['choices'][0]['text'].strip()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | nltk
 2 | streamlit
 3 | tenacity
 4 | openai
 5 | google-api-python-client
 6 | pytube
 7 | transformers
 8 | python-dotenv
 9 | google-auth-httplib2 
10 | google-auth-oauthlib


--------------------------------------------------------------------------------
/sample_urls.txt:
--------------------------------------------------------------------------------
 1 | ## Youtube video urls for testing:
 2 | 
 3 | 1. 100+ comments
 4 | https://www.youtube.com/watch?v=_55G24aghPY
 5 | 
 6 | 2. 500+ comments
 7 | https://www.youtube.com/watch?v=7sB052Pz0sQ
 8 | 
 9 | 3. 1000+ comments
10 | https://www.youtube.com/watch?v=M988_fsOSWo
11 | 
12 | 


--------------------------------------------------------------------------------
/secrets.toml.example:
--------------------------------------------------------------------------------
1 | API_SERVICE_NAME = "youtube"
2 | API_VERSION = "v3"
3 | YOUTUBE_API_KEY='xxx'
4 | OPENAI_API_KEY='xxx'
5 | GEMINI_API_KEY='xxx'


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | from langchain_google_genai import ChatGoogleGenerativeAI
 3 | from langchain_text_splitters import TokenTextSplitter
 4 | from langchain.chains.summarize import load_summarize_chain
 5 | from dotenv import load_dotenv
 6 | 
 7 | gemini_api_key = st.secrets['GEMINI_API_KEY']
 8 | 
 9 | load_dotenv()
10 | 
11 | def get_summary(text):
12 | 
13 |     #Tokenization
14 |     text_splitter = TokenTextSplitter(
15 |         chunk_size=1000, 
16 |         chunk_overlap=10
17 |     )
18 |     chunks = text_splitter.create_documents([text])
19 | 
20 |     #Summarization
21 |     llm = ChatGoogleGenerativeAI(
22 |         model="gemini-pro",
23 |         google_api_key=gemini_api_key
24 |     )
25 |     
26 |     chain = load_summarize_chain(
27 |         llm, 
28 |         chain_type="map_reduce"
29 |     )
30 | 
31 |     #Invoke Chain
32 |     response=chain.run(chunks)
33 | 
34 |     return response


--------------------------------------------------------------------------------