├── README.md
└── summarise_web_articles.py


/README.md:
--------------------------------------------------------------------------------
  1 | # web_scrapping_openai
  2 | 
  3 | ## Overview
  4 | 
  5 | The Smart Article Summarizer is a Python application that leverages the OpenAI API and web scraping techniques to automatically generate concise summaries of web articles. It provides a convenient way to extract key insights and main points from lengthy articles, saving time and effort for users who need to quickly grasp the essence of online content.
  6 | 
  7 | ## Features
  8 | 
  9 | ### OpenAI Integration:
 10 | Utilizes the OpenAI API, specifically the GPT (Generative Pre-trained Transformer) model, to perform advanced natural language processing tasks such as text summarization.
 11 | ### Web Scraping: 
 12 | Implements web scraping techniques, facilitated by libraries like newspaper or Beautiful Soup, to extract the content of web articles from their URLs.
 13 | ### Customizable Summarization Logic: 
 14 | Offers flexibility in summarization strategies, supporting various approaches such as extractive summarization, abstractive summarization, keyphrase extraction, and query-based summarization.
 15 | ### Error Handling: 
 16 | Includes robust error handling mechanisms to address issues such as invalid URLs, rate limits, and forbidden access during web scraping.
 17 | 
 18 | ## Prerequisites
 19 | 
 20 | Python 3.x installed on your system.
 21 | OpenAI API key obtained from the OpenAI website.
 22 | Basic familiarity with Python programming and web scraping concepts.
 23 | 
 24 | ## Setup Instructions
 25 | 
 26 | 1. Clone the repository to your local machine.
 27 | 2. Install the required Python dependencies using pip install -r requirements.txt.
 28 | 3. Set up your OpenAI API key as an environment variable or provide it directly in the code.
 29 | 
 30 | ## Usage
 31 | 
 32 | Specify the URL of the article you want to summarize.
 33 | Run the Python script, providing the URL as input.
 34 | The Smart Article Summarizer will extract the content from the article, process it using the chosen summarization logic, and generate a summary.
 35 | Contributing
 36 | 
 37 | Contributions to the Smart Article Summarizer project are welcome! If you encounter any bugs, have feature requests, or want to contribute improvements, please feel free to submit pull requests or open issues on GitHub.
 38 | 
 39 | Feel free to customize this template according to your project's specific requirements and features.
 40 | 
 41 | ## Output Results 
 42 | 
 43 | ### Example 1 
 44 | 
 45 | Output Generated:
 46 | Title of the Article: OpenAI Help Center
 47 | 
 48 | Generating Summary...
 49 | 
 50 | Summary of the Web Article:
 51 | 
 52 |  Here is a table summarizing the key points relevant to climate change policy makers from the article:
 53 | 
 54 | | Key Points                                                       |
 55 | |------------------------------------------------------------------|
 56 | | Optimize prompts by making them shorter and removing extra words |
 57 | | Remove extra examples                                            |
 58 | | Test revised prompt to ensure effectiveness                      |
 59 | | Shorter prompts result in reduced cost                           |
 60 | | Seek assistance if needed                                        |
 61 | 
 62 | These points highlight the importance of optimizing prompts for climate change policy makers by making them concise, removing unnecessary elements, and testing their effectiveness. Shorter prompts can also lead to cost savings. Additionally, policy makers are encouraged to seek help if needed
 63 | 
 64 | Summary Generated.
 65 | 
 66 | ### Example 2 
 67 | 
 68 | Output Generated:
 69 | 
 70 | Title of the Article: Global warming and climate change effects: information and facts
 71 | 
 72 | Generating Summary...
 73 | 
 74 | 
 75 | Summary of the Web Article:
 76 | 
 77 |  6 degrees Fahrenheit (0.9 degrees Celsius) since 1906.                |
 78 | | Impacts of global warming are evident now         | The effects of global warming, such as melting glaciers and sea ice, shifting precipitation patterns, and changes in animal migration, are observable in the present.                                 |
 79 | | Climate change encompasses various shifts         | Climate change refers to a range of complex shifts in weather and climate systems, including extreme weather events, changing wildlife populations and habitats, rising sea levels, and other impacts. |
 80 | | Human activities contribute to climate change     | The addition of heat-trapping greenhouse gases to the atmosphere by humans has led to climate change.                                  |
 81 | | Accelerated ice loss in Antarctica                | Climate change has caused a faster rate of ice loss across the continent of Antarctica 
 82 | 
 83 | Here is the expert summary of the article along with key points relevant to climate change policy makers:
 84 |   
 85 | | Key Points                                        | Summary                                                         |
 86 | |---------------------------------------------------|-----------------------------------------------------------------|
 87 | | Ice loss in Antarctica is accelerating.           | The article highlights that ice loss across Antarctica is       |
 88 | |                                                   | increasing. This finding is significant for climate change      |
 89 | |                                                   | policy makers as it indicates the urgency to address global     |
 90 | |                                                   | warming and its impacts on polar ice sheets.                    |
 91 | | Sea levels are rising due to ice loss.            | The article mentions that the ice loss in Antarctica is         |
 92 | |                                                   | contributing to rising sea levels. This information is crucial  |
 93 | |                                                   | for policy makers to understand the potential consequences of   |
 94 | |                                                   | climate change and implement mitigation strategies accordingly. |
 95 | | Future effects of warming are anticipated.        | The article discusses the potential future effects of warming   |
 96 | |                                                   | in Antarctica, including ecosystem disruptions and increased    |
 97 | |                                                   | meltwater flow. Policy makers need to be aware of these         |
 98 | |                                                   | projections to make informed decisions regarding climate policy.|
 99 | | The authenticity of global warming is questioned. | The article states that one of the key debate around global     |
100 | |                                                   | warming is its authenticity, suggesting that climate change     |
101 | |                                                   | policy makers should consider scientific consensus while        |
102 | |                                                   | formulating policies.                                           |
103 | 
104 | These key points are relevant to climate change policy makers as they provide insights into the accelerating ice loss in Antarctica, its contribution to rising sea levels, anticipated future effects of warming, and the ongoing debate surrounding the authenticity of global warming. Understanding these aspects is crucial for policymakers to develop effective strategies and policies to address climate change and its impacts
105 | 
106 | 


--------------------------------------------------------------------------------
/summarise_web_articles.py:
--------------------------------------------------------------------------------
  1 | # Import required packages
  2 | import backoff # for exponential backoff
  3 | import openai
  4 | import os
  5 | import re
  6 | from newspaper import Article
  7 | from openai import OpenAI
  8 | from tabulate import tabulate
  9 | 
 10 | 
 11 | # Define your LLMWrapper class within this file, with the necessary methods.
 12 | class LLMWrapper:
 13 |     def __init__(self, api_key, config):
 14 |         # Store the API key and configuration.
 15 |         self.api_key = api_key
 16 |         self.model_name = config['model_name']
 17 |         self.max_tokens = config['max_tokens']
 18 |       
 19 |     # Function to handle OpenAI API calls with rate limit backoff
 20 |     @backoff.on_exception(backoff.expo, openai.RateLimitError)
 21 |     def prompt_completion(self, messages):
 22 |         """
 23 |         Prompt completion is a helper function that wraps the OpenAI API call to
 24 |         complete a prompt. It handles rate limiting and retries on errors.
 25 |         Args:
 26 |         prompt: The prompt to complete.
 27 |         Returns: The OpenAI API response.
 28 |         """
 29 | 
 30 |         try:
 31 |             # Logic to generate a response to a prompt by interacting with the OpenAI API.
 32 |             client = OpenAI(api_key=self.api_key)
 33 |             response = client.chat.completions.create(
 34 |                 model=self.model_name,
 35 |                 messages=messages,
 36 |                 max_tokens=self.max_tokens
 37 |             )
 38 |           
 39 |             # Return the generated response.
 40 |             full_text = response.choices[0].message.content
 41 |     
 42 |             # Remove any incomplete sentences at the beginning and end of the generated text
 43 |             sentences = full_text.split(".")
 44 |     
 45 |             # Split the generated text into sentences
 46 |             if len(sentences) > 1:
 47 |                 first_sentence = sentences[0].strip()
 48 |                 if not first_sentence[0].isupper() or not first_sentence[-1].isalnum():
 49 |                     full_text = full_text[len(first_sentence):].strip()
 50 |     
 51 |                 last_sentence = sentences[-1].strip()
 52 |                 if not last_sentence or not last_sentence[-1].isalnum():  # Add check for empty string
 53 |                     full_text = full_text[:len(full_text) - len(last_sentence)].strip()
 54 |     
 55 |             # Additional cleanup
 56 |             full_text = full_text.strip(".”")    
 57 |             return full_text
 58 |           
 59 |         except Exception as e:
 60 |             print(f"Exception in OpenAI completion task: {str(e)}")
 61 |             return None
 62 | 
 63 |     def extract_article_text(self, url):
 64 |       """
 65 |       Extracts the text from an article.
 66 |       Args:
 67 |       url: The URL of the article.
 68 |       Returns: A List containing the chunks of the article and article title.
 69 |       """
 70 |       
 71 |       # Logic to extract the text from an article.
 72 |       try: 
 73 |           article = Article(url)
 74 |           article.download()
 75 |           article.parse()
 76 |           article_text = article.text
 77 |           
 78 |           # Split the text into chunks
 79 |           chunks = [article_text[i:i+self.max_tokens] for i in range(0, len(article_text), self.max_tokens)]
 80 |           print(f"Number of chunks: {len(chunks)}")
 81 |           return chunks, article.title 
 82 |         
 83 |       except Exception as e:
 84 |           print(f"Forbidden URL Exception, Error downloading and parsing article from {url}: {str(e)}")
 85 |           return None, None
 86 |         
 87 |       
 88 | 
 89 | class SmartArticleSummarizer:
 90 |     def __init__(self, api_key, config):
 91 |         self.llm_wrapper = LLMWrapper(api_key, config)
 92 | 
 93 |     def load_url_content(self, url):
 94 |         """
 95 |         Loads the content of a URL.
 96 |         Args:
 97 |         url: The URL to load.
 98 |         Returns: Two strings string containing the title and content of the URL respectively.
 99 |         """
100 |         # Method to fetch and process content from the given URL.
101 |         article_chunks, article_title = self.llm_wrapper.extract_article_text(url)
102 |         print("\nTitle of the Article:", article_title)
103 | 
104 |         return article_title, article_chunks
105 | 
106 |     def generate_bullet_points(self, article_chunks):
107 |         """
108 |         Helper function if required in future: 
109 |         Generates bullet points from the chunks of the article.
110 |         Args:
111 |         article_chunks: A list of chunks of the article.          
112 |         Returns: A list of bullet points.
113 |         """
114 |         try:
115 |             # Logic to generate bullet points from the article chunks.
116 |             prompt = f"""Please provide 3-5 bullet points summarizing the main points and 
117 |                     key takeaways of the following article, ensuring they are concise and 
118 |                     informative:\n\n{article_chunks}"""
119 |             
120 |             messages = [
121 |               {"role": "system", "content": prompt}
122 |             ]
123 | 
124 |             # Call the LLM wrapper to generate a response to the prompt.
125 |             raw_text = self.llm_wrapper.prompt_completion(messages)
126 |             
127 |             # Split the response into bullet points and return
128 |             return raw_text.replace("•", "\n•").strip()
129 |           
130 |         except Exception as e:
131 |             print(f"Exception in bullet point generation task: {str(e)}")
132 |             return None
133 | 
134 |     def generate_summary(self, article_chunks, summerization_logic):
135 |         """
136 |         Generate a summary from the article chunks
137 |         Args:
138 |         article_chunks: A list of chunks of the article.
139 |         reasoning: A string indicating the reasoning in the prompt.
140 |         Returns: A string containing the summary of the article.
141 |         """
142 |         try:
143 |             # Logic to generate a summary from the article chunks.
144 |             summaries = []
145 |             for chunk in article_chunks:
146 |               # Generate system and user prompts
147 |               # print(f"\nSystem: Summarize the following chunk: {chunk}")
148 |               system_prompt = f"""
149 |                               You are an advanced AI language model designed to assist 
150 |                               users in extracting expert summarization from web content.
151 |                               Your goal is to distill complex information, identify key
152 |                               insights according to the {summerization_logic}, and 
153 |                               generate concise and informative summary of the content.
154 |                               """
155 |               
156 |               user_prompt = f"""
157 |                             Write expert summarization of the following article:
158 |                             {chunk}. Consider relevant {summerization_logic['reasoning']} 
159 |                             and nuances in the content. Give summerization results in the 
160 |                             form of a {summerization_logic['output_format']}.
161 |                             """
162 |                 
163 |               messages = [
164 |                         {"role": "system", "content": system_prompt},
165 |                         {"role": "user", "content": user_prompt},
166 |               ]
167 |               
168 |               # Call the LLM wrapper to generate a response to the prompt.
169 |               summary = self.llm_wrapper.prompt_completion(messages)
170 |               summaries.append(summary)
171 | 
172 |             # Combine the summaries of all chunks
173 |             combined_summary = " ".join(summaries)
174 |        
175 |             return combined_summary
176 |               
177 |         # Handle exceptions if the LLM wrapper encounters an error.
178 |         except Exception as e:
179 |               print(f"Generating summary error occurred. {e}")
180 |               return None
181 | 
182 |     def format_output(self, summary, output_format):
183 |         """
184 |         Helper function if required in future: 
185 |         Format the output based on the output format.
186 |         Args:
187 |         summary: A string containing the summary of the article.
188 |         output_format: A string indicating the output format.
189 |         Returns: A string containing the formatted output.
190 |         """
191 | 
192 |         try:
193 |             # Method is limited to format output only in table format based on the current output format.
194 |             if output_format == "table":
195 |                 # Placeholder logic for converting key points to a table format
196 |                 # This might involve structuring the data in a tabular way
197 |                 # For simplicity, let's consider a simple table format
198 |                 
199 |                 # Split the summary paragraph into sentences using a simple regex
200 |                 sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', summary)
201 |                 # Remove bullets from the sentences
202 |                 sentences = [sentence.replace('[^a-zA-Z0-9 \n\.]', "") for sentence in sentences]
203 |                 
204 |                 
205 |                 # Format the table rows
206 |                 table_data = [(index + 1, sentence) for index, sentence in enumerate(sentences)]
207 |     
208 |                 # Format the table header
209 |                 table_header =['Index', 'Summerized Key Points from Web Article']
210 |     
211 |                 formatted_output = tabulate(table_data, headers=table_header)
212 |             else:
213 |                 formatted_output = summary
214 |               
215 |             print(formatted_output)
216 |             return formatted_output
217 |           
218 |         except Exception as e:
219 |             print(f"Formatting output error occurred. {e}")
220 |             return None
221 |       
222 |     def summarise(self, url, summarisation_logic):
223 |         """
224 |         Summarise the given URL using the given summarisation logic. It uses the 
225 |         load_url_content and LLMWrapper's methods to produce a summary.
226 |         Args:        
227 |         url (str): URL of the article to be summarised.
228 |         summarisation_logic (str): Summarisation logic to be used.
229 |         Returns: formatted_output (str): Formatted output of the summarised article. 
230 |         """
231 |         try:
232 |             # Fetch and process the content from the URL.
233 |             article_title, article_chunks = self.load_url_content(url)
234 |           
235 |             # Make separate API calls for each part of the article
236 |             print("\nGenerating Summary...\n")
237 |             summary = self.generate_summary(article_chunks, summarisation_logic)
238 |           
239 |   
240 |             return summary
241 |           
242 |         except Exception as e:
243 |             print(f"Exception in summerization process: {str(e)}")
244 |             return None
245 | 
246 |   
247 | 
248 | # Main function to demonstrate usage.
249 | # Instantiate the SmartArticleSummarizer class.
250 | def main():
251 |   
252 |     # Retrieving the OpenAI API key from environment variables.
253 |     openai_api_key = os.environ.get("OPENAI_API_KEY")
254 |     
255 |     # Configuration for the LLMWrapper, such as the model name and token limits.
256 |     openai_config = {"model_name": "gpt-3.5-turbo", "max_tokens": 2000}
257 | 
258 |     # Instantiate the summarizer with the API key and configuration.
259 |     summarizer = SmartArticleSummarizer(openai_api_key, openai_config)
260 | 
261 |     # The URL of the article to summarise.
262 |     # url = "https://help.openai.com/en/articles/6891753-rate-limit-advice"
263 |     # url = "https://www.projectpro.io/article/python-libraries-for-web-scraping/625"
264 |     url = "https://www.nationalgeographic.com/environment/article/global-warming-effects"
265 | 
266 |     # The logic for summarisation, 
267 |     # May include parameters like reasoning or output format
268 |     # Or it could be blank.
269 |     summarisation_logic = {
270 |         "reasoning": "Extract key points relevant to climate change policy makers.",
271 |         "output_format": "table"
272 |     }
273 |   
274 |     # Generate the summary.
275 |     summary = summarizer.summarise(url, summarisation_logic)
276 | 
277 |     # Print the summary.
278 |     print("\nSummary of the Web Article:\n\n", summary)
279 |     print("\nSummary Generated.\n")
280 | 
281 | if __name__ == "__main__":
282 |   # Call the main function.
283 |     main()
284 | 
285 | # End of file.
286 | 


--------------------------------------------------------------------------------