├── README.md └── summarise_web_articles.py /README.md: -------------------------------------------------------------------------------- 1 | # web_scrapping_openai 2 | 3 | ## Overview 4 | 5 | The Smart Article Summarizer is a Python application that leverages the OpenAI API and web scraping techniques to automatically generate concise summaries of web articles. It provides a convenient way to extract key insights and main points from lengthy articles, saving time and effort for users who need to quickly grasp the essence of online content. 6 | 7 | ## Features 8 | 9 | ### OpenAI Integration: 10 | Utilizes the OpenAI API, specifically the GPT (Generative Pre-trained Transformer) model, to perform advanced natural language processing tasks such as text summarization. 11 | ### Web Scraping: 12 | Implements web scraping techniques, facilitated by libraries like newspaper or Beautiful Soup, to extract the content of web articles from their URLs. 13 | ### Customizable Summarization Logic: 14 | Offers flexibility in summarization strategies, supporting various approaches such as extractive summarization, abstractive summarization, keyphrase extraction, and query-based summarization. 15 | ### Error Handling: 16 | Includes robust error handling mechanisms to address issues such as invalid URLs, rate limits, and forbidden access during web scraping. 17 | 18 | ## Prerequisites 19 | 20 | Python 3.x installed on your system. 21 | OpenAI API key obtained from the OpenAI website. 22 | Basic familiarity with Python programming and web scraping concepts. 23 | 24 | ## Setup Instructions 25 | 26 | 1. Clone the repository to your local machine. 27 | 2. Install the required Python dependencies using pip install -r requirements.txt. 28 | 3. Set up your OpenAI API key as an environment variable or provide it directly in the code. 29 | 30 | ## Usage 31 | 32 | Specify the URL of the article you want to summarize. 33 | Run the Python script, providing the URL as input. 34 | The Smart Article Summarizer will extract the content from the article, process it using the chosen summarization logic, and generate a summary. 35 | Contributing 36 | 37 | Contributions to the Smart Article Summarizer project are welcome! If you encounter any bugs, have feature requests, or want to contribute improvements, please feel free to submit pull requests or open issues on GitHub. 38 | 39 | Feel free to customize this template according to your project's specific requirements and features. 40 | 41 | ## Output Results 42 | 43 | ### Example 1 44 | 45 | Output Generated: 46 | Title of the Article: OpenAI Help Center 47 | 48 | Generating Summary... 49 | 50 | Summary of the Web Article: 51 | 52 | Here is a table summarizing the key points relevant to climate change policy makers from the article: 53 | 54 | | Key Points | 55 | |------------------------------------------------------------------| 56 | | Optimize prompts by making them shorter and removing extra words | 57 | | Remove extra examples | 58 | | Test revised prompt to ensure effectiveness | 59 | | Shorter prompts result in reduced cost | 60 | | Seek assistance if needed | 61 | 62 | These points highlight the importance of optimizing prompts for climate change policy makers by making them concise, removing unnecessary elements, and testing their effectiveness. Shorter prompts can also lead to cost savings. Additionally, policy makers are encouraged to seek help if needed 63 | 64 | Summary Generated. 65 | 66 | ### Example 2 67 | 68 | Output Generated: 69 | 70 | Title of the Article: Global warming and climate change effects: information and facts 71 | 72 | Generating Summary... 73 | 74 | 75 | Summary of the Web Article: 76 | 77 | 6 degrees Fahrenheit (0.9 degrees Celsius) since 1906. | 78 | | Impacts of global warming are evident now | The effects of global warming, such as melting glaciers and sea ice, shifting precipitation patterns, and changes in animal migration, are observable in the present. | 79 | | Climate change encompasses various shifts | Climate change refers to a range of complex shifts in weather and climate systems, including extreme weather events, changing wildlife populations and habitats, rising sea levels, and other impacts. | 80 | | Human activities contribute to climate change | The addition of heat-trapping greenhouse gases to the atmosphere by humans has led to climate change. | 81 | | Accelerated ice loss in Antarctica | Climate change has caused a faster rate of ice loss across the continent of Antarctica 82 | 83 | Here is the expert summary of the article along with key points relevant to climate change policy makers: 84 | 85 | | Key Points | Summary | 86 | |---------------------------------------------------|-----------------------------------------------------------------| 87 | | Ice loss in Antarctica is accelerating. | The article highlights that ice loss across Antarctica is | 88 | | | increasing. This finding is significant for climate change | 89 | | | policy makers as it indicates the urgency to address global | 90 | | | warming and its impacts on polar ice sheets. | 91 | | Sea levels are rising due to ice loss. | The article mentions that the ice loss in Antarctica is | 92 | | | contributing to rising sea levels. This information is crucial | 93 | | | for policy makers to understand the potential consequences of | 94 | | | climate change and implement mitigation strategies accordingly. | 95 | | Future effects of warming are anticipated. | The article discusses the potential future effects of warming | 96 | | | in Antarctica, including ecosystem disruptions and increased | 97 | | | meltwater flow. Policy makers need to be aware of these | 98 | | | projections to make informed decisions regarding climate policy.| 99 | | The authenticity of global warming is questioned. | The article states that one of the key debate around global | 100 | | | warming is its authenticity, suggesting that climate change | 101 | | | policy makers should consider scientific consensus while | 102 | | | formulating policies. | 103 | 104 | These key points are relevant to climate change policy makers as they provide insights into the accelerating ice loss in Antarctica, its contribution to rising sea levels, anticipated future effects of warming, and the ongoing debate surrounding the authenticity of global warming. Understanding these aspects is crucial for policymakers to develop effective strategies and policies to address climate change and its impacts 105 | 106 | -------------------------------------------------------------------------------- /summarise_web_articles.py: -------------------------------------------------------------------------------- 1 | # Import required packages 2 | import backoff # for exponential backoff 3 | import openai 4 | import os 5 | import re 6 | from newspaper import Article 7 | from openai import OpenAI 8 | from tabulate import tabulate 9 | 10 | 11 | # Define your LLMWrapper class within this file, with the necessary methods. 12 | class LLMWrapper: 13 | def __init__(self, api_key, config): 14 | # Store the API key and configuration. 15 | self.api_key = api_key 16 | self.model_name = config['model_name'] 17 | self.max_tokens = config['max_tokens'] 18 | 19 | # Function to handle OpenAI API calls with rate limit backoff 20 | @backoff.on_exception(backoff.expo, openai.RateLimitError) 21 | def prompt_completion(self, messages): 22 | """ 23 | Prompt completion is a helper function that wraps the OpenAI API call to 24 | complete a prompt. It handles rate limiting and retries on errors. 25 | Args: 26 | prompt: The prompt to complete. 27 | Returns: The OpenAI API response. 28 | """ 29 | 30 | try: 31 | # Logic to generate a response to a prompt by interacting with the OpenAI API. 32 | client = OpenAI(api_key=self.api_key) 33 | response = client.chat.completions.create( 34 | model=self.model_name, 35 | messages=messages, 36 | max_tokens=self.max_tokens 37 | ) 38 | 39 | # Return the generated response. 40 | full_text = response.choices[0].message.content 41 | 42 | # Remove any incomplete sentences at the beginning and end of the generated text 43 | sentences = full_text.split(".") 44 | 45 | # Split the generated text into sentences 46 | if len(sentences) > 1: 47 | first_sentence = sentences[0].strip() 48 | if not first_sentence[0].isupper() or not first_sentence[-1].isalnum(): 49 | full_text = full_text[len(first_sentence):].strip() 50 | 51 | last_sentence = sentences[-1].strip() 52 | if not last_sentence or not last_sentence[-1].isalnum(): # Add check for empty string 53 | full_text = full_text[:len(full_text) - len(last_sentence)].strip() 54 | 55 | # Additional cleanup 56 | full_text = full_text.strip(".”") 57 | return full_text 58 | 59 | except Exception as e: 60 | print(f"Exception in OpenAI completion task: {str(e)}") 61 | return None 62 | 63 | def extract_article_text(self, url): 64 | """ 65 | Extracts the text from an article. 66 | Args: 67 | url: The URL of the article. 68 | Returns: A List containing the chunks of the article and article title. 69 | """ 70 | 71 | # Logic to extract the text from an article. 72 | try: 73 | article = Article(url) 74 | article.download() 75 | article.parse() 76 | article_text = article.text 77 | 78 | # Split the text into chunks 79 | chunks = [article_text[i:i+self.max_tokens] for i in range(0, len(article_text), self.max_tokens)] 80 | print(f"Number of chunks: {len(chunks)}") 81 | return chunks, article.title 82 | 83 | except Exception as e: 84 | print(f"Forbidden URL Exception, Error downloading and parsing article from {url}: {str(e)}") 85 | return None, None 86 | 87 | 88 | 89 | class SmartArticleSummarizer: 90 | def __init__(self, api_key, config): 91 | self.llm_wrapper = LLMWrapper(api_key, config) 92 | 93 | def load_url_content(self, url): 94 | """ 95 | Loads the content of a URL. 96 | Args: 97 | url: The URL to load. 98 | Returns: Two strings string containing the title and content of the URL respectively. 99 | """ 100 | # Method to fetch and process content from the given URL. 101 | article_chunks, article_title = self.llm_wrapper.extract_article_text(url) 102 | print("\nTitle of the Article:", article_title) 103 | 104 | return article_title, article_chunks 105 | 106 | def generate_bullet_points(self, article_chunks): 107 | """ 108 | Helper function if required in future: 109 | Generates bullet points from the chunks of the article. 110 | Args: 111 | article_chunks: A list of chunks of the article. 112 | Returns: A list of bullet points. 113 | """ 114 | try: 115 | # Logic to generate bullet points from the article chunks. 116 | prompt = f"""Please provide 3-5 bullet points summarizing the main points and 117 | key takeaways of the following article, ensuring they are concise and 118 | informative:\n\n{article_chunks}""" 119 | 120 | messages = [ 121 | {"role": "system", "content": prompt} 122 | ] 123 | 124 | # Call the LLM wrapper to generate a response to the prompt. 125 | raw_text = self.llm_wrapper.prompt_completion(messages) 126 | 127 | # Split the response into bullet points and return 128 | return raw_text.replace("•", "\n•").strip() 129 | 130 | except Exception as e: 131 | print(f"Exception in bullet point generation task: {str(e)}") 132 | return None 133 | 134 | def generate_summary(self, article_chunks, summerization_logic): 135 | """ 136 | Generate a summary from the article chunks 137 | Args: 138 | article_chunks: A list of chunks of the article. 139 | reasoning: A string indicating the reasoning in the prompt. 140 | Returns: A string containing the summary of the article. 141 | """ 142 | try: 143 | # Logic to generate a summary from the article chunks. 144 | summaries = [] 145 | for chunk in article_chunks: 146 | # Generate system and user prompts 147 | # print(f"\nSystem: Summarize the following chunk: {chunk}") 148 | system_prompt = f""" 149 | You are an advanced AI language model designed to assist 150 | users in extracting expert summarization from web content. 151 | Your goal is to distill complex information, identify key 152 | insights according to the {summerization_logic}, and 153 | generate concise and informative summary of the content. 154 | """ 155 | 156 | user_prompt = f""" 157 | Write expert summarization of the following article: 158 | {chunk}. Consider relevant {summerization_logic['reasoning']} 159 | and nuances in the content. Give summerization results in the 160 | form of a {summerization_logic['output_format']}. 161 | """ 162 | 163 | messages = [ 164 | {"role": "system", "content": system_prompt}, 165 | {"role": "user", "content": user_prompt}, 166 | ] 167 | 168 | # Call the LLM wrapper to generate a response to the prompt. 169 | summary = self.llm_wrapper.prompt_completion(messages) 170 | summaries.append(summary) 171 | 172 | # Combine the summaries of all chunks 173 | combined_summary = " ".join(summaries) 174 | 175 | return combined_summary 176 | 177 | # Handle exceptions if the LLM wrapper encounters an error. 178 | except Exception as e: 179 | print(f"Generating summary error occurred. {e}") 180 | return None 181 | 182 | def format_output(self, summary, output_format): 183 | """ 184 | Helper function if required in future: 185 | Format the output based on the output format. 186 | Args: 187 | summary: A string containing the summary of the article. 188 | output_format: A string indicating the output format. 189 | Returns: A string containing the formatted output. 190 | """ 191 | 192 | try: 193 | # Method is limited to format output only in table format based on the current output format. 194 | if output_format == "table": 195 | # Placeholder logic for converting key points to a table format 196 | # This might involve structuring the data in a tabular way 197 | # For simplicity, let's consider a simple table format 198 | 199 | # Split the summary paragraph into sentences using a simple regex 200 | sentences = re.split(r'(?