├── requirements.txt ├── sentiment_analyzer.py ├── LICENSE ├── .gitignore ├── config.py ├── news_fetcher.py ├── topic_modeler.py ├── main.py └── README.md /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/DanielPaulDsouza/ibkr-news-analyzer/HEAD/requirements.txt -------------------------------------------------------------------------------- /sentiment_analyzer.py: -------------------------------------------------------------------------------- 1 | from textblob import TextBlob 2 | 3 | def analyze_sentiment(text: str) -> tuple[str, float]: 4 | """ 5 | Analyzes the sentiment of a given text. 6 | 7 | Args: 8 | text: The text (headline or article) to analyze. 9 | 10 | Returns: 11 | A tuple containing the sentiment label ('Positive', 'Negative', 'Neutral') 12 | and the polarity score (from -1.0 to 1.0). 13 | """ 14 | if not text: 15 | return 'Neutral', 0.0 16 | 17 | # Create a TextBlob object 18 | analysis = TextBlob(text) 19 | 20 | # Get the polarity score 21 | polarity = analysis.sentiment.polarity 22 | 23 | # Classify the sentiment based on the polarity score 24 | if polarity > 0.1: 25 | sentiment = 'Positive' 26 | elif polarity < -0.1: 27 | sentiment = 'Negative' 28 | else: 29 | sentiment = 'Neutral' 30 | 31 | return sentiment, polarity -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2025 Daniel Paul Dsouza 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .nox/ 42 | .coverage 43 | .coverage.* 44 | .cache 45 | nosetests.xml 46 | coverage.xml 47 | *.cover 48 | .hypothesis/ 49 | .pytest_cache/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | local_settings.py 58 | db.sqlite3 59 | 60 | # Flask stuff: 61 | instance/ 62 | .webassets-cache 63 | 64 | # Scrapy stuff: 65 | .scrapy 66 | 67 | # Sphinx documentation 68 | docs/_build/ 69 | 70 | # PyBuilder 71 | target/ 72 | 73 | # Jupyter Notebook 74 | .ipynb_checkpoints 75 | 76 | # IPython 77 | profile_default/ 78 | ipython_config.py 79 | 80 | # pyenv 81 | .python-version 82 | 83 | # celery 84 | celerybeat-schedule 85 | celerybeat.pid 86 | 87 | # SageMath parsed files 88 | *.sage.py 89 | 90 | # Environments 91 | .env 92 | .venv 93 | env/ 94 | venv/ 95 | ENV/ 96 | env.bak/ 97 | venv.bak/ 98 | 99 | # Spyder project settings 100 | .spyderproject 101 | .spyderworkspace 102 | 103 | # VSCode settings 104 | .vscode/ 105 | 106 | # Our project specific ignores 107 | reports/ 108 | *.csv -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | # ============================================================================= 4 | # USER CONFIGURATION 5 | # ============================================================================= 6 | 7 | # --- Connection Settings --- 8 | # Make sure TWS or IB Gateway is running and API connections are enabled. 9 | IB_HOST = '127.0.0.1' 10 | IB_PORT = 7497 # 7497 for TWS Paper, 7496 for TWS Live, 4002 for IB Gateway Paper, 4001 for IB Gateway Live 11 | CLIENT_ID = 123 # Use a unique client ID for each running script 12 | 13 | # --- Contract To Analyze --- 14 | # The stock/ETF symbol you want to fetch news for. 15 | # In the CONTRACT_SYMBOLS you can add ONLY ONE or MULTIPLE contracts. 16 | CONTRACT_SYMBOLS = ['SPY', 'QQQ', 'DIA'] # Add 1 or many symbols, you want. 17 | CONTRACT_TYPE = 'STK' # 'STK' for stock/ETF, 'FUT' for future, etc. 18 | EXCHANGE = 'SMART' 19 | CURRENCY = 'USD' 20 | 21 | # --- News Search Parameters --- 22 | # If you leave this list empty, then the Matches_Keywords column in the CSV 23 | # will return false. 24 | KEYWORDS_TO_SEARCH = [ 25 | 'earnings', 'fed', 'inflation', 'rate cut', 'geopolitical', 'supply chain', 26 | 'buyback', 'guidance', 'downgrade', 'upgrade' 27 | ] 28 | 29 | # --- Topic Modeling Settings --- 30 | # This defines how many distinct topics the LDA model will try to discover 31 | # in the collection of news articles. There is no single "correct" number. 32 | # - A small number (e.g., 5) will result in very broad, high-level topics. 33 | # - A large number (e.g., 20-30) will result in more specific, granular topics. 34 | NUM_TOPICS = 20 35 | 36 | # --- Time Frame for News Search --- 37 | # This script uses naive local datetimes 38 | 39 | # --- CHOOSE YOUR END DATE --- 40 | # This will get the date and time right now in your local timezone. 41 | # We subtract timedelta(days=1) to look at news ending yesterday. 42 | # You can change the number of days to go back further or keep it 43 | # at 0 to have the end date at current time. 44 | END_DATE = datetime.datetime.now() - datetime.timedelta(days=0) 45 | 46 | # Set the START_DATE relative to your chosen END_DATE. 47 | # Change this number to adjust to the period you want. 48 | START_DATE = END_DATE - datetime.timedelta(days=100) 49 | 50 | # --- Output File --- 51 | # The script will create a filename based on the contract, start date and end date, 52 | # and start time and end time. 53 | OUTPUT_DIRECTORY = 'reports' # A subfolder to keep reports organized 54 | 55 | 56 | #NOTES 57 | 58 | """ 59 | You can search between specific hours and specific minutes too and aren't limited 60 | to searching between days. See example below for the code modification - 61 | 62 | END_DATE = datetime.datetime(2025, 7, 5, 16, 0, 0) # July 5, 4:00 PM 63 | # Search for the previous 2 hours 64 | START_DATE = END_DATE - datetime.timedelta(hours=2) 65 | 66 | """ -------------------------------------------------------------------------------- /news_fetcher.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | from ib_insync import IB, Contract 3 | 4 | 5 | def fetch_historical_news(ib: IB, contract_details: dict, start_date: datetime, end_date: datetime) -> list: 6 | """ 7 | Makes a single request to fetch the most recent batch of historical news 8 | within a given date range. 9 | 10 | NOTE: The IBKR API limits this request to approximately 300 of the most 11 | recent articles within the specified timeframe. But I have kept a limit of 12 | 100000 13 | 14 | Args: 15 | ib: An active and connected ib_insync IB instance. 16 | contract_details: A dictionary with contract info (symbol, type, etc.). 17 | start_date: The start date for the news search window. 18 | end_date: The end date for the news search window. 19 | 20 | Returns: 21 | A list of news headline objects found. 22 | """ 23 | # 1. Get available news providers for the account 24 | print("Fetching available news providers...") 25 | providers = ib.reqNewsProviders() 26 | if not providers: 27 | print("Error: No news providers found for this account.") 28 | return [] 29 | provider_codes = '+'.join([p.code for p in providers]) 30 | print(f"Found providers: {[p.code for p in providers]}") 31 | 32 | # 2. Qualify the contract to get its conId 33 | contract = Contract( 34 | symbol=contract_details['symbol'], 35 | secType=contract_details['secType'], 36 | exchange=contract_details['exchange'], 37 | currency=contract_details['currency'] 38 | ) 39 | ib.qualifyContracts(contract) 40 | if not contract.conId: 41 | print(f"Error: Could not resolve contract for {contract_details['symbol']}.") 42 | return [] 43 | print(f"Successfully qualified contract for {contract_details['symbol']} (conId: {contract.conId})") 44 | 45 | # 3. Make a single, direct request for historical news 46 | start_str = start_date.strftime('%Y%m%d %H:%M:%S') 47 | end_str = end_date.strftime('%Y%m%d %H:%M:%S') 48 | print(f"\nRequesting news from {start_str} to {end_str}...") 49 | print("(Note: API is limited to the ~300 most recent articles in this range)") 50 | 51 | try: 52 | news_headlines = ib.reqHistoricalNews( 53 | conId=contract.conId, 54 | providerCodes=provider_codes, 55 | startDateTime=start_str, 56 | endDateTime=end_str, 57 | totalResults=100000 58 | ) 59 | except Exception as e: 60 | print(f" -> API Error fetching headlines: {e}") 61 | return [] 62 | 63 | if not news_headlines: 64 | news_headlines = [] 65 | 66 | print(f"\nTotal headlines received from API: {len(news_headlines)}") 67 | return news_headlines 68 | 69 | 70 | def get_full_article(ib: IB, headline) -> str: 71 | """ 72 | Fetches the full text of a single news article, with error handling. 73 | 74 | Args: 75 | ib: An active and connected ib_insync IB instance. 76 | headline: The news headline object. 77 | 78 | Returns: 79 | The full text of the article as a string. 80 | """ 81 | try: 82 | news_article_body = ib.reqNewsArticle( 83 | providerCode=headline.providerCode, 84 | articleId=headline.articleId 85 | ) 86 | return news_article_body.articleText if news_article_body else "" 87 | except Exception as e: 88 | # This will catch the "Not allowed" error and handle it gracefully 89 | # print(f"\nWarning: Could not fetch article {headline.articleId}. Reason: {e}") 90 | return "Full article text not available (subscription may be required)." -------------------------------------------------------------------------------- /topic_modeler.py: -------------------------------------------------------------------------------- 1 | # topic_modeler.py 2 | 3 | import re 4 | import nltk 5 | from collections import Counter 6 | from sklearn.feature_extraction.text import TfidfVectorizer 7 | from sklearn.decomposition import LatentDirichletAllocation 8 | 9 | def download_nltk_packages(): 10 | """Checks if necessary NLTK packages are downloaded and gets them if not.""" 11 | required_packages = ['stopwords', 'punkt', 'wordnet', 'omw-1.4'] 12 | for package in required_packages: 13 | try: 14 | if package in ['punkt', 'wordnet', 'omw-1.4']: 15 | nltk.data.find(f'tokenizers/{package}' if package == 'punkt' else f'corpora/{package}') 16 | else: 17 | nltk.data.find(f'corpora/{package}') 18 | except LookupError: 19 | print(f"Downloading NLTK package: {package}...") 20 | nltk.download(package) 21 | 22 | def lemmatize_and_tokenize(text: str, lemmatizer): 23 | """Tokenizes and lemmatizes a string of text, returning a list of tokens.""" 24 | tokens = nltk.word_tokenize(text.lower()) 25 | return [lemmatizer.lemmatize(token) for token in tokens] 26 | 27 | def find_common_phrases(texts: list, n: int = 5, top_k: int = 10) -> list: 28 | """Finds the most common n-grams (phrases) to identify boilerplate.""" 29 | print(f"Identifying top {top_k} common {n}-word phrases to treat as stopwords...") 30 | all_ngrams = [] 31 | for text in texts: 32 | # Simple tokenization for n-gram finding 33 | tokens = re.findall(r'\b[a-zA-Z]{3,}\b', text.lower()) 34 | if len(tokens) >= n: 35 | for i in range(len(tokens) - n + 1): 36 | all_ngrams.append(tuple(tokens[i:i+n])) 37 | 38 | most_common = [ngram for ngram, count in Counter(all_ngrams).most_common(top_k)] 39 | 40 | # Add individual words from the common phrases to the stopword list 41 | boilerplate_stopwords = set() 42 | for phrase in most_common: 43 | print(f" -> Identified common phrase: {' '.join(phrase)}") 44 | for word in phrase: 45 | boilerplate_stopwords.add(word) 46 | 47 | return list(boilerplate_stopwords) 48 | 49 | def perform_topic_modeling(texts: list, num_topics: int) -> tuple[list, list]: 50 | """ 51 | Performs LDA Topic Modeling using advanced pre-processing, including 52 | lemmatization, bigrams, and automated boilerplate removal. 53 | 54 | Args: 55 | texts: A list of strings, where each string is an article's content. 56 | num_topics: The number of topics to discover. 57 | 58 | Returns: 59 | A tuple containing: 60 | - A list of topic IDs for each text. 61 | - A list of the top words for each discovered topic. 62 | """ 63 | if not texts: 64 | return [], [] 65 | 66 | download_nltk_packages() 67 | 68 | # --- 1. Advanced Text Cleaning --- 69 | print("\nCleaning and pre-processing article text...") 70 | 71 | # Automatically find and add common boilerplate phrases to the stopword list 72 | boilerplate_words = find_common_phrases(texts) 73 | 74 | lemmatizer = nltk.stem.WordNetLemmatizer() 75 | nltk_stop_words = list(nltk.corpus.stopwords.words('english')) 76 | 77 | # --- Create a custom list of words to ignore --- 78 | # Add any other meaningless words you find to this list. 79 | # This list WILL always be never-ending .... :p 80 | CUSTOM_STOP_WORDS = [ 81 | 'com', 'story', 'news', 'fly', 'edt', 'theflyonthewall', '00', 'yet', 82 | 'copyright', 'free', '30', 'br', 'apos', 'www', 'wsj', 'writes', 'take', 83 | 'likely', 'wants', 'et', 'according', 'would', 'basis', 'due', 'god', 84 | 'bless', 'https', 'states', 'starting', 'sent', 'instead', 'see', 'thefly', 85 | 'go', 'rest', 'permalinks', 'entry', 'php', 'another', 'event', 'events', 86 | 'like', 'well', 'may', 'us', 'final', 'noted', 'read', 'minutes', 'finished', 87 | 'last', '000', 'years', 'year', 'plans', 'set', 'weeks', 'reference', 'href', 88 | 'called', 'effects', 'near', 'says', 'say', 'make', '000', 'said', 'remains', 'also', 89 | 'seen', 'get', 'time', 'generally', 'looking', 'nice', 'post', 'yesterday', 90 | 'working', 'worked', 'works', 'made', 'great', 'gov', 'briefing', 'authors', 91 | 'div', 'following', 'told', 'made', 'tell', 'comments', 'good', 'speaking', 92 | 'http', 'able', 'place', 'many', 'slipped', 'shed', 'rose', 'higher', 'lower', 93 | 'gains', 'falls', 'rising', 'falling', 'snapped', 'climbs', 'declines', 'closes', 94 | 'fox', 'reuters', 'tells', 'interview', 'bring', 'reporter', 'work', 'long', 95 | 'effect', 'previously', 'move', 'going', 'mod', 'link', 'avoiding', 'new', 96 | 'old', 'done', 'want', 'along', 'accept', 'could', 'stance', 'announces', 97 | 'meanwhile', 'marginally', 'fresh', 'buzz', 'dow', 'jones', 'trading', 'share', 98 | 'pre', 'believed', 'method', 'expected', 'several', 'suggested', 'observed', 99 | 'saying', 'give', 'really', 'earlier', 'think', 'live', 'know', 'held', 100 | 'familiar', 'include', 'citing', 'keep', 'know', 'opted', 'among', 'known', 101 | 'slightly', 'stated', 'shame', 'amp' 102 | ] 103 | 104 | stop_words = nltk_stop_words + CUSTOM_STOP_WORDS + boilerplate_words 105 | 106 | # --- 2. Lemmatization --- 107 | # Lemmatize AFTER finding boilerplate to catch the original phrases 108 | lemmatized_texts = [' '.join(lemmatize_and_tokenize(text, lemmatizer)) for text in texts] 109 | 110 | # --- 3. Vectorization using TF-IDF and Bigrams --- 111 | 112 | print(f"Performing Topic Modeling to discover {num_topics} topics...") 113 | 114 | # TF-IDF is more advanced than a simple count. It weighs words based on 115 | # how important they are to a specific document, not just how frequent they are. 116 | # Also, we ignore words that appear in less than 2 documents or more than 85% of documents. 117 | # The token_pattern considers words of 3+ letters. 118 | vectorizer = TfidfVectorizer( 119 | max_df=0.85, min_df=2, stop_words=stop_words, 120 | token_pattern=r'\b[a-zA-Z]{3,}\b' 121 | ) 122 | 123 | dtm = vectorizer.fit_transform(lemmatized_texts) 124 | 125 | # 4. Build and fit the LDA model 126 | lda = LatentDirichletAllocation(n_components=num_topics, random_state=42) 127 | lda.fit(dtm) 128 | 129 | # 5. Get the dominant topic for each document 130 | topic_results = lda.transform(dtm) 131 | dominant_topic_per_document = topic_results.argmax(axis=1) 132 | 133 | # 6. Get the top words/phrases for each topic for display 134 | feature_names = vectorizer.get_feature_names_out() 135 | top_words_per_topic = [] 136 | for topic_idx, topic in enumerate(lda.components_): 137 | # Get the top 10 words/phrases for this topic 138 | top_words = [feature_names[i].replace(' ', '_') for i in topic.argsort()[:-10 - 1:-1]] 139 | top_words_per_topic.append(top_words) 140 | print(f" -> Discovered Topic #{topic_idx}: {', '.join(top_words)}") 141 | 142 | return dominant_topic_per_document.tolist(), top_words_per_topic -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import datetime 3 | import pandas as pd 4 | from ib_insync import IB, util 5 | 6 | # Import our custom modules and configuration 7 | import config 8 | from news_fetcher import fetch_historical_news, get_full_article 9 | from sentiment_analyzer import analyze_sentiment 10 | from topic_modeler import perform_topic_modeling 11 | 12 | def main(): 13 | """ 14 | Main function to orchestrate the news fetching and analysis process. 15 | """ 16 | ib = IB() 17 | all_symbols_results = [] # Master list to hold results from all symbols 18 | 19 | try: 20 | # --- Connect to IBKR --- 21 | print(f"Connecting to IBKR on {config.IB_HOST}:{config.IB_PORT}...") 22 | ib.connect(config.IB_HOST, config.IB_PORT, clientId=config.CLIENT_ID) 23 | print("Connection successful.") 24 | 25 | # Loop through each symbol specified in the config file 26 | for symbol in config.CONTRACT_SYMBOLS: 27 | print(f"\n{'='*20} Processing: {symbol} {'='*20}") 28 | 29 | # --- Fetch News Headlines --- 30 | contract_details = { 31 | 'symbol': symbol, # Use the symbol from the loop 32 | 'secType': config.CONTRACT_TYPE, 33 | 'exchange': config.EXCHANGE, 34 | 'currency': config.CURRENCY 35 | } 36 | all_headlines = fetch_historical_news(ib, contract_details, config.START_DATE, config.END_DATE) 37 | 38 | if not all_headlines: 39 | print(f"No headlines found for {symbol}. Skipping.") 40 | continue 41 | 42 | # --- Analyze ALL articles and flag keyword matches --- 43 | print(f"\nAnalyzing all {len(all_headlines)} articles for {symbol}...") 44 | 45 | # Sort headlines by time, newest first 46 | sorted_headlines = sorted(all_headlines, key=lambda h: h.time, reverse=True) 47 | 48 | # --- Batch processing logic to avoid rate limiting --- 49 | BATCH_SIZE = 200 # Process 200 articles at a time 50 | BATCH_PAUSE = 2 # Pause for 2 seconds between batches 51 | 52 | for i in range(0, len(sorted_headlines), BATCH_SIZE): 53 | batch = sorted_headlines[i:i + BATCH_SIZE] 54 | print(f"\n--- Processing Batch {i//BATCH_SIZE + 1}/{len(sorted_headlines)//BATCH_SIZE + 1} ---") 55 | 56 | for headline in batch: 57 | if not (config.START_DATE <= headline.time <= config.END_DATE): 58 | continue 59 | 60 | # Get full article text 61 | article_text = get_full_article(ib, headline) 62 | content_to_search = (headline.headline + ' ' + article_text).lower() 63 | 64 | # Determine if the article matches the keywords 65 | matches_keywords = any(keyword.lower() in content_to_search for keyword in config.KEYWORDS_TO_SEARCH) 66 | 67 | # Perform sentiment analysis on every article 68 | sentiment, polarity = analyze_sentiment(content_to_search) 69 | 70 | # Add to results 71 | ts = headline.time 72 | all_symbols_results.append({ 73 | 'Symbol': symbol, 74 | 'Date': ts.strftime('%Y-%m-%d'), 75 | 'Time': ts.strftime('%H:%M:%S'), 76 | 'Provider': headline.providerCode, 77 | 'Matches_Keywords': matches_keywords, 78 | 'Sentiment': sentiment, 79 | 'Polarity': round(polarity, 4), 80 | 'Headline': headline.headline, 81 | 'Article': article_text.replace('\n', ' ').strip() 82 | }) 83 | 84 | # progress indicator for batches 85 | print(f" -> Processed: {headline.headline[:60]}...", end='\r') 86 | ib.sleep(0.1) # Small pause between each article 87 | 88 | # After a batch is done, check if it's not the very last one 89 | if i + BATCH_SIZE < len(sorted_headlines): 90 | print(f"\n--- Batch complete. Pausing for {BATCH_PAUSE} seconds to respect API limits... ---") 91 | ib.sleep(BATCH_PAUSE) 92 | 93 | print(f"\n\n{'='*20} Finished processing: {symbol} {'='*20}") 94 | 95 | # Don't pause after the last symbol 96 | if symbol != config.CONTRACT_SYMBOLS[-1]: 97 | print("Pausing before next symbol to respect API rate limits...") 98 | ib.sleep(5) #Longer pause between symbols 99 | 100 | # --- Perform Topic Modeling on ALL collected articles --- 101 | if all_symbols_results: 102 | # Create a list of just the article texts to feed into the model 103 | all_article_texts = [result['Article'] for result in all_symbols_results] 104 | 105 | # --- Define date strings once for use in all filenames --- 106 | start_str = config.START_DATE.strftime('%Y%m%d-%H%M%S') 107 | end_str = config.END_DATE.strftime('%Y%m%d-%H%M%S') 108 | 109 | # Perform the topic modeling 110 | topic_ids, topics = perform_topic_modeling(all_article_texts, num_topics=config.NUM_TOPICS) 111 | 112 | # Add the discovered topic ID to each result 113 | for i, result in enumerate(all_symbols_results): 114 | result['Topic_ID'] = topic_ids[i] 115 | 116 | # --- Save the topic summary to a text file --- 117 | # Create a formatted string with the topic details 118 | topic_summary_lines = ["--- Discovered Topic Summary ---"] 119 | for topic_id, top_words in enumerate(topics): 120 | topic_summary_lines.append(f"Topic #{topic_id}: {', '.join(top_words)}") 121 | 122 | # Define the path for the summary file 123 | summary_filename = f"topic_summary_from_{start_str}_to_{end_str}.txt" 124 | summary_filepath = os.path.join(config.OUTPUT_DIRECTORY, summary_filename) 125 | 126 | # Write the summary to the file 127 | with open(summary_filepath, 'w') as f: 128 | f.write('\n'.join(topic_summary_lines)) 129 | print(f"\nSaved topic summary to '{summary_filepath}'") 130 | 131 | # --- Save Combined Report to CSV --- 132 | if all_symbols_results: 133 | print("\nSaving combined report to CSV...") 134 | if not os.path.exists(config.OUTPUT_DIRECTORY): 135 | os.makedirs(config.OUTPUT_DIRECTORY) 136 | 137 | filename = f"news_report_combined_from_{start_str}_to_{end_str}.csv" 138 | filepath = os.path.join(config.OUTPUT_DIRECTORY, filename) 139 | 140 | # Create and save DataFrame 141 | df = pd.DataFrame(all_symbols_results) 142 | # Reorder columns to put Symbol first 143 | df = df[['Symbol', 'Date', 'Time', 'Provider', 'Matches_Keywords', 'Topic_ID', 'Sentiment', 'Polarity', 'Headline', 'Article']] 144 | df.to_csv(filepath, index=False, encoding='utf-8') 145 | print(f"\nSuccessfully saved the report to '{os.path.abspath(filepath)}'") 146 | else: 147 | print("\nNo articles found across all symbols. No CSV file was generated.") 148 | 149 | except ConnectionRefusedError: 150 | print(f"\nError: Connection refused. Is TWS or IB Gateway running on {config.IB_HOST}:{config.IB_PORT}?") 151 | except Exception as e: 152 | print(f"\nAn unexpected error occurred: {e}") 153 | finally: 154 | if ib.isConnected(): 155 | print("Disconnecting from IBKR.") 156 | ib.disconnect() 157 | 158 | if __name__ == "__main__": 159 | # ib_insync requires an asyncio event loop to run. 160 | # util.startLoop() is a helper for running it in scripts. 161 | util.startLoop() 162 | main() -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # IBKR Historical News Analyzer 2 | 3 | A powerful and robust Python tool to fetch, analyze, and perform advanced topic modeling & sentiment analysis on historical news data from Interactive Brokers. 4 | 5 | This repository contains the full development history of the project. The latest stable version is **v1.1**. 6 | 7 | --- 8 | 9 | ### Official Releases 10 | 11 | You can browse the code, documentation, and download the source for each official version by clicking the links below. 12 | 13 | | Version | Key Feature | Browse Files & README | View Release Notes & Downloads | 14 | | :------ | :-------------------------- | :------------------------------------------------------------------------- | :------------------------------------------------------------------------- | 15 | | **V1.1** | **Advanced Topic Modeling** | [**Browse V1.1 Files**](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/tree/v1.1) | [**V1.1 Release Notes**](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/releases/tag/v1.1) | 16 | | **V1.0** | **Stable Harvester** | [Browse V1.0 Files](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/tree/v1.0) | [V1.0 Release Notes](https://github.com/DanielPaulDsouza/ibkr-news-analyzer/releases/tag/v1.0) | 17 | 18 | --- 19 | 20 | ## About the Latest Version (V1.1) 21 | 22 | This project has evolved from a simple data harvester into a sophisticated analysis engine. It connects to the IBKR API, downloads news for multiple symbols over a specified date range, and then applies a professional-grade Natural Language Processing (NLP) pipeline to each article. It also analyzes every article for sentiment and flags articles that match your keywords. The final output is a single, rich CSV file containing sentiment scores, keyword flags, and a discovered "Topic ID" for each article, enabling deep thematic analysis, further analysis or visualization. 23 | 24 | ## Features 25 | 26 | - **Multi-Contract Support:** Fetch news for multiple symbols (e.g., 'SPY', 'QQQ', 'AAPL') in a single, automated run. 27 | - **Robust API Rate-Limit Handling:** Politely handles API limits by processing articles in configurable batches with pauses, ensuring reliable data collection without being blocked. 28 | - **✨ New in V1.1: Advanced NLP Pre-processing:** Utilizes a professional pipeline including boilerplate removal, lemmatization (reducing words to their root form), and bigram detection to produce cleaner data for analysis. 29 | - **✨ New in V1.1: Advanced Topic Modeling:** Implements Latent Dirichlet Allocation (LDA) with TF-IDF vectorization to automatically discover and categorize the underlying themes in the news articles. 30 | - **Sentiment Scoring:** Uses `TextBlob` to perform sentiment analysis on every article, providing `Sentiment` (Positive, Negative, Neutral) and `Polarity` score columns. 31 | - **Keyword Flagging:** Includes a `Matches_Keywords` (True/False) column. This allows you to either analyze all news for a symbol or easily filter for articles relevant to your specific interests in a downstream tool like Pandas or Excel. 32 | - **Fully Configurable:** Easily change all parameters (dates, keywords, contract symbols, batch sizes, number of topics for LDA, etc.) in a simple `config.py` file. 33 | - **Combined Outputs:** Saves all results from all symbols into a single, detailed CSV file and creates a separate `topic_summary.txt` file that describes the discovered topics for each run. 34 | 35 | ## What's New in v1.1: Advanced Topic Modeling 36 | 37 | This version introduces a powerful topic modeling feature using Latent Dirichlet Allocation (LDA) with an advanced pre-processing pipeline. 38 | 39 | - **Automated Boilerplate Detection:** Automatically finds and removes common, repetitive phrases (e.g., news provider disclaimers) to reduce noise. 40 | 41 | - **Advanced Text Cleaning:** Uses Lemmatization and Bigram detection to treat words like "rates" and "rate" as the same concept, and to understand that "rate_hike" is a single, important idea. 42 | 43 | - **TF-IDF Vectorization:** Employs TF-IDF to intelligently weigh words, giving more importance to terms that are significant to a specific document rather than just frequent overall. 44 | 45 | - **Tunable Model:** Allows the user to easily configure the number of topics to discover, enabling both high-level and granular thematic analysis. 46 | 47 | ## How does v1.1 Work 48 | 49 | The script connects to a running instance of Interactive Brokers' Trader Workstation (TWS) or IB Gateway. For each symbol in the configuration, it makes a request for the ~300 most recent historical news headlines within the specified date range. It then processes these headlines in batches, requesting the full article text for each one while pausing to respect API rate limits. Each article is then cleaned using advanced NLP techniques. This cleaned articles are analyzed using LDA and TF - IDF to discover and categorize underlying themes. 50 | 51 | The articles are further analyzed for sentiment and checked against the keyword list before being added to a master results list, which is then saved to a single CSV file. 52 | 53 | ## How Topic Modeling Works & Tuning Guide 54 | 55 | The script first collects all news articles. Then, it cleans the text by removing boilerplate, common "stopwords," and noise like numbers. It then uses TF-IDF to represent the importance of words in each document. Finally, the LDA algorithm analyzes these representations to discover a set number of underlying "topics" (i.e., clusters of words that frequently appear together). Each article in the final CSV is assigned a `Topic_ID` corresponding to its dominant theme. 56 | 57 | ### Tuning Your Topic Model 58 | 59 | The quality of these topics is highly dependent on a few key settings which you can tune in your project: 60 | 61 | * **`NUM_TOPICS`** (in `config.py`): This is the most important setting. It defines how many distinct themes the model should look for. A smaller number (~10-15) will produce broad themes. A larger number (25+) will produce more specific, granular themes. It is recommended to start with a smaller number and increase it as you refine your data cleaning. 62 | * **`CUSTOM_STOP_WORDS`** (in `topic_modeler.py`): This is a powerful list where you can add domain-specific words you want the model to ignore. This is the best place to add generic market commentary verbs (e.g., `rose`, `fell`, `climbed`) or other noise you discover. 63 | * **`max_df` / `min_df`** (in `topic_modeler.py`): These parameters in the `TfidfVectorizer` are powerful filters. 64 | * `max_df=0.85` tells the model to ignore words that appear in more than 85% of all articles. Lowering this value is an effective way to remove overly common words and force the model to find more nuanced themes. 65 | * `min_df=2` tells the model to ignore words that appear in fewer than 2 documents, which helps remove rare words and potential typos. 66 | 67 | ## Architectural Choice: `ib_insync` vs. Native `ibapi` 68 | 69 | This project is built using the `ib_insync` library rather than the native `ibapi` for a specific architectural reason. 70 | 71 | * The **native `ibapi`** is a low-level, purely **event-driven** system. It is powerful for real-time applications that need to react instantly to live data streams pushed from the server. 72 | * **`ib_insync`** is a higher-level library that provides a clean, **synchronous-style (request/response)** interface. It handles the complex asynchronous background work, making it the ideal tool for tasks that follow a linear logic, such as "request a list of data, then process each item." 73 | 74 | For this project, which focuses on harvesting and analyzing historical data, the request/response model of `ib_insync` is the superior choice. It allows the code to be simpler, more readable, and more focused on the core tasks of data processing and analysis, rather than on managing complex event loops and callbacks. 75 | 76 | ## Prerequisites 77 | 78 | - Python 3.8+ 79 | - An Interactive Brokers account (live or paper) 80 | - Trader Workstation (TWS) or IB Gateway installed and running. 81 | 82 | ## Setup & Installation 83 | 84 | 1. **Clone the repository (or download the ZIP):** 85 | ```bash 86 | git clone https://github.com/DanielPaulDsouza/ibkr-news-analyzer.git 87 | cd ibkr-news-analyzer 88 | ``` 89 | 90 | 2. **Create a virtual environment (recommended):** 91 | ```bash 92 | python -m venv venv 93 | source venv/bin/activate # On Windows use `venv\Scripts\activate` 94 | ``` 95 | 96 | 3. **Install the required libraries:** 97 | ```bash 98 | pip install -r requirements.txt 99 | ``` 100 | 101 | ## How to Use 102 | 103 | 1. **Log in to TWS or IB Gateway.** Make sure the API connection is enabled. 104 | - In TWS: Go to `File -> Global Configuration -> API -> Settings` and check "Enable ActiveX and Socket Clients". 105 | 106 | 2. **Edit the configuration file.** Open `config.py` and adjust the settings to your needs. You can change the date range, the keywords, and the list of stock symbols. 107 | 108 | 3. **Run the application.** From your terminal, simply run: 109 | ```bash 110 | python main.py 111 | ``` 112 | 113 | 4. **Check the output.** The script will print its progress in the terminal. Once finished, you will find a CSV file & a .txt file in the `reports` folder with the combined, analyzed news data. 114 | 115 | ## Output CSV Columns 116 | 117 | | Column | Description | 118 | | ---------------- | ------------------------------------------------------------------------ | 119 | | `Symbol` | The contract symbol (e.g., 'SPY') the news is for. | 120 | | `Date` | The publication date of the article. | 121 | | `Time` | The publication time of the article. | 122 | | `Provider` | The news provider code (e.g., 'FLY', 'BRFG'). | 123 | | `Matches_Keywords` | `True` if the article contains any of your keywords, otherwise `False`. | 124 | | **`Topic_ID`** | **(New in V1.1)** The ID of the topic cluster the article belongs to. | 125 | | `Sentiment` | The sentiment classification: 'Positive', 'Negative', or 'Neutral'. | 126 | | `Polarity` | The sentiment polarity score (from -1.0 for negative to 1.0 for positive). | 127 | | `Headline` | The headline of the news article. | 128 | | `Article` | The full text of the news article (or an error message if unavailable). | 129 | 130 | ## Project Roadmap (Future Development) 131 | 132 | This project is under active development. The following major features are planned for future versions: 133 | 134 | ### V2.0: The Advanced Analyzer 135 | 136 | - **Feature:** Upgrade the sentiment analysis engine from `TextBlob` to a finance-specific transformer model like `FinBERT`. Merge the Topic Modeling and FinBERT features into a single, powerful pipeline. 137 | - **Goal:** To create a comprehensive analysis tool that can not only determine the sentiment of news with high accuracy but also identify the specific economic or financial themes driving that sentiment. 138 | 139 | ## Disclaimer 140 | 141 | This tool is for educational and informational purposes only. Financial markets are complex and risky. Past performance is not indicative of future results. The author is not responsible for any financial losses incurred as a result of using this software. Always do your own research. --------------------------------------------------------------------------------