├── .gitignore ├── IR-Project-Spring1401.pdf ├── README.md └── main.py /.gitignore: -------------------------------------------------------------------------------- 1 | /test.py 2 | /.idea 3 | /data/ 4 | -------------------------------------------------------------------------------- /IR-Project-Spring1401.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aliasad059/Search-Engine/HEAD/IR-Project-Spring1401.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Simple Search Engine with Information Retrieval Techniques 2 | 3 | This Python project implements a basic search engine with information retrieval techniques. It provides functionalities for searching and retrieving relevant documents from a dataset. Here's a brief overview of the project components: 4 | 5 | ## Features 6 | - **Data Loading and Preprocessing:** The code can load a dataset from a JSON file and preprocess it, including normalization, tokenization, removing stopwords, punctuation, and stemming. 7 | 8 | - **Indexing:** The project creates an inverted index of the dataset. This index stores information about the occurrence of words in the documents, including word frequency, document frequency, and positions within documents. It also includes a champion list to optimize searches. 9 | 10 | - **Searching:** The search engine supports various types of queries, including single-word queries, phrasal queries (queries enclosed in double quotes), and negation queries (queries with '!' for exclusion). It ranks search results based on relevance to the query using TF-IDF scores. 11 | 12 | - **Ranking and Retrieval:** The code provides ranked retrieval functionality to retrieve the top-k most relevant documents to a query using TF-IDF scores. It can optionally utilize champion lists to speed up ranked retrieval. 13 | 14 | - **Zipf's Law Visualization:** The project includes a function to draw a Zipf's Law plot, showing the distribution of word frequencies in the dataset. 15 | 16 | - **Index Saving and Loading:** The code allows for saving and loading the created index to/from a file, improving efficiency when working with large datasets. 17 | 18 | ## Implementation Details 19 | This project implements a search engine from scratch with the following features: 20 | - Data preprocessing, including tokenization, normalization, and stemming. 21 | - Creation of an inverted index for efficient word-document mapping. 22 | - Support for various query types, including phrasal and negation queries. 23 | - Ranked retrieval search using TF-IDF scores. 24 | - Optimization of query answering speed using a champion list. 25 | 26 | ## Getting Started 27 | To use this search engine, you need: 28 | - Python 3.6 or higher 29 | - Required Python libraries: `json`, `itertools`, `collections`, `matplotlib`, `hazm`, `numpy`, `pandas`, `re`, `string` 30 | 31 | ## Usage 32 | - Load and preprocess your dataset. 33 | - Create an inverted index and a champion list for your dataset. 34 | - Use the search engine to perform various types of searches, including ranked retrieval. 35 | - Display search results to the user. 36 | 37 | ## Example Usage 38 | Here's an example of how to use this search engine: 39 | 40 | ```python 41 | # Load and preprocess your dataset 42 | documents = [...] # Your dataset 43 | inverted_index = create_inverted_index(documents) 44 | champion_list = create_champion_list(inverted_index, k=10) 45 | 46 | # Perform a search 47 | query = "your user query" 48 | results = ranked_retrieval(query, inverted_index, champion_list, k=10) 49 | display_results(results) 50 | ``` 51 | 52 | This Python project implements a basic search engine with information retrieval techniques. It provides functionalities for searching and retrieving relevant documents from a dataset. For more advanced features, accuracy, and scalability, you can also explore another repository that uses the Elasticsearch framework for similarity modeling, spell correction, and clustering. 53 | 54 | ## Additional Features in the Advanced Repository 55 | - **Elasticsearch Integration:** The advanced repository integrates Elasticsearch, a powerful search and analytics engine, to enhance search capabilities. 56 | - **Similarity Modeling:** Elasticsearch allows for more advanced similarity modeling, enabling better relevance ranking of search results. 57 | - **Spell Correction:** The repository incorporates spell correction mechanisms to improve search accuracy, ensuring that users get relevant results even with typos or misspelled queries. 58 | - **Clustering:** It includes clustering algorithms to group and retrieve similar documents, providing users with a more structured and organized search experience. 59 | 60 | You can find the advanced repository using the following link: 61 | [Search Engine with Elasticsearch](https://github.com/aliasad059/Elastic-Search) 62 | 63 | Please refer to the advanced repository for more sophisticated search capabilities and accuracy enhancements. 64 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import json 2 | from itertools import cycle 3 | import collections 4 | from matplotlib import pyplot as plt 5 | from hazm import Normalizer, word_tokenize, stopwords_list, Stemmer 6 | import numpy as np 7 | import pandas as pd 8 | import re 9 | from string import punctuation 10 | 11 | 12 | def load_df(file_name): 13 | """ 14 | Loads a dataframe from a json file. 15 | """ 16 | return pd.read_json(file_name).transpose() 17 | 18 | 19 | def preprocess_query(query): 20 | """ 21 | Preprocesses a query. 22 | normalizes and tokenizes the query. 23 | removes stopwords, punctuation, urls ,and stemming. 24 | """ 25 | query = word_tokenize(query) 26 | query = [w for w in query if w not in stopwords_list()] 27 | query = [Stemmer().stem(w) for w in query] 28 | return ' '.join(query) 29 | 30 | 31 | def preprocess_df(df, column_name, verbose=False): 32 | """ 33 | Preprocesses a dataframe column. 34 | normalizes and tokenizes the text. 35 | removes stopwords, punctuation, urls ,and stemming. 36 | """ 37 | if verbose: 38 | print('Removing URLs...') 39 | df[column_name] = df[column_name].apply(lambda x: re.sub(r'http\S+', '', x)) 40 | if verbose: 41 | print('Removing punctuations...') 42 | df[column_name] = df[column_name].apply(lambda x: re.sub(f'[{punctuation}؟،٪×÷»«]+', '', x)) 43 | if verbose: 44 | print('Normalizing...') 45 | df[column_name] = df[column_name].apply(lambda x: Normalizer().normalize(x)) 46 | if verbose: 47 | print('Tokenizing...') 48 | df[column_name] = df[column_name].apply(lambda x: word_tokenize(x)) 49 | if verbose: 50 | print('Removing stopwords...') 51 | df[column_name] = df[column_name].apply(lambda x: [w for w in x if w not in stopwords_list()]) 52 | 53 | # # total number of tokens in all documents 54 | # total_tokens = sum([len(x) for x in df[column_name]]) 55 | # print(f'Total number of tokens before stemming: {total_tokens}') 56 | 57 | if verbose: 58 | print('Stemming...') 59 | df[column_name] = df[column_name].apply(lambda x: [Stemmer().stem(w) for w in x]) 60 | 61 | # # total number of tokens in all documents 62 | # total_tokens = sum([len(x) for x in df[column_name]]) 63 | # print(f'Total number of tokens after stemming: {total_tokens}') 64 | 65 | if verbose: 66 | print('Joining...') 67 | df[column_name] = df[column_name].apply(lambda x: ' '.join(x)) 68 | if verbose: 69 | print('Done.') 70 | return df 71 | 72 | 73 | def create_index_dict(df, column_name): 74 | """ 75 | Creates a dictionary of word indexes. 76 | """ 77 | print('Creating index...') 78 | word_index = {} 79 | for i, row in df.iterrows(): 80 | for p, word in enumerate(row[column_name].split()): 81 | if word not in word_index: 82 | word_index[word] = {} 83 | word_index[word]['count'] = 1 # holds count of the word in all documents 84 | word_index[word]['docs'] = {} # holds the documents that contain the word 85 | word_index[word]['docs'][i] = {} 86 | word_index[word]['docs'][i]['count'] = 1 # holds count of the word in document i 87 | word_index[word]['docs'][i]['positions'] = [p] # holds the positions of the word in document i 88 | else: 89 | word_index[word]['count'] += 1 90 | if i not in word_index[word]['docs']: 91 | word_index[word]['docs'][i] = {} 92 | word_index[word]['docs'][i]['count'] = 1 93 | word_index[word]['docs'][i]['positions'] = [p] 94 | else: 95 | word_index[word]['docs'][i]['count'] += 1 96 | word_index[word]['docs'][i]['positions'].append(p) 97 | print('Done.') 98 | 99 | print('Creating champion index...') 100 | for word in word_index: # add champions list to each word 101 | champ_list = sorted(word_index[word]['docs'], key=lambda x: word_index[word]['docs'][x]['count'], reverse=True) 102 | word_index[word]['champions'] = champ_list[:len(champ_list) // 2] 103 | print('Done.') 104 | 105 | return word_index 106 | 107 | 108 | def exclude_indexes(indexes, excluded_indexes): 109 | """ 110 | Excludes sorted indexes from a sorted list. 111 | """ 112 | result = [] 113 | i = 0 114 | j = 0 115 | while i < len(indexes) and j < len(excluded_indexes): 116 | if indexes[i] < excluded_indexes[j]: 117 | result.append(indexes[i]) 118 | i += 1 119 | elif indexes[i] > excluded_indexes[j]: 120 | j += 1 121 | else: 122 | i += 1 123 | j += 1 124 | if indexes[-1] > excluded_indexes[-1]: 125 | result.append(indexes[-1]) 126 | return result 127 | 128 | 129 | def multi_intersect_indexes(lists): 130 | """ 131 | Intersects multiple sorted indexes. 132 | """ 133 | if len(lists) == 1: 134 | return lists[0] 135 | 136 | result = [] 137 | maxval = float("-inf") 138 | consecutive = 0 139 | try: 140 | for sublist in cycle(iter(sublist) for sublist in lists): 141 | 142 | value = next(sublist) 143 | while value < maxval: 144 | value = next(sublist) 145 | 146 | if value > maxval: 147 | maxval = value 148 | consecutive = 0 149 | continue 150 | 151 | consecutive += 1 152 | if consecutive >= len(lists) - 1: 153 | result.append(maxval) 154 | consecutive = 0 155 | 156 | except StopIteration: 157 | return result 158 | 159 | 160 | def intersect_two_indexes(indexes1, indexes2): 161 | """ 162 | Intersects two sorted indexes. 163 | """ 164 | result = [] 165 | i = 0 166 | j = 0 167 | while i < len(indexes1) and j < len(indexes2): 168 | if indexes1[i] == indexes2[j]: 169 | result.append(indexes1[i]) 170 | i += 1 171 | j += 1 172 | elif indexes1[i] < indexes2[j]: 173 | i += 1 174 | else: 175 | j += 1 176 | return result 177 | 178 | 179 | def multiple_word_query(words, word_index): 180 | """ 181 | Answers to a multiple word query and sorts the results. 182 | """ 183 | words = [Stemmer().stem(w) for w in words] 184 | words = [w for w in words if w not in stopwords_list()] 185 | try: 186 | posting_lists = [word_index[word] for word in words] 187 | except KeyError: 188 | return {} 189 | 190 | lists = [list(p['docs'].keys()) for p in posting_lists] 191 | result = multi_intersect_indexes(lists) 192 | ranked_result = np.zeros(len(result)) 193 | for p in posting_lists: 194 | ranked_result += [p['docs'][i]['count'] for i in result] 195 | # return [x for x, y in sorted(zip(ranked_result, result), reverse=True)] 196 | return dict(zip(result, ranked_result)) 197 | 198 | 199 | def phrasal_query(phrasal_word, word_index): 200 | """ 201 | Answers to a phrasal query and sorts the results. 202 | """ 203 | words = phrasal_word.split() 204 | words = [Stemmer().stem(word) for word in words] 205 | try: 206 | posting_lists = [word_index[word] for word in words] 207 | except KeyError: 208 | return {} 209 | lists = [list(p['docs'].keys()) for p in posting_lists] 210 | intersect_of_words_in_phrase = multi_intersect_indexes(lists) 211 | 212 | result = {} 213 | for d in intersect_of_words_in_phrase: 214 | positions = [word_index[w]['docs'][d]['positions'] for w in words] 215 | for p in positions[0]: 216 | if all(p + i in positions[i] for i in range(1, len(positions))): 217 | if d in result: 218 | result[d] += 1 219 | else: 220 | result[d] = 1 221 | return result 222 | 223 | 224 | def query(query, word_index): 225 | """ 226 | Answers a query and sorts the results. 227 | supported operands: 1. double quotes("") for phrasal queries. 228 | 2. ! for negation. 229 | 3. otherwise intersects words. 230 | """ 231 | if len(query.split()) == 1: 232 | word = (query.split()[0]) 233 | if (word not in word_index) or (word in stopwords_list()): 234 | return [] 235 | word = Stemmer().stem(word) 236 | posting_list = word_index[word]['docs'] 237 | return sorted(posting_list, key=lambda x: posting_list[x]['count'], reverse=True) 238 | else: 239 | phrasal_words = re.findall(r'"(.*?)"', query) 240 | excluded_words = re.findall(r'!(.*?)!', query) 241 | other_words = re.sub(r'"(.*?)"|!(.*?)!', '', query).split() 242 | 243 | ranked_result = [] 244 | result = [] 245 | if phrasal_words: 246 | phrasal_words_result = [phrasal_query(phrasal_word, word_index) for phrasal_word in phrasal_words] 247 | result = multi_intersect_indexes([list(p.keys()) for p in phrasal_words_result]) 248 | ranked_result = [p[i] for p in phrasal_words_result for i in result] 249 | if other_words: 250 | multiple_word_query_result = multiple_word_query(other_words, word_index) 251 | if phrasal_words: 252 | result = intersect_two_indexes(result, list(multiple_word_query_result.keys())) 253 | else: 254 | result = list(multiple_word_query_result.keys()) 255 | ranked_result = [multiple_word_query_result[i] for i in result] 256 | if excluded_words: 257 | for word in excluded_words: 258 | word = Stemmer().stem(word) 259 | if word in word_index: 260 | if not result: 261 | return [] 262 | result = exclude_indexes(result, list(word_index[word]['docs'].keys())) 263 | return [x for y, x in sorted(zip(ranked_result, result), reverse=True)] 264 | 265 | 266 | def draw_zipf_law(word_index): 267 | """ 268 | Draws the Zipf law. 269 | """ 270 | tokens = list(word_index.keys()) 271 | counts = [word_index[w]['count'] for w in tokens] 272 | ranks = np.arange(1, len(counts) + 1) 273 | indices = list(reversed(np.argsort(counts))) 274 | frequencies = [counts[i] for i in indices] 275 | plt.figure(figsize=(8, 6)) 276 | plt.loglog(ranks, frequencies, marker=".") 277 | plt.plot([1, frequencies[0]], [frequencies[0], 1], color='r') 278 | plt.title("Zipf plot for news tokens") 279 | plt.xlabel("Frequency rank of token") 280 | plt.ylabel("Absolute frequency of token") 281 | plt.grid(True) 282 | plt.show() 283 | 284 | 285 | def get_tf_idf(tf, idf): 286 | """ 287 | Returns the tf-idf of a term. 288 | """ 289 | return (1 + (np.log10(tf))) * np.log10(idf) 290 | 291 | 292 | def ranked_retrieval_query(query, word_index, k, N, use_champion_list=False): 293 | """ 294 | Returns the top k documents that are most similar to the query. 295 | """ 296 | words = query.split() 297 | words = [Stemmer().stem(w) for w in words] 298 | words = [w for w in words if w not in stopwords_list()] 299 | query_index = dict(collections.Counter(words)) 300 | 301 | scores = np.zeros(N) 302 | for word in query_index: 303 | if word in word_index: 304 | docs = word_index[word]['docs'] 305 | if use_champion_list: 306 | champions_list = word_index[word]['champions'] 307 | docs = {k: v for k, v in docs.items() if k in champions_list} 308 | idf = N / len(docs) 309 | wtq = get_tf_idf(query_index[word], idf) 310 | for d in docs: 311 | wtd = get_tf_idf(docs[d]['count'], idf) 312 | scores[d] += wtq * wtd 313 | indices = np.argsort(scores)[::-1] 314 | return indices[:k] 315 | 316 | 317 | def save_index(word_index, file_name): 318 | """ 319 | Saves the index to a file. 320 | """ 321 | with open(file_name, 'wb') as f: 322 | json.dump(word_index, open('./data/word_index.json', 'w')) 323 | 324 | 325 | def load_index(file_name): 326 | """ 327 | Loads the index from a file. 328 | """ 329 | with open(file_name, 'rb') as f: 330 | return json.load(f) 331 | 332 | 333 | if __name__ == '__main__': 334 | # UNCOMMENT to do preprocessing and save it to a file 335 | # # Load dataframe from json raw data file 336 | # df = load_df('data/raw_data.json') 337 | # # Preprocess dataframe 338 | # df = preprocess_df(df, column_name='content', verbose=True) 339 | # # Save dataframe 340 | # df.to_csv('data/preprocessed_data.csv') 341 | 342 | # Load preprocessed dataframe 343 | df = pd.read_csv('data/preprocessed_data.csv') 344 | 345 | # Create index dictionary 346 | word_index = create_index_dict(df, column_name='content') 347 | # save_index(word_index, './data/word_index.json') 348 | # word_index = load_index('./data/word_index.json') 349 | 350 | # draw_zipf_law(word_index) 351 | # simple local search 352 | print(query('تحریم‌های آمریکا علیه ایران', word_index)) 353 | print(query('تحریم‌های آمریکا !ایران!', word_index)) 354 | print(query('"کنگره ضدتروریست"', word_index)) 355 | print(query('"تحریم هسته‌ای" آمریکا !ایران!', word_index)) 356 | print(query('اورشلیم !صهیونیست!', word_index)) 357 | 358 | # ranked retrieval search 359 | print(ranked_retrieval_query('لیگ', word_index, k=10, N=len(df))) 360 | print(ranked_retrieval_query('لیگ', word_index, k=10, N=len(df), use_champion_list=True)) 361 | print(ranked_retrieval_query('جدول رده‌بندی لیگ', word_index, k=10, N=len(df))) 362 | print(ranked_retrieval_query('جدول رده‌بندی لیگ', word_index, k=10, N=len(df), use_champion_list=True)) 363 | print(ranked_retrieval_query('سایپا', word_index, k=10, N=len(df))) 364 | print(ranked_retrieval_query('سایپا', word_index, k=10, N=len(df), use_champion_list=True)) 365 | print(ranked_retrieval_query(' بودجه سالیانه شهرداری', word_index, k=10, N=len(df))) 366 | print(ranked_retrieval_query(' بودجه سالیانه شهرداری', word_index, k=10, N=len(df), use_champion_list=True)) 367 | 368 | # compare ranked retrieval search and simple local search 369 | print(query('جدول رده‌بندی لیگ', word_index)) 370 | print(ranked_retrieval_query('جدول رده‌بندی لیگ', word_index, k=10, N=len(df))) 371 | print(query(' بودجه سالیانه شهرداری', word_index)) 372 | print(ranked_retrieval_query(' بودجه سالیانه شهرداری', word_index, k=10, N=len(df))) 373 | --------------------------------------------------------------------------------