├── README.md ├── bing_search ├── bing_search_name.py ├── combine_db_tables.py └── process_search_result.py ├── clean_name ├── compustat_process_name.py ├── dict_char_replace.json ├── patentsview_process_name.py └── sdc_process_name.py ├── match ├── link_pv2compustat.py └── link_pv2sdc.py └── my_own_handy_functions.py /README.md: -------------------------------------------------------------------------------- 1 | # Matching USPTO Patent Assignees to Compustat Public Firms and SDC Private Firms 2 | 3 | This data project is a systematic effort to match assignee names on USPTO patent records, sometimes abbreviated or misspelled, to the universe of public firms in Compustat and all private firms that have been involved in alliances and M&A in SDC Platinum. The current coverage runs from 1976 to 2017 for Compustat firms and from 1985-2017 for SDC firms. 4 | 5 | The algorithm leverages the Bing web search engine and significantly improves upon fuzzy name matching, a common practice in the literature. This document presents a step-by-step guide to our searching and matching algorithm. 6 | 7 | You may add your email to our [list](https://forms.gle/1ABK6WZk36n9Ye9p9) so that we can inform you once we make the data publicly available. Please contact Danqing Mei (dqmei@ckgsb.edu.cn) or Miao Liu (miao.liu@bc.edu) if you have any other questions. 8 | 9 | ## Matching Patent Assignees to Compustat Public Firms: 10 | Our procedure to match patent assignees to Compustat firms is similar to that of Autor et al. (forthcoming). The initial project started before they publish their paper and procedure. So there are a few differences between ours and theirs as detailed below. We extend their sample period from 2013 to 2017. 11 | 12 | 13 | ## Matching Patent Assignees to SDC Private Firms: 14 | We are the first attempt in the literature to match patent assignees to private firms using the web search engine approach (Mei, 2020). To illustrate the importance of private firms in the M&A market, the number of public deals (i.e., public acquirer – public target) is 6,175 between 1976 and 2017. Among these deals, 1,665 targets have patent records. The number of private deals (i.e., public acquirer – private target) is 42,206 between 1985 and 2017. Among these deals, 4,019 targets have patent records. 15 | 16 | ## Matching Procedure 17 | Our matching procedure has four steps. 18 | 19 | ### Step 1: Download the source data: 20 | - Patent data from the U.S. Patent and Inventor Database (the March 2019 version from PatentsView is used here). 21 | - Linked patent-Compustat data from the NBER Patent Data Project, which covers only patents granted by 2006. 22 | - Linked patent-CRSP data from Kogan et al. (2017), which covers patents granted by 2010. 23 | - Compustat North America. The relevant variables include Compustat firm ID (gvkey), firm name, firm website. 24 | - SDC Platinum. The relevant variables include SDC deal number, firm name of the target and acquirer, and firm ID (cusip). 25 | 26 | ### Step 2: Clean assignee names and Compustat/SDC firm names. 27 | - In this step, we remove punctuation and accent marks. The file “dict_char_replace.json” provides a complete list of punctuations and accent marks that we remove or replace. These choices appear to produce the best Bing search results based on our manual checks. The folder “clean_name” contains the codes for removing punctuations and accent marks in the assignee and Compustat/SDC firm names. The punctuation-free names are used as input into the Bing search in the next step. 28 | 29 | ### Step 3: Use the Bing Web Search API to collect search results in the form of URLs. 30 | - In this step, we first create the csv input file that contains the punctuation-free firm name. 31 | - Run the Python program “bing_search_name.py” after adding the API key. A paid subscription to the Bing Web Search API is required when performing more than 3000 searches in a month. 32 | - The Python programs will generate an output file in the SQL database that contains the links, titles, and descriptions of the top 50 search results from searching the punctuation-free firm names on Bing. 33 | - The folder “match” contains some simple code for cleaning the URLs collected from the Bing search. 34 | 35 | ### Step 4: Match assignees to Compustat public firms and SDC private firms using names and URLs. 36 | - In the final step, we consider a patent assignee and a Compustat/SDC firm to be a match if the top five search results for the assignee and for the Compustat/SDC firm share at least two exact matches. 37 | 38 | ## Reference: 39 | Autor, David, David Dorn, Gordon H. Hanson, Gary Pisano, and Pian Shu, 2019, “Foreign Competition and Domestic Innovation: Evidence from U.S. Patents,” American Economic Review: Insights, forthcoming. 40 | 41 | Kogan, Leonid, Dimitris Papanikolaou, Amit Seru, and Noah Stoffman, 2017, “Technological Innovation, Resource Allocation, and Growth,” The Quarterly Journal of Economics 132, 665–712. 42 | 43 | Mei, Danqing, 2020, "Technology Development and Corporate Mergers," Working Paper. 44 | -------------------------------------------------------------------------------- /bing_search/bing_search_name.py: -------------------------------------------------------------------------------- 1 | #!/NOBACKUP/scratch/dmei19/anaconda3/bin/python 2 | # this program runs bing search on Columbia Research Grid 3 | # it runs embarrassingly parallel jobs with multiple tasks 4 | 5 | import pickle 6 | import pandas as pd 7 | import sqlite3 8 | import time 9 | import sys 10 | from azure.cognitiveservices.search.websearch import WebSearchAPI 11 | from msrest.authentication import CognitiveServicesCredentials 12 | 13 | subscription_key = "" # insert your Bing key here 14 | assert subscription_key 15 | client = WebSearchAPI(CognitiveServicesCredentials(subscription_key)) 16 | 17 | client_id = ""; 18 | Empty_Search_Word_Err = "Empty Search Word" 19 | 20 | ############################################################################# 21 | #More details for bing web search please refer to 22 | #https://docs.microsoft.com/en-us/azure/cognitive-services/bing-web-search/web-sdk-python-quickstart 23 | 24 | def bing_web_search_sdk_list(search_word_list, c = 50): 25 | list_name_url = [] 26 | list_raw = [] 27 | list_urls = [] 28 | for search_word in search_word_list: 29 | #sanity check to avoid any potential issue 30 | if len(search_word) == 0 : 31 | raise Exception(Empty_Search_Word_Err) 32 | 33 | web_data_raw = client.web.search(query=search_word, raw = True, count = c) 34 | raw = web_data_raw.response.text 35 | list_raw.append(raw) 36 | 37 | name_url = [] 38 | urls = [] 39 | if hasattr(web_data_raw.output.web_pages, 'value'): 40 | for i in web_data_raw.output.web_pages.value: 41 | name_url.append((i.name, i.url)) 42 | urls.append(i.url) 43 | 44 | list_name_url.append(str(name_url)) 45 | list_urls.append(str(urls)) 46 | 47 | return list_name_url, list_raw, list_urls 48 | ############################################################################## 49 | 50 | #beg########################################################## 51 | def log_time_used(t1, task, log, mode = 'a'): 52 | t2 = time.time() 53 | t = round(t2-t1,2) 54 | message = f"{task} takes {t}s." 55 | if log == '': 56 | print(message, file = sys.stdout) 57 | else: 58 | with open(log, mode) as f: 59 | print(message, file = f) 60 | return time.time() 61 | #end############################################################ 62 | 63 | t_start = time.time() 64 | t1 = time.time() 65 | 66 | # insert the processed names from SDC/Compustat/PatentsView 67 | with open('sdc/compustat/patentsview_name.pickle', 'rb') as handle: 68 | list_name = pickle.load(handle) 69 | 70 | # take task no, 1-7 for Compustat, 1-10 for SDC, 1-100 for PatentsView 71 | task_num = int(sys.argv[1]) 72 | 73 | logfile = 'search_task' + str(task_num) + '.log' 74 | task_size = 5000 # each task searches 5000 names 75 | task_start = (task_num - 1) * task_size 76 | task_end = min(task_num * task_size, len(list_name)) 77 | list_task = list_name[task_start:task_end] 78 | 79 | df = pd.DataFrame() 80 | df['newname'] = list_task 81 | 82 | # for each task, create a database to store ALL the search results 83 | db = 'search_task' + str(task_num) + '.db' 84 | con = sqlite3.connect(db) 85 | cur = con.cursor() 86 | table = 'newname_task' + str(task_num) 87 | df.to_sql(table, con, if_exists = "replace") 88 | 89 | ########################################### 90 | def show_tables(): 91 | cur.execute("SELECT name FROM sqlite_master WHERE type='table';") 92 | return cur.fetchall() 93 | 94 | def drop_tables(table_name): 95 | query = str( f"DROP TABLE {table_name}" ) 96 | cur.execute(query) 97 | cur.execute("SELECT name FROM sqlite_master WHERE type='table';") 98 | return cur.fetchall() 99 | ################################################# 100 | 101 | result_table = 'search_result_task' + str(task_num) 102 | try: 103 | drop_tables(result_table) 104 | except: 105 | print('no result table in db yet') 106 | 107 | t1 = log_time_used(t1, 'getting ready', log = logfile, mode = 'w+') 108 | 109 | ############################################################################## 110 | ### for each task of 5000 searches, divide it into 5 batches 111 | # after each batch, store them into the corresponding database 112 | # this is to avoid any break-down of the tasks so that the cost of Bing search is wasted 113 | def batch_search_new(n, s, c = 50): 114 | 115 | ############################################# 116 | t1 = time.time() 117 | 118 | begin = (n-1)*s 119 | end = min(n*s, len(list_task)) 120 | name_list = list_task[begin:end] 121 | list_name_url, list_raw, list_urls = bing_web_search_sdk_list(name_list, c) 122 | 123 | df_result = pd.DataFrame() 124 | df_result['newname'] = name_list 125 | df_result['name_url'] = list_name_url 126 | df_result['raw'] = list_raw 127 | df_result['urls'] = list_urls 128 | sql_name = 'sdc_search_result_task' + str(task_num) 129 | df_result.to_sql(sql_name, con, if_exists = 'append') 130 | 131 | t1 = log_time_used(t1, f"query + save {s} searches to sql", log = logfile) 132 | 133 | ##################################################################################### 134 | 135 | t_start = time.time() 136 | batch_size = 1000 # 5 batches within one task 137 | if len(list_task) % batch_size == 0: 138 | batch_round = int(len(list_task) / batch_size) 139 | else: 140 | batch_round = int(len(list_task) / batch_size) + 1 141 | 142 | for batch_num in range(1, batch_round+1): 143 | batch_search_new(batch_num, batch_size) 144 | if logfile == '': 145 | # after each batch, print 146 | print(f"processed batch No. {batch_num}") 147 | else: 148 | with open(logfile, 'a') as f: 149 | print(f"processed batch No. {batch_num}", file = f) 150 | 151 | t1 = log_time_used(t_start, f"{batch_round} rounds done", log = logfile) 152 | 153 | #####validation code 154 | #sql = "select * from sdc_search_result_task1 limit 100;" 155 | #df_temp = pd.read_sql(sql, con) 156 | #print(df_temp) 157 | -------------------------------------------------------------------------------- /bing_search/combine_db_tables.py: -------------------------------------------------------------------------------- 1 | #!/NOBACKUP/scratch/dmei19/anaconda3/bin/python 2 | 3 | import pandas as pd 4 | import sqlite3 5 | 6 | 7 | attach_sql = ";" 8 | attach_name = "" 9 | 10 | num_task = 10 # 7 for Compustat / 10 for SDC / 100 for PatentsView 11 | df_list = [None] * num_task 12 | 13 | #prefix = 'compustat' 14 | #prefix = 'pv' 15 | prefix = 'sdc' 16 | db_folder = prefix + '_search_result_db/' 17 | 18 | for i in range(1, num_task+1): 19 | db = db_folder + prefix + '_search_task' + str(i) + '.db' 20 | con = sqlite3.connect(db) 21 | cur = con.cursor 22 | sql = f"select * from {prefix}_search_result_task" + str(i) + attach_sql 23 | df = pd.read_sql(sql, con) 24 | df = df.drop('index', 1) 25 | df_list[i-1] = df 26 | con.close() 27 | 28 | print('save list done') 29 | 30 | db = db_folder + prefix + '_search_all' + attach_name + '.db' 31 | con = sqlite3.connect(db) 32 | for i in range(0, len(df_list)): 33 | dataframe = df_list[i] 34 | table_name = prefix + '_search_result_all' + attach_name 35 | if i == 0: 36 | dataframe.to_sql(table_name, con, if_exists='replace', index = False) 37 | else: 38 | dataframe.to_sql(table_name, con, if_exists='append', index = False) 39 | print(str(i+1) + ' tables processed') 40 | 41 | query = f"select * from {prefix}_search_result_all limit 100;" 42 | df_sample = pd.read_sql(query, con) 43 | print(df_sample) 44 | 45 | con.close() 46 | -------------------------------------------------------------------------------- /bing_search/process_search_result.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Aug 7 16:25:50 2019 4 | 5 | @author: danie 6 | """ 7 | 8 | import pandas as pd 9 | import sqlite3 10 | import json 11 | import sys 12 | 13 | def print_log(message, logfile = '', mode = 'a'): 14 | if logfile != '': 15 | with open(logfile, mode) as f: 16 | print(message, file = f) 17 | else: 18 | print(message, file = sys.stdout) 19 | return None 20 | 21 | #prefix = 'compustat_' 22 | #prefix = 'pv_' 23 | prefix = 'sdc_' 24 | logfile = '' 25 | 26 | attach_sql = ";" 27 | attach_name = "" 28 | 29 | suffix = 'all' 30 | 31 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/" 32 | db = db_folder + f"{prefix}search_{suffix}.db" 33 | con = sqlite3.connect(db) 34 | cur = con.cursor 35 | sql = f"select * from {prefix}search_result_" + suffix + attach_sql 36 | df = pd.read_sql(sql, con) 37 | try: 38 | df = df.drop('index', 1) 39 | except: 40 | pass 41 | 42 | con.close() 43 | 44 | print_log('data load done', logfile, 'w+') 45 | 46 | list_urls = [] 47 | list_names = [] 48 | list_newname = list(df['newname']) 49 | list_raw = list(df['raw']) 50 | 51 | for obs in range(0, len(list_newname)): 52 | urls = [] 53 | names = [] 54 | item = list_raw[obs] 55 | newname = list_newname[obs] 56 | dict_item = json.loads(item) 57 | if 'webPages' not in dict_item: 58 | print_log(f"{obs} does not have webpage result", logfile) 59 | print_log(newname, logfile) 60 | else: 61 | if 'value' in dict_item['webPages']: 62 | for webpage in dict_item['webPages']['value']: 63 | urls.append(webpage['url']) 64 | names.append(webpage['name']) 65 | list_urls.append(urls) 66 | list_names.append(names) 67 | if (obs+1) % 50000 == 0: 68 | print_log(f"{obs+1} observations processed", logfile) 69 | 70 | print_log("start to pick top 5", logfile) 71 | 72 | list_urls5 = [] 73 | list_names5 = [] 74 | for obs in range(0, len(list_newname)): 75 | urls = list_urls[obs] 76 | names = list_names[obs] 77 | top5 = min(5, len(urls), len(names)) 78 | urls5 = urls[0:top5] 79 | names5 = names[0:top5] 80 | list_urls5.append(str(urls5)) 81 | list_names5.append(str(names5)) 82 | 83 | print_log("start saving to sql", logfile) 84 | 85 | df_result = pd.DataFrame() 86 | df_result['newname'] = list_newname 87 | df_result['urls5'] = list_urls5 88 | df_result['names5'] = list_names5 89 | 90 | db = db_folder + f"{prefix}search_{suffix}_top5.db" 91 | con = sqlite3.connect(db) 92 | cur = con.cursor 93 | 94 | table = f"{prefix}search_{suffix}_top5" 95 | df_result.to_sql(table, con, if_exists = "replace") 96 | con.close() 97 | 98 | print_log("all done", logfile) 99 | 100 | ## validation code 101 | con = sqlite3.connect(db) 102 | sql = 'select * from ' + table + ' limit 10;' 103 | df_test = pd.read_sql(sql, con) 104 | try: 105 | df_test = df_test.drop('index', 1) 106 | except: 107 | pass 108 | con.close() 109 | with open(f"{prefix}test.out", 'w+', encoding='utf-8') as f: 110 | print(df_test, file=f) 111 | 112 | -------------------------------------------------------------------------------- /clean_name/compustat_process_name.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Aug 13 17:34:04 2019 4 | 5 | @author: danie 6 | """ 7 | 8 | import my_own_handy_functions as mf 9 | import re 10 | import pandas as pd 11 | import time 12 | import json 13 | 14 | logfile = '' 15 | #logfile = sys.stdout 16 | 17 | t1 = time.time() 18 | 19 | df = pd.read_csv('compustat.csv') # all firm-year observations from compustat 20 | 21 | df_indl = df.loc[df['indfmt']=='INDL'] 22 | df_fs = df.loc[df['indfmt']=='FS'] 23 | df_indl = df_indl.dropna(how='all', subset = ['at', 'sale', 'xrd']) 24 | df_fs = df_fs.dropna(how='all', subset = ['at', 'sale', 'xrd']) 25 | # drop if missing(total assets) & missing(sales) & missing(r&d expenses) 26 | 27 | df_indl_unique = df_indl[['gvkey', 'conml', 'weburl', 'conm']].drop_duplicates(['gvkey', 'conml', 'weburl']) 28 | 29 | list_gvkey = list(df_indl_unique['gvkey']) 30 | 31 | # if use conm, often get null result from bing searc 32 | # note, have to use legal name hereh 33 | list_old_conm = list(df_indl_unique['conml']) 34 | 35 | list_conm = [] 36 | for i in range(0, len(list_old_conm)): 37 | name = list_old_conm[i].lower() 38 | list_conm.append(name) 39 | 40 | #### get all characters in the company names, to check abnormal cases######### 41 | dict_clean_char = {} 42 | for i in range(0,len(list_gvkey)): 43 | name = list_conm[i] 44 | for char in name: 45 | if char != " ": 46 | gvkey = list_gvkey[i] 47 | if char not in dict_clean_char: 48 | dict_clean_char[char] = [(gvkey, name)] 49 | else: 50 | dict_clean_char[char].append((gvkey, name)) 51 | if i % 5000 == 0: 52 | print(i) 53 | 54 | list_char = list(dict_clean_char.keys()) 55 | list_char.sort() 56 | ################################################################################ 57 | 58 | # dict_replace gives the correct char to replace the old one 59 | with open('dict_char_replace.json', 'r') as f: 60 | dict_replace = json.load(f) 61 | 62 | ########################################################## 63 | 64 | for i in range(0, len(list_conm)): 65 | name = list_conm[i] 66 | if '.,' in name: 67 | newname = name.replace('.,', ' ') 68 | list_conm[i] = newname 69 | 70 | ############################################################# 71 | # below to find x.x.x.x.x.x.x from 10 x(s) to 3 x(s) 72 | def find_pattern(name): 73 | for i in range(10,1,-1): 74 | temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') 75 | m = re.search(temp_re, name) 76 | if m: 77 | print(name) 78 | print(m.group(0)) 79 | return m.group(0) 80 | 81 | def fix_pattern(name, i): # i from 10 to 1 82 | temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') # means x.x.x... (from 11x to 2x) 83 | m = re.search(temp_re, name) 84 | if m: 85 | new_re = ''.join(ele for ele in ['\\' + str(j) for j in range(1, i+1+1)]) 86 | # for example, when i = 5, new_re = r"\1\2\3\4\5\6" 87 | newname = temp_re.sub(new_re, name) 88 | return newname 89 | else: 90 | return name 91 | n = 0 92 | for i in range(0, len(list_conm)): 93 | name = list_conm[i] 94 | newname = list_conm[i] 95 | for n_x in range(10, 0, -1): 96 | newname = fix_pattern(newname, n_x) 97 | if newname != name: 98 | n+=1 99 | list_conm[i] = newname 100 | 101 | ################################################################ 102 | # use dict_replace to clean every char 103 | list_conm_afcharc = [] 104 | for i in range(0, len(list_conm)): 105 | name = list_conm[i] 106 | newchar_list = [] 107 | for char in name: 108 | if char != ' ': 109 | newchar_list.append(dict_replace[char]) 110 | else: 111 | newchar_list.append(' ') 112 | newname = ''.join(newchar for newchar in newchar_list) 113 | list_conm_afcharc.append(newname) 114 | 115 | # dont replace . as space, keep dot, because for .com or .net keeping them has better results for search 116 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg 117 | for i in range(0, len(list_conm_afcharc)): 118 | name = list_conm_afcharc[i] 119 | newname = dot2replace_re.sub(' ', name) 120 | list_conm_afcharc[i] = newname 121 | 122 | white0 = r" +" # >=1 whitespace 123 | white0_re = re.compile(white0) 124 | for i in range(0, len(list_conm_afcharc)): 125 | name = list_conm_afcharc[i] 126 | newname = white0_re.sub(' ', name) 127 | list_conm_afcharc[i] = newname 128 | 129 | white1 = r"^ | $" # begin or end with whitespace 130 | white1_re = re.compile(white1) 131 | for i in range(0, len(list_conm_afcharc)): 132 | name = list_conm_afcharc[i] 133 | newname = white1_re.sub('',name) 134 | list_conm_afcharc[i] = newname 135 | 136 | # take care of u s, u s a 137 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b") 138 | us_re = re.compile(r"\b(u) \b(s)\b") 139 | for i in range(0, len(list_conm_afcharc)): 140 | name = list_conm_afcharc[i] 141 | newname = usa_re.sub('usa', name) 142 | newname = us_re.sub('us', newname) 143 | list_conm_afcharc[i] = newname 144 | 145 | mf.pickle_dump(list_gvkey, 'list_compustat_gvkey') 146 | mf.pickle_dump(list_old_conm, 'list_compustat_conml') 147 | mf.pickle_dump(list_conm_afcharc, 'list_compustat_newname') 148 | 149 | -------------------------------------------------------------------------------- /clean_name/dict_char_replace.json: -------------------------------------------------------------------------------- 1 | {"!": " ", "#": "#", "$": "s", "%": "%", "&": " & ", "'": "'", "(": " ", ")": " ", "*": "*", "+": "+", ",": " ", "-": " ", ".": ".", "/": "/", "0": "0", "1": "1", "2": "2", "3": "3", "4": "4", "5": "5", "6": "6", "7": "7", "8": "8", "9": "9", ":": ":", ";": " ", "<": " ", "=": "=", ">": " ", "?": " ", "@": " ", "[": " ", "\\": " ", "]": " ", "_": " ", "`": " ", "a": "a", "b": "b", "c": "c", "d": "d", "e": "e", "f": "f", "g": "g", "h": "h", "i": "i", "j": "j", "k": "k", "l": "l", "m": "m", "n": "n", "o": "o", "p": "p", "q": "q", "r": "r", "s": "s", "t": "t", "u": "u", "v": "v", "w": "w", "x": "x", "y": "y", "z": "z", "{": " ", "|": " ", "}": " ", "\u00a3": " ", "\u00a7": "e", "\u00a9": " ", "\u00ae": " ", "\u00b0": " ", "\u00b1": " ", "\u00b2": "2", "\u00b3": "3", "\u00b4": "\u00b4", "\u00b7": " ", "\u00bd": "", "\u00bf": " ", "\u00d7": "\u00d7", "\u00e0": "a", "\u00e1": "a", "\u00e2": "a", "\u00e3": "a", "\u00e4": "ae", "\u00e5": "a", "\u00e6": "ae", "\u00e7": "c", "\u00e8": "e", "\u00e9": "e", "\u00ea": "e", "\u00eb": "e", "\u00ec": "i", "\u00ed": "i", "\u00ee": "i", "\u00ef": "i", "\u00f0": "d", "\u00f1": "n", "\u00f2": "o", "\u00f3": "o", "\u00f4": "o", "\u00f5": "o", "\u00f6": "o", "\u00f7": " ", "\u00f8": "\u00f8", "\u00f9": "u", "\u00fa": "u", "\u00fb": "u", "\u00fc": "u", "\u00fd": "y", "\u00ff": "y", "\u0101": "a", "\u0105": "a", "\u0107": "c", "\u010b": "c", "\u0113": "e", "\u0117": "e", "\u012b": "i", "\u0142": "l", "\u0144": "n", "\u014d": "o", "\u0151": "o", "\u0155": "r", "\u015b": "s", "\u015f": "s", "\u0169": "u", "\u016b": "u", "\u0171": "u", "\u017c": "z", "\u01f5": "g", "\u02dc": " ", "\u0304": "", "\u0305": "", "\u03b2": "\u03b2", "\u03b5": "s", "\u03bb": "a", "\u03bf": "o", "\u04ab": "c", "\u2013": " ", "\u2014": " ", "\u2018": " ", "\u2019": " ", "\u201c": " ", "\u201d": " ", "\u2022": " ", "\u2032": " ", "\u2033": " ", "\u203b": " ", "\u2082": "2", "\u208b": " ", "\u2122": " tm ", "\u2205": " ", "\u2212": " ", "\u2605": "*", "\u266f": "#", "\u2713": "v"} -------------------------------------------------------------------------------- /clean_name/patentsview_process_name.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sat Jul 6 22:04:33 2019 4 | 5 | @author: Danqing Mei 6 | """ 7 | 8 | import pandas as pd 9 | import re 10 | import html 11 | import json 12 | import my_own_handy_functions as mf 13 | 14 | rawassignee = pd.read_stata('rawassignee_noquote.dta') # patent number-assignee name file from patentsview 15 | rawassignee = rawassignee.loc[rawassignee['dummy_raw_org']==1] 16 | rawassignee_nodup = rawassignee.drop_duplicates(['raw_organization'], keep='first', inplace=False) 17 | list_raworg_nodup = list(rawassignee_nodup['raw_organization']) 18 | list_patentid_nodup = list(rawassignee_nodup['patent_id']) 19 | 20 | list_cleanorg = [] 21 | for i in range(0, len(list_raworg_nodup)): 22 | raw_name = list_raworg_nodup[i] 23 | # this unescape takes care most of the "&" 24 | clean_name = html.unescape(raw_name) 25 | # some exceptions below 26 | if '&Circlesolid;' not in clean_name and '&thgr;' not in clean_name and '&dgr;' not in clean_name: 27 | list_cleanorg.append(clean_name.lower()) 28 | else: 29 | clean_name = clean_name.replace('&Circlesolid;', ' ') 30 | clean_name = clean_name.replace('&thgr;', 'o') #'\u03B8', actually should be 'o' 31 | clean_name = clean_name.replace('&dgr;', '-') # '\u03B4', actually should be '-' 32 | list_cleanorg.append(clean_name.lower()) 33 | 34 | # take care of char ";" 35 | checkfmt = re.compile(r'\d+;') # at least one digit followed by a ";" 36 | for i in range(0, len(list_cleanorg)): 37 | name = list_cleanorg[i] 38 | match = re.search(checkfmt, name, flags=0) 39 | if match: 40 | name = name.replace(match.group(0), '') 41 | list_cleanorg[i] = name 42 | 43 | ''' 44 | # check special format about ; 45 | # indeed first check all names containing ; 46 | # then use the following regex 47 | 48 | checkfmt = re.compile(r'(^|\S+);\S+') 49 | # begin of the string or any non-white space (one or more) + ; + any non-white space (one or more) 50 | for i in range(0,len(list_cleanorg)): 51 | name = list_cleanorg[i] 52 | match = re.search(checkfmt, name) 53 | if match: 54 | print( name + ' ' + list_patentid_nodup[i]) 55 | ''' 56 | 57 | # These are stuff need to take care 58 | err_fmt = ['f;vis', ';3m', ';bri', 'hô ;', 'sil;verbrook', 'el;ectronics', 'people;s', 's;p.a.', 'co;,'] 59 | crr_fmt = ['vis' , '3m' , 'bri' , 'hô' , 'silverbrook' , 'electronics' , 'people\'s','s.p.a.', 'co,' ] 60 | 61 | for i in range(0, len(list_cleanorg)): 62 | name = list_cleanorg[i] 63 | for j in range(0, len(err_fmt)): 64 | err = err_fmt[j] 65 | crr = crr_fmt[j] 66 | if err in name: 67 | newname = name.replace(err, crr) 68 | list_cleanorg[i] = newname 69 | 70 | 71 | post = r"( |\()a corp.*of.*$" # take care of "a corp... of..." 72 | post_re = re.compile(post) 73 | for i in range(0, len(list_cleanorg)): 74 | name = list_cleanorg[i] 75 | newname = post_re.sub('',name) 76 | list_cleanorg[i] = newname 77 | 78 | ''' 79 | # get a dictionary of all char in the assignee names to check later 80 | 81 | dict_clean_char = {} 82 | for i in range(0,len(list_patentid_nodup)): 83 | name = list_cleanorg[i] 84 | for char in name: 85 | if char != " ": 86 | patent_id = list_patentid_nodup[i] 87 | if char not in dict_clean_char: 88 | dict_clean_char[char] = {patent_id:name} 89 | else: 90 | dict_clean_char[char].update({patent_id:name}) 91 | if i % 10000 == 0: 92 | print(i) 93 | 94 | with open('dict_clean_char.pickle', 'wb') as handle: 95 | pickle.dump(dict_clean_char, handle, protocol = pickle.HIGHEST_PROTOCOL) 96 | 97 | with open('dict_clean_char.pickle', 'rb') as handle: 98 | dict_clean_char = pickle.load(handle) 99 | 100 | list_char = list(dict_clean_char.keys()) 101 | list_char.sort() 102 | ''' 103 | 104 | # dict_replace gives the correct char to replace the old one 105 | with open('dict_char_replace.json', 'r') as f: 106 | dict_replace = json.load(f) 107 | 108 | # change ., to space 109 | for i in range(0, len(list_cleanorg)): 110 | name = list_cleanorg[i] 111 | if '.,' in name: 112 | newname = name.replace('.,', ' ') 113 | list_cleanorg[i] = newname 114 | 115 | ##### below to find x.x.x.x.x.x.x from 10 x(s) to 3 x(s) ##################### 116 | def find_pattern(name): 117 | for i in range(10,1,-1): 118 | temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') 119 | m = re.search(temp_re, name) 120 | if m: 121 | print(name) 122 | print(m.group(0)) 123 | return m.group(0) 124 | 125 | def fix_pattern(name, i): # i from 10 to 1 126 | temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') # means x.x.x... (from 11x to 2x) 127 | m = re.search(temp_re, name) 128 | if m: 129 | new_re = ''.join(ele for ele in ['\\' + str(j) for j in range(1, i+1+1)]) 130 | # for example, when i = 5, new_re = r"\1\2\3\4\5\6" 131 | newname = temp_re.sub(new_re, name) 132 | return newname 133 | else: 134 | return name 135 | 136 | n = 0 137 | for i in range(0, len(list_cleanorg)): 138 | name = list_cleanorg[i] 139 | newname = list_cleanorg[i] 140 | for n_x in range(10, 0, -1): 141 | newname = fix_pattern(newname, n_x) 142 | if newname != name: 143 | n+=1 144 | list_cleanorg[i] = newname 145 | ############################################################################ 146 | ################ begin to take care of {} ################################# 147 | match_re = re.compile(r"{.*over.*\((.)\)}") 148 | 149 | ''' 150 | check all these strange { over ()} cases 151 | for patentid, name in dict_clean_char[list_char[62]].items(): 152 | m = re.search(match_re, name) 153 | if m: 154 | if m.group(1) == ' ': 155 | print(patentid) 156 | print(name) 157 | print(m.group(1)) 158 | ''' 159 | 160 | n=0 161 | for i in range(0, len(list_cleanorg)): 162 | name = list_cleanorg[i] 163 | m = re.search(match_re, name) 164 | if m: 165 | if m.group(1) == ' ': 166 | replace_char = '' 167 | else: 168 | replace_char = m.group(1) 169 | newname = re.sub(match_re, replace_char, name) 170 | list_cleanorg[i] = newname 171 | n+=1 172 | ########################################################################## 173 | 174 | ##### clean every char to correct ones ############################## 175 | list_cleanorg_afcharc = [] 176 | for i in range(0, len(list_cleanorg)): 177 | name = list_cleanorg[i] 178 | newchar_list = [] 179 | for char in name: 180 | if char != ' ': 181 | newchar_list.append(dict_replace[char]) 182 | else: 183 | newchar_list.append(' ') 184 | newname = ''.join(newchar for newchar in newchar_list) 185 | list_cleanorg_afcharc.append(newname) 186 | ###################################################### 187 | 188 | # process dot a bit more carefully because .com or .net cannot replace dot as space, dont have meaningful search results 189 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg 190 | for i in range(0, len(list_cleanorg_afcharc)): 191 | name = list_cleanorg_afcharc[i] 192 | newname = dot2replace_re.sub(' ', name) 193 | list_cleanorg_afcharc[i] = newname 194 | 195 | white0 = r" +" # >=1 whitespace 196 | white0_re = re.compile(white0) 197 | for i in range(0, len(list_cleanorg_afcharc)): 198 | name = list_cleanorg_afcharc[i] 199 | newname = white0_re.sub(' ', name) 200 | list_cleanorg_afcharc[i] = newname 201 | 202 | white1 = r"^ | $" # begin or end with whitespace 203 | white1_re = re.compile(white1) 204 | for i in range(0, len(list_cleanorg_afcharc)): 205 | name = list_cleanorg_afcharc[i] 206 | newname = white1_re.sub('',name) 207 | list_cleanorg_afcharc[i] = newname 208 | 209 | # take care of u s, u s a 210 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b") 211 | us_re = re.compile(r"\b(u) \b(s)\b") 212 | for i in range(0, len(list_cleanorg_afcharc)): 213 | name = list_cleanorg_afcharc[i] 214 | newname = usa_re.sub('usa', name) 215 | newname = us_re.sub('us', newname) 216 | list_cleanorg_afcharc[i] = newname 217 | 218 | # take care of "a l'energie" 219 | temp_re = re.compile(r"\ba *l'* *energie") 220 | for i in range(0, len(list_cleanorg_afcharc)): 221 | name = list_cleanorg_afcharc[i] 222 | newname = temp_re.sub("a l'energie", name) 223 | list_cleanorg_afcharc[i] = newname 224 | 225 | ############################################################### 226 | dict_raw2new = {} 227 | for i in range(0, len(list_raworg_nodup)): 228 | rawname = list_raworg_nodup[i] 229 | newname = list_cleanorg_afcharc[i] 230 | dict_raw2new.update({rawname: newname}) 231 | mf.pickle_dump(dict_raw2new, 'dict_pv_raw2new') 232 | 233 | dict_new2raw = {} 234 | for i in range(0, len(list_raworg_nodup)): 235 | rawname = list_raworg_nodup[i] 236 | newname = list_cleanorg_afcharc[i] 237 | if newname not in dict_new2raw: 238 | dict_new2raw[newname] = {rawname} 239 | else: 240 | dict_new2raw[newname].update({rawname}) 241 | mf.pickle_dump(dict_new2raw, 'dict_pv_new2raw') 242 | 243 | -------------------------------------------------------------------------------- /clean_name/sdc_process_name.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Oct 29 16:33:37 2019 4 | 5 | @author: danie 6 | """ 7 | 8 | import my_own_handy_functions as mf 9 | import re 10 | import json 11 | import pandas as pd 12 | import time 13 | 14 | logfile = '' 15 | 16 | t1 = time.time() 17 | 18 | df = pd.read_stata('../../../MA_Feb27/step3_new_ma_all_acq_public.dta') # M&A file 19 | df = df.loc[df['divestiture']=="N"] 20 | df = df.loc[~(df['targetname'].str.contains('Undisclosed', regex=False))] # delete undisclosed target 21 | 22 | list_dealnumber = list(df['dealnumber']) 23 | list_targetcusip = list(df['targetcusip']) 24 | list_targetname = list(df['targetname']) 25 | 26 | ######## delete all things in (), in SDC, the content within () is usually the immediate parent. 27 | ######## including it doesn't seem to yield good search results 28 | bracket_re = re.compile(r"\(.*\)") 29 | list_targetname_new = [] 30 | for i in range(0,len(list_targetname)): 31 | name = list_targetname[i] 32 | if re.search(bracket_re, name): 33 | newname = bracket_re.sub("",name) 34 | print(name) 35 | print(newname) 36 | list_targetname_new.append(newname) 37 | else: 38 | list_targetname_new.append(name) 39 | ################################################# 40 | 41 | # dict_replace gives the correct char to replace the old one 42 | with open('dict_char_replace.json', 'r') as f: 43 | dict_replace = json.load(f) 44 | 45 | 46 | for i in range(0, len(list_targetname_new)): 47 | newname = list_targetname_new[i].lower() 48 | list_targetname_new[i] = newname 49 | 50 | ####### clean every char to correct one ############## 51 | list_targetname_afcharc = [] 52 | for i in range(0, len(list_targetname_new)): 53 | name = list_targetname_new[i] 54 | newchar_list = [] 55 | for char in name: 56 | if char != ' ': 57 | newchar_list.append(dict_replace[char]) 58 | else: 59 | newchar_list.append(' ') 60 | newname = ''.join(newchar for newchar in newchar_list) 61 | list_targetname_afcharc.append(newname) 62 | ########################################################### 63 | 64 | ### process dot carefully, because of .com and .net 65 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg 66 | for i in range(0, len(list_targetname_afcharc)): 67 | name = list_targetname_afcharc[i] 68 | newname = dot2replace_re.sub(' ', name) 69 | list_targetname_afcharc[i] = newname 70 | 71 | white0 = r" +" # >=1 whitespace 72 | white0_re = re.compile(white0) 73 | for i in range(0, len(list_targetname_afcharc)): 74 | name = list_targetname_afcharc[i] 75 | newname = white0_re.sub(' ', name) 76 | list_targetname_afcharc[i] = newname 77 | 78 | white1 = r"^ | $" # begin or end with whitespace 79 | white1_re = re.compile(white1) 80 | for i in range(0, len(list_targetname_afcharc)): 81 | name = list_targetname_afcharc[i] 82 | newname = white1_re.sub('',name) 83 | list_targetname_afcharc[i] = newname 84 | 85 | # take care of u s, u s a 86 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b") 87 | us_re = re.compile(r"\b(u) \b(s)\b") 88 | for i in range(0, len(list_targetname_afcharc)): 89 | name = list_targetname_afcharc[i] 90 | newname = usa_re.sub('usa', name) 91 | newname = us_re.sub('us', newname) 92 | list_targetname_afcharc[i] = newname 93 | 94 | mf.pickle_dump(list_dealnumber, 'ma_acq_public_dealnumber') 95 | mf.pickle_dump(list_targetname_afcharc, 'ma_acq_public_targetname_processed') 96 | -------------------------------------------------------------------------------- /match/link_pv2compustat.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import sqlite3 3 | import my_own_handy_functions as mf 4 | from itertools import combinations 5 | import re 6 | 7 | prefix = "pv_" 8 | logfile = '' 9 | 10 | attach_sql = ";" 11 | attach_name = "" 12 | 13 | i = 2 14 | task_i_suffix = f'task{i}' 15 | suffix = "all_top5" 16 | 17 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/" 18 | db = db_folder + f"{prefix}search_{suffix}.db" 19 | con = sqlite3.connect(db) 20 | print(mf.show_tables(con)) 21 | 22 | sql = f"select * from {prefix}search_" + suffix + attach_sql 23 | df_pv = pd.read_sql(sql, con) 24 | try: 25 | df_pv = df_pv.drop('index', 1) 26 | except: 27 | pass 28 | 29 | prefix = "compustat_" 30 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/" 31 | db = db_folder + f"{prefix}search_{suffix}.db" 32 | con = sqlite3.connect(db) 33 | print(mf.show_tables(con)) 34 | sql = f"select * from {prefix}search_" + suffix + attach_sql 35 | df_compustat = pd.read_sql(sql, con) 36 | try: 37 | df_compustat = df_compustat.drop('index', 1) 38 | except: 39 | pass 40 | df_compustat = df_compustat.drop(['gvkey', 'compustat_conml'], 1) 41 | df_compustat = df_compustat.drop_duplicates(['compustat_newname']) 42 | 43 | list_newname_compustat = list(df_compustat['compustat_newname']) 44 | list_urls5_compustat = list(df_compustat['urls5']) 45 | list_newname_pv = list(df_pv['pv_newname']) 46 | list_urls5_pv = list(df_pv['urls5']) 47 | 48 | ## clean the url a bit 49 | http_re = re.compile(r"^http\:\/\/|^https\:\/\/|\/$") 50 | 51 | ## index all the unique urls 52 | set_url_compustat = set(http_re.sub('',url) for urls5 in list_urls5_compustat for url in eval(urls5)) 53 | set_url = set(http_re.sub('',url) for urls5 in list_urls5_pv for url in eval(urls5)) 54 | set_url.update(set_url_compustat) 55 | 56 | list_url = list(set_url) 57 | dict_all_url_index = {} 58 | for i in range(0, len(list_url)): 59 | dict_all_url_index.update({i: list_url[i], list_url[i]: i}) 60 | 61 | ### change 5 urls to their corresponding indexes 62 | def newname_url_index(list_urls5, list_newname): 63 | dict_newname_url_index = {} 64 | for i in range(0, len(list_newname)): 65 | newname = list_newname[i] 66 | urls5 = eval(list_urls5[i]) 67 | urls5_index = frozenset([dict_all_url_index[http_re.sub('',url)] for url in urls5]) 68 | dict_newname_url_index[newname] = urls5_index 69 | return dict_newname_url_index 70 | 71 | # dictionary of newname to a set of 5 urls 72 | dict_newname_url_index_pv = newname_url_index(list_urls5_pv, list_newname_pv) 73 | dict_newname_url_index_compustat = newname_url_index(list_urls5_compustat, list_newname_compustat) 74 | 75 | # construct a dictionary of n-urls to newname 76 | ##### n means that how many matches of urls are required here 77 | #### suf means suffix 78 | def urls_index_dict(n, # number of urls to be hashed 79 | suf, # '_pv' or '_compustat' 80 | ): 81 | 82 | dict_urls = {} 83 | list_newname = globals()[f"list_newname{suf}"] 84 | for i in range(0, len(list_newname)): 85 | newname = list_newname[i] 86 | urls5_index = globals()[f"dict_newname_url_index{suf}"][newname] 87 | if len(urls5_index) >= n: 88 | # hash all the 5Cn combinations of urls 89 | urls5_index_c5n = combinations(urls5_index, n) 90 | for urls_item in urls5_index_c5n: 91 | urls_key = frozenset(urls_item) 92 | if urls_key not in dict_urls: 93 | dict_urls[urls_key] = {newname} 94 | else: 95 | dict_urls[urls_key].update({newname}) 96 | return dict_urls 97 | 98 | dict_pv2compustat = [{}, {}, {}, {}] 99 | #### matching based on 5/4/3/2 urls 100 | for n in range(5,1,-1): 101 | # hash all 5Cn combinations of urls 102 | dict_urls_pv = urls_index_dict(n, '_pv') 103 | dict_urls_compustat = urls_index_dict(n, '_compustat') 104 | print(f"{n} loading done") 105 | 106 | # for each hashed n-url in compustat 107 | for key, value in dict_urls_compustat.items(): 108 | # if this hashed n-url can also be found in PV, give a match 109 | if key in dict_urls_pv: 110 | newname_compustat = list(value) 111 | newname_pv_list = list(dict_urls_pv[key]) 112 | for newname_pv in newname_pv_list: 113 | if newname_pv not in dict_pv2compustat[5-n]: 114 | dict_pv2compustat[5-n][newname_pv] = set(newname_compustat) 115 | else: 116 | dict_pv2compustat[5-n][newname_pv].update(set(newname_compustat)) 117 | print(f"{n} dict done") 118 | dict_urls_pv = {} 119 | dict_urls_compustat = {} 120 | 121 | mf.pickle_dump(dict_pv2compustat, 'dict_pv2compustat') 122 | mf.pickle_dump(dict_all_url_index, 'dict_all_url_index_pv2compustat') 123 | -------------------------------------------------------------------------------- /match/link_pv2sdc.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import sqlite3 3 | import my_own_handy_functions as mf 4 | from itertools import combinations 5 | import re 6 | 7 | prefix = "pv_" 8 | logfile = '' 9 | 10 | attach_sql = ";" 11 | attach_name = "" 12 | suffix = "all_top5" 13 | 14 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/" 15 | db = db_folder + f"{prefix}search_{suffix}.db" 16 | con = sqlite3.connect(db) 17 | print(mf.show_tables(con)) 18 | 19 | sql = f"select * from {prefix}search_" + suffix + attach_sql 20 | df_pv = pd.read_sql(sql, con) 21 | try: 22 | df_pv = df_pv.drop('index', 1) 23 | except: 24 | pass 25 | ######################################### 26 | 27 | prefix = "sdc_" 28 | 29 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/" 30 | db = db_folder + f"{prefix}search_{suffix}.db" 31 | con = sqlite3.connect(db) 32 | print(mf.show_tables(con)) 33 | sql = f"select * from {prefix}search_" + suffix + attach_sql 34 | df_sdc = pd.read_sql(sql, con) 35 | try: 36 | df_sdc = df_sdc.drop('index', 1) 37 | except: 38 | pass 39 | 40 | df_sdc = df_sdc.drop_duplicates(['newname']) 41 | 42 | list_newname_sdc = list(df_sdc['newname']) 43 | list_urls5_sdc = list(df_sdc['urls5']) 44 | list_newname_pv = list(df_pv['pv_newname']) 45 | list_urls5_pv = list(df_pv['urls5']) 46 | 47 | # clean url a bit 48 | http_re = re.compile(r"^http\:\/\/|^https\:\/\/|\/$") 49 | 50 | ## index all the unique urls 51 | set_url_sdc = set(http_re.sub('',url) for urls5 in list_urls5_sdc for url in eval(urls5)) 52 | set_url = set(http_re.sub('',url) for urls5 in list_urls5_pv for url in eval(urls5)) 53 | set_url.update(set_url_sdc) 54 | 55 | list_url = list(set_url) 56 | dict_all_url_index = {} 57 | for i in range(0, len(list_url)): 58 | dict_all_url_index.update({i: list_url[i], list_url[i]: i}) 59 | 60 | ### change 5 urls to their corresponding indexes 61 | def newname_url_index(list_urls5, list_newname): 62 | dict_newname_url_index = {} 63 | for i in range(0, len(list_newname)): 64 | newname = list_newname[i] 65 | urls5 = eval(list_urls5[i]) 66 | urls5_index = frozenset([dict_all_url_index[http_re.sub('',url)] for url in urls5]) 67 | dict_newname_url_index[newname] = urls5_index 68 | return dict_newname_url_index 69 | 70 | # dictionary of newname to a set of 5 urls 71 | dict_newname_url_index_pv = newname_url_index(list_urls5_pv, list_newname_pv) 72 | dict_newname_url_index_sdc = newname_url_index(list_urls5_sdc, list_newname_sdc) 73 | 74 | # construct a dictionary of n-urls to newname 75 | ##### n means that how many matches of urls are required here 76 | #### suf means suffix 77 | def urls_index_dict(n, # number of urls to be hashed 78 | suf, # '_pv' or '_sdc' 79 | ): 80 | 81 | dict_urls = {} 82 | list_newname = globals()[f"list_newname{suf}"] 83 | for i in range(0, len(list_newname)): 84 | newname = list_newname[i] 85 | urls5_index = globals()[f"dict_newname_url_index{suf}"][newname] 86 | if len(urls5_index) >= n: 87 | # hash all the 5Cn combinations of urls 88 | urls5_index_c5n = combinations(urls5_index, n) 89 | for urls_item in urls5_index_c5n: 90 | urls_key = frozenset(urls_item) 91 | if urls_key not in dict_urls: 92 | dict_urls[urls_key] = {newname} 93 | else: 94 | dict_urls[urls_key].update({newname}) 95 | return dict_urls 96 | 97 | dict_pv2sdc = [{}, {}, {}, {}] 98 | #### matching based on 5/4/3/2 urls 99 | for n in range(5,1,-1): 100 | # hash all 5Cn combinations of urls 101 | dict_urls_pv = urls_index_dict(n, '_pv') 102 | dict_urls_sdc = urls_index_dict(n, '_sdc') 103 | print(f"{n} loading done") 104 | # for each hashed n-url in sdc 105 | for key, value in dict_urls_sdc.items(): 106 | if key in dict_urls_pv: 107 | # if this hashed n-url can also be found in PV, give a match 108 | newname_sdc = list(value) 109 | newname_pv_list = list(dict_urls_pv[key]) 110 | for newname_pv in newname_pv_list: 111 | if newname_pv not in dict_pv2sdc[5-n]: 112 | dict_pv2sdc[5-n][newname_pv] = set(newname_sdc) 113 | else: 114 | dict_pv2sdc[5-n][newname_pv].update(set(newname_sdc)) 115 | print(f"{n} dict done") 116 | dict_urls_pv = {} 117 | dict_urls_sdc = {} 118 | 119 | mf.pickle_dump(dict_pv2sdc, 'dict_pv2sdc') 120 | mf.pickle_dump(dict_all_url_index, 'dict_all_url_index_pv2sdc') 121 | -------------------------------------------------------------------------------- /my_own_handy_functions.py: -------------------------------------------------------------------------------- 1 | # some handy functions 2 | 3 | import pickle 4 | import time 5 | import sys 6 | 7 | def pickle_dump(dict2dump, dict_name): 8 | pickle_name = dict_name + '.pickle' 9 | with open(pickle_name, 'wb') as handle: 10 | pickle.dump(dict2dump, handle, protocol = pickle.HIGHEST_PROTOCOL) 11 | return pickle_name 12 | 13 | def pickle_load(dict_name): 14 | pickle_name = dict_name + '.pickle' 15 | with open(pickle_name, 'rb') as handle: 16 | return pickle.load(handle) 17 | 18 | def show_tables(con): 19 | cur = con.cursor() 20 | cur.execute("SELECT name FROM sqlite_master WHERE type='table';") 21 | return cur.fetchall() 22 | 23 | def log_time_used(t1, task, log, mode = 'a'): 24 | t2 = time.time() 25 | t = round(t2-t1,2) 26 | message = f"{task} takes {t}s." 27 | if log == '': 28 | print(message, file = sys.stdout) 29 | else: 30 | with open(log, mode) as f: 31 | print(message, file = f) 32 | return time.time() 33 | 34 | from azure.cognitiveservices.search.websearch import WebSearchAPI 35 | from msrest.authentication import CognitiveServicesCredentials 36 | import json 37 | 38 | def simple_bing_search(search_word): 39 | subscription_key = "" # input key from Bing search 40 | assert subscription_key 41 | client = WebSearchAPI(CognitiveServicesCredentials(subscription_key)) 42 | web_data_raw = client.web.search(query=search_word, raw = True, count = 50) 43 | raw_text = web_data_raw.response.text 44 | raw_dict = json.loads(raw_text) 45 | return raw_dict 46 | 47 | def print_log(message, logfile = '', mode = 'a'): 48 | if logfile != '': 49 | with open(logfile, mode) as f: 50 | print(message, file = f) 51 | else: 52 | print(message, file = sys.stdout) 53 | return None 54 | --------------------------------------------------------------------------------