├── README.md
├── bing_search
    ├── bing_search_name.py
    ├── combine_db_tables.py
    └── process_search_result.py
├── clean_name
    ├── compustat_process_name.py
    ├── dict_char_replace.json
    ├── patentsview_process_name.py
    └── sdc_process_name.py
├── match
    ├── link_pv2compustat.py
    └── link_pv2sdc.py
└── my_own_handy_functions.py


/README.md:
--------------------------------------------------------------------------------
 1 | # Matching USPTO Patent Assignees to Compustat Public Firms and SDC Private Firms
 2 | 
 3 | This data project is a systematic effort to match assignee names on USPTO patent records, sometimes abbreviated or misspelled, to the universe of public firms in Compustat and all private firms that have been involved in alliances and M&A in SDC Platinum. The current coverage runs from 1976 to 2017 for Compustat firms and from 1985-2017 for SDC firms. 
 4 | 
 5 | The algorithm leverages the Bing web search engine and significantly improves upon fuzzy name matching, a common practice in the literature. This document presents a step-by-step guide to our searching and matching algorithm.
 6 | 
 7 | You may add your email to our [list](https://forms.gle/1ABK6WZk36n9Ye9p9) so that we can inform you once we make the data publicly available. Please contact Danqing Mei (dqmei@ckgsb.edu.cn) or Miao Liu (miao.liu@bc.edu) if you have any other questions.
 8 | 
 9 | ## Matching Patent Assignees to Compustat Public Firms: 
10 | Our procedure to match patent assignees to Compustat firms is similar to that of Autor et al. (forthcoming). The initial project started before they publish their paper and procedure. So there are a few differences between ours and theirs as detailed below. We extend their sample period from 2013 to 2017. 
11 | 
12 | 
13 | ## Matching Patent Assignees to SDC Private Firms: 
14 | We are the first attempt in the literature to match patent assignees to private firms using the web search engine approach (Mei, 2020). To illustrate the importance of private firms in the M&A market, the number of public deals (i.e., public acquirer – public target) is 6,175 between 1976 and 2017. Among these deals, 1,665 targets have patent records. The number of private deals (i.e., public acquirer – private target) is 42,206 between 1985 and 2017. Among these deals, 4,019 targets have patent records.
15 | 
16 | ## Matching Procedure
17 | Our matching procedure has four steps.
18 | 
19 | ### Step 1: Download the source data:
20 | - Patent data from the U.S. Patent and Inventor Database (the March 2019 version from PatentsView is used here).
21 | - Linked patent-Compustat data from the NBER Patent Data Project, which covers only patents granted by 2006.
22 | - Linked patent-CRSP data from Kogan et al. (2017), which covers patents granted by 2010.
23 | - Compustat North America. The relevant variables include Compustat firm ID (gvkey), firm name, firm website.
24 | - SDC Platinum. The relevant variables include SDC deal number, firm name of the target and acquirer, and firm ID (cusip).
25 | 
26 | ### Step 2: Clean assignee names and Compustat/SDC firm names.
27 | - In this step, we remove punctuation and accent marks. The file “dict_char_replace.json” provides a complete list of punctuations and accent marks that we remove or replace. These choices appear to produce the best Bing search results based on our manual checks. The folder “clean_name” contains the codes for removing punctuations and accent marks in the assignee and Compustat/SDC firm names. The punctuation-free names are used as input into the Bing search in the next step.
28 | 
29 | ### Step 3: Use the Bing Web Search API to collect search results in the form of URLs.
30 | - In this step, we first create the csv input file that contains the punctuation-free firm name.
31 | - Run the Python program “bing_search_name.py” after adding the API key. A paid subscription to the Bing Web Search API is required when performing more than 3000 searches in a month. 
32 | - The Python programs will generate an output file in the SQL database that contains the links, titles, and descriptions of the top 50 search results from searching the punctuation-free firm names on Bing. 
33 | - The folder “match” contains some simple code for cleaning the URLs collected from the Bing search.
34 | 
35 | ### Step 4: Match assignees to Compustat public firms and SDC private firms using names and URLs.
36 | - In the final step, we consider a patent assignee and a Compustat/SDC firm to be a match if the top five search results for the assignee and for the Compustat/SDC firm share at least two exact matches.
37 | 
38 | ## Reference:
39 | Autor, David, David Dorn, Gordon H. Hanson, Gary Pisano, and Pian Shu, 2019, “Foreign Competition and Domestic Innovation: Evidence from U.S. Patents,” American Economic Review: Insights, forthcoming.
40 | 
41 | Kogan, Leonid, Dimitris Papanikolaou, Amit Seru, and Noah Stoffman, 2017, “Technological Innovation, Resource Allocation, and Growth,” The Quarterly Journal of Economics 132, 665–712.
42 | 
43 | Mei, Danqing, 2020, "Technology Development and Corporate Mergers," Working Paper.
44 | 


--------------------------------------------------------------------------------
/bing_search/bing_search_name.py:
--------------------------------------------------------------------------------
  1 | #!/NOBACKUP/scratch/dmei19/anaconda3/bin/python
  2 | # this program runs bing search on Columbia Research Grid
  3 | # it runs embarrassingly parallel jobs with multiple tasks 
  4 | 
  5 | import pickle
  6 | import pandas as pd
  7 | import sqlite3
  8 | import time
  9 | import sys
 10 | from azure.cognitiveservices.search.websearch import WebSearchAPI
 11 | from msrest.authentication import CognitiveServicesCredentials
 12 | 
 13 | subscription_key = "" # insert your Bing key here
 14 | assert subscription_key
 15 | client = WebSearchAPI(CognitiveServicesCredentials(subscription_key))
 16 | 
 17 | client_id = "";
 18 | Empty_Search_Word_Err = "Empty Search Word"
 19 | 
 20 | #############################################################################
 21 | #More details for bing web search please refer to 
 22 | #https://docs.microsoft.com/en-us/azure/cognitive-services/bing-web-search/web-sdk-python-quickstart
 23 | 
 24 | def bing_web_search_sdk_list(search_word_list, c = 50):
 25 |     list_name_url = []
 26 |     list_raw = []
 27 |     list_urls = []
 28 |     for search_word in search_word_list:
 29 |         #sanity check to avoid any potential issue
 30 |         if len(search_word) == 0 :
 31 |             raise Exception(Empty_Search_Word_Err)    
 32 |         
 33 |         web_data_raw = client.web.search(query=search_word, raw = True, count = c) 
 34 |         raw = web_data_raw.response.text
 35 |         list_raw.append(raw)
 36 |         
 37 |         name_url = []
 38 |         urls = []
 39 |         if hasattr(web_data_raw.output.web_pages, 'value'):        
 40 |             for i in web_data_raw.output.web_pages.value:
 41 |                 name_url.append((i.name, i.url))
 42 |                 urls.append(i.url) 
 43 |         
 44 |         list_name_url.append(str(name_url))
 45 |         list_urls.append(str(urls))
 46 |         
 47 |     return list_name_url, list_raw, list_urls
 48 | ##############################################################################
 49 | 
 50 | #beg##########################################################
 51 | def log_time_used(t1, task, log, mode = 'a'):
 52 |     t2 = time.time()
 53 |     t = round(t2-t1,2)
 54 |     message = f"{task} takes {t}s."
 55 |     if log == '':
 56 |         print(message, file = sys.stdout)
 57 |     else:
 58 |         with open(log, mode) as f:
 59 |             print(message, file = f)
 60 |     return time.time()
 61 | #end############################################################
 62 | 
 63 | t_start = time.time()
 64 | t1 = time.time()
 65 | 
 66 | # insert the processed names from SDC/Compustat/PatentsView
 67 | with open('sdc/compustat/patentsview_name.pickle', 'rb') as handle:
 68 |     list_name = pickle.load(handle)
 69 | 
 70 | # take task no, 1-7 for Compustat, 1-10 for SDC, 1-100 for PatentsView
 71 | task_num = int(sys.argv[1])
 72 | 
 73 | logfile = 'search_task' + str(task_num) + '.log'
 74 | task_size = 5000 # each task searches 5000 names
 75 | task_start = (task_num - 1) * task_size
 76 | task_end = min(task_num * task_size, len(list_name))
 77 | list_task = list_name[task_start:task_end]
 78 | 
 79 | df = pd.DataFrame()
 80 | df['newname'] = list_task
 81 | 
 82 | # for each task, create a database to store ALL the search results
 83 | db = 'search_task' + str(task_num) + '.db'
 84 | con = sqlite3.connect(db)
 85 | cur = con.cursor()
 86 | table = 'newname_task' + str(task_num)
 87 | df.to_sql(table, con, if_exists = "replace")
 88 | 
 89 | ###########################################
 90 | def show_tables():
 91 |     cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
 92 |     return cur.fetchall()
 93 | 
 94 | def drop_tables(table_name):
 95 |     query = str( f"DROP TABLE {table_name}" )
 96 |     cur.execute(query)
 97 |     cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
 98 |     return cur.fetchall()
 99 | #################################################
100 |     
101 | result_table = 'search_result_task' + str(task_num)
102 | try:
103 |     drop_tables(result_table)
104 | except:
105 |     print('no result table in db yet')    
106 | 
107 | t1 = log_time_used(t1, 'getting ready', log = logfile, mode = 'w+')
108 | 
109 | ##############################################################################
110 | ### for each task of 5000 searches, divide it into 5 batches
111 | # after each batch, store them into the corresponding database
112 | # this is to avoid any break-down of the tasks so that the cost of Bing search is wasted
113 | def batch_search_new(n, s, c = 50):
114 |     
115 |     #############################################
116 |     t1 = time.time()
117 |     
118 |     begin = (n-1)*s
119 |     end = min(n*s, len(list_task))
120 |     name_list = list_task[begin:end]
121 |     list_name_url, list_raw, list_urls = bing_web_search_sdk_list(name_list, c)
122 |             
123 |     df_result = pd.DataFrame()
124 |     df_result['newname'] = name_list
125 |     df_result['name_url'] = list_name_url
126 |     df_result['raw'] = list_raw
127 |     df_result['urls'] = list_urls
128 |     sql_name = 'sdc_search_result_task' + str(task_num)
129 |     df_result.to_sql(sql_name, con, if_exists = 'append')
130 |     
131 |     t1 = log_time_used(t1, f"query + save {s} searches to sql", log = logfile)
132 |     
133 | #####################################################################################
134 | 
135 | t_start = time.time()
136 | batch_size = 1000 # 5 batches within one task
137 | if len(list_task) % batch_size == 0:
138 |     batch_round = int(len(list_task) / batch_size)
139 | else:
140 |     batch_round = int(len(list_task) / batch_size) + 1
141 | 
142 | for batch_num in range(1, batch_round+1):
143 |     batch_search_new(batch_num, batch_size)
144 |     if logfile == '':
145 |         # after each batch, print
146 |         print(f"processed batch No. {batch_num}")
147 |     else:
148 |         with open(logfile, 'a') as f:
149 |             print(f"processed batch No. {batch_num}", file = f)
150 |             
151 | t1 = log_time_used(t_start, f"{batch_round} rounds done", log = logfile)
152 | 
153 | #####validation code
154 | #sql = "select * from sdc_search_result_task1 limit 100;"
155 | #df_temp = pd.read_sql(sql, con)
156 | #print(df_temp)
157 | 


--------------------------------------------------------------------------------
/bing_search/combine_db_tables.py:
--------------------------------------------------------------------------------
 1 | #!/NOBACKUP/scratch/dmei19/anaconda3/bin/python
 2 | 
 3 | import pandas as pd
 4 | import sqlite3
 5 | 
 6 | 
 7 | attach_sql = ";"
 8 | attach_name = ""
 9 | 
10 | num_task = 10 # 7 for Compustat / 10 for SDC / 100 for PatentsView
11 | df_list = [None] * num_task
12 | 
13 | #prefix = 'compustat'
14 | #prefix = 'pv'
15 | prefix = 'sdc'
16 | db_folder = prefix + '_search_result_db/'
17 | 
18 | for i in range(1, num_task+1):
19 |     db = db_folder + prefix + '_search_task' + str(i) + '.db'
20 |     con = sqlite3.connect(db)
21 |     cur = con.cursor
22 |     sql = f"select * from {prefix}_search_result_task" + str(i) + attach_sql
23 |     df = pd.read_sql(sql, con)
24 |     df = df.drop('index', 1)
25 |     df_list[i-1] = df
26 |     con.close()
27 | 
28 | print('save list done')
29 | 
30 | db = db_folder + prefix + '_search_all' + attach_name + '.db'
31 | con = sqlite3.connect(db)
32 | for i in range(0, len(df_list)):
33 |     dataframe = df_list[i]
34 |     table_name = prefix + '_search_result_all' + attach_name
35 |     if i == 0:
36 |         dataframe.to_sql(table_name, con, if_exists='replace', index = False)
37 |     else:
38 |         dataframe.to_sql(table_name, con, if_exists='append', index = False)
39 |     print(str(i+1) + ' tables processed')
40 | 
41 | query = f"select * from {prefix}_search_result_all limit 100;"
42 | df_sample = pd.read_sql(query, con)
43 | print(df_sample)
44 | 
45 | con.close()
46 | 


--------------------------------------------------------------------------------
/bing_search/process_search_result.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Wed Aug  7 16:25:50 2019
  4 | 
  5 | @author: danie
  6 | """
  7 | 
  8 | import pandas as pd
  9 | import sqlite3
 10 | import json
 11 | import sys
 12 | 
 13 | def print_log(message, logfile = '', mode = 'a'):
 14 |     if logfile != '':
 15 |         with open(logfile, mode) as f:
 16 |             print(message, file = f)
 17 |     else:
 18 |         print(message, file = sys.stdout)
 19 |     return None
 20 | 
 21 | #prefix = 'compustat_'
 22 | #prefix = 'pv_'
 23 | prefix = 'sdc_'
 24 | logfile = ''
 25 | 
 26 | attach_sql = ";"
 27 | attach_name = ""
 28 | 
 29 | suffix = 'all'
 30 | 
 31 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/"
 32 | db = db_folder + f"{prefix}search_{suffix}.db"
 33 | con = sqlite3.connect(db)
 34 | cur = con.cursor
 35 | sql = f"select * from {prefix}search_result_" + suffix + attach_sql
 36 | df = pd.read_sql(sql, con)
 37 | try:
 38 |     df = df.drop('index', 1)
 39 | except:
 40 |     pass
 41 | 
 42 | con.close()
 43 | 
 44 | print_log('data load done', logfile, 'w+')
 45 | 
 46 | list_urls = []
 47 | list_names = []
 48 | list_newname = list(df['newname'])
 49 | list_raw = list(df['raw'])
 50 | 
 51 | for obs in range(0, len(list_newname)):
 52 |     urls = []
 53 |     names = []
 54 |     item = list_raw[obs]
 55 |     newname = list_newname[obs]
 56 |     dict_item = json.loads(item)
 57 |     if 'webPages' not in dict_item:
 58 |         print_log(f"{obs} does not have webpage result", logfile)
 59 |         print_log(newname, logfile)
 60 |     else:
 61 |         if 'value' in dict_item['webPages']:
 62 |             for webpage in dict_item['webPages']['value']:
 63 |                 urls.append(webpage['url'])
 64 |                 names.append(webpage['name'])
 65 |     list_urls.append(urls)
 66 |     list_names.append(names)
 67 |     if (obs+1) % 50000 == 0:
 68 |         print_log(f"{obs+1} observations processed", logfile)
 69 | 
 70 | print_log("start to pick top 5", logfile)
 71 | 
 72 | list_urls5 = []
 73 | list_names5 = [] 
 74 | for obs in range(0, len(list_newname)):
 75 |     urls = list_urls[obs]
 76 |     names = list_names[obs]
 77 |     top5 = min(5, len(urls), len(names))
 78 |     urls5 = urls[0:top5]
 79 |     names5 = names[0:top5]
 80 |     list_urls5.append(str(urls5))
 81 |     list_names5.append(str(names5))
 82 | 
 83 | print_log("start saving to sql", logfile)
 84 | 
 85 | df_result = pd.DataFrame()
 86 | df_result['newname'] = list_newname
 87 | df_result['urls5'] = list_urls5
 88 | df_result['names5'] = list_names5
 89 | 
 90 | db = db_folder + f"{prefix}search_{suffix}_top5.db"
 91 | con = sqlite3.connect(db)
 92 | cur = con.cursor
 93 | 
 94 | table = f"{prefix}search_{suffix}_top5"
 95 | df_result.to_sql(table, con, if_exists = "replace")
 96 | con.close()
 97 | 
 98 | print_log("all done", logfile)
 99 | 
100 | ## validation code
101 | con = sqlite3.connect(db)
102 | sql = 'select * from ' + table + ' limit 10;'
103 | df_test = pd.read_sql(sql, con)
104 | try:
105 |     df_test = df_test.drop('index', 1)
106 | except:
107 |     pass
108 | con.close()
109 | with open(f"{prefix}test.out", 'w+', encoding='utf-8') as f:
110 |     print(df_test, file=f)
111 | 
112 | 


--------------------------------------------------------------------------------
/clean_name/compustat_process_name.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Tue Aug 13 17:34:04 2019
  4 | 
  5 | @author: danie
  6 | """
  7 | 
  8 | import my_own_handy_functions as mf
  9 | import re
 10 | import pandas as pd
 11 | import time
 12 | import json
 13 | 
 14 | logfile = ''    
 15 | #logfile = sys.stdout
 16 | 
 17 | t1 = time.time()
 18 | 
 19 | df = pd.read_csv('compustat.csv') # all firm-year observations from compustat
 20 | 
 21 | df_indl = df.loc[df['indfmt']=='INDL']
 22 | df_fs = df.loc[df['indfmt']=='FS']
 23 | df_indl = df_indl.dropna(how='all', subset = ['at', 'sale', 'xrd']) 
 24 | df_fs = df_fs.dropna(how='all', subset = ['at', 'sale', 'xrd']) 
 25 | # drop if missing(total assets) & missing(sales) & missing(r&d expenses)
 26 | 
 27 | df_indl_unique = df_indl[['gvkey', 'conml', 'weburl', 'conm']].drop_duplicates(['gvkey', 'conml', 'weburl'])
 28 | 
 29 | list_gvkey = list(df_indl_unique['gvkey'])
 30 | 
 31 | # if use conm, often get null result from bing searc
 32 | # note, have to use legal name hereh
 33 | list_old_conm = list(df_indl_unique['conml']) 
 34 | 
 35 | list_conm = []
 36 | for i in range(0, len(list_old_conm)):
 37 |     name = list_old_conm[i].lower()
 38 |     list_conm.append(name)
 39 | 
 40 | #### get all characters in the company names, to check abnormal cases#########
 41 | dict_clean_char = {}
 42 | for i in range(0,len(list_gvkey)):
 43 |     name = list_conm[i]
 44 |     for char in name:
 45 |         if char != " ":
 46 |             gvkey = list_gvkey[i]
 47 |             if char not in dict_clean_char:
 48 |                 dict_clean_char[char] = [(gvkey, name)]
 49 |             else:
 50 |                 dict_clean_char[char].append((gvkey, name))
 51 |     if i % 5000 == 0:
 52 |         print(i)
 53 | 
 54 | list_char = list(dict_clean_char.keys())
 55 | list_char.sort()
 56 | ################################################################################
 57 | 
 58 | # dict_replace gives the correct char to replace the old one
 59 | with open('dict_char_replace.json', 'r') as f:
 60 |     dict_replace = json.load(f)
 61 | 
 62 | ##########################################################
 63 | 
 64 | for i in range(0, len(list_conm)):
 65 |     name = list_conm[i]
 66 |     if '.,' in name:
 67 |         newname = name.replace('.,', ' ')
 68 |         list_conm[i] = newname
 69 |         
 70 | #############################################################
 71 | # below to find x.x.x.x.x.x.x from 10 x(s) to 3 x(s)
 72 | def find_pattern(name):
 73 |     for i in range(10,1,-1):
 74 |         temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b')
 75 |         m = re.search(temp_re, name)
 76 |         if m:
 77 |             print(name)
 78 |             print(m.group(0))
 79 |             return m.group(0)
 80 | 
 81 | def fix_pattern(name, i): # i from 10 to 1
 82 |     temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') # means x.x.x... (from 11x to 2x)
 83 |     m = re.search(temp_re, name)
 84 |     if m:
 85 |         new_re = ''.join(ele for ele in ['\\' + str(j) for j in range(1, i+1+1)])
 86 |         # for example, when i = 5, new_re = r"\1\2\3\4\5\6"
 87 |         newname = temp_re.sub(new_re, name)
 88 |         return newname
 89 |     else:
 90 |         return name
 91 | n = 0
 92 | for i in range(0, len(list_conm)):
 93 |     name = list_conm[i]
 94 |     newname = list_conm[i]
 95 |     for n_x in range(10, 0, -1):
 96 |         newname = fix_pattern(newname, n_x)
 97 |     if newname != name:
 98 |         n+=1        
 99 |         list_conm[i] = newname
100 | 
101 | ################################################################
102 | # use dict_replace to clean every char
103 | list_conm_afcharc = []
104 | for i in range(0, len(list_conm)):
105 |     name = list_conm[i]
106 |     newchar_list = []
107 |     for char in name:
108 |         if char != ' ':
109 |             newchar_list.append(dict_replace[char])
110 |         else:
111 |             newchar_list.append(' ')
112 |     newname = ''.join(newchar for newchar in newchar_list)
113 |     list_conm_afcharc.append(newname)    
114 | 
115 | # dont replace . as space, keep dot, because for .com or .net keeping them has better results for search
116 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg
117 | for i in range(0, len(list_conm_afcharc)):
118 |     name = list_conm_afcharc[i]
119 |     newname = dot2replace_re.sub(' ', name)
120 |     list_conm_afcharc[i] = newname 
121 | 
122 | white0 = r" +" # >=1 whitespace 
123 | white0_re = re.compile(white0)
124 | for i in range(0, len(list_conm_afcharc)):
125 |     name = list_conm_afcharc[i]
126 |     newname = white0_re.sub(' ', name)
127 |     list_conm_afcharc[i] = newname
128 | 
129 | white1 = r"^ | $" # begin or end with whitespace
130 | white1_re = re.compile(white1)
131 | for i in range(0, len(list_conm_afcharc)):
132 |     name = list_conm_afcharc[i]
133 |     newname = white1_re.sub('',name)
134 |     list_conm_afcharc[i] = newname
135 | 
136 | # take care of u s, u s a
137 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b")
138 | us_re = re.compile(r"\b(u) \b(s)\b")
139 | for i in range(0, len(list_conm_afcharc)):
140 |     name = list_conm_afcharc[i]
141 |     newname = usa_re.sub('usa', name)
142 |     newname = us_re.sub('us', newname)
143 |     list_conm_afcharc[i] = newname
144 | 
145 | mf.pickle_dump(list_gvkey, 'list_compustat_gvkey')
146 | mf.pickle_dump(list_old_conm, 'list_compustat_conml')
147 | mf.pickle_dump(list_conm_afcharc, 'list_compustat_newname')
148 | 
149 | 


--------------------------------------------------------------------------------
/clean_name/dict_char_replace.json:
--------------------------------------------------------------------------------
1 | {"!": " ", "#": "#", "$": "s", "%": "%", "&": " & ", "'": "'", "(": " ", ")": " ", "*": "*", "+": "+", ",": " ", "-": " ", ".": ".", "/": "/", "0": "0", "1": "1", "2": "2", "3": "3", "4": "4", "5": "5", "6": "6", "7": "7", "8": "8", "9": "9", ":": ":", ";": " ", "<": " ", "=": "=", ">": " ", "?": " ", "@": " ", "[": " ", "\\": " ", "]": " ", "_": " ", "`": " ", "a": "a", "b": "b", "c": "c", "d": "d", "e": "e", "f": "f", "g": "g", "h": "h", "i": "i", "j": "j", "k": "k", "l": "l", "m": "m", "n": "n", "o": "o", "p": "p", "q": "q", "r": "r", "s": "s", "t": "t", "u": "u", "v": "v", "w": "w", "x": "x", "y": "y", "z": "z", "{": " ", "|": " ", "}": " ", "\u00a3": " ", "\u00a7": "e", "\u00a9": " ", "\u00ae": " ", "\u00b0": " ", "\u00b1": " ", "\u00b2": "2", "\u00b3": "3", "\u00b4": "\u00b4", "\u00b7": " ", "\u00bd": "", "\u00bf": " ", "\u00d7": "\u00d7", "\u00e0": "a", "\u00e1": "a", "\u00e2": "a", "\u00e3": "a", "\u00e4": "ae", "\u00e5": "a", "\u00e6": "ae", "\u00e7": "c", "\u00e8": "e", "\u00e9": "e", "\u00ea": "e", "\u00eb": "e", "\u00ec": "i", "\u00ed": "i", "\u00ee": "i", "\u00ef": "i", "\u00f0": "d", "\u00f1": "n", "\u00f2": "o", "\u00f3": "o", "\u00f4": "o", "\u00f5": "o", "\u00f6": "o", "\u00f7": " ", "\u00f8": "\u00f8", "\u00f9": "u", "\u00fa": "u", "\u00fb": "u", "\u00fc": "u", "\u00fd": "y", "\u00ff": "y", "\u0101": "a", "\u0105": "a", "\u0107": "c", "\u010b": "c", "\u0113": "e", "\u0117": "e", "\u012b": "i", "\u0142": "l", "\u0144": "n", "\u014d": "o", "\u0151": "o", "\u0155": "r", "\u015b": "s", "\u015f": "s", "\u0169": "u", "\u016b": "u", "\u0171": "u", "\u017c": "z", "\u01f5": "g", "\u02dc": " ", "\u0304": "", "\u0305": "", "\u03b2": "\u03b2", "\u03b5": "s", "\u03bb": "a", "\u03bf": "o", "\u04ab": "c", "\u2013": " ", "\u2014": " ", "\u2018": " ", "\u2019": " ", "\u201c": " ", "\u201d": " ", "\u2022": " ", "\u2032": " ", "\u2033": " ", "\u203b": " ", "\u2082": "2", "\u208b": " ", "\u2122": " tm ", "\u2205": " ", "\u2212": " ", "\u2605": "*", "\u266f": "#", "\u2713": "v"}


--------------------------------------------------------------------------------
/clean_name/patentsview_process_name.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sat Jul  6 22:04:33 2019
  4 | 
  5 | @author: Danqing Mei
  6 | """
  7 | 
  8 | import pandas as pd
  9 | import re
 10 | import html
 11 | import json
 12 | import my_own_handy_functions as mf
 13 | 
 14 | rawassignee = pd.read_stata('rawassignee_noquote.dta') # patent number-assignee name file from patentsview
 15 | rawassignee = rawassignee.loc[rawassignee['dummy_raw_org']==1]
 16 | rawassignee_nodup = rawassignee.drop_duplicates(['raw_organization'], keep='first', inplace=False)
 17 | list_raworg_nodup = list(rawassignee_nodup['raw_organization'])
 18 | list_patentid_nodup = list(rawassignee_nodup['patent_id'])
 19 | 
 20 | list_cleanorg = []
 21 | for i in range(0, len(list_raworg_nodup)):
 22 |     raw_name = list_raworg_nodup[i]
 23 |     # this unescape takes care most of the "&"
 24 |     clean_name = html.unescape(raw_name)
 25 |     # some exceptions below
 26 |     if '&Circlesolid;' not in clean_name and '&thgr;' not in clean_name and '&dgr;' not in clean_name:
 27 |         list_cleanorg.append(clean_name.lower())
 28 |     else:
 29 |         clean_name = clean_name.replace('&Circlesolid;', ' ')
 30 |         clean_name = clean_name.replace('&thgr;', 'o') #'\u03B8', actually should be 'o'
 31 |         clean_name = clean_name.replace('&dgr;', '-') # '\u03B4', actually should be '-'
 32 |         list_cleanorg.append(clean_name.lower())
 33 | 
 34 | # take care of char ";"
 35 | checkfmt = re.compile(r'\d+;') # at least one digit followed by a ";"
 36 | for i in range(0, len(list_cleanorg)):
 37 |     name = list_cleanorg[i]
 38 |     match = re.search(checkfmt, name, flags=0)
 39 |     if match:
 40 |         name = name.replace(match.group(0), '')
 41 |         list_cleanorg[i] = name
 42 | 
 43 | '''
 44 | # check special format about ;
 45 | # indeed first check all names containing ;
 46 | # then use the following regex
 47 | 
 48 | checkfmt = re.compile(r'(^|\S+);\S+') 
 49 | # begin of the string or any non-white space (one or more) + ; + any non-white space (one or more)
 50 | for i in range(0,len(list_cleanorg)):
 51 |     name = list_cleanorg[i]
 52 |     match = re.search(checkfmt, name)
 53 |     if match:
 54 |         print( name + ' ' + list_patentid_nodup[i])
 55 | '''
 56 | 
 57 | # These are stuff need to take care
 58 | err_fmt = ['f;vis', ';3m', ';bri', 'hô ;', 'sil;verbrook', 'el;ectronics', 'people;s', 's;p.a.', 'co;,']
 59 | crr_fmt = ['vis'  , '3m' , 'bri' , 'hô'  , 'silverbrook' , 'electronics' , 'people\'s','s.p.a.', 'co,' ]
 60 | 
 61 | for i in range(0, len(list_cleanorg)):
 62 |     name = list_cleanorg[i]
 63 |     for j in range(0, len(err_fmt)):
 64 |         err = err_fmt[j]
 65 |         crr = crr_fmt[j]
 66 |         if err in name:
 67 |             newname = name.replace(err, crr)
 68 |             list_cleanorg[i] = newname
 69 |         
 70 | 
 71 | post = r"( |\()a corp.*of.*$" # take care of "a corp... of..."
 72 | post_re = re.compile(post)
 73 | for i in range(0, len(list_cleanorg)):
 74 |     name = list_cleanorg[i]
 75 |     newname = post_re.sub('',name)
 76 |     list_cleanorg[i] = newname
 77 | 
 78 | '''
 79 | # get a dictionary of all char in the assignee names to check later
 80 | 
 81 | dict_clean_char = {}
 82 | for i in range(0,len(list_patentid_nodup)):
 83 |     name = list_cleanorg[i]
 84 |     for char in name:
 85 |         if char != " ":
 86 |             patent_id = list_patentid_nodup[i]
 87 |             if char not in dict_clean_char:
 88 |                 dict_clean_char[char] = {patent_id:name}
 89 |             else:
 90 |                 dict_clean_char[char].update({patent_id:name})
 91 |     if i % 10000 == 0:
 92 |         print(i)
 93 | 
 94 | with open('dict_clean_char.pickle', 'wb') as handle:
 95 |     pickle.dump(dict_clean_char, handle, protocol = pickle.HIGHEST_PROTOCOL)
 96 | 
 97 | with open('dict_clean_char.pickle', 'rb') as handle:
 98 |     dict_clean_char = pickle.load(handle)
 99 | 
100 | list_char = list(dict_clean_char.keys())
101 | list_char.sort()
102 | '''
103 | 
104 | # dict_replace gives the correct char to replace the old one
105 | with open('dict_char_replace.json', 'r') as f:
106 |     dict_replace = json.load(f)
107 | 
108 | # change ., to space
109 | for i in range(0, len(list_cleanorg)):
110 |     name = list_cleanorg[i]
111 |     if '.,' in name:
112 |         newname = name.replace('.,', ' ')
113 |         list_cleanorg[i] = newname
114 | 
115 | ##### below to find x.x.x.x.x.x.x from 10 x(s) to 3 x(s) #####################
116 | def find_pattern(name):
117 |     for i in range(10,1,-1):
118 |         temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b')
119 |         m = re.search(temp_re, name)
120 |         if m:
121 |             print(name)
122 |             print(m.group(0))
123 |             return m.group(0)
124 | 
125 | def fix_pattern(name, i): # i from 10 to 1
126 |     temp_re = re.compile('\\b(\\w)' + i*'\\.(\\w)\\b') # means x.x.x... (from 11x to 2x)
127 |     m = re.search(temp_re, name)
128 |     if m:
129 |         new_re = ''.join(ele for ele in ['\\' + str(j) for j in range(1, i+1+1)])
130 |         # for example, when i = 5, new_re = r"\1\2\3\4\5\6"
131 |         newname = temp_re.sub(new_re, name)
132 |         return newname
133 |     else:
134 |         return name
135 |         
136 | n = 0
137 | for i in range(0, len(list_cleanorg)):
138 |     name = list_cleanorg[i]
139 |     newname = list_cleanorg[i]
140 |     for n_x in range(10, 0, -1):
141 |         newname = fix_pattern(newname, n_x)
142 |     if newname != name:
143 |         n+=1        
144 |         list_cleanorg[i] = newname
145 | ############################################################################
146 | ################ begin to take care of {} #################################
147 | match_re = re.compile(r"{.*over.*\((.)\)}")
148 | 
149 | '''
150 | check all these strange {  over ()} cases
151 | for patentid, name in dict_clean_char[list_char[62]].items():
152 |     m = re.search(match_re, name)
153 |     if m:
154 |         if m.group(1) == ' ':
155 |             print(patentid)
156 |             print(name)
157 |             print(m.group(1))
158 | '''
159 | 
160 | n=0
161 | for i in range(0, len(list_cleanorg)):
162 |     name = list_cleanorg[i]
163 |     m = re.search(match_re, name)
164 |     if m:
165 |         if m.group(1) == ' ':
166 |             replace_char = ''
167 |         else:
168 |             replace_char = m.group(1)
169 |         newname = re.sub(match_re, replace_char, name)
170 |         list_cleanorg[i] = newname
171 |         n+=1
172 | ##########################################################################
173 | 
174 | ##### clean every char to correct ones ##############################        
175 | list_cleanorg_afcharc = []
176 | for i in range(0, len(list_cleanorg)):
177 |     name = list_cleanorg[i]
178 |     newchar_list = []
179 |     for char in name:
180 |         if char != ' ':
181 |             newchar_list.append(dict_replace[char])
182 |         else:
183 |             newchar_list.append(' ')
184 |     newname = ''.join(newchar for newchar in newchar_list)
185 |     list_cleanorg_afcharc.append(newname)
186 | ######################################################
187 |     
188 | # process dot a bit more carefully because .com or .net cannot replace dot as space, dont have meaningful search results
189 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg
190 | for i in range(0, len(list_cleanorg_afcharc)):
191 |     name = list_cleanorg_afcharc[i]
192 |     newname = dot2replace_re.sub(' ', name)
193 |     list_cleanorg_afcharc[i] = newname 
194 | 
195 | white0 = r" +" # >=1 whitespace 
196 | white0_re = re.compile(white0)
197 | for i in range(0, len(list_cleanorg_afcharc)):
198 |     name = list_cleanorg_afcharc[i]
199 |     newname = white0_re.sub(' ', name)
200 |     list_cleanorg_afcharc[i] = newname
201 | 
202 | white1 = r"^ | $" # begin or end with whitespace
203 | white1_re = re.compile(white1)
204 | for i in range(0, len(list_cleanorg_afcharc)):
205 |     name = list_cleanorg_afcharc[i]
206 |     newname = white1_re.sub('',name)
207 |     list_cleanorg_afcharc[i] = newname
208 | 
209 | # take care of u s, u s a
210 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b")
211 | us_re = re.compile(r"\b(u) \b(s)\b")
212 | for i in range(0, len(list_cleanorg_afcharc)):
213 |     name = list_cleanorg_afcharc[i]
214 |     newname = usa_re.sub('usa', name)
215 |     newname = us_re.sub('us', newname)
216 |     list_cleanorg_afcharc[i] = newname
217 | 
218 | # take care of "a l'energie"
219 | temp_re = re.compile(r"\ba *l'* *energie")
220 | for i in range(0, len(list_cleanorg_afcharc)):
221 |     name = list_cleanorg_afcharc[i]
222 |     newname = temp_re.sub("a l'energie", name)
223 |     list_cleanorg_afcharc[i] = newname
224 | 
225 | ###############################################################
226 | dict_raw2new = {}
227 | for i in range(0, len(list_raworg_nodup)):
228 |     rawname = list_raworg_nodup[i]
229 |     newname = list_cleanorg_afcharc[i]
230 |     dict_raw2new.update({rawname: newname})
231 | mf.pickle_dump(dict_raw2new, 'dict_pv_raw2new')
232 | 
233 | dict_new2raw = {}
234 | for i in range(0, len(list_raworg_nodup)):
235 |     rawname = list_raworg_nodup[i]
236 |     newname = list_cleanorg_afcharc[i]
237 |     if newname not in dict_new2raw:
238 |         dict_new2raw[newname] = {rawname}
239 |     else:
240 |         dict_new2raw[newname].update({rawname})    
241 | mf.pickle_dump(dict_new2raw, 'dict_pv_new2raw')
242 | 
243 | 


--------------------------------------------------------------------------------
/clean_name/sdc_process_name.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | """
 3 | Created on Tue Oct 29 16:33:37 2019
 4 | 
 5 | @author: danie
 6 | """
 7 | 
 8 | import my_own_handy_functions as mf
 9 | import re
10 | import json
11 | import pandas as pd
12 | import time
13 | 
14 | logfile = ''    
15 | 
16 | t1 = time.time()
17 | 
18 | df = pd.read_stata('../../../MA_Feb27/step3_new_ma_all_acq_public.dta') # M&A file
19 | df = df.loc[df['divestiture']=="N"]
20 | df = df.loc[~(df['targetname'].str.contains('Undisclosed', regex=False))] # delete undisclosed target
21 | 
22 | list_dealnumber = list(df['dealnumber'])
23 | list_targetcusip = list(df['targetcusip'])
24 | list_targetname = list(df['targetname'])
25 | 
26 | ######## delete all things in (), in SDC, the content within () is usually the immediate parent.
27 | ######## including it doesn't seem to yield good search results
28 | bracket_re = re.compile(r"\(.*\)")
29 | list_targetname_new = []
30 | for i in range(0,len(list_targetname)):
31 |     name = list_targetname[i]
32 |     if re.search(bracket_re, name):
33 |         newname = bracket_re.sub("",name)
34 |         print(name)
35 |         print(newname)
36 |         list_targetname_new.append(newname)
37 |     else:
38 |         list_targetname_new.append(name)
39 | #################################################        
40 | 
41 | # dict_replace gives the correct char to replace the old one
42 | with open('dict_char_replace.json', 'r') as f:
43 |     dict_replace = json.load(f)
44 | 
45 | 
46 | for i in range(0, len(list_targetname_new)):
47 |     newname = list_targetname_new[i].lower()
48 |     list_targetname_new[i] = newname
49 | 
50 | ####### clean every char to correct one ##############
51 | list_targetname_afcharc = []
52 | for i in range(0, len(list_targetname_new)):
53 |     name = list_targetname_new[i]
54 |     newchar_list = []
55 |     for char in name:
56 |         if char != ' ':
57 |             newchar_list.append(dict_replace[char])
58 |         else:
59 |             newchar_list.append(' ')
60 |     newname = ''.join(newchar for newchar in newchar_list)
61 |     list_targetname_afcharc.append(newname)    
62 | ###########################################################   
63 | 
64 | ### process dot carefully, because of .com and .net
65 | dot2replace_re = re.compile(r"(\. )|\.$|^\.") # dot space or dot at the end of the string or dot at beg
66 | for i in range(0, len(list_targetname_afcharc)):
67 |     name = list_targetname_afcharc[i]
68 |     newname = dot2replace_re.sub(' ', name)
69 |     list_targetname_afcharc[i] = newname 
70 | 
71 | white0 = r" +" # >=1 whitespace 
72 | white0_re = re.compile(white0)
73 | for i in range(0, len(list_targetname_afcharc)):
74 |     name = list_targetname_afcharc[i]
75 |     newname = white0_re.sub(' ', name)
76 |     list_targetname_afcharc[i] = newname
77 | 
78 | white1 = r"^ | $" # begin or end with whitespace
79 | white1_re = re.compile(white1)
80 | for i in range(0, len(list_targetname_afcharc)):
81 |     name = list_targetname_afcharc[i]
82 |     newname = white1_re.sub('',name)
83 |     list_targetname_afcharc[i] = newname
84 | 
85 | # take care of u s, u s a
86 | usa_re = re.compile(r"\b(u) \b(s) \b(a)\b")
87 | us_re = re.compile(r"\b(u) \b(s)\b")
88 | for i in range(0, len(list_targetname_afcharc)):
89 |     name = list_targetname_afcharc[i]
90 |     newname = usa_re.sub('usa', name)
91 |     newname = us_re.sub('us', newname)
92 |     list_targetname_afcharc[i] = newname
93 |     
94 | mf.pickle_dump(list_dealnumber, 'ma_acq_public_dealnumber')
95 | mf.pickle_dump(list_targetname_afcharc, 'ma_acq_public_targetname_processed')
96 | 


--------------------------------------------------------------------------------
/match/link_pv2compustat.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import sqlite3
  3 | import my_own_handy_functions as mf
  4 | from itertools import combinations 
  5 | import re
  6 | 
  7 | prefix = "pv_"
  8 | logfile = ''
  9 | 
 10 | attach_sql = ";"
 11 | attach_name = ""
 12 | 
 13 | i = 2
 14 | task_i_suffix = f'task{i}'
 15 | suffix = "all_top5"
 16 | 
 17 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/"
 18 | db = db_folder + f"{prefix}search_{suffix}.db"
 19 | con = sqlite3.connect(db)
 20 | print(mf.show_tables(con))
 21 | 
 22 | sql = f"select * from {prefix}search_" + suffix + attach_sql
 23 | df_pv = pd.read_sql(sql, con)
 24 | try:
 25 |     df_pv = df_pv.drop('index', 1)
 26 | except:
 27 |     pass
 28 | 
 29 | prefix = "compustat_"
 30 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/"
 31 | db = db_folder + f"{prefix}search_{suffix}.db"
 32 | con = sqlite3.connect(db)
 33 | print(mf.show_tables(con))
 34 | sql = f"select * from {prefix}search_" + suffix + attach_sql
 35 | df_compustat = pd.read_sql(sql, con)
 36 | try:
 37 |     df_compustat = df_compustat.drop('index', 1)
 38 | except:
 39 |     pass
 40 | df_compustat = df_compustat.drop(['gvkey', 'compustat_conml'], 1)
 41 | df_compustat = df_compustat.drop_duplicates(['compustat_newname'])
 42 | 
 43 | list_newname_compustat = list(df_compustat['compustat_newname'])
 44 | list_urls5_compustat = list(df_compustat['urls5'])
 45 | list_newname_pv = list(df_pv['pv_newname'])
 46 | list_urls5_pv = list(df_pv['urls5'])
 47 | 
 48 | ## clean the url a bit
 49 | http_re = re.compile(r"^http\:\/\/|^https\:\/\/|\/$")
 50 | 
 51 | ## index all the unique urls
 52 | set_url_compustat = set(http_re.sub('',url) for urls5 in list_urls5_compustat for url in eval(urls5))
 53 | set_url = set(http_re.sub('',url) for urls5 in list_urls5_pv for url in eval(urls5))
 54 | set_url.update(set_url_compustat)
 55 | 
 56 | list_url = list(set_url)
 57 | dict_all_url_index = {}
 58 | for i in range(0, len(list_url)):
 59 |     dict_all_url_index.update({i: list_url[i], list_url[i]: i})
 60 | 
 61 | ### change 5 urls to their corresponding indexes 
 62 | def newname_url_index(list_urls5, list_newname):
 63 |     dict_newname_url_index = {}
 64 |     for i in range(0, len(list_newname)):
 65 |         newname = list_newname[i]
 66 |         urls5 = eval(list_urls5[i])
 67 |         urls5_index = frozenset([dict_all_url_index[http_re.sub('',url)] for url in urls5])
 68 |         dict_newname_url_index[newname] = urls5_index
 69 |     return dict_newname_url_index
 70 | 
 71 | # dictionary of newname to a set of 5 urls                
 72 | dict_newname_url_index_pv = newname_url_index(list_urls5_pv, list_newname_pv)
 73 | dict_newname_url_index_compustat = newname_url_index(list_urls5_compustat, list_newname_compustat)
 74 | 
 75 | # construct a dictionary of n-urls to newname
 76 | ##### n means that how many matches of urls are required here
 77 | #### suf means suffix
 78 | def urls_index_dict(n, # number of urls to be hashed 
 79 |                     suf, # '_pv' or '_compustat'
 80 |                     ):
 81 |     
 82 |     dict_urls = {}
 83 |     list_newname = globals()[f"list_newname{suf}"]
 84 |     for i in range(0, len(list_newname)):
 85 |         newname = list_newname[i]
 86 |         urls5_index = globals()[f"dict_newname_url_index{suf}"][newname]
 87 |         if len(urls5_index) >= n:
 88 |             # hash all the 5Cn combinations of urls
 89 |             urls5_index_c5n = combinations(urls5_index, n)
 90 |             for urls_item in urls5_index_c5n:
 91 |                 urls_key = frozenset(urls_item)
 92 |                 if urls_key not in dict_urls:
 93 |                     dict_urls[urls_key] = {newname}
 94 |                 else:
 95 |                     dict_urls[urls_key].update({newname})
 96 |     return dict_urls
 97 |     
 98 | dict_pv2compustat = [{}, {}, {}, {}]
 99 | #### matching based on 5/4/3/2 urls
100 | for n in range(5,1,-1):
101 |     # hash all 5Cn combinations of urls
102 |     dict_urls_pv = urls_index_dict(n, '_pv')
103 |     dict_urls_compustat = urls_index_dict(n, '_compustat')
104 |     print(f"{n} loading done")
105 |     
106 |     # for each hashed n-url in compustat
107 |     for key, value in dict_urls_compustat.items():
108 |         # if this hashed n-url can also be found in PV, give a match
109 |         if key in dict_urls_pv:
110 |             newname_compustat = list(value)
111 |             newname_pv_list = list(dict_urls_pv[key])
112 |             for newname_pv in newname_pv_list:
113 |                 if newname_pv not in dict_pv2compustat[5-n]:
114 |                     dict_pv2compustat[5-n][newname_pv] = set(newname_compustat)
115 |                 else:
116 |                     dict_pv2compustat[5-n][newname_pv].update(set(newname_compustat))
117 |     print(f"{n} dict done")
118 |     dict_urls_pv = {}
119 |     dict_urls_compustat = {}            
120 | 
121 | mf.pickle_dump(dict_pv2compustat, 'dict_pv2compustat')
122 | mf.pickle_dump(dict_all_url_index, 'dict_all_url_index_pv2compustat')
123 | 


--------------------------------------------------------------------------------
/match/link_pv2sdc.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import sqlite3
  3 | import my_own_handy_functions as mf
  4 | from itertools import combinations 
  5 | import re
  6 | 
  7 | prefix = "pv_"
  8 | logfile = ''
  9 | 
 10 | attach_sql = ";"
 11 | attach_name = ""
 12 | suffix = "all_top5"
 13 | 
 14 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/"
 15 | db = db_folder + f"{prefix}search_{suffix}.db"
 16 | con = sqlite3.connect(db)
 17 | print(mf.show_tables(con))
 18 | 
 19 | sql = f"select * from {prefix}search_" + suffix + attach_sql
 20 | df_pv = pd.read_sql(sql, con)
 21 | try:
 22 |     df_pv = df_pv.drop('index', 1)
 23 | except:
 24 |     pass
 25 | #########################################
 26 |     
 27 | prefix = "sdc_"
 28 | 
 29 | db_folder = f"../../../Apps/CBS_ResearchGrid/{prefix}search_result_db/"
 30 | db = db_folder + f"{prefix}search_{suffix}.db"
 31 | con = sqlite3.connect(db)
 32 | print(mf.show_tables(con))
 33 | sql = f"select * from {prefix}search_" + suffix + attach_sql
 34 | df_sdc = pd.read_sql(sql, con)
 35 | try:
 36 |     df_sdc = df_sdc.drop('index', 1)
 37 | except:
 38 |     pass
 39 | 
 40 | df_sdc = df_sdc.drop_duplicates(['newname'])
 41 | 
 42 | list_newname_sdc = list(df_sdc['newname'])
 43 | list_urls5_sdc = list(df_sdc['urls5'])
 44 | list_newname_pv = list(df_pv['pv_newname'])
 45 | list_urls5_pv = list(df_pv['urls5'])
 46 | 
 47 | # clean url a bit
 48 | http_re = re.compile(r"^http\:\/\/|^https\:\/\/|\/$")
 49 | 
 50 | ## index all the unique urls
 51 | set_url_sdc = set(http_re.sub('',url) for urls5 in list_urls5_sdc for url in eval(urls5))
 52 | set_url = set(http_re.sub('',url) for urls5 in list_urls5_pv for url in eval(urls5))
 53 | set_url.update(set_url_sdc)
 54 | 
 55 | list_url = list(set_url)
 56 | dict_all_url_index = {}
 57 | for i in range(0, len(list_url)):
 58 |     dict_all_url_index.update({i: list_url[i], list_url[i]: i})
 59 | 
 60 | ### change 5 urls to their corresponding indexes     
 61 | def newname_url_index(list_urls5, list_newname):
 62 |     dict_newname_url_index = {}
 63 |     for i in range(0, len(list_newname)):
 64 |         newname = list_newname[i]
 65 |         urls5 = eval(list_urls5[i])
 66 |         urls5_index = frozenset([dict_all_url_index[http_re.sub('',url)] for url in urls5])
 67 |         dict_newname_url_index[newname] = urls5_index
 68 |     return dict_newname_url_index
 69 | 
 70 | # dictionary of newname to a set of 5 urls        
 71 | dict_newname_url_index_pv = newname_url_index(list_urls5_pv, list_newname_pv)
 72 | dict_newname_url_index_sdc = newname_url_index(list_urls5_sdc, list_newname_sdc)
 73 | 
 74 | # construct a dictionary of n-urls to newname
 75 | ##### n means that how many matches of urls are required here
 76 | #### suf means suffix
 77 | def urls_index_dict(n, # number of urls to be hashed 
 78 |                     suf, # '_pv' or '_sdc'
 79 |                     ):
 80 |     
 81 |     dict_urls = {}
 82 |     list_newname = globals()[f"list_newname{suf}"]
 83 |     for i in range(0, len(list_newname)):
 84 |         newname = list_newname[i]
 85 |         urls5_index = globals()[f"dict_newname_url_index{suf}"][newname]
 86 |         if len(urls5_index) >= n:
 87 |             # hash all the 5Cn combinations of urls
 88 |             urls5_index_c5n = combinations(urls5_index, n)
 89 |             for urls_item in urls5_index_c5n:
 90 |                 urls_key = frozenset(urls_item)
 91 |                 if urls_key not in dict_urls:
 92 |                     dict_urls[urls_key] = {newname}
 93 |                 else:
 94 |                     dict_urls[urls_key].update({newname})
 95 |     return dict_urls
 96 |     
 97 | dict_pv2sdc = [{}, {}, {}, {}]
 98 | #### matching based on 5/4/3/2 urls
 99 | for n in range(5,1,-1):
100 |     # hash all 5Cn combinations of urls
101 |     dict_urls_pv = urls_index_dict(n, '_pv')
102 |     dict_urls_sdc = urls_index_dict(n, '_sdc')
103 |     print(f"{n} loading done")
104 |     # for each hashed n-url in sdc
105 |     for key, value in dict_urls_sdc.items():
106 |         if key in dict_urls_pv:
107 |             # if this hashed n-url can also be found in PV, give a match
108 |             newname_sdc = list(value)
109 |             newname_pv_list = list(dict_urls_pv[key])
110 |             for newname_pv in newname_pv_list:
111 |                 if newname_pv not in dict_pv2sdc[5-n]:
112 |                     dict_pv2sdc[5-n][newname_pv] = set(newname_sdc)
113 |                 else:
114 |                     dict_pv2sdc[5-n][newname_pv].update(set(newname_sdc))
115 |     print(f"{n} dict done")
116 |     dict_urls_pv = {}
117 |     dict_urls_sdc = {}            
118 | 
119 | mf.pickle_dump(dict_pv2sdc, 'dict_pv2sdc')
120 | mf.pickle_dump(dict_all_url_index, 'dict_all_url_index_pv2sdc')
121 | 


--------------------------------------------------------------------------------
/my_own_handy_functions.py:
--------------------------------------------------------------------------------
 1 | # some handy functions
 2 | 
 3 | import pickle
 4 | import time
 5 | import sys
 6 | 
 7 | def pickle_dump(dict2dump, dict_name):
 8 |     pickle_name = dict_name + '.pickle'
 9 |     with open(pickle_name, 'wb') as handle:
10 |         pickle.dump(dict2dump, handle, protocol = pickle.HIGHEST_PROTOCOL)
11 |     return pickle_name
12 | 
13 | def pickle_load(dict_name):
14 |     pickle_name = dict_name + '.pickle'
15 |     with open(pickle_name, 'rb') as handle:
16 |         return pickle.load(handle)
17 |     
18 | def show_tables(con):
19 |     cur = con.cursor()
20 |     cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
21 |     return cur.fetchall()
22 | 
23 | def log_time_used(t1, task, log, mode = 'a'):
24 |     t2 = time.time()
25 |     t = round(t2-t1,2)
26 |     message = f"{task} takes {t}s."
27 |     if log == '':
28 |         print(message, file = sys.stdout)
29 |     else:
30 |         with open(log, mode) as f:
31 |             print(message, file = f)
32 |     return time.time()
33 | 
34 | from azure.cognitiveservices.search.websearch import WebSearchAPI
35 | from msrest.authentication import CognitiveServicesCredentials
36 | import json
37 | 
38 | def simple_bing_search(search_word):
39 |     subscription_key = "" # input key from Bing search
40 |     assert subscription_key
41 |     client = WebSearchAPI(CognitiveServicesCredentials(subscription_key))
42 |     web_data_raw = client.web.search(query=search_word, raw = True, count = 50)
43 |     raw_text = web_data_raw.response.text
44 |     raw_dict = json.loads(raw_text)
45 |     return raw_dict
46 | 
47 | def print_log(message, logfile = '', mode = 'a'):
48 |     if logfile != '':
49 |         with open(logfile, mode) as f:
50 |             print(message, file = f)
51 |     else:
52 |         print(message, file = sys.stdout)
53 |     return None
54 | 


--------------------------------------------------------------------------------