├── README.md
├── LICENSE
└── scraper


/README.md:
--------------------------------------------------------------------------------
 1 | # keywordhunter
 2 | A python script to loop through urls in a csv and look for specific keywords on the scraped homepage.
 3 | 
 4 | 
 5 | This specific example opens a familyoffice.csv stored in the same folder,
 6 | scrapes the homepage and cleans it up using Beautiful Soup (i.e. removes HTML),
 7 | and then searched for the keywords VC and Venture Capital.
 8 | 
 9 | It was originally set up to save a new csv when loop was complete,
10 | but added functionality to add csv during loop (in case the loop errored out, etc.).
11 | Currently this last part is not working and keywords are not being saved.
12 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Yohei Nakajima
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/scraper:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | import requests
 4 | from bs4 import BeautifulSoup
 5 | import re
 6 | import csv
 7 | 
 8 | def check_for_keywords(text):
 9 |     if re.search(r'\bventure capital\b', text, re.IGNORECASE):
10 |         return 'venture capital'
11 |     elif re.search(r'\bVC\b', text, re.IGNORECASE):
12 |         return 'VC'
13 |     else:
14 |         return None
15 | 
16 | def savecsv():
17 |     with open('familyoffice.csv', 'w') as f:
18 |         writer = csv.DictWriter(f, fieldnames=rows[0].keys())
19 |         writer.writeheader()
20 |         writer.writerows(rows)
21 | 
22 | with open('familyoffice.csv', 'r') as f:
23 |     reader = csv.DictReader(f)
24 |     rows = list(reader)
25 | 
26 | for row in rows:
27 |     if 'keyword' not in row:
28 |         try:
29 |             url = row['url']
30 |             print(url)
31 |             req=requests.get(url)
32 |             content=req.text
33 |             soup = BeautifulSoup(content, 'html.parser')
34 |             text = soup.get_text()
35 |             row['keyword'] = check_for_keywords(text)
36 |             print(row['keyword'])
37 |             #save csv as we go, in case it's too long
38 |             #currently, the keyword is not being saved...
39 |             savecsv()
40 |         except:
41 |             row['keyword'] = 'error'
42 |             print(row['keyword'])
43 | 
44 | #save full csv at end when complete
45 | with open('familyoffice_keyword_added.csv', 'w') as f:
46 |     writer = csv.DictWriter(f, fieldnames=rows[0].keys())
47 |     writer.writeheader()
48 |     writer.writerows(rows)
49 | 


--------------------------------------------------------------------------------