├── README.md
└── citation_analysis
└── find_hrefs_loop.py
/README.md:
--------------------------------------------------------------------------------
1 | ## Far Right Analysis
2 |
3 | **Slack:** [#far-right](https://datafordemocracy.slack.com/messages/far-right)
4 |
5 | **Project Leads:** @nick, @sjackson, @jonathon
6 |
7 | **Maintainers (people with write/commit access):**
8 | * GitHub: @jonathon, @nick, @sjacks26
9 | * data.world: @jonathon
10 |
11 | **Data:** Head over to [data.world](https://data.world/data4democracy/far-right) to check out our most up-to-date data! That's a private dataset, so ask [@jonathon](https://datafordemocracy.slack.com/messages/@jonathon/) or [@sharon](https://datafordemocracy.slack.com/messages/) if you need access. It contains datasets specific to militia groups operating on Facebook, and white supremacists on Twitter, which might help you get started. Also, the [#assemble](https://datafordemocracy.slack.com/messages/assemble) channel in Slack (and the related [assemble GitHub repo](https://github.com/Data4Democracy/assemble)) is dedicated to collecting data for analysis, and may be able to help you build a dataset you need.
12 |
13 | **Project Description:** This repo is for collecting analyses related to the behavior of extreme far right online communities. It's likely that this will be temporary until we move to a platform that's designed more specifically for collaborative research, but it'll do in the meantime. Research/analysis ideas are collected [as issues](https://github.com/Data4Democracy/far-right-analysis/issues) - please don't hesitate to add your ideas, or take ownership of an analysis project that's interesting to you. Definitely let someone know (in the issue or on Slack) if you start on an idea, though, so we don't accidentally duplicate work.
14 |
15 | There are several related veins of work going on in the group at the current time:
16 | * **Data Collection, Cleaning, and Joining**
17 | * **Data Visualization**
18 | * **Modelling**
19 | * **Analysis**
20 |
21 | ## Getting Started
22 |
23 | ### Want to Contribute?
24 | * **"First-timers" are welcome!** Whether you're trying to learn data science, hone your coding skills, or get started collaborating over the web, we're happy to help. If you have any questions feel free to pose them on our [slack channel](https://datafordemocracy.slack.com/messages/far-right), or reach out to one of the team leads. That channel is a good place to go with preliminary ideas, before you're ready to add them as issues here. If you have questions about Git and GitHub specifically, our [github-playground](https://github.com/Data4Democracy/github-playground) repo and the [#github-help](https://datafordemocracy.slack.com/messages/github-help) Slack channel are good places to start.
25 | * **Feeling Comfortable with GitHub, and Ready to Dig In?** Check out our GitHub issues. This is our official listing of the work that we are planning to get done. As we add more issues, the maintainers will make sure to specifically tag those issues that are good for beginners with: `beginner-friendly`
26 | * **This README is a Living Document:** If you see something you think should be changed feel free to edit and submit a Pull Request. Not only will this be a huge help to the group, it is also a great first PR!
27 | * **Got an Idea for Something We Should be Working On?** You can submit an issue on our GitHub page, mention your idea on the [slack channel](https://datafordemocracy.slack.com/messages/far-right), or reach out to one of the project leads.
28 |
29 | ## Latest Project News
30 |
31 | Check out this [Google Doc](https://docs.google.com/document/d/16C6tEZJ6i96PbWVUplRJW9IJeFW5TEsR0TsfzNOFYj4/edit?usp=sharing) for the latest news on what we've done and what we're working on now.
32 |
--------------------------------------------------------------------------------
/citation_analysis/find_hrefs_loop.py:
--------------------------------------------------------------------------------
1 | '''
2 | Right now, this builds a dictionary with domain names and the number of times each domain appears. I need to revise this
3 | so that it only looks for hrefs in the main text of the page. Right now, it's picking up share buttons and comments and
4 | all sorts of other stuff.
5 | The problem I'm having is searching for "a" tags in the results of a search for
6 | ~~ I think I solved this problem. First, I do a find_all for div class entry-content. Then, I convert the results of
7 | that to a string. Then, I Soup that string, which lets me do another find_all for a.
8 |
9 | '''
10 |
11 | from bs4 import BeautifulSoup
12 | from urllib.parse import urlparse
13 | import os
14 |
15 | start_path = input('What is the start path?')
16 |
17 | full_domains = {}
18 |
19 | site_domains_w_page_details = {}
20 | site_domains_unique = {}
21 | site_domains_total = {}
22 | page_domains = {}
23 | all_site_domains_count = {}
24 | all_site_domains_unique = []
25 | all_site_domains = []
26 | for path, subdirs, files in os.walk(start_path):
27 | for subdir in subdirs:
28 | for f in files:
29 | if f.endswith('.html'):
30 | file_errors = []
31 | file_path = path + '/' + f
32 |
33 | soup = BeautifulSoup(open(file_path, 'r'), 'lxml')
34 |
35 | content = soup.find_all('p')
36 | content = str(content)
37 |
38 | # The following couple of blocks (through the line that defines 'this_domain_name') identifies the domain of the page
39 | # being scraped.
40 | this_domain_candidates = soup.find_all('meta', {'property': 'og:url'})
41 | t_d_candidates = []
42 | for can in this_domain_candidates:
43 | a_domain = can.get('content')
44 | #print(type(a_domain))
45 | #print(a_domain)
46 | t_d_candidates.append(a_domain)
47 |
48 | this_domain = []
49 |
50 | for f in t_d_candidates:
51 | #print(type(f))
52 | f_string = str(f)
53 | if f_string.startswith('http'):
54 | this_domain.append(f)
55 | if len(this_domain) == 1:
56 | this_domain = this_domain[0]
57 | else:
58 | print('More than one domain option \n')
59 | for f in this_domain:
60 | print(f)
61 |
62 | domain_parse = urlparse(this_domain)
63 | this_domain_name = domain_parse.netloc
64 |
65 | # This section searches for URLs in the main text of the page.
66 | soup = BeautifulSoup(content, 'lxml')
67 | main_text = soup.find_all('a')
68 |
69 | hrefs = []
70 | domains = []
71 | for link in main_text:
72 | url = link.get('href')
73 | url = str(url)
74 | if url.startswith('http'):
75 | hrefs.append(url)
76 |
77 | # This should build a list of unique domains in links in a page.
78 | all_domains = []
79 | for link in hrefs:
80 | url_parts = urlparse(link)
81 | domain = url_parts.netloc
82 | all_domains.append(domain)
83 | all_site_domains.append(domain)
84 | if not domain in domains and not domain in this_domain_name:
85 | domains.append(domain)
86 | if not domain in all_site_domains_unique and not domain in this_domain_name:
87 | all_site_domains_unique.append(domain)
88 |
89 | # This builds a dictionary with domain names associated with domain counts. When working across multiple
90 | # pages, prob want another dictionary wrapped around this one for each page.
91 | domain_counts = {}
92 | for d in domains:
93 | domain_count = 0
94 | for y in all_domains:
95 | if d == y:
96 | domain_count +=1
97 | domain_counts[d] = domain_count
98 |
99 | page_domains[file_path] = domain_counts
100 | site_domains_w_page_details[this_domain_name] = page_domains
101 |
102 | for d in all_site_domains_unique:
103 | site_domain_count = 0
104 | for y in all_site_domains:
105 | if d == y:
106 | site_domain_count +=1
107 | all_site_domains_count[d] = site_domain_count
108 |
109 | site_domains_unique[this_domain_name] = all_site_domains_count
110 | site_domains_total[this_domain_name] = all_site_domains
111 |
112 | full_domains[this_domain_name] = site_domains_unique
113 |
114 |
115 | '''
116 | test = page_domains['http://www.totalsurvivalist.com/2016/11/realities-of-defensive-conflicts.html']
117 |
118 | for domain in test:
119 | print(domain)
120 | print(type(test[domain]))
121 |
122 |
123 | # I need to develop a count of unique domains found in a site, taking into account the total number of pages in the
124 | # site.
125 | ## Instead of doing this, add an additional dictionary above that compiles domains and domain counts (botn unique and
126 | # total counts) for the site, rather than for each page.
127 | site_domains = {}
128 | full_domain_list = []
129 | overview_domain_list = []
130 | for site in site_domains:
131 | #print(site)
132 | for page in site_domains[site]:
133 | #print(page)
134 | for domain in page_domains[page]:
135 | print(domain)
136 | if not domain in full_domain_list:
137 | full_domain_list.append(domain)
138 | '''
139 |
140 | end_on_comment = True
141 |
--------------------------------------------------------------------------------