├── README.rdoc
├── build_hashes.py
├── probabilities.dict.json
├── scrape_comments.py
├── spamcheck.py
└── testit.py


/README.rdoc:
--------------------------------------------------------------------------------
 1 | == youtube-troll-classifier
 2 | 
 3 | This is a bayesian classifier of online comments. It was trained with youtube comments , but it could be used on any online commenting system. It is not affiliated in any way with Youtube.com or google.
 4 | 
 5 | It is an implementation of Paul Graham's algorithms and techniques for classifying spam: http://www.paulgraham.com/spam.html
 6 | 
 7 | == The training data 
 8 | 
 9 | Using the youtube api I trained the classifier with about 30,000 youtube comments. The api tells you whether a certain comment has been marked as spam or not by other youtube users.
10 | 
11 | == Usage
12 | 
13 | spamcheck.py is the main tool here. If you give it a comment via stdin it will give you the probability that the comment is spam. Probabilities are between 0 and 1.0.
14 | 
15 |     $ echo "Go to my youtube page and check out my videos" | ./spamcheck.py 
16 |     0.999964931472
17 |     $ echo "Great video! Loved it" | ./spamcheck.py
18 |     0.222768242843
19 |     $ echo "Please go to my webpage and buy my software" | ./spamcheck.py
20 |     0.997404774789
21 |     $ echo "worst video ever" | ./spamcheck.py
22 |     0.338513134698
23 | 
24 | == Performance
25 | 
26 | I ran the classifier on 3 popular videos outside of the ones I used to do the training.
27 | Here were the results:
28 | 
29 |     total comments: 2936
30 |     Spams: 434
31 |     Not spams: 2502
32 |     Marked as spam: 417
33 |     Marked as notspam: 2519
34 |     False positives: 153
35 |     False negatives: 170
36 | 
37 | So it caught 65% of spams with a 5% false positive rate
38 | Not very good. Not very good at all.
39 | 
40 | You can kind of dial these values around. By modifying the thresholds I was able to get up to 75% of spams caught, but at the cost of a 15% false positive rate.
41 | 
42 | == Possible improvements
43 | 
44 | The data is really fuzzy... the 'spam' comments are ones that youtube users mark as spam. 
45 | So sometimes they will miss a comment that is spam, sometimes they will mark as spam a comment that is really just a troll, sometimes the trolls will mark a legit comment as spam. If I went through all of the comments and manually marked spam vs not spam I could have some better training data as well as better data with which to score the classifier.
46 | 
47 | Use capitalization data - right now I convert everything to lowercase. But anecdotally it seems like spams have a higher chance of being in all caps.
48 | 
49 | Use punctuation - the classifier doesn't really use punctuation, this is most likely a mistake because spams seem to have a lot of weird punctuation and ascii art.
50 | 
51 | Search for keywords - just tokenizing the comment isn't the best because a lot of spam comments look like "pleasecheckoutmyfacebookpageatwwwfacebookcom/blah"
52 | 
53 | 
54 | 


--------------------------------------------------------------------------------
/build_hashes.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import json
 3 | 
 4 | spamtokens = {}
 5 | notspamtokens = {}
 6 | probabilities = {}
 7 | 
 8 | spam_count = 0
 9 | notspam_count = 0
10 | 
11 | print 'building dictionaries...'
12 | f = open('spam', 'r')
13 | for line in f:
14 |   tokens = line.split()
15 |   for token in tokens:
16 |     if token == "bbSEP":
17 |       spam_count = spam_count + 1
18 |       continue
19 |     token = token.lower()
20 |     if token in spamtokens:
21 |       spamtokens[token] = spamtokens[token] + 1
22 |     else:
23 |       spamtokens[token] = 1
24 | 
25 | f.close()
26 | 
27 | f = open('notspam', 'r')
28 | for line in f:
29 |   tokens = line.split()
30 |   for token in tokens:
31 |     if token == "bbSEP":
32 |       notspam_count = notspam_count + 1
33 |     token = token.lower()
34 |     if token in notspamtokens:
35 |       notspamtokens[token] = notspamtokens[token] + 1
36 |     else:
37 |       notspamtokens[token] = 1
38 | 
39 | f.close()
40 | 
41 | 
42 | for k in spamtokens:
43 |   num_in_spam = spamtokens[k]
44 |   if k in notspamtokens:
45 |     num_in_notspam = notspamtokens[k] * 2
46 |   else:
47 |     num_in_notspam = 0
48 |   p = ( num_in_spam / float(spam_count)) / ( (num_in_spam / float(spam_count)) + (num_in_notspam / float(notspam_count)))
49 |   probabilities[k] = p
50 | 
51 | for k in notspamtokens:
52 |   if not (k in spamtokens):
53 |     probabilities[k] = 0
54 | 
55 | print "saving dictionaries as json..."
56 | # save the hashes as json
57 | f = open('spam.dict.json', 'w')
58 | f.write(json.dumps(spamtokens))
59 | f.close()
60 | 
61 | f = open('notspam.dict.json', 'w')
62 | f.write(json.dumps(notspamtokens))
63 | f.close()
64 | 
65 | f = open('probabilities.dict.json', 'w')
66 | f.write(json.dumps(probabilities))
67 | f.close()


--------------------------------------------------------------------------------
/scrape_comments.py:
--------------------------------------------------------------------------------
 1 | import urllib
 2 | import urllib2
 3 | from xml.dom import minidom
 4 | 
 5 | data = None
 6 | headers = {   
 7 |     'GData-Version' : 2
 8 | }
 9 | 
10 | spamfile = open('spam', 'w')
11 | notspamfile = open('notspam', 'w')
12 | 
13 | # get list of most popular videos
14 | url = 'http://gdata.youtube.com/feeds/api/videos?max-results=50&orderby=viewCount'
15 | req = urllib2.Request(url, data, headers)
16 | response = urllib2.urlopen(req)
17 | xmldata = response.read()
18 | videos = minidom.parseString(xmldata)
19 | video_entries = videos.getElementsByTagName("gd:feedLink")
20 | for entry in video_entries:
21 |   link = entry.attributes["href"].value
22 |   print "processing: " + link
23 | # get comments
24 |   start = 1
25 |   while start<=1000:
26 |     url = link + '?max-results=50&start-index=' + str(start)
27 | 
28 |     req = urllib2.Request(url, data, headers)
29 |     response = urllib2.urlopen(req)
30 |     xmldata = response.read()
31 |     comments = minidom.parseString(xmldata)
32 | 
33 |     entries = comments.getElementsByTagName("entry")
34 | 
35 |     for entry in entries:
36 |       content = entry.getElementsByTagName("content")
37 |       spam = entry.getElementsByTagName("yt:spam")
38 |       if len(spam) > 0:
39 |         spamfile.write(content[0].firstChild.nodeValue.encode('utf8'))
40 |         spamfile.write('\nbbSEP\n')
41 |       else:
42 |         notspamfile.write(content[0].firstChild.nodeValue.encode('utf8'))
43 |         notspamfile.write('\nbbSEP\n')
44 |     start += 50
45 | 
46 | spam.close()
47 | notspam.close()


--------------------------------------------------------------------------------
/spamcheck.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | import sys
 4 | import json
 5 | 
 6 | def sort_func(x):
 7 |   return abs(token_probs[x] - 0.5)
 8 | 
 9 | f = open('probabilities.dict.json', 'r')
10 | json_probs = f.read()
11 | probabilities = json.loads(json_probs)
12 | f.close()
13 | 
14 | 
15 | comment = sys.stdin.read()
16 | 
17 | token_probs = {}
18 | tokens = comment.split()
19 | for token in tokens:
20 |   token = token.lower()
21 |   if token in probabilities:
22 |     token_probs[token] = probabilities[token]
23 |   else:
24 |     token_probs[token] = 0.4
25 | 
26 | 
27 | # sort token probabilities by distance from .5
28 | # and pull out the top X ones to use to calculate total probability
29 | max_tokens = 10
30 | interesting_tokens = []
31 | c=0
32 | for w in sorted(token_probs, key=sort_func, reverse=True):
33 |   interesting_tokens.append(token_probs[w])
34 |   if c >= max_tokens:
35 |     break
36 |   c = c+1
37 | 
38 | # calculate real probabilitiy
39 | a=1
40 | b=1
41 | for token in interesting_tokens:
42 |   a = a * token
43 |   b = b * (1-token)
44 | if a + b == 0:
45 |   print "0"
46 |   sys.exit()
47 | spam_probability = a / ( a + b)
48 | print spam_probability
49 | 
50 | 
51 | 


--------------------------------------------------------------------------------
/testit.py:
--------------------------------------------------------------------------------
 1 | import urllib
 2 | import urllib2
 3 | import subprocess
 4 | from xml.dom import minidom
 5 | 
 6 | data = None
 7 | headers = {   
 8 |     'GData-Version' : 2
 9 | }
10 | num_spam = 0
11 | num_notspam = 0 
12 | num_marked_as_spam = 0
13 | num_marked_as_notspam = 0
14 | false_positives = 0
15 | false_negatives = 0
16 | 
17 | # get list of most popular videos
18 | url = 'http://gdata.youtube.com/feeds/api/videos?max-results=2&start-index=50&orderby=viewCount'
19 | req = urllib2.Request(url, data, headers)
20 | response = urllib2.urlopen(req)
21 | xmldata = response.read()
22 | videos = minidom.parseString(xmldata)
23 | video_entries = videos.getElementsByTagName("gd:feedLink")
24 | for entry in video_entries:
25 |   link = entry.attributes["href"].value
26 |   print "processing: " + link
27 | # get comments
28 |   start = 1
29 |   while start<=1000:
30 |     url = link + '?max-results=50&start-index=' + str(start)
31 | 
32 |     req = urllib2.Request(url, data, headers)
33 |     response = urllib2.urlopen(req)
34 |     xmldata = response.read()
35 |     comments = minidom.parseString(xmldata)
36 | 
37 |     entries = comments.getElementsByTagName("entry")
38 | 
39 |     for entry in entries:
40 |       content = entry.getElementsByTagName("content")
41 | 
42 |       comment_string = content[0].firstChild.nodeValue.encode('utf8')
43 |       print "checking: " + comment_string
44 |       # get content score
45 |       p = subprocess.Popen("./spamcheck.py", stdin=subprocess.PIPE, stdout=subprocess.PIPE)
46 |       p.stdin.write(comment_string)
47 |       p.stdin.close()
48 |       score = float(p.stdout.readline())
49 |       p.kill()
50 |       print "score: " + str(score)
51 |       if score >= 0.9:
52 |         num_marked_as_spam = num_marked_as_spam + 1
53 |       else:
54 |         num_marked_as_notspam = num_marked_as_notspam + 1
55 | 
56 |       spam = entry.getElementsByTagName("yt:spam")
57 |       if len(spam) > 0:
58 |         if score < 0.9:
59 |           false_negatives = false_negatives + 1
60 |         num_spam = num_spam + 1
61 |       else:
62 |         if score >= 0.9:
63 |           false_positives = false_positives + 1
64 |         num_notspam = num_notspam + 1
65 |         
66 |     start += 50
67 | 
68 | print "total comments: " +  str(num_spam + num_notspam)
69 | print "Spams: " + str(num_spam)
70 | print "Not spams: " + str(num_notspam)
71 | print "Marked as spam: " + str(num_marked_as_spam)
72 | print "Marked as notspam: " + str(num_marked_as_notspam)
73 | print "False positives: " + str(false_positives)
74 | print "False negatives: " + str(false_negatives)
75 | 


--------------------------------------------------------------------------------