├── README.rdoc ├── build_hashes.py ├── probabilities.dict.json ├── scrape_comments.py ├── spamcheck.py └── testit.py /README.rdoc: -------------------------------------------------------------------------------- 1 | == youtube-troll-classifier 2 | 3 | This is a bayesian classifier of online comments. It was trained with youtube comments , but it could be used on any online commenting system. It is not affiliated in any way with Youtube.com or google. 4 | 5 | It is an implementation of Paul Graham's algorithms and techniques for classifying spam: http://www.paulgraham.com/spam.html 6 | 7 | == The training data 8 | 9 | Using the youtube api I trained the classifier with about 30,000 youtube comments. The api tells you whether a certain comment has been marked as spam or not by other youtube users. 10 | 11 | == Usage 12 | 13 | spamcheck.py is the main tool here. If you give it a comment via stdin it will give you the probability that the comment is spam. Probabilities are between 0 and 1.0. 14 | 15 | $ echo "Go to my youtube page and check out my videos" | ./spamcheck.py 16 | 0.999964931472 17 | $ echo "Great video! Loved it" | ./spamcheck.py 18 | 0.222768242843 19 | $ echo "Please go to my webpage and buy my software" | ./spamcheck.py 20 | 0.997404774789 21 | $ echo "worst video ever" | ./spamcheck.py 22 | 0.338513134698 23 | 24 | == Performance 25 | 26 | I ran the classifier on 3 popular videos outside of the ones I used to do the training. 27 | Here were the results: 28 | 29 | total comments: 2936 30 | Spams: 434 31 | Not spams: 2502 32 | Marked as spam: 417 33 | Marked as notspam: 2519 34 | False positives: 153 35 | False negatives: 170 36 | 37 | So it caught 65% of spams with a 5% false positive rate 38 | Not very good. Not very good at all. 39 | 40 | You can kind of dial these values around. By modifying the thresholds I was able to get up to 75% of spams caught, but at the cost of a 15% false positive rate. 41 | 42 | == Possible improvements 43 | 44 | The data is really fuzzy... the 'spam' comments are ones that youtube users mark as spam. 45 | So sometimes they will miss a comment that is spam, sometimes they will mark as spam a comment that is really just a troll, sometimes the trolls will mark a legit comment as spam. If I went through all of the comments and manually marked spam vs not spam I could have some better training data as well as better data with which to score the classifier. 46 | 47 | Use capitalization data - right now I convert everything to lowercase. But anecdotally it seems like spams have a higher chance of being in all caps. 48 | 49 | Use punctuation - the classifier doesn't really use punctuation, this is most likely a mistake because spams seem to have a lot of weird punctuation and ascii art. 50 | 51 | Search for keywords - just tokenizing the comment isn't the best because a lot of spam comments look like "pleasecheckoutmyfacebookpageatwwwfacebookcom/blah" 52 | 53 | 54 | -------------------------------------------------------------------------------- /build_hashes.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | 4 | spamtokens = {} 5 | notspamtokens = {} 6 | probabilities = {} 7 | 8 | spam_count = 0 9 | notspam_count = 0 10 | 11 | print 'building dictionaries...' 12 | f = open('spam', 'r') 13 | for line in f: 14 | tokens = line.split() 15 | for token in tokens: 16 | if token == "bbSEP": 17 | spam_count = spam_count + 1 18 | continue 19 | token = token.lower() 20 | if token in spamtokens: 21 | spamtokens[token] = spamtokens[token] + 1 22 | else: 23 | spamtokens[token] = 1 24 | 25 | f.close() 26 | 27 | f = open('notspam', 'r') 28 | for line in f: 29 | tokens = line.split() 30 | for token in tokens: 31 | if token == "bbSEP": 32 | notspam_count = notspam_count + 1 33 | token = token.lower() 34 | if token in notspamtokens: 35 | notspamtokens[token] = notspamtokens[token] + 1 36 | else: 37 | notspamtokens[token] = 1 38 | 39 | f.close() 40 | 41 | 42 | for k in spamtokens: 43 | num_in_spam = spamtokens[k] 44 | if k in notspamtokens: 45 | num_in_notspam = notspamtokens[k] * 2 46 | else: 47 | num_in_notspam = 0 48 | p = ( num_in_spam / float(spam_count)) / ( (num_in_spam / float(spam_count)) + (num_in_notspam / float(notspam_count))) 49 | probabilities[k] = p 50 | 51 | for k in notspamtokens: 52 | if not (k in spamtokens): 53 | probabilities[k] = 0 54 | 55 | print "saving dictionaries as json..." 56 | # save the hashes as json 57 | f = open('spam.dict.json', 'w') 58 | f.write(json.dumps(spamtokens)) 59 | f.close() 60 | 61 | f = open('notspam.dict.json', 'w') 62 | f.write(json.dumps(notspamtokens)) 63 | f.close() 64 | 65 | f = open('probabilities.dict.json', 'w') 66 | f.write(json.dumps(probabilities)) 67 | f.close() -------------------------------------------------------------------------------- /scrape_comments.py: -------------------------------------------------------------------------------- 1 | import urllib 2 | import urllib2 3 | from xml.dom import minidom 4 | 5 | data = None 6 | headers = { 7 | 'GData-Version' : 2 8 | } 9 | 10 | spamfile = open('spam', 'w') 11 | notspamfile = open('notspam', 'w') 12 | 13 | # get list of most popular videos 14 | url = 'http://gdata.youtube.com/feeds/api/videos?max-results=50&orderby=viewCount' 15 | req = urllib2.Request(url, data, headers) 16 | response = urllib2.urlopen(req) 17 | xmldata = response.read() 18 | videos = minidom.parseString(xmldata) 19 | video_entries = videos.getElementsByTagName("gd:feedLink") 20 | for entry in video_entries: 21 | link = entry.attributes["href"].value 22 | print "processing: " + link 23 | # get comments 24 | start = 1 25 | while start<=1000: 26 | url = link + '?max-results=50&start-index=' + str(start) 27 | 28 | req = urllib2.Request(url, data, headers) 29 | response = urllib2.urlopen(req) 30 | xmldata = response.read() 31 | comments = minidom.parseString(xmldata) 32 | 33 | entries = comments.getElementsByTagName("entry") 34 | 35 | for entry in entries: 36 | content = entry.getElementsByTagName("content") 37 | spam = entry.getElementsByTagName("yt:spam") 38 | if len(spam) > 0: 39 | spamfile.write(content[0].firstChild.nodeValue.encode('utf8')) 40 | spamfile.write('\nbbSEP\n') 41 | else: 42 | notspamfile.write(content[0].firstChild.nodeValue.encode('utf8')) 43 | notspamfile.write('\nbbSEP\n') 44 | start += 50 45 | 46 | spam.close() 47 | notspam.close() -------------------------------------------------------------------------------- /spamcheck.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | import json 5 | 6 | def sort_func(x): 7 | return abs(token_probs[x] - 0.5) 8 | 9 | f = open('probabilities.dict.json', 'r') 10 | json_probs = f.read() 11 | probabilities = json.loads(json_probs) 12 | f.close() 13 | 14 | 15 | comment = sys.stdin.read() 16 | 17 | token_probs = {} 18 | tokens = comment.split() 19 | for token in tokens: 20 | token = token.lower() 21 | if token in probabilities: 22 | token_probs[token] = probabilities[token] 23 | else: 24 | token_probs[token] = 0.4 25 | 26 | 27 | # sort token probabilities by distance from .5 28 | # and pull out the top X ones to use to calculate total probability 29 | max_tokens = 10 30 | interesting_tokens = [] 31 | c=0 32 | for w in sorted(token_probs, key=sort_func, reverse=True): 33 | interesting_tokens.append(token_probs[w]) 34 | if c >= max_tokens: 35 | break 36 | c = c+1 37 | 38 | # calculate real probabilitiy 39 | a=1 40 | b=1 41 | for token in interesting_tokens: 42 | a = a * token 43 | b = b * (1-token) 44 | if a + b == 0: 45 | print "0" 46 | sys.exit() 47 | spam_probability = a / ( a + b) 48 | print spam_probability 49 | 50 | 51 | -------------------------------------------------------------------------------- /testit.py: -------------------------------------------------------------------------------- 1 | import urllib 2 | import urllib2 3 | import subprocess 4 | from xml.dom import minidom 5 | 6 | data = None 7 | headers = { 8 | 'GData-Version' : 2 9 | } 10 | num_spam = 0 11 | num_notspam = 0 12 | num_marked_as_spam = 0 13 | num_marked_as_notspam = 0 14 | false_positives = 0 15 | false_negatives = 0 16 | 17 | # get list of most popular videos 18 | url = 'http://gdata.youtube.com/feeds/api/videos?max-results=2&start-index=50&orderby=viewCount' 19 | req = urllib2.Request(url, data, headers) 20 | response = urllib2.urlopen(req) 21 | xmldata = response.read() 22 | videos = minidom.parseString(xmldata) 23 | video_entries = videos.getElementsByTagName("gd:feedLink") 24 | for entry in video_entries: 25 | link = entry.attributes["href"].value 26 | print "processing: " + link 27 | # get comments 28 | start = 1 29 | while start<=1000: 30 | url = link + '?max-results=50&start-index=' + str(start) 31 | 32 | req = urllib2.Request(url, data, headers) 33 | response = urllib2.urlopen(req) 34 | xmldata = response.read() 35 | comments = minidom.parseString(xmldata) 36 | 37 | entries = comments.getElementsByTagName("entry") 38 | 39 | for entry in entries: 40 | content = entry.getElementsByTagName("content") 41 | 42 | comment_string = content[0].firstChild.nodeValue.encode('utf8') 43 | print "checking: " + comment_string 44 | # get content score 45 | p = subprocess.Popen("./spamcheck.py", stdin=subprocess.PIPE, stdout=subprocess.PIPE) 46 | p.stdin.write(comment_string) 47 | p.stdin.close() 48 | score = float(p.stdout.readline()) 49 | p.kill() 50 | print "score: " + str(score) 51 | if score >= 0.9: 52 | num_marked_as_spam = num_marked_as_spam + 1 53 | else: 54 | num_marked_as_notspam = num_marked_as_notspam + 1 55 | 56 | spam = entry.getElementsByTagName("yt:spam") 57 | if len(spam) > 0: 58 | if score < 0.9: 59 | false_negatives = false_negatives + 1 60 | num_spam = num_spam + 1 61 | else: 62 | if score >= 0.9: 63 | false_positives = false_positives + 1 64 | num_notspam = num_notspam + 1 65 | 66 | start += 50 67 | 68 | print "total comments: " + str(num_spam + num_notspam) 69 | print "Spams: " + str(num_spam) 70 | print "Not spams: " + str(num_notspam) 71 | print "Marked as spam: " + str(num_marked_as_spam) 72 | print "Marked as notspam: " + str(num_marked_as_notspam) 73 | print "False positives: " + str(false_positives) 74 | print "False negatives: " + str(false_negatives) 75 | --------------------------------------------------------------------------------