├── intro_web_data
├── arts
├── classify.py
├── delicious_import.py
├── distance_demo.py
├── links.csv
├── nytimes_pull.py
├── rec.py
├── sports
├── stopwords.txt
└── tag_clustering.py
└── solving_problems
├── access_log.txt
├── bloom_filter.py
├── decision_tree_regression.py
├── descriptions.csv
├── flat.txt
├── kmeans_descriptions.py
├── liked_decision_tree.py
├── map_reduce.py
├── path_distribution.txt
├── pca.py
├── scripts
├── bar_chart.py
├── histogram.py
└── ninety_five_percent.py
├── simhashes.py
├── thingiverse_all_names.csv
├── thingiverse_liked_objects.csv
├── thingiverse_liked_objects_1k.csv
├── thingiverse_tree.dot
└── thingiverse_tree.png
/intro_web_data/arts:
--------------------------------------------------------------------------------
1 | LONDON -- An opera about Anna Nicole Smith --the American sex symbol, Playboy Playmate, hapless model, laughable actress and fortune-hunting wife of a billionaire 62 years her senior? Commissioned by, no less, the Royal Opera at Covent Garden? When the plans were announced it sounded like a dubious idea, a tawdry way for a major opera house to look
2 | THIS is the last weekend to see the New York production of “The Merchant of Venice” starring an Academy Award winner in one of Shakespeare ’s greatest roles, Shylock the moneylender. That is, until next weekend. Al Pacino (who won the best-actor Oscar for “Scent of a Woman”) wraps up his four-month run as Shylock on
3 | Theater Approximate running times are in parentheses. Theaters are in Manhattan unless otherwise noted. Full reviews of current shows, additional listings, showtimes and tickets: nytimes.com/theater . Previews and Openings ‘The Book of Mormon’ Previews start on Thursday. Opens on March 24. The “South Park” creators Matt
4 | It’s comedy night at the asylum, folks. And have we got some high-voltage vaudeville for you, the kind that curls your hair and turns your knees to rubber. So here he is, all the way from St. Petersburg, Russia, the man who put the madcap in madness. Put your hands together for the stand-up stylings of Aksentii Poprishchin. No such words of
5 | Only a performer of monumental presence can withstand the theatrical typhoon that is Mandy Patinkin . So hats off to the frail-looking, child-size marionette who walks away with “Compulsion,” the straight-line bio-drama by Rinne Groff, starring Mr. Patinkin at gale force. Designed by Matt Acheson, this charismatic puppet — with
6 | Strip away the uninspired mythology, and “I Am Number Four” is just your average high school movie with below-average drama. Fielding familiar classroom stereotypes — the bully, the science geek, the strutting alien female in the skintight cat suit — this turgid schedule filler is only marginally more fun than a week’s
7 | “Putty Hill,” Matt Porterfield ’s moody, elliptical fusion of fiction and documentary, slips back and forth between the forms with a stealth that dissolves one into the other. The mostly nonprofessional actors in the film, set in a working-class neighborhood on the outskirts of Baltimore, play versions of themselves in a fictional
8 | Icíar Bollaín’s bluntly political film “Even the Rain” makes pertinent, if heavy-handed, comparisons between European imperialism five centuries ago and modern globalization. In particular it portrays high-end filming on location in poor countries as an offshoot of colonial exploitation. The movie is set in and
9 | “We Are What We Are,” Jorge Michel Grau’s macabre fable of urban survival, follows the disintegration of a pod of people eaters when its diseased patriarch expires in a shopping mall. Swiftly removed by a wordless cleanup crew, the man’s remains are found to contain a single undigested finger. “It’s shocking how
10 | The programmers for Film Comment Selects possess the refined tastes of practiced cine-mixologists, along with a yen for the outré. For the 11th edition of this two-week annual festival, which starts on Friday at the Walter Reade Theater, they have exhumed the old and rounded up the new, unearthing treasures and curiosities to put next to
11 | Liam Neeson ’s latter-day renaissance as an unlikely action star should give hope to performers and viewers of a certain age (i.e., over 40) everywhere. While that irrepressible exhibitionist Helen Mirren , born in 1945, continues to inspire legions of AARP members, one discarded garment at a time, Mr. Neeson, a comparative pup born in 1952,
12 | Movies Ratings and running times are in parentheses; foreign films have English subtitles. Full reviews of all current releases, movie trailers, showtimes and tickets: nytimes.com/movies . ★ ‘Another Year’ (PG-13, 2:09) An autumnal gem from Mike Leigh , by turns sweet and abrasive, gentle and sad, about the unequal distribution of
13 | It’s fine to employ a plot device that’s been used repeatedly. But you run into trouble when you use a familiar plot and do only the familiar with it. “Immigration Tango,” a pale romantic comedy, has this problem. An immigrant couple (he’s from Colombia, she’s from Russia) in Florida strike a deal with their best
14 | One of the most urgent and certainly among the most beautifully shot documentaries to hit the big screen in recent memory, “The Last Lions” isn’t just another cute and fuzzy encounter session with a different species. It’s a pulse-quickening, tear-duct milking and outrageously dramatized story about the threats —
15 | Essentially a two-person play liberally sprinkled with gleaming, groovy, graphic sex, “Now & Later” exudes an amiably accessible vibe that softens the edges of its freewheeling explicitness. Lest we enjoy all this flesh without an ennobling context, the movie kicks off with Wilhelm Reich’s assertion of the link between sexual
16 | As a group they give a new and truer meaning to the phrase “independent film.” In a country where all movies must obtain official approval to be exhibited commercially, the five Chinese directors whose work will be featured beginning on Friday in the Museum of Modern Art’s Documentary Fortnight are forced to operate in a peculiar
17 | The something wicked that comes creeping like night in “Vanishing on 7th Street,” turning down the sun and seemingly sucking people right out of their homes, offices, cars and clothes, arrives without warning. One minute moviegoers are yukking it up at a multiplex in this generally nifty little horror flick, and the next minute
18 | “Loveless” is an aimless film about an aimless fellow, but it’s not without its charms. It may be without a point, but hey, you can’t have everything in a no-budget film like this. Andrew (Andrew Von Urtz) is a nearing-middle-age New Yorker who’s still behaving like a self- and sex-centered 25-year-old, trolling for
19 | It’s too bad that Paul Levesque is so, well, large. Mr. Levesque — a professional wrestler whose ring name is Triple H — is a perfectly tolerable actor, as he shows in “The Chaperone,” a lightweight comedy aimed, presumably, at tweeners and fans of World Wrestling Entertainment, whose film division generated this
20 | Around Town Museums and Sites American Museum of Natural History (Saturday and Wednesday) “Saluting Our Jazz Elders,” an afternoon of music and discussions on Saturday in celebration of Black History Month, will include performances by the percussionist Sekou Alaje (12:30 p.m.); the New Amsterdam Music Association (1:15 p.m.); the
21 | ‘CIRCUS INCOGNITUS’ Jamie Adkins, an old-style vaudeville performer, won’t mind if you throw things at him during his show. He wants you to throw. He invites you to throw. He’ll even provide the things. That Mr. Adkins behaves this way in front of children at the New Victory Theater attests to his bravery. The elementary
22 | Poised and whispery, Vanessa Paradis played her New York City debut as a headliner at Town Hall on Wednesday night. In France, Ms. Paradis has been a pop star since she was 14, when her single “Joe le Taxi” was a No. 1 hit in 1987, and where she won the Victoires de la Musique award, the equivalent of a Grammy , for album of the year
23 | LONDON — The English National Opera introduced the German director Nikolaus Lehnhoff’s production of Wagner’s “Parsifal” in 1999, and since then this influential modern staging, which presents the Knights of the Grail as a spiritually decaying brotherhood in a bleakly gray, postapocalyptic and timeless setting, has
24 | Jazz Full reviews of recent jazz concerts: nytimes.com/music . Uri Caine, Theo Bleckmann, Todd Sickafoose, Jenny Scheinman (Friday) This latest show in the weekly Spontaneous Constructions series, which aims to foster new collaborations, features Mr. Caine, a keyboardist of spectacularly diverse tastes; Mr. Bleckmann, a vocalist of ethereal
25 | Pop Prices may not include ticketing service charges. Full reviews of recent concerts: nytimes.com/music . Trey Anastasio (Tuesday) Although Phish , the jam band that made him a star to the noodle-dancing set, has since regrouped, Mr. Anastasio has yet to abandon his solo career. Last summer he released “Time Becomes Elastic” (Rubber
26 | The Tune-In festival at the Park Avenue Armory promises to explore musical connections between past and present, as well as a few philosophical notions, like whether music has the power to express anything. (Stravinsky said that it does not; others have disagreed.) Most of the series, which runs through Sunday, was assembled by the enterprising
27 | Classical Full reviews of recent music performances: nytimes.com/music . Opera ★ ‘Armida’ (Friday and Wednesday) This infrequently heard 1817 Rossini opera finally made it to the Met last spring as a vehicle for Renée Fleming in a handsome and fanciful, if rather safe, production by Mary Zimmerman . “Armida” is
28 | Len Lesser, a character actor for more than half a century whose hawklike profile and Noo Yawk accent finally gained him popular recognition when he played Jerry Seinfeld ’s annoying Uncle Leo on “Seinfeld,” died on Wednesday in Burbank Calif. He was 88. The cause was pneumonia, said his son, David, adding that his father had been
29 | There’s a holdup in the Bronx, Brooklyn’s broken out in fights. There’s a traffic jam in Harlem That’s backed up to Jackson Heights. There’s a scout troop short a child, Khrushchev’s due at Idlewild. Car 54, where are you? Ask almost anyone over 50, and the song pours buoyantly forth, evoking one of
30 | Although a two-hour “Hollywood week” episode of Fox’s “American Idol” topped the ratings on Wednesday, CBS’s new series “Criminal Minds: Suspect Behavior,” with Janeane Garofalo and Forest Whitaker , delivered a strong debut at 10 p.m. According to Nielsen’s estimates 12.9 million viewers tuned
31 | Kate Werble Gallery 83 Vandam Street SoHo Through March 12 Flokati rugs, those fluffy white coverings traditionally handmade in the Pindus Mountains in Europe and prized by contemporary designers, become wild-and-woolly wall reliefs in Anna Betbeze’s first New York solo. Ms. Betbeze dyes, scorches, shreds, shaves and otherwise attacks these
32 | Winkleman Gallery 621 West 27th Street Chelsea Through March 12 The three short, related videos that make up Janet Biggs’s debut show at Winkleman were filmed on glacial islands between the top of Norway and the North Pole. Playing on separate screens and in overlapping sequence, the pieces can be viewed in any order, though a gallery news
33 | Meredith Ward Fine Art 44 East 74th Street Manhattan Through March 12 Working in oil on small pieces of canvas board near the waters and harbors of Manhattan, John Marin (1870-1953) was possibly the first American artist to make abstract paintings. There are other candidates — among them Marsden Hartley and Georgia O’Keeffe — but
34 | Rembrandt ’s jowly, battered face glows like a night light in the great late self-portrait from 1658 at the Frick Collection . And it glows more brightly than ever now that layers of old varnish have been cleaned away. Colors — the gold of the artist’s shirt, the wine-red of his Middle Easternish sash, the pink of the chafe mark
35 | BRIDGEPORT, Conn. — “This was the most contaminated room,” Kathleen Maher said, pointing to powdery debris, paint flakes and glass shards on shelves and carpeting in a dimly lighted ground-floor gallery at her work space. She is curator and executive director at the Barnum Museum here, where last June a tornado struck its 1890s
36 | Art Museums and galleries are in Manhattan unless otherwise noted. Full reviews of recent art shows: nytimes.com/art . Museums American Folk Art Museum : ‘Eugene Von Bruenchenhein: Freelance Artist — Poet and Sculptor — Inovator — Arrow maker and Plant man — Bone artifacts constructor — Photographer and Architect
37 | Perhaps because her work so frequently appears in exhibitions, art fairs and auctions, it seems as though Cindy Sherman’s photographs are often with us. Think of images of her as a clown, a Renaissance Madonna, a sex kitten or even a half-pig, half-human creature. But in the United States it has been nearly 14 years since the public has had a
38 | Steven Harvey Fine Art Projects 24 East 73rd Street Manhattan Through Feb. 28 The companionship and inspiration that artists gain from other artists and their work is pinpointed in this sweet and unusual show. Its main focus is the friendship between Gandy Brodie (1925-1975) and Bob Thompson (1937-1966), who met in Provincetown, Mass., in the
39 | Many human beings evidently share with the magpie a gene causing an irrational attraction to bright and shiny objects. If you suffer from this disorder, you will love “Cloisonné: Chinese Enamels From the Yuan, Ming and Qing Dynasties,” a ravishing exhibition at the Bard Graduate Center. Displaying more than 160 items ranging from
40 | Leo Koenig Inc. Projekte 541 West 23rd Street Chelsea Through March 19 Vincent Szarek’s sleek, four-piece exhibition offers a poetic meditation on modern decadence. The first item, on the floor, is a long, narrow, geometric solid painted in glossy, metal-flake gold. It looks like a parody of Minimalist sculpture, but it is also readily
41 | The New Museum has become a busy place this year, and it is not yet even March. In January it opened a popular tribute to the market-hardy paintings of George Condo. Now it is offering a startlingly excellent resurrection of the prescient Post-Minimalist renegade Lynda Benglis and her gaudy, multidexterous and often gender-bending segues among
42 | Dance Full reviews of recent performances: nytimes.com/dance . Aspen Santa Fe Ballet (Tuesday through Thursday) This handsome company returns with a trio of contemporary works: Jorma Elo’s “Red Sweet,” Jiri Kylian’s “Stamping Ground” and the East Coast premiere of “Uneven,” by the Spanish
43 | Walter Dundervill is just fine with letting his imagination run wild, and that’s evident in “Aesthetic Destiny 1: Candy Mountain,” performed at Dance Theater Workshop on Wednesday night, where the terrain of the stage is decorated with colorful polygons of varying sizes. Some are propped up like jagged mountains with strangely
44 | New York’s flamencophiles are deprived of their annual Flamenco Festival this year, thanks to financial cutbacks in Spain. (The hope is to continue it biennially instead.) In its place, however, is a weeklong season of the filmmaker Carlos Saura ’s “Flamenco Hoy” (“Flamenco Today”) at City Center. Tuesday’s
45 | The British travel writer and novelist Bruce Chatwin (1940-89) had blond hair, flinty blue eyes and delicately firm features — he looked like a bookish member of the Police, Sting’s band, circa 1983 — and the kind of narcissism that can be a byproduct of talent mixed with charisma. Both men and women were drawn to him, and he to
46 | JACKSON, Miss. THERE is “The Help,” and then there is the help. And she is not happy. Ablene Cooper, a 60-year-old woman who has long worked as a maid here, has filed a lawsuit against Kathryn Stockett, the author of the best-selling novel “The Help,” about black maids working for white families in Jackson in the 1960s. In
47 | Geoffrey Rush stars in the production at the Brooklyn Academy of Music.
48 | Geoffrey Rush stars in the production at the Brooklyn Academy of Music.
49 | Geoffrey Rush stars in the production at the Brooklyn Academy of Music.
50 | Geoffrey Rush stars in the production at the Brooklyn Academy of Music.
51 |
--------------------------------------------------------------------------------
/intro_web_data/classify.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 | """
4 | classify.py
5 |
6 | Created by Hilary Mason on 2011-02-17.
7 | Copyright (c) 2011 Hilary Mason. All rights reserved.
8 | """
9 |
10 | import re, string
11 |
12 | from nltk import FreqDist
13 | from nltk.tokenize import word_tokenize
14 | from nltk.stem.porter import PorterStemmer
15 |
16 | class NaiveBayesClassifier(object):
17 |
18 | def __init__(self):
19 | self.feature_count = {}
20 | self.category_count = {}
21 |
22 | def probability(self, item, category):
23 | """
24 | probability: prob that an item is in a category
25 | """
26 | category_prob = self.get_category_count(category) / sum(self.category_count.values())
27 | return self.document_probability(item, category) * category_prob
28 |
29 | def document_probability(self, item, category):
30 | features = self.get_features(item)
31 |
32 | p = 1
33 | for feature in features:
34 | print "%s - %s - %s" % (feature, category, self.weighted_prob(feature, category))
35 | p *= self.weighted_prob(feature, category)
36 |
37 | return p
38 |
39 | def train_from_data(self, data):
40 | for category, documents in data.items():
41 | for doc in documents:
42 | self.train(doc, category)
43 |
44 | # print self.feature_count
45 |
46 |
47 | # def get_features(self, document):
48 | # all_words = word_tokenize(document)
49 | # all_words_freq = FreqDist(all_words)
50 | #
51 | # # print sorted(all_words_freq.items(), key=lambda(w,c):(-c, w))
52 | # return all_words_freq
53 |
54 | def get_features(self, document):
55 | document = re.sub('[%s]' % re.escape(string.punctuation), '', document) # removes punctuation
56 | document = document.lower() # make everything lowercase
57 | all_words = [w for w in word_tokenize(document) if len(w) > 3 and len(w) < 16]
58 | p = PorterStemmer()
59 | all_words = [p.stem(w) for w in all_words]
60 | all_words_freq = FreqDist(all_words)
61 |
62 | # print sorted(all_words_freq.items(), key=lambda(w,c):(-c, w))
63 | return all_words_freq
64 |
65 | def increment_feature(self, feature, category):
66 | self.feature_count.setdefault(feature,{})
67 | self.feature_count[feature].setdefault(category, 0)
68 | self.feature_count[feature][category] += 1
69 |
70 | def increment_cat(self, category):
71 | self.category_count.setdefault(category, 0)
72 | self.category_count[category] += 1
73 |
74 | def get_feature_count(self, feature, category):
75 | if feature in self.feature_count and category in self.feature_count[feature]:
76 | return float(self.feature_count[feature][category])
77 | else:
78 | return 0.0
79 |
80 | def get_category_count(self, category):
81 | if category in self.category_count:
82 | return float(self.category_count[category])
83 | else:
84 | return 0.0
85 |
86 | def feature_prob(self, f, category): # Pr(A|B)
87 | if self.get_category_count(category) == 0:
88 | return 0
89 |
90 | return (self.get_feature_count(f, category) / self.get_category_count(category))
91 |
92 | def weighted_prob(self, f, category, weight=1.0, ap=0.5):
93 | basic_prob = self.feature_prob(f, category)
94 |
95 | totals = sum([self.get_feature_count(f, category) for category in self.category_count.keys()])
96 |
97 | w_prob = ((weight*ap) + (totals * basic_prob)) / (weight + totals)
98 | return w_prob
99 |
100 | def train(self, item, category):
101 | features = self.get_features(item)
102 |
103 | for f in features:
104 | self.increment_feature(f, category)
105 |
106 | self.increment_cat(category)
107 |
108 | if __name__ == '__main__':
109 | labels = ['arts', 'sports'] # these are the categories we want
110 | data = {}
111 | for label in labels:
112 | f = open(label, 'r')
113 | data[label] = f.readlines()
114 | # print len(data[label])
115 | f.close()
116 |
117 | nb = NaiveBayesClassifier()
118 | nb.train_from_data(data)
119 | print nb.probability("Early Friday afternoon, the lead negotiators for the N.B.A. and the players union will hold a bargaining session in Beverly Hills — the latest attempt to break a 12-month stalemate on a new labor deal.", 'arts')
120 | print nb.probability("Early Friday afternoon, the lead negotiators for the N.B.A. and the players union will hold a bargaining session in Beverly Hills — the latest attempt to break a 12-month stalemate on a new labor deal.", 'sports')
121 |
122 |
123 |
--------------------------------------------------------------------------------
/intro_web_data/delicious_import.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 | """
4 | delicious_import.py
5 |
6 | Created by Hilary Mason on 2010-11-28.
7 | Copyright (c) 2010 Hilary Mason. All rights reserved.
8 | """
9 |
10 | import sys
11 | import urllib
12 | import csv
13 | from xml.dom import minidom
14 |
15 |
16 | class delicious_import(object):
17 | def __init__(self, username, password=''):
18 | # API URL: https://user:passwd@api.del.icio.us/v1/posts/all
19 | url = "https://%s:%s@api.del.icio.us/v1/posts/all" % (username, password)
20 | h = urllib.urlopen(url)
21 | content = h.read()
22 | h.close()
23 |
24 | x = minidom.parseString(content)
25 |
26 | data = []
27 |
28 | # sample post:
29 | post_list = x.getElementsByTagName('post')
30 | for post_index, post in enumerate(post_list):
31 | url = post.getAttribute('href')
32 | desc = post.getAttribute('description')
33 | tags = ",".join([t for t in post.getAttribute('tag').split()])
34 | timestamp = post.getAttribute('time')
35 |
36 | data.append([url.encode("utf-8"), tags.encode("utf-8")])
37 |
38 | writer = csv.writer(open("links.csv", 'wb'))
39 | for entry in data:
40 | writer.writerow(entry)
41 |
42 |
43 | if __name__ == '__main__':
44 | try:
45 | (username, password) = sys.argv[1:]
46 | except ValueError:
47 | print "Usage: python delicious_import.py username password"
48 |
49 | d = delicious_import(username, password)
50 |
51 |
--------------------------------------------------------------------------------
/intro_web_data/distance_demo.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 | """
4 | tag_clustering.py
5 |
6 | Created by Hilary Mason on 2011-02-18.
7 | Copyright (c) 2011 Hilary Mason. All rights reserved.
8 | """
9 |
10 |
11 | from hcluster import *
12 |
13 | class TagClustering(object):
14 |
15 | def __init__(self):
16 | v1 = [0,0,0,1]
17 | v2 = [0,1,1,1]
18 |
19 | print euclidean(v1, v2)
20 | print cityblock(v1, v2)
21 | print jaccard(v1, v2)
22 |
23 |
24 | if __name__ == '__main__':
25 | t = TagClustering()
26 |
27 |
--------------------------------------------------------------------------------
/intro_web_data/nytimes_pull.py:
--------------------------------------------------------------------------------
1 | import urllib
2 | import json
3 |
4 | def main(api_key, category, label):
5 |
6 | content = []
7 | for i in range(0,5):
8 | # print "http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=news_desk:('%s')&api-key=%s&page=%s" % (category, api_key, i)
9 | h = urllib.urlopen("http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=news_desk:(\"%s\")&api-key=%s&page=%s" % (category, api_key, i))
10 | print h
11 | try:
12 | result = json.loads(h.read())
13 | content.append(result)
14 | except ValueError:
15 | print "Malformed JSON: " + data
16 | continue #In the rare cases that JSON refuses to parse
17 |
18 | f = open(label, 'w')
19 | for line in content:
20 | try:
21 | f.write('%s\n' % line)
22 | except UnicodeEncodeError:
23 | pass
24 |
25 | f.close()
26 |
27 | if __name__ == '__main__':
28 | main("f7b4a1749764aec0364b215c354e3a0f:18:25759498", "Arts","arts")
29 | main("f7b4a1749764aec0364b215c354e3a0f:18:25759498", "Sports","sports")
30 |
31 |
--------------------------------------------------------------------------------
/intro_web_data/rec.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 | """
4 | tag_clustering.py
5 |
6 | Created by Hilary Mason on 2011-02-18.
7 | Copyright (c) 2011 Hilary Mason. All rights reserved.
8 | """
9 |
10 | import csv
11 |
12 | # from Pycluster import *
13 | from hcluster import *
14 |
15 | class TagClustering(object):
16 |
17 | def __init__(self):
18 | tag_data = self.load_link_data()
19 | # print tag_data
20 | all_tags = []
21 | all_urls = []
22 | for url,tags in tag_data.items():
23 | all_urls.append(url)
24 | all_tags.extend(tags)
25 |
26 | all_tags = list(set(all_tags)) # list of all tags in the space
27 |
28 | numerical_data = {} # create vectors for each item
29 | for url,tags in tag_data.items():
30 | v = []
31 | for t in all_tags:
32 | if t in tags:
33 | v.append(1)
34 | else:
35 | v.append(0)
36 | numerical_data[url] = v
37 |
38 | recommend_url = 'http://www.qwantz.com/index.php'
39 | results = {}
40 | for url,vector in numerical_data.items():
41 | d = euclidean(numerical_data[recommend_url],numerical_data[url])
42 | results[url] = d
43 |
44 | print sorted(results.items(), key=lambda(u,s):(s, u))
45 |
46 |
47 | def load_link_data(self,filename="links.csv"):
48 | data = {}
49 |
50 | r = csv.reader(open(filename, 'r'))
51 | for row in r:
52 | data[row[0]] = row[1].split(',')
53 |
54 | return data
55 |
56 |
57 | if __name__ == '__main__':
58 | t = TagClustering()
59 |
60 |
--------------------------------------------------------------------------------
/intro_web_data/sports:
--------------------------------------------------------------------------------
1 | Harvey Almorn Updyke Jr., 62, of Dadeville, Ala., was arrested on a charge of criminal mischief in connection with the poisoning of the Toomer’s Corner oak trees at Auburn. On Jan. 27, a man saying he was “Al from Dadeville” phoned a radio show, claiming he poured herbicide around the 130-year-old oaks that are a scene of
2 | Mike Repole has had a pretty good winter. St. John’s, his alma mater, is inching closer to a berth in the N.C.A.A. men’s basketball tournament for the first time since 2002. His racehorse Uncle Mo is the early favorite to win the Kentucky Derby . His beloved Mets have started spring training. Repole, after all, is the latest in what has
3 | Derrick Rose scored a career-high 42 points and the host Chicago Bulls headed into the All-Star break with a 109-99 victory over the San Antonio Spurs . The Spurs have the N.B.A. ’s best record, 46-10; the Bulls, who won 41 games last season, are 38-16. Sports Briefing | Basketball
4 | DAYTONA BEACH, Fla. — The 10th anniversary of the fatal crash of Dale Earnhardt at Daytona International Speedway will pass Friday without an official tribute as a lower-level Camping World Series truck race is held at the track. But Earnhardt, a seven-time Nascar Cup champion, will be honored Sunday during the season-opening Daytona 500. On
5 | Joel Northrup refused to compete against a girl at the Iowa state tournament in Des Moines, relinquishing a chance to become a champion because he said wrestling a girl would conflict with his religious beliefs. Northrup, a home-schooled sophomore who was 35-4 wrestling for Linn-Mar High School, defaulted his first-round match in the 112-pound
6 | Tina Maze became the first Slovenian to win an Alpine skiing world championship, riding her advantage from the first run to the gold medal in the women’s giant slalom in Garmisch-Partenkirchen, Germany. She used a controlled second run to finish in 2 minutes 20.54 seconds and defeat Federica Brignone of Italy by 0.09. Sports Briefing | Skiing
7 | DAYTONA BEACH, Fla. — The Danica Patrick Nascar experiment enters its second season on Saturday at Daytona International Speedway, with expectations tempered but hopes raised. Patrick, 28, the only woman to win a race in the IndyCar Series, will compete in Nascar’s Drive4copd 300, a lower-level Nationwide Series race that runs the day
8 | DAYTONA BEACH, Fla. — Kurt Busch declared himself the favorite for Sunday’s Daytona 500, and it was hard to argue after he won the first of two 150-mile qualifying races Thursday at Daytona International Speedway. Busch is 2 for 2 at Daytona this month, having captured the exhibition Budweiser Shootout last Saturday. So far, no one has
9 | In an attempt to jump-start negotiations that stalled a week ago, representatives for N.F.L. owners and the league’s players planned to engage in seven consecutive days of talks with a federal mediator beginning Friday. The collective bargaining agreement expires in less than two weeks, and the decision to attempt to intensify negotiations
10 | Dustin Johnson wound up with another bizarre penalty Thursday when his caddie thought his tee time was 40 minutes later than it was, and he raced to the first tee at the Northern Trust Open in Los Angeles to avoid disqualification. Johnson was given a two-shot penalty for not being on the tee box at his starting time. Players then have five minutes
11 | The remaining games are starting to dwindle, and teams on the playoff bubble, like the Rangers and the Los Angeles Kings , are playing increasingly desperate hockey. That is what happened at Madison Square Garden on Thursday night, in a breathless cliffhanger not decided until Erik Christensen and Mats Zuccarello scored in the shootout and Henrik
12 | For some American hockey players at the highest level, memories of childhood are filled with idyllic days on frozen ponds and outdoor rinks. But for a growing number, childhood memories are framed by palm trees, warm weather and rooting for N.H.L. teams that many Northerners disdain as a failed Sun Belt experiment. Those memories reflect the
13 | SOUTH BEND, Ind. — Every Sunday after church, on the Hansbroughs’ backyard basketball court in Poplar Bluff, Mo., three brothers would play the age-old game of 21. The scene conjures up both Rockwell and Darwin — little Ben Hansbrough, the youngest, learned to survive despite a weekly diet of blocked shots and sharp elbows. Many
14 | Lynetta Kizer scored 17 points, and No. 16 Maryland beat No. 7 Duke , 69-47, Thursday night to drop the visiting Blue Devils into a three-way tie for first place in the Atlantic Coast Conference. The Terrapins (21-5, 7-4) let a 12-point lead dwindle to 39-38 before pulling away to a victory that enabled them to avoid their first three-game losing
15 | Peter Roby grew up playing ball with Tom Thibodeau in New Britain, Conn., and later coached with him at Harvard . His friend’s success at basketball’s highest level is no surprise — Thibodeau, Roby recalled, was always passionate about learning the game and intrigued by the challenge of teaching it. The “defensive
16 | When the Naismith Memorial Basketball Hall of Fame announces its finalists for the class of 2011 on Friday, one name will be conspicuous in its absence: that of Reggie Miller , the former Indiana Pacers sharpshooter, who is in his first year of eligibility. Miller, 45, who retired in 2005 and will be in Los Angeles this weekend as an analyst for
17 | Before they can celebrate Derrick Rose’s ascendance, Kevin Durant’s dominance or Blake Griffin’s hang time , the N.B.A. ’s brightest minds and brightest players will gather in a hotel conference room and beg one another not to ruin it all. For the next three days in downtown Los Angeles, the N.B.A. will do what it does best
18 | PORT ST. LUCIE, Fla. — While their players stretched and exercised in advance of their first official workout of the 2011 season, the owners of the Mets stood nearby on an artificial turf field and discussed the troubling issues that could jeopardize their ownership of the team. Fred Wilpon , the principal owner of the team, said Thursday
19 | PORT ST. LUCIE, Fla. — Johan Santana has been throwing a baseball for almost two weeks, but the next time he throws a pitch in a major league game could be more than four months from now, perhaps sometime close to the All-Star break. That has been the time frame the Mets expected all along for Santana, a two-time Cy Young Award winner. But in
20 | PORT ST. LUCIE, Fla. Half a dozen times during a spirited news conference that lasted about 20 minutes, Fred Wilpon took to the offense with his new favorite V word, now that his role as a victim in Bernard L. Madoff case is under grave legal challenge. In what will most likely be a costly struggle for the survival of his ownership of the Mets , he
21 | TAMPA, Fla. — Freddy Garcia dipped into his memory bank the other morning, his mind drifting to a Seattle special of an afternoon, cool and overcast, during the 2001 playoffs, when he was the 25-year-old ace of the Mariners . Garcia recalled how Bartolo Colon pumped fastball after fastball past his Seattle teammates to secure the division
22 | The Detroit Tigers ’ Miguel Cabrera was arrested on charges of drunken driving and resisting an officer in Fort Pierce, Fla. He has had drinking-related problems, including a 2009 incident in which he fought with his wife after drinking heavily the night before his team lost the A.L. Central title to Minnesota. ¶Catcher Yorvit Torrealba
23 | Joe Frazier, the manager of the Mets in the turbulent period between the tenures of Yogi Berra and Joe Torre , died Tuesday in Broken Arrow, Okla. He was 88 and a longtime Broken Arrow resident. His death was confirmed by the Christian-Gavlik Funeral Home in Broken Arrow. Frazier, who spent almost a half-century in organized baseball, primarily as
24 | PHOENIX — It was Don Mattingly ’s opening news conference at his first spring training as the Dodgers ’ manager, and a seat at the head of a picnic table was reserved for him. Mattingly demurred and folded his 6-foot frame into another chair on an outdoor patio after casually brushing off a dried bird dropping stuck to it. If only
25 | Auburn said that someone poisoned oak trees at Toomer’s Corner, where fans celebrate big wins, and that the trees, which are estimated to be more than 130 years old, could not be saved. Auburn said a herbicide commonly used to kill trees was applied “in lethal amounts” to the soil. A caller to a radio show claimed he had applied
26 | It’s not every day that Max Klimavicius, the owner of Sardi’s, personally cuts a customer’s filet mignon into bite-size pieces. Then again, not every customer has just won Best in Show at the Westminster Kennel Club Dog Show . The chef had cooked the steak until it was medium rare, lightly seasoned it with salt and pepper, then
27 | Camille Richardson has heard all the arguments, read all the comments, and sees the logic. But as a freshman midfielder for the Columbia women’s lacrosse team who is fully aware of the dangers of head trauma, Richardson makes one thing clear: She has no interest in wearing a helmet, as the men must. “Wearing a helmet,” Richardson
28 | The Boston Athletic Association announced new registration procedures for the Boston Marathon in response to the growing demand that will leave some of the fastest runners on the sideline. The field of nearly 27,000 for this year filled up in eight hours. Organizers said the top qualifiers would be allowed to enter first under a two-week, online,
29 | The DVD was sitting in Michael Waltrip ’s house for nine and a half years while the accident churned inside him. His big sister Connie had recorded every race of his. As soon as the Daytona 500 went off the air that fatal day , she decorated the case with stars and happy faces to commemorate his first Daytona — his first Nascar Cup
30 | DAYTONA BEACH, Fla. — Dale Earnhardt Jr. crashed during practice at Daytona International Speedway on Wednesday, costing him the pole position for the Daytona 500 on Sunday. Earnhardt captured the pole last Sunday, but he will now have to switch to a backup car. Under Nascar rules, that means he will have to start from the back of the field.
31 | DAYTONA BEACH, Fla. — A year after potholes led to embarrassing delays in the Daytona 500, a $20 million repaving job at Daytona International Speedway is helping to create another set of concerns for Nascar ’s season-opening showcase event. Nascar officials mandated a series of adjustments to the racecars this week, the latest coming
32 | Lance Armstrong , the seven-time Tour de France winner who is the target of a federal investigation into doping in cycling, announced Wednesday that he had retired from his sport — this time for good. Armstrong, who is 39 and a cancer survivor, said he was leaving to spend more time with his family — he has five children — and to
33 | Arsenal stunned Barcelona with a second-half comeback in the European Champions League at Emirate Stadium on Wednesday in London, with the substitute Andrey Arshavin scoring the winner in the 83rd minute in a 2-1 victory. The Gunners were outplayed for most of the first leg of the Round of 16 meeting. The second leg is scheduled to be played on
34 | The Turkish Basketball Federation lifted the provisional suspension of Diana Taurasi on Wednesday after the lab that conducted Taurasi’s positive test retracted its report. The lab issued the change after it evaluated Taurasi’s statements in her defense. The federation did not say whether the lab made a mistake. Taurasi, 28, who had her
35 | Anna Chakvetadze collapsed on the court as she was serving for the second set against top-seeded Caroline Wozniacki at the Dubai Championships in the United Arab Emirates and had to withdraw. After losing the first set, 6-1, Chakvetadze was ahead, 5-3, when she lost a long rally to Wozniacki, wobbled and fainted. After treatment, Chakvetadze
36 | LOS ANGELES — So what if the weather on the horizon is as forbidding as the maître d’ at Koi glaring at the common people? The specter of three days of rain in Southern California has not deterred the field at Riviera Country Club for the Northern Trust Open. At least, not yet. With 11 of the top 20 players in the World Golf
37 | The Devils rookie Nick Palmieri gave the puck to Ilya Kovalchuk along the Carolina goal line early in the second period of a scoreless game, then got a chance to watch Kovalchuk, a $100 million Russian superstar, put on a show. Kovalchuk skated to the point, reversed direction, went back down to the left circle, hit the brakes to lose defenseman
38 | Kemba Walker had the ball at about the free-throw line and he was being covered by a player 9 inches taller than him. He had quite a list of possible plays in front of him. The one he chose is not on the list of options for almost every other player. Walker faked to his left, then threw the ball hard off the backboard and — since he was the
39 | Looking nothing like the two-time defending N.B.A. champions, the Los Angeles Lakers dropped their third straight game, a 104-99 loss Wednesday night on the road to the Cleveland Cavaliers — the league’s worst team, which avenged a 55-point embarrassment against Los Angeles last month. Ramon Sessions came off the bench and scored a
40 | The Knicks settled into the All-Star break on Wednesday at nearly the same point where they started the season. Only now they have two more wins than losses. For the Knicks, any progress is significant toward ending a six-year playoff absence. After a 102-90 victory over the Atlanta Hawks at Madison Square Garden, the Knicks (28-26) squeezed into
41 | TAMPA, Fla. — Joba Chamberlain arrived at the most important spring training of his young career listed at 230 pounds , just as he was all last season. This would not be a problem except that Chamberlain weighs more than 230 pounds, and the Yankees are hardly pleased that he does. Asked Wednesday morning for his impression of Chamberlain,
42 | In late 1999, three friends created a Web site to solicit fans to acquire the Jets . The quixotic effort at one point claimed commitments worth $20 million from 11,000 people. But the N.F.L. rejected the plan, and Woody Johnson paid $635 million for the team in early 2000. On Wednesday, three friends started an Internet bid to acquire the Mets and
43 | PORT ST. LUCIE, Fla. — Jeff Wilpon, the Mets ’ chief operating officer, spoke to reporters Wednesday for the first time since a lawsuit seeking as much as $1 billion from the team’s owners was unsealed on Feb. 4. Wilpon, the son of the longtime principal owner of the Mets, Fred Wilpon , said his family had received many offers to
44 | PORT ST. LUCIE, Fla. — A contrite Francisco Rodriguez arrived in Mets camp Wednesday and promised that he had learned from his mistakes and had become a better person. At the same time, Rodriguez vowed that in some respects he would not change. “On the mound, it’s going to be the same,” he said. “It’s going to be
45 | CLEARWATER, Fla. He stood beneath a palm tree on a clear Florida morning, commanding attention as he always has across a lifetime in baseball. Yet the routines of spring training were gone for Dallas Green. There is nothing routine about coping with horror. Christina-Taylor Green was the youngest victim of the shooting in Tucson last month that
46 | JUPITER, Fla. — In explaining how the St. Louis Cardinals have reached the end of negotiations to extend the contract of Albert Pujols, the team’s owner, Bill DeWitt Jr., really did not have to utter much more than one short declarative sentence. “We’re not the Yankees ,” he said after Pujols’s self-imposed noon
47 | A day before a scheduled arbitration hearing, the Brewers and second baseman Rickie Weeks agreed to a four-year, $38.5 million deal. A 2015 option could increase the total value to $50 million. Weeks, 28, hit .269 with 29 homers, 83 runs batted in and 112 runs last year. Sports Briefing | Baseball
48 | The day in sports, including cricket, skiing, and pitchers and catchers.
49 | The day in sports, including cricket, skiing, and pitchers and catchers.
50 | The day in sports, including cricket, skiing, and pitchers and catchers.
51 |
--------------------------------------------------------------------------------
/intro_web_data/stopwords.txt:
--------------------------------------------------------------------------------
1 | i
2 | me
3 | my
4 | myself
5 | we
6 | our
7 | ours
8 | ourselves
9 | you
10 | your
11 | yours
12 | yourself
13 | yourselves
14 | he
15 | him
16 | his
17 | himself
18 | she
19 | her
20 | hers
21 | herself
22 | it
23 | its
24 | itself
25 | they
26 | them
27 | their
28 | theirs
29 | themselves
30 | what
31 | which
32 | who
33 | whom
34 | this
35 | that
36 | these
37 | those
38 | am
39 | is
40 | are
41 | was
42 | were
43 | be
44 | been
45 | being
46 | have
47 | has
48 | had
49 | having
50 | do
51 | does
52 | did
53 | doing
54 | a
55 | an
56 | the
57 | and
58 | but
59 | if
60 | or
61 | because
62 | as
63 | until
64 | while
65 | of
66 | at
67 | by
68 | for
69 | with
70 | about
71 | against
72 | between
73 | into
74 | through
75 | during
76 | before
77 | after
78 | above
79 | below
80 | to
81 | from
82 | up
83 | down
84 | in
85 | out
86 | on
87 | off
88 | over
89 | under
90 | again
91 | further
92 | then
93 | once
94 | here
95 | there
96 | when
97 | where
98 | why
99 | how
100 | all
101 | any
102 | both
103 | each
104 | few
105 | more
106 | most
107 | other
108 | some
109 | such
110 | no
111 | nor
112 | not
113 | only
114 | own
115 | same
116 | so
117 | than
118 | too
119 | very
120 | s
121 | t
122 | can
123 | will
124 | just
125 | don
126 | should
127 | now
128 | the
129 | and
130 | this
131 | with
132 | for
133 | not
134 | but
135 | with
136 | how
--------------------------------------------------------------------------------
/intro_web_data/tag_clustering.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # encoding: utf-8
3 | """
4 | tag_clustering.py
5 |
6 | Created by Hilary Mason on 2011-02-18.
7 | Copyright (c) 2011 Hilary Mason. All rights reserved.
8 | """
9 |
10 | import csv
11 |
12 | import numpy
13 | from Pycluster import *
14 |
15 | class TagClustering(object):
16 |
17 | def __init__(self):
18 | tag_data = self.load_link_data()
19 | # print tag_data
20 | all_tags = []
21 | all_urls = []
22 | for url,tags in tag_data.items():
23 | all_urls.append(url)
24 | all_tags.extend(tags)
25 |
26 | all_tags = list(set(all_tags)) # list of all tags in the space
27 |
28 | numerical_data = [] # create vectors for each item
29 | for url,tags in tag_data.items():
30 | v = []
31 | for t in all_tags:
32 | if t in tags:
33 | v.append(1)
34 | else:
35 | v.append(0)
36 | numerical_data.append(tuple(v))
37 | data = numpy.array(numerical_data)
38 |
39 | # cluster the items
40 | # labels, error, nfound = kcluster(data, nclusters=20, dist='e') # 20 clusters, euclidean distance
41 | # labels, error, nfound = kcluster(data, nclusters=20, dist='b',npass=10) # 20 clusters, city-block distance, iterate 10 times
42 | labels, error, nfound = kcluster(data, nclusters=30, dist='a',npass=10) # 30 clusters, abs val of the correlation distance, iterate 10 times
43 |
44 | # print out the clusters
45 | clustered_urls = {}
46 | clustered_tags = {}
47 | i = 0
48 | for url in all_urls:
49 | clustered_urls.setdefault(labels[i], []).append(url)
50 | clustered_tags.setdefault(labels[i], []).extend(tag_data[url])
51 | i += 1
52 |
53 | for cluster_id,urls in clustered_urls.items():
54 | print cluster_id
55 | print urls
56 |
57 | # for cluster_id,tags in clustered_tags.items():
58 | # print cluster_id
59 | # print list(set(tags))
60 |
61 |
62 | def load_link_data(self,filename="links.csv"):
63 | data = {}
64 |
65 | r = csv.reader(open(filename, 'r'))
66 | for row in r:
67 | data[row[0]] = row[1].split(',')
68 |
69 | return data
70 |
71 |
72 | if __name__ == '__main__':
73 | t = TagClustering()
74 |
75 |
--------------------------------------------------------------------------------
/solving_problems/bloom_filter.py:
--------------------------------------------------------------------------------
1 | from hashes.bloom import bloomfilter
2 |
3 | hash1 = bloomfilter('imastring')
4 | print hash1.hashbits, hash1.num_hashes # default values (see below)
5 |
6 | hash1.add('imastring string')
7 |
8 | # print 'test string' in hash1
9 | for word in 'bloom filters are the best'.split():
10 | hash1.add(word)
11 |
12 | if 'machine' in hash1:
13 | print "machine!"
14 |
--------------------------------------------------------------------------------
/solving_problems/decision_tree_regression.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | # Create a random dataset
4 | rng = np.random.RandomState(1)
5 | X = np.sort(5 * rng.rand(80, 1), axis=0)
6 | y = np.sin(X).ravel()
7 | y[::5] += 3 * (0.5 - rng.rand(16))
8 |
9 | # Fit regression model
10 | from sklearn.tree import DecisionTreeRegressor
11 |
12 | clf_1 = DecisionTreeRegressor(max_depth=2)
13 | clf_2 = DecisionTreeRegressor(max_depth=5)
14 | clf_1.fit(X, y)
15 | clf_2.fit(X, y)
16 |
17 | # Predict
18 | X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
19 | y_1 = clf_1.predict(X_test)
20 | y_2 = clf_2.predict(X_test)
21 |
22 | # Plot the results
23 | import pylab as pl
24 |
25 | pl.figure()
26 | pl.scatter(X, y, c="k", label="data")
27 | pl.plot(X_test, y_1, c="g", label="max_depth=2", linewidth=2)
28 | pl.plot(X_test, y_2, c="r", label="max_depth=5", linewidth=2)
29 | pl.xlabel("data")
30 | pl.ylabel("target")
31 | pl.title("Decision Tree Regression")
32 | pl.legend()
33 | #pl.show()
34 | pl.savefig('decision_tree_regression.png', format='png')
35 |
--------------------------------------------------------------------------------
/solving_problems/flat.txt:
--------------------------------------------------------------------------------
1 | Flat Glasses
2 | Flatpack Bunny
3 | Flatpack Monkey
4 | Flat Pack Fastenerless (FPF) Game Table With Reversible Top
5 | Axim X51v flatpack cradle
6 | Flatfile Nameplate
7 | Print Flat - Roll Into 3D, Heptagonal Column
8 | FlatRoll Airfoil
9 | FlatRoll with Adjustable Thickness, Pentagonal Column
10 | Lock-Tab FlatRoll, Hexagonal Column
11 | Flatpack Sphere
12 | Interlocking Puzzle Piece Flat
13 | Calibration -flat- square
14 | Flat decorative Christmas tree
15 | Flat Bottom Shotglass
16 | 3D from any 2D (or From Flat to Cat)
17 | Urinal with flat bottom
18 | Flat drivenut for 12x6x2 (mm) trapezium thread
19 | Iris Box V2 Flat Base
20 | Stepper motor gear for Reprap Mendel with flat on motor shaft
21 | aMESS RAMPS Flattened Enclosure v.0.1.2 - Arduino Modular Enclosure System Stack
22 | Supa-Flat X-Carriage
23 | Flat Yodsta/Gangda
24 | Parametric inflater nozzle
25 | Flatfooted Soldier Boy
26 | Flat Teardrop
27 |
--------------------------------------------------------------------------------
/solving_problems/kmeans_descriptions.py:
--------------------------------------------------------------------------------
1 | import csv
2 | from sklearn.datasets import fetch_20newsgroups
3 | from sklearn.feature_extraction.text import Vectorizer
4 | from sklearn import metrics
5 |
6 | from sklearn.cluster import KMeans, MiniBatchKMeans
7 |
8 | import logging
9 | from optparse import OptionParser
10 | import sys
11 | from time import time
12 |
13 | import numpy as np
14 |
15 |
16 | # Display progress logs on stdout
17 | logging.basicConfig(level=logging.INFO,
18 | format='%(asctime)s %(levelname)s %(message)s')
19 |
20 | # parse commandline arguments
21 | op = OptionParser()
22 | op.add_option("--no-minibatch",
23 | action="store_false", dest="minibatch", default=True,
24 | help="Use ordinary k-means algorithm.")
25 |
26 | print __doc__
27 | op.print_help()
28 |
29 | (opts, args) = op.parse_args()
30 | if len(args) > 0:
31 | op.error("this script takes no arguments.")
32 | sys.exit(1)
33 |
34 |
35 | input_data = csv.reader(open('descriptions_100.csv','rb'))
36 | dataset_data = []
37 | dataset_target = []
38 | for row in input_data:
39 | dataset_data.append(row[1])
40 | dataset_target.append(row[0])
41 |
42 | labels = dataset_target
43 | true_k = np.unique(labels).shape[0]
44 |
45 | print "Extracting features from the training dataset using a sparse vectorizer"
46 | t0 = time()
47 | vectorizer = Vectorizer(max_df=0.95, max_features=10000)
48 | X = vectorizer.fit_transform(dataset_data)
49 | print X
50 |
51 | print "done in %fs" % (time() - t0)
52 | print "n_samples: %d, n_features: %d" % X.shape
53 |
54 |
55 | ###############################################################################
56 | # Do the actual clustering
57 |
58 | km = MiniBatchKMeans(k=true_k, init='k-means++', n_init=1,init_size=1000,batch_size=1000, verbose=1)
59 |
60 | print "Clustering with %s" % km
61 | t0 = time()
62 | km.fit(X)
63 | print "done in %0.3fs\n" % (time() - t0)
64 | print km.labels_
65 |
66 | # print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_)
67 | # print "Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_)
68 | # print "V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_)
69 |
70 |
71 |
--------------------------------------------------------------------------------
/solving_problems/liked_decision_tree.py:
--------------------------------------------------------------------------------
1 | import sys, os
2 | import csv
3 |
4 | from sklearn import tree
5 |
6 | if __name__ == '__main__':
7 | input_file = "thingiverse_liked_objects_1k.csv"
8 | input_data = csv.reader(open(input_file, 'rb'))
9 |
10 | data_features = []
11 | data_labels = []
12 |
13 | for row in input_data:
14 | data_features.append([row[0], row[1]])
15 | data_labels.append(row[2])
16 |
17 | # sklearn.tree.DecisionTreeClassifier(criterion='gini', max_depth=None, min_split=1,
18 | # min_density=0.10000000000000001, max_features=None, compute_importances=False, random_state=None)
19 |
20 | dt = tree.DecisionTreeClassifier(min_split=10)
21 | dt = dt.fit(data_features, data_labels)
22 |
23 | print dt.predict([50,500])
24 |
25 | o = tree.export_graphviz(dt,out_file='thingiverse_tree.dot',feature_names=['user_id','num_likes'])
26 |
--------------------------------------------------------------------------------
/solving_problems/map_reduce.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | #
3 | # Licensed to Cloudera, Inc. under one
4 | # or more contributor license agreements. See the NOTICE file
5 | # distributed with this work for additional information
6 | # regarding copyright ownership. Cloudera, Inc. licenses this file
7 | # to you under the Apache License, Version 2.0 (the
8 | # "License"); you may not use this file except in compliance
9 | # with the License. You may obtain a copy of the License at
10 | #
11 | # http://www.apache.org/licenses/LICENSE-2.0
12 | #
13 | # Unless required by applicable law or agreed to in writing, software
14 | # distributed under the License is distributed on an "AS IS" BASIS,
15 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16 | # See the License for the specific language governing permissions and
17 | # limitations under the License.
18 | #
19 | #
20 | # Template for python Hadoop streaming. Fill in the map() and reduce()
21 | # functions, which should call emit(), as appropriate.
22 | #
23 | # Test your script with
24 | # cat input | python wordcount.py map | sort | python wordcount.py reduce
25 |
26 | import sys
27 | import re
28 | try:
29 | import simplejson as json
30 | except ImportError:
31 | import json
32 |
33 | import __builtin__
34 |
35 | def map(line):
36 | words = line.split()
37 | for word in words:
38 | emit(word, str(1))
39 |
40 | def reduce(key, values):
41 | emit(key, str(sum(__builtin__.map(int,values))))
42 |
43 | # Common library code follows:
44 |
45 | def emit(key, value):
46 | """
47 | Emits a key->value pair. Key and value should be strings.
48 | """
49 | try:
50 | print "\t".join( (key, value) )
51 | except:
52 | pass
53 |
54 | def run_map():
55 | """Calls map() for each input value."""
56 | for line in sys.stdin:
57 | line = line.rstrip()
58 | map(line)
59 |
60 | def run_reduce():
61 | """Gathers reduce() data in memory, and calls reduce()."""
62 | prev_key = None
63 | values = []
64 | for line in sys.stdin:
65 | line = line.rstrip()
66 | key, value = re.split("\t", line, 1)
67 | if prev_key == key:
68 | values.append(value)
69 | else:
70 | if prev_key is not None:
71 | reduce(prev_key, values)
72 | prev_key = key
73 | values = [ value ]
74 |
75 | if prev_key is not None:
76 | reduce(prev_key, values)
77 |
78 | def main():
79 | """Runs map or reduce code, per arguments."""
80 | if len(sys.argv) != 2 or sys.argv[1] not in ("map", "reduce"):
81 | print "Usage: %s