67 | How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic
68 | motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see
69 | above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by
70 | analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming),
71 | transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how
72 | I did this with my detailed guide to Document Clustering with Python. But first, what did I learn?
73 |
74 |
75 |
76 | A bit of background
77 |
78 |
79 |
80 | I obtained a list of the top 100 films of all time from an IMDB user list called
81 |
82 | Top 100 Greatest Movies of All Time (The Ultimate List)
83 | by ChrisWalczyk55.
84 | ChrisWalczyk55 claims that "My lists are not based on my own personal favorites;
85 | they are based on the true greatness and/or sucess of the person, place, or thing
86 | being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings,
87 | combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters.
88 | Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow
89 | me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit,
90 | but I chose 5 clusters since this led to the best intuition.
91 |
92 |
93 |
94 |
95 | Understanding the visualization
96 |
97 |
98 |
99 | The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies
100 | (colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is
101 | by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as
102 | determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained
103 | within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a
104 | more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation
105 | to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind
106 | that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short
107 | the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology.
108 |
109 |
110 |
111 |
112 |
113 | Scoring the clusters
114 |
115 |
116 |
117 | Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better).
118 |
119 |
120 |
121 |
122 |
Rank
123 |
Cluster
124 |
Score
125 |
Count
126 |
127 |
128 |
1
129 |
Killed, soldiers, captain
130 |
43.7
131 |
26
132 |
133 |
134 |
2
135 |
Family, home, war
136 |
47.2
137 |
25
138 |
139 |
140 |
3
141 |
Father, New York, brothers
142 |
49.4
143 |
21
144 |
145 |
146 |
4
147 |
Dance, singing, love
148 |
54.5
149 |
12
150 |
151 |
152 |
5
153 |
Police, killed, murders
154 |
58.8
155 |
16
156 |
157 |
158 |
159 |
160 |
161 | You can see that the war movies scored the best. The basic war epic cluster was at the top, followed
162 | closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping.
163 | Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom.
164 | This despite the dominance of the Godfather films.
165 |
166 |
167 |
168 |
169 |
170 |
Killed, soldiers, captain
171 |
172 |
173 |
Rank
174 |
Title
175 |
176 |
177 |
178 |
179 |
2
180 |
The Shawshank Redemption
181 |
182 |
183 |
11
184 |
Lawrence of Arabia
185 |
186 |
187 |
18
188 |
The Sound of Music
189 |
190 |
191 |
20
192 |
Star Wars
193 |
194 |
195 |
22
196 |
2001: A Space Odyssey
197 |
198 |
199 |
25
200 |
The Bridge on the River Kwai
201 |
202 |
203 |
30
204 |
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
654 |
655 |
656 |
744 |
745 |
746 |
747 |
--------------------------------------------------------------------------------
/backup/header_short.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/header_short.jpg
--------------------------------------------------------------------------------
/backup/link_list.txt:
--------------------------------------------------------------------------------
1 | /title/tt0068646/
2 | /title/tt0111161/
3 | /title/tt0108052/
4 | /title/tt0081398/
5 | /title/tt0034583/
6 | /title/tt0073486/
7 | /title/tt0031381/
8 | /title/tt0033467/
9 | /title/tt0032138/
10 | /title/tt0120338/
11 | /title/tt0056172/
12 | /title/tt0071562/
13 | /title/tt0054215/
14 | /title/tt0043014/
15 | /title/tt0052357/
16 | /title/tt0047296/
17 | /title/tt0109830/
18 | /title/tt0059742/
19 | /title/tt0055614/
20 | /title/tt0076759/
21 | /title/tt0083866/
22 | /title/tt0062622/
23 | /title/tt0102926/
24 | /title/tt0071315/
25 | /title/tt0050212/
26 | /title/tt0045152/
27 | /title/tt0038650/
28 | /title/tt0053291/
29 | /title/tt0050083/
30 | /title/tt0057012/
31 | /title/tt0086879/
32 | /title/tt0078788/
33 | /title/tt0083987/
34 | /title/tt0167260/
35 | /title/tt0172495/
36 | /title/tt0045793/
37 | /title/tt0120815/
38 | /title/tt0105695/
39 | /title/tt0082971/
40 | /title/tt0075148/
41 | /title/tt0044081/
42 | /title/tt0032904/
43 | /title/tt0056592/
44 | /title/tt0043278/
45 | /title/tt0036868/
46 | /title/tt0058385/
47 | /title/tt0052618/
48 | /title/tt0059113/
49 | /title/tt0066206/
50 | /title/tt0073195/
51 | /title/tt0112573/
52 | /title/tt0060196/
53 | /title/tt0064115/
54 | /title/tt0040897/
55 | /title/tt0053604/
56 | /title/tt0091763/
57 | /title/tt0044706/
58 | /title/tt0099348/
59 | /title/tt0253474/
60 | /title/tt0099685/
61 | /title/tt0070047/
62 | /title/tt0077416/
63 | /title/tt0020629/
64 | /title/tt0067116/
65 | /title/tt0021749/
66 | /title/tt1504320/
67 | /title/tt0025316/
68 | /title/tt0043924/
69 | /title/tt0064665/
70 | /title/tt0031679/
71 | /title/tt0095953/
72 | /title/tt0075686/
73 | /title/tt0089755/
74 | /title/tt0119217/
75 | /title/tt0086425/
76 | /title/tt0084805/
77 | /title/tt0116282/
78 | /title/tt0049261/
79 | /title/tt0032551/
80 | /title/tt0046303/
81 | /title/tt0120689/
82 | /title/tt0075860/
83 | /title/tt0074958/
84 | /title/tt0073440/
85 | /title/tt0061722/
86 | /title/tt0069704/
87 | /title/tt0110912/
88 | /title/tt0043265/
89 | /title/tt0031971/
90 | /title/tt0026752/
91 | /title/tt0033870/
92 | /title/tt0066921/
93 | /title/tt0075314/
94 | /title/tt0032145/
95 | /title/tt0036775/
96 | /title/tt0048545/
97 | /title/tt0047396/
98 | /title/tt0041959/
99 | /title/tt0053125/
100 | /title/tt0035575/
101 |
--------------------------------------------------------------------------------
/backup/link_list_imdb.txt:
--------------------------------------------------------------------------------
1 | http://www.imdb.com/title/tt0068646/
2 | http://www.imdb.com/title/tt0111161/
3 | http://www.imdb.com/title/tt0108052/
4 | http://www.imdb.com/title/tt0081398/
5 | http://www.imdb.com/title/tt0034583/
6 | http://www.imdb.com/title/tt0073486/
7 | http://www.imdb.com/title/tt0031381/
8 | http://www.imdb.com/title/tt0033467/
9 | http://www.imdb.com/title/tt0032138/
10 | http://www.imdb.com/title/tt0120338/
11 | http://www.imdb.com/title/tt0056172/
12 | http://www.imdb.com/title/tt0071562/
13 | http://www.imdb.com/title/tt0054215/
14 | http://www.imdb.com/title/tt0043014/
15 | http://www.imdb.com/title/tt0052357/
16 | http://www.imdb.com/title/tt0047296/
17 | http://www.imdb.com/title/tt0109830/
18 | http://www.imdb.com/title/tt0059742/
19 | http://www.imdb.com/title/tt0055614/
20 | http://www.imdb.com/title/tt0076759/
21 | http://www.imdb.com/title/tt0083866/
22 | http://www.imdb.com/title/tt0062622/
23 | http://www.imdb.com/title/tt0102926/
24 | http://www.imdb.com/title/tt0071315/
25 | http://www.imdb.com/title/tt0050212/
26 | http://www.imdb.com/title/tt0045152/
27 | http://www.imdb.com/title/tt0038650/
28 | http://www.imdb.com/title/tt0053291/
29 | http://www.imdb.com/title/tt0050083/
30 | http://www.imdb.com/title/tt0057012/
31 | http://www.imdb.com/title/tt0086879/
32 | http://www.imdb.com/title/tt0078788/
33 | http://www.imdb.com/title/tt0083987/
34 | http://www.imdb.com/title/tt0167260/
35 | http://www.imdb.com/title/tt0172495/
36 | http://www.imdb.com/title/tt0045793/
37 | http://www.imdb.com/title/tt0120815/
38 | http://www.imdb.com/title/tt0105695/
39 | http://www.imdb.com/title/tt0082971/
40 | http://www.imdb.com/title/tt0075148/
41 | http://www.imdb.com/title/tt0044081/
42 | http://www.imdb.com/title/tt0032904/
43 | http://www.imdb.com/title/tt0056592/
44 | http://www.imdb.com/title/tt0043278/
45 | http://www.imdb.com/title/tt0036868/
46 | http://www.imdb.com/title/tt0058385/
47 | http://www.imdb.com/title/tt0052618/
48 | http://www.imdb.com/title/tt0059113/
49 | http://www.imdb.com/title/tt0066206/
50 | http://www.imdb.com/title/tt0073195/
51 | http://www.imdb.com/title/tt0112573/
52 | http://www.imdb.com/title/tt0060196/
53 | http://www.imdb.com/title/tt0064115/
54 | http://www.imdb.com/title/tt0040897/
55 | http://www.imdb.com/title/tt0053604/
56 | http://www.imdb.com/title/tt0091763/
57 | http://www.imdb.com/title/tt0044706/
58 | http://www.imdb.com/title/tt0099348/
59 | http://www.imdb.com/title/tt0253474/
60 | http://www.imdb.com/title/tt0099685/
61 | http://www.imdb.com/title/tt0070047/
62 | http://www.imdb.com/title/tt0077416/
63 | http://www.imdb.com/title/tt0020629/
64 | http://www.imdb.com/title/tt0067116/
65 | http://www.imdb.com/title/tt0021749/
66 | http://www.imdb.com/title/tt1504320/
67 | http://www.imdb.com/title/tt0025316/
68 | http://www.imdb.com/title/tt0043924/
69 | http://www.imdb.com/title/tt0064665/
70 | http://www.imdb.com/title/tt0031679/
71 | http://www.imdb.com/title/tt0095953/
72 | http://www.imdb.com/title/tt0075686/
73 | http://www.imdb.com/title/tt0089755/
74 | http://www.imdb.com/title/tt0119217/
75 | http://www.imdb.com/title/tt0086425/
76 | http://www.imdb.com/title/tt0084805/
77 | http://www.imdb.com/title/tt0116282/
78 | http://www.imdb.com/title/tt0049261/
79 | http://www.imdb.com/title/tt0032551/
80 | http://www.imdb.com/title/tt0046303/
81 | http://www.imdb.com/title/tt0120689/
82 | http://www.imdb.com/title/tt0075860/
83 | http://www.imdb.com/title/tt0074958/
84 | http://www.imdb.com/title/tt0073440/
85 | http://www.imdb.com/title/tt0061722/
86 | http://www.imdb.com/title/tt0069704/
87 | http://www.imdb.com/title/tt0110912/
88 | http://www.imdb.com/title/tt0043265/
89 | http://www.imdb.com/title/tt0031971/
90 | http://www.imdb.com/title/tt0026752/
91 | http://www.imdb.com/title/tt0033870/
92 | http://www.imdb.com/title/tt0066921/
93 | http://www.imdb.com/title/tt0075314/
94 | http://www.imdb.com/title/tt0032145/
95 | http://www.imdb.com/title/tt0036775/
96 | http://www.imdb.com/title/tt0048545/
97 | http://www.imdb.com/title/tt0047396/
98 | http://www.imdb.com/title/tt0041959/
99 | http://www.imdb.com/title/tt0053125/
100 | http://www.imdb.com/title/tt0035575/
101 |
--------------------------------------------------------------------------------
/backup/link_list_wiki.txt:
--------------------------------------------------------------------------------
1 | http://en.wikipedia.org/wiki/The_Godfather
2 | http://en.wikipedia.org/wiki/The_Shawshank_Redemption
3 | http://en.wikipedia.org/wiki/Schindler%27s_List
4 | http://en.wikipedia.org/wiki/Raging_Bull
5 | http://en.wikipedia.org/wiki/Casablanca_(film)
6 | http://en.wikipedia.org/wiki/One_Flew_Over_the_Cuckoo%27s_Nest_(film)
7 | http://en.wikipedia.org/wiki/Gone_with_the_Wind_(film)
8 | http://en.wikipedia.org/wiki/Citizen_Kane
9 | http://en.wikipedia.org/wiki/The_Wizard_of_Oz_(1939_film)
10 | http://en.wikipedia.org/wiki/Titanic_(1997_film)
11 | http://en.wikipedia.org/wiki/Lawrence_of_Arabia_(film)
12 | http://en.wikipedia.org/wiki/The_Godfather_Part_II
13 | http://en.wikipedia.org/wiki/American_Psycho_(film)
14 | http://en.wikipedia.org/wiki/Sunset_Boulevard_(film)
15 | http://en.wikipedia.org/wiki/Vertigo_(film)
16 | http://en.wikipedia.org/wiki/On_the_Waterfront
17 | http://en.wikipedia.org/wiki/Forrest_Gump
18 | http://en.wikipedia.org/wiki/The_Sound_of_Music_(film)
19 | http://en.wikipedia.org/wiki/West_Side_Story_(film)
20 | http://en.wikipedia.org/wiki/Star_Wars_(film)
21 | http://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial
22 | http://en.wikipedia.org/wiki/2001:_A_Space_Odyssey_(film)
23 | http://en.wikipedia.org/wiki/The_Silence_of_the_Lambs_(film)
24 | http://en.wikipedia.org/wiki/Chinatown_(1974_film)
25 | http://en.wikipedia.org/wiki/The_Bridge_on_the_River_Kwai
26 | http://en.wikipedia.org/wiki/Singin%27_in_the_Rain
27 | http://en.wikipedia.org/wiki/It%27s_a_Wonderful_Life
28 | http://en.wikipedia.org/wiki/Some_Like_It_Hot
29 | http://en.wikipedia.org/wiki/12_Angry_Men_(1957_film)
30 | http://en.wikipedia.org/wiki/Dr._Strangelove
31 | http://en.wikipedia.org/wiki/Amadeus_(film)
32 | http://en.wikipedia.org/wiki/Apocalypse_Now
33 | http://en.wikipedia.org/wiki/Gandhi_(film)
34 | http://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Return_of_the_King
35 | http://en.wikipedia.org/wiki/Gladiator_(2000_film)
36 | http://en.wikipedia.org/wiki/From_Here_to_Eternity
37 | http://en.wikipedia.org/wiki/Saving_Private_Ryan
38 | http://en.wikipedia.org/wiki/Unforgiven
39 | http://en.wikipedia.org/wiki/Raiders_of_the_Lost_Ark
40 | http://en.wikipedia.org/wiki/Rocky
41 | http://en.wikipedia.org/wiki/A_Streetcar_Named_Desire_(1951_film)
42 | http://en.wikipedia.org/wiki/The_Philadelphia_Story_(film)
43 | http://en.wikipedia.org/wiki/To_Kill_a_Mockingbird_(film)
44 | http://en.wikipedia.org/wiki/An_American_in_Paris_(film)
45 | http://en.wikipedia.org/wiki/The_Best_Years_of_Our_Lives
46 | http://en.wikipedia.org/wiki/My_Fair_Lady_(film)
47 | http://en.wikipedia.org/wiki/Ben-Hur_(1959_film)
48 | http://en.wikipedia.org/wiki/Doctor_Zhivago_(film)
49 | http://en.wikipedia.org/wiki/Patton_(film)
50 | http://en.wikipedia.org/wiki/Jaws_(film)
51 | http://en.wikipedia.org/wiki/Braveheart
52 | http://en.wikipedia.org/wiki/Coyote_Ugly_(film)
53 | http://en.wikipedia.org/wiki/Butch_Cassidy_and_the_Sundance_Kid
54 | http://en.wikipedia.org/wiki/The_Treasure_of_the_Sierra_Madre_(film)
55 | http://en.wikipedia.org/wiki/The_Apartment
56 | http://en.wikipedia.org/wiki/Platoon_(film)
57 | http://en.wikipedia.org/wiki/High_Noon
58 | http://en.wikipedia.org/wiki/Dances_with_Wolves
59 | http://en.wikipedia.org/wiki/The_Pianist_(2002_film)
60 | http://en.wikipedia.org/wiki/Goodfellas
61 | http://en.wikipedia.org/wiki/The_Exorcist_(film)
62 | http://en.wikipedia.org/wiki/The_Deer_Hunter
63 | http://en.wikipedia.org/wiki/All_Quiet_on_the_Western_Front_(1930_film)
64 | http://en.wikipedia.org/wiki/The_French_Connection_(film)
65 | http://en.wikipedia.org/wiki/Friday_Night_Lights_(film)
66 | http://en.wikipedia.org/wiki/The_King%27s_Speech
67 | http://en.wikipedia.org/wiki/It_Happened_One_Night
68 | http://en.wikipedia.org/wiki/A_Place_in_the_Sun_(film)
69 | http://en.wikipedia.org/wiki/Midnight_Cowboy
70 | http://en.wikipedia.org/wiki/Mr._Smith_Goes_to_Washington
71 | http://en.wikipedia.org/wiki/Rain_Man
72 | http://en.wikipedia.org/wiki/Annie_Hall
73 | http://en.wikipedia.org/wiki/Out_of_Africa_(film)
74 | http://en.wikipedia.org/wiki/Good_Will_Hunting
75 | http://en.wikipedia.org/wiki/Terms_of_Endearment
76 | http://en.wikipedia.org/wiki/Tootsie
77 | http://en.wikipedia.org/wiki/Fargo_(film)
78 | http://en.wikipedia.org/wiki/Giant_(1956_film)
79 | http://en.wikipedia.org/wiki/The_Grapes_of_Wrath_(film)
80 | http://en.wikipedia.org/wiki/Shane_(film)
81 | http://en.wikipedia.org/wiki/The_Green_Mile_(film)
82 | http://en.wikipedia.org/wiki/Close_Encounters_of_the_Third_Kind
83 | http://en.wikipedia.org/wiki/Network_(film)
84 | http://en.wikipedia.org/wiki/Nashville_(film)
85 | http://en.wikipedia.org/wiki/The_Graduate
86 | http://en.wikipedia.org/wiki/American_Graffiti
87 | http://en.wikipedia.org/wiki/Pulp_Fiction
88 | http://en.wikipedia.org/wiki/The_African_Queen_(film)
89 | http://en.wikipedia.org/wiki/Stagecoach_(1939_film)
90 | http://en.wikipedia.org/wiki/Mutiny_on_the_Bounty_(1962_film)
91 | http://en.wikipedia.org/wiki/The_Maltese_Falcon_(1941_film)
92 | http://en.wikipedia.org/wiki/A_Clockwork_Orange_(film)
93 | http://en.wikipedia.org/wiki/Taxi_Driver
94 | http://en.wikipedia.org/wiki/Wuthering_Heights_%281939_film%29
95 | http://en.wikipedia.org/wiki/Double_Indemnity_(film)
96 | http://en.wikipedia.org/wiki/Rebel_Without_a_Cause
97 | http://en.wikipedia.org/wiki/Rear_Window
98 | http://en.wikipedia.org/wiki/The_Third_Man
99 | http://en.wikipedia.org/wiki/North_by_Northwest
100 | http://en.wikipedia.org/wiki/Yankee_Doodle_Dandy
101 |
--------------------------------------------------------------------------------
/backup/synopses_list.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/synopses_list.txt
--------------------------------------------------------------------------------
/backup/synopses_list.txt.txt:
--------------------------------------------------------------------------------
1 |
2 |
3 | In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as "Godfather." He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, "no Sicilian can refuse a request on his daughter's wedding day." One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment. The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish the young men responsible.Meanwhile, the Don's youngest son Michael (Al Pacino), a decorated Marine hero returning from World War II service, arrives at the wedding and tells his girlfriend Kay Adams (Diane Keaton) anecdotes about his family, informing her about his father's criminal life; he reassures her that he is different from his family and doesn't plan to join them in their criminal dealings. The wedding scene serves as critical exposition for the remainder of the film, as Michael introduces the main characters to Kay. Fredo (John Cazale), Michael's next older brother, is a bit dim-witted and quite drunk by the time he finds Michael at the party. Sonny (James Caan), the Don's eldest child and next in line to become Don upon his father's retirement, is married but he is a hot-tempered philanderer who sneaks into a bedroom to have sex with one of Connie's bridesmaids, Lucy Mancini (Jeannie Linero). Tom Hagen is not related to the family by blood but is considered one of the Don's sons because he was homeless when he befriended Sonny in the Little Italy neighborhood of Manhattan and the Don took him in. Now a talented attorney, Tom is being groomed for the important position of consigliere (counselor) to the Don, despite his non-Sicilian heritage.Also among the guests at the celebration is the famous singer Johnny Fontane (Al Martino), Corleone's godson, who has come from Hollywood to petition Vito's help in landing a movie role that will revitalize his flagging career. Jack Woltz (John Marley), the head of the studio, denies Fontane the part (a character much like Johnny himself), which will make him an even bigger star, but Don Corleone explains to Johnny: "I'm gonna make him an offer he can't refuse." The Don also receives congratulatory salutations from Luca Brasi, a terrifying enforcer in the criminal underworld, and fills a request from the baker who made Connie's wedding cake who wishes for his nephew Enzo to become an American citizen.After the wedding, Hagen is dispatched to Los Angeles to meet with Woltz, but Woltz angrily tells him that he will never cast Fontane in the role. Woltz holds a grudge because Fontane seduced and "ruined" a starlet who Woltz had been grooming for stardom and with whom he had a sexual relationship. Woltz is persuaded to give Johnny the role, however, when he wakes up early the next morning and feels something wet in his bed. He pulls back the sheets and finds himself in a pool of blood; he screams in horror when he discovers the severed head of his prized $600,000 stud horse, Khartoum, in the bed with him. (A deleted scene from the film implies that Luca Brasi (Lenny Montana), Vito's top "button man" or hitman, is responsible.)Upon Hagen's return, the family meets with Virgil "The Turk" Sollozzo (Al Lettieri), who is being backed by the rival Tattaglia family. He asks Don Corleone for financing as well as political and legal protection for importing and distributing heroin. Despite the huge profit to be made, Vito Corleone refuses, explaining that his political influence would be jeopardized by a move into the narcotics trade. The Don's eldest son, Sonny, who had earlier urged the family to enter the narcotics trade, breaks ranks during the meeting and questions Sollozzo's assurances as to the Corleone Family's investment being guaranteed by the Tattaglia Family. His father, angry at Sonny's dissension in a non-family member's presence, privately rebukes him later. Don Corleone then dispatches Luca Brasi to infiltrate Sollozzo's organization and report back with information. During the meeting, while Brasi is bent over to allow Bruno Tattaglia to light his cigarette, he is stabbed in the hand by Sollozzo, and is subsequently garroted by an assassin.Soon after his meeting with Sollozzo, Don Corleone is gunned down in an assassination attempt just outside his office, and it is not immediately known whether he has survived. Fredo Corleone had been assigned driving and protection duty for his father when Paulie Gatto, the Don's usual bodyguard, had called in sick. Fredo proves to be ineffectual, fumbling with his gun and unable to shoot back. When Sonny hears about the Don being shot and Paulie's absence, he orders Clemenza (Richard S. Castellano) to find Paulie and bring him to the Don's house.Sollozzo abducts Tom Hagen and persuades him to offer Sonny the deal previously offered to his father. Enraged, Sonny refuses to consider it and issues an ultimatum to the Tattaglias: turn over Sollozzo or face a lengthy, bloody and costly (for both sides) gang war. They refuse, and instead send Sonny "a Sicilian message," in the form of two fresh fish wrapped in Luca Brasi's bullet-proof vest, to tell the Corleones that Luca Brasi "sleeps with the fishes."Clemenza later takes Paulie and one of the family's hitmen, Rocco Lampone, for a drive into Manhattan. Sonny wants to "go to the mattresses" -- set up beds in apartments for Corleone button men to operate out of in the event that the crime war breaks out. On their way back from Manhattan, Clemenza has Paulie stop the car in a remote area so he can urinate. Rocco shoots Paulie dead; he and Clemenza leave Paulie and the car behind.Michael, whom the other Mafia families consider a "civilian" uninvolved in mob business, visits his father at a small private hospital. He is shocked to find that no one is guarding him. Realizing that his father is again being set up to be killed, he calls Sonny for help, moves his father to another room, and goes outside to watch the entrance. Michael enlists help from Enzo the baker (Gabriele Torrei), who has come to the hospital to pay his respects. Together, they bluff away Sollozzo's men as they drive by. Police cars soon appear bringing the corrupt Captain McCluskey (Sterling Hayden), who viciously punches Michael in the cheek and breaks his jaw when Michael insinuates that Sollozzo paid McCluskey to set up his father. Just then, Hagen arrives with "private detectives" licensed to carry guns to protect Don Corleone, and he takes the injured Michael home. Sonny responds by having Bruno Tattaglia (Tony Giorgio), the eldest son and underboss of Don Phillip Tattaglia (Victor Rendina), killed (off-camera).Following the attempt on the Don's life at the hospital, Sollozzo requests a meeting with the Corleones, which Captain McCluskey will attend as Sollozzo's bodyguard. When Michael volunteers to kill both men during the meeting, Sonny and the other senior Family members are amused; however, Michael convinces them that he is serious and that killing Sollozzo and McCluskey is in the family's interest: "It's not personal. It's strictly business." Because Michael is considered a civilian, he won't be regarded as a suspicious ambassador for the Corleones. Although police officers are usually off limits for hits, Michael argues that since McCluskey is corrupt and has illegal dealings with Sollozzo, he is fair game. Michael also implies that newspaper reporters that the Corleones have on their payroll would delight in publishing stories about a corrupt police captain.Michael meets with Clemenza, one of his father's caporegimes (captains), who prepares a small pistol for him, covering the trigger and grip with tape to prevent any fingerprint evidence. He instructs Michael about the proper way to perform the assassination and tells him to leave the gun behind. He also tells Michael that the family were all very proud of Michael for becoming a war hero during his service in the Marines. Clemenza shows great confidence that Michael can perform the job and tells him it will all go smoothly. The plan is to have the Corleone's informers find out the location of the meeting and plant the revolver before Michael, Sollozzo and McCluskey arrive.Before the meeting in a small Italian restaurant, McCluskey frisks Michael for weapons and finds him clean. Michael excuses himself to go to the bathroom, where he retrieves the planted revolver. Returning to the table, he fatally shoots Sollozzo, then McCluskey. Michael is sent to hide in Sicily while the Corleone family prepares for all-out warfare with the Five Families (who are united against the Corleones) as well as a general clampdown on the mob by the police and government authorities. When the don returns home from the hospital, he is distraught to learn that it was Michael who killed Sollozzo and McCluskey.Meanwhile, Connie and Carlo's marriage is disintegrating. They argue publicly over Carlo's suspected infidelity and his possessive behavior toward Connie. By Italian tradition, nobody, not even a high-ranking Mafia don, can intervene in a married couple's personal disputes, even if they involve infidelity, money, or domestic abuse. One day, Sonny sees a bruise on Connie's face and she tells him that Carlo hit her after she asked him if he was having an affair. Sonny tracks down and severely beats up Carlo Rizzi in the middle of a crowded street for brutalizing the pregnant Connie, and threatens to kill Carlo if he ever abuses Connie again. An angry Carlo responds by plotting with Tattaglia and Don Emilio Barzini (Richard Conte), the Corleones' chief rivals, to have Sonny killed.Later, Carlo has one of his mistresses phone his house, knowing that Connie will answer. The woman asks Connie to tell Carlo not to meet her tonight. The very pregnant and distraught Connie assaults Carlo; he takes advantage of the altercation to beat Connie in order to lure Sonny out in the open and away from the Corleone compound. When Connie phones the compound to tell Sonny that Carlo has beaten her again, the furious Sonny drives off (alone and unescorted) to fulfill his threat against Carlo. On the way to Connie and Carlo's house, Sonny is ambushed at a toll booth on the Long Island Causeway and violently shot to death by several carloads of hitmen wielding Thompson sub-machine guns.Tom Hagen relays the news of Sonny's massacre to the Don, who calls in the favor from Bonasera to personally handle the embalming of Sonny's body. Rather than seek revenge for Sonny's killing, Don Corleone meets with the heads of the Five Families to negotiate a cease-fire. Not only is the conflict draining all their assets and threatening their survival, but ending it is the only way that Michael can return home safely. Reversing his previous decision, Vito agrees that the Corleone family will provide political protection for Tattaglia's traffic in heroin, as long as it is controlled and not sold to children. At the meeting, Don Corleone deduces that Don Barzini, not Tattaglia, was ultimately behind the start of the mob war and Sonny's death.In Sicily, Michael patiently waits out his exile, protected by Don Tommasino (Corrado Gaipa), an old family friend. Michael aimlessly wanders the countryside, accompanied by his ever-present bodyguards, Calo (Franco Citti) and Fabrizio (Angelo Infanti). In a small village, Michael meets and falls in love with Apollonia Vitelli (Simonetta Stefanelli), the beautiful young daughter of a bar owner. They court and marry in the traditional Sicilian fashion, but soon Michael's presence becomes known to Corleone enemies. As the couple is about to be moved to a safer location, Apollonia is killed as a result of a rigged car (originally intended for Michael), exploding on ignition; Michael, who watched the car blow up, spots Fabrizio hurriedly leaving the grounds seconds before the explosion, implicating him in the assassination plot. (In a deleted scene, Fabrizio is found years later and killed.)With his safety guaranteed, Michael returns home. More than a year later, in 1950, he reunites with his former girlfriend Kay after a total of four years of separation -- three in Italy and one in America. He tells her he wants them to be married. Although Kay is hurt that he waited so long to contact her, she accepts his proposal. With Don Vito semi-retired, Sonny dead, and middle brother Fredo considered incapable of running the family business, Michael is now in charge; he promises Kay he will make the family business completely legitimate within five years.Two years later, Clemenza and Salvatore Tessio (Abe Vigoda), complain that they are being pushed around by the Barzini Family and ask permission to strike back, but Michael denies the request. He plans to move the family operations to Nevada and after that, Clemenza and Tessio may break away to form their own families. Michael further promises Connie's husband, Carlo, that he will be his right hand man in Nevada (Carlo had grown up there), unaware of his part in Sonny's assassination. Tom Hagen has been removed as consigliere and is now merely the family's lawyer, with Vito serving as consigliere. Privately, Hagen inquires about his change in status, and also questions Michael about a new regime of "soldiers" secretly being built under Rocco Lampone (Tom Rosqui). Don Vito explains to Hagen that Michael is acting on his advice.Another year or so later, Michael travels to Las Vegas and meets with Moe Greene (Alex Rocco), a rich and shrewd casino boss looking to expand his business dealings. After the Don's attempted assassination, Fredo had been sent to Las Vegas to learn about the casino business from Greene. Michael arrogantly offers to buy out Greene but is rudely rebuffed. Greene believes the Corleones are weak and that he can secure a better deal from Barzini. As Moe and Michael heatedly negotiate, Fredo sides with Moe. Afterward, Michael warns Fredo to never again "take sides with anyone against the family."Michael returns home. In a private moment, Vito explains his expectation that the Family's enemies will attempt to murder Michael by using a trusted associate to arrange a meeting as a pretext for assassination. Vito also reveals that he had never really intended a life of crime for Michael, hoping that his youngest son would hold legitimate power as a senator or governor. Some months later, Vito collapses and dies while playing with his young grandson Anthony (Anthony Gounaris) in his tomato garden. At the burial, Tessio conveys a proposal for a meeting with Barzini, which identifies Tessio as the traitor that Vito was expecting.Michael arranges for a series of murders to occur simultaneously while he is standing godfather to Connie's and Carlo's newborn son at the church:Don Stracci (Don Costello) is gunned down along with his bodyguard in a hotel elevator by a shotgun-wielding Clemenza.Moe Greene is killed while having a massage, shot through the eye by an unidentified assassin.Don Cuneo (Rudy Bond) is trapped in a revolving door at the St. Regis Hotel and shot dead by soldier Willi Cicci (Joe Spinell).Don Tattaglia is assassinated in his bed, along with a prostitute, by Rocco Lampone and an unknown associate.Don Barzini is killed on the steps of his office building along with his bodyguard and driver, shot by Al Neri (Richard Bright), disguised in his old police uniform.After the baptism, Tessio believes he and Hagen are on their way to the meeting between Michael and Barzini that he has arranged. Instead, he is surrounded by Willi Cicci and other button men as Hagen steps away. Realizing that Michael has uncovered his betrayal, Tessio tells Hagen that he always respected Michael, and that his disloyalty "was only business." He asks if Tom can get him off for "old times' sake," but Tom says he cannot. Tessio is driven away and never seen again (it is implied that Cicci shoots and kills Tessio with his own gun after he disarms him prior to entering the car).Meanwhile, Michael confronts Carlo about Sonny's murder and forces him to admit his role in setting up the ambush, having been approached by Barzini himself. (The hitmen who killed Sonny were the core members of Barzini's personal bodyguard.) Michael assures Carlo he will not be killed, but his punishment is exclusion from all family business. He hands Carlo a plane ticket to exile in Las Vegas. However, when Carlo gets into a car headed for the airport, he is immediately garroted to death by Clemenza, on Michael's orders.Later, a hysterical Connie confronts Michael at the Corleone compound as movers carry away the furniture in preparation for the family move to Nevada. She accuses him of murdering Carlo in retribution for Carlo's brutal treatment of her and for Carlo's suspected involvement in Sonny's murder. After Connie is removed from the house, Kay questions Michael about Connie's accusation, but he refuses to answer, reminding her to never ask him about his business or what he does for a living. She insists, and Michael outright lies, reassuring his wife that he played no role in Carlo's death. Kay believes him and is relieved. The film ends with Clemenza and new caporegimes Rocco Lampone and Al Neri arriving and paying their respects to Michael. Clemenza kisses Michael's hand and greets him as "Don Corleone." As Kay watches, the office door is closed.
4 |
5 |
6 |
7 |
8 | In 1947, Andy Dufresne (Tim Robbins), a banker in Maine, is convicted of murdering his wife and her lover, a golf pro. He is given two life sentences and sent to the notoriously harsh Shawshank Prison. Andy always claims his innocence, but his cold and measured demeanor led many to doubt his word.During the first night, the chief guard, Byron Hadley (Clancy Brown), savagely beats a newly arrived inmate because of his crying and hysterics. The inmate later dies in the infirmary because the prison doctor had left for the night. Meanwhile Andy remained steadfast and composed. Ellis Boyd Redding (Morgan Freeman), also known as Red, bet against others that Andy would be the one to break down first and loses a considerable amount of cash.About a month later, Andy approaches Red, who runs contraband inside the walls of Shawshank. He asks if Red could find him a rock hammer, an instrument he claims is necessary for his hobby of rock collecting and sculpting. Though other prisoners consider Andy "a really cold fish", Red sees something in Andy, and likes him from the start. Red believes Andy intends to use the hammer to engineer his escape in the future but when the tool arrived and he saw how small it was, Red put aside the thought that Andy could ever use it to dig his way out of prison.Over the first two years of his incarceration, Andy works in the prison laundry. He attracts attention from "the Sisters", a group of prisoners who sexually assault other prisoners. Though he persistently resists and fights them, Andy is beaten and raped on a regular basis.Red pulls some strings, and gets Andy and a few of their mutual friends a break by getting them all on a work detail tarring the roof of one of the prison's buildings. During the job Andy overhears Hadley complaining about having to pay taxes for an upcoming inheritance. Using his expertise as a banker, Andy lets Hadley know how he could shelter his money from the IRS, turning it into a one-time gift for his wife. He said he'd assist in exchange for some cold beers for his fellow inmates while on the tarring job. Though he at first threatens to throw Andy off the roof, Hadley, the most brutal guard in the prison, agrees, providing the men with cold beer before the job is finished. Red remarks that Andy may have engineered the privilege to build favor with the prison guards as much as with his fellow inmates, but Red also thinks Andy did it simply to "feel free."While watching a movie, Andy demands Red "Rita Hayworth". Soon, after asking Red for "Rita Hayworth", Andy once more encountered the Sisters and is brutally beaten, putting him in the infirmary for a month. Boggs (Mark Rolston), the leader of "The Sisters", spends a week in solitary. When he comes out, he finds Hadley and his men waiting in his cell. They beat him so badly he's left paralyzed, transferred to a prison hospital upstate, and the Sisters never bothered Andy again. When Andy got out of the infirmary, he finds a bunch of rocks and a poster of Rita Hayworth in his cell: presents from Red and his buddies.Warden Samuel Norton (Bob Gunton) hears about Andy helped Hadley and uses a surprise cell inspection to size Andy up. The warden meets with Andy and sends him to work with aging inmate Brooks Hatlen (James Whitmore) in the prison library, where he sets up a make-shift desk to provide services to other guards (and the warden himself) with income tax returns and other financial advice. There Andy sees an opportunity to expand the prison library, starting with asking the Maine state senate for funds. He starts writing letters and sending them every week. His financial support practice became so appreciated that even guards from other prisons, when they came for inter-prison baseball matches, sought Andy's financial advice. Andy even ends up doing Norton's taxes the next season.Not long afterward, Brooks, the old librarian, threatens to kill another prisoner, Heywood, in order to avoid being paroled. Andy is able to talk him down and Brooks is paroled. He goes to a halfway house but finds it impossible to adjust to life outside the prison. He eventually commits suicide. When his friends suggest that he was crazy for doing so, Red tells them that Brooks had obviously become "institutionalized", essentially conditioned to be a prisoner for the rest of his life and unable to adapt to the outside world. Red remarks: "These walls are funny. First you hate 'em, then you get used to 'em. Enough time passes, you get so you depend on them."After six years of writing letters, Andy receives $200 from the state for the library, along with a collection of old books and records. Though the state Senate thinks this will be enough to get Andy to halt his letter-writing campaign, he is undaunted and doubles his efforts.When the donations of old books and records arrive at the warden's office, Andy finds a copy of Mozart's "The Marriage of Figaro" among the records. He locks the guard assigned to the warden's office in the bathroom and plays the record on a phonograph over the prison's PA system. The entire prison seems captivated by the music - Red remarks that the voices of the women in the intro made everyone feel free, if only for a brief time. Outside the office, Norton appears, furious at the act of defiance and orders Andy to turn off the record player. Andy reaches for the needle arm at first, then turns the volume on the phonograph up. The warden orders Hadley to break into the office and Andy is sent immediately to solitary confinement for two weeks. When he gets out, he tells his friends that it was the "easiest time" stretch ever did in the hole because he thought of Mozart's Figaro. When the other prisoners tell him how unlikely that could be, he tells them that hope can sustain them. Red is not convinced and leaves, bitter at the thought.With the enlarged library and more materials, Andy begins to teach those inmates who want to receive their high school diplomas. After Andy is able to secure a steady stream of funding from various sources, the library is further renovated and named for Brooks.Warden Norton profits on Andy's knowledge of bookkeeping and devises a scheme whereby he put prison inmates to work in public projects which he won by outbidding other contractors (cheap labor from the prisoners). Occasionally, he let others get some contracts if they bribe him. Andy launders money for the warden by setting up many accounts in different banks, along with several investments, using a fake identity: "Randall Stephens". He shared the details only with his friend, Red, noting that he had to "go to prison to learn how to be a criminal."In 1965, a young prisoner named Tommy (Gil Bellows) comes to Shawshank. Andy suggests that Tommy take up another line of work besides theft. The suggestion really gets to Tommy and he works on achieving his high school equivalency diploma. Though Tommy is a good student, he is still frustrated when he takes the exam itself, crumpling it up and tossing it in the trash. Andy retrieves it and sends it in.One day Red tells Tommy about Andy's case. Tommy is visibly upset at hearing Andy's story and tells Andy and Red that he had a cellmate in another prison who boasted about killing a man who was a pro golfer at the country club he worked at, along with his lover. The woman's husband, a banker, had gone to prison for those murders. With this new information, Andy, full of hope, meets with the warden's, expecting he could help him get another trial with Tommy as a witness. The reaction from Norton is completely contrary to what Andy hoped for. Andy says emphatically that he would never reveal the money laundering schemes he had set up for Norton over the years - the warden becomes furious and orders him to solitary for a month. The warden later meets with Tommy alone and asks him if he'll testify on Andy's behalf. Tommy enthusiastically agrees and the warden has him shot dead by Hadley.When the warden visits Andy in solitary, he tells him that Tommy was killed while attempting escape. Andy tells Norton that the financial schemes will stop. The warden counters, saying the library will be destroyed and all it's materials burned. Andy will also lose his private cell and be sent to the block with the most hardened criminals. The warden gives Andy another month in solitary.Afterwards, Andy returns to the usual daily life at Shawshank, a seemingly broken man. One day he talks to Red, about how although he didn't kill his wife, his personality drove her away, which led to her infidelity and death. He says if he's ever freed or escapes, he'd like to go to Zihuatanejo, a beach town on the Pacific coast of Mexico. He also tells Red how he got engaged. He and his future wife went up to a farm in Buxton, Maine, to a large oak tree at the end of a stone wall. The two made love under the tree, after which he proposed to her. He tells Red that, if he should ever be paroled, he should look for that field, and that oak tree. There, under a large black volcanic rock that would look out of place, Andy has buried a box that he wants Red to have. Andy refuses to reveal what might be in that box.Later, Andy asks for a length of rope, leading Red and his buddies to suspect he will commit suicide. At the end of the day, Norton asks Andy to shine his shoes for him and put his suit in for dry-cleaning before retiring for the night.The following morning, Andy is not accounted for as usual from his cell. At the same time, Norton becomes alarmed when he finds Andy's shoes in his shoebox instead of his own. He rushes to Andy's cell and demands an explanation. Hadley brings in Red, but Red insists he knows nothing of Andy's plans. Becoming increasing hostile and paranoid, Norton starts throwing Andy's sculpted rocks around the cell. When he throws one at Andy's poster of Raquel Welch (where it used to hold Marilyn Monroe and Rita Hayworth before), the rock punches through and into the wall. Norton tears the poster away from the wall and finds a tunnel just wide enough for a man to crawl through.During the previous nights thunderstorm, Andy wore Norton's shoes to his cell, catching a lucky break when no one notices. He packs some papers and Norton's clothes into a plastic bag, tied it to himself with the rope he'd asked for, and escapes through his hole. The tunnel he'd excavated led him to a space between two walls of the prison where he found a sewer main line. Using a rock, he hits it in time with the lightning strikes and eventually burst it. Crawling through 500 yards in the pipe and through the raw sewage contained in it, Andy emerged in a brook outside the walls. A search team later found his uniform and his rock hammer which had been worn nearly to nothing.That morning, Andy walks into the Maine National Bank in Portland, where he had put Warden Norton's money. Using his assumed identity as Randall Stephens, and with all the necessary documentation, he walked out with a cashier's check. Before he leaves, he asks them to drop a package in the mail. He continues his visitations to nearly a dozen other local banks, ending up with some $370,000. The package contained Warden Norton's account books which were delivered straight to the Portland Daily Bugle newspaper.Not long after, the police storms Shawshank Prison. Hadley is arrested for murder; Red said he was taken away "crying like a little girl". Warden Norton finally opens the safe, which he hadn't touched since Andy escaped, and instead of his books, he finds the Bible he had given Andy. Norton opens it to the book of Exodus and finds that the pages have been cut out in the shape of Andy's rock hammer. Norton walks back to his desk as the police pound on his door, takes out a small revolver and shoots himself under the chin. Red remarks that he wondered if the warden thought, right before pulling the trigger, how "Andy could ever have gotten the best of him."Shortly after, Red receives a postcard from Fort Hancock, Texas, with nothing written on it. Red takes it as a sign that Andy made it into Mexico to freedom. Red and his buddies would spend their time talking about Andy's exploits (with a lot of embellishments), but Red just missed his friend.At Red's next parole hearing in 1967, he talked to the parole board about how "rehabilitated" was a made-up word, and how he regretted his actions of the past. His parole is granted this time. He goes to work at a grocery store, and stays at the same halfway house room Brooks had stayed in. He frequently walks by a pawn shop, which had several guns and compasses in the window. At times he would contemplate trying to get back into prison, but he remembered the promise he had made to Andy.One day, with a compass he bought from the pawn shop, he followed Andy's instructions, hitchhiking to Buxton and arriving at the stone wall Andy described. Just like Andy said, there was a large black stone. Under it was a small box containing a large sum of cash and instructions to find him. He said he needed somebody "who could get things" for a "project" of his.Red violates parole and leaves the halfway house, unconcerned since no one would really do an extensive manhunt for "an old crook like [him]." He takes a bus to Fort Hancock, where he crosses into Mexico. The two friends are finally reunited on the beach of Zihuatanejo on the Pacific coast.
9 |
10 |
--------------------------------------------------------------------------------
/backup/ward_clusters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/ward_clusters.png
--------------------------------------------------------------------------------
/d3/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2013, Michael Bostock
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | * The name Michael Bostock may not be used to endorse or promote products
15 | derived from this software without specific prior written permission.
16 |
17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
18 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
19 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
20 | DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT,
21 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
22 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
23 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24 | OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
25 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
26 | EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27 |
--------------------------------------------------------------------------------
/data/genres_list.txt:
--------------------------------------------------------------------------------
1 | [u' Crime', u' Drama']
2 | [u' Crime', u' Drama']
3 | [u' Biography', u' Drama', u' History']
4 | [u' Biography', u' Drama', u' Sport']
5 | [u' Drama', u' Romance', u' War']
6 | [u' Drama']
7 | [u' Drama', u' Romance', u' War']
8 | [u' Drama', u' Mystery']
9 | [u' Adventure', u' Family', u' Fantasy', u' Musical']
10 | [u' Drama', u' Romance']
11 | [u' Adventure', u' Biography', u' Drama', u' History', u' War']
12 | [u' Crime', u' Drama']
13 | [u' Horror', u' Mystery', u' Thriller']
14 | [u' Drama', u' Film-Noir']
15 | [u' Mystery', u' Romance', u' Thriller']
16 | [u' Crime', u' Drama']
17 | [u' Drama', u' Romance']
18 | [u' Biography', u' Drama', u' Family', u' Musical', u' Romance']
19 | [u' Crime', u' Drama', u' Musical', u' Romance', u' Thriller']
20 | [u' Action', u' Adventure', u' Fantasy', u' Sci-Fi']
21 | [u' Adventure', u' Family', u' Sci-Fi']
22 | [u' Mystery', u' Sci-Fi']
23 | [u' Crime', u' Drama', u' Thriller']
24 | [u' Drama', u' Mystery', u' Thriller']
25 | [u' Adventure', u' Drama', u' War']
26 | [u' Comedy', u' Musical', u' Romance']
27 | [u' Drama', u' Family', u' Fantasy']
28 | [u' Comedy']
29 | [u' Drama']
30 | [u' Comedy', u' War']
31 | [u' Biography', u' Drama', u' Music']
32 | [u' Drama', u' War']
33 | [u' Biography', u' Drama', u' History']
34 | [u' Adventure', u' Fantasy']
35 | [u' Action', u' Drama']
36 | [u' Drama', u' Romance', u' War']
37 | [u' Action', u' Drama', u' War']
38 | [u' Western']
39 | [u' Action', u' Adventure']
40 | [u' Drama', u' Sport']
41 | [u' Drama']
42 | [u' Comedy', u' Romance']
43 | [u' Drama']
44 | [u' Musical', u' Romance']
45 | [u' Drama', u' Romance', u' War']
46 | [u' Drama', u' Family', u' Musical', u' Romance']
47 | [u' Adventure', u' Drama']
48 | [u' Drama', u' Romance', u' War']
49 | [u' Biography', u' Drama', u' War']
50 | [u' Drama', u' Thriller']
51 | [u' Action', u' Biography', u' Drama', u' History', u' War']
52 | [u' Western']
53 | [u' Biography', u' Crime', u' Western']
54 | [u' Action', u' Adventure', u' Drama', u' Western']
55 | [u' Comedy', u' Drama', u' Romance']
56 | [u' Drama', u' War']
57 | [u' Western']
58 | [u' Adventure', u' Drama', u' Western']
59 | [u' Biography', u' Drama', u' War']
60 | [u' Biography', u' Crime', u' Drama']
61 | [u' Horror']
62 | [u' Drama', u' War']
63 | [u' Drama', u' War']
64 | [u' Action', u' Crime', u' Thriller']
65 | [u' Comedy', u' Drama', u' Romance']
66 | [u' Biography', u' Drama', u' History']
67 | [u' Comedy', u' Romance']
68 | [u' Drama', u' Romance']
69 | [u' Drama']
70 | [u' Drama']
71 | [u' Drama']
72 | [u' Comedy', u' Drama', u' Romance']
73 | [u' Biography', u' Drama', u' Romance']
74 | [u' Drama']
75 | [u' Comedy', u' Drama']
76 | [u' Comedy', u' Drama', u' Romance']
77 | [u' Crime', u' Drama', u' Thriller']
78 | [u' Drama', u' Romance']
79 | [u' Drama']
80 | [u' Drama', u' Romance', u' Western']
81 | [u' Crime', u' Drama', u' Fantasy', u' Mystery']
82 | [u' Drama', u' Sci-Fi']
83 | [u' Drama']
84 | [u' Drama', u' Music']
85 | [u' Comedy', u' Drama', u' Romance']
86 | [u' Comedy', u' Drama']
87 | [u' Crime', u' Drama', u' Thriller']
88 | [u' Adventure', u' Romance', u' War']
89 | [u' Adventure', u' Western']
90 | [u' Adventure', u' Drama', u' History']
91 | [u' Drama', u' Film-Noir', u' Mystery']
92 | [u' Crime', u' Drama', u' Sci-Fi']
93 | [u' Crime', u' Drama']
94 | [u' Drama', u' Romance']
95 | [u' Crime', u' Drama', u' Film-Noir', u' Thriller']
96 | [u' Drama']
97 | [u' Mystery', u' Thriller']
98 | [u' Film-Noir', u' Mystery', u' Thriller']
99 | [u' Mystery', u' Thriller']
100 | [u' Biography', u' Drama', u' Musical']
101 |
--------------------------------------------------------------------------------
/data/title_list.txt:
--------------------------------------------------------------------------------
1 | The Godfather
2 | The Shawshank Redemption
3 | Schindler's List
4 | Raging Bull
5 | Casablanca
6 | One Flew Over the Cuckoo's Nest
7 | Gone with the Wind
8 | Citizen Kane
9 | The Wizard of Oz
10 | Titanic
11 | Lawrence of Arabia
12 | The Godfather: Part II
13 | Psycho
14 | Sunset Blvd.
15 | Vertigo
16 | On the Waterfront
17 | Forrest Gump
18 | The Sound of Music
19 | West Side Story
20 | Star Wars
21 | E.T. the Extra-Terrestrial
22 | 2001: A Space Odyssey
23 | The Silence of the Lambs
24 | Chinatown
25 | The Bridge on the River Kwai
26 | Singin' in the Rain
27 | It's a Wonderful Life
28 | Some Like It Hot
29 | 12 Angry Men
30 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
31 | Amadeus
32 | Apocalypse Now
33 | Gandhi
34 | The Lord of the Rings: The Return of the King
35 | Gladiator
36 | From Here to Eternity
37 | Saving Private Ryan
38 | Unforgiven
39 | Raiders of the Lost Ark
40 | Rocky
41 | A Streetcar Named Desire
42 | The Philadelphia Story
43 | To Kill a Mockingbird
44 | An American in Paris
45 | The Best Years of Our Lives
46 | My Fair Lady
47 | Ben-Hur
48 | Doctor Zhivago
49 | Patton
50 | Jaws
51 | Braveheart
52 | The Good, the Bad and the Ugly
53 | Butch Cassidy and the Sundance Kid
54 | The Treasure of the Sierra Madre
55 | The Apartment
56 | Platoon
57 | High Noon
58 | Dances with Wolves
59 | The Pianist
60 | Goodfellas
61 | The Exorcist
62 | The Deer Hunter
63 | All Quiet on the Western Front
64 | The French Connection
65 | City Lights
66 | The King's Speech
67 | It Happened One Night
68 | A Place in the Sun
69 | Midnight Cowboy
70 | Mr. Smith Goes to Washington
71 | Rain Man
72 | Annie Hall
73 | Out of Africa
74 | Good Will Hunting
75 | Terms of Endearment
76 | Tootsie
77 | Fargo
78 | Giant
79 | The Grapes of Wrath
80 | Shane
81 | The Green Mile
82 | Close Encounters of the Third Kind
83 | Network
84 | Nashville
85 | The Graduate
86 | American Graffiti
87 | Pulp Fiction
88 | The African Queen
89 | Stagecoach
90 | Mutiny on the Bounty
91 | The Maltese Falcon
92 | A Clockwork Orange
93 | Taxi Driver
94 | Wuthering Heights
95 | Double Indemnity
96 | Rebel Without a Cause
97 | Rear Window
98 | The Third Man
99 | North by Northwest
100 | Yankee Doodle Dandy
101 |
--------------------------------------------------------------------------------
/doc_clustering.py:
--------------------------------------------------------------------------------
1 |
2 | # coding: utf-8
3 |
4 | # TL;DR
5 | #
6 | # Data: Top 100 movies (http://www.imdb.com/list/ls055592025/) with title, genre, and synopsis (IMDB and Wiki)
7 | # Goal: Put 100 movies into 5 clusters based on text mining their synopses
8 |
9 | # In[1]:
10 |
11 | import numpy as np
12 | import pandas as pd
13 | import nltk
14 | from nltk.stem.snowball import SnowballStemmer
15 | from bs4 import BeautifulSoup
16 | import re
17 | import os
18 | import codecs
19 | from sklearn import feature_extraction
20 | import mpld3
21 |
22 |
23 | # Read movie titles, 100 movies but somehow the last one is empty string
24 |
25 | # In[2]:
26 |
27 | # so that you need to use print()
28 | from __future__ import print_function
29 | titles = open('data/title_list.txt').read().split('\n')
30 |
31 |
32 | # In[3]:
33 |
34 | len(titles)
35 |
36 |
37 | # In[4]:
38 |
39 | titles[:10]
40 |
41 |
42 | # In[5]:
43 |
44 | titles = titles[:100]
45 |
46 |
47 | # Read Genres information
48 |
49 | # In[6]:
50 |
51 | genres = open('data/genres_list.txt').read().split('\n')
52 | genres = genres[:100]
53 |
54 |
55 | # In[7]:
56 |
57 | genres[0]
58 |
59 |
60 | # Read in the synopses from wiki
61 |
62 | # In[8]:
63 |
64 | synopses_wiki = open('data/synopses_list_wiki.txt').read().split('\n BREAKS HERE')
65 |
66 |
67 | # In[9]:
68 |
69 | len(synopses_wiki)
70 |
71 |
72 | # In[10]:
73 |
74 | synopses_wiki = synopses_wiki[:100]
75 |
76 |
77 | # In[11]:
78 |
79 | synopses_wiki[0]
80 |
81 |
82 | # strips html formatting and converts to unicode
83 |
84 | # In[12]:
85 |
86 | synopses_clean_wiki = []
87 | for text in synopses_wiki:
88 | text = BeautifulSoup(text, 'html.parser').getText()
89 | synopses_clean_wiki.append(text)
90 | synopses_wiki = synopses_clean_wiki
91 |
92 |
93 | # In[13]:
94 |
95 | synopses_wiki[0]
96 |
97 |
98 | # Read synopses information from imdb, which might be different from wiki. Also cleaned as above.
99 |
100 | # In[14]:
101 |
102 | synopses_imdb = open('data/synopses_list_imdb.txt').read().split('\n BREAKS HERE')
103 | synopses_imdb = synopses_imdb[:100]
104 |
105 | synopses_clean_imdb = []
106 |
107 | for text in synopses_imdb:
108 | text = BeautifulSoup(text, 'html.parser').getText()
109 | synopses_clean_imdb.append(text)
110 | synopses_imdb = synopses_clean_imdb
111 |
112 |
113 | # In[15]:
114 |
115 | synopses_imdb[0]
116 |
117 |
118 | # Combine synopses from wiki and imdb
119 |
120 | # In[16]:
121 |
122 | synopses = []
123 | for i in range(len(synopses_wiki)):
124 | item = synopses_wiki[i] + synopses_imdb[i]
125 | synopses.append(item)
126 |
127 |
128 | # In[17]:
129 |
130 | synopses[0]
131 |
132 |
133 | # In[18]:
134 |
135 | print(str(len(titles)) + ' titles')
136 | print(str(len(genres)) + ' genres')
137 | print(str(len(synopses)) + ' synopses')
138 |
139 |
140 | # In[19]:
141 |
142 | # generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later
143 | # the movies in the list are already ranked from 1 to 100
144 | ranks = []
145 | for i in range(1, len(titles)+1):
146 | ranks.append(i)
147 |
148 |
149 | # In[20]:
150 |
151 | # load nltk's English stopwords as variable called 'stopwords'
152 | # use nltk.download() to install the corpus first
153 | # Stop Words are words which do not contain important significance to be used in Search Queries
154 | stopwords = nltk.corpus.stopwords.words('english')
155 |
156 | # load nltk's SnowballStemmer as variabled 'stemmer'
157 | stemmer = SnowballStemmer("english")
158 |
159 |
160 | # In[21]:
161 |
162 | len(stopwords)
163 |
164 |
165 | # In[22]:
166 |
167 | stopwords
168 |
169 |
170 | # In[23]:
171 |
172 | sents = [sent for sent in nltk.sent_tokenize("Today (May 19, 2016) is his only daughter's wedding. Vito Corleone is the Godfather. Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception.")]
173 |
174 |
175 | # In[24]:
176 |
177 | sents
178 |
179 |
180 | # In[25]:
181 |
182 | words = [word for word in nltk.word_tokenize(sents[0])]
183 | words
184 |
185 |
186 | # In[26]:
187 |
188 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
189 | filtered_words = []
190 | for word in words:
191 | if re.search('[a-zA-Z]', word):
192 | filtered_words.append(word)
193 | filtered_words
194 |
195 |
196 | # In[27]:
197 |
198 | # see how "only" is stemmed to "onli" and "wedding" is stemmed to "wed"
199 | stems = [stemmer.stem(t) for t in filtered_words]
200 | stems
201 |
202 |
203 | # In[28]:
204 |
205 | # here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed
206 | # Punkt Sentence Tokenizer, sent means sentence
207 | def tokenize_and_stem(text):
208 | # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
209 | tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
210 | filtered_tokens = []
211 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
212 | for token in tokens:
213 | if re.search('[a-zA-Z]', token):
214 | filtered_tokens.append(token)
215 | stems = [stemmer.stem(t) for t in filtered_tokens]
216 | return stems
217 |
218 |
219 | # In[29]:
220 |
221 |
222 | def tokenize_only(text):
223 | # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
224 | tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
225 | filtered_tokens = []
226 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
227 | for token in tokens:
228 | if re.search('[a-zA-Z]', token):
229 | filtered_tokens.append(token)
230 | return filtered_tokens
231 |
232 |
233 | # In[30]:
234 |
235 | words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
236 | print(words_stemmed)
237 |
238 |
239 | # In[31]:
240 |
241 | words_only = tokenize_only("Today (May 19, 2016) is his only daughter's wedding.")
242 | words_only
243 |
244 |
245 | # Below I use my stemming/tokenizing and tokenizing functions to iterate over the list of synopses to create two vocabularies: one stemmed and one only tokenized
246 |
247 | # In[32]:
248 |
249 | # extend vs. append
250 | a = [1, 2]
251 | b = [3, 4]
252 | c = [5, 6]
253 | b.append(a)
254 | c.extend(a)
255 | print(b)
256 | print(c)
257 |
258 |
259 | # In[33]:
260 |
261 | totalvocab_stemmed = []
262 | totalvocab_tokenized = []
263 | for i in synopses:
264 | allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem
265 | totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
266 |
267 | allwords_tokenized = tokenize_only(i)
268 | totalvocab_tokenized.extend(allwords_tokenized)
269 |
270 |
271 | # In[34]:
272 |
273 | print(len(totalvocab_stemmed))
274 | print(len(totalvocab_tokenized))
275 |
276 |
277 | # In[35]:
278 |
279 | vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
280 | print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
281 | print(vocab_frame.head())
282 |
283 |
284 | # In[36]:
285 |
286 | words_frame = pd.DataFrame({'WORD': words_only}, index = words_stemmed)
287 | print('there are ' + str(words_frame.shape[0]) + ' items in words_frame')
288 | print(words_frame)
289 |
290 |
291 | # Generate TF-IDF matrix (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
292 | #
293 | # max_df: this is the maximum frequency within the documents a given feature can have to be used in the tfi-idf matrix. If the term is in greater than 80% of the documents it probably cares little meanining (in the context of film synopses)
294 | #
295 | # min_idf: this could be an integer (e.g. 5) and the term would have to be in at least 5 of the documents to be considered. Here I pass 0.2; the term must be in at least 20% of the document. I found that if I allowed a lower min_df I ended up basing clustering on names--for example "Michael" or "Tom" are names found in several of the movies and the synopses use these names frequently, but the names carry no real meaning.
296 | #
297 | # ngram_range: this just means I'll look at unigrams, bigrams and trigrams
298 |
299 | # In[37]:
300 |
301 | # Note that the result of this block takes a while to show
302 | from sklearn.feature_extraction.text import TfidfVectorizer
303 |
304 | #define vectorizer parameters
305 | tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
306 | min_df=0.2, stop_words='english',
307 | use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
308 |
309 | get_ipython().magic(u'time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses')
310 |
311 | # (100, 563) means the matrix has 100 rows and 563 columns
312 | print(tfidf_matrix.shape)
313 | terms = tfidf_vectorizer.get_feature_names()
314 | len(terms)
315 |
316 |
317 | # In[40]:
318 |
319 | from sklearn.metrics.pairwise import cosine_similarity
320 | # A short example using the sentences above
321 | words_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
322 | min_df=0.2, stop_words='english',
323 | use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
324 |
325 | get_ipython().magic(u'time words_matrix = words_vectorizer.fit_transform(sents) #fit the vectorizer to synopses')
326 |
327 | # (2, 18) means the matrix has 2 rows (two sentences) and 18 columns (18 terms)
328 | print(words_matrix.shape)
329 | print(words_matrix)
330 |
331 | # this is how we get the 18 terms
332 | analyze = words_vectorizer.build_analyzer()
333 | print(analyze("Today (May 19, 2016) is his only daughter's wedding."))
334 | print(analyze("Vito Corleone is the Godfather."))
335 | print(analyze("Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception."))
336 | all_terms = words_vectorizer.get_feature_names()
337 | print(all_terms)
338 | print(len(all_terms))
339 |
340 | # sent 1 and 2, similarity 0, sent 1 and 3 shares "his", sent 2 and 3 shares Vito - try to change Vito's in sent3 to His and see the similary matrix changes
341 | example_similarity = cosine_similarity(words_matrix)
342 | example_similarity
343 |
344 |
345 | # Now onto the fun part. Using the tf-idf matrix, you can run a slew of clustering algorithms to better understand the hidden structure within the synopses. I first chose k-means. K-means initializes with a pre-determined number of clusters (I chose 5). Each observation is assigned to a cluster (cluster assignment) so as to minimize the within cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence.
346 | #
347 | # I found it took several runs for the algorithm to converge a global optimum as k-means is susceptible to reaching local optima - how to decide that the algorithm converged???
348 |
349 | # In[41]:
350 |
351 | from sklearn.cluster import KMeans
352 |
353 | num_clusters = 5
354 |
355 | km = KMeans(n_clusters=num_clusters)
356 |
357 | get_ipython().magic(u'time km.fit(tfidf_matrix)')
358 |
359 | clusters = km.labels_.tolist()
360 |
361 |
362 | # I use joblib.dump to pickle the model, once it has converged and to reload the model/reassign the labels as the clusters.
363 |
364 | # In[42]:
365 |
366 | from sklearn.externals import joblib
367 |
368 | #uncomment the below to save your model
369 | #since I've already run my model I am loading from the pickle
370 |
371 | joblib.dump(km, 'doc_cluster.pkl')
372 |
373 | km = joblib.load('doc_cluster.pkl')
374 | clusters = km.labels_.tolist()
375 | # clusters show which cluster (0-4) each of the 100 synoposes belongs to
376 | print(len(clusters))
377 | print(clusters)
378 |
379 |
380 | # Here, I create a dictionary of titles, ranks, the synopsis, the cluster assignment, and the genre [rank and genre were scraped from IMDB].
381 | # I convert this dictionary to a Pandas DataFrame for easy access. I'm a huge fan of Pandas and recommend taking a look at some of its awesome functionality which I'll use below, but not describe in a ton of detail.
382 |
383 | # In[43]:
384 |
385 | films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres }
386 |
387 | frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])
388 |
389 | print(frame) # here the ranking is still 0 to 99
390 |
391 | frame['cluster'].value_counts() #number of films per cluster (clusters from 0 to 4)
392 |
393 |
394 | # In[44]:
395 |
396 | grouped = frame['rank'].groupby(frame['cluster']) # groupby cluster for aggregation purposes
397 |
398 | grouped.mean() # average rank (1 to 100) per cluster
399 |
400 |
401 | # Note that clusters 4 and 0 have the lowest rank, which indicates that they, on average, contain films that were ranked as "better" on the top 100 list.
402 | # Here is some fancy indexing and sorting on each cluster to identify which are the top n (I chose n=6) words that are nearest to the cluster centroid. This gives a good sense of the main topic of the cluster.
403 |
404 | # In[45]:
405 |
406 | from __future__ import print_function
407 |
408 | print("Top terms per cluster:")
409 |
410 | #sort cluster centers by proximity to centroid
411 | order_centroids = km.cluster_centers_.argsort()[:, ::-1]
412 |
413 | for i in range(num_clusters):
414 | print("Cluster %d words:" % i, end='')
415 |
416 | for ind in order_centroids[i, :6]: #replace 6 with n words per cluster
417 | print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
418 | print() #add whitespace
419 | print() #add whitespace
420 |
421 | print("Cluster %d titles:" % i, end='')
422 | for title in frame.ix[i]['title'].values.tolist():
423 | print(' %s,' % title, end='')
424 | print() #add whitespace
425 | print() #add whitespace
426 |
427 |
428 | # Cosine similarity is measured against the tf-idf matrix and can be used to generate a measure of similarity between each document and the other documents in the corpus (each synopsis among the synopses). cosine similarity 1 means the same document, 0 means totally different ones. dist is defined as 1 - the cosine similarity of each document. Subtracting it from 1 provides cosine distance which I will use for plotting on a euclidean (2-dimensional) plane.
429 | # Note that with dist it is possible to evaluate the similarity of any two or more synopses.
430 |
431 | # In[46]:
432 |
433 |
434 | similarity_distance = 1 - cosine_similarity(tfidf_matrix)
435 | print(type(similarity_distance))
436 | print(similarity_distance.shape)
437 |
438 |
439 | # Multidimensional scaling
440 | # Here is some code to convert the dist matrix into a 2-dimensional array using multidimensional scaling. I won't pretend I know a ton about MDS, but it was useful for this purpose. Another option would be to use principal component analysis.
441 |
442 | # In[47]:
443 |
444 | import os # for os.path.basename
445 |
446 | import matplotlib.pyplot as plt
447 | import matplotlib as mpl
448 |
449 | from sklearn.manifold import MDS
450 |
451 | # convert two components as we're plotting points in a two-dimensional plane
452 | # "precomputed" because we provide a distance matrix
453 | # we will also specify `random_state` so the plot is reproducible.
454 | mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
455 |
456 | get_ipython().magic(u'time pos = mds.fit_transform(similarity_distance) # shape (n_components, n_samples)')
457 |
458 | print(pos.shape)
459 | print(pos)
460 |
461 | xs, ys = pos[:, 0], pos[:, 1]
462 | print(type(xs))
463 | xs
464 |
465 |
466 | # Visualizing document clusters
467 | # In this section, I demonstrate how you can visualize the document clustering output using matplotlib and mpld3 (a matplotlib wrapper for D3.js).
468 | # First I define some dictionaries for going from cluster number to color and to cluster name. I based the cluster names off the words that were closest to each cluster centroid.
469 |
470 | # In[48]:
471 |
472 | #set up colors per clusters using a dict
473 | cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'}
474 |
475 | #set up cluster names using a dict
476 | cluster_names = {0: 'Family, home, war',
477 | 1: 'Police, killed, murders',
478 | 2: 'Father, New York, brothers',
479 | 3: 'Dance, singing, love',
480 | 4: 'Killed, soldiers, captain'}
481 |
482 |
483 | # Next, I plot the labeled observations (films, film titles) colored by cluster using matplotlib. I won't get into too much detail about the matplotlib plot, but I tried to provide some helpful commenting.
484 |
485 | # In[49]:
486 |
487 | #some ipython magic to show the matplotlib plots inline
488 | get_ipython().magic(u'matplotlib inline')
489 |
490 | #create data frame that has the result of the MDS plus the cluster numbers and titles
491 | df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles))
492 |
493 | print(df[1:10])
494 | # group by cluster
495 | # this generate {name:group(which is a dataframe)}
496 | groups = df.groupby('label')
497 | print(groups.groups)
498 |
499 |
500 | # set up plot
501 | fig, ax = plt.subplots(figsize=(17, 9)) # set size
502 | # ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
503 |
504 | #iterate through groups to layer the plot
505 | #note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
506 | # ms: marker size
507 | for name, group in groups:
508 | print("*******")
509 | print("group name " + str(name))
510 | print(group)
511 | ax.plot(group.x, group.y, marker='o', linestyle='', ms=20,
512 | label=cluster_names[name], color=cluster_colors[name],
513 | mec='none')
514 | ax.set_aspect('auto')
515 | ax.tick_params( axis= 'x', # changes apply to the x-axis
516 | which='both', # both major and minor ticks are affected
517 | bottom='off', # ticks along the bottom edge are off
518 | top='off', # ticks along the top edge are off
519 | labelbottom='off')
520 | ax.tick_params( axis= 'y', # changes apply to the y-axis
521 | which='both', # both major and minor ticks are affected
522 | left='off', # ticks along the bottom edge are off
523 | top='off', # ticks along the top edge are off
524 | labelleft='off')
525 |
526 | ax.legend(numpoints=1) #show legend with only 1 point
527 |
528 | #add label in x,y position with the label as the film title
529 | for i in range(len(df)):
530 | ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=10)
531 |
532 |
533 |
534 | plt.show() #show the plot
535 |
536 | #uncomment the below to save the plot if need be
537 | #plt.savefig('clusters_small_noaxes.png', dpi=200)
538 |
539 |
540 | # Use plotly to generate interactive chart. I have to downgrade matplotlib to 1.3.1 for this chart to work with plotly. see https://github.com/harrywang/plotly/blob/master/README.md for how to setup plotly. After running the following, a browser will open to show the plotly chart.
541 |
542 | # In[50]:
543 |
544 | # import plotly.plotly as py
545 | # plot_url = py.plot_mpl(fig)
546 |
547 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | alabaster==0.7.6
2 | Babel==1.3
3 | backports.ssl-match-hostname==3.4.0.2
4 | beautifulsoup4==4.3.2
5 | boto==2.38.0
6 | bz2file==0.98
7 | certifi==2015.4.28
8 | cufflinks==0.7.2
9 | docutils==0.12
10 | functools32==3.2.3.post1
11 | gensim==0.11.1.post1
12 | gnureadline==6.3.3
13 | ipython==3.2.0
14 | Jinja2==2.7.3
15 | jsonschema==2.5.1
16 | MarkupSafe==0.23
17 | matplotlib==1.3.1
18 | mistune==0.6
19 | mock==1.0.1
20 | mpld3==0.2
21 | nltk==3.0.3
22 | nose==1.3.7
23 | numpy==1.9.2
24 | numpydoc==0.5
25 | pandas==0.16.2
26 | plotly==1.10.0
27 | ptyprocess==0.5
28 | Pygments==2.0.2
29 | pyparsing==2.0.3
30 | python-dateutil==2.4.2
31 | pytz==2015.4
32 | pyzmq==14.7.0
33 | requests==2.7.0
34 | scikit-learn==0.16.1
35 | scipy==0.15.1
36 | six==1.9.0
37 | smart-open==1.2.1
38 | snowballstemmer==1.2.0
39 | Sphinx==1.3.1
40 | sphinx-rtd-theme==0.1.8
41 | terminado==0.5
42 | tornado==4.2
43 |
--------------------------------------------------------------------------------