├── .gitignore ├── README.md ├── backup ├── Film Scrape.ipynb ├── brandonrose_doc.css ├── cluster_analysis.ipynb ├── cluster_analysis_web.html ├── cluster_analysis_web.ipynb ├── cluster_script.js ├── clusters_small.png ├── clusters_small_noaxes.png ├── film_cluster.html ├── header_short.jpg ├── link_list.txt ├── link_list_imdb.txt ├── link_list_wiki.txt ├── synopses_list.txt ├── synopses_list.txt.txt └── ward_clusters.png ├── d3 ├── LICENSE ├── d3.v3.js └── d3.v3.min.js ├── data ├── genres_list.txt ├── synopses_list_imdb.txt ├── synopses_list_wiki.txt └── title_list.txt ├── doc_clustering.ipynb ├── doc_clustering.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | __pycache__ 21 | 22 | # Installer logs 23 | pip-log.txt 24 | 25 | # Unit test / coverage reports 26 | .coverage 27 | .tox 28 | nosetests.xml 29 | 30 | # Translations 31 | *.mo 32 | 33 | # Mr Developer 34 | .mr.developer.cfg 35 | .project 36 | .pydevproject 37 | 38 | # PyCharm 39 | .idea 40 | 41 | # SQLite databases 42 | *.sqlite 43 | 44 | # Virtual environment 45 | venv/ 46 | 47 | # Node and Bower 48 | node_modules 49 | app/static/bower_components 50 | 51 | # OS generated files # 52 | .DS_Store 53 | .DS_Store? 54 | 55 | # heroku config 56 | .env 57 | 58 | # redis 59 | redis-stable 60 | .rdb 61 | 62 | #uploaded files 63 | app/static/upload/temp 64 | 65 | # tmp dictionary 66 | tmp/ 67 | 68 | #iPython notebook 69 | .npy 70 | .pkl 71 | .ipynb_checkpoints 72 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Document Clustering with Python 2 | 3 | This is my revision of the great tutorial at http://brandonrose.org/clustering - many thanks to the author. 4 | 5 | ## TL;DR 6 | **Data**: Top 100 movies (http://www.imdb.com/list/ls055592025/) with title, genre, and synopsis (IMDB and Wiki) 7 | 8 | **Goal**: Put 100 movies into 5 clusters by text-mining their synopses and plot the result as follows 9 | 10 | screenshot 2016-05-23 20 50 20 11 | 12 | ## Setup 13 | 14 | First, clone the repo, go to the repo folder, setup the virtual environment, and install the required packages: 15 | 16 | ``` 17 | git clone https://github.com/harrywang/document_clustering.git 18 | cd document_clustering 19 | virtualenv venv 20 | source venv/bin/activate 21 | pip install -r requirements.txt 22 | ``` 23 | Second, use nltk.download() to download all nltk packages, which are saved to /Users/harrywang/nltk_data 24 | 25 | ``` 26 | ipython2 27 | import nltk 28 | nltk.download() 29 | ``` 30 | 31 | Lastly, view doc_clustering.ipynb directly on Github at https://github.com/harrywang/document_cluster/blob/master/doc_clustering.ipynb or locally by running `ipython2 notebook` to learn the tutorial step-by-step. 32 | 33 | ## Key Steps 34 | 1. **Read data**: read titles, genres, synopses, rankings into four arrays 35 | 2. **Tokenize and stem**: break paragraphs into sentences, then to words, stem the words (without removing stopwords) - each synopsis essentially becomes a bag of stemmed words. 36 | 3. **Generate tf-idf matrix**: each row is a term (unigram, bigram, trigram...generated from the bag of words in 2.), each column is a synopsis. 37 | 4. **Generate clusters**: based on the tf-idf matrix, 5 (or any number) clusters are generated using k-means. The top key terms are selected for each cluster. 38 | 5. **Calculate similarity**: generate the cosine similarity matrix using the tf-idf matrix (100x100), then generate the distance matrix (1 - similarity matrix), so each pair of synopsis has a distance number between 0 and 1. 39 | 6. **Plot clusters**: use multidimensional scaling (MDS) to convert distance matrix to a 2-dimensional array, each synopsis has (x, y) that represents their relative location based on the distance matrix. Plot the 100 points with their (x, y) using matplotlib (I added an example on using plotly.js). 40 | -------------------------------------------------------------------------------- /backup/brandonrose_doc.css: -------------------------------------------------------------------------------- 1 | html { 2 | min-width: 1040px; 3 | 4 | } 5 | 6 | body { 7 | font-family: "Helvetica Neue", Helvetica, sans-serif; 8 | margin: 1em auto 4em auto; 9 | 10 | tab-size: 2; 11 | width: 960px; 12 | } 13 | 14 | .main { 15 | width: 750px; 16 | 17 | 18 | } 19 | 20 | .thumb_holder { 21 | 22 | } 23 | 24 | /******************************************************************* 25 | TYPOGRAPHY 26 | *******************************************************************/ 27 | 28 | h1 { 29 | font-size: 64px; 30 | font-weight: 300; 31 | letter-spacing: -2px; 32 | margin: .3em 0 .1em 0; 33 | } 34 | 35 | h2 { 36 | margin-top: 2em; 37 | font-weight: 300; 38 | 39 | 40 | } 41 | 42 | h3 { 43 | margin: .5em 0 .1em 0; 44 | 45 | } 46 | 47 | 48 | p { 49 | line-height: 1.5em; 50 | width: 750px; 51 | } 52 | 53 | a { 54 | color: steelblue; 55 | } 56 | 57 | a:not(:hover) { 58 | text-decoration: none; 59 | } 60 | 61 | 62 | 63 | /******************************************************************* 64 | TABLES 65 | *******************************************************************/ 66 | .twocols { 67 | width: 300px; 68 | display: inline-block; 69 | } 70 | 71 | table { 72 | width: 300px; 73 | border-collapse:collapse; 74 | display: inline-block; 75 | margin-right: 25px; 76 | } 77 | 78 | th { 79 | padding: 5px; 80 | font-weight: bold; 81 | font-size: 14px; 82 | background-color: #f8f8f8; 83 | } 84 | 85 | th,td { 86 | border: 1px solid #ddd; 87 | 88 | } 89 | 90 | td { 91 | padding: 5px; 92 | font-size: 14px; 93 | letter-spacing: -1px; 94 | 95 | } 96 | 97 | /******************************************************************* 98 | IMAGES 99 | *******************************************************************/ 100 | 101 | .thumbnail { 102 | height: 264px; 103 | width: 360px; 104 | display: inline-block; 105 | } 106 | 107 | .index_thumb { 108 | height: 173px; 109 | width: 333px; 110 | display: inline-block; 111 | border: 2px solid #ddd; 112 | margin: .3em 0 .5em .5em; 113 | 114 | } 115 | 116 | 117 | img.index_thumb:hover { 118 | border: 2px solid steelblue; 119 | 120 | } -------------------------------------------------------------------------------- /backup/cluster_script.js: -------------------------------------------------------------------------------- 1 | function mpld3_load_lib(url, callback){ 2 | var s = document.createElement('script'); 3 | s.src = url; 4 | s.async = true; 5 | s.onreadystatechange = s.onload = callback; 6 | s.onerror = function(){console.warn("failed to load library " + url);}; 7 | document.getElementsByTagName("head")[0].appendChild(s); 8 | } 9 | 10 | if(typeof(mpld3) !== "undefined" && mpld3._mpld3IsLoaded){ 11 | // already loaded: just create the figure 12 | !function(mpld3){ 13 | 14 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 15 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 16 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 17 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 18 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 19 | function HtmlTooltipPlugin(fig, props){ 20 | mpld3.Plugin.call(this, fig, props); 21 | }; 22 | 23 | HtmlTooltipPlugin.prototype.draw = function(){ 24 | var obj = mpld3.get_element(this.props.id); 25 | var labels = this.props.labels; 26 | var tooltip = d3.select("body").append("div") 27 | .attr("class", "mpld3-tooltip") 28 | .style("position", "absolute") 29 | .style("z-index", "10") 30 | .style("visibility", "hidden"); 31 | 32 | obj.elements() 33 | .on("mouseover", function(d, i){ 34 | tooltip.html(labels[i]) 35 | .style("visibility", "visible");}) 36 | .on("mousemove", function(d, i){ 37 | tooltip 38 | .style("top", d3.event.pageY + this.props.voffset + "px") 39 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 40 | }.bind(this)) 41 | .on("mouseout", function(d, i){ 42 | tooltip.style("visibility", "hidden");}); 43 | }; 44 | 45 | mpld3.register_plugin("toptoolbar", TopToolbar); 46 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 47 | TopToolbar.prototype.constructor = TopToolbar; 48 | function TopToolbar(fig, props){ 49 | mpld3.Plugin.call(this, fig, props); 50 | }; 51 | 52 | TopToolbar.prototype.draw = function(){ 53 | // the toolbar svg doesn't exist 54 | // yet, so first draw it 55 | this.fig.toolbar.draw(); 56 | 57 | // then change the y position to be 58 | // at the top of the figure 59 | this.fig.toolbar.toolbar.attr("x", 150); 60 | this.fig.toolbar.toolbar.attr("y", 400); 61 | 62 | // then remove the draw function, 63 | // so that it is not called again 64 | this.fig.toolbar.draw = function() {} 65 | } 66 | 67 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 68 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 69 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 70 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 71 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 72 | function HtmlTooltipPlugin(fig, props){ 73 | mpld3.Plugin.call(this, fig, props); 74 | }; 75 | 76 | HtmlTooltipPlugin.prototype.draw = function(){ 77 | var obj = mpld3.get_element(this.props.id); 78 | var labels = this.props.labels; 79 | var tooltip = d3.select("body").append("div") 80 | .attr("class", "mpld3-tooltip") 81 | .style("position", "absolute") 82 | .style("z-index", "10") 83 | .style("visibility", "hidden"); 84 | 85 | obj.elements() 86 | .on("mouseover", function(d, i){ 87 | tooltip.html(labels[i]) 88 | .style("visibility", "visible");}) 89 | .on("mousemove", function(d, i){ 90 | tooltip 91 | .style("top", d3.event.pageY + this.props.voffset + "px") 92 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 93 | }.bind(this)) 94 | .on("mouseout", function(d, i){ 95 | tooltip.style("visibility", "hidden");}); 96 | }; 97 | 98 | mpld3.register_plugin("toptoolbar", TopToolbar); 99 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 100 | TopToolbar.prototype.constructor = TopToolbar; 101 | function TopToolbar(fig, props){ 102 | mpld3.Plugin.call(this, fig, props); 103 | }; 104 | 105 | TopToolbar.prototype.draw = function(){ 106 | // the toolbar svg doesn't exist 107 | // yet, so first draw it 108 | this.fig.toolbar.draw(); 109 | 110 | // then change the y position to be 111 | // at the top of the figure 112 | this.fig.toolbar.toolbar.attr("x", 150); 113 | this.fig.toolbar.toolbar.attr("y", 400); 114 | 115 | // then remove the draw function, 116 | // so that it is not called again 117 | this.fig.toolbar.draw = function() {} 118 | } 119 | 120 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 121 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 122 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 123 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 124 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 125 | function HtmlTooltipPlugin(fig, props){ 126 | mpld3.Plugin.call(this, fig, props); 127 | }; 128 | 129 | HtmlTooltipPlugin.prototype.draw = function(){ 130 | var obj = mpld3.get_element(this.props.id); 131 | var labels = this.props.labels; 132 | var tooltip = d3.select("body").append("div") 133 | .attr("class", "mpld3-tooltip") 134 | .style("position", "absolute") 135 | .style("z-index", "10") 136 | .style("visibility", "hidden"); 137 | 138 | obj.elements() 139 | .on("mouseover", function(d, i){ 140 | tooltip.html(labels[i]) 141 | .style("visibility", "visible");}) 142 | .on("mousemove", function(d, i){ 143 | tooltip 144 | .style("top", d3.event.pageY + this.props.voffset + "px") 145 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 146 | }.bind(this)) 147 | .on("mouseout", function(d, i){ 148 | tooltip.style("visibility", "hidden");}); 149 | }; 150 | 151 | mpld3.register_plugin("toptoolbar", TopToolbar); 152 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 153 | TopToolbar.prototype.constructor = TopToolbar; 154 | function TopToolbar(fig, props){ 155 | mpld3.Plugin.call(this, fig, props); 156 | }; 157 | 158 | TopToolbar.prototype.draw = function(){ 159 | // the toolbar svg doesn't exist 160 | // yet, so first draw it 161 | this.fig.toolbar.draw(); 162 | 163 | // then change the y position to be 164 | // at the top of the figure 165 | this.fig.toolbar.toolbar.attr("x", 150); 166 | this.fig.toolbar.toolbar.attr("y", 400); 167 | 168 | // then remove the draw function, 169 | // so that it is not called again 170 | this.fig.toolbar.draw = function() {} 171 | } 172 | 173 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 174 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 175 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 176 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 177 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 178 | function HtmlTooltipPlugin(fig, props){ 179 | mpld3.Plugin.call(this, fig, props); 180 | }; 181 | 182 | HtmlTooltipPlugin.prototype.draw = function(){ 183 | var obj = mpld3.get_element(this.props.id); 184 | var labels = this.props.labels; 185 | var tooltip = d3.select("body").append("div") 186 | .attr("class", "mpld3-tooltip") 187 | .style("position", "absolute") 188 | .style("z-index", "10") 189 | .style("visibility", "hidden"); 190 | 191 | obj.elements() 192 | .on("mouseover", function(d, i){ 193 | tooltip.html(labels[i]) 194 | .style("visibility", "visible");}) 195 | .on("mousemove", function(d, i){ 196 | tooltip 197 | .style("top", d3.event.pageY + this.props.voffset + "px") 198 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 199 | }.bind(this)) 200 | .on("mouseout", function(d, i){ 201 | tooltip.style("visibility", "hidden");}); 202 | }; 203 | 204 | mpld3.register_plugin("toptoolbar", TopToolbar); 205 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 206 | TopToolbar.prototype.constructor = TopToolbar; 207 | function TopToolbar(fig, props){ 208 | mpld3.Plugin.call(this, fig, props); 209 | }; 210 | 211 | TopToolbar.prototype.draw = function(){ 212 | // the toolbar svg doesn't exist 213 | // yet, so first draw it 214 | this.fig.toolbar.draw(); 215 | 216 | // then change the y position to be 217 | // at the top of the figure 218 | this.fig.toolbar.toolbar.attr("x", 150); 219 | this.fig.toolbar.toolbar.attr("y", 400); 220 | 221 | // then remove the draw function, 222 | // so that it is not called again 223 | this.fig.toolbar.draw = function() {} 224 | } 225 | 226 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 227 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 228 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 229 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 230 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 231 | function HtmlTooltipPlugin(fig, props){ 232 | mpld3.Plugin.call(this, fig, props); 233 | }; 234 | 235 | HtmlTooltipPlugin.prototype.draw = function(){ 236 | var obj = mpld3.get_element(this.props.id); 237 | var labels = this.props.labels; 238 | var tooltip = d3.select("body").append("div") 239 | .attr("class", "mpld3-tooltip") 240 | .style("position", "absolute") 241 | .style("z-index", "10") 242 | .style("visibility", "hidden"); 243 | 244 | obj.elements() 245 | .on("mouseover", function(d, i){ 246 | tooltip.html(labels[i]) 247 | .style("visibility", "visible");}) 248 | .on("mousemove", function(d, i){ 249 | tooltip 250 | .style("top", d3.event.pageY + this.props.voffset + "px") 251 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 252 | }.bind(this)) 253 | .on("mouseout", function(d, i){ 254 | tooltip.style("visibility", "hidden");}); 255 | }; 256 | 257 | mpld3.register_plugin("toptoolbar", TopToolbar); 258 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 259 | TopToolbar.prototype.constructor = TopToolbar; 260 | function TopToolbar(fig, props){ 261 | mpld3.Plugin.call(this, fig, props); 262 | }; 263 | 264 | TopToolbar.prototype.draw = function(){ 265 | // the toolbar svg doesn't exist 266 | // yet, so first draw it 267 | this.fig.toolbar.draw(); 268 | 269 | // then change the y position to be 270 | // at the top of the figure 271 | this.fig.toolbar.toolbar.attr("x", 150); 272 | this.fig.toolbar.toolbar.attr("y", 400); 273 | 274 | // then remove the draw function, 275 | // so that it is not called again 276 | this.fig.toolbar.draw = function() {} 277 | } 278 | 279 | mpld3.draw_figure("fig_el290745801570086793125459", {"axes": [{"xlim": [-0.70189582032870046, 0.71051473536226761], "yscale": "linear", "axesbg": "#FFFFFF", "texts": [], "zoomable": true, "images": [], "xdomain": [-0.70189582032870046, 0.71051473536226761], "ylim": [-0.70830339802964049, 0.69370701811001245], "paths": [], "sharey": [], "sharex": [], "axesbgalpha": null, "axes": [{"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "bottom", "nticks": 0, "tickvalues": []}, {"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "left", "nticks": 0, "tickvalues": []}], "lines": [], "markers": [{"edgecolor": "none", "facecolor": "#1B9E77", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data01", "id": "el29074665479312pts"}, {"edgecolor": "none", "facecolor": "#D95F02", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data02", "id": "el29074709227536pts"}, {"edgecolor": "none", "facecolor": "#7570B3", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data03", "id": "el29074665483024pts"}, {"edgecolor": "none", "facecolor": "#E7298A", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data04", "id": "el29074540723600pts"}, {"edgecolor": "none", "facecolor": "#66A61E", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data05", "id": "el29074709858768pts"}], "id": "el29074527433296", "ydomain": [-0.70830339802964049, 0.69370701811001245], "collections": [], "xscale": "linear", "bbox": [0.125, 0.125, 0.77500000000000002, 0.77500000000000002]}], "height": 480.0, "width": 1120.0, "plugins": [{"type": "reset"}, {"enabled": false, "button": true, "type": "zoom"}, {"enabled": false, "button": true, "type": "boxzoom"}, {"voffset": 10, "labels": ["Schindler's List", "One Flew Over the Cuckoo's Nest", "Gone with the Wind", "The Wizard of Oz", "Titanic", "Forrest Gump", "E.T. the Extra-Terrestrial", "The Silence of the Lambs", "Gandhi", "A Streetcar Named Desire", "The Best Years of Our Lives", "My Fair Lady", "Ben-Hur", "Doctor Zhivago", "The Pianist", "The Exorcist", "Out of Africa", "Good Will Hunting", "Terms of Endearment", "Giant", "The Grapes of Wrath", "Close Encounters of the Third Kind", "The Graduate", "Stagecoach", "Wuthering Heights"], "type": "htmltooltip", "id": "el29074665479312pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["Casablanca", "Psycho", "Sunset Blvd.", "Vertigo", "Chinatown", "Amadeus", "High Noon", "The French Connection", "Fargo", "Pulp Fiction", "The Maltese Falcon", "A Clockwork Orange", "Double Indemnity", "Rebel Without a Cause", "The Third Man", "North by Northwest"], "type": "htmltooltip", "id": "el29074709227536pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Godfather", "Raging Bull", "Citizen Kane", "The Godfather: Part II", "On the Waterfront", "12 Angry Men", "Rocky", "To Kill a Mockingbird", "Braveheart", "The Good, the Bad and the Ugly", "The Apartment", "Goodfellas", "City Lights", "It Happened One Night", "Midnight Cowboy", "Mr. Smith Goes to Washington", "Rain Man", "Annie Hall", "Network", "Taxi Driver", "Rear Window"], "type": "htmltooltip", "id": "el29074665483024pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["West Side Story", "Singin' in the Rain", "It's a Wonderful Life", "Some Like It Hot", "The Philadelphia Story", "An American in Paris", "The King's Speech", "A Place in the Sun", "Tootsie", "Nashville", "American Graffiti", "Yankee Doodle Dandy"], "type": "htmltooltip", "id": "el29074540723600pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Shawshank Redemption", "Lawrence of Arabia", "The Sound of Music", "Star Wars", "2001: A Space Odyssey", "The Bridge on the River Kwai", "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb", "Apocalypse Now", "The Lord of the Rings: The Return of the King", "Gladiator", "From Here to Eternity", "Saving Private Ryan", "Unforgiven", "Raiders of the Lost Ark", "Patton", "Jaws", "Butch Cassidy and the Sundance Kid", "The Treasure of the Sierra Madre", "Platoon", "Dances with Wolves", "The Deer Hunter", "All Quiet on the Western Front", "Shane", "The Green Mile", "The African Queen", "Mutiny on the Bounty"], "type": "htmltooltip", "id": "el29074709858768pts", "hoffset": 10}, {"type": "toptoolbar"}], "data": {"data04": [[-0.48383166475045314, 0.17406382605759155], [-0.3431621006015567, 0.6339455084054434], [0.42023926967786635, 0.585073332726518], [-0.17325065040967416, 0.23163649164649666], [0.6705408517106365, 0.1152867311811776], [-0.20507682543002934, 0.5602431727781544], [0.4029759387982718, 0.4993894350885348], [0.6671527866782192, 0.2655632808140779], [-0.02368751477042728, 0.34058596130883056], [0.06057638900463016, 0.6101307798961454], [-0.6012117652228453, -0.04229065076764765], [0.6299986440743635, 0.35408489380322294]], "data05": [[0.4425217895905002, -0.2796139796878817], [-0.24776866601229716, -0.5782685424165561], [0.5555461511292237, 0.486954891259419], [0.19708753275222002, -0.38902709232046384], [0.3497144798998901, -0.338553421833173], [-0.07419701411754268, -0.5782948227044011], [-0.014592308929045296, -0.6368511048162793], [0.025373394060233537, -0.634289200597481], [0.24399849236205579, -0.5034936076710894], [0.07277111225479607, -0.47961483445651304], [-0.1519065165847722, -0.338080020776755], [-0.015830214134386178, -0.36229007847066996], [-0.13402784922090072, -0.3012324095400005], [-0.23474733283083896, -0.4502062748678493], [0.12924047876683878, -0.6686238579502164], [0.08194147679809109, -0.3929722898416771], [-0.34747538257781435, -0.3233751877512795], [-0.3631413985443942, -0.4813272206475685], [-0.09731550319802962, -0.505864130911545], [0.18513473762402247, -0.5781596431297301], [-0.09597244991104013, -0.19942409220233107], [-0.18228725458498232, -0.5566467223632475], [-0.551994202072232, 0.33486446230226674], [-0.5679692454857176, -0.35356444529282777], [-0.45369369757595834, -0.44434234569143366], [0.3366116741827183, -0.6447240310583208]], "data02": [[-0.2476475600239232, -0.26290199748846194], [-0.22919155130164443, 0.06968880208772948], [-0.42967373628736466, 0.28652830354480197], [-0.3531795894914802, 0.12742474874876483], [-0.5517590615479852, -0.22488031423191712], [0.4134372284409409, 0.40907893375027393], [-0.6619219366770693, -0.20578641688667848], [-0.4603909162269872, 0.008430725029157468], [-0.3422448822106277, 0.03699561151307461], [-0.3438274431555653, -0.06956150821349773], [-0.3384187737692901, -0.1515124959398215], [-0.18283874549876644, 0.14829844310283077], [-0.46586683087784553, -0.09919136771199337], [-0.577910557660688, 0.28870430670538155], [-0.20956788370183355, -0.0574148454899521], [-0.20029308653202116, 0.00718931223386455]], "data03": [[0.18558490034956887, -0.01077771968604301], [-0.13262115988334475, 0.5751789037009231], [0.016222043434641144, 0.4652342083561871], [0.07240937947290513, 0.13247203862675336], [-0.05239281560115413, 0.12267112673062845], [0.5241688637949986, 0.14425768177773893], [-0.2825557188697044, 0.2745647039405087], [0.21864153351095966, 0.22101476971173736], [0.49293979852148206, -0.1406180288639719], [-0.07221741889036004, 0.5688366094319275], [-0.5967680343768968, 0.1894475004960752], [0.2030557040289471, 0.06941302351286169], [0.5649139525233913, 0.34616110908884656], [-0.42188287709302075, 0.4618936364485825], [-0.30512268820937116, 0.4768633838569787], [0.6449291941438089, -0.15516297397920956], [-0.18081125741175175, 0.4765240199441122], [-0.027643220902818993, 0.6540274780305884], [0.6687778032882542, -0.08795572251657516], [-0.5534118533588518, 0.02875489034480997], [-0.5807159767006421, 0.10579656075156939]], "data01": [[-0.0777242299467317, -0.4515361500051302], [0.05143904563855998, 0.25947569793979247], [0.05926003443907459, -0.04759931281928853], [-0.4336747338877505, -0.2971244895104658], [0.6035469203843735, -0.3456193989894745], [0.35577780891092403, 0.007409352966807421], [-0.019806795422961994, -0.044100107912120406], [0.5183164260703591, -0.25353575550390045], [0.43257291891111244, 0.1381593571372851], [0.3437182373924481, 0.24527960034963084], [0.31733764740406106, 0.08349042218139716], [-0.19661734109288, 0.39627304179640405], [0.3443594003836902, -0.13069722017172503], [0.1620196922509593, -0.11876438020876365], [0.1563956093287554, -0.21269443337356803], [0.17499103541231237, 0.2985373547088241], [0.12782499859472804, 0.4815648223887244], [0.12523150357922094, 0.41721548160929905], [0.2219315942794893, 0.47877846167983906], [0.48413541386850095, 0.008757361949769692], [0.4755390685857624, -0.4356393509319867], [0.27427215751846695, -0.2940897904508699], [-0.32213342140019896, 0.33175887910243107], [0.3241754832842173, -0.45547190176887226], [0.18462105786497715, 0.5797922578764358]]}, "id": "el29074580157008"}); 280 | }(mpld3); 281 | }else if(typeof define === "function" && define.amd){ 282 | // require.js is available: use it to load d3/mpld3 283 | require.config({paths: {d3: "https://mpld3.github.io/js/d3.v3.min"}}); 284 | require(["d3"], function(d3){ 285 | window.d3 = d3; 286 | mpld3_load_lib("https://mpld3.github.io/js/mpld3.v0.2.js", function(){ 287 | 288 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 289 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 290 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 291 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 292 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 293 | function HtmlTooltipPlugin(fig, props){ 294 | mpld3.Plugin.call(this, fig, props); 295 | }; 296 | 297 | HtmlTooltipPlugin.prototype.draw = function(){ 298 | var obj = mpld3.get_element(this.props.id); 299 | var labels = this.props.labels; 300 | var tooltip = d3.select("body").append("div") 301 | .attr("class", "mpld3-tooltip") 302 | .style("position", "absolute") 303 | .style("z-index", "10") 304 | .style("visibility", "hidden"); 305 | 306 | obj.elements() 307 | .on("mouseover", function(d, i){ 308 | tooltip.html(labels[i]) 309 | .style("visibility", "visible");}) 310 | .on("mousemove", function(d, i){ 311 | tooltip 312 | .style("top", d3.event.pageY + this.props.voffset + "px") 313 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 314 | }.bind(this)) 315 | .on("mouseout", function(d, i){ 316 | tooltip.style("visibility", "hidden");}); 317 | }; 318 | 319 | mpld3.register_plugin("toptoolbar", TopToolbar); 320 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 321 | TopToolbar.prototype.constructor = TopToolbar; 322 | function TopToolbar(fig, props){ 323 | mpld3.Plugin.call(this, fig, props); 324 | }; 325 | 326 | TopToolbar.prototype.draw = function(){ 327 | // the toolbar svg doesn't exist 328 | // yet, so first draw it 329 | this.fig.toolbar.draw(); 330 | 331 | // then change the y position to be 332 | // at the top of the figure 333 | this.fig.toolbar.toolbar.attr("x", 150); 334 | this.fig.toolbar.toolbar.attr("y", 400); 335 | 336 | // then remove the draw function, 337 | // so that it is not called again 338 | this.fig.toolbar.draw = function() {} 339 | } 340 | 341 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 342 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 343 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 344 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 345 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 346 | function HtmlTooltipPlugin(fig, props){ 347 | mpld3.Plugin.call(this, fig, props); 348 | }; 349 | 350 | HtmlTooltipPlugin.prototype.draw = function(){ 351 | var obj = mpld3.get_element(this.props.id); 352 | var labels = this.props.labels; 353 | var tooltip = d3.select("body").append("div") 354 | .attr("class", "mpld3-tooltip") 355 | .style("position", "absolute") 356 | .style("z-index", "10") 357 | .style("visibility", "hidden"); 358 | 359 | obj.elements() 360 | .on("mouseover", function(d, i){ 361 | tooltip.html(labels[i]) 362 | .style("visibility", "visible");}) 363 | .on("mousemove", function(d, i){ 364 | tooltip 365 | .style("top", d3.event.pageY + this.props.voffset + "px") 366 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 367 | }.bind(this)) 368 | .on("mouseout", function(d, i){ 369 | tooltip.style("visibility", "hidden");}); 370 | }; 371 | 372 | mpld3.register_plugin("toptoolbar", TopToolbar); 373 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 374 | TopToolbar.prototype.constructor = TopToolbar; 375 | function TopToolbar(fig, props){ 376 | mpld3.Plugin.call(this, fig, props); 377 | }; 378 | 379 | TopToolbar.prototype.draw = function(){ 380 | // the toolbar svg doesn't exist 381 | // yet, so first draw it 382 | this.fig.toolbar.draw(); 383 | 384 | // then change the y position to be 385 | // at the top of the figure 386 | this.fig.toolbar.toolbar.attr("x", 150); 387 | this.fig.toolbar.toolbar.attr("y", 400); 388 | 389 | // then remove the draw function, 390 | // so that it is not called again 391 | this.fig.toolbar.draw = function() {} 392 | } 393 | 394 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 395 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 396 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 397 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 398 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 399 | function HtmlTooltipPlugin(fig, props){ 400 | mpld3.Plugin.call(this, fig, props); 401 | }; 402 | 403 | HtmlTooltipPlugin.prototype.draw = function(){ 404 | var obj = mpld3.get_element(this.props.id); 405 | var labels = this.props.labels; 406 | var tooltip = d3.select("body").append("div") 407 | .attr("class", "mpld3-tooltip") 408 | .style("position", "absolute") 409 | .style("z-index", "10") 410 | .style("visibility", "hidden"); 411 | 412 | obj.elements() 413 | .on("mouseover", function(d, i){ 414 | tooltip.html(labels[i]) 415 | .style("visibility", "visible");}) 416 | .on("mousemove", function(d, i){ 417 | tooltip 418 | .style("top", d3.event.pageY + this.props.voffset + "px") 419 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 420 | }.bind(this)) 421 | .on("mouseout", function(d, i){ 422 | tooltip.style("visibility", "hidden");}); 423 | }; 424 | 425 | mpld3.register_plugin("toptoolbar", TopToolbar); 426 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 427 | TopToolbar.prototype.constructor = TopToolbar; 428 | function TopToolbar(fig, props){ 429 | mpld3.Plugin.call(this, fig, props); 430 | }; 431 | 432 | TopToolbar.prototype.draw = function(){ 433 | // the toolbar svg doesn't exist 434 | // yet, so first draw it 435 | this.fig.toolbar.draw(); 436 | 437 | // then change the y position to be 438 | // at the top of the figure 439 | this.fig.toolbar.toolbar.attr("x", 150); 440 | this.fig.toolbar.toolbar.attr("y", 400); 441 | 442 | // then remove the draw function, 443 | // so that it is not called again 444 | this.fig.toolbar.draw = function() {} 445 | } 446 | 447 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 448 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 449 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 450 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 451 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 452 | function HtmlTooltipPlugin(fig, props){ 453 | mpld3.Plugin.call(this, fig, props); 454 | }; 455 | 456 | HtmlTooltipPlugin.prototype.draw = function(){ 457 | var obj = mpld3.get_element(this.props.id); 458 | var labels = this.props.labels; 459 | var tooltip = d3.select("body").append("div") 460 | .attr("class", "mpld3-tooltip") 461 | .style("position", "absolute") 462 | .style("z-index", "10") 463 | .style("visibility", "hidden"); 464 | 465 | obj.elements() 466 | .on("mouseover", function(d, i){ 467 | tooltip.html(labels[i]) 468 | .style("visibility", "visible");}) 469 | .on("mousemove", function(d, i){ 470 | tooltip 471 | .style("top", d3.event.pageY + this.props.voffset + "px") 472 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 473 | }.bind(this)) 474 | .on("mouseout", function(d, i){ 475 | tooltip.style("visibility", "hidden");}); 476 | }; 477 | 478 | mpld3.register_plugin("toptoolbar", TopToolbar); 479 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 480 | TopToolbar.prototype.constructor = TopToolbar; 481 | function TopToolbar(fig, props){ 482 | mpld3.Plugin.call(this, fig, props); 483 | }; 484 | 485 | TopToolbar.prototype.draw = function(){ 486 | // the toolbar svg doesn't exist 487 | // yet, so first draw it 488 | this.fig.toolbar.draw(); 489 | 490 | // then change the y position to be 491 | // at the top of the figure 492 | this.fig.toolbar.toolbar.attr("x", 150); 493 | this.fig.toolbar.toolbar.attr("y", 400); 494 | 495 | // then remove the draw function, 496 | // so that it is not called again 497 | this.fig.toolbar.draw = function() {} 498 | } 499 | 500 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 501 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 502 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 503 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 504 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 505 | function HtmlTooltipPlugin(fig, props){ 506 | mpld3.Plugin.call(this, fig, props); 507 | }; 508 | 509 | HtmlTooltipPlugin.prototype.draw = function(){ 510 | var obj = mpld3.get_element(this.props.id); 511 | var labels = this.props.labels; 512 | var tooltip = d3.select("body").append("div") 513 | .attr("class", "mpld3-tooltip") 514 | .style("position", "absolute") 515 | .style("z-index", "10") 516 | .style("visibility", "hidden"); 517 | 518 | obj.elements() 519 | .on("mouseover", function(d, i){ 520 | tooltip.html(labels[i]) 521 | .style("visibility", "visible");}) 522 | .on("mousemove", function(d, i){ 523 | tooltip 524 | .style("top", d3.event.pageY + this.props.voffset + "px") 525 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 526 | }.bind(this)) 527 | .on("mouseout", function(d, i){ 528 | tooltip.style("visibility", "hidden");}); 529 | }; 530 | 531 | mpld3.register_plugin("toptoolbar", TopToolbar); 532 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 533 | TopToolbar.prototype.constructor = TopToolbar; 534 | function TopToolbar(fig, props){ 535 | mpld3.Plugin.call(this, fig, props); 536 | }; 537 | 538 | TopToolbar.prototype.draw = function(){ 539 | // the toolbar svg doesn't exist 540 | // yet, so first draw it 541 | this.fig.toolbar.draw(); 542 | 543 | // then change the y position to be 544 | // at the top of the figure 545 | this.fig.toolbar.toolbar.attr("x", 150); 546 | this.fig.toolbar.toolbar.attr("y", 400); 547 | 548 | // then remove the draw function, 549 | // so that it is not called again 550 | this.fig.toolbar.draw = function() {} 551 | } 552 | 553 | mpld3.draw_figure("fig_el290745801570086793125459", {"axes": [{"xlim": [-0.70189582032870046, 0.71051473536226761], "yscale": "linear", "axesbg": "#FFFFFF", "texts": [], "zoomable": true, "images": [], "xdomain": [-0.70189582032870046, 0.71051473536226761], "ylim": [-0.70830339802964049, 0.69370701811001245], "paths": [], "sharey": [], "sharex": [], "axesbgalpha": null, "axes": [{"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "bottom", "nticks": 0, "tickvalues": []}, {"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "left", "nticks": 0, "tickvalues": []}], "lines": [], "markers": [{"edgecolor": "none", "facecolor": "#1B9E77", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data01", "id": "el29074665479312pts"}, {"edgecolor": "none", "facecolor": "#D95F02", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data02", "id": "el29074709227536pts"}, {"edgecolor": "none", "facecolor": "#7570B3", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data03", "id": "el29074665483024pts"}, {"edgecolor": "none", "facecolor": "#E7298A", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data04", "id": "el29074540723600pts"}, {"edgecolor": "none", "facecolor": "#66A61E", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data05", "id": "el29074709858768pts"}], "id": "el29074527433296", "ydomain": [-0.70830339802964049, 0.69370701811001245], "collections": [], "xscale": "linear", "bbox": [0.125, 0.125, 0.77500000000000002, 0.77500000000000002]}], "height": 480.0, "width": 1120.0, "plugins": [{"type": "reset"}, {"enabled": false, "button": true, "type": "zoom"}, {"enabled": false, "button": true, "type": "boxzoom"}, {"voffset": 10, "labels": ["Schindler's List", "One Flew Over the Cuckoo's Nest", "Gone with the Wind", "The Wizard of Oz", "Titanic", "Forrest Gump", "E.T. the Extra-Terrestrial", "The Silence of the Lambs", "Gandhi", "A Streetcar Named Desire", "The Best Years of Our Lives", "My Fair Lady", "Ben-Hur", "Doctor Zhivago", "The Pianist", "The Exorcist", "Out of Africa", "Good Will Hunting", "Terms of Endearment", "Giant", "The Grapes of Wrath", "Close Encounters of the Third Kind", "The Graduate", "Stagecoach", "Wuthering Heights"], "type": "htmltooltip", "id": "el29074665479312pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["Casablanca", "Psycho", "Sunset Blvd.", "Vertigo", "Chinatown", "Amadeus", "High Noon", "The French Connection", "Fargo", "Pulp Fiction", "The Maltese Falcon", "A Clockwork Orange", "Double Indemnity", "Rebel Without a Cause", "The Third Man", "North by Northwest"], "type": "htmltooltip", "id": "el29074709227536pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Godfather", "Raging Bull", "Citizen Kane", "The Godfather: Part II", "On the Waterfront", "12 Angry Men", "Rocky", "To Kill a Mockingbird", "Braveheart", "The Good, the Bad and the Ugly", "The Apartment", "Goodfellas", "City Lights", "It Happened One Night", "Midnight Cowboy", "Mr. Smith Goes to Washington", "Rain Man", "Annie Hall", "Network", "Taxi Driver", "Rear Window"], "type": "htmltooltip", "id": "el29074665483024pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["West Side Story", "Singin' in the Rain", "It's a Wonderful Life", "Some Like It Hot", "The Philadelphia Story", "An American in Paris", "The King's Speech", "A Place in the Sun", "Tootsie", "Nashville", "American Graffiti", "Yankee Doodle Dandy"], "type": "htmltooltip", "id": "el29074540723600pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Shawshank Redemption", "Lawrence of Arabia", "The Sound of Music", "Star Wars", "2001: A Space Odyssey", "The Bridge on the River Kwai", "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb", "Apocalypse Now", "The Lord of the Rings: The Return of the King", "Gladiator", "From Here to Eternity", "Saving Private Ryan", "Unforgiven", "Raiders of the Lost Ark", "Patton", "Jaws", "Butch Cassidy and the Sundance Kid", "The Treasure of the Sierra Madre", "Platoon", "Dances with Wolves", "The Deer Hunter", "All Quiet on the Western Front", "Shane", "The Green Mile", "The African Queen", "Mutiny on the Bounty"], "type": "htmltooltip", "id": "el29074709858768pts", "hoffset": 10}, {"type": "toptoolbar"}], "data": {"data04": [[-0.48383166475045314, 0.17406382605759155], [-0.3431621006015567, 0.6339455084054434], [0.42023926967786635, 0.585073332726518], [-0.17325065040967416, 0.23163649164649666], [0.6705408517106365, 0.1152867311811776], [-0.20507682543002934, 0.5602431727781544], [0.4029759387982718, 0.4993894350885348], [0.6671527866782192, 0.2655632808140779], [-0.02368751477042728, 0.34058596130883056], [0.06057638900463016, 0.6101307798961454], [-0.6012117652228453, -0.04229065076764765], [0.6299986440743635, 0.35408489380322294]], "data05": [[0.4425217895905002, -0.2796139796878817], [-0.24776866601229716, -0.5782685424165561], [0.5555461511292237, 0.486954891259419], [0.19708753275222002, -0.38902709232046384], [0.3497144798998901, -0.338553421833173], [-0.07419701411754268, -0.5782948227044011], [-0.014592308929045296, -0.6368511048162793], [0.025373394060233537, -0.634289200597481], [0.24399849236205579, -0.5034936076710894], [0.07277111225479607, -0.47961483445651304], [-0.1519065165847722, -0.338080020776755], [-0.015830214134386178, -0.36229007847066996], [-0.13402784922090072, -0.3012324095400005], [-0.23474733283083896, -0.4502062748678493], [0.12924047876683878, -0.6686238579502164], [0.08194147679809109, -0.3929722898416771], [-0.34747538257781435, -0.3233751877512795], [-0.3631413985443942, -0.4813272206475685], [-0.09731550319802962, -0.505864130911545], [0.18513473762402247, -0.5781596431297301], [-0.09597244991104013, -0.19942409220233107], [-0.18228725458498232, -0.5566467223632475], [-0.551994202072232, 0.33486446230226674], [-0.5679692454857176, -0.35356444529282777], [-0.45369369757595834, -0.44434234569143366], [0.3366116741827183, -0.6447240310583208]], "data02": [[-0.2476475600239232, -0.26290199748846194], [-0.22919155130164443, 0.06968880208772948], [-0.42967373628736466, 0.28652830354480197], [-0.3531795894914802, 0.12742474874876483], [-0.5517590615479852, -0.22488031423191712], [0.4134372284409409, 0.40907893375027393], [-0.6619219366770693, -0.20578641688667848], [-0.4603909162269872, 0.008430725029157468], [-0.3422448822106277, 0.03699561151307461], [-0.3438274431555653, -0.06956150821349773], [-0.3384187737692901, -0.1515124959398215], [-0.18283874549876644, 0.14829844310283077], [-0.46586683087784553, -0.09919136771199337], [-0.577910557660688, 0.28870430670538155], [-0.20956788370183355, -0.0574148454899521], [-0.20029308653202116, 0.00718931223386455]], "data03": [[0.18558490034956887, -0.01077771968604301], [-0.13262115988334475, 0.5751789037009231], [0.016222043434641144, 0.4652342083561871], [0.07240937947290513, 0.13247203862675336], [-0.05239281560115413, 0.12267112673062845], [0.5241688637949986, 0.14425768177773893], [-0.2825557188697044, 0.2745647039405087], [0.21864153351095966, 0.22101476971173736], [0.49293979852148206, -0.1406180288639719], [-0.07221741889036004, 0.5688366094319275], [-0.5967680343768968, 0.1894475004960752], [0.2030557040289471, 0.06941302351286169], [0.5649139525233913, 0.34616110908884656], [-0.42188287709302075, 0.4618936364485825], [-0.30512268820937116, 0.4768633838569787], [0.6449291941438089, -0.15516297397920956], [-0.18081125741175175, 0.4765240199441122], [-0.027643220902818993, 0.6540274780305884], [0.6687778032882542, -0.08795572251657516], [-0.5534118533588518, 0.02875489034480997], [-0.5807159767006421, 0.10579656075156939]], "data01": [[-0.0777242299467317, -0.4515361500051302], [0.05143904563855998, 0.25947569793979247], [0.05926003443907459, -0.04759931281928853], [-0.4336747338877505, -0.2971244895104658], [0.6035469203843735, -0.3456193989894745], [0.35577780891092403, 0.007409352966807421], [-0.019806795422961994, -0.044100107912120406], [0.5183164260703591, -0.25353575550390045], [0.43257291891111244, 0.1381593571372851], [0.3437182373924481, 0.24527960034963084], [0.31733764740406106, 0.08349042218139716], [-0.19661734109288, 0.39627304179640405], [0.3443594003836902, -0.13069722017172503], [0.1620196922509593, -0.11876438020876365], [0.1563956093287554, -0.21269443337356803], [0.17499103541231237, 0.2985373547088241], [0.12782499859472804, 0.4815648223887244], [0.12523150357922094, 0.41721548160929905], [0.2219315942794893, 0.47877846167983906], [0.48413541386850095, 0.008757361949769692], [0.4755390685857624, -0.4356393509319867], [0.27427215751846695, -0.2940897904508699], [-0.32213342140019896, 0.33175887910243107], [0.3241754832842173, -0.45547190176887226], [0.18462105786497715, 0.5797922578764358]]}, "id": "el29074580157008"}); 554 | }); 555 | }); 556 | }else{ 557 | // require.js not available: dynamically load d3 & mpld3 558 | mpld3_load_lib("https://mpld3.github.io/js/d3.v3.min.js", function(){ 559 | mpld3_load_lib("https://mpld3.github.io/js/mpld3.v0.2.js", function(){ 560 | 561 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 562 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 563 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 564 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 565 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 566 | function HtmlTooltipPlugin(fig, props){ 567 | mpld3.Plugin.call(this, fig, props); 568 | }; 569 | 570 | HtmlTooltipPlugin.prototype.draw = function(){ 571 | var obj = mpld3.get_element(this.props.id); 572 | var labels = this.props.labels; 573 | var tooltip = d3.select("body").append("div") 574 | .attr("class", "mpld3-tooltip") 575 | .style("position", "absolute") 576 | .style("z-index", "10") 577 | .style("visibility", "hidden"); 578 | 579 | obj.elements() 580 | .on("mouseover", function(d, i){ 581 | tooltip.html(labels[i]) 582 | .style("visibility", "visible");}) 583 | .on("mousemove", function(d, i){ 584 | tooltip 585 | .style("top", d3.event.pageY + this.props.voffset + "px") 586 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 587 | }.bind(this)) 588 | .on("mouseout", function(d, i){ 589 | tooltip.style("visibility", "hidden");}); 590 | }; 591 | 592 | mpld3.register_plugin("toptoolbar", TopToolbar); 593 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 594 | TopToolbar.prototype.constructor = TopToolbar; 595 | function TopToolbar(fig, props){ 596 | mpld3.Plugin.call(this, fig, props); 597 | }; 598 | 599 | TopToolbar.prototype.draw = function(){ 600 | // the toolbar svg doesn't exist 601 | // yet, so first draw it 602 | this.fig.toolbar.draw(); 603 | 604 | // then change the y position to be 605 | // at the top of the figure 606 | this.fig.toolbar.toolbar.attr("x", 150); 607 | this.fig.toolbar.toolbar.attr("y", 400); 608 | 609 | // then remove the draw function, 610 | // so that it is not called again 611 | this.fig.toolbar.draw = function() {} 612 | } 613 | 614 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 615 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 616 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 617 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 618 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 619 | function HtmlTooltipPlugin(fig, props){ 620 | mpld3.Plugin.call(this, fig, props); 621 | }; 622 | 623 | HtmlTooltipPlugin.prototype.draw = function(){ 624 | var obj = mpld3.get_element(this.props.id); 625 | var labels = this.props.labels; 626 | var tooltip = d3.select("body").append("div") 627 | .attr("class", "mpld3-tooltip") 628 | .style("position", "absolute") 629 | .style("z-index", "10") 630 | .style("visibility", "hidden"); 631 | 632 | obj.elements() 633 | .on("mouseover", function(d, i){ 634 | tooltip.html(labels[i]) 635 | .style("visibility", "visible");}) 636 | .on("mousemove", function(d, i){ 637 | tooltip 638 | .style("top", d3.event.pageY + this.props.voffset + "px") 639 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 640 | }.bind(this)) 641 | .on("mouseout", function(d, i){ 642 | tooltip.style("visibility", "hidden");}); 643 | }; 644 | 645 | mpld3.register_plugin("toptoolbar", TopToolbar); 646 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 647 | TopToolbar.prototype.constructor = TopToolbar; 648 | function TopToolbar(fig, props){ 649 | mpld3.Plugin.call(this, fig, props); 650 | }; 651 | 652 | TopToolbar.prototype.draw = function(){ 653 | // the toolbar svg doesn't exist 654 | // yet, so first draw it 655 | this.fig.toolbar.draw(); 656 | 657 | // then change the y position to be 658 | // at the top of the figure 659 | this.fig.toolbar.toolbar.attr("x", 150); 660 | this.fig.toolbar.toolbar.attr("y", 400); 661 | 662 | // then remove the draw function, 663 | // so that it is not called again 664 | this.fig.toolbar.draw = function() {} 665 | } 666 | 667 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 668 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 669 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 670 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 671 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 672 | function HtmlTooltipPlugin(fig, props){ 673 | mpld3.Plugin.call(this, fig, props); 674 | }; 675 | 676 | HtmlTooltipPlugin.prototype.draw = function(){ 677 | var obj = mpld3.get_element(this.props.id); 678 | var labels = this.props.labels; 679 | var tooltip = d3.select("body").append("div") 680 | .attr("class", "mpld3-tooltip") 681 | .style("position", "absolute") 682 | .style("z-index", "10") 683 | .style("visibility", "hidden"); 684 | 685 | obj.elements() 686 | .on("mouseover", function(d, i){ 687 | tooltip.html(labels[i]) 688 | .style("visibility", "visible");}) 689 | .on("mousemove", function(d, i){ 690 | tooltip 691 | .style("top", d3.event.pageY + this.props.voffset + "px") 692 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 693 | }.bind(this)) 694 | .on("mouseout", function(d, i){ 695 | tooltip.style("visibility", "hidden");}); 696 | }; 697 | 698 | mpld3.register_plugin("toptoolbar", TopToolbar); 699 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 700 | TopToolbar.prototype.constructor = TopToolbar; 701 | function TopToolbar(fig, props){ 702 | mpld3.Plugin.call(this, fig, props); 703 | }; 704 | 705 | TopToolbar.prototype.draw = function(){ 706 | // the toolbar svg doesn't exist 707 | // yet, so first draw it 708 | this.fig.toolbar.draw(); 709 | 710 | // then change the y position to be 711 | // at the top of the figure 712 | this.fig.toolbar.toolbar.attr("x", 150); 713 | this.fig.toolbar.toolbar.attr("y", 400); 714 | 715 | // then remove the draw function, 716 | // so that it is not called again 717 | this.fig.toolbar.draw = function() {} 718 | } 719 | 720 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 721 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 722 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 723 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 724 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 725 | function HtmlTooltipPlugin(fig, props){ 726 | mpld3.Plugin.call(this, fig, props); 727 | }; 728 | 729 | HtmlTooltipPlugin.prototype.draw = function(){ 730 | var obj = mpld3.get_element(this.props.id); 731 | var labels = this.props.labels; 732 | var tooltip = d3.select("body").append("div") 733 | .attr("class", "mpld3-tooltip") 734 | .style("position", "absolute") 735 | .style("z-index", "10") 736 | .style("visibility", "hidden"); 737 | 738 | obj.elements() 739 | .on("mouseover", function(d, i){ 740 | tooltip.html(labels[i]) 741 | .style("visibility", "visible");}) 742 | .on("mousemove", function(d, i){ 743 | tooltip 744 | .style("top", d3.event.pageY + this.props.voffset + "px") 745 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 746 | }.bind(this)) 747 | .on("mouseout", function(d, i){ 748 | tooltip.style("visibility", "hidden");}); 749 | }; 750 | 751 | mpld3.register_plugin("toptoolbar", TopToolbar); 752 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 753 | TopToolbar.prototype.constructor = TopToolbar; 754 | function TopToolbar(fig, props){ 755 | mpld3.Plugin.call(this, fig, props); 756 | }; 757 | 758 | TopToolbar.prototype.draw = function(){ 759 | // the toolbar svg doesn't exist 760 | // yet, so first draw it 761 | this.fig.toolbar.draw(); 762 | 763 | // then change the y position to be 764 | // at the top of the figure 765 | this.fig.toolbar.toolbar.attr("x", 150); 766 | this.fig.toolbar.toolbar.attr("y", 400); 767 | 768 | // then remove the draw function, 769 | // so that it is not called again 770 | this.fig.toolbar.draw = function() {} 771 | } 772 | 773 | mpld3.register_plugin("htmltooltip", HtmlTooltipPlugin); 774 | HtmlTooltipPlugin.prototype = Object.create(mpld3.Plugin.prototype); 775 | HtmlTooltipPlugin.prototype.constructor = HtmlTooltipPlugin; 776 | HtmlTooltipPlugin.prototype.requiredProps = ["id"]; 777 | HtmlTooltipPlugin.prototype.defaultProps = {labels:null, hoffset:0, voffset:10}; 778 | function HtmlTooltipPlugin(fig, props){ 779 | mpld3.Plugin.call(this, fig, props); 780 | }; 781 | 782 | HtmlTooltipPlugin.prototype.draw = function(){ 783 | var obj = mpld3.get_element(this.props.id); 784 | var labels = this.props.labels; 785 | var tooltip = d3.select("body").append("div") 786 | .attr("class", "mpld3-tooltip") 787 | .style("position", "absolute") 788 | .style("z-index", "10") 789 | .style("visibility", "hidden"); 790 | 791 | obj.elements() 792 | .on("mouseover", function(d, i){ 793 | tooltip.html(labels[i]) 794 | .style("visibility", "visible");}) 795 | .on("mousemove", function(d, i){ 796 | tooltip 797 | .style("top", d3.event.pageY + this.props.voffset + "px") 798 | .style("left",d3.event.pageX + this.props.hoffset + "px"); 799 | }.bind(this)) 800 | .on("mouseout", function(d, i){ 801 | tooltip.style("visibility", "hidden");}); 802 | }; 803 | 804 | mpld3.register_plugin("toptoolbar", TopToolbar); 805 | TopToolbar.prototype = Object.create(mpld3.Plugin.prototype); 806 | TopToolbar.prototype.constructor = TopToolbar; 807 | function TopToolbar(fig, props){ 808 | mpld3.Plugin.call(this, fig, props); 809 | }; 810 | 811 | TopToolbar.prototype.draw = function(){ 812 | // the toolbar svg doesn't exist 813 | // yet, so first draw it 814 | this.fig.toolbar.draw(); 815 | 816 | // then change the y position to be 817 | // at the top of the figure 818 | this.fig.toolbar.toolbar.attr("x", 150); 819 | this.fig.toolbar.toolbar.attr("y", 400); 820 | 821 | // then remove the draw function, 822 | // so that it is not called again 823 | this.fig.toolbar.draw = function() {} 824 | } 825 | 826 | mpld3.draw_figure("fig_el290745801570086793125459", {"axes": [{"xlim": [-0.70189582032870046, 0.71051473536226761], "yscale": "linear", "axesbg": "#FFFFFF", "texts": [], "zoomable": true, "images": [], "xdomain": [-0.70189582032870046, 0.71051473536226761], "ylim": [-0.70830339802964049, 0.69370701811001245], "paths": [], "sharey": [], "sharex": [], "axesbgalpha": null, "axes": [{"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "bottom", "nticks": 0, "tickvalues": []}, {"scale": "linear", "tickformat": "", "grid": {"gridOn": false}, "fontsize": null, "position": "left", "nticks": 0, "tickvalues": []}], "lines": [], "markers": [{"edgecolor": "none", "facecolor": "#1B9E77", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data01", "id": "el29074665479312pts"}, {"edgecolor": "none", "facecolor": "#D95F02", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data02", "id": "el29074709227536pts"}, {"edgecolor": "none", "facecolor": "#7570B3", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data03", "id": "el29074665483024pts"}, {"edgecolor": "none", "facecolor": "#E7298A", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data04", "id": "el29074540723600pts"}, {"edgecolor": "none", "facecolor": "#66A61E", "edgewidth": 0.5, "yindex": 1, "coordinates": "data", "zorder": 2, "markerpath": [[[0.0, 9.0], [2.3868279, 9.0], [4.676218837063681, 8.051703224294176], [6.3639610306789285, 6.3639610306789285], [8.051703224294176, 4.676218837063681], [9.0, 2.3868279], [9.0, 0.0], [9.0, -2.3868279], [8.051703224294176, -4.676218837063681], [6.3639610306789285, -6.3639610306789285], [4.676218837063681, -8.051703224294176], [2.3868279, -9.0], [0.0, -9.0], [-2.3868279, -9.0], [-4.676218837063681, -8.051703224294176], [-6.3639610306789285, -6.3639610306789285], [-8.051703224294176, -4.676218837063681], [-9.0, -2.3868279], [-9.0, 0.0], [-9.0, 2.3868279], [-8.051703224294176, 4.676218837063681], [-6.3639610306789285, 6.3639610306789285], [-4.676218837063681, 8.051703224294176], [-2.3868279, 9.0], [0.0, 9.0]], ["M", "C", "C", "C", "C", "C", "C", "C", "C", "Z"]], "alpha": 1, "xindex": 0, "data": "data05", "id": "el29074709858768pts"}], "id": "el29074527433296", "ydomain": [-0.70830339802964049, 0.69370701811001245], "collections": [], "xscale": "linear", "bbox": [0.125, 0.125, 0.77500000000000002, 0.77500000000000002]}], "height": 480.0, "width": 1120.0, "plugins": [{"type": "reset"}, {"enabled": false, "button": true, "type": "zoom"}, {"enabled": false, "button": true, "type": "boxzoom"}, {"voffset": 10, "labels": ["Schindler's List", "One Flew Over the Cuckoo's Nest", "Gone with the Wind", "The Wizard of Oz", "Titanic", "Forrest Gump", "E.T. the Extra-Terrestrial", "The Silence of the Lambs", "Gandhi", "A Streetcar Named Desire", "The Best Years of Our Lives", "My Fair Lady", "Ben-Hur", "Doctor Zhivago", "The Pianist", "The Exorcist", "Out of Africa", "Good Will Hunting", "Terms of Endearment", "Giant", "The Grapes of Wrath", "Close Encounters of the Third Kind", "The Graduate", "Stagecoach", "Wuthering Heights"], "type": "htmltooltip", "id": "el29074665479312pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["Casablanca", "Psycho", "Sunset Blvd.", "Vertigo", "Chinatown", "Amadeus", "High Noon", "The French Connection", "Fargo", "Pulp Fiction", "The Maltese Falcon", "A Clockwork Orange", "Double Indemnity", "Rebel Without a Cause", "The Third Man", "North by Northwest"], "type": "htmltooltip", "id": "el29074709227536pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Godfather", "Raging Bull", "Citizen Kane", "The Godfather: Part II", "On the Waterfront", "12 Angry Men", "Rocky", "To Kill a Mockingbird", "Braveheart", "The Good, the Bad and the Ugly", "The Apartment", "Goodfellas", "City Lights", "It Happened One Night", "Midnight Cowboy", "Mr. Smith Goes to Washington", "Rain Man", "Annie Hall", "Network", "Taxi Driver", "Rear Window"], "type": "htmltooltip", "id": "el29074665483024pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["West Side Story", "Singin' in the Rain", "It's a Wonderful Life", "Some Like It Hot", "The Philadelphia Story", "An American in Paris", "The King's Speech", "A Place in the Sun", "Tootsie", "Nashville", "American Graffiti", "Yankee Doodle Dandy"], "type": "htmltooltip", "id": "el29074540723600pts", "hoffset": 10}, {"type": "toptoolbar"}, {"voffset": 10, "labels": ["The Shawshank Redemption", "Lawrence of Arabia", "The Sound of Music", "Star Wars", "2001: A Space Odyssey", "The Bridge on the River Kwai", "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb", "Apocalypse Now", "The Lord of the Rings: The Return of the King", "Gladiator", "From Here to Eternity", "Saving Private Ryan", "Unforgiven", "Raiders of the Lost Ark", "Patton", "Jaws", "Butch Cassidy and the Sundance Kid", "The Treasure of the Sierra Madre", "Platoon", "Dances with Wolves", "The Deer Hunter", "All Quiet on the Western Front", "Shane", "The Green Mile", "The African Queen", "Mutiny on the Bounty"], "type": "htmltooltip", "id": "el29074709858768pts", "hoffset": 10}, {"type": "toptoolbar"}], "data": {"data04": [[-0.48383166475045314, 0.17406382605759155], [-0.3431621006015567, 0.6339455084054434], [0.42023926967786635, 0.585073332726518], [-0.17325065040967416, 0.23163649164649666], [0.6705408517106365, 0.1152867311811776], [-0.20507682543002934, 0.5602431727781544], [0.4029759387982718, 0.4993894350885348], [0.6671527866782192, 0.2655632808140779], [-0.02368751477042728, 0.34058596130883056], [0.06057638900463016, 0.6101307798961454], [-0.6012117652228453, -0.04229065076764765], [0.6299986440743635, 0.35408489380322294]], "data05": [[0.4425217895905002, -0.2796139796878817], [-0.24776866601229716, -0.5782685424165561], [0.5555461511292237, 0.486954891259419], [0.19708753275222002, -0.38902709232046384], [0.3497144798998901, -0.338553421833173], [-0.07419701411754268, -0.5782948227044011], [-0.014592308929045296, -0.6368511048162793], [0.025373394060233537, -0.634289200597481], [0.24399849236205579, -0.5034936076710894], [0.07277111225479607, -0.47961483445651304], [-0.1519065165847722, -0.338080020776755], [-0.015830214134386178, -0.36229007847066996], [-0.13402784922090072, -0.3012324095400005], [-0.23474733283083896, -0.4502062748678493], [0.12924047876683878, -0.6686238579502164], [0.08194147679809109, -0.3929722898416771], [-0.34747538257781435, -0.3233751877512795], [-0.3631413985443942, -0.4813272206475685], [-0.09731550319802962, -0.505864130911545], [0.18513473762402247, -0.5781596431297301], [-0.09597244991104013, -0.19942409220233107], [-0.18228725458498232, -0.5566467223632475], [-0.551994202072232, 0.33486446230226674], [-0.5679692454857176, -0.35356444529282777], [-0.45369369757595834, -0.44434234569143366], [0.3366116741827183, -0.6447240310583208]], "data02": [[-0.2476475600239232, -0.26290199748846194], [-0.22919155130164443, 0.06968880208772948], [-0.42967373628736466, 0.28652830354480197], [-0.3531795894914802, 0.12742474874876483], [-0.5517590615479852, -0.22488031423191712], [0.4134372284409409, 0.40907893375027393], [-0.6619219366770693, -0.20578641688667848], [-0.4603909162269872, 0.008430725029157468], [-0.3422448822106277, 0.03699561151307461], [-0.3438274431555653, -0.06956150821349773], [-0.3384187737692901, -0.1515124959398215], [-0.18283874549876644, 0.14829844310283077], [-0.46586683087784553, -0.09919136771199337], [-0.577910557660688, 0.28870430670538155], [-0.20956788370183355, -0.0574148454899521], [-0.20029308653202116, 0.00718931223386455]], "data03": [[0.18558490034956887, -0.01077771968604301], [-0.13262115988334475, 0.5751789037009231], [0.016222043434641144, 0.4652342083561871], [0.07240937947290513, 0.13247203862675336], [-0.05239281560115413, 0.12267112673062845], [0.5241688637949986, 0.14425768177773893], [-0.2825557188697044, 0.2745647039405087], [0.21864153351095966, 0.22101476971173736], [0.49293979852148206, -0.1406180288639719], [-0.07221741889036004, 0.5688366094319275], [-0.5967680343768968, 0.1894475004960752], [0.2030557040289471, 0.06941302351286169], [0.5649139525233913, 0.34616110908884656], [-0.42188287709302075, 0.4618936364485825], [-0.30512268820937116, 0.4768633838569787], [0.6449291941438089, -0.15516297397920956], [-0.18081125741175175, 0.4765240199441122], [-0.027643220902818993, 0.6540274780305884], [0.6687778032882542, -0.08795572251657516], [-0.5534118533588518, 0.02875489034480997], [-0.5807159767006421, 0.10579656075156939]], "data01": [[-0.0777242299467317, -0.4515361500051302], [0.05143904563855998, 0.25947569793979247], [0.05926003443907459, -0.04759931281928853], [-0.4336747338877505, -0.2971244895104658], [0.6035469203843735, -0.3456193989894745], [0.35577780891092403, 0.007409352966807421], [-0.019806795422961994, -0.044100107912120406], [0.5183164260703591, -0.25353575550390045], [0.43257291891111244, 0.1381593571372851], [0.3437182373924481, 0.24527960034963084], [0.31733764740406106, 0.08349042218139716], [-0.19661734109288, 0.39627304179640405], [0.3443594003836902, -0.13069722017172503], [0.1620196922509593, -0.11876438020876365], [0.1563956093287554, -0.21269443337356803], [0.17499103541231237, 0.2985373547088241], [0.12782499859472804, 0.4815648223887244], [0.12523150357922094, 0.41721548160929905], [0.2219315942794893, 0.47877846167983906], [0.48413541386850095, 0.008757361949769692], [0.4755390685857624, -0.4356393509319867], [0.27427215751846695, -0.2940897904508699], [-0.32213342140019896, 0.33175887910243107], [0.3241754832842173, -0.45547190176887226], [0.18462105786497715, 0.5797922578764358]]}, "id": "el29074580157008"}); 827 | }) 828 | }); 829 | } -------------------------------------------------------------------------------- /backup/clusters_small.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/clusters_small.png -------------------------------------------------------------------------------- /backup/clusters_small_noaxes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/clusters_small_noaxes.png -------------------------------------------------------------------------------- /backup/film_cluster.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Top 100 Films 6 | 7 | 8 | 9 | 10 | 11 | 21 | 22 | 23 | 24 | 25 | 50 | 51 | 52 |
53 | Home 54 |
55 | 56 |
57 | 58 |

Top 100 Films of all Time

59 | 60 |
61 | 62 | 63 |
64 | 65 |
66 |

67 | How can you learn about the underlying structure of documents in a way that is informative and intuitive? This basic 68 | motivating question led me on a journey to visualize and cluster documents in a two-dimensional space. What you see 69 | above is an output of an analytical pipeline that begin by gathering synopses on the top 100 films of all time and ended by 70 | analyzing the latent topics within each document. In between I ran significant manipulations on these synopses (tokenization, stemming), 71 | transformed them into a vector space model (tf-idf), and clustered them into groups (k-means). You can learn all about how 72 | I did this with my detailed guide to Document Clustering with Python. But first, what did I learn? 73 |

74 |
75 |

76 | A bit of background 77 |

78 | 79 |

80 | I obtained a list of the top 100 films of all time from an IMDB user list called 81 | 82 | Top 100 Greatest Movies of All Time (The Ultimate List) 83 | by ChrisWalczyk55. 84 | ChrisWalczyk55 claims that "My lists are not based on my own personal favorites; 85 | they are based on the true greatness and/or sucess of the person, place, or thing 86 | being ranked." Ok, sure, whatever. Using this list and it's ordinal rankings, 87 | combined with synopses gathered from IMDB and Wikipedia, I was able to separate the films into 5 clusters. 88 | Why 5? Clustering is more art than science and if I selected 20 clusters they would be too narrow to allow 89 | me to draw any generalizations. If I picked 2 or 3 clusters they would be too broad. 5 to 8 generated a good fit, 90 | but I chose 5 clusters since this led to the best intuition. 91 |

92 |
93 | 94 |

95 | Understanding the visualization 96 |

97 | 98 |

99 | The visualization at the top of the page is a 2-dimensional scatterplot of the cosine distance of each of the movies 100 | (colored by cluster). The dimensions (X and Y) do not actually have labels. The way to interpret the the scatterplot is 101 | by examining the location of one film, relative to others, in this 2-d space. Proximity in this space equates to similarity as 102 | determined by a multi-dimensional scaling of the cosine distance (1 minus cosine similarity) between synopses contained 103 | within the term frequency-inverse document frequency (tf-idf) matrix. That was probably confusing and I plan to explain it in a 104 | more detailed write up of my methodology, but the basic intuition is that, based on the collected synopses, each film is plotted in relation 105 | to its similarity to all other films contained in the plot. You might find some wierd relationships in this plot: keep in mind 106 | that similarity was measured based on the words found in the film synopses. If the film synopses were written poorly or very short 107 | the results were most certainly impacted. Garbage in, garbage out. Mostly I was interested in exploring the methodology. 108 |

109 |
110 | 111 | 112 |

113 | Scoring the clusters 114 |

115 | 116 |

117 | Based on the outcome of the clustering, I used the average rank from the IMDB list to score the clusters (lower is better). 118 |

119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 |
RankClusterScoreCount
1Killed, soldiers, captain43.726
2Family, home, war47.225
3Father, New York, brothers49.421
4Dance, singing, love54.512
5Police, killed, murders58.816
159 | 160 |

161 | You can see that the war movies scored the best. The basic war epic cluster was at the top, followed 162 | closely by family/home with some war mixed in. Family and "New York" or perhaps just cities follows the war grouping. 163 | Dancing, singing, love is beats out the crime-ish flicks which, in the scheme of the top 100 movies, tend towards the bottom. 164 | This despite the dominance of the Godfather films. 165 |

166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 |
Killed, soldiers, captain
RankTitle
2 The Shawshank Redemption
11 Lawrence of Arabia
18 The Sound of Music
20 Star Wars
22 2001: A Space Odyssey
25 The Bridge on the River Kwai
30 Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
32 Apocalypse Now
34 The Lord of the Rings: The Return of the King
35 Gladiator
36 From Here to Eternity
37 Saving Private Ryan
38 Unforgiven
39 Raiders of the Lost Ark
49 Patton
50 Jaws
53 Butch Cassidy and the Sundance Kid
54 The Treasure of the Sierra Madre
56 Platoon
58 Dances with Wolves
62 The Deer Hunter
63 All Quiet on the Western Front
80 Shane
81 The Green Mile
88 The African Queen
90 Mutiny on the Bounty
284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 |
Family, home, war
RankTitle
3 Schindler's List
6 One Flew Over the Cuckoo's Nest
7 Gone with the Wind
9 The Wizard of Oz
10 Titanic
17 Forrest Gump
21 E.T. the Extra-Terrestrial
23 The Silence of the Lambs
33 Gandhi
41 A Streetcar Named Desire
45 The Best Years of Our Lives
46 My Fair Lady
47 Ben-Hur
48 Doctor Zhivago
59 The Pianist
61 The Exorcist
73 Out of Africa
74 Good Will Hunting
75 Terms of Endearment
78 Giant
79 The Grapes of Wrath
82 Close Encounters of the Third Kind
85 The Graduate
89 Stagecoach
94 Wuthering Heights
398 | 399 |
400 |
401 |
402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 |
Father, New York, brothers
RankTitle
1 The Godfather
4 Raging Bull
8 Citizen Kane
12 The Godfather: Part II
16 On the Waterfront
29 12 Angry Men
40 Rocky
43 To Kill a Mockingbird
51 Braveheart
52 The Good, the Bad and the Ugly
55 The Apartment
60 Goodfellas
65 City Lights
67 It Happened One Night
69 Midnight Cowboy
70 Mr. Smith Goes to Washington
71 Rain Man
72 Annie Hall
83 Network
93 Taxi Driver
97 Rear Window
500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 |
Dance, singing, love
RankTitle
19 West Side Story
26 Singin' in the Rain
27 It's a Wonderful Life
28 Some Like It Hot
42 The Philadelphia Story
44 An American in Paris
66 The King's Speech
68 A Place in the Sun
76 Tootsie
84 Nashville
86 American Graffiti
100 Yankee Doodle Dandy
563 | 564 |
565 |
566 |
567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 |
Police, killed, murders
RankTitle
5 Casablanca
13 Psycho
14 Sunset Blvd.
15 Vertigo
24 Chinatown
31 Amadeus
57 High Noon
64 The French Connection
77 Fargo
87 Pulp Fiction
91 The Maltese Falcon
92 A Clockwork Orange
95 Double Indemnity
96 Rebel Without a Cause
98 The Third Man
99 North by Northwest
645 | 646 | 647 |
648 |
649 |
650 | 651 | 652 | 653 |
654 | 655 | 656 | 744 | 745 | 746 | 747 | -------------------------------------------------------------------------------- /backup/header_short.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/header_short.jpg -------------------------------------------------------------------------------- /backup/link_list.txt: -------------------------------------------------------------------------------- 1 | /title/tt0068646/ 2 | /title/tt0111161/ 3 | /title/tt0108052/ 4 | /title/tt0081398/ 5 | /title/tt0034583/ 6 | /title/tt0073486/ 7 | /title/tt0031381/ 8 | /title/tt0033467/ 9 | /title/tt0032138/ 10 | /title/tt0120338/ 11 | /title/tt0056172/ 12 | /title/tt0071562/ 13 | /title/tt0054215/ 14 | /title/tt0043014/ 15 | /title/tt0052357/ 16 | /title/tt0047296/ 17 | /title/tt0109830/ 18 | /title/tt0059742/ 19 | /title/tt0055614/ 20 | /title/tt0076759/ 21 | /title/tt0083866/ 22 | /title/tt0062622/ 23 | /title/tt0102926/ 24 | /title/tt0071315/ 25 | /title/tt0050212/ 26 | /title/tt0045152/ 27 | /title/tt0038650/ 28 | /title/tt0053291/ 29 | /title/tt0050083/ 30 | /title/tt0057012/ 31 | /title/tt0086879/ 32 | /title/tt0078788/ 33 | /title/tt0083987/ 34 | /title/tt0167260/ 35 | /title/tt0172495/ 36 | /title/tt0045793/ 37 | /title/tt0120815/ 38 | /title/tt0105695/ 39 | /title/tt0082971/ 40 | /title/tt0075148/ 41 | /title/tt0044081/ 42 | /title/tt0032904/ 43 | /title/tt0056592/ 44 | /title/tt0043278/ 45 | /title/tt0036868/ 46 | /title/tt0058385/ 47 | /title/tt0052618/ 48 | /title/tt0059113/ 49 | /title/tt0066206/ 50 | /title/tt0073195/ 51 | /title/tt0112573/ 52 | /title/tt0060196/ 53 | /title/tt0064115/ 54 | /title/tt0040897/ 55 | /title/tt0053604/ 56 | /title/tt0091763/ 57 | /title/tt0044706/ 58 | /title/tt0099348/ 59 | /title/tt0253474/ 60 | /title/tt0099685/ 61 | /title/tt0070047/ 62 | /title/tt0077416/ 63 | /title/tt0020629/ 64 | /title/tt0067116/ 65 | /title/tt0021749/ 66 | /title/tt1504320/ 67 | /title/tt0025316/ 68 | /title/tt0043924/ 69 | /title/tt0064665/ 70 | /title/tt0031679/ 71 | /title/tt0095953/ 72 | /title/tt0075686/ 73 | /title/tt0089755/ 74 | /title/tt0119217/ 75 | /title/tt0086425/ 76 | /title/tt0084805/ 77 | /title/tt0116282/ 78 | /title/tt0049261/ 79 | /title/tt0032551/ 80 | /title/tt0046303/ 81 | /title/tt0120689/ 82 | /title/tt0075860/ 83 | /title/tt0074958/ 84 | /title/tt0073440/ 85 | /title/tt0061722/ 86 | /title/tt0069704/ 87 | /title/tt0110912/ 88 | /title/tt0043265/ 89 | /title/tt0031971/ 90 | /title/tt0026752/ 91 | /title/tt0033870/ 92 | /title/tt0066921/ 93 | /title/tt0075314/ 94 | /title/tt0032145/ 95 | /title/tt0036775/ 96 | /title/tt0048545/ 97 | /title/tt0047396/ 98 | /title/tt0041959/ 99 | /title/tt0053125/ 100 | /title/tt0035575/ 101 | -------------------------------------------------------------------------------- /backup/link_list_imdb.txt: -------------------------------------------------------------------------------- 1 | http://www.imdb.com/title/tt0068646/ 2 | http://www.imdb.com/title/tt0111161/ 3 | http://www.imdb.com/title/tt0108052/ 4 | http://www.imdb.com/title/tt0081398/ 5 | http://www.imdb.com/title/tt0034583/ 6 | http://www.imdb.com/title/tt0073486/ 7 | http://www.imdb.com/title/tt0031381/ 8 | http://www.imdb.com/title/tt0033467/ 9 | http://www.imdb.com/title/tt0032138/ 10 | http://www.imdb.com/title/tt0120338/ 11 | http://www.imdb.com/title/tt0056172/ 12 | http://www.imdb.com/title/tt0071562/ 13 | http://www.imdb.com/title/tt0054215/ 14 | http://www.imdb.com/title/tt0043014/ 15 | http://www.imdb.com/title/tt0052357/ 16 | http://www.imdb.com/title/tt0047296/ 17 | http://www.imdb.com/title/tt0109830/ 18 | http://www.imdb.com/title/tt0059742/ 19 | http://www.imdb.com/title/tt0055614/ 20 | http://www.imdb.com/title/tt0076759/ 21 | http://www.imdb.com/title/tt0083866/ 22 | http://www.imdb.com/title/tt0062622/ 23 | http://www.imdb.com/title/tt0102926/ 24 | http://www.imdb.com/title/tt0071315/ 25 | http://www.imdb.com/title/tt0050212/ 26 | http://www.imdb.com/title/tt0045152/ 27 | http://www.imdb.com/title/tt0038650/ 28 | http://www.imdb.com/title/tt0053291/ 29 | http://www.imdb.com/title/tt0050083/ 30 | http://www.imdb.com/title/tt0057012/ 31 | http://www.imdb.com/title/tt0086879/ 32 | http://www.imdb.com/title/tt0078788/ 33 | http://www.imdb.com/title/tt0083987/ 34 | http://www.imdb.com/title/tt0167260/ 35 | http://www.imdb.com/title/tt0172495/ 36 | http://www.imdb.com/title/tt0045793/ 37 | http://www.imdb.com/title/tt0120815/ 38 | http://www.imdb.com/title/tt0105695/ 39 | http://www.imdb.com/title/tt0082971/ 40 | http://www.imdb.com/title/tt0075148/ 41 | http://www.imdb.com/title/tt0044081/ 42 | http://www.imdb.com/title/tt0032904/ 43 | http://www.imdb.com/title/tt0056592/ 44 | http://www.imdb.com/title/tt0043278/ 45 | http://www.imdb.com/title/tt0036868/ 46 | http://www.imdb.com/title/tt0058385/ 47 | http://www.imdb.com/title/tt0052618/ 48 | http://www.imdb.com/title/tt0059113/ 49 | http://www.imdb.com/title/tt0066206/ 50 | http://www.imdb.com/title/tt0073195/ 51 | http://www.imdb.com/title/tt0112573/ 52 | http://www.imdb.com/title/tt0060196/ 53 | http://www.imdb.com/title/tt0064115/ 54 | http://www.imdb.com/title/tt0040897/ 55 | http://www.imdb.com/title/tt0053604/ 56 | http://www.imdb.com/title/tt0091763/ 57 | http://www.imdb.com/title/tt0044706/ 58 | http://www.imdb.com/title/tt0099348/ 59 | http://www.imdb.com/title/tt0253474/ 60 | http://www.imdb.com/title/tt0099685/ 61 | http://www.imdb.com/title/tt0070047/ 62 | http://www.imdb.com/title/tt0077416/ 63 | http://www.imdb.com/title/tt0020629/ 64 | http://www.imdb.com/title/tt0067116/ 65 | http://www.imdb.com/title/tt0021749/ 66 | http://www.imdb.com/title/tt1504320/ 67 | http://www.imdb.com/title/tt0025316/ 68 | http://www.imdb.com/title/tt0043924/ 69 | http://www.imdb.com/title/tt0064665/ 70 | http://www.imdb.com/title/tt0031679/ 71 | http://www.imdb.com/title/tt0095953/ 72 | http://www.imdb.com/title/tt0075686/ 73 | http://www.imdb.com/title/tt0089755/ 74 | http://www.imdb.com/title/tt0119217/ 75 | http://www.imdb.com/title/tt0086425/ 76 | http://www.imdb.com/title/tt0084805/ 77 | http://www.imdb.com/title/tt0116282/ 78 | http://www.imdb.com/title/tt0049261/ 79 | http://www.imdb.com/title/tt0032551/ 80 | http://www.imdb.com/title/tt0046303/ 81 | http://www.imdb.com/title/tt0120689/ 82 | http://www.imdb.com/title/tt0075860/ 83 | http://www.imdb.com/title/tt0074958/ 84 | http://www.imdb.com/title/tt0073440/ 85 | http://www.imdb.com/title/tt0061722/ 86 | http://www.imdb.com/title/tt0069704/ 87 | http://www.imdb.com/title/tt0110912/ 88 | http://www.imdb.com/title/tt0043265/ 89 | http://www.imdb.com/title/tt0031971/ 90 | http://www.imdb.com/title/tt0026752/ 91 | http://www.imdb.com/title/tt0033870/ 92 | http://www.imdb.com/title/tt0066921/ 93 | http://www.imdb.com/title/tt0075314/ 94 | http://www.imdb.com/title/tt0032145/ 95 | http://www.imdb.com/title/tt0036775/ 96 | http://www.imdb.com/title/tt0048545/ 97 | http://www.imdb.com/title/tt0047396/ 98 | http://www.imdb.com/title/tt0041959/ 99 | http://www.imdb.com/title/tt0053125/ 100 | http://www.imdb.com/title/tt0035575/ 101 | -------------------------------------------------------------------------------- /backup/link_list_wiki.txt: -------------------------------------------------------------------------------- 1 | http://en.wikipedia.org/wiki/The_Godfather 2 | http://en.wikipedia.org/wiki/The_Shawshank_Redemption 3 | http://en.wikipedia.org/wiki/Schindler%27s_List 4 | http://en.wikipedia.org/wiki/Raging_Bull 5 | http://en.wikipedia.org/wiki/Casablanca_(film) 6 | http://en.wikipedia.org/wiki/One_Flew_Over_the_Cuckoo%27s_Nest_(film) 7 | http://en.wikipedia.org/wiki/Gone_with_the_Wind_(film) 8 | http://en.wikipedia.org/wiki/Citizen_Kane 9 | http://en.wikipedia.org/wiki/The_Wizard_of_Oz_(1939_film) 10 | http://en.wikipedia.org/wiki/Titanic_(1997_film) 11 | http://en.wikipedia.org/wiki/Lawrence_of_Arabia_(film) 12 | http://en.wikipedia.org/wiki/The_Godfather_Part_II 13 | http://en.wikipedia.org/wiki/American_Psycho_(film) 14 | http://en.wikipedia.org/wiki/Sunset_Boulevard_(film) 15 | http://en.wikipedia.org/wiki/Vertigo_(film) 16 | http://en.wikipedia.org/wiki/On_the_Waterfront 17 | http://en.wikipedia.org/wiki/Forrest_Gump 18 | http://en.wikipedia.org/wiki/The_Sound_of_Music_(film) 19 | http://en.wikipedia.org/wiki/West_Side_Story_(film) 20 | http://en.wikipedia.org/wiki/Star_Wars_(film) 21 | http://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial 22 | http://en.wikipedia.org/wiki/2001:_A_Space_Odyssey_(film) 23 | http://en.wikipedia.org/wiki/The_Silence_of_the_Lambs_(film) 24 | http://en.wikipedia.org/wiki/Chinatown_(1974_film) 25 | http://en.wikipedia.org/wiki/The_Bridge_on_the_River_Kwai 26 | http://en.wikipedia.org/wiki/Singin%27_in_the_Rain 27 | http://en.wikipedia.org/wiki/It%27s_a_Wonderful_Life 28 | http://en.wikipedia.org/wiki/Some_Like_It_Hot 29 | http://en.wikipedia.org/wiki/12_Angry_Men_(1957_film) 30 | http://en.wikipedia.org/wiki/Dr._Strangelove 31 | http://en.wikipedia.org/wiki/Amadeus_(film) 32 | http://en.wikipedia.org/wiki/Apocalypse_Now 33 | http://en.wikipedia.org/wiki/Gandhi_(film) 34 | http://en.wikipedia.org/wiki/The_Lord_of_the_Rings:_The_Return_of_the_King 35 | http://en.wikipedia.org/wiki/Gladiator_(2000_film) 36 | http://en.wikipedia.org/wiki/From_Here_to_Eternity 37 | http://en.wikipedia.org/wiki/Saving_Private_Ryan 38 | http://en.wikipedia.org/wiki/Unforgiven 39 | http://en.wikipedia.org/wiki/Raiders_of_the_Lost_Ark 40 | http://en.wikipedia.org/wiki/Rocky 41 | http://en.wikipedia.org/wiki/A_Streetcar_Named_Desire_(1951_film) 42 | http://en.wikipedia.org/wiki/The_Philadelphia_Story_(film) 43 | http://en.wikipedia.org/wiki/To_Kill_a_Mockingbird_(film) 44 | http://en.wikipedia.org/wiki/An_American_in_Paris_(film) 45 | http://en.wikipedia.org/wiki/The_Best_Years_of_Our_Lives 46 | http://en.wikipedia.org/wiki/My_Fair_Lady_(film) 47 | http://en.wikipedia.org/wiki/Ben-Hur_(1959_film) 48 | http://en.wikipedia.org/wiki/Doctor_Zhivago_(film) 49 | http://en.wikipedia.org/wiki/Patton_(film) 50 | http://en.wikipedia.org/wiki/Jaws_(film) 51 | http://en.wikipedia.org/wiki/Braveheart 52 | http://en.wikipedia.org/wiki/Coyote_Ugly_(film) 53 | http://en.wikipedia.org/wiki/Butch_Cassidy_and_the_Sundance_Kid 54 | http://en.wikipedia.org/wiki/The_Treasure_of_the_Sierra_Madre_(film) 55 | http://en.wikipedia.org/wiki/The_Apartment 56 | http://en.wikipedia.org/wiki/Platoon_(film) 57 | http://en.wikipedia.org/wiki/High_Noon 58 | http://en.wikipedia.org/wiki/Dances_with_Wolves 59 | http://en.wikipedia.org/wiki/The_Pianist_(2002_film) 60 | http://en.wikipedia.org/wiki/Goodfellas 61 | http://en.wikipedia.org/wiki/The_Exorcist_(film) 62 | http://en.wikipedia.org/wiki/The_Deer_Hunter 63 | http://en.wikipedia.org/wiki/All_Quiet_on_the_Western_Front_(1930_film) 64 | http://en.wikipedia.org/wiki/The_French_Connection_(film) 65 | http://en.wikipedia.org/wiki/Friday_Night_Lights_(film) 66 | http://en.wikipedia.org/wiki/The_King%27s_Speech 67 | http://en.wikipedia.org/wiki/It_Happened_One_Night 68 | http://en.wikipedia.org/wiki/A_Place_in_the_Sun_(film) 69 | http://en.wikipedia.org/wiki/Midnight_Cowboy 70 | http://en.wikipedia.org/wiki/Mr._Smith_Goes_to_Washington 71 | http://en.wikipedia.org/wiki/Rain_Man 72 | http://en.wikipedia.org/wiki/Annie_Hall 73 | http://en.wikipedia.org/wiki/Out_of_Africa_(film) 74 | http://en.wikipedia.org/wiki/Good_Will_Hunting 75 | http://en.wikipedia.org/wiki/Terms_of_Endearment 76 | http://en.wikipedia.org/wiki/Tootsie 77 | http://en.wikipedia.org/wiki/Fargo_(film) 78 | http://en.wikipedia.org/wiki/Giant_(1956_film) 79 | http://en.wikipedia.org/wiki/The_Grapes_of_Wrath_(film) 80 | http://en.wikipedia.org/wiki/Shane_(film) 81 | http://en.wikipedia.org/wiki/The_Green_Mile_(film) 82 | http://en.wikipedia.org/wiki/Close_Encounters_of_the_Third_Kind 83 | http://en.wikipedia.org/wiki/Network_(film) 84 | http://en.wikipedia.org/wiki/Nashville_(film) 85 | http://en.wikipedia.org/wiki/The_Graduate 86 | http://en.wikipedia.org/wiki/American_Graffiti 87 | http://en.wikipedia.org/wiki/Pulp_Fiction 88 | http://en.wikipedia.org/wiki/The_African_Queen_(film) 89 | http://en.wikipedia.org/wiki/Stagecoach_(1939_film) 90 | http://en.wikipedia.org/wiki/Mutiny_on_the_Bounty_(1962_film) 91 | http://en.wikipedia.org/wiki/The_Maltese_Falcon_(1941_film) 92 | http://en.wikipedia.org/wiki/A_Clockwork_Orange_(film) 93 | http://en.wikipedia.org/wiki/Taxi_Driver 94 | http://en.wikipedia.org/wiki/Wuthering_Heights_%281939_film%29 95 | http://en.wikipedia.org/wiki/Double_Indemnity_(film) 96 | http://en.wikipedia.org/wiki/Rebel_Without_a_Cause 97 | http://en.wikipedia.org/wiki/Rear_Window 98 | http://en.wikipedia.org/wiki/The_Third_Man 99 | http://en.wikipedia.org/wiki/North_by_Northwest 100 | http://en.wikipedia.org/wiki/Yankee_Doodle_Dandy 101 | -------------------------------------------------------------------------------- /backup/synopses_list.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/synopses_list.txt -------------------------------------------------------------------------------- /backup/synopses_list.txt.txt: -------------------------------------------------------------------------------- 1 | 2 | 3 | In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as "Godfather." He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, "no Sicilian can refuse a request on his daughter's wedding day." One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment. The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish the young men responsible.Meanwhile, the Don's youngest son Michael (Al Pacino), a decorated Marine hero returning from World War II service, arrives at the wedding and tells his girlfriend Kay Adams (Diane Keaton) anecdotes about his family, informing her about his father's criminal life; he reassures her that he is different from his family and doesn't plan to join them in their criminal dealings. The wedding scene serves as critical exposition for the remainder of the film, as Michael introduces the main characters to Kay. Fredo (John Cazale), Michael's next older brother, is a bit dim-witted and quite drunk by the time he finds Michael at the party. Sonny (James Caan), the Don's eldest child and next in line to become Don upon his father's retirement, is married but he is a hot-tempered philanderer who sneaks into a bedroom to have sex with one of Connie's bridesmaids, Lucy Mancini (Jeannie Linero). Tom Hagen is not related to the family by blood but is considered one of the Don's sons because he was homeless when he befriended Sonny in the Little Italy neighborhood of Manhattan and the Don took him in. Now a talented attorney, Tom is being groomed for the important position of consigliere (counselor) to the Don, despite his non-Sicilian heritage.Also among the guests at the celebration is the famous singer Johnny Fontane (Al Martino), Corleone's godson, who has come from Hollywood to petition Vito's help in landing a movie role that will revitalize his flagging career. Jack Woltz (John Marley), the head of the studio, denies Fontane the part (a character much like Johnny himself), which will make him an even bigger star, but Don Corleone explains to Johnny: "I'm gonna make him an offer he can't refuse." The Don also receives congratulatory salutations from Luca Brasi, a terrifying enforcer in the criminal underworld, and fills a request from the baker who made Connie's wedding cake who wishes for his nephew Enzo to become an American citizen.After the wedding, Hagen is dispatched to Los Angeles to meet with Woltz, but Woltz angrily tells him that he will never cast Fontane in the role. Woltz holds a grudge because Fontane seduced and "ruined" a starlet who Woltz had been grooming for stardom and with whom he had a sexual relationship. Woltz is persuaded to give Johnny the role, however, when he wakes up early the next morning and feels something wet in his bed. He pulls back the sheets and finds himself in a pool of blood; he screams in horror when he discovers the severed head of his prized $600,000 stud horse, Khartoum, in the bed with him. (A deleted scene from the film implies that Luca Brasi (Lenny Montana), Vito's top "button man" or hitman, is responsible.)Upon Hagen's return, the family meets with Virgil "The Turk" Sollozzo (Al Lettieri), who is being backed by the rival Tattaglia family. He asks Don Corleone for financing as well as political and legal protection for importing and distributing heroin. Despite the huge profit to be made, Vito Corleone refuses, explaining that his political influence would be jeopardized by a move into the narcotics trade. The Don's eldest son, Sonny, who had earlier urged the family to enter the narcotics trade, breaks ranks during the meeting and questions Sollozzo's assurances as to the Corleone Family's investment being guaranteed by the Tattaglia Family. His father, angry at Sonny's dissension in a non-family member's presence, privately rebukes him later. Don Corleone then dispatches Luca Brasi to infiltrate Sollozzo's organization and report back with information. During the meeting, while Brasi is bent over to allow Bruno Tattaglia to light his cigarette, he is stabbed in the hand by Sollozzo, and is subsequently garroted by an assassin.Soon after his meeting with Sollozzo, Don Corleone is gunned down in an assassination attempt just outside his office, and it is not immediately known whether he has survived. Fredo Corleone had been assigned driving and protection duty for his father when Paulie Gatto, the Don's usual bodyguard, had called in sick. Fredo proves to be ineffectual, fumbling with his gun and unable to shoot back. When Sonny hears about the Don being shot and Paulie's absence, he orders Clemenza (Richard S. Castellano) to find Paulie and bring him to the Don's house.Sollozzo abducts Tom Hagen and persuades him to offer Sonny the deal previously offered to his father. Enraged, Sonny refuses to consider it and issues an ultimatum to the Tattaglias: turn over Sollozzo or face a lengthy, bloody and costly (for both sides) gang war. They refuse, and instead send Sonny "a Sicilian message," in the form of two fresh fish wrapped in Luca Brasi's bullet-proof vest, to tell the Corleones that Luca Brasi "sleeps with the fishes."Clemenza later takes Paulie and one of the family's hitmen, Rocco Lampone, for a drive into Manhattan. Sonny wants to "go to the mattresses" -- set up beds in apartments for Corleone button men to operate out of in the event that the crime war breaks out. On their way back from Manhattan, Clemenza has Paulie stop the car in a remote area so he can urinate. Rocco shoots Paulie dead; he and Clemenza leave Paulie and the car behind.Michael, whom the other Mafia families consider a "civilian" uninvolved in mob business, visits his father at a small private hospital. He is shocked to find that no one is guarding him. Realizing that his father is again being set up to be killed, he calls Sonny for help, moves his father to another room, and goes outside to watch the entrance. Michael enlists help from Enzo the baker (Gabriele Torrei), who has come to the hospital to pay his respects. Together, they bluff away Sollozzo's men as they drive by. Police cars soon appear bringing the corrupt Captain McCluskey (Sterling Hayden), who viciously punches Michael in the cheek and breaks his jaw when Michael insinuates that Sollozzo paid McCluskey to set up his father. Just then, Hagen arrives with "private detectives" licensed to carry guns to protect Don Corleone, and he takes the injured Michael home. Sonny responds by having Bruno Tattaglia (Tony Giorgio), the eldest son and underboss of Don Phillip Tattaglia (Victor Rendina), killed (off-camera).Following the attempt on the Don's life at the hospital, Sollozzo requests a meeting with the Corleones, which Captain McCluskey will attend as Sollozzo's bodyguard. When Michael volunteers to kill both men during the meeting, Sonny and the other senior Family members are amused; however, Michael convinces them that he is serious and that killing Sollozzo and McCluskey is in the family's interest: "It's not personal. It's strictly business." Because Michael is considered a civilian, he won't be regarded as a suspicious ambassador for the Corleones. Although police officers are usually off limits for hits, Michael argues that since McCluskey is corrupt and has illegal dealings with Sollozzo, he is fair game. Michael also implies that newspaper reporters that the Corleones have on their payroll would delight in publishing stories about a corrupt police captain.Michael meets with Clemenza, one of his father's caporegimes (captains), who prepares a small pistol for him, covering the trigger and grip with tape to prevent any fingerprint evidence. He instructs Michael about the proper way to perform the assassination and tells him to leave the gun behind. He also tells Michael that the family were all very proud of Michael for becoming a war hero during his service in the Marines. Clemenza shows great confidence that Michael can perform the job and tells him it will all go smoothly. The plan is to have the Corleone's informers find out the location of the meeting and plant the revolver before Michael, Sollozzo and McCluskey arrive.Before the meeting in a small Italian restaurant, McCluskey frisks Michael for weapons and finds him clean. Michael excuses himself to go to the bathroom, where he retrieves the planted revolver. Returning to the table, he fatally shoots Sollozzo, then McCluskey. Michael is sent to hide in Sicily while the Corleone family prepares for all-out warfare with the Five Families (who are united against the Corleones) as well as a general clampdown on the mob by the police and government authorities. When the don returns home from the hospital, he is distraught to learn that it was Michael who killed Sollozzo and McCluskey.Meanwhile, Connie and Carlo's marriage is disintegrating. They argue publicly over Carlo's suspected infidelity and his possessive behavior toward Connie. By Italian tradition, nobody, not even a high-ranking Mafia don, can intervene in a married couple's personal disputes, even if they involve infidelity, money, or domestic abuse. One day, Sonny sees a bruise on Connie's face and she tells him that Carlo hit her after she asked him if he was having an affair. Sonny tracks down and severely beats up Carlo Rizzi in the middle of a crowded street for brutalizing the pregnant Connie, and threatens to kill Carlo if he ever abuses Connie again. An angry Carlo responds by plotting with Tattaglia and Don Emilio Barzini (Richard Conte), the Corleones' chief rivals, to have Sonny killed.Later, Carlo has one of his mistresses phone his house, knowing that Connie will answer. The woman asks Connie to tell Carlo not to meet her tonight. The very pregnant and distraught Connie assaults Carlo; he takes advantage of the altercation to beat Connie in order to lure Sonny out in the open and away from the Corleone compound. When Connie phones the compound to tell Sonny that Carlo has beaten her again, the furious Sonny drives off (alone and unescorted) to fulfill his threat against Carlo. On the way to Connie and Carlo's house, Sonny is ambushed at a toll booth on the Long Island Causeway and violently shot to death by several carloads of hitmen wielding Thompson sub-machine guns.Tom Hagen relays the news of Sonny's massacre to the Don, who calls in the favor from Bonasera to personally handle the embalming of Sonny's body. Rather than seek revenge for Sonny's killing, Don Corleone meets with the heads of the Five Families to negotiate a cease-fire. Not only is the conflict draining all their assets and threatening their survival, but ending it is the only way that Michael can return home safely. Reversing his previous decision, Vito agrees that the Corleone family will provide political protection for Tattaglia's traffic in heroin, as long as it is controlled and not sold to children. At the meeting, Don Corleone deduces that Don Barzini, not Tattaglia, was ultimately behind the start of the mob war and Sonny's death.In Sicily, Michael patiently waits out his exile, protected by Don Tommasino (Corrado Gaipa), an old family friend. Michael aimlessly wanders the countryside, accompanied by his ever-present bodyguards, Calo (Franco Citti) and Fabrizio (Angelo Infanti). In a small village, Michael meets and falls in love with Apollonia Vitelli (Simonetta Stefanelli), the beautiful young daughter of a bar owner. They court and marry in the traditional Sicilian fashion, but soon Michael's presence becomes known to Corleone enemies. As the couple is about to be moved to a safer location, Apollonia is killed as a result of a rigged car (originally intended for Michael), exploding on ignition; Michael, who watched the car blow up, spots Fabrizio hurriedly leaving the grounds seconds before the explosion, implicating him in the assassination plot. (In a deleted scene, Fabrizio is found years later and killed.)With his safety guaranteed, Michael returns home. More than a year later, in 1950, he reunites with his former girlfriend Kay after a total of four years of separation -- three in Italy and one in America. He tells her he wants them to be married. Although Kay is hurt that he waited so long to contact her, she accepts his proposal. With Don Vito semi-retired, Sonny dead, and middle brother Fredo considered incapable of running the family business, Michael is now in charge; he promises Kay he will make the family business completely legitimate within five years.Two years later, Clemenza and Salvatore Tessio (Abe Vigoda), complain that they are being pushed around by the Barzini Family and ask permission to strike back, but Michael denies the request. He plans to move the family operations to Nevada and after that, Clemenza and Tessio may break away to form their own families. Michael further promises Connie's husband, Carlo, that he will be his right hand man in Nevada (Carlo had grown up there), unaware of his part in Sonny's assassination. Tom Hagen has been removed as consigliere and is now merely the family's lawyer, with Vito serving as consigliere. Privately, Hagen inquires about his change in status, and also questions Michael about a new regime of "soldiers" secretly being built under Rocco Lampone (Tom Rosqui). Don Vito explains to Hagen that Michael is acting on his advice.Another year or so later, Michael travels to Las Vegas and meets with Moe Greene (Alex Rocco), a rich and shrewd casino boss looking to expand his business dealings. After the Don's attempted assassination, Fredo had been sent to Las Vegas to learn about the casino business from Greene. Michael arrogantly offers to buy out Greene but is rudely rebuffed. Greene believes the Corleones are weak and that he can secure a better deal from Barzini. As Moe and Michael heatedly negotiate, Fredo sides with Moe. Afterward, Michael warns Fredo to never again "take sides with anyone against the family."Michael returns home. In a private moment, Vito explains his expectation that the Family's enemies will attempt to murder Michael by using a trusted associate to arrange a meeting as a pretext for assassination. Vito also reveals that he had never really intended a life of crime for Michael, hoping that his youngest son would hold legitimate power as a senator or governor. Some months later, Vito collapses and dies while playing with his young grandson Anthony (Anthony Gounaris) in his tomato garden. At the burial, Tessio conveys a proposal for a meeting with Barzini, which identifies Tessio as the traitor that Vito was expecting.Michael arranges for a series of murders to occur simultaneously while he is standing godfather to Connie's and Carlo's newborn son at the church:Don Stracci (Don Costello) is gunned down along with his bodyguard in a hotel elevator by a shotgun-wielding Clemenza.Moe Greene is killed while having a massage, shot through the eye by an unidentified assassin.Don Cuneo (Rudy Bond) is trapped in a revolving door at the St. Regis Hotel and shot dead by soldier Willi Cicci (Joe Spinell).Don Tattaglia is assassinated in his bed, along with a prostitute, by Rocco Lampone and an unknown associate.Don Barzini is killed on the steps of his office building along with his bodyguard and driver, shot by Al Neri (Richard Bright), disguised in his old police uniform.After the baptism, Tessio believes he and Hagen are on their way to the meeting between Michael and Barzini that he has arranged. Instead, he is surrounded by Willi Cicci and other button men as Hagen steps away. Realizing that Michael has uncovered his betrayal, Tessio tells Hagen that he always respected Michael, and that his disloyalty "was only business." He asks if Tom can get him off for "old times' sake," but Tom says he cannot. Tessio is driven away and never seen again (it is implied that Cicci shoots and kills Tessio with his own gun after he disarms him prior to entering the car).Meanwhile, Michael confronts Carlo about Sonny's murder and forces him to admit his role in setting up the ambush, having been approached by Barzini himself. (The hitmen who killed Sonny were the core members of Barzini's personal bodyguard.) Michael assures Carlo he will not be killed, but his punishment is exclusion from all family business. He hands Carlo a plane ticket to exile in Las Vegas. However, when Carlo gets into a car headed for the airport, he is immediately garroted to death by Clemenza, on Michael's orders.Later, a hysterical Connie confronts Michael at the Corleone compound as movers carry away the furniture in preparation for the family move to Nevada. She accuses him of murdering Carlo in retribution for Carlo's brutal treatment of her and for Carlo's suspected involvement in Sonny's murder. After Connie is removed from the house, Kay questions Michael about Connie's accusation, but he refuses to answer, reminding her to never ask him about his business or what he does for a living. She insists, and Michael outright lies, reassuring his wife that he played no role in Carlo's death. Kay believes him and is relieved. The film ends with Clemenza and new caporegimes Rocco Lampone and Al Neri arriving and paying their respects to Michael. Clemenza kisses Michael's hand and greets him as "Don Corleone." As Kay watches, the office door is closed. 4 | 5 | 6 | 7 | 8 | In 1947, Andy Dufresne (Tim Robbins), a banker in Maine, is convicted of murdering his wife and her lover, a golf pro. He is given two life sentences and sent to the notoriously harsh Shawshank Prison. Andy always claims his innocence, but his cold and measured demeanor led many to doubt his word.During the first night, the chief guard, Byron Hadley (Clancy Brown), savagely beats a newly arrived inmate because of his crying and hysterics. The inmate later dies in the infirmary because the prison doctor had left for the night. Meanwhile Andy remained steadfast and composed. Ellis Boyd Redding (Morgan Freeman), also known as Red, bet against others that Andy would be the one to break down first and loses a considerable amount of cash.About a month later, Andy approaches Red, who runs contraband inside the walls of Shawshank. He asks if Red could find him a rock hammer, an instrument he claims is necessary for his hobby of rock collecting and sculpting. Though other prisoners consider Andy "a really cold fish", Red sees something in Andy, and likes him from the start. Red believes Andy intends to use the hammer to engineer his escape in the future but when the tool arrived and he saw how small it was, Red put aside the thought that Andy could ever use it to dig his way out of prison.Over the first two years of his incarceration, Andy works in the prison laundry. He attracts attention from "the Sisters", a group of prisoners who sexually assault other prisoners. Though he persistently resists and fights them, Andy is beaten and raped on a regular basis.Red pulls some strings, and gets Andy and a few of their mutual friends a break by getting them all on a work detail tarring the roof of one of the prison's buildings. During the job Andy overhears Hadley complaining about having to pay taxes for an upcoming inheritance. Using his expertise as a banker, Andy lets Hadley know how he could shelter his money from the IRS, turning it into a one-time gift for his wife. He said he'd assist in exchange for some cold beers for his fellow inmates while on the tarring job. Though he at first threatens to throw Andy off the roof, Hadley, the most brutal guard in the prison, agrees, providing the men with cold beer before the job is finished. Red remarks that Andy may have engineered the privilege to build favor with the prison guards as much as with his fellow inmates, but Red also thinks Andy did it simply to "feel free."While watching a movie, Andy demands Red "Rita Hayworth". Soon, after asking Red for "Rita Hayworth", Andy once more encountered the Sisters and is brutally beaten, putting him in the infirmary for a month. Boggs (Mark Rolston), the leader of "The Sisters", spends a week in solitary. When he comes out, he finds Hadley and his men waiting in his cell. They beat him so badly he's left paralyzed, transferred to a prison hospital upstate, and the Sisters never bothered Andy again. When Andy got out of the infirmary, he finds a bunch of rocks and a poster of Rita Hayworth in his cell: presents from Red and his buddies.Warden Samuel Norton (Bob Gunton) hears about Andy helped Hadley and uses a surprise cell inspection to size Andy up. The warden meets with Andy and sends him to work with aging inmate Brooks Hatlen (James Whitmore) in the prison library, where he sets up a make-shift desk to provide services to other guards (and the warden himself) with income tax returns and other financial advice. There Andy sees an opportunity to expand the prison library, starting with asking the Maine state senate for funds. He starts writing letters and sending them every week. His financial support practice became so appreciated that even guards from other prisons, when they came for inter-prison baseball matches, sought Andy's financial advice. Andy even ends up doing Norton's taxes the next season.Not long afterward, Brooks, the old librarian, threatens to kill another prisoner, Heywood, in order to avoid being paroled. Andy is able to talk him down and Brooks is paroled. He goes to a halfway house but finds it impossible to adjust to life outside the prison. He eventually commits suicide. When his friends suggest that he was crazy for doing so, Red tells them that Brooks had obviously become "institutionalized", essentially conditioned to be a prisoner for the rest of his life and unable to adapt to the outside world. Red remarks: "These walls are funny. First you hate 'em, then you get used to 'em. Enough time passes, you get so you depend on them."After six years of writing letters, Andy receives $200 from the state for the library, along with a collection of old books and records. Though the state Senate thinks this will be enough to get Andy to halt his letter-writing campaign, he is undaunted and doubles his efforts.When the donations of old books and records arrive at the warden's office, Andy finds a copy of Mozart's "The Marriage of Figaro" among the records. He locks the guard assigned to the warden's office in the bathroom and plays the record on a phonograph over the prison's PA system. The entire prison seems captivated by the music - Red remarks that the voices of the women in the intro made everyone feel free, if only for a brief time. Outside the office, Norton appears, furious at the act of defiance and orders Andy to turn off the record player. Andy reaches for the needle arm at first, then turns the volume on the phonograph up. The warden orders Hadley to break into the office and Andy is sent immediately to solitary confinement for two weeks. When he gets out, he tells his friends that it was the "easiest time" stretch ever did in the hole because he thought of Mozart's Figaro. When the other prisoners tell him how unlikely that could be, he tells them that hope can sustain them. Red is not convinced and leaves, bitter at the thought.With the enlarged library and more materials, Andy begins to teach those inmates who want to receive their high school diplomas. After Andy is able to secure a steady stream of funding from various sources, the library is further renovated and named for Brooks.Warden Norton profits on Andy's knowledge of bookkeeping and devises a scheme whereby he put prison inmates to work in public projects which he won by outbidding other contractors (cheap labor from the prisoners). Occasionally, he let others get some contracts if they bribe him. Andy launders money for the warden by setting up many accounts in different banks, along with several investments, using a fake identity: "Randall Stephens". He shared the details only with his friend, Red, noting that he had to "go to prison to learn how to be a criminal."In 1965, a young prisoner named Tommy (Gil Bellows) comes to Shawshank. Andy suggests that Tommy take up another line of work besides theft. The suggestion really gets to Tommy and he works on achieving his high school equivalency diploma. Though Tommy is a good student, he is still frustrated when he takes the exam itself, crumpling it up and tossing it in the trash. Andy retrieves it and sends it in.One day Red tells Tommy about Andy's case. Tommy is visibly upset at hearing Andy's story and tells Andy and Red that he had a cellmate in another prison who boasted about killing a man who was a pro golfer at the country club he worked at, along with his lover. The woman's husband, a banker, had gone to prison for those murders. With this new information, Andy, full of hope, meets with the warden's, expecting he could help him get another trial with Tommy as a witness. The reaction from Norton is completely contrary to what Andy hoped for. Andy says emphatically that he would never reveal the money laundering schemes he had set up for Norton over the years - the warden becomes furious and orders him to solitary for a month. The warden later meets with Tommy alone and asks him if he'll testify on Andy's behalf. Tommy enthusiastically agrees and the warden has him shot dead by Hadley.When the warden visits Andy in solitary, he tells him that Tommy was killed while attempting escape. Andy tells Norton that the financial schemes will stop. The warden counters, saying the library will be destroyed and all it's materials burned. Andy will also lose his private cell and be sent to the block with the most hardened criminals. The warden gives Andy another month in solitary.Afterwards, Andy returns to the usual daily life at Shawshank, a seemingly broken man. One day he talks to Red, about how although he didn't kill his wife, his personality drove her away, which led to her infidelity and death. He says if he's ever freed or escapes, he'd like to go to Zihuatanejo, a beach town on the Pacific coast of Mexico. He also tells Red how he got engaged. He and his future wife went up to a farm in Buxton, Maine, to a large oak tree at the end of a stone wall. The two made love under the tree, after which he proposed to her. He tells Red that, if he should ever be paroled, he should look for that field, and that oak tree. There, under a large black volcanic rock that would look out of place, Andy has buried a box that he wants Red to have. Andy refuses to reveal what might be in that box.Later, Andy asks for a length of rope, leading Red and his buddies to suspect he will commit suicide. At the end of the day, Norton asks Andy to shine his shoes for him and put his suit in for dry-cleaning before retiring for the night.The following morning, Andy is not accounted for as usual from his cell. At the same time, Norton becomes alarmed when he finds Andy's shoes in his shoebox instead of his own. He rushes to Andy's cell and demands an explanation. Hadley brings in Red, but Red insists he knows nothing of Andy's plans. Becoming increasing hostile and paranoid, Norton starts throwing Andy's sculpted rocks around the cell. When he throws one at Andy's poster of Raquel Welch (where it used to hold Marilyn Monroe and Rita Hayworth before), the rock punches through and into the wall. Norton tears the poster away from the wall and finds a tunnel just wide enough for a man to crawl through.During the previous nights thunderstorm, Andy wore Norton's shoes to his cell, catching a lucky break when no one notices. He packs some papers and Norton's clothes into a plastic bag, tied it to himself with the rope he'd asked for, and escapes through his hole. The tunnel he'd excavated led him to a space between two walls of the prison where he found a sewer main line. Using a rock, he hits it in time with the lightning strikes and eventually burst it. Crawling through 500 yards in the pipe and through the raw sewage contained in it, Andy emerged in a brook outside the walls. A search team later found his uniform and his rock hammer which had been worn nearly to nothing.That morning, Andy walks into the Maine National Bank in Portland, where he had put Warden Norton's money. Using his assumed identity as Randall Stephens, and with all the necessary documentation, he walked out with a cashier's check. Before he leaves, he asks them to drop a package in the mail. He continues his visitations to nearly a dozen other local banks, ending up with some $370,000. The package contained Warden Norton's account books which were delivered straight to the Portland Daily Bugle newspaper.Not long after, the police storms Shawshank Prison. Hadley is arrested for murder; Red said he was taken away "crying like a little girl". Warden Norton finally opens the safe, which he hadn't touched since Andy escaped, and instead of his books, he finds the Bible he had given Andy. Norton opens it to the book of Exodus and finds that the pages have been cut out in the shape of Andy's rock hammer. Norton walks back to his desk as the police pound on his door, takes out a small revolver and shoots himself under the chin. Red remarks that he wondered if the warden thought, right before pulling the trigger, how "Andy could ever have gotten the best of him."Shortly after, Red receives a postcard from Fort Hancock, Texas, with nothing written on it. Red takes it as a sign that Andy made it into Mexico to freedom. Red and his buddies would spend their time talking about Andy's exploits (with a lot of embellishments), but Red just missed his friend.At Red's next parole hearing in 1967, he talked to the parole board about how "rehabilitated" was a made-up word, and how he regretted his actions of the past. His parole is granted this time. He goes to work at a grocery store, and stays at the same halfway house room Brooks had stayed in. He frequently walks by a pawn shop, which had several guns and compasses in the window. At times he would contemplate trying to get back into prison, but he remembered the promise he had made to Andy.One day, with a compass he bought from the pawn shop, he followed Andy's instructions, hitchhiking to Buxton and arriving at the stone wall Andy described. Just like Andy said, there was a large black stone. Under it was a small box containing a large sum of cash and instructions to find him. He said he needed somebody "who could get things" for a "project" of his.Red violates parole and leaves the halfway house, unconcerned since no one would really do an extensive manhunt for "an old crook like [him]." He takes a bus to Fort Hancock, where he crosses into Mexico. The two friends are finally reunited on the beach of Zihuatanejo on the Pacific coast. 9 | 10 | -------------------------------------------------------------------------------- /backup/ward_clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harrywang/document_clustering/e540feba339fe12e09ad8cd0eb07f0aaddcfd1fc/backup/ward_clusters.png -------------------------------------------------------------------------------- /d3/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013, Michael Bostock 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * The name Michael Bostock may not be used to endorse or promote products 15 | derived from this software without specific prior written permission. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 18 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 19 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT, 21 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 22 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 23 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY 24 | OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 25 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, 26 | EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 27 | -------------------------------------------------------------------------------- /data/genres_list.txt: -------------------------------------------------------------------------------- 1 | [u' Crime', u' Drama'] 2 | [u' Crime', u' Drama'] 3 | [u' Biography', u' Drama', u' History'] 4 | [u' Biography', u' Drama', u' Sport'] 5 | [u' Drama', u' Romance', u' War'] 6 | [u' Drama'] 7 | [u' Drama', u' Romance', u' War'] 8 | [u' Drama', u' Mystery'] 9 | [u' Adventure', u' Family', u' Fantasy', u' Musical'] 10 | [u' Drama', u' Romance'] 11 | [u' Adventure', u' Biography', u' Drama', u' History', u' War'] 12 | [u' Crime', u' Drama'] 13 | [u' Horror', u' Mystery', u' Thriller'] 14 | [u' Drama', u' Film-Noir'] 15 | [u' Mystery', u' Romance', u' Thriller'] 16 | [u' Crime', u' Drama'] 17 | [u' Drama', u' Romance'] 18 | [u' Biography', u' Drama', u' Family', u' Musical', u' Romance'] 19 | [u' Crime', u' Drama', u' Musical', u' Romance', u' Thriller'] 20 | [u' Action', u' Adventure', u' Fantasy', u' Sci-Fi'] 21 | [u' Adventure', u' Family', u' Sci-Fi'] 22 | [u' Mystery', u' Sci-Fi'] 23 | [u' Crime', u' Drama', u' Thriller'] 24 | [u' Drama', u' Mystery', u' Thriller'] 25 | [u' Adventure', u' Drama', u' War'] 26 | [u' Comedy', u' Musical', u' Romance'] 27 | [u' Drama', u' Family', u' Fantasy'] 28 | [u' Comedy'] 29 | [u' Drama'] 30 | [u' Comedy', u' War'] 31 | [u' Biography', u' Drama', u' Music'] 32 | [u' Drama', u' War'] 33 | [u' Biography', u' Drama', u' History'] 34 | [u' Adventure', u' Fantasy'] 35 | [u' Action', u' Drama'] 36 | [u' Drama', u' Romance', u' War'] 37 | [u' Action', u' Drama', u' War'] 38 | [u' Western'] 39 | [u' Action', u' Adventure'] 40 | [u' Drama', u' Sport'] 41 | [u' Drama'] 42 | [u' Comedy', u' Romance'] 43 | [u' Drama'] 44 | [u' Musical', u' Romance'] 45 | [u' Drama', u' Romance', u' War'] 46 | [u' Drama', u' Family', u' Musical', u' Romance'] 47 | [u' Adventure', u' Drama'] 48 | [u' Drama', u' Romance', u' War'] 49 | [u' Biography', u' Drama', u' War'] 50 | [u' Drama', u' Thriller'] 51 | [u' Action', u' Biography', u' Drama', u' History', u' War'] 52 | [u' Western'] 53 | [u' Biography', u' Crime', u' Western'] 54 | [u' Action', u' Adventure', u' Drama', u' Western'] 55 | [u' Comedy', u' Drama', u' Romance'] 56 | [u' Drama', u' War'] 57 | [u' Western'] 58 | [u' Adventure', u' Drama', u' Western'] 59 | [u' Biography', u' Drama', u' War'] 60 | [u' Biography', u' Crime', u' Drama'] 61 | [u' Horror'] 62 | [u' Drama', u' War'] 63 | [u' Drama', u' War'] 64 | [u' Action', u' Crime', u' Thriller'] 65 | [u' Comedy', u' Drama', u' Romance'] 66 | [u' Biography', u' Drama', u' History'] 67 | [u' Comedy', u' Romance'] 68 | [u' Drama', u' Romance'] 69 | [u' Drama'] 70 | [u' Drama'] 71 | [u' Drama'] 72 | [u' Comedy', u' Drama', u' Romance'] 73 | [u' Biography', u' Drama', u' Romance'] 74 | [u' Drama'] 75 | [u' Comedy', u' Drama'] 76 | [u' Comedy', u' Drama', u' Romance'] 77 | [u' Crime', u' Drama', u' Thriller'] 78 | [u' Drama', u' Romance'] 79 | [u' Drama'] 80 | [u' Drama', u' Romance', u' Western'] 81 | [u' Crime', u' Drama', u' Fantasy', u' Mystery'] 82 | [u' Drama', u' Sci-Fi'] 83 | [u' Drama'] 84 | [u' Drama', u' Music'] 85 | [u' Comedy', u' Drama', u' Romance'] 86 | [u' Comedy', u' Drama'] 87 | [u' Crime', u' Drama', u' Thriller'] 88 | [u' Adventure', u' Romance', u' War'] 89 | [u' Adventure', u' Western'] 90 | [u' Adventure', u' Drama', u' History'] 91 | [u' Drama', u' Film-Noir', u' Mystery'] 92 | [u' Crime', u' Drama', u' Sci-Fi'] 93 | [u' Crime', u' Drama'] 94 | [u' Drama', u' Romance'] 95 | [u' Crime', u' Drama', u' Film-Noir', u' Thriller'] 96 | [u' Drama'] 97 | [u' Mystery', u' Thriller'] 98 | [u' Film-Noir', u' Mystery', u' Thriller'] 99 | [u' Mystery', u' Thriller'] 100 | [u' Biography', u' Drama', u' Musical'] 101 | -------------------------------------------------------------------------------- /data/title_list.txt: -------------------------------------------------------------------------------- 1 | The Godfather 2 | The Shawshank Redemption 3 | Schindler's List 4 | Raging Bull 5 | Casablanca 6 | One Flew Over the Cuckoo's Nest 7 | Gone with the Wind 8 | Citizen Kane 9 | The Wizard of Oz 10 | Titanic 11 | Lawrence of Arabia 12 | The Godfather: Part II 13 | Psycho 14 | Sunset Blvd. 15 | Vertigo 16 | On the Waterfront 17 | Forrest Gump 18 | The Sound of Music 19 | West Side Story 20 | Star Wars 21 | E.T. the Extra-Terrestrial 22 | 2001: A Space Odyssey 23 | The Silence of the Lambs 24 | Chinatown 25 | The Bridge on the River Kwai 26 | Singin' in the Rain 27 | It's a Wonderful Life 28 | Some Like It Hot 29 | 12 Angry Men 30 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb 31 | Amadeus 32 | Apocalypse Now 33 | Gandhi 34 | The Lord of the Rings: The Return of the King 35 | Gladiator 36 | From Here to Eternity 37 | Saving Private Ryan 38 | Unforgiven 39 | Raiders of the Lost Ark 40 | Rocky 41 | A Streetcar Named Desire 42 | The Philadelphia Story 43 | To Kill a Mockingbird 44 | An American in Paris 45 | The Best Years of Our Lives 46 | My Fair Lady 47 | Ben-Hur 48 | Doctor Zhivago 49 | Patton 50 | Jaws 51 | Braveheart 52 | The Good, the Bad and the Ugly 53 | Butch Cassidy and the Sundance Kid 54 | The Treasure of the Sierra Madre 55 | The Apartment 56 | Platoon 57 | High Noon 58 | Dances with Wolves 59 | The Pianist 60 | Goodfellas 61 | The Exorcist 62 | The Deer Hunter 63 | All Quiet on the Western Front 64 | The French Connection 65 | City Lights 66 | The King's Speech 67 | It Happened One Night 68 | A Place in the Sun 69 | Midnight Cowboy 70 | Mr. Smith Goes to Washington 71 | Rain Man 72 | Annie Hall 73 | Out of Africa 74 | Good Will Hunting 75 | Terms of Endearment 76 | Tootsie 77 | Fargo 78 | Giant 79 | The Grapes of Wrath 80 | Shane 81 | The Green Mile 82 | Close Encounters of the Third Kind 83 | Network 84 | Nashville 85 | The Graduate 86 | American Graffiti 87 | Pulp Fiction 88 | The African Queen 89 | Stagecoach 90 | Mutiny on the Bounty 91 | The Maltese Falcon 92 | A Clockwork Orange 93 | Taxi Driver 94 | Wuthering Heights 95 | Double Indemnity 96 | Rebel Without a Cause 97 | Rear Window 98 | The Third Man 99 | North by Northwest 100 | Yankee Doodle Dandy 101 | -------------------------------------------------------------------------------- /doc_clustering.py: -------------------------------------------------------------------------------- 1 | 2 | # coding: utf-8 3 | 4 | # TL;DR 5 | # 6 | # Data: Top 100 movies (http://www.imdb.com/list/ls055592025/) with title, genre, and synopsis (IMDB and Wiki) 7 | # Goal: Put 100 movies into 5 clusters based on text mining their synopses 8 | 9 | # In[1]: 10 | 11 | import numpy as np 12 | import pandas as pd 13 | import nltk 14 | from nltk.stem.snowball import SnowballStemmer 15 | from bs4 import BeautifulSoup 16 | import re 17 | import os 18 | import codecs 19 | from sklearn import feature_extraction 20 | import mpld3 21 | 22 | 23 | # Read movie titles, 100 movies but somehow the last one is empty string 24 | 25 | # In[2]: 26 | 27 | # so that you need to use print() 28 | from __future__ import print_function 29 | titles = open('data/title_list.txt').read().split('\n') 30 | 31 | 32 | # In[3]: 33 | 34 | len(titles) 35 | 36 | 37 | # In[4]: 38 | 39 | titles[:10] 40 | 41 | 42 | # In[5]: 43 | 44 | titles = titles[:100] 45 | 46 | 47 | # Read Genres information 48 | 49 | # In[6]: 50 | 51 | genres = open('data/genres_list.txt').read().split('\n') 52 | genres = genres[:100] 53 | 54 | 55 | # In[7]: 56 | 57 | genres[0] 58 | 59 | 60 | # Read in the synopses from wiki 61 | 62 | # In[8]: 63 | 64 | synopses_wiki = open('data/synopses_list_wiki.txt').read().split('\n BREAKS HERE') 65 | 66 | 67 | # In[9]: 68 | 69 | len(synopses_wiki) 70 | 71 | 72 | # In[10]: 73 | 74 | synopses_wiki = synopses_wiki[:100] 75 | 76 | 77 | # In[11]: 78 | 79 | synopses_wiki[0] 80 | 81 | 82 | # strips html formatting and converts to unicode 83 | 84 | # In[12]: 85 | 86 | synopses_clean_wiki = [] 87 | for text in synopses_wiki: 88 | text = BeautifulSoup(text, 'html.parser').getText() 89 | synopses_clean_wiki.append(text) 90 | synopses_wiki = synopses_clean_wiki 91 | 92 | 93 | # In[13]: 94 | 95 | synopses_wiki[0] 96 | 97 | 98 | # Read synopses information from imdb, which might be different from wiki. Also cleaned as above. 99 | 100 | # In[14]: 101 | 102 | synopses_imdb = open('data/synopses_list_imdb.txt').read().split('\n BREAKS HERE') 103 | synopses_imdb = synopses_imdb[:100] 104 | 105 | synopses_clean_imdb = [] 106 | 107 | for text in synopses_imdb: 108 | text = BeautifulSoup(text, 'html.parser').getText() 109 | synopses_clean_imdb.append(text) 110 | synopses_imdb = synopses_clean_imdb 111 | 112 | 113 | # In[15]: 114 | 115 | synopses_imdb[0] 116 | 117 | 118 | # Combine synopses from wiki and imdb 119 | 120 | # In[16]: 121 | 122 | synopses = [] 123 | for i in range(len(synopses_wiki)): 124 | item = synopses_wiki[i] + synopses_imdb[i] 125 | synopses.append(item) 126 | 127 | 128 | # In[17]: 129 | 130 | synopses[0] 131 | 132 | 133 | # In[18]: 134 | 135 | print(str(len(titles)) + ' titles') 136 | print(str(len(genres)) + ' genres') 137 | print(str(len(synopses)) + ' synopses') 138 | 139 | 140 | # In[19]: 141 | 142 | # generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later 143 | # the movies in the list are already ranked from 1 to 100 144 | ranks = [] 145 | for i in range(1, len(titles)+1): 146 | ranks.append(i) 147 | 148 | 149 | # In[20]: 150 | 151 | # load nltk's English stopwords as variable called 'stopwords' 152 | # use nltk.download() to install the corpus first 153 | # Stop Words are words which do not contain important significance to be used in Search Queries 154 | stopwords = nltk.corpus.stopwords.words('english') 155 | 156 | # load nltk's SnowballStemmer as variabled 'stemmer' 157 | stemmer = SnowballStemmer("english") 158 | 159 | 160 | # In[21]: 161 | 162 | len(stopwords) 163 | 164 | 165 | # In[22]: 166 | 167 | stopwords 168 | 169 | 170 | # In[23]: 171 | 172 | sents = [sent for sent in nltk.sent_tokenize("Today (May 19, 2016) is his only daughter's wedding. Vito Corleone is the Godfather. Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception.")] 173 | 174 | 175 | # In[24]: 176 | 177 | sents 178 | 179 | 180 | # In[25]: 181 | 182 | words = [word for word in nltk.word_tokenize(sents[0])] 183 | words 184 | 185 | 186 | # In[26]: 187 | 188 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) 189 | filtered_words = [] 190 | for word in words: 191 | if re.search('[a-zA-Z]', word): 192 | filtered_words.append(word) 193 | filtered_words 194 | 195 | 196 | # In[27]: 197 | 198 | # see how "only" is stemmed to "onli" and "wedding" is stemmed to "wed" 199 | stems = [stemmer.stem(t) for t in filtered_words] 200 | stems 201 | 202 | 203 | # In[28]: 204 | 205 | # here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed 206 | # Punkt Sentence Tokenizer, sent means sentence 207 | def tokenize_and_stem(text): 208 | # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token 209 | tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] 210 | filtered_tokens = [] 211 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) 212 | for token in tokens: 213 | if re.search('[a-zA-Z]', token): 214 | filtered_tokens.append(token) 215 | stems = [stemmer.stem(t) for t in filtered_tokens] 216 | return stems 217 | 218 | 219 | # In[29]: 220 | 221 | 222 | def tokenize_only(text): 223 | # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token 224 | tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] 225 | filtered_tokens = [] 226 | # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) 227 | for token in tokens: 228 | if re.search('[a-zA-Z]', token): 229 | filtered_tokens.append(token) 230 | return filtered_tokens 231 | 232 | 233 | # In[30]: 234 | 235 | words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.") 236 | print(words_stemmed) 237 | 238 | 239 | # In[31]: 240 | 241 | words_only = tokenize_only("Today (May 19, 2016) is his only daughter's wedding.") 242 | words_only 243 | 244 | 245 | # Below I use my stemming/tokenizing and tokenizing functions to iterate over the list of synopses to create two vocabularies: one stemmed and one only tokenized 246 | 247 | # In[32]: 248 | 249 | # extend vs. append 250 | a = [1, 2] 251 | b = [3, 4] 252 | c = [5, 6] 253 | b.append(a) 254 | c.extend(a) 255 | print(b) 256 | print(c) 257 | 258 | 259 | # In[33]: 260 | 261 | totalvocab_stemmed = [] 262 | totalvocab_tokenized = [] 263 | for i in synopses: 264 | allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem 265 | totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list 266 | 267 | allwords_tokenized = tokenize_only(i) 268 | totalvocab_tokenized.extend(allwords_tokenized) 269 | 270 | 271 | # In[34]: 272 | 273 | print(len(totalvocab_stemmed)) 274 | print(len(totalvocab_tokenized)) 275 | 276 | 277 | # In[35]: 278 | 279 | vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) 280 | print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame') 281 | print(vocab_frame.head()) 282 | 283 | 284 | # In[36]: 285 | 286 | words_frame = pd.DataFrame({'WORD': words_only}, index = words_stemmed) 287 | print('there are ' + str(words_frame.shape[0]) + ' items in words_frame') 288 | print(words_frame) 289 | 290 | 291 | # Generate TF-IDF matrix (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) 292 | # 293 | # max_df: this is the maximum frequency within the documents a given feature can have to be used in the tfi-idf matrix. If the term is in greater than 80% of the documents it probably cares little meanining (in the context of film synopses) 294 | # 295 | # min_idf: this could be an integer (e.g. 5) and the term would have to be in at least 5 of the documents to be considered. Here I pass 0.2; the term must be in at least 20% of the document. I found that if I allowed a lower min_df I ended up basing clustering on names--for example "Michael" or "Tom" are names found in several of the movies and the synopses use these names frequently, but the names carry no real meaning. 296 | # 297 | # ngram_range: this just means I'll look at unigrams, bigrams and trigrams 298 | 299 | # In[37]: 300 | 301 | # Note that the result of this block takes a while to show 302 | from sklearn.feature_extraction.text import TfidfVectorizer 303 | 304 | #define vectorizer parameters 305 | tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, 306 | min_df=0.2, stop_words='english', 307 | use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) 308 | 309 | get_ipython().magic(u'time tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses') 310 | 311 | # (100, 563) means the matrix has 100 rows and 563 columns 312 | print(tfidf_matrix.shape) 313 | terms = tfidf_vectorizer.get_feature_names() 314 | len(terms) 315 | 316 | 317 | # In[40]: 318 | 319 | from sklearn.metrics.pairwise import cosine_similarity 320 | # A short example using the sentences above 321 | words_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, 322 | min_df=0.2, stop_words='english', 323 | use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)) 324 | 325 | get_ipython().magic(u'time words_matrix = words_vectorizer.fit_transform(sents) #fit the vectorizer to synopses') 326 | 327 | # (2, 18) means the matrix has 2 rows (two sentences) and 18 columns (18 terms) 328 | print(words_matrix.shape) 329 | print(words_matrix) 330 | 331 | # this is how we get the 18 terms 332 | analyze = words_vectorizer.build_analyzer() 333 | print(analyze("Today (May 19, 2016) is his only daughter's wedding.")) 334 | print(analyze("Vito Corleone is the Godfather.")) 335 | print(analyze("Vito's youngest son, Michael, in a Marine Corps uniform, introduces his girlfriend, Kay Adams, to his family at the sprawling reception.")) 336 | all_terms = words_vectorizer.get_feature_names() 337 | print(all_terms) 338 | print(len(all_terms)) 339 | 340 | # sent 1 and 2, similarity 0, sent 1 and 3 shares "his", sent 2 and 3 shares Vito - try to change Vito's in sent3 to His and see the similary matrix changes 341 | example_similarity = cosine_similarity(words_matrix) 342 | example_similarity 343 | 344 | 345 | # Now onto the fun part. Using the tf-idf matrix, you can run a slew of clustering algorithms to better understand the hidden structure within the synopses. I first chose k-means. K-means initializes with a pre-determined number of clusters (I chose 5). Each observation is assigned to a cluster (cluster assignment) so as to minimize the within cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence. 346 | # 347 | # I found it took several runs for the algorithm to converge a global optimum as k-means is susceptible to reaching local optima - how to decide that the algorithm converged??? 348 | 349 | # In[41]: 350 | 351 | from sklearn.cluster import KMeans 352 | 353 | num_clusters = 5 354 | 355 | km = KMeans(n_clusters=num_clusters) 356 | 357 | get_ipython().magic(u'time km.fit(tfidf_matrix)') 358 | 359 | clusters = km.labels_.tolist() 360 | 361 | 362 | # I use joblib.dump to pickle the model, once it has converged and to reload the model/reassign the labels as the clusters. 363 | 364 | # In[42]: 365 | 366 | from sklearn.externals import joblib 367 | 368 | #uncomment the below to save your model 369 | #since I've already run my model I am loading from the pickle 370 | 371 | joblib.dump(km, 'doc_cluster.pkl') 372 | 373 | km = joblib.load('doc_cluster.pkl') 374 | clusters = km.labels_.tolist() 375 | # clusters show which cluster (0-4) each of the 100 synoposes belongs to 376 | print(len(clusters)) 377 | print(clusters) 378 | 379 | 380 | # Here, I create a dictionary of titles, ranks, the synopsis, the cluster assignment, and the genre [rank and genre were scraped from IMDB]. 381 | # I convert this dictionary to a Pandas DataFrame for easy access. I'm a huge fan of Pandas and recommend taking a look at some of its awesome functionality which I'll use below, but not describe in a ton of detail. 382 | 383 | # In[43]: 384 | 385 | films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres } 386 | 387 | frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre']) 388 | 389 | print(frame) # here the ranking is still 0 to 99 390 | 391 | frame['cluster'].value_counts() #number of films per cluster (clusters from 0 to 4) 392 | 393 | 394 | # In[44]: 395 | 396 | grouped = frame['rank'].groupby(frame['cluster']) # groupby cluster for aggregation purposes 397 | 398 | grouped.mean() # average rank (1 to 100) per cluster 399 | 400 | 401 | # Note that clusters 4 and 0 have the lowest rank, which indicates that they, on average, contain films that were ranked as "better" on the top 100 list. 402 | # Here is some fancy indexing and sorting on each cluster to identify which are the top n (I chose n=6) words that are nearest to the cluster centroid. This gives a good sense of the main topic of the cluster. 403 | 404 | # In[45]: 405 | 406 | from __future__ import print_function 407 | 408 | print("Top terms per cluster:") 409 | 410 | #sort cluster centers by proximity to centroid 411 | order_centroids = km.cluster_centers_.argsort()[:, ::-1] 412 | 413 | for i in range(num_clusters): 414 | print("Cluster %d words:" % i, end='') 415 | 416 | for ind in order_centroids[i, :6]: #replace 6 with n words per cluster 417 | print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',') 418 | print() #add whitespace 419 | print() #add whitespace 420 | 421 | print("Cluster %d titles:" % i, end='') 422 | for title in frame.ix[i]['title'].values.tolist(): 423 | print(' %s,' % title, end='') 424 | print() #add whitespace 425 | print() #add whitespace 426 | 427 | 428 | # Cosine similarity is measured against the tf-idf matrix and can be used to generate a measure of similarity between each document and the other documents in the corpus (each synopsis among the synopses). cosine similarity 1 means the same document, 0 means totally different ones. dist is defined as 1 - the cosine similarity of each document. Subtracting it from 1 provides cosine distance which I will use for plotting on a euclidean (2-dimensional) plane. 429 | # Note that with dist it is possible to evaluate the similarity of any two or more synopses. 430 | 431 | # In[46]: 432 | 433 | 434 | similarity_distance = 1 - cosine_similarity(tfidf_matrix) 435 | print(type(similarity_distance)) 436 | print(similarity_distance.shape) 437 | 438 | 439 | # Multidimensional scaling 440 | # Here is some code to convert the dist matrix into a 2-dimensional array using multidimensional scaling. I won't pretend I know a ton about MDS, but it was useful for this purpose. Another option would be to use principal component analysis. 441 | 442 | # In[47]: 443 | 444 | import os # for os.path.basename 445 | 446 | import matplotlib.pyplot as plt 447 | import matplotlib as mpl 448 | 449 | from sklearn.manifold import MDS 450 | 451 | # convert two components as we're plotting points in a two-dimensional plane 452 | # "precomputed" because we provide a distance matrix 453 | # we will also specify `random_state` so the plot is reproducible. 454 | mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1) 455 | 456 | get_ipython().magic(u'time pos = mds.fit_transform(similarity_distance) # shape (n_components, n_samples)') 457 | 458 | print(pos.shape) 459 | print(pos) 460 | 461 | xs, ys = pos[:, 0], pos[:, 1] 462 | print(type(xs)) 463 | xs 464 | 465 | 466 | # Visualizing document clusters 467 | # In this section, I demonstrate how you can visualize the document clustering output using matplotlib and mpld3 (a matplotlib wrapper for D3.js). 468 | # First I define some dictionaries for going from cluster number to color and to cluster name. I based the cluster names off the words that were closest to each cluster centroid. 469 | 470 | # In[48]: 471 | 472 | #set up colors per clusters using a dict 473 | cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e'} 474 | 475 | #set up cluster names using a dict 476 | cluster_names = {0: 'Family, home, war', 477 | 1: 'Police, killed, murders', 478 | 2: 'Father, New York, brothers', 479 | 3: 'Dance, singing, love', 480 | 4: 'Killed, soldiers, captain'} 481 | 482 | 483 | # Next, I plot the labeled observations (films, film titles) colored by cluster using matplotlib. I won't get into too much detail about the matplotlib plot, but I tried to provide some helpful commenting. 484 | 485 | # In[49]: 486 | 487 | #some ipython magic to show the matplotlib plots inline 488 | get_ipython().magic(u'matplotlib inline') 489 | 490 | #create data frame that has the result of the MDS plus the cluster numbers and titles 491 | df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 492 | 493 | print(df[1:10]) 494 | # group by cluster 495 | # this generate {name:group(which is a dataframe)} 496 | groups = df.groupby('label') 497 | print(groups.groups) 498 | 499 | 500 | # set up plot 501 | fig, ax = plt.subplots(figsize=(17, 9)) # set size 502 | # ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling 503 | 504 | #iterate through groups to layer the plot 505 | #note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label 506 | # ms: marker size 507 | for name, group in groups: 508 | print("*******") 509 | print("group name " + str(name)) 510 | print(group) 511 | ax.plot(group.x, group.y, marker='o', linestyle='', ms=20, 512 | label=cluster_names[name], color=cluster_colors[name], 513 | mec='none') 514 | ax.set_aspect('auto') 515 | ax.tick_params( axis= 'x', # changes apply to the x-axis 516 | which='both', # both major and minor ticks are affected 517 | bottom='off', # ticks along the bottom edge are off 518 | top='off', # ticks along the top edge are off 519 | labelbottom='off') 520 | ax.tick_params( axis= 'y', # changes apply to the y-axis 521 | which='both', # both major and minor ticks are affected 522 | left='off', # ticks along the bottom edge are off 523 | top='off', # ticks along the top edge are off 524 | labelleft='off') 525 | 526 | ax.legend(numpoints=1) #show legend with only 1 point 527 | 528 | #add label in x,y position with the label as the film title 529 | for i in range(len(df)): 530 | ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=10) 531 | 532 | 533 | 534 | plt.show() #show the plot 535 | 536 | #uncomment the below to save the plot if need be 537 | #plt.savefig('clusters_small_noaxes.png', dpi=200) 538 | 539 | 540 | # Use plotly to generate interactive chart. I have to downgrade matplotlib to 1.3.1 for this chart to work with plotly. see https://github.com/harrywang/plotly/blob/master/README.md for how to setup plotly. After running the following, a browser will open to show the plotly chart. 541 | 542 | # In[50]: 543 | 544 | # import plotly.plotly as py 545 | # plot_url = py.plot_mpl(fig) 546 | 547 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | alabaster==0.7.6 2 | Babel==1.3 3 | backports.ssl-match-hostname==3.4.0.2 4 | beautifulsoup4==4.3.2 5 | boto==2.38.0 6 | bz2file==0.98 7 | certifi==2015.4.28 8 | cufflinks==0.7.2 9 | docutils==0.12 10 | functools32==3.2.3.post1 11 | gensim==0.11.1.post1 12 | gnureadline==6.3.3 13 | ipython==3.2.0 14 | Jinja2==2.7.3 15 | jsonschema==2.5.1 16 | MarkupSafe==0.23 17 | matplotlib==1.3.1 18 | mistune==0.6 19 | mock==1.0.1 20 | mpld3==0.2 21 | nltk==3.0.3 22 | nose==1.3.7 23 | numpy==1.9.2 24 | numpydoc==0.5 25 | pandas==0.16.2 26 | plotly==1.10.0 27 | ptyprocess==0.5 28 | Pygments==2.0.2 29 | pyparsing==2.0.3 30 | python-dateutil==2.4.2 31 | pytz==2015.4 32 | pyzmq==14.7.0 33 | requests==2.7.0 34 | scikit-learn==0.16.1 35 | scipy==0.15.1 36 | six==1.9.0 37 | smart-open==1.2.1 38 | snowballstemmer==1.2.0 39 | Sphinx==1.3.1 40 | sphinx-rtd-theme==0.1.8 41 | terminado==0.5 42 | tornado==4.2 43 | --------------------------------------------------------------------------------