├── .gitignore
├── .gitmodules
├── LICENSE
├── Makefile
├── README.rst
├── index_template.html
├── out
└── ep16
│ ├── europython_logo.png
│ └── gael_simple.png
└── topics_extraction.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 |
27 | # PyInstaller
28 | # Usually these files are written by a python script from a template
29 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 |
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 |
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 |
48 | # Translations
49 | *.mo
50 | *.pot
51 |
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 |
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 |
60 | # Scrapy stuff:
61 | .scrapy
62 |
63 | # Sphinx documentation
64 | docs/_build/
65 |
66 | # PyBuilder
67 | target/
68 |
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 |
72 | # pyenv
73 | .python-version
74 |
75 | # celery beat schedule file
76 | celerybeat-schedule
77 |
78 | # dotenv
79 | .env
80 |
81 | # virtualenv
82 | venv/
83 | ENV/
84 |
85 | # Spyder project settings
86 | .spyderproject
87 |
88 | # Rope project settings
89 | .ropeproject
90 |
--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "github-pages-publish"]
2 | path = github-pages-publish
3 | url = git@github.com:rafaelmartins/github-pages-publish.git
4 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2016, Gael Varoquaux
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without
5 | modification, are permitted provided that the following conditions are met:
6 |
7 | * Redistributions of source code must retain the above copyright notice, this
8 | list of conditions and the following disclaimer.
9 |
10 | * Redistributions in binary form must reproduce the above copyright notice,
11 | this list of conditions and the following disclaimer in the documentation
12 | and/or other materials provided with the distribution.
13 |
14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
15 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
18 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
20 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
21 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
22 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
23 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
24 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 |
2 | all: html
3 |
4 | html: topics_extraction.py
5 | python topics_extraction.py
6 |
7 | install: html
8 | python github-pages-publish/github-pages-publish . out/
9 | git push origin gh-pages
10 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 |
2 | ====================================================
3 | Topic modelling from EuroPython's list of abstracts
4 | ====================================================
5 |
6 | This is the code to produce a list of topics from abstracts downloaded
7 | from the conference website.
8 |
9 | The different steps and corresponding modules are:
10 |
11 | * **Web srapping** to retrieve the abstracts, based on `beautifulsoup4`,
12 | and `urllib2`.
13 |
14 | `joblib` is also useful for caching, to avoid multiple crawls of the
15 | websites and downloads.
16 |
17 | I could have asked access to a dump of the database for the organizers,
18 | but it was more fun to crawl.
19 |
20 | * **Stemming**: trying to convert plural words to singular, using `NLTK`.
21 |
22 | Note that stemming is in general more sophisticated, and will convert
23 | words to their roots, such as 'organization' -> 'organ'. To have
24 | understandable word clouds, we want to keep more differentiation. Hence
25 | we add a custom layer to reduce the power of the stemmer.
26 |
27 | * **Topic modelling** with `scikit-learn`.
28 |
29 | It's a 2 step process: first we convert the text data to a numerical
30 | representation, "vectorizing"; second we use a Non-negative Matrix
31 | Factorization to extract "topics" in these.
32 |
33 | * **Word-cloud figures** with the `wordcloud` module.
34 |
35 | * **Create a webpace** with the `tempita`.
36 |
37 | ___
38 |
39 |
40 | This application beautifully combines multiple facets of the Python
41 | ecosystem, from web tools to PyData.
42 |
43 |
--------------------------------------------------------------------------------
/index_template.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |