├── README.md
├── goa_symposium.pdf
├── gsoc
    ├── 2016
    │   └── gensim-2016.md
    └── 2017
    │   ├── CLiPS-2017.md
    │   ├── papers.md
    │   └── pymc3-2017.md
└── notebooks
    ├── README.md
    ├── clustering_classing.ipynb
    ├── computational_social_science
        ├── README.md
        ├── REQUIREMENTS.txt
        ├── bend_it_like_beckham.txt
        ├── computational_social_science_tutorial.ipynb
        └── computational_social_science_tutorial_unrun.ipynb
    ├── gensim
        ├── Dynamic Topic Model.png
        ├── Monkey Brains New.png
        ├── Monkey Brains.png
        ├── distance_metrics.ipynb
        ├── dtm_example.ipynb
        ├── ldaseqmodel.ipynb
        └── topic_methods.ipynb
    ├── manifolds.ipynb
    ├── metric-learn
        └── metric_plotting.ipynb
    ├── metrics.ipynb
    ├── pycobra
        ├── regression.ipynb
        └── visualise.ipynb
    ├── text_analysis
        ├── clustering_classing.ipynb
        └── word2vec.ipynb
    └── text_analysis_tutorial
        ├── README.md
        ├── REQUIREMENTS.txt
        ├── REQUIREMENTS_1.txt
        ├── Weights
            ├── weights-improvement-01-4.3050.hdf5
            ├── weights-improvement-01-4.3902.hdf5
            ├── weights-improvement-02-4.0229.hdf5
            └── weights-improvement-02-4.3093.hdf5
        ├── text_analysis_tutorial.ipynb
        ├── text_analysis_tutorial_unrun.ipynb
        ├── topic_modelling.ipynb
        └── topic_modelling_unrun.ipynb


/README.md:
--------------------------------------------------------------------------------
1 | # Personal Material
2 | 
3 | Contains personal material which is to share - Google Summer of Code proposals, Jupyter notebooks I have helped contribute to in the past, and new notebooks/code snippets I have been working on.


--------------------------------------------------------------------------------
/goa_symposium.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/goa_symposium.pdf


--------------------------------------------------------------------------------
/gsoc/2016/gensim-2016.md:
--------------------------------------------------------------------------------
  1 | # Title
  2 | 
  3 | Dynamic Topic Model for Gensim.
  4 | ## Abstract
  5 | 
  6 | Dynamic Topic Models [\[1\]](http://www.stat.uchicago.edu/~lafferty/pdf/dtm.pdf) [\[2\]](https://en.wikipedia.org/wiki/Dynamic_topic_model) are used to model the evolution of topics in a corpus, over time. The Dynamic Topic Model is part of a class of probabilistic topic models, and unlike the previous models developed [\[3\]](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf), takes into account time-series data. The idea behind this project proposal is to implement Dynamic Topic Models while taking care of Gensim's philosophy of being memory-independent and robust. 
  7 | 
  8 | There is already an academic implementation available in C++ [\[4\]](https://code.google.com/archive/p/princeton-statistical-learning/downloads), but it is not well-supported and the python wrapper already implemented [\[5\]](https://radimrehurek.com/gensim/models/wrappers/dtmmodel.html) uses this rather flimsy academic implementation. Completing this project would be of great use in the industry, and in academia [\[6\]](http://nlp.stanford.edu/pubs/hall-emnlp08.pdf) , [\[7\]](http://jmlr.org/proceedings/papers/v28/shalit13.pd) where having a quick implementation of Dynamic Topic Models would do wonders, especially in the Humanities [\[8\]](http://sappingattention.blogspot.in/2013/01/keeping-words-in-topic-models.html) where evolution of topics in literature and sociology is heavily studied. 
  9 | 
 10 | This work can be extended to include Document Influence Models [\[9\]](https://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf), where the influence of certain documents affect topic evolution. My personal interest in this project is motivated by my extensive use of Dynamic Topic Models for my research work on tracking evolution of Computer Science research topics on academic graph data sets, and in my experience on working with various topic models [\[10\]](https://github.com/santonus/bigscholarlydata).
 11 | 
 12 | ## Technical Details
 13 | 
 14 | Dynamic Topic Models were first described academically here [\[1\]](http://www.stat.uchicago.edu/~lafferty/pdf/dtm.pdf). One of the main tenets of the Dynamic Topic Model is that the topics in time `t` have evolved from the topics in time `t-1`. To make this happen, there is a relationship between the hyper parameters associated with each time slice. All of this is a stark difference from your typical LDA implementation, and the challenges/differences in DTM start making themselves clear now.
 15 | 
 16 | For starters, we can't sample documents and the topics they've evolved from the way we normally do. In typical language modeling applications, Dirichlet distributions are used to model uncertainty about the distributions over words. However, the Dirichlet is not amenable to sequential modeling. Blei's paper describes the models for drawing your parameters (`α`,`β`), and the topics. As described by him - `"Our approach is thus to model sequences of compositional random variables by chaining Gaussian distributions in a dynamic model and mapping the emitted values to the simplex." `
 17 | How this works into our code would be a topic of discussion.
 18 | 
 19 | We are soon faced with another problem - `"Working with time series over the natural parameters enables the use of Gaussian models for the time dynamics; however, due to the nonconjugacy of the Gaussian and multinomial models, posterior inference is intractable."`  He goes on to explain a  variational method for approximate posterior inference. They suggest Kalman Filtering [\[11\]](https://en.wikipedia.org/wiki/Kalman_filter) and Wavelet Regression as ways to go about this. 
 20 | 
 21 | Other things to keep in mind while going about this project is Gensim's philosophy of not loading in memory, and addressing the lack of a timed stream functionality. 
 22 | 
 23 | ## Schedule of Deliverables
 24 | 
 25 | ### Community Bonding Period
 26 | 
 27 | - A huge part of being part of the open-source community and Google Summer of Code is sharing your ideas and work. I've started a blog [here](https://topicmodel2016.wordpress.com/) to give weekly reports of the summer work I will be doing, and try and break down the world of probabilistic topic models to everyone.
 28 | 
 29 | - Before the official time period begins I intend to fix some of the outstanding issues [\[#113\]](https://github.com/piskvorky/gensim/issues/113) and add a few basic functionality such as [\[#64\]](https://github.com/piskvorky/gensim/issues/64), among others. I have begun work on #64 and after finishing Hellinger distances I will add more similarity metrics to `mathutils.py` such as Jaccard Coefficient and other metrics described in [this](http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf) paper. [This paper](http://bib.dbvis.de/uploadedFiles/155.pdf) is also a good resource for high-dimensional distance metrics and introduces a fractional distance norm which can be considered.
 30 | 
 31 | ### May 25th -  June 7th
 32 | 
 33 | - Work on sampling the documents, words, and topics, according to Gaussian State Space Model described.
 34 | 
 35 | 
 36 | ### June 8th - June 21th
 37 | 
 38 | - Can start with tests to make sure the sampling methods work right.
 39 | - Figure out how to workaround posterier inference approximation and add functionality of Kalman Filtering or Wavelet Regression if needed.
 40 | - The basic building blocks of DTM are being assembled so far.
 41 | 
 42 | ### June 22nd - July 5th
 43 | 
 44 | - Engineer in time-slices to our existing system and test our above results over a range.
 45 | - Since time-slices have not been introduced in Gensim before, we might need to spend some time working on
 46 | how to build our API for this. 
 47 | - Keep working on the blog as well!
 48 | 
 49 | ### July 6th - July 19th
 50 | 
 51 | - By now we should have a crude representation of Dynamic Topic Model set up.
 52 | - Put our building blocks together and make sure the results are at least on par with the current gensim wrapper. We will use [\[12\]](http://arxiv.org/pdf/1510.03797.pdf) as the corpus to benchmark the two implementations. There is some pre-processing code for the same dataset on [this](https://github.com/ashishbaghudana/dtm) repo.
 53 | 
 54 | 
 55 | ### July 20th - August 2nd
 56 | 
 57 | - Now that we have our Dynamic Topic Model, it's time to test and fine-tune and give insights.
 58 | - The papers [\[13\]](http://ceur-ws.org/Vol-974/lakdatachallenge2013_01.pdf) and [\[14\]](http://jmlr.org/proceedings/papers/v28/shalit13.pdf) also use Dynamic Topic Modelling - an effort can be made to procure these corpuses as well to have multiple benchmarks on the performance of our code.
 59 | - After tuning on different scenarios and the corpuses the above references use, we can see how this implementation holds up to standard previous implementations such as [\[4\]](https://code.google.com/archive/p/princeton-statistical-learning/downloads)].
 60 | 
 61 | ### August 3rd - August 16th
 62 | 
 63 | - At this point documentation will start. One of the ideas is to start making a ipython notebook tutorial as well.
 64 | - Work on possible bugs and problems our implementation will have at this point. Make sure all edge and corner cases are handled.
 65 | - As mentioned in the project deliverables, time, memory usage and accuracy is what we will primarily concern ourselves with at this point. 
 66 | 
 67 | ### August 17th - August 21th 19:00 UTC
 68 | 
 69 | - By now there will be a good amount of insight of how our model works, and knowledge of parameter selection, number of topics will be known according to different scenarios.
 70 | - Documentation of the same, cleaning, and seeing how we can set up our code to implement multiple cores should be discussed (maybe on Spark and TensorFlow), if not completed. 
 71 | - Last few days will be in scrubbing and finishing documentation, and wrapping up with the ipython tutorial as well.
 72 | 
 73 | ## Future works
 74 | 
 75 | The project timeline described above does not include an implementation across multiple cores, but an extension to include this can be attempted. Ideas used in Dynamic Topic Model are also used in Document Influence Model, and this could be a direction to go in after Dynamic Topic Model is implemented.
 76 | 
 77 | ## Open Source Development Experience
 78 | 
 79 | My first tryst with Open Source Development was when I was learning it through [Code Combat](https://codecombat.com/), a popular game used to learn Python and JavaScript. While playing, I stumbled across an error [\[#25\]](https://github.com/differentmatt/filbert/issues/25) in the sorting function. This had me stuck, so I decided to dive in and fix the bug [\[#42\]](https://github.com/differentmatt/filbert/pull/42) - it was quite nice to see my contribution in a game I enjoyed playing. Since, I have been largely working on research work in data science and stepped away from open source contribution, but some of this academic work has been made public  [\[15\]](https://github.com/bhargavvader/CASApythonPort), and I hope it has helped those who needed help in porting MATLAB to Python and do work in Blind Source Separation. Since my interest in contributing to Gensim, I have been in regular touch with my mentors and have started working on a few issues to get more familiar with the code base [\[#64\]](https://github.com/piskvorky/gensim/issues/64).
 80 | I look forward to getting back to regularly contributing to the Open Source and Machine Learning Community.
 81 | 
 82 | ## Academic Experience
 83 | 
 84 | I am a 3rd year Undergraduate Computer Science student enrolled at BITS Pilani University, India. I have undertaken a variety of Data Science projects and coursework, my most recent in being analyzing growth of topics in Computer Science research areas using the [Microsoft Academic Graph](http://research.microsoft.com/en-us/projects/mag/). I have worked in a Machine Learning [start-up](https://zero.ai) and undertaken research projects on Mining High-Dimensional Datasets with pioneering researchers such as [Prof. Ashwin Srinivasan](http://www.bits-pilani.ac.in/goa/ashwin/profile). I am attaching my [CV](https://drive.google.com/a/goa.bits-pilani.ac.in/file/d/0By80y9AXd1WsOVdBOGdWRzNxaWM/view) as well, for a more detailed explanation of my previous experiences. 
 85 | 
 86 | ## Why this project?
 87 | 
 88 | Implementing Dynamic Topic Models came from a rather selfish point of view - I wanted a python implementation of it to use in my undergraduate research project! I have a fair amount of experience in the mathematical theory and implementation of Topic Models because of my use of the same in my work with Professors at my University, and in pet projects. I am glad to have a formal platform to implement Dynamic Topic Models and look forward to contributing to Gensim towards this end. It would also be a great joy to be a part of a project which will most definitely be used by a large number of people in industry and academia. 
 89 | 
 90 | ## Appendix
 91 | 
 92 | Find below a list of all the web articles and research papers described above.
 93 | 
 94 | 1 - http://www.stat.uchicago.edu/~lafferty/pdf/dtm.pdf
 95 | 2 - https://en.wikipedia.org/wiki/Dynamic_topic_model
 96 | 3 - https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
 97 | 4 - https://code.google.com/archive/p/princeton-statistical-learning/downloads
 98 | 5 - https://radimrehurek.com/gensim/models/wrappers/dtmmodel.html
 99 | 6 - http://nlp.stanford.edu/pubs/hall-emnlp08.pdf
100 | 7 - http://jmlr.org/proceedings/papers/v28/shalit13.pd 
101 | 8 - http://sappingattention.blogspot.in/2013/01/keeping-words-in-topic-models.html
102 | 9 - https://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf
103 | 10 - https://github.com/santonus/bigscholarlydata
104 | 11 - https://en.wikipedia.org/wiki/Kalman_filter
105 | 12 - http://arxiv.org/pdf/1510.03797.pdf
106 | 13 - http://ceur-ws.org/Vol-974/lakdatachallenge2013_01.pdf
107 | 14 - http://jmlr.org/proceedings/papers/v28/shalit13.pdf
108 | 15 - http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
109 | 16 - http://bib.dbvis.de/uploadedFiles/155.pdf


--------------------------------------------------------------------------------
/gsoc/2017/CLiPS-2017.md:
--------------------------------------------------------------------------------
  1 | # Title
  2 | 
  3 | Solving Fake News - From Data Collection to Classification
  4 | 
  5 | ## Abstract
  6 | 
  7 | In 2007, the Journal of Mass Media Ethics had an article with the title: [The Role of Journalist and the Performance of Journalism: Ethical Lessons From “Fake” News (Seriously)](http://www.tandfonline.com/doi/abs/10.1080/08900520701583586). It suggests that that Jon Stewart of The Daily Show with Jon Stewart and Stephen Colbert of The Colbert Report (TCR) are a new kind of journalist - but that they lack the journalists moral commitment, and are not hence restricted to morally report news. Now, while the actual content of the paper may not be relevant to the problem we intend to explore, it is certainly curious to note the word "Seriously" in the title - and only wonder what the authors would exclaim how serious the Fake News problem has become, 10 years from when the article was written!
  8 | 
  9 | In the age of post-truth politics where the debate is more of an appeals to emotion as opposed to policies, Fake News is more relevant (and dangerous) before. The 2016 American Presidential elections is perhaps the most pertinent example, where Fake News on social media was a huge debate. 
 10 | 
 11 | I propose an attempt to better understand this problem - by crawling for data, organising data, reading relevant scientific and journalistic literature on Fake News, and attempt to solve this problem using a predictive model. As for which model we decide to use, a Neural Network seems like a popular choice, but more discussion on this can be made when our data-set is finalised. No machine learning model is worth anything without data - after all, [garbage in, grabage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)! 
 12 | 
 13 | A final product which we would like to see is a thouroughly documented GitHub repository which features all the steps of our project. Regular blog posts/infographics about the findings and illustrative Jupyter Notebooks would help in making the public aware of the project. 
 14 | 
 15 | ## Technical Details
 16 | 
 17 | With the spike in the amount of Fake News articles lately, webites such as [Media Bias Fact Check](https://mediabiasfactcheck.com) and [Fake News Checker](https://www.fakenewschecker.com) are useful in identifying which websites spread fake news or publish/share fake news. Crawling these websites to create a databse of articles and headlines which are biased or have unverified sources would be very useful. 
 18 | 
 19 | The [Fakes News Challenge](http://www.fakenewschallenge.org) also serves as a very useful resource - corrdinating with the community of researchers already trying to tackle the problem and using the already annotated dataset will further help. [Snopes](http://www.snopes.com), [Politifact](http://www.politifact.com), and [FactCheck](http://www.factcheck.org) also serve as websites to identify trustworthy sources.
 20 | 
 21 | Other interesting sources of data is social media itself - twitter is often used, but it also usually just links to spurious sources, which we can use the above links to verify. [Reddit](https://www.reddit.com) is another very popular online forum, and has a huge community discussing [Fake News](https://www.reddit.com/search?q=fake+news). Crawling on reddit would also leave us with some interesting results. Kaggle is another very important resource - the competition, [Getting Real about Fake News](https://www.kaggle.com/mrisdal/fake-news) is another forum to not only discuss strategies with other data scientists but to compare our models performances - something which will be important in the later stages of our project when we are fine tuning our model.
 22 | 
 23 | [Google Scholar](https://scholar.google.fr/scholar?q=fake+news&btnG=&hl=en&as_sdt=0%2C5) offers us access to the exisiting scientific literature on Fake News, helping us with our research project. [Hoaxy](http://hoaxy.iuni.iu.edu) is a research project visualising networks of fake news.
 24 | 
 25 | As for training our model and linguistic features, some of the existing literature such as [Social Media and Fake News in the 2016 Election](https://web.stanford.edu/~gentzkow/research/fakenews.pdf) and [Classifying Fake News](http://www.conniefan.com/wp-content/uploads/2017/03/classifying-fake-news.pdf) helps us - ideas such as using sentiment (higher negative or positive sentiment - likely to be fake), writing quality (better writing quality - likely to be real), punctuation (again, similar to writing quality) are some of the many features we can consider choosing. The [project details](http://www.clips.uantwerpen.be/projects/gsoc-2017) in the CLiPS page also asks important questions, some of which have been answered in the papers above.
 26 | 
 27 | About the tools we can use, the work will largely be done in python - python boasts a really powerful variety of machine learning and deep learning tools. [Scikit-learn](http://scikit-learn.org/stable/), [Keras](https://keras.io), [Theano](http://deeplearning.net/software/theano/), [Tensorflow](https://www.tensorflow.org) are all libraries which can be used to create models for our problem, and I am comfortable with all of them. In particular, I believe the python library [spaCy](https://spacy.io) will be very useful - it has the fastest pre-processing toolkit available, and it's the deep learning integration is seamless. [Gensim](https://github.com/RaRe-Technologies/gensim), [NLTK](http://www.nltk.org) and [Pattern](http://www.clips.ua.ac.be/pages/pattern) will also serve as useful libraries during our project.
 28 | 
 29 | ## Open Source Development Experience
 30 | 
 31 | I enjoy and regularly contribute to open source scientific computing libraries. I was previously selected to participate in Google Summer of Code 2016 with Gensim under the NumFOCUS umbrella, where I implemented Dynamic Topic Models. My [blog](https://summerofcode2017.wordpress.com/) details my experiences during summer of 2016.
 32 | 
 33 | I'm still a regular contributor for [Gensim](https://github.com/RaRe-Technologies/gensim/pulls/bhargavvader) - I have 29 pull requests merged, and actively help on the mailing list and issues. I have given talks at PyCon France 2016 and PyCon Slovakia 2017 about my experience with Gensim and GSoC 2016. I also contribute to [metric-learn](https://github.com/all-umass/metric-learn/pulls?q=is%3Apr+author%3Abhargavvader+is%3Aclosed), [Edward](https://github.com/blei-lab/edward/pulls/bhargavvader), [spaCy](https://github.com/explosion/spacy-notebooks/pulls?q=is%3Apr+author%3Abhargavvader+is%3Aclosed), and [pymc3](https://github.com/pymc-devs/pymc3/issues?q=is%3Aopen+mentions%3Abhargavvader). The links direct to my PRs/issues for the respective repos. My [GitHub profile](https://github.com/bhargavvader) has a more detailed summary of all my contributions. 
 34 | 
 35 | 
 36 | ## Academic Experience
 37 | 
 38 | I am a student researcher at INRIA, France. I work with the MODAL (Models Of Data Analysis and Learning) team, where I work on Predictor Aggregation and Metric Learning problems. I'm finishing up my undergraduate education in Computer Science Engineering from BITS Pilani University, India. My [resume](https://drive.google.com/file/d/0By80y9AXd1WsRUJWTlFfeldITGc/view?usp=sharing) details my previous internships and research experiences, which are all in the field of Machine Learning, Data Science and Software Engineering - and will help in completing this project.
 39 | 
 40 | ## Why this project? 
 41 | 
 42 | My interest in this project is for multiple reasons. While I majored in Computer Science at University, my minor was in Philosophy, Economics, Politics [PEP]. My coursework in the Social Sciences is what sparked my interest in Computational Social Science - solving or understanding social problems with the use of Computational tools. And Computational Linguistics and Natural Language Procsssing is a very powerful way to approach social problems, mainly because of the abundance of textual data and ways to mine them. The Fake News problem in particular is interesting because of how often we feel its affects, or notice it, or talk about it - with the ubiquity of social media, Fake News is everywhere! It directly affects our ruling governements and in some cases the policies - our ability to be informed voters is no longer easy. By making any progress in solving this problem - from gathering, labelling data and documentation to training a model - we are solving an important research and social problem.
 43 | 
 44 | A personal reason is my interest in pursuing a PhD in Computational Social Sciences - working with researchers from CLiPS, a top facility for Computational Linguistics would be a great research experience!
 45 | 
 46 | ## What skills do I have suited for this project?
 47 | 
 48 | I have 3 years of experience in python, and have spent a considerable amount of time contributing to open source, particularly in python and machine learning. As I mentioned in the Open Source Development Experience section, I'm very comfortable with open source python NLP/CL tools. My background in the social sciences will also help in better analysing the context of the problem. 
 49 | 
 50 | My research experience in Machine Learning means I am familiar with the pipeline of solving such problems - finding and cleaning data, writing clean, reproducable code and keeping the process open source is as important as solving it. 
 51 | 
 52 | I have worked with visualisation tools such as D3.js before, and am very comfortable with matplotlib. This [link](https://github.com/bhargavvader/personal/tree/master/notebooks) contains all the Jupyter Notebooks I have made and contributed to.
 53 | I also enjoy writing (I have previously linked to my GSoC 2016 blog), and using Jupyter notebooks to explain concepts and ideas. 
 54 | 
 55 | If working in a team, I am comfortable in either working in the data collection, curation and cleaning, or in creating the classifier/model, and have experience in doing both. Having completed GSoC before, I understand the importance of clear communication and how to work and behave in an open source community.
 56 | 
 57 | ## Schedule of Deliverables
 58 | 
 59 | ### May 1th - May 28th, **Community Bonding Period**
 60 | 
 61 | - Meet with project mentors, decide which sources to use and which tools to use - at least for the beginning!
 62 | - Literature survey so everyone is comfortable with what has already been done, and discuss what hasn't!
 63 | 
 64 | 
 65 | ### May 29th - June 9th
 66 | 
 67 | - Our first objective is to create a comprehensive dataset. We have discussed possible sources in detail in the Tehcnical Details section.
 68 | - The first two weeks could be spent on this task.
 69 | 
 70 | ### June 12th - June 16th
 71 | 
 72 | - Cleaning and pre-processing our dataset is as important as collecting it!
 73 | - This week could be spent pre-processing and arranging our data, making it ready for training.
 74 | 
 75 | ### June 19th - June 23th, **End of Phase 1**
 76 | 
 77 | - As phase 1 ends we now have our data ready. Documentation is key - making sure we let people know the sources of our data, and creating a Jupyter notebook to explain our collection and cleaning would make for an interesting blog post!
 78 | - We can start in setting up our model.
 79 | 
 80 | 
 81 | ### June 26 - July 7th, **Begin of Phase 2**
 82 | 
 83 | - We now begin exploring possible models to train our data.
 84 | - A more complex model is not necessarily a better one - a simple Naive Bayes can give us excellent results if we choose the right features!
 85 | 
 86 | ### July 10th - July 14th
 87 | 
 88 | - We are now midway through the project and should be seeing some results!
 89 | - In most machine learning problems, fine tuning and playing with parameters is half the trouble - we should be at this stage now.
 90 | 
 91 | ### July 17th - July 21th, **End of Phase 2**
 92 | 
 93 | - A blog post detailing our progress would be appropriate now.
 94 | - Along with progress, it is important to clealry document the difficulties and problems faced - we are contributing to the exisiting scientific literature by doing this!
 95 | 
 96 | ### July 24th - August 18th, **Begin of Phase 3**
 97 | 
 98 | - The last month of the project should be when we are coding less and less - visualisations, write-ups, infographics and blog posts should be the focus!
 99 | - Much like a trained sentiment model, a trained Fake News detector would be a very useful open source addition. 
100 | - Even an imperfect machine would be useful to the coummnity, and we can always improve the state-of-the-art.
101 | - Cleaning up what's left of the project, and making sure that it can be accessed again and worked on is a high priority.
102 | 
103 | 
104 | ### August 21st - August 29th, **Final Week**
105 | 
106 | - At this point we should have explored the Fake News project in great detail, and from many angles and perspectives.
107 | - A research publication would be an ideal result - with regular blog posts and documentation, we should have enough material ready!
108 | 
109 | ## Future Works
110 | 
111 | I would like to continue contributing to the research pursued by CLiPS - either by contributing to Pattern, or even considering a career as a researcher or student at CLiPS!
112 | 
113 | ## Appendix
114 | 
115 | [Fakes News Challenge](http://www.fakenewschallenge.org)
116 | 
117 | [Media Bias Fact Check](https://mediabiasfactcheck.com)
118 | 
119 | [Fake News Checker](https://www.fakenewschecker.com)
120 | 
121 | [Social Media and Fake News in the 2016 Election](https://web.stanford.edu/~gentzkow/research/fakenews.pdf) 
122 | 
123 | [Classifying Fake News](http://www.conniefan.com/wp-content/uploads/2017/03/classifying-fake-news.pdf) 
124 | 
125 | 


--------------------------------------------------------------------------------
/gsoc/2017/papers.md:
--------------------------------------------------------------------------------
 1 | ## Links to papers relevant to GSoC 2017 with pymc3
 2 | ### Introduction to MCMC
 3 | [The Convergence of Markov chain Monte Carlo Methods: From the Metropolis method to Hamiltonian Monte Carlo](https://arxiv.org/abs/1706.01520)
 4 | 
 5 | ### Reading Material for Riemannian Manifold Hamiltonian Monte Carlo
 6 | 
 7 | [A General Metric for Riemannian Manifold Hamiltonian Monte Carlo](https://arxiv.org/abs/1212.4693)
 8 | 
 9 | [Generalizing the No-U-Turn Sampler to Riemannian Manifolds](https://arxiv.org/abs/1304.1920)
10 | 
11 | [Riemann Manifold Langevin and Hamiltonian Monte Carlo](https://pdfs.semanticscholar.org/16c5/06c5bb253f7528ddcc80c72673fabf584f32.pdf)
12 | 
13 | [Identifying the Optimal Integration Time in Hamiltonian Monte Carlo](https://arxiv.org/abs/1601.00225)
14 | 
15 | ### Reading material for Manifolds and Differential Geometry
16 | 
17 | [INTRODUCTION TO SMOOTH MANIFOLDS](http://webmath2.unito.it/paginepersonali/sergio.console/lee.pdf)
18 | 


--------------------------------------------------------------------------------
/gsoc/2017/pymc3-2017.md:
--------------------------------------------------------------------------------
  1 | # Title
  2 | 
  3 | Implementing Riemannian Manifold HMC
  4 | 
  5 | ## Abstract
  6 | 
  7 | pymc3 has a powerful range of sampling methods, with the Hamiltonian Monte Carlo (and NUTS) being the crown jewel.
  8 | And with the [NUTS](https://arxiv.org/abs/1111.4246) algorithm, HMC can be optimized adaptively - also, empirically, NUTS performs at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. There are some minor drawbacks to this - HMC doesn't always work so well on extremely complex models, and sometimes NUTS can fail by terminating prematurely before reaching the optimal integration time.
  9 | 
 10 | We can build on the HMC method by introducing Riemannian Manifold HMC to pymc3, which improves the auto-tuning behavior of the NUTS algorithm and provides independent samples on extremely complex models. In the [paper](https://arxiv.org/pdf/1304.1920.pdf) on RMHMC to NUTS, we are explained the motivation behind this - ``` The No U-Turn Sampler identifies these turning points for Euclidean manifolds, but the criterion begins to fail when applied to more complex distributions and Riemannian manifolds. Appealing to the geometry of HMC, however, admits a straightforward generalization of the No U-Turn Sampler that is not only amenable to Riemannain manifolds but also isolates the turning points in more compli- cated, non-convex target distributions.```
 11 | 
 12 | By implementing RMHMC as an alternate sampling method for pymc3, we can tackle more problems and provide a powerful alternative to the existing sampling methods.
 13 | 
 14 | 
 15 | ## Technical Details
 16 | 
 17 | Michael Betancourt's implementation [here](https://github.com/betanalpha/jamon) would serve as a very useful blueprint while going about RM-HMC. It is also coded in Stan, but not exposed (due to potential fragility of higher-order autodiff). This [issue](https://github.com/stan-dev/stan/issues/304) discusses this in more detail, also linking to the files in particular relevant to the RMHMC implementation. We would require higher-order autodiff for probability distributions, which would be a mini-task in itself, but with Theano this should be managed. 
 18 | 
 19 | The current implementations use the Softabs metric, and this approach is explained in [this paper](https://arxiv.org/pdf/1212.4693.pdf).
 20 | 
 21 | Other useful links include a [MATLAB implementation of RMHMC](https://github.com/a-kramer/ode_rmhmc), and the [paper](http://www.dcs.gla.ac.uk/publications/PAPERS/9149/RMHMC_MG_BC_SC_07_09.pdf) by Girolami & Calderhead.
 22 | 
 23 | The API would require discussion - but could be similarily structured to the [HMC directory](https://github.com/pymc-devs/pymc3/tree/master/pymc3/step_methods/hmc), i.e create a RMHMC directory with it's own NUTS algorithm. 
 24 | 
 25 | ## Schedule of Deliverables
 26 | 
 27 | ### May 1th - May 28th, **Community Bonding Period**
 28 | 
 29 | - A huge part of being part of the open-source community and Google Summer of Code is sharing your ideas and work. I've started a blog [here](https://summerofcode2017.wordpress.com/) to give regular reports of the summer work I will be doing, and try and break down the world of Probabilitic Programming to everyone.
 30 | 
 31 | - Wrap up exisitng PRs and continue to help around with issues and bugs. I would also start work on setting up RM-HMC and NUTS. It would be useful to have the basic API and structure ready so that we can focus on heavy lifting the rest of the period. 
 32 | 
 33 | - Communication with mentors to clarify doubts, re-read the papers and Stan code to make sure everything is in place. 
 34 | 
 35 | ### May 29th - June 3rd
 36 | 
 37 | - Since the current implementation is based on the Softabs metric, it would make sense to first see if we can replicate [this](https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/hamiltonians/softabs_metric.hpp), and whether we would need to at all.
 38 | 
 39 | - If we haven't already, decide the API which we will start working with. 
 40 | 
 41 | ### June 5th - June 9th
 42 | 
 43 | - The current road-block in the Stan implementation is because of instability of higher-order autodiffs. Exploring this problem and how a pymc3 implementation of the same will handle these problems is the next-step.
 44 | 
 45 | 
 46 | ### June 12th - June 16th
 47 | 
 48 | - By now we will have our basic road-map in place, and can start implementing the algorithm. Before Phase 2 begins a PR should be opened which has the skeleton code ready!
 49 | 
 50 | - It would also be important to decide how much of the existing implementations (linked to before) we would like to be influenced by. 
 51 | 
 52 | ### June 19th - June 23th, **End of Phase 1**
 53 | 
 54 | - As phase 1 ends, the design decisions should end and RM-HMC should starting taking up some shape.
 55 | 
 56 | - We will follow the philosophy of test driven development, and regularly document the work as it progresses.
 57 | 
 58 | ### June 26 - July 7th, **Begin of Phase 2**
 59 | 
 60 | - Phase 2 would mark the start of the bulk of the coding work. 
 61 | 
 62 | - A blog post detailing how MCMC, HMC and RM-HMC work would be a nice way to talk more about pymc3 and GSoC 2017.
 63 | 
 64 | ### July 10th - July 14th
 65 | 
 66 | - By now, RM-HMC would have taken considerable shape - our focus would now be on how to extend the NUTS algorithm appropriately to this.
 67 | 
 68 | ### July 17th - July 21th, **End of Phase 2**
 69 | 
 70 | - This would be enough time to be done with RM-HMC and the NUTS algorithm, coinciding with the end of Phase 2.
 71 | 
 72 | - The rest of the time would be detailed to testing, benchmarks, further documentation (with tutorials).
 73 | 
 74 | ### July 24th - August 4th, **Begin of Phase 3**
 75 | 
 76 | - The performance of the RMHMC can assessed by performing posterior inference on logistic regression models, log-Gaussian Cox point processes, and stochastic volatility models.
 77 | 
 78 | - The bulk of this period will involve finish testing, bugs and edge cases.
 79 | 
 80 | ### August 7th - August 11th
 81 | 
 82 | - Another blog post detailing the work done so far, and how NUTS works.
 83 | 
 84 | - Wrap up tests, documentation. Perform memory/speed benchmarks.
 85 | 
 86 | ### August 14th - August 18th
 87 | 
 88 | - Jupyter Notebook to accompany the rest of the code.
 89 | 
 90 | ### August 21st - August 25th, **Final Week**
 91 | 
 92 | - The last week will involve any remaning code cleaning, and finishing the tutorial.
 93 | 
 94 | ### August 28th - August 29th, **Submit final work**
 95 | 
 96 | - Have PR merged, celebrate!
 97 | 
 98 | ## Future works
 99 | 
100 | With regard to sampling methods and MCMC/HMC in particular, Betancourts work is particularly interesting [this](https://arxiv.org/pdf/1601.00225.pdf) paper talks about XMC, which claims to be twice as fast as the regular NUTS algorithm. I intend to keep contributing to pymc3 with issues and PRs, and be an active part of the community.
101 | 
102 | ## Open Source Development Experience
103 | 
104 | I enjoy and regularly contribute to open source scientific computing libraries. I was previously selected to participate in Google Summer of Code 2016 with Gensim under the NumFOCUS umbrella, where I implemented Dynamic Topic Models. My [blog](https://summerofcode2017.wordpress.com/) details my experiences during summer of 2016.
105 | 
106 | I have been using pymc3 for my research work at INRIA, and to learn probabilitic programming myself. I've recently started contributing to pymc3 - these are my [contrinutions](https://github.com/pymc-devs/pymc3/issues?q=is%3Aopen+mentions%3Abhargavvader).
107 | 
108 | I'm still a regular contributor for [Gensim](https://github.com/RaRe-Technologies/gensim/pulls/bhargavvader) - I have 29 pull requests merged, and actively help on the mailing list and issues. I have given talks at PyCon France 2016 and PyCon Slovakia 2017 about my experience with Gensim and GSoC 2016. I also contribute to [metric-learn](https://github.com/all-umass/metric-learn/pulls?q=is%3Apr+author%3Abhargavvader+is%3Aclosed), [Edward](https://github.com/blei-lab/edward/pulls/bhargavvader), and [spaCy](https://github.com/explosion/spacy-notebooks/pulls?q=is%3Apr+author%3Abhargavvader+is%3Aclosed). The links direct to my PRs/issues for the respective repos. My [GitHub profile](https://github.com/bhargavvader) has a more detailed summary of all my contributions. 
109 | 
110 | I intend to continue my open source contributions, and to keep contributing to pymc3 as well!
111 | 
112 | ## Academic Experience
113 | 
114 | I am a student researcher at INRIA, France. I work with the MODAL (Models Of Data Analysis and Learning) team, where I work on Predictor Aggregation and Metric Learning problems. I'm finishing up my undergraduate education in Computer Science Engineering from BITS Pilani University, India. My [resume](https://drive.google.com/file/d/0By80y9AXd1WsRUJWTlFfeldITGc/view?usp=sharing) details my previous internships and research experiences, which are all in the field of Machine Learning, Data Science and Software Engineering.
115 | 
116 | ## Why this project?
117 | 
118 | One of pymc3's main "selling points" are it's powerful suite of sampling methods. While NUTS is arguably the most powerful tool in the box, it can be bettered, and RMHMC is a way to do this. Working on this project would mean that I am contributing to one of the most important parts of pymc3, while providing an addition which isn't there in any other Probabilistic Programming languages/packages yet. 
119 | On a personal note, the project serves as a way to better my Physics (and Math) while contributing, subjects I have always loved but never spent the amount of time I would like on.
120 | 
121 | ## Appendix
122 | 
123 | The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo - Matthew D. Hoffman, Andrew Gelman
124 | 
125 | Riemann Manifold Langevin and Hamiltonian Monte Carlo - Mark Girolami, Ben Calderhead, Siu A. Chin 
126 | 
127 | A General Metric for Riemannian Manifold Hamiltonian Monte Carlo - Michael Betancourt
128 | 
129 | Generalizing the No-U-Turn Sampler to Riemannian Manifolds - Michael Betancourt
130 | 
131 | Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers - Ziyu Wang, Shakir Mohamed, Nando de Freitas
132 | 
133 | 
134 | 
135 | 


--------------------------------------------------------------------------------
/notebooks/README.md:
--------------------------------------------------------------------------------
 1 | ## Notebooks
 2 | 
 3 | This directory contains the Jupyter notebooks I have made or helped contribute to.
 4 | 
 5 | ## Links
 6 | 
 7 | Following are the links to the notebooks in the original repositories.
 8 | 
 9 | ### Gensim
10 | 
11 | [Distance Metrics](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/distance_metrics.ipynb)
12 | 
13 | [DTM wrapper example](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/dtm_example.ipynb)
14 | 
15 | [Dynamic Topic Models in python](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb)
16 | 
17 | [New LDA topic methods](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb)
18 | 
19 | ### metric-learn
20 | 
21 | [Metric Plotting](https://github.com/all-umass/metric-learn/blob/master/examples/metric_plotting.ipynb)
22 | 
23 | ### pycobra
24 | 
25 | [Regression](https://github.com/bhargavvader/pycobra/blob/master/notebooks/regression.ipynb)
26 | 
27 | [Visualisation](https://github.com/bhargavvader/pycobra/blob/master/notebooks/visualise.ipynb)
28 | 
29 | ## spacy-notebooks
30 | 
31 | I also love using spaCy - while I haven't gone around to contributing to them, I help maintaining the [notebooks](https://github.com/explosion/spacy-notebooks) directory, where we document Jupyter Notebooks which use spaCy. 


--------------------------------------------------------------------------------
/notebooks/computational_social_science/README.md:
--------------------------------------------------------------------------------
 1 | ## Workshop
 2 | 
 3 | This directory contains the Jupyter Notebooks which will be followed during the workshop/tutorial.
 4 | 
 5 | The computational social science tutorial will walk us through networks and text analysis - this is just the tip of the iceberg, and there is a lot more which would constitute computational social science, but we have to start somewhere. Hopefully this should give you an idea of the kinds of methods used.
 6 | 
 7 | ### Requirements for Tutorial
 8 | 
 9 | ```
10 | - Jupyter
11 | - Gensim 
12 | - matplotlib
13 | - spaCy
14 | - pandas
15 | - numpy
16 | - networkx
17 | - seaborn
18 | ```
19 | 
20 | 
21 | ### Setup
22 | 
23 | The setup instructions are if you are using virtualenv for your environment: you can use any environment to do this, or even on your local (though it isn't recommended). Basically, you would want to be able to run the jupyter notebook in the directory, install the packages, and start running the cells. The instructions below will help you through one way of making that happen.
24 | 
25 | - Start by cloning the repo using
26 | 
27 | `git clone https://github.com/bhargavvader/personal`
28 | 
29 | - Go into the `notebooks/computational_social_science` directory
30 | 
31 | - Install `virtualenv` using
32 | 
33 | `pip install virtualenv`
34 | 
35 | - Start the environment with
36 | 
37 | ```
38 | virtualenv venv
39 | source venv/bin/activate
40 | ```
41 | 
42 | - Download requirements with -
43 | 
44 | `pip install -r REQUIREMENTS.txt`
45 | 
46 | And you should be good to go!
47 | 
48 | If you are using anaconda as your virtual environment, running `conda install gensim` and `conda install spacy` should also do the trick.
49 | 
50 | Alternatively, you can look up which of the libraries you would still need to download and go ahead and just download those.
51 | 
52 | ### Downloading spaCy language model
53 | 
54 | The tutorial will be using the spaCy English language model, so we will be needing to download it first.
55 | This [link](https://spacy.io/usage/models) contains instructions to download this model.
56 | All we really have to do is run `python -m spacy download en` after we finish all our libary installations.
57 | 


--------------------------------------------------------------------------------
/notebooks/computational_social_science/REQUIREMENTS.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | scipy
3 | gensim
4 | spacy
5 | matplotlib
6 | jupyter
7 | pandas
8 | seaborn
9 | networkx


--------------------------------------------------------------------------------
/notebooks/computational_social_science/computational_social_science_tutorial_unrun.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## An (attempt at an) Introduction to Computational Social Science with Python\n",
  8 |     "\n",
  9 |     "- by Bhargav Srinivasa Desikan\n",
 10 |     "\n",
 11 |     "This tutorial was first conducted at PyCon India 2020 ([CFP](https://in.pycon.org/cfp/2020/proposals/python-for-computational-social-science-with-text-networks-and-clever-data-games~dyPPE/)).\n",
 12 |     "\n",
 13 |     "Computational Social Science is a broad, diverse field, and no tutorial, let alone Jupyter notebook, can claim to give a true introduction to the field: what we will attempt to do in this notebook is understand the _kinds_ of approaches one might be able to take when dealing with a variety of computational social science questions.\n",
 14 |     "\n",
 15 |     "What is Computational Social Science? It includes the academic sub-disciplines concerned with computational approaches (such as data scraping, data cleaning, machine learning, natural language processing and others) to various social science disciplines. Social science disciplines include Economics, Sociology, Psychology, Political Science, Anthropology, Cultural Studies (and sometimes History & Science-Technology-Society Studies). Examples of computational approaches to social science questions could be characterising tie forming in friendship groups, analysing tweets to identify sentiment towards political groups, modelling economies and their reactions to changes in policy, or identifying how words in a language change meaning over time. \n",
 16 |     "\n",
 17 |     "While the idea of computational social science is based in academia and research, we see usages of these ideas by big tech companies all the time: for example, when we see a friend or page suggestion on a social media website, or a movie recommendation on Netflix. In research, academic and experimental rigor is valued more than a business oriented approach, but the core ideas often remain the same.\n",
 18 |     "\n",
 19 |     "The computational paradigm has taken off due to a congruence of hardware (GPUs, TPUs), software (larger communities of open source software, increasingly sophisticated sophisticated software and environments), and large amounts of available social data. Social data can range from curated datasets we can find on kaggle or the UCI repository, or data we mine from reddit or twitter. Important aspects of Computational Social Science include choosing a right research question, studying the theoretical underpinnings associated with the questions, and setting up an experiment or mining data which might be able to appropriately address the question. Given that this is a tough expercise, and requires extensive academic coursework, I will be skipping the academic rigor, to instead explore the kinds of approaches and methods one might encounter in Computational Social Science research.\n",
 20 |     "\n",
 21 |     "How will this tutorial be useful to you? It will include basic examples of code with small datasets to illustrate a variety of methods often used in CSS research. This includes (and is not limited to) text analysis via Natural Language Processing and Computational Linguistics, network and graph oriented methods, and building classifiers and regression models trained on a variety of social data. \n",
 22 |     "\n",
 23 |     "Since it is difficult to represent all of these methods, I will only focus on a few, with a focus on NLP and CL, as these are the methods I have most experience in. We will then move on to network based methods, and finally move on to a combination of both text and networks. The tutorial will assume basic python knowledge but not much else: do not worry if some of the methods or ideas don't make sense yet. A lot of the words and ideas I have used in this tutorial have ample resources available if you wish to understand them, and you are also welcome to e-mail / tweet at me any questions you have and I will do my best to answer them. If you find yourself reproducing parts of this tutorial for your own work or tutorials, please do cite / shout-out / reference this original tutorial and page, thank you!"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "### Where do you find data? What is the correct dataset?\n",
 31 |     "\n",
 32 |     "This is an always challenging question. Really, it depends on the question you want to answer: there are a lot of considerations one must take when deciding if a dataset is the right one to answer your social science question. Since this tutorial would like to introduce a wide variety of methods, the dataset we choose has to be one which is ammenable to a variety of approaches: to that end, I have decided to use the script for the Indo-British film, _Bend it like Beckham_. Why this film? It's simply the last movie I watched, I enjoyed it watching it as a kid back in the early 2000s, and the movie also kicks _ass_. For those who may not have seen it, it was the story of how an Indian-British Punjabi girl, Jesminder (Jess) and her coming of age story of becoming a professional footballer. I have to apologize - SPOILERS! During the various methods we will be applying throughout the notebook, you might learn things about the movie you might have wanted to keep a surprise if you haven't seen it. \n",
 33 |     "\n",
 34 |     "I found the script of the movie through a simple google search, at https://www.swcs.com.au/bend.htm. Like you so far, I have no clue what to expect of the results of the analysis - we are going to try and use the data in its avaible form to create objects of analysis for us. Let is start by setting up some basic imports and reading the data, and thinking of useful representations.\n",
 35 |     "\n",
 36 |     "### Imports and Data\n",
 37 |     "\n",
 38 |     "Arguably the most important part of Computational Social Science is cleaning and organising the data. The following cells will have you follow me as I work to set up the best representations to analyse the dataset."
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": null,
 44 |    "metadata": {},
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "import numpy as np\n",
 48 |     "import spacy\n",
 49 |     "import gensim\n",
 50 |     "import pandas as pd"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "txt = []"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": null,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "txtfile = open('bend_it_like_beckham.txt')"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": null,
 74 |    "metadata": {},
 75 |    "outputs": [],
 76 |    "source": [
 77 |     "for line in txtfile.readlines():\n",
 78 |     "    txt.append(line)"
 79 |    ]
 80 |   },
 81 |   {
 82 |    "cell_type": "code",
 83 |    "execution_count": null,
 84 |    "metadata": {},
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "txt"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "There: we now have each line from the movie (as well as it has been transcribed, at least: we do not know anything about the quality of the document) saved in a list."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "len(txt)"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "txt[0]"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": null,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "txt[23]"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": null,
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "txt[25]"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "Ok, so looking at these output examples starts to give us some hints. The first word of each text is usually the character's name, followed by a colon, and then the text. This gives us some ideas on how to parse this dataset. Let us probe what else is there in this dataset. Sentences without a colon might be associated with a song and commentary, so for now we will add it in our extras list."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "extras, texts = [], []"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": null,
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "character_counts = {}"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {},
162 |    "outputs": [],
163 |    "source": [
164 |     "for line in txt:\n",
165 |     "    try:\n",
166 |     "        character, text = line.split(\":\")\n",
167 |     "    except ValueError:\n",
168 |     "        if line is not \"\\n\":\n",
169 |     "            extras.append(line)\n",
170 |     "        continue\n",
171 |     "        \n",
172 |     "    if character in character_counts:\n",
173 |     "        character_counts[character] += 1\n",
174 |     "    \n",
175 |     "    if character not in character_counts:\n",
176 |     "        character_counts[character] = 1\n",
177 |     "        \n",
178 |     "    if text is not \"\\n\":\n",
179 |     "        texts.append(text)"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": null,
185 |    "metadata": {},
186 |    "outputs": [],
187 |    "source": [
188 |     "extras"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "A quick look at the list confirms our hypothesis. Lets now see what characters we find:"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "character_counts"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": null,
210 |    "metadata": {},
211 |    "outputs": [],
212 |    "source": [
213 |     "character_counts['Song Playing']"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "markdown",
218 |    "metadata": {},
219 |    "source": [
220 |     "This is an important list: it tells us how many lines each of the chatacters had, at least the way it was transcribed. Can we somehow use this information to find key characters, and to also find transitions in scenes?\n",
221 |     "\n",
222 |     "A little bonus: we that 15 songs have been played in the movie!"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": null,
228 |    "metadata": {},
229 |    "outputs": [],
230 |    "source": [
231 |     "locations_transitions = []\n",
232 |     "key_characters = []"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "metadata": {},
239 |    "outputs": [],
240 |    "source": [
241 |     "for character in character_counts:\n",
242 |     "    if character_counts[character] > 25:\n",
243 |     "        key_characters.append(character)\n",
244 |     "    location_transition_keywords = ['at', 'then', 'later', 'in', 'outside', 'back', 'inside', 'on', 'after', 'walking']\n",
245 |     "    for word in location_transition_keywords:\n",
246 |     "        if word in character:\n",
247 |     "            locations_transitions.append(character)\n",
248 |     "            continue\n",
249 |     "    "
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": null,
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "locations_transitions"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "markdown",
263 |    "metadata": {},
264 |    "source": [
265 |     "These aren't perfect, and we might have missed some of them, but they will be very useful in creating co-occurence counts: i.e, two characters share screen time. "
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": null,
271 |    "metadata": {},
272 |    "outputs": [],
273 |    "source": [
274 |     "key_characters"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "That makes sense, having seen the movie. Now let us make some datasets for us which will be useful for analysis!"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": null,
287 |    "metadata": {},
288 |    "outputs": [],
289 |    "source": [
290 |     "key_character_texts = {}"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": null,
296 |    "metadata": {},
297 |    "outputs": [],
298 |    "source": [
299 |     "for line in txt:\n",
300 |     "    try:\n",
301 |     "        character, text = line.split(\":\")\n",
302 |     "    except ValueError:\n",
303 |     "        continue\n",
304 |     "    if character in key_characters:\n",
305 |     "        if character not in key_character_texts:\n",
306 |     "            key_character_texts[character] = []\n",
307 |     "        if character in key_character_texts:\n",
308 |     "            key_character_texts[character].append(text.replace(\"\\n\", \"\"))\n"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": null,
314 |    "metadata": {},
315 |    "outputs": [],
316 |    "source": [
317 |     "for character in key_character_texts:\n",
318 |     "    print(character, len(key_character_texts[character]))"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": null,
324 |    "metadata": {},
325 |    "outputs": [],
326 |    "source": [
327 |     "key_character_texts['Jess']"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "metadata": {},
334 |    "outputs": [],
335 |    "source": [
336 |     "len(texts)"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "Neat: we now have each character and their associated texts, and the individual texts, so we can create our first network based object, based on interactions. \n",
344 |     "\n",
345 |     "## Graphs and Networks\n",
346 |     "\n",
347 |     "Graphs and networks can be a very useful way to represent data, and can provide us all kinds of insights. \n",
348 |     "\n",
349 |     "I would highly, highly recommend checking out this amazing set of Jupyter Notebooks and resources by Mridul Seth and Eric Ma: http://ericmjl.github.io/Network-Analysis-Made-Simple/\n",
350 |     "\n",
351 |     "I will not be going into a real introduction to the theories and ideas behind graphs and networks, but I will explain how they are useful for us in social scientific analysis. A graph can be thought of as a representation of data in a way you can measure connections or relations, and the kind of relations. \n",
352 |     "\n",
353 |     "Examples (taken from Network Analysis Made Simple) of graphs are:\n",
354 |     "\n",
355 |     "1) protein-protein interaction network. Here, the graph can be defined in the following way:\n",
356 |     "\n",
357 |     "    nodes/entities are the proteins,\n",
358 |     "    edges/relationships are defined as \"one protein is known to bind with another\".\n",
359 |     "\n",
360 |     "2) air transportation network. Here, the graph can be defined in the following way:\n",
361 |     "\n",
362 |     "    nodes/entities are airports\n",
363 |     "    edges/relationships are defined as \"at least one flight carrier flies between the airports\".\n",
364 |     "\n",
365 |     "3) social networks: With Twitter, the graph can be defined in the following way:\n",
366 |     "\n",
367 |     "    nodes/entities are individual users\n",
368 |     "    edges/relationships are defined as \"one user has decided to follow another\".\n",
369 |     "\n",
370 |     "How can we extend these examples to our data set?\n",
371 |     "It would be useful to know which characters are in the same scences, so we can measure which characters are most central, and so on. "
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "code",
376 |    "execution_count": null,
377 |    "metadata": {},
378 |    "outputs": [],
379 |    "source": [
380 |     "import networkx as nx"
381 |    ]
382 |   },
383 |   {
384 |    "cell_type": "code",
385 |    "execution_count": null,
386 |    "metadata": {},
387 |    "outputs": [],
388 |    "source": [
389 |     "character_graph = nx.Graph()"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": null,
395 |    "metadata": {},
396 |    "outputs": [],
397 |    "source": [
398 |     "for character in key_characters:\n",
399 |     "    character_graph.add_node(character)"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "code",
404 |    "execution_count": null,
405 |    "metadata": {},
406 |    "outputs": [],
407 |    "source": [
408 |     "scenes = {}"
409 |    ]
410 |   },
411 |   {
412 |    "cell_type": "code",
413 |    "execution_count": null,
414 |    "metadata": {},
415 |    "outputs": [],
416 |    "source": [
417 |     "i = 0"
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "code",
422 |    "execution_count": null,
423 |    "metadata": {},
424 |    "outputs": [],
425 |    "source": [
426 |     "scenes[i] = []"
427 |    ]
428 |   },
429 |   {
430 |    "cell_type": "code",
431 |    "execution_count": null,
432 |    "metadata": {},
433 |    "outputs": [],
434 |    "source": [
435 |     "for line in txt:\n",
436 |     "    try:\n",
437 |     "        character, text = line.split(\":\")\n",
438 |     "    except ValueError:\n",
439 |     "        continue\n",
440 |     "    if character in key_characters and character not in scenes[i]:\n",
441 |     "        scenes[i].append(character)\n",
442 |     "    if character in locations_transitions:\n",
443 |     "        i += 1\n",
444 |     "        scenes[i] = []"
445 |    ]
446 |   },
447 |   {
448 |    "cell_type": "code",
449 |    "execution_count": null,
450 |    "metadata": {},
451 |    "outputs": [],
452 |    "source": [
453 |     "scenes"
454 |    ]
455 |   },
456 |   {
457 |    "cell_type": "markdown",
458 |    "metadata": {},
459 |    "source": [
460 |     "We now have information about which characters share a scence, and can build our graph."
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": null,
466 |    "metadata": {},
467 |    "outputs": [],
468 |    "source": [
469 |     "for scene in scenes:\n",
470 |     "    if len(scenes[scene]) > 1:\n",
471 |     "        for character_0 in scenes[scene]:\n",
472 |     "            for character_1 in scenes[scene]:\n",
473 |     "                if character_0 != character_1:\n",
474 |     "                    if (character_0, character_1) not in character_graph.edges():\n",
475 |     "                        character_graph.add_edge(character_0, character_1, weight=0)\n",
476 |     "                    if (character_0, character_1) in character_graph.edges():\n",
477 |     "                        character_graph.edges[(character_0, character_1)]['weight'] += 1"
478 |    ]
479 |   },
480 |   {
481 |    "cell_type": "code",
482 |    "execution_count": null,
483 |    "metadata": {},
484 |    "outputs": [],
485 |    "source": [
486 |     "nx.draw(character_graph, with_labels=True, font_weight='bold')\n"
487 |    ]
488 |   },
489 |   {
490 |    "cell_type": "code",
491 |    "execution_count": null,
492 |    "metadata": {},
493 |    "outputs": [],
494 |    "source": [
495 |     "character_graph.edges()[('Jess', 'Pinky')]"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "markdown",
500 |    "metadata": {},
501 |    "source": [
502 |     "We can see how this graph gives us a lot of information: we see which characters appear in the same scence together, and can get an idea of centrality, as well as the numnber of times two characters share a scence."
503 |    ]
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "execution_count": null,
508 |    "metadata": {},
509 |    "outputs": [],
510 |    "source": [
511 |     "import seaborn as sns\n",
512 |     "import matplotlib.pyplot as plt"
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": null,
518 |    "metadata": {},
519 |    "outputs": [],
520 |    "source": [
521 |     "L = []\n",
522 |     "for node in character_graph.nodes():\n",
523 |     "    l = []\n",
524 |     "    for node_ in character_graph.nodes():\n",
525 |     "        if node == node_:\n",
526 |     "            l.append(0)\n",
527 |     "        else:\n",
528 |     "            if (node, node_) in character_graph.edges():\n",
529 |     "                l.append(character_graph.edges[(node, node_)]['weight'])\n",
530 |     "            else:\n",
531 |     "                l.append(0)\n",
532 |     "    L.append(l)\n",
533 |     "M_ = np.array(L)\n",
534 |     "fig = plt.figure()\n",
535 |     "div = pd.DataFrame(M_, columns = list(character_graph.nodes()), index = list(character_graph.nodes()))\n",
536 |     "ax = sns.heatmap(div)\n",
537 |     "plt.show()"
538 |    ]
539 |   },
540 |   {
541 |    "cell_type": "markdown",
542 |    "metadata": {},
543 |    "source": [
544 |     "We can use our co-occurence graph to plot which characters share the screen with each other. Let's also use this graph to find the most central characters."
545 |    ]
546 |   },
547 |   {
548 |    "cell_type": "code",
549 |    "execution_count": null,
550 |    "metadata": {},
551 |    "outputs": [],
552 |    "source": [
553 |     "nx.algorithms.centrality.degree_centrality(character_graph)"
554 |    ]
555 |   },
556 |   {
557 |    "cell_type": "markdown",
558 |    "metadata": {},
559 |    "source": [
560 |     "Let's do some catch up: we have created a graph which maps how often characters share the screen together. This allows us to visualise how often characters appear with each other, and calculate centralities, where we see that Jess, Jules and Jess's Dad share the screen with every other character. This makes sense because they are three central characters.\n",
561 |     "\n",
562 |     "But with every character more or less linked to every other one, it's tough to see any cliques or groups. What if we make the threshold higher?"
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "code",
567 |    "execution_count": null,
568 |    "metadata": {},
569 |    "outputs": [],
570 |    "source": [
571 |     "stricter_graph = nx.Graph()"
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "code",
576 |    "execution_count": null,
577 |    "metadata": {},
578 |    "outputs": [],
579 |    "source": [
580 |     "for character_0 in character_graph.nodes():\n",
581 |     "    stricter_graph.add_node(character_0)\n",
582 |     "    for character_1 in character_graph.nodes():\n",
583 |     "        if character_0 != character_1 and (character_0, character_1) in character_graph.edges():\n",
584 |     "            if character_graph.edges()[(character_0, character_1)]['weight'] > 50:\n",
585 |     "                stricter_graph.add_edge(character_0, character_1)"
586 |    ]
587 |   },
588 |   {
589 |    "cell_type": "code",
590 |    "execution_count": null,
591 |    "metadata": {},
592 |    "outputs": [],
593 |    "source": [
594 |     "nx.draw(stricter_graph, with_labels=True, font_weight='bold')\n"
595 |    ]
596 |   },
597 |   {
598 |    "cell_type": "markdown",
599 |    "metadata": {},
600 |    "source": [
601 |     "This graph gives us an idea of the cliques: Jess is clearly the main character, and Jules's Mum speaks to everyone else the least. So far the networks have been useful in identifying the key players and relationships! We can also Jess's Mom and Dad forming a clique with Jess.\n",
602 |     "\n",
603 |     "Let's take a break from networks and move on to analysing some of the actual text spoken in the movie.\n",
604 |     "\n",
605 |     "## Text Analysis\n",
606 |     "\n",
607 |     "Text is used as information in all kinds of situations, and is used heavily in social scientific analysis. There are a very wide variety of text related methods one can use for different purposes. Text analysis can be useful for clasdsifying documents and lines, for identifying intent, semantic content, parts of speech, named entities, and many more purposes. It can even be used to generate text, such as in chatbots. For a tutorial focusing only on text analysis, please see: https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/text_analysis_tutorial.ipynb\n",
608 |     "\n",
609 |     "We will be using one of the methods used in that notebook: Topic Modelling. Topic modelling is an unsupervised statistical learning algorithm which can be used to identify topics in large bodies of text. For example, if we were running topic models on newspapers, we'd likely see topics associated to the weather, politics, and sports. The algorithm assumes that topics are made of words, and documents are made of words. Similarly, documents can also be thought of as made up of topics. All of this will be more clear once we see the actual topics generated.\n",
610 |     "\n",
611 |     "For a tutorial specifically for topic modelling, please see: \n",
612 |     "https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "code",
617 |    "execution_count": null,
618 |    "metadata": {},
619 |    "outputs": [],
620 |    "source": [
621 |     "import spacy\n",
622 |     "import gensim\n",
623 |     "from gensim.corpora import Dictionary\n",
624 |     "from gensim.models import LdaModel"
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": null,
630 |    "metadata": {},
631 |    "outputs": [],
632 |    "source": [
633 |     "nlp = spacy.load('en')"
634 |    ]
635 |   },
636 |   {
637 |    "cell_type": "code",
638 |    "execution_count": null,
639 |    "metadata": {},
640 |    "outputs": [],
641 |    "source": [
642 |     "my_stop_words = [u'yeah', u'like', u'look']\n",
643 |     "for stopword in my_stop_words:\n",
644 |     "    lexeme = nlp.vocab[stopword]\n",
645 |     "    lexeme.is_stop = True"
646 |    ]
647 |   },
648 |   {
649 |    "cell_type": "code",
650 |    "execution_count": null,
651 |    "metadata": {},
652 |    "outputs": [],
653 |    "source": [
654 |     "# we add some words to the stop word list\n",
655 |     "cleaned_texts, article = [], []\n",
656 |     "for line in texts:\n",
657 |     "    doc = nlp(line.lower())\n",
658 |     "    for w in doc:\n",
659 |     "        # if it's not a stop word or punctuation mark, add it to our article!\n",
660 |     "        if w.text != '\\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and w.text.strip() is not '':\n",
661 |     "            # we add the lematized version of the word\n",
662 |     "            article.append(w.lemma_)\n",
663 |     "        # if it's a new line, it means we're onto our next document\n",
664 |     "    cleaned_texts.append(article)\n",
665 |     "    article = []"
666 |    ]
667 |   },
668 |   {
669 |    "cell_type": "code",
670 |    "execution_count": null,
671 |    "metadata": {},
672 |    "outputs": [],
673 |    "source": [
674 |     "cleaned_texts[9]"
675 |    ]
676 |   },
677 |   {
678 |    "cell_type": "code",
679 |    "execution_count": null,
680 |    "metadata": {},
681 |    "outputs": [],
682 |    "source": [
683 |     "dictionary = Dictionary(cleaned_texts)\n",
684 |     "corpus = [dictionary.doc2bow(text) for text in cleaned_texts]"
685 |    ]
686 |   },
687 |   {
688 |    "cell_type": "code",
689 |    "execution_count": null,
690 |    "metadata": {},
691 |    "outputs": [],
692 |    "source": [
693 |     "corpus[9]"
694 |    ]
695 |   },
696 |   {
697 |    "cell_type": "code",
698 |    "execution_count": null,
699 |    "metadata": {},
700 |    "outputs": [],
701 |    "source": [
702 |     "ldamodel = LdaModel(corpus=corpus, num_topics=5, id2word=dictionary)"
703 |    ]
704 |   },
705 |   {
706 |    "cell_type": "code",
707 |    "execution_count": null,
708 |    "metadata": {},
709 |    "outputs": [],
710 |    "source": [
711 |     "ldamodel.print_topics()"
712 |    ]
713 |   },
714 |   {
715 |    "cell_type": "markdown",
716 |    "metadata": {},
717 |    "source": [
718 |     "We can see here 5 topics: one is to do with talking to Mums (there is a lot of that going on in this movie), abuot playing, one about Jess, and one to do with the coach. We can try and guess which kinds of scenes are dominated by which topics (or write some code to identify it, too!).\n",
719 |     "\n",
720 |     "So we now see the power of topic models in getting a birds eye view of a large body of text: it can be even more powerful on larger texts with more well defined topics. I highly encourage you to check out both the tutorials I linked to earlier, they cover the same methods but in more detail. This is also just the beginning of the power of such methods: gensim has a variety of tutorials in their documentation on different ways you can use such models."
721 |    ]
722 |   },
723 |   {
724 |    "cell_type": "markdown",
725 |    "metadata": {},
726 |    "source": [
727 |     "## Networks + Topics\n",
728 |     "\n",
729 |     "We're now going to combine both of the ideas we explored! We can use these topic models to get an idea of the kinds of things each of the key characters talked about. One good example is the paper - Individuals, institutions, and innovation in the debates of the French Revolution (https://www.pnas.org/content/115/18/4607), where they use topic models to find similarities and differences between the topics of different individuals in the French revolution.\n",
730 |     "\n",
731 |     "Let us use some of these ideas to create a graph which also contains information of the topic distributions of each of the key characters."
732 |    ]
733 |   },
734 |   {
735 |    "cell_type": "code",
736 |    "execution_count": null,
737 |    "metadata": {},
738 |    "outputs": [],
739 |    "source": [
740 |     "key_character_texts_cleaned = {}\n",
741 |     "key_character_all_words = {}\n",
742 |     "key_character_doc2bow = {}"
743 |    ]
744 |   },
745 |   {
746 |    "cell_type": "code",
747 |    "execution_count": null,
748 |    "metadata": {},
749 |    "outputs": [],
750 |    "source": [
751 |     "for character in key_character_texts:\n",
752 |     "    key_character_texts_cleaned[character] = []\n",
753 |     "    for line in key_character_texts[character]:\n",
754 |     "        doc = nlp(line.lower())\n",
755 |     "        for w in doc:\n",
756 |     "            # if it's not a stop word or punctuation mark, add it to our article!\n",
757 |     "            if w.text != '\\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I' and w.text.strip() is not '':\n",
758 |     "                # we add the lematized version of the word\n",
759 |     "                article.append(w.lemma_)\n",
760 |     "            # if it's a new line, it means we're onto our next document\n",
761 |     "        key_character_texts_cleaned[character].append(article)\n",
762 |     "        article = []\n",
763 |     "    # now that we have all the cleaned texts, we can find all the words used\n",
764 |     "    key_character_all_words[character] = []\n",
765 |     "    for line in key_character_texts_cleaned[character]:\n",
766 |     "        for word in line:\n",
767 |     "            key_character_all_words[character].append(word)\n",
768 |     "    # we convert these words to doc2bow\n",
769 |     "    key_character_doc2bow[character] = dictionary.doc2bow(key_character_all_words[character])\n",
770 |     "    # we convert the doc2bow to topic proportions, and assign to the graph\n",
771 |     "    character_graph.nodes()[character]['topic_proportions'] =  ldamodel[key_character_doc2bow[character]]"
772 |    ]
773 |   },
774 |   {
775 |    "cell_type": "code",
776 |    "execution_count": null,
777 |    "metadata": {},
778 |    "outputs": [],
779 |    "source": [
780 |     "# to make all the lists equal size\n",
781 |     "character_graph.nodes['Jules\\'s Mum']['topic_proportions'].append((4,0.04))"
782 |    ]
783 |   },
784 |   {
785 |    "cell_type": "code",
786 |    "execution_count": null,
787 |    "metadata": {},
788 |    "outputs": [],
789 |    "source": [
790 |     "for actor in character_graph.nodes():\n",
791 |     "    print(actor, character_graph.nodes[actor]['topic_proportions'])"
792 |    ]
793 |   },
794 |   {
795 |    "cell_type": "markdown",
796 |    "metadata": {},
797 |    "source": [
798 |     "We now have our graph with the topic proportions as well! Let us now see how similar or different they are to each other. We can measure topic similarity or difference using certain information metrics which measure probability similarities. "
799 |    ]
800 |   },
801 |   {
802 |    "cell_type": "code",
803 |    "execution_count": null,
804 |    "metadata": {},
805 |    "outputs": [],
806 |    "source": [
807 |     "from gensim.matutils import kullback_leibler\n"
808 |    ]
809 |   },
810 |   {
811 |    "cell_type": "code",
812 |    "execution_count": null,
813 |    "metadata": {},
814 |    "outputs": [],
815 |    "source": [
816 |     "def convert_to_prob(bow):\n",
817 |     "    ps = []\n",
818 |     "    for topic_no, topic_prob in bow:\n",
819 |     "        ps.append(topic_prob)\n",
820 |     "    return ps"
821 |    ]
822 |   },
823 |   {
824 |    "cell_type": "code",
825 |    "execution_count": null,
826 |    "metadata": {},
827 |    "outputs": [],
828 |    "source": [
829 |     "L = []\n",
830 |     "for actor_1 in character_graph.nodes():\n",
831 |     "    p = character_graph.nodes[actor_1]['topic_proportions'] \n",
832 |     "    p = convert_to_prob(p)\n",
833 |     "    l = []\n",
834 |     "    for actor_2 in character_graph.nodes():\n",
835 |     "        q = character_graph.nodes[actor_2]['topic_proportions'] \n",
836 |     "        q = convert_to_prob(q)\n",
837 |     "        l.append(kullback_leibler(p, q))\n",
838 |     "    L.append(l)\n",
839 |     "M = np.array(L)"
840 |    ]
841 |   },
842 |   {
843 |    "cell_type": "code",
844 |    "execution_count": null,
845 |    "metadata": {},
846 |    "outputs": [],
847 |    "source": [
848 |     "fig = plt.figure()\n",
849 |     "div = pd.DataFrame(M, columns = list(character_graph.nodes()), index = list(character_graph.nodes()))\n",
850 |     "ax = sns.heatmap(div)\n",
851 |     "plt.show()"
852 |    ]
853 |   },
854 |   {
855 |    "cell_type": "markdown",
856 |    "metadata": {},
857 |    "source": [
858 |     "Let us try and understand this plot: we see two clusters, one between Jules, Pinky and Jess, and the rest seperated. This makes sense! They all do use similar language in the movie: the Coach, Joe, and Jesse's Mum also speak quite similarly, which doesn't make a 100% sense, but they do talk about Jess a lot, so maybe that's the reason?"
859 |    ]
860 |   },
861 |   {
862 |    "cell_type": "markdown",
863 |    "metadata": {},
864 |    "source": [
865 |     "## Conclusions and Ways Forward\n",
866 |     "\n",
867 |     "Phew - there was a lot going on there! We explored a variety of concepts, from networks, to topic models, to measuring similarities between topic models. These methods serve as the base for a variety of more complex analysis, and hopefully have served as a way to illustrate the kinds of tools Computational Social Scientists use throughout the analysis. A huge part of the notebook was also cleaning and organising data into different structures so that we can analyse them: this is a very important part of Computational Social Science, and indeed any data analysis exercise.\n",
868 |     "\n",
869 |     "While the movie we chose to analyse may not have thrown us any very surprusing results, it did help in providing us some useful summarising ideas about the movie which we understood only by looking at the text of the script. While Social Scientists may not sit around analysing movie texts, if the process is operationalised to a large number of movies, it may be useful to have some of the metrics we learned in the tutorial today. Different social scientists use different parts of the tools we explored today to carry out their analysis. \n",
870 |     "\n",
871 |     "If you wish to carry out your own academic research and analysis, focus on your research question and what kind of data might be helpful in answering the question. Then, explore the exisiting literature on the question, and think of a computational technique might help in add to the literature in a useful way. Outside of academia, these methods can still be used for a variety of business and personal needs.\n",
872 |     "\n",
873 |     "Happy researching and problem solving - and feel free to reach out to me for any clarifications, advice, or suggestions/errata!"
874 |    ]
875 |   }
876 |  ],
877 |  "metadata": {
878 |   "kernelspec": {
879 |    "display_name": "Python 3",
880 |    "language": "python",
881 |    "name": "python3"
882 |   },
883 |   "language_info": {
884 |    "codemirror_mode": {
885 |     "name": "ipython",
886 |     "version": 3
887 |    },
888 |    "file_extension": ".py",
889 |    "mimetype": "text/x-python",
890 |    "name": "python",
891 |    "nbconvert_exporter": "python",
892 |    "pygments_lexer": "ipython3",
893 |    "version": "3.7.6"
894 |   }
895 |  },
896 |  "nbformat": 4,
897 |  "nbformat_minor": 4
898 | }
899 | 


--------------------------------------------------------------------------------
/notebooks/gensim/Dynamic Topic Model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/gensim/Dynamic Topic Model.png


--------------------------------------------------------------------------------
/notebooks/gensim/Monkey Brains New.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/gensim/Monkey Brains New.png


--------------------------------------------------------------------------------
/notebooks/gensim/Monkey Brains.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/gensim/Monkey Brains.png


--------------------------------------------------------------------------------
/notebooks/gensim/distance_metrics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## New Distance Metrics for Probability Distribution and Bag of Words "
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "A small tutorial to illustrate the new distance functions.\n",
 15 |     "\n",
 16 |     "We would need this mostly when comparing how similar two probability distributions are, and in the case of gensim, usually for LSI or LDA topic distributions after we have a LDA model.\n",
 17 |     "\n",
 18 |     "Gensim already has functionalities for this, in the sense of getting most similar documents - [this](http://radimrehurek.com/topic_modeling_tutorial/3%20-%20Indexing%20and%20Retrieval.html), [this](https://radimrehurek.com/gensim/tut3.html) and [this](https://radimrehurek.com/gensim/similarities/docsim.html) are such examples of documentation and tutorials.\n",
 19 |     "\n",
 20 |     "What this tutorial shows is a building block of these larger methods, which are a small suite of distance metrics.\n",
 21 |     "We'll start by setting up a small corpus and showing off the methods."
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 1,
 27 |    "metadata": {
 28 |     "collapsed": false
 29 |    },
 30 |    "outputs": [],
 31 |    "source": [
 32 |     "from gensim.corpora import Dictionary\n",
 33 |     "from gensim.models import ldamodel\n",
 34 |     "from gensim.matutils import kullback_leibler, jaccard, hellinger, sparse2full\n",
 35 |     "import numpy"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 2,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "# you can use any corpus, this is just illustratory\n",
 47 |     "\n",
 48 |     "texts = [['bank','river','shore','water'],\n",
 49 |     "        ['river','water','flow','fast','tree'],\n",
 50 |     "        ['bank','water','fall','flow'],\n",
 51 |     "        ['bank','bank','water','rain','river'],\n",
 52 |     "        ['river','water','mud','tree'],\n",
 53 |     "        ['money','transaction','bank','finance'],\n",
 54 |     "        ['bank','borrow','money'], \n",
 55 |     "        ['bank','finance'],\n",
 56 |     "        ['finance','money','sell','bank'],\n",
 57 |     "        ['borrow','sell'],\n",
 58 |     "        ['bank','loan','sell']]\n",
 59 |     "\n",
 60 |     "dictionary = Dictionary(texts)\n",
 61 |     "corpus = [dictionary.doc2bow(text) for text in texts]"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 3,
 67 |    "metadata": {
 68 |     "collapsed": false
 69 |    },
 70 |    "outputs": [
 71 |     {
 72 |      "data": {
 73 |       "text/plain": [
 74 |        "[(0,\n",
 75 |        "  u'0.164*bank + 0.142*water + 0.108*river + 0.076*flow + 0.067*borrow + 0.063*sell + 0.060*tree + 0.048*money + 0.046*fast + 0.044*rain'),\n",
 76 |        " (1,\n",
 77 |        "  u'0.196*bank + 0.120*finance + 0.100*money + 0.082*sell + 0.067*river + 0.065*water + 0.056*transaction + 0.049*loan + 0.046*tree + 0.040*mud')]"
 78 |       ]
 79 |      },
 80 |      "execution_count": 3,
 81 |      "metadata": {},
 82 |      "output_type": "execute_result"
 83 |     }
 84 |    ],
 85 |    "source": [
 86 |     "numpy.random.seed(1) # setting random seed to get the same results each time.\n",
 87 |     "model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)\n",
 88 |     "\n",
 89 |     "model.show_topics()"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "markdown",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "Let's take a few sample documents and get them ready to test Similarity. Let's call the 1st topic the water topic and the second topic the finance topic.\n",
 97 |     "\n",
 98 |     "Note: these are all distance metrics. This means that a value between 0 and 1 is returned, where values closer to 0 indicate a smaller 'distance' and therefore a larger similarity."
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "code",
103 |    "execution_count": 4,
104 |    "metadata": {
105 |     "collapsed": false
106 |    },
107 |    "outputs": [],
108 |    "source": [
109 |     "doc_water = ['river', 'water', 'shore']\n",
110 |     "doc_finance = ['finance', 'money', 'sell']\n",
111 |     "doc_bank = ['finance', 'bank', 'tree', 'water']\n",
112 |     "\n",
113 |     "# now let's make these into a bag of words format\n",
114 |     "\n",
115 |     "bow_water = model.id2word.doc2bow(doc_water)   \n",
116 |     "bow_finance = model.id2word.doc2bow(doc_finance)   \n",
117 |     "bow_bank = model.id2word.doc2bow(doc_bank)   \n",
118 |     "\n",
119 |     "# we can now get the LDA topic distributions for these\n",
120 |     "lda_bow_water = model[bow_water]\n",
121 |     "lda_bow_finance = model[bow_finance]\n",
122 |     "lda_bow_bank = model[bow_bank]"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "## Hellinger and Kullback–Leibler"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "We're now ready to apply our distance metrics.\n",
137 |     "\n",
138 |     "Let's start with the popular Hellinger distance. \n",
139 |     "The Hellinger distance metric gives an output in the range [0,1] for two probability distributions, with values closer to 0 meaning they are more similar."
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 5,
145 |    "metadata": {
146 |     "collapsed": false
147 |    },
148 |    "outputs": [
149 |     {
150 |      "data": {
151 |       "text/plain": [
152 |        "0.51251199778753576"
153 |       ]
154 |      },
155 |      "execution_count": 5,
156 |      "metadata": {},
157 |      "output_type": "execute_result"
158 |     }
159 |    ],
160 |    "source": [
161 |     "hellinger(lda_bow_water, lda_bow_finance)"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 6,
167 |    "metadata": {
168 |     "collapsed": false
169 |    },
170 |    "outputs": [
171 |     {
172 |      "data": {
173 |       "text/plain": [
174 |        "0.23407305272210427"
175 |       ]
176 |      },
177 |      "execution_count": 6,
178 |      "metadata": {},
179 |      "output_type": "execute_result"
180 |     }
181 |    ],
182 |    "source": [
183 |     "hellinger(lda_bow_finance, lda_bow_bank)"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "markdown",
188 |    "metadata": {},
189 |    "source": [
190 |     "Makes sense, right? In the first example, Document 1 and Document 2 are hardly similar, so we get a value of roughly 0.5. \n",
191 |     "\n",
192 |     "In the second case, the documents are a lot more similar, semantically. Trained with the model, they give a much less distance value."
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "markdown",
197 |    "metadata": {},
198 |    "source": [
199 |     "Let's run similar examples down with Kullback Leibler."
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": 7,
205 |    "metadata": {
206 |     "collapsed": false
207 |    },
208 |    "outputs": [
209 |     {
210 |      "data": {
211 |       "text/plain": [
212 |        "0.30823547"
213 |       ]
214 |      },
215 |      "execution_count": 7,
216 |      "metadata": {},
217 |      "output_type": "execute_result"
218 |     }
219 |    ],
220 |    "source": [
221 |     "kullback_leibler(lda_bow_water, lda_bow_bank)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": 8,
227 |    "metadata": {
228 |     "collapsed": false
229 |    },
230 |    "outputs": [
231 |     {
232 |      "data": {
233 |       "text/plain": [
234 |        "0.19881117"
235 |       ]
236 |      },
237 |      "execution_count": 8,
238 |      "metadata": {},
239 |      "output_type": "execute_result"
240 |     }
241 |    ],
242 |    "source": [
243 |     "kullback_leibler(lda_bow_finance, lda_bow_bank)"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "markdown",
248 |    "metadata": {},
249 |    "source": [
250 |     "*NOTE!*\n",
251 |     "\n",
252 |     "KL is not a Distance Metric in the mathematical sense, and hence is not symmetrical. \n",
253 |     "This means that `kullback_leibler(lda_bow_finance, lda_bow_bank)` is not equal to  `kullback_leibler(lda_bow_bank, lda_bow_finance)`. "
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "code",
258 |    "execution_count": 9,
259 |    "metadata": {
260 |     "collapsed": false
261 |    },
262 |    "outputs": [
263 |     {
264 |      "data": {
265 |       "text/plain": [
266 |        "0.24780412"
267 |       ]
268 |      },
269 |      "execution_count": 9,
270 |      "metadata": {},
271 |      "output_type": "execute_result"
272 |     }
273 |    ],
274 |    "source": [
275 |     "# As you can see, the values are not equal. We'll get more into the details of this later on in the notebook.\n",
276 |     "kullback_leibler(lda_bow_bank, lda_bow_finance)"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "markdown",
281 |    "metadata": {},
282 |    "source": [
283 |     "In our previous examples we saw that there were lower distance values between bank and finance than for bank and water, even if it wasn't by a huge margin. What does this mean?\n",
284 |     "\n",
285 |     "The `bank` document is a combination of both water and finance related terms - but as bank in this context is likely to belong to the finance topic, the distance values are less between the finance and bank bows."
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": 10,
291 |    "metadata": {
292 |     "collapsed": false
293 |    },
294 |    "outputs": [
295 |     {
296 |      "data": {
297 |       "text/plain": [
298 |        "[(0, 0.44146764073708339), (1, 0.55853235926291656)]"
299 |       ]
300 |      },
301 |      "execution_count": 10,
302 |      "metadata": {},
303 |      "output_type": "execute_result"
304 |     }
305 |    ],
306 |    "source": [
307 |     "# just to confirm our suspicion that the bank bow is more to do with finance:\n",
308 |     "\n",
309 |     "model.get_document_topics(bow_bank)"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "markdown",
314 |    "metadata": {},
315 |    "source": [
316 |     "It's evident that while it isn't too skewed, it it more towards the finance topic."
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "markdown",
321 |    "metadata": {},
322 |    "source": [
323 |     "Distance metrics (also referred to as similarity metrics), as suggested in the examples above, are mainly for probability distributions, but the methods can accept a bunch of formats for input. You can do some further reading on [Kullback Leibler](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) and [Hellinger](https://en.wikipedia.org/wiki/Hellinger_distance) to figure out what suits your needs."
324 |    ]
325 |   },
326 |   {
327 |    "cell_type": "markdown",
328 |    "metadata": {},
329 |    "source": [
330 |     "## Jaccard "
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "markdown",
335 |    "metadata": {},
336 |    "source": [
337 |     "Let us now look at the [Jaccard Distance](https://en.wikipedia.org/wiki/Jaccard_index) metric for similarity between bags of words (i.e, documents)"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 11,
343 |    "metadata": {
344 |     "collapsed": false
345 |    },
346 |    "outputs": [
347 |     {
348 |      "data": {
349 |       "text/plain": [
350 |        "0.8571428571428572"
351 |       ]
352 |      },
353 |      "execution_count": 11,
354 |      "metadata": {},
355 |      "output_type": "execute_result"
356 |     }
357 |    ],
358 |    "source": [
359 |     "jaccard(bow_water, bow_bank)"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": 12,
365 |    "metadata": {
366 |     "collapsed": false
367 |    },
368 |    "outputs": [
369 |     {
370 |      "data": {
371 |       "text/plain": [
372 |        "0.8333333333333334"
373 |       ]
374 |      },
375 |      "execution_count": 12,
376 |      "metadata": {},
377 |      "output_type": "execute_result"
378 |     }
379 |    ],
380 |    "source": [
381 |     "jaccard(doc_water, doc_bank)"
382 |    ]
383 |   },
384 |   {
385 |    "cell_type": "code",
386 |    "execution_count": 13,
387 |    "metadata": {
388 |     "collapsed": false
389 |    },
390 |    "outputs": [
391 |     {
392 |      "data": {
393 |       "text/plain": [
394 |        "0.0"
395 |       ]
396 |      },
397 |      "execution_count": 13,
398 |      "metadata": {},
399 |      "output_type": "execute_result"
400 |     }
401 |    ],
402 |    "source": [
403 |     "jaccard(['word'], ['word'])"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "The three examples above feature 2 different input methods. \n",
411 |     "\n",
412 |     "In the first case, we present to jaccard document vectors already in bag of words format. The distance can be defined as 1 minus the size of the intersection upon the size of the union of the vectors. \n",
413 |     "\n",
414 |     "We can see (on manual inspection as well), that the distance is likely to be high - and it is. \n",
415 |     "\n",
416 |     "The last two examples illustrate the ability for jaccard to accept even lists (i.e, documents) as inputs.\n",
417 |     "In the last case, because they are the same vectors, the value returned is 0 - this means the distance is 0 and they are very similar. "
418 |    ]
419 |   },
420 |   {
421 |    "cell_type": "markdown",
422 |    "metadata": {},
423 |    "source": [
424 |     "## Distance Metrics for Topic Distributions"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "While there are already standard methods to identify similarity of documents, our distance metrics has one more interesting use-case: topic distributions. \n",
432 |     "\n",
433 |     "Let's say we want to find out how similar our two topics are, water and finance."
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": 14,
439 |    "metadata": {
440 |     "collapsed": false
441 |    },
442 |    "outputs": [
443 |     {
444 |      "data": {
445 |       "text/plain": [
446 |        "[(3, 0.196),\n",
447 |        " (12, 0.12),\n",
448 |        " (10, 0.1),\n",
449 |        " (14, 0.082),\n",
450 |        " (2, 0.067),\n",
451 |        " (0, 0.065),\n",
452 |        " (11, 0.056),\n",
453 |        " (15, 0.049),\n",
454 |        " (5, 0.046),\n",
455 |        " (9, 0.04)]"
456 |       ]
457 |      },
458 |      "execution_count": 14,
459 |      "metadata": {},
460 |      "output_type": "execute_result"
461 |     }
462 |    ],
463 |    "source": [
464 |     "topic_water, topic_finance = model.show_topics()\n",
465 |     "\n",
466 |     "# some pre processing to get the topics in a format acceptable to our distance metrics\n",
467 |     "\n",
468 |     "def make_topics_bow(topic):\n",
469 |     "    # takes the string returned by model.show_topics()\n",
470 |     "    # split on strings to get topics and the probabilities\n",
471 |     "    topic = topic.split('+')\n",
472 |     "    # list to store topic bows\n",
473 |     "    topic_bow = []\n",
474 |     "    for word in topic:\n",
475 |     "        # split probability and word\n",
476 |     "        prob, word = word.split('*')\n",
477 |     "        # get rid of spaces\n",
478 |     "        word = word.replace(\" \",\"\")\n",
479 |     "        # convert to word_type\n",
480 |     "        word = model.id2word.doc2bow([word])[0][0]\n",
481 |     "        topic_bow.append((word, float(prob)))\n",
482 |     "    return topic_bow\n",
483 |     "\n",
484 |     "finance_distribution = make_topics_bow(topic_finance[1])\n",
485 |     "water_distribution = make_topics_bow(topic_water[1])\n",
486 |     "\n",
487 |     "# the finance topic in bag of words format looks like this:\n",
488 |     "finance_distribution"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "markdown",
493 |    "metadata": {},
494 |    "source": [
495 |     "Now that we've got our topics in a format more acceptable by our functions, let's use a Distance metric to see how similar the word distributions in the topics are."
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 15,
501 |    "metadata": {
502 |     "collapsed": false
503 |    },
504 |    "outputs": [
505 |     {
506 |      "data": {
507 |       "text/plain": [
508 |        "0.36453028040240248"
509 |       ]
510 |      },
511 |      "execution_count": 15,
512 |      "metadata": {},
513 |      "output_type": "execute_result"
514 |     }
515 |    ],
516 |    "source": [
517 |     "hellinger(water_distribution, finance_distribution)"
518 |    ]
519 |   },
520 |   {
521 |    "cell_type": "markdown",
522 |    "metadata": {},
523 |    "source": [
524 |     "Our value of roughly 0.36 means that the topics are not TOO distant with respect to their word distributions.\n",
525 |     "This makes sense again, because of overlapping words like `bank` and a small size dictionary."
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "## Some things to take care of "
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "markdown",
537 |    "metadata": {},
538 |    "source": [
539 |     "In our previous example we didn't use Kullback Leibler to test for similarity for a reason - KL is not a Distance 'Metric' in the technical sense (you can see what a metric is [here](https://en.wikipedia.org/wiki/Metric_(mathematics)). The nature of it, mathematically also means we must be a little careful before using it, because since it involves the log function, a zero can mess things up. For example:"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "execution_count": 16,
545 |    "metadata": {
546 |     "collapsed": false
547 |    },
548 |    "outputs": [
549 |     {
550 |      "data": {
551 |       "text/plain": [
552 |        "inf"
553 |       ]
554 |      },
555 |      "execution_count": 16,
556 |      "metadata": {},
557 |      "output_type": "execute_result"
558 |     }
559 |    ],
560 |    "source": [
561 |     "# 16 here is the number of features the probability distribution draws from\n",
562 |     "kullback_leibler(water_distribution, finance_distribution, 16) "
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "markdown",
567 |    "metadata": {},
568 |    "source": [
569 |     "That wasn't very helpful, right? This just means that we have to be a bit careful about our inputs. Our old example didn't work out because they were some missing values for some words (because `show_topics()` only returned the top 10 topics). \n",
570 |     "\n",
571 |     "This can be remedied, though."
572 |    ]
573 |   },
574 |   {
575 |    "cell_type": "code",
576 |    "execution_count": 17,
577 |    "metadata": {
578 |     "collapsed": false
579 |    },
580 |    "outputs": [
581 |     {
582 |      "data": {
583 |       "text/plain": [
584 |        "0.19781515"
585 |       ]
586 |      },
587 |      "execution_count": 17,
588 |      "metadata": {},
589 |      "output_type": "execute_result"
590 |     }
591 |    ],
592 |    "source": [
593 |     "# return ALL the words in the dictionary for the topic-word distribution.\n",
594 |     "topic_water, topic_finance = model.show_topics(num_words=len(model.id2word))\n",
595 |     "\n",
596 |     "# do our bag of words transformation again\n",
597 |     "finance_distribution = make_topics_bow(topic_finance[1])\n",
598 |     "water_distribution = make_topics_bow(topic_water[1])\n",
599 |     "\n",
600 |     "# and voila!\n",
601 |     "kullback_leibler(water_distribution, finance_distribution)"
602 |    ]
603 |   },
604 |   {
605 |    "cell_type": "markdown",
606 |    "metadata": {},
607 |    "source": [
608 |     "You may notice that the distance for this is quite less, indicating a high similarity. This may be a bit off because of the small size of the corpus, where all topics are likely to contain a decent overlap of word probabilities. You will likely get a better value for a bigger corpus.\n",
609 |     "\n",
610 |     "So, just remember, if you intend to use KL as a metric to measure similarity or distance between two distributions, avoid zeros by returning the ENTIRE distribution. Since it's unlikely any probability distribution will ever have absolute zeros for any feature/word, returning all the values like we did will make you good to go."
611 |    ]
612 |   },
613 |   {
614 |    "cell_type": "markdown",
615 |    "metadata": {},
616 |    "source": [
617 |     "## So - what exactly are Distance Metrics? "
618 |    ]
619 |   },
620 |   {
621 |    "cell_type": "markdown",
622 |    "metadata": {},
623 |    "source": [
624 |     "Having seen the practical usages of these measures (i.e, to find similarity), let's learn a little about what exactly Distance Measures and Metrics are. \n",
625 |     "\n",
626 |     "I mentioned in the previous section that KL was not a distance metric. There are 4 conditons for for a distance measure to be a matric:\n",
627 |     "\n",
628 |     "1.\td(x,y) >= 0\n",
629 |     "2.  d(x,y) = 0 <=> x = y\n",
630 |     "3.  d(x,y) = d(y,x)\n",
631 |     "4.  d(x,z) <= d(x,y) + d(y,z)\n",
632 |     "\n",
633 |     "That is: it must be non-negative; if x and y are the same, distance must be zero; it must be symmetric; and it must obey the triangle inequality law. \n",
634 |     "\n",
635 |     "Simple enough, right? \n",
636 |     "Let's test these out for our measures."
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "code",
641 |    "execution_count": 18,
642 |    "metadata": {
643 |     "collapsed": false
644 |    },
645 |    "outputs": [
646 |     {
647 |      "data": {
648 |       "text/plain": [
649 |        "0.22491784692602151"
650 |       ]
651 |      },
652 |      "execution_count": 18,
653 |      "metadata": {},
654 |      "output_type": "execute_result"
655 |     }
656 |    ],
657 |    "source": [
658 |     "# normal Hellinger\n",
659 |     "hellinger(water_distribution, finance_distribution)"
660 |    ]
661 |   },
662 |   {
663 |    "cell_type": "code",
664 |    "execution_count": 19,
665 |    "metadata": {
666 |     "collapsed": false
667 |    },
668 |    "outputs": [
669 |     {
670 |      "data": {
671 |       "text/plain": [
672 |        "0.22491784692602151"
673 |       ]
674 |      },
675 |      "execution_count": 19,
676 |      "metadata": {},
677 |      "output_type": "execute_result"
678 |     }
679 |    ],
680 |    "source": [
681 |     "# we swap finance and water distributions and get the same value. It is indeed symmetric!\n",
682 |     "hellinger(finance_distribution, water_distribution)"
683 |    ]
684 |   },
685 |   {
686 |    "cell_type": "code",
687 |    "execution_count": 20,
688 |    "metadata": {
689 |     "collapsed": false
690 |    },
691 |    "outputs": [
692 |     {
693 |      "data": {
694 |       "text/plain": [
695 |        "0.0"
696 |       ]
697 |      },
698 |      "execution_count": 20,
699 |      "metadata": {},
700 |      "output_type": "execute_result"
701 |     }
702 |    ],
703 |    "source": [
704 |     "# if we pass the same values, it is zero.\n",
705 |     "hellinger(water_distribution, water_distribution)"
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "code",
710 |    "execution_count": 21,
711 |    "metadata": {
712 |     "collapsed": false
713 |    },
714 |    "outputs": [
715 |     {
716 |      "data": {
717 |       "text/plain": [
718 |        "0.23407305272210427"
719 |       ]
720 |      },
721 |      "execution_count": 21,
722 |      "metadata": {},
723 |      "output_type": "execute_result"
724 |     }
725 |    ],
726 |    "source": [
727 |     "# for triangle inequality let's use LDA document distributions\n",
728 |     "hellinger(lda_bow_finance, lda_bow_bank)"
729 |    ]
730 |   },
731 |   {
732 |    "cell_type": "code",
733 |    "execution_count": 22,
734 |    "metadata": {
735 |     "collapsed": false
736 |    },
737 |    "outputs": [
738 |     {
739 |      "data": {
740 |       "text/plain": [
741 |        "0.79979376323008911"
742 |       ]
743 |      },
744 |      "execution_count": 22,
745 |      "metadata": {},
746 |      "output_type": "execute_result"
747 |     }
748 |    ],
749 |    "source": [
750 |     "# Triangle inequality works too!\n",
751 |     "hellinger(lda_bow_finance, lda_bow_water) + hellinger(lda_bow_water, lda_bow_bank)"
752 |    ]
753 |   },
754 |   {
755 |    "cell_type": "markdown",
756 |    "metadata": {},
757 |    "source": [
758 |     "So Hellinger is indeed a metric. Let's check out KL. "
759 |    ]
760 |   },
761 |   {
762 |    "cell_type": "code",
763 |    "execution_count": 23,
764 |    "metadata": {
765 |     "collapsed": false
766 |    },
767 |    "outputs": [
768 |     {
769 |      "data": {
770 |       "text/plain": [
771 |        "0.2149342"
772 |       ]
773 |      },
774 |      "execution_count": 23,
775 |      "metadata": {},
776 |      "output_type": "execute_result"
777 |     }
778 |    ],
779 |    "source": [
780 |     "kullback_leibler(finance_distribution, water_distribution)"
781 |    ]
782 |   },
783 |   {
784 |    "cell_type": "code",
785 |    "execution_count": 24,
786 |    "metadata": {
787 |     "collapsed": false
788 |    },
789 |    "outputs": [
790 |     {
791 |      "data": {
792 |       "text/plain": [
793 |        "0.19781515"
794 |       ]
795 |      },
796 |      "execution_count": 24,
797 |      "metadata": {},
798 |      "output_type": "execute_result"
799 |     }
800 |    ],
801 |    "source": [
802 |     "kullback_leibler(water_distribution, finance_distribution)"
803 |    ]
804 |   },
805 |   {
806 |    "cell_type": "markdown",
807 |    "metadata": {},
808 |    "source": [
809 |     "We immediately notice that when we swap the values they aren't equal! One of the four conditions not fitting is enough for it to not be a metric. \n",
810 |     "\n",
811 |     "However, just because it is not a metric, (strictly in the mathematical sense) does not mean that it is not useful to figure out the distance between two probability distributions. KL Divergence is widely used for this purpose, and is probably the most 'famous' distance measure in fields like Information Theory.\n",
812 |     "\n",
813 |     "For a nice review of the mathematical differences between Hellinger and KL, [this](http://stats.stackexchange.com/questions/130432/differences-between-bhattacharyya-distance-and-kl-divergence) link does a very good job. "
814 |    ]
815 |   },
816 |   {
817 |    "cell_type": "markdown",
818 |    "metadata": {},
819 |    "source": [
820 |     "## Conclusion"
821 |    ]
822 |   },
823 |   {
824 |    "cell_type": "markdown",
825 |    "metadata": {},
826 |    "source": [
827 |     "That brings us to the end of this small tutorial.\n",
828 |     "The scope for adding new similarity metrics is large, as there exist an even larger suite of metrics and methods to add to the matutils.py file. ([This](http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf) is one paper which talks about some of them)\n",
829 |     "\n",
830 |     "Looking forward to more PRs towards this functionality in Gensim! :)"
831 |    ]
832 |   }
833 |  ],
834 |  "metadata": {
835 |   "kernelspec": {
836 |    "display_name": "Python 2",
837 |    "language": "python",
838 |    "name": "python2"
839 |   },
840 |   "language_info": {
841 |    "codemirror_mode": {
842 |     "name": "ipython",
843 |     "version": 2
844 |    },
845 |    "file_extension": ".py",
846 |    "mimetype": "text/x-python",
847 |    "name": "python",
848 |    "nbconvert_exporter": "python",
849 |    "pygments_lexer": "ipython2",
850 |    "version": "2.7.11"
851 |   }
852 |  },
853 |  "nbformat": 4,
854 |  "nbformat_minor": 0
855 | }
856 | 


--------------------------------------------------------------------------------
/notebooks/gensim/dtm_example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# DTM Example\n",
  8 |     "\n",
  9 |     "In this example we will present a sample usage of the DTM wrapper. Prior to using this you need to compile the [DTM code](https://github.com/magsilva/dtm) yourself or use one of the [binaries](https://github.com/magsilva/dtm/tree/master/bin).\n",
 10 |     "\n",
 11 |     "This tutorial is on Windows. Running it on Linux and OSX is the same.\n",
 12 |     "\n",
 13 |     "In this example we will use a small already processed corpus. To see how to get a dataset to this stage please take a look at [Gensim Tutorials](https://radimrehurek.com/gensim/tutorial.html)"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 1,
 19 |    "metadata": {
 20 |     "collapsed": true
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "import logging\n",
 25 |     "import os\n",
 26 |     "from gensim import corpora, utils\n",
 27 |     "from gensim.models.wrappers.dtmmodel import DtmModel\n",
 28 |     "import numpy as np\n",
 29 |     "\n",
 30 |     "if not os.environ.get('DTM_PATH', None):\n",
 31 |     "    raise ValueError(\"SKIP: You need to set the DTM path\")"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "First we wil setup logging"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 2,
 44 |    "metadata": {
 45 |     "collapsed": false
 46 |    },
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "logger = logging.getLogger()\n",
 50 |     "logger.setLevel(logging.DEBUG)\n",
 51 |     "logging.debug(\"test\")"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "Now lets load a set of documents"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 3,
 64 |    "metadata": {
 65 |     "collapsed": true
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "documents = [[u'senior', u'studios', u'studios', u'studios', u'creators', u'award', u'mobile', u'currently', u'challenges', u'senior', u'summary', u'senior', u'motivated', u'creative', u'senior', u'performs', u'engineering', u'tasks', u'infrastructure', u'focusing', u'primarily', u'programming', u'interaction', u'designers', u'engineers', u'leadership', u'teams', u'teams', u'crews', u'responsibilities', u'engineering', u'quality', u'functional', u'functional', u'teams', u'organizing', u'prioritizing', u'technical', u'decisions', u'engineering', u'participates', u'participates', u'reviews', u'participates', u'hiring', u'conducting', u'interviews', u'feedback', u'departments', u'define', u'focusing', u'engineering', u'teams', u'crews', u'facilitate', u'engineering', u'departments', u'deadlines', u'milestones', u'typically', u'spends', u'designing', u'developing', u'updating', u'bugs', u'mentoring', u'engineers', u'define', u'schedules', u'milestones', u'participating', u'reviews', u'interviews', u'sized', u'teams', u'interacts', u'disciplines', u'knowledge', u'skills', u'knowledge', u'knowledge', u'xcode', u'scripting', u'debugging', u'skills', u'skills', u'knowledge', u'disciplines', u'animation', u'networking', u'expertise', u'competencies', u'oral', u'skills', u'management', u'skills', u'proven', u'effectively', u'teams', u'deadline', u'environment', u'bachelor', u'minimum', u'shipped', u'leadership', u'teams', u'location', u'resumes', u'jobs', u'candidates', u'openings', u'jobs'], [u'maryland', u'client', u'producers', u'electricity', u'operates', u'storage', u'utility', u'retail', u'customers', u'engineering', u'consultant', u'maryland', u'summary', u'technical', u'technology', u'departments', u'expertise', u'maximizing', u'output', u'reduces', u'operating', u'participates', u'areas', u'engineering', u'conducts', u'testing', u'solve', u'supports', u'environmental', u'understands', u'objectives', u'operates', u'responsibilities', u'handles', u'complex', u'engineering', u'aspects', u'monitors', u'quality', u'proficiency', u'optimization', u'recommendations', u'supports', u'personnel', u'troubleshooting', u'commissioning', u'startup', u'shutdown', u'supports', u'procedure', u'operating', u'units', u'develops', u'simulations', u'troubleshooting', u'tests', u'enhancing', u'solving', u'develops', u'estimates', u'schedules', u'scopes', u'understands', u'technical', u'management', u'utilize', u'routine', u'conducts', u'hazards', u'utilizing', u'hazard', u'operability', u'methodologies', u'participates', u'startup', u'reviews', u'pssr', u'participate', u'teams', u'participate', u'regulatory', u'audits', u'define', u'scopes', u'budgets', u'schedules', u'technical', u'management', u'environmental', u'awareness', u'interfacing', u'personnel', u'interacts', u'regulatory', u'departments', u'input', u'objectives', u'identifying', u'introducing', u'concepts', u'solutions', u'peers', u'customers', u'coworkers', u'knowledge', u'skills', u'engineering', u'quality', u'engineering', u'commissioning', u'startup', u'knowledge', u'simulators', u'technologies', u'knowledge', u'engineering', u'techniques', u'disciplines', u'leadership', u'skills', u'proven', u'engineers', u'oral', u'skills', u'technical', u'skills', u'analytically', u'solve', u'complex', u'interpret', u'proficiency', u'simulation', u'knowledge', u'applications', u'manipulate', u'applications', u'engineering', u'calculations', u'programs', u'matlab', u'excel', u'independently', u'environment', u'proven', u'skills', u'effectively', u'multiple', u'tasks', u'planning', u'organizational', u'management', u'skills', u'rigzone', u'jobs', u'developer', u'exceptional', u'strategies', u'junction', u'exceptional', u'strategies', u'solutions', u'solutions', u'biggest', u'insurers', u'operates', u'investment'], [u'vegas', u'tasks', u'electrical', u'contracting', u'expertise', u'virtually', u'electrical', u'developments', u'institutional', u'utilities', u'technical', u'experts', u'relationships', u'credibility', u'contractors', u'utility', u'customers', u'customer', u'relationships', u'consistently', u'innovations', u'profile', u'construct', u'envision', u'dynamic', u'complex', u'electrical', u'management', u'grad', u'internship', u'electrical', u'engineering', u'infrastructures', u'engineers', u'documented', u'management', u'engineering', u'quality', u'engineering', u'electrical', u'engineers', u'complex', u'distribution', u'grounding', u'estimation', u'testing', u'procedures', u'voltage', u'engineering', u'troubleshooting', u'installation', u'documentation', u'bsee', u'certification', u'electrical', u'voltage', u'cabling', u'electrical', u'engineering', u'candidates', u'electrical', u'internships', u'oral', u'skills', u'organizational', u'prioritization', u'skills', u'skills', u'excel', u'cadd', u'calculation', u'autocad', u'mathcad', u'skills', u'skills', u'customer', u'relationships', u'solving', u'ethic', u'motivation', u'tasks', u'budget', u'affirmative', u'diversity', u'workforce', u'gender', u'orientation', u'disability', u'disabled', u'veteran', u'vietnam', u'veteran', u'qualifying', u'veteran', u'diverse', u'candidates', u'respond', u'developing', u'workplace', u'reflects', u'diversity', u'communities', u'reviews', u'electrical', u'contracting', u'southwest', u'electrical', u'contractors'], [u'intern', u'electrical', u'engineering', u'idexx', u'laboratories', u'validating', u'idexx', u'integrated', u'hardware', u'entails', u'planning', u'debug', u'validation', u'engineers', u'validation', u'methodologies', u'healthcare', u'platforms', u'brightest', u'solve', u'challenges', u'innovation', u'technology', u'idexx', u'intern', u'idexx', u'interns', u'supplement', u'interns', u'teams', u'roles', u'competitive', u'interns', u'idexx', u'interns', u'participate', u'internships', u'mentors', u'seminars', u'topics', u'leadership', u'workshops', u'relevant', u'planning', u'topics', u'intern', u'presentations', u'mixers', u'applicants', u'ineligible', u'laboratory', u'compliant', u'idexx', u'laboratories', u'healthcare', u'innovation', u'practicing', u'veterinarians', u'diagnostic', u'technology', u'idexx', u'enhance', u'veterinarians', u'efficiency', u'economically', u'idexx', u'worldwide', u'diagnostic', u'tests', u'tests', u'quality', u'headquartered', u'idexx', u'laboratories', u'employs', u'customers', u'qualifications', u'applicants', u'idexx', u'interns', u'potential', u'demonstrated', u'portfolio', u'recommendation', u'resumes', u'marketing', u'location', u'americas', u'verification', u'validation', u'schedule', u'overtime', u'idexx', u'laboratories', u'reviews', u'idexx', u'laboratories', u'nasdaq', u'healthcare', u'innovation', u'practicing', u'veterinarians'], [u'location', u'duration', u'temp', u'verification', u'validation', u'tester', u'verification', u'validation', u'middleware', u'specifically', u'testing', u'applications', u'clinical', u'laboratory', u'regulated', u'environment', u'responsibilities', u'complex', u'hardware', u'testing', u'clinical', u'analyzers', u'laboratory', u'graphical', u'interfaces', u'complex', u'sample', u'sequencing', u'protocols', u'developers', u'correction', u'tracking', u'tool', u'timely', u'troubleshoot', u'testing', u'functional', u'manual', u'automated', u'participate', u'ongoing', u'testing', u'coverage', u'planning', u'documentation', u'testing', u'validation', u'corrections', u'monitor', u'implementation', u'recurrence', u'operating', u'statistical', u'quality', u'testing', u'global', u'multi', u'teams', u'travel', u'skills', u'concepts', u'waterfall', u'agile', u'methodologies', u'debugging', u'skills', u'complex', u'automated', u'instrumentation', u'environment', u'hardware', u'mechanical', u'components', u'tracking', u'lifecycle', u'management', u'quality', u'organize', u'define', u'priorities', u'organize', u'supervision', u'aggressive', u'deadlines', u'ambiguity', u'analyze', u'complex', u'situations', u'concepts', u'technologies', u'verbal', u'skills', u'effectively', u'technical', u'clinical', u'diverse', u'strategy', u'clinical', u'chemistry', u'analyzer', u'laboratory', u'middleware', u'basic', u'automated', u'testing', u'biomedical', u'engineering', u'technologists', u'laboratory', u'technology', u'availability', u'click', u'attach'], [u'scientist', u'linux', u'asrc', u'scientist', u'linux', u'asrc', u'technology', u'solutions', u'subsidiary', u'asrc', u'engineering', u'technology', u'contracts', u'multiple', u'agencies', u'scientists', u'engineers', u'management', u'personnel', u'allows', u'solutions', u'complex', u'aeronautics', u'aviation', u'management', u'aviation', u'engineering', u'hughes', u'technical', u'technical', u'aviation', u'evaluation', u'engineering', u'management', u'technical', u'terminal', u'surveillance', u'programs', u'currently', u'scientist', u'travel', u'responsibilities', u'develops', u'technology', u'modifies', u'technical', u'complex', u'reviews', u'draft', u'conformity', u'completeness', u'testing', u'interface', u'hardware', u'regression', u'impact', u'reliability', u'maintainability', u'factors', u'standardization', u'skills', u'travel', u'programming', u'linux', u'environment', u'cisco', u'knowledge', u'terminal', u'environment', u'clearance', u'clearance', u'input', u'output', u'digital', u'automatic', u'terminal', u'management', u'controller', u'termination', u'testing', u'evaluating', u'policies', u'procedure', u'interface', u'installation', u'verification', u'certification', u'core', u'avionic', u'programs', u'knowledge', u'procedural', u'testing', u'interfacing', u'hardware', u'regression', u'impact', u'reliability', u'maintainability', u'factors', u'standardization', u'missions', u'asrc', u'subsidiaries', u'affirmative', u'employers', u'applicants', u'disability', u'veteran', u'technology', u'location', u'airport', u'bachelor', u'schedule', u'travel', u'contributor', u'management', u'asrc', u'reviews'], [u'technical', u'solarcity', u'niche', u'vegas', u'overview', u'resolving', u'customer', u'clients', u'expanding', u'engineers', u'developers', u'responsibilities', u'knowledge', u'planning', u'adapt', u'dynamic', u'environment', u'inventive', u'creative', u'solarcity', u'lifecycle', u'responsibilities', u'technical', u'analyzing', u'diagnosing', u'troubleshooting', u'customers', u'ticketing', u'console', u'escalate', u'knowledge', u'engineering', u'timely', u'basic', u'phone', u'functionality', u'customer', u'tracking', u'knowledgebase', u'rotation', u'configure', u'deployment', u'sccm', u'technical', u'deployment', u'deploy', u'hardware', u'solarcity', u'bachelor', u'knowledge', u'dell', u'laptops', u'analytical', u'troubleshooting', u'solving', u'skills', u'knowledge', u'databases', u'preferably', u'server', u'preferably', u'monitoring', u'suites', u'documentation', u'procedures', u'knowledge', u'entries', u'verbal', u'skills', u'customer', u'skills', u'competitive', u'solar', u'package', u'insurance', u'vacation', u'savings', u'referral', u'eligibility', u'equity', u'performers', u'solarcity', u'affirmative', u'diversity', u'workplace', u'applicants', u'orientation', u'disability', u'veteran', u'careerrookie'], [u'embedded', u'exelis', u'junction', u'exelis', u'embedded', u'acquisition', u'networking', u'capabilities', u'classified', u'customer', u'motivated', u'develops', u'tests', u'innovative', u'solutions', u'minimal', u'supervision', u'paced', u'environment', u'enjoys', u'assignments', u'interact', u'multi', u'disciplined', u'challenging', u'focused', u'embedded', u'developments', u'spanning', u'engineering', u'lifecycle', u'specification', u'enhancement', u'applications', u'embedded', u'freescale', u'applications', u'android', u'platforms', u'interface', u'customers', u'developers', u'refine', u'specifications', u'architectures', u'java', u'programming', u'scripts', u'python', u'debug', u'debugging', u'emulators', u'regression', u'revisions', u'specialized', u'setups', u'capabilities', u'subversion', u'technical', u'documentation', u'multiple', u'engineering', u'techexpousa', u'reviews'], [u'modeler', u'semantic', u'modeling', u'models', u'skills', u'ontology', u'resource', u'framework', u'schema', u'technologies', u'hadoop', u'warehouse', u'oracle', u'relational', u'artifacts', u'models', u'dictionaries', u'models', u'interface', u'specifications', u'documentation', u'harmonization', u'mappings', u'aligned', u'coordinate', u'technical', u'peer', u'reviews', u'stakeholder', u'communities', u'impact', u'domains', u'relationships', u'interdependencies', u'models', u'define', u'analyze', u'legacy', u'models', u'corporate', u'databases', u'architectural', u'alignment', u'customer', u'expertise', u'harmonization', u'modeling', u'modeling', u'consulting', u'stakeholders', u'quality', u'models', u'storage', u'agile', u'specifically', u'focus', u'modeling', u'qualifications', u'bachelors', u'accredited', u'modeler', u'encompass', u'evaluation', u'skills', u'knowledge', u'modeling', u'techniques', u'resource', u'framework', u'schema', u'technologies', u'unified', u'modeling', u'technologies', u'schemas', u'ontologies', u'sybase', u'knowledge', u'skills', u'interpersonal', u'skills', u'customers', u'clearance', u'applicants', u'eligibility', u'classified', u'clearance', u'polygraph', u'techexpousa', u'solutions', u'partnership', u'solutions', u'integration'], [u'technologies', u'junction', u'develops', u'maintains', u'enhances', u'complex', u'diverse', u'intensive', u'analytics', u'algorithm', u'manipulation', u'management', u'documented', u'individually', u'reviews', u'tests', u'components', u'adherence', u'resolves', u'utilizes', u'methodologies', u'environment', u'input', u'components', u'hardware', u'offs', u'reuse', u'cots', u'gots', u'synthesis', u'components', u'tasks', u'individually', u'analyzes', u'modifies', u'debugs', u'corrects', u'integrates', u'operating', u'environments', u'develops', u'queries', u'databases', u'repositories', u'recommendations', u'improving', u'documentation', u'develops', u'implements', u'algorithms', u'functional', u'assists', u'developing', u'executing', u'procedures', u'components', u'reviews', u'documentation', u'solutions', u'analyzing', u'conferring', u'users', u'engineers', u'analyzing', u'investigating', u'areas', u'adapt', u'hardware', u'mathematical', u'models', u'predict', u'outcome', u'implement', u'complex', u'database', u'repository', u'interfaces', u'queries', u'bachelors', u'accredited', u'substituted', u'bachelors', u'firewalls', u'ipsec', u'vpns', u'technology', u'administering', u'servers', u'apache', u'jboss', u'tomcat', u'developing', u'interfaces', u'firefox', u'internet', u'explorer', u'operating', u'mainframe', u'linux', u'solaris', u'virtual', u'scripting', u'programming', u'oriented', u'programming', u'ajax', u'script', u'procedures', u'cobol', u'cognos', u'fusion', u'focus', u'html', u'java', u'java', u'script', u'jquery', u'perl', u'visual', u'basic', u'powershell', u'cots', u'cots', u'oracle', u'apex', u'integration', u'competitive', u'package', u'bonus', u'corporate', u'equity', u'tuition', u'reimbursement', u'referral', u'bonus', u'holidays', u'insurance', u'flexible', u'disability', u'insurance', u'technologies', u'disability', u'accommodation', u'recruiter', u'techexpousa']]"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "markdown",
 74 |    "metadata": {},
 75 |    "source": [
 76 |     "This corpus contains 10 documents. Now lets say we would like to model this with DTM.\n",
 77 |     "To do this we have to define the time steps\n",
 78 |     "each document belongs to. In this case the first 3 documents were collected at the same time, while the last 7 were collected \n",
 79 |     "a month later, and we wish to see how the topics change from month to month.\n",
 80 |     "For this we will define the `time_seq`, which contains the time slice definition."
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 4,
 86 |    "metadata": {
 87 |     "collapsed": true
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "time_seq = [3, 7]  # first 3 documents are from time slice one \n",
 92 |     "#  and the other 7 are from the second time slice."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "markdown",
 97 |    "metadata": {},
 98 |    "source": [
 99 |     "A simple corpus wrapper to load a premade corpus. You can use this with your own data."
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": 5,
105 |    "metadata": {
106 |     "collapsed": false
107 |    },
108 |    "outputs": [],
109 |    "source": [
110 |     "class DTMcorpus(corpora.textcorpus.TextCorpus):\n",
111 |     "\n",
112 |     "    def get_texts(self):\n",
113 |     "        return self.input\n",
114 |     "\n",
115 |     "    def __len__(self):\n",
116 |     "        return len(self.input)\n",
117 |     "\n",
118 |     "corpus = DTMcorpus(documents)"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "So now we have to generate the path to DTM executable, here I have already set an ENV variable for the DTM_HOME"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 6,
131 |    "metadata": {
132 |     "collapsed": true
133 |    },
134 |    "outputs": [],
135 |    "source": [
136 |     "# path to dtm home folder\n",
137 |     "dtm_home = os.environ.get('DTM_HOME', \"dtm-master\")\n",
138 |     "# path to the binary. on my PC the executable file is dtm-master/bin/dtm\n",
139 |     "dtm_path = os.path.join(dtm_home, 'bin', 'dtm') if dtm_home else None\n",
140 |     "# you can also copy the path down directly. Change this variable to your DTM executable before running.\n",
141 |     "dtm_path = \"/home/bhargav/dtm/main\""
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "markdown",
146 |    "metadata": {},
147 |    "source": [
148 |     "That is basically all we need to be able to invoke the Training. \n",
149 |     "\n",
150 |     "If ```initialize_lda=True``` then DTM will create a LDA model first and store it in initial-lda-ss.dat.\n",
151 |     "If you already have itial-lda-ss.dat in the DTM folder then you can save time and re-use it with ```initialize_lda=False```. If the file is missing then DTM wil exit with an error."
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 7,
157 |    "metadata": {
158 |     "collapsed": false
159 |    },
160 |    "outputs": [],
161 |    "source": [
162 |     "model = DtmModel(dtm_path, corpus, time_seq, num_topics=2,\n",
163 |     "                 id2word=corpus.dictionary, initialize_lda=True)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "If everything worked we should be able to print out the topics"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 8,
176 |    "metadata": {
177 |     "collapsed": true
178 |    },
179 |    "outputs": [],
180 |    "source": [
181 |     "topics = model.show_topic(topicid=1, time=1, num_words=10)"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 9,
187 |    "metadata": {
188 |     "collapsed": false
189 |    },
190 |    "outputs": [
191 |     {
192 |      "data": {
193 |       "text/plain": [
194 |        "[(0.023565028919164586, 'skills'),\n",
195 |        " (0.02308969736545094, 'engineering'),\n",
196 |        " (0.019616329462533579, 'idexx'),\n",
197 |        " (0.0194313503731963, 'testing'),\n",
198 |        " (0.01858957362093603, 'technical'),\n",
199 |        " (0.017685337300946517, 'electrical'),\n",
200 |        " (0.017483543705882995, 'management'),\n",
201 |        " (0.015310984365058886, 'complex'),\n",
202 |        " (0.014032951915032212, 'knowledge'),\n",
203 |        " (0.012958700085355939, 'technology')]"
204 |       ]
205 |      },
206 |      "execution_count": 9,
207 |      "metadata": {},
208 |      "output_type": "execute_result"
209 |     }
210 |    ],
211 |    "source": [
212 |     "topics"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "markdown",
217 |    "metadata": {},
218 |    "source": [
219 |     "## Document-Topic proportions "
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "Next, we'll attempt to find the Document-Topic proportions. We will use the gamma class variable of the model to do the same. Gamma is a matrix such that gamma[5,10] is the proportion of the 10th topic in document 5.\n",
227 |     "\n",
228 |     "To find, say, the topic proportions in Document 1, we do the following:"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": 10,
234 |    "metadata": {
235 |     "collapsed": false
236 |    },
237 |    "outputs": [
238 |     {
239 |      "name": "stdout",
240 |      "output_type": "stream",
241 |      "text": [
242 |       "Distribution of Topic 0 0.562498\n",
243 |       "Distribution of Topic 1 0.437502\n"
244 |      ]
245 |     }
246 |    ],
247 |    "source": [
248 |     "doc_number = 1\n",
249 |     "num_topics = 2\n",
250 |     "\n",
251 |     "for i in range(0, num_topics):\n",
252 |     "    print (\"Distribution of Topic %d %f\" % (i, model.gamma_[doc_number, i]))"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "## DIM Example"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {},
265 |    "source": [
266 |     "The DTM wrapper in Gensim also has the capacity to run in Document Influence Model mode. The Model is described in [this](http://www.umiacs.umd.edu/~jbg/nips_tm_workshop/30.pdf) paper. What it allows you to do is find the 'influence' of a certain document on a particular topic. It is primarily used in identifying the scientific impact of research papers through the capability of that document's keywords influencing a topic. \n",
267 |     "\n",
268 |     "'Influence' can be naively thought of like this - if more of a particular document's words appear in subsequent evolution of a topic, that document is understood to have influenced that topic more.\n",
269 |     "\n",
270 |     "To run it in this mode, we now call `DtmModel` again, but with the `model` parameter set as `fixed`. \n",
271 |     "\n",
272 |     "Note that running it in this mode will also generate the DTM topics similar to running plain DTM, but with added information on document influence."
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": 11,
278 |    "metadata": {
279 |     "collapsed": false
280 |    },
281 |    "outputs": [],
282 |    "source": [
283 |     "model = DtmModel(dtm_path, corpus, time_seq, num_topics=2,\n",
284 |     "                 id2word=corpus.dictionary, initialize_lda=True, model='fixed')"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "markdown",
289 |    "metadata": {},
290 |    "source": [
291 |     "The main difference between the DTM and DIM models are the addition of Influence files for each time-slice, which is interpreted with the `influences_time` variable. \n",
292 |     "\n",
293 |     "To find, say, the influence of Document 2 on Topic 2 in Time-Slice 1, we do the following:"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 12,
299 |    "metadata": {
300 |     "collapsed": false
301 |    },
302 |    "outputs": [
303 |     {
304 |      "data": {
305 |       "text/plain": [
306 |        "0.0061833357763878861"
307 |       ]
308 |      },
309 |      "execution_count": 12,
310 |      "metadata": {},
311 |      "output_type": "execute_result"
312 |     }
313 |    ],
314 |    "source": [
315 |     "document_no = 1 #document 2\n",
316 |     "topic_no = 1 #topic number 2\n",
317 |     "time_slice = 0 #time slice 1\n",
318 |     "\n",
319 |     "model.influences_time[time_slice][document_no][topic_no]"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "markdown",
324 |    "metadata": {},
325 |    "source": [
326 |     "## Differences between DTM and DIM mode."
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "There are not too many differences in DTM and DIM apart from the Document Influence information which is generated by running it in DIM mode. The topics generated by both the models are also more or less similar.\n",
334 |     "\n",
335 |     "As for running times, with smaller corpuses of less than 2000 documents, time taken for the two models is roughly the same, but for larger corpuses DIM mode takes significantly more time - usually 1.5 or 2 times as how long DTM would take.\n",
336 |     "\n",
337 |     "For examples of use-cases of both, the following resources might be helpful:\n",
338 |     "\n",
339 |     "[Modeling Musical Influence with Topic Models](http://jmlr.org/proceedings/papers/v28/shalit13.pdf)\n",
340 |     "\n",
341 |     "[A Language-based Approach to Measuring Scholarly Impact](https://www.cs.princeton.edu/~blei/papers/GerrishBlei2010.pdf)\n",
342 |     "\n",
343 |     "[Studying the history of ideas using topic models](http://web.stanford.edu/~jurafsky/hallemnlp08.pdf)\n"
344 |    ]
345 |   }
346 |  ],
347 |  "metadata": {
348 |   "kernelspec": {
349 |    "display_name": "Python [py35]",
350 |    "language": "python",
351 |    "name": "Python [py35]"
352 |   },
353 |   "language_info": {
354 |    "codemirror_mode": {
355 |     "name": "ipython",
356 |     "version": 3
357 |    },
358 |    "file_extension": ".py",
359 |    "mimetype": "text/x-python",
360 |    "name": "python",
361 |    "nbconvert_exporter": "python",
362 |    "pygments_lexer": "ipython3",
363 |    "version": "3.5.2"
364 |   }
365 |  },
366 |  "nbformat": 4,
367 |  "nbformat_minor": 0
368 | }
369 | 


--------------------------------------------------------------------------------
/notebooks/manifolds.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Creating Manifolds\n",
  8 |     "\n",
  9 |     "Let's start with the torus."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": true
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "%matplotlib inline\n",
 21 |     "import numpy as np\n",
 22 |     "import theano.tensor as tt\n",
 23 |     "import pymc3 as pm\n",
 24 |     "\n",
 25 |     "import matplotlib.pyplot as plt"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "def torus_param(u, v):\n",
 37 |     "    x = (1 + np.cos(v)) * np.cos(u)\n",
 38 |     "    y = (1 + np.cos(v)) * np.sin(u)\n",
 39 |     "    z = np.sin(v)\n",
 40 |     "    return np.array([x, y, z])"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "How do we make a path from x_1 and x_2? We can of course do a simple euclidean distance.\n",
 48 |     "\n",
 49 |     "One approach is to take a path from $torus-param(x_1)^{-1}$ to $torus-param(x_2)^{-1}$, lift that, and measure all the segments."
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 7,
 55 |    "metadata": {
 56 |     "collapsed": true
 57 |    },
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "def euclidean_distance(x_1, x_2):\n",
 61 |     "    return np.square(x_1 - x_2).sum()"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 8,
 67 |    "metadata": {
 68 |     "collapsed": false
 69 |    },
 70 |    "outputs": [
 71 |     {
 72 |      "data": {
 73 |       "text/plain": [
 74 |        "0.31340765237662049"
 75 |       ]
 76 |      },
 77 |      "execution_count": 8,
 78 |      "metadata": {},
 79 |      "output_type": "execute_result"
 80 |     }
 81 |    ],
 82 |    "source": [
 83 |     "euclidean_distance(torus_param(0, 2), torus_param(1, 2))"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": null,
 89 |    "metadata": {
 90 |     "collapsed": true
 91 |    },
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "def riemannian_metric(x_1, x_2):\n",
 95 |     "    "
 96 |    ]
 97 |   }
 98 |  ],
 99 |  "metadata": {
100 |   "kernelspec": {
101 |    "display_name": "Python 2",
102 |    "language": "python",
103 |    "name": "python2"
104 |   },
105 |   "language_info": {
106 |    "codemirror_mode": {
107 |     "name": "ipython",
108 |     "version": 2
109 |    },
110 |    "file_extension": ".py",
111 |    "mimetype": "text/x-python",
112 |    "name": "python",
113 |    "nbconvert_exporter": "python",
114 |    "pygments_lexer": "ipython2",
115 |    "version": "2.7.12"
116 |   }
117 |  },
118 |  "nbformat": 4,
119 |  "nbformat_minor": 0
120 | }
121 | 


--------------------------------------------------------------------------------
/notebooks/metrics.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Distance Metrics for Probability Distributions\n",
  8 |     "\n",
  9 |     "We'll be looking at 3 different distance metrics, and see how different probability distributions look with them.\n",
 10 |     "\n",
 11 |     "### Creating probability distributions"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 1,
 17 |    "metadata": {
 18 |     "collapsed": true
 19 |    },
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "import pymc3 as pm\n",
 23 |     "import numpy as np"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {
 30 |     "collapsed": false
 31 |    },
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "import matplotlib.pyplot as plt\n",
 35 |     "\n",
 36 |     "# Initialize random number generator\n",
 37 |     "np.random.seed(123)\n",
 38 |     "\n",
 39 |     "# True parameter values\n",
 40 |     "alpha, sigma = 1, 1\n",
 41 |     "beta = [1, 2.5]\n",
 42 |     "\n",
 43 |     "# Size of dataset\n",
 44 |     "size = 100\n",
 45 |     "\n",
 46 |     "# Predictor variable\n",
 47 |     "X1 = np.random.randn(size)\n",
 48 |     "X2 = np.random.randn(size) * 0.2\n",
 49 |     "\n",
 50 |     "# Simulate outcome variable\n",
 51 |     "Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "### Create Models\n",
 59 |     "\n",
 60 |     "Let's create traces based on different sampling methods."
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "metadata": {
 67 |     "collapsed": true
 68 |    },
 69 |    "outputs": [],
 70 |    "source": [
 71 |     "basic_model = pm.Model()\n",
 72 |     "\n",
 73 |     "with basic_model:\n",
 74 |     "    \n",
 75 |     "    # Priors for unknown model parameters\n",
 76 |     "    alpha = pm.Normal('alpha', mu=0, sd=10)\n",
 77 |     "    beta = pm.Normal('beta', mu=0, sd=10, shape=2)\n",
 78 |     "    sigma = pm.HalfNormal('sigma', sd=1)\n",
 79 |     "    \n",
 80 |     "    # Expected value of outcome\n",
 81 |     "    mu = alpha + beta[0]*X1 + beta[1]*X2\n",
 82 |     "    \n",
 83 |     "    # Likelihood (sampling distribution) of observations\n",
 84 |     "    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 4,
 90 |    "metadata": {
 91 |     "collapsed": false
 92 |    },
 93 |    "outputs": [
 94 |     {
 95 |      "name": "stderr",
 96 |      "output_type": "stream",
 97 |      "text": [
 98 |       "Average Loss = 156.62:   5%|▌         | 10526/200000 [00:01<00:21, 8667.63it/s]\n",
 99 |       "100%|██████████| 1000/1000 [00:01<00:00, 730.98it/s]\n"
100 |      ]
101 |     }
102 |    ],
103 |    "source": [
104 |     "from scipy import optimize\n",
105 |     "\n",
106 |     "with basic_model:\n",
107 |     "    # draw 500 posterior samples\n",
108 |     "    trace_default = pm.sample()"
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "code",
113 |    "execution_count": 5,
114 |    "metadata": {
115 |     "collapsed": false
116 |    },
117 |    "outputs": [
118 |     {
119 |      "name": "stderr",
120 |      "output_type": "stream",
121 |      "text": [
122 |       "  1%|          | 67/5500 [00:00<00:50, 107.78it/s]"
123 |      ]
124 |     },
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "Optimization terminated successfully.\n",
130 |       "         Current function value: 149.019762\n",
131 |       "         Iterations: 4\n",
132 |       "         Function evaluations: 176\n"
133 |      ]
134 |     },
135 |     {
136 |      "name": "stderr",
137 |      "output_type": "stream",
138 |      "text": [
139 |       "100%|██████████| 5500/5500 [00:08<00:00, 622.00it/s]\n"
140 |      ]
141 |     }
142 |    ],
143 |    "source": [
144 |     "with basic_model:\n",
145 |     "    # obtain starting values via MAP\n",
146 |     "    start = pm.find_MAP(fmin=optimize.fmin_powell)\n",
147 |     "    # instantiate sampler\n",
148 |     "    step = pm.Slice() \n",
149 |     "    # draw 5000 posterior samples\n",
150 |     "    trace_slice = pm.sample(5000, step=step, start=start)"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 6,
156 |    "metadata": {
157 |     "collapsed": false
158 |    },
159 |    "outputs": [
160 |     {
161 |      "name": "stderr",
162 |      "output_type": "stream",
163 |      "text": [
164 |       "100%|██████████| 5500/5500 [00:04<00:00, 1312.59it/s]\n"
165 |      ]
166 |     }
167 |    ],
168 |    "source": [
169 |     "with basic_model:\n",
170 |     "    # instantiate sampler\n",
171 |     "    step = pm.HamiltonianMC()\n",
172 |     "    # draw 5000 posterior samples\n",
173 |     "    trace_HMC = pm.sample(5000, step=step)"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 7,
179 |    "metadata": {
180 |     "collapsed": false
181 |    },
182 |    "outputs": [
183 |     {
184 |      "name": "stderr",
185 |      "output_type": "stream",
186 |      "text": [
187 |       "100%|██████████| 5500/5500 [00:07<00:00, 704.28it/s]\n"
188 |      ]
189 |     }
190 |    ],
191 |    "source": [
192 |     "with basic_model:\n",
193 |     "    step = pm.NUTS()\n",
194 |     "    # draw 5000 posterior samples\n",
195 |     "    trace_NUTS = pm.sample(5000, step=step)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 8,
201 |    "metadata": {
202 |     "collapsed": false
203 |    },
204 |    "outputs": [
205 |     {
206 |      "name": "stderr",
207 |      "output_type": "stream",
208 |      "text": [
209 |       "100%|██████████| 5500/5500 [00:02<00:00, 2649.34it/s]\n"
210 |      ]
211 |     }
212 |    ],
213 |    "source": [
214 |     "with basic_model:\n",
215 |     "    step = pm.Metropolis()\n",
216 |     "    # draw 5000 posterior samples\n",
217 |     "    trace_metropolis = pm.sample(5000, step=step)"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 9,
223 |    "metadata": {
224 |     "collapsed": false
225 |    },
226 |    "outputs": [],
227 |    "source": [
228 |     "# SMC is still an experimental method.\n",
229 |     "# with basic_model:\n",
230 |     "#     step = pm.SMC()\n",
231 |     "#     # draw 5000 posterior samples\n",
232 |     "#     trace_SMC = pm.sample(5000, step=step)"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "metadata": {},
238 |    "source": [
239 |     "### Creating Manifolds\n",
240 |     "\n",
241 |     "Torus. Sphere?"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {
248 |     "collapsed": true
249 |    },
250 |    "outputs": [],
251 |    "source": []
252 |   },
253 |   {
254 |    "cell_type": "markdown",
255 |    "metadata": {},
256 |    "source": [
257 |     "### Kullback–Leibler divergence"
258 |    ]
259 |   },
260 |   {
261 |    "cell_type": "code",
262 |    "execution_count": null,
263 |    "metadata": {
264 |     "collapsed": true
265 |    },
266 |    "outputs": [],
267 |    "source": [
268 |     "def KLdivergence(dist_1, dist_2):\n",
269 |     "    distance = np.sum(dist_1 * np.log(dist_1 / dist_2))\n",
270 |     "    return distance"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "markdown",
275 |    "metadata": {},
276 |    "source": [
277 |     "### Hellinger Distance"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": null,
283 |    "metadata": {
284 |     "collapsed": true
285 |    },
286 |    "outputs": [],
287 |    "source": [
288 |     "def hellinger(dist_1, dist_2):\n",
289 |     "    distance = np.sqrt(0.5 * ((np.sqrt(dist_1) - np.sqrt(dist_2))**2).sum())\n",
290 |     "    return distance"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "markdown",
295 |    "metadata": {},
296 |    "source": [
297 |     "### Fischer-Rao Metric\n",
298 |     "\n",
299 |     "The Fischer-Rao metric is a particular Riemannian metric. We normally have a statistical manifold with coordinates at each point; in this small snippet we will make do with pseudo code."
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "metadata": {
306 |     "collapsed": true
307 |    },
308 |    "outputs": [],
309 |    "source": [
310 |     "def fischer_rao(distribution, coordinate_1, coordinate_2):\n",
311 |     "    distance = np.sum(np.log(distribution(coordinate_1)) * np.log(distribution(coordinate_2))*distribution)\n",
312 |     "    return distance"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "### SoftAbs Metric\n",
320 |     "\n",
321 |     "The SoftAbs metric is based on an exponential map.\n",
322 |     "We need to compute the gradient of the quadratic form, and the log determinant. \n",
323 |     "Here p is the momenta and pi(q) is the N-dimensional Target density.\n",
324 |     "\n",
325 |     "H = Q . $lambda$ . $Q^T$\n",
326 |     "\n",
327 |     "$lambda$ = Diag($lambda_{i}$)\n",
328 |     "\n",
329 |     "Lambda is the diagonal matrix of eigenvalues and Q is the corresponding matrix of eigenvectors. "
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": null,
335 |    "metadata": {
336 |     "collapsed": false
337 |    },
338 |    "outputs": [],
339 |    "source": [
340 |     "def grad_quad(H_ij, p):\n",
341 |     "    Q, lambda_i = decompose(H_ij)\n",
342 |     "    D = diag(Q_t . p / (lambda_i . coth(alpha . lambda_i))\n",
343 |     "    J = d(lambda_i . coth(alpha . lambda_i))\n",
344 |     "    grad = - Trace(Q . D . J . D . Q_t . d(H))\n",
345 |     "    return grad"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": null,
351 |    "metadata": {
352 |     "collapsed": true
353 |    },
354 |    "outputs": [],
355 |    "source": [
356 |     "def grad_log(H_ij):\n",
357 |     "    Q, lambda_i = decompose(H_ij)\n",
358 |     "    J = d(lambda_i . coth(alpha . lambda_i))\n",
359 |     "    R = diag(1 / lambda_i . coth(alpha . lambda_i)\n",
360 |     "    grad = Trace(Q . (R ◦ J). Q_t . dH)\n",
361 |     "    return grad"
362 |    ]
363 |   }
364 |  ],
365 |  "metadata": {
366 |   "kernelspec": {
367 |    "display_name": "Python 2",
368 |    "language": "python",
369 |    "name": "python2"
370 |   },
371 |   "language_info": {
372 |    "codemirror_mode": {
373 |     "name": "ipython",
374 |     "version": 2
375 |    },
376 |    "file_extension": ".py",
377 |    "mimetype": "text/x-python",
378 |    "name": "python",
379 |    "nbconvert_exporter": "python",
380 |    "pygments_lexer": "ipython2",
381 |    "version": "2.7.12"
382 |   }
383 |  },
384 |  "nbformat": 4,
385 |  "nbformat_minor": 0
386 | }
387 | 


--------------------------------------------------------------------------------
/notebooks/pycobra/regression.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Playing with Regression\n",
  8 |     "\n",
  9 |     "This notebook will help us with testing different regression techniques, and eventually, test COBRA. \n",
 10 |     "\n",
 11 |     "So for now we will generate a random data-set and try some of the popular regression techniques on it, after it has been loaded to COBRA.\n",
 12 |     "\n",
 13 |     "#### Imports"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 1,
 19 |    "metadata": {
 20 |     "collapsed": false
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "from pycobra.cobra import cobra\n",
 25 |     "from pycobra.diagnostics import diagnostics\n",
 26 |     "import numpy as np\n",
 27 |     "%matplotlib inline"
 28 |    ]
 29 |   },
 30 |   {
 31 |    "cell_type": "markdown",
 32 |    "metadata": {},
 33 |    "source": [
 34 |     "#### Setting up data set"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 2,
 40 |    "metadata": {
 41 |     "collapsed": false
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "# setting up our random data-set\n",
 46 |     "rng = np.random.RandomState(1)\n",
 47 |     "\n",
 48 |     "# D1 = train machines; D2 = create COBRA; D3 = calibrate epsilon, alpha; D4 = testing\n",
 49 |     "n_features = 20\n",
 50 |     "D1, D2, D3, D4 = 200, 200, 200, 200\n",
 51 |     "D = D1 + D2 + D3 + D4\n",
 52 |     "X = rng.uniform(-1, 1, D * n_features).reshape(D, n_features)\n",
 53 |     "Y = np.power(X[:,1], 2) + np.power(X[:,3], 3) + np.exp(X[:,10]) \n",
 54 |     "# Y = np.power(X[:,0], 2) + np.power(X[:,1], 3)\n",
 55 |     "\n",
 56 |     "# training data-set\n",
 57 |     "X_train = X[:D1 + D2]\n",
 58 |     "X_test = X[D1 + D2 + D3:D1 + D2 + D3 + D4]\n",
 59 |     "X_eps = X[D1 + D2:D1 + D2 + D3]\n",
 60 |     "# for testing\n",
 61 |     "Y_train = Y[:D1 + D2]\n",
 62 |     "Y_test = Y[D1 + D2 + D3:D1 + D2 + D3 + D4]\n",
 63 |     "Y_eps = Y[D1 + D2:D1 + D2 + D3]"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "### Setting up COBRA\n",
 71 |     "\n",
 72 |     "Let's up our COBRA machine with the data."
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 3,
 78 |    "metadata": {
 79 |     "collapsed": false
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "COBRA = cobra(X_train, Y_train, epsilon=0.5, default=False, random_state=0)"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "In the above line, we initialise COBRA with an epsilon value of $0.5$ - this is because we are aware of the distribution and 0.5 is a fair guess of what would be a \"good\" epsilon value, because the data varies from $-1$ to $1$. \n",
 91 |     "\n",
 92 |     "If we do not pass the $\\epsilon$ parameter, it auto sets it as $\\frac{\\epsilon_{max} - \\epsilon_{min}}{2}$, or if test_data is passed it sets it to an epsilon value optimised to the test-data.\n",
 93 |     "\n",
 94 |     "It can be noticed that the `default` parameter is set as false: this is so we can walk you through what happens when COBRA is set-up, instead of the deafult settings being used."
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "markdown",
 99 |    "metadata": {},
100 |    "source": [
101 |     "We're now going to split our dataset into two parts, and shuffle data points."
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 4,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [],
111 |    "source": [
112 |     "COBRA.split_data(D1, D1 + D2, shuffle_data=True)"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "Let's load the default machines to COBRA."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 5,
125 |    "metadata": {
126 |     "collapsed": false
127 |    },
128 |    "outputs": [],
129 |    "source": [
130 |     "COBRA.load_default()"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "We note here that further machines can be loaded using either the `loadMachine()` and `loadSKMachine()` methods. The only prerequisite is that the machine has a valid `predict()` method."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "## Using COBRA's machines\n",
145 |     "\n",
146 |     "We've created our random dataset and now we're going to use the default sci-kit machines to see what the results look like."
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 6,
152 |    "metadata": {
153 |     "collapsed": true
154 |    },
155 |    "outputs": [],
156 |    "source": [
157 |     "query = X_test[9].reshape(1, -1)"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 7,
163 |    "metadata": {
164 |     "collapsed": false
165 |    },
166 |    "outputs": [
167 |     {
168 |      "data": {
169 |       "text/plain": [
170 |        "{'lasso': LassoLars(alpha=1.0, copy_X=True, eps=2.2204460492503131e-16,\n",
171 |        "      fit_intercept=True, fit_path=True, max_iter=500, normalize=True,\n",
172 |        "      positive=False, precompute='auto', verbose=False),\n",
173 |        " 'random_forest': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
174 |        "            max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,\n",
175 |        "            min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
176 |        "            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,\n",
177 |        "            verbose=0, warm_start=False),\n",
178 |        " 'ridge': Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,\n",
179 |        "    normalize=False, random_state=0, solver='auto', tol=0.001),\n",
180 |        " 'tree': DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,\n",
181 |        "            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,\n",
182 |        "            min_weight_fraction_leaf=0.0, presort=False, random_state=0,\n",
183 |        "            splitter='best')}"
184 |       ]
185 |      },
186 |      "execution_count": 7,
187 |      "metadata": {},
188 |      "output_type": "execute_result"
189 |     }
190 |    ],
191 |    "source": [
192 |     "COBRA.machines"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 8,
198 |    "metadata": {
199 |     "collapsed": false
200 |    },
201 |    "outputs": [
202 |     {
203 |      "data": {
204 |       "text/plain": [
205 |        "array([ 1.55459791])"
206 |       ]
207 |      },
208 |      "execution_count": 8,
209 |      "metadata": {},
210 |      "output_type": "execute_result"
211 |     }
212 |    ],
213 |    "source": [
214 |     "COBRA.machines['lasso'].predict(query)"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": 9,
220 |    "metadata": {
221 |     "collapsed": false
222 |    },
223 |    "outputs": [
224 |     {
225 |      "data": {
226 |       "text/plain": [
227 |        "array([ 0.22769628])"
228 |       ]
229 |      },
230 |      "execution_count": 9,
231 |      "metadata": {},
232 |      "output_type": "execute_result"
233 |     }
234 |    ],
235 |    "source": [
236 |     "COBRA.machines['tree'].predict(query)"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 10,
242 |    "metadata": {
243 |     "collapsed": false
244 |    },
245 |    "outputs": [
246 |     {
247 |      "data": {
248 |       "text/plain": [
249 |        "array([ 0.06747291])"
250 |       ]
251 |      },
252 |      "execution_count": 10,
253 |      "metadata": {},
254 |      "output_type": "execute_result"
255 |     }
256 |    ],
257 |    "source": [
258 |     "COBRA.machines['ridge'].predict(query)"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": 11,
264 |    "metadata": {
265 |     "collapsed": false
266 |    },
267 |    "outputs": [
268 |     {
269 |      "data": {
270 |       "text/plain": [
271 |        "array([ 0.3382969])"
272 |       ]
273 |      },
274 |      "execution_count": 11,
275 |      "metadata": {},
276 |      "output_type": "execute_result"
277 |     }
278 |    ],
279 |    "source": [
280 |     "COBRA.machines['random_forest'].predict(query)"
281 |    ]
282 |   },
283 |   {
284 |    "cell_type": "markdown",
285 |    "metadata": {},
286 |    "source": [
287 |     "## Aggregate!\n",
288 |     "\n",
289 |     "By using the aggregate function we can combine our predictors.\n",
290 |     "You can read about the aggregation procedure either in the original COBRA paper or look around in the source code for the algorithm.\n",
291 |     "\n",
292 |     "We start by loading each machine's predictions now."
293 |    ]
294 |   },
295 |   {
296 |    "cell_type": "code",
297 |    "execution_count": 12,
298 |    "metadata": {
299 |     "collapsed": false
300 |    },
301 |    "outputs": [],
302 |    "source": [
303 |     "COBRA.load_machine_predictions()"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 13,
309 |    "metadata": {
310 |     "collapsed": false
311 |    },
312 |    "outputs": [
313 |     {
314 |      "data": {
315 |       "text/plain": [
316 |        "0.20355644905114159"
317 |       ]
318 |      },
319 |      "execution_count": 13,
320 |      "metadata": {},
321 |      "output_type": "execute_result"
322 |     }
323 |    ],
324 |    "source": [
325 |     "COBRA.predict(query)"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "code",
330 |    "execution_count": 14,
331 |    "metadata": {
332 |     "collapsed": false
333 |    },
334 |    "outputs": [
335 |     {
336 |      "data": {
337 |       "text/plain": [
338 |        "0.0095390633892067367"
339 |       ]
340 |      },
341 |      "execution_count": 14,
342 |      "metadata": {},
343 |      "output_type": "execute_result"
344 |     }
345 |    ],
346 |    "source": [
347 |     "Y_test[9]"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "### Optimizing COBRA\n",
355 |     "\n",
356 |     "To squeeze the best out of COBRA we make use of the COBRA diagnostics class. With a grid based approach to optimizing hyperparameters, we can find out the best epsilon value, number of machines (alpha value), and combination of machines."
357 |    ]
358 |   },
359 |   {
360 |    "cell_type": "markdown",
361 |    "metadata": {},
362 |    "source": [
363 |     "Let's check the MSE for each of COBRAs machines:"
364 |    ]
365 |   },
366 |   {
367 |    "cell_type": "code",
368 |    "execution_count": 15,
369 |    "metadata": {
370 |     "collapsed": false
371 |    },
372 |    "outputs": [],
373 |    "source": [
374 |     "cobra_diagnostics = diagnostics(COBRA, X_test, Y_test, load_MSE=True)"
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": 16,
380 |    "metadata": {
381 |     "collapsed": false
382 |    },
383 |    "outputs": [
384 |     {
385 |      "data": {
386 |       "text/plain": [
387 |        "{'COBRA': 0.1186812827420051,\n",
388 |        " 'lasso': 0.69378035831915863,\n",
389 |        " 'random_forest': 0.1087898042447215,\n",
390 |        " 'ridge': 0.17014683933748218,\n",
391 |        " 'tree': 0.16080252584960136}"
392 |       ]
393 |      },
394 |      "execution_count": 16,
395 |      "metadata": {},
396 |      "output_type": "execute_result"
397 |     }
398 |    ],
399 |    "source": [
400 |     "cobra_diagnostics.machine_MSE"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "markdown",
405 |    "metadata": {},
406 |    "source": [
407 |     "This error is bound by the value $C\\mathscr{l}^{\\frac{-2}{M + 2}}$ upto a constant $C$, which is problem dependant. For more details, we refer the user to the original [paper](http://www.sciencedirect.com/science/article/pii/S0047259X15000950)."
408 |    ]
409 |   },
410 |   {
411 |    "cell_type": "code",
412 |    "execution_count": 17,
413 |    "metadata": {
414 |     "collapsed": false
415 |    },
416 |    "outputs": [
417 |     {
418 |      "data": {
419 |       "text/plain": [
420 |        "0.005"
421 |       ]
422 |      },
423 |      "execution_count": 17,
424 |      "metadata": {},
425 |      "output_type": "execute_result"
426 |     }
427 |    ],
428 |    "source": [
429 |     "cobra_diagnostics.error_bound"
430 |    ]
431 |   },
432 |   {
433 |    "cell_type": "markdown",
434 |    "metadata": {},
435 |    "source": [
436 |     "### Playing with Data-Splitting\n",
437 |     "\n",
438 |     "When we initially started to set up COBRA, we split our training data into two further parts - $D_k$, and $D_l$. \n",
439 |     "This split was done 50-50, but it is upto us how we wish to do this. \n",
440 |     "The following section will compare 20-80, 60-40, 50-50, 40-60, 80-20 and check for which case we get the best MSE values, for a fixed Epsilon (or use a grid)."
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": 18,
446 |    "metadata": {
447 |     "collapsed": false
448 |    },
449 |    "outputs": [
450 |     {
451 |      "data": {
452 |       "text/plain": [
453 |        "((0.6, 0.4), 0.16127856452749259)"
454 |       ]
455 |      },
456 |      "execution_count": 18,
457 |      "metadata": {},
458 |      "output_type": "execute_result"
459 |     }
460 |    ],
461 |    "source": [
462 |     "cobra_diagnostics.optimal_split(X_eps, Y_eps)"
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "metadata": {},
468 |    "source": [
469 |     "What we saw was the default result, with the optimal split ratio and the corresponding MSE. We can do a further analysis here by enabling the info and graph options, and using more values to split on."
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": 19,
475 |    "metadata": {
476 |     "collapsed": true
477 |    },
478 |    "outputs": [],
479 |    "source": [
480 |     "split = [(0.05, 0.95), (0.10, 0.90), (0.20, 0.80), (0.40, 0.60), (0.50, 0.50), (0.60, 0.40), (0.80, 0.20), (0.90, 0.10), (0.95, 0.05)]"
481 |    ]
482 |   },
483 |   {
484 |    "cell_type": "code",
485 |    "execution_count": 20,
486 |    "metadata": {
487 |     "collapsed": false
488 |    },
489 |    "outputs": [
490 |     {
491 |      "data": {
492 |       "text/plain": [
493 |        "{(0.05, 0.95): 0.39262084148151344,\n",
494 |        " (0.1, 0.9): 0.22715491182618358,\n",
495 |        " (0.2, 0.8): 0.24066262191934482,\n",
496 |        " (0.4, 0.6): 0.25176243012900951,\n",
497 |        " (0.5, 0.5): 0.32890802098373911,\n",
498 |        " (0.6, 0.4): 0.16127856452749259,\n",
499 |        " (0.8, 0.2): 0.85840425483893001,\n",
500 |        " (0.9, 0.1): 0.87760930708223073,\n",
501 |        " (0.95, 0.05): 1.2414145213568437}"
502 |       ]
503 |      },
504 |      "execution_count": 20,
505 |      "metadata": {},
506 |      "output_type": "execute_result"
507 |     },
508 |     {
509 |      "data": {
510 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8VPW9//HXJ8kkIWQDEoiQQNiXgrhEZKnWhSraFmvV\nVtS6obRWW1utXntvb29/7W3vrdvV3lqpu7a1Yr22pRWlqFCsJEoQUVkCSVgS1jCBQBKyf39/JEBA\nIAPMzJmZvJ+PBw9mOZnz9pi8Ofme5WvOOUREJLbEeR1ARESCT+UuIhKDVO4iIjFI5S4iEoNU7iIi\nMUjlLiISg1TuIiIxSOUuIhKDVO4iIjEowasVZ2Vlufz8fK9WLyISlZYtW7bTOZfd1XKelXt+fj7F\nxcVerV5EJCqZ2cZAltOwjIhIDFK5i4jEIJW7iEgMUrmLiMQglbuISAxSuYuIxCCVu4hIDFK5i4iE\n0SNvrmVJ6c6Qr0flLiISJrvrm3j0rXUUb9wV8nWp3EVEwqSovBrnYNLQPiFfl8pdRCRMisr99PDF\nMz43M+TrUrmLiIRJYZmfgvxeJCaEvnq7XIOZPWNmO8zsk6O8f62ZfWRmH5vZEjMbH/yYIiLRzV/b\nSMn2vUwcEvohGQhsz/05YNox3l8PfM45Nw74KfBEEHKJiMSUovJqIDzj7RDALX+dc4vNLP8Y7y/p\n9LQIyD35WCIisaWwfCc9E+MZNyAjLOsL9sDPTOD1IH+miEjUKyzzc9bg3vjiw3OoM2hrMbPzaS/3\nfznGMrPMrNjMiquqqoK1ahGRiLZjTwNlVXVMCtN4OwSp3M3sVOAp4DLnnP9oyznnnnDOFTjnCrKz\nu5wlSkQkJhSWt9diuMbbIQjlbmYDgVeBrzvn1p58JBGR2FJU7ictOYHP9A/PeDsEcEDVzP4AnAdk\nmVkl8B+AD8A5Nxv4EdAH+LWZAbQ45wpCFVhEJNoUlvk5e3Bv4uMsbOsM5GyZGV28fwtwS9ASiYjE\nkK01+9jgr+e6iYPCul5doSoiEkKFZeEfbweVu4hISBWW+clM8TE6Jz2s61W5i4iEUGF5+3h7XBjH\n20HlLiISMhXV9VTu2hfW89v3U7mLiITIwfPbs8K+bpW7iEiIFJX56dMzkRH9UsO+bpW7iEgIOOco\nLPczcUgfOq4BCiuVu4hICGz017O1poGJYT4Fcj+Vu4hICBwYb/fgYCqo3EVEQqKwzE92WhJDs3t6\nsn6Vu4hIkO0fb5/k0Xg7qNxFRIKurKqOqr2NYb/lQGcqdxGRIPN6vB1U7iIiQVdU5ueUjGQG9Unx\nLIPKXUQkiJxzFHk83g4qdxGRoFq7vRZ/XZNn57fvp3IXEQmiwrKdgLfj7aByFxEJqsJyP7m9epDX\n27vxdlC5i4gETVubo6i82vO9dlC5i4gEzaqte6jZ1+zp+e37qdxFRIKkqNyb+VKPROUuIhIkhWV+\n8vukcEpGD6+jqNxFRIKhpbWN99dXR8ReO6jcRUSCYuWWPextbGFiBBxMBZW7iEhQRML9ZDpTuYuI\nBEFhmZ+h2T3pm57sdRRA5S4ictKaW9tYuiFyxttB5S4ictI+qqyhvqmVSUOyvI5yQJflbmbPmNkO\nM/vkKO+bmf3SzErN7CMzOyP4MUVEItf+89snDuntcZKDAtlzfw6Ydoz3LwGGd/yZBTx+8rFERKJH\nYZmfkf3S6JOa5HWUA7osd+fcYqD6GItcBrzg2hUBmWZ2SrACiohEssaWVoo3RtZ4OwRnzH0AUNHp\neWXHa59iZrPMrNjMiquqqoKwahERb62oqKGhuS1izm/fL6wHVJ1zTzjnCpxzBdnZ2eFctYhISBSW\n+TGLrPF2CE65bwbyOj3P7XhNRCTmFZbvZHROOpkpiV5HOUQwyn0ucH3HWTMTgRrn3NYgfK6ISERr\naG7lg027I268HSChqwXM7A/AeUCWmVUC/wH4AJxzs4F5wKVAKVAP3BSqsCIikeSDTbtoammLmFsO\ndNZluTvnZnTxvgNuD1oiEZEoUVTmJ85gQoSNt4OuUBUROWGF5X7GDsggPdnndZRPUbmLiJyAfU2t\nfFixOyKHZEDlLiJyQoo3VtPc6pgYgQdTQeUuInJCCsv8xMcZZ+VH3ng7qNxFRE5IYbmfU3MzSE3q\n8rwUT6jcRUSOU21jCx9V1kTseDuo3EVEjtvSDdW0trmIvHhpP5W7iMhxKirz44s3CgZF5ng7qNxF\nRI5bYbmf0/Iy6ZEY73WUo1K5i4gchz0NzXyyObLH20HlLiJyXN4vr6bNEbHnt++nchcROQ6F5X4S\nE+I4Y2Avr6Mck8pdROQ4FJb5OWNgJsm+yB1vB5W7iEjAdtc3sXrbHiYNyfI6SpdU7iIiASoqr8Y5\nIvr89v1U7iIiASoq95Psi2N8XobXUboUmTdFEBEJM+cce/a1sG1PA9v3NLBtTwM7Ov7eVtPIjr0N\nrN2+l4JBvUlKiOzxdlC5i0g30NDcyo49jWzf28C2mvbybi/wxgOPt+9poKG57VNfm9HDR056Mv0y\nkvnSqf2ZcfZAD/4Ljp/KXUSiVlubw1/X1F7UNQ1s39vA9pqGjr3vg8W9q775U1+bmBBHTnoyOenJ\njBuQwedH96NfR4nnpCfTLz2JfunJEX9WzNGo3EUk4m301/Hax1vZsaeRbTUHh0x27G2kpc0dsqwZ\nZKUmkZOeTG6vHpw5qBf9Okq8X0Z7aeekJ5PRw4eZefRfFHoqdxGJaE0tbVz39HtUVO8jLSnhQEFP\nHNrnYGl37GnnZCSTlZqEL17niqjcRSSizVm6iYrqfTx1fQFTx/TzOk7U0D9vIhKx6ptaePStUibk\n9+bC0X29jhNVVO4iErGefXcDO2sbuXfayJgeHw8FlbuIRKTd9U3M/kcZU0f3pSBCJ6GOZCp3EYlI\nj/+jjNrGFr5/8Uivo0QllbuIRJxtNQ089+4GvnzaAEblpHsdJyoFVO5mNs3MSsys1MzuO8L7A81s\noZktN7OPzOzS4EcVke7i0bfW0eYc35s6wusoUavLcjezeOAx4BJgDDDDzMYcttgPgZedc6cDVwO/\nDnZQEeke1u+s4+XiCmZMGMjAPilex4lagey5TwBKnXPlzrkm4CXgssOWccD+350ygC3Biygi3clD\nfy8hMT6OOy4Y5nWUqBZIuQ8AKjo9r+x4rbMfA9eZWSUwD/h2UNKJSLfyyeYa/vbRVmZ+djB905K9\njhPVgnVAdQbwnHMuF7gU+K2ZfeqzzWyWmRWbWXFVVVWQVi0iseL++SVkpviY9bkhXkeJeoGU+2Yg\nr9Pz3I7XOpsJvAzgnCsEkoFPzUPlnHvCOVfgnCvIzs4+scQiEpMKy/wsXlvFbZ8bSnqyz+s4US+Q\ncl8KDDezwWaWSPsB07mHLbMJuBDAzEbTXu7aNReRgDjnuH/+GvqlJ3HD5Hyv48SELsvdOdcC3AHM\nB1bTflbMSjP7iZlN71jsbuBWM1sB/AG40TnnjvyJIiKHWrBqO8s37ea7U0dE7f3TI01Ad4V0zs2j\n/UBp59d+1OnxKmBKcKOJSHfQ2uZ48O8lDM7qyVVn5nodJ2boClUR8dSfl29m7fZa7r5oBAm6D3vQ\naEuKiGcaW1p5eMFaxg5I59Kxp3gdJ6ao3EXEMy++t4nNu/dx78WjiIvTLX2DSeUuIp6obWzhV2+X\nMnFIb84Z/qkzp+UkaZo9EfHEM/9cj7+uiSenjdJEHCGgPXcRCbvquiaeWFzORWP6ccbAXl7HiUkq\ndxEJu8cXlVLXpIk4QknlLiJhtWX3Pp4v3MhXTs9lRL80r+PELJW7iITVo2+uAwffnTrc6ygxTeUu\nImFTuqOWPy6r4NqJA8nrrYk4QknlLiJh8/CCEpJ98dx+vibiCDWVu4iExYqK3cz7eBu3nDOErNQk\nr+PEPJW7iITFA/NL6JXi49ZzBnsdpVtQuYtIyL1bupN/lu7k9vOHkaaJOMJC5S4iIeWc4/431tA/\nI5nrJg7yOk63oXIXkZCav3IbKyprNBFHmKncRSRkWlrbeGB+CUOze/KVMwZ4HadbUbmLSMi8unwz\nZVV1fP+ikZqII8y0tUUkJBqaW3lkwVpOzc1g2tgcr+N0Oyp3EQmJ3xVtZEtNA/+iW/p6QuUuIkG3\nt6GZXy8q47PDspgyTBNxeEHlLiJB99Q766mua+Ie3dLXMyp3EQmqnbWNPPVOOZeMzWF8XqbXcbot\nlbuIBNVjC0vZ19zK3Rdpr91LKncRCZrKXfX8vmgTV56Zy7C+qV7H6dZU7iISNI+8uQ4M7pw6wuso\n3Z7KXUSCYt32vbz6QSXXTxzEgMweXsfp9lTuIhIUD/69hJTEBL6liTgiQkDlbmbTzKzEzErN7L6j\nLPNVM1tlZivN7MXgxhSRSLZ80y7mr9zOrecMoXfPRK/jCJDQ1QJmFg88BnweqASWmtlc59yqTssM\nB34ATHHO7TKzvqEKLCKRxTnHL95YQ5+eiczURBwRI5A99wlAqXOu3DnXBLwEXHbYMrcCjznndgE4\n53YEN6aIRKp31u2kqLyaOy4YRmpSl/uLEiaBlPsAoKLT88qO1zobAYwws3fNrMjMpgUroIhErrY2\nxwPzSxiQ2YNrzh7odRzpJFj/zCYAw4HzgFxgsZmNc87t7ryQmc0CZgEMHKhvBJFo9/on2/h4cw0P\nXjWepARNxBFJAtlz3wzkdXqe2/FaZ5XAXOdcs3NuPbCW9rI/hHPuCedcgXOuIDs7+0Qzi0gEaGlt\n46G/lzC8byqXn66JOCJNIOW+FBhuZoPNLBG4Gph72DJ/pn2vHTPLon2YpjyIOUUkwryyrJLynXXc\nc/FI4uN0S99I02W5O+dagDuA+cBq4GXn3Eoz+4mZTe9YbD7gN7NVwELgHuecP1ShRcRbDc2tPPLm\nOk4fmMnnx/TzOo4cQUBj7s65ecC8w177UafHDrir44+IxLgXCjewbU8D//O10zQRR4TSFaoiclz2\ndEzEce6IbCYN7eN1HDkKlbuIHJcnF5ezu76ZezURR0RTuYtIwKr2NvLUO+v5wqmnMHZAhtdx5BhU\n7iISsF+9vY6m1jbu/rxu6RvpVO4iEpCK6npefH8TXy3IY0i2JuKIdCp3EQnI/yxYS5wZd174qesT\nJQKp3EWkS2u27eFPH27mxsn55GQkex1HAqByF5EuPTi/hNSkBG47b6jXUSRAKncROaZlG6t5c/UO\nvnHuEDJTNBFHtFC5i8hROef4xeslZKUmcdMUTcQRTVTuInJUi9ZW8f6Gar5z4TB6aiKOqKJyF5Ej\namtz3P9GCXm9e3D1WZp/Idqo3EXkiP728VZWb93DXZ8fQWKCqiLa6P+YiHxKc8dEHKNy0pg+XhNx\nRCOVu4h8ypylFWz012sijiimcheRQ+xrauWXb62jYFAvLhjV1+s4coJU7iJyiOeWbGDH3kbunTZK\nE3FEMZW7iBxQU9/M44tKOX9kNhMG9/Y6jpwElbuIHDB7cRl7Glq45+JRXkeRk6RyFxEAduxp4Nl3\n1zN9fH/G9E/3Oo6cJJW7iADwy7fX0dLquEsTccQElbuIsNFfx0vvV3D1hDzys3p6HUeCQOUuIjy8\nYC0J8cZ3LtBEHLFC5S7Sza3cUsNfPtzCTVMG0zddE3HECpW7SDf34PwS0pMT+Oa5mogjlqjcRbqx\n99dXs7CkitvOG0ZGis/rOBJEKneRbso5x/1vrKFvWhI3Ts73Oo4EmcpdpJt6e80Oijfu4jsXDqdH\nYrzXcSTIAip3M5tmZiVmVmpm9x1juSvMzJlZQfAiikiwtbU5HphfwqA+KXztrDyv40gIdFnuZhYP\nPAZcAowBZpjZmCMslwbcCbwX7JAiElxzV2xhzba93H3RSHzx+gU+FgXyf3UCUOqcK3fONQEvAZcd\nYbmfAr8AGoKYT0SCrKmljYcWlDDmlHS+OO4Ur+NIiARS7gOAik7PKzteO8DMzgDynHOvHeuDzGyW\nmRWbWXFVVdVxh4X2g0DFG6pP6GtFBF5auomK6n3cM20kcZqII2ad9O9jZhYHPAzc3dWyzrknnHMF\nzrmC7OzsE1rfnKUVXDm7kPfXq+BFjld9Uwu/fKuUCYN7c96IE/sZlOgQSLlvBjofccnteG2/NGAs\nsMjMNgATgbmhOqh62WkDyElP5mevraKtzYViFSIx69l3N7CztpF/mTZSE3HEuEDKfSkw3MwGm1ki\ncDUwd/+bzrka51yWcy7fOZcPFAHTnXPFoQjcIzGe7188khWVNfz1oy2hWIVITNpV18TsRWVMHd2X\nMwdpIo5Y12W5O+dagDuA+cBq4GXn3Eoz+4mZTQ91wCP5yukD+Ez/dO5/o4SG5lYvIohEndn/KKO2\nqYXvXzzS6ygSBgGNuTvn5jnnRjjnhjrnftbx2o+cc3OPsOx5odpr3y8uzvi3L4xm8+59PPvuhlCu\nSiQmbKtp4LklG7j8tAGMytFEHN1B1J7gOnloFlNH9+XXC0vx1zZ6HUckoj361jranON7moij24ja\ncge475LR1De38sib67yOIhKxyqtqebm4gmsmDCSvd4rXcSRMorrch/VN5dqzB/Li+5so3bHX6zgi\nEemhBWtJSojjDk3E0a1EdbkD3HnhcFJ88fzXvDVeRxGJOJ9sruG1j7Zy85TBZKcleR1Hwijqy71P\nahK3XzCMt9bsYEnpTq/jiESU++eXkJniY9bnhngdRcIs6ssd4MbJ+QzI7MF/vraaVl3YJAJAYZmf\nxWur+NZ5Q0lP1kQc3U1MlHuyL557p41k1dY9vPpBpddxJIpVVNfz1Dvl7Ngb3fe/c85x//w15KQn\nc/2kfK/jiAdiotwBpo/vz/i8TB78ewn1TS1ex5Eo0tbmWFiyg5nPLeXcBxbyn6+t5orHl7BhZ53X\n0U7YglXbWb5pN3dOHU6yTxNxdEcxU+5mxr9/YTTb9zTy5OL1XseRKLC7voknF5dz/kOLuOnZpayo\nrOHb5w/jmRsLqGts5YrHl/BxZY3XMY9ba8dEHEOyenLVmblexxGPJHgdIJgK8ntzydgcfrO4jBkT\n8uibnux1JIlAn2yu4YXCDfzlwy00trRxVn4v7r5oJNM+k0NiQvv+zh+/OYnrn36fq58o5DdfL+Cz\nw7O8DX0c/rR8M+t21PLYNWeQoIk4ui1zzpsDkAUFBa64OPh3Kdjor2Pqw//gK6fn8osrTw3650t0\namxpZd7HW3mhcCPLN+2mhy+eL58+gK9PHMSY/ke+HH/7ngZueOZ9yqpqefirp/Gl8f3DnPr4Nba0\ncsGD/6B3z0T+cvsU3a89BpnZMudcl3fdjak9d4BBfXpy/aR8nnl3PTdOyWf0KbqPRndWuaue37+3\niTlLK6iua2JIVk9+9MUxXHFmLhk9jn0GSb/0ZOZ8YxK3Pl/Md15ajr+2kRunDA5T8hPz4nub2Lx7\nH//1lXEq9m4u5sod4NsXDOOVZZX8fN5qXrh5gu5b3c20tTn+WbqTFwo38vaa7QBMHd2P6yflM3lo\nn+MqvYwePl6YOYFv/2E5P/7rKnbWNnH3RSMi8nuqtrGFX71dyqQhfTgnioaRJDRistwzUxL5zoXD\n+enfVrFobRXnj+zrdSQJg5p9zbyyrJLfFW1k/c46+vRM5LbzhnLN2YMYkNnjhD832RfP49eewb//\n5RN+tbCUqr2N/OzysRE3nv30O+vx1zVxrybiEGK03AG+PnEQvy3cwM9fW805w7Ii7gdRgmfllhp+\nV7SRPy/fwr7mVs4c1Is7LxzOJeNySEoIzmmACfFx/PzycWSnJvHLt0uprm/if2ecHjGnGVbXNfHk\nO+VcNKYfpw/s5XUciQAxW+6JCXHcd8kovvm7D5hTXMG1Zw/yOpIEUVNLG69/0n6AdNnGXST74vjy\naQO4buIgxg7ICMk6zYy7LhpJn9QkfvzXlXz96fd46vqzyEjx/urPXy8spV4TcUgnMVvuABd/JocJ\n+b35nwVrmT6+P2m6BDvqbdm9jxff28RLSzexs7aJ/D4p/PALo7nqzLywlewNk/Ppk5rI9+Z8yFd/\nU8jzN08gJ8O702637N7HC0Ub+coZuYzol+ZZDoksMV3uZu0zNl322LvM/kcZ91w8yutIcgKccywp\n8/NC4QYWrNqOAy4c1ZevT8rnnGFZnpwV8sVT+9MrJZFZLxRzxeNLeP7mCQzrmxr2HACPvrkOHHx3\nqm7pKwfFdLkDjM/L5LLT+vPUO+u59uxB9D+JA2sSXnsamvm/ZZX8tmgj5VV19ErxMevcoVx7dmRM\nOjFlWBZzvjGJG599n6tmL+HZmyZwWl5mWDOU7qjlj8squGFyPrm9vN8mEjm6xVHGey4eiQMemF/i\ndRQJwJpte/jXP33MxJ+/xf/76yrSk308dNV4Cn9wIfddMioiin2/sQMyeOWbk0lL9jHjiSIWlewI\n6/of+nsJPXzx3H7+sLCuVyJfzO+5A+T2SmHmZwfz+KIybpqSz6m54d276o6cczS3OhpbWmlsaaOx\npY2mlrb2581tNLW20djc/ryp4/29Dc38dcVW3t9QTVJCHNPH9+f6SfmMyw3NAdJgyc/qySu3TeLG\nZ5Zyy/PFPHDVqVx+eujv6bKiYjevf7KNOy8cTlaqJuKQQ3WLcge47byhzFlawc9eW81LsybG9HnA\nbW3uYHm2tnaUaKdyPfD40HJtbG7tVLr7C/gI5Xzga4/wNZ3ePxEDe6fwr5eO4qoz8+jVMzHIWyZ0\n+qYlM+cbE5n1wjK+N2cF/tombjkntBNkPDC/hF4pPm45J7KvmhVvdJtyT0/28b2pw/n3v6xkwart\nXPSZnLBnaGxppaa+mV31zeyqb2J3fRN7G1oOK8fOBdq5eDvK9hh7vvu/vqn1xIq1szhrv3gnMSGO\npIQ4khI6P25/ntnDR1JaEkm+eBLj40jytb+X2PF+0mHLH/h63+Gfd/BxTnpy1F42n5bs49mbzuKu\nlz/kP19bTdXeRu67ZFRIdiT+uW4n/yzdyQ+/MFpngckRdZtyB5gxYSDPLdnAf7++hvNH9cV3ghc2\ntbY5avY1s7u+iV317X/vPlDYzezed/D1XXXN1Oxrf6++qTWgz/fF24FyPGK5+uJIS05oX+YIhfrp\nx3GfKuCkQwr40M9OjI/TRV8nKNkXz//OOIM+PVfym8Xl7Kxt4r+vGHfC32tH4pzjgflr6J+RzHUT\ndf2GHFm3KveE+Dj+9dLRzHy+mN8XbeSGyfnUNra0F/L+ct53sJR31TcdKOZd9c3UdPy9p6GZo91M\nM87ab3+QmeKjV0oip2QkM/qUdHql+MhM8ZGZkkivlER6pfjISPGRnuw7WLQdxRqte67SLj7O+Mll\nnyE7LYmHF6yluq6Rx649g5TE4Py4zV+5jRWVNdx/5akRc4WsRJ6Yu+VvV5xzXPvUe7y3vpo4g+bW\no//3pyUlkNnTR2aPg2XdXsrtf/dKSSSj0+uZKYmkJSWonOWAF9/bxA///DHj8zJ55oazTvo4Qktr\nGxc/shiA+d89V79hdUPd9pa/XTEzfnb5OJ7553pSkxMOlHJmDx+9enaUd0eZB/NXaemerjl7IL17\n+vjOSx9y5ewlvDDz7JO6idmrH2ymrKqO2ddpIg45toD23M1sGvAoEA885Zz778Pevwu4BWgBqoCb\nnXMbj/WZXu25i3ihqNzPrc8Xk5qcwPM3Tzih2wQ0NLdy/oOL6JuWxJ9vnxLTZ3zJ0QW6597lP/1m\nFg88BlwCjAFmmNmYwxZbDhQ4504FXgHuP/7IIrFr4pA+zPnGJFraHFfNLmTZxurj/ozfFW1ka00D\n904LzRk4ElsC+b1uAlDqnCt3zjUBLwGXdV7AObfQOVff8bQI0Ky8IocZ0z+dV2+bTO+eiVz71Hu8\nuWp7wF+7t6GZxxaW8tlhWUwZpok4pGuBlPsAoKLT88qO145mJvD6yYQSiVV5vVN45ZuTGNEvjW/8\nbhkvF1d0/UXAk++sZ1d9M/folr4SoKAekTGz64AC4IGjvD/LzIrNrLiqqiqYqxaJGn1Sk3jx1olM\nHtqHe1/5iF8vKuVYx7521jby9DvlXDouh/FhvjGZRK9Ayn0zkNfpeW7Ha4cws6nAvwHTnXONR/og\n59wTzrkC51xBdnb2ieQViQmpSQk8fcNZTB/fn/vfKOGnf1tNW9uRC/6xhaXsa27lrs9rr10CF8ip\nkEuB4WY2mPZSvxq4pvMCZnY68BtgmnMuvLfFE4lSiQlxPPK10+jdM5Fn3l2Pv66RB64cT2LCwX2u\niup6fl+0iavOzPPsfvESnbosd+dci5ndAcyn/VTIZ5xzK83sJ0Cxc24u7cMwqcAfO47ib3LOTQ9h\nbpGYEBdn/MeXxtA3PYn73yihuq6Jx687k9Sk9h/NR95cBwZ3aiIOOU4BXcTknJsHzDvstR91ejw1\nyLlEug0z41vnDSMrNYkfvPox1zxZxLM3noW/rok/La/k5imDNcmMHLdud4WqSKT6akEevVMSuf3F\nD7hydiE56cmkJCbwLU3EISdA1y+LRJCpY/rx+1vOxl/bSGG5n1nnDqF3FN3XXiKH9txFIkxBfm9e\nuW0y/7eskpmf1UQccmJU7iIRaES/NH5w6WivY0gU07CMiEgMUrmLiMQglbuISAxSuYuIxCCVu4hI\nDFK5i4jEIJW7iEgMUrmLiMSggCbIDsmKzaqAY06i3U1kATu9DhFBtD0O0rY4lLZHu0HOuS4nxPCs\n3KWdmRUHMpN5d6HtcZC2xaG0PY6PhmVERGKQyl1EJAap3L33hNcBIoy2x0HaFofS9jgOGnMXEYlB\n2nMXEYlBKvcwMbNpZlZiZqVmdt8R3r/LzFaZ2Udm9paZDfIiZ7h0tT06LXeFmTkzi9mzJALZFmb2\n1Y7vj5Vm9mK4M4ZTAD8rA81soZkt7/h5udSLnBHPOac/If4DxANlwBAgEVgBjDlsmfOBlI7HtwFz\nvM7t5fboWC4NWAwUAQVe5/bwe2M4sBzo1fG8r9e5Pd4eTwC3dTweA2zwOnck/tGee3hMAEqdc+XO\nuSbgJeCyzgs45xY65+o7nhYBuWHOGE5dbo8OPwV+ATSEM1yYBbItbgUec87tAnDO7QhzxnAKZHs4\nIL3jcQaaugSVAAABwklEQVSwJYz5oobKPTwGABWdnld2vHY0M4HXQ5rIW11uDzM7A8hzzr0WzmAe\nCOR7YwQwwszeNbMiM5sWtnThF8j2+DFwnZlVAvOAb4cnWnTRHKoRxsyuAwqAz3mdxStmFgc8DNzo\ncZRIkUD70Mx5tP9Gt9jMxjnndnuayjszgOeccw+Z2STgt2Y21jnX5nWwSKI99/DYDOR1ep7b8doh\nzGwq8G/AdOdcY5iyeaGr7ZEGjAUWmdkGYCIwN0YPqgbyvVEJzHXONTvn1gNraS/7WBTI9pgJvAzg\nnCsEkmm/74x0onIPj6XAcDMbbGaJwNXA3M4LmNnpwG9oL/ZYHlOFLraHc67GOZflnMt3zuXTfgxi\nunOu2Ju4IdXl9wbwZ9r32jGzLNqHacrDGTKMAtkem4ALAcxsNO3lXhXWlFFA5R4GzrkW4A5gPrAa\neNk5t9LMfmJm0zsWewBIBf5oZh+a2eHf0DEjwO3RLQS4LeYDfjNbBSwE7nHO+b1JHFoBbo+7gVvN\nbAXwB+BG13HqjBykK1RFRGKQ9txFRGKQyl1EJAap3EVEYpDKXUQkBqncRURikMpdRCQGqdxFRGKQ\nyl1EJAb9f3Y2RkJum7VFAAAAAElFTkSuQmCC\n",
511 |       "text/plain": [
512 |        "<matplotlib.figure.Figure at 0x10bd08a50>"
513 |       ]
514 |      },
515 |      "metadata": {},
516 |      "output_type": "display_data"
517 |     }
518 |    ],
519 |    "source": [
520 |     "cobra_diagnostics.optimal_split(X_eps, Y_eps, split=split, info=True, graph=True)"
521 |    ]
522 |   },
523 |   {
524 |    "cell_type": "markdown",
525 |    "metadata": {},
526 |    "source": [
527 |     "### Alpha, Epsilon and Machines\n",
528 |     "\n",
529 |     "The following are methods to idetify the optimal epislon values, alpha values, and combination of machines. \n",
530 |     "The grid methods allow for us to predict for a single point the optimal alpha/machines and epsilon combination."
531 |    ]
532 |   },
533 |   {
534 |    "cell_type": "code",
535 |    "execution_count": 21,
536 |    "metadata": {
537 |     "collapsed": false,
538 |     "scrolled": true
539 |    },
540 |    "outputs": [
541 |     {
542 |      "data": {
543 |       "text/plain": [
544 |        "(0.70018379287582377, 0.14676391327075944)"
545 |       ]
546 |      },
547 |      "execution_count": 21,
548 |      "metadata": {},
549 |      "output_type": "execute_result"
550 |     }
551 |    ],
552 |    "source": [
553 |     "cobra_diagnostics.optimal_epsilon(X_eps, Y_eps, line_points=100)"
554 |    ]
555 |   },
556 |   {
557 |    "cell_type": "code",
558 |    "execution_count": 22,
559 |    "metadata": {
560 |     "collapsed": false
561 |    },
562 |    "outputs": [
563 |     {
564 |      "data": {
565 |       "text/plain": [
566 |        "{1: 1.3770681356163277,\n",
567 |        " 2: 0.36330962313779791,\n",
568 |        " 3: 0.22184955956802241,\n",
569 |        " 4: 0.32890802098373911}"
570 |       ]
571 |      },
572 |      "execution_count": 22,
573 |      "metadata": {},
574 |      "output_type": "execute_result"
575 |     }
576 |    ],
577 |    "source": [
578 |     "cobra_diagnostics.optimal_alpha(X_eps, Y_eps, info=True)"
579 |    ]
580 |   },
581 |   {
582 |    "cell_type": "code",
583 |    "execution_count": 23,
584 |    "metadata": {
585 |     "collapsed": false
586 |    },
587 |    "outputs": [
588 |     {
589 |      "data": {
590 |       "text/plain": [
591 |        "{('lasso',): 0.67305998753716922,\n",
592 |        " ('random_forest',): 0.21212098528993584,\n",
593 |        " ('random_forest', 'lasso'): 0.21212098528993584,\n",
594 |        " ('ridge',): 0.15102495085167283,\n",
595 |        " ('ridge', 'lasso'): 0.15102495085167283,\n",
596 |        " ('ridge', 'random_forest'): 0.12856244062945194,\n",
597 |        " ('ridge', 'random_forest', 'lasso'): 0.12856244062945194,\n",
598 |        " ('ridge', 'tree'): 0.29304405824493457,\n",
599 |        " ('ridge', 'tree', 'lasso'): 0.29304405824493457,\n",
600 |        " ('ridge', 'tree', 'random_forest'): 0.32890802098373911,\n",
601 |        " ('ridge', 'tree', 'random_forest', 'lasso'): 0.32890802098373911,\n",
602 |        " ('tree',): 0.30387747420693395,\n",
603 |        " ('tree', 'lasso'): 0.30387747420693395,\n",
604 |        " ('tree', 'random_forest'): 0.256264208732493,\n",
605 |        " ('tree', 'random_forest', 'lasso'): 0.256264208732493}"
606 |       ]
607 |      },
608 |      "execution_count": 23,
609 |      "metadata": {},
610 |      "output_type": "execute_result"
611 |     }
612 |    ],
613 |    "source": [
614 |     "cobra_diagnostics.optimal_machines(X_eps, Y_eps, info=True)"
615 |    ]
616 |   },
617 |   {
618 |    "cell_type": "code",
619 |    "execution_count": 24,
620 |    "metadata": {
621 |     "collapsed": false
622 |    },
623 |    "outputs": [
624 |     {
625 |      "data": {
626 |       "text/plain": [
627 |        "((4, 0.28007351715032952), 0.0016800595281319843)"
628 |       ]
629 |      },
630 |      "execution_count": 24,
631 |      "metadata": {},
632 |      "output_type": "execute_result"
633 |     }
634 |    ],
635 |    "source": [
636 |     "cobra_diagnostics.optimal_alpha_grid(X_eps[0], Y_eps[0], line_points=100)"
637 |    ]
638 |   },
639 |   {
640 |    "cell_type": "code",
641 |    "execution_count": 25,
642 |    "metadata": {
643 |     "collapsed": false
644 |    },
645 |    "outputs": [
646 |     {
647 |      "data": {
648 |       "text/plain": [
649 |        "((('ridge', 'tree', 'lasso'), 0.3150827067941207), 0.0002692832977726628)"
650 |       ]
651 |      },
652 |      "execution_count": 25,
653 |      "metadata": {},
654 |      "output_type": "execute_result"
655 |     }
656 |    ],
657 |    "source": [
658 |     "cobra_diagnostics.optimal_machines_grid(X_eps[0], Y_eps[0], line_points=100)"
659 |    ]
660 |   },
661 |   {
662 |    "cell_type": "markdown",
663 |    "metadata": {},
664 |    "source": [
665 |     "Increasing the number of line points helps in finding a better optimal value. These are the results for the same point. The MSEs are to the second value of the tuple.\n",
666 |     "\n",
667 |     "With 10:\n",
668 |     "((('ridge', 'random_forest', 'lasso'), 1.1063905961135443), 0.96254542159345469)\n",
669 |     " \n",
670 |     "With 20: \n",
671 |     "((('tree', 'random_forest'), 0.87346626008964035), 0.53850941611803993)\n",
672 |     "\n",
673 |     "With 50:\n",
674 |     "((('ridge', 'tree'), 0.94833479666875231), 0.48256303899450931)\n",
675 |     "\n",
676 |     "With 100:\n",
677 |     "((('ridge', 'tree', 'random_forest'), 0.10058096328304948), 0.30285776885759158)\n",
678 |     "\n",
679 |     "With 200: \n",
680 |     "((('ridge', 'tree', 'lasso'), 0.10007553130675276), 0.30285776885759158)"
681 |    ]
682 |   }
683 |  ],
684 |  "metadata": {
685 |   "kernelspec": {
686 |    "display_name": "Python 2",
687 |    "language": "python",
688 |    "name": "python2"
689 |   },
690 |   "language_info": {
691 |    "codemirror_mode": {
692 |     "name": "ipython",
693 |     "version": 2
694 |    },
695 |    "file_extension": ".py",
696 |    "mimetype": "text/x-python",
697 |    "name": "python",
698 |    "nbconvert_exporter": "python",
699 |    "pygments_lexer": "ipython2",
700 |    "version": "2.7.12"
701 |   }
702 |  },
703 |  "nbformat": 4,
704 |  "nbformat_minor": 1
705 | }
706 | 


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/README.md:
--------------------------------------------------------------------------------
 1 | ## Workshop
 2 | 
 3 | This directory contains the Jupyter Notebooks which will be followed during the workshop/tutorial.
 4 | 
 5 | It largely follows the structure of the [News Classification](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim_news_classification.ipynb) notebook, with the major difference being in the pre-processing done by spaCy instead of NLTK.
 6 | edit: I've also added text generation with Keras now, and will soon be adding a brief bit on word embeddings.
 7 | 
 8 | The tutorial on Topic Modelling follows us through different topic models and how to visualise them and evaluate them, while the text analysis tutorial introduces users to a variety of text analysis approaches.
 9 | 
10 | ### Requirements for Topic Modelling
11 | 
12 | ```
13 | - Jupyter
14 | - Gensim Version (>=0.13.1 would be preferred since we will be using topic coherence briefly)
15 | - matplotlib
16 | - spaCy
17 | - pyLDAVis
18 | ```
19 | 
20 | In case the user finds it difficult to download any of the above, there will be a Jupyter Notebook with all the cells already run, so you can just follow the same.
21 | 
22 | ### Requirements for Text Analysis
23 | 
24 | ```
25 | - Jupyter
26 | - Gensim Version 
27 | - matplotlib
28 | - spaCy
29 | - scikit-learn
30 | - keras
31 | ```
32 | 
33 | ### Setup
34 | 
35 | - Start by cloning the repo using
36 | 
37 | `git clone https://github.com/bhargavvader/personal`
38 | 
39 | - Go into the `notebooks/text_analysis_tutorial` directory
40 | 
41 | - Install `virtualenv` using
42 | 
43 | `pip install virtualenv`
44 | 
45 | - Start the environment with
46 | 
47 | ```
48 | virtualenv venv
49 | source venv/bin/activate
50 | ```
51 | 
52 | - Download requirements with -
53 | 
54 | `pip install -r REQUIREMENTS.txt`
55 | 
56 | And you should be good to go!
57 | 
58 | Alternatively, if you are using anaconda as your virtual environment, running `conda install gensim` and `conda install spacy` should also do the trick.
59 | 
60 | ### Text Analysis Tutorial
61 | 
62 | For the text analysis tutorial, you will be following the same instructions as above, but will need to run
63 | 
64 | `pip install -r REQUIREMENTS_1.txt`
65 | 
66 | Alternatively, you can look up which of the libraries you would still need to download and go ahead and just download those.
67 | 
68 | ### Downloading spaCy language model
69 | 
70 | Both of the tutorials will be using the spaCy English language model, so we will be needing to download it first.
71 | This [link](https://spacy.io/usage/models) contains instructions to download this model.
72 | All we really have to do is run `python -m spacy download en` after we finish all our libary installations.
73 | 


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/REQUIREMENTS.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | scipy
3 | gensim
4 | spacy
5 | matplotlib
6 | jupyter
7 | pyLDAvis


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/REQUIREMENTS_1.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | scipy
 3 | gensim
 4 | spacy
 5 | matplotlib
 6 | jupyter
 7 | scikit-learn
 8 | tensorflow
 9 | theano
10 | keras


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/Weights/weights-improvement-01-4.3050.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/text_analysis_tutorial/Weights/weights-improvement-01-4.3050.hdf5


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/Weights/weights-improvement-01-4.3902.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/text_analysis_tutorial/Weights/weights-improvement-01-4.3902.hdf5


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/Weights/weights-improvement-02-4.0229.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/text_analysis_tutorial/Weights/weights-improvement-02-4.0229.hdf5


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/Weights/weights-improvement-02-4.3093.hdf5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bhargavvader/personal/fe52b001ca4a21912b5d1755a396cce1a1546574/notebooks/text_analysis_tutorial/Weights/weights-improvement-02-4.3093.hdf5


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/text_analysis_tutorial_unrun.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Text Analysis Tutorial\n",
  8 |     "\n",
  9 |     "Hello there - we'll be following this Jupyter Notebook for the tutorial. \n",
 10 |     "The purpose of this tutorial is to walk you through different parts of the text analysis pipeline, from getting a hold of our data, cleaning and annotating it all the way to swapping verbs in sentences and evaluating topic models.\n",
 11 |     "\n",
 12 |     "We will not be looking to explore our textual data in depth, but rather in breadth; give a taste of the different kinds of analysis we can do."
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "Our step, naturally, is setting up our imports. We will be using spaCy for data pre-processing and computational linguistics, gensim for topic modelling, scikit-learn for classification, and Keras for text generation.\n",
 20 |     "We will also use numpy and matplotlib for other parts of the tutorial.\n",
 21 |     "\n",
 22 |     "### Imports"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "import gensim\n",
 32 |     "import numpy as np\n",
 33 |     "import spacy\n",
 34 |     "from spacy import displacy\n",
 35 |     "from gensim.corpora import Dictionary\n",
 36 |     "from gensim.models import LdaModel\n",
 37 |     "import matplotlib.pyplot as plt\n",
 38 |     "import sklearn\n",
 39 |     "import keras"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "import warnings\n",
 49 |     "import os\n",
 50 |     "warnings.filterwarnings('ignore')  # Let's not pay heed to them right now\n",
 51 |     "%matplotlib inline"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "## Gathering Data\n",
 59 |     "\n",
 60 |     "A huge part of text analysis is your data collection - one of the initial goals of the tutorial was to walk the user through the process of cleaning messy twitter data, or scraping data off the internet. But while this does remain an integral part of text analysis, a one and half hour tutorial cannot do justice to both the process of data collection and data analysis - so we will use two more popular, already available data-sets for the purpose of the tutorial.\n",
 61 |     "\n",
 62 |     "Keep in mind the only main difference between using a standardised data-set and scraping your own data off the internet is that internet data is largely unstructured; this means we will be spending a lot of time in organising our data into a form that is easy to pre-processes. The datasets we will be working with will be the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF), and the [20NG dataset](http://qwone.com/~jason/20Newsgroups/). We will be performing different tasks with these two datasets, and will talk a little bit more about the datasets when we come across them.\n",
 63 |     "\n",
 64 |     "Let us now get started with loading our first data-set, the Lee corpus, which we load using Gensim."
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
 74 |     "lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",
 75 |     "text = open(lee_train_file).read()"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "markdown",
 80 |    "metadata": {},
 81 |    "source": [
 82 |     "## Cleaning Data\n",
 83 |     "\n",
 84 |     "It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set.\n",
 85 |     "NTLK is usually a popular choice for pre-processing - but is a rather [outdated](https://explosion.ai/blog/dead-code-should-be-buried) and we will be checking out spaCy, an industry grade text-processing package. \n",
 86 |     "\n",
 87 |     "spaCy uses language models similar to the one we just downloaded before starting this tutorial."
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": null,
 93 |    "metadata": {},
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "nlp = spacy.load(\"en\")"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said' and 'Mister' which will not really add any value to the topic models.\n"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "metadata": {},
110 |    "outputs": [],
111 |    "source": [
112 |     "my_stop_words = [u'say', u'\\'s', u'mr', u'be', u'said', u'says', u'saying', 'today']\n",
113 |     "for stopword in my_stop_words:\n",
114 |     "    lexeme = nlp.vocab[stopword]\n",
115 |     "    lexeme.is_stop = True"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": null,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "doc = nlp(text.lower())"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "Voila! With the `English` pipeline, all the heavy lifting has been done. Let's see what went on under the hood."
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "doc"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "## Computational Linguistics\n",
148 |     "\n",
149 |     "Okay - now that we have our doc object, what exactly can we do with it?\n",
150 |     "We can see that the doc object now contains the entire corpus. This is important because we will be using this doc object to create our corpus for the machine learning algorithms. When creating a corpus for gensim/scikit-learn, we sometimes forget the incredible power which spaCy packs in its pipeline, so we will briefly demonstrate the same in this section with a smaller example sentence. Keep in mind that whatever we can do with a sentence, we can also just as well do with the entire corpus."
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "sent = nlp(u\"Tom went to IKEA to get some of those delicious Swedish meatballs.\")"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "Simple enough sentence, right? When we pass any kind of text through the spaCy pipeline, it becomes annotated. We will quickly have a look at the 3 most important of capabilities which spaCy provides - POS-tagging, NER-tagging, and dependency parsing.\n",
167 |     "\n",
168 |     "#### POS-Tagging"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": null,
174 |    "metadata": {},
175 |    "outputs": [],
176 |    "source": [
177 |     "for token in sent:\n",
178 |     "    print(token.text, token.pos_, token.tag_)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "markdown",
183 |    "metadata": {},
184 |    "source": [
185 |     "#### NER-Tagging"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "for token in sent:\n",
195 |     "    print(token.text, token.ent_type_)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": null,
201 |    "metadata": {},
202 |    "outputs": [],
203 |    "source": [
204 |     "for ent in sent.ents:\n",
205 |     "    print(ent.text, ent.label_)"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": null,
211 |    "metadata": {},
212 |    "outputs": [],
213 |    "source": [
214 |     "displacy.render(sent, style='ent', jupyter=True)"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "markdown",
219 |    "metadata": {},
220 |    "source": [
221 |     "#### Dependency Parsing"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": null,
227 |    "metadata": {},
228 |    "outputs": [],
229 |    "source": [
230 |     "for chunk in sent.noun_chunks:\n",
231 |     "    print(chunk.text, chunk.root.text, chunk.root.dep_,\n",
232 |     "          chunk.root.head.text)\n"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "metadata": {},
239 |    "outputs": [],
240 |    "source": [
241 |     "for token in sent:\n",
242 |     "    print(token.text, token.dep_, token.head.text, token.head.pos_,\n",
243 |     "          [child for child in token.children])\n"
244 |    ]
245 |   },
246 |   {
247 |    "cell_type": "code",
248 |    "execution_count": null,
249 |    "metadata": {},
250 |    "outputs": [],
251 |    "source": [
252 |     "displacy.render(sent, style='dep', jupyter=True)"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "This is just an example of the kind of annotations spaCy adds when it runs any text through its pipeline. We will see in the very next section that spaCy has a bunch of other information as well, such as whether a token is a number or not, stop-word or not, and other information which comes in very handy when pre-processing text. \n",
260 |     "\n",
261 |     "## Continuing Cleaning\n",
262 |     "\n",
263 |     "Have a quick look at the output of the doc object. It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus. You can check out what a gensim corpus looks like [here](https://radimrehurek.com/gensim/tut1.html)."
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": [
272 |     "# we add some words to the stop word list\n",
273 |     "texts, article = [], []\n",
274 |     "for w in doc:\n",
275 |     "    # if it's not a stop word or punctuation mark, add it to our article!\n",
276 |     "    if w.text != '\\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':\n",
277 |     "        # we add the lematized version of the word\n",
278 |     "        article.append(w.lemma_)\n",
279 |     "    # if it's a new line, it means we're onto our next document\n",
280 |     "    if w.text == '\\n':\n",
281 |     "        texts.append(article)\n",
282 |     "        article = []"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": null,
288 |    "metadata": {},
289 |    "outputs": [],
290 |    "source": [
291 |     "texts"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "markdown",
296 |    "metadata": {},
297 |    "source": [
298 |     "And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, punctauation markers, and added the lemmatized word. \n",
299 |     "\n",
300 |     "Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly."
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "code",
305 |    "execution_count": null,
306 |    "metadata": {},
307 |    "outputs": [],
308 |    "source": [
309 |     "bigram = gensim.models.Phrases(texts)"
310 |    ]
311 |   },
312 |   {
313 |    "cell_type": "code",
314 |    "execution_count": null,
315 |    "metadata": {},
316 |    "outputs": [],
317 |    "source": [
318 |     "texts = [bigram[line] for line in texts]"
319 |    ]
320 |   },
321 |   {
322 |    "cell_type": "code",
323 |    "execution_count": null,
324 |    "metadata": {},
325 |    "outputs": [],
326 |    "source": [
327 |     "texts[0]"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": null,
333 |    "metadata": {},
334 |    "outputs": [],
335 |    "source": [
336 |     "dictionary = Dictionary(texts)\n",
337 |     "corpus = [dictionary.doc2bow(text) for text in texts]"
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "markdown",
342 |    "metadata": {},
343 |    "source": [
344 |     "## Topic Modelling\n",
345 |     "\n",
346 |     "Topic Modelling refers to the probabilistic modelling of text documents as topics. Gensim remains the most popular library to perform such modelling, and we will be using it to perform our topic modelling. \n",
347 |     "\n",
348 |     "LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics. This is where the corpus we created earlier will come in handy."
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": null,
354 |    "metadata": {},
355 |    "outputs": [],
356 |    "source": [
357 |     "ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)"
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "markdown",
362 |    "metadata": {},
363 |    "source": [
364 |     "This is a great way to get a view of what words end up appearing in our documents, and what kind of document topics might be present. For more details, such as the other topic models which Gensim provides, as well as ways to measure topic coherence (performance), and visualisation, the topic modelling notebook in the same directory will serve as a good resource."
365 |    ]
366 |   },
367 |   {
368 |    "cell_type": "markdown",
369 |    "metadata": {},
370 |    "source": [
371 |     "## Text Classification\n",
372 |     "\n",
373 |     "In the previous example, we worked with unlabelled, unstructured data. Classification is a machine learning task which is quite different from the previous examples because we are dealing with labelled data, and we know what classes we want to put our documents into - we are not discovering topics or classes.\n",
374 |     "\n",
375 |     "For such an example, we would need to use a labelled data-set, and in our case we will be using the previously mentioned 20NG dataset."
376 |    ]
377 |   },
378 |   {
379 |    "cell_type": "code",
380 |    "execution_count": null,
381 |    "metadata": {},
382 |    "outputs": [],
383 |    "source": [
384 |     "from sklearn.datasets import fetch_20newsgroups\n",
385 |     "from sklearn.decomposition import TruncatedSVD\n",
386 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
387 |     "from sklearn.feature_extraction.text import HashingVectorizer\n",
388 |     "from sklearn.feature_extraction.text import TfidfTransformer\n",
389 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
390 |     "\n",
391 |     "categories = [\n",
392 |     "    'alt.atheism',\n",
393 |     "    'talk.religion.misc',\n",
394 |     "    'comp.graphics',\n",
395 |     "    'sci.space',\n",
396 |     "]\n",
397 |     "\n",
398 |     "from sklearn.pipeline import Pipeline\n",
399 |     "from sklearn.pipeline import make_pipeline\n",
400 |     "from sklearn.preprocessing import Normalizer"
401 |    ]
402 |   },
403 |   {
404 |    "cell_type": "code",
405 |    "execution_count": null,
406 |    "metadata": {},
407 |    "outputs": [],
408 |    "source": [
409 |     "data_train = fetch_20newsgroups(subset='train', categories=categories,\n",
410 |     "                             shuffle=True, random_state=42)\n",
411 |     "n_components = 5\n",
412 |     "labels = data_train.target\n",
413 |     "true_k = np.unique(labels).shape[0]\n",
414 |     "\n",
415 |     "# convert to TF-IDF format\n",
416 |     "vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english', use_idf=True)\n",
417 |     "X_train = vectorizer.fit_transform(data_train.data)\n",
418 |     "\n",
419 |     "# Reduce dimensions\n",
420 |     "svd = TruncatedSVD(n_components)\n",
421 |     "normalizer = Normalizer(copy=False)\n",
422 |     "lsa = make_pipeline(svd, normalizer)\n",
423 |     "\n",
424 |     "X_train = lsa.fit_transform(X_train)"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "code",
429 |    "execution_count": null,
430 |    "metadata": {},
431 |    "outputs": [],
432 |    "source": [
433 |     "# order of labels in `target_names` can be different from `categories`\n",
434 |     "data_test = fetch_20newsgroups(subset='test', categories=categories,\n",
435 |     "                               shuffle=True, random_state=42)\n",
436 |     "\n",
437 |     "target_names = data_train.target_names\n",
438 |     "# split a training set and a test set\n",
439 |     "y_train, y_test = data_train.target, data_test.target\n",
440 |     "\n",
441 |     "print(\"Extracting features from the test data using the same vectorizer\")\n",
442 |     "X_test = vectorizer.transform(data_test.data)\n",
443 |     "X_test = lsa.fit_transform(X_test)"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "Take a minute to note the pre-processing steps we used above - it is less transparent than our method with spaCy, but it is still important to know and to be able to use the scikit-learn modules for the same. "
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "code",
455 |    "execution_count": null,
456 |    "metadata": {},
457 |    "outputs": [],
458 |    "source": [
459 |     "from sklearn.naive_bayes import GaussianNB"
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "code",
464 |    "execution_count": null,
465 |    "metadata": {},
466 |    "outputs": [],
467 |    "source": [
468 |     "gnb = GaussianNB()\n",
469 |     "y_pred_NB = gnb.fit(X_train, y_train).predict(X_test)"
470 |    ]
471 |   },
472 |   {
473 |    "cell_type": "code",
474 |    "execution_count": null,
475 |    "metadata": {},
476 |    "outputs": [],
477 |    "source": [
478 |     "y_pred_NB"
479 |    ]
480 |   },
481 |   {
482 |    "cell_type": "code",
483 |    "execution_count": null,
484 |    "metadata": {},
485 |    "outputs": [],
486 |    "source": [
487 |     "from sklearn.svm import SVC"
488 |    ]
489 |   },
490 |   {
491 |    "cell_type": "code",
492 |    "execution_count": null,
493 |    "metadata": {},
494 |    "outputs": [],
495 |    "source": [
496 |     "svm = SVC()\n",
497 |     "y_pred_SVM = svm.fit(X_train, y_train).predict(X_test) "
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "## Deep Learning\n",
505 |     "\n",
506 |     "The final bit of our tutorial will explore the ideas of neural networks, and using RNNs to generate text.\n",
507 |     "\n",
508 |     "A Recurrent Neural Network does one step better than other neural networks because of its ability to remember context, as each layer in the network is built with information from the previous layer - this additional context allows it to perform better. We will be using a particular variant of an RNN called LSTM, or Long Short Term Memory - as the name suggests, it has the ability to have short-term memory which can last for a long period of time. Whenever there is a significant time-lag between inputs, LSTMs tend to perform well - considering the nature of language, where a word which appears later on in a sentence is influenced by the context of the sentence, this property starts becoming more important. For a more detailed explanation on the mathematics or intuition behind an LSTM and RNN, the following blog posts can be very useful:\n",
509 |     "\n",
510 |     "[Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)\n",
511 |     "[Unreasonable Effectiveness of Reccurent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)\n",
512 |     "\n",
513 |     "For this part of the tutorial, we will be using the code written by my good friend Kirit Thadaka - you can find the original code over on this [GitHub repository](https://github.com/kirit93/Personal/blob/master/text_generation_keras/text_generation.ipynb)."
514 |    ]
515 |   },
516 |   {
517 |    "cell_type": "code",
518 |    "execution_count": null,
519 |    "metadata": {},
520 |    "outputs": [],
521 |    "source": [
522 |     "from keras.models import Sequential\n",
523 |     "from keras.layers import LSTM, Dense, Dropout\n",
524 |     "from keras.callbacks import ModelCheckpoint\n",
525 |     "from keras.utils import np_utils\n",
526 |     "SEQ_LENGTH = 100"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "code",
531 |    "execution_count": null,
532 |    "metadata": {},
533 |    "outputs": [],
534 |    "source": [
535 |     "test_x = np.array([1, 2, 0, 4, 3, 7, 10])\n",
536 |     "# one hot encoding\n",
537 |     "test_y = np_utils.to_categorical(test_x)\n",
538 |     "print(test_x)\n",
539 |     "print(test_y)"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "code",
544 |    "execution_count": null,
545 |    "metadata": {},
546 |    "outputs": [],
547 |    "source": [
548 |     "# Using keras functional model\n",
549 |     "def create_functional_model(n_layers, input_shape, hidden_dim, n_out, **kwargs):\n",
550 |     "    drop        = kwargs.get('drop_rate', 0.2)\n",
551 |     "    activ       = kwargs.get('activation', 'softmax')\n",
552 |     "    mode        = kwargs.get('mode', 'train')\n",
553 |     "    hidden_dim  = int(hidden_dim)\n",
554 |     "\n",
555 |     "    inputs      = Input(shape = (input_shape[1], input_shape[2]))\n",
556 |     "    model       = LSTM(hidden_dim, return_sequences = True)(inputs)\n",
557 |     "    model       = Dropout(drop)(model)\n",
558 |     "    model       = Dense(n_out)(model)\n"
559 |    ]
560 |   },
561 |   {
562 |    "cell_type": "code",
563 |    "execution_count": null,
564 |    "metadata": {},
565 |    "outputs": [],
566 |    "source": [
567 |     "# Using keras sequential model\n",
568 |     "def create_model(n_layers, input_shape, hidden_dim, n_out, **kwargs):\n",
569 |     "    drop        = kwargs.get('drop_rate', 0.2)\n",
570 |     "    activ       = kwargs.get('activation', 'softmax')\n",
571 |     "    mode        = kwargs.get('mode', 'train')\n",
572 |     "    hidden_dim  = int(hidden_dim)\n",
573 |     "    model       = Sequential()\n",
574 |     "    flag        = True \n",
575 |     "\n",
576 |     "    if n_layers == 1:   \n",
577 |     "        model.add( LSTM(hidden_dim, input_shape = (input_shape[1], input_shape[2])) )\n",
578 |     "        if mode == 'train':\n",
579 |     "            model.add( Dropout(drop) )\n",
580 |     "\n",
581 |     "    else:\n",
582 |     "        model.add( LSTM(hidden_dim, input_shape = (input_shape[1], input_shape[2]), return_sequences = True) )\n",
583 |     "        if mode == 'train':\n",
584 |     "            model.add( Dropout(drop) )\n",
585 |     "        for i in range(n_layers - 2):\n",
586 |     "            model.add( LSTM(hidden_dim, return_sequences = True) )\n",
587 |     "            if mode == 'train':\n",
588 |     "                model.add( Dropout(drop) )\n",
589 |     "        model.add( LSTM(hidden_dim) )\n",
590 |     "\n",
591 |     "    model.add( Dense(n_out, activation = activ) )\n",
592 |     "\n",
593 |     "    return model"
594 |    ]
595 |   },
596 |   {
597 |    "cell_type": "code",
598 |    "execution_count": null,
599 |    "metadata": {},
600 |    "outputs": [],
601 |    "source": [
602 |     "def train(model, X, Y, n_epochs, b_size, vocab_size, **kwargs):    \n",
603 |     "    loss            = kwargs.get('loss', 'categorical_crossentropy')\n",
604 |     "    opt             = kwargs.get('optimizer', 'adam')\n",
605 |     "    \n",
606 |     "    model.compile(loss = loss, optimizer = opt)\n",
607 |     "\n",
608 |     "    filepath        = \"Weights/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5\"\n",
609 |     "    checkpoint      = ModelCheckpoint(filepath, monitor = 'loss', verbose = 1, save_best_only = True, mode = 'min')\n",
610 |     "    callbacks_list  = [checkpoint]\n",
611 |     "    X               = X / float(vocab_size)\n",
612 |     "    model.fit(X, Y, epochs = n_epochs, batch_size = b_size, callbacks = callbacks_list)"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {},
618 |    "source": [
619 |     "The fit function will run the input batchwase n_epochs number of times and it will save the weights to a file whenever there is an improvement. This is taken care of through the callback. \n",
620 |     "\n",
621 |     "After the training is done or once you find a loss that you are happy with, you can test how well the model generates text."
622 |    ]
623 |   },
624 |   {
625 |    "cell_type": "code",
626 |    "execution_count": null,
627 |    "metadata": {},
628 |    "outputs": [],
629 |    "source": [
630 |     "def generate_text(model, X, filename, ix_to_char, vocab_size):\n",
631 |     "    \n",
632 |     "    # Load the weights from the epoch with the least loss\n",
633 |     "    model.load_weights(filename)\n",
634 |     "    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')\n",
635 |     "\n",
636 |     "    start   = np.random.randint(0, len(X) - 1)\n",
637 |     "    pattern = np.ravel(X[start]).tolist()\n",
638 |     "\n",
639 |     "    # We seed the model with a random sequence of 100 so it can start predicting\n",
640 |     "    print (\"Seed:\")\n",
641 |     "    print (\"\\\"\", ''.join([ix_to_char[value] for value in pattern]), \"\\\"\")\n",
642 |     "    output = []\n",
643 |     "    for i in range(250):\n",
644 |     "        x           = np.reshape(pattern, (1, len(pattern), 1))\n",
645 |     "        x           = x / float(vocab_size)\n",
646 |     "        prediction  = model.predict(x, verbose = 0)\n",
647 |     "        index       = np.argmax(prediction)\n",
648 |     "        result      = index\n",
649 |     "        output.append(result)\n",
650 |     "        pattern.append(index)\n",
651 |     "        pattern = pattern[1 : len(pattern)]\n",
652 |     "\n",
653 |     "    print(\"Predictions\")\n",
654 |     "    print (\"\\\"\", ''.join([ix_to_char[value] for value in output]), \"\\\"\")"
655 |    ]
656 |   },
657 |   {
658 |    "cell_type": "code",
659 |    "execution_count": null,
660 |    "metadata": {},
661 |    "outputs": [],
662 |    "source": [
663 |     "# filename    = 'data/game_of_thrones.txt'\n",
664 |     "# data        = open(filename).read()\n",
665 |     "# data        = data.lower()\n",
666 |     "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
667 |     "lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",
668 |     "data = open(lee_train_file).read()\n",
669 |     "# Find all the unique characters\n",
670 |     "chars       = sorted(list(set(data)))\n",
671 |     "char_to_int = dict((c, i) for i, c in enumerate(chars))\n",
672 |     "ix_to_char  = dict((i, c) for i, c in enumerate(chars))\n",
673 |     "vocab_size  = len(chars)\n",
674 |     "\n",
675 |     "print(\"List of unique characters : \\n\", chars)\n",
676 |     "\n",
677 |     "print(\"Number of unique characters : \\n\", vocab_size)\n",
678 |     "\n",
679 |     "print(\"Character to integer mapping : \\n\", char_to_int)"
680 |    ]
681 |   },
682 |   {
683 |    "cell_type": "code",
684 |    "execution_count": null,
685 |    "metadata": {},
686 |    "outputs": [],
687 |    "source": [
688 |     "list_X      = []\n",
689 |     "list_Y      = []\n",
690 |     "\n",
691 |     "# Python append is faster than numpy append. Try it!\n",
692 |     "for i in range(0, len(data) - SEQ_LENGTH, 1):\n",
693 |     "    seq_in  = data[i : i + SEQ_LENGTH]\n",
694 |     "    seq_out = data[i + SEQ_LENGTH]\n",
695 |     "    list_X.append([char_to_int[char] for char in seq_in])\n",
696 |     "    list_Y.append(char_to_int[seq_out])\n",
697 |     "\n",
698 |     "n_patterns  = len(list_X)"
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "code",
703 |    "execution_count": null,
704 |    "metadata": {},
705 |    "outputs": [],
706 |    "source": [
707 |     "X           = np.reshape(list_X, (n_patterns, SEQ_LENGTH, 1)) # (n, 100, 1)\n",
708 |     "# Encode output as one-hot vector\n",
709 |     "Y           = np_utils.to_categorical(list_Y)"
710 |    ]
711 |   },
712 |   {
713 |    "cell_type": "code",
714 |    "execution_count": null,
715 |    "metadata": {},
716 |    "outputs": [],
717 |    "source": [
718 |     "model   = create_model(1, X.shape, 256, Y.shape[1], mode = 'train')"
719 |    ]
720 |   },
721 |   {
722 |    "cell_type": "code",
723 |    "execution_count": null,
724 |    "metadata": {},
725 |    "outputs": [],
726 |    "source": [
727 |     "train(model, X[:1024], Y[:1024], 2, 512, vocab_size)\n"
728 |    ]
729 |   },
730 |   {
731 |    "cell_type": "code",
732 |    "execution_count": null,
733 |    "metadata": {},
734 |    "outputs": [],
735 |    "source": [
736 |     "generate_text(model, X, \"Weights/weights-improvement-01-4.3050.hdf5\", ix_to_char, vocab_size)"
737 |    ]
738 |   },
739 |   {
740 |    "cell_type": "code",
741 |    "execution_count": null,
742 |    "metadata": {},
743 |    "outputs": [],
744 |    "source": [
745 |     "generate_text(model, X, \"Weights/weights-improvement-02-4.0229.hdf5\", ix_to_char, vocab_size)\n"
746 |    ]
747 |   },
748 |   {
749 |    "cell_type": "code",
750 |    "execution_count": null,
751 |    "metadata": {},
752 |    "outputs": [],
753 |    "source": []
754 |   }
755 |  ],
756 |  "metadata": {
757 |   "kernelspec": {
758 |    "display_name": "Python 3",
759 |    "language": "python",
760 |    "name": "python3"
761 |   },
762 |   "language_info": {
763 |    "codemirror_mode": {
764 |     "name": "ipython",
765 |     "version": 3
766 |    },
767 |    "file_extension": ".py",
768 |    "mimetype": "text/x-python",
769 |    "name": "python",
770 |    "nbconvert_exporter": "python",
771 |    "pygments_lexer": "ipython3",
772 |    "version": "3.5.2"
773 |   }
774 |  },
775 |  "nbformat": 4,
776 |  "nbformat_minor": 2
777 | }
778 | 


--------------------------------------------------------------------------------
/notebooks/text_analysis_tutorial/topic_modelling_unrun.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "collapsed": true
  7 |    },
  8 |    "source": [
  9 |     "### Topic Modelling - and more - with Gensim!\n",
 10 |     "\n",
 11 |     "This tutorial will attempt to walk you through the entire process of analysing your text - from pre-processing to creating your topic models and visualising them. \n",
 12 |     "\n",
 13 |     "python offers a very rich suite of NLP and CL tools, and we will illustrate these to the best of our capabilities.\n",
 14 |     "Let's start by setting up our imports.\n",
 15 |     "\n",
 16 |     "We will be needing: \n",
 17 |     "```\n",
 18 |     "- Gensim\n",
 19 |     "- matplotlib\n",
 20 |     "- spaCy\n",
 21 |     "- pyLDAVis\n",
 22 |     "```\n"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "import matplotlib.pyplot as plt\n",
 32 |     "import gensim\n",
 33 |     "import numpy as np\n",
 34 |     "import spacy\n",
 35 |     "\n",
 36 |     "from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel\n",
 37 |     "from gensim.models.wrappers import LdaMallet\n",
 38 |     "from gensim.corpora import Dictionary\n",
 39 |     "import pyLDAvis.gensim\n",
 40 |     "\n",
 41 |     "import os, re, operator, warnings\n",
 42 |     "warnings.filterwarnings('ignore')  # Let's not pay heed to them right now\n",
 43 |     "%matplotlib inline"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "For this tutorial, we will be using the Lee corpus which is a shortened version of the [Lee Background Corpus](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF). The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the year 2000-2001. \n",
 51 |     "\n",
 52 |     "We should keep in mind we can use pretty much any textual data-set and go ahead with what we will be doing."
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "# since we're working in python 2.7 in this tutorial, we need to make sure to clean our data to make it unicode consistent\n",
 62 |     "def clean(text):\n",
 63 |     "    return unicode(''.join([i if ord(i) < 128 else ' ' for i in text]))\n",
 64 |     "\n",
 65 |     "test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
 66 |     "lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",
 67 |     "text = open(lee_train_file).read()"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "### Pre-processing data!\n",
 75 |     "\n",
 76 |     "It's been often said in Machine Learning and NLP algorithms - garbage in, garbage out. We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set.\n",
 77 |     "NTLK is usually a popular choice for pre-processing - but is a rather [outdated](https://explosion.ai/blog/dead-code-should-be-buried) and we will be checking out spaCy, an industry grade text-processing package. "
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "nlp = spacy.load(\"en\")"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said' and 'Mister' which will not really add any value to the topic models."
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": null,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "my_stop_words = [u'say', u'\\'s', u'Mr', u'be', u'said', u'says', u'saying']\n",
103 |     "for stopword in my_stop_words:\n",
104 |     "    lexeme = nlp.vocab[stopword]\n",
105 |     "    lexeme.is_stop = True"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": null,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "doc = nlp(clean(text))"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "markdown",
119 |    "metadata": {},
120 |    "source": [
121 |     "Voila! With the `English` pipeline, all the heavy lifting has been done. Let's see what went on under the hood."
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": null,
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "doc"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus. You can check out what a gensim corpus looks like [here](google.com)."
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "# we add some words to the stop word list\n",
147 |     "texts, article = [], []\n",
148 |     "for w in doc:\n",
149 |     "    # if it's not a stop word or punctuation mark, add it to our article!\n",
150 |     "    if w.text != '\\n' and not w.is_stop and not w.is_punct and not w.like_num:\n",
151 |     "        # we add the lematized version of the word\n",
152 |     "        article.append(w.lemma_)\n",
153 |     "    # if it's a new line, it means we're onto our next document\n",
154 |     "    if w.text == '\\n':\n",
155 |     "        texts.append(article)\n",
156 |     "        article = []"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": null,
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "texts"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, punctauation markers, and added the lemmatized word. There's lot more we can do with spaCy which I would really recommend checking out.\n",
173 |     "\n",
174 |     "Sometimes topic models make more sense when 'New' and 'York' are treated as 'New_York' - we can do this by creating a bigram model and modifying our corpus accordingly."
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": null,
180 |    "metadata": {},
181 |    "outputs": [],
182 |    "source": [
183 |     "bigram = gensim.models.Phrases(texts)"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "metadata": {},
190 |    "outputs": [],
191 |    "source": [
192 |     "texts = [bigram[line] for line in texts]"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": null,
198 |    "metadata": {},
199 |    "outputs": [],
200 |    "source": [
201 |     "texts[10]"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": null,
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "dictionary = Dictionary(texts)\n",
211 |     "corpus = [dictionary.doc2bow(text) for text in texts]"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": null,
217 |    "metadata": {},
218 |    "outputs": [],
219 |    "source": [
220 |     "corpus[100]"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "markdown",
225 |    "metadata": {},
226 |    "source": [
227 |     "We're now done with a very important part of any text analysis - the data cleaning and setting up of corpus. It must be kept in mind that we created the corpus the way we did because that's how gensim requires it - most algorithms still require one to clean the data set the way we did, by removing stop words and numbers, adding the lemmatized form of the word, and using bigrams. "
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "markdown",
232 |    "metadata": {},
233 |    "source": [
234 |     "### LSI\n",
235 |     "\n",
236 |     "LSI stands for Latent Semantic Indeixing - it is a popular information retreival method which works by decomposing the original matrix of words to maintain key topics. Gensim's implementation uses an SVD."
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": null,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "lsimodel.show_topics(num_topics=5)  # Showing only the top 5 topics"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "### HDP\n",
262 |     "\n",
263 |     "HDP, the Hierarchical Dirichlet process is an unsupervised topic model which figures out the number of topics on it's own."
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": null,
269 |    "metadata": {},
270 |    "outputs": [],
271 |    "source": [
272 |     "hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)"
273 |    ]
274 |   },
275 |   {
276 |    "cell_type": "code",
277 |    "execution_count": null,
278 |    "metadata": {},
279 |    "outputs": [],
280 |    "source": [
281 |     "hdpmodel.show_topics()"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "### LDA\n",
289 |     "\n",
290 |     "LDA, or Latent Dirichlet Allocation is arguably the most famous topic modelling algorithm out there. Out here we create a simple topic model with 10 topics."
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": null,
296 |    "metadata": {},
297 |    "outputs": [],
298 |    "source": [
299 |     "ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "metadata": {},
306 |    "outputs": [],
307 |    "source": [
308 |     "ldamodel.show_topics()"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "markdown",
313 |    "metadata": {},
314 |    "source": [
315 |     "### pyLDAvis \n",
316 |     "\n",
317 |     "Thanks to pyLDAvis, we can visualise our topic models in a really handy way. All we need to do is enable our notebook and prepare the object."
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": null,
323 |    "metadata": {},
324 |    "outputs": [],
325 |    "source": [
326 |     "pyLDAvis.enable_notebook()\n",
327 |     "pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "metadata": {},
333 |    "source": [
334 |     "### Round-up\n",
335 |     "\n",
336 |     "Okay - so what have we learned so far? \n",
337 |     "By using spaCy, we cleaned up our data super fast. It's worth noting that by running our doc through the pipeline we also know about every single words POS-tag and NER-tag. This is useful information and we can do some funky things with it! I would highly recommend going through [this](https://github.com/explosion/spacy-notebooks) repository to see examples of hands-on spaCy usage.\n",
338 |     "\n",
339 |     "As for gensim and topic modelling, it's pretty easy to see how well we could create our topic models. Now the obvious next question is - how do we use these topic models? The [news classification notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/gensim_news_classification.ipynb) in the Gensim [notebooks](https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks) directory is a good example of how we can use topic models in a practical scenario.\n",
340 |     "\n",
341 |     "We will continue this tutorial by demonstrating a newer topic modelling features of gensim - in particular, Topic Coherence. \n",
342 |     "\n",
343 |     "### Topic Coherence\n",
344 |     "\n",
345 |     "Topic Coherence is a new gensim functionality where we can identify which topic model is 'better'. \n",
346 |     "By returning a score, we can compare between different topic models of the same. We use the same example from the news classification notebook to plot a graph between the topic models we have created."
347 |    ]
348 |   },
349 |   {
350 |    "cell_type": "code",
351 |    "execution_count": null,
352 |    "metadata": {},
353 |    "outputs": [],
354 |    "source": [
355 |     "lsitopics = [[word for word, prob in topic] for topicid, topic in lsimodel.show_topics(formatted=False)]\n",
356 |     "\n",
357 |     "hdptopics = [[word for word, prob in topic] for topicid, topic in hdpmodel.show_topics(formatted=False)]\n",
358 |     "\n",
359 |     "ldatopics = [[word for word, prob in topic] for topicid, topic in ldamodel.show_topics(formatted=False)]"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "code",
364 |    "execution_count": null,
365 |    "metadata": {},
366 |    "outputs": [],
367 |    "source": [
368 |     "lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=texts, dictionary=dictionary, window_size=10).get_coherence()\n",
369 |     "\n",
370 |     "hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=texts, dictionary=dictionary, window_size=10).get_coherence()\n",
371 |     "\n",
372 |     "lda_coherence = CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary, window_size=10).get_coherence()"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": null,
378 |    "metadata": {},
379 |    "outputs": [],
380 |    "source": [
381 |     "def evaluate_bar_graph(coherences, indices):\n",
382 |     "    \"\"\"\n",
383 |     "    Function to plot bar graph.\n",
384 |     "    \n",
385 |     "    coherences: list of coherence values\n",
386 |     "    indices: Indices to be used to mark bars. Length of this and coherences should be equal.\n",
387 |     "    \"\"\"\n",
388 |     "    assert len(coherences) == len(indices)\n",
389 |     "    n = len(coherences)\n",
390 |     "    x = np.arange(n)\n",
391 |     "    plt.bar(x, coherences, width=0.2, tick_label=indices, align='center')\n",
392 |     "    plt.xlabel('Models')\n",
393 |     "    plt.ylabel('Coherence Value')"
394 |    ]
395 |   },
396 |   {
397 |    "cell_type": "code",
398 |    "execution_count": null,
399 |    "metadata": {},
400 |    "outputs": [],
401 |    "source": [
402 |     "evaluate_bar_graph([lsi_coherence, hdp_coherence, lda_coherence],\n",
403 |     "                   ['LSI', 'HDP', 'LDA'])"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "We can see that topic coherence helped us get past manually inspecting our topic models - we can now keep fine tuning our models and compare between them to see which has the best performance. \n",
411 |     "\n",
412 |     "This also brings us to the end of the runnable part of this tutorial - we will continue however by briefly going over two more Jupyter notebooks I have previously worked on - mainly, [Dynamic Topic Modelling](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/ldaseqmodel.ipynb) and [Document Word Coloring](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb)."
413 |    ]
414 |   }
415 |  ],
416 |  "metadata": {
417 |   "kernelspec": {
418 |    "display_name": "Python 3",
419 |    "language": "python",
420 |    "name": "python3"
421 |   },
422 |   "language_info": {
423 |    "codemirror_mode": {
424 |     "name": "ipython",
425 |     "version": 3
426 |    },
427 |    "file_extension": ".py",
428 |    "mimetype": "text/x-python",
429 |    "name": "python",
430 |    "nbconvert_exporter": "python",
431 |    "pygments_lexer": "ipython3",
432 |    "version": "3.6.4"
433 |   }
434 |  },
435 |  "nbformat": 4,
436 |  "nbformat_minor": 2
437 | }
438 | 


--------------------------------------------------------------------------------