├── .gitignore
├── README.md
├── data
├── bird_small.mat
├── bird_small.png
├── emailSample1.txt
├── emailSample2.txt
├── ex1data1.txt
├── ex1data2.txt
├── ex2data1.txt
├── ex2data2.txt
├── ex3data1.mat
├── ex3weights.mat
├── ex4data1.mat
├── ex4weights.mat
├── ex5data1.mat
├── ex6data1.mat
├── ex6data2.mat
├── ex6data3.mat
├── ex7data1.mat
├── ex7data2.mat
├── ex7faces.mat
├── ex8_movieParams.mat
├── ex8_movies.mat
├── ex8data1.mat
├── ex8data2.mat
├── manning.csv
├── movie_ids.txt
├── spamSample1.txt
├── spamSample2.txt
├── spamTest.mat
├── spamTrain.mat
├── stock_data.csv
└── vocab.txt
├── exercises
└── ML
│ ├── ex1.pdf
│ ├── ex2.pdf
│ ├── ex3.pdf
│ ├── ex4.pdf
│ ├── ex5.pdf
│ ├── ex6.pdf
│ ├── ex7.pdf
│ └── ex8.pdf
├── notebooks
├── fastai
│ ├── Fastai-Lesson1.ipynb
│ ├── Fastai-Lesson2.ipynb
│ ├── Fastai-Lesson3.ipynb
│ ├── Fastai-Lesson4.ipynb
│ ├── Fastai-Lesson5.ipynb
│ ├── Fastai-Lesson6.ipynb
│ └── Fastai-Lesson7.ipynb
├── keras
│ ├── AnomalyDetection.ipynb
│ ├── ConvolutionalNetworks.ipynb
│ ├── GenerativeAdversarialNetworks.ipynb
│ ├── RecommenderSystems.ipynb
│ ├── RecurrentNetworks.ipynb
│ └── StructuredTimeSeries.ipynb
├── language
│ ├── IPythonMagic.ipynb
│ └── Intro.ipynb
├── libraries
│ ├── DEAP.ipynb
│ ├── Gensim.ipynb
│ ├── Matplotlib.ipynb
│ ├── NLTK.ipynb
│ ├── NetworkX.ipynb
│ ├── NumPy.ipynb
│ ├── Pandas.ipynb
│ ├── PyMC.ipynb
│ ├── SciPy.ipynb
│ ├── Scikit-learn.ipynb
│ ├── Seaborn.ipynb
│ └── Statsmodels.ipynb
├── misc
│ ├── CodeOptimization.ipynb
│ ├── DynamicProgramming.ipynb
│ ├── LanguageVectors.ipynb
│ ├── MarkovChains.ipynb
│ ├── MonteCarlo.ipynb
│ ├── ProbablisticProgramming.ipynb
│ ├── ProphetForecasting.ipynb
│ └── TimeSeriesStockAnalysis.ipynb
├── ml
│ ├── ML-Exercise1.ipynb
│ ├── ML-Exercise2.ipynb
│ ├── ML-Exercise3.ipynb
│ ├── ML-Exercise4.ipynb
│ ├── ML-Exercise6.ipynb
│ ├── ML-Exercise7.ipynb
│ └── ML-Exercise8.ipynb
├── spark
│ ├── Spark-Lab0-Tutorial.ipynb
│ ├── Spark-Lab1-WordCount.ipynb
│ ├── Spark-Lab2-ApacheLog.ipynb
│ ├── Spark-Lab3-EntityResolution.ipynb
│ ├── Spark-Lab4-MachineLearning.ipynb
│ ├── Spark-ML-Lab3-LinearRegression.ipynb
│ ├── Spark-ML-Lab4-CriteoPrediction.ipynb
│ └── Spark-ML-Lab5-NeuroPCA.ipynb
└── tensorflow
│ ├── Tensorflow-1-NotMNIST.ipynb
│ ├── Tensorflow-2-FullyConnected.ipynb
│ ├── Tensorflow-3-Regularization.ipynb
│ ├── Tensorflow-4-Convolutions.ipynb
│ ├── Tensorflow-5-Word2Vec.ipynb
│ └── Tensorflow-6-LSTM.ipynb
└── scripts
├── __init__.py
└── hello.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 |
5 | # C extensions
6 | *.so
7 |
8 | # Distribution / packaging
9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | eggs/
15 | lib/
16 | lib64/
17 | parts/
18 | sdist/
19 | var/
20 | *.egg-info/
21 | .installed.cfg
22 | *.egg
23 |
24 | # Installer logs
25 | pip-log.txt
26 | pip-delete-this-directory.txt
27 |
28 | # Unit test / coverage reports
29 | htmlcov/
30 | .tox/
31 | .coverage
32 | .cache
33 | nosetests.xml
34 | coverage.xml
35 |
36 | # Translations
37 | *.mo
38 | *.pot
39 |
40 | # Django stuff:
41 | *.log
42 |
43 | # Sphinx documentation
44 | docs/_build/
45 |
46 | # IPython notebook checkpoints
47 | .ipynb_checkpoints/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ipython-notebooks
2 | ========================
3 |
4 |
5 |
6 | This repo contains various IPython notebooks I've created to experiment with libraries and work through exercises, and explore subjects that I find interesting. I've included notebook viewer links below. Click the link to see a live rendering of the notebook.
7 |
8 | #### Language
9 |
10 | These notebooks contain introductory content such as an overview of the language and a review of IPython's functionality.
11 |
12 | Introduction To Python
13 | IPython Magic Commands
14 |
15 | #### Libraries
16 |
17 | Examples using a variety of popular "data science" Python libraries.
18 |
19 | NumPy
20 | SciPy
21 | Matplotlib
22 | Pandas
23 | Statsmodels
24 | Scikit-learn
25 | Seaborn
26 | NetworkX
27 | PyMC
28 | NLTK
29 | DEAP
30 | Gensim
31 |
32 | #### Machine Learning Exercises
33 |
34 | Implementations of the exercises presented in Andrew Ng's "Machine Learning" class on Coursera.
35 |
36 | Exercise 1 - Linear Regression
37 | Exercise 2 - Logistic Regression
38 | Exercise 3 - Multi-Class Classification
39 | Exercise 4 - Neural Networks
40 | Exercise 6 - Support Vector Machines
41 | Exercise 7 - K-Means Clustering & PCA
42 | Exercise 8 - Anomaly Detection & Recommendation Systems
43 |
44 | #### Tensorflow Deep Learning Exercises
45 |
46 | Implementations of the assignments from Google's Udacity course on deep learning.
47 |
48 | Assignment 1 - Intro & Data Prep
49 | Assignment 2 - Regression & Neural Nets
50 | Assignment 3 - Regularization
51 | Assignment 4 - Convolutions
52 | Assignment 5 - Word Embeddings
53 | Assignment 6 - Recurrent Nets
54 |
55 | #### Spark Big Data Labs
56 |
57 | Lab exercises for the original Spark classes on edX.
58 |
59 | Lab 0 - Learning Apache Spark
60 | Lab 1 - Building A Word Count Application
61 | Lab 2 - Web Server Log Analysis
62 | Lab 3 - Text Analysis & Entity Resolution
63 | Lab 4 - Introduction To Machine Learning
64 | ML Lab 3 - Linear Regression
65 | ML Lab 4 - Click-Through Rate Prediction
66 | ML Lab 5 - Principal Component Analysis
67 |
68 | ### Fast.ai Lessons
69 |
70 | Notebooks from Jeremy Howard's fast.ai class.
71 |
72 | Lesson 1 - Image Classification
73 | Lesson 2 - Multi-label Classification
74 | Lesson 3 - Structured And Time Series Data
75 | Lesson 4 - Sentiment Classification
76 | Lesson 5 - Recommendation Using Deep Learning
77 | Lesson 6 - Language Modeling With RNNs
78 | Lesson 7 - Convolutional Networks In Detail
79 |
80 | ### Deep Learning With Keras
81 |
82 | Notebooks using Keras to implement deep learning models.
83 |
84 | Part 1 - Structured And Time Series Data
85 | Part 2 - Convolutional Networks
86 | Part 3 - Recommender Systems
87 | Part 4 - Recurrent Networks
88 | Part 5 - Anomaly Detection
89 | Part 6 - Generative Adversarial Networks
90 |
91 | #### Misc
92 |
93 | Notebooks covering various interesting topics!
94 |
95 | Comparison Of Various Code Optimization Methods
96 | A Simple Time Series Analysis of the S&P 500 Index
97 | An Intro To Probablistic Programming
98 | Language Exploration Using Vector Space Models
99 | Solving Problems With Dynamic Programming
100 | Time Series Forecasting With Prophet
101 | Markov Chains From Scratch
102 | A Sampling Of Monte Carlo Methods
103 |
--------------------------------------------------------------------------------
/data/bird_small.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/bird_small.mat
--------------------------------------------------------------------------------
/data/bird_small.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/bird_small.png
--------------------------------------------------------------------------------
/data/emailSample1.txt:
--------------------------------------------------------------------------------
1 | > Anyone knows how much it costs to host a web portal ?
2 | >
3 | Well, it depends on how many visitors you're expecting.
4 | This can be anywhere from less than 10 bucks a month to a couple of $100.
5 | You should checkout http://www.rackspace.com/ or perhaps Amazon EC2
6 | if youre running something big..
7 |
8 | To unsubscribe yourself from this mailing list, send an email to:
9 | groupname-unsubscribe@egroups.com
10 |
11 |
--------------------------------------------------------------------------------
/data/emailSample2.txt:
--------------------------------------------------------------------------------
1 | Folks,
2 |
3 | my first time posting - have a bit of Unix experience, but am new to Linux.
4 |
5 |
6 | Just got a new PC at home - Dell box with Windows XP. Added a second hard disk
7 | for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went
8 | fine except it didn't pick up my monitor.
9 |
10 | I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4
11 | Ti4200 video card, both of which are probably too new to feature in Suse's default
12 | set. I downloaded a driver from the nVidia website and installed it using RPM.
13 | Then I ran Sax2 (as was recommended in some postings I found on the net), but
14 | it still doesn't feature my video card in the available list. What next?
15 |
16 | Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice,
17 | the whole machine crashes (in Linux, not Windows) - even the on/off switch is
18 | inactive, leaving me to reach for the power cable instead.
19 |
20 | If anyone can help me in any way with these probs., I'd be really grateful -
21 | I've searched the 'net but have run out of ideas.
22 |
23 | Or should I be going for a different version of Linux such as RedHat? Opinions
24 | welcome.
25 |
26 | Thanks a lot,
27 | Peter
28 |
29 | --
30 | Irish Linux Users' Group: ilug@linux.ie
31 | http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
32 | List maintainer: listmaster@linux.ie
33 |
34 |
35 |
--------------------------------------------------------------------------------
/data/ex1data1.txt:
--------------------------------------------------------------------------------
1 | 6.1101,17.592
2 | 5.5277,9.1302
3 | 8.5186,13.662
4 | 7.0032,11.854
5 | 5.8598,6.8233
6 | 8.3829,11.886
7 | 7.4764,4.3483
8 | 8.5781,12
9 | 6.4862,6.5987
10 | 5.0546,3.8166
11 | 5.7107,3.2522
12 | 14.164,15.505
13 | 5.734,3.1551
14 | 8.4084,7.2258
15 | 5.6407,0.71618
16 | 5.3794,3.5129
17 | 6.3654,5.3048
18 | 5.1301,0.56077
19 | 6.4296,3.6518
20 | 7.0708,5.3893
21 | 6.1891,3.1386
22 | 20.27,21.767
23 | 5.4901,4.263
24 | 6.3261,5.1875
25 | 5.5649,3.0825
26 | 18.945,22.638
27 | 12.828,13.501
28 | 10.957,7.0467
29 | 13.176,14.692
30 | 22.203,24.147
31 | 5.2524,-1.22
32 | 6.5894,5.9966
33 | 9.2482,12.134
34 | 5.8918,1.8495
35 | 8.2111,6.5426
36 | 7.9334,4.5623
37 | 8.0959,4.1164
38 | 5.6063,3.3928
39 | 12.836,10.117
40 | 6.3534,5.4974
41 | 5.4069,0.55657
42 | 6.8825,3.9115
43 | 11.708,5.3854
44 | 5.7737,2.4406
45 | 7.8247,6.7318
46 | 7.0931,1.0463
47 | 5.0702,5.1337
48 | 5.8014,1.844
49 | 11.7,8.0043
50 | 5.5416,1.0179
51 | 7.5402,6.7504
52 | 5.3077,1.8396
53 | 7.4239,4.2885
54 | 7.6031,4.9981
55 | 6.3328,1.4233
56 | 6.3589,-1.4211
57 | 6.2742,2.4756
58 | 5.6397,4.6042
59 | 9.3102,3.9624
60 | 9.4536,5.4141
61 | 8.8254,5.1694
62 | 5.1793,-0.74279
63 | 21.279,17.929
64 | 14.908,12.054
65 | 18.959,17.054
66 | 7.2182,4.8852
67 | 8.2951,5.7442
68 | 10.236,7.7754
69 | 5.4994,1.0173
70 | 20.341,20.992
71 | 10.136,6.6799
72 | 7.3345,4.0259
73 | 6.0062,1.2784
74 | 7.2259,3.3411
75 | 5.0269,-2.6807
76 | 6.5479,0.29678
77 | 7.5386,3.8845
78 | 5.0365,5.7014
79 | 10.274,6.7526
80 | 5.1077,2.0576
81 | 5.7292,0.47953
82 | 5.1884,0.20421
83 | 6.3557,0.67861
84 | 9.7687,7.5435
85 | 6.5159,5.3436
86 | 8.5172,4.2415
87 | 9.1802,6.7981
88 | 6.002,0.92695
89 | 5.5204,0.152
90 | 5.0594,2.8214
91 | 5.7077,1.8451
92 | 7.6366,4.2959
93 | 5.8707,7.2029
94 | 5.3054,1.9869
95 | 8.2934,0.14454
96 | 13.394,9.0551
97 | 5.4369,0.61705
98 |
--------------------------------------------------------------------------------
/data/ex1data2.txt:
--------------------------------------------------------------------------------
1 | 2104,3,399900
2 | 1600,3,329900
3 | 2400,3,369000
4 | 1416,2,232000
5 | 3000,4,539900
6 | 1985,4,299900
7 | 1534,3,314900
8 | 1427,3,198999
9 | 1380,3,212000
10 | 1494,3,242500
11 | 1940,4,239999
12 | 2000,3,347000
13 | 1890,3,329999
14 | 4478,5,699900
15 | 1268,3,259900
16 | 2300,4,449900
17 | 1320,2,299900
18 | 1236,3,199900
19 | 2609,4,499998
20 | 3031,4,599000
21 | 1767,3,252900
22 | 1888,2,255000
23 | 1604,3,242900
24 | 1962,4,259900
25 | 3890,3,573900
26 | 1100,3,249900
27 | 1458,3,464500
28 | 2526,3,469000
29 | 2200,3,475000
30 | 2637,3,299900
31 | 1839,2,349900
32 | 1000,1,169900
33 | 2040,4,314900
34 | 3137,3,579900
35 | 1811,4,285900
36 | 1437,3,249900
37 | 1239,3,229900
38 | 2132,4,345000
39 | 4215,4,549000
40 | 2162,4,287000
41 | 1664,2,368500
42 | 2238,3,329900
43 | 2567,4,314000
44 | 1200,3,299000
45 | 852,2,179900
46 | 1852,4,299900
47 | 1203,3,239500
48 |
--------------------------------------------------------------------------------
/data/ex2data1.txt:
--------------------------------------------------------------------------------
1 | 34.62365962451697,78.0246928153624,0
2 | 30.28671076822607,43.89499752400101,0
3 | 35.84740876993872,72.90219802708364,0
4 | 60.18259938620976,86.30855209546826,1
5 | 79.0327360507101,75.3443764369103,1
6 | 45.08327747668339,56.3163717815305,0
7 | 61.10666453684766,96.51142588489624,1
8 | 75.02474556738889,46.55401354116538,1
9 | 76.09878670226257,87.42056971926803,1
10 | 84.43281996120035,43.53339331072109,1
11 | 95.86155507093572,38.22527805795094,0
12 | 75.01365838958247,30.60326323428011,0
13 | 82.30705337399482,76.48196330235604,1
14 | 69.36458875970939,97.71869196188608,1
15 | 39.53833914367223,76.03681085115882,0
16 | 53.9710521485623,89.20735013750205,1
17 | 69.07014406283025,52.74046973016765,1
18 | 67.94685547711617,46.67857410673128,0
19 | 70.66150955499435,92.92713789364831,1
20 | 76.97878372747498,47.57596364975532,1
21 | 67.37202754570876,42.83843832029179,0
22 | 89.67677575072079,65.79936592745237,1
23 | 50.534788289883,48.85581152764205,0
24 | 34.21206097786789,44.20952859866288,0
25 | 77.9240914545704,68.9723599933059,1
26 | 62.27101367004632,69.95445795447587,1
27 | 80.1901807509566,44.82162893218353,1
28 | 93.114388797442,38.80067033713209,0
29 | 61.83020602312595,50.25610789244621,0
30 | 38.78580379679423,64.99568095539578,0
31 | 61.379289447425,72.80788731317097,1
32 | 85.40451939411645,57.05198397627122,1
33 | 52.10797973193984,63.12762376881715,0
34 | 52.04540476831827,69.43286012045222,1
35 | 40.23689373545111,71.16774802184875,0
36 | 54.63510555424817,52.21388588061123,0
37 | 33.91550010906887,98.86943574220611,0
38 | 64.17698887494485,80.90806058670817,1
39 | 74.78925295941542,41.57341522824434,0
40 | 34.1836400264419,75.2377203360134,0
41 | 83.90239366249155,56.30804621605327,1
42 | 51.54772026906181,46.85629026349976,0
43 | 94.44336776917852,65.56892160559052,1
44 | 82.36875375713919,40.61825515970618,0
45 | 51.04775177128865,45.82270145776001,0
46 | 62.22267576120188,52.06099194836679,0
47 | 77.19303492601364,70.45820000180959,1
48 | 97.77159928000232,86.7278223300282,1
49 | 62.07306379667647,96.76882412413983,1
50 | 91.56497449807442,88.69629254546599,1
51 | 79.94481794066932,74.16311935043758,1
52 | 99.2725269292572,60.99903099844988,1
53 | 90.54671411399852,43.39060180650027,1
54 | 34.52451385320009,60.39634245837173,0
55 | 50.2864961189907,49.80453881323059,0
56 | 49.58667721632031,59.80895099453265,0
57 | 97.64563396007767,68.86157272420604,1
58 | 32.57720016809309,95.59854761387875,0
59 | 74.24869136721598,69.82457122657193,1
60 | 71.79646205863379,78.45356224515052,1
61 | 75.3956114656803,85.75993667331619,1
62 | 35.28611281526193,47.02051394723416,0
63 | 56.25381749711624,39.26147251058019,0
64 | 30.05882244669796,49.59297386723685,0
65 | 44.66826172480893,66.45008614558913,0
66 | 66.56089447242954,41.09209807936973,0
67 | 40.45755098375164,97.53518548909936,1
68 | 49.07256321908844,51.88321182073966,0
69 | 80.27957401466998,92.11606081344084,1
70 | 66.74671856944039,60.99139402740988,1
71 | 32.72283304060323,43.30717306430063,0
72 | 64.0393204150601,78.03168802018232,1
73 | 72.34649422579923,96.22759296761404,1
74 | 60.45788573918959,73.09499809758037,1
75 | 58.84095621726802,75.85844831279042,1
76 | 99.82785779692128,72.36925193383885,1
77 | 47.26426910848174,88.47586499559782,1
78 | 50.45815980285988,75.80985952982456,1
79 | 60.45555629271532,42.50840943572217,0
80 | 82.22666157785568,42.71987853716458,0
81 | 88.9138964166533,69.80378889835472,1
82 | 94.83450672430196,45.69430680250754,1
83 | 67.31925746917527,66.58935317747915,1
84 | 57.23870631569862,59.51428198012956,1
85 | 80.36675600171273,90.96014789746954,1
86 | 68.46852178591112,85.59430710452014,1
87 | 42.0754545384731,78.84478600148043,0
88 | 75.47770200533905,90.42453899753964,1
89 | 78.63542434898018,96.64742716885644,1
90 | 52.34800398794107,60.76950525602592,0
91 | 94.09433112516793,77.15910509073893,1
92 | 90.44855097096364,87.50879176484702,1
93 | 55.48216114069585,35.57070347228866,0
94 | 74.49269241843041,84.84513684930135,1
95 | 89.84580670720979,45.35828361091658,1
96 | 83.48916274498238,48.38028579728175,1
97 | 42.2617008099817,87.10385094025457,1
98 | 99.31500880510394,68.77540947206617,1
99 | 55.34001756003703,64.9319380069486,1
100 | 74.77589300092767,89.52981289513276,1
101 |
--------------------------------------------------------------------------------
/data/ex2data2.txt:
--------------------------------------------------------------------------------
1 | 0.051267,0.69956,1
2 | -0.092742,0.68494,1
3 | -0.21371,0.69225,1
4 | -0.375,0.50219,1
5 | -0.51325,0.46564,1
6 | -0.52477,0.2098,1
7 | -0.39804,0.034357,1
8 | -0.30588,-0.19225,1
9 | 0.016705,-0.40424,1
10 | 0.13191,-0.51389,1
11 | 0.38537,-0.56506,1
12 | 0.52938,-0.5212,1
13 | 0.63882,-0.24342,1
14 | 0.73675,-0.18494,1
15 | 0.54666,0.48757,1
16 | 0.322,0.5826,1
17 | 0.16647,0.53874,1
18 | -0.046659,0.81652,1
19 | -0.17339,0.69956,1
20 | -0.47869,0.63377,1
21 | -0.60541,0.59722,1
22 | -0.62846,0.33406,1
23 | -0.59389,0.005117,1
24 | -0.42108,-0.27266,1
25 | -0.11578,-0.39693,1
26 | 0.20104,-0.60161,1
27 | 0.46601,-0.53582,1
28 | 0.67339,-0.53582,1
29 | -0.13882,0.54605,1
30 | -0.29435,0.77997,1
31 | -0.26555,0.96272,1
32 | -0.16187,0.8019,1
33 | -0.17339,0.64839,1
34 | -0.28283,0.47295,1
35 | -0.36348,0.31213,1
36 | -0.30012,0.027047,1
37 | -0.23675,-0.21418,1
38 | -0.06394,-0.18494,1
39 | 0.062788,-0.16301,1
40 | 0.22984,-0.41155,1
41 | 0.2932,-0.2288,1
42 | 0.48329,-0.18494,1
43 | 0.64459,-0.14108,1
44 | 0.46025,0.012427,1
45 | 0.6273,0.15863,1
46 | 0.57546,0.26827,1
47 | 0.72523,0.44371,1
48 | 0.22408,0.52412,1
49 | 0.44297,0.67032,1
50 | 0.322,0.69225,1
51 | 0.13767,0.57529,1
52 | -0.0063364,0.39985,1
53 | -0.092742,0.55336,1
54 | -0.20795,0.35599,1
55 | -0.20795,0.17325,1
56 | -0.43836,0.21711,1
57 | -0.21947,-0.016813,1
58 | -0.13882,-0.27266,1
59 | 0.18376,0.93348,0
60 | 0.22408,0.77997,0
61 | 0.29896,0.61915,0
62 | 0.50634,0.75804,0
63 | 0.61578,0.7288,0
64 | 0.60426,0.59722,0
65 | 0.76555,0.50219,0
66 | 0.92684,0.3633,0
67 | 0.82316,0.27558,0
68 | 0.96141,0.085526,0
69 | 0.93836,0.012427,0
70 | 0.86348,-0.082602,0
71 | 0.89804,-0.20687,0
72 | 0.85196,-0.36769,0
73 | 0.82892,-0.5212,0
74 | 0.79435,-0.55775,0
75 | 0.59274,-0.7405,0
76 | 0.51786,-0.5943,0
77 | 0.46601,-0.41886,0
78 | 0.35081,-0.57968,0
79 | 0.28744,-0.76974,0
80 | 0.085829,-0.75512,0
81 | 0.14919,-0.57968,0
82 | -0.13306,-0.4481,0
83 | -0.40956,-0.41155,0
84 | -0.39228,-0.25804,0
85 | -0.74366,-0.25804,0
86 | -0.69758,0.041667,0
87 | -0.75518,0.2902,0
88 | -0.69758,0.68494,0
89 | -0.4038,0.70687,0
90 | -0.38076,0.91886,0
91 | -0.50749,0.90424,0
92 | -0.54781,0.70687,0
93 | 0.10311,0.77997,0
94 | 0.057028,0.91886,0
95 | -0.10426,0.99196,0
96 | -0.081221,1.1089,0
97 | 0.28744,1.087,0
98 | 0.39689,0.82383,0
99 | 0.63882,0.88962,0
100 | 0.82316,0.66301,0
101 | 0.67339,0.64108,0
102 | 1.0709,0.10015,0
103 | -0.046659,-0.57968,0
104 | -0.23675,-0.63816,0
105 | -0.15035,-0.36769,0
106 | -0.49021,-0.3019,0
107 | -0.46717,-0.13377,0
108 | -0.28859,-0.060673,0
109 | -0.61118,-0.067982,0
110 | -0.66302,-0.21418,0
111 | -0.59965,-0.41886,0
112 | -0.72638,-0.082602,0
113 | -0.83007,0.31213,0
114 | -0.72062,0.53874,0
115 | -0.59389,0.49488,0
116 | -0.48445,0.99927,0
117 | -0.0063364,0.99927,0
118 | 0.63265,-0.030612,0
119 |
--------------------------------------------------------------------------------
/data/ex3data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex3data1.mat
--------------------------------------------------------------------------------
/data/ex3weights.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex3weights.mat
--------------------------------------------------------------------------------
/data/ex4data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex4data1.mat
--------------------------------------------------------------------------------
/data/ex4weights.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex4weights.mat
--------------------------------------------------------------------------------
/data/ex5data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex5data1.mat
--------------------------------------------------------------------------------
/data/ex6data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex6data1.mat
--------------------------------------------------------------------------------
/data/ex6data2.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex6data2.mat
--------------------------------------------------------------------------------
/data/ex6data3.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex6data3.mat
--------------------------------------------------------------------------------
/data/ex7data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex7data1.mat
--------------------------------------------------------------------------------
/data/ex7data2.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex7data2.mat
--------------------------------------------------------------------------------
/data/ex7faces.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex7faces.mat
--------------------------------------------------------------------------------
/data/ex8_movieParams.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex8_movieParams.mat
--------------------------------------------------------------------------------
/data/ex8_movies.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex8_movies.mat
--------------------------------------------------------------------------------
/data/ex8data1.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex8data1.mat
--------------------------------------------------------------------------------
/data/ex8data2.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/ex8data2.mat
--------------------------------------------------------------------------------
/data/movie_ids.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/movie_ids.txt
--------------------------------------------------------------------------------
/data/spamSample1.txt:
--------------------------------------------------------------------------------
1 | Do You Want To Make $1000 Or More Per Week?
2 |
3 |
4 |
5 | If you are a motivated and qualified individual - I
6 | will personally demonstrate to you a system that will
7 | make you $1,000 per week or more! This is NOT mlm.
8 |
9 |
10 |
11 | Call our 24 hour pre-recorded number to get the
12 | details.
13 |
14 |
15 |
16 | 000-456-789
17 |
18 |
19 |
20 | I need people who want to make serious money. Make
21 | the call and get the facts.
22 |
23 | Invest 2 minutes in yourself now!
24 |
25 |
26 |
27 | 000-456-789
28 |
29 |
30 |
31 | Looking forward to your call and I will introduce you
32 | to people like yourself who
33 | are currently making $10,000 plus per week!
34 |
35 |
36 |
37 | 000-456-789
38 |
39 |
40 |
41 | 3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72
42 |
43 |
--------------------------------------------------------------------------------
/data/spamSample2.txt:
--------------------------------------------------------------------------------
1 | Best Buy Viagra Generic Online
2 |
3 | Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed!
4 |
5 | We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers!
6 | http://medphysitcstech.ru
7 |
8 |
9 |
--------------------------------------------------------------------------------
/data/spamTest.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/spamTest.mat
--------------------------------------------------------------------------------
/data/spamTrain.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/data/spamTrain.mat
--------------------------------------------------------------------------------
/exercises/ML/ex1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex1.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex2.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex3.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex4.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex5.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex5.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex6.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex6.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex7.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex7.pdf
--------------------------------------------------------------------------------
/exercises/ML/ex8.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/exercises/ML/ex8.pdf
--------------------------------------------------------------------------------
/notebooks/libraries/DEAP.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# DEAP"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelisation mechanism such as multiprocessing and SCOOP. The following documentation presents the key concepts and many features to build your own evolutions.\n",
15 | "\n",
16 | "Library documentation: http://deap.readthedocs.org/en/master/"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "## One Max Problem (GA)"
24 | ]
25 | },
26 | {
27 | "cell_type": "markdown",
28 | "metadata": {},
29 | "source": [
30 | "This problem is very simple, we search for a 1 filled list individual. This problem is widely used in the evolutionary computation community since it is very simple and it illustrates well the potential of evolutionary algorithms."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 1,
36 | "metadata": {
37 | "collapsed": true
38 | },
39 | "outputs": [],
40 | "source": [
41 | "import random\n",
42 | "\n",
43 | "from deap import base\n",
44 | "from deap import creator\n",
45 | "from deap import tools"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 2,
51 | "metadata": {
52 | "collapsed": true
53 | },
54 | "outputs": [],
55 | "source": [
56 | "# creator is a class factory that can build new classes at run-time\n",
57 | "creator.create(\"FitnessMax\", base.Fitness, weights=(1.0,))\n",
58 | "creator.create(\"Individual\", list, fitness=creator.FitnessMax)"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 3,
64 | "metadata": {
65 | "collapsed": true
66 | },
67 | "outputs": [],
68 | "source": [
69 | "# a toolbox stores functions and their arguments\n",
70 | "toolbox = base.Toolbox()\n",
71 | "\n",
72 | "# attribute generator\n",
73 | "toolbox.register(\"attr_bool\", random.randint, 0, 1)\n",
74 | "\n",
75 | "# structure initializers\n",
76 | "toolbox.register(\"individual\", tools.initRepeat, creator.Individual, toolbox.attr_bool, 100)\n",
77 | "toolbox.register(\"population\", tools.initRepeat, list, toolbox.individual)"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 4,
83 | "metadata": {
84 | "collapsed": true
85 | },
86 | "outputs": [],
87 | "source": [
88 | "# evaluation function\n",
89 | "def evalOneMax(individual):\n",
90 | " return sum(individual),"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 5,
96 | "metadata": {
97 | "collapsed": true
98 | },
99 | "outputs": [],
100 | "source": [
101 | "# register the required genetic operators\n",
102 | "toolbox.register(\"evaluate\", evalOneMax)\n",
103 | "toolbox.register(\"mate\", tools.cxTwoPoint)\n",
104 | "toolbox.register(\"mutate\", tools.mutFlipBit, indpb=0.05)\n",
105 | "toolbox.register(\"select\", tools.selTournament, tournsize=3)"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 6,
111 | "metadata": {
112 | "collapsed": false
113 | },
114 | "outputs": [
115 | {
116 | "name": "stdout",
117 | "output_type": "stream",
118 | "text": [
119 | " Evaluated 300 individuals\n"
120 | ]
121 | }
122 | ],
123 | "source": [
124 | "random.seed(64)\n",
125 | "\n",
126 | "# instantiate a population\n",
127 | "pop = toolbox.population(n=300)\n",
128 | "CXPB, MUTPB, NGEN = 0.5, 0.2, 40\n",
129 | "\n",
130 | "# evaluate the entire population\n",
131 | "fitnesses = list(map(toolbox.evaluate, pop))\n",
132 | "for ind, fit in zip(pop, fitnesses):\n",
133 | " ind.fitness.values = fit\n",
134 | "\n",
135 | "print(\" Evaluated %i individuals\" % len(pop))"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 7,
141 | "metadata": {
142 | "collapsed": false
143 | },
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "-- Generation 0 --\n",
150 | " Evaluated 189 individuals\n",
151 | " Min 40.0\n",
152 | " Max 65.0\n",
153 | " Avg 54.7433333333\n",
154 | " Std 4.46289766358\n",
155 | "-- Generation 1 --\n",
156 | " Evaluated 171 individuals\n",
157 | " Min 44.0\n",
158 | " Max 70.0\n",
159 | " Avg 58.48\n",
160 | " Std 3.98533980149\n",
161 | "-- Generation 2 --\n",
162 | " Evaluated 169 individuals\n",
163 | " Min 54.0\n",
164 | " Max 68.0\n",
165 | " Avg 61.6066666667\n",
166 | " Std 2.92779021714\n",
167 | "-- Generation 3 --\n",
168 | " Evaluated 185 individuals\n",
169 | " Min 57.0\n",
170 | " Max 73.0\n",
171 | " Avg 63.82\n",
172 | " Std 2.74364720764\n",
173 | "-- Generation 4 --\n",
174 | " Evaluated 175 individuals\n",
175 | " Min 54.0\n",
176 | " Max 73.0\n",
177 | " Avg 65.67\n",
178 | " Std 2.57961883489\n",
179 | "-- Generation 5 --\n",
180 | " Evaluated 164 individuals\n",
181 | " Min 60.0\n",
182 | " Max 76.0\n",
183 | " Avg 67.5466666667\n",
184 | " Std 2.57833710407\n",
185 | "-- Generation 6 --\n",
186 | " Evaluated 185 individuals\n",
187 | " Min 63.0\n",
188 | " Max 77.0\n",
189 | " Avg 69.0666666667\n",
190 | " Std 2.50510589707\n",
191 | "-- Generation 7 --\n",
192 | " Evaluated 194 individuals\n",
193 | " Min 62.0\n",
194 | " Max 78.0\n",
195 | " Avg 70.78\n",
196 | " Std 2.39963886172\n",
197 | "-- Generation 8 --\n",
198 | " Evaluated 199 individuals\n",
199 | " Min 63.0\n",
200 | " Max 79.0\n",
201 | " Avg 72.3133333333\n",
202 | " Std 2.57717330077\n",
203 | "-- Generation 9 --\n",
204 | " Evaluated 169 individuals\n",
205 | " Min 67.0\n",
206 | " Max 81.0\n",
207 | " Avg 74.0\n",
208 | " Std 2.62551582234\n",
209 | "-- Generation 10 --\n",
210 | " Evaluated 180 individuals\n",
211 | " Min 67.0\n",
212 | " Max 83.0\n",
213 | " Avg 75.9166666667\n",
214 | " Std 2.52910831893\n",
215 | "-- Generation 11 --\n",
216 | " Evaluated 193 individuals\n",
217 | " Min 67.0\n",
218 | " Max 84.0\n",
219 | " Avg 77.5966666667\n",
220 | " Std 2.40291258453\n",
221 | "-- Generation 12 --\n",
222 | " Evaluated 177 individuals\n",
223 | " Min 72.0\n",
224 | " Max 85.0\n",
225 | " Avg 78.97\n",
226 | " Std 2.29690371297\n",
227 | "-- Generation 13 --\n",
228 | " Evaluated 195 individuals\n",
229 | " Min 70.0\n",
230 | " Max 86.0\n",
231 | " Avg 80.13\n",
232 | " Std 2.35650164439\n",
233 | "-- Generation 14 --\n",
234 | " Evaluated 175 individuals\n",
235 | " Min 74.0\n",
236 | " Max 86.0\n",
237 | " Avg 81.3966666667\n",
238 | " Std 2.03780655499\n",
239 | "-- Generation 15 --\n",
240 | " Evaluated 181 individuals\n",
241 | " Min 74.0\n",
242 | " Max 87.0\n",
243 | " Avg 82.33\n",
244 | " Std 2.18504767301\n",
245 | "-- Generation 16 --\n",
246 | " Evaluated 198 individuals\n",
247 | " Min 74.0\n",
248 | " Max 88.0\n",
249 | " Avg 83.4033333333\n",
250 | " Std 2.22575580172\n",
251 | "-- Generation 17 --\n",
252 | " Evaluated 190 individuals\n",
253 | " Min 72.0\n",
254 | " Max 88.0\n",
255 | " Avg 84.14\n",
256 | " Std 2.34955314901\n",
257 | "-- Generation 18 --\n",
258 | " Evaluated 170 individuals\n",
259 | " Min 76.0\n",
260 | " Max 89.0\n",
261 | " Avg 85.1\n",
262 | " Std 2.20529665427\n",
263 | "-- Generation 19 --\n",
264 | " Evaluated 189 individuals\n",
265 | " Min 75.0\n",
266 | " Max 90.0\n",
267 | " Avg 85.77\n",
268 | " Std 2.1564863397\n",
269 | "-- Generation 20 --\n",
270 | " Evaluated 188 individuals\n",
271 | " Min 77.0\n",
272 | " Max 91.0\n",
273 | " Avg 86.4833333333\n",
274 | " Std 2.2589943682\n",
275 | "-- Generation 21 --\n",
276 | " Evaluated 180 individuals\n",
277 | " Min 80.0\n",
278 | " Max 91.0\n",
279 | " Avg 87.24\n",
280 | " Std 2.0613264338\n",
281 | "-- Generation 22 --\n",
282 | " Evaluated 179 individuals\n",
283 | " Min 80.0\n",
284 | " Max 92.0\n",
285 | " Avg 87.95\n",
286 | " Std 1.95298916194\n",
287 | "-- Generation 23 --\n",
288 | " Evaluated 196 individuals\n",
289 | " Min 79.0\n",
290 | " Max 93.0\n",
291 | " Avg 88.42\n",
292 | " Std 2.2249194742\n",
293 | "-- Generation 24 --\n",
294 | " Evaluated 168 individuals\n",
295 | " Min 82.0\n",
296 | " Max 93.0\n",
297 | " Avg 89.2833333333\n",
298 | " Std 1.89289607627\n",
299 | "-- Generation 25 --\n",
300 | " Evaluated 186 individuals\n",
301 | " Min 78.0\n",
302 | " Max 94.0\n",
303 | " Avg 89.7666666667\n",
304 | " Std 2.26102238428\n",
305 | "-- Generation 26 --\n",
306 | " Evaluated 182 individuals\n",
307 | " Min 82.0\n",
308 | " Max 94.0\n",
309 | " Avg 90.4633333333\n",
310 | " Std 2.21404356075\n",
311 | "-- Generation 27 --\n",
312 | " Evaluated 179 individuals\n",
313 | " Min 81.0\n",
314 | " Max 95.0\n",
315 | " Avg 90.8733333333\n",
316 | " Std 2.41328729238\n",
317 | "-- Generation 28 --\n",
318 | " Evaluated 183 individuals\n",
319 | " Min 83.0\n",
320 | " Max 95.0\n",
321 | " Avg 91.7166666667\n",
322 | " Std 2.18701978856\n",
323 | "-- Generation 29 --\n",
324 | " Evaluated 167 individuals\n",
325 | " Min 83.0\n",
326 | " Max 98.0\n",
327 | " Avg 92.3466666667\n",
328 | " Std 2.21656390739\n",
329 | "-- Generation 30 --\n",
330 | " Evaluated 170 individuals\n",
331 | " Min 84.0\n",
332 | " Max 98.0\n",
333 | " Avg 92.9533333333\n",
334 | " Std 2.09868742048\n",
335 | "-- Generation 31 --\n",
336 | " Evaluated 172 individuals\n",
337 | " Min 83.0\n",
338 | " Max 97.0\n",
339 | " Avg 93.5266666667\n",
340 | " Std 2.28238666507\n",
341 | "-- Generation 32 --\n",
342 | " Evaluated 196 individuals\n",
343 | " Min 86.0\n",
344 | " Max 98.0\n",
345 | " Avg 94.28\n",
346 | " Std 2.16985406575\n",
347 | "-- Generation 33 --\n",
348 | " Evaluated 176 individuals\n",
349 | " Min 85.0\n",
350 | " Max 98.0\n",
351 | " Avg 94.9133333333\n",
352 | " Std 2.22392046221\n",
353 | "-- Generation 34 --\n",
354 | " Evaluated 176 individuals\n",
355 | " Min 86.0\n",
356 | " Max 99.0\n",
357 | " Avg 95.6333333333\n",
358 | " Std 2.13359373411\n",
359 | "-- Generation 35 --\n",
360 | " Evaluated 174 individuals\n",
361 | " Min 86.0\n",
362 | " Max 99.0\n",
363 | " Avg 96.2966666667\n",
364 | " Std 2.23651266236\n",
365 | "-- Generation 36 --\n",
366 | " Evaluated 174 individuals\n",
367 | " Min 87.0\n",
368 | " Max 100.0\n",
369 | " Avg 96.5866666667\n",
370 | " Std 2.41436442062\n",
371 | "-- Generation 37 --\n",
372 | " Evaluated 195 individuals\n",
373 | " Min 84.0\n",
374 | " Max 100.0\n",
375 | " Avg 97.3666666667\n",
376 | " Std 2.16153237825\n",
377 | "-- Generation 38 --\n",
378 | " Evaluated 180 individuals\n",
379 | " Min 89.0\n",
380 | " Max 100.0\n",
381 | " Avg 97.7466666667\n",
382 | " Std 2.32719191779\n",
383 | "-- Generation 39 --\n",
384 | " Evaluated 196 individuals\n",
385 | " Min 88.0\n",
386 | " Max 100.0\n",
387 | " Avg 98.1833333333\n",
388 | " Std 2.33589145486\n"
389 | ]
390 | }
391 | ],
392 | "source": [
393 | "# begin the evolution\n",
394 | "for g in range(NGEN):\n",
395 | " print(\"-- Generation %i --\" % g)\n",
396 | "\n",
397 | " # select the next generation individuals\n",
398 | " offspring = toolbox.select(pop, len(pop))\n",
399 | "\n",
400 | " # clone the selected individuals\n",
401 | " offspring = list(map(toolbox.clone, offspring))\n",
402 | "\n",
403 | " # apply crossover and mutation on the offspring\n",
404 | " for child1, child2 in zip(offspring[::2], offspring[1::2]):\n",
405 | " if random.random() < CXPB:\n",
406 | " toolbox.mate(child1, child2)\n",
407 | " del child1.fitness.values\n",
408 | " del child2.fitness.values\n",
409 | "\n",
410 | " for mutant in offspring:\n",
411 | " if random.random() < MUTPB:\n",
412 | " toolbox.mutate(mutant)\n",
413 | " del mutant.fitness.values\n",
414 | "\n",
415 | " # evaluate the individuals with an invalid fitness\n",
416 | " invalid_ind = [ind for ind in offspring if not ind.fitness.valid]\n",
417 | " fitnesses = map(toolbox.evaluate, invalid_ind)\n",
418 | " for ind, fit in zip(invalid_ind, fitnesses):\n",
419 | " ind.fitness.values = fit\n",
420 | "\n",
421 | " print(\" Evaluated %i individuals\" % len(invalid_ind))\n",
422 | "\n",
423 | " # the population is entirely replaced by the offspring\n",
424 | " pop[:] = offspring\n",
425 | "\n",
426 | " # gather all the fitnesses in one list and print the stats\n",
427 | " fits = [ind.fitness.values[0] for ind in pop]\n",
428 | "\n",
429 | " length = len(pop)\n",
430 | " mean = sum(fits) / length\n",
431 | " sum2 = sum(x*x for x in fits)\n",
432 | " std = abs(sum2 / length - mean**2)**0.5\n",
433 | "\n",
434 | " print(\" Min %s\" % min(fits))\n",
435 | " print(\" Max %s\" % max(fits))\n",
436 | " print(\" Avg %s\" % mean)\n",
437 | " print(\" Std %s\" % std)"
438 | ]
439 | },
440 | {
441 | "cell_type": "code",
442 | "execution_count": 8,
443 | "metadata": {
444 | "collapsed": false
445 | },
446 | "outputs": [
447 | {
448 | "name": "stdout",
449 | "output_type": "stream",
450 | "text": [
451 | "Best individual is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], (100.0,)\n"
452 | ]
453 | }
454 | ],
455 | "source": [
456 | "best_ind = tools.selBest(pop, 1)[0]\n",
457 | "print(\"Best individual is %s, %s\" % (best_ind, best_ind.fitness.values))"
458 | ]
459 | },
460 | {
461 | "cell_type": "markdown",
462 | "metadata": {},
463 | "source": [
464 | "## Symbolic Regression (GP)"
465 | ]
466 | },
467 | {
468 | "cell_type": "markdown",
469 | "metadata": {},
470 | "source": [
471 | "Symbolic regression is one of the best known problems in GP. It is commonly used as a tuning problem for new algorithms, but is also widely used with real-life distributions, where other regression methods may not work.\n",
472 | "\n",
473 | "All symbolic regression problems use an arbitrary data distribution, and try to fit the most accurately the data with a symbolic formula. Usually, a measure like the RMSE (Root Mean Square Error) is used to measure an individual’s fitness.\n",
474 | "\n",
475 | "In this example, we use a classical distribution, the quartic polynomial (x^4 + x^3 + x^2 + x), a one-dimension distribution. 20 equidistant points are generated in the range [-1, 1], and are used to evaluate the fitness."
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": 9,
481 | "metadata": {
482 | "collapsed": true
483 | },
484 | "outputs": [],
485 | "source": [
486 | "import operator\n",
487 | "import math\n",
488 | "import random\n",
489 | "\n",
490 | "import numpy\n",
491 | "\n",
492 | "from deap import algorithms\n",
493 | "from deap import base\n",
494 | "from deap import creator\n",
495 | "from deap import tools\n",
496 | "from deap import gp\n",
497 | "\n",
498 | "# define a new function for divison that guards against divide by 0\n",
499 | "def protectedDiv(left, right):\n",
500 | " try:\n",
501 | " return left / right\n",
502 | " except ZeroDivisionError:\n",
503 | " return 1"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": 10,
509 | "metadata": {
510 | "collapsed": true
511 | },
512 | "outputs": [],
513 | "source": [
514 | "# add aritmetic primitives\n",
515 | "pset = gp.PrimitiveSet(\"MAIN\", 1)\n",
516 | "pset.addPrimitive(operator.add, 2)\n",
517 | "pset.addPrimitive(operator.sub, 2)\n",
518 | "pset.addPrimitive(operator.mul, 2)\n",
519 | "pset.addPrimitive(protectedDiv, 2)\n",
520 | "pset.addPrimitive(operator.neg, 1)\n",
521 | "pset.addPrimitive(math.cos, 1)\n",
522 | "pset.addPrimitive(math.sin, 1)\n",
523 | "\n",
524 | "# constant terminal\n",
525 | "pset.addEphemeralConstant(\"rand101\", lambda: random.randint(-1,1))\n",
526 | "\n",
527 | "# define number of inputs\n",
528 | "pset.renameArguments(ARG0='x')"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": 11,
534 | "metadata": {
535 | "collapsed": true
536 | },
537 | "outputs": [],
538 | "source": [
539 | "# create fitness and individual objects\n",
540 | "creator.create(\"FitnessMin\", base.Fitness, weights=(-1.0,))\n",
541 | "creator.create(\"Individual\", gp.PrimitiveTree, fitness=creator.FitnessMin)"
542 | ]
543 | },
544 | {
545 | "cell_type": "code",
546 | "execution_count": 12,
547 | "metadata": {
548 | "collapsed": true
549 | },
550 | "outputs": [],
551 | "source": [
552 | "# register evolution process parameters through the toolbox\n",
553 | "toolbox = base.Toolbox()\n",
554 | "toolbox.register(\"expr\", gp.genHalfAndHalf, pset=pset, min_=1, max_=2)\n",
555 | "toolbox.register(\"individual\", tools.initIterate, creator.Individual, toolbox.expr)\n",
556 | "toolbox.register(\"population\", tools.initRepeat, list, toolbox.individual)\n",
557 | "toolbox.register(\"compile\", gp.compile, pset=pset)\n",
558 | "\n",
559 | "# evaluation function\n",
560 | "def evalSymbReg(individual, points):\n",
561 | " # transform the tree expression in a callable function\n",
562 | " func = toolbox.compile(expr=individual)\n",
563 | " # evaluate the mean squared error between the expression\n",
564 | " # and the real function : x**4 + x**3 + x**2 + x\n",
565 | " sqerrors = ((func(x) - x**4 - x**3 - x**2 - x)**2 for x in points)\n",
566 | " return math.fsum(sqerrors) / len(points),\n",
567 | "\n",
568 | "toolbox.register(\"evaluate\", evalSymbReg, points=[x/10. for x in range(-10,10)])\n",
569 | "toolbox.register(\"select\", tools.selTournament, tournsize=3)\n",
570 | "toolbox.register(\"mate\", gp.cxOnePoint)\n",
571 | "toolbox.register(\"expr_mut\", gp.genFull, min_=0, max_=2)\n",
572 | "toolbox.register(\"mutate\", gp.mutUniform, expr=toolbox.expr_mut, pset=pset)\n",
573 | "\n",
574 | "# prevent functions from getting too deep/complex\n",
575 | "toolbox.decorate(\"mate\", gp.staticLimit(key=operator.attrgetter(\"height\"), max_value=17))\n",
576 | "toolbox.decorate(\"mutate\", gp.staticLimit(key=operator.attrgetter(\"height\"), max_value=17))"
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": 13,
582 | "metadata": {
583 | "collapsed": true
584 | },
585 | "outputs": [],
586 | "source": [
587 | "# compute some statistics about the population\n",
588 | "stats_fit = tools.Statistics(lambda ind: ind.fitness.values)\n",
589 | "stats_size = tools.Statistics(len)\n",
590 | "mstats = tools.MultiStatistics(fitness=stats_fit, size=stats_size)\n",
591 | "mstats.register(\"avg\", numpy.mean)\n",
592 | "mstats.register(\"std\", numpy.std)\n",
593 | "mstats.register(\"min\", numpy.min)\n",
594 | "mstats.register(\"max\", numpy.max)"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 14,
600 | "metadata": {
601 | "collapsed": false
602 | },
603 | "outputs": [
604 | {
605 | "name": "stdout",
606 | "output_type": "stream",
607 | "text": [
608 | " \t \t fitness \t size \n",
609 | " \t \t---------------------------------------\t-------------------------------\n",
610 | "gen\tnevals\tavg \tmax \tmin \tstd \tavg \tmax\tmin\tstd \n",
611 | "0 \t300 \t2.39949\t59.2593\t0.165572\t4.64122\t3.69667\t7 \t2 \t1.61389\n",
612 | "1 \t146 \t1.0971 \t10.1 \t0.165572\t0.845978\t3.80667\t13 \t1 \t1.78586\n",
613 | "2 \t169 \t0.902365\t6.5179 \t0.165572\t0.72362 \t4.16 \t13 \t1 \t2.0366 \n",
614 | "3 \t167 \t0.852725\t9.6327 \t0.165572\t0.869381\t4.63667\t13 \t1 \t2.20408\n",
615 | "4 \t158 \t0.74829 \t14.1573\t0.165572\t1.01281 \t4.88333\t13 \t1 \t2.14392\n",
616 | "5 \t160 \t0.630299\t7.90605\t0.165572\t0.904373\t5.52333\t14 \t1 \t2.09351\n",
617 | "6 \t181 \t0.495118\t4.09456\t0.165572\t0.524658\t6.08333\t13 \t1 \t1.99409\n",
618 | "7 \t170 \t0.403873\t2.6434 \t0.165572\t0.440596\t6.34667\t14 \t1 \t1.84386\n",
619 | "8 \t173 \t0.393405\t2.9829 \t0.165572\t0.425415\t6.37 \t12 \t1 \t1.78132\n",
620 | "9 \t168 \t0.414299\t13.5996\t0.165572\t0.841226\t6.25333\t11 \t2 \t1.76328\n",
621 | "10 \t142 \t0.384179\t4.07808\t0.165572\t0.477269\t6.25667\t13 \t1 \t1.78067\n",
622 | "11 \t156 \t0.459639\t19.8316\t0.165572\t1.47254 \t6.35333\t15 \t1 \t2.04983\n",
623 | "12 \t167 \t0.384348\t6.79674\t0.165572\t0.495807\t6.25 \t13 \t1 \t1.92029\n",
624 | "13 \t157 \t0.42446 \t11.0636\t0.165572\t0.818953\t6.43667\t15 \t1 \t2.11959\n",
625 | "14 \t175 \t0.342257\t2.552 \t0.165572\t0.325872\t6.23333\t15 \t1 \t2.14295\n",
626 | "15 \t154 \t0.442374\t13.8349\t0.165572\t0.950612\t6.05667\t14 \t1 \t1.90266\n",
627 | "16 \t181 \t0.455697\t19.7228\t0.101561\t1.39528 \t6.08667\t13 \t1 \t1.84006\n",
628 | "17 \t178 \t0.36256 \t2.54124\t0.101561\t0.340555\t6.24 \t15 \t1 \t2.20055\n",
629 | "18 \t171 \t0.411532\t14.2339\t0.101561\t0.897785\t6.44 \t15 \t1 \t2.2715 \n",
630 | "19 \t156 \t0.43193 \t15.5923\t0.101561\t0.9949 \t6.66667\t15 \t1 \t2.40185\n",
631 | "20 \t169 \t0.398163\t4.09456\t0.0976781\t0.450231\t6.96667\t15 \t1 \t2.62022\n",
632 | "21 \t162 \t0.385774\t4.09456\t0.0976781\t0.421867\t7.13 \t14 \t1 \t2.65577\n",
633 | "22 \t162 \t0.35318 \t2.55465\t0.0253803\t0.389453\t7.66667\t19 \t2 \t3.04995\n",
634 | "23 \t164 \t0.3471 \t3.66792\t0.0253803\t0.482334\t8.24 \t21 \t1 \t3.48364\n",
635 | "24 \t159 \t1.46248 \t331.247\t0.0253803\t19.0841 \t9.42667\t19 \t3 \t3.238 \n",
636 | "25 \t164 \t0.382697\t6.6452 \t0.0173316\t0.652247\t10.1867\t25 \t1 \t3.46292\n",
637 | "26 \t139 \t0.367651\t11.9045\t0.0173316\t0.855067\t10.67 \t19 \t3 \t3.32582\n",
638 | "27 \t167 \t0.345866\t6.6452 \t0.0173316\t0.586155\t11.4 \t27 \t3 \t3.44384\n",
639 | "28 \t183 \t0.388404\t4.53076\t0.0173316\t0.58986 \t11.5767\t24 \t3 \t3.4483 \n",
640 | "29 \t163 \t0.356009\t6.33264\t0.0173316\t0.563266\t12.2433\t29 \t2 \t4.23211\n",
641 | "30 \t174 \t0.31506 \t2.54124\t0.0173316\t0.412507\t12.92 \t27 \t3 \t4.5041 \n",
642 | "31 \t206 \t0.361197\t2.9829 \t0.0173316\t0.486155\t13.9333\t33 \t1 \t5.6747 \n",
643 | "32 \t168 \t0.302704\t4.01244\t0.0173316\t0.502277\t15.04 \t31 \t3 \t5.40849\n",
644 | "33 \t160 \t0.246509\t3.30873\t0.012947 \t0.433212\t16.3967\t34 \t2 \t5.66092\n",
645 | "34 \t158 \t0.344791\t26.1966\t0.012947 \t1.57277 \t17.39 \t43 \t1 \t6.13008\n",
646 | "35 \t162 \t0.212572\t2.85856\t0.0148373\t0.363023\t17.64 \t37 \t2 \t6.04349\n",
647 | "36 \t183 \t0.240268\t5.06093\t0.0112887\t0.482794\t17.4333\t41 \t3 \t6.33184\n",
648 | "37 \t185 \t0.514635\t65.543 \t0.0103125\t3.7864 \t16.6167\t41 \t1 \t6.58456\n",
649 | "38 \t134 \t0.340433\t11.2506\t0.0103125\t0.827213\t16.2733\t34 \t1 \t6.08484\n",
650 | "39 \t158 \t0.329797\t15.8145\t4.50668e-33\t1.05693 \t16.4133\t34 \t1 \t6.09993\n",
651 | "40 \t164 \t0.306543\t14.3573\t4.50668e-33\t0.947046\t17.9033\t53 \t2 \t8.23695\n"
652 | ]
653 | }
654 | ],
655 | "source": [
656 | "random.seed(318)\n",
657 | "\n",
658 | "pop = toolbox.population(n=300)\n",
659 | "hof = tools.HallOfFame(1)\n",
660 | "\n",
661 | "# run the algorithm\n",
662 | "pop, log = algorithms.eaSimple(pop, toolbox, 0.5, 0.1, 40, stats=mstats,\n",
663 | " halloffame=hof, verbose=True)"
664 | ]
665 | }
666 | ],
667 | "metadata": {
668 | "kernelspec": {
669 | "display_name": "Python 2",
670 | "language": "python",
671 | "name": "python2"
672 | },
673 | "language_info": {
674 | "codemirror_mode": {
675 | "name": "ipython",
676 | "version": 2
677 | },
678 | "file_extension": ".py",
679 | "mimetype": "text/x-python",
680 | "name": "python",
681 | "nbconvert_exporter": "python",
682 | "pygments_lexer": "ipython2",
683 | "version": "2.7.9"
684 | }
685 | },
686 | "nbformat": 4,
687 | "nbformat_minor": 0
688 | }
689 |
--------------------------------------------------------------------------------
/notebooks/libraries/Gensim.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Gensim"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Gensim is a free Python library designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.\n",
15 | "\n",
16 | "Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.\n",
17 | "\n",
18 | "Once these statistical patterns are found, any plain text documents can be succinctly expressed in the new, semantic representation, and queried for topical similarity against other documents.\n",
19 | "\n",
20 | "Library documentation: https://radimrehurek.com/gensim/index.html"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 1,
26 | "metadata": {
27 | "collapsed": false
28 | },
29 | "outputs": [],
30 | "source": [
31 | "from gensim import corpora, models, similarities\n",
32 | "\n",
33 | "documents = [\"Human machine interface for lab abc computer applications\",\n",
34 | " \"A survey of user opinion of computer system response time\",\n",
35 | " \"The EPS user interface management system\",\n",
36 | " \"System and human system engineering testing of EPS\",\n",
37 | " \"Relation of user perceived response time to error measurement\",\n",
38 | " \"The generation of random binary unordered trees\",\n",
39 | " \"The intersection graph of paths in trees\",\n",
40 | " \"Graph minors IV Widths of trees and well quasi ordering\",\n",
41 | " \"Graph minors A survey\"]"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "metadata": {
48 | "collapsed": true
49 | },
50 | "outputs": [],
51 | "source": [
52 | "# remove common words and tokenize\n",
53 | "stoplist = set('for a of the and to in'.split())\n",
54 | "texts = [[word for word in document.lower().split() if word not in stoplist]\n",
55 | " for document in documents]"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 3,
61 | "metadata": {
62 | "collapsed": true
63 | },
64 | "outputs": [],
65 | "source": [
66 | "# remove words that appear only once\n",
67 | "from collections import defaultdict\n",
68 | "frequency = defaultdict(int)\n",
69 | "for text in texts:\n",
70 | " for token in text:\n",
71 | " frequency[token] += 1\n",
72 | "\n",
73 | "texts = [[token for token in text if frequency[token] > 1]\n",
74 | " for text in texts]"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 4,
80 | "metadata": {
81 | "collapsed": false
82 | },
83 | "outputs": [
84 | {
85 | "name": "stdout",
86 | "output_type": "stream",
87 | "text": [
88 | "[['human', 'interface', 'computer'],\n",
89 | " ['survey', 'user', 'computer', 'system', 'response', 'time'],\n",
90 | " ['eps', 'user', 'interface', 'system'],\n",
91 | " ['system', 'human', 'system', 'eps'],\n",
92 | " ['user', 'response', 'time'],\n",
93 | " ['trees'],\n",
94 | " ['graph', 'trees'],\n",
95 | " ['graph', 'minors', 'trees'],\n",
96 | " ['graph', 'minors', 'survey']]\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "from pprint import pprint\n",
102 | "pprint(texts)"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 5,
108 | "metadata": {
109 | "collapsed": false
110 | },
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...)\n"
117 | ]
118 | }
119 | ],
120 | "source": [
121 | "# create a dictionary mapping between ids and unique words\n",
122 | "dictionary = corpora.Dictionary(texts)\n",
123 | "print(dictionary)"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 6,
129 | "metadata": {
130 | "collapsed": false
131 | },
132 | "outputs": [
133 | {
134 | "name": "stdout",
135 | "output_type": "stream",
136 | "text": [
137 | "{u'minors': 11, u'graph': 10, u'system': 5, u'trees': 9, u'eps': 8, u'computer': 0, u'survey': 4, u'user': 7, u'human': 1, u'time': 6, u'interface': 2, u'response': 3}\n"
138 | ]
139 | }
140 | ],
141 | "source": [
142 | "# mapping between ids and words\n",
143 | "print(dictionary.token2id)"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 7,
149 | "metadata": {
150 | "collapsed": false
151 | },
152 | "outputs": [
153 | {
154 | "name": "stdout",
155 | "output_type": "stream",
156 | "text": [
157 | "[[(0, 1), (1, 1), (2, 1)],\n",
158 | " [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],\n",
159 | " [(2, 1), (5, 1), (7, 1), (8, 1)],\n",
160 | " [(1, 1), (5, 2), (8, 1)],\n",
161 | " [(3, 1), (6, 1), (7, 1)],\n",
162 | " [(9, 1)],\n",
163 | " [(9, 1), (10, 1)],\n",
164 | " [(9, 1), (10, 1), (11, 1)],\n",
165 | " [(4, 1), (10, 1), (11, 1)]]\n"
166 | ]
167 | }
168 | ],
169 | "source": [
170 | "# convert the text to a bag-of-words corpus\n",
171 | "corpus = [dictionary.doc2bow(text) for text in texts]\n",
172 | "pprint(corpus)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": 8,
178 | "metadata": {
179 | "collapsed": false
180 | },
181 | "outputs": [
182 | {
183 | "name": "stdout",
184 | "output_type": "stream",
185 | "text": [
186 | "[[ 1. 1. 0. 0. 0. 0. 0. 0. 0.]\n",
187 | " [ 1. 0. 0. 1. 0. 0. 0. 0. 0.]\n",
188 | " [ 1. 0. 1. 0. 0. 0. 0. 0. 0.]\n",
189 | " [ 0. 1. 0. 0. 1. 0. 0. 0. 0.]\n",
190 | " [ 0. 1. 0. 0. 0. 0. 0. 0. 1.]\n",
191 | " [ 0. 1. 1. 2. 0. 0. 0. 0. 0.]\n",
192 | " [ 0. 1. 0. 0. 1. 0. 0. 0. 0.]\n",
193 | " [ 0. 1. 1. 0. 1. 0. 0. 0. 0.]\n",
194 | " [ 0. 0. 1. 1. 0. 0. 0. 0. 0.]\n",
195 | " [ 0. 0. 0. 0. 0. 1. 1. 1. 0.]\n",
196 | " [ 0. 0. 0. 0. 0. 0. 1. 1. 1.]\n",
197 | " [ 0. 0. 0. 0. 0. 0. 0. 1. 1.]]\n"
198 | ]
199 | }
200 | ],
201 | "source": [
202 | "# can convert to numpy/scipy matrices and back\n",
203 | "from gensim import matutils\n",
204 | "numpy_matrix = matutils.corpus2dense(corpus, num_terms=12)\n",
205 | "print(numpy_matrix)"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 9,
211 | "metadata": {
212 | "collapsed": false
213 | },
214 | "outputs": [],
215 | "source": [
216 | "scipy_csc_matrix = matutils.corpus2csc(corpus)\n",
217 | "numpy_corpus = matutils.Dense2Corpus(numpy_matrix)\n",
218 | "scipy_corpus = matutils.Sparse2Corpus(scipy_csc_matrix)"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": 10,
224 | "metadata": {
225 | "collapsed": true
226 | },
227 | "outputs": [],
228 | "source": [
229 | "# initialize a TF-IDF transformation\n",
230 | "tfidf = models.TfidfModel(corpus)"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": 11,
236 | "metadata": {
237 | "collapsed": false
238 | },
239 | "outputs": [
240 | {
241 | "name": "stdout",
242 | "output_type": "stream",
243 | "text": [
244 | "[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]\n",
245 | "[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]\n",
246 | "[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]\n",
247 | "[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]\n",
248 | "[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]\n",
249 | "[(9, 1.0)]\n",
250 | "[(9, 0.7071067811865475), (10, 0.7071067811865475)]\n",
251 | "[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]\n",
252 | "[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]\n"
253 | ]
254 | }
255 | ],
256 | "source": [
257 | "# apply it to the whole corpus\n",
258 | "corpus_tfidf = tfidf[corpus]\n",
259 | "for doc in corpus_tfidf:\n",
260 | " print(doc)"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 12,
266 | "metadata": {
267 | "collapsed": false
268 | },
269 | "outputs": [
270 | {
271 | "data": {
272 | "text/plain": [
273 | "[u'0.703*\"trees\" + 0.538*\"graph\" + 0.402*\"minors\" + 0.187*\"survey\" + 0.061*\"system\" + 0.060*\"response\" + 0.060*\"time\" + 0.058*\"user\" + 0.049*\"computer\" + 0.035*\"interface\"',\n",
274 | " u'-0.460*\"system\" + -0.373*\"user\" + -0.332*\"eps\" + -0.328*\"interface\" + -0.320*\"time\" + -0.320*\"response\" + -0.293*\"computer\" + -0.280*\"human\" + -0.171*\"survey\" + 0.161*\"trees\"']"
275 | ]
276 | },
277 | "execution_count": 12,
278 | "metadata": {},
279 | "output_type": "execute_result"
280 | }
281 | ],
282 | "source": [
283 | "# initialize an LSI transformation\n",
284 | "lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)\n",
285 | "lsi.print_topics(2)"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 13,
291 | "metadata": {
292 | "collapsed": false
293 | },
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "[(0, 0.066007833960902734), (1, -0.52007033063618424)]\n",
300 | "[(0, 0.19667592859142366), (1, -0.76095631677000475)]\n",
301 | "[(0, 0.089926399724463812), (1, -0.72418606267525032)]\n",
302 | "[(0, 0.075858476521781015), (1, -0.63205515860034267)]\n",
303 | "[(0, 0.10150299184980033), (1, -0.57373084830029586)]\n",
304 | "[(0, 0.70321089393783154), (1, 0.16115180214025748)]\n",
305 | "[(0, 0.87747876731198349), (1, 0.16758906864659354)]\n",
306 | "[(0, 0.90986246868185783), (1, 0.14086553628718948)]\n",
307 | "[(0, 0.61658253505692784), (1, -0.053929075663894252)]\n"
308 | ]
309 | }
310 | ],
311 | "source": [
312 | "# create a double wrapper over the original corpus: bow->tfidf->lsi\n",
313 | "corpus_lsi = lsi[corpus_tfidf]\n",
314 | "for doc in corpus_lsi:\n",
315 | " print(doc)"
316 | ]
317 | },
318 | {
319 | "cell_type": "code",
320 | "execution_count": 14,
321 | "metadata": {
322 | "collapsed": false
323 | },
324 | "outputs": [],
325 | "source": [
326 | "# random projection model\n",
327 | "rp = models.RpModel(corpus_tfidf, num_topics=500)"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 15,
333 | "metadata": {
334 | "collapsed": false
335 | },
336 | "outputs": [
337 | {
338 | "name": "stderr",
339 | "output_type": "stream",
340 | "text": [
341 | "WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n"
342 | ]
343 | }
344 | ],
345 | "source": [
346 | "# latent dirichlet allocation model\n",
347 | "lda = models.LdaModel(corpus, id2word=dictionary, num_topics=100)"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 16,
353 | "metadata": {
354 | "collapsed": false
355 | },
356 | "outputs": [
357 | {
358 | "name": "stdout",
359 | "output_type": "stream",
360 | "text": [
361 | "[(0, 0.079104751174447263), (1, -0.5732835243079395)]\n"
362 | ]
363 | }
364 | ],
365 | "source": [
366 | "# convert a phrase into the LSI model space\n",
367 | "doc = \"Human computer interaction\"\n",
368 | "vec_bow = dictionary.doc2bow(doc.lower().split())\n",
369 | "vec_lsi = lsi[vec_bow] # convert the query to LSI space\n",
370 | "print(vec_lsi)"
371 | ]
372 | },
373 | {
374 | "cell_type": "code",
375 | "execution_count": 17,
376 | "metadata": {
377 | "collapsed": false
378 | },
379 | "outputs": [
380 | {
381 | "name": "stderr",
382 | "output_type": "stream",
383 | "text": [
384 | "WARNING:gensim.similarities.docsim:scanning corpus to determine the number of features (consider setting `num_features` explicitly)\n"
385 | ]
386 | }
387 | ],
388 | "source": [
389 | "# index the transformed corpus from earlier\n",
390 | "index = similarities.MatrixSimilarity(corpus_lsi)"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": 18,
396 | "metadata": {
397 | "collapsed": false
398 | },
399 | "outputs": [
400 | {
401 | "name": "stdout",
402 | "output_type": "stream",
403 | "text": [
404 | "[(0, 0.99994081), (1, 0.99330217), (2, 0.99990785), (3, 0.99984384), (4, 0.9992786), (5, -0.08804217), (6, -0.0515742), (7, -0.016480923), (8, 0.22248439)]\n"
405 | ]
406 | }
407 | ],
408 | "source": [
409 | "# perform a similarity query against the corpus using cosine similarity\n",
410 | "sims = index[vec_lsi]\n",
411 | "print(list(enumerate(sims)))"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": 19,
417 | "metadata": {
418 | "collapsed": false
419 | },
420 | "outputs": [
421 | {
422 | "name": "stdout",
423 | "output_type": "stream",
424 | "text": [
425 | "[(0, 0.99994081),\n",
426 | " (2, 0.99990785),\n",
427 | " (3, 0.99984384),\n",
428 | " (4, 0.9992786),\n",
429 | " (1, 0.99330217),\n",
430 | " (8, 0.22248439),\n",
431 | " (7, -0.016480923),\n",
432 | " (6, -0.0515742),\n",
433 | " (5, -0.08804217)]\n"
434 | ]
435 | }
436 | ],
437 | "source": [
438 | "# display in sorted order\n",
439 | "sims = sorted(enumerate(sims), key=lambda item: -item[1])\n",
440 | "pprint(sims)"
441 | ]
442 | }
443 | ],
444 | "metadata": {
445 | "kernelspec": {
446 | "display_name": "Python 2",
447 | "language": "python",
448 | "name": "python2"
449 | },
450 | "language_info": {
451 | "codemirror_mode": {
452 | "name": "ipython",
453 | "version": 2
454 | },
455 | "file_extension": ".py",
456 | "mimetype": "text/x-python",
457 | "name": "python",
458 | "nbconvert_exporter": "python",
459 | "pygments_lexer": "ipython2",
460 | "version": "2.7.9"
461 | }
462 | },
463 | "nbformat": 4,
464 | "nbformat_minor": 0
465 | }
466 |
--------------------------------------------------------------------------------
/notebooks/libraries/NetworkX.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# NetworkX"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks.\n",
15 | "\n",
16 | "With NetworkX you can load and store networks in standard and nonstandard data formats, generate many types of random and classic networks, analyze network structure, build network models, design new network algorithms, draw networks, and much more.\n",
17 | "\n",
18 | "Library documentation: https://networkx.github.io/"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {
25 | "collapsed": false
26 | },
27 | "outputs": [],
28 | "source": [
29 | "import networkx as nx\n",
30 | "G = nx.Graph()"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 2,
36 | "metadata": {
37 | "collapsed": false
38 | },
39 | "outputs": [],
40 | "source": [
41 | "# basic add nodes\n",
42 | "G.add_node(1)\n",
43 | "G.add_nodes_from([2, 3])"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 3,
49 | "metadata": {
50 | "collapsed": false
51 | },
52 | "outputs": [],
53 | "source": [
54 | "# add a group of nodes at once\n",
55 | "H = nx.path_graph(10)\n",
56 | "G.add_nodes_from(H)"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 4,
62 | "metadata": {
63 | "collapsed": false
64 | },
65 | "outputs": [],
66 | "source": [
67 | "# add another graph itself as a node\n",
68 | "G.add_node(H)"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 5,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [],
78 | "source": [
79 | "# add edges using similar methods\n",
80 | "G.add_edge(1, 2)\n",
81 | "e = (2, 3)\n",
82 | "G.add_edge(*e)\n",
83 | "G.add_edges_from([(1, 2), (1, 3)])\n",
84 | "G.add_edges_from(H.edges())"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 6,
90 | "metadata": {
91 | "collapsed": false
92 | },
93 | "outputs": [],
94 | "source": [
95 | "# can also remove or clear\n",
96 | "G.remove_node(H)\n",
97 | "G.clear()"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 7,
103 | "metadata": {
104 | "collapsed": false
105 | },
106 | "outputs": [],
107 | "source": [
108 | "# repeats are ignored\n",
109 | "G.add_edges_from([(1,2),(1,3)])\n",
110 | "G.add_node(1)\n",
111 | "G.add_edge(1,2)\n",
112 | "G.add_node('spam') # adds node \"spam\"\n",
113 | "G.add_nodes_from('spam') # adds 4 nodes: 's', 'p', 'a', 'm'"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 8,
119 | "metadata": {
120 | "collapsed": false
121 | },
122 | "outputs": [
123 | {
124 | "data": {
125 | "text/plain": [
126 | "(8, 2)"
127 | ]
128 | },
129 | "execution_count": 8,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "# get the number of nodes and edges\n",
136 | "G.number_of_nodes(), G.number_of_edges()"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 9,
142 | "metadata": {
143 | "collapsed": false
144 | },
145 | "outputs": [
146 | {
147 | "data": {
148 | "text/plain": [
149 | "{2: {}, 3: {}}"
150 | ]
151 | },
152 | "execution_count": 9,
153 | "metadata": {},
154 | "output_type": "execute_result"
155 | }
156 | ],
157 | "source": [
158 | "# access graph edges\n",
159 | "G[1]"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 10,
165 | "metadata": {
166 | "collapsed": false
167 | },
168 | "outputs": [
169 | {
170 | "data": {
171 | "text/plain": [
172 | "{}"
173 | ]
174 | },
175 | "execution_count": 10,
176 | "metadata": {},
177 | "output_type": "execute_result"
178 | }
179 | ],
180 | "source": [
181 | "G[1][2]"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 11,
187 | "metadata": {
188 | "collapsed": false
189 | },
190 | "outputs": [],
191 | "source": [
192 | "# set an attribute of an edge\n",
193 | "G.add_edge(1,3)\n",
194 | "G[1][3]['color'] = 'blue'"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 12,
200 | "metadata": {
201 | "collapsed": false
202 | },
203 | "outputs": [
204 | {
205 | "name": "stdout",
206 | "output_type": "stream",
207 | "text": [
208 | "(1, 2, 0.125)\n",
209 | "(2, 1, 0.125)\n",
210 | "(3, 4, 0.375)\n",
211 | "(4, 3, 0.375)\n"
212 | ]
213 | }
214 | ],
215 | "source": [
216 | "FG = nx.Graph()\n",
217 | "FG.add_weighted_edges_from([(1, 2, 0.125), (1, 3, 0.75), (2, 4, 1.2), (3, 4, 0.375)])\n",
218 | "for n, nbrs in FG.adjacency_iter():\n",
219 | " for nbr, eattr in nbrs.items():\n",
220 | " data = eattr['weight']\n",
221 | " if data < 0.5: print('(%d, %d, %.3f)' % (n, nbr, data))"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 13,
227 | "metadata": {
228 | "collapsed": false
229 | },
230 | "outputs": [
231 | {
232 | "data": {
233 | "text/plain": [
234 | "{'day': 'Friday'}"
235 | ]
236 | },
237 | "execution_count": 13,
238 | "metadata": {},
239 | "output_type": "execute_result"
240 | }
241 | ],
242 | "source": [
243 | "# graph attribte\n",
244 | "G = nx.Graph(day='Friday')\n",
245 | "G.graph"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 14,
251 | "metadata": {
252 | "collapsed": false
253 | },
254 | "outputs": [
255 | {
256 | "data": {
257 | "text/plain": [
258 | "{'day': 'Monday'}"
259 | ]
260 | },
261 | "execution_count": 14,
262 | "metadata": {},
263 | "output_type": "execute_result"
264 | }
265 | ],
266 | "source": [
267 | "# modifying an attribute\n",
268 | "G.graph['day'] = 'Monday'\n",
269 | "G.graph"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 15,
275 | "metadata": {
276 | "collapsed": false
277 | },
278 | "outputs": [
279 | {
280 | "data": {
281 | "text/plain": [
282 | "[(1, {'room': 714, 'time': '5pm'}), (3, {'time': '2pm'})]"
283 | ]
284 | },
285 | "execution_count": 15,
286 | "metadata": {},
287 | "output_type": "execute_result"
288 | }
289 | ],
290 | "source": [
291 | "# node attributes\n",
292 | "G.add_node(1, time='5pm')\n",
293 | "G.add_nodes_from([3], time='2pm')\n",
294 | "G.node[1]['room'] = 714\n",
295 | "G.nodes(data=True)"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 16,
301 | "metadata": {
302 | "collapsed": false
303 | },
304 | "outputs": [],
305 | "source": [
306 | "# edge attributes (weight is a special numeric attribute)\n",
307 | "G.add_edge(1, 2, weight=4.7)\n",
308 | "G.add_edges_from([(3, 4), (4, 5)], color='red')\n",
309 | "G.add_edges_from([(1, 2 ,{'color': 'blue'}), (2, 3, {'weight' :8})])\n",
310 | "G[1][2]['weight'] = 4.7\n",
311 | "G.edge[1][2]['weight'] = 4"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 17,
317 | "metadata": {
318 | "collapsed": false
319 | },
320 | "outputs": [
321 | {
322 | "data": {
323 | "text/plain": [
324 | "0.5"
325 | ]
326 | },
327 | "execution_count": 17,
328 | "metadata": {},
329 | "output_type": "execute_result"
330 | }
331 | ],
332 | "source": [
333 | "# directed graph\n",
334 | "DG = nx.DiGraph()\n",
335 | "DG.add_weighted_edges_from([(1, 2 ,0.5), (3, 1, 0.75)])\n",
336 | "DG.out_degree(1, weight='weight')"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": 18,
342 | "metadata": {
343 | "collapsed": false
344 | },
345 | "outputs": [
346 | {
347 | "data": {
348 | "text/plain": [
349 | "1.25"
350 | ]
351 | },
352 | "execution_count": 18,
353 | "metadata": {},
354 | "output_type": "execute_result"
355 | }
356 | ],
357 | "source": [
358 | "DG.degree(1, weight='weight')"
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 19,
364 | "metadata": {
365 | "collapsed": false
366 | },
367 | "outputs": [
368 | {
369 | "data": {
370 | "text/plain": [
371 | "[2]"
372 | ]
373 | },
374 | "execution_count": 19,
375 | "metadata": {},
376 | "output_type": "execute_result"
377 | }
378 | ],
379 | "source": [
380 | "DG.successors(1)"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": 20,
386 | "metadata": {
387 | "collapsed": false
388 | },
389 | "outputs": [
390 | {
391 | "data": {
392 | "text/plain": [
393 | "[3]"
394 | ]
395 | },
396 | "execution_count": 20,
397 | "metadata": {},
398 | "output_type": "execute_result"
399 | }
400 | ],
401 | "source": [
402 | "DG.predecessors(1)"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": 21,
408 | "metadata": {
409 | "collapsed": false
410 | },
411 | "outputs": [],
412 | "source": [
413 | "# convert to undirected graph\n",
414 | "H = nx.Graph(G)"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 22,
420 | "metadata": {
421 | "collapsed": false
422 | },
423 | "outputs": [
424 | {
425 | "data": {
426 | "image/png": [
427 | "iVBORw0KGgoAAAANSUhEUgAAAd8AAAFBCAYAAAA2bKVrAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\n",
428 | "AAALEgAACxIB0t1+/AAAHeFJREFUeJzt3X9QVXX+x/EXEj/lQoCISWFN1pTYSKKom5QWOJY/Et1V\n",
429 | "+mljrhq7W+pi1LbNjlk7WW2b1trSN8tyC8S6gCaaoIuiGSo6Wjqtle3aZibieoFNL1c43z8strKU\n",
430 | "H/eec388HzPOyMw9t5f/9OLzPp/zOUGGYRgCAACm6WZ1AAAAAg3lCwCAyShfAABMRvkCAGAyyhcA\n",
431 | "AJNRvgAAmIzyBQDAZJQvAAAmo3wBADAZ5QsAgMkoXwAATEb5AgBgMsoXAACTUb4AAJiM8gUAwGSU\n",
432 | "LwAAJqN8AQAwGeULAIDJKF8AAExG+QIAYDLKFwAAk1G+AACYjPIFAMBklC8AACajfAEAMBnlCwCA\n",
433 | "yShfAABMRvkCAGAyyhcAAJNRvgAAmIzyBQDAZJQvAAAmo3wBADAZ5QsAgMkoXwAATEb5AgBgMsoX\n",
434 | "AACTUb4AAJiM8gUAwGSULwAAJqN8AQAwGeULAIDJLrA6AADAOg6HQ/X19ZKk+Ph4xcTEWJwoMLDy\n",
435 | "BYAA43Q6VVhYqIzUVCUlJOimAQN004ABSkpIUEZqqgoLC9Xc3Gx1TL8WZBiGYXUIAIA5VhQV6YGZ\n",
436 | "M3WNYSi3sVHj9L8RqEvSaklLoqL0YbduWlRQoCk5OdaF9WOULwAEiMXPPqtnfv97lZw8qbTzfLZW\n",
437 | "UnZkpPIWLND9c+eaES+gUL4AEABWFBVp3rRp2nLypJLbec0hScMjI/X00qWsgN2M8gUAP+d0OtWn\n",
438 | "Z0+VNzRoYAevrZU0Jjpah+rqFBoa6ol4AYkNVwDg5+x2u/q3tna4eCUpTVJKa6vsdru7YwU0Vr4A\n",
439 | "4OcyUlM1Z88eTezk9W9LWpSaqs27d7szVkCjfAHAjzkcDiUlJOiEy9Xpgx1ckmJDQvRFXR3PAbsJ\n",
440 | "Y2cA8GP19fVKCAvr0olKIZJ6hIbq+PHj7ooV8DjhCoAlOFnJvQzD0PHjx3X48OG2P1988YU++ugj\n",
441 | "nTp50up4+AHKF4BpnE6n7Ha7lixcqN379yshLEySVOd06tp+/ZSbn69Jkyaxq/YHmpqavleoP/b3\n",
442 | "w4cPKyIiQr179277k5SUpAEDBqikqEgunVnBdoZL0rHmZsXFxbnxXxXYuOcLwBScrHQ2p9OpL7/8\n",
443 | "8qzV6g9/drlcSkpKaivU75brt3+/6KKL1L179x/977DhyvtQvgA8LtBOVmppadHRo0fPWaiHDx+W\n",
444 | "w+FQr169zlqt/vDnmJgYBQUFdTpPYWGhls6Yocqmpk5df5PNpl++9JJyAuAXIrNQvgA8yp9OVjIM\n",
445 | "Q//5z39+sky//XP06FHFxsb+aKF+9+8JCQnq1s3z+145ZMP7UL4APMaX/qf/3fuq51qthoeH/+jq\n",
446 | "9Ls/9+rVy+uKyp9+CfIHbLgC4DHuOlmpK+PO5ubmtvuq59qs1NzcfNYK9eKLL1Z6enrbz+e6r+rt\n",
447 | "puTk6KvDhzW8E+N/itf9WPkC8BhPbvT54X3Vn1qtOhwOJSYmnnP827t3b1144YVduq/qK77d+Na/\n",
448 | "tVW5TU0ar+9vfFslaYnNpn1BQQGz8c0KlC8Aj3DXyUoxwcH67cMPtz3D+m3B/vC+6k9tVurRo4eC\n",
449 | "g4Pd+U/zec3NzW2PfO3at089vhmRH2tu1sCUFOXm52vixIleNzr3J5QvAI84ePCgbhowQJ91coft\n",
450 | "ty664AJNmD5d/fr1+17BeuN9VV/kcDjaTq6Ki4vjsBOTcM8XgFcLCwvTgw8+qMsuu8zqKH4pJiaG\n",
451 | "wrUAZzsD8Ij4+HjVOZ1ydeE7OFkJ/oqVLwC3am1t1c6dO1VSUqKIbt20Wur0hqtVkgampLAyg99h\n",
452 | "5Qugy5qbm1VRUaFf/epXSk5O1tSpUxUUFKT7H3lES6KiOv29S2w25ebnuzEp4B3YcAWgU5qamrRu\n",
453 | "3TqVlpaqvLxcV155pbKzs3XrrbfqqquukuRbh2wAZmLsDKDdjh49qtWrV6u0tFSbNm3SsGHDlJ2d\n",
454 | "raeeekq9e/c+6/NhYWFaVFCgCZ04WSk7MlKLCgooXvglyhfAOR08eFClpaUqLS3V3r17NWrUKN1+\n",
455 | "++1avny5LrzwwvNez8lKwNkYOwP4HsMwtGfPHpWUlKi0tFRHjhzR+PHjlZ2drRtvvFHh4eGd+l5O\n",
456 | "VgL+h/IFoNOnT2vr1q1thRscHKzs7GxNmDBBw4YNc9sJUZysBJxB+QIB6uTJk6qoqFBJSYneeecd\n",
457 | "JScna8KECZowYYL69+/v8XOOOVkJgYzyBQLI8ePHtWbNGpWUlGjDhg0aOHBg2w7lPn36WB0PCBiU\n",
458 | "L+DnPv/8c5WVlamkpEQ7duzQjTfeqOzsbI0ZM0Y9evSwOh4QkChfwM8YhqH9+/ertLRUJSUl+uyz\n",
459 | "zzR27FhlZ2crKyvLZ99HC/gTyhfwA62trXr//ffbHgk6depU2/3bjIwMhYSEWB0RwHdQvoCbOBwO\n",
460 | "1dfXSzrzUgFPbyByOp3auHGjSktLVVZWpoSEhLbCHThwYEC8GB7wVRyyAXSB0+lse3Rm9/79SggL\n",
461 | "kyTVOZ26tl8/5ebna9KkSW57dKahoUHl5eUqLS3VunXrlJKSouzsbFVXV+uKK65wy38DgOex8gU6\n",
462 | "6dtDI64xDOU2Nmqcvn9oxGpJS6Ki9GG3bl06NOLIkSNatWqVSkpKtHXrVg0fPlzZ2dkaN26cevXq\n",
463 | "5aZ/DQAzUb5AJyx+9lk904njEu+fO7dd3//xxx+3bZjav3+/br75ZmVnZ2v06NGKjo7ucn4A1qJ8\n",
464 | "gQ5aUVSkeZ14UcDwyEg9vXTpj66ADcNQbW1tW+EeP35ct956qyZMmKCRI0cq7JtxNgD/QPkCHeDO\n",
465 | "V+S5XC5t3ry5bYdyRERE25GOQ4YMUbduvG4b8FdsuAI6wG63q39ra4eLV5LSJPVrbdVDDz2kY8eO\n",
466 | "ac2aNbr88ss1YcIEvfvuu7r66qvZoQwECFa+QAdkpKZqzp49mtjJ69+WdH9UlH735JMaP368Lrnk\n",
467 | "EnfGA+AjKF+gnRwOh5ISEnTC5er0yMglKTYkRF/U1fEiASCAcVMJaKf6+nolhIV16V5NiKQeoaFt\n",
468 | "b/MBEJgoXwAATEb5Au0UHx+vOqdTri58h0tnXhwfFxfnrlgAfBDlC7RTTEyMru3XT6u78B2rJA1M\n",
469 | "SeF+LxDgKF+gA3Lz87UkKqrT1y+x2ZSbn+/GRAB8EbudgQ5wOp1KTkjQ2sbGLh+yASBwsfIFOuCD\n",
470 | "Dz6QER6uMRdcoEMduO6QzpzvvKiggOIFQPkC7WEYhl588UXdcsstevHFF/XwwoUaHhGh2nZcW6sz\n",
471 | "5zrnLVjQ6TcbAfAvHC8JnEdTU5NmzJih/fv367333lPfvn0lSYm9e2vMzJnq39qq3KYmjdf3Xym4\n",
472 | "Smfu8e4LCurSKwUB+B9WvsA57Nu3T4MHD1ZkZKS2bdvWVrySNCUnR4fq6jT9//5Pz6Wm6sKQEF3a\n",
473 | "vbsu7d5dsSEhWpSaql++9JIO1dVRvAC+hw1XwE9Yvny55s6dq2eeeUZTp0497+cdDkfbyVVxcXE8\n",
474 | "TgTgJ1G+wA+cOnVK999/vzZt2qS33npL11xzjdWRAPgZxs7Ad3z66acaNmyYGhoatHPnTooXgEdQ\n",
475 | "vsA3SkpKNGzYME2fPl2FhYWy2WxWRwLgp9jtjIDncrn00EMP6e2339Y777yj9PR0qyMB8HOULwLa\n",
476 | "v//9b02ZMkWxsbHatWsXLzwAYArGzghY69ev1+DBgzVu3DitWrWK4gVgGla+CDgtLS167LHH9PLL\n",
477 | "L6uoqEg33HCD1ZEABBjKFwHl6NGjuuOOO3T69GnV1taqV69eVkcCEIAYOyNgbNmyRWlpaUpPT1dF\n",
478 | "RQXFC8AyrHzh9wzD0J/+9Cc9/fTTevXVV3XLLbdYHQlAgKN84ddOnDihe+65R19++aW2b9+uPn36\n",
479 | "WB0JABg7w3/V1tYqLS1NycnJqq6upngBeA3KF37HMAwVFBRo9OjRevLJJ7V48WJeYA/AqzB2hl9p\n",
480 | "amrSrFmztHfvXm3dulVXXnml1ZEA4CysfOE39u/fr/T0dIWGhur999+neAF4LcoXfuGNN97QDTfc\n",
481 | "oLy8PL3yyiuKjIy0OhIA/CTGzvBpp06d0uzZs7Vx40ZVVlZqwIABVkcCgPNi5QufdfDgQV133XWq\n",
482 | "r6/Xzp07KV4APoPyhU8qKyvT0KFDNXXqVBUXFys6OtrqSADQboyd4VNcLpd+97vfqbi4WKtWrdLQ\n",
483 | "oUOtjgQAHUb5wmd88cUXysnJkc1m065duxQfH291JADoFMbO8AmVlZUaNGiQRo8erXfeeYfiBeDT\n",
484 | "WPnCq7W0tOjxxx9XQUGB3nzzTY0cOdLqSADQZZQvvFZdXZ3uvPNOnTp1SrW1tbrooousjgQAbsHY\n",
485 | "GV7pvffeU1pamq699lpt2LCB4gXgV1j5wqsYhqE///nPWrhwoZYuXaqxY8daHQkA3I7yhdc4ceKE\n",
486 | "pk2bps8//1w1NTW69NJLrY4EAB7B2BleYffu3Ro0aJB69+6tLVu2ULwA/BrlC0sZhqGXXnpJo0aN\n",
487 | "0hNPPKEXXnhBYWFhVscCAI9i7AzL/Pe//9V9992nXbt2qbq6WldddZXVkQDAFKx8YYmPPvpIQ4YM\n",
488 | "UVBQkGpqaiheAAGF8oXpCgsLlZGRodmzZ2vZsmXq3r271ZEAwFSMnWEap9OpOXPmqKKiQhUVFUpN\n",
489 | "TbU6EgBYgpUvTPHZZ59p+PDh+uqrr7Rz506KF0BAo3zhcatXr9bQoUN1++2366233lJMTIzVkQDA\n",
490 | "Uoyd4TGnT5/WI488osLCQpWUlOhnP/uZ1ZEAwCtQvvCIw4cPKycnR5GRkdq1a5d69OhhdSQA8BqM\n",
491 | "neF2Gzdu1KBBg5SVlaXy8nKKFwB+gJUv3Ka1tVV//OMf9Ze//EXLly9XZmam1ZEAwCtRvnCLY8eO\n",
492 | "6a677lJTU5N27typpKQkqyMBgNdi7Iwu27Ztm9LS0nTNNddo48aNFC8AnAcrX3SaYRhavHixnnji\n",
493 | "Cb388ssaP3681ZEAwCdQvugUh8Ohe++9V//85z9VU1Ojyy67zOpIAOAzGDujw/bs2aNBgwapZ8+e\n",
494 | "2rJlC8ULAB1E+aLdDMPQ0qVLlZmZqfnz52vJkiUKDw+3OhYA+BzGzmiXr7/+Wrm5udqxY4c2b96s\n",
495 | "q6++2upIAOCzWPnivP7xj39oyJAhamlp0fbt2yleAOgiyhfnVFxcrOHDh+s3v/mNXn/9dd69CwBu\n",
496 | "wNgZP8rpdCovL0/l5eV69913NXDgQKsjAYDfoHxxln/961/6xS9+oaSkJNXW1urCCy+0OhIA+BXG\n",
497 | "zvieNWvWKD09XVOmTJHdbqd4AcADWPlC0pl37z766KP629/+Jrvdruuuu87qSADgtyhf6Msvv9Rt\n",
498 | "t92m0NBQ7dq1SwkJCVZHAgC/xtg5wFVVVWnQoEEaMWKE1q5dS/ECgAlY+Qao1tZWPfnkk3r++ef1\n",
499 | "2muvadSoUVZHAoCAQfkGoPr6et199906ceKEduzYoYsvvtjqSAAQUBg7B5iamhqlpaXp6quvVlVV\n",
500 | "FcULABZg5RsgDMPQCy+8oAULFqigoEDZ2dlWRwKAgEX5BoCGhgZNnz5dn3zyibZt26bLL7/c6kgA\n",
501 | "ENAYO/u5vXv3avDgwYqNjdV7771H8QKAF6B8/dirr76qm266SY8++qgKCgp49y4AeAnGzn7o66+/\n",
502 | "1q9//Wtt27ZNVVVVSklJsToSAOA7WPn6mQMHDmjo0KE6deqUduzYQfECgBeifP3IypUrdd111+m+\n",
503 | "++7TG2+8oaioKKsjAQB+BGNnP9Dc3Kx58+Zp9erVWrt2rQYNGmR1JADAOVC+Pu7QoUOaPHmyEhMT\n",
504 | "VVtbq9jYWKsjAQDOg7GzD1u7dq0GDx6sSZMmqbS0lOIFAB/BytcHnT59Wn/4wx/02muv6a233lJG\n",
505 | "RobVkQAAHUD5+pgjR47otttuU7du3VRbW6vExESrIwEAOoixsw/ZtGmT0tLSlJGRofXr11O8AOCj\n",
506 | "WPlawOFwqL6+XpIUHx+vmJiYc36+tbVVTz31lJ577jktW7ZMo0ePNiMmAMBDKF+TOJ1O2e12LVm4\n",
507 | "ULv371dCWJgkqc7p1LX9+ik3P1+TJk1SaGjo9647fvy4pk6dqvr6eu3YsUOXXHKJFfEBAG7E2NkE\n",
508 | "K4qK1KdnT70yc6bm7tmjEy6XPmtq0mdNTfqPy6U5e/Zo6YwZSk5I0IqiorbrduzYobS0NPXt21dV\n",
509 | "VVUULwD4iSDDMAyrQ/izxc8+q2d+/3uVnDyptPN8tlZSdmSkfvvYYwoOC9P8+fP117/+VZMmTTIj\n",
510 | "KgDAJJSvB60oKtK8adO05eRJJbfzmkOS0oODFZaUpA0bNqhv376ejAgAsADl6yFOp1N9evZUeUOD\n",
511 | "Bnbw2lpJY6Kjdaiu7qx7wAAA38c9Xw+x2+3q39ra4eKVpDRJKa2tstvt7o4FAPACrHw9JCM1VXP2\n",
512 | "7NHETl7/tqRFqanavHu3O2MBALwA5esBDodDSQkJOuFydfpZLpek2JAQfVFXd97ngAEAvoWxswfU\n",
513 | "19crISysSw9Rh0jqERqq48ePuysWAMBLUL4AAJiM8vWA+Ph41TmdcnXhO1ySjjU3Ky4uzl2xAABe\n",
514 | "gvL1gJiYGF3br59Wd+E7VkkamJLC/V4A8EOUr4fk5udrSVRUp69fYrMpNz/fjYkAAN6C3c4ewiEb\n",
515 | "AICfwsrXQ8LCwrSooEATIiJ0qAPXHZJ06zfXUrwA4J8oXw+akpOjvMcf1/CICNW24/O1kn4WFqam\n",
516 | "Cy5Qn0sv9XA6AIBVGDubYEVRkR6YOVP9W1uV29Sk8frfi5RdOrO5aonNpn1BQVpUUKDomBjdc889\n",
517 | "Wr9+vQYMGGBdcACAR1C+JmlubpbdbteShQu1a98+9fhmpHysuVkDU1KUm5+viRMnto2aV65cqQce\n",
518 | "eEBVVVW68sorrYwOAHAzytcCDoej7eSquLi4n3yc6JVXXtH8+fO1efNm9enTx8yIAAAP6soJiOik\n",
519 | "mJiYdj2/O23aNDU2NiozM1PV1dXq1auXCekAAJ5G+Xq5Bx54QA0NDcrKytKmTZs48QoA/ABjZx9g\n",
520 | "GIYefPBBbd68WZWVlbLZbFZHAgB0AeXrIwzD0KxZs3TgwAGVl5crIiLC6kgAgE6ifH1IS0uL7r77\n",
521 | "bjkcDtntdg7hAAAfRfn6GJfLpZ///OcKDw/Xm2++qeDgYKsjAQA6iBOufExISIhWrFihY8eOaebM\n",
522 | "meJ3JwDwPZSvDwoPD1dZWZn27dunuXPnUsAA4GMoXx8VFRWl8vJy/f3vf9f8+fOtjgMA6ACe8/Vh\n",
523 | "sbGxWr9+vTIyMhQdHa25c+daHQkA0A6Ur4/r2bOnKisrlZGRIZvNpl/+8pdWRwIAnAfl6wcuueQS\n",
524 | "VVZW6oYbbpDNZlNOTo7VkQAA50D5+om+fftq3bp1yszMVPfu3TVu3DirIwEAfgLP+fqZ7du3a8yY\n",
525 | "MVqxYoVuvPFGq+MAAH4Eu539THp6ulauXKmcnBy9//77VscBAPwIytcPjRgxQsuWLdOtt96qvXv3\n",
526 | "Wh0HAPADlK+fuuWWW/T8889r9OjROnDggNVxAADfwYYrPzZ58mQ1NTUpKytL1dXVSk5OtjoSAECU\n",
527 | "r9+bNm2aGhoalJmZqc2bN6tXr15WRwKAgEf5BoDZs2ersbFRo0aNUlVVleLi4qyOBAABjUeNAoRh\n",
528 | "GJo3b56qq6tVWVkpm81mdSQACFiUbwAxDEMzZ87UJ598ojVr1igiIsLqSAAQkCjfANPS0qK77rpL\n",
529 | "DQ0NKikpUUhIiNWRACDgUL4ByOVyadKkSYqMjNQbb7yh4OBgqyMBQEDhOd8AFBISouLiYtXV1WnW\n",
530 | "rFni9y8AMBflG6DCw8NVVlamDz/8UL/97W8pYAAwEeUbwKKiolReXq4NGzboscceszoOAAQMnvMN\n",
531 | "cLGxsVq/fr2uv/562Ww2zZ071+pIAOD3KF8oMTFRFRUVuv766xUdHa3p06dbHQkA/BrlC0lScnKy\n",
532 | "KioqNGLECNlsNk2ZMsXqSADgtyhftLniiiu0bt06ZWVlqXv37ho7dqzVkQDAL/GcL86yfft2jRkz\n",
533 | "RsXFxRo5cqTVcQDA77DbGWdJT0/XypUrNXnyZNXU1FgdBwD8DuWLHzVixAgtW7ZM48eP1969e62O\n",
534 | "AwB+hfLFTxozZoyef/553XzzzTpw4IDVcQDAb7DhCuc0efJkNTY2KisrS9XV1UpOTrY6EgD4PMoX\n",
535 | "53XvvfeqsbFRmZmZqq6uVmJiotWRAMCnUb5ol9mzZ6uhoUFZWVmqqqpSXFyc1ZEAwGfxqBHazTAM\n",
536 | "5eXlaevWraqoqJDNZrM6EgD4JMoXHWIYhmbMmKFPP/1U5eXlCg8PtzoSAPgcyhcd1tLSojvvvFNN\n",
537 | "TU2y2+0KCQmxOhIA+BQeNUKHBQcH6/XXX1dQUJDuvvtutbS0WB0JAHwK5YtOCQkJUXFxsb766ivN\n",
538 | "mjVLDFAAoP0oX3RaeHi4ysrK9MEHHygvL48CBoB2onzRJTabTeXl5aqsrNSCBQusjgMAPoHnfNFl\n",
539 | "cXFxWr9+vTIyMmSz2TRnzhyrIwGAV6N84RaJiYmqrKzU9ddfr+joaN17771WRwIAr0X5wm2Sk5NV\n",
540 | "UVGhESNGKCoqSlOmTLE6EgB4JcoXbnXFFVdo7dq1ysrKUlRUlMaMGWN1JADwOhyyAY+oqanR2LFj\n",
541 | "VVxcrJEjR1odBwC8Crud4RFDhgxRcXGxJk+erJqaGqvjAIBXoXzhMSNHjtSrr76q8ePH64MPPrA6\n",
542 | "DgB4DcoXHjV27FgtXrxYo0eP1scff2x1HADwCmy4gsdNmTJFjY2NysrK0ubNm5WcnGx1JACwFOUL\n",
543 | "U0yfPl2NjY3KzMxUdXW1EhMTrY4EAJahfGGaOXPmqKGhQaNGjVJVVZViY2OtjgQAluBRI5jKMAzl\n",
544 | "5eVp69atqqiokM1mszoSAJiO8oXpDMPQjBkzdPDgQa1Zs0bh4eFWRwIAU1G+sERLS4vuuOMOff31\n",
545 | "13r77bcVEhJidSQAMA2PGsESwcHBWr58uQzD0NSpU9XS0mJ1JAAwDeULy4SEhKi4uFhHjhzRfffd\n",
546 | "J4YwAAIF5QtLRUREqKysTHv27NG8efMoYAABgUeNYDmbzaa1a9dqxIgRiomJ0aOPPnrWZxwOh+rr\n",
547 | "6yVJ8fHxiomJMTsmALgNK194hbi4OK1fv17Lly/Xc889J0lyOp0qLCxURmqqkhISdNOAAbppwAAl\n",
548 | "JSQoIzVVhYWFam5utjg5AHQcu53hVQ4dOqSMjAyNHj1aZUVFusYwlNvYqHH635jGJWm1pCVRUfqw\n",
549 | "WzctKijQlJwc60IDQAdRvvA6jz78sP765JNaJyntPJ+tlZQdGam8BQt0/9y5JqQDgK6jfOFVVhQV\n",
550 | "ad60adpy8qTa+/qFQ5KGR0bq6aVLWQED8AmUL7yG0+lUn549Vd7QoIEdvLZW0pjoaB2qq1NoaKgn\n",
551 | "4gGA27DhCl7Dbrerf2trh4tXOjOeTmltld1ud3csAHA7Vr7wGhmpqZqzZ48mdvL6tyUtSk3V5t27\n",
552 | "3RkLANyO8oVXcDgcSkpI0AmXq9MPn7skxYaE6Iu6Op4DBuDVGDvDK9TX1yshLKxLp76ESOoRGqrj\n",
553 | "x4+7KxYAeATlCwCAyShfeIX4+HjVOZ1ydeE7XJKONTcrLi7OXbEAwCMoX3iFmJgYXduvn1Z34TtW\n",
554 | "SRqYksL9XgBej/KF18jNz9eSqKhOX7/EZlNufr4bEwGAZ7DbGV6DQzYABApWvvAaYWFhWlRQoAkR\n",
555 | "ETrUgesO6cz5zosKCiheAD6B8oVXmZKTo7zHH9fwiAjVtuPztTpzrnPeggWc6wzAZzB2hldaUVSk\n",
556 | "B2bOVP/WVuU2NWm8vv9KwVU6c493X1AQrxQE4HMoX3it5uZm2e12LVm4ULv27VOPb0bKx5qbNTAl\n",
557 | "Rbn5+Zo4cSKjZgA+h/KFT3A4HG0nV8XFxfE4EQCfRvkCAGAyNlwBAGAyyhcAAJNRvgAAmIzyBQDA\n",
558 | "ZJQvAAAmo3wBADAZ5QsAgMkoXwAATEb5AgBgMsoXAACTUb4AAJiM8gUAwGSULwAAJqN8AQAwGeUL\n",
559 | "AIDJKF8AAExG+QIAYDLKFwAAk1G+AACYjPIFAMBklC8AACajfAEAMBnlCwCAyShfAABMRvkCAGAy\n",
560 | "yhcAAJNRvgAAmIzyBQDAZJQvAAAmo3wBADAZ5QsAgMkoXwAATEb5AgBgMsoXAACTUb4AAJiM8gUA\n",
561 | "wGSULwAAJqN8AQAwGeULAIDJKF8AAExG+QIAYDLKFwAAk1G+AACYjPIFAMBklC8AACajfAEAMBnl\n",
562 | "CwCAyShfAABMRvkCAGAyyhcAAJNRvgAAmOz/ATDalB3w9E/NAAAAAElFTkSuQmCC\n"
563 | ],
564 | "text/plain": [
565 | ""
566 | ]
567 | },
568 | "metadata": {},
569 | "output_type": "display_data"
570 | }
571 | ],
572 | "source": [
573 | "# basic graph drawing capability\n",
574 | "%matplotlib inline\n",
575 | "import matplotlib.pyplot as plt\n",
576 | "nx.draw(G)"
577 | ]
578 | }
579 | ],
580 | "metadata": {
581 | "kernelspec": {
582 | "display_name": "Python 2",
583 | "language": "python",
584 | "name": "python2"
585 | },
586 | "language_info": {
587 | "codemirror_mode": {
588 | "name": "ipython",
589 | "version": 2
590 | },
591 | "file_extension": ".py",
592 | "mimetype": "text/x-python",
593 | "name": "python",
594 | "nbconvert_exporter": "python",
595 | "pygments_lexer": "ipython2",
596 | "version": "2.7.9"
597 | }
598 | },
599 | "nbformat": 4,
600 | "nbformat_minor": 0
601 | }
602 |
--------------------------------------------------------------------------------
/notebooks/misc/CodeOptimization.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Comparison Of Various Code Optimization Methods"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "There are a number of ways to optimize the performance of Python code. Below is a short summary adapted from http://people.duke.edu/~ccc14/sta-663/MakingCodeFast.html regarding various optimization strategies.\n",
15 | "\n",
16 | "There is a traditional sequence for writing code, and it goes like this:\n",
17 | "\n",
18 | "1) Make it run
\n",
19 | "2) Make it right (testing)
\n",
20 | "3) Make it fast (optimization)
\n",
21 | "\n",
22 | "Making it fast is the last step, and you should only optimize when it is necessary. Also, it is good to know when a program is “fast enough” for your needs. Optimization has a price:\n",
23 | "\n",
24 | "1) Cost in programmer time
\n",
25 | "2) Optimized code is often more complex
\n",
26 | "3) Optimized code is often less generic
\n",
27 | "\n",
28 | "However, having fast code is often necessary for statistical computing, so we will spend some time learning how to make code run faster. To do so, we need to understand why our code is slow. Code can be slow because of differnet resource limitations:\n",
29 | "\n",
30 | "CPU-bound - CPU is working flat out
\n",
31 | "Memory-bound - Out of RAM - swapping to hard disk
\n",
32 | "IO-bound - Lots of data transfer to and from hard disk
\n",
33 | "Network-bound - CPU is waiting for data to come over network or from memory (“starvation”)
\n",
34 | "\n",
35 | "Different bottlenekcs may require different approaches. However, there is a natural order to making code fast:\n",
36 | "\n",
37 | "1) Cheat\n",
38 | "* Use a better machine (e.g. if RAM is limiting, buy more RAM)\n",
39 | "* Solve a simpler problem (e.g. will a subsample of the data suffice?)\n",
40 | "* Solve a different problem\n",
41 | "\n",
42 | "2) Find out what is slowing down the code (profiling)\n",
43 | "* Using timeit\n",
44 | "* Using cProfile\n",
45 | "* Using memory_profiler\n",
46 | "\n",
47 | "3) Use better algorithms and data structures\n",
48 | "\n",
49 | "4) Off-load heavy computations to numpy/scipy\n",
50 | "\n",
51 | "5) Use compiled code written in another language\n",
52 | "* Calling code written in C (ctypes, cython)\n",
53 | "* Calling code written in Fotran (f2py)\n",
54 | "* Calling code written in Julia (pyjulia)\n",
55 | "\n",
56 | "6) Convert Python code to compiled code\n",
57 | "* Using numexpr\n",
58 | "* Using numba\n",
59 | "* Using cython\n",
60 | "\n",
61 | "7) Write parallel programs\n",
62 | "* Ahmdahl and Gustafsson’s laws\n",
63 | "* Embarassinlgy parallel problems\n",
64 | "* Problems requiring communication and syncrhonization\n",
65 | "\n",
66 | "8) Execute in parallel\n",
67 | "* On multi-core machines\n",
68 | "* On multiple machines\n",
69 | "* On GPUs\n",
70 | "\n",
71 | "This notebook will focus on 4 and 6. We will use the example of calculating the pairwsise Euclidean distance between all points. Examples are adapted from http://people.duke.edu/~ccc14/sta-663/Optimization_Bakeoff.html."
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 1,
77 | "metadata": {
78 | "collapsed": true
79 | },
80 | "outputs": [],
81 | "source": [
82 | "%matplotlib inline\n",
83 | "%precision 2\n",
84 | "import numpy as np\n",
85 | "import matplotlib.pyplot as plt\n",
86 | "import numexpr as ne\n",
87 | "from numba import jit"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 2,
93 | "metadata": {
94 | "collapsed": false
95 | },
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "(1000, 3)"
101 | ]
102 | },
103 | "execution_count": 2,
104 | "metadata": {},
105 | "output_type": "execute_result"
106 | }
107 | ],
108 | "source": [
109 | "xs = np.random.random((1000, 3))\n",
110 | "xs.shape"
111 | ]
112 | },
113 | {
114 | "cell_type": "markdown",
115 | "metadata": {},
116 | "source": [
117 | "## Python"
118 | ]
119 | },
120 | {
121 | "cell_type": "markdown",
122 | "metadata": {},
123 | "source": [
124 | "This is the pure python version of the algorithm. Used as a baseline for comparison."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 3,
130 | "metadata": {
131 | "collapsed": true
132 | },
133 | "outputs": [],
134 | "source": [
135 | "def pdist_python(xs):\n",
136 | " n, p = xs.shape\n",
137 | " D = np.empty((n, n), np.float)\n",
138 | " for i in range(n):\n",
139 | " for j in range(n):\n",
140 | " s = 0.0\n",
141 | " for k in range(p):\n",
142 | " tmp = xs[i,k] - xs[j,k]\n",
143 | " s += tmp * tmp\n",
144 | " D[i, j] = s**0.5\n",
145 | " return D"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": 4,
151 | "metadata": {
152 | "collapsed": false
153 | },
154 | "outputs": [
155 | {
156 | "name": "stdout",
157 | "output_type": "stream",
158 | "text": [
159 | "10 loops, best of 3: 3.17 s per loop\n"
160 | ]
161 | }
162 | ],
163 | "source": [
164 | "%timeit -n 10 pdist_python(xs)"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "## Numpy"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "NumPy is the fundamental package for scientific computing with Python. It contains among other things:\n",
179 | "\n",
180 | "* a powerful N-dimensional array object\n",
181 | "* sophisticated (broadcasting) functions\n",
182 | "* tools for integrating C/C++ and Fortran code\n",
183 | "* useful linear algebra, Fourier transform, and random number capabilities\n",
184 | "\n",
185 | "Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.\n",
186 | "\n",
187 | "Library documentation: http://www.numpy.org/"
188 | ]
189 | },
190 | {
191 | "cell_type": "code",
192 | "execution_count": 5,
193 | "metadata": {
194 | "collapsed": true
195 | },
196 | "outputs": [],
197 | "source": [
198 | "def pdist_numpy(xs):\n",
199 | " return np.sqrt(((xs[:,None,:] - xs)**2).sum(-1))"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 6,
205 | "metadata": {
206 | "collapsed": false
207 | },
208 | "outputs": [
209 | {
210 | "name": "stdout",
211 | "output_type": "stream",
212 | "text": [
213 | "100 loops, best of 3: 50.3 ms per loop\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "%timeit -n 100 pdist_numpy(xs)"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "metadata": {},
224 | "source": [
225 | "## Numexpr"
226 | ]
227 | },
228 | {
229 | "cell_type": "markdown",
230 | "metadata": {},
231 | "source": [
232 | "Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like \"3\\*a+4\\*b\") are accelerated and use less memory than doing the same calculation in Python.\n",
233 | "\n",
234 | "In addition, its multi-threaded capabilities can make use of all your cores, which may accelerate computations, most specially if they are not memory-bounded (e.g. those using transcendental functions).\n",
235 | "\n",
236 | "Library documentation: https://github.com/pydata/numexpr"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 7,
242 | "metadata": {
243 | "collapsed": true
244 | },
245 | "outputs": [],
246 | "source": [
247 | "def pdist_numexpr(xs):\n",
248 | " a = xs[:, np.newaxis, :]\n",
249 | " return np.sqrt(ne.evaluate('sum((a-xs)**2, axis=2)'))"
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 8,
255 | "metadata": {
256 | "collapsed": false
257 | },
258 | "outputs": [
259 | {
260 | "name": "stdout",
261 | "output_type": "stream",
262 | "text": [
263 | "100 loops, best of 3: 19.4 ms per loop\n"
264 | ]
265 | }
266 | ],
267 | "source": [
268 | "%timeit -n 100 pdist_numexpr(xs)"
269 | ]
270 | },
271 | {
272 | "cell_type": "markdown",
273 | "metadata": {},
274 | "source": [
275 | "## Numba"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.\n",
283 | "\n",
284 | "Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.\n",
285 | "\n",
286 | "Library documentation: http://numba.pydata.org/"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": 9,
292 | "metadata": {
293 | "collapsed": true
294 | },
295 | "outputs": [],
296 | "source": [
297 | "pdist_numba = jit(pdist_python)"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 10,
303 | "metadata": {
304 | "collapsed": false
305 | },
306 | "outputs": [
307 | {
308 | "name": "stdout",
309 | "output_type": "stream",
310 | "text": [
311 | "100 loops, best of 3: 11.1 ms per loop\n"
312 | ]
313 | }
314 | ],
315 | "source": [
316 | "%timeit -n 100 pdist_numba(xs)"
317 | ]
318 | },
319 | {
320 | "cell_type": "markdown",
321 | "metadata": {},
322 | "source": [
323 | "## Cython"
324 | ]
325 | },
326 | {
327 | "cell_type": "markdown",
328 | "metadata": {},
329 | "source": [
330 | "Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It makes writing C extensions for Python as easy as Python itself.\n",
331 | "Cython gives you the combined power of Python and C to let you:\n",
332 | "\n",
333 | "* write Python code that calls back and forth from and to C or C++ code natively at any point\n",
334 | "* easily tune readable Python code into plain C performance by adding static type declarations\n",
335 | "* use combined source code level debugging to find bugs in your Python, Cython and C code\n",
336 | "* interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays\n",
337 | "* quickly build your applications within the large, mature and widely used CPython ecosystem\n",
338 | "* integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications\n",
339 | "\n",
340 | "Library details here: http://cython.org/"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 11,
346 | "metadata": {
347 | "collapsed": true
348 | },
349 | "outputs": [],
350 | "source": [
351 | "%load_ext Cython"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": 12,
357 | "metadata": {
358 | "collapsed": false
359 | },
360 | "outputs": [],
361 | "source": [
362 | "%%cython\n",
363 | "\n",
364 | "import numpy as np\n",
365 | "cimport cython\n",
366 | "from libc.math cimport sqrt\n",
367 | "\n",
368 | "@cython.boundscheck(False)\n",
369 | "@cython.wraparound(False)\n",
370 | "def pdist_cython(double[:, ::1] xs):\n",
371 | " cdef int n = xs.shape[0]\n",
372 | " cdef int p = xs.shape[1]\n",
373 | " cdef double tmp, d\n",
374 | " cdef double[:, ::1] D = np.empty((n, n), dtype=np.float)\n",
375 | " for i in range(n):\n",
376 | " for j in range(n):\n",
377 | " d = 0.0\n",
378 | " for k in range(p):\n",
379 | " tmp = xs[i, k] - xs[j, k]\n",
380 | " d += tmp * tmp\n",
381 | " D[i, j] = sqrt(d)\n",
382 | " return np.asarray(D)"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": 13,
388 | "metadata": {
389 | "collapsed": false
390 | },
391 | "outputs": [
392 | {
393 | "name": "stdout",
394 | "output_type": "stream",
395 | "text": [
396 | "100 loops, best of 3: 11 ms per loop\n"
397 | ]
398 | }
399 | ],
400 | "source": [
401 | "%timeit -n 100 pdist_cython(xs)"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "## Scipy"
409 | ]
410 | },
411 | {
412 | "cell_type": "markdown",
413 | "metadata": {},
414 | "source": [
415 | "Scipy has an optimized version of this particular function already built in. It exploits symmetry in the problem that we're not taking advantage of it in the \"naive\" implementations above.\n",
416 | "\n",
417 | "Library documentation: http://www.scipy.org/"
418 | ]
419 | },
420 | {
421 | "cell_type": "code",
422 | "execution_count": 14,
423 | "metadata": {
424 | "collapsed": true
425 | },
426 | "outputs": [],
427 | "source": [
428 | "from scipy.spatial.distance import pdist as pdist_scipy"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": 15,
434 | "metadata": {
435 | "collapsed": false
436 | },
437 | "outputs": [
438 | {
439 | "name": "stdout",
440 | "output_type": "stream",
441 | "text": [
442 | "100 loops, best of 3: 5.79 ms per loop\n"
443 | ]
444 | }
445 | ],
446 | "source": [
447 | "%timeit -n 100 pdist_scipy(xs)"
448 | ]
449 | },
450 | {
451 | "cell_type": "markdown",
452 | "metadata": {},
453 | "source": [
454 | "## Summary"
455 | ]
456 | },
457 | {
458 | "cell_type": "markdown",
459 | "metadata": {},
460 | "source": [
461 | "Here's all of them together."
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 16,
467 | "metadata": {
468 | "collapsed": false
469 | },
470 | "outputs": [
471 | {
472 | "name": "stdout",
473 | "output_type": "stream",
474 | "text": [
475 | "Python\n",
476 | "10 loops, best of 3: 3.15 s per loop\n",
477 | "Numpy\n",
478 | "100 loops, best of 3: 49.5 ms per loop\n",
479 | "Numexpr\n",
480 | "100 loops, best of 3: 19.3 ms per loop\n",
481 | "Numba\n",
482 | "100 loops, best of 3: 11 ms per loop\n",
483 | "Cython\n",
484 | "100 loops, best of 3: 11 ms per loop\n",
485 | "Scipy\n",
486 | "100 loops, best of 3: 5.68 ms per loop\n"
487 | ]
488 | }
489 | ],
490 | "source": [
491 | "print('Python')\n",
492 | "%timeit -n 10 pdist_python(xs)\n",
493 | "print('Numpy')\n",
494 | "%timeit -n 100 pdist_numpy(xs)\n",
495 | "print('Numexpr')\n",
496 | "%timeit -n 100 pdist_numexpr(xs)\n",
497 | "print('Numba')\n",
498 | "%timeit -n 100 pdist_numba(xs)\n",
499 | "print('Cython')\n",
500 | "%timeit -n 100 pdist_cython(xs)\n",
501 | "print('Scipy')\n",
502 | "%timeit -n 100 pdist_scipy(xs)"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "Some observations:\n",
510 | "\n",
511 | "* Pure python is much, much slower than all of the other methods (close to 1000x difference!)\n",
512 | "* Simply using Numpy where possible results in a huge speed-up\n",
513 | "* Numba is surprisingly effective given how easy it is to utilize, on par with compiled C code using Cython\n",
514 | "* Algorithm optimizations (such as those employed in the Scipy implementation) can easily trump other methods"
515 | ]
516 | }
517 | ],
518 | "metadata": {
519 | "kernelspec": {
520 | "display_name": "Python 2",
521 | "language": "python",
522 | "name": "python2"
523 | },
524 | "language_info": {
525 | "codemirror_mode": {
526 | "name": "ipython",
527 | "version": 2
528 | },
529 | "file_extension": ".py",
530 | "mimetype": "text/x-python",
531 | "name": "python",
532 | "nbconvert_exporter": "python",
533 | "pygments_lexer": "ipython2",
534 | "version": "2.7.9"
535 | }
536 | },
537 | "nbformat": 4,
538 | "nbformat_minor": 0
539 | }
540 |
--------------------------------------------------------------------------------
/notebooks/misc/DynamicProgramming.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Solving Problems With Dynamic Programming"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Dynamic programming is a really useful general technique for solving problems that involves breaking down problems into smaller overlapping sub-problems, storing the results computed from the sub-problems and reusing those results on larger chunks of the problem. Dynamic programming solutions are pretty much always more efficent than naive brute-force solutions. Dynamic programming techniques are particularly effective on problems that contain [optimal substructure](https://en.wikipedia.org/wiki/Optimal_substructure).\n",
15 | "\n",
16 | "Dynamic programming is related to a number of other fundamental concepts in computer science in interesting ways. Recursion, for example, is similar to (but not identical to) dynamic programming. The key difference is that in a naive recursive solution, answers to sub-problems may be computed many times. A recursive solution that caches answers to sub-problems which were already computed is called [memoization](https://en.wikipedia.org/wiki/Memoization), which is basically the inverse of dynamic programming. Another variation is when the sub-problems don't actually overlap at all, in which case the technique is known as [divide and conquer](https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms). Finally, dynamic programming is tied to the concept of [mathematical induction](https://en.wikipedia.org/wiki/Mathematical_induction) and can be thought of as a specific application of inductive reasoning in practice.\n",
17 | "\n",
18 | "While the core ideas behind dynamic programming are actually pretty simple, it turns out that it's fairly challenging to use on non-trivial problems because it's often not obvious how to frame a difficult problem in terms of overlapping sub-problems. This is where experience and practice come in handy, which is the idea for this notebook. We'll build both naive and \"intelligent\" solutions to several well-known problems and see how the problems are decomposed to use dynamic programming solutions."
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Fibonacci Numbers"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "First we'll look at the problem of computing numbers in the [Fibonacci sequence](https://en.wikipedia.org/wiki/Fibonacci_number). The problem definition is very simple - each number in the sequence is the sum of the two previous numbers in the sequence. Or, more formally:\n",
33 | "\n",
34 | "$F_n = F_{n-1} + F_{n-2}$, with $F_0 = 0$ and $F_1 = 1$ as the seed values.\n",
35 | "\n",
36 | "Our solution will be responsible for calculating each of Fibonacci numbers up to some defined limit. We'll first implement a naive solution that re-calculates each number in the sequence from scratch."
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 1,
42 | "metadata": {
43 | "collapsed": false
44 | },
45 | "outputs": [],
46 | "source": [
47 | "def fib(n):\n",
48 | " if n == 0:\n",
49 | " return 0\n",
50 | " if n == 1:\n",
51 | " return 1\n",
52 | " \n",
53 | " return fib(n - 1) + fib(n - 2)"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {
60 | "collapsed": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "def all_fib(n):\n",
65 | " fibs = []\n",
66 | " for i in range(n):\n",
67 | " fibs.append(fib(i))\n",
68 | " \n",
69 | " return fibs"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Let's try it out on a pretty small number first."
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 3,
82 | "metadata": {
83 | "collapsed": false
84 | },
85 | "outputs": [
86 | {
87 | "name": "stdout",
88 | "output_type": "stream",
89 | "text": [
90 | "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]\n",
91 | "Wall time: 0 ns\n"
92 | ]
93 | }
94 | ],
95 | "source": [
96 | "%time print(all_fib(10))"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "Okay, probably too trivial. Let's try a bit bigger..."
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 4,
109 | "metadata": {
110 | "collapsed": false
111 | },
112 | "outputs": [
113 | {
114 | "name": "stdout",
115 | "output_type": "stream",
116 | "text": [
117 | "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181]\n",
118 | "Wall time: 5 ms\n"
119 | ]
120 | }
121 | ],
122 | "source": [
123 | "%time print(all_fib(20))"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "The runtime was at least measurable now, but still pretty quick. Let's try one more time..."
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 5,
136 | "metadata": {
137 | "collapsed": false
138 | },
139 | "outputs": [
140 | {
141 | "name": "stdout",
142 | "output_type": "stream",
143 | "text": [
144 | "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986]\n",
145 | "Wall time: 1min 9s\n"
146 | ]
147 | }
148 | ],
149 | "source": [
150 | "%time print(all_fib(40))"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "That escalated quickly! Clearly this is a pretty bad solution. Let's see what it looks like when applying dynamic programming."
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 6,
163 | "metadata": {
164 | "collapsed": true
165 | },
166 | "outputs": [],
167 | "source": [
168 | "def all_fib_dp(n):\n",
169 | " fibs = []\n",
170 | " for i in range(n):\n",
171 | " if i < 2:\n",
172 | " fibs.append(i)\n",
173 | " else:\n",
174 | " fibs.append(fibs[i - 2] + fibs[i - 1])\n",
175 | " \n",
176 | " return fibs"
177 | ]
178 | },
179 | {
180 | "cell_type": "markdown",
181 | "metadata": {},
182 | "source": [
183 | "This time we're saving the result at each iteration and computing new numbers as a sum of the previously saved results. Let's see what this does to the performance of the function."
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 7,
189 | "metadata": {
190 | "collapsed": false
191 | },
192 | "outputs": [
193 | {
194 | "name": "stdout",
195 | "output_type": "stream",
196 | "text": [
197 | "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986]\n",
198 | "Wall time: 0 ns\n"
199 | ]
200 | }
201 | ],
202 | "source": [
203 | "%time print(all_fib_dp(40))"
204 | ]
205 | },
206 | {
207 | "cell_type": "markdown",
208 | "metadata": {},
209 | "source": [
210 | "By not computing the full recusrive tree on each iteration, we've essentially reduced the running time for the first 40 numbers from ~75 seconds to virtually instant. This also happens to be a good example of the danger of naive recursive functions. Our new Fibonaci number function can compute additional values in linear time vs. exponential time for the first version."
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 8,
216 | "metadata": {
217 | "collapsed": false
218 | },
219 | "outputs": [
220 | {
221 | "name": "stdout",
222 | "output_type": "stream",
223 | "text": [
224 | "[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073L, 4807526976L, 7778742049L, 12586269025L, 20365011074L, 32951280099L, 53316291173L, 86267571272L, 139583862445L, 225851433717L, 365435296162L, 591286729879L, 956722026041L, 1548008755920L, 2504730781961L, 4052739537881L, 6557470319842L, 10610209857723L, 17167680177565L, 27777890035288L, 44945570212853L, 72723460248141L, 117669030460994L, 190392490709135L, 308061521170129L, 498454011879264L, 806515533049393L, 1304969544928657L, 2111485077978050L, 3416454622906707L, 5527939700884757L, 8944394323791464L, 14472334024676221L, 23416728348467685L, 37889062373143906L, 61305790721611591L, 99194853094755497L, 160500643816367088L, 259695496911122585L, 420196140727489673L, 679891637638612258L, 1100087778366101931L, 1779979416004714189L, 2880067194370816120L, 4660046610375530309L, 7540113804746346429L, 12200160415121876738L, 19740274219868223167L, 31940434634990099905L, 51680708854858323072L, 83621143489848422977L, 135301852344706746049L, 218922995834555169026L]\n",
225 | "Wall time: 0 ns\n"
226 | ]
227 | }
228 | ],
229 | "source": [
230 | "%time print(all_fib_dp(100))"
231 | ]
232 | },
233 | {
234 | "cell_type": "markdown",
235 | "metadata": {},
236 | "source": [
237 | "## Longest Increasing Subsequence"
238 | ]
239 | },
240 | {
241 | "cell_type": "markdown",
242 | "metadata": {},
243 | "source": [
244 | "The Fibonacci problem is a good starter example but doesn't really capture the challenge of representing problems in terms of optimal sub-problems because for Fibonacci numbers the answer is pretty obvious. Let's move up one step in difficulty to a problem known as the [longest increasing subsequence](https://en.wikipedia.org/wiki/Longest_increasing_subsequence) problem. The objective is to find the longest subsequence of a given sequence such that all elements in the subsequence are sorted in increasing order. Note that the elements do not need to be contiguous; that is, they are not required to appear next to each other. For example, in the sequence [ 10, 22, 9, 33, 21, 50, 41, 60, 80 ] the longest increasing subsequence (LIS) is [10, 22, 33, 50, 60, 80].\n",
245 | "\n",
246 | "It turns out that it's fairly difficult to do a \"brute-force\" solution to this problem. The dynamic programming solution is much more concise and a natural fit for the problem definition, so we'll skip creating an unnecessarily complicated naive solution and jump straight to the DP solution."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 9,
252 | "metadata": {
253 | "collapsed": true
254 | },
255 | "outputs": [],
256 | "source": [
257 | "def find_lis(seq):\n",
258 | " n = len(seq)\n",
259 | " max_length = 1\n",
260 | " best_seq_end = -1\n",
261 | " \n",
262 | " # keep a chain of the values of the lis\n",
263 | " prev = [0 for i in range(n)]\n",
264 | " prev[0] = -1\n",
265 | " \n",
266 | " # the length of the lis at each position\n",
267 | " length = [0 for i in range(n)]\n",
268 | " length[0] = 1\n",
269 | " \n",
270 | " for i in range(1, n):\n",
271 | " length[i] = 0\n",
272 | " prev[i] = -1\n",
273 | " \n",
274 | " # start from index i-1 and work back to 0\n",
275 | " for j in range(i - 1, -1, -1):\n",
276 | " if (length[j] + 1) > length[i] and seq[j] < seq[i]:\n",
277 | " # there's a number before position i that increases the lis at i\n",
278 | " length[i] = length[j] + 1\n",
279 | " prev[i] = j\n",
280 | " \n",
281 | " if length[i] > max_length:\n",
282 | " max_length = length[i]\n",
283 | " best_seq_end = i\n",
284 | " \n",
285 | " # recover the subsequence\n",
286 | " lis = []\n",
287 | " element = best_seq_end\n",
288 | " while element != -1:\n",
289 | " lis.append(seq[element])\n",
290 | " element = prev[element]\n",
291 | " \n",
292 | " return lis[::-1]"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "The intuition here is that for a given index $i$, we can compute the length of the longest increasing subsequence $length(i)$ by looking at all indices $j < i$ and if $length(j) + 1 > i$ and $seq[j] < seq[i]$ (meaning there's a number at position $j$ that increases the longest subsequence at that index such that it is now longer than the longest recorded subsequence at $i$) then we increase $length(i)$ by 1. It's a bit confusing at first glance but step through it carefully and convince yourself that this solution finds the optimal subsequence. The \"prev\" list holds the indices of the elements that form the actual values in the subsequence.\n",
300 | "\n",
301 | "Let's generate some test data and try it out."
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": 10,
307 | "metadata": {
308 | "collapsed": false
309 | },
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "[16, 10, 17, 18, 9, 0, 2, 19, 4, 3, 1, 14, 12, 6, 2, 4, 11, 5, 19, 4]"
315 | ]
316 | },
317 | "execution_count": 10,
318 | "metadata": {},
319 | "output_type": "execute_result"
320 | }
321 | ],
322 | "source": [
323 | "import numpy as np\n",
324 | "seq_small = list(np.random.randint(0, 20, 20))\n",
325 | "seq_small"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": 11,
331 | "metadata": {
332 | "collapsed": false
333 | },
334 | "outputs": [
335 | {
336 | "name": "stdout",
337 | "output_type": "stream",
338 | "text": [
339 | "[0, 1, 2, 4, 5, 19]\n",
340 | "Wall time: 0 ns\n"
341 | ]
342 | }
343 | ],
344 | "source": [
345 | "%time print(find_lis(seq_small))"
346 | ]
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "Just based on the eye test the output looks correct. Let's see how well it performs on much larger sequences."
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": 12,
358 | "metadata": {
359 | "collapsed": false
360 | },
361 | "outputs": [
362 | {
363 | "name": "stdout",
364 | "output_type": "stream",
365 | "text": [
366 | "[29, 94, 125, 159, 262, 271, 274, 345, 375, 421, 524, 536, 668, 689, 694, 755, 763, 774, 788, 854, 916, 1018, 1022, 1098, 1136, 1154, 1172, 1237, 1325, 1361, 1400, 1401, 1406, 1450, 1498, 1633, 1693, 1745, 1765, 1793, 1835, 1949, 1997, 2069, 2072, 2096, 2157, 2336, 2345, 2468, 2519, 2529, 2624, 2630, 2924, 3103, 3291, 3321, 3380, 3546, 3635, 3657, 3668, 3703, 3775, 3836, 3850, 3961, 4002, 4004, 4039, 4060, 4128, 4361, 4377, 4424, 4432, 4460, 4465, 4493, 4540, 4595, 4693, 4732, 4735, 4766, 4831, 4850, 4873, 4908, 4940, 4969, 5013, 5073, 5087, 5139, 5144, 5271, 5280, 5299, 5300, 5355, 5393, 5430, 5536, 5538, 5559, 5565, 5822, 5891, 5895, 5906, 6157, 6199, 6286, 6369, 6431, 6450, 6510, 6533, 6577, 6585, 6683, 6701, 6740, 6745, 6829, 6853, 6863, 6872, 6884, 6923, 6925, 7009, 7019, 7028, 7040, 7170, 7235, 7304, 7356, 7377, 7416, 7490, 7495, 7662, 7676, 7703, 7808, 7925, 7971, 8036, 8073, 8282, 8295, 8332, 8342, 8360, 8429, 8454, 8499, 8557, 8585, 8639, 8649, 8725, 8759, 8831, 8860, 8899, 8969, 9046, 9146, 9161, 9245, 9270, 9374, 9451, 9465, 9515, 9522, 9525, 9527, 9664, 9770, 9781, 9787, 9914, 9993]\n",
367 | "Wall time: 4.94 s\n"
368 | ]
369 | }
370 | ],
371 | "source": [
372 | "seq = list(np.random.randint(0, 10000, 10000))\n",
373 | "%time print(find_lis(seq))"
374 | ]
375 | },
376 | {
377 | "cell_type": "markdown",
378 | "metadata": {},
379 | "source": [
380 | "So it's still pretty fast, but the difference is definitely noticable. At 10,000 integers in the sequence our algorithm already takes several seconds to complete. In fact, even though this solution uses dynamic programming its runtime is still $O(n^2)$. The lesson here is that dynamic programming doesn't always result in lightning-fast solutions. There are also different ways to apply DP to the same problem. In fact there's a solution to this problem that uses binary search trees and runs in $O(nlogn)$ time, significantly better than the solution we just came up with."
381 | ]
382 | },
383 | {
384 | "cell_type": "markdown",
385 | "metadata": {},
386 | "source": [
387 | "## Knapsack Problem"
388 | ]
389 | },
390 | {
391 | "cell_type": "markdown",
392 | "metadata": {},
393 | "source": [
394 | "The [knapsack problem](https://en.wikipedia.org/wiki/Knapsack_problem) is another classic dynamic programming exercise. The generalization of this problem is very old and comes in many variations, and there are actually multiple ways to tackle this problem aside from dynamic programming. Still, it's a common example for DP exercises.\n",
395 | "\n",
396 | "The problem at its core is one of combinatorial optimization. Given a set of items, each with a mass and a value, determine the collection of items that results in the highest possible value while not exceeding some limit on the total weight. The variation we'll look at is commonly referred to as the 0-1 knapsack problem, which restricts the number of copies of each kind of item to 0 or 1. More formally, given a set of $n$ items each with weight $w_i$ and value $v_i$ along with a maximum total weight $W$, our objective is:\n",
397 | "\n",
398 | "$\\Large max \\Sigma v_i x_i$, where $\\Large \\Sigma w_i x_i \\leq W$\n",
399 | "\n",
400 | "Let's see what the implementation looks like then discuss why it works."
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 13,
406 | "metadata": {
407 | "collapsed": true
408 | },
409 | "outputs": [],
410 | "source": [
411 | "def knapsack(W, w, v):\n",
412 | " # create a W x n solution matrix to store the sub-problem results\n",
413 | " n = len(v)\n",
414 | " S = [[0 for x in range(W)] for k in range(n)]\n",
415 | " \n",
416 | " for x in range(1, W):\n",
417 | " for k in range(1, n):\n",
418 | " # using this notation k is the number of items in the solution and x is the max weight of the solution,\n",
419 | " # so the initial assumption is that the optimal solution with k items at weight x is at least as good\n",
420 | " # as the optimal solution with k-1 items for the same max weight\n",
421 | " S[k][x] = S[k-1][x]\n",
422 | " \n",
423 | " # if the current item weighs less than the max weight and the optimal solution including this item is \n",
424 | " # better than the current optimum, the new optimum is the one resulting from including the current item\n",
425 | " if w[k] < x and S[k-1][x-w[k]] + v[k] > S[k][x]:\n",
426 | " S[k][x] = S[k-1][x-w[k]] + v[k]\n",
427 | " \n",
428 | " return S"
429 | ]
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {},
434 | "source": [
435 | "The intuition behind this algorithm is that once you've solved for the optimal combination of items at some weight $x < W$ and with some number of items $k < n$, then it's easy to solve the problem with one more item or one higher max weight because you can just check to see if the solution obtained by incorporating that item is better than the best solution you've already found. So how do you get the initial solution? Keep going down the rabbit hole until to reach 0 (in which case the answer is 0). At first glance it's very hard to grasp, but that's part of the magic of dynamic programming. Let's run an example to see what it looks like."
436 | ]
437 | },
438 | {
439 | "cell_type": "code",
440 | "execution_count": 14,
441 | "metadata": {
442 | "collapsed": false
443 | },
444 | "outputs": [
445 | {
446 | "data": {
447 | "text/plain": [
448 | "([3, 9, 3, 6, 5], [40, 45, 72, 77, 16])"
449 | ]
450 | },
451 | "execution_count": 14,
452 | "metadata": {},
453 | "output_type": "execute_result"
454 | }
455 | ],
456 | "source": [
457 | "w = list(np.random.randint(0, 10, 5))\n",
458 | "v = list(np.random.randint(0, 100, 5))\n",
459 | "w, v"
460 | ]
461 | },
462 | {
463 | "cell_type": "code",
464 | "execution_count": 15,
465 | "metadata": {
466 | "collapsed": false
467 | },
468 | "outputs": [
469 | {
470 | "data": {
471 | "text/plain": [
472 | "[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
473 | " [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 45, 45, 45, 45, 45],\n",
474 | " [0, 0, 0, 0, 72, 72, 72, 72, 72, 72, 72, 72, 72, 117, 117],\n",
475 | " [0, 0, 0, 0, 72, 72, 72, 77, 77, 77, 149, 149, 149, 149, 149],\n",
476 | " [0, 0, 0, 0, 72, 72, 72, 77, 77, 88, 149, 149, 149, 149, 149]]"
477 | ]
478 | },
479 | "execution_count": 15,
480 | "metadata": {},
481 | "output_type": "execute_result"
482 | }
483 | ],
484 | "source": [
485 | "knapsack(15, w, v)"
486 | ]
487 | },
488 | {
489 | "cell_type": "markdown",
490 | "metadata": {},
491 | "source": [
492 | "The output here is the array of optimal values for a given max weight (think of it as the column index) and max number of items (the row index). Notice how the output follows what looks sort of like a wavefront pattern. This seems to be a recurring phenomenon with dynamic programming solutions. The value in the lower right corner is the max value that we were looking for under the given constraints and is the answer to the problem."
493 | ]
494 | }
495 | ],
496 | "metadata": {
497 | "kernelspec": {
498 | "display_name": "Python 2",
499 | "language": "python",
500 | "name": "python2"
501 | },
502 | "language_info": {
503 | "codemirror_mode": {
504 | "name": "ipython",
505 | "version": 2
506 | },
507 | "file_extension": ".py",
508 | "mimetype": "text/x-python",
509 | "name": "python",
510 | "nbconvert_exporter": "python",
511 | "pygments_lexer": "ipython2",
512 | "version": "2.7.11"
513 | }
514 | },
515 | "nbformat": 4,
516 | "nbformat_minor": 0
517 | }
518 |
--------------------------------------------------------------------------------
/notebooks/misc/MarkovChains.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Markov Chains From Scratch"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "Sometimes it pays to go back to basics. Data science is a massive, complicated field with seemingly endless topics to learn about. But in our rush to learn about the latest deep learning trends, it's easy to forget that there are simple yet powerful techniques right under our noses. In this notebook we'll explore one such technique called a Markov chain. By building one from scratch using nothing but standard Python libraries, we'll see how simplistic they can be while also yielding some cool results.\n",
15 | "\n",
16 | "Markov chains are essentially a way to capture the probability of state transitions in a system. A process can be considered a Markov process if one can make predictions about the future state of the process based soley on its present state (or several of the most recent states for a higher-order Markov process). In other words, the history doesn't matter beyond a certain point. There are lots of great explainers out there so I'll leave that for the reader to explore independently ([this one](http://setosa.io/ev/markov-chains/) is my favorite). It will become clearer as we step through the code, so let's dive in."
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "For this example we're going to build a language-based Markov chain. More specifically, we'll read in a corpus of text and identify pairs of words that appear together. The pairings are sequential such that when a word w1 is followed by a word w2, then we say that the system has a probablistic state transition from w1 to w2. An example will help. Consider the phrase \"the brown fox jumped over the lazy dog\". If we break this down by word pairings, our state transitions would look like this:\n",
24 | "\n",
25 | "the: [brown, lazy]
\n",
26 | "brown: [fox]
\n",
27 | "fox: [jumped]
\n",
28 | "over: [the]
\n",
29 | "lazy: [dog]
\n",
30 | "\n",
31 | "This set of state transitions is called a Markov chain. With this in hand we can now choose a starting point (i.e. a word in the corpus) and \"walk the chain\" to create a new phrase. Markov chains built in this manner over large amounts of text can produce surprisingly realistic-sounding phrases."
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "In order to get started we need a corpus of text. Anything sufficiently large will do, but to really have some fun (and at the risk of bringing politics into the mix) we're going to make Markov chains great again by using [this collection of text from Donald Trump's campain speeches](https://github.com/ryanmcdermott/trump-speeches). Our first step is to import the text file and parse it into words."
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 1,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "name": "stdout",
48 | "output_type": "stream",
49 | "text": [
50 | "Corpus size: 166259 words.\n"
51 | ]
52 | }
53 | ],
54 | "source": [
55 | "import urllib2\n",
56 | "text = urllib2.urlopen('https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt')\n",
57 | "words = []\n",
58 | "for line in text:\n",
59 | " line = line.decode('utf-8-sig', errors='ignore')\n",
60 | " line = line.encode('ascii', errors='ignore')\n",
61 | " line = line.replace('\\r', ' ').replace('\\n', ' ')\n",
62 | " new_words = line.split(' ')\n",
63 | " new_words = [word for word in new_words if word not in ['', ' ']]\n",
64 | " words = words + new_words\n",
65 | "\n",
66 | "print('Corpus size: {0} words.'.format(len(words)))"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "I did some clean-up by converting it to ASCII and removing line breaks but that's about it, the rest of the text is just left as it appears in the source file. Our next step is to build the transition probabilities. We'll represent our transitions as a dictionary where the keys are the distinct words in the corpus and the value for a given key is a list of words that appear after that key. To build the chain we just need to iterate through the list of words, add it to the dictionary if it's not already there, and add the word proceeding it to the list of transition words."
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 2,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "Chain size: 13292 distinct words.\n"
86 | ]
87 | }
88 | ],
89 | "source": [
90 | "chain = {}\n",
91 | "n_words = len(words)\n",
92 | "for i, key in enumerate(words):\n",
93 | " if n_words > (i + 1):\n",
94 | " word = words[i + 1]\n",
95 | " if key not in chain:\n",
96 | " chain[key] = [word]\n",
97 | " else:\n",
98 | " chain[key].append(word)\n",
99 | "\n",
100 | "print('Chain size: {0} distinct words.'.format(len(chain)))"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "It may come as a surprise that we're just naively inserting words into the transition list without caring if that word had appeared already or not. Won't we get duplicates, and isn't that a problem? Yes we will, and no it's not. Think of this as a simplistic way of representing the transition probability. If a word appears multiple times in the list, and we sample from the list randomly during a transition, there's a higher likelihood that we pick that word proportional to the number of times it appeared after the key relative to all the other words in the corpus that appeared after that key."
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "Now that we've built our Markov chain, we can get to the fun part - using it to generate phrases! To do this we only need two pieces of information - a starting word, and a phrase length. We're going to randomly select a starting word from the corpus and make our phrases tweet-length by sampling until our phrase hits 140 characters (assume we're part of the #never280 crowd). Let's give it a try."
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 3,
120 | "metadata": {},
121 | "outputs": [
122 | {
123 | "name": "stdout",
124 | "output_type": "stream",
125 | "text": [
126 | "Were not going to run by the 93 million people are, where were starting. New Hampshire.\" I PROMISE. I do so incredible, and be insulted, Chuck.\n"
127 | ]
128 | }
129 | ],
130 | "source": [
131 | "import random\n",
132 | "w1 = random.choice(words)\n",
133 | "tweet = w1\n",
134 | "\n",
135 | "while len(tweet) < 140:\n",
136 | " w2 = random.choice(chain[w1])\n",
137 | " tweet += ' ' + w2\n",
138 | " w1 = w2\n",
139 | "\n",
140 | "print(tweet)"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "Not bad! The limitations of using only one word for context are readily apparent though. We can improve it by using a 2nd-order Markov chain instead. This time, instead of using simple word pairings, our \"keys\" will be the set of distinct tuples of words that appear in the text. Borrowing from the example phrase earlier, a 2nd-order Markov chain for \"the brown fox jumped over the lazy dog\" would look like:\n",
148 | "\n",
149 | "(the, brown): [fox]
\n",
150 | "(brown, fox): [jumped]
\n",
151 | "(fox, jumped): [over]
\n",
152 | "(jumped, over): [the]
\n",
153 | "(over, the): [lazy]
\n",
154 | "(the, lazy): [dog]
\n",
155 | "\n",
156 | "In order to build a 2nd-order chain, we have to make a few modifications to the code."
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 4,
162 | "metadata": {},
163 | "outputs": [
164 | {
165 | "name": "stdout",
166 | "output_type": "stream",
167 | "text": [
168 | "Chain size: 72373 distinct word pairs.\n"
169 | ]
170 | }
171 | ],
172 | "source": [
173 | "chain = {}\n",
174 | "n_words = len(words)\n",
175 | "for i, key1 in enumerate(words):\n",
176 | " if n_words > i + 2:\n",
177 | " key2 = words[i + 1]\n",
178 | " word = words[i + 2]\n",
179 | " if (key1, key2) not in chain:\n",
180 | " chain[(key1, key2)] = [word]\n",
181 | " else:\n",
182 | " chain[(key1, key2)].append(word)\n",
183 | "\n",
184 | "print('Chain size: {0} distinct word pairs.'.format(len(chain)))"
185 | ]
186 | },
187 | {
188 | "cell_type": "markdown",
189 | "metadata": {},
190 | "source": [
191 | "We can do a sanity check to make sure it's doing what we expect by choosing a word pair that appears somewhere in the text and then examining the transitions in the chain for that pair of words."
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 5,
197 | "metadata": {},
198 | "outputs": [
199 | {
200 | "data": {
201 | "text/plain": [
202 | "['great',\n",
203 | " 'great',\n",
204 | " 'easy.',\n",
205 | " 'preposterous.',\n",
206 | " 'important...',\n",
207 | " 'simple.',\n",
208 | " 'simple.',\n",
209 | " 'horrible.',\n",
210 | " 'out',\n",
211 | " 'terrible.',\n",
212 | " 'sad.',\n",
213 | " 'much',\n",
214 | " 'can',\n",
215 | " 'easy.',\n",
216 | " 'embarrassing',\n",
217 | " 'astronomical']"
218 | ]
219 | },
220 | "execution_count": 5,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "chain[(\"Its\", \"so\")]"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "Looks about like what I'd expect. Next we need to modify the \"tweet\" code to handle the new design."
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 6,
239 | "metadata": {},
240 | "outputs": [
241 | {
242 | "name": "stdout",
243 | "output_type": "stream",
244 | "text": [
245 | "there. They saw it. He talks about medical cards. He talks about fixing the VA health care. They want to talk to me from Georgia? \"Dear So and\n"
246 | ]
247 | }
248 | ],
249 | "source": [
250 | "r = random.randint(0, len(words) - 1)\n",
251 | "key = (words[r], words[r + 1])\n",
252 | "tweet = key[0] + ' ' + key[1]\n",
253 | "\n",
254 | "while len(tweet) < 140:\n",
255 | " w = random.choice(chain[key])\n",
256 | " tweet += ' ' + w\n",
257 | " key = (key[1], w)\n",
258 | "\n",
259 | "print(tweet)"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {},
265 | "source": [
266 | "Better! Let's turn this into a function that we can call repeatedly to see a few more examples."
267 | ]
268 | },
269 | {
270 | "cell_type": "code",
271 | "execution_count": 7,
272 | "metadata": {
273 | "collapsed": true
274 | },
275 | "outputs": [],
276 | "source": [
277 | "def markov_tweet(chain, words):\n",
278 | " r = random.randint(0, len(words) - 1)\n",
279 | " key = (words[r], words[r + 1])\n",
280 | " tweet = key[0] + ' ' + key[1]\n",
281 | "\n",
282 | " while len(tweet) < 140:\n",
283 | " w = random.choice(chain[key])\n",
284 | " tweet += ' ' + w\n",
285 | " key = (key[1], w)\n",
286 | "\n",
287 | " print(tweet + '\\n')"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 8,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "name": "stdout",
297 | "output_type": "stream",
298 | "text": [
299 | "East. But we have a huge subject. Ive been with the Romney campaign. Guys made tens of thousands of people didnt care about the vets in one hour.\n",
300 | "\n",
301 | "somebody is going to put American-produced steel back into the sky. It will be the candidate. But I think 11 is a huge problem. And Im on the\n",
302 | "\n",
303 | "THAT WE CAN ONLY DREAM ABOUT. THEY HAVE A VERY BIG BEAUTIFUL GATE IN THAT WALL, BIG AND BEAUTIFUL, RIGHT. NO. NO, I DON'T KNOW WHERE THEY HAVE\n",
304 | "\n",
305 | "We need to get so sick of me. I didnt want the world my tenant. They buy condos for tens of millions of dollars overseas. And too many executive\n",
306 | "\n",
307 | "Wont be as good as you know, started going around and were going to win. Were going to happen. Thank you. SPEECH 8 This is serious rifle. This\n",
308 | "\n"
309 | ]
310 | }
311 | ],
312 | "source": [
313 | "markov_tweet(chain, words)\n",
314 | "markov_tweet(chain, words)\n",
315 | "markov_tweet(chain, words)\n",
316 | "markov_tweet(chain, words)\n",
317 | "markov_tweet(chain, words)"
318 | ]
319 | },
320 | {
321 | "cell_type": "markdown",
322 | "metadata": {},
323 | "source": [
324 | "That's all there is to it! Incredibly simple yet surprisingly effective. It's obviously not perfect but it's not complete gibberish either. If you run it enough times you'll find some combinations that actually sound pretty plausible. These results could probably be improved significantly with a much more powerful technique like a recurrent neural net, but relative to the effort involved it's hard to beat Markov chains."
325 | ]
326 | }
327 | ],
328 | "metadata": {
329 | "kernelspec": {
330 | "display_name": "Python 3",
331 | "language": "python",
332 | "name": "python3"
333 | },
334 | "language_info": {
335 | "codemirror_mode": {
336 | "name": "ipython",
337 | "version": 3
338 | },
339 | "file_extension": ".py",
340 | "mimetype": "text/x-python",
341 | "name": "python",
342 | "nbconvert_exporter": "python",
343 | "pygments_lexer": "ipython3",
344 | "version": "3.6.4"
345 | }
346 | },
347 | "nbformat": 4,
348 | "nbformat_minor": 2
349 | }
350 |
--------------------------------------------------------------------------------
/notebooks/ml/ML-Exercise3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Machine Learning Exercise 3 - Multi-Class Classification"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "This notebook covers a Python-based solution for the third programming exercise of the machine learning class on Coursera. Please refer to the [exercise text](https://github.com/jdwittenauer/ipython-notebooks/blob/master/exercises/ML/ex3.pdf) for detailed descriptions and equations."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {
20 | "collapsed": true
21 | },
22 | "source": [
23 | "For this exercise we'll use logistic regression to recognize hand-written digits (0 to 9). We'll be extending the implementation of logistic regression we wrote in exercise 2 and apply it to one-vs-all classification. Let's get started by loading the data set. It's in MATLAB's native format, so to load it in Python we need to use a SciPy utility."
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {
30 | "collapsed": false
31 | },
32 | "outputs": [
33 | {
34 | "data": {
35 | "text/plain": [
36 | "{'X': array([[ 0., 0., 0., ..., 0., 0., 0.],\n",
37 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
38 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
39 | " ..., \n",
40 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
41 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
42 | " [ 0., 0., 0., ..., 0., 0., 0.]]),\n",
43 | " '__globals__': [],\n",
44 | " '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',\n",
45 | " '__version__': '1.0',\n",
46 | " 'y': array([[10],\n",
47 | " [10],\n",
48 | " [10],\n",
49 | " ..., \n",
50 | " [ 9],\n",
51 | " [ 9],\n",
52 | " [ 9]], dtype=uint8)}"
53 | ]
54 | },
55 | "execution_count": 1,
56 | "metadata": {},
57 | "output_type": "execute_result"
58 | }
59 | ],
60 | "source": [
61 | "import numpy as np\n",
62 | "import pandas as pd\n",
63 | "import matplotlib.pyplot as plt\n",
64 | "from scipy.io import loadmat\n",
65 | "%matplotlib inline\n",
66 | "\n",
67 | "data = loadmat('data/ex3data1.mat')\n",
68 | "data"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 2,
74 | "metadata": {
75 | "collapsed": false
76 | },
77 | "outputs": [
78 | {
79 | "data": {
80 | "text/plain": [
81 | "((5000L, 400L), (5000L, 1L))"
82 | ]
83 | },
84 | "execution_count": 2,
85 | "metadata": {},
86 | "output_type": "execute_result"
87 | }
88 | ],
89 | "source": [
90 | "data['X'].shape, data['y'].shape"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "Great, we've got our data loaded. The images are represented in martix X as a 400-dimensional vector (of which there are 5,000 of them). The 400 \"features\" are grayscale intensities of each pixel in the original 20 x 20 image. The class labels are in the vector y as a numeric class representing the digit that's in the image.\n",
98 | "\n",
99 | "The exercise code in MATLAB has a function provided to visualize the hand-written digits. I'm not going to reproduce that in Python, but there's an illustration in the exercise PDF if one is interested in seeing what the images look like. We're going to move on to our logistic regression implementation.\n",
100 | "\n",
101 | "The first task is to modify our logistic regression implementation to be completely vectorized (i.e. no \"for\" loops). This is because vectorized code, in addition to being short and concise, is able to take advantage of linear algebra optimizations and is typically much faster than iterative code. However if you look at our cost function implementation from exercise 2, it's already vectorized! So we can re-use the same implementation here. Note we're skipping straight to the final, regularized version."
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 3,
107 | "metadata": {
108 | "collapsed": true
109 | },
110 | "outputs": [],
111 | "source": [
112 | "def sigmoid(z):\n",
113 | " return 1 / (1 + np.exp(-z))"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 4,
119 | "metadata": {
120 | "collapsed": true
121 | },
122 | "outputs": [],
123 | "source": [
124 | "def cost(theta, X, y, learningRate):\n",
125 | " theta = np.matrix(theta)\n",
126 | " X = np.matrix(X)\n",
127 | " y = np.matrix(y)\n",
128 | " first = np.multiply(-y, np.log(sigmoid(X * theta.T)))\n",
129 | " second = np.multiply((1 - y), np.log(1 - sigmoid(X * theta.T)))\n",
130 | " reg = (learningRate / 2 * len(X)) * np.sum(np.power(theta[:,1:theta.shape[1]], 2))\n",
131 | " return np.sum(first - second) / (len(X)) + reg"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "Next we need the function that computes the gradient. Again, we already defined this in the previous exercise, only in this case we do have a \"for\" loop in the update step that we need to get rid of. Here's the original code for reference:"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 5,
144 | "metadata": {
145 | "collapsed": true
146 | },
147 | "outputs": [],
148 | "source": [
149 | "def gradient_with_loop(theta, X, y, learningRate):\n",
150 | " theta = np.matrix(theta)\n",
151 | " X = np.matrix(X)\n",
152 | " y = np.matrix(y)\n",
153 | " \n",
154 | " parameters = int(theta.ravel().shape[1])\n",
155 | " grad = np.zeros(parameters)\n",
156 | " \n",
157 | " error = sigmoid(X * theta.T) - y\n",
158 | " \n",
159 | " for i in range(parameters):\n",
160 | " term = np.multiply(error, X[:,i])\n",
161 | " \n",
162 | " if (i == 0):\n",
163 | " grad[i] = np.sum(term) / len(X)\n",
164 | " else:\n",
165 | " grad[i] = (np.sum(term) / len(X)) + ((learningRate / len(X)) * theta[:,i])\n",
166 | " \n",
167 | " return grad"
168 | ]
169 | },
170 | {
171 | "cell_type": "markdown",
172 | "metadata": {},
173 | "source": [
174 | "In our new version we're going to pull out the \"for\" loop and compute the gradient for each parameter at once using linear algebra (except for the intercept parameter, which is not regularized so it's computed separately). To follow the math behind the transformation, refer to the exercise 3 text.\n",
175 | "\n",
176 | "Also note that we're converting the data structures to NumPy matrices (which I've used for the most part throughout these exercises). This is done in an attempt to make the code look more similar to Octave than it would using arrays because matrices automatically follow matrix operation rules vs. element-wise operations, which is the default for arrays. There is some debate in the community over wether or not the matrix class should be used at all, but it's there so we're using it in these examples."
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 6,
182 | "metadata": {
183 | "collapsed": true
184 | },
185 | "outputs": [],
186 | "source": [
187 | "def gradient(theta, X, y, learningRate):\n",
188 | " theta = np.matrix(theta)\n",
189 | " X = np.matrix(X)\n",
190 | " y = np.matrix(y)\n",
191 | " \n",
192 | " parameters = int(theta.ravel().shape[1])\n",
193 | " error = sigmoid(X * theta.T) - y\n",
194 | " \n",
195 | " grad = ((X.T * error) / len(X)).T + ((learningRate / len(X)) * theta)\n",
196 | " \n",
197 | " # intercept gradient is not regularized\n",
198 | " grad[0, 0] = np.sum(np.multiply(error, X[:,0])) / len(X)\n",
199 | " \n",
200 | " return np.array(grad).ravel()"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {},
206 | "source": [
207 | "Now that we've defined our cost and gradient functions, it's time to build a classifier. For this task we've got 10 possible classes, and since logistic regression is only able to distiguish between 2 classes at a time, we need a strategy to deal with the multi-class scenario. In this exercise we're tasked with implementing a one-vs-all classification approach, where a label with k different classes results in k classifiers, each one deciding between \"class i\" and \"not class i\" (i.e. any class other than i). We're going to wrap the classifier training up in one function that computes the final weights for each of the 10 classifiers and returns the weights as a k X (n + 1) array, where n is the number of parameters."
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 7,
213 | "metadata": {
214 | "collapsed": true
215 | },
216 | "outputs": [],
217 | "source": [
218 | "from scipy.optimize import minimize\n",
219 | "\n",
220 | "def one_vs_all(X, y, num_labels, learning_rate):\n",
221 | " rows = X.shape[0]\n",
222 | " params = X.shape[1]\n",
223 | " \n",
224 | " # k X (n + 1) array for the parameters of each of the k classifiers\n",
225 | " all_theta = np.zeros((num_labels, params + 1))\n",
226 | " \n",
227 | " # insert a column of ones at the beginning for the intercept term\n",
228 | " X = np.insert(X, 0, values=np.ones(rows), axis=1)\n",
229 | " \n",
230 | " # labels are 1-indexed instead of 0-indexed\n",
231 | " for i in range(1, num_labels + 1):\n",
232 | " theta = np.zeros(params + 1)\n",
233 | " y_i = np.array([1 if label == i else 0 for label in y])\n",
234 | " y_i = np.reshape(y_i, (rows, 1))\n",
235 | " \n",
236 | " # minimize the objective function\n",
237 | " fmin = minimize(fun=cost, x0=theta, args=(X, y_i, learning_rate), method='TNC', jac=gradient)\n",
238 | " all_theta[i-1,:] = fmin.x\n",
239 | " \n",
240 | " return all_theta"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {},
246 | "source": [
247 | "A few things to note here...first, we're adding an extra parameter to theta (along with a column of ones to the training data) to account for the intercept term. Second, we're transforming y from a class label to a binary value for each classifier (either is class i or is not class i). Finally, we're using SciPy's newer optimization API to minimize the cost function for each classifier. The API takes an objective function, an initial set of parameters, an optimization method, and a jacobian (gradient) function if specified. The parameters found by the optimization routine are then assigned to the parameter array.\n",
248 | "\n",
249 | "One of the more challenging parts of implementing vectorized code is getting all of the matrix interactions written correctly, so I find it useful to do some sanity checks by looking at the shapes of the arrays/matrices I'm working with and convincing myself that they're sensible. Let's look at some of the data structures used in the above function."
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 8,
255 | "metadata": {
256 | "collapsed": false
257 | },
258 | "outputs": [
259 | {
260 | "data": {
261 | "text/plain": [
262 | "((5000L, 401L), (5000L, 1L), (401L,), (10L, 401L))"
263 | ]
264 | },
265 | "execution_count": 8,
266 | "metadata": {},
267 | "output_type": "execute_result"
268 | }
269 | ],
270 | "source": [
271 | "rows = data['X'].shape[0]\n",
272 | "params = data['X'].shape[1]\n",
273 | "\n",
274 | "all_theta = np.zeros((10, params + 1))\n",
275 | "\n",
276 | "X = np.insert(data['X'], 0, values=np.ones(rows), axis=1)\n",
277 | "\n",
278 | "theta = np.zeros(params + 1)\n",
279 | "\n",
280 | "y_0 = np.array([1 if label == 0 else 0 for label in data['y']])\n",
281 | "y_0 = np.reshape(y_0, (rows, 1))\n",
282 | "\n",
283 | "X.shape, y_0.shape, theta.shape, all_theta.shape"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "metadata": {},
289 | "source": [
290 | "These all appear to make sense. Note that theta is a one-dimensional array, so when it gets converted to a matrix in the code that computes the gradient, it turns into a (1 X 401) matrix. Let's also check the class labels in y to make sure they look like what we're expecting."
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 9,
296 | "metadata": {
297 | "collapsed": false
298 | },
299 | "outputs": [
300 | {
301 | "data": {
302 | "text/plain": [
303 | "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=uint8)"
304 | ]
305 | },
306 | "execution_count": 9,
307 | "metadata": {},
308 | "output_type": "execute_result"
309 | }
310 | ],
311 | "source": [
312 | "np.unique(data['y'])"
313 | ]
314 | },
315 | {
316 | "cell_type": "markdown",
317 | "metadata": {},
318 | "source": [
319 | "Let's make sure that our training function actually runs, and we get some sensible outputs, before going any further."
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 10,
325 | "metadata": {
326 | "collapsed": false
327 | },
328 | "outputs": [
329 | {
330 | "data": {
331 | "text/plain": [
332 | "array([[ -5.79312170e+00, 0.00000000e+00, 0.00000000e+00, ...,\n",
333 | " 1.22140973e-02, 2.88611969e-07, 0.00000000e+00],\n",
334 | " [ -4.91685285e+00, 0.00000000e+00, 0.00000000e+00, ...,\n",
335 | " 2.40449128e-01, -1.08488270e-02, 0.00000000e+00],\n",
336 | " [ -8.56840371e+00, 0.00000000e+00, 0.00000000e+00, ...,\n",
337 | " -2.59241796e-04, -1.12756844e-06, 0.00000000e+00],\n",
338 | " ..., \n",
339 | " [ -1.32641613e+01, 0.00000000e+00, 0.00000000e+00, ...,\n",
340 | " -5.63659404e+00, 6.50939114e-01, 0.00000000e+00],\n",
341 | " [ -8.55392716e+00, 0.00000000e+00, 0.00000000e+00, ...,\n",
342 | " -2.01206880e-01, 9.61930149e-03, 0.00000000e+00],\n",
343 | " [ -1.29807876e+01, 0.00000000e+00, 0.00000000e+00, ...,\n",
344 | " 2.60651472e-04, 4.22693052e-05, 0.00000000e+00]])"
345 | ]
346 | },
347 | "execution_count": 10,
348 | "metadata": {},
349 | "output_type": "execute_result"
350 | }
351 | ],
352 | "source": [
353 | "all_theta = one_vs_all(data['X'], data['y'], 10, 1)\n",
354 | "all_theta"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {},
360 | "source": [
361 | "We're now ready for the final step - using the trained classifiers to predict a label for each image. For this step we're going to compute the class probability for each class, for each training instance (using vectorized code of course!) and assign the output class label as the class with the highest probability."
362 | ]
363 | },
364 | {
365 | "cell_type": "code",
366 | "execution_count": 11,
367 | "metadata": {
368 | "collapsed": true
369 | },
370 | "outputs": [],
371 | "source": [
372 | "def predict_all(X, all_theta):\n",
373 | " rows = X.shape[0]\n",
374 | " params = X.shape[1]\n",
375 | " num_labels = all_theta.shape[0]\n",
376 | " \n",
377 | " # same as before, insert ones to match the shape\n",
378 | " X = np.insert(X, 0, values=np.ones(rows), axis=1)\n",
379 | " \n",
380 | " # convert to matrices\n",
381 | " X = np.matrix(X)\n",
382 | " all_theta = np.matrix(all_theta)\n",
383 | " \n",
384 | " # compute the class probability for each class on each training instance\n",
385 | " h = sigmoid(X * all_theta.T)\n",
386 | " \n",
387 | " # create array of the index with the maximum probability\n",
388 | " h_argmax = np.argmax(h, axis=1)\n",
389 | " \n",
390 | " # because our array was zero-indexed we need to add one for the true label prediction\n",
391 | " h_argmax = h_argmax + 1\n",
392 | " \n",
393 | " return h_argmax"
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "Now we can use the predict_all function to generate class predictions for each instance and see how well our classifier works."
401 | ]
402 | },
403 | {
404 | "cell_type": "code",
405 | "execution_count": 12,
406 | "metadata": {
407 | "collapsed": false
408 | },
409 | "outputs": [
410 | {
411 | "name": "stdout",
412 | "output_type": "stream",
413 | "text": [
414 | "accuracy = 97.58%\n"
415 | ]
416 | }
417 | ],
418 | "source": [
419 | "y_pred = predict_all(data['X'], all_theta)\n",
420 | "correct = [1 if a == b else 0 for (a, b) in zip(y_pred, data['y'])]\n",
421 | "accuracy = (sum(map(int, correct)) / float(len(correct)))\n",
422 | "print 'accuracy = {0}%'.format(accuracy * 100)"
423 | ]
424 | },
425 | {
426 | "cell_type": "markdown",
427 | "metadata": {},
428 | "source": [
429 | "Almost 98% isn't too bad! That's all for exercise 3. In the next exercise, we'll look at how to implement a feed-forward neural network from scratch."
430 | ]
431 | }
432 | ],
433 | "metadata": {
434 | "kernelspec": {
435 | "display_name": "Python 2",
436 | "language": "python",
437 | "name": "python2"
438 | },
439 | "language_info": {
440 | "codemirror_mode": {
441 | "name": "ipython",
442 | "version": 2
443 | },
444 | "file_extension": ".py",
445 | "mimetype": "text/x-python",
446 | "name": "python",
447 | "nbconvert_exporter": "python",
448 | "pygments_lexer": "ipython2",
449 | "version": "2.7.9"
450 | }
451 | },
452 | "nbformat": 4,
453 | "nbformat_minor": 0
454 | }
455 |
--------------------------------------------------------------------------------
/notebooks/ml/ML-Exercise4.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Machine Learning Exercise 4 - Neural Networks"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "This notebook covers a Python-based solution for the fourth programming exercise of the machine learning class on Coursera. Please refer to the [exercise text](https://github.com/jdwittenauer/ipython-notebooks/blob/master/exercises/ML/ex4.pdf) for detailed descriptions and equations."
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "For this exercise we'll again tackle the hand-written digits data set, this time using a feed-forward neural network with backpropagation. We'll implement un-regularized and regularized versions of the neural network cost function and gradient computation via the backpropagation algorithm. We'll also implement random weight initialization and a method to use the network to make predictions.\n",
22 | "\n",
23 | "Since the data set is the same one we used in exercise 3, we'll re-use the code to load the data."
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {
30 | "collapsed": false
31 | },
32 | "outputs": [
33 | {
34 | "data": {
35 | "text/plain": [
36 | "{'X': array([[ 0., 0., 0., ..., 0., 0., 0.],\n",
37 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
38 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
39 | " ..., \n",
40 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
41 | " [ 0., 0., 0., ..., 0., 0., 0.],\n",
42 | " [ 0., 0., 0., ..., 0., 0., 0.]]),\n",
43 | " '__globals__': [],\n",
44 | " '__header__': 'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sun Oct 16 13:09:09 2011',\n",
45 | " '__version__': '1.0',\n",
46 | " 'y': array([[10],\n",
47 | " [10],\n",
48 | " [10],\n",
49 | " ..., \n",
50 | " [ 9],\n",
51 | " [ 9],\n",
52 | " [ 9]], dtype=uint8)}"
53 | ]
54 | },
55 | "execution_count": 1,
56 | "metadata": {},
57 | "output_type": "execute_result"
58 | }
59 | ],
60 | "source": [
61 | "import numpy as np\n",
62 | "import pandas as pd\n",
63 | "import matplotlib.pyplot as plt\n",
64 | "from scipy.io import loadmat\n",
65 | "%matplotlib inline\n",
66 | "\n",
67 | "data = loadmat('data/ex3data1.mat')\n",
68 | "data"
69 | ]
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "Since we're going to need these later (and will use them often), let's create some useful variables up-front."
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 2,
81 | "metadata": {
82 | "collapsed": false
83 | },
84 | "outputs": [
85 | {
86 | "data": {
87 | "text/plain": [
88 | "((5000L, 400L), (5000L, 1L))"
89 | ]
90 | },
91 | "execution_count": 2,
92 | "metadata": {},
93 | "output_type": "execute_result"
94 | }
95 | ],
96 | "source": [
97 | "X = data['X']\n",
98 | "y = data['y']\n",
99 | "\n",
100 | "X.shape, y.shape"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "We're also going to need to one-hot encode our y labels. One-hot encoding turns a class label n (out of k classes) into a vector of length k where index n is \"hot\" (1) while the rest are zero. Scikit-learn has a built in utility we can use for this."
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 3,
113 | "metadata": {
114 | "collapsed": false
115 | },
116 | "outputs": [
117 | {
118 | "data": {
119 | "text/plain": [
120 | "(5000L, 10L)"
121 | ]
122 | },
123 | "execution_count": 3,
124 | "metadata": {},
125 | "output_type": "execute_result"
126 | }
127 | ],
128 | "source": [
129 | "from sklearn.preprocessing import OneHotEncoder\n",
130 | "encoder = OneHotEncoder(sparse=False)\n",
131 | "y_onehot = encoder.fit_transform(y)\n",
132 | "y_onehot.shape"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 4,
138 | "metadata": {
139 | "collapsed": false
140 | },
141 | "outputs": [
142 | {
143 | "data": {
144 | "text/plain": [
145 | "(array([10], dtype=uint8),\n",
146 | " array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]))"
147 | ]
148 | },
149 | "execution_count": 4,
150 | "metadata": {},
151 | "output_type": "execute_result"
152 | }
153 | ],
154 | "source": [
155 | "y[0], y_onehot[0,:]"
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "The neural network we're going to build for this exercise has an input layer matching the size of our instance data (400 + the bias unit), a hidden layer with 25 units (26 with the bias unit), and an output layer with 10 units corresponding to our one-hot encoding for the class labels. For additional details and an image of the network architecture, please refer to the PDF in the \"exercises\" folder.\n",
163 | "\n",
164 | "The first piece we need to implement is a cost function to evaluate the loss for a given set of network parameters. The source mathematical function is in the exercise text (and looks pretty intimidating). Here are the functions required to compute the cost."
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 5,
170 | "metadata": {
171 | "collapsed": true
172 | },
173 | "outputs": [],
174 | "source": [
175 | "def sigmoid(z):\n",
176 | " return 1 / (1 + np.exp(-z))"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 6,
182 | "metadata": {
183 | "collapsed": true
184 | },
185 | "outputs": [],
186 | "source": [
187 | "def forward_propagate(X, theta1, theta2):\n",
188 | " m = X.shape[0]\n",
189 | " \n",
190 | " a1 = np.insert(X, 0, values=np.ones(m), axis=1)\n",
191 | " z2 = a1 * theta1.T\n",
192 | " a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)\n",
193 | " z3 = a2 * theta2.T\n",
194 | " h = sigmoid(z3)\n",
195 | " \n",
196 | " return a1, z2, a2, z3, h"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 7,
202 | "metadata": {
203 | "collapsed": true
204 | },
205 | "outputs": [],
206 | "source": [
207 | "def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n",
208 | " m = X.shape[0]\n",
209 | " X = np.matrix(X)\n",
210 | " y = np.matrix(y)\n",
211 | " \n",
212 | " # reshape the parameter array into parameter matrices for each layer\n",
213 | " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
214 | " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
215 | " \n",
216 | " # run the feed-forward pass\n",
217 | " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
218 | " \n",
219 | " # compute the cost\n",
220 | " J = 0\n",
221 | " for i in range(m):\n",
222 | " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n",
223 | " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n",
224 | " J += np.sum(first_term - second_term)\n",
225 | " \n",
226 | " J = J / m\n",
227 | " \n",
228 | " return J"
229 | ]
230 | },
231 | {
232 | "cell_type": "markdown",
233 | "metadata": {},
234 | "source": [
235 | "We've used the sigmoid function before so that's not new. The forward-propagate function computes the hypothesis for each training instance given the current parameters. It's output shape should match the same of our one-hot encoding for y. We can test this real quick to convince ourselves that it's working as expected (the intermediate steps are also returned as these will be useful later)."
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 8,
241 | "metadata": {
242 | "collapsed": false
243 | },
244 | "outputs": [
245 | {
246 | "data": {
247 | "text/plain": [
248 | "((25L, 401L), (10L, 26L))"
249 | ]
250 | },
251 | "execution_count": 8,
252 | "metadata": {},
253 | "output_type": "execute_result"
254 | }
255 | ],
256 | "source": [
257 | "# initial setup\n",
258 | "input_size = 400\n",
259 | "hidden_size = 25\n",
260 | "num_labels = 10\n",
261 | "learning_rate = 1\n",
262 | "\n",
263 | "# randomly initialize a parameter array of the size of the full network's parameters\n",
264 | "params = (np.random.random(size=hidden_size * (input_size + 1) + num_labels * (hidden_size + 1)) - 0.5) * 0.25\n",
265 | "\n",
266 | "m = X.shape[0]\n",
267 | "X = np.matrix(X)\n",
268 | "y = np.matrix(y)\n",
269 | "\n",
270 | "# unravel the parameter array into parameter matrices for each layer\n",
271 | "theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
272 | "theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
273 | "\n",
274 | "theta1.shape, theta2.shape"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 9,
280 | "metadata": {
281 | "collapsed": false
282 | },
283 | "outputs": [
284 | {
285 | "data": {
286 | "text/plain": [
287 | "((5000L, 401L), (5000L, 25L), (5000L, 26L), (5000L, 10L), (5000L, 10L))"
288 | ]
289 | },
290 | "execution_count": 9,
291 | "metadata": {},
292 | "output_type": "execute_result"
293 | }
294 | ],
295 | "source": [
296 | "a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
297 | "a1.shape, z2.shape, a2.shape, z3.shape, h.shape"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "metadata": {},
303 | "source": [
304 | "The cost function, after computing the hypothesis matrix h, applies the cost equation to compute the total error between y and h."
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 10,
310 | "metadata": {
311 | "collapsed": false
312 | },
313 | "outputs": [
314 | {
315 | "data": {
316 | "text/plain": [
317 | "6.8228086634127862"
318 | ]
319 | },
320 | "execution_count": 10,
321 | "metadata": {},
322 | "output_type": "execute_result"
323 | }
324 | ],
325 | "source": [
326 | "cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)"
327 | ]
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "Our next step is to add regularization to the cost function. If you're following along in the exercise text and thought the last equation looked ugly, this one looks REALLY ugly. It's actually not as complicated as it looks though - in fact, the regularization term is simply an addition to the cost we already computed. Here's the revised cost function."
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 11,
339 | "metadata": {
340 | "collapsed": true
341 | },
342 | "outputs": [],
343 | "source": [
344 | "def cost(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n",
345 | " m = X.shape[0]\n",
346 | " X = np.matrix(X)\n",
347 | " y = np.matrix(y)\n",
348 | " \n",
349 | " # reshape the parameter array into parameter matrices for each layer\n",
350 | " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
351 | " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
352 | " \n",
353 | " # run the feed-forward pass\n",
354 | " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
355 | " \n",
356 | " # compute the cost\n",
357 | " J = 0\n",
358 | " for i in range(m):\n",
359 | " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n",
360 | " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n",
361 | " J += np.sum(first_term - second_term)\n",
362 | " \n",
363 | " J = J / m\n",
364 | " \n",
365 | " # add the cost regularization term\n",
366 | " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n",
367 | " \n",
368 | " return J"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": 12,
374 | "metadata": {
375 | "collapsed": false
376 | },
377 | "outputs": [
378 | {
379 | "data": {
380 | "text/plain": [
381 | "6.8281541822949299"
382 | ]
383 | },
384 | "execution_count": 12,
385 | "metadata": {},
386 | "output_type": "execute_result"
387 | }
388 | ],
389 | "source": [
390 | "cost(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)"
391 | ]
392 | },
393 | {
394 | "cell_type": "markdown",
395 | "metadata": {},
396 | "source": [
397 | "Next up is the backpropagation algorithm. Backpropagation computes the parameter updates that will reduce the error of the network on the training data. The first thing we need is a function that computes the gradient of the sigmoid function we created earlier."
398 | ]
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 13,
403 | "metadata": {
404 | "collapsed": true
405 | },
406 | "outputs": [],
407 | "source": [
408 | "def sigmoid_gradient(z):\n",
409 | " return np.multiply(sigmoid(z), (1 - sigmoid(z)))"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "metadata": {},
415 | "source": [
416 | "Now we're ready to implement backpropagation to compute the gradients. Since the computations required for backpropagation are a superset of those required in the cost function, we're actually going to extend the cost function to also perform backpropagation and return both the cost and the gradients."
417 | ]
418 | },
419 | {
420 | "cell_type": "code",
421 | "execution_count": 14,
422 | "metadata": {
423 | "collapsed": true
424 | },
425 | "outputs": [],
426 | "source": [
427 | "def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n",
428 | " m = X.shape[0]\n",
429 | " X = np.matrix(X)\n",
430 | " y = np.matrix(y)\n",
431 | " \n",
432 | " # reshape the parameter array into parameter matrices for each layer\n",
433 | " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
434 | " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
435 | " \n",
436 | " # run the feed-forward pass\n",
437 | " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
438 | " \n",
439 | " # initializations\n",
440 | " J = 0\n",
441 | " delta1 = np.zeros(theta1.shape) # (25, 401)\n",
442 | " delta2 = np.zeros(theta2.shape) # (10, 26)\n",
443 | " \n",
444 | " # compute the cost\n",
445 | " for i in range(m):\n",
446 | " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n",
447 | " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n",
448 | " J += np.sum(first_term - second_term)\n",
449 | " \n",
450 | " J = J / m\n",
451 | " \n",
452 | " # add the cost regularization term\n",
453 | " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n",
454 | " \n",
455 | " # perform backpropagation\n",
456 | " for t in range(m):\n",
457 | " a1t = a1[t,:] # (1, 401)\n",
458 | " z2t = z2[t,:] # (1, 25)\n",
459 | " a2t = a2[t,:] # (1, 26)\n",
460 | " ht = h[t,:] # (1, 10)\n",
461 | " yt = y[t,:] # (1, 10)\n",
462 | " \n",
463 | " d3t = ht - yt # (1, 10)\n",
464 | " \n",
465 | " z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)\n",
466 | " d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)\n",
467 | " \n",
468 | " delta1 = delta1 + (d2t[:,1:]).T * a1t\n",
469 | " delta2 = delta2 + d3t.T * a2t\n",
470 | " \n",
471 | " delta1 = delta1 / m\n",
472 | " delta2 = delta2 / m\n",
473 | " \n",
474 | " # unravel the gradient matrices into a single array\n",
475 | " grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))\n",
476 | " \n",
477 | " return J, grad"
478 | ]
479 | },
480 | {
481 | "cell_type": "markdown",
482 | "metadata": {},
483 | "source": [
484 | "The hardest part of the backprop computation (other than understanding WHY we're doing all these calculations) is getting the matrix dimensions right. By the way, if you find it confusing when to use A * B vs. np.multiply(A, B), you're not alone. Basically the former is a matrix multiplication and the latter is an element-wise multiplication (unless A or B is a scalar value, in which case it doesn't matter). I wish there was a more concise syntax for this (maybe there is and I'm just not aware of it).\n",
485 | "\n",
486 | "Anyway, let's test it out to make sure the function returns what we're expecting it to."
487 | ]
488 | },
489 | {
490 | "cell_type": "code",
491 | "execution_count": 15,
492 | "metadata": {
493 | "collapsed": false
494 | },
495 | "outputs": [
496 | {
497 | "data": {
498 | "text/plain": [
499 | "(6.8281541822949299, (10285L,))"
500 | ]
501 | },
502 | "execution_count": 15,
503 | "metadata": {},
504 | "output_type": "execute_result"
505 | }
506 | ],
507 | "source": [
508 | "J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)\n",
509 | "J, grad.shape"
510 | ]
511 | },
512 | {
513 | "cell_type": "markdown",
514 | "metadata": {},
515 | "source": [
516 | "We still have one more modification to make to the backprop function - adding regularization to the gradient calculations. The final regularized version is below."
517 | ]
518 | },
519 | {
520 | "cell_type": "code",
521 | "execution_count": 16,
522 | "metadata": {
523 | "collapsed": true
524 | },
525 | "outputs": [],
526 | "source": [
527 | "def backprop(params, input_size, hidden_size, num_labels, X, y, learning_rate):\n",
528 | " m = X.shape[0]\n",
529 | " X = np.matrix(X)\n",
530 | " y = np.matrix(y)\n",
531 | " \n",
532 | " # reshape the parameter array into parameter matrices for each layer\n",
533 | " theta1 = np.matrix(np.reshape(params[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
534 | " theta2 = np.matrix(np.reshape(params[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
535 | " \n",
536 | " # run the feed-forward pass\n",
537 | " a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
538 | " \n",
539 | " # initializations\n",
540 | " J = 0\n",
541 | " delta1 = np.zeros(theta1.shape) # (25, 401)\n",
542 | " delta2 = np.zeros(theta2.shape) # (10, 26)\n",
543 | " \n",
544 | " # compute the cost\n",
545 | " for i in range(m):\n",
546 | " first_term = np.multiply(-y[i,:], np.log(h[i,:]))\n",
547 | " second_term = np.multiply((1 - y[i,:]), np.log(1 - h[i,:]))\n",
548 | " J += np.sum(first_term - second_term)\n",
549 | " \n",
550 | " J = J / m\n",
551 | " \n",
552 | " # add the cost regularization term\n",
553 | " J += (float(learning_rate) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))\n",
554 | " \n",
555 | " # perform backpropagation\n",
556 | " for t in range(m):\n",
557 | " a1t = a1[t,:] # (1, 401)\n",
558 | " z2t = z2[t,:] # (1, 25)\n",
559 | " a2t = a2[t,:] # (1, 26)\n",
560 | " ht = h[t,:] # (1, 10)\n",
561 | " yt = y[t,:] # (1, 10)\n",
562 | " \n",
563 | " d3t = ht - yt # (1, 10)\n",
564 | " \n",
565 | " z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)\n",
566 | " d2t = np.multiply((theta2.T * d3t.T).T, sigmoid_gradient(z2t)) # (1, 26)\n",
567 | " \n",
568 | " delta1 = delta1 + (d2t[:,1:]).T * a1t\n",
569 | " delta2 = delta2 + d3t.T * a2t\n",
570 | " \n",
571 | " delta1 = delta1 / m\n",
572 | " delta2 = delta2 / m\n",
573 | " \n",
574 | " # add the gradient regularization term\n",
575 | " delta1[:,1:] = delta1[:,1:] + (theta1[:,1:] * learning_rate) / m\n",
576 | " delta2[:,1:] = delta2[:,1:] + (theta2[:,1:] * learning_rate) / m\n",
577 | " \n",
578 | " # unravel the gradient matrices into a single array\n",
579 | " grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))\n",
580 | " \n",
581 | " return J, grad"
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": 17,
587 | "metadata": {
588 | "collapsed": false
589 | },
590 | "outputs": [
591 | {
592 | "data": {
593 | "text/plain": [
594 | "(6.8281541822949299, (10285L,))"
595 | ]
596 | },
597 | "execution_count": 17,
598 | "metadata": {},
599 | "output_type": "execute_result"
600 | }
601 | ],
602 | "source": [
603 | "J, grad = backprop(params, input_size, hidden_size, num_labels, X, y_onehot, learning_rate)\n",
604 | "J, grad.shape"
605 | ]
606 | },
607 | {
608 | "cell_type": "markdown",
609 | "metadata": {},
610 | "source": [
611 | "We're finally ready to train our network and use it to make predictions. This is roughly similar to the previous exercise with multi-class logistic regression."
612 | ]
613 | },
614 | {
615 | "cell_type": "code",
616 | "execution_count": 18,
617 | "metadata": {
618 | "collapsed": false
619 | },
620 | "outputs": [
621 | {
622 | "data": {
623 | "text/plain": [
624 | " status: 3\n",
625 | " success: False\n",
626 | " nfev: 250\n",
627 | " fun: 0.33900736818312283\n",
628 | " x: array([ -8.85740564e-01, 2.57420350e-04, -4.09396202e-04, ...,\n",
629 | " 1.44634791e+00, 1.68974302e+00, 7.10121593e-01])\n",
630 | " message: 'Max. number of function evaluations reach'\n",
631 | " jac: array([ -5.11463703e-04, 5.14840700e-08, -8.18792403e-08, ...,\n",
632 | " -2.48297749e-04, -3.17870911e-04, -3.31404592e-04])\n",
633 | " nit: 21"
634 | ]
635 | },
636 | "execution_count": 18,
637 | "metadata": {},
638 | "output_type": "execute_result"
639 | }
640 | ],
641 | "source": [
642 | "from scipy.optimize import minimize\n",
643 | "\n",
644 | "# minimize the objective function\n",
645 | "fmin = minimize(fun=backprop, x0=params, args=(input_size, hidden_size, num_labels, X, y_onehot, learning_rate), \n",
646 | " method='TNC', jac=True, options={'maxiter': 250})\n",
647 | "fmin"
648 | ]
649 | },
650 | {
651 | "cell_type": "markdown",
652 | "metadata": {},
653 | "source": [
654 | "We put a bound on the number of iterations since the objective function is not likely to completely converge. Our total cost has dropped below 0.5 though so that's a good indicator that the algorithm is working. Let's use the parameters it found and forward-propagate them through the network to get some predictions."
655 | ]
656 | },
657 | {
658 | "cell_type": "code",
659 | "execution_count": 19,
660 | "metadata": {
661 | "collapsed": false
662 | },
663 | "outputs": [
664 | {
665 | "data": {
666 | "text/plain": [
667 | "array([[10],\n",
668 | " [10],\n",
669 | " [10],\n",
670 | " ..., \n",
671 | " [ 9],\n",
672 | " [ 9],\n",
673 | " [ 9]], dtype=int64)"
674 | ]
675 | },
676 | "execution_count": 19,
677 | "metadata": {},
678 | "output_type": "execute_result"
679 | }
680 | ],
681 | "source": [
682 | "X = np.matrix(X)\n",
683 | "theta1 = np.matrix(np.reshape(fmin.x[:hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))\n",
684 | "theta2 = np.matrix(np.reshape(fmin.x[hidden_size * (input_size + 1):], (num_labels, (hidden_size + 1))))\n",
685 | "\n",
686 | "a1, z2, a2, z3, h = forward_propagate(X, theta1, theta2)\n",
687 | "y_pred = np.array(np.argmax(h, axis=1) + 1)\n",
688 | "y_pred"
689 | ]
690 | },
691 | {
692 | "cell_type": "markdown",
693 | "metadata": {},
694 | "source": [
695 | "Finally we can compute the accuracy to see how well our trained network is doing."
696 | ]
697 | },
698 | {
699 | "cell_type": "code",
700 | "execution_count": 20,
701 | "metadata": {
702 | "collapsed": false
703 | },
704 | "outputs": [
705 | {
706 | "name": "stdout",
707 | "output_type": "stream",
708 | "text": [
709 | "accuracy = 99.22%\n"
710 | ]
711 | }
712 | ],
713 | "source": [
714 | "correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y)]\n",
715 | "accuracy = (sum(map(int, correct)) / float(len(correct)))\n",
716 | "print 'accuracy = {0}%'.format(accuracy * 100)"
717 | ]
718 | },
719 | {
720 | "cell_type": "markdown",
721 | "metadata": {},
722 | "source": [
723 | "And we're done! We've successfully implemented a rudimentary feed-forward neural network with backpropagation and used it to classify images of handwritten digits. In the next exercise we'll look at another power supervised learning algorithm, support vector machines."
724 | ]
725 | }
726 | ],
727 | "metadata": {
728 | "kernelspec": {
729 | "display_name": "Python 2",
730 | "language": "python",
731 | "name": "python2"
732 | },
733 | "language_info": {
734 | "codemirror_mode": {
735 | "name": "ipython",
736 | "version": 2
737 | },
738 | "file_extension": ".py",
739 | "mimetype": "text/x-python",
740 | "name": "python",
741 | "nbconvert_exporter": "python",
742 | "pygments_lexer": "ipython2",
743 | "version": "2.7.9"
744 | }
745 | },
746 | "nbformat": 4,
747 | "nbformat_minor": 0
748 | }
749 |
--------------------------------------------------------------------------------
/notebooks/tensorflow/Tensorflow-2-FullyConnected.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "kR-4eNdK6lYS"
8 | },
9 | "source": [
10 | "Deep Learning\n",
11 | "=============\n",
12 | "\n",
13 | "Assignment 2\n",
14 | "------------\n",
15 | "\n",
16 | "Previously in `1_notmnist.ipynb`, we created a pickle with formatted datasets for training, development and testing on the [notMNIST dataset](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html).\n",
17 | "\n",
18 | "The goal of this assignment is to progressively train deeper and more accurate models using TensorFlow."
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {
25 | "cellView": "both",
26 | "colab": {
27 | "autoexec": {
28 | "startup": false,
29 | "wait_interval": 0
30 | }
31 | },
32 | "colab_type": "code",
33 | "collapsed": true,
34 | "id": "JLpLa8Jt7Vu4"
35 | },
36 | "outputs": [],
37 | "source": [
38 | "# These are all the modules we'll be using later. Make sure you can import them\n",
39 | "# before proceeding further.\n",
40 | "from __future__ import print_function\n",
41 | "import numpy as np\n",
42 | "import tensorflow as tf\n",
43 | "from six.moves import cPickle as pickle\n",
44 | "from six.moves import range"
45 | ]
46 | },
47 | {
48 | "cell_type": "markdown",
49 | "metadata": {
50 | "colab_type": "text",
51 | "id": "1HrCK6e17WzV"
52 | },
53 | "source": [
54 | "First reload the data we generated in `1_notmnist.ipynb`."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 2,
60 | "metadata": {
61 | "cellView": "both",
62 | "colab": {
63 | "autoexec": {
64 | "startup": false,
65 | "wait_interval": 0
66 | },
67 | "output_extras": [
68 | {
69 | "item_id": 1
70 | }
71 | ]
72 | },
73 | "colab_type": "code",
74 | "collapsed": false,
75 | "executionInfo": {
76 | "elapsed": 19456,
77 | "status": "ok",
78 | "timestamp": 1449847956073,
79 | "user": {
80 | "color": "",
81 | "displayName": "",
82 | "isAnonymous": false,
83 | "isMe": true,
84 | "permissionId": "",
85 | "photoUrl": "",
86 | "sessionId": "0",
87 | "userId": ""
88 | },
89 | "user_tz": 480
90 | },
91 | "id": "y3-cj1bpmuxc",
92 | "outputId": "0ddb1607-1fc4-4ddb-de28-6c7ab7fb0c33"
93 | },
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "Training set (200000, 28, 28) (200000,)\n",
100 | "Validation set (10000, 28, 28) (10000,)\n",
101 | "Test set (10000, 28, 28) (10000,)\n"
102 | ]
103 | }
104 | ],
105 | "source": [
106 | "pickle_file = 'notMNIST.pickle'\n",
107 | "\n",
108 | "with open(pickle_file, 'rb') as f:\n",
109 | " save = pickle.load(f)\n",
110 | " train_dataset = save['train_dataset']\n",
111 | " train_labels = save['train_labels']\n",
112 | " valid_dataset = save['valid_dataset']\n",
113 | " valid_labels = save['valid_labels']\n",
114 | " test_dataset = save['test_dataset']\n",
115 | " test_labels = save['test_labels']\n",
116 | " del save # hint to help gc free up memory\n",
117 | " print('Training set', train_dataset.shape, train_labels.shape)\n",
118 | " print('Validation set', valid_dataset.shape, valid_labels.shape)\n",
119 | " print('Test set', test_dataset.shape, test_labels.shape)"
120 | ]
121 | },
122 | {
123 | "cell_type": "markdown",
124 | "metadata": {
125 | "colab_type": "text",
126 | "id": "L7aHrm6nGDMB"
127 | },
128 | "source": [
129 | "Reformat into a shape that's more adapted to the models we're going to train:\n",
130 | "- data as a flat matrix,\n",
131 | "- labels as float 1-hot encodings."
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 3,
137 | "metadata": {
138 | "cellView": "both",
139 | "colab": {
140 | "autoexec": {
141 | "startup": false,
142 | "wait_interval": 0
143 | },
144 | "output_extras": [
145 | {
146 | "item_id": 1
147 | }
148 | ]
149 | },
150 | "colab_type": "code",
151 | "collapsed": false,
152 | "executionInfo": {
153 | "elapsed": 19723,
154 | "status": "ok",
155 | "timestamp": 1449847956364,
156 | "user": {
157 | "color": "",
158 | "displayName": "",
159 | "isAnonymous": false,
160 | "isMe": true,
161 | "permissionId": "",
162 | "photoUrl": "",
163 | "sessionId": "0",
164 | "userId": ""
165 | },
166 | "user_tz": 480
167 | },
168 | "id": "IRSyYiIIGIzS",
169 | "outputId": "2ba0fc75-1487-4ace-a562-cf81cae82793"
170 | },
171 | "outputs": [
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "Training set (200000, 784) (200000, 10)\n",
177 | "Validation set (10000, 784) (10000, 10)\n",
178 | "Test set (10000, 784) (10000, 10)\n"
179 | ]
180 | }
181 | ],
182 | "source": [
183 | "image_size = 28\n",
184 | "num_labels = 10\n",
185 | "\n",
186 | "def reformat(dataset, labels):\n",
187 | " dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)\n",
188 | " # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]\n",
189 | " labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)\n",
190 | " return dataset, labels\n",
191 | "train_dataset, train_labels = reformat(train_dataset, train_labels)\n",
192 | "valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)\n",
193 | "test_dataset, test_labels = reformat(test_dataset, test_labels)\n",
194 | "print('Training set', train_dataset.shape, train_labels.shape)\n",
195 | "print('Validation set', valid_dataset.shape, valid_labels.shape)\n",
196 | "print('Test set', test_dataset.shape, test_labels.shape)"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "metadata": {
202 | "colab_type": "text",
203 | "id": "nCLVqyQ5vPPH"
204 | },
205 | "source": [
206 | "We're first going to train a multinomial logistic regression using simple gradient descent.\n",
207 | "\n",
208 | "TensorFlow works like this:\n",
209 | "* First you describe the computation that you want to see performed: what the inputs, the variables, and the operations look like. These get created as nodes over a computation graph. This description is all contained within the block below:\n",
210 | "\n",
211 | " with graph.as_default():\n",
212 | " ...\n",
213 | "\n",
214 | "* Then you can run the operations on this graph as many times as you want by calling `session.run()`, providing it outputs to fetch from the graph that get returned. This runtime operation is all contained in the block below:\n",
215 | "\n",
216 | " with tf.Session(graph=graph) as session:\n",
217 | " ...\n",
218 | "\n",
219 | "Let's load all the data into TensorFlow and build the computation graph corresponding to our training:"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 4,
225 | "metadata": {
226 | "cellView": "both",
227 | "colab": {
228 | "autoexec": {
229 | "startup": false,
230 | "wait_interval": 0
231 | }
232 | },
233 | "colab_type": "code",
234 | "collapsed": true,
235 | "id": "Nfv39qvtvOl_"
236 | },
237 | "outputs": [],
238 | "source": [
239 | "# With gradient descent training, even this much data is prohibitive.\n",
240 | "# Subset the training data for faster turnaround.\n",
241 | "train_subset = 10000\n",
242 | "\n",
243 | "graph = tf.Graph()\n",
244 | "with graph.as_default():\n",
245 | "\n",
246 | " # Input data.\n",
247 | " # Load the training, validation and test data into constants that are\n",
248 | " # attached to the graph.\n",
249 | " tf_train_dataset = tf.constant(train_dataset[:train_subset, :])\n",
250 | " tf_train_labels = tf.constant(train_labels[:train_subset])\n",
251 | " tf_valid_dataset = tf.constant(valid_dataset)\n",
252 | " tf_test_dataset = tf.constant(test_dataset)\n",
253 | " \n",
254 | " # Variables.\n",
255 | " # These are the parameters that we are going to be training. The weight\n",
256 | " # matrix will be initialized using random valued following a (truncated)\n",
257 | " # normal distribution. The biases get initialized to zero.\n",
258 | " weights = tf.Variable(\n",
259 | " tf.truncated_normal([image_size * image_size, num_labels]))\n",
260 | " biases = tf.Variable(tf.zeros([num_labels]))\n",
261 | " \n",
262 | " # Training computation.\n",
263 | " # We multiply the inputs with the weight matrix, and add biases. We compute\n",
264 | " # the softmax and cross-entropy (it's one operation in TensorFlow, because\n",
265 | " # it's very common, and it can be optimized). We take the average of this\n",
266 | " # cross-entropy across all training examples: that's our loss.\n",
267 | " logits = tf.matmul(tf_train_dataset, weights) + biases\n",
268 | " loss = tf.reduce_mean(\n",
269 | " tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))\n",
270 | " \n",
271 | " # Optimizer.\n",
272 | " # We are going to find the minimum of this loss using gradient descent.\n",
273 | " optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)\n",
274 | " \n",
275 | " # Predictions for the training, validation, and test data.\n",
276 | " # These are not part of training, but merely here so that we can report\n",
277 | " # accuracy figures as we train.\n",
278 | " train_prediction = tf.nn.softmax(logits)\n",
279 | " valid_prediction = tf.nn.softmax(\n",
280 | " tf.matmul(tf_valid_dataset, weights) + biases)\n",
281 | " test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {
287 | "colab_type": "text",
288 | "id": "KQcL4uqISHjP"
289 | },
290 | "source": [
291 | "Let's run this computation and iterate:"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 5,
297 | "metadata": {
298 | "cellView": "both",
299 | "colab": {
300 | "autoexec": {
301 | "startup": false,
302 | "wait_interval": 0
303 | },
304 | "output_extras": [
305 | {
306 | "item_id": 9
307 | }
308 | ]
309 | },
310 | "colab_type": "code",
311 | "collapsed": false,
312 | "executionInfo": {
313 | "elapsed": 57454,
314 | "status": "ok",
315 | "timestamp": 1449847994134,
316 | "user": {
317 | "color": "",
318 | "displayName": "",
319 | "isAnonymous": false,
320 | "isMe": true,
321 | "permissionId": "",
322 | "photoUrl": "",
323 | "sessionId": "0",
324 | "userId": ""
325 | },
326 | "user_tz": 480
327 | },
328 | "id": "z2cjdenH869W",
329 | "outputId": "4c037ba1-b526-4d8e-e632-91e2a0333267"
330 | },
331 | "outputs": [
332 | {
333 | "name": "stdout",
334 | "output_type": "stream",
335 | "text": [
336 | "Initialized\n",
337 | "Loss at step 0: 17.643072\n",
338 | "Training accuracy: 8.1%\n",
339 | "Validation accuracy: 9.6%\n",
340 | "Loss at step 100: 2.280050\n",
341 | "Training accuracy: 71.9%\n",
342 | "Validation accuracy: 71.1%\n",
343 | "Loss at step 200: 1.824829\n",
344 | "Training accuracy: 74.9%\n",
345 | "Validation accuracy: 73.6%\n",
346 | "Loss at step 300: 1.588680\n",
347 | "Training accuracy: 76.2%\n",
348 | "Validation accuracy: 74.5%\n",
349 | "Loss at step 400: 1.431254\n",
350 | "Training accuracy: 77.1%\n",
351 | "Validation accuracy: 74.9%\n",
352 | "Loss at step 500: 1.314126\n",
353 | "Training accuracy: 77.9%\n",
354 | "Validation accuracy: 75.1%\n",
355 | "Loss at step 600: 1.221493\n",
356 | "Training accuracy: 78.3%\n",
357 | "Validation accuracy: 75.3%\n",
358 | "Loss at step 700: 1.145665\n",
359 | "Training accuracy: 78.9%\n",
360 | "Validation accuracy: 75.6%\n",
361 | "Loss at step 800: 1.082164\n",
362 | "Training accuracy: 79.3%\n",
363 | "Validation accuracy: 75.7%\n",
364 | "Test accuracy: 82.1%\n"
365 | ]
366 | }
367 | ],
368 | "source": [
369 | "num_steps = 801\n",
370 | "\n",
371 | "def accuracy(predictions, labels):\n",
372 | " return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))\n",
373 | " / predictions.shape[0])\n",
374 | "\n",
375 | "with tf.Session(graph=graph) as session:\n",
376 | " # This is a one-time operation which ensures the parameters get initialized as\n",
377 | " # we described in the graph: random weights for the matrix, zeros for the\n",
378 | " # biases. \n",
379 | " tf.initialize_all_variables().run()\n",
380 | " print('Initialized')\n",
381 | " for step in range(num_steps):\n",
382 | " # Run the computations. We tell .run() that we want to run the optimizer,\n",
383 | " # and get the loss value and the training predictions returned as numpy\n",
384 | " # arrays.\n",
385 | " _, l, predictions = session.run([optimizer, loss, train_prediction])\n",
386 | " if (step % 100 == 0):\n",
387 | " print('Loss at step %d: %f' % (step, l))\n",
388 | " print('Training accuracy: %.1f%%' % accuracy(\n",
389 | " predictions, train_labels[:train_subset, :]))\n",
390 | " # Calling .eval() on valid_prediction is basically like calling run(), but\n",
391 | " # just to get that one numpy array. Note that it recomputes all its graph\n",
392 | " # dependencies.\n",
393 | " print('Validation accuracy: %.1f%%' % accuracy(\n",
394 | " valid_prediction.eval(), valid_labels))\n",
395 | " print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))"
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {
401 | "colab_type": "text",
402 | "id": "x68f-hxRGm3H"
403 | },
404 | "source": [
405 | "Let's now switch to stochastic gradient descent training instead, which is much faster.\n",
406 | "\n",
407 | "The graph will be similar, except that instead of holding all the training data into a constant node, we create a `Placeholder` node which will be fed actual data at every call of `sesion.run()`."
408 | ]
409 | },
410 | {
411 | "cell_type": "code",
412 | "execution_count": 6,
413 | "metadata": {
414 | "cellView": "both",
415 | "colab": {
416 | "autoexec": {
417 | "startup": false,
418 | "wait_interval": 0
419 | }
420 | },
421 | "colab_type": "code",
422 | "collapsed": true,
423 | "id": "qhPMzWYRGrzM"
424 | },
425 | "outputs": [],
426 | "source": [
427 | "batch_size = 128\n",
428 | "\n",
429 | "graph = tf.Graph()\n",
430 | "with graph.as_default():\n",
431 | "\n",
432 | " # Input data. For the training data, we use a placeholder that will be fed\n",
433 | " # at run time with a training minibatch.\n",
434 | " tf_train_dataset = tf.placeholder(tf.float32,\n",
435 | " shape=(batch_size, image_size * image_size))\n",
436 | " tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))\n",
437 | " tf_valid_dataset = tf.constant(valid_dataset)\n",
438 | " tf_test_dataset = tf.constant(test_dataset)\n",
439 | " \n",
440 | " # Variables.\n",
441 | " weights = tf.Variable(\n",
442 | " tf.truncated_normal([image_size * image_size, num_labels]))\n",
443 | " biases = tf.Variable(tf.zeros([num_labels]))\n",
444 | " \n",
445 | " # Training computation.\n",
446 | " logits = tf.matmul(tf_train_dataset, weights) + biases\n",
447 | " loss = tf.reduce_mean(\n",
448 | " tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))\n",
449 | " \n",
450 | " # Optimizer.\n",
451 | " optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)\n",
452 | " \n",
453 | " # Predictions for the training, validation, and test data.\n",
454 | " train_prediction = tf.nn.softmax(logits)\n",
455 | " valid_prediction = tf.nn.softmax(\n",
456 | " tf.matmul(tf_valid_dataset, weights) + biases)\n",
457 | " test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)"
458 | ]
459 | },
460 | {
461 | "cell_type": "markdown",
462 | "metadata": {
463 | "colab_type": "text",
464 | "id": "XmVZESmtG4JH"
465 | },
466 | "source": [
467 | "Let's run it:"
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 7,
473 | "metadata": {
474 | "cellView": "both",
475 | "colab": {
476 | "autoexec": {
477 | "startup": false,
478 | "wait_interval": 0
479 | },
480 | "output_extras": [
481 | {
482 | "item_id": 6
483 | }
484 | ]
485 | },
486 | "colab_type": "code",
487 | "collapsed": false,
488 | "executionInfo": {
489 | "elapsed": 66292,
490 | "status": "ok",
491 | "timestamp": 1449848003013,
492 | "user": {
493 | "color": "",
494 | "displayName": "",
495 | "isAnonymous": false,
496 | "isMe": true,
497 | "permissionId": "",
498 | "photoUrl": "",
499 | "sessionId": "0",
500 | "userId": ""
501 | },
502 | "user_tz": 480
503 | },
504 | "id": "FoF91pknG_YW",
505 | "outputId": "d255c80e-954d-4183-ca1c-c7333ce91d0a"
506 | },
507 | "outputs": [
508 | {
509 | "name": "stdout",
510 | "output_type": "stream",
511 | "text": [
512 | "Initialized\n",
513 | "Minibatch loss at step 0: 19.032427\n",
514 | "Minibatch accuracy: 8.6%\n",
515 | "Validation accuracy: 9.5%\n",
516 | "Minibatch loss at step 500: 1.287354\n",
517 | "Minibatch accuracy: 77.3%\n",
518 | "Validation accuracy: 75.8%\n",
519 | "Minibatch loss at step 1000: 1.040482\n",
520 | "Minibatch accuracy: 83.6%\n",
521 | "Validation accuracy: 76.4%\n",
522 | "Minibatch loss at step 1500: 1.008734\n",
523 | "Minibatch accuracy: 82.8%\n",
524 | "Validation accuracy: 78.2%\n",
525 | "Minibatch loss at step 2000: 0.933759\n",
526 | "Minibatch accuracy: 79.7%\n",
527 | "Validation accuracy: 78.5%\n",
528 | "Minibatch loss at step 2500: 0.854228\n",
529 | "Minibatch accuracy: 78.9%\n",
530 | "Validation accuracy: 78.6%\n",
531 | "Minibatch loss at step 3000: 1.088064\n",
532 | "Minibatch accuracy: 71.1%\n",
533 | "Validation accuracy: 79.0%\n",
534 | "Test accuracy: 85.7%\n"
535 | ]
536 | }
537 | ],
538 | "source": [
539 | "num_steps = 3001\n",
540 | "\n",
541 | "with tf.Session(graph=graph) as session:\n",
542 | " tf.initialize_all_variables().run()\n",
543 | " print(\"Initialized\")\n",
544 | " for step in range(num_steps):\n",
545 | " # Pick an offset within the training data, which has been randomized.\n",
546 | " # Note: we could use better randomization across epochs.\n",
547 | " offset = (step * batch_size) % (train_labels.shape[0] - batch_size)\n",
548 | " # Generate a minibatch.\n",
549 | " batch_data = train_dataset[offset:(offset + batch_size), :]\n",
550 | " batch_labels = train_labels[offset:(offset + batch_size), :]\n",
551 | " # Prepare a dictionary telling the session where to feed the minibatch.\n",
552 | " # The key of the dictionary is the placeholder node of the graph to be fed,\n",
553 | " # and the value is the numpy array to feed to it.\n",
554 | " feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}\n",
555 | " _, l, predictions = session.run(\n",
556 | " [optimizer, loss, train_prediction], feed_dict=feed_dict)\n",
557 | " if (step % 500 == 0):\n",
558 | " print(\"Minibatch loss at step %d: %f\" % (step, l))\n",
559 | " print(\"Minibatch accuracy: %.1f%%\" % accuracy(predictions, batch_labels))\n",
560 | " print(\"Validation accuracy: %.1f%%\" % accuracy(\n",
561 | " valid_prediction.eval(), valid_labels))\n",
562 | " print(\"Test accuracy: %.1f%%\" % accuracy(test_prediction.eval(), test_labels))"
563 | ]
564 | },
565 | {
566 | "cell_type": "markdown",
567 | "metadata": {
568 | "colab_type": "text",
569 | "id": "7omWxtvLLxik"
570 | },
571 | "source": [
572 | "---\n",
573 | "Problem\n",
574 | "-------\n",
575 | "\n",
576 | "Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units (nn.relu()) and 1024 hidden nodes. This model should improve your validation / test accuracy."
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": 8,
582 | "metadata": {
583 | "collapsed": false
584 | },
585 | "outputs": [],
586 | "source": [
587 | "batch_size = 128\n",
588 | "hidden_nodes = 1024\n",
589 | "\n",
590 | "graph = tf.Graph()\n",
591 | "with graph.as_default():\n",
592 | "\n",
593 | " # Input data. For the training data, we use a placeholder that will be fed\n",
594 | " # at run time with a training minibatch.\n",
595 | " tf_train_dataset = tf.placeholder(tf.float32,\n",
596 | " shape=(batch_size, image_size * image_size))\n",
597 | " tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))\n",
598 | " tf_valid_dataset = tf.constant(valid_dataset)\n",
599 | " tf_test_dataset = tf.constant(test_dataset)\n",
600 | " \n",
601 | " # Variables.\n",
602 | " weights_1 = tf.Variable(\n",
603 | " tf.truncated_normal([image_size * image_size, hidden_nodes]))\n",
604 | " biases_1 = tf.Variable(tf.zeros([hidden_nodes]))\n",
605 | " weights_2 = tf.Variable(\n",
606 | " tf.truncated_normal([hidden_nodes, num_labels]))\n",
607 | " biases_2 = tf.Variable(tf.zeros([num_labels]))\n",
608 | " \n",
609 | " # Training computation.\n",
610 | " def forward_prop(input):\n",
611 | " h1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)\n",
612 | " return tf.matmul(h1, weights_2) + biases_2\n",
613 | " \n",
614 | " logits = forward_prop(tf_train_dataset)\n",
615 | " loss = tf.reduce_mean(\n",
616 | " tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))\n",
617 | " \n",
618 | " # Optimizer.\n",
619 | " optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)\n",
620 | " \n",
621 | " # Predictions for the training, validation, and test data.\n",
622 | " train_prediction = tf.nn.softmax(logits)\n",
623 | " valid_prediction = tf.nn.softmax(forward_prop(tf_valid_dataset))\n",
624 | " test_prediction = tf.nn.softmax(forward_prop(tf_test_dataset))"
625 | ]
626 | },
627 | {
628 | "cell_type": "code",
629 | "execution_count": 9,
630 | "metadata": {
631 | "collapsed": false
632 | },
633 | "outputs": [
634 | {
635 | "name": "stdout",
636 | "output_type": "stream",
637 | "text": [
638 | "Initialized\n",
639 | "Minibatch loss at step 0: 364.405762\n",
640 | "Minibatch accuracy: 13.3%\n",
641 | "Validation accuracy: 32.3%\n",
642 | "Minibatch loss at step 500: 12.502020\n",
643 | "Minibatch accuracy: 78.1%\n",
644 | "Validation accuracy: 80.9%\n",
645 | "Minibatch loss at step 1000: 3.743875\n",
646 | "Minibatch accuracy: 83.6%\n",
647 | "Validation accuracy: 81.2%\n",
648 | "Minibatch loss at step 1500: 5.572070\n",
649 | "Minibatch accuracy: 79.7%\n",
650 | "Validation accuracy: 81.5%\n",
651 | "Minibatch loss at step 2000: 3.639989\n",
652 | "Minibatch accuracy: 82.8%\n",
653 | "Validation accuracy: 82.7%\n",
654 | "Minibatch loss at step 2500: 6.019464\n",
655 | "Minibatch accuracy: 82.8%\n",
656 | "Validation accuracy: 82.8%\n",
657 | "Minibatch loss at step 3000: 2.160873\n",
658 | "Minibatch accuracy: 81.2%\n",
659 | "Validation accuracy: 82.9%\n",
660 | "Test accuracy: 89.0%\n"
661 | ]
662 | }
663 | ],
664 | "source": [
665 | "num_steps = 3001\n",
666 | "\n",
667 | "with tf.Session(graph=graph) as session:\n",
668 | " tf.initialize_all_variables().run()\n",
669 | " print(\"Initialized\")\n",
670 | " for step in range(num_steps):\n",
671 | " # Pick an offset within the training data, which has been randomized.\n",
672 | " # Note: we could use better randomization across epochs.\n",
673 | " offset = (step * batch_size) % (train_labels.shape[0] - batch_size)\n",
674 | " # Generate a minibatch.\n",
675 | " batch_data = train_dataset[offset:(offset + batch_size), :]\n",
676 | " batch_labels = train_labels[offset:(offset + batch_size), :]\n",
677 | " # Prepare a dictionary telling the session where to feed the minibatch.\n",
678 | " # The key of the dictionary is the placeholder node of the graph to be fed,\n",
679 | " # and the value is the numpy array to feed to it.\n",
680 | " feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}\n",
681 | " _, l, predictions = session.run(\n",
682 | " [optimizer, loss, train_prediction], feed_dict=feed_dict)\n",
683 | " if (step % 500 == 0):\n",
684 | " print(\"Minibatch loss at step %d: %f\" % (step, l))\n",
685 | " print(\"Minibatch accuracy: %.1f%%\" % accuracy(predictions, batch_labels))\n",
686 | " print(\"Validation accuracy: %.1f%%\" % accuracy(\n",
687 | " valid_prediction.eval(), valid_labels))\n",
688 | " print(\"Test accuracy: %.1f%%\" % accuracy(test_prediction.eval(), test_labels))"
689 | ]
690 | }
691 | ],
692 | "metadata": {
693 | "colab": {
694 | "default_view": {},
695 | "name": "2_fullyconnected.ipynb",
696 | "provenance": [],
697 | "version": "0.3.2",
698 | "views": {}
699 | },
700 | "kernelspec": {
701 | "display_name": "Python 2",
702 | "language": "python",
703 | "name": "python2"
704 | },
705 | "language_info": {
706 | "codemirror_mode": {
707 | "name": "ipython",
708 | "version": 2
709 | },
710 | "file_extension": ".py",
711 | "mimetype": "text/x-python",
712 | "name": "python",
713 | "nbconvert_exporter": "python",
714 | "pygments_lexer": "ipython2",
715 | "version": "2.7.11"
716 | }
717 | },
718 | "nbformat": 4,
719 | "nbformat_minor": 0
720 | }
721 |
--------------------------------------------------------------------------------
/scripts/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jdwittenauer/ipython-notebooks/44641e20d6c30c073cde849626a51248632d1381/scripts/__init__.py
--------------------------------------------------------------------------------
/scripts/hello.py:
--------------------------------------------------------------------------------
1 | print('Hello world!')
--------------------------------------------------------------------------------