├── 2012 └── 12 │ ├── 4 │ └── hello-internet.rst │ └── 5 │ └── python-spell-checker.rst ├── 2013 ├── 1 │ └── 5 │ │ └── decorates-and-annotations.rst └── 2 │ ├── 4 │ └── joel-test-for-data-teams.rst │ └── 24 │ └── timing-python-code.rst ├── .DS_Store ├── config.yml ├── Makefile ├── README.md ├── _build ├── 2012 │ ├── 12 │ │ ├── 4 │ │ │ └── hello-internet │ │ │ │ └── index.html │ │ ├── 5 │ │ │ └── python-spell-checker │ │ │ │ └── index.html │ │ └── index.html │ └── index.html ├── 2013 │ ├── 1 │ │ └── 5 │ │ │ └── decorates-and-annotations │ │ │ └── index.html │ ├── 2 │ │ ├── 4 │ │ │ └── joel-test-for-data-teams │ │ │ │ └── index.html │ │ └── 24 │ │ │ └── timing-python-code │ │ │ └── index.html │ ├── index.html │ ├── 02 │ │ └── index.html │ └── 01 │ │ └── index.html ├── README.md ├── upload.sh ├── tags │ ├── Thoughts │ │ ├── index.html │ │ └── feed.atom │ ├── performance │ │ ├── index.html │ │ └── feed.atom │ ├── bayes │ │ ├── index.html │ │ └── feed.atom │ ├── introduction │ │ ├── index.html │ │ └── feed.atom │ ├── statistics │ │ ├── index.html │ │ └── feed.atom │ ├── probability │ │ ├── index.html │ │ └── feed.atom │ ├── python │ │ └── index.html │ └── index.html ├── archive │ └── index.html ├── static │ ├── _pygments.css │ └── style.css ├── about │ └── index.html └── index.html ├── _templates ├── tagcloud.html ├── tag.html ├── _pagination.html ├── blog │ ├── year_archive.html │ ├── month_archive.html │ ├── archive.html │ └── index.html ├── rst_display.html └── layout.html ├── upload.sh ├── .gitignore ├── about.rst └── static └── style.css /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mattalcock/blog/HEAD/.DS_Store -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | active_modules: [pygments, tags, blog, latex] 2 | author: Matt Alcock 3 | canonical_url: http://blog.mattalcock.com/ 4 | modules: 5 | pygments: 6 | style: tango -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: build upload 2 | 3 | clean: 4 | rm -rf _build 5 | 6 | build: 7 | run-rstblog build 8 | 9 | serve: 10 | run-rstblog serve 11 | 12 | upload: 13 | ./upload.sh 14 | @echo "Done..." -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | blog 2 | ==== 3 | 4 | My personal blog 5 | 6 | 7 | Building blog command.... 8 | 9 | >run-rstblog build 10 | 11 | Running blog command.... 12 | 13 | >run-rstblog serve 14 | 15 | Copy blog command.... 16 | 17 | >upload.sh (work in progress) -------------------------------------------------------------------------------- /_build/README.md: -------------------------------------------------------------------------------- 1 | blog 2 | ==== 3 | 4 | My personal blog 5 | 6 | 7 | Building blog command.... 8 | 9 | >run-rstblog build 10 | 11 | Running blog command.... 12 | 13 | >run-rstblog serve 14 | 15 | Copy blog command.... 16 | 17 | >upload.sh (work in progress) -------------------------------------------------------------------------------- /_templates/tagcloud.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Tags{% endblock %} 3 | {% block body %} 4 |

Tags

5 | 11 | {% endblock %} 12 | -------------------------------------------------------------------------------- /upload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | HOST='ftp.mattalcock.com' 3 | USER='mattalc1' 4 | TARGETFOLDER='/public_html/blog' 5 | SOURCEFOLDER='/Users/mattalcock/Dev/blog/_build' 6 | 7 | #lftp was installed using 'brew install lftp' 8 | 9 | lftp -f " 10 | open $HOST 11 | user $USER $BLOGPASS 12 | lcd $SOURCEFOLDER 13 | mirror --reverse --delete --verbose $SOURCEFOLDER $TARGETFOLDER 14 | bye 15 | " -------------------------------------------------------------------------------- /_build/upload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | HOST='ftp.mattalcock.com' 3 | USER='mattalc1' 4 | TARGETFOLDER='/public_html/blog' 5 | SOURCEFOLDER='/Users/mattalcock/Dev/blog/_build' 6 | 7 | #lftp was installed using 'brew install lftp' 8 | 9 | lftp -f " 10 | open $HOST 11 | user $USER $BLOGPASS 12 | lcd $SOURCEFOLDER 13 | mirror --reverse --delete --verbose $SOURCEFOLDER $TARGETFOLDER 14 | bye 15 | " -------------------------------------------------------------------------------- /_templates/tag.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Entries tagged “{{ tag.name }}”{% endblock %} 3 | {% block body %} 4 |

Entries tagged “{{ tag.name }}”

5 | 10 | {% endblock %} 11 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | 21 | # Installer logs 22 | pip-log.txt 23 | 24 | # Unit test / coverage reports 25 | .coverage 26 | .tox 27 | nosetests.xml 28 | 29 | # Translations 30 | *.mo 31 | 32 | # Mr Developer 33 | .mr.developer.cfg 34 | .project 35 | .pydevproject 36 | -------------------------------------------------------------------------------- /_templates/_pagination.html: -------------------------------------------------------------------------------- 1 | 14 | -------------------------------------------------------------------------------- /_templates/blog/year_archive.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Blog Archive for {{ entry.year }}{% endblock %} 3 | {% block body %} 4 | 5 |

Blog Archive for {{ entry.year }}

6 | 7 | 14 | {% endblock %} 15 | -------------------------------------------------------------------------------- /_templates/blog/month_archive.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Blog Archive for {{ entry.month_name}}, {{ entry.year }}{% endblock %} 3 | {% block body %} 4 | 5 |

Blog Archive for {{ entry.month_name }}, 6 | {{ entry.year }}

7 | 8 | 14 | {% endblock %} 15 | -------------------------------------------------------------------------------- /_templates/blog/archive.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Blog Archive{% endblock %} 3 | {% block body %} 4 | 5 |

Blog Archive

6 | 7 | 19 | {% endblock %} 20 | -------------------------------------------------------------------------------- /_templates/rst_display.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}{{ rst.title }}{% endblock %} 3 | {% block body %} 4 | {%- if not config.hide_title %} 5 | {{ rst.html_title }} 6 | {%- endif %} 7 | {% if ctx.pub_date %} 8 |

written on {{ format_date(ctx.pub_date, format='full') }} 9 | {% endif %} 10 | 11 | {{ rst.fragment }} 12 | 13 | {% if ctx.tags %} 14 |

This entry was tagged 15 | {% for tag in ctx.tags|sort(case_sensitive=true) %} 16 | {%- if not loop.first and not loop.last %}, {% endif -%} 17 | {%- if loop.last and not loop.first %} and {% endif %} 18 | {{ tag }} 19 | {%- endfor %} 20 | {% endif %} 21 | 22 | {% if 'disqus' in config.active_modules %} 23 | {{ get_disqus() }} 24 | {% endif %} 25 | {% endblock %} 26 | -------------------------------------------------------------------------------- /_templates/blog/index.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Blog{% endblock %} 3 | {% block body %} 4 | 5 | {%- for entry in pagination.get_slice() %} 6 |

7 |
{{ format_date(entry.pub_date, format='medium') }}
8 |
9 |

{{ entry.title }}

10 | {% if entry.summary %} 11 |
{{ entry.render_summary() }}
12 | {% endif %} 13 | {% if entry.tags %} 14 |
15 | {% for tag in entry.tags|sort(case_sensitive=true) %} 16 | {%- if not loop.first and not loop.last %}, {% endif -%} 17 | {%- if loop.last and not loop.first %} and {% endif %} 18 | #{{ tag }} 19 | {%- endfor %} 20 |
21 | {% endif %} 22 |
23 |
24 | {%- endfor %} 25 | 26 | {% if show_pagination and pagination.pages > 1 %} 27 | {{ pagination }} 28 | {% endif %} 29 | {% endblock %} 30 | -------------------------------------------------------------------------------- /about.rst: -------------------------------------------------------------------------------- 1 | public: yes 2 | 3 | About Me 4 | ======== 5 | 6 | My name is Matt Alcock and I'm a Data Scientist, Analytics Lead and Python fan. I'm currently the Lead Analyst at NaturalMotion Games where I manage a small team of data analysts. I currently work and live in Oxford. I've worked in small startups and large multinational companies covering a variety of industries including games, finance and fashion. I've been working with data for 10+ years and although my jobs have been varried they've all centered around drawing insight from large data sets. 7 | 8 | I split the majority of my time between Oxford and London. If you'd like to meet for a coffee or discuss anything please drop me a message through one of the following channels: 9 | 10 | - drop me an `email `_ 11 | - send a tweet to `@mattalcock `_ 12 | - send me a driect message on `LinkedIn `_ 13 | 14 | 15 | About this Website 16 | ------------------ 17 | 18 | The website is a collection of observations, thoughts, notes and side projects. A lot of the supporting code for the blog posts can be found in a public repo under my `github account `_. 19 | 20 | The website itself is written in restructured text and built with a small 21 | script written by the very talented Armin Ronacher. Sourcecode can be `found on github 22 | `_. -------------------------------------------------------------------------------- /2013/2/4/joel-test-for-data-teams.rst: -------------------------------------------------------------------------------- 1 | public: no 2 | tags: [python, introduction] 3 | pub_date: 2013-02-04 4 | summary: | 5 | The 'Data Team Test' a Joel Test for data and analytics teams. 6 | 7 | The 'Data Team Test' 8 | ==================== 9 | 10 | The original 'Joel Test'. 11 | 12 | Do you use source control? 13 | Can you make a build in one step? 14 | Do you make daily builds? 15 | Do you have a bug database? 16 | Do you fix bugs before writing new code? 17 | Do you have an up-to-date schedule? 18 | Do you have a spec? 19 | Do programmers have quiet working conditions? 20 | Do you use the best tools money can buy? 21 | Do you have testers? 22 | Do new candidates write code during their interview? 23 | Do you do hallway usability testing? 24 | 25 | 26 | The 'Data Team Test'. 27 | 28 | Do you use source control? 29 | Do you have a seperate devlopment and production enviroment? 30 | Do you have access to raw data in a warehouse or other offline store that doesn't impact the live system? 31 | Do you have a way of recording and marking dirty data? 32 | Do you record bugs and fix them before writing new code? 33 | Do you have a process to schedule and prioritse analysis? 34 | Do you spend time with product owners and domain experts? 35 | Do programmers/analysts have quiet working conditions? 36 | Do you use the best tools money can buy? 37 | Do you have testers? 38 | Do you do have a mechanism to communicate findings and make recommendations? 39 | Do you and other teams employ data driven development teqniques? -------------------------------------------------------------------------------- /2012/12/4/hello-internet.rst: -------------------------------------------------------------------------------- 1 | public: yes 2 | tags: [thoughts] 3 | pub_date: 2012-12-04 4 | summary: | 5 | The obligatory first post. 6 | 7 | Hello Internet 8 | ============== 9 | 10 | The oblatory first post. I'll be honest I've been meaning to start and commit to a blog for sometime. After some false starts covering my broad spectrum of interests I've decided to focus on writing about my thoughts as a Data Scientist. I'm hoping people will find this informative and insightful. If there are any thoughts, feedback or collaborations that come from this then the blog will have been a success. So please let me know what you think. 11 | 12 | I've been working with data for 10 years and my job title has jumped around from Developer, Data Analyst, Quantitate Analyst, Team Leader, Product Manager, Warehouse Manager, Head Of Analytics, Lead Analyst and Data Scientist. So what am I? I'm not sure everyone fits into role buckets but one thing I am convinced of is that everyones interests and expertise is different. I enjoy managing small technical teams, I love working with large amounts of data and I have the expertise to apply statistical and scientific methods to my work. 13 | 14 | Below are some biases and opinions I should mention before I start. I'll explain these in more detail over the coming posts but their four personal and somewhat subjective opinions I wanted to share from the outset. 15 | ` 16 | - I'm love the power of modern NoSQL data stores but still feel SQL is an amazing tool for analysis thats hard to beat. 17 | - I think data can unlock questions and give insights into almost every area of business but also I understand it's not the silver bullet. It should be used with creativity, lateral thinking and domain expertise not instead of it. 18 | - I love the concise power of statistics but realise they're frequently misleading and often poorly presented and explained. 19 | - I'm program language agnostic but love to use Python 20 | 21 | The blog will contain, thoughts, tools, projects and some book reviews. If there is anything you'd like me to talk about or review please drop me an mail. 22 | 23 | I hope you like whats to come......... 24 | -------------------------------------------------------------------------------- /_build/tags/Thoughts/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “thoughts” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “thoughts”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/tags/performance/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “performance” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “performance”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/2012/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive for 2012 | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive for 2012

28 | 29 | 33 | 34 |
35 | 47 |
48 | 61 | 62 | 63 | -------------------------------------------------------------------------------- /_build/tags/bayes/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “bayes” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “bayes”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/tags/introduction/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “introduction” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “introduction”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/tags/statistics/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “statistics” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “statistics”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/tags/probability/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “probability” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “probability”

27 | 30 | 31 |
32 | 44 |
45 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /_build/2013/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive for 2013 | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive for 2013

28 | 29 | 35 | 36 |
37 | 49 |
50 | 63 | 64 | 65 | -------------------------------------------------------------------------------- /_build/2013/02/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive for February, 2013 | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive for February, 28 | 2013

29 | 30 | 33 | 34 |
35 | 47 |
48 | 61 | 62 | 63 | -------------------------------------------------------------------------------- /_build/2013/01/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive for January, 2013 | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive for January, 28 | 2013

29 | 30 | 33 | 34 |
35 | 47 |
48 | 61 | 62 | 63 | -------------------------------------------------------------------------------- /_templates/layout.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {% block htmlhead %} 6 | {% block title %}Home{% endblock %} | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | {%- for link in links %} 10 | 12 | {%- endfor %} 13 | {% endblock %} 14 | 15 | 16 |
17 |
18 | Matt Alcock - A Data Scientist with a passion for Python 19 |
20 | 28 |
29 | {% block body %}{% endblock %} 30 |
31 | 43 |
44 | 57 | 58 | -------------------------------------------------------------------------------- /_build/tags/python/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Entries tagged “python” | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Entries tagged “python”

27 | 32 | 33 |
34 | 46 |
47 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /_build/2012/12/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive for December, 2012 | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive for December, 28 | 2012

29 | 30 | 34 | 35 |
36 | 48 |
49 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /_build/archive/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog Archive | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 | 27 |

Blog Archive

28 | 29 | 43 | 44 |
45 | 57 |
58 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /_build/tags/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Tags | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |

Tags

27 | 36 | 37 |
38 | 50 |
51 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /_build/tags/Thoughts/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2012-12-04T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Hello Internet 12 | http://blog.mattalcock.com/2012/12/4/hello-internet 13 | 2012-12-04T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>The oblatory first post. I'll be honest I've been meaning to start and commit to a blog for sometime. After some false starts covering my broad spectrum of interests I've decided to focus on writing about my thoughts as a Data Scientist. I'm hoping people will find this informative and insightful. If there are any thoughts, feedback or collaborations that come from this then the blog will have been a success. So please let me know what you think.</p> 19 | <p>I've been working with data for 10 years and my job title has jumped around from Developer, Data Analyst, Quantitate Analyst, Team Leader, Product Manager, Warehouse Manager, Head Of Analytics, Lead Analyst and Data Scientist. So what am I? I'm not sure everyone fits into role buckets but one thing I am convinced of is that everyones interests and expertise is different. I enjoy managing small technical teams, I love working with large amounts of data and I have the expertise to apply statistical and scientific methods to my work.</p> 20 | <p>Below are some biases and opinions I should mention before I start. I'll explain these in more detail over the coming posts but their four personal and somewhat subjective opinions I wanted to share from the outset. 21 | ` 22 | - I'm love the power of modern NoSQL data stores but still feel SQL is an amazing tool for analysis thats hard to beat. 23 | - I think data can unlock questions and give insights into almost every area of business but also I understand it's not the silver bullet. It should be used with creativity, lateral thinking and domain expertise not instead of it. 24 | - I love the concise power of statistics but realise they're frequently misleading and often poorly presented and explained. 25 | - I'm program language agnostic but love to use Python</p> 26 | <p>The blog will contain, thoughts, tools, projects and some book reviews. If there is anything you'd like me to talk about or review please drop me an mail.</p> 27 | <p>I hope you like whats to come.........</p> 28 | 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /_build/2013/2/4/joel-test-for-data-teams/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | The 'Data Team Test' | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

The 'Data Team Test'

28 | 29 | 30 |

written on Monday, February 4, 2013 31 | 32 | 33 |

The original 'Joel Test'.

34 |

Do you use source control? 35 | Can you make a build in one step? 36 | Do you make daily builds? 37 | Do you have a bug database? 38 | Do you fix bugs before writing new code? 39 | Do you have an up-to-date schedule? 40 | Do you have a spec? 41 | Do programmers have quiet working conditions? 42 | Do you use the best tools money can buy? 43 | Do you have testers? 44 | Do new candidates write code during their interview? 45 | Do you do hallway usability testing?

46 |

The 'Data Team Test'.

47 |

Do you use source control? 48 | Do you have a seperate devlopment and production enviroment? 49 | Do you have access to raw data in a warehouse or other offline store that doesn't impact the live system? 50 | Do you have a way of recording and marking dirty data? 51 | Do you record bugs and fix them before writing new code? 52 | Do you have a process to schedule and prioritse analysis? 53 | Do you spend time with product owners and domain experts? 54 | Do programmers/analysts have quiet working conditions? 55 | Do you use the best tools money can buy? 56 | Do you have testers? 57 | Do you do have a mechanism to communicate findings and make recommendations? 58 | Do you and other teams employ data driven development teqniques?

59 | 60 | 61 | 62 | 63 | 64 | 65 |
66 | 78 |
79 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /_build/static/_pygments.css: -------------------------------------------------------------------------------- 1 | .hll { background-color: #ffffcc } 2 | .c { color: #8f5902; font-style: italic } /* Comment */ 3 | .err { color: #a40000; border: 1px solid #ef2929 } /* Error */ 4 | .g { color: #000000 } /* Generic */ 5 | .k { color: #204a87; font-weight: bold } /* Keyword */ 6 | .l { color: #000000 } /* Literal */ 7 | .n { color: #000000 } /* Name */ 8 | .o { color: #ce5c00; font-weight: bold } /* Operator */ 9 | .x { color: #000000 } /* Other */ 10 | .p { color: #000000; font-weight: bold } /* Punctuation */ 11 | .cm { color: #8f5902; font-style: italic } /* Comment.Multiline */ 12 | .cp { color: #8f5902; font-style: italic } /* Comment.Preproc */ 13 | .c1 { color: #8f5902; font-style: italic } /* Comment.Single */ 14 | .cs { color: #8f5902; font-style: italic } /* Comment.Special */ 15 | .gd { color: #a40000 } /* Generic.Deleted */ 16 | .ge { color: #000000; font-style: italic } /* Generic.Emph */ 17 | .gr { color: #ef2929 } /* Generic.Error */ 18 | .gh { color: #000080; font-weight: bold } /* Generic.Heading */ 19 | .gi { color: #00A000 } /* Generic.Inserted */ 20 | .go { color: #000000; font-style: italic } /* Generic.Output */ 21 | .gp { color: #8f5902 } /* Generic.Prompt */ 22 | .gs { color: #000000; font-weight: bold } /* Generic.Strong */ 23 | .gu { color: #800080; font-weight: bold } /* Generic.Subheading */ 24 | .gt { color: #a40000; font-weight: bold } /* Generic.Traceback */ 25 | .kc { color: #204a87; font-weight: bold } /* Keyword.Constant */ 26 | .kd { color: #204a87; font-weight: bold } /* Keyword.Declaration */ 27 | .kn { color: #204a87; font-weight: bold } /* Keyword.Namespace */ 28 | .kp { color: #204a87; font-weight: bold } /* Keyword.Pseudo */ 29 | .kr { color: #204a87; font-weight: bold } /* Keyword.Reserved */ 30 | .kt { color: #204a87; font-weight: bold } /* Keyword.Type */ 31 | .ld { color: #000000 } /* Literal.Date */ 32 | .m { color: #0000cf; font-weight: bold } /* Literal.Number */ 33 | .s { color: #4e9a06 } /* Literal.String */ 34 | .na { color: #c4a000 } /* Name.Attribute */ 35 | .nb { color: #204a87 } /* Name.Builtin */ 36 | .nc { color: #000000 } /* Name.Class */ 37 | .no { color: #000000 } /* Name.Constant */ 38 | .nd { color: #5c35cc; font-weight: bold } /* Name.Decorator */ 39 | .ni { color: #ce5c00 } /* Name.Entity */ 40 | .ne { color: #cc0000; font-weight: bold } /* Name.Exception */ 41 | .nf { color: #000000 } /* Name.Function */ 42 | .nl { color: #f57900 } /* Name.Label */ 43 | .nn { color: #000000 } /* Name.Namespace */ 44 | .nx { color: #000000 } /* Name.Other */ 45 | .py { color: #000000 } /* Name.Property */ 46 | .nt { color: #204a87; font-weight: bold } /* Name.Tag */ 47 | .nv { color: #000000 } /* Name.Variable */ 48 | .ow { color: #204a87; font-weight: bold } /* Operator.Word */ 49 | .w { color: #f8f8f8; text-decoration: underline } /* Text.Whitespace */ 50 | .mf { color: #0000cf; font-weight: bold } /* Literal.Number.Float */ 51 | .mh { color: #0000cf; font-weight: bold } /* Literal.Number.Hex */ 52 | .mi { color: #0000cf; font-weight: bold } /* Literal.Number.Integer */ 53 | .mo { color: #0000cf; font-weight: bold } /* Literal.Number.Oct */ 54 | .sb { color: #4e9a06 } /* Literal.String.Backtick */ 55 | .sc { color: #4e9a06 } /* Literal.String.Char */ 56 | .sd { color: #8f5902; font-style: italic } /* Literal.String.Doc */ 57 | .s2 { color: #4e9a06 } /* Literal.String.Double */ 58 | .se { color: #4e9a06 } /* Literal.String.Escape */ 59 | .sh { color: #4e9a06 } /* Literal.String.Heredoc */ 60 | .si { color: #4e9a06 } /* Literal.String.Interpol */ 61 | .sx { color: #4e9a06 } /* Literal.String.Other */ 62 | .sr { color: #4e9a06 } /* Literal.String.Regex */ 63 | .s1 { color: #4e9a06 } /* Literal.String.Single */ 64 | .ss { color: #4e9a06 } /* Literal.String.Symbol */ 65 | .bp { color: #3465a4 } /* Name.Builtin.Pseudo */ 66 | .vc { color: #000000 } /* Name.Variable.Class */ 67 | .vg { color: #000000 } /* Name.Variable.Global */ 68 | .vi { color: #000000 } /* Name.Variable.Instance */ 69 | .il { color: #0000cf; font-weight: bold } /* Literal.Number.Integer.Long */ -------------------------------------------------------------------------------- /_build/about/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | About Me | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

About Me

28 | 29 | 30 | 31 |

My name is Matt Alcock and I'm a Data Scientist, Analytics Lead and Python fan. I'm currently the Lead Analyst at NaturalMotion Games where I manage a small team of data analysts. I currently work and live in Oxford. I've worked in small startups and large multinational companies covering a variety of industries including games, finance and fashion. I've been working with data for 10+ years and although my jobs have been varried they've all centered around drawing insight from large data sets.

32 |

I split the majority of my time between Oxford and London. If you'd like to meet for a coffee or discuss anything please drop me a message through one of the following channels:

33 | 38 |
39 |

About this Website

40 |

The website is a collection of observations, thoughts, notes and side projects. A lot of the supporting code for the blog posts can be found in a public repo under my github account.

41 |

The website itself is written in restructured text and built with a small 42 | script written by the very talented Armin Ronacher. Sourcecode can be found on github.

43 |
44 | 45 | 46 | 47 | 48 | 49 | 50 |
51 | 63 |
64 | 77 | 78 | 79 | -------------------------------------------------------------------------------- /2012/12/5/python-spell-checker.rst: -------------------------------------------------------------------------------- 1 | public: yes 2 | tags: [python, probability, statistics, bayes] 3 | pub_date: 2012-12-05 4 | summary: | 5 | How to use Python and some powerful statistics to create a very lightweight but effective Google style spell checker. 6 | 7 | Did you mean 'python spell checker'? 8 | ==================================== 9 | 10 | Have you ever been really impressed with Googles 'Did you mean....' spell checker? 11 | Have you ever just typed something into google to help you with your spelling? 12 | 13 | My answer to the above questions above would be Yes, all the time! 14 | 15 | In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries). 16 | 17 | The code and supporting data files can be found in my hacks public repo under the spellcheck folder. 18 | 19 | The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was. 20 | 21 | .. sourcecode:: python 22 | 23 | import re, collections 24 | 25 | def words(text): 26 | return re.findall('[a-z]+', text.lower()) 27 | 28 | def train(features): 29 | model = collections.defaultdict(lambda: 1) 30 | for f in features: 31 | model[f] += 1 32 | return model 33 | 34 | NWORDS = train(words(file('big.txt').read())) 35 | alphabet = 'abcdefghijklmnopqrstuvwxyz' 36 | 37 | def edits1(word): 38 | s = [(word[:i], word[i:]) for i in range(len(word) + 1)] 39 | deletes = [a + b[1:] for a, b in s if b] 40 | transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1] 41 | replaces = [a + c + b[1:] for a, b in s for c in alphabet if b] 42 | inserts = [a + c + b for a, b in s for c in alphabet] 43 | return set(deletes + transposes + replaces + inserts) 44 | 45 | def known_edits2(word): 46 | return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) 47 | 48 | def known(words): 49 | return set(w for w in words if w in NWORDS) 50 | 51 | def correct(word): 52 | candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] 53 | return max(candidates, key=NWORDS.get) 54 | 55 | 56 | If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files. 57 | 58 | An exmaple of nested function composition is: 59 | 60 | .. sourcecode:: python 61 | 62 | NWORDS = train(words(file('big.txt').read())) 63 | 64 | An example of complex list comprehension is: 65 | 66 | .. sourcecode:: python 67 | 68 | [a + c + b[1:] for a, b in s for c in alphabet if b] 69 | 70 | The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word->count mappings. 71 | 72 | .. sourcecode:: python 73 | 74 | return max(candidates, key=NWORDS.get) 75 | 76 | The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word. 77 | 78 | See more details, test results and further work at Peter Novig’s `site `_ . 79 | 80 | 81 | -------------------------------------------------------------------------------- /_build/2012/12/4/hello-internet/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Hello Internet | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

Hello Internet

28 | 29 | 30 |

written on Tuesday, December 4, 2012 31 | 32 | 33 |

The oblatory first post. I'll be honest I've been meaning to start and commit to a blog for sometime. After some false starts covering my broad spectrum of interests I've decided to focus on writing about my thoughts as a Data Scientist. I'm hoping people will find this informative and insightful. If there are any thoughts, feedback or collaborations that come from this then the blog will have been a success. So please let me know what you think.

34 |

I've been working with data for 10 years and my job title has jumped around from Developer, Data Analyst, Quantitate Analyst, Team Leader, Product Manager, Warehouse Manager, Head Of Analytics, Lead Analyst and Data Scientist. So what am I? I'm not sure everyone fits into role buckets but one thing I am convinced of is that everyones interests and expertise is different. I enjoy managing small technical teams, I love working with large amounts of data and I have the expertise to apply statistical and scientific methods to my work.

35 |

Below are some biases and opinions I should mention before I start. I'll explain these in more detail over the coming posts but their four personal and somewhat subjective opinions I wanted to share from the outset. 36 | ` 37 | - I'm love the power of modern NoSQL data stores but still feel SQL is an amazing tool for analysis thats hard to beat. 38 | - I think data can unlock questions and give insights into almost every area of business but also I understand it's not the silver bullet. It should be used with creativity, lateral thinking and domain expertise not instead of it. 39 | - I love the concise power of statistics but realise they're frequently misleading and often poorly presented and explained. 40 | - I'm program language agnostic but love to use Python

41 |

The blog will contain, thoughts, tools, projects and some book reviews. If there is anything you'd like me to talk about or review please drop me an mail.

42 |

I hope you like whats to come.........

43 | 44 | 45 | 46 |

This entry was tagged 47 | 48 | thoughts 49 | 50 | 51 | 52 | 53 |

54 | 66 |
67 | 80 | 81 | 82 | -------------------------------------------------------------------------------- /static/style.css: -------------------------------------------------------------------------------- 1 | /* fonts */ 2 | @import url(http://fonts.googleapis.com/css?family=Merriweather:400,300); 3 | @import url(http://fonts.googleapis.com/css?family=Ubuntu+Mono:400,400italic,700,700italic); 4 | 5 | /* general style */ 6 | body { font: 17px/25px 'Merriweather', serif; 7 | margin: 0; padding: 0; font-weight: 300; } 8 | a { color: black; font-weight: 400; } 9 | a:hover { color: #CC0033; } 10 | 11 | /* headlines */ 12 | h1, h2, h3, h4, h5, h6 { font-family: 'Merriweather', serif; 13 | font-weight: 300; color: #222; } 14 | h1 a, h2 a, h3 a, h4 a, 15 | h5 a, h6 a { text-decoration: none; } 16 | h1 a:hover, h2 a:hover, 17 | h3 a:hover, h4 a:hover { text-decoration: underline; } 18 | h1.title { width: 560px; } 19 | h1, h2 { margin: 10px 0 25px 0; } 20 | h1 { font-size: 42px; line-height: 52px; } 21 | h2 { font-size: 32px; line-height: 40px; } 22 | 23 | /* layout elements */ 24 | div.container { width: 740px; margin: 48px auto; padding: 0; } 25 | div.header { float: left; } 26 | div.navigation { float: right; } 27 | div.header, div.navigation { height: 25px; margin-bottom: 42px; } 28 | div.navigation ul { margin: 0; padding: 0; list-style: none; } 29 | div.navigation ul li { display: inline; margin: 0 2px; padding: 0; } 30 | div.body { clear: both; } 31 | div.footer { border-top: 1px solid #555; padding-top: 9px; 32 | margin-top: 42px; font-size: 16px; 33 | text-align: center; color: #555; } 34 | div.footer p { margin: 0; } 35 | 36 | /* margins and stuff */ 37 | p, div.line-block, ul, ol, pre, 38 | table { margin: 25px 0 25px 0; } 39 | dt { margin: 25px 0 16px 0; padding: 0; } 40 | dd { margin: 16px 0 25px 40px; padding: 0; } 41 | ul ol, ol ul, ul ul, ol ol { margin: 10px 0; padding: 0 0 0 40px; } 42 | li { padding: 0; } 43 | h1 + p.date { margin-top: -25px; } 44 | 45 | /* code formatting. no monospace because of webkit (bug?) */ 46 | pre, code, tt { font-family: 'Ubuntu Mono', 'Consolas', 'Deja Vu Sans Mono', 47 | 'Bitstream Vera Sans Mono', 'Monaco', 'Courier New'; 48 | font-size: 0.9em; } 49 | pre { line-height: 1.3; background: none; padding: 0; } 50 | code, tt { background: #eee; } 51 | 52 | /* tables */ 53 | table { border: 1px solid #ddd; border-collapse: collapse; 54 | background: #fafafa; } 55 | td, th { padding: 2px 12px; border: 1px solid #ddd; } 56 | 57 | /* footnotes */ 58 | table.footnote { margin: 15px 0; background: transparent; border: none; } 59 | table.footnote td { border: none; padding: 9px 0 0 0; font-size: 12px; } 60 | table.footnote td.label { padding-right: 10px; } 61 | table.footnote td p { margin: 0; } 62 | table.footnote td p + p { margin-top: 15px; } 63 | 64 | /* blog overview */ 65 | div.entry-overview { margin: 25px 122px 25px 102px; } 66 | div.entry-overview h1, 67 | div.entry-overview div.summary, 68 | div.entry-overview div.summary p { line-height: 25px; } 69 | div.entry-overview h1 { margin: 0; font-size: 20px; } 70 | div.entry-overview div.summary, 71 | div.entry-overview div.date, 72 | div.entry-overview div.summary p { margin: 0; padding: 0; } 73 | div.entry-overview div.detail { margin-left: 140px; } 74 | div.entry-overview div.date { float: left; width: 120px; color: #CC0033; 75 | text-align: right; font-size: 14px; } 76 | div.entry-overview div.summary-tags { font-size: 12px; } 77 | div.entry-overview div.summary-tags a { text-decoration: none; font-weight: 300;} 78 | div.entry-overview h1 { font-weight: normal; } 79 | div.entry-overview h1:after { content: ""; } 80 | 81 | /* other alignment things */ 82 | img.align-center { margin: 15px auto; display: block; } 83 | 84 | /* pagination */ 85 | div.pagination { margin: 36px 0 0 0; text-align: center; } 86 | div.pagination strong { font-weight: normal; font-style: italic; } 87 | 88 | /* tags */ 89 | p.tags { text-align: right; } 90 | ul.tagcloud { font-size: 16px; margin: 36px 0; padding: 0; 91 | list-style: none; } 92 | ul.tagcloud li { margin: 0; padding: 0 10px; display: inline; } 93 | 94 | /* latex math */ 95 | span.math img { margin-bottom: -7px; } -------------------------------------------------------------------------------- /_build/static/style.css: -------------------------------------------------------------------------------- 1 | /* fonts */ 2 | @import url(http://fonts.googleapis.com/css?family=Merriweather:400,300); 3 | @import url(http://fonts.googleapis.com/css?family=Ubuntu+Mono:400,400italic,700,700italic); 4 | 5 | /* general style */ 6 | body { font: 17px/25px 'Merriweather', serif; 7 | margin: 0; padding: 0; font-weight: 300; } 8 | a { color: black; font-weight: 400; } 9 | a:hover { color: #CC0033; } 10 | 11 | /* headlines */ 12 | h1, h2, h3, h4, h5, h6 { font-family: 'Merriweather', serif; 13 | font-weight: 300; color: #222; } 14 | h1 a, h2 a, h3 a, h4 a, 15 | h5 a, h6 a { text-decoration: none; } 16 | h1 a:hover, h2 a:hover, 17 | h3 a:hover, h4 a:hover { text-decoration: underline; } 18 | h1.title { width: 560px; } 19 | h1, h2 { margin: 10px 0 25px 0; } 20 | h1 { font-size: 42px; line-height: 52px; } 21 | h2 { font-size: 32px; line-height: 40px; } 22 | 23 | /* layout elements */ 24 | div.container { width: 740px; margin: 48px auto; padding: 0; } 25 | div.header { float: left; } 26 | div.navigation { float: right; } 27 | div.header, div.navigation { height: 25px; margin-bottom: 42px; } 28 | div.navigation ul { margin: 0; padding: 0; list-style: none; } 29 | div.navigation ul li { display: inline; margin: 0 2px; padding: 0; } 30 | div.body { clear: both; } 31 | div.footer { border-top: 1px solid #555; padding-top: 9px; 32 | margin-top: 42px; font-size: 16px; 33 | text-align: center; color: #555; } 34 | div.footer p { margin: 0; } 35 | 36 | /* margins and stuff */ 37 | p, div.line-block, ul, ol, pre, 38 | table { margin: 25px 0 25px 0; } 39 | dt { margin: 25px 0 16px 0; padding: 0; } 40 | dd { margin: 16px 0 25px 40px; padding: 0; } 41 | ul ol, ol ul, ul ul, ol ol { margin: 10px 0; padding: 0 0 0 40px; } 42 | li { padding: 0; } 43 | h1 + p.date { margin-top: -25px; } 44 | 45 | /* code formatting. no monospace because of webkit (bug?) */ 46 | pre, code, tt { font-family: 'Ubuntu Mono', 'Consolas', 'Deja Vu Sans Mono', 47 | 'Bitstream Vera Sans Mono', 'Monaco', 'Courier New'; 48 | font-size: 0.9em; } 49 | pre { line-height: 1.3; background: none; padding: 0; } 50 | code, tt { background: #eee; } 51 | 52 | /* tables */ 53 | table { border: 1px solid #ddd; border-collapse: collapse; 54 | background: #fafafa; } 55 | td, th { padding: 2px 12px; border: 1px solid #ddd; } 56 | 57 | /* footnotes */ 58 | table.footnote { margin: 15px 0; background: transparent; border: none; } 59 | table.footnote td { border: none; padding: 9px 0 0 0; font-size: 12px; } 60 | table.footnote td.label { padding-right: 10px; } 61 | table.footnote td p { margin: 0; } 62 | table.footnote td p + p { margin-top: 15px; } 63 | 64 | /* blog overview */ 65 | div.entry-overview { margin: 25px 122px 25px 102px; } 66 | div.entry-overview h1, 67 | div.entry-overview div.summary, 68 | div.entry-overview div.summary p { line-height: 25px; } 69 | div.entry-overview h1 { margin: 0; font-size: 20px; } 70 | div.entry-overview div.summary, 71 | div.entry-overview div.date, 72 | div.entry-overview div.summary p { margin: 0; padding: 0; } 73 | div.entry-overview div.detail { margin-left: 140px; } 74 | div.entry-overview div.date { float: left; width: 120px; color: #CC0033; 75 | text-align: right; font-size: 14px; } 76 | div.entry-overview div.summary-tags { font-size: 12px; } 77 | div.entry-overview div.summary-tags a { text-decoration: none; font-weight: 300;} 78 | div.entry-overview h1 { font-weight: normal; } 79 | div.entry-overview h1:after { content: ""; } 80 | 81 | /* other alignment things */ 82 | img.align-center { margin: 15px auto; display: block; } 83 | 84 | /* pagination */ 85 | div.pagination { margin: 36px 0 0 0; text-align: center; } 86 | div.pagination strong { font-weight: normal; font-style: italic; } 87 | 88 | /* tags */ 89 | p.tags { text-align: right; } 90 | ul.tagcloud { font-size: 16px; margin: 36px 0; padding: 0; 91 | list-style: none; } 92 | ul.tagcloud li { margin: 0; padding: 0 10px; display: inline; } 93 | 94 | /* latex math */ 95 | span.math img { margin-bottom: -7px; } -------------------------------------------------------------------------------- /_build/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Blog | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 |
13 |
14 | Matt Alcock - A Data Scientist with a passion for Python 15 |
16 | 24 |
25 | 26 |
27 |
Feb 24, 2013
28 |
29 |

Timing Python Code

30 | 31 |

Using decorators to time and optimise the performance of python code.

32 |
33 | 34 | 35 |
36 | 37 | #performance and 38 | #python 39 |
40 | 41 |
42 |
43 |
44 |
Jan 5, 2013
45 |
46 |

Decorators & Annotations

47 | 48 |

An introduction into decorators and annotations in python and their simple power.

49 |
50 | 51 | 52 |
53 | 54 | #introduction and 55 | #python 56 |
57 | 58 |
59 |
60 |
61 |
Dec 5, 2012
62 |
63 |

Did you mean 'python spell checker'?

64 | 65 |

How to use Python and some powerful statistics to create a very lightweight but effective Google style spell checker.

66 |
67 | 68 | 69 |
70 | 71 | #bayes, 72 | #probability, 73 | #python and 74 | #statistics 75 |
76 | 77 |
78 |
79 |
80 |
Dec 4, 2012
81 |
82 |

Hello Internet

83 | 84 |

The obligatory first post.

85 |
86 | 87 | 88 |
89 | 90 | #thoughts 91 |
92 | 93 |
94 |
95 | 96 | 97 | 98 |
99 | 111 |
112 | 125 | 126 | 127 | -------------------------------------------------------------------------------- /2013/2/24/timing-python-code.rst: -------------------------------------------------------------------------------- 1 | public: yes 2 | tags: [python, performance] 3 | pub_date: 2013-02-24 4 | summary: | 5 | Using decorators to time and optimise the performance of python code. 6 | 7 | Timing Python Code 8 | ================== 9 | 10 | This post outlines why timing code in Python is important and provides some simple decorators that can help you time your code without the concerns and worries of peppering your lovely clean code with temporary timing and print statements. 11 | 12 | Scroll down if your just after the decortor code to time functions.... 13 | 14 | Python vs Speed 15 | --------------- 16 | 17 | One of things Python was orignally critisied for was speed. Like lots of Dynamic Lanaguages there is an overhead in keeping tracking of types and because code is interpreted at runtime instead of being compiled to native code at compile time dynmaic langauges like Python will always be a little slower. 18 | 19 | Where Python shines is in it's power and ability to allow progrmaers to opmtimise and focus on the algorthim. Focusing on the complextity of the problem and the algorithms order of magnitidue rather than the low level detials of memory management, pointers etc can often have massive benefits. Ask any computer science student and they can list of nermerous teachings that show alogrithm and data strucute design will beat brute force compuatation power. 20 | 21 | If your looking to build something where microsecounds count then I'd turn to C or Java. `PyPy `_ and other sophiticated JIT (Just in Time) compliers can help and they seam to be the future for Pytohn solutions in this space. Another aterntative is to find the slow code and either optimise that function or write a C plugin for Python for your very specific task. This last approach seams very popular in the finaince industry where milliseconds mean dollars but they still need the felxiablity and speed of devlelopment benefits that come with a dynamic lanaguage. 22 | 23 | More often than not slow code just needs some refactoring work, a new support data strucutre or a change in the complexity of processing. So the challenge is really not how can I speed up my code but what code needs my attention. 24 | 25 | Finding Slow Code 26 | ----------------- 27 | 28 | In order to find slow Python code we're going to have to time stuff. We don't want to cover our lovely clean code with temporary timing code and print statements, so how can we: 29 | 30 | - Time code without alteringing the code of a function 31 | - Get detailed timing information if the function is run with different arguments 32 | - Switch off the timing at deploy time to reduce the overhead and improve the performance of monitoring 33 | 34 | The timing decoriatros below can help with all of these. If your new to decorators and annotations see my previous blog `post on the subject `_ 35 | 36 | 37 | The Timeit Decorator 38 | -------------------- 39 | 40 | .. sourcecode:: python 41 | 42 | import time 43 | 44 | def timeit(f): 45 | 46 | def timed(*args, **kw): 47 | ts = time.time() 48 | result = f(*args, **kw) 49 | te = time.time() 50 | 51 | print 'func:%r args:[%r, %r] took: %2.4f sec' % \ 52 | (f.__name__, args, kw, te-ts) 53 | return result 54 | 55 | return timed 56 | 57 | 58 | Using the decorator is easy either use annotations. 59 | 60 | .. sourcecode:: python 61 | 62 | @timeit 63 | def compute_magic(n): 64 | #function definition 65 | .... 66 | 67 | 68 | Or realias the function you want to time. 69 | 70 | .. sourcecode:: python 71 | 72 | compute_magic = timeit(compute_magic) 73 | 74 | 75 | Sometimes you'll want to remove the code timing. You can either do this by remvoing the timeit annotations before deployment or you can you a configuration switch to enable the decorator to wrap the function in timing code. 76 | 77 | .. sourcecode:: python 78 | 79 | import time 80 | 81 | #from config import TIME_FUCNTIONS 82 | TIME_FUCNTIONS = False 83 | 84 | def timeit(f): 85 | if not TIME_FUCNTIONS: 86 | return f 87 | else: 88 | def timed(*args, **kw): 89 | ts = time.time() 90 | result = f(*args, **kw) 91 | te = time.time() 92 | 93 | print 'func:%r args:[%r, %r] took: %2.4f sec' % \ 94 | (f.__name__, args, kw, te-ts) 95 | return result 96 | 97 | return timed 98 | 99 | By simpley changing the TIME_FUNCTIONS configuration swtich the functions will not decorated. I find having these swtiches in a common config file/folder often helps. 100 | 101 | All this code and the majorty of code from my posts can be found in the hack repo of my github account. Please take a look `here `_ . I hope it's helped if there are any questions about the above or you'd like to understand more about timing code in Python drop me a mail. 102 | 103 | Matt 104 | 105 | 106 | -------------------------------------------------------------------------------- /2013/1/5/decorates-and-annotations.rst: -------------------------------------------------------------------------------- 1 | public: yes 2 | tags: [python, introduction] 3 | pub_date: 2013-01-05 4 | summary: | 5 | An introduction into decorators and annotations in python and their simple power. 6 | 7 | Decorators & Annotations 8 | ======================== 9 | 10 | I wanted to highlight the power of decorators and annotations in python and give the novice Python programmer some insight into how they can be used. If you've only been using Python for a short while then both of these will probably be new. 11 | 12 | Decorators are a way of implementing the famous computer science decorator pattern. This pattern put in simple terms is a mechanism that allow you to inject or modify code in a function. In python you can have two different styles of decorator. The function defined style or the class defined style. I prefer the function style but I'll show you using a class structure as well. 13 | 14 | The best way to explain their use is through a well known example. The below code shows how to functionally compute the Fibonacci numbers. 15 | 16 | The Fibonacci sequence is : [0,1,1,2,3,5,8,13.....] where the nth number equalling the sum of n-1 and n-2. 17 | 18 | An elegant way of computing this is using the below code: 19 | 20 | .. sourcecode:: python 21 | 22 | def fib(n): 23 | if n<=0: 24 | return 0 25 | elif n==1: 26 | return 1 27 | else: 28 | return fib(n-2) + fib(n-1) 29 | 30 | So fib(7) would return 13. As you can see from the code this uses recursion. The challenge with this approach for calculating the fib sequence is that the low 'tail' function calls will get called multiple times. Remvoing this overhead is called 'tail recursion elimination' or TRE. Python doesn't support this and `probably wont `_ . Below shows how running the fib function for just a small n can result in a massive numbers of calls of the tail values. 31 | 32 | .. sourcecode:: python 33 | 34 | fib(7) = fib(6) + fib(5) 35 | fib(7) = fib(4) + fib(3) + fib(4) + fib(3) 36 | fib(7) = fib(3) + fib(2) + fib(2) + fib(1) + fib(3) + fib(2) + fib(2) + fib(1) 37 | ..... 38 | fib(7) = fib(1) + fib(0) + fib(1) + fib(0) + .......... [All fib zeros and fib ones] 39 | 40 | A way to make this faster is to use a technique called Memoize. This remembers the result of a function for a given argument, stores it and if called again uses the stored version rather than re calculating. This can speed up the above by many orders of magnitude. 41 | 42 | The best way to implement this function calling memory is by decorating the function with some code that can modify the execution path to check a pre saved store first. Below is the memoize decorator as a function. 43 | 44 | .. sourcecode:: python 45 | 46 | def memoize(f): 47 | cache= {} 48 | def memf(*args, **kw): 49 | key = (args, tuple(sorted(kw.items()))) 50 | if key not in cache: 51 | cache[key] = f(*args, **kw) 52 | return cache[key] 53 | return memf 54 | 55 | The memoize decorator above takes a function as an argument. It then creates a new function that stores the results of the function into a cash. The decorator then returns the new function that contains the original function call. 56 | We can then use some cleaver dynamic language tricks to re alias the fib function to the decorated version. 57 | 58 | .. sourcecode:: python 59 | 60 | fib = memoize(fib) 61 | 62 | Calling fib after this aliased decoration we can ensure that the decorated function will run instead of the basic fib function 63 | . 64 | I hope that explains how decorators work in python and gives you an example of use. So what are annotations? 65 | 66 | Annotations allow us to use decorators all over our code and are actually syntactic sugar (the same thing) as the above aliased line. Rather than re-aliasing fib to the decorated fib we can use annotations at the point of writing the fib function definition. 67 | 68 | An annotated fib function would look like this. Note the simple use of @ and the decorator name above the definition. 69 | 70 | .. sourcecode:: python 71 | 72 | @memoize 73 | def fib(n): 74 | if n<=0: 75 | return 0 76 | elif n==1: 77 | return 1 78 | else: 79 | return fib(n-2) + fib(n-1) 80 | 81 | Simple hey! So annotations are just stylish and helpful ways to decorate functions at the place of definition. This really helps when your sharing code and working as a small team because you don't have to look all over the code to see if the function has been re-aliased and decorated it's right above the definition. 82 | 83 | Once of the best uses of this type of decoration using annotations is to log the performance of a function or to perform some detailed profiling. You only need write a single decorator to modify and wrap any function and then you just sprinkle the decorator around your code as annotations depending on what functions you want to time/profile or investigate in detail. 84 | 85 | As I mentioned before there is also a class style to writing decorators, lets use our memoize decorator as an example. 86 | 87 | Written as a class the decorator is: 88 | 89 | .. sourcecode:: python 90 | 91 | class Memoize: 92 | 93 | def __init__(self, f): 94 | self.f = f 95 | self.cache = {} 96 | 97 | def __call__(self, *args, **kw): 98 | key = key = (args, tuple(sorted(kw.items()))) 99 | if not key in self.cache: 100 | self.cache[key] = self.f(*args, **kw) 101 | return self.cache[key] 102 | 103 | The class has to have to functions to operate as a decorator. __init__ and __call__. Some people find this easier to read and construct others prefer the function style. I think it really depends on how advanced the decorator is going to be. 104 | 105 | The class style can then be applied in the exact same way as the above function style decorator. 106 | 107 | .. sourcecode:: python 108 | 109 | fib = Memoize(fib) 110 | 111 | @Memoize 112 | def fib(n): 113 | if n<=0: 114 | return 0 115 | ... 116 | 117 | I hope this has helped understand the basics of decorators and annotations. All of the decorator code listed above can be found in the hacks repo on my github account `here `_ -------------------------------------------------------------------------------- /_build/2013/2/24/timing-python-code/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Timing Python Code | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

Timing Python Code

28 | 29 | 30 |

written on Sunday, February 24, 2013 31 | 32 | 33 |

This post outlines why timing code in Python is important and provides some simple decorators that can help you time your code without the concerns and worries of peppering your lovely clean code with temporary timing and print statements.

34 |

Scroll down if your just after the decortor code to time functions....

35 |
36 |

Python vs Speed

37 |

One of things Python was orignally critisied for was speed. Like lots of Dynamic Lanaguages there is an overhead in keeping tracking of types and because code is interpreted at runtime instead of being compiled to native code at compile time dynmaic langauges like Python will always be a little slower.

38 |

Where Python shines is in it's power and ability to allow progrmaers to opmtimise and focus on the algorthim. Focusing on the complextity of the problem and the algorithms order of magnitidue rather than the low level detials of memory management, pointers etc can often have massive benefits. Ask any computer science student and they can list of nermerous teachings that show alogrithm and data strucute design will beat brute force compuatation power.

39 |

If your looking to build something where microsecounds count then I'd turn to C or Java. PyPy and other sophiticated JIT (Just in Time) compliers can help and they seam to be the future for Pytohn solutions in this space. Another aterntative is to find the slow code and either optimise that function or write a C plugin for Python for your very specific task. This last approach seams very popular in the finaince industry where milliseconds mean dollars but they still need the felxiablity and speed of devlelopment benefits that come with a dynamic lanaguage.

40 |

More often than not slow code just needs some refactoring work, a new support data strucutre or a change in the complexity of processing. So the challenge is really not how can I speed up my code but what code needs my attention.

41 |
42 |
43 |

Finding Slow Code

44 |

In order to find slow Python code we're going to have to time stuff. We don't want to cover our lovely clean code with temporary timing code and print statements, so how can we:

45 |
46 |
    47 |
  • Time code without alteringing the code of a function
  • 48 |
  • Get detailed timing information if the function is run with different arguments
  • 49 |
  • Switch off the timing at deploy time to reduce the overhead and improve the performance of monitoring
  • 50 |
51 |
52 |

The timing decoriatros below can help with all of these. If your new to decorators and annotations see my previous blog post on the subject

53 |
54 |
55 |

The Timeit Decorator

56 |
import time
 57 | 
 58 | def timeit(f):
 59 | 
 60 |     def timed(*args, **kw):
 61 |         ts = time.time()
 62 |         result = f(*args, **kw)
 63 |         te = time.time()
 64 | 
 65 |         print 'func:%r args:[%r, %r] took: %2.4f sec' % \
 66 |           (f.__name__, args, kw, te-ts)
 67 |         return result
 68 | 
 69 |     return timed
 70 | 
71 |

Using the decorator is easy either use annotations.

72 |
@timeit
 73 | def compute_magic(n):
 74 |     #function definition
 75 |     ....
 76 | 
77 |

Or realias the function you want to time.

78 |
compute_magic = timeit(compute_magic)
 79 | 
80 |

Sometimes you'll want to remove the code timing. You can either do this by remvoing the timeit annotations before deployment or you can you a configuration switch to enable the decorator to wrap the function in timing code.

81 |
import time
 82 | 
 83 | #from config import TIME_FUCNTIONS
 84 | TIME_FUCNTIONS = False
 85 | 
 86 | def timeit(f):
 87 |     if not TIME_FUCNTIONS:
 88 |         return f
 89 |     else:
 90 |         def timed(*args, **kw):
 91 |             ts = time.time()
 92 |             result = f(*args, **kw)
 93 |             te = time.time()
 94 | 
 95 |             print 'func:%r args:[%r, %r] took: %2.4f sec' % \
 96 |                 (f.__name__, args, kw, te-ts)
 97 |             return result
 98 | 
 99 |         return timed
100 | 
101 |

By simpley changing the TIME_FUNCTIONS configuration swtich the functions will not decorated. I find having these swtiches in a common config file/folder often helps.

102 |

All this code and the majorty of code from my posts can be found in the hack repo of my github account. Please take a look here . I hope it's helped if there are any questions about the above or you'd like to understand more about timing code in Python drop me a mail.

103 |

Matt

104 |
105 | 106 | 107 | 108 |

This entry was tagged 109 | 110 | performance and 111 | python 112 | 113 | 114 | 115 | 116 |

117 | 129 |
130 | 143 | 144 | 145 | -------------------------------------------------------------------------------- /_build/tags/performance/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2013-02-24T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Timing Python Code 12 | http://blog.mattalcock.com/2013/2/24/timing-python-code 13 | 2013-02-24T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>This post outlines why timing code in Python is important and provides some simple decorators that can help you time your code without the concerns and worries of peppering your lovely clean code with temporary timing and print statements.</p> 19 | <p>Scroll down if your just after the decortor code to time functions....</p> 20 | <div class="section" id="python-vs-speed"> 21 | <h2>Python vs Speed</h2> 22 | <p>One of things Python was orignally critisied for was speed. Like lots of Dynamic Lanaguages there is an overhead in keeping tracking of types and because code is interpreted at runtime instead of being compiled to native code at compile time dynmaic langauges like Python will always be a little slower.</p> 23 | <p>Where Python shines is in it's power and ability to allow progrmaers to opmtimise and focus on the algorthim. Focusing on the complextity of the problem and the algorithms order of magnitidue rather than the low level detials of memory management, pointers etc can often have massive benefits. Ask any computer science student and they can list of nermerous teachings that show alogrithm and data strucute design will beat brute force compuatation power.</p> 24 | <p>If your looking to build something where microsecounds count then I'd turn to C or Java. <a class="reference external" href="http://pypy.org/">PyPy</a> and other sophiticated JIT (Just in Time) compliers can help and they seam to be the future for Pytohn solutions in this space. Another aterntative is to find the slow code and either optimise that function or write a C plugin for Python for your very specific task. This last approach seams very popular in the finaince industry where milliseconds mean dollars but they still need the felxiablity and speed of devlelopment benefits that come with a dynamic lanaguage.</p> 25 | <p>More often than not slow code just needs some refactoring work, a new support data strucutre or a change in the complexity of processing. So the challenge is really not how can I speed up my code but what code needs my attention.</p> 26 | </div> 27 | <div class="section" id="finding-slow-code"> 28 | <h2>Finding Slow Code</h2> 29 | <p>In order to find slow Python code we're going to have to time stuff. We don't want to cover our lovely clean code with temporary timing code and print statements, so how can we:</p> 30 | <blockquote> 31 | <ul class="simple"> 32 | <li>Time code without alteringing the code of a function</li> 33 | <li>Get detailed timing information if the function is run with different arguments</li> 34 | <li>Switch off the timing at deploy time to reduce the overhead and improve the performance of monitoring</li> 35 | </ul> 36 | </blockquote> 37 | <p>The timing decoriatros below can help with all of these. If your new to decorators and annotations see my previous blog <a class="reference external" href="http://blog.mattalcock.com/2013/1/5/decorates-and-annotations/">post on the subject</a></p> 38 | </div> 39 | <div class="section" id="the-timeit-decorator"> 40 | <h2>The Timeit Decorator</h2> 41 | <div class="highlight"><pre><span class="kn">import</span> <span class="nn">time</span> 42 | 43 | <span class="k">def</span> <span class="nf">timeit</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> 44 | 45 | <span class="k">def</span> <span class="nf">timed</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> 46 | <span class="n">ts</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> 47 | <span class="n">result</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> 48 | <span class="n">te</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> 49 | 50 | <span class="k">print</span> <span class="s">&#39;func:</span><span class="si">%r</span><span class="s"> args:[</span><span class="si">%r</span><span class="s">, </span><span class="si">%r</span><span class="s">] took: </span><span class="si">%2.4f</span><span class="s"> sec&#39;</span> <span class="o">%</span> \ 51 | <span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">__name__</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kw</span><span class="p">,</span> <span class="n">te</span><span class="o">-</span><span class="n">ts</span><span class="p">)</span> 52 | <span class="k">return</span> <span class="n">result</span> 53 | 54 | <span class="k">return</span> <span class="n">timed</span> 55 | </pre></div> 56 | <p>Using the decorator is easy either use annotations.</p> 57 | <div class="highlight"><pre><span class="nd">@timeit</span> 58 | <span class="k">def</span> <span class="nf">compute_magic</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> 59 | <span class="c">#function definition</span> 60 | <span class="o">....</span> 61 | </pre></div> 62 | <p>Or realias the function you want to time.</p> 63 | <div class="highlight"><pre><span class="n">compute_magic</span> <span class="o">=</span> <span class="n">timeit</span><span class="p">(</span><span class="n">compute_magic</span><span class="p">)</span> 64 | </pre></div> 65 | <p>Sometimes you'll want to remove the code timing. You can either do this by remvoing the timeit annotations before deployment or you can you a configuration switch to enable the decorator to wrap the function in timing code.</p> 66 | <div class="highlight"><pre><span class="kn">import</span> <span class="nn">time</span> 67 | 68 | <span class="c">#from config import TIME_FUCNTIONS</span> 69 | <span class="n">TIME_FUCNTIONS</span> <span class="o">=</span> <span class="bp">False</span> 70 | 71 | <span class="k">def</span> <span class="nf">timeit</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> 72 | <span class="k">if</span> <span class="ow">not</span> <span class="n">TIME_FUCNTIONS</span><span class="p">:</span> 73 | <span class="k">return</span> <span class="n">f</span> 74 | <span class="k">else</span><span class="p">:</span> 75 | <span class="k">def</span> <span class="nf">timed</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> 76 | <span class="n">ts</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> 77 | <span class="n">result</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> 78 | <span class="n">te</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> 79 | 80 | <span class="k">print</span> <span class="s">&#39;func:</span><span class="si">%r</span><span class="s"> args:[</span><span class="si">%r</span><span class="s">, </span><span class="si">%r</span><span class="s">] took: </span><span class="si">%2.4f</span><span class="s"> sec&#39;</span> <span class="o">%</span> \ 81 | <span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">__name__</span><span class="p">,</span> <span class="n">args</span><span class="p">,</span> <span class="n">kw</span><span class="p">,</span> <span class="n">te</span><span class="o">-</span><span class="n">ts</span><span class="p">)</span> 82 | <span class="k">return</span> <span class="n">result</span> 83 | 84 | <span class="k">return</span> <span class="n">timed</span> 85 | </pre></div> 86 | <p>By simpley changing the TIME_FUNCTIONS configuration swtich the functions will not decorated. I find having these swtiches in a common config file/folder often helps.</p> 87 | <p>All this code and the majorty of code from my posts can be found in the hack repo of my github account. Please take a look <a class="reference external" href="https://github.com/mattalcock/hacks">here</a> . I hope it's helped if there are any questions about the above or you'd like to understand more about timing code in Python drop me a mail.</p> 88 | <p>Matt</p> 89 | </div> 90 | 91 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /_build/2012/12/5/python-spell-checker/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Did you mean 'python spell checker'? | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

Did you mean 'python spell checker'?

28 | 29 | 30 |

written on Wednesday, December 5, 2012 31 | 32 | 33 |

Have you ever been really impressed with Googles 'Did you mean....' spell checker? 34 | Have you ever just typed something into google to help you with your spelling?

35 |

My answer to the above questions above would be Yes, all the time!

36 |

In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries).

37 |

The code and supporting data files can be found in my hacks public repo under the spellcheck folder.

38 |

The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.

39 |
import re, collections
 40 | 
 41 | def words(text):
 42 |     return re.findall('[a-z]+', text.lower())
 43 | 
 44 | def train(features):
 45 |     model = collections.defaultdict(lambda: 1)
 46 |     for f in features:
 47 |         model[f] += 1
 48 |     return model
 49 | 
 50 | NWORDS = train(words(file('big.txt').read()))
 51 | alphabet = 'abcdefghijklmnopqrstuvwxyz'
 52 | 
 53 | def edits1(word):
 54 |     s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
 55 |     deletes    = [a + b[1:] for a, b in s if b]
 56 |     transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
 57 |     replaces   = [a + c + b[1:] for a, b in s for c in alphabet if b]
 58 |     inserts    = [a + c + b     for a, b in s for c in alphabet]
 59 |     return set(deletes + transposes + replaces + inserts)
 60 | 
 61 | def known_edits2(word):
 62 |     return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
 63 | 
 64 | def known(words):
 65 |     return set(w for w in words if w in NWORDS)
 66 | 
 67 | def correct(word):
 68 |     candidates = known([word]) or known(edits1(word)) or    known_edits2(word) or [word]
 69 |     return max(candidates, key=NWORDS.get)
 70 | 
71 |

If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files.

72 |

An exmaple of nested function composition is:

73 |
NWORDS = train(words(file('big.txt').read()))
 74 | 
75 |

An example of complex list comprehension is:

76 |
[a + c + b[1:] for a, b in s for c in alphabet if b]
 77 | 
78 |

The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word->count mappings.

79 |
return max(candidates, key=NWORDS.get)
 80 | 
81 |

The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word.

82 |

See more details, test results and further work at Peter Novig’s site .

83 | 84 | 85 | 86 |

This entry was tagged 87 | 88 | bayes, 89 | probability, 90 | python and 91 | statistics 92 | 93 | 94 | 95 | 96 |

97 | 109 |
110 | 123 | 124 | 125 | -------------------------------------------------------------------------------- /_build/tags/bayes/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2012-12-05T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Did you mean 'python spell checker'? 12 | http://blog.mattalcock.com/2012/12/5/python-spell-checker 13 | 2012-12-05T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>Have you ever been really impressed with Googles 'Did you mean....' spell checker? 19 | Have you ever just typed something into google to help you with your spelling?</p> 20 | <p>My answer to the above questions above would be Yes, all the time!</p> 21 | <p>In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries).</p> 22 | <p>The code and supporting data files can be found in my hacks public repo under the spellcheck folder.</p> 23 | <p>The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.</p> 24 | <div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span><span class="o">,</span> <span class="nn">collections</span> 25 | 26 | <span class="k">def</span> <span class="nf">words</span><span class="p">(</span><span class="n">text</span><span class="p">):</span> 27 | <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">&#39;[a-z]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span> 28 | 29 | <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">features</span><span class="p">):</span> 30 | <span class="n">model</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">defaultdict</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="mi">1</span><span class="p">)</span> 31 | <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span> 32 | <span class="n">model</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> 33 | <span class="k">return</span> <span class="n">model</span> 34 | 35 | <span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 36 | <span class="n">alphabet</span> <span class="o">=</span> <span class="s">&#39;abcdefghijklmnopqrstuvwxyz&#39;</span> 37 | 38 | <span class="k">def</span> <span class="nf">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 39 | <span class="n">s</span> <span class="o">=</span> <span class="p">[(</span><span class="n">word</span><span class="p">[:</span><span class="n">i</span><span class="p">],</span> <span class="n">word</span><span class="p">[</span><span class="n">i</span><span class="p">:])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span> 40 | <span class="n">deletes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 41 | <span class="n">transposes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">2</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="o">&gt;</span><span class="mi">1</span><span class="p">]</span> 42 | <span class="n">replaces</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 43 | <span class="n">inserts</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span><span class="p">]</span> 44 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">deletes</span> <span class="o">+</span> <span class="n">transposes</span> <span class="o">+</span> <span class="n">replaces</span> <span class="o">+</span> <span class="n">inserts</span><span class="p">)</span> 45 | 46 | <span class="k">def</span> <span class="nf">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 47 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">e2</span> <span class="k">for</span> <span class="n">e1</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="k">for</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">e1</span><span class="p">)</span> <span class="k">if</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 48 | 49 | <span class="k">def</span> <span class="nf">known</span><span class="p">(</span><span class="n">words</span><span class="p">):</span> 50 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 51 | 52 | <span class="k">def</span> <span class="nf">correct</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 53 | <span class="n">candidates</span> <span class="o">=</span> <span class="n">known</span><span class="p">([</span><span class="n">word</span><span class="p">])</span> <span class="ow">or</span> <span class="n">known</span><span class="p">(</span><span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">))</span> <span class="ow">or</span> <span class="n">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="ow">or</span> <span class="p">[</span><span class="n">word</span><span class="p">]</span> 54 | <span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 55 | </pre></div> 56 | <p>If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files.</p> 57 | <p>An exmaple of nested function composition is:</p> 58 | <div class="highlight"><pre><span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 59 | </pre></div> 60 | <p>An example of complex list comprehension is:</p> 61 | <div class="highlight"><pre><span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 62 | </pre></div> 63 | <p>The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word-&gt;count mappings.</p> 64 | <div class="highlight"><pre><span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 65 | </pre></div> 66 | <p>The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word.</p> 67 | <p>See more details, test results and further work at Peter Novig’s <a class="reference external" href="http://norvig.com/spell-correct.html">site</a> .</p> 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /_build/tags/probability/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2012-12-05T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Did you mean 'python spell checker'? 12 | http://blog.mattalcock.com/2012/12/5/python-spell-checker 13 | 2012-12-05T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>Have you ever been really impressed with Googles 'Did you mean....' spell checker? 19 | Have you ever just typed something into google to help you with your spelling?</p> 20 | <p>My answer to the above questions above would be Yes, all the time!</p> 21 | <p>In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries).</p> 22 | <p>The code and supporting data files can be found in my hacks public repo under the spellcheck folder.</p> 23 | <p>The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.</p> 24 | <div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span><span class="o">,</span> <span class="nn">collections</span> 25 | 26 | <span class="k">def</span> <span class="nf">words</span><span class="p">(</span><span class="n">text</span><span class="p">):</span> 27 | <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">&#39;[a-z]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span> 28 | 29 | <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">features</span><span class="p">):</span> 30 | <span class="n">model</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">defaultdict</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="mi">1</span><span class="p">)</span> 31 | <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span> 32 | <span class="n">model</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> 33 | <span class="k">return</span> <span class="n">model</span> 34 | 35 | <span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 36 | <span class="n">alphabet</span> <span class="o">=</span> <span class="s">&#39;abcdefghijklmnopqrstuvwxyz&#39;</span> 37 | 38 | <span class="k">def</span> <span class="nf">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 39 | <span class="n">s</span> <span class="o">=</span> <span class="p">[(</span><span class="n">word</span><span class="p">[:</span><span class="n">i</span><span class="p">],</span> <span class="n">word</span><span class="p">[</span><span class="n">i</span><span class="p">:])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span> 40 | <span class="n">deletes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 41 | <span class="n">transposes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">2</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="o">&gt;</span><span class="mi">1</span><span class="p">]</span> 42 | <span class="n">replaces</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 43 | <span class="n">inserts</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span><span class="p">]</span> 44 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">deletes</span> <span class="o">+</span> <span class="n">transposes</span> <span class="o">+</span> <span class="n">replaces</span> <span class="o">+</span> <span class="n">inserts</span><span class="p">)</span> 45 | 46 | <span class="k">def</span> <span class="nf">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 47 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">e2</span> <span class="k">for</span> <span class="n">e1</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="k">for</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">e1</span><span class="p">)</span> <span class="k">if</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 48 | 49 | <span class="k">def</span> <span class="nf">known</span><span class="p">(</span><span class="n">words</span><span class="p">):</span> 50 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 51 | 52 | <span class="k">def</span> <span class="nf">correct</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 53 | <span class="n">candidates</span> <span class="o">=</span> <span class="n">known</span><span class="p">([</span><span class="n">word</span><span class="p">])</span> <span class="ow">or</span> <span class="n">known</span><span class="p">(</span><span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">))</span> <span class="ow">or</span> <span class="n">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="ow">or</span> <span class="p">[</span><span class="n">word</span><span class="p">]</span> 54 | <span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 55 | </pre></div> 56 | <p>If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files.</p> 57 | <p>An exmaple of nested function composition is:</p> 58 | <div class="highlight"><pre><span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 59 | </pre></div> 60 | <p>An example of complex list comprehension is:</p> 61 | <div class="highlight"><pre><span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 62 | </pre></div> 63 | <p>The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word-&gt;count mappings.</p> 64 | <div class="highlight"><pre><span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 65 | </pre></div> 66 | <p>The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word.</p> 67 | <p>See more details, test results and further work at Peter Novig’s <a class="reference external" href="http://norvig.com/spell-correct.html">site</a> .</p> 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /_build/tags/statistics/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2012-12-05T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Did you mean 'python spell checker'? 12 | http://blog.mattalcock.com/2012/12/5/python-spell-checker 13 | 2012-12-05T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>Have you ever been really impressed with Googles 'Did you mean....' spell checker? 19 | Have you ever just typed something into google to help you with your spelling?</p> 20 | <p>My answer to the above questions above would be Yes, all the time!</p> 21 | <p>In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries).</p> 22 | <p>The code and supporting data files can be found in my hacks public repo under the spellcheck folder.</p> 23 | <p>The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.</p> 24 | <div class="highlight"><pre><span class="kn">import</span> <span class="nn">re</span><span class="o">,</span> <span class="nn">collections</span> 25 | 26 | <span class="k">def</span> <span class="nf">words</span><span class="p">(</span><span class="n">text</span><span class="p">):</span> 27 | <span class="k">return</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">&#39;[a-z]+&#39;</span><span class="p">,</span> <span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">())</span> 28 | 29 | <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">features</span><span class="p">):</span> 30 | <span class="n">model</span> <span class="o">=</span> <span class="n">collections</span><span class="o">.</span><span class="n">defaultdict</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="mi">1</span><span class="p">)</span> 31 | <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">features</span><span class="p">:</span> 32 | <span class="n">model</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span> 33 | <span class="k">return</span> <span class="n">model</span> 34 | 35 | <span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 36 | <span class="n">alphabet</span> <span class="o">=</span> <span class="s">&#39;abcdefghijklmnopqrstuvwxyz&#39;</span> 37 | 38 | <span class="k">def</span> <span class="nf">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 39 | <span class="n">s</span> <span class="o">=</span> <span class="p">[(</span><span class="n">word</span><span class="p">[:</span><span class="n">i</span><span class="p">],</span> <span class="n">word</span><span class="p">[</span><span class="n">i</span><span class="p">:])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)]</span> 40 | <span class="n">deletes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 41 | <span class="n">transposes</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">2</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="o">&gt;</span><span class="mi">1</span><span class="p">]</span> 42 | <span class="n">replaces</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 43 | <span class="n">inserts</span> <span class="o">=</span> <span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span><span class="p">]</span> 44 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">deletes</span> <span class="o">+</span> <span class="n">transposes</span> <span class="o">+</span> <span class="n">replaces</span> <span class="o">+</span> <span class="n">inserts</span><span class="p">)</span> 45 | 46 | <span class="k">def</span> <span class="nf">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 47 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">e2</span> <span class="k">for</span> <span class="n">e1</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="k">for</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">edits1</span><span class="p">(</span><span class="n">e1</span><span class="p">)</span> <span class="k">if</span> <span class="n">e2</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 48 | 49 | <span class="k">def</span> <span class="nf">known</span><span class="p">(</span><span class="n">words</span><span class="p">):</span> 50 | <span class="k">return</span> <span class="nb">set</span><span class="p">(</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">words</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">NWORDS</span><span class="p">)</span> 51 | 52 | <span class="k">def</span> <span class="nf">correct</span><span class="p">(</span><span class="n">word</span><span class="p">):</span> 53 | <span class="n">candidates</span> <span class="o">=</span> <span class="n">known</span><span class="p">([</span><span class="n">word</span><span class="p">])</span> <span class="ow">or</span> <span class="n">known</span><span class="p">(</span><span class="n">edits1</span><span class="p">(</span><span class="n">word</span><span class="p">))</span> <span class="ow">or</span> <span class="n">known_edits2</span><span class="p">(</span><span class="n">word</span><span class="p">)</span> <span class="ow">or</span> <span class="p">[</span><span class="n">word</span><span class="p">]</span> 54 | <span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 55 | </pre></div> 56 | <p>If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files.</p> 57 | <p>An exmaple of nested function composition is:</p> 58 | <div class="highlight"><pre><span class="n">NWORDS</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">words</span><span class="p">(</span><span class="nb">file</span><span class="p">(</span><span class="s">&#39;big.txt&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()))</span> 59 | </pre></div> 60 | <p>An example of complex list comprehension is:</p> 61 | <div class="highlight"><pre><span class="p">[</span><span class="n">a</span> <span class="o">+</span> <span class="n">c</span> <span class="o">+</span> <span class="n">b</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="n">s</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">alphabet</span> <span class="k">if</span> <span class="n">b</span><span class="p">]</span> 62 | </pre></div> 63 | <p>The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word-&gt;count mappings.</p> 64 | <div class="highlight"><pre><span class="k">return</span> <span class="nb">max</span><span class="p">(</span><span class="n">candidates</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">NWORDS</span><span class="o">.</span><span class="n">get</span><span class="p">)</span> 65 | </pre></div> 66 | <p>The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word.</p> 67 | <p>See more details, test results and further work at Peter Novig’s <a class="reference external" href="http://norvig.com/spell-correct.html">site</a> .</p> 68 | 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /_build/2013/1/5/decorates-and-annotations/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Decorators & Annotations | Matt Alcock - A Data Scientist with a passion for Python 7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 |
15 | Matt Alcock - A Data Scientist with a passion for Python 16 |
17 | 25 |
26 | 27 |

Decorators & Annotations

28 | 29 | 30 |

written on Saturday, January 5, 2013 31 | 32 | 33 |

I wanted to highlight the power of decorators and annotations in python and give the novice Python programmer some insight into how they can be used. If you've only been using Python for a short while then both of these will probably be new.

34 |

Decorators are a way of implementing the famous computer science decorator pattern. This pattern put in simple terms is a mechanism that allow you to inject or modify code in a function. In python you can have two different styles of decorator. The function defined style or the class defined style. I prefer the function style but I'll show you using a class structure as well.

35 |

The best way to explain their use is through a well known example. The below code shows how to functionally compute the Fibonacci numbers.

36 |

The Fibonacci sequence is : [0,1,1,2,3,5,8,13.....] where the nth number equalling the sum of n-1 and n-2.

37 |

An elegant way of computing this is using the below code:

38 |
def fib(n):
 39 |     if n<=0:
 40 |         return 0
 41 |     elif n==1:
 42 |         return 1
 43 |     else:
 44 |         return fib(n-2) + fib(n-1)
 45 | 
46 |

So fib(7) would return 13. As you can see from the code this uses recursion. The challenge with this approach for calculating the fib sequence is that the low 'tail' function calls will get called multiple times. Remvoing this overhead is called 'tail recursion elimination' or TRE. Python doesn't support this and probably wont . Below shows how running the fib function for just a small n can result in a massive numbers of calls of the tail values.

47 |
fib(7) = fib(6) + fib(5)
 48 | fib(7) = fib(4) + fib(3) + fib(4) + fib(3)
 49 | fib(7) = fib(3) + fib(2) + fib(2) + fib(1) + fib(3) + fib(2) + fib(2) + fib(1)
 50 | .....
 51 | fib(7) = fib(1) + fib(0) + fib(1) + fib(0) + .......... [All fib zeros and fib ones]
 52 | 
53 |

A way to make this faster is to use a technique called Memoize. This remembers the result of a function for a given argument, stores it and if called again uses the stored version rather than re calculating. This can speed up the above by many orders of magnitude.

54 |

The best way to implement this function calling memory is by decorating the function with some code that can modify the execution path to check a pre saved store first. Below is the memoize decorator as a function.

55 |
def memoize(f):
 56 |     cache= {}
 57 |     def memf(*args, **kw):
 58 |         key = (args, tuple(sorted(kw.items())))
 59 |         if key not in cache:
 60 |             cache[key] = f(*args, **kw)
 61 |         return cache[key]
 62 |     return memf
 63 | 
64 |

The memoize decorator above takes a function as an argument. It then creates a new function that stores the results of the function into a cash. The decorator then returns the new function that contains the original function call. 65 | We can then use some cleaver dynamic language tricks to re alias the fib function to the decorated version.

66 |
fib = memoize(fib)
 67 | 
68 |

Calling fib after this aliased decoration we can ensure that the decorated function will run instead of the basic fib function 69 | . 70 | I hope that explains how decorators work in python and gives you an example of use. So what are annotations?

71 |

Annotations allow us to use decorators all over our code and are actually syntactic sugar (the same thing) as the above aliased line. Rather than re-aliasing fib to the decorated fib we can use annotations at the point of writing the fib function definition.

72 |

An annotated fib function would look like this. Note the simple use of @ and the decorator name above the definition.

73 |
@memoize
 74 | def fib(n):
 75 |     if n<=0:
 76 |         return 0
 77 |     elif n==1:
 78 |         return 1
 79 |     else:
 80 |         return fib(n-2) + fib(n-1)
 81 | 
82 |

Simple hey! So annotations are just stylish and helpful ways to decorate functions at the place of definition. This really helps when your sharing code and working as a small team because you don't have to look all over the code to see if the function has been re-aliased and decorated it's right above the definition.

83 |

Once of the best uses of this type of decoration using annotations is to log the performance of a function or to perform some detailed profiling. You only need write a single decorator to modify and wrap any function and then you just sprinkle the decorator around your code as annotations depending on what functions you want to time/profile or investigate in detail.

84 |

As I mentioned before there is also a class style to writing decorators, lets use our memoize decorator as an example.

85 |

Written as a class the decorator is:

86 |
class Memoize:
 87 | 
 88 |     def __init__(self, f):
 89 |         self.f = f
 90 |         self.cache = {}
 91 | 
 92 |     def __call__(self, *args, **kw):
 93 |         key = key = (args, tuple(sorted(kw.items())))
 94 |         if not key in self.cache:
 95 |             self.cache[key] = self.f(*args, **kw)
 96 |         return self.cache[key]
 97 | 
98 |

The class has to have to functions to operate as a decorator. __init__ and __call__. Some people find this easier to read and construct others prefer the function style. I think it really depends on how advanced the decorator is going to be.

99 |

The class style can then be applied in the exact same way as the above function style decorator.

100 |
fib = Memoize(fib)
101 | 
102 | @Memoize
103 | def fib(n):
104 |     if n<=0:
105 |         return 0
106 |    ...
107 | 
108 |

I hope this has helped understand the basics of decorators and annotations. All of the decorator code listed above can be found in the hacks repo on my github account here

109 | 110 | 111 | 112 |

This entry was tagged 113 | 114 | introduction and 115 | python 116 | 117 | 118 | 119 | 120 |

121 | 133 |
134 | 147 | 148 | 149 | -------------------------------------------------------------------------------- /_build/tags/introduction/feed.atom: -------------------------------------------------------------------------------- 1 | 2 | 3 | Recent Blog Posts 4 | http://blog.mattalcock.com/feed.atom 5 | 2013-01-05T00:00:00Z 6 | 7 | 8 | Recent blog posts 9 | Werkzeug 10 | 11 | Decorators & Annotations 12 | http://blog.mattalcock.com/2013/1/5/decorates-and-annotations 13 | 2013-01-05T00:00:00Z 14 | 15 | 16 | Matt Alcock 17 | 18 | <p>I wanted to highlight the power of decorators and annotations in python and give the novice Python programmer some insight into how they can be used. If you've only been using Python for a short while then both of these will probably be new.</p> 19 | <p>Decorators are a way of implementing the famous computer science decorator pattern. This pattern put in simple terms is a mechanism that allow you to inject or modify code in a function. In python you can have two different styles of decorator. The function defined style or the class defined style. I prefer the function style but I'll show you using a class structure as well.</p> 20 | <p>The best way to explain their use is through a well known example. The below code shows how to functionally compute the Fibonacci numbers.</p> 21 | <p>The Fibonacci sequence is : [0,1,1,2,3,5,8,13.....] where the nth number equalling the sum of n-1 and n-2.</p> 22 | <p>An elegant way of computing this is using the below code:</p> 23 | <div class="highlight"><pre><span class="k">def</span> <span class="nf">fib</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> 24 | <span class="k">if</span> <span class="n">n</span><span class="o">&lt;=</span><span class="mi">0</span><span class="p">:</span> 25 | <span class="k">return</span> <span class="mi">0</span> 26 | <span class="k">elif</span> <span class="n">n</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span> 27 | <span class="k">return</span> <span class="mi">1</span> 28 | <span class="k">else</span><span class="p">:</span> 29 | <span class="k">return</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> 30 | </pre></div> 31 | <p>So fib(7) would return 13. As you can see from the code this uses recursion. The challenge with this approach for calculating the fib sequence is that the low 'tail' function calls will get called multiple times. Remvoing this overhead is called 'tail recursion elimination' or TRE. Python doesn't support this and <a class="reference external" href="http://neopythonic.blogspot.co.uk/2009/04/tail-recursion-elimination.html">probably wont</a> . Below shows how running the fib function for just a small n can result in a massive numbers of calls of the tail values.</p> 32 | <div class="highlight"><pre><span class="n">fib</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span> <span class="o">=</span> <span class="n">fib</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> 33 | <span class="n">fib</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span> <span class="o">=</span> <span class="n">fib</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> 34 | <span class="n">fib</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span> <span class="o">=</span> <span class="n">fib</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> 35 | <span class="o">.....</span> 36 | <span class="n">fib</span><span class="p">(</span><span class="mi">7</span><span class="p">)</span> <span class="o">=</span> <span class="n">fib</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="o">+</span> <span class="o">..........</span> <span class="p">[</span><span class="n">All</span> <span class="n">fib</span> <span class="n">zeros</span> <span class="ow">and</span> <span class="n">fib</span> <span class="n">ones</span><span class="p">]</span> 37 | </pre></div> 38 | <p>A way to make this faster is to use a technique called Memoize. This remembers the result of a function for a given argument, stores it and if called again uses the stored version rather than re calculating. This can speed up the above by many orders of magnitude.</p> 39 | <p>The best way to implement this function calling memory is by decorating the function with some code that can modify the execution path to check a pre saved store first. Below is the memoize decorator as a function.</p> 40 | <div class="highlight"><pre><span class="k">def</span> <span class="nf">memoize</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> 41 | <span class="n">cache</span><span class="o">=</span> <span class="p">{}</span> 42 | <span class="k">def</span> <span class="nf">memf</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> 43 | <span class="n">key</span> <span class="o">=</span> <span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">kw</span><span class="o">.</span><span class="n">items</span><span class="p">())))</span> 44 | <span class="k">if</span> <span class="n">key</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">cache</span><span class="p">:</span> 45 | <span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> 46 | <span class="k">return</span> <span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> 47 | <span class="k">return</span> <span class="n">memf</span> 48 | </pre></div> 49 | <p>The memoize decorator above takes a function as an argument. It then creates a new function that stores the results of the function into a cash. The decorator then returns the new function that contains the original function call. 50 | We can then use some cleaver dynamic language tricks to re alias the fib function to the decorated version.</p> 51 | <div class="highlight"><pre><span class="n">fib</span> <span class="o">=</span> <span class="n">memoize</span><span class="p">(</span><span class="n">fib</span><span class="p">)</span> 52 | </pre></div> 53 | <p>Calling fib after this aliased decoration we can ensure that the decorated function will run instead of the basic fib function 54 | . 55 | I hope that explains how decorators work in python and gives you an example of use. So what are annotations?</p> 56 | <p>Annotations allow us to use decorators all over our code and are actually syntactic sugar (the same thing) as the above aliased line. Rather than re-aliasing fib to the decorated fib we can use annotations at the point of writing the fib function definition.</p> 57 | <p>An annotated fib function would look like this. Note the simple use of &#64; and the decorator name above the definition.</p> 58 | <div class="highlight"><pre><span class="nd">@memoize</span> 59 | <span class="k">def</span> <span class="nf">fib</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> 60 | <span class="k">if</span> <span class="n">n</span><span class="o">&lt;=</span><span class="mi">0</span><span class="p">:</span> 61 | <span class="k">return</span> <span class="mi">0</span> 62 | <span class="k">elif</span> <span class="n">n</span><span class="o">==</span><span class="mi">1</span><span class="p">:</span> 63 | <span class="k">return</span> <span class="mi">1</span> 64 | <span class="k">else</span><span class="p">:</span> 65 | <span class="k">return</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> 66 | </pre></div> 67 | <p>Simple hey! So annotations are just stylish and helpful ways to decorate functions at the place of definition. This really helps when your sharing code and working as a small team because you don't have to look all over the code to see if the function has been re-aliased and decorated it's right above the definition.</p> 68 | <p>Once of the best uses of this type of decoration using annotations is to log the performance of a function or to perform some detailed profiling. You only need write a single decorator to modify and wrap any function and then you just sprinkle the decorator around your code as annotations depending on what functions you want to time/profile or investigate in detail.</p> 69 | <p>As I mentioned before there is also a class style to writing decorators, lets use our memoize decorator as an example.</p> 70 | <p>Written as a class the decorator is:</p> 71 | <div class="highlight"><pre><span class="k">class</span> <span class="nc">Memoize</span><span class="p">:</span> 72 | 73 | <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">f</span><span class="p">):</span> 74 | <span class="bp">self</span><span class="o">.</span><span class="n">f</span> <span class="o">=</span> <span class="n">f</span> 75 | <span class="bp">self</span><span class="o">.</span><span class="n">cache</span> <span class="o">=</span> <span class="p">{}</span> 76 | 77 | <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">):</span> 78 | <span class="n">key</span> <span class="o">=</span> <span class="n">key</span> <span class="o">=</span> <span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">(</span><span class="nb">sorted</span><span class="p">(</span><span class="n">kw</span><span class="o">.</span><span class="n">items</span><span class="p">())))</span> 79 | <span class="k">if</span> <span class="ow">not</span> <span class="n">key</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">:</span> 80 | <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">f</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kw</span><span class="p">)</span> 81 | <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> 82 | </pre></div> 83 | <p>The class has to have to functions to operate as a decorator. __init__ and __call__. Some people find this easier to read and construct others prefer the function style. I think it really depends on how advanced the decorator is going to be.</p> 84 | <p>The class style can then be applied in the exact same way as the above function style decorator.</p> 85 | <div class="highlight"><pre><span class="n">fib</span> <span class="o">=</span> <span class="n">Memoize</span><span class="p">(</span><span class="n">fib</span><span class="p">)</span> 86 | 87 | <span class="nd">@Memoize</span> 88 | <span class="k">def</span> <span class="nf">fib</span><span class="p">(</span><span class="n">n</span><span class="p">):</span> 89 | <span class="k">if</span> <span class="n">n</span><span class="o">&lt;=</span><span class="mi">0</span><span class="p">:</span> 90 | <span class="k">return</span> <span class="mi">0</span> 91 | <span class="o">...</span> 92 | </pre></div> 93 | <p>I hope this has helped understand the basics of decorators and annotations. All of the decorator code listed above can be found in the hacks repo on my github account <a class="reference external" href="https://github.com/mattalcock/hacks/tree/master/decorators">here</a></p> 94 | 95 | 96 | 97 | 98 | --------------------------------------------------------------------------------