├── 2012 └── 12 │ ├── 4 │ └── hello-internet.rst │ └── 5 │ └── python-spell-checker.rst ├── 2013 ├── 1 │ └── 5 │ │ └── decorates-and-annotations.rst └── 2 │ ├── 4 │ └── joel-test-for-data-teams.rst │ └── 24 │ └── timing-python-code.rst ├── .DS_Store ├── config.yml ├── Makefile ├── README.md ├── _build ├── 2012 │ ├── 12 │ │ ├── 4 │ │ │ └── hello-internet │ │ │ │ └── index.html │ │ ├── 5 │ │ │ └── python-spell-checker │ │ │ │ └── index.html │ │ └── index.html │ └── index.html ├── 2013 │ ├── 1 │ │ └── 5 │ │ │ └── decorates-and-annotations │ │ │ └── index.html │ ├── 2 │ │ ├── 4 │ │ │ └── joel-test-for-data-teams │ │ │ │ └── index.html │ │ └── 24 │ │ │ └── timing-python-code │ │ │ └── index.html │ ├── index.html │ ├── 02 │ │ └── index.html │ └── 01 │ │ └── index.html ├── README.md ├── upload.sh ├── tags │ ├── Thoughts │ │ ├── index.html │ │ └── feed.atom │ ├── performance │ │ ├── index.html │ │ └── feed.atom │ ├── bayes │ │ ├── index.html │ │ └── feed.atom │ ├── introduction │ │ ├── index.html │ │ └── feed.atom │ ├── statistics │ │ ├── index.html │ │ └── feed.atom │ ├── probability │ │ ├── index.html │ │ └── feed.atom │ ├── python │ │ └── index.html │ └── index.html ├── archive │ └── index.html ├── static │ ├── _pygments.css │ └── style.css ├── about │ └── index.html └── index.html ├── _templates ├── tagcloud.html ├── tag.html ├── _pagination.html ├── blog │ ├── year_archive.html │ ├── month_archive.html │ ├── archive.html │ └── index.html ├── rst_display.html └── layout.html ├── upload.sh ├── .gitignore ├── about.rst └── static └── style.css /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mattalcock/blog/HEAD/.DS_Store -------------------------------------------------------------------------------- /config.yml: -------------------------------------------------------------------------------- 1 | active_modules: [pygments, tags, blog, latex] 2 | author: Matt Alcock 3 | canonical_url: http://blog.mattalcock.com/ 4 | modules: 5 | pygments: 6 | style: tango -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: build upload 2 | 3 | clean: 4 | rm -rf _build 5 | 6 | build: 7 | run-rstblog build 8 | 9 | serve: 10 | run-rstblog serve 11 | 12 | upload: 13 | ./upload.sh 14 | @echo "Done..." -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | blog 2 | ==== 3 | 4 | My personal blog 5 | 6 | 7 | Building blog command.... 8 | 9 | >run-rstblog build 10 | 11 | Running blog command.... 12 | 13 | >run-rstblog serve 14 | 15 | Copy blog command.... 16 | 17 | >upload.sh (work in progress) -------------------------------------------------------------------------------- /_build/README.md: -------------------------------------------------------------------------------- 1 | blog 2 | ==== 3 | 4 | My personal blog 5 | 6 | 7 | Building blog command.... 8 | 9 | >run-rstblog build 10 | 11 | Running blog command.... 12 | 13 | >run-rstblog serve 14 | 15 | Copy blog command.... 16 | 17 | >upload.sh (work in progress) -------------------------------------------------------------------------------- /_templates/tagcloud.html: -------------------------------------------------------------------------------- 1 | {% extends "layout.html" %} 2 | {% block title %}Tags{% endblock %} 3 | {% block body %} 4 |
written on {{ format_date(ctx.pub_date, format='full') }} 9 | {% endif %} 10 | 11 | {{ rst.fragment }} 12 | 13 | {% if ctx.tags %} 14 |
written on Monday, February 4, 2013 31 | 32 | 33 |
The original 'Joel Test'.
34 |Do you use source control? 35 | Can you make a build in one step? 36 | Do you make daily builds? 37 | Do you have a bug database? 38 | Do you fix bugs before writing new code? 39 | Do you have an up-to-date schedule? 40 | Do you have a spec? 41 | Do programmers have quiet working conditions? 42 | Do you use the best tools money can buy? 43 | Do you have testers? 44 | Do new candidates write code during their interview? 45 | Do you do hallway usability testing?
46 |The 'Data Team Test'.
47 |Do you use source control? 48 | Do you have a seperate devlopment and production enviroment? 49 | Do you have access to raw data in a warehouse or other offline store that doesn't impact the live system? 50 | Do you have a way of recording and marking dirty data? 51 | Do you record bugs and fix them before writing new code? 52 | Do you have a process to schedule and prioritse analysis? 53 | Do you spend time with product owners and domain experts? 54 | Do programmers/analysts have quiet working conditions? 55 | Do you use the best tools money can buy? 56 | Do you have testers? 57 | Do you do have a mechanism to communicate findings and make recommendations? 58 | Do you and other teams employ data driven development teqniques?
59 | 60 | 61 | 62 | 63 | 64 | 65 |My name is Matt Alcock and I'm a Data Scientist, Analytics Lead and Python fan. I'm currently the Lead Analyst at NaturalMotion Games where I manage a small team of data analysts. I currently work and live in Oxford. I've worked in small startups and large multinational companies covering a variety of industries including games, finance and fashion. I've been working with data for 10+ years and although my jobs have been varried they've all centered around drawing insight from large data sets.
32 |I split the majority of my time between Oxford and London. If you'd like to meet for a coffee or discuss anything please drop me a message through one of the following channels:
33 |The website is a collection of observations, thoughts, notes and side projects. A lot of the supporting code for the blog posts can be found in a public repo under my github account.
41 |The website itself is written in restructured text and built with a small 42 | script written by the very talented Armin Ronacher. Sourcecode can be found on github.
43 |written on Tuesday, December 4, 2012 31 | 32 | 33 |
The oblatory first post. I'll be honest I've been meaning to start and commit to a blog for sometime. After some false starts covering my broad spectrum of interests I've decided to focus on writing about my thoughts as a Data Scientist. I'm hoping people will find this informative and insightful. If there are any thoughts, feedback or collaborations that come from this then the blog will have been a success. So please let me know what you think.
34 |I've been working with data for 10 years and my job title has jumped around from Developer, Data Analyst, Quantitate Analyst, Team Leader, Product Manager, Warehouse Manager, Head Of Analytics, Lead Analyst and Data Scientist. So what am I? I'm not sure everyone fits into role buckets but one thing I am convinced of is that everyones interests and expertise is different. I enjoy managing small technical teams, I love working with large amounts of data and I have the expertise to apply statistical and scientific methods to my work.
35 |Below are some biases and opinions I should mention before I start. I'll explain these in more detail over the coming posts but their four personal and somewhat subjective opinions I wanted to share from the outset. 36 | ` 37 | - I'm love the power of modern NoSQL data stores but still feel SQL is an amazing tool for analysis thats hard to beat. 38 | - I think data can unlock questions and give insights into almost every area of business but also I understand it's not the silver bullet. It should be used with creativity, lateral thinking and domain expertise not instead of it. 39 | - I love the concise power of statistics but realise they're frequently misleading and often poorly presented and explained. 40 | - I'm program language agnostic but love to use Python
41 |The blog will contain, thoughts, tools, projects and some book reviews. If there is anything you'd like me to talk about or review please drop me an mail.
42 |I hope you like whats to come.........
43 | 44 | 45 | 46 |Using decorators to time and optimise the performance of python code.
32 |An introduction into decorators and annotations in python and their simple power.
49 |How to use Python and some powerful statistics to create a very lightweight but effective Google style spell checker.
66 |The obligatory first post.
85 |written on Sunday, February 24, 2013 31 | 32 | 33 |
This post outlines why timing code in Python is important and provides some simple decorators that can help you time your code without the concerns and worries of peppering your lovely clean code with temporary timing and print statements.
34 |Scroll down if your just after the decortor code to time functions....
35 |One of things Python was orignally critisied for was speed. Like lots of Dynamic Lanaguages there is an overhead in keeping tracking of types and because code is interpreted at runtime instead of being compiled to native code at compile time dynmaic langauges like Python will always be a little slower.
38 |Where Python shines is in it's power and ability to allow progrmaers to opmtimise and focus on the algorthim. Focusing on the complextity of the problem and the algorithms order of magnitidue rather than the low level detials of memory management, pointers etc can often have massive benefits. Ask any computer science student and they can list of nermerous teachings that show alogrithm and data strucute design will beat brute force compuatation power.
39 |If your looking to build something where microsecounds count then I'd turn to C or Java. PyPy and other sophiticated JIT (Just in Time) compliers can help and they seam to be the future for Pytohn solutions in this space. Another aterntative is to find the slow code and either optimise that function or write a C plugin for Python for your very specific task. This last approach seams very popular in the finaince industry where milliseconds mean dollars but they still need the felxiablity and speed of devlelopment benefits that come with a dynamic lanaguage.
40 |More often than not slow code just needs some refactoring work, a new support data strucutre or a change in the complexity of processing. So the challenge is really not how can I speed up my code but what code needs my attention.
41 |In order to find slow Python code we're going to have to time stuff. We don't want to cover our lovely clean code with temporary timing code and print statements, so how can we:
45 |46 |52 |47 |
51 |- Time code without alteringing the code of a function
48 |- Get detailed timing information if the function is run with different arguments
49 |- Switch off the timing at deploy time to reduce the overhead and improve the performance of monitoring
50 |
The timing decoriatros below can help with all of these. If your new to decorators and annotations see my previous blog post on the subject
53 |import time
57 |
58 | def timeit(f):
59 |
60 | def timed(*args, **kw):
61 | ts = time.time()
62 | result = f(*args, **kw)
63 | te = time.time()
64 |
65 | print 'func:%r args:[%r, %r] took: %2.4f sec' % \
66 | (f.__name__, args, kw, te-ts)
67 | return result
68 |
69 | return timed
70 | Using the decorator is easy either use annotations.
72 |@timeit
73 | def compute_magic(n):
74 | #function definition
75 | ....
76 | Or realias the function you want to time.
78 |compute_magic = timeit(compute_magic)
79 | Sometimes you'll want to remove the code timing. You can either do this by remvoing the timeit annotations before deployment or you can you a configuration switch to enable the decorator to wrap the function in timing code.
81 |import time
82 |
83 | #from config import TIME_FUCNTIONS
84 | TIME_FUCNTIONS = False
85 |
86 | def timeit(f):
87 | if not TIME_FUCNTIONS:
88 | return f
89 | else:
90 | def timed(*args, **kw):
91 | ts = time.time()
92 | result = f(*args, **kw)
93 | te = time.time()
94 |
95 | print 'func:%r args:[%r, %r] took: %2.4f sec' % \
96 | (f.__name__, args, kw, te-ts)
97 | return result
98 |
99 | return timed
100 | By simpley changing the TIME_FUNCTIONS configuration swtich the functions will not decorated. I find having these swtiches in a common config file/folder often helps.
102 |All this code and the majorty of code from my posts can be found in the hack repo of my github account. Please take a look here . I hope it's helped if there are any questions about the above or you'd like to understand more about timing code in Python drop me a mail.
103 |Matt
104 |written on Wednesday, December 5, 2012 31 | 32 | 33 |
Have you ever been really impressed with Googles 'Did you mean....' spell checker? 34 | Have you ever just typed something into google to help you with your spelling?
35 |My answer to the above questions above would be Yes, all the time!
36 |In a fantastic post I read some years ago Peter Norvig outlined how Google’s ‘did you mean’ spelling corrector uses probability theory, large training sets and some elegant statistical language processing to be so effective. Type in a search like 'speling' and Google comes back in 0.1 seconds or so with Did you mean: 'spelling'. Below is a toy spelling corrector in python that achieves 80 to 90% accuracy and is very fast. It's written in a fanstically impressive 21 lines of code. It uses list comprehensions, and some of my favorite data structures (sets and default dictionaries).
37 |The code and supporting data files can be found in my hacks public repo under the spellcheck folder.
38 |The data seed comes from a big.txt file that consists of about a million words. The file is a concatenation of several public domain books from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. It uses a simple training method of just counting the occurrences of each word in the big text file. Obviously Google has a lot more data to seed this spelling checker with but I was suprised at how effective this relatively small seed was.
39 |import re, collections
40 |
41 | def words(text):
42 | return re.findall('[a-z]+', text.lower())
43 |
44 | def train(features):
45 | model = collections.defaultdict(lambda: 1)
46 | for f in features:
47 | model[f] += 1
48 | return model
49 |
50 | NWORDS = train(words(file('big.txt').read()))
51 | alphabet = 'abcdefghijklmnopqrstuvwxyz'
52 |
53 | def edits1(word):
54 | s = [(word[:i], word[i:]) for i in range(len(word) + 1)]
55 | deletes = [a + b[1:] for a, b in s if b]
56 | transposes = [a + b[1] + b[0] + b[2:] for a, b in s if len(b)>1]
57 | replaces = [a + c + b[1:] for a, b in s for c in alphabet if b]
58 | inserts = [a + c + b for a, b in s for c in alphabet]
59 | return set(deletes + transposes + replaces + inserts)
60 |
61 | def known_edits2(word):
62 | return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
63 |
64 | def known(words):
65 | return set(w for w in words if w in NWORDS)
66 |
67 | def correct(word):
68 | candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
69 | return max(candidates, key=NWORDS.get)
70 | If your new to python some of the above code my look complicated and hard to follow. Although dense I love Peter's use of list comprehensions and generators. The use of nested function composits is also very efficient and I've noticed a massive speed up in using such approaches when injesting or processing large data files.
72 |An exmaple of nested function composition is:
73 |NWORDS = train(words(file('big.txt').read()))
74 | An example of complex list comprehension is:
76 |[a + c + b[1:] for a, b in s for c in alphabet if b]
77 | The final thing I really like in this code snippet is the overriding of the key function when max is called in the 'correct' function. This is a great way to find the word with the highest value in a dictionaty of word->count mappings.
79 |return max(candidates, key=NWORDS.get)
80 | The code is simple and elegant and basically generates a set of candidates words based on the partial or badly spelt word (aka the original word). The most often used word from the candiates is chosen. Peter expalins how Bayes Theorem is used to select the best correction given the original word.
82 |See more details, test results and further work at Peter Novig’s site .
83 | 84 | 85 | 86 |written on Saturday, January 5, 2013 31 | 32 | 33 |
I wanted to highlight the power of decorators and annotations in python and give the novice Python programmer some insight into how they can be used. If you've only been using Python for a short while then both of these will probably be new.
34 |Decorators are a way of implementing the famous computer science decorator pattern. This pattern put in simple terms is a mechanism that allow you to inject or modify code in a function. In python you can have two different styles of decorator. The function defined style or the class defined style. I prefer the function style but I'll show you using a class structure as well.
35 |The best way to explain their use is through a well known example. The below code shows how to functionally compute the Fibonacci numbers.
36 |The Fibonacci sequence is : [0,1,1,2,3,5,8,13.....] where the nth number equalling the sum of n-1 and n-2.
37 |An elegant way of computing this is using the below code:
38 |def fib(n):
39 | if n<=0:
40 | return 0
41 | elif n==1:
42 | return 1
43 | else:
44 | return fib(n-2) + fib(n-1)
45 | So fib(7) would return 13. As you can see from the code this uses recursion. The challenge with this approach for calculating the fib sequence is that the low 'tail' function calls will get called multiple times. Remvoing this overhead is called 'tail recursion elimination' or TRE. Python doesn't support this and probably wont . Below shows how running the fib function for just a small n can result in a massive numbers of calls of the tail values.
47 |fib(7) = fib(6) + fib(5)
48 | fib(7) = fib(4) + fib(3) + fib(4) + fib(3)
49 | fib(7) = fib(3) + fib(2) + fib(2) + fib(1) + fib(3) + fib(2) + fib(2) + fib(1)
50 | .....
51 | fib(7) = fib(1) + fib(0) + fib(1) + fib(0) + .......... [All fib zeros and fib ones]
52 | A way to make this faster is to use a technique called Memoize. This remembers the result of a function for a given argument, stores it and if called again uses the stored version rather than re calculating. This can speed up the above by many orders of magnitude.
54 |The best way to implement this function calling memory is by decorating the function with some code that can modify the execution path to check a pre saved store first. Below is the memoize decorator as a function.
55 |def memoize(f):
56 | cache= {}
57 | def memf(*args, **kw):
58 | key = (args, tuple(sorted(kw.items())))
59 | if key not in cache:
60 | cache[key] = f(*args, **kw)
61 | return cache[key]
62 | return memf
63 | The memoize decorator above takes a function as an argument. It then creates a new function that stores the results of the function into a cash. The decorator then returns the new function that contains the original function call. 65 | We can then use some cleaver dynamic language tricks to re alias the fib function to the decorated version.
66 |fib = memoize(fib)
67 | Calling fib after this aliased decoration we can ensure that the decorated function will run instead of the basic fib function 69 | . 70 | I hope that explains how decorators work in python and gives you an example of use. So what are annotations?
71 |Annotations allow us to use decorators all over our code and are actually syntactic sugar (the same thing) as the above aliased line. Rather than re-aliasing fib to the decorated fib we can use annotations at the point of writing the fib function definition.
72 |An annotated fib function would look like this. Note the simple use of @ and the decorator name above the definition.
73 |@memoize
74 | def fib(n):
75 | if n<=0:
76 | return 0
77 | elif n==1:
78 | return 1
79 | else:
80 | return fib(n-2) + fib(n-1)
81 | Simple hey! So annotations are just stylish and helpful ways to decorate functions at the place of definition. This really helps when your sharing code and working as a small team because you don't have to look all over the code to see if the function has been re-aliased and decorated it's right above the definition.
83 |Once of the best uses of this type of decoration using annotations is to log the performance of a function or to perform some detailed profiling. You only need write a single decorator to modify and wrap any function and then you just sprinkle the decorator around your code as annotations depending on what functions you want to time/profile or investigate in detail.
84 |As I mentioned before there is also a class style to writing decorators, lets use our memoize decorator as an example.
85 |Written as a class the decorator is:
86 |class Memoize:
87 |
88 | def __init__(self, f):
89 | self.f = f
90 | self.cache = {}
91 |
92 | def __call__(self, *args, **kw):
93 | key = key = (args, tuple(sorted(kw.items())))
94 | if not key in self.cache:
95 | self.cache[key] = self.f(*args, **kw)
96 | return self.cache[key]
97 | The class has to have to functions to operate as a decorator. __init__ and __call__. Some people find this easier to read and construct others prefer the function style. I think it really depends on how advanced the decorator is going to be.
99 |The class style can then be applied in the exact same way as the above function style decorator.
100 |fib = Memoize(fib)
101 |
102 | @Memoize
103 | def fib(n):
104 | if n<=0:
105 | return 0
106 | ...
107 | I hope this has helped understand the basics of decorators and annotations. All of the decorator code listed above can be found in the hacks repo on my github account here
109 | 110 | 111 | 112 |