Python generators and being lazy
This is going to be a rather long post (call it a tutorial if you wish), but in case you are a beginner I hope it will help you understand generators in Python and lazy evaluation and your time will be well spent. I usually take notes while learning any new stuff and now I am trying to experiment converting the notes into blog post/tutorials as I feel it will be a good way for me to revisit and revise the concepts while being helpful to others at the same time.
And no, please don't grab a cup of coffee for this one ;-) Instead fire up a Python shell and have your favourite editor ready because we will be trying out stuff.
A simple example
The good news is that, to work with Python generators it doesn't require us to learn much additional syntax. Here is a simple generator.
def gen(): for i in range(1, 6): yield i
>>> g = gen() >>> type(g) <type 'generator'>
g is a generator here. What's happening is that the function
when invoked returns a generator object which is assigned to
you think I am crazy to say it returns a generator object, I don't
blame you because it's not immediately clear. After all there is no
return keyword used. Instead, we see a new keyword
function with yield statement will magically return a generator
The call to the function will not execute any code inside it yet. For
that we need to call the generator object's
>>> g.next() 1 >>> print 'Hello' Hello >>> g.next() 2 >>> g.next() 3
At the time of the first call to the
next, the yield statement will
be executed once and a value will be returned. At the same time, the
control will also be returned back to the calling code. On the next
call to the
next method, the control goes back to the function and
it can resume the execution from where it left with full access to the
local variables again.
Iterator protocol and Generator expressions
Generators support the
i.e. they implement the
__iter__ methods and raise
StopIteration exception when no more values can be yielded. Hence
we can use a for loop to generate values from a generator instead of
calling the next method manually.
for will implicitly handle
StopIteration and when that happens, will end the loop.
for i in g: print i
In fact there also exist list comprehensions equivalent for generators
called generator expressions. The syntax again is ridiculously
similar, the only change being, round brackets
() instead of
. The difference is that it will give us an iterator (a
generator object) instead of an iterable (a list in memory).
>>> squares = [i*i for i in range(1, 11)] # list >>> type(squares) <type 'list'> >>> gen_squares = (i*i for i in range(1, 11)) # generator object >>> type(gen_squares) <type 'generator'> >>> iter(gen_squares) is gen_squares True
Now you may ask how does this differ from an ordinary list and what is
the use of all this anyway? The key difference is that the generator
gives out new values on the fly and doesn't keep the elements in
memory. Turns out, our earlier example was not quite apt for
understanding the concept as we used
range to build a list in memory
upfront. As a practical example, let's define a function to give us
incremental values infinitely.
def infinitely_incr(start=0): n = start while True: n += 1 yield n
>>> iinf = infinitely_incr() >>> iinf.next() 1 >>> iinf.next() 2 >>> iinf.next() 3
We can call
iinf.next() as many times as we want to get an
incremented number each time without having a list in memory. This is
Let's consider another example. What if we have huge data in some file and need to process each of it's lines by calling one or many functions on them,
def gen1(): with open('hugedata.txt') as f: for line in f: yield line g = gen1() g2 = (process(x) for x in g) for x in g2: print x
In python, a file object can be iterated over to obtain one line at a
time. In the above example, since the
process function is called
inside a generator expression, it will not be executed until the for
loop starts consuming the generator. That is when the
function will execute for each value. Don't worry if all this sounds
confusing at the moment since the next example will clarify
things. But if you think about it, the cost of loading all data from
the huge file into memory is avoided. On the other hand, it also means
that the file cannot be closed until all the lines are processed.
Also, not keeping the elements in memory implies that a generator object can be looped through or consumed only once. So it is obviously not a good choice if the sequence of items need to be reused in which case a normal list would be suitable.
>>> g = gen() >>> squares = (i*i for i in g) >>> list(squares) [1, 4, 9, 16, 25] >>> cubes = (i*i*i for i in g) >>> list(cubes) 
But if you have a series of functions, that need to be executed one after another on each line of a file, then the laziness of generator expressions can be tremendously useful.
Understanding the 'lazy' using a
concrete contrived example
So, what does being lazy mean after all? Imagine our hugedata.txt contains some 100000 lines with 1 random number on each line and we want to find out the digit sum of the square of each number and print out the results in the shell. Here is an example that uses list comprehensions and hence will build and keep lists in memory.
For the sake of an example and to make sense out of the results, let's assume that our hugedata.txt is actually a tiny file of just 5 lines containing the first 5 positive integers :-)
def square(x): print 'Square of %d ->' % x, return x*x def digit_sum(x): print 'Digit Sum of %d ->' % x, return sum(map(int, str(x))) numbers = gen() squares = [square(n) for n in numbers] dsums = [digit_sum(n) for n in squares] for n in dsums: print n
Running the above snippet of code will produce an output as follows,
Square of 1 -> Square of 2 -> Square of 3 -> Square of 4 -> Square of 5 -> Digit Sum of 1 -> Digit Sum of 4 -> Digit Sum of 9 -> Digit Sum of 16 -> Digit Sum of 25 -> 1 4 9 7 7
First all squares will be calculated, then their digit sums and then the results will be printed one by one.
Now with generator expressions just see what we get,
numbers = gen1() squares = (square(n) for n in numbers) dsums = (digit_sum(n) for n in squares) for n in dsums: print n
Square of 1 -> Digit Sum of 1 -> 1 Square of 2 -> Digit Sum of 4 -> 4 Square of 3 -> Digit Sum of 9 -> 9 Square of 4 -> Digit Sum of 16 -> 7 Square of 5 -> Digit Sum of 25 -> 7
Every item is processed by each function sequencially similar to how
it would have been if there was just one
for loop and all
functions were called progressively on the derived values of the item
in each iteration. This is quite awesome if you can imagine numbers
flowing through functions similar to signals flowing through various
stages of a signal processor.
It's called lazy because the numbers are getting consumed late, at the
time of iteration. The implicit call to
next by the
dsums which asks for the
squares which asks for
numbers can yield a value. Nothing is evaluated
unless it is asked for.
Common traps and things to watch out for
Just like in case of many other cool language features, there are a few gotchas and things that we need to watch out for when using generators as it's very easy to screw things up.
Rule #0 is - Use generators wisely. Don't use a generator expression only because the syntax is slightly different from list comprehensions.
Also, as we saw earlier, if the sequence needs to be reused then simply use a list. Keeping stuff in memory is not bad after all (we do that all the time while caching values, don't we?)
Another important thing to watch out for is the scope of the variables that are going to be used by functions when they execute in a lazy manner. This needs a bit more explanation so here is an example.
Suppose we have a generator that yeilds alphabets and we need to add
two suffixes to each alphabet for eg. we have alphabet
it's suffixed with
x which makes it
ax and then with
axy. We need to do this with multiple alphabets and we
choose to use a generator object to yield each alphabet.
def add_suffix(s, suffix): return '%s%s' % (s, suffix) def gen(): for i in ['a', 'b', 'c', 'd']: yield i ns = gen() suffixes = ['x', 'y'] for s in suffixes: ns = (add_suffix(i, s) for i in ns) print list(ns)
What do you think will be output of the above program? If your mind
['axy', 'bxy', 'cxy', 'dxy'] then it's wrong. Just run it
and see it for yourself that the output we get is
['ayy', 'byy', 'cyy', 'dyy']. What's happening here?
A generator can remember the local variables when it gets back the
control on the call of
next method. The local scope here is actually that of
for loop. By the time the generator is consumed upon call to
list(ns), the value of
s in local scope is
y. The value
the previous iteration of suffixes is simply lost.
To fix this, we just define another function wrapping over the call to
add_suffix function that will return a generator object
def gen1(s, sfx): for x in s: yield add_suffix(x, sfx) for s in suffixes: ns = gen1(ns, s)
>>> list(ns) ['axy', 'bxy', 'cxy', 'dxy']
This is by no means all about generators
I know there is lot more to generators than what this post covers. You should only consider this as a starting point for digging deeper into them. It would also be worth mentioning about the use of generator as co-routines where it can accept values from the calling code besides yeilding to it. Co-routines are pretty advanced and mind bending to understand and I am still trying to explore this topic. I got interested in it after attending a talk on 'Data processing pipelines' by Ami Tavory at SciPy India 2012 where he also showed Dagpype - A framework written by him for data processing and preparation.
- Generator Tricks for system programmers by David Beazley (particularly a must read)
- Iterators and simple generators by David Mertz
- Core Python Programming Book by Wesley Chun
- I would also like to recommend this recently published article by Ian Ward.
If you are curious about co-routines, also see,
- A Curious Course on Coroutines and Concurrency (again by David Beazley)
- Part II of the above article by Ian Ward
People who helped improve the post by pointing out errors and bugs. Thanks!
- Jimit Modi
- Sanjay Bhangar