6.6.4 Processing files
We’re deferring some of the work of decoding our files to functions that return generators
over data. The readlines() function takes the connection, key, and a block-iterating
callback. It’ll iterate over blocks of data yielded by the block-iterating callback,
discover line breaks, and yield lines. When provided with blocks as in listing 6.32, it finds
the last line ending in the block, and then splits the lines up to that last line ending, yielding
the lines one by one. When it’s done, it keeps any partial lines to prepend onto the
next block. If there’s no more data, it yields the last line by itself. There are other ways
of finding line breaks and extracting lines in Python, but the rfind()/split() combination
is faster than other methods.
For our higher-level line-generating function, we’re iterating over blocks produced by
one of two readers, which allows us to focus on finding line breaks.
GENERATORS WITH YIELDListing 6.32 offers our first real use of Python generators
with the yield statement. Generally, this allows Python to suspend and
resume execution of code primarily to allow for easy iteration over sequences
or pseudo-sequences of data. For more details on how generators work, you can
visit the Python language tutorial with this short URL: http://mng.bz/Z2b1.
Each of the two block-yielding callbacks, readblocks and readblocks_gz(), will
read blocks of data from Redis. The first yields the blocks directly, whereas the other
automatically decompresses gzip files. We’ll use this particular layer separation in
order to offer the most useful and reusable data reading method possible. The following
listing shows the readblocks generator.
The readblocks() generator is primarily meant to offer an abstraction over our block
reading, which allows us to replace it later with other types of readers, like maybe a filesystem
reader, a memcached reader, a ZSET reader, or in our case, a block reader that
handles gzip files in Redis. The next listing shows the readblocks_gz() generator.
Much of the body of readblocks_gz() is gzip header parsing code, which is unfortunately
necessary. For log files (like we’re parsing), gzip can offer a reduction of 2–5
times in storage space requirements, while offering fairly high-speed decompression.
Though more modern compression methods are able to compress better (bzip2,
lzma/xz, and many others) or faster (lz4, lzop, snappy, QuickLZ, and many others),
no other method is as widely available (bzip2 comes close) or has such a useful range
of compression ratio and CPU utilization trade-off options.