EBOOK – REDIS IN ACTION

This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

5.2.1 Storing counters in Redis

As we monitor our application, being able to gather information over time becomes
ever more important. Code changes (that can affect how quickly our site responds,
and subsequently how many pages we serve), new advertising campaigns, or new users
to our system can all radically change the number of pages that are loaded on a site.
Subsequently, any number of other performance metrics may change. But if we aren’t
recording any metrics, then it’s impossible to know how they’re changing, or whether
we’re doing better or worse.

In an attempt to start gathering metrics to watch and analyze, we’ll build a tool to
keep named counters over time (counters with names like site hits, sales, or database queries
can be crucial). Each of these counters will store the most recent 120 samples at a
variety of time precisions (like 1 second, 5 seconds, 1 minute, and so on). Both the number
of samples and the selection of precisions to record can be customized as necessary.
The first step for keeping counters is actually storing the counters themselves.

UPDATING A COUNTER

In order to update counters, we’ll need to store the actual counter information. For each counter and precision, like site hits and 5 seconds, we’ll keep a HASH that stores information about the number of site hits that have occurred in each 5-second time slice. The keys in the hash will be the start of the time slice, and the value will be the number of hits. Figure 5.1 shows a selection of data from a hit counter with 5-second time slices.

As we start to use counters, we need to record what counters have been written to so
that we can clear out old data. For this, we need an ordered sequence that lets us iterate
one by one over its entries, and that also doesn’t allow duplicates. We could use a LIST
combined with a SET, but that would take extra code and round trips to Redis. Instead,
we’ll use a ZSET, where the members are the combinations of precisions and names that
have been written to, and the scores are all 0. By setting all scores to 0 in a ZSET, Redis
will try to sort by score, and finding them all equal, will then sort by member name. This gives us a fixed order for a given set of members, which will make it easy to sequentially
scan them. An example ZSET of known counters can be seen in figure 5.2.

Figure 5.1A HASH that shows the number of web page hits over 5-second time slices around 7:40 a.m. on May 7, 2012
Figure 5.2A ZSET that shows some known counters

Now that we know what our structures for counters look like, what goes on to make that
happen? For each time slice precision, we’ll add a reference to the precision and the
name of the counter to the known ZSET, and we’ll increment the appropriate time window
by the count in the proper HASH. Our code for updating a counter looks like this.

Listing 5.3The update_counter() function
PRECISION = [1, 5, 60, 300, 3600, 18000, 86400]

The precision of the counters in seconds: 1 second, 5 seconds, 1 minute, 5 minutes, 1 hour, 5 hours, 1 day—adjust as necessary.

def update_counter(conn, name, count=1, now=None):

   now = now or time.time()

Get the current time to know which time slice to increment.

   pipe = conn.pipeline()

Create a transactional pipeline so that later cleanup can work correctly.

   for prec in PRECISION:

Add entries for all precisions that we record.

      pnow = int(now / prec) * prec

Get the start of the current time slice.

      hash = '%s:%s'%(prec, name)

Create the named hash where this data will be stored.

      pipe.zadd('known:', hash, 0)

Record a reference to the counters into a ZSET with the score 0 so we can clean up after ourselves.

      pipe.hincrby('count:' + hash, pnow, count)

Update the counter for the given name and time precision.

   pipe.execute()

Updating the counter information isn’t so bad; just a ZADD and HINCRBY for each time
slice precision. And fetching the information for a named counter and a specific precision
is also easy. We fetch the whole HASH with HGETALL, convert our time slices and
counters back into numbers (they’re all returned as strings), sort them by time, and
finally return the values. The next listing shows our code for fetching counter data.

Listing 5.4The get_counter() function
def get_counter(conn, name, precision):

   hash = '%s:%s'%(precision, name)

Get the name of the key where we’ll be storing counter data.

   data = conn.hgetall('count:' + hash)

Fetch the counter data from Redis.

   to_return = []
   for key, value in data.iteritems():
      to_return.append((int(key), int(value)))

Convert the counter data into something more expected.

   to_return.sort()

Sort our data so that older samples are first.

   return to_return

We did exactly what we said we were going to do. We fetched the data, ordered it
sequentially based on time, and converted it back into integers. Let’s look at how we
prevent these counters from keeping too much data.

CLEANING OUT OLD COUNTERS

Now we have all of our counters written to Redis and we can fetch our counters with
ease. But as we update our counters, at some point we’re going to run out of memory
if we don’t perform any cleanup. Because we were thinking ahead, we wrote to our
known ZSET the listing of known counters. To clean out the counters, we need to iterate
over that listing and clean up old entries.

WHY NOT USE EXPIRE?One limitation of the EXPIRE command is that it only
applies to whole keys; we can’t expire parts of keys. And because we chose to
structure our data so that counter X of precision Y is in a single key for all
time, we have to clean out the counters periodically. If you feel ambitious, you
may want to try restructuring the counters to change the data layout to use
standard Redis expiration instead.

As we process and clean up old counters, a few things are important to pay attention
to. The following list shows a few of the more important items that we need to be
aware of as we clean out old counters:

  • New counters can be added at any time.
  • Multiple cleanups may be occurring at the same time.
  • Trying to clean up daily counters every minute is a waste of effort.
  • If a counter has no more data, we shouldn’t try to clean it up any more.

With all of those things in mind, we’ll build a daemon function similar in operation to
the daemon functions that we wrote back in chapter 2. As before, we’ll repeatedly
loop until the system is told to quit. To help minimize load during cleanup, we’ll
attempt to clean out old counters roughly once per minute, and will also clean up old
counters at roughly the schedule that they’re creating new entries, except for counters
that get new entries more often than once per minute. If a counter has a time
slice of 5 minutes, we’ll try to clean out old entries from that counter every 5 minutes.
Counters that have new entries more often (1 second and 5 seconds in our example),
we’ll clean out every minute.

To iterate over the counters, we’ll fetch known counters one by one with ZRANGE.
To clean a counter, we’ll fetch all of the start times for a given counter, calculate which
items are before a calculated cutoff (120 samples ago), and remove them. If there’s no
more data for a given counter, we’ll remove the counter reference from the known
ZSET. Explaining what goes on is simple, but the details of the code show some corner
cases. Check out this listing to see the cleanup code in full detail.

Listing 5.5The clean_counters() functions
def clean_counters(conn):
   pipe = conn.pipeline(True)
   passes = 0

Keep a record of the number of passes so that we can balance cleaning out per-second vs. per-day counters.

   while not QUIT:

Keep cleaning out counters until we’re told to stop.

      start = time.time()

Get the start time of the pass to calculate the total duration.

      index = 0
      while index < conn.zcard('known:'):

Incrementally iterate over all known counters.

         hash = conn.zrange('known:', index, index)

Get the next counter to check.

         index += 1
         if not hash:
               break
         hash = hash[0]
         prec = int(hash.partition(':')[0])

Get the precision of the counter.

         bprec = int(prec // 60) or 1

We’ll take a pass every 60 seconds or so, so we’ll try to clean out counters at roughly the rate that they’re written to.

         if passes % bprec:

Try the next counter if we aren’t supposed to check this one on this pass (for example, we’ve taken three passes, but the counter has a precision of 5 minutes).

            continue

         hkey = 'count:' + hash
         cutoff = time.time() - SAMPLE_COUNT * prec

Find the cutoff time for the earliest sample that we should keep, given the precision and number of samples that we want to keep.

         samples = map(int, conn.hkeys(hkey))

Fetch the times of the samples, and convert the strings to integers.

         samples.sort()
         remove = bisect.bisect_right(samples, cutoff)

Determine the number of samples that should be deleted.

         if remove:
            conn.hdel(hkey, *samples[:remove])

Remove the samples as necessary.

            if remove == len(samples):

The data HASH may be empty.

               try:
                  pipe.watch(hkey)
                  if not pipe.hlen(hkey):
                     pipe.multi()
                     pipe.zrem('known:', hash)

Watch the counter hash for changes.

Verify that the counter hash is empty, and if so, remove it from the known counters.

                     pipe.execute()
                     index -= 1

If we deleted a scounter, then we scan use the same sindex next pass.

                  else:
                     pipe.unwatch()

The hash isn’t empty; keep it in the list of known counters.

               except redis.exceptions.WatchError:
                  pass

Someone else changed the counter hash by adding counters, which means that it has data, so we’ll leave the counter in the list of known counters.

      passes += 1
      duration = min(int(time.time() - start) + 1, 60)

passes += 1 duration = min(int(time.time() – start) + 1, 60)

      time.sleep(max(60 - duration, 1))

Sleep the remainder of the 60 seconds, or at least 1 second, just to offer a bit of a rest.

As described earlier, we iterate one by one over the ZSET of counters, looking for items
to clean out. We only clean out counters that should be cleaned in this pass, so we perform
that check early. We then fetch the counter data and determine what (if anything)
should be cleaned up. After cleaning out old data as necessary, if we don’t
believe that there should be any remaining data, we verify that there’s no more data
for the counter and remove it from our ZSET of counters. Finally, after passing over all
of the counters, we calculate how long it took to perform a pass, and sleep roughly the
remainder of the minute we left for ourselves to perform the full cleanup, until the
next pass.

Now that we have counter data, are cleaning it up, and can fetch it, it’s just a matter of building an interface for consuming the data. Sadly, that part is out of the scope of this book, but there are a few usable JavaScript plotting libraries that can help you out on the web side of things (I’ve had goodexperiences with jqplot [http://www.jqplot.com/], Highcharts [http://www.highcharts.com/], dygraphs [http://dygraphs.com/], and D3 [http://d3js.org/] for personal and professional uses).

When dealing with the depth of complexity in a real website, knowing that a page
gets hit thousands of times a day can help us to decide that the page should be
cached. But if that page takes 2 milliseconds to render, whereas another page gets one
tenth the traffic but takes 2 seconds to render, we can instead direct our attention to
optimizing the slower page. In the next section, we change our approach from keeping
precise counters that give us data over time, to keeping aggregate statistics to help
us make more nuanced decisions about what to optimize.