Epistemic Status: evergreen tree Evergreen


Almost a decade ago, a friend of mine sent me a link to a video to explain to me how Statsd works. It introduced me to the whole concept of abstract algebra and I ran out and bought books about it (like I do). I had actually forgotten a lot of the details, and so recently I went back to find the video again and thought I’d write up an introduction to the topic1.

For those who don’t know, Statsd or things like it are what power metrics observability services like Datadog and SignalFx. These services provide other features outside of metrics, but I’m only talking about metrics here.

Have you ever wondered how these services are able to efficiently collect hundreds of thousands of data metrics from hundreds of thousands of sources and able to slice and dice that data in graphs without falling over from the load or having ridiculous storage requirements? The solution leverages group theory and abstract algebra, things like commutative monoids and abelian groups.

Let’s back up a bit

Anytime someone starts throwing around terms like abelian group, people’s eyes glaze over and they tend to check out, so let’s start with some examples.

Counting things

Let’s say you wanted to keep track of how much money your fancy new web service is generating. One way to do this would be to track it in a metric. Using Datadog’s libraries as an example, we might do something like this:

require 'datadog/statsd'
 
statsd = Datadog::Statsd.new(...)
 
def checkout_page(...) do
	# somewhere inside your view handler code
	# and assuming that `order.total` is the amount
	# of money on each total.
	statsd.count('money_money_money', order.total)
end

The idea here is that anytime you call statsd.count with a label and a value, some global count is incremented by that value. And then we could go into a dashboard and see our graphs growing up and the right (that’s the dream, right?).

A simple way to implement this would be to just have the Statsd library call an API endpoint hosted by Datadog to submit the values as they arrive. But if your application is making lots and lots of transactions, this would quickly overwhelm Datadog, not to mention saturate your outbound traffic.

But, we can exploit the fact that counting works on addition. It doesn’t matter if we get $250 and then $100, or $100 and then $250, in the end we’ll have $350. And it doesn’t matter how many we have, and which parts we add together first, we’ll get the same result.

So, on our web server, we can run a little agent that receives each of these events, and adds them to an internal counter, and then every once in a while, we can send that aggregate count to Datadog in a single, small message.

I'm glossing over details

How it actually works is a little more involved and Datadog usually combines multiple metrics into a single message, e.g. mean, median, percentiles, etc., but the basic premise holds.

Let’s look at a few more examples, even though Datadog doesn’t directly provide some of these.

Tracking response times

Let’s say we want to keep track of the slowest response times in our views. One way we could do that is by sending timing values in milliseconds. Since we’re only interested in the slowest ones, we could use the max\max function to determine that.

max(a,b)={aif abbif a<b\max(a,b) = \begin{cases} a &\text{if } a \geq b \\ b &\text{if } a \lt b \end{cases}

The maxmax function has the same properties as addition, in that the order of the arguments doesn’t matter, and the order in which you process the individual values doesn’t matter. So again, we can aggregate in the agent running on our web servers, and send smaller rolled up values to Datadog.

What if we wanted to track the fastest requests? We could try to use the minmin function, but we might discover a problem.

When we’re keeping track of totals with addition, or slowest requests with maxmax, we don’t care if we get zeros reported, but if we apply minmin to a value and zero, we’ll get zero back.

Properties of operations

What is it about ++ and max\max that is different for min\min? If you remember from algebra, there are certain properties that various operations have.

An operation can be associative which is to say if I take a series of numbers aa, bb, and cc and combine them using the operation (using ++ to represent it here), the following equation holds true for all values:

a+(b+c)=(a+b)+ca + (b + c) = (a + b) + c

An operation can be commutative which is to say if I have a pair of numbers aa and bb and combine them using the operation, the following equation holds true for all values:

a+b=b+aa + b = b + a

So far, we can say that addition, max\max and min\min are all associative and commutative. An example of an operation that is not commutative is subtraction.

abbaa - b \neq b - a

In fact, subtraction is anticommutative which means that when you switch the arguments, you get the inverse of the original operation (which you can easily prove by distributing the -1 on the right side).

ba=(ab)=(1)(ab)=(1)(a)+(1)(b)=a+b=ba\begin{align*} b - a &= -(a - b)\\ &= (-1)(a - b)\\ &= (-1)(a) + (-1)(-b)\\ &= -a + b\\ &= b - a \end{align*}

What happens when we apply addition to 0? Well, we get back the number we started with, that is to say:

a+0=aa + 0 = a

The value 0 in this case is known as the identity with respect to addition. It’s also the identity with respect to the max\max function when working with positive integers.

max(a,0)=a\max(a, 0) = a

However, the value 0 is not the identity with respect to the min\min function. We need to use a different value for the identity, which in this case would be infinity (or the maximum value an integer can hold in your programming language).

This is the basics of group theory. You have a set of things, e.g. positive integers, and you have an operation you can apply to elements of that set that produce other values usually in the set. Group theory defines various structures on top of sets, each one adding a restriction on the previous one. Starting from the loosest structure, we have:

  • A magma is a set with a single binary operation that must be closed, that is, any two elements of the set when combined with the operation result in another element of the set. There isn’t a combination that produces an element outside the set. The operation has no other properties, not associativity, commutativity, etc.
  • A semigroup adds associativity to the operation.
    • Adding commutativity to the operation turns a semigroup into a commutative semigroup.
  • A monoid takes the semigroup and adds an identity element, e.g. the value 0 for addition, or 1 for multiplication.
    • Adding commutativity to the operation turns a monoid into a commutative monoid.
  • A group takes the monoid and adds elements to the set such that each element has an inverse such that when combined under the operation results in the identity element.
    • Adding commutativity to the operation turns a group into an abelian group.

Enough with the maths! Back to some metrics we care about

Suppose we wanted to calculate average response times. It seems like this is something we can’t do without keeping all the intermediate values. After all, if we have several collected values like this:

[a,b,c,d][a, b, c, d]

Then the average of those will be:

a+b+c+d4\frac{a + b + c + d}{4}

If we just keep track of those averages, we have a problem. Say we have this series of values from three different sources:

sourcea[1,4,3]sourceb[2,7,1,2,8]sourcec[8,7,6]\begin{align*} \mathit{source_a} &\rightarrow [1, 4, 3]\\ \mathit{source_b} &\rightarrow [2, 7, 1, 2, 8]\\ \mathit{source_c} &\rightarrow [8, 7, 6] \end{align*}

If each of those sources are averaged, we end up with these values:

avg(sourcea)=2.66avg(sourceb)=4avg(sourcec)=7\begin{align*} \mathrm{avg}(\mathit{source_a}) &= 2.66\\ \mathrm{avg}(\mathit{source_b}) &= 4\\ \mathrm{avg}(\mathit{source_c}) &= 7 \end{align*}

If we naïvely average those (summing them and dividing by the total) we’ll end up with:

2.66+4+734.55\frac{2.66 + 4 + 7}{3} \approx 4.55

but if we do the same on the original values, we get:

1+4+3+2+7+1+2+8+8+7+6114.45\frac{1+4+3+2+7+1+2+8+8+7+6}{11} \approx 4.45

This happens because we’ve lost some information in the aggregated values, and the averages by themselves do not form a commutative monoid.

But, we can fix this by changing the data we’re storing. What if instead of storing the average, we store the average and the count of the elements that created it? All of our original values would just have a count of 1, but the aggregated values would become:

avg(sourcea)={2.66,3}avg(sourceb)={4,5}avg(sourcec)={7,2}\begin{align*} \mathrm{avg}(\mathit{source_a}) &= \{2.66, 3\}\\ \mathrm{avg}(\mathit{source_b}) &= \{4, 5\}\\ \mathrm{avg}(\mathit{source_c}) &= \{7, 2\} \end{align*}

Then we could combine these using a weighted average:

(2.66×3)+(4×5)+(7×3)3+5+34.45\frac{(2.66 \times 3) + (4 \times 5) + (7 \times 3)}{3 + 5 + 3} \approx 4.45

So, a combination of an average and a count do form a commutative monoid under the operation of weighted average.

Numerical stability

This formula isn’t numerically stable when using floating point numbers in a computer, which is to say, inaccuracies can creep in. A detailed analysis of this problem can be found in this paper.

It’s easy to get fixated on numbers as the elements of sets, but the beauty of abstract algebra and group theory is that it applies to any set of “objects” that follow the rules of the structure we’re interested in.

An example from the book A Book of Abstract Algebra of this is, imagine you’re at a family reunion, and through some method we come up with a way to determine given two people which person at the reunion is their closest common ancestor. The people at the reunion would be the “set of objects” and “closest common ancestor” would be the operation2.

What’s the take away here?

There are many scenarios that can be looked at from this angle of commutative monoids and abelian groups. The nice thing about these structures is that the intermediate values are fairly static when it comes to how much space they take. We don’t have to store all the averages if we don’t want to. How many snapshots you take over time only limits the resolution at which you can see individual results, but don’t affect the final calculation.

There are also quite a few scenarios that don’t fit this model.

An example of something that would be nice to track this way, unique visitors, is impossible to do under this model. This problem is known as the count-distinct problem and has space requirements proportional to the cardinality of the set of possible values. The best we can do is something like HyperLogLog to form a fairly accurate estimate.

But that’s a story for another time.

Book Recommendations

The only book I still own on the topic is A Book of Abstract Algebra by Charles Pinter. It’s a typical math book, so being comfortable with algebra and set theory would be helpful (there’s a review of the latter in one of the appendices).

It’s been a while since I read it, but it does a good job of going from the basics all the way through Galois Theory3.

Footnotes

  1. This is sort of me doing a variant of the Feynman technique.

  2. They leave out an explanation of what the “identity” element would be, so this is a commutative semigroup, since the operation is associative and commutative, but the set lacks an identity element. Even taking yourself as your own ancestor, there is no single person with whom you could compare anyone’s common ancestors and get back that person.

  3. Named after Évariste Galois from the early 19th century who came up with the idea when he was 18. Who knows what he could have accomplished if he hadn’t died at 20 in a duel.)