A dynamic programming algorithm for the Gini coefficient

There’s a very simple $O(n)$ DP algorithm for calculating the Gini coefficient of a sorted list. The code for it honestly looks almost simpler than the naive formula. Code is at the bottom.

If $x$ is a sorted list of values of length $n$, then the Gini coefficient $G$ is given by:

$^$ G = \frac{\sum_{i=0}^n \sum_{j=0}^n \text{abs}(x_i - x_j) }{2 \cdot n^2 \cdot \text{mean}(x)} $^$

Where does this come from?

Here’s one way of looking at it. The mean absolute difference of a distribution is the expected value of drawing two items from the distribution and returning the absolute value of their difference. Here’s a formula for the mean absolute difference $MD$ of the sorted list $x$ from before:

$^$MD = \frac{\sum_{i=0}^n \sum_{j=0}^n \text{abs}(x_i - x_j)}{n^2}$^$

Using this, we can write the Gini coefficient as

$^$G = \frac{MD}{2 \cdot \text{mean}(x)}$^$

This makes more sense. We divide by the mean because we want the Gini coefficient of a distribution to be unaffected by multiplying by a constant. ($\frac{MD}{\text{mean}(x)}$ is known as relative mean absolute distance.) And then we divide by 2 so that our Gini coefficient varies between 0 meaning perfect equality and 1 meaning perfect inequality.

Calculating it

We can calculate this directly. But this takes $O(n^2)$, which is super slow.

Alternatively, we can use a dynamic programming approach to cut the runtime down to $O(n)$ plus the cost of getting the list in sorted order (which is $O(n \log(n))$ if we start out with an unsorted list of incomes, but we plausibly get our data in some other form than that).

Let’s look at the sum on the numerator of the Mean Absolute Difference formula:

$^$ \sum_{i=0}^n \sum_{j=0}^n \text{abs}(x_i - x_j) $^$

Here’s a table of absolute differences for items in the list $[1, 3, 4, 5]$.

	1	3	4	5
1	0	2	3	4
3	2	0	1	2
4	3	1	0	1
5	4	2	1	0

First thing to notice here is that the table is symmetrical around its diagonal. If we can calculate the sum of one of those sides, we’re done. Let’s choose the lower triangle.

	1	3	4	5
1	0
3	2	0
4	3	1	0
5	4	2	1	0

Let’s see how to compute the sum of that triangle efficiently.

First, let’s compute a list $y$ of prefix sums of $x$–that is, $y_i$ is the sum of the first $i$ elements in $x$. So $y_0 = 0$, and $y_i = y_{i-1} + x_{i-1}$.

Let’s see what this list looks like. Let’s also look at the sum of the elements in the $i$th row of the table.

$$i$$	0	1	2	3
$$x_i$$	1	3	4	5
$$y_i$$	0	1	4	8
$$\sum_{j=0}^i x_i - x_j$$	0	2	4	7

Can we rewrite this

\[\begin{align} &\sum_{j=0}^{i} \left| x_i - x_j \right|\\ = &\sum_{j=0}^{i} \left(x_i - x_j \right) \\ = &i\cdot x_i - \left( \sum_{j=0}^i x_j \right) \end{align}\]

Or, taking advantage of $y$:

$^$i\cdot x_i - \left( \sum_{j=0}^i x_j \right) =i \cdot x_i - y_i$^$

Feel free to verify that this is correct:

$$i$$	0	1	2	3
$$x_i$$	1	3	4	5
$$y_i$$	0	1	4	8
$$\sum_{j=0}^i x_i - x_j$$	0	2	4	7
$$i \cdot x_i - y_i$$	0	2	4	7

Now we can write a single for loop to calculate our sum of absolute differences. And so we can write a very fast function for the Gini coefficient of a list of incomes:

def gini(incomes):
    # skip this if it's unneeded
    incomes.sort()

    sum_of_absolute_differences = 0
    subsum = 0

    for i, x in enumerate(incomes):
        sum_of_absolute_differences += i * x - subsum
        subsum += x

    return sum_of_absolute_differences / subsum / len(incomes)

>>> gini([1, 1, 1])
0.0
>>> gini([0, 0, 0, 0, 10])
0.8
>>> gini([1, 2, 3, 4, 5])
0.26666666666666666

All of these match my what Gini coefficient visualizer says.