Writing high(ish) performance code with Kingdon and Numba

In this document we will demonstrate how easy it is to compose together the fundamental (binary and unary) operators that kingdon supplies out of the box, while maintaining good performance.

The example used is inspired by that of `clifford <https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html>`__ to allow for comparison of both syntax and performance. In order to make the performance comparison fair, we will run both versions within this document, such that both examples are run on the same machine.

First we will run kingdon with numba enabled and while leveraging the sparseness in the example, to get the best speeds possible. Then we will run the clifford equivalent. Lastly, we will run the kingdon example again but with a full multivector, to show how the two libraries compare in this case.

Kingdon unchained

First, we run kingdon on the example, allowing it to use all of its tricks.

Let’s initiate a 3DCGA algebra and create some simple lines.

[1]:
from kingdon import Algebra, MultiVector
import numpy as np
from numba import njit

alg = Algebra(4, 1, wrapper=njit)
locals().update(alg.blades)

In order to have a fair comparison, we make a full multivector, which prevents kingdon from using the sparseness of the input to gain additional performance benefits.

[2]:
l1 = e145
l2 = e245
R = 1 + l2 * l1

The function we will optimize for this example is the conjugation (or sandwich product) used to transform a multivector \(X\) by a rotor \(R\):

[3]:
def apply_rotor(R, X):
    return R * X * ~R

The first execution will be the most expensive, because it triggers the code generation for the geometric product and reversion operators to be performed. We therefore time it seperatelly:

[4]:
%timeit -n 1 -r 1 apply_rotor(R, l1)
153 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

We can now time the actual excecution time of the apply_rotor function.

[5]:
%timeit -n 100_000 -r 10 apply_rotor(R, l1)
13.4 µs ± 5.47 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

We now have a benchmark: to perform the function apply_rotor takes about 7.69 µs ± 59.2 ns on the authors laptop. To do better is easy: we simply apply the Algebra.register decorator to apply_rotor:

@alg.register
def apply_rotor(R, X):
    return R * X * ~R

This decorator allows kingdon to work its magic on the decorated function. While the decorator syntax is pleasant to look it, it overwrites the original function, so in this document we are better off using

apply_rotor_compiled = alg.register(apply_rotor)
[6]:
apply_rotor_compiled = alg.register(apply_rotor)
[7]:
%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)
104 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[8]:
%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)
8.89 µs ± 975 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

After decoration the code runs in about 3.87 µs ± 529 ns on the authors laptop, which is roughly a two-fold improvement in performance. Not bad, for so little effort. The reason for the speed-up is that ordinarilly kingdon has to traverse down to the actual numerical implementation from the python level and then back up for every operator. So in this example this is done three times: two times for the products *, and once for the reversion ~.

By contrast, the Algebra.register decorator only traverses down to the numerical level once, does all the computations on that level, and then passes the results back up the chain.

Having had our first taste of better performance, we of course want to know if we can do even better. In this example we would expect that we can get a better result if we first do some symbolical symplification, since both \(R\) and \(\widetilde{R}\) appear in the expression. We can ask the decorator to do symbolic symplification by instead using

@alg.register(symbolic=True)
def apply_rotor(R, X):
    return R * X * ~R

(Note: for more complicated expressions this might be a bad idea, since symbolic symplification can get very expensive.)

[9]:
apply_rotor_simplified = alg.register(apply_rotor, symbolic=True)
[10]:
%timeit -n 1 -r 1 apply_rotor_simplified(R, l1)
710 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[11]:
%timeit -n 100_000 -r 10 apply_rotor_simplified(R, l1)
4.17 µs ± 330 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

The symbolical optimization is more expensive to perform, even in this sparse example. However, we do obtain a new speed record: 1.91 µs ± 376 ns, which is ~4 times faster than the original.

Of course conjugation is a common operation, so kingdon does ship with a precompiled version. The operator for conjugation is given by >>:

[12]:
%timeit -n 1 -r 1 R >> l1
95.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[13]:
%timeit -n 100_000 -r 10 R >> l1
8.69 µs ± 812 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

This has comparable numerical performance and also has a shorter generation time, which is important for the user experience.

We have now achieved the limit of what can be achieved with the @alg.register decorator: 1.91 µs ± 376 ns, a four-fold speed-up compared to the undecorated version. Beware however, that until we can find a faster symbolic engine than sympy, it is usually a bad idea to use the symbolic keyword. It is therefore recommended to only use the symbolic keyword selectly.

Clifford

Now let us repeat these exercises, but using `clifford <https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html>`__.

[14]:
import clifford as cf
from clifford import g3c
import numba

# Get the layout in our local namespace etc etc
layout = g3c.layout
locals().update(g3c.blades)

ep, en, up, down, homo, E0, ninf, no = (g3c.stuff["ep"], g3c.stuff["en"],
                                        g3c.stuff["up"], g3c.stuff["down"], g3c.stuff["homo"],
                                        g3c.stuff["E0"], g3c.stuff["einf"], -g3c.stuff["eo"])
# Define a few useful terms
E = ninf^(no)
I5 = e12345
I3 = e123
[15]:
def cf_apply_rotor(R,mv):
    return R*mv*~R
[16]:
line_one = (up(0)^up(e1)^ninf).normal()
line_two = (up(0)^up(e2)^ninf).normal()
R = 1 + line_two*line_one
[17]:
%timeit -n 1 -r 1 cf_apply_rotor(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor(R, line_one)
113 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
23.3 µs ± 5.54 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

The default clifford version of the conjugation formula takes 9.14 µs ± 55.6 ns on the authors laptop. In order to improve performance, we need to reach into the internals of clifford, and explicitelly call the relevant functions. This means that to speed-up the code, one has to be an advanced user.

[18]:
def cf_apply_rotor_faster(R,mv):
    return layout.MultiVector(layout.gmt_func(R.value,layout.gmt_func(mv.value,layout.adjoint_func(R.value))) )
[19]:
%timeit -n 1 -r 1 cf_apply_rotor_faster(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor_faster(R, line_one)
43.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6.25 µs ± 823 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

The result is very good: about a two-fold speed-up to 4.95 µs ± 698 ns on the authors laptop.

We can improve this further by adding the numba.njit decorator to our function:

[20]:
gmt_func = layout.gmt_func
adjoint_func = layout.adjoint_func

@numba.njit
def cf_apply_rotor_val_numba(R_val,mv_val):
    return gmt_func(R_val, gmt_func(mv_val, adjoint_func(R_val)))

def cf_apply_rotor_wrapped_numba(R,mv):
    return cf.MultiVector(layout, cf_apply_rotor_val_numba(R.value, mv.value))
[21]:
%timeit -n 1 -r 1 cf_apply_rotor_wrapped_numba(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor_wrapped_numba(R, line_one)
90.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.12 µs ± 326 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

We are now down to 3.52 µs ± 403 ns per loop on the authors laptop, a ~2.6 times increase. These times however are very comparible to the 3.87 µs ± 529 ns we got from the @alg.register decorator, which looked like this:

@alg.register
def apply_rotor(R, X):
    return R * X * ~R

So with a much cleaner API, we can achieve similar results in kingdon.

King on a Leash

In order to have a fair comparison, we will end with a computation in kingdon on a full multivector. This means kingdon will not be able to cheat by doing almost no computations in the first place, and will really have to multiply full multivectors, just like clifford.

[22]:
from kingdon import Algebra, MultiVector
import numpy as np

alg = Algebra(4, 1, wrapper=njit)
locals().update(alg.blades)
[23]:
l1vals = np.zeros(len(alg))
l2vals = np.zeros(len(alg))
l1 = alg.multivector(l1vals) + e145
l2 = alg.multivector(l2vals) + e245
R = 1 + l2 * l1
[24]:
def apply_rotor(R, X):
    return R * X * ~R
[25]:
%timeit -n 1 -r 1 apply_rotor(R, l1)
%timeit -n 100_000 -r 10 apply_rotor(R, l1)
254 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
20.3 µs ± 507 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

On a full multivector, kingdon takes 18 µs ± 696 ns per loop on the authors laptop. This is significantly more than the 7.69 µs ± 59.2 ns for the sparse scenario, and also more than the 9.14 µs ± 55.6 ns delivered by clifford. However, we can fix this in one simple move:

[26]:
apply_rotor_compiled = alg.register(apply_rotor)
[27]:
%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)
%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)
181 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.28 µs ± 100 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

With one decorator, we are down to 5.59 µs ± 724 ns per loop, which is close to the 3.52 µs ± 403 ns that clifford achieves with much more manual labour on the part of the programmer. (We will not include the symbolic=True version here, because for a full multivector this will be way to expensive.)

Most of the costs that both clifford and kingdon make is in the glue around the computation. This is responsable for part of the speed-up in the clifford code, and we haven’t tapped into this yet in the kingdon example so far. So if you are willing to compromise readability, kingdon can still do a little bit better by removing the glue:

[28]:
def apply_rotor_numba(R, X):
    keys_out, func = apply_rotor_compiled[R.keys(), X.keys()]
    numba_func = alg.numspace[func.__name__]  # The numba version is available from the namespace of numerical functions; numspace.
    return MultiVector.fromkeysvalues(
        alg,
        keys=keys_out,
        values=numba_func(R.values(), X.values()),
    )

The decorated function apply_rotor_compiled can be accessed like a dictionary to retrieve the numerical function and the resulting keys_out for the input keys of R and l1. Moreover, the alternative constructor MultiVector.fromkeysvalues bypasses the sanity checking of the default constructor so as to reach optimal performance. We then simply have to make a function to return the result back into

[29]:
%timeit -n 1 -r 1 apply_rotor_numba(R, l1)
27.5 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[30]:
%timeit -n 100_000 -r 10 apply_rotor_numba(R, l1)
3.05 µs ± 61.2 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)

We now see that the kingdon function is also faster than the clifford version in the full multivector version, with 2.91 µs ± 16.3 ns for kingdon vs. 3.52 µs ± 403 ns for clifford. Putting the kingdon decorator on a diet to reach this limit is an ongoing effort.

Conclusion

For full multivectors, clifford is still a bit faster than kingdon’s Algebra.register decorator, but at the cost of readability. And as we have seen, kingdon can match this performance and lack of readability if needed. However, in reality many multivectors are not full, but rather of only a select number of grades, or even just a couple of coefficients. In these scenarios, kingdon really comes into its own, and the very readable Algebra.register decorator offers a way to achieve high performance for very little effort on the users part.

[ ]: