Writing high(ish) performance code with Kingdon and Numba¶
In this document we will demonstrate how easy it is to compose together the fundamental (binary and unary) operators that kingdon supplies out of the box, while maintaining good performance.
The example used is inspired by that of `clifford
<https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html>`__ to allow for comparison of both syntax and performance. In order to make the performance comparison fair, we will run both versions within this document, such that both examples are run on the same machine.
First we will run kingdon
with numba
enabled and while leveraging the sparseness in the example, to get the best speeds possible. Then we will run the clifford
equivalent. Lastly, we will run the kingdon
example again but with a full multivector, to show how the two libraries compare in this case.
Kingdon unchained¶
First, we run kingdon
on the example, allowing it to use all of its tricks.
Let’s initiate a 3DCGA algebra and create some simple lines.
[1]:
from kingdon import Algebra, MultiVector
import numpy as np
from numba import njit
alg = Algebra(4, 1, wrapper=njit)
locals().update(alg.blades)
In order to have a fair comparison, we make a full multivector, which prevents kingdon
from using the sparseness of the input to gain additional performance benefits.
[2]:
l1 = e145
l2 = e245
R = 1 + l2 * l1
The function we will optimize for this example is the conjugation (or sandwich product) used to transform a multivector \(X\) by a rotor \(R\):
[3]:
def apply_rotor(R, X):
return R * X * ~R
The first execution will be the most expensive, because it triggers the code generation for the geometric product and reversion operators to be performed. We therefore time it seperatelly:
[4]:
%timeit -n 1 -r 1 apply_rotor(R, l1)
153 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
We can now time the actual excecution time of the apply_rotor
function.
[5]:
%timeit -n 100_000 -r 10 apply_rotor(R, l1)
13.4 µs ± 5.47 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
We now have a benchmark: to perform the function apply_rotor
takes about 7.69 µs ± 59.2 ns on the authors laptop. To do better is easy: we simply apply the Algebra.register
decorator to apply_rotor
:
@alg.register
def apply_rotor(R, X):
return R * X * ~R
This decorator allows kingdon
to work its magic on the decorated function. While the decorator syntax is pleasant to look it, it overwrites the original function, so in this document we are better off using
apply_rotor_compiled = alg.register(apply_rotor)
[6]:
apply_rotor_compiled = alg.register(apply_rotor)
[7]:
%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)
104 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[8]:
%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)
8.89 µs ± 975 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
After decoration the code runs in about 3.87 µs ± 529 ns on the authors laptop, which is roughly a two-fold improvement in performance. Not bad, for so little effort. The reason for the speed-up is that ordinarilly kingdon has to traverse down to the actual numerical implementation from the python level and then back up for every operator. So in this example this is done three times: two times for the products *
, and once for the reversion ~
.
By contrast, the Algebra.register
decorator only traverses down to the numerical level once, does all the computations on that level, and then passes the results back up the chain.
Having had our first taste of better performance, we of course want to know if we can do even better. In this example we would expect that we can get a better result if we first do some symbolical symplification, since both \(R\) and \(\widetilde{R}\) appear in the expression. We can ask the decorator to do symbolic symplification by instead using
@alg.register(symbolic=True)
def apply_rotor(R, X):
return R * X * ~R
(Note: for more complicated expressions this might be a bad idea, since symbolic symplification can get very expensive.)
[9]:
apply_rotor_simplified = alg.register(apply_rotor, symbolic=True)
[10]:
%timeit -n 1 -r 1 apply_rotor_simplified(R, l1)
710 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[11]:
%timeit -n 100_000 -r 10 apply_rotor_simplified(R, l1)
4.17 µs ± 330 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
The symbolical optimization is more expensive to perform, even in this sparse example. However, we do obtain a new speed record: 1.91 µs ± 376 ns, which is ~4 times faster than the original.
Of course conjugation is a common operation, so kingdon
does ship with a precompiled version. The operator for conjugation is given by >>
:
[12]:
%timeit -n 1 -r 1 R >> l1
95.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[13]:
%timeit -n 100_000 -r 10 R >> l1
8.69 µs ± 812 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
This has comparable numerical performance and also has a shorter generation time, which is important for the user experience.
We have now achieved the limit of what can be achieved with the @alg.register
decorator: 1.91 µs ± 376 ns, a four-fold speed-up compared to the undecorated version. Beware however, that until we can find a faster symbolic engine than sympy
, it is usually a bad idea to use the symbolic
keyword. It is therefore recommended to only use the symbolic
keyword selectly.
Clifford¶
Now let us repeat these exercises, but using `clifford
<https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html>`__.
[14]:
import clifford as cf
from clifford import g3c
import numba
# Get the layout in our local namespace etc etc
layout = g3c.layout
locals().update(g3c.blades)
ep, en, up, down, homo, E0, ninf, no = (g3c.stuff["ep"], g3c.stuff["en"],
g3c.stuff["up"], g3c.stuff["down"], g3c.stuff["homo"],
g3c.stuff["E0"], g3c.stuff["einf"], -g3c.stuff["eo"])
# Define a few useful terms
E = ninf^(no)
I5 = e12345
I3 = e123
[15]:
def cf_apply_rotor(R,mv):
return R*mv*~R
[16]:
line_one = (up(0)^up(e1)^ninf).normal()
line_two = (up(0)^up(e2)^ninf).normal()
R = 1 + line_two*line_one
[17]:
%timeit -n 1 -r 1 cf_apply_rotor(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor(R, line_one)
113 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
23.3 µs ± 5.54 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
The default clifford
version of the conjugation formula takes 9.14 µs ± 55.6 ns on the authors laptop. In order to improve performance, we need to reach into the internals of clifford
, and explicitelly call the relevant functions. This means that to speed-up the code, one has to be an advanced user.
[18]:
def cf_apply_rotor_faster(R,mv):
return layout.MultiVector(layout.gmt_func(R.value,layout.gmt_func(mv.value,layout.adjoint_func(R.value))) )
[19]:
%timeit -n 1 -r 1 cf_apply_rotor_faster(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor_faster(R, line_one)
43.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
6.25 µs ± 823 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
The result is very good: about a two-fold speed-up to 4.95 µs ± 698 ns on the authors laptop.
We can improve this further by adding the numba.njit
decorator to our function:
[20]:
gmt_func = layout.gmt_func
adjoint_func = layout.adjoint_func
@numba.njit
def cf_apply_rotor_val_numba(R_val,mv_val):
return gmt_func(R_val, gmt_func(mv_val, adjoint_func(R_val)))
def cf_apply_rotor_wrapped_numba(R,mv):
return cf.MultiVector(layout, cf_apply_rotor_val_numba(R.value, mv.value))
[21]:
%timeit -n 1 -r 1 cf_apply_rotor_wrapped_numba(R, line_one)
%timeit -n 100_000 -r 10 cf_apply_rotor_wrapped_numba(R, line_one)
90.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
4.12 µs ± 326 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
We are now down to 3.52 µs ± 403 ns per loop on the authors laptop, a ~2.6 times increase. These times however are very comparible to the 3.87 µs ± 529 ns we got from the @alg.register
decorator, which looked like this:
@alg.register
def apply_rotor(R, X):
return R * X * ~R
So with a much cleaner API, we can achieve similar results in kingdon
.
King on a Leash¶
In order to have a fair comparison, we will end with a computation in kingdon on a full multivector. This means kingdon
will not be able to cheat by doing almost no computations in the first place, and will really have to multiply full multivectors, just like clifford
.
[22]:
from kingdon import Algebra, MultiVector
import numpy as np
alg = Algebra(4, 1, wrapper=njit)
locals().update(alg.blades)
[23]:
l1vals = np.zeros(len(alg))
l2vals = np.zeros(len(alg))
l1 = alg.multivector(l1vals) + e145
l2 = alg.multivector(l2vals) + e245
R = 1 + l2 * l1
[24]:
def apply_rotor(R, X):
return R * X * ~R
[25]:
%timeit -n 1 -r 1 apply_rotor(R, l1)
%timeit -n 100_000 -r 10 apply_rotor(R, l1)
254 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
20.3 µs ± 507 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
On a full multivector, kingdon
takes 18 µs ± 696 ns per loop on the authors laptop. This is significantly more than the 7.69 µs ± 59.2 ns for the sparse scenario, and also more than the 9.14 µs ± 55.6 ns delivered by clifford
. However, we can fix this in one simple move:
[26]:
apply_rotor_compiled = alg.register(apply_rotor)
[27]:
%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)
%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)
181 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
5.28 µs ± 100 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
With one decorator, we are down to 5.59 µs ± 724 ns per loop, which is close to the 3.52 µs ± 403 ns that clifford
achieves with much more manual labour on the part of the programmer. (We will not include the symbolic=True
version here, because for a full multivector this will be way to expensive.)
Most of the costs that both clifford
and kingdon
make is in the glue around the computation. This is responsable for part of the speed-up in the clifford
code, and we haven’t tapped into this yet in the kingdon
example so far. So if you are willing to compromise readability, kingdon
can still do a little bit better by removing the glue:
[28]:
def apply_rotor_numba(R, X):
keys_out, func = apply_rotor_compiled[R.keys(), X.keys()]
numba_func = alg.numspace[func.__name__] # The numba version is available from the namespace of numerical functions; numspace.
return MultiVector.fromkeysvalues(
alg,
keys=keys_out,
values=numba_func(R.values(), X.values()),
)
The decorated function apply_rotor_compiled
can be accessed like a dictionary to retrieve the numerical function and the resulting keys_out
for the input keys
of R
and l1
. Moreover, the alternative constructor MultiVector.fromkeysvalues
bypasses the sanity checking of the default constructor so as to reach optimal performance. We then simply have to make a function to return the result back into
[29]:
%timeit -n 1 -r 1 apply_rotor_numba(R, l1)
27.5 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
[30]:
%timeit -n 100_000 -r 10 apply_rotor_numba(R, l1)
3.05 µs ± 61.2 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)
We now see that the kingdon
function is also faster than the clifford
version in the full multivector version, with 2.91 µs ± 16.3 ns for kingdon
vs. 3.52 µs ± 403 ns for clifford
. Putting the kingdon
decorator on a diet to reach this limit is an ongoing effort.
Conclusion¶
For full multivectors, clifford is still a bit faster than kingdon
’s Algebra.register
decorator, but at the cost of readability. And as we have seen, kingdon
can match this performance and lack of readability if needed. However, in reality many multivectors are not full, but rather of only a select number of grades, or even just a couple of coefficients. In these scenarios, kingdon
really comes into its own, and the very readable Algebra.register
decorator offers a way to
achieve high performance for very little effort on the users part.
[ ]: