{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ad91839",
   "metadata": {},
   "source": [
    "# Writing high(ish) performance code with Kingdon and Numba\n",
    "\n",
    "In this document we will demonstrate how easy it is to compose together the fundamental (binary and unary) operators that kingdon supplies out of the box, while maintaining good performance.\n",
    "\n",
    "The example used is inspired by that of [`clifford`](https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html) to allow for comparison of both syntax and performance. In order to make the performance comparison fair, we will run both versions within this document, such that both examples are run on the same machine. \n",
    "\n",
    "First we will run `kingdon` with `numba` enabled and while leveraging the sparseness in the example, to get the best speeds possible. Then we will run the `clifford` equivalent. Lastly, we will run the `kingdon` example again but with a full multivector, to show how the two libraries compare in this case."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa6be68a",
   "metadata": {},
   "source": [
    "### Kingdon unchained\n",
    "\n",
    "First, we run `kingdon` on the example, allowing it to use all of its tricks.\n",
    "\n",
    "Let's initiate a 3DCGA algebra and create some simple lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bb371f89",
   "metadata": {},
   "outputs": [],
   "source": [
    "from kingdon import Algebra, MultiVector\n",
    "import numpy as np\n",
    "from numba import njit\n",
    "\n",
    "alg = Algebra(4, 1, wrapper=njit)\n",
    "locals().update(alg.blades)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1099bfb8",
   "metadata": {},
   "source": [
    "In order to have a fair comparison, we make a full multivector, which prevents `kingdon` from using the sparseness of the input to gain additional performance benefits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2d7ba575",
   "metadata": {},
   "outputs": [],
   "source": [
    "l1 = e145\n",
    "l2 = e245\n",
    "R = 1 + l2 * l1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "005a28b9",
   "metadata": {},
   "source": [
    "The function we will optimize for this example is the conjugation (or sandwich product) used to transform a multivector $X$ by a rotor $R$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "94ce9dfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "def apply_rotor(R, X):\n",
    "    return R * X * ~R"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61930d91",
   "metadata": {},
   "source": [
    "The first execution will be the most expensive, because it triggers the code generation for the geometric product and reversion operators to be performed. We therefore time it seperatelly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bd5b8144",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "153 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6eb96fb0",
   "metadata": {},
   "source": [
    "We can now time the actual excecution time of the `apply_rotor` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d1b60b9a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13.4 µs ± 5.47 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 100_000 -r 10 apply_rotor(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdc22c38",
   "metadata": {},
   "source": [
    "We now have a benchmark: to perform the function `apply_rotor` takes about 7.69 µs ± 59.2 ns on the authors laptop. To do better is easy: we simply apply the `Algebra.register` decorator to `apply_rotor`:\n",
    "```python\n",
    "@alg.register\n",
    "def apply_rotor(R, X):\n",
    "    return R * X * ~R\n",
    "```\n",
    "This decorator allows `kingdon` to work its magic on the decorated function.\n",
    "While the decorator syntax is pleasant to look it, it overwrites the original function, so in this document we are better off using\n",
    "```python\n",
    "apply_rotor_compiled = alg.register(apply_rotor)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8ac69632",
   "metadata": {},
   "outputs": [],
   "source": [
    "apply_rotor_compiled = alg.register(apply_rotor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "50b66c3b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "104 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1d9138d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8.89 µs ± 975 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99679152",
   "metadata": {},
   "source": [
    "After decoration the code runs in about 3.87 µs ± 529 ns on the authors laptop, which is roughly a two-fold improvement in performance. Not bad, for so little effort. The reason for the speed-up is that ordinarilly kingdon has to traverse down to the actual numerical implementation from the python level and then back up for every operator. So in this example this is done three times: two times for the products `*`, and once for the reversion `~`. \n",
    "\n",
    "By contrast, the `Algebra.register` decorator only traverses down to the numerical level once, does all the computations on that level, and then passes the results back up the chain.\n",
    "\n",
    "Having had our first taste of better performance, we of course want to know if we can do even better. In this example we would expect that we can get a better result if we first do some symbolical symplification, since both $R$ and $\\widetilde{R}$ appear in the expression. We can ask the decorator to do symbolic symplification by instead using\n",
    "\n",
    "```python\n",
    "@alg.register(symbolic=True)\n",
    "def apply_rotor(R, X):\n",
    "    return R * X * ~R\n",
    "```\n",
    "\n",
    "(Note: for more complicated expressions this might be a bad idea, since symbolic symplification can get very expensive.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "51543cb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "apply_rotor_simplified = alg.register(apply_rotor, symbolic=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d5b19ce9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "710 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor_simplified(R, l1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c17d3cdc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.17 µs ± 330 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 100_000 -r 10 apply_rotor_simplified(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f059763",
   "metadata": {},
   "source": [
    "The symbolical optimization is more expensive to perform, even in this sparse example. However, we do obtain a new speed record: 1.91 µs ± 376 ns, which is ~4 times faster than the original.\n",
    "\n",
    "Of course conjugation is a common operation, so `kingdon` does ship with a precompiled version. The operator for conjugation is given by `>>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "60db1538",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "95.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 R >> l1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "af32af82",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8.69 µs ± 812 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 100_000 -r 10 R >> l1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87f67373",
   "metadata": {},
   "source": [
    "This has comparable numerical performance and also has a shorter generation time, which is important for the user experience.\n",
    "\n",
    "We have now achieved the limit of what can be achieved with the `@alg.register` decorator: 1.91 µs ± 376 ns, a four-fold speed-up compared to the undecorated version. Beware however, that until we can find a faster symbolic engine than `sympy`, it is usually a bad idea to use the `symbolic` keyword. It is therefore recommended to only use the `symbolic` keyword selectly."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "987b8564",
   "metadata": {},
   "source": [
    "\n",
    "### Clifford \n",
    "\n",
    "Now let us repeat these exercises, but using [`clifford`](https://clifford.readthedocs.io/en/latest/tutorials/PerformanceCliffordTutorial.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "0d6280fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import clifford as cf\n",
    "from clifford import g3c\n",
    "import numba\n",
    "\n",
    "# Get the layout in our local namespace etc etc\n",
    "layout = g3c.layout\n",
    "locals().update(g3c.blades)\n",
    "\n",
    "ep, en, up, down, homo, E0, ninf, no = (g3c.stuff[\"ep\"], g3c.stuff[\"en\"],\n",
    "                                        g3c.stuff[\"up\"], g3c.stuff[\"down\"], g3c.stuff[\"homo\"],\n",
    "                                        g3c.stuff[\"E0\"], g3c.stuff[\"einf\"], -g3c.stuff[\"eo\"])\n",
    "# Define a few useful terms\n",
    "E = ninf^(no)\n",
    "I5 = e12345\n",
    "I3 = e123"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "7ef28eb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def cf_apply_rotor(R,mv):\n",
    "    return R*mv*~R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "a7e3016f",
   "metadata": {},
   "outputs": [],
   "source": [
    "line_one = (up(0)^up(e1)^ninf).normal()\n",
    "line_two = (up(0)^up(e2)^ninf).normal()\n",
    "R = 1 + line_two*line_one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f7cf17b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "113 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
      "23.3 µs ± 5.54 µs per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 cf_apply_rotor(R, line_one)\n",
    "%timeit -n 100_000 -r 10 cf_apply_rotor(R, line_one)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e12d2953",
   "metadata": {},
   "source": [
    "The default `clifford` version of the conjugation formula takes 9.14 µs ± 55.6 ns on the authors laptop. In order to improve performance, we need to reach into the internals of `clifford`, and explicitelly call the relevant functions. This means that to speed-up the code, one has to be an advanced user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b12d4afc",
   "metadata": {},
   "outputs": [],
   "source": [
    "def cf_apply_rotor_faster(R,mv):\n",
    "    return layout.MultiVector(layout.gmt_func(R.value,layout.gmt_func(mv.value,layout.adjoint_func(R.value))) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "00e4372f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "43.6 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
      "6.25 µs ± 823 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 cf_apply_rotor_faster(R, line_one)\n",
    "%timeit -n 100_000 -r 10 cf_apply_rotor_faster(R, line_one)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ce05c39",
   "metadata": {},
   "source": [
    "The result is very good: about a two-fold speed-up to 4.95 µs ± 698 ns on the authors laptop.\n",
    "\n",
    "We can improve this further by adding the `numba.njit` decorator to our function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "97706e35",
   "metadata": {},
   "outputs": [],
   "source": [
    "gmt_func = layout.gmt_func\n",
    "adjoint_func = layout.adjoint_func\n",
    "\n",
    "@numba.njit\n",
    "def cf_apply_rotor_val_numba(R_val,mv_val):\n",
    "    return gmt_func(R_val, gmt_func(mv_val, adjoint_func(R_val)))\n",
    "\n",
    "def cf_apply_rotor_wrapped_numba(R,mv):\n",
    "    return cf.MultiVector(layout, cf_apply_rotor_val_numba(R.value, mv.value))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "b88d63cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "90.2 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
      "4.12 µs ± 326 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 cf_apply_rotor_wrapped_numba(R, line_one)\n",
    "%timeit -n 100_000 -r 10 cf_apply_rotor_wrapped_numba(R, line_one)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c52cb69c",
   "metadata": {},
   "source": [
    "We are now down to 3.52 µs ± 403 ns per loop on the authors laptop, a ~2.6 times increase. These times however are very comparible to the 3.87 µs ± 529 ns we got from the `@alg.register` decorator, which looked like this:\n",
    "```python\n",
    "@alg.register\n",
    "def apply_rotor(R, X):\n",
    "    return R * X * ~R\n",
    "```\n",
    "So with a much cleaner API, we can achieve similar results in `kingdon`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99807452",
   "metadata": {},
   "source": [
    "### King on a Leash\n",
    "\n",
    "In order to have a fair comparison, we will end with a computation in kingdon on a full multivector. This means `kingdon` will not be able to cheat by doing almost no computations in the first place, and will really have to multiply full multivectors, just like `clifford`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "52de901a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from kingdon import Algebra, MultiVector\n",
    "import numpy as np\n",
    "\n",
    "alg = Algebra(4, 1, wrapper=njit)\n",
    "locals().update(alg.blades)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "28f7e40f",
   "metadata": {},
   "outputs": [],
   "source": [
    "l1vals = np.zeros(len(alg))\n",
    "l2vals = np.zeros(len(alg))\n",
    "l1 = alg.multivector(l1vals) + e145\n",
    "l2 = alg.multivector(l2vals) + e245\n",
    "R = 1 + l2 * l1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "1944bd46",
   "metadata": {},
   "outputs": [],
   "source": [
    "def apply_rotor(R, X):\n",
    "    return R * X * ~R"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "0c7daddf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "254 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
      "20.3 µs ± 507 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor(R, l1)\n",
    "%timeit -n 100_000 -r 10 apply_rotor(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c619e256",
   "metadata": {},
   "source": [
    "On a full multivector, `kingdon` takes 18 µs ± 696 ns per loop on the authors laptop. This is significantly more than the 7.69 µs ± 59.2 ns for the sparse scenario, and also more than the 9.14 µs ± 55.6 ns delivered by `clifford`. However, we can fix this in one simple move:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "a204a397",
   "metadata": {},
   "outputs": [],
   "source": [
    "apply_rotor_compiled = alg.register(apply_rotor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "995c36ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "181 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
      "5.28 µs ± 100 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor_compiled(R, l1)\n",
    "%timeit -n 100_000 -r 10 apply_rotor_compiled(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d381fc0",
   "metadata": {},
   "source": [
    "With one decorator, we are down to 5.59 µs ± 724 ns per loop, which is close to the 3.52 µs ± 403 ns that `clifford` achieves with much more manual labour on the part of the programmer. (We will not include the `symbolic=True` version here, because for a full multivector this will be way to expensive.)\n",
    "\n",
    "Most of the costs that both `clifford` and `kingdon` make is in the glue around the computation. \n",
    "This is responsable for part of the speed-up in the `clifford` code, and we haven't tapped into this yet in the `kingdon` example so far.\n",
    "So if you are willing to compromise readability, `kingdon` can still do a little bit better by removing the glue:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "9566a7ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "def apply_rotor_numba(R, X):\n",
    "    keys_out, func = apply_rotor_compiled[R.keys(), X.keys()]\n",
    "    numba_func = alg.numspace[func.__name__]  # The numba version is available from the namespace of numerical functions; numspace.\n",
    "    return MultiVector.fromkeysvalues(\n",
    "        alg, \n",
    "        keys=keys_out, \n",
    "        values=numba_func(R.values(), X.values()),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fde3b74f",
   "metadata": {},
   "source": [
    "The decorated function `apply_rotor_compiled` can be accessed like a dictionary to retrieve the numerical function and the resulting `keys_out` for the input `keys` of `R` and `l1`. Moreover, the alternative constructor `MultiVector.fromkeysvalues` bypasses the sanity checking of the default constructor so as to reach optimal performance. We then simply have to make a function to return the result back into"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "b65c7888",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "27.5 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 1 -r 1 apply_rotor_numba(R, l1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "0dd191be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.05 µs ± 61.2 ns per loop (mean ± std. dev. of 10 runs, 100,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit -n 100_000 -r 10 apply_rotor_numba(R, l1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b8e6546",
   "metadata": {},
   "source": [
    "We now see that the `kingdon` function is also faster than the `clifford` version in the full multivector version, with 2.91 µs ± 16.3 ns for `kingdon` vs. 3.52 µs ± 403 ns for `clifford`. Putting the `kingdon` decorator on a diet to reach this limit is an ongoing effort."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "faca82c7",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "For full multivectors, clifford is still a bit faster than `kingdon`'s `Algebra.register` decorator, but at the cost of readability. And as we have seen, `kingdon` can match this performance and lack of readability if needed. However, in reality many multivectors are not full, but rather of only a select number of grades, or even just a couple of coefficients. In these scenarios, `kingdon` really comes into its own, and the very readable `Algebra.register` decorator offers a way to achieve high performance for very little effort on the users part."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b2b28a0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}