A small freedom area RSS

Fixing the iterative damping interpolation in video games

Sat, 18 May 2024 12:22:15 -0000

As I'm exploring the fantastic world of indie game development lately, I end up watching a large number of video tutorials on the subject. Even though the quality of the content is pretty variable, I'm very grateful to the creators for it. That being said, I couldn't help noticing this particular bit times and times again:

a = lerp(a, B, delta * RATE)

Behind this apparent banal call hides a terrible curse, forever perpetrated by innocent souls on the Internet.

In this article we will study what it's trying to achieve, how it works, why it's wrong, and then we'll come up with a good solution to the initial problem.

The usual warning: I don't have a mathematics or academic background, so the article is addressed at other neanderthals like myself, who managed to understand that pressing keys on a keyboard make pixels turn on and off.

What is it?

Let's start from the beginning. We're in a game engine main loop callback called at a regular interval (roughly), passing down the time difference from the last call.

In Godot engine, it looks like this:

func _physics_process(delta: float):
    ...

If the game is configured to refresh at 60 FPS, we can expect this function to be called around 60 times per second with delta = 1/60 = 0.01666....

As a game developer, we want some smooth animations for all kind of transformations. For example, we may want the speed of the player to go down to zero as they release the moving key. We could do that linearly, but to make the stop less brutal and robotic we want to slow down the speed progressively.

Linear (top) versus smooth/exponential (bottom) animation

Virtually every tutorial will suggest updating some random variable with something like that:

velocity = lerp(velocity, 0, delta * RATE)

At 60 FPS, a decay RATE defined to 3.5, and an initial velocity of 100, the velocity will go down to 0 following this curve:

Example curve of a decaying variable

Note: velocity is just a variable name example, it can be found in many other contexts

If you're familiar with lerp() ("linear interpolation") you may be wondering why this is making a curve. Indeed, this lerp() function, also known as mix(), is a simple linear function defined as lerp(a,b,x) = x*(b-a) + a or its alternative stable form lerp(a,b,x) = (1-a)x + bx. For more information, see a previous article about this particular function. But here we are re-using the previous value, so this essentially means nesting lerp() function calls, which expands into a power formula, forming a curve composed of a chain of small straight segments.

Why is it wrong?

The main issue is that the formula is heavily depending on the refresh rate. If the game is supposed to work at 30, 60, or 144 FPS, then it means the physics engine is going to behave differently.

Here is an illustration of the kind of instability we can expect:

Comparison of the curves at different frame rates with the problematic formula

Note that the inaccuracy when compared to an ideal curve is not the issue here. The problem is that the game mechanics are different depending on the hardware, the system, and the wind direction observed in a small island of Japan. Imagine being able to jump further if we replace our 60Hz monitor with a 144Hz one, that would be some nasty pay to win incentive.

We may be able to get away with this by forcing a constant refresh rate for the game and consider this a non-issue (I'm not convinced this is achievable on all engines and platforms), but then we meet another problem: the device may not be able to hold this requirement at all times because of potential lags (for reasons that may be outside our control). That's right, so far we assumed delta=1/FPS but that's merely a target, it could fluctuate, causing mild to dramatic situations gameplay wise.

One last issue with that formula is the situation of a huge delay spike, causing an overshooting of the target. For example if we have RATE=3 and we end up with a frame that takes 500ms for whatever random reason, we're going to interpolate with a value of 1.5, which is way above 1. This is easily fixed by maxing out the 3rd argument of lerp to 1, but we have to keep that issue in mind.

To summarize, the formula is:

not frame rate agnostic ❌
non deterministic ❌
vulnerable to overshooting ❌

If you're not interested in the gory details on the how, you can now jump straight to the conclusion for a better alternative.

Study

We're going to switch to a more mathematical notation from now on. It's only going to be linear algebra, nothing particularly fancy, but we're going to make a mess of 1 letter symbols, bear with me.

Let's name the exhaustive list of inputs of our problem:

initial value: a_0=\Alpha (from where we start, only used once)
target value: \Beta (where we are going, constant value)
time delta: \Delta_n (time difference from last call)
the rate of change: R (arbitrary scaling user constant)
original sequence: a_{n+1} = \texttt{lerp}(a_n, \Beta, R\Delta_n) (the code in the main loop callback)
frame rate: F (the target frame rate, for example 60 FPS)
time: t (animation time elapsed)

What we are looking for is a new sequence formula u_n (u standing for purfect) that doesn't have the 3 previously mentioned pitfalls.

The first thing we can do is to transform this recursive sequence into the expected ideal contiguous time based function. The original sequence was designed for a given rate R and FPS F: this means that while \Delta_n changes in practice every frame, the ideal function we are looking for is constant: \Delta=1/F.

So instead of starting from a_{n+1} = \texttt{lerp}(a_n, \Beta, R\Delta_n), we will look for u_n starting from u_{n+1} = \texttt{lerp}(u_n, \Beta, R\Delta) with u_0=a_0=\Alpha.

Since I'm lazy and incompetent, we are just going to ask WolframAlpha for help finding the solution to the recursive sequence. But to feed its input we need to simplify the terms a bit:

\begin{split} u_{n+1} &= \texttt{lerp}(u_n, \Beta, R\Delta) \\ &= u_n(1-R\Delta) + \Beta R\Delta \\ &= u_nP + Q \end{split}

...with P=(1-R\Delta) and Q=\Beta R\Delta. We do that so we have a familiar ax+b linear form.

According to WolframAlpha this is equivalent to:

u_n = \Alpha P^n + \frac{Q(P^n-1)}{P-1}

This is great because we now have the formula according to n, our frame number. We can also express that discrete sequence into a contiguous function according to the time t:

f(t) = \Alpha P^{tF} + \frac{Q(P^{tF}-1)}{P-1}

Expanding our temporary P and Q placeholders with their values and unrolling, we get:

\begin{split} f(t) &= AP^{tF} + \frac{Q(P^{tF}-1)}{P-1} \\ &= A(1-R\Delta)^{tF} + \frac{\Beta R\Delta((1-R\Delta)^{tF}-1)}{(1-R\Delta)-1} \\ &= A(1-R\Delta)^{tF} - \Beta((1-R\Delta)^{tF}-1) \\ &= A(1-R\Delta)^{tF} + \Beta(1-(1-R\Delta)^{tF}) \\ &= \texttt{lerp}(\Beta, \Alpha, (1-R\Delta)^{tF}) \\ &= \texttt{lerp}(\Beta, \Alpha, (1-R/F)^{tF}) \\ f(t) &= \boxed{\texttt{lerp}(\Alpha, \Beta, 1-(1-R/F)^{tF})} \end{split}

This function perfectly matches the initial lerp() sequence in the hypothetical situation where the frame rate is honored. Basically, it's what the sequence a_{n+1} was meant to emulate at a given frame rate F.

Note: we swapped the first 2 terms of lerp() at the last step because it makes more sense semantically to go from \Alpha to \Beta.

Let's again summarize what we have and what we want: we're into the game main loop and we want our running value to stick to that f(t) function. We have:

v=f(t): the value previously computed (t is the running duration so far, but we don't have it); in the original sequence this is known as a_n
\Delta_n: the delta time for the current frame

We are looking for a function \Eta(v,\Delta_n) which defines the position of a new point on the curve, only knowing v and \Delta_n. It's a "time agnostic" version of f(t).

Basically, it is defined as \Eta(v,\Delta_n)=f(t+\Delta_n), but since we don't have t it's not very helpful. That being said, while we don't have t, we do have f(t) (the previous value v).

Looking at the curve, we know the y-value of the previous point, and we know the difference between the new point and the previous point on the x-axis:

Previous and current point in time

If we want t (the total time elapsed at the previous point), we need the inverse function f^{-1}. Indeed, t = f^{-1}(f(t)): taking the inverse of a function gives back the input. We know f so we can inverse it, relying on WolframAlpha again (what a blessing this website is):

f^{-1}(x) = \frac{\ln{\frac{\Beta-x}{\Beta-\Alpha}}}{F \ln(1-R/F)}

Note: \ln stands for natural logarithm, sometimes also called \log. Careful though, on Desmos for example \log is in base 10, not base e (while its \exp is in base e for some reason).

This complex formula may feel a bit intimidating but we can now find \Eta only using its two parameters:

\begin{split} \Eta(v,\Delta_n) &= f(t + \Delta_n) \\ &= f(f^{-1}(f(t)) + \Delta_n) \\ &= f(f^{-1}(v) + \Delta_n) \\ &= f(\frac{\ln{\frac{\Beta-v}{\Beta-\Alpha}}}{F \ln(1-R/F)} + \Delta_n) \\ &= \texttt{lerp}(\Alpha, \Beta, 1-(1-R/F)^{(\frac{\ln{\frac{\Beta-v}{\Beta-\Alpha}}}{F \ln(1-R/F)} + \Delta_n) \times F}) \\ &= \texttt{lerp}(\Alpha, \Beta, 1-(1-R/F)^{\frac{\ln{\frac{\Beta-v}{\Beta-\Alpha}}}{\ln(1-R/F)}} (1-R/F)^{F\Delta_n}) \\ &= \texttt{lerp}(\Alpha, \Beta, 1-\frac{\Beta-v}{\Beta-\Alpha} (1-R/F)^{F\Delta_n}) \\ &= (1-\frac{\Beta-v}{\Beta-\Alpha} (1-R/F)^{F\Delta_n})(\Beta-\Alpha) + A \\ &= (\Beta-\Alpha) - (\Beta-v) (1-R/F)^{F\Delta_n} + A \\ &= (v-\Beta)(1-R/F)^{F\Delta_n} + \Beta \\ &= \texttt{lerp}(\Beta, v, (1-R/F)^{F\Delta_n}) \\ \Eta(v,\Delta_n) &= \texttt{lerp}(v, \Beta, 1-(1-R/F)^{F\Delta_n}) \end{split}

Again we swapped the first 2 arguments of lerp at the last step at the cost of an additional subtraction: this is more readable because \Beta is our destination point.

An interesting property that is going to be helpful here is m^n = e^{n \ln{m}}. For my fellow programmers getting tensed here: pow(m, n) == exp(n * log(m)). Replacing the power with the exponential may not seem like an improvement at first, but it allows packing all the constant terms together:

\begin{split} \Eta(v,\Delta_n) &= \texttt{lerp}(v, \Beta, 1-(1-R/F)^{F\Delta_n}) \\ &= \texttt{lerp}(v, \Beta, 1-e^{F\ln(1-R/F)\Delta_n}) \end{split}

F\ln(1-R/F) can be pre-computed because it is constant: it's our rate conversion formula, which we can extract:

\begin{split} R' &= F\ln(1-R/F) \\ \Eta(v,\Delta_n) &= \texttt{lerp}(v, \Beta, 1-e^{R'\Delta_n}) \end{split}

Rewriting this in a sequence notation, we get:

\begin{split} R' &= F\ln(1-R/F) \\ u_{n+1} &= \texttt{lerp}(u_n, \Beta, 1-e^{R'\Delta_n}) \end{split}

We're going to make one last adjustment: R' is negative, which is not exactly intuitive to work with as a user (in case it is defined arbitrarily and not through the conversion formula), so we make a sign swap for convenience:

\boxed{\begin{split} R' &= -F\ln(1-R/F) \\ u_{n+1} &= \texttt{lerp}(u_n, \Beta, 1-e^{-R'\Delta_n}) \end{split}}

The conversion formula is optional, it's only needed to port a previously broken code to the new formula. One interesting thing here is that R' is fairly close to R when R is small.

For example, a rate factor R=5 at 60 FPS gives us R' \approx 5.22. This means that if the rate factors weren't closely tuned, it is probably acceptable to go with R'=R and not bother with any conversion. Still, having that formula can be useful to update all the decay constants and check that everything still works as expected.

Also, notice how if the delta gets very large, -R'\Delta_n is going toward -\infty, e^{-R'\Delta_n} toward 0, 1-e^{-R'\Delta_n} toward 1, and so the interpolation is going to reach our final target \Beta without overshooting. This means the formula doesn't need any extra care with regard to the 3rd issue we pointed out earlier.

Looking at the previous curves but now with the new formula and an adjusted rate:

Comparison of the curves at different frame rates with the new formula

Conclusion

So there we have it, the perfect formula, frame rate agnostic ✅, deterministic ✅ and resilient to overshooting ✅. If you've quickly skimmed through the maths, here is what you need to know:

a = lerp(a, B, delta * RATE)

Should be changed to:

a = lerp(a, B, 1.0 - exp(-delta * RATE2))

With the precomputed RATE2 = -FPS * log(1 - RATE/FPS) (where log is the natural logarithm), or simply using RATE2 = RATE as a rough equivalent.

Also, any existing overshooting clamping can safely be dropped.

Now please adjust your game to make the world a better and safer place for everyone ♥

Going further

As suggested on HN:

for numerical stability it makes sense to use -expm1(x) instead of 1-exp(x)
API wise, proposing a time constant T instead of the rate (where T=1/rate) might be more intuitive
for performance reasons, the exponential could be expanded manually to x+x²/2 for small value of x

Hacking window titles to help OBS

Tue, 06 Jun 2023 09:27:10 -0000

This write-up is meant to present the rationale and technical details behind a tiny project I wrote the other day, WTH, or WindowTitleHack, which is meant to force a constant window name for apps that keep changing it (I'm looking specifically at Firefox and Krita, but there are probably many others).

Why tho?

I've been streaming on Twitch from Linux (X11) with a barebone OBS Studio setup for a while now, and while most of the experience has been relatively smooth, one particularly striking frustration has been dealing with windows detection.

If we don't want to capture the whole desktop for privacy reasons or simply to have control over the scene layout depending on the currently focused app, we need to rely on the Window Capture (XComposite) source. This works mostly fine, and it is actually able to track windows even when their title bar is renamed. But obviously, upon restart it can't find them again because both the window titles and the window IDs changed, meaning we have to redo our setup by reselecting the windows again.

It would have been acceptable if that was the only issue I had, but one of the more advanced feature I'm extensively using is the Advanced Scene Switcher (the builtin one, available through the Tools menu). This tool is a basic window title pattern matching system that allows automatic scene switches depending on the current window. Note that it does seem to support regex, which could help with the problem, but there is no guarantee that the app would leave a recognizable matchable pattern in its title. Also, if we want multiple Firefox windows but only match one in particular, the regex wouldn't help.

Hacking Windows

One unreliable hack would be to spam xdotool commands to correct the window title. This could be a resource hog, and it would create quite a bunch of races. One slight improvement over this would be to use xprop -spy, but that wouldn't address the race conditions (since we would adjust the title after it's been already changed).

So how do we deal with that properly? Well, on X11 with the reference library (Xlib) there are actually various (actually a lot of) ways of changing the title bar. It took me a while to identify which call(s) to target, but ended up with the following call graph, where each function is actually exposed publicly:

From this we can easily see that we only need to hook the deepest function XChangeProperty, and check if the property is XA_WM_NAME (or its "modern" sibling, _NET_WM_NAME).

How do we do that? With the help of the LD_PRELOAD environment variable and a dynamic library that implements a custom XChangeProperty.

First, we grab the original function:

#include <dlfcn.h>

/* A type matching the prototype of the target function */
typedef int (*XChangeProperty_func_type)(
    Display *display,
    Window w,
    Atom property,
    Atom type,
    int format,
    int mode,
    const unsigned char *data,
    int nelements
);

/* [...] */

XChangeProperty_func_type XChangeProperty_orig = dlsym(RTLD_NEXT, "XChangeProperty");

We also need to craft a custom _NET_WM_NAME atom:

_NET_WM_NAME = XInternAtom(display, "_NET_WM_NAME", 0);

With this we are now able to intercept all the WM_NAME events and override them with our own:

if (property == XA_WM_NAME || property == _NET_WM_NAME) {
    data = (const unsigned char *)new_title;
    nelements = (int)strlen(new_title);
}
return XChangeProperty_orig(display, w, property, type, format, mode, data, nelements);

We wrap all of this into our own redefinition of XChangeProperty and… that's pretty much it.

Now due to a long history of development, Xlib has been "deprecated" and superseded by libxcb. Both are widely used, but fortunately the APIs are more or less similar. The function to hook is xcb_change_property, and defining _NET_WM_NAME is slightly more cumbered but not exactly challenging:

const xcb_intern_atom_cookie_t cookie = xcb_intern_atom(conn, 0, strlen("_NET_WM_NAME"), "_NET_WM_NAME");
xcb_intern_atom_reply_t *reply = xcb_intern_atom_reply(conn, cookie, NULL);
if (reply)
    _NET_WM_NAME = reply->atom;
free(reply);

Aside from that, the code is pretty much the same.

Configuration

To pass down the custom title to override, I've been relying on an environment variable WTH_TITLE. From a user point of view, it looks like this:

LD_PRELOAD="builddir/libwth.so" WTH_TITLE="Krita4ever" krita

We could probably improve the usability by creating a wrapping tool (so that we could have something such as ./wth --title=Krita4ever krita). Unfortunately I wasn't yet able to make a self-referencing executable accepted by LD_PRELOAD, so for now the manual LD_PRELOAD and WTH_TITLE environment will do just fine.

Thread safety

To avoid a bunch of redundant function round-trips we need to globally cache a few things: the new title (to avoid fetching it in the environment all the time), the original functions (to save the dlsym call), and _NET_WM_NAME.

Those are loaded lazily at the first function call, but we have no guarantee with regards to concurrent calls on that hooked function so we must create our own lock. I initially though about using pthread_once but unfortunately the initialization callback mechanism doesn't allow any custom argument. Again, this is merely a slight annoyance since we can implement our own in a few lines of code:

/* The "once" API is similar to pthread_once but allows a custom function argument */
struct wth_once {
    pthread_mutex_t lock;
    int initialized;
};

#define WTH_ONCE_INITIALIZER {.lock=PTHREAD_MUTEX_INITIALIZER}

typedef void (*init_func_type)(void *user_arg);

void wth_init_once(struct wth_once *once, init_func_type init_func, void *user_arg)
{
    pthread_mutex_lock(&once->lock);
    if (!once->initialized) {
        init_func(user_arg);
        once->initialized = 1;
    }
    pthread_mutex_unlock(&once->lock);
}

Which we use like this:

static struct wth_once once = WTH_ONCE_INITIALIZER;

static void init_once(void *user_arg)
{
    Display *display = user_arg;
    /* [...] */
}

/* [...] */

wth_init_once(&once, init_once, display);

The End?

I've been delaying doing this project for weeks because it felt complex at first glance, but it actually just took me a few hours. Probably the same amount of time it took me to write this article. While the project is admittedly really small, it still feel like a nice accomplishment. I hope it's useful to other people.

Now, the Wayland support is probably the most obvious improvement the project can receive, but I don't have such a setup locally to test yet, so this is postponed for an undetermined amount of time.

The code is released with a permissive license (MIT); if you want to contribute you can open a pull request but getting in touch with me first is appreciated to avoid unnecessary and overlapping efforts.

Improving color quantization heuristics

Sat, 31 Dec 2022 12:00:43 -0000

In 2015, I wrote an article about how the palette color quantization was improved in FFmpeg in order to make nice animated GIF files. For some reason, to this day this is one of my most popular article.

As time passed, my experience with colors grew and I ended up being quite ashamed and frustrated with the state of these filters. A lot of the code was naive (when not terribly wrong), despite the apparent good results.

One of the major change I wanted to do was to evaluate the color distances using a perceptually uniform colorspace, instead of using a naive euclidean distance of RGB triplets.

As usual it felt like a week-end long project; after all, all I have to do is change the distance function to work in a different space, right? Well, if you're following my blog you might have noticed I've add numerous adventures that stacked up on each others:

I had to work out the colorspace with integer arithmetic first
...which forced me to look into integer division more deeply
...which confronted me to all sort of undefined behaviours in the process

And when I finally reached the point where I could make the switch to OkLab (the perceptual colorspace), a few experiments showed that the flavour of the core algorithm I was using might contain some fundamental flaws, or at least was not implementing optimal heuristics. So here we go again, quickly enough I find myself starting a new research study in the pursuit of understanding how to put pixels on the screen. This write-up is the story of yet another self-inflicted struggle.

Palette quantization

But what is palette quantization? It essentially refers to the process of reducing the number of available colors of an image down to a smaller subset. In sRGB, an image can have up to 16.7 million colors. In practice though it's generally much less, to the surprise of no one. Still, it's not rare to have a few hundreds of thousands different colors in a single picture. Our goal is to reduce that to something like 256 colors that represent them best, and use these colors to create a new picture.

Why you may ask? There are multiple reasons, here are some:

Improve size compression (this is a lossy operation of course, and using dithering on top might actually defeat the original purpose)
Some codecs might not support anything else than limited palettes (GIF or subtitles codecs are examples)
Various artistic purposes

Following is an example of a picture quantized at different levels:

Original (26125 colors)	Quantized to 8bpp (256 colors)	Quantized to 2bpp (4 colors)

This color quantization process can be roughly summarized in a 4-steps based process:

Sample the input image: we build an histogram of all the colors in the picture (basically a simple statistical analysis)
Design a colormap: we build the palette through various means using the histograms
Create a pixel mapping which associates a color (one that can be found in the input image) with another (one that can be found in the newly created palette)
Image quantizing: we use the color mapping to build our new image. This step may also involve some dithering.

The study here will focus on step 2 (which itself relies on step 1).

Colormap design algorithms

A palette is simply a set of colors. It can be represented in various ways, for example here in 2D and 3D:

To generate such a palette, all sort of algorithms exists. They are usually classified into 2 large categories:

Dividing/splitting algorithms (such as Median-Cut and its various flavors)
Clustering algorithms (such as K-means, maximin distance, (E)LBG or pairwise clustering)

The former are faster but non-optimal while the latter are slower but better. The problem is NP-complete, meaning it's possible to find the optimal solution but it can be extremely costly. On the other hand, it's possible to find "local optimums" at minimal cost.

Since I'm working within FFmpeg, speed has always been a priority. This was the reason that motivated me to initially implement the Median-Cut over a more expensive algorithm.

The rough picture of the algorithm is relatively easy to grasp. Assuming we want a palette of K colors:

A set S of all the colors in the input picture is constructed, along with a respective set W of the weight of each color (how much they appear)
Since the colors are expressed as RGB triplets, they can be encapsulated in one big cuboid, or box
The box is cut in two along one of the axis (R, G or B) on the median (hence the name of the algorithm)
If we don't have a total K boxes yet, pick one of them and go back to previous step
All the colors in each of the K boxes are then averaged to form the color palette entries

Here is how the process looks like visually:

Median-Cut algorithm targeting 16 boxes

You may have spotted in this video that the colors are not expressed in RGB but in Lab: this is because instead of representing the colors in a traditional RGB colorspace, we are instead using the OkLab colorspace which has the property of being perceptually uniform. It doesn't really change the Median Cut algorithm, but it definitely has an impact on the resulting palette.

One striking limitation of this algorithm is that we are working exclusively with cuboids: the cuts are limited to an axis, we are not cutting along an arbitrary plane or a more complex shape. Think of it like working with voxels instead of more free-form geometries. The main benefit is that the algorithm is pretty simple to implement.

Now the description provided earlier conveniently avoided describing two important aspects happening in step 3 and 4:

How do we choose the next box to split?
How do we choose along which axis of the box we make the cut?

I pondered about that for a quite a long time.

An overview of the possible heuristics

In bulk, some of the heuristics I started thinking of:

should we take the box that has the longest axis across all boxes?
should we take the box that has the largest volume?
should we take the box that has the biggest Mean Squared Error when compared to its average color?
should we take the box that has the axis with the biggest MSE?
assuming we choose to go with the MSE, should it be normalized across all boxes?
should we even account for the weight of each color or consider them equal?
what about the axis? Is it better to pick the longest or the one with the higher MSE?

I tried to formalize these questions mathematically to the best of my limited abilities. So let's start by saying that all the colors c of a given box are stored in a N×M 2D-array following the matrix notation:

L₁	L₂	L₃	…	Lₘ
a₁	a₂	a₃	…	aₘ
b₁	b₂	b₃	…	bₘ

N is the number of components (3 in our case, whether it's RGB or Lab), and M the number of colors in that box. You can visualize this as a list of vectors as well, where c_{i,j} is the color at row i and column j.

With that in mind we can sketch the following diagram representing the tree of heuristic possibilities to implement:

Mathematicians are going to kill me for doodling random notes all over this perfectly understandable symbols gibberish, but I believe this is required for the human beings reading this article.

In summary, we end up with a total of 24 combinations to try out:

2 axis selection heuristics:
- cut the axis with the maximum error squared
- cut the axis with the maximum length
3 operators:
- maximum measurement out of all the channels
- product of the measurements of all the channels
- sum of the measurements of all the channels
4 measurements:
- error squared, honoring weights
- error squared, not honoring weights
- error squared, honoring weights, normalized
- length of the axis

If we start to intuitively think about which ones are likely going to perform the best, we quickly realize that we haven't actually formalized what we are trying to achieve. Such a rookie mistake. Clarifying this will help us getting a better feeling about the likely outcome.

I chose to target an output that minimizes the MSE against the reference image, in a perceptual way. Said differently, trying to make the perceptual distance between an input and output color pixel as minimal as possible. This is an arbitrary and debatable target, but it's relatively simple and objective to evaluate if we have faith in the selected perceptual model. Another appropriate metric could have been to find the ideal palette through another algorithm, and compare against that instead. Doing that unfortunately implied that I would trust that other algorithm, its implementation, and that I have enough computing power.

So to summarize, we want to minimize the MSE between the input and output, evaluated in the OkLab colorspace. This can be expressed with the following formula:

Where:

P is a partition (which we constrain to a box in our implementation)
C the set of colors in the partition P
w the weight of a color
c a single color
µ the average color of the set C

Special thanks to criver for helping me a ton on the math area, this last formula is from them.

Looking at the formula, we can see how similar it is to certain branches of the heuristics tree, so we can start getting an intuition about the result of the experiment.

Experiment language

Short deviation from the main topic (feel free to skip to the next section): working in C within FFmpeg quickly became a hurdle more than anything. Aside from the lack of flexibility, the implicit casts destroying the precision deceitfully, and the undefined behaviours, all kind of C quirks went in the way several times, which made me question my sanity. This one typically severly messed me up while trying to average the colors:

#include <stdio.h>
#include <stdint.h>

int main (void)
{
    const int32_t x = -30;
    const uint32_t y = 10;

    const uint32_t a = 30;
    const int32_t b = -10;

    printf("%d×%u=%d\n", x, y, x * y);
    printf("%u×%d=%d\n", a, b, a * b);
    printf("%d/%u=%d\n", x, y, x / y);
    printf("%u/%d=%d\n", a, b, a / b);
    return 0;
}

% cc -Wall -Wextra -fsanitize=undefined test.c -o test && ./test
-30×10=-300
30×-10=-300
-30/10=429496726
30/-10=0

Anyway, I know this is obvious but if you aren't already doing that I suggest you build your experiments in another language, Python or whatever, and rewrite them in C later when you figured out your expected output.

Re-implementing what I needed in Python didn't take me long. It was, and still is obviously much slower at runtime, but that's fine. There is a lot of room for speed improvement, typically by relying on numpy (which I didn't bother with).

Experiment results

I created a research repository for the occasion. The code to reproduce and the results can be found in the color quantization README.

In short, based on the results, we can conclude that:

Overall, the box that has the axis with the largest non-normalized weighted sum of squared error is the best candidate in the box selection algorithm
Overall, cutting the axis with the largest weighted sum of squared error is the best axis cut selection algorithm

To my surprise, normalizing the weights per box is not a good idea. I initially observed that by trial and error, which was actually one of the main motivator for this research. I initially thought normalizing each box was necessary in order to compare them against each others (such that they are compared on a common ground). My loose explanation of the phenomenon was that not normalizing causes a bias towards boxes with many colors, but that's actually exactly what we want. I believe it can also be explained by our evaluation function: we want to minimize the error across the whole set of colors, so small partitions (in color counts) must not be made stronger. At least not in the context of the target we chose.

It's also interesting to see how the max() seems to perform better than the sum() of the variance of each component most of the time. Admittedly my picture samples set is not that big, which may imply that more experiments to confirm that tendency are required.

In retrospective, this might have been quickly predictable to someone with a mathematical background. But since I don't have that, nor do I trust my abstract thinking much, I'm kind of forced to try things out often. This is likely one of the many instances where I spent way too much energy on something obvious from the beginning, but I have the hope it will actually provide some useful information for other lost souls out there.

Known limitations

There are two main limitations I want to discuss before closing this article. The first one is related to minimizing the MSE even more.

K-means refinement

We know the Median-Cut actually provides a rough estimate of the optimal palette. One thing we could do is use it as a first step before refinement, for example by running a few K-means iterations as post-processing (how much refinement/iterations could be a user control). The general idea of K-means is to progressively move each colors individually to a more appropriate box, that is a box for which the color distance to the average color of that box is smaller. I started implementing that in a very naive way, so it's extremely slow, but that's something to investigate further because it definitely improves the results.

Most of the academic literature seems to suggest the use of the K-means clustering, but all of them require some startup step. Some come up with various heuristics, some use PCA, but I've yet to see one that rely on Median-Cut as first pass; maybe that's not such a good idea, but who knows.

Bias toward perceived lightness

Another more annoying problem for which I have no solution for is with regards to the human perception being much more sensitive to light changes than hue. If you look at the first demo with the parrot, you may have observed the boxes are kind of thin. This is because the a and b components (respectively how green/red and blue/yellow the color is) have a much smaller amplitude compared to the L (perceived lightness).

Here is a side by side comparison of the spread of colors between a stretched and normalized view:

You may rightfully question whether this is a problem or not. In practice, this means that when K is low (let's say smaller than 8 or even 16), cuts along L will almost always be preferred, causing the picture to be heavily desaturated. This is because it tries to preserve the most significant attribute in human perception: the lightness.

That particular picture is actually a pathological study case:

4 colors	8 colors	12 colors	16 colors

We can see the hue timidly appearing around K=16 (specifically it starts being more strongly noticeable starting the cut K=13).

Conclusion

For now, I'm mostly done with this "week-end long project" into which I actually poured 2 or 3 months of lifetime. The FFmpeg patchset will likely be upstreamed soon so everyone should hopefully be able to benefit from it in the next release. It will also come with additional dithering methods, which implementation actually was a relaxing distraction from all this hardship. There are still many ways of improving this work, but it's the end of the line for me, so I'll trust the Internet with it.

Porting OkLab colorspace to integer arithmetic

Sun, 11 Dec 2022 22:01:17 -0000

For reasons I'll explain in a futur write-up, I needed to make use of a perceptually uniform colorspace in some computer vision code. OkLab from Björn Ottosson was a great candidate given how simple the implementation is.

But there is a plot twist: I needed the code to be deterministic for the tests to be portable across a large variety of architecture, systems and configurations. Several solutions were offered to me, including reworking the test framework to support a difference mechanism with threshold, but having done that in another project I can confidently say that it's not trivial (when not downright impossible in certain cases). Another approach would have been to hardcode the libc math functions, but even then I wasn't confident the floating point arithmetic would determinism would be guaranteed in all cases.

So I ended up choosing to port the code to integer arithmetic. I'm sure many people would disagree with that approach, but:

code determinism is guaranteed
not all FPU are that efficient, typically on embedded
it can now be used in the kernel; while this is far-fetched for OkLab (though maybe someone needs some color management in v4l2 or something), sRGB transforms might have their use cases
it's a learning experience which can be re-used in other circumstances
working on the integer arithmetic versions unlocked various optimizations for the normal case

Note: I'm following Björn Ottosson will to have OkLab code in the public domain as well as under MIT license, so this "dual licensing" applies to all the code presented in this article.

Warning: The integer arithmetics in this write-up can only work if your language behaves the same as C99 (or more recent) with regard to integer division. See this previous article on integer division for more information.

Quick summary of uniform colorspaces

For those unfamiliar with color management, one of the main benefit of a uniform colorspace like OkLab is that the euclidean distance between two colors is directly correlated with the human perception of these colors.

More concretely, if we want to evaluate the distance between the RGB triplets (R₀,G₀,B₀) and (R₁,G₁,B₁), one may naively compute the euclidean distance √((R₀-R₁)²+(G₀-G₁)²+(B₀-B₁)²). Unfortunately, even if the RGB is gamma expanded into linear values, the computed distance will actually be pretty far from reflecting how the human eye perceive this difference. It typically isn't going to be consistent when applied to another pair of colors.

With OkLab (and many other uniform colorspaces), the colors are also identified with 3D coordinates, but instead of (R,G,B) we call them (L,a,b) (which is an entirely different 3D space). In that space √((L₀-L₁)²+(a₀-a₁)²+(b₀-b₁)² (called ΔE, or Delta-E) is expected to be aligned with human perception of color differences.

Of course, this is just one model, and it doesn't take into account many parameters. For instance, the perception of a color depends a lot on the surrounding colors. Still, these models are much better than working with RGB triplets, which don't make much sense visually speaking.

Reference code / diagram

In this study case, We will be focusing on the transform that goes from sRGB to OkLab, and back again into sRGB. Only the first part is interesting if we want the color distance, but sometimes we also want to alter a color uniformly and thus we need the 2nd part as well to reconstruct an sRGB color from it.

We are only considering the sRGB input and output for simplicity, which means we will be inlining the sRGB color transfer in the pipeline. If you're not familiar with gamma compression, there are many resources about it on the Internet which you may want to look into first.

Here is a diagram of the complete pipeline:

And the corresponding code (of the 4 circles in the diagram) we will be porting:

struct Lab { float L, a, b; }

uint8_t linear_f32_to_srgb_u8(float x)
{
    if (x <= 0.0) {
        return 0;
    } else if (x >= 1.0) {
        return 0xff;
    } else {
        const float v = x < 0.0031308f ? x*12.92f : 1.055f*powf(x, 1.f/2.4f) - 0.055f;
        return lrintf(v * 255.f);
    }
}

float srgb_u8_to_linear_f32(uint8_t x)
{
    const float v = x / 255.f;
    return v < 0.04045f ? v/12.92f : powf((v+0.055f)/1.055f, 2.4f);
}

struct Lab srgb_u8_to_oklab_f32(uint32_t srgb)
{
    const float r = srgb_u8_to_linear_f32(srgb >> 16 & 0xff);
    const float g = srgb_u8_to_linear_f32(srgb >>  8 & 0xff);
    const float b = srgb_u8_to_linear_f32(srgb       & 0xff);

    const float l = 0.4122214708f * r + 0.5363325363f * g + 0.0514459929f * b;
    const float m = 0.2119034982f * r + 0.6806995451f * g + 0.1073969566f * b;
    const float s = 0.0883024619f * r + 0.2817188376f * g + 0.6299787005f * b;

    const float l_ = cbrtf(l);
    const float m_ = cbrtf(m);
    const float s_ = cbrtf(s);

    const struct Lab ret = {
        .L = 0.2104542553f * l_ + 0.7936177850f * m_ - 0.0040720468f * s_,
        .a = 1.9779984951f * l_ - 2.4285922050f * m_ + 0.4505937099f * s_,
        .b = 0.0259040371f * l_ + 0.7827717662f * m_ - 0.8086757660f * s_,
    };

    return ret;
}

uint32_t oklab_f32_to_srgb_u8(struct Lab c)
{
    const float l_ = c.L + 0.3963377774f * c.a + 0.2158037573f * c.b;
    const float m_ = c.L - 0.1055613458f * c.a - 0.0638541728f * c.b;
    const float s_ = c.L - 0.0894841775f * c.a - 1.2914855480f * c.b;

    const float l = l_*l_*l_;
    const float m = m_*m_*m_;
    const float s = s_*s_*s_;

    const uint8_t r = linear_f32_to_srgb_u8(+4.0767416621f * l - 3.3077115913f * m + 0.2309699292f * s);
    const uint8_t g = linear_f32_to_srgb_u8(-1.2684380046f * l + 2.6097574011f * m - 0.3413193965f * s);
    const uint8_t b = linear_f32_to_srgb_u8(-0.0041960863f * l - 0.7034186147f * m + 1.7076147010f * s);

    return r<<16 | g<<8 | b;
}

sRGB to Linear

The first step is converting the sRGB color to linear values. That sRGB transfer function can be intimidating, but it's pretty much a simple power function:

The input is 8-bit ([0x00;0xff] for each of the 3 channels) which means we can use a simple 256 values lookup table containing the precomputed resulting linear values. Note that we can already do that with the reference code with a table remapping the 8-bit index into a float value.

For our integer version we need to pick an arbitrary precision for the linear representation. 8-bit is not going to be enough precision, so we're going to pick the next power of two to be space efficient: 16-bit. We will be using the constant K=(1<<16)-1=0xffff to refer to this scale.

Alternatively we could rely on a fixed point mapping (an integer for the decimal part and another integer for the fractional part), but in our case pretty much everything is normalized so the decimal part doesn't really matter.

/**
 * Table mapping formula:
 *   f(x) = x < 0.04045 ? x/12.92 : ((x+0.055)/1.055)^2.4  (sRGB EOTF)
 * Where x is the normalized index in the table and f(x) the value in the table.
 * f(x) is remapped to [0;K] and rounded.
 */
static const uint16_t srgb2linear[256] = {
    0x0000, 0x0014, 0x0028, 0x003c, 0x0050, 0x0063, 0x0077, 0x008b,
    0x009f, 0x00b3, 0x00c7, 0x00db, 0x00f1, 0x0108, 0x0120, 0x0139,
    0x0154, 0x016f, 0x018c, 0x01ab, 0x01ca, 0x01eb, 0x020e, 0x0232,
    0x0257, 0x027d, 0x02a5, 0x02ce, 0x02f9, 0x0325, 0x0353, 0x0382,
    0x03b3, 0x03e5, 0x0418, 0x044d, 0x0484, 0x04bc, 0x04f6, 0x0532,
    0x056f, 0x05ad, 0x05ed, 0x062f, 0x0673, 0x06b8, 0x06fe, 0x0747,
    0x0791, 0x07dd, 0x082a, 0x087a, 0x08ca, 0x091d, 0x0972, 0x09c8,
    0x0a20, 0x0a79, 0x0ad5, 0x0b32, 0x0b91, 0x0bf2, 0x0c55, 0x0cba,
    0x0d20, 0x0d88, 0x0df2, 0x0e5e, 0x0ecc, 0x0f3c, 0x0fae, 0x1021,
    0x1097, 0x110e, 0x1188, 0x1203, 0x1280, 0x1300, 0x1381, 0x1404,
    0x1489, 0x1510, 0x159a, 0x1625, 0x16b2, 0x1741, 0x17d3, 0x1866,
    0x18fb, 0x1993, 0x1a2c, 0x1ac8, 0x1b66, 0x1c06, 0x1ca7, 0x1d4c,
    0x1df2, 0x1e9a, 0x1f44, 0x1ff1, 0x20a0, 0x2150, 0x2204, 0x22b9,
    0x2370, 0x242a, 0x24e5, 0x25a3, 0x2664, 0x2726, 0x27eb, 0x28b1,
    0x297b, 0x2a46, 0x2b14, 0x2be3, 0x2cb6, 0x2d8a, 0x2e61, 0x2f3a,
    0x3015, 0x30f2, 0x31d2, 0x32b4, 0x3399, 0x3480, 0x3569, 0x3655,
    0x3742, 0x3833, 0x3925, 0x3a1a, 0x3b12, 0x3c0b, 0x3d07, 0x3e06,
    0x3f07, 0x400a, 0x4110, 0x4218, 0x4323, 0x4430, 0x453f, 0x4651,
    0x4765, 0x487c, 0x4995, 0x4ab1, 0x4bcf, 0x4cf0, 0x4e13, 0x4f39,
    0x5061, 0x518c, 0x52b9, 0x53e9, 0x551b, 0x5650, 0x5787, 0x58c1,
    0x59fe, 0x5b3d, 0x5c7e, 0x5dc2, 0x5f09, 0x6052, 0x619e, 0x62ed,
    0x643e, 0x6591, 0x66e8, 0x6840, 0x699c, 0x6afa, 0x6c5b, 0x6dbe,
    0x6f24, 0x708d, 0x71f8, 0x7366, 0x74d7, 0x764a, 0x77c0, 0x7939,
    0x7ab4, 0x7c32, 0x7db3, 0x7f37, 0x80bd, 0x8246, 0x83d1, 0x855f,
    0x86f0, 0x8884, 0x8a1b, 0x8bb4, 0x8d50, 0x8eef, 0x9090, 0x9235,
    0x93dc, 0x9586, 0x9732, 0x98e2, 0x9a94, 0x9c49, 0x9e01, 0x9fbb,
    0xa179, 0xa339, 0xa4fc, 0xa6c2, 0xa88b, 0xaa56, 0xac25, 0xadf6,
    0xafca, 0xb1a1, 0xb37b, 0xb557, 0xb737, 0xb919, 0xbaff, 0xbce7,
    0xbed2, 0xc0c0, 0xc2b1, 0xc4a5, 0xc69c, 0xc895, 0xca92, 0xcc91,
    0xce94, 0xd099, 0xd2a1, 0xd4ad, 0xd6bb, 0xd8cc, 0xdae0, 0xdcf7,
    0xdf11, 0xe12e, 0xe34e, 0xe571, 0xe797, 0xe9c0, 0xebec, 0xee1b,
    0xf04d, 0xf282, 0xf4ba, 0xf6f5, 0xf933, 0xfb74, 0xfdb8, 0xffff,
};

int32_t srgb_u8_to_linear_int(uint8_t x)
{
    return (int32_t)srgb2linear[x];
}

You may have noticed that we are returning the value in a i32: this is to ease arithmetic operations (preserving the 16-bit unsigned precision would have overflow warping implications when working with the value).

Linear to OkLab

The OkLab is expressed in a virtually continuous space (floats). If we feed all 16.7 millions sRGB colors to the OkLab transform we get the following ranges in output:

min Lab: 0.000000 -0.233887 -0.311528
max Lab: 1.000000 0.276216 0.198570

We observe that L is always positive and neatly within [0;1] while a and b are in a more restricted and signed range. Multiple choices are offered to us with regard to the integer representation we pick.

Since we chose 16-bit for the input linear value, it makes sense to preserve that precision for Lab. For the L component, this fits neatly ([0;1] in the ref maps to [0;0xffff] in the integer version), but for the a and b component, not so much. We could pick a signed 16-bit, but that would imply a 15-bit precision for the arithmetic and 1-bit for the sign, which is going to be troublesome: we want to preserve the same precision for L, a and b since the whole point of this operation is to have a uniform space.

Instead, I decided to go with 16-bits of precision, with one extra bit for the sign (which will be used for a and b), and thus storing Lab in 3 signed i32. Alternatively, we could decide to have a 15-bit precision with an extra bit for the sign by using 3 i16. This should work mostly fine but having the values fit exactly the boundaries of the storage can be problematic in various situations, typically anything that involves boundary checks and overflows. Picking a larger storage simplifies a bunch of things.

Looking at srgb_u8_to_oklab_f32 we quickly see that for most of the function it's simple arithmetic, but we have a cube root (cbrt()), so let's study that first.

Cube root

All the cbrt inputs are driven by this:

const float l = 0.4122214708f * r + 0.5363325363f * g + 0.0514459929f * b;
const float m = 0.2119034982f * r + 0.6806995451f * g + 0.1073969566f * b;
const float s = 0.0883024619f * r + 0.2817188376f * g + 0.6299787005f * b;

This might not be obvious at first glance but here l, m and s all are in [0;1] range (the sum of the coefficients of each row is 1), so we will only need to deal with this range in our cbrt implementation. This simplifies greatly the problem!

Now, what does it look like?

This function is simply the inverse of f(x)=x³, which is a more convenient function to work with. And I have some great news: not long ago, I wrote an article on how to inverse a function, so that's exactly what we are going to do here: inverse f(x)=x³.

What we first need though is a good approximation of the curve. A straight line is probably fine but we could try to use some symbolic regression in order to get some sort of rough polynomial approximation. PySR can do that in a few lines of code:

import numpy as np
from pysr import PySRRegressor

# 25 points of ³√x within [0;1]
x = np.linspace(0, 1, 25).reshape(-1, 1)
y = x ** (1/3)
model = PySRRegressor(model_selection="accuracy", binary_operators=["+", "-", "*"], niterations=200)
r = model.fit(x, y, variable_names=["x"])
print(r)

The output is not deterministic for some reason (which is quite annoying) and the expressions provided usually follows a wonky form. Still, in my run it seemed to take a liking to the following polynomial: u₀ = x³ - 2.19893x² + 2.01593x + 0.219407 (reformatted in a sane polynomial form thanks to WolframAlpha).

Note that increasing the number of data points is not really a good idea because we quickly start being confronted to Runge's phenomenon. No need to overthink it, 25 points is just fine.

Now we can make a few Newton iterations. For that, we need the derivative of f(uₙ)=uₙ³-x, so f'(uₙ)=3uₙ² and thus the iteration expressions can be obtained easily:

uₙ₊₁ = uₙ - (f(uₙ)-v)/f'(uₙ)
     = uₙ - (uₙ³-v)/(3uₙ²)
     = (2uₙ³+v)/(3uₙ²)

If you don't understand what the hell is going on here, check the article referred to earlier, we're simply following the recipe here.

Now I had a look into how most libc compute cbrt, and despite sometimes referring to Newton iterations, they were actually using Halley iterations. So we're going to do the same (not lying, just the Halley part). To get the Halley iteration instead of Newton, we need the first but also the second derivative of f(uₙ)=uₙ³-x (f'(uₙ)=3uₙ² and f"(uₙ)=6uₙ) from which we deduce a relatively simple expression:

uₙ₊₁ = uₙ-2f(uₙ)f'(uₙ)/(2f'(uₙ)²-f(uₙ)f"(uₙ))
     = uₙ(2x+uₙ³)/(x+2uₙ³)

We have everything we need to approximate a cube root of a real between 0 and 1. In Python a complete implementation would be as simple as this snippet:

b, c, d = -2.19893, 2.01593, 0.219407

def cbrt01(x):
    # We only support [0;1]
    if x <= 0: return 0
    if x >= 1: return 1
    # Initial approximation
    u = x**3 + b*x**2 + c*x + d
    # 2 Halley iterations
    u = u * (2*x+u**3) / (x+2*u**3)
    u = u * (2*x+u**3) / (x+2*u**3)
    return u

But now we need to scale the floating values up into 16-bit integers.

First of all, in the integer version our x is actually in K scale, which means we want to express u according to X=x·K. Similarly, we want to use B=b·K, C=c·K and D=d·K instead of b, c and d because we have no way of expressing the former as integer otherwise. Finally, we're not actually going to compute u₀ but u₀·K because we're preserving the scale through the function. We have:

u₀·K = K·(x³ + bx² + cx + d)
     = K·((x·K)³/K³ + b(x·K)²/K² + c(x·K)/K + d)
     = K·(X³/K³ + bX²/K² + cX/K + d)
     = X³·K/K³ + bX²·K/K² + cX·K/K + d·K
     = X³/K² + BX²/K² + CX/K + D
     = X³/K² + BX²/K² + CX/K + D
     = (X³ + BX²)/K² + CX/K + D
     = ((X³ + BX²)/K + CX)/K + D
     = (X(X² + BX)/K + CX)/K + D
  U₀ = (X(X(X + B)/K + CX)/K + D

With this we have a relatively cheap expression where the K divisions would still preserve enough precision even if evaluated as integer division.

We can do the same for the Halley iteration. I spare you the algebra, the expression u(2x+u³) / (x+2u³) becomes (U(2X+U³/K²)) / (X+2U³/K²).

Looking at this expression you may start to worry about overflows, and that would fair since even K² is getting dangerously close to the sun (it's actually already larger than INT32_MAX). For this reason we're going to cheat and simply use 64-bit arithmetic in this function. I believe we could reduce the risk of overflow, but I don't think there is a way to remain in 32-bit without nasty compromises anyway. This is also why in the code below you'll notice the constants are suffixed with LL (to force long-long/64-bit arithmetic).

Beware that overflows are a terrible predicament to get into as they will lead to undefined behaviour. Do not underestimate this risk. You might not detect them early enough, and missing them may mislead you when interpreting the results. For this reason, I strongly suggest to always build with -fsanitize=undefined during test and development. I don't do that often, but for this kind of research, I also highly recommend to first write tests that cover all possible integers input (when applicable) so that overflows are detected as soon as possible.

Before we write the integer version of our function, we need to address rounding. In the case of the initial approximation I don't think we need to bother, but for our Halley iteration we're going to need as much precision as we can get. Since we know U is positive (remember we're evaluating cbrt(x) where x is in [0;1]), we can use the (a+b/2)/b rounding formula.

Our function finally just looks like:

#define K2 ((int64_t)K*K)

int32_t cbrt01_int(int32_t x)
{
    int64_t u;

    /* Approximation curve is for the [0;1] range */
    if (x <= 0) return 0;
    if (x >= K) return K;

    /* Initial approximation: x³ - 2.19893x² + 2.01593x + 0.219407 */
    u = x*(x*(x - 144107LL) / K + 132114LL) / K + 14379LL;

    /* Refine with 2 Halley iterations. */
    for (int i = 0; i < 2; i++) {
        const int64_t u3 = u*u*u;
        const int64_t den = x + (2*u3 + K2/2) / K2;
        u = (u * (2*x + (u3 + K2/2) / K2) + den/2) / den;
    }

    return u;
}

Cute, isn't it? If we test the accuracy of this function by calling it for all the possible values we actually get extremely good results. Here is a test code:

int main(void)
{
    float max_diff = 0;
    float total_diff = 0;

    for (int i = 0; i <= K; i++) {
        const float ref = cbrtf(i / (float)K);
        const float out = cbrt01_int(i) / (float)K;
        const float d = fabs(ref - out);

        if (d > max_diff)
            max_diff = d;
        total_diff += d;
    }

    printf("max_diff=%f total_diff=%f avg_diff=%f\n",
           max_diff, total_diff, total_diff / (K + 1));
    return 0;
}

Output: max_diff=0.030831 total_diff=0.816078 avg_diff=0.000012

If we want to trade precision for speed, we could adjust the function to use Newton iterations, and maybe remove the rounding.

Back to the core

Going back to our sRGB-to-OkLab function, everything should look straightforward to implement now. There is one thing though, while lms computation (at the beginning of the function) is exclusively working with positive values, the output Lab value expression is signed. For this reason we will need a more involved rounded division, so referring again to my last article we will use:

static int64_t div_round64(int64_t a, int64_t b) { return (a^b)<0 ? (a-b/2)/b : (a+b/2)/b; }

And thus, we have:

struct LabInt { int32_t L, a, b; };

struct LabInt srgb_u8_to_oklab_int(uint32_t srgb)
{
    const int32_t r = (int32_t)srgb2linear[srgb >> 16 & 0xff];
    const int32_t g = (int32_t)srgb2linear[srgb >>  8 & 0xff];
    const int32_t b = (int32_t)srgb2linear[srgb       & 0xff];

    // Note: lms can actually be slightly over K due to rounded coefficients
    const int32_t l = (27015LL*r + 35149LL*g + 3372LL*b + K/2) / K;
    const int32_t m = (13887LL*r + 44610LL*g + 7038LL*b + K/2) / K;
    const int32_t s = (5787LL*r + 18462LL*g + 41286LL*b + K/2) / K;

    const int32_t l_ = cbrt01_int(l);
    const int32_t m_ = cbrt01_int(m);
    const int32_t s_ = cbrt01_int(s);

    const struct LabInt ret = {
        .L = div_round64( 13792LL*l_ +  52010LL*m_ -   267LL*s_, K),
        .a = div_round64(129628LL*l_ - 159158LL*m_ + 29530LL*s_, K),
        .b = div_round64(  1698LL*l_ +  51299LL*m_ - 52997LL*s_, K),
    };

    return ret;
}

The note in this code is here to remind us that we have to saturate lms to a maximum of K (corresponding to 1.0 with floats), which is what we're doing in cbrt01_int().

At this point we can already work within the OkLab space but we're only half-way through the pain. Fortunately, things are going to be easier from now on.

OkLab to sRGB

Our OkLab-to-sRGB function relies on the Linear-to-sRGB function (at the end), so we're going to deal with it first.

Linear to sRGB

Contrary to sRGB-to-Linear it's going to be tricky to rely on a table because it would be way too large to hold all possible values (since it would require K entries). I initially considered computing powf(x, 1.f/2.4f) with integer arithmetic somehow, but this is much more involved than how we managed to implement cbrt. So instead I thought about approximating the curve with a bunch of points (stored in a table), and then approximate any intermediate value with a linear interpolation, that is as if the point were joined through small segments.

We gave 256 16-bit entries to srgb2linear, so if we were to give as much storage to linear2srgb we could have a table of 512 8-bit entries (our output is 8-bit). Here it is:

/**
 * Table mapping formula:
 *   f(x) = x < 0.0031308 ? x*12.92 : (1.055)*x^(1/2.4)-0.055  (sRGB OETF)
 * Where x is the normalized index in the table and f(x) the value in the table.
 * f(x) is remapped to [0;0xff] and rounded.
 *
 * Since a 16-bit table is too large, we reduce its precision to 9-bit.
 */
static const uint8_t linear2srgb[P + 1] = {
    0x00, 0x06, 0x0d, 0x12, 0x16, 0x19, 0x1c, 0x1f, 0x22, 0x24, 0x26, 0x28, 0x2a, 0x2c, 0x2e, 0x30,
    0x32, 0x33, 0x35, 0x36, 0x38, 0x39, 0x3b, 0x3c, 0x3d, 0x3e, 0x40, 0x41, 0x42, 0x43, 0x45, 0x46,
    0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56,
    0x56, 0x57, 0x58, 0x59, 0x5a, 0x5b, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, 0x5f, 0x60, 0x61, 0x62, 0x62,
    0x63, 0x64, 0x65, 0x65, 0x66, 0x67, 0x67, 0x68, 0x69, 0x6a, 0x6a, 0x6b, 0x6c, 0x6c, 0x6d, 0x6e,
    0x6e, 0x6f, 0x6f, 0x70, 0x71, 0x71, 0x72, 0x73, 0x73, 0x74, 0x74, 0x75, 0x76, 0x76, 0x77, 0x77,
    0x78, 0x79, 0x79, 0x7a, 0x7a, 0x7b, 0x7b, 0x7c, 0x7d, 0x7d, 0x7e, 0x7e, 0x7f, 0x7f, 0x80, 0x80,
    0x81, 0x81, 0x82, 0x82, 0x83, 0x84, 0x84, 0x85, 0x85, 0x86, 0x86, 0x87, 0x87, 0x88, 0x88, 0x89,
    0x89, 0x8a, 0x8a, 0x8b, 0x8b, 0x8c, 0x8c, 0x8c, 0x8d, 0x8d, 0x8e, 0x8e, 0x8f, 0x8f, 0x90, 0x90,
    0x91, 0x91, 0x92, 0x92, 0x93, 0x93, 0x93, 0x94, 0x94, 0x95, 0x95, 0x96, 0x96, 0x97, 0x97, 0x97,
    0x98, 0x98, 0x99, 0x99, 0x9a, 0x9a, 0x9a, 0x9b, 0x9b, 0x9c, 0x9c, 0x9c, 0x9d, 0x9d, 0x9e, 0x9e,
    0x9f, 0x9f, 0x9f, 0xa0, 0xa0, 0xa1, 0xa1, 0xa1, 0xa2, 0xa2, 0xa3, 0xa3, 0xa3, 0xa4, 0xa4, 0xa5,
    0xa5, 0xa5, 0xa6, 0xa6, 0xa6, 0xa7, 0xa7, 0xa8, 0xa8, 0xa8, 0xa9, 0xa9, 0xa9, 0xaa, 0xaa, 0xab,
    0xab, 0xab, 0xac, 0xac, 0xac, 0xad, 0xad, 0xae, 0xae, 0xae, 0xaf, 0xaf, 0xaf, 0xb0, 0xb0, 0xb0,
    0xb1, 0xb1, 0xb1, 0xb2, 0xb2, 0xb3, 0xb3, 0xb3, 0xb4, 0xb4, 0xb4, 0xb5, 0xb5, 0xb5, 0xb6, 0xb6,
    0xb6, 0xb7, 0xb7, 0xb7, 0xb8, 0xb8, 0xb8, 0xb9, 0xb9, 0xb9, 0xba, 0xba, 0xba, 0xbb, 0xbb, 0xbb,
    0xbc, 0xbc, 0xbc, 0xbd, 0xbd, 0xbd, 0xbe, 0xbe, 0xbe, 0xbf, 0xbf, 0xbf, 0xc0, 0xc0, 0xc0, 0xc1,
    0xc1, 0xc1, 0xc1, 0xc2, 0xc2, 0xc2, 0xc3, 0xc3, 0xc3, 0xc4, 0xc4, 0xc4, 0xc5, 0xc5, 0xc5, 0xc6,
    0xc6, 0xc6, 0xc6, 0xc7, 0xc7, 0xc7, 0xc8, 0xc8, 0xc8, 0xc9, 0xc9, 0xc9, 0xc9, 0xca, 0xca, 0xca,
    0xcb, 0xcb, 0xcb, 0xcc, 0xcc, 0xcc, 0xcc, 0xcd, 0xcd, 0xcd, 0xce, 0xce, 0xce, 0xce, 0xcf, 0xcf,
    0xcf, 0xd0, 0xd0, 0xd0, 0xd0, 0xd1, 0xd1, 0xd1, 0xd2, 0xd2, 0xd2, 0xd2, 0xd3, 0xd3, 0xd3, 0xd4,
    0xd4, 0xd4, 0xd4, 0xd5, 0xd5, 0xd5, 0xd6, 0xd6, 0xd6, 0xd6, 0xd7, 0xd7, 0xd7, 0xd7, 0xd8, 0xd8,
    0xd8, 0xd9, 0xd9, 0xd9, 0xd9, 0xda, 0xda, 0xda, 0xda, 0xdb, 0xdb, 0xdb, 0xdc, 0xdc, 0xdc, 0xdc,
    0xdd, 0xdd, 0xdd, 0xdd, 0xde, 0xde, 0xde, 0xde, 0xdf, 0xdf, 0xdf, 0xe0, 0xe0, 0xe0, 0xe0, 0xe1,
    0xe1, 0xe1, 0xe1, 0xe2, 0xe2, 0xe2, 0xe2, 0xe3, 0xe3, 0xe3, 0xe3, 0xe4, 0xe4, 0xe4, 0xe4, 0xe5,
    0xe5, 0xe5, 0xe5, 0xe6, 0xe6, 0xe6, 0xe6, 0xe7, 0xe7, 0xe7, 0xe7, 0xe8, 0xe8, 0xe8, 0xe8, 0xe9,
    0xe9, 0xe9, 0xe9, 0xea, 0xea, 0xea, 0xea, 0xeb, 0xeb, 0xeb, 0xeb, 0xec, 0xec, 0xec, 0xec, 0xed,
    0xed, 0xed, 0xed, 0xee, 0xee, 0xee, 0xee, 0xef, 0xef, 0xef, 0xef, 0xef, 0xf0, 0xf0, 0xf0, 0xf0,
    0xf1, 0xf1, 0xf1, 0xf1, 0xf2, 0xf2, 0xf2, 0xf2, 0xf3, 0xf3, 0xf3, 0xf3, 0xf3, 0xf4, 0xf4, 0xf4,
    0xf4, 0xf5, 0xf5, 0xf5, 0xf5, 0xf6, 0xf6, 0xf6, 0xf6, 0xf6, 0xf7, 0xf7, 0xf7, 0xf7, 0xf8, 0xf8,
    0xf8, 0xf8, 0xf9, 0xf9, 0xf9, 0xf9, 0xf9, 0xfa, 0xfa, 0xfa, 0xfa, 0xfb, 0xfb, 0xfb, 0xfb, 0xfb,
    0xfc, 0xfc, 0xfc, 0xfc, 0xfd, 0xfd, 0xfd, 0xfd, 0xfd, 0xfe, 0xfe, 0xfe, 0xfe, 0xff, 0xff, 0xff,
};

Again we're going to start with the floating point version as it's easier to reason with.

We have a precision P of 9-bits: P = (1<<9)-1 = 511 = 0x1ff. But for the sake of understanding the math, the following diagram will assume a P of 3 so that we can clearly see the segment divisions:

The input of our table is an integer index which needs to be calculated according to our input x. But as stated earlier, we won't need one but two indices in order to interpolate a point between 2 discrete values from our table. We will refer to these indices as iₚ and iₙ, which can be computed like this:

i = x·P
iₚ = ⌊i⌋
iₙ = iₚ + 1

(⌊a⌋ means floor(a))

In order to get an approximation of y according to i, we simply need a linear remapping: the ratio of i between iₚ and iₙ is the same ratio as y between yₚ and yₙ. So yet again we're going to rely on the most useful maths formulas: remap(iₚ,iₙ,yₚ,yₙ,i) = mix(yₚ,yₙ,linear(iₚ,iₙ,i)).

The ratio r we're computing as an input to the y-mix can be simplified a bit:

r = linear(iₚ,iₙ,i)
  = (i-iₚ) / (iₙ-iₚ)
  = i-iₚ
  = x·P - ⌊x·P⌋
  = fract(x·P)

So in the end our formula is simply: y = mix(yₚ,yₙ,fract(x·P))

Translated into C we can write it like this:

uint8_t linear_f32_to_srgb_u8_fast(float x)
{
    if (x <= 0.f) {
        return 0;
    } else if (x >= 1.f) {
        return 0xff;
    } else {
        const float i = x * P;
        const int32_t idx = (int32_t)floorf(i);
        const float y0 = linear2srgb[idx];
        const float y1 = linear2srgb[idx + 1];
        const float r = i - idx;
        return lrintf(mix(y0, y1, r));
    }
}

Note: in case you are concerned about idx+1 overflowing, floorf((1.0-FLT_EPSILON)*P) is P-1, so this is safe.

Linear to sRGB, integer version

In the integer version, our function input x is within [0;K], so we need to make a few adjustments.

The first issue we have is that with integer arithmetic our i and idx are the same. We have X=x·K as input, so i = idx = X·P/K because we are using an integer division, which in this case is equivalent to the floor() expression in the float version. So while it's a simple and fast way to get yₚ and yₙ, we have an issue figuring out the ratio r.

One tool we have is the modulo operator: the integer division is destructive of the fractional part, but fortunately the modulo (the rest of the division) gives this information back. It can also be obtained for free most of the time because CPU division instructions tend to also provide that modulo as well without extra computation.

If we give m = (X·P) % K, we have the fractional part of the division expressed in the K scale, which means we can derivate our ratio r from it: r = m / K.

Slipping the K division in our mix() expression we end up with the following code:

uint8_t linear_int_to_srgb_u8(int32_t x)
{
    if (x <= 0) {
        return 0;
    } else if (x >= K) {
        return 0xff;
    } else {
        const int32_t xP = x * P;
        const int32_t i = xP / K;
        const int32_t m = xP % K;
        const int32_t y0 = linear2srgb[i];
        const int32_t y1 = linear2srgb[i + 1];
        return (m * (y1 - y0) + K/2) / K + y0;
    }
}

Testing this function for all the possible input of x, the biggest inaccuracy is a off-by-one, which concerns 6280 of the 65536 possible values (less than 10%): 2886 "off by -1" and 3394 "off by +1". It matches exactly the inaccuracy of the float version of this function, so I think we can be pretty happy with it.

Given how good this approach is, we could also consider applying the same strategy for cbrt, so this is left as an exercise to the reader.

Back to the core

We're finally in our last function. Using everything we've learned so far, it can be trivially converted to integer arithmetic:

uint32_t oklab_int_to_srgb_u8(struct LabInt c)
{
    const int64_t l_ = c.L + div_round64(25974LL * c.a, K) + div_round64( 14143LL * c.b, K);
    const int64_t m_ = c.L + div_round64(-6918LL * c.a, K) + div_round64( -4185LL * c.b, K);
    const int64_t s_ = c.L + div_round64(-5864LL * c.a, K) + div_round64(-84638LL * c.b, K);

    const int32_t l = l_*l_*l_ / K2;
    const int32_t m = m_*m_*m_ / K2;
    const int32_t s = s_*s_*s_ / K2;

    const uint8_t r = linear_int_to_srgb_u8((267169LL * l - 216771LL * m +  15137LL * s + K/2) / K);
    const uint8_t g = linear_int_to_srgb_u8((-83127LL * l + 171030LL * m -  22368LL * s + K/2) / K);
    const uint8_t b = linear_int_to_srgb_u8((  -275LL * l -  46099LL * m + 111909LL * s + K/2) / K);

    return r<<16 | g<<8 | b;
}

Important things to notice:

we're storing l_, m_ and s_ in 64-bits values so that the following cubic do not overflow
we're using div_round64 for part of the expressions of l_, m_ and s_ because they are using signed sub-expressions
we're using a naive integer division in r, g and b because the value is expected to be positive

Evaluation

We're finally there. In the end the complete code is less than 200 lines of code and even less for the optimized float one (assuming we don't implement our own cbrt). The complete code, test functions and benchmarks tools can be found on Github.

Accuracy

Comparing the integer version to the reference float gives use the following results:

sRGB to OkLab: max_diff=0.000883 total_diff=0.051189
OkLab to sRGB: max_diff_r=2 max_diff_g=1 max_diff_b=1

I find these results pretty decent for an integer version, but you're free to disagree and improve them.

Speed

The benchmarks are also interesting: on my main workstation (Intel® Core™ i7-12700, glibc 2.36, GCC 12.2.0), the integer arithmetic is slightly slower that the optimized float version:

Command	Mean [s]	Min [s]	Max [s]	Relative
Reference	1.425 ± 0.008	1.414	1.439	1.59 ± 0.01
Fast float	0.897 ± 0.005	0.888	0.902	1.00
Integer arithmetic	0.937 ± 0.006	0.926	0.947	1.04 ± 0.01

Observations:

The FPU is definitely fast in modern CPUs
Both integer and optimized float versions are destroying the reference code (note that this only because of the transfer functions optimizations, as we have no change in the OkLab functions themselves in the optimized float version)

On the other hand, on one of my random ARM board (NanoPI NEO 2 with a Cortex A53, glibc 2.35, GCC 12.1.0), I get different results:

Command	Mean [s]	Min [s]	Max [s]	Relative
Reference	27.678 ± 0.009	27.673	27.703	2.04 ± 0.00
Fast float	15.769 ± 0.001	15.767	15.772	1.16 ± 0.00
Integer arithmetic	13.551 ± 0.001	13.550	13.553	1.00

Not that much faster proportionally speaking, but the integer version is still significantly faster overall on such low-end device.

Conclusion

This took me ages to complete, way longer than I expected but I'm pretty happy with the end results and with everything I learned in the process. Also, you may have noticed how much I referred to previous work; this has been particularly satisfying from my point of view (re-using previous toolboxes means they were actually useful). This write-up won't be an exception to the rule: in a later article, I will make use of OkLab for another project I've been working on for a while now. See you soon!

GCC undefined behaviors are getting wild

Sun, 27 Nov 2022 22:13:26 -0000

Happy with my recent breakthrough in understanding C integer divisions after weeks of struggle, I was minding my own business having fun writing integer arithmetic code. Life was good, when suddenly… zsh: segmentation fault (core dumped).

That code wasn't messing with memory much so it was more likely to be a side effect of an arithmetic overflow or something. Using -fsanitize=undefined quickly identified the issue, which confirmed the presence of an integer overflow. The fix was easy but something felt off. I was under the impression my code was robust enough against that kind of honest mistake. Turns out, the protecting condition I had in place should indeed have been enough, so I tried to extract a minimal reproducible case:

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

uint8_t tab[0x1ff + 1];

uint8_t f(int32_t x)
{
    if (x < 0)
        return 0;
    int32_t i = x * 0x1ff / 0xffff;
    if (i >= 0 && i < sizeof(tab)) {
        printf("tab[%d] looks safe because %d is between [0;%d[\n", i, i, (int)sizeof(tab));
        return tab[i];
    }

    return 0;
}

int main(int ac, char **av)
{
    return f(atoi(av[1]));
}

The overflow can happen on x * 0x1ff. Since an integer overflow is undefined, GCC makes the assumption that it cannot happen, ever. In practice in this case it does, but the i >= 0 && i < sizeof(tab) condition should be enough to take care of it, whatever crazy value it becomes, right? Well, I have bad news:

% cc -Wall -O2 overflow.c -o overflow && ./overflow 50000000
tab[62183] looks safe because 62183 is between [0;512[
zsh: segmentation fault (core dumped)  ./overflow 50000000

Note: this is GCC 12.2.0 on x86-64.

We have i=62183 as the result of the overflow, and nevertheless the execution violates the gate condition, spout a non-sense lie, go straight into dereferencing tab, and die miserably.

Let's study what GCC is doing here. Firing up Ghidra we observe the following decompiled code:

uint8_t f(int x)
{
  int tmp;

  if (-1 < x) {
    tmp = x * 0x1ff;
    if (tmp < 0x1fffe00) {
      printf("tab[%d] looks safe because %d is between [0;%d[\n",(ulong)(uint)tmp / 0xffff, (ulong)(uint)tmp / 0xffff,0x200);
      return tab[(int)((uint)tmp / 0xffff)];
    }
  }
  return '\0';
}

When I said GCC makes the assumption that it cannot happen this is what I meant: tmp is not supposed to overflow so part of the condition I had in place was simply removed. More specifically since x can not be lesser than 0, and since GCC assumes a multiplication cannot overflow into a random value (that could be negative) because it is undefined behaviour, it then decides to drop the "redundant" i >= 0 condition because "it cannot happen".

I reported that exact issue to GCC to make sure it wasn't a bug, and it was indeed confirmed to me that the undefined behaviour of an integer overflow is not limited in scope to whatever insane value it could take: it is apparently perfectly acceptable to mess up the code flow entirely.

While I understand how attractive it can be from an optimization point of view, the paranoid developer in me is straight up terrified by the perspective of a single integer overflow removing security protection and causing such havoc. I've worked several years in a project where the integer overflows were (and probably still are) legion. Identifying and fixing of all them is likely a lifetime mission of several opinionated individuals.

I'm expecting this article to make the rust crew go in a crusade again, and I think I might be with them this time.

Edit: it was made clear to me while reading Predrag's blog that the key to my misunderstanding boils down to this: "Undefined behavior is not the same as implementation-defined behavior". While I was indeed talking about undefined behaviour, subconsciously I was thinking that the behaviour of an overflow on a multiplication would be "implementation-defined behaviour". This is not the case, it is indeed an undefined behaviour, and yes the compiler is free to do whatever it wants to because it is compliant with the specifications. It's my mistake of course, but to my defense, despite the arrogant comments I read, this confusion happens a lot. This happens I believe because it's violating the Principle of least astonishment. To illustrate this I'll take this interesting old OpenBSD developer blog post being concerned about the result of the multiplication rather than the invalidation of any guarantee with regard to what's going to happen to the execution flow (before and after). This is not uncommon and in my opinion perfectly understandable.

Figuring out round, floor and ceil with integer division

Fri, 25 Nov 2022 08:28:34 -0000

Lately I've been transforming a float based algorithm to integers in order to make it bit-exact. Preserving the precision as best as possible was way more challenging than I initially though, which forced me to go deep down the rabbit hole. During the process I realized I had many wrong assumptions about integer divisions, and also discovered some remarkably useful mathematical properties.

This story is about a journey into figuring out equivalent functions to round(a/b), floor(a/b) and ceil(a/b) with a and b integers, while staying in the integer domain (no intermediate float transformation allowed).

Note: for the sake of conciseness (and to make a bridge with the mathematics world), floor(x) and ceil(x) will sometimes respectively be written ⌊x⌋ and ⌈x⌉.

Clarifying the mission

Better than explained with words, here is how the functions we're looking for behave with a real as input:

The dots indicate on which lines the stitching applies; for example round(½) is 1, not 0.

Language specificities (important!)

Here are the corresponding prototypes, in C:

int div_round(int a, int b); // round(a/b)
int div_floor(int a, int b); // floor(a/b)
int div_ceil(int a, int b);  // ceil(a/b)

We're going to work in C99 (or more recent), and this is actually the first warning I have here. If you're working with a different language, you must absolutely look into how its integer division works. In C, the integer division is toward zero, for both positive and negative integers, and only defined as such starting C99 (it is implementation defined before that). Be mindful about it if your codebase is in C89 or C90.

This means that in C:

printf("%d %d %d\n", 10/30, 15/30, 20/30);
printf("%d %d %d\n", -10/30, -15/30, -20/30);

We get:

0 0 0
0 0 0

This is typically different in Python:

>>> 10//30, 15//30, 20//30
(0, 0, 0)
>>> -10//30, -15//30, -20//30
(-1, -1, -1)

In Python 2 and 3, the integer division is toward -∞, which means it is directly equivalent to how the floor() function behaves.

In C, the integer division is equivalent to floor() only for positive numbers, otherwise it behaves the same as ceil(). This is the division behavior we will assume in this article:

And again, I can't stress that enough: make sure you understand how the integer division of your language works.

Similarly, you may have noticed we picked the round function as defined by POSIX, meaning rounding half away from 0. Again, in Python a different method was selected:

>>> [round(x) for x in (0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5)]
[0, 2, 2, 4, 4, 6, 6]

Python is following the round toward even choice rule. This is not what we are implementing here (Edit: a partial implementation is provided at the end though). There are many ways of rounding, so make sure you've clarified what method your language picked.

Ceiling and flooring

The integer division is symmetrical around 0 but ceil and floor aren't, so we need a way get the sign in order to branch in one direction or another. If a and b have the same sign, then a/b is positive, otherwise it's negative. This is well expressed with a xor operator, so we will be using the sign of (a^b) (where ^ is a xor operator). Of course we only need to xor the sign bit so we could instead use (a<0)^(b<0) but it is a bit more complex.

Edit: note that (a^b) is not > 0 when a == b. Also, as pointed out on lobste.rs it's likely to rely on unspecified / implementation-defined behavior (hopefully not undefined behaviour). We could use the safer (a<0)^(b<0) form which only generates an extra shift instruction on x86.

Looking at the graphics, we observe the following symmetries:

floor(x):
- For positive x, the C division works the same
- For negative x, the C division is one step too high (with the exception of the stitching point)
ceil(x)
- For negative x, the C division works the same
- For positive x, the C division is one step too low (with the exception of the stitching point)

We can translate these observations into code using a modulo trick (which purpose is to not offset the stitching point when the division is round):

int div_floor(int a, int b) { return a/b - (a%b!=0 && (a^b)<0); }
int div_ceil(int a, int b)  { return a/b + (a%b!=0 && (a^b)>0); }

One may wonder about the double division (a/b and a%b), but fortunately CPU architectures usually offer a division instruction that computes both at once so this is not as expensive as it would seem in the first place.

Now you also have an alternative without the modulo, but it generates less effective code (at least here on x86-64 with a modern CPU according to my benchmarks):

int div_floor(int a, int b) { return (a^b)<0 && a ? (1-abs(a))/abs(b)-1 : a/b; }
int div_ceil(int a, int b)  { return (a^b)>0 && a ? (abs(a)-1)/abs(b)+1 : a/b; }

Edit: note that these versions suffer from undefined behaviour in case of abs(INT_MIN) as pointed out by nortti in previous comment about xor.

I have no hard proof to provide for these right now, so this is left as an exercise to the reader, but some tools can be found in in Concrete Mathematics (2nd ed) by Ronald L. Graham, Donald E. Knuth and Oren Patashnik. In particular:

the reflection properties: ⌊-x⌋ = -⌈x⌉ and ⌈-x⌉ = -⌊x⌋
⌈n/m⌉ = ⌊(n-1)/m⌋+1 and ⌊n/m⌋ = ⌈(n+1)/m⌉-1

Rounding

The round() function is the most useful one when trying to approximate floats operations with integers (typically what I was looking for initially: converting an algorithm into a bit-exact one).

We are going to study the positive ones only at first, and try to define it according to the integer C division (just like we did for floor and ceil). Since we are on the positive side, the division is equivalent to a floor(), which simplifies a bunch of things.

I initially used a round function defined as round(a,b) = (a+b/2)/b and thought to myself "if we are improving the accuracy of the division by b using a b/2 offset, why shouldn't we also improve the accuracy of b/2 by doing (b+1)/2 instead?" Very proud of my deep insight I went on with this, until I realized it was causing more off by ones (with a bias always in the same direction). So don't do that, it's wrong, we will instead try to find the appropriate formula.

Looking at the round function we can make the observation that it's pretty much the floor() function with the x offset by ½: round(x) = floor(x+½)

So we have:

round(a/b) = ⌊a/b + ½⌋
           = ⌊(2a+b)/(2b)⌋

We could stop right here but this suffers from overflow limitations if translated into C. We are lucky though, because we're about to discover the most mind blowing property of integers division:

This again comes from Concrete Mathematics (2nd ed), page 72.

You may not immediately realize how insane and great this is, so let me elaborate: it basically means N successive truncating divisions can be merged into one without loss of precision (and the other way around).

Here is a concrete example:

>>> n = 5647817612937
>>> d = 712
>>> n//d//d//d == n//(d*d*d)
True

That's great but how does that help us? Well, we can do this now:

round(a/b) = ⌊a/b + ½⌋
           = ⌊(2a+b)/(2b)⌋
           = ⌊⌊(2a+b)/2⌋/b⌋   <--- applying the nested division property to split in 2 floor expressions
           = ⌊⌊a+b/2⌋/b⌋
           = ⌊(a+⌊b/2⌋)/b⌋

How cute is that, we're back to the original formula I was using: round(a,b) = (a+b/2)/b (because again the C division is equivalent to floor() for positive values).

Now how about the negative version, that is when a/b < 0? We can make the similar observation that for a negative x, round(x) = ceil(x-½), so we have:

round(a/b) = ⌈a/b - ½⌉
           = ⌈(2a-b)/(2b)⌉
           = ⌈⌈(2a-b)/2⌉/b⌉
           = ⌈⌈a-b/2⌉/b⌉
           = ⌈(a-⌈b/2⌉)/b⌉

And since a/b is negative, the C division is equivalent to ceil(). So in the end we simply have:

int div_round(int a, int b) { return (a^b)<0 ? (a-b/2)/b : (a+b/2)/b; }

This is the generic version, but of course in many cases we can (and probably should) simplify the expression appropriately.

Let's say for example we want to remap an u16 to an u8: remap(x,0,0xff,0,0xffff) = x*0xff/0xffff = x/257. The appropriate way to round this division is simply: (x+257/2)/257, or just: (x+128)/257.

Edit: it was pointed out several times on HackerNews that this function still suffer from overflows. Though, it remains more robust than the previous version with ×2.

Bonus: partial round to even choice rounding

Equivalent to lrintf, this function provided by Antonio on Mastodon can be used:

static int div_lrint(int a, int b)
{
    const int d = a/b;
    const int m = a%b;
    return m < b/2 + (b&1) ? d : m > b/2 ? d + 1 : (d + 1) & ~1;
}

Warning: this only works with positive values.

Verification

Since you should definitely not trust my math nor my understanding of computers, here is a test code to demonstrate the exactitude of the formulas:

#include <stdio.h>
#include <math.h>

static int div_floor(int a, int b) { return a/b - (a%b!=0 && (a^b)<0); }
static int div_ceil(int a, int b)  { return a/b + (a%b!=0 && (a^b)>0); }
static int div_round(int a, int b) { return (a^b)<0 ? (a-b/2)/b : (a+b/2)/b; }

#define N 3000

int main()
{
    for (int a = -N; a <= N; a++) {
        for (int b = -N; b <= N; b++) {
            if (!b)
                continue;

            const float f = a / (float)b;

            const int ef = (int)floorf(f);
            const int er = (int)roundf(f);
            const int ec = (int)ceilf(f);

            const int of = div_floor(a, b);
            const int or = div_round(a, b);
            const int oc = div_ceil(a, b);

            const int df = ef != of;
            const int dr = er != or;
            const int dc = ec != oc;

            if (df || dr || dc) {
                fprintf(stderr, "%d/%d=%g%s\n", a, b, f, (a ^ b) < 0 ? " (diff sign)" : "");
                if (df) fprintf(stderr, "floor: %d ≠ %d\n", of, ef);
                if (dr) fprintf(stderr, "round: %d ≠ %d\n", or, er);
                if (dc) fprintf(stderr, "ceil: %d ≠ %d\n", oc, ec);
            }
        }
    }
    return 0;
}

Conclusion

These trivial code snippets have proven to be extremely useful to me so far, and I have the hope that it will benefit others as well. I spent an unreasonable amount of time on this issue, and given the amount of mistakes (or at the very least non optimal code) I've observed in the wild, I'm most certainly not the only one being confused about all of this.

Investigating why Steam started picking a random font

Fri, 18 Nov 2022 22:17:04 -0000

Out of the blue my Steam started picking a random font I had in my user fonts dir: Virgil, the Excalidraw font.

That triggered me all sorts of emotions, ranging from laugh to total incredulity. I initially thought the root cause was a random derping from Valve but the Internet seemed quiet about it, so the unreasonable idea that it might have been my fault surfaced.

To understand how it came to this, I have to tell you about The Stanley Parable, an incredibly funny game I highly recommend. One of the achievement of the game is to not play it for 5 years.

To get it, I disabled NTP, changed my system clock to 2030, started the game, enjoyed my achievement and restored NTP. So far so good, mission is a success, I can move on with my life.

But not satisfied with this first victory I soon wanted to achieve the same in the Ultra Deluxe edition. This one comes with the same achievement, except it's 10 years instead of 5. Since 2022+10 is too hard of a mental calculation for me I rounded it up to 2040, and followed the same procedure as previously. Achievement unlocked, easy peasy.

Problem is, Steam accessed many files during that short lapse of time, which caused them to have their access time updated to 2040. And you know what's special about 2040? It's after 2038.

Get it yet? Here is a hint: Year 2038 problem.

This is the kind of error I was seeing in the console: "/usr/share/fonts": Value too large for defined data type.

What kind of error could that be?

% errno -s "Value too large"
EOVERFLOW 75 Value too large for defined data type

Nice, so we're triggering an overflow somewhere. More precisely, fontconfig 32-bit (an underlying code to be exact) was going mad crazy because of this:

% stat /etc/fonts/conf.d/*|grep 2040
Access: 2040-11-22 00:00:04.110328309 +0100
Access: 2040-11-22 00:00:04.110328309 +0100
Access: 2040-11-22 00:00:04.110328309 +0100
...

In order to fix this mess I had to be a bit brutal:

% sudo mount -o remount,strictatime /
% sudo mount -o remount,strictatime /home
% sudo find / -newerat 2039-12-31 -exec touch -a {} +
% sudo mount -o remount,relatime /
% sudo mount -o remount,relatime /home

The remounts were needed because relatime is the default, which means file accesses get updated only if the current time is past the access time. And I had to remount both my root and home partition because Steam touched files everywhere.

Not gonna lie, this self-inflicted bug brought quite a few life lessons to me:

The Stanley Parable meta-game has no limit to madness
2038 is going to be a lot of fun
32-bit games preservation is a sad state of affair

Exploring intricate execution mysteries by reversing a crackme

Thu, 27 Oct 2022 10:04:29 -0000

It's been a very long time since I've done some actual reverse engineering work. Going through a difficult period currently, I needed to take a break from the graphics world and go back to the roots: understanding obscure or elementary tech stuff. One may argue that it was most certainly not the best way to deal with a burnout, but apparently that was what I needed at that moment. Put on your black hoodie and follow me, it's gonna be fun.

The beginning and the start of the end

So I started solving a few crackmes from crackmes.one to get a hang of it. Most were solved in a relatively short time window, until I came across JCWasmx86's cm001. I initially thought the most interesting part was going to be reversing the key verification algorithm, and I couldn't be more wrong. This article will be focusing on various other aspects (while still covering the algorithm itself).

The validation function

After loading the executable into Ghidra and following the entry point, we can identify the main function quickly. A few renames later we figure out that it's a pretty straightforward function (code adjusted manually from the decompiled view):

int main(void)
{
    char input[64+1] = {0};

    puts("Input:");
    fgets(input, sizeof(input), stdin);
    validate_input(input, strlen(input));
    return 0;
}

The validate_input() function on the other hand is quite a different beast. According to the crackme description we can expect some parts written in assembly. And indeed, it's hard to make Ghidra generate a sane decompiled code out of it. For that reason, we are going to switch to a graph view representation.

I'm going to use Cutter for… aesthetic reasons. Here it is, with a few annotations to understand what is actually happening:

To summarize, we have a 64 bytes long input, split into 4 lanes of data, which are followed by a series of checks. This flow is very odd for several reasons though:

We don't see any exit here: it basically ends with a division, and all other exits lead to failed_password (the function that displays the error). What we also don't see in the graph is that after the last instruction (div, Oddity #1), the code falls through into the failed_password code, just like the other exit code paths
We see an explicit check only for the first and second lanes, the 2 others are somehow used in the division, but even there, only slices of them are used, the rest is stored at some random global location (in the .bss, at 0x4040b0 and 0x4040a8 respectively)
128 bits of data are stored at 0x4040b0 (Oddity #0): we'll see later why this is strange

The only way I would see this flow go somewhere else would be some sort of exception/interruption. Looking through all the instructions again, the only one I see causing anything like this would be the last div instruction, with a floating point exception. But how could that even be caught and handled, we didn't see anything about it in the main or in the validate function.

At some point, something grabbed my attention:

Relocation section '.rela.plt' at offset 0x598 contains 6 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000404018  000100000007 R_X86_64_JUMP_SLO 0000000000000000 puts@GLIBC_2.2.5 + 0
000000404020  000200000007 R_X86_64_JUMP_SLO 0000000000000000 write@GLIBC_2.2.5 + 0
000000404028  000300000007 R_X86_64_JUMP_SLO 0000000000000000 strlen@GLIBC_2.2.5 + 0
000000404030  000500000007 R_X86_64_JUMP_SLO 0000000000000000 fgets@GLIBC_2.2.5 + 0
000000404038  000600000007 R_X86_64_JUMP_SLO 0000000000000000 signal@GLIBC_2.2.5 + 0
000000404040  000800000007 R_X86_64_JUMP_SLO 0000000000000000 exit@GLIBC_2.2.5 + 0

There is a signal symbol in the relocation section, so there must be code somewhere calling this function, and it must certainly happens before the main. Tracing back the function usage from Ghidra land us here (again, code reworked from its decompiled form):

void _INIT_1(void)
{
    signal(SIGFPE, handle_fpe);
    return;
}

But how does this function end up being called?

Program entry point

At this point I needed to dive quite extensively into the Linux program startup procedure in order to understand what the hell was going on. I didn't need to understand it all during the reverse, but I came back to it later on to clarify the situation. I'll try to explain the best I can how it essentially works because it's probably the most useful piece of information I got out of this experience. Brace yourselves.

Modern (glibc ≥ 2.34, around 2018)

On a Linux system with a modern glibc, if we try to compile int main(){return 0;} into an ELF binary (cc test.c -o test), the file crt1.o (for Core Runtime Object) or one of its variant such as Scrt1.o (S for "shared") is linked into the final executable by the toolchain linker. These object files are distributed by our libc package, glibc being the most common one.

They contain the real entry point of the program, identified by the label _start. Their bootstrap code is actually fairly short:

% objdump -d -Mintel /usr/lib/Scrt1.o

/usr/lib/Scrt1.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <_start>:
   0:	f3 0f 1e fa          	endbr64
   4:	31 ed                	xor    ebp,ebp
   6:	49 89 d1             	mov    r9,rdx
   9:	5e                   	pop    rsi
   a:	48 89 e2             	mov    rdx,rsp
   d:	48 83 e4 f0          	and    rsp,0xfffffffffffffff0
  11:	50                   	push   rax
  12:	54                   	push   rsp
  13:	45 31 c0             	xor    r8d,r8d
  16:	31 c9                	xor    ecx,ecx
  18:	48 8b 3d 00 00 00 00 	mov    rdi,QWORD PTR [rip+0x0]        # 1f <_start+0x1f>
  1f:	ff 15 00 00 00 00    	call   QWORD PTR [rip+0x0]        # 25 <_start+0x25>
  25:	f4                   	hlt

If we look closely at the assembly above, we notice it's a skeleton with a few placeholders. More specifically the call argument and the rdi register just before. These are respectively going to be replaced at link time with a call to the __libc_start_main() function, and a pointer to the main function. Using objdump -r clarifies these relocation entries:

  18:	48 8b 3d 00 00 00 00 	mov    rdi,QWORD PTR [rip+0x0]        # 1f <_start+0x1f>
			1b: R_X86_64_REX_GOTPCRELX	main-0x4
  1f:	ff 15 00 00 00 00    	call   QWORD PTR [rip+0x0]        # 25 <_start+0x25>
			21: R_X86_64_GOTPCRELX	__libc_start_main-0x4

Note that __libc_start_main() is an external function: it is located inside the glibc itself (typically /usr/lib/libc.so.6).

Said in more simple terms, what this code is essentially doing is jumping straight into the libc by calling __libc_start_main(main, <a few other args>). That function will be responsible for calling main itself, using the transmitted pointer.

Why not call directly the main? Well, there might be some stuff to initialize before the main. Either in externally linked libraries, or simply through constructors.

Here is an example of a C code with such a construct:

#include <stdio.h>

__attribute__((constructor))
static void ctor(void)
{
    printf("ctor\n");
}

int main()
{
    printf("main\n");
    return 0;
}

% cc test.c -o test && ./test
ctor
main

In this case, a pointer to ctor is stored in a table in one of the ELF section: .init_array. At some point in __libc_start_main(), all the functions of that array are going to be called one by one.

With this executable loaded into Ghidra, we can observe this table at that particular section:

So basically a table of 2 function pointers, the latter being our custom ctor function.

The way that code is able to access the ELF header is for another story. Similarly, even though related, I'm going to skip details about the dynamic linker. I'll just point out that the program has an .interp section with a string such as "/lib64/ld-linux-x86-64.so.2" identifying the dynamic linker to use (which is also an ELF program, see man ld.so for more information). This program is actually executed before our main as well since it is responsible for loading the dynamic libraries.

Legacy (glibc < 2.34)

So far we've seen how a modern program is built and started, but it wasn't always exactly like this. It actually changed "recently", around 2018. We have to study how it was before because the crackme we're interested in is actually compiled in these pre-2018 conditions. The patterns we get don't match the modern construct we just observed.

If we look at how the Scrt1.o of glibc was before 2.34, we get the following:

0000000000000000 <_start>:
   0:	31 ed                	xor    ebp,ebp
   2:	49 89 d1             	mov    r9,rdx
   5:	5e                   	pop    rsi
   6:	48 89 e2             	mov    rdx,rsp
   9:	48 83 e4 f0          	and    rsp,0xfffffffffffffff0
   d:	50                   	push   rax
   e:	54                   	push   rsp
   f:	4c 8b 05 00 00 00 00 	mov    r8,QWORD PTR [rip+0x0]        # 16 <_start+0x16>
  16:	48 8b 0d 00 00 00 00 	mov    rcx,QWORD PTR [rip+0x0]        # 1d <_start+0x1d>
  1d:	48 8b 3d 00 00 00 00 	mov    rdi,QWORD PTR [rip+0x0]        # 24 <_start+0x24>
  24:	ff 15 00 00 00 00    	call   QWORD PTR [rip+0x0]        # 2a <_start+0x2a>
  2a:	f4                   	hlt

It's pretty similar to what we've seen before but we can see more relocation entries (see r8 and rcx registers). A grasp on the x86-64 calling convention is going to be helpful here: a function is expected to read its arguments in the following register order: rdi, rsi, rdx, rcx, r8, r9 (assuming no floats). In the dump above we can actually see all these registers being loaded before the call instruction, so they're very likely preparing the arguments for that __libc_start_main call.

At this point, we need to know more about __libc_start_main actual prototype. Looking on the web for it, we may land on such a page:

This is extremelly outdated. It is actually a prototype from a long time ago when the init function passed as argument didn't receive any parameter. The prototype for __libc_start_main in glibc now looks like this (extracted, tweaked and commented for clarity from glibc/csu/libc-start.c):

int __libc_start_main(
    int (*main)(int, char **, char ** MAIN_AUXVEC_DECL),  /* RDI */
    int argc,                                             /* RSI */
    char **argv,                                          /* RDX */
    __typeof (main) init,                                 /* RCX */
    void (*fini)(void),                                   /* R8 */
    void (*rtld_fini)(void),                              /* R9 */
    void *stack_end                                       /* RSP (stack pointer) */
)

The init parameter now matches the prototype of the main. For those interested in archaeology, this is true since 2003, which I believe is around the Palaeolithic period.

Going back to our __libc_start_main() call at the entry point: there is now 2 extra arguments compared to the modern version: rcx (the init argument) and r8 (the fini argument). These are respectively going to point to two functions respectively called __libc_csu_init and __libc_csu_fini. In Ghidra if the binary is not stripped we observe the following:

Now here is the trick: where do you think these functions are located? One may expect to have them in the glibc, just like __libc_start_main, but that's not the case. They are actually embedded within our ELF binary. The reason for this is still unclear to me.

The mechanism of injecting that code inside the binary was also a mystery to me: while the canonical crt1.o mechanism is followed by build toolchains since forever, that object doesn't contain __libc_csu_init and __libc_csu_fini. So where the hell do they even come from? Well, here is the magic trick (thank you strace):

% file /lib/libc.so
/lib/libc.so: ASCII text
% cat /lib/libc.so
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library, so try that secondarily.  */
OUTPUT_FORMAT(elf64-x86-64)
GROUP ( /usr/lib/libc.so.6 /usr/lib/libc_nonshared.a  AS_NEEDED ( /usr/lib/ld-linux-x86-64.so.2 ) )

That's right, just as deceptive as ld.so is a program, libc.so is a linker script. We see it instructing the linker to use libc_nonshared.a, which is another file distributed by the glibc, containing a bunch of functions, notably __libc_csu_init and __libc_csu_fini. This means that thanks to this script, this static non-shared archive containing yet another batch of weird init routines, is dumped into every dynamically linked ELF executable. I'm still having a hard time processing this.

Note that libc_nonshared.a still exists in the modern setup (as of 2.36 at least), but it's much smaller and doesn't have those functions anymore.

So what are these functions doing? Well, they're responsible for calling the pre and post-main functions, just like __libc_start_main is doing in its modern setup. Here is what they looked like before getting removed in glibc 2.34 (extracted and simplified from glibc/csu/elf-init.c in 2.33):

void __libc_csu_init (int argc, char **argv, char **envp)
{
  _init ();
  const size_t size = __init_array_end - __init_array_start;
  for (size_t i = 0; i < size; i++)
      (*__init_array_start [i]) (argc, argv, envp);
}

void __libc_csu_fini (void)
{
  _fini ();
}

Note: CSU likely stands for "C Start Up" or "Canonical Start Up".

The commit removing these functions is actually pretty damn interesting and we can learn a lot from it:

it has security implications: the ROP gadgets referred to are basically snippets of instructions that are useful for exploitation, having them in the binary is a liability
__libc_start_main() kept its prototype for backward compatibility, so init and fini arguments are still there, just passed as NULL (look at the 2 xor instructions in the modern Scrt1.o shared earlier)
the forward compatibility on the other hand is not possible: we can run an old executable on a modern system, but we cannot run a modern executable on an old system

With all that new knowledge we are now armed to decipher the startup mechanism of our crackme.

Within Ghidra

After analysis, the entry point of our crackme looks like this:

We recognize the _start pattern of our crt1.o. More specifically, we can see that it's loading 2 pointers in rcx and r8, so we know we're in the pattern pre-2018:

r8: FUN_00401730 is __libc_csu_fini
rcx: FUN_004016c0 is __libc_csu_init
rdi: LAB_004010b0 is main

If we want to find the custom inits, we have to follow __libc_csu_init, where we can see it matching the snippet shared earlier, except __init_array_start is named __DT_INIT_ARRAY but still located at the .init_array ELF section. And in that table, we find again our init callbacks:

_INIT_0 corresponds to frame_dummy, and _INIT_1 is the first user constructor. So just like ctor in sample C code, we are interested in what's happening in _INIT_1, which is the function shown earlier calling signal.

Of course, someone familiar with this pattern will go straight into the .init_array section, but with crackmes you never know if they're actually going to follow the expected path, so it's a good thing to be familiar with the complete execution path.

Going deeper, uncovering Ghidra bugs

We could stop our research on the init procedure here but I have to make a detour to talk about some unfortunate things in x86-64 and Ghidra (as of 10.1.5).

If we look at the decompiler view of the entry point, we see a weird prototype:

void entry(undefined8 param_1,undefined8 param_2,undefined8 param_3)
{
  /* ... */
}

The thing is, when a program entry point is called, it's not supposed to have 3 arguments like that. According to glibc sysdeps/x86_64/start.S (which is the source of crt1.o), here are the actual inputs for _start:

This is the canonical entry point, usually the first thing in the text
segment.  The SVR4/i386 ABI (pages 3-31, 3-32) says that when the entry
point runs, most registers' values are unspecified, except for:

%rdx       Contains a function pointer to be registered with `atexit'.
           This is how the dynamic linker arranges to have DT_FINI
           functions called for shared libraries that have been loaded
           before this code runs.

%rsp       The stack contains the arguments and environment:
           0(%rsp)                         argc
           LP_SIZE(%rsp)                   argv[0]
           ...
           (LP_SIZE*argc)(%rsp)            NULL
           (LP_SIZE*(argc+1))(%rsp)        envp[0]
           ...

Basically only the rdx register is expected to be set (along with the stack and its register) which the program entry function usually forwards down to __libc_start_main (as rtld_fini argument) which itself passes it down to atexit. You will find similar information in the kernel in its ELF loader code.

Do you remember the x86-64 calling convention from earlier? The function arguments are passed in the following register order: rdi, rsi, rdx, rcx, r8, r9. But like we just saw the entry point code of the program is expected to only read rdx (equivalent to the 3rd argument in the calling convention), while rdi and rsi content is undefined. Since the program entry point is usually respecting that (reading rdx to get rtld_fini), Ghidra infers that the 1st and 2nd argument must also exist, and get confused when rdi and rsi are actually overridden to setup the call to __libc_start_main instead.

Now one may ask, why even use rdx in the 1st place if it conflicts with the calling convention? Well, on 32-bit it uses edx, which makes a little more sense to use since it doesn't overlap with the calling convention: all the function arguments are expected to be on the stack on 32-bit. And during the move to 64-bit they unfortunately just extended edx into rdx.

While not immediately problematic, I still don't know why they decided to use edx on 32-bit in the kernel instead of the stack though; apparently this is described in "SVR4/i386 ABI (pages 3-31, 3-32)" but I couldn't find much information about it.

Anyway, all of this to say that until the NSA fixes the bug, I'd recommend to override the _start prototype: void entry(undefined8 param_1,undefined8 param_2,undefined8 param_3) should be void _start(void), and you should expect the code to read the rdx register.

Remaining bits of the algorithm

Alright, so we're back to our previous flow. Assuming the division raised a floating point error, we're following the callback forwarded to signal(), and we end up at another location, which after various renames and retyping in Ghidra decompiler looks like this:

I'll spare you the details since it's an overly complex implementation of a very simple routine:

read the 2 halves of registers stored earlier (remember half of lane2 and lane3 were stored for later use, here is where we read them back)
check that those are different
for each halves, make the sum of each element of the data by slicing it in nibbles (4-bits), with each nibble value being permuted using a simple table
check that the checksums are the same

And that's pretty much it.

Now we roughly know how the 64 bytes of input are read and checked. There is one thing we need to study more though: the div instruction.

Oddity #1: the division

We need to understand how the div instruction works since it's the trigger to our success path. Here is what the relevant Intel documentation says about it:

In English this means that if we have div rbx, then the registers rdx and rax are combined together to form a single 128-bit value, which is then divided by rbx.

As a reminder, the chunk doing the division looks like this:

Our divider is rbx, a large hardcoded number: 0xffff231203 (meaning the exception cannot be a division by zero, but could be an overflow)
rax contains the lower part of the xmm3 register (the 4th lane) xored with the higher part of the xmm2 register (the 3rd lane)
rdx contains… wait, what does it contain? We don't know.

Looking through the code, rdx value looks pretty much undefined. If it's big enough, the result of the division will luckily not fit in a 64-bit register and will overflow, causing the floating point exception. Under "normal" conditions it seems to happen, but if run through let's say valgrind, rdx will be initialized to something else and the overflow won't be triggered.

This is actually a bug, an undefined behaviour in the crackme. That's too bad because the original idea was pretty good. But it also means we won't have to think much about whatever data we put into that part of the input.

Oddity #0

One last oddity before we're ready to write a keygen: the Oddity #0 is a write of a 128-bit register at an address where only 64 bits are available, located at the end of the .bss section. For some reason the code still works so I'm assuming we are lucky thanks to some padding in the memory map…

The issue can actually easily be noticed because it drives the decompiler nuts in that area:

If you patch the instruction from xmmword ptr [0x004040b0],XMM1 to xmmword ptr [0x004040a8],XMM1, you'll observe everything going back to normal in the decompiler view.

I later became aware about the code source of the crackme on Github, so I could see why the mistake happened in the first place. I reported the issue if you want more information on that topic.

Writing the keygen

Onto the final step: writing a keygen.

To summarize all the conditions that need to be met:

input length must be 64-bytes long
xor'ing each character of the 1st lane with each other (after encoding with the xor key) must be 0
the sum of all the characters of the 2nd lane must be equal to: (lane0[11] ^ xor_key[11]) × 136 + 314
the first half of the 3rd lane and the 2nd half of the 4th lane must be different
the sum of the permuted nibbles of the first half of the 3rd lane and the 2nd half of the 4th lane must be equal
the 2nd half of the 3rd lane and the 1st half of the 1st lane don't really matter

I don't think solving this part is the most interesting, particularly for a reader, but I described the strategy I followed in the keygen code, so I'll just share it as is:

# Range of allowed characters in the input; we'll use the xor key as part of
# the password so we're kind of constraint to its range
xor_key = bytes.fromhex("64 47 34 36 72 73 6b 6a 38 2d 34 35 37 28 7e 3a")
ord_min, ord_max = min(xor_key), max(xor_key)


def xor0(data: str) -> int:
    """Encode the data using the xor key"""
    assert len(data) == len(xor_key) == 16
    r = 0
    for c, x in zip(data, xor_key):
        r ^= ord(c) ^ x
    return r


def get_lane0(k11: str) -> str:
    """
    Compute lane0 of the input

    We have the following constraints on lane0:
    - the character at position 11 must be k11
    - xoring all characters must give 0
    - input characters must be within accepted range (self-imposed)
    Strategy:
    - start with the xor key itself because the xor reduce will give our
      perfect zero score
    - replace the 11th char with our k11 and figure out which bits get off
      because of it
    - go through each character to see if we can flip the off bits
    """
    lane0 = "".join(map(chr, xor_key))
    lane0 = lane0[:11] + k11 + lane0[12:]
    off = xor0(lane0)

    off_bits = [(1 << i) for i in range(8) if off & (1 << i)]

    fixed_lane0 = lane0
    for i, c in enumerate(lane0):
        if i == 11:
            continue
        remains = []
        for bit in list(off_bits):
            o = ord(c) ^ bit
            if ord_min <= o <= ord_max:
                c = chr(o)
            else:
                remains.append(bit)
        fixed_lane0 = fixed_lane0[:i] + c + fixed_lane0[i + 1 :]
        off_bits = remains
        if not off_bits:
            break

    assert not off_bits

    off = xor0(fixed_lane0)
    assert xor0(fixed_lane0) == 0

    return fixed_lane0


def get_lane1(t: int) -> str:
    # First estimate by taking the average
    avg_ord = t // 16
    assert ord_min <= avg_ord <= ord_max
    lane1 = [avg_ord] * 16

    # Adjust with off by ones to reach target if necessary
    off = sum(lane1) - t
    if off:
        sgn = [-1, 1][off < 0]
        for i in range(abs(off)):
            lane1[i] += sgn

    assert sum(lane1) == t
    return "".join(map(chr, lane1))


def get_divdata():
    # The div data doesn't really matter, so we just use some slashes to carry
    # the division meaning
    d0 = d1 = "/" * 8
    return d0, d1


def chksum4(data: str) -> int:
    """nibble (4-bit) checksum"""
    permutes4 = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4]
    return sum(permutes4[ord(c) >> 4] << 4 | permutes4[ord(c) & 0xF] for c in data)


def get_chksums4():
    # We need the value to be different but the checksum to be the same, so we
    # simply interleave 2 working characters differently
    c0 = (chr(ord_min) + chr(ord_max)) * 4
    c1 = (chr(ord_max) + chr(ord_min)) * 4
    assert c0 != c1
    assert chksum4(c0) == chksum4(c1)
    return c0, c1


def get_passwords():
    # The user input key is composed of 4x16B, which will be referred to as 4
    # lanes: lane[0..3]. The character at lane0[11] defines what is going to be the
    # target T that S=sum(lane1) will need to reach. Here we compute all potential
    # T value that can be obtained within our range of characters.
    x11 = xor_key[11]
    allowed_ords = range(ord_min, ord_max + 1)
    all_t = {136 * (o ^ x11) + 314: o for o in allowed_ords}

    # Compute the extreme sums our input lane1 can reach and filter out T values
    # that land outside these boundaries
    min_t, max_t = ord_min * 16, ord_max * 16
    possible_t = {t: chr(k11) for t, k11 in all_t.items() if min_t <= t <= max_t}

    for t, k11 in possible_t.items():
        lane0 = get_lane0(k11)
        lane1 = get_lane1(t)

        d0, d1 = get_divdata()
        c0, c1 = get_chksums4()

        lane2 = c0 + d0
        lane3 = d1 + c1

        password = lane0 + lane1 + lane2 + lane3
        assert len(password) == 16 * 4
        yield password


for password in get_passwords():
    print(password)

It executes instantly and gives the following keys (it's not exhaustive to all possibilities):

% python 615888be33c5d4329c344f66_cm001.py
aG46rskj8-407(~:??>>>>>>>>>>>>>>(~(~(~(~////////////////~(~(~(~(
`G46rskj8-417(~:6666666666555555(~(~(~(~////////////////~(~(~(~(
cG46rskj8-427(~:PPOOOOOOOOOOOOOO(~(~(~(~////////////////~(~(~(~(
bG46rskj8-437(~:GGGGGGGGGGFFFFFF(~(~(~(~////////////////~(~(~(~(
gG46rskj8-467(~:..--------------(~(~(~(~////////////////~(~(~(~(
hG46rskj8-497(~:zzzzzzzzzzyyyyyy(~(~(~(~////////////////~(~(~(~(
mG46rskj8-4<7(~:aa``````````````(~(~(~(~////////////////~(~(~(~(
lG46rskj8-4=7(~:XXXXXXXXXXWWWWWW(~(~(~(~////////////////~(~(~(~(
oG46rskj8-4>7(~:rrqqqqqqqqqqqqqq(~(~(~(~////////////////~(~(~(~(
nG46rskj8-4?7(~:iiiiiiiiiihhhhhh(~(~(~(~////////////////~(~(~(~(

All of these seem to be working keys. We can visually see how each segment corresponds to a specific part of the algorithm. The keys are ugly, but at least they're printable.

The most tricky part for me was to anticipate the range of guaranteed keys, due to the dependency between lane0 and lane1, the rest was relatively simple.

Conclusion

I didn't expect such a ride to be honest. There were just so many incentives to dig down the rabbit hole of various intricacies. The bugs in the crackme caused me a lot of confusion, but I don't think they're even close to the obfuscated level of the glibc and its messy history of deceptive patterns.

Deconstructing Bézier curves

Tue, 16 Aug 2022 06:29:19 -0000

Graphists, animators, game programmers, font designers, and other graphics professionals and enthusiasts are often working with Bézier curves. They're popular, extensively documented, and used pretty much everywhere. That being said, I find them being explained almost exclusively in 2 or 3 dimensions, which can be a source of confusion in various situations. I'll try to deconstruct them a bit further in this article. At the end or the post, we'll conclude with a concrete example where this deconstruction is helpful.

A Bézier curve in pop culture

Most people are first confronted with Bézier curves through an UI that may look like this:

In this case the curve is composed of 4 user controllable points, meaning it's a Cubic Bézier.

C₀, C₁, C₂ and C₃ are respectively the start, controls and end 2D point coordinates. Evaluating this formula for all the t values within [0;1] will give all the points of the curve. Simple enough.

Now this is obvious but the important take here is that this formula applies to each dimension. Since we are working in 2D here, it is evaluated on both the x and y-axis. As a result, a more explicit writing of the formula would be:

Note: if we were working with Bézier in 3D space, the C vectors would be in 3D as well.

Intuitively, you may start to see in the mathematical form how each point contributes to the curve, but it involves some tricky mental gymnastic (at least for me). So before diving into the multidimensional aspect, we will simplify the problem by looking into lower degrees.

Lower degrees

As implied by its name, the Cubic curve B₃(t) is of the 3rd degree. The 2nd most popular curve is the Quadratic curve B₂(t) where instead of 2 control points, we only have one (Q₁, in the middle):

Can we go lower? Well, there is a "1st degree Bézier curve" but you won't hear that term very often, because after removing the remaining control point:

The "curve" is now a simple line between the 2 points. Still, the concept of interpolation between the points is consistent/symmetric with the cubic and the quadratic.

Do you recognize the formula (see title of the figure)? Yes, this is mix(), one of the most useful math formula! The contribution of each factor should make sense this time: t varies within [0;1], at t=0 we have 100% of L₀ (the starting point), at t=1 we have 100% of L₁, in the middle at t=½ we have 50% of each, etc. All intermediate values of t define a straight line between these 2 points. We have a simple linear interpolation.

The presence of this function in the 1st degree is not just a coincidence: the mix function is actually the corner stone of all the Bézier curves. Indeed, we can build up the Bézier formulas using exclusively nested mix():

B₁(l₀,l₁,t) = mix(l₀,l₁,t)
B₂(q₀,q₁,q₂,t) = B₁(mix(q₀,q₁,t), mix(q₁,q₂,t), t)
B₃(c₀,c₁,c₂,c₃,t) = B₂(mix(c₀,c₁,t), mix(c₁,c₂,t), mix(c₂,c₃,t))

This way of formulating the curves is basically De Casteljau's algorithm. You have no idea how much I love accidentally finding yet again a relationship with my favourite mathematical function.

But back to our "Bézier 1st degree", remember that we are still in 2D:

This multi-dimensional graphic representation can be problematic because it is exclusively spatial: if one is interested in the t parameter, it has to be extrapolated visually from a twisted curve using mind bending powers, which is not always practical.

Mono-dimensional

In order to represent t, we have to split each spatial dimension and draw them according to t (defined within [0;1]).

Let's work this out with the following cubic curve (start point is bottom-left):

If we study this curve, we can see that the x is slightly decreasing, then increasing for most of the curve, then slightly decreasing again. In comparison, the y seems to be increasing, decreasing, then increasing again, probably more strongly than with x. But can you tell for sure what their respective curves actually look like precisely? I for sure can't, but my computer can:

Just to be extra clear: the formula is unchanged, we're simply tracing the x and y dimensions separately according to t instead of plotting the curve in a xy plane. Note that this means C₀, C₁, C₂ and C₂ can now only change vertically: they are respectively placed at t=0, t=⅓, t=⅔ and t=1. The vertical axis corresponds to their value on their respective plane.

Similarly, with a quadratic we would have Q₀ at t=0, Q₁ at t=½ and Q₂ at t=1.

So what's so great about this representation? Well, first of all the curves are not going backward anymore, they can be understood by following a left-to-right reading everyone is familiar with: there is no shenanigan involved in the interpretation anymore. Also, we are now going to be able to work them out in algebraic form.

Polynomial form

So far we've looked at the curve under their Bézier form, but they can also be expressed in their polynomial form:

B₁(t) = (1-t)·L₀ + t·L₁
      = (-L₀+L₁)·t + L₀
      = a₁t + b₁

B₂(t) = (1-t)²·Q₀ + 2(1-t)t·Q₁ + t²·Q₂
      = (Q₀-2Q₁+Q₂)·t² + (-2Q₀+2Q₁)·t + Q₀
      = a₂t² + b₂t + c₂

B₃(t) = (1-t)³·C₀ + 3(1-t)²t·C₁ + 3(1-t)t²·C₂ + t³·C₃
      = (-C₀+3C₁-3C₂+C₃)·t³ + (3C₀-6C₁+3C₂)·t² + (-3C₀+3C₁)·t + C₀
      = a₃t³ + b₃t² + c₃t + d₃

This algebraic form is great because we can now plug the formula into a polynomial root finding algorithm in order to identify the roots. Let's study a concrete use case of this.

Concrete use case: intersecting ray

A fundamental problem of text rendering is figuring out whether a given pixel P lands inside or outside the character shape (which is composed of a chain of Bézier curves). The most common algorithms (non-zero rule or even-odd rule) involve a ray going from the pixel position into an arbitrary direction toward infinity (usually horizontal for simplicity). If we can identify every intersection of this ray with each curve of the shape, we can deduce if our pixel point P=(Px,Py) is inside or outside.

We will simplify the problem to the crossing of just one curve, using the one from previous section. It would look like this with an arbitrary point P:

We're looking for the intersection coordinates, but how can we do that in 2D space? Well, with an horizontal ray, we would have to know when the y-coordinate of the curve is the same as the y-coordinate of P, so we first have to solve By(t) = Py, or By(t)-Py=0, where By(t) is the y component of the given Bézier curve B(t).

This is a schoolbook root finding problem, because given that B(t) is of the third degree, we end up solving the equation: a₃t³ + b₃t² + c₃t + d₃ - Py = 0 (the d₃ - Py part is constant, so it acts as the last coefficient of the polynomial). This gives us the t values (or roots), that is where the ray crosses our y component.

Since this is a 3rd degree polynomial (highest power is 3), we will have at most 3 points were the ray crosses the curve. In our case, we do actually get the maximum number of roots:

Now that we have the t values on our curve (remember that t values are common for both x and y axis), we can simply evaluate the x component of the B(t) to obtain the x coordinate.

Using Px, we can filter which roots we want to keep. In this case, Px=-0.75, so we're going to keep all the intersections (all the roots x-coordinates are located above this value).

We could do exactly the same operation by solving Bx(t)-Px=0 and evaluating By(t) on the roots we found: this would give us the intersections with a vertical ray instead of an horizontal one.

I'm voluntarily omitting a lot of technical details here, such as the root finding algorithm and floating point inaccuracies challenges: the point is to illustrate how the 1D deconstruction is essential in understanding and manipulating Bézier curves.

Bonus

During the writing of this article, I made a small matplotlib demo which got quite popular on Twitter, so I'm sharing it again:

Animated Bézier curves

The script used to generate this video:

import matplotlib.pyplot as plt
import numpy as np
from matplotlib.animation import FuncAnimation

def mix(a, b, x): return (1 - x) * a + b * x
def linear(a, b, x): return (x - a) / (b - a)
def remap(a, b, c, d, x): return mix(c, d, linear(a, b, x))

def bezier1(p0, p1, t): return mix(p0, p1, t)
def bezier2(p0, p1, p2, t): return bezier1(mix(p0, p1, t), mix(p1, p2, t), t)
def bezier3(p0, p1, p2, p3, t): return bezier2(mix(p0, p1, t), mix(p1, p2, t), mix(p2, p3, t), t)

def _main():
    pad = 0.05
    bmin, bmax = -1, 1
    x_color, y_color, xy_color = "#ff4444", "#44ff44", "#ffdd00"

    np.random.seed(0)
    r0, r1 = np.random.uniform(-1, 1, (2, 4))
    r2, r3 = np.random.uniform(0, 2 * np.pi, (2, 4))

    cfg = {
        "axes.facecolor": "333333",
        "figure.facecolor": "111111",
        "font.family": "monospace",
        "font.size": 9,
        "grid.color": "666666",
    }
    plt.style.use("dark_background")
    with plt.rc_context(cfg):
        fig = plt.figure(figsize=[8, 4.5])
        gs = fig.add_gridspec(nrows=2, ncols=3)

        ax_x = fig.add_subplot(gs[0, 0])
        ax_x.grid(True)
        for i in range(4):
            ax_x.axvline(x=i / 3, linestyle="--", alpha=0.5)
        ax_x.axhline(y=0, alpha=0.5)
        ax_x.set_xlabel("t")
        ax_x.set_ylabel("x", rotation=0, color=x_color)
        ax_x.set_xlim(0 - pad, 1 + pad)
        ax_x.set_ylim(bmin - pad, bmax + pad)
        (x_plt,) = ax_x.plot([], [], "-", color=x_color)
        (x_plt_c0,) = ax_x.plot([], [], "o:", color=x_color)
        (x_plt_c1,) = ax_x.plot([], [], "o:", color=x_color)

        ax_y = fig.add_subplot(gs[1, 0])
        ax_y.grid(True)
        for i in range(4):
            ax_y.axvline(x=i / 3, linestyle="--", alpha=0.5)
        ax_y.axhline(y=0, alpha=0.5)
        ax_y.set_xlabel("t")
        ax_y.set_ylabel("y", rotation=0, color=y_color)
        ax_y.set_xlim(0 - pad, 1 + pad)
        ax_y.set_ylim(bmin - pad, bmax + pad)
        (y_plt,) = ax_y.plot([], [], "-", color=y_color)
        (y_plt_c0,) = ax_y.plot([], [], "o:", color=y_color)
        (y_plt_c1,) = ax_y.plot([], [], "o:", color=y_color)

        ax_xy = fig.add_subplot(gs[0:2, 1:3])
        ax_xy.grid(True)
        ax_xy.axvline(x=0, alpha=0.8)
        ax_xy.axhline(y=0, alpha=0.8)
        ax_xy.set_aspect("equal", "box")
        ax_xy.set_xlabel("x", color=x_color)
        ax_xy.set_ylabel("y", rotation=0, color=y_color)
        ax_xy.set_xlim(bmin - pad, bmax + pad)
        ax_xy.set_ylim(bmin - pad, bmax + pad)
        (xy_plt,) = ax_xy.plot([], [], "-", color=xy_color)
        (xy_plt_c0,) = ax_xy.plot([], [], "o:", color=xy_color)
        (xy_plt_c1,) = ax_xy.plot([], [], "o:", color=xy_color)

        fig.tight_layout()

        def update(frame):
            px = remap(-1, 1, bmin, bmax, np.sin(r0 * frame + r2))
            py = remap(-1, 1, bmin, bmax, np.sin(r1 * frame + r3))
            t = np.linspace(0, 1)
            x = bezier3(px[0], px[1], px[2], px[3], t)
            y = bezier3(py[0], py[1], py[2], py[3], t)

            x_plt.set_data(t, x)
            x_plt_c0.set_data((0, 1 / 3), (px[0], px[1]))
            x_plt_c1.set_data((2 / 3, 1), (px[2], px[3]))

            y_plt.set_data(t, y)
            y_plt_c0.set_data((0, 1 / 3), (py[0], py[1]))
            y_plt_c1.set_data((2 / 3, 1), (py[2], py[3]))

            xy_plt.set_data(x, y)
            xy_plt_c0.set_data((px[0], px[1]), (py[0], py[1]))
            xy_plt_c1.set_data((px[2], px[3]), (py[2], py[3]))

        duration, fps, speed = 15, 60, 3
        frames = np.linspace(0, duration * speed, duration * fps)
        anim = FuncAnimation(fig, update, frames=frames)
        anim.save("/tmp/bezier.webm", fps=fps, codec="vp9", extra_args=["-preset", "veryslow", "-tune-content", "screen"])


if __name__ == "__main__":
    _main()

Invert a function using Newton iterations

Thu, 11 Aug 2022 06:59:53 -0000

Newton's method is probably one of the most popular algorithm for finding the roots of a function through successive numeric approximations. In less cryptic words, if you have an opaque function f(x), and you need to solve f(x)=0 (finding where the function crosses the x-axis), the Newton-Raphson method gives you a dead simple cookbook to achieve that (a few conditions need to be met though).

I recently had to solve a similar problem where instead of finding the roots I had to inverse the function. At first glance this may sound like two entirely different problems but in practice it's almost the same thing. Since I barely avoided a mental breakdown in the process of figuring that out, I thought it would make sense to share the experience of walking the road to enlightenment.

A function and its inverse

We are given a funky function, let's say f(x)=2/3(x+1)²-sin(x)-1, and we want to figure out its inverse f¯¹():

The diagonal is highlighted for the symmetry to be more obvious. One thing you may immediately wonder is how is such an inverse function even possible? Indeed, if you look at x=0, the inverse function gives (at least) 2 y values, which means it's impossible to trace according to the x-axis. What we just did here is we swapped the axis: we simply drew y=f(x) and x=f(y), which means the axis do not correspond to the same thing whether we are looking at one curve or the other. For y=f(x) (abbreviated f or f(x)), the horizontal axis is the x-axis, and for x=f(y) (abbreviated f¯¹ or f¯¹(y)) the horizontal axis is the y-axis because we actually drew the curve according to the vertical axis.

What can we do here to bring this problem back to reality? Well, first of all we can reduce the domain and focus on only one segment of the function where the function can be inverted. This is one of the condition that needs to be met, otherwise it is simply impossible to solve because it doesn't make any sense. So we'll redefine our problem to make it solvable by assuming our function is actually defined in the range R=[R₀,R₁] which we arbitrarily set to R=[0.1;1.5] in our case (could be anything as long as we have no discontinuity):

Now f' (the derivative of f) is never null, implying there won't be multiple solution for a given x, so we should be safe. Indeed, while we are still tracing f¯¹ by flipping the axis, we can see that it could also exist in the same space as f, meaning we could now draw it according to the horizontal axis, just like f.

What's so hard though? Bear with me for a moment, because this took me quite a while to wrap my head around. The symmetry is such that it's trivial to go from a point on f to a point on f¯¹:

Transforming point A into point B is a matter of simply swapping the coordinates. Said differently, if I have a x coordinate, evaluating f(x) will give me the A.y coordinate, so we have A=(x,f(x)) and we can get B with B=(A.y,A.x)=(f(x),x). But while we are going to use this property, this is not actually what we are looking for in the first place: our input is the x coordinate of B (or the y coordinate of A) and we want the other component.

So how do we do that? This is where root finding actually comes into play.

Root finding

We are going to distance ourselves a bit from the graphic representation (it can be quite confusing anyway) and try to reason with algebra. Not that I'm much more comfortable with it but we can manage something with the basics here.

The key to not getting your mind mixed up in x and y confusion is to use different terms because we associate x and y respectively with the horizontal and vertical axis. So instead we are going to redefine our functions according to u and v. We have:

f(u)=v
f¯¹(v)=u (reminder: v is our input and u is what we are looking for)

Note: writing f¯¹ doesn't mean our function is anything special, the ¯¹ simply acts as some sort of semantic tagging, we could very well have written h(v)=v. Both functions f and f¯¹ are simply mapping a real number to another one.

In the previous section we've seen than f(u)=v is actually equivalent to f¯¹(v)=u. This may sound like an arbitrary statement, so let me rephrase it differently: for a given value of u it only exists one corresponding value of v. If we now feed that same v to f¯¹ we will get u back. To paraphrase this with algebra: f¯¹(f(u)) = u.

How does that all of this help us? Well it means that f¯¹(v)=u is equivalent to f(u)=v. So all we have to do is solve f(u)=v, or f(u)-v=0. The process of solving this equation to find u is equivalent to evaluating f¯¹(v).

And there we have it, with a simple subtraction of v, we're back into known territory. We declare a new function g(u)=f(u)-v and we are going to find its root by solving g(u)=0 with the help of Newton's method.

Summary with less babble:

f¯¹(v)=u ⬄ f(u)=v
         ⬄ f(u)-v=0
         ⬄ g(u)=0 with g(u)=f(u)-v

Newton's method

The Newton iterations are dead-ass simple: it's a suite (or an iterative loop if you prefer):

uₙ₊₁ = uₙ - g(uₙ)/g'(uₙ)

…repeated as much as needed (it converges quickly).

g is the function from which we're trying to find the root
g' its derivative
u our current approximation, which gets closer to the truth at each iteration

We can evaluate g (g(u)=f(u)-v) but we need two more pieces to the puzzle: g' and an initial value for u.

Derivative

There is actually something cool with the derivative g': since v is a constant term, the derivative of g is actually the derivative of f: g(u)=f(u)-v so g'(u)=f'(u).

This means that we can rewrite our iteration according to f instead of g:

uₙ₊₁ = uₙ - (f(uₙ)-v)/f'(uₙ)

Now for the derivative f' itself we have two choices. If we know the function f, we can derive it analytically. This should be the preferred choice if you can because it's faster and more accurate. In our case:

 f(x) = 2/3(x+1)² - sin(x) - 1
f'(x) = 4x/3 - cos(x) + 4/3

You can rely on the derivative rules to figure the analytic formula for your function or… you can cheat by using "derivative …" on WolframAlpha.

But you may be in the situation where you don't actually have that information because the function is opaque. In this case, you could use an approximation: take a very small value ε (let's say 1e-6) and approximate the derivative with for example f'(x)=(f(x+ε)-f(x-ε))/(2ε). It's a dumb trick: we're basically figuring out the slope by taking two very close points around x. This would also work by using g instead of f, but you have to compute two extra subtractions (the - v) for no benefit because they cancel each others.

Initial approximation

For the 3rd and last piece of the puzzle, the initial u, we need to figure out something more elaborate. The simplest we can do is to start with a first approximation function f₀¯¹ as a straight line between the point (f(R₀),R₀) and (f(R₁),R₁). How do we create a function that linearly link these 2 points together? We of course use one of the most useful math formulas: remap(a,b,c,d,x) = mix(c,d,linear(a,b,x)), and we evaluate it for our first approximation value u₀:

u₀ = remap(f(R₀),f(R₁),R₀,R₁,v)

If your boundaries are simpler, typically if R=[0;1], this expression can be dramatically simplified. A linear() might be enough, or even a simple division. We have a pathological case here so we're using the generic expression.

We get:

Close enough, we can start iterating from here.

Iterating

If we do a single Newton iteration, u₁ = u₀ - (f(u₀)-v)/f'(u₀) our straight line becomes:

With one more iteration:

Seems like we're getting pretty close, aren't we?

If you want to converge even faster, you may want to consider Halley's method. It's more expensive to compute, but 1 iteration of Halley may cost less than 2 iterations of Newton. Up to you to study if the trade-off is worth it.

Demo code

If you want to play with this, here is a matplotlib demo generating a graphic pretty similar to what's found in this post:

import numpy as np
import matplotlib.pyplot as plt

N = 1  # Number of iterations
R0, R1 = (0.1, 1.5)  # Reduced domain

# The function to inverse and its derivative
def f(x): return 2 / 3 * (x + 1) ** 2 - np.sin(x) - 1
def d(x): return 4 / 3 * x - np.cos(x) + 4 / 3

# The most useful math functions
def mix(a, b, x): return a * (1 - x) + b * x
def linear(a, b, x): return (x - a) / (b - a)
def remap(a, b, c, d, x): return mix(c, d, linear(a, b, x))

# The inverse approximation using Newton-Raphson iterations
def inverse(v, n):
    u = remap(f(R0), f(R1), R0, R1, v)
    for _ in range(n):
        u = u - (f(u) - v) / d(u)
    return u

def _main():
    _, ax = plt.subplots()

    x = np.linspace(R0, R1)
    y = f(x)
    ax.plot((-1 / 2, 2), (-1 / 2, 2), "--", color="gray")
    ax.plot(x, y, "-", color="C0", label="f")
    ax.plot([R0, R1], [f(R0), f(R1)], "o", color="C0")
    ax.plot(y, x, "-", color="C1", label="f¯¹")

    v = np.linspace(f(R0), f(R1))
    u = inverse(v, N)
    ax.plot(v, u, "-", color="C3", label=f"f¯¹ approx in {N} iteration(s)")
    ax.plot([f(R0), f(R1)], [R0, R1], "o", color="C3")

    ax.set_aspect("equal", "box")
    ax.grid(True)
    ax.legend()
    plt.show()

_main()