Story of debugging a crash in MPD

Getting close to 3 years without posting. I haven't been peacefully enjoying a retired commoner life during all this time. If anything, I'm angrier and overdoing things more than ever. At the same time, the queue of blog posts to write and publish keeps growing as well, so it's time to give it a stab.

Today, I'm going to talk about a random bug that occurred while playing with my web audio stream. Long story short, I have a local MPD pushing to a remote Icecast using the "shout" audio output. I'm usually relying on that setup to listen to my music when I'm not at home. But this time, MPD kept crashing.

I naively asked on IRC if such an obvious bug was already known, and the answer was "no crash bugs are currently known. report them with full backtraces".

Spoiler alert: the bug is not in MPD.

Starting point

# coredumpctl info
           PID: 415 (mpd)
           UID: 45 (mpd)
           GID: 45 (mpd)
        Signal: 11 (SEGV)
     Timestamp: Sun 2020-08-02 12:24:58 CEST (4h 16min ago)
  Command Line: /usr/bin/mpd --no-daemon
    Executable: /usr/bin/mpd
 Control Group: /user.slice/user-45.slice/user@45.service/mpd.service
          Unit: user@45.service
     User Unit: mpd.service
         Slice: user-45.slice
     Owner UID: 45 (mpd)
       Boot ID: 6f57bce0857b45509356e5e01298eea0
    Machine ID: 799f672765424647aa2b2bbaea25145c
      Hostname: kaeru
       Storage: /var/lib/systemd/coredump/core.mpd.45.6f57bce0857b45509356e5e01298eea0.415.1596363898000000000000.lz4
       Message: Process 415 (mpd) of user 45 dumped core.

                Stack trace of thread 462:
                #0  0x0000ffff9269f444 memcpy (libc.so.6 + 0x81444)
                #1  0x0000ffff939c47c4 n/a (libshout.so.3 + 0xd7c4)
                #2  0x0000aaaaea7b6778 n/a (mpd + 0xd4778)
                #3  0x0000aaaaea7b6de0 n/a (mpd + 0xd4de0)
                #4  0x0000aaaaea7a0a64 n/a (mpd + 0xbea64)
                #5  0x0000aaaaea7a1d80 n/a (mpd + 0xbfd80)
                #6  0x0000aaaaea7a2ab4 n/a (mpd + 0xc0ab4)
                #7  0x0000aaaaea763270 n/a (mpd + 0x81270)
                #8  0x0000ffff939ee4b8 start_thread (libpthread.so.0 + 0x84b8)
                #9  0x0000ffff926ee95c thread_start (libc.so.6 + 0xd095c)

       [<snip thread-lock backtraces>]

For the context, this is a low cost ARM (aarch64) board running Arch Linux ARM. I'm basically able to reproduce the issue by randomly toggling on and off both the shout output and the music playback.

This is the kind of situation where debugging is painful for several reasons:

Getting the symbols on the board is going to be annoying because I need to cross compile (Arch Linux and Arch Linux ARM don't provide symbol packages).
Setting up a toolchain and cross compiling existing packages is a nightmare.
Setting up a similar setup locally with MPD and Icecast is going to be a annoying and I'm not sure I'll be able to reproduce the issue as it's going to be a pretty different setup; I need more information before doing that.

Identifying the crash source

The method I used to identify the code responsible for the crash just required to install GDB on the board, run coredumpctl gdb, and x/10x 0x0000aaaaea7b6778 (this is frame #2, the last frame in MPD before entering libshout.so). The reason I'm looking here is because MPD is the top brick and it might be misusing its libraries.

After grabbing the /usr/bin/mpd binary locally from the board, and loading it in Ghidra, I byte-search the 10 bytes dump by the x/10x command, which brings me here (on the blr instruction):

Ghidra disassembly view

Alright well, easy enough then, it's the part of the code in MPD that calls shout_send(). We're fortunate, there is only one call in MPD, in src/output/plugins/ShoutOutputPlugin.cxx:

static void
EncoderToShout(shout_t *shout_conn, Encoder &encoder,
	       unsigned char *buffer, size_t buffer_size)
{
	while (true) {
		size_t nbytes = encoder.Read(buffer, buffer_size);
		if (nbytes == 0)
			return;

		int err = shout_send(shout_conn, buffer, nbytes);
		HandleShoutError(shout_conn, err);
	}
}

I want to observe the value of the 3 parameters shout_conn, buffer and nbytes by looking at the first 3 registers at frame #2:

(gdb) bt
#0  0x0000ffff9269f444 in memcpy () at /usr/lib/libc.so.6
#1  0x0000ffff939c47c4 in  () at /usr/lib/libshout.so.3
#2  0x0000aaaaea7b6778 in  ()
#3  0x0000aaaaea7b6de0 in  ()
#4  0x0000aaaaea7a0a64 in  ()
#5  0x0000aaaaea7a1d80 in  ()
#6  0x0000aaaaea7a2ab4 in  ()
#7  0x0000aaaaea763270 in  ()
#8  0x0000ffff939ee4b8 in start_thread () at /usr/lib/libpthread.so.0
#9  0x0000ffff926ee95c in thread_start () at /usr/lib/libc.so.6
(gdb) frame 2
#2  0x0000aaaaea7b6778 in ?? ()
(gdb) i r
x0             0xaaaa8e85e4c2      187649512301762
x1             0xaaaaf6e9d46c      187651263681644
x2             0x3a                58
[...]

Unfortunately, these 3 parameters do not correspond to the registers at frame #2, they correspond to the values at frame #0 (memcpy arguments, potentially modified within the function). Some registers seem to change between frames, but not those, which means that GDB is pretty much useless here. According to the doc, this is what info registers does:

Print the names and values of all registers except floating-point and vector registers (in the selected stack frame).

Source: Debugging with GDB: Registers

But of course it can't because these registers weren't saved anywhere: they are call clobbered. Functions can modify them and do not need to restore them, so they don't end up saved on the stack, and as a result GDB can't reconstruct them. It would have been much better if GDB was printing <unknown> or tagging with <unreliable> such registers when we are on a frame ≠ #0.

Anyway, let's see what we get in libshout.so (down one level, at frame #1) instead. Using the same method used with MPD, we easily identify this code in icecast-libshout (src/format_ogg.c):

static int send_ogg(shout_t *self, const unsigned char *data, size_t len)
{
    ogg_data_t  *ogg_data = (ogg_data_t*)self->format_data;
    ogg_codec_t *codec;
    char        *buffer;
    ogg_page     page;

    buffer = ogg_sync_buffer(&ogg_data->oy, len);
    memcpy(buffer, data, len);
    ogg_sync_wrote(&ogg_data->oy, len);

We know the crash happens inside that memcpy call. First thing we observe here is that ogg_sync_buffer() is unchecked and can actually return NULL in several scenarios, this one for a start:

char *ogg_sync_buffer(ogg_sync_state *oy, long size){
  if(ogg_sync_check(oy)) return NULL;
  // ...

The SIGSEGV does indeed happens on a write instruction (stp) meaning that we have an invalid destination buffer. Well then, we already have a patch to try out:

diff --git a/src/format_ogg.c b/src/format_ogg.c
index 9d852c4..b694911 100644
--- a/src/format_ogg.c
+++ b/src/format_ogg.c
@@ -92,6 +92,8 @@ static int send_ogg(shout_t *self, const unsigned char *data, size_t len)
     ogg_page     page;

     buffer = ogg_sync_buffer(&ogg_data->oy, len);
+    if (!buffer)
+        return self->error = SHOUTERR_INSANE;
     memcpy(buffer, data, len);
     ogg_sync_wrote(&ogg_data->oy, len);

But in our case, the pointer looks far from being NULL if we look at the registers, so it's likely another scenario. The rest of the function looks rather correct, so we'll keep that patch in standby while we're getting deeper.

Down the rabbit hole

Down into libogg this time. Just in case ogg_sync_buffer() is returning an invalid buffer, we can have a look at what's going on there.

First, here is a small aggregation of a few useful definitions to understand the context:

typedef struct {
  unsigned char *data;
  int storage;
  int fill;
  int returned;

  int unsynced;
  int headerbytes;
  int bodybytes;
} ogg_sync_state;

/* ... */

typedef struct {
    ogg_sync_state  oy;
    ogg_codec_t    *codecs;
    char            bos;
} ogg_data_t;

We have a ring buffer with data the base pointer, storage the total number allocated, fill how much is filled, and returned corresponding to what has been consumed.

Only scenario where I see the ogg_sync_buffer() fail would be if the values are inconsistent, typically with returned being larger than fill.

After staring at the code for a while, I couldn't notice anything special. Remember, I don't have much tools to actually debug or make any modification yet, so building up a mental model without any data is getting trickier as I'm getting deeper.

Still, one thing is suspicious: the growing buffer size (newsize) is held in a long but saved in an int (storage):

char *ogg_sync_buffer(ogg_sync_state *oy, long size){
  if(ogg_sync_check(oy)) return NULL;

  /* first, clear out any space that has been previously returned */
  if(oy->returned){
    oy->fill-=oy->returned;
    if(oy->fill>0)
      memmove(oy->data,oy->data+oy->returned,oy->fill);
    oy->returned=0;
  }

  if(size>oy->storage-oy->fill){
    /* We need to extend the internal buffer */
    long newsize=size+oy->fill+4096; /* an extra page to be nice */
    void *ret;

    if(oy->data)
      ret=_ogg_realloc(oy->data,newsize);
    else
      ret=_ogg_malloc(newsize);
    if(!ret){
      ogg_sync_clear(oy);
      return NULL;
    }
    oy->data=ret;
    oy->storage=newsize;
  }

  /* expose a segment at least as large as requested at the fill mark */
  return((char *)oy->data+oy->fill);
}

I have doubts this is the issue I'm getting, but it's still something we have to address. We'll make another patch for that.

At this point, I've been playing with a lot of assumptions, but one thing I'm reassured of is that it doesn't look like an architecture specific issue (with multimedia stuff, you really never know), so I'm more confident that I'll be able to reproduce it locally. Since I can't go much further without actual debugging, I'm adjusting the strategy.

Switching to easy mode

I now have some more time to construct a similar but simplified setup locally.

MPD setup

We first need a minimal MPD with a shout output, let's go with the following ~/mpd.conf:

log_file	"/home/ux/mpd.log"
db_file		"/home/ux/mpd.db"
music_directory	"/home/ux/x"

audio_output {
	type		"shout"
	name		"My Shout Stream"
	host		"localhost"
	port		"8000"
	mount		"/mpd.ogg"
	password	"hackme"
	bitrate		"128"
	format		"44100:16:1"
}

We add a random music file in ~/x and can start it in standalone with mpd --no-daemon --stderr mpd.conf.

Icecast setup

I don't need to make a custom build or actually debug Icecast, so we can install it on the system and be done with it. But of course the package is broken out of the box, so we need to find a workaround for a stupid permission issue. I don't do that often but I have to commend the quick reaction and fix here. Quickly enough, systemctl start icecast is working again.

Crashing it

Since using ncmpcpp or another UI for controlling MPD is burdensome for automation, we need to identify a minimal number of steps to reproduce the issue. So here it is, with mpc:

mpc add /
mpc play
mpc toggleoutput 1
mpc toggleoutput 1
mpc play

We can save that in a shell script and happily trigger the crash as much as we want.

Tooling

Last step: valgrind, the joker tool for almost every possible memory issue. I wanted to do something smart with the vgdb attach system, but a bug in the packaging prevented me from reading the manual so I had to let it go. Maybe one day someone will explain to me why everything I try is always broken one way or another.

A first run with helgrind (thread sanitizer) revealed a race in MPD, but apparently only in a debugging scenario according to someone on IRC, so I ignored that one...

With the default Valgrind mode though, we quickly realize that the invalid write to the buffer is a use after free issue.

End of the road

The rest of the story is pretty uninteresting (valgrind did all the job really) so I'll skip down to the answer: libshout has a shout_t context which the user configures. Then you can call shout_open() and shout_close() on it. Problem is that doing that several times on the same context causes a re-use of previous data (allocated by the open) because it isn't cleaned up properly (in the close). This happens typically when you turn on and off the audio stream (the mpc toggleoutput commands do that, or in my case in the view 7 of ncmpcpp). Pretty boring and uninteresting bug overall, but the path was interesting given that my main goal was initially to see how far I could go without actually being on a workstation.

The patch itself is simple enough, and it actually baffles me that no one addressed it before that. Looking at the libshout tracker, I found the bug was actually reported a year ago.

In the end, I made the 3 patches and they were upstreamed:

ogg: framing: check for overflow on growing buffer (684c7377)
libshout: Fix: cleanup format residual after close (6598fd21)
libshout: Fix: check ogg_sync_buffer() pointer before writing to it (9d3bed1a)

2nd commit in that list being the important fix.

Everything is broken

Frankly, I don't understand how software can be so broken, even at such low levels like this, where things don't even change much lately. This blog post is about 3 bugs including a critical one, but along the way so much shit happened such as the 2 packaging specific issues (Icecast, Valgrind). But GDB is also unreliable, and I'm not even talking about the i915 GPU hangs I had every 15 minutes or so (for totally unrelated reasons). And after all of this is fixed and I think I can listen to my stream, I realize there is now a bug in the ogg demuxer of my player, causing a huge CPU and memory consumption after a few minutes of listening to the stream. I haven't made an exhaustive list, but during my 2 days of investigation, I got hit by more bugs than my hands can count.

The amount of effort I have to make just to be able to listen to my music is not humanly acceptable, and it's no surprise so many people eat proprietary garbage software on a daily basis. And yes I know, it's my fault for using a bleeding edge distribution, and C code is full of memory issues, Rust is the best, blablabla. I still believe we could do much better in the current state. Testing software and being held to higher standards don't need to be reserved to modern languages and stable distributions. The core issue is how we break this mentality of "bugs don't matter because we can just fix them quickly".

To be honest, I've lost all hopes for things to ever get better in any way in this world, but I still won't surrender to making things even worse. See you soon for another blog post.

For updates and more frequent content you can follow me on Mastodon. Feel also free to subscribe to the RSS in order to be notified of new write-ups. It is also usually possible to reach me through other means (check the footer below). Finally, discussions on some of the articles can sometimes be found on HackerNews, Lobste.rs and Reddit.