Getting close to 3 years without posting. I haven't been peacefully enjoying a retired commoner life during all this time. If anything, I'm angrier and overdoing things more than ever. At the same time, the queue of blog posts to write and publish keeps growing as well, so it's time to give it a stab.
Today, I'm going to talk about a random bug that occurred while playing with my web audio stream. Long story short, I have a local MPD pushing to a remote Icecast using the "shout" audio output. I'm usually relying on that setup to listen to my music when I'm not at home. But this time, MPD kept crashing.
I naively asked on IRC if such an obvious bug was already known, and the answer was "no crash bugs are currently known. report them with full backtraces".
Spoiler alert: the bug is not in MPD.
Starting point
# coredumpctl info
PID: 415 (mpd)
UID: 45 (mpd)
GID: 45 (mpd)
Signal: 11 (SEGV)
Timestamp: Sun 2020-08-02 12:24:58 CEST (4h 16min ago)
Command Line: /usr/bin/mpd --no-daemon
Executable: /usr/bin/mpd
Control Group: /user.slice/user-45.slice/user@45.service/mpd.service
Unit: user@45.service
User Unit: mpd.service
Slice: user-45.slice
Owner UID: 45 (mpd)
Boot ID: 6f57bce0857b45509356e5e01298eea0
Machine ID: 799f672765424647aa2b2bbaea25145c
Hostname: kaeru
Storage: /var/lib/systemd/coredump/core.mpd.45.6f57bce0857b45509356e5e01298eea0.415.1596363898000000000000.lz4
Message: Process 415 (mpd) of user 45 dumped core.
Stack trace of thread 462:
#0 0x0000ffff9269f444 memcpy (libc.so.6 + 0x81444)
#1 0x0000ffff939c47c4 n/a (libshout.so.3 + 0xd7c4)
#2 0x0000aaaaea7b6778 n/a (mpd + 0xd4778)
#3 0x0000aaaaea7b6de0 n/a (mpd + 0xd4de0)
#4 0x0000aaaaea7a0a64 n/a (mpd + 0xbea64)
#5 0x0000aaaaea7a1d80 n/a (mpd + 0xbfd80)
#6 0x0000aaaaea7a2ab4 n/a (mpd + 0xc0ab4)
#7 0x0000aaaaea763270 n/a (mpd + 0x81270)
#8 0x0000ffff939ee4b8 start_thread (libpthread.so.0 + 0x84b8)
#9 0x0000ffff926ee95c thread_start (libc.so.6 + 0xd095c)
[<snip thread-lock backtraces>]
For the context, this is a low cost ARM (aarch64) board running Arch Linux ARM. I'm basically able to reproduce the issue by randomly toggling on and off both the shout output and the music playback.
This is the kind of situation where debugging is painful for several reasons:
- Getting the symbols on the board is going to be annoying because I need to cross compile (Arch Linux and Arch Linux ARM don't provide symbol packages).
- Setting up a toolchain and cross compiling existing packages is a nightmare.
- Setting up a similar setup locally with MPD and Icecast is going to be a annoying and I'm not sure I'll be able to reproduce the issue as it's going to be a pretty different setup; I need more information before doing that.
Identifying the crash source
The method I used to identify the code responsible for the crash just required
to install GDB on the board, run coredumpctl gdb
, and x/10x 0x0000aaaaea7b6778
(this is frame #2
, the last frame in MPD before entering
libshout.so
). The reason I'm looking here is because MPD is the top brick and
it might be misusing its libraries.
After grabbing the /usr/bin/mpd
binary locally from the board, and loading it
in Ghidra, I byte-search the 10 bytes dump by the x/10x
command,
which brings me here (on the blr
instruction):
Alright well, easy enough then, it's the part of the code in MPD that calls
shout_send()
. We're fortunate, there is only one call in MPD, in
src/output/plugins/ShoutOutputPlugin.cxx
:
static void
EncoderToShout(shout_t *shout_conn, Encoder &encoder,
unsigned char *buffer, size_t buffer_size)
{
while (true) {
size_t nbytes = encoder.Read(buffer, buffer_size);
if (nbytes == 0)
return;
int err = shout_send(shout_conn, buffer, nbytes);
HandleShoutError(shout_conn, err);
}
}
I want to observe the value of the 3 parameters shout_conn
, buffer
and
nbytes
by looking at the first 3 registers at frame #2
:
(gdb) bt
#0 0x0000ffff9269f444 in memcpy () at /usr/lib/libc.so.6
#1 0x0000ffff939c47c4 in () at /usr/lib/libshout.so.3
#2 0x0000aaaaea7b6778 in ()
#3 0x0000aaaaea7b6de0 in ()
#4 0x0000aaaaea7a0a64 in ()
#5 0x0000aaaaea7a1d80 in ()
#6 0x0000aaaaea7a2ab4 in ()
#7 0x0000aaaaea763270 in ()
#8 0x0000ffff939ee4b8 in start_thread () at /usr/lib/libpthread.so.0
#9 0x0000ffff926ee95c in thread_start () at /usr/lib/libc.so.6
(gdb) frame 2
#2 0x0000aaaaea7b6778 in ?? ()
(gdb) i r
x0 0xaaaa8e85e4c2 187649512301762
x1 0xaaaaf6e9d46c 187651263681644
x2 0x3a 58
[...]
Unfortunately, these 3 parameters do not correspond to the registers at frame
#2
, they correspond to the values at frame #0
(memcpy
arguments,
potentially modified within the function). Some registers seem to change
between frames, but not those, which means that GDB is pretty much useless
here. According to the doc, this is what info registers
does:
Print the names and values of all registers except floating-point and vector registers (in the selected stack frame).
Source: Debugging with GDB: Registers
But of course it can't because these registers weren't saved anywhere: they are
call clobbered. Functions can modify them and do not need to restore them, so
they don't end up saved on the stack, and as a result GDB can't reconstruct
them. It would have been much better if GDB was printing <unknown>
or tagging
with <unreliable>
such registers when we are on a frame ≠ #0
.
Anyway, let's see what we get in libshout.so
(down one level, at frame #1
)
instead. Using the same method used with MPD, we easily identify this code in
icecast-libshout
(src/format_ogg.c
):
static int send_ogg(shout_t *self, const unsigned char *data, size_t len)
{
ogg_data_t *ogg_data = (ogg_data_t*)self->format_data;
ogg_codec_t *codec;
char *buffer;
ogg_page page;
buffer = ogg_sync_buffer(&ogg_data->oy, len);
memcpy(buffer, data, len);
ogg_sync_wrote(&ogg_data->oy, len);
We know the crash happens inside that memcpy
call. First thing we observe
here is that ogg_sync_buffer()
is unchecked and can actually return NULL
in
several scenarios, this one for a start:
char *ogg_sync_buffer(ogg_sync_state *oy, long size){
if(ogg_sync_check(oy)) return NULL;
// ...
The SIGSEGV
does indeed happens on a write instruction (stp
) meaning that
we have an invalid destination buffer
. Well then, we already have a patch to
try out:
diff --git a/src/format_ogg.c b/src/format_ogg.c
index 9d852c4..b694911 100644
--- a/src/format_ogg.c
+++ b/src/format_ogg.c
@@ -92,6 +92,8 @@ static int send_ogg(shout_t *self, const unsigned char *data, size_t len)
ogg_page page;
buffer = ogg_sync_buffer(&ogg_data->oy, len);
+ if (!buffer)
+ return self->error = SHOUTERR_INSANE;
memcpy(buffer, data, len);
ogg_sync_wrote(&ogg_data->oy, len);
But in our case, the pointer looks far from being NULL
if we look at the
registers, so it's likely another scenario. The rest of the function looks
rather correct, so we'll keep that patch in standby while we're getting deeper.
Down the rabbit hole
Down into libogg
this time. Just in case ogg_sync_buffer()
is returning an
invalid buffer, we can have a look at what's going on there.
First, here is a small aggregation of a few useful definitions to understand the context:
typedef struct {
unsigned char *data;
int storage;
int fill;
int returned;
int unsynced;
int headerbytes;
int bodybytes;
} ogg_sync_state;
/* ... */
typedef struct {
ogg_sync_state oy;
ogg_codec_t *codecs;
char bos;
} ogg_data_t;
We have a ring buffer with data
the base pointer, storage
the total number
allocated, fill
how much is filled, and returned
corresponding to what has
been consumed.
Only scenario where I see the ogg_sync_buffer()
fail would be if the values
are inconsistent, typically with returned
being larger than fill
.
After staring at the code for a while, I couldn't notice anything special. Remember, I don't have much tools to actually debug or make any modification yet, so building up a mental model without any data is getting trickier as I'm getting deeper.
Still, one thing is suspicious: the growing buffer size (newsize
) is held
in a long
but saved in an int
(storage
):
char *ogg_sync_buffer(ogg_sync_state *oy, long size){
if(ogg_sync_check(oy)) return NULL;
/* first, clear out any space that has been previously returned */
if(oy->returned){
oy->fill-=oy->returned;
if(oy->fill>0)
memmove(oy->data,oy->data+oy->returned,oy->fill);
oy->returned=0;
}
if(size>oy->storage-oy->fill){
/* We need to extend the internal buffer */
long newsize=size+oy->fill+4096; /* an extra page to be nice */
void *ret;
if(oy->data)
ret=_ogg_realloc(oy->data,newsize);
else
ret=_ogg_malloc(newsize);
if(!ret){
ogg_sync_clear(oy);
return NULL;
}
oy->data=ret;
oy->storage=newsize;
}
/* expose a segment at least as large as requested at the fill mark */
return((char *)oy->data+oy->fill);
}
I have doubts this is the issue I'm getting, but it's still something we have to address. We'll make another patch for that.
At this point, I've been playing with a lot of assumptions, but one thing I'm reassured of is that it doesn't look like an architecture specific issue (with multimedia stuff, you really never know), so I'm more confident that I'll be able to reproduce it locally. Since I can't go much further without actual debugging, I'm adjusting the strategy.
Switching to easy mode
I now have some more time to construct a similar but simplified setup locally.
MPD setup
We first need a minimal MPD with a shout output, let's go with the following
~/mpd.conf
:
log_file "/home/ux/mpd.log"
db_file "/home/ux/mpd.db"
music_directory "/home/ux/x"
audio_output {
type "shout"
name "My Shout Stream"
host "localhost"
port "8000"
mount "/mpd.ogg"
password "hackme"
bitrate "128"
format "44100:16:1"
}
We add a random music file in ~/x
and can start it in standalone with mpd --no-daemon --stderr mpd.conf
.
Icecast setup
I don't need to make a custom build or actually debug Icecast, so we can
install it on the system and be done with it. But of course the package is
broken out of the box, so we need to find a workaround for a stupid permission
issue. I don't do that often but I have
to commend the quick reaction and fix here. Quickly enough, systemctl start icecast
is working again.
Crashing it
Since using ncmpcpp
or another UI for controlling MPD
is burdensome for
automation, we need to identify a minimal number of steps to reproduce the
issue. So here it is, with mpc
:
mpc add /
mpc play
mpc toggleoutput 1
mpc toggleoutput 1
mpc play
We can save that in a shell script and happily trigger the crash as much as we want.
Tooling
Last step: valgrind
, the joker tool for almost every possible memory issue. I
wanted to do something smart with the vgdb
attach system, but a bug in the
packaging prevented me from reading the
manual so I had to let it go. Maybe one day someone will explain to me why
everything I try is always broken one way or another.
A first run with helgrind
(thread sanitizer) revealed a race in MPD
, but
apparently only in a debugging scenario according to someone on IRC, so I
ignored that one...
With the default Valgrind mode though, we quickly realize that the invalid write to the buffer is a use after free issue.
End of the road
The rest of the story is pretty uninteresting (valgrind
did all the job
really) so I'll skip down to the answer: libshout
has a shout_t
context
which the user configures. Then you can call shout_open()
and shout_close()
on it. Problem is that doing that several times on the same context causes a
re-use of previous data (allocated by the open) because it isn't cleaned up
properly (in the close). This happens typically when you turn on and off the
audio stream (the mpc toggleoutput
commands do that, or in my case in the
view 7
of ncmpcpp
). Pretty boring and uninteresting bug overall, but the
path was interesting given that my main goal was initially to see how far I
could go without actually being on a workstation.
The patch itself is simple enough, and it actually baffles me that no one
addressed it before that. Looking at the libshout
tracker, I found the bug
was actually reported a year ago.
In the end, I made the 3 patches and they were upstreamed:
ogg
: framing: check for overflow on growing buffer (684c7377)libshout
: Fix: cleanup format residual after close (6598fd21)libshout
: Fix: check ogg_sync_buffer() pointer before writing to it (9d3bed1a)
2nd commit in that list being the important fix.
Everything is broken
Frankly, I don't understand how software can be so broken, even at such low
levels like this, where things don't even change much lately. This blog post is
about 3 bugs including a critical one, but along the way so much shit happened
such as the 2 packaging specific issues (Icecast, Valgrind). But GDB is also
unreliable, and I'm not even talking about the i915
GPU hangs I had every 15
minutes or so (for totally unrelated reasons). And after all of this is fixed
and I think I can listen to my stream, I realize there is now a bug in the
ogg
demuxer of my player, causing a
huge CPU and memory consumption after a few minutes of listening to the stream.
I haven't made an exhaustive list, but during my 2 days of investigation, I got
hit by more bugs than my hands can count.
The amount of effort I have to make just to be able to listen to my music is not humanly acceptable, and it's no surprise so many people eat proprietary garbage software on a daily basis. And yes I know, it's my fault for using a bleeding edge distribution, and C code is full of memory issues, Rust is the best, blablabla. I still believe we could do much better in the current state. Testing software and being held to higher standards don't need to be reserved to modern languages and stable distributions. The core issue is how we break this mentality of "bugs don't matter because we can just fix them quickly".
To be honest, I've lost all hopes for things to ever get better in any way in this world, but I still won't surrender to making things even worse. See you soon for another blog post.