A small freedom area.

LAS Lossy Audio Spotter

Tue 09 Aug 2011

music, audio analysis, prog, ffmpeg, fft, av

FLAC is supposed to be a lossless codec. Actually it is, but the source might not be. Stupid people, or worse, evil ones, might pretend to release a lossless copy of a song while they've only transcoded some lossy MP3 to FLAC files.

This post explains the various steps I followed to write LAS, a tool able to give a guess about the original quality of the source. It is not 100% efficient, but it works for most of the cases. There are certainly better algorithms out there, this is just my simple attempt to solve the issue.

Observation

Let's take a random lossless FLAC sample, and transcode it to a MP3 in CBR:

% ffmpeg -i lossless.flac -ab 192k lossy.mp3

So what could we do to detect the fraud without the original copy? Well, we can do a simple observation in Audacity; these are the respective spectrum histograms (AnalyzePlot Spectrum):

Here is the original:

centerimg

And the MP3:

centerimg

It is obvious with those histograms that higher frequencies are never reached, so this is where we will try to operate.

The same scheme is observable with CBR 320k (highest available bit rate), and high/low VBR encodes; the no magnitude's land is just found at different frequencies.

We will start by reproducing those histograms and then analyze them.

How to proceed

This section will present the content of the main struct and will resume the execution flow.

First, we will use FFmpeg to decode the audio (because it's great, and it also implements FFT stuff). We will keep in the structure a pointer to the codec (will be needed at least to decode the audio packets, and access the number of channels):

AVCodecContext *codec;

The decoded data frame will be stored in a buffer:

uint8_t *audio_buf;

The samples (amplitudes) will be extracted from audio_buf in an appropriate buffer of 16-signed-bits precision. Note that we can not use audio_buf directly because samples must have a specific size before applying the Fast Fourier Transform, and audio_buf length is undefined. So the samples will be stored in:

int16_t *samples;

...and its size in bytes stored in:

int samples_bsize;

The amount of samples filled (in bytes) after decoding each packet is stored in:

int filled;

It will be used to know when to send the samples to the process function (flush).

We will also need some pointers for the FFT. Since we will apply a Real Discrete Fourier Transform, we need the appropriate context (still using the FFmpeg API):

RDFTContext *rdft;

With the following buffer as input (samples will be stored in that buffer before being transformed):

FFTSample *rdft_data;

This FFT will transform FFT_WINSIZE samples to FFT_WINSIZE/2 + 1 complex bins. FFT_WINSIZE is the size of each window; it's the N you know when doing FFT. And we will need to keep a counter of those processed windows, so we can average them later:

int win_count;

The magnitude will be tracked in that last buffer:

float fft[FFT_WINSIZE/2 + 1];

Given all those parameters, and window size of 512B, we will have:

#define FFT_NBITS   9
#define FFT_WINSIZE (1<<FFT_NBITS)

struct las {
    AVCodecContext *codec;
    uint8_t *audio_buf;
    int16_t *samples;
    int samples_bsize;
    int filled;
    RDFTContext *rdft;
    FFTSample *rdft_data;
    int win_count;
    float fft[FFT_WINSIZE/2 + 1];
};

Decoding the audio

Before decoding the audio file, let's initialize FFmpeg, load the file, find the first audio stream, open the codec, and check for the number of audio channels (we will only support up to 2 channels):

/* FFmpeg init */
av_register_all();
AVFormatContext *fmt_ctx = NULL;
if (avformat_open_input(&fmt_ctx, av[1], NULL, NULL) ||
    av_find_stream_info(fmt_ctx) < 0)
    errx(1, "unable to load %s", av[1]);
int stream_id = -1;
for (unsigned i = 0; i < fmt_ctx->nb_streams; i++) {
    if (fmt_ctx->streams[i]->codec->codec_type == AVMEDIA_TYPE_AUDIO) {
        stream_id = i;
        break;
    }
}
if (stream_id == -1)
    errx(1, "no audio stream found");
ctx.codec = fmt_ctx->streams[stream_id]->codec;
AVCodec *adec = avcodec_find_decoder(ctx.codec->codec_id);
if (!adec)
    errx(1, "unsupported codec");
avcodec_open(ctx.codec, adec);
if (ctx.codec->channels != 1 && ctx.codec->channels != 2)
    errx(1, "unsupported number of channels (%d)", ctx.codec->channels);

Now, let's see our own stuff:

/* LAS init */
precalc_hann();
ctx.samples_bsize = FFT_WINSIZE * ctx.codec->channels * sizeof(*ctx.samples);
ctx.samples       = av_malloc(ctx.samples_bsize);
ctx.rdft          = av_rdft_init(FFT_NBITS, DFT_R2C);
ctx.rdft_data     = av_malloc(FFT_WINSIZE * sizeof(*ctx.rdft_data));
ctx.audio_buf     = av_malloc(AVCODEC_MAX_AUDIO_FRAME_SIZE);

A few notes are needed here. precalc_hann() is just a small function used to initialize the Hann function (which is needed to get smooth FFT results) LUT:

static float hann[FFT_WINSIZE];

static void precalc_hann(void)
{
    for (int i = 0; i < FFT_WINSIZE; i++)
        hann[i] = .5f * (1 - cos(2*M_PI*i / (FFT_WINSIZE-1)));
}

The samples size (in bytes) is pretty straightforward; we need FFT_WINSIZE samples, times the number of channels (which will be downmix to mono channel later) times the size of one sample.

The FFmpeg RDFT is only able to perform FFT with buffer of size being a power of two, and this is expressed with a number of bits FFT_NBITS (1<<2=4, 1<<3=8, 1<<4=16, etc.). DFT_R2C means Discrete Fourier Transform from Real numbers to Complex numbers. It will transform a window of FFT_WINSIZE samples: rdft_data.

Last one is the audio_buf mentionned above already, which needs a size of AVCODEC_MAX_AUDIO_FRAME_SIZE. Note that we avoid stack allocation and prefer av_malloc() because of some data alignment requirements (FFmpeg API) and performance reasons.

We can now read the frames and process the audio packets:

AVPacket pkt;
while (av_read_frame(fmt_ctx, &pkt) >= 0)
    if (pkt.stream_index == stream_id && process_audio_pkt(&ctx, &pkt) < 0)
        warnx("error while processing audio packet");

Since the buffer might have some information left, we pad the buffer with zeros (we need FFT_WINSIZE samples, and the buffer have garbage/old data at the end) and flush it:

if (ctx.filled) { // flush buffer
    memset((uint8_t*)ctx.samples + ctx.filled, 0, ctx.samples_bsize - ctx.filled);
    process_samples(&ctx);
}

Here is the process_audio_pkt() verbatim:

static int process_audio_pkt(struct las *c, AVPacket *pkt)
{
    while (pkt->size > 0) {
        uint8_t *data = c->audio_buf;
        int data_size = AVCODEC_MAX_AUDIO_FRAME_SIZE;
        int len       = avcodec_decode_audio3(c->codec, (int16_t*)data, &data_size, pkt);
        if (len < 0) {
            pkt->size = 0;
            return -1;
        }
        pkt->data += len;
        pkt->size -= len;
        while (data_size > 0) {
            int needed = c->samples_bsize - c->filled; // in bytes
            int ncpy   = data_size >= needed ? needed : data_size;
            memcpy((uint8_t*)c->samples + c->filled, data, ncpy);
            c->filled += ncpy;
            if (c->filled != c->samples_bsize)
                break;
            process_samples(c);
            data      += ncpy;
            data_size -= ncpy;
        }
    }
    return 0;
}

It will basically decode the packet, fill our samples buffer, and call process_samples() when a buffer is ready.

Processing the audio data

Alright, now we are in process_samples() and have FFT_WINSIZE samples available for processing.

static void process_samples(struct las *c)
{
    c->win_count++;
    c->filled = 0;

So first, we need to downmix (to mono-channel), resample (from signed 16-bits integer to [0;1] floats), and apply the Hann window on the samples:

    // Resample (int16_t to float), downmix if necessary and apply hann window
    int16_t *s16 = c->samples;
    switch (c->codec->channels) {
    case 2:
        for (int i = 0; i < FFT_WINSIZE; i++)
            c->rdft_data[i] = (s16[i*2] + s16[i*2+1]) / (2*32768.f) * hann[i];
        break;
    case 1:
        for (int i = 0; i < FFT_WINSIZE; i++)
            c->rdft_data[i] = s16[i] / 32768.f * hann[i];
        break;
    }

And then apply FFT, compute the magnitude (not normalized, and without the square root) for each sample:

    // FFT
#define FFT_ASSIGN_VALUES(i, re, im) c->fft[i] += re*re + im*im
    float *bin = c->rdft_data;
    av_rdft_calc(c->rdft, bin);
    for (int i = 1; i < FFT_WINSIZE/2; i++)
        FFT_ASSIGN_VALUES(i, bin[i*2], bin[i*2+1]);
    FFT_ASSIGN_VALUES(0,             bin[0], 0);
    FFT_ASSIGN_VALUES(FFT_WINSIZE/2, bin[1], 0);
}

av_rdft_calc() processes FFT_WINSIZE samples and gives FFT_WINSIZE/2 + 1 bins. How is that possible in the input buffer since we have two bins for one sample?

Logically we would need a buffer of (FFT_WINSIZE/2+1)*2 = FFT_WINSIZE+2 bytes, and not FFT_WINSIZE. But actually, the first and last bins are real numbers, which means their imaginary parts are always zero: so the buffer is composed of couples of real and imaginary floats except for the two first ones which are the real parts of the first and last bins. The last two lines of code are for these extractions.

Display the histogram

It's time to check out the histogram. fft field now contains the sum of all the magnitudes for each frequency range. We assume the input is encoded at 44100Hz, so FFT_WINSIZE/2+1 entries in fft will correspond to the magnitudes from 0 to 44100/2=22050Hz.

In order to get a graph similar to what we got in the first place with audacity, we first have to average the values (divide each entry by the windows count), and scale it with FFT_WINSIZE. And then, we need to convert the magnitude in dB. Since there was no square root used on the magnitude, we won't use 20 * log10(x) but 10 * log10(x).

Here is the code used to print the magnitude of each frequencies to be piped to Gnuplot:

printf("set terminal png font tiny size 300,200\n"
       "set style data boxes\n"
       "set style fill solid\n"
       "set xrange [0:%f]\n"
       "set yrange [-120:0]\n"
       "set xlabel 'Frequency (kHz)'\n"
       "set ylabel 'Log magnitude (dB)'\n"
       "plot '-' using ($0/%d.*%f):1 notitle\n",
       44.1f/2.f, FFT_WINSIZE, 44.1f);
for (i = 0; i < sizeof(ctx.fft)/sizeof(*ctx.fft); i++) {
    float x = ctx.fft[i] / ctx.win_count / FFT_WINSIZE;
    ctx.fft[i] = 10 * log10(x ? x : -1e20);
    printf("%f\n", ctx.fft[i]);
}

Exploit the results

And now the fun. How can we figure out the higher frequency from the graph? Well, let's first do a pretty simple heuristic:

/* Find higher frequency used */
unsigned last = sizeof(ctx.fft)/sizeof(*ctx.fft) - 1;
for (i = last - 1; i > 0; i--)
    if (fabsf(ctx.fft[i] - ctx.fft[last])  > 1.5f ||
        fabsf(ctx.fft[i] - ctx.fft[i + 1]) > 3.f)
        break;

We take the last entry and check from the end (higher frequency) to the beginning (lower frequency) trying to detect a fluctuation. The error ranges of 1.5f and 3.f are chosen by trial and error, and seemed pretty efficient here with my samples.

I also added a second check: the huge fall lookup. The previous check does not always work out, and we need to look for something else. This second heuristic seems to cover most of the exceptions:

    /* Try to detect huge falls too */
#define THRESHOLD_FALL 5
    for (unsigned n = i - THRESHOLD_FALL; n; n--) {
        if (ctx.fft[n] - ctx.fft[n + THRESHOLD_FALL] > 25.f) {
            i = n;
            break;
        }
    }

After computing the higher frequency from i, we can check if it's "high enough" to be considered lossless:

/* Report lossy/lossless estimation in a Gnuplot comment */
int hfreq = i * 44100 / FFT_WINSIZE;
printf("# higher freq=%dHz → %s\n", hfreq, hfreq < 21000 ? "lossy" : "lossless");

Again, the 21000Hz value was obtained by trial and error.

Testing

The complete source is available on LAS webpage and includes a analyze.sh script which takes files and directories as argument, scans all the FLAC in it, and generates a all-in-one report.html page to summarize all the results:

centerimg

I tested the code with a few FLAC from different musical genres re-encoded with various high and low VBR and CBR MP3 profiles, and re-transcoded to FLAC. Here is the small script I used to generate those samples (can be pretty long to execute):

#!/bin/sh

for f in *.flac; do
    name=${f%.flac}
    for aq in 0 1 2 6 8;      do ffmpeg -i $f -aq ${aq}  ${name}-aq${aq}.mp3; done
    for ab in 64 128 192 320; do ffmpeg -i $f -ab ${ab}k ${name}-${ab}k.mp3;  done
done
for mp3 in *.mp3; do
    ffmpeg -i $mp3 ${mp3%.mp3}.flac;
done
rm -f *.mp3

Have no fear of perfection - you'll never reach it. S. Dali

The project is not mature enough, we might be able to do something simpler (I guess we can get rid of the dB conversion completely for instance), and more efficient (another algorithm, better chosen coefficient, etc.), but right now it seems to work pretty fine in all my tests.

Also, I did not test at all with other codecs like Vorbis, so there is certainly some path to improvement here.

I don't know if there are already stuff out there related to this research. If you have something on that matter, feel free to send it. Also, I'm all open for samples, better values, algorithms, various comments, etc.

index