Finding specific multimedia samples

When working in multimedia or image processing projects such as FFmpeg, we tend to accumulate a lot of samples over time. Today I have around 60G of samples, in addition to the FFmpeg test suite (1.1G). These samples are basically a mess, because just like music, there is no way to classify them. So it's basically a melting pot of files ordered in arbitrary overlapping categories.

Passed a certain size, it's hard to find specific samples according to random criteria. You may want to find media files with streams of certain codecs, of certain durations, with special time bases, or any kind of property or combination of properties really.

Metadata

First step is obviously to identify the properties of each file. For that, we will rely on ffprobe. Here is what the JSON output can typically provide for a given file:

{
    "streams": [
        {
            "index": 0,
            "codec_name": "012v",
            "codec_long_name": "Uncompressed 4:2:2 10-bit",
            "codec_type": "video",
            "codec_time_base": "1/10",
            "codec_tag_string": "012v",
            "codec_tag": "0x76323130",
            "width": 316,
            "height": 240,
            "coded_width": 316,
            "coded_height": 240,
            "closed_captions": 0,
            "has_b_frames": 0,
            "pix_fmt": "yuv422p16le",
            "level": -99,
            "refs": 1,
            "r_frame_rate": "10/1",
            "avg_frame_rate": "10/1",
            "time_base": "1/10",
            "start_pts": 0,
            "start_time": "0.000000",
            "duration_ts": 1,
            "duration": "0.100000",
            "bits_per_raw_sample": "10",
            "nb_frames": "1",
            "disposition": {
                "default": 0,
                "dub": 0,
                "original": 0,
                "comment": 0,
                "lyrics": 0,
                "karaoke": 0,
                "forced": 0,
                "hearing_impaired": 0,
                "visual_impaired": 0,
                "clean_effects": 0,
                "attached_pic": 0,
                "timed_thumbnails": 0
            }
        }
    ],
    "format": {
        "filename": "/home/ux/fate-samples/012v/sample.avi",
        "nb_streams": 1,
        "nb_programs": 0,
        "format_name": "avi",
        "format_long_name": "AVI (Audio Video Interleaved)",
        "start_time": "0.000000",
        "duration": "0.100000",
        "size": "211756",
        "bit_rate": "16940480",
        "probe_score": 100
    }
}

Building a database

One important property of these metadata is that they have a free form. Typically, the number of streams is variable, and we can have random keys popping up. This makes it not fit for a SQL (Structured) database.

So for now, we're just going to aggregate all these information into a single json array. I present you gen-samples-db.py the magnificient:

#!/usr/bin/env python

import os
import os.path as op
import sys
import json
import subprocess
from multiprocessing.dummy import Pool


def _get_files(root):
    for dirpath, dirnames, filenames in os.walk(root):
        for filename in filenames:
            yield op.join(dirpath, filename)
        for dirname in dirnames:
            _get_files(op.join(dirpath, dirname))


def _probe_file(filepath):
    try:
        raw_data = subprocess.check_output(['ffprobe', '-v', '0', '-of', 'json',
                                            '-show_streams', '-show_format', filepath])
    except Exception as e:
        print(f'[✗] {filepath}')
        return None
    else:
        print(f'[✓] {filepath}')
        return json.loads(raw_data)


def _main(output, dirs):
    files = sorted(f for d in dirs for f in _get_files(d))
    print(f'processing {len(files)} files...')

    db = [result for result in Pool().imap(_probe_file, files) if result]

    print(f'writing {output} database')
    with open(output, 'w') as f:
        f.write(json.dumps(db, indent=4))


if __name__ == '__main__':
    output = sys.argv[1]
    dirs = sys.argv[2:]
    _main(output, dirs)

It's using a pool of threads, and in each of them ffprobe is executed. The whole thing is then aggregated and stored in the specified database. Building the db.json database looks like this:

% ./gen-samples-db.py db.json ~/fate-samples/ ~/samples
processing 9063 files...
[✗] /home/ux/fate-samples/4xm/md5sum
[✓] /home/ux/fate-samples/012v/sample.avi
[✓] /home/ux/fate-samples/4xm/dracula.4xm
[✓] /home/ux/fate-samples/4xm/version2.4xm
[✓] /home/ux/fate-samples/4xm/TimeGatep01s01n01a02_2.4xm
[✓] /home/ux/fate-samples/8bps/full9iron-partial.mov
[✗] /home/ux/fate-samples/8bps/md5sum
[✓] /home/ux/fate-samples/4xm/version1.4xm
[✗] /home/ux/fate-samples/HEADER.txt
[✓] /home/ux/fate-samples/KMVC/LOGO1.AVI
[✓] /home/ux/fate-samples/CSCD/sample_video.avi
[✓] /home/ux/fate-samples/CCITT_fax/G4.TIF
...
[✓] /home/ux/samples/wiko-tests/VID_20130923_120657.3gp
[✓] /home/ux/samples/wiko-tests/VID_20130923_120152.3gp
[✓] /home/ux/samples/wiko-tests/VID_20130905_115141.3gp
writing db.json database

Querying the database

At this point, it's already usable. We can just open db.json with whatever text reader and search into the buffer, or we can also use jq to make "queries". Admittedly clumsy, it typically looks like this:

% cat db.json|jq '.[] | select(.streams[].codec_name == "dvb_subtitle") | .format.filename'
"/home/ux/fate-samples/wtv/law-and-order-partial.wtv"
"/home/ux/samples/BBC1HD_v101.ts"
"/home/ux/samples/SubTitleHD.ts"
"/home/ux/samples/dvbsub/dvbsubtest.ts"
"/home/ux/samples/dvbsub/fr-tv-dvd-sub-and-teletext.ts"
"/home/ux/samples/dvbsub/fr-tv-dvd-sub-and-teletext.ts"
"/home/ux/samples/dvbsub/tf1-000t.ts"
"/home/ux/samples/pps-sps-libav-merge/mpegts_with_dvbsubs.ts"
"/home/ux/samples/pps-sps-libav-merge/mpegts_with_dvbsubs.ts"
"/home/ux/samples/ticket4274-sample.ts"

This query gives me all the medias with a DVB subtitle stream (I actually needed that in a recent work). The multiple entries of the same filenames just mean there are multiple DVB subtitle streams in the same file.

More examples

In bulk, here are a few more examples:

Identify medias with a negative start_time (yup, it exists, I have 5 of them here) and print both filename and start_time for these matches:

.[]
| .format
| select(.start_time != null and (.start_time | tonumber) < 0)
| {filename, start_time}

How many files have a SubRip subtitle stream:

[.[] | select(.streams[] | .codec_name == "subrip") | .format.filename]
| unique
| length

Do I have medias with multiple video streams?

.[] | {
    f: .format.filename,
    n: ([.streams[] | select(.codec_type == "video")] | length)
} | select(.n > 1)'

Find portrait videos:

.[] | {
    f: .format.filename,
    s: [
        .streams[]
        | select(.width > 0 and .height > 0 and .width < .height)
        | {width, height, ratio:(.width/.height)}
    ]
} | select(.s | length > 0)

We can probably do smarter requests but that language is pretty new to me.

A better interface

Obviously at this point, you're wondering about how to make a better interface. First of, a fuzzy search with something like fzf would be nice for simple requests: the DVB subtitle example is a good one, I don't want to type more than "codec type dvb" as a query. We can probably make something not so complex with the shell but I leave that as an exercise for the reader.

Also to be considered, a real database like CouchDB may open ways for various improvements.

And then there is all the web shit universe which I'm sure provides all the crazy tools to make sexy web interfaces with fuzzy finding, but I'm not into masochism so I leave that for those into that kind of stuff.

For updates and more frequent content you can follow me on Mastodon. Feel also free to subscribe to the RSS in order to be notified of new write-ups. It is also usually possible to reach me through other means (check the footer below). Finally, discussions on some of the articles can sometimes be found on HackerNews, Lobste.rs and Reddit.