When working in multimedia or image processing projects such as FFmpeg, we tend to accumulate a lot of samples over time. Today I have around 60G of samples, in addition to the FFmpeg test suite (1.1G). These samples are basically a mess, because just like music, there is no way to classify them. So it's basically a melting pot of files ordered in arbitrary overlapping categories.
Passed a certain size, it's hard to find specific samples according to random criteria. You may want to find media files with streams of certain codecs, of certain durations, with special time bases, or any kind of property or combination of properties really.
Metadata
First step is obviously to identify the properties of each file. For that, we
will rely on ffprobe
. Here is what the JSON output can typically provide for
a given file:
{
"streams": [
{
"index": 0,
"codec_name": "012v",
"codec_long_name": "Uncompressed 4:2:2 10-bit",
"codec_type": "video",
"codec_time_base": "1/10",
"codec_tag_string": "012v",
"codec_tag": "0x76323130",
"width": 316,
"height": 240,
"coded_width": 316,
"coded_height": 240,
"closed_captions": 0,
"has_b_frames": 0,
"pix_fmt": "yuv422p16le",
"level": -99,
"refs": 1,
"r_frame_rate": "10/1",
"avg_frame_rate": "10/1",
"time_base": "1/10",
"start_pts": 0,
"start_time": "0.000000",
"duration_ts": 1,
"duration": "0.100000",
"bits_per_raw_sample": "10",
"nb_frames": "1",
"disposition": {
"default": 0,
"dub": 0,
"original": 0,
"comment": 0,
"lyrics": 0,
"karaoke": 0,
"forced": 0,
"hearing_impaired": 0,
"visual_impaired": 0,
"clean_effects": 0,
"attached_pic": 0,
"timed_thumbnails": 0
}
}
],
"format": {
"filename": "/home/ux/fate-samples/012v/sample.avi",
"nb_streams": 1,
"nb_programs": 0,
"format_name": "avi",
"format_long_name": "AVI (Audio Video Interleaved)",
"start_time": "0.000000",
"duration": "0.100000",
"size": "211756",
"bit_rate": "16940480",
"probe_score": 100
}
}
Building a database
One important property of these metadata is that they have a free form.
Typically, the number of streams
is variable, and we can have random keys
popping up. This makes it not fit for a SQL (Structured) database.
So for now, we're just going to aggregate all these information into a single
json
array. I present you gen-samples-db.py
the magnificient:
#!/usr/bin/env python
import os
import os.path as op
import sys
import json
import subprocess
from multiprocessing.dummy import Pool
def _get_files(root):
for dirpath, dirnames, filenames in os.walk(root):
for filename in filenames:
yield op.join(dirpath, filename)
for dirname in dirnames:
_get_files(op.join(dirpath, dirname))
def _probe_file(filepath):
try:
raw_data = subprocess.check_output(['ffprobe', '-v', '0', '-of', 'json',
'-show_streams', '-show_format', filepath])
except Exception as e:
print(f'[✗] {filepath}')
return None
else:
print(f'[✓] {filepath}')
return json.loads(raw_data)
def _main(output, dirs):
files = sorted(f for d in dirs for f in _get_files(d))
print(f'processing {len(files)} files...')
db = [result for result in Pool().imap(_probe_file, files) if result]
print(f'writing {output} database')
with open(output, 'w') as f:
f.write(json.dumps(db, indent=4))
if __name__ == '__main__':
output = sys.argv[1]
dirs = sys.argv[2:]
_main(output, dirs)
It's using a pool of threads, and in each of them ffprobe
is executed. The
whole thing is then aggregated and stored in the specified database. Building
the db.json
database looks like this:
% ./gen-samples-db.py db.json ~/fate-samples/ ~/samples
processing 9063 files...
[✗] /home/ux/fate-samples/4xm/md5sum
[✓] /home/ux/fate-samples/012v/sample.avi
[✓] /home/ux/fate-samples/4xm/dracula.4xm
[✓] /home/ux/fate-samples/4xm/version2.4xm
[✓] /home/ux/fate-samples/4xm/TimeGatep01s01n01a02_2.4xm
[✓] /home/ux/fate-samples/8bps/full9iron-partial.mov
[✗] /home/ux/fate-samples/8bps/md5sum
[✓] /home/ux/fate-samples/4xm/version1.4xm
[✗] /home/ux/fate-samples/HEADER.txt
[✓] /home/ux/fate-samples/KMVC/LOGO1.AVI
[✓] /home/ux/fate-samples/CSCD/sample_video.avi
[✓] /home/ux/fate-samples/CCITT_fax/G4.TIF
...
[✓] /home/ux/samples/wiko-tests/VID_20130923_120657.3gp
[✓] /home/ux/samples/wiko-tests/VID_20130923_120152.3gp
[✓] /home/ux/samples/wiko-tests/VID_20130905_115141.3gp
writing db.json database
Querying the database
At this point, it's already usable. We can just open db.json
with whatever
text reader and search into the buffer, or we can also use jq to make
"queries". Admittedly clumsy, it typically looks like this:
% cat db.json|jq '.[] | select(.streams[].codec_name == "dvb_subtitle") | .format.filename'
"/home/ux/fate-samples/wtv/law-and-order-partial.wtv"
"/home/ux/samples/BBC1HD_v101.ts"
"/home/ux/samples/SubTitleHD.ts"
"/home/ux/samples/dvbsub/dvbsubtest.ts"
"/home/ux/samples/dvbsub/fr-tv-dvd-sub-and-teletext.ts"
"/home/ux/samples/dvbsub/fr-tv-dvd-sub-and-teletext.ts"
"/home/ux/samples/dvbsub/tf1-000t.ts"
"/home/ux/samples/pps-sps-libav-merge/mpegts_with_dvbsubs.ts"
"/home/ux/samples/pps-sps-libav-merge/mpegts_with_dvbsubs.ts"
"/home/ux/samples/ticket4274-sample.ts"
This query gives me all the medias with a DVB subtitle stream (I actually needed that in a recent work). The multiple entries of the same filenames just mean there are multiple DVB subtitle streams in the same file.
More examples
In bulk, here are a few more examples:
Identify medias with a negative start_time
(yup, it exists, I have 5 of them
here) and print both filename
and start_time
for these matches:
.[]
| .format
| select(.start_time != null and (.start_time | tonumber) < 0)
| {filename, start_time}
How many files have a SubRip
subtitle stream:
[.[] | select(.streams[] | .codec_name == "subrip") | .format.filename]
| unique
| length
Do I have medias with multiple video streams?
.[] | {
f: .format.filename,
n: ([.streams[] | select(.codec_type == "video")] | length)
} | select(.n > 1)'
Find portrait videos:
.[] | {
f: .format.filename,
s: [
.streams[]
| select(.width > 0 and .height > 0 and .width < .height)
| {width, height, ratio:(.width/.height)}
]
} | select(.s | length > 0)
We can probably do smarter requests but that language is pretty new to me.
A better interface
Obviously at this point, you're wondering about how to make a better interface. First of, a fuzzy search with something like fzf would be nice for simple requests: the DVB subtitle example is a good one, I don't want to type more than "codec type dvb" as a query. We can probably make something not so complex with the shell but I leave that as an exercise for the reader.
Also to be considered, a real database like CouchDB may open ways for various improvements.
And then there is all the web shit universe which I'm sure provides all the crazy tools to make sexy web interfaces with fuzzy finding, but I'm not into masochism so I leave that for those into that kind of stuff.