A small freedom area.

The music classifying nightmare

Sat 08 Sep 2012

music, thoughts, av

While I accumulated music files over the years, the main issue I came across wasn't how to not get caught or where to find the data, but how to classify and organize the whole music files tree. There is actually no such thing as perfect naming/tagging convention, and this blog post has one only goal: share my nightmare with you trying to find one anyway.

If you ever considered to organize your music and don't know where to start, I hope this will discourage you so you won't ever waste the amount of time I did over the last years.

However, if you are trying to create a nice universal software solution for this problem, you might be interested in the different issues I faced (but well, you will fail anyway).

Redefining the reasons, and a bit of (my) history

The following list (chronologically ordered) are the reasons that made me change my way of sorting out my music collection:

  1. "I need to find out the music I want to listen fast"
  2. "I want to be able to really identify a song, out of its place"
  3. "I want all the music of the universe and be a better provider than commercial ones"
  4. "I don't care anymore"

The first point led me to put all the MP3 in one single directory, with a simple pattern: <artist> - <title>.mp3. That worked, for a while. Then I started to really have a lot of files and thus started creating directories for artists, albums, eventually a bazaar directory (the name was actually more crude). At that time, I wasn't aware of good systems making use of the metadata.

The second point was all about extracting one song out of the tree and still keep a full reference (for instance in order to share the file with friends, or copy it to a MP3 player). So I started renaming files like <tracknum> - <title>.mp3 into something like <artist> - <album> - <tracknum> - <title>.mp3. The files were in <artist>/<album>. And everything was fine, until I hit several kind of music such as rap, soundtracks, or even classical music.

This is where the third point came into play. The madness began when I started thinking on how I could organize my music so it could store and provide to everyone any music ever made. The first step was to get the most complete collection possible, so I took every orphan MP3 I had one by one, and grabbed the discography of the artist. The second step was to fix all the tags, and have an homogeneous naming convention. So I started playing with EasyTAG a lot. This was a nightmare.

centerimg

And then I reached the terminal phase of this madness, where I just stopped caring anymore, for various reasons. Here is a short list:

I strongly advise you to pick a few reasons too if you don't want to become insane.

So what are the issues?

Lossless/lossy

As I mentioned earlier, the MP3 (or any lossy codec) might not be the perfect solution: this is not what you would get if you had the original CD. Having lossless data instead of lossy avoids the need of re-downloading several times the same music in order to have a higher quality (or any data closer to the original release). Also, there is no point in having a perfect organization system if it's just to sort junk files.

Still, there are music only distributed as MP3 and never as CD; this happens often with independent artists selling or sharing their music online, so you will likely end up with different formats in your collection.

Formats/tags

The MP3 is just a file with MPEG stacked packets, with generally ID3v2 tags on top of the file, or ID3v1 tags at the end, sometimes both, and sometimes well... nothing. This tag system has several limitations. The main one is not having a method to handle multiple artists, which is a real problem. You have to select a separator and stick with it (Comma? Semicolon? Slash? Something else?).

Also, different audio formats means different tag systems, so you can't tag a flac file the same way you would tag MP3. If you want to do some correlation between tracks metadata, you need some kind of common mapping, and heuristics (to split the artists from ID3 for example).

The multi-language problem

Fortunately, all the music doesn't come from the USA. But this obviously means some localisation issues. Generally you have an international title, so you can somehow manage to keep only ASCII data, but this is not always the case, especially with marginal tracks, so you need to keep the original titles. But how do we know if an album of a given artist will always be internationalized (you could grab one or two world licensed and renamed music and later found an old local mixtape)? The best solution here is to keep both original and international one (if available) for homogeneous purpose; you can't have two different naming of the same artist because of this, or else you might lost some references.

Look at Miyuki Nakajima for instance. How would you name the artist (in directory and tags)? Miyuki Nakajima (中島 みゆき)? More issues come out now, only because of some language issues:

This list is a subset of problems you will have with Japanese materials, what about Russian or traditional Chinese albums?

centerimg

Extra info in the titles

Sometimes you get various extra information about the album or the track, such as: Cover, Bonus CD, Feat./Ft./Featuring, Remix/rmx, Single, Edit, Radio edit, Instrumental, Karaoke, Skit, ... And thus:

If you just keep the original/first naming the artist gave, and want that information to be exploitable by softwares, you will need a lot of complex heuristics to make it match all the different conventions. Extracting these information can lead to various issues, especially when splitting the artist names.

Album types

The albums annotations sometimes refer to an album type, or categories: Albums, Compilations, Singles, ED, Soundtracks, Anthologies, Lives, Remixes, Singles, Bootlegs, Mixtapes, ...

Where is that supposed to be stored in the tags? For the file system, you could go with a pattern like <artist>/<album-style>/<year> - <album>/<tracknum>. <title>.

Doing this can sometimes be mixed up with the track specific information (see previous section), like a live track from a soundtrack being release as a single.

Multiple artists

How are we supposed to deal with multiple artists? What is supposed to be the separator? You can use the tag built-in if available (Ogg tags support this for example), or you need to define a separator as mentioned earlier.

If the album is made of various artists, you have different patterns, here are some random ones I came across:

Compilations/Rap/crazy mashups:

1. A, B, C, D
2. B, D, X, Y, Z
3. A, X, Y
...

Here generally you define the artist as Various artists, or you could also have a real artist list in the tags. But what about the artist directory? Various artists can be a solution here, that's generally what I do.

Soundtrack common pattern:

1.  X
2.  Main
3.  Main
4.  Main
...
10. Main
11. X

The soundtracks are really horribles. I personally have a dedicated directory, because I'm actually using the file system (I'm much more confident with it than any tags system). You generally have a main artist and one or two special tracks from a different artist. The common way to handle that situation is to use the main artist name to classify the whole album, and sometimes even the track where he isn't the author. I don't want to lose that "grain" of information, so I keep the real reference names in the tags, and find the whole soundtrack using the file system (my Soundtracks/ directory) instead of relying on the main author name. I have a dedicated section to soundtracks later in this post, for more issues about them.

Rap case 2:

1. Main (feat. X)
2. Main
3. Main (feat. Y)
4. Main (feat. Z and X)
...

Here the artists are "secondary" most of the time: basically, the main artist doesn't share the same amount of time. The featuring might even be just the sample in background. So it might not be wise to split the artist list like in the first case (or they will be at the same "level", which is wrong). But sometimes, they share somehow 50/50 of the time, or worse the featuring artist might actually monopolize the whole track. You need to know pretty well the artists and songs to be able to sort this out.

Singles common pattern:

1. Main - track foo
2. Main - track foo (Remixed by X)
3. Main - track foo (Y Remix)
4. Main - track foo (foobar mix)
5. Main - track foo (radio edit)

This meets various issues I talked about previously (naming convention, and authorship). You will also note there is sometimes a differentiation with the composer (hi classical music lovers!), and you might want to keep this information in some case (even if it is likely to be ignored by most players).

Artist aliases

This is yet another thing I had real hard time to deal with. Let's take Aphex Twin for instance, which is one of the worst insane case:

Aphex Twin has also recorded music under the aliases AFX, Blue Calx, Bradley Strider, Caustic Window, DJ Smojphace, GAK, Martin Tressider, Polygon Window, Power-Pill, Prichard D. Jams, Q-Chastic, Tahnaiya Russell, The Dice Man, Soit-P.P., and speculatively The Tuss.

-- Wikipedia

Oh, and his real name is Richard David James. What are you supposed to use for the file system directories and files name? His name? The most common nickname? Both? One file system solution is to have symbolic links (do you link Richard David James to Aphex Twin, or vice versa?). For tags, if you don't want to lose information, this is another story…

Other examples: SNoW, M/Matthieu Chedid

Soundtracks

I will start this section by selecting one of the worst case I expected in soundtracks: Compilation album by Anna Tsuchiya Inspi' Nana (Black Stones) Olivia Inspi' Reira (Trapnest)

The real artists are Anna Tsuchiya (see issues with romanization I mentioned above by the way) and Olivia Lufkin. Nana is the anime's name, and Black Stones and Trapnest are the bands in the anime. So far, it just looks like a lot of information, but it's more like multiple different names. Olivia is for example spelled in a few different ways in the albums:

The second alias is pretty interesting, because it is common with Japanese (nick)names to use capitalization, while a lot of "tag oriented music viewer" use some kind of generic formatting changing this name into "Olivia" (like what you get with the .title() method in Python).

What if now you also want to store some other songs from Olivia, where and how should the files be located and tagged so we can easily find everything she did (and also get the related music if she worked with different artists in the same scope).

Also, what is the Genre of the soundtracks? Soundtrack? Anime Soundtrack? Anime? Original Soundtrack? OST? Rock? I'll be back on the genre issue later.

centerimg

What should we do with the "soundtrack" mention in the title by the way? Various solutions with their own problems for this:

You will also note various different separators than dash '-' in the tracks title, like tildes '~' or special dots '' (common Japanese "markups"), which are actually not really part of the title. Or cases where the title contains the nickname artist, but you have to store the real artist name (in the Artist tag for instance), a different name for the lyrics, and also a different name for the arranger(s), which by the way are all under the same artist-split issue.

Assuming you managed to tag everything in a somehow consistent way, how about the file system? Keep in mind you might want to group all the albums under an arbitrary pack name such as "Nana" (like you would put all the "Lord of the Rings" soundtracks under a directory of that name), which might not even appear in the tags. You will certainly come with an arbitrary path such as: Soundtracks/Animes/Nana/<album name>/<tracknum>. <title>

Which is somehow an intuitive way of storing most of the "important" information when looking for the song, but certainly in total contradiction with the other musics.

Sometimes, you hit yet another problem, like with the Arrietty soundtrack, where you have an US (international) version, a French version (with additional tracks, somehow flagged as "premium"), and maybe a Japanese one I'm not aware of.

Genre

You noticed I used in the previous example "Soundtracks" as the root directory, which somehow sounds like a musical Genre. If you want to do that for all the artists, you just can't, for the simple reason that genres are almost song specific, pretty subjective, generally mixed, or never defined by the author. What you can do on the other hand is to have genre tags (like arbitrary annotations), if your system allows it. But this doesn't solve the file system issue.

Dates

It is sometimes important to keep track of the date of an album. For instance to keep the chronological evolution of the artist. But if the artist released multiple albums the same year, you will need to stick with a convention such as:

But sometimes, a "same" album has been released at different times...

Multiple editions

A lot of albums are released in different versions, see for example the album Brothers in Arms by Dire Straits:

centerimg

One of the issue is that you can't easily keep up with the following file system format: <artist>/<year> - <album>/

And you will have a lot of tags collisions. You need to find a way of making a differentiation between these albums and their tracks.

Random issues

Here are some more issues left I didn't have the courage to elaborate on:

centerimg

Potential solutions

Filesystem

Some potential solutions exists. The first one is to have a virtual file system (with FUSE for instance), dedicated to music. For instance the path genre/soundtracks/animes/Nana would focus on similar data than: artist/lufkin/aliases/OLIVIA/album/Nana

This solution may need a lot of thinking, and will likely hit the same issues as the tag system, and certainly a lot of others. If you are working on something similar to this, I'm interested.

Musical content retrieval

This is actually a more promising solution in my opinion. The goal here is to remove the textual content issue. Anita Lillie made a thesis on this: MusicBox. I hereby encourage you to read this if you are interested in the topic.

I actually worked on this for two years as an experiment at my school with a few fellows: we basically re-implemented what Anita proposed, put the analysis in-place (instead of using an existing engine like she did), and implemented a way of communicating with various players instead of taking the burden of trying to propose yet another player. There is nothing really releasable, so here are some feedback on that experience if you want to do something like this:

centerimg

Since this is going slightly off topic, I won't detail much here, just mentioning the idea.

A fine hack

Despite all my attempts to get a well classified music collection, it is essentially like everyone else, a giant mess. But the tags are somehow good enough to give interesting results with the suggestion API of Last.FM (and actually they need to be as messy as everyone else so the match algorithm can work), so I basically start with a song I like, and run DynaMPD in order to get some related content from my collection.

And about the mess in my files well… I just don't give a shit anymore. I'm still able to find what I'm looking for in a relatively short amount of time, and trying to remove the small overhead isn't worth wasting my life on it.

I resigned myself; Music is a form of art, thus it is not, and must not ever be limited to a binary mind.

centerimg

index | article raw