Skip to main content

PodRat

I wrote a quick initial implementation of some podcast software. I named it PodRat, short for “podcast pack rat.” This first implementation is just for myself, but I may eventually release it for others who collect podcasts and care about things like filenames and tags.

The software currently has three commands:

  • The rss2yaml command parses an RSS file (or other type of feed file) and creates a YAML file that contains normalized metadata, including filenames and tags.
  • The process command reads a YAML file and processes source files by setting tags and renaming them to target filenames.
  • The sync command reads a YAML file and updates tags of target files when they do not match.

I implemented the initial version in Python in order to get it done quickly. I would like to write it in Haskell, but it was faster to use Python because of the libraries available.

  • I used feedparser to parse feeds. It supports many feed formats, which can be quite tricky to parse. The Haskell feed package looks promising!
  • I used Mutagen for reading and writing tags. I have very specific tag requirements, and I have not had success with Haskell libraries.
  • I used PyYAML for rendering and parsing YAML files. The order of properties in objects of a YAML file should be fixed (specified by the program). This is trivial to do with PyYAML, but I do not know of any Haskell libraries that can do this.

The rss2yaml command is used to create a new YAML file, which can then be edited by hand. The goal is to do as much as possible automatically, in order to minimize manual editing work. The CLI has enough functionality to handle the easy cases, but the vast majority of feeds require custom code. I usually avoid object-oriented code, but the rss2yaml implementation is object-oriented because it is the easiest way in Python to provide default behavior for many operations and allow implementations to change behavior with fine granularity.

Here are some examples from the podcasts that I have processed so far:

  • Episode numbers are a common way to identify episodes of a podcast. One can usually get the episode number from an itunes:episode element, but not all podcasts include that. PodRat makes it easy to override the default behavior and parse the episode number from the title or URL, for example.
  • I processed one podcast feed that formatted episode numbers as a real number (1.00)! My guess is that it was rendered by JavaScript, a language that is popular in spite of its lack of integral type. PodRat makes it easy to handle such issues.
  • Many podcasts are surprisingly disorganized! Episode numbers are sometimes skipped, repeated, etc. Some podcasts release episodes that are not deemed a “full” episode and are given some special designation, such as “bonus” or a point release (such as “episode 42.5”). PodRat generates “discriminator” strings that ensure unique filenames with the correct sort order. These strings are alphabetic by default, but implementations for specific podcasts can use custom behavior.
  • For podcasts that are hopelessly disorganized, strict index numbers and/or timestamps can be used instead of episode numbers.
  • I format the title tag with the episode number at the beginning, followed by a colon and space, followed by the episode title. This is the convention used by most podcasts, but some podcasts do not include the episode number, some podcasts use a different format (such as #42 | Title), and some podcasts write out Episode before the episode number. PodRat makes it easy to normalize such titles.
  • I use the artist tag to specify the podcast hosts and the comment tag to specify the guests. When podcasts consistently specify guests in the title, PodRat can parse the title so that the title tag does not include the names of guests and the comment tag does.
  • For podcasts that do not specify guests in the title, such information can be scraped from the website, using requests and beautifulsoup, for example.

What is the deal with filenames?

  • For people who manage podcasts using the command line, the filename is used to identify a file. For example, using a UUID as the filename is completely unhelpful.
  • Filenames determine the order that files are played on many MP3 players. For example, using a UUID or just the title as the filename results in an essentially random order. An episode number (and discriminator), timestamp, or index number needs to be used to get the order correct, and such numbers must be zero-padded.
  • Users often need to be able to select specific files on MP3 players with tiny screens, so the filename prefix should not be too long. For example, a podcast with files named like SomeAwesomePodcast-042-EpisodeTitle.mp3 may result in a menu showing only lines of SomeAweS and make it difficult for the user to select a file. Files named like SAP-042-EpisodeTitle.mp3 show the important information in the menu.

I used PodRat to process many podcasts, but I still have many left to process. It is a work in progress, and I will likely improve the design as I get more experience using it. The hacky nature of coding special behavior to handle the formats and quirks of each podcast makes me think that Python might be an appropriate choice for such software.

Author

Travis Cardwell

Published

Tags