If you want to see the Python code.
I gave myself a programming challenge.
I like to listen to podcasts and one of my favorites is the Planet Money podcast. I have a linux box I use as a server that runs a program/script called podget. It runs every night and checks for the latest episodes. When I completed listening to podcast on my player I then copy over the mp3s that have downloaded since the last time. I have been avoiding the Planet Money podcast, because frankly it is a pain. When the files download they have a weird name which I have a script I run to rename them properly. Now I have files that have a datestamp with an a number of the episode.
I can’t tell what the subject is, so I thought I could write a python script to pull the ID3 data from the files. I tried a few different utilities and none of them worked. It finally dawned on me that they did not work because Planet Money did not bother to populate them. I then looked for a list of the podcasts and the only usable one was the rss feed. If you view it in the browser and do a save page you get an xthml page. Looking at the file I realized I could key on the # sign next to the episode number. I could have loaded the document into a parser but I did not want to add additional dependencies since the target for the script is an old server. Here is what the pertinent data looks like.
<span xml:base=”https://www.npr.org/rss/podcast.php?id=510289″>#830: XXX-XX-XXXX</span>
I decided I should be able to take the file loop through the lines and then split each line at the # sign. It seemed to work. I then parsed out the episode number. Following the colon is the name of the episode. The example XXX-XX-XXXX represents a Social Security number. So all I had to do was take the values between the colon and the less than sign. I then populated a dictionary using the episode number as the key. Now I moved on to the actual file names and at first I had piped a ls *.mp3 command to a file. I then looped through that file. While doing that I figured I did not want to have to pipe the filenames manually so I switched to listing the directory and looping through the filenames. Again, I then parsed out the episode number, opting not to include the version number. Some of the files included a v1,v2,etc. after the episode number. I split each filename on ‘pmpod’ and then the second item contains the episode number and I loop char by char until I find a ‘v’ or ‘.’. I then use a subprocess to call a shell command ‘mp3info’ to update the mp3 file with the ID3 v1 title tag.
Here is a before and after of what VLC displays as the title.
I then write out the filename to a timestamped file so I have a list of titles I can then decide which ones I would like to listen to.
I then thought I don’t want to manually download the xhtml so I did a wget of the feed and it is the actual RSS feed. The RSS feed and xhtml are different but since they are both xml my code worked fine with either. I now can add a wget to my rename shell script and have the RSS feed waiting for my new Python script.
Of course when you are dealing with a feed that is out of your control, you can only get your process good enough. I am at the mercy of what Planet Money does with their feed. I know at some point I will have to modify the code, but for now at least I have titles.