CBS News Radio

From Archiveteam
(Redirected from CBSNewsRadio)
Jump to navigation Jump to search

For ~2TB and ~3M URLs, you can archive most of CBS Radio's online distributed audio from 2008-present.

Sources

Starting in 2008, CBS Radio distributed its newsfeed audio (cbsnewsfeed.com / legacy.cbsradionewsfeed.com) from an S3 bucket (cbsrnaudio), including hourly news broadcasts, updates, regular shows, special broadcasts, and soundbites and audio clips used as part of larger packages. While no longer an open bucket, their CloudFront URL (audio.cbsradionewsfeed.com) resolves all valid paths to it.

  • I have the list of the entire contents of the "2015" prefix in that bucket: 200,405 objects, 83GB.
  • All of this audio was pushed out via a redirector called "eyecast". Eyecast IDs are incrementing integers and cover most of the URLs to capture (e.g. 199,475 out of 200,377 regularly named files, or 99.5%, of the 2015 prefix). Scraping this list is 3,040,502 URLs. (Around 1,800 are actually invalid, e.g. with /home/cbsnewsr/www/audio in the path, and needed to be manually fixed.)
  • Prior to the internal standardization of prefixes and filenames in the S3 bucket, everything was dumped in the root as YYMMDDHHmmSS/P00?.mp3 files. This is another 1,288 URLs.
  • Based on the eyecast scrape, there's ~21 days where eyecast may have been down or dropping data. Hourlies, updates, and licensed hourlies have regular filenames, and we can generate a list of all the filenames that should exist and try to fetch those as well, to make sure nothing is missed. This is 403,200 URLs.
  • Finally, I pulled all the snapshots for all the cbsradionewsfeed.com and cbsrnaudio.s3.amazonaws.com URLs from Wayback Machine, and compiled a list of all the already-captured URLs. This is 6,941 URLs.

The de-duped list of presumed good URLs (up to March 20 or so) is 3,044,050. The remaining list of speculative URLs is 3,323. Eyecast IDs should be scraped and speculative URLs should be generated through to the May 22, 2026 shutdown.

A test run of 50,505 downloads showed that 325 (0.6%) reported 403 Forbidden.

URLs

File patterns

  • Hourlies (top of the hour): YYYY/MM/DD/HH/Hourly-HH.mp3
  • Updates (bottom of the hour): YYYY/MM/DD/HH/Update-HH.mp3
  • Licensed hourlies (top of the hour): YYYY/MM/DD/HH/Licensedhnc-HH.mp3
  • Named shows and features (usually): TITLE_STATION_EXPORT.mp3 or DDTITLE_STATION_EXPORT.mp3
  • Sound bites and clips: NXXX_STATION_COUNTER.mp3

HH is 00 to 23. STATION is probably the specific station ID that produced the report, usually a four-digit number, sometimes alphanumeric. EXPORT is an incrementing, sequential, seven-digit integer, recycled at some point, indicating some global asset ID from their back-end system (ENPS or other), with around a thousand files generated a day. NXXX increments from N001 starting at midnight Eastern. COUNTER appears to be a back-end server's time-based counter in tenths of a second.

The licensed hourly newscasts are usually just the first half of the hourlies, for syndicated content, so local stations can put their own content after the commercial break. These began appearing in the eyecast scrape December 8, 2014.

There are files that don't match these patterns. Examples from the 2015 capture:

  • 2015/01/11/21/60 MINUTES_ 1_11a.mp3
  • 2015/01/22/14/The Observation Deck.mp3
  • 2015/08/21/11/TRY AGAIN.mp3
  • 2015/10/22/16/Green Air 2.wav