CBS News Radio
| CBS News Radio | |
US news radio | |
| URL | https://www.cbsnewsfeed.com |
| Status | Closing |
| Archiving status | Not saved yet |
| Archiving type | Unknown |
| IRC channel | #archiveteam-bs (on hackint) |
For ~2TB and ~3M URLs, you can archive most of CBS Radio's online distributed audio from 2008-present.
Sources
Starting in 2008, CBS Radio distributed its newsfeed audio (cbsnewsfeed.com / legacy.cbsradionewsfeed.com) from an S3 bucket (cbsrnaudio), including hourly news broadcasts, updates, regular shows, special broadcasts, and soundbites and audio clips used as part of larger packages. While no longer an open bucket, their CloudFront URL (audio.cbsradionewsfeed.com) resolves all valid paths to it.
- I have the list of the entire contents of the "2015" prefix in that bucket: 200,405 objects, 83GB.
- All of this audio was pushed out via a redirector called "eyecast". Eyecast IDs are incrementing integers and cover most of the URLs to capture (e.g. 199,475 out of 200,377 regularly named files, or 99.5%, of the 2015 prefix). Scraping this list is 3,040,502 URLs. (Around 1,800 are actually invalid, e.g. with
/home/cbsnewsr/www/audioin the path, and needed to be manually fixed.) - Prior to the internal standardization of prefixes and filenames in the S3 bucket, everything was dumped in the root as
YYMMDDHHmmSS/P00?.mp3files. This is another 1,288 URLs. - Based on the eyecast scrape, there's ~21 days where eyecast may have been down or dropping data. Hourlies, updates, and licensed hourlies have regular filenames, and we can generate a list of all the filenames that should exist and try to fetch those as well, to make sure nothing is missed. This is 403,200 URLs.
- Finally, I pulled all the snapshots for all the
cbsradionewsfeed.comandcbsrnaudio.s3.amazonaws.comURLs from Wayback Machine, and compiled a list of all the already-captured URLs. This is 6,941 URLs.
The de-duped list of presumed good URLs (up to March 20 or so) is 3,044,050. The remaining list of speculative URLs is 3,323. Eyecast IDs should be scraped and speculative URLs should be generated through to the May 22, 2026 shutdown.
A test run of 50,505 downloads showed that 325 (0.6%) reported 403 Forbidden.
URLs
- https://s3.amazonaws.com/vitorio/ready_for_wget.txt contains the full de-duped list of presumed good URLs.
- https://s3.amazonaws.com/vitorio/predictable_urls.txt contains the generated list of speculative URLs.
- https://s3.amazonaws.com/vitorio/cbs_archive_updater.py is a tested Python 3 script (requires
requests) that will scrape eyecast and generate URLs for dates beyond those in these lists. `--date 2026-03-21 --id 3732737will scrape and generate URLs from March 21 through to "tomorrow" and output to a file named for the results, e.g.cbs_update_20260321_to_20260408_Eyecast_3732737_to_3739122.txt.
File patterns
- Hourlies (top of the hour):
YYYY/MM/DD/HH/Hourly-HH.mp3 - Updates (bottom of the hour):
YYYY/MM/DD/HH/Update-HH.mp3 - Licensed hourlies (top of the hour):
YYYY/MM/DD/HH/Licensedhnc-HH.mp3 - Named shows and features (usually):
TITLE_STATION_EXPORT.mp3orDDTITLE_STATION_EXPORT.mp3 - Sound bites and clips:
NXXX_STATION_COUNTER.mp3
HH is 00 to 23. STATION is probably the specific station ID that produced the report, usually a four-digit number, sometimes alphanumeric. EXPORT is an incrementing, sequential, seven-digit integer, recycled at some point, indicating some global asset ID from their back-end system (ENPS or other), with around a thousand files generated a day. NXXX increments from N001 starting at midnight Eastern. COUNTER appears to be a back-end server's time-based counter in tenths of a second.
The licensed hourly newscasts are usually just the first half of the hourlies, for syndicated content, so local stations can put their own content after the commercial break. These began appearing in the eyecast scrape December 8, 2014.
There are files that don't match these patterns. Examples from the 2015 capture:
2015/01/11/21/60 MINUTES_ 1_11a.mp32015/01/22/14/The Observation Deck.mp32015/08/21/11/TRY AGAIN.mp32015/10/22/16/Green Air 2.wav