From Archiveteam
Jump to navigation Jump to search


ROBOTS.TXT is a stupid, silly idea in the modern era. Archive Team entirely ignores it and with precisely one exception, everyone else should too.

If you do not know what ROBOTS.TXT is and you run a site... excellent. If you do know what it is and you have one, delete it. Regardless, Archive Team will ignore it and we'll delete your complaints, just like you should be deleting ROBOTS.TXT.

For the unfamiliar, ROBOTS.TXT is a machine-readable textfile that sits on webservers that gives instructions as to what items, directories or sections of a web site should not be "crawled", that is, viewed by search engines or downloaded via programs, or otherwise accessed by automatic means. The reason is not often given, and in fact people implement ROBOTS.TXT for all sorts of reasons - convincing themselves that they don't want "outdated" information in caches, preventing undue taxing of resources, or avoiding any unpleasant situations where they delete information that is embarrassing or unfavorable and it still shows up elsewhere.

The purpose and meaning behind the creation of ROBOTS.TXT file dates back to the early 1990s, when the then-new World Wide Web was quickly defining itself as the killer application that would change forever how users would interact with the growing internet. Where previous information networks utilizing internet connections such as GOPHER and WAIS were text-based and relatively low-bandwidth, the combination of text, graphics and even sounds on webpages meant that resources were stretched to the limit. It was possible, no joke, to crash a machine with a Netscape/Mozilla web browser, as it opened multiple connections to web servers and downloaded all their items - the optimizations and advantages that 20 years of innovation have brought to web serving were simply not there. As crawlers/spiders/search engines came into use, the potential to overwhelm a site was great. Thus, Martijn Koster is credited with the Robot Exclusion Protocol, also known simply as the ROBOTS.TXT file.

It was an understandable stop-gap fix for a temporary problem, a problem that has long, long since been solved. While the onslaught of some social media hoo-hah will demolish some servers in the modern era, normal single or multi-thread use of a site will not cause trouble, unless there's a server misconfiguration or you're browsing a science project.

What this situation does, in fact, is cause many more problems than it solves - catastrophic failures on a website are ensured total destruction with the addition of ROBOTS.TXT. Modifications, poor choices in URL transition, and all other sorts of management work can lead to a loss of historically important and relevant data. Unchecked, and left alone, the ROBOTS.TXT file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context.

Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.

Again, Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online.

Don't commit suicide. Don't use ROBOTS.TXT.

Archiveteam welcomes debate, dissent, rage and misery around the saving of online history. Please join the conversation and our various projects to rescue sites from destruction.