Difference between revisions of "Dev/Infrastructure"
m (add devnav) |
(targetify, add other warc filetypes, use IA id template) |
||
(7 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
The Archive Team '''infrastructure''' is a distributed web processing system used for distributed preservation of service attacks. | |||
== Component Overview == | |||
[[File:Archiveteam_warrior_infrastructure.png|500px]] | [[File:Archiveteam_warrior_infrastructure.png|500px]] | ||
Line 17: | Line 19: | ||
|- | |- | ||
| 4 | | 4 | ||
| | | Upload Target | ||
|- | |- | ||
| 5 | | 5 | ||
| Internet Archive | | Internet Archive | ||
|} | |} | ||
=== Website in Danger === | |||
The website in danger is typically a website exhibiting combinations of | |||
* acquihire | |||
* mass layoffs | |||
* neglect, decay, unhealthy, or owners missing in action | |||
* political and legal issues | |||
* robots.txt exclusion file that forbids crawling by Wayback Machine (whether intentionally or unintentionally) | |||
* cultural significance | |||
=== Warrior === | |||
The Warrior is client code run by volunteers that grabs/scrapes the content of the website in danger. | |||
Websites often implement throttling systems to protect themselves for various reasons such as spam or server load. Typical systems use IP address bans. As such, many Warriors, running on many IP addresses, are needed. | |||
Content is usually grabbed and saved in [[The WARC Ecosystem|WARC files]]. | |||
=== Tracker === | |||
The Tracker is server code run by "core" Archive Team volunteers. The Tracker assigns what the Warrior should download and provides a leaderboard. | |||
=== Upload Targets === | |||
Targets (sometimes called "staging servers") are typically servers running Rsync run by "core" volunteers. Warriors upload WARC files to these hosts. The hosts queue and package up the WARC files into large WARC files (Megawarcs). Then, the Megawarcs are uploaded to the Internet Archive under the {{IA id|archiveteam}} collection. | |||
=== Internet Archive === | |||
The Internet Archive is a digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" (made inaccessible to the general public) instead of deleted. | |||
An item is ingested by the Wayback Machine if it | |||
* has .warc, .warc.gz. or .warc.zst files, | |||
* has a "web" media type, | |||
* and is under the Archive Team collection. | |||
{{devnav}} | {{devnav}} | ||
{{Navigation box}} |
Latest revision as of 18:13, 11 November 2024
The Archive Team infrastructure is a distributed web processing system used for distributed preservation of service attacks.
Component Overview
Figure | Description |
---|---|
1 | Website in Danger |
2 | Warrior |
3 | Tracker |
4 | Upload Target |
5 | Internet Archive |
Website in Danger
The website in danger is typically a website exhibiting combinations of
- acquihire
- mass layoffs
- neglect, decay, unhealthy, or owners missing in action
- political and legal issues
- robots.txt exclusion file that forbids crawling by Wayback Machine (whether intentionally or unintentionally)
- cultural significance
Warrior
The Warrior is client code run by volunteers that grabs/scrapes the content of the website in danger.
Websites often implement throttling systems to protect themselves for various reasons such as spam or server load. Typical systems use IP address bans. As such, many Warriors, running on many IP addresses, are needed.
Content is usually grabbed and saved in WARC files.
Tracker
The Tracker is server code run by "core" Archive Team volunteers. The Tracker assigns what the Warrior should download and provides a leaderboard.
Upload Targets
Targets (sometimes called "staging servers") are typically servers running Rsync run by "core" volunteers. Warriors upload WARC files to these hosts. The hosts queue and package up the WARC files into large WARC files (Megawarcs). Then, the Megawarcs are uploaded to the Internet Archive under the archiveteam collection.
Internet Archive
The Internet Archive is a digital library and archive. It is different from other hosting services because they are not a distribution platform. If there is an legal issue, items are "darked" (made inaccessible to the general public) instead of deleted.
An item is ingested by the Wayback Machine if it
- has .warc, .warc.gz. or .warc.zst files,
- has a "web" media type,
- and is under the Archive Team collection.