Difference between revisions of "Frequently Asked Questions"

From Archiveteam
Jump to navigation Jump to search
(+ navbox)
(Add an entry pointing people to the WBM for general access)
(48 intermediate revisions by 13 users not shown)
Line 1: Line 1:
'''How can I help?'''
'''How can I help?'''


See [[Who We Are]], [[Deathwatch]], and [[:Category:Projects_status]]
See [[Who We Are]], [[Deathwatch]], and [[:Category:Projects_status]]. These pages describe our projects and the things you can do to help.
 
'''Is the Archive Team affiliated with the [[Internet Archive]] (archive.org)?'''
 
No. A few members are affiliated, but majority of Archive Team members are volunteers who help while not busy at work or school.
 
'''Why is ArchiveTeam crawling my site / disrespecting robots.txt?'''
 
A detailed manifesto is located at [[Robots.txt]]. Please read it first and contact us through [[IRC]] (described in an answer below) before making harsh actions. [[Posterous/Story|We cooperate!]]
 
If you notice the crawler's user-agent is "ArchiveBot", please see [[ArchiveBot]].
 
'''How should I go about backing things up?'''
 
What would you like to back up? If you want to mirror/backup a website, the de facto tool is [[Wget]] (but there's lots more, see [[Software]]!). WARC files are highly recommended as they can be ingested by the Wayback Machine.
 
If you want to back up your personal files, {{wikipedia|List of backup software}} is an extensive list of backup software. See [[Backup Tips]] as well!
 
'''Where do all the saved files go?'''
 
Files are ultimately uploaded to Internet Archive on the {{IA item|archiveteam}} collection. Archive Team relies on Internet Archive for storing the files.
 
<span id="faq_data_access">'''How do I access the stuff you archived?'''</span>
 
Usually, the content we archived is available in the [[Wayback Machine]], and this is generally the recommended way of accessing it. However, in some cases, this will not work as you might expect. If the obvious ''plug a URL into the WBM'' doesn't work, check whether the wiki page for the specific project has more information.
 
'''What are these WARC files in the Internet Archive? How do I extract files from a WARC file?'''
 
[http://fileformats.archiveteam.org/wiki/WARC WARC files] are de facto medium of digital preservation of the web. These WARC files are ingested by the Wayback Machine. WARC files are not simple zip files; they're designed to record metadata.
 
There is a growing number of tools that can manipulate WARC files in [[The WARC Ecosystem]].
 
= halp pls halp =
 
'''I uploaded a WARC file but why doesn't it show up in Wayback Machine?'''
 
To ensure content integrity, items with WARC files must have the mediatype set to "web" and be uploaded by a whitelisted Internet Archive account for it to be ingested by the Wayback Machine.
 
'''I think there is a web site that's going to shut down / sun set / end its incredible journey. Can you save it?'''
 
Yes, do tell us a soon as possible! Don't just Tweet about it; do something. '''''Get things done.''''' Talk to us on [[IRC]] to let us know. There is also a [https://www.reddit.com/r/shutdown /r/shutdown] sub-Reddit.
 
For small websites, see [[ArchiveBot]]. For large websites, the [[Warrior]] may need to be deployed. Please take a look at [[Dev]] for learning how the Warrior works.
 
'''I saved/archived some stuff from a website / ftp server. Do you want it? Where should I upload it?'''
 
Great, yes, and https://archive.org! Tag your items with the subject keyword "archiveteam" and let us know so we can move it under the ArchiveTeam collection.
 
P.S. Creating an account on Internet Archive is free and should be the first thing that comes to mind for archive files. File hosting sites like Putfile, Megaupload, etc. are ''not'' suitable for hosting archives!
 
'''I lost my stuff from [[Geocities]]/[[Tabblo]]/[[Posterous]]/some web host! Where can I get it back?'''
 
Try searching the wiki for a page about the specific website to find out more about what happened. Typically, there are several ways of recovering files from the Internet Archive:
 
* A specially crafted username lookup page created by Archive Team
** Allows you to search by your username and will present the relevant materials. Only a small set of projects have this feature.
* The Internet Archive's [https://archive.org/web/ Wayback Machine]
** This method is the easiest for most users but some web pages take months to show up in the Wayback Machine.
* Individual [[The WARC Ecosystem|WARC Files]] uploaded to the Internet Archive
** This method is the most accurate but requires power user skills with working with WARC files. As well, WARC files produced by the Internet Archive are not publicly available (but the ones by Archive Team are always available).
 
For details, see [[Restoring]].
 
'''I need help running the Warrior or scripts. I think it's broken.'''
 
See the FAQ in the [[Warrior]] wiki page.
 
= We Are Not The Internet Archive =
 
'''How do I upload something to the Internet Archive (archive.org)?'''
 
You can either use the HTML5/Flash uploader or use an [[Internet_Archive#Uploading_to_archive.org|alternative method]].
 
'''Can someone remove or fix something on the Internet Archive?'''
 
Possibly. Keep in mind that majority of Archive Team are volunteers who are not affiliated with the Internet Archive and requests should go to staff instead. If strenuous circumstances arise, please see the question about contacting Archive Team below.
 
'''How do I get the original file on the Wayback Machine?'''
 
Add <code>id_</code> after the date in the URL. For example: <code>web.archive.org/web/20090119040418'''id_'''/<nowiki>http://www.archiveteam.org/index.php?title=Main_Page</nowiki></code>
 
'''How do I search the contents of the Wayback Machine?'''
 
You can't unfortunately. However, the Internet Archive provides API access (designed for programmers and power users) to the Wayback Machine and to the CDX database.
 
'''Why does the Wayback Machine follow robots.txt in a way that I don't like?'''
 
Because it makes the lawyers go away. We're not the Internet Archive. Don't ask us, ask them.
 
'''Can I upload $COPYRIGHTED_THING to the Internet Archive?'''
 
Although the Internet Archive ''prefers'' freely-redistributable content, they also accept still-in-copyright things. If there's a valid complaint/DMCA takedown request, they'll simply make the item private, but they will '''not''' delete the data. Having said that, the Internet Archive is not [[The Pirate Bay]], so please don't treat it as such.
 
'''Can I save big files with the Save Now feature?'''
 
No, files larger than 200 MB will not be saved correctly.
 
= How redundant is Archiveteam? =


'''Is there a backup of the data on the archiveteam.org website? If so where can I download it?'''
'''Is there a backup of the data on the archiveteam.org website? If so where can I download it?'''
Line 7: Line 104:
Two sets of backups of this wiki are available. There are backups done by the hosting provider (several, going back days and weeks as well as hours), which use the storage capability of the shared hosting to keep them automatically (no tape or disk backups being done as most people would think of them). There are similarly copies of the database kept going back months.
Two sets of backups of this wiki are available. There are backups done by the hosting provider (several, going back days and weeks as well as hours), which use the storage capability of the shared hosting to keep them automatically (no tape or disk backups being done as most people would think of them). There are similarly copies of the database kept going back months.


Additionally, an XML dump of the Mediawiki database (which can be imported into any MediaWiki software) is accessible at [http://www.archiveteam.org/dumps http://www.archiveteam.org/dumps]. New backups are currently pushed out once a week (and will be increased if changes on the site require it). All images are also wrapped into a images.tar.gz file, although our entire images directory is available at [http://www.archiveteam.org/images http://www.archiveteam.org/images].
Additionally, an XML dump of the Mediawiki database (which can be imported into any MediaWiki software) is accessible at [https://www.archiveteam.org/dumps https://www.archiveteam.org/dumps]. New backups are currently pushed out once a week (and will be increased if changes on the site require it). Our entire images directory is available at [https://www.archiveteam.org/images https://www.archiveteam.org/images].
 
Dumps of the ArchiveTeam Wiki are generated with [[WikiTeam]] tools and uploaded to the [[Internet Archive]] quite regularly. You can find them [https://archive.org/details/wiki-archiveteamorg here] and older ones [https://archive.org/details/wiki-archiveteam.org here].


'''Is there a mirror of the archiveteam.org website?'''
'''Is there a mirror of the archiveteam.org website?'''
Line 13: Line 112:
There are no mirrors we know of, although we encourage our more paranoid or protective readers to maintain one based on the above dumps.
There are no mirrors we know of, although we encourage our more paranoid or protective readers to maintain one based on the above dumps.


There is a backup from August 03, 2011 available. The main things that are not included are: Site history, Edit & source of the pages, Special pages and other minor links. (See "Not Crawled.txt") [http://www.archive.org/details/ArchiveTeamsiteRip Click here to download.]
= Who are y'all? =
 
'''Does Archive Team have any social media accounts?'''
 
Follow us on Twitter: [https://twitter.com/archiveteam @archiveteam], [https://twitter.com/at_warrior @at_warrior], [https://www.reddit.com/r/archiveteam /r/archiveteam] and [https://www.facebook.com/ArchiveTeam like us on Facebook]. These accounts are run by selected volunteers and may not be monitored for questions.
 
(There is a [https://groups.google.com/forum/?fromgroups=#!forum/archive-team Google ArchiveTeam group] but it is not used.)
 
'''Who's the administrator?'''
 
For a list of administrators, see [[Tracker#People]] which has a table at the bottom of the page.
 
'''I went through the wiki and I still have a question! How do I contact the Archive Team?'''
 
Join us on [[IRC|IRC!]] For general inquiries, visit [ircs://irc.hackint.org:6697/archiveteam #archiveteam] on hackint. Email can be sent to [mailto:archiveteam@archiveteam.org archiveteam@archiveteam.org].
 
This wiki is ''not'' monitored for questions.
 
If a FAQ should appear here, please add it.
 
= Notes =
<references/>


'''How should I go about backing things up?'''


{{Navigation pager
| previous = Recommended Reading
}}
{{Navigation box}}
{{Navigation box}}
See [[Backup Tips]]

Revision as of 03:13, 25 October 2021

How can I help?

See Who We Are, Deathwatch, and Category:Projects_status. These pages describe our projects and the things you can do to help.

Is the Archive Team affiliated with the Internet Archive (archive.org)?

No. A few members are affiliated, but majority of Archive Team members are volunteers who help while not busy at work or school.

Why is ArchiveTeam crawling my site / disrespecting robots.txt?

A detailed manifesto is located at Robots.txt. Please read it first and contact us through IRC (described in an answer below) before making harsh actions. We cooperate!

If you notice the crawler's user-agent is "ArchiveBot", please see ArchiveBot.

How should I go about backing things up?

What would you like to back up? If you want to mirror/backup a website, the de facto tool is Wget (but there's lots more, see Software!). WARC files are highly recommended as they can be ingested by the Wayback Machine.

If you want to back up your personal files, "List of backup software" at Wikipedia is an extensive list of backup software. See Backup Tips as well!

Where do all the saved files go?

Files are ultimately uploaded to Internet Archive on the archiveteam collection. Archive Team relies on Internet Archive for storing the files.

How do I access the stuff you archived?

Usually, the content we archived is available in the Wayback Machine, and this is generally the recommended way of accessing it. However, in some cases, this will not work as you might expect. If the obvious plug a URL into the WBM doesn't work, check whether the wiki page for the specific project has more information.

What are these WARC files in the Internet Archive? How do I extract files from a WARC file?

WARC files are de facto medium of digital preservation of the web. These WARC files are ingested by the Wayback Machine. WARC files are not simple zip files; they're designed to record metadata.

There is a growing number of tools that can manipulate WARC files in The WARC Ecosystem.

halp pls halp

I uploaded a WARC file but why doesn't it show up in Wayback Machine?

To ensure content integrity, items with WARC files must have the mediatype set to "web" and be uploaded by a whitelisted Internet Archive account for it to be ingested by the Wayback Machine.

I think there is a web site that's going to shut down / sun set / end its incredible journey. Can you save it?

Yes, do tell us a soon as possible! Don't just Tweet about it; do something. Get things done. Talk to us on IRC to let us know. There is also a /r/shutdown sub-Reddit.

For small websites, see ArchiveBot. For large websites, the Warrior may need to be deployed. Please take a look at Dev for learning how the Warrior works.

I saved/archived some stuff from a website / ftp server. Do you want it? Where should I upload it?

Great, yes, and https://archive.org! Tag your items with the subject keyword "archiveteam" and let us know so we can move it under the ArchiveTeam collection.

P.S. Creating an account on Internet Archive is free and should be the first thing that comes to mind for archive files. File hosting sites like Putfile, Megaupload, etc. are not suitable for hosting archives!

I lost my stuff from Geocities/Tabblo/Posterous/some web host! Where can I get it back?

Try searching the wiki for a page about the specific website to find out more about what happened. Typically, there are several ways of recovering files from the Internet Archive:

  • A specially crafted username lookup page created by Archive Team
    • Allows you to search by your username and will present the relevant materials. Only a small set of projects have this feature.
  • The Internet Archive's Wayback Machine
    • This method is the easiest for most users but some web pages take months to show up in the Wayback Machine.
  • Individual WARC Files uploaded to the Internet Archive
    • This method is the most accurate but requires power user skills with working with WARC files. As well, WARC files produced by the Internet Archive are not publicly available (but the ones by Archive Team are always available).

For details, see Restoring.

I need help running the Warrior or scripts. I think it's broken.

See the FAQ in the Warrior wiki page.

We Are Not The Internet Archive

How do I upload something to the Internet Archive (archive.org)?

You can either use the HTML5/Flash uploader or use an alternative method.

Can someone remove or fix something on the Internet Archive?

Possibly. Keep in mind that majority of Archive Team are volunteers who are not affiliated with the Internet Archive and requests should go to staff instead. If strenuous circumstances arise, please see the question about contacting Archive Team below.

How do I get the original file on the Wayback Machine?

Add id_ after the date in the URL. For example: web.archive.org/web/20090119040418id_/http://www.archiveteam.org/index.php?title=Main_Page

How do I search the contents of the Wayback Machine?

You can't unfortunately. However, the Internet Archive provides API access (designed for programmers and power users) to the Wayback Machine and to the CDX database.

Why does the Wayback Machine follow robots.txt in a way that I don't like?

Because it makes the lawyers go away. We're not the Internet Archive. Don't ask us, ask them.

Can I upload $COPYRIGHTED_THING to the Internet Archive?

Although the Internet Archive prefers freely-redistributable content, they also accept still-in-copyright things. If there's a valid complaint/DMCA takedown request, they'll simply make the item private, but they will not delete the data. Having said that, the Internet Archive is not The Pirate Bay, so please don't treat it as such.

Can I save big files with the Save Now feature?

No, files larger than 200 MB will not be saved correctly.

How redundant is Archiveteam?

Is there a backup of the data on the archiveteam.org website? If so where can I download it?

Two sets of backups of this wiki are available. There are backups done by the hosting provider (several, going back days and weeks as well as hours), which use the storage capability of the shared hosting to keep them automatically (no tape or disk backups being done as most people would think of them). There are similarly copies of the database kept going back months.

Additionally, an XML dump of the Mediawiki database (which can be imported into any MediaWiki software) is accessible at https://www.archiveteam.org/dumps. New backups are currently pushed out once a week (and will be increased if changes on the site require it). Our entire images directory is available at https://www.archiveteam.org/images.

Dumps of the ArchiveTeam Wiki are generated with WikiTeam tools and uploaded to the Internet Archive quite regularly. You can find them here and older ones here.

Is there a mirror of the archiveteam.org website?

There are no mirrors we know of, although we encourage our more paranoid or protective readers to maintain one based on the above dumps.

Who are y'all?

Does Archive Team have any social media accounts?

Follow us on Twitter: @archiveteam, @at_warrior, /r/archiveteam and like us on Facebook. These accounts are run by selected volunteers and may not be monitored for questions.

(There is a Google ArchiveTeam group but it is not used.)

Who's the administrator?

For a list of administrators, see Tracker#People which has a table at the bottom of the page.

I went through the wiki and I still have a question! How do I contact the Archive Team?

Join us on IRC! For general inquiries, visit #archiveteam on hackint. Email can be sent to archiveteam@archiveteam.org.

This wiki is not monitored for questions.

If a FAQ should appear here, please add it.

Notes