Difference between revisions of "URLs"

From Archiveteam
Jump to navigation Jump to search
(remove CTA for now)
(replace CTA... carefully!)
Line 12: Line 12:
}}
}}


The '''URLs project''' is a continuous generic project to archive random URLs from various sources, e.g. external links discovered in other projects or in older archives. Some current projects as of mid 2023 that send outlinks to URLs include the [[Reddit]] and [[Telegram]] projects. URLs can also be queued manually in the project channel. However, please keep in mind that this project runs at a '''very high speed''' and you may accidentally DDoS someone if you queue a lot of URLs on one host. If the URL list is comprised mostly or entirely of one website, it might be a better idea to submit it to [[ArchiveBot]] (or make a new Warrior project designed specifically for it). Also note that we have no way of keeping track if any of the URLs actually succeed; if archival fails, they are tried many times, but this is more of a "throw whatever spare URLs you have in" rather than a structured method of archiving a website.
The '''URLs project''' is a continuous, generic, best-effort project to archive random URLs from a variety of sources, including external links discovered in other projects (such as [[Reddit]] and [[Telegram]]), news sites and feeds of interest crawled regularly from [https://github.com/ArchiveTeam/urls-sources/ urls-sources], and lists queued manually in the IRC channel.


'''Important note''': If you run this project, you'll likely see your IP get banned from Facebook, Instagram, YouTube, etc., and using those sites may become difficult (e.g. constant captchas, forced login). Also, if you run at significant speed, you'll likely see abuse notices, IP blacklists, and so on.
'''Important note''': If you run this project, you'll likely see your IP get banned from Facebook, Instagram, YouTube, etc., and using those sites may become difficult (e.g. constant captchas, forced login). Also, if you run at significant speed, you'll likely see abuse notices, IP blacklists, and so on.


If a website is particularly valuable and should be occasionally requeued (e.g. news sites), it can be added to the [https://github.com/ArchiveTeam/urls-sources/ urls-sources] repository (documentation is in the README).
{{CTA URL lists}}
#* Please '''briefly describe''' the content of your list and why it should be archived! A sentence or two is fine.
 
===Caveats===
* Lists containing large numbers of URLs on one host are '''not''' appropriate here. This project runs at very high speed and can easily DDoS a server. If you would like a crawl of a single website, please request it in [[ArchiveBot|#archivebot]] instead.
* Lists containing extremely important or endangered URLs are not appropriate here either. This project is best-effort only and does not track whether archival succeeded. If you would like to monitor the status of a list as it is archived, please request it in [[ArchiveBot|#archivebot]] instead.

Revision as of 00:42, 4 January 2024

URLs
URL https://url.spec.whatwg.org/
Status Special case
Archiving status In progress...
Archiving type DPoS
Project source urls-grab
urls-sources
Project tracker urls
IRC channel #// (on hackint)
Project lead arkiver
Data[how to use] archiveteam_urls

The URLs project is a continuous, generic, best-effort project to archive random URLs from a variety of sources, including external links discovered in other projects (such as Reddit and Telegram), news sites and feeds of interest crawled regularly from urls-sources, and lists queued manually in the IRC channel.

Important note: If you run this project, you'll likely see your IP get banned from Facebook, Instagram, YouTube, etc., and using those sites may become difficult (e.g. constant captchas, forced login). Also, if you run at significant speed, you'll likely see abuse notices, IP blacklists, and so on.

How to help if you have lists of URLs

For other ArchiveTeam projects that can use this kind of help, see Projects requiring URL lists.

This project requires lists of URLs for content on the target website. If you have a source of URLs, please:

  1. If the list exceeds a few megabytes, compress it, preferably using zstd -10.
  2. Give the file a descriptive name and upload it to https://transfer.archivete.am/.
  3. Share the resulting URL in the project IRC channel.
    • If you wish your list to remain private, please get in touch with a channel op (e.g. arkiver or JustAnotherArchivist). Items generated from your list will still be processed publicly, but they will be mixed in with all other items and channel logs will not associate them with you.
    • Please briefly describe the content of your list and why it should be archived! A sentence or two is fine.

Caveats

  • Lists containing large numbers of URLs on one host are not appropriate here. This project runs at very high speed and can easily DDoS a server. If you would like a crawl of a single website, please request it in #archivebot instead.
  • Lists containing extremely important or endangered URLs are not appropriate here either. This project is best-effort only and does not track whether archival succeeded. If you would like to monitor the status of a list as it is archived, please request it in #archivebot instead.