Distributed recursive crawls

From Archiveteam
Jump to navigation Jump to search

This is a project to recursively crawl large websites that have no clear structure that can easily be split into work items the way we usually do on DPoS projects. It is somewhat comparable to ArchiveBot in that crawls are started manually for specific sites of interest.

Useful due to

  • rate limiting
  • slow hardware
  • geo-restrictions
  • very large
  • blocks ArchiveBot
  • still updated but abandoned
  • features missing in ArchiveBot
    • automatic open dir scanning

Candidates

Some sites that are candidates for this project: