Distributed recursive crawls
Jump to navigation
Jump to search
Distributed recursive crawls | |
Status | Special case |
Archiving status | On hiatus |
Archiving type | DPoS |
Project source | grab-grab |
Project tracker | grab |
IRC channel | #Y (on hackint) |
Data[how to use] | archiveteam_grab |
This is a project to recursively crawl large websites that have no clear structure that can easily be split into work items the way we usually do on DPoS projects. It is somewhat comparable to ArchiveBot in that crawls are started manually for specific sites of interest.
Candidates
Some sites that are candidates for this project:
- MoinMoin instances - rate limiting, slow and AB invents non-sequential diff URLs
- https://brianwilson.websitetoolbox.com/ - embedded in brianwilson.com - owner died - rate limited to ~40-50 req/s - done in AB, but the site was undetectably disabled for some of that time, and there have been new posts since then
- https://forum.hardware.fr/ - more than 100 million messages - related to a site that is potentially in danger
- https://www.flamesofwar.com/ https://events.battlefront.co.nz/ https://forcesv4.flamesofwar.com/ https://forces.team-yankee.com/ https://forces.flamesofwar.com/ https://battlerankings.com/ https://www.battlefrontgroup.com/ https://www.team-yankee.com/ https://battlefront-community.com/ - fails in AB - shutting down
- https://www.blogalia.com/ - abandoned blogging platform that is still in use by users so needs a continuous project
- encode.su - large forum that needs a moderately large delay
- http://bmwfans.info/ - notoriously unstable and slow but contains very valuable information
- https://clay.earth/ - acquired - huge and rate limited
- https://lists.gnu.org/ - very large and needs a continuous project, maybe just as part of Mailing Lists
- https://www.premio.nl/
- https://wiseupaction.info/ - rate limiting
- https://gobiernu.cw/ - rate limiting
- https://ellieirons.com/ - slow
- https://akb.au.int/ - rate limiting
- https://issues.mediagoblin.org/ - rate limiting - almost finished in AB
- https://ukraineforum.de/ - needs to run super slowly in order to not get refused connections
- https://jimfitzpatrick.com/ gets 409s if run too fast (which is still way slower than the default speed)
- https://www.mordorintelligence.com/ - 429s even with incredibly long delays
- https://www.whosampled.com/
- https://blog.csdn.net/ - too large for AB
- https://transfer.archivete.am/VUfaK/sina.com.cn-subdomains.txt - too large/slow for AB
- https://forum.ixbt.com/ - too large for AB
- https://www.buholegal.com/ and https://buholegal.com/ - too slow for AB
- https://revistas-colaboracion.juridicas.unam.mx/ - too slow for AB
- https://www.energia.ru/ - not accessible outside Russia
- https://www.culture.ru/ - huge and rate limited
- https://www.abgeordnetenwatch.de/ - rate limited