Starting a new project is a giant leap into getting things done.
Take a good look at how the website is structured:
- Is everything hosted under one domain name?
- Is there a throttling system?
- How can I discover usernames or page IDs?
- Is there an API?
- Is there a sitemap.xml?
- Can I guess URLs by incrementing a value?
- Does disabling cookies or using specific cookies affect anything?
- Does the website break if you make special requests?
- Can you Google
site:example.comfor some URLs?
- Is it a video? Try get-flash-videos
- Check to see if there's a noscript or mobile version.
- Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
Websites sometimes do not host static media such as images and stylesheets under their primary domain name. Be sure to take those under consideration.
IP Address Bans & Throttling
Find out if there is IP address banning. Use a sacrificial IP address if you need to.
Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username.
Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length.
- If a user site is USERNAME.example.com, a good candidate is USERNAME.
- Be careful of large subdomain sites.
- If the content is by some numerical ID, consider whether ranges of IDs are appropriate.
Call for Action
- ProTip™: Get things done.
Ensure there is documentation on this wiki about the project.
- an overview of the website
- the shutdown notice
- "how to help" instructions
- a (future) link to the archives
Writing Grab Scripts
If you do not have permissions to create Archive Team's repository, please ask on IRC.
For detailed information about what goes inside grab scripts, take a look at writing Seesaw scripts.
If you do not have permission to access the Tracker, please see Tracker#People.
Archive Team uses per-project IRC channels to reduce noise in the main channel. It also serves as a technical support channel.
IRC channel names must be humorous.
- If an employee of the website in danger appears on the channel, please do cooperate.
Successful projects are a result of successful management. See Project Management for details.
Many Twitter followers? Got connections? Become a loudmouth!
Otherwise, take initiative yourself and encourage other team members to take initiative.