Dev/New Project

From Archiveteam
< Dev
Revision as of 12:22, 5 December 2013 by Chfoo (talk | contribs) (fill in some website structure notes)
Jump to navigation Jump to search
Archiveteam1.png this page is work in progress

Starting a new project is a giant leap into getting things done.

Website Structure

Take a good look at how the website is structured:

  • Is everything hosted under one domain name?
  • Is there a throttling system?
  • How can I discover usernames?
  • Is there an API?
  • Is there a sitemap.xml?
  • Can I guess URLs by incrementing a value?

JavaScript

JavaScript is a pain.

  • Check to see if there's a noscript or mobile version.
  • Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
  • Scrape URLs from JavaScript templates with regular expressions.

Static Assets

Websites sometimes do not static media, such as images and stylesheets, under their primary domain name. Be sure to take those under consideration.

IP Address Bans & Throttling

Find out if there is IP address banning. Use a sacrificial IP address if you need to.

Items

Once you determine the website structure, you need to determine how to split up jobs up efficiently by an item name.

Because the Tracker uses Redis as its database, the maximum number of items supported ranges from 5,000,000 to 10,000,000.

  • If a user site is USERNAME.example.com, a good candidate is USERNAME.
    • Be careful of large subdomain sites.
  • If the content is by some ID, consider whether range of IDs are appropriate.

Writing Grab Scripts

Take a look at writing Seesaw scripts.

Call for Action

Wiki Page

Repo & Source Code

IRC Channel

Developer Documentation