Difference between revisions of "Dev/New Project"

From Archiveteam
< Dev
Jump to navigation Jump to search
m (Move the get-flash-videos link off of Google Code for obvious reasons.)
m (→‎Writing Grab Scripts: typos fixed: have have → have)
Line 60: Line 60:
=== Writing Grab Scripts ===
=== Writing Grab Scripts ===


If you do not have have permissions to create Archive Team's repository, please ask on [[IRC]].
If you do not have permissions to create Archive Team's repository, please ask on [[IRC]].


For detailed information about what goes inside grab scripts, take a look at [[Dev/Seesaw|writing Seesaw scripts]].
For detailed information about what goes inside grab scripts, take a look at [[Dev/Seesaw|writing Seesaw scripts]].

Revision as of 02:50, 5 December 2017

Starting a new project is a giant leap into getting things done.

Website Structure

Take a good look at how the website is structured:

  • Is everything hosted under one domain name?
  • Is there a throttling system?
  • How can I discover usernames or page IDs?
  • Is there an API?
  • Is there a sitemap.xml?
  • Can I guess URLs by incrementing a value?
  • Does disabling cookies or using specific cookies affect anything?
  • Does the website break if you make special requests?
  • Can you Google site:example.com for some URLs?
    • Hint: site:example.com inurl:show_thread
  • Is it a video? Try get-flash-videos

JavaScript

JavaScript is a pain.

  • Check to see if there's a noscript or mobile version.
  • Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
  • Scrape URLs from JavaScript templates with regular expressions.

Static Assets

Websites sometimes do not static media, such as images and stylesheets, under their primary domain name. Be sure to take those under consideration.

IP Address Bans & Throttling

Find out if there is IP address banning. Use a sacrificial IP address if you need to.

Items

Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username.

Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length.

  • If a user site is USERNAME.example.com, a good candidate is USERNAME.
    • Be careful of large subdomain sites.
  • If the content is by some numerical ID, consider whether ranges of IDs are appropriate.

Call for Action

  • ProTip™: Get things done.

Wiki Page

Ensure there is documentation on this wiki about the project.

Include:

  • an overview of the website
  • the shutdown notice
  • "how to help" instructions
  • a (future) link to the archives

Writing Grab Scripts

If you do not have permissions to create Archive Team's repository, please ask on IRC.

For detailed information about what goes inside grab scripts, take a look at writing Seesaw scripts.

Tracker Access

If you do not have permission to access the Tracker, please see Tracker#People.

IRC Channel

Archive Team uses per-project IRC channels to reduce noise in the main channel. It also serves as a technical support channel.

IRC channel names must be humorous.

  • If an employee of the website in danger appears on the channel, please do cooperate.

Project Management

Successful projects are a result of successful management. See Project Management for details.

Getting Attention

Many Twitter followers? Got connections? Become a loudmouth!

Otherwise, take initiative yourself and encourage other team members to take initiative.


Developer Documentation