Difference between revisions of "Dev/New Project"

From Archiveteam
< Dev
Jump to navigation Jump to search
(fill in some website structure notes)
(more writing)
Line 1: Line 1:
{{notice|this page is work in progress}}
Starting a '''new project''' is a giant leap into getting things done.  
Starting a '''new project''' is a giant leap into getting things done.  


Line 9: Line 7:
* Is everything hosted under one domain name?  
* Is everything hosted under one domain name?  
* Is there a throttling system?
* Is there a throttling system?
* How can I discover usernames?
* How can I discover usernames or page IDs?
* Is there an API?
* Is there an API?
* Is there a sitemap.xml?
* Is there a sitemap.xml?
Line 32: Line 30:
== Items ==
== Items ==


Once you determine the website structure, you need to determine how to split up jobs up efficiently by an item name.
Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username.


Because the Tracker uses Redis as its database, the maximum number of items supported ranges from 5,000,000 to 10,000,000.
Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length.


* If a user site is USERNAME.example.com, a good candidate is USERNAME.
* If a user site is USERNAME.example.com, a good candidate is USERNAME.
** Be careful of large subdomain sites.
** Be careful of large subdomain sites.
* If the content is by some ID, consider whether range of IDs are appropriate.
* If the content is by some numerical ID, consider whether ranges of IDs are appropriate.
 
== Call for Action ==
 
=== Wiki Page ===
 
Ensure there is documentation on this wiki about the project.


== Writing Grab Scripts ==
Include:


Take a look at [[Dev/Seesaw|writing Seesaw scripts]].
* an overview of the website
* the shutdown notice
* "how to help" instructions
* a (future) link to the archives


== Call for Action ==
=== Writing Grab Scripts ===


=== Wiki Page ===
If you do not have have permissions to create Archive Team's repository, please ask on [[IRC]].


=== Repo & Source Code ===
For detailed information about what goes inside grab scripts, take a look at [[Dev/Seesaw|writing Seesaw scripts]].


=== IRC Channel ===
=== IRC Channel ===


Archive Team uses per-project [[IRC]] channels to reduce noise in the main channel. It also serves as a technical support channel.
IRC channel names must be humorous.
* If an employee of the website in danger appears on the channel, please do cooperate.
=== Project Management ===
Successful projects are a result of successful management. See [[Dev/Project Management|Project Management]] for details.
=== Getting Attention ===
Many Twitter followers? Got connections? Become a loudmouth!
Otherwise, encourage other team members to take initiative.


{{devnav}}
{{devnav}}

Revision as of 14:28, 5 December 2013

Starting a new project is a giant leap into getting things done.

Website Structure

Take a good look at how the website is structured:

  • Is everything hosted under one domain name?
  • Is there a throttling system?
  • How can I discover usernames or page IDs?
  • Is there an API?
  • Is there a sitemap.xml?
  • Can I guess URLs by incrementing a value?

JavaScript

JavaScript is a pain.

  • Check to see if there's a noscript or mobile version.
  • Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
  • Scrape URLs from JavaScript templates with regular expressions.

Static Assets

Websites sometimes do not static media, such as images and stylesheets, under their primary domain name. Be sure to take those under consideration.

IP Address Bans & Throttling

Find out if there is IP address banning. Use a sacrificial IP address if you need to.

Items

Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username.

Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length.

  • If a user site is USERNAME.example.com, a good candidate is USERNAME.
    • Be careful of large subdomain sites.
  • If the content is by some numerical ID, consider whether ranges of IDs are appropriate.

Call for Action

Wiki Page

Ensure there is documentation on this wiki about the project.

Include:

  • an overview of the website
  • the shutdown notice
  • "how to help" instructions
  • a (future) link to the archives

Writing Grab Scripts

If you do not have have permissions to create Archive Team's repository, please ask on IRC.

For detailed information about what goes inside grab scripts, take a look at writing Seesaw scripts.

IRC Channel

Archive Team uses per-project IRC channels to reduce noise in the main channel. It also serves as a technical support channel.

IRC channel names must be humorous.

  • If an employee of the website in danger appears on the channel, please do cooperate.

Project Management

Successful projects are a result of successful management. See Project Management for details.

Getting Attention

Many Twitter followers? Got connections? Become a loudmouth!

Otherwise, encourage other team members to take initiative.


Developer Documentation