Difference between revisions of "Dev/New Project"
< Dev
Jump to navigation
Jump to search
m (add devnav) |
(fill in some website structure notes) |
||
Line 1: | Line 1: | ||
{{notice|this page is work in progress}} | {{notice|this page is work in progress}} | ||
Starting a '''new project''' is a giant leap into getting things done. | |||
== Website Structure == | |||
Take a good look at how the website is structured: | |||
* Is everything hosted under one domain name? | |||
* Is there a throttling system? | |||
* How can I discover usernames? | |||
* Is there an API? | |||
* Is there a sitemap.xml? | |||
* Can I guess URLs by incrementing a value? | |||
=== JavaScript === | === JavaScript === | ||
* | JavaScript is a pain. | ||
* | |||
* Check to see if there's a noscript or mobile version. | |||
* Use a web inspector to observe its behavior and simulate POST requests made by the scripts. | |||
* Scrape URLs from JavaScript templates with regular expressions. | |||
=== Static Assets === | === Static Assets === | ||
Websites sometimes do not static media, such as images and stylesheets, under their primary domain name. Be sure to take those under consideration. | |||
=== IP Address Bans & Throttling === | === IP Address Bans & Throttling === | ||
Find out if there is IP address banning. Use a sacrificial IP address if you need to. | |||
== Items == | == Items == | ||
Once you determine the website structure, you need to determine how to split up jobs up efficiently by an item name. | |||
Because the Tracker uses Redis as its database, the maximum number of items supported ranges from 5,000,000 to 10,000,000. | |||
* If a user site is USERNAME.example.com, a good candidate is USERNAME. | |||
** Be careful of large subdomain sites. | |||
* If the content is by some ID, consider whether range of IDs are appropriate. | |||
== Writing Grab Scripts == | |||
Take a look at [[Dev/Seesaw|writing Seesaw scripts]]. | |||
== Call for Action == | == Call for Action == |
Revision as of 12:22, 5 December 2013
Starting a new project is a giant leap into getting things done.
Website Structure
Take a good look at how the website is structured:
- Is everything hosted under one domain name?
- Is there a throttling system?
- How can I discover usernames?
- Is there an API?
- Is there a sitemap.xml?
- Can I guess URLs by incrementing a value?
JavaScript
JavaScript is a pain.
- Check to see if there's a noscript or mobile version.
- Use a web inspector to observe its behavior and simulate POST requests made by the scripts.
- Scrape URLs from JavaScript templates with regular expressions.
Static Assets
Websites sometimes do not static media, such as images and stylesheets, under their primary domain name. Be sure to take those under consideration.
IP Address Bans & Throttling
Find out if there is IP address banning. Use a sacrificial IP address if you need to.
Items
Once you determine the website structure, you need to determine how to split up jobs up efficiently by an item name.
Because the Tracker uses Redis as its database, the maximum number of items supported ranges from 5,000,000 to 10,000,000.
- If a user site is USERNAME.example.com, a good candidate is USERNAME.
- Be careful of large subdomain sites.
- If the content is by some ID, consider whether range of IDs are appropriate.
Writing Grab Scripts
Take a look at writing Seesaw scripts.