HTTrack options
Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram:
httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'
- ignores robots.txt
- allows for a queue of 500M unfetched URLS
- custom useragent
- pretty fast (uses several connections at once)
- will re-write links so they work offline
NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.
NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.