Difference between revisions of "HTTrack options"

From Archiveteam
Jump to navigation Jump to search
m
m (Reverted edits by Megalanya1 (talk) to last revision by Jscott)
(9 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Good options to use for [[httrack]] to mirror a large-ish site (requires 2GB of ram, more would probably be recommended). Works well on my DELL 2850 w/ 4GB of ram:
Good options to use for [[httrack]] to mirror a large-ish site.


httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'
== Quick copy and paste ==
 
<pre>httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -A100000000 -#L500000000 'YOURURL'</pre>


* ignores robots.txt
* ignores robots.txt
Line 9: Line 11:
* will re-write links so they work offline
* will re-write links so they work offline


NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.
== A rundown of the previous options ==
 
* <code>--connection-per-second=50</code>: This allows for up to 50 connections per second.
* <code>--sockets=80</code>: Opens up to 80 sockets. If this gives you errors, lower this to 48.
* <code>--disable-security-limits</code>,<code>-A100000000</code>: By default, HTTrack attempts to play nicely with webservers, and tries not to overload them by limiting the download speed to 25kbps. On text-based sites this is normally good, but it becomes a hassle when the site is image-heavy. The first option disables the forced limit and the second one raises the limit to a large amount.
* <code>-s0</code>: Tells HTTrack to disobey [[robots.txt]].
* <code>-F</code>: Sets the user agent.
* <code>-#L500000000</code>: Raises the maximum amount of links HTTrack fetches to 500M. Raise if needed.
* <code>-n</code>: This gets all nearby files (all files shown on a page), rather than only those on the domain name, which is HTTracks default behavior.
 
== Other options ==
 
* A full rundown of all possible options (including those on site structure) can be found here: https://www.httrack.com/html/fcguide.html .
 


NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram.  Not sure if a 64-bit version of it will allow for a larger crawl queue.
NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram.  Not sure if a 64-bit version of it will allow for a larger crawl queue.

Revision as of 16:29, 17 January 2017

Good options to use for httrack to mirror a large-ish site.

Quick copy and paste

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -A100000000 -#L500000000 'YOURURL'
  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

A rundown of the previous options

  • --connection-per-second=50: This allows for up to 50 connections per second.
  • --sockets=80: Opens up to 80 sockets. If this gives you errors, lower this to 48.
  • --disable-security-limits,-A100000000: By default, HTTrack attempts to play nicely with webservers, and tries not to overload them by limiting the download speed to 25kbps. On text-based sites this is normally good, but it becomes a hassle when the site is image-heavy. The first option disables the forced limit and the second one raises the limit to a large amount.
  • -s0: Tells HTTrack to disobey robots.txt.
  • -F: Sets the user agent.
  • -#L500000000: Raises the maximum amount of links HTTrack fetches to 500M. Raise if needed.
  • -n: This gets all nearby files (all files shown on a page), rather than only those on the domain name, which is HTTracks default behavior.

Other options


NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.