Software
Revision as of 20:51, 15 November 2011 by Nemo bis (talk | contribs) (→General Tools: Wget with WARC output)
General Tools
- GNU WGET
- Backing up a Wordpress site: "wget --no-parent --no-clobber --html-extension --recursive --convert-links --page-requisites --user=<username> --password=<password> <path>"
- Wget with WARC output
- cURL
- HTTrack - HTTrack options
- Heritrix -- what archive.org use
- Pavuk -- a bit flaky, but very flexible
- http://warrick.cs.odu.edu/warrick.html
- Beautiful Soup - Python library for web scraping
- Scrapy - Fast python library for web scraping
- Splinter - Web app acceptance testing library for Python -- could be used along with a scraping lib to extract data from hard-to-reach places
- WiLiSe WikiLink Search - Python script to get links to specific pages of a site through the search in a Wiki (MediaWiki-type) has the api.php accessible or extension LinkSearch enabled (the project is still very immature and at the moment the code is only available in this SVN repository).
Hosted tools
Pinboard is a convenient social bookmarking service that will archive copies of all your bookmarks for online viewing. The catch is that it costs $9.25 just to join, plus $25/year for the archival feature and you can only download archives of your 25 most recent bookmarks in a particular category. This may pose problems if you ever need to get your data out in a hurry.