Difference between revisions of "Splinder"
(→Upload status: Update for Paradoks) |
|||
Line 76: | Line 76: | ||
| crawl335 || 6106 || | | crawl335 || 6106 || | ||
|- | |- | ||
| Paradoks || 5890 || '''Uploaded''', but still downloading | | Paradoks || 5890 || '''Uploaded''', but still downloading it:scatto, which includes files.splinder.com (Among others). | ||
|- | |- | ||
| koon || 5029 || | | koon || 5029 || |
Revision as of 00:14, 9 December 2011
Splinder | |
![]() | |
URL | http://www.splinder.com/[IA•Wcite•.today•MemWeb] |
Status | Closing |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Splinder.com has been the main blog hosting company in Italy for a while (see Wikipedia:it:Splinder). It was founded in 2001 and it hosts about half a million blogs and over 55 millions pages. Since 8th November, 2011 a warning on the home page says that no new PRO accounts are being created since the 1st June. The company has confirmed that the website will close on the 24th.[1]
Update: the company issued an official statement saying that the closure will happen on January 31, 2012.[2] According to our tracker, we have downloaded or assigned all users.
Upload status
For the time being: please ignore any errors caused by special characters in usernames (| ^ etc.), we'll get those profiles later.
Phase 1 | ||
---|---|---|
Downloader | Count | Status |
closure | 254869 | |
kenneth | 206696 | |
ndurner | 177665 | |
Nemo | 111340 | Uploaded with errors, some incomplete |
donbex | 71562 | |
dnova | 68620 | Uploaded; still downloading more |
underscor | 58774 | |
Wyatt | 54525 | |
crawl336 | 45785 | |
Angra | 35752 | |
cameron_d | 26357 | |
db48x | 23120 | Uploaded, three profiles not uploaded |
yipdw | 18789 | Most uploaded, re-doing some larger blogs with errors |
crawl338 | 17783 | |
crawl337 | 16784 | |
crawl334 | 15897 | |
Coderjoe | 13749 | |
bsmith093 | 13194 | |
DoubleJ | 10301 | Uploaded from all machines w/ no errors |
crawl339 | 9026 | |
anonymous | 8653 | |
kennethreitz | 8287 | |
alard | 7299 | Uploaded, one error |
dashcloud | 6803 | |
crawl333 | 6292 | |
spirit | 6282 | |
crawl335 | 6106 | |
Paradoks | 5890 | Uploaded, but still downloading it:scatto, which includes files.splinder.com (Among others). |
koon | 5029 | |
chronomex | 4913 | Partially Uploaded, moved house and has yet to get computers running |
VMB | 4620 | |
shoop | 4461 | |
marceloantonio1 | 2927 | Uploaded |
undercave | 2508 | |
DFJustin | 2456 | |
proub | 1178 | |
Hydriz | 842 | Uploaded |
canUbeatclosure | 669 | |
tef | 440 | |
arima | 347 | |
NotGLaDOS | 259 | |
sarpedon | 105 | |
pberry | 89 | |
Wyattq | 84 | |
soultcer | 74 | |
Konklone | 56 | |
PepsiMax | 12 | |
mareloantonio1 | 10 | Uploaded |
hrbrmstr | 9 | |
sente | 7 | |
rebiolca | 6 | |
2 | 5 | |
Wyatt-B | 3 | |
Wyatt-A | 2 | |
asdf | 2 |
How to help archiving
There is a distributed download script that gets usernames from a tracker and downloads the data.
Make sure you are on Linux, that you have curl, git, a recent version of Bash. Your system must also be able to compile wget.
- Get the code:
git clone https://github.com/ArchiveTeam/splinder-grab
- Get and compile the latest version of wget-warc:
./get-wget-warc.sh
- Think of a nickname for yourself (preferably use your IRC name).
- Run the download script:
- To run a single downloader, run
./dld-client.sh "<YOURNICK>"
. - To run multiple downloaders (and thus use your bandwidth more efficiently), do either:
- simply run as many copies of
dld-client.sh
as you like - run
./dld-streamer.sh <YOURNICK> <N>
, where <N> is the number of concurrent downloads you want.
- simply run as many copies of
- To run a single downloader, run
- To stop the script gracefully, run
touch STOP
in the script's working directory. It will finish the current task and stop.
Notes
- Compiling wget-warc will require dev packages for the various libraries that it needs. Most questions have been about gnutls; install the
gnutls-devel
orgnutls-dev
package with your favorite package manager. - Downloading one user's data can take between 10 seconds and several days.
- The data for one user is equally varied, from a few kB to several GB.
- The downloaded data will be saved in the
./data/
subdirectory. - Download speeds from splinder.com are not that high (servers may be particularly overloaded during European day because of additional traffic of people exporting their blogs). You can run multiple clients to speed things up.
Errors
- There are some problems with subdomains containing dashes[3]: if they fail on your machine (reported: wget compiled with +nls), for now stop and restart the script, someone else will do those users (although they seem to fail in part anyway).
- Some such users: macrisa, -Maryanne-, it:SalixArdens, it:MCris, it:7lilla, it:thepinkpenguin, it:bimbambolina, it:lazzaretta, it:Hedwige, it:N4m3L3Ss, it:Barbabietole_Azzurre, it:celebrolesa2212, it:buongiono.mattina, it:DarkExtra, it:-slash-, it:marlene1, it:Ohina, us:XyKy, us:Naluf, it:elisablu, it:*JuLs*, it:RikuSan, it:Nasutina
- There are also some problems with upload-finished.sh because of some inconsistencies in escaping special characters, e.g. [4]; remember not to delete those directories without fixing/uploading them.
- The script looks for errors in English, so it's better if you wget-warc to use English. Otherwise, errors like these won't be detected and the script will mark as done users which failed. Please run
fix-dld.sh
to fix those users, after changingif grep -q "ERROR 50"
to your localised output.
splinder_noconn.html errors
Please check your wget logs for presence of a file named splinder_noconn.html
. This is a transient maintenance page that has appeared in some downloads, but cannot be detected as an error by wget, because the page isn't returned with a status code indicating "an error occurred".
Some examples:
- https://gist.github.com/a15c7707ee666502a825
- https://gist.github.com/0427b4ed12ae48f2fb5f
- http://p.defau.lt/?sJOFev7prpKYpC_CYRnqrg
These accounts may have to be re-fetched.
Uploading your data
- To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the
./upload-finished.sh
script to upload your data. For example, run this in your script directory:./upload-finished.sh batcave.textfiles.com::YOURNICK/splinder/
- The script will upload only completed users. To check how much space the incomplete users are taking, without killing your disk, you can use
ionice -c 3 find -name .incomplete -printf "%h\0" | ionice -c 3 du -mcs --files0-from=-
in yoursplinder-grab
directory.
Status
There is a real-time dashboard where you can check the progress.
External links
Site structure
The users are identified by their usernames. Fortunately, the side provides a list of all users. Usernames are not case-sensitive, but there is a case preference.
Example URLs
User profile: http://www.splinder.com/profile/<<username>>
Example profile: http://www.splinder.com/profile/difficilifoglie View count on profile page: http://www.splinder.com/ajax.php?type=counter&op=profile&profile=Romanticdreamer Example of friends list paging: (160 per page, starting at 0) http://www.splinder.com/profile/difficilifoglie/friends http://www.splinder.com/profile/difficilifoglie/friends/160 Inverse friends (probably also paged): http://www.splinder.com/profile/difficilifoglie/friendof Link to blog: (note: not always the same as the username) http://difficilifoglie.splinder.com/ http://learnonline.splinder.com/ Photo: http://www.splinder.com/profile/difficilifoglie/photo http://www.splinder.com/mediablog/wondermum/media/24544805 Video: http://www.splinder.com/profile/wondermum/video http://www.splinder.com/mediablog/wondermum/media/25737390 Audio: Not a separate user feed, but only accessible via mediablog http://www.splinder.com/mediablog/learnonline/media/25727030 Mediablog: combination of the audio + video + photo lists http://www.splinder.com/mediablog/learnonline (16 per page, starting at 0) http://www.splinder.com/mediablog/learnonline/16 Mediablog has PowerPoint, Word files: http://www.splinder.com/mediablog/learnonline/media/25641346 http://www.splinder.com/mediablog/learnonline/media/25546305 http://www.splinder.com/mediablog/learnonline/media/21901634 http://www.splinder.com/mediablog/learnonline/media/24875290 User avatar: grab url from profile page Photo file: grab url from photo page and remove _medium to get original picture http://files.splinder.com/d5e492233631af39212268593afca02d_square.jpg http://files.splinder.com/d5e492233631af39212268593afca02d_medium.jpg http://files.splinder.com/d5e492233631af39212268593afca02d.jpg older photos do not have this structure, different ids for each size: http://www.splinder.com/mediablog/babboramo/media/17359043 http://files.splinder.com/13b615ccbd75354ee4e0d973da66c2b2.jpeg http://files.splinder.com/770d7b9ecac27083d9204af327ebe743.jpeg PowerPoint, Word files: grab url from media page http://files.splinder.com/46dbf3d5a0b12e490f81ddb8444b4fad.ppt http://files.splinder.com/ab3ce16c850ac530351d9df0937152c7.pdf Video items: grab url from media page http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_square.jpg http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_thumbnail.jpg http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv note: square, thumbnail, small is not always available, check flashvars for vidpath, imgpath http://www.splinder.com/mediablog/babboramo/media/13131052 http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv http://files.splinder.com/f56060b7fef139f03b72e06ca9fcba55.jpeg Audio items: grab url from media page, flashvars sometimes there is a _thumbnail, remove that to get a better quality http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3 Comments on blog posts: http://www.splinder.com/myblog/comment/list/25742358 on some, but not on all blogs, those comments are also included in the blog page http://dal15al25.splinder.com/post/25740180 http://soluzioni.splinder.com/post/2802227/blog-pager-su-piu-righe http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/ http://civati.splinder.com/post/25742977 pagination: see media comments Comments on media items: http://www.splinder.com/media/comment/list/21254470 http://www.splinder.com/media/comment/list/21254470?from=50 (50 per page, starting at 0) number of comments is on the media page http://www.splinder.com/mediablog/danspo/media/21254470 Blog urls: the blogs have content from their own subdomain, but also from files.splinder.com www.splinder.com/misc/ (topbar css, gif) www.splinder.com/includes/ (js) www.splinder.com/modules/service_links/ (images) syndication.splinder.com links to www.splinder.com that should NOT be followed: /myblog/ /users/ /media/ /node/ /profile/ /mediablog/ /community/ /user/ /night/ /home/ /mysearch/ /online/ /trackback/
wget-warc --mirror --page-requisites --span-hosts --domains=learnonline.splinder.com,files.splinder.com,www.splinder.com,syndication.splinder.com --exclude-directories="/users,/media,/node,/profile,/mediablog,/community,/user,/night,/home,/mysearch,/online,/trackback,/myblog/post,/myblog/posts,/myblog/tags,/myblog/tag,/myblog/view,/myblog/latest,/myblog/subscribe" -nv -o wget.log "http://learnonline.splinder.com/"