Difference between revisions of "Splinder"
(I think the main domain screenshot is better: this main page seems to contain mostly spam) |
|||
Line 1: | Line 1: | ||
{{Infobox project | {{Infobox project | ||
| title = Splinder | | title = Splinder | ||
| image = | | image = Splinder homepage.png | ||
| URL = {{url|1=http://www.splinder.com/}} | | URL = {{url|1=http://www.splinder.com/}} | ||
{{url|1=http://www.us.splinder.com/}} | {{url|1=http://www.us.splinder.com/}} |
Revision as of 08:50, 15 November 2011
Splinder | |
URL | http://www.splinder.com/[IA•Wcite•.today•MemWeb] |
Status | Closing |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Splinder.com has been the main blog hosting company in Italy for a while (see Wikipedia:it:Splinder). It was founded in 2001 and it hosts about half a million blogs and over 55 millions pages. Since 8th November, 2011 a warning on the home page says that no new PRO accounts are being created since the 1st June. The company has confirmed that the website will close on the 24th.[1]
How to help archiving
There is a distributed download script that gets usernames from a tracker and downloads the data.
Make sure you are on Linux, that you have curl, git, a recent version of Bash. Your system must also be able to compile wget.
- Get the code:
git clone https://github.com/ArchiveTeam/splinder-grab
- Get and compile the latest version of wget-warc:
./get-wget-warc.sh
- Think of a nickname for yourself (preferably use your IRC name).
- Run the download script with
./dld-client.sh "<YOURNICK>"
- To stop the script gracefully, run
touch STOP
in the script's working directory. It will finish the current task and stop.
Notes
- Compiling wget-warc will require dev packages for the various libraries that it needs. Most questions have been about gnutls; install the
gnutls-devel
orgnutls-dev
package with your favorite package manager. - Downloading one user's data can take between 10 seconds and a few hours.
- The data for one user is equally varied, from a few kB to several GB.
- The downloaded data will be saved in the
./data/
subdirectory. - Download speeds from me.com are not that high. You can run multiple clients to speed things up.
Status
There is a real-time dashboard where you can check the progress.
External links
Site structure
The users are identified by their usernames. Fortunately, the side provides a list of all users. Usernames are not case-sensitive, but there is a case preference.
Example URLs
User profile: http://www.splinder.com/profile/<<username>>
Example profile: http://www.splinder.com/profile/difficilifoglie View count on profile page: http://www.splinder.com/ajax.php?type=counter&op=profile&profile=Romanticdreamer Example of friends list paging: (160 per page, starting at 0) http://www.splinder.com/profile/difficilifoglie/friends http://www.splinder.com/profile/difficilifoglie/friends/160 Inverse friends (probably also paged): http://www.splinder.com/profile/difficilifoglie/friendof Link to blog: (note: not always the same as the username) http://difficilifoglie.splinder.com/ http://learnonline.splinder.com/ Photo: http://www.splinder.com/profile/difficilifoglie/photo http://www.splinder.com/mediablog/wondermum/media/24544805 Video: http://www.splinder.com/profile/wondermum/video http://www.splinder.com/mediablog/wondermum/media/25737390 Audio: Not a separate user feed, but only accessible via mediablog http://www.splinder.com/mediablog/learnonline/media/25727030 Mediablog: combination of the audio + video + photo lists http://www.splinder.com/mediablog/learnonline (16 per page, starting at 0) http://www.splinder.com/mediablog/learnonline/16 Mediablog has PowerPoint, Word files: http://www.splinder.com/mediablog/learnonline/media/25641346 http://www.splinder.com/mediablog/learnonline/media/25546305 http://www.splinder.com/mediablog/learnonline/media/21901634 http://www.splinder.com/mediablog/learnonline/media/24875290 User avatar: grab url from profile page Photo file: grab url from photo page and remove _medium to get original picture http://files.splinder.com/d5e492233631af39212268593afca02d_square.jpg http://files.splinder.com/d5e492233631af39212268593afca02d_medium.jpg http://files.splinder.com/d5e492233631af39212268593afca02d.jpg older photos do not have this structure, different ids for each size: http://www.splinder.com/mediablog/babboramo/media/17359043 http://files.splinder.com/13b615ccbd75354ee4e0d973da66c2b2.jpeg http://files.splinder.com/770d7b9ecac27083d9204af327ebe743.jpeg PowerPoint, Word files: grab url from media page http://files.splinder.com/46dbf3d5a0b12e490f81ddb8444b4fad.ppt http://files.splinder.com/ab3ce16c850ac530351d9df0937152c7.pdf Video items: grab url from media page http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_square.jpg http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_thumbnail.jpg http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv note: square, thumbnail, small is not always available, check flashvars for vidpath, imgpath http://www.splinder.com/mediablog/babboramo/media/13131052 http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv http://files.splinder.com/f56060b7fef139f03b72e06ca9fcba55.jpeg Audio items: grab url from media page, flashvars sometimes there is a _thumbnail, remove that to get a better quality http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3 http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3 Comments on blog posts: http://www.splinder.com/myblog/comment/list/25742358 on some, but not on all blogs, those comments are also included in the blog page http://dal15al25.splinder.com/post/25740180 http://soluzioni.splinder.com/post/2802227/blog-pager-su-piu-righe http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/ http://civati.splinder.com/post/25742977 pagination: see media comments Comments on media items: http://www.splinder.com/media/comment/list/21254470 http://www.splinder.com/media/comment/list/21254470?from=50 (50 per page, starting at 0) number of comments is on the media page http://www.splinder.com/mediablog/danspo/media/21254470 Blog urls: the blogs have content from their own subdomain, but also from files.splinder.com www.splinder.com/misc/ (topbar css, gif) www.splinder.com/includes/ (js) www.splinder.com/modules/service_links/ (images) syndication.splinder.com links to www.splinder.com that should NOT be followed: /myblog/ /users/ /media/ /node/ /profile/ /mediablog/ /community/ /user/ /night/ /home/ /mysearch/ /online/ /trackback/
wget-warc --mirror --page-requisites --span-hosts --domains=learnonline.splinder.com,files.splinder.com,www.splinder.com,syndication.splinder.com --exclude-directories="/users,/media,/node,/profile,/mediablog,/community,/user,/night,/home,/mysearch,/online,/trackback,/myblog/post,/myblog/posts,/myblog/tags,/myblog/tag,/myblog/view,/myblog/latest,/myblog/subscribe" -nv -o wget.log "http://learnonline.splinder.com/"