Difference between revisions of "Friendster"
(→Proposal: download some groups: Claiming group 36) |
(→Proposal: download some groups: Group 44 update) |
||
Line 571: | Line 571: | ||
| 43 || Schools & Alumni || (still downloading ids) || 95 || || '''Unclaimed''' | | 43 || Schools & Alumni || (still downloading ids) || 95 || || '''Unclaimed''' | ||
|- | |- | ||
| 44 || Science & History || ids-100plus-cat-44.txt || 181 || || Paradoks | | 44 || Science & History || ids-100plus-cat-44.txt || 181 || Downloaded, 341MB || Paradoks | ||
|- | |- | ||
| 45 || Sorority/Fraternities || ids-100plus-cat-45.txt || 491 || || '''Unclaimed''' | | 45 || Sorority/Fraternities || ids-100plus-cat-45.txt || 491 || || '''Unclaimed''' |
Revision as of 09:10, 28 May 2011
Friendster | |
![]() | |
URL | http://www.friendster.com/[IA•Wcite•.today•MemWeb] |
Status | Closing |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Friendster is an early social networking site. It's estimated that Friendster has over 115 million registered users. Founded in 2002, Friendster allowed the posting of blogs, photos, shoutouts/comments, and "widgets" of varying quality (not dissimilar to Facebook applications). It is considered one of the earlier social media networks (although it has numerous predecessors dating back for years) and distinguished itself by allowing such "rich media" additions to a user's account. After an initially high ranking and rating in the charts, Friendster's slow decline in hotness ensured an ever-growing chance of being deleted, and on April 25th, 2011, Friendster announced that most of the user-generated content on the site would be removed on May 31st, 2011. Literally terabytes of user-generated content is in danger of being wiped out, and Archive Team has made it a priority to grab as much of Friendster as possible. A unix-based script (called BFF, or Best Friends Forever) has been created and Archive Team is asking for anyone with unix and 100gb of disk space to get involved in the project.
Jonathan Abrams, the original co-founder of Friendster, has wiped his hands of the whole situation, and is mostly frustrated with Friendster's past. [1]
Because Friendster is based on numeric IDs (as opposed to usernames), it is possible to assign "chunks" to Archive Team volunteers. Please read up about the tools below, and if you have an interest in helping, join us at #foreveralone on EFnet and help us save Friendster.
Tools
friendster-scrape-profile
Script to download a Friendster profile: download it, or clone the git repository.
You need a Friendster account to use this script. (Note: if you are creating an account, mailinator email addresses are blocked) Add your login details to a file username.txt
and a password.txt
and save those in the directory of the download script.
Run with a numeric profile id of a Friendster user: ./friendster-scrape-profile PROFILE_ID
Currently downloads:
- the main profile page (profiles.friendster.com/$PROFILE_ID)
- the user's profile image from that page
- the list of public albums (www.friendster.com/viewalbums.php?uid=$PROFILE_ID)
- each of the album pages (www.friendster.com/viewphotos.php?a=$id&uid=$PROFILE_ID)
- the original photos from each album
- the list of friends (www.friendster.com/fans.php?uid=$PROFILE_ID)
- the shoutoutstream (www.friendster.com/shoutoutstream.php?uid=$PROFILE_ID) and the associated comments
- the Friendster blog, if any
It does not download any of the widgets.
Downloading one profile takes between 6 to 10 seconds and generates 200-400 kB of data (for normal profiles).
Automating the process
(This is all unix-only; it won't work in Windows.)
1. Create a Friendster account
2. Download the script; name it 'bff.sh'.
3. In the directory that you put the bff.sh, make a username.txt file that has your Friendster e-mail address as the text in it
4. In the directory that you put the bff.sh, make a password.txt file that has your Friendster password as the text in it.
5. Choose your profile range.
6. Edit that section to say what range you'll do.
7. On the command line, type (with your range replacing the '#'s.):
$ for i in {#..#}; do bash bff.sh $i; done
or even better
$ ./bff-thread.sh # #
which will allow you to stop at any time by touching the STOP file.
Advanced: multiple instances
Requirements
Now you might notice it's relatively slow. My average is 115 profiles per hour. The bottleneck is mainly network requests, so running multiple instances can increase your download speed nearly linearly. BUT we're not sure whether it's safe to use the same cookies.txt file for all the instances (which it will do by default). Luckily you can easily avoid this using an extra optional parameter of bff.sh. Just add the name of the cookie file you want it to create and use right after the profile ID, for instance: "bff.sh 4012089 cookie3.txt". Use a different cookie file for each instance.
Manually
The full, modified command would then be (replacing the #'s with your range or the cookie number, where applicable):
$ for i in {#..#}; do bash bff.sh $i cookie#.txt; done
chunky.sh
This is the latest and most sophisticated way to automate this is to run chunky.sh. It breaks the range up into chunks of a thousand profiles, and runs as many of these chunks concurrently as you request. This means that if some chunks contain smaller profiles and therefore download more quickly you don't end up with fewer concurrent downloads than you wanted.
$ ./chunky.sh <start> <end> <threads>
snook.sh
The original automated solution was snook.sh. This script takes the start and end of a range and a number of download threads to run and launches that many instances of bff.sh at once. It automatically logs the output to individual log files and creates separate cookies files for them. This script was originally written by underscore; you may have his link to pastebin on the irc channel. I've fixed several bugs, including one very serious one. If you used the version from pastebin, you'll need to start over because it downloaded the wrong profiles (keep what you downloaded, it'll merely overlap with someone else.) If you need to stop the downloads cleanly, simply $ touch STOP.
invoker.pl and summary.pl
Another option is this perl script which does a similar job. It's not thorougly tested yet, but it's pretty simple. It takes the starting ID, the number of IDs per process, the number of processes, then creates a shell script which launches them. It has the bonus of being able to be stopped by using $ touch STOP, and it logs every finished ID from every instance to one file for monitoring. This script will give a quick summary of that file to monitor the processes' progress. (And with touch STOP and the summary file, that means easy management over SSH! Woo!)
Troubleshooting
If you get an error like bff.sh: line 26: $'\r': command not found, you will need to convert the script to use UNIX-style line endings:
$ dos2unix bff.sh
or if you somehow find yourself without the dos2unix command, do this:
$ sed "s/\r//" bff.sh > bff-fixed.sh $ mv bff-fixed.sh bff.sh
Site Organization
Content on Friendster seems to be primarily organized by the id number of the users, which were sequentially assigned starting at 1. This will make it fairly easy for wget to scrape the site and for us to break it up into convenient work units. The main components we need to scrape are the profile pages, photo albums and blogs, but there may be others. More research is needed
Profiles
Urls of the form 'http://profiles.friendster.com/<userid>'. Many pictures on these pages are hosted on urls that look like 'http://photos-p.friendster.com/photos/<lk>/<ji>/nnnnnijkl/<imageid>.jpg', but these folders aren't browsable directly. Profiles will not be easy to scrape with wget.
Photo Albums
A user's photo albums are at urls that look like 'http://www.friendster.com/viewalbums.php?uid=<userid>' with individual albums at 'http://www.friendster.com/viewphotos.php?a=<album id>&uid=<userid>'. It appears that the individual photo pages use javascript to load the images, so they will be very hard to scrape.
On the individual album pages, the photo thumbnails are stored under similar paths as the main images. i.e. if the album thumb is at http://photos-p.friendster.com/photos/<lk>/<ji>/nnnnnijkl/<imageid>m.jpg, just drop the final 'm' to get the main photo (or replace it with a 't' to get an even tinier version).
Blogs
Blogs are hosted by a wordpress install, typically at (somename).blog.friendster.com for the actual blog pages, with images hosted on (somename).blogs.friendster.com, where that name is the same, and picked by the user.
Groups
Friendster groups (only visible when logged in) have a profile picture, a list of members, photos, discussions (a forum) and announcements.
Range Signup Sheet
We're going to break up the user ids into ranges and let individuals claim a range to download. Use this table to mark your territory:
Start | End | Status | Size (Uncompressed) | Claimant | |
---|---|---|---|---|---|
1 | 999 | Uploaded | 55MB | closure | |
1,000 | 1,999 | Uploaded | 283MB | alard | |
2,000 | 2,999 | Uploaded | 473MB | DoubleJ | |
3,000 | 3,999 | Downloaded | 234MB | Teaspoon | |
4,000 | 4,999 | Uploaded | 183MB | Paradoks | |
5,000 | 5,999 | Uploaded | 202MB | robbiet48/Robbie Trencheny (Amsterdam) | |
6,000 | 9,999 | Uploaded | 1.1gb | Sketchcow/Jason Scott | |
10,000 | 29,999 | Uploaded | 5.1gb | Sketchcow/Jason Scott | |
30,000 | 31,999 | Uploaded | 485mb | Sketchcow/Jason Scott | |
32,000 | 32,999 | Uploaded | 201MB | Paradoks | |
33,000 | 33,999 | Uploaded | 241mb | closure | |
34,000 | 100,000 | Uploaded | unknown (20+ gb?) | closure | |
100,000 | 101,000 | Downloaded | 205.6 MB | xlene | |
101,001 | 102,000 | Uploaded | 232MB | robbiet48/Robbie Trencheny (Florida) | |
102,001 | 103,000 | Uploaded | 241MB | robbiet48/Robbie Trencheny (Amsterdam) | |
103,001 | 104,000 | Uploaded | yipdw | ||
104,001 | 105,000 | Downloaded | 252MB | Coderjoe | |
105,001 | 114,999 | Uploaded | 2.1GB | Paradoks | |
115,000 | 116,999 | Uploaded | yipdw | ||
117,000 | 119,999 | Downloaded | 815MB | Coderjoe | |
120,000 | 130,000 | Uploaded | 2.3GB | robbiet48/Robbie Trencheny (Florida) | |
130,000 | 140,000 | Claimed | robbiet48/Robbie Trencheny (Florida) | ||
140,001 | 160,000 | Uploaded | yipdw | ||
160,001 | 180,000 | Downloaded | 2.4GB | jch | |
180,001 | 200,000 | Uploaded | yipdw | ||
200,001 | 220,000 | Downloaded | 8.4GB | Coderjoe | |
220,001 | 230,000 | Claimed | xlene | ||
230,001 | 240,000 | Uploaded | 4.4GB | alard | |
240,001 | 250,000 | Downloaded | Teaspoon | ||
250,001 | 260,000 | Claimed | robbiet48/Robbie Trencheny (Newark) | ||
260,001 | 270,000 | Uploaded | 4.0GB | robbiet48/Robbie Trencheny (Fremont 1) | |
270,001 | 280,000 | Uploaded | 3.2GB | robbiet48/Robbie Trencheny (Fremont 2) | |
280,001 | 290,000 | Uploaded | 3.8GB | DoubleJ | |
290,001 | 300,000 | Uploaded | 3.9GB | dnova | |
310,001 | 320,000 | Downloaded | 5.1GB | Coderjoe | |
320,001 | 330,000 | Claimed | robbiet48/Robbie Trencheny (Oakland) | ||
330,000 | 340,000 | Uploaded | closure | ||
340,000 | 400,000 | Uploaded | 25gb | Sketchcow/Jason Scott | |
400,001 | 500,000 | Uploaded | 40 GB | DoubleJ | |
500,000 | 600,000 | Downloaded | 37 GB | closure (penguin) | |
600,001 | 700,000 | Claimed | no2pencil | ||
700,001 | 800,000 | Uploaded | 36GB | proub/Paul Roub | |
800,001 | 900,000 | Uploaded | 39GB | proub/Paul Roub | |
900,001 | 1,000,000 | Downloaded / v12 | 36GB | Soult | |
1,000,001 | 1,100,000 | Claimed | Avram | ||
1,100,001 | 1,200,000 | Uploaded | 33GB | Paradoks | |
1,200,001 | 1,300,000 | Uploaded | 36 GB | db48x | |
1,300,000 | 1,400,000 | Downloaded | 36 GB | closure (penguin) | |
1,400,001 | 1,500,000 | Uploaded | alard | ||
1,500,001 | 1,600,000 | Claimed | 94.3% done | ksh/omglolbah | |
1,600,001 | 1,700,000 | Claimed | 86.9% done | ksh/omglolbah | |
1,700,001 | 1,800,000 | Claimed | 83.7% done | ksh/omglolbah | |
1,800,001 | 1,900,000 | Claimed | 79.3% done | ksh/omglolbah | |
1,900,001 | 2,000,000 | Claimed | 53.7% done | ksh/omglolbah | |
2,000,001 | 2,100,000 | Claimed | 67.4% done | ksh/omglolbah | |
2,100,001 | 2,200,000 | Downloaded | 65 GB | Teaspoon | |
2,200,001 | 2,300,000 | Uploaded | 50gb compressed | Darkstar | |
2,300,001 | 2,400,000 | Uploaded | 70gb compressed | Darkstar | |
2,400,001 | 2,500,000 | Downloaded | underscor (snookie) | ||
2,500,001 | 2,600,000 | Claimed | Bardicer | ||
2,600,001 | 2,700,000 | Claimed | Robbie Trencheny (Amsterdam) | ||
2,700,001 | 2,800,000 | Claimed | Robbie Trencheny (Fremont 2) | ||
2,800,001 | 2,900,000 | Downloaded | 139GB | Coderjoe (system1) | |
2,900,001 | 3,000,000 | Downloaded | 154GB | Coderjoe (system2) | |
3,000,001 | 3,100,000 | Claimed | 78GB | Qwerty0 | |
3,100,001 | 3,600,000 | Claimed | Jason Scott/Sketchcow | ||
3,600,001 | 3,700,000 | Done/Uploading | DoubleJ | ||
3,700,001 | 3,800,000 | Uploaded | yipdw | ||
3,800,001 | 3,900,000 | Uploaded | oli | ||
3,900,001 | 4,000,000 | Claimed | Jason Scott/Sketchcow | ||
4,000,001 | 4,100,000 | Claimed | primus102 | ||
4,100,001 | 4,200,000 | Claimed | Zebranky | ||
4,200,001 | 4,300,000 | Claimed | Zebranky | ||
4,300,001 | 4,399,999 | Uploaded | 255GB (196GB compressed) | db48x | |
4,400,000 | 4,599,999 | Claimed | Jade Falcon | ||
4,600,000 | 4,799,999 | Claimed | Soult | ||
4,800,000 | 4,809,999 | Uploaded | alard | ||
4,810,000 | 4,899,999 | Uploaded | oli | ||
4,900,000 | 4,999,999 | Uploaded | 216GB (160GB compressed) | db48x | |
5,000,000 | 5,099,999 | Claimed | jch | ||
5,100,000 | 5,199,999 | Claimed, 10% Complete | hydruh | ||
5,200,000 | 5,299,999 | Uploaded | chris_k | ||
5,300,000 | 5,349,000 | Downloaded | 177~GB uncompressed | ersi | |
5,349,001 | 5,359,000 | Downloaded | 13GB uncompressed | Underscor 03:25, 22 May 2011 (UTC) | |
5,359,001 | 5,360,000 | Downloaded | Underscor 03:25, 22 May 2011 (UTC) | ||
5,360,001 | 5,370,000 | Downloaded | 11GB uncompressed | Underscor 03:25, 22 May 2011 (UTC) | |
5,370,001 | 5,470,000 | Downloaded | Underscor 03:25, 22 May 2011 (UTC) | ||
5,470,001 | 5,570,000 | Downloaded | Underscor 03:25, 22 May 2011 (UTC) | ||
5,570,001 | 5,670,000 | Downloaded | Underscor 03:25, 22 May 2011 (UTC) | ||
5,670,001 | 6,349,999 | Downloading | jeremydouglass | ||
6,350,000 | 6,449,999 | Downloaded | 212~GB uncompressed | Paradoks | |
6,450,000 | 6,550,000 | Uploaded | yipdw | ||
6,550,001 | 6,700,000 | Claimed | oli | ||
6,700,000 | 6,800,000 | Claimed | closure (penguin) | ||
6,800,001 | 6,900,000 | Uploaded | alard | ||
6,900,001 | 7,000,000 | Uploaded | oli | ||
7,000,001 | 7,100,000 | Compressing | seanp2k (likwid/@ip2k on twitter) | ||
7,100,001 | 7,150,000 | Claimed | oli | ||
7,150,001 | 7,250,001 | Downloaded | dashcloud | ||
7,250,002 | 7,299,999 | Downloaded | db48x | ||
7,300,000 | 7,399,999 | Done/Uploading | DoubleJ | ||
7,400,000 | 7,499,999 | Claimed (70% downloaded) | dsquared | ||
7,500,000 | 7,599,999 | Uploaded | oli | ||
7,600,000 | 7,699,999 | Uploaded | oli | ||
7,700,000 | 7,799,999 | Claimed | seanp2k (likwid/@ip2k on twitter) | ||
7,800,000 | 7,899,999 | Claimed | seanp2k (likwid/@ip2k on twitter) | ||
7,900,000 | 7,999,999 | Claimed | seanp2k (likwid/@ip2k on twitter) | ||
8,000,000 | 8,099,999 | Claimed | primus102 | ||
8,100,000 | 8,199,999 | Uploading | alard | ||
8,200,000 | 8,299,999 | Downloading (30%) | 58GB uncompressed | jeremydouglass | |
8,300,000 | 8,399,999 | Downloaded | 192GB uncompressed | Beardicus | |
8,400,000 | 8,449,999 | Compressed | 100GB uncompressed | Shadyman (Yes, 50k IDs) | |
8,450,000 | 8,599,999 | Downloading (70%) | aristotle | ||
8,600,000 | 8,699,999 | uploaded | 131GB uncompressed | chris_k | |
8,700,000 | 8,715,999 | Downloading | vertevero | ||
8,716,000 | 8,999,999 | Downloading | aggroskater | ||
9,000,000 | 9,899,999 | Pool - unclaimed | |||
9,900,000 | 9,999,999 | Downloading (80%) | aristotle | ||
10,000,000 | 10,050,000 | Uploaded | yipdw (50k intentional) | ||
10,050,001 | 10,100,000 | Uploaded | 96G | dinomite | |
10,100,001 | 10,199,999 | uploaded | chris_k | ||
10,200,000 | 10,300,000 | Claimed | Coderjoe (yes, 100k1) | ||
10,300,001 | 10,399,999 | Downloaded | dashcloud | ||
10,400,001 | 10,499,999 | Claimed | Lambda_Driver | ||
10,500,000 | 10,599,999 | Uploaded | 199G (155G compressed) | dinomite | |
10,600,000 | 10,699,999 | Downloaded | dinomite (Titus uploading) | ||
10,700,000 | 10,799,999 | Claimed | DoubleJ | ||
10,800,000 | 10,849,999 | Claimed | Shadyman | ||
10,850,000 | 10,899,999 | Downloading | Underscor 18:56, 24 May 2011 (UTC) | ||
10,900,000 | 10,999,999 | Downloaded | chris_k | ||
11,000,000 | 11,049,999 | Claimed | alard | ||
11,050,000 | 11,999,999 | Downloading | Unknown | Cameron_D | |
12,000,000 | 12,099,999 | Downloading (25%) | chris_k | ||
12,100,000 | 19,999,999 | Pool - unclaimed | |||
20,000,000 | 20,099,999 | Uploaded | dinomite | ||
20,100,000 | 20,199,999 | Claimed | Beardicus | ||
20,200,000 | 29,999,999 | Pool - unclaimed | |||
30,000,000 | 30,099,999 | Claimed | dsquared | ||
30,100,000 | 39,999,999 | Pool - unclaimed | |||
40,000,000 | 40,099,999 | Claimed | incomplete | db48x | |
40,100,000 | 40,199,999 | Downloaded | dashcloud | ||
40,200,000 | 49,999,999 | Pool - unclaimed | |||
50,000,000 | 50,099,999 | Downloaded | 135GB uncompressed | jeremydouglass | |
50,100,000 | 59,999,999 | Pool - unclaimed | |||
60,000,000 | 60,099,999 | Uploaded | 131GB, 91G .gz | chris_k | |
60,100,000 | 69,999,999 | Pool - unclaimed | |||
70,000,000 | 70,099,999 | Downloaded | 97GB uncompressed | jeremydouglass | |
70,100,000 | 79,999,999 | Pool - unclaimed | — | ||
80,000,000 | 80,099,999 | Uploaded | 94GB uncompressed | chris_k | |
80,100,000 | 89,999,999 | Pool - unclaimed | |||
90,000,000 | 90,099,999 | Uploaded | 90GB, 51G .gz | chris_k | |
90,100,000 | 94,999,999 | Pool - unclaimed | |||
95,000,000 | 95,099,999 | Downloaded | 76G | zebedee | |
95,100,000 | 100,000,000 | Pool - unclaimed | |||
100,000,000 | 100,099,999 | Uploaded | dinomite | ||
100,100,000 | 109,999,999 | Downloading | |||
110,000,000 | 110,099,999 | Claimed, 90% done | 72GB est. | Paradoks | |
110,100,000 | 124,099,999 | Pool - unclaimed | |||
124,100,000 | 124,138,261 | Downloaded | 9GB uncompressed | jeremydouglass |
Special Collection | Status | Size (Uncompressed) | Claimant |
---|---|---|---|
320 fan profiles (from search) | Downloading | alard |
We recommend claiming 100k at a time, because that keeps things neat and tidy, both in this table and on your computer. However, it seems that the number of photographs per profile increased quite a bit during the early years, so the later profiles are much larger than the older ones. Feel free to claim a smaller block if it'll help. 100GB should hold about 50,000 ids and only take a couple of days to download.
Proposal: sampling
It is growing increasingly likely that we won't get it all by the 31st. Given that, perhaps we should be sampling new ranges from across the total index, in order to capture a better picture of what Friendster was like across its history.
Here are eleven proposed ranges to start, ranked by priority:
Start | End | Priority |
---|---|---|
20,000,000 | 20,099,999 | 5 |
30,000,000 | 30,099,999 | 9 |
40,000,000 | 40,099,999 | 3 |
50,000,000 | 50,099,999 | 6 |
60,000,000 | 60,099,999 | 10 |
70,000,000 | 70,099,999 | 2 |
80,000,000 | 80,099,999 | 7 |
90,000,000 | 90,099,999 | 11 |
100,000,000 | 100,099,999 | 4 |
110,000,000 | 110,099,999 | 8 |
124,100,000 | 124,138,261 | 1 |
...and here they are sorted by priority. If you want to do one of these ranges, you would still add an entry for it in the main table above.
Start | End | Priority | |
---|---|---|---|
124,100,000 | 124,138,261 | 1 | claimed |
70,000,000 | 70,099,999 | 2 | claimed |
40,000,000 | 40,099,999 | 3 | claimed |
100,000,000 | 100,099,999 | 4 | claimed |
20,000,000 | 20,099,999 | 5 | claimed |
50,000,000 | 50,099,999 | 6 | claimed |
80,000,000 | 80,099,999 | 7 | claimed |
110,000,000 | 110,099,999 | 8 | claimed |
30,000,000 | 30,099,999 | 9 | claimed |
60,000,000 | 60,099,999 | 10 | claimed |
90,000,000 | 90,099,999 | 11 | claimed |
Proposal: download some groups
It might be interesting to download at least part of the Friendster groups. The bigger groups often have forums, photos and announcements. This table lists the number of groups with 100 members or more. If you are interested, claim a category.
The lists of group ids can be found in this special Github repository. To download a group, you'll need the bgf.sh
script, which can be found in the same git repository as bff.sh (you may already have it!).
CatID | Category | ID list file | Groups | Status | Claimed by |
---|---|---|---|---|---|
11 | Activities | (still downloading ids) | 183 | Unclaimed | |
12 | Automotive | ids-100plus-cat-12.txt | 983 | Downloading | Cameron_D |
13 | Business | ids-100plus-cat-13.txt | 289 | Unclaimed | |
14 | Career & Jobs | ids-100plus-cat-14.txt | 457 | Unclaimed | |
15 | Cities & Neighborhoods | ids-100plus-cat-15.txt | 586 | Unclaimed | |
16 | Companies | ids-100plus-cat-16.txt | 627 | Unclaimed | |
17 | Computers & Internet | ids-100plus-cat-17.txt | 661 | Unclaimed | |
18 | Countries & Regional | ids-100plus-cat-18.txt | 645 | Unclaimed | |
19 | Cultures & Community | ids-100plus-cat-19.txt | 1425 | Unclaimed | |
20 | Entertainment | (still downloading ids) | 2718 | Unclaimed | |
21 | Family & Home | ids-100plus-cat-21.txt | 554 | Unclaimed | |
22 | Fan Clubs | (still downloading ids) | 268 | Unclaimed | |
23 | Fashion & Beauty | ids-100plus-cat-23.txt | 1725 | Unclaimed | |
24 | Film & Television | ids-100plus-cat-24.txt | 954 | Unclaimed | |
25 | Food, Drink, & Wine | ids-100plus-cat-25.txt | 634 | Unclaimed | |
26 | Games | (still downloading ids) | 1583 | Unclaimed | |
27 | Gay, Lesbian & Bi | ids-100plus-cat-27.txt | 850 | Unclaimed | |
28 | Government & Politics | ids-100plus-cat-28.txt | 279 | Unclaimed | |
29 | Health & Fitness | ids-100plus-cat-29.txt | 281 | Unclaimed | |
30 | Hobbies & Crafts | ids-100plus-cat-30.txt | 753 | Unclaimed | |
31 | Literature & Arts | ids-100plus-cat-31.txt | 422 | Unclaimed | |
32 | Money & Investing | ids-100plus-cat-32.txt | 102 | Unclaimed | |
33 | Movies | ids-100plus-cat-33.txt | 1265 | Unclaimed | |
34 | Music | (still downloading ids) | 437 | Unclaimed | |
35 | Nightlife & Clubs | ids-100plus-cat-35.txt | 856 | Unclaimed | |
36 | Non-Profit & Philanthropic | ids-100plus-cat-36.txt | 126 | Paradoks | |
37 | People | (still downloading ids) | 136 | Unclaimed | |
38 | Pets & Animals | ids-100plus-cat-38.txt | 449 | Unclaimed | |
39 | Professional Organizations | ids-100plus-cat-39.txt | 1200 | Unclaimed | |
40 | Recreation & Sports | ids-100plus-cat-40.txt | 1130 | Unclaimed | |
41 | Religion & Beliefs | ids-100plus-cat-41.txt | 1281 | Unclaimed | |
42 | Romance & Relationships | ids-100plus-cat-42.txt | 1020 | Unclaimed | |
43 | Schools & Alumni | (still downloading ids) | 95 | Unclaimed | |
44 | Science & History | ids-100plus-cat-44.txt | 181 | Downloaded, 341MB | Paradoks |
45 | Sorority/Fraternities | ids-100plus-cat-45.txt | 491 | Unclaimed | |
46 | Television | ids-100plus-cat-46.txt | 908 | Unclaimed | |
47 | Travel | ids-100plus-cat-47.txt | 241 | Unclaimed | |
48 | Other | (still downloading ids) | 493 | Unclaimed | |
49 | Events | ids-100plus-cat-49.txt | 330 | Unclaimed |
Known issues
Affected | Issue | Resolution |
---|---|---|
User IDs < 340000 | Suspect blog content | Jason will run a blog-check at the end |
Profiles retrieved with bff.sh < v8 | Missing blog content | Redownload affected profiles or wait for blog-check |
Profiles retrieved with bff.sh < v9 | Missing images | Redownload affected profiles |
Profiles with more than one shoutout page, retrieved with bff.sh < v12 | Only first page of shoutoutstream | Redownload profiles that have a file shoutout_2.html |
Running on Mac OS X
Summary: To run on Mac OS X 10.5 (Leopard), 10.6 (Snow Leopard), or 10.6 Server, you need to install wget, bash 4.0+, and a more recent expr.
Description: In MacPorts, this can be done through installing packages wget, bash, and coreutils (for gexpr), then changeing the top lines in all .sh files from the ArchiveTeam-friendster-scrape git package to !#/opt/local/bin/bash , and replacing all instances of 'expr' with 'gexpr'. Then run chunky.sh as normal on a range and declare victory.
Problem Details: All this is done to work around these three problems:
- bff.sh requires wget
- ...which is not installed by default
- bff.sh requires a more recent version of expr
- chunky.sh requires a more recent version of bash (4.0+)
- ...to support the shell builtin "declare -A" (associative arrays)
Solution Details:
- Install MacPorts (requires Developer Tools). You may also use Homebrew or Fink.
- Install new versions of wget, expr, and bash. In MacPorts
- sudo port install wget
- sudo port install coreutils
- sudo port install bash
- In MacPorts, the new bash is installed at
- /opt/local/bin/bash
- ...so change the first line of each .sh file from:
- !#/bin/bash ...to:
- !#/opt/local/bin/bash
- In MacPorts, the new expr is called "gexpr", so search and replace every expr-->gexpr
- (lines 167 and 231 in bff.sh, in the current scripts)
More:
- Homebrew users can use commands like:
- brew install bash
- ...or Xcode users can build bash 4+ from source:
More on the missing image problem
We've just discovered the versions of bff.sh that we've been using don't grab the right things on some systems. Specifically, we know that older versions of grep (i.e. 2.5.4) don't match some urls as intended. To test whether your files have been downloading correctly, run ./bff.sh 115288. If you end up with one .jpg instead of 8 (here is what you should end up with), you need to upgrade your version of bff.sh before continuing. The current version solves the issue. We're figuring out what to do about the already-downloaded stuff.
More on the shoutout page problem
There was an error in the section downloading the shoutoutstream pages, (bff.sh versions < 12). For profiles with more than one shoutoutstream page, the first page was downloaded several times. shoutout_1.html, shoutout_2.html etc. all contained the first page of messages. This problem was fixed in the version 12 of the script.
This only affects profiles with more than one shoutout page. This is a small percentage of the profiles (7 out of the 50,000 profiles in my collection). They can be found by looking for shoutout_2.html. Remove the profiles that have this file and run the script again.
Blogs with bad links
Some blogs have bad links that expand into an infinite tree. The latest version of bff.sh ameliorates this problem by limiting recursion depth to 20, but in some cases that can still be too much.
These profile IDs are known to have blogs that cause problems:
ID | Example of offending URL |
---|---|
319533 | exquisitelle.blog.friendster.com/category/uncategorized/<object width=/<object width=/...
|
488742 | oxidation-hani.blog.friendster.com/category/uncategorized/page/category/uncategorized/page/...
|
2969345 | mercurian.blog.friendster.com/category/life-or-something-like-it/ <a href=/...
|
3007822 | iz-freedom.blog.friendster.com/category/uncategorized/;/;/...
|
3035078 | khelay-angela.blog.friendster.com/category/music/<div style=/<div style=/<div style=/...
|
3764079 | luzie53.blog.friendster.com/&tbnh=106&tbnw=142&hl=tl&start=4&prev=/images%3Fq%3Dsad%2Bangel%2Bpictures%26svnum%3D10%26hl%3Dtl%26lr%3D%26sa%3DN/...
|
3774275 | msiabeckham.blog.friendster.com/category/uncategorized/\/\/\/\/\/\/\/\/\/page/2/\/page/2...
|
3789069 | chrisna.blog.friendster.com/&tbnh=110&tbnw=80&hl=en&start=20&prev=/images?q=friendship&hl=en&lr=&sa=G/images?q=friendship&hl=en&lr=&sa=G/...
|
6501473 | irisruby.blog.friendster.com/category/music/<a href=/<a href=/<a href=/<a href=/<a href=/<a href=/<a href=/<object classid=/...
|
6803753 | pauloge10.blog.friendster.com/category/uncategorized/object_width=%2F%3Ca+href%3D%2F%3Cobject+width%3D%2F%3Ca+href%3D%2F%3Cobject+width%3D%2F%3Ca+href%3D%2F%3Ca+href%3D%2F%3Ca+href%3D%2F%3Ca+href%3D%2F%3Ca+href%3D%2F%3Ca+href%3D%2F/page/2/
|
7124822 | julius-kurimaw.blog.friendster.com/&tbnh=110&tbnw=111&hl=en&start=25&prev=/images?q=+x-japan+&start=20&hl=en&lr=&sa=N/&tbnh=97&tbnw=97&hl=en&start=6&prev=/images?q=first+love+lyrics+by+utada+hikaru&hl=en&lr=&sa=N...
|
If you come across one of these, please add to this list.
Blogs with corrupt images
Some blogs have links to images that just never finish downloading. wget downloads to 99%, then hangs until the server closes the connection.
ID | Example of offending URL |
---|---|
1421002 | http://diverjun23.blog.friendster.com/files/sany0901.jpg
|
2848374 | http://aspen.blogs.friendster.com/photos/uncategorized/test.gif
|
7934375 | http://photos-p.friendster.com/photos/76/94/7934967/1_256336337.jpg
|
If you come across another corrupt image, please add it to the list.