Difference between revisions of "MyVIP/Bash script"
< MyVIP
Jump to navigation
Jump to search
Megalanya0 (talk | contribs) m (MOTHERFUCKER ! ! !) |
m (Reverted edits by Megalanya0 (talk) to last revision by Jscott) |
||
Line 1: | Line 1: | ||
'''The following notes, preceding the actual script, have been copied from the [[myVIP]] page, and most of it should be considered deprecated, along with the script afterwards; these are kept for historical purposes only.''' | |||
<hr> | |||
== ''' | Some notes about the site, the script and the archiving process: | ||
* The site utilizes a lot of javascript but still can be saved perfectly. | |||
* '''Scraping persons will need to register an account, and the cookies file (exported from the browser) must be fed with the script. Note: visiting myvip.com in a new browser session instantly invalidates the old cookie!''' | |||
* '''The total size of content is probably around 2–3 terabytes.''' The first (oldest) 100,000 profiles show that 47% of the possible profiles exist, and those have an average profile size of ~790 kilobytes, with the largest ones being a few tens (less then 100) megabytes (WARC compressed). Note that this is a rough estimate with a small sample. (That would mean that the profiles would be like ~2 TB, not counting the profile pic thumbnails and the clubs, but those are probably not too significant in size.) | |||
* The bash script is an amateur work, might be bash-specific at some points (i.e. not usable with other shells). | |||
* However, the script has been tested, should be reliable and do its job. | |||
* '''The script currently supports only profiles.''' Club pages should also be saved later, having this algorithm makes it simpler to write the one needed for that. | |||
* The script accepts userids, that go from 1 up to like 4,600,000 (?) sequentially, but not all profiles exist. | |||
* The script currently saves each user's stuff into separate WARC files (should be changed, as one such WARC file might be too little, resulting in lots of little files). | |||
* The script saves the followings of a user: profile page, list of clubs the user is a member of (if more than one page), acquaintances (or "friends") list, photo albums, photos, comments on photos. That's all that should be saved, if any. | |||
* The script supports creating a "directory" of users: it extracts some identifying information and stores in a one-line CSV file per user. (Should be adjusted just like the WARC; later they can be concatenated to form a database.) | |||
* The script also creates lists of profile picture and club avatar thumbnails that are used in lists on the site. They could be saved for every user, but that would mean that a profile picture would be requested as many times as many acquaintances the user has. So, creating a list of them and then downloading all those tiny pics only once is the feasible solution. | |||
* The script currently has a(n almost) separate discovery and grab phase. This means that '''some (many) pages are requested twice: while discovery and while WARCing'''. This could be probably optimized. | |||
* '''A user's acquaintances list is a problematic point.''' When first visiting the list (clicking "Ismerősök"), an alphanumeric pager ID is generated. The request for the other pages of the list needs this pager ID. However, a new request for the initial pager ID invalidates the earlier one! Also, the pager ID expires in 20 minutes – that is, all pages of a user's acq. list must be saved in 20 minutes. (This is why the script currently does it strictly in one separate phase in the end, and that the initial page is grabbed separately, to find out the current pager ID.) | |||
* '''The site should be saved in Hungarian.''' There is an English language option, but how it works hasn't been tested out. (Is it automatically set to English when visited outside of Hungary? Does the site remember the setting? Is the setting sent in a cookie or in the URL? etc.) | |||
* The script uses wget for discovery. It's much faster, but it's not immune to DNS resolution errors (doesn't retry), that's why a separate bash function for fetching with wget. | |||
* The script uses wpull for grabbing (WARCing), beacuse it's much more intelligent than wget. (The wget-lua version could probably also be used, though, but that needs some coding.) | |||
* The script often checks whether we still are logged in. If not, then the item – depending on which phase we are in – pauses (sleep) or fails. | |||
* The bash script doesn't support running multiple instances of it, in its current state. (However, there is probably no obstacle server-side in the way of doing so with a proper script. '''There is a little glitch''' with colliding pagers, that results in some 302 redirections, but this doesn't seem to change user experience nor archival, just let it redirect. – Anyway, '''concurrency of 1 is recommended, but not absolutely necessary.''') | |||
* A list of '''static files''' (that need to be downloaded only once) is [[myVIP/Static files|here]]. | |||
For more info, see [[myVIP/Bash script|the code]]. '''Further questions should be addressed to [[user:bzc6p]]''', either on this page's talk page, or on his talk page. | |||
<hr> | |||
MyVIP archiving bash script, written by [[user:bzc6p]]. Needs to be rewritten to conform ArchiveTeam framework and standards. | |||
<pre> | |||
#!/bin/bash | |||
# Discovers and downloads user content belonging to given user ID | |||
# Accepts one or two paramters: a single id or two ids, in the latter case does the range. | |||
# Creates a WARC file with the profile content and a csv file with one line containing some identifying information about the user. | |||
# Avatar pictures' links are collected for future downloading. | |||
abort_wpull () # if wpull is redirected, content is wrong and therefore we shouldn't go on | |||
{ | |||
echo "> Wpull grabbed wrong pages last time, you probably have lost authentication or something other weird happened. Check the logs before going on." | |||
echo "> Aborted." | |||
rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 acq_list list db myvip_script_lock 2>/dev/null | |||
mkdir ERROR 2>/dev/null | |||
mv $2.warc.gz ERROR | |||
mv $2.csv ERROR | |||
mv log.txt ERROR/log_$1.txt | |||
rm avatars/*av_$1 | |||
unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS | |||
unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS | |||
} | |||
fetch () # In case wget has a DNS error (doesn't retry) or we've lost authentication. | |||
{ | |||
while [ true ] | |||
do | |||
wget $WGET_OPTS -O $2 $1 || { echo "> Probably an error in the connection. Sleeping 1 minute..."; sleep 60; continue; } | |||
if [ `grep "<span class=\"btn-text\">Bejelentkezés</span>" $2 | wc -l | cut -d" " -f 1` -gt 0 ]; then | |||
echo "> You have lost your authentication! Log in and export your cookies file again!" | |||
echo "> Sleeping 1 minute..." | |||
sleep 60 | |||
echo "> Retrying..." | |||
continue | |||
fi | |||
break | |||
done | |||
} | |||
echo "*** myVIP user backup script ***" | |||
[[ $1 =~ `echo "^[0-9]+$"` ]] || { echo "> First parameter wrong!"; echo "> Aborted."; exit 1; } | |||
[[ -z $2 ]] || [[ $2 =~ `echo "^[0-9]+$"` ]] || { echo "> Second parameter wrong!"; echo "> Aborted."; exit 1; } | |||
[[ -z $2 ]] || [[ $1-$2 -le 0 ]] || { echo "> Parameters wrong!"; echo "> Aborted."; exit 1; } | |||
echo "> Looking for wpull..." | |||
wpull --version > wpull_ver 2>/dev/null || { echo "> You don't have wpull installed! wpull is necessary for the script to run!"; echo "> Aborted"; rm wpull_ver; exit 1; } | |||
[ `cat wpull_ver | cut -d"." -f 1` -lt 1 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; } | |||
[ `cat wpull_ver | cut -d"." -f 1` -eq 1 -a `cat wpull_ver | cut -d"." -f 2` -lt 2 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; } | |||
rm wpull_ver | |||
echo "> Checking authentication..." | |||
if [ `wget --load-cookies cookies.txt -q -O - http://myvip.com/profile.php | grep "Adatlap" | wc -l | cut -d" " -f 1` -lt 1 ]; then | |||
echo "> Authentication failed. Check your cookies file or your internet connection."; echo "> Aborted."; exit 1 | |||
fi | |||
cat myvip_script_lock >/dev/null 2>/dev/null && { echo "> Another myVIP backup script seems to be running! Multiple instances of the script MUST NOT be run at the same time!"; echo "> It is possible though that the last run interrupted. If you are sure no other myVIP backup script is running, issue 'rm myvip_script_lock' and retry."; echo "> Aborted."; exit 1; } | |||
touch myvip_script_lock | |||
mkdir avatars warcs logs index 2>/dev/null | |||
MYVIP_A=$1 | |||
if [[ -z $2 ]]; then | |||
MYVIP_B=$1 | |||
echo "> Backing up myVIP user profile $MYVIP_A" | |||
else | |||
MYVIP_B=$2 | |||
echo "> Backing up myVIP user profiles ${MYVIP_A}–${MYVIP_B}" | |||
fi | |||
WPULL_OPTS="--exclude-domains static.myvip.com,avatar.myvip.com --reject-regex infobar_frame|banner_bottombanner_frame -a log.txt --retry-connrefused --retry-dns-error --tries inf --waitretry 10 --timeout 30 --no-robots --progress none --load-cookies cookies.txt -p -H -Dmyvip.com --no-warc-keep-log --delete-after --database db --warc-append" # options for wpull | |||
WGET_OPTS="-q -a log.txt --retry-connrefused -e robots=off --tries 0 --waitretry 10 --timeout 30 --load-cookies cookies.txt" | |||
for (( n = $MYVIP_A; n <= $MYVIP_B; n++ )) | |||
do | |||
WARC_NAME=myvip_com_user_$n | |||
rm list acq_list 2>/dev/null | |||
echo "-------------------------------------------------------------------------------" | |||
unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS | |||
echo "> Fetching user page $n..." | |||
fetch `echo "http://myvip.com/profile.php?uid=$n"` "temp1" # initial grab of user page | |||
if [ `grep "Törölt, vagy nem létező felhasználó!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then # if profile doesn't exist | |||
echo "> User profile doesn't exist, saving empty page..." | |||
echo ";;;;;http://myvip.com/profile.php?uid=$n" > $WARC_NAME.csv | |||
wpull $WPULL_OPTS --warc-file $WARC_NAME "http://myvip.com/profile.php?uid=$n" # actual content grab | |||
if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then | |||
abort_wpull $n $WARC_NAME | |||
exit 1 | |||
fi | |||
echo "> Empty profile page $n archived." | |||
else # if user page exists | |||
echo "http://myvip.com/profile.php?uid=$n" >> list # it will be grabbed | |||
# In the following lines, we parse the profile page for some identification information. Those of everyone will be put in an index so that if one looks for their profile, they can easily find them. Multiple fields are necessary because several people may have the same name, and not everyone fill in all the fields. The index can be hidden or truncated later; the script should build it anyway. | |||
# We'll use semicolon as field separator, so we replace the possible semicolons with commas | |||
MYVIP_NAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>név:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>név:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"` | |||
MYVIP_NICKNAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>becenév:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>becenév:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"` | |||
MYVIP_BIRTHDATE=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>születési idő:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>születési idő:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | cut -d" " -f1-3` | |||
MYVIP_PERM_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>lakhely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>lakhely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"` | |||
MYVIP_TEMP_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"` | |||
MYVIP_URL="http://myvip.com/profile.php?uid=$n" | |||
echo "$MYVIP_NAME;$MYVIP_NICKNAME;$MYVIP_BIRTHDATE;$MYVIP_PERM_ADDRESS;$MYVIP_TEMP_ADDRESS;$MYVIP_URL" | sed "s/",/\"/g" | sed "s/&,/&/g" | sed "s/<,/</g" | sed "s/>,/>/g" > $WARC_NAME.csv # decoding special characters; they go to a semicolon-seperated file | |||
echo "> Profile for user '$MYVIP_NAME' indexed." | |||
grep "loaded-image-userprofile_avatar" temp1 | grep -o "http[0-9a-zA-Z/\.?:_]*" | uniq | sed "s/\\\//g" >> list # avatar pic | |||
if [ `grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | wc -l | cut -d" " -f 1` -gt 0 ]; then | |||
NUMCLUBS=`grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | cut -d"," -f 2` # counting clublist pages | |||
if [[ ! $NUMCLUBS = "" ]]; then | |||
echo "> Parsing for club avatars..." | |||
for (( i = 0; i <= $NUMCLUBS; i++)) | |||
do | |||
echo -n $(($NUMCLUBS-$i))... # print progress | |||
echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n" >> list # adding them to list | |||
fetch `echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n"` "temp2" # fetching to discover clubavatars | |||
grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/clubav_$n | |||
done | |||
echo | |||
else | |||
grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n | |||
fi | |||
else | |||
grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n | |||
fi | |||
grep -o "images.php?uid=[0-9]\+&imageid=[0-9]\+#imageview_container" temp1 | cut -d "'" -f 2 | sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list # links to pictures on profile page | |||
if [ `grep "dousercontacts" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then # does the user have acquaintances? | |||
echo "> User has no acquaintances." | |||
MAXPAGE=-1 | |||
else | |||
echo "> Discovering acquaintances..." | |||
fetch `echo "http://myvip.com/search.php?act=dousercontacts&uid=$n"` "temp1" # grabbing acq. list for discovering number of acq. pages | |||
if [ `grep "rangepager-jump rangepager-jump-last rangepager-jump-disabled" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then # does the acq. list have more than one page? | |||
MAXPAGE=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"&" -f 1 | rev | cut -d"=" -f 2 | cut -d"'" -f 1` # number of acq. pages | |||
PAGER_PREFIX=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"'" -f 4 | rev | cut -d "&" -f 1-2` # url prefix for acq. pages, including a unique pager id | |||
for (( i = 0; i <= $MAXPAGE; i++ )) | |||
do | |||
echo -n $(($MAXPAGE-$i))... # print progress | |||
echo "http://myvip.com/$PAGER_PREFIX&p=$i" >> acq_list # urls for acquaintances pages. WE'LL MODIFY AND GRAB LATER! | |||
fetch `echo "http://myvip.com/$PAGER_PREFIX&p=$i"` "temp2" # discovering profile avatars | |||
grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/profav_$n | |||
done | |||
echo | |||
else | |||
grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/profav_$n | |||
MAXPAGE=0 | |||
echo "http://myvip.com/browse.php?act=browse&pager=phant0mpag3r1d3nt1f13r&p=0" >> acq_list | |||
fi | |||
#echo "> Found $(( $MAXPAGE + 1 )) pages of acquaintances." # We've already printed progress, deprecated | |||
fi | |||
echo "> Discovering images..." | |||
echo "http://myvip.com/images.php?uid=$n" >> list | |||
fetch `echo "http://myvip.com/images.php?uid=$n"` "temp1" # fetching images page for discovery | |||
if [ `grep "A felhasználónak nincs nyilvános albuma!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then # does the user have images? | |||
echo "> User has no public images." | |||
else | |||
grep "images.php?albumid" temp1 | cut -d'"' -f 2 | cut -d"/" -f 2 | uniq > temp2 # collecting direct album links' postfixes | |||
# cut -d"=" -f 2 temp2 | cut -d"&" -f 1 > albumids_$n # collecting albumids (probably not necessary) | |||
echo "> User has `wc -l temp2 | cut -d" " -f 1` public albums." | |||
sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" temp2 >> list # add myvip.com prefix | |||
grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&imageid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list # collecting browser thumbnail album links | |||
grep -o "'/images.php?uid=[0-9]\+&albumid=[0-9]\+'" temp1 | cut -d "'" -f 2 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq >> list # get other kind of direct links to albums | |||
grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq > temp3 # collecting browser album links | |||
cat temp3 >> list # we'll grab them too | |||
NUMALBUMS=`wc -l temp3 | cut -d" " -f 1` | |||
for (( h = 1; h <= $NUMALBUMS; h++)) | |||
do | |||
echo "> Discovering content of album $h/$NUMALBUMS..." | |||
fetch `head -$h temp3 | tail -1` "temp4" # fetch albums' embedded pages | |||
grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" > temp5 # collect image page postfixes & add myvip.com prefix | |||
echo "> User has `wc -l temp5 | cut -d" " -f 1` images in this album." | |||
cat temp5 >> list # add them to list | |||
grep -o "<div class=\"thumbnail-commentcnt\">[^<]*</div>" temp4 | cut -d">" -f 2 | cut -d"<" -f 1 > temp41 # list of number of comments | |||
grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 > temp42 # list of image pages, in the same order | |||
ALBUMID=`head -$h temp3 | tail -1 | grep -o "albumid=[0-9]\+" | cut -d"=" -f 2` | |||
NUMIMAGES=`wc -l temp41 | cut -d" " -f 1` | |||
for (( i = 1; i <= $NUMIMAGES; i++)) | |||
do | |||
if [ `head -$i temp41 | tail -1` -gt 20 ]; then | |||
echo "http://myvip.com/images.php?imageid=`head -$i temp42 | tail -1 | cut -d'&' -f 2 | cut -d'=' -f 2`&albumid=$ALBUMID&uid=$n&isajax=1&getcontent=comments" >> list # get comments | |||
fi | |||
done | |||
done | |||
fi | |||
echo "> Downloading discovered content..." | |||
wpull $WPULL_OPTS --warc-file $WARC_NAME -i list # actual content grab | |||
if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then | |||
abort_wpull $n $WARC_NAME | |||
exit 1 | |||
fi | |||
if [ $MAXPAGE -ne -1 ]; then | |||
echo "> Downloading acquaintances pages" | |||
OLDPAGERID=`head -1 acq_list | grep -o "pager=[0-9a-z]*" | cut -d"=" -f 2` | |||
echo "http://myvip.com/search.php?act=dousercontacts&uid=$n" > list # one URL to find out current pager ID | |||
wpull $WPULL_OPTS --warc-file $WARC_NAME -i list | |||
if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then | |||
abort_wpull $n $WARC_NAME | |||
exit 1 | |||
fi | |||
if [ `grep "Fetching ‘http://myvip.com/browse.php?pager=[0-9a-z]*&p=0’ encountered an error" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then | |||
echo "> A rare problem occured. Grab of this user profile must be restarted." | |||
rm acq_list temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db $WARC_NAME.warc.gz $WARC_NAME.csv log.txt avatars/profav_$n avatars/clubav_$n 2>/dev/null | |||
((n--)) | |||
cat STOP 2>/dev/null && ((n=$MYVIP_B)) | |||
continue | |||
fi | |||
NEWPAGERID=`grep "pager" log.txt | tail -1 | cut -d"=" -f 2 | cut -d"&" -f 1` | |||
sed -i -e "s/$OLDPAGERID/$NEWPAGERID/g" acq_list | |||
mv acq_list list | |||
wpull $WPULL_OPTS --warc-file $WARC_NAME -i list # needed so that wpull surely uses the old database | |||
if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then | |||
abort_wpull $n $WARC_NAME | |||
exit 1 | |||
fi | |||
fi | |||
echo "> myVIP profile of user '$MYVIP_NAME' (id $n) has been successfully archived!" | |||
fi | |||
rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db 2>/dev/null | |||
mv $WARC_NAME.warc.gz warcs | |||
mv $WARC_NAME.csv index | |||
mv log.txt logs/log_$n.txt | |||
cat STOP 2>/dev/null && ((n=$MYVIP_B)) # if STOP file is present, we stop the loop | |||
done | |||
unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS | |||
unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS | |||
rm myvip_script_lock | |||
exit 0 | |||
</pre> |
Revision as of 16:18, 16 January 2017
The following notes, preceding the actual script, have been copied from the myVIP page, and most of it should be considered deprecated, along with the script afterwards; these are kept for historical purposes only.
Some notes about the site, the script and the archiving process:
- The site utilizes a lot of javascript but still can be saved perfectly.
- Scraping persons will need to register an account, and the cookies file (exported from the browser) must be fed with the script. Note: visiting myvip.com in a new browser session instantly invalidates the old cookie!
- The total size of content is probably around 2–3 terabytes. The first (oldest) 100,000 profiles show that 47% of the possible profiles exist, and those have an average profile size of ~790 kilobytes, with the largest ones being a few tens (less then 100) megabytes (WARC compressed). Note that this is a rough estimate with a small sample. (That would mean that the profiles would be like ~2 TB, not counting the profile pic thumbnails and the clubs, but those are probably not too significant in size.)
- The bash script is an amateur work, might be bash-specific at some points (i.e. not usable with other shells).
- However, the script has been tested, should be reliable and do its job.
- The script currently supports only profiles. Club pages should also be saved later, having this algorithm makes it simpler to write the one needed for that.
- The script accepts userids, that go from 1 up to like 4,600,000 (?) sequentially, but not all profiles exist.
- The script currently saves each user's stuff into separate WARC files (should be changed, as one such WARC file might be too little, resulting in lots of little files).
- The script saves the followings of a user: profile page, list of clubs the user is a member of (if more than one page), acquaintances (or "friends") list, photo albums, photos, comments on photos. That's all that should be saved, if any.
- The script supports creating a "directory" of users: it extracts some identifying information and stores in a one-line CSV file per user. (Should be adjusted just like the WARC; later they can be concatenated to form a database.)
- The script also creates lists of profile picture and club avatar thumbnails that are used in lists on the site. They could be saved for every user, but that would mean that a profile picture would be requested as many times as many acquaintances the user has. So, creating a list of them and then downloading all those tiny pics only once is the feasible solution.
- The script currently has a(n almost) separate discovery and grab phase. This means that some (many) pages are requested twice: while discovery and while WARCing. This could be probably optimized.
- A user's acquaintances list is a problematic point. When first visiting the list (clicking "Ismerősök"), an alphanumeric pager ID is generated. The request for the other pages of the list needs this pager ID. However, a new request for the initial pager ID invalidates the earlier one! Also, the pager ID expires in 20 minutes – that is, all pages of a user's acq. list must be saved in 20 minutes. (This is why the script currently does it strictly in one separate phase in the end, and that the initial page is grabbed separately, to find out the current pager ID.)
- The site should be saved in Hungarian. There is an English language option, but how it works hasn't been tested out. (Is it automatically set to English when visited outside of Hungary? Does the site remember the setting? Is the setting sent in a cookie or in the URL? etc.)
- The script uses wget for discovery. It's much faster, but it's not immune to DNS resolution errors (doesn't retry), that's why a separate bash function for fetching with wget.
- The script uses wpull for grabbing (WARCing), beacuse it's much more intelligent than wget. (The wget-lua version could probably also be used, though, but that needs some coding.)
- The script often checks whether we still are logged in. If not, then the item – depending on which phase we are in – pauses (sleep) or fails.
- The bash script doesn't support running multiple instances of it, in its current state. (However, there is probably no obstacle server-side in the way of doing so with a proper script. There is a little glitch with colliding pagers, that results in some 302 redirections, but this doesn't seem to change user experience nor archival, just let it redirect. – Anyway, concurrency of 1 is recommended, but not absolutely necessary.)
- A list of static files (that need to be downloaded only once) is here.
For more info, see the code. Further questions should be addressed to user:bzc6p, either on this page's talk page, or on his talk page.
MyVIP archiving bash script, written by user:bzc6p. Needs to be rewritten to conform ArchiveTeam framework and standards.
#!/bin/bash # Discovers and downloads user content belonging to given user ID # Accepts one or two paramters: a single id or two ids, in the latter case does the range. # Creates a WARC file with the profile content and a csv file with one line containing some identifying information about the user. # Avatar pictures' links are collected for future downloading. abort_wpull () # if wpull is redirected, content is wrong and therefore we shouldn't go on { echo "> Wpull grabbed wrong pages last time, you probably have lost authentication or something other weird happened. Check the logs before going on." echo "> Aborted." rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 acq_list list db myvip_script_lock 2>/dev/null mkdir ERROR 2>/dev/null mv $2.warc.gz ERROR mv $2.csv ERROR mv log.txt ERROR/log_$1.txt rm avatars/*av_$1 unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS } fetch () # In case wget has a DNS error (doesn't retry) or we've lost authentication. { while [ true ] do wget $WGET_OPTS -O $2 $1 || { echo "> Probably an error in the connection. Sleeping 1 minute..."; sleep 60; continue; } if [ `grep "<span class=\"btn-text\">Bejelentkezés</span>" $2 | wc -l | cut -d" " -f 1` -gt 0 ]; then echo "> You have lost your authentication! Log in and export your cookies file again!" echo "> Sleeping 1 minute..." sleep 60 echo "> Retrying..." continue fi break done } echo "*** myVIP user backup script ***" [[ $1 =~ `echo "^[0-9]+$"` ]] || { echo "> First parameter wrong!"; echo "> Aborted."; exit 1; } [[ -z $2 ]] || [[ $2 =~ `echo "^[0-9]+$"` ]] || { echo "> Second parameter wrong!"; echo "> Aborted."; exit 1; } [[ -z $2 ]] || [[ $1-$2 -le 0 ]] || { echo "> Parameters wrong!"; echo "> Aborted."; exit 1; } echo "> Looking for wpull..." wpull --version > wpull_ver 2>/dev/null || { echo "> You don't have wpull installed! wpull is necessary for the script to run!"; echo "> Aborted"; rm wpull_ver; exit 1; } [ `cat wpull_ver | cut -d"." -f 1` -lt 1 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; } [ `cat wpull_ver | cut -d"." -f 1` -eq 1 -a `cat wpull_ver | cut -d"." -f 2` -lt 2 ] && { echo "> Your wpull version is too old (`cat wpull_ver`). The script needs at least wpull version 1.2 to run."; echo "> Aborted."; rm wpull_ver; exit 1; } rm wpull_ver echo "> Checking authentication..." if [ `wget --load-cookies cookies.txt -q -O - http://myvip.com/profile.php | grep "Adatlap" | wc -l | cut -d" " -f 1` -lt 1 ]; then echo "> Authentication failed. Check your cookies file or your internet connection."; echo "> Aborted."; exit 1 fi cat myvip_script_lock >/dev/null 2>/dev/null && { echo "> Another myVIP backup script seems to be running! Multiple instances of the script MUST NOT be run at the same time!"; echo "> It is possible though that the last run interrupted. If you are sure no other myVIP backup script is running, issue 'rm myvip_script_lock' and retry."; echo "> Aborted."; exit 1; } touch myvip_script_lock mkdir avatars warcs logs index 2>/dev/null MYVIP_A=$1 if [[ -z $2 ]]; then MYVIP_B=$1 echo "> Backing up myVIP user profile $MYVIP_A" else MYVIP_B=$2 echo "> Backing up myVIP user profiles ${MYVIP_A}–${MYVIP_B}" fi WPULL_OPTS="--exclude-domains static.myvip.com,avatar.myvip.com --reject-regex infobar_frame|banner_bottombanner_frame -a log.txt --retry-connrefused --retry-dns-error --tries inf --waitretry 10 --timeout 30 --no-robots --progress none --load-cookies cookies.txt -p -H -Dmyvip.com --no-warc-keep-log --delete-after --database db --warc-append" # options for wpull WGET_OPTS="-q -a log.txt --retry-connrefused -e robots=off --tries 0 --waitretry 10 --timeout 30 --load-cookies cookies.txt" for (( n = $MYVIP_A; n <= $MYVIP_B; n++ )) do WARC_NAME=myvip_com_user_$n rm list acq_list 2>/dev/null echo "-------------------------------------------------------------------------------" unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS echo "> Fetching user page $n..." fetch `echo "http://myvip.com/profile.php?uid=$n"` "temp1" # initial grab of user page if [ `grep "Törölt, vagy nem létező felhasználó!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then # if profile doesn't exist echo "> User profile doesn't exist, saving empty page..." echo ";;;;;http://myvip.com/profile.php?uid=$n" > $WARC_NAME.csv wpull $WPULL_OPTS --warc-file $WARC_NAME "http://myvip.com/profile.php?uid=$n" # actual content grab if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then abort_wpull $n $WARC_NAME exit 1 fi echo "> Empty profile page $n archived." else # if user page exists echo "http://myvip.com/profile.php?uid=$n" >> list # it will be grabbed # In the following lines, we parse the profile page for some identification information. Those of everyone will be put in an index so that if one looks for their profile, they can easily find them. Multiple fields are necessary because several people may have the same name, and not everyone fill in all the fields. The index can be hidden or truncated later; the script should build it anyway. # We'll use semicolon as field separator, so we replace the possible semicolons with commas MYVIP_NAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>név:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>név:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"` MYVIP_NICKNAME=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>becenév:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>becenév:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/;/,/g"` MYVIP_BIRTHDATE=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>születési idő:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>születési idő:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | cut -d" " -f1-3` MYVIP_PERM_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>lakhely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>lakhely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"` MYVIP_TEMP_ADDRESS=`grep -o "<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:</span><span style='margin-left:[0-9]*px;' class='pairs-value'>[^<]*</span>" temp1 | sed "s/<span style='width:[0-9]*px;' class='pairs-key'>tartózkodási hely:<\/span><span style='margin-left:[0-9]*px;' class='pairs-value'>//g" | sed "s/<\/span>//g" | sed "s/>/>/g" | cut -d">" -f 3 | cut -d" " -f 2- | sed "s/;/,/g"` MYVIP_URL="http://myvip.com/profile.php?uid=$n" echo "$MYVIP_NAME;$MYVIP_NICKNAME;$MYVIP_BIRTHDATE;$MYVIP_PERM_ADDRESS;$MYVIP_TEMP_ADDRESS;$MYVIP_URL" | sed "s/",/\"/g" | sed "s/&,/&/g" | sed "s/<,/</g" | sed "s/>,/>/g" > $WARC_NAME.csv # decoding special characters; they go to a semicolon-seperated file echo "> Profile for user '$MYVIP_NAME' indexed." grep "loaded-image-userprofile_avatar" temp1 | grep -o "http[0-9a-zA-Z/\.?:_]*" | uniq | sed "s/\\\//g" >> list # avatar pic if [ `grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | wc -l | cut -d" " -f 1` -gt 0 ]; then NUMCLUBS=`grep -o "onclick='profile_gotopage(\"\",[0-9],[0-9]*); return false' class='rangepager-jump rangepager-jump-last'>" temp1 | cut -d"," -f 2` # counting clublist pages if [[ ! $NUMCLUBS = "" ]]; then echo "> Parsing for club avatars..." for (( i = 0; i <= $NUMCLUBS; i++)) do echo -n $(($NUMCLUBS-$i))... # print progress echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n" >> list # adding them to list fetch `echo "http://myvip.com/profile.php?act=getclubs&page=$i&uid=$n"` "temp2" # fetching to discover clubavatars grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/clubav_$n done echo else grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n fi else grep -o "img src=\"http://avatar\.myvip\.com/avatars/clubs[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/clubav_$n fi grep -o "images.php?uid=[0-9]\+&imageid=[0-9]\+#imageview_container" temp1 | cut -d "'" -f 2 | sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list # links to pictures on profile page if [ `grep "dousercontacts" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then # does the user have acquaintances? echo "> User has no acquaintances." MAXPAGE=-1 else echo "> Discovering acquaintances..." fetch `echo "http://myvip.com/search.php?act=dousercontacts&uid=$n"` "temp1" # grabbing acq. list for discovering number of acq. pages if [ `grep "rangepager-jump rangepager-jump-last rangepager-jump-disabled" temp1 | wc -l | cut -d" " -f 1` -eq 0 ]; then # does the acq. list have more than one page? MAXPAGE=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"&" -f 1 | rev | cut -d"=" -f 2 | cut -d"'" -f 1` # number of acq. pages PAGER_PREFIX=`grep "rangepager-jump rangepager-jump-last" temp1 | uniq | rev | cut -d"'" -f 4 | rev | cut -d "&" -f 1-2` # url prefix for acq. pages, including a unique pager id for (( i = 0; i <= $MAXPAGE; i++ )) do echo -n $(($MAXPAGE-$i))... # print progress echo "http://myvip.com/$PAGER_PREFIX&p=$i" >> acq_list # urls for acquaintances pages. WE'LL MODIFY AND GRAB LATER! fetch `echo "http://myvip.com/$PAGER_PREFIX&p=$i"` "temp2" # discovering profile avatars grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp2 | cut -d'"' -f 2 >> avatars/profav_$n done echo else grep -o "img src=\"http://avatar\.myvip\.com/avatars/users[^\"]*\"" temp1 | cut -d'"' -f 2 >> avatars/profav_$n MAXPAGE=0 echo "http://myvip.com/browse.php?act=browse&pager=phant0mpag3r1d3nt1f13r&p=0" >> acq_list fi #echo "> Found $(( $MAXPAGE + 1 )) pages of acquaintances." # We've already printed progress, deprecated fi echo "> Discovering images..." echo "http://myvip.com/images.php?uid=$n" >> list fetch `echo "http://myvip.com/images.php?uid=$n"` "temp1" # fetching images page for discovery if [ `grep "A felhasználónak nincs nyilvános albuma!" temp1 | wc -l | cut -d" " -f 1` -ne 0 ]; then # does the user have images? echo "> User has no public images." else grep "images.php?albumid" temp1 | cut -d'"' -f 2 | cut -d"/" -f 2 | uniq > temp2 # collecting direct album links' postfixes # cut -d"=" -f 2 temp2 | cut -d"&" -f 1 > albumids_$n # collecting albumids (probably not necessary) echo "> User has `wc -l temp2 | cut -d" " -f 1` public albums." sed "s/images\.php/http:\/\/myvip\.com\/images\.php/g" temp2 >> list # add myvip.com prefix grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&imageid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" >> list # collecting browser thumbnail album links grep -o "'/images.php?uid=[0-9]\+&albumid=[0-9]\+'" temp1 | cut -d "'" -f 2 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq >> list # get other kind of direct links to albums grep -o "/images.php?uid=[0-9]\+&albumid=[0-9]\+&getcontent=album&isajax=1" temp1 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" | uniq > temp3 # collecting browser album links cat temp3 >> list # we'll grab them too NUMALBUMS=`wc -l temp3 | cut -d" " -f 1` for (( h = 1; h <= $NUMALBUMS; h++)) do echo "> Discovering content of album $h/$NUMALBUMS..." fetch `head -$h temp3 | tail -1` "temp4" # fetch albums' embedded pages grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 | sed "s/\/images\.php/http:\/\/myvip\.com\/images\.php/g" > temp5 # collect image page postfixes & add myvip.com prefix echo "> User has `wc -l temp5 | cut -d" " -f 1` images in this album." cat temp5 >> list # add them to list grep -o "<div class=\"thumbnail-commentcnt\">[^<]*</div>" temp4 | cut -d">" -f 2 | cut -d"<" -f 1 > temp41 # list of number of comments grep -o "/images.php?uid=[0-9]\+&imageid=[0-9]\+&getcontent=img&isajax=1" temp4 > temp42 # list of image pages, in the same order ALBUMID=`head -$h temp3 | tail -1 | grep -o "albumid=[0-9]\+" | cut -d"=" -f 2` NUMIMAGES=`wc -l temp41 | cut -d" " -f 1` for (( i = 1; i <= $NUMIMAGES; i++)) do if [ `head -$i temp41 | tail -1` -gt 20 ]; then echo "http://myvip.com/images.php?imageid=`head -$i temp42 | tail -1 | cut -d'&' -f 2 | cut -d'=' -f 2`&albumid=$ALBUMID&uid=$n&isajax=1&getcontent=comments" >> list # get comments fi done done fi echo "> Downloading discovered content..." wpull $WPULL_OPTS --warc-file $WARC_NAME -i list # actual content grab if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then abort_wpull $n $WARC_NAME exit 1 fi if [ $MAXPAGE -ne -1 ]; then echo "> Downloading acquaintances pages" OLDPAGERID=`head -1 acq_list | grep -o "pager=[0-9a-z]*" | cut -d"=" -f 2` echo "http://myvip.com/search.php?act=dousercontacts&uid=$n" > list # one URL to find out current pager ID wpull $WPULL_OPTS --warc-file $WARC_NAME -i list if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then abort_wpull $n $WARC_NAME exit 1 fi if [ `grep "Fetching ‘http://myvip.com/browse.php?pager=[0-9a-z]*&p=0’ encountered an error" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then echo "> A rare problem occured. Grab of this user profile must be restarted." rm acq_list temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db $WARC_NAME.warc.gz $WARC_NAME.csv log.txt avatars/profav_$n avatars/clubav_$n 2>/dev/null ((n--)) cat STOP 2>/dev/null && ((n=$MYVIP_B)) continue fi NEWPAGERID=`grep "pager" log.txt | tail -1 | cut -d"=" -f 2 | cut -d"&" -f 1` sed -i -e "s/$OLDPAGERID/$NEWPAGERID/g" acq_list mv acq_list list wpull $WPULL_OPTS --warc-file $WARC_NAME -i list # needed so that wpull surely uses the old database if [ `grep "index\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 -o `grep "homeent\.php" log.txt | wc -l | cut -d" " -f 1` -ne 0 ]; then abort_wpull $n $WARC_NAME exit 1 fi fi echo "> myVIP profile of user '$MYVIP_NAME' (id $n) has been successfully archived!" fi rm temp1 temp2 temp3 temp4 temp41 temp42 temp5 list db 2>/dev/null mv $WARC_NAME.warc.gz warcs mv $WARC_NAME.csv index mv log.txt logs/log_$n.txt cat STOP 2>/dev/null && ((n=$MYVIP_B)) # if STOP file is present, we stop the loop done unset MYVIP_NAME MYVIP_NICKNAME MYVIP_BIRTHDATE MYVIP_PERM_ADDRESS MYVIP_TEMP_ADDRESS MYVIP_URL MAXPAGE PAGE_PREFIX NUMALBUMS ALBUMID NUMIMAGES MYVIP_TEMP NEWPAGERID OLDPAGERID NUMCLUBS unset MYVIP_A MYVIP_B WPULL_OPTS WGET_OPTS rm myvip_script_lock exit 0