Difference between revisions of "User talk:Archive Maniac"
| Antonizoon (talk | contribs)  | |||
| (21 intermediate revisions by one other user not shown) | |||
| Line 14: | Line 14: | ||
| I hope I answered your questions and sorry for missing your earlier messages. --- [[User:Chfoo|Chfoo]] 16:18, 12 April 2014 (EDT) | I hope I answered your questions and sorry for missing your earlier messages. --- [[User:Chfoo|Chfoo]] 16:18, 12 April 2014 (EDT) | ||
| == Some friendly words == | |||
| The text with small letters is obsolete, see update under that. | |||
| <small>I don't like starting private conversations except about technical things. However, I've seen your strange activities on the ArchiveTeam IRC channels recently, and I can't help saying some words. | |||
| First, I won't ever be sarcastic or cinycistic with you. Some of AT members may have been, but it's understandable. We have different amounts of patience. They have made assumptions about your age as well, however, we have no information about that. | |||
| Seeing your reactions and activities, I think I can understand your behaviour. I used to make similar actions and reactions myself, so we have common traits in some way, if you don't mind me saying this. | |||
| I was lucky to be present when the initial affair happened. I read through the lines several times, but the only thing I could conclude is that you accidentally wrote those lines to that window (they were totally out of context and you said this yourself too), but you were immediately banned. I don't remember you asking too much as you state on your user page. It is possible I didn't get something, and SketchCow and the others had the reasons to qualify you as "persona non grata", but I don't see. | |||
| Either way, you shouldn't feel offended. If you had logged in with another nickname, no one would have ever remembered your earlier activity. Even if you had logged in as... you know how, even then, I'm sure, no one would have said a word against or about you, provided you acted normally. | |||
| Even now, you could see people tried to be friendly towards you. However, what you feel is that they have some kind of hatred against you, and you must take revenge. No. It's not true. People don't hate you, even now, and you don't need to take revenge. I hate to say this but if you go on acting like this, ''then'' they may become actually fed up. But it's not too late now to turn back on this crazy way. | |||
| What you have been doing is called "demonstrating" on Wikipedia. Sad to see if someone, otherwise valueable member, does that. You seem to be a valuable member, doing useful things for/with/like ArchiveTeam. Please be collaborative and not disruptive. You don't have to say much, or even do much. I myself don't say or do too much (however, more and more as I've been an AT member for more and more time). I'm sure all (or at least 98% of) ArchiveTeam members counts on your work and welcomes you if you don't act in a kind of crazy way, if you don't mind me saying this. | |||
| And, one thing about SketchCow: he is not a de jura nor a de facto leader of AT. He writes about himself: ''"While I am a (generally) beloved figure who is appreciated for his public speaking skills and snappy dressing, Archive Team has collectively disagreed with me and some projects have been approached completely different ways than I would have approached them."'' What's more, you don't have to talk to him. I myself haven't talked to him yet too, just listen to him and agree or disagree with him in myself. | |||
| You write on your user page that there are friendly people here. Definitely, more than you think. As I see, almost every one of them. There are ones who don't seem to be so good mannered – but where aren't people like them? They are good too, just not that patient or have their own problems or such. (SketchCow has really unique manners, some adorable, some maybe not, but the same could be said about any one of us.) | |||
| I want to ensure you that you can ask me if you have questions, want to discuss something, and I won't try to get rid of you, and try not to hurt you with my words. And I want to encourage you to take part in ArchiveTeam's nice work. I've been in the group only for some months so far, but every day I know more and more about archiving, web, programming – and archiving is kind of fun, isn't it? Be sure your work is appreciated by everyone, just avoid demonstrating like today. Except your today's demonstrative activities, your work (making website crawls, informing AT about closures, running warriors) is appreciated. I think everyone is ready to forget everything about you immediately, if you return to that kind of work, with a calm tone. The one on your user page is a good starting point. | |||
| I know what it is like to be touchy. I am (or used to be) touchy myself. People forget and forgive, and we outgrow our traits like that. So cheer up and ArchiveTeam awaits you in its journey and mission! | |||
| Yours truly, [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]), 17 October 2014, 14:55 (UTC)</small> | |||
| I studied your "history", the events preceding your ban. So basically the only problem was that you talked much offtopic on ArchiveTeam channels and asked many, not-that-much important questions. | |||
| About the first thing. No problem that you are chatty. You could think that these IRC channels are also meant for talk you initiated. You didn't mistake too much about it, just a bit. You just need to accept that these channels are not completely like you imagined. It's not a problem with you, nor with the channel. But the two together. You can't do much about it, but don't be angry with channel members. Nor are they angry with you, they just find what you were doing inappropriate. | |||
| About the second thing. For some of your questions, the previous paragraph applies. For the others: some answers you may find out yourself, some of them you don't necessarily need to know. No problem with curiosity, but members may find too many questions exhausting. I hope you understand this. (I say this while I myself tend to ask too many questions sometimes, to make sure, but I am also patient answering questions. Not all of us must be like me regarding this thing, it's understandable that some people don't like tons of questions.) | |||
| And about both of the two things in general: too much text in IRC channels and logs makes the essence get lost. At least I think this. This is another thing why we should talk only about archiving-related stuff on AT IRC channels. | |||
| Still I uphold much of what I wrote earlier. You shouldn't be in cross with AT members, especially not swearing at them. If you consider what I wrote in the preceding paragraphs, you will be welcome on IRC even after these things. (Or, if you want to make sure, you can choose another nickname. That doesn't matter too much, I think.) Don't let revenge lead your actions. That's disruptive and contraproductive. None of us can do quality work if we don't listen to each other, study, sometimes ask. We know more and more every day, and after a point we answer more than we ask. But only if we are collaborative. That's the way it goes. | |||
| I'm ready to answer your questions if I can, I think I won't run out of patience too early. (No problem if someone does, but then that person shouldn't be bothered too much.) You can use my talk page if you have questions you think I can answer. | |||
| I gladly see you didn't give up archiving, even if you communicated this on IRC in a quite provocative way. I want to repeat that you won't possibly do quality work if you ignore other, more experienced members. Don't get hurt if they say your product is not okay. What to do with incompatible or corrupted or incomplete files? You should accept the pieces of advice. All of us does so. If something, then archiving is a thing which you can't do with completely closed eyes and ears. | |||
| And please don't curse SketchCow or anyone else...  We must conform to others' manners when we talk to them. They also do so when they talk to us. This is the way it goes, again. I'm sure you know how it feels to be hurt. Why would you hurt others then? | |||
| I myself feel that I must be careful when talking to some people, especially if he is much older than me or has strange manners. So do others when talking to us (e.g. not to hurt, being patient etc.) And, about mistakes, we all forget and forgive – and learn. | |||
| I know the things I just wrote may be seen as spam, or at least needless and offtopic and too personal for this wiki. However, I just wanted to tell you that your archiving efforts are appreciated, and with some experience you may soon become a valued member of ArchiveTeam, doing lot of good stuff. You only need to be patient yourself, listen to others, read instructions and IRC, try things you are unsure of, and if important or you can't find out, ask. More or less this is what I've been doing, and I haven't had quarrels with others in AT so far, but I'm already on the level of being able to answer some questions and do good work (I think so). | |||
| I think I can tell you on behalf of ArchiveTeam that if you consider what I've written above, you'll be fine and your work will be welcome. | |||
| I hope we can count on you in the future. That's why I wrote this 10kb-ish post. (Sorry everyone for writing so much, this is one of my weaknesses.) | |||
| Yours truly, [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]), 18 October 2014, 20:23 (UTC) | |||
| :You are welcome. However, I think it would be too early and strange if I entered the channel that "Hey guys, Dec-31-99 is sorry and wants you to forgive him"... It will resolve itself, if you wait a couple of days. Then, if you want to tell them something important (in short, to make sure), they won't kick you out, I'm sure – provided you follow the guidelines others and I told you. | |||
| :I'm sure that not I'm the only one who "understood your situation". Rather, I may be the only time-millionaire who can type 10kBs to "explain ArchiveTeam". | |||
| :Well, the message "if you know any other Hungarian sites..." is addressed to Hungarian people in the first place, they can find sunsetting sites easier, you guess why... but of course no one is excluded. I myself regularly check Google with keywords like "web site closes" (in Hungarian). (In fact, this way did I find Panoramio and alarmed ArchiveTeam!) As for GPortál, it's a very big WYSIWYG website hosting and has other services as well, I don't expect it to close without any notification, and if it is ever going to shut down, that will be a big thing and will make noise. | |||
| :For the specific website you mentioned: if you want to archive that site (I don't have the time now, I'm concerned with Demotiváló right now – and you could learn with grabbing this donkeykong), you can do two things. One is that you pass it to ArchiveBot. I haven't used that so you need to check out how it works. (My projects so far needed special care, I think ArchiveBot couldn't have done them itself. But if it's a simple website with not too much awful Javascript, hidden comments etc, it may be able to handle.) The other thing is that you grab the website yourself. For that I recommend [http://github.com/chfoo/wpull wpull], which is a wget-like software designed with creating WARC files in mind. I didn't check the website too deeply, but if I see well, website components reside under "donkeykong.gportal.hu" and "gportal.hu/portal/donkeykong". The wpull command I would try first: | |||
| :<code>wpull --accept-regex "donkeykong.gportal.hu|gportal.hu/portal/donkeykong" -o log.txt --no-warc-keep-log --recursive --level inf -p -H -Dgportal.hu --tries inf --no-robots --retry-connrefused --retry-dns-error --delete-after --warc-cdx --database DATABASEFILENAME --warc-file WARCFILENAME</code> | |||
| :where you choose DATABASEFILENAME and WARCFILENAME as you wish. The database file lets you continue the download, only problem is that then wpull ignores the already existing warcfile (and overwrites it). If I archive a larger site, I prepare, and for the warcfilename I give the _01 postfix first, and if wpull gets stopped for some reason, I change the postfix to _02 etc, leaving the other options intact. This is not too elegant, to have several files, but later they may be merged together with some megawarc tool. But if you have a good internet connection (here the problem is that for some reason wpull pretends there is no connection when there is, may be a bug) and the site is not that big, it may come down in one run – in that case you can omit the database file and the postfixes. ''This latter is the desirable way.'' | |||
| :Wpull documentation, including a manpage-style option overview: http://wpull.readthedocs.org | |||
| :See [[The WARC Ecosystem]] for warc-tools. | |||
| :If you want to test your WARC, try [https://github.com/alard/warc-proxy warc-proxy]. Even ArchiveTeam uses that sometimes. I've read somewhere that one of your (?) WARCs couldn't be injected into Wayback Machine for some reason. Well, if warc-proxy can read your WARC, that doesn't necessarily imply that Wayback also will, but we can hope. | |||
| :These are all Linux tools. I don't know any tools for Windows. Software like HTTrack may be good in mirroring, but they don't speak WARC, and WARC is essential for Wayback Machine. | |||
| :[[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 19 October 2014, 22:25 (UTC+2) | |||
| ::wpull [http://wpull.readthedocs.org/en/master/changelog.html#id3 has just dropped] Python2 support. | |||
| ::You can run Python programs on Windows if you have Python and the other dependencies installed, don't you? (I haven't tried.) | |||
| ::[[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 20 October 2014, 17:52 (UTC+2) | |||
| :::A possible and handy solution is to create a virtual machine with a minimalist Linux installation (e.g. Debian ''testing'', and when installing, choose ''Expert install'' and don't go further than installing the base (or core) system if you don't want a GUI).  I do the same myself, as Debian stable (what I use) seems to be too obsolete for wpull. I don't remember errors when installing wpull on Debian testing. | |||
| :::On the other hand, I could install the ArchiveTeam scripts easily on Debian stable and had problems on Debian testing, so I run the scripts on the real stable system, and also a virtual machine with testing to run wpull. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 21 October 2014, 07:31 (UTC+2) | |||
| == Re: Any Help on Chat? == | |||
| I don't think I have that much a way with words. I rarely speak on AT IRC channels, and have never done on #archiveteam-bs and on #archivebot. | |||
| Regarding #archiveteam-bs, the best way to find out the appropriate behaviour is to read through some of the chatlogs. On http://badcheese.com/~steve/atlogs/ you can read the logs of some channels (including #archiveteam, #archiveteam-bs, but unfortunately not #archivebot) for the last 10 days directly, but by changing the parameter in the URL, you can even go back several months. | |||
| Regarding #archivebot, I've never been there and have no chatlogs, so I can rely only on what is written on the wiki: ''"Channel for controlling ArchiveBot. Discussions about ArchiveBot development also take place here."''; and yipdw wrote on #archiveteam-bs on 2014-09-09: ''"in that channel the expectation is that you're there to issue commands, check up on a job, or talk about something to work on; talking about how archivebot works is fine but there's a point where it just gets annoying to deal with"''. There is a [[ArchiveBot|wiki page]] with basic information about ArchiveBot. | |||
| Regarding my IRC presence, I'm usually logged in to channels of featured – and currently active – projects, mainly to follow the news. Right now I'm available on #quitpic. Sometimes I also log in to #archiveteam, but usually for a short time, when announcing something important and waiting for the reactions. I said I rarely speak on IRC: I only answer questions not answered in some minutes, or announce important news or problems yet not noticed by the people "in charge". | |||
| My IRC username is the same as here: ''bzc6p''. However, I'm much of the time away from keyboard, but I usually check the log when coming back, and reply to private messages if any. | |||
| [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 22 October 2014, 12:59 (UTC+2) | |||
| == Invitation for private chat == | |||
| Let's talk in private my friend. Please come to #pmchannel on EFnet. (There you can recommend a better "place" if you have any.) I'll be by the computer or check often from today to Sunday from ~7:00 until 22:00 UTC. I count on your attendance. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 23 October 2014, 10:44 (UTC+2) | |||
| :Damn timezones. Thank you for being there – I missed your arrival and leaving just by some tens of minutes... Well, the weekend may be better for us in terms of free time and sleeping patterns, but I don't wait until that. Next time I'll get up during the night (that's your afternoon and evening), and we can talk. We may give each other our email addresses (I don't want to disclose it publicly) to overcome this timezone issue. I want to end this private communication on this wiki. | |||
| :See you there tomorrow, and sorry for this situation.  | |||
| : [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 23 October 2014 21:03 PDT / 24 October 2014 06:03 CEST | |||
| == ArchiveBot Requests == | |||
| To tell the truth, I don't really care ArchiveBot, at least for now, for three reasons. One, I don't consider most websites simple enough that a wpull run can get everything without human intelligence. Two, I don't want to use others' bandwidth while I manage with mine. Three, I can learn a lot about archiving websites if I do that myself. | |||
| So I think I can't take ArchiveBot requests. (I don't even know its commands, etc.) Moreover, I have no more right in ArchiveBot or ArchiveTeam channels than you. And, if you have only some sites you want to be grabbed by archivebot, people in #archiveteam usually initiate the task of archiving a page if you ask them. | |||
| If you've been banned in such a way that you can't enter those channels at all (even with an other nickname), that's another case, then tell me and I'll transfer your request. | |||
| Sorry if I sounded rough or something, don't take it on yourself. I've been busy these days and I'm quite tired at the moment. | |||
| Regards, [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:03, 19 November 2014 (EST) | |||
| :1. My favourite web archiving tool is [http://github.com/chfoo/wpull wpull]. I had a problem with wget parsing certain HTML files. And a great thing in wpull (what wget lacks, as I know) is that it can store its database in a separate file. So, when continuing a mirroring, it doesn't need the files it earlier downloaded (they don't even need to be stored, --delete-after), but uses the database file instead. (You can even manipulate it with e.g. [http://sqlitestudio.pl/ sqlitestudio], for example, for preventing failing URLs to be retried forever, adding new URLs, etc. – however, normal usage may not require this, and it may be inappropriate.) | |||
| :I don't know about any other sophisticated WARC supporting mirroring tool. | |||
| :2. Yes, of course, I do that myself too. As I know, ArchiveBot runs wpull, and I believe that an ArchiveBot command just initiates a recursive download of the site with page requisites – I don't know it at all, but surely there is no (easy) way to apply human intelligence in a way like I do in my mirrors, in several steps, taking scripts and other things into consideration. | |||
| :So I think there's nothing you couldn't do and ArchiveBot could. (Except maybe uploading directly to the ArchiveTeam collection, but that's not so important.) However, AB may have more space, better performance and a stable internet connection. But the latter can also be worked around: if you need to stop and continue the grab, you can do that with the help of the database file, but you must give a different WARC file name (you may append a postfix), and finally you can concatenate them using [https://github.com/alard/megawarc megawarc]. | |||
| :3. Possibly because Python is – I guess – much more portable and platform independent, and the source code doesn't need to be compiled every single time (it's an interpreted language). | |||
| :4. I don't, but I bet you'll find information about it with a Google search. | |||
| :No, I'm not annyoyed at all. However, it may happen that I can't/don't answer very soon. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 14:43, 21 November 2014 (EST) | |||
| ::Well, as I remember, on a Debian Jessie (currently testing branch) I could install wpull smoothly. I think the <code>python3</code> and possibly the <code>python3-pip</code> packages are necessary to issue <code>pip3 install wpull</code>, and that pulls the dependencies automatically. | |||
| ::On older Debian (Wheezy, it's the stable) I couldn't install it, because Wheezy seems to insist on Python2 as default. I run a virtual machine with Jessie (without GUI) to run wpull. You said you had similar problems on Windows. Well, I haven't used much Windows in a while and not at all its new versions, so I think I can't help with that. It should work on Debian Jessie. Or, if you prefer Ubuntu or something else, if it's a recent version and prefers (or at least supports well) Python3, that should suffice too. (There must be a Windows workaround too, I believe, documented somewhere on the internet, in general about Python3.) | |||
| ::[[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:04, 22 November 2014 (EST) | |||
| == Re: Wikiadownloader.py problem  == | |||
| [[user:bzc6p|bzc6p]] here, let me answer your question until chfoo gives a better one (if necessary). | |||
| At the beginning of the wikiadownloader.py you can read the following:  | |||
|  # using a list of wikia subdomains, it downloads all dumps available in Special:Statistics pages | |||
|  # you can use the list available at the "listofwikis" directory, the file is called wikia.com and it contains +200k wikis | |||
| So, <code>wikia.com</code> ''is'' actually a file, so the script isn't wrong, at least at this point. However, I couldn't find the file where it is said to be. But indeed, there are files in that directory (in fact, in its subdirectories) that have lists of wikis. After studying the code, I think you need to download a list, rename it to <code>wikia.com</code> and start the script (the listfile must be in the same directory as the script). See also the instructions in the script file. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 13:04, 27 November 2014 (EST) | |||
| == Re: Blank CD Question == | |||
| I don't know what the estimated shelf life of optical discs is, I've never studied this question. Instead, I check their health every year. For archiving purposes I mostly use DVD-RWs, as some of the archives change, and they are reusable for any purpose. They are not older than 5 years, so that's probably not a reference. | |||
| I have a CD-R like 12 years old and it's fine. But a CD-ROM with about the same age is showing the common signs of failure. (It's actually a Windows XP disc.) I don't have many old discs, some CD-Rs and DVD-Rs I have are like 6 years old and fine. | |||
| So I'm not a good person to ask, as I don't have an old collection of optical discs. I believe that the technology indeed has its shelf life, and the actual lifetime of a certain disc depends on the manufacturer and the usage, but must be around some number. To make sure, check them regularly. I've found a great tool: [http://dvdisaster.net dvdisaster]. You can not only check your discs with it, but generate error correction code, that can save your disc when the first sectors go wrong. Some interesting fact: Its documentation says that the discs tend to go wrong from outside to inside, so the outer parts (that's the ''end'' of the data) are lost first. My Windows XP disc indeed behaves like that. | |||
| For your second question: Unfortunately I'm not an expert in such questions, I possibly don't know more about it than you. I use DVD-RWs for backing up my original CDs and my work on the computer, but that's not more than some tens of gigabytes. If I wanted to store terabytes (I hope one day I can, I have some great plans), I think I would buy an external hard disk drive, and buy some second-hand hard disks. And store them cold, i.e. not connected to a computer. | |||
| [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 16:46, 29 November 2014 (EST) | |||
| == Re: Blogter.hu's Unexpected Downfall  == | |||
| You don't need to worry about GPortal. Let me explain. | |||
| 1. Blogter.hu showed '''clear signs''' of decay. I myself used the site too. In 2010, it started to respond slowly. At that time, archiving would have been possible. Maybe in 2011 too, or – although very slowly – even later. If I had been concerned with web archiving like now, I definitely would have saved it or sounded ArchiveTeam's alarms if necessary. (The same applies to extra.hu, freeblog.hu etc.) | |||
| 2. I think the financial status of a company behind a website also helps deciding if there are problems looming or not. (Of course, this is not an exclusive reason for closing a website, but why would you close down a profitable thing?) Balance sheets of companies are publicly available in Hungary. I realised it just now. According to that, Blogter Ltd. took a huge loan in 2007, I don't know why. (It was like $400,000 loan, although their share capital was just $12,000.) By 2011 it became clear that they couldn't (or didn't want to) repay the loan, even their capital went to negative. You remember: then the site still could have been saved. (Note: I'm not an economist, I don't understand these sheets ''too much''.) | |||
| Another example is [[freeblog.hu]]. I checked its numbers too. They went into a similar situation like Blogter Ltd. Balance sheet of 2011 showed $–14,000 capital, while share capital was just $2,000. So it was in a huge trouble. The data became available in May 2012. The site was still up and fine in August 2013, but there were some signs of decay. So one could archive if had paid attention. | |||
| 3. Gportal is owned by Origo, which is owned my Magyar Telekom (one of the biggest companies in Hungary, market leading in telecommunication), which is owned by Deutsche Telekom, a multinational company. Although there are some negative numbers in the subsidiary's balance sheet, it's a company big enough I'm not that afraid. And, which is more important, I'm absolutely sure it wouldn't go down without notice or at least obvious signs. Remember, Blogter and Freeblog also had signs. Just no one acted. So, their downfall wasn't ''unexpected''. | |||
| 4. Gportal is 156th most popular website in Hungary, according to Alexa. That's quite good I think. | |||
| -------------- | |||
| The biggest site closures in Hungary in the last years were those of extra.hu, iWiW, Freeblog and Blogter. Extra.hu announced it like two months before. iWiW also months before. Freeblog showed signs of technical problems months, and financial problems 1.5 years before. I already wrote about Blogter. If I had been an ArchiveTeam member back then, all of them would have been (tried to be) saved. | |||
| Every day I know a lot more about what to keep an eye on: not only technical status, activity of support and number of visitors, but also financial status. Once I find a site endangered, I'll initiate its archival. (The first ones may be those old free hosting services that started like a decade ago but now are quite abandoned upon the rise of social networks.) But at the moment I'm not afraid, and archiving a living website would be not only waste of time and resources, but also partly in vain, as new content is being added. Especially if Wayback Machine is making snapshots. | |||
| GPortal is one of the least possible to disappear anyway. | |||
| -------------- | |||
| Before you misunderstand my tone, no problem that you're concerned with archiving, what's more, it's great. But the site you mentioned doesn't seem to be in danger, and I keep an eye on the others as well, considering several factors. (If it was now that Blogter was still available, I'd try to save it, along with mommo.hu, which was an early social networking site also run by Blogter Ltd. – and, what a shame, it was fully functional even this year and I didn't save it.) | |||
| I joined ArchiveTeam in spring this year, and it's since summer that I'm ready to take action myself if necessary. Too bad I came just too late for Blogter and Mommo. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:21, 8 December 2014 (EST) | |||
| == Re: What I'm Currently Doing == | |||
| If I see well, you upload (and get WM snapshot) stuff that are rarities, which is good, as these rare but possibly valuable things are endangered to be lost and forgotten. This work, however, needs time and manual work, so it's even more appreciated. | |||
| What I'm doing regarding archiving is quite a "bulk" work instead. After discovering the site structure, it goes almost automatically to download and then upload stuff. I mean image hosting services, which I'm "specialized" in. Although they are alive now, I '''don't trust any of them'''. And I find them important, for several reasons. | |||
| I've also been planning uploading some small but good software I know but are in danger of forgetting. And also some original authors on YouTube – however, I must look into what Internet Archive is already archiving with its youtube crawl actions. | |||
| Regarding your earlier "mad rage", there is nothing I need to forgive you, as you never hurt ''me'' – your rage wasn't directed towards me. thus you don't need to feel embarassed around me at all, and I also kind of understand it, as we discussed it privately. | |||
| Have a good time archiving stuff. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 09:45, 6 January 2015 (EST) | |||
| :You asked Chfoo [http://archiveteam.org/index.php?title=User_talk:Chfoo#Wikiadownloader.py_problem here] (last section), and I answered [http://archiveteam.org/index.php?title=User_talk:Archive_Maniac#Re:_Wikiadownloader.py_problem here]. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 14:36, 6 January 2015 (EST) | |||
| == Re: View Archive.org Directories as Text Only == | |||
| I don't know what exactly you mean, and even if I could I think I would be unable to give an answer. Should you have meant listing the files in an item, <code><nowiki>http://archive.org/download/<IDENTIFIER></nowiki></code> is the way, e.g. https://archive.org/download/demotivalo_net_2014_october. This is the page you are presented when you click "HTTP" under "Download item" on the left (on the old version of the site) of the details page of an item. | |||
| The type of link you gave as an example is new to me, and I don't know (and couldn't find) other such "hidden" pages of the Wayback Machine. | |||
| If it was on IRC that you were taught the way, then try searching for that with Google on badcheese.com. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 07:28, 23 January 2015 (EST) | |||
| :I guess it's not what they taught you, but [https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md this] tool (which I found mentioned [http://www.willglynn.com/2014/01/26/exporting-from-the-wayback-machine/ here]), used with proper parameters, may give the result you need. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 07:46, 24 January 2015 (EST) | |||
| ::Unfortunately, as you also know, IA respects robots.txt, and even retroactively (official statement [https://archive.org/about/exclude.php here] and a discussion [https://archive.org/post/406632/why-does-the-wayback-machine-pay-attention-to-robotstxt here]). As you yourself discovered, IA does hide (I guess they don't delete) content behind a 403 page – and I think those guys have the technical skills to not leave any gap in which someone can still access it. (I mean, there are lots of websites that, for example, don't have www links to some documents, but if you know the address of the containing directory, you get the directory listing, and you've got everything. Well, I guess IA staff is not that dummy.) | |||
| ::So, what to say? This is the situation. IA doesn't seem to be willing to change it, not even to drop the retroactivity. (They may have a reason for that, though.) I can only quote your own words: "I suggest you save/archive your favorite old web pages on this machine before they get the "robots" move." | |||
| ::Or, one more thing you could do: try searching in other, smaller internet archiving sites' databases – who knows, some of them may have the portion of the internet you're looking for.  | |||
| ::And some offtopic thing. SketchCow [http://badcheese.com/~steve/atlogs/?chan=archiveteam-bs&day=2014-12-05 adviced] against "engaging" people in the forums. That is – if I understand well with my English –, one should avoid getting into an argument with them, especially talking about ArchiveTeam. I'm telling it to you because [https://archive.org/post/1020010/why-does-the-wayback-machine-pay-attention-to-robotstxt you were on the edge of it]; take this as my preemptive good advice. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 17:08, 24 January 2015 (EST) | |||
| == Re: FTP Sites == | |||
| I am aware of the FTP saving project. However, it doesn't seem to be well organized to me – I don't know whoever works on it and whatever they download, although there is a [https://github.com/zmap/zmap tool] with which ALL the IPv4 FTP sites can easily be discovered unbelievably quickly, and then a distributed project could be created to systematically mirror each and every one of them. I'm not in the power to organize this, and until it isn't, I don't feel like joining. | |||
| At the same time, I've got a lot of work with Hungarian image hosting services, it could saturate my bandwidth forever, so I don't even have the capacity to do anything else, except if very urgent. (It is inarguable that image hostings must be saved – they represent content and produce links, and neither the content should be lost, nor the links should become broken. And the reliability of these services is bad in average.) | |||
| In response to your other discovery, I myself discovered that too, and although it seems to be a handy way of archiving, I'm not sure if it is the desirable way – at first glance, it looks okay, though. However, creating and uploading WARCs has the advantage that they can easily be restored if the site goes down and the domain expires and someone decides "let's put it back!". Or if someone wants to mirror specific sites' archives, then it's just one click – but exporting from the Wayback Machine is very difficult. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 15:27, 1 February 2015 (EST) | |||
| :Last time I meant that downloading sites with wget or wpull (which is in fact a python cover of wget) with the options responsible for creating a WARC file (wpull was originally designed to be able to create WARCs, wget supports it since not too long ago), and uploading these WARCs has the benefit that one can download the whole archive of the site in one file – and, yes, as WARCs can be "injected" into Wayback Machine, it has almost the same appearance there as if you had saved it with the web.archive.org/web/save (or whatever) method – but, again, one can simply download the whole archive only if it is uploaded as an archive (preferably WARC). | |||
| :As far as I know, wpull is behind ArchiveBot. The Warrior still uses the modified wget, but one day it may be replaced with wpull too. | |||
| :Regarding your FTP site saving efforts, do what you wish – you don't have to narrow your efforts on Hungarian sites, and although it looks kind that you do this "for my sake", I don't feel like I deserve that much... recognition or something like that. Hungarian FTP sites are not more important than any others in the world, maybe rather less important. | |||
| :Just save what you think important, urgent or what you feel like. I do the same. [[User:bzc6p|bzc6p]] ([[User_talk:bzc6p|talk]]) 17:29, 5 February 2015 (EST) | |||
| == Re: Sup Archive Maniac, from the Bibliotheca Anonoma == | |||
| You really might want to consider joining our team, we need more help from people just like you, and just as you say, you've really been helping us a lot. | |||
| It's easy to get in contact with us. Our control room is our IRC channel at irc.rizon.net #bibanon. This is where everything happens. Please join in and idle, it's a lively community of archivists just like you, sharing info and tips and tricks with each other. | |||
| [[User:Antonizoon|Antonizoon]] 13:33, 21 October 2015 (EDT) | |||
Latest revision as of 17:33, 21 October 2015
Hi Archive Maniac, if you're having trouble, it's best to chat on IRC on the #archiveteam channel on EFnet where more people can help. I don't know how to upload wikis so you will need to join the #wikiteam channel for help. Please be patient and leave your chat client connected to give someone time to answer. Thanks. Chfoo 01:37, 17 February 2014 (EST)
Hi, sorry for not responding to your earlier messages. I don't check the wiki for messages that often because Archive Team does all its discussion on IRC. There's no forums unfortunately. If you have trouble with IRC, you can email me and I can get back to you sooner.
Regarding the best way to store your backups is to keep copies on multiple hard drives. Like VHS tapes and audio cassettes, CDs and DVDs wear out after a while. It's called disk rot. Although hard drives don't last long either, they hold much more data and are cheaper in the long run.
People who run the Warrior scripts manually usually have experience and money to spend on cloud computing for virtual hosts so they can run dozens of the scripts at once. This is why the people at the top of the Warrior leaderboards have gigabytes and gigabytes downloaded.
Archive Team already has a way for people to submit websites to be archived. It's called ArchiveBot and anyone can use it. All Archive Team files are placed into the archiveteam collection. Adding files to collection is restricted since files under this collection show up in the Wayback Machine.
Regarding uploading things to Internet Archive, uploading archives with good conventions is excellent and I wish more people would take initiative and be proactive.
However when uploading websites, you need to upload WARC files instead of a 7z file of the website. With wget, you'll need to use the --warc-file option. For example, --warc-file example will produce a WARC file called example.warc.gz. You want to use WARC files so The Wayback Machine can load them and show the archives properly.
I hope I answered your questions and sorry for missing your earlier messages. --- Chfoo 16:18, 12 April 2014 (EDT)
Some friendly words
The text with small letters is obsolete, see update under that.
I don't like starting private conversations except about technical things. However, I've seen your strange activities on the ArchiveTeam IRC channels recently, and I can't help saying some words.
First, I won't ever be sarcastic or cinycistic with you. Some of AT members may have been, but it's understandable. We have different amounts of patience. They have made assumptions about your age as well, however, we have no information about that.
Seeing your reactions and activities, I think I can understand your behaviour. I used to make similar actions and reactions myself, so we have common traits in some way, if you don't mind me saying this.
I was lucky to be present when the initial affair happened. I read through the lines several times, but the only thing I could conclude is that you accidentally wrote those lines to that window (they were totally out of context and you said this yourself too), but you were immediately banned. I don't remember you asking too much as you state on your user page. It is possible I didn't get something, and SketchCow and the others had the reasons to qualify you as "persona non grata", but I don't see.
Either way, you shouldn't feel offended. If you had logged in with another nickname, no one would have ever remembered your earlier activity. Even if you had logged in as... you know how, even then, I'm sure, no one would have said a word against or about you, provided you acted normally.
Even now, you could see people tried to be friendly towards you. However, what you feel is that they have some kind of hatred against you, and you must take revenge. No. It's not true. People don't hate you, even now, and you don't need to take revenge. I hate to say this but if you go on acting like this, then they may become actually fed up. But it's not too late now to turn back on this crazy way.
What you have been doing is called "demonstrating" on Wikipedia. Sad to see if someone, otherwise valueable member, does that. You seem to be a valuable member, doing useful things for/with/like ArchiveTeam. Please be collaborative and not disruptive. You don't have to say much, or even do much. I myself don't say or do too much (however, more and more as I've been an AT member for more and more time). I'm sure all (or at least 98% of) ArchiveTeam members counts on your work and welcomes you if you don't act in a kind of crazy way, if you don't mind me saying this.
And, one thing about SketchCow: he is not a de jura nor a de facto leader of AT. He writes about himself: "While I am a (generally) beloved figure who is appreciated for his public speaking skills and snappy dressing, Archive Team has collectively disagreed with me and some projects have been approached completely different ways than I would have approached them." What's more, you don't have to talk to him. I myself haven't talked to him yet too, just listen to him and agree or disagree with him in myself.
You write on your user page that there are friendly people here. Definitely, more than you think. As I see, almost every one of them. There are ones who don't seem to be so good mannered – but where aren't people like them? They are good too, just not that patient or have their own problems or such. (SketchCow has really unique manners, some adorable, some maybe not, but the same could be said about any one of us.)
I want to ensure you that you can ask me if you have questions, want to discuss something, and I won't try to get rid of you, and try not to hurt you with my words. And I want to encourage you to take part in ArchiveTeam's nice work. I've been in the group only for some months so far, but every day I know more and more about archiving, web, programming – and archiving is kind of fun, isn't it? Be sure your work is appreciated by everyone, just avoid demonstrating like today. Except your today's demonstrative activities, your work (making website crawls, informing AT about closures, running warriors) is appreciated. I think everyone is ready to forget everything about you immediately, if you return to that kind of work, with a calm tone. The one on your user page is a good starting point.
I know what it is like to be touchy. I am (or used to be) touchy myself. People forget and forgive, and we outgrow our traits like that. So cheer up and ArchiveTeam awaits you in its journey and mission!
Yours truly, bzc6p (talk), 17 October 2014, 14:55 (UTC)
I studied your "history", the events preceding your ban. So basically the only problem was that you talked much offtopic on ArchiveTeam channels and asked many, not-that-much important questions.
About the first thing. No problem that you are chatty. You could think that these IRC channels are also meant for talk you initiated. You didn't mistake too much about it, just a bit. You just need to accept that these channels are not completely like you imagined. It's not a problem with you, nor with the channel. But the two together. You can't do much about it, but don't be angry with channel members. Nor are they angry with you, they just find what you were doing inappropriate.
About the second thing. For some of your questions, the previous paragraph applies. For the others: some answers you may find out yourself, some of them you don't necessarily need to know. No problem with curiosity, but members may find too many questions exhausting. I hope you understand this. (I say this while I myself tend to ask too many questions sometimes, to make sure, but I am also patient answering questions. Not all of us must be like me regarding this thing, it's understandable that some people don't like tons of questions.)
And about both of the two things in general: too much text in IRC channels and logs makes the essence get lost. At least I think this. This is another thing why we should talk only about archiving-related stuff on AT IRC channels.
Still I uphold much of what I wrote earlier. You shouldn't be in cross with AT members, especially not swearing at them. If you consider what I wrote in the preceding paragraphs, you will be welcome on IRC even after these things. (Or, if you want to make sure, you can choose another nickname. That doesn't matter too much, I think.) Don't let revenge lead your actions. That's disruptive and contraproductive. None of us can do quality work if we don't listen to each other, study, sometimes ask. We know more and more every day, and after a point we answer more than we ask. But only if we are collaborative. That's the way it goes.
I'm ready to answer your questions if I can, I think I won't run out of patience too early. (No problem if someone does, but then that person shouldn't be bothered too much.) You can use my talk page if you have questions you think I can answer.
I gladly see you didn't give up archiving, even if you communicated this on IRC in a quite provocative way. I want to repeat that you won't possibly do quality work if you ignore other, more experienced members. Don't get hurt if they say your product is not okay. What to do with incompatible or corrupted or incomplete files? You should accept the pieces of advice. All of us does so. If something, then archiving is a thing which you can't do with completely closed eyes and ears.
And please don't curse SketchCow or anyone else... We must conform to others' manners when we talk to them. They also do so when they talk to us. This is the way it goes, again. I'm sure you know how it feels to be hurt. Why would you hurt others then?
I myself feel that I must be careful when talking to some people, especially if he is much older than me or has strange manners. So do others when talking to us (e.g. not to hurt, being patient etc.) And, about mistakes, we all forget and forgive – and learn.
I know the things I just wrote may be seen as spam, or at least needless and offtopic and too personal for this wiki. However, I just wanted to tell you that your archiving efforts are appreciated, and with some experience you may soon become a valued member of ArchiveTeam, doing lot of good stuff. You only need to be patient yourself, listen to others, read instructions and IRC, try things you are unsure of, and if important or you can't find out, ask. More or less this is what I've been doing, and I haven't had quarrels with others in AT so far, but I'm already on the level of being able to answer some questions and do good work (I think so).
I think I can tell you on behalf of ArchiveTeam that if you consider what I've written above, you'll be fine and your work will be welcome.
I hope we can count on you in the future. That's why I wrote this 10kb-ish post. (Sorry everyone for writing so much, this is one of my weaknesses.)
Yours truly, bzc6p (talk), 18 October 2014, 20:23 (UTC)
- You are welcome. However, I think it would be too early and strange if I entered the channel that "Hey guys, Dec-31-99 is sorry and wants you to forgive him"... It will resolve itself, if you wait a couple of days. Then, if you want to tell them something important (in short, to make sure), they won't kick you out, I'm sure – provided you follow the guidelines others and I told you.
- I'm sure that not I'm the only one who "understood your situation". Rather, I may be the only time-millionaire who can type 10kBs to "explain ArchiveTeam".
- Well, the message "if you know any other Hungarian sites..." is addressed to Hungarian people in the first place, they can find sunsetting sites easier, you guess why... but of course no one is excluded. I myself regularly check Google with keywords like "web site closes" (in Hungarian). (In fact, this way did I find Panoramio and alarmed ArchiveTeam!) As for GPortál, it's a very big WYSIWYG website hosting and has other services as well, I don't expect it to close without any notification, and if it is ever going to shut down, that will be a big thing and will make noise.
- For the specific website you mentioned: if you want to archive that site (I don't have the time now, I'm concerned with Demotiváló right now – and you could learn with grabbing this donkeykong), you can do two things. One is that you pass it to ArchiveBot. I haven't used that so you need to check out how it works. (My projects so far needed special care, I think ArchiveBot couldn't have done them itself. But if it's a simple website with not too much awful Javascript, hidden comments etc, it may be able to handle.) The other thing is that you grab the website yourself. For that I recommend wpull, which is a wget-like software designed with creating WARC files in mind. I didn't check the website too deeply, but if I see well, website components reside under "donkeykong.gportal.hu" and "gportal.hu/portal/donkeykong". The wpull command I would try first:
- wpull --accept-regex "donkeykong.gportal.hu|gportal.hu/portal/donkeykong" -o log.txt --no-warc-keep-log --recursive --level inf -p -H -Dgportal.hu --tries inf --no-robots --retry-connrefused --retry-dns-error --delete-after --warc-cdx --database DATABASEFILENAME --warc-file WARCFILENAME
- where you choose DATABASEFILENAME and WARCFILENAME as you wish. The database file lets you continue the download, only problem is that then wpull ignores the already existing warcfile (and overwrites it). If I archive a larger site, I prepare, and for the warcfilename I give the _01 postfix first, and if wpull gets stopped for some reason, I change the postfix to _02 etc, leaving the other options intact. This is not too elegant, to have several files, but later they may be merged together with some megawarc tool. But if you have a good internet connection (here the problem is that for some reason wpull pretends there is no connection when there is, may be a bug) and the site is not that big, it may come down in one run – in that case you can omit the database file and the postfixes. This latter is the desirable way.
- Wpull documentation, including a manpage-style option overview: http://wpull.readthedocs.org
- See The WARC Ecosystem for warc-tools.
- If you want to test your WARC, try warc-proxy. Even ArchiveTeam uses that sometimes. I've read somewhere that one of your (?) WARCs couldn't be injected into Wayback Machine for some reason. Well, if warc-proxy can read your WARC, that doesn't necessarily imply that Wayback also will, but we can hope.
- These are all Linux tools. I don't know any tools for Windows. Software like HTTrack may be good in mirroring, but they don't speak WARC, and WARC is essential for Wayback Machine.
- wpull has just dropped Python2 support.
- You can run Python programs on Windows if you have Python and the other dependencies installed, don't you? (I haven't tried.)
- bzc6p (talk) 20 October 2014, 17:52 (UTC+2)
 
- A possible and handy solution is to create a virtual machine with a minimalist Linux installation (e.g. Debian testing, and when installing, choose Expert install and don't go further than installing the base (or core) system if you don't want a GUI). I do the same myself, as Debian stable (what I use) seems to be too obsolete for wpull. I don't remember errors when installing wpull on Debian testing.
- On the other hand, I could install the ArchiveTeam scripts easily on Debian stable and had problems on Debian testing, so I run the scripts on the real stable system, and also a virtual machine with testing to run wpull. bzc6p (talk) 21 October 2014, 07:31 (UTC+2)
 
 
Re: Any Help on Chat?
I don't think I have that much a way with words. I rarely speak on AT IRC channels, and have never done on #archiveteam-bs and on #archivebot.
Regarding #archiveteam-bs, the best way to find out the appropriate behaviour is to read through some of the chatlogs. On http://badcheese.com/~steve/atlogs/ you can read the logs of some channels (including #archiveteam, #archiveteam-bs, but unfortunately not #archivebot) for the last 10 days directly, but by changing the parameter in the URL, you can even go back several months.
Regarding #archivebot, I've never been there and have no chatlogs, so I can rely only on what is written on the wiki: "Channel for controlling ArchiveBot. Discussions about ArchiveBot development also take place here."; and yipdw wrote on #archiveteam-bs on 2014-09-09: "in that channel the expectation is that you're there to issue commands, check up on a job, or talk about something to work on; talking about how archivebot works is fine but there's a point where it just gets annoying to deal with". There is a wiki page with basic information about ArchiveBot.
Regarding my IRC presence, I'm usually logged in to channels of featured – and currently active – projects, mainly to follow the news. Right now I'm available on #quitpic. Sometimes I also log in to #archiveteam, but usually for a short time, when announcing something important and waiting for the reactions. I said I rarely speak on IRC: I only answer questions not answered in some minutes, or announce important news or problems yet not noticed by the people "in charge".
My IRC username is the same as here: bzc6p. However, I'm much of the time away from keyboard, but I usually check the log when coming back, and reply to private messages if any.
bzc6p (talk) 22 October 2014, 12:59 (UTC+2)
Invitation for private chat
Let's talk in private my friend. Please come to #pmchannel on EFnet. (There you can recommend a better "place" if you have any.) I'll be by the computer or check often from today to Sunday from ~7:00 until 22:00 UTC. I count on your attendance. bzc6p (talk) 23 October 2014, 10:44 (UTC+2)
- Damn timezones. Thank you for being there – I missed your arrival and leaving just by some tens of minutes... Well, the weekend may be better for us in terms of free time and sleeping patterns, but I don't wait until that. Next time I'll get up during the night (that's your afternoon and evening), and we can talk. We may give each other our email addresses (I don't want to disclose it publicly) to overcome this timezone issue. I want to end this private communication on this wiki.
- See you there tomorrow, and sorry for this situation.
ArchiveBot Requests
To tell the truth, I don't really care ArchiveBot, at least for now, for three reasons. One, I don't consider most websites simple enough that a wpull run can get everything without human intelligence. Two, I don't want to use others' bandwidth while I manage with mine. Three, I can learn a lot about archiving websites if I do that myself.
So I think I can't take ArchiveBot requests. (I don't even know its commands, etc.) Moreover, I have no more right in ArchiveBot or ArchiveTeam channels than you. And, if you have only some sites you want to be grabbed by archivebot, people in #archiveteam usually initiate the task of archiving a page if you ask them.
If you've been banned in such a way that you can't enter those channels at all (even with an other nickname), that's another case, then tell me and I'll transfer your request.
Sorry if I sounded rough or something, don't take it on yourself. I've been busy these days and I'm quite tired at the moment.
Regards, bzc6p (talk) 15:03, 19 November 2014 (EST)
- 1. My favourite web archiving tool is wpull. I had a problem with wget parsing certain HTML files. And a great thing in wpull (what wget lacks, as I know) is that it can store its database in a separate file. So, when continuing a mirroring, it doesn't need the files it earlier downloaded (they don't even need to be stored, --delete-after), but uses the database file instead. (You can even manipulate it with e.g. sqlitestudio, for example, for preventing failing URLs to be retried forever, adding new URLs, etc. – however, normal usage may not require this, and it may be inappropriate.)
- I don't know about any other sophisticated WARC supporting mirroring tool.
- 2. Yes, of course, I do that myself too. As I know, ArchiveBot runs wpull, and I believe that an ArchiveBot command just initiates a recursive download of the site with page requisites – I don't know it at all, but surely there is no (easy) way to apply human intelligence in a way like I do in my mirrors, in several steps, taking scripts and other things into consideration.
- So I think there's nothing you couldn't do and ArchiveBot could. (Except maybe uploading directly to the ArchiveTeam collection, but that's not so important.) However, AB may have more space, better performance and a stable internet connection. But the latter can also be worked around: if you need to stop and continue the grab, you can do that with the help of the database file, but you must give a different WARC file name (you may append a postfix), and finally you can concatenate them using megawarc.
- 3. Possibly because Python is – I guess – much more portable and platform independent, and the source code doesn't need to be compiled every single time (it's an interpreted language).
- 4. I don't, but I bet you'll find information about it with a Google search.
- No, I'm not annyoyed at all. However, it may happen that I can't/don't answer very soon. bzc6p (talk) 14:43, 21 November 2014 (EST)
- Well, as I remember, on a Debian Jessie (currently testing branch) I could install wpull smoothly. I think the python3and possibly thepython3-pippackages are necessary to issuepip3 install wpull, and that pulls the dependencies automatically.
 
- Well, as I remember, on a Debian Jessie (currently testing branch) I could install wpull smoothly. I think the 
- On older Debian (Wheezy, it's the stable) I couldn't install it, because Wheezy seems to insist on Python2 as default. I run a virtual machine with Jessie (without GUI) to run wpull. You said you had similar problems on Windows. Well, I haven't used much Windows in a while and not at all its new versions, so I think I can't help with that. It should work on Debian Jessie. Or, if you prefer Ubuntu or something else, if it's a recent version and prefers (or at least supports well) Python3, that should suffice too. (There must be a Windows workaround too, I believe, documented somewhere on the internet, in general about Python3.)
 
Re: Wikiadownloader.py problem
bzc6p here, let me answer your question until chfoo gives a better one (if necessary).
At the beginning of the wikiadownloader.py you can read the following:
# using a list of wikia subdomains, it downloads all dumps available in Special:Statistics pages # you can use the list available at the "listofwikis" directory, the file is called wikia.com and it contains +200k wikis
So, wikia.com is actually a file, so the script isn't wrong, at least at this point. However, I couldn't find the file where it is said to be. But indeed, there are files in that directory (in fact, in its subdirectories) that have lists of wikis. After studying the code, I think you need to download a list, rename it to wikia.com and start the script (the listfile must be in the same directory as the script). See also the instructions in the script file. bzc6p (talk) 13:04, 27 November 2014 (EST)
Re: Blank CD Question
I don't know what the estimated shelf life of optical discs is, I've never studied this question. Instead, I check their health every year. For archiving purposes I mostly use DVD-RWs, as some of the archives change, and they are reusable for any purpose. They are not older than 5 years, so that's probably not a reference.
I have a CD-R like 12 years old and it's fine. But a CD-ROM with about the same age is showing the common signs of failure. (It's actually a Windows XP disc.) I don't have many old discs, some CD-Rs and DVD-Rs I have are like 6 years old and fine.
So I'm not a good person to ask, as I don't have an old collection of optical discs. I believe that the technology indeed has its shelf life, and the actual lifetime of a certain disc depends on the manufacturer and the usage, but must be around some number. To make sure, check them regularly. I've found a great tool: dvdisaster. You can not only check your discs with it, but generate error correction code, that can save your disc when the first sectors go wrong. Some interesting fact: Its documentation says that the discs tend to go wrong from outside to inside, so the outer parts (that's the end of the data) are lost first. My Windows XP disc indeed behaves like that.
For your second question: Unfortunately I'm not an expert in such questions, I possibly don't know more about it than you. I use DVD-RWs for backing up my original CDs and my work on the computer, but that's not more than some tens of gigabytes. If I wanted to store terabytes (I hope one day I can, I have some great plans), I think I would buy an external hard disk drive, and buy some second-hand hard disks. And store them cold, i.e. not connected to a computer.
bzc6p (talk) 16:46, 29 November 2014 (EST)
Re: Blogter.hu's Unexpected Downfall
You don't need to worry about GPortal. Let me explain.
1. Blogter.hu showed clear signs of decay. I myself used the site too. In 2010, it started to respond slowly. At that time, archiving would have been possible. Maybe in 2011 too, or – although very slowly – even later. If I had been concerned with web archiving like now, I definitely would have saved it or sounded ArchiveTeam's alarms if necessary. (The same applies to extra.hu, freeblog.hu etc.)
2. I think the financial status of a company behind a website also helps deciding if there are problems looming or not. (Of course, this is not an exclusive reason for closing a website, but why would you close down a profitable thing?) Balance sheets of companies are publicly available in Hungary. I realised it just now. According to that, Blogter Ltd. took a huge loan in 2007, I don't know why. (It was like $400,000 loan, although their share capital was just $12,000.) By 2011 it became clear that they couldn't (or didn't want to) repay the loan, even their capital went to negative. You remember: then the site still could have been saved. (Note: I'm not an economist, I don't understand these sheets too much.)
Another example is freeblog.hu. I checked its numbers too. They went into a similar situation like Blogter Ltd. Balance sheet of 2011 showed $–14,000 capital, while share capital was just $2,000. So it was in a huge trouble. The data became available in May 2012. The site was still up and fine in August 2013, but there were some signs of decay. So one could archive if had paid attention.
3. Gportal is owned by Origo, which is owned my Magyar Telekom (one of the biggest companies in Hungary, market leading in telecommunication), which is owned by Deutsche Telekom, a multinational company. Although there are some negative numbers in the subsidiary's balance sheet, it's a company big enough I'm not that afraid. And, which is more important, I'm absolutely sure it wouldn't go down without notice or at least obvious signs. Remember, Blogter and Freeblog also had signs. Just no one acted. So, their downfall wasn't unexpected.
4. Gportal is 156th most popular website in Hungary, according to Alexa. That's quite good I think.
The biggest site closures in Hungary in the last years were those of extra.hu, iWiW, Freeblog and Blogter. Extra.hu announced it like two months before. iWiW also months before. Freeblog showed signs of technical problems months, and financial problems 1.5 years before. I already wrote about Blogter. If I had been an ArchiveTeam member back then, all of them would have been (tried to be) saved.
Every day I know a lot more about what to keep an eye on: not only technical status, activity of support and number of visitors, but also financial status. Once I find a site endangered, I'll initiate its archival. (The first ones may be those old free hosting services that started like a decade ago but now are quite abandoned upon the rise of social networks.) But at the moment I'm not afraid, and archiving a living website would be not only waste of time and resources, but also partly in vain, as new content is being added. Especially if Wayback Machine is making snapshots.
GPortal is one of the least possible to disappear anyway.
Before you misunderstand my tone, no problem that you're concerned with archiving, what's more, it's great. But the site you mentioned doesn't seem to be in danger, and I keep an eye on the others as well, considering several factors. (If it was now that Blogter was still available, I'd try to save it, along with mommo.hu, which was an early social networking site also run by Blogter Ltd. – and, what a shame, it was fully functional even this year and I didn't save it.)
I joined ArchiveTeam in spring this year, and it's since summer that I'm ready to take action myself if necessary. Too bad I came just too late for Blogter and Mommo. bzc6p (talk) 15:21, 8 December 2014 (EST)
Re: What I'm Currently Doing
If I see well, you upload (and get WM snapshot) stuff that are rarities, which is good, as these rare but possibly valuable things are endangered to be lost and forgotten. This work, however, needs time and manual work, so it's even more appreciated.
What I'm doing regarding archiving is quite a "bulk" work instead. After discovering the site structure, it goes almost automatically to download and then upload stuff. I mean image hosting services, which I'm "specialized" in. Although they are alive now, I don't trust any of them. And I find them important, for several reasons.
I've also been planning uploading some small but good software I know but are in danger of forgetting. And also some original authors on YouTube – however, I must look into what Internet Archive is already archiving with its youtube crawl actions.
Regarding your earlier "mad rage", there is nothing I need to forgive you, as you never hurt me – your rage wasn't directed towards me. thus you don't need to feel embarassed around me at all, and I also kind of understand it, as we discussed it privately.
Have a good time archiving stuff. bzc6p (talk) 09:45, 6 January 2015 (EST)
Re: View Archive.org Directories as Text Only
I don't know what exactly you mean, and even if I could I think I would be unable to give an answer. Should you have meant listing the files in an item, http://archive.org/download/<IDENTIFIER> is the way, e.g. https://archive.org/download/demotivalo_net_2014_october. This is the page you are presented when you click "HTTP" under "Download item" on the left (on the old version of the site) of the details page of an item.
The type of link you gave as an example is new to me, and I don't know (and couldn't find) other such "hidden" pages of the Wayback Machine.
If it was on IRC that you were taught the way, then try searching for that with Google on badcheese.com. bzc6p (talk) 07:28, 23 January 2015 (EST)
- I guess it's not what they taught you, but this tool (which I found mentioned here), used with proper parameters, may give the result you need. bzc6p (talk) 07:46, 24 January 2015 (EST)
- Unfortunately, as you also know, IA respects robots.txt, and even retroactively (official statement here and a discussion here). As you yourself discovered, IA does hide (I guess they don't delete) content behind a 403 page – and I think those guys have the technical skills to not leave any gap in which someone can still access it. (I mean, there are lots of websites that, for example, don't have www links to some documents, but if you know the address of the containing directory, you get the directory listing, and you've got everything. Well, I guess IA staff is not that dummy.)
 
- So, what to say? This is the situation. IA doesn't seem to be willing to change it, not even to drop the retroactivity. (They may have a reason for that, though.) I can only quote your own words: "I suggest you save/archive your favorite old web pages on this machine before they get the "robots" move."
 
- Or, one more thing you could do: try searching in other, smaller internet archiving sites' databases – who knows, some of them may have the portion of the internet you're looking for.
 
- And some offtopic thing. SketchCow adviced against "engaging" people in the forums. That is – if I understand well with my English –, one should avoid getting into an argument with them, especially talking about ArchiveTeam. I'm telling it to you because you were on the edge of it; take this as my preemptive good advice. bzc6p (talk) 17:08, 24 January 2015 (EST)
 
Re: FTP Sites
I am aware of the FTP saving project. However, it doesn't seem to be well organized to me – I don't know whoever works on it and whatever they download, although there is a tool with which ALL the IPv4 FTP sites can easily be discovered unbelievably quickly, and then a distributed project could be created to systematically mirror each and every one of them. I'm not in the power to organize this, and until it isn't, I don't feel like joining.
At the same time, I've got a lot of work with Hungarian image hosting services, it could saturate my bandwidth forever, so I don't even have the capacity to do anything else, except if very urgent. (It is inarguable that image hostings must be saved – they represent content and produce links, and neither the content should be lost, nor the links should become broken. And the reliability of these services is bad in average.)
In response to your other discovery, I myself discovered that too, and although it seems to be a handy way of archiving, I'm not sure if it is the desirable way – at first glance, it looks okay, though. However, creating and uploading WARCs has the advantage that they can easily be restored if the site goes down and the domain expires and someone decides "let's put it back!". Or if someone wants to mirror specific sites' archives, then it's just one click – but exporting from the Wayback Machine is very difficult. bzc6p (talk) 15:27, 1 February 2015 (EST)
- Last time I meant that downloading sites with wget or wpull (which is in fact a python cover of wget) with the options responsible for creating a WARC file (wpull was originally designed to be able to create WARCs, wget supports it since not too long ago), and uploading these WARCs has the benefit that one can download the whole archive of the site in one file – and, yes, as WARCs can be "injected" into Wayback Machine, it has almost the same appearance there as if you had saved it with the web.archive.org/web/save (or whatever) method – but, again, one can simply download the whole archive only if it is uploaded as an archive (preferably WARC).
- As far as I know, wpull is behind ArchiveBot. The Warrior still uses the modified wget, but one day it may be replaced with wpull too.
- Regarding your FTP site saving efforts, do what you wish – you don't have to narrow your efforts on Hungarian sites, and although it looks kind that you do this "for my sake", I don't feel like I deserve that much... recognition or something like that. Hungarian FTP sites are not more important than any others in the world, maybe rather less important.
- Just save what you think important, urgent or what you feel like. I do the same. bzc6p (talk) 17:29, 5 February 2015 (EST)
Re: Sup Archive Maniac, from the Bibliotheca Anonoma
You really might want to consider joining our team, we need more help from people just like you, and just as you say, you've really been helping us a lot.
It's easy to get in contact with us. Our control room is our IRC channel at irc.rizon.net #bibanon. This is where everything happens. Please join in and idle, it's a lively community of archivists just like you, sharing info and tips and tricks with each other.
Antonizoon 13:33, 21 October 2015 (EDT)