PDF 2016
PDF 2016 | |
Status | Online! |
Archiving status | Unknown |
Archiving type | Unknown |
Project source | citeseerxpdf-grab |
Project tracker | citeseerxpdf |
IRC channel | #archiveteam-bs (on hackint) (formerly #pdflush (on EFnet)) |
PDF (Portable Document Format) is a file format used to present documents in a manner independent of application software, hardware, and operating systems.[1]
PDF 2016 is a codename for an ArchiveTeam project, saving a lot of PDFs.
Story
In March 2016, user davidar informed ArchiveTeam on IRC that he obtained a list of hundreds of millions of links to PDF files from around the Web.[2][3][4] ArchiveTeam decided to make a Warrior project for downloading these files.
Most of the files are open-access scientific documents. Besides being uploaded to the Internet Archive, http://citeseerx.ist.psu.edu/ will also host and index them.
Archiving
ArchiveTeam is first saving about 1 terabyte of files, then the Internet Archive decides whether they are able to store all downloadable stuff, that is going to be tens or hundreds of terabytes. (The test run also helps making a good estimate on the total size.)
We have 250-300M URLs: ~10% of those are direct links to PDFs, the rest are mostly links to HTML landing pages. 10-15% have a citation_pdf_url meta tag giving a direct link to a PDF and another 10-15% have the PDF linked in an unstructured way.
How can I help?
Running a Warrior
You can start up a Warrior and there select PDF 2016. (If you don't really care what you are archiving, select ArchiveTeam's Choice instead, as at some points ArchiveTeam may prioritize another project.)
Running the script manually
If you use Linux and you're a bit familiar with it, you can try running the script directly.
The instructions can be found at github.com/ArchiveTeam/citeseerxpdf-grab.
Some additional information |
---|
Don't forget to replace YOURNICKHERE with your nickname.
The number after If you want to stop the script, please do it gracefully if possible. To do so, create an empty file named STOP in the folder of the script (terminal command: If you see "Project code is out of date", kill the script, go to its folder ( |
Donating to the Internet Archive
Content downloaded by the ArchiveTeam will be uploaded to the Internet Archive, where it will be stored and be available – hopefully – forever. However, storing it costs thousands of dollars in the long run. So, if you can afford, please consider donating to the Internet Archive, so that this piece of history can be kept for us all. http://archive.org/donate
Do you like our cause?
If you want to help in other projects, want to learn more about ArchiveTeam, or even help in development in general, navigate to the Main Page of this wiki, from there you can reach a lot of information. The Team consists of volunteers working on the projects in their free time, so helping hands (and resources) are always welcome.
References
- ↑ https://en.wikipedia.org/wiki/Portable_Document_Format
- ↑ http://archive.fart.website/bin/irclogger_log/archiveteam?date=2016-03-03,Thu&sel=67#l63
- ↑ http://archive.fart.website/bin/irclogger_log/archiveteam?date=2016-03-09,Wed&sel=134#l130
- ↑ http://archive.fart.website/bin/irclogger_log/archiveteam?date=2016-03-13,Sun&sel=133#l129