News+C

From Archiveteam
Revision as of 18:55, 2 November 2018 by Bzc6p (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

News+C is a project brought to life by user:bzc6p, and is concerned with archiving news websites.

NewsGrabber vs. News+C

Wait, we already have NewsGrabber! How is this one different?

  • News+C focuses on websites that have user comments, especially those that use third party comment plugins (Facebook, Disqus etc.)
  • As third party comment plugins are usually full of Javascript and difficult to archive in an automated way, News+C is more like a manual project.
  • While NewsGrabber archives all articles of thousands of websites, News+C focuses on only some (popular) websites and archives only those, but more thoroughly.
  • While anyone can join to the Newsgrabber project, as it needs only a script to be started, a News+C project needs more knowledge, time and attention.

News+C is in no way a replacement or a competitor of NewsGrabber. It is a small-scale project with a different approach, that, in fact, focuses on comments rather than on the news, and to some extent, prefers quality over quantity.

Tools

Archiving websites with a ton of Javascript is always a pain in the ass of an archivist. And third party comment plugins do use a lot of Javascript. The problem is that you need to list and save all URLs that these scripts request during user actions, but only browsers are able to interpret these scripts correctly.

So there are two approaches:

  • If you are a Javascript mage, you find a way to automate all those Ajax and whatever requests, so that you are able to fetch comments with wget or wpull.
  • If you are not that expert/intelligent/patient/whatever, and don't want to deal with all that, there is a slower, but simpler, more universal and cosier approach: automating a web browser, using computer vision.

Solving the script puzzle

If you know how to efficiently archive Facebook or Disqus comment threads with a script, do not hesitate to share. The founder of this project, however, doesn't, so he is developing the other method.

Using computer vision

Web browsers interpret Javascript well. There are tools that archive websites as you are browsing ([1], [2] etc.). So you can save a website pretty much perfectly if you yourself browse it in a web browser with these archiving tools on. (Alternatively, if you don't trust or otherwise can't use such a tool, you can export the list of URLs with some browser plugin, and then save those with wget, wpull.) (We are talking about WARC archives, of course. Ctrl+S-ing the website is not the optimal way for us.)

But the question is, as always: how do you automate this process?

Here comes the computer vision to the stage. You can – surprisingly easily –

  • simulate keypresses
  • simulate mouse movement and clicks
  • find the location of an excerpt image on the screen

with a little programming.

This – according to user:bzc6p's knowledge – needs a graphical interface and can't be put in the background, but at least you can save a few hundred/thousand articles overnight, while you sleep.

Different scripts are necessary for different websites, but the approach is the same, and the scripts are also similar. The modifiable python2 script user:bzc6p uses has been named by its creator the Archiving SharpShooter (ASS).

Archiving SharpShooter

Particular code may be published later, or if you are interested, you can ask user:bzc6p, but the project is still quite beta, so only the algorithm is explained here.

  • Input is a list of news URLs.
  • Key python2 libraries used are pyAutoGUI and openCV. The former is our hands, the latter our eyes.
  • pyautogui.press(), pyautogui.click() types, scrolls and clicks. cv2.matchTemplate() finds the location of the "Read comments", "More comments" etc. buttons or links, and we click them. matchTemplate needs a template to search for (we cut them out from screenshots) and an up-to-date screenshot (we invoke scrot from python, and load that image). With matchTemplate we can also check if the page has been loaded or if we have reached the bottom of the page. (The threshold for matchTemplate must be carefully chosen for each template, so that it doesn't miss a template, nor finds a false positive.)

What the program basically does:

  1. types URL in the address bar
  2. waits till page is loaded
  3. scrolls till it finds "Read comments" or equivalent sign
  4. clicks on that
  5. waits for comments to be loaded
  6. scrolls till "More comments" or equivalent is reached
  7. waits for more comments to be loaded
  8. repeats this until bottom of page is reached (no more comments)

During this, warcprox runs in the background, and every request is immediately saved to a WARC file. (Warcprox provides a proxy, which is set in the browser.)

Disclaimer

The Archiving SharpShooter, or anything with the same concept may be slow, but it does the job. We don't have anything better until someone comes up with one. Also, ASS is universal in a way that, for each website, you need just a few templates (excerpt images), set (and test) thresholds, and set command order, and you're all set, not needing to carefully reverse-engineer tons of Javascript code. This may also help archiving e.g. Facebook threads or other stuff other than news.

Websites being archived

For an easier overview, let this page have subpages for countries/languages.