Difference between revisions of "NewsGrabber"

From Archiveteam
Jump to navigation Jump to search
m (MOTHERFUCKER ! ! !)
m (MOTHERFUCKER ! ! !)
Line 11: Line 11:


NewsGrabber is a project to save as many news articles as possible from as many websites as possible.
NewsGrabber is a project to save as many news articles as possible from as many websites as possible.
== '''MOTHERFUCKER ! ! !''' ==
== '''MOTHERFUCKER ! ! !''' ==


== '''MOTHERFUCKER ! ! !''' ==
== '''MOTHERFUCKER ! ! !''' ==

Revision as of 13:16, 17 January 2017

NewsGrabber - Archiving all the world's news!
Newssites.png
URL http://newsgrabber.harrycross.me:29000
Status Online!
Archiving status In progress...
Archiving type Unknown
Project source NewsGrabber
Project tracker [1]
IRC channel #newsgrabber (on hackint)

NewsGrabber is a project to save as many news articles as possible from as many websites as possible.

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

MOTHERFUCKER ! ! !

NewsGrabber

NewsGrabber is written to solve the problem of missing news articles. NewsGrabber contains a database of URLs to be checked for articles, and allows anyone to add new websites to the database. Multiple seed URLs can be added for each website entry, all of which will be crawled periodically to look for new article URLs. youtube-dl can be used to download article URLs, making it possible to preserve news in video-form just as well as news in text-form.

How can I help?

There are two ways you can help:

  • There are uncountably many news websites in the world. Make our list more complete! Add newssites to our database, see in the next section or here how.
  • Donate server power for the grabbing process. Join the #newsgrabber (on hackint) IRC channel on EFNet and find user HCross.

Options

NewsGrabber handles several options for discovering and processing URLs. More details about using these options to add new websites can be found in the README of the project [2].

Refreshtime

The refreshtime is the time NewsGrabber waits before crawling the seed URLs of the website. The refreshtime can be as low as 5 seconds.

SeedURLs

SeedURLs are used by NewsGrabber to discover new articles. Often not all newsarticles are displayed on the front page of a website. They may be spread over several sections or rss feeds. A list of these URLs can be given to NewsGrabber so all articles from these URLs are found and not only those on the frontpage.

Videos

NewsGrabber supports videos. URLs containing videos are downloaded with youtube-dl, if they match the regex given for videoURLs.

Livepages

In case of big events a live page is often created with the latest news on this event. NewsGrabber will grab URLs matching the regex given for liveURLs over and over, and will crawl every new bit of information on them.

Immediate grab

When emergencies happen newssites try to cover as much as they can about what is happening. Often false information is published on the websites and later removed. To catch articles that are later removed it is possible to make NewsGrabber grab new found articles immediatly instead of adding them to the list, which will be grabbed once every hour.

Lists to include