Difference between revisions of "NewsGrabber"

Revision as of 16:13, 27 December 2015

NewsGrabber - Archiving all the worlds news!
URL	http://newsgrabber.harrycross.me:29000
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
Project source	NewsGrabber
Project tracker	[1]
IRC channel	#newsgrabber (on hackint)

NewsGrabber is a project to save as many newsarticles as possible from as many websites as possible.

Why

A lot of news articles are saved in the Wayback Machine by the Internet Archive. They're mostly save through Top News of Focused Crawls [2], with the crawls on GDELT URLs [3], and through Wide Crawls.

We think these crawls aren't very complete crawls of the worldwide newsarticles.

Crawls on websites from Top News in Focused Crawls crawl a website from the seed URL up to 5 layers deep. This is often done once a day. Not a lot of websites are covered and the websites covered are mostly English.
GDELT does a very good job of covering news from around the world, but sometimes misses the more local websites and many non-English websites.
Wide Crawls are focused on the whole World Wide Web, not on newsarticles.

NewsGrabber

NewsGrabber is written to solve the problem of missing newsarticles. NewsGrabber allows anyone to add new websites to it's database to be checked for articles. Several seed URLs can be added to a website entry which are crawled periodically for new URLs. If new URLs are found, these URLs are either downloaded with or without youtube-dl, making it possible to preserve news in video-form just as good as news in text-form.

Options

NewsGrabber handles several options for discovering and processing URLs. More details about using these options to add new websites can be found in the README of the project [4].

Refreshtime

The refreshtime is the time NewsGrabber waits before crawling the seed URLs of the website. The refreshtime can be as low as 5 seconds.

SeedURLs

SeedURLs are used by NewsGrabber to discover new articles. Often not all newsarticles are displayed on the front page of a website. They may be spread over several sections or rss feeds. A list of these URLs can be given to NewsGrabber so all articles from these URLs are found and not only those on the frontpage.

Videos

NewsGrabber supports videos. URLs containing videos are downloaded with youtube-dl, if they match the regex given for videoURLs.

Livepages

In case of big events a live page is often created with the latest news on this event. NewsGrabber will grab URLs matching the regex given for liveURLs over and over, and will crawl every new bit of information on them.

Immediate grab

When emergencies happen newssites try to cover as much as they can about what is happening. Often false information is published on the websites and later removed. To catch articles that are later removed it is possible to make NewsGrabber grab new found articles immediatly instead of adding them to the list, which will be grabbed once every hour.

@@ Line 8: / Line 8: @@
 | irc = newsgrabber
 }}
+NewsGrabber is a project to save as many newsarticles as possible from as many websites as possible.
 == Why ==
-The Internet Archive had a project that was aiming to cover news websites from all around the world, but it didn't cover as many as we liked.
+A lot of news articles are saved in the Wayback Machine by the Internet Archive. They're mostly save through Top News of Focused Crawls [https://archive.org/details/top_news], with the crawls on GDELT URLs [https://archive.org/details/NO404-GDELT], and through Wide Crawls.
+We think these crawls aren't very complete crawls of the worldwide newsarticles.
+* Crawls on websites from Top News in Focused Crawls crawl a website from the seed URL up to 5 layers deep. This is often done once a day. Not a lot of websites are covered and the websites covered are mostly English.
+* GDELT does a very good job of covering news from around the world, but sometimes misses the more local websites and many non-English websites.
+* Wide Crawls are focused on the whole World Wide Web, not on newsarticles.
+== NewsGrabber ==
+NewsGrabber is written to solve the problem of missing newsarticles. NewsGrabber allows anyone to add new websites to it's database to be checked for articles. Several seed URLs can be added to a website entry which are crawled periodically for new URLs. If new URLs are found, these URLs are either downloaded with or without youtube-dl, making it possible to preserve news in video-form just as good as news in text-form.
+== Options ==
+NewsGrabber handles several options for discovering and processing URLs. More details about using these options to add new websites can be found in the README of the project [https://github.com/ArchiveTeam/NewsGrabber/blob/master/README.md].
+=== Refreshtime ===
+The refreshtime is the time NewsGrabber waits before crawling the seed URLs of the website. The refreshtime can be as low as 5 seconds.
+=== SeedURLs ===
+SeedURLs are used by NewsGrabber to discover new articles. Often not all newsarticles are displayed on the front page of a website. They may be spread over several sections or rss feeds. A list of these URLs can be given to NewsGrabber so all articles from these URLs are found and not only those on the frontpage.
+=== Videos ===
+NewsGrabber supports videos. URLs containing videos are downloaded with youtube-dl, if they match the regex given for videoURLs.
+=== Livepages ===
+In case of big events a live page is often created with the latest news on this event. NewsGrabber will grab URLs matching the regex given for liveURLs over and over, and will crawl every new bit of information on them.
+=== Immediate grab ===
+When emergencies happen newssites try to cover as much as they can about what is happening. Often false information is published on the websites and later removed. To catch articles that are later removed it is possible to make NewsGrabber grab new found articles immediatly instead of adding them to the list, which will be grabbed once every hour.

Difference between revisions of "NewsGrabber"

Revision as of 16:13, 27 December 2015

Contents

Why

NewsGrabber

Options

Refreshtime

SeedURLs

Videos

Livepages

Immediate grab

Navigation menu

Difference between revisions of "NewsGrabber"

Revision as of 16:13, 27 December 2015

Why

NewsGrabber

Options

Refreshtime

SeedURLs

Videos

Livepages

Immediate grab

Navigation menu

Search