ArchiveBot/Ignore

From Archiveteam
Jump to navigation Jump to search

ArchiveBot provides a bunch of built-in ignoresets already, there exist however a set of commonly used patterns for the ignore command which aren’t part of an ignoreset (yet).

Regex

As mentioned in the ArchiveBot documentation, the patterns have to be expressed in the Python Regular Expression format (HOWTO):

Applying a pattern looks like this, for example:

 !ig 4inxzjho43kee3ufmxdiizg9g ^https?://deprospekte\.com/.*/gtm\.js$

eggdrop

The Eggdrop bot[1] hanging out in the #archivebot (on hackint) channel can create an ignore pattern for you, that will ignore an entire domain for the protocols HTTP, HTTPS, FTP and FTPS, with or without any username or port specified. Creating an ignore for archive.org.ua for instance will look like this:

 <username> !igd https://archive.org.ua/
 <eggdrop> username: ^(http|ftp)s?://([^/]*[@.])?archive\.org\.ua\.?(:\d+)?/

You can then just copy that pattern and paste it to your command for adding the ignore to a job.

The Eggdrop bot can also generate URLs to the current set of ignores and ignoresets for a job:

 <username> !ignores 9rk2clrrtny63oyav6urv0vk
 <eggdrop> username: http://archivebot.com/ignores/9rk2clrrtny63oyav6urv0vk?compact=true

Commonly needed ignore patterns

TODO

  • Calendars
  • Woocommerce

Drupal

Drupal websites use relative paths in a <script> tag for their JavaScript files, and the web servers are often misconfigured to not return 404s on the resulting broken URLs, which will lead to quickly growing, large URL loops for most of the affected URLs. This can happen for off-site links too, which is why an ignoreset wouldn’t work here.

A typical Drupal-loop will look like this:

 https://www.spdfraktion.de/themen/sites/all/libraries/colorbox/sites/all/libraries/jquery.bxslider/misc/collapse.js

To figure out the correct ignore pattern, look at the source code for one of these pages. You want to find the Drupal.settings script block, which contains the basePath. Then, check whether there are URLs starting with profiles in that same script block, i.e. whether "profiles\/ appears.

There are two possible ignores, and it depends on the site which one you need to use:

 .../(?!sites/).*/(sites|modules|misc)/.*\.(js|css)$
 .../(?!(profiles|sites)/).*/(sites|modules|misc|profiles)/.*\.(js|css)$

The ... part should be the base of the Drupal site, i.e. the domain plus any basePath. If "profiles\/ is present in the script block, use the ignore that includes profiles; if not, the other one.

Example with "basePath":"\/" and no profiles:

 !ig jobid ^https?://www\.spdfraktion\.de/(?!sites/).*/(sites|modules|misc)/.*\.(js|css)$

Example with "basePath":"\/" and profiles:

 !ig jobid ^https?://www\.splcenter\.org/(?!(profiles|sites)/).*/(sites|modules|misc|profiles)/.*\.(js|css)$

Example with a non-root basePath and no profiles:

 !ig jobid ^https?://humber\.ca/student-life/(?!sites/).*/(sites|modules|misc)/.*\.(js|css)$

An easy way to apply those ignores is to use the right-click context menu on the dashboard to get an ignore for the affected domain and path, copy one of the two patterns above, and paste them at the end to complete the ignore.

Occasionally that won’t be enough, as some pages will have a lot of URLs ending with .png already in the queue, URLs that end with a directory like …/misc/, or have an additional ?parameter=… after the JS/CSS filename. In those cases it’s best to manually add adaptions of the above patterns matching those URLs.

Note that there is a fork of Drupal called Backdrop CMS. It needs different ignores; see below.

Common Offenders

Urls that are commonly seen as off-site links running into Drupal loops:

Backdrop CMS

Backdrop CMS is a fork of Drupal and has the same class of problem. It is significantly rarer and can be identified via a <meta> tag or the window.Backdrop object instead of Drupal.settings. An ignore might look like this, though it isn't yet clear whether this is correct in all cases:

 ^https?://islandpress\.org/(?!core/|modules/|themes/|files/).*/(core|modules|themes|files)/.*\.(js|css)$

https://mappingmilitants.org/ uses Backdrop CMS; note that https://mappingmilitants.org/profiles is a valid URL (/profiles/ is not a script path on this site).

Facebook

Facebook/Instagram/etc will commonly get stuck in loops with URLs like this:

 https://www.facebook.com/ingo.bodtke/posts/pfbid06HRh8gNKe1usc32uyXXpPy3j6dj8EMMKCCBUHiGrVooMNGf747WBYkJGwnd9wtWze33xnzUcCyiQ7rYgTr2Til/js/d99xfjrvg2gco08s.pkg,js/1s3xnfn0kpi8gock.pkg,js/29hbm3a6lfdwg8kg.pkg,js/cw6k1bjdh1s88sko.pkg,js/6e7o7mrmay88c8ww.pkg,js/effz650iopkck0s4.pkg,js/31g819r5dxk404s4.pkg,js/2yvog9db9ym8soso.pkg,js/3x3uuoh7yrc448g4.pkg,js/4fwkuxugoe80wsgc.pkg,js/ct2xs7vmceg4wok4.pkg,js/5uqm9c708og8ccok.pkg,js/6rrjcec8x74sok80.pkg,js/4lpqcfjirrwg0csg.pkg,js/zrruzjb8668wgo44.pkg,js/6zdjq4yqc7sws8go.pkg,js/ajvj8auatuoko8cc.pkg,js/b7a7y0ant8g0gokc.pkg,js/ayrtrli4elck00sg.pkg.__composite__.js

If that happens, use this ignore pattern:

 !ig jobid ^https?://www\.(instagram|facebook)\.com/.*\.pkg[,.](js|css)($|/)

Wix

Websites built using the Wix website builder will try accessing a bunch of nonsense-urls and probably ending up getting stuck in an URL loop towards the end. That usually happens when the target website is built using Wix.

Use this ignore pattern to avoid it:

 !ignore jobid ^https?://{primary_netloc}/((.*/)?productPage_USD_productPage_USD|(.*/)?h_\d+/(.*/)?h_\d+(/|$)|.*/.*\.(jpg|jpeg|svg|png|json|txt|xml|text|gif|pdf|mp4)$|.*\.(css|js|json)$|.*/wix-thunderbolt/)

Pinterest

AB finds a bunch of bogus JS URLs for Pinterest (including offsite), ignore them using this:

 !ignore jobid ^https?://(www|[a-z]{2})\.pinterest\.com/.*\.js$

Wordpress

Some Wordpress plugins generate junk, ignore them using this:

 !ignore jobid ^https?://{primary_netloc}/.*/(udata\.vst|current\.cmp|current\.src|current_add\.ep|gtm\.js)/?$

DokuWiki

DokuWiki sites need some ignores to avoid problematic URLs.

Non-sequential diffs

AB may sometimes create URLs to diffs between non-sequential page versions on MoinMoin and potentially other sites. There is a draft (buggy) set of ignores for non-sequential integers.