.hu domains seed
.hu domains | |
![]() | |
Status | Special case |
Archiving status | In progress... |
Archiving type | other |
IRC channel | #archiveteam-bs (on hackint) |
Project lead | bzc6p |
Data[how to use] | In the Wayback Machine |
Hundreds of .hu domains are registered and expiring every single day. Many of them are just parked and never actually used, for example because they are common words or catchy phrases. Without links to them, even Internet Archive crawls are unable to catch them, leaving us with no publicly available proof that such domains have ever been registered.
Justification
The Council of Internet Service Providers (ISZT), which manages the .hu domain registry, maintains up-to-date lists of recently registered and recently expired domains. Even though the Internet Archive crawler occasionally checks these lists, and therefore could have access to these web pages via links on these lists, this frequency isn't sufficient for all domains be archived, due to the fact that new domain names are held in this list only for 2 weeks.
user:bzc6p has been saving these lists since early 2021, and even though many of these domains have been deleted since then, the database built from this data (including known registration and expiration times of all domains) suggests that tens of thousands of domains registered throughout this period are still available (with or without actual websites behind the domains).
Purpose
The purpose of this project is to seed all available domains into the Wayback Machine, by triggering a Save Page Now action (programmatically, of course...), so that the Wayback Machine knows that the given domain exists, and it can then create subsequent captures on its own.
The project started on 2024-12-17.
Methodology
- A database has been created from all domain names, their registration and (actual) expiration dates since early 2021.
- Queries have been made to find out which registered domains have not been expired yet (or, have been re-registered after their expiration) – in other words, that are active in the registry.
- A script is going through each and every domain on the list, first querying the Wayback Machine if there's a snapshot and if it's recent enough (i.e. created after the last registration date of the domain). For this purpose, the following API is used:
https://archive.org/wayback/available?url=example.comNo, this one doesn't seem to work correctly, or somehow filters out snapshots that should be considered valid for this purpose. So, instead, the JSON fetched in the background by the Wayback Machine web interface is used to find out the date of last snapshot. - If needed, a Save Page Now action is triggered, with saving the error pages as well (e.g. 404), with an appropriately composed POST request to the https://web.archive.org/save endpoint.
Progress and statistics
The following table will be updated as results come in.
Year | Known active domains | of which | ||
---|---|---|---|---|
snapshot already existed | snapshotting failed (DNS error) | first snapshot created | ||
2021 | 50,277[1] |
TBA |
TBA |
TBA
|
2022 | TBA |
TBA |
TBA |
TBA
|
2023 | TBA |
TBA |
TBA |
TBA
|
2024 | TBA |
TBA |
TBA |
TBA
|
Total | 50,277 |
TBA |
TBA |
TBA
|
- ↑ as of 2024-12-14