Description
@hook54321a brought up on IRC that a pipeline retrieved http://www/ successfully (Wayback Machine). The reason for this is that there's a search ovh.net
line on the pipeline's resolv.conf
, meaning that www
resolves to www.ovh.net
. It appears that this is not a new issue; here's an example snapshot of Online.net's website from two years ago, captured through the same URL by an ArchiveBot pipeline.
We are not alone with this problem: there are various other snapshots showing all kinds of responses (e.g. one, two), and IA even captured itself under that URL at least once back in 2005.
Still, we need to find a way to prevent this from happening to avoid pollution in the archives. @ivan suggested testing whether www
resolves in the preflight check, but this might not always be reliable because www
might not resolve on the search domain while other (sub)domains work fine. Another option might be explicitly testing whether /etc/resolv.conf
contains a search
line, though that would obviously not work on all OS. Yet another option would be implementing a custom DNS resolving stack which completely ignores the DNS configuration in resolv.conf
(i.e. communicates with specific DNS servers directly), but that's probably not a good idea.
In any case, the current pipelines also need to be fixed of course. Ping to the current pipeline operators: @Asparagirl, @chronomex, @falconkirtaran, @HarryC145, @MattIggo. My pipelines are not affected.