Skip to content

Pollution in archives due to search domains in resolv.conf #318

Open

Description

Image for: Description

@hook54321a brought up on IRC that a pipeline retrieved http://www/ successfully (Wayback Machine). The reason for this is that there's a search ovh.net line on the pipeline's resolv.conf, meaning that www resolves to www.ovh.net. It appears that this is not a new issue; here's an example snapshot of Online.net's website from two years ago, captured through the same URL by an ArchiveBot pipeline.

We are not alone with this problem: there are various other snapshots showing all kinds of responses (e.g. one, two), and IA even captured itself under that URL at least once back in 2005.

Still, we need to find a way to prevent this from happening to avoid pollution in the archives. @ivan suggested testing whether www resolves in the preflight check, but this might not always be reliable because www might not resolve on the search domain while other (sub)domains work fine. Another option might be explicitly testing whether /etc/resolv.conf contains a search line, though that would obviously not work on all OS. Yet another option would be implementing a custom DNS resolving stack which completely ignores the DNS configuration in resolv.conf (i.e. communicates with specific DNS servers directly), but that's probably not a good idea.

In any case, the current pipelines also need to be fixed of course. Ping to the current pipeline operators: @Asparagirl, @chronomex, @falconkirtaran, @HarryC145, @MattIggo. My pipelines are not affected.

Metadata

Image for: Metadata

Metadata

Image for: Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

    Image for: Issue actions