Second insight of the day

It’s time for my favorite thing. Let’s have a Python packaging rant.

Most people know that the Python world’s approach to packaging and distribution is best summarized, and probably directly inspired, by xkcd.

A long time ago, the decision was made, or accepted by mutual apathy, how installation tools like pip and the late, unlamented easy_install would find packages:

Predict the file name (possibly more than one).
Load an HTML page.
Check every <a> on that page.
Use the one that points at the most preferred file name variation.

The only thing in the page that is at all interesting in this model are the //a/@hrefs. Everything else is noise.

Enter pip 22.0, released ~2 days ago. It now uses a different parser for HTML pages than previous versions. Instead of the certainly entirely overpowered html5lib, whatever that is, it is now using something else, apparently called html.parser.

Remember, the approach to looking up download links for package files consists of looking at <a> elements. Finding them in HTML, however close to, or far from, the spec it may be, is not a big issue. (Anyone who allows user-generated content on a download page is beyond help anyway.)

Why, then, the switch to a parser that, in order to find <a> elements, requires that the page have a correct <!DOCTYPE html> declaration, i.e. the claim that it adheres to the HTML5 spec?

The HTML5 “spec” shall not be gone into here any further.

The result is pip 22.0.2, an emergency release so urgent that it could not even be tagged in the repo before it was put on PyPI (and the tag has still not been created as of press time). In the typical manner, whenever it encounters a page that does not have the canonical <!DOCTYPE> it will print a warning blaming the user for their audacity in installing packages from sources that do not wrap their unstructured lists of <a> elements in proper, (un)well-specified HTML5.

Hey, PyPA, you want a proper fix for that? Publish a file. One single file. Its name will end in “.xsd”, and it will describe how lists of package links are supposed to look.

Ceterum censeo: This bug would have been avoided by not suggesting to package publishers that you will accept any line noise as long as there are <a> elements in it, then suddenly deciding that the stuff surrounding the <a> elements really matters to you.

tl;dr: Shame on you. This took at least six minutes to write, go enjoy it.