Second insight of the day
It’s time for my favorite thing. Let’s have a Python packaging rant.
Most people know that the Python world’s approach to packaging and distribution is best summarized, and probably directly inspired, by xkcd.
A long time ago, the decision was made, or accepted by mutual apathy,
how installation tools like pip and the late, unlamented
easy_install
would find packages:
- Predict the file name (possibly more than one).
- Load an HTML page.
- Check every
<a>
on that page. - Use the one that points at the most preferred file name variation.
The only thing in the page that is at all interesting in this model
are the //a/@href
s. Everything else is noise.
Enter pip 22.0, released ~2 days ago. It now uses a different parser
for HTML pages than previous versions. Instead of the certainly
entirely overpowered html5lib
, whatever that is, it is now using
something else, apparently called html.parser
.
Remember, the approach to looking up download links for package files
consists of looking at <a>
elements. Finding them in HTML, however
close to, or far from, the spec it may be, is not a big issue. (Anyone
who allows user-generated content on a download page is beyond help
anyway.)
Why, then, the switch to a parser that, in order to find <a>
elements, requires that the page have a correct <!DOCTYPE html>
declaration, i.e. the claim that it adheres to the HTML5 spec?
The HTML5 “spec” shall not be gone into here any further.
The result is pip 22.0.2, an emergency release so urgent that it could
not even be tagged in the repo before it was put on PyPI (and the tag
has still not been created as of press time). In the typical manner,
whenever it encounters a page that does not have the canonical
<!DOCTYPE>
it will print a warning blaming the user for their
audacity in installing packages from sources that do not wrap their
unstructured lists of <a>
elements in proper, (un)well-specified
HTML5.
Hey, PyPA, you want a proper fix for that? Publish a file. One single
file. Its name will end in “.xsd
”, and it will describe how lists of
package links are supposed to look.
Ceterum censeo: This bug would have been avoided by not suggesting to
package publishers that you will accept any line noise as long as
there are <a>
elements in it, then suddenly deciding that the stuff
surrounding the <a>
elements really matters to you.
tl;dr: Shame on you. This took at least six minutes to write, go enjoy it.