It is incorrect to "normalize" // in HTTP URL paths (runxiyu.org)

by pabs3 67 comments 70 points
Read article View on HN

67 comments

[−] Bender 27d ago
NGinx, Kube-NGINX, Apache, Traefik all default to normalizing request paths per reference of RFC 3986 [1]. This behavior can be disabled when requests are proxied to resources on the back-end that require double-slashes. I only reference the RFC to describe what they are talking about, not why they default to merging. They all agreed on a decision as one was not made for them.

To generalize by saying "incorrect" is incorrect. The correct answer is that it depends on the requirements in the given implementation. Making such generalizations will just lead to endless arguing. If there is still any debate then a group must vote to deprecate and replace the existing RFC with a new RFC that requires that merging slashes MUST be either be always enabled or always disabled using verbiage per RFC 2119 [2] and optionally RFC 6919 [3]. Even then one may violate an RFC is there is a need to do so and everyone has verified, documented and signed off that doing so has not introduced any security or other risks in the given implementation and if such a risk is identified that it will be remediated or mitigated in a timely manor.

[Edit] For clarification the reason I am linking to RFC 3986 is that it only defines path characteristics and does not explicitly say what to do or not to do. Arguments will persist until a new RFC is created rather than blog and stack overflow posts. Even then people may violate the RFC if they feel it is safe to do so. I do not know how to reword this to make it less confusing.

[1] - https://datatracker.ietf.org/doc/html/rfc3986

[2] - https://datatracker.ietf.org/doc/html/rfc2119

[3] - https://datatracker.ietf.org/doc/html/rfc6919

[−] cxr 27d ago
Both you and the original author cite the same RFC to support your arguments. Passages from RFC 3986 comprise the bulk of the original post.

The difference between the support for your argument and theirs is that they call out the specific sections in the RFC that they claim are relevant to the issue at hand and your comment only broadly references the RFC by name. In any case, even if they, too, merely gestured to its existence, claiming that it supports their position, then appearing here with a bare claim that RFC 3986 supports the opposing side without further elaboration is not exactly strong candidate for a path to a fruitful resolution.

[−] mjmas 27d ago
Agreed. Reading through the RFC it certainly appears to support the blog article.

And looking around I found this SO answer noting nothing in the RFC:

https://stackoverflow.com/a/24661288

[−] Bender 27d ago
In any case, even if they, too, merely gestured to its existence

That is entirely my point. If the author wants to disable merge slashes then they need to replace the RFC I linked to with one that explicitly says what to do or not do using strong verbiage that is explicit as I explained. Blog articles and Stack Overflow threads will not set a standard.

If people interpret the RFC differently than I in that they feel it is explicit vs vague then please contact all of the web daemon maintainers to have them correct their default behavior. Just know ahead of time that two of them are quite challenging to have these discussions with.

[−] cxr 26d ago

> That is entirely my point. If the author wants to disable merge slashes then they need to replace the RFC I linked to with one that explicitly says what to do or not do using strong verbiage that is explicit as I explained.

That doesn't seem to be the case. You said, "NGinx, Kube-NGINX, Apache, Traefik all default to normalizing request paths per reference of RFC 3986". That's a strong claim, not an appeal to ambiguity.

> Blog articles[…] will not set a standard.

Blog posts absolutely have the power to influence future developments. That's historically how it has worked. "RFC" stands for "Request For Comments".

[−] Bender 26d ago
to influence future developments

This development work is already completed. New web daemons would likely just follow the precident that has been set by the popular daemons as to not cause confusion, unexpected behavior and even more arguments.

If a notable sized group of developers would like to contact all the web daemon maintainers I can list all their contact information. In my experience these developers and F5 are not very open to making sweeping changes but there is mostly no harm in trying. The represenative should be someone thick skinned.

[−] cxr 26d ago

> This development work is already completed.

You're prevaricating. Earlier:

> they need to replace the RFC I linked to with one that explicitly says what to do or not do

Do they need to work on getting the RFC to be more explicit about the correct behavior or not?

[−] Bender 26d ago
It really isn't up to me. If enough developers find this to be an important issue then the first step would be to replace the RFC with a new one and then work with the existing web daemon developers to change their defaults. There should also be an effort to communicate these changes to all the internet companies world wide long in advance as this will be a breaking change for many people. Perhaps I am just jaded but I think this will break a lot of stuff and cause a lot of really bad maintenance windows and other fallout. Who among the developers is willing to take the lead on this? You appear to be very articulate and astute. Are you taking the lead?
[−] cxr 26d ago

> It really isn't up to me. If enough developers find this to be an important issue then the first step would be to[…]

That's not what I asked. In one breath, you've said they need to take up that effort. In the next breath, you've said that it's a done deal. I'm asking: which is it?

Making up your mind (instead of perpetually moving the goalposts) is up to you.

[−] Timwi 25d ago
I believe you misunderstood them. The way I interpreted it is: 1) the development is already done, so the developers have broad consensus on how it should work; 2) the only thing that can break that consensus is a new RFC that tells them in no uncertain terms to do it differently.
[−] bigbadfeline 26d ago
If cxr takes the lead, I'll be happy to help too. I don't have much time but I can provide some support in this monumental struggle.
[−] embedding-shape 27d ago

> Making such generalizations will just lead to endless arguing

But 80% of all programming blog posts on the internet rely on being able to make sweeping generalizations across the ecosystem! Without this, we basically have nothing left to argue about.

Caring about tradeoffs, contexts, nuance and not just cargoculting our way into a distributed architecture for a app with 10 users just sounds so 90s and early 00s. We're now in the future and we're all outputting the same ̶t̶o̶k̶e̶n̶s̶ code, so obviously what is the solution in my case, surely must be the solution in your case too.

[−] echoangle 27d ago

> Wait, are there any implementations that wrongly collapse double-slashes?

> nginx with merge_slashes

How can it be wrong if it is server-side? If the server wants to treat those paths equally, it can if it wants to.

It would only be wrong if a client does it and requests a different URL than the user entered, right?

[−] MattJ100 27d ago
URL parsing/normalisation/escaping/unescaping is a minefield. There are many edge cases where every implementation does things differently. This is a perfect example.

It gets worse if you are mapping URLs to a filesystem (e.g. for serving files). Even though they look similar, URL paths have different capabilities and rules than filesystems, and different filesystems also vary. This is also an example of that (I don't think most filesystems support empty directory names).

[−] PunchyHamster 27d ago
We cut those and few others coz historically there were exploits relying on it

Nothing on web is "correct", deal with it

[−] bryden_cruz 27d ago
This exact ambiguity causes massive headaches when putting Nginx in front of a Spring Boot backend. Nginx defaults to merge_slashes on, so it silently 'fixes' the path. But Spring Security's strict firewall explicitly rejects URLs with // as a potential directory traversal vector and throws an error. It forces you to explicitly decide which layer in your infrastructure owns path normalization, because if Nginx passes it raw, the Java backend completely panics.
[−] dale_glass 27d ago
But maybe you should anyway.

Because maybe you use S3, which treats foo/bar.txt and foo//bar.txt as entirely separate things. Because to S3, directories don't exist and those are literally the exact names of the keys under which data is stored.

So you have script A concatenate "foo" + "/bar" and script B concatenate "foo/" + "/bar", and suddenly you have a weird problem.

I can't imagine a real use case where you'd think this is desirable.

[−] leni536 27d ago
I don't think it's incorrect for distinct paths to point to the same resource.

Of course you shouldn't assume that in a client. If you are implementing against an API don't deviate regarding // and trailing / from the API documentation.

[−] mjs01 27d ago
// is useful if the server needs to serve both static files in the filesystem, and embedded files like a webpage. // can be used for embedded files' URL because they will never conflict with filesystem paths.
[−] sfeng 27d ago
What I’ve learned in doing this type of normalization is whatever the specification says, you will always find some website that uses some insane url tweak to decide what content it should show.
[−] WesolyKubeczek 27d ago
It is probably “incorrect”, but given the established actual usage over the decades, it’s most likely what you need to do nevertheless.

Not doing it is like punishing people for not using Oxford commas, or entering an hour long debate each time someone writes “would of” instead of “would have”. It grinds my gears too, but I have different hills to die on.

[−] nottorp 27d ago
There are still email forms that refuse pluses in email addresses too...
[−] domenicd 27d ago
As some others have indirectly pointed out, this article conflates two things:

- URL parsing/normalization; and

- Mapping URLs to resources (e.g. file paths or database entries) to be served from the server, and whether you ever map two distinct URLs to the same resource (either via redirects or just serving the same content).

The former has a good spec these days: https://url.spec.whatwg.org/ tells you precisely how to turn a string (e.g., sent over the network via HTTP requests) into a normalized data structure [1] of (scheme, username, password, host, port, path, query, fragment). The article is correct insofar that the spec's path (which is a list of strings, for HTTP URLs) can contain empty string segments.

But the latter is much more wild-west, and I don't know of any attempt being made to standardize it. There are tons of possible choices you can make here:

- Should https://example.com/foo//bar serve the same resource as https://example.com/foo/bar? (What the article focuses on.)

- https://example.com/foo/ vs. https://example.com/foo

- https://example.com/foo/ vs. https://example.com/FOO

- https://example.com/foo vs. https://example.com/fo%6f% vs. https://example.com/fo%6F%

- https://example.com/foo%2Fbar vs. https://example.com/foo/bar

- https://example.com/foo/ vs. https://example.com/foo.html

Note that some things are normalized during parsing, e.g. /foo\bar -> /foo/bar, and /foo/baz/../bar -> /foo/bar. But for paths, very few.

Relatedly:

- For hosts, many more things are normalized during parsing. (This makes some sense, for security reasons.)

- For query, very little is normalized during parsing. But unlike for pathname, there is a standardized format and parser, application/x-www-form-urlencoded [2], that can be used to go further and canonicalize from the raw query string into a list of (name, value) string pairs.

Some discussions on the topic of path normalization, especially in terms of mapping the filesystem, in the URL Standard repo:

- https://github.com/whatwg/url/issues/552

- https://github.com/whatwg/url/issues/606

- https://github.com/whatwg/url/issues/565

- https://github.com/whatwg/url/issues/729

-----

[1]: https://url.spec.whatwg.org/#url-representation [2]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...

[−] janmarsal 27d ago
i'm gonna do it anyway
[−] leni536 27d ago
Wait until you try http:/example.com and http://////example.com in your browser.
[−] LeonTing8090 27d ago
[dead]
[−] renewiltord 27d ago
I’m going to keep doing it.
[−] joeframbach 26d ago
The "why would you want to do that?" section ought to be the very first paragraph. I spent half the article thinking, who the hell is collapsing http:// into http:/` until I had to deduce from context what this article even was about. The article starts in media res.