Properly escaping characters (especially spaces) in an HTML anchor name
Problem
A well-loved HTML feature is the ability to have a link to a specific part of a page. Specifically, the “fragment portion” of a URL can identify an HTML element in the page, and after loading the page will automatically scroll to that element. For example,
No known cat breeds are considered arachnids, as I <a href="#cat-arachnids">explain later</a>.
...
...
<h2 id="cat-arachnids">Cat Arachnids</h2>
There is currently no overlap between the felines and the arachnids. Thankfully, our best and brightest scientists are working hard to rectify this.
...
...For simple cases, everything is good. But what if you want to include special characters in the anchor name?
Things are a little interesting because the anchor name must both be
written in the id attribute and in a URL. Most HTML
attributes are allowed to be arbitrary strings, but unfortunately
id attributes are an exception:
“When specified on HTML elements, the id attribute value must be unique amongst all the IDs in the element’s tree and must contain at least one character. The value must not contain any ASCII whitespace.”
(from the HTML5 spec).
Furthermore, the spec also
clarifies that the id attribute should not be
URL-encoded (the URL is decoded before trying to find a matching ID), so
it is not correct to simply escape the spaces using %20. In
fact, using only valid HTML, it is impossible to write an anchor
target using the id attribute which will match a URL
fragment identifier with a space in it.
Solution
However, you may be aware that there’s another way to create a named
anchor in HTML: Using an <a> tag with the
name attribute. As of HTML5, this is deprecated,
but still valid. While name attributes are allowed to
legally contain any character (including whitespace), we actually don’t
want to include literal spaces; the spec specifies that, when using a
name-d <a> tag as a fragment identifier,
the name is URL-encoded, unlike when using the id
attribute. So we should do something like
<a name='cat%20arachnids'></a>
Non-ASCII characters
There doesn’t seem to be anything in the spec prohibiting the
id attribute from storing non-ascii characters. The only
prohibited characters are “ASCII Whitespace”.
What do Browsers do?
Both Chrome and Firefox break the spec to make spaces work better, by
treating %20 specially in id attributes.
According to the spec, %20 in the id attribute
should match only %2520 in the URL. Chrome and Firefox both
do work if you specify that URL, but they also let
%20 in the ID match %20 in the URL. This is
only for %20, though: If you put any other URL-encoded
character into the id attribute, it will be treated
according to the spec (ie, if your attribute is
my%2Fanchor, it will only match the fragment URL
my%252Fanchor, and not
my%2Fanchor).
TL;DR
- Do you need ASCII whitespace in your anchor names?
- Yes: Use
<a name="my%20url-encoded%20name">...</a> - No: Use
<span id="my-non-url-encoded-name">...</span>
- Yes: Use
Both of these options are valid HTML5, though the former is technically deprecated.