I Want XMP But I Know Why I Can't Have It

Link post

When writing text that will be displayed to users as HTML I write HTML. This seems like it would be the normal way to do it, though it’s unusual these days. Either you draft in a fancy content editor or you write Markdown. HTML authoring is a bit retrogrouch but it suits me. Except when I need to write about HTML itself.

In my previous post I was talking about how I write recipes in HTML:

<li>2 eggs (or 2T flax and 5T water)
<li>2/3 C oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed

Except in the blog post itself the ampersands needed to be escaped, so on my screen it looked like:

<pre>
&lt;li&gt;2 eggs (or 2T flax and 5T water)
&lt;li&gt;2/3 C oil
&lt;li&gt;1C greek yoghurt
&lt;li&gt;1/4 cup milk, more if needed
</pre>

This is pretty painful. What if I could just write HTML, but mark it off as an example of how to write HTML so the browser would ignore it? This was exactly the problem Berners-Lee had when initially documenting HTML, and if you look at early pages you’ll see he introduced an <xmp> tag:

The title of a document is given between title tags:
<XMP><TITLE> … </TITLE></XMP>

This is great! Can I just write:

<xmp>
<li>2 eggs (or 2T flax and 5T water)
<li>2/3 C oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed
</xmp>

Let’s go look up the documentation first:

Oh dear. Deprecated since at least 1995. Now, it does still work when I try it in Chrome and Firefox (would they ever break the original HTML pages?) but I’m not going to use it here in case it doesn’t work in RSS readers. But why would they take away this great feature?

When I look at it today, the obvious answer is security. Someone is going to generate a page with:

print(“You wrote <xmp>” +
      userString +
      “</xmp>”)

And someone will give the string "</xmp><script>...". That’s an XSS vulnerability.

But when this was deprecated (1995 or before) XSS wasn’t possible: JS only came out in December 1995. XSS didn’t become something people were really thinking about until more like 2000. [1]

Instead, it was an SGML issue:

An earlier HTML specification included an XMP element whose syntax is not expressible in SGML. Inside the XMP , no markup was recognized except the </XMP> end tag. While implementations are encouraged to support this idiom, its use is obsolete.

While the original HTML spec was implementable and relatively simple, a lot of technical people didn’t like that it was written somewhat informally. Berners-Lee designed it to be easy to write, even when that made it inelegant for computers to process. The W3C tried to get HTML to be a kind of SGML, and later to replace it with XML, but none of this worked: there was no gain to authors in switching their sites to some new HTML-like language. Eventually the browser vendors got together to make HTML5, which didn’t worry about ‘reforming’ HTML and just resolved disagreements between implementations and specified it as is.

At which point, why not restore <xmp>? This was proposed for HTML5, and the editor (Ian Hickson) summarized the discussion as:

Pros: Experienced authors who are writing specs, HTML tutorials, programming language blogs, or other pages containing snippets of code that can be expected to contain < and & characters get to save the time of escaping their <s and &s.
Cons: Complicates the language, introduces yet another polyglot difference, may be mistreated as a security feature, a pain to use if you have to later add markup inside the block (e.g. to highlight a section), doesn’t support characters outside the character encoding of the page (as it can’t get entities).

Before deciding to reject the proposal:

I’m going to say no on this, mostly driven by the simplicity argument. It’s a tough call, though. There’s some good arguments on both sides.

The spec says it is “entirely obsolete, and must not be used by authors”, but being pragmatic they did require browsers maintain their current behavior.

If this came up now I expect the main objection would be security and not simplicity, but regardless I don’t expect this to be reopened.

On the other hand, what I write here already runs through a script before anyone sees it. Let’s make it convert <xmp> to <pre> while escaping the content:

import re
from html import escape
def replace_xmp(s):
  return re.sub(
    r’<xmp>(.*?)</xmp>’,
    lambda m: f”<pre>{escape(m.group(1))}</pre>”,
  s,
  flags=re.DOTALL
)

Note that I couldn’t do this with etree because that would require the HTML inside the <xmp> to be well formed. Since modifying the parser would be pain, it’s a preprocessing step before I parse the HTML.

And now here’s that code block again, this time internally implemented with <xmp> and converted to <pre> before it leaves my server:

<li>2 eggs (or 2T flax and 5T water)
<li>2/3 oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed

Now as long as I’m not talking about <xmp> I can avoid a lot of angle brackets in my example code!

[1] If you read through the 2000 CERT Advisories you’ll see “CA-2000-02: Malicious HTML Tags Embedded in Client Web Requests” from February 2000 which is the first public description of the attack I can find.

Comment via: facebook, lesswrong, mastodon

I Want XMP But I Know Why I Can’t Have It