When writing text that will be displayed to users as HTML I write
HTML. This seems like it would be the normal way to do it, though
it’s unusual these days. Either you draft in a fancy content editor
or you write
Markdown. HTML
authoring is a bit
retrogrouch but it
suits me. Except when I need to write about HTML itself.
In my previous post I was talking
about how I write recipes in HTML:
<li>2 eggs (or 2T flax and 5T water)
<li>2/3 C oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed
Except in the blog post itself the ampersands needed to be escaped, so
on my screen it looked like:
<pre>
<li>2 eggs (or 2T flax and 5T water)
<li>2/3 C oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed
</pre>
This is pretty painful. What if I could just write HTML, but mark it
off as an example of how to write HTML so the browser would ignore it?
This was exactly the problem Berners-Lee
had when initially documenting HTML, and if you look at early
pages you’ll see he introduced an <xmp> tag:
The title of a document is given between title tags:
<XMP><TITLE> … </TITLE></XMP>
This is great! Can I just write:
<xmp>
<li>2 eggs (or 2T flax and 5T water)
<li>2/3 C oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed
</xmp>
Oh dear. Deprecated since at least 1995.
Now, it does still work when I try it in Chrome and Firefox (would
they ever break the original HTML pages?) but I’m not going to use it
here in case it doesn’t work in RSS readers. But why would they take
away this great feature?
When I look at it today, the obvious answer is security. Someone is
going to generate a page with:
print(“You wrote <xmp>” +
userString +
“</xmp>”)
And someone will give the string "</xmp><script>...".
That’s an XSS vulnerability.
But when this was deprecated (1995 or before) XSS wasn’t possible: JS
only came out in December
1995. XSS didn’t become something people were really thinking
about until more like 2000. [1]
An earlier HTML specification included an XMP element whose syntax is
not expressible in SGML. Inside the XMP , no markup was recognized
except the </XMP> end tag. While implementations are encouraged to
support this idiom, its use is obsolete.
While the original HTML spec was implementable and relatively simple,
a lot of technical people didn’t like that it was written somewhat
informally. Berners-Lee designed it to be easy to write, even when
that made it inelegant for computers to process. The W3C tried to get
HTML to be a kind of SGML, and later to replace it with XML, but none
of this worked: there was no gain to authors in switching their
sites to some new HTML-like language. Eventually the browser
vendors got together to make HTML5, which didn’t
worry about ‘reforming’ HTML and just resolved disagreements between
implementations and specified it as is.
At which point, why not restore <xmp>? This was
proposed
for HTML5, and the editor (Ian Hickson) summarized the discussion as:
Pros: Experienced authors who are writing specs, HTML tutorials,
programming language blogs, or other pages containing snippets of code
that can be expected to contain < and & characters get to save the
time of escaping their <s and &s.
Cons: Complicates the language, introduces yet another polyglot
difference, may be mistreated as a security feature, a pain to use if
you have to later add markup inside the block (e.g. to highlight a
section), doesn’t support characters outside the character encoding of
the page (as it can’t get entities).
Before deciding to reject the proposal:
I’m going to say no on this, mostly driven by the simplicity
argument. It’s a tough call, though. There’s some good arguments on
both sides.
If this came up now I expect the main objection would be security and
not simplicity, but regardless I don’t expect this to be reopened.
On the other hand, what I write here already runs through a script
before anyone sees it. Let’s make it convert <xmp> to
<pre> while escaping the content:
import re
from html import escape
def replace_xmp(s):
return re.sub(
r’<xmp>(.*?)</xmp>’,
lambda m: f”<pre>{escape(m.group(1))}</pre>”,
s,
flags=re.DOTALL
)
Note that I couldn’t do this with etree because that would require the
HTML inside the <xmp> to be well formed. Since
modifying the parser would be pain, it’s a preprocessing step before I
parse the HTML.
And now here’s that code block again, this time internally implemented
with <xmp> and converted to
<pre> before it leaves my server:
<li>2 eggs (or 2T flax and 5T water)
<li>2/3 oil
<li>1C greek yoghurt
<li>1/4 cup milk, more if needed
Now as long as I’m not talking about <xmp> I can
avoid a lot of angle brackets in my example code!
[1] If you read through the 2000 CERT
Advisories you’ll see “CA-2000-02: Malicious HTML Tags Embedded in
Client Web Requests” from February 2000 which is the first public
description of the attack I can find.
I Want XMP But I Know Why I Can’t Have It
Link post
When writing text that will be displayed to users as HTML I write HTML. This seems like it would be the normal way to do it, though it’s unusual these days. Either you draft in a fancy content editor or you write Markdown. HTML authoring is a bit retrogrouch but it suits me. Except when I need to write about HTML itself.
In my previous post I was talking about how I write recipes in HTML:
Except in the blog post itself the ampersands needed to be escaped, so on my screen it looked like:
This is pretty painful. What if I could just write HTML, but mark it off as an example of how to write HTML so the browser would ignore it? This was exactly the problem Berners-Lee had when initially documenting HTML, and if you look at early pages you’ll see he introduced an
<xmp>
tag:This is great! Can I just write:
Let’s go look up the documentation first:
Oh dear. Deprecated since at least 1995. Now, it does still work when I try it in Chrome and Firefox (would they ever break the original HTML pages?) but I’m not going to use it here in case it doesn’t work in RSS readers. But why would they take away this great feature?
When I look at it today, the obvious answer is security. Someone is going to generate a page with:
And someone will give the string
"</xmp><script>..."
. That’s an XSS vulnerability.But when this was deprecated (1995 or before) XSS wasn’t possible: JS only came out in December 1995. XSS didn’t become something people were really thinking about until more like 2000. [1]
Instead, it was an SGML issue:
While the original HTML spec was implementable and relatively simple, a lot of technical people didn’t like that it was written somewhat informally. Berners-Lee designed it to be easy to write, even when that made it inelegant for computers to process. The W3C tried to get HTML to be a kind of SGML, and later to replace it with XML, but none of this worked: there was no gain to authors in switching their sites to some new HTML-like language. Eventually the browser vendors got together to make HTML5, which didn’t worry about ‘reforming’ HTML and just resolved disagreements between implementations and specified it as is.
At which point, why not restore
<xmp>
? This was proposed for HTML5, and the editor (Ian Hickson) summarized the discussion as:Before deciding to reject the proposal:
The spec says it is “entirely obsolete, and must not be used by authors”, but being pragmatic they did require browsers maintain their current behavior.
If this came up now I expect the main objection would be security and not simplicity, but regardless I don’t expect this to be reopened.
On the other hand, what I write here already runs through a script before anyone sees it. Let’s make it convert
<xmp>
to<pre>
while escaping the content:Note that I couldn’t do this with etree because that would require the HTML inside the
<xmp>
to be well formed. Since modifying the parser would be pain, it’s a preprocessing step before I parse the HTML.And now here’s that code block again, this time internally implemented with
<xmp>
and converted to<pre>
before it leaves my server:Now as long as I’m not talking about
<xmp>
I can avoid a lot of angle brackets in my example code![1] If you read through the 2000 CERT Advisories you’ll see “CA-2000-02: Malicious HTML Tags Embedded in Client Web Requests” from February 2000 which is the first public description of the attack I can find.
Comment via: facebook, lesswrong, mastodon