Absortio

Email → Summary → Bookmark → Email

How not to break a site – sirre.al

Extracto

<script> tags follow unintuitive parsing rules that can break a webpage in surprising ways. Fortunately, it’s relatively straightforward to escape JSON for script tags.

Resumen

Resumen Principal

El contenido examina las reglas de análisis no intuitivas de las etiquetas <script> en HTML, que pueden provocar fallos inesperados en las páginas web, especialmente al incrustar datos JSON. Aunque la práctica común sugiere escapar </script> a <\script> dentro de cadenas JSON (por ejemplo, con json_encode()), el análisis revela que esta solución es a menudo insuficiente. El problema subyace en un estado de análisis denominado "script data double escaped state", donde el navegador puede quedar "atascado" y consumir todo el contenido subsiguiente, incluyendo HTML legítimo, como parte del script. Esta situación, que puede llevar a una página en blanco en producción, se debe a reglas de coincidencia de nombres de etiquetas y a la forma en que los navegadores manejan ciertos patrones como <!--. La solución recomendada, estandarizada por HTML y adoptada por herramientas como PHP y WordPress, implica reemplazar el carácter < por \x3C o \u003C en las cadenas JSON, asegurando una salida JSON verdaderamente segura dentro de las etiquetas <script>. Estas complejas reglas tienen sus raíces en la evolución histórica de la web y la necesidad de compatibilidad con versiones anteriores.

Elementos Clave

  • Reglas de Parsing Especiales de <script>: Las etiquetas <script> poseen un comportamiento de análisis único que las distingue del resto del HTML. El navegador acepta prácticamente cualquier contenido hasta que encuentra la etiqueta de cierre </script>. Esto permite incrustar lenguajes como JavaScript directamente, pero también introduce la vulnerabilidad de que una cadena </script> dentro del propio script pueda cerrar prematuramente el elemento, rompiendo la estructura del documento.
  • Insuficiencia del Escapado Básico de JSON: Aunque json_encode() escapa </script> a <\script>, este método no siempre es suficiente para garantizar la seguridad. El contenido ilustra cómo incluso con este escapado, una cadena JSON que contiene patrones como <!-- o <script> (en el ejemplo, openComment": "<!--") puede activar un estado de análisis complejo, el "script data double escaped state", haciendo que el parser se detenga y consuma todo el HTML posterior.
  • El "Script Data Double Escaped State": Este es un estado crítico donde el analizador HTML se bloquea. Incluso con la barra invertida (\) escapando la barra de cierre de </script>, el parser no logra salir del elemento script. Esto sucede porque el analizador entra en un estado donde busca patrones específicos como <!-- o </script que, bajo ciertas condiciones (como ser parte de una cadena JSON no completamente escapada), pueden activar transiciones de estado que lo dejan "atascado", interpretando el HTML subsiguiente como parte del contenido del script.
  • Solución Recomendada de Escapado para JSON: La forma más segura de incrustar JSON en etiquetas <script> es reemplazar siempre el carácter < por su equivalente escapado \x3C o \u003C dentro de las cadenas JSON. Esta práctica es recomendada por el estándar HTML y se implementa en lenguajes como PHP a través de json_encode($data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES) y en WordPress con wp_json_encode y los mismos flags, garantizando que no se interpreten erróneamente posibles inicios de etiquetas.

Análisis e Implicaciones

La comprensión profunda de estas reglas de

Contenido

<script> tags follow unintuitive parsing rules that can break a webpage in surprising ways. Fortunately, it’s relatively straightforward to escape JSON for script tags.

Just do this

  • Replace < with \x3C or \u003C in JSON strings.
  • In PHP, use json_encode($data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES) for safe JSON in <script> tags.
  • In WordPress, use wp_json_encode with the same flags.

You don’t have to take my word for it, the HTML standard recommends this type of escaping:

The easiest and safest … is to always escape an ASCII case-insensitive match for “<!--” as “\x3C!--“, “<script” as “\x3Cscript“, and “</script” as “\x3C/script“…

This post will dive deep into the exotic script tag parsing rules in order to understand how they work and why this is the appropriate way to escape JSON.

What’s so gnarly about a script tag?

Script tags are used to embed other languages in HTML. The most common example is JavaScript:

This is great, JavaScript can be embedded directly. Imagine if script tags required HTML escaping:

In fact, script tags can contain any language (not necessarily JavaScript) or even arbitrary data. In order to support this behavior, script tags have special parsing rules. For the most part, the browser accepts whatever is inside the script tag until it finds the script close tag </script>1.

So, what happens when we embed this perfectly valid JavaScript that contains a script close tag?

Oops! We can see that </script> was part of a JavaScript string, but the browser is just parsing the HTML. This script element closes prematurely, resulting in the following tree:

├─SCRIPT
│ └─#text console.log('
└─#text ')

Ok, let’s use json_encode() and we should be all set:

Now we’ve got this HTML:

</script> has become <\/script>. The JavaScript string value is preserved and the script element does not close prematurely. Perfect, right?

Not so fast, things are about to get messy

Let’s expand with a more complex example. Here’s some data used by an imaginary HTML library. We’ll escape the JSON again with json_encode2:

Our HTML page includes the following, with a safely escaped script close tag:

Lovely. We’re good at this. Let’s just ship that 🚀


🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥


Great. Production is now a blank page and we need to write a post-mortem. What happened? The HTML looks just fine. Let’s inspect the document tree:

└─SCRIPT
  └─#text {␊
              "closeComment": "-->",␊
              "closeScript": "<\/script>",␊
              "openComment": "<!--",␊
              "openScript": "<script>"␊
          }</script>␊
          <h1>Success! 🎉</h1>

The script tag did not close as expected at </script>. The script close tag and all of the subsequent HTML are part of the script tag contents.

Wait, what???

We’ve just discovered some of those unintuitive parsing rules. In short, the HTML parser entered script data double escaped state and got stuck. Yes, this does break real pages.

If you’re not steeped in HTML arcana, fear not, this handy chart should clarify things 🙃

This is a real and mostly accurate diagram of how script tag tokenization works. I’ve taken some liberties with things like end-of-file tokens and null bytes that aren’t relevant to the discussion.

You may be wondering, like I did, why HTML would work like this. Well, the web wasn’t always the mature platform we know and love today:

When JavaScript was first introduced, many browsers did not support it. So they would render the content of the script tag – the JavaScript code itself. The normal way to get around that was to put the script into a comment — things like

This kind of practice was commonplace on the web. As the web evolved, browsers continued to support the behavior so they wouldn’t break existing pages. Then, HTML5 came along and standardized the behavior so folks knew what to expect, even if it’s surprising. We can see other remnants of this practice in the HTML scripting specification:

for related historical reasons, the string “<!–” in classic scripts is actually treated as a line comment start, just like “//”.

Back to our script data double escaped state. We can simplify the diagram above to collapse some states and focus on the interesting transitions:

This diagram names some transitions <script and </script. This is true, but the tag name only matches when the name script is followed by a byte that terminates a tag name — Space, Tab, “/”, “>”, or a newline (\n, \f, \r). For example, <script-o-rama or </scripty do not transition.

To understand the problem with our example above, locate the three transitions for </script:

  • script data → close
  • script data escaped → close
  • ‼️ script data double escaped → script data escaped ‼️

</script> does not close a script element from the script data double escaped state.

I encourage you to pause for a moment and play with this example to get a feel for how the script tag escaped states work.

Avoid the doubled escaped state

The complexity of script tag parsing and escaping comes from the escaped states. Avoid the script data double escaped state and script tags become simple. Everything until the tag closer </script> is inside the script element.

How can we avoid the double escaped state? Script tag parsing always starts in the script data state and there’s a pattern in its transitions:

  • </script: script data → close
  • <!--: script datascript data escaped

Both require “<” as their first character!3 Everything will be handled predictably if < never appears inside of the script tag. Remember what the HTML standard on scripting said? It recommends escaping < in specific places:

[Escape] “<!--” as “\x3C!--“, “<script” as “\x3Cscript“, and “</script” as “\x3C/script” [in literals.]

PHP has the JSON_HEX_TAG flag that will escape all < as \u003C and > \u003E. This will escape much more than is strictly necessary, but it’s sufficient and is provided by the language. Perfect!

How to escape JSON escaping in PHP

For JSON that will be printed in a script tag, use the following flags:

If everything is UTF-8 (both the data and the charset of the page) you can add these flags for cleaner and shorter JSON:

JSON_UNESCAPED_LINE_TERMINATORS is a fun one. Before ES2019, JavaScript strings did not accept two characters U+2028 (LINE SEPARATOR) and U+2029 (PARAGRAPH SEPARATOR) that JSON strings do allow. Some valid JSON was invalid JavaScript. Since the JavaScript is a superset of JSON proposal landed in ES2019, that’s no longer the case and those characters no longer require escaping. Phew! Browser support today is very good.

JSON escaping in action

Here’s the problematic example again, now with the recommended flags:

Let’s see the printed HTML and its resulting tree:

├─SCRIPT
│ └─#text {␊
│             "closeComment": "--\u003E",␊
│             "closeScript": "\u003C/script\u003E",␊
│             "openComment": "\u003C!--",␊
│             "openScript": "\u003Cscript\u003E"␊
│         }
├─#text ␊ 
└─H1
  └─#text Success! 🎉

“Success! 🎉” is displayed and the tree structure is exactly what we expected.

What about JavaScript?

The problems with JSON seem to be solved. But what about JavaScript source text? Or what if we decide to embed XML, Python, or Haskell in a script tag? All of those are permitted but bring different challenges.

Given what we learned here, see if you can find a general solution for escaping JavaScript safely. Remember that script data double escaped state is dangerous and should be avoided. We also can’t allow the script tag to close prematurely with </script>. The path from our entry state to double-escaped looks like this:

  • Script data state: “<!--” transition to
  • Script data escaped state: “<script>” transition to
  • Script data double escaped state: ‼️

The diagrams in this post were generated with Mermaid and Graphviz. Their source is available in this gist. Thanks to Dennis Snell for an improved version of the reduced state graph.

  1. It’s easiest to talk about </script> as the script close tag. Technically, it’s not strictly </script>, but a sequence of characters that looks like a script tag closer. For example </SCRIPT/> closes a script element but </script-no> does not. ↩︎
  2. Several examples include JSON_PRETTY_PRINT in the output for legibility. This flag is omitted from the example code. ↩︎
  3. Script data transitions to script data less-than sign state when the < character is encountered. That is the only transition from the script data state. ↩︎