Sentry Answers>HTML>

Regex to match open HTML tags except for self-contained tags

Regex to match open HTML tags except for self-contained tags

Matthew C.

The Problem

You want a regular expression (regex) to match opening HTML tags such as <div>, <form id="myForm">, and <h1>. The regex should not match self-contained (self-closing) tags such as <img />, <br />, and <input />.

The Solution

Self-closing tags do not exist in HTML. HTML elements that can’t have any child nodes are void elements. These elements don’t have a closing tag. Self-closing tags, which contain a trailing slash character (”/”) before the closing angle bracket, are required for XML, XHTML, and SVG void elements. Some code formatters add a trailing slash to the start tag of an HTML void element to make them XHTML compatible and to improve readability. Self-closing tags can be used when writing HTML code since the trailing slash character is ignored by HTML parsers. These days HTML is used far more than XHTML: it’s the most used markup language for websites.

Various regexes can be used to match open HTML tags and not self-contained tags. For example:

Click to Copy
<([a-z]+)(?![^>]*\/>)[^>]*>

This regex does the following:

  • <: Match the opening angle bracket of an HTML tag.
  • ([a-z]+): Match one or more lowercase alphabetical characters.
  • (?![^>]*\/>): Negative lookahead that prevents matching closing tags. If there are zero or more characters other than ”>” followed by a ”/>” then the regex won’t match.
  • [^>]*>: The regex will match if the string ends in zero or more characters other than ”>” followed by a ”>” character.

Using a regex to find HTML tags is not ideal as it can lead to incorrect matches. For example, if you use the above regex for the following HTML string:

Click to Copy
<script> const myString = "<script></script>"; </script> <div class="container"> <!-- <img src="cat.jpg" alt="big cat" > --> </div>

The regex will match the <script> and <div> HTML opening tags. However, it will also match two opening tags that are not actual DOM tags: the <script> tag string in the myString variable and the <img> tag in the HTML comment.

A better approach is to use an HTML parser library such as Cheerio.

  • Syntax.fmListen to the Syntax Podcast
  • ResourcesWhat is Distributed Tracing
  • Syntax.fm logo
    Listen to the Syntax Podcast

    Tasty treats for web developers brought to you by Sentry. Get tips and tricks from Wes Bos and Scott Tolinski.

    SEE EPISODES

Considered “not bad” by 4 million developers and more than 100,000 organizations worldwide, Sentry provides code-level observability to many of the world’s best-known companies like Disney, Peloton, Cloudflare, Eventbrite, Slack, Supercell, and Rockstar Games. Each month we process billions of exceptions from the most popular products on the internet.

© 2024 • Sentry is a registered Trademark of Functional Software, Inc.