Character Information Tool

Analyze any character to see its Unicode code point, UTF-8/UTF-16 encodings, HTML entities, CSS and JavaScript escapes, and more.

Enter text above or look up a character by code point below

Enter a Unicode code point (U+0041), hex (0x41), or 4-6 digit hex (0041) to add the character

Quick examples:

Understanding Unicode and Character Encodings

What Is Unicode and Why Does It Exist?

Unicode is a universal character encoding standard that assigns a unique number, called a code point, to every character used in human writing systems across the world. Before Unicode, dozens of competing encoding standards existed, each supporting only a limited set of languages and character sets. ASCII, one of the earliest and most widely used standards, could only represent 128 characters, covering basic English letters, digits, and punctuation. Extended ASCII variants added another 128 characters for specific languages, but these were incompatible with each other, leading to widespread problems when exchanging text across different systems, countries, and platforms. A document written in one encoding would display as garbled nonsense on a system using a different encoding, a problem known as mojibake.

Unicode was created to solve this fragmentation by providing a single, comprehensive standard that could represent every character in every writing system. The current version of Unicode defines over 149,000 characters covering 161 modern and historic scripts, as well as symbols, emoji, and control characters. Each character is assigned a unique code point in the format U+XXXX, where XXXX is a hexadecimal number. For example, the Latin letter "A" is U+0041, the Greek letter pi is U+03C0, and the snowman emoji is U+2603. The Unicode standard is maintained by the Unicode Consortium and is updated regularly to include new scripts, characters, and emoji.

UTF-8: The Dominant Encoding on the Web

While Unicode defines the abstract mapping from characters to code points, an encoding scheme is needed to represent those code points as actual bytes in computer memory and files. UTF-8 (Unicode Transformation Format, 8-bit) is by far the most widely used encoding on the internet and in modern software. Over 98% of all web pages use UTF-8 encoding. UTF-8 is a variable-length encoding that uses one to four bytes per character. ASCII characters (U+0000 to U+007F) are encoded as a single byte, making UTF-8 fully backward-compatible with ASCII. Characters from other scripts use two, three, or four bytes depending on their code point value. This variable-length design makes UTF-8 extremely space-efficient for text that is primarily in Latin-based scripts while still supporting every Unicode character.

UTF-16, another common encoding, uses two or four bytes per character. It is the native string encoding in JavaScript, Java, and Windows. Characters in the Basic Multilingual Plane (U+0000 to U+FFFF) are encoded as a single 16-bit code unit, while characters outside this range (such as many emoji) are encoded using a pair of 16-bit values called a surrogate pair. This is why, in JavaScript, some emoji and rare characters have a string length of 2 even though they represent a single visual character. Understanding these encoding details is essential for developers working with international text, file processing, and data exchange between systems.

HTML Entities and Character References

HTML entities are a mechanism for representing special characters in HTML documents. There are two forms: named entities and numeric references. Named entities use a human-readable label, such as &amp; for the ampersand (&), &lt; for the less-than sign (<), and &copy; for the copyright symbol. Numeric character references use the Unicode code point in either decimal (&#169;) or hexadecimal (&#xA9;) form. HTML entities are necessary for characters that have special meaning in HTML syntax, such as the less-than sign, greater-than sign, ampersand, and quotation marks. Using these characters directly without escaping can break the HTML structure or create security vulnerabilities such as cross-site scripting (XSS) attacks.

CSS and JavaScript Character Escapes

Both CSS and JavaScript provide their own character escape mechanisms for including Unicode characters in source code. In CSS, Unicode characters can be escaped using a backslash followed by the hexadecimal code point value, such as \00A9 for the copyright symbol. This is particularly useful in CSS content properties, selectors, and identifiers where special characters need to be included. JavaScript uses the \uXXXX syntax for characters in the Basic Multilingual Plane and \u{XXXXX} for characters outside it. These escape sequences allow developers to include any Unicode character in their source code regardless of the file encoding, which is valuable for writing portable, internationalized applications.

Unicode Categories and Blocks

Every Unicode character belongs to a general category that describes its type and function. The major categories include Letter (L), Mark (M), Number (N), Punctuation (P), Symbol (S), Separator (Z), and Other (C). Each major category has subcategories: for example, Lu for uppercase letters, Ll for lowercase letters, Nd for decimal digits, and Sc for currency symbols. These categories are used by programming languages for character classification functions, regular expressions, and text processing algorithms.

Characters are also organized into blocks, which are contiguous ranges of code points allocated to specific scripts or purposes. For example, Basic Latin (U+0000 to U+007F) contains the ASCII character set, Greek and Coptic (U+0370 to U+03FF) contains Greek letters, and CJK Unified Ideographs (U+4E00 to U+9FFF) contains Chinese, Japanese, and Korean characters. Understanding Unicode blocks helps developers work with specific writing systems and validate character ranges for input filtering and text processing applications.

Practical Applications of Character Analysis

Character analysis tools like this one serve many practical purposes in web development, software engineering, and content creation. Developers use character information to debug encoding issues, such as when invisible characters or zero-width spaces cause unexpected behavior in applications. Web developers need HTML entity references when writing content that includes special characters, mathematical notation, or multilingual text. Database administrators use character encoding knowledge to ensure proper storage and retrieval of international text data. Security professionals analyze characters to detect homograph attacks, where visually similar characters from different scripts are used to create deceptive URLs. Content creators and designers rely on Unicode information to find and use special characters, symbols, and typographic elements. Whether you are troubleshooting a mysterious rendering bug, crafting internationalized content, or simply exploring the vast landscape of Unicode characters, this tool provides all the technical details you need at a glance.