Unicode Checker breaks any string into individual characters and shows each one's Unicode codepoint (U+xxxx), official name, general category, script, UTF-8 byte sequence, and HTML entity. It highlights invisible characters (zero-width space, soft hyphen, directional marks), homoglyph lookalikes, and unexpected scripts — essential for security auditing and debugging encoding issues.
The Unicode Checker analyses text character by character, showing the Unicode code point (U+XXXX), UTF-8 byte sequence, Unicode category (letter, digit, punctuation, symbol, etc.), script (Latin, Cyrillic, Arabic, etc.), and HTML entity for each character. It also flags invisible and potentially dangerous characters such as zero-width joiners, right-to-left overrides, bidirectional control characters, and the BOM (byte order mark).
Why check for invisible Unicode characters?
Invisible Unicode characters — zero-width spaces (U+200B), zero-width non-joiners (U+200C), soft hyphens (U+00AD), right-to-left overrides (U+202E) — are often copy-pasted from untrusted sources and can cause subtle bugs: password mismatches, URL spoofing, broken string comparisons, and hidden Trojan-source attacks in code. The checker highlights all suspicious code points in red.
UTF-8 byte sequences
UTF-8 is a variable-length encoding: ASCII characters (U+0000–U+007F) use one byte, characters up to U+07FF use two bytes, up to U+FFFF use three bytes, and characters above U+FFFF (emoji, rare CJK extensions) use four bytes. Knowing the byte length is important when allocating storage (e.g. MySQL's utf8mb4 charset is required for four-byte characters) and when calculating byte offsets in binary protocols.