UTF-8 (8-bit Unicode Transformation Format) is a character encoding standard that allows computers to represent every character from every language in the world, plus symbols, emojis, and special characters.
UTF-8 is the universal language that lets computers display text correctly, whether it's English, Chinese, Arabic, emoji 😀, or mathematical symbols ∑
Why UTF-8 is Important
- 98% of websites use UTF-8 (as of 2025)
- Supports 1.1 million possible characters
- Backward compatible with ASCII
- The default encoding for HTML5, JSON, XML
UTF-8 vs ASCII vs UTF-16
- ASCII: Only 128 characters (English letters, numbers, basic symbols)
- UTF-8: 1.1 million characters, variable length (1-4 bytes)
- UTF-16: Also supports 1.1 million, but uses 2-4 bytes (less efficient for English text)
Example:
- Letter "A" in ASCII: 1 byte
- Letter "A" in UTF-8: 1 byte (same as ASCII!)
- Chinese character "中" in UTF-8: 3 bytes
- Emoji "😀" in UTF-8: 4 bytes
How UTF-8 Works
UTF-8 is "variable-length":
- 1 byte: English letters (A-Z), numbers, basic punctuation
- 2 bytes: Latin with accents (é, ñ), Greek, Cyrillic
- 3 bytes: Chinese, Japanese, Korean, most symbols
- 4 bytes: Emoji, rare/ancient scripts
This makes UTF-8 efficient: English text uses same space as ASCII!
Common UTF-8 Problems and Solutions
Problem: Seeing gibberish like "é" instead of "é"
Cause: File saved as UTF-8 but opened as Latin-1/Windows-1252
Solution: Re-open file and select UTF-8 encoding
Problem: Question marks or boxes instead of text
Cause: Missing font or unsupported characters
Solution: Install fonts that support the character set
Problem: Database showing wrong characters
Cause: Database not set to UTF-8
Solution: Set database and table collation to utf8mb4_unicode_ci
How to Set UTF-8 in Different Contexts
HTML
<meta charset="UTF-8">
CSS
@charset "UTF-8";
PHP
header('Content-Type: text/html; charset=utf-8');
MySQL
SET NAMES utf8mb4;
Python
# -*- coding: utf-8 -*-
UTF-8 vs UTF-8 BOM
BOM (Byte Order Mark) is an optional marker at the start of UTF-8 files:
- UTF-8 without BOM: Standard, recommended for web
- UTF-8 with BOM: Used by Windows Notepad, can cause issues
Problem with BOM: Can break PHP files, prevent HTTP headers from working
Always save files as "UTF-8 without BOM" for web development. Most modern editors support this.