What Does UTF-8 Encoding Mean?

Updated December 12, 2025 | 5 min read

UTF-8 (8-bit Unicode Transformation Format) is a character encoding standard that allows computers to represent every character from every language in the world, plus symbols, emojis, and special characters.

In Simple Terms:
UTF-8 is the universal language that lets computers display text correctly, whether it's English, Chinese, Arabic, emoji 😀, or mathematical symbols ∑

Why UTF-8 is Important

  • 98% of websites use UTF-8 (as of 2025)
  • Supports 1.1 million possible characters
  • Backward compatible with ASCII
  • The default encoding for HTML5, JSON, XML

UTF-8 vs ASCII vs UTF-16

  • ASCII: Only 128 characters (English letters, numbers, basic symbols)
  • UTF-8: 1.1 million characters, variable length (1-4 bytes)
  • UTF-16: Also supports 1.1 million, but uses 2-4 bytes (less efficient for English text)

Example:

  • Letter "A" in ASCII: 1 byte
  • Letter "A" in UTF-8: 1 byte (same as ASCII!)
  • Chinese character "中" in UTF-8: 3 bytes
  • Emoji "😀" in UTF-8: 4 bytes

How UTF-8 Works

UTF-8 is "variable-length":

  • 1 byte: English letters (A-Z), numbers, basic punctuation
  • 2 bytes: Latin with accents (é, ñ), Greek, Cyrillic
  • 3 bytes: Chinese, Japanese, Korean, most symbols
  • 4 bytes: Emoji, rare/ancient scripts

This makes UTF-8 efficient: English text uses same space as ASCII!

Common UTF-8 Problems and Solutions

Problem: Seeing gibberish like "é" instead of "é"

Cause: File saved as UTF-8 but opened as Latin-1/Windows-1252

Solution: Re-open file and select UTF-8 encoding

Problem: Question marks or boxes instead of text

Cause: Missing font or unsupported characters

Solution: Install fonts that support the character set

Problem: Database showing wrong characters

Cause: Database not set to UTF-8

Solution: Set database and table collation to utf8mb4_unicode_ci

How to Set UTF-8 in Different Contexts

HTML

<meta charset="UTF-8">

CSS

@charset "UTF-8";

PHP

header('Content-Type: text/html; charset=utf-8');

MySQL

SET NAMES utf8mb4;

Python

# -*- coding: utf-8 -*-

UTF-8 vs UTF-8 BOM

BOM (Byte Order Mark) is an optional marker at the start of UTF-8 files:

  • UTF-8 without BOM: Standard, recommended for web
  • UTF-8 with BOM: Used by Windows Notepad, can cause issues

Problem with BOM: Can break PHP files, prevent HTTP headers from working

Best Practice:
Always save files as "UTF-8 without BOM" for web development. Most modern editors support this.

Related Resources