What is File Encoding?

Understanding how computers store text and data in files

Simple Answer
File encoding is the system computers use to translate human-readable text (like "Hello") into numbers that can be stored in a file. Different encoding schemes (UTF-8, ASCII, etc.) use different number-to-character mappings. Using the wrong encoding makes text appear as gibberish (�������).

The Basic Problem: Computers Only Understand Numbers

Computers can only store and process numbers (specifically, binary data: 1s and 0s). But humans work with letters, symbols, and characters. File encoding is the bridge between these two worlds - it's a standardized system that says "the number 65 represents the letter 'A', the number 66 represents 'B'," and so on.

When you save a text file containing "Hello World", your computer doesn't actually save those letters. Instead, it:

  1. Looks up each character in an encoding table
  2. Converts each character to its corresponding number
  3. Saves those numbers to the file

When you open the file, the process reverses: the computer reads the numbers and converts them back to characters using the same encoding table.

A Simple Example

Text you type: Hello

ASCII encoding converts to:
H = 72
e = 101
l = 108
l = 108
o = 111

File contains (in binary):
01001000 01100101 01101100 01101100 01101111

When opened: Computer reads numbers, looks up characters → displays "Hello"

Why Encoding Matters

If the program opening the file uses a different encoding table than the one used to save it, the numbers get translated to the wrong characters. This is why you sometimes see garbled text like "é" instead of "é" or "ãã"ã«ã¡ã¯" instead of "こんにちは".

Common Problem:
You open a file and see weird symbols (�, é, ’) instead of normal text. This happens when the file was saved with one encoding (e.g., UTF-8) but opened with another (e.g., Windows-1252). The numbers are correct, but they're being interpreted using the wrong translation table.

Common Encoding Types

Encoding Description Character Support Common Use
ASCII Original standard (1960s) 128 characters: English letters, numbers, basic symbols Old systems, simple text
UTF-8 Modern universal standard All world languages, emojis (1.1M+ characters) Web, modern apps, recommended default
UTF-16 Wide character encoding Same as UTF-8 but uses 2+ bytes per character Windows internal, Java, .NET
Windows-1252 Windows default (old) Western European languages Legacy Windows files
ISO-8859-1 Latin-1 standard Western European languages Email, HTTP headers (legacy)
Shift-JIS Japanese encoding Japanese characters Japanese systems and files

UTF-8: The Universal Solution

UTF-8 (Unicode Transformation Format - 8-bit) is now the dominant encoding standard for several reasons:

Best Practice:
Always use UTF-8 encoding for new files. It's the modern standard and prevents encoding problems across different systems, languages, and platforms.

Character Encoding vs File Encoding

These terms are often used interchangeably, but there's a subtle difference:

When someone asks "What's the file encoding?", they're asking "Which character encoding system was used to convert the text to numbers in this file?"

Text Files vs Binary Files

Encoding primarily matters for text files - files meant to store human-readable text:

Binary files (images, videos, executables, PDFs) have their own internal structure that doesn't rely on character encoding. They store raw data in format-specific ways.

Technical Note:
Even binary files sometimes contain embedded text (metadata, tags) that uses character encoding, but the file as a whole isn't a "text file" in the encoding sense.

Real-World Encoding Problems

Problem 1: Mojibake (Character Corruption)

Symptoms: Text displays as "é" instead of "é", "’" instead of "'", or random symbols
Cause: File saved in UTF-8, opened as Windows-1252 (or vice versa)
Solution: Re-open the file and specify the correct encoding (UTF-8 in most cases)

Problem 2: Question Marks or Boxes

Symptoms: Characters display as � or □
Cause: The encoding being used doesn't support those characters (e.g., trying to display Chinese in ASCII)
Solution: Convert to UTF-8 which supports all characters

Problem 3: BOM Confusion

Symptoms: UTF-8 files with BOM (Byte Order Mark) cause issues in Linux/web development
Cause: Windows adds an invisible BOM (EF BB BF) to UTF-8 files; some systems can't handle it
Solution: Save as "UTF-8 without BOM"

How to Check File Encoding

Windows - Notepad

  1. Open file in Notepad
  2. Click File → Save As
  3. Look at the "Encoding" dropdown at the bottom
  4. The current encoding is shown (don't save unless you want to change it)

Visual Studio Code

The bottom-right corner of VS Code shows the current file encoding (e.g., "UTF-8"). Click it to change encoding.

Command Line (Linux/Mac)

file -i filename.txt
# Output: text/plain; charset=utf-8

When Encoding Doesn't Matter

File encoding is irrelevant for:

Converting Between Encodings

If you need to convert a file from one encoding to another (e.g., Windows-1252 to UTF-8):

Method 1: Text Editor

  1. Open file in a capable editor (VS Code, Notepad++, Sublime)
  2. Tell it the current encoding to display correctly
  3. Save As and choose the target encoding (UTF-8)

Method 2: Command Line

iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
Data Loss Warning:
Converting from a rich encoding (UTF-8) to a limited one (ASCII) will lose characters that don't exist in the target. "Hello 世界" → "Hello ??" because ASCII can't represent Chinese characters.

Programming and Encoding

When writing code that reads/writes files, you must specify encoding:

Python:
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

JavaScript (Node.js):
fs.readFileSync('file.txt', 'utf8')

Java:
Files.readString(Path.of("file.txt"), StandardCharsets.UTF_8)

Omitting encoding makes the program use a platform-default encoding, which causes bugs when code runs on different systems (Windows vs Linux vs Mac).

Frequently Asked Questions

What is the default file encoding?

It depends on your system. Windows historically used Windows-1252. Modern Linux and Mac use UTF-8. Web browsers assume UTF-8. Because defaults vary, you should always explicitly specify UTF-8 to avoid problems.

Can I change a file's encoding without changing its content?

You must re-encode (convert) the file. You can't just "change the label" because the actual numbers stored in the file are different for different encodings. Opening in the correct encoding first, then saving in the new encoding, preserves the visible text.

Why do some files say "ANSI" encoding?

"ANSI" is a vague term that usually means Windows-1252 on Western systems. It's not a specific encoding. Modern tools should use the actual encoding name (Windows-1252) instead of the ambiguous "ANSI" label.

Does encoding affect file size?

Yes. UTF-8 uses 1 byte for English/ASCII characters but 2-4 bytes for other languages. A file with Chinese text will be larger in UTF-8 than in GB2312 (Chinese-specific encoding). However, UTF-8's universal support outweighs the small size difference.

What's the difference between UTF-8 and Unicode?

Unicode is the character set - the master list of every character and its assigned number (e.g., "A = U+0041"). UTF-8 is one encoding method for storing those Unicode numbers in files. UTF-16 and UTF-32 are other encoding methods for the same Unicode character set.