What is File Encoding? Simple Explanation with Examples

Simple Answer
File encoding is the system computers use to translate human-readable text (like "Hello") into numbers that can be stored in a file. Different encoding schemes (UTF-8, ASCII, etc.) use different number-to-character mappings. Using the wrong encoding makes text appear as gibberish (��).

The Basic Problem: Computers Only Understand Numbers

Computers can only store and process numbers (specifically, binary data: 1s and 0s). But humans work with letters, symbols, and characters. File encoding is the bridge between these two worlds - it's a standardized system that says "the number 65 represents the letter 'A', the number 66 represents 'B'," and so on.

When you save a text file containing "Hello World", your computer doesn't actually save those letters. Instead, it:

Looks up each character in an encoding table
Converts each character to its corresponding number
Saves those numbers to the file

When you open the file, the process reverses: the computer reads the numbers and converts them back to characters using the same encoding table.

A Simple Example

Text you type: Hello

ASCII encoding converts to:
H = 72
e = 101
l = 108
l = 108
o = 111

File contains (in binary):
01001000 01100101 01101100 01101100 01101111

When opened: Computer reads numbers, looks up characters → displays "Hello"

Why Encoding Matters

If the program opening the file uses a different encoding table than the one used to save it, the numbers get translated to the wrong characters. This is why you sometimes see garbled text like "Ã©" instead of "é" or "ãã"ã«ã¡ã¯" instead of "こんにちは".

Common Problem:
You open a file and see weird symbols (�, Ã©, â€™) instead of normal text. This happens when the file was saved with one encoding (e.g., UTF-8) but opened with another (e.g., Windows-1252). The numbers are correct, but they're being interpreted using the wrong translation table.

Common Encoding Types

Encoding	Description	Character Support	Common Use
ASCII	Original standard (1960s)	128 characters: English letters, numbers, basic symbols	Old systems, simple text
UTF-8	Modern universal standard	All world languages, emojis (1.1M+ characters)	Web, modern apps, recommended default
UTF-16	Wide character encoding	Same as UTF-8 but uses 2+ bytes per character	Windows internal, Java, .NET
Windows-1252	Windows default (old)	Western European languages	Legacy Windows files
ISO-8859-1	Latin-1 standard	Western European languages	Email, HTTP headers (legacy)
Shift-JIS	Japanese encoding	Japanese characters	Japanese systems and files

UTF-8: The Universal Solution

UTF-8 (Unicode Transformation Format - 8-bit) is now the dominant encoding standard for several reasons:

Universal: Supports every written language, including Chinese, Arabic, Russian, emoji, mathematical symbols
Backward compatible: The first 128 characters are identical to ASCII, so old ASCII files work perfectly
Efficient: English text uses 1 byte per character; other languages use 2-4 bytes only where needed
Web standard: 98% of websites use UTF-8
No ambiguity: Files explicitly marked as UTF-8 are interpreted correctly everywhere

Best Practice:
Always use UTF-8 encoding for new files. It's the modern standard and prevents encoding problems across different systems, languages, and platforms.

Character Encoding vs File Encoding

These terms are often used interchangeably, but there's a subtle difference:

Character encoding: The mapping system itself (e.g., "UTF-8 says é = 195 169")
File encoding: Which character encoding was used to save a particular file

When someone asks "What's the file encoding?", they're asking "Which character encoding system was used to convert the text to numbers in this file?"

Text Files vs Binary Files

Encoding primarily matters for text files - files meant to store human-readable text:

.txt files
.csv files
.html, .css, .js files
.xml, .json files
Source code (.py, .java, .cpp, etc.)

Binary files (images, videos, executables, PDFs) have their own internal structure that doesn't rely on character encoding. They store raw data in format-specific ways.

Technical Note:
Even binary files sometimes contain embedded text (metadata, tags) that uses character encoding, but the file as a whole isn't a "text file" in the encoding sense.

Real-World Encoding Problems

Problem 1: Mojibake (Character Corruption)

Symptoms: Text displays as "Ã©" instead of "é", "â€™" instead of "'", or random symbols
Cause: File saved in UTF-8, opened as Windows-1252 (or vice versa)
Solution: Re-open the file and specify the correct encoding (UTF-8 in most cases)

Problem 2: Question Marks or Boxes

Symptoms: Characters display as � or □
Cause: The encoding being used doesn't support those characters (e.g., trying to display Chinese in ASCII)
Solution: Convert to UTF-8 which supports all characters

Problem 3: BOM Confusion

Symptoms: UTF-8 files with BOM (Byte Order Mark) cause issues in Linux/web development
Cause: Windows adds an invisible BOM (EF BB BF) to UTF-8 files; some systems can't handle it
Solution: Save as "UTF-8 without BOM"

How to Check File Encoding

Windows - Notepad

Open file in Notepad
Click File → Save As
Look at the "Encoding" dropdown at the bottom
The current encoding is shown (don't save unless you want to change it)

Visual Studio Code

The bottom-right corner of VS Code shows the current file encoding (e.g., "UTF-8"). Click it to change encoding.

Command Line (Linux/Mac)

file -i filename.txt
# Output: text/plain; charset=utf-8

When Encoding Doesn't Matter

File encoding is irrelevant for:

Image files: JPG, PNG, GIF - they have their own internal format
Video files: MP4, AVI, MKV - binary formats with embedded text metadata
Audio files: MP3, WAV, FLAC - audio data is not text
Executable files: .exe, .dll, .so - binary machine code
Document files: .docx, .pdf, .xlsx - these have internal encoding separate from file encoding

Converting Between Encodings

If you need to convert a file from one encoding to another (e.g., Windows-1252 to UTF-8):

Method 1: Text Editor

Open file in a capable editor (VS Code, Notepad++, Sublime)
Tell it the current encoding to display correctly
Save As and choose the target encoding (UTF-8)

Method 2: Command Line

iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

Data Loss Warning:
Converting from a rich encoding (UTF-8) to a limited one (ASCII) will lose characters that don't exist in the target. "Hello 世界" → "Hello ??" because ASCII can't represent Chinese characters.

Programming and Encoding

When writing code that reads/writes files, you must specify encoding:

Python:
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()

JavaScript (Node.js):
fs.readFileSync('file.txt', 'utf8')

Java:
Files.readString(Path.of("file.txt"), StandardCharsets.UTF_8)

Omitting encoding makes the program use a platform-default encoding, which causes bugs when code runs on different systems (Windows vs Linux vs Mac).

Frequently Asked Questions

What is the default file encoding?

It depends on your system. Windows historically used Windows-1252. Modern Linux and Mac use UTF-8. Web browsers assume UTF-8. Because defaults vary, you should always explicitly specify UTF-8 to avoid problems.

Can I change a file's encoding without changing its content?

You must re-encode (convert) the file. You can't just "change the label" because the actual numbers stored in the file are different for different encodings. Opening in the correct encoding first, then saving in the new encoding, preserves the visible text.

Why do some files say "ANSI" encoding?

"ANSI" is a vague term that usually means Windows-1252 on Western systems. It's not a specific encoding. Modern tools should use the actual encoding name (Windows-1252) instead of the ambiguous "ANSI" label.

Does encoding affect file size?

Yes. UTF-8 uses 1 byte for English/ASCII characters but 2-4 bytes for other languages. A file with Chinese text will be larger in UTF-8 than in GB2312 (Chinese-specific encoding). However, UTF-8's universal support outweighs the small size difference.

What's the difference between UTF-8 and Unicode?

Unicode is the character set - the master list of every character and its assigned number (e.g., "A = U+0041"). UTF-8 is one encoding method for storing those Unicode numbers in files. UTF-16 and UTF-32 are other encoding methods for the same Unicode character set.

What is File Encoding?