File encoding is the system computers use to translate human-readable text (like "Hello") into numbers that can be stored in a file. Different encoding schemes (UTF-8, ASCII, etc.) use different number-to-character mappings. Using the wrong encoding makes text appear as gibberish (�������).
The Basic Problem: Computers Only Understand Numbers
Computers can only store and process numbers (specifically, binary data: 1s and 0s). But humans work with letters, symbols, and characters. File encoding is the bridge between these two worlds - it's a standardized system that says "the number 65 represents the letter 'A', the number 66 represents 'B'," and so on.
When you save a text file containing "Hello World", your computer doesn't actually save those letters. Instead, it:
- Looks up each character in an encoding table
- Converts each character to its corresponding number
- Saves those numbers to the file
When you open the file, the process reverses: the computer reads the numbers and converts them back to characters using the same encoding table.
A Simple Example
ASCII encoding converts to:
H = 72
e = 101
l = 108
l = 108
o = 111
File contains (in binary):
01001000 01100101 01101100 01101100 01101111
When opened: Computer reads numbers, looks up characters → displays "Hello"
Why Encoding Matters
If the program opening the file uses a different encoding table than the one used to save it, the numbers get translated to the wrong characters. This is why you sometimes see garbled text like "é" instead of "é" or "ãã"ã«ã¡ã¯" instead of "こんにちは".
You open a file and see weird symbols (�, é, ’) instead of normal text. This happens when the file was saved with one encoding (e.g., UTF-8) but opened with another (e.g., Windows-1252). The numbers are correct, but they're being interpreted using the wrong translation table.
Common Encoding Types
| Encoding | Description | Character Support | Common Use |
|---|---|---|---|
| ASCII | Original standard (1960s) | 128 characters: English letters, numbers, basic symbols | Old systems, simple text |
| UTF-8 | Modern universal standard | All world languages, emojis (1.1M+ characters) | Web, modern apps, recommended default |
| UTF-16 | Wide character encoding | Same as UTF-8 but uses 2+ bytes per character | Windows internal, Java, .NET |
| Windows-1252 | Windows default (old) | Western European languages | Legacy Windows files |
| ISO-8859-1 | Latin-1 standard | Western European languages | Email, HTTP headers (legacy) |
| Shift-JIS | Japanese encoding | Japanese characters | Japanese systems and files |
UTF-8: The Universal Solution
UTF-8 (Unicode Transformation Format - 8-bit) is now the dominant encoding standard for several reasons:
- Universal: Supports every written language, including Chinese, Arabic, Russian, emoji, mathematical symbols
- Backward compatible: The first 128 characters are identical to ASCII, so old ASCII files work perfectly
- Efficient: English text uses 1 byte per character; other languages use 2-4 bytes only where needed
- Web standard: 98% of websites use UTF-8
- No ambiguity: Files explicitly marked as UTF-8 are interpreted correctly everywhere
Always use UTF-8 encoding for new files. It's the modern standard and prevents encoding problems across different systems, languages, and platforms.
Character Encoding vs File Encoding
These terms are often used interchangeably, but there's a subtle difference:
- Character encoding: The mapping system itself (e.g., "UTF-8 says é = 195 169")
- File encoding: Which character encoding was used to save a particular file
When someone asks "What's the file encoding?", they're asking "Which character encoding system was used to convert the text to numbers in this file?"
Text Files vs Binary Files
Encoding primarily matters for text files - files meant to store human-readable text:
- .txt files
- .csv files
- .html, .css, .js files
- .xml, .json files
- Source code (.py, .java, .cpp, etc.)
Binary files (images, videos, executables, PDFs) have their own internal structure that doesn't rely on character encoding. They store raw data in format-specific ways.
Even binary files sometimes contain embedded text (metadata, tags) that uses character encoding, but the file as a whole isn't a "text file" in the encoding sense.
Real-World Encoding Problems
Problem 1: Mojibake (Character Corruption)
Symptoms: Text displays as "é" instead of "é", "’" instead of "'", or random symbols
Cause: File saved in UTF-8, opened as Windows-1252 (or vice versa)
Solution: Re-open the file and specify the correct encoding (UTF-8 in most cases)
Problem 2: Question Marks or Boxes
Symptoms: Characters display as � or □
Cause: The encoding being used doesn't support those characters (e.g., trying to display Chinese in ASCII)
Solution: Convert to UTF-8 which supports all characters
Problem 3: BOM Confusion
Symptoms: UTF-8 files with BOM (Byte Order Mark) cause issues in Linux/web development
Cause: Windows adds an invisible BOM (EF BB BF) to UTF-8 files; some systems can't handle it
Solution: Save as "UTF-8 without BOM"
How to Check File Encoding
Windows - Notepad
- Open file in Notepad
- Click File → Save As
- Look at the "Encoding" dropdown at the bottom
- The current encoding is shown (don't save unless you want to change it)
Visual Studio Code
The bottom-right corner of VS Code shows the current file encoding (e.g., "UTF-8"). Click it to change encoding.
Command Line (Linux/Mac)
# Output: text/plain; charset=utf-8
When Encoding Doesn't Matter
File encoding is irrelevant for:
- Image files: JPG, PNG, GIF - they have their own internal format
- Video files: MP4, AVI, MKV - binary formats with embedded text metadata
- Audio files: MP3, WAV, FLAC - audio data is not text
- Executable files: .exe, .dll, .so - binary machine code
- Document files: .docx, .pdf, .xlsx - these have internal encoding separate from file encoding
Converting Between Encodings
If you need to convert a file from one encoding to another (e.g., Windows-1252 to UTF-8):
Method 1: Text Editor
- Open file in a capable editor (VS Code, Notepad++, Sublime)
- Tell it the current encoding to display correctly
- Save As and choose the target encoding (UTF-8)
Method 2: Command Line
Converting from a rich encoding (UTF-8) to a limited one (ASCII) will lose characters that don't exist in the target. "Hello 世界" → "Hello ??" because ASCII can't represent Chinese characters.
Programming and Encoding
When writing code that reads/writes files, you must specify encoding:
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
JavaScript (Node.js):
fs.readFileSync('file.txt', 'utf8')
Java:
Files.readString(Path.of("file.txt"), StandardCharsets.UTF_8)
Omitting encoding makes the program use a platform-default encoding, which causes bugs when code runs on different systems (Windows vs Linux vs Mac).
Frequently Asked Questions
What is the default file encoding?
It depends on your system. Windows historically used Windows-1252. Modern Linux and Mac use UTF-8. Web browsers assume UTF-8. Because defaults vary, you should always explicitly specify UTF-8 to avoid problems.
Can I change a file's encoding without changing its content?
You must re-encode (convert) the file. You can't just "change the label" because the actual numbers stored in the file are different for different encodings. Opening in the correct encoding first, then saving in the new encoding, preserves the visible text.
Why do some files say "ANSI" encoding?
"ANSI" is a vague term that usually means Windows-1252 on Western systems. It's not a specific encoding. Modern tools should use the actual encoding name (Windows-1252) instead of the ambiguous "ANSI" label.
Does encoding affect file size?
Yes. UTF-8 uses 1 byte for English/ASCII characters but 2-4 bytes for other languages. A file with Chinese text will be larger in UTF-8 than in GB2312 (Chinese-specific encoding). However, UTF-8's universal support outweighs the small size difference.
What's the difference between UTF-8 and Unicode?
Unicode is the character set - the master list of every character and its assigned number (e.g., "A = U+0041"). UTF-8 is one encoding method for storing those Unicode numbers in files. UTF-16 and UTF-32 are other encoding methods for the same Unicode character set.