Understanding Character Encoding
Character encoding defines how text characters are stored as bytes in computers. Choosing the wrong encoding causes corruption, displaying characters as question marks or garbled symbols.
Common Encodings
UTF-8 (Recommended)
UTF-8 is the modern standard and should be your default choice for all new projects.
- Variable width: 1-4 bytes per character
- Coverage: Supports all Unicode characters (143,000+ characters)
- Backward compatible: ASCII characters (0-127) use same encoding
- Web standard: Required for HTML5, recommended by W3C
- Efficiency: Compact for English text, larger for Asian languages
ASCII
The original 7-bit encoding, supports only basic English characters.
- Size: 1 byte (7 bits used) per character
- Coverage: 128 characters (0-127)
- Includes: English letters, numbers, basic punctuation
- Use when: Simple English text, no special characters needed
- Limitations: No accented characters, emoji, or non-English alphabets
Latin-1 (ISO-8859-1)
8-bit encoding for Western European languages.
- Size: 1 byte per character
- Coverage: 256 characters (0-255)
- Includes: ASCII + Western European accented characters (é, ñ, ö, etc.)
- Use when: Legacy systems, Western European text only
- Limitations: No emoji, Eastern European, Asian, or other scripts
UTF-16
16-bit encoding, primarily used internally by Windows and Java.
- Size: 2-4 bytes per character
- Coverage: All Unicode characters
- Use when: Windows API, Java internals, .NET strings
- Drawback: Larger file sizes for English text, byte order issues
Common Encoding Problems
Mojibake (Character Corruption)
Occurs when text is decoded with the wrong encoding. Examples:
- "café" becomes "café" (UTF-8 read as Latin-1)
- "naïve" becomes "naïve" (UTF-8 read as Windows-1252)
- Emoji appear as "�" (replacement character)
Database Storage Issues
Database column encoding must match application encoding:
- MySQL: Use
utf8mb4 charset (not utf8)
- PostgreSQL: Use
UTF8 encoding
- SQL Server: Use
NVARCHAR for Unicode data
Best Practices
Modern Applications
- Always use UTF-8 for new projects
- Set UTF-8 in HTML:
<meta charset="UTF-8">
- Configure web servers to send UTF-8 headers
- Use UTF-8 in database connections and storage
- Save source code files as UTF-8
Working with Files
- Specify encoding when opening files:
open(file, encoding='utf-8')
- Detect encoding before processing unknown files
- Convert legacy files to UTF-8 before processing
- Include BOM (Byte Order Mark) only if required by tools
APIs and Data Exchange
- Always specify encoding in Content-Type headers
- Document encoding requirements in API specs
- Validate incoming data is valid UTF-8
- Reject or convert data with encoding errors
Migration from Legacy Encodings
Step 1: Identify Current Encoding
Use encoding detection tools on sample files to determine the source encoding.
Step 2: Convert Files
Use tools like iconv to convert files from source encoding to UTF-8.
Step 3: Update Code
Change all encoding declarations in code, databases, and configuration files.
Step 4: Test Thoroughly
Test with special characters, emoji, and text in multiple languages.