Encoding Detector

Detect and validate text character encoding formats

Understanding Character Encoding

Character encoding defines how text characters are stored as bytes in computers. Choosing the wrong encoding causes corruption, displaying characters as question marks or garbled symbols.

Common Encodings

UTF-8 (Recommended)

UTF-8 is the modern standard and should be your default choice for all new projects.

Variable width: 1-4 bytes per character
Coverage: Supports all Unicode characters (143,000+ characters)
Backward compatible: ASCII characters (0-127) use same encoding
Web standard: Required for HTML5, recommended by W3C
Efficiency: Compact for English text, larger for Asian languages

ASCII

The original 7-bit encoding, supports only basic English characters.

Size: 1 byte (7 bits used) per character
Coverage: 128 characters (0-127)
Includes: English letters, numbers, basic punctuation
Use when: Simple English text, no special characters needed
Limitations: No accented characters, emoji, or non-English alphabets

Latin-1 (ISO-8859-1)

8-bit encoding for Western European languages.

Size: 1 byte per character
Coverage: 256 characters (0-255)
Includes: ASCII + Western European accented characters (é, ñ, ö, etc.)
Use when: Legacy systems, Western European text only
Limitations: No emoji, Eastern European, Asian, or other scripts

UTF-16

16-bit encoding, primarily used internally by Windows and Java.

Size: 2-4 bytes per character
Coverage: All Unicode characters
Use when: Windows API, Java internals, .NET strings
Drawback: Larger file sizes for English text, byte order issues

Common Encoding Problems

Mojibake (Character Corruption)

Occurs when text is decoded with the wrong encoding. Examples:

"café" becomes "cafÃ©" (UTF-8 read as Latin-1)
"naïve" becomes "naÃ¯ve" (UTF-8 read as Windows-1252)
Emoji appear as "�" (replacement character)

Database Storage Issues

Database column encoding must match application encoding:

MySQL: Use utf8mb4 charset (not utf8)
PostgreSQL: Use UTF8 encoding
SQL Server: Use NVARCHAR for Unicode data

Best Practices

Modern Applications

Always use UTF-8 for new projects
Set UTF-8 in HTML: <meta charset="UTF-8">
Configure web servers to send UTF-8 headers
Use UTF-8 in database connections and storage
Save source code files as UTF-8

Working with Files

Specify encoding when opening files: open(file, encoding='utf-8')
Detect encoding before processing unknown files
Convert legacy files to UTF-8 before processing
Include BOM (Byte Order Mark) only if required by tools

APIs and Data Exchange

Always specify encoding in Content-Type headers
Document encoding requirements in API specs
Validate incoming data is valid UTF-8
Reject or convert data with encoding errors

Migration from Legacy Encodings

Step 1: Identify Current Encoding

Use encoding detection tools on sample files to determine the source encoding.

Step 2: Convert Files

Use tools like iconv to convert files from source encoding to UTF-8.

Step 3: Update Code

Change all encoding declarations in code, databases, and configuration files.

Step 4: Test Thoroughly

Test with special characters, emoji, and text in multiple languages.

Quick Test Strings

ASCII only:
Hello World 123

Extended ASCII:
café, naïve, résumé

Unicode:
你好 (Chinese)
こんにちは (Japanese)
🎉 Emoji 🌍

Encoding Comparison

Encoding	Char Sets
ASCII	128
Latin-1	256
UTF-8	143,000+
UTF-16	143,000+