What is String Encoding?
A computer only understands numbers. Encoding is a "dictionary" that says: "Character 'A' corresponds to number 1040, character 'B' — number 66" etc.
🟢 Junior Level
Encoding is a set of rules that defines how characters (letters, digits, symbols) are converted to bytes for storage and transmission, and back.
A computer only understands numbers. Encoding is a “dictionary” that says: “Character ‘A’ corresponds to number 1040, character ‘B’ — number 66” etc.
Simple analogy: Encoding is like a translation language. You tell the translator: “Translate this phrase to German” (encode), and then: “Translate back to Russian” (decode). If both translators use the same dictionary — everything will be understood correctly.
Example:
String s = "Привет";
// Encode: String → byte[]
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
// Decode: byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);
Most important encodings: | Encoding | For what | Size of “Привет” | | —————- | ——————————————————————- | —————- | | UTF-8 | Internet, files, databases | 12 bytes | | UTF-16 | Internal format of Java 8 and earlier / for non-Latin in Java 9+ | 12 bytes | | Windows-1251 | Old Windows systems | 6 bytes |
Main rule: ALWAYS specify encoding explicitly! Never rely on default encoding — it differs on Windows and Linux.
🟡 Middle Level
How Java stores strings internally
Before Java 9: Strings were stored as char[] in UTF-16 encoding (2 bytes per character always).
Java 9+ (Compact Strings): Java automatically chooses optimal format:
- Only Latin characters (U+0000–U+00FF) → Latin-1 (1 byte/char)
- Non-Latin characters present → UTF-16 (2 bytes/char)
- Decision stored in
coderfield (0 = Latin-1, 1 = UTF-16)
Conversion during I/O
When a string leaves the JVM (file write, network send, DB query), it must be converted to bytes with the specified encoding:
// ✅ GOOD — explicit encoding
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
String str = new String(bytes, StandardCharsets.UTF_8);
// ❌ BAD — default encoding depends on OS!
byte[] bytes = str.getBytes(); // Windows-1251 on Windows, UTF-8 on Linux
String str = new String(bytes); // May produce "garbled text"
Table of typical mistakes
| Mistake | Consequences | Solution |
|---|---|---|
new String(bytes) without encoding |
“Garbled text” on OS change | Always new String(bytes, StandardCharsets.UTF_8) |
getBytes() without encoding |
Cannot decode on another system | Always getBytes(StandardCharsets.UTF_8) |
Confusing str.length() and bytes.length |
Wrong data processing logic | "Привет".length() = 6, "Привет".getBytes(UTF_8).length = 12 |
Comparison of popular encodings
| Encoding | Bytes/char | Support | Compatibility |
|---|---|---|---|
| UTF-8 | 1–4 (variable) | All Unicode | ASCII-compatible |
| UTF-16 LE/BE | 2–4 | All Unicode | Requires BOM or agreement |
| Latin-1 | 1 | Western European | Doesn’t support Cyrillic |
| Windows-1251 | 1 | Cyrillic | Windows only |
| ASCII | 1 | 128 characters | Doesn’t support Cyrillic |
When NOT to convert to String
- Binary data (images, protobuf, archives) — work with
byte[]directly - Ultra-low-latency systems — conversion adds 50–200ns overhead
- Streaming data of unknown encoding — use
BOMInputStreamor auto-detect
🔴 Senior Level
Internal Implementation — Compact Strings and Encoding
public final class String {
@Stable
private final byte[] value; // byte[] instead of char[] (Java 9+)
private final byte coder; // LATIN1=0 or UTF16=1
}
When getBytes(Charset) is called:
- Check current String’s
coder - Choose encoding algorithm in
StringCoding.encode(value, coder, charset) - For UTF-8: character-by-character conversion via
CharsetEncoder- Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1)
- UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)
On new String(byte[], Charset):
StringCoding.decode(bytes, charset)→byte[] value+coder- If decoder detects invalid bytes → replacement with
\uFFFD(replacement character)
Edge Cases (minimum 3)
1. BOM (Byte Order Mark):
byte[] withBom = {(byte)0xEF, (byte)0xBB, (byte)0xBF, (byte)'H', (byte)'i'};
String s = new String(withBom, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible character!
// s.startsWith("Hi") → false!
Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes.
2. Malformed input — invalid UTF-8 sequences:
byte[] bad = {(byte) 0xFF, (byte) 0xFE};
String s = new String(bad, StandardCharsets.UTF_8);
// Result: "\uFFFD\uFFFD" — replacement characters
Behavior depends on CodingErrorAction (REPLACE by default, can be changed to REPORT or IGNORE).
3. Charset encoding roundtrip — data loss:
String s = "Привет";
byte[] ascii = s.getBytes(StandardCharsets.US_ASCII);
// "??????" — Cyrillic not supported in ASCII, replaced with '?'
// Reverse conversion is impossible — data is lost!
4. Security — encoding bypass:
// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// Some old systems decode %C0%27 as a single quote
// Always validate input AFTER decoding, not before!
Performance
| Operation | UTF-8 | UTF-16 | Latin-1 |
|---|---|---|---|
| Encode “Hello” | ~5ns | ~3ns | ~2ns |
| Encode “Привет” | ~15ns | ~5ns | N/A |
| Decode 12 bytes | ~10ns | ~5ns | ~3ns |
| Encode 10KB text | ~500ns | ~150ns | ~80ns |
// UTF-8 for Cyrillic = 2 bytes/char with multi-byte encoding logic. // UTF-16 = direct 1:1 copy for BMP characters — faster.
Memory (Java 9+):
- Latin-1 string: 24 bytes (object) + 16 + N (byte[]) → ~40+N bytes
- UTF-16 string: 24 bytes (object) + 16 + 2N (byte[]) → ~40+2N bytes
- Savings for ASCII/Latin-1: ~50% memory
Thread Safety
Classes StandardCharsets, Charset, CharsetEncoder, CharsetDecoder:
StandardCharsets.UTF_8— thread-safe (immutable singleton)CharsetEncoder/CharsetDecoder— NOT thread-safe! Single instance cannot be used from multiple threads simultaneously- Solution: create new encoder/decoder per thread or use
ThreadLocal
Production War Story
Scenario: Microservice on Linux (Spring Boot) reads data from legacy Windows system.
Windows system sent data in Windows-1251. Linux service read bytes as UTF-8 (default Linux encoding) → “garbled text” in logs. Clients complained about incorrect responses.
Diagnosis: Charset.defaultCharset() on Linux = UTF-8, on Windows = Windows-1251 (or Cp1252).
Fix:
// Explicit encoding at protocol level
String text = new String(bytes, Charset.forName("Windows-1251"));
Long-term solution: agree on UTF-8 at API contract level between all services.
Scenario 2: HTTP API response with Cyrillic — without Content-Type: application/json; charset=UTF-8 browser interpreted response as ISO-8859-1, Cyrillic turned into “garbled text”.
Monitoring
# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8
# Check available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"
# JFR — can track I/O operations with encodings
java -XX:StartFlightRecording=filename=recording.jfr ...
# GC logs — indirectly: smaller string size → smaller heap
java -Xlog:gc*:file=gc.log ...
// Runtime encoding check
System.out.println(Charset.defaultCharset());
// JOL — actual String size after conversion
String latin = "Hello";
String cyrillic = "Привет";
System.out.println(GraphLayout.parseInstance(latin).toFootprint()); // ~34 bytes (Latin-1)
System.out.println(GraphLayout.parseInstance(cyrillic).toFootprint()); // ~42 bytes (UTF-16)
Best Practices for Highload
- Always specify
Charset— useStandardCharsets.UTF_8(constant, no allocations) - For JSON/XML/HTTP: UTF-8 — de facto standard
- For binary data: work with
byte[]/ByteBufferdirectly, don’t convert to String CharsetEncoder/CharsetDecoder— don’t share between threads, create per-thread or useThreadLocal- For ultra-low-latency: avoid String for I/O — use
ByteBuf(Netty), zero-copy approaches - BOM handling:
BOMInputStreamor manual check of first bytes before decoding - Security: validate input AFTER decoding, use constant-time comparison for secrets
🎯 Interview Cheat Sheet
Must know:
- Encoding — rules for converting characters to bytes and back
- UTF-8 — de facto standard for internet, JSON, HTTP (ASCII-compatible, 1-4 bytes/char)
- Java 9+: Compact Strings — Latin-1 (1 byte/char) or UTF-16 (2 bytes/char) automatically
getBytes()without encoding depends on OS — different results on Windows and LinuxCharsetEncoder/CharsetDecoder— NOT thread-safe, cannot share between threads- BOM (Byte Order Mark) — invisible character
\uFEFFat start of UTF-8 file
Frequent follow-up questions:
- Why can’t you use
getBytes()without encoding? — Default encoding depends on OS. On Windows it may be Cp1251, on Linux — UTF-8. - How to handle BOM? —
BOMInputStream(Apache Commons IO) or manual check of first 3 bytes (EF BB BF). - What happens when decoding invalid UTF-8 bytes? — Replacement with
\uFFFD(replacement character). Can be configured viaCodingErrorAction. - How does Java 9+ store strings? —
byte[]+coder: Latin-1 for U+0000–U+00FF, UTF-16 for others.
Red flags (DON’T say):
- ❌ “Default encoding — always UTF-8” — depends on OS and locale
- ❌ “
CharsetEncoderis thread-safe” — NOT thread-safe, needThreadLocalor per-call creation - ❌ “You can convert binary data to String” — data loss, work with
byte[]directly - ❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too
Related topics:
- [[18. How to Properly Convert String to byte[] and Back]]
- [[19. What are Compact Strings in Java 9+]]