What is String Encoding?

🟢 Junior Level

Encoding is a set of rules that defines how characters (letters, digits, symbols) are converted to bytes for storage and transmission, and back.

A computer only understands numbers. Encoding is a “dictionary” that says: “Character ‘A’ corresponds to number 1040, character ‘B’ — number 66” etc.

Simple analogy: Encoding is like a translation language. You tell the translator: “Translate this phrase to German” (encode), and then: “Translate back to Russian” (decode). If both translators use the same dictionary — everything will be understood correctly.

Example:

String s = "Привет";

// Encode: String → byte[]
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

// Decode: byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);

Main rule: ALWAYS specify encoding explicitly! Never rely on default encoding — it differs on Windows and Linux.

🟡 Middle Level

How Java stores strings internally

Before Java 9: Strings were stored as char[] in UTF-16 encoding (2 bytes per character always).

Java 9+ (Compact Strings): Java automatically chooses optimal format:

Only Latin characters (U+0000–U+00FF) → Latin-1 (1 byte/char)
Non-Latin characters present → UTF-16 (2 bytes/char)
Decision stored in coder field (0 = Latin-1, 1 = UTF-16)

Conversion during I/O

When a string leaves the JVM (file write, network send, DB query), it must be converted to bytes with the specified encoding:

// ✅ GOOD — explicit encoding
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
String str = new String(bytes, StandardCharsets.UTF_8);

// ❌ BAD — default encoding depends on OS!
byte[] bytes = str.getBytes();           // Windows-1251 on Windows, UTF-8 on Linux
String str = new String(bytes);          // May produce "garbled text"

Table of typical mistakes

Mistake	Consequences	Solution
`new String(bytes)` without encoding	“Garbled text” on OS change	Always `new String(bytes, StandardCharsets.UTF_8)`
`getBytes()` without encoding	Cannot decode on another system	Always `getBytes(StandardCharsets.UTF_8)`
Confusing `str.length()` and `bytes.length`	Wrong data processing logic	`"Привет".length()` = 6, `"Привет".getBytes(UTF_8).length` = 12

Comparison of popular encodings

Encoding	Bytes/char	Support	Compatibility
UTF-8	1–4 (variable)	All Unicode	ASCII-compatible
UTF-16 LE/BE	2–4	All Unicode	Requires BOM or agreement
Latin-1	1	Western European	Doesn’t support Cyrillic
Windows-1251	1	Cyrillic	Windows only
ASCII	1	128 characters	Doesn’t support Cyrillic

When NOT to convert to String

Binary data (images, protobuf, archives) — work with byte[] directly
Ultra-low-latency systems — conversion adds 50–200ns overhead
Streaming data of unknown encoding — use BOMInputStream or auto-detect

🔴 Senior Level

Internal Implementation — Compact Strings and Encoding

public final class String {
    @Stable
    private final byte[] value;  // byte[] instead of char[] (Java 9+)
    private final byte coder;    // LATIN1=0 or UTF16=1
}

When getBytes(Charset) is called:

Check current String’s coder
Choose encoding algorithm in StringCoding.encode(value, coder, charset)
For UTF-8: character-by-character conversion via CharsetEncoder
- Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1)
- UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)

On new String(byte[], Charset):

StringCoding.decode(bytes, charset) → byte[] value + coder
If decoder detects invalid bytes → replacement with \uFFFD (replacement character)

Edge Cases (minimum 3)

1. BOM (Byte Order Mark):

byte[] withBom = {(byte)0xEF, (byte)0xBB, (byte)0xBF, (byte)'H', (byte)'i'};
String s = new String(withBom, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible character!
// s.startsWith("Hi") → false!

Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes.

2. Malformed input — invalid UTF-8 sequences:

byte[] bad = {(byte) 0xFF, (byte) 0xFE};
String s = new String(bad, StandardCharsets.UTF_8);
// Result: "\uFFFD\uFFFD" — replacement characters

Behavior depends on CodingErrorAction (REPLACE by default, can be changed to REPORT or IGNORE).

3. Charset encoding roundtrip — data loss:

String s = "Привет";
byte[] ascii = s.getBytes(StandardCharsets.US_ASCII);
// "??????" — Cyrillic not supported in ASCII, replaced with '?'
// Reverse conversion is impossible — data is lost!

4. Security — encoding bypass:

// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// Some old systems decode %C0%27 as a single quote
// Always validate input AFTER decoding, not before!

Performance

Operation	UTF-8	UTF-16	Latin-1
Encode “Hello”	~5ns	~3ns	~2ns
Encode “Привет”	~15ns	~5ns	N/A
Decode 12 bytes	~10ns	~5ns	~3ns
Encode 10KB text	~500ns	~150ns	~80ns

// UTF-8 for Cyrillic = 2 bytes/char with multi-byte encoding logic. // UTF-16 = direct 1:1 copy for BMP characters — faster.

Memory (Java 9+):

Latin-1 string: 24 bytes (object) + 16 + N (byte[]) → ~40+N bytes
UTF-16 string: 24 bytes (object) + 16 + 2N (byte[]) → ~40+2N bytes
Savings for ASCII/Latin-1: ~50% memory

Thread Safety

Classes StandardCharsets, Charset, CharsetEncoder, CharsetDecoder:

StandardCharsets.UTF_8 — thread-safe (immutable singleton)
CharsetEncoder / CharsetDecoder — NOT thread-safe! Single instance cannot be used from multiple threads simultaneously
Solution: create new encoder/decoder per thread or use ThreadLocal

Production War Story

Scenario: Microservice on Linux (Spring Boot) reads data from legacy Windows system.

Windows system sent data in Windows-1251. Linux service read bytes as UTF-8 (default Linux encoding) → “garbled text” in logs. Clients complained about incorrect responses.

Diagnosis: Charset.defaultCharset() on Linux = UTF-8, on Windows = Windows-1251 (or Cp1252).

Fix:

// Explicit encoding at protocol level
String text = new String(bytes, Charset.forName("Windows-1251"));

Long-term solution: agree on UTF-8 at API contract level between all services.

Scenario 2: HTTP API response with Cyrillic — without Content-Type: application/json; charset=UTF-8 browser interpreted response as ISO-8859-1, Cyrillic turned into “garbled text”.

Monitoring

# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8

# Check available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"

# JFR — can track I/O operations with encodings
java -XX:StartFlightRecording=filename=recording.jfr ...

# GC logs — indirectly: smaller string size → smaller heap
java -Xlog:gc*:file=gc.log ...

// Runtime encoding check
System.out.println(Charset.defaultCharset());

// JOL — actual String size after conversion
String latin = "Hello";
String cyrillic = "Привет";
System.out.println(GraphLayout.parseInstance(latin).toFootprint());  // ~34 bytes (Latin-1)
System.out.println(GraphLayout.parseInstance(cyrillic).toFootprint()); // ~42 bytes (UTF-16)

Best Practices for Highload

Always specify Charset — use StandardCharsets.UTF_8 (constant, no allocations)
For JSON/XML/HTTP: UTF-8 — de facto standard
For binary data: work with byte[]/ByteBuffer directly, don’t convert to String
CharsetEncoder/CharsetDecoder — don’t share between threads, create per-thread or use ThreadLocal
For ultra-low-latency: avoid String for I/O — use ByteBuf (Netty), zero-copy approaches
BOM handling: BOMInputStream or manual check of first bytes before decoding
Security: validate input AFTER decoding, use constant-time comparison for secrets

🎯 Interview Cheat Sheet

Must know:

Encoding — rules for converting characters to bytes and back
UTF-8 — de facto standard for internet, JSON, HTTP (ASCII-compatible, 1-4 bytes/char)
Java 9+: Compact Strings — Latin-1 (1 byte/char) or UTF-16 (2 bytes/char) automatically
getBytes() without encoding depends on OS — different results on Windows and Linux
CharsetEncoder/CharsetDecoder — NOT thread-safe, cannot share between threads
BOM (Byte Order Mark) — invisible character \uFEFF at start of UTF-8 file

Frequent follow-up questions:

Why can’t you use getBytes() without encoding? — Default encoding depends on OS. On Windows it may be Cp1251, on Linux — UTF-8.
How to handle BOM? — BOMInputStream (Apache Commons IO) or manual check of first 3 bytes (EF BB BF).
What happens when decoding invalid UTF-8 bytes? — Replacement with \uFFFD (replacement character). Can be configured via CodingErrorAction.
How does Java 9+ store strings? — byte[] + coder: Latin-1 for U+0000–U+00FF, UTF-16 for others.

Red flags (DON’T say):

❌ “Default encoding — always UTF-8” — depends on OS and locale
❌ “CharsetEncoder is thread-safe” — NOT thread-safe, need ThreadLocal or per-call creation
❌ “You can convert binary data to String” — data loss, work with byte[] directly
❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too

Related topics:

[[18. How to Properly Convert String to byte[] and Back]]
[[19. What are Compact Strings in Java 9+]]