Question 17 · Section 12

What is String Encoding?

A computer only understands numbers. Encoding is a "dictionary" that says: "Character 'A' corresponds to number 1040, character 'B' — number 66" etc.

Language versions: English Russian Ukrainian

🟢 Junior Level

Encoding is a set of rules that defines how characters (letters, digits, symbols) are converted to bytes for storage and transmission, and back.

A computer only understands numbers. Encoding is a “dictionary” that says: “Character ‘A’ corresponds to number 1040, character ‘B’ — number 66” etc.

Simple analogy: Encoding is like a translation language. You tell the translator: “Translate this phrase to German” (encode), and then: “Translate back to Russian” (decode). If both translators use the same dictionary — everything will be understood correctly.

Example:

String s = "Привет";

// Encode: String → byte[]
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);

// Decode: byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);

Most important encodings: | Encoding | For what | Size of “Привет” | | —————- | ——————————————————————- | —————- | | UTF-8 | Internet, files, databases | 12 bytes | | UTF-16 | Internal format of Java 8 and earlier / for non-Latin in Java 9+ | 12 bytes | | Windows-1251 | Old Windows systems | 6 bytes |

Main rule: ALWAYS specify encoding explicitly! Never rely on default encoding — it differs on Windows and Linux.


🟡 Middle Level

How Java stores strings internally

Before Java 9: Strings were stored as char[] in UTF-16 encoding (2 bytes per character always).

Java 9+ (Compact Strings): Java automatically chooses optimal format:

  • Only Latin characters (U+0000–U+00FF) → Latin-1 (1 byte/char)
  • Non-Latin characters present → UTF-16 (2 bytes/char)
  • Decision stored in coder field (0 = Latin-1, 1 = UTF-16)

Conversion during I/O

When a string leaves the JVM (file write, network send, DB query), it must be converted to bytes with the specified encoding:

// ✅ GOOD — explicit encoding
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
String str = new String(bytes, StandardCharsets.UTF_8);

// ❌ BAD — default encoding depends on OS!
byte[] bytes = str.getBytes();           // Windows-1251 on Windows, UTF-8 on Linux
String str = new String(bytes);          // May produce "garbled text"

Table of typical mistakes

Mistake Consequences Solution
new String(bytes) without encoding “Garbled text” on OS change Always new String(bytes, StandardCharsets.UTF_8)
getBytes() without encoding Cannot decode on another system Always getBytes(StandardCharsets.UTF_8)
Confusing str.length() and bytes.length Wrong data processing logic "Привет".length() = 6, "Привет".getBytes(UTF_8).length = 12
Encoding Bytes/char Support Compatibility
UTF-8 1–4 (variable) All Unicode ASCII-compatible
UTF-16 LE/BE 2–4 All Unicode Requires BOM or agreement
Latin-1 1 Western European Doesn’t support Cyrillic
Windows-1251 1 Cyrillic Windows only
ASCII 1 128 characters Doesn’t support Cyrillic

When NOT to convert to String

  • Binary data (images, protobuf, archives) — work with byte[] directly
  • Ultra-low-latency systems — conversion adds 50–200ns overhead
  • Streaming data of unknown encoding — use BOMInputStream or auto-detect

🔴 Senior Level

Internal Implementation — Compact Strings and Encoding

public final class String {
    @Stable
    private final byte[] value;  // byte[] instead of char[] (Java 9+)
    private final byte coder;    // LATIN1=0 or UTF16=1
}

When getBytes(Charset) is called:

  1. Check current String’s coder
  2. Choose encoding algorithm in StringCoding.encode(value, coder, charset)
  3. For UTF-8: character-by-character conversion via CharsetEncoder
    • Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1)
    • UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)

On new String(byte[], Charset):

  1. StringCoding.decode(bytes, charset)byte[] value + coder
  2. If decoder detects invalid bytes → replacement with \uFFFD (replacement character)

Edge Cases (minimum 3)

1. BOM (Byte Order Mark):

byte[] withBom = {(byte)0xEF, (byte)0xBB, (byte)0xBF, (byte)'H', (byte)'i'};
String s = new String(withBom, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible character!
// s.startsWith("Hi") → false!

Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes.

2. Malformed input — invalid UTF-8 sequences:

byte[] bad = {(byte) 0xFF, (byte) 0xFE};
String s = new String(bad, StandardCharsets.UTF_8);
// Result: "\uFFFD\uFFFD" — replacement characters

Behavior depends on CodingErrorAction (REPLACE by default, can be changed to REPORT or IGNORE).

3. Charset encoding roundtrip — data loss:

String s = "Привет";
byte[] ascii = s.getBytes(StandardCharsets.US_ASCII);
// "??????" — Cyrillic not supported in ASCII, replaced with '?'
// Reverse conversion is impossible — data is lost!

4. Security — encoding bypass:

// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// Some old systems decode %C0%27 as a single quote
// Always validate input AFTER decoding, not before!

Performance

Operation UTF-8 UTF-16 Latin-1
Encode “Hello” ~5ns ~3ns ~2ns
Encode “Привет” ~15ns ~5ns N/A
Decode 12 bytes ~10ns ~5ns ~3ns
Encode 10KB text ~500ns ~150ns ~80ns

// UTF-8 for Cyrillic = 2 bytes/char with multi-byte encoding logic. // UTF-16 = direct 1:1 copy for BMP characters — faster.

Memory (Java 9+):

  • Latin-1 string: 24 bytes (object) + 16 + N (byte[]) → ~40+N bytes
  • UTF-16 string: 24 bytes (object) + 16 + 2N (byte[]) → ~40+2N bytes
  • Savings for ASCII/Latin-1: ~50% memory

Thread Safety

Classes StandardCharsets, Charset, CharsetEncoder, CharsetDecoder:

  • StandardCharsets.UTF_8 — thread-safe (immutable singleton)
  • CharsetEncoder / CharsetDecoderNOT thread-safe! Single instance cannot be used from multiple threads simultaneously
  • Solution: create new encoder/decoder per thread or use ThreadLocal

Production War Story

Scenario: Microservice on Linux (Spring Boot) reads data from legacy Windows system.

Windows system sent data in Windows-1251. Linux service read bytes as UTF-8 (default Linux encoding) → “garbled text” in logs. Clients complained about incorrect responses.

Diagnosis: Charset.defaultCharset() on Linux = UTF-8, on Windows = Windows-1251 (or Cp1252).

Fix:

// Explicit encoding at protocol level
String text = new String(bytes, Charset.forName("Windows-1251"));

Long-term solution: agree on UTF-8 at API contract level between all services.

Scenario 2: HTTP API response with Cyrillic — without Content-Type: application/json; charset=UTF-8 browser interpreted response as ISO-8859-1, Cyrillic turned into “garbled text”.

Monitoring

# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8

# Check available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"

# JFR — can track I/O operations with encodings
java -XX:StartFlightRecording=filename=recording.jfr ...

# GC logs — indirectly: smaller string size → smaller heap
java -Xlog:gc*:file=gc.log ...
// Runtime encoding check
System.out.println(Charset.defaultCharset());

// JOL — actual String size after conversion
String latin = "Hello";
String cyrillic = "Привет";
System.out.println(GraphLayout.parseInstance(latin).toFootprint());  // ~34 bytes (Latin-1)
System.out.println(GraphLayout.parseInstance(cyrillic).toFootprint()); // ~42 bytes (UTF-16)

Best Practices for Highload

  • Always specify Charset — use StandardCharsets.UTF_8 (constant, no allocations)
  • For JSON/XML/HTTP: UTF-8 — de facto standard
  • For binary data: work with byte[]/ByteBuffer directly, don’t convert to String
  • CharsetEncoder/CharsetDecoderdon’t share between threads, create per-thread or use ThreadLocal
  • For ultra-low-latency: avoid String for I/O — use ByteBuf (Netty), zero-copy approaches
  • BOM handling: BOMInputStream or manual check of first bytes before decoding
  • Security: validate input AFTER decoding, use constant-time comparison for secrets

🎯 Interview Cheat Sheet

Must know:

  • Encoding — rules for converting characters to bytes and back
  • UTF-8 — de facto standard for internet, JSON, HTTP (ASCII-compatible, 1-4 bytes/char)
  • Java 9+: Compact Strings — Latin-1 (1 byte/char) or UTF-16 (2 bytes/char) automatically
  • getBytes() without encoding depends on OS — different results on Windows and Linux
  • CharsetEncoder/CharsetDecoder — NOT thread-safe, cannot share between threads
  • BOM (Byte Order Mark) — invisible character \uFEFF at start of UTF-8 file

Frequent follow-up questions:

  • Why can’t you use getBytes() without encoding? — Default encoding depends on OS. On Windows it may be Cp1251, on Linux — UTF-8.
  • How to handle BOM?BOMInputStream (Apache Commons IO) or manual check of first 3 bytes (EF BB BF).
  • What happens when decoding invalid UTF-8 bytes? — Replacement with \uFFFD (replacement character). Can be configured via CodingErrorAction.
  • How does Java 9+ store strings?byte[] + coder: Latin-1 for U+0000–U+00FF, UTF-16 for others.

Red flags (DON’T say):

  • ❌ “Default encoding — always UTF-8” — depends on OS and locale
  • ❌ “CharsetEncoder is thread-safe” — NOT thread-safe, need ThreadLocal or per-call creation
  • ❌ “You can convert binary data to String” — data loss, work with byte[] directly
  • ❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too

Related topics:

  • [[18. How to Properly Convert String to byte[] and Back]]
  • [[19. What are Compact Strings in Java 9+]]