Question 19 · Section 12

What are Compact Strings in Java 9+?

Every character was stored in 2 bytes (UTF-16), even for simple letters like "a", "b", "c". The string "Hello" took 10 bytes.

Language versions: English Russian Ukrainian

🟢 Junior Level

Compact Strings is an optimization introduced in Java 9 that allows strings to take half the memory if they contain only simple Latin characters (English letters, digits, basic punctuation).

How it worked before Java 9: Every character was stored in 2 bytes (UTF-16), even for simple letters like “a”, “b”, “c”. The string “Hello” took 10 bytes.

How it works in Java 9+:

  • If string contains only “simple” characters (Latin-1, U+0000–U+00FF) → 1 byte per character
  • If complex characters present (Cyrillic, hieroglyphs, emoji) → 2 bytes per character, as before
  • Java decides which format to use automatically — you don’t need to change anything in your code

Example:

String english = "Hello World";  // 5 bytes (would be 10 in Java 8)
String russian = "Привет Мир";    // 10 bytes (UTF-16, as before)

Simple analogy: Imagine a suitcase. Before, Java always packed things in a large suitcase (2 bytes), even if you’re only taking socks. Now Java looks: if there are few things — takes a small suitcase (1 byte), if many — large (2 bytes).

You don’t need to change anything. This works completely transparently.


🟡 Middle Level

How it’s implemented internally

In Java 9, the char[] value field was replaced with byte[] value + coder flag:

public final class String {
    @Stable
    private final byte[] value;  // Was char[] before
    private final byte coder;    // 0 = Latin-1, 1 = UTF-16
}
  • coder = 0 (LATIN1): 1 byte per character, range U+0000–U+00FF
  • coder = 1 (UTF16): 2 bytes per character, all Unicode characters

Automatic switching

String latin = "Hello";      // coder = LATIN1, 1 byte/char
String cyrillic = "Привет";  // coder = UTF16, 2 bytes/char

// Concatenation Latin-1 + Cyrillic → UTF-16 (entire string expands)
String mixed = latin + cyrillic; // coder = UTF16 for entire string

Table of typical mistakes

Mistake Consequences Solution
Thinking compact strings is compression (like gzip) Expecting savings for any data It’s just a more efficient storage format, only for Latin-1
Expecting -XX:+CompactStrings needs enabling Confusion with flags Enabled by default since Java 9. Flag -XX:-CompactStrings disables
Expecting substring to “downgrade” coder Inefficient memory usage JVM doesn’t “downgrade” UTF-16 → Latin-1 automatically

Memory comparison

String Java 8 (char[]) Java 9+ (byte[]) Savings  
"Hello" 10 bytes 5 bytes 50%  
"Hello World" 22 bytes 11 bytes 50%  
"Привет" 12 bytes 12 bytes 0%  
"Hello Привет" 24 bytes 24 bytes 0%  
"" (empty) ~40 bytes ~40 bytes 0% String object (~24) + empty byte[] (~16)

When NOT to rely on compact strings

  • Cyrillic/Chinese/Japanese text — always UTF-16, no savings
  • Mixed text (Latin + Cyrillic) — entire string becomes UTF-16
  • Substrings from UTF-16 strings — inherit UTF-16 coder, even if substring contains only ASCII

🔴 Senior Level

Internal Implementation — JEP 254

JEP 254: Compact Strings (Java 9) changed the internal String representation:

// Key methods with coder check
public int length() {
    return value.length >> coder; // LATIN1: >> 0 = no shift; UTF16: >> 1 = /2
}

public char charAt(int index) {
    if (isLatin1()) {
        return (char)(value[index] & 0xff);
    }
    return StringUTF16.getChar(value, index);
}

public boolean equals(Object anObject) {
    if (this == anObject) return true;
    if (anObject instanceof String another) {
        if (coder == another.coder) { // First check coder!
            return isLatin1()
                ? StringLatin1.equals(value, another.value)
                : StringUTF16.equals(value, another.value);
        }
    }
    return false; // Different coder → definitely not equal
}

Every String method now checks coder and delegates work to StringLatin1 or StringUTF16. These delegates use intrinsic methods of JVM — JIT compiler generates SIMD-optimized code for byte-by-byte comparison.

Trade-offs

Pros:

  • Memory savings: 40–50% for typical Enterprise applications (JSON keys, log levels, HTTP headers — all ASCII)
  • GC pressure: fewer objects in Heap → less frequent GC pauses
  • Cache locality: compact data → more fits in L1/L2 CPU cache → faster processing
  • Free optimization — zero code change

Cons:

  • Small overhead on coder check in every method (JIT usually eliminates via constant folding)
  • Concatenation Latin-1 + UTF-16 → UTF-16 (entire string expands)
  • Had to rewrite all intrinsic optimizations for two formats
  • Reflection code working with char[] broke

Edge Cases (minimum 3)

1. Coder mismatch on concatenation:

String latin = "Hello";     // Latin-1
String cyrillic = "Мир";    // UTF-16
String result = latin + " " + cyrillic; // Entire string → UTF-16
// Even "Hello " expands to UTF-16 in result
// Loss: 6 bytes → 12 bytes for Latin-1 part

2. Substring doesn’t “downgrade” coder:

String mixed = "Hello Мир";    // UTF-16 (contains Cyrillic)
String sub = mixed.substring(0, 5); // "Hello" — still UTF-16!
// JVM doesn't "downgrade" to Latin-1 automatically
// sub takes 10 bytes instead of possible 5

3. Reflection and byte[]:

// In Java 9+ you can't just get value via reflection
// byte[] value instead of char[] — broke old reflection code
// Module system (Java 9+) additionally restricts access to internal fields
Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true); // Requires --add-opens java.base/java.lang

4. Coder on creation via constructor:

// new String(byte[], Charset) determines coder by content
byte[] ascii = {72, 101, 108, 108, 111}; // "Hello"
String s = new String(ascii, StandardCharsets.UTF_8); // coder = LATIN1

byte[] cyrillicBytes = {(byte)0xD0, (byte)0x9F}; // "П" in UTF-8
String s2 = new String(cyrillicBytes, StandardCharsets.UTF_8); // coder = UTF16

Performance

Operation Java 8 (char[]) Java 9+ Compact Improvement
Memory “Hello” 22 bytes 11 bytes -50%
Memory “Hello World!” 34 bytes 18 bytes -47%
charAt() ~1ns ~1ns (intrinsic) Same
equals() (Latin-1) ~2ns ~1.5ns (SIMD on byte[]) -25%
equals() (UTF-16) ~2ns ~2ns Same
GC throughput Baseline +10–15% Better

// Approximate values (JMH). Actual depend on CPU and JVM.

Memory (64-bit JVM, CompressedOops):

  • Latin-1 String: 24 bytes (object header) + 16 + N bytes (byte[]) ≈ 40 + N bytes
  • UTF-16 String: 24 bytes + 16 + 2N bytes ≈ 40 + 2N bytes
  • For 1M “Hello” strings: savings ~5MB in Heap

Thread Safety

String remains fully thread-safe. coder field is final, value array is @Stable (JVM annotation guaranteeing immutability after construction). No race conditions when reading from multiple threads.

Production War Story

Scenario: Migration Java 8 → 17 in microservice (4GB Heap, Spring Boot, JSON API).

  • Before: Heap usage 75% (3GB), Full GC every 20 minutes, p99 latency = 15ms
  • After: Heap usage 55% (2.2GB), Full GC every 45 minutes, p99 latency = 10ms
  • JSON keys ("id", "name", "type", "status") — all Latin-1 → 50% savings
  • Without a single line of code — transparent upgrade

Scenario 2: Highload parser (1M lines/sec, logs):

  • Strings: JSON keys — all Latin-1
  • Savings: ~200MB/sec allocations → ~100MB/sec
  • Young GC duration reduced by 30%
  • Problem: Substrings from UTF-16 strings inherited UTF-16 coder → unexpectedly high memory consumption for “simple” strings. Fix: explicit string creation via constructor.

Monitoring

# Check Compact Strings is enabled
java -XX:+PrintFlagsFinal -version | grep CompactStrings
# bool CompactStrings = true  {product}

# JOL — actual size
# String s = "Hello";
# GraphLayout.parseInstance(s).toPrintable()
# Java 9+: value = byte[5]  (5 bytes)
# Java 8:  value = char[5]   (10 bytes)

# GC logs — you'll notice reduced heap usage
java -Xlog:gc*:file=gc.log ...

Best Practices for Highload

  • Compact Strings are enabled by default — no action needed
  • For maximum benefit: store data in Latin-1 when possible (ASCII keys, enum values, statuses)
  • Don’t try to manually “downgrade” coder — JIT handles it better
  • -XX:-CompactStrings to disable (but why?)
  • Benefit most noticeable in applications with many strings: web, JSON parsing, logging, HTTP headers
  • On migration Java 8 → 9+: check reflection code that worked with char[] value
  • For ultra-low-latency: compact strings improve cache locality → fewer L1/L2 cache misses

🎯 Interview Cheat Sheet

Must know:

  • Compact Strings (JEP 254, Java 9+) — optimization: Latin-1 (1 byte/char) instead of UTF-16 (2 bytes)
  • byte[] value + byte coder instead of char[] value (Java 8 and earlier)
  • coder = 0 → Latin-1 (U+0000–U+00FF), coder = 1 → UTF-16
  • Enabled by default, flag -XX:-CompactStrings disables
  • Concatenation Latin-1 + UTF-16 → result is UTF-16 (entire string expands)
  • Savings: 40-50% memory for typical Enterprise applications (JSON keys, HTTP headers — all ASCII)
  • Substring from UTF-16 string inherits UTF-16 coder, even if substring is only ASCII

Frequent follow-up questions:

  • What memory savings from Compact Strings? — 40-50% for Latin-1 strings. In typical web app 70% of strings are Latin-1, overall Heap reduction 20-30%.
  • What happens on Latin-1 + Cyrillic concatenation? — Entire string becomes UTF-16. Latin-1 part expands to 2 bytes/char.
  • Does substring() downgrade coder from UTF-16 to Latin-1? — No. Substring from UTF-16 string remains UTF-16, even if it contains only ASCII.
  • Do you need to enable Compact Strings with a flag? — No, enabled by default since Java 9. -XX:-CompactStrings — disables.

Red flags (DON’T say):

  • ❌ “Compact Strings — compression like gzip” — it’s just a more efficient storage format
  • ❌ “Need to enable -XX:+CompactStrings” — already enabled by default
  • ❌ “Compact Strings work for Cyrillic” — Cyrillic = UTF-16, no savings
  • ❌ “Java automatically downgrades UTF-16 → Latin-1” — it doesn’t downgrade, only upgrades

Related topics:

  • [[17. What is String Encoding]]
  • [[20. How to Find Out How Much Memory a String Occupies]]
  • [[13. What substring() Does and How It Worked Before Java 7]]
  • [[22. What is String Deduplication in G1 GC]]