What are Compact Strings in Java 9+?

🟢 Junior Level

Compact Strings is an optimization introduced in Java 9 that allows strings to take half the memory if they contain only simple Latin characters (English letters, digits, basic punctuation).

How it worked before Java 9: Every character was stored in 2 bytes (UTF-16), even for simple letters like “a”, “b”, “c”. The string “Hello” took 10 bytes.

How it works in Java 9+:

If string contains only “simple” characters (Latin-1, U+0000–U+00FF) → 1 byte per character
If complex characters present (Cyrillic, hieroglyphs, emoji) → 2 bytes per character, as before
Java decides which format to use automatically — you don’t need to change anything in your code

Example:

String english = "Hello World";  // 5 bytes (would be 10 in Java 8)
String russian = "Привет Мир";    // 10 bytes (UTF-16, as before)

Simple analogy: Imagine a suitcase. Before, Java always packed things in a large suitcase (2 bytes), even if you’re only taking socks. Now Java looks: if there are few things — takes a small suitcase (1 byte), if many — large (2 bytes).

You don’t need to change anything. This works completely transparently.

🟡 Middle Level

How it’s implemented internally

In Java 9, the char[] value field was replaced with byte[] value + coder flag:

public final class String {
    @Stable
    private final byte[] value;  // Was char[] before
    private final byte coder;    // 0 = Latin-1, 1 = UTF-16
}

coder = 0 (LATIN1): 1 byte per character, range U+0000–U+00FF
coder = 1 (UTF16): 2 bytes per character, all Unicode characters

Automatic switching

String latin = "Hello";      // coder = LATIN1, 1 byte/char
String cyrillic = "Привет";  // coder = UTF16, 2 bytes/char

// Concatenation Latin-1 + Cyrillic → UTF-16 (entire string expands)
String mixed = latin + cyrillic; // coder = UTF16 for entire string

Table of typical mistakes

Mistake	Consequences	Solution
Thinking compact strings is compression (like gzip)	Expecting savings for any data	It’s just a more efficient storage format, only for Latin-1
Expecting `-XX:+CompactStrings` needs enabling	Confusion with flags	Enabled by default since Java 9. Flag `-XX:-CompactStrings` disables
Expecting substring to “downgrade” coder	Inefficient memory usage	JVM doesn’t “downgrade” UTF-16 → Latin-1 automatically

Memory comparison

String	Java 8 (`char[]`)	Java 9+ (`byte[]`)	Savings
`"Hello"`	10 bytes	5 bytes	50%
`"Hello World"`	22 bytes	11 bytes	50%
`"Привет"`	12 bytes	12 bytes	0%
`"Hello Привет"`	24 bytes	24 bytes	0%
`""` (empty)	~40 bytes	~40 bytes	0%	String object (~24) + empty byte[] (~16)

When NOT to rely on compact strings

Cyrillic/Chinese/Japanese text — always UTF-16, no savings
Mixed text (Latin + Cyrillic) — entire string becomes UTF-16
Substrings from UTF-16 strings — inherit UTF-16 coder, even if substring contains only ASCII

🔴 Senior Level

Internal Implementation — JEP 254

JEP 254: Compact Strings (Java 9) changed the internal String representation:

// Key methods with coder check
public int length() {
    return value.length >> coder; // LATIN1: >> 0 = no shift; UTF16: >> 1 = /2
}

public char charAt(int index) {
    if (isLatin1()) {
        return (char)(value[index] & 0xff);
    }
    return StringUTF16.getChar(value, index);
}

public boolean equals(Object anObject) {
    if (this == anObject) return true;
    if (anObject instanceof String another) {
        if (coder == another.coder) { // First check coder!
            return isLatin1()
                ? StringLatin1.equals(value, another.value)
                : StringUTF16.equals(value, another.value);
        }
    }
    return false; // Different coder → definitely not equal
}

Every String method now checks coder and delegates work to StringLatin1 or StringUTF16. These delegates use intrinsic methods of JVM — JIT compiler generates SIMD-optimized code for byte-by-byte comparison.

Trade-offs

Pros:

Memory savings: 40–50% for typical Enterprise applications (JSON keys, log levels, HTTP headers — all ASCII)
GC pressure: fewer objects in Heap → less frequent GC pauses
Cache locality: compact data → more fits in L1/L2 CPU cache → faster processing
Free optimization — zero code change

Cons:

Small overhead on coder check in every method (JIT usually eliminates via constant folding)
Concatenation Latin-1 + UTF-16 → UTF-16 (entire string expands)
Had to rewrite all intrinsic optimizations for two formats
Reflection code working with char[] broke

Edge Cases (minimum 3)

1. Coder mismatch on concatenation:

String latin = "Hello";     // Latin-1
String cyrillic = "Мир";    // UTF-16
String result = latin + " " + cyrillic; // Entire string → UTF-16
// Even "Hello " expands to UTF-16 in result
// Loss: 6 bytes → 12 bytes for Latin-1 part

2. Substring doesn’t “downgrade” coder:

String mixed = "Hello Мир";    // UTF-16 (contains Cyrillic)
String sub = mixed.substring(0, 5); // "Hello" — still UTF-16!
// JVM doesn't "downgrade" to Latin-1 automatically
// sub takes 10 bytes instead of possible 5

3. Reflection and byte[]:

// In Java 9+ you can't just get value via reflection
// byte[] value instead of char[] — broke old reflection code
// Module system (Java 9+) additionally restricts access to internal fields
Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true); // Requires --add-opens java.base/java.lang

4. Coder on creation via constructor:

// new String(byte[], Charset) determines coder by content
byte[] ascii = {72, 101, 108, 108, 111}; // "Hello"
String s = new String(ascii, StandardCharsets.UTF_8); // coder = LATIN1

byte[] cyrillicBytes = {(byte)0xD0, (byte)0x9F}; // "П" in UTF-8
String s2 = new String(cyrillicBytes, StandardCharsets.UTF_8); // coder = UTF16

Performance

Operation	Java 8 (`char[]`)	Java 9+ Compact	Improvement
Memory “Hello”	22 bytes	11 bytes	-50%
Memory “Hello World!”	34 bytes	18 bytes	-47%
`charAt()`	~1ns	~1ns (intrinsic)	Same
`equals()` (Latin-1)	~2ns	~1.5ns (SIMD on `byte[]`)	-25%
`equals()` (UTF-16)	~2ns	~2ns	Same
GC throughput	Baseline	+10–15%	Better

// Approximate values (JMH). Actual depend on CPU and JVM.

Memory (64-bit JVM, CompressedOops):

Latin-1 String: 24 bytes (object header) + 16 + N bytes (byte[]) ≈ 40 + N bytes
UTF-16 String: 24 bytes + 16 + 2N bytes ≈ 40 + 2N bytes
For 1M “Hello” strings: savings ~5MB in Heap

Thread Safety

String remains fully thread-safe. coder field is final, value array is @Stable (JVM annotation guaranteeing immutability after construction). No race conditions when reading from multiple threads.

Production War Story

Scenario: Migration Java 8 → 17 in microservice (4GB Heap, Spring Boot, JSON API).

Before: Heap usage 75% (3GB), Full GC every 20 minutes, p99 latency = 15ms
After: Heap usage 55% (2.2GB), Full GC every 45 minutes, p99 latency = 10ms
JSON keys ("id", "name", "type", "status") — all Latin-1 → 50% savings
Without a single line of code — transparent upgrade

Scenario 2: Highload parser (1M lines/sec, logs):

Strings: JSON keys — all Latin-1
Savings: ~200MB/sec allocations → ~100MB/sec
Young GC duration reduced by 30%
Problem: Substrings from UTF-16 strings inherited UTF-16 coder → unexpectedly high memory consumption for “simple” strings. Fix: explicit string creation via constructor.

Monitoring

# Check Compact Strings is enabled
java -XX:+PrintFlagsFinal -version | grep CompactStrings
# bool CompactStrings = true  {product}

# JOL — actual size
# String s = "Hello";
# GraphLayout.parseInstance(s).toPrintable()
# Java 9+: value = byte[5]  (5 bytes)
# Java 8:  value = char[5]   (10 bytes)

# GC logs — you'll notice reduced heap usage
java -Xlog:gc*:file=gc.log ...

Best Practices for Highload

Compact Strings are enabled by default — no action needed
For maximum benefit: store data in Latin-1 when possible (ASCII keys, enum values, statuses)
Don’t try to manually “downgrade” coder — JIT handles it better
-XX:-CompactStrings to disable (but why?)
Benefit most noticeable in applications with many strings: web, JSON parsing, logging, HTTP headers
On migration Java 8 → 9+: check reflection code that worked with char[] value
For ultra-low-latency: compact strings improve cache locality → fewer L1/L2 cache misses

🎯 Interview Cheat Sheet

Must know:

Compact Strings (JEP 254, Java 9+) — optimization: Latin-1 (1 byte/char) instead of UTF-16 (2 bytes)
byte[] value + byte coder instead of char[] value (Java 8 and earlier)
coder = 0 → Latin-1 (U+0000–U+00FF), coder = 1 → UTF-16
Enabled by default, flag -XX:-CompactStrings disables
Concatenation Latin-1 + UTF-16 → result is UTF-16 (entire string expands)
Savings: 40-50% memory for typical Enterprise applications (JSON keys, HTTP headers — all ASCII)
Substring from UTF-16 string inherits UTF-16 coder, even if substring is only ASCII

Frequent follow-up questions:

What memory savings from Compact Strings? — 40-50% for Latin-1 strings. In typical web app 70% of strings are Latin-1, overall Heap reduction 20-30%.
What happens on Latin-1 + Cyrillic concatenation? — Entire string becomes UTF-16. Latin-1 part expands to 2 bytes/char.
Does substring() downgrade coder from UTF-16 to Latin-1? — No. Substring from UTF-16 string remains UTF-16, even if it contains only ASCII.
Do you need to enable Compact Strings with a flag? — No, enabled by default since Java 9. -XX:-CompactStrings — disables.

Red flags (DON’T say):

❌ “Compact Strings — compression like gzip” — it’s just a more efficient storage format
❌ “Need to enable -XX:+CompactStrings” — already enabled by default
❌ “Compact Strings work for Cyrillic” — Cyrillic = UTF-16, no savings
❌ “Java automatically downgrades UTF-16 → Latin-1” — it doesn’t downgrade, only upgrades

Related topics:

[[17. What is String Encoding]]
[[20. How to Find Out How Much Memory a String Occupies]]
[[13. What substring() Does and How It Worked Before Java 7]]
[[22. What is String Deduplication in G1 GC]]