What are Compact Strings in Java 9+?
Every character was stored in 2 bytes (UTF-16), even for simple letters like "a", "b", "c". The string "Hello" took 10 bytes.
🟢 Junior Level
Compact Strings is an optimization introduced in Java 9 that allows strings to take half the memory if they contain only simple Latin characters (English letters, digits, basic punctuation).
How it worked before Java 9: Every character was stored in 2 bytes (UTF-16), even for simple letters like “a”, “b”, “c”. The string “Hello” took 10 bytes.
How it works in Java 9+:
- If string contains only “simple” characters (Latin-1, U+0000–U+00FF) → 1 byte per character
- If complex characters present (Cyrillic, hieroglyphs, emoji) → 2 bytes per character, as before
- Java decides which format to use automatically — you don’t need to change anything in your code
Example:
String english = "Hello World"; // 5 bytes (would be 10 in Java 8)
String russian = "Привет Мир"; // 10 bytes (UTF-16, as before)
Simple analogy: Imagine a suitcase. Before, Java always packed things in a large suitcase (2 bytes), even if you’re only taking socks. Now Java looks: if there are few things — takes a small suitcase (1 byte), if many — large (2 bytes).
You don’t need to change anything. This works completely transparently.
🟡 Middle Level
How it’s implemented internally
In Java 9, the char[] value field was replaced with byte[] value + coder flag:
public final class String {
@Stable
private final byte[] value; // Was char[] before
private final byte coder; // 0 = Latin-1, 1 = UTF-16
}
- coder = 0 (LATIN1): 1 byte per character, range U+0000–U+00FF
- coder = 1 (UTF16): 2 bytes per character, all Unicode characters
Automatic switching
String latin = "Hello"; // coder = LATIN1, 1 byte/char
String cyrillic = "Привет"; // coder = UTF16, 2 bytes/char
// Concatenation Latin-1 + Cyrillic → UTF-16 (entire string expands)
String mixed = latin + cyrillic; // coder = UTF16 for entire string
Table of typical mistakes
| Mistake | Consequences | Solution |
|---|---|---|
| Thinking compact strings is compression (like gzip) | Expecting savings for any data | It’s just a more efficient storage format, only for Latin-1 |
Expecting -XX:+CompactStrings needs enabling |
Confusion with flags | Enabled by default since Java 9. Flag -XX:-CompactStrings disables |
| Expecting substring to “downgrade” coder | Inefficient memory usage | JVM doesn’t “downgrade” UTF-16 → Latin-1 automatically |
Memory comparison
| String | Java 8 (char[]) |
Java 9+ (byte[]) |
Savings | |
|---|---|---|---|---|
"Hello" |
10 bytes | 5 bytes | 50% | |
"Hello World" |
22 bytes | 11 bytes | 50% | |
"Привет" |
12 bytes | 12 bytes | 0% | |
"Hello Привет" |
24 bytes | 24 bytes | 0% | |
"" (empty) |
~40 bytes | ~40 bytes | 0% | String object (~24) + empty byte[] (~16) |
When NOT to rely on compact strings
- Cyrillic/Chinese/Japanese text — always UTF-16, no savings
- Mixed text (Latin + Cyrillic) — entire string becomes UTF-16
- Substrings from UTF-16 strings — inherit UTF-16 coder, even if substring contains only ASCII
🔴 Senior Level
Internal Implementation — JEP 254
JEP 254: Compact Strings (Java 9) changed the internal String representation:
// Key methods with coder check
public int length() {
return value.length >> coder; // LATIN1: >> 0 = no shift; UTF16: >> 1 = /2
}
public char charAt(int index) {
if (isLatin1()) {
return (char)(value[index] & 0xff);
}
return StringUTF16.getChar(value, index);
}
public boolean equals(Object anObject) {
if (this == anObject) return true;
if (anObject instanceof String another) {
if (coder == another.coder) { // First check coder!
return isLatin1()
? StringLatin1.equals(value, another.value)
: StringUTF16.equals(value, another.value);
}
}
return false; // Different coder → definitely not equal
}
Every String method now checks coder and delegates work to StringLatin1 or StringUTF16. These delegates use intrinsic methods of JVM — JIT compiler generates SIMD-optimized code for byte-by-byte comparison.
Trade-offs
Pros:
- Memory savings: 40–50% for typical Enterprise applications (JSON keys, log levels, HTTP headers — all ASCII)
- GC pressure: fewer objects in Heap → less frequent GC pauses
- Cache locality: compact data → more fits in L1/L2 CPU cache → faster processing
- Free optimization — zero code change
Cons:
- Small overhead on
codercheck in every method (JIT usually eliminates via constant folding) - Concatenation Latin-1 + UTF-16 → UTF-16 (entire string expands)
- Had to rewrite all intrinsic optimizations for two formats
- Reflection code working with
char[]broke
Edge Cases (minimum 3)
1. Coder mismatch on concatenation:
String latin = "Hello"; // Latin-1
String cyrillic = "Мир"; // UTF-16
String result = latin + " " + cyrillic; // Entire string → UTF-16
// Even "Hello " expands to UTF-16 in result
// Loss: 6 bytes → 12 bytes for Latin-1 part
2. Substring doesn’t “downgrade” coder:
String mixed = "Hello Мир"; // UTF-16 (contains Cyrillic)
String sub = mixed.substring(0, 5); // "Hello" — still UTF-16!
// JVM doesn't "downgrade" to Latin-1 automatically
// sub takes 10 bytes instead of possible 5
3. Reflection and byte[]:
// In Java 9+ you can't just get value via reflection
// byte[] value instead of char[] — broke old reflection code
// Module system (Java 9+) additionally restricts access to internal fields
Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true); // Requires --add-opens java.base/java.lang
4. Coder on creation via constructor:
// new String(byte[], Charset) determines coder by content
byte[] ascii = {72, 101, 108, 108, 111}; // "Hello"
String s = new String(ascii, StandardCharsets.UTF_8); // coder = LATIN1
byte[] cyrillicBytes = {(byte)0xD0, (byte)0x9F}; // "П" in UTF-8
String s2 = new String(cyrillicBytes, StandardCharsets.UTF_8); // coder = UTF16
Performance
| Operation | Java 8 (char[]) |
Java 9+ Compact | Improvement |
|---|---|---|---|
| Memory “Hello” | 22 bytes | 11 bytes | -50% |
| Memory “Hello World!” | 34 bytes | 18 bytes | -47% |
charAt() |
~1ns | ~1ns (intrinsic) | Same |
equals() (Latin-1) |
~2ns | ~1.5ns (SIMD on byte[]) |
-25% |
equals() (UTF-16) |
~2ns | ~2ns | Same |
| GC throughput | Baseline | +10–15% | Better |
// Approximate values (JMH). Actual depend on CPU and JVM.
Memory (64-bit JVM, CompressedOops):
- Latin-1 String: 24 bytes (object header) + 16 + N bytes (
byte[]) ≈ 40 + N bytes - UTF-16 String: 24 bytes + 16 + 2N bytes ≈ 40 + 2N bytes
- For 1M “Hello” strings: savings ~5MB in Heap
Thread Safety
String remains fully thread-safe. coder field is final, value array is @Stable (JVM annotation guaranteeing immutability after construction). No race conditions when reading from multiple threads.
Production War Story
Scenario: Migration Java 8 → 17 in microservice (4GB Heap, Spring Boot, JSON API).
- Before: Heap usage 75% (3GB), Full GC every 20 minutes, p99 latency = 15ms
- After: Heap usage 55% (2.2GB), Full GC every 45 minutes, p99 latency = 10ms
- JSON keys (
"id","name","type","status") — all Latin-1 → 50% savings - Without a single line of code — transparent upgrade
Scenario 2: Highload parser (1M lines/sec, logs):
- Strings: JSON keys — all Latin-1
- Savings: ~200MB/sec allocations → ~100MB/sec
- Young GC duration reduced by 30%
- Problem: Substrings from UTF-16 strings inherited UTF-16 coder → unexpectedly high memory consumption for “simple” strings. Fix: explicit string creation via constructor.
Monitoring
# Check Compact Strings is enabled
java -XX:+PrintFlagsFinal -version | grep CompactStrings
# bool CompactStrings = true {product}
# JOL — actual size
# String s = "Hello";
# GraphLayout.parseInstance(s).toPrintable()
# Java 9+: value = byte[5] (5 bytes)
# Java 8: value = char[5] (10 bytes)
# GC logs — you'll notice reduced heap usage
java -Xlog:gc*:file=gc.log ...
Best Practices for Highload
- Compact Strings are enabled by default — no action needed
- For maximum benefit: store data in Latin-1 when possible (ASCII keys, enum values, statuses)
- Don’t try to manually “downgrade” coder — JIT handles it better
-XX:-CompactStringsto disable (but why?)- Benefit most noticeable in applications with many strings: web, JSON parsing, logging, HTTP headers
- On migration Java 8 → 9+: check reflection code that worked with
char[] value - For ultra-low-latency: compact strings improve cache locality → fewer L1/L2 cache misses
🎯 Interview Cheat Sheet
Must know:
- Compact Strings (JEP 254, Java 9+) — optimization: Latin-1 (1 byte/char) instead of UTF-16 (2 bytes)
byte[] value+byte coderinstead ofchar[] value(Java 8 and earlier)- coder = 0 → Latin-1 (U+0000–U+00FF), coder = 1 → UTF-16
- Enabled by default, flag
-XX:-CompactStringsdisables - Concatenation Latin-1 + UTF-16 → result is UTF-16 (entire string expands)
- Savings: 40-50% memory for typical Enterprise applications (JSON keys, HTTP headers — all ASCII)
- Substring from UTF-16 string inherits UTF-16 coder, even if substring is only ASCII
Frequent follow-up questions:
- What memory savings from Compact Strings? — 40-50% for Latin-1 strings. In typical web app 70% of strings are Latin-1, overall Heap reduction 20-30%.
- What happens on Latin-1 + Cyrillic concatenation? — Entire string becomes UTF-16. Latin-1 part expands to 2 bytes/char.
- Does
substring()downgrade coder from UTF-16 to Latin-1? — No. Substring from UTF-16 string remains UTF-16, even if it contains only ASCII. - Do you need to enable Compact Strings with a flag? — No, enabled by default since Java 9.
-XX:-CompactStrings— disables.
Red flags (DON’T say):
- ❌ “Compact Strings — compression like gzip” — it’s just a more efficient storage format
- ❌ “Need to enable
-XX:+CompactStrings” — already enabled by default - ❌ “Compact Strings work for Cyrillic” — Cyrillic = UTF-16, no savings
- ❌ “Java automatically downgrades UTF-16 → Latin-1” — it doesn’t downgrade, only upgrades
Related topics:
- [[17. What is String Encoding]]
- [[20. How to Find Out How Much Memory a String Occupies]]
- [[13. What substring() Does and How It Worked Before Java 7]]
- [[22. What is String Deduplication in G1 GC]]