What is String Deduplication in G1 GC?

🟢 Junior Level

String Deduplication is a feature of the G1 garbage collector that automatically finds identical strings in memory and merges their internal arrays to save memory.

How it works:

JVM notices that two different String objects contain the same text
Instead of storing two identical byte arrays, it makes both strings reference one array
This happens automatically — you don’t need to change code

How to enable:

java -XX:+UseG1GC -XX:+UseStringDeduplication -jar app.jar

Simple analogy: Imagine you have 100 copies of the same book in a library. Deduplication is when the librarian leaves one book on the shelf and gives all other readers a reference to it. Same text, but space is saved.

🟡 Middle Level

How it works internally

Scanning: During GC (evacuation phase) G1 marks String objects in collected regions
Queue: References to candidate strings are placed in deduplication queue
Background thread: Separate thread computes hash of byte[] and searches for matches in deduplication table
Merging: If identical array is found, the value field is redirected to the existing array via atomic operation

Difference from String Pool — detailed comparison

Characteristic	String Pool (`intern()`)	String Deduplication
What combines	`String` objects	Internal `byte[]` arrays
When	On `intern()` call (synchronous)	During GC (asynchronous)
Management	Manual (need to call `intern()`)	Automatic (JVM flag)
Works with any GC	Yes	Only G1 GC and Shenandoah
Effect on `==`	Makes `==` true	`==` remains false (objects different)
CPU overhead	On each `intern()`	Background thread, ~2–5% CPU
Table memory	StringTable in Heap (~32 bytes/entry)	Native memory (~10–50MB)

When to enable

Profiler shows many duplicate strings in Heap
You can’t use intern() (legacy code, complex logic, no code access)
Application runs on G1 GC (default in Java 9+)
Heap > 4GB and strings occupy significant portion

Table of typical mistakes

Mistake	Consequences	Solution
Expecting instant results	“Enabled, but memory didn’t free”	Deduplication happens during GC, not instantly; needs several GC cycles
Enabling without monitoring	Don’t know if it works at all	Check via `-XX:+PrintStringDeduplicationStatistics`
Expecting `==` to become true	Comparison logic broken	Deduplication doesn’t change object references, only internal arrays
Enabling on ZGC	Doesn’t work	ZGC doesn’t support String Deduplication

When NOT to use

Few duplicates: if strings are mostly unique — overhead without benefit
Short-lived strings: die before reaching deduplication queue
ZGC: not supported (use -XX:+UseStringDeduplication with Shenandoah)
Ultra-low-latency: 2–5% CPU overhead may be critical

🔴 Senior Level

Internal Implementation — G1 GC Deduplication Pipeline

┌──────────────────────────────────────────────────────────────┐
│  GC (Evacuation Phase)                                        │
│  ├── Identify String objects in collection set                │
│  ├── Filter: age >= DeduplicationAgeThreshold (default 3)     │
│  ├── Enqueue candidates to dedup queue                        │
│  └── Continue evacuation                                      │
├──────────────────────────────────────────────────────────────┤
│  Deduplication Thread (concurrent, low-priority)              │
│  ├── Dequeue String references                                │
│  ├── Compute hash of byte[] value (age hash, not String.hashCode) │
│  ├── Lookup in deduplication table (native memory hashtable)  │
│  ├── If found: byte-by-byte comparison to confirm             │
│  ├── If match: CAS redirect value reference → shared array    │
│  └── If not found: add to table                               │
└──────────────────────────────────────────────────────────────┘

Deduplication Table:

Native hash table (outside Java Heap, in C-heap)
Stores hashes of byte[] arrays + weak references
On hash match — byte-by-byte comparison to confirm (collision protection)
Reference update via CAS (Compare-And-Swap) — thread-safe without locks

Marking Algorithm:

G1 uses concurrent marking
Strings are marked during marking phase
Age threshold (default 3 GC cycles) — only strings that “survived” enough are deduplicated
This filters out short-lived strings that die before queue processing

Trade-offs

Pros:

Transparency: no code change needed — only JVM flag
Savings: 10–20% Heap for text-heavy applications
Safety: no risk of String Pool corruption (data doesn’t change)
Works with any strings, not just interned ones

Cons:

CPU overhead: hashing + lookup + byte comparison (~2–5% CPU)
Memory: deduplication table (~10–50MB native memory)
Only G1 GC (and Shenandoah in OpenJDK)
Delay: deduplication happens after several GC cycles (age threshold)
Doesn’t deduplicate: strings with different coder (Latin-1 vs UTF-16 — arrays of different length)

Edge Cases (minimum 3)

1. Doesn’t deduplicate strings with different coder:

String s1 = "Hello"; // Latin-1, byte[5]
String s2 = new String("Hello".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16); // UTF-16, byte[12]
// Different coder → different byte[] arrays → deduplication won't work
// Even if content is the same, arrays have different size and bytes

2. Very short-lived strings:

void process() {
    String temp = "duplicate"; // Eden
    String temp2 = "duplicate"; // Eden
    // Both strings die in Young GC — don't reach age threshold (3 GC)
    // Deduplication won't have time to work
}

3. Race condition on redirect:

// Dedup thread performs CAS:
// if (CAS(oldValue, sharedValue)) → success
// If two threads simultaneously try to redirect — only one succeeds
// Other thread sees value is already redirect-ed, and skips
// Thread reading s.value always sees consistent value (@Stable + CAS)
// @Stable — JVM annotation telling JIT that field is written once during construction.

4. String Pool vs Deduplication — interaction:

String s1 = new String("Hello");
String s2 = new String("Hello");
// s1.value and s2.value — different byte[] arrays (both Latin-1, byte[5])
// After deduplication: s1.value and s2.value → SAME byte[]
// But s1 != s2 (String objects are different!)
// Savings: one byte[5] instead of two → 5 bytes

5. Subnormal strings (very long):

String huge = "A".repeat(10_000_000); // 10MB
String huge2 = "A".repeat(10_000_000); // 10MB
// Byte-by-byte comparison of 10MB — expensive (~10ms)
// Deduplication thread may slow down GC
// In practice: long strings may be identical in log aggregators or data pipelines
// (same JSON payloads), but this is rare for typical web applications.

Performance

Metric	Without dedup	With dedup	Delta
Heap usage	4.0 GB	3.2 GB	-20%
GC pause (avg)	50ms	55ms	+10%
CPU overhead	Baseline	+2–5%	Small
Young GC	10ms	10ms	No change
Mixed GC	50ms	55ms	+5ms
Native memory (table)	0MB	10–50MB	Extra

Memory savings (real scenarios):

JSON API service: 15–25% strings are duplicates (keys, status values)
Log aggregator: 30–40% duplicates (levels, service names, host names)
ETL pipeline: 5–10% duplicates (category names, country codes)

Thread Safety

Deduplication is thread-safe:

value reference update via CAS (atomic operation)
value field is @Stable, but JVM allows redirect within GC
Reading threads always see consistent value (memory barriers during GC)
Deduplication thread — single (single-threaded), no contention between dedup threads

Production War Story

Scenario 1: JSON API service (G1 GC, 8GB Heap, Spring Boot, 50K RPS):

Without deduplication: Heap usage 6.5GB, Full GC every 30 minutes, p99 latency = 25ms
With deduplication: Heap usage 5.2GB, Full GC every 50 minutes, p99 latency = 20ms
CPU overhead: +3% (acceptable)
Stats: deduplicated 2.3GB of strings, 850K unique byte[] arrays merged
Result: reduced instances from 10 to 8 (saving $15K/month)

Scenario 2: Log aggregator (1M log lines/min, G1 GC, 12GB Heap):

Fields level, service, host — many duplicates ("INFO", "UserService", "host-1")
Without deduplication: 12GB Heap
With deduplication: 9GB Heap
Savings: 3GB → fewer instances in cluster
Problem: CPU overhead grew to 7% due to huge number of strings. Fix: -XX:StringDeduplicationAgeThreshold=5 (increased age threshold, fewer candidates → less CPU).

Scenario 3 (anti-pattern): Team enabled deduplication “just in case” for an app with unique strings (UUIDs, hashes, timestamps). CPU overhead +4%, memory savings 0.5%. Disabled it.

Monitoring

# Enable deduplication statistics
-XX:+PrintStringDeduplicationStatistics
-XX:+PrintGC

# Output in GC logs:
# [GC concurrent string deduplication]
# String Deduplication: 1.2GB deduplicated (500K strings)
# [DEDUP: 500K strings, 1.2GB, 2.3ms]

# JCmd — statistics in runtime
jcmd <pid> GC.string_deduplication_statistics

# Output:
# String Deduplication Statistics:
#   Executed: 1234 times
#   Deduplicated: 567890 strings (1.2GB)
#   Skipped: 123456 strings (already deduplicated)

# JFR (Java Flight Recorder)
java -XX:StartFlightRecording=filename=recording.jfr ...
# Events: StringDeduplicationStatistics
# In JDK Mission Controller: Memory → String Deduplication

# Configure age threshold
-XX:StringDeduplicationAgeThreshold=3  # Default 3 GC cycles
# Increase to 5–10 if CPU overhead is too high

Best Practices for Highload

Enable when profiler shows > 10% duplicate strings in Heap
Don’t use as replacement for intern() for highly duplicate long-lived strings (dictionaries, configs) — intern() is more efficient
Monitor CPU overhead — if > 5%, increase -XX:StringDeduplicationAgeThreshold=5
Combine: intern() for dictionary data (enum values, status codes) + deduplication for everything else
For ZGC: not supported (ZGC as of JDK 21: doesn’t support String Deduplication) — consider Shenandoah (-XX:+UseShenandoahGC -XX:+UseStringDeduplication)
For max savings: tune -XX:G1HeapRegionSize — smaller regions → more frequent evacuation → more dedup candidates
Don’t enable for short-lived apps (CLI, batch jobs < 1min) — won’t have time to work
For ultra-low-latency: benchmark with and without deduplication; sometimes 5ms GC pause increase is critical

🎯 Interview Cheat Sheet

Must know:

String Deduplication — G1 GC feature, automatically merges identical string byte[] arrays
Enabled with flag: -XX:+UseG1GC -XX:+UseStringDeduplication (no code change needed)
Difference from String Pool: deduplication merges byte[], not String objects; == remains false
Works asynchronously during GC, age threshold (default 3 GC cycles) — filters out short-lived strings
CPU overhead: ~2-5%, native memory for dedup table: ~10-50MB
Doesn’t deduplicate strings with different coder (Latin-1 vs UTF-16)

Frequent follow-up questions:

How is deduplication different from intern()? — intern() merges String objects (== becomes true), requires code. Deduplication — only byte[] arrays, == remains false, no code change.
What memory savings? — 10-20% Heap for apps with duplicate strings. JSON API: 15-25%, log aggregator: 30-40%.
Why doesn’t it work with ZGC? — ZGC (as of JDK 21) doesn’t support String Deduplication. Alternative: Shenandoah GC.
How to reduce CPU overhead? — Increase -XX:StringDeduplicationAgeThreshold=5 — fewer candidates, less CPU.

Red flags (DON’T say):

❌ “Deduplication makes == true” — String objects remain different, only byte[] is merged
❌ “This replaces intern()” — no, intern() is more efficient for dictionary data
❌ “Works instantly” — happens during GC, needs several cycles
❌ “Works with any GC” — only G1 GC and Shenandoah

Related topics:

[[1. How String Pool Works]]
[[3. When to Use intern()]]
[[19. What are Compact Strings in Java 9+]]
[[20. How to Find Out How Much Memory a String Occupies]]