Question 22 Β· Section 12

What is String Deduplication in G1 GC?

4. Merging: If identical array is found, the value field is redirected to the existing array via atomic operation

Language versions: English Russian Ukrainian

🟒 Junior Level

String Deduplication is a feature of the G1 garbage collector that automatically finds identical strings in memory and merges their internal arrays to save memory.

How it works:

  1. JVM notices that two different String objects contain the same text
  2. Instead of storing two identical byte arrays, it makes both strings reference one array
  3. This happens automatically β€” you don’t need to change code

How to enable:

java -XX:+UseG1GC -XX:+UseStringDeduplication -jar app.jar

Simple analogy: Imagine you have 100 copies of the same book in a library. Deduplication is when the librarian leaves one book on the shelf and gives all other readers a reference to it. Same text, but space is saved.

Difference from String Pool: | | String Pool (intern()) | String Deduplication | | —————– | β€”β€”β€”β€”β€”β€”β€”β€”- | ————————– | | What combines | String objects | Internal byte[] arrays | | Need code change? | Yes (str.intern()) | No (only JVM flag) | | When it works | On intern() call | During GC (in background) |


🟑 Middle Level

How it works internally

  1. Scanning: During GC (evacuation phase) G1 marks String objects in collected regions
  2. Queue: References to candidate strings are placed in deduplication queue
  3. Background thread: Separate thread computes hash of byte[] and searches for matches in deduplication table
  4. Merging: If identical array is found, the value field is redirected to the existing array via atomic operation

Difference from String Pool β€” detailed comparison

Characteristic String Pool (intern()) String Deduplication
What combines String objects Internal byte[] arrays
When On intern() call (synchronous) During GC (asynchronous)
Management Manual (need to call intern()) Automatic (JVM flag)
Works with any GC Yes Only G1 GC and Shenandoah
Effect on == Makes == true == remains false (objects different)
CPU overhead On each intern() Background thread, ~2–5% CPU
Table memory StringTable in Heap (~32 bytes/entry) Native memory (~10–50MB)

When to enable

  • Profiler shows many duplicate strings in Heap
  • You can’t use intern() (legacy code, complex logic, no code access)
  • Application runs on G1 GC (default in Java 9+)
  • Heap > 4GB and strings occupy significant portion

Table of typical mistakes

Mistake Consequences Solution
Expecting instant results β€œEnabled, but memory didn’t free” Deduplication happens during GC, not instantly; needs several GC cycles
Enabling without monitoring Don’t know if it works at all Check via -XX:+PrintStringDeduplicationStatistics
Expecting == to become true Comparison logic broken Deduplication doesn’t change object references, only internal arrays
Enabling on ZGC Doesn’t work ZGC doesn’t support String Deduplication

When NOT to use

  • Few duplicates: if strings are mostly unique β€” overhead without benefit
  • Short-lived strings: die before reaching deduplication queue
  • ZGC: not supported (use -XX:+UseStringDeduplication with Shenandoah)
  • Ultra-low-latency: 2–5% CPU overhead may be critical

πŸ”΄ Senior Level

Internal Implementation β€” G1 GC Deduplication Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GC (Evacuation Phase)                                        β”‚
β”‚  β”œβ”€β”€ Identify String objects in collection set                β”‚
β”‚  β”œβ”€β”€ Filter: age >= DeduplicationAgeThreshold (default 3)     β”‚
β”‚  β”œβ”€β”€ Enqueue candidates to dedup queue                        β”‚
β”‚  └── Continue evacuation                                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Deduplication Thread (concurrent, low-priority)              β”‚
β”‚  β”œβ”€β”€ Dequeue String references                                β”‚
β”‚  β”œβ”€β”€ Compute hash of byte[] value (age hash, not String.hashCode) β”‚
β”‚  β”œβ”€β”€ Lookup in deduplication table (native memory hashtable)  β”‚
β”‚  β”œβ”€β”€ If found: byte-by-byte comparison to confirm             β”‚
β”‚  β”œβ”€β”€ If match: CAS redirect value reference β†’ shared array    β”‚
β”‚  └── If not found: add to table                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deduplication Table:

  • Native hash table (outside Java Heap, in C-heap)
  • Stores hashes of byte[] arrays + weak references
  • On hash match β€” byte-by-byte comparison to confirm (collision protection)
  • Reference update via CAS (Compare-And-Swap) β€” thread-safe without locks

Marking Algorithm:

  • G1 uses concurrent marking
  • Strings are marked during marking phase
  • Age threshold (default 3 GC cycles) β€” only strings that β€œsurvived” enough are deduplicated
  • This filters out short-lived strings that die before queue processing

Trade-offs

Pros:

  • Transparency: no code change needed β€” only JVM flag
  • Savings: 10–20% Heap for text-heavy applications
  • Safety: no risk of String Pool corruption (data doesn’t change)
  • Works with any strings, not just interned ones

Cons:

  • CPU overhead: hashing + lookup + byte comparison (~2–5% CPU)
  • Memory: deduplication table (~10–50MB native memory)
  • Only G1 GC (and Shenandoah in OpenJDK)
  • Delay: deduplication happens after several GC cycles (age threshold)
  • Doesn’t deduplicate: strings with different coder (Latin-1 vs UTF-16 β€” arrays of different length)

Edge Cases (minimum 3)

1. Doesn’t deduplicate strings with different coder:

String s1 = "Hello"; // Latin-1, byte[5]
String s2 = new String("Hello".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16); // UTF-16, byte[12]
// Different coder β†’ different byte[] arrays β†’ deduplication won't work
// Even if content is the same, arrays have different size and bytes

2. Very short-lived strings:

void process() {
    String temp = "duplicate"; // Eden
    String temp2 = "duplicate"; // Eden
    // Both strings die in Young GC β€” don't reach age threshold (3 GC)
    // Deduplication won't have time to work
}

3. Race condition on redirect:

// Dedup thread performs CAS:
// if (CAS(oldValue, sharedValue)) β†’ success
// If two threads simultaneously try to redirect β€” only one succeeds
// Other thread sees value is already redirect-ed, and skips
// Thread reading s.value always sees consistent value (@Stable + CAS)
// @Stable β€” JVM annotation telling JIT that field is written once during construction.

4. String Pool vs Deduplication β€” interaction:

String s1 = new String("Hello");
String s2 = new String("Hello");
// s1.value and s2.value β€” different byte[] arrays (both Latin-1, byte[5])
// After deduplication: s1.value and s2.value β†’ SAME byte[]
// But s1 != s2 (String objects are different!)
// Savings: one byte[5] instead of two β†’ 5 bytes

5. Subnormal strings (very long):

String huge = "A".repeat(10_000_000); // 10MB
String huge2 = "A".repeat(10_000_000); // 10MB
// Byte-by-byte comparison of 10MB β€” expensive (~10ms)
// Deduplication thread may slow down GC
// In practice: long strings may be identical in log aggregators or data pipelines
// (same JSON payloads), but this is rare for typical web applications.

Performance

Metric Without dedup With dedup Delta
Heap usage 4.0 GB 3.2 GB -20%
GC pause (avg) 50ms 55ms +10%
CPU overhead Baseline +2–5% Small
Young GC 10ms 10ms No change
Mixed GC 50ms 55ms +5ms
Native memory (table) 0MB 10–50MB Extra

Memory savings (real scenarios):

  • JSON API service: 15–25% strings are duplicates (keys, status values)
  • Log aggregator: 30–40% duplicates (levels, service names, host names)
  • ETL pipeline: 5–10% duplicates (category names, country codes)

Thread Safety

Deduplication is thread-safe:

  • value reference update via CAS (atomic operation)
  • value field is @Stable, but JVM allows redirect within GC
  • Reading threads always see consistent value (memory barriers during GC)
  • Deduplication thread β€” single (single-threaded), no contention between dedup threads

Production War Story

Scenario 1: JSON API service (G1 GC, 8GB Heap, Spring Boot, 50K RPS):

  • Without deduplication: Heap usage 6.5GB, Full GC every 30 minutes, p99 latency = 25ms
  • With deduplication: Heap usage 5.2GB, Full GC every 50 minutes, p99 latency = 20ms
  • CPU overhead: +3% (acceptable)
  • Stats: deduplicated 2.3GB of strings, 850K unique byte[] arrays merged
  • Result: reduced instances from 10 to 8 (saving $15K/month)

Scenario 2: Log aggregator (1M log lines/min, G1 GC, 12GB Heap):

  • Fields level, service, host β€” many duplicates ("INFO", "UserService", "host-1")
  • Without deduplication: 12GB Heap
  • With deduplication: 9GB Heap
  • Savings: 3GB β†’ fewer instances in cluster
  • Problem: CPU overhead grew to 7% due to huge number of strings. Fix: -XX:StringDeduplicationAgeThreshold=5 (increased age threshold, fewer candidates β†’ less CPU).

Scenario 3 (anti-pattern): Team enabled deduplication β€œjust in case” for an app with unique strings (UUIDs, hashes, timestamps). CPU overhead +4%, memory savings 0.5%. Disabled it.

Monitoring

# Enable deduplication statistics
-XX:+PrintStringDeduplicationStatistics
-XX:+PrintGC

# Output in GC logs:
# [GC concurrent string deduplication]
# String Deduplication: 1.2GB deduplicated (500K strings)
# [DEDUP: 500K strings, 1.2GB, 2.3ms]

# JCmd β€” statistics in runtime
jcmd <pid> GC.string_deduplication_statistics

# Output:
# String Deduplication Statistics:
#   Executed: 1234 times
#   Deduplicated: 567890 strings (1.2GB)
#   Skipped: 123456 strings (already deduplicated)

# JFR (Java Flight Recorder)
java -XX:StartFlightRecording=filename=recording.jfr ...
# Events: StringDeduplicationStatistics
# In JDK Mission Controller: Memory β†’ String Deduplication

# Configure age threshold
-XX:StringDeduplicationAgeThreshold=3  # Default 3 GC cycles
# Increase to 5–10 if CPU overhead is too high

Best Practices for Highload

  • Enable when profiler shows > 10% duplicate strings in Heap
  • Don’t use as replacement for intern() for highly duplicate long-lived strings (dictionaries, configs) β€” intern() is more efficient
  • Monitor CPU overhead β€” if > 5%, increase -XX:StringDeduplicationAgeThreshold=5
  • Combine: intern() for dictionary data (enum values, status codes) + deduplication for everything else
  • For ZGC: not supported (ZGC as of JDK 21: doesn’t support String Deduplication) β€” consider Shenandoah (-XX:+UseShenandoahGC -XX:+UseStringDeduplication)
  • For max savings: tune -XX:G1HeapRegionSize β€” smaller regions β†’ more frequent evacuation β†’ more dedup candidates
  • Don’t enable for short-lived apps (CLI, batch jobs < 1min) β€” won’t have time to work
  • For ultra-low-latency: benchmark with and without deduplication; sometimes 5ms GC pause increase is critical

🎯 Interview Cheat Sheet

Must know:

  • String Deduplication β€” G1 GC feature, automatically merges identical string byte[] arrays
  • Enabled with flag: -XX:+UseG1GC -XX:+UseStringDeduplication (no code change needed)
  • Difference from String Pool: deduplication merges byte[], not String objects; == remains false
  • Works asynchronously during GC, age threshold (default 3 GC cycles) β€” filters out short-lived strings
  • CPU overhead: ~2-5%, native memory for dedup table: ~10-50MB
  • Doesn’t deduplicate strings with different coder (Latin-1 vs UTF-16)

Frequent follow-up questions:

  • How is deduplication different from intern()? β€” intern() merges String objects (== becomes true), requires code. Deduplication β€” only byte[] arrays, == remains false, no code change.
  • What memory savings? β€” 10-20% Heap for apps with duplicate strings. JSON API: 15-25%, log aggregator: 30-40%.
  • Why doesn’t it work with ZGC? β€” ZGC (as of JDK 21) doesn’t support String Deduplication. Alternative: Shenandoah GC.
  • How to reduce CPU overhead? β€” Increase -XX:StringDeduplicationAgeThreshold=5 β€” fewer candidates, less CPU.

Red flags (DON’T say):

  • ❌ β€œDeduplication makes == true” β€” String objects remain different, only byte[] is merged
  • ❌ β€œThis replaces intern()” β€” no, intern() is more efficient for dictionary data
  • ❌ β€œWorks instantly” β€” happens during GC, needs several cycles
  • ❌ β€œWorks with any GC” β€” only G1 GC and Shenandoah

Related topics:

  • [[1. How String Pool Works]]
  • [[3. When to Use intern()]]
  • [[19. What are Compact Strings in Java 9+]]
  • [[20. How to Find Out How Much Memory a String Occupies]]