Question 18 · Section 12

How to Properly Convert String to byte[] and Back

Converting a string to bytes and back is one of the most common operations when working with files, networks, and databases.

Language versions: English Russian Ukrainian

🟢 Junior Level

Converting a string to bytes and back is one of the most common operations when working with files, networks, and databases.

Main rule: ALWAYS explicitly specify the encoding!

import java.nio.charset.StandardCharsets;

String str = "Hello, World!";

// String → byte[]
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);

System.out.println(restored); // "Hello, World!"

Never do this:

// ❌ BAD — uses OS encoding (different on Windows and Linux!)
byte[] bytes = str.getBytes();
String restored = new String(bytes);

Why: Default encoding depends on the operating system. On Windows it might be Windows-1251, on Linux — UTF-8. A string converted on one system will turn into “garbled text” on another.

**String → byte[] — like recording speech on a dictaphone (text → bytes). byte[] → String — like playing back the recording (bytes → text). If you choose the wrong recording format (encoding), you’ll hear noise instead of words.


🟡 Middle Level

Correct conversion methods

String → byte[]:

// Java 7+ — recommended approach (constant, no allocations)
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// If encoding is determined dynamically
byte[] bytes = str.getBytes(Charset.forName("Windows-1251"));

byte[] → String:

// Java 7+ — recommended approach
String str = new String(bytes, StandardCharsets.UTF_8);

// If encoding is dynamic
String str = new String(bytes, Charset.forName("Windows-1251"));

Streaming work with large data:

// Reading file with encoding
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
    String line = reader.readLine();
}

// Writing file with encoding
try (BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8)) {
    writer.write("Hello");
}

Table of typical mistakes

Mistake Consequences Solution
getBytes() without encoding “Garbled text” on OS change Always getBytes(StandardCharsets.UTF_8)
new String(bytes) without encoding Cannot restore original text Always new String(bytes, StandardCharsets.UTF_8)
Confusing str.length() with bytes.length Wrong data processing logic "Привет".length() = 6, "Привет".getBytes(UTF_8).length = 12
Converting binary data to String Data loss, corruption For binary — use byte[]/ByteBuffer directly

Encoding comparison for conversion

Encoding Bytes/char (Latin) Bytes/char (Cyrillic) When to use
UTF-8 1 2 Internet, JSON, HTTP — de facto standard
UTF-16 2 2 Internal Java format, Windows API
Latin-1 1 N/A Western European only
Windows-1251 N/A 1 Legacy Windows systems

When NOT to convert String → byte[]

  • Binary data (images, PDF, protobuf) — work with byte[] directly
  • Ultra-low-latency systems — conversion adds 50–200ns overhead
  • Data of unknown encoding — use auto-detect or BOMInputStream

🔴 Senior Level

Internal Implementation

String.getBytes(Charset):

public byte[] getBytes(Charset charset) {
    if (charset == null) throw new NullPointerException();
    return StringCoding.encode(value, coder, charset);
}

StringCoding.encode — what happens:

  1. Check coder (Latin-1 or UTF-16)
  2. Choose encoding algorithm:
    • Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1, U+0080–U+00FF)
    • UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)
  3. For UTF-8: character-by-character conversion via CharsetEncoder
  4. Allocate new byte[] with needed size

new String(byte[], Charset):

public String(byte[] bytes, Charset charset) {
    this(bytes, 0, bytes.length, charset);
}
// → StringCoding.decode → CharsetDecoder → byte[] value + coder
  1. CharsetDecoder decodes bytes to characters
  2. coder is determined: if all characters in range U+0000–U+00FF → LATIN1, otherwise → UTF16
  3. Invalid bytes → replacement with \uFFFD (replacement character)
  4. New String is created with byte[] value + coder

Encoding trade-offs

UTF-8:

  • Pros: De facto standard, ASCII-compatible, variable-length (savings for Latin)
  • Cons: Cyrillic = 2 bytes/char, hieroglyphs = 3 bytes, variable-length complicates random access

UTF-16:

  • Pros: Fixed size for BMP (2 bytes/char), internal Java format (minimum conversion)
  • Cons: Endianness (LE vs BE), BOM, 2x size for ASCII, not ASCII-compatible

Latin-1 (ISO-8859-1):

  • Pros: 1 byte/char, minimal overhead, fixed size
  • Cons: Only 256 characters, Cyrillic in extended Latin-1, not Unicode

Edge Cases (minimum 3)

1. BOM (Byte Order Mark):

// UTF-8 file with BOM: EF BB BF
byte[] bomUtf8 = {(byte)0xEF, (byte)0xBB, (byte)0xBF, 'H', 'i'};
String s = new String(bomUtf8, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible BOM character!
// s.startsWith("Hi") → false!

Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes:

if (bytes.length >= 3 && bytes[0] == (byte)0xEF && bytes[1] == (byte)0xBB && bytes[2] == (byte)0xBF) {
    s = new String(bytes, 3, bytes.length - 3, StandardCharsets.UTF_8);
}

2. Malformed input — invalid UTF-8 bytes:

byte[] malformed = {(byte) 0xFF, (byte) 0xFE, (byte) 0x80};
String s = new String(malformed, StandardCharsets.UTF_8);
// Invalid sequences → '\uFFFD' (replacement character)
// s = "\uFFFD\uFFFD\uFFFD"

Behavior controlled via CodingErrorAction:

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);  // Throws MalformedInputException
decoder.onMalformedInput(CodingErrorAction.IGNORE);  // Skips
decoder.onMalformedInput(CodingErrorAction.REPLACE); // Default → \uFFFD

3. Truncated multi-byte sequences:

// Cyrillic 'А' = D0 90 in UTF-8
byte[] truncated = {(byte) 0xD0}; // Only first byte
String s = new String(truncated, StandardCharsets.UTF_8);
// '\uFFFD' — incomplete sequence

Critical for streaming I/O: if buffer breaks in the middle of a multi-byte sequence, you need to save the “tail” and add it to the next buffer.

4. Security — encoding bypass:

// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// In some old systems %C0%27 decodes as a single quote
// This bypasses filters that check input BEFORE decoding
// Always validate input AFTER decoding!

5. Surrogate pairs and byte length:

String emoji = "\uD83D\uDE00"; // "😀" — surrogate pair
emoji.length();                         // 2 (char)
emoji.getBytes(StandardCharsets.UTF_8).length; // 4 (bytes)
emoji.getBytes(StandardCharsets.UTF_16).length; // 6 (2 BOM + 4 data)
// StandardCharsets.UTF_16 includes BOM. UTF_16BE/UTF_16LE — without BOM, would be 4 bytes.
// Don't use bytes.length to determine character count!

Performance

Operation UTF-8 UTF-16 Latin-1
Encode 100 chars (Latin-1) ~100ns ~50ns ~20ns
Encode 100 chars (Cyrillic) ~200ns ~50ns N/A
Decode 200 bytes (UTF-8) ~150ns N/A N/A
Decode 200 bytes (UTF-16) N/A ~40ns N/A
Encode 10KB text (mixed) ~8μs ~3μs ~2μs

Allocations:

  • getBytes(UTF_8): allocates new byte[] of size ~N–3N (depends on content)
  • new String(bytes, UTF_8): allocates new byte[] + String object (~24 bytes overhead)
  • For 1M conversions: ~100MB allocations → Young GC pressure

Thread Safety

  • StandardCharsets.UTF_8thread-safe (immutable singleton)
  • Charset.forName(...)thread-safe (caches results)
  • CharsetEncoder / CharsetDecoderNOT thread-safe! Single instance cannot be used from multiple threads simultaneously
  • Solution: create new encoder/decoder on each call or use ThreadLocal<CharsetEncoder>

Production War Story

Scenario 1: HTTP API — reading request body (Spring Boot):

// Spring Boot — automatically uses UTF-8
@PostMapping
public void handle(@RequestBody String body) { ... }

// Raw Servlet — need to specify explicitly
request.setCharacterEncoding("UTF-8");
String body = request.getReader().readLine();

Problem: client sent POST request in Windows-1251, Spring read as UTF-8 → “garbled text”. Fix: header Content-Type: text/plain; charset=Windows-1251 or client migration to UTF-8.

Scenario 2: Kafka messages — serialization/deserialization:

// Producer
byte[] bytes = jsonString.getBytes(StandardCharsets.UTF_8);
producer.send(new ProducerRecord<>(topic, keyBytes, bytes));

// Consumer
String message = new String(record.value(), StandardCharsets.UTF_8);

Problem: one producer used getBytes() without encoding (default on Windows = Cp1251). Consumer on Linux read as UTF-8 → corrupted messages. Fix: unified UTF-8 standard at Kafka contract level.

Scenario 3: Highload log parser (500K lines/sec):

  • Converting each line new String(bytes, UTF_8) → 500K allocations/sec
  • Young GC every 2 seconds, pause 15ms
  • Fix: zero-copy via ByteBuffer + custom parser, without String conversion
  • Result: Young GC every 8 seconds, pause 5ms

Monitoring

# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8

# Available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"

# GC logs — allocations from conversion
java -Xlog:gc*:file=gc.log ...

# JFR — Object Allocation
java -XX:StartFlightRecording=filename=recording.jfr ...
# In JFR: Memory → Object Allocation → filter by java.lang.String

# JOL — String size after conversion
System.out.println(GraphLayout.parseInstance(s).toFootprint());
// Runtime check
System.out.println(Charset.defaultCharset()); // Depends on OS!

// Conversion benchmark
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    byte[] b = str.getBytes(StandardCharsets.UTF_8);
}
long elapsed = System.nanoTime() - start;
System.out.println("1M encode: " + elapsed / 1_000_000 + "ms");

Best Practices for Highload

  • Always specify Charset — use StandardCharsets.UTF_8 (constant, no lookup allocations)
  • For JSON/XML/HTTP: UTF-8 — de facto standard
  • For binary protocols (Kafka, gRPC, TCP): work with byte[]/ByteBuf directly, without String conversion
  • For streaming I/O: InputStreamReader/OutputStreamWriter with explicit Charset — buffering inside
  • BOM handling: BOMInputStream (Apache Commons IO) or manual check of first bytes
  • For ultra-low-latency: avoid String — use ByteBuf (Netty), ByteBuffer, zero-copy approaches
  • Security: validate input AFTER decoding, use constant-time comparison for secrets
  • CharsetEncoder/CharsetDecoderdon’t share between threads, use ThreadLocal or per-call creation
  • For large data: streaming via Reader/Writer, don’t load entire file into byte[]

🎯 Interview Cheat Sheet

Must know:

  • str.getBytes(StandardCharsets.UTF_8) — correct way to convert String → byte[]
  • new String(bytes, StandardCharsets.UTF_8) — correct way to convert byte[] → String
  • getBytes() and new String(bytes) without encoding — depend on OS, cause “garbled text”
  • BOM in UTF-8 (3 bytes EF BB BF) creates invisible character \uFEFF at start of string
  • Invalid UTF-8 bytes are replaced with \uFFFD (replacement character)
  • CharsetEncoder/CharsetDecoder — NOT thread-safe, need ThreadLocal

Frequent follow-up questions:

  • Why is new String(bytes) without encoding bad? — Default encoding depends on OS. String encoded on Windows (Cp1251) becomes “garbled text” on Linux (UTF-8).
  • How to handle BOM when reading UTF-8 file?BOMInputStream or check: if first 3 bytes are EF BB BF, skip them.
  • What happens when converting binary data to String? — Data loss. For binary — work with byte[]/ByteBuffer directly.
  • How to speed up conversion in highload? — Zero-copy: ByteBuffer, ByteBuf (Netty), streaming via Reader/Writer.

Red flags (DON’T say):

  • ❌ “getBytes() without encoding — is fine” — depends on OS, breaks on migration
  • ❌ “CharsetEncoder can be shared between threads” — NOT thread-safe
  • ❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too
  • ❌ “You can use bytes.length to determine character count” — for UTF-8 1 char = 1-4 bytes

Related topics:

  • [[17. What is String Encoding]]
  • [[19. What are Compact Strings in Java 9+]]
  • [[20. How to Find Out How Much Memory a String Occupies]]