How to Properly Convert String to byte[] and Back

🟢 Junior Level

Converting a string to bytes and back is one of the most common operations when working with files, networks, and databases.

Main rule: ALWAYS explicitly specify the encoding!

import java.nio.charset.StandardCharsets;

String str = "Hello, World!";

// String → byte[]
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);

System.out.println(restored); // "Hello, World!"

Never do this:

// ❌ BAD — uses OS encoding (different on Windows and Linux!)
byte[] bytes = str.getBytes();
String restored = new String(bytes);

Why: Default encoding depends on the operating system. On Windows it might be Windows-1251, on Linux — UTF-8. A string converted on one system will turn into “garbled text” on another.

**String → byte[] — like recording speech on a dictaphone (text → bytes). byte[] → String — like playing back the recording (bytes → text). If you choose the wrong recording format (encoding), you’ll hear noise instead of words.

🟡 Middle Level

Correct conversion methods

String → byte[]:

// Java 7+ — recommended approach (constant, no allocations)
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);

// If encoding is determined dynamically
byte[] bytes = str.getBytes(Charset.forName("Windows-1251"));

byte[] → String:

// Java 7+ — recommended approach
String str = new String(bytes, StandardCharsets.UTF_8);

// If encoding is dynamic
String str = new String(bytes, Charset.forName("Windows-1251"));

Streaming work with large data:

// Reading file with encoding
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
    String line = reader.readLine();
}

// Writing file with encoding
try (BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8)) {
    writer.write("Hello");
}

Table of typical mistakes

Mistake	Consequences	Solution
`getBytes()` without encoding	“Garbled text” on OS change	Always `getBytes(StandardCharsets.UTF_8)`
`new String(bytes)` without encoding	Cannot restore original text	Always `new String(bytes, StandardCharsets.UTF_8)`
Confusing `str.length()` with `bytes.length`	Wrong data processing logic	`"Привет".length()` = 6, `"Привет".getBytes(UTF_8).length` = 12
Converting binary data to String	Data loss, corruption	For binary — use `byte[]`/`ByteBuffer` directly

Encoding comparison for conversion

Encoding	Bytes/char (Latin)	Bytes/char (Cyrillic)	When to use
UTF-8	1	2	Internet, JSON, HTTP — de facto standard
UTF-16	2	2	Internal Java format, Windows API
Latin-1	1	N/A	Western European only
Windows-1251	N/A	1	Legacy Windows systems

When NOT to convert String → byte[]

Binary data (images, PDF, protobuf) — work with byte[] directly
Ultra-low-latency systems — conversion adds 50–200ns overhead
Data of unknown encoding — use auto-detect or BOMInputStream

🔴 Senior Level

Internal Implementation

String.getBytes(Charset):

public byte[] getBytes(Charset charset) {
    if (charset == null) throw new NullPointerException();
    return StringCoding.encode(value, coder, charset);
}

StringCoding.encode — what happens:

Check coder (Latin-1 or UTF-16)
Choose encoding algorithm:
- Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1, U+0080–U+00FF)
- UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)
For UTF-8: character-by-character conversion via CharsetEncoder
Allocate new byte[] with needed size

new String(byte[], Charset):

public String(byte[] bytes, Charset charset) {
    this(bytes, 0, bytes.length, charset);
}
// → StringCoding.decode → CharsetDecoder → byte[] value + coder

CharsetDecoder decodes bytes to characters
coder is determined: if all characters in range U+0000–U+00FF → LATIN1, otherwise → UTF16
Invalid bytes → replacement with \uFFFD (replacement character)
New String is created with byte[] value + coder

Encoding trade-offs

UTF-8:

Pros: De facto standard, ASCII-compatible, variable-length (savings for Latin)
Cons: Cyrillic = 2 bytes/char, hieroglyphs = 3 bytes, variable-length complicates random access

UTF-16:

Pros: Fixed size for BMP (2 bytes/char), internal Java format (minimum conversion)
Cons: Endianness (LE vs BE), BOM, 2x size for ASCII, not ASCII-compatible

Latin-1 (ISO-8859-1):

Pros: 1 byte/char, minimal overhead, fixed size
Cons: Only 256 characters, Cyrillic in extended Latin-1, not Unicode

Edge Cases (minimum 3)

1. BOM (Byte Order Mark):

// UTF-8 file with BOM: EF BB BF
byte[] bomUtf8 = {(byte)0xEF, (byte)0xBB, (byte)0xBF, 'H', 'i'};
String s = new String(bomUtf8, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible BOM character!
// s.startsWith("Hi") → false!

Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes:

if (bytes.length >= 3 && bytes[0] == (byte)0xEF && bytes[1] == (byte)0xBB && bytes[2] == (byte)0xBF) {
    s = new String(bytes, 3, bytes.length - 3, StandardCharsets.UTF_8);
}

2. Malformed input — invalid UTF-8 bytes:

byte[] malformed = {(byte) 0xFF, (byte) 0xFE, (byte) 0x80};
String s = new String(malformed, StandardCharsets.UTF_8);
// Invalid sequences → '\uFFFD' (replacement character)
// s = "\uFFFD\uFFFD\uFFFD"

Behavior controlled via CodingErrorAction:

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);  // Throws MalformedInputException
decoder.onMalformedInput(CodingErrorAction.IGNORE);  // Skips
decoder.onMalformedInput(CodingErrorAction.REPLACE); // Default → \uFFFD

3. Truncated multi-byte sequences:

// Cyrillic 'А' = D0 90 in UTF-8
byte[] truncated = {(byte) 0xD0}; // Only first byte
String s = new String(truncated, StandardCharsets.UTF_8);
// '\uFFFD' — incomplete sequence

Critical for streaming I/O: if buffer breaks in the middle of a multi-byte sequence, you need to save the “tail” and add it to the next buffer.

4. Security — encoding bypass:

// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// In some old systems %C0%27 decodes as a single quote
// This bypasses filters that check input BEFORE decoding
// Always validate input AFTER decoding!

5. Surrogate pairs and byte length:

String emoji = "\uD83D\uDE00"; // "😀" — surrogate pair
emoji.length();                         // 2 (char)
emoji.getBytes(StandardCharsets.UTF_8).length; // 4 (bytes)
emoji.getBytes(StandardCharsets.UTF_16).length; // 6 (2 BOM + 4 data)
// StandardCharsets.UTF_16 includes BOM. UTF_16BE/UTF_16LE — without BOM, would be 4 bytes.
// Don't use bytes.length to determine character count!

Performance

Operation	UTF-8	UTF-16	Latin-1
Encode 100 chars (Latin-1)	~100ns	~50ns	~20ns
Encode 100 chars (Cyrillic)	~200ns	~50ns	N/A
Decode 200 bytes (UTF-8)	~150ns	N/A	N/A
Decode 200 bytes (UTF-16)	N/A	~40ns	N/A
Encode 10KB text (mixed)	~8μs	~3μs	~2μs

Allocations:

getBytes(UTF_8): allocates new byte[] of size ~N–3N (depends on content)
new String(bytes, UTF_8): allocates new byte[] + String object (~24 bytes overhead)
For 1M conversions: ~100MB allocations → Young GC pressure

Thread Safety

StandardCharsets.UTF_8 — thread-safe (immutable singleton)
Charset.forName(...) — thread-safe (caches results)
CharsetEncoder / CharsetDecoder — NOT thread-safe! Single instance cannot be used from multiple threads simultaneously
Solution: create new encoder/decoder on each call or use ThreadLocal<CharsetEncoder>

Production War Story

Scenario 1: HTTP API — reading request body (Spring Boot):

// Spring Boot — automatically uses UTF-8
@PostMapping
public void handle(@RequestBody String body) { ... }

// Raw Servlet — need to specify explicitly
request.setCharacterEncoding("UTF-8");
String body = request.getReader().readLine();

Problem: client sent POST request in Windows-1251, Spring read as UTF-8 → “garbled text”. Fix: header Content-Type: text/plain; charset=Windows-1251 or client migration to UTF-8.

Scenario 2: Kafka messages — serialization/deserialization:

// Producer
byte[] bytes = jsonString.getBytes(StandardCharsets.UTF_8);
producer.send(new ProducerRecord<>(topic, keyBytes, bytes));

// Consumer
String message = new String(record.value(), StandardCharsets.UTF_8);

Problem: one producer used getBytes() without encoding (default on Windows = Cp1251). Consumer on Linux read as UTF-8 → corrupted messages. Fix: unified UTF-8 standard at Kafka contract level.

Scenario 3: Highload log parser (500K lines/sec):

Converting each line new String(bytes, UTF_8) → 500K allocations/sec
Young GC every 2 seconds, pause 15ms
Fix: zero-copy via ByteBuffer + custom parser, without String conversion
Result: Young GC every 8 seconds, pause 5ms

Monitoring

# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8

# Available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"

# GC logs — allocations from conversion
java -Xlog:gc*:file=gc.log ...

# JFR — Object Allocation
java -XX:StartFlightRecording=filename=recording.jfr ...
# In JFR: Memory → Object Allocation → filter by java.lang.String

# JOL — String size after conversion
System.out.println(GraphLayout.parseInstance(s).toFootprint());

// Runtime check
System.out.println(Charset.defaultCharset()); // Depends on OS!

// Conversion benchmark
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    byte[] b = str.getBytes(StandardCharsets.UTF_8);
}
long elapsed = System.nanoTime() - start;
System.out.println("1M encode: " + elapsed / 1_000_000 + "ms");

Best Practices for Highload

Always specify Charset — use StandardCharsets.UTF_8 (constant, no lookup allocations)
For JSON/XML/HTTP: UTF-8 — de facto standard
For binary protocols (Kafka, gRPC, TCP): work with byte[]/ByteBuf directly, without String conversion
For streaming I/O: InputStreamReader/OutputStreamWriter with explicit Charset — buffering inside
BOM handling: BOMInputStream (Apache Commons IO) or manual check of first bytes
For ultra-low-latency: avoid String — use ByteBuf (Netty), ByteBuffer, zero-copy approaches
Security: validate input AFTER decoding, use constant-time comparison for secrets
CharsetEncoder/CharsetDecoder — don’t share between threads, use ThreadLocal or per-call creation
For large data: streaming via Reader/Writer, don’t load entire file into byte[]

🎯 Interview Cheat Sheet

Must know:

str.getBytes(StandardCharsets.UTF_8) — correct way to convert String → byte[]
new String(bytes, StandardCharsets.UTF_8) — correct way to convert byte[] → String
getBytes() and new String(bytes) without encoding — depend on OS, cause “garbled text”
BOM in UTF-8 (3 bytes EF BB BF) creates invisible character \uFEFF at start of string
Invalid UTF-8 bytes are replaced with \uFFFD (replacement character)
CharsetEncoder/CharsetDecoder — NOT thread-safe, need ThreadLocal

Frequent follow-up questions:

Why is new String(bytes) without encoding bad? — Default encoding depends on OS. String encoded on Windows (Cp1251) becomes “garbled text” on Linux (UTF-8).
How to handle BOM when reading UTF-8 file? — BOMInputStream or check: if first 3 bytes are EF BB BF, skip them.
What happens when converting binary data to String? — Data loss. For binary — work with byte[]/ByteBuffer directly.
How to speed up conversion in highload? — Zero-copy: ByteBuffer, ByteBuf (Netty), streaming via Reader/Writer.

Red flags (DON’T say):

❌ “getBytes() without encoding — is fine” — depends on OS, breaks on migration
❌ “CharsetEncoder can be shared between threads” — NOT thread-safe
❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too
❌ “You can use bytes.length to determine character count” — for UTF-8 1 char = 1-4 bytes

Related topics:

[[17. What is String Encoding]]
[[19. What are Compact Strings in Java 9+]]
[[20. How to Find Out How Much Memory a String Occupies]]