How to Properly Convert String to byte[] and Back
Converting a string to bytes and back is one of the most common operations when working with files, networks, and databases.
🟢 Junior Level
Converting a string to bytes and back is one of the most common operations when working with files, networks, and databases.
Main rule: ALWAYS explicitly specify the encoding!
import java.nio.charset.StandardCharsets;
String str = "Hello, World!";
// String → byte[]
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
// byte[] → String
String restored = new String(bytes, StandardCharsets.UTF_8);
System.out.println(restored); // "Hello, World!"
Never do this:
// ❌ BAD — uses OS encoding (different on Windows and Linux!)
byte[] bytes = str.getBytes();
String restored = new String(bytes);
Why: Default encoding depends on the operating system. On Windows it might be Windows-1251, on Linux — UTF-8. A string converted on one system will turn into “garbled text” on another.
**String → byte[] — like recording speech on a dictaphone (text → bytes). byte[] → String — like playing back the recording (bytes → text). If you choose the wrong recording format (encoding), you’ll hear noise instead of words.
🟡 Middle Level
Correct conversion methods
String → byte[]:
// Java 7+ — recommended approach (constant, no allocations)
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
// If encoding is determined dynamically
byte[] bytes = str.getBytes(Charset.forName("Windows-1251"));
byte[] → String:
// Java 7+ — recommended approach
String str = new String(bytes, StandardCharsets.UTF_8);
// If encoding is dynamic
String str = new String(bytes, Charset.forName("Windows-1251"));
Streaming work with large data:
// Reading file with encoding
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
String line = reader.readLine();
}
// Writing file with encoding
try (BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8)) {
writer.write("Hello");
}
Table of typical mistakes
| Mistake | Consequences | Solution |
|---|---|---|
getBytes() without encoding |
“Garbled text” on OS change | Always getBytes(StandardCharsets.UTF_8) |
new String(bytes) without encoding |
Cannot restore original text | Always new String(bytes, StandardCharsets.UTF_8) |
Confusing str.length() with bytes.length |
Wrong data processing logic | "Привет".length() = 6, "Привет".getBytes(UTF_8).length = 12 |
| Converting binary data to String | Data loss, corruption | For binary — use byte[]/ByteBuffer directly |
Encoding comparison for conversion
| Encoding | Bytes/char (Latin) | Bytes/char (Cyrillic) | When to use |
|---|---|---|---|
| UTF-8 | 1 | 2 | Internet, JSON, HTTP — de facto standard |
| UTF-16 | 2 | 2 | Internal Java format, Windows API |
| Latin-1 | 1 | N/A | Western European only |
| Windows-1251 | N/A | 1 | Legacy Windows systems |
When NOT to convert String → byte[]
- Binary data (images, PDF, protobuf) — work with
byte[]directly - Ultra-low-latency systems — conversion adds 50–200ns overhead
- Data of unknown encoding — use auto-detect or
BOMInputStream
🔴 Senior Level
Internal Implementation
String.getBytes(Charset):
public byte[] getBytes(Charset charset) {
if (charset == null) throw new NullPointerException();
return StringCoding.encode(value, coder, charset);
}
StringCoding.encode — what happens:
- Check
coder(Latin-1 or UTF-16) - Choose encoding algorithm:
- Latin-1 → UTF-8: 1 byte → 1 byte (ASCII), 1 byte → 2 bytes (extended Latin-1, U+0080–U+00FF)
- UTF-16 → UTF-8: 2 bytes → 1–3 bytes (depends on code point)
- For UTF-8: character-by-character conversion via
CharsetEncoder - Allocate new
byte[]with needed size
new String(byte[], Charset):
public String(byte[] bytes, Charset charset) {
this(bytes, 0, bytes.length, charset);
}
// → StringCoding.decode → CharsetDecoder → byte[] value + coder
CharsetDecoderdecodes bytes to characterscoderis determined: if all characters in range U+0000–U+00FF → LATIN1, otherwise → UTF16- Invalid bytes → replacement with
\uFFFD(replacement character) - New String is created with
byte[] value+coder
Encoding trade-offs
UTF-8:
- Pros: De facto standard, ASCII-compatible, variable-length (savings for Latin)
- Cons: Cyrillic = 2 bytes/char, hieroglyphs = 3 bytes, variable-length complicates random access
UTF-16:
- Pros: Fixed size for BMP (2 bytes/char), internal Java format (minimum conversion)
- Cons: Endianness (LE vs BE), BOM, 2x size for ASCII, not ASCII-compatible
Latin-1 (ISO-8859-1):
- Pros: 1 byte/char, minimal overhead, fixed size
- Cons: Only 256 characters, Cyrillic in extended Latin-1, not Unicode
Edge Cases (minimum 3)
1. BOM (Byte Order Mark):
// UTF-8 file with BOM: EF BB BF
byte[] bomUtf8 = {(byte)0xEF, (byte)0xBB, (byte)0xBF, 'H', 'i'};
String s = new String(bomUtf8, StandardCharsets.UTF_8);
// s.charAt(0) = '\uFEFF' — invisible BOM character!
// s.startsWith("Hi") → false!
Solution: BOMInputStream (Apache Commons IO) or manual check of first 3 bytes:
if (bytes.length >= 3 && bytes[0] == (byte)0xEF && bytes[1] == (byte)0xBB && bytes[2] == (byte)0xBF) {
s = new String(bytes, 3, bytes.length - 3, StandardCharsets.UTF_8);
}
2. Malformed input — invalid UTF-8 bytes:
byte[] malformed = {(byte) 0xFF, (byte) 0xFE, (byte) 0x80};
String s = new String(malformed, StandardCharsets.UTF_8);
// Invalid sequences → '\uFFFD' (replacement character)
// s = "\uFFFD\uFFFD\uFFFD"
Behavior controlled via CodingErrorAction:
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT); // Throws MalformedInputException
decoder.onMalformedInput(CodingErrorAction.IGNORE); // Skips
decoder.onMalformedInput(CodingErrorAction.REPLACE); // Default → \uFFFD
3. Truncated multi-byte sequences:
// Cyrillic 'А' = D0 90 in UTF-8
byte[] truncated = {(byte) 0xD0}; // Only first byte
String s = new String(truncated, StandardCharsets.UTF_8);
// '\uFFFD' — incomplete sequence
Critical for streaming I/O: if buffer breaks in the middle of a multi-byte sequence, you need to save the “tail” and add it to the next buffer.
4. Security — encoding bypass:
// SQL injection through multi-byte encoding bypass
String input = "%C0%27 OR 1=1 --";
// In some old systems %C0%27 decodes as a single quote
// This bypasses filters that check input BEFORE decoding
// Always validate input AFTER decoding!
5. Surrogate pairs and byte length:
String emoji = "\uD83D\uDE00"; // "😀" — surrogate pair
emoji.length(); // 2 (char)
emoji.getBytes(StandardCharsets.UTF_8).length; // 4 (bytes)
emoji.getBytes(StandardCharsets.UTF_16).length; // 6 (2 BOM + 4 data)
// StandardCharsets.UTF_16 includes BOM. UTF_16BE/UTF_16LE — without BOM, would be 4 bytes.
// Don't use bytes.length to determine character count!
Performance
| Operation | UTF-8 | UTF-16 | Latin-1 |
|---|---|---|---|
| Encode 100 chars (Latin-1) | ~100ns | ~50ns | ~20ns |
| Encode 100 chars (Cyrillic) | ~200ns | ~50ns | N/A |
| Decode 200 bytes (UTF-8) | ~150ns | N/A | N/A |
| Decode 200 bytes (UTF-16) | N/A | ~40ns | N/A |
| Encode 10KB text (mixed) | ~8μs | ~3μs | ~2μs |
Allocations:
getBytes(UTF_8): allocates newbyte[]of size ~N–3N (depends on content)new String(bytes, UTF_8): allocates newbyte[]+ String object (~24 bytes overhead)- For 1M conversions: ~100MB allocations → Young GC pressure
Thread Safety
StandardCharsets.UTF_8— thread-safe (immutable singleton)Charset.forName(...)— thread-safe (caches results)CharsetEncoder/CharsetDecoder— NOT thread-safe! Single instance cannot be used from multiple threads simultaneously- Solution: create new encoder/decoder on each call or use
ThreadLocal<CharsetEncoder>
Production War Story
Scenario 1: HTTP API — reading request body (Spring Boot):
// Spring Boot — automatically uses UTF-8
@PostMapping
public void handle(@RequestBody String body) { ... }
// Raw Servlet — need to specify explicitly
request.setCharacterEncoding("UTF-8");
String body = request.getReader().readLine();
Problem: client sent POST request in Windows-1251, Spring read as UTF-8 → “garbled text”. Fix: header Content-Type: text/plain; charset=Windows-1251 or client migration to UTF-8.
Scenario 2: Kafka messages — serialization/deserialization:
// Producer
byte[] bytes = jsonString.getBytes(StandardCharsets.UTF_8);
producer.send(new ProducerRecord<>(topic, keyBytes, bytes));
// Consumer
String message = new String(record.value(), StandardCharsets.UTF_8);
Problem: one producer used getBytes() without encoding (default on Windows = Cp1251). Consumer on Linux read as UTF-8 → corrupted messages. Fix: unified UTF-8 standard at Kafka contract level.
Scenario 3: Highload log parser (500K lines/sec):
- Converting each line
new String(bytes, UTF_8)→ 500K allocations/sec - Young GC every 2 seconds, pause 15ms
- Fix: zero-copy via
ByteBuffer+ custom parser, without String conversion - Result: Young GC every 8 seconds, pause 5ms
Monitoring
# Check default encoding
java -XX:+PrintFlagsFinal -version 2>&1 | grep file.encoding
# java.nio.file.DefaultCharset = UTF-8
# Available encodings
jrunscript -e "print(java.nio.charset.Charset.availableCharsets().keySet())"
# GC logs — allocations from conversion
java -Xlog:gc*:file=gc.log ...
# JFR — Object Allocation
java -XX:StartFlightRecording=filename=recording.jfr ...
# In JFR: Memory → Object Allocation → filter by java.lang.String
# JOL — String size after conversion
System.out.println(GraphLayout.parseInstance(s).toFootprint());
// Runtime check
System.out.println(Charset.defaultCharset()); // Depends on OS!
// Conversion benchmark
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
byte[] b = str.getBytes(StandardCharsets.UTF_8);
}
long elapsed = System.nanoTime() - start;
System.out.println("1M encode: " + elapsed / 1_000_000 + "ms");
Best Practices for Highload
- Always specify
Charset— useStandardCharsets.UTF_8(constant, no lookup allocations) - For JSON/XML/HTTP: UTF-8 — de facto standard
- For binary protocols (Kafka, gRPC, TCP): work with
byte[]/ByteBufdirectly, without String conversion - For streaming I/O:
InputStreamReader/OutputStreamWriterwith explicit Charset — buffering inside - BOM handling:
BOMInputStream(Apache Commons IO) or manual check of first bytes - For ultra-low-latency: avoid String — use
ByteBuf(Netty),ByteBuffer, zero-copy approaches - Security: validate input AFTER decoding, use constant-time comparison for secrets
CharsetEncoder/CharsetDecoder— don’t share between threads, useThreadLocalor per-call creation- For large data: streaming via
Reader/Writer, don’t load entire file intobyte[]
🎯 Interview Cheat Sheet
Must know:
str.getBytes(StandardCharsets.UTF_8)— correct way to convert String → byte[]new String(bytes, StandardCharsets.UTF_8)— correct way to convert byte[] → StringgetBytes()andnew String(bytes)without encoding — depend on OS, cause “garbled text”- BOM in UTF-8 (3 bytes
EF BB BF) creates invisible character\uFEFFat start of string - Invalid UTF-8 bytes are replaced with
\uFFFD(replacement character) CharsetEncoder/CharsetDecoder— NOT thread-safe, needThreadLocal
Frequent follow-up questions:
- Why is
new String(bytes)without encoding bad? — Default encoding depends on OS. String encoded on Windows (Cp1251) becomes “garbled text” on Linux (UTF-8). - How to handle BOM when reading UTF-8 file? —
BOMInputStreamor check: if first 3 bytes areEF BB BF, skip them. - What happens when converting binary data to String? — Data loss. For binary — work with
byte[]/ByteBufferdirectly. - How to speed up conversion in highload? — Zero-copy:
ByteBuffer,ByteBuf(Netty), streaming viaReader/Writer.
Red flags (DON’T say):
- ❌ “
getBytes()without encoding — is fine” — depends on OS, breaks on migration - ❌ “
CharsetEncodercan be shared between threads” — NOT thread-safe - ❌ “BOM — only a UTF-16 problem” — UTF-8 can have BOM too
- ❌ “You can use
bytes.lengthto determine character count” — for UTF-8 1 char = 1-4 bytes
Related topics:
- [[17. What is String Encoding]]
- [[19. What are Compact Strings in Java 9+]]
- [[20. How to Find Out How Much Memory a String Occupies]]