How split() Method Works

🟢 Junior Level

The split() method breaks a string into an array of substrings by a given delimiter.

Simple example:

String data = "apple,banana,cherry";
String[] fruits = data.split(",");
// ["apple", "banana", "cherry"]

Important: The delimiter is a regular expression, not just a string! Some characters need escaping:

String ip = "192.168.1.1";
String[] parts = ip.split("\\."); // Dot needs escaping!
// ["192", "168", "1", "1"]

Empty strings at end are removed by default:

"a,b,,,".split(",");  // ["a", "b"] — empties removed
"a,b,,,".split(",", -1); // ["a", "b", "", "", ""] — all preserved

🟡 Middle Level

Two versions of the method

String[] split(String regex)           // limit = 0
String[] split(String regex, int limit) // full control

limit parameter: | Limit | Behavior | Example "a,b,c,,".split(",", limit) | | ———— | —————————————————— | ————————————- | | 0 (default)| Max splitting, empty at end removed | ["a", "b", "c"] | | > 0 | No more than limit elements, rest — in last element | ["a", "b,c,,"] (limit=2) | | < 0 | Max splitting, empty preserved | ["a", "b", "c", "", ""] |

Fast Path optimization

split() does NOT always use the heavy regex engine. If the delimiter is a single character (not a regex metacharacter), direct search is used:

// Fast Path — no regex compilation
"hello world".split(" ");

// Regex engine — Pattern/Matcher compilation
"hello world".split("\\s+");

Metacharacters that break Fast Path: ., $,

, (, ), [, ], ^, ?, *, +, \

Typical mistakes

Mistake: split(".") — dot = “any character” in regex Solution: split("\\.") or split(Pattern.quote("."))
Mistake: Expecting empty strings at end by default Solution: Use split(",", -1) to preserve empties
Mistake: split() in a loop for the same regex Solution: Compile Pattern once: Pattern.compile(",").split(str)

🔴 Senior Level

Internal Implementation

OpenJDK — String.split():

public String[] split(String regex, int limit) {
    char ch = 0;
    // Fast Path: single character, not regex meta
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE)) {
        // This checks that ch is NOT a lowercase letter (similarly for uppercase
        // and digits) — guaranteeing that \ is not a known shorthand (\d, \w etc.)
        // FAST PATH — direct search through byte array
        int off = 0;
        int next = indexOf(ch, off);
        // ... manual splitting without Pattern/Matcher
    }
    // SLOW PATH — via Pattern
    return Pattern.compile(regex).split(this, limit);
}

Architectural Trade-offs

Fast Path:

Pros: No Pattern/Matcher allocation, ~5-10ns per call
Cons: Works only for simplest delimiters

Regex Engine (Pattern/Matcher):

Pros: Full power of regular expressions
Cons: Regex compilation (~1-5μs), Pattern + Matcher + results allocations

Edge Cases

Empty input:

"".split(",");    // [""] — array with one empty string
",".split(",");   // [] — empty array (limit=0 removes empties)
",".split(",", -1); // ["", ""] — two empty strings

Regex with lookahead/lookbehind:

"a1b2c3".split("(?=\\d)"); // ["a", "1b", "2c", "3"] — split before digit

Trailing empty strings:

"a,,b".split(",");     // ["a", "", "b"]
"a,,b,,,".split(",");  // ["a", "", "b"] — trailing removed
"a,,b,,,".split(",", -1); // ["a", "", "b", "", "", ""]

Performance

| Scenario | Fast Path | Regex Engine | Pre-compiled Pattern | | ———————- | ——— | ———— | ——————– | | split(",") 1M times | ~50ms | ~500ms | ~80ms | | split("\\ | ") 1M | ~60ms | ~500ms | ~80ms | | split("\\s+") 1M | N/A | ~800ms | ~120ms | | Regex compile overhead | 0 | ~1-5μs | 0 (once) |

Production Experience

Scenario: CSV parsing (10M lines):

// BAD — regex compilation on every line
for (String line : lines) {
    String[] fields = line.split(","); // 10M regex compilations!
}

// GOOD — pre-compiled Pattern
private static final Pattern COMMA = Pattern.compile(",");
for (String line : lines) {
    String[] fields = COMMA.split(line);
}

// BETTER — Fast Path (single char, not meta)
for (String line : lines) {
    String[] fields = line.split(","); // Fast Path will trigger!
}

Scenario 2: Log file parsing with regex delimiter:

line.split("\\s\\|\\s") — not Fast Path, every call compiles regex
Fix: private static final Pattern SEP = Pattern.compile("\\s\\|\\s");
Result: -80% CPU on parsing

Monitoring

// JMH benchmark
@Benchmark
public String[] testSplit() {
    return input.split(",");
}

@Benchmark
public String[] testPrecompiled() {
    return COMMA.split(input);
}

// Profile regex compilation
java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions ...

Best Practices for Highload

For single-character delimiters (not meta): split(",") — Fast Path
For regex delimiters: pre-compiled Pattern.compile(regex).split(str)
In hot paths: consider manual implementation via indexOf() — minimum allocations
For CSV/TSV: specialized libraries (OpenCSV, Apache Commons CSV)
For ultra-low-latency: zero-copy parsing via CharSequence wrappers

🎯 Interview Cheat Sheet

Must know:

split(regex) splits string by regular expression, returns array
Fast Path: for single-character non-meta delimiters — direct search without regex engine
limit parameter: 0 — trailing empties removed, < 0 — preserved, > 0 — max elements
Regex metacharacters need escaping: . $ | ( ) [ ] { } ^ ? * + \
Pattern.quote(".") — safe way to escape for split()
Compiling regex on every loop iteration — antipattern, use pre-compiled Pattern

Frequent follow-up questions:

Why doesn’t split(".") work? — Dot in regex = “any character”. Need split("\\.").
What does split(",", -1) do? — Preserves empty strings at end. By default (limit=0) they’re removed.
What is Fast Path in split()? — If delimiter is a single non-meta character, direct search is used without Pattern/Matcher.
How to optimize split() in a loop? — Pre-compiled Pattern: private static final Pattern COMMA = Pattern.compile(",").

Red flags (DON’T say):

❌ “split() takes a plain string, not regex” — takes regex, dot will break logic
❌ “split() always preserves empty strings” — by default (limit=0) removes trailing empty
❌ “You can compile regex in a loop without consequences” — 10M compilations = seconds of CPU
❌ “split() — the only way to split a string” — there’s indexOf(), StringTokenizer, specialized parsers

Related topics:

[[16. Difference Between replace() vs replaceAll()]]
[[8. How Java Compiler Optimizes String Concatenation]]
[[7. What Happens When Concatenating Strings with + Operator]]