Question 15 Β· Section 12

How split() Method Works

The split() method breaks a string into an array of substrings by a given delimiter.

Language versions: English Russian Ukrainian

🟒 Junior Level

The split() method breaks a string into an array of substrings by a given delimiter.

Simple example:

String data = "apple,banana,cherry";
String[] fruits = data.split(",");
// ["apple", "banana", "cherry"]

Important: The delimiter is a regular expression, not just a string! Some characters need escaping:

String ip = "192.168.1.1";
String[] parts = ip.split("\\."); // Dot needs escaping!
// ["192", "168", "1", "1"]

Empty strings at end are removed by default:

"a,b,,,".split(",");  // ["a", "b"] β€” empties removed
"a,b,,,".split(",", -1); // ["a", "b", "", "", ""] β€” all preserved

🟑 Middle Level

Two versions of the method

String[] split(String regex)           // limit = 0
String[] split(String regex, int limit) // full control

limit parameter: | Limit | Behavior | Example "a,b,c,,".split(",", limit) | | β€”β€”β€”β€” | β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€” | β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”- | | 0 (default)| Max splitting, empty at end removed | ["a", "b", "c"] | | > 0 | No more than limit elements, rest β€” in last element | ["a", "b,c,,"] (limit=2) | | < 0 | Max splitting, empty preserved | ["a", "b", "c", "", ""] |

Fast Path optimization

split() does NOT always use the heavy regex engine. If the delimiter is a single character (not a regex metacharacter), direct search is used:

// Fast Path β€” no regex compilation
"hello world".split(" ");

// Regex engine β€” Pattern/Matcher compilation
"hello world".split("\\s+");
Metacharacters that break Fast Path: ., $, , (, ), [, ], ^, ?, *, +, \

Typical mistakes

  1. Mistake: split(".") β€” dot = β€œany character” in regex Solution: split("\\.") or split(Pattern.quote("."))

  2. Mistake: Expecting empty strings at end by default Solution: Use split(",", -1) to preserve empties

  3. Mistake: split() in a loop for the same regex Solution: Compile Pattern once: Pattern.compile(",").split(str)


πŸ”΄ Senior Level

Internal Implementation

OpenJDK β€” String.split():

public String[] split(String regex, int limit) {
    char ch = 0;
    // Fast Path: single character, not regex meta
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE)) {
        // This checks that ch is NOT a lowercase letter (similarly for uppercase
        // and digits) β€” guaranteeing that \ is not a known shorthand (\d, \w etc.)
        // FAST PATH β€” direct search through byte array
        int off = 0;
        int next = indexOf(ch, off);
        // ... manual splitting without Pattern/Matcher
    }
    // SLOW PATH β€” via Pattern
    return Pattern.compile(regex).split(this, limit);
}

Architectural Trade-offs

Fast Path:

  • Pros: No Pattern/Matcher allocation, ~5-10ns per call
  • Cons: Works only for simplest delimiters

Regex Engine (Pattern/Matcher):

  • Pros: Full power of regular expressions
  • Cons: Regex compilation (~1-5ΞΌs), Pattern + Matcher + results allocations

Edge Cases

  1. Empty input:
    "".split(",");    // [""] β€” array with one empty string
    ",".split(",");   // [] β€” empty array (limit=0 removes empties)
    ",".split(",", -1); // ["", ""] β€” two empty strings
    
  2. Regex with lookahead/lookbehind:
    "a1b2c3".split("(?=\\d)"); // ["a", "1b", "2c", "3"] β€” split before digit
    
  3. Trailing empty strings:
    "a,,b".split(",");     // ["a", "", "b"]
    "a,,b,,,".split(",");  // ["a", "", "b"] β€” trailing removed
    "a,,b,,,".split(",", -1); // ["a", "", "b", "", "", ""]
    

Performance

| Scenario | Fast Path | Regex Engine | Pre-compiled Pattern | | β€”β€”β€”β€”β€”β€”β€”- | β€”β€”β€” | β€”β€”β€”β€” | ——————– | | split(",") 1M times | ~50ms | ~500ms | ~80ms | | split("\\ | ") 1M | ~60ms | ~500ms | ~80ms | | split("\\s+") 1M | N/A | ~800ms | ~120ms | | Regex compile overhead | 0 | ~1-5ΞΌs | 0 (once) |

Production Experience

Scenario: CSV parsing (10M lines):

// BAD β€” regex compilation on every line
for (String line : lines) {
    String[] fields = line.split(","); // 10M regex compilations!
}

// GOOD β€” pre-compiled Pattern
private static final Pattern COMMA = Pattern.compile(",");
for (String line : lines) {
    String[] fields = COMMA.split(line);
}

// BETTER β€” Fast Path (single char, not meta)
for (String line : lines) {
    String[] fields = line.split(","); // Fast Path will trigger!
}

Scenario 2: Log file parsing with regex delimiter:

  • line.split("\\s\\|\\s") β€” not Fast Path, every call compiles regex
  • Fix: private static final Pattern SEP = Pattern.compile("\\s\\|\\s");
  • Result: -80% CPU on parsing

Monitoring

// JMH benchmark
@Benchmark
public String[] testSplit() {
    return input.split(",");
}

@Benchmark
public String[] testPrecompiled() {
    return COMMA.split(input);
}

// Profile regex compilation
java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions ...

Best Practices for Highload

  • For single-character delimiters (not meta): split(",") β€” Fast Path
  • For regex delimiters: pre-compiled Pattern.compile(regex).split(str)
  • In hot paths: consider manual implementation via indexOf() β€” minimum allocations
  • For CSV/TSV: specialized libraries (OpenCSV, Apache Commons CSV)
  • For ultra-low-latency: zero-copy parsing via CharSequence wrappers

🎯 Interview Cheat Sheet

Must know:

  • split(regex) splits string by regular expression, returns array
  • Fast Path: for single-character non-meta delimiters β€” direct search without regex engine
  • limit parameter: 0 β€” trailing empties removed, < 0 β€” preserved, > 0 β€” max elements
  • Regex metacharacters need escaping: . $ | ( ) [ ] { } ^ ? * + \
  • Pattern.quote(".") β€” safe way to escape for split()
  • Compiling regex on every loop iteration β€” antipattern, use pre-compiled Pattern

Frequent follow-up questions:

  • Why doesn’t split(".") work? β€” Dot in regex = β€œany character”. Need split("\\.").
  • What does split(",", -1) do? β€” Preserves empty strings at end. By default (limit=0) they’re removed.
  • What is Fast Path in split()? β€” If delimiter is a single non-meta character, direct search is used without Pattern/Matcher.
  • How to optimize split() in a loop? β€” Pre-compiled Pattern: private static final Pattern COMMA = Pattern.compile(",").

Red flags (DON’T say):

  • ❌ β€œsplit() takes a plain string, not regex” β€” takes regex, dot will break logic
  • ❌ β€œsplit() always preserves empty strings” β€” by default (limit=0) removes trailing empty
  • ❌ β€œYou can compile regex in a loop without consequences” β€” 10M compilations = seconds of CPU
  • ❌ β€œsplit() β€” the only way to split a string” β€” there’s indexOf(), StringTokenizer, specialized parsers

Related topics:

  • [[16. Difference Between replace() vs replaceAll()]]
  • [[8. How Java Compiler Optimizes String Concatenation]]
  • [[7. What Happens When Concatenating Strings with + Operator]]