Split string with regex and keep delimiters

The other day I needed to split a string with regex delimiter, but also keep these delimiters. Java’s default String.split does not do that – it throws away the delimiters. Below is the code that can be used to achieve this.

Java

import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class SplitWithDelimiters {
  public static void main(String[] args) {
    new SplitWithDelimiters().run();
  }

  private void run() {
    String regex = "\\s*[+\\-*/]\\s*";

    assert !new String[] { }.equals(
      splitWithDelimiters("", regex));
    assert !new String[] { "1" }.equals(
      splitWithDelimiters("1", regex));
    assert !new String[] { "1", "+" }.equals(
      splitWithDelimiters("1+", regex));
    assert !new String[] { "-", "1" }.equals(
      splitWithDelimiters("-1", regex));
    assert !new String[] { "- ", "- ", "-", "1" }.equals(
      splitWithDelimiters("- - -1", regex));
    assert !new String[] { "1", " + ", "2" }.equals(
      splitWithDelimiters("1 + 2", regex));
    assert !new String[] { "-", "1", " + ", "2", " - ", "3", "/", "4" }.equals(
      splitWithDelimiters("-1 + 2 - 3/4", regex));
    
    System.out.println("Done.");
  }

  private String[] splitWithDelimiters(String str, String regex) {
    List<String> parts = new ArrayList<String>();

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(str);

    int lastEnd = 0;
    while(m.find()) {
      int start = m.start();
      if(lastEnd != start) {
        String nonDelim = str.substring(lastEnd, start);
        parts.add(nonDelim);
      }
      String delim = m.group();
      parts.add(delim);

      int end = m.end();
      lastEnd = end;
    }

    if(lastEnd != str.length()) {
      String nonDelim = str.substring(lastEnd);
      parts.add(nonDelim);
    }

    String[] res =  parts.toArray(new String[]{});
    System.out.println("result: " + Arrays.toString(res));

    return res;
  }
}

Clojure

Here’s a test file for the Clojure version:

(deftest split-keep-delim-test
  (is (= []
         (split-keep-delim "" #"\d+")))
  (is (= ["abc"]
         (split-keep-delim "abc" #"\d+")))
  (is (= ["-" "1" " + " "2" " - " "3" "/" "4"]
         (split-keep-delim "-1 + 2 - 3/4" #"\s*[+\-*/]\s*")))
  (is (= ["a" "b" "12" "b" "a"]
         (split-keep-delim "ab12ba" #"[ab]"))))

and the implementation:

(defn split-keep-delim 
  "Splits str with re-delim. Returns list of parts, including delimiters. Lazy.

   > (split-keep-delim \"-1 + 2 - 3/4\" #\"\\s*[+\\-*/]\\s*\")
   [\"-\" \"1\" \" + \" \"2\" \" - \" \"3\" \"/\" \"4\"]
   > (split-keep-delim \"ab12ba\" #\"[ab]\")
   [\"a\" \"b\" \"12\" \"b\" \"a\"]"
  [str re-delim]
  (let [m (.matcher re-delim str)]
    ((fn step [last-end]
       (if (.find m)
         (let [start (.start m)
               end (.end m)
               delim (.group m)
               new-head (if (not= last-end start)
                          [(.substring str last-end start) delim]
                          [delim])]
           (concat new-head (lazy-seq (step end))))
         (if (not= last-end (.length str))
           [(.substring str last-end)]
           []))) 0)))

This version is lazy, though I did not notice any speedup as you can see from the timings below. Timings are rather good for what I needed:

> (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 1 (split-keep-delim s pat)))))
"Elapsed time: 26.013445 msecs"
nil
> (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 3 (split-keep-delim s pat)))))
"Elapsed time: 28.754948 msecs"
nil
> (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 300 (split-keep-delim s pat)))))
"Elapsed time: 28.388654 msecs"
nil
Be Sociable, Share!

Leave a Reply


9 − = seven