The other day I needed to split a string with regex delimiter, but also keep these delimiters. Java’s default String.split does not do that – it throws away the delimiters. Below is the code that can be used to achieve this.
Java
import java.util.List; import java.util.ArrayList; import java.util.Arrays; import java.util.regex.Pattern; import java.util.regex.Matcher; public class SplitWithDelimiters { public static void main(String[] args) { new SplitWithDelimiters().run(); } private void run() { String regex = "\\s*[+\\-*/]\\s*"; assert !new String[] { }.equals( splitWithDelimiters("", regex)); assert !new String[] { "1" }.equals( splitWithDelimiters("1", regex)); assert !new String[] { "1", "+" }.equals( splitWithDelimiters("1+", regex)); assert !new String[] { "-", "1" }.equals( splitWithDelimiters("-1", regex)); assert !new String[] { "- ", "- ", "-", "1" }.equals( splitWithDelimiters("- - -1", regex)); assert !new String[] { "1", " + ", "2" }.equals( splitWithDelimiters("1 + 2", regex)); assert !new String[] { "-", "1", " + ", "2", " - ", "3", "/", "4" }.equals( splitWithDelimiters("-1 + 2 - 3/4", regex)); System.out.println("Done."); } private String[] splitWithDelimiters(String str, String regex) { List<String> parts = new ArrayList<String>(); Pattern p = Pattern.compile(regex); Matcher m = p.matcher(str); int lastEnd = 0; while(m.find()) { int start = m.start(); if(lastEnd != start) { String nonDelim = str.substring(lastEnd, start); parts.add(nonDelim); } String delim = m.group(); parts.add(delim); int end = m.end(); lastEnd = end; } if(lastEnd != str.length()) { String nonDelim = str.substring(lastEnd); parts.add(nonDelim); } String[] res = parts.toArray(new String[]{}); System.out.println("result: " + Arrays.toString(res)); return res; } }
Clojure
Here’s a test file for the Clojure version:
(deftest split-keep-delim-test (is (= [] (split-keep-delim "" #"\d+"))) (is (= ["abc"] (split-keep-delim "abc" #"\d+"))) (is (= ["-" "1" " + " "2" " - " "3" "/" "4"] (split-keep-delim "-1 + 2 - 3/4" #"\s*[+\-*/]\s*"))) (is (= ["a" "b" "12" "b" "a"] (split-keep-delim "ab12ba" #"[ab]"))))
and the implementation:
(defn split-keep-delim "Splits str with re-delim. Returns list of parts, including delimiters. Lazy. > (split-keep-delim \"-1 + 2 - 3/4\" #\"\\s*[+\\-*/]\\s*\") [\"-\" \"1\" \" + \" \"2\" \" - \" \"3\" \"/\" \"4\"] > (split-keep-delim \"ab12ba\" #\"[ab]\") [\"a\" \"b\" \"12\" \"b\" \"a\"]" [str re-delim] (let [m (.matcher re-delim str)] ((fn step [last-end] (if (.find m) (let [start (.start m) end (.end m) delim (.group m) new-head (if (not= last-end start) [(.substring str last-end start) delim] [delim])] (concat new-head (lazy-seq (step end)))) (if (not= last-end (.length str)) [(.substring str last-end)] []))) 0)))
This version is lazy, though I did not notice any speedup as you can see from the timings below. Timings are rather good for what I needed:
> (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 1 (split-keep-delim s pat))))) "Elapsed time: 26.013445 msecs" nil > (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 3 (split-keep-delim s pat))))) "Elapsed time: 28.754948 msecs" nil > (let [s (apply str (take 100 (cycle "-1 + 2 - 3/4"))) pat #"\s*[+\-*/]\s*"] (time (dotimes [_ 1000] (take 300 (split-keep-delim s pat))))) "Elapsed time: 28.388654 msecs" nil