如何拆分字符串,但还要保留分隔符?

我有一个由一组不同的分隔符分隔的多行字符串:

(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)

我可以使用 String.split 将此字符串拆分为多个部分,但似乎我无法获得与分隔符正则表达式匹配的实际字符串。

换句话说,这就是我得到的:

  • Text1
  • Text2
  • Text3
  • Text4

这就是我要的

  • Text1
  • DelimiterA
  • Text2
  • DelimiterC
  • Text3
  • DelimiterB
  • Text4

是否有任何JDK方法可以使用分隔符正则表达式拆分字符串但也保留分隔符?

stack overflow How to split a string, but also keep the delimiters?
原文答案
author avatar

接受的答案

您可以使用LookAhead和LookBehind,这是正则表达式的功能。

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

您会得到:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

最后一个是您想要的。

((?<=;)|(?=;)) 等于在 ; 之前或 ; 之前选择一个空字符。

编辑: Fabian Steeg对可读性的评论是有效的。可读性始终是正则表达式的问题。我为使正则表达式更可读性做的一件事是创建一个变量,其名称代表正则表达式的作用。您甚至可以将占位符(例如 %1$s )放置,并使用Java的 String.format 将占位符替换为需要使用的实际字符串;例如:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";

public void someMethod() {
    final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
    ...
}

答案:

作者头像

您想使用环视,并在零宽度匹配上进行拆分。这里有些例子:

public class SplitNDump {
    static void dump(String[] arr) {
        for (String s : arr) {
            System.out.format("[%s]", s);
        }
        System.out.println();
    }
    public static void main(String[] args) {
        dump("1,234,567,890".split(","));
        // "[1][234][567][890]"
        dump("1,234,567,890".split("(?=,)"));   
        // "[1][,234][,567][,890]"
        dump("1,234,567,890".split("(?<=,)"));  
        // "[1,][234,][567,][890]"
        dump("1,234,567,890".split("(?<=,)|(?=,)"));
        // "[1][,][234][,][567][,][890]"

        dump(":a:bb::c:".split("(?=:)|(?<=:)"));
        // "[][:][a][:][bb][:][:][c][:]"
        dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
        // "[:][a][:][bb][:][:][c][:]"
        dump(":::a::::b  b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
        // "[:::][a][::::][b  b][::][c][:]"
        dump("a,bb:::c  d..e".split("(?!^)b"));
        // "[a][,][bb][:::][c][  ][d][..][e]"

        dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
        // "[Array][Index][Out][Of][Bounds][Exception]"
        dump("1234567890".split("(?<=G.{4})"));   
        // "[1234][5678][90]"

        // Split at the end of each run of letter
        dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)1(?!1))..)"));
        // "[Booo][yaaaa][h! Yipp][ieeee][!!]"
    }
}

是的,这是最后一个模式中的三重嵌套断言。

相关问题

### 也可以看看

作者头像

一个不涉及正则表达式的非常天真的解决方案是在分隔符上执行字符串替换(假设分隔符为逗号):

string.replace(FullString, "," , "~,~")

您可以将 tilda (~) 替换为适当的唯一分隔符。

然后,如果您对新的分隔符进行拆分,那么我相信您会得到想要的结果。

作者头像
import java.util.regex.*;
import java.util.LinkedList;

public class Splitter {
    private static final Pattern DEFAULT_PATTERN = Pattern.compile("s+");

    private Pattern pattern;
    private boolean keep_delimiters;

    public Splitter(Pattern pattern, boolean keep_delimiters) {
        this.pattern = pattern;
        this.keep_delimiters = keep_delimiters;
    }
    public Splitter(String pattern, boolean keep_delimiters) {
        this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
    }
    public Splitter(Pattern pattern) { this(pattern, true); }
    public Splitter(String pattern) { this(pattern, true); }
    public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
    public Splitter() { this(DEFAULT_PATTERN); }

    public String[] split(String text) {
        if (text == null) {
            text = "";
        }

        int last_match = 0;
        LinkedList<String> splitted = new LinkedList<String>();

        Matcher m = this.pattern.matcher(text);

        while (m.find()) {

            splitted.add(text.substring(last_match,m.start()));

            if (this.keep_delimiters) {
                splitted.add(m.group());
            }

            last_match = m.end();
        }

        splitted.add(text.substring(last_match));

        return splitted.toArray(new String[splitted.size()]);
    }

    public static void main(String[] argv) {
        if (argv.length != 2) {
            System.err.println("Syntax: java Splitter <pattern> <text>");
            return;
        }

        Pattern pattern = null;
        try {
            pattern = Pattern.compile(argv[0]);
        }
        catch (PatternSyntaxException e) {
            System.err.println(e);
            return;
        }

        Splitter splitter = new Splitter(pattern);

        String text = argv[1];
        int counter = 1;
        for (String part : splitter.split(text)) {
            System.out.printf("Part %d: "%s"n", counter++, part);
        }
    }
}

/*
    Example:
    > java Splitter "W+" "Hello World!"
    Part 1: "Hello"
    Part 2: " "
    Part 3: "World"
    Part 4: "!"
    Part 5: ""
*/

我真的不喜欢另一种方式,即前后都有一个空元素。分隔符通常不在字符串的开头或结尾,因此您通常最终会浪费两个好的数组槽。

编辑: 固定极限情况。可以在此处找到带有测试用例的注释源: http://snippets.dzone.com/posts/show/6453

作者头像

Pass the 3rd aurgument as "true". It will return delimiters as well.

StringTokenizer(String str, String delimiters, true);
作者头像

我来晚了,但回到最初的问题,为什么不使用环视?

Pattern p = Pattern.compile("(?<=w)(?=W)|(?<=W)(?=w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));

输出:

[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]

编辑:您在上面看到的是当我运行该代码时出现在命令行上的内容,但我现在看到它有点令人困惑。很难跟踪哪些逗号是结果的一部分,哪些是由 Arrays.toString() 添加的。 SO的语法突出显示也无济于事。为了让突出显示对我起作用 with 而不是对我不利,以下是我在源代码中声明它们的这些数组的外观:

{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }

我希望这更容易阅读。感谢您的提醒,@finnw。

作者头像

我知道这是一个非常古老的问题,答案也已被接受。但我仍然想对原始问题提交一个非常简单的答案。考虑这段代码:

String str = "Hello-World:HownAre You&doing";
inputs = str.split("(?!^)b");
for (int i=0; i<inputs.length; i++) {
   System.out.println("a[" + i + "] = "" + inputs[i] + '"');
}

输出:

a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"

当它是文本的开头时,我只是使用单词边界 b 来分隔单词 except

作者头像

我看了上面的答案,老实说,我觉得没有一个是令人满意的。您要做的基本上是模仿 Perl 的拆分功能。为什么 Java 不允许这样做并且在某处有一个 join() 方法超出了我的范围,但我离题了。您甚至不需要为此开设课程。它只是一个功能。运行这个示例程序:

一些较早的答案有过多的空值检查,我最近在这里写了一个问题的回复:

https://stackoverflow.com/users/18393/cletus

无论如何,代码:

public class Split {
    public static List<String> split(String s, String pattern) {
        assert s != null;
        assert pattern != null;
        return split(s, Pattern.compile(pattern));
    }

    public static List<String> split(String s, Pattern pattern) {
        assert s != null;
        assert pattern != null;
        Matcher m = pattern.matcher(s);
        List<String> ret = new ArrayList<String>();
        int start = 0;
        while (m.find()) {
            ret.add(s.substring(start, m.start()));
            ret.add(m.group());
            start = m.end();
        }
        ret.add(start >= s.length() ? "" : s.substring(start));
        return ret;
    }

    private static void testSplit(String s, String pattern) {
        System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
        List<String> tokens = split(s, pattern);
        System.out.printf("Found %d matches%n", tokens.size());
        int i = 0;
        for (String token : tokens) {
            System.out.printf("  %d/%d: '%s'%n", ++i, tokens.size(), token);
        }
        System.out.println();
    }

    public static void main(String args[]) {
        testSplit("abcdefghij", "z"); // "abcdefghij"
        testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
        testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
        testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
        testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
    }
}
作者头像

我喜欢 StringTokenizer 的想法,因为它是可枚举的。
但它也已过时,并被 String.split 替换,它返回一个无聊的 String [] (and does not includes the delimiters).

So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.

A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:

[o], '', [o], '', [o]

But the regexp o+ will return the expected result when splitting "aooob"

[], 'a', [ooo], 'b', []

To use this StringTokenizerEx:

final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
    // uses the split String detected and memorized in 'aString'
    final nextDelimiter = aStringTokenizerEx.getDelimiter();
}

The code of this class is available at DZone Snippets

像往常一样,对于 code-challenge 响应(一个包含测试用例的自包含类),复制粘贴它(在“src/test”目录中)并运行它。它的 main() 方法说明了不同的用法。


注意:(2009 年末编辑)

文章 Final Thoughts: Java Puzzler: Splitting Hairs 很好地解释了 String.split() 中的奇怪行为。
乔什·布洛赫(Josh Bloch)甚至对那篇文章发表了评论:

是的,这很痛苦。 FWIW,这样做有一个很好的理由:与 Perl 的兼容性。
做这件事的人是 Mike “madbot” McCloskey,他现在在 Google 与我们一起工作。 Mike 确保 Java 的正则表达式几乎通过了 30K Perl 正则表达式测试中的每一项测试(并且运行得更快)。

Google common-library Guava 还包含一个拆分器,它是:

  • 使用更简单
  • 由 Google(而不是您)维护

所以它可能值得一试。来自他们的 initial rough documentation (pdf)

JDK 有这个:

String[] pieces = "foo.bar".split(".");

如果你想要它的确切作用,可以使用它: - 正则表达式 - 作为数组的结果 - 它处理空块的方式

迷你拼图:",a,b,".split(",") 返回...

(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above

答案:(e) 以上都不是。

",a,,b,".split(",")
returns
"", "a", "", "b"

只跳过尾随空! (谁知道防止跳过的解决方法?这很有趣......)

无论如何,我们的 Splitter 更加灵活:默认行为很简单:

Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]

如果您想要额外的功能,请向他们索取!

Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]

配置方法的顺序无关紧要——在拆分过程中,修剪发生在检查空位之前。

作者头像

Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by @cletus.

public static String[] split(CharSequence input, String pattern) {
    return split(input, Pattern.compile(pattern));
}

public static String[] split(CharSequence input, Pattern pattern) {
    Matcher matcher = pattern.matcher(input);
    int start = 0;
    List<String> result = new ArrayList<>();
    while (matcher.find()) {
        result.add(input.subSequence(start, matcher.start()).toString());
        result.add(matcher.group());
        start = matcher.end();
    }
    if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
    return result.toArray(new String[0]);
}

I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.

I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.

Here are my tests:

@Test
public void splitsVariableLengthPattern() {
    String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}

@Test
public void splitsEndingWithPattern() {
    String[] result = Split.split("/foo/$bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}

@Test
public void splitsStartingWithPattern() {
    String[] result = Split.split("$foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}

@Test
public void splitsNoMatchesPattern() {
    String[] result = Split.split("/foo/bar", "\\$\\w+");
    Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}