Introduction to the Java.util.regex Object Model - public String[] split(CharSequence input)
(Page 3 of 12 )
This method can be particularly helpful if you need to break up a String into an array of substrings based on some criteria. In concept, it’s similar to the StringTokenizer, but it’s much more powerful and more resource intensive than StringTokenizer because it allows your program to use regular expressions as the splitting criteria.
This method always returns at least one element. If the split candidate, input, can’t be found, then a String array is returned that contains exactly one String—namely, the original input.
If the input can be found, then a String array is returned. That array contains every substring after an occurrence of the input. Thus, for the pattern
Pattern p = new Pattern.compile(",");
the split method for Hello, Dolly will return a String array consisting of two elements. The first element of the array will contain Hello, and the second will contain Dolly. That String array is obtained as follows:
String tmp[] = p.split("Hello,Dolly");
In this case, the value returned is
//tmp is equal to { "Hello", "Dolly"}
You should be aware of some subtleties when you work with this method. If the candidate String had been Hello,Dolly, with a trailing comma character after the y in Dolly then this method would still have returned a two-element String array consisting of Hello and Dolly. The implicit behavior is that trailing spaces aren’t returned.
If the input String had been Hello,,,Dolly the resulting String array would have had four elements. The return value of the split method, as applied to the pattern, is
// p.split("Hello,,,Dolly") returns {"Hello","","","Dolly"}
Listing 2-4 provides an example in which the split method is used to split a String into an array based on a single space character.
Listing 2-4. Pattern Splitting Example
import java.util.regex.*;
public class PatternSplitExample{
public static void main(String args[]){
splitTest();
}
public static void splitTest(){
Pattern p =
Pattern.compile(" ");
String tmp = "this is the String I want to split up";
String[] tokens = p.split(tmp);
for (int i=0; i<tokens.length i++ ){
System.out.println(tokens[i]);
}
}
}
Of course, this is a misuse of the method: You could have used a StringTokenizer to achieve the same result, and it would have been less resource intensive. In light of what you now know, consider Listing 2-5, which is a slightly modified version of Listing 1-12 from Chapter 1 in that it uses the Pattern.split method. Output 2-1 shows the result of running the program.
Listing 2-5. PatternSplit.java
import java.util.regex.*;
public class PatternSplit{
public static void main(String args[]){
String statement = "I will not compromise. I will not "+
"cooperate. There will be no concession, no conciliation, no "+
"finding the middle ground, and no give and take.";
String tokens[] =null;
String splitPattern= "compromise|cooperate|concession|"+
"conciliation|(finding the middle ground)|(give and take)";
Pattern p = Pattern.compile(splitPattern);
tokens=p.split(statement);
System.out.println("REGEX PATTERN:\n"+splitPattern + "\n");
System.out.println("STATEMENT:\n"+statement + "\n");
System.out.println("TOKENS:");
for (int i=0; i < tokens.length; i++){
System.out.println(tokens[i]);
}
}
}
Output 2-1. Result of Running PatternSplit.java
-------------------------------------------------------------------C:\RegEx\code\chapter1>java Split
REGEX PATTERN:
compromise|cooperate|concession|conciliation|(finding the middle group)|(give
and take)
STATEMENT:
I will not compromise. I will not cooperate. There
will be no concession, no conciliation,
no finding the middle group, and no give and take.
TOKENS:
I will not
. I will not
. There will be no
, no
, no
, and no
.
You’ll notice that Listing 2-5 uses the Pattern.split method, whereas Listing 1-12 uses the new String.split method. In effect, the two are identical because the String.split method simply defers to this method internally.
What you’ve done is really quite amazing and might have been ridiculously convoluted without regular expressions. You’re actually using complicated artificial constructs—namely, English language synonyms—to decompose text. This isn’t your father’s J2SE.
| NOTE The String method further optimizes its search criteria by placing an invisible ^ before the pattern and a $ after it. |
public String[] split(CharSequence input, int limit) This method works in exactly the same way that Pattern.split(CharSequence input) does, except for one variation. The second parameter, limit, allows you to control how many elements are returned, as shown in the following sections.
Limit == 0
If you specify that the second parameter, limit, should equal 0, then this method behaves exactly like its overloaded counterpart. That is, it returns an array containing as many matching substrings as possible, and trailing spaces are discarded. Thus, the pattern
Pattern p = new Pattern.compile(",");
will return an array consisting of two elements when split against the candidate Hello, Dolly. An example of the usage of the method follows:
String tmp[] = p.split("Hello,Dolly", 0);
Similarly, split will return two elements when matched against the String Hello, Dolly, with a trailing comma character after the y in Dolly:
String tmp[] = p.split("Hello,Dolly,", 0);
However, you may not always want this behavior. For example, there may be a time when you want to limit the number of elements returned.
Limit > 0
Use a positive limit if you’re interested in only a certain number of matches. You should use that number +1 as the limit. To split the String Hello, Dolly, You, Are, My, Favorite when you want only the first two tokens, use this:
String[] tmp = pattern.split("Hello, Dolly, You, Are, My, Favorite",3);
The value of the resulting String is as follows:
//tmp[0] is "Hello",
// tmp[1] is "Dolly";
The interesting behavior here is that a third element is returned. In this case, the third element is
//tmp[2] is "You, Are, My, Favorite";
Using a positive limit can potentially lead to performance enhancements, because the regex engine can stop searching when it meets the specified number of matches.
Limit < 0
Using a negative number—any negative number—for the limit tells the regex engine that you want to return as many matches as possible and that you want trailing spaces, if any, to be returned. Thus, for the regex pattern
Pattern p = Pattern.compile(",");
and the candidate String Hello,Dolly, the command
String tmp[] = p.split("Hello,Dolly", -1);
results in the following condition:
//tmp is equal to {"Hello","Dolly"};
However, for the String Hello, Dolly, with trailing spaces after the comma following Dolly, the method call
String tmp[] = p.split("Hello,Dolly, ", -1);
results in this:
//tmp is equal to {"Hello","Dolly"," "};
Notice that the actual value of the negative limit doesn’t matter, thus
p.split("Hello,Dolly", -1);
is exactly equivalent to this:
p.split("Hello,Dolly", -100);
Next: The Matcher Object >>
More Java Articles
More By Apress Publishing
|
This article is excerpted from chapter three of Java Regular Expressions Taming the Java.util.regex Engine, written by Mehran Habibi (Apress, 2004; ISBN: 1590591070). Check it out at your favorite bookstore. Buy this book now.
|
|