Iterating and Incrementing Strings in Ruby - Regular Expressions
(Page 6 of 6 )
You have already seen regular expressions in action. A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The syntax for regular expressions was invented by mathematician Stephen Kleene in the 1950s.
I’ll spend a little time demonstrating some patterns to search for strings. In this little discussion, you’ll learn the fundamentals: how to use basic string patterns, square brackets, alternation, grouping, anchors, shortcuts, repetition operators, and braces. Table 4-1 lists the syntax for regular expressions in Ruby.
We need a little text to munch on. Here are the opening lines of Shakespeare’s 29th sonnet:
opening = "When in disgrace with fortune and men's eyes\nI all alone beweep my
outcast state,\n"
Note that this string contains two lines, set off by the newline character\n.
You can match the first line just by using a word in the pattern:
opening.grep(/men/) # => ["When in disgrace with fortune and men's eyes\n"]
By the way,grepis not aString method; it comes from theEnumerablemodule, which theStringclass includes, so it is available for processing strings.grep takes a pattern as an argument, and can also take a block (see http://www.ruby-doc.org/core/ classes/Enumerable.html).
When you use a pair of square brackets ([]), you can match any character in the brackets. Let’s try to match the word man or men using[]:
opening.grep(/m[ae]n/) # => ["When in disgrace with fortune and men's eyes\n"]
It would also match a line with the word man in it:
Alternation lets you match alternate forms of a pattern using the pipe character (|):
opening.grep(/men|man/) # => ["When in disgrace with fortune and men's eyes\n"]
Grouping uses parentheses to group a subexpression, like this one that contains an alternation:
opening.grep(/m(e|a)n/) # => ["When in disgrace with fortune and men's eyes\n"]
Anchors anchor a pattern to the beginning (^) or end ($) of a line:
opening.grep(/^When in/) # => ["When in disgrace with fortune and men's eyes\n"]
opening.grep(/outcast state,$/) # => ["I all alone beweep my outcast state,\n"]
The^means that a match is found when the textWhen inis at the beginning of a line, and$will only matchoutcast stateif it is found at the end of a line.
One way to specify the beginning and ending of strings in a pattern is with shortcuts. Shortcut syntax is brief—a single character preceded by a backslash. For example, the\dshortcut represents a digit; it is the same as using[0-9]but, well, shorter. Similarly to^, the shortcut\Amatches the beginning of a string, not a line:
opening.grep(/\AWhen in/) # => ["When in disgrace with fortune and men's eyes\n"]
Similar to$,the shortcut\zmatches the end of a string, not a line:
opening.grep(/outcast state,\z/) # => ["I all alone beweep my outcast state,"]
The shortcut\Zmatches the end of a string before the newline character, assuming that a newline character (\n) is at the end of the string (it won’t work otherwise).
Let’s figure out how to match a phone number in the form(555)123-4567. Supposing that the stringphonecontains a phone number like this, the following pattern will find it:
phone.grep(/[\(\d\d\d\)]?\d\d\d-\d\d\d\d/) # => ["(555)123-4567"]
The backslash precedes the parentheses (\(...\)) to let the regexp engine know that these are literal characters. Otherwise, the engine will see the parentheses as enclosing a subexpression. The three\ds in the parentheses represent three digits. The hyphen (-) is just an unambiguous character, so you can use it in the pattern as is.
The question mark (?) is a repetition operator. It indicates zero or one occurrence of the previous pattern. So the phone number you are looking for can have an area code in parentheses, or not. The area-code pattern is surrounded by[and] so that the? operator applies to the entire area code. Either form of the phone number, with or without the area code, will work. Here is a way to use?with just a single character,u:
color.grep(/colou?r/) # => ["I think that colour is just right for you office."]
The plus sign (+) operator indicates one or more of the previous pattern, in this case digits:
phone.grep(/[\(\d+\)]?\d+-\d+/) # => ["(555)123-4567"]
Braces ({}) let you specify the exact number of digits, such as\d{3}or\d{4}:
phone.grep(/[\(\d{3}\)]?\d{3}-\d{4}/)# => ["(555)123-4567"]
It is also possible to indicate an “at least” amount with{m,}, and a minimum/maximum number with{m,n}.
TheString class also has the=~method and the!~operator. If=~finds a match, it returns the offset position where the match starts in the string:
color =~ /colou?r/ # => 13
The!~operator returnstrueif it does not match the string,falseotherwise:
color !~ /colou?r/ # => false
Also of interest are theRegexpandMatchDataclasses. TheRegexpclass (http://www.ruby- doc.org/core/classes/Regexp.html) lets you create a regular expression object. TheMatchData class (http://www.ruby-doc.org/core/classes/MatchData.html) provides the special$-variable, which encapsulates all search results from a pattern match.
This discussion has given you a decent foundation in regular expressions (see Table 4-1 for a listing). With these fundamentals, you can define most any pattern.
Table 4-1. Regular expressions in Ruby
| Pattern | Description |
| /pattern/options | Patternpatternin slashes, followed by optionaloptions, i.e., one or more of:ifor case-insensitive;ofor substitute once;xfor ignore whitespace, allow comments;mfor match multiple lines, newlines as normal characters |
| %r!pattern! | General delimited string for a regular expression, where!can be an arbitrary character |
| ^ | Matches beginning of line |
| $ | Matches end of line |
| . | Matches any character |
| \1...\9 | Matches nth grouped subexpression |
| \10 | Matches nth grouped subexpression, if already matched; otherwise, refers to octal representation of a character code |
| \n, \r, \t, etc. | Matches character in backslash notation |
Table 4-1. Regular expressions in Ruby (continued)
| Pattern | Description |
| \w | Matches word character, as in[0-9A-Za-z_] |
| \W | Matches nonword character |
| \s | Matches whitespace character, as in[\t\n\r\f] |
| \S | Matches nonwhitespace character |
| \d | Matches digit, same as[0-9] |
| \D | Matches nondigit |
| \A | Matches beginning of a string |
| \Z | Matches end of a string, or before newline at the end |
| \z | Matches end of a string |
| \b | Matches word boundary outside[], or backspace (0x08) inside[] |
| \B | Matches nonword boundary |
| \G | Matches point where last match finished |
| [..] | Matches any single character in brackets, such as[ch]at |
| [^..] | Matches any single character not in brackets |
| * | Matches 0 or more of previous regular expressions |
| *? | Matches zero or more of previous regular expressions (nongreedy) |
| + | Matches one or more of previous regular expressions |
| +? | Matches one or more of previous regular expressions (nongreedy) |
| {m} | Matches exactlymnumber of previous regular expressions |
| {m,} | Matches at leastmnumber of previous regular expressions |
| {m,n} | Matches at leastmbut at mostnnumber of previous regular expressions |
| {m,n}? | Matches at leastmbut at mostnnumber of previous regular expressions (nongreedy) |
| ? | Matches zero or one of previous regular expressions |
| | | Alternation, such ascolor|colour |
| ( ) | Grouping regular expressions or subexpression, such ascol(o|ou)r |
| (?#..) | Comment |
| (?:..) | Grouping without back-references (without remembering matched text) |
| (?=..) | Specify position with pattern |
| (?!..) | Specify position with pattern negation |
| (?>..) | Matches independent pattern without backtracking |
| (?imx) | Togglesi,m, orxoptions on |
| (?-imx) | Togglesi,m, orxoptions off |
| (?imx:..) | Togglesi,m, orxoptions on within parentheses |
| (?-imx:..) | Togglesi,m, orxoptions off within parentheses |
| (?ix-ix: ) | Turns on (or off) i and x options within this noncapturing group |
1.9 and Beyond
In the versions of Ruby that follow, String will likely:
- Add thestart_with?andend_with?methods, which will return true if a string starts with or ends with a given prefix or suffix of the string.
- Add aclearmethod that will turn a string with a length greater than 1 to an empty string.
- Add anord method that will return a character code.
- Add thepartitionandrpartitionmethods to partition a string at a given separator.
- Add abytesmethod that will return the bytes of a string, one by one.
- Return a single character string instead of a character code when a string is indexed with[].
- Consider characters to be more than one byte in length.
Review Questions
- How do chop and chomp differ?
- Name two ways to concatenate strings.
- What happens when you reverse a palindrome?
- How do you iterate over a string?
- Name two or more case conversion methods.
- What methods would you use to adjust space in a string?
- Describe alternation in a regular expression pattern?
- What does /\d{3}/ match?
- How do you convert a string to an array?
- What do you think is the easiest way to create a string?
| DISCLAIMER: The content provided in this article is not warranted or guaranteed by Developer Shed, Inc. The content provided is intended for entertainment and/or educational purposes in order to introduce to the reader key ideas, concepts, and/or product reviews. As such it is incumbent upon the reader to employ real-world tactics for security and implementation of best practices. We are not liable for any negative consequences that may result from implementing any information covered in our articles or tutorials. If this is a hardware review, it is not recommended to open and/or modify your hardware. |
|
This article is excerpted from chapter four of Learning Ruby, written by Michael Fitzgerald (O'Reilly; ISBN: 0596529864). Check it out today at your favorite bookstore. Buy this book now.
|
|