Regular Expressions: features, filler, and repetitions Stephanie Lockwood-Childs 9/24/09 * Common uses for regular expressions Match/no-match result Syntax validation Show matching entries Search patterns Modify matching entries Find and replace Strip out unwanted data General reformatting * Demonstration problems Find and replace Capitalize all instances of "linux" Strip out unwanted data Equivalent of "basename" utility on list of paths Equivalent of "dirname" utility on list of paths Search patterns Find C functions returning type int Syntax validation First name, optional middle initial, last name General reformatting Convert dates from MM/DD/YY to YYYY-MM-DD format * General approach Make up example inputs Pick out match features Create candidate expression Test on examples Refine until examples work Think of more devious examples Fix until devious examples work - Make up example inputs - Pick out match features Define match boundaries Mandatory and optional components Alternatives Repetition of components - Create candidate expression Regular expression syntax Regex != Pathname expansion Basic syntax vs extended syntax: choices in back-slashification Basic requires slash Meaning . N ANY character [:class:] N Character inside pre-defined set named "class" ** [abc] N Character inside given set [^abc] N Character NOT inside given set (abc) Y Cluster into pattern group a|b Y Alternative patterns/groups a? Y Repeat count: occurs zero or once a* N Repeat count: occurs ANY number of times (including 0) a+ Y Repeat count: occurs one or more times a{2,5} Y Repeat count: occurs some number within range ^abc N Anchor: beginning of line abc$ N Anchor: end of line ** there is no character class named "class", merely my placeholder for the real classes of which the most important are: alpha, digit, alnum, lower, upper, blank, and space Reasons for grouping Apply count to whole group (abc)? Use group as alternative (abc)|(xyz) Remember group for replacement phase Match rules Matching starts from the left Greedy vs. non-greedy matching -- greedy by default * Tips Develop complicated patterns in extended syntax then back-slashify if needed Choose carefully between "any number" and "one or more" echo "HUUMA" | egrep --colour "(U*)|Z" echo "HUUMA" | egrep --colour "(U+)|Z" Relevant man pages: "man 7 regex", "man perlre"