Sunday, February 7, 2010

Regular Expressions - 2

Alternation:
Difference between a character class and alternation is that alternation can be used to match alternates of entire regular expressions while character classes match only one character. However alternation cannot be negated like a character class.

Ignoring differences in capitalization:
We can tell egrep to ignore capitalization differences by using the '-i' option. But remember that this is not part of the regular expression language itself.

Word boundaries:

Metacharacters:
There are different metacharacters within a class and outside of it. Sometimes the meaning of a metacharacter also changes depending on whether it is within a class or outside of it.

. Matches anything
[...] Represents a character class
[^..] Neggated character class
^ Start of a line when outside a class, negation when the 1st char in a class
$ End of a line
\< Start of a word
\> End of a word
| Alternation
() Used to limit the scope of alternation... there are other uses as well
- Used to specify ranges within a character class
? 0 or 1 match (signifies an optional match)
+ Matches 1 or more
* Matches any number including none (0 or more)

Example:

July? (fourth|4(th)?)
This means match anything which has a 'J' followe by a 'u' followed by a 'l' which may (optional) be followed by 'y'. Following this we may have a 'fourth' or a '4', or '4th'.


Saturday, February 6, 2010

Regular Expressions - 1

Regular expressions by practice

I want to search for my name in a large text file. This is a case where we can use simple regular expressions.

egrep Parag file.txt

Will search the file called file.txt and print all lines which have the word 'Parag' in it. But normally things are not as simple as this. What if I suddenly remember that some instances of Parag may not have the first character in upper case. I want to search for all instances of 'Parag', or 'parag'. So basically I want to search for a word which begins with either 'P', or 'p', followed by 'arag'. How do I specify 'P', or 'p'? It can be done with what is known as character classes. A character class essentially results in a match when anythin specifed in a character class matches.

egrep [Pp]arag file.txt

will print all lines containing 'Parag' or 'parag'