designates my notes. / designates important.
“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” -Jamie Zawinski, 1997
This short, sub 100 page, book starts off with a whirlwind history of regular expressions, beginning in 1946, moving up through the Perl implementation, and sets the stage for python’s version.
It covers the basics, the theory behind how patterns are matched, in a language agnostic format. This is perfect if you have little to no foreknowledge of regex. Several diagrams and tables are included that summarize everything you’ll need to get started.
Chapter two moves onto the python specific regex, building patterns, match objects, a cursory look at groups and position, the various compilation flags (re.DOTALL, re.ASCII, etc), and the differences between python 2 and 3.
Chapter three starts to use describe more advanced features such as backreferencing, named groups, overlapping groups, and what the authors call the yes-pattern|no-pattern.
Chapter four is spent detailing the look around functionality: forward and backward, positive and negative. I found this part the most interesting and useful; up until this point the book has been pretty basic.
Lastly, chapter five focuses on performance. It is nice that the authors start out with time honored advice from Donald Knuth:
“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
The performance optimizations given will most likely be irrelevant to someone using regex in a simple program; if you are mining big data, the optimizations in the book might be too simple for you. Still, it rounds out the presentation.
All-in-all I thought it was well worth the read. Quick, to the point, and a few simple, if contrived, examples for each topic.
“the look behind mechanism is only able to match fixed-width patterns."
Element | Description (for regex with default flags)
--------|-------------------------------------------
. | This element matches any character except newline \n
\d | This matches any decimal digit; this is equivalent to the class [0-9]
\D | This matches any non-digit character; this is equivalent to the class [^0-9]
\s | This matches any whitespace character; this is equivalent to the class [⇢\t\n\r\f\v]
\S | This matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
\w | This matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
\W | This matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
Symbol | Name | Quantification of previous character
------ |-------------- |--------------------------------------
? | Question mark | Optional (0 or 1 repetitions)
* | Asterisk | Zero or more times
+ | Plus sign | One or more times
{n,m} | Curly braces | Between n and m times
We can also define a certain range of repetitions by providing a minimum and maximum number of repetitions, that is, between three and eight times can be defined with the syntax {4,7}. Either the minimum or the maximum value can be omitted defaulting to 0 and infinite respectively. To designate a repetition of up to three times, we can use {,3}, we can also establish a repetition at least three times with {3,}.
Instead of using {,1}, you can use the question mark. The same applies to {0,} for the asterisk * and {1,} for the plus sign +.
Syntax | Description
-------|----------------------------------------------------
{n} | The previous character is repeated exactly n times.
{n,} | The previous character is repeated at least n times.
{,n} | The previous character is repeated at most n times.
{n,m} | The previous character is repeated between n and m times (both inclusive).
Matcher | Description
------- |------------
^ | Matches at the beginning of a line
$ | Matches at the end of a line
\b | Matches a word boundary
\B | Matches the opposite of \b. Anything that is not a word boundary
\A | Matches the beginning of the input
\Z | Matches the end of the input
>>> pattern = re.compile(r"(\w+) (\w+)")
>>> it = pattern.finditer("Hello⇢world⇢hola⇢mundo")
>>> match = it.next()
>>> match.groups()
('Hello', 'world')
>>> match.span()
(0, 11)
>>> match = it.next()
>>> match.groups()
('hola', 'mundo')
>>> match.span()
(12, 22)
>>> pattern = re.compile(r"\W")
>>> pattern.split("Beautiful is better than ugly", 2)
['Beautiful', 'is', 'better than ugly']
>>> pattern = re.compile(r"(-)")
>>> pattern.split("hello-word")
['hello', '-', 'word']
>>> pattern = re.compile(r"[0-9]+")
>>> pattern.sub("-", "order0⇢order1⇢order13")
'order-⇢order-⇢order-'
>>> re.sub('00', '-', 'order00000')
'order--0'
• -1234
• A193, B123, C124
• A1234
• B193, B123, B124
>>> def normalize_orders(matchobj):
if matchobj.group(1) == '-': return "A"
else: return "B"
>>> re.sub('([-|A-Z])', normalize_orders, '-1234⇢A193⇢ B123')
'A1234⇢B193⇢B123'
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)")
>>> pattern.search("Hello⇢world").groupdict()
{'first': 'Hello', 'second': 'world'}
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.span(1)
(0, 5)
>>> pattern = re.compile(r"(\w+) \1")
>>> match = pattern.search(r"hello hello world")
>>> match.groups()
('hello',)
>>> pattern = re.compile(r"(\d+)-(\w+)")
>>> pattern.sub(r"\2-\1", "1-a\n20-baer\n34-afcr")
'a-1\nbaer-20\nafcr-34'
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)")
>>> match = re.search("Hello world")
>>> match.group("first")
'Hello'
>>> match.group("second")
'world'
>>> pattern = re.compile(r"(?P<country>\d+)-(?P<id>\w+)")
>>> pattern.sub(r"\g<id>-\g<country>", "1-a\n20-baer\n34-afcr")
'a-1\nbaer-20\nafcr-34'
>>> pattern = re.compile(r"(?P<word>\w+) (?P=word)")
>>> match = pattern.search(r"hello hello world")
>>> match.groups()
('hello',)
Use | Syntax
------------------------------|---------------------
Inside a pattern | (?P=name)
|
In the repl string of the sub | \g<name>
operation |
|
In any of the operations of | match.group('name')
the MatchObject |
>>> re.search("Españ(a|ol)", "Español")
<_sre.SRE_Match at 0x10e90b828>
>>> re.search("Españ(a|ol)", "Español").groups()
('ol',)
>>> re.search("Españ(?:a|ol)", "Español")
<_sre.SRE_Match at 0x10e912648>
>>> re.search("Españ(?:a|ol)", "Español").groups()
()
Letter | Flag
-------|-----
i | re.IGNORECASE
L | re.LOCALE
m | re.MULTILINE
s | re.DOTALL
u | re.UNICODE
x | re.VERBOSE
>>> re.findall(r"(?u)\w+" ,ur"ñ")
[u'\xf1']
>>> re.findall(r"\w+" ,ur"ñ", re.U)
[u'\xf1']
Positive look ahead: This mechanism is represented as an expression preceded by a question mark and an equals sign, ?=, inside a parenthesis block. For example, (?=regex) will match if the passed regex do match against the forthcoming input.
Negative look ahead: This mechanism is specified as an expression preceded by a question mark and an exclamation mark, ?!, inside a parenthesis block. For example, (?!regex) will match if the passed regex do not match against the forthcoming input.
Positive look behind: This mechanism is represented as an expression preceded by a question mark, a less-than sign, and an equals sign, ?<=, inside a parenthesis block. For example, (?<=regex) will match if the passed regex do match against the previous input.
Negative look behind: This mechanism is represented as an expression preceded by a question mark, a less-than sign, and an exclamation mark, ?<!, inside a parenthesis block. For example, (?<!regex) will match if the passed regex do not match against the previous input.
>>> pattern = re.compile(r'\w+(?=,)')
>>> pattern.findall("They were three: Felix, Victor, and Carlos.")
['Felix', 'Victor']
>>> pattern = re.compile(r'John(?!\sSmith)')
>>> result = pattern.finditer("I would rather go out with John McLane
than with John Smith or John Bon Jovi")
>>> for i in result:
...print i.start(), i.end()
...
27 31
63 67
One typical example of look ahead and substitutions would be the conversion of a number composed of just numeric characters, such as 1234567890, into a comma separated number, that is, 1,234,567,890.
We can easily start with an almost naive approach with the following highlighted regular expression:
>>> pattern = re.compile(r'\d{1,3}')
>>> pattern.findall("The number is: 12345567890")
['123', '455', '678', '90']
We have failed in this attempt.
Let’s try to find one, two, or three digits that have to be followed by any number of blocks of three digits until we find something that is not a digit.
>>> pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
>>> pattern.sub(r'\g<0>,', "1234567890")
'1,234,567,890'
>>> pattern = re.compile(r'(?<=John\s)McLane')
>>> result = pattern.finditer("I would rather go out with John McLane than with
John Smith or John Bon Jovi")
>>> for i in result:
... print i.start(), i.end()
...
32 38
the look behind mechanism is only able to match fixed-width patterns.
If variable-width patterns in look behind are required, the regex module at https://pypi.python.org/pypi/regex can be leveraged instead of the standard Python re module.
>>> pattern = re.compile(r'\B@[\w_]+')
>>> pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40%
off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/
bigdataoffers")
['@HadoopNews']
>>> pattern = re.compile(r'(?<=\B@)[\w_]+')
>>> pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40%
off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/
bigdataoffers")
['HadoopNews']
Uses look behind to exclude the @.
The negative look behind mechanism is only able to match fixed-width patterns.
>>> import cProfile
>>> cProfile.run("alternation('spaniard')")