[Home] [Articles, Categories, Tags] [Books, Quotes]
Masting Python Regular Expressions
Author:
Pub Year:
Source:
Read: 2018-07-12
Last Update: 2018-07-12

Five Sentence Abstract:

The book, short and sweet as it is, still has space to devote the first chapter to the history of regular expressions, not failing to mention the grep trivia. After the history is history, it moves on to cover, at first, the basics, including building patterns and match objects. Next it moves into slightly more advanced territory with groups and backreferencing. The most advanced features covered are matching with look ahead and look behind, both positive and negative. It finishes up with some tips on optimization and the advice to not prematurely optimize.

Thoughts:

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." -Jamie Zawinski, 1997

This short, sub 100 page, book starts off with a whirlwind history of regular expressions, beginning in 1946, moving up through the Perl implementation, and sets the stage for python's version.

It covers the basics, the theory behind how patterns are matched, in a language agnostic format. This is perfect if you have little to no foreknowledge of regex. Several diagrams and tables are included that summarize everything you'll need to get started.

Chapter two moves onto the python specific regex, building patterns, match objects, a cursory look at groups and position, the various compilation flags (re.DOTALL, re.ASCII, etc), and the differences between python 2 and 3.

Chapter three starts to use describe more advanced features such as backreferencing, named groups, overlapping groups, and what the authors call the yes-pattern|no-pattern.

Chapter four is spent detailing the look around functionality: forward and backward, positive and negative. I found this part the most interesting and useful; up until this point the book has been pretty basic.

Lastly, chapter five focuses on performance. It is nice that the authors start out with time honored advice from Donald Knuth:

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

The performance optimizations given will most likely be irrelevant to someone using regex in a simple program; if you are mining big data, the optimizations in the book might be too simple for you. Still, it rounds out the presentation.

All-in-all I thought it was well worth the read. Quick, to the point, and a few simple, if contrived, examples for each topic.

Exceptional Excerpts:

the look behind mechanism is only able to match fixed-width patterns.

'Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.' -Donald Knuth

Notes:

Table of Contents

01: Introducing Regular Expressions
02: Regular Expressions with Python
03: Grouping
04: Look Around
05: Performance of Regular Expressions

01: Introducing Regular Expressions

page 6:
page 13:
1
2
3
4
5
6
7
8
9
Element | Description (for regex with default flags)
--------|-------------------------------------------
.       | This element matches any character except newline \n
\d      | This matches any decimal digit; this is equivalent to the class [0-9]
\D      | This matches any non-digit character; this is equivalent to the class [^0-9]
\s      | This matches any whitespace character; this is equivalent to the class [⇢\t\n\r\f\v]
\S      | This matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
\w      | This matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
\W      | This matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
page 17:
1
2
3
4
5
6
Symbol | Name          | Quantification of previous character
------ |-------------- |--------------------------------------
?      | Question mark | Optional (0 or 1 repetitions)
*      | Asterisk      | Zero or more times
+      | Plus sign     | One or more times
{n,m}  | Curly braces  | Between n and m times
page 18:
page 19:
1
2
3
4
5
6
Syntax | Description
-------|----------------------------------------------------
{n}    | The previous character is repeated exactly n times.
{n,}   | The previous character is repeated at least n times.
{,n}   | The previous character is repeated at most n times.
{n,m}  | The previous character is repeated between n and m times (both inclusive).
page 21:
1
2
3
4
5
6
7
8
Matcher | Description
------- |------------
^       | Matches at the beginning of a line
$       | Matches at the end of a line
\b      | Matches a word boundary
\B      | Matches the opposite of \b. Anything that is not a word boundary
\A      | Matches the beginning of the input
\Z      | Matches the end of the input

02: Regular Expressions with Python

page 28:

page 34:
1
2
3
4
5
6
7
>>> pattern = re.compile(r"(\w+) (\w+)")
>>> it = pattern.finditer("Hello⇢world⇢hola⇢mundo")
>>> match = it.next()
>>> match.groups()
('Hello', 'world')
>>> match.span()
(0, 11)
1
2
3
4
5
>>> match = it.next()
>>> match.groups()
('hola', 'mundo')
>>> match.span()
(12, 22)
page 36:
1
2
3
>>> pattern = re.compile(r"\W")
>>> pattern.split("Beautiful is better than ugly", 2)
['Beautiful', 'is', 'better than ugly']
1
2
3
>>> pattern = re.compile(r"(-)")
>>> pattern.split("hello-word")
['hello', '-', 'word']
1
2
3
>>> pattern = re.compile(r"[0-9]+")
>>> pattern.sub("-", "order0⇢order1⇢order13")
'order-⇢order-⇢order-'
1
2
>>> re.sub('00', '-', 'order00000')
'order--0'
page 37:
1
2
•    -1234
•    A193, B123, C124
1
2
•    A1234
•    B193, B123, B124
1
2
3
4
5
>>>def normalize_orders(matchobj):
  if matchobj.group(1) == '-': return "A"
  else: return "B"
>>> re.sub('([-|A-Z])', normalize_orders, '-1234⇢A193⇢ B123')
'A1234⇢B193⇢B123'
page 41:
1
2
3
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)")
>>> pattern.search("Hello⇢world").groupdict()
{'first': 'Hello', 'second': 'world'}
page 42:
1
2
3
4
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)?")
>>> match = pattern.search("Hello⇢")
>>> match.span(1)
(0, 5)
page 44:

03: Grouping

page 56:
1
2
3
4
>>>pattern = re.compile(r"(\w+) \1")
>>>match = pattern.search(r"hello hello world")
>>>match.groups()
('hello',)
1
2
3
>>>pattern = re.compile(r"(\d+)-(\w+)")
>>>pattern.sub(r"\2-\1", "1-a\n20-baer\n34-afcr")
'a-1\nbaer-20\nafcr-34'
1
2
3
4
5
6
>>> pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)")
>>> match = re.search("Hello world")
>>>match.group("first")
'Hello'
>>>match.group("second")
'world'
1
2
3
>>>pattern = re.compile(r"(?P<country>\d+)-(?P<id>\w+)")
>>>pattern.sub(r"\g<id>-\g<country>", "1-a\n20-baer\n34-afcr")
'a-1\nbaer-20\nafcr-34'
page 58:
1
2
3
4
>>>pattern = re.compile(r"(?P<word>\w+) (?P=word)")
>>>match = pattern.search(r"hello hello world")
>>>match.groups()
('hello',)
1
2
3
4
5
6
7
8
9
Use                           | Syntax
------------------------------|---------------------
Inside a pattern              | (?P=name)
                              |
In the repl string of the sub | \g<name>
operation                     |
                              |
In any of the operations of   | match.group('name')
the MatchObject               |
page 59:
1
2
3
4
>>>re.search("Españ(a|ol)", "Español")
<_sre.SRE_Match at 0x10e90b828>
>>>re.search("Españ(a|ol)", "Español").groups()
('ol',)
1
2
3
4
>>>re.search("Españ(?:a|ol)", "Español")
<_sre.SRE_Match at 0x10e912648>
>>>re.search("Españ(?:a|ol)", "Español").groups()
()
page 60:
1
2
3
4
5
6
7
8
Letter | Flag
-------|-----
i      | re.IGNORECASE
L      | re.LOCALE
m      | re.MULTILINE
s      | re.DOTALL
u      | re.UNICODE
x      | re.VERBOSE
1
2
>>>re.findall(r"(?u)\w+" ,ur"ñ")
[u'\xf1']
1
2
>>>re.findall(r"\w+" ,ur"ñ", re.U)
[u'\xf1']

04: Look Around

page 66:
page 67:
1
2
3
>>>pattern = re.compile(r'\w+(?=,)')
>>>pattern.findall("They were three: Felix, Victor, and Carlos.")
['Felix', 'Victor']
1
2
3
4
5
6
7
8
>>>pattern = re.compile(r'John(?!\sSmith)')
>>> result = pattern.finditer("I would rather go out with John McLane
than with John Smith or John Bon Jovi")
>>>for i in result:
  ...print i.start(), i.end()
  ...
  27 31
  63 67
page 69:
1
2
3
>>>pattern = re.compile(r'\d{1,3}')
>>>pattern.findall("The number is: 12345567890")
['123', '455', '678', '90']
page 70:

page 71:
1
2
3
>>>pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
>>>pattern.sub(r'\g<0>,', "1234567890")
'1,234,567,890'
1
2
3
4
5
6
7
>>>pattern = re.compile(r'(?<=John\s)McLane')
>>>result = pattern.finditer("I would rather go out with John McLane than with
                              John Smith or John Bon Jovi")
>>>for i in result:
  ... print i.start(), i.end()
  ...
32 38
page 72:

page 73:
1
2
3
4
5
>>>pattern = re.compile(r'\B@[\w_]+')
>>>pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40%
off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/
bigdataoffers")
['@HadoopNews']
page 74:
1
2
3
4
5
>>>pattern = re.compile(r'(?<=\B@)[\w_]+')
>>>pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40%
off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/
bigdataoffers")
['HadoopNews']

05: Performance of Regular Expressions

page 77:

page 78:
1
2
>>> import cProfile
>>> cProfile.run("alternation('spaniard')")
page 80:
page 81:
page 82:











[About] [Contact]