RegEx engines already define some groups of characters that can make writing RegEx expressions quicker.
is used to assert the beginning of a line in multi-line mode, or the beginning of the string in whole-string mode.^
is used to assert the end of a line in multi-line mode, or the end of the string in whole-string mode.$
The behaviours of these depend on the match options
Some combinators will either match “lazy”, or “greedy”.
Lazy is when the engine only matches as many characters required to get to the next step. This should almost always be used.
Greedy matching is when the engine tries to match as many characters as possible. The problem with this is that it might cause “backtracking”, which happens when the engine goes back in the pattern multiple times to ensure that as many characters as possible where matched. This can cause big performance issues.
Multiple atoms can be combined together to form more complex patterns.
When two expressions are next to each other, they will be chained together, which means that both will be evaluated in-order.
Example:
matches a x\d
and then a digit, like for example x
x9
Two expressions separated by a
cause the RegEx engine to first try to match the left side, and only if it fails, it tries the right side instead.|
Note that “or” has a long left and right scope, which means that
will match either ab|cd
or ab
cd
Tries to match the expression on the left to it, but won’t error if it doesn’t succeed.
Note that “or-not” has a short left scope, which means that
will always match ab?
, and then try to match a
b
A expression followed by either a
for greedy repeat, or a *
for lazy repeat.*?
This matches as many times as possible, but can also match the pattern zero times.
Note that this has a short left scope.
A expression followed by either a
for greedy repeat, or a +
for lazy repeat.+?
This matches as many times as possible, and at least one time.
Note that this has a short left scope.
Groups multiple expressions together for scoping.
Example:
will just match (?:abc)
abc
Similar to Non-Capture Groups except that they capture the matched text. This allows the matched text of the inner expression to be extracted later.
Capture group IDs are enumerated from left to right, starting with 1.
Example:
will match (abc)de
, and store abcde
in group 1.abc
By surrounding multiple characters in square brackets, the engine will match any of them. Special characters or expressions won’t be parsed inside them, which means that this can also be used to escape characters.
For example:
will match either [abc]
, a
or b
.c
and
will match either [ab(?:c)]
, a
, b
, (
, ?
, :
, or c
.)
Character groups and escaped characters still work inside character sets.
Character sets can also contain ranges. For example:
will match either any digit, or any lowercase letter.[0-9a-z]
RegEx is perfect for when you just want to match some patterns, but the syntax can make patterns very hard to read or modify.
In the next article, we will start to dive into implementing RegEx.
Stay tuned!