- 1. Introduction
- 2. Character Matching
- 3. Meta Characters
- 4. Special Quantifiers
- 5. Shorthand Character Set
- 6. Good Examples
- 7. Command Line Usage
- 8. Look Around
- 9. Flags
- 10. Substitutions
- 11. When not to use Regex
- 12. Cheat Sheets
REGEX mean REGular EXpression which in turn is nothing but sequence of characters. These expressions say for example [0–9] means that the expression should contain numbers. Regular expressions are used in many situation in computer programming. Majorly in search, pattern matching, parsing, filtering of results and so on.
In lay man words, its a kind of rule which programmer tells to the computer to understand.
You may see this in website forms, where you are only forced to enter ONLY numbers or ONLY characters or MINIMUM 8 characters and so on, these are controlled by REGEX behind the screen.Imagine you are writing an application and you want to set the rules for when a
user chooses their username. We want to allow the username to contain letters, numbers, underscores and hyphens. We also want to limit the number of characters in username so it does not look ugly. We use the following regular expression to validate a username:
Above regular expression would accept bansalnitish
but not bansalnitish020297
(too long).
A good thing about regex is that it is almost language independent. Implementation might vary but the expression itself is more or less the same. See below each script will read the test.txt
file, search it using our regular expression:^[0-9]+$
, and print all the numbers in file to the console. As of now
consider just assume [0-9]
means any digit from 0-9, ^
means start of string, +
means anything greater than 0 and $
means end of the string.
If we use the following file as test.txt:
1234
abcde
12db2
5362
1
The output would be: ('1234', '5362', '1')
in all the following cases -
const fs = require('fs')
const testFile = fs.readFileSync('test.txt', 'utf8')
const regex = /^([0-9]+)$/gm
let results = testFile.match(regex)
console.log(results)
import re
with open('test.txt', 'r') as f:
test_string = f.read()
regex = re.compile(r'^([0-9]+)$', re.MULTILINE)
result = regex.findall(test_string)
print(result)
<?php
$myfile = fopen("test.txt", "r") or die("Unable to open file.");
$test_str = fread($myfile,filesize("test.txt"));
fclose($myfile);
$re = '/^[0-9]+$/m';
preg_match_all($re, $test_str, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
?>
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
class FileRegexExample {
public static void main(String[] args) {
try {
String content = new String(Files.readAllBytes(Paths.get("test.txt")));
Pattern pattern = Pattern.compile("^[0-9]+$", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(content);
ArrayList<String> matchList = new ArrayList<String>();
while (matcher.find()) {
matchList.add(matcher.group());
}
System.out.println(matchList);
} catch (IOException e) {
e.printStackTrace();
}
}
}
#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <regex>
using namespace std;
int main () {
ifstream t("test.txt");
stringstream buffer;
buffer << t.rdbuf();
string testString = buffer.str();
regex numberLineRegex("(^|\n)([0-9]+)($|\n)");
sregex_iterator it(testString.begin(), testString.end(), numberLineRegex);
sregex_iterator it_end;
while(it != it_end) {
cout << it -> str();
++it;
}
}
Regular expressions (regex or regexp) are extremely useful in extracting information from any text by searching for one or more matches of a specific search pattern (i.e. a specific sequence of ASCII or unicode characters). Fields of application range from validation to parsing/replacing strings, passing through converting data to other formats and web scraping. One of the most interesting features is that once you’ve learned the syntax, you can actually use this tool in (almost) all programming languages (JavaScript, Java, VB, C #, C / C++, Python, Perl, Ruby, Delphi, R, Tcl, an many others) with the slightest distinctions about the support of the most advanced features and syntax versions supported by the engines).
A regex can be used for simple searching and matching of a sequence of characters.
- For example, the regular expression
hat
means: the letterh
, followed by the lettera
, followed by the lettert
.hat
=> She wore a pretty bluehat
. Her friend wore a redhat
. Allhat
s are cool.
In this way, you can find a match for any sequence of characters you want to. The characters will be searched throughtout the input provided by you. Note that regex is case sensitive.
Meta characters are the building blocks of the regular expressions. Meta characters do not stand for themselves but instead are interpreted in some special way. Some meta characters have a special meaning and are written inside square brackets. The meta characters are as follows:
Meta character | Description |
---|---|
. | Period matches any single character except a line break. |
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets |
* | Matches 0 or more repetitions of the preceding symbol. |
+ | Matches 1 or more repetitions of the preceding symbol. |
? | Makes the preceding symbol optional. |
{n,m} | Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol. |
(xyz) | Character group. Matches the characters xyz in that exact order. |
| | Alternation. Matches either the characters before or the characters after the symbol. |
\ | Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ | |
^ | Matches the beginning of the input. |
$ | Matches the end of the input. |
Full stop .
is the simplest example of meta character. The meta character .
matches any single character. It will not match return or newline characters.
For example, the regular expression .ar
means: any character, followed by the letter a
, followed by the letter r
.
".ar" => The car parked in the garage.
Character sets are also called character class. Square brackets are used to specify character sets. Use a hyphen inside a character set to specify the
characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression [Tt]he
means: an uppercase
T
or lowercase t
, followed by the letter h
, followed by the letter e
.
"[Tt]he" => The car parked in the garage.
A period inside a character set, however, means a literal period. The regular expression ar[.]
means: a lowercase character a
, followed by letter r
,
followed by a period .
character.
"ar[.]" => A garage is a good place to park a car.
In general, the caret symbol represents the start of the string, but when it is typed after the opening square bracket it negates the character set. For
example, the regular expression [^c]ar
means: any character except c
, followed by the character a
, followed by the letter r
.
"[^c]ar" => The car parked in the garage.
Following meta characters +
, *
or ?
are used to specify how many times a subpattern can occur. These meta characters act differently in different
situations.
The symbol *
matches zero or more repetitions of the preceding matcher. The regular expression a*
means: zero or more repetitions of preceding lowercase
character a
. But if it appears after a character set or class then it finds the repetitions of the whole character set. For example, the regular expression
[a-z]*
means: any number of lowercase letters in a row.
"[a-z]*" => The car parked in the garage #21.
The *
symbol can be used with the meta character .
to match any string of characters .*
. The *
symbol can be used with the whitespace character \s
to match a string of whitespace characters. For example, the expression \s*cat\s*
means: zero or more spaces, followed by lowercase character c
,
followed by lowercase character a
, followed by lowercase character t
, followed by zero or more spaces.
"\s*cat\s*" => The fat cat sat on the concatenation.
The symbol +
matches one or more repetitions of the preceding character. For example, the regular expression c.+t
means: lowercase letter c
, followed by at least one character, followed by the lowercase character t
. It needs to be clarified that t
is the last t
in the sentence.
"c.+t" => The fat cat sat on the mat.
In regular expression the meta character ?
makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression [T]?he
means: Optional the uppercase letter T
, followed by the lowercase character h
, followed by the lowercase
character e
.
"[T]he" => The car is parked in the garage.
"[T]?he" => The car is parked in the garage.
In regular expression braces that are also called quantifiers are used to specify the number of times that a character or a group of characters can be
repeated. For example, the regular expression [0-9]{2,3}
means: Match at least 2 digits but not more than 3 ( characters in the range of 0 to 9).
"[0-9]{2,3}" => The number was 9.9997 but we rounded it off to 10.0.
We can leave out the second number. For example, the regular expression [0-9]{2,}
means: Match 2 or more digits. If we also remove the comma the regular expression [0-9]{3}
means: Match exactly 3 digits.
"[0-9]{2,}" => The number was 9.9997 but we rounded it off to 10.0.
"[0-9]{3}" => The number was 9.9997 but we rounded it off to 10.0.
Character group is a group of sub-patterns that is written inside Parentheses (...)
. As we discussed before that in regular expression if we put a quantifier after a character then it will repeat the preceding character. But if we put quantifier after a character group then it repeats the whole character group. For example, the regular expression (ab)*
matches zero or more repetitions of the character "ab". We can also use the alternation |
meta character inside character group. For example, the regular expression (c|g|p)ar
means: lowercase character c
, g
or p
, followed by character a
, followed by character r
.
"(c|g|p)ar" => The car is parked in the garage.
In regular expression Vertical bar |
is used to define alternation.Alternation is like a condition between multiple expressions. Now, you may be thinking that character set and alternation works the same way. But the big difference between character set and alternation is that character set works on character level but alternation works on expression level. For example, the regular expression (T|t)he|car
means: uppercase character T
or lowercase t
, followed by lowercase character h
, followed by lowercase character e
or lowercase character c
, followed by lowercase character a
, followed by lowercase character r
.
"(T|t)he|car" => The car is parked in the garage.
Backslash \
is used in regular expression to escape the next character. This allows us to specify a symbol as a matching character including reserved
characters { } [ ] / \ + * . $ ^ | ?
. To use a special character as a matching character prepend \
before it.
For example, the regular expression .
is used to match any character except newline. Now to match .
in an input string the regular expression
(f|c|m)at\.?
means: lowercase letter f
, c
or m
, followed by lowercase character a
, followed by lowercase letter t
, followed by optional .
character.
"(f|c|m)at\.?" => The fat cat sat on the mat.
In regular expressions, we use anchors to check if the matching symbol is the starting symbol or ending symbol of the input string. Anchors are of two types:
First type is Caret ^
that check if the matching character is the start character of the input and the second type is Dollar $
that checks if matching character is the last character of the input string.
Caret ^
symbol is used to check if matching character is the first character of the input string. If we apply the following regular expression ^a
(if a is the starting symbol) to input string abc
it matches a
. But if we apply regular expression ^b
on above input string it does not match anything.Because in input string abc
"b" is not the starting symbol. Let's take a look at another regular expression ^(T|t)he
which means: uppercase character T
or lowercase character t
is the start symbol of the input string, followed by lowercase character h
, followed by lowercase character e
.
"(T|t)he" => The car is parked in the garage.
"^(T|t)he" => The car is parked in the garage.
Dollar $
symbol is used to check if matching character is the last character of the input string. For example, regular expression (at\.)$
means: a
lowercase character a
, followed by lowercase character t
, followed by a .
character and the matcher must be end of the string.
"(at\.)" => The fat cat. sat. on the mat.
"(at\.)$" => The fat cat. sat. on the mat.
- Greedy Quantifier: matches as many characters as possible.
a.*a
=> greedy can be dangerous a
t times - Lazy Quantifier : matches as few characters as possible.
a*?
=>a
pa
rt ofa
na
dventurous trip. - Possessive Quantifier : matches as many characters as possible; backtracking can't reduce the number of characters matched. Since it is greedy, it will match all the way to the last digit, leaving nothing else for the
.
to match. Without backtracking, this regex fails to produce a match.
Regular expression provides shorthands for the commonly used character sets, which offer convenient shorthands for commonly used regular expressions. The shorthand character sets are as follows:
Shorthand | Description |
---|---|
. | Any character except new line |
\w | Matches alphanumeric characters: [a-zA-Z0-9_] |
\W | Matches non-alphanumeric characters: [^\w] |
\d | Matches digit: [0-9] |
\D | Matches non-digit: [^\d] |
\s | Matches whitespace character: [\t\n\f\r\p{Z}] |
\S | Matches non-whitespace character: [^\s] |
\b
may be a bit confusing to some of you so I tried to expand this a bit more. \b
is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
A word character is all alphanumeric alphabets
[a-zA-Z0-9]
and underscore_
. Any character not a word character is non word or not a word character.
A few examples will may you much clear about how \b
works.
The regex \bcat\b
would match cat in a black cat, but it wouldn't match it in catatonic, tomcat or certificate. Removing one of the boundaries, \bcat
would match cat in catfish, and cat\b would match cat in tomcat, but not vice-versa. Both, of course, would match cat on its own. The \bcat\b
will not match cat in _cat or in cat25 because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters (as defined above).
"\bm" => moon
oo\b
does not match the oo
in "moon", because oo
is followed by n
which is a word character.
\B
matches all positions where \b doesn't match. Therefore, it matches:
- When neither side is a word character, for instance at any position in the string $=(@-%++) (including the beginning and end of the string)
- When both sides are a word character, for instance between the
H
and thei
inHi!
.
I feel examples explain the content in the best way possible, clearing all your doubts. Here we have two examples they will make you confident that as of now you can write and play with regex.
Another highly useful regex recipe is matching URLs in text.
Here an example URL matching expression from Stack Overflow.
(https?:\/\/)(www\.)?(?<domain>[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})(?<path>\/[-a-zA-Z0-9@:%_\/+.~#?&=]*)?
(https?:\/\/)
- Match http or https followed by \(www\.)?
- Optional "www" prefix(?<domain>[-a-zA-Z0-9@:%._\+~#=]{2,256}
- Match a valid domain name\.[a-z]{2,6})
- Match a domain extension extension (i.e. ".com" or ".org")(?<path>\/[-a-zA-Z0-9@:%_\/+.~#?&=]*)?
- Match URL path (/posts
), query string (?limit=1
), and/or file extension (.html
), all optional.
You'll notice here that some of the capture groups now begin with a ?<name>
identifier. This is the syntax for a named capture group, which makes the data extraction cleaner.
Here's how we could use named capture groups to extract the domain name of each URL in a web page using Python.
import re
import urllib.request
html = str(urllib.request.urlopen("https://moz.com/top500").read())
regex = r"(https?:\/\/)(www\.)?(?P<domain>[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6})(?P<path>\/[-a-zA-Z0-9@:%_\/+.~#?&=]*)?"
matches = re.finditer(regex, html)
for match in matches:
print(match.group('domain'))
In the above code we have three groups, the last two of which have name domain and path respectively.
The script will print out each domain name it finds in the raw web page HTML content.
...
facebook.com
twitter.com
google.com
youtube.com
linkedin.com
wordpress.org
instagram.com
pinterest.com
wikipedia.org
wordpress.com
...
Note that you can also access groups using the group number i.e you can get the same output by just replacing
print(match.group('domain'))
with
print(match.group(2))
Just to make this more clear try to follow the image below, here we have 3 groups and three matchings. When we print the matching of a group two none and a matching would be out.
Regex is :
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Observer this and you feel see the first group is to be repeated 3 times as {3} is mentioned and also all three groups should end with a dot. Inside the bracket we have 25
followed by a number between 0
and 5
; or the string 2
and a number between 0
and 4
and any number; or an optional zero or one followed by two numbers, with the second being optional. Making it a number less than 256.
Similarly we have the last group, just the dot is removed. Since dot is meta character so backslash is used to provide its literal meaning.
So the string 252.60.129.255 will be matched
while the string 256.60.129.255 will not be matched
Regular expressions are also supported by many Unix command line utilities! We'll walk through how to use them with grep
to find specific files, and with sed
to replace text file content in-place.
We'll define another basic regular expression, this time to match image files.
^.+\.(?i)(png|jpg|jpeg|gif|webp)$
^
- Start of line..+
- Match any character (letters, digits, symbols), expect for\n
(new line), 1+ times.\.
- Match the '.' character.(?i)
- Signifies that the next sequence is case-insensitive.(png|jpg|jpeg|gif|webp)
- Match common image file extensions$
- End of line
Here's how you could list all of the image files in your Downloads
directory.
ls ~/Downloads | grep -E '^.+\.(?i)(png|jpg|jpeg|gif|webp)$'
ls ~/Downloads
- List the files in your downloads directory|
- Pipe the output to the next commandgrep -E
- Filter the input with regular expression
Another good use of regular expressions in bash commands could be redacting emails within a text file.
This can be done quite using the sed
command, along with a modified version of our email regex from earlier.
sed -E -i 's/^(.*?\s|)[^@]+@[^\s]+/\1\{redacted\}/g' test.txt
sed
- The Unix "stream editor" utility, which allows for powerful text file transformations.-E
- Use extended regex pattern matching-i
- Replace the file stream in-place's/^(.*?\s|)
- Wrap the beginning of the line in a capture group[^@]+@[^\s]+
- Simplified version of our email regex./\1\{redacted\}/g'
- Replace each email address with{redacted}
.test.txt
- Perform the operation on thetest.txt
file.
We can run the above substitution command on a sample test.txt
file.
My email is patrick.triest@gmail.com
Once the command has been run, the email will be redacted from the test.txt
file.
My email is {redacted}
Warning - This command will automatically remove all email addresses from any
test.txt
that you pass it, so be careful where/when you run it, since this operation cannot be reversed. To preview the results within the terminal, instead of replacing the text in-place, simply omit the-i
flag.
Lookbehind and lookahead (also called lookaround) are specific types of
non-capturing groups (Used to match the pattern but not included in matching
list). Lookaheads are used when we have the condition that this pattern is
preceded or followed by another certain pattern. For example, we want to get all
numbers that are preceded by $
character from the following input string
$4.44 and $10.88
. We will use following regular expression (?<=\$)[0-9\.]*
which means: get all the numbers which contain .
character and are preceded
by $
character. Following are the lookarounds that are used in regular
expressions:
Symbol | Description |
---|---|
?= | Positive Lookahead |
?! | Negative Lookahead |
?<= | Positive Lookbehind |
?<! | Negative Lookbehind |
The positive lookahead asserts that the first part of the expression must be
followed by the lookahead expression. The returned match only contains the text
that is matched by the first part of the expression. To define a positive
lookahead, parentheses are used. Within those parentheses, a question mark with
equal sign is used like this: (?=...)
. Lookahead expression is written after
the equal sign inside parentheses. For example, the regular expression
(T|t)he(?=\sfat)
means: optionally match lowercase letter t
or uppercase
letter T
, followed by letter h
, followed by letter e
. In parentheses we
define positive lookahead which tells regular expression engine to match The
or the
which are followed by the word fat
.
"(T|t)he(?=\sfat)" => The fat cat sat on the mat.
Negative lookahead is used when we need to get all matches from input string
that are not followed by a pattern. Negative lookahead is defined same as we define
positive lookahead but the only difference is instead of equal =
character we
use negation !
character i.e. (?!...)
. Let's take a look at the following
regular expression (T|t)he(?!\sfat)
which means: get all The
or the
words
from input string that are not followed by the word fat
precedes by a space
character.
"(T|t)he(?!\sfat)" => The fat cat sat on the mat.
Positive lookbehind is used to get all the matches that are preceded by a
specific pattern. Positive lookbehind is denoted by (?<=...)
. For example, the
regular expression (?<=(T|t)he\s)(fat|mat)
means: get all fat
or mat
words
from input string that are after the word The
or the
.
"(?<=(T|t)he\s)(fat|mat)" => The fat cat sat on the mat.
Negative lookbehind is used to get all the matches that are not preceded by a
specific pattern. Negative lookbehind is denoted by (?<!...)
. For example, the
regular expression (?<!(T|t)he\s)(cat)
means: get all cat
words from input
string that are not after the word The
or the
.
"(?<!(T|t)he\s)(cat)" => The cat sat on cat.
Flags are also called modifiers because they modify the output of a regular expression. A regex usually comes with in this form /abc/, where the search pattern is delimited by two slash characters /. At the end we can specify a flag with these values (we can also combine them with each other):
g
(global) does not return after the first match, restarting the subsequent searches from the end of the previous match.m
(multi line) when enabled^
and$
will match the start and end of a line, instead of the whole string.i
(insensitive) makes the whole expression case-insensitive (for instance/aBc/i
would match AbC).
\t
inserts a tab character.\r
inserts a carriage return character.\n
inserts a newline character.\f
inserts a form-feed character.
Ok, so clearly regex is a powerful, flexible tool. Are there times when you should avoid writing your own regex expressions? Yes!
Parsing structured languages, from English to Java to JSON, can be a real pain using regex expressions.
Writing your own regex expression for this purpose is likely to be an exercise in frustration that will result in eventual (or immediate) disaster when an edge case or minor syntax/grammar error in the data source causes the expression to fail.
Battle-hardened parsers are available for virtually all machine-readable languages, and NLP tools are available for human languages - I strongly recommend that you use one of them instead of attempting to write your own.
It may seem tempting to use regular expressions to filter user input (such as from a web form), to prevent hackers from sending malicious commands (such as SQL injections) to your application.
Using a custom regex expression here is unwise since it is very difficult to cover every potential attack vector or malicious command. For instance, hackers can use alternative character encodings to get around naively programmed input blacklist filters.
This is another instance where I would strongly recommend using the well-tested libraries and/or services, along with the use of whitelists instead of blacklists, in order to protect your application from malicious inputs.
Regex matching speeds can range from not-very-fast to extremely slow, depending on how well the expression is written. This is fine for most use cases, especially if the text being matched is very short (such as an email address form). For high-performance server applications, however, regex can be a performance bottleneck, especially if expression is poorly written or the text being searched is long.
Regex is an incredibly useful tool, but that doesn't mean you should use it everywhere.
If there is an alternative solution to a problem, which is simpler and/or does not require the use of regular expressions, please do not use regex just to feel clever. Regex is great, but it is also one of the least readable programming tools, and one that is very prone to edge cases and bugs.
Overusing regex is a great way to make your co-workers (and anyone else who needs to work with your code) very angry with you.
-
Essential Guide: http://coding.smashingmagazine.com/2009/06/01/essential-guide-to-regular-expressions-tools-tutorials-and-resources/
-
Basic RegEx Tester: http://regexpal.com/ and http://www.myregextester.com/
-
JavaScript RegEx Cheat Sheet: https://www.debuggex.com/cheatsheet/regex/javascript
-
Toolbox: http://www.tripwiremagazine.com/2009/08/massive-regular-expressions-toolbox.html
-
Mozilla Dev article: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
-
General Cheat Sheet: http://www.addedbytes.com/cheat-sheets/download/regular-expressions-cheat-sheet-v2.png
-
GNU Grep tutorial: http://www.ibm.com/developerworks/aix/library/au-regexp/
-
TAO: http://linuxreviews.org/beginner/tao_of_regular_expressions/
-
List of common regex: https://gist.github.com/nerdsrescueme/1237767
- Report Issues
- Feel free to expand this