The Art of Regex: Python’s pattern painting
What are Regular Expressions?
Regular Expression or Regexp is a search query that matches the regular expression pattern. Regex pattern has a sequence of characters. When executed, it will return the string that matches the pattern.
Why use Regular Expressions?
Regex makes life easy for cleaning and automating search patterns when working with text-based data
Consider that you have a log file that contains information like computer details, process ID, timestamp, and other information that you may not be interested in.
So, if you wanted to fetch the process ID, irrespective of the log's content, you could use the regex query to fetch the process ID, which matches the pattern.
Similarly, if you want to validate a field enter by the user like username, you can achieve this using a regex instead of using if-else conditions.
Consider that you want the username to be between 6 to 20 characters and contains numbers and -_&#$@ as special characters.
^[aA-zZ0–9_\-\$@&]{6,20}$
By using this regular expression, we can check whether the user input matches this pattern or not. The query may look confusing, but don’t worry. By the end of this blog, you will be able to decode it.
Pip Install
Before we start working on regular expressions, let’s install the related packages.
pip install regex
If you want to test your regular expression online, use this
Basic Regular Expression
search and findall
Both the search and findall are the methods in the module re.
re.search(<regex pattern>, <string>)
re.findall(<regex pattern>, <string>)
Using a search, we can find whether the pattern matches the text or not. If matches, the search will return the first occurrence of the pattern in the text and the index value. If not matched, it will return None value. Using span we can find the index of the string.
Using findall, unlike search, we can find all the string occurrences that match the pattern in the list format.
Anchor classes :
^ and $ are defined as anchor classes in the regex world.
^ - defines the start of the sentence
$ - defines the end of the sentence
^ will match the pattern at the start of the sentence
$ will search the pattern at the end of the sentence.
Character classes :
Character classes or character sets are enclosed within ‘[].’ You can tell the regex engine to match only one out of several characters. Just place the characters you want to match inside the square brackets.
As we have seen earlier, if pattern don't matches, it returns None.
Shorthand character class :
We can use shortcuts for the character class.
. - also known as wild card, as it matches any character
\d - can be used to matches digits
\w - matches a word character (alphanumeric character plus underscore)
\s - can match with the white space, tab or line breaks
\b - can match whole words
Similarly \d, \w and \s also present their negations with \D, \W and \S respectively.
Meta Characters
[^ ] → Matches all the characters, except the character that is inside the brackets
[ | ] → Matches either the characters, before or the characters after the preceding symbol
? -> Makes the preceding symbol optional.
Repetition Qualifiers
So far, we have looked for how to match the specific character or even any character. Now we’re going to check out how to match these characters several times.
* --> Matches 0 or more repetitions of the preceding symbol
.* --> Matches with any character symbol
.*
Match as many input characters as possible once it encounters a quantified token. This behavior is called greedy matching because the engine will eagerly attempt to match as many characters as it can in anything it can.
+ --- > Matches 1 or more repetitions of the preceding symbol
More on Repetition qualifiers
What if you want to use repetition qualifiers for the character class, but you want to restrict the number of characters?
Consider a case where you have to fetch all the string that has six characters. Using {} brackets, you can restrict the number of characters.
What if you want to fetch a string between a range of numbers.
Here we are having a problem the string gets cut in the middle. Using \b, we can restrict the words.
\b → performs “whole words only” search
Escaping Characters
Escaping characters are used when we need our pattern to match some of the special characters.
Let’s see an example. Consider that you want to find whether the domain name contains .com or not.
Regex function matches the string, even though it does not contain .com, because (.) special character matches any character.
So how do we fix this?
To solve this, we have escaping characters. Escaping characters will treat special characters as normal.
Now we have looked into a bunch of basic regex expressions. Let's see Regular expression in action.
We will look into a problem to check whether the variable name in python is valid or not.
Valid name starts with _ or letters, and it contains alphanumeric letters and underscores. The variable name other than this is invalid in python.
So how will you do it?
You can also try some exercise on your own.
- Like country name that starts and ends with a. Australia, Argentina are valid, whereas Afghanistan is invalid.
- Matching the proper time format. 9:49 AM, 12:21 PM are correct formats, whereas 7:60 AM or twelve ‘o clock is invalid
Are you able to crack it? Awesome Below are some of the advanced concepts in Regex.
Capturing Groups
For example, you are given the name of a person in the format of the last name, first name. Now, if you want to fetch the first name and second name separately…how can you achieve it?
Capturing groups are the portion of the pattern that are enclosed in parentheses. The syntax is (regex).
Consider that you want to write a program to rearrange the first and last name. You can make use of the regression function which we looked at above.
Splitting and replacing
We can use the regex to split the string and also substitute strings.
import re
# Split the string by spaces
text = "This is a sentence with spaces."
split_text = re.split(r"\s+", text)
# Replace all occurrences of "sentence" with "phrase"
replaced_text = re.sub(r"sentence", "phrase", text)
# Print the split and replaced text
print(split_text)
# Output : ['This', 'is', 'a', 'sentence', 'with', 'spaces.']
print(replaced_text)
# Output : This is a phrase with spaces.
Repetition Qualifiers
*
: It specifies that the character previous to it can be repeated zero or more times.+
: It specifies that the character previous to it can be repeated one or more times.?
: It specifies that the character previous to it can be repeated zero or one time.
import re
# Match any string that contains the letter "a" zero or more times.
pattern = "a*"
string = "This is a string."
match = re.search(pattern, string)
if match:
print(match.group())
# Match any string that contains the letter "a" one or more times.
pattern = "a+"
string = "This is a string."
match = re.search(pattern, string)
if match:
print(match.group())
# Match any string that contains the letter "a" zero or one time.
pattern = "a?"
string = "This is a string."
match = re.search(pattern, string)
if match:
print(match.group())
Regex is like a Swiss Army knife for strings — versatile, compact, and you only need one until you drop it in a loop and suddenly it’s a chainsaw.