Regular Expressions for String Pattern Matching in Python
Regular expressions (regex) are powerful tools used for searching and manipulating strings that match a particular pattern. They allow you to perform tasks such as searching, replacing, splitting, or validating strings based on a given pattern. Python's re
module provides functions for working with regular expressions.
1. Basic Concepts of Regular Expressions
A regular expression is a sequence of characters that defines a search pattern. It can include:
- Literal characters: Such as letters or numbers.
- Metacharacters: Special characters that control the search pattern (e.g.,
.
(any character),*
(zero or more repetitions)). - Character classes: Groups of characters that match specific sets (e.g.,
[0-9]
for digits). - Anchors: Characters that specify where a match occurs (e.g.,
^
for the start of a string,$
for the end).
2. Common Regular Expression Syntax
Here are some commonly used regex patterns:
-
.
(Dot): Matches any single character except a newline.- Example:
a.b
matchesaab
,acb
, etc., but notab
.
- Example:
-
^
(Caret): Anchors the match at the beginning of the string.- Example:
^abc
matches strings that start with "abc".
- Example:
-
$
(Dollar Sign): Anchors the match at the end of the string.- Example:
abc$
matches strings that end with "abc".
- Example:
-
*
(Asterisk): Matches zero or more repetitions of the preceding element.- Example:
ab*c
matchesac
,abc
,abbc
, etc.
- Example:
-
+
(Plus Sign): Matches one or more repetitions of the preceding element.- Example:
ab+c
matchesabc
,abbc
, etc., but notac
.
- Example:
-
?
(Question Mark): Matches zero or one occurrence of the preceding element.- Example:
ab?c
matchesabc
orac
.
- Example:
-
{n}
: Matches exactlyn
repetitions of the preceding element.- Example:
a{3}
matchesaaa
.
- Example:
-
{n, m}
: Matches betweenn
andm
repetitions of the preceding element.- Example:
a{2,4}
matchesaa
,aaa
, oraaaa
.
- Example:
-
[]
(Square Brackets): Matches any one character within the brackets.- Example:
[a-z]
matches any lowercase letter.
- Example:
-
|
(Pipe): Acts as a logical OR between two patterns.- Example:
abc|def
matches eitherabc
ordef
.
- Example:
-
\
(Backslash): Escapes a special character, or matches a special character if preceded by a backslash.- Example:
\.
matches a literal period (.
), and\\
matches a literal backslash.
- Example:
-
()
(Parentheses): Groups expressions together and captures them for backreferencing.- Example:
(abc)+
matchesabc
,abcabc
, etc.
- Example:
3. Commonly Used Character Classes
-
\d
: Matches any digit, equivalent to[0-9]
.- Example:
\d{3}
matches any three-digit number (e.g.,123
,456
).
- Example:
-
\D
: Matches any non-digit.- Example:
\D+
matches a sequence of non-digits.
- Example:
-
\w
: Matches any alphanumeric character (letters, digits, and underscores), equivalent to[a-zA-Z0-9_]
.- Example:
\w+
matches a word of one or more characters.
- Example:
-
\W
: Matches any non-alphanumeric character.- Example:
\W+
matches non-word characters like spaces, punctuation, etc.
- Example:
-
\s
: Matches any whitespace character (spaces, tabs, newlines).- Example:
\s+
matches one or more whitespace characters.
- Example:
-
\S
: Matches any non-whitespace character.- Example:
\S+
matches a sequence of non-whitespace characters.
- Example:
4. Working with Regular Expressions in Python (re
module)
Python’s re
module provides functions for searching and manipulating strings using regular expressions. Here's how you can use some of its key functions.
5. Basic re
Functions
1. re.match()
This function checks if the regular expression matches the beginning of the string.
import re
pattern = r"^hello"
text = "hello world"
# Check if pattern matches the start of the string
match = re.match(pattern, text)
if match:
print("Match found:", match.group())
else:
print("No match")
- Output:
Match found: hello
- Note:
match()
only checks the beginning of the string.
2. re.search()
This function searches the entire string for the first occurrence of the pattern.
import re
pattern = r"world"
text = "hello world"
# Search for the pattern in the text
search = re.search(pattern, text)
if search:
print("Search found:", search.group())
else:
print("No match")
- Output:
Search found: world
- Note:
search()
returns the first match found anywhere in the string.
3. re.findall()
This function finds all occurrences of the pattern in the string and returns them as a list.
import re
pattern = r"\d+" # Matches one or more digits
text = "I have 2 apples and 3 bananas."
# Find all numbers in the string
numbers = re.findall(pattern, text)
print("Found numbers:", numbers)
- Output:
Found numbers: ['2', '3']
- Note:
findall()
returns all matches as a list of strings.
4. re.finditer()
This function returns an iterator yielding match objects for all occurrences of the pattern.
import re
pattern = r"\d+" # Matches one or more digits
text = "I have 2 apples and 3 bananas."
# Find all numbers and iterate over match objects
matches = re.finditer(pattern, text)
for match in matches:
print(f"Found number: {match.group()}")
- Output:
Found number: 2 Found number: 3
5. re.sub()
This function replaces occurrences of the pattern with a specified replacement.
import re
pattern = r"\d+" # Matches one or more digits
text = "I have 2 apples and 3 bananas."
# Replace digits with the word "many"
new_text = re.sub(pattern, "many", text)
print("Replaced text:", new_text)
- Output:
Replaced text: I have many apples and many bananas.
- Note:
sub()
performs a search-and-replace operation.
6. re.split()
This function splits the string at occurrences of the pattern.
import re
pattern = r"\s+" # Matches one or more whitespace characters
text = "I have 2 apples and 3 bananas."
# Split the text by spaces
words = re.split(pattern, text)
print("Split words:", words)
- Output:
Split words: ['I', 'have', '2', 'apples', 'and', '3', 'bananas.']
- Note:
split()
works likestr.split()
, but it uses a regular expression.
6. Regular Expression Groups and Backreferences
You can use parentheses ()
in regular expressions to create groups, which can capture parts of the match. These groups can be referenced later.
Example: Capturing Groups
import re
pattern = r"(\d+)-(\d+)"
text = "My phone number is 123-4567."
# Use parentheses to capture the area code and phone number
match = re.search(pattern, text)
if match:
print("Area Code:", match.group(1))
print("Phone Number:", match.group(2))
-
Output:
Area Code: 123 Phone Number: 4567
-
Note:
group(0)
refers to the entire match,group(1)
refers to the first captured group, and so on.
7. Using Flags in Regular Expressions
Flags can be used to modify the behavior of regular expression functions. Common flags include:
re.IGNORECASE
(re.I
): Makes the pattern case-insensitive.re.MULTILINE
(re.M
): Changes the behavior of^
and$
to match the start and end of each line, not just the string.re.DOTALL
(re.S
): Allows the dot (.
) to match newlines as well.
Example: Using Flags
import re
pattern = r"^hello"
text = "Hello, World!"
match = re.match(pattern, text, re.IGNORECASE)
Case-insensitive match
if match: print("Match found:", match.group()) else: print("No match")
- **Output**: `Match found: Hello`
---
### 8. **Summary of Regular Expressions in Python**
- Regular expressions allow for powerful pattern matching and manipulation of strings.
- Python’s `re` module provides functions like `match()`, `search()`, `findall()`, `sub()`, and more.
- Metacharacters and character classes provide flexibility for pattern creation.
- Capturing groups and backreferences make it easy to work with specific parts of a match.
- Flags modify how the regular expression behaves, enabling case-insensitive matching, multiline matching, and more.
With these tools, you can perform complex text manipulation efficiently and flexibly in Python.