Regex
Regular Expression in Python and JavaScript
8.30.22
Algorithms. Validation. Sanitization. Web Scraping. Data wrangling. Search Engines. Natural language processing. Syntax highlighting. A common denominator that threads across all of these concepts are the use of Regular Expressions.
I have been mentoring a friend who just graduated from a bootcamp and I brushed up on Regular Expressions so we can grind some Leet Code. I thought this could be a cool topic for a blog post.
When I first started learning to code, Regular Expressions really confused me. A lot of code, especially Python, can be very readable to an untrained eye. Even if you don’t know what a specific built-in function does, you can usually figure it out by breaking it down, context helps a lot, but with Regular Expressions, you come across forward and back slashes, brackets, upper and lower case single letters, carrots, dashes, question marks, dollar signs, pipes, etc. all jumbled together. To me, it looked like code that only computers can understand. Like binary code but maybe worse.
The truth is, Regular Expressions can be very readable if you understand how it works. They can look intense because they can be complex but at the core there are some pretty simple rules.
So, what is it? A Regular Expression is a way to search through a string (or text) or more specifically, a sequence of characters that specifies a search pattern in text. The concept of Regular Expressions, or just Regex (sometimes regexp) dates back to the 1950’s and the common syntax dates back to the 1980’s. Even though the concept of regex is programming-language agnostic, programming languages do use Regex a little differently. I will show some regex using JavaScript and Python.
The basic syntax for regex is to put a pattern inside of two forward slashes followed by an optional flag (modifier): /pattern/flag
Some common flags (or modifiers) are:
i • perform case-Insensitive matching
g • perform a Global match
m • perform a Multiline matching
Some more you can use:
s • allows to match newline characters or Single line, aka .dotAll.
u • Unicode: treat a pattern as a sequence of unicode code points.
d • generate indices for substring matches.
y • perform a sticky search that matches staring at the current position in the target string
Ok, now onto the patterns. Square brackets [ ] are how you find specific characters or digits. You can search for specific characters or digits with brackets by simply putting what you are looking for inside the brackets: [abc] or [123]. You can also use brackets to find ranges: [a-m] or [0-9]. Add a carrot to define what you are NOT looking for: [^abc] or [^0-9]. Similar to brackets, you can use parenthesis ( ) to find exactly what you are looking for, and you can look for multiple things by separating them with a pipe character |: (x|y). Not too bad so far right?
After flags and patterns, we have metacharacters. These give your search some more power. These are the backslashes that I mentioned earlier.
\w • find a word character
\W • find a non-word character
\d • find a digit
\D • find a non-digit character
\s • find a whitespace character
\S • find a non-whitespace character
\b • find a match at the Beginning \bWORD or end WORD\b of a word
\B • find a match, but not at the beginning/end of a word
\n • find a new line character
\t • find a tab character
\uxxxx • find the unicode character specified by the hexadecimal number xxxx
Finally, we have quantifiers, which as the name suggests, defines quantities.
n+ • Matches any string that contains at least one n
n* • Matches any string that contains zero or more occurrences of n
n? • Matches any string that contains zero or one occurrences of n
n$ • Matches any string with n at the end of it
^n • Matches any string with n at the beginning of it
?=n • Matches any string that is followed by a specific string n
Ok cool, now that those definitions are out of the way, we can see it in action.
In Python, you have access to a built-in regex package. After you import it, you have access to several functions that work with regular expressions. One of those functions is re.sub(), which can replace the matches with the text of your choice. For this example and the following JavaScript version, know that the text variable is the string Method Man.
In the above Python example, we are importing the regex built-in package and then assigning a variable to the re.sub() function and passing in the search term and what we want to replace the search term with. For the search, I am simply using parenthesis just to find exactly what I am looking for. Then we use an f string to print out a new string using the new variable.
The above is the JavaScript version of the Python example. In JavaScript, you don’t need to import anything and the built-in function replace() works the same as Python’s re.sub(). In the JavaScript version, we are again defining a new variable and inside the replace() function, we are using regex to find the string and we are using the i flag to modify the search to be case insensitive. The JavaScript equivalent to f strings are template literals (back ticks) so we can pass in our variables into a string.
There is a cool website where you can practice regex or see it in action before you implement a pattern into your code. Below is from regexr.com.
So, in that example we have our regex pattern inside our forward slashes with the g global flag on it, which will make it search through the entire string. The pattern is searching for all non-word characters including underscores which are highlighted.
In JavaScript, we can use that regex pattern using the replace() method to manipulate the string. After we use the regex pattern, we can replace everything we found with an empty string (nothing) to return just the bare characters. This is a technique known as sterilizing a string.
These are some pretty simple examples. Again, these regex patterns can become very complex very quickly but as you can probably tell, they can also be pretty powerful. If this is the first time you are reading about regex go check out how your programming language of choice handles these patterns. Regex is definitely an essential tool to have in your belt and something that does not have to be intimidating.
Adam