There will be a day when you are working on a real world data set that is not formatted properly. What do you do in this situation?

a. Panic, because you never had to deal with untidy data during your school studies.

b. Rage quit because you tried to clean the data either manually or using Excel tools like Find and Replace.

c. Use regular expressions to help clean your data.

What are Regular expressions

Regular expressions also know as Regex is a way to find patterns in text. Let’s say that you are reading a pdf document about cats and your are trying to find information about Siamese cats. The pdf document you are reading does not have a table contents. Would you just read the entire document until you stumble upon the chapter about Siamese cats? No. You would just use ctrl+F command to find the word Siamese. These types of computer operations use Regex algorithms.

The great thing about Regex is that it can be used in any programming language. Using Regex can be confusing at first but it is a powerful tool in your arsenal.

Basic Regex Patterns

a, X, 9,% < — ordinary characters just match themselves exactly.

. ^ $ * + ? { [ ] \ | ( ) < — are special characters that do not match themselves

\d — decimal digit [0–9]

\t, \n, \r — tab, newline, return

. (a period) — matches any single character except newline ‘\n’

\w — (lowercase w) matches a “word” character: a letter or digit or underscore [a-zA-Z0–9_].

\W (upper case W) matches any non-word character.

^ — match the start of a string i.e ^apple.* match with any word that starts with apple like applesauce

$ — match the end of the string

\ — escape the special character. So, i.e. \. to match a period

Repetition

You can use quantifiers to specify repetition in the pattern

  • + — 1 or more occurrences of the pattern
  • * — 0 or more occurrences of the pattern
  • ? — match 0 or 1 occurrences of the pattern
  • {number} — can match the number of times a pattern can occur i.e {3} match exactly 3 times

Find a Phone Number

Please design regular expression code for a phone number and explain how you came up with this design and verified it works.

Lets look at a pattern of a phone number 312–555–5555.

The pattern of a phone number is

digit digit digit dash digit digit digit dash digit digit digit

To create a regex pattern we can simply use.

\d+-\d+-\d+

\d+ <- Match a digit at least 1 or as many times possible

-

Match the dash character .

This pattern above will work but it has some flaws. We know that phone numbers in the United States have seven digits. If you have numbers that fall within this pattern that has more or less that ten digits then our regular expressions will return them as well. In order to make our regular expression correct we will have to a non greedy way.

\d\d\d-\d\d\d-\d\d\d\d

Or we can simply this using quantifiers.

\d{3}-\d{3}-\d{4}

I created a python script to verify that my code works. You can run it on an .

import restr = 'purple  monkey dishwasher 312-444-5555 44-55-442 773-888-8888'
matches = re.findall(r'\d{3}-\d{3}-\d{4}', str)
for match in matches:
print(match)

Conclusion

This is just the tip of the iceberg of what regular expressions can do. Here are some more resources to learn about regular expressions.

Full Stack Developer | Aspiring Data Scientist | Northwestern Coding Bootcamp Student | Udacity Scholar | Foodie

Full Stack Developer | Aspiring Data Scientist | Northwestern Coding Bootcamp Student | Udacity Scholar | Foodie