Simple Regex tutorial

lllllIllIlllI 3 Tallied Votes 2K Views Share

Regex is one of the more complicated modules that you can use in python. Once you have learnt it though you can use it many different programming languages, so its a useful tool for using with strings.

So first to use regex you must import it

import re

This loads the module for us to use.

Regex is a module designed to make strings easy to manipulate and is often used to check for correct input.

For example

r = raw_input("Please enter an email address")

But how do you know without complicated checking that they have entered the right format of something@something.com? Well to check this normally we would need to index the '@' symbol, as well as make sure they had the right words (.com) and that it was all in the right order.

But with regex we can work this out in one line... that is after working out the regex string.

So lets start on the email..
First we have to understand what an email needs in it:

  1. A Beginning (xxxx@mail.com)
  2. The '@' sign
  3. a domain (mail@xxx.com)
  4. and a .com (we are not going to make it for .orgs/anything else)

So lets start (please see below for explanation of symbols)

import re
#Lets make a fake email...
email = 'bogusemail123@sillymail.com'

#to make a re pattern we use a string
pattern = "^[A-Za-z0-9.]+@[A-Za-z0-9.]+.com$"

#now we check if it matches
re.findall(pattern, email)
#Yes! it does
#It returns ["bogusemail123@sillymail.com"]

#lets try some other addresses
re.findall(pattern,"@sillymail.com")
#returns []
re.findall(pattern,"bogusemail123@sillymail"
#returns []

So this is a relatively simple example but you can easily see how it can save you time in checking that a user has inputted the correct things as well as searching for things in a string..

Now to explain what "^[A-Za-z0-9.]+@[A-Za-z0-9.]+.com$" means

  • ^ --> means that the pattern starts at the start of the string, this means that "Hello bogusmail123@sillymail.com" will not work
  • [A-Za-z0-9.] --> This is called a range, it means that anything inside that range will match the string, so and letter of A-z or a-z as well as numbers 0-9 and a dot. This means that you do not get emails with other forms of punctuation in them.
  • + --> This does not mean plus, or anything like that, rather it means that whatever came before it needs to be in the string one time or more. In this case the thing before was our range, so what it means is that we need at least one letter/number/dot or more to have the string match
  • @ --> For a match where you want it to match a character exactly you just put the character in the string in the place it is meant to be
  • [A-Za-z0-9.]+ --> Just another range like we had before, with a '+' sign to mean it need one or more things in the range
  • .com$ --> Then we put in exactly what we want at the end of the email address ('.com') and make sure it is at the end of the string with the dollar symbol.

Then to check that our string matches we use re.findall(regexpatter, string) That lists all of the strings that match, in our case it should only come back with either a list with one email address or nothing at all if the input was incorrect.

This will not get all email addresses its just a simple example designed to show people the possibilities of the regex module.

If you want to extend yourself in this, try making it so that is accepts .org/.net/com.au etc.

Hope you enjoyed the tutorial and learnt something :)

scru 909 Posting Virtuoso Featured Poster

Great tutorial, I have some comments.

In python re, you can omit the ^ and $ from your regular expression and use re.match to match the string from start to finish against the expression, as long as you don't require newlines to be taken into account. This way you can use the same pattern to search within the string, or to match the entire string.

Also, if you need to use a pattern more that once, you may as well compile the pattern into a regular expression object.

email = 'bogusemail123@sillymail.com'

pattern = r"[A-Za-z0-9.]+@[A-Za-z0-9.]+.com"
rexp = re.compile(pattern)

print not not rexp.match(email) 
#True

print not not rexp.match("@sillymail.com")
#False

print not not rexp.match("bogusemail123@sillymail")
#False
lllllIllIlllI 178 Veteran Poster

Ah yeah, forgot about doing the compiling of them. Im glad you liked it though :)

bunkus 0 Newbie Poster

Here is a more sophisticated code for more cases than just for *@*.com which uses no regexp but still does its job quite well for checking on valid email addresses. If you have any improvements or any cases in which this code does not work feel free to comment. Originally this code was inspired by
http://commandline.org.uk/python/email-syntax-check/

and by
http://code.activestate.com/recipes/65215/

def validateEmail(email):
        """Checks for a syntactically valid email address."""

        # Email address must be at least 6 characters in total.
        if len(email) < 6:
            return False # Address too short.

        # Split up email address into parts.
        try:
            localpart, domainname = email.rsplit('@', 1)
            host, toplevel = domainname.rsplit('.', 1)
        except ValueError:
            return False # Address does not have enough parts.

        if localpart[:1] == "." or localpart[-1] == ".":
            return False # Dots at the beginning or end of the localpart
        
        # Check for Country code length.
        if len(toplevel)<2 or len(toplevel)>6:
            return False # Not a domain name.

        # Check for allowed characters
        for i in '-_.%+':
            localpart = localpart.replace(i, "")
        for i in '-_.':
            host = host.replace(i, "")

        if localpart.isalnum() and host.isalnum():
            return True # Email address is fine.
        else:
            return False # Email address has funny characters.
            
if __name__ == '__main__':
    email1 = "test.@web.com"
    print email1,"is valid:",validateEmail(email1)
    email2 = "test+john@web.museum"
    print email2,"is valid:",validateEmail(email2)
    email3 = "test+john@web.m"
    print email3,"is valid:",validateEmail(email3)
    email4 = "a@n.dk"
    print email4,"is valid:",validateEmail(email4)
    email5 = "and.bun@webben.de"
    print email5,"is valid:",validateEmail(email5)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.