The Exchange 2007 Wiki

Text Patterns or Regular Expressions

Regular expressions are used for any transport rule that supports "Text Patterns."  These allow for very powerful text matching for things like SS#, which are in the format of ###-##-####, or phases that contain certain words and more.

See this MS TechNet article on this topic for more details.

For a tutorial on regular expressions see http://www.regular-expressions.info/tutorial.html.  Note that Exchange 2007's RegEx support is limited to just the following pattern strings: From MS TechNet

Pattern string Description

\S

The \S pattern string matches any single character that is not a space.

\s

The \s pattern string matches any single white-space character.

\D

The \D pattern string matches any non-numeric digit.

\d

The \d pattern string matches any single numeric digit.

\w

The \w pattern string matches any single letter a-z, A-Z, 0-9, or any other Unicode character.

|

The pipe ( | ) character performs an OR function.

*

The wildcard ( * ) character matches zero or more instances of the previous character. For example, ab*c matches the following strings: ac, abc, abbbbc.

( )

Parentheses act as grouping delimiters. For example, a(bc)* matches the following strings: a, abc, abcbc, abcbcbc, and so on.

\\

Two backslashes indicate that the character that follows the backslashes should be escaped. For example, if you want to match a string that contains \d, you would type \\d.

^

The caret ( ^ ) character indicates that the pattern string that follows the caret must exist at the start of the text string that is being matched. For example, ^fred@contoso matches fred@contoso.com and fred@contoso.co.uk but not alfred@contoso.com.

This character can also be used with the dollar ( $ ) character to specify an exact string to match. For example, ^kim@contoso.com$ matches only kim@contoso.com and does not match anything else, such as kim@contoso.com.au.

$

The dollar ( $ ) character indicates that the preceding pattern string must exist at the end of the text string that is being matched. For example, contoso.com$ matches adam@contoso.com and kim@research.contoso.com, but does not match kim@contoso.com.au.

This character can also be used with the caret ( ^ ) character to specify an exact string to match. For example, ^kim@contoso.com$ matches only kim@contoso.com and does not match anything else, such as chris@sales.contoso.com.

 

To test your expressions see: http://www.javaregex.com/test.html (Note this tester is case sensitive, where E2k7 is not.  It also supports more compile rules than E2k7 does.)

Examples:

  1. Social Security #s
    • \d\d\d-\d\d-\d\d\d\d
    • In the above rule \d matches any single # so the results is that any number if ###-##-#### format will match the rule
  2. US Telephone #
    • (\\()*\d\d\d(\\)|\s|.)\d\d\d(-|.)\d\d\d\d
    • This pattern matches three different phone number formats: ### ###-####, ###.###.####, and (###) ###-####
    • Here is a break down of this example: (From MS TechNet)
      • (\\()*   This portion makes the first parentheses optional. Because the closing parenthesis is also a regular expression delimiter, it must be escaped by using two backslashes \\. The surrounding (()) parentheses group the \\( characters together so that the wildcard character * can act upon the \\( characters to make them optional.
      • \d\d\d   This portion requires that exactly three numeric digits appear next.
      • (\\)|\s|.)   This portion requires that an opening parenthesis, a space, or a period exist after the three-digit number. Each character-matching string is contained in the grouping delimiters and is separated by the pipe character. This means that only one of the specified characters inside the grouping delimiters can exist in this location in the string that is being matched.
        • \\) = "\"
        • | = Or
        • \s = space
        • | = Or
        • . = period
      • \d\d\d   This portion requires that exactly three numeric digits appear next.
      • (-|.)   This portion requires that either a hyphen or period exists after the three-digit number. Because the hyphen and period exist in the grouping delimiters, only one of the two characters can exist in this location in the string that is being matched.
      • \d\d\d\d   This portion requires that exactly four numeric digits appear next
  3. Matching attachments that end with PDF and contain certain words
    • (\d\d\d\d.pdf)|(\d\d\d\d.zip)
    • This will search for "####.pdf" or "####.zip", which is a common format used in the recent PDF and ZIP attachment spam.  Most of these have an attachment that has something like "new account-454323262.pdf",  "agreement-983423423.pdf", "investor_report-2342343243.pdf", "market_watch-31242342.pdf", "new account-42354352.zip" etc
    • The above RegEx expression will match all of these since they all contain at least four numbers, a period, followed by ".pdf" or ".zip".
    • Be careful with this one since it would also match Report-070708.zip
    • To prevent false matches I would setup this expression to search both the subject, using the "when the Subject field contains text patterns", and attachment file name, using "when any attachment file name contains text patterns" rules.  This is because most of the "PDF spam" uses the file name in the subject also.
  4. Matching attachments and subjects that contain certain keywords
    • (quotes*)|(Income*)|(analyst*)|(financial*)
    • This will search for the words above, including if they is something before or after the words
  5. Matching attachments, or any string, that contains a certain keyword and ends with a known value
    • (urgent info.*pdf$)|(paid-.*pdf$)
    • In English, this will patch "urgent info*pdf", so as long as the text contains "urgent info", any additional text (.*), and end in pdf ($) it will match.  It will also match "paid-*pdf) too.

For an export of rules see: TP-Rules.xml.  These rules were created by Jason Sherry and mainly focus on blocking spam.

Site

Changes
Index
Search

 

User

 

Log In
Register

 
 

Last Modified 8/17/07 7:11 AM