Skip to main content
Contiem Connect

Using Regular Expressions

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with simple wildcard notations such as *.txt in windows explorer.

In HelpStudio you can use Regular Expressions in the Project Find and Replace and in the Topic Editor Find and Replace dialog.

You can do much more with regular expressions than you can with simple wildcards. For example you could use the regular expression \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b to search for an email address - any email address.

The power of regular expressions comes with a learning curve - regular expressions can be difficult to master and complex to read. But in terms of search (and replace) functionality, they are difficult to beat and can turn simple Find and Replace functionality into a powerful content manipulation tool.

This topic contains just a few simple examples of regular expressions and is intended only as an introduction. More resources on regular expressions, including tutorials, can be found through a simple web search on the term "Regular Expression".

Literal Characters

The simplest regular expressions contain just literal characters, e.g. a. A regular expression containing literal characters will match the first occurance of the specified character or characters.

There are 11 characters (metacharacters) with special meanings in a regular expression - [ \ ^ $ . | ? * + ( ). If you want to use any of these characters as a literal in a regex, escape them with a backslash. For example, to match the literal string Does 1 + 1 = 2? the correct regex would be Does 1 \+ 1 = 2\?.

Character Classes

A character class is a special construct that matches one out of several specified characters. For example st[oe]p would match either Stop or Step. The order of characters in the character class is not important. A character class matches only a single character so extending the previous example st[oe]p would not match steep or stoop.

You can use a hyphen in order to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use multiple ranges. [a-zA-Z] matches any alpha character. You can combine ranges and single characters. [a-zA-Z1] matches an alpha character or the number 1.

Typing a ^ character at the start of the character class (inside the opening square bracket) will negate the class - the character class will only match any character that is not in the character class. ch[^a]p would match chip and chop, but not chap.

Shorthand Character Classes

There are a number of pre-defined shorthand character classes available for use in a regexp.

  • \w matches a word character (any alphanumeric character plus the underscore character).
  • \W matches any non-word character.
  • \d matches a single character that is a digit.
  • \D matches a single character that is not a digit.
  • \s matches any whitespace character (including line breaks and tabs).
  • \S matches any non-whitespace character.
  • \b matches a word boundary. er\b matches the er in never but not the er in verb.
  • \B matches a non-word boundary. er\b matches the er in verb but not the er in never.

Non printable characters

A regexp can contain character sequences to identify non-printable characters.

  • \t matches a tab character.
  • \r matches a carriage return and \n a line feed (windows uses \r\n to terminate lines).

Non printable characters can be used directly in the regexp, or in character classes.

Wildcard match

The dot matches any single character, except line break characters.

st.p matches step, stop, st%p etc.

Use the dot carefully. Often a character class (or negated character class) is more precise.

Anchors

Anchors are used to match the start or end of the string.

  • ^ matches the start of the string.
  • $ matches the end of the string.

Alternation

Alternation is the regular expression "or" functionality. rain|shine will match rain in Do you think it will rain today? Perhaps the sun will shine. If the regexp is applied again, it will match shine.

Repetition

The question mark makes the preceding token optional. E.g. stee?p matches step or steep.

The asterisk will match the preceding token zero or more times. The plus matches the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.

Repetition operators are "greedy". They will expand the match as far as possible. <.+> will match <STRONG>True</STRONG> in If the value is <STRONG>True</STRONG>.

Place a question mark after the quantifier to make it "lazy". <.+?> will match <STRONG> in the previous example.

Grouping

Place brackets around multiple tokens to group them together. You can then apply a quantifier to the whole group. E.g. Fruit(fly)? matches Fruit or Fruitfly.

Groups can be used when executing a project wide find and replace operation by using $n in the 'Replace With' string - where n is a number from 1-9 identifying a group defined in the search string. E.g. a search string of Fruit(fly)? and a replace string of Fire$1 would replace FruitFly with Firefly and Fruit with Fire.

Lookaround

A lookaround is a special kind of group. The group is matched normally, but yields only the result not in the group. Lookaround therefore matches a position, similar to anchors.

st(?=e) matches the st in steer but not in stop. This is called a positive lookaround.

st(?!e) matches the st in stop but not the st in steer. This is called a negative lookaround.

To look backwards, use a backwards lookaround. (?<=t)op matches the op in top but not in pop.

Lookaround expressions are only supported in Tools.

Top of page