Playing with Regular Expressions, part 2 – Find the first word in a sentence

In all the examples below I have used the same sample text (you will see it in the samples). I have used an excellent tool named Expresso to evaluate all expressions in this blog entry.

Problem

I want to find the first word in all sentences in a text.

Solution

Use the following pattern to find all the first words in a string without any formatting (HTML).

(?:^|\r\n|\.\s+)(?<myMatchedWord>\w+)

 

The pattern above will match the following words:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut eu sem nisl.
Nulla elementum consectetur leo nec consequat. Vestibulum quis libero sit amet arcu euismod bibendum a.

Nulla elementum:    1389-89-1443

Praesent a nibh sed augue mollis vehicula.
Vestibulum nisl elit, eleifend a tristique nec, faucibus a sem.

Explanation

  • The part found in the beginning of the expression, (?:^|\r\n|\.\s+), will determine if the word occur in the beginning of the string or after a previous line in the sentence.
  • The pipe character has the same meaning as the OR operator in code. In the expression above we have three blocks that either one has to be true in order to successfully match the beginning of the sentence.
  • The first part of the expression, (?:^, before the pipe character instructs regex that we only want to match the pattern and not store it in the group of matches. We also states that we are looking at the start of the string by using the caret character.
  • The second part of the expression tries to match carriage return and line feed. We can add more OR cases if we like, e.g. if we only have carriage return or only line feed.
  • The third part after the pipe character, \.\s+, will search for the first word somewhere in a paragraph, e.g. after a previous sentence that ends with a dot and has one or more whitespace characters.
  • The second parenthesis tries to match the first word and will put the word found in a named group named myMatchedWord when a word is successfully matched.
  • When using named groups in regex you have to use the (?<myName>myExpression) construct.  
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s