Playing with Regular Expressions, part 2 – Find the first word in a sentence

In all the examples below I have used the same sample text (you will see it in the samples). I have used an excellent tool named Expresso to evaluate all expressions in this blog entry.

Problem

I want to find the first word in all sentences in a text.

Solution

Use the following pattern to find all the first words in a string without any formatting (HTML).

(?:^|\r\n|\.\s+)(?<myMatchedWord>\w+)

 

The pattern above will match the following words:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut eu sem nisl.
Nulla elementum consectetur leo nec consequat. Vestibulum quis libero sit amet arcu euismod bibendum a.

Nulla elementum:    1389-89-1443

Praesent a nibh sed augue mollis vehicula.
Vestibulum nisl elit, eleifend a tristique nec, faucibus a sem.

Explanation

  • The part found in the beginning of the expression, (?:^|\r\n|\.\s+), will determine if the word occur in the beginning of the string or after a previous line in the sentence.
  • The pipe character has the same meaning as the OR operator in code. In the expression above we have three blocks that either one has to be true in order to successfully match the beginning of the sentence.
  • The first part of the expression, (?:^, before the pipe character instructs regex that we only want to match the pattern and not store it in the group of matches. We also states that we are looking at the start of the string by using the caret character.
  • The second part of the expression tries to match carriage return and line feed. We can add more OR cases if we like, e.g. if we only have carriage return or only line feed.
  • The third part after the pipe character, \.\s+, will search for the first word somewhere in a paragraph, e.g. after a previous sentence that ends with a dot and has one or more whitespace characters.
  • The second parenthesis tries to match the first word and will put the word found in a named group named myMatchedWord when a word is successfully matched.
  • When using named groups in regex you have to use the (?<myName>myExpression) construct.  
Advertisements

Playing with Regular Expressions, part 1 – Find the last word in a sentence

As a developer I quite often run into situations where I need to find an occurrence of a word or phrase in a text or some kind of number or pattern in a string. Regular expressions makes these tasks relatively simple and usually you will find loads of examples on how to match your specific pattern on the internet. This blog series will cover how to think when working with regular expressions. 

In all the examples in this blog series I have used the same sample text (you will see it in the samples). I have used an excellent tool named Expresso to evaluate all expressions in this blog entry. Of course it is possible to tweak the expressions in my examples below so that it searches other patterns as well.

Problem

I want to retrieve the last word in a sentence.

Solution

Use the \b anchor together with the pattern that ends the sentence to instruct regex that you only want the word that appears right before the dot-character, carriage return (and/or) line feed or the dollar-character that represent the end of the string.

The pattern below will match all the last words in each sentence.

(?<myNamedGroup>\b\w+)(?:\.|\r\n)

 

You should se the following result when executing the expression against the sample text.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut eu sem nisl.
Nulla elementum consectetur leo nec consequat. Vestibulum quis libero sit amet arcu euismod bibendum a.

Nulla elementum:    1389-89-1443

Praesent a nibh sed augue mollis vehicula.
Vestibulum nisl elit, eleifend a tristique nec, faucibus a sem.

Explanation

  • The first parenthesis tries to match the first word and will put the word found in a named group named myMatchedWord when a word is successfully matched.
  • \b\w+ will match every word in the text that has at least one character. The \b anchor matches either the beginning or the end of a word.
  • (?:\.|\r\n) will match either a dot or the combination of carriage return and line feed. The (?: part of the expression tells regex to skip the pattern in the matched content.
  • The whole expression together will return all words where the first part of the expression is combined with one of the expressions found in the second part.

Matching passwords – a look into the wonderful world of regular expressions

Ok, so I am a big fan of regular expressions. There lies a great strength in those few characters that you write to validate your input, but it can be a little bit tricky to get the expressions to do exactly what you want.

I got a question from a collegue that wanted to validate passwords with regex, but a task that seemed so trivial was as always not that easy. His original regex was ^.*(?=.{9,})(?=.*[0-9]{2,})(?=.*[a-z\d]).*$ but the problem was that it only validated strings that had two numbers in a row and not two numbers that could appear on different locations in the string. 

I began to dive into the problem based on the rules that was stated for the passwords that should be validated:

  • Minimum 8 charachters in the password.
  • Minimum 2 numbers somewhere in the password.

Lets analyze the regular expression above:

  • The characters in the beginning ^.* tells that zero or infinite alpha- or non alphanumeric characters may occur from the beginning of the string to the start of the pattern that matches the regular expression. The caret sign, i.e. ^ tells that the matching should start from the beginning of the text. The dot, i.e. . means any alpha- or non-alphanumeric character. The asterisk, i.e. * means that the preceding character should occur zero or infinite times.
    • This part of the regular expression was not ok since it had no purpose.
  • The construct (?=.{8,}) tells that the string should contain 9 characters that can be alphanumeric or non alphanumeric characters.  
    • So this construct was almost ok since it did what it was intended to do.
  • The construct (?=.*[0-8]{2,}) states that we are looking for zero or infinite alpha- or non alphanumeric characters followed by two numbers.
    • This part of the expression had no purpose.
  • The construct (?=.*[a-z\d]) tells almost the same as above, but are only looking for alphabetical charachters between a-z or any number.
    • This part of the regular expression had no purpose.
  • The last part of the regular expression .*$ states that we are looking for zero or infinite alpha- or non alphanumeric characters that occurs before the end of the string to be searched. The dollar sign, i.e. $ marks that it is the end of the string.
    • This part of the regular expression was not ok since it had no purpose.

So how do we find the solution to this problem? Well, if we take the construct (?=.{8,}) we have a good start to begin with. The problem was that the string should contain 8 charachters and of all the characters ther should be at least two numbers. Hmm, two numbers that would be (?=\d{2,}) since \d is the shorthand for any number in regex and {2,} tells that it should be two or infinite numbers in the string to be searched, but this is not enough since this pattern tells that it should be two numbers in a row. The correct pattern is (?=.*\d{2,}) which gives us the correct result.

What happens if we combine these two constructs? Well, we get almost what we want. The solution to the problem was a combination of the patterns described above which yielded in the pattern (?=.{8,})(?=(.*\d){2,})

To state that the password should containt at least one non alphanumeric character we could add the pattern (?=(.*\W){1,}) which would give us the final result of the pattern (?=.{8,})(?=(.*\d){2,})(?=(.*\W){1,})

Hope that this gives you some clues on the power and complexity of regular expressions.