Thursday, 22 February 2018

Greedy and Non-Greedy match - Regex to compute first n digits in a string

This Post will clear all your doubts about what greedy and non-greedy means in the context of Regex.


Input String: AV assd 201708 DC18 ROUTE PO205960-205961-206200-    
                       206129ASSS 90a852585 108524724A

Req Output: 201708 i.e. first six digits in the input string.

I am not going to write the answer at once, rather we will go step by step from here.

Try to have some online tool ready where you can test while going through the below steps.
You may use the below available tools or any other as of your choice:



Step 1: What if we want to match to each digit i.e. every single digit in the given
             String.

Possible Sol: (\d) or ^.*(\d) looks good, will match to each and every occurrence of
                      digit in the string.

(\d):
  • Will make as many full match as single digit in the string(i.e.2,0,2,7....4)
  • Will make as many groups as single digit in the string(i.e.2,0,2,7....4)

^.*(\d):
  • This will also do the same thing, only difference will be:
  • Will make as one full match that will start from the beginning of the string and will last up to the last single digit in the string.
(i.e. AV assd 201708 DC18 ROUTE PO205960-205961-206200-206129ASSS 90a852585 108524724)
  • Will make as many groups as single digit in the string. (i.e.2,0,2,7....4)
And this difference is because of “^.*”  let’s see how it is effecting.

  • ^ matches to the start of the line i.e. in our case it will point before first char "A"
  • .* matches to anything.

In general if we write *txt this means everything that ends with txt.
Similarly, we are writing ^.*(\d), here anything can be present before(\d) but should last with digit(\d.

So you can ask it should end on this first digit i.e 2 in our case AV assd 2.

But no it will go up to the last digit in the string i.e. 4(see screen shot below).

This is because .* is greedy after finding the first match it is not going to stop, rather it will keep on traversing and will find all the possible match

And when you will use it in your code it will give the last matched group as the result i.e. 4 in our case


Step 2: Similarly, what if we want to match to 6 consecutive digits.

(\d{6}) or ^.*(\d{6}) looks possible solutions.

(\d{6})
  • Will make as many full match as six consecutive digits occur in the string(i.e. 201708,205960.......108524)
  • Will make as many groups as six consecutive digits occurs in the string(i.e. 201708,205960.......108524)

^.*(\d{6})

Again we can write it as ^.*(\d{6}) i.e. go to the start of line, and match as many possible 6 consecutive no’s.

It will start from "A" will find first 6 consecutive nos i.e. "201708" but will not stop and keep on finding till it find last 6 consecutive digits i.e. "524724" and will give the last group as result i.e. "524724"

  • This will also do the same thing, only difference will be:
  • Will make as one full match that will start from the beginning of the string and will last up to  the last six consecutive digits occurs in the string.
(i.e. AV assd 201708 DC18 ROUTE PO205960-205961-206200-206129ASSS 90a852585 108524724)
  • Will make as many groups as six consecutive digits occurs in the string.
(i.e. 201708,205960.......108524)



Step 4: Here comes our requirement i.e. match only the first six digits

As I already mentioned .* is greedy, so how to stop its greediness and make it lazy.

We saw in previous steps that how our expression is not stopping after the first match occurred rather it finds all the possible match and the last match is the output.

This is the meaning of greedy in regex

To make it lazy or say non greedy ? comes into play


? Will restrict the search up to the first match only, even if matches are available after the first match, it will make your regex lazy to avoid all other matches.

^.*?(\d{6})
  • Will make as one full match that will start from the beginning of the string and will last up to  the first six consecutive digits occurs in the string(i.e. AV assd 201708)
  • Will make as one groups as six consecutive digits occurs in the string.(i.e.201708)

Fell free to mention doubts in comment section.
Hope this post will help you.

No comments:

Post a Comment

Handaling XLS File In Java

This code will help you to handle xls file in java using apache poi. package test; import java.io.File; import java.io.FileInputStream...