Posted by: lrrp | January 18, 2006

Be careful about using Java RegEx’s for simple patterns by Bill Higgins

Be careful about using Java RegEx’s for simple patterns

posted by Bill Higgins
Permalink

I was writing some code this week that essentially implements a REST API for a server-side web app. The URL could either end in a positive integer (e.g. /123) or a semi-random unique identifier (e.g. _someLongString). So the server-side code basically needed to check if the thing was a legit positive integer and if so do something.

Java provides a method int Integer.parseInt(String str) that attemps to turn ‘str’ into a primitive int, so say if you passed in the string “-45” it would become the number negative 45. The problem is that if you pass this method something that’s not an int (say “abc”) it throws an exception. Throwing an exception may make sense in cases where the correct input is always supposed to be a String form of an int, but in my case there was a legitimate case where I’d receive non-ints (the unique identifier case) so throwing and catching an exception seemed like a bad practice (see Bloch’s Effective Java, Item 39: “Use exceptions only for exceptional conditions”) so I was looking for something that could test a String to see if it was an int without throwing an exception.

I thought I found something when I found java.lang.String’s matches method. It takes a regular expression as input and returns a boolean indicating whether or not your pattern matches your String. So to test for a positive int I could just say myStr.matches("[0-9]+") (For non-programming friends and family members, this pattern means “match only strings that contain a digit (“[0-9]”) one or more times (“+”).)

I mentioned this to my manager and he was a bit worried because he was under the impression that regular expressions in Java were relatively processor-intensive, and the code in question was on a frequently-traveled server-side code path, meaning that any extra cycles would add up as more users used the server. He suggested I time the different options available, so I did. It turns out he was right – String.matches was horrendously slow compared to some of the other methods available. Below is a summary of the tests I ran:

Method 1: Custom checker
The custom checker is a private method that takes a String as input, walks the String and returns false if it hits any character that’s not a digit (using Character.isDigit(char testChar)).

Method 2: Try Integer.parseInt, catch Exception
This seems to be the de facto method for testing Strings for ints, but the one I rejected because of the legit case of a non-int as input.

Method 3: Pre-compiled regular expression
Similar to String.matches, but using a regular expression that’s compiled only once in the lifetime of the program, since the pattern (“[0-9]+”) never changes.

Method 4: String.matches(String regex)
Matching on the string “[0-9]+”, meaning that this regex needs to be compiled each time.

Input:
1 million randomly generated integers, converted to strings, with 5% of said int-strings mangled to include non-int characters.

Results:

# Method Processing time (in secs)
1 Custom checker 0.9
2 Number format exception checker 5.5
3 Pre-compiled reg-ex 27.1
4 String.matches 141.7

I was pretty stunned that the most straight-forward method (String.matches) was two orders of magnitude slower than the custom checker, which I ended up going with. Now I know someone out there is reciting “premature optimization is the root of all evils”, so a few disclaimers:

  • This pattern was extremely simple, so it was easy to write a custom checker – if the pattern was more complex, or user-supplied, then it would have made more sense from a maintainability standpoint to use the pre-compiled regex since your custom algorithm for the complex case would start to be as expensive as just using the regex, plus probably containing bugs!
  • This was server-side code on a frequently-executed code path. If this were running in some obscure server-side code or better yet, on an infrequent method in a desktop application, the pre-compiled regex again would have been fine.
  • The simple Character.isDigit checker isn’t as robust as Integer.parseInt (e.g. mine couldn’t grok a minus sign), but again it didn’t need to be because of my application’s requirements. Because of the one-off nature of this test, I put it in a private method.

Ah, coding again.

(http://www-128.ibm.com/developerworks/blogs/dw_blog_comments.jspa?blog=396&entry=104674)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: