Google Unwittingly Cracks Own Captchas


----------------

streetviewnumbers

Google has developed an algorithm that can decipher house numbers from Street View images with impressive accuracy. The problem is that the algorithm can match or even outperform humans when confronted with Google’s own use of CAPTCHA tests.

Because so much of the data collection for Google Maps is automated, the first drafts of the maps often lack the ability to match a specific number to a specific building, meaning maps can’t always highlight a particular building (or part of a larger building) in a street.

One way round this is to automatically hunt for building numbers that appear in the images captured for the Street View images. That’s not always easy as the numbers on the side of buildings appear in many different typefaces at different sizes and with differing degrees of contrast to the backgrounds.

In a paper published via Cornell University, Google researchers have rethought the way this recognition works. It says traditionally computer systems work on image recognition in three distinct steps: figuring out what part of the image actually houses the characters, breaking it down into manageable sections, then recognizing it piece by piece.

In this case, the researchers instead developed what they call “a deep convolutional neural network that operates directly on the image pixels.” In simpler terms, they tried to develop a system that looks at the entire image and tried to recognize characters in a single process. That’s the technique humans use and has usually been considered as a task computers can’t perform as quickly.

The researchers say they initially tested their algorithm on the Street View House Numbers (SVHN) dataset, a selection of images Google makes publicly available for testing and refining automated recognition tools. The algorithm successfully recognized 96 percent of complete numbers and 97.84 percent of individual digits.

They then turned to the complete Google database, which contains many images considered too difficult to be worth including in SVHN. Here they still managed 90 percent accuracy.

Realizing the technique might work in other areas, the researchers then tried the algorithm on reCAPTCHA, Google’s own system used to distinguish between humans and bots. Not only does reCAPCTAH use letters as well as numbers, but it also distorts the image. The idea is that humans can quickly piece together and reverse-engineer the effects of the distortion, whereas a computer is forced to try out every possible change to get back to the original image and then recognize the characters.

According to the researchers, the algorithm was able to get 99.8% accuracy on the hardest category in the reCAPTCHA test.

Google now concedes that reCAPTCHA as it stands may no longer be sufficient to distinguish between human and machine:

Thanks to this research, we know that relying on distorted text alone isnít enough. However, itís important to note that simply identifying the text in CAPTCHA puzzles correctly doesnít mean that reCAPTCHA itself is broken or ineffective. On the contrary, these findings have helped us build additional safeguards against bad actors in reCAPTCHA.

Understandably Google isn’t detailing these changes, which have involved cooperation between the researchers and the reCAPTCHA department. However, the paper suggests the simplest option is to make the CAPTCHA text longer, something that not only requires more computer resources to automatically crack, but also increases the significance of whatever error rate does exist.

It also suggests a more effective change would be to have the CAPTCHA text sliding around over a static image background, effectively creating a moving target for automated recognition.





4 Responses to Google Unwittingly Cracks Own Captchas

    • Better captchas? Most certainly. More secure? Not sure that assumption can be made. We’ll have to wait for all those hackers sitting in their parents basements to take a shot at it first.

  1. If anything, having the numbers/text sliding around on a static background would make it easier for a suitable algorithm to work on as you’d have gone through the trouble of telling it exactly which parts are signal rather than noise – the bits that change from frame to frame.

  2. The captchas need to go. They are easily breakable by ocr, and if not then you can just hire a bunch of guys from 3rd world countries to retype them all day long for a few bucks. With time they only get harder for me (a human non the less) to decipher :-P. The worst offender so far was the captcha on old Steam forum – that one I needed to reload even 10x to get one that I was able to read :-P.

    All in all the captchas were never a real solution to the problem that is on the net – the spammers. We need some kind of a larger change to the way people think, to make spam stop working at all, then the spammers will go away. As long as ppl fall for spam links then the spammers will be here.