CyberWarfare / ExoWarfare

New machine learning algorithm breaks text CAPTCHAs easier than ever

Algorithm tested against the text CAPTCHA systems used on 33 popular websites.

Academics from UK and China have developed a new machine learning algorithm that can break text-based CAPTCHA systems with less effort, faster, and with higher accuracy than all previous methods.

This new algorithm -developed by scientists from Lancaster University (UK), Northwest University (China), and Peking University (China)- is based on the concept of GAN, which stands for “Generative Adversarial Network.”

GANs are a special class of artificial intelligence algorithms that are useful in scenarios where the algorithm doesn’t have access to large quantities of training data.

Classing machine learning algorithms usually require millions of data points to train the algorithm in performing a task with the desired degree of accuracy.

A GAN algorithm has the advantage that it can work with a much smaller batch of initial data points. This is because a GAN uses a so-called “generative” component to produce lookalike data. These “generated” data points are then fed to a “solver” algorithm that tries to guess the output.

As these two GAN components are pitched against each other, the solver gets better, as if it would have been trained with millions of data points.

UK and Chinese academics applied this very same concept to breaking text CAPTCHAs, which, in the vast majority of previous research studies, have only been tested with classic machine learning algorithms trained with large quantities of initial data points.

Researchers argued that in a real-world scenario, an attacker wouldn’t be able to generate millions of CAPTCHAs on a live website or API without being detected and banned.

That’s why, for their research, they used only 500 text CAPTCHAs from each of 11 text CAPTCHA services found used on 32 of the Top 50 Alexa websites.

“It takes up to 2 hours (less than 30 minutes for most of the scheme) to collect 500 captchas and less than 2 hours to label them by one user,” said researchers. “This means that the effort and cost for launching our attack on a particular captcha scheme is low.”

The list of training data, listed in the table below, included text CAPTCHAs from sites like Wikipedia, Microsoft, eBay, Baidu, Google, Alipay, JD, Qihoo360, Sina, Weibo, and Sohu.

Once they’ve collected and trained their GAN solvers by generating up to 200,000 “synthetic” CAPTCHAs, researchers tested their algorithms against other text CAPTCHAs systems used across the Internet, and which had been previously tested by other researchers in prior academic works.



“Table 4 [see below] compares our fine-tuned solver to previous attacks,” researchers said. “In this experiment, our approach outperforms all comparative schemes by delivering a significantly higher success rate.”

Researchers said their method was able to solve text CAPTCHAs with a 100 percent accuracy rate on sites like Megaupload, Blizzard, and Authorize.NET. In addition, their method also achieved better accuracy on absolutely all other CAPTCHA systems used on the other 30 sites they tested -which included the likes of Amazon, Digg, Slashdot, PayPal, Yahoo, and QQ, just to name a few.



Besides improved accuracy, researchers also said that the solver component of the GAN algorithm they developed was also more efficient and cheaper than any other approaches.

“It can solve a captcha within 0.05 of a second by using a desktop PC,” researchers said.

This means that attackers won’t need to buy and keep paying for expensive cloud computing servers in order to break text CAPTCHAs in real time on websites.

Once an attacker has trained a text CAPTCHA algorithm, they can run it on a regular PC or web server, and launch coordinated DDoS or spam-posting attacks on websites where that CAPTCHA service is in use.

Because the algorithm is also easy to train, even if they encounter a never-before-seen text CAPTCHA, they can train their algorithm to deal with that as well.

“This is scary because it means that this first security defence of many websites is no longer reliable,” said Dr. Zheng Wang, Senior Lecturer at Lancaster University’s School of Computing and Communications and co-author of the research.

Zheng and his team recommend that website owners implement alternative bot-detection measures that use multiple layers of security, such as a users’ use patterns, device location, or biometric data.

Earlier this year, Google launched such a service, version 3 of the reCAPTCHA tool, which Google said it relied on machine learning algorithms to discern bots from actual users.

More details about the researchers’ work can be found in a research paper entitled “Yet Another Text Captcha Solver: A Generative Adversarial Network Based Approach.”