Another Method to Protect Against Web Scrapers

It’s common practice for people to use Web scraping (copying) to fetch the content of other sites in order to reuse the information. Sometimes this is well intended, for example, when people curate only the best of the best of the content from multiple sites. There’s also the case of news aggregators that combine news from multiple sites to provide their users with a complete list of news.

The Problem

However, there are also people that are not so well intended and want to simply duplicate other sites or parts of sites. These people put out clones of that site and try to make a profit off somebody else’s work.

Another example would be those that parse the entire content of a listing site/marketplace and create lists of valuable information that they can sell later. One specific example is pretty common, and it consists of scraping listing sites to collect seller’s names, phone numbers, locations, and then selling the lists to spammers.

General Solutions

There are a few solutions that help us protect our content from malicious users, but none of them is perfect, and neither is the one I’m going to describe a bit later. But the thinking is that by implementing them, we can at least get rid of a big chunk of the scripters that would be able to scrape our content. In other words, the general solutions make it harder for bots to access the content.

Using a CAPTCHA

Yes, that’s right, we can use CAPTCHA systems for more than just forms. If we want to protect against automated tools that copy our content extremely fast, we can setup an entire page with only a CAPTCHA challenge on it that would show up after a set amount of hits.

This is not an uncommon practice at all, it’s used by anybody that handles large amounts of unwanted traffic. Two obvious examples would be Facebook and Google, which show a CAPTCHA after a few failed login attempts. This is used to protect their users’ accounts from brute force attacks (automatically trying random combinations of passwords until one fits).

Well, we can use the same principle to handle connections coming from the same IP, for example. Let’s say we have a site that has 1000 articles on it and somebody wants to scrape all the articles. Say that we display 10 articles per page, and in order to get the entire content for each article, they need to visit every single article individually to get it. That’ll take 1100 requests to accomplish the task. Without any protection, that can be done in a few minutes.

1000 total articles plus 100 requests to visit all pages to get the articles’ URL addresses.

If we were to set a captcha to appear after 5 requests made in the same second, a normal user would probably never reach that. Even if they open all 10 articles on a page in 10 tabs, it takes them more than one second to open 5, without an automated tool.

In a real world scenario, we’d need to fine tune this limit based on the user activity. If the normal users hit it, we increase it, if they never hit it but some bots do, we decrease it, and so on.

So now we’re forcing the scraping bot to make at most 5 requests per second, and then solve a CAPTCHA to continue making requests.

Cloudflare actually uses this exact technique to protect against large amounts of traffic, because it knows it looks like something that a human wouldn’t be able to do.

This is not a perfect solution, but it’s a good step to scare away some of the people that want to scrape your content.

Throttling Requests

This is a common technique in APIs, but it can be implemented for anything that handles a lot of requests. They usually have a set amount of number of requests, very similar to the CAPTCHA approach, but instead of displaying a CAPTCHA, they throttle it, which means that they will not respond to any more requests until the limit expires.

That being said, if the limit is 100 requests per minute, if you do 101 requests in a minute, it will wait until the next minute to serve you the 101th request and the others coming after that.

Obfuscating the Content

Sometimes it’s enough just to obfuscate the content and decode it in the client side using JavaScript. For example, you could echo your content encoded using Base64, and decode it from the source code after the page has been served. The people that scrape the contents of other sites usually visit the site normally, in a browser, select some text, and then look for it in the source code. If they can’t find it there, they need to think if they want to scrape it as is and decode it, or just move on. This is a common technique for quiz developers, because they don’t want their users to be able to easily find the result of something in the code.

Lazy Loading the Content

Apps that rely heavily on JavaScript don’t load the content with the first request. They load it when the JS handler (framework/lib) has loaded and will be able to fetch and display them. They usually create additional XHR (AJAX) requests to fetch the content, more often than not based on user interactions (click, hover, etc).

Most of the people that write scrapers for a living don’t know how to handle this. As I’ve mentioned in the previous section, they usually copy content displayed on the page and then look for it in the source, if it’s not there, they need to be able to think that the content is loaded later on a separate request, be able to use DevTools to inspect the requests, then figure out how to duplicate that request. It’s not easy for somebody that is not a dev, and they usually aren’t.

As you can see, this is already a pretty good way of loading the content to protect against scrapers, however, I’m going to introduce another layer of protection to add on top of this.

A Different Approach

For the sites that don’t load the contents initially and rely on JavaScript to load it separately, most of the people will use DevTools to look at the list of requests, identify the requests they’re interested in, and then try to duplicate that request.

However, we have the ability to tell when a user opened DevTools, and we can do something about it. Hence, my suggestion relies on figuring out when they opened DevTools, switching the contents and stopping the execution of the scripts.

It turns out that this is pretty simple to do actually. Let’s take a look at my approach:

const devtools = require('devtools-detect/index');

if (window.location.hostname === 'www.codepicky.com') {
    setInterval(() => {
        if (devtools.open) {
            document.body.innerHTML = ':)';
            for (;;) find();
        }
    }, 1e3);
}

The first line loads an NPM package made available by Sindre Sorhus, here: https://github.com/sindresorhus/devtools-detect

You can install it via NPM:

npm install --save devtool-detect

To use it, simply load it using require():

const devtools = require('devtools-detect/index');

Or you can simply copy his script located in the index.js file found on the repo above.

The first if makes sure we’re only running the code under our specific domain name, because we wouldn’t want this to run on our local environments.

Next, we’re setting up and interval that would run the containing code every second.

At each iteration, we’re checking if the DevTools is open, as per the script that we’ve just installed. If it is open, we’re going to remove the entire source code of the site. This is just a quick way of removing everything, we obviously could be manually clearing any timeouts of intervals too. We can do anything there.

The next line,

for (;;) find();

It’s just the shortest annoying piece of code that I could think of. It runs the find() function forever, keeping the CPU busy in the meantime, which translates in the tab freezing entirely.

Running:

for (;;) debugger;

Would probably be equally annoying.

If you like this approach and want to implement it on your sites, I suggest that you hide the entire code, including the one on the devtools-detect/index.js file, inside other big chunks of JS, and also obfuscating it.

Conclusion

As I’ve said earlier, this is not a perfect solution. Advanced devs will be able to find ways around this, but if you need a quick way to scare off most of the people willing to scrape other sites, this will help.

If you know any other ways of protecting against using the DevTools, I’d love to find out about them.

Table of Contents + show

Preface

Obfuscating the Content

Lazy Loading the Content

A Different Approach

Conclusion

Another Method to Protect Against Web Scrapers

The Problem

General Solutions

Using a CAPTCHA

Throttling Requests

Obfuscating the Content

Lazy Loading the Content

A Different Approach

Conclusion

Tagged

What's Next?

Leave a Reply Cancel reply

Categories

Recent Posts

Another Method to Protect Against Web Scrapers

The Problem

General Solutions

Using a CAPTCHA

Throttling Requests

Obfuscating the Content

Lazy Loading the Content

A Different Approach

Conclusion

Tagged

What's Next?

How To Programmatically Download VideoJS Files

How To Move the Cursor Inside an Input

Leave a Reply Cancel reply

Categories

Recent Posts

Tags