On AI Crawlers

The OpenAI web crawler has been met with some controversy. The reaction was mostly to try to block it from accessing content, esp. content behind an ineffective paywall. Let's explore whether that is even needed.

Initially, I only wanted to argue the philosophical reason why blocking may be bad for society at large. While writing, I realised that if crawlers played by the rules (of copyright and licensing), the now blocked information would not be considered anyways. Therefore this discussion is very limited.

Still, there are a few interesting facts to explore: does it make sense technically and is it needed legally?

What is this Crawler?

A crawler, in most basic terms, is a piece of software that surfs the web to find content. It takes the content that it finds, puts it into a database, looks for links in that content, puts those into kind of a todo list and then goes to the next link in this todo list. From then on, it’s rinse, repeat.

Now what happens with the content, is up to the operator of that crawler. The most famous of those crawlers is the Google bot, which finds content for Google’s search engine. Bing has one as well. There are a few (thousand) others, some of which have legitimate use case, other’s don’t. Now OpenAI joined the party with a crawler aiming to enhance the language model with more up-to-date information.

Why OpenAI’s crawler is different

You may ask, what makes OpenAI’s crawler different to Google’s or Bing’s? When you search, Google and Bing ultimately just show you links to the documents containing the information you’re looking for. You still have to read that document and may have to pay for it, if it’s behind a paywall.

OpenAI is different in the sense that it has already read the document for you and will adjust its reply to your question accordingly. No need to go to the document and pay if necessary. Of course that has everybody rattled, especially news outlets and magazines that rely on either subscriptions or ad revenue. This distinction becomes important later, when we discuss licenses.

To further understand the issue, you need to know that those paywalls are often ineffective to crawlers — by design. The content may only be invisible to you, but is fully readable by crawlers. In some cases it might be as simple as going to the Dev Tools of your browser and toggling visibility for this content. Websites do that because the whole content boosts their visibility in a search engine. The better content you have, the higher up front you are in Google’s and Bing’s listings.

Since the search listing only shows a preview text, which may be the freely available introduction paragraph and can be controlled to a certain degree by the content provider, it’s a win-win.

The philosophical question

My initial reaction to the blocking was along the lines of “great, now the AIs are only trained on racist and alternate truth content and become useless cesspool of misinformation”.

First of all, there is no correlation between content being blocked for crawlers and its societal value. Second, a provider like OpenAI, already has to guard against misinformation and hate speech. Third, if they deem a content source as trustworthy enough to use it for training their AI, then they have an incentive to seek a way to integrate this source in a sustainable way (for the source and them).

Verdict: from this point of view, blocking the crawler doesn’t really matter. Let’s move on…

Why it’s likely in vain

Let’s look at the technical side of blocking: the most popular way to block crawlers is to use somethings called robots.txt. This file sits at the root of a website (e.g. https://jjkress.com/robots.txt) and defines what the crawler should explore or not. The legit crawlers will obey this file, but there is no stopping them if the wanted see everything (Think about it. You’re explicitly pointing to the boxes the crawler should not open. If the crawler isn’t legit, they will crawl through those boxes first 😉 ).

Crawlers, esp. dubious ones, are often designed to behave as human-like as possible, in fact crawlers often use delays when visiting a second, third, and so forth link on the same domain to look more human and prevent blocking.

If you want to block beyond that, you have to resort to pretty drastic measures, like blocking known IPs that crawlers are using. This can be expensive and error-prone. Which brings us to knowing something about the crawlers: even for robots.txt you need to know friend from foe. The only compromise you have, is to allow Google and Bing explicitly, and deny all the others. That probably is something most can live with, since new search engines don’t show up very often because of Google’s dominance.

Which brings us to the most drastic measure: blocking all requests that are not authenticated (i.e from subscribed users). If that was your idea, bad luck. First of all, unless your name is Financial Times or similar, people are unlikely to find your content beyond word-of-mouth and your revenue tanks. Secondly, as mentioned earlier, a crawler can simulate humans, incl. logging into a subscription. Whether that impacts the content provider’s business model or not is somewhat related to how the information is used by the crawler, but regardless might be prohibited by the terms of service, which often disallow automated access by their users.

Verdict: robots.txt is probably the simplest way to block legit crawlers. Anything beyond is not practical. I would not call it a day just yet though…

Why it’s probably unnecessary…

To be honest, the headline should have a question mark, because I’m neither sure, nor a lawyer in copyright and licensing. The question I’m pondering: isn’t this all governed by copyright and the license you attach to the content?

In that case the operator of a crawler should adhere to the licensing when transforming the information in a material way that makes accessing the original information obsolete. The problem is the definition of “transforming in a material way”.

A summary or citing specific factoids, e.g. “25% of European citizens”, is and should probably be allowed. Otherwise you could throw out a huge portion of Wikipedia for copyright reasons. They don’t discourage or forbid citing sources behind a paywall. They encourage, however, to always seek an alternative source that is freely available. So if a contributor to Wikipedia pays for a Financial Times subscription, they can cite away, linking to the original information. In fact, if you search for Wikipedia’s policy regarding paywalled content, you find that this is even the subject of scholars and their own data science team.

The keyword in the last paragraph is citing. We all have seen the famous and meme-worthy “citation needed” mark in Wikipedia. Every mentioned factoid requires a proper citation. However, if all you want is an overview, you will visit Wikipedia and call it a day. If you want to go deeper, you will sift through the sources, especially for the information that is important to you.

The first step for OpenAI to improve its crawlers image, would be to cite crucial primary sources by default. You can trigger that with the right query, but ChatGPT has been shown to hallucinate sources that sound perfectly legit.

The other sub question would be around licensing. How do you create a licensing model for AI crawlers? If they create perfectly Wikipedia-esque answers with proper citations, what is the difference between an AI and a Wikipedia contributor? Hence each AI needs one subscription to each paywalled source, right?

If they create more elaborate answers that make a visit to the sources obsolete, even for deep research, there are two options:

The LLM operator, e.g. OpenAI, pays a license fee to their sources that may or may not be used in the answer. That would require the operator and the source to negotiate how that could work.
The LLM operator finds a technical way to alter answers depending on sources the end user has access to, in a “bring your own sources” kind of way. This could either be through authenticating with the source (e.g. OAuth) or an add-on subscription through the LLM provider. The answer would be different for a user who brings a Washington Post subscription than for a user who brings a The Economist subscription than for a user who brings no subscription at all. The more subscriptions, the better the answer?

We will see how lawyers will figure this out over the next few months or years. To make it more tangible, let’s look at some of the Creative Commons licenses and how I would read them when it comes to LLM (again, I’m not a lawyer):

CC-BY: As long as you cite it as one of your learning sources, the LLM can answer in every shape or form.
CC-BY-ND: The ND stands for “No Derivatives”, so an LLM could summarise and cite “Wikipedia-style”, but rephrasing would be limited.
CC-BY-NC: The NC stands for “Non Commercial”, so a free ChatGPT could use the information for an answer, while a paid-for version would either have to filter it out or ensure a royalty license is in effect.

The rest are combinations of the above or public domain, which is OK for LLMs anyways. As you can see, even seemingly uncritical licenses CC-BY and CC-BY-ND have legal pitfalls for LLMs. Unless a source has an explicit license allowing you commercial and derivative work, you should steer clear of it. The explicit license is important. The way I understand copyright law in most jurisdictions, any and every work is implicitly copyrighted, unless the creator chooses to license it differently. Whether or not they seek litigation when you use their work, is up to them and the lack of an explicit copyright notice won’t safe you.

As mentioned earlier, beyond the license there is also the question of terms of service.

Machine-centric licensing

The policy makers and Creative Commons may want to start thinking about machine-centric licensing rather sooner than later. The licenses today at least need a clarification, if not separate terms altogether.

The tech scene may want to think about an extension or replacement of robots.txt that includes licensing terms for machine-centric usage, e.g. teaching an AI.

Theoretically blockchain could be an answer here, but it’s neither practical at the moment, nor bullet-proof against bad actors, so please curb your enthusiasm.

Take away for content providers

If your business model relies on people accessing the content directly, either through ad revenue or subscriptions, blocking crawlers with measures like robots.txt may give you a little bit of relief, but it’s not the be-all-end-all measure.

You should pay attention to the developments surrounding this topic and maybe choose to join interest groups that deal with the issue. I certainly believe that a collaborative solution might even increase the revenue for content creators. Just don’t get greedy, like the music and film industries.

Take away for LLM operators

If you run your own LLM, you should get familiar with copyright and licensing like yesterday. Understand whether your usage of content constitutes commercial usage or better let a lawyer verify it. Form or join interest groups and pro-actively work on proposals that include the interests of content providers, because it will save you headaches later and might even give you a competitive advantage over less prepared operators.

The last bit is also a danger. If I was in a startup today working on a general LLM, I would not shelf this issue and let OpenAI and others figure it out for me.

Take away for everybody else

It depends on whether you use these type of AIs for business or not. If all you do is asking ChatGPT for dad-jokes and information you might as well find on Wikipedia or Wolfram Alpha, you can continue without bothering about the Ins and Outs of it all.

If you rely only somewhat on an AI for your business, things are less laissez faire. You should be at least vaguely informed about the developments and how it impacts your work. You may have to change providers or take other actions. You should also be aware of the content sources your provider uses. The liability question is currently mostly discussed with regard to wrong answers, but what about answers that use illegally obtained information? Can you unwittingly do insider trading with the answer of an AI?

Using a LLM, let alone working on one, is exciting (and funny at times). With all that excitement we should not forget to do our homework with regard to copyright and licensing. Unfortunately not doing their homework always happens when there is a gold rush in the tech scene.

What about the original question of blocking? I would not bother with it. I’d try to understand how such an AI could even add to the bottom line of my business, because they are here to stay.