Mustafa Suleyman's statements about copyright and AI

In a recent interview with NBC News at the Aspen Ideas Festival, Mustafa Suleyman, CEO of Microsoft’s AI program, made an interesting statement about using publicly available data for AI training purposes. This conversation, moderated by CNBC’s Andrew Ross Sorkin, can be watched on YouTube here.

Before I share my thoughts on what he said, here is the full transcript of the segment in which he attributes the web content as “freeware”. If you click on the video link above for the interview, the segment starts at the 13:40 mark.

Read transcript

Andrew Ross Sorkin:

Was the idea that the AI machines, the training, that [AI’s] is running out of data to consume? And we’ll have a conversation in a moment about synthetic data, which is actually digital data, mostly reproduced off of data that doesn’t exist. But I want to actually ask about the data that does exist. There are a number of officers here at the Aspen Ideas Festival and a number of journalists as well. It appears that a lot of the information that has been trained on over the years has come from the web. Some of it is the open web and some of it is not.

We’ve heard stories about how OpenAI was turning YouTube videos into transcripts and then training on the transcripts. Who is supposed to get value from that IP? To put it in very blunt terms, have the AI companies effectively stolen the world’s IP?

Mustafa Suleyman:

Yeah, I think that’s a very fair argument. With respect to content that is already on the open web, the social contract of that content since the ’90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That’s been the understanding. There’s a separate category where a website, publisher, or news organization has explicitly said, “Do not scrape or use me for any other reason than indexing me so that other people can find that content.” But that’s the gray area, and I think that’s going to work its way through the courts.

Andrew Ross Sorkin:

What does that mean when you say it’s a gray area?

Mustafa Suleyman:

So far, some people have taken that information. I don’t know who has and who hasn’t, but that’s going to get litigated in the U.S.

Andrew Ross Sorkin:

Right. Do you think that the IP laws should be different? As an author, I can go write a book. In the process of writing my book, I could go to the library or buy 40 other books on Amazon. I could read those books, put them in my bibliography, and hopefully produce a book. I would owe the authors of those 40 books nothing more than whatever it cost me to buy those books. Maybe the library bought them once as well. Nobody ever imagined that I was going to be able to produce a million books every 10 seconds. What should the economics of that be?

Mustafa Suleyman:

You know, look, the economics of information are about to radically change because we can reduce the cost of production of knowledge to zero marginal cost. This is just a very difficult thing for people to understand. But in 15 or 20 years’ time, we will be producing new scientific and cultural knowledge at almost zero marginal cost. It will be widely open-source and available to everybody. I think that is going to be a true inflection point in the history of our species because, collectively, as an organism of humans, we are a knowledge and intellectual production engine. We produce knowledge, and our science makes us better. What we really want in the world, in my opinion, are new engines that can turbocharge discovery and invention.

In answering the questions, Suleyman made three key points:

He noted that content on the open web has traditionally been considered fair use, allowing anyone to copy, recreate, or reproduce it unless explicitly restricted.
He acknowledged a gray area concerning content with explicit restrictions on scraping, which is still under legal scrutiny and likely to be resolved through litigation.
He predicts that AI can reduce the cost of knowledge production to nearly zero, potentially leading to widely accessible, open-source scientific and cultural knowledge.

So, for what it’s worth, here is how I feel about it.

For starters, the idea that content published on the open web is automatically “freeware” is a bold assertion. As someone who creates content myself, ranging from articles to software code, I understand the instinctual desire to share knowledge and information freely. However, this must be balanced with respect for intellectual property rights, which are crucial for incentivizing creativity and ensuring creators can benefit from their work.

When Suleyman suggests that the social contract of the internet since the ’90s has been one of open sharing akin to freeware, it oversimplifies a complex legal landscape. Yes, the internet has fostered a culture of sharing and collaboration, but this doesn’t nullify copyright protections. In reality, the moment someone creates original content, whether it’s a blog post, a photograph, or a video, it is automatically protected under copyright law in most jurisdictions, including the US.

Your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device.
U.S. Copyright Office

The concept of fair use, which allows for certain limited uses of copyrighted material under specific conditions (such as criticism, commentary, news reporting, etc.), is indeed a legal defense that is determined by courts on a case-by-case basis. It’s not a blanket permission to use any content found on the internet for any purpose without consequence.

But, the issue extends beyond legality to ethics and respect for content creators. As someone who respects intellectual property and values the hard work that goes into creating original content, it’s disconcerting to hear prominent figures in the tech industry downplay the importance of respecting copyright. It undermines the efforts of creators who rely on their work for income and recognition.

Regarding the use of robots.txt (which is not legally binding) as a means to specify which bots can and cannot scrape content, it’s a technical measure rather than a legal one. While it helps establish guidelines for web crawlers, it doesn’t replace the need for legal frameworks that protect intellectual property. Just recently we saw Perplexity (which has a serious plagiarism problem), completely ignore robots.txt and outright lie about their User-Agent. Because of this, Amazon is now investigating them.

The argument that training AI models on copyrighted content falls under fair use is contentious and subject to legal interpretation. While OpenAI is running around and making content deals with every big (not small!) publisher on the planet, some, such as The New York Times, have actually stood up for themselves and taken them to court.

The reason? OpenAI thought that all that content they trained their GPT models on was “freeware”. Mind you, the Times lawsuit is specifically about fair use.

And lastly, Suleyman predicts that AI can reduce the cost of knowledge production to nearly zero.

This prediction, while enticing, appears to be more aspirational than grounded in the current realities of AI development and intellectual property law. Firstly, the concept of reducing knowledge production costs to nearly zero overlooks the substantial and ongoing investments required for developing, training, and maintaining advanced AI systems. These include costs for computational resources, data acquisition, and the skilled labor necessary for refining and updating models. Even if the marginal cost of producing additional information decreases, these initial and ongoing expenses are far from negligible.

Suleyman’s vision assumes a level of openness and accessibility that clashes with the current landscape of intellectual property rights. Intellectual property laws are designed to protect the rights of creators and ensure they can monetize their work.

Needless to say, I am having a difficult time imagining a reality where a cutting-edge state of the art AI model is going to give you full access to it for free. It’s all about money, and it always has been. These companies, including Microsoft who is an OpenAI partner, are spending tens of billions of dollars to train and deploy these AI models, but somehow the cost of knowledge is going to be reduced to zero.

Who knew? Microsoft is actually a charitable non-profit organization.

Mustafa Suleyman’s statements about copyright and AI

Posted by api

Tags

Posted by api