How Companies May Be 'Acting Irresponsibly' When Training AI Models
A recent report from the New York Times claims tech companies that are building large language models or other foundation models have perhaps been playing a little fast and loose with copyright law and user agreements. For example, the report says that OpenAI used Whisper, its speech-to-text tool, to mine millions of hours of conversations of YouTube videos to train its system that powers ChatGPT. This seems to be in violation of YouTube's user agreement.
There are three important issues to consider here:
First, it looks like companies might be acting irresponsibly in the secrecy around the data that they use. They are certainly not asking for permission to use the data that they scrape from the internet and seem to have a good idea how close to the line of “fair use” they are skating and when they are crossing it.
Second, from a regulatory perspective, we need to examine issues of copyright law, plagiarism, and fair use. You can make the argument that this is not a pure copyright issue because AI systems are using data to learn, not to copy and reproduce the data. However, you can also argue that there must be a discussion about fair use guidelines. Given that AI models work better with high-quality documents that are usually proprietary, there must be a discussion about compensation for the creators of the original content.
And finally, given that human beings learn faster, better, and smarter with less volume, then perhaps we should consider alternative routes to building intelligence. The brute force training methods behind LLMs are time-consuming, require a lot of energy, and are super expensive. A more supervised approach that incorporates machine teaching with machine learning might get us to where we want to go without infringing on the rights of others. Of course, that might require a more mindful approach to what we want our machines to learn. Like how we decide what and how to teach our kids.
Kristian Hammond
Bill and Cathy Osborn Professor of Computer Science
Director of the Center for Advancing Safety of Machine Intelligence (CASMI)
Director of the Master of Science in Artificial Intelligence (MSAI) Program