Skip to content

Stealing Books to Teach Reading

What’s the difference between a school district using AI to support the teaching of reading, writing and thinking, and a school district using a stolen car to teach driver’s education? 

Notice how one of these questions elicits a quick, unequivocal answer and the other gets a ..”wait, what”? 

Car theft is obvious, easy to understand, and illegal, even buying a stolen car is against the law. The theft that lies behind AI tools is harder to see, but it’s no less real.  Yet we simply ignore it as AI floods into almost every aspect of our information environment and rush to barrel it into our schools.

Maybe there are better questions to help us think about what we are doing.

What would you think about English teachers ganging up for a smash and grab looting spree at Barnes and Noble?  What would parents think?  Would they be troubled by the terrible examples these teachers are setting for their children, or would they think instead about what their children would gain from all those books?  Sure, they don’t want their child to see Ms. Hawkins perp-walk out of the mall in cuffs, but those books are expensive.

New Jersey requires educators to ensure students learn by 2nd grade that “Digital artifacts can be owned by individuals or organizations” and by 5th grade that “Intellectual property rights exist to protect the original works of individuals”. 

By 12th grade, NJ students should know that “there are consequences to utilizing or sharing another’s originals works without permission or appropriate credit” <NJ Standards> but that is precisely what AI tools do.  By using AI tools powered by LLMs trained on content stolen from authors, artists, composers, illustrators, and writers, or taken without their knowledge, we are ourselves teaching students that there are no consequences for “utilizing … originals works without permission or appropriate credit.”

It’s likely that readers who’ve made it to this point are wondering what proof I’m going to offer that AI tools are powered by LLMs trained on stolen content.  Choosing examples from all the evidence available is like choosing just one type of yogurt at Whole Foods.

Behind the slick and clean interface of every AI tool is the mathematical analysis of hundreds of terabytes of language, audio, images or video that finds relationships between the component parts of that content and their relation to each other in order to calculate a likely result in response to a prompt that appears to make sense.  The Large language models (LLMs) under the hood of AI engines are trained on massive datasets of publicly available web-scraped data, as well as pirated content in archives and behind paywalls.  In short, everything generated by AI is based on something generated by a human.

How do we know that some of that content was stolen?

You could throw titles or author’s names into the LibGen tool that Meta used to train its AI , or writers, directors and actors to see the  movies, TV shows it used.Cambridge University Press complained about Meta’s “piracy to harvest content for its AI development” The Society of Authors said that “Meta needs to compensate the rightsholders of all the works it has been exploiting.”

You could read a summary of the personal and class action lawsuits against AI companies from the Authors Guild or look at a visualization of them published by Wired. Or if you’re in the mood for a deepdive, skim through the first couple pages of the NYTimes complaint filed against OpenAI.

There are many questions about the efficacy, pedagogy and social implications of AI to consider, this moral question deserves to be on that list.

Leave a Reply

Your email address will not be published. Required fields are marked *