In brief

  • The ruling compels OpenAI to provide 20 million chat logs after months of disputes over privacy, preservation, and scope.
  • Judge Ona T. Wang ruled that the sample size is “proportional” to what the case needs to prove whether ChatGPT outputs reproduced Times content.
  • The case joins a growing wave of copyright challenges aimed at how AI labs source and use training data.

A federal magistrate judge has ordered OpenAI to turn over roughly 20 million de-identified ChatGPT logs to The New York Times and other plaintiffs, deepening the AI development company’s exposure to an array of copyright and data governance disputes.

Issued on Wednesday in New York, the order denies OpenAI’s bid to block the production of user-chat records and directs the company to hand over the logs under a protective framework.

The outcome could shape how tech firms such as OpenAI, Anthropic, and Perplexity source training data, license content, and build guardrails around and over what their systems can output.

While the court “recognizes that the privacy considerations of OpenAI’s users are sincere,” such considerations “are only one factor in the proportionality analysis, and cannot predominate where there is clear relevance and minimal burden,” U.S. Magistrate Judge Ona T. Wang wrote.

Decrypt has reached out to both parties for comment.

The order stems from the Times’ ongoing lawsuit, which alleges that OpenAI’s models were trained on copyrighted news content without permission. It was first brought forward in December 2023.

In January last year, OpenAI challenged the NYT’s claims and filed a countersuit, claiming that the publication was not “telling the full story.”

The court later found that the 20 million chat log samples in question are “proportional to the needs of the case” to assess whether ChatGPT outputs copied the NYT’s material.

Over the past year, the dispute has intensified, with plaintiffs pressing for broad access to output data, and OpenAI warning that expansive production of these materials would raise privacy and operational burdens.

In June, OpenAI faced another setback when the court ordered the company to keep a wide range of ChatGPT user data for the lawsuit, including chats users may have already deleted.

Months later, in October, the dispute resurfaced, with the court flagging OpenAI’s October 20 filing (ECF 679) that challenged the production of the 20 million log sample, and ordered both sides to submit clarifications on why they disagree.

At the time, the judge pressed the parties to explain how the fight related to earlier concerns over deleted logs and whether OpenAI had backed away from prior agreements on what it previously claimed it would turn over.

Late last month, OpenAI filed a formal objection asking the district judge to overturn the magistrate judge’s discovery order.

The company argued that the ruling was “clearly erroneous” and “disproportionate,” in that it would force the company to disclose millions of private user conversations, according to a court document shared with Decrypt by an OpenAI representative.

The dispute arises as part of a broader offensive against AI labs, with authors, news organizations, music publishers, and code repositories seeking to test how far existing copyright law extends when models ingest and reproduce protected material.

Courts across the U.S. and Europe are now sorting through similar claims.

Generally Intelligent Newsletter

A weekly AI journey narrated by Gen, a generative AI model.