Can we detect if your data has been used in LLM training

We present a method to test whether a specific text corpus appears in a language model’s training data, using statistical probing and verification techniques.

Overview

It’s been a debatable topic to prove or disprove if private data has been used in LLM.
In this demo, we show a mechanism to detect if a particular corpus has been used in training.
While its hard to show (may be impossible) the source from which data has scraped, but it is not that challenging to that certain data has been used in our