Data Portraits: Recording Foundation Model Training Data

要約

基礎モデルは、ますます巨大かつ不透明になるデータセットでトレーニングされます。
これらのモデルは現在 AI システム構築の鍵となっていますが、「モデルはトレーニング中に特定の例にすでに遭遇しましたか?」という単純な質問に答えるのは難しい場合があります。
したがって、私たちはデータポートレート、つまりトレーニングデータを記録し、下流の検査を可能にするアーティファクトを広く採用することを提案します。
まず、このようなアーティファクトの特性を概説し、既存のソリューションを使用して透明性を高める方法について説明します。
次に、データスケッチに基づいたソリューションを提案して実装し、高速でスペース効率の高いクエリを重視します。
私たちのツールを使用して、人気のある言語モデリングコーパス (The Pile) と最近リリースされたコードモデリングデータセット (The Stack) を文書化します。
私たちのソリューションにより、テストセットの漏洩とモデルの盗用に関する質問に答えることができることを示します。
私たちのツールは軽量かつ高速で、オーバーヘッドはデータセットサイズの 3% のみです。
私たちはツールのライブインターフェイスを https://dataportraits.org/ でリリースし、データセットとモデルの作成者に対し、現在のドキュメンテーションの実践を補完するものとしてデータポートレイトをリリースするよう呼びかけています。

要約(オリジナル)

Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tools, we document a popular language modeling corpus (The Pile) and a recently released code modeling dataset (The Stack). We show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a live interface of our tools at https://dataportraits.org/ and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.

arxiv情報

著者	Marc Marone,Benjamin Van Durme
発行日	2023-12-14 16:55:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Portraits: Recording Foundation Model Training Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー