DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

要約

差分プライバシー (DP) 保護の下で表形式データを生成すると、理論上のプライバシーが保証されますが、主にノイズの多い監視信号の下で複雑な構造をキャプチャする必要があるため、機械学習モデルのトレーニングに課題が生じます。
最近、事前トレーニングされた大規模言語モデル (LLM) (GPT-2 スケールのものであっても) が、表形式データの合成において大きな可能性を示しています。
ただし、DP 制約下でのそれらのアプリケーションはほとんど調査されていないままです。
この研究では、DP 技術を合成表形式データの生成に適用することで、このギャップに対処します。
私たちの調査結果は、テーブル構造のような非プライベートな要素にプライバシー予算が非効率的に割り当てられているため、LLM が DP で微調整した場合に一貫したテキストを生成するのが困難に直面していることを示しています。
これを克服するために、差分プライベート表形式データ生成のための 2 段階の微調整フレームワークを提案します。
最初の段階では、擬似データセットでの非プライベート微調整が行われ、続いてプライベートデータセットで DP 微調整が行われます。
私たちの経験的結果は、このアプローチが DP コンテキストで直接微調整された LLM と比較して、さまざまな設定およびメトリックにわたってパフォーマンスを向上させることを示しています。
コードとセットアップは https://github.com/tejuafonja/DP-2Stage でリリースされます。

要約(オリジナル)

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) — even those at the scale of GPT-2 — have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

arxiv情報

著者	Tejumade Afonja,Hui-Po Wang,Raouf Kerkouche,Mario Fritz
発行日	2024-12-03 14:10:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー