以生成對抗網路透過逐步替換實現之資料合成方法

No Thumbnail Available

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

現今,機器學習的技術被應用在廣泛的領域中,如顧客喜好分析、生物辨識,甚至醫療體系中,這些應用無疑幫助著人們做出更準確的決策。然而,當包含個人資訊的資料集變得更加龐大也更加詳細的同時,資料隱私保護的技術與規範卻沒有相應幅度的進展。著名的資料匿名研究,以 k 匿名 (k-Anonymity) 為首,然而 k 匿名受限於較低維度的資料集,並且存在容易受到背景知識攻擊 (Background-knowledge Attack)的缺點;差分隱私 (Differential Privacy) 是近十年來受到矚目的匿名化技術,在數學上提供了強大的隱私保證,能確保資料集當中沒有任何一筆資料被重新識別,但在實際應用層面上,卻會遇到無法兼顧隱私保護跟資料實用性的考驗。希臘哲學家普魯塔克以著名傳說 ── 忒休斯之船為例,對於一一抽換掉船中的木頭後,忒休斯之船是否仍然是原本的那艘船提出大哉問。其中,我們參考了英國政治哲學家霍布斯對此問題的結論作為研究的發想,提出了忒休斯抽換機制(Theseus Data Synthesis Approach)。透過不斷的抽換資料集當中的部分資料,待最後所有資料都被取代完畢後,產生一個相似於原始資料集的合成資料集,並且確保合成的資料集當中沒有任何一筆資料是來自於原始資料集,以此避免資料被重新識別的可能性。此外,本文亦提出對於忒休斯抽換機制的安全性及相似性的論證模型。資料合成 (Data Synthesis) 的技術可以用於創造缺失的資料或者拓展資料集的大小,近年來也被使用在較需要隱私保護的醫療資料研究中。本文參考過去的研究,以生成對抗網路 (Generative Adversarial Networks) 來產生合成資料集,並以生成對抗網路本身的隨機性來取代相關研究中在生成器的損失函數上額外加入的噪聲,以此提高與原始資料集的相似度。本文最後對合成資料集與原始資料集的相似度與合成資料集的實用度進行了分析,探討不同的抽換比例與相關研究在生成品質上的差異,發現在抽換比例較小的情況下產生的合成資料集與原始資料集的相似度較高,亦優於相關研究,此外,更提供了較佳的預測品質。
The utilization of machine learning algorithms to achieve various tasks such as customer preference prediction, facial/voice recognition or even medical diagnosis assist us human to make more accurate decisions. However, as personal information in electronic data become more massive and detailed, the progress of data privacy protection doesn't seem to catch on with the rapidly improved data curation techniques. Previously developed data anonymization methods such as k-anonymity suffered from background knowledge attacks and is limited to low dimensional and centralized input datasets; Differential privacy provide strong mathematical guarantee for algorithms analyzing datasets and as a result prevent individual data participated in the dataset from being identified. With rigorous model ensuring data provider's unidentifiability, differential privacy however, faces huge hindrance when applying to real-world applications due to the tradeoff between data quality and privacy.Greek philosopher Plutarch questioned about whether the ship of Theseus would remain the same if it were entirely replaced, piece by piece. Among all the discussions of the question proposed by Plutarch, we draw inspiration from the conclusion of English philosopher Thomas Hobbes to build our own mechanism ──Theseus Data Synthesis Approach (TDSA). We generate our synthetic data by replacing partial records until no record from the original dataset remains. This can prevent the possibility of data in the original dataset being re-identified from released synthetic dataset. Furthermore, we also proposed a similarity and security scheme for our replacement mechanism.Data synthesis can be utilize on constructing missing values in dataset or data augmentation. It is also implemented on relatively sensitive medical datasets in recent years.We generate our synthetic data by GAN framework based on previous researches, but utilizing the randomness of GAN itself rather than adding noises via additional loss function like other works did. In order to preserve the similarity of generated synthetic data to the original dataset.We analyze the quality and utility of synthetic dataset with different settings of ourproposed mechanism, and compare them with related work. We conclude that, with small proportion of replacement settings, we can derive higher quality of synthetic data.

Description

Keywords

資料匿名化, 資料合成, Data Anonymization, Data Synthesis

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By