深度學習輔助的基於分佈的集成科學資料統計視覺化與分析
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
為了透過計算機模擬研究複雜的現實世界現象,科學家通常依賴從多次模擬運行中生成的集合數據集,這些模擬運行使用不同的參數配置。這一過程會生成極大規模的數據集,導致傳統的數據分析流程因有限的I/O帶寬和磁盤容量而變得相當侷限。基於分布的數據表示已被提出作為一個可能的解決方案。通過原位資料處理來生成緊湊的基於分布的表示,不僅緩解了有限的I/O帶寬和磁盤容量的挑戰,還能實現不確定性量化,從而減少誤解的風險。然而,基於分布的方法本質上會犧牲數據樣本的空間信息,可能會降低數據分析流程中的精確度。為了解決這一問題,我們引入了一種深度學習模型來從分布表示中重建數據體積。我們並不使用直接從分布表示預測數據塊的模型,而是提出了一種基於Gumbel-Sinkhorn神經網絡(GSNN)的深度學習模型,它學習將從塊的分布中抽取的樣本映射到塊內的空間位置。該深度學習模型不僅支持高質量的後續數據分析和可視化,還能提供逐點不確定性量化,並保證重建的數據塊分布與其分布表示一致。
To study complex real-world phenomena using computer simulations, scientists often rely on ensemble datasets generated from multiple simulation runs with varying parameter configurations. This process can produce extreme-scale datasets, making traditional data analysis pipelines impractical due to limited I/O bandwidth and disk capacity. Distribution-based data representations have been proposed as a promising solution.Processing data in situ to generate compact distribution-based representations not only alleviates the challenges of limited I/O bandwidth and disk capacity but also enables uncertainty quantification, thus mitigating the risk of misinterpretation. Nevertheless, distribution-based method inherently sacrifices spatial information of data samples within the distribution, potentially reducing precision in the data analysis pipeline. To address this issue, we introduce a deep learning model to reconstruct data volume from the distribution representation. Instead of using a model that predicts a data block directly from its distribution representation, we propose a deep learning model based on the Gumbel-Sinkhorn Neural Network (GSNN) that learns to map samples drawn from a block's distribution to spatial locations within the block. The deep learning model can support high-quality downstream data analysis and visualization, provide point-wise uncertainty quantification, and guarantee the distribution of the reconstructed data block follows the block's distribution representation.
To study complex real-world phenomena using computer simulations, scientists often rely on ensemble datasets generated from multiple simulation runs with varying parameter configurations. This process can produce extreme-scale datasets, making traditional data analysis pipelines impractical due to limited I/O bandwidth and disk capacity. Distribution-based data representations have been proposed as a promising solution.Processing data in situ to generate compact distribution-based representations not only alleviates the challenges of limited I/O bandwidth and disk capacity but also enables uncertainty quantification, thus mitigating the risk of misinterpretation. Nevertheless, distribution-based method inherently sacrifices spatial information of data samples within the distribution, potentially reducing precision in the data analysis pipeline. To address this issue, we introduce a deep learning model to reconstruct data volume from the distribution representation. Instead of using a model that predicts a data block directly from its distribution representation, we propose a deep learning model based on the Gumbel-Sinkhorn Neural Network (GSNN) that learns to map samples drawn from a block's distribution to spatial locations within the block. The deep learning model can support high-quality downstream data analysis and visualization, provide point-wise uncertainty quantification, and guarantee the distribution of the reconstructed data block follows the block's distribution representation.
Description
Keywords
深度學習, 基於分布表示, 原位資料處理, 大型集成資料, Deep learning, distribution-based, in situ data processing, large ensemble data