用於大規模科學數據處理的高效且可移植的分布建模

No Thumbnail Available

Date

2021

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

透過基於分布的資料表示法來處理大規模的科學資料集是一種新興且相當有潛力的方法。這種資料表示法基本上是將科學資料集轉換為許多分布來表示,並且每個分布皆由少量的樣本計算而出。目前大多數的平行演算法著重在將許多輸入樣本擬合成單一個分布,但這可能不適合處理大規模的科學資料集,因為這樣並不能很有效地利用計算資源。直方圖和高斯混和模型(GMM)最流行的科學資料集的分布表示法。因此,我們提出了針對處理大規模科學資料集的多組直方圖和GMM建模演算法。我們的演算法是基於data-parallel primitives開發的,以實現不同硬體架構的可移植性。我們詳細評估了我們所提出的演算法的性能,並展示了在處理科學數據時的使用案例。
The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach.The distribution-based approaches often transform a scientific dataset into many distributions, and each distribution is calculated from a small number of samples.Most of the proposed parallel algorithms focus on modeling single distribution from many input samples efficiently, which may not fit the large-scale scientific data processing scenario because they cannot utilize the computing resource well.Histogram and Gaussian Mixture Model (GMM) are the most popular distribution representations used to model the scientific datasets.Therefore, we propose multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures.We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.

Description

Keywords

大數據處理, 科學資料, 平行計算, Data-Parallel Primitives, large-scale data processing, scientific dataset, distribution-based approach, parallel algorithm

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By