方瓊瑤吳孟倫Fang, Chiung-YaoWu, Meng-Luen俞柏丞Yu, Po-Cheng2024-12-172024-08-052024https://etds.lib.ntnu.edu.tw/thesis/detail/0f289209c0a2f624e8c5fd75bc828314/http://rportal.lib.ntnu.edu.tw/handle/20.500.12235/123724近年來,自然語言處理和影像處理領域進步迅速,各種應用蓬勃發展帶眾多應用。隨著手機成為日常拍攝的重要工具,本研究提出一套基於深度學習的拍攝指引系統。該系統結合自然語言處理和影像處理技術,幫助使用者在拍攝過程中獲得具有情感和美學價值的建議。本系統通過文字評論與美學分數提供指引,幫助使用者提高攝影技巧,並準確地捕捉畫面中的美感。拍攝指引系統主要可以分成兩個子系統,一個是輸出分數的美學評分子系統,另外一個是輸出文字的美學評論子系統。其中第一個為輸出分數的美學評分子系統,採用多尺度影像品質評估模型,作為本研究客觀評估影像的參考指標。另外一個為美學評論子系統,採用Encoder-Decoder構成的文字生成模型,本研究選擇SwinV2作為Encoder來擷取影像特徵,並使用GPT-2作為Decoder學習文字特徵,同時在其內部使用交互注意力機制(cross attention)做異質性特徵融合,最後生成評論。但交互注意力機制不能有效融合異質性特徵,所以本研究引入Self-Resurrecting Activation Unit (SRAU)來控制異質性特徵學習的內容。而GPT-2 block中的多層感知網路Multi-Layer Perceptron(MLP)無法學習處理複雜的特徵資訊,所以本研究採用前饋網路高斯誤差門控線性單元Feedforward Network Gaussian Error Gated Linear Units (FFN_GEGLU)網路架構,來提升模型學習的效果。為解決資料集過少的問題,本研究採用網路收集的弱標籤資料集,但弱標籤資料內文字評論常有錯誤。為提升資料集品質,本研究採用兩個方法。一是收集並整理弱標籤資料集,通過資料清洗提高品質;二是加入高品質資料進行訓練,並通過資料增強的方式增加高品質資料集的數量。通過這些資料處理方法,本研究將其整合成一個高品質資料集進行訓練及測試。結果顯示35個評估指標中有33個優於基準模型,改良證明模型在五種美學面向中有94%的指標優於基準模型,顯示其有效性。In recent years, advancements in natural language processing and image processing have led to numerous applications. With smartphones becoming essential for daily photography, this study proposes a deep learning-based photography guidance system. Combining natural language processing and image processing, the system provides users with emotionally and aesthetically valuable suggestions through textual comments, enhancing their photography skills to better capture and present beauty and express emotional stories.The photography guidance system is divided into two subsystems: an aesthetic scoring subsystem that outputs scores and an aesthetic critique subsystem that outputs text. The aesthetic scoring subsystem employs a multi-scale image quality assessment model as an objective reference for evaluating images. The aesthetic critique subsystem uses an Encoder-Decoder framework for text generation. This study selects SwinV2 as the Encoder to extract image features and GPT-2 as the Decoder to learn text features. Additionally, it employs a cross-attention mechanism for heterogeneous feature fusion, ultimately generating reviews. Since the cross-attention mechanism cannot effectively fuse heterogeneous features, this study introduces the Self-Resurrecting Activation Unit (SRAU) to control the learning of heterogeneous features. Moreover, the Multi-Layer Perceptron (MLP). In the GPT-2 block, the Multi-Layer Perceptron (MLP) is unable to learn and process complex feature information. Therefore, this study adopts the Feedforward Network Gaussian Error Gated Linear Units (FFN_GEGLU) architecture to enhance the model's learning effectiveness. The dataset in this study mainly consists of weakly labeled data with unverified and error-prone textual reviews. To improve quality, two methods are proposed: cleaning and organizing the weakly labeled data, and augmenting high-quality datasets to address the shortage of aesthetic reviews. These methods create a high-quality database for training and testing. Results show that the proposed photography guidance system outperforms the baseline model in 33 of 35 metricsand exceeds the baseline in 94% of aesthetic aspects, confirming its effectiveness.拍攝指引系統影像美學描述自然語言處理異質性特徵融合影像美學評估電腦視覺Photography Guidance SystemAesthetic Image CaptioningNatural Language ProcessingHeterogeneous features fusionImage Aesthetics AssessmentComputer Vision基於深度學習之攝影指引系統──多面相評論和評分Deep Learning-Based Photography Guidance System: Multi-Aspect Reviews and Ratings學術論文