智慧型演講錄製系統
No Thumbnail Available
Files
Date
2017
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
近年來由於數位學習(或遠距教學)的發展,從高度發達的大都市到偏遠低開發國家,都可為學習者提供了平等的機會。而演講錄製系統在收集數位學習的內容資料中發揮著至關重要的作用。然而隨著數位學習的蓬勃發展,數位內容的缺乏以及專業錄製團隊人員等正在成為一個大問題。這項研究提出了一個智慧型的演講錄製系統,可以自動錄製與人類團隊相同質量水平的內容,並減少錄製人員不足的問題。
本研究所提出的智慧型演講錄製系統由三個主要元件系統組成,分別稱為虛擬攝影師、虛擬導演和虛實對位。前兩個元件虛擬攝影師和虛擬導演是線上執行的系統,而虛實對位是屬於離線後製的元件。而虛擬攝影師元件可進一步分為三個子系統:演講者攝影師,觀眾攝影師和演講廳攝影師。所有這些子系統都是自動運作,包括選擇拍攝目標、追踪拍攝、特殊事件偵測等功能。這三個子系統拍攝的視訊將全部傳輸到虛擬導演系統,虛擬導演則選擇最具代表性的畫面錄製或直播。我們將虛擬導演的此功能稱為:選鏡。選鏡的功能主要是對來自虛擬攝影師的視訊作內容分析,並通過反傳播神經網絡特徵的機器學習過程進行畫面選擇的決策。此外,虛擬導演系統具有另一個關鍵功能:視覺指導,通過它可以模仿人類導演和現實世界中的人類攝影師之間的溝通。
在完成一段實況的演講錄製後,有時會在演講的錄音集中附加額外的內容或素材,以增加其表現力和可看性。所以本研究另外開發了一個稱為虛實對位系統的後期製作元件,用於實際拍攝影片與虛擬物件的合成。該系統以深度攝影機作為深度感測設備,協助真實世界的彩色攝影機和虛擬世界的攝影機同步對位。虛實對位系統有三個主要執行流程:時間深度融合、攝影機跟踪和虛實合成預覽。由深度相機獲取的深度影像經由時間深度融合被疊合成場景的3D構造。再藉由3D場景的結構與深度攝影機的相對關係,推導出彩色攝影機的移動軌跡。此軌跡則用於引導虛擬攝影機與真實攝影機同步移動完成對位,將虛擬物件投影並生成虛擬影像,將生成的虛擬圖像疊加在由彩色攝影機獲取的真實圖像上,所得到的圖像稱為虛實合一的預覽圖像。
本研究進行了一系列實際演講錄製實驗,而實驗數據顯示我們所提出的智慧型演講錄製系統可以模擬出近似於真正的人類團隊所採取的拍攝、選鏡技術。我們也認為這套系統可不限於演講錄製;如果可以搭配適當的訓練資料,也可以適合錄製舞台表演,音樂會,運動比賽和產品發表會等場合。
Nowadays, e-learning (or distance learning) provides equal opportunities for learners in locations ranging from highly developed metropolises to remote less-developed countries. Lecture recording systems play a vital role in collecting spoken discourse for e-learning. However, in view of the growing development of e-learning, the lack of content is becoming a problem. This research presents a smart lecture recording (SLR) system that can record orations at the same level of quality as a human team, but with a reduced degree of human involvement. The proposed SLR system is composed of three principal components, referred to as virtual cameraman (VC), virtual director (VD), and virtual-real match moving (VRMM), respectively. The first two components, VC and VD, are online components, whereas the VRMM component is offline. The VC component is further divided into three subsystems: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). All these subsystems are automatic, and can take actions that include target and event detection, tracking, and view searching. The videos taken by these three subsystems are all forwarded to the VD system, in which the representative shot is chosen for recording or direct broadcasting. We refer to this function of the VD system as shot selection. The shot selection function operates based on the content analysis of the videos transmitted from the VC component. The capability of content analysis is pre-trained through a machine-learning process characterized by the counter-propagation neural network. In addition, the VD system possesses another pivotal function of visual instruction, through which it imitates the communication between a human director and human cameramen in the real world. Having completed a live speech recording, it is often necessary to include additional contents or materials in the shot collection of the speech in order to increase its expressivity and vitality. In this context, we develop a post-production component called the virtual-real match moving (VRMM) system for graphic/ stereoscopic image composition. The input data to this system is provided by the equipment constituting a color camera and a depth camera. There are three major processes: temporal depth fusion, camera tracking, and virtual-real synthesis preview, involved in the VRMM subsystem. During temporal depth fusion, the depth images acquired by the depth camera are fused to lead to a 3D construction of the scene. Based on the constructed scene, the pose of the color camera is determined, which is next used to direct a virtual camera to generate synthetic images of a given 3D object model. The generated images are superimposed upon the real images acquired by the color camera. The resultant images are called preview images. A series of experiments for real lecture has been conducted. The results showed that the proposed SLR system can provide oration records close to some extend to those taken by real human teams. We believe that the proposed system may not be limited to live speeches; if it can be configured with appropriate training materials, it may also be suitable for recording stage performance, concerts, athletic competitions, and product launches.
Nowadays, e-learning (or distance learning) provides equal opportunities for learners in locations ranging from highly developed metropolises to remote less-developed countries. Lecture recording systems play a vital role in collecting spoken discourse for e-learning. However, in view of the growing development of e-learning, the lack of content is becoming a problem. This research presents a smart lecture recording (SLR) system that can record orations at the same level of quality as a human team, but with a reduced degree of human involvement. The proposed SLR system is composed of three principal components, referred to as virtual cameraman (VC), virtual director (VD), and virtual-real match moving (VRMM), respectively. The first two components, VC and VD, are online components, whereas the VRMM component is offline. The VC component is further divided into three subsystems: speaker cameraman (SC), audience cameraman (AC), and hall cameraman (HC). All these subsystems are automatic, and can take actions that include target and event detection, tracking, and view searching. The videos taken by these three subsystems are all forwarded to the VD system, in which the representative shot is chosen for recording or direct broadcasting. We refer to this function of the VD system as shot selection. The shot selection function operates based on the content analysis of the videos transmitted from the VC component. The capability of content analysis is pre-trained through a machine-learning process characterized by the counter-propagation neural network. In addition, the VD system possesses another pivotal function of visual instruction, through which it imitates the communication between a human director and human cameramen in the real world. Having completed a live speech recording, it is often necessary to include additional contents or materials in the shot collection of the speech in order to increase its expressivity and vitality. In this context, we develop a post-production component called the virtual-real match moving (VRMM) system for graphic/ stereoscopic image composition. The input data to this system is provided by the equipment constituting a color camera and a depth camera. There are three major processes: temporal depth fusion, camera tracking, and virtual-real synthesis preview, involved in the VRMM subsystem. During temporal depth fusion, the depth images acquired by the depth camera are fused to lead to a 3D construction of the scene. Based on the constructed scene, the pose of the color camera is determined, which is next used to direct a virtual camera to generate synthetic images of a given 3D object model. The generated images are superimposed upon the real images acquired by the color camera. The resultant images are called preview images. A series of experiments for real lecture has been conducted. The results showed that the proposed SLR system can provide oration records close to some extend to those taken by real human teams. We believe that the proposed system may not be limited to live speeches; if it can be configured with appropriate training materials, it may also be suitable for recording stage performance, concerts, athletic competitions, and product launches.
Description
Keywords
智慧型演講錄製系統, 虛擬攝影師, 虛擬導播, 虛實對位, 選鏡, 視覺指導, 虛實預覽, Smart lecture recording system, Virtual cameraman, Virtual director, Virtual-real match moving, Shot selection, Visual instruction, Preview images