表單文件手寫資料欄位擷取之研究
No Thumbnail Available
Date
2007
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
本研究旨在針對表單文件自動化處理進行研究,針對表單處理中之手寫欄位分類、擷取與手寫資料擷取等問題提出解決的方法。在表單手寫欄位擷取的階段,分別利用表單中物件的尺寸大小、比例、物件整體性結構特性與物件方向性結構特徵,作為物件之分類特徵。為便於取得物件之結構特徵,本研究利用影像編碼的方式,將空白表單影像轉換成簡化的結構圖。同時為區辨說明欄位與包含說明文字之填寫欄位,分別利用欄位區域水平及垂直方向之像素投影,配合說明文字之分佈、大小與文字間距等特徵,進行分析辨識。
在手寫資料擷取的階段中,將已填寫之表單影像與已知空白表單樣本進行比
對後,根據相同類別的空白表單之手寫欄位資訊,擷取出已填寫表單中之手寫欄位資料。對於所擷取出之手寫資料中,因框線去除後,造成與框線相交之手寫筆畫斷裂的問題,提出判斷筆畫相交區段,並重建相交區段之手寫筆畫的方法,修補破碎手寫筆畫。
本研究之測試影像,共分為一般單純格式之表單影像與格式複雜之複合式表
單影像等兩類。由實驗結果可證明本研究所提出之方法,針對不同類型之表單影像,皆可得到不錯的效果。
Form document analysis is one of the most essential tasks in document analysis and recognition. The problems of form fields and filled-in data extraction are two important parts of form document analysis. For form field extraction, the first major task was to classify the preprinted text, lines, check boxes, text boxes and the tables of a form. This thesis proposes a method which based on direction-invariant global structural features and directional dependant structural features to classify the form fields, and then extract the filled-in spaces in a form document. Since tables can contain both name fields and data fields, for the second task, we used a method based on horizontal and vertical color histogram distribution features to segment the fields and extract the data fields. For filled-in data extraction, we propose a method which based on Run-based algorithm and the idea of interpolation to detect the character strokes overlapped by printed form frame and reconstruct the broken strokes after removing the frame line. The experimental results on different types of form documents showed a 99% recognition rate on form fields extraction, and a 91% successful filled-in data extraction rate was achieved.
Form document analysis is one of the most essential tasks in document analysis and recognition. The problems of form fields and filled-in data extraction are two important parts of form document analysis. For form field extraction, the first major task was to classify the preprinted text, lines, check boxes, text boxes and the tables of a form. This thesis proposes a method which based on direction-invariant global structural features and directional dependant structural features to classify the form fields, and then extract the filled-in spaces in a form document. Since tables can contain both name fields and data fields, for the second task, we used a method based on horizontal and vertical color histogram distribution features to segment the fields and extract the data fields. For filled-in data extraction, we propose a method which based on Run-based algorithm and the idea of interpolation to detect the character strokes overlapped by printed form frame and reconstruct the broken strokes after removing the frame line. The experimental results on different types of form documents showed a 99% recognition rate on form fields extraction, and a 91% successful filled-in data extraction rate was achieved.
Description
Keywords
表單文件辨識, 表單手寫欄位擷取, 手寫資料萃取, 破碎字修補, Run-Based 演算法, Form document analysis and recognition, Form field extraction, Filled-in data extraction, Broken stroke reconstruction, Run-based Algorithm