表徵學習法之文本可讀性

No Thumbnail Available

Date

2020

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

none
Text readability refers to the degree to which a text can be understood by its readers: the higher the readability of a text for readers, the better the comprehension and learning retention can be achieved. In order to facilitate readers to digest and comprehend documents, researchers have long been developing readability models that can automatically and accurately estimate text readability. Conventional approaches to readability classification aim to infer a readability model using a set of handcrafted features defined a priori and computed from the training documents, along with the readability levels of these documents. However, developing the handcrafted features is not only labor-intensiveand time-consuming, but also expertise demanding. With the recent advance of representation learning techniques, we can efficiently extract salient features from documents without recourse to specialized expertise, which offers a promising avenue of research on readability classification. In view of this, we in this study based on representation learning techniques propose several novel readability models, which have the capability of effectively analyzing documents belonging to different domains and covering a wide variety of topics. Compared with a baseline reference using a traditional model, the new model improves by 39.55% to achieve 78.45% of accuracy. We then combine different kinds of representation learning algorithm with general linguistic features, and the accuracy improves by an even higher degree of 40.95% to achieve 79.85%. Finally, this study also explores character-level representations to develop a novel readability model, which offers the promise of conducting a successful text readability assessment of the Chinese language with 78.66% accuracy. All the above results indicate that the readability features developed in this study can be used both to train a readability model for leveling domain-specific texts and to be used in combination with the more common linguistic features to enhance the efficacy of the model. As to future work, we will focus on exploring more training methods of constructing the semantic space and combining text summarization techniques, in order to distill salient aspects of text content that can further enhance the effectiveness of a readability model.

Description

Keywords

none, Readability, Latent Semantic Analysis, Word2vec, fastText, StarSpace, Convolutional Neural Network, BERT

Citation

Collections