WANG Hong,FANG Yanmei,HUANG Fangjun.A preliminary study on text style analysis based on information entropy[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2020,59(06):113-125.
WANG Hong,FANG Yanmei,HUANG Fangjun.A preliminary study on text style analysis based on information entropy[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2020,59(06):113-125. DOI: 10.13471/j.cnki.acta.snus.2019.07.08.2019A052.
A preliminary study on text style analysis based on information entropy
It is proposed and verified that the information entropy is a quantitative standard for lexical richness. Firstly, the English novels are categorized into four groups, namely, magic/science fiction, mystery novels, humorous satirical novels, and children's literature. Then the authors calculate the information entropy of each English novel, compare the information entropy of the four groups by means of graphs, and observe whether the difference of information entropy among these four categories consists with what the authors' expectation. Through verification, the authors find that the information entropy of children's literature is averagely the lowest, and the information entropy of magic/science fiction is generally higher. According to previous studies and our usual reading experience, the magic/science fiction indeed has higher vocabulary richness, and the vocabulary richness in children's literature is lower. Finally, the authors use hypothesis testing to verify the difference of entropy among the categories. Then, the authors conclude that information entropy can be used as an indicator of the vocabulary richness.
关键词
信息熵词汇丰富程度计量风格学统计假设检验
Keywords
information entropyvocabulary richnessstylometryhypothesis testing
HAO X L, GAO Y. Multi-level extraction algorithm of video key frame mutual information entropy under CUDA framework [J]. Journal of University of Electronic Science and Technology, 2018, 47(5): 726-732.
XU T, LI J, WANG Y. Superimposed Information entropy walking data clustering algorithm [J]. Journal of Xidian University(Natural Science), 2018, 45(4): 75-79.
SONG Y, CAI Z P. An intrusion detection feature extraction method based on information theory model [J]. Journal of University of Electronic Science and Technology, 2018, 47(2): 267-271.
LIU Y, XIAO T J. Study on the metrology and style of Jin Yong and Gu Long's novels [J]. Journal of Tsinghua University: (Philosophy and Social Sciences), 2014, 29(5): 135-147.
SHI J. The application of cluster analysis method in the comparison of literary styles-A comparative analysis of Bi Feiyu and Su Tong's novels[J]. Data of Culture and Education, 2017, 773(33): 19-22.
刘颖,肖天久.《红楼梦》计量风格学研究[J].红楼梦学刊, 2014(4): 260-281.
LIU Y, XIAO T J. Research on the measurement style of "a Dream of Red Mansions"[J]. Studies on "a Dream of Red Mansions", 2014(4): 260-281.
金迪.基于语料库的格非、余华小说计量风格学研究[D].南京:南京师范大学,2018.
JIN D.A corpus-based study on geometry and style of Ge Fei and Yu Hua's novels [D].Nan Jing: Nanjing Normal University, 2018.
田宝玉,杨洁,贺志强,等.基础信息论[M].2版.北京:人民邮电出版社, 2008: 18-26.
COVER T M, THOMAS J A. Elements of information theory [M]. Second Edition, Beijing: Mechanical Industry Press, 2008: 13-16.
SHANNON C E. A mathematical theory of communication[J]. The Bell System Technical Journal. 1948, 27: 379-423, 623-656.
PERKINS J. Python 3 text processing with NLTK 3 codebook [DB/OL]. http://www.allitebooks.com/python-3-text-processing-with-nltk-3-codebook, August 2014.
JU X K. A preliminary study of magical novels: a case study of The Lord of the Rings and Harry Potter[J]. Journal of Guangxi College of Education, 2007, 2: 121-123.
王军礼.论欧·亨利小说的语言表达艺术[J].语文建设, 2016(23): 85-86.
WANG J L. The language expression art of O.Henry's novels[J]. Language Construction, 2016(23): 85-86.
YANG L, WU Y X, WANG J L, et al. A review of research on recurrent neural networks[J]. Computer Application, 2018, 38(S2): 1-6,26.
李友坤.BP神经网络的研究分析及改进应用[D].安徽:安徽理工大学,2012.
LI Y K. Research and analysis of BP neural network and its application[D]. Anhui: Anhui University of Science and Technology, 2012.
尹建杰.Logistic回归模型分析综述及应用研究[D].哈尔滨: 黑龙江大学, 2011.
YIN J J. Summary and applied research of Logistic regression model analysis [D]. Harbin: Heilongjiang University, 2011.
ABUZEINA D, AL-ANZI F S. Employing fisher discriminant analysis for Arabic text classification [J]. Computers & Electrical Engineering, 2017: S0045790617334845.
张浩然,韩正之,李昌刚.支持向量机[J].计算机科学,2002, 29(12): 135-137.
ZHANG H R, HAN Z Z,LI C G. Supported vector machine [J]. Computer Science, 2002, 29(12): 135-137.
祁亨年.支持向量机及其应用研究综述[J].计算机工程,2004, 30(10): 6-9.
QI X N. A survey of support vector machines and their applications [J]. Computer Engineering, 2004, 30(10): 6-9.