中山大学数据科学与计算机学院,广东 广州 510006
王泓(1996年生),男;研究方向:机器学习与深度学习、数字媒体信息安全;E-mail:13719175117@163.com
方艳梅(1966年生),女;研究方向:机器学习与深度学习、数字媒体信息安全;E-mail:fangym@mail.sysu.edu.cn
纸质出版日期:2020-11-25,
收稿日期:2019-07-08,
扫 描 看 全 文
王泓,方艳梅,黄方军.基于信息熵的语言风格分析方法初探[J].中山大学学报(自然科学版),2020,59(06):113-125.
WANG Hong,FANG Yanmei,HUANG Fangjun.A preliminary study on text style analysis based on information entropy[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2020,59(06):113-125.
王泓,方艳梅,黄方军.基于信息熵的语言风格分析方法初探[J].中山大学学报(自然科学版),2020,59(06):113-125. DOI: 10.13471/j.cnki.acta.snus.2019.07.08.2019A052.
WANG Hong,FANG Yanmei,HUANG Fangjun.A preliminary study on text style analysis based on information entropy[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2020,59(06):113-125. DOI: 10.13471/j.cnki.acta.snus.2019.07.08.2019A052.
提出一种对于词汇丰富程度的量化标准——信息熵,并验证信息熵的确可以反映文本的词汇丰富程度。先将英文小说分成四类,分别是魔幻/科幻小说,推理小说,幽默讽刺小说,儿童文学。并计算每一类中的每一本英文小说作品的信息熵,然后通过图表的方式对这四类小说的信息熵进行对比,并且根据以往的对于小说风格的研究和平时的阅读经验,观察四类小说的信息熵差别是否如同预期所猜想的一致。通过验证发现,儿童文学的信息熵普遍偏低,而魔幻/科幻小说的信息熵普遍较高,而根据以往的研究和平时的阅读体验来看,魔幻/科幻小说词汇丰富程度确实较高,儿童文学词汇丰富程度的确较低。之后用假设检验的方法验证不同类型作品信息熵的差异。由此说明信息熵可以作为反应词汇丰富程度的一个指标。
It is proposed and verified that the information entropy is a quantitative standard for lexical richness. Firstly, the English novels are categorized into four groups, namely, magic/science fiction, mystery novels, humorous satirical novels, and children's literature. Then the authors calculate the information entropy of each English novel, compare the information entropy of the four groups by means of graphs, and observe whether the difference of information entropy among these four categories consists with what the authors' expectation. Through verification, the authors find that the information entropy of children's literature is averagely the lowest, and the information entropy of magic/science fiction is generally higher. According to previous studies and our usual reading experience, the magic/science fiction indeed has higher vocabulary richness, and the vocabulary richness in children's literature is lower. Finally, the authors use hypothesis testing to verify the difference of entropy among the categories. Then, the authors conclude that information entropy can be used as an indicator of the vocabulary richness.
信息熵词汇丰富程度计量风格学统计假设检验
information entropyvocabulary richnessstylometryhypothesis testing
郝晓丽,高永.CUDA框架下的视频关键帧互信息熵多级提取算法[J].电子科技大学学报, 2018, 47(5): 726–732.
HAO X L, GAO Y. Multi-level extraction algorithm of video key frame mutual information entropy under CUDA framework [J]. Journal of University of Electronic Science and Technology, 2018, 47(5): 726-732.
续拓,李洁,王颖.叠加信息熵游走数据聚类算法[J].西安电子科技大学学报(自然科学版), 2018, 45(4): 75-79.
XU T, LI J, WANG Y. Superimposed Information entropy walking data clustering algorithm [J]. Journal of Xidian University(Natural Science), 2018, 45(4): 75-79.
宋勇,蔡志平.一种基于信息论模型的入侵检测特征提取方法[J].电子科技大学学报, 2018, 47(2): 267-271.
SONG Y, CAI Z P. An intrusion detection feature extraction method based on information theory model [J]. Journal of University of Electronic Science and Technology, 2018, 47(2): 267-271.
郑碧如,吴广潮.基于信息论方法的分类数据相似性度量[J].计算机与现代化, 2018(5): 30-34.
ZHENG B R, WU G C. Similarity measure of classification data based on information theory[J]. Computer and Modernization, 2018(5): 30-34.
刘颖, 肖天久.金庸与古龙小说计量风格学研究[J].清华大学学报(哲学社会科学版), 2014, 29(5): 135-147.
LIU Y, XIAO T J. Study on the metrology and style of Jin Yong and Gu Long's novels [J]. Journal of Tsinghua University: (Philosophy and Social Sciences), 2014, 29(5): 135-147.
贺湘情,刘颖.基于文本聚类的语言韵律和节奏风格特征挖掘[J].中文信息学报, 2014, 28(6): 194-207.
HE X Q, LIU Y. Language rhythm and rhythm style feature mining based on text clustering [J]. Chinese Journal of Information, 2014, 28(6): 194-207.
李艳丽,李宛蓉,廖欣,等.基于计量风格学的小说质量分析[J].计算机与现代化,2019, 285(5): 23-28.
LI Y L, LI W R, LIAO X, et.al. Stylometry-based analysis of literature texts [J]. Computer and Modernization, 2019, 285(5): 23-28.
JrSTRUNK W,WHITE E B. The elements of style[M], Fourth Edition,MA: Allyn & Bacon,2000.
时季.聚类分析方法在文学作品风格比较中的应用—以毕飞宇、苏童小说的比较分析为例[J].文教资料,2017,773(33): 19-22.
SHI J. The application of cluster analysis method in the comparison of literary styles-A comparative analysis of Bi Feiyu and Su Tong's novels[J]. Data of Culture and Education, 2017, 773(33): 19-22.
刘颖,肖天久.《红楼梦》计量风格学研究[J].红楼梦学刊, 2014(4): 260-281.
LIU Y, XIAO T J. Research on the measurement style of "a Dream of Red Mansions"[J]. Studies on "a Dream of Red Mansions", 2014(4): 260-281.
金迪.基于语料库的格非、余华小说计量风格学研究[D].南京:南京师范大学,2018.
JIN D.A corpus-based study on geometry and style of Ge Fei and Yu Hua's novels [D].Nan Jing: Nanjing Normal University, 2018.
田宝玉,杨洁,贺志强,等.基础信息论[M].2版.北京:人民邮电出版社, 2008: 18-26.
COVER T M, THOMAS J A. Elements of information theory [M]. Second Edition, Beijing: Mechanical Industry Press, 2008: 13-16.
SHANNON C E. A mathematical theory of communication[J]. The Bell System Technical Journal. 1948, 27: 379-423, 623-656.
曹雪虹.信息论与编码[M].2版.北京:清华大学出版社, 2009: 6-15.
盛骤,谢式千,潘承毅.概率论与数理统计[M].4版.北京:高等教育出版社, 2008: 19-121.
PERKINS J. Python 3 text processing with NLTK 3 codebook [DB/OL]. http://www.allitebooks.com/python-3-text-processing-with-nltk-3-codebook, August 2014.
鞠训科.魔幻主义小说初探——以《魔戒》和《哈利·波特》为个案[J].广西教育学院学报,2007,2:121-123.
JU X K. A preliminary study of magical novels: a case study of The Lord of the Rings and Harry Potter[J]. Journal of Guangxi College of Education, 2007, 2: 121-123.
王军礼.论欧·亨利小说的语言表达艺术[J].语文建设, 2016(23): 85-86.
WANG J L. The language expression art of O.Henry's novels[J]. Language Construction, 2016(23): 85-86.
陈丽.欧·亨利小说的语言艺术探析[J].湖南科技学院学报, 2017, 38(07): 33-34.
CHEN L. An analysis of the language art of O Henry's novels[J]. Journal of Hunan Institute of Science and Technology, 2017, 38(7): 33-34.
徐海云.《忠实的朋友》语料库与文学的结合[J].海外英语, 2015(15): 177-179.
XU H Y. Corpus and literature of The Devoted Friend [J]. Overseas English, 2015(15): 177-179.
GRAVES A. Supervised sequence labelling with recurrent neural networks[D]. München: Technische Universität München, 2006.
杨丽,吴雨茜,王俊丽,等.循环神经网络研究综述[J].计算机应用, 2018,3 8(S2): 1-6,26.
YANG L, WU Y X, WANG J L, et al. A review of research on recurrent neural networks[J]. Computer Application, 2018, 38(S2): 1-6,26.
李友坤.BP神经网络的研究分析及改进应用[D].安徽:安徽理工大学,2012.
LI Y K. Research and analysis of BP neural network and its application[D]. Anhui: Anhui University of Science and Technology, 2012.
尹建杰.Logistic回归模型分析综述及应用研究[D].哈尔滨: 黑龙江大学, 2011.
YIN J J. Summary and applied research of Logistic regression model analysis [D]. Harbin: Heilongjiang University, 2011.
ABUZEINA D, AL-ANZI F S. Employing fisher discriminant analysis for Arabic text classification [J]. Computers & Electrical Engineering, 2017: S0045790617334845.
张浩然,韩正之,李昌刚.支持向量机[J].计算机科学,2002, 29(12): 135-137.
ZHANG H R, HAN Z Z,LI C G. Supported vector machine [J]. Computer Science, 2002, 29(12): 135-137.
祁亨年.支持向量机及其应用研究综述[J].计算机工程,2004, 30(10): 6-9.
QI X N. A survey of support vector machines and their applications [J]. Computer Engineering, 2004, 30(10): 6-9.
王千,王成,冯振元,等.K-means聚类算法研究综述[J].电子设计工程,2012, 20(7): 21-24.
WANG Q, WANG C, FENG Z Y, et al. A survey of K-means clustering algorithms [J]. Electronic Design Engineering, 2012, 20(7): 21-24.
吴夙慧,成颖,郑彦宁,等.K-means算法研究综述[J].数据分析与知识发现,2011, 27(5): 28-35.
WU S F, CHENG Y, ZHENG Y N, et al. A survey of K-means algorithm research [J]. Data Analysis and Knowledge Discovery, 2011, 27(5): 28-35.
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构