基于新词发现的环境管理专业词库构建及其实证应用

Development of environmental management lexicon based on new word discovery and its empirical application

  • 摘要: 随着我国环境政策法规数量的不断增加,采用纯人工方式对政策法规进行整理归纳和分析解读变得越来越困难。运用文本挖掘等计算机技术辅助开展环境政策法规信息提取、内容分析以及智能化管理应用具有重要意义。精准分词则是实现文本挖掘各项功能的必要条件。为改善政策法规文本分词效果,以我国各级生态环境部门官网发布的环境政策法规文本为语料基础,通过新词发现算法与人工补充修正构建得到环境管理专业词库。应用实证结果表明:添加专业词库能将政策法规文本的分词准确率由72.6%升至94.1%;将基于支持向量机模型的政策法规文本自动分类误判率降低22.7%;且添加词库后的词频统计和关键词提取结果能为环境政策法规分析提供更全面、更具有时效性的统计信息。

     

    Abstract: With the rapid development of environmental policies in China, collating, inducing, analyzing and interpreting a large number of policies and regulations in a purely manual way has become more and more difficult. Therefore, it is of great significance to use computer technologies, such as text mining, to support intelligent environmental policy management and environmental policy analysis, including information extraction and text analysis. Accurate word segmentation, or tokenization, is the basis of all text mining functions. In order to improve the effect of policy text segmentation, the environmental policies published on official websites of China?s ecological and environmental departments of all levels were collected and taken as corpus. New word discovery algorithms and manual supplement and modification were adopted to develop the environmental management professional lexicon. The empirical results showed that with addition of the environmental lexicon, the accuracy of environmental policy segmentation could improve from 72.6% to 94.1%, and the misjudgment rate of policy automatic classification based on support vector machine could reduce by 22.7%. Besides, the results of word frequency statistics and keyword extraction after adding lexicon could also provide more comprehensive and more timely statistical information for environmental policy analysis.

     

/

返回文章
返回