广东财经大学信息学院,广东,广州,510320
纸质出版日期:2014,
网络出版日期:2014-9-25,
扫 描 看 全 文
尹华, 胡玉平. 基于随机森林的不平衡特征选择算法[J]. 中山大学学报(自然科学版)(中英文), 2014,53(5):59-65.
YIN Hua, HU Yuping. An Imbalanced Feature Selection Algorithm Based on Random Forest[J]. Acta Scientiarum Naturalium Universitatis SunYatseni, 2014,53(5):59-65.
数据高维不平衡是当前数据挖掘的挑战。针对传统特征选择方法基于类别平衡假设,导致在不平衡数据上效果不理想的问题,利用随机森林内嵌的变量选择机制,构造了一个新的不平衡随机森林特征选择算法IBRFVS。IBRFVS在平衡的取样数据上构造多样决策树,采用交叉验证方式获取单棵决策树的特征重要性度量值。各决策树的权重和特征重要性度量的加权平均决定了最终的特征重要性序列,其中,决策树的权重由该决策树与集成预测的一致性程度决定。在UCI数据集上的随机森林超参数选择和预处理对比验证实验中显示,四种超参数K经验取值中,当K的取值为特征数的平方根时,IBRFVS性能较为稳定且优于传统特征选择算法。
High-dimensional and imbalance data is a challenge for data mining. Balanced class distribution hypothesis leads to unsatisfied results of traditional feature selection algorithms on imbalanced data. For solving this problem
a new imbalanced feature selection algorithm IBRFVS
which uses the variable selection mechanism embedded in random forest
is constructed. IBRFVS construct vary decision trees on the balanced sampling data and get the feature importance measurements of individual decision tree by cross validation. The features importance list is decided by the weighted average of the decision tree weights and feature importance measurements
and the decision tree weights is decided by the consistent degree of the individual decision prediction and ensemble prediction. The random forest hyper parameter selection and preprocessing compare experiments on UCI dataset show that the performance of IBRFVS is more stable and prior than traditional feature selection algorithms when hyper parameter K is the square root of feature number
among four empirical parameters.
不平衡数据高维数据特征选择随机森林
0
浏览量
310
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621
