Decision tree methods: applications for classification and prediction
决策树方法:分类和预测的应用
Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.
决策树方法是建立多协变量分类系统或开发目标变量预测算法的常用数据挖掘方法。该方法将填充分类为类似树枝的片段,这些片段构建了具有根节点、内部节点和叶节点的倒置树。该算法是非参数的,可以有效地处理大型、复杂的数据集,而不需要施加复杂的参数结构。当样本量足够大时,研究数据可以分为训练数据集和验证数据集。使用训练数据集来建立决策树模型,使用验证数据集来决定达到最优最终模型所需的适当树大小。本文介绍了开发决策树的常用算法(CART、C4.5、CHAID、QUEST),并描述了用于可视化决策树结构的SPSS和SAS程序。
Key words: decision tree; data mining; classification; prediction 关键词:决策树;数据挖掘;分类;预测