|This class implements the CHAID algorithm proposed by Kass (1980) for decision tree learning. CHAID consists of three steps: merging, splitting and stopping. A tree is grown by repeatedly using these three steps on each node starting from the root node. CHAID accepts nominal or ordinal categorical predictors only. If predictors are continuous, they have to be transformed into ordinal predictors before tree growing. |
CONVERTING CONTINUOUS PREDICTORS TO ORDINAL :
Continuous predictors are converted to ordinal by binning. The number of bins (K) has to be supplied by the user. Given K, a predictor is split in such a way that all the bins get the same number (more or less) of distinct predictor values. The maximum feature value in each bin is used as a breakpoint.
During the merging step, allowable pairs of categories of a predictor are evaluated for similarity. If the similarity of a pair is above a threshold, the categories constituting the pair are merged into a single category. The process is repeated until there is no pair left having high similarity between its categories. Similarity between categories is evaluated using the p_value
The splitting step selects which predictor to be used to best split the node. Selection is accomplished by comparing the adjusted p_value associated with each predictor. The predictor that has the smallest adjusted p_value is chosen for splitting the node.
The tree growing process stops if any of the following conditions is satisfied :
1. If a node becomes pure; that is, all cases in a node have identical values of the dependent variable, the node will not be split.
2. If all cases in a node have identical values for each predictor, the node will not be split.
3. If the current tree depth reaches the user specified maximum tree depth limit value, the tree growing process will stop.
4. If the size of a node is less than the user-specified minimum node size value, the node will not be split.
p_value CALCULATIONS FOR NOMINAL DEPENDENT VARIABLE:
If the dependent variable is nominal categorical, a contingency (or count) table is formed using classes of Y as columns and categories of the predictor X as rows. The p_value is computed using the entries of this table and Pearson chi-squared statistic, For more details, please see : http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_tree-chaid_pvalue_categorical.htm
p_value CALCULATIONS FOR ORDINAL DEPENDENT VARIABLE:
If the dependent variable Y is categorical ordinal, the null hypothesis of independence of X and Y is tested against the row effects model, with the rows being the categories of X and columns the classes of Y. Again Pearson chi- squared statistic is used (like nominal case) but two sets of expected cell frequencies are calculated. For more details : http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_tree-chaid_pvalue_ordinal.htm
p_value CALCULATIONS FOR CONTINUOUS DEPENDENT VARIABLE:
If the dependent variable Y is continuous, an ANOVA F test is performed that tests if the means of Y for different categories of X are the same. For more details please see : http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_tree-chaid_pvalue_scale.htm. More...