--- title: "Creating FFTs with FFTrees()" author: "Nathaniel Phillips and Hansjörg Neth" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: fft.bib csl: apa.csl vignette: > %\VignetteIndexEntry{Creating FFTs with FFTrees()} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo = FALSE} knitr::opts_chunk$set(collapse = FALSE, comment = "#>", prompt = FALSE, tidy = FALSE, echo = TRUE, message = FALSE, warning = FALSE, # Default figure options: dpi = 100, fig.align = 'center', fig.height = 6.0, fig.width = 6.5, out.width = "580px") ``` ```{r load-pkg, echo = FALSE, message = FALSE, results = 'hide'} library(FFTrees) ``` ## Details on the `FFTrees()` function This vignette starts by building a fast-and-frugal tree (FFT) from the `heartdisease` data ---\ also used in the [Tutorial: FFTs for heart disease](FFTrees_heart.html) and @phillips2017FFTrees\ --- but then explores additional aspects of the `FFTrees()` function. The goal of the `FFTrees()` function is to create FFTs from data and record all details of the problem specification, tree definitions, the classification process, and performance measures in an `FFTrees` object. As `FFTrees()` can handle two sets of data (for training vs. testing) and creates a range of FFTs, each with distinct process and performance characteristics, evaluating the function may take some time (typically a few seconds) and the structure of the resulting `FFTrees` object is quite complex. But as `FFTrees()` is at the heart of the **FFTrees** package, it pays to understand its arguments and the structure of an\ `FFTrees` object. ### Example: Predicting heart disease An FFT generally addresses binary classification problems: It attempts to classify the outcomes of a _criterion_ variable into one of two classes (i.e., True or False) based on a range of potential _predictor_ variables (aka. cues or features). A corresponding problem from the domain of clinical diagnostics is: - Which patient has heart disease, given some data on patients' general health and diagnostic symptoms? ```{r image, fig.align = "center", out.width="250px", echo = FALSE} knitr::include_graphics("CoronaryArtery.jpg") ``` To address this problem, the **FFTrees** package includes the `heartdisease` data. But rather than using the full dataset to fit our FFTs, we have split the data into a training set (`heart.train`), and test set (`heart.test`). Here is a peak at the corresponding data frames: ```{r heart-data} # Training data: head(heart.train) # Testing data: head(heart.test) ``` The critical dependent variable (or binary criterion variable) is `diagnosis`. This variable indicates whether a patient has heart disease (`diagnosis = TRUE`) or not (`diagnosis = FALSE`). All other variables in the dataset (e.g., `sex`, `age`, and several biological measurements) can be used as predictors (aka. cues). ### Creating trees with `FFTrees()` To illustrate the difference between fitting and prediction, we will _train_ the FFTs on `heart.train`, and _test_ their prediction performance in `heart.test`. Note that we can also automate the training / test split using the `train.p` argument in `FFTrees()`. Setting `train.p` will randomly split `train.p`\% of the original data into a training set. To create a set of FFTs, we use the `FFTrees()` function to create a new `FFTrees` object called `heart.fft`. Here, we specify `diagnosis` as the binary criterion (or dependent variable), and include all other (independent) variables with `formula = diagnosis ~ .`: ```{r heart-fft, message = FALSE, results = 'hide'} # Create an FFTrees object called heart.fft predicting diagnosis: heart.fft <- FFTrees(formula = diagnosis ~., data = heart.train, data.test = heart.test) ``` If we wanted to only consider specific variables, like\ `sex` and\ `age`, for our trees, we could specify `formula = diagnosis ~ age + sex`. ### Elements of an `FFTrees` object The `FFTrees()` function returns an object of the `FFTrees` class. There are many elements in an `FFTrees` object. We can obtain these elements by printing their names: ```{r print-names} # See the elements of an FFTrees object: names(heart.fft) ``` Inspecting these elements provides a wealth of information on the range of FFTs contained in the current `FFTrees` object: 1. `criterion_name`: The name of the (predicted) binary criterion variable. 2. `cue_names`: The names of all potential cue variables (predictors) in the data. 3. `formula`: The formula used to create the `FFTrees` object. 4. `trees`: Information on all trees contained in the object, with list elements that specify their number\ `n`, the `best` tree, as well as tree `definitions`, verbal descriptions (`inwords`), `decisions`, and the performance characteristics (`stats` and `level_stats`) of each FFT. 5. `data`: The datasets used to `train` and `test` the FFTs (if applicable). 6. `params`: The parameters used for constructing FFTs (currently `r length(names(heart.fft$params))` parameters). 7. `competition`: Models and statistics of alternative classification algorithms (for `test`, `train`, and `models`). 8. `cues`: Information on all cue variables (predictors), with list elements that specify their `thresholds` and performance `stats` when training FFTs. ### Basic performance characteristics of FFTs We can view basic information about the `FFTrees` object by printing its name. As the default tree construction algorithm\ `ifan` creates multiple trees with different exit structures, an `FFTrees` object typically contains many FFTs. When printing an `FFTrees` object, we automatically see the performance on the _training_ data (i.e., for _fitting_, rather than _prediction_) and obtain the information about the tree with the highest value of the `goal` statistic. By default, the `goal` is set to weighed accuracy\ `wacc`: ```{r print-fftrees-object-data-train} # Training performance of the best tree (on "train" data, given current goal): heart.fft # same as: print(heart.fft, data = "train") ``` Before interpreting any model output, we need to carefully distinguish between an FFT's "Training" (for fitting training data) and "Prediction" performance (for new test data). Unless we explicitly ask for `print(heart.fft, data = "test")`, the output of printing `heart.fft` will report on the fitting phase (i.e., `data = "train"` by default). To see the corresponding prediction performance, we could alternatively ask for: ```{r print-fftrees-object-data-test, eval = FALSE} # Prediction performance of the best training tree (on "test" data): print(heart.fft, data = "test") ``` When evaluating an FFT for either training or test data, we obtain a wide range of measures. After some general information on the `FFTrees` object, we see a verbal **Definition** of the best FFT (FFT\ #1). Key information for evaluating an FFT's performance is contained in the **Accuracy** panel (for either training or prediction). Here is a description of the frequency counts and corresponding statistics provided: | Statistic | Long name | Definition | |:----- |:--------- |:----------------------------------| | _Frequencies_: | | | | `hi` | Number of hits | $N(\text{Decision} = 1 \land \text{Truth} = 1)$ | | `mi` | Number of misses | $N(\text{Decision} = 0 \land \text{Truth} = 1)$ | | `fa` | Number of false-alarms | $N(\text{Decision} = 1 \land \text{Truth} = 0)$ | | `cr` | Number of correct rejections | $N(\text{Decision} = 0 \land \text{Truth} = 0)$ | | `N` | Number of cases | The total number of cases considered. | | _Probabilities_: | | | | `acc` | Accuracy | The percentage of cases that were correctly classified. | | `ppv` | Positive predictive value | The percentage (or conditional probability) of positive decisions being correct (i.e., True\ + cases). | | `npv` | Negative predictive value | The percentage (or conditional probability) of negative decisions being correct (i.e., True\ - cases). | | `wacc`| Weighted accuracy | The weighted average of sensitivity and specificity, where sensitivity is weighted by `sens.w` (by default, `sens.w = .50`). | | `sens`| Sensitivity | The percentage (or conditional probability) of true positive cases being correctly classified. | | `spec`| Specificity | The percentage (or conditional probability) of true negative cases being correctly classified. | | _Frugality_: | | | | `mcu` | Mean cues used | On average, how many cues were needed to classify cases? In other words, what percent of the available information was used on average? | | `pci` | Percent cues ignored | The percent of data that was _ignored_ when classifying cases with a given tree. This is identical to `mcu / cues.n`, where `cues.n` is the total number of cues in the data. | **Table 1**: Description of FFTs' basic frequencies and corresponding accuracy and speed/frugality statistics. To obtain the same information for _another_ FFT of an\ `FFTrees` object, we can call `print()` with a numeric `tree` parameter. For instance, the following expression would provide the basic performance characteristics of Tree\ 3: ```{r print-fftrees-object-else, eval = FALSE} # Performance of alternative FFTs (Tree 3) in an FFTrees object: print(heart.fft, tree = 3, data = "test") ``` Alternatively, we could visualize the same tree and its performance characteristics by calling `plot(heart.fft, tree = 3, data = "test")`. See the [Accuracy statistics](FFTrees_accuracy_statistics.html) vignette for details on the accuracy statistics used throughout the **FFTrees** package. ### Cue performance information Each FFT has a decision threshold for each cue (regardless of whether or not it is actually used in the tree) that maximizes the `goal` value of that cue when it is applied to the entire training dataset. We can obtain cue accuracy statistics using the calculated decision thresholds from the `cue.accuracies` list. If the object has test data, we can see the marginal cue accuracies in the test data (using the thresholds calculated from the training data): ```{r fft-cues-stats-train} # Decision thresholds and marginal classification training accuracies for each cue: heart.fft$cues$stats$train ``` We can also visualize the cue accuracies for the training data (in ROC\ space, i.e., showing each cue's hit rate by its false alarm rate) by calling `plot()` with the `what = "cues"` argument. This will show the sensitivities and specificities for each cue, with the top five cues highlighted and listed: ```{r fft-plot-cues, fig.width = 6, fig.height = 6, out.width = "500px", fig.align = 'center'} # Visualize individual cue accuracies: plot(heart.fft, what = "cues", main = "Cue accuracy for heartdisease") ``` See the [Plotting FFTrees objects](FFTrees_plot.html) vignette for more details on visualizing cue accuracies and FFTs. ### Tree definitions The `tree.definitions` data frame contains definitions (cues, classes, exits, thresholds, and directions) of all trees in an\ `FFTrees` object. The combination of these five pieces of information (as well as their order), define and describe _how_ a tree makes decisions: ```{r fft-definitions} # See the definitions of all trees: heart.fft$trees$definitions ``` Separate levels in tree definitions are separated by colons\ (`;`). To understand how to read these definitions, let's start by understanding Tree\ 1, the tree with the highest weighted accuracy during training: - _Nodes, classes, and cues_: - Tree\ 1 has three cues in the order `thal`, `cp`, `ca`. - The classes of the cues are\ `c` (character), `c` and\ `n` (numeric). - _Exits, directions, and thresholds_: - The decision _exits_ for the cues are 1\ (positive), 0\ (negative), and\ 0.5 (both positive and negative). This means that the first cue only makes positive decisions, the second cue only makes negative decisions, and the third cue makes _both_ positive and negative decisions. - The decision _thresholds_ are `rd` and `fd` for the first cue, `a` for the second cue, and\ `0` for the third cue. - The cue _directions_ for predicting the criterion variable are\ `=` for the first cue, `=` for the second cue, and\ `>` for the third cue, respectively. Importantly, these cue directions indicate how the tree _would_ make positive decisions _if_ it had a positive exit (i.e., predicted a Signal) for that cue. If the tree has a positive exit for the given cue, then cases that satisfy this threshold and direction are classified as having a _positive_ criterion value. However, if the tree has a negative exit for a given cue, then cases that do _not_ satisfy the given thresholds are classified as _negative_. Thus, the directions for cues with negative exits need to be negated (e.g., `=`\ becomes\ `!=`, `>`\ becomes\ `<=`, etc.). From this information, we can understand and verbalize Tree\ 1 as follows: 1. If `thal` is equal to either\ `rd` or\ `fd`, predict a _positive_ criterion value. 2. Otherwise, if `cp` is _not_ equal to\ `a`, predict a _negative_ value. 3. Otherwise, if `ca` is greater than\ 0, predict a _positive_ value, else predict a _negative_ value. Note that `heart.fft$trees$definitions` also reveals that Tree\ 3 and Tree\ 5 use the same cues and cue directions as Tree\ 1. However, they differ in the exit structures of the first and second cues (or nodes). Applying the `inwords()` function to an `FFTrees` object returns a verbal description of a tree. For instance, to obtain a verbal description of the tree with the highest training accuracy (i.e., Tree\ #1), we can ask for: ```{r fft-inwords} # Describe the best training tree (i.e., Tree #1): inwords(heart.fft, tree = 1) ``` ### Accuracy statistics of FFTs The performance of an FFT on a specific dataset is characterized by a range of accuracy statistics. Here are the training statistics for all trees in `heart.fft`: ```{r fft-stats-train} # Training statistics for all trees: heart.fft$trees$stats$train ``` The corresponding statistics for the testing are: ```{r fft-stats-test} # Testing statistics for all trees: heart.fft$trees$stats$test ``` See the [Accuracy statistics](FFTrees_accuracy_statistics.html) vignette for the definitions of accuracy statistics used throughout the **FFTrees** package. ### Classification decisions The `decision` list contains the raw classification decisions for each tree and each case, as well as detailed information on the costs of each classification. For instance, here are the decisions made by Tree\ 1 on the training data: ```{r fft-decisions} # Inspect the decisions of Tree 1: heart.fft$trees$decisions$train$tree_1 ``` ### Predicting new cases with `predict()` Once we have created an `FFTrees` object, we can use it to predict new data using `predict()`. First, we'll use the `heart.fft` object to make predictions for cases\ 1 through\ 10 of the `heartdisease` dataset. By default, the tree with the best training\ `wacc` values is used to predict the value of the binary criterion variable: ```{r fft-predict-class} # Predict classes for new data from the best training tree: predict(heart.fft, newdata = heartdisease[1:10, ]) ``` If we wanted to use an alternative FFT of an `FFTrees` object for predicting the criterion outcomes of new data, we could specify its number in the `tree` argument of the `predict()` function. To predict class probabilities, we can include the `type = "prob"` argument. This will return a matrix of class predictions, where the first column indicates the probabilities for a case being classified as\ 0 / `FALSE`, and the second column indicates the probability for a case being classified as\ 1 / `TRUE`: ```{r fft-predict-prob} # Predict class probabilities for new data from the best training tree: predict(heart.fft, newdata = heartdisease, type = "prob") ``` Use `type = "both"` to get both classification and probability predictions for cases: ```{r fft-predict-both} # Predict both classes and probabilities: predict(heart.fft, newdata = heartdisease, type = "both") ``` ### Visualising trees Once we have created an `FFTrees` object using the `FFTrees()` function we can visualize the tree (and ROC\ curves) using the `plot()` function. The following code will visualize the best training tree applied to the test data: ```{r fft-plot, fig.width = 7, fig.height = 7} plot(heart.fft, main = "Heart Disease", decision.labels = c("Healthy", "Disease")) ``` See the [Plotting FFTrees objects](FFTrees_plot.html) vignette for more details on visualizing trees. ### Manually defining an FFT We can also design a specific FFT and apply it to a dataset by using the `my.tree` argument of `FFTrees()`. To do so, we specify the FFT as a sentence, making sure to correctly spell the cue names as they appear in the data. Sets of factor cues can be specified using (curly) brackets. For example, we can manually define an FFT by using the sentence: - `"If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True"` ```{r fft-my-tree-def, results = 'hide'} # Manually define a tree using the my.tree argument: myheart_fft <- FFTrees(diagnosis ~., data = heartdisease, my.tree = "If chol > 300, predict True. If thal = {fd,rd}, predict False. Otherwise, predict True") ``` Here is a plot of the resulting FFT: ```{r fft-my-tree-plot} plot(myheart_fft, main = "Specifying a manual FFT") ``` As we can see, the performance of this particular tree is pretty terrible ---\ but this should motivate you to build better FFTs yourself! In addition to defining an FFT from its verbal description, we can edit and define sets of FFT definitions and evaluate them on data. See the [Manually specifying FFTs](FFTrees_mytree.html) vignette for details on editing, modifying, and evaluating specific FFTs. ## Vignettes Here is a complete list of the vignettes available in the **FFTrees** package: | | Vignette | Description | |--:|:------------------------------|:-------------------------------------------------| | | [Main guide: FFTrees overview](guide.html) | An overview of the **FFTrees** package | | 1 | [Tutorial: FFTs for heart disease](FFTrees_heart.html) | An example of using `FFTrees()` to model heart disease diagnosis | | 2 | [Accuracy statistics](FFTrees_accuracy_statistics.html) | Definitions of accuracy statistics used throughout the package | | 3 | [Creating FFTs with FFTrees()](FFTrees_function.html) | Details on the main `FFTrees()` function | | 4 | [Manually specifying FFTs](FFTrees_mytree.html) | How to directly create FFTs without using the built-in algorithms | | 5 | [Visualizing FFTs](FFTrees_plot.html) | Plotting `FFTrees` objects, from full trees to icon arrays | | 6 | [Examples of FFTs](FFTrees_examples.html) | Examples of FFTs from different datasets contained in the package | ## References