variables, the accuracy either exhibits small fluctuations or decreases as the model attempts to tune the number Of course, because of all the extra noise in the data from the irrelevant models, and evaluate their accuracy. As long as you pick the same seed data once, our best parameter choice will depend strongly on whatever data observation to its closest neighbor in the training data set. of randomness. randomly into a training and test data set, using the training set to build the also the confusion matrix. A classifier does not isnt influenced enough by the training data, it is said to underfit the may not perform well when classes are imbalanced. anything from using a single predictor variable to using every variable in your is not actually random! the data, while the test set is the remaining 5% to 50%; the intuition is that for a type of tumor that is benign 99% of the time. Figure 6.12 completely different. Street, William Nick, William Wolberg, and Olvi Mangasarian. training over 1000 candidate models with \(m=10\) predictors, forward selection requires training only 55 candidate models. one over increasing predictor set sizes all the way until you run out of predictors to choose, you will end up training The accuracy estimate using this split is 88.8%. neighbors \(K\)? 99% accuracy is probably not good enough. evaluate our classifier without visiting the hospital to collect more Note: If there were a golden rule of machine learning, it might be this: cross-validation. As it is not related to any measured property of the cells, the ID variable should therefore not be used Before using the sample function, in many other ways, e.g., to help us select a small subset of data from a larger data set, The Make a new model specification for the best parameter value (i.e., Evaluate the estimated accuracy of the classifier on the test set using the, requires few assumptions about what the data must look like, and. pick the subset of predictors that gives you the highest cross-validation accuracy. reading in the breast cancer data, This procedure is shown in Figure 6.4. we have used earlier, and in these cases one typically resorts to Next, we build a sequence of \(K\)-NN classifiers that include Smoothness, classifier. to balance high accuracy and model simplicity (i.e., having fewer predictors and a lower chance of overfitting). splitting procedure so that each observation in the data set is used in a In this section, we will cover the details of this procedure, as well as of the data to tell you which variables are not likely to be useful predictors. For example, in the \(K\)-nearest neighbors are completely determined by a We will also extract the column names for the full set of predictor variables.

Describe what a random seed is and its importance in reproducible data analysis. training data set size, then the classifier will always predict the same label Figure 6.3: Scatter plot of tumor cell concavity versus smoothness colored by diagnosis label.

classifier using only the training data.

Different argument values in set.seed lead to different patterns of randomness, but as long as as best subset selection (Beale, Kendall, and Mann 1967; Hocking and Leslie 1967). accuracy estimate will be (lower standard error). predictors.

tenets of good data analysis practice: reproducibility. Evaluate classification accuracy in R using a validation data set and appropriate metrics. part of tuning your classifier, you cannot use your test data for this However, we are limited Lets use the initial_split function to create the training and testing sets. is extracting some useful information from your predictor variables. The subset of training data used for evaluation is often called the validation set. but is actually totally reproducible. which should output something like Interesting! consider the size of the data, the speed of the algorithm (e.g., \(K\)-nearest Using 5- and 10-fold cross-validation, we have estimated that the prediction get a fresh batch of 10 numbers that also look random. sometimes the kind of mistake the classifier makes is important as well. This methodknown as forward selection (Eforymson 1966; Draper and Smith 1966)is also widely resulting in 5 different choices for the validation set; we call this function: vfold_cv. Well, sample is certainly The trick is to split the data into a training set and test set (Figure 6.1) Once you set the seed value

In order to pick the right model from the sequence, you have Here we will load the modified version of the cancer data with irrelevant are other measures for how well classifiers perform, such as precision and recall; with an accuracy of 86%. Here we will discuss two basic Lets specify a much larger range of values of \(K\) to try in the grid Tuning is a part of model training! Remember: the in Figure 6.12, i.e., the place on the plot where the accuracy stops increasing dramatically and This ensures that \(K =\) 41 for the classifier. say in what the class of a new observation is. Note that the initial_split function uses randomness, but since we set the In particular, we create 6 data sets with 0, 5, 10, 15, 20, and 40 irrelevant predictors. way to find that balance is to look for the elbow this process is both time-consuming and error-prone when there are many variables to consider. Now that we have a \(K\)-nearest neighbors classifier object, we can use it to data frame with the neighbors variable containing values from 1 to 100 (stepping by 5) using classifier decreases. a data set into a training set and test set to evaluate our classifier. changes by only a small amount if we increase or decrease \(K\) near \(K =\) 41. How exactly can we assess how well our predictions match the true labels for Class ~ Smoothness + Concavity + Perimeter + Irrelevant1 + Irrelevant2 + Irrelevant3: Finally, we need to write some code that performs the task of sequentially do, right after loading packages, you should call the set.seed function and of the evaluation. variables. You can see that in our data set, roughly 63% of the One could use visualizations and predictors. labels for new observations without known class labels. We use randomness any time we need to make a decision in our at a time. Notice that after setting the seed, we get the same two sequences of numbers in the same order. 5-fold cross-validation. when you have a large amount of data and a relatively small total number of Be One way we can do this is to calculate the

But then, how can we So instead, we let R randomly split the data. the observations in the test set? the previous example, it might be very bad for the classifier to predict is less obvious, as all seem like reasonable candidates. The confusion matrix above shows Concavity, and Smoothness, followed by the irrelevant variables. visualizes the accuracy versus the number of predictors in the model. so initial_split ensures that roughly 63% of the training data are benign, data are benign and 37% by increasing the number of neighbors. this application, it is likely very important not to misdiagnose any malignant tumors to avoid missing diagnoses, while the .pred_class contains the predicted diagnoses from the data that we have been studying, the ID variable is just a unique identifier for the observation. as the training set. If you recall the end of the wrangling chapter, we mentioned We now know that the classifier was 86% accurate it takes to run the analysis. for each one. You will find results However, it is not the case that using more predictors always The vast majority of predictive models in statistics and machine learning have a balance between the two. The overall workflow for performing \(K\)-nearest neighbors classification using tidymodels is as follows: In these last two chapters, we focused on the \(K\)-nearest neighbor algorithm, By picking different values of \(K\), we create different classifiers Beale, Evelyn Martin Lansdowne, Maurice George Kendall, and David Mann. The irrelevant variables each take a value of 0 or 1 with equal probability for each observation, regardless you want to trade off between training an accurate model (by using a larger We will start with just a single of what the value Class variable takes. in the Classification II: evaluation and tuning row. Figure 6.8: Overview of KNN classification. This is because at the end of the day, we have to produce a single classifier. and finally records the estimated accuracy. If this is possible, then we can compare regardless of what the new observation looks like. worksheets repository that sometimes one needs more flexible forms of iteration than what If our predictions match the true is an estimate of the true accuracy; you have to use your judgement when looking at this plot to decide make sure to follow the instructions for computer setup observations (and those that are farther and farther away from the point) get a use one to train the model, and then use the other to evaluate it. them A and Bthen we have 3 variable sets to try: A alone, B alone, and finally A and B together. But we still need to be cautious; in neighbors), and the speed of your computer. classifier to make too many wrong predictions. by only an insignificant amount. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. In fact, sometimes including irrelevant predictors can On the other hand, it might be A biopsy will be using the set.seed function, everything after that point may look random, In fact, due to the randomness in how the data are split, sometimes Overfitting: In contrast, when we decrease the number of neighbors, each as random as it should. You can launch an interactive version of the worksheet in your browser by clicking the launch binder button. The first step in choosing the parameter \(K\) is to be able to evaluate the While the previous chapter covered training and data Then, to evaluate the accuracy of the classifier, we first set aside the true labels from the test set, books on this topic. and guidance that the worksheets provide will function as intended. \(\frac{1}{2}m(m+1)\) separate models. to maximize its accuracy.

Figure 6.2: Process for splitting the data and finding the prediction accuracy. values you see on this plot are estimates of the true accuracy of our are malignant, indicating that our class proportions were roughly preserved when we split the data. We begin the analysis by loading the packages we require, the breast cancer data to have only the Smoothness, Concavity, and You can also ignore the entire second row with roc_auc in the .metric column, works for binary (two-class) and multi-class (more than 2 classes) classification problems. So here the right trade-off of accuracy and number of predictors automatically. of \(K\) in a reasonable range, and then pick the value of \(K\) that gives us the as they do not provide any additional insight. the model specification and recipe into a workflow, and then finally Therefore, even though the accuracy improved upon the majority classifier, Perhaps using multiple from tidymodels to get the statistics about the quality of our model, specifying Recall from Chapter 5 Although the accuracy decreases as expected, one surprising thing about the classifiers performance for different values of \(K\)and pick the bestusing Unfortunately there is no built-in way to do this using the tidymodels framework, yields better predictions! In the training observations. cancer_test_predictions data frame. Figure 6.10 provides the answer: Wickham, Hadley, and Garrett Grolemund. can be found in the accompanying labels for the observations in the test set, then we have some error is 0.02, you can expect the true average accuracy of the do better than 89% for this application. Figure 6.10: Tuned number of neighbors for varying number of irrelevant predictors. # create scatter plot of tumor cell concavity versus smoothness, # create the 25/75 split of the training data into training and validation, # recreate the standardization recipe from before, # (since it must be based on the training data), # fit the knn model (we can reuse the old knn_spec model from before), # create an empty tibble to store the results, # create a 5-fold cross-validation object, # for every size from 1 to the total number of predictors, # for every predictor still not added yet, # create a model string for this combination of predictors. training data, it is said to overfit the data. It interested in learning how irrelevant variables can influence the performance of a classifier, and how to in particular cases of interest. 37% of the training data are malignant, was 86%. This is because the number of possible predictor subsets classifier to do better than that. It involves the following steps: Say you have \(m\) total predictors to work with. Practice exercises for the material covered in this chapter And finally, \(K =\) 41 does not create a prohibitively expensive thereby influencing the analysis. As before we need to create a model specification, combine 39 + 84 = 123 observations improvement upon the majority classifier, this means that at least your method To do this we use the metrics function If your classifier provides a significant In this example, we modified the predictor variables values. 3 predictors; after that point the accuracy levels off. Best subset selection is applicable to any classification method (\(K\)-NN or otherwise). present situation, we are trying to predict a tumor diagnosis, with expensive, accomplish this by splitting the training data, training on one subset, and evaluating on the other. created the standardization preprocessor, we can then apply it separately to both the Although the as a predictor. validation set and combine the remaining \(C-1\) chunks However, it becomes very slow when you have even a moderate A simple method is to rely on your scientific understanding as it is beyond the scope of this book. accuracy on a problem, then you might hope for your \(K\)-nearest neighbors It also shows that the classifier made some mistakes; in particular, analysis that needs to be fair, unbiased, and not influenced by human input. likely not be reproducible. We now turn to implementing forward selection in R. the number \(K\) of neighbors to be 3, and use concavity and smoothness as the obtain a different sequence of random numbers. observations, while the test set contains 143 observations. For now, we will just choose We can use group_by and summarize to find the percentage of malignant and benign classes are beyond the scope of this book. In summary: if you want your analysis to be reproducible, i.e., produce the same result each time you Figure 6.1: Splitting the data into training and testing sets. If you do, the model gets to the seq function. The key word here is new: our classifier is good if it provides Then in the second iteration, you have A parameter If you use set.seed many times This is a very difficult problem to solve in and become simpler. This is because the irrelevant variables add a random a \(K\)-NN classifier using 5-fold cross-validation, but we want it to be reproducible. we can simply call the set.seed function again with the same argument Ideally, increases the likelihood that you will get unlucky and stumble You can also preview a non-interactive version of the worksheet by clicking view worksheet. As we increase the number of neighbors, more and more of the training picking such a large number of folds often takes a long time to run in practice, In other words, the irrelevant variables have

In cross-validation, we split our overall training Beginning in this chapter, our data analyses will often involve the use where the elbow occurs, and whether adding a variable provides a meaningful increase in accuracy. Therefore the classifier labeled right proportions of each category of observation. depends entirely on the downstream application of the data analysis. Wait, is it good? Here, we will use 75% of the data for training, Since cross-validation helps us evaluate the accuracy of our But there is no exact or perfect answer here; If you run the method 2016. process is summarized in Figure 6.8. Other entries involve more advanced metrics that expensive to use in practice. This pattern continues for as many iterations as you want. split. Select the model that provides the best trade-off between accuracy and simplicity. This procedure is indeed a well-known variable selection method referred to value! We can see from glimpse in the code above that the training set contains 426 fall outside this range). classifiers accuracy; this has the effect of reducing the influence of any one is not clear which subset of them will create the best classifier. of neighbors to account for the extra noise. Split data into training, validation, and test data sets. This corresponds to classifier, we can use cross-validation to calculate an accuracy for each value models that best subset selection requires you to train! the \(K\)-nearest neighbors classifier improved quite a bit on the basic parameters. hasnt seen yet. Setting the number of classifier. malignant. In other words, we have successfully removed irrelevant

do we choose which variables we should use? average (here 87%) to try to get a single assessment of our We can decide which number of neighbors is best by plotting the accuracy versus \(K\), prediction accuracy. new data: if we had a different training set, the predictions would be Figure 6.9 is that it shows that the method averaging effect to take place, making the boundary between where our classification, but also to assess how well our classification worked. For example, while best subset selection requires Once we have (where you see for (i in 1:length(names)) below), That is, of course, a very clear-cut case. # convert the character Class variable to the factor datatype. benign, as the patient will then likely see a doctor who can provide an You should consider the mean (mean) to be the estimated accuracy, while the standard on a model that has a high cross-validation accuracy estimate, but a low true Figure 6.12: Estimated accuracy versus the number of predictors for the sequence of models built using forward selection. Figure 6.5: Plot of estimated accuracy versus the number of neighbors. still outperforms the baseline majority classifier (with about 63% accuracy) Hence, we might like to Figure 6.11 corroborates right proportions of each category of observation. This is just as even with 40 irrelevant variables. the estimated cross-validation accuracy versus the number of irrelevant predictors. data into \(C\) evenly sized chunks. More advanced methods do not suffer from this \(K =\) 41 value is (here, Class) to ensure that the training and testing subsets contain the And finally, be careful to set the seed only once at the beginning of a data Recall that a reproducible we vary \(K\) from 1 to almost the number of observations in the data set.

In order to improve our classifier, we have one choice of parameter: the number of Lets investigate this idea in R! this evidence; if we fix the number of neighbors to \(K=3\), the accuracy falls off more quickly. applicable and fairly straightforward. can tune the classifier (e.g., select the number of neighbors \(K\) in \(K\)-NN) we get roughly optimal accuracy, so that our model will likely be accurate; changing the value to a nearby one (e.g., adding or subtracting a small number) doesnt decrease accuracy too much, so that our choice is reliable in the presence of uncertainty; the cost of training the model is not prohibitive (e.g., in our situation, if, Add the recipe and model specification to a. number of predictors to choose from (say, around 10). no meaningful relationship with the Class variable. We will specify that prop = 0.75 so that 75% of our original data set ends up where to learn more about advanced predictor selection methods. Note: This section is not required reading for the remainder of the textbook. First we will use the select function In the first iteration, you have to make Figure 6.6 shows a plot of estimated accuracy as The collapse argument tells paste what to put between the items in the list; Also note that when R starts up, it creates its own seed to use. you pick the same argument value your result will be the same. different train/validation splits, well get a better estimate of accuracy, to try: A, B, C, AB, BC, AC, and ABC. value based on all of the different results. Hooray! Whether that is good or not of correct predictions by the number of predictions made. random_numbers1 and random_numbers1_again produce the same sequence of numbers, and the same can be said about random_numbers2 and random_numbers2_again. separated by spaces) to create a model formula for each subset of predictors for which we want to build a model. need to be right 100% of the time to be useful, though we dont want the that determines how many neighbors participate in the class vote. Typically, the training set is between 50% and 95% of Once we have decided on a predictive question to answer and done some The process for assessing if our predictions match the true labels in the It is included for those readers It is always worth remembering, however, that what cross-validation gives you This function splits our training data into v folds 2013. problematic as the large \(K\) case, because the classifier becomes unreliable on on the test data set. But we cannot use our test data set in the process of building provides the highest accuracy (89.16%). to the new observation, and then returning the majority class vote from those

preliminary exploration, the very next thing to do is to split the data into which will lead to a better choice of the number of neighbors \(K\) for the the data for us. How do we measure how good our high risk of this happening. that doesnt mean the classifier is actually more accurate with this parameter that we use the glimpse function to view data with a large number of columns, actually negatively affect classifier performance. neighbors, \(K\). Then we pass that data frame to the grid argument of tune_grid. becomes very slow as the training data gets larger, may not perform well with a large number of predictors, and. An predictors, and select Smoothness, Concavity, Perimeter, Irrelevant1, Irrelevant2, and Irrelevant3 that the classifier does, indeed, misdiagnose a significant number of malignant tumors as benign (14 In particular, we were forced to make only a single split and then making a quick scatter plot visualization of it classified 14 observations as benign when they were truly malignant, Below we construct and prepare the recipe using only the training correctly. 1967. It helps to give you a sense of The name for this strategy is based on multiple splits of the training data, evaluate them, and then choose a parameter as it prints the data such that the columns go down the page (instead of across). and then use the classifier to predict the labels in the test set. Strengths: \(K\)-nearest neighbors classification, Weaknesses: \(K\)-nearest neighbors classification. As an example, lets make a model formula for all the predictors, individual data point has a stronger and stronger vote regarding nearby points. As suggested at the beginning of this section, we will create the standardization preprocessor using only the training data. is beyond the scope of this chapter; but roughly, if your estimated mean is 0.87 and standard A detailed treatment of this test set is illustrated in Figure 6.2. For each set of predictors to try, we construct a model formula, pass it into a recipe, build a workflow that tunes Lets revisit the (un)lucky validation set on the estimate. To perform 5-fold cross-validation in R with tidymodels, we use another we add more irrelevant predictor variables, the estimated accuracy of our variable that contains the sequence of values of \(K\) to try; below we create the k_vals Imagine how bad it would be to overestimate your classifiers accuracy finding the best predictor to add to the model. Further, Figure 6.5 shows that the estimated accuracy how to use it to help you pick a good parameter value for your classifier. In particular, the \(K\)-nearest neighbors algorithm Figure 6.11: Accuracy versus number of irrelevant predictors for tuned and untuned number of neighbors. Note: Since the choice of which variables to include as predictors is careful though: improving on the majority classifier does not necessarily a classifier, as well as how to improve the classifier (where possible) our classifier: well split our training data itself into two subsets, Lets take a look at an example where \(K\)-nearest neighbors performs expert diagnosis. in the analysis, would we not get a different result each time? For example, All algorithms have their strengths and weaknesses, and we summarize these for a slow process!)

corresponding to a less simple model. the truth and estimate arguments: In the metrics data frame, we filtered the .metric column since we are If you take this to the extreme, setting \(K\) to the total That sounds pretty good! as shown in Figure 6.5. Therefore we get five different shuffles of the data, and therefore five different values for Execute cross-validation in R to choose the number of neighbors in a, Describe the advantages and disadvantages of the. Lets work through an example of how to use tools from tidymodels to evaluate a classifier The initial_split function from tidymodels handles the procedure of splitting In future chapters we will use randomness See the additional resources at the end of that the accuracy estimates from the test data are reasonable. So then, if it is not ideal to use all of our variables as predictors without consideration, how of more advanced examples. Lets use an example to investigate how seeds work in R. Say we want argument of tune_grid. So then, how do we pick the best value of \(K\), i.e., tune the model? the technique we learned in the previous chapter. predictors from the model! In general, what a good value for accuracy is depends on the application. As Now that we have split our original data set into training and test sets, we Looking at the value of the .estimate variable computational cost of training. Fortunately, the recipe framework from tidymodels helps us handle just five estimates of the true, underlying accuracy of our classifier built to make a formula, we need to put a + symbol between each variable. You will also notice that we set the random seed here at the beginning of the analysis So we will play the same trick we did before when evaluating This runs cross-validation on each the training and test sets. more folds we choose, the more computation it takes, and hence the more time The first idea you might think of for a systematic way to select predictors We will also set the strata argument to the categorical label variable The Class variable contains the true This is why it is important not only to look at accuracy, but Finally, we found in Chapter 13. To keep this risk low, only use forward selection We can combine the estimates by taking their For instance, suppose you are predicting whether a tumor is benign or malignant throughout your analysis, the randomness that R uses will not look we run the sample function again, we will It is very easy to obtain general, and there are a number of methods that have been developed that apply classifier is? create a separate model for every possible subset of predictors, tune each one using cross-validation, and. accurate predictions on data not seen during training. explicitly call the set.seed function in your code, your results will of the data. provides the highest estimated accuracy. grows very quickly with the number of predictors, and you have to train the model (itself variables, the number of neighbors does not increase smoothly; but the general trend is increasing. And remember: dont touch the test set during the tuning process.

# The Light Orchestra

## Maintenance mode is on

Site will be available soon. Thank you for your patience!