For each of the models tested, an input table containing values for the factors identified in Objective 1 and corresponding N and P observations for each of the Ontario Provincial Water Quality Monitoring Network sites was used for training. The model was trained on 4 years of observations from 2013 to 2016. The nitrogen data set contains 1709 observations over 80 unique sites, and the phosphorus data set contains 2006 observations across 80 unique sites. Figure 6 below illustrates the process used to train the models.

**Figure 6.** Flow chart illustrating the steps taken to train the predictive models and the process used to predict N and P levels based on the best model.

**Model Testing**

In development of the prediction model, two ordinary least squares (OLS) linear regression models were tested, and two machine learning models were tested. Linear regression models were chosen because they are easy to understand and quick to implement, thus if these performed well they should be used. The two linear regression models included a “full linear model” with all of the predictors being included, and a step-wise linear model, which can be considered a slightly more optimized version of the full model. The step-wise model was created with the stepAIC function from the R package MASS (Modern Applied Statistics with S). The stepwise regression iteratively adds and removes parameters until the most appropriate and statistically significant group was selected to remove factors that are detrimental to the model's behavior (Montgomery et al. 2013).

Next, the two machine learning models were tested. Machine learning models take significantly more computing time to complete but are able to locate non-linear trends in the data that a linear regression model would be unable to discover. The first model used was Random Forest and was the main model of interest due to its usage in similar problems. The Random Forest algorithm takes the average of many decision trees created from subsets of the data and is commonly used for classification and regression-based prediction (Breiman, 2001). Random Forest has been used successfully using GIS for both categorization and regression-based problems (Peters et al. 2007; Oliveira et al. 2012; Vorpahl et al. 2012) and does not over-fit (Breiman, 2001). A second machine learning model called SVM (Support Vector Machine) was explored as well. SVM uses a technique where input vectors are non-linearly mapped to a very high-dimension feature space (Cortes & Vapnik, 1995). This method has been around for longer than Random Forest and can handle both categorical and regression tasks making it a good baseline for the performance of the more modern Random Forest algorithm. The comparison of these models is further explained in Objective 4. Both models were implemented with the Caret (Classification And Regression Training) R package, using the 10-fold cross validation resampling method repeated 10 times for tuning, as recommended for validating accuracy and tuning the model (Kohavi, 1995). Both were further tested using an arbitrarily chosen 80/20 training/testing data. After the training model was validated, the predictive model was created by retraining the model using the entire data set.

Both machine learning models were run using three different methods. First, a regression was conducted, which gives an exact value for N and P levels at a monitoring point. Next, a classification-based method was attempted with the machine learning algorithms using two and five classes. A visual representation of the classification methods used is shown in Figure 7 as an output from the Random Forest algorithm. Classification was explored because the accuracy can be directly calculated using a percentage of correct guesses. Classification reduces the complexity of the training task and reduces training time. This may be all a researcher needs in locating areas with high or low N and P. The two-class scenario represent high and low values, splitting data by the median of all observations. For five classes, the data were divided up by quantiles, leading to 20% of the observations being in each category. The classes are representative of “very low, low, medium, high and very high” for the purpose of the study.

The importance of each parameter in every model was calculated using the R function varImp (Calculation of Variable Importance for Regression and Classification Models).

**Figure 7.**** **Illustrates the two different kinds of classification methods used when developing the predictive model based on the Random Forest output; A represents the two category classification method with phosphorus values, classified as either high or low; B represents the five category classification method with phosphorus values ranging from very low to very high.

**Model Selection **

Testing multiple models and assessing which model is the most appropriate is a process known as “model selection” (Claeskens, 2016) and is designed to be expanded to more options in the future. The predictive models created various output values for error and accuracy that were compared to choose the best model. The metrics used to compare the regression-based output were RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R^{2}, and an additional MAE calculated using the predicted and actual test set values. Error is entirely dependent on the data, but models trained with the same data will have better performance when they have lower error. A higher R^{2} indicates a better model as the R^{2} value describes how well the model fits the data (Montgomery et al. 2013). These metrics were the main factors in choosing the best model. The metrics used to compare the classification-based schemes were Accuracy and Kappa calculated through model validation, and the accuracy using the test set. The Kappa statistic is the accuracy when controlling for random chance where 0.21–0.40 is fair, 0.41–0.60 is moderate, 0.61–0.80 is substantial model agreement (Landis & Koch, 1977).

**Model Prediction **