nonstat

2020-01-22

Introduction

The nonstat R package is designed to allow hydrologists to easily undertake trend and change point detection and non-stationary flood frequency analysis using annual maximum (AMAX) flow data. The package contains a number of functions that can be called by the user and which have associated help files. This document provides more details about the functions, their inputs and outputs, and how to use them. Practitioner guidance will be provided with the final version of the package that will complement this user guide by providing further information on the statistical methods, guidance on their application and some case study examples. It is recommended that analysis is carried out using RStudio, a free and open-source integrated development environment (IDE) for R.

This documentation contains examples of code for running each function and there is also a complete example script at the end of the document for all aspects of analysis that can be carried out using this package. All examples here use a folder called ‘Non-stationarity’ saved on the C drive, however, you can change the file path to anything you wish. The analysis does not need to be carried out in the same folder where the nonstat package is saved.

RStudio

Once you have installed RStudio, a shortcut should have been created which you can use to open it. If not, you can find the application in the directory you selected for saving RStudio under ‘bin/rstudio.exe’ (assuming the default settings have been followed):

When you first open RStudio, it will look similar to this, although there may also be a box in the top left quadrant:

To ensure that analyses are reproducible as far as possible, it is recommended that you create and save an R script (in RStudio: File -> New File -> R Script). This script will contain the lines of code that you wish to run. The script will open in the top left quadrant, as in the example below. The hash symbol can be used to comment out parts of the script that should not be run, for example, if you want to add a comment about what you are doing, or don’t want to run a certain line of code.

Lines of code in the script can be run in the console by clicking on the line to run and pressing ‘Ctrl + Enter’ or by clicking the ‘Run’ button at the top of the script. You can save the script as a .R file using File -> Save As. This can then be re-opened if you need to access the script again.

It is generally recommended to start a new analysis in a clean version of RStudio, where nothing is saved in the workspace that may interfere with the analysis. If you get errors that you have not previously encountered from the same function, then it is worth restarting RStudio to check that nothing is saved from previous analyses that may be interfering with the new analysis.

Package installation

The ‘nonstat’ R package will be supplied as a .tar.gz file via email.

The ‘nonstat’ package has a number of dependencies (other R packages on which the code depends). These are changepoint, data.table, dplyr, ggplot2, grid, gridExtra, lubridate, stats, stringr, texmex and trend. These must be installed first. You can see which packages you have installed in the the ‘Packages’ tab, in the bottom right panel (assuming the default RStudio layout). Packages only need installing once, so you do not need to re-install packages that you already have. If you need to install any dependencies, this can be achieved using the following code (listing all missing packages, rather than the two in this example):

install.packages(c('changepoint', 'trend'))

Go to the email containing the .tar.gz file and save it to a directory with an appropriate name. You can use the code below to store that directory and file name in R, replacing the file path with the one to the directory you just created. It is worth noting that folder paths in R use a forward slash rather than a backslash.

path_to_file <- "C:/Non-stationarity/nonstat_0.2.0.tar.gz"

Now you can install the ‘nonstat’ package, using the following code. (It would also be possible to put the file path directly into the below line, rather than using ‘path_to_file’, if preferred.)

install.packages(path_to_file, repos = NULL, type="source")

If any of the package dependencies are still missing, you will get an error message like the following. In this example, the only missing package is ‘changepoint’. This means that the nonstat package has failed to install and you will need to install the missing packages before re-installing the nonstat package.

Once the package is installed, you should be able to see the package present in the ‘Packages’ tab, in the bottom right panel (assuming the default RStudio layout), as shown below:

Using the nonstat package

The functions in the nonstat package can be accessed by loading the package into the current R session by running the following code in the console (or by checking the box to the left of the package name in the ‘Packages’ list shown above). You will be able to see that other R packages on which nonstat depends are also loaded.

library(nonstat)

The help files for each function can be accessed in R by typing ‘?function’ where function is the function name, for example, ‘?fit.time’.

Data

Each function in the package is developed to work with flow data in .AM files, such as those found in the National River Flow Archive (NRFA) peak flows dataset. The dates must be in the standard format of 22 Jan 1960 for example; please ensure this is the case if you manually edit any files. Prior to running any functions, you need to create a directory containing two folders: ‘Inputs’ and ‘Outputs’. To tell R what this directory is, you can use the following code:

folder <- "C:/Non-stationarity/Data"

Every function in the package has the path to this directory as an input argument. If you are using the same folder for the whole analysis then this can just be set once at the beginning of the script.

Data for any gauges you wish to analyse must be in the ‘Inputs’ folder (many of the functions will loop through all the gauge data (.AM files) saved in this folder). Outputs will automatically be written out to the ‘Outputs’ folder.

General comments

For outputs from all functions, the gauge numbers come from the station number within the .AM file and therefore that will need changing if multiple versions of the same gauge are used (if only the file name of the .AM file is changed, some of the results files will be overwritten). Outputs will be overwritten when the code is re-run if the gauge numbers are the same, however, output file names can be changed once the file has been produced if required. The gauge name comes from a look-up file (‘Gauge_metadata.csv’) which is provided with the package and needs to be placed in the ‘Inputs’ folder. If the gauge ID (number) does not exist in the look-up file then no gauge name will be included in the outputs. The file can be edited to include your gauge if you wish to do so, as long as the format of the file remains the same. The existing IDs are those of the NRFA.

When running the scripts, it is worth keeping an eye out for errors or warnings that may appear in the R console. Errors generally stop the function so that it fails to complete its intended analysis, so it is unlikely that some or all results will have been written out. The problem will therefore need addressing. Some of the functions catch the errors, rather than stopping the analysis, and they are written out in .csv files instead. It is therefore important to check the warning and error files that are written out.
Warnings caution users without stopping the function working, so you will probably get results, but it is possible that there was a problem with the inputs or the computation, for example, something may have been amended slightly to allow the function to work. Some warnings do not require action to be taken, if you are happy that the analysis is not being impacted and the results are correct.
Each function in this package uses the same data pre-processing routine. If the data pre-processing produces warnings or errors for a given gauge, they will be written out in the automatically-generated ‘Input data warnings and errors.csv’ file. Errors are likely to be related to formatting of the input data, for example, an expected header being missing or one section of the file not ending with an ‘[END]’ row.
A warning message may also be produced if one of the gauges has a missing stage value in the .AM file; this is not a problem as it is only the flow values that are analysed, but it is worth checking that this is the problem in the .AM file to which the warning refers. In the unlikely situation that a flow value is missing in the .AM file for a date (water year) that is included in the file, then that row (water year) is automatically removed from the analysis (to enable the distribution fitting in some of the functions to work).

Optional parameters do not need to be included to run the function. Output folders and files will automatically be created for each function that is run.

For the code examples in this document, the lines below the code used to run the function starting with ‘#>’ show what appears in the console when the function runs, as an example of what you might see in your console when you run the function.

Trend and change point tests

There are three functions in the package that can carry out trend and change point tests and these are described in this section. All these functions can be run on multiple gauges and a list of the gauges will be printed in the R console. A progress bar will update after each gauge has been analysed.

MK.test

This function carries out the Mann-Kendall trend test on all gauges listed in a folder and uses functionality from the ‘trend’ R package. The Mann-Kendall test is a non-parametric trend test, widely used for monotonic trend testing. The results are automatically written out to the ‘Outputs/Mann-Kendall’ folder. The ‘Mann-Kendall’ folder will be automatically generated if it does not already exist. The function takes the following inputs:

The function can be run as follows. The progress bar can be seen to update after each gauge has been analysed and the total processing time is also stated. If there are any warnings from running the function, they will also be stated in the console.

The outputs from running the function are as follows:

‘Mann-Kendall results.csv’: A .csv file containing gauge information and results. Missing years are those in the middle of the AMAX series, not at the start or end, as there is no way of automatically determining that these are missing. The number of years reported is the number of years used in the analysis, excluding rejected years (i.e. those marked as rejected in the .AM file) and rows with missing flow data (if they exist).

An explanation of the results from the Mann-Kendall test can be found in https://cran.r-project.org/web/packages/trend/vignettes/trend.pdf. The Mann-Kendall Z score (MKZ) is a standardised version of the results, so can be used to compare stations.

Positive values of MKZ indicate increasing trends and negative ones indicate decreasing trends. The 2-sided p-value can be used to determine whether the results are statistically significant at a given significance level (as specified by the user) and the interpretation is reported in the next column. At a 5% significance level, this means that if the absolute value of MKZ is greater than 1.96 (or if the p-value is less than 0.05) then the null hypothesis (H0) of no trend is rejected. The conventional approach is then to (provisionally) accept a single alternative hypothesis H1, i.e. that there is a trend.

gauge_number (gauge name – if exists) trend plots.png’: One plot per gauge containing two graphs. The first shows a time series of the AMAX data (black lines) and smoothed data (red line), using locally-weighted polynomial regression (LOWESS). The second shows the autocorrelation function. This enables investigation into whether there is significant lag-1 serial correlation in the AMAX time series (i.e. whether the time series is dependent on its past), because the Mann-Kendall test requires serial independence of data. The plots look like this, where in the second plot, the x-axis is the lag for each autocorrelation estimate and the y-axis is the estimate:

The autocorrelation has been estimated at a range of lags (water years, given that AMAX data are being used). Be aware that missing years are not taken into account (i.e. the function assumes that the values are for consecutive years), so the estimates may not be entirely correct, particularly if there are lots of missing years throughout the time series. The lag-1 autocorrelation indicates the correlation between values in the time series and their preceeding value and lag-2 indicates the correlation between values and the value two water years before. The Bedburn Beck example above shows autocorrelations for lags up to 17 years. The ACF at lag-0 is always 1. High values of autocorrelation indicate strong correlation between values at given lags. For time series where the current value is closely related to its preceding value, high values of autocorrelation would be expected for the shorter lags. For time series showing a periodic pattern, e.g. where the current value is closely related to the observation two before it, there would a high value of autocorrelation for lag-2. For time series with no clear patterns, the values of autocorrelation should be relatively small. The blue, horizontal dashed lines on the plot represent lag-wise 95% confidence intervals centred around zero. Autocorrelation estimates at a given lag would be statistically significant at a 5% significance level if they are outside these lines (compared to a null hypothesis of no autocorrelation).

Pettitt.test

This function carries out Pettitt’s test, which is designed to detect a sudden change in the mean of a time series. It is a non-parametric test, i.e. it makes no assumption about the distribution followed by the data. Results are automatically written out to the ‘Outputs/Pettitt’ folder. The ‘Pettitt’ folder will be automatically generated if it does not already exist. The function takes the following inputs:

The function can be run as follows. The progress bar can be seen to update after each gauge has been analysed, along with a statement that an image is being saved, and the total processing time is also stated. Any warnings will also be stated in the console.

The outputs from running the function are as follows:

‘Pettitt results.csv’: A .csv file containing the results for all gauges. The year of the detected change point is given. The p-value can be used to identify whether the change in the mean AMAX flow from the period before to the period after the change point is significant. In this case, the null hypothesis, H0, is that there is no difference between the means of the earlier and later portions of each AMAX flow series. If the p-value is less than the user-specified significance level (e.g. 5% means that the p-value should be compared with 0.05), then H0 is rejected. The conventional approach is then to (provisionally) accept a single alternative hypothesis H1, i.e. that there is a sudden change. The interpretation is provided in the file. The absolute and percentage change in the mean between the periods before and after the change point is provided, as well as the direction of the change (positive or negative).

gauge_number (gauge name – if exists) Pettitt results.png’: A plot is also generated for each gauge showing the AMAX data with the detected change point shown as a vertical line, if significant. The plot looks like this:

PELT.test

One limitation of Pettitt’s test is that it can only detect a single change point. Additionally, it has been criticised for its tendency to classify gradual trends as sudden changes (Rougé et al., 2013). An alternative is the PELT (Pruned Exact Linear Time) test. As with most change point algorithms, PELT (Killick et al., 2012) tries to find the optimal segmentation in a time series. The method prevents identification of too many unrealistic change points. PELT is implemented through the ‘changepoint’ R package (Killick and Eckley, 2014) and allows detection of one or more change(s) in mean and variance of the AMAX flow data (this function is restricted to two change points; results from national analysis found no gauges where more than two change points were detected). The approach requires an assumption that the data follow a known distributional form. Since the flood time series are not generated from any known distribution, a log transformation is applied to transform the data to approximate Normality (Box and Cox, 1964). In effect, we assume that the AMAX flows follow a log-Normal distribution, which has been used in previous studies (e.g. Prosdocimi et al., 2014). It is worth checking that the log-Normal assumption is reasonable; this can be done by checking the Normal QQ plot which is an output from the package. Points close to the diagonal line indicate a reasonable log-Normal fit.
A minimum segment length is required, which prevents additional over-fitting and false positive changes at short time scales, so 10 years has been chosen here as a suitable length for local stationarity pre- and post-change. This means that the record length of the input data must be at least 20 years (two times the minimum segment length). If the record length of your data is shorter than this, the function will not work and the gauge will be listed in a spreadsheet in the ‘Outputs’ folder: ‘Gauges with record lengths too short to be analysed.csv’. The PELT.test function carries out this test and takes the following input:

The function can be run as follows. The progress bar can be seen to update after each gauge has been analysed, along with a statement that an image is being saved, and the total processing time is also stated. Any warnings will also be stated in the console.

The outputs from running the function are as follows:

gauge_number (gauge name – if exists) Normal QQ plot.png’: A Normal QQ plot of the log-transformed data. This indicates whether the log-Normal assumption is reasonable - points close to the diagonal line should indicate a reasonable log-Normal fit. If the log-transformed data do not remotely follow a Normal distribution then the PELT results may not be reliable. The following figure shows two Normal QQ plots. The points are close to the line in (a) and hence the log-Normal assumption seems to be reasonable. In (b), however, the points are further from the line and it is possible that the log-Normal assumption is not valid and the PELT results may not be reliable.

‘PELT results.csv’: A .csv file containing the results for all gauges. The year(s) of the detected change point(s) are given (some gauges will have no detected change points). For each detected change point, the absolute and percentage change in the mean and standard deviation between the periods before and after the change point is provided, as well as the direction of the change (positive or negative).

gauge_number (gauge name – if exists) PELT results.png’: A plot is also generated for each gauge showing the AMAX data with any detected change point shown as a vertical line. The plot looks like this:

Fitting non-stationary models

Traditional statistical methods for flood frequency analysis assume stationarity of extreme events, where observations of past flood events are assumed to represent the behaviour of future events. If non-stationarity is evident in our AMAX flow data, then our design estimates derived from traditional flood frequency analysis may be questionable.

To model non-stationarity, covariates can be incorporated into the parameters of statistical models used in flood frequency analysis. Time (i.e. the year in which a flood happened) may seem to be an obvious choice for a covariate, relating changes in flood frequency to time. However, time is a proxy for some changing physical process that is causing change in flood frequency. An alternative option would be to use physically-based covariates, such as rainfall or a climate index such as North Atlantic Oscillation. Two reasons are sometimes given for this. The first is that physical covariates help remove some of the year-to-year variability in AMAX flows, enabling better identification of time-based trends and better fit of the GEV distribution (Prosdocimi et al. 2014). The second is that they provide a more physically meaningful model of non-stationarity. Some physical covariates may open up the prospect of predicting the future evolution of the flood frequency curve (Sraj et al. 2016). They may not directly represent the physical processes that cause floods, but represent a potential step forward from the simplistic approach of modelling non-stationarity as a change over time.

This package has an option for using only time as a covariate in a statistical distribution, as well as an option for using physical covariates. The former is a simpler option, and can be interpreted as a trend test. Either the Generalised Extreme Value (GEV) or Generalised Logistic (GLO) distributions can be used. All model fitting in this package is carried out using functions from the ‘texmex’ package and all confidence intervals are calculated using bootstrapping.

Typically, flood risk estimates are defined using return periods, but in non-stationary models, this becomes more complicated. It is an open question as to how non-stationary models can best be used to inform design specifications, given the different ways of defining risk.

Some key definitions for this section are:

Return level - the value (in this analysis, the river flow) associated with a particular return period or probability. Hydrologists often refer to this as the design flood, or as a point on the flood frequency curve.

Conditional return level – the return level that would be expected if the covariates had a particular combination of values (i.e. conditional on those values). For instance, if the covariate was the water year, there could be a return level conditional on the water year being 2019-20.

This is therefore suitable for use given a model with time covariates, as one could use the conditional return level given the current year.

It is less useful to think the same way when using physical covariates, as for example, the 100-year return level given the 2019 rainfall amount is the expected return level under the (clearly hypothetical) conditions that the rainfall always takes the value that is observed in 2019. In this case, an alternative approach would be to define a marginal return level, as used in Eastoe and Tawn (2009).

Marginal return level – the return level corresponding to the averaged encounter probability, where averaging is over covariates in a period of interest. Specifically it is the return level where the average encounter probability is equal to the reciprocal of the return period. (An encounter probability is the probability of an event occurring in a given number of years.)

Present-day marginal return level – the present-day expected return level for a particular exceedance probability, but not conditional on any particular value of a covariate such as annual rainfall. It is estimated by calculating the marginal return level, setting the time covariate to be the ‘current’ water year (last year in the flow and covariate data), and averaging over the full observed distribution of the physical covariates.

Fitting models where time is the only covariate

fit.time

The fit.time function fits stationary and non-stationary distributions (GEV or GLO) to AMAX flow data for all gauges provided by the user in an ‘Inputs’ folder. The three non-stationary models have a varying location parameter, a varying scale parameter and varying both location and scale parameters. The parameters vary with a covariate. This function uses time as a covariate and automatically calculates a value for the covariate to associate with each AMAX flow in the input data for the non-stationary distribution fitting, i.e. the user does not need to supply the time covariate. The time covariate is an index of water year and the values are centred and scaled.
The results can be interpreted as a parametric trend test; there may be some catchments where this indicates that the data are non-stationary, but where no trend is detected by non-parametric trend tests, such as the Mann-Kendall test.
Parameters of the fitted distributions and the results from frequency analysis are automatically written out to .csv files in the ‘Outputs/Time covariate models’ folder. The ‘Time covariate models’ sub-folder will be automatically generated if it does not exist.
The function takes the following inputs, as documented in the package help files:

The function fits a GEV or GLO distribution using maximum likelihood estimation (MLE) for a stationary case and three non-stationary cases (varying location, varying scale and varying both location and scale parameters). The scale parameter is log-transformed to ensure positivity for all possible covariate values.

The outputs from running the function are as follows.

gauge_number (gauge name – if exists) GEV|GLO results.csv’: A results spreadsheet containing the conditional return levels (flow estimates) from the fitted distributions. The first columns show the AMAX flow data used in the analysis. The ‘IndexYear’ column is the automatically generated data representing time and the ‘ScaledIndexYear’ column is the centred and scaled version of this which is used as a covariate to represent time in the model fitting. For each distribution, the ‘Fit’ column is the type of fit and the subsequent columns are the results from that fit (including the user-specified confidence intervals). The results from fitting a stationary distribution are the same for every year, but for the non-stationary cases, the results vary with time (e.g. the 100-year flow in 1960 may be different from the 100-year flow in 2015), because they are conditional on a given year.

‘GEV|GLO distribution parameters and fits.csv’: A spreadsheet containing the parameters from the distribution fitting is also produced. There is one row per model, so four rows per gauging station.

The first columns provide information about each station. Missing years are those in the middle of the AMAX series, not at the start or end, as there is no way of automatically determining that these are missing. The number of years reported is the number of years used in the analysis, excluding rejected years and rows with missing flow data (if they exist). The ‘Fit’ column gives the model type that corresponds to the parameters in that row. The next columns are the parameters of the fitted distribution. For a stationary fit, the location, ln(scale) and shape parameters are provided, along with their standard errors. The scale parameter (\(\sigma\)) can be obtained by taking the exponential of the value provided (ln(\(\sigma\)) or \(\phi\)). For the non-stationary fits, the outputs change for the varying parameter(s). If the location parameter is varying then the location parameter (\(\mu\)) is replaced with \(\mu_0\) and \(\mu_1\) (mu0 and mu1 in the column names) and if the scale parameter is varying then the ln(scale) parameter (\(\phi\)) is replaced with \(\phi_0\) and \(\phi_1\) (phi0 and phi1 in the column names), as described in the following equations where the parameters vary with time, \(t\):
\(\mu(t) = \mu_0 + \mu_1t\),
\(ln (\sigma(t)) = \phi(t) = \phi_0 + \phi_1t\).
Therefore, if the parameter is varying, then the column names should be read as mu0 and phi0, but if the parameter is not varying, they should be read as location and ln(scale).

The AIC and BIC are the Akaike Information Criterion and the Bayesian Information Criterion and represent the goodness of fit of the model. They are both methods of selecting the best fitting distribution for each station and take into account that there will be greater uncertainty with a larger number of varying parameters (i.e. they penalise overfitting). The lowest value indicates the best fit. Testing indicated that the BIC may be more useful because it tends to give preference to simpler models. It is worth noting that there can be small differences in the AIC or BIC value between different models for the same catchment and yet the models themselves can produce quite different results. It is therefore recommended that the results are assessed alongside diagnostic plots to determine which model is most suitable.

Another approach to find best fitting models is to use likelihood ratio testing. This is a preferred approach when comparing a small number of candidate models, when the likelihood ratio can be calculated for each nested pair of models. It is impractical when comparing hundreds of models, which can be the case when several covariates are being considered, but it is suitable for this case where there are only four models to compare. The last three columns therefore identify the best fit per gauge based on the AIC value, the BIC value and likelihood ratio testing.

gauge_number (gauge name – if exists) GEV|GLO diagnostic plots – fit type.png’: This contains three plots. The first is of the observed AMAX data, the second is a probability-probability (or PP) plot and the third is a quantile-quantile (or QQ) plot. A well fitting model will have points lying close to the unit diagonal on both the PP and QQ plots. The QQ plots for all four models are also provided in ‘gauge_number (gauge name – if exists) GEV|GLO diagnostic plots – Compare four models.png’ for easy comparison.

There are also several files containing information about warnings and errors produced during the analysis. These are as follows:

‘GEV|GLO diagnostic plot warnings and errors.csv’: If there were any warnings or errors produced during the generation of the diagnostic plots then they will be stated in here. There is one column for warnings and one for errors for each type of model. It will be empty if there were no warnings or errors.

‘GEV|GLO fit time function warnings and errors.csv’: If there were any warnings or errors produced within the function used to fit the distributions and calculate return levels, they will be stated in here.

‘GEV|GLO model fitting warnings and errors.csv’: If there were any warnings or errors produced during the actual model fitting, they will be stated in here. There is one column for warnings and one for errors for each type of model.

If there are any errors, then any results from the process which caused the error will be missing.

It is possible that the message ‘Ratio of bias to standard error is high’ will appear in the RStudio console window during the model fitting. This is an indication of the variability of the bootstrapped estimates and can be ignored in this instance.

Once this function has been run, there are two functions that can be used for plotting the results. These are ‘plot.selected.distributions’ and ‘plot.all’. Note that the example code to run the ‘fit.time’ function is provided with the example plotting code in the ‘plot.all’ section.

plot.selected.distributions

This function plots the results from the fitted stationary distribution and one of the fitted non-stationary distributions selected by the user (e.g. the best fitting non-stationary model based on likelihood ratio testing) for one user-selected gauge using the outputs from fit.time for a range of return periods. It also plots the confidence interval for one return period on a second plot. These plots will be automatically written out to a .png file in the ‘Outputs/Time covariate models’ folder. The function takes the following inputs, as documented in the package help files:

The output from the function is a .png file containing two plots (‘gauge_number (gauge name – if exists) GEV|GLO user-specified distribution results and confidence interval plots.png’). The plots should look like this:

The first plot shows the conditional return levels from the fitted stationary and one user-selected non-stationary distribution for specified return periods. In the example above, the varying location and scale parameters model has been plotted, as stated in the plot title. The second plot shows confidence intervals for the conditional return levels for one specified return period (in this example, for the 100-year return period). It can be seen that the non-stationary results are varying over time, with the 100-year flow increasing from 39 to 135 m3/s over the period of record.

plot.all

This function plots the results from the fitted stationary and three non-stationary distributions (varying location, varying scale and varying location and scale) with time as a covariate for a user-specified gauge. There will be one plot per specified return period. The plots will be automatically written out to a .png file in the ‘Outputs/Time covariate models’ folder. The function takes the following inputs, as documented in the package help files:

The outputs from the function are .png files (one per return period) containing plots showing results from all four fitted distributions and should look like the following. It can be seen that the results from the different models can vary hugely.

All these functions could be run as follows.

# Set up data to use as inputs to the fit.time function
dist <- "GEV"
folder <- "C:/Non-stationarity/Data"
RPs <- c(2, 10, 20, 50, 100)
CI <- 90
# Run the fitting function, saving the outputs to an object called 'GEVfits'.
GEVfits <- fit.time(dist, folder, RPs, CI)
#> 
  |                                                                       
  |                                                                 |   0%[1] "2 gauging stations queued for analysis:"
#> C:/Non-stationarity/Data/Inputs/24004.AM
#> C:/Non-stationarity/Data/Inputs/76005.AM
#>  
#> [1] "C:/Non-stationarity/Data/Inputs/24004.AM"
#> Loading required namespace: parallel
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> 
  |                                                                       
  |================================                                 |  50%[1] "C:/Non-stationarity/Data/Inputs/76005.AM"
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> 
  |                                                                       
  |=================================================================| 100%
#> [1] "*** Total processing time:"
#> Time difference of 1.651176 mins

These results are then saved in your R environment and can then be used in the plotting functions. If you wish to save the results so that they can be used later in the plotting functions (in a new R session), this can be achieved by saving the results as a .RDS file (a serialized version of the data which is saved with gzip compression by default).

Once the fitting function has been run, you can run the plotting functions.

If you wish to plot results for more than one gauge at a time, it is possible to write some code to loop through each gauge. This would only be possible for plot.selected.distributions if the same type of non-stationary model is to be fitted for each gauge. The following code could be used to loop through the gauges.

Fitting models with physical covariates

The functions described in this section allow users to explore incorporation of physical covariates into flood frequency analysis.

One important consideration is the effect of collinearity and confounding variables on results when including multiple covariates. It is desirable that covariates included in a non-stationary statistical model are independent, otherwise computational issues may arise during inference and the fitted model may suffer from problems with interpretability. When time is one of the covariates, it is recommended that this is achieved by detrending the physical covariates based on the time covariate. This can be achieved in the fit.phys.cov function by using the detrend.cov parameter.

There are two steps to carry out analysis using physical covariates in this package. The first is fitting of models (varying the parameters by a range of covariate combinations) to flow data for all gauges in the ‘Inputs’ folder. The second is obtaining return level estimates (flows) for each gauge individually using one of the fitted models for that gauge.

Unlike the time covariate, physical covariates need to be supplied by the user in a .csv file (see the example file 24004_covariates.csv). This should be included in the ‘Inputs’ folder with the flow data. The gauge number in the covariates file name must match the one in the .AM file, so if it has been amended in the .AM file, e.g. 52006x, then the covariates file should be called 52006x_covariates.csv. A value of the covariate must be associated with every AMAX flow (e.g. a value of annual or seasonal rainfall would be required for every water year in the flow record). The flow and covariate data should therefore cover the same time period, although the package will automatically trim both data sets to use only the overlapping time period. The file must contain a maximum of seven physical covariates, otherwise the fit.phys.cov function will produce an error. Gauges without a covariate data file will be skipped.

The example file contains a range of covariates. The NAO and EA indices were obtained from NOAA (https://www.cpc.ncep.noaa.gov/data/teledoc/telecontents.shtml). The indices are calculated from anomalies that are standardised by monthly means and standard deviations calculated over a 1950-2000 baseline period.
In all cases, annual values of the covariates were calculated using water years.
The rainfall covariates were centred and scaled, subtracting the mean and dividing by the standard deviation. The other covariates were already standardised in a similar way. It is recommended that you centre and scale all covariates.

fit.phys.cov

As with the fit.time function, the fit.phys.cov function fits stationary and non-stationary GEV or GLO distributions to AMAX flow data for all gauges provided by the user in an ‘Inputs’ folder. The non-stationary models have varying location and/or scale parameters that can vary with one or more physical covariate(s) and/or time. Outputs from the function are automatically written out to the ‘Outputs/Physical covariates models’ folder. The ‘Physical covariates models’ sub-folder will be automatically generated if it does not exist. Time (an index of water year; calculated within the function) is automatically included as a covariate alongside the physical covariates, although the function can fit a range of models with various combinations of covariates, so time will not be included in every model fitted.

The function takes the following inputs, as documented in the package help files:

The function fits a GEV or GLO distribution using maximum likelihood estimation for a stationary case and a number of non-stationary cases (depending on the option selected). The scale parameter is log-transformed to ensure positivity for all possible covariate values.

The outputs from running the function are as follows.

gauge_number (gauge name – if exists) GEV|GLO fitting information (physical covariates).csv’: A spreadsheet containing the information about each fitted model. There is one row per model. The locFit column gives the regression equation of covariates included in the location parameter for that model. The scaleFit column gives the regression equation of covariates included in the scale parameter for that model. The warnings and errors columns give any warnings or errors that occurred during the fitting of the model (this is more likely when option 1 has been selected as there may be many covariates in one model). Models with warnings or errors should not be selected for calculating return levels. Therefore, it is important to check this spreadsheet before running the ‘res.phys.cov’ function. The AIC and BIC are as described for the models with only time as a covariate. This file also contains columns which rank the models in terms of AIC and BIC. If detrend.cov has been set, this is reported in the ‘DetrendingCovariate’ column, and the ‘CovariatesOption’ column records which option was selected when the function was run.

The ‘params’ column gives the parameters of the fitted distribution. For a stationary fit, the location, ln(scale) and shape parameters are provided. However, as for the time-only model fitting, for the non-stationary fits, the outputs change depending on the varying parameter(s). If the location parameter is varying then the location parameter (\(\mu\)) is replaced with a \(\mu\) value for the intercept and each covariate. This is equivant to using \(\mu_0\) to \(\mu_n\) (as used in the fit.time function results where there was only one covariate and hence there was only a \(\mu_0\) and a \(\mu_1\)). If the scale parameter is varying then the ln(scale) parameter (\(\phi\)) is replaced with a \(\phi\) value for the intercept and each covariate, equivalent to \(\phi_0\) to \(\phi_n\). This is described in the following equations where the parameters vary with a vector of covariates, \(x\):
\(\mu(x) = \mu_0 + \mu_1x_1 + ... + \mu_nx_n\),
\(ln (\sigma(x)) = \phi(x) = \phi_0 + \phi_1x_n + ... + \phi_nx_n\),
where \(n\) is the number of covariates included in the parameter.
For example, in the above equation, if covariate \(x_1\) represents time and covariate \(x_2\) is a measure of rainfall, \(\mu_1\) would express how floods change over time, once variations in rainfall are taken into account by \(\mu_2\) (Prosdocimi et al., 2014).

gauge_number (gauge name – if exists) GEV|GLO model fits (physical covariates).RDS’: This is for reference purposes, if any information from the fitting process needs to be accessed again in the future. It is a list containing information for each gauge analysed. For each gauge, the following information is saved: dist (distribution used - taken from input parameter), model_fits (described below), all_data_raw (the data used in the model fitting - not detrended), gaugeName (name of the gauge) and detrend.cov (the string from the input parameter). Most of these are also returned from the function so that they can be used in the next step (see code example).

The model_fits element is another list containing the following information: best_model (details of the best fitting model selected based on AIC or BIC, as specified by the user), best_model_formula (details of the formula used to fit best_model), stat_model (details of the fitted stationary model), all_models_metadata (a metadata file associated with fitting models to all combinations of covariates selected by the user. The latter is the same as the .csv file described above.

Once this function has been run, the res.phys.cov function can be run for each gauge to obtain return levels (flows) based on one of the fitted models, selected by the user.

res.phys.cov

The res.phys.cov function uses one of the fitted non-stationary models from the fit.phys.cov function to estimate conditional, marginal and present-day marginal return levels (as described in the ‘Fitting non-stationary models’ section of this document). Stationary return levels are also estimated based on the fitted stationary model. The function also outputs plots of the results. The following inputs are required:

The outputs from running the function are as follows.

gauge_number (gauge name – if exists) GEV|GLO model fits (physical covariates).RDS’: A .RDS file is returned which contains information relating to the analysis for the gauge (from the model used to estimate return levels to the results of the frequency analysis). This can be opened in R and referred to at a later date if required. Much of the information is also written out in a range of .csv files, as described below.

gauge_number (gauge name – if exists) GEV|GLO model formula.csv’: This contains the regression equations used in the location and scale parameters for the selected model (i.e. which covariates have been included).

gauge_number (gauge name – if exists) GEV|GLO conditional return levels.csv’: This is similar to the ‘gauge_number (gauge name – if exists) GEV|GLO results.csv’ output from the fit.time function. The Flow column is the observed AMAX flow data and the WaterYear column is the water year associated with each flow. Other columns at the beginning show the covariate data used in the model fitting (detrended, if detrend.cov has been used). If time is included as a covariate in the selected model, then the WaterYearIndex column is the automatically-generated index representing time and the ScaledWaterYearIndex column is the centred and scaled version of this, which was actually used as the time covariate in the model fitting. The conditional return levels (and their confidence intervals) are given in the remaining columns. These vary based on whichever covariates have been used in the model selected in this function for calculating return levels. As previously mentioned, given that multiple covariates may have been used, these results can be less useful than those derived from stationary models or the non-stationary models with only time as a covariate. For example, the return level may be conditional on a specific water year and rainfall amount.

gauge_number (gauge name – if exists) GEV|GLO marginal return levels.csv’: This provides the marginal return levels (and confidence intervals) for each return period. Unlike the conditional return levels, there is only one value per return period. It indicates the return level corresponding to a particular averaged exceedance probability (the encounter probability) during the entire period of record. For example, the 100-year marginal return level is the value expected to be exceeded once every 100 years assuming the covariate distribution in the observed period is unchanged. There may be a higher chance of that exceedance occurring in the second half of the record, if the conditional return levels are higher in that half.

gauge_number (gauge name – if exists) GEV|GLO present-day marginal return levels.csv’: This provides the present-day marginal return levels (and confidence intervals) for each return period. This is similar to the marginal return levels, except that the time covariate has been set to be the last year in the input flow and covariate data (representing the present-day) and then the averaging is done over the full observed distribution of the physical covariates.

gauge_number (gauge name – if exists) GEV|GLO stationary return levels.csv’: This provides the return levels for each return period based on a stationary model. This is the usual definition of a return level (the flow associated with a particular return period).

gauge_number (gauge name – if exists) GEV|GLO model diagnostic plots.png’: This contains the diagnostic plots relating to the selected model. These should be checked to make sure that the selected model has a good fit, rather than just relying on the AIC or BIC value. The plots are the same as those from the fit.time function.

gauge_number (gauge name – if exists) GEV|GLO correlation plot.png’: This is a plot of the flow data plotted against all covariates in the input file (detrended, if detrend.cov has been used) and can be used to see whether flow is correlated with the covariates used. The plot will look like the following.

gauge_number (gauge name – if exists) GEV|GLO model fit plot.png’: This is a plot of flow plotted against conditional return levels from the non-stationary model and stationary return levels from the stationary model for the observed period. An example can be seen below. The dots are the conditional return levels and these will move up and down each year, depending on the covariates. The stationary return levels are the dashed horizontal lines. If the selected model is stationary then only the stationary return levels will be plotted as conditional return levels will not be available.

gauge_number (gauge name – if exists) GEV|GLO encounter probability plot.png’: This is a plot of flow against encounter probability, in this case, the probability of the flow being exceeded over the full period of record. An example can be seen below. The plot shows both the stationary return levels (‘Stationary flow estimates’) and marginal return levels (referred to as ‘Integrated flow estimates’, as in the practitioner guidance), including a confidence interval for the latter. The observed AMAX flows are plotted as short black lines on the y-axis. If the selected model is stationary, then marginal return levels will not be available and therefore only the stationary results will be plotted, along with their confidence interval.

Any warnings and errors associated with the analysis will be written out in the R console. Due to the use of bootstrapping to calculate confidence intervals, 200 (the number of bootstrap samples used) models are fitted using maximum likelihood estimation. There are therefore likely to be some errors and warnings associated with some of these models.
You may get messages stating that there were a number of warnings or errors when fitting the model for a particular bootstrap sample. This is for information only. You may then also get a message stating that there are errors or warnings in more than 10 of the bootstrap models for at least one return period. This means that there may be issues with at least 5% of the 200 bootstrap models used to find marginal or present-day marginal (present-day will be stated in the message if applicable) return level confidence intervals.
In addition, you may get a message stating the number of errors or warnings that occurred when calculating marginal return levels from the 200 models in order to calcuate the confidence intervals over all the required return periods. Bootstrap samples with errors or warnings will probably result in NAs and hence not be used to calculate the confidence interval. There may also be additional errors from the model fitting that appear in the console, such as ‘Error in solve.default(o$hessian)’ or ‘Error in diag(o$cov)’. These imply poor-fitting of the model. Therefore, if there are a large number of errors or warnings then it may be that less confidence should be placed in the values of the confidence interval.

Information about the bootstrapping results, warnings and errors is also saved in the .RDS file under ‘margRLsAllInfo’. If you do need to refer to this information, the explanation for each output can be found in the ‘BootstrapMargRLs’ function help file (this function is only used within other functions and hence is not included in this document). The information includes the marginal return levels for each bootstrap sample and return period, so you can see where the problems are. It is possible that there may be some poorly-fitting bootstrap models that have been used to produce results for lower return periods but have resulted in NAs for higher return periods, in which case, it is likely that any results produced are not sensible. If this is a problem with lots of the bootstrap samples, then the confidence interval for each return period may be less accurate. Additionally, any models containing warnings or errors in the ‘fitting information’ .csv file that is output from the fit.phys.cov function should not be used. If any confidence intervals are coming out the same as the estimate, this also implies that there were problems with the model fitting and that the results should not be used.

These two steps for using physical covariates can be run as in the example below.

# Fit models using physical covariates
folder <- "C:/Non-stationarity/Data"
dist <- "GEV"
# Run function and save outputs to an object named 'physCovModelFits'
physCovModelFits <- fit.phys.cov(dist, folder, detrend.cov = "ScaledWaterYearIndex", 
                                 option = 2, best_fit_method = "BIC")
#> 
  |                                                                       
  |                                                                 |   0%[1] "2 gauging stations queued for analysis:"
#> C:/Non-stationarity/Data/Inputs/24004.AM
#> C:/Non-stationarity/Data/Inputs/76005.AM
#>  
#> [1] "C:/Non-stationarity/Data/Inputs/24004.AM"
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000 
#> [1] "C:/Non-stationarity/Data/Inputs/76005.AM"
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000
#> Ratio of bias to standard error is high
#> Replicate 100 
#> Replicate 200 
#> Replicate 300 
#> Replicate 400 
#> Replicate 500 
#> Replicate 600 
#> Replicate 700 
#> Replicate 800 
#> Replicate 900 
#> Replicate 1000 
#> 
#> [1] "*** Total processing time:"
#> Time difference of 1.228278 mins

# Obtain return levels based on one of the fitted models
# Set inputs
gauge <- "24004"
RPs <- c(2, 10, 20, 50, 100)
CI <- 90
# Run function
res.phys.cov(folder, physCovModelFits, gauge, RPs, CI, model_type = "Best", 
             locFit=NULL, scaleFit=NULL, present_day = TRUE)
#> null device 
#>           1

Complete example script to run all functions

# Setup -------------------------------------------------------------------
# Install package dependencies
install.packages(c('changepoint', 'trend'))

# Install the nonstat package from saved location
path_to_file <- "C:/Non-stationarity/nonstat_0.2.0.tar.gz"
install.packages(path_to_file, repos = NULL, type="source")

# Load the nonstat package
library(nonstat)

# Set the path to the folder containing the 'Inputs' and 'Outputs' folders
folder <- "C:/Non-stationarity/Data"

# Trend testing -----------------------------------------------------------
MK.test(folder, sig.level = 5)
Pettitt.test(folder, sig.level = 5)
PELT.test(folder)

# Flood frequency analysis (time only as a covariate) ---------------------

# Run the fitting function
dist = "GEV"
folder <- "C:/Non-stationarity/Data"
RPs <- c(2, 10, 20, 50, 100)
CI <- 90
GEVfits <- fit.time(dist, folder, RPs, CI)

# Run the plotting functions
gauge <- "24004"
type <- "Varying location and scale"
fit.results <- GEVfits
RPs <- c(20, 50, 100)  # return periods to plot
CI.RP <- 100  # return period for confidence interval plot
CI <- 90 # set the confidence interval to be 90%
plot.selected.distributions(folder, gauge, type, fit.results, RPs, CI.RP, CI)

RPs <- c(2, 10, 20, 50, 100) # return periods to plot
plot.all(folder, gauge, fit.results, RPs)

# Flood frequency analysis (physical covariates) --------------------------

# Run the fitting functions
detrend.cov <- "ScaledWaterYearIndex"
option <- 2
best_fit_method <- "BIC"
physCovModelFits <- fit.phys.cov(dist, folder, detrend.cov, option, best_fit_method)

# Obtain return levels based on one of the fitted models
gauge <- "24004"
RPs <- c(2, 10, 20, 50, 100)
CI <- 90
res.phys.cov(folder, physCovModelFits, gauge, RPs, CI, model_type = "Best", 
             locFit=NULL, scaleFit=NULL, present_day = TRUE)

References

Box, G. E. & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243.

Eastoe, E. F. & Tawn, J. A. (2009). Modelling non-stationary extremes with application to surface level ozone. Appl. Statist. 58, 22–45.

Killick, R., Fearnhead, P. & Eckley, I.A. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590-1598.

Killick, R. & Eckley, I.A. (2014). changepoint: An R package for changepoint analysis. Journal of Statistical Software, 58(3), 1-19.

Prosdocimi, I., Kjeldsen, T. R. & Svensson, C. (2014). Non-stationarity in annual and seasonal series of peak flow and precipitation in the UK. Natural Hazards and Earth System Sciences, 14(5), 1125-1144.

Rougé, C., Ge, Y. & Cai, X. (2013). Detecting gradual and abrupt changes in hydrological records. Advances in Water Resour., 53, 33–44.

Šraj, M., Viglione, A., Parajka, J. & Blöschl, G. (2016). The influence of non-stationarity in extreme hydrological events on flood frequency estimation. J. Hydrol. Hydromech., 64(4), 426–437.