LoadRunner

Introduction

LoadRunner's purpose is to calculate "flux" or "loads" of dissolved elements, carried in streams and rivers. This could be for a single USGS water testing site, or for an entire watershed. The element could be a single USGS constituent code, or a list of codes that measure the 'same' quantity by different means.

For example, the kind of question LoadRunner was built to answer was, "how much carbon does the Ohio River watershed carry" - a major component of the carbon sequestered by the watershed.

At the core of LoadRunner is the USGS LOADEST program - a batch program requiring specially prepared input files. LoadRunner's job is to:

Take the source data in its native form - USGS water quality and flow data files (or hand-crafted files),
Massage it to create LOADEST input files,
Run LOADEST for you, then
Present the results in a convenient format for further analysis.

The rest of this page goes into more detail on the options and operation of LoadRunner, organized by the area of the LoadRunner screen they appear in.

The LoadRunner startup screen.
Before clicking the Run button, you have to supply the two Required Input data files.
Further Options allow you to customize the run. LoadRunner remembers your selections from run to run.
After a Run, two new areas - Output and Results - will appear.

Required Input - Data Files

LoadRunner can do nothing without:

Water quality data - the concentration of each element dissolved in the streamflow, and
Flow data - how much water the stream carried that day.

LoadRunner extracts the quantities of interest from these huge files, massages it into LOADEST format, and runs LOADEST on it.

The current version of LoadRunner assumes you've already fetched or prepared this data for all the sites you're interested in. (You can put a whole list of sites into one file of each type.) A future version may automate retrieving the data for you. For now, you fetch data, and save it onto your local computer, from:

Quality file - USGS Water Quality Samples for the Nation

Flow file - USGS Surface-Water Daily Data for the Nation

How to fetch USGS files.

OR : You can massage your own data (flow, quality, or both) into a format LoadRunner will accept.

How to prepare arbitrary data for LoadRunner.

Note that water quality files are samples. The flow data is (almost) daily, over many decades. The quality data is far sparser, sometimes missing years at a stretch with no samples at all. Based on what little sample data is available, LOADEST constructs mathematical models fitting that data to a curve, then amongst those models, selects the best fit. Then it applies that load model to calculate load as a function of date and flow, and report the results.

LOADEST has a nice fat manual on that mathematical challenge.

But for LoadRunner's purposes, note that the "calibration data" - the information we give LOADEST to build its models, LOADEST's "calib.inp" - all comes from the quality data file, including the stream flow to match with each sample measurement. For the calibration data, LoadRunner only uses data from the flow file when the water quality file provides no flow measurement for the day of the sample measurement - a rare event.

The flow file is used as input for calculating loads after the models have been calibrated - LOADEST's "est.inp" file. If days are missing from the raw data, LoadRunner pads in the missing days with averaged data, so that you can get good-enough estimates for stream loads for full months and years.

The USGS does a wonderful job on data quality control, with regular maintenance and corrections. But note that any database that large and complex, built by so many people over so many years, is bound to have quirks and errors that have gone unnoticed. LoadRunner reports any problem it recognizes, in the pursuit of its own modest goals. If you disagree with what LoadRunner decided to do about the problem, you can go back to the original data files and edit them.

Options

Recall that the bulk of LoadRunner's job is to automate running LOADEST for you. Most of these options have to do with how it massages the raw data into posing a LOADEST question.

The most critical option is the list of ElemNum - what you want LoadRunner to calculate loads of. (This is only an "option" because LoadRunner was built to automate alkalinity load estimation.) Note that unit conversions are also entered here, if any are needed.

Other options in order are:

Model - A LOADEST option. Model 0 tells LOADEST to try its entire zoo of canned mathematical models, and pick the best fit.
Alternatively, you might want to give it a particular LOADEST model number to use, and skip trying the others. You might want to do this, for instance, if you studied the Susquehanna River watershed in careful detail, and found Model 4 works best. And then for consistency you want to re-run the watershed with one fixed model.
Note that for each site, LoadRunner spawns a new LOADEST run. So with Model 0 selected, each and every site is going to have its own personal best model selected. Which you may or may not want.

Sigma - Used to toss data values that are too far from the mean. So if sigma is set to 1.0, that would discard any concentration measurements (element, in the quality file) that was beyond one standard deviation of the mean.
Normally this is set high (Sigma 3.0, or 3 standard deviations), and catches data outliers that suffer from someone entering too many zeroes. Note that it's normally no use at all for catching data values that are too small.
Sigma testing is not applied to non-detect values. So if you set the < detect option to use 1/2, LoadRunner knows which values were properly marked as non-detects, and doesn't apply the sigma test.

FullYear - true/false. If set to false, LoadRunner does no extrapolation, but rather calculates loads from the first to last sample date, padding flows for any missing dates between. But often you want to estimate loads over years, so if you did this, your first and last year will always be partial years. So when FullYear is true, LoadRunner runs from the beginning of the year of the first sample, to the end of the year of the last sample.
Note that this only affects the begin and end dates of the load estimation. Whatever those dates are, LoadRunner pads the interval with any missing flow dates as necessary, to give you full daily coverage within the interval.

SplitSite - true/false. Almost always true - run each site separately, meaning calibrate the load model coefficients separately for each quality testing site. Quality and flow data are matched by USGS site number code, and results are reported on a per-site basis, then appended into allsite composite files.
The only case when you would set SplitSite to false, is if you cannot match quality and flow data by USGS site number. For instance, perhaps the testing site moved a mile downstream and got a new site number in 1965. So you want to take two site numbers, quality and flow data, and munge them all together. But if this were part of a larger project, a better approach would be to massage the USGS site numbers in the source data files.
Note that if you prepare your own data, and don't provide a site_no column, the site name of "NoSiteGiven" is attributed to the data. (This is true for flow or quality data.) So if, for example, you have hand-crafted qwdata with no site names, and a USGS dv file that you've decided matches that data, you would set SplitSite to false, so that "NoSiteGiven" gets matched to data with another site name. However, if both were hand-crafted files, neither with site_no column, their site names would match ("NoSiteGiven"), and the SplitSite setting wouldn't matter.

FlowGap - As mentioned, LoadRunner interpolates missing daily flow data. We added this because sometimes a year is missing 3 or 4 flow daily flow values, and we don't want to lose the annual and monthly statistics due to these minor gaps. The FlowGap option allows the user to define the maximum number of consecutive flow days LoadRunner should interpolate (default 7 days max). Days skipped because of this option, simply don't appear in loadest's "est.inp" file, or loadrunner's daily flux output. Because months and years report averages of the values that are present, those reports don't change (though occasionally a month may be missing.)
Note that FlowGap is talking about flow data gaps, of days. The other gap options are talking about calibration data gaps, of years.

GapYear% - This option allows you to place gaps for years with no water quality data. It has no effect on the LOADEST data modeling, only the LoadRunner output dates. When you see the gap in the output files, you get a better feel for the data coverage.
The simplest choices are "Any" and "None". "Any" means just go from beginning year, to end year, of the data, without checking for gaps. "None" means to skip any years that have no sample measurements.
The percentage number settings govern what to do in between. Say we have data from 1953 to 2005 - 53 years inclusive. If you say the maximum gap is 5%, in this run that would be three years (it's rounded to the nearest number). So a run of four consecutive years missing data would not be estimated, but gaps of three or fewer years would run.
Note that the run is not chopped up into separate groups of years run separately. So the LOADEST model fitting is done on the same data, whether gaps are suppressed or not, and that one model fitting is used for all years. This may not be quite what you want, if the load carrying behavior of the stream has changed substantially over the years - you might want the 'early years' estimated separately from the 'recent years'. But to do that, you need to chop up the input files by hand - LoadRunner has no options to help.

preGap & postGap - We have found that sometimes sites with just a few data points anchoring either end of a data set, can produce erroneous results. With these options, if the calibration data has gaps (from GapYear%), then we chose to omit groups of years with too few points at the beginning or end of the run. For example, say a site had 1 or 2 data points before 1940, then fairly constant data from 1950-1985, then a lonely outlier in 2001. If LOADEST runs on all of these points, the beginning and ending outliers get overweighted.
Example: If preGapMin is 3, and there are only 2 calibration points before the first gap, those 2 points get dropped. Likewise for the last group of points after the last gap. This is only applied once. So if the points go 2-gap-1-gap-20-gap-1-gap-2, only the first 2 and last 2 would get dropped by preGapMin = postGapMin = 3. The next inward groups of 1 point don't get dropped.

< detect - This governs what to do with quality file concentration values that are given as less than a detection limit. LOADEST calls these "censored" values. Three options:
1. drop - Omit these values from the calibration data LoadRunner sends to LOADEST.
2. use 1/2 - Code as normal values (not non-detects!) using 1/2 of the detection limit stated for this value.
3. pass to LOADEST - Code as LOADEST "censored values", and let LOADEST do its best with them.
For example, say in 1955 there was a value of "< .5" and in 1995, a value of "< .1". With the < detect - use 1/2 option, LoadRunner sends these to LOADEST as ".25" and ".05". Neither value is "censored" in LOADEST's worldview, but both are marked with "<" as a caveat in the calibration file, for ease of human inspection. The normal data issues report gives counts of these values.
However, if you use option "pass to LOADEST", the values are unchanged and marked as censored for LOADEST. This is the correct thing to do if you're dealing with an element with many non-detects, such as pesticides. Unfortunately, we've encountered runs where LOADEST hung completely using such data. In all cases, the data was actually bad, but there was no safe way for LoadRunner to divine this. So, if you do use this option, and LOADEST hangs, you'll need to study the data carefully, and possibly consult LOADEST support. In a multi-site run, you could use one of the other options just to explore the dataset, then fix the problems and run with "pass to LOADEST" after resolving them.
Please see the LOADEST site section on "Publications that Discuss Detection Limits and Censored Data" here for further discussion of when and why detection limits are an issue.
Note that the sigma test is not applied to non-detects, provided they were properly marked in the USGS quality data file. (Values of "0" are not properly marked, and simply dropped.)

Elem - A short nickname used in file naming.

ElemNum - The most critical option - the list USGS element codes you want LoadRunner to calculate loads of. Please see the ElemNum help file. Unit conversions are also entered on this line.

Site label - A longer comment included on each output file.

Run Directory - LoadRunner creates a working directory for its inputs and outputs and intermediate files - one directory for each click of the Run button. This option controls where to create those directories.

The Run Directory and Output

For each run (click of the Run button), LoadRunner creates a directory with three subdirectories:

inputs - Copies of the raw data you gave LoadRunner
loadest - Both inputs and outputs of the LOADEST runs, in a subdirectory for each USGS site included in the run. You can examine the calibration and estimation files and the raw results of the LOADEST program here.
outputs - LoadRunner's restatement of LOADEST's results, data issues found, etc.

After a run completes, an "Outputs" group of buttons is available on the right of the the LoadRunner screen, allowing you to click several of these items to open directly, or the whole run directory for your rummaging convenience.

Typically LOADEST is running a bunch of sites, since we are most interested in running whole watersheds. After each site is completed, the Output area buttons give you quick access to the result files for that site. Once all the sites have completed running, these per-site results are appended into files with allsite_ prepended to the file name. Note that xxx_model.txt summary files are only available site-by-site - there is no "allsite" composite form of that file.

More on the contents of these files in the Results section below.

Results

This text area at the bottom of the screen reports LoadRunner's progress on each site as it runs. Its contents are also written to the xxx_runlog.txt file - the first button in the Outputs area. The messages here include:

The run date/time and run directory name used for all LoadRunner files associated with the current job.

The arguments used for the batch LoadestRunner program underlying the LoadRunner user interface.

Each site number as its input is read, then processed.

Any data issues where infobits were discarded during reading.
1. Concentrations of 0.0 are discarded - even if true, LOADEST will crash, and in our experience, it isn't true.
2. Currently concentrations less than detection limit are discarded, because they also cause LOADEST to crash sometimes. We've contacted the LOADEST author about that.
3. Some lines just don't parse - a value of "N" in a numeric field, etc.
4. Values discarded for having failed the sigma test.
Note that unlike the data issues below, this is the end of the line for those info bits. After the "Processing < sitenum >" message, further data warnings are on data that was included in the main calculations.

If any gaps years were omitted, the dates for those.

How many calibration measurements were used, and a caveat summary on that data:
1. The number of measurements the USGS marked as estimated data.
2. How many days used flow data from the flow file instead of quality file for calibration.
3. Measurements the USGS marked as below the detection limit (currently disabled - we ignore those because including them crashed LOADEST sometimes).
Calibration measurements with these "data issues" are clearly marked on the "calib.inp" file for this site/run (open the directory button under Output and rummage under the "loadest" subdirectory for the site number of interest's LOADEST files.)
Calibration "data issues" are not marked on the daily flux output file. They really affect all days.
If there aren't enough calibration measurements available, this site's run stops. LOADEST needs a minimum of 12 calibration points, or it can't estimate stream loads.

How many days flow data were used, and a caveat summary on that data:
1. The number of measurements the USGS marked as estimated data.
2. How many days LoadRunner padded with interpolated flow values, because there was no usable USGS data.
3. How many days LoadRunner supplied averaged flow values, because the USGS data had multiple entries.
4. How many days LoadRunner replaced a flow value between zero and 1.0, with an interpolated value.
Flow measurements with these "data issues" are clearly marked on the "est.inp" file for this site/run (open the directory button under Output and rummage under the "loadest" subdirectory for the site number of interest's LOADEST files.)
Flow data issues are marked in the "Caveat" column on the daily xxx_flux.txt output files. This column doesn't appear in the xxx_monthflux.txt and xxx_annualflux.txt composited files.

Run completion time.

The Output Files

There are a number of output files written by LoadRunner. (xxx_ is whatever nickname you filled in for the Elem field under Options. sss_ is either a USGS site number or allsite, for the results of all sites appended together.)

xxx_runlog.txt - Described above - this is just a file copy of the Results area. It covers the whole multi-site run. (So it is an allsite_ file in effect.)

xxx_model.txt - There is one of these for each site. This file echoes the data issues that were written to the run log, and also extracts some of the more interesting info from the LOADEST xxx.out file - which mathematical model was selected, and the coefficients chosen, some statistics on how well the model fit the data. The last of these model files is dumped to the Results area after the last site completes processing.
The allsite_ version is not an appended version, but a table of site, model, statistics, and the model coefficients, one line per site.

sss_xxx_calib.txt - A restatement of the calibration data points (LOADEST calib.inp), in a form convenient for further analysis and plotting via Excel. Note that if a site has too few calibration points to run, LoadRunner still outputs sss_xxx_calib.txt, but it isn't included in allsite_xxx_calib.txt. In other words, the allsite_ version only includes sites that ran.
Days with quality data issues are marked with a code in the "Caveat" column.

sss_xxx_flux.txt - Daily flux estimates, in a form convenient for further analysis and plotting via Excel.
Days with flow data issues are marked with a code in the "Caveat" column. This column doesn't appear in the other ..._flux.txt files.

sss_xxx_monthflux.txt - Much like the daily flux, except the fields are the average values for the month.

sss_xxx_annualflux.txt - Much like the daily flux, except the fields are the average values for the year.

Ginger Booth for Peter Raymond, January, 2008