#### Read Microsoft Word - RRQRR_v20050706.doc text version

Using GIS to generate spatially-balanced random survey designs for natural resource applications

David M. Theobald1, Don L. Stevens, Jr.2 , Denis White3, N. Scott Urquhart4, and Anthony R. Olsen3

Running title: Spatially-balanced survey design using GIS

1

Natural Resource Ecology Lab, Colorado State University, Fort Collins, CO 80523-1499

2

Department of Statistics, Oregon State University, Corvallis, OR 97331-4501

3

U.S. Environmental Protection Agency, Western Ecology Division, Corvallis, OR 97333

4

Department of Statistics, Colorado State University, Fort Collins, CO 80523

Communicating author: D. Theobald, [email protected] Submitted to: Environmental Management

Page 1 -- 7/6/2005

Abstract

Sampling of a population is frequently required to understand trends and patterns in natural resource management because financial and time constraints preclude a complete census. A rigorous probability-based survey design defines the framework that specifies where to sample so that inferences from the sample apply to the entire population. Such a framework should be used in natural resource and environmental management situations because it provides the mathematical foundation for statistical inference. Here we provide a short review of traditional probability- based survey designs and describe a recent approach called spatially-balanced sampling. We develop an implementation in a geographic information system (GIS), called the Reversed Randomized Quadrant-Recursive Raster algorithm. The implementation of this algorithm in GIS provide environmental managers a practical, useful tool to generate simple, efficient, and robust survey designs for natural resource applications. Moreover, factors to modify the sampling intensity, such as strata, gradients, or accessibility, can be readily incorporated and visualized. We provide examples of survey designs for point, line, areal-based features (e.g., lakes, streams, and vegetation) generated using our Spatial Sampling tool.

Introduction

Understanding the status and trends of natural resources is an important goal in natural resource management. Because financial and time constraints preclude a complete census of an entire population of interest, sampling is frequently required. For example, the US National Park Service's Inventory & Monitoring Program is currently undergoing an extensive effort to determine park "vital signs". Developing useful monitoring designs is a key component to this effort (Oakley and others 2003). Monitoring design requires careful consideration of the resource

Page 2 -- 7/6/2005

to be monitored (target population), what will be measured (indicator), how it will be measured (response design), where it will be monitored (survey design or site selection, but also called sample design), how frequently it will be monitored (time selection), and how measurements will be summarized (monitoring analysis). In this paper we have four goals. First, we briefly review the need for a rigorous survey design so that characteristics about the whole study area or population may be estimated from a sample or part of the population (Thompson 2002). In natural resource applications, selected features are typically identified by their location. This contrasts with the perspective under which classical sampling developed, namely that units were selected from a list. Consequently, natural resource sampling needs an alternative perspective: sampling in space. Second, we compare and contrast common designs for probability-based sampling. Third, we describe the advantages of a recently developed approach called spatially-balanced sampling (SBS) to make it better known to natural resource management scientists. We argue that SBS is a useful and practical alternative to simple random sampling. Finally, we describe a novel implementation of an algorithm implemented within a geographical information systems (GIS) framework that produces a spatially-balanced design. This implementation as a GIS tool makes the SBS method more accessible to non-specialists and facilitates survey design and evaluation.

Sampling populations

There are two scientifically defensible approaches for extrapolating from a sample to an entire population: model-based and design-based inference (Smith 1976; Särndal 1978; Hansen and others 1983). (Note: these model-based and design-based issues were also discussed in a spatial context by de Gruijter and Ter Braak (1990) and Brus and de Gruijter (1993)). First, inference can be based on explicit specification of the relationship between the subset and the

Page 3 -- 7/6/2005

entirety. This is called model-based inference, where a "model" is used to describe the relationship between a subset (or sample) and a population. The model may be something as simple as an assumption that the observations are "representative" of the population to the extent that the mean of the sample should be close to the mean of the population. The advantage of model-based inference is that it enables very general and precise inference from limited data. For example, if the population response follows a normal distribution, only the mean and standard deviation are needed to infer the shape of the entire population distribution. In a sense, the inference "borrows strength" from the model: the model structure provides the framework for the inference, and the precision of the inference is judged relative to the model. The difficulty with model-based inference stems from the same basis as its advantage: the structure of the model. If the model is not a faithful description of reality, estimates of the population based on the model may have little resemblance to the true population values. Because precision is judged relative to the model, there may be no indication that the inference is substantially in error. The model may fit observed data well, resulting in apparently high precision for population estimates. Yet, if the observations from a selected subset of the population conform to the model while the rest of the population does not, the extrapolation and its apparent precision may be substantially in error. Practically speaking, most real-world ecological systems are simply too complex to use model-based inference methods with much reliability. The second approach for describing ecological systems is to select a subset from a population or study area via a probability sample and using design-based inference methods. This approach is called design-based inference because the resulting inference draws its generality and validity from the survey design -- not from any presumption of the correctness of

Page 4 -- 7/6/2005

a model or representativeness of the sample locations. Sample locations are selected from space (rather than a list) with a known probability that may reflect other factors such as physical characteristics related to the response variable (e.g., topography, vegetation type, precipitation, etc.). Provided the design is properly applied, the resulting inference can be made with known confidence, and confidence can be increased by increasing sample size. Probability-based sampling should be used in natural resource and environmental management situations because it provides the mathematical foundation for statistical inference (Stehman 1999). The key characteristic of a probability-based sample is the specification of an inclusion probability () that is known and non-zero. That is, all locations within a study area (or all members of a population) have some known chance of being selected. This contrasts sharply with non-probability-based sampling, where samples are selected in an ad hoc manner and the likelihood of selection is not known. Violation of known and non-zero probability often occurs in natural resource applications when representative locations are chosen based on judgment alone, for example when locations are selected because of convenient access, or patches of homogenous vegetation meeting minimum size constraints are selected as "training sites" for classification of a remotely sensed image (Stehman 2001). A related issue is that exclusion zones are often used to reduce an area that is needed to be sampled, such as restricting sampling to patch interiors, to locations nearby roads, or to certain types of patches. Exclusion zones narrow the search area, but because the inclusion probability for exclusion zones is zero, no inference from the samples can be extended to exclusion zones (Stehman 2001). Although exclusion zones seemingly make more efficient use of scarce sampling resources, they are often problematic. For example, preliminary information on distribution and life history requirements of a species or process of interest may be used to exclude certain land

Page 5 -- 7/6/2005

cover types from a survey design (e.g., non-riparian cover types are excluded). Three potential problems arise with use of these exclusion zones. First, occasionally knowledge where species "should" occur is limited and scientists later are "surprised" to find species extending beyond where they were originally thought to reside (e.g., Thompson 2004). Second, spatial data on surrogate variables used to generate the exclusion zones commonly have some level of uncertainty associated with them (e.g., a polygon is misclassified), data are not fine-grained enough to resolve important features on the ground (e.g., small, narrow strips of riparian vegetation), and/or are inadequate temporally (i.e. wrong time of year or not current enough). Third, the variability of natural processes (like El Nino/La Nina cycles, fire/disease outbreaks, etc.) and human-induced changes such as climate change and land use change (e.g., urbanization, tourism, etc.) often causes shifts in the distribution of a species or process of interest. Without some, perhaps modest, allocation of resources to low quality or "not preferred" habitat, trends cannot be documented in a direct way (e.g., Peterson and others 1999). Because we believe there are important advantages of design-based inference using probability sampling, we will not consider non-probability-based survey designs further in this paper.

Probability-based survey designs

There are a variety of traditional, basic probability survey designs that have different strengths and weaknesses for natural resource applications. Stehman (1999) developed five criteria to examine the relative trade-offs of different survey designs. We briefly describe these criteria here and provide a summary in Table 1. Two related criteria are the need for low estimated variance and the quantification of uncertainty of estimated variance. Sampling should have low variance for estimates of important measures that are being sampled, and the variance

Page 6 -- 7/6/2005

of the estimates should be acceptably small. A third criterion is the need for a spatially balanced, or well-distributed, sample. Commonly, natural resource data are spatially autocorrelated, which means that: "Everything is related to everything else, but near things are more related than distant things" (Tobler 1970, pg. 236), sample locations that are close together tend to be more similar than more distant ones (Stevens and Olsen 2004), and points that are further apart tend to be more independent (Stehman 1999). Hence, a well-distributed survey design that is spatially balanced improves the precision of estimated values by capitalizing on likely spatial autocorrelation, by maximizing spatial independence among sample locations. Further, designers of natural resource surveys cannot foresee all future applications of the resulting data. A spatially-balanced sample assures that a newly defined area will contain sample points approximately proportional to the presence of the resource and the area of the new area. Spatial balance can be visualized by computing Thiessen or Voronoi polygons around sample locations. If the area of the polygons tend to be more similar (low variance in polygon area), then the sample is well balanced. In contrast, spatially random sampling (SRS) generates designs or sample patterns with high variability in polygon area (high variance). A fourth criterion to consider is simplicity. A simple design makes it easier to generate sample locations through a straightforward procedure or computer algorithm, to implement the design and to locate samples in the field (uploading locations to a GPS unit makes this easier), and to analyze data because methods to estimate accuracy and associated standards errors are straightforward and well-defined. Fifth, a survey design should be cost effective. The overall goal of a project usually is to develop estimates with an acceptable level of precision for the lowest cost.

Page 7 -- 7/6/2005

We added a sixth criterion to Stehman's (1999) list to emphasize some practical realities that natural resource researchers and managers commonly face: the need for flexibility. Because all projects face a variety of uncertain and often unforeseen forces, a flexible design is paramount. A primary uncertainty of most projects is that the actual number of samples that can be collected is not known reliably during the construction of the survey design. However, the cost to acquire each sample varies, depending on factors such as travel costs, ease of access, and possible economies of effort for nearby locations. In addition, often the amount of funding to support a sampling effort decreases occasionally during the course of a project. If the desired number of samples cannot be collected, then some survey designs may be compromised (e.g., systematic sampling). As a result of these uncertainties, flexibility in the design is a useful characteristic so that adjustments in the original target sample size can be made after the survey design has been specified. Moreover, the extent of the population being sampled can change over time, so adjustments to the sampling frame (or study area) need to be made as refinements to the population are made. There are four common probability-based spatial survey designs: simple random, systematic, stratified, and cluster sampling (Table 1). Spatial simple random sampling (SRS) is a common technique used to sample from a population S by generating a series of random locations, x,y values, that are paired to form a set of s random sample locations. The location of each sample is assumed to be independent of other samples. The variance for this design is high, though calculation of variance estimation is straightforward (Stehman 1999). SRS is simple and flexible, and additional samples can be added to an existing set of samples. Although theoretically it is possible to generate a well-distributed survey design using SRS, it is commonly recognized that any single realization of a set of s random points generated from SRS often has

Page 8 -- 7/6/2005

clusters of samples or areas devoid of samples (Stevens and Olsen 2004). As a result, SRS is frequently not spatially well-balanced, leading to survey designs that are spatially inefficient because they do not adjust for spatial autocorrelation. Systematic sampling (SyS) locates sampling points regularly, usually equally spaced on a regular grid or along a linear feature covering the entire population. The design is simple to implement and ensures that all portions of a study area are represented in the sample and so the design is spatially well distributed. However, the variance of estimates can be high, and a variance estimate is not possible without some assumption about the spatial structure of the population. For example, one may assume that the population values are generated by a spatially random process to make up for the lack of randomness in the sample point selection. Also, this method is sensitive to the potential alignment of systematic grid axes with population features. For example, a periodic response coherent with the sampling interval could lead to extreme variation between sample replications (Gilbert 1987). Furthermore, extreme variation would not be detectable based on the results of a single sample draw (or realization). SyS provides good spatial coverage of the target population but allows only limited ability for different portions of the study area to be sampled with varying intensity. Variable intensity can be achieved by halving or doubling the grid spacing. Moreover, systematic survey designs are often compromised in practice by the need to have different sites sampled in different years or through non-response patterns (Cochran 1977). Non-response occurs when a sample location cannot be visited because of denial of access (e.g., a private land owner or military base) (Lesser 2001), because the feature thought to be located there does not exist (e.g., a dried-up stream or incorrect vegetation type), or because a location simply requires too much time and effort to reach or is dangerous to access (e.g., a steep rock cliff or swift stream). Occasionally, logistical, technical,

Page 9 -- 7/6/2005

or safety issues in the field preclude collection of a sample at that location, such as the inability to use GPS to locate a sample point in a deep canyon or dense forest. Note that the non-response problem also plagues studies that employ remote sensing data (albeit less so), such as when aerial photo data are not available for a particular portion of a study area or a cloud covers a portion of an image. Stratified random sampling (StRS) provides some spatial structure to the overall population through strata, and each strata is then sampled independently (commonly using SRS). Variable or unequal probability sampling (Overton 1993) is used when a type of resource occurs less frequently than other types. Variable probability sampling provides unbiased estimates of a target population, and the uncertainty can be quantified. For example, if a certain type of vegetation is much rarer in a landscape than a common type (e.g., riparian zones vs. coniferous forest), then the inclusion probability can be adjusted so that an adequate number of samples in the rare type (or strata) is generated. (Typically inclusion probabilities are computed by dividing the number of samples desired by the number of total elements in a population or strata). If strata are specified accurately so that the distribution of the response variable of interest is more homogeneous within strata than between strata, then the variance may be reduced compared to SRS. However, when strata are specialized for certain situations, they quickly loose utility in other situations (lower flexibility), and they are sensitive to possible errors in the data (maps) used to define the strata. Designs with strong differences in the inclusion probability between strata (e.g., 1.0 for riparian, 0.1 for coniferous forest) are highly sensitive to possible errors in mapping of the strata, and can potentially be worse than an equal-probability design (Stevens and Olsen 1991).

Page 10 -- 7/6/2005

Cluster sampling (CS) is often used when there is some natural spatial grouping of population units. For example, a cluster sample of basins might be selected and then one might observe all stream segments within the selected basins, or utilize two-stage sampling to select a sample of streams with the sample basins. Cluster sampling is often used for administrative or operational convenience, because it is often much easier to implement than SRS. The initial location of each cluster can be selected using SRS or a spatially constrained design, e.g., SyS or StRS. If two-stage sampling is used, SRS or a spatially constrained design could be used. Adaptive cluster sampling is often used for rare or elusive species when individuals occur in groups (Thompson 2004). Sites are selected, usually by SRS. Each site is visited, and if a target feature is found at that site (e.g., a rare species observed), then sampling sites are added in adjacent areas. Because of tradeoffs between survey design issues and the practical challenges of field work, developing a statistically rigorous, efficient, robust, and flexible survey design in natural resource applications remains challenging (Table 1). SRS is straightforward and flexible, but has high variance, is not spatially well-balanced, and is inefficient when landscape types are not composed evenly (e.g., contains rare landscape types). SyS is spatially well-balanced, but is inefficient in heterogeneous landscapes changes in the number of samples. StRS is typically more efficient than SRS and SyS, but is not necessarily well balanced spatially and is sensitive to possible errors in defining strata. In summary, traditional approaches including simple, systematic, stratified, and clustered survey designs have significant limitations for natural resource applications.

Page 11 -- 7/6/2005

Spatially balanced survey design

Spatially-balanced sampling (SBS) is a relatively new approach to develop survey designs that are useful for natural resource researchers and practitioners (Stevens 1997; Di Zio and others 2004). The SBS approach generates survey designs that are probability-based, results in low to moderate variance, is spatially well-balanced, and is simple and flexible. Our method to implement SBS is similar to the Generalized Random Tessellation Stratified (GRTS) algorithm, which has been described elsewhere (Stevens 1997; Stevens and Olsen 2000; Stevens and Olsen 2004). GRTS is a type of stratified survey design that is probability-based and has the advantage of providing a spatially-balanced design, and it has been used by a number of studies (e.g., Hall and others 2000; Herlihy and others 2000). Stevens and Olsen (2003) develop a variance estimator for GRTS. Spatially-balanced sampling leads to more efficient sampling, defined as providing more information per sample, because it attempts to maximize the spatial independence among sample locations. A useful way to measure statistical efficiency of survey design is to compute a spatial efficiency ratio (ER) of the variance of the area of the Voronoi polygons formed by a SBS design vs. a SRS design (Stevens and Olsen 2004). If ER <1.0, then the SBS design is more spatially efficient than SRS. There are several advantages of generating an SBS design within a geographic information system (GIS) framework. First, GIS is typically used to establish the sample or reference frame. That is, spatial data is typically needed to represent the population the spatial extent and location of the study area to be sampled or the set of geographic features to be sampled from (e.g., targeted vegetation types, sections of a stream that provide habitat for a target species, a set of lakes, etc.). Note that a full range of geographic features, including points

Page 12 -- 7/6/2005

(e.g., centers of lakes or pre-defined stream reaches), lines (streams, roads, etc.), areas (vegetation patches, lakes, estuaries, etc.), and combinations of these are supported. Moreover, if certain geographic regions or particular features need to be sampled at different intensities (e.g. rare vegetation types or higher-order streams), then again a spatial dataset provides a convenient way to specify the unequal probabilities associated with different geographic features. Occasionally inclusion probabilities may need to vary continuously across a surface to reflect an environmental gradient such as elevation or precipitation. This can be accomplished easily using raster-based representation of spatial data. Second, GIS allows the visualization of a completed survey design so that the samples can be viewed within the context of other geographic layers, such as vegetation types, roads and trails for access, land ownership, etc. Examining the context of the samples is useful both to evaluate a survey design and to develop a logistical plan for collecting the field data. Third, we believe that a broader base of potential users will be reached. There are many natural resource professionals who use GIS regularly, but who may not be as familiar with statistical packages (such as S-Plus, SAS, or R) and the methods needed to generate survey designs. Most standard GIS provide a limited set of tools to construct rigorous spatial survey designs, and although SRS designs can be generated in most GIS, they typically require custom programs or scripts. For example, Theobald (2003) described random sampling with unequal probabilities using continuous (raster) data in ArcGIS (Environmental Systems Research Institute, Redlands, CA), Huber (2000) developed methods for systematic sampling and Jenness (2001) created a random point generator in ArcView v3 (Environmental Systems Research Institute, Redlands, CA), the IDRISI system (Clark Labs, Worcester, MA) has embedded the GStat package into their GIS (Pebesma and Wesseling 1998; Pebesma 2003), and sampling

Page 13 -- 7/6/2005

routines have been written for GRASS in the r.le (landscape ecology; Baker and Cai 1992) and r.samp routines (Mitchell and others 2002). These existing extensions to GIS provide tools to generate simple, systematic, and stratified unequal probability designs; however, these tools produce survey designs that are subject to the limitations described previously. Below we provide a description of a new approach to generate spatially-balanced survey designs, called the Reversed Randomized Quadrant-Recursive Raster (RRQRR). We explain how this approach differs from GRTS and describe its implementation in ArcGIS (using ESRI software ArcGIS v9). Finally, we provide case studies to illustrate the sampling methodology for geographic features represented by points, lines, and areas.

Methods

Generating a spatially-balanced sample can be accomplished by using a function that converts 2D space into 1D space. Rather than representing location using an x, y (or row and column), locations (or cells) can be numbered sequentially using a 1D ordering. Three common sequential ordering systems used are row-major order (left to right, row by row) and row-prime order, also known as boustrophenon ordering, meaning "like an ox plowing a field" (Goodchild and Grandfield 1983) and the Peano scan ordering. However, the average distance between two sequential numbers can vary widely in these methods (Goodchild and Grandfield 1983), and there can be long jumps between values of adjacent cells in adjacent rows. RRQRR (and GRTS) is based on a hierarchical quadrant-recursive ordering using Morton order. Morton order generates "N" or "Z" shaped patterns of 2x2 quads that are composed of lower-left, upper-left, lower-right, and upper-right cells (labeled 1, 2, 3, and 4, respectively for the "N" pattern) that can be nested at hierarchical scales. This creates a recursive, space-filling address (Saalfeld 1998). This quadrant-recursive property (Mark 1990) maximizes 2D proximal

Page 14 -- 7/6/2005

relationships when converting from 2D to 1D space so that 1D ordered addresses are close together in 2D space. Conceptually, five steps are needed to generate a SBS design using the RRQRR algorithm. First, the Morton address is computed for all locations within a study area or population. Although theoretically a 2D space has an infinite number of points in it, we approximate continuous space using a very large set of locations (i.e., >106) represented by a raster (or matrix of cells or points) data structure. The study area is subdivided into 4 equal-sized cells at the root level (L1), labeled 1, 2, 3, and 4 (Figure 1). Cells from L1 are then quartered and labeled in a similar fashion to create L2. The recursive subdivision continues until a very large set of locations is reached at Lk. These levels can then be stacked on top of one another and values added cell-by-cell to create a hierarchical address that includes levels L1 to Lk, called the Morton address or M value for each cell at Lk. Individual cells can then be addressed by their hierarchical address, M, such as M11, M12, M13, M14, ... (for L2) or M111, M112, M113, M114... (for L3). A primary difference between approaches is that GRTS is feature-based (point, line, or area), so that quadrant-recursion occurs only in areas where features are located, and only to a level k that resolves features in the sampling frame so that the sum of inclusion probabilities in a cell is less than 1. In contrast, RRQRR uses a raster to provide a discrete, but fine-grained, approximation of features which can include any combination of geographic feature types. At the floor (or finest resolution), the sample location is approximated by the center of a cell.

<Figure 1 about here>

Page 15 -- 7/6/2005

The second step is to convert the Morton address to a 1D linear representation, or sequential ordering, called Morton order denoted by T. This is done by sorting all Morton addresses, then assigning a linear sequence or order. In our example, M111 is placed at T0, M112 is placed at T1, and the rest is ordered sequentially and ends with address M444 placed at T63 (Figure 2a & 2b). Mark (1990) called this the quadrant-recursive order (QR-order). Note that a systematic uniform sampling of s locations can be created by computing an interval I such that: I = ((Tmax Tmin)+1) / s. Using modulo division, sample locations can be determined by finding values of T whose remainder equals 0 (modulo division) when divided by I. For example, if 4 samples in an 8x8 GRID were needed, then I=16, and samples would be located at T0, T16, T32, and T48.

<Figure 2 about here>

The third step is to reverse the Morton addressing from L1 to Lk to Lk to L1 to generate a reversed quadrant-recursive order. For example, M123 becomes M'321. Reversing M to M' brings out a remarkable property: the ordered sequence of locations T' generated from reversed addresses M' approximates a uniform, systematic sampling pattern. That is, instead of using a sampling interval I to compute s sample locations, one simply has to identify a sequential list T' of s locations. For example, to get 4 sample locations, one could choose the locations that occur in the first four records: T'0, T'1, T'2, and T'3 (Figure 2c & 2d). To get additional samples, one simply proceeds down the T' list in sequence, without skipping. Moreover, any sequential subset of addresses has a similar spatial distribution to the entire list of addresses in the study area.

Page 16 -- 7/6/2005

The fourth step is needed to introduce a random component to the design to obtain a reversed random quadrant-recursive raster (RRQRR) order. This is accomplished by randomly permuting, or re-ordering, the quadrant values (1, 2, 3, 4) for all quadrants at each level. For example, quadrant values could be: 1, 2, 4, 3 or 4, 3, 1, 2, etc. Permuting the quadrant values at each level ensures that samples are randomly located, but because of the quadrant structure is implemented at number of levels, ordered locations are also well distributed. To obtain or draw s samples of T', one simply begins at a random starting point along T' and identifies a systematic list of s samples. For example, to draw 4 samples from the population, one could sample at locations represented by T'8, T'9, T'10, and T'11 (Figure 2e). A final step is needed to incorporate unequal inclusion probabilities. Again, the RRQRR and GRTS approaches differ. In GRTS, inclusion probabilities are applied to the 1D sequential ordering T, before reversing the Morton addresses. The inclusion probabilities are used as weights, in effect stretching T values along the 1D number line. A sampling interval I is computed for a defined number of samples. The reversed order T' is then computed for the T order, but only for the selected sample locations. Note that inclusion probabilities for GRTS are computed by dividing the number of samples desired by the user for each stratum by the total number, length, or area of the sampling frame (for each stratum). In RRQRR, the inclusion probabilities are used to "filter" or remove locations in the T' raster. Locations where < Rk are withdrawn from the full list of samples, where Rk is a new raster dataset containing random values drawn from a uniform distribution (0,1). The resulting RRQRR ordering may no longer be a sequential list of integers (e.g., 1, 2, 3, 4, 5, ... ), but may have breaks in them (e.g., 1, 2, 4, 6, 7...) but the ordering remains important.

Page 17 -- 7/6/2005

RRQRR implementation

Implementing the RRQRR spatially-balanced survey design in GIS generally follows the conceptual steps outlined above. In RRQRR, Morton addresses M' and sequential values T' are generated for all locations within a maximum enclosing rectangle (the orientation is specified by the coordinate system of the sampling frame spatial dataset), not just for locations specified by the sample frame (i.e. the study area). The main input is a raster dataset, G, that represents both the sample frame and the inclusion probabilities (usually normalized to range from 0 1). That is, locations or cells in the study area have a non-null, >0 value. Areas outside of the study area are assigned null values. Relative inclusion probabilities, , or the likelihood of a location being selected relative to other locations, are such that 0 1. The finest resolution that the algorithm will sample at is also specified by G. Thus, G simultaneously specifies the maximum enclosing rectangle, the sample frame, the inclusion probabilities, and the finest resolution at which the sampled locations will be generated. It is important to note that the entire maximum enclosing rectangle that encompasses G (the spatial extent or window) is tessellated. Thus, T' is computed initially for all locations both inside and outside of the study area that are represented using a raster data set (a matrix or grid of cells). Although it is slightly less computationally efficient to do so, there are some important advantages of computing the RRQRR order for all locations in a study area (discussed below). When converting feature-based (point, line, polygon) geographic datasets to raster datasets, three considerations drive the decision about resolution size. First, the resolution should, at a minimum, be able to distinguish all features in the population S. This resolution can be determined by setting the cell size to be no larger than half the minimum distance between features. For example, if the minimum distance between the centroids of two lakes is 1,000 m,

Page 18 -- 7/6/2005

then the resolution should be 500 m or smaller. Second, because locations sampled from a raster dataset will be systematically aligned at the level of the dataset resolution, it may be desirable to create a finer resolution so that the cell size begins to approximate the limits of the ability to locate positions in the field accurately. That is, if a GPS has an accuracy of 10 m, then the resolution should be increased from 500 m to 10 m if possible. Third, the level of resolution needs to be balanced by practical constraints of file sizes, which increase to the square of the cell width (e.g., moving from 500 m to 10 m would be a 2500 times increase in file size). To create the hierarchy of nested tessellations, a loop is performed for each level l. The number of levels, Ll, is found such that Ll2 is the first in the sequence of 1, 2, ..., k such that l2 equals or exceeds the number of rows or columns in G, whichever is greater. For example, if the study area is an 8x8 GRID, then k=3. If the study area is 8 x 9, then k=4. At each level, the resolution, or cell width, is doubled from the finest resolution Lk until the root level, L1. Thus, at Lk, the resolution equals that of G, while at L1, the resolution is coarser such that the number of columns and rows <=2. A straightforward implementation to generate Morton addresses would be to stack the Pk raster layers, where each level corresponds with a decimal place, such that level l would be located in the 1s place, level 2 in the 10s place, and so on. By simply reversing the level and place relationship, the M' address can also be obtained. However, because the maximum number in a 4-byte integer raster is 2,147,483,846, only 9 levels can be computed, resulting in a raster dataset with a maximum size of 512 x 512. At 30 m resolution, this represents a study area only 15.3 km x 15.3 km, which is not likely adequate for most applications. A solution to this size limitation is to map the M to a 1D sequence at each level rather than ordering the Morton addresses after all levels are computed, which departs from the

Page 19 -- 7/6/2005

methodology of Stevens and Olsen (2004). For example, for a simple situation with uniform inclusion probabilities with k=2, the values at L2 equal the randomly permuted values 1 4 in each quadrant. At L1, each cell contains 4 cells at the level below it (L2), so these values should be 0, 4, 8, and 12. At L0, each cell contains 16 cells, so these values should be 0, 16, 32, and 48. For each cell at resolution L2, these values are summed, creating ordered integers, T, ranging from 0 to 63. As a result, it is possible to represent a raster dataset with 2,147,483,846 cells (or square raster with 46,340 cells on a side). This approach can then represent a study area over 1,390 km per side at 30 m resolution. For very large study areas, random samples can be computed for tiles, or sub-sections of the study area (which can be handled like separate strata), then each tile's sample locations are converted to a set of points, and then combined into a single file. Pre-computing the rank order also reduces the computational load of sorting potentially m x n records of Morton addresses. At this point, the entire spatial extent (not just the study area) is completely tessellated and each raster cell contains a value that specifies its sample selection order. For each level k, randomly permuted values ranging from 1 to 4 are generated, running backwards (fine to coarse) from Lk to L1 quadrants. Permutation is accomplished by first creating a raster dataset of random values, Rk, drawn from a uniform distribution of 0 1. Next, the minimum Rk value within each quadrant is found (using a block focal function that generates a non-overlapping, moving window of size 2 x 2). The location within the quadrant where the minimum random value was located is assigned a value of 1. Next, the random value in Rk at the minimum location is removed. Again, the minimum Rk value within each quadrant (now with only 3 values) is located, assigned a value of 2, and the random value at that location is removed.

Page 20 -- 7/6/2005

This continues for the 3rd and 4th position, resulting in a raster dataset, P, of randomly permuted values 1, 2, 3, and 4 for each quadrant. Inclusion probabilities are used to "filter" or remove locations using inclusion probabilities (Figure 3). There are a couple of advantages of separating the ordered raster T' (the RRQRR order) from G (the sampling frame and inclusion probabilities). First, survey designs for multiple indicators can be integrated usefully by applying different inclusion probability "filters" to the same RRQRR order. The resulting survey designs will have survey locations in common, to the maximum extent possible depending on the spatial pattern of inclusion probabilities. Second, the original survey design is more robust to possible changes to accommodate revised knowledge about the sampling frame. For example, a stream reach that was not included in the original sampling frame might subsequently be determined to provide habitat for a species of interest (and is part of the population) can be added to the sampling frame. Note that if a relatively coarse resolution is used (controlled by G) and there are large portions of the study area have very low (<0.1) inclusion probabilities, then the number of samples resulting from this unequal inclusion probability selection may result in too few sample points. Samples can now be drawn from the resulting T' list of values, one at a time from the largest value working backwards (in descending order). There will be gaps in the list T', but one simply proceeds down the list in order (skipping locations that were filtered out). As a matter of convenience, the final step is to convert the raster representation of sample locations to point features that represent sample locations, with associated attributes that specify the sampling order. Point features provide more flexibility in visualizing and displaying a survey design, and are easily "uploaded" to a GPS unit for field data collection.

Page 21 -- 7/6/2005

Results and case studies

We ran a series of simulation tests to confirm that our algorithm generated a spatially balanced design and to examine how the efficiency ratio ER changes as a function of the sample size and proportion of samples to the entire population (because we are approximating a continuous surface with a raster dataset). Our algorithm achieves a ratio of between 0.3 and 0.4 for a reasonable proportion of samples. Figure 4 shows the ER ratio results as a function of sample size. We recommend that the number of samples drawn from the SBS generated from our algorithm be limited to less than 0.1% of the total population. For example, this means that no more than 1,000 samples should be drawn for a population of 1,000,000 cells (i.e. a 1,000 x 1,000) study area. The RRQRR methodology can be used to draw samples from a population of geographic features that can be represented by any combination of the three feature types: points, lines, and areas (also called ecological resource types from a survey design perspective), in both a continuous and discrete representation. The point feature type typically represents discrete, 0 dimensional objects, such as a population of individual trees or houses. A point feature dataset is then converted to a raster dataset at a desired resolution to create the population frame or G raster. The values in raster cells that contain a feature (e.g., point) are set the feature's identification number. All other cells are assigned NoData values and are thus excluded from the population S. In order to sample individual features that are represented as either lines or polygons, it is common to reduce individual features to a point, and then to use the set of points as the population to sample from. For example, to create a sample of streams in Oregon, streams from a 1:100,000-scale map were divided into 200 foot segments, and the mid-point for each reach was

Page 22 -- 7/6/2005

represented as a point (Stevens 2002). Lakes are often represented as polygons, but converted to points by using the center (actually label-center should be used, not the geographic centroid) of each lake polygon to create the population of points. Note that in this case, we did not differentiate lakes by their area (size), so each lake has equal probability of being sampled (Figure 5). Another example is to sample from a population of watersheds or HUCs by representing the polygons with points using centroids to answer the question: what proportion of HUCs are in poor condition?

<Figure 5 about here>

Linear features such as streams can also be represented as continuous features, so that theoretically all possible locations along a stream are included in the population. In this case, linear features are converted directly to a raster dataset (Figure 6). This could be used to answer the question: how many kilometers of stream are in poor condition? Although conversion of a linear or areal feature to a raster dataset discretizes features into an even-sized matrix of cells, the degree to which the resulting design can be considered continuous depends on the resolution of the raster. In practice, there is a finite spatial scale that can be readily distinguished in the field (either related to what a feature is or limited by our ability to precisely measure position in the field with a GPS) or a minimum size that is typically used to define a feature. An areal feature such as a lake can also be represented as a continuous feature, by converting a polygon directly to a raster polygon. In this case, if there are multiple lakes, then the probability that a lake will be sampled could be proportional to its area (if equal inclusion probabilities are used). Other examples of sampling continuous areal features might be to sample

Page 23 -- 7/6/2005

for water clarity within a single lake or sampling estuarine resources represented as a single large study area (not necessarily contiguous). For studies that examine land cover or vegetation patterns and trends, the study area is typically represented as a polygon (or strata by polygons) and samples are drawn from the population (of cells) within the study area (e.g., Figure 7).

<Figure 6 about here> <Figure 7 about here>

In addition to varying probabilities by zones (e.g., land cover types), inclusion probabilities can be based on a continuous covariate, such as an environmental gradient (e.g., elevation, mean precipitation, distance from roads, etc.). As an example, Figure 8 shows a survey design where the inclusion probabilities were inversely related to accessibility (measured as one-way travel time).

<Figure 8 about here>

Discussion

In addition to generating a spatially-well balanced, probability-based survey design, the reversed random Morton order (T') in the GRTS algorithm allows additional samples to be drawn from the sequential list without disrupting or compromising the survey design, which provides a number of practical benefits. First, this is immensely valuable to accomodate common problems with field data collection efforts. Non-response problems frequently occur during field collection, but for each non-response sample, an additional sample can be drawn, sequentially, from T'. Second, often the average time to collect a sample is not known when constructing a

Page 24 -- 7/6/2005

survey design, so researchers often struggle with selecting the number of samples to specify a priori. In the fortunate -- but less common -- situation when resources (time, money) remain after collecting s samples, additional samples can be drawn from T' and the sample size increased. Third, because the order is known when the survey design is specified, one can strategically prioritize collection of additional samples when field crews are in the general vicinity. That is, often the amount of time to get to a sample location is greater than the actual time spent collecting data at a sample site. When this is the case, a wise strategy might be to visit a group of samples that are close to one another for efficient data collection. For example, if one's goal is to collect at least 30 samples, and samples s1, s5, s7, and s30 (drawn from T') are located within a remote valley that has only one access road, then an efficient strategy might be to attempt to visit all 4 locations in a day (or at least consecutive days). Imagine if samples s32 and s38 were also located in this same remote valley. To increase the number of sample locations visited, it might be efficient to collect s32, and possibly s38 as well -- if there is sufficient time. However, s32 and s38 cannot be used in any subsequent estimation, if lower-numbered samples were not collected (or at least attempted). Fourth, because additional samples can be drawn, the efforts from separate projects (either different teams or agencies, or a few years later) can be combined. That is, as long as projects use the same survey design, additional samples can continue to be drawn and a more comprehensive database with an increasing s. One useful feature of implementing RRQRR within GIS is the ability to incorporate logistical constraints during survey design through an estimate of the amount of time to gain access (on the ground) to a location. That is, an accessibility measure (Theobald 2003) that estimates the remoteness of a location could be used to reduce the number of hard-to-access locations, but with known probability. This would allow projects to invest in a limited number of

Page 25 -- 7/6/2005

difficult points so that the full population can be estimated, and would allow programs to a priori trade-off practical logistical limitations with known effects on sampling. Related to this, occasionally field data cannot be collected at a site because it was too difficult or dangerous to access. In this situation, it would be important to develop a map that depicts all locations in the study area that have similar (in)accessibility constraints, thus explicitly describing the target population about which strong inference can be made, and predicting "effective exclusion zones" that were not adequately sampled and therefore about which rigorous estimates cannot be made. It is important to note, however, that the trade-off is that low inclusion probabilities result in high analysis weights (inverse function). We recognize that there are several limitations to implementing the RRQRR algorithm in GIS. First, because the population to be sampled must be converted to a raster, there may be file size limitations. As the ratio of the spatial extent (e.g., geographic coverage) to the necessary grain or resolution (e.g., raster cell size) increases, the raster datasets that represent the features to be sampled can become quite large and possibly generate files that are too large for practical use. Second, theoretically the RRQRR approach can be expanded beyond two dimensional tessellations, so that sampling in three dimensions could be achieved, for example for marine or meteorological applications. Moving to three dimensions, or beyond, would currently be difficult to program in contemporary GIS. Third, our particular implementation of RRQRR requires ArcGIS v9, a commercial software package (Environmental Systems Research Institute, Redlands, CA), which may be difficult to acquire for some organizations (i.e. it is not freeware).

Conclusion

We have developed a new way to generate spatially-balanced survey designs that is implemented in ArcGIS v9 through the Spatial Sampling tool. It simplifies the identification of

Page 26 -- 7/6/2005

the survey sites because GIS is typically needed for establishing the sampling frame that sampling algorithms require. Unequal inclusion probabilities are easily incorporated by computing probabilities based on simple strata represented by geographic features (e.g. vegetation types), strata that are composed of multiple factors (e.g., elevation zones and vegetation types), and continuously varying inclusion probabilities to reflect an environmental gradient such as temperature or precipitation. Often the practical challenges of visiting remote or dangerous locations need to be considered during the survey design, and GIS can assist by enabling researchers to visualize a survey design and consider the trade-offs between the number of samples and the resources required to complete the survey. Finally, we hope that by describing this methodology and providing freely-available tools (the STARMAP Spatial Sampling tools http://www.stat.colostate.edu/~nsu/starmap/), that a broader base of natural resource professionals and researchers will be able to generate useful spatially-balanced probability based survey designs.

Acknowledgements

We thank J. Norman, E, and N. Peterson for programming and field assistance with this research and M. Farnsworth, J. Gross, B. Noon, and E. Peterson for helpful comments on previous drafts. This research was supported by funding from the STAR Research Assistance Agreement CR829095 awarded by the US Environmental Protection Agency. This paper has not been formally reviewed by EPA and the views expressed here are solely those of the authors. EPA does not endorse any products or commercial services mentioned in this paper.

Page 27 -- 7/6/2005

Literature cited

Baker, W.L. and Y. Cai. 1992. The r.le programs for multiscale analysis of landscape structure using the GRASS geographical information system. Landscape Ecology 7: 291-302. Brus, D.J. and J.J. de Gruitjer. 1993. Design-based versus model-based estimates of spatial means: theory and application in environmental soil science. Environmetrics 4, 123152. Cochran, W.G. 1977. Sampling techniques. 3rd Ed. New York, NY: John Wiley and Sons. Di Zio, S., L. Fontanella, and L. Ippoliti. 2004. Optimal spatial sampling schemes for environmental surveys. Environmental and Ecological Statistics 11(4): 397-411. Gilbert, R.O. 1987. Statistical methods for environmental pollution monitoring. New York, NY: Van Nostrand Reinhold. Goodchild, M.F. and A.W. Grandfield. 1983. Optimizing raster storage: an examination of four alternatives. Proceedings of the AutoCarto 6, Ottawa, Canada. Pgs. 400-407. de Gruijter, J.J. and C.J.F. Ter Braak. 1990. Model free estimation from survey samples: a reappraisal of classical sampling theory. Mathematical Geology 22: 407415. Hall, R.K., A. Olsen, D. Stevens, B. Rosenbaum, P. Husby, G.A. Wolinsky, and D.T. Heggem. 2000. EMAP design and river reach file 3 (RF3) as a sample frame in the Central Valley, California. Environmental Monitoring and Assessment 64: 69-80.

Hansen, M.H., W.G. Madow and B.J. Tepping. 1983. An evaluation of model-dependent and probability sampling inferences in sample surveys. Journal of the American Statistical Association 78, 776 760.

Herlihy, A.T., D.P. Larsen, S.G. Paulsen, N.S. Urquhart, and B.J. Rosenbaum. 2000. Designing a spatially balanced, randomized site selection process for regional stream surveys: the EMAP mid-Atlantic pilot study. Environmental Monitoring and Assessment 63: 92-113.

Page 28 -- 7/6/2005

Huber, B. 2000. Sample: Designing random sampling programs with ArcView 3.2. Quantitative Decisions, Inc. www.quantdec.com/sample/index.htm (viewed 3 March 2004). Jenness, J. 2001. Random point generator, v1.1. Jenness Enterprises. Flagstaff, Arizona. Lesser, V.M. 2001. Applying survey research methods to account for denied access to research sites on private property. Wetlands 21(4): 639-647. Mark, D.M. 1990. Neighbor-based properties of some orderings of two-dimensional space. Geographical Analysis 2: 145-157. Mitchell, S., F. Csillag, and C. Tague. 2002. Advantages of open-source GIS to improve spatial environmental modeling. In Proceedings of the Open Source GIS GRASS users conference. Trento, Italy. 11-13 September. Oakley, K.L., L.P. Thomas, and S.G. Fancy. 2003. Guidelines for long-term monitoring protocols. Wildlife Society Bulletin 31(4):1000-1003. Overton, W.S. 1993. Probability sampling and population inference in monitoring programs. In Environmental modeling with GIS. M.F. Goodchild, B.O. Parks, and L.T. Stayert, eds. New York, NY: Oxford University Press. Pp. 470-480. Pebesma, E.J. and C.G. Wesseling. 1998. Gstat, a program for geostatistical modeling, prediction and simulation. Computers and Geosciences 24(1):17-31. Peterson, S.A., N.S. Urquhart, and E.B. Welch. 1999. Sample representativeness: A must for reliable regional lake condition estimates. Environmental Science & Technology 33: 1559-1565. Saalfeld, A. 1998. Sorting spatial data for sampling and other geographic applications. GeoInformatica 2:37-57.

Page 29 -- 7/6/2005

Särndal, C. 1978. Design-based and model-based inference for survey sampling. Scandinavian Journal of Statistics 5: 27-52.

Smith, T.H. 1976. The foundations of survey sampling: a review. Journal of the Royal Statistics Society A 139: 183-204.

Stehman, S.V. 1999. Basic probability sampling designs for thematic map accuracy assessments. International Journal of Remote Sensing 20(12): 2423-2441. Stehman, S.V. 2001. Statistical rigor and practical utility in thematic map accuracy assessment. Photogrammetric Engineering & Remote Sensing 67(6):727-734. Stevens, D.L., Jr. 1997. Variable density grid-based sampling designs for continuous spatial populations. Environmetrics 8: 164-195. Stevens, D.L., Jr., and A.R. Olsen. 1991. Statistical issues in environmental monitoring and assessment. Proceedings of the Section on Statistics and the Environment. American Statistical Association. Alexandria, Virginia. Pp. 1-9. Stevens, D.L., Jr., and A.R. Olsen. 2000. Spatially restricted random sampling designs for design-based and model-based estimation. In Accuracy 2000: Proceedings of the 4th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Delft University Press, The Netherlands. Pp. 609-616. Stevens, D.L., Jr. 2002. Sample design and statistical analysis methods for the integrated biological and physical monitoring of Oregon streams. Oregon Department of Fish and Wildlife, Report Number OPSW-ODFW-2002-07. Stevens, D.L. Jr., and A.R. Olsen. 2003. Variance estimation for spatially balanced samples of environmental resources. Environmetrics 14: 593-610. Stevens, D.L., Jr., and A.R. Olsen. 2004. Spatially balanced sampling of natural resources. Journal of the American Statistical Association 99(465): 262-278.

Page 30 -- 7/6/2005

Theobald, D.M. 2003. GIS Concepts and ArcGIS Methods. Fort Collins, Colorado: Conservation Planning Technologies. Tobler, W. 1970. A computer model of simulating urban growth in the Detroit region. Economic Geography 46: 234-240. Thompson, S. K. 2002. Sampling (2nd Ed.). John Wiley & Sons. Thompson, W.L. (Ed.). 2004. Sampling rare or elusive species. Island Press, Washington, D.C.

Page 31 -- 7/6/2005

List of Tables and Figures

Table 1. Criteria to evaluate the relative trade-offs of different probability-based survey designs (after Stehman 1999). We added the spatially balanced category of survey design and criterion 6 flexibility.

Figure 1. Three levels of hierarchical-recursive subdivision. An example study area is subdivided at 3 levels: root level L0 (left), levels L0 and L1 (center), and levels L0, L1, and L2 (right).

Figure 2. Morton addressing and ordering. A. The Morton addresses for 3 levels (k=3, L3). B. The addresses can then simply be sorted in order and stretched to a 1D Morton-ordered line that follow the sequence of integer values from 0 to 63. C. Reverse Morton address. D. Sequential ordering of reversed Morton values. E. Randomized reverse sequential order. A sample of 4 points has been from 8, 9, 10, and 11 (highlighted using circles).

Figure 3. A flowchart portraying how inclusion probabilities are used to "filter" the sample order. The sampling frame and inclusion probability raster is used to specify the minimum enclosing rectangle used to generate the randomized reversed quadrant-recursive raster. The inclusion probability raster (G) is then compared against a random value raster (R, drawn from normal distribution) to generate a filter raster (where G > R).

Figure 4. The efficiency of the spatially balanced design as measured by the ratio of the variance of the area of Voronoi polygon for sample points generated from spatially balanced vs. simple

Page 32 -- 7/6/2005

random sampling. We ran 100 simulations for each sample size ranging from 10, 50, 100, 500, 1,000, 5,000, to 10,000, and averaged the variance.

Figure 5. Thirty lakes selected (circles) from 223 lakes (crosses) in the study area near Indian Peaks Wilderness and Rocky Mountain National Park, Colorado.

Figure 6. A sample of 50 points sampled along a continuous representation of streams (at 30 m resolution), reflecting inclusion probabilities to sample equally along stream orders 1 (0.28), 2 (0.9), and 3 (1.0) using thin, medium, thick line, respectively.

Figure 7. A equal-probability, spatially-balanced survey for vegetation sampling in the Laramie Foothills, Larimer County, Colorado.

Figure 8. An unequal-probability, spatially-balanced survey in the Laramie Foothills (see Figure 7). An accessibility surface was created that depicted the one-way time to travel from the edge of the study area (near Fort Collins, Colorado) to all locations within the study area. Locations that were close (e.g., <1 hour) had a high inclusion probability (1.0), while locations that were further away (e.g., >3 hours) had a low inclusion probability (0.1).

Page 33 -- 7/6/2005

#### Information

#### Report File (DMCA)

Our content is added by our users. **We aim to remove reported files within 1 working day.** Please use this link to notify us:

Report this file as copyright or inappropriate

1188668

### You might also be interested in

^{BETA}