Read doi:10.1016/j.jveb.2006.09.002 text version

Journal of Veterinary Behavior (2006) 1, 94-108


The development and assessment of temperament tests for adult companion dogs

Katy D. Taylor, BSc, PhD, and Daniel S. Mills, BVSc, PhD, Dipl. ECVBM-CA, MRCVS

From the Animal Behaviour, Cognition and Welfare Group, Department of Biological Sciences, University of Lincoln, Riseholme Park, Lincoln, United Kingdom. KEYWORDS:

dog; personality; reliability; temperament; validity Abstract Temperament tests have been created by a range of organizations and individuals in order to assess useful, predictable behavioral tendencies in working dogs and, increasingly, in companion dogs. For the latter group, such tests may help to select suitable pets from rescue centers or to identify those already in the population that are, or are likely to be, unsuitable as pets (e.g., those with behavior problems involving aggression). Unfortunately, many of these tests seem to have been developed without a systematic scientific approach. Perhaps as a result there are few reports of these tests in the scientific literature and even fewer that fully report their reliability and specific aspects of validity. This pattern is unfortunate, because the outcome of tests for companion dogs may have the potential to affect their welfare and survival. This paper attempts to encourage a more scientific approach to the development, conduct, and evaluation of temperament tests for adult companion dogs. Five key measures of the quality of a temperament test (purpose, standardization, reliability, validity, and practicality) are identified and explained in detail. Methods for the assessment of these qualities are given together with discussion of their limitations. © 2006 Elsevier Inc. All rights reserved.


The ability to select a dog for a particular role, particularly from a very young age, is an attractive idea for breeders and trainers. What might make this a feasible endeavor is the idea that individuals possess stable behavioral tendencies, i.e., they have what has been called "temperament." Temperament is defined as differences in behavior between individuals that are relatively consistently displayed when tested under similar situations (Diederich and Giffroy, 2006). Using this definition, these differences are considered to be the product of both genetically determined and acquired behavioral traits (Stur, 1987), and therefore the age

Address reprint requests and correspondence: D.S. Mills, Animal Behaviour, Cognition and Welfare Group, Department of Biological Sciences, University of Lincoln, Riseholme Park, Lincoln, LN2 2LG, UK; Tel. 44 (0) 020 7619 6979. E-mail: [email protected]

at which they can be considered to be stable is still debatable. Terms such as "personality" (Gosling and John, 1999; Svartberg, 2005), "character" (Ruefenacht et al., 2002) and "emotional predispositions" (Sheppard and Mills, 2002) have also been used in the same context. Temperament is made up of traits that are "correlations of internal factors that cause consistent individual differences in behavior" (Eysenck, 1994). In an attempt to identify these traits, interested parties have developed behavioral tests that take multiple measures of the dog's behavior during a series of shorter tests, or subtests (Ledger, 1997). Often these measures are subjected to factor or principal component analysis, which are data reduction techniques that statistically identify consistently correlated measures within a data set and place them into factors (Goodloe and Borchelt, 1998). The composition of these factors can be used to describe the various behavioral traits exposed by the test and to predict the dog's behavior in another, similar situation.

1558-7878/$ -see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jveb.2006.09.002

Taylor and Mills

Dog Temperament Tests

95 one, although this fact is not always explicitly stated (Hsu and Serpell, 2003). Demonstration of test-retest reliability is therefore key for a temperament test (Marston and Bennett, 2003). Additionally, if tests are not reliable, they will not be valid (Diederich and Giffroy, 2006).

Tests to identify particular characteristics of interest, such as "sharpness" and "courage" have been a common feature of working dog associations and breed groups (Willis, 1995; Wilsson and Sundgren, 1997; Brenoe et al., 2002; Ruefenacht et al., 2002; Svartberg and Forkman, 2002; Courreau and Langlois, 2005; Fuchs et al., 2005). These might include assessment of the dog's hunting, tracking, or aggressive ability. Tests have also been developed for the assessment of the suitability of dogs as police (Slabbert and Odendaal, 1999), guide (Pfaffenberger et al., 1976; Goddard and Beilharz, 1983, 1984, 1985; Knol et al., 1988; Murphy, 1995), therapy (Fredrickson, 1993; Schaffer and Phillips, 1994), or assistance dogs (Weiss and Greenberg, 1997; Weiss, 2002; Lucidi et al., 2005). Over the past 15 years, interest has increased in the development of tests to specifically determine the suitability of dogs as pets. Many of these tests have focused on the assessment of problem behaviors, particularly those involving aggression, which may be associated with an increasing trend toward legislation to ban supposedly dangerous breeds. The possibility of assessing both undesirable or negative and desirable or positive behavioral traits has also been of particular interest to rescue and re-homing groups (Sternberg, 2002). It is hoped that behavioral assessments conducted in the shelter may then help staff match dogs to potential owners (Ledger, 1997) and/or predict behavior that might be problematic in the new home. The results of such assessments have the potential to directly affect the welfare of the dog, because problem behaviors can result in punishment (Hsu and Serpell, 2003), euthanasia, or (repeated) relinquishment to shelters (Arkow, 1994; Miller et al., 1996; Salman et al., 1998). Similarly, the welfare implications of an inaccurate assessment of potential aggressiveness can be disastrous for humans who encounter the dog. For these reasons, if not for reasons of academic integrity, it is important that published temperament tests be accompanied by appropriate statistical evidence to support their specific claims, something highlighted by Goodloe (1996). Martin and Bateson (1993) have identified 3 specific measures (reliability, validity, and feasibility) that determine the quality of a behavior test. These measures determine whether a test is a good measure, the right measure, and a useful measure (Appendix A).


Validity concerns the appropriateness, meaningfulness, and usefulness of the specific inferences made from the test results (APA, 1985). Temperament tests need to ensure that they are actually assessing the trait(s) of interest (e.g., fearfulness) if they are to be valid. Validity assessments for temperament tests are fraught with difficulty, because it is unlikely that any test will be wholly predictive of a dog's behavioral reaction in any given circumstance. The aim of a temperament test is therefore to improve our knowledge of the dog and its likely future behavior above that of chance alone. The probability of this goal being achieved increases with limited context. Finally, the quality of temperament tests must also address issues of practicality and appropriateness for widespread or commercial use, whether this use is in rescue shelters or in breeding and training establishments. Tests that are impractical, overly long, and difficult to assess are unlikely to be performed accurately or reliably, if at all. Accordingly, a scientifically developed test will often require refinement for practical use. For test developers, two additional considerations need to be made in order to ensure that a test is reliable, valid, and feasible: consideration of the purpose of the test and standardization of the test procedure. If the goals of the temperament test are not clearly identified (i.e., the aspects of temperament that the testers wish to identify are not explicitly stated), then it is unlikely that the test will be valid. The next step in the development process is the selection of appropriate tests and corresponding scores for the dog's behavior. If this stage is not standardized and formalized, it is unlikely that the test will be reliable. It is important that these two additional prior requirements be fulfilled before the test developers can proceed to assessment of reliability and validity. Jones and Gosling (2005) and Diederich and Giffroy (2006) have both recently reviewed temperament assessments in dogs. Jones and Gosling (2005) broadly considered the issues of reliability and validity for all forms of temperament assessment, including those derived from individualbased and general questionnaires, but left open the question of the quality of specific temperament tests used in practice. Nonetheless, they found evidence that the issue of reliability, in particular, had been poorly addressed, and evidence for validity was low for tests conducted on young dogs. Diederich and Giffroy (2006) specifically highlighted the lack of standardization of temperament tests for a range of dog roles. The initial aim for our paper was to review in detail the extent to which temperament tests specifically for adult pet


Reliability concerns the degree to which the test scores are free from errors of measurement (APA, 1985). To determine reliability, one must identify the consistency of the results across subtests, tests, observers, assessment centers, etc. Measures of reliability include consistency within the observer of the test (intraobserver), between observers (interobserver), within the dog (test-retest), and within components of measures designed to assess the same behavior (internal consistency). Evidence of the consistency, and hence the predictability, of the dog's behavior is what differentiates a temperament assessment from a behavioral


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006 ences between individuals in subjective categories such as bravery, courage, and sharpness. In order to make this differentiation, the tests may subject the dog to gunfire, prey objects, and mock attacks toward the dog or its handler (e.g., Wilsson and Sundgren, 1997; Svartberg, 2002; Ruefenacht et al., 2002; Courreau and Langlois, 2005). Clearly such tests (and the use of such terms) are of less relevance when trying to identify a potential pet. A factor that is often overlooked is the biological basis of the traits being explored. It has been argued that traits with a clearer biological basis, (i.e., those which relate more directly to specific underlying behavioral systems, such as responsiveness to reward or punishment) will be easier to validate by independent objective means, such as physiological correlates, and they will have a clearer genetic basis for selective breeding purposes than will nebulous or artificial traits (e.g., working ability) (Sheppard and Mills, 2002). Twenty of the most common subtests used in tests to select pet dogs are listed in Table 1. Despite the apparent differences in specific aims of tests--for example, aggression (Netto and Planta, 1997) versus general temperament (De Palma et al., 2005)--tests for pet dogs are remarkably similar in their content. Although this similarity is owing, in part, to some tests being derived from others (e.g., Netto and Planta, 1997, and van den Berg et al., 2003, acknowledge van der Borg et al., 1991), it may also reflect some common aims (identification of aggressive tendencies, for example). These tests also share common features with tests for other roles, particularly those to select assistance dogs. For example, tests by Weiss and Greenberg (1997), Weiss (2002), and Lucidi et al. (2005) include approach cage, behavior on leash, umbrella test, other dog, etc. Indeed, these authors acknowledge that these tests may be just as applicable in identifying general suitability as a pet. Tests for pet dogs also share many similarities with tests for other working roles, e.g., approach by person, object play, noise stability (Svartberg, 2002). It is perhaps not surprising, then, that Svartberg (2005) found similarities in many of the characteristics of the dogs derived from working dog tests with a more general questionnaire of the owner regarding their pet. These similarities may reflect a common theme among all these tests: to identify a friendly and willing companion. As a result, comparison of the needs of the new test and the content of those already in existence may indicate some subtests that are relevant. However, it is important to carefully assess the needs of the overall test in relation to its content. Unqualified repetition from other tests, which may be asking subtly different questions, should be avoided. Although they may be criticized for their use of subjective terms (see above, and below), current tests for working dogs may be more valid than those for pet dogs since they are clearer in their requirements. Specific tests (e.g., dragging an object) are designed to test for specific traits (e.g., chase proneness) (Svartberg and Forkman, 2002). Specificity of the test and the described behavior is likely to increase

dogs had demonstrated reliability, validity, and feasibility. Our search involved a Pub Med and Science Direct search using the terms "dog" or "canine," "temperament" or "behavior(u)r," and "test." Only six papers relating to primary research were revealed in this search of the peer-reviewed literature. Van der Borg et al. (1991) and Hennessy et al. (2001) described tests to predict a range of problem behaviors in rescue dogs. De Palma et al. (2005) described a test to assess general temperament and re-homing suitability of rescue dogs. Netto and Planta (1997), van den Berg et al. (2003), and Kroll et al. (2004) described tests specifically to assess aggression in pet dogs. A number of other tests have been reported in conference proceedings (particularly those of the International Veterinary Behavior Meetings and the Companion Animal Behaviour Therapy Study Group) but have not been reported formally in the literature. This number includes tests for specific problem behaviors (McPherson and Bradshaw, 1998; Notari et al., 2005) and general temperament in rescue dogs (Heidenberger, 1993; Ledger and Baxter, 1997; Marder et al., 2003; Mondelli et al., 2003). The lack of publication is disappointing, because it is well known that many shelter organizations have also devised their own temperament tests (Sternberg, 2002). Given the lack of relevant, data-based scientific publications and the problems identified by other reviewers of this procedure, it is appropriate to review and reiterate the process of valid test development in order to provide a benchmark for future test developers. This paper reviews the range of evaluations required before claims can be made about either reliability or validity with the intent to guide future research and test developers. This process is broken down into identification of the purpose and content of the test, standardization, assessment of reliability and validity, and refinement for practical use, or feasibility (See Appendix A for definitions and Appendix B for key points for each of these.).

Purpose and content of the temperament test

The first step in creating a valid and useful temperament test is careful consideration of its purpose (Appendix B). Test developers need to first consider why they want a temperament test. What behaviors or traits should the test reveal (e.g., fearlessness), and what behaviors and traits should it avoid revealing (e.g., aggressiveness, stress-related responses)? Determination of the purpose of a test is key to determining the method to be used to reveal the properties under investigation (e.g., specific subtests). Depending on their purpose (e.g., working dogs, assistance dogs, companion dogs), temperament tests may be expected to vary in their content, since different characteristics may be of more importance to the selectors than others. For example, Wilsson and Sundgren (1997) found that German shepherds typically used for police work scored higher for "defense drive" than did Labradors that were typically used as guide dogs. Working dog tests typically aim to identify differ-

Taylor and Mills

Table 1 Test

Dog Temperament Tests


The 20 most commonly reported subtests used in the temperament tests for pet dogs listed below

(study reference number)

Object play1,2,4,6 Dog is engaged in play with toys, including a tug-of-war game. Novel room test1,3,4,6 The dog's general behavior when placed in a novel room is assessed. May also include tendency to approach an unfamiliar person within the room5 and response to being left alone in the room for a few minutes.3 Other dog1,2,4,6 Dog is exposed to another dog (s) who is either held on a leash near the dog or placed in a next-door run. Doll test1,2,4,5 A child-sized doll is moved toward the dog. Petting2,4,5,6 The dog is stroked and petted (with an artificial hand if necessary). Basic commands1,2,6 Dog's response to basic commands such as "sit," "down," and "stay." Ignore2,4,6 Dog is ignored for a few moments; handler may turn away or not look at dog. Usually occurs midway through the test. Threatening approach1,2,4 Dog is approached in a threatening manner, which may involve rapid movements, staring, arms held up, mock hitting, and shouting. Umbrella test1,2,4 An umbrella with an automatic opening system is opened in front of the dog. Food guarding1,2 Dog is given a bowl of food or a large treat, which is then pushed away, using an artificial hand for safety. Particular note is made of aggressive reactions. Restraint1,2 Dog is held down on its back and/or restrained and inspected as a vet exam. Approach dog in kennel1,6 Dog is approached while it is still in its kennel. Approach is usually friendly, involving crouching down and talking to the dog through the kennel bars. Behavior on leash1,2 Behavior of dog is monitored while it is walked on a lead for a short distance outside. Other stimuli (e.g., other dogs and people passing by) may be presented. Noise stability1,3 Dog is presented with a loud and startling noise, e.g., a car or air horn. Skin sensitivity2,4 The dog's skin is pinched, usually between the toes or groin. Novel object test2,3 Dog is presented with a novel object when loose within a novel room. Object may be a large bag dragged across floor2 or a moving toy car.5 Threaten handler2,4 The dog's owner or handler is mock-threatened by an unfamiliar person who shouts, waves his or her arms and/or pretends to hit the handler. Stare at dog2 Dog is looked in the eyes until it looks away. Collar1 Response to a collar being put on, may also include a muzzle or headcollar. Running loose6 The dog is allowed to run loose in a large penned area. Reactions to other stimuli while "free" may also be assessed.


Van der Borg et al. (1991), 2Netto and Planta (1997), 3Hennessey et al. (2001), 4Van den Berg et al. (2003), 5Kroll et al. (2004), 6De Palma et al. (2005)

predictability (and therefore validity) (Wilsson and Sundgren, 1998). Tests conducted in relation to companion animals may also seek specific characteristics, such as sociability (Netto and Planta, 1997; van den Berg et al., 2003; Kroll et al., 2004), in order to minimize risks to human society. However, they may also seek more general information on temperament in order to help inform potential owners to optimize a match (e.g., van der Borg et al., 1991; Ledger and Baxter, 1997; De Palma et al., 2005). There is therefore a danger that tests for this purpose may lead test

developers to try to cover a range of characteristics with a range of (very different) subtests (e.g., 21 subtests for van der Borg et al., 1991; 36 for Ledger and Baxter, 1997). This broadness of scope may be a problem, causing the overall assessment to be less sensitive than more specific tests, although this assumption has not yet been tested. A case in point is Weiss and Greenberg's (1997) selection test of shelter dogs as assistance dogs. The investigators tried to relate a series of subtests for general amiability, socialization, and fearfulness to ability to retrieve. While there was


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006 presentations that may differ with gender. Different reactions by dogs toward male and female or familiar and unfamiliar observers have been reported elsewhere (Rappolt et al., 1979; Lore and Eisenberg, 1986; Wells and Hepper, 1999). This difference may be a consequence of differences in body language and/or behavior toward the dog, both of which have been reported to produce different reactions in the same dog (Millot, 1994; Hennessy et al., 1997, 1998; Vas et al., 2005). This finding may mean that the sex of the general tester may have to remain the same across testing situations, while making sure that the dog's reaction to both sexes is assessed at some point, if this issue is considered important to the test's goals. Standardization of the test can be seen as a minimum quality requirement. Most tests in the scientific literature appear to meet this criterion (Diederich and Giffroy, 2006). Nonetheless, it may be difficult to completely standardize a test, since the test itself relies heavily on the interaction between dog and observer/handler. Individual differences in observer behavior may alter the test result at an unintentional level. For example, a dog that is perceived to be more aggressive may pose more of a challenge to a handler than does a dog that appears timid. If a dog appears to be unpredictable and the tester is very variable in his or her behavior, we do not know if the dog is genuinely unpredictable, or if the dog is simply responding to the tester and his or her behaviors. In theory, the more constant the tester or test stimulus, the more predictable the dog will appear. This discussion highlights the importance of selecting only those tests that have been shown to be particularly reliable, despite these potential problems. In cases of testing reactions to novelty, reliability is a particular challenge, as repeat testing with the same object may result in habituation. That said, different objects will differ in their novelty value depending on the dog's previous experience, which may be unknown. Two stimuli which appear quite similar to us might appear quite different to dogs, because dogs focus on different properties, such as olfactory cues, of which we are unaware. Standardization also relates to formalization of reporting of the dog's behavioral reactions, for example, by the use of check sheets or event recorders. It is also important to note that the time during which the dog is presented with the stimuli may not be the same as the time at which observation starts or ceases, and that this mismatch can have important consequences. For example, some tests recommend that the initial startle reaction of the dog is to be ignored because what the dog does after this startle is what is of interest (Houpt, 1994). Again, the specific procedure needs to be justified and clarified for future testers, especially in the absence of data supporting such time-dependent claims used in justification. Consideration also needs to be given to the choice of measures to report the dog's behavior. Various tests have employed different methods, ranging from attempting to count all behavioral responses (Ledger and Baxter, 1997;

evidence from their work, and others (e.g., Goddard and Beilharz, 1984), that fearfulness was related to the ability to retrieve, a more effective selection test may have been retrieval itself. Once a range of tests that might relate to the behaviors of interest has been identified, it is then important to assess these tests for their sensitivity and specificity in identifying these behaviors. Sensitivity is the proportion of true positives that are correctly identified by a test, whereas specificity is the proportion of true negatives that are correctly identified by a test. These parameters are often considered separately from validity (Diederich and Giffroy, 2006); however, if a test does not elicit the required behavior, e.g., aggression, then it is unlikely to be valid. In reality, this validity can be assessed initially quite simply by using a known population of animals with and without the property of interest in pilot studies. If a test fails at this level, then it is not worth proceeding further. Somewhat surprisingly, this step often appears to be overlooked, at least in the early stages, if not totally, in the development of many tests. Sensitivity and specificity are discussed further in relation to predictive validity.


For a test to have any chance of being reliable and valid, standardization of the test procedure is a minimum requirement. Standardization relates to the extent to which a protocol for carrying out the test is provided and consideration for minimization of variability between tests has been made. In standardization, all potential sources of variability need to be identified and controlled for so that the only variable is the dog's response (Diederich and Giffroy, 2006). Considerations for sources of environmental variation are included in Appendix B and include location of the test, timing of the test, and type of stimuli presented. It is important that these factors remain the same for all subsequent tests on all dogs and therefore need to be determined in advance. As discussed in the following section, standardization helps to increase test-retest reliability (Goodloe, 1996). Particular consideration needs to be given to the characteristics of the animate objects presented to the dog. For example, Goddard and Beilharz (1985) found that potential guide dogs were more likely to behave confidently when they were presented with a juvenile dog and least confidently toward a mature male dog. The same consideration may also need to be paid to potential variation in the human being present during the tests, even if he or she features only as the presenter of other objects. Wickens et al. (1995) reported poor agreement between a male and a female assessor for many measures of the same dog during a similar test to McPherson and Bradshaw (1998). This finding may have been, in part, owing to a difference in the dog's reaction to the gender of the assessors and behavioral

Taylor and Mills

Dog Temperament Tests

99 narrow and limits valid use of statistical evaluation. Although testers may find the concept of two categories facilitates making decisions about dogs, it is unlikely that two categories really represent the variability of the dog's behavior. A discussion of the needed statistical analysis is beyond the scope of this paper, but it is important that those developing behavioral evaluations also carefully evaluate the assumptions associated with any statistical test that may be used with it. Standardization is reviewed in more detail by Diederich and Giffroy (2006), who also raise the issue of standardization between test protocols. They highlight discrepancies in the literature in both inconsistent presentation of stimuli to the dog and variation in measuring its response. As a result, factors relating to aggressiveness in working dog tests (Svartberg, 2002), may not be comparable to those relating to aggressiveness in rescue dogs (Ledger and Baxter, 1997; Diederich and Giffroy, 2006). This use of multiple definitions of terms is something to resolve if one is to compare temperament across a range of dog roles and within them.

van den Berg et al., 2003) to fitting the dog's reaction into a qualitative or ordinal category (Netto and Planta, 1997; Kroll et al., 2004) to subjectively assessing the dog on a range of characteristics (van der Borg et al., 1991; Mondelli et al., 2003). Typically, the first option lends itself to a large amount of data which have to be reduced, possibly by factor analysis, to identify potential traits. For example, Ledger and Baxter (1997) attempted to note into a voice recorder every movement of the dog and then subjected these data to principle component analysis. The number of factors described and their interpretation may therefore vary from study to study, making comparisons between studies difficult. Hennessy et al. (2001) used more objective measures such as activity level (e.g., number of line crossings) and number of vocalizations. Although doing so may reduce observer bias, many measures may need to be taken in order to obtain a more complete picture of the dog's reaction. Taking so many measures may not be feasible in practice. Finally, reducing a suite of behaviors to such minutiae may cause the overall quality of the dog's behavior to be lost (Feaver et al., 1986; Wemelsfelder and Farish, 2004). Such extraordinary detail may partially explain why Hennessy et al. (2001) found their test to have poor predictive power. In contrast, van der Borg et al. (1991) categorized dogs subjectively as aggressive, disobedient, and so on. This method is common in working dog tests, which often score the dog along subjective terms such as sharpness, courage, prey drive, and others. Such categorization can be used either throughout the test, or in one subtest (e.g., Wilsson and Sundgren, 1997). However, the assessment of such tests has been criticized for being based on an assessor's subjective review or judgment of the dog (Stur, 1987). It is also unclear if these terms refer to behavior patterns shown by the animal or to the apparent function of the behavior (Moran and Fentress, 1979). Additionally, the names given to factors describing potential traits may be intuitively understood, but not easily defined (Murphy, 1998). The standardization of such qualitative measures requires more exploration in order to maximize transferability between testers and breed types and to minimize the risk of interpreting behaviors that may have a range of motivations (Sheppard and Mills, 2002). For example, what is considered normal or acceptable may depend on the breed of the dog. There is evidence for differences between breed types in behavioral signaling (Goodwin et al., 1996) and for behavior in tests (Vas et al., 2005), but the extent to which these differences are taken into account in the assessment of the dog is largely unexplored. A reasonable compromise between feasibility and completeness may be found in a semiquantitative assessment of the behavior. The use of five categories (from 1 [no aggression] to 5 [biting], as in Netto and Planta, 1997) seems adequate to cover for a range of behavioral reactions, whereas only two (pass or fail, as in Weiss, 2002; or "dog attacks the other dog or snarls" or "the dog exhibits no aggressive behavior," as in Lucidi et al., 2005), may be too


Intra-observer reliability

Intra-observer reliability measures the consistency of the reports of a single observer (Martin and Bateson, 1993). In theory, the observer's assessments should report similarly when the same dog is tested using the same test on another occasion. However, in order to control for behavioral changes on the part of the dog, rather than by the observer, it is recommended that intra-observer reliability be assessed by the use of video recordings (Martin and Bateson, 1993). In this way, the observer can compare the reports of the same test on two or more occasions. If more than one observer is used, it may be advisable to train observers first, in order to help maximize the chance of consistency (see Murphy, 1998), and then select only those that are most reliable. Intra-observer reliability may be considered as the most basic form of reliability assessment, but one that is equally important as others, since poor intra-observer reliability may contribute to poor test-retest reliability (Bowling, 1997). There appear to be few explicit reports of intraobserver reliability for temperament tests (for any use), with the notable exception of Murphy (1995). Intra-observer reliability is of particular concern when interpretations or subjective opinions of behavior are part of the scoring system, but it applies to all systems.

Inter-observer reliability

Inter-observer reliability measures the likelihood that different observers will assess the same dog on the same occasion in the same way (Martin and Bateson, 1993). Given that temperament tests are usually developed for use


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006 (1997) reported retesting a sample of the dogs 6 months later. Potential for changes attributable to experience, as opposed to inconsistency, need to be appreciated in terms of learning theory. For example, it has been suggested that the dog may react less aggressively owing to lack of novelty of the stimuli (habituation) (Svartberg et al., 2005). It has also been suggested that dogs may react more aggressively in some repeated situations because they may have perceived that they "won" last time (i.e., they had been rewarded for avoiding a potential threat on the first occasion) (Netto and Planta, 1997). Of course, the extent to which the dog has learned from the situation and responded accordingly depends partially on its temperament. One way of minimizing this problem is to look at the consistency across tests in rank behavioral reactions (Svartberg et al., 2005). This process helps to control for general shifts in behavior as a consequence of repeated exposure, and to identify those tests in which dogs genuinely react inconsistently. Partially because of the lack of reported test-retest reliability, it is difficult to make recommendations about the length of time that should elapse between tests. Clearly, the second test should not immediately follow the first in order to increase the chance of some sense of novelty pertaining to the second test (De Palma et al., 2005). The more circumstances in which a dog is evaluated, the more likely that any test will have later predictive value, because the individual's response surface has been better defined (Overall, 2005). More investigation needs to be undertaken into how temperament tests might be assessed for test-retest reliability while taking these processes into account.

by several individuals and assessment centers, it is particularly important to assess the consistency of reports between them. Reasons for inconsistent reporting may include poorly designed tests and differences in perception, experience, and recording ability between observers (Murphy, 1995, 1998). It is also important to recognize that reliability between scientists developing the test does not guarantee reliability among those in the field for whom it may have been designed. Training of observers before they conduct assessments will help to maximize reliability between, as well as within, the evaluations. If multiple observers cannot assess the same dog during the same test, then video recordings can be used so that all observers are watching the same example of behavior (Murphy, 1995). This is probably the better approach, since the dog and the tester may be affected by a crowd of observers. The use of video recordings is also useful to assess any changes in consistency within dogs, testers, or observers over time and with experience. It is important to note, though, that any assessment of consistency should be of observations from the same media; it is not advisable to compare real-time observations with those from videos, since both may fail to capture all events but in a different sense. Of the six tests for pet dogs identified earlier, only Kroll et al. (2004) and De Palma et al. (2005) reported on inter-observer reliability. Although they reported high levels of agreement, the reliability between observers for other forms of temperament tests, where reported, has been fairly low (Goddard and Beilharz, 1983; Murphy, 1995), suggesting that this is an important area of concern.

Test-retest reliability Internal consistency and unidimensionality

Test-retest reliability measures the likelihood that the dog will behave in the same manner when the same test is conducted on another occasion. Consistency of behavioral reactions is vital to the concept of temperament (Svartberg et al., 2005). Only when the same response is observed in the dog during the same test can predictions be made about its reaction in similar situations. There are a number of reasons that might account for variation in the behavior of the same dog from one test to another, including hunger (Goodloe, 1996) or illness. Repeat testing of reactions to a novel object are particularly challenging, as previously discussed, and it is not surprising that differences in dogs' curiosity toward novel objects have been reported between 2 test occasions (Beerda et al., 1999; Siwak et al., 2001). Other difficulties with assessing test-retest reliability arise from the possibility that familiarity with the stimuli in the test procedure--including the tester, objects, and location-- may change the dog's behavior. For example, in shelter settings, cortisol measurements from dogs vary over time (Hennessey et al., 1997), raising the possibility that the dog's behavioral reaction in this situation may also be more variable over time. Unfortunately, most authors have largely failed to report assessment of test-retest reliability. Of six studies of pet dog temperament, only Netto and Planta The assessment of internal consistency is applicable when the behaviors of interest are actually derivations from much larger data sets that have been subject to data reduction techniques such as factor or principal components analysis. Internal consistency refers to the extent to which the items within these factors assess the same dimension and can therefore be considered a measure of reliability (Eiser and Morse, 2001). The correlation between items within factors can be assessed using Cronbach alpha (Serpell and Hsu, 2001). Typically, correlations above 0.7 are considered desirable, but Clark and Watson (1995) recommend that the range of interitem correlations be examined because high correlations may suggest that the items are measuring the same trait, whereas low correlations suggest that the items are measuring entirely different traits. Accordingly, Clark and Watson (1995) recommend that most interitem correlations within a scale should range from 0.15-0.50 to maximize unidimensionality ( 0.15) while avoiding data redundancy ( 0.50). A number of researchers have employed data reduction to their initial behavioral observations (Hennessy et al., 2001; van den Berg et al., 2003; De Palma et al., 2005), but the issue of data redundancy appears to still be largely overlooked. Thus, tests may include unnecessary

Taylor and Mills

Dog Temperament Tests

101 sion. Palestrini et al. (2005) recently reported that, in the absence of their owners during a veterinary examination, dogs had higher heart rates but were less likely to be aggressive compared to when their owners were present. Despite these concerns, none of the authors reviewed has specifically reported on the content validity of their tests, something that should be considered at the start of the development process.

items that, at best, produce unnecessary repeat measures and, at worst, bias any final score toward the relevant dimension.


Content validity

Content validity evaluates whether the test measures what it should and whether the components of the measure cover all aspects of the behavior in question. Face validity is one aspect of content validity and refers to the subjective assessment of whether the item appears to be measuring the variable it claims to "on the face of it" (Eiser and Morse, 2001). For example, van den Berg et al. (2003) performed a principal components analysis on all the behaviors shown by the dogs in their aggression test. Six factors explained 66% of the variance, and each factor contained behaviors that described a general behavioral reaction. Behaviors in the first factor appeared to represent threatening behavior-- stiff posture, staring, growling, and lip curling--and did not feature in other factors. Similarly, in Hennessey et al. (2001), the factor labeled "locomotor activity" was made up of behaviors related to activity, as defined by the number of lines crossed in a novel room test. These examples can be seen as providing evidence of content validity, although this conclusion was not explicitly stated by the authors. Perhaps a more rigorous methodology is to employ a panel of experts who evaluate both the test protocol and the behaviors observed for their similarity with the test purpose (Wiseman-Orr et al., 2004). However, to date, this particular method has not been reported for specific temperament tests, but only for questionnaire-based studies. Evaluation of content validity is particularly pertinent for temperament tests, because these tests seek to assess the dog's general behavioral tendencies over only a short space of time and in response to limited stimuli. The dog's behavioral reaction at any given point in time is a product of both its temperament and the surrounding environment. Because temperament tests cannot re-create all possible scenarios, some have questioned whether temperament tests can ever have much validity (Sheppard and Mills, 2002; Kroll et al., 2004). The validity of temperament tests conducted within rescue shelters has been particularly criticized, because the environmental situation may be overwhelming and may mask the dog's true behavior, which might differ in a more normal situation, such as the home (Weiss and Greenberg, 1997; Marston and Bennett, 2003). Tests conducted in the shelter may not represent the home situation in a number of respects, not least the absence of relationships. For example, the high level of arousal, degree of novelty, and lack of escape routes may increase the chances of a caged dog showing defensive aggression. In the absence of the owner supporting the dog's behavior, these situations may also decrease the likelihood of aggres-

Construct validity

Construct validity is the extent to which the items within a behavioral test measure the broad construct (i.e., temperament trait) that they were designed to measure. Construct validity is typically assessed by looking at the relationship between factors created by data reduction techniques. Convergent validity is when factors that theoretically should be related to each other correlate together. Discriminant validity is when factors that theoretically are not related to each other do not correlate together (Bowling, 1997). Evaluation of construct validity appears to be more common in questionnaire-based assessments of temperament (Goodloe and Borchelt, 1998; Serpell and Hsu, 2001; Hsu and Serpell, 2003). For example, Goodloe and Borchelt (1998) found that their derived factor for friendliness negatively correlated with their factor for aggression to strangers, supporting the contention that friendliness was in fact measuring what it said it was. This form of validity check could be done similarly for behavior tests, particularly those that have derived factors from data reduction techniques of multiple measures taken during each test. Although several authors used these techniques (Hennessey et al., 2001; van den Berg et al., 2003) none has specifically reported evaluation of construct validity.

Criterion validity

Criterion validity is the extent to which an association between the scores for each factor and an external criterion can be demonstrated. Strictly speaking, the criterion is a gold standard, but in the absence of it, concurrent validity may be a more appropriate term. Concurrent validity occurs when the measure or factor is found to vary alongside a more established measure that is supposed to be measuring the same construct, usually measured at the same time. Concurrent validity may be considered by some as a more robust aspect of construct validity (Eiser and Morse, 2001), which may explain the use of the latter term in other canine behavior studies (Serpell and Hsu, 2001; Hsu and Serpell, 2003; Svartberg, 2005), despite the comparison of different, concurrent assessments of the dog's behavior (i.e., reports of different persons, reports, and test results). For temperament tests of pet dogs, the external criterion for assessment of concurrent or predictive validity has been the report of the owner in the majority of cases. Such reports may be in the form of acknowledgment of aggressive be-


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006

Table 2 Values used to determine the predictive value, sensitivity, and specificity of a temperament test to predict behavior problems in the home Behavior problem in home? Behavior problem in test? Yes No Yes a c No b d

havior by the current owner (Netto and Planta, 1997; van den Berg et al., 2003) or of behavior problems once the dog is re-homed (van der Borg et al., 1991; Ledger and Baxter, 1997; Marder et al., 2003). For example, Ledger and Baxter (1997) reported correlations in owners' reports of the dogs' behavior and factors derived from their behavior in a series of subtests. More recently, Svartberg (2005) compared the results from working dog tests with owner report and found examples of agreement. For example, factors relating to playfulness from the temperament test correlated to the owner's answers relating to playfulness in their questionnaire. More recently, consultation with a behavior specialist has been used as the criterion (Kroll et al., 2004; Notari et al., 2005), although it is not clear how much of this also relies on the owner's report (Hsu and Serpell, 2003). The validity of the use of owner report in the form of questionnaires may be questioned, largely because these questionnaires themselves need to be validated (van der Borg et al., 1991). Particularly with behavior problems, there is concern that owners may not be good reporters because they may not be aware of the behavior (e.g., food guarding), or if they are aware of the behavior they may not recognize it as a problem to them and so they fail to report it (van der Borg et al., 1991). The behavior of the dog as reported by the owner may depend considerably on the owner--their level of expectation, experience, and their own temperament--although work on the validation of questionnaires is beginning (Serpell and Hsu, 2001; Hsu and Serpell, 2003; Sheppard and Mills, 2003). To date, there is a lack of evidence of correlation between owner report and more objective measures, such as those provided by direct observation of the dog in the home by another person (van der Borg et al., 1991). This lack of correlation exists partly because it is difficult to standardize the complex and variable nature of life in the home (Svartberg, 2005). Clearly, the test would need to be validated by watching the owner interact with the dog (as in Rooney et al., 2000), which could be facilitated by the use of video cameras. This method has been recommended for the objective identification of problem behavior in separation-related behavior problems (Lund and Jørgensen, 1999). In the case of temperament tests for pet dogs, it is more common that there is a single outcome (e.g., a behavior problem) that can be compared with the test results. In this case, the predictive validity of the test can be assessed, which is a form of criterion validity. Predictive validity measures the extent to which a measure (e.g., the test results) can predict another measure in the future (e.g., reports of behavior problems in the home) (Ledger, 1997). The predictive power of the temperament test is often presented in terms of its sensitivity and specificity (van der Borg et al., 1991; Planta, 2001; Marder et al., 2003), as shown in Table 2. Using the terminology of Table 2, for a given behavior problem, sensitivity is the ability of the test to correctly identify dogs with the problem (a / a c), whereas specific-

ity is the ability of the test to correctly identify those that do not have the problem (d / b d) (Overall, 1997). For example, van der Borg et al. (1991) reported that the sensitivity of their test to correctly predict behavior problems, overall, was 75%. The number of dogs that incorrectly fail the test (false negatives) and the number of dogs that incorrectly pass the test (false positives) also determine validity (Kroll et al., 2004). The number of true positives of the total number of dogs that fail the test is the predictive value of the positive test (a / a b). However, if the prevalence of a problem behavior is low, as it often is, a highly specific test may give a high number of false positives because of the high number of dogs with no problems being tested (van der Borg et al., 1991). Because of this situation, some authors report the predictive value of the negative test, which is the number of true negatives (dogs without a problem) of the total number of dogs that pass the test (d / c d). An alternative is to present the likelihood ratio (LR), which gives the likelihood that the test, given the prevalence of the problem in the sample, predicts an increased (positive LR sensitivy / 1-specificity) or decreased (negative LR 1-sensitivity / specificity) risk of the problem in the dog. For example, Marder et al. (2003) reported that if the dog showed aggression in any situation in the test, there was a 90% chance it would do so in the new home. But if a dog did not show aggression in the test, there was still a 50% chance that it would do so in the new home. Therefore, overall their test provided moderate contributions to knowledge about various forms of aggression. The predictive validity of temperament tests used to match owners to dogs is also supported, to some extent, by the reduction in numbers of returned dogs to the re-homing centers (Heidenberger, 1993; Mondelli et al., 2003), particularly in comparison to centers that did not implement the matching procedure (Ledger and Stephen, 2004). However, the extent to which the test's predictability depends on the actual test results and matching procedure, or on the extra time taken to advise particularly inexperienced owners, is not yet known. An additional consideration when assessing the predictive validity of temperament tests is the potential for low prevalence of problem behavior, as reported by many authors (van der Borg et al., 1991; Murphy, 1998; Hennessey et al., 2001; Marder et al., 2003). Unless the sample size of the population is very large, low prevalence can lead not only to skewed data sets, but also to small sample sizes for

Taylor and Mills

Dog Temperament Tests


Purpose of the test (Why do you want a temperament test? For which behaviors do you want to select?) Content (identify relevant subtests and perform sensitivity analysis) Standardization (formalize the test procedure and the method of recording behavior) Assessment of intraobserver reliability (Disregard unreliable measures) Assess test-retest reliability (Disregard unreliable measures)

the assessment of predictability of tests. Low prevalence of some problems may be partly owing, for example, to dogs with severe problems being unlikely to be offered for adoption, a decision that will depend on the culture of each rescue shelter. Similarly, loss of dogs to follow-up once re-homed should not be considered as "missing at random," since the reasons for lack of follow-up can include that the dog has behavior problems and the owner either does not want to admit it, is too busy dealing with the dog to fill in the questionnaire, has had the dog destroyed, or has passed the dog on to someone else.

Feasibility for practical use

The ultimate aim of many temperament tests for pet dogs is that interested groups can perform the test themselves and make use of the results (Ledger and Baxter, 1997). Accordingly, the test needs to be standardized and short, easy to perform, and amenable to easily recording the dog's response. Many of the tests reviewed here may be prohibitively long for practical use in a working environment like a shelter (Hsu and Serpell, 2003), taking one hour per dog (van der Borg et al., 1991; Ledger and Baxter, 1997; Netto and Planta, 1997; Marder et al., 2003), which is surprising because the majority of these tests were developed with shelter dogs in mind. Several authors have stated their intention to refine the test for use by shelter staff, but none of these revisions has been formally presented in the literature. The selection of a wide range of tests may be a useful exercise during the development of the test, particularly if knowledge about what might be predictive of future behavior is poor (Goodloe, 1996). However, in order to be feasible, refinement of the test needs to be undertaken, a process that may involve reducing the number of subtests and observations, simplifying the method of observation, and enabling decisions to be made about individual dogs based on the test results. Reduction of tests and observations should be made based on poor intra-observer, inter-observer, and test-retest consistency as described above, and in this order. Refinement for practical use is particularly pertinent for those tests that took a large number of measures of the dog's behavior and discussed the findings in terms of factors derived statistically from these (Ledger and Baxter, 1997; McPherson and Bradshaw, 1998; Hennessy et al., 2001; Marder et al., 2003; van den Berg et al., 2003; De Palma et al., 2005). When tests are changed for practical purposes, validity in the new setting must not be assumed without retesting. In order to maximize efficiency when developing a test, the flow chart in Figure 1, which describes the entire process by which a temperament test may be evaluated as it is developed, may be followed. During the refinement process, additional consideration should also be given to the welfare of the animal being evaluated (Martin and Bateson, 1993), particularly when tests

Assess content and construct validity Assess inter-observer reliability (from a single site) (Disregard unreliable measures) Assess criterion validity (Predictive validity of the final test - in situ) Re-evaluate test (Does it need to be refined or to have new tests added?) Assess inter-observer reliability (Across a range of sites) Assess criterion validity (Predictive validity of the test - across a range of sites) PUBLISH!

Figure Flow diagram of the process in creating and evaluating a temperament test

could provoke fear, anxiety, or aggressive behavior. Interestingly, the most fearful responses in the aggression test by van den Berg et al. (2003) were during the umbrella test (Table 1), rather than in other tests which might be perceived to be more threatening, such as crowds and gestures. This may be a concern, because the umbrella test is commonly used in other temperament studies (van der Borg et al., 1991; Ledger and Baxter, 1997; Weiss and Greenberg, 1997; Marder et al., 2003). King et al. (2003) evaluated behavioral and physiological reactions of dogs to the umbrella test and concluded that it constituted an intense startle test that could cause fear. In some respects, this reaction may be normal for dogs, which should lead test developers to question why they require it in their tests in the first place. The extensive test described by Netto and Planta (1997) to predict aggressiveness was, by their own admission, severe, since it included tests in which the dog was cornered and threatened by humans and other dogs. Of the dogs tested, 97% showed aggression and 67% bit at some point, with aggressive behavior increasing as the test progressed, despite the fact that not all dogs were considered aggressive before the test. A


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006 to compare the strengths of one test over another. There is promising evidence from the limited studies that a matching program can reduce return rates (Heidenberger, 1993, Mondelli et al., 2003; Ledger and Stephen, 2004), the latter using a refined version of the test by Ledger and Baxter (1997). It is unfortunate that such studies have not been reported fully in the literature. It is important that such procedures are not only validated in the shelters in which they are used, to control for the differences between shelters, but that they are evaluated for their success and refined, so that the time cost:benefit ratio is maximized. This aspect is particularly pertinent when the process is at risk of limiting the choice available to prospective owners and increasing the chances that dogs remain un-homed when they might otherwise make a reasonable match. Perhaps because most of the published tests discussed here demonstrated some degree of predictive or concurrent validity, the issue of reliability, in particular, has been largely overlooked. Omitting assessments of reliability means that there is no information regarding the extent to which the validity of the test results has been affected by inconsistency on the part of the tester, observer, or dog, or the use of a different assessment center from that in which the original test was developed. Assessing the reliability of measures, observers, and subtests during the development process, as shown in McPherson and Bradshaw (1998), should be considered a better methodology. These authors reported testing all their dogs twice and only included measures in their assessment tool that had high test re-test consistency between the two occasions. Since consistency of the dog's response is what characterizes temperament, it is particularly important that during development of the test, measures are chosen that most clearly demonstrate this consistency. Doing so will indicate those tests and measures at an earlier stage that are most resistant to inconsistency with time and observers and are therefore more likely to be predictive. In order to facilitate an understanding of how tests should be developed taking into account these quality checks, this paper includes a flow chart of this process. Definitions and key considerations for each stage are also provided for guidance (Appendices A and B). Hopefully, as more temperament tests are published, improvements to this scheme can be made. We hope this discussion will in some way help future test developers meet these challenges.

shortened version of the Netto and Planta (1997) test, called the "sociable-acceptable behavior - MAG test" has been presented (Planta, 2001; van den Berg et al., 2003) and may be seen as a refinement of Netto and Planta (1997). These shortened versions may have reduced the length of time the dog was under pressure but, in order to remain predictive, still retained severe elements of the longer test. Even so, they both reported a small decrease in predictability, which is to be expected by the removal of a number of additional, albeit unreliable or duplicated, explanatory variables. In this case, there appears to be a trade-off, particularly when testing for behavior problems, between maximizing predictability and minimizing the stress to the animal. It is important to note, however, that a temperament test can also be a positive experience for the dog, particularly if it contains elements of play and contact and can therefore contribute toward socialization or training time (Dunbar, 1989).


Fewer than ten reports of temperament tests specifically for the selection of suitable adult dogs as pets could be found in the peer-reviewed scientific literature. Even among these, the reports of reliability, validity, and feasibility are not complete, with authors typically reporting on one, but not all, aspects. The absence of reports of the methodology, reliability, and validity of temperament tests for dogs in general has been noted by a number of authors (Hsu and Serpell, 2003; Marston and Bennett, 2003; Jones and Gosling, 2005). Despite this fact, it is clear that many organizations carry out their own temperament tests. It is of concern to find that not only are many of these tests not apparently designed in consultation with behavioral scientists, but also that they have not been presented formally in the scientific literature. It is even more concerning that those that do appear in the literature have incomplete reports of the quality of these tests. It is important that this serious omission be remedied in order to facilitate the development of a consensus and the spread of good practice while reducing unnecessary replication, particularly in the design and evaluation of tests, which can be costly and lengthy. Failure to report the full details of the test procedure and how the dog's behavior is assessed prevents replication by others. Replication is a necessary requirement for scientific progress. In the absence of full, peer-reviewed reports of tests, this field is in danger of proceeding in a pseudoscientific manner. It is imperative that publishers also recognize the importance of such data. Given that important decisions about the future of individual dogs are made on the basis of these tests, this finding is all the more disturbing. Overall, evidence of concurrent or predictive validity of the temperament tests reviewed was moderate to good, particularly for identification of specific behavior problems such as aggression and separation anxiety. However, the method of reporting on validity varied, with few authors reporting sensitivity and specificity, which makes it difficult


This paper formed part of a wider review of approaches to the assessment of temperament, welfare, and quality of life in kenneled dogs commissioned by Dogs Trust, U.K., and we are indebted to this organization for its support of this work. The first author was supported by this charity to undertake these reviews. We would also like to thank members of the Dogs Trust "quality of life working party" for their support and comments: Jon Bowen, John Bradshaw,

Taylor and Mills

Dog Temperament Tests


Goodwin, D., Bradshaw, J.W.S., Wickens, S.M., 1996. Paedomorphosis affects agonistic visual signals of domestic dogs. Anim. Behav. 53, 297-304. Gosling, S.D., John, O.J., 1999. Personality dimension in nonhuman animals: a crossspecies review. Curr. Dir. Psychol. Sci. 8, 69-75. Heidenberger, E., 1993. Rehabilitation of dogs kept in animal shelters. In: Proceedings of the 27th International Congress of the International Society for Applied Ethology, Berlin. Hennessy, M.B., Davis, H.N., Williams, M.T., Mellott, C., Douglas, C.W., 1997. Plasma cortisol levels of dogs at a county animal shelter. Physiol. Behav. 62, 485-490. Hennessy, M.B., Voith, V.L., Mazzei, S.J., Buttram, J., Miller, D.D., Linden, F., 2001. Behaviour and cortisol levels of dogs in a public shelter, and an exploration of the ability of these measures to predict problem behaviour after adoption. Appl. Anim. Behav. Sci. 73, 217233. Hennessy, M.B., Williams, M.T., Miller, D.D., Douglas, C.W., Voith, V.L., 1998. Influence of male and female petters on plasma cortisol and behaviour: can human interaction reduce the stress of dogs in a public animal shelter? Appl. Anim. Behav. Sci. 61, 63-77. Houpt, K.A., 1994. The ontogeny of behaviour in dogs and cats. In: T.G. Hungerford Refresher Course for Veterinarians ­ Animal Behaviour, Proceedings 214 Post-Graduate Committee in Veterinary Science, University of Sydney, Australia. pp. 89-98. Hsu, Y., Serpell, J.A., 2003. Development and validation of a questionnaire for measuring behaviour and temperament traits in pet dogs. J. Am. Vet. Med. Assoc. 223, 1293-1300. Jones, A.C., Gosling, S.D., 2005. Temperament and personality in dogs (Canis familiaris): a review and evaluation of past research. Appl. Anim. Behav. Sci. 95, 1-53. King, T., Hemsworth, P.H., Coleman, G.J., 2003. Fear of novel and startling stimuli in domestic dogs. Appl. Anim. Behav. Sci. 82, 45-64. Knol, B.W., Roozendaal, C., van den Bogaard, L., Bouw, J., 1988. The suitability of dogs as guide dogs for the blind: criteria and testing procedures. Vet. Q. 10, 198-204. Kroll, T.L., Houpt, K.A., Erb, H.N., 2004. The use of novel stimuli as indicators of aggressive behaviour in dogs. J. Am. Anim. Hosp. Assoc. 40, 13-19. Ledger, R.A., 1997. Understanding owner-dog compatibility. Vet. Int. 9, 17-23. Ledger, R.A., Baxter, M.R., 1997. The development of a validated test to assess the temperament of dogs in a rescue shelter. In: Proceedings of the 1st International Conference on Veterinary Behavioural Medicine, UFAW, Herts, UK, pp. 87-92. Ledger, R.A., Stephen, J.M., 2004. Reducing dog return rates at rescue shelters: applying science for animal welfare. Anim. Welf. 13(S), 247. Lore, R.K., Eisenberg, F.B., 1986. Avoidance reactions of domestic dogs to unfamiliar male and female humans in a kennel setting. Appl. Anim. Behav. Sci. 15, 261-266. Lucidi, P., Bernabo, N., Panunzi, M., Dalla Villa, P., Mattioli, M., 2005. Ethotest: A new model to identify (shelter) dogs' skills as service animals or adoptable pets. Appl. Anim. Behav. Sci. 95, 103-122. Lund J.D., Jørgensen M.C., 1999. Behaviour patterns and time course of activity in dogs with separation problems. Appl. Anim. Behav. Sci. 63, 219-236 Marder, A.R., Engel, J.M., Carle, D., 2003. Predictability of a shelter dog behavioural assessment test. In: Proceedings of the 4th International Veterinary Behaviour Meeting, Caloundra, Australia, p. 143. Marston, L.C., Bennett, P.C., 2003. Re-forging the bond-towards successful canine adoption. Appl. Anim. Behav. Sci. 83, 227-245. Martin, P., Bateson, P., 1993. Measuring Behaviour: an introductory guide, 2nd ed. Cambridge University Press, Cambridge, UK. McPherson, J.A., Bradshaw, J.W.S., 1998. A validated test of separation behaviour in kennelled rescue dogs. In: Proceedings of the 1998 Companion Animal Behaviour Therapy Study Group Study Day, Birmingham, UK, pp. 33.

Keith Butt, Rachel Casey, Philip Daubeny, Paul DeVile, Sarah Heath, Andrew Higgins, Matthew Leach, Sam Lindley, David Main, Joe Mayhew, Rose Mcllrath, Dirk Pfeiffer, Jacqueline Reid, Irene Rochlitz, Marian Scott, Jacqueline Stephen, Natalie Waran, Deborah Wells, and Lesley Wiseman-Orr. Finally, the authors would like to thank the 3 anonymous reviewers for their constructive comments.


American Psychological Association, 1985. Standards for educational and psychological testing. American Psychological Association, Washington DC, USA, pp. 9. Arkow, P.S., 1994. A new look at pet "over-population." Anthrozöos. 7, 202-205. Beerda, B., Schilder, M.B.H., Van Hoof, J.A.R.A.M., De Vries, H.W., Mol, J.A., 1999. Chronic stress in dogs subjected to social and spatial restriction In: Behavioural responses. Physiol. Behav. 66, 233-242. Bowling, A., 1997. Measuring Health: a review of quality of life measurements scales, 2nd ed. Open University Press, Buckingham, UK. Brenoe, U.T., Larsgard, A.G., Johannessen, K.R., Uldal, S.H., 2002. Estimates of genetic parameters for hunting performance traits in three breeds of gun hunting dogs in Norway. Appl. Anim. Behav. Sci. 77, 209-215. Clark, L.A., Watson, D., 1995. Constructing validity: Basic issues in objective scale development. Psychol. Assess. 7, 309-319. Courreau, J-F., Langlois, B., 2005. Genetic parameters and environmental effects which characterise the defence ability of the Belgian shepherd dog. Appl. Anim. Behav. Sci. 91, 233-245. De Palma, C., Viggiano, E., Barillari, E., Palme, R., Dufour, A.B., Fantini, C., Natoli, E., 2005. Evaluating the temperament in shelter dogs. Behav. 142, 1307-1328. Diederich, C., Giffroy, J-M., 2006. Behavioural testing in dogs: A review of methodology in search for standardization. Appl. Anim. Behav. Sci. 97, 51-72. Dunbar, I., 1989. Standardized Testing. Pure-bred Dogs. American Kennel ClubGazette. March, 22­24. Eiser, C., Morse, R., 2001. Quality of life measures in chronic disease of childhood. Health Tech. Assess. 5, 1-156. Eysenck, M.W., 1994. Individual differences. normal and abnormal. Lawrence Erlbaum Associates, Hillside, NJ, USA. Feaver, J., Mendl, M., Bateson, P., 1986. A method for rating the individual distinctiveness of domestic cats. Anim. Behav. 34, 1016-1025. Fredrickson, M.A., 1993. Temperament testing procedures for animals involved in nursing home, school and hospital visiting programs through Delta Society Pet Partners. Appl. Anim. Behav. Sci. 37, 83. Fuchs, C., Gaillard, S., Gebhardt-Henrich, S., Ruefenacht, A., Steiger, T., 2005. External factors and reproducibility of the behaviour test in German shepherd dogs in Switzerland. Appl. Anim. Behav. Sci. 94, 287-301. Goddard, M.E., Beilharz, R.G., 1983. Genetics of traits which determine the suitability of dogs as guide dogs for the blind. Appl. Anim. Ethol. 9, 299-315. Goddard, M.E., Beilharz, R.G., 1984. The relationship of fearfulness, sex, age and experience on exploration and activity in dogs. Appl. Anim. Behav. Sci. 12, 267-278. Goddard, M.E., Beilharz, R.G., 1985. Individual variation in agonistic behaviour in dogs. Anim. Behav. 33, 1338-1342. Goodloe, L.P., 1996. Issues in description and measurement of temperament in companion dogs. In: Voith, V.L., Borchelt, P.L. (Eds), Readings in Companion Animal Behaviour, Veterinary Learning Systems, Trenton, NJ, USA, pp. 32-39. Goodloe, L.P., Borchelt, P.L., 1998. Companion dogs temperament traits. J. Appl. Anim. Welf. Sci. 1, 303-338.


Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006

Sheppard, G., Mills, D.S., 2002. The development of a psychometric scale for the evaluation of the emotional predispositions of pet dogs. Int. J. Comp. Psychol. 15, 201-222. Sheppard, G., Mills, D.S., 2003. The validation of scales designed to measure positive and negative activation in dogs. In: Proceedings of the 4th International Veterinary Behaviour Meeting, Caloundra, Australia, pp. 37-45. Siwak, C.T., Tapp, P.D., Milgram, N.W., 2001. Effect of age and level of cognitive function on spontaneous and exploratory behaviours in the beagle dog. Learn. Mem. 8, 317-325. Slabbert, J.M., Odendaal, J.S.J., 1999. Early prediction of adult police dog efficiency--a longitudinal study. Appl. Anim. Behav. Sci. 64, 269-288. Sternberg, S., 2002. Great dog adoptions: a guide for shelters. Latham Foundation for the Promotion of Humane Education, USA. Stur, I., 1987. Genetic aspects of temperament and behaviour in dogs. J. Sm. Anim. Pract. 28, 957-964. Svartberg, K., 2002. Shyness-boldness predicts performance in working dogs. Appl. Anim. Behav. Sci. 79, 157-174. Svartberg, K., 2005. A comparison of behaviour in test and in everyday life: evidence of three consistent boldness-related personality traits in dogs. Appl. Anim. Behav. Sci. 91, 103-128. Svartberg, K., Forkman, B., 2002. Personality traits in the domestic dog (Canis familiaris). Appl. Anim. Behav. Sci. 79, 133-155. Svartberg, K., Tapper, I., Temrin, H., Radesater, T., Thorman, S., 2005. Consistency of personality traits in dogs. Anim. Behav. 69, 283-291. Van den Berg, L., Schilder, M.B., Knol, B.W., 2003. Behaviour genetics of canine aggression: behavioural phenotyping of golden retrievers by means of an aggression test. Behav. Genet. 33, 469-483. van der Borg, J.A.M., Netto, W.J., Planta, D.J.U., 1991. Behavioural testing dogs in animal shelters to predict problem behaviour. Appl. Anim. Behav. Sci. 32, 237-251. Vas, J., Topal, J., Gacsi, M., Miklosi, A., Csayni, V., 2005. A friend or an enemy? Dogs' reaction to an unfamiliar person showing behavioural cues of threat and friendliness at different times. Appl. Anim. Behav. Sci. 94, 99-115. Weiss, E., 2002. Selecting shelter dogs for service dog training. J. Appl. Anim. Welf. Sci. 5, 43-62. Weiss, E., Greenberg, G., 1997. Service dog selection tests: effectiveness for dogs from animal shelters. Appl. Anim. Behav. Sci. 53, 297-308. Wells, D.L., Hepper, P.G., 1999. Male and female dogs respond differently to men and women. Appl. Anim. Behav. Sci. 61, 341-349. Wemelsfelder, F., Farish, M., 2004. Qualitative categories for the interpretation of sheep welfare: a review. Anim. Welf. 13, 261-268. Wickens, S.M., Astell-Billings, I., McPherson, J.A., Gibb, R., Bradshaw, J.W.S., McBride, E.A., 1995. The behavioural assessment of dogs in animal shelters: inter-observer reliability and data redundancy. In: Proceedings of the 29th International Congress of the International Society for Applied Ethology, Potters Bar, UFAW, UK, pp. 127-128. Willis, M.B., 1995. Genetic aspects of dog behaviour with particular reference to working ability. In: Serpell, J. (ed.), The domestic dog: its evolution, behaviour and interactions with people. Cambridge University Press, Cambridge, UK, pp. 51-64. Wilsson, E., Sundgren, P.E., 1997. The use of a behaviour test for the selection of dogs for service and breeding I: Method of testing and evaluating test results in the adult dog, demands on different kinds of service dogs, sex and breed differences. Appl. Anim. Behav. Sci. 53, 279-295. Wilsson, E., Sundgren, P., 1998. Behaviour test for eight-week old puppies ­ heritabilities of tested behaviour traits and its correspondence to later behaviour. Appl. Anim. Behav. Sci. 58, 151-162. Wiseman-Orr, M.L., Nolan, A.M., Reid, J., Scott, E.M., 2004. Development of a questionnaire to measure the effects of chronic pain on health-related quality of life in dogs. Am. J. Vet. Res. 65, 1077-1084.

Miller, D.D., Staats, S.R., Partlo, C., Rada, K., 1996. Factors associated with the decision to surrender a pet to an animal shelter. J. Am. Vet. Med. Assoc. 209, 738-742. Millot, J.L., 1994. Olfactory and visual cues in the interaction systems between dogs and children. Behav. Proc. 33, 177-188. Mondelli, F., Montanari, S., Prato Previde, E., Valsecchi, P., 2003. Temperament evaluation of dogs housed in an Italian rescue shelter as a tool to increase adoption success. Anim. Welf. 13, 251 Moran, G., Fentress, J.C., 1979. A search for order in wolf social behaviour. In: Klinghammer, E. (Ed.) The Behaviour and Ecology of Wolves. Garland Press; New York, pp. 245-283. Murphy, J.A., 1995. Assessment of the temperament of potential guide dogs. Anthrozöos 8, 224-228. Murphy, J.A., 1998. Describing categories of temperament in potential guide dogs for the blind. Appl. Anim. Behav. Sci. 58, 163-178. Netto, W., Planta, D., 1997. Behavioural testing for aggression in the domestic dog. Appl. Anim. Behav. Sci. 52, 243-263. Notari, L., Antoni, M., Gallicchio, B., Gazzano, A., 2005. Behavioural testing for dog (Canis familiaris) behaviour and owners' management in urban contexts: a preliminary study. In: Mills, D.S., Levine, E., Landsberg, G., Horowitz, D., Duxbury, M., Mertens, P., Meyer, K., Huntley, L.R., Reich, M., Willard, J. (Eds.). Current Issues and Research in Veterinary Medicine. Proceedings of the 5th International Veterinary Behaviour Meeting, Purdue University Press, West Lafayette, In., USA, pp. 181-183. Overall, K.L., 1997. Clinical behavioural medicine for small animals. Mosby, St. Louis, Mo. Overall, K.L., 2005. Proceedings of the Dogs Trust Meeting on Advances in Veterinary Behavioural Medicine London, 4th­7th November 2004: Veterinary behavioural medicine: a roadmap for the 21st century. Vet. J. 169, 130-143. Palestrini, C., Baldoni, M., Riva, J., Verga, M., 2005. Evaluation of the owner's influence on dogs' behavioural and physiological reactions during the clinical examination. In: Mills, D.S., Levine, E., Landsberg, G., Horowitz, D., Duxbury, M., Mertens, P., Meyer, K., Huntley, L.R., Reich, M., Willard, J. (Eds.). Current Issues and Research in Veterinary Medicine: Proceedings of the 5th International Veterinary Behaviour Meeting, Purdue University Press, West Lafayette, In., pp. 277-279. Pfaffenberger, J.C., Scott, J.P., Fuller, J.L., Ginsburg, B.E., Biefelt, A.W., 1976. Guide Dogs for the Blind: their selection, development and training. Elsevier, Amsterdam, the Netherlands. Planta, D., 2001. Testing dogs for aggressive biting behaviour: The MAGtest (Sociable acceptable behaviour test) as an alternative for the aggression test. In: Proceedings 3rd International Congress of Veterinary Behavioural Medicine, Vancouver, Canada, pp. 142-144. Rappolt, G.A., John, J., Thompson, N.S., 1979. Canine response to familiar and unfamiliar humans. Aggress. Behav. 5, 155-161. Rooney, N.J., Bradshaw, J.W.S., Robinson, I.H., 2000. A comparison of dog-dog and dog-human play behaviour. Appl. Anim. Behav. Sci. 66, 235-248. Ruefenacht, S., Gebhardt-Heinrich, S., Miyake, T., Gaillard, C., 2002. A behaviour test on German Shepherd dogs: heritability of seven different traits. Appl. Anim. Behav. Sci. 79, 113-132. Salman, M.D., New, J.G., Scarlett, J.M., Kass, P.H., Ruch-Gallie, R., Hetts, S., 1998. Human and animal factors related to the relinquishment of dogs and cats in 12 selected animal shelters in the United States. J. Appl. Anim. Welf. Sci. 1, 207-226. Schaffer, C.B., Phillips, J., 1994. The Tuskagee behaviour test for selecting therapy dogs. Appl. Anim. Behav. Sci. 39, 192 Serpell, J.A., Hsu, Y., 2001. Development and validation of a novel method for evaluating behaviour and temperament in guide dogs. Appl. Anim. Behav. Sci. 72, 347-364.

Taylor and Mills

Appendix A

Dog Temperament Tests


Definitions Individual, consistent behavioral tendencies The consistency of reports The consistency of the reports of a single observer reporting on the behavior of a dog during the same test. This consistency is usually assessed by comparing the same observer's reports from a recording of the same test viewed on 2 separate occasions. The consistency of the reports of one or more observers reporting on the behavior of a dog during the same test. This consistency is usually assessed by comparing the reports between observers who have viewed a recording of the same test. The consistency of (reports of) behavior of the same dog during the same test conducted on another occasion. The correlation between items within each factor, where factors are derivations of behavioral observations obtained by data reduction techniques. The accuracy of reports The extent to which items within a behavioral factor (group of observations) measure the broad construct (i.e., temperament trait) that they were designed to measure, "on the face of it." The extent to which items within a behavioral factor (group of observations) measure the broad construct (i.e., temperament trait) that they were designed to measure, as assessed by their correlation with other similar and dissimilar factors (convergent and divergent validity). The association between the reports of the dog's behavior and an external criterion. In the absence of a gold standard, concurrent validity may be a more appropriate term. Concurrent validity is the agreement between the report of the dog's behavior and another, more established measure that is supposed to be measuring the same construct, usually taken at the same time. Predictive validity is the extent to which the report of the dog's behavior during the test correlates with a different assessment of its behavior in another, similar context in the future.

Temperament Reliability Intra-observer reliability

Inter-observer reliability

Test-retest reliability Internal consistency Validity Content validity Construct validity

Criterion validity

Appendix B

Key considerations for each stage of test development

Purpose of test: What is the test going to be used for? Which behaviors you are particularly interested in, with a view to selecting for or against? Identify from the literature which tests might be useful for this purpose. Perform preliminary tests on a sample of dogs to identify the most practical and sensitive tests. Standardization: The procedure for each subtest within the larger test protocol needs to be formalized so it can be consistently repeated by others. In order to reduce external variation, elements within each subtest need to be standardized, including; location of test (specifically, familiarity of location to the dog, whether it is cage, room, or open area, furnished or unfurnished) timing of the test (in relation to time of day, oestrus cycle (?), feeding) stimuli to be presented (size and type of objects, treats, same sex/age/type of dogs/people to be presented) length of time for which dog is exposed to the stimuli length of time for which the dog is observed, including time observation starts whether first, last, or overall reaction is recorded Reliability: Intra-observer reliability: The consistency of the reports of a single observer reporting on the behavior of a dog during the same test. This reliability is usually assessed by comparing the same observer's reports from a video recording of the same test viewed on 2 separate occasions. Inter-observer reliability: The consistency of the reports of one or more observers reporting on the behavior of a dog during the same test. This reliability is usually assessed by comparing the reports between observers who have viewed a video recording of the same test. Test-retest reliability: The consistency of (reports of) behavior of the same dog during the same test conducted on another occasion. Assessment of this form of reliability is key to the concept of temperament. The test should be repeated on the same dog on another occasion in order to identify those behaviors that are stable over time. Internal consistency: The correlation between items within each factor, where factors are derivations of behavioral observations obtained by data reduction techniques. The correlation between items within factors can be assessed using Cronbach alpha. Typically, values between 0.15-0.50 should be sought in order to maximize unidimensionality ( 0.15) and avoid data redundancy ( 0.50). Continued


Appendix B

Journal of Veterinary Behavior, Vol 1, No 3, November/December 2006

Key considerations for each stage of test development (Continued)

Validity: Content validity: The extent to which items within a behavioral observation or factor measure the temperament trait they were intended to "on the face of it." May involve subjective assessment of the components (behaviors) that make up each behavioral category or trait (e.g., fearfulness) by the researchers or an "expert panel." Construct validity: The extent to which factors expected to assess one temperament trait (as assessed above) correlate with other factors. Convergent validity is when factors that theoretically should be related to each other correlate. Discriminant validity is when factors that theoretically are not related to each other do not correlate. Concurrent validity The extent to which an association between the reports of the dog's behavior (during the test) and another measure can be demonstrated. Predictive validity is the extent to which the report of the dog's behavior during the test correlates with a different assessment of its behavior in another, similar context in the future. Typically, the comparative measures have included the report of the owner or another expert via questionnaire, although direct observation of the dog in its new setting may be another. Feasibility: Is the test of a reasonable length to gain information but not pose a burden on the dog or the organization? Is the test easy to perform and standardize? Is it easy to make a decision about the dog based on the test results? The process of refinement may require reduction of number of subtests and/or observations, based on assessment of reliability of these items between testers from different backgrounds. If the test is changed by refinement, it should be re-evaluated for content, construct, and predictive validity in its new setting.



15 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


You might also be interested in

7 x 11.5 long title.p65
Microsoft Word - 2003_Tech Manual.doc
Measure: Preschool Social Behavior Scale- Teacher Form (TSBS-T)
Microsoft Word - MS 10_1_ Miller - Validity Study Chally System.doc