Read FUMIYO.NAKATSUHARA-FT.pdf text version


Abstract Over the last two decades, research has suggested that test performances and test scores are collaboratively achieved through interviewing/scoring processes and that some test-takers might be in unfair situations caused by the paired interviewer. Most of these studies, however, have employed only holistic scores and little has been known about which analytic categories (e.g. pronunciation, grammar, fluency) are vulnerable to what sort of interviewer behaviour. Therefore, it seems meaningful to make explicit which aspects of interviewer variation have an impact on which analytic categories, so that such information could help interviewers' training by increasing their understanding of the influence of their performance on that of the candidate and on the perception of the second or third raters who later rate audio- or video-recorded tapes. This present research, with the use of an analytical scale, investigates the variability of interviewer behaviour, its influence on a candidate's performance and consequent raters' perceptions of the candidate's ability. The data are collected from two interview sessions involving the identical candidate with two different interviewers, and the video-taped interviews are rated by 22 raters on five marking categories. The result shows that a significantly different score was awarded to `pronunciation' and `fluency' in the two interviews. The reasons for the differences are discussed on the basis of Conversation Analysis (CA) findings and raters' commentaries. This paper concludes with some suggestions on how the potential unfairness caused by interviewer variability could be solved. 1. Introduction Over the last two decades, a number of studies have analyzed the discourse of various speaking test formats, as the research into the process of the test has been realized as valuable for designing, describing, and most importantly for validating oral proficiency tests (e.g. Young & He 1998, Lazaraton 2002). Accordingly, more attention has been drawn to interlocutor behaviour in oral interview tests, and the variability of interviewers' behaviour



has been focused on as a potential source of unfairness. Some of these studies have described how interviewers' behaviour could vary especially in terms of the degree of speech accommodation and its possible influence on candidates' production, and the others also involved rating and investigated how scores can be affected by such interactional difference. Firstly, a variety of speech accommodation strategies which interviewers practice towards interviewees was identified, such as slowing down the speech, rephrasing questions and simplifying lexis. Such interviewer accommodation is regarded as a parallel phenomenon to `foreigner talk' discourse, where native speakers accommodate their speech to non-natives with the purpose of facilitating mutual understanding. Since one of the advantages of oral interview tests, over automated speaking tests using computers or telephones, is claimed for allowing non-native test-takers to `interact in an authentic communicative event ... extemporaneously' (Ross 1996: 34) 1 , various supportive foreigner-talk practices that interviewers provide impromptu with interviewees appear to be positive by validating the fact that these interviews could, to some extent, tap a feature of `real-life conversation'. However, it has been pointed out that, if each interviewer inconsistently accommodates the candidates, it may raise a question about the influence on candidate language use. For example, Lazaraton (1996) demonstrated that original complex questions might be re-formulated into simple yes-no questions or by stating question prompts as statements which merely require the candidate's confirmation, and Ross & Berwick (1992) demonstrated that candidates at a certain level were likely to get over-accommodated. Thus, if interviewers use more of such foreigner talk than the candidates deserve based simply on their assumption of their proficiency, they may fail to push an interviewee's performance of oral proficiency to its limits, resulting in possible unfairness that some candidates' best performance may not be elicited or that their lack of proficiency may not be revealed. Secondly, interactional differences were also studied together with their impact on rating scores. These studies examined the impact on scores considering the cases where a rater and an interviewer are separately allocated2, and there seems to be a tendency that some interlocutors are likely to give raters a better impression of candidates' performance than others. Brown & Hill (1997), with Rasch analysis, discovered a continuum of `interviewer difficulty' 3 among a group of interviewers, and the difference between `the easiest


However, we now have a general consensus that while oral interview interaction could tap some features of non-test conversation, it is essentially one specific case of asymmetrical, institutional interaction, which elicits only particular types of test-taker's language functions (e.g. Simpson 2006). Additionally, it is important to note that not all the similarities to natural conversation found in interview tests are argued as positive features. 2 They differentiate an interviewer from a rater, as can happen, for example, when the second or third raters mark candidates' performance while watching video/audio-tapes. 3 The term `interviewer difficulty' is analogous to the more well-known notion of `task difficulty' (Brown &



interlocutor' and `the most difficult interlocutor' among them resulted in a difference of 0.6 of a band on the IELTS speaking scale. They also found that the easier interviewer tended to shift topics more frequently and ask simpler questions, in contrast with the more difficult interviewer who had a tendency to ask more challenging questions and employ a wider range of interaction like interrupting and disagreeing. With the same data set, Brown (2003) further analyzed, using Conversation Analysis, the reasons why biased rating occurred, and found that, whilst `the easy interlocutor' was making the candidate appear to be an effective communicator by her scaffolding, explicit questioning, smoothly extending topics and frequent positive feedback, `the difficult interlocutor' was making the candidate appear to be a poor communicator by confusing the test-taker with frequent topic shift and using ambiguous closed questions to elicit extended responses. From slightly different perspectives, McNamara & Lumley (1997) examined the relationship between ratings and the amount of rapport that interviewers established with candidates and between ratings and interviewer competence (both as judged by raters who listened to the audio-taped interview). Their result indicated that the effect of perceived lack of interviewer rapport and competence was to lead raters to award higher ratings. Concerning this counter-intuitive result, they concluded that `a perception of lack of competence on the part of the interlocutor may have been interpreted as raising an issue of fairness in the mind of the rater, who may then have made a sympathetic compensation to the candidate' (ibid.: 152). As shown above, whilst the unpredictable nature of the test interaction can contribute to the test validity by incorporating foreigner-talk as in non-test conversation, research has warned of the threat that the very characteristic for validating these tests can also be a source of unreliability associated with a lack of standardisation across interviewers, and potential unfairness to candidates. However, in order not to hastily conclude that strictly prescribed interviewer behaviours are preferred to a more non-test-like interaction, this study investigates the precise effect of inter-interviewer variability, so that we could make a useful application of these research findings to actual testing practices to improve fairness for candidates. In particular, what has not been clear in the previous research is how these interviewer differences could be translated into analytic rating scores. As far as I am aware, analytic scales were utilized only by Shohamy (1983), where she examined the impact of interviewer difference but did not find any significant result, and the other studies have employed only holistic scales. Thus, the research questions of this study are addressed as follows; (1) When the same candidate is interviewed by two different interviewers, are there any analytical marking categories which are especially affected by the interviewer difference?

Hill 1997, Brown 2003).



(2) If so, what types of interlocutor behaviour could have influenced the mark for the analytical components? 2. The Study: Method of Data Collection and Analyses The two interview sessions were conducted involving two different interviewers, A and B with the same candidate, C. Both A and B are experienced teachers of English as a foreign language, whose career has put them into contact with various non-native speakers of English. They are also experienced interviewers in speaking tests. Interviewer A was formally trained for IELTS and Trinity College London ESOL test, and subsequently went through regular retraining procedures, while B has not taken any formal training. The candidate C is a Chinese student, having studied English for Academic Purpose for two months in the University of Essex when she was interviewed. In order to focus more specifically on interviewer-interviewee discourse, the speaking test lasting about 12 minutes employed only a single picture description task to stimulate their conversation, as briefly described in Table 14. Table 1: Interview Structure

1 Openings (1 minute) 2 Conversation on familiar topics (3 minutes): The interviewer asks the candidate to talk about him/herself. 3 Picture Description (2 minutes)5: The interviewer asks the candidate to describe a photo. (Picture 1) A mother trying to cope with her child (Picture 2) A boy in front of the TV 4 Conversation on topics from the given picture (5 minutes): The interviewer asks the candidate the linked questions to the picture (from general questions to extended questions). 5 Closings (1 minute)

The interviews were video-taped for rating and transcribing purposes. After an individual short discussion with the researcher on how to use the rating scale, the video-tape was shown to 22 independent raters to judge the candidate's performance in the two sessions respectively. In order to avoid an "order effect", A's interview was shown first to half of the raters, and B's interview to the other half of raters. The raters, D-Z, all have rater experience in some speaking tests as well as teaching experience. Since this study aimed at examining


Since this study is not investigating how to minimise inter-interviewer variability but discovering possible source of unfairness caused by inter-interviewer variability, a precise prescribed interview framework was not given to the interviewers, although several questions to be asked were provided to them. 5 Picture 1 and 2 were used in A's session and B's session respectively. The pictures were taken from CAE practice book (Harrison & Kerr 1999). Two different pictures were employed so that the candidate could not perform better in the second session only due to the practice effect.



how the impact of inter-interviewer variability is realised on analytical scales, an analytical scale with the most general five marking categories was provided: pronunciation, grammar, vocabulary resources, fluency and interactive communication. The rating scale utilised is a modified version of a previous First Certificate in English (FCE) analytical rating scale by University of Cambridge Local Examination Syndicate (UCLES). Each category consists of four levels (rather than the original six levels) because it was considered that the criteria should easily be deployed by raters with limited rater training in this study. In addition to providing scores on each category, raters were also asked to summarise reasons for awarding those scores so that the retrospective verbal reports could help to uncover any relationship between interviewer behaviours, candidate's performance and their ratings. The rating data were firstly quantitatively analysed to see if judges gave different scores systematically on certain analytical categories in one of the two sessions (for Research Question 1). Due to the limited rater training, inter-rater reliability in an absolute sense was low, showing only .6628 agreement among 22 raters (average measure intraclass correlation, SPSS ver.11.0), whilst inter-rater reliability in a relative sense was an acceptable level, showing .7701 (Cronbach alpha, SPSS ver.11.0). This could be interpreted that although raters differed in their harshness levels, the fluctuation of scores rated by the 22 raters with different harshness degrees were relatively correlated. Secondly, to discover interactional characteristics of the two interviewers, the video-taped interview sessions were transcribed, employing CA conventions (Atkinson & Heritage 1984)6. Although CA has been developed to analyse mundane conversation or non-test interaction, it seems now widely carried out in institutional interaction and succeeds in capturing pictures of context-specific interaction (Drew & Heritage 1992). Following the convention of CA, the analysis here also considers that repeated listening for production of a transcript is an important part of analysis for discovery (Levinson 1983). The CA findings are, along with commentaries of the 22 raters, to be used to explore the features of interviewer behaviour which might have caused the different ratings (for Research Question 2). 3. Results 3.1. Results of Quantitative Analysis: Effect on Rating


Transcribing notation is as follows (Atkinson and Heritage 1984): (1) Gaps and pauses: Periods of silence. Micro-pauses is shown as (.); longer pauses appear as a time within parentheses (2) Colon (:): A lengthened sound (3) Dash (-): A cut off (4) .hhh: Inhalation (5) hhh: Exhalation (6) hah, huh: Laughter (7) (h): Breathiness within a word (8) Punctuation: Intonation rather than clausal structure (9) Equal sign (=): A latched utterance (10) Brackets ([ ]): Overlapping talk (11) Arrow (): A feature of interest to the analyst (12) Empty parentheses ( ): Words within parentheses are doubtful or uncertain (13) Double parentheses: Non-vocal action (14) Arrows (><): The talk speeds up (15) Arrows (<>): The talks slows down (16) Underlining: A sound is emphasized



As shown in Table 2 and Figure 17, the mean scores of all analytic categories are higher in B's interview except `Vocabulary resources', and the tendency is more clearly observed in `Pronunciation' and `Fluency'. Table 2: Rating Result

Analytic categories Pronunciation Grammar Vocabulary resource Fluency Interactive communication Inter -viewer A B A B A B A B A B (N=22) Mean S.D. 1.77 2.00 1.41 1.45 1.73 1.64 1.64 1.91 2.00 2.05 .69 .69 .59 .50 .70 .66 .66 .53 .87 .72

Figure 1: Rating Result

Paired Sample T-tests (SPSS ver.11.0) indicate that, when the candidate was interviewed by interviewer B, the candidate obtained significantly higher scores (p<.05) in `Pronunciation' (Mean difference: -.2273; S.D.: .4289; t(21): -2.485; p=.021) and `Fluency' (Mean difference: -.2727; S.D.: .5505; t(21): -2.324; p=.030), as shown in Table 38. Table 3: Paired Sample T-tests

PRON_A ­ PRON_B GRAM_A ­ GRAM_B VOCAB_A ­ VOCAB_B FLU_A ­ FLU- B INTER_A ­ INTER_B Mean difference -.2273 -.0455 .0909 -.2727 .0455 S.D. .4289 .3751 .5263 .5505 .7854 t -2.485 -.568 .810 -2.324 -.271 df 21 21 21 21 21 Sig (two-tailed) .021 .576 .427 .030 .789

3.2. Result of Qualitative Analysis: Interviewer Variability Following Brown (2003), the analysis of the nature of interaction is reported in terms



The given error bars plot the 95% confidence interval for means. Here, I would like to acknowledge that one may think that the obtained mean differences for `Pronunciation' and `Fluency' are rather too small to be discussed. However, considering that the rating category consists only of four levels, which may not be sensitive enough for the differences and that 5 out of 22 raters and 7 out of 22 raters gave better scores on `Pronunciation' and `Fluency' respectively in B's session, it seems plausible to regard that it is not by chance that the raters perceived the better performance of the candidate in `Pronunciation' and `Fluency' categories in B's session.



of three phases of interaction: 1) questioning and topic nomination techniques, 2) topic expansion and management techniques and 3) receipt tokens and feedback techniques. The interview by interviewer A is analysed first. Interviewer A seems to have a typical approach to questioning, especially when the question nominates new topics. The following (1) is an excerpt taken from the initial part of her interview. (1) Interviewer A (I: interviewer, C: candidate) 1 I: And (.) an' Why did you choose to come to Essex?= 2 =>Why did you want to come to Essex to study?< 3 C: Ah:: Because I: (.) I will take the Master course eh:: to study, eh my major will 4 be (.) eh: (.8) economics or .hh international relation, so I think eh I choose this 5 university because the: ga::ment department and the economics department is 6 very .hh (.5) eh:: rentaful. So I choose this University?= 7 I: =Right, What are you going to do when you finish your studies? 8 Will you go back to Beijing? 9 C: Yah, of course. (.5) Haha:::= 10 I: =N' you want to be a manager? Or have your own company? In line 1, she asks a question on C's reason for choosing Essex University, and before C responds, she immediately rephrases the initial question in a `latched' (=) and speeded-up (><) fashion in line 2. After the candidate answers, A develops the topic by asking another question on her future plan in line 7, and the question is also added to by another easier question which could project the candidate's possible answer. Facilitated by the second question, C immediately answers `Yah, of course. (.5) Haha:::', but she fails to deal with the first question. Consequently, the interviewer returns to the unsuccessful question not by the same wh-question but by asking for confirmation of two possible answers in line 10. This approach to questioning, where she frequently rephrases the initial questions in her own turn, is typical of interviewer A. Since she rephrases the initial production before the candidate's turn and the candidate presents no facial expression of ambiguity during the initial questioning, the motives for rephrasing are not likely to be the candidate's misunderstandings, but more likely to be the interviewer's anticipation to be misunderstood. Once the topic has been introduced, A tends to systematically expand the topic by eliciting the candidate's response with various methods. For instance, she asks more of C's opinions (see (2)), requests reasons for her previous answer (see (3)) and asks for examples (see (4)).



(2) Interviewer: A (I: interviewer, C: candidate) 1 C: Yeh, freedom .hh freedom for child n: (.8) if the child is crying, OK uh if 2 crying OK finish (.5) ah:: will be goo(h)d 3 I: Right. Do you agree with that? Or do you ( ) (3) Interviewer: A (I: interviewer, C: candidate) 1 C: =but I I think she uh:: doesn't ca(h)re .hh the ki(h)ds cry(hah)ing 2 (.5) 3 I: A(h)ll right. Hah hah ha What makes you say that? (4) Interviewer: A (I: interviewer, C: candidate) 1 I: =So so in your idea or your point of view, what makes a good mother? 2 (.5) 3 C: Em:: (.) manage em something they should manage [and ha huh 4 I: [Right 5 (.5) 6 I: For example? Concerning feedback, she rarely gives comments on the information given by the candidate. Instead, as in (1)-(5), she consistently employs an identical one-word token, `right', throughout the whole interview. Some CA research such as Jefferson (1984) shows that `right' can be a device to bring current talk to a close and move to the next topic, although it can also be continuers indicating `go ahead' in other occasions. Therefore, whether consciously or unconsciously, A's frequent use of `right' might have presented her as being more in charge of controlling topics in this interaction. A also tends to replace some possible verbalised receipt tokens with non-verbal behaviour such as nodding, eye contact and a smile during the candidate's in-progress story. (5) Interviewer: A (I: interviewer, C: candidate) 1 C: uh:: Maybe this another person? maybe it's grandmothe-, grandma? 2 I: (1.0) ((nodding)) C: And:: ah: (.8) but I think the room is a little mess 3 4 ha[ha (.5) lot of toys lot of books and something:: (.5) a lot of things in the floor. 5 I: [((smiling)) 6 (1.0) 7 I: Right. Following the description of A's interview, the analysis of B's interview is now presented. In contrast with A, B rarely rephrases his initial questions. However, he also seems to have a typical questioning method. He frequently produces statements which are actually implicit question prompts rather than explicitly prompting the test-taker to produce



language. For example, in (6) below, after C's answer that TV is a beneficial source to learn English, B hypothesises that she may not care too much about the programme itself, and states it with a falling intonation (.) in line 3, which is subsequently confirmed and elaborated by the candidate. (6) Interviewer: B (I: interviewer, C: candidate) 1 C: =Uh: hh but em: for m(h)e, I think watching TV is good fo(h)r me(h) huh 2 for my Engli(h)sh. Hah ha= 3 I: =Oh OK Just for Eng[lish so doesn't matter what programme. 4 C: [just learn hahha 5 C: Yah. Do(h)esn't matter what. Ne(h)ws, advertisements, and something else, 6 I watch haha all all can improve my English, A similar technique is also employed to develop topics. In order to maintain the topics, rather than asking a related question to expand the given topic, he often `formulates' (Heritage 1985) or re-presents what C has said, and tends to wait for natural development of the interaction as in (7). In this excerpt, the candidate is talking about her reaction to the room that she has just described in a picture. (7) Interviewer: B (I: interviewer, C: candidate) 1 C: =No, hah ha I don't like this roo(hh)m hah[ha 2 I: [You do(h)n't like i(h)t? 3 [hu so not your style= 4 C: [hah hah =Yah hah[ha 5 I: [Ri(h)ght= 6 C: =I like more fashion, .hh more eh light room you know (.) I like I like eh: 7 some gla:ss, some something else. That that is not like this room hahha In an assessment situation, using statements as implicit prompting (6), and formulation, (7) are not considered as always desirable, since these techniques only voluntarily ask the candidate to continue to talk and may function as drawing a conclusion for the candidates thus depriving them of doing it by themselves (Lazaraton 1996). However, as formulation tends to be preferable, for instance in a news interview, to preserve the interviewee's prior statement as a topic of further talk (Heritage 1985: 106), and this candidate successfully interpreted B's implicit demand for more information and often elaborated her response as shown above, these techniques in the given interview functioned rather effectively. Additionally, as in (8) below, interviewer B tends to shift topics when the first questioning seemed unsuccessful in terms of the possibility of natural expansion of the topic. This can be contrasted with



interviewer A, who returns to the unsuccessful questions, as shown in (1), and systematically expands the topic by various strategies in (2)(4) above. (8) Interviewer: B (I: interviewer, C: candidate) 1 I: =Do you have favourite artists or= 2 C: =Em:: (.5) hah hah my my favourite artist also is Chine(h)se. hah hah huh 3 I: Oh, Chinese.= =.hh What about mu- movies? 4 C: =Yeh= 5 I : Which [kind of movies do you like? 6 C: [movie? As for feedback, B frequently provides feedback comments, particularly giving positive evaluation, as in (9)9. Although giving evaluation may sometimes be avoided in testing, since it may `mislead some candidates to believe that they are doing better than actually are in the assessment; it also may impact on outcome ratings' (Lazaraton 1996: 161). Nevertheless, this type of positive feedback seems to be regularly observed in the classroom to encourage learners (Ur 1996: 242), and may consequently sound like natural interaction between such native and non-native conversation as observed here. (9) Interviewer: B (I: interviewer, C: candidate) 1 I: So, parents should [be responsible. 2 C: [Yah yah yah Yes. 3 I: Oh OK Yah very good. yah, very very good. OK As another method of feedback, B frequently echoes what C has uttered as in (10). This receipt design of his may be a useful display of his involvement in her response and to establish intersubjective understandings. (10) Interviewer: B (I: interviewer, C: candidate) 1 I: Uh so you watch videos. 2 C: Um: Just (.) watching TV hah hah ha 3 I: Just TV In sum, while A is explicit in her questioning and systematic topic development, B may ambiguously do so. Moreover, while A's talk seems to be more teacher-like with


One may wonder that giving `positive' feedback could be one of the factors in raters giving higher marks in B's interview. However, if it was the case, the perceived positiveness in communication should have resulted in higher mark in `interactive communication' rather than in `pronunciation' and `fluency' as in this study.



supportive behaviours and control of the topic the candidate needs to deal with, B's behaviour could be seen as more non-test-like with his lower-level control of the interaction. Whilst A gave a minimal amount of verbal response tokens (`right' or non-verbal tokens), B was characterised as giving feedback such as evaluative comments and echoing. 3.3. Discussion In this section, based on the CA findings described above, two possible reasons why different scores were awarded to `Pronunciation' and `Fluency' components will be discussed10. Firstly, it may be because A's interview was more controlled with her systematic questioning and topic-development than that of B's. Together with the great amount of supportive behaviour, her guidance was more explicit than that of B, and topics about which the candidate had to talk were clearly defined at every stage. Consequently, in order to deal with the clearly specified topic, the candidate may have been required to utilize unfamiliar vocabulary whose pronunciation she was not sure of. Some raters actually awarded better score for the `Vocabulary' components in A's interview session, probably because A pushed the candidate to her limitation of vocabulary resources. On the other hand, as B exercised less control over the direction of the interviews with implicit questioning, and shifted topics frequently when the candidate seemed to have difficulty to expand topics, the candidate may have been able to avoid lexis whose pronunciation she does not fully know and might get wrong. Similarly, the candidate might have spoken more fluently in B's less directed interview where she could talk about whatever she wanted to and did not need to talk about any dispreferred topics in depth11. This can be explained as `avoidance strategies', more precisely `formal reduction strategies' (Faerch & Kasper 1984), which is `motivated by the language user's desire to use language correctly, i.e., to avoid errors, or fluently, i.e., to avoid rules and items which cannot be easily retrieved and smoothly articulated' (ibid.: 48). Secondly, the difference in types and the amount of feedback may have affected the perception of raters about the candidate's fluency. Interviewer A was characterized as giving minimal comments (mostly `right') on the answer provided by the candidate and as replacing

It could also be argued that interviewer gender and age or different prompts utilized for the picture description task could be a factor. However, since these factors are out of the scope of this study, the discussion here focuses only on interactional differences between the two interviewers. 11 However, it was interesting that the candidate commented later that she felt more comfortable with A because A guided her talk very well. It can be interpreted that the candidate had a certain expectation about the role of interviewers in oral assessment conditions, and interviewer B, who gave more conversation-like interaction, did not meet her expectation, though he let her perform better.




possible verbal receipt tokens by rich non-verbal receipt tokens during the candidate's in-progress utterance by nodding, eye contact and smiling. This could be the result of the formal interviewer training A has experienced. In particular, giving evaluative comments are normally cautiously treated in interviewer training, as it could occasionally be problematic; for example, such comments may make candidates misbelieve their performance better than it actually is and it also may impact on outcome ratings (Lazaraton 1996: 161). On the other hand, B frequently provided feedback not only by usual response tokens such as `uh huh', but particularly by positive evaluation comments and by echoing the candidate's utterance. In this sense, as shown in (5) above, interviewer A's minimal amount of feedback may have increased the amount of silence, which gave the raters impression that the candidate was hesitant. This feature of A echoes Fulcher (1996) on why language produced in oral testing situations gives rise to the comments as being unnatural interaction; `the interviewer in the oral test appears to be highly sensitive to the possibility that the student needs time to plan what is going to be said next, and therefore the amount of overlapping speech may be much less than in less formal interaction'. (ibid.: 217). Therefore, being sensitive not to interrupt the candidate's production, interviewer A may have failed to fill gaps which is normally filled in mundane conversation, and this could have caused the raters to perceive that the candidate `was more hesitant when answering questions in A's interview, while she generally kept the flow of conversation going in B's' (Rater M). 4. Conclusions and Suggestions Arising from the literature on interviewer behaviour in oral interview tests, this research has explored more precise pictures of the relationship between interlocutor behaviour and its impact on candidate performance and scores. To summarise, the two interviewers examined here possessed their own ways of questioning, developing topics and reacting to the candidate's response. These differences seemed to be translated into the different `Pronunciation' and `Fluency' scores in ratings, due to the co-producing nature of the assessment process. Although the results cannot be generalised due to the limited data treated here, this study has clearly exemplified a possible relationship between the characteristics of interviewer behaviour and particular components of language ability affected. In order to capture its clearer, generalizable picture, further studies are required. It is, however, hoped that the findings of this research will contribute to a better understanding of interviewer behaviour on the candidates' outcome and to fairness for candidates.



Lastly, I would like to suggest how the findings of this study and of future studies along these lines can be useful to ensure that students are equally treated regardless of the interviewer they are paired with. Firstly, each examination board can refer to such research results to define what interlocutor support should/should not be needed in what circumstances and to what extent interviewer variability could be allowed. Secondly, more emphasis should be given in interviewer training programmes to reflect these empirical data so that interviewers can be made aware of themselves as being more than a mere conduit for question prompts. Although interviewer training, as Brown (2003: 19) points out, `has generally tended to be somewhat overlooked in relation to rater training, with interviewer behaviour rarely being scrutinized once initial training is completed', interviewer behaviour should also be realized as significant a factor in the fluctuation of students' scores as rater reliability is. Thirdly, the existing rating scales can also be refined so that the role of the interviewer in test interaction can be taken into consideration in the rater training procedure (Ross 1992, Lazaraton 1996). In this way, if the contents of rater training programmes are correspondingly organized with interviewer guidelines and interviewer training programmes, the mutual understandings between interviewers and raters on why particular interviewer behaviour is employed at each moment will no longer make interviewer behaviour a source of random fluctuation of scores. Rather, this will enhance both the validity and the reliability of oral interview tests by providing the candidates with systematic, consistent `nontest-like' interlocutor interaction.

References Atkinson, J. M. & Heritage, J. (eds.) (1984). Structures of social action. Cambridge: Cambridge University Press. Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency. Language Testing 20. 1 25. Brown, A. & Hill, K. (1997). Interviewer style and candidate performance in the IELTS oral interview. In Woods, S. (ed.), Research Reports 1997(vol. 1). Sydney: ELICOS. 173 191. Drew, P. & Heritage, J. (1992). Analyzing talk at work: an introduction. In Drew, P. & Heritage, J. (eds.), Talk at work. Cambridge: Cambridge University Press. 3 65. Faerch, C. & Kasper, G. (1984). Two ways of defining communication strategies. Language Learning 34. 45 63. Fulcher, G. (1996). Does thick description lead to smart test? A data-based approach to rating scale construction. Language Testing 13. 208 238.



Harrison, M. & Kerr, R. (1999). C.A.E. Practice tests. Oxford: Oxford University Press. Heritage, J. (1985). Analyzing news interviews: aspects of the production of talk for an overhearing audience. In van Dijk, T. A. (ed.), Handbook of discourse analysis (vol. 3). London: Academic Press. 95 117. Jefferson, G. (1984). On stepwise transition from talk about a trouble to inappropriately next-position matters. In Atkinson & Heritage (eds.), 191 222. Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: the case of CASE. Language Testing 13. 151 172. Lazaraton, A. (2002). A quantitative approach to the validation of oral language tests. Cambridge: Cambridge University Press. Levinson, S. C. (1983). Pragmatics. Cambridge: Cambridge University Press. McNamara, T. F. & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing 14. 140 156. Ross, S. (1992). Accommodative questions in oral proficiency interviews. Language Testing 9. 173 186. Ross, S. (1996). Formulae and inter-interviewer variation in oral proficiency interview discourse. Prospect 11. 3 16. Ross, S. & Berwick, R. (1992). The discourse of accommodation in oral proficiency interviews. Studies in Second Language Acquisition 14. 159 176. Shohamy, E. (1983). The stability of oral proficiency assessment on the oral interview testing procedures. Language Learning 33. 527 540. Simpson, J. (2006). Differing expectations in the assessment of the speaking skills of ESOL learners. Linguistics and Education 17. 40 55. Ur, P. (1996). A Course in Language Teaching. Cambridge: Cambridge University Press. Young, R. & He, A. W. (eds.) (1998). Talking and testing: discourse approaches to the assessment of oral proficiency. Amsterdam, Philadelphia: John Benjamins. Fumiyo Nakatsuhara Department of Language and Linguistics University of Essex, Wivenhoe Park, Colchester, CO4 3SQ United Kingdom [email protected]



14 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


Notice: fwrite(): send of 216 bytes failed with errno=104 Connection reset by peer in /home/ on line 531