In lesson 4 (Experimental Design) we introduced the problem of potential independent variables: uncontrolled factors that could cause changes in a dependent variable other than the manipulation or intervention. Such factors are known as threats to the internal validity of a study. Internal validity, then, is concerned with the extent to which the design of the study allows us to attribute causality to the intervention and to rule out potential alternative explanations. It is the researchers task to design the study in such a way as to rule out plausible alternative explanations so that he/she can be sure that it was the manipulation that caused the effect.
A second aspect of study validity is concerned with extent to which the design of the study allows us to generalise the results to populations other than that from which the sample was drawn, or to similar populations in different settings or at different times. This is known as external or ecological validity. External validity is about how meaningful the results are when applied to the real world.
External validity: To what extent can we generalise the study findings to other people or settings?
this lesson we will look in detail at the main threats to validity and
at how different study designs do or do not control for them. It is important
to note from the outset that one needs to consider the particular context
of the study in evaluating the potential for the different threats to
validity to be in operation; in some contexts a particular threat may
simply not apply.
Threats to validity can be loosely categorised into one of three types: threats relating to the passage of time, threats relating to selection of participants, and threats relating to testing and manipulations. The latter category includes reactive arrangements, a diverse set of threats which will be dealt with separately.
Threats relating to the passage of time are a potential problem whenever a repeated measures design is employed. There are five main threats in this category: maturation, history, mortality, instrumentation and statistical regression to the mean.
The reading ability of a class of children will be assessed, they will undergo the reading programme for a period of time, and then their reading ability will be assessed again. What is wrong with this design?
Realising his or her error, the researcher decides to employ a randomised pretest-posttest control group design:
Children are randomly assigned to one of two groups. One group receives the reading programme whilst the other receives a 'standard' teaching method. Both groups' reading ability is assessed pre- and posttest. How does this control for maturational effects?
Now, suppose the researcher decides to save time and effort and implement a randomised posttest only control group design:
Children are randomly assigned to either receive the new teaching method or not. No pretest of reading ability is given but both groups receive a posttest. As mentioned in the previous lesson, this is not an inherently weaker design than the previous one, although it may appear to be. Why not?
an example, imagine that a big company decides to promote healthy eating,
reduce smoking and increase physical activity in its workforce. Workers
are given financial incentives for changing these health-related behaviours.
The scheme is evaluated by assessing the dietary, smoking and activity
habits of a randomly selected sample of workers prior to and following
the intervention period. Here we have a single group, pretest-posttest
design again. Suppose that at the same time that the scheme is running,
the local health authority launches a major health promotion initiative
targeting the whole community. This intervention involves providing information
and advice on leading a healthy lifestyle, disseminated through local
supermarkets, leisure centres, the media and so on. Here we have an historical
event that could influence the outcomes of the company's scheme. Any improvement
in health behaviours among the workforce could be due to the wider initiative
they have been exposed to, rather than the financial incentives.
The threat of mortality is concerned with dropouts from a study between pre- and posttest. We have already met this problem in the lesson on samples and sampling. Dropouts may be systematically different from participants who remain in a study. Thus when participants dropout, the nature of the sample changes. In a single group study, this means that a sample which starts out at pretest as at least reasonably representative of the population of interest may not be representative at posttest. For example, suppose a study was conducted to examine the effects of a motivational enhancement treatment for adherence to an exercise programme using a single group, pretest-posttest design. Participants who are less motivated to begin with might dropout, leaving more motivated participants behind. Any apparent increase in exercise may then be because the participants left at the end were highly motivated to adhere to the programme anyway, and so the increase may have nothing to do with the treatment.
However, we have already discussed two reasons why the single group design is weak, and there are more to come. What about a randomised pretest-posttest control group design? Well here, as with the two previous threats, randomisation comes to our rescue. If participants are randomly assigned to groups then they should be equated with respect to motivation and any other factors that might lead to dropout. Thus if there are dropouts, the same sorts of people should dropout from both groups.
However, mortality can still present problems, particularly if the treatment is time-consuming or requires effort on behalf of the participants. Suppose that we implemented the study to assess the efficacy of the motivational treatment for exercise adherence using a randomised pretest-posttest control group design and that the intervention involved attending four one-hour long motivational enhancement sessions in addition to a programme of exercise classes which provide the adherence data. The control group does not have to attend any such sessions, just the exercise classes. Although the two groups are equated with respect to motivational factors initially by randomisation, the less motivated participants may dropout of the treatment group because of the time and effort involved. Thus the treatment group shrinks in a way that is not comparable with the control group. A mixture of more and less motivated individuals remain at posttest in the control group but the experimental group only comprises more motivated individuals. Thus any apparent advantage for the treatment at posttest may again simply be due to the motivation of the participants and not the treatment.
So, mortality is controlled for only if there is not differential mortality between groups (i.e. different sorts of people dropping out from different groups). Or is it? Have a think about what might be the consequences if you did this study and you did get dropouts but not differential mortality. What further problems might this cause?
This is a rather more straightforward threat, usually. It is a problem when the way in which the dependent variable is measured varies from pretest to posttest or between different groups. This may be due to calibration errors in an instrument, the use of different instruments, or by experimenters using the instruments in different ways.
In my previous incarnation as a hospital operating theatre technician (bet you didn't know that, did you?) I once tested five electronic blood pressure gauges, all of the same make. They gave five different readings, varying by as much as 10 mm hg! Suppose I used them in a study to assess the effects of relaxation on blood pressure. I use a single group, pretest-posttest design. At pretest I happen to use a gauge that measures 5 mm hg under the correct value and at posttest I use a gauge that measures 5 mm hg over the correct value. Even if in fact the treatment had reduced blood pressure by, say, 5 mm hg, I'd have to conclude that relaxation actually increased blood pressure! The lesson is, make sure you are using properly calibrated instruments.
Instrumentation can also be a threat when the researchers themselves are the instrument, as in observational studies. Suppose that a study is conducted to assess whether training coaches to give positive feedback actually increases their use of positive feedback. We assess the provision of positive feedback by having an observer record instances of it during a coaching session, then we give the treatment and observe the coaches again to see if they use more positive feedback. Suppose that in fact the training has no effect; the coaches' use of positive feedback does not increase. Can you think of any reason why the observer might record more instances of positive feedback at posttest than at pretest?
regression to the mean
This is a complex statistical phenomenon that can occur in pretest-posttest designs. Whenever we take a pretest measure, some people will happen to score low or high on the variable due to some factor that will not be present at posttest. Then when you take the posttest measure, those people will not score so low or high again. For example, suppose I gave the Research Methods class a multiple choice test on this lesson. Some of you will make some lucky guesses and give correct answers to questions you do not know the answer to, so your score will be inflated. The effect of this would be to drag the mean score for the class upwards. If I then gave you a second, comparable test at a later date, it is very unlikely that those who got lucky the first time round will do so again. So at posttest the class mean would be lower.
Many factors can create this regression artefact. Individuals might just be having a bad day, be in a bad mood, not be concentrating or whatever, and it can be difficult to identify what caused the problem. Statistical regression is particularly likely to occur when groups are selected on the basis of extreme scores at pretest as in some quasi-experimental designs. Recall that we discussed this approach in the previous lesson (the regression-discontinuity or cutoff design). The problem here is that some of the participants will give extreme scores at pretest only due to factors that will not be in operation at posttest. When participants are randomly assigned to groups, regression will not be a problem (provided randomisation works) because the factors causing the 'misleading' extreme scores will be randomly distributed across the groups.
Unfortunately, randomisation does not always work. We once conducted a study assessing the effects of a relaxation treatment on preoperative anxiety and difficulty of anaesthesia in daycase surgery. The relaxation treatment was compared to a similar period of time listening to a short story (an attention-control condition; more on this later) and a no-treatment control condition using a randomised, pretest-posttest control group design. When plotted, the results for changes in state anxiety for the relaxation and no-treatment groups looked like this:
I have left out the results for the attention-control condition here for clarity; they were straight down the middle. Whenever you see results like this, warning bells should start ringing. Clearly, despite random assignment, the groups were not equated with respect to anxiety at pretest. Therefore the apparent decrease in anxiety in the treatment group and increase in the control group could have been due to statistical regression. When we submitted the study to a journal for publication one of the paper's reviewers, quite rightly, was quick to point this out. Fortunately, we had other data to counter this rival explanation and the paper was published.
These threats are concerned with biases introduced in assigning participants to treatment conditions in multiple group studies.
This threat involves bias resulting from differential selection when assigning participants to groups. If the groups are different to begin with, then they are most likely going to be different following the application of a treatment to one of the groups, regardless of whether or not the treatment has any effect. For example, suppose in our relaxation for preoperative anxiety study we had assigned more anxious participants to a control condition and less anxious participants to a relaxation treatment condition and used a static group comparison design:
would be likely to find that the relaxation group was less anxious following
treatment than the control group. Obviously, though, this would be because
the relaxation group was less anxious anyway. Clearly, random assignment
to groups avoids this problem because the groups are equated at pretest.
In multiple group studies, selection biases can interact with any of the other threats to validity that we have already met and those still to come. For example, we can have selection X maturation. Here, if there is differential selection, changes in the dependent variable may be due to maturational changes in some groups that are not affecting other groups. For example, if a treatment group comprised younger participants than a control group then any changes in the treatment group might be due to maturation and not the treatment itself. Alternatively we could have selection X history. Here, if there is differential selection, changes in the dependent variable may be due to historical events experienced in one group that are not experienced in the other group. We might also have selection X regression if one group is extreme at pretest (relative to the population mean) whereas other groups are not.
Selection X manipulation
The selection X manipulation interaction is rather different. In this case, the effects of the manipulation only hold for the particular population sampled. It is therefore a threat to external validity. For example, a very large amount of psychological research has used undergraduate students as participants, simply because they are easy to get hold of. However, undergraduates differ in many ways from the population in general. One would hope, for example, that they are somewhat more intelligent and better educated than people in general. We might question, then, whether results obtained from undergraduates would apply to the general population.
Similarly, in the sports sciences a good deal of research uses sports performers who participate at relatively low levels of competition, because it is difficult to get more elite performers to take part in our studies. Suppose we find that some intervention 'works' when tested with lower level performers. We have no way of knowing whether the same intervention would work with elite athletes unless we go on to test it with that population. The problem goes further than that. If we went on to test the intervention with, say, Premier League footballers, how do we know that the same intervention would work equally well with top class sprinters? No design, no matter how strong in internal validity, can answer this question. In fact, this is a general problem with external validity that we must always bear in mind. Logically, we cannot generalise the results of an internally valid study to populations other than the one used to test an intervention. However, we do not normally have the resources to go on testing interventions with every possible population. At the end of the day, we have to make an informed judgment about the extent to which our findings are generalisable to different populations. It then becomes an issue of face validity: does it seem reasonable that the same results would be obtained from a different population?
These are threats concerned with the act of testing participants and with potential effects of the manipulation other than those that are intended.
effects concern the effects of the act of taking the pretest on scores
at posttest. Posttest scores
may be affected by practice gained in taking the pretest, memory, familiarisation
with the setting in which the tests are conducted, and so on. For example,
the best way to improve your IQ scores is to take an IQ test! Typically,
scores improve by around 3-5 points from a first IQ test to a second test.
Of course, this does not mean that taking IQ tests improves your
A pretest can change the way people respond to a posttest in other, more subtle ways. For example, suppose we get some young athletes to complete a questionnaire about attitudes towards drugs in sport. The act of filling in the questionnaire might make them go away and think more deeply about about drugs in sport than they had previously. Then if you give the the same questionnaire again at a later date they may respond differently to how they did on the pretest, regardless of the effects of any intervention designed to change their attitudes. This is also referred to pretest sensitisation: the pretest sensitises participants to the intervention so that they respond to it in ways that they would not respond if they did not have the pretest. The reactive effects of testing are in fact an interaction between the pretest and the intervention. This is a threat to external validity because, if on the basis of our studies we decided that our intervention worked, we would not normally be pretesting individuals when we went out and applied the intervention in real-life settings. Can you think of a design that would avoid this problem?
Reactive arrangements are not to be confused with the reactive effects of testing. These are a more general class of threats that are concerned with a participant's response to the research setting. Essentially the problem boils down to this: research settings are unnatural situations so we might expect participants in research to behave unnaturally. In other words, the research setting might influence participants to respond differently to how they would respond in a real-life setting.
Research is a social activity involving complex interpersonal interactions between the researchers and the participants. Consequently, reactive arrangements are sometimes referred to as social threats to validity. Unlike guinea pigs or laboratory rats, human research participants are thinking, rationalising organisms who will always try to make sense of their situation, interpret what is happening to them and anticipate what is expected of them. This becomes particularly problematic in psycho-social research situations. Social threats to internal validity, then, are concerned with the potential for any change in the dependent variables being due to social factors inherent in the research setting.
As an example, there has been much debate over many years as to whether it is possible to get people to perform acts under hypnosis that they would not normally perform when not hypnotised. One study in the early sixties seemed to support the idea that this was possible. Hypnotised participants were told to pick up and play with poisonous snakes, and they duly did so. Of course, the venom had previously been removed from the snakes, but the participants were not to know this. Orne and Evans (1965) then replicated this study but without the hypnosis and got the same effects! When asked why they had done something so seemingly dangerous, the participants replied that they knew that the experimenters were responsible people who would not really put them in any danger. In similar studies, Orne and his colleagues managed to get participants to commit all sorts of apparently dangerous or even violent acts, such as throwing 'acid' (actually just water) into other peoples' faces. The participants were behaving in the experimental situation in ways in which they would not normally behave and the results from the earlier experiment may not have been due to the manipulation (hypnotism) but to the participants' interpretation of the situation.
It is important to note that reactive arrangements are not just a potential problem in experimental studies. Any research setting is unusual and might prompt uncharacteristic responses from participants. If I stop people in the High Street and ask them to fill in a questionnaire, this is an unusual situation for them and they may act in unusual ways.
Whenever taking part in a study is time-consuming, boring, involves aversive procedures, effort, and so on, one might expect participants to be unmotivated or to lose motivation during the course of the study, which will then influence their responses. Furthermore, procedures that involve threats to the person's ego or the need to disclose personal information may lead them to respond with less than perfect honesty.
A related problem is that of social desirability response bias. This is a measurement issue. People generally tend to try to portray themselves in the best possible light. They might be unwilling to disclose their fears or weaknesses. So they could respond in ways which they think are socially desirable or acceptable. If I were to ask the Research Methods class how many of you regularly pick your noses, I bet that not all the nose-pickers among you would admit to it! As a more serious example, athletes might be unwilling to report feelings of competitive anxiety. When given an anxiety questionnaire, therefore, they might score lower than they really should. Similarly, when asked to self-report physical activity levels, individuals tend to overestimate the amount they do. They do not want to be seen as lazy, perhaps. In randomised multiple group studies with repeated measures, if we assume that tendencies to over- or underestimate are randomly distributed between groups, this is not such a big problem. Although the absolute scores obtained will be inaccurate, we should still be able to detect differential changes across time because it is likely that people in different groups will over- or underestimate to the same degree on different occasions.
The social desirability response bias is a relatively stable personality disposition. Some people have a tendency to respond more in a socially desirable way than others. The likelihood of a questionnaire being influenced by social desirability response bias can be assessed during its development. Scales are available that measure the tendency to respond in socially desirable ways, the most prominent being the Marlowe-Crowne Scale. This asks individuals to respond with true or false answers to a series of statements. For some of the statements it would be highly unlikely or unreasonable to expect anyone to truthfully respond positively or negatively. For example, one statement says 'I never get angry'. As most people do get angry on some occasions, it would not be reasonable to expect someone to respond with 'true' to this statement. If they do so then it is likely that they are responding with a social desirability response bias. We can use the scale in two ways. We can administer it with a new questionnaire and correlate scores from the two measures. A significant correlation would indicate that the new questionnaire is prone to this response bias. Secondly, we could administer the Marlowe-Crowne scale to identify individuals who tend to respond in a socially desirable way and eliminate them from our studies. A number of personality inventories incorporate similar items, known as lie scales, for this purpose.
No design, of itself, can control for motivational and response bias factors. We can only seek to minimise them. A method often used to maintain motivation is to offer a reward for participation, such as money, course credit for student participants, or the chance to enter a prize draw. Can you think of a problem with this?
The instructions given to participants, called the instructional set, can be used to promote truthfulness by reassuring them that their responses will be held in strict confidence and used for no other purpose than for answering the research question. Ideally, questionnaires and other measures should be administered in conditions of anonymity: no names are taken and it is clear to the respondents that the data-collector does not know who is completing which questionnaire booklet, in the same way that we collect your module evaluations at the end of each semester. Instructional sets for questionnaires also often include a statement reassuring respondents that the instrument is not a test and that there are no right or wrong answers. This is designed to prevent people from feeling that they are being put on the spot or that they should respond in a particular way.
If you use instructional sets in your own research, be careful how you word them. I once had a final year project student who told her participants: "This is not a test and there are no right or wrong answers. We simply want to know how you feel about exercise situations. Your responses will be held in the strictest confidence and not divulged to anyone but the researchers. Total animosity is guaranteed" !
Orne (1962) coined the term demand characteristics to describe those aspects of a research setting that may lead participants to anticipate what the study's hypothesis is. He defined demand characteristics as "... the totality of cues which convey an experimental hypothesis to the subject.” When designing a study we need to ensure that we minimise (even if we cannot entirely eliminate) demand characteristics that might lead the participants by the nose to respond in the direction of our hypothesis.
Demand characteristics can be very subtle, though, and easy to overlook. I once conducted a study of the effects of goal-setting training for enhancing exercise adherence. Participants were randomly assigned to a goal-setting training group or an attention-control group that did not receive goal-setting training (again, more on attention-control later). At the end of the study period both groups were given a questionnaire designed to measure whether or not they were setting effective goals. This comprised scales assessing the extent to which they set specific, difficult, measurable, realistic, time-limited goals. The goal-setting training group scored significantly higher on these scales. However, independent quantitative and qualitative data suggested that they were not in fact setting more effective goals. Of course, in the goal-setting training programme I had trained the participants to set specific, difficult, measurable, realistic, time-limited goals. At posttest I then asked them if they set specific, difficult, measurable, realistic, time-limited goals. It is not surprising that they reported doing so because they would have expected that that was what I was looking for! I had to conclude that the apparent better goal-setting performance of the experimental group was very likely due to the demand characteristics of the research situation and not the training.
Another reactive threat concerns the participants' understanding of the purpose of a study. First, they may simply misunderstand the instructions given to them. Obviously, it is important to present instructions clearly and unambiguously. Pilot testing before launching a study can help identify any problems here.
There is a more subtle issue, though, to do with demand characteristics and the different ways in which people might respond to how they interpret the research situation. If participants know what the hypothesis is, they may do one of two things: comply with what they think is expected of them (the so-called 'good' participant) or react with defiance and deliberately seek to sabotage the study (the 'bad' participant). In either case they will be behaving in ways in which they would not normally behave.
In order to prevent this, we typically conceal the purpose of a study from the participants, although for ethical reasons they should always be debriefed afterwards. This does not necessarily solve the problem, though. Because they are thinking organisms who will always seek to make sense of what is happening, human participants will tend to try to guess at what the purpose of the study is, taking into account whatever cues (including demand characteristics) are available in the situation. They might well guess wrong, but in any case they could then react with compliance or defiance to what they perceive the study's purpose to be, again not behaving in ways in which they would normally behave in real-life settings.
Participants assigned to no-treatment control groups can cause problems of their own. First there is always a thorny ethical issue of withholding a treatment which the researchers believe will be of benefit. This is a serious problem in medical and other critical applied areas of research. If we believe that a new treatment will improve life for Parkinson's disease sufferers, for instance, or individuals with severe clinical depression, what right have we to withhold the treatment from some people just because they are participants in our study? The usual answer to this is that ultimately by conducting the study we can be sure that the treatment is in fact effective, or more effective than current treatments, and we can rule out hazardous side-effects. Furthermore, if we find that the treatment works, we can always give it to the control group participants at a later date, although it might be too late for some of them. So in the end the 'greater good' outweighs the withholding of treatment from some individuals.
Assignment to no-treatment control conditions, however, can also pose a threat to validity because of the way in which individuals might respond to knowing that they are not going to get a treatment. Suppose I advertise in the local press for volunteers to take part in a study examining the efficacy of new method for losing weight, using a particular combination of dietary changes and physical activity. Lots of overweight people eagerly sign up for the study. Then I tell half of them that actually they are not going to get the treatment. Instead I want them to carry on with their normal eating and activity patterns. You can imagine that they are likely to be rather disappointed. It has been shown that control group participants can respond in one of two different ways. First, in their disappointment, they might become demotivated and actually eat more and do less activity than normal. This is called resentful demoralisation. If this were to happen than the treatment might appear to be more effective than it actually is. Alternatively, participants might think "Well, I'll show them what I think of this" and rigorously engage in more exercise and a stricter diet than normal, to get their own back, as it were. This is called compensatory rivalry. In this case the treatment might appear less effective than it actually is.
The most commonly used method to avoid these problems is to employ a waiting list control condition. Participants in the no-treatment group are offered the prospect of receiving the treatment at a later date, often being told that the researchers do not have the resources to implement the treatment immediately to all the people who have volunteered for the study. Thus they are put on a waiting list. In this way, at least they know that they will get the treatment and when. This procedure can strengthen the study design because you then have a second wave of intervention and can see if you get any effect observed in the first wave again.
A major threat to internal validity in studies evaluating the efficacy of treatments are non-specific treatment effects. These are aspects of a treatment programme, often unavoidable, that are in addition to the specifics of the treatment itself. For example, my goal-setting training programme involved spending a considerable amount of time with the participants. Any effects of the treatment on their adherence to exercise may have had nothing to do with goal-setting training but due to the time and attention they had received from me. This might have made them feel special, or perhaps obligated to me, and caused them to put more effort into exercising regularly.
These effects are often referred to as Hawthorne Effects from a classic series of studies conducted in the early part of the last century. Researchers were conducting experiments at the Hawthorne works of the Western Electric Company in Chicago to determine optimal working conditions for the workforce with a view to improving productivity. In general it was found that no matter what changes were made in conditions, productivity improved. In one series of studies, lighting was progressively increased and productivity improved. Then the lighting was progressively decreased, to the point where there was virtually no light available, and productivity still improved! Although the interpretation of these findings remains controversial, a widely held conclusion has been that it was the time, attention and encouragement given to the workers by the researchers that led to the strange results.
Hawthorne effects are rather analogous to placebo effects in medical research. As we discussed earlier, in evaluating the efficacy of new drugs or other treatments, some participants are given the drug whilst others receive an inert substance that looks like the drug but has no active ingredient. The aim is to be able to determine whether the drug is effective over and above any placebo effect. The mechanisms by which placebos exert an effect are not well understood, but there is no doubt that they do have an effect. In such studies, double blind designs are used: neither the researchers nor the participants know who is getting the real treatment and who is getting the placebo. Such designs are sometimes used in sport science research, for example when examining the effects of ergogenic aids like creatine phosphate. Hawthorne effects are very similar to placebo effects but are due to aspects of the social interaction between researchers and participants.
Another, related aspect of non-specific treatment effects concerns participants' expectations of benefit. If the treatment appears to be credible and likely to be effective to the participants, in other words if they expect to benefit from it, they may well do so, regardless of the actual efficacy of the intervention. Incidentally, although this may be one explanation for placebo effects, there are studies showing that the placebos can work even when participants do not believe in the efficacy of the treatment.
The problem with psycho-social interventions is that it is usually impossible to conduct double blind procedures. How could I have conducted the goal-setting training study without knowing which participants were receiving the training? Furthermore, if I was comparing the goal-setting training with a no-treatment condition the participants themselves would know whether or not they were receiving a treatment. However, it is possible to attempt to control for non-specific treatment effects by using an attention-control condition. This is analogous to a placebo condition. It involves randomly assigning some participants to an alternative treatment that matches the experimental intervention in terms of the time and attention that the participants receive but without any active ingredient and, if possible, it should appear to the participants to be a credible treatment that will work. When using a credible attention-control condition it is possible to implement a single blind design: although the researchers know which group is receiving the experimental treatment, the participants do not.
In the goal-setting study my attention-control group received what I described to them as 'motivational training'. This involved meeting with the participants for the same amount of time as the goal-setting group, training them to self-monitor their fitness levels and completing decision balance sheets: self-assessments of the pros and cons of exercising. I chose these procedures because at that time the literature suggested that they would not have any enduring effects on exercise adherence. Nevertheless, I gave an explicit rationale as to why these procedures should work so that the participants would perceive them as credible and effective. The problem was that despite what the literature said, this treatment turned out to be almost as effective as goal-setting training in enhancing exercise adherence! This is a general problem with attention-control procedures: it is often very difficult to devise a credible treatment that is truly inert when conducting psych-social interventions.
It is possible to measure expectations of benefit and the extent to which participants perceive an attention-control condition to be as credible a treatment as an experimental condition. Questionnaires are available that can be used to assess participants' expectancies of benefit and beliefs about treatment credibility. You can then determine whether or not treatment and attention-control groups are significantly different with regards to these factors. Ideally, there will be no difference.
Experimenter effects are concerned with biases introduced by the researchers themselves. Researchers are usually very committed to their work. Often their jobs or promotion prospects are dependent on producing good research. They may be driven by the prospect of fame (although not usually fortune, at least not in British academia!). Consequently, they want their experiments to 'work'. This may lead investigators to introduce biases into their studies that will tend to lead to the support of their hypotheses, even if they don't actually fabricate their data. Unfortunately, there are many examples of deliberate bias and data fabrication in the history of science, although the culprits usually get found out eventually.
A more subtle problem, though, is bias inadvertently or subconsciously introduced by the researcher. A researcher might, for example, treat experimental group participants more kindly or give them more encouragement or attention than control group participants. This is known as the Rosenthal Effect after Robert Rosenthal who studied this phenomenon in great depth. In one study, Rosenthal had psychology undergraduates take part in a lab practical in which they had to time mice running through a maze. The students did not know it but they were actually experimental participants. Half of them were told that their mice were 'maze bright': a special strain that had been bred to be good at running mazes. The other half were told that their mice were 'maze dull': a strain that was bad at maze running. In fact the mice were all of the same strain but it was found that the maze bright rodents ran the maze in significantly faster times than the maze dull ones. Subsequent observation showed that students who were given so-called bright mice treated their animals much more carefully and gently than those who had dull ones. The dull mice were therefore more scared than the bright ones and consequently took longer to run the maze.
Rosenthal and Jacobsen (1968) conducted a similar experiment in a human context in a famous study that came to be known as Pygmalion in the Classroom. To cut a long and complex story short, schoolteachers were told that tests showed that some of their pupils were 'late bloomers' who were about to show a dramatic improvement in their academic learning. In fact, these 'special' pupils were randomly selected and no different to their classmates. At the end of the term, all the students were tested, and it was found that the late bloomers not only performed better according to their teachers' evaluations, but they also scored significantly higher on standardized IQ tests. It seems that the expectations of the teachers had translated into how they treated their pupils throughout the year, leading to greater improvements in the so-called late bloomers than their unlabelled counterparts. You can read more about this landmark study here.
So it is quite easy for researchers to unwittingly bias the results of an experiment. Imagine conducting a study to assess the effects of caffeine or some other ergogenic substance on treadmill running performance. It would be very easy to encourage the experimental group to run faster or longer than the control group, without even knowing you were doing it. It is for this reason that double blind procedures are used whenever possible. However, as we have seen, in many circumstances it is not possible to control for experimenter effects in this way. Like expectancy effects though, experimenter effects can be measured to ensure that different groups are being treated in the same way by the experimenters. You can give all groups a questionnaire tapping the extent to which participants perceive that the experimenters encouraged them, developed rapport or gave them special attention. Hopefully, there would be no significant difference between the groups in scores on the questionnaire, indicating that the experimenters had treated all sets of participants equally.
We have now examined in some detail the major threats to internal and external validity and how different types of design can or cannot control for them. For the purposes of the end-of-semester exam, you need to be thoroughly familiar with these issues. The following points are particularly important to note.
The randomised pretest-posttest control group design
This design controls for threats to internal validity provided that:
Randomisation works (groups are equated at pre-test)
All potential independent variables including reactive arrangements and non-specific treatment effects etc. are held constant
There is no differential mortality between groups.
Threats relating to the passage of time are controlled because they should manifest themselves equally in each group:
Threats relating to selection are controlled because participants are randomly assigned to groups so they should be equal with respect to any potential independent variables.
The testing threat is controlled because it should also manifest itself equally in both groups.
Because there is a pretest, the reactive effects of testing (external validity) are not controlled, so we do not know if we would get the same effects for a treatment when participants are not pretested.
The posttest only control group design
design also controls for threats to internal validity provided that:
works (groups are equated at pretest)
There is no differential mortality between groups.
Threats relating to the passage of time are controlled because they should manifest themselves equally in each group:
Threats relating to selection are controlled because participants are randomly assigned to groups so they should be equal with respect to any potential independent variables.
The testing threat is controlled because there is no pretest.
Changes related to time (maturation, history etc.) are not measured because there is no pretest.
We cannot be sure that the groups are equated at pretest.
However, this design also controls for reactive effects of testing because there is no pretest.
Otherwise, we cannot control for threats to external validity. In an internally valid study we can only demonstrate that the effects of a treatment hold under the specific conditions of the study. We can only say for certain that the effects of a treatment hold for pre-tested participants from the particular population sampled at this point in time in this particular place … etc.
Now do the Threats to Validity MCQ self-test in Blackboard.