Interpreting research evidence

Rowena Bermingham

Using research evidence to form opinions or scrutinise decisions requires knowledge about how to interpret it. For example, an individual reading a piece of research might wonder to themselves whether they can trust the research results and whether they can apply the findings to a different context. There are lots of different considerations that can be taken into account when looking at research evidence. However, this last section will focus on the concepts of some measures of quality in research (such as validity, reliability, generalisability and applicability). The way quality is assessed in quantitative and qualitative research differs. Therefore, this section will explain the concepts applied to these two types of research separately.

Quality in quantitative research

When assessing the quality of quantitative research, the most commonly considered concept is the validity of the research. Validity refers to how sure one can be that the conclusions of a study are accurate and how much these conclusions can be used in contexts outside of the study. In quantitative research, there are two key types of validity: internal validity and external validity. Internal validity is the extent to which the design and execution of the design are free from bias. External validity is the extent to which the conclusions of a study can be generalised to a wider population.

Internal validity

Participant selection

When recruiting or selecting participants for a study, there are risks to the internal validity because of a number of potential biases. Selection bias is the threat to internal validity caused by participants not being chosen in a way that reflects the particular population being studied. Other threats to validity come from self-selection bias, participation bias and attrition bias. Self-selection bias occurs when studies solicit participation from people (such as by asking people to volunteer to take part in a study or fill out a questionnaire). Those who choose to take part are likely to differ fundamentally from those who do not choose to take part and they may not be representative of the wider population. Indeed, they may over-represent people with strong opinions, interests or motivations, resulting in the study’s results being biased [189]. Participation bias is a related issue where participants who have been invited to be part of a study choose not to be involved. Participants who choose not to be involved are likely to share common characteristics and missing them from a sample of the population could mean a study’s results are biased. During the course of a study, a common occurrence is for participants to drop out of the research. This loss of participants is called attrition.

Attrition bias is the influence that these participants leaving might have on the results of the study. For example, if more participants are lost from the control group than the intervention group this could influence the results because those leaving the study may share certain characteristics (such as health level, sex or age group). At the most extreme, in a medical study it may be that one group may have large attrition rates if there are lots of people in one group with worsening health or who die during the study. Therefore, when the groups are compared at the end of the study, there may appear to be a difference due to the intervention which is actually caused by attrition [190].

Measurement use

There are many different data collection methods used to measure variables and factors in research studies. These include clinical measurements (such as blood pressure), interviews, questionnaires and observing participant behaviour. Different methods are used to answer different questions. A key concern is that the measurements used are reliable and valid for the question being asked. Reliability refers to how consistent a measure is. For example, if the same object is placed on weighing scales, the scales would be considered reliable if they showed the same weight each time (regardless of whether this weight is completely accurate). Reliability is important for internal validity because if measures are not reliable, differences found when comparing different groups may be attributable to an unreliable measurement, not true differences. Reliability issues become more of a risk when aspects of the measurement are not kept consistent. For example, interviews being carried out by different individuals may not be conducted in a reliably consistent way [191]. Measurement reliability differs from measurement validity. Measures that are valid provide an accurate reflection of the factor a researcher is interested in. For example, if reliability is a weighing scale showing the same weight for an object each time it is measured, validity would be that weight actually being accurate. In other words:

If you weighed a 1kg bag of flour three times on unreliable scales, they might state the weight as 2kg, 1.5kg and 1kg
If you weighed the 1kg bag of flour three times on reliable but invalid scales, they might state the weight as 2kg each time
However, if you weighed a 1kg bag of flour three times on reliable and valid scales, they would state the weight as 1kg each time.

Measurement validity also encompasses choosing the right measures for the factor being investigated. If a researcher was examining the effect of a new medication on cholesterol, a blood test to measure cholesterol would be a valid measure. Another indicator, such as blood pressure, might be related but would not be the most valid way of measuring blood cholesterol. Other potential measures would be completely invalid, such as lung capacity or maths ability.

Extraneous variables

The way in which interventions and measurements are carried out may also threaten the internal validity of a study. The observer-expectancy effect occurs when an experimenter carrying out an intervention influences the participants to produce the results they expect (intentionally or unintentionally). For example, if a researcher is expecting a new treatment to reduce blood pressure, they may take more care to relax participants in the intervention group before taking blood pressure measurements than those in the control group. This would mean that differences between the groups might be caused by the researcher influencing the participants as opposed to the intervention medication. Another way in which researchers may influence results is through the generation of demand characteristics. Demand characteristics are the indications (either explicit or implied) for participants to behave in a certain way. Studies that measure behaviour are particularly prone to demand characteristics. Where participants are being asked directly about their opinions or behaviour, they may exhibit a response bias where they answer questions in the way they believe a researcher wants them to. They may also exhibit social desirability bias, where they underreport behaviours or opinions that they believe are socially unacceptable or undesirable. The social desirability bias can be lessened when responses are anonymised and participants are assured that they will not be identifiable from their responses. Demand characteristics can also cause a placebo effect, where the expectation of showing an improvement in a clinical outcome (such as blood pressure or symptoms of depression) results in participants improving in these measures regardless of the intervention. For this reason, in clinical trials, placebos (a substance or treatment that should have no effect) may be given to control groups so that the intervention can be compared to the improvements that occur just from the placebo effect. However, more often the intervention is compared to the standard treatment rather than a true placebo.

Another way of measuring whether effects are caused by the research situation as opposed to the intervention is to conduct follow-ups. These are measures taken at various time periods following the end of a study to see whether effects persist. A persistent effect indicates that the intervention has long-term effects beyond the situation of the study. Follow-ups also indicate whether effects grow or diminish over time and allow researchers to predict how long the effect of an intervention might last.

One research practice to reduce the effects of observer-expectancy effects, response bias and demand characteristics is blinding. A study can be non-blind, single-blind, double-blind or triple-blind. In a non-blind study, the participants know which intervention they are receiving, as do the researchers carrying out the intervention and analysing the data. Non-blind studies are at the highest risk of influence from observer-expectancy effects, response bias and demand characteristics. In a single-blind study, participants do not know whether they are in an intervention or control group and/or are not aware of exactly what the study is investigating. Single-blinding reduces the effects of demand characteristics and response bias. However, even if participants are not told what is being investigated, they may still have an idea (for example, if a study asks obese individuals to eat a certain diet for six months). In addition, as the researchers conducting the study are aware of whether participants are in an intervention or control group, there is still potential bias from the observer-expectancy effect. Double-blind studies eliminate the observer-expectancy effect by ensuring the researchers carrying out the study do not know which participants are in intervention or control groups as well as the participants not being aware which group they are in. In this way, potential bias from observer-expectancy effects, response bias and demand characteristics is reduced. Triple-blind studies are similar to double-blind studies in that neither participants nor the researchers carrying out the intervention know which groups the participants are in. However, in addition, the researchers who carry out the final analysis also do not know which groups received an intervention. This eliminates bias from observer-expectancy effects entirely as it removes the opportunity for a researcher to deliberately or inadvertently analyse the data in a way that favours an expected result. Another way to reduce the risk of bias from researchers is for data to be collected and analysed by an independent evaluator as they will not have expectations of the results.

Further threats to internal validity come from certain research designs. For example, recall bias is caused by an individual’s inability to remember accurately events that happened in the past; the distortion of their memory can affect study outcomes. Recall bias is a particular threat to the internal validity of case-control studies and some retrospective longitudinal studies, whose design is based around accurately recalling previous life events. Recall bias is also an issue for any other study design where remembering past events is important.

Carryover effects are a threat to the internal validity of within-subject designs because participants are exposed to all controls and interventions. For example, a study might compare whether mathematical problem-solving is improved by exercising or meditating beforehand. In a within-subject design, participants would take three mathematical problem-solving tests with three different situations beforehand (doing nothing, exercising or meditating). There is a risk of bias from carryover effects because the researcher cannot be sure that one intervention is influencing the outcomes of another intervention. For example, the study might have the control maths test at 11:00, the meditation followed by maths test at 13:00 and the exercise followed by maths test at 15:00. The researcher cannot be sure that the 13:00 meditation did not influence mathematical problem-solving for the rest of the day, including for the 15:00 maths test.

Other threats to validity in the study described above come from order effects and sequence effects. These effects are the influences of what number in an order an intervention/control is on the dependent variable (such as maths test scores). There may be an advantage, for example, to being the first in the day. Order effects can include practice effects, where performance improves with each subsequent intervention. For example, the scores for the final maths test of the day might be higher because participants have been practising and improving, not because of the intervention. Order effects can also include fatigue effects, where performance declines with each subsequent intervention. For example, the scores for the final maths test of the day might be lower purely because participants are tired from multiple testing, not because of the intervention. Sequence effects are the influences that come from the exact sequence in which interventions are carried out. For example, there may be some interaction in the study described above where meditating then exercising in the same day improves maths problem-solving. Similarly, sequence effects might be seen in a study taste-testing six different ice cream flavours. The exact sequence they are tasted in might mean that participants rate them by comparing them only to the preceding flavour. Order effects and sequence effects are not just a threat to the internal validity of within-subject designs but also to any study design where participants have a dependent variable measured multiple times (as found in many one-group designs). They are also a threat to the internal validity of certain methodologies, such as questionnaires and interviews, because the exact order in which questions are asked may lead to bias in the responses.

One way of ensuring that order effects and sequence effects do not skew results is counterbalancing. Counterbalancing ensures that all possible orders of presenting interventions and controls (or test items, such as questions in a questionnaire) are used. An alternative to counterbalancing that also reduces order effects or sequence effects is randomisation, where all participants receive intervention/item in a different random order.

External Validity

Whereas internal validity is the extent to which a single study’s conclusions can be considered unbiased, external validity is the extent to which those conclusions can be applied to different circumstances (such as other populations, locations or time periods). There are several factors that contribute to external validity. Three key contributing factors (generalisability, replicability and applicability) will be discussed below.

Generalisability

Generalisability is the extent to which the findings of a study can be applied to other situations. Generalisability can be divided into population generalisability, environmental generalisability and temporal generalisability.

Population generalisability

Population generalisability is the extent to which the findings of a study could be applied to a wider population than just those individuals who took part in the research. When a study is undertaken, researchers aim for the findings to be valid for a particular population (such as individuals with heart disease, university students studying maths or people who have experienced domestic violence). To be considered generalisable to the wider population, the participants involved in the study should be a representative sample. A representative sample reflects the variety in characteristics of the population drawn from. For example, if a study was researching maternity services for first-time mothers in Huddersfield, the sample should represent the range of ages, economic groups, religions, cultures, races and other factors of all first-time mothers in Huddersfield. Another consideration for population generalisability is sample size (the number of participants involved in the study); a study is likely to be more externally valid the higher the proportion of a population that is sampled.

One issue for population generalisability is specific to studies that use animal models. These studies use research animals to provide potential insights into human physiology or behaviour. For example, animal models may be used to investigate the role of genetics in obesity or the effect of a new medication on memory. However, because of genetic and physiological differences between humans and various animals, findings for an animal model may not be generalisable to humans [192]. The findings from this sort of study cannot be immediately generalised to humans without further research and testing. However, many media stories report ‘ground-breaking findings’ or ‘miracle cures’ for various conditions without noting that the evidence has come entirely from animal studies [193].

Environmental generalisability

Environmental generalisability is the extent to which the findings of a study could be applied to a different area. Researchers, and any individual looking at research results, have to consider how generalisable the findings of an individual study are to different local areas, different regions or different countries. For example, the results of a study researching maternity services for first-time mothers in Huddersfield may not be generalisable to another location. There may be unique factors at play in one local area that may not be present in another. For example, hospitals in large towns (like Huddersfield) are likely to be closer together and have better transport links between them than hospitals in rural area, and these factors could have a role in maternity services. Similarly, studies involving factors such as the ecology or geology of a region may not be generalisable to other locations where these factors differ.

Temporal generalisability

Temporal generalisability is the extent to which findings of a study could be applied to a different time period. For example, if the study of maternity services for first-time mothers in Huddersfield took place in 2019, its findings would likely still be generalisable to the situation currently, unless any major changes had taken place since the end of the study. However, if the study took place in the 1950s it would be likely to be less generalisable to today. If it took place in the 1850s, its findings would almost certainly not be generalisable to today.

Replicability

One way that researchers can be more confident of the external validity of their work is through replication of their results. Quality research is reproducible, meaning that its data could be analysed again or its methodology could be used to rerun the study with a different sample [194]. In order for research to be reproducible, researchers must either share the data collected from the study or share the details of how they ran their study, including how the sample was selected, how the methodology was carried out, how measurements were taken, and how results were analysed [195]. This allows other researchers to repeat the study and see if the results found in the original study hold true in other contexts (meaning the results have external validity) or if the results are unique to the context of the original study (meaning the results lack external validity) [196]. Replicability refers to the extent to which the results and conclusions of a study are corroborated by that study being run again. The more similar the results of a replication study are to the original study, the more likely the results are to be externally valid and, therefore, generalisable to a wider context. The more replications of a study there are with similar results, the greater the likely external validity. However, if a replication study finds results that are different to the original study, it may mean that extraneous variables have not been controlled for in the study design and, therefore, the research is not externally valid and generalisable.

Applicability

The extent to which a study has external validity affects the applicability of its findings. Applicability is how relevant the research findings are to situations in the real world. For example, a study that tests a new medication on individuals with heart disease might have high external validity, because the participants are going about their life and taking a new medication instead of the standard medication. The findings of this study are highly applicable, because if the new medication is found to be more effective than the standard medication, the relevance is clear; to improve outcomes for individuals with heart disease, change from the standard medication to the new medication. However, other studies may be less obviously applicable. For example, a study might find that participants who ran on a treadmill for 30 minutes before completing a general knowledge test scored far higher than those who lay down on a sofa for 30 minutes before completing the same test. An individual reading this study might question its validity and applicability. It would be difficult to find a scenario in the real world where this insight could be applied directly. Instead, it might inform some wider discussions around whether exercising before completing mentally challenging tasks is more beneficial than resting.

Studies in highly controlled, artificial environments (such as in laboratories) may change the behaviour of participants and, therefore, the behaviour that is reported in the study may not be the behaviour that is exhibited in real life scenarios. For example, a student may show improvements on a maths test taken after an intervention in the laboratory, but the study would only have external validity if these improvements were found when that intervention was used to improve students’ scores in a real maths test (such as a GCSE maths exam) [197].The measurements used for studies may also affect its external validity. For example, if a study used a questionnaire to measure participants’ anxiety levels around spiders before and after different interventions, participants might claim that they would be comfortable having a spider placed on their hand after receiving an intervention. However, this hypothetical situation presented in the questionnaire is very different to the ‘real life’ scenario of actually having a spider placed in their hand. If the study results were applicable in the real world, then any participant who said on the questionnaire that they would be comfortable having a spider placed in their hand would be able to hold a spider in real life.

Quality in qualitative research

As discussed previously, quantitative and qualitative research have different aims. Usually, quantitative research is more concerned with objectivity and the replicability of research findings, while qualitative research is more concerned with the data being an authentic and trustworthy reflection of the circumstances in which it was collected [198, 199]. These differences mean that some qualitative researchers reject quality measures used by quantitative researchers (such as validity) and adopt other measures instead [200, 201]. However, others advocate using the same terms but redefining them in the context of qualitative research [202-204]. The sections below consider how the terms reliability and validity are use in qualitative research alongside some concepts specific to assessing quality in qualitative research.

Reliability

When considering reliability in quantitative research, the focus is on how accurately a concept is being measured and explained. However, qualitative research tries to generate understanding in an area. Therefore, reliability in qualitative research refers to how dependable the research method and the data generated were [205-206]. For example, qualitative studies should be clear which methods have been used and what decisions have been made by the researcher during the research process. If this is clear and trustworthy, then it would be likely that a second researcher carrying out that study would produce similar findings [207].

Validity

Some qualitative researchers reject the concept of validity being applied to qualitative research, preferring other terms (such as adequacy, trustworthiness, accuracy and credibility) or other measures instead [208-210]. However, many researchers emphasise the importance of assessing validity in qualitative research [211, 212]. When looking at internal validity in qualitative research, the focus is similar to that for quantitative research, where readers of a study will consider how appropriate the research process was. This can include whether the choice of methodology was the right choice for answering the research question and whether the sampling and data analysis were carried out appropriately [213]. Some quality assessments list multiple types of validity specific to qualitative or mixed methods research. For example, empathic validity is the extent to which the study increased empathy among participants and ethical validity is the extent to which the research outcomes and resulting changes are appropriate and fair [214]. Other qualitative researchers favour the use of credibility as a quality measure, instead of internal validity. This involves establishing that those who participated in the research find the results believable [215].

Although quantitative researchers tend to place a greater emphasis on the importance of external validity (and the related concepts of replicability and generalisability) some qualitative researchers also consider this an important measure of quality in their work [216]. Most qualitative research looks at a specific issue or phenomenon in a specific context and, therefore, widespread generalisability is not expected [217]. However, external validity can be demonstrated in qualitative research through a variety of techniques, such as triangulation (comparing two separate sources of information to find where findings align) [218, 219].

Also in this series

References

Rovai, A. et al (2014). Social Science Research Design and Statistics: A Practitioner’s Guide to Research Methods and SPSS Analysis. Watertree Press.
Dumville, J. et al (2006). Reporting attrition in randomised controlled trials. British Medical Journal.
Department for International Development (2014). How to note: Assessing the strength of evidence. UK Government.
Mestas, J. & Hughes, C. (2004). Of mice and not men: Differences between mouse and human immunology. The Journal of Immunology.
Chakradhar, S. (2019). It’s just in mice! This scientist is calling out hype in science reporting. Stat.
McNutt, M. (2014). Reproducibility. Science.
McNutt, M. (2014). Reproducibility. Science.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science.
Wegener, D. & Blankenship, K. (2007). Ecological Validity. Encyclopaedia of Social Psychology.
Bryman, A. (2012). Social research methods. Oxford University Press.
Bryman, A. (2012). Social research methods. Oxford University Press.
Noble, H. & Smith, J. (2015). Issues of validity and reliability in qualitative research. Evidence-Based Nursing.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Golafshani, N. (2003). Understanding reliability and validity in qualitative research. The Qualitative Report.
Mays, N. & Pope, C. (2000). Assessing quality in qualitative research. BMJ.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Golafshani, N. (2003). Understanding reliability and validity in qualitative research. The Qualitative Report.
Leung, L. (2015). Validity, reliability, and generalizability in qualitative research. Journal of Family Medicine and Primary Care.
Noble, H. & Smith, J. (2015). Issues of validity and reliability in qualitative research. Evidence-Based Nursing.
Cohen, D. & Crabtree, B. (2008). Evaluative criteria for qualitative research in health care: Controversies and recommendations. Annals of Family Medicine.
Noble, H. & Smith, J. (2015). Issues of validity and reliability in qualitative research. Evidence-Based Nursing.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Mays, N. & Pope, C. (2000). Assessing quality in qualitative research. BMJ.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Leung, L. (2015). Validity, reliability, and generalizability in qualitative research. Journal of Family Medicine and Primary Care.
International Collaboration for Participatory Health Research (ICPHR) (2013). Position Paper 1: What is Participatory Health Research?
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.
Cohen, D. & Crabtree, B. (2008). Evaluative criteria for qualitative research in health care: Controversies and recommendations. Annals of Family Medicine.
Mays, N. & Pope, C. (2000). Assessing quality in qualitative research. BMJ.
Bryman, A. et al (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology.