Statistical analysis attempts to provide a quantitative overview of the data and can be used to assess how much the data matches educated guesses made by the researcher at the start of the study (hypotheses). Modelling uses the data collected to predict the way a system (such as an ecosystem or an economy) works or will work in the future. Qualitative analysis attempts to aggregate and explain the data collected and may also develop or test theories from the data. The sections below provide a very brief overview of these different forms of data analysis but do not cover the breadth and complexity of all different forms of analysis.

Statistical analysis

Standard statistical analysis requires a quantitative (numerical) dataset. Therefore, statistical analysis can only be carried out by studies that either used quantitative data collection or used qualitative data collection and later quantified the data. There are two main types of statistics: descriptive statistics and inferential statistics [146]. There are other types of statistics that will not be covered in this briefing, such as graphical statistics (ways of visualising data) and Bayesian statistics (an approach that adds and analyses new data over time to explore the probability of conclusions).

Descriptive statistics are numerical summaries of a dataset. They can include measures of central tendency (the average of the dataset, such as the mean, median or mode) and measures of variability (such as by how much most of the data differs from the average of the dataset, or what the range of values is in a dataset [147]). For example, a researcher might use descriptive statistics in a study looking at the effect of a new type of teaching on maths test scores. The researcher could randomly assign participants into two groups (control and intervention), give them a series of lessons (either control teaching or intervention teaching) and then get them to complete a maths test. Descriptive statistics can describe the dataset for the researcher. The researcher might find that students who received control teaching got 60/100 on average on the test, but that the variation was wide, with most students scoring between 45 and 75 (summarised as 60±15). However, the students who received intervention teaching got 70/100 on average on the test and the variation was quite narrow, with most students scoring between 65 and 75 (summarised as 70±5). These are useful summaries of the data. However, descriptive statistics cannot be used to make conclusions about the dataset. From these statistics alone, the researcher cannot be sure whether the intervention is a more effective way of improving maths test scores. For making these conclusions, researchers use inferential statistics.

Inferential statistics are mathematical ways of forming conclusions from a dataset. In research studies, they are used for three main purposes:

  • Seeing which educated guesses (hypotheses) made by a researcher at the start of the study were accurate (such as seeing if an intervention teaching method produces higher maths test scores than a control teaching method).
  • Estimating from a dataset how much a certain phenomenon is likely to be present in a larger population (such as using opinion-poll data from a small group to predict the voting habits of the country’s population in a general election) [148].
  • Exploring relationships between factors (such as examining the correlation between household income and spending habits).

Inferential statistics used for the first purpose (hypothesis testing) usually require the researcher to have made some hypotheses (educated guesses) at the start of their study. There are two types of hypothesis. The null hypothesis is usually the assumption that there is no relationship between the variables or factors being studied (or if comparing a new intervention to a current intervention, that the new intervention is the same or worse than the current one) [149]. For example, in the maths test study, the null hypothesis would be that there is no relationship between the teaching received and the test scores achieved by the students. The alternative hypothesis is the educated guess that there is a relationship between the independent variable and the dependent variable (or in the case of an observational study, that there is a relationship between the factors being studied) [150]. For example, in the maths test study, an alternative hypothesis would be that the intervention teaching will have an effect on test scores compared to the control. Statistical tests can be then used to see if the data indicate that the null hypothesis should be rejected in favour of the alternative. There are many types of statistical test used for hypothesis testing and many considerations for selecting which test is appropriate. The types of statistical test and considerations will not be discussed in this briefing because there are many other introductions to statistics that cover these topics [151]. However, there are some considerations for all inferential statistics, discussed below.

Hypothesis testing results in a numerical output that tells the researcher if their results are statistically significant [152]. Results are statistically significant when they are unlikely to have been caused by chance. Researchers consider a probability value (p-value). This is the probability that the results found during the study could have occurred if the null hypothesis were true. For example, a p-value of 0.05 (or 5%) means there is a 5% probability of obtaining those results (or more extreme results) if the null hypothesis were actually true. A small p-value casts doubt on the null hypothesis being true and so researchers would reject the null hypothesis in favour of the alternative hypothesis. Researchers set a p-value at which they consider a result statistically significant. For many areas of research, results are considered significant if the probability that they (or more extreme results) occurred by chance is less than 5% (written as p<0.05). Some research (such as clinical research) may set more stringent levels, with results only considered significant if the probability is less than 1% (written as p<0.01).

Although statistical significance gives a strong indication of a relationship between the variables/factors being studied, it is only reliable if the sample for the study was selected appropriately and the research process was carefully controlled. There are also additional considerations when looking at the output of hypothesis testing. A major consideration for results that are statistically significant is that even at p<0.05, there is still up to 5% chance that the results were due to chance and that the null hypothesis is valid. Therefore, a study may reject a null hypothesis and accept an alternative hypothesis when the results are down to chance. This is known as a type I error [153]. If results are found not to be statistically significant (and the null hypothesis is therefore not rejected), the opposite error can occur. Sometimes a study may fail to reject a null hypothesis when results were not down to chance because the results did not cross the p<0.05 threshold. This is known as a type II error [154]. One way that studies may fail to reject a null hypothesis is through being underpowered, often though the number of participants/measurements in the study (sample size) being too few. Statistical power is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true. It can be calculated before the start of a study based on the level of significance, the potential effect size and the sample size [155]. An underpowered study is one where the combination of these three factors means it is likely that the null hypothesis will fail to be rejected even if an alternative hypothesis is true. The most common way to increase the power of a study (and reduce the likelihood of a type II error) is to increase the sample size.

In recent years there has been a rise in criticism of statistical significance. Researchers have noted that choosing a cut-off point for results to be deemed significant creates an arbitrary distinction [156]. For example, one study might result in a p-value of 0.049 and be considered statistically significant while another might have a p-value of 0.051 (a value that is only a tiny bit higher) and be considered not significant. This creates the situation where there may be multiple studies investigating the same thing that present a mixture of significant and non-significant results. This may be one of the reasons for the replicability crisis in disciplines, where studies with significant findings are repeated and the do not find significant results a second time [157]. A focus on a set p-value for significance may also increase the risk of p-hacking, where researchers collect, select or analyse data in a way that favours a ‘statistically significant’ result (either deliberately or accidentally) [158].

Key concept 7: Why should we care about the sample?

If a researcher wants to be able to apply their findings to people/situations outside of their study, then they must consider the sample of individuals within their study. There are two key things to consider: the number of people in the study (the sample size) and how well they reflect the population the sample has been drawn from (the representativeness of the sample). Researchers decide the sample size and sampling method based on what is practical/possible and the population that they wish to generalise their findings to. Sample size is the number of participants involved in the study. Sample sizes matter to researchers because, generally, the larger the sample drawn from the population, the more generalisable the findings of a study will be to that population. There are standard statistical calculations that say how accurate a result is from a sample of a particular size. However, it is not necessarily the case that a larger sample size is always better. An important consideration is whether there is a representative sample. Even if a sample is large, if it unrepresentative than it may not give accurate results. If the sample is already large and representative then recruiting more people to join the study increases cost, use of resources and time without necessarily making the results more accurate. A representative sample reflects the variety in characteristics of the population is drawn from. As discussed earlier, randomisation allows researchers to assume that randomly assigned groups reflect the variety of characteristics within a certain population (such as age, height or sex). Similarly, if a random sample of individuals is chosen from the population to be involved in a study then a researcher can assume that it is likely to be a representative sample. This sampling method is called random sampling. However, it is not always possible to randomly select participants from a population. This could be because the study design makes it impossible or the situation makes it impractical/unethical. Other means of drawing a sample can introduce bias. Selection bias is caused by participants not being chosen in a way that reflects the particular population being studied, meaning that the results of the study could be affected by extraneous variables. Different ways of sampling participants from a population can introduce different levels of bias. Stratified sampling divides a population of interest into different sub-populations based on certain characteristics (such as age group, sex or income level). Participants in each of the sub-populations are then randomly chosen and invited to be part of the study until there are enough people recruited from each of the sub-populations. Because there is randomisation in this sampling method, it is close to random sampling in limiting bias as it increases the likelihood that various characteristics in the population are reflected in the overall sample. Cluster sampling selects participants for a study based on their geographical location. For example, a cluster sample might involve an entire local school or a whole local hospital in the study. Cluster sampling is a simple and cost-effective way to recruit lots of participants. However, because the sample is only drawn from one location, it may not be representative of the population as a whole. For example, a local school may have a particularly high number of children from poor backgrounds or a local hospital may have patients with a high average age. Similar problems are presented by opportunity sampling (where a researcher recruits participants from at a particular location and time, such as conducting a survey with shoppers at a supermarket on a Saturday morning) and snowball sampling (where participants recruit people that they know to take part in the study who then recruit/nominate other people that they know). These sampling methods may result in a sample that is not representative of the population they are trying to research. For example, the type of people doing their shopping on a Saturday morning may be different from those doing their shopping the rest of the week. Equally, the characteristics of a group who all know each other may differ from the wider population. Another way people can join a study is voluntary sampling. Voluntary sampling recruits participants through advertising the study and the sample is then made up of people who volunteer to take part. Voluntary sampling is vulnerable to self-selection bias and participation bias where those who do (and do not) put themselves forward for a study may differ from the overall population, meaning that the sample is not representative of the wider population.

Even when a result is considered significant, the actual difference that an intervention makes compared to a control (the effect size) may be very small [159]. For example, a study looking at whether cardiovascular exercise once a week results in greater weight loss over a year than not exercising at all might find that participants who exercised did lose ‘statistically significantly more weight’. However, the actual amount of weight that they tended to lose compared to the control group might be 0.1kg. This would indicate that although the results were statistically significant (p<0.05), the effect size was very small. Researchers need to think about not only the statistical significance of their results, but also the relevance to the real world, which is based on the effect size.

As well as hypothesis testing, inferential statistics can be used to predict from a sample how widespread a certain characteristic or phenomenon would be in a particular population. For example, inferential statistics allow a researcher to predict the voting intentions for a national population from a survey of 10,000 people. With this type of inference, having a sample that is representative of the population as a whole is key. When using data from samples to make inferences on a population level, it is important to consider the level of uncertainty. Research is rarely 100% certain. Every time a researcher runs a study on a sample, there will be slightly different results, creating a level of uncertainty. Results are usually reported with a 95% confidence interval. For example, in a survey looking at voting intention, this confidence interval would mean that the researcher expects that if they ran the survey 100 times, the range that they predicted would include the true voting intention at least 95 out of 100 times. Statistics can also be used to generalise from a sample to a whole population when predicting risk.

Key concept 8: What exactly is the risk?

Medical research studies often try to estimate the risk of conditions in a whole population using statistics, and may also research factors that increase this risk. These studies can produce two types of statistic: absolute risk and relative risk [160].Absolute risk predicts the probability of an individual experiencing a particular event (such as developing a certain medical condition) in their lifetime. For example, the absolute risk of a woman developing breast cancer is 12.5% [161]. This means that out of every 1000 women, 125 will develop breast cancer in their lives (or, put another way, 1 in every 8 women). Relative risk is a statistic based on absolute risk that provides an indication of how much something raises the risk of experiencing a particular event. For example, a study could compare the absolute risk of developing breast cancer for women who do not drink alcohol and women who have one alcoholic drink a day. The absolute risk for teetotal women is 11.1% and the absolute risk for women who have one alcoholic drink a day is 11.7% [162]. That is to say that 111 out of 1000 teetotal women will develop breast cancer and 117 out of 1000 women who have one alcoholic drink a day will develop breast cancer. Relative risk compares those two absolute risks to give an indication of how much a factor increases/decreases risk. As 117 is 5% higher than 111, this indicates that having one drink a day increases the risk of developing breast cancer by 5% [163]. This 5% is the relative risk of having one drink a day. Knowing both the absolute risk is and the relative risk is key for interpreting these statistics, as relative risk alone can be misleading. For example, a newspaper might publish a finding that wearing a red hat every day increases the risk of being attacked by a bird of prey by 400%. This may sound alarming, unless you knew that the absolute risk of ever being attacked by a bird of prey was 0.0001% (one in a million). That would mean that your absolute risk of being attacked by a bird of prey if you wore a red hat every day would increase to 0.0004%. This would mean that of every million people who wore a red hat every day, only four of them would ever be attacked by a bird of prey. A huge increase in a risk that is very small still results in a small overall risk. Alternatively, if the underlying risk is very big, even just a small relative increase can result in a large increase overall. Understanding the difference between absolute risk and relative risk and knowing how they interrelate is often essential when scrutinising claims made in academia and the media. For example, the NHS ‘Behind the Headlines’ service often reviews health stories reported by the media to explain risk in relatable ways [164].

Modelling

Modelling can be carried out to test hypotheses, often alongside other forms of analysis. It can also be used to develop hypotheses. Sometimes in research it is not possible (or highly impractical) to measure something directly. In these instances, models may be used to predict the effect of a particular factor on a system [165, 166}. Models can be used in many different research areas, including ecology, engineering, astrophysics and economics [167, 168]. They can use quantitative or qualitative data. Some instances when models can be used include:

  • If a researcher is interested in the effect of a small factor on a very large system. For example, a researcher might want to know how a sea-level rise of 3 millimetres affects a coastal ecosystem.
  • If a researcher is interested in the effect of an event but creating a situation where the event occurs would be unethical or impossible. For example, a researcher might want to know how the extinction of a species could affect the ecosystem it lives in, but it would be unethical to cause an extinction to occur.
  • If a researcher is interested in the way that a system will develop in the future, given changes in one factor or developments in other interdependent systems (particularly when those systems are complex). For example, researchers make predictions on future levels of global warming for a given amount of future greenhouse gas emissions.

Research questions such as those above can be answered with the help of modelling. Modelling uses data that have been collected by researchers to create a simplified computer simulation of the system being investigated [199-171]. For example, a researcher could use different datasets collected by other studies to investigate how the extinction of a species would affect an ecosystem. They might have access to an observational dataset that spans many decades and records the estimated numbers of each type of species in an area. They might also have a separate dataset with detailed environmental data for the same time period (such as average daily rainfall and average temperature). Using these datasets, they might be able to build a model that predicts the relationship between the number of each type of species along with how the overall species numbers are affected by environmental factors. Once a researcher has used these data to build a model, they can then use that model to create predictions. For example, they could predict how an increase in average daily rainfall would affect the number of all types of species. If the data used as input for the model show that high rainfall usually decreased the numbers of all species in the past, the model would likely predict that increased rainfall in the future would decrease the numbers of all species. The model could also predict how the extinction of one species might affect the other species in the ecosystem. If the data used as input for the model show that a reduction in species A was usually followed by a decrease in predator B and an increase in prey C, the model would likely predict that the extinction of species A would be followed by a decrease in predator B an in increase in prey C. The model might also be able to make predictions about how species numbers and environmental factors interact. For example, it might predict larger decreases in predator B if species A went extinct and there was concurrently a high level of daily rainfall.

Models are often used in research studies and can sometimes be cited (for example, in the media) without a complete understanding of their limitations, particularly when they attempt to describe future developments. For example, from March 2019 news outlets began reporting on the predictions of an economic model [172]. A model looking at Government bonds in the USA over the past 60 years showed that, most of the time, investors could get better interest rates (and higher potential yields) if they lent money to the Government with a long-term bond of 10 years than if they lent money to the Government with a short-term bond of 1 year [173]. This is a standard yield curve. However, occasionally, investors could get better interest rates (and higher potential yields) if they lent money to the Government with a short-term bond of 1 year than if they lent money to the Government with a long-term bond of 10 years [174]. This is an inverted yield curve. What the model also showed was that when a standard yield curve switched to an inverted yield curve for three months, there was often a subsequent recession. This model predicts that if an inverted yield curve occurs in the future, it might signal that a recession will follow [175]. Therefore, when there was a switch from a standard yield curve to an inverted yield curve in March 2019, commentators suggested this was the prediction of a recession [176]. However, a prediction such as this cannot be seen as certain; no model is a perfect predictive tool and there are several limitations to drawing conclusions from them.

This example is also illustrated in the use of Integrated Assessment Models (IAMs), which attempt to model the global economy, energy, land use, and climate systems to inform international climate change policy. These very complex models are used to predict hundreds of potential ‘pathways’ of possible future global warming and climate change. IAM pathways are intended to generate discussion and understanding of possible futures, but are not able to discern how likely, or feasible, a pathway is. They may for example seem to suggest that large amounts of uncertain or risky technology could be deployed in future to prevent dangerous climate change, which has led to the criticism that policy-makers may view them a prescriptive tool [177]. However, IAMs, and other large predictive models, are merely illustrative, and usually require wider research evidence on the feasibility, costs or other aspects of their pathways, for making policy decisions.

As explained above, models are simplified versions of complex systems. Therefore, they rarely contain all the necessary data to make accurate predictions [178-180]. Similarly, models are not inherently objective because the information that is included in a model will usually need to be selected by the researcher. They are built with assumptions from the researcher about what may be important to include or exclude, or what the values for certain parameters within the model should be [181]. Therefore, it is possible to build a model that is biased and creates predictions to fit with an individual’s personal opinions (although there are ways of reducing this bias, such as using statistics to test which factors should and should not be included). Models may also be incomplete or flawed because of the data used as input [182]. The data may not have been collected in a rigorous way (see above) or there may be relatively little data available. Generally, models that are making predictions about smaller systems (such as an ecosystem in a fish tank) are considered more reliable than models making predictions about larger and more complex systems (such as the ecosystem of an ocean), especially where there are limited data. Finally, there are different ways to ‘train’ a model to produce predictions, using a type of machine learning. Some models are trained by a computer ‘learning’ rules by examining and re-examining large amounts of data in a way that is intended to mimic or outperform human learning (known as artificial neural networks) [183]. Other models are ‘trained’ by the researcher making decisions on what is, and is not, an important relationship in the data. If the latter way of training is used, this can again add bias into the process. There are ways of evaluating the reliability of models and these include the ability to accurately predict past events (especially if the model has not been trained on that set of data) and the ability to accurately predict future events [184].

Qualitative analysis

Qualitative data analysis develops and tests hypotheses by exploring a dataset (such as diaries, collections of pictures or transcripts of interviews). Some of the most common forms of qualitative analysis include analytic induction and thematic analysis [185].

Analytic induction involves a researcher reviewing a dataset and developing hypotheses. The researcher then collects more data and looks for any cases that do not fit the hypotheses. If there are cases that do not fit, the hypotheses are reformulated. The researcher then collects more data and looks for any cases that do not fit the reformulated hypotheses. This process continues until there are no cases that do not fit the hypotheses [186]. For example, a researcher might read through written accounts by homeless individuals to develop a hypothesis on what life events can trigger homelessness. They might develop a hypothesis that homelessness can be triggered by losing employment after reading ten cases. However, after reading more cases they might find other life events (such as relationship breakdown or eviction) may also trigger homelessness. Data collection would stop once there were no new life events found that appear to trigger homelessness. A key limitation in analytic induction is that data collection stops once all cases match the hypothesis; however, this does not mean that there are not other cases that would alter the hypothesis, it just means that these cases were not found.

Thematic analysis also develops hypotheses and collects new data throughout the process. The purpose of thematic analysis is to identify, analyse and interpret patterns within data. One of the ways that thematic analysis does this is through coding the data. Coding is a multi-step process where data (such as written autobiographical statements) are reviewed by a researcher and concepts and phrases that appear key to the researcher are highlighted and moved into sub-categories and categories [187]. This can be done manually or with the help of computer software that can tag and categorise data based on a researcher’s input. For example, a researcher reading interview transcripts of people who have survived a serious illness might notice that certain key words keep appearing in the data, such as ‘battle’, ‘fight’, ‘race’ or ‘journey’. They might initially code these at the word-level so there is a code for ‘battle’ and one for ‘fight’ and so on. They may then notice that there are sub-categories of comparing illness to war, sports and travel and move these codes into those sub-categories. After reviewing the sub-categories, the researcher may decide that these belong in the overall category of metaphors of illness. The codes and categories used are not decided in advance of coding but develop during the process and may also change over time. After initial coding and categorising, researchers identify the themes and relationship between categories and begin to develop hypotheses relating to them. For example, when looking at the metaphors of illness, the researcher may notice that this category occurs very often with another category about positive mindset and may develop a hypothesis as to how these two concepts interrelate. More data may be collected at this stage to test or back-up the hypotheses. Data collection and coding will usually stop when the researcher believes they have reached theoretical saturation (the point at which more coding/data collection is unlikely to further develop the hypotheses) [188]. As with analytic induction, a key limitation with thematic analysis is that there may be cases that would disprove or alter the hypothesis that were not found by the researcher.

Qualitative analysis, by its nature, emphasises subjectivity and context in analysing data. Qualitative research recognises that researchers approach their work with their own experiences/opinions. Researchers can see a lack of objectivity as inevitable, or even essential, in their work. Many qualitative studies also include information about potential influences on the researcher’s objectivity and can also include a self-reflective section where the researcher considers how their analysis may have been influenced by their experiences and opinions.

Also in this series

References

  1. Towards Data Science. Statistics: Descriptive and inferential.
  2. Towards Data Science. Descriptive statistics.
  3. Towards Data Science. Inferential statistics for data science.
  4. Towards Data Science. Inferential statistics for data science.
  5. Towards Data Science. Inferential statistics for data science.
  6. Government Statistical Service (2017). Statistics for policy professionals: Things that you need to know. UK Government.
  7. Towards Data Science. Statistical significance explained.
  8. Banerjee, A. et al (2009). Hypothesis testing, type I and type II errors. Industrial Psychiatry Journal.
  9. Banerjee, A. et al (2009). Hypothesis testing, type I and type II errors. Industrial Psychiatry Journal.
  10. Jones, S. et al (2004). An introduction to power and sample size estimation. Emergency Medicine Journal.
  11. Amrhein, V. et al (2019). Scientists rise up against statistical significance. Nature.
  12. Camerer, C. et al (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour.
  13. Ionnidis, J. (2019). What have we (not) learnt from millions of scientific papers with p-values? The American Statistician.
  14. Sullivan, G. & Feinn, R. (2012). Using effect size – or why the p-value is not enough. Journal of Graduate Medical Education.
  15. Sedgwick, P. (2012). Absolute and relative risks. British Medical Journal.
  16. Cancer Research UK. Behind the headlines: Low-level alcohol drinking and breast cancer.
  17. Cancer Research UK. Behind the headlines: Low-level alcohol drinking and breast cancer.
  18. Cancer Research UK. Behind the headlines: Low-level alcohol drinking and breast cancer.
  19. NHS. Behind the headlines.
  20. Encyclopaedia Britannica. Scientific Modeling.
  21. IMF. What are economic models?
  22. Encyclopaedia Britannica. Scientific Modeling.
  23. IMF. What are economic models?
  24. Encyclopaedia Britannica. Scientific Modeling.
  25. IMF. What are economic models?
  26. Grüne-Yanoff, T. (2009). Learning from Minimal Economic Models. Erkenntnis.
  27. BBC. Are markets signalling that a recession is due?
  28. Bauer, M. & Mertens, T. (2018). Economic forecasts with the yield curve. Economic Letters.
  29. Bauer, M. & Mertens, T. (2018). Economic forecasts with the yield curve. Economic Letters.
  30. Bauer, M. & Mertens, T. (2018). Economic forecasts with the yield curve. Economic Letters.
  31. BBC. Are markets signalling that a recession is due?
  32. Anderson, K. & Jewel, J. (2019). Debating the bedrock of climate-change mitigation scenarios. Nature News and Views Forum.
  33. Oberkampf, W. et al (2002). Error and uncertainty in modeling and simulation. Reliability Engineering & System Safety.
  34. Sciencing. Limitations of models in science.
  35. Grüne-Yanoff, T. (2009). Learning from Minimal Economic Models. Erkenntnis.
  36. Sciencing. Limitations of models in science.
  37. Sciencing. Limitations of models in science.
  38. Stanford University. Training computer models to accurately simulate nature’s variability.
  39. Grüne-Yanoff, T. (2009). Learning from Minimal Economic Models. Erkenntnis.
  40. Bryman, A. (2012). Social research methods. Oxford University Press.
  41. Bryman, A. (2012). Social research methods. Oxford University Press.
  42. Bryman, A. (2012). Social research methods. Oxford University Press.
  43. Bryman, A. (2012). Social research methods. Oxford University Press.