Impact of statistical significance and sample size on conclusions in sports science research – an analysis on the example of the relative age effect

Authors: Ib K. Keune

¹Department of Sport Science, Humboldt-Universität zu Berlin, Berlin, GER

Corresponding Author:

Ib K. Keune, M.Ed., PT
Philippstraße 13
10115 Berlin
ib.keune@hu-berlin.de
017681194891

Ib K. Keune is sport sociology doctoral student at the Humboldt-Universität zu Berlin. His areas of research interest include the relative age effect in sports and its interaction with factors of social inequality, statistics and methods in sport research, and applied ethics in sports.

Impact of statistical significance and sample size on conclusions in sports science research – an analysis on the example of the relative age effect

ABSTRACT

Purpose: The null hypothesis significance test (NHST) is a commonly applied statistical method for detecting effects in science, despite it being repeatedly criticized. Detractors argue that by focusing exclusively on NHST results, scientists fail to consider descriptive results, potentially leading to misinformed policy makers. They also point out that the influence of sample size on statistical power is often overlooked. This paper investigates whether this critique holds true in sport science research by analyzing the conclusions in publications about the relative age effect (RAE) – an effect manifested in biased birth date patterns. Method: In an extensive content analysis, 7,247 samples listed in 647 sources were recorded and analyzed using binary logistic regression. Results: Findings show discrepancies between NHST results and birth patterns. Authors in RAE research rely more heavily on NHST results than on birth patterns to draw their conclusions regarding the presence of a RAE. In addition, findings indicate that NHST results are influenced by sample size, birth pattern, and the interaction of both. This interaction leads to a RAE more often being suspected in large samples than small samples, even though birth patterns are more evenly distributed in large samples. Conclusion: As large samples are more likely to represent recreational sport and small samples are more likely to represent elite sport, the strong orientation towards NHST results for conclusions can lead to misinformation about the location of substantial RAEs. Applications in Sport: Similar reliance on NHST results and potential misinformation are also to be expected in other topics in sport research, where characteristics like elite status tend to accumulate in certain sample sizes. Decision-makers in sport should contextualize research findings. Researchers should use NHST appropriately and carefully and combine it with other statistical measures.Key Words: null hypothesis significance testing, birthdate effect, metascience

INTRODUCTION

The use of null hypothesis significance testing (NHST), or at least certain ways of applying NHST, has been continuously criticized (1-14). Recently, critics have argued that, contrary to the original intention of the creators of NHST (15-19), NHST results are often dichotomized into significant and non-significant, and these results are uncritically taken as evidence supporting the null or an alternative hypothesis (2, 20-26). Another problem identified in the context of NHST is the role of sample size and its importance for assigning statistical power, and which, if left unaddressed, may lead to questionable results and conclusions (3, 18, 22, 27, 28-31). Amrheim et al. (20) began their comment on NHST by describing a presentation in which an apparent difference between two groups was dismissed for lack of statistical significance and articulated concerns that errors in NHST lead to misinformed policy decisions. Inspired by this comment, this paper aims to address the question of whether conclusions in research are predominantly based on NHST results and if they differ from the implications of purely descriptive data. It will also investigate whether sample size plays a role in determining statistical significance and the association of effects in published research and if there is evidence that the conclusions drawn may be misleading for policy makers.

The effect of age on group composition or on the performance of individuals in a group is called the Relative Age Effect (RAE). I used RAE as an example to illustrate the effect of the use of NHST within defined research parameters. It is assumed that it is only in supposedly age-homogeneous groups such as school classes or junior teams in sport that the relatively older participants are advantaged. This advantage is linked to biased birth patterns in these groups, which tend to have more older than younger individuals. For example, in the current German Football Federation Under-21 team, 17 players were born in the first half of the year, and only 5 in the second half (32). The field of RAE research was also chosen because there have been discussions about NHST in this scientific field (33-34), including the argument that NHST is often not necessary in RAE research because population data are mostly used in situations where inference is not required (35).

To determine whether NHST results lead to misleading conclusions that differ from the implications of the descriptive birth pattern and whether sample size has an effect on research results, 647 publications have been analyzed to clarify whether a) the NHST is used in RAE research, b) conclusions about the RAE are based more on the NHST results or on the descriptive pattern, c) there is a discrepancy between the descriptive pattern and the NHST results, d) the sample size has an impact on the NHST results and the RAE conclusions, and finally e) the NHST (at least in this case) has the potential to misinform policy makers by pointing to different groups requiring public assistance than the descriptive patterns would suggest.

METHODS

Sample

Between December 2018 and December 2019, publications on RAE in sport were searched for using the search engines “Google scholar”, “Researchgate” and “PRIMUS”. In addition, all references in the discovered publications were also searched for, as well as all publications of the identified authors that referred to the RAE.

The following publications were excluded: (i) Publications that deal with the Constituent Year Effects or Constant Year Effects, which are effects of different birth years on the selection of groups. Although these effects are often equated with the RAE, they are not the same, but they can interact with the RAE (36). (ii) Publications that deal with the RAE in groups that dropped out from sport. (iii) Publications that deal with the RAE as an effect of relative age on performance and not in terms of birth distributions.

The literature search identified 720 publications on the RAE published between 1973 and 2019, 647 of which were accessible. These included 434 articles in scientific journals, 155 conference papers or contributions to collective volumes, 18 doctoral, master’s or bachelor’s theses, and 41 other sources such as sports journals, blogs, or publications by sports associations. In an extensive content analysis, 7,247 samples listed in these sources were recorded. All data are available on request.

Variables

For the content analysis of this research project, a category system was developed to capture the following variables from each sport group mentioned in the texts: conclusions of the respective authors whether an RAE is present or not; descriptive birth distribution; results of NHST; and sample size, as well as other characteristics such as type of sport and level of performance. The authors’ conclusions were categorized as “RAE”, “no RAE”, “not categorizable” and “others/ no conclusion”. The conclusions “RAE” and “no RAE” were subjected to primary analysis, “not categorizable” conclusions became part of an iteration analysis. The results of the second analysis remain the same or become even more extreme (data available on request). NHST results were categorized as “significant”, “non-significant”, “both” and “others/ no results”. Results categorized as significant and non-significant were included in the analysis. Group sizes were summarized as n<100, n =100-400, n=400-1000, n>1000.

Sample size was handled as a categorial variable because of the varying characteristics of different sized samples (national team-based data analysis generally relies on small sample sizes, while association-based data analysis relies on large sample sizes) and the unequal distribution of these samples. An iteration analysis with sample size as the metric variable was conducted, which did not alter the results (data available on request). Descriptive birth pattern was coded as a categorical variable, categorizing semesterly birth distribution as “even” (up to 2% deviation from the reference population or an assumed uniform distribution), “small bias” (up to 10% deviation), “strong bias” (up to 30% deviation) and “extreme bias” (up to 50% deviation). All contributions were coded, and the results were entered into an SPSS database. If the same sample was studied several times in the literature, it was only recorded once. The full sample size (N=7,124) was analyzed to ensure that NHST plays an important role in RAE research. All the following analyses used samples containing NHST results, birth patterns and authors’ conclusions regarding RAE only (N=3,053).

Data Analysis

1) To determine whether NHST plays an important role in RAE research, the percentage of samples within the database which contain NHST-based results was determined.

2) Additional analysis was conducted to determine whether conclusions in RAE research are in general based on NHST results or on the descriptive data associated with the sample by performing binary logistic regression and comparing the impact of biased birth patterns and NHST results (independent variables) on RAE attribution (dependent variable “conclusion: RAE”, reference: conclusion: “no RAE”). The results of this analysis were further visualized using descriptive statistics.

3) Analysis was then conducted to determine whether there were discrepancies between the NHST results and the descriptive data, as NHST results would not be misleading if they produced the same conclusion as the descriptive data. Discrepancies are defined in this case as strong biases in birth patterns appearing as statistically non-significant and supposedly even birth patterns appearing as statistically significant. To determine this, descriptive data were compared with NHST results.

4) Sample size, birth pattern and interaction terms of both (independent variables) were calculated and analyzed by performing binary logistic regression, to determine the effect of sample size in relation to birth pattern on NHST results (dependent variable: “statistically significant NHST result”, reference: “statistically non-significant NHST result”) and to determine whether sample size acts as a moderator of NHST results.

5) Analysis was then conducted to determine whether competition level is spread unevenly across sample sizes and, if so, whether this leads to potential misinformation about the competition level in which the RAE predominantly appears.

All statistical analyses were performed using IBM SPSS Statistics Version 25 (Armonk, NY: IBM Corp). The significance level (α) was set at p≤.05 for all statistical procedures.

RESULTS

Extent of use of NHST in RAE research

NHSTs were used in 83.8% of the 647 publications. NHST results were given for 71.9% of the 7,247 samples listed in the publications.

Association of RAE with birth pattern and NHST results

A significant impact of small (OR=2.087), strong (OR=9.613) and extremely biased birth patterns (OR=10.725) has been detected (Table 2). A statistically significant NHST result had a significant and more pronounced effect on conclusions on the existence of a RAE (OR=2332.146). Table 3 and Table 4 show conclusions regarding the existence of RAEs separated by statistical significance. The Tables also show that both NHST results and birth patterns affect the conclusions regarding RAE, but NHST results have a stronger impact. A statistically significant NHST result (Tab.3) nearly determines a positive conclusion on the existence of an RAE, independently from the birth pattern associated with the sample.

Discrepancies of birth pattern and NHST results

The percentage of statistically significant results increase with the degree of bias in birth pattern. However, a strong or extremely biased birth pattern does not always lead to statistically significant results (Tab. 5).

Effect of sample size on NHST results in relation to birth pattern

NHST results are statistically significantly influenced by large sample size (OR=13.033) and strong (OR=26.242) and extreme biased birth pattern (OR=62.898) (Table 6). Also, all interaction terms reflect a strong effect of the interaction of both variables on significant NHST results (OR ranging from 2.903–15.419) – the two non-significant interaction terms (n= 100 – 400; extreme bias and n= >1000; strong bias) emerge because there are no non-significant cases in these subgroups. The interaction is partly visualized in Fig. 1-2. With increasing group size, less strong and extreme biased birth pattern (Fig. 1) and an increasing number of statistically significant NHST results occurs (Fig. 2).

Figure 1. *Birth pattern by sample size*

Potential of NHST to misinform

The smallest and largest samples show the most pronounced differences between the descriptive birth pattern and the conclusion about the presence of an RAE (cf. Fig. 1 and Fig. 3). As assumed, these groups also show a difference with regard to competition level. Ninety percent of the smallest samples and 49.6% of the largest are associated with elite sport (Tab.7). As a result, there are more positive conclusions on a RAE in recreational samples or samples of mixed competition levels in contrast to elite sport samples (Tab.8) even though the non-elite groups show double the number of even birth distributions, half as much strong biases in birth distribution, and no extreme biases in birth distribution, in contrast to elite sports (Tab.9).

Figure 3. Conclusions on RAE by sample size

DISCUSSION

The results show that 1) NHST is widely used in RAE research, 2) conclusions on RAE are based on NHST results rather than the descriptive results on birth pattern, 3) there are discrepancies between the descriptive results on birth pattern and NHST results, where evenly distributed births are marked as statistically significant and extreme unevenly distributed births are sometimes marked as non-significant, 4) sample size has an impact on statistical significance and thus on the conclusion regarding the existence of an RAE, and 5) the groups that show the largest discrepancies between the descriptive results on birth pattern and the conclusions on RAE also differ in their (non-)elite status, resulting, on a percentage basis, in more conclusions on RAE for recreational sport samples than for elite sport samples even though there are fewer biases in the birth pattern of recreational sport samples.

The significant percentage of research projects that apply NHST in RAE research demonstrates once again that it is a standard method used in scientific research (4-5, 9, 22, 37-41). The strong association between NHST results and authors’ conclusions on the existence of RAEs suggest that scientists who choose to rely on NHS Testing bind themselves in advance to certain conclusions. This does not mean that a decision not to rely on the descriptive data is definitely wrong, unless the position taken by Gibbs et al. (35) is correct that population data is not an appropriate basis for NHST.

But the tendency to assume an RAE only in conjunction with a statistically significant p-value, which in turn depends on statistical power and sample size, leads to a discrepancy between the purely descriptive data and the conclusions about which sample contains an RAE. We see the greatest number of attributions of an RAE in the large samples, even though these are the samples with the most evenly distributed births and the fewest elite athletes. And we see the fewest attributions of an RAE in the small samples with the most extreme birth patterns and the greatest number of elite athletes. To not conclude that there is an RAE after finding no statistical significance in a small sample size is justifiable, precisely because the low number leaves room for random effects. But by proceeding this way, there is not only the danger that a) the informative value of the NHST results in combination with low sample size was overestimated (3, 18, 22, 28, 42-43); b) the decision lacks contextualization (20, 25, 31), e.g., that systematic reviews of RAE literature find overall RAEs (44-45); c) the NHST was not the appropriate statistical tool (35); but also (d) the recorded decision was not the rejection of a conclusion at all but the conclusion “no RAE”. This is a misinterpretation which cannot be obtained from the p-value (15, 18, 20, 22, 26, 46) nor, in the case at hand, from the context or the descriptive distribution.

Also debatable is the admittance of a tiny effect in huge data bases. Various authors call for the significance level to be adjusted to the sample size and for a meaningful, practical, not just statistical significance to be aimed for (2, 24, 26-28, 31, 42-43, 47-48). Does it matter if there is a difference if it only affects very few individuals? This seems to be both an ethical and a statistical question.

But even when these RAEs are worth mentioning, there is an attribution bias which may adversely affect efforts to seek solutions. When no RAE is attributed to Lithuanian Olympic basketballers (49) because of a missing statistical significance (χ2=1.200; p= 0.68, N=15), despite the fact that 80% were born in the first half of the year, but an RAE is assigned to Swiss female skiers (50) where there were 0.3% fewer female skiers born in the 2nd half of the year than in the total female population (χ2=31.0; p<0.01; V=0.01, N=186,468), than decision makers may try to counter the RAE on the grassroot level and not in elite training.

This bias may be even stronger if a publication bias has also affected the database of this study; that is, because articles with significant results are more likely to be published (51-53). It is important to note that it is not the NHSTs that produce conclusions, researchers do, because scientific and statistical inferences are not the same (11, 26). There are many ways to use the NHST to derive unbiased conclusions, such as by treating the p-value as a continuous variable without a sharp and arbitrary threshold at 0.05 (20, 22, 25-26) or by taking effect size into account (27-28), to name only few (26).

CONCLUSIONS

The problem illustrated here is transferable to other topics in sport science, where samples with specific characteristics tend to accumulate in certain sample sizes, e.g., in research on talent development and sport participation. This paper adds to the recent criticism of NHST by illustrating how the strong reliance on NHST results affects conclusion patterns when compared with descriptive-based results. It has been shown that in RAE-related research, large sample sizes are more often associated with RAE despite more even birth patterns. Since these large samples have different characteristics than in smaller samples, misinformation based on these conclusions is possible. In the given case of RAE research, it seems likely that the presence of RAEs in recreational sport is overestimated, while in certain fields of elite sport it is underestimated. Discrepancies between descriptive results and conclusions derived from NHST results which are influenced by sample size are also to be expected in research conducted in other scientific fields where the trust in NHST results is likely similar to that in sport science.

APPLICATIONS IN SPORT

Decision-makers in sport should contextualize research results when informing themselves through scientific literature. Researchers should use NHST appropriately and carefully by combining the results with other statistical measures.

Acknowledgements

The writing of this paper was supported by an Elsa Neumann scholarship.

REFERENCES

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423–437.
Cohen J. (1994). The earth is round (p < 0.05). American Psychologist, 47, 997–1003.
Freiman, J. A., Chalmers, T. C., Smith, H., Jr, & Kuebler, R. R. (1978). The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. The New England Journal of Medicine, 299(13), 690–694.
Gill, J. (1999). The Insignificance of Null Hypothesis Significance Testing. Political Research Quarterly, 52, 647–674.
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33, 587–606.
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you always wanted to know about null hypothesis testing but were afraid to ask. In D. Kaplan (Ed.), Handbook on quantitative methods in the social sciences (pp. 391–408). Sage.
Hunter, J. E. (1997). Needed: A Ban on the Significance Test. Psychological Science, 8, 3–7.
Lambdin, Ch. (2012). Significance tests as sorcery: Science is empirical-Significance tests are not. Theory & Psychology, 22(1), 67–90.
McCloskey, D. N., & Ziliak, S. (1996). The Standard Error of Regression. Journal of Economic Literature, 34, 97–114.
Meehl, P. E. (1978). Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology. Journal of Counseling and Clinical Psychology, 46, 806–834.
Rothman, K. J. (1986). Significance questing. Annals of Internal Medicine, 105, 445–447.
Rozeboom, W. M. (1960). The fallacy of null-hypothesis significance test. Psychological Bulletin, 57, 416–428.
Schmidt, F. L. (1996). Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers. Psychological Methods, 1, 115–129.
Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisal of psychological research and the good-enough principle. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 199–228). Lawrence Erlbaum Associates, Inc.
Fisher, K. A. (1935). Statistical tests. Nature, 136(3438), 474–474.
Fisher, R. A. (1956). Statistical methods and scientific inference. Oliver and Boyd.
Neyman, J., & Pearson, E. S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Biometrika, 20A(1/2), 175–240.
Pearson, K. (1935). Statistical Tests. Nature, 136, 296–297.
Pearson, E. S. (1955). Statistical concepts in the relation to reality. Journal of the Royal Statistical Society (Series B), 17, 204–207.
Amrheim, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567, 305–307.
Amrheim, V., Trafimow D., & Greenland, S. (2019) Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, 73(Suppl. 1), 262–270.
Greenland, S., Senn, St. J., Rothman, K. J., Carlin, J. B., Poole, Ch., Goodman, St. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
Hurlbert, S. H., Levine, R. A., & Utts, J. (2019). Coup de Grâce for a Tough Old Bull: “Statistically Significant” Expires. The American Statistician, 73(Suppl. 1), 352–357.
McShane, B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885–895.
McShane, B., Gal, D., Gelman, A., Robert, C., & Tackett J. L. (2019) Abandon Statistical Significance. The American Statistician, 73(Suppl. 1), 235–245.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p < 0.05”. The American Statistician, 73(Suppl. 1), 1–19.
Betensky, R. (2019). The p-Value Requires Context, Not a Threshold. The American Statistician, 73(Suppl. 1), 115–117.
Field, A. (2019). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE.
Freeman, P. R. (1993). The Role of p-Values in Analysing Trial Results. Statistics in Medicine, 12, 1443–1452.
Gannon, M. A, de Bragança Pereira, C. A., & Polpo, A. (2019). Blending Bayesian and Classical Tools to Define Optimal Sample-Size-Dependent Significance Levels. The American Statistician, 73(Suppl. 1), 213–222.
Wellek, S. (2017). A Critical Evaluation of the Current p-Value Controversy. Biometrical Journal, 59, 854–900.
DFB (2022, September 24). Team und Trainer. https://www.dfb.de/u-21-maenner/team-und-trainer/
Buhre, T., & Tschernij, O. (March 29, 2018). Sample distribution and research design are methodological dilemmas when identifying selection and using relative age as an explanation of results. The Sport Journal.
Delorme, N., Boiché, J., & Raspaud, M. (2010). Relative age effect in elite sports: Methodological bias or real discrimination? European Journal of Sport Science, 10(2), 91–96.
Gibbs, B. G., Shafer, K., & Dufur, M. J. (2015). Why infer? The use and misuse of population data in sport research. International Review for the Sociology of Sport, 50(1), 115–121.
Steingröver, C., Wattie, N., Baker, J., Helsen, W.F. and Schorer, J. (2017), The interaction between constituent year and within-1-year effects in elite German youth basketball. Scandinavian Journal of Medicine & Science in Sports, 27, 627–633.
Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems, Prevalence, and an Alternative. The Journal of Wildlife Management, 64(4), 912–923.
Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Krüger, G. Gigerenzer, M. S. Morgan (Eds.), The probabilistic revolution: Ideas in the sciences (Vol. 2., pp. 11–33). MIT Press.
Hubbard, R. (2004). Alphabet soup: Blurring the distinctions between p’s and α’s in psychological research. Theory & Psychology, 14, 295–327.
McShane, B., and Gal, D. (2016). Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62, 1707–1718.
Sawyer, A.G., & Peter, P. (1983). The Significance of Statistical Significance Tests in Marketing Research. Journal of Marketing Research, 20, 122–133
Boster, F. J. (2002). On making progress in communication science. Human Communication Research, 28(4), 473–490.
Levine, T. R., Weber, R., Hullett, C., Park, H. S., & Massi-Lindsey, L. L. (2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34(2), 171–187.
Cobley, S., Baker, J., Wattie, N., & McKenna, J. (2009). Annual age-grouping and athlete development: a meta-analytical review of relative age effects in sport. Sports medicine, 39(3), 235–256.
Smith, K. L., Weir, P. L., Till, K., Romann, M., & Cobley, S. (2018). Relative Age Effects Across and Within Female Sport Contexts: A Systematic Review and Meta-Analysis. Sports medicine, 48(6), 1451–1478.
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. British Medical Journal, 311, 485
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.
Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). W. W. Norton and Company.
Werneck, F. Z., Coelho, E. F., de Oliveira, H. Z., Ribeiro Júnior, D. P., Almas, S. P., de Lima, J. R. P., Matta, M. O., Figueiredo, A. J. (2016). Relative age effect in Olympic basketball athletes. Science & Sports, 31(3), 158–161.
Romann, M., & Fuchslocher, J. (2014). The need to consider relative age effects in women’s talent development process. Perceptual and Motor Skills, 118(3), 651–662.
Dickersin, K., Min, Y. I., & Meinert, C. L. (1992). Factors influencing publication of research results. Follow-up of applications submitted to two institutional review boards. Journal of the American Medical Association, 267(3), 374–378.
Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891–904.
Greenwald, A.G. (1975). Consequences of Prejudice Against the Null Hypothesis. Psychological Bulletin, 82, 1–20.