No data left behind

Infant research notoriously suffers from small samples, resulting in low power. Beyond increasing sample sizes, improving the reliability of our measurements can also increase power and help find more reliable effects. Byers-Heinlein, Bergmann and Savalei (2021) provide both an analysis of the problem of (low) reliability and a number of valuable recommendations. One of the recommendations is to ‘ exclude unreliable data ’ . Although this may increase the effect size found in the remaining data, it can also unjustifiably bias the estimates when it is unknown what the cause of the unreliability is. In such cases, it is better to embrace the variability and use it to characterize the population: variability is also informative. Modern analytical techniques can be used to deal with variability and with missing data. No data should be left behind! Highlights: Variability and individual differences are the bread and butter of developmental science. Discarding variable/unreliable data carries the risk of biasing effect size estimates. Variable and missing data can be dealt with appropriately with modern analytical approaches. Byers-Heinlein et al. (2021) argue that lack of reliability hinders progress in infant research and they provide recommendations for improving reliability. This is indeed much needed: better measurement instruments, more data/trials per infant, and better reporting of the psychometric

analysis of the problem of (low) reliability and a number of valuable recommendations.One of the recommendations is to 'exclude unreliable data'.Although this may increase the effect size found in the remaining data, it can also unjustifiably bias the estimates when it is unknown what the cause of the unreliability is.In such cases, it is better to embrace the variability and use it to characterize the population: variability is also informative.Modern analytical techniques can be used to deal with variability and with missing data.No data should be left behind!Highlights: Variability and individual differences are the bread and butter of developmental science.
Discarding variable/unreliable data carries the risk of biasing effect size estimates.
Variable and missing data can be dealt with appropriately with modern analytical approaches.Byers-Heinlein et al. (2021) argue that lack of reliability hinders progress in infant research and they provide recommendations for improving reliability.This is indeed much needed: better measurement instruments, more data/trials per infant, and better reporting of the psychometric

| COGNITIVE MECHANISM AND INDIVIDUAL DIFFERENCES
The goal of much infant research is to establish a particular experimental effect, for example, a difference in looking time between the habituated stimulus and a novel stimulus.In such situations, the largest effect size will be obtained when all infants behave identically, that is when there is no measurement error whatsoever.Note that there is a paradox looming here: when all infants behave identically, apparently there are no individual differences.In experimental psychology studying adults performing in psychophysics tasks, getting rid of individual differences as much as possible may be an acceptable strategy, in infant research it is not.It is throwing away the proverbial baby with the bathwater.Without individual differences, there can be no study of development.Infants acquire skills in different orders, at different points in time and older infants respond faster and more reliably than younger infants.Aiming for a situation where those differences do not exist is hence not the way forward, variability and detecting effects or cognitive mechanisms underlying them need to be reconciled; see, for example, discussion in Borsboom (2006) and an example in decision making in van der Maas et al. (2011).
For such reconciliation to be achievable, good quality measurement instruments are certainly needed!The use of eye-tracking holds a great promise in this respect, for a number of reasons.Data from eye-tracking are much richer than looking times; the whole pattern of fixations and saccades over time can be analysed (e.g., van Renswoude et al., 2020) providing a basis for analysing strategy differences (Kucharskỳ et al., 2020).Basic eyetracking measures are known to have good test-retest reliabilities (Wass et al., 2014).Moreover, eye-tracking tasks can be made to be more engaging for infants by making them gaze-contingent (Wang et al., 2012), and as such also naturally provide more datapoints per infant.All of these changes can increase the chances of finding reliable and (relatively) large results.

| DEALING WITH NOISY AND MISSING DATA
Infant data is notoriously noisy, and even with improved measurement procedures part of this variability will remain, as well as the fact that some infants will complete more trials than others.If the goal is to increase effect sizes per se, one could argue that leaving out such unreliable data will have the desired effect.However, this carries a serious threat to the validity of the research as well.In the example that Byers-Heinlein et al. ( 2021) discuss from the Man-yBabies 1 study, it is shown that the effect size increases when only infants are analysed who completed more trials.
It is not known however (or at least it is not part of the analysis) what the reason is that some infants complete fewer trials than other infants.Did they just happen to be hungry or thirsty and therefore completed fewer trials?That is, was the reason that they completed fewer trials something outside of the task they were performing or not?Alternatively, it could have been the case that they were less attentive because of some cognitive processing reasons.The infants could potentially already have habituated to the test trials sooner than other infants.Or, yet another option is that these infants were less able to concentrate on these particular stimuli.Yet another possibility is simply that they were slightly younger than other infants who completed more trials; younger infants are typically more variable, and may therefore complete fewer trials.The truth of the matter is that in many cases as infant researchers we will not know what the reasons are that some infants complete more trials than others.In such cases, it is not advisable to discard data from infants with fewer trials even though that would increase the effect size.
The effect of discarding data is potentially that the effect size estimate becomes biased.For example, if indeed the number of completed trials is dependent on age-which is quite likely in actual experiments!-discardingdata based on this will result in an effect size estimate that is biased towards older infants.It should also be noted here that large effect sizes are not a goal in and of itself.In research, we need to establish an effect to draw conclusions about the existence of theoretical constructs and their relationships; there are many cases where we expect small effects (even when our measurement instruments are optimized)-when these putative effects are theoretically interesting enough, we simply need to accept that we will require large samples to detect them.Rather than discarding unreliable data, it is more informative in this case to apply the sixth recommendation in Byers-Heinlein et al. ( 2021): the employment of more sophisticated analytical techniques.A number of further recommendations can be made about ways in which more value can be gained from variable, developmental data.
First, even prior to doing any analysis, it is advisable to refrain from averaging trial data from infants.By averaging trial-based data per infant, differences in variability and reliability are obscured, where we would rather analyse them, for example, in random effects models.If significant random effects are found, follow-up analyses can look into the causes of such variability, for example, related to age, trial number, testing condition, and/or the strength of attention in the pre-test (habituation, familiarization) phase of the task.
Second, consider using latent group or mixture models to capture heterogeneity.In random effects models, participant related variability in the data is thought to result from a continuous, typically Gaussian, distribution.In mixture models, individual differences are conceptualized as qualitative differences, rather than continuous differences.Schaaf et al. (2019) apply a hierarchical Bayesian approach in a reinforcement learning task with mixture components that represent (1) participants that learn in the task and (2) participants that fail to learn and remain at guessing level throughout the trials.Note that such patterns of data may also be observed in habituation, where one group habituates and another group does not.Mixture analysis can be used to detect groups of participants that behave similarly without making prior assumptions about the characteristics of these groups (Van Der Maas & Straatemeier, 2008).
Third, even missing data can be informative!Beyond dealing with missing data appropriately in applying analytical methods (cf Nicholson et al., 2017), the pattern of missing data can be informative, for example when it correlates with age-which is quite likely in developmental data-but also when it correlates with task condition.It could be that task difficulty plays a role, leading to more missing data in one condition than another, which is informative about the nature of the task.Analysing the number of completed (or missing) trials, possibly in interaction with age or other variables, can hence be informative about the task.The pattern mixture model has been developed for analysing such patterns of missing data (Hedeker & Gibbons, 1997).
Discarding data is unnecessary and risky, no data should be left behind!Even when data are variable (or missing!) it can carry information about the cognitive processes that we are interested in.Getting familiar with the more sophisticated analytical techniques required to deal with variable data can be a daunting perspective.Fortunately, our field is moving towards larger collectives of researchers collaborating (Byers-Heinlein et al., 2020;Frank et al., 2017).In such collectives, a division of labour can be more easily arranged and each member can focus on their specialties.

PEER REVIEW
The peer review history for this article is available at https://publons.com/publon/10.1002/icd.2339.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.