Today at Eco-Stats we discussed the PLoS ONE paper "Where Should I Send It? Optimizing the Submission Decision Process" which did some mathematical modelling to decide on an optimal approach to choosing the order of journals to send an ecological paper to. The main factors considered were time to acceptance (a function of time to review and acceptance rate) and impact factor of the journal. The authors wrote to the editorial boards of all - yes all - ISI-listed
journals in ecology, and another six general journals (e.g. Science,
PLoS ONE) that publish ecological papers. They got responses from 61
journals, yielding an interesting dataset available as an appendix to
their paper. I've reformatted it as a comma-delimited file here.

The authors derived a couple of metrics (e.g. to maximise expected citations) under a host of assumptions (which made me somewhat uncomfortable, as modelling papers often do), the endpoint was metrics that could be used to evaluate different publication strategies, e.g. Science then PNAS then Ecology Letters then...

Their results I found largely unsurprising - they highlighted a few target journals, of the ones they had data on, in particular Ecology Letters, Ecological Monographs and PloS One, which all scored high as compromises between impact factor and time to publication. Interestingly Science didn't come out smelling like roses, although this may be a function of the metrics they used and their implicit assumptions as much as anything else. They didn't have data on all journals, e.g. I would like to know about Nature, Trends in Ecology and Evolution or Methods in Ecology and Evolution. They expressed surprise that a pretty good strategy seemed to be submitting to journals in order of impact factor. They expected a loss of impact due to long times spent in review, I mean you end up bouncing around between journals for years and years. I think in practice that strategy would do worse than their model suggested, for most of us, because it didn't incorporate the positive correlation in outcomes from submitting the same paper to different journals (or more generally, any measure of how significant a given paper actually is).

Over time I've become more of a statistician than a modeller and so I was especially interested in the data behind this work, and I learnt the most just by looking at the raw data that was tucked away in an appendix. Here are a few choice graphs which explain the main drivers behind their results.

First, Impact Factor vs time in review:

There is a decent negative correlation between impact factor and time in review (r=-0.5). For those of us who have submitted a few papers to journals at each end of the spectrum this won't be news. This is presumably one of the reasons why a journal has high impact - faster acceptances has a direct effect on citation metrics, and increases the incentive to submit good papers there.

The Science journal is a bit of an outlier on this graph - it has the highest impact factor but a pretty average review time, more than twice as long as Ecology Letters, so if you take these numbers at face value (are they measured the same way across journals?), and if 50 days means a lot to you, there is a case for having Ecology Letters as your plan A rather than Science. Hmmm...

Good journals are towards the top left, and apart from Ecology Letters and Science we also have Ecological Monographs on the shortlist because it has a slightly shorter time in review than most journals with similar impact factors. Although I wonder how large that difference is relative to sampling error (would it come out to the left of the pack next year too?)...

Next graph is Impact Factor vs Acceptance rate:

There is a slightly stronger negative association this time (r=-0.6). I vaguely remember a bulletin article a few years ago suggesting no relation between impact factor and acceptance rate - that article used a small sample size and made the classic mistake of assuming that no evidence of a relationship means no relationship. Well given some more data clearly there is a relationship.

This time we are looking for papers towards to top-right. The journal fitting the bill is by far the biggest outlier, PLoS ONE - a journal with a different editorial policy to most that reviews largely for technical correctness rather than for novelty. It ends up with quite a high acceptance rate, and nevertheless manages a pretty high impact factor. But its impact factor was calculated across all disciplines, what is it when limited to just ecology papers?

So anyway, from looking at the raw data and taking it at face value, what would be your publishing strategy? A sensible (and relatively common) strategy is to first go for a high impact journal (or two) with relatively short turnaround times, which Ecology Letters is known for, and when you get tired/discouraged by lack of success, or when just trying to squeeze a paper out quickly, PLoS ONE is a good option. This is pretty much what the paper said using fancy metrics, I guess it is reassuring to get the same sort of answer from eyeballing scatterplots of the raw data.

There are a few simplifying assumptions in this discussion and in the paper itself - a key one is that all paper are treated as equal, when in fact some are more likely to be accepted than others, and some are more suited to some journals than others. There are assumptions like citations being the be-all and end-all, and the modelling in the original paper further assumed that the citations a paper will get are a function of the journal it is published in alone, and not to do with the quality of the paper that is published. But it's all good fun and there are certainly some lessons to be learnt here.

## Thursday, 28 April 2016

## Thursday, 21 April 2016

### Structural equation modeling

The Ecostatistics group gathered today for a discussion of the paper "Structural equation models: a review with applications to environmental epidemiology" by Brisa Sanchez et al. from the December 2005 edition of the Journal of the American Statistical Association.

We were sparked to read this paper due to multiple requests from clients at the Stats Central consulting lab, who were attempting to implement structural equation models (SEMs) but did not understand the methodology. An SEM is typically used as a model for multivariate responses that depend on latent variables. The methodology is used mostly in the field of psychometrics, for instance to model someone's score on several tests as reflecting some unobservable latent variable like "spatial intelligence" or "motor skills". The paper by Sanchez et al. provides a nice overview of SEMs for statisticians who don't necessarily keep up-to-date in the psychometrics literature, with two example analyses drawn from the field of environmental epidemiology.

The first example is a model that explains the concentration of lead in four body tissues (umbilical cord blood, maternal whole blood, the patella, and the tibia tibia) via five environmental covariates (time living in Mexico City, age, use of ceramics for cooking [both long-term and recent], and the concentration of lead in the air). This isn't a case you'd normally think of as requiring the use of an SEM, but the model was structured hierarchically so that some of the response variables were also covariates for predicting the others. The figure below depicts the model's structure, with each arrow indicating a linear model. It was reproduced from the manuscript.

Since there are no latent variables in this model, we wondered whether the result of using an SEM to fit the data jointly will be very different from generating each regression model individually, and using its fitted values as inputs to the next level of the model.

The other example is a model for performance on eleven neurobehavioral tests with mercury exposure as an unobserved covariate that affects the latent factors "motor ability" and "verbal ability". Whale consumption was used to predict the mercury exposure (this study was conducted in the remote Faeroe Islands north of Scotland), with hair and blood mercury used as surrogates that reflect the mercury exposure. The diagram of this model's structure is reproduced below. Note that the double-ended arrow between motor ability and verbal ability indicates that the two latent factors are correlated.

One thing to note from the diagrams is that these models use a lot of parameters and impose a quite specific structure, both of which can be problematic. The number of parameters leads to concerns about the identifiability of SEMs, which must be carefully checked in each case, and the results may be sensitive to the structure, which must be chosen a priori, using the theory of the scientific field being studied. Model checking is discussed in the paper, but in a general way (necessarily, given that this is a review paper).

Both examples use linear models and Gaussian distributions at every stage, though SEMs can be used with more flexible model equations such as GLMs. The R package sem and the STATA package gllamm are called out as the state of the art (circa 2005).

We got to discussing when an Ecostatistician might want to use an SEM in their work, and the discussion focused around our favorite application: multivariate abundance modeling (mvabund). In fact, most kinds of regression and latent variable models seem to be specific cases of structural equation models (though that's not usually the most productive way to think).

In sum, we found this paper to be a well-calibrated review article for statisticians who are looking to quickly understand the SEM methodology and see how it can be used in an example.

We were sparked to read this paper due to multiple requests from clients at the Stats Central consulting lab, who were attempting to implement structural equation models (SEMs) but did not understand the methodology. An SEM is typically used as a model for multivariate responses that depend on latent variables. The methodology is used mostly in the field of psychometrics, for instance to model someone's score on several tests as reflecting some unobservable latent variable like "spatial intelligence" or "motor skills". The paper by Sanchez et al. provides a nice overview of SEMs for statisticians who don't necessarily keep up-to-date in the psychometrics literature, with two example analyses drawn from the field of environmental epidemiology.

The first example is a model that explains the concentration of lead in four body tissues (umbilical cord blood, maternal whole blood, the patella, and the tibia tibia) via five environmental covariates (time living in Mexico City, age, use of ceramics for cooking [both long-term and recent], and the concentration of lead in the air). This isn't a case you'd normally think of as requiring the use of an SEM, but the model was structured hierarchically so that some of the response variables were also covariates for predicting the others. The figure below depicts the model's structure, with each arrow indicating a linear model. It was reproduced from the manuscript.

Since there are no latent variables in this model, we wondered whether the result of using an SEM to fit the data jointly will be very different from generating each regression model individually, and using its fitted values as inputs to the next level of the model.

The other example is a model for performance on eleven neurobehavioral tests with mercury exposure as an unobserved covariate that affects the latent factors "motor ability" and "verbal ability". Whale consumption was used to predict the mercury exposure (this study was conducted in the remote Faeroe Islands north of Scotland), with hair and blood mercury used as surrogates that reflect the mercury exposure. The diagram of this model's structure is reproduced below. Note that the double-ended arrow between motor ability and verbal ability indicates that the two latent factors are correlated.

One thing to note from the diagrams is that these models use a lot of parameters and impose a quite specific structure, both of which can be problematic. The number of parameters leads to concerns about the identifiability of SEMs, which must be carefully checked in each case, and the results may be sensitive to the structure, which must be chosen a priori, using the theory of the scientific field being studied. Model checking is discussed in the paper, but in a general way (necessarily, given that this is a review paper).

Both examples use linear models and Gaussian distributions at every stage, though SEMs can be used with more flexible model equations such as GLMs. The R package sem and the STATA package gllamm are called out as the state of the art (circa 2005).

We got to discussing when an Ecostatistician might want to use an SEM in their work, and the discussion focused around our favorite application: multivariate abundance modeling (mvabund). In fact, most kinds of regression and latent variable models seem to be specific cases of structural equation models (though that's not usually the most productive way to think).

In sum, we found this paper to be a well-calibrated review article for statisticians who are looking to quickly understand the SEM methodology and see how it can be used in an example.

Subscribe to:
Posts (Atom)