Tuesday, February 25, 2014

Why would any scientists fuss over making your data public and accessible?

Well colour me naive. When PLoS announced their new data archiving policy a few days ago, I hardly felt like it was a "big deal". Providing a public archive of raw data used in a published study seems like a no-brainer (except in very limited circumstances with medical records, location data for at risk species, etc), and is becoming standard practice, right? Clearly my naivete knows no bounds, given some of the really negative reaction to it. (here is one example, while discuss the issue a bit more broadly here ).

In the fields I am most active (at the intersection between Genomics, Genetics and Evolution), there have been numerous recent (and successful?) efforts to make sure data associated with studies becomes archived and publicly available in repositories. While they (the repositories) are not perfect, data archiving seemed to be working and generally useful, and I was always happy to do so myself. Yes some of the issues with getting the data and meta-data formatted for NCBI GEO (or SRA) could be annoying at times, but this was such a minor concern relative to all of the other efforts in collecting and analyzing the data, writing the manuscript (and getting it accepted for publication) that the day spent so that it would be available to other researchers long term seemed pretty minor. Other scientists have always sent me reagents and (when they could find it) data, so this seemed like an easy way to be helpful and inline with the scientific process (and hopefully progress).

More importantly, having tried to get data from other researchers over the years (with huge numbers of "old hard drive failures" always seeming to be the reason why it could not be made available to me). I have recently been involved with a large meta-analysis of previously published data. Rarely was the raw available, and because only summary statistics were available (rarely with associated measures of statistical uncertainty), we were very limited in what analyses we could do. There would be so much more we could do if the raw data had been available.

 So, I do not want other researchers to have to deal with these frustrations because of me. By archiving data generated by myself or members of my lab, other researchers could get it without hassling me, and I would not have to worry about finding it at a later date (like 10 years down the road), where it may have taken far more time to recover, then putting it in a repository in the first place.

 In Evolutionary biology, most of the journals simultaneously started a data archiving policy (generally associated with DRYAD) a few year ago, I was quite happy. Not only did I put data from new studies up in DRYAD, but also from my older studies (see here). I naively expected most evolutionary biologists to do the same. After all, there are many long term data sets in evolutionary biology that would be of great value, in particular for studies of natural selection, and estimating G matrices, where there is still much active methodological development. Some of the publications generated data sets requiring heroic efforts, and would be a huge community resource.

 So I was a little surprised when DRYAD was not rapidly populated by all of these great legacy datasets. I think that folks "hoarding" data are a very small minority, and the majority of folks were just very busy, and this did not seem like a pressing issue to them. In any case, I have also spent some effort at my institution (Michigan State University) discussing such issues with students about the importance of data archiving. All of the benefits seem obvious, making our science more open, and making our data available for those who may be able to address interesting and novel questions in the future. Fundamentally, it is the data and the analysis (and interpretation) that represents much of the the science we do. Our scientific papers representing a summary of this work itself. Better to have it all (data, analysis and interpretation) available, no?

So, when PLoS made the announcement, this seemed like par for the course in biology. Funding agencies are mandating data management and sharing plans, other journals too. So who could be either shocked or dismayed by this?

Like I said, I can be naive.

Even after reading (and re-reading) posts (above) and discussion threads about concerns, I am still baffled. Yes there will be a few "corner cases" where privacy, safety or conservation concerns need to be considered. However for the vast majority of studies in biology this is not the case, or the data can be stripped of identifiers or variables to alleviate such issues, at least in many of these situations.

So what's the problem?
Does it require a little work on the part of the authors of the studies? Perhaps a little. However, I always remind folks in the lab that the raw data they generate, and the scripts they use for analysis will be made available. I find that to my benefit the scripts are much easier to read. Furthermore, keeping these issues in my mind makes it that much easier to get it organized for archiving. The readme files we generate make sure we do not forget what variable names mean, or other such issues. Handling data transformations or removing outliers in scripts means we can always go back and double check the influence of those observations.

In their post, DrugMonkey suggests that the behavioural data they generate in their lab is too difficult to organize in such a fashion as to be broadly useful. While I agree that the raw video (if that is what they are collecting) still remains difficult (although perhaps figshare could be used, which is what we will try for our behavioural trials), we find that the text files from our "event recordings" are very easy to post, organize and generate meta-data for. Does the data need to be parsed from its "raw" format, to one more useful for analysis? Sure, but we will also supply (as we do in our DRYAD data packages) the scripts to do so. Perhaps there is something I am missing about their concern. However, I do not concede their point about the difficulties about organizing their data for archiving. How hard should it really be to make such files useful to other experts in the field?

Even for simulations, we supply our scripts, configuration files, and sometimes data generated from the simulation (to replicate key figures that would take too long to generate by replicating the whole simulation).

It is always worth reminding ourselves (as scientists) of this quote (attributed to Sir William Bragg):

"The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them"