Tuesday, February 25, 2014

Why would any scientists fuss over making your data public and accessible?

Well colour me naive. When PLoS announced their new data archiving policy a few days ago, I hardly felt like it was a "big deal". Providing a public archive of raw data used in a published study seems like a no-brainer (except in very limited circumstances with medical records, location data for at risk species, etc), and is becoming standard practice, right? Clearly my naivete knows no bounds, given some of the really negative reaction to it. (here is one example, while discuss the issue a bit more broadly here ).

In the fields I am most active (at the intersection between Genomics, Genetics and Evolution), there have been numerous recent (and successful?) efforts to make sure data associated with studies becomes archived and publicly available in repositories. While they (the repositories) are not perfect, data archiving seemed to be working and generally useful, and I was always happy to do so myself. Yes some of the issues with getting the data and meta-data formatted for NCBI GEO (or SRA) could be annoying at times, but this was such a minor concern relative to all of the other efforts in collecting and analyzing the data, writing the manuscript (and getting it accepted for publication) that the day spent so that it would be available to other researchers long term seemed pretty minor. Other scientists have always sent me reagents and (when they could find it) data, so this seemed like an easy way to be helpful and inline with the scientific process (and hopefully progress).

More importantly, having tried to get data from other researchers over the years (with huge numbers of "old hard drive failures" always seeming to be the reason why it could not be made available to me). I have recently been involved with a large meta-analysis of previously published data. Rarely was the raw available, and because only summary statistics were available (rarely with associated measures of statistical uncertainty), we were very limited in what analyses we could do. There would be so much more we could do if the raw data had been available.

 So, I do not want other researchers to have to deal with these frustrations because of me. By archiving data generated by myself or members of my lab, other researchers could get it without hassling me, and I would not have to worry about finding it at a later date (like 10 years down the road), where it may have taken far more time to recover, then putting it in a repository in the first place.

 In Evolutionary biology, most of the journals simultaneously started a data archiving policy (generally associated with DRYAD) a few year ago, I was quite happy. Not only did I put data from new studies up in DRYAD, but also from my older studies (see here). I naively expected most evolutionary biologists to do the same. After all, there are many long term data sets in evolutionary biology that would be of great value, in particular for studies of natural selection, and estimating G matrices, where there is still much active methodological development. Some of the publications generated data sets requiring heroic efforts, and would be a huge community resource.

 So I was a little surprised when DRYAD was not rapidly populated by all of these great legacy datasets. I think that folks "hoarding" data are a very small minority, and the majority of folks were just very busy, and this did not seem like a pressing issue to them. In any case, I have also spent some effort at my institution (Michigan State University) discussing such issues with students about the importance of data archiving. All of the benefits seem obvious, making our science more open, and making our data available for those who may be able to address interesting and novel questions in the future. Fundamentally, it is the data and the analysis (and interpretation) that represents much of the the science we do. Our scientific papers representing a summary of this work itself. Better to have it all (data, analysis and interpretation) available, no?

So, when PLoS made the announcement, this seemed like par for the course in biology. Funding agencies are mandating data management and sharing plans, other journals too. So who could be either shocked or dismayed by this?

Like I said, I can be naive.

Even after reading (and re-reading) posts (above) and discussion threads about concerns, I am still baffled. Yes there will be a few "corner cases" where privacy, safety or conservation concerns need to be considered. However for the vast majority of studies in biology this is not the case, or the data can be stripped of identifiers or variables to alleviate such issues, at least in many of these situations.

So what's the problem?
Does it require a little work on the part of the authors of the studies? Perhaps a little. However, I always remind folks in the lab that the raw data they generate, and the scripts they use for analysis will be made available. I find that to my benefit the scripts are much easier to read. Furthermore, keeping these issues in my mind makes it that much easier to get it organized for archiving. The readme files we generate make sure we do not forget what variable names mean, or other such issues. Handling data transformations or removing outliers in scripts means we can always go back and double check the influence of those observations.

In their post, DrugMonkey suggests that the behavioural data they generate in their lab is too difficult to organize in such a fashion as to be broadly useful. While I agree that the raw video (if that is what they are collecting) still remains difficult (although perhaps figshare could be used, which is what we will try for our behavioural trials), we find that the text files from our "event recordings" are very easy to post, organize and generate meta-data for. Does the data need to be parsed from its "raw" format, to one more useful for analysis? Sure, but we will also supply (as we do in our DRYAD data packages) the scripts to do so. Perhaps there is something I am missing about their concern. However, I do not concede their point about the difficulties about organizing their data for archiving. How hard should it really be to make such files useful to other experts in the field?

Even for simulations, we supply our scripts, configuration files, and sometimes data generated from the simulation (to replicate key figures that would take too long to generate by replicating the whole simulation).

It is always worth reminding ourselves (as scientists) of this quote (attributed to Sir William Bragg):

"The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them"


  1. They don't just want you to make your stuff available. They want you to use standardized formats for that. They also don't specify to which degree the data have to be 'raw' which in some cases makes a huge difference, especially in non-standard experiments.

  2. No doubt there will be issues to work out. However we have managed to all get through it with genbank, NCBI GEO, SRA, etc... I think that honest efforts are largely what will be required. Who knows, I could be completely wrong on this. It happens to me frequently!

  3. I would not call the second link you provided as a "really negative reaction". It seems like a fairly balanced piece which largely argues that any problems with data sharing can be overcome and simply quotes a couple of people who seem to have problems with data-sharing.

    1. I think you are right that the article at the scientist is not "really negative", at least compared to the blog post by DrugMonkey. I will edit my post to reflect this.

      However, I do think they did not really establish that data sharing has increasingly become the norm in biological disciplines (genbank, NCBI GEO, DRYAD, etc..). Still, I should not have cast it in such a light.

  4. This comment has been removed by the author.

  5. I am a big fan of making data available and agree that making standard genomics data available will be relative easy and most journals already require this. We generate a lot of metabolomics data and the data are more complex and the notion of 'raw data' is much less clear than with DNA sequence. This is partly due to MS instrument makers creating software that is proprietary and it is unlikely that making raw data available will be of the same benefit as DNA sequence. Also, we have a lot of phenotypic data on a public website/database and these kind of data are really hard to standardize. By moving towards one size fits all I think that journals are going well beyond the cost/benefit tipping point. Putting a manuscript together is getting more and more complex every year.

    1. Rob,

      The fact that you are making as much effort as you do to make all your data so easily available makes your concerns and suggestions that much more important to consider! I am far more concerned about those who have made no attempt (and for whatever reasons they have, do not wish to).

      I completely agree that for many data types there is no "one size fits all solution", and I am not sure what PLoS or DRYAD have argued is that there is. Certainly I have never had this issue myself. For our Genetics paper that just got accepted we have archived new data on both DRYAD and NCBI SRA, and used older data from DRYAD and NCBI GEO. No issues with putting in all the links to the data, and no complaints from the editors.

      While I know little about the metabolomics data types, I will say that at the beginning of the "NextGen" sequencing revolution, data types were all over the place, and "standard formats" took a while to coalesce. You probably remember this as well as I do.This suggest to me that for metabolomics it requires some effort on part of the community to make decisions about data (and meta-data), and get the companies to allow exporting to standard file types. Or is there something much more complicated?

  6. I agree with you in that some level of (possibly post-processed) data can be made available to the benefit of everybody, but a simple citation is not sufficient credit for those who collect the data, especially when there is no restriction that the data provided will not be used to "construct new stories". Something like a data-source tag, along with credit if one's data are used as a data-source by theoretical/analytical papers, should be pre-requisite before open data deposition without restriction can be considered.

  7. I get your point. I have been reflecting during this whole debate, how the re-use of previously published data should be "counted" in terms of scientific productivity. I agree in some cases, simply citing the paper & data may be insufficient.

    I really like the idea of a data-source tag. I have heard some related suggestions before, but I think it could work. If a data set is often re-analyzed then the authors should be able to consider it a product, much like we do for software. We would need to convince both our institutions and funding agencies to do this, but if the push was sufficiently broad I think it would be possible (but slow).

    Yet, requiring co-authorship for those generated data associated with a previously published paper could be very problematic. In particular in instances where the new study may be refuting findings from the original paper. Or if the authors of the original paper disagree with inferences made when there data is used (say as part of a large meta-analysis).

    Furthermore there is a real question "who" should be the co-author with respect to data collection, since more often than not it is not the PI, but undergrads, grad students, technicians or post-docs who "collect" the data.

    I certainly do not have any simple answers, but I think the idea of a data-source tag can certainly be used. Obviously we can cite the data (in addition to the paper itself) as it is usually in a place with a DOI. However, this probably does not go far enough in many cases.

    I remain worried that if we wait for funding agencies and institutions to catch up though, it would be too slow, and considerable and important data will be lost forever because those who held onto their data eventually lost it or left science and so the data disappears.

    Thanks so much for your comments!