Saturday, September 27, 2014

Sufficient biological replication is essential for differential expression analysis of RNA-seq

I just took part in a twitter discussion about the trade-offs between sequencing depth and number of independent biological replicate (per treatment group) for differential gene expression analysis. While there are applications of RNA-seq where  sequencing deeply (more than say 50 million reads for a given sample) can be important for discovery. However, most researchers I interact with are interested at some level with differential expression among groups (different genotypes, species, tissues, etc).  As with everything else that requires making estimates and quantifying uncertainty for those estimates (minimally necessary for differential expression), you need independent biological samples within each group as well. The ENCODE guidelines suggest a minimum of 2 biological replicates per treatment group (well they do not say "biological" replicates, but I will give them the benefit of the doubt).

However, numerous studies have demonstrated that 2 is rarely sufficient (see links below). I have no idea where the ENCODE got this number from. Generally you want to aim for 4 or more for simple experimental designs. There are numerous studies that have shown this (both by simulation and by rarefaction analysis). These also demonstrate that on balance, beyond a certain read depth per sample (somewhere between 10-25 million reads per sample) there is diminishing returns for rare transcripts (in terms of differential expression), and that it is better to do more independent biological replication (say 5 samples each at 20 million reads) rather than more depth (2 independent biological samples at 50 million reads each). The exact number depends on a number of factors including biological variability (and measurement error) within groups, as well as experimental design. A number of tools have been developed to help folks with figuring out optimal designs.

Here are just a few such studies (there are many more, just wanted a handful for the moment).

Check out
for a brief and succinct discussion of these and other issues.

And yes, depending on your questions, read length (and PE for SE ) also contribute!

Wednesday, September 24, 2014

Implementing Discovery

This post is the first (of two) about my suggestions for how to implement "open discovery" for answering scientific questions, but in a way that does not completely alienate current professional scientists. In particular because of the current system of how "credit" for answering questions translates to prestige which directly translates to tangible materialistic considerations (raises, being invited to give talks, grants, employment).

Some Background

Early last week I saw a tweet by @caseybergman:

This was posted in the context of how Casey plans to implement his scientific research in the coming months and years. Casey was one of two folks who introduced me to twitter as a serious means for scientific communication, and I have found in our 1-1 conversations to get a lot out of it, so I went ahead and read the book he mentioned by Michael Nielsen (, Reinventing Discovery. I was very inspired by the book.

 I am not very efficient at summarizing books, but you can read the first chapter for free online. It does a good job of summarizing the main message of the book. Essentially, scientific discovery can be profoundly changed for the better (and in particular made much more efficient and productive), by opening up the ongoing research endeavour to the world, for any and all to actively and concurrently participate in. This goes well beyond (but does include) sharing all data, source code and manuscripts, which (if done) is after most of the actual research has been completed. The approach advocated in the book is about setting up the important problems in the field (with some progress of research on those problems), and inviting all scientists (professional and lay scientists alike) to participate in answering the questions.

One of the main examples that Nielsen cites throughout the book is from the Gower's polymath project, that used an open collaborative framework (on Gower's blog) to find a mathematical proof that had previously eluded the mathematical community. I will not go into the details from the book or blog post, but will just point out that this turned out to work very well, and efficiently, and went from asking the question (with some progress Gower had made) to answering it in less than 40 days. This was not surprisingly followed up by writing up the results in a scientific paper.

The book is full of various examples of how scientifically or computationally challenging problems have been addressed in the open on the internet. I will let you read the book and decide for yourself. While it is fair to say I was already primed (see here and here) for such a vision of scientific discovery, I did realize how much more potential  there was.


 Such an approach flies in the face of how many (including within the scientific community) perceive how science is done, and how credit for solving scientific problems is garnered. While both the process of scientific discovery, and communication of those discoveries has changed a great deal over the past few hundred years, it is fairly clear that more "recently" (post World War II anyways) a particular system has been built up for professional academic scientists (those who do their research and teach at Universities and other institutes of higher learning).

There are two parts to the "standardization" (or possibly calcification) of scientific discovery and communication that bear considering, and why an immediate transformation to a completely open process of scientific discovery may not be easy (at least from the perspective of academic scientists).

Current practices in scientific communication

First is the means of scientific communication that has become accepted (and calcified). When an individual or group of scientists working on a particular problem have made (in their eyes) sufficient progress on a problem they will usually communicate this in the form of a scientific paper. This is commonly done via submitting a manuscript to a journal. An editor sends it out for peer review (to other experts in the field), who evaluate it for technical correctness, soundness of logic, and for many journals for "novelty" of the ideas and findings. If some of these criteria are not met, the manuscript may be recommended for rejection, otherwise for corrections (revisions), or for acceptance.  For a given manuscript this process may repeat several times at the same journal, or at different journals (if rejected from the first journal).

Scientific prestige and current measures of productivity do translate to material benefits (for the scientist).

 Assuming the paper is accepted, it is published in a scientific journal. The publication of the article itself, the place it is published and the attention it receives (both in the scientific literature via citations, or in the popular press) all can "matter" for the nebulous ideas of "credit" and "prestige" for the scientists who did the work, and wrote the paper. Indeed these ideas of "credit/prestige" are at the heard of how scientists are evaluated at universities. Our employment (getting a job in the first place), career advancement, salaries, garnering grant support, etc.. can all depend (to varying degrees) on where and how much you publish. These are proxies for "research productivity".

Who gets credit for scientific breakthroughs.

The other piece of this (and related to the idea of prestige), is that much of thescientific work, and the "important breakthroughs" are done by lone individuals or small research groups. These "important breakthroughs" are often popularized in textbooks and the media as having come out of no where (i.e. that the research is completely unlike what has been done before). However, most of the time it is pretty clear that (like with general relativity, or Natural Selection), related ideas and concepts were percolating in the scientific community. That is not to play down the genius of folks like Einstein or Darwin, just that these breakthroughs rarely occur in an extreme intellectual vacuum.

 The problem is, that even in modern Universities, research institutes and funding agencies, these sorts of ideas persist, and the prestige for addressing a particular research problem go to one or a few people. This is despite the fact that much highly related work (that set up the conditions for the breakthrough) happened before. Even for multi-authored papers, it is usually just a few of the authors that garner the credit for the findings/discoveries. In my field this is usually the people who are the first and final authors on the manuscript. The prestige associated with this leads to all sorts of benefits (like the ones mentioned above), as well as being invited to give talks around the world at conferences and universities.  Thus there is a real materialistic benefit in modern academic science for garnering this prestige (and getting the right position as an author).

The problem is, in many fields, what defines author position can vary considerably. In my field the first author is usually for the person who has provided the most work, and insight into the research in this paper, and the last position for the "senior scientist" whose lab the work was done in (and usually garnered funds, and sometimes came up with the ideas). However the difference between the contribution between the first and second author (or subsequent sets of authors) is rarely quantified, or clear.

This difference in authorship position means a great deal for material concerns to the participating scientists though. Being first author on a paper in a prestigious journal (even if there are 40 authors on the paper) may be necessary (although rarely sufficient) to get a job at a major research university. However being third author on each of three papers in that same journal (even if there are only 4 authors on each of those papers), will not carry nearly as much weight.

Thus the issues of how to make an open collaborative discovery system for scientific research is at odds with the current (socially constructed) system for academic awards for professional scientists. This is by no means insurmountable. While "Reinventing Discovery" only touched on some possible solutions, in the coming days I will post about one possible idea towards meeting such goals, but in such a way that can be easily integrated into the current system of publishing (but maybe not rewards).

Saturday, May 10, 2014

Can we really "afford" not to estimate effect sizes

A recent post over on DATA COLADA suggested that the sample sizes required to estimate effect sizes appropriately are prohibitive for most experiments. In particular this is the point they made:

"Only kind of the opposite because it is not that we shouldn’t try to estimate effect sizes; it is that, in the lab, we can’t afford to."

In response to their post, it has already been pointed out that even if an individual estimate (from a single study) of effect size may have a high degree of statistical uncertainty (in terms of wide confidence limits), that a combination of such estimates across many studies (in a meta-analysis), would actually have pretty reasonable uncertainty (and far smaller than for any single experiment). 

I think that there are a couple of other basic points to be made as well.

1) There is no reason not to report the effect size and its confidence intervals. These can be readily computed, so why not report it and associated confidence intervals? Even if some folks still like to focus on trying to read tea-leaves from p-values, the effect size helps in the interpretation of the biological (or other) effect of the particular variables under investigation.

2) The main argument from the DATA COLADA blog post seems to be from the simulation they summarize in figure 2. There are two important points to be made from this. First for all the sample sizes they investigate in figure 2 the confidence intervals do not overlap with zero. So the effect sizes (for all of the sample sizes reported) also demonstrate the "significant effect", but with considerable additional information. In other words, there is no loss of information by just reporting the effect sizes with confidence intervals. You can always do a formal significance test as well, although most of the time it will not provide further insight.

3) The "acceptable" width of the 95% confidence intervals is a discipline specific issue (and the CIs are a function of not just sample size, but of the standard deviation for the observed data itself). 

So please report effect sizes and confidence intervals. Your field needs it far more than another example of P < 0.05.

Tuesday, February 25, 2014

Why would any scientists fuss over making your data public and accessible?

Well colour me naive. When PLoS announced their new data archiving policy a few days ago, I hardly felt like it was a "big deal". Providing a public archive of raw data used in a published study seems like a no-brainer (except in very limited circumstances with medical records, location data for at risk species, etc), and is becoming standard practice, right? Clearly my naivete knows no bounds, given some of the really negative reaction to it. (here is one example, while discuss the issue a bit more broadly here ).

In the fields I am most active (at the intersection between Genomics, Genetics and Evolution), there have been numerous recent (and successful?) efforts to make sure data associated with studies becomes archived and publicly available in repositories. While they (the repositories) are not perfect, data archiving seemed to be working and generally useful, and I was always happy to do so myself. Yes some of the issues with getting the data and meta-data formatted for NCBI GEO (or SRA) could be annoying at times, but this was such a minor concern relative to all of the other efforts in collecting and analyzing the data, writing the manuscript (and getting it accepted for publication) that the day spent so that it would be available to other researchers long term seemed pretty minor. Other scientists have always sent me reagents and (when they could find it) data, so this seemed like an easy way to be helpful and inline with the scientific process (and hopefully progress).

More importantly, having tried to get data from other researchers over the years (with huge numbers of "old hard drive failures" always seeming to be the reason why it could not be made available to me). I have recently been involved with a large meta-analysis of previously published data. Rarely was the raw available, and because only summary statistics were available (rarely with associated measures of statistical uncertainty), we were very limited in what analyses we could do. There would be so much more we could do if the raw data had been available.

 So, I do not want other researchers to have to deal with these frustrations because of me. By archiving data generated by myself or members of my lab, other researchers could get it without hassling me, and I would not have to worry about finding it at a later date (like 10 years down the road), where it may have taken far more time to recover, then putting it in a repository in the first place.

 In Evolutionary biology, most of the journals simultaneously started a data archiving policy (generally associated with DRYAD) a few year ago, I was quite happy. Not only did I put data from new studies up in DRYAD, but also from my older studies (see here). I naively expected most evolutionary biologists to do the same. After all, there are many long term data sets in evolutionary biology that would be of great value, in particular for studies of natural selection, and estimating G matrices, where there is still much active methodological development. Some of the publications generated data sets requiring heroic efforts, and would be a huge community resource.

 So I was a little surprised when DRYAD was not rapidly populated by all of these great legacy datasets. I think that folks "hoarding" data are a very small minority, and the majority of folks were just very busy, and this did not seem like a pressing issue to them. In any case, I have also spent some effort at my institution (Michigan State University) discussing such issues with students about the importance of data archiving. All of the benefits seem obvious, making our science more open, and making our data available for those who may be able to address interesting and novel questions in the future. Fundamentally, it is the data and the analysis (and interpretation) that represents much of the the science we do. Our scientific papers representing a summary of this work itself. Better to have it all (data, analysis and interpretation) available, no?

So, when PLoS made the announcement, this seemed like par for the course in biology. Funding agencies are mandating data management and sharing plans, other journals too. So who could be either shocked or dismayed by this?

Like I said, I can be naive.

Even after reading (and re-reading) posts (above) and discussion threads about concerns, I am still baffled. Yes there will be a few "corner cases" where privacy, safety or conservation concerns need to be considered. However for the vast majority of studies in biology this is not the case, or the data can be stripped of identifiers or variables to alleviate such issues, at least in many of these situations.

So what's the problem?
Does it require a little work on the part of the authors of the studies? Perhaps a little. However, I always remind folks in the lab that the raw data they generate, and the scripts they use for analysis will be made available. I find that to my benefit the scripts are much easier to read. Furthermore, keeping these issues in my mind makes it that much easier to get it organized for archiving. The readme files we generate make sure we do not forget what variable names mean, or other such issues. Handling data transformations or removing outliers in scripts means we can always go back and double check the influence of those observations.

In their post, DrugMonkey suggests that the behavioural data they generate in their lab is too difficult to organize in such a fashion as to be broadly useful. While I agree that the raw video (if that is what they are collecting) still remains difficult (although perhaps figshare could be used, which is what we will try for our behavioural trials), we find that the text files from our "event recordings" are very easy to post, organize and generate meta-data for. Does the data need to be parsed from its "raw" format, to one more useful for analysis? Sure, but we will also supply (as we do in our DRYAD data packages) the scripts to do so. Perhaps there is something I am missing about their concern. However, I do not concede their point about the difficulties about organizing their data for archiving. How hard should it really be to make such files useful to other experts in the field?

Even for simulations, we supply our scripts, configuration files, and sometimes data generated from the simulation (to replicate key figures that would take too long to generate by replicating the whole simulation).

It is always worth reminding ourselves (as scientists) of this quote (attributed to Sir William Bragg):

"The important thing in science is not so much to obtain new facts as to discover new ways of thinking about them"