Genes Gone Wild: September 2014

I just took part in a twitter discussion about the trade-offs between sequencing depth and number of independent biological replicate (per treatment group) for differential gene expression analysis. While there are applications of RNA-seq where sequencing deeply (more than say 50 million reads for a given sample) can be important for discovery. However, most researchers I interact with are interested at some level with differential expression among groups (different genotypes, species, tissues, etc). As with everything else that requires making estimates and quantifying uncertainty for those estimates (minimally necessary for differential expression), you need independent biological samples within each group as well. The ENCODE guidelines suggest a minimum of 2 biological replicates per treatment group (well they do not say "biological" replicates, but I will give them the benefit of the doubt).

However, numerous studies have demonstrated that 2 is rarely sufficient (see links below). I have no idea where the ENCODE got this number from. Generally you want to aim for 4 or more for simple experimental designs. There are numerous studies that have shown this (both by simulation and by rarefaction analysis). These also demonstrate that on balance, beyond a certain read depth per sample (somewhere between 10-25 million reads per sample) there is diminishing returns for rare transcripts (in terms of differential expression), and that it is better to do more independent biological replication (say 5 samples each at 20 million reads) rather than more depth (2 independent biological samples at 50 million reads each). The exact number depends on a number of factors including biological variability (and measurement error) within groups, as well as experimental design. A number of tools have been developed to help folks with figuring out optimal designs.

Here are just a few such studies (there are many more, just wanted a handful for the moment).

http://www.ncbi.nlm.nih.gov/pubmed/24319002
http://www.ncbi.nlm.nih.gov/pubmed/25246651
http://www.ncbi.nlm.nih.gov/pubmed/22985019
http://www.ncbi.nlm.nih.gov/pubmed/22268221
http://www.ncbi.nlm.nih.gov/pubmed/23497356

Check out
http://bfg.oxfordjournals.org/content/early/2011/12/30/bfgp.elr041.full.pdf+html
for a brief and succinct discussion of these and other issues.

And yes, depending on your questions, read length (and PE for SE ) also contribute!

This post is the first (of two) about my suggestions for how to implement "open discovery" for answering scientific questions, but in a way that does not completely alienate current professional scientists. In particular because of the current system of how "credit" for answering questions translates to prestige which directly translates to tangible materialistic considerations (raises, being invited to give talks, grants, employment).

Some Background

Early last week I saw a tweet by @caseybergman:

@cdessimoz @MVickySchneider my thinking is heavily influenced by @michael_nielsen's book "Reinventing Discovery" http://t.co/yUvT3gldNb 2/2
— Casey Bergman (@caseybergman) September 12, 2014

This was posted in the context of how Casey plans to implement his scientific research in the coming months and years. Casey was one of two folks who introduced me to twitter as a serious means for scientific communication, and I have found in our 1-1 conversations to get a lot out of it, so I went ahead and read the book he mentioned by Michael Nielsen (http://michaelnielsen.org/), Reinventing Discovery. I was very inspired by the book.

I am not very efficient at summarizing books, but you can read the first chapter for free online. It does a good job of summarizing the main message of the book. Essentially, scientific discovery can be profoundly changed for the better (and in particular made much more efficient and productive), by opening up the ongoing research endeavour to the world, for any and all to actively and concurrently participate in. This goes well beyond (but does include) sharing all data, source code and manuscripts, which (if done) is after most of the actual research has been completed. The approach advocated in the book is about setting up the important problems in the field (with some progress of research on those problems), and inviting all scientists (professional and lay scientists alike) to participate in answering the questions.

One of the main examples that Nielsen cites throughout the book is from the Gower's polymath project, that used an open collaborative framework (on Gower's blog) to find a mathematical proof that had previously eluded the mathematical community. I will not go into the details from the book or blog post, but will just point out that this turned out to work very well, and efficiently, and went from asking the question (with some progress Gower had made) to answering it in less than 40 days. This was not surprisingly followed up by writing up the results in a scientific paper.

The book is full of various examples of how scientifically or computationally challenging problems have been addressed in the open on the internet. I will let you read the book and decide for yourself. While it is fair to say I was already primed (see here and here) for such a vision of scientific discovery, I did realize how much more potential there was.

But...

Such an approach flies in the face of how many (including within the scientific community) perceive how science is done, and how credit for solving scientific problems is garnered. While both the process of scientific discovery, and communication of those discoveries has changed a great deal over the past few hundred years, it is fairly clear that more "recently" (post World War II anyways) a particular system has been built up for professional academic scientists (those who do their research and teach at Universities and other institutes of higher learning).

There are two parts to the "standardization" (or possibly calcification) of scientific discovery and communication that bear considering, and why an immediate transformation to a completely open process of scientific discovery may not be easy (at least from the perspective of academic scientists).

Current practices in scientific communication

First is the means of scientific communication that has become accepted (and calcified). When an individual or group of scientists working on a particular problem have made (in their eyes) sufficient progress on a problem they will usually communicate this in the form of a scientific paper. This is commonly done via submitting a manuscript to a journal. An editor sends it out for peer review (to other experts in the field), who evaluate it for technical correctness, soundness of logic, and for many journals for "novelty" of the ideas and findings. If some of these criteria are not met, the manuscript may be recommended for rejection, otherwise for corrections (revisions), or for acceptance. For a given manuscript this process may repeat several times at the same journal, or at different journals (if rejected from the first journal).

Scientific prestige and current measures of productivity do translate to material benefits (for the scientist).

Assuming the paper is accepted, it is published in a scientific journal. The publication of the article itself, the place it is published and the attention it receives (both in the scientific literature via citations, or in the popular press) all can "matter" for the nebulous ideas of "credit" and "prestige" for the scientists who did the work, and wrote the paper. Indeed these ideas of "credit/prestige" are at the heard of how scientists are evaluated at universities. Our employment (getting a job in the first place), career advancement, salaries, garnering grant support, etc.. can all depend (to varying degrees) on where and how much you publish. These are proxies for "research productivity".

Who gets credit for scientific breakthroughs.

The other piece of this (and related to the idea of prestige), is that much of thescientific work, and the "important breakthroughs" are done by lone individuals or small research groups. These "important breakthroughs" are often popularized in textbooks and the media as having come out of no where (i.e. that the research is completely unlike what has been done before). However, most of the time it is pretty clear that (like with general relativity, or Natural Selection), related ideas and concepts were percolating in the scientific community. That is not to play down the genius of folks like Einstein or Darwin, just that these breakthroughs rarely occur in an extreme intellectual vacuum.

The problem is, that even in modern Universities, research institutes and funding agencies, these sorts of ideas persist, and the prestige for addressing a particular research problem go to one or a few people. This is despite the fact that much highly related work (that set up the conditions for the breakthrough) happened before. Even for multi-authored papers, it is usually just a few of the authors that garner the credit for the findings/discoveries. In my field this is usually the people who are the first and final authors on the manuscript. The prestige associated with this leads to all sorts of benefits (like the ones mentioned above), as well as being invited to give talks around the world at conferences and universities. Thus there is a real materialistic benefit in modern academic science for garnering this prestige (and getting the right position as an author).

The problem is, in many fields, what defines author position can vary considerably. In my field the first author is usually for the person who has provided the most work, and insight into the research in this paper, and the last position for the "senior scientist" whose lab the work was done in (and usually garnered funds, and sometimes came up with the ideas). However the difference between the contribution between the first and second author (or subsequent sets of authors) is rarely quantified, or clear.

This difference in authorship position means a great deal for material concerns to the participating scientists though. Being first author on a paper in a prestigious journal (even if there are 40 authors on the paper) may be necessary (although rarely sufficient) to get a job at a major research university. However being third author on each of three papers in that same journal (even if there are only 4 authors on each of those papers), will not carry nearly as much weight.

Thus the issues of how to make an open collaborative discovery system for scientific research is at odds with the current (socially constructed) system for academic awards for professional scientists. This is by no means insurmountable. While "Reinventing Discovery" only touched on some possible solutions, in the coming days I will post about one possible idea towards meeting such goals, but in such a way that can be easily integrated into the current system of publishing (but maybe not rewards).

Genes Gone Wild

Saturday, September 27, 2014

Sufficient biological replication is essential for differential expression analysis of RNA-seq

Wednesday, September 24, 2014

Implementing Discovery

Some Background

But...

Current practices in scientific communication

Scientific prestige and current measures of productivity do translate to material benefits (for the scientist).

Who gets credit for scientific breakthroughs.