The GenABEL project: methodology to address real world problems

by Yurii Aulchenko

The GenABEL project is dedicated to development of statistical genomics methodology of large impact on the real world. From this perspective, methodology development includes statistical methodology itself, its implementation in an usable software, and application of this software to real data in order to generate new knowledge.

Thus, we see methodology development as a three-stage process including mathematical formulation of the method, formulation and implementation of an algorithm in a software, and, finally, application of the methodology to real data. Actually, most of the time, the data will call for a new method. In that, application comes before the mathematical formulation. Presence of all three stages, and feedback between these is a key aspect of our approach to statistical genomics methodology development.

Why all three stages are critical?

For example, you may develop something, which looks like a nice and promising piece of math, but when you try to implement it, you figure out that you did not completely understand the problem, or that you were operating under some implicit assumptions, which are not likely to be correct. You may also figure out that computational complexity is too high, and you need to change the method in order for it to be practically applicable.

Next, it is important to apply your methodology to the simulated data (for which you know the answer): nonsense results will provide feedback on implementation (is that a bug?) or even on methodology (ah, formula 15-3 was wrong!).

It is even more important to apply your methodology to REAL data as early as possible: it will provide feedback on implementation (is it feasible to run my analysis in reality?), and methodology -- the situation when a method works on simulated data, but fails miserably on real data is not that uncommon! Also, trying to use real data will tell you about data formats people are using, and will eventually make your implementation really usable. There are example of great methods implemented in a software requiring such specific data format, that it becomes almost impossible to use these.

To conclude: methodology development should be viewed as integral process including development of methodology itself, development of software, and application of software to real data.

While such integral approach is a tall order for an individual researcher or even a (smaller) group of these, it is feasible if an open source approach -- commons-based production by openly exchanging ideas, bartering (you help me, I help you) and collaboration -- is applied throughout. I will elaborate on this point in my further posts.

Disclaimer. This is my personal position, which is open for discussion. In this post, I am not speaking on behalf of the GenABEL project community, but rather seeding the discussion which will eventually set the project's standards.

I would like to thank Dr. Lennart Karssen for valuable and continuous discussion through which many of my views on the GenABEL project have evolved.