The GenABEL project: open source methodology

by Yurii Aulchenko

"The future is open source everything" (Linus Torvalds)

Open source development is commons-based peer production by openly exchanging ideas and collaboration. The GenABEL project provides infrastructure to apply open source philosophy to methodology development in statistical genomics, aiming to public good. To my knowledge, application of the open source ideas to methodology development is quite new.

In my view, ideally, the GenABEL project should operate on principles of open source and open standards throughout: for governance, methodology discussion, software implementation, extension, maintenance, documentation, publication, education, training, etc.

My view is also that in the GenABEL project open principles are critical for governance and software development/maintenance; in other aspects this is up to an individual contributor to decide how far he/she wants to go.

Open governance of the GenABEL project is a very important topic, which will be discussed in a separate post. Here, I address open principles in application to methodology, software development, documentation, etc.

It is important that methodology is developed through open discussion, and in my experience, bringing your idea to discussion at an early stage saves time and improves quality. My suggestion is, if you got what seems to be a bright methodological idea, just shoot it to the genabel-development mailing list for discussion with other people. From the community, you will learn how original you idea is, and you may get people enthusiastic about it and willing to help in e.g. code development and testing. We have examples of this happening already in the GenABEL project.

I can imagine that one may be afraid that an idea is “stolen”. However, in my opinion, this is highly unlikely to happen: an individual stealing from peers will be excluded from the community right away, and will get very bad publicity. “Stealing” things is too risky a business for one who intends to stay in the field. Mind also that many ideas are “in the air” and many people come to the same idea simultaneously and independently (e.g. rediscovery of Mendel’s lows). Also please mind that an idea costs very little if it does not translate to application (e.g. through development of a software). It is just too easy to come up with ‘ideas’; implementing them, this is what makes it real (see Edison’s 1% inspiration and 99% perspiration).

I can also imagine that some people want to do methodology and software development "in private" for at least some time, because they are afraid that the idea/implementation will be used without proper reference/credit. I do not agree with this position: while you indeed run some risk by making a not-yet-published idea open and usable, I think the benefits of using community intelligence outweigh the risks. Secondly, even if you have a method published, only awareness of the general public about your method can prevent other people from reinventing "your" methodology and making a great fuss of it. The only way to get proper credit is to let really many people know about your idea, to disseminate it effectively. In this, the GenABEL project can help you.

Now, switching from ideas/methodology to implementation. In this, we should stick to open source principles, again. My favorite license at the moment is GNU GPL: anyone is free to copy, modify, and redistribute the code provided the derivative code carries the same license (so, free to copy, modify, redistribute; it is recursive in a way). I also think all committers should have write-rights to the whole project code (does not mean that someone would/should make stupid changes in other people’s code!). Again, my believe is that the source code should go public as soon as possible.

Here is an example to illustrate above ideas. We have released ProbABEL for public use in 2008, and it was in wide use ever since. The ProbABEL paper was published ~1.5 years later, in 2010. And I am happy we released ProbABEL in that way. A lot of feedback was provided between 2008 and 2010. Not to mention the 'public good' again, I am also rather confident that our citation factor will be higher under this scenario compared to an imaginary scenario under which we would have postponed the public release of software to 2010 (publication date).

The next point is about open documentation. When one develops a tutorial, unless he/she wants to keep it private (?!), I suggest that it clearly carries the license. What really matters for me in a license is the fact that other people can use my text as a starting point for further developments, can modify my text, make derivative works, and redistribute it (hopefully leading to increasingly better transmission of ideas, and to public good). Of cause, I also want my name to be mentioned and get some credit. My current preference is CC BY-SA (http://creativecommons.org/licenses/; seems like GNU LGPL to me), and I am going to release ‘ABEL tutorial’ under it or something similar. Under this license, other people can make derivative works; this definitely is following open source spirit.

At some point, we may think of publishing our 'community' set of tutorials and selling printed copies (charging for the fact of physical transaction of a copy, not for information!). This money can be used for purposes, which are unequivocally for the good of the project as the whole, such as e.g. paying the costs of Amazon Cloud hosting our web-server and forum.

Finally, it will be only fair if a manuscript which has used GenABEL project as a setting, will appear in open access -- the community has contributed to the work, and has rights to access the results. I must admit, though, that the original GenABEL-package paper is not open access -- back in 2007, this did not seem important. I will try to make that paper open access, but this will take some time.

Disclaimer. This is my personal position, which is open for discussion. In this post, I am not speaking on behalf of the GenABEL project community, but rather seeding the discussion, which will eventually set the project's broad standards.

I would like to thank Dr. Lennart Karssen for valuable and continuous discussion through which many of my views on the GenABEL project have evolved.