GenABEL tutorial

Copyright 2013, GenABEL project developers
Copyright 2007-2013, Yurii Aulchenko

June 19, 2013

THIS WORK IS LICENSED UNDER THE CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 3.0 UNPORTED LICENSE. TO VIEW A COPY OF THIS LICENSE, VISIT HTTP://CREATIVECOMMONS.ORG/LICENSES/BY-SA/3.0/ OR SEND A LETTER TO CREATIVE COMMONS, 444 CASTRO STREET, SUITE 900, MOUNTAIN VIEW, CALIFORNIA, 94041, USA.

Contents

1 Overview
 1.1 Download necessary files
2 Introduction to R
 2.1 Basic R data types and operations
 2.2 Data frames
 2.3 Exploratory analysis of qualitative and quantitative traits
 2.4 Regression analysis
 2.5 Answers to exercises
3 Introduction to genetic association analysis in R
 3.1 Characterisation of genetic data
 3.2 Exploring genetic data with library genetics
 3.3 Genetic association analysis
 3.4 Example association analysis
 3.5 Exercise: Exploring genetic data using library genetics
 3.6 Answers to exercises
4 Introduction to the GenABEL-package
 4.1 General description of gwaa.data-class
 4.2 Accessing and modifying phenotypic data
 4.3 Sub-setting and coercing gwaa.data
 4.4 Exploring genetic data
 4.5 Answers to exercises
5 Genome-wide association analysis
 5.1 Data descriptives and first round of GWA analysis
 5.2 Genetic data QC
 5.3 Finding genetic sub-structure
 5.4 GWA association analysis
 5.5 Genome-wide association analysis exercise
 5.6 Answers to exercises
6 GWA analysis in presence of stratification: theory
 6.1 Genetic structure of populations
  6.1.1 Hardy-Weinberg equilibrium
  6.1.2 Inbreeding
  6.1.3 Mixture of genetic populations: Wahlund’s effect
 6.2 Effects of population structure on standard tests for association
  6.2.1 Standard tests for genetic association
  6.2.2 Effects of genetic structure on standard tests
  6.2.3 Genomic control
 6.3 Analysis of structured populations
  6.3.1 Structured association
  6.3.2 Mixed models based approach
  6.3.3 Estimation of kinship matrix from genomic data
  6.3.4 EIGENSTRAT and related methods
  6.3.5 Summary: what method to use?
 6.4 Links
7 GWA in presence of genetic stratification: practice
 7.1 Analysis with ethnic admixture
 7.2 Analysis of family data
 7.3 Example GWA analysis using family-based data
 7.4 Exercise: analysis of family data
 7.5 Answers to exercises
8 Imperfect knowledge about genotypes
 8.1 Motivation
 8.2 Input files
  8.2.1 SNP information file
  8.2.2 Genomic predictor file
  8.2.3 Phenotypic file
  8.2.4 Optional map file
 8.3 Running an analysis
  8.3.1 Basic analysis options
  8.3.2 Advanced analysis options
  8.3.3 Running multiple analyses at once: probabel.pl
 8.4 Output file format
 8.5 Preparing input files
 8.6 Memory use and performance
 8.7 Methodology
  8.7.1 Analysis of population-based data
  8.7.2 Analysis of pedigree data
 8.8 How to cite
9 Analysis of imputed data: an example
 9.1 Analysis of 500 directly typed SNPs
 9.2 Analysis of imputed data with ProbABEL-package
 9.3 Analysis of imputed data with MixABEL-package
 9.4 Answers to exercises
10 Meta-analysis of GWA scans
 10.1 Standard meta-analysis methods
 10.2 Exercise: meta-analysis of literature data
 10.3 Reporting GWA results for future meta-analysis
 10.4 Meta-analysis with MetABEL-package
 10.5 Answers to the exercise
  10.5.1 Exercise 9:
11 Analysis of selected region
 11.1 Exploring linkage disequilibrium
 11.2 Haplotype analysis
 11.3 Analysis of interactions
A Importing data to GenABEL-package
 A.1 Converting from preferred format
 A.2 Converting PLINK tped files
 A.3 Converting linkage-like files
 A.4 Converting from MACH format
 A.5 Converting from text format
B GenABEL internals
 B.1 Internal structure of gwaa.data-class

Chapter 1
Overview

This introduction is outdated: now the GenABEL-package is the project, the suite, and the package, see http://www.genabel.org/developers

GenABEL-package is an R library developed to facilitate Genome-Wide Association (GWA) analysis of binary and quantitative traits. GenABEL-package is implemented as an R library. R is a free, open source language and environment for general-purpose statistical analysis (available at http://www.r-project.org/). It implements powerful data management and analysis tools. Though it is not strictly necessary to learn everything about R to run GenABEL-package, it is highly recommended as this knowledge will improve flexibility and quality of your analysis.

Originally GenABEL-package was developed to facilitate GWA analysis of quantitative traits using data coming from extended families and/or collected form genetically isolated populations. At the same time GenABEL-package implements a large number of procedures used in analysis of population-based data; it supports analysis of binary and quantitative tarits, and of survival (time-till-event) data. Most up-to-date information about GenABEL-package can be found at the web site http://www.genabel.org.

This tutorial was originally written to serve as a set of exercises for the "Advances in population-based studies of complex genetic disorders" (GE03) course of the Netherlands Institute of Health Sciences (Nihes).

If you read this tutorial not as a part of the GE03 course, and you are eager to start with you GWA analysis without reading all the not-so-strictly-necessary staff, start directly from the section 5 ("Genome-wide association analysis").

Otherwise, you can start with R basics and simple association analyses using few SNPs in section 2, "Introduction to R". In the next section, 4 ("Introduction to the GenABEL-package") you will learn how to work with the gwaa.data-class, which is used to store GWA data in GenABEL-package and will perform some simple large-scale analyses.

In the next section, 5 ("Genome-wide association analysis"), you will do quality control of genetic data and do association analysis under realistic conditions. This section is the core of this tutorial.

The section 7 ("GWA in presence of genetic stratification: practice") is dedicated to analysis in the presence of population stratification and analysis of family-based data.

Genetic data imputations are covered in the section ??, "??".

The last section, 11 ("Analysis of selected region"), is dedicated to analysis of haplotype association and analysis of SNP interactions.

Information on importing the data from different file formats to GenABEL-package is given in appendix A ("Importing data to GenABEL-package"). Answers to exercises are provided at the end of the respective chapters.

Experienced R users start directly with the section (4, "Introduction to the GenABEL-package").

1.1 Download necessary files

This code needs to be run prior to other parts of tutorial. We reccommend that prior to any actions you create a new directory, say, ’exercisesGenABEL’, to keep all of your working tutorial files there.

Start R and make sure that your working directory is set to a proper location. Your current working directory can be queried by command ’getwd()’. Use ’setwd’ command to set the working directory.

The next lines of code kill the ’RData’ directory if it is present in your working directory (danger! danger!) to make new clean data installation. Paste this code into R:

unlink("RData",recursive=TRUE,force=TRUE)  
dir.create("RData")

Now, fetch the necessary data from the server. First, define the download procedure

myDownloads <- function(baseUrl,baseLocal,files) {  
for (cFile in files) {  
    cFileUrl <- paste(baseUrl,cFile,sep="")  
    cFileLocal <- paste(baseLocal,cFile,sep="")  
    tryDownload <- try(  
                       download.file(url=cFileUrl,destfile=cFileLocal)  
                       )  
    if ( is(tryDownload,"try-error") )  
      stop(paste("can not download",cFileUrl,"into",cFileLocal,":",tryDownload))  
  }  
}

Second, download data files:

baseUrl <- "http://www.genabel.org/sites/default/files/data/"  
baseLocal <- "RData/"  
dataFiles <- c(  
           "assocbase.RData",  
           "popdat.RData",  
           "mach1.out.mlinfo",  
           "mach1.mldose.fvi",  
           "mach1.mldose.fvd",  
           "rcT.PHE",  
           "gen0.illu",  
           "gen0.illuwos",  
           "gen0.tped",  
           "gen0.tfam",  
           "gen0.ped",  
           "map0.dat",  
           "emap0.dat",  
           "phe0.dat",  
           "ImputedDataAnalysis.RData")  
myDownloads(baseUrl,baseLocal,dataFiles)

That’s it! - now you are fully set to start with the GenABEL tutorial!

Chapter 2
Introduction to R

In this section we will consider the basic R data types and operations, as well as tools for the analysis of qualitative and quantitative traits. Only basic R functionality – the things which are crucial to know before we can proceed to genetic association analysis – will be covered within this section. If you want to make most of your data, though, we strongly recommend that you improve your knowledge of R using books other than this one. A number of excellent manuals (’An introduction to R’, ’Simple R’, ’Practical Regression and Anova using R’, and others) is available free of charge from the R project web-site (http://www.r-project.org).

In the first part of this chapter you will learn about the most important R data types and will learn how to work with R data. Next, we will cover exploratory data analysis. The chapter will end with an introduction to regression analysis.

2.1 Basic R data types and operations

In contrast with many other statistical analysis packages, analysis in R is not based on a graphical user interface, but is command line-based. When you first start R, a command prompt appears. To get help and overview of R, type help.start() on the command line and press enter. This will start your default internet browser and open the main page of the R documentation.

Let us first use R as a powerful calculator. You can directly operate with numbers in R. Try multiplying two by three:

  > 2*3

  [1] 6

Other standard arithmetic operations can be performed in similar manner:

  > 2/3

  [1] 0.6666667

(division)

  > 2^3

  [1] 8

(power)

  > 2-3

  [1] -1

(subtraction)

  > 2+3

  [1] 5

(summation)1 .

Mathematical functions, such as square roots, base-10 logarithm, and exponentiation, are available in R as well:

  > sqrt(5)
  [1] 2.236068
  > log10(2.24)
  [1] 0.350248
  > exp(0.35)
  [1] 1.419068

Here, we have computed e to the power of base-10 logarithm of the square root of the sum of two and three. After each operation, we have rounded the result to the two digits after the floating point – just to do less typing.

The arithmetic operations and functions can be nested and therefore we can obtain the above result in one line, and without the 2nd-digit approximation:

  > exp(log10(sqrt(2+3)))
  [1] 1.418337

R functions include not only the standard mathematical ones, but also a wide range of statistical function, for example, probability density functions of many probability distributions. We will make extensive use of these at a later stage, when computing significance and estimating statistical power.

For any function with a name say ’fun’, help may be obtained by typing ’help(fun)’ (or ?fun) on the command line.

R help pages have a standard layout, documenting usage of the function, explaining function arguments, providing details of implementation and/or usage, explaining the value returned by the function, and giving references and examples of the function use.

Most of the documented functions have examples of their usage at the end of the ’help’ page, and these examples can be evaluated in R. E.g. try ’example(log10)’.

Exercise 1. Explore help for Wilcoxon test
Explore the help page for the Wilcoxon test (function: wilcox.test) and answer the following questions:

1. 

When is the exact Wilcoxon test computed by default?

2. 

If the default conditions for the exact test are not satisfied, what approximation is used?

If you do not know the exact name for the function you look for, try ’help.search("query")’, where query is the keyword.

Exercise 2. Finding functions and help pages
Try to find out what are the functions to do the

1. 

Fisher exact test

2. 

T-test

One of the important R operations is assignment, which is done with the ’<-’ operator. A (new) variable name should be provided on the left-hand side of this operator and on the right-hand side, there must be either the name of an already existing variable or an expression. For example, we if want to assign the value ’2’ to the variable ’a’, and value ’3’ to the variable ’b’ we would use the assignment operator in the following way:

  > a <- 2
  > b <- 3

Typing the variable name on the R command line will return its value, e.g.

  > b
  [1] 3

Evaluation of the expression

  > exp(log10(sqrt(a+b)))
  [1] 1.418337

gives the expected result we have obtained earlier using numerical arguments.

While the variables ’a’ and ’b’ contain single numeric values, variables in general can be multi-dimensional; a one-dimensional example of such is the vector (or array). Let us create an example vector and experiment with it:

  > v <- c(1, 3, 5, 7, 11)

Here, ’c()’ is a function, which combines its arguments to make a vector. This vector is then assigned to a variable named ’v’.

Now, let us try different operations with this vector:

  > v + 1
  [1]  2  4  6  8 12

It is easy to see that the result is a vector, which is obtained by adding one to each element of the original vector v. Other arithmetic operations and mathematical functions behave in the same way, e.g. the operation is performed for each element of the vector, and the results are returned:

  > 1/v
  [1] 1.00000000 0.33333333 0.20000000 0.14285714 0.09090909
  > log(v)
  [1] 0.000000 1.098612 1.609438 1.945910 2.397895

What happens if two vectors are supplied as function arguments? Let us define a new vector

  > ov <- c(1, 2, 3, 4, 5)

and add it to the vector v:

  > v + ov
  [1]  2  5  8 11 16

You can see that the summation was done element-wise, i.e. the first element of the result vector is obtained as the sum of the first elements of v and ov, the second is the sum of the second elements, and so forth.

Other arithmetic operations with two vectors are performed in the same element-wise manner:

  > v * ov
  [1]  1  6 15 28 55

(multiplication)

  > v^ov
  [1]      1      9    125   2401 161051

(power).

The vector operations considered above returned a same-length vector as output. There are others – statistical and summary – functions which evaluate a vector as a whole and return a single value as output. For example, to obtain a sum of elements of a vector, use

  > sum(v)
  [1] 27

Other examples of such functions are length, returning the number of elements of a vector, mean, returning the mean, var, returning the variance, etc.:

  > length(v)
  [1] 5
  > mean(v)
  [1] 5.4
  > var(v)
  [1] 14.8

One of the basic, and probably most used, data operations in R is sub-setting. This refers to an operation which helps you deriving a subset of the data. Let us create a short vector and play a bit with sub-setting. This vector will contain 5 simple character strings:

  > a <- c("I am element 1", "I am element 2", "I am element 3",
  +        "I am element 4", "I am element 5")
  > a
  [1] "I am element 1" "I am element 2" "I am element 3" "I am element 4"
  [5] "I am element 5"

To find out what is the value of the i-th element of this vector, you can sub-set it by a[i]. For example the 3rd elements is:

  > a[3]
  [1] "I am element 3"

You can also select a bigger sub-set, e.g. all elements from 2 to 4:

  > a[c(2:4)]
  [1] "I am element 2" "I am element 3" "I am element 4"

Here, the operation c(2:4) stands for ’combine numbers from 2 to 4 into a vector’. An equivalent result is obtained by

  > a[c(2, 3, 4)]
  [1] "I am element 2" "I am element 3" "I am element 4"

We can also easily get disjoint elements; e.g. if you want to retrieve elements 1, 3, and 5, you can do that with

  > dje <- c(1, 3, 5)
  > dje
  [1] 1 3 5
  > a[dje]
  [1] "I am element 1" "I am element 3" "I am element 5"

One of the very attractive features of R data objects is the possibility to derive a sub-set based on some condition. Let us consider two vectors, tmphgt, containing the height of some subjects, and tmpids, containing their identification codes (IDs):

  > tmphgt <- c(150, 175, 182, 173, 192, 168)
  > tmphgt
  [1] 150 175 182 173 192 168
  > tmpids <- c("fem1", "fem2", "man1", "fem3", "man2", "man3")
  > tmpids
  [1] "fem1" "fem2" "man1" "fem3" "man2" "man3"

Imagine you need to derive the IDs of the people with height over 170 cm. To do that, we need to combine several operations. First, we should run the logical function > 170 on the height data:

  > vec <- (tmphgt > 170)
  > vec
  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE

This returns a logical vector whose elements are ’TRUE’, when a particular element of the tmphgt satisfies the condition > 170. The returned logical vector, in turn, can be applied to sub-set any other vector of the same length2 , including itself. Thus if you want to see the heights in people that are taller than 170 cm, you can use

  > tmphgt[vec]
  [1] 175 182 173 192

As you can see, only the elements of tmphgt for which the corresponding value of vec was ’TRUE’, are returned. In the same manner, the logical vector vec can be applied to select elements of the vector of IDs:

  > tmpids[vec]
  [1] "fem2" "man1" "fem3" "man2"

You can combine more than one logical condition to derive sub-sets. For example, to see what are the IDs of people taller than 170 but shorter than 190 cm, you can use

  > vec <- (tmphgt>170 & tmphgt<190)
  > vec
  [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE
  > tmpids[vec]
  [1] "fem2" "man1" "fem3"

A better3 way to do logical sub-setting is to use the which() function on top of the logical vector. This function reports which elements are TRUE. To obtain the aforementioned result you can run:

  > vec <- which(tmphgt>170 & tmphgt<190)
  > vec
  [1] 2 3 4
  > tmpids[vec]
  [1] "fem2" "man1" "fem3"

You can see that now vec contains a vector whose elements are the indices of the elements of tmphgt for which the logical condition holds.

Sub-setting for 2D objects (matrices) is done in a similar manner. Let us construct a simple matrix and do several sub-setting operations on it:

  > a <- matrix(c(11, 12, 13,
  +               21, 22, 23,
  +               31, 32, 33
  +               ),
  +             nrow=3, ncol=3)
  > a
       [,1] [,2] [,3]
  [1,]   11   21   31
  [2,]   12   22   32
  [3,]   13   23   33

To obtain the element in the 2nd row and 2nd column, you can use

  > a[2, 2]
  [1] 22

To access the element from the second row and third column, use

  > a[2, 3]
  [1] 32

Note that here, the row index (2) comes first, and the column index (3) comes second.

To obtain the 2 × 2 set of elements contained in upper left corner, you can do

  > a[1:2, 1:2]
       [,1] [,2]
  [1,]   11   21
  [2,]   12   22

Or you can even get the variables that reside in corners:

  > a[c(1, 3), c(1, 3)]
       [,1] [,2]
  [1,]   11   31
  [2,]   13   33

If one of the dimensions is not specified, a complete vector is returned for this dimension. For example, here we retrieve the first row

  > a[1,]
  [1] 11 21 31

…and the third column

  > a[, 3]
  [1] 31 32 33

…or columns 1 and 3:

  > a[, c(1, 3)]
       [,1] [,2]
  [1,]   11   31
  [2,]   12   32
  [3,]   13   33

Another way to address elements of a matrix is to use a one-dimensional index. For example, if you want to access the element in the 2nd row and 2nd column, instead of

  > a[2, 2]
  [1] 22

you can use

  > a[5]
  [1] 22






123




1147
2258
3369





Table 2.1: Vector representation of a matrix. Elements in the table are the vector indices of the matrix elements.

This way of accessing the elements of a matrix is based on the fact that each matrix can be represented as a vector whose elements are numbered consecutively: the element in the upper-left corner has index 1, the element in the second row of the first column has index 2, and the last element in the bottom-right corner has the maximal value, as shown in Table 2.1.

You can sub-set matrices using logical conditions or indexes like you can with vectors. For example, if we want to see which elements of a are greater than 21, we can run

  > a > 21
        [,1]  [,2] [,3]
  [1,] FALSE FALSE TRUE
  [2,] FALSE  TRUE TRUE
  [3,] FALSE  TRUE TRUE

or, better

  > which(a > 21)
  [1] 5 6 7 8 9

Note that in the latter case, a vector whose elements give the 1-D indicess of the matrix, is returned. This vector indicates the elements of matrix a, for which the condition (a > 21) is satisfied.

You can obtain the values of the matrix’s elements for which the condition is fulfilled either by

  > a[a > 21]
  [1] 22 23 31 32 33

or using

  > a[which(a > 21)]
  [1] 22 23 31 32 33

Once again, the latter method should be preferred. Consider the example where some elements of the matrix are missing (NA) – a situation which is common in real data analysis. Let us replace element number 5 with NA and perform sub-setting operations on the resulting matrix:

  > a
       [,1] [,2] [,3]
  [1,]   11   21   31
  [2,]   12   22   32
  [3,]   13   23   33
  > a[5] <- NA
  > a
       [,1] [,2] [,3]
  [1,]   11   21   31
  [2,]   12   NA   32
  [3,]   13   23   33
  > a[a > 21]
  [1] NA 23 31 32 33
  > a[which(a > 21)]
  [1] 23 31 32 33

You can see that when a[a > 21] was used, not only the elements which are greater than 21 were returned, but also NA was. As a rule, this is not what you want, and which should be used unless you do want to make some use of the NA elements.

In this section, we have generated a number of R data objects. Some of these were numeric (e.g. vector of heights, tmphgt) and some were character, or string (e.g. vector of study IDs, tmpids). Sometimes you need to figure out what the class of a certain object is. This can be done using the class() function. For example,

  > tmphgt
  [1] 150 175 182 173 192 168
  > class(tmphgt)
  [1] "numeric"
  > tmpids
  [1] "fem1" "fem2" "man1" "fem3" "man2" "man3"
  > class(tmpids)
  [1] "character"

What happens if we try to find out the class of

  > a
       [,1] [,2] [,3]
  [1,]   11   21   31
  [2,]   12   NA   32
  [3,]   13   23   33

– an object, which contains a matrix?

  > class(a)
  [1] "matrix"

Results are expected – we find out that a is a matrix, which is correct. At the same time, a matrix is an upper-level class, which contains a number of elements, belonging to some lower-level (e.g. character/numeric) class. To see what is the class of the matrix elements, try

  > a[1, ]
  [1] 11 21 31
  > class(a[1, ])
  [1] "numeric"

which says that elements (at least of the first row) are numeric. Because all elements of a matrix should have the same class, we can conclude that a is a matrix containing numeric values.

At this point, it is worthwile inspecting what data objects were created during our work. This can be done with the ls() command:

  > ls()
  [1] "a"      "b"      "dje"    "old"    "ov"     "tmphgt" "tmpids" "v"
  [9] "vec"

Obviously, this ”list” command is very useful – you will soon find that it is just too easy to forget the name of a variable which took a long time to create. Sometimes you may wish to remove some of the data objects because you do not need then anymore. You can remove an object using the rm() command, where the names of objects to be deleted are listed as arguments. For example, to remove the tmphgt and tmpids variables you can use

  > rm(tmphgt, tmpids)

If you now look up what data objects are still left in you workspace with the ls() command

  > ls()
  [1] "a"   "b"   "dje" "old" "ov"  "v"   "vec"

you find that you have successfully deleted tmphgt and tmpids.

At this point, you can exit R by typing q() on the command line and pressing Enter.

\item You can get access to the top-level R documentation via the  
\texttt{help.start()} command. To search help for some keyword \texttt{keywrd},  
you can use the \texttt{help.search(keywrd)} command.  
To get a description of some function \texttt{fun}, use \texttt{help(fun)}.  
\item You can use R as a powerful calculator  
\item It is possible to get sub-sets of vectors and matrices by  
specifying an index value or a logical condition (of the same length as  
the vector / matrix) between square brackets  
(\texttt{[}, \texttt{]})  
\item When you obtain an element of a matrix with \texttt{[i, j]},  
\texttt{i} is the row and \texttt{j} is the column of the matrix.  
\item The function \texttt{which(A)} returns the index of the elements  
of A which are \texttt{TRUE}  
\item You can see the objects available in your workspace  
by using the \texttt{ls()} command  
\item Unnecessary objects (say, \texttt{tmphgt}) can be  
deleted from the workspace using the \texttt{rm} command,  
\eg \texttt{rm(tmphgt)}  
\item You can leave R using the \texttt{q()} command

Exercise 3. Exploring srdta
In this exercise, you will explore a few vectors representing different data on study subjects described in hte srdta example data set supplied together with GenABEL-package. First, you need to load GenABEL-package by typing

  > library(GenABEL)

and load the data by

  > data(srdta)

The vector containing the study subjects’ sex can be accessed through male(srdta); this vector’s value is 1 when the corresponding person is male and 0 otherwise. The vector containing SNP names can be accessed via snpnames(srdta), chromosome ID – through chromosome(srdta) and map – through map(srdta). Explore these vectors and answer the questions.

1. 

What is the ID and sex of the first person in the data set?

2. 

Of the 22nd person?

3. 

How many males are observed among the first hundred subjects?

4. 

How many FEMALES are among the 4th hundred?

5. 

What is the male proportion in the first 1000 people?

6. 

What is the FEMALE proportion in second 1000 (1001:2000) people?

7. 

What is name, chromosome and map position of 33rd marker?

8. 

What is distance between markers 25 and 26?

2.2 Data frames

A data frame is a class of R data which, basically, is a data table. In such tables, it is usually assumed that rows correspond to subjects (observations) and columns correspond to variables (characteristics) measured on these subjects. A nice feature of data frames is that columns (variables) have names, and the data can be addressed by referencing to these names4 .

We will explore R data frames using the example data set assoc. Start R with a double-click on the file named assocbase.RData. You can see the names of the loaded objects by using the ”list” command:

  > ls()
  [1] "assoc"

Thus, only one object is loaded. The class of this object is:

  > class(assoc)
  [1] "data.frame"

– a data frame.

The dimensionality of a data frame (or a matrix) can be determined by using the dim() command:

  > dim(assoc)
  [1] 250   7

Here, the first number corresponds to the number of rows (subjects) and the second to the number of columns (variables). Thus, the data frame assoc contains the data on 250 subjects, who are characterised by 7 variables each.

Let us now figure out what the names are of the 7 variables present in the data frame. To see what the variable names are, use the command names():

  > names(assoc)
  [1] "subj" "sex"  "aff"  "qt"   "snp4" "snp5" "snp6"

These variables correspond to the personal identifier (ID, variable subj), sex, affection status, quantitative trait qt and several SNPs. Each variable can have its own type (numeric, character, logical), but all variables must have the same length – thus forming a matrix-like data structure.

A variable from a data frame (say, fram), which has some name (say, nam) can be accessed through fram$nam. This will return a conventional vector, containing the values of the variable. For example, to see the affection status (aff) in the data frame assoc, use

  > assoc$aff
    [1] 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0
   [38] 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0
   [75] 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0
  [112] 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1
  [149] 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1
  [186] 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0
  [223] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0

The aff (affected) variable here codes for a case/control status. Conventinally, cases are coded as 1 and controls as 0. You can also see several ”NA”s, which denotes a missing observation.

Exercise 4. Exploring assoc

1. 

Investigate the types of the variables present in data frame assoc. For each variable, write down the class.

A data frame may be thought of as a matrix which is a collection of (potentily different-type) vectors. All sub-setting operations discussed before for matrices are applicable to a data frame, while all operations dicussed for vectors are applicable to a data frame’s variables.

Thus, as any particular variable present in a data frame is a conventional vector, its elements can be accessed using the vector’s indices. For example, if you would like to know what are the ID, sex and affection status for the person with index 75, you can request

  > assoc$subj[75]
  [1] 75
  > assoc$sex[75]
  [1] 1
  > assoc$aff[75]
  [1] 0

Alternatively, using the matrix-style of sub-setting, you can see all the data for person 75:

  > assoc[75,]
     subj sex aff       qt snp4 snp5 snp6
  75   75   1   0 1.014664  A/B  B/A  B/B

In the same manner as with matrices, you can get data for e.g. subjects 5 to 15 by

  > assoc[5:15,]
     subj sex aff         qt snp4 snp5 snp6
  5     5   0   0  0.1009220  A/B  B/A  B/A
  6     6   1   0 -0.1724321  A/B  A/A  A/A
  7     7   0   0 -0.3378473  B/B  A/A  A/A
  8     8   0   0 -1.7112925  A/A  B/B <NA>
  9     9   1   0 -0.4815822  A/B  B/A  B/A
  10   10   1   0  1.2281232  A/A  B/B  B/B
  11   11   0   0  0.5993945  A/B  B/A  B/A
  12   12   0   0  1.9792190  A/A  B/B  B/B
  13   13   1   0  1.5435921  A/A  B/B  B/B
  14   14   0   0 -1.6242738  A/B  B/A  B/A
  15   15   0   0 -0.5160331  A/A  B/B  B/B

The result is actually a new data frame containing data only on people with index ranging from 5 to 15:

  > x <- assoc[5:15,]
  > class(x)
  [1] "data.frame"
  > dim(x)
  [1] 11  7

As well as with matrices and vectors, it is possible to sub-set elements of a data frame based on (a combination of) logical conditions. For example, if you are interested in people who have qt values over 1.4, you can find out what the indices of these people are:

  > vec <- which(assoc$qt>1.4)
  > vec
   [1]  12  13  33  41  54  68  72  76  89 106 118 142 156 161 175 181 193 219 241

and then show the complete data with

  > assoc$subj[vec]
   [1]  12  13  33  41  54  68  72  76  89 106 118 142 156 161 175 181 193 219 241

At the same time, if you only want to check what the IDs of these people are, try

  > assoc$subj[vec]
   [1]  12  13  33  41  54  68  72  76  89 106 118 142 156 161 175 181 193 219 241

Or, if we are interested to find what the IDs and the SNP genotypes are of these people, we can try

  > assoc[vec, c(1, 5, 6, 7)]
      subj snp4 snp5 snp6
  12    12  A/A  B/B  B/B
  13    13  A/A  B/B  B/B
  33    33  A/A  B/B  B/B
  41    41  A/A  B/A  B/A
  54    54  A/B  B/A  B/A
  68    68  A/A  B/B  B/B
  72    72  A/A  B/A  B/A
  76    76  A/B  B/A  B/A
  89    89  A/A  B/B  B/B
  106  106  A/B  B/A  B/A
  118  118  A/B  B/A  B/A
  142  142  A/B  B/A  B/A
  156  156  A/A  B/B  B/B
  161  161  A/B  B/A  B/A
  175  175  A/B  B/A  B/A
  181  181  A/B  B/A  B/A
  193  193  A/A  B/B  B/B
  219  219  A/B  B/A  B/A
  241  241  B/B  A/A  A/A

here, we select people identified by vec in the first dimension (subjects), and by c(1, 5, 6, 7) we select the first, fifth, sixth and seventh column (variable).

The same result can be obtained using variable names instead of the variables’ indices. To remind you the variable names can be found with:

  > names(assoc)
  [1] "subj" "sex"  "aff"  "qt"   "snp4" "snp5" "snp6"

And now make a vector of the variable names of interest and filter the data based on it:

  > namstoshow <- c("subj", "snp4", "snp5", "snp6")
  > assoc[vec, namstoshow]
      subj snp4 snp5 snp6
  12    12  A/A  B/B  B/B
  13    13  A/A  B/B  B/B
  33    33  A/A  B/B  B/B
  41    41  A/A  B/A  B/A
  54    54  A/B  B/A  B/A
  68    68  A/A  B/B  B/B
  72    72  A/A  B/A  B/A
  76    76  A/B  B/A  B/A
  89    89  A/A  B/B  B/B
  106  106  A/B  B/A  B/A
  118  118  A/B  B/A  B/A
  142  142  A/B  B/A  B/A
  156  156  A/A  B/B  B/B
  161  161  A/B  B/A  B/A
  175  175  A/B  B/A  B/A
  181  181  A/B  B/A  B/A
  193  193  A/A  B/B  B/B
  219  219  A/B  B/A  B/A
  241  241  B/B  A/A  A/A

A more convenient way to access data presented in a data frame is through "attaching" it to the R search path by

  > attach(assoc)

After that, the variables can be accessed directly, e.g.

  > subj[75]
  [1] 75

instead of assoc$subj[75].

While it is possible to explore the data presented in a data frame using the sub-setting operations and screen output, and modify certain data elements using the assignment (”<-”) operation, you can also explore and modify the data contained in a data frame5 by using the fix() command (e.g. try fix(assoc)). However, normally this is not necessary.

With attached data frames, a possible complication is that later on you may have several data frames which contain variables with the same names. The variable which will be used when you directly use the name would be the one from the data frame attached last. You can use the detach() function to remove a certain data frame from the search path, e.g. after

  > detach(assoc)

we cannot use a direct reference to the name (try subj[75]) anymore, but have to use the full path instead:

  > assoc$subj[75]
  [1] 75

\item The list of available objects can be viewed with \texttt{ls()};  
the class of some object \texttt{obj} can be examined with  
\texttt{class(obj)}.  
\item Simple summary statistics for numeric variables can be  
generated by using the \texttt{summary} function  
\item A histogram for some variable \texttt{var} can be generated  
by \texttt{hist(var)}.  
\item A variable with name \texttt{name} from a data frame  
\texttt{frame}, can be  
accessed through \texttt{frame\$name}.  
\item You can attach a data frame to the search path by  
\texttt{attach(frame)}. Then the variables contained in this  
data frame may be accessed directly. To detach the data  
frame (because, \eg, you are now interested in another data  
frame), use \texttt{detach(frame)}.

Exercise 5. Explore the phenotypic part of srdta
Load the srdta data object supplied with GenABEL by loading the package with library(GenABEL) and then loading the data with data(srdta). The srdta object contains a data frame with phenotypes. This data frame may be accessed through phdata(srdta). Explore this data frame and answer the questions

1. 

What is the value of the 4th variable for subject number 75?

2. 

What is the value of variable 1 for person 75? Check the value of this variable for the first ten people. Can you guess what the first variable is?

3. 

What is the sum of variable 2? Can you guess what data variable 2 contains?

2.3 Exploratory analysis of qualitative and quantitative traits

Let us now attach the data frame asscoc

  > attach(assoc)

and explore it.

Let us first check how many of the subjects are males. In the sex variable, males are coded with ”1” and females with ”0”. Therefore to see the total number of males, you can use

  > sum(sex==1)

  [1] 129

and to determine what is the proportion of males you can use

  > sum(sex==1)/length(sex)

  [1] 0.516

This way to compute the proportion would only work correctly if there are no missing observations (lenght() returns the total length of a variable, including NAs).

Because of the way the males are coded, the same answer is reached by

  > mean(sex)

  [1] 0.516

However, that would not have worked if the sex was coded differently, e.g. with ”1” for males and ”2” for females.

Let us now try to find out the mean of the quantitative trait qt. By definition, the mean of a variable, say x (with the i-th element denoted as xi) is

x = ΣNi=1xi
      N
where N is the number of measurements.

If we try to find out the mean of qt by direct use of this formula, we first need to find out the sum of the elements of qt. The sum() function of R precisely does the operation we need. However, if we try it

  > sum(qt)

  [1] -29.79333

this returns ”NA”. The problem is that the qt variable contains ”NA”s (try qt to see these) and then, by default, ”NA” is returned. We can, however, instruct the sum() function to remove ”NA”s from consideration:

  > sum(qt, na.rm=TRUE)

  [1] -29.79333

where na.rm=TRUE tells R that missing variables should be be removed (NonAvailable.ReMove=True)6 .

We can now try to compute the mean with

  > sum(qt, na.rm=TRUE)/length(qt)
  [1] -0.1191733

This result, however, is not correct. The length() function returns the total length of a vector, which includes ”NA”s as well. Thus we need to compute the number of elements in qt that are not missing.

For this, we can use R function is.na(). This function returns TRUE if the supplied argument is missing (NA) and FALSE otherwise. Let us apply this function to the vector assoc$qt:

  > is.na(qt)
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Indeed, the 0 missing elements are correctly identified. However, we are interested in elements which are not missing. To get these, we can use the logical function NOT (!), which changes all FALSE to TRUE and visa versa:

  > !is.na(qt)
    [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
   [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Thus the number of elements which are not missing7 is

  > sum(!is.na(qt))
  [1] 250

Finally, we can compute the mean of the qt with

  > sum(qt, na.rm=TRUE)/sum(!is.na(qt))
  [1] -0.1191733

While this way of computing the mean is enlightening in the sense of how missing values are treated, the same correct result should be normally achieved by supplying the na.rm=TRUE argument to the mean() function:

  > mean(qt, na.rm=TRUE)
  [1] -0.1191733

The function table(x) produces a frequency table for the variable x. Thus, we can use

  > table(sex)
  sex
    0   1
  121 129

which, again, tells us that there are 129 males and 121 females in this data set. This function excludes missing observations.

Tables of other qualitative variables, such as affection status and SNPs, can be generated in the same manner.

As with arithmetic operations and mathematical functions, most of the R operations can be combined within a single line. Let us try to combine logical conditions and the table() command to check the distribution of number of affected in men and women separately:

  > table(aff[which(sex==1)])
   0  1
  98 31
  > table(aff[which(sex==0)])
   0  1
  96 25

On the R command line pressing the ‘‘up-arrow'' button  
makes the last typed command re-appear (pressing it one more time will  
bring you to the one before the last, so on). This is very handy when  
you have to repeat the same analysis of different variables

Exercise 6. Explore assoc
Explore the phenotypic variables present in assoc

1. 

How many affected and unaffected are present in the data set?

2. 

What is the proportion of affected?

3. 

What is the distribution of snp4 (how many different genotype classes are present and what are the counts)?

Contingency tables for pairs of variables (cross-tables) can be generated in R using the table command we have used in previous section to explore frequency distributions. For example, if you want cross-tabulate sex and affection status in the data frame assoc, you can use

  > table(sex, aff)
     aff
  sex  0  1
    0 96 25
    1 98 31

Here, the first variable (sex) is presented in rows and the second (affection status) in columns.

As is usually the case with R, the output may be saved as a new object (of class ’table’, which is a variety of a matrix):

  > a <- table(sex, aff)
  > class(a)
  [1] "table"
  > a
     aff
  sex  0  1
    0 96 25
    1 98 31

and this object may be analysed further.

For example, we can easily get the number of affected male with

  > a[2, 2]
  [1] 31

Alternatively, we can analyse the resulting contingency table a with more complex functions. If we want to see proportions in this table, we can use

  > prop.table(a)
     aff
  sex     0     1
    0 0.384 0.100
    1 0.392 0.124

Needless to say, this is equivalent to

  > prop.table(table(assoc$sex, assoc$aff))
          0     1
    0 0.384 0.100
    1 0.392 0.124

In the above table, we see what proportion of people belong to four different classes (affected male, affected female, unaffected male and unaffected female). We may also be interested in the proportion of males in affected and unaffected. This may be achieved by

  > prop.table(a, 2)
     aff
  sex         0         1
    0 0.4948454 0.4464286
    1 0.5051546 0.5535714

telling us that 55.4% of affected individuals are male.

Alternatively, we may be interested in the proportion of affected among males/females. To answer this question, run

  > prop.table(a, 1)
     aff
  sex         0         1
    0 0.7933884 0.2066116
    1 0.7596899 0.2403101

telling us that 55.4% of male are affected.

Another useful contingency table analysis function is fisher.test, which implements the Fisher Exact Test of independence:

  > fisher.test(a)
   Fisher's Exact Test for Count Data
  
  data:  a
  p-value = 0.547
  alternative hypothesis: true odds ratio is not equal to 1
  95 percent confidence interval:
   0.6409648 2.3156591
  sample estimates:
  odds ratio
    1.213747

Exploration of genetic data within base R, though possible, may be a bit of a pain. For example, we can easily generate contingency table of SNP5 vs. affected status:

  > a <- table(aff, snp5)
  > a
     snp5
  aff A/A B/A B/B
    0  31  88  71
    1   9  26  17

We can also look up the proportion of affected among different genotypic groups

  > prop.table(a, 2)
     snp5
  aff       A/A       B/A       B/B
    0 0.7750000 0.7719298 0.8068182
    1 0.2250000 0.2280702 0.1931818

showing that proportion of cases is similar in the ’A/A’ and ’A/B’ genotypic groups and somewhat decreased in ’B/B’. It is easy to test if this affection is statistically independent of genotype by doing a χ2 test

  > chisq.test(a)
   Pearson's Chi-squared test
  
  data:  a
  X-squared = 0.3874, df = 2, p-value = 0.8239

which gives a (insignificant) genotypic association test with two degrees of freedom.

However, testing Hardy-Weinberg equilibrium, testing allelic effects, and even computation of allelic frequency is not so straightforward. Such specific genetic tests are implemented in special R libraries, such as genetics and GenABEL-package and will be covered in later sections of this document.

At this moment we will switch to exploratory analysis of quantitative traits. We will make use of the srdta data supplied with the GenABEL-package. As you can remember from an earlier exercise, that library is loaded with library(GenABEL) and the data are loaded with data(srdta). Then the phenotypic data frame may be accessed through phdata(srdta).

Exercise 7. Explore phenotypes in srdta
Explore the phenotypic data content of srdta object (phdata(srdta)).

1. 

How many observations and variables are present in the data frame?

2. 

What are the classes of these variables?

As was mentioned before, the function summary() generates a summary statistics for an object. For example, to see the summary for trait qt1, we can use

  > summary( phdata(srdta)$qt1 )
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
  -4.6000 -0.9500 -0.3100 -0.2981  0.3800  3.2000       3

\texttt{summary} is quite a useful function which may operate  
in different ways for objects of different classes.  
Try \texttt{summary(phdata(srdta))}.

With R, it is also easy to explore the data graphically. For example, a histogram for qt1 may be generated by

  > hist( phdata(srdta)$qt1 )

The resulting histogram is shown in Figure 2.1.


pict
Figure 2.1: Histogram of qt1.

In a similar manner scatter-plots may be generated. To see the relation between qt1 and qt3, you can run

  > plot( phdata(srdta)$qt1, phdata(srdta)$qt3 )

The resulting plot is shown in Figure 2.2.


pict
Figure 2.2: Scatter-plot of qt1 against qt3.

The mean, median, minimum and maximum of the distribution of a trait may be found using the functions mean, median, min and max, respectively. The variance and standard deviation can be computed with var and sd.

To compute the correlation between two variables (or all variables in a matrix/data frame), use cor.

In GenABEL-package, there is a special function designed to facilitate phenotypic quality control. This function takes the names of variables and a data frame as an input, and returns summary statistics, list of outliers (using False Discovery Rate) and graphs.

For example, to do QC of sex, age and qt3, try

  > check.trait( c("sex", "age", "qt3"), phdata(srdta) )
  --------------------------------
  Trait sex has 2500 measurements
  Missing: 0 ( 0 %)
  Mean = 0.51 ; s.d. = 0.5
  NO outliers discovered for trait sex
  --------------------------------
  Trait age has 2500 measurements
  Missing: 0 ( 0 %)
  Mean = 50.0378 ; s.d. = 7.060125
  NO outliers discovered for trait age
  --------------------------------
  Trait qt3 has 2489 measurements
  Missing: 11 ( 0.44 %)
  Mean = 2.60859 ; s.d. = 1.101154
  NO outliers discovered for trait qt3

The corresponding graph is shown in figure 2.3.


pict
Figure 2.3: Quality control graphs for sex, age, qt3

Before you start with the exercise:  
if a function returns unexpected results, and you are confident that  
syntax was right, checking the help page is always a good idea!

Exercise 8. Exploring the phenotypic part of srdta
Explore the phdata part of the srdta object

1. 

How many people have age over 65 years?

2. 

What is the sex distribution (proportion of males) in the people over 65 years old?

3. 

What are the mean, median, minimum and maximum ages in the sample?

4. 

Compare the distribution of qt3 in people younger and older than 65 years. Use the function sd(A) to get standard deviation of A.

5. 

Produce distributions of different traits. Do you see something special?

6. 

What is the correlation between qt3 and age?

2.4 Regression analysis

While contingency tables, bi-plots and correlation are powerful tools to analyse relations between pairs of variables, a more general framework allowing investigation of the relation of an outcome to multiple predictors is regression. In R, the function lm implements linear regression modelling, and the function glm implements generalised linear regression. In this section, we will use these two functions to analyse quantitative and binary outcomes.

You can do linear regression to check if trait qt2 has a relation with sex and age by

  > a <- lm(phdata(srdta)$qt2 ~ phdata(srdta)$age + phdata(srdta)$sex)

The results of this analysis are stored in object ’a’, which has class ’lm’ and contains many sub-objects:

  > class(a)

  [1] "lm"

  > names(a)

   [1] "coefficients"  "residuals"     "effects"       "rank"
   [5] "fitted.values" "assign"        "qr"            "df.residual"
   [9] "xlevels"       "call"          "terms"         "model"

At this moment you do not need to understand all these sub-objects; the meaningful summary of analysis is produced with

  > summary(a)

  Call:
  lm(formula = phdata(srdta)$qt2 ~ phdata(srdta)$age + phdata(srdta)$sex)
  
  Residuals:
     Min     1Q Median     3Q    Max
   -5.65  -1.80  -1.03  -0.31 883.08
  
  Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
  (Intercept)       -1.55892    4.41667  -0.353    0.724
  phdata(srdta)$age  0.14022    0.08668   1.618    0.106
  phdata(srdta)$sex  1.30377    1.22393   1.065    0.287
  
  Residual standard error: 30.59 on 2497 degrees of freedom
  Multiple R-squared:  0.001518, Adjusted R-squared:  0.0007181
  F-statistic: 1.898 on 2 and 2497 DF,  p-value: 0.1501

You can see that qt2 is not associated with age or sex.

As before, to make easy access to your data (basically, to avoid typing phdata(srdta) before every trait name, you may attach the data to the search path:

  > attach(phdata(srdta))

Then, the above expression to run linear regression analysis simplifies to:

  > summary( lm(qt2 ~ age + sex) )

  Call:
  lm(formula = qt2 ~ age + sex)
  
  Residuals:
     Min     1Q Median     3Q    Max
   -5.65  -1.80  -1.03  -0.31 883.08
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -1.55892    4.41667  -0.353    0.724
  age          0.14022    0.08668   1.618    0.106
  sex          1.30377    1.22393   1.065    0.287
  
  Residual standard error: 30.59 on 2497 degrees of freedom
  Multiple R-squared:  0.001518, Adjusted R-squared:  0.0007181
  F-statistic: 1.898 on 2 and 2497 DF,  p-value: 0.1501

with the same results.

Analysis of binary outcomes may be performed using glm function, using binomial family for the error distribution and the link function. For example, to figure out if your binary trait (bt) is associated with sex and age, you need to tell that this is binary trait:

  > a <- glm(bt ~ age + sex, family="binomial")
  > summary(a)

  Call:
  glm(formula = bt ~ age + sex, family = "binomial")
  
  Deviance Residuals:
     Min      1Q  Median      3Q     Max
  -1.992  -1.091  -0.444   1.094   1.917
  
  Coefficients:
               Estimate Std. Error z value Pr(>|z|)
  (Intercept) -4.639958   0.330519 -14.038  < 2e-16 ***
  age          0.088860   0.006463  13.749  < 2e-16 ***
  sex          0.379593   0.084138   4.512 6.44e-06 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 3450.5  on 2488  degrees of freedom
  Residual deviance: 3216.5  on 2486  degrees of freedom
    (11 observations deleted due to missingness)
  AIC: 3222.5
  
  Number of Fisher Scoring iterations: 4

There is strong association between bt and sex and age. If you want to characterise the strength of association to a binary trait with Odds Ratios, take the exponents of the regression coefficient. For example, the odds ratio associated with male is

  > exp(0.3796)

  [1] 1.4617

2.5 Answers to exercises

Answer of Exercise 1. Explore help for Wilcoxon test
By default (if ’exact’ is not specified), an exact p-value is computed if the samples contain less than 50 finite values and there are no ties. Otherwise, a normal approximation is used.

Answer of Exercise 2. Finding functions and help pages
Try help.search("Fisher") and help.search("Student t-test"). You will find that the corresponding functions are fisher.test t.test.

Answer of Exercise 3. Exploring srdta
For the first person the id is "p1" and the sex code is 1 (1=male, 0=female)

  > idnames(srdta)[1]
  [1] "p1"
  > male(srdta)[1]
  p1
   1

The id for the 22nd person is "p22" and sex code is 1:

  > idnames(srdta)[22]
  [1] "p22"
  > male(srdta)[22]
  p22
    1

Among the first 100 subjects, there are 53 males:

  > sum(male(srdta)[1:100])
  [1] 53

Among the 4th hundred subjects there are 45 females:

  > 100 - sum(male(srdta)[301:400])
  [1] 45

The male proportion among the first 1000 people is

  > mean(male(srdta)[1:1000])
  [1] 0.508

The female proportion among the second 1000 people is

  > 1 - mean(male(srdta)[1001:2000])
  [1] 0.476

Name, chromosome and map position of the 33rd marker are:

  > snpnames(srdta)[33]
  [1] "rs422"
  > chromosome(srdta)[33]
  rs422
    "1"
  > map(srdta)[33]
   rs422
  105500

The map positions for and distance between markers 25 and 26 are:

  > pos25 <- map(srdta)[25]
  > pos25
  rs365
  91250
  > pos26 <- map(srdta)[26]
  > pos26
  rs372
  92750
  > pos26 - pos25
  rs372
   1500

Answer of Exercise 4. Exploring assoc
Here is an script which automatically explores the classes of variables in assoc:

  > for (i in names(assoc)) {
  +   cat("Variable '", i ,"' has class '", class(assoc[, i]), "'\n", sep="")
  + }
  Variable 'subj' has class 'integer'
  Variable 'sex' has class 'numeric'
  Variable 'aff' has class 'numeric'
  Variable 'qt' has class 'numeric'
  Variable 'snp4' has class 'genotypefactor'
  Variable 'snp5' has class 'genotypefactor'
  Variable 'snp6' has class 'genotypefactor'

In this so-called for-loop the variable i cycles through all names in assoc and for each of them it uses the cat function to print the name of the variable and its class. The \n is the code for a new line.

Answer of Exercise 5. Explore the phenotypic part of srdta
Load the data and look at the few first rows of the phenotypic data frame:

  > data(srdta)
  > phdata(srdta)[1:5, ]
     id sex  age   qt1    qt2  qt3 bt
  p1 p1   1 43.4 -0.58   4.46 1.43  0
  p2 p2   1 48.2  0.80   6.32 3.90  1
  p3 p3   0 37.9 -0.52   3.26 5.05  1
  p4 p4   1 53.8 -1.55 888.00 3.76  1
  p5 p5   1 47.5  0.25   5.70 2.89  1

The value of the 4th variable of person 75:

  > phdata(srdta)[75, 4]
  [1] -0.04

The value for variable 1 is

  > phdata(srdta)[75, 1]
  [1] "p75"

Also, if we check the first 10 elements we see

  > phdata(srdta)[1:10, 1]
   [1] "p1"  "p2"  "p3"  "p4"  "p5"  "p6"  "p7"  "p8"  "p9"  "p10"

This is the individual ID.

The sum for variable 2 is

  > sum(phdata(srdta)[, 2])
  [1] 1275

This is the sex variable – so there are 1275 males in the data set.

Answer of Exercise 6. Explore assoc
The number of affected (coded with ’1’) and unaffected (’0’) is

  > table(aff)
  aff
    0   1
  194  56

The proportion of unaffected and affected is

  > prop.table(table(aff))
  aff
      0     1
  0.776 0.224

Distribution of the ’snp4’ is

  > t <- table(snp4)
  > t
  snp4
  A/A A/B B/B
  109 105  29
  > prop.table(t)
  snp4
        A/A       A/B       B/B
  0.4485597 0.4320988 0.1193416

Answer of Exercise 7. Explore phenotypes in srdta
Number of people:

  > nids(srdta)
  [1] 2500

Number of variables:

  > length( names( phdata(srdta) ) )
  [1] 7

The same – dimensions of phenotypic data frame:

  > dim(phdata(srdta))
  [1] 2500    7

The class of variables in phenotypic data frame:

  > for (i in names(phdata(srdta))) {
  +   cat("The class of variable '", i ,"' is '",
  +       class(phdata(srdta)[, i]), "'\n", sep="")
  + }
  The class of variable 'id' is 'character'
  The class of variable 'sex' is 'integer'
  The class of variable 'age' is 'numeric'
  The class of variable 'qt1' is 'numeric'
  The class of variable 'qt2' is 'numeric'
  The class of variable 'qt3' is 'numeric'
  The class of variable 'bt' is 'integer'

Answer of Exercise 8. Exploring the phenotypic part of srdta
To obtain the number of people with age > 65 y.o., you can use any of the following

  > sum( phdata(srdta)$age > 65 )
  [1] 48
  > vec <- which( phdata(srdta)$age > 65 )
  > length(vec)
  [1] 48

To get sex of these people use any of:

  > sx65 <- phdata(srdta)$sex[ phdata(srdta)$age > 65 ]
  > sx65
   [1] 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0
  [39] 1 0 1 0 0 0 0 1 1 1
  > sx65 <- phdata(srdta)$sex[vec]
  > sx65
   [1] 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0
  [39] 1 0 1 0 0 0 0 1 1 1

Thus, number of males is:

  > sum(sx65)
  [1] 26

To conclude, the proportion of males is 0.541666666666667.

The distributions of qt3 in people younger and older than 65 are:

  > summary(phdata(srdta)$qt3[vec])
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.730   2.690   3.480   3.499   4.265   5.840
  > sd( phdata(srdta)$qt3[vec], na.rm=TRUE )
  [1] 1.128701
  > young <- which(phdata(srdta)$age<65)
  > summary( phdata(srdta)$qt3[young] )
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
    -1.97    1.83    2.58    2.59    3.35    6.34      11
  > sd( phdata(srdta)$qt3[young], na.rm=TRUE )
  [1] 1.093374

The Mean, median, min and max of age:

  > summary( phdata(srdta)$age )
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    24.10   45.10   50.00   50.04   54.80   71.60

The histogram for qt2 looks strange (you can generate it using hist(phdata(srdta)$qt2)): it seems there are a few very strong outliers. You can also see that with summary:

  > summary( phdata(srdta)$qt2 )
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.000   4.220   5.045   6.122   5.910 888.000

Chapter 3
Introduction to genetic association analysis in R

When analyzing several (dozens of) SNPs, facilities of base R are sufficient and efficient for data storage and analysis. Few specific test, such as these of Hardy-Weinberg Equilibrium (HWE) and Linkage Disequilibrium (LD), are implemented in different libraries, e.g. genetics and GenABEL-package.

In this section, we will describe library genetics and will make use of it to guide you through simple genetic analysis exercise using a small example data set. In the last part, you will investigate a bigger data set as based on the knowledge obtained in the first part, and will answer the questions.

3.1 Characterisation of genetic data

3.2 Exploring genetic data with library genetics

Library genetics was written by Gregory R. Warnes to facilitate analysis of genetic data in R. This library

Start R by double-click on the file ge03d1p1.RData. THIS FILE IS THE SAME AS THE assocbase.RData, so I load and patch it below Load library genetics, which we will need for testing HWE and computations of LD by

  > library(genetics)

  NOTE: THIS PACKAGE IS NOW OBSOLETE.
  
    The R-Genetics project has developed an set of enhanced genetics
    packages to replace 'genetics'. Please visit the project homepage
    at http://rgenetics.org for informtion.

The file you have loaded contains single data frame assocg. Let us load

  > #load("RData/ge03d1p1.RData")
  > # load assocbase data and convert snps into proper 'genetics' format
  > load("RData/assocbase.RData")
  > assocg <- assoc
  > assocg$snp4 <- as.genotype(assocg$snp4)
  > assocg$snp5 <- as.genotype(assocg$snp5)
  > assocg$snp6 <- as.genotype(assocg$snp6)

and briefly explore it:

  > class(assocg)

  [1] "data.frame"

  > names(assocg)

  [1] "subj" "sex"  "aff"  "qt"   "snp4" "snp5" "snp6"

  > dim(assocg)

  [1] 250   7

You can see that assocg looks remarkably similar to the previously explored data frame assoc (section 2.2, page 30). Indeed, they are almost equivalent. Let us present the data for the subjects 5 to 15 and compare this output to that presented on page 32:

  > assocg[5:15,]

     subj sex aff         qt snp4 snp5 snp6
  5     5   0   0  0.1009220  A/B  B/A  B/A
  6     6   1   0 -0.1724321  A/B  A/A  A/A
  7     7   0   0 -0.3378473  B/B  A/A  A/A
  8     8   0   0 -1.7112925  A/A  B/B <NA>
  9     9   1   0 -0.4815822  A/B  B/A  B/A
  10   10   1   0  1.2281232  A/A  B/B  B/B
  11   11   0   0  0.5993945  A/B  B/A  B/A
  12   12   0   0  1.9792190  A/A  B/B  B/B
  13   13   1   0  1.5435921  A/A  B/B  B/B
  14   14   0   0 -1.6242738  A/B  B/A  B/A
  15   15   0   0 -0.5160331  A/A  B/B  B/B

The data are identical. However, the SNP data presented in the new data frame have special class genotype, as implemented in genetics library:

  > class(assocg$snp4)

  [1] "genotype" "factor"

Previously, the SNP genotypes were coded as characters. This new way of presentation allows library genetics to recognise the SNP data as genetic and analyse them accordingly.

Let us attach the assocg data frame and explore what data analysis advantages are achieved by application of library genetics.

  > attach(assocg)

As we noted in section 2.2, testing Hardy-Weinberg equilibrium, testing allelic effects, and even computation of allelic frequency is not so straightforward in base R. These tests, are, however, easy with library genetics. To see the allelic frequencies and other summary statistics for a SNP, you can use

  > summary(snp4)

  Number of samples typed: 243 (97.2%)
  
  Allele Frequency: (2 alleles)
     Count Proportion
  A    323       0.66
  B    163       0.34
  NA    14         NA
  
  
  Genotype Frequency:
      Count Proportion
  B/B    29       0.12
  A/B   105       0.43
  A/A   109       0.45
  NA      7         NA
  
  Heterozygosity (Hu)  = 0.4467269
  Poly. Inf. Content   = 0.3464355

To check these characteristics in controls and cases separately, you can use

  > summary(snp4[aff==0])

  Number of samples typed: 190 (97.9%)
  
  Allele Frequency: (2 alleles)
     Count Proportion
  A    255       0.67
  B    125       0.33
  NA     8         NA
  
  
  Genotype Frequency:
      Count Proportion
  B/B    22       0.12
  A/B    81       0.43
  A/A    87       0.46
  NA      4         NA
  
  Heterozygosity (Hu)  = 0.4426469
  Poly. Inf. Content   = 0.3440288

  > summary(snp4[aff==1])

  Number of samples typed: 53 (94.6%)
  
  Allele Frequency: (2 alleles)
     Count Proportion
  A     68       0.64
  B     38       0.36
  NA     6         NA
  
  
  Genotype Frequency:
      Count Proportion
  B/B     7       0.13
  A/B    24       0.45
  A/A    22       0.42
  NA      3         NA
  
  Heterozygosity (Hu)  = 0.4643306
  Poly. Inf. Content   = 0.3541731

Let us check if HWE holds for the SNPs described in this data frame. We can do exact test for HWE by

  > HWE.exact(snp4)

   Exact Test for Hardy-Weinberg Equilibrium
  
  data:  snp4
  N11 = 109, N12 = 105, N22 = 29, N1 = 323, N2 = 163, p-value = 0.666

If you want to check HWE using controls only, you can do it by

  > HWE.exact(snp4[aff==0])

   Exact Test for Hardy-Weinberg Equilibrium
  
  data:  snp4[aff == 0]
  N11 = 87, N12 = 81, N22 = 22, N1 = 255, N2 = 125, p-value = 0.6244

Let us check if the there is LD between snp4 and snp5:

  > LD(snp4,snp5)

  Pairwise LD
  -----------
                     D        D'      Corr
  Estimates: 0.2009042 0.9997352 0.8683117
  
                X^2 P-value   N
  LD Test: 354.3636       0 235

The output shows results of the test for significance of LD, and estimates of the magnitude of LD (Dand correlation, r). To obtain r2, you can either square the correlation manually

  > 0.8683117*0.8683117

  [1] 0.7539652

or simply ask LD() to report it by

  > LD(snp4,snp5)$"R^2"

  [1] 0.7539652

The latter command is possible because the \texttt{LD()}  
function actually computes more things than it reports. This is quite common  
for \texttt{R} functions.  You can apply \texttt{names()} function to the  
analysis objects to see (at least part of) what was actually computed.  
Try  
\begin{Schunk}  
\begin{Sinput}  
> ld45 <- LD(snp4,snp5)  
\end{Sinput}  
\end{Schunk}  
and check what are the sub-objects contained in this analysis object  
\begin{Schunk}  
\begin{Sinput}  
> names(ld45)  
\end{Sinput}  
\begin{Soutput}  
[1] "call"    "D"       "D'"      "r"       "R^2"     "n"       "X^2"  
[8] "P-value"  
\end{Soutput}  
\end{Schunk}  
Any of these variables can be accessed through \texttt{object\$var}  
syntax, e.g. to check $D'$ we can use  
\begin{Schunk}  
\begin{Sinput}  
> ld45$"D'"  
\end{Sinput}  
\begin{Soutput}  
[1] 0.9997352  
\end{Soutput}  
\end{Schunk}

To check LD for more that two SNPs, we can compute an LD analysis object by

  > ldall <- LD(data.frame(snp4,snp5,snp6))

and later check

  > ldall$"P-value"

       snp4 snp5 snp6
  snp4   NA    0    0
  snp5   NA   NA    0
  snp6   NA   NA   NA

to see significance,

  > ldall$"D'"

       snp4      snp5      snp6
  snp4   NA 0.9997352 0.8039577
  snp5   NA        NA 0.9997231
  snp6   NA        NA        NA

for Dand

  > ldall$"R^2"

       snp4      snp5      snp6
  snp4   NA 0.7539652 0.5886602
  snp5   NA        NA 0.8278328
  snp6   NA        NA        NA

for r2.

You can also present e.g. r2 matrix as a plot by

  > image(ldall$"R^2")

A more neat way to present it requires specification of the set of threshold (break points) and colors to be used (you do not need to try this example if you do not want):

  > image(ldall$"R^2",breaks=c(0.5,0.6,0.7,0.8,0.9,1),col=heat.colors(5))

Resulting plot is shown at figure 3.1.


pict
Figure 3.1: r2 plot for snp4, snp5 and snp6

For any \texttt{R} command, you can get help by typing  
\texttt{help(command)}. Try \texttt{help(image)} if you are interested  
to understand what are "breaks" and "col"; or try  
\texttt{help(heat.colors)} to figure this color schema out.

Similar to our HWE checks, we may want to compute (and compare) LD in cases and controls separately:

  > ldcases <- LD(data.frame(snp4,snp5,snp6)[aff==1,])
  > ldcases$"R^2"

       snp4      snp5      snp6
  snp4   NA 0.7615923 0.6891558
  snp5   NA        NA 0.8943495
  snp6   NA        NA        NA

  > ldcontr <- LD(data.frame(snp4,snp5,snp6)[aff==0,])
  > ldcontr$"R^2"

       snp4      snp5      snp6
  snp4   NA 0.7512458 0.5616395
  snp5   NA        NA 0.8075894
  snp6   NA        NA        NA

and even present it results for cases and controls on the same graph (you do not need to produce this graph, which is presented at the figure 3.2):

  > image(ldcases$"R^2",breaks=c(0.5,0.6,0.7,0.8,0.9,1),col=heat.colors(5))
  > image(t(ldcontr$"R^2"),breaks=c(0.5,0.6,0.7,0.8,0.9,1),col=heat.colors(5),add=T)

pict

Figure 3.2: r2 plot for snp4, snp5 and snp6. Above diagonal: LD in cases; below: controls

3.3 Genetic association analysis

3.4 Example association analysis

Now, after we have described genetic and phenotypic data separately, we are ready to test association between these two. In previous sections, we showed that association between a binary trait and genotype may be analysed using contingency tables (functions table, prop.table, fisher.test, etc.). The association between a quantitative trait and genotype may be done using correlations, T-test, etc.

However, a more flexible analysis is possible when using regression modelling. First, we will investigate relation between the quantitative trait qt and the SNPs by using linear regression

  > mg <- lm(qt~snp4)

The lm command fits linear regression model to the data and returns an analysis object. The summary of analysis may be generated with

  > summary(mg)

  Call:
  lm(formula = qt ~ snp4)
  
  Residuals:
       Min       1Q   Median       3Q      Max
  -2.63700 -0.62291 -0.01225  0.58922  3.05561
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.081114   0.092517  -0.877    0.382
  snp4A/B     -0.108366   0.132079  -0.820    0.413
  snp4B/B     -0.006041   0.201820  -0.030    0.976
  
  Residual standard error: 0.9659 on 240 degrees of freedom
    (7 observations deleted due to missingness)
  Multiple R-squared:  0.003049, Adjusted R-squared:  -0.005259
  F-statistic: 0.367 on 2 and 240 DF,  p-value: 0.6932

From the summary output, it is clear that the model assumes arbitrary (estimated) effects of the genotypes AA, AB and BB. Neither effect of AB nor BB is significant in this case. The global test on two degrees of freedom (bottom of the output) is also not significant.

If you want to include some covariate into your model, e.g. sex, you can easily do that by adding the term to the formula:

  > summary(lm(qt~sex+snp4))

  Call:
  lm(formula = qt ~ sex + snp4)
  
  Residuals:
       Min       1Q   Median       3Q      Max
  -2.66442 -0.62417 -0.00875  0.59705  3.08086
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.110298   0.115260  -0.957    0.340
  sex          0.053018   0.124493   0.426    0.671
  snp4A/B     -0.104429   0.132628  -0.787    0.432
  snp4B/B     -0.002452   0.202340  -0.012    0.990
  
  Residual standard error: 0.9676 on 239 degrees of freedom
    (7 observations deleted due to missingness)
  Multiple R-squared:  0.003805, Adjusted R-squared:  -0.0087
  F-statistic: 0.3043 on 3 and 239 DF,  p-value: 0.8223

You can also allow for interaction by using the "*" operator

  > summary(lm(qt~sex*snp4))

  Call:
  lm(formula = qt ~ sex * snp4)
  
  Residuals:
       Min       1Q   Median       3Q      Max
  -2.57049 -0.64596 -0.00264  0.61094  3.01970
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.20579    0.13834  -1.487    0.138
  sex          0.22649    0.18647   1.215    0.226
  snp4A/B      0.05222    0.19024   0.274    0.784
  snp4B/B      0.18071    0.28576   0.632    0.528
  sex:snp4A/B -0.30191    0.26566  -1.136    0.257
  sex:snp4B/B -0.35508    0.40531  -0.876    0.382
  
  Residual standard error: 0.9684 on 237 degrees of freedom
    (7 observations deleted due to missingness)
  Multiple R-squared:  0.01041, Adjusted R-squared:  -0.01047
  F-statistic: 0.4984 on 5 and 237 DF,  p-value: 0.7773

Note that both main effects of sex and snp4, and also effects of interaction are estimated in this model.

Of interest in genetic studies may be three other models: additive, dominant and recessive.

The additive model assumes that the difference between mean trait’s values between ’AA’ and ’BB’ is twice the difference between ’AA’ and ’BB’, that is the mean value of the trait for heterozygous genotypes is right in between the two homozygotes. To test additive model, we first need to recode the predictor (genotype) as a numeric factor to be used as covariate. This can be easy done with function as.numeric:

  > add4 <- as.numeric(snp4)-1

We can check if recoding was done correctly by producing the table

  > table(snp4,add4)

       add4
  snp4    0   1   2
    A/A 109   0   0
    A/B   0 105   0
    B/B   0   0  29

Now to test the additive model run

  > summary(lm(qt~add4))

  Call:
  lm(formula = qt ~ add4)
  
  Residuals:
       Min       1Q   Median       3Q      Max
  -2.54813 -0.62104 -0.02754  0.60584  3.00652
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.10476    0.08710  -1.203    0.230
  add4        -0.03563    0.09133  -0.390    0.697
  
  Residual standard error: 0.9651 on 241 degrees of freedom
    (7 observations deleted due to missingness)
  Multiple R-squared:  0.0006313, Adjusted R-squared:  -0.003516
  F-statistic: 0.1522 on 1 and 241 DF,  p-value: 0.6968

The model assuming dominant action of the ’A’ allele means that the means of genotypes ’AA’ and ’AB’ are the same. This is equivalent to the model of recessive action of ’B’ allele. To code SNP4 according to this model, we can use function replace:

  > dom4 <- add4
  > dom4[dom4==2] <- 1
  > table(snp4,dom4)

       dom4
  snp4    0   1
    A/A 109   0
    A/B   0 105
    B/B   0  29

To test association with a binary outcome, we will use function glm with binomial family:

  > summary(glm(aff~snp4,family="binomial"))

  Call:
  glm(formula = aff ~ snp4, family = "binomial")
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -0.7433  -0.7204  -0.6715  -0.6715   1.7890
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept)  -1.3749     0.2386  -5.761 8.35e-09 ***
  snp4A/B       0.1585     0.3331   0.476    0.634
  snp4B/B       0.2297     0.4952   0.464    0.643
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 254.91  on 242  degrees of freedom
  Residual deviance: 254.58  on 240  degrees of freedom
    (7 observations deleted due to missingness)
  AIC: 260.58
  
  Number of Fisher Scoring iterations: 4

To make a test of global significance of the SNP effect, you can use

  > anova(glm(aff~snp4,family="binomial"),test="Chisq")

  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
       Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                   242     254.91
  snp4  2  0.32894       240     254.58   0.8483

In the manner similar to that described for quantitative traits, additive and dominance/recessive models can be tested by proper coding of the genotypic variable, e.g. to test the additive model, use

  > summary(glm(aff~as.numeric(snp4),family="binomial"))

  Call:
  glm(formula = aff ~ as.numeric(snp4), family = "binomial")
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -0.7549  -0.7139  -0.6747  -0.6747   1.7842
  
  Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
  (Intercept)       -1.4913     0.4164  -3.581 0.000342 ***
  as.numeric(snp4)   0.1272     0.2268   0.561 0.574994
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 254.91  on 242  degrees of freedom
  Residual deviance: 254.60  on 241  degrees of freedom
    (7 observations deleted due to missingness)
  AIC: 258.6
  
  Number of Fisher Scoring iterations: 4

Now you have learned all commands necessary to answer the questions of the next section.

Exit R by typing q() command (do not save image) and and proceed to the self exercise.

3.5 Exercise: Exploring genetic data using library genetics

Start R by double-click over the file ge03d1p2.RData (Windows) or by changing to the directory containing the data, starting R and loading the data set with load("ge03d1p2.RData") (Linux). Explore the data frame present and answer the questions.

Ex. 1 —  

How many SNPs are described in this data frame?

Ex. 2 —  

What is the frequency (proportion) of snp1 allele ’A’?

Ex. 3 —  

What is its frequency of ’A’ in affected (aff==1)?

Ex. 4 —  

How many cases and controls are present in the data set?

Ex. 5 —  

If all subjects are used to test HWE, are there any SNPs out of HWE at nominal P 0.05? Which ones?

Ex. 6 —  

If only controls are used to test the SNPs which are out of HWE in total sample, are these still out of HWE?

Ex. 7 —  

Which SNP pairs are in strong LD (r2 0.8)?

Ex. 8 —  

For SNPs in strong LD, what is r2 for separate samples of cases and controls?

Ex. 9 —  

Is there significant association between affection status and sex? What is P-value for association?

Ex. 10 —  

Is association between the disease and qt significant?

Ex. 11 —  

Which SNPs are significantly associated with affection status at nominal p-value 0.05? Use general genotypic (2 d.f.) model.

Ex. 12 —  

Test association between aff and snp5 and snp10, allowing for the SNPs interaction effect. Use arbitrary (not an additive) model. Do you observe significant interaction? How can you describe the model of concert action of snp5 and snp10?

Ex. 13 —  

Test for association between the quantitative trait qt and SNPs 1-10 using additive model. Which SNPs are associated at nominal P 0.05?

Ex. 14 —  

OPTIONAL, difficulty is medium, but may be time-consuming. If you adjust the analysis under additive model for sex, how do the findings change? Before doing the exercise, please check the answer to previous exercise – it shows a quick way to do testing for all 10 SNPs.

Ex. 15 —  

Which SNPs are associated with the quantitative trait qt at nominal P 0.05 when general genotypic (2 d.f. test) model is used?

Ex. 16 —  

ADVANCED: How can you describe the model of action of the significant SNPs? Test if the data are compatible with additive/dominant/recessive model.

3.6 Answers to exercises

Answer (Ex. 1) — 

The answer is 10 snps:

  > attach(popdat)
  > names(popdat)

   [1] "subj"  "sex"   "aff"   "qt"    "snp1"  "snp2"  "snp3"  "snp4"  "snp5"
  [10] "snp6"  "snp7"  "snp8"  "snp9"  "snp10"

Answer (Ex. 2) — 

The frequency of ’A’ in all subjects is 0.73:

  > summary(snp1)

  Number of samples typed: 2374 (95%)
  
  Allele Frequency: (2 alleles)
     Count Proportion
  A   3462       0.73
  B   1286       0.27
  NA   252         NA
  
  
  Genotype Frequency:
      Count Proportion
  B/B   199       0.08
  A/B   888       0.37
  A/A  1287       0.54
  NA    126         NA
  
  Heterozygosity (Hu)  = 0.3950646
  Poly. Inf. Content   = 0.3169762

Answer (Ex. 3) — 

The frequency of A in affected subjects is 0.7:

  > summary(snp1[aff==1])

  Number of samples typed: 519 (94.5%)
  
  Allele Frequency: (2 alleles)
     Count Proportion
  A    729        0.7
  B    309        0.3
  NA    60         NA
  
  
  Genotype Frequency:
      Count Proportion
  B/B    48       0.09
  A/B   213       0.41
  A/A   258       0.50
  NA     30         NA
  
  Heterozygosity (Hu)  = 0.4185428
  Poly. Inf. Content   = 0.3307192

Answer (Ex. 4) — 

There are 549 cases and 1951 controls:

  > table(aff)

  aff
     0    1
  1951  549

Answer (Ex. 5) — 

Only SNP 1 is out of HWE in the total sample. Here is a sciript testing all SNPs (no need to reproduce that, just check the results):

  > for (i in 1:10) {
  + snpname <- paste("snp",i,sep="")
  + cat("HWE P-value for SNP",snpname,"is",HWE.exact(get(snpname))$p.value,"\n")
  + }

  HWE P-value for SNP snp1 is 0.01083499
  HWE P-value for SNP snp2 is 1
  HWE P-value for SNP snp3 is 0.4197772
  HWE P-value for SNP snp4 is 0.8960298
  HWE P-value for SNP snp5 is 0.2960967
  HWE P-value for SNP snp6 is 0.5207056
  HWE P-value for SNP snp7 is 0.6284575
  HWE P-value for SNP snp8 is 0.1309458
  HWE P-value for SNP snp9 is 0.4457363
  HWE P-value for SNP snp10 is 0.7897327

Answer (Ex. 6) — 

Yes, SNP one is out of HWE also in controls:

  > HWE.exact(snp1[aff==0])

   Exact Test for Hardy-Weinberg Equilibrium
  
  data:  snp1[aff == 0]
  N11 = 1029, N12 = 675, N22 = 151, N1 = 2733, N2 = 977, p-value =
  0.008393

Answer (Ex. 7) — 

SNP pairs 4-5 and 5-6 have r2 0.8:

  > LD(popdat[,5:14])$"R^2"

        snp1  snp2  snp3  snp4  snp5  snp6  snp7  snp8  snp9 snp10
  snp1    NA 0.016 0.235 0.206 0.258 0.227 0.152 0.117 0.090 0.000
  snp2    NA    NA 0.004 0.004 0.005 0.004 0.000 0.000 0.000 0.000
  snp3    NA    NA    NA 0.602 0.457 0.346 0.641 0.031 0.042 0.001
  snp4    NA    NA    NA    NA 0.803 0.650 0.729 0.027 0.037 0.002
  snp5    NA    NA    NA    NA    NA 0.874 0.586 0.034 0.046 0.002
  snp6    NA    NA    NA    NA    NA    NA 0.670 0.030 0.040 0.002
  snp7    NA    NA    NA    NA    NA    NA    NA 0.020 0.027 0.003
  snp8    NA    NA    NA    NA    NA    NA    NA    NA 0.002 0.000
  snp9    NA    NA    NA    NA    NA    NA    NA    NA    NA 0.001
  snp10   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Answer (Ex. 8) — 

For controls,

  > #LD(popdat[aff==0,8:10])$"R^2"
  > LD(data.frame(snp4,snp5,snp6)[aff==0,])$"R^2"

       snp4     snp5      snp6
  snp4   NA 0.806591 0.6419715
  snp5   NA       NA 0.8661005
  snp6   NA       NA        NA

For cases,

  > #LD(popdat[aff==1,8:10])$"R^2"
  > LD(data.frame(snp4,snp5,snp6)[aff==1,])$"R^2"

       snp4      snp5      snp6
  snp4   NA 0.7951475 0.6773275
  snp5   NA        NA 0.9083237
  snp6   NA        NA        NA

Note that the fact that LD is higher in cases may mean nothing because the estimates of LD are biased upwards with smaller sample sizes. For example in a small sample (5 people) of controls we expect even higher LD because of strong upward bias:

  > LD(popdat[which(aff==0)[1:5],8:10])$"R^2"

       snp4      snp5      snp6
  snp4   NA 0.9995876 0.9995876
  snp5   NA        NA 0.9995876
  snp6   NA        NA        NA

More elaborate methods, such as that by Zaykin et al. (2006), are required to contrast LD between sample of unequal size.

Answer (Ex. 9) — 

There is no significant association:

  > t <- table(aff,sex)
  > t

     sex
  aff   0   1
    0 973 978
    1 260 289

  > fisher.test(t)

   Fisher's Exact Test for Count Data
  
  data:  t
  p-value = 0.3104
  alternative hypothesis: true odds ratio is not equal to 1
  95 percent confidence interval:
   0.9107753 1.3430565
  sample estimates:
  odds ratio
    1.105811

  > summary(glm(aff~sex,family=binomial()))

  Call:
  glm(formula = aff ~ sex, family = binomial())
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -0.7196  -0.7196  -0.6882  -0.6882   1.7644
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept) -1.31970    0.06981  -18.90   <2e-16 ***
  sex          0.10062    0.09673    1.04    0.298
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 2632.0  on 2499  degrees of freedom
  Residual deviance: 2630.9  on 2498  degrees of freedom
  AIC: 2634.9
  
  Number of Fisher Scoring iterations: 4

Answer (Ex. 10) — 

There is no significant association:

  > summary(glm(aff~qt,family=binomial()))

  Call:
  glm(formula = aff ~ qt, family = binomial())
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -0.7326  -0.7079  -0.7012  -0.6905   1.7675
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept) -1.26769    0.04832 -26.238   <2e-16 ***
  qt          -0.02514    0.04862  -0.517    0.605
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 2632.0  on 2499  degrees of freedom
  Residual deviance: 2631.7  on 2498  degrees of freedom
  AIC: 2635.7
  
  Number of Fisher Scoring iterations: 4

Answer (Ex. 11) — 

SNPs 5 and 10 are significantly associated:

  > for (i in 1:10) {
  + snpname <- paste("snp",i,sep="")
  + cat("\nTesting association between aff and SNP",snpname,":\n")
  + print(anova(glm(aff~get(snpname),family=binomial),test="Chisq"))
  + #print(summary(lm(qt~get(snpname)))$coef)
  + }

  Testing association between aff and SNP snp1 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2373     2493.4
  get(snpname)  2   5.4094      2371     2488.0  0.06689 .
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between aff and SNP snp2 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2373     2485.8
  get(snpname)  1  0.29367      2372     2485.5   0.5879
  
  Testing association between aff and SNP snp3 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2377     2503.0
  get(snpname)  2   2.6087      2375     2500.4   0.2714
  
  Testing association between aff and SNP snp4 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2389     2519.1
  get(snpname)  2   5.2755      2387     2513.8  0.07152 .
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between aff and SNP snp5 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2382     2440.4
  get(snpname)  2   9.2395      2380     2431.2 0.009855 **
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between aff and SNP snp6 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2379     2498.9
  get(snpname)  2   1.7969      2377     2497.1   0.4072
  
  Testing association between aff and SNP snp7 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2367     2487.9
  get(snpname)  2   1.3604      2365     2486.6   0.5065
  
  Testing association between aff and SNP snp8 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2370     2489.4
  get(snpname)  2   5.5375      2368     2483.9  0.06274 .
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between aff and SNP snp9 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2360     2476.8
  get(snpname)  2   1.1891      2358     2475.6   0.5518
  
  Testing association between aff and SNP snp10 :
  Analysis of Deviance Table
  
  Model: binomial, link: logit
  
  Response: aff
  
  Terms added sequentially (first to last)
  
  
               Df Deviance Resid. Df Resid. Dev Pr(>Chi)
  NULL                          2383     2475.1
  get(snpname)  2   6.7328      2381     2468.4  0.03451 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer (Ex. 12) — 

It appears that SNP10 genotype is only relevant in these who are homozygous for the low-risk A allele at the SNP5; in such cases SNP 10 allele B is risk increasing. In these homozygous for SNP 5 A, we observe highly significant increase in risk for heterozygotes for SNP10 and increased (though not significantly) risk for SNP 10 BB:

  > summary(glm(aff~snp5*snp10,family=binomial()))

  Call:
  glm(formula = aff ~ snp5 * snp10, family = binomial())
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -0.9906  -0.7340  -0.6323  -0.5215   2.0310
  
  Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
  (Intercept)      -1.50840    0.08905 -16.938  < 2e-16 ***
  snp5A/A          -0.41802    0.19722  -2.120   0.0340 *
  snp5B/B           0.33441    0.13360   2.503   0.0123 *
  snp10A/B         -0.01403    0.18251  -0.077   0.9387
  snp10B/B         -0.14983    0.55277  -0.271   0.7863
  snp5A/A:snp10A/B  1.48369    0.32750   4.530 5.89e-06 ***
  snp5B/B:snp10A/B  0.12989    0.27441   0.473   0.6360
  snp5A/A:snp10B/B  0.82348    0.98963   0.832   0.4053
  snp5B/B:snp10B/B -0.28562    1.23104  -0.232   0.8165
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 2282.4  on 2268  degrees of freedom
  Residual deviance: 2242.9  on 2260  degrees of freedom
    (231 observations deleted due to missingness)
  AIC: 2260.9
  
  Number of Fisher Scoring iterations: 4

Answer (Ex. 13) — 

SNPs 1, 4, 5, 6 and 9 are significantly associated at nominal P 0.05. Here is a testing script (no need to reproduce that, just check the results):

  > for (i in 1:10) {
  + snpname <- paste("snp",i,sep="")
  + cat("\nTesting association between qt and SNP",snpname,":\n")
  + testmodel <- lm(qt~as.numeric(get(snpname)))
  + print(summary(testmodel)$coef)
  + }

  Testing association between qt and SNP snp1 :
                              Estimate Std. Error   t value    Pr(>|t|)
  (Intercept)              -0.11874800 0.05260279 -2.257447 0.024070746
  as.numeric(get(snpname))  0.08859657 0.03147693  2.814651 0.004923315
  
  Testing association between qt and SNP snp2 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)               0.09149145  0.1841115  0.4969352 0.6192808
  as.numeric(get(snpname)) -0.07749967  0.1806078 -0.4291047 0.6678860
  
  Testing association between qt and SNP snp3 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)               0.06382773 0.05629376  1.1338331 0.2569789
  as.numeric(get(snpname)) -0.02517149 0.02894125 -0.8697443 0.3845280
  
  Testing association between qt and SNP snp4 :
                              Estimate Std. Error   t value   Pr(>|t|)
  (Intercept)               0.14005988 0.05612775  2.495377 0.01264938
  as.numeric(get(snpname)) -0.07284539 0.02982557 -2.442380 0.01466282
  
  Testing association between qt and SNP snp5 :
                              Estimate Std. Error   t value   Pr(>|t|)
  (Intercept)              -0.14350846 0.06620992 -2.167477 0.03029734
  as.numeric(get(snpname))  0.07404874 0.02941437  2.517434 0.01188645
  
  Testing association between qt and SNP snp6 :
                              Estimate Std. Error   t value   Pr(>|t|)
  (Intercept)              -0.12737489 0.06768096 -1.881990 0.05995937
  as.numeric(get(snpname))  0.06724115 0.02924840  2.298969 0.02159304
  
  Testing association between qt and SNP snp7 :
                              Estimate Std. Error   t value  Pr(>|t|)
  (Intercept)               0.08884244 0.05475991  1.622399 0.1048511
  as.numeric(get(snpname)) -0.03774136 0.03152701 -1.197112 0.2313829
  
  Testing association between qt and SNP snp8 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)              -0.05214665 0.08116533 -0.6424744 0.5206274
  as.numeric(get(snpname))  0.06942327 0.07222881  0.9611576 0.3365710
  
  Testing association between qt and SNP snp9 :
                             Estimate Std. Error   t value     Pr(>|t|)
  (Intercept)              -0.2201742 0.07115711 -3.094199 0.0019965937
  as.numeric(get(snpname))  0.2110978 0.06112112  3.453761 0.0005625794
  
  Testing association between qt and SNP snp10 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)              -0.01474695 0.05749473 -0.2564921 0.7975930
  as.numeric(get(snpname))  0.03140888 0.04251458  0.7387789 0.4601141

Answer (Ex. 14) — 

Generally, results do not change much: still, SNPs 1, 4, 5, 6 and 9 are significantly associated, and p-values are close to these observed without adjustment For SNPs

  > for (i in 1:10) {
  + snpname <- paste("snp",i,sep="")
  + cat("\nTesting sex-adjusted association between qt and SNP",snpname,":\n")
  + testmodel <- lm(qt~sex+as.numeric(get(snpname)))
  + print(summary(testmodel)$coef)
  + }

  Testing sex-adjusted association between qt and SNP snp1 :
                              Estimate Std. Error    t value    Pr(>|t|)
  (Intercept)              -0.12783200 0.05660623 -2.2582672 0.024019551
  sex                       0.01766209 0.04061074  0.4349117 0.663666095
  as.numeric(get(snpname))  0.08868826 0.03148302  2.8170191 0.004887301
  
  Testing sex-adjusted association between qt and SNP snp2 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)               0.08169692 0.18525892  0.4409878 0.6592621
  sex                       0.01977118 0.04101612  0.4820343 0.6298261
  as.numeric(get(snpname)) -0.07770464 0.18063757 -0.4301688 0.6671120
  
  Testing sex-adjusted association between qt and SNP snp3 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)               0.05518829 0.06028776  0.9154146 0.3600669
  sex                       0.01626276 0.04057000  0.4008568 0.6885616
  as.numeric(get(snpname)) -0.02491788 0.02895328 -0.8606239 0.3895321
  
  Testing sex-adjusted association between qt and SNP snp4 :
                              Estimate Std. Error    t value   Pr(>|t|)
  (Intercept)               0.12611774 0.06040009  2.0880388 0.03690009
  sex                       0.02553529 0.04083273  0.6253633 0.53179244
  as.numeric(get(snpname)) -0.07227905 0.02984312 -2.4219671 0.01551079
  
  Testing sex-adjusted association between qt and SNP snp5 :
                              Estimate Std. Error    t value   Pr(>|t|)
  (Intercept)              -0.15272175 0.06880690 -2.2195704 0.02654187
  sex                       0.02009534 0.04076087  0.4930057 0.62205408
  as.numeric(get(snpname))  0.07357452 0.02943476  2.4995791 0.01250094
  
  Testing sex-adjusted association between qt and SNP snp6 :
                              Estimate Std. Error    t value   Pr(>|t|)
  (Intercept)              -0.14110680 0.07002874 -2.0149841 0.04401871
  sex                       0.03117861 0.04077551  0.7646406 0.44456146
  as.numeric(get(snpname))  0.06634602 0.02927437  2.2663517 0.02351939
  
  Testing sex-adjusted association between qt and SNP snp7 :
                              Estimate Std. Error   t value  Pr(>|t|)
  (Intercept)               0.06697438 0.05879604  1.139097 0.2547782
  sex                       0.04174468 0.04087001  1.021401 0.3071689
  as.numeric(get(snpname)) -0.03723286 0.03153066 -1.180846 0.2377825
  
  Testing sex-adjusted association between qt and SNP snp8 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)              -0.06807682 0.08455606 -0.8051087 0.4208378
  sex                       0.02760968 0.04102873  0.6729354 0.5010541
  as.numeric(get(snpname))  0.07124407 0.07228780  0.9855614 0.3244491
  
  Testing sex-adjusted association between qt and SNP snp9 :
                              Estimate Std. Error    t value     Pr(>|t|)
  (Intercept)              -0.22558672 0.07313030 -3.0847229 0.0020610506
  sex                       0.01311497 0.04074702  0.3218632 0.7475848790
  as.numeric(get(snpname))  0.21002414 0.06122367  3.4304402 0.0006129721
  
  Testing sex-adjusted association between qt and SNP snp10 :
                              Estimate Std. Error    t value  Pr(>|t|)
  (Intercept)              -0.02805887 0.06107457 -0.4594198 0.6459747
  sex                       0.02628396 0.04064009  0.6467496 0.5178563
  as.numeric(get(snpname))  0.03150834 0.04252005  0.7410231 0.4587525

Answer (Ex. 15) — 

SNPs 1, 4, 5 an 9 are significantly associated at nominal P 0.05. SNP 6 is only marginally significantly associated unde the general genotypic model. Here is a testing script (no need to reproduce that, just check the results):

  > for (i in 1:10) {
  + snpname <- paste("snp",i,sep="")
  + cat("\nTesting association between qt and SNP",snpname,":\n")
  + print(anova(lm(qt~get(snpname)),test="Chisq"))
  + #print(summary(lm(qt~get(snpname)))$coef)
  + }

  Testing association between qt and SNP snp1 :
  Analysis of Variance Table
  
  Response: qt
                 Df Sum Sq Mean Sq F value  Pr(>F)
  get(snpname)    2    7.8  3.8995  3.9845 0.01873 *
  Residuals    2371 2320.4  0.9787
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between qt and SNP snp2 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value Pr(>F)
  get(snpname)    1    0.18 0.18376  0.1841 0.6679
  Residuals    2372 2367.23 0.99799
  
  Testing association between qt and SNP snp3 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value Pr(>F)
  get(snpname)    2    3.24 1.61986   1.658 0.1907
  Residuals    2375 2320.41 0.97701
  
  Testing association between qt and SNP snp4 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value  Pr(>F)
  get(snpname)    2    7.68  3.8417  3.8628 0.02114 *
  Residuals    2387 2373.94  0.9945
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between qt and SNP snp5 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value  Pr(>F)
  get(snpname)    2    6.48  3.2418  3.2798 0.03781 *
  Residuals    2380 2352.48  0.9884
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between qt and SNP snp6 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value  Pr(>F)
  get(snpname)    2    5.49 2.74680  2.7808 0.06219 .
  Residuals    2377 2347.91 0.98776
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between qt and SNP snp7 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value Pr(>F)
  get(snpname)    2    4.02 2.01212  2.0368 0.1307
  Residuals    2365 2336.31 0.98787
  
  Testing association between qt and SNP snp8 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value Pr(>F)
  get(snpname)    2    1.38 0.68987  0.6924 0.5005
  Residuals    2368 2359.23 0.99630
  
  Testing association between qt and SNP snp9 :
  Analysis of Variance Table
  
  Response: qt
                 Df Sum Sq Mean Sq F value    Pr(>F)
  get(snpname)    2   15.6  7.8014  7.9982 0.0003453 ***
  Residuals    2358 2300.0  0.9754
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing association between qt and SNP snp10 :
  Analysis of Variance Table
  
  Response: qt
                 Df  Sum Sq Mean Sq F value Pr(>F)
  get(snpname)    2    1.19 0.59456  0.6041 0.5467
  Residuals    2381 2343.47 0.98424

Answer (Ex. 16) — 

For ’snp1’, though the data are compatible with either additive, dominant or recessive model, the additive model provides best fit to the data (largest p-value), while the recessive ’B’ model provide the wors fit (almost significantly worse than the general model):

  >  table(snp1,as.numeric(snp1))

  snp1     1    2    3
    A/A 1287    0    0
    A/B    0  888    0
    B/B    0    0  199

  >  table(snp1,(as.numeric(snp1)>=2))

  snp1  FALSE TRUE
    A/A  1287    0
    A/B     0  888
    B/B     0  199

  >  table(snp1,(as.numeric(snp1)>=3))

  snp1  FALSE TRUE
    A/A  1287    0
    A/B   888    0
    B/B     0  199

  >  model_gen <- lm(qt~snp1)
  >  summary(model_gen)

  Call:
  lm(formula = qt ~ snp1)
  
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.5261 -0.6643 -0.0111  0.6765  3.5462
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.02846    0.02758  -1.032   0.3022
  snp1A/B      0.08200    0.04316   1.900   0.0575 .
  snp1B/B      0.18644    0.07536   2.474   0.0134 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.9893 on 2371 degrees of freedom
    (126 observations deleted due to missingness)
  Multiple R-squared:  0.00335, Adjusted R-squared:  0.002509
  F-statistic: 3.985 on 2 and 2371 DF,  p-value: 0.01873

  >  model_add <- lm(qt~as.numeric(snp1))
  >  model_dom <- lm(qt~I(as.numeric(snp1)>=2))
  >  model_rec <- lm(qt~I(as.numeric(snp1)>=3))
  >  anova(model_add,model_gen,test="Chisq")

  Analysis of Variance Table
  
  Model 1: qt ~ as.numeric(snp1)
  Model 2: qt ~ snp1
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2372 2320.5
  2   2371 2320.4  1   0.04886   0.8232

  >  anova(model_dom,model_gen,test="Chisq")

  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(snp1) >= 2)
  Model 2: qt ~ snp1
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2372 2322.2
  2   2371 2320.4  1    1.7733   0.1783

  >  anova(model_rec,model_gen,test="Chisq")

  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(snp1) >= 3)
  Model 2: qt ~ snp1
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2372 2324.0
  2   2371 2320.4  1    3.5332  0.05743 .
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For SNPs 4, 5, 6, and 9 results are:

  >  for (i in c(4,5,6,9)) {
  +  snpname <- paste("snp",i,sep="")
  +  cat("\nTesting SNP",snpname,":\n")
  +  cursnp <- get(snpname)
  +  model_gen <- lm(qt~cursnp)
  +  print(summary(model_gen))
  +  model_add <- lm(qt~as.numeric(cursnp))
  +  model_dom <- lm(qt~I(as.numeric(cursnp)>=2))
  +  model_rec <- lm(qt~I(as.numeric(cursnp)>=3))
  +  print(anova(model_add,model_gen,test="Chisq"))
  +  print(anova(model_dom,model_gen,test="Chisq"))
  +  print(anova(model_rec,model_gen,test="Chisq"))
  +  }

  Testing SNP snp4 :
  
  Call:
  lm(formula = qt ~ cursnp)
  
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.4311 -0.6636 -0.0013  0.6737  3.5489
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.02132    0.02972   0.717   0.4733
  cursnpA/A    0.02953    0.04423   0.668   0.5044
  cursnpB/B   -0.14481    0.06192  -2.339   0.0194 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.9973 on 2387 degrees of freedom
    (110 observations deleted due to missingness)
  Multiple R-squared:  0.003226, Adjusted R-squared:  0.002391
  F-statistic: 3.863 on 2 and 2387 DF,  p-value: 0.02114
  
  Analysis of Variance Table
  
  Model 1: qt ~ as.numeric(cursnp)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2388 2375.7
  2   2387 2373.9  1    1.7489   0.1848
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 2)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2388 2379.4
  2   2387 2373.9  1    5.4391  0.01936 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 3)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2388 2374.4
  2   2387 2373.9  1   0.44342   0.5043
  
  Testing SNP snp5 :
  
  Call:
  lm(formula = qt ~ cursnp)
  
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.4719 -0.6589 -0.0084  0.6622  3.5285
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept)  0.01401    0.02878   0.487   0.6264
  cursnpA/A   -0.09667    0.05611  -1.723   0.0851 .
  cursnpB/B    0.05727    0.04607   1.243   0.2140
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.9942 on 2380 degrees of freedom
    (117 observations deleted due to missingness)
  Multiple R-squared:  0.002749, Adjusted R-squared:  0.00191
  F-statistic:  3.28 on 2 and 2380 DF,  p-value: 0.03781
  
  Analysis of Variance Table
  
  Model 1: qt ~ as.numeric(cursnp)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2381 2352.7
  2   2380 2352.5  1   0.22152   0.6359
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 2)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2381 2354.0
  2   2380 2352.5  1    1.5273   0.2138
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 3)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2381 2355.4
  2   2380 2352.5  1    2.9335  0.08494 .
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Testing SNP snp6 :
  
  Call:
  lm(formula = qt ~ cursnp)
  
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.4784 -0.6753 -0.0064  0.6703  3.5324
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.07617    0.05085  -1.498   0.1343
  cursnpB/A    0.09417    0.05886   1.600   0.1097
  cursnpB/B    0.14351    0.06096   2.354   0.0186 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.9939 on 2377 degrees of freedom
    (120 observations deleted due to missingness)
  Multiple R-squared:  0.002334, Adjusted R-squared:  0.001495
  F-statistic: 2.781 on 2 and 2377 DF,  p-value: 0.06219
  
  Analysis of Variance Table
  
  Model 1: qt ~ as.numeric(cursnp)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2378 2348.2
  2   2377 2347.9  1   0.27462    0.598
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 2)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2378 2349.1
  2   2377 2347.9  1    1.1967    0.271
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 3)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2378 2350.4
  2   2377 2347.9  1    2.5284   0.1096
  
  Testing SNP snp9 :
  
  Call:
  lm(formula = qt ~ cursnp)
  
  Residuals:
      Min      1Q  Median      3Q     Max
  -3.5482 -0.6673  0.0074  0.6546  3.6061
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)
  (Intercept) -0.006298   0.021562  -0.292  0.77026
  cursnpA/B    0.162230   0.065729   2.468  0.01365 *
  cursnpB/B    1.002439   0.313057   3.202  0.00138 **
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 0.9876 on 2358 degrees of freedom
    (139 observations deleted due to missingness)
  Multiple R-squared:  0.006738, Adjusted R-squared:  0.005896
  F-statistic: 7.998 on 2 and 2358 DF,  p-value: 0.0003453
  
  Analysis of Variance Table
  
  Model 1: qt ~ as.numeric(cursnp)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2359 2303.9
  2   2358 2300.0  1    3.9528  0.04411 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 2)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2359 2306.8
  2   2358 2300.0  1    6.7911 0.008324 **
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Analysis of Variance Table
  
  Model 1: qt ~ I(as.numeric(cursnp) >= 3)
  Model 2: qt ~ cursnp
    Res.Df    RSS Df Sum of Sq Pr(>Chi)
  1   2359 2305.9
  2   2358 2300.0  1     5.942  0.01358 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chapter 4
Introduction to the GenABEL-package

In this section, you will become familiar with the GenABEL-package library, designed for GWA analysis. Compared to the genetics package, it provides specific facilities for storage and manipulation of large amounts of data, very fast tests for GWA analysis, and special functions to analyse and graphically present the results of GWA analysis (thus "analysis of analysis").

Start R and load the GenABEL-package library using the command

  > library(GenABEL)

After that, load the example data set using the command

  > data(srdta)

4.1 General description of gwaa.data-class

The object you have loaded, srdta, belongs to the gwaa.data class. This is a special R class developed to facilitate GWA analysis.

In GWA analysis, different types of data are used. These include the phenotypic and genotypic data on the study participants and chromosome and location of every SNP. For every SNP, it is desirable to know the details of coding (how are the alleles coded? – A, T, G, C? – as well as the strand – ’+’ or ’-’, ’top’ or ’bot’? – this coding is for).

One could attempt to store all phenotypes and genotypes together in a single table, using, e.g. one row per study subject; than the columns will correspond to study phenotypes and SNPs. For a typical GWA data set, this would lead to a table of few thousands rows and few hundreds of thousands to millions of columns. Such a format is generated when one downloads HapMap data for a region. To store GWA data in such tables internally, within R, proves to be inefficient. In GenABEL-package, a special data class, gwaa.data-class is used to store GWA data.

You may consider an object of gwaa.data-class as a ’black box’ from which you can get specific data using specific functions. If you are interested in the internal structure of the gwaa.data-class, you can find the description in section B.1 (Internal structure of gwaa.data-class).

The data frame, which contains all phenotypic data in the study may be accessed using the phdata function. Let us have a look at the first few rows of the phenotypic data frame of srdta:

  > phdata(srdta)[1:5,]

     id sex  age   qt1    qt2  qt3 bt
  p1 p1   1 43.4 -0.58   4.46 1.43  0
  p2 p2   1 48.2  0.80   6.32 3.90  1
  p3 p3   0 37.9 -0.52   3.26 5.05  1
  p4 p4   1 53.8 -1.55 888.00 3.76  1
  p5 p5   1 47.5  0.25   5.70 2.89  1

The rows of this data frame correspond to th estudy subjects, and the columns correspond to the variables. There are two default variables, which are always present in phdata. The first of these is “id”, which contains study subject identification code. This identification code can be arbitrary character, numer, or alphanumeric combination, but every person must be coded with an unique ID. The second default variable is “sex”, where males are coded with ones (“1”) and females are coded with zero (“0”).

It is important to understand that this data frame is not supposed to be directly modified by the user, as its structure is coupled to the structure of genotypic data. If at some point you need to manipulate (add/delete) the phenotypes included in phdata, you need to use such GenABEL-package functions as add.phdata and del.phdata (see section 4.2).

The other part of an object of gwaa.data-class is gtdata, which contains all GWA genetic information in an object of class snp.data class. It is not supposed to be modified directly by user. The genotypic data can be accessed through the gtdata function, e.g.

  > gtdata(srdta[1:10, 1:10])

  @nids = 10
  @nsnps = 10
  @nbytes = 3
  @idnames = p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
  @snpnames = rs10 rs18 rs29 rs65 rs73 rs114 rs128 rs130 rs143 rs150
  @chromosome = 1 1 1 1 1 1 1 1 1 1
  @coding =  08 0b 0c 03 04 03 0c 04 08 0f
  @strand =  01 01 02 01 01 01 02 01 01 01
  @map = 2500 3500 5750 13500 14250 24500 27000 27250 31000 33250
  @male = 1 1 0 1 1 0 0 1 0 0
  @gtps =
  40 40 40 80 40 40 40 40 c0 c0
  40 40 00 00 40 40 40 c0 40 40
  40 40 00 80 40 40 40 40 c0 c0

As you can see, these data are of little direct use as these are stored in an internal format – you need to coerce that to another data type if you want to manipulate/analyse these data using non-GenABEL-package functions (see section ??).

The number of individuals described in an object of gwaa.data-class can be accessed through nids function, e.g.

  > nids(srdta)

  [1] 2500

and the number of SNPs using the nsnps function:

  > nsnps(srdta)

  [1] 833

The IDs of the individuals included in the study can be accessed via the idnames function, for example the IDs of the first 7 individuals in the study are

  > idnames(srdta)[1:7]

  [1] "p1" "p2" "p3" "p4" "p5" "p6" "p7"

The sex of the individuals can be accessed using the male function:

  > male(srdta)[1:7]

  p1 p2 p3 p4 p5 p6 p7
   1  1  0  1  1  0  0

where males (heterogametic sex) are assigned with “1” and a homogametic sex (females) are assigned the value “0”.

Names of SNPs can be accessed using the snpnames function; for example the names of the first 10 SNPs in the srdta are

  > snpnames(srdta)[1:10]

   [1] "rs10"  "rs18"  "rs29"  "rs65"  "rs73"  "rs114" "rs128" "rs130" "rs143"
  [10] "rs150"

SNP annotation includes (shown for the first 10 SNPs only):

\item \GA{} uses a special data class, \texttt{gwaa.data-class}, to  
store GWA data.  
\item To access the content of an object of \texttt{gwaa.data-class},  
a number of functions is used

Exercise 1. Exploring IDs in srdta
Explore srdta.

1. 

How many people are included in the study?

2. 

How many of these are males?

3. 

How many are females?

4. 

What is male proportion?

Exercise 2. Exploring SNPs in srdta
Explore the SNPs contained in srdta using the functions to access SNP names (snpnames) and map (map) location

1. 

What are names of markers located after 2,490,000 b.p.?

2. 

Between 1,100,000 and 1,105,000 b.p.?

4.2 Accessing and modifying phenotypic data

As was already mentioned, the object returned by phdata contains phenotypic data and is a conventional data frame, wich obligatory includes ’id’ and ’sex’ variables, and ordered an a way that it couples to the genotypic data.

Being a data frame, phdata may be accessed using the corresponding methods:

  > phdata(srdta)[1:5, ]

     id sex  age   qt1    qt2  qt3 bt
  p1 p1   1 43.4 -0.58   4.46 1.43  0
  p2 p2   1 48.2  0.80   6.32 3.90  1
  p3 p3   0 37.9 -0.52   3.26 5.05  1
  p4 p4   1 53.8 -1.55 888.00 3.76  1
  p5 p5   1 47.5  0.25   5.70 2.89  1

  > class(phdata(srdta))

  [1] "data.frame"

  > phdata(srdta)[1:5, 2]

  [1] 1 1 0 1 1

  > phdata(srdta)[1:5, "sex"]

  [1] 1 1 0 1 1

  > phdata(srdta)$sex[1:5]

  [1] 1 1 0 1 1

The modification of the phenotypic data is performed using special methods, because of specific restrictions on phenotypic data frames. There are two main functions which allow you to add (add.phdata) and delete (del.phdata) phenotypes from phdata part of an object of gwaa.data-class.

For example, if you want to add a variable (say, the square of age) computed from the “age” variable of srdta

  > phdata(srdta)[1:5, ]

     id sex  age   qt1    qt2  qt3 bt
  p1 p1   1 43.4 -0.58   4.46 1.43  0
  p2 p2   1 48.2  0.80   6.32 3.90  1
  p3 p3   0 37.9 -0.52   3.26 5.05  1
  p4 p4   1 53.8 -1.55 888.00 3.76  1
  p5 p5   1 47.5  0.25   5.70 2.89  1

  > age2 <- phdata(srdta)$age^2

you need to use the add.phdata function:

  > srdta <- add.phdata(srdta, newph=age2, name="age_squared")
  > phdata(srdta)[1:5, ]

     id sex  age   qt1    qt2  qt3 bt age_squared
  p1 p1   1 43.4 -0.58   4.46 1.43  0     1883.56
  p2 p2   1 48.2  0.80   6.32 3.90  1     2323.24
  p3 p3   0 37.9 -0.52   3.26 5.05  1     1436.41
  p4 p4   1 53.8 -1.55 888.00 3.76  1     2894.44
  p5 p5   1 47.5  0.25   5.70 2.89  1     2256.25

You can add more than one variable at once using the same function, however, in this case the second (“newph”) argument of the function should be a data frame, which contains an ’id’ variable specifing the IDs of the individuals. Imagine we have the data for individuals ’p1’, ’p2’ and ’p7’ (we will generate random data for them; pay attention only to the result):

  > newvalues <- matrix( rnorm(3*5), 3, 5 )
  > newdata   <- data.frame(id=c("p1", "p2", "p7"),
  +                         ph1=1, ph2=1, ph3=1, ph4=1, ph5=1)
  > newdata[, c(2:6)] <- newvalues
  > newdata

    id        ph1        ph2        ph3        ph4       ph5
  1 p1 -0.6322620 -0.3698151 -1.5723992  0.1048902 -1.091918
  2 p2 -2.1418978  0.9183244 -0.7655322 -0.5194269 -0.707315
  3 p7 -0.2966435 -0.6518437  0.3486510 -0.2613273  1.435912

These data can be added to the phenotypic data with

  > srdta <- add.phdata(srdta,newdata)
  > phdata(srdta)[1:10, ]

       id sex  age   qt1    qt2  qt3 bt age_squared        ph1        ph2
  p1   p1   1 43.4 -0.58   4.46 1.43  0     1883.56 -0.6322620 -0.3698151
  p2   p2   1 48.2  0.80   6.32 3.90  1     2323.24 -2.1418978  0.9183244
  p3   p3   0 37.9 -0.52   3.26 5.05  1     1436.41         NA         NA
  p4   p4   1 53.8 -1.55 888.00 3.76  1     2894.44         NA         NA
  p5   p5   1 47.5  0.25   5.70 2.89  1     2256.25         NA         NA
  p6   p6   0 45.0  0.15   4.65 1.87  0     2025.00         NA         NA
  p7   p7   0 52.0 -0.56   4.64 2.49  0     2704.00 -0.2966435 -0.6518437
  p8   p8   1 42.5    NA   5.77 2.68  1     1806.25         NA         NA
  p9   p9   0 29.7 -2.26   0.71 1.45  0      882.09         NA         NA
  p10 p10   0 45.8 -1.32   3.26 0.85  0     2097.64         NA         NA
             ph3        ph4       ph5
  p1  -1.5723992  0.1048902 -1.091918
  p2  -0.7655322 -0.5194269 -0.707315
  p3          NA         NA        NA
  p4          NA         NA        NA
  p5          NA         NA        NA
  p6          NA         NA        NA
  p7   0.3486510 -0.2613273  1.435912
  p8          NA         NA        NA
  p9          NA         NA        NA
  p10         NA         NA        NA

Finally, if you need, you can delete some phenotypes from the phdata using del.phdata function. Let us delete the phenotypes we have just added:

  > srdta <- del.phdata(srdta,
  +                     c("age_squared", "ph1", "ph2", "ph3", "ph4", "ph5"))
  > phdata(srdta)[1:10, ]

       id sex  age   qt1    qt2  qt3 bt
  p1   p1   1 43.4 -0.58   4.46 1.43  0
  p2   p2   1 48.2  0.80   6.32 3.90  1
  p3   p3   0 37.9 -0.52   3.26 5.05  1
  p4   p4   1 53.8 -1.55 888.00 3.76  1
  p5   p5   1 47.5  0.25   5.70 2.89  1
  p6   p6   0 45.0  0.15   4.65 1.87  0
  p7   p7   0 52.0 -0.56   4.64 2.49  0
  p8   p8   1 42.5    NA   5.77 2.68  1
  p9   p9   0 29.7 -2.26   0.71 1.45  0
  p10 p10   0 45.8 -1.32   3.26 0.85  0

\item Phenotypic data contained in an object of \texttt{gwaa.data-class}  
can be accessed using the \texttt{phdata} functions  
\item You can add phenotypes using the \texttt{add.phdata} function  
\item You can delete phenotypes using the \texttt{del.phdata} function

4.3 Sub-setting and coercing gwaa.data

It is possible to sub-set the object, which stores the GWA data in the manner similar to that used for conventional R matrices and data frames. Very primitively, you may think of an object of class gwaa.data as a matrix whose rows correspond to study subjects and columns correspond to SNPs studied (though the actual object is more complicated). For example, if we would like to investigate the content of srdta for the first 5 people and 3 SNPs, we can run

  > ssubs <- srdta[1:5, 1:3]
  > class(ssubs)

  [1] "gwaa.data"
  attr(,"package")
  [1] "GenABEL"

As you can see, by sub-setting we obtained a smaller object of gwaa.data-class. The two major parts it contains are phenotypic data, which can be accessed through phdata (discussed in section 4.2):

  > phdata(ssubs)

     id sex  age   qt1    qt2  qt3 bt
  p1 p1   1 43.4 -0.58   4.46 1.43  0
  p2 p2   1 48.2  0.80   6.32 3.90  1
  p3 p3   0 37.9 -0.52   3.26 5.05  1
  p4 p4   1 53.8 -1.55 888.00 3.76  1
  p5 p5   1 47.5  0.25   5.70 2.89  1

and genotypc data, which can be accessed via the gtdata function:

  > gtdata(ssubs)

  @nids = 5
  @nsnps = 3
  @nbytes = 2
  @idnames = p1 p2 p3 p4 p5
  @snpnames = rs10 rs18 rs29
  @chromosome = 1 1 1
  @coding =  08 0b 0c
  @strand =  01 01 02
  @map = 2500 3500 5750
  @male = 1 1 0 1 1
  @gtps =
  40 40 40
  40 40 00

whose content is not quite straightforward to read.

To get human-readable information, a genotypic object should be coerced to a regular R data type, e.g. character, using the as.character() function:

  > as.character(gtdata(ssubs))

     rs10  rs18  rs29
  p1 "T/T" "G/G" "G/G"
  p2 "T/T" "G/G" NA
  p3 "T/T" "G/G" NA
  p4 "T/T" "G/G" NA
  p5 "T/T" "G/A" "G/G"

Another useful coercion is to "numeric":

  > as.numeric(gtdata(ssubs))

     rs10 rs18 rs29
  p1    0    0    0
  p2    0    0   NA
  p3    0    0   NA
  p4    0    0   NA
  p5    0    1    0

Note that conversion to numeric happened according to the underlying genotype and the rules specified by SNP coding:

  > coding(ssubs)

  rs10 rs18 rs29
  "TG" "GA" "GT"

– the genotype, which is made of the ’first’ allele of the ’code’ is converted to “0”, the heterozygote to “1” and a homozygote for the second allele is converted to “2”.

For example, when the coding is “GA”, as it is for the rs18 (the second SNP), homozygotes for the first allele, as specified by coding (“G”) are converted to zeros (“0”), heterozygotes are converted to ones (“1”), and homozygotes for the second allele (“A”) are converted to twos (“2”). Clearly, when numerically converted data are used for association analysis, the effects will be estimated for the second allele, while first will be used as a reference.

Genotypic data converted to standard R format can be used in any further analysis.

Several useful genetic analysis libraries were developed for R. These include genetics (analysis of linkage disequilibrium and many other useful functions) and haplo.stats (analysis of association between traits and haplotypes). These use their own genetic data formats.

One can translate GenABEL-package genetic data to the format used by "genetics" library using as.genotype():

  > as.genotype(gtdata(ssubs))

  NOTE: THIS PACKAGE IS NOW OBSOLETE.
  
    The R-Genetics project has developed an set of enhanced genetics
    packages to replace 'genetics'. Please visit the project homepage
    at http://rgenetics.org for informtion.
  
     rs10 rs18 rs29
  p1  T/T  G/G  G/G
  p2  T/T  G/G <NA>
  p3  T/T  G/G <NA>
  p4  T/T  G/G <NA>
  p5  T/T  G/A  G/G

To translate GenABEL-package data to the format used by "haplo.stats" you can use function as.hsgeno()

  > as.hsgeno(gtdata(ssubs))

     rs10.a1 rs10.a2 rs18.a1 rs18.a2 rs29.a1 rs29.a2
  p1       1       1       1       1       1       1
  p2       1       1       1       1      NA      NA
  p3       1       1       1       1      NA      NA
  p4       1       1       1       1      NA      NA
  p5       1       1       1       2       1       1

Actually, most users will not need the latter function, as GenABEL-package provides a functional interface to "haplo.stats" (such GenABEL-package functions as scan.haplo() and scan.haplo.2D()).

It is possible to select sub-sets of gwaa.data-class based not only on index (e.g. first 10 people and SNP number 33), but also based on names.

For example, if we would like to retrieve phenotypic data on people with IDs "p141", "p147" and "p2000", we can use

  > phdata( srdta[c("p141", "p147", "p2000"), ] )

           id sex  age   qt1  qt2  qt3 bt
  p141   p141   0 47.2  0.51 5.23 2.17  0
  p147   p147   0 43.2  0.14 4.47 1.73  0
  p2000 p2000   0 43.1 -1.53 2.78 2.70  1

here, the first part of expression sub-sets srdta on selected IDs, and the second tells which part of the retrieved sub-set we want to see. You can try srdta[c("p141", "p147", "p2000"),], but be prepared to see long output, as all information will be reported.

In a similar manner, we can also select on SNP name. For example, if we are interested to see information on SNPs "rs10" and "rs29" for above people, we can run

  > phdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])

           id sex  age   qt1  qt2  qt3 bt
  p141   p141   0 47.2  0.51 5.23 2.17  0
  p147   p147   0 43.2  0.14 4.47 1.73  0
  p2000 p2000   0 43.1 -1.53 2.78 2.70  1

  > gtdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])

  @nids = 3
  @nsnps = 2
  @nbytes = 1
  @idnames = p141 p147 p2000
  @snpnames = rs10 rs29
  @chromosome = 1 1
  @coding =  08 0c
  @strand =  01 02
  @map = 2500 5750
  @male = 0 0 0
  @gtps =
  40 40

To see the actual genotypes for the above three people and two SNPs, use

  > as.character(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])

        rs10  rs29
  p141  "T/T" "G/G"
  p147  "T/T" "G/G"
  p2000 "T/G" "G/T"

or

  > as.numeric(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])

        rs10 rs29
  p141     0    0
  p147     0    0
  p2000    1    1

Exercise 3. Exploring rs114
Explore genotypes for SNP "rs114"

1. 

What is the coding and which allele is the reference one?

2. 

What is the frequency of non-reference (“effect”) allele in total sample?

3. 

What is the frequency of the effect allele in males?

4. 

What is the frequency of the effect allele in females?

5. 

What is the frequency of the reference allele in the total sample, males and females?

\item It is possible to obtain subsets of objects of  
\texttt{gwaa.data-class} and \texttt{snp.data-class}  
using the standard 2D sub-setting model \texttt{[i, j]},  
where \texttt{i} corresponds to study