最近使用sommer计算G矩阵提示超过了默认的32位整数报错:Error: Mat::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD,下面一起看看G矩阵构建和该错误的解决方法。
G矩阵是怎么来的,这里主要参考Vanraden的计算方法,其中$Z$为标记的设计矩阵,$p_i$为每个标记的次等位基因频率:
\[G=\frac{ZZ'}{2\sum p_{i}(1-p_i)}\]接下来我分别使用GAPIT3和sommer基于Vanraden方法构建G矩阵。首先软件将会过滤单态性标记,并根据次等位基因频率进一步过滤标记。GAPIT3中VanRaden方法输入的是0、1、2数值型矩阵,而 sommer中使用的是-1、0、1数值型矩阵:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
source("./GAPIT/R/GAPIT.kinship.VanRaden.R)"
# 种子
set.seed(666)
# 创建一个 5*10的0、1、2 矩阵
matrix_dat <- matrix(sample(0:2, 5 * 10, replace = TRUE), nrow = 5)
> matrix_dat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 1 1 0 0 1 0 1 1 0
[2,] 2 1 1 2 0 0 2 0 2 2
[3,] 2 1 0 2 2 1 2 2 2 1
[4,] 1 2 1 0 1 2 0 0 0 0
[5,] 2 0 2 0 2 1 2 0 2 1
> GAPIT.kinship.VanRaden(matrix_dat)
[1] "Calculating kinship with VanRaden method..."
[1] "substracting P..."
[1] "Getting X'X..."
[1] "Adjusting..."
[1] "Calculating kinship with VanRaden method: done"
[,1] [,2] [,3] [,4] [,5]
[1,] 0.9147982 -0.52017937 -0.6098655 0.6457399 -0.43049327
[2,] -0.5201794 1.40807175 0.1973094 -1.0134529 -0.07174888
[3,] -0.6098655 0.19730942 1.4529148 -0.8789238 -0.16143498
[4,] 0.6457399 -1.01345291 -0.8789238 1.7219731 -0.47533632
[5,] -0.4304933 -0.07174888 -0.1614350 -0.4753363 1.13901345
接下来使用sommer的A.mat()函数计算,注意sommer使用的是 -1 0 1数值型矩阵:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
library(sommer)
# 将0 1 2矩阵转换为-1 0 1矩阵
new_mat <- matrix_dat - 1
> new_mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 0 -1 -1 0 -1 0 0 -1
[2,] 1 0 0 1 -1 -1 1 -1 1 1
[3,] 1 0 -1 1 1 0 1 1 1 0
[4,] 0 1 0 -1 0 1 -1 -1 -1 -1
[5,] 1 -1 1 -1 1 0 1 -1 1 0
> A.mat(new_mat)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.9147982 -0.52017937 -0.6098655 0.6457399 -0.43049327
[2,] -0.5201794 1.40807175 0.1973094 -1.0134529 -0.07174888
[3,] -0.6098655 0.19730942 1.4529148 -0.8789238 -0.16143498
[4,] 0.6457399 -1.01345291 -0.8789238 1.7219731 -0.47533632
[5,] -0.4304933 -0.07174888 -0.1614350 -0.4753363 1.13901345
可以看到,两者得到的结果是相同的。然而,当标记数目较大时,sommer 可能会出现报错 “Error: Mat::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD”。为了解决这个问题,我们需要修改 C++ 函数的源代码(MNR.cpp),并保存为 MNR_modify.cpp。以下是修改后的源代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4; indent-tabs-mode: nil; -*-
// we only include RcppArmadillo.h which pulls Rcpp.h in for us
#define ARMA_64BIT_WORD
#define ARMA_DONT_PRINT_ERRORS
#include "RcppArmadillo.h"
#include "stdlib.h"
#include <progress.hpp>
// [[Rcpp::export]]
arma::mat amat(const arma::mat & Xo, const bool & vanraden, double minMAF) {
// remove min.MAF
arma::rowvec pfreq = mean(Xo+1,0)/2; // frequency of p
arma::mat pqfreq = arma::join_cols(pfreq,1-pfreq); // frequencies of p and q
arma::rowvec MAF = min(pqfreq,0); // minor allele freqs
arma::uvec indexMAF = find(MAF > minMAF); // index for good markers > minMAF
arma::mat Xo2 = Xo.cols(indexMAF); // new X only with polymorphic markers
// remove monomorphic markers
arma::rowvec xVar = var(Xo2,0); // column variance
arma::uvec index = find(xVar > 0); // index for good markers
arma::mat X = Xo2.cols(index); // new X only with polymorphic markers
// initialize A
int p = X.n_cols;// number of markers
int n = X.n_rows;
arma::mat A(n,n);
if(vanraden == true){ // regular vanRaden
arma::rowvec ms012 = mean( X+1, 0 ); // means of columns
arma::rowvec freq = ms012/2;
double v = 2 * mean(freq % (1 - freq));
arma::mat one(n, 1, arma::fill::ones);
arma::mat freqmat = one * freq;
arma::mat W = (X + 1) - (2 * freqmat);
//
arma::mat K = W * W.t();
A = K/v/p;
}else{ // Endelman (currently we have a bug here)
// IN R: M <- scale(X, center = TRUE, scale = FALSE)
arma::rowvec ms = mean( X, 0 ); // means of columns
arma::mat M = X.each_row() - ms;
// IN R: tcrossprod(M)
arma::mat K = M * M.t();
// IN R: K/mean(diag(K)) mean(K.diag())
double v = mean(diagvec(K));
A = K/v;
}
return A;
}
然后使用A.mat函数计算G矩阵
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
A.mat <- function(X, min.MAF=0, return.imputed=FALSE){
################
## impute
missingCheck <- which(is.na(X), arr.ind = TRUE)
if(nrow(missingCheck) > 0){
cat("Imputing markers with mean value\n")
uniqueCols <- unique(missingCheck[,2])
X[,uniqueCols] <- apply(X[,uniqueCols],2,imputev)
}
vanraden=TRUE
##################
res <- amat(X, vanraden, min.MAF
)
colnames(res) <- rownames(res) <- rownames(X)
if(return.imputed){
return(list(X=X,A=res))
}else{
return(res)
}
}
library(Rcpp)
sourceCpp("path/to/MNR_modify.cpp")
big_G <- A.mat(train_biggeno)
修改后可以计算大型数据集的 G 矩阵而避免 32 位整数大小限制的错误。构建亲缘关系矩阵的方法和软件还有很多,大家可以根据自己的使用习惯进行选择。
参考:
Wang J., Zhang Z., GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction, Genomics, Proteomics & Bioinformatics (2021), doi: https://doi.org/10.1016/j.gpb.2021.08.005. VanRaden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 2008 Nov;91(11):4414-23. doi: 10.3168/jds.2007-0980. PMID: 18946147. Lourenco, D.; Legarra, A.; Tsuruta, S.; Masuda, Y.; Aguilar, I.; Misztal, I. Single-Step Genomic Evaluations from Theory to Practice: Using SNP Chips and Sequence Data in BLUPF90. Genes 2020, 11, 790.
© 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!