This is a simple tuitorial to illustrate the use of the unified local function to select low-density SNPs, implemented in the selectSNPs R package. This R package is available for educational use only and some advanced features are not included. Further information regarding this R package is avaiable upon request to Dr. Xiao-Lin Nick Wu (nwu
Over the years, ad-hoc procedures were used for designing SNP chips. Often, the strategies were to select uniformly-distributed SNPs, either with or without a cutoff for SNP minor allele frequency (MAF), or select SNPs having strong association effects on quantitative traits, with some constraints considered in series. Recently, a multiple-objective, local optimization (MOLO) algorithm has been proposed, which evaluates multiple factors jointly when selecting optimal SNPs (Wu et al., 2016). This MOLO algorithm maximizes the adjusted SNP information (namely adjusted E score) under multiple constraints, e.g., on MAF, uniformness of SNP locations, the inclusion of obligatory SNPs, and the number and size of possible gaps. The computing of the adjusted E score (formula 12; Wu et al., 2016), however, is empirical, and it does not scale well between the uniformness of SNP locations and SNP informativeness. Additionally, the MOLO objective function does not accommodate selecting uniformly-distributed SNPs alone. We, therefore, proposed a unified, local function for optimal selection of SNPs in the present study, as an amendment to the local function in the MOLO algorithm.
Of the many factors to be considered, two are of essential importance for the design of low-density SNP chips: the information and locations of the selected SNPs. In the MOLO, SNP information is measured by the Shannon entropy, and the latter is a function of SNP MAF. Average Shannon entropy is referred to as the E-score. In information theory, entropy is the average amount of information contained in each message received. For SNP selection, a message refers to an allele. Considering m SNPs, for example. The E score is computed for each SNP and then averaged across all SNPs, as follows:
\(E=(-1/m)\sum_{j=1}^{m}[q_j (log_2 (q_j))+(1-q_j)(log_2(1-q_j))]\)
where q is the minor allele frequency of SNP j. For a single SNP, E is maximized and equals 1 (i.e., 1 bit) when \(q_j=(1-q_j)=0.5\). Without loss of generality, the Shannon entropy is m bits when selecting m SNPs, when each has two equally probable alleles. The E score is computed as the average of Shannon entropy across m loci, which has one as the maximum. The relationship between the E score and MAF is shown the Figure below, which resembles a parabola curve, with the E score reaching its maximum at MAF = 0.5. Thus, maximizing the E score is equivalent to maximizing MAF, except that the former tends to select more SNPs of large MAF than the latter. Note that Shannon entropy can be computed similarly for all possible haplotypes. When multiple correlated populations are involved, SNPs can be selected jointly based on weighted MAF or E score. Otherwise, population-specific SNPs will have to be selected based on adjusted MFA or E score computed for each population.
The uniformness of SNPs on each chromosome or chromosomal segment is measured by the U score. Let there be \(n_t\) selected SNPs on a chromosomal segment and let \(δ_j\) be the spacing distance between SNP j and SNP {(j+1)}, for j=1,…,(\(n_{t-1}\)). Furthermore, let there be the same number of perfectly uniformly-distributed “virtual” SNPs (PUD-VSNPs) on the same chromosome segment, flanked by the first SNP and the last SNP. Let the spacing between two neighboring SNPs say j and {(j+1)}, be denoted by \(τ_j\). Then, the U score is computed to be the square root of the ratio of the mean squared spacing between two adjacent PUD-VSNPs over that between adjacent selected SNPs. That is,
\(U=\sqrt{\frac{(1/(n_t-1))\sum_{j=1}^{n_t-1}τ_j^2}{(1/(n_t-1))\sum_{j=1}^{n_t-1}δ_j^2}}\) \(=\sqrt{\frac{\sum_{j=1}^{n_t-1}τ_j^2}{\sum_{j=1}^{n_t-1}δ_j^2}}\)
Note that U≤1 because \(\frac{1}{(n_t-1)}\sum_{j=1}^{n_t-1}τ_j^2 ≤ \frac{1}{(n_t-1)}\sum_{j=1}^{n_t-1}δ_j^2\).
Finally, a weighted local score is computed as follows:
\(f=w_1×h^{t_1}+w_2×u^{t_2}\)
In the above, h is a vector of E score, u is a vector of U score, \(t_1\) and \(t_2\) are tuning parameters (between 0 and 100), and \(w_1\) and \(w_2\) are the weights for h and u, respectively, under the restriction of \(w_1+w_2=1\). The above is also referred to as the unified local function, because it scales well between the E score and the U score, and it allows selecting SNPs under various scenarios. For example, letting \(w_1=0\) and \(w_2=1\) leads to a subset of uniformly-distributed SNPs; letting \(w_1=1\) and \(w_2=0\) leads to a subset of high-MAF SNPs; letting \(0<w_1<1\) and \(0<w_2<1\) allows selecting locally-optimal SNPs, depending on the relative values of the two weights. The larger shrinkge on the E score (or U score), the less probably that a SNP with a low E (or U) value will be selected. When \(t_1=0\), all SNPs have equal contributions of their E scores, thus leading to uniformly-distributed SNPs. Similarly, when \(t_2=0\), all SNPs have equal contributions of their U scores, thus leading to a set of SNPs with largest minor allele frequencies within each chromosome region. The tuning parameters decides the strength of shrinkage effects on the contribution of the U score and E score, respectively.
There are three basic data classes, Locus, Chrom, and Map. The Locus class represents a locus for a gene or a genetic marker. It has six slots: Name, Chromosome, Position, Maf, Type, and Status. There are three possible values for the Type slot: A, B, and C, corresponding to the three Normalization_Bins used by the Illumina. Bin C assays include all Infinium II designs requiring a single bead type. Bin A and B assays are Infinium I designs in the red and green channels, respectively, and are classified into 1 of these 2 bins based on the color channelrequired to detect the target alleles across the 2 bead types used in the Infinium I assay. There are two values for Status: 1 for obligatory SNPs and 0 otherwise. The Chrom class represents a chromosome, which as six slots of the same names, each of them being a vector except the chromosome name (Chrom). In what follows, we illustrate how to create a Locus object and a Chrom object.
snp1<-new("Locus",
Name="SNP1",
Chromosome="1",
Position=600000,
Maf=0.20,
Type="C",
Status=as.integer(0))
str(snp1)
## Formal class 'Locus' [package "selectSNPs"] with 6 slots
## ..@ Name : chr "SNP1"
## ..@ Chromosome: chr "1"
## ..@ Position : num 6e+05
## ..@ Maf : num 0.2
## ..@ Type : chr "C"
## ..@ Status : int 0
chr1<-new("Chrom",
Chromosome="1",
Name=paste("SNP",1:10,sep=""),
Position=seq(600000,11000000,length.out=10),
Maf=runif(10,0,0.5),
Type=rep("C",10),
Status=as.integer(rep(0,10)))
str(chr1)
## Formal class 'Chrom' [package "selectSNPs"] with 6 slots
## ..@ Chromosome: chr "1"
## ..@ Name : chr [1:10] "SNP1" "SNP2" "SNP3" "SNP4" ...
## ..@ Position : num [1:10] 600000 1755556 2911111 4066667 5222222 ...
## ..@ Maf : num [1:10] 0.245 0.116 0.137 0.301 0.122 ...
## ..@ Type : chr [1:10] "C" "C" "C" "C" ...
## ..@ Status : int [1:10] 0 0 0 0 0 0 0 0 0 0
The Map class is defined as a list of Chrom objects, yet the simplest Map object can have only one chromosome. In simulation, for example, one can simulate chromosomes one by one and then construct a Map using a list of simulated chromosomes. In practice, however, one does not have to create Chrom objects in order to build a Map object. Instead, map information can be read from any input file into a data frame and convert this data frame into a Map object using the “as.Map” function.
In this package, there is a bovine 80K SNP Map object (“bov80K”) as the example dataset. It has information for for 76,694 SNPs on 30 chromosomes, with chromosome X represented by 30. Please note that these Maf data were arbitarily taken and their coincidence with any cattle breed is incidental. Also, the Type and the Status values are exact and they are used for demonstration only. Summary of this Map object is shown below.
## chrom nLoci Min Max Length mu.bw sd.bw min.bw max.bw
## 1 1 4519 67130 158855123 158787993 35137.86 20232.68 0 535471
## 2 2 3921 35126 136908437 136873311 34907.76 18568.73 0 232039
## 3 3 3571 25683 123148964 123123281 34478.66 29876.78 0 1362367
## 4 4 3405 17112 120615269 120598157 35417.96 18215.00 0 207572
## 5 5 3540 31485 125058666 125027181 35318.41 67729.38 0 3882807
## 6 6 3412 36766 122509741 122472975 35894.78 56064.14 0 3072483
## 7 7 3218 33543 112610067 112576524 34983.38 27328.12 0 1099995
## 8 8 3253 20855 113367096 113346241 34843.60 20444.64 0 377911
## 9 9 3087 28755 105688974 105660219 34227.48 18202.74 0 143454
## 10 10 3019 23914 104253580 104229666 34524.57 20124.00 0 303409
## 11 11 3103 39943 107274061 107234118 34558.21 17884.17 0 172449
## 12 12 2672 29043 91131021 91101978 34095.05 25095.28 0 453945
## 13 13 2478 33877 84229982 84196105 33977.44 19901.47 0 269493
## 14 14 2502 74984 84628243 84553259 33794.27 23740.40 0 471231
## 15 15 2582 68422 85257312 85188890 32993.37 18406.15 0 226131
## 16 16 2473 68542 81688070 81619528 33004.26 21384.37 0 443546
## 17 17 2240 23517 75132928 75109411 33530.99 23123.19 0 509683
## 18 18 2058 11508 65978584 65967076 32053.97 22029.43 0 334671
## 19 19 2018 81083 64044783 63963700 31696.58 17950.74 0 168041
## 20 20 2233 76307 71986227 71909920 32203.28 17564.79 0 170417
## 21 21 2227 83766 71573501 71489735 32101.36 20580.05 0 271219
## 22 22 1896 138506 61378199 61239693 32299.42 17874.77 0 355005
## 23 23 1746 15894 52465632 52449738 30039.94 21406.10 0 501309
## 24 24 1939 101549 62643699 62542150 32254.85 17051.92 0 204124
## 25 25 1416 25945 42851121 42825176 30243.77 15300.12 0 149559
## 26 26 1618 132614 51680135 51547521 31858.79 18213.97 0 215466
## 27 27 1440 18888 45388171 45369283 31506.45 23640.28 0 534557
## 28 28 1489 18262 46224056 46205794 31031.43 16308.51 0 122929
## 29 29 1597 108487 51492483 51383996 32175.33 19241.20 0 373797
## 30 30 2022 29245 148805559 148776314 73578.79 78233.51 0 867961
A Map object can be converted into a data frame, and vice versa. The latter also demonstrates a convenient way of building a Map object from a data frame.
## Name Chromosome Position Maf Type Status
## 1 BovineHD0100000024 1 67130 0.0000000 C 0
## 2 BovineHD0100000035 1 120183 0.3265306 C 0
## 3 Hapmap43437-BTA-101873 1 135098 0.3945578 C 0
## 4 BovineHD0100000048 1 158820 0.3401361 C 0
## 5 BovineHD0100000057 1 183040 0.2857143 C 0
## 6 BovineHD0100000064 1 208728 0.3809524 C 0
## Formal class 'Map' [package "selectSNPs"] with 1 slot
## ..@ .Data:List of 3
## .. ..$ :Formal class 'Chrom' [package "selectSNPs"] with 6 slots
## .. .. .. ..@ Chromosome: chr "1"
## .. .. .. ..@ Name : chr [1:4519] "BovineHD0100000024" "BovineHD0100000035" "Hapmap43437-BTA-101873" "BovineHD0100000048" ...
## .. .. .. ..@ Position : num [1:4519] 67130 120183 135098 158820 183040 ...
## .. .. .. ..@ Maf : num [1:4519] 0 0.327 0.395 0.34 0.286 ...
## .. .. .. ..@ Type : chr [1:4519] "C" "C" "C" "C" ...
## .. .. .. ..@ Status : int [1:4519] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..$ :Formal class 'Chrom' [package "selectSNPs"] with 6 slots
## .. .. .. ..@ Chromosome: chr "10"
## .. .. .. ..@ Name : chr [1:3019] "BovineHD1000000003" "ARS-BFGL-NGS-6048" "BovineHD1000000010" "BovineHD1000000015" ...
## .. .. .. ..@ Position : num [1:3019] 23914 41893 79479 124115 141162 ...
## .. .. .. ..@ Maf : num [1:3019] 0.415 0.104 0 0.473 0.386 ...
## .. .. .. ..@ Type : chr [1:3019] "C" "C" "C" "C" ...
## .. .. .. ..@ Status : int [1:3019] 0 0 0 0 0 0 0 0 0 0 ...
## .. ..$ :Formal class 'Chrom' [package "selectSNPs"] with 6 slots
## .. .. .. ..@ Chromosome: chr "15"
## .. .. .. ..@ Name : chr [1:2582] "BovineHD1500000006" "BovineHD1500000009" "BovineHD1500000021" "BovineHD1500000046" ...
## .. .. .. ..@ Position : num [1:2582] 68422 80300 156662 236845 267912 ...
## .. .. .. ..@ Maf : num [1:2582] 0.279 0.136 0.136 0.306 0.136 ...
## .. .. .. ..@ Type : chr [1:2582] "C" "C" "C" "C" ...
## .. .. .. ..@ Status : int [1:2582] 0 0 0 0 0 0 0 0 0 0 ...
Using the uniform local function, one can select a subset of SNPs by giving different values to \(w_1\) and \(w_2\), depending varied scenarios for the low-density chip.
In the following, we select 2000 locally-optimal SNPs by letting \(w_1 = 0.5\) and \(w_2 = 0.5\). Noe that the weigths can be set up differently, subject to \(w_1 + w_2 = 1\).
## Selecting SNPs by chromosomes ......
## 1 1 4519 119 119 ......
## 2 2 3921 103 103 ......
## 3 3 3571 93 93 ......
## 4 4 3405 91 91 ......
## 5 5 3540 94 94 ......
## 6 6 3412 92 92 ......
## 7 7 3218 85 85 ......
## 8 8 3253 85 85 ......
## 9 9 3087 79 79 ......
## 10 10 3019 78 78 ......
## 11 11 3103 81 81 ......
## 12 12 2672 68 68 ......
## 13 13 2478 63 63 ......
## 14 14 2502 63 63 ......
## 15 15 2582 64 64 ......
## 16 16 2473 61 61 ......
## 17 17 2240 56 56 ......
## 18 18 2058 49 49 ......
## 19 19 2018 48 48 ......
## 20 20 2233 54 54 ......
## 21 21 2227 54 54 ......
## 22 22 1896 46 46 ......
## 23 23 1746 39 39 ......
## 24 24 1939 47 47 ......
## 25 25 1416 32 32 ......
## 26 26 1618 38 38 ......
## 27 27 1440 34 34 ......
## 28 28 1489 34 34 ......
## 29 29 1597 38 38 ......
## 30 30 2022 112 112 ......
## chrom nLoci Min Max Length mu.bw sd.bw min.bw max.bw
## 1 1 119 67130 158855123 158787993 1334353 191003.5 0 1674693
## 2 2 103 35126 136908437 136873311 1328867 205977.6 0 1723685
## 3 3 93 25683 123148964 123123281 1323906 214869.9 0 1686917
## 4 4 91 17112 120615269 120598157 1325254 218750.4 0 1710898
## 5 5 94 31485 125058666 125027181 1330076 385842.4 0 3882807
## 6 6 92 36766 122509741 122472975 1331228 327096.9 0 3098609
## 7 7 85 33543 112610067 112576524 1324430 225257.9 0 1590326
## 8 8 85 20855 113367096 113346241 1333485 255233.1 0 1862202
## 9 9 79 28755 105688974 105660219 1337471 220194.7 0 1695823
## 10 10 78 23914 104253580 104229666 1336278 256542.5 0 1791934
## 11 11 81 39943 107274061 107234118 1323878 220572.8 0 1667816
## 12 12 68 29043 91131021 91101978 1339735 251360.3 0 1867156
## 13 13 63 33877 84229982 84196105 1336446 233912.8 0 1653434
## 14 14 63 74984 84628243 84553259 1342115 259103.5 0 1772625
## 15 15 64 68422 85257312 85188890 1331076 250314.3 0 1840177
## 16 16 61 68542 81688070 81619528 1338025 260168.5 0 1735964
## 17 17 56 23517 75132928 75109411 1341239 250309.2 0 1655978
## 18 18 49 11508 65978584 65967076 1346267 279663.6 0 1676967
## 19 19 48 81083 64044783 63963700 1332577 272953.6 0 1638613
## 20 20 54 76307 71986227 71909920 1331665 277971.0 0 1682367
## 21 21 54 83766 71573501 71489735 1323884 254895.4 0 1662699
## 22 22 46 138506 61378199 61239693 1331298 276348.9 0 1643236
## 23 23 39 15894 52465632 52449738 1344865 292536.2 0 1619427
## 24 24 47 101549 62643699 62542150 1330684 282062.2 0 1717716
## 25 25 32 25945 42851121 42825176 1338287 327191.3 0 1700183
## 26 26 38 132614 51680135 51547521 1356514 311446.1 0 1730693
## 27 27 34 18888 45388171 45369283 1334391 299555.3 0 1561287
## 28 28 34 18262 46224056 46205794 1358994 313161.3 0 1697944
## 29 29 38 108487 51492483 51383996 1352210 315076.5 0 1695469
## 30 30 112 29245 148805559 148776314 1328360 280646.1 0 1989342
If we let \(w_1 = 0\) and \(w_2 = 1\), it is equivalent to selecting SNPs based on the U scores solely, thus leading to a set uniformly-distributed SNPs.
## Selecting SNPs by chromosomes ......
## 1 1 4519 119 119 ......
## 2 2 3921 103 103 ......
## 3 3 3571 93 93 ......
## 4 4 3405 91 91 ......
## 5 5 3540 94 94 ......
## 6 6 3412 92 92 ......
## 7 7 3218 85 85 ......
## 8 8 3253 85 85 ......
## 9 9 3087 79 79 ......
## 10 10 3019 78 78 ......
## 11 11 3103 81 81 ......
## 12 12 2672 68 68 ......
## 13 13 2478 63 63 ......
## 14 14 2502 63 63 ......
## 15 15 2582 64 64 ......
## 16 16 2473 61 61 ......
## 17 17 2240 56 56 ......
## 18 18 2058 49 49 ......
## 19 19 2018 48 48 ......
## 20 20 2233 54 54 ......
## 21 21 2227 54 54 ......
## 22 22 1896 46 46 ......
## 23 23 1746 39 39 ......
## 24 24 1939 47 47 ......
## 25 25 1416 32 32 ......
## 26 26 1618 38 38 ......
## 27 27 1440 34 34 ......
## 28 28 1489 34 34 ......
## 29 29 1597 38 38 ......
## 30 30 2022 112 112 ......
Likewise, if we let \(w_1 = 0\) and \(w_2 = 1\), it is equivalent to selecting SNPs based on the E scores solely, thus leading to a subset of SNPs with the highest minor allele frequencies at local chromosome regions.
## Selecting SNPs by chromosomes ......
## 1 1 4519 119 119 ......
## 2 2 3921 103 103 ......
## 3 3 3571 93 93 ......
## 4 4 3405 91 91 ......
## 5 5 3540 94 94 ......
## 6 6 3412 92 92 ......
## 7 7 3218 85 85 ......
## 8 8 3253 85 85 ......
## 9 9 3087 79 79 ......
## 10 10 3019 78 78 ......
## 11 11 3103 81 81 ......
## 12 12 2672 68 68 ......
## 13 13 2478 63 63 ......
## 14 14 2502 63 63 ......
## 15 15 2582 64 64 ......
## 16 16 2473 61 61 ......
## 17 17 2240 56 56 ......
## 18 18 2058 49 49 ......
## 19 19 2018 48 48 ......
## 20 20 2233 54 54 ......
## 21 21 2227 54 54 ......
## 22 22 1896 46 46 ......
## 23 23 1746 39 39 ......
## 24 24 1939 47 47 ......
## 25 25 1416 32 32 ......
## 26 26 1618 38 38 ......
## 27 27 1440 34 34 ......
## 28 28 1489 34 34 ......
## 29 29 1597 38 38 ......
## 30 30 2022 112 112 ......
Wu X-L, Li H, Ferretti R, Simpson B, Walker J, Parham J, Mastro L, Qiu J, Schultz T, Tait RG. Jr., and Bauck S. (2020) A unified local objective function for optimally selecting SNPs on arrays for agricultural genomics applications. Anim. Genet. (accepted)
Wu XL, Xu J, Feng G, Wiggans GR, Taylor JF, He J, Qian C, Qiu J, Simpson B, Walker J, Bauck S. Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications. PLoS One. 2016, 11(9):0161719. doi: 10.1371/journal.pone.0161719. eCollection 2016.