Introduction

The quantitative dissection of complex traits into underlying genetic components has been the stated goal of many generations of quantitative geneticists (Falconer 1989). Recently, increased availability of molecular markers combined with enhanced statistical analysis techniques has given quantitative geneticists new tools. One simple approach to achieve the goal is to use quantitative trait locus (QTL) detection methods that exploit phenotypic and molecular marker data collected in designed bi-parental mapping populations of large size (Boer et al. 2007). However, such an approach has a serious limitation in that it explores only a small fraction of the genetic variance available in the reference population from which the two parents of the bi-parental mapping population are sampled. Additionally, analyses of mapping populations from different parents for the same trait can give inconsistent estimates of QTL positions and effect sizes (Beavis et al. 1991). QTL analysis of connected populations has been advocated as an alternative to increase the amount of genetic variability accounted for in the statistical model (Bink et al. 2002; Blanc et al. 2006). This approach is also expected to yield more consistent QTL mapping results. However, a common assumption in this approach is that the parents of the connected populations are unrelated and thus can be treated as independent (Blanc et al. 2006; Crepieux et al. 2005; Fang et al. 2011; Hayashi and Iwata 2009). While this assumption is convenient from the standpoint of the statistical analysis, it does not reflect the reality of most breeding programs and leads to loss of power in QTL estimation when the parents are in fact related.

The mapping resolution in QTL linkage studies depends, among other factors, on the number of meioses events accounted for in the statistical model. Therefore, accounting for meioses events that occurred in the ancestors of the parents of the current mapping population should be beneficial in the detection of QTL and the precise placement of these QTL on the genetic map. Meuwissen and Goddard (2000) proposed such methodology for the precise mapping of loci affecting quantitative traits. This methodology combines linkage and linkage-disequilibrium information where the latter is a function of the historical/ancestral recombinations. While Meuwissen and Goddard (2000) made use of population genetics theory to model linkage disequilibrium, an alternative scenario is possible when highly accurate pedigree and marker information exists for the recent ancestral generations of the parents of connected populations. This is especially true when the ancestral pedigree has been genotyped for many more genetic markers than the connected populations to be used for QTL mapping. However, explicit inclusion of the marker and pedigree information collected on ancestors into the dataset to be analyzed can create a significant missing marker data problem which requires significant imputation efforts when used for mapping experiments of a size typical for breeding programs working with commercial elite germplasm. Instead of including the high-density genotyped ancestors themselves into the statistical analysis, we propose an approach that collapses the marker and pedigree information from the ancestors into parental identity by descent (PIBD) information.

This article presents a novel Bayesian approach to combine PIBD information into a QTL linkage analysis framework. The PIBD information pertains to the parents of the connected populations that form the analysis dataset and this information may be obtained in various ways. However, it is assumed that this information is in the form of an IBD matrix specific to a particular genomic position. Here we extend the Bayesian hierarchical framework of Bink et al. (2008) to allow for latent ancestral alleles that are derived from the locus-specific PIBD matrices. The approach is empirically tested using simulated phenotypic and marker data conditional on a pedigree specific to a maize mapping population. Extensions and implications to other QTL mapping experiments are discussed.

Materials and methods

In conventional QTL linkage mapping it is common to assume independence among the parental alleles of the mapping population(s) (Blanc et al. 2006; Crepieux et al. 2005; Fang et al. 2011; Hayashi and Iwata 2009). Here, this assumption is replaced by allowing for putative dependencies between the parents based on ancestral pedigree and marker information.

In the description of the methodology and implementation, we will concentrate on mapping populations containing inbred lines, i.e., individuals are homozygous at all loci. Consequently, we may use the terms allele, haplotype, and individual synonymously. The theoretical concepts presented in this article can be readily adapted for outbred populations.

Consider a set of n j multiple mapping populations (J 1 − J m) that have a connectedness through a set of parents (I 1 − I n) as shown in Fig. 1a. We adopt a bi-allelic additive QTL model where Q (q) denotes the allele that increases (decreases) the quantitative trait value. The frequency of allele Q is denoted by p for which we assume a uniform prior distribution between 0 and 1. Furthermore, let vector λ denote the positions of all QTL in the model, where the number of putative QTL (N QTL) is treated as a random variable in our Bayesian approach and the prior distribution of the positions of the putative QTL along the marked genome is assumed uniform and continuous.

Fig. 1
figure 1

A graphical representation of QTL genotypes (QQ and qq) in an inbred pedigree of six connected mapping populations (J 1J 6) derived from crosses between unrelated inbred parents (I 1I 6). a provides the actual genotypes whereas b is a representation using 0/1 segregation indicators to describe parental origin in the mapping offspring. c shows the corresponding directed acyclic graphs for unrelated mapping parents with founder alleles (G)

Three types of data are available, i.e., phenotypic trait data (Y T ), low-density marker data on mapping populations (Y M ), and parental IBD data (Y D ). The parental IBD data are IBD probabilities among the parents of the mapping populations, available as symmetric matrices Q for a set of n Q positions along the genome. At each genomic position an element Q ij of Q is the probability that parents I i and I j are IBD.

Modeling QTL genotypes

Let n i denote the number of parents, n j the number of mapping populations (crosses), n o [j] the number of offspring in the jth mapping population, and n a the number of ancestral alleles. Then, we denote G as the (n i  × N QTL) matrix of parental alleles with \( i = 1,\; \ldots ,\;n_{i} )\) and \( qtl = 1,\; \ldots ,\;N_{{{\text{QTL}}}}\). Similarly, let S denote the (n O  × N QTL) matrix of segregation (or meiosis) indicators (Donnelly 1983; Lander and Green 1987), where n O is the total number of offspring across all mapping populations, i.e., \( n_{O} = \sum\nolimits_{j = 1}^{{n_{j} }} {n_{o} [j]} \); let C denote the (n i  × N QTL) matrix of ancestral class indicators; and let A denote the (n a × N QTL) matrix of ancestral alleles. Finally, for ease of readability we will suppress the subscripts pertaining to qtl and describe the concept as if only one QTL is assumed.

Original framework used to model QTL genotypes

In the original framework of Bink et al. (2008) the parental genotypes are assumed to be unrelated or independent (Fig. 1c). We denote this model as UNR in the remainder of this study. The prior distribution for the parental genotypes at a QTL is

$$ P\left( {{\mathbf{G}}\left| p \right.} \right) = \prod\limits_{i = 1}^{{n_{i} }} {P\left( {g_{i} \left| p \right.} \right)} $$
(1)

where g i denotes the genotype of parent i and p denotes the allele frequency at a QTL. The QTL genotypes of the offspring in the mapping populations (J 1 − J m) are defined as functions of the parental genotypes and segregation (meiosis) indicators (Donnelly 1983; Lander and Green 1987). For an inbred offspring o, the interpretation of a binary segregation indicator \( s_{\text{o}} \) is that 0 (1) pertains to the 1st (2nd) inbred parent of that mapping population (Fig. 1b). The prior distribution for the QTL segregation indicators of all offspring of all mapping populations is

$$ P\left( {{\mathbf{S}}\left| {{{\uplambda}},{\mathbf{Y}}_{\varvec{M}} } \right.} \right) = \prod\limits_{j = 1}^{{n_{j} }} {\prod\limits_{o = 1}^{{n_{o} [j]}} {P\left( {s_{o} \left| {{{\uplambda}},lfm_{o} ,rfm_{o} } \right.} \right)} } $$
(2)

where n o [j] is the number of offspring from population j and Y M pertains to marker data and lfm (rfm) denotes the left-sided (right-sided) informative flanking markers for inbred offspring o.

Framework used to model dependence of parental QTL genotypes—ancestral alleles

In contrast to the original framework of Bink et al. (2008) (Fig. 1), we now consider ancestral relationships in modeling QTL alleles of the parents of the mapping population (Fig. 2). In Fig. 2a an example of a complete ancestral pedigree is depicted using an assigned descent path of QTL genotypes from ancestors to the parents of the mapping population. Alternatively, if the ancestral pedigree is not modeled explicitly the QTL genotypes in the parents are assumed to be copies of anonymous ancestral alleles (Fig. 2b). The ancestral alleles are anonymous as the ancestral pedigree is not modeled explicitly in the proposed approach. For known pedigrees, it may be possible to link the ancestral alleles to real ancestors in the recorded pedigree (ter Braak et al. 2010). In our example, this holds for ancestor alleles A 1, A 2, and A 3 (Fig. 2b). Furthermore, the number of ancestral alleles (n a ) is considered random and may vary at different loci along the genome. We denote the identity of these ancestral copies as observed in the parents of the mapping population via ancestral class indicators I, II, etc. (Fig. 2b). The number of ancestral alleles equals three in the example in Fig. 1 where ancestral allele I, II, and III have, respectively, 2, 1, and 3 copies present in the parental genotypes. For an inbred parent i, the interpretation of an integer ancestral class indicator \( c_{i} \) is that its value a, \( a = 1,\; \ldots ,\;n_{a}\), pertains to the ath ancestral allele. Analogous to the segregation indicators, S, in (2), the prior distribution for the ancestral class indicators is

$$ P\left( {{\mathbf{C}}\left| {{{\uplambda}},{\mathbf{Y}}_{D} } \right.} \right) = \prod\limits_{i = 1}^{{n_{i} }} {P\left( {c_{i} \left| {{{\uplambda}},Y_{D} (l),Y_{D} (r)} \right.} \right)} $$
(3)

where Y D (l),(Y D (r)) denotes the immediate left (right) flanking PIBD information. The additional source of information Y D is used to account for locus-specific relationships among parents as generated by the ancestral pedigree and marker information. It will be discussed in detail below. Finally, the prior distribution for ancestral QTL alleles is

$$ P\left( {{\mathbf{A}}\left| p \right.} \right) = \prod\limits_{l = 1}^{{n_{a} }} {P\left( {a_{l} \left| p \right.} \right)} $$
(4)

where a l is the allele of ancestor l. This prior is similar to the prior of Eq. 1 of the parental genotypes in the original model of Bink et al. (2008), namely it assumes independence among ancestral alleles (Fig. 2a).

Fig. 2
figure 2

A graphical representation of an known ancestral pedigree with known transmission of QTL alleles (red arrows) is included in (a). This same information is condensed in (b) using assignments (I–III) linking the alleles of the mapping parents to the ancestral alleles (A 1A 6). Note that ancestral alleles may or may not coincide with real ancestors in the pedigree. The assignment to ancestral alleles is based on the TIBD classification of Table 1. c shows the corresponding Directed Acyclic Graphs for related mapping parents. Parent IBD data (an array of Q-matrices) to account for relatedness by replacing the founder alleles (G) by Ancestor alleles (A) and Class indicators (P)

Models for parental identity by descent (PIBD) data

The locus-specific PIBD matrices jointly form the PIBD data (denoted Y D , Fig. 2); thus Y D represents an array of Q matrices. An example of such a Q matrix is presented in Table 1A. These IBD probabilities cannot be used directly in the Bayesian sampling algorithms of Bink et al. (2008). However, two alternative models have been suggested to capture the information provided by these data (ter Braak et al. 2010). We discuss their implementation in a Bayesian framework below.

Table 1 Numerical example of IBD probability matrix Q among six inbred individuals (A) and the corresponding probability matrix P for the latent ancestor model (LAAM) (B). The IBD status matrix Q TIBD (C) and the corresponding ancestor assignments (P TIBD) for the threshold model (D), where the IBD status of pair I4–I6 has been adjusted for reason of transitivity. The assignments correspond to the example given in Fig. 1

Threshold IBD model (TIBD)

This first model uses a threshold on the IBD probabilities, for example a threshold value of 0.80 and is denoted as the TIBD model. If the IBD probability of two inbred individuals (i and j) sharing the same ancestral allele exceeds this threshold, the alleles are assumed to have the same ancestral allele and their IBD probability is substituted with a value of 1.0, i.e.,

$$ \begin{array}{ll} Q_{{T_{{I{\text{BD}}}} ,ij}} = 0&{\text{if }} {Q_{ij} < T_{\text{IBD} } } \\ {Q_{{T_{\text{IBD}} ,ij}} = 1} & {\text{otherwise}}\end{array} $$
(5)

Values below the threshold are replaced by 0 unless a transitivity problem arises (ter Braak et al. 2010). In case of a transitivity problem, the threshold is locally lowered to create consistency in IBD patterns (Table 1C). This threshold-based approach results in a crisp 0/1 matrix; for example the TIBD model for the example in Table 1A yields three ancestral allele classes (Table 1D). The inbreds I 1 and I 2 are copies of ancestral allele A 1, inbreds I 3, I 4, and I 6 are copies of ancestral allele A 3, and inbred I 5 is a copy of ancestral allele A 2. Note that the IBD probability between inbreds I 4 and I 6 is below the threshold but set to 1 because of the transitivity rule.

The position of a putative QTL is assumed to be continuous along the genome and can thus be in between positions at which PIBD data are available. The Q matrix pertaining to the putative position λqtl may be calculated as the weighted average of the Q matrices at two flanking positions λl and λr, (cf., Eq. 3)

$$ {\mathbf{Q}}_{{{{\uplambda}}_{\text{qtl}} }} = {{\left( {\left( {{{\uplambda}}_{\text{qtl}} - {{\uplambda}}_{\text{l}} } \right){\mathbf{Q}}_{l} + \left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{qtl}} } \right){\mathbf{Q}}_{r} } \right)} \mathord{\left/ {\vphantom {{\left( {\left( {{{\uplambda}}_{\text{qtl}} - {{\uplambda}}_{\text{l}} } \right){\mathbf{Q}}_{l} + \left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{qtl}} } \right){\mathbf{Q}}_{r} } \right)} {\left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{l}} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{l}} } \right)}} $$
(6)

This implies that matrix Q λ is fixed at any position along the genome and consequently the class indicators (P TIBD) are also fixed in this threshold model. The calculation of Q λ is performed at every sampling step in the Markov chain Monte Carlo (MCMC) simulation. The application of the weighted Q matrix (Eq. 6) to some initial datasets led to spurious results due to erroneous clustering of individuals based on the averaged probabilities. Another approach is to sample the Q matrix for a putative position λqtl between two flanking positions λl and λr as follows:

$$ {\mathbf{Q}}_{{{{\uplambda}}_{\text{qtl}} }} = \left\{ {\begin{array}{*{20}c} {{\mathbf{Q}}_{\text{l}} } & {{\text{with}}\quad {\text{Prob}} = \left( {{{\uplambda}}_{\text{qtl}} - {{\uplambda}}_{\text{l}} } \right)/\left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{l}} } \right)} \\ {{\mathbf{Q}}_{\text{r}} } & {{\text{with}}\quad {\text{Prob}} = \left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{qtl}} } \right)/\left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{l}} } \right)} \\ \end{array} } \right. $$
(7)

This sampling approach seems more robust as it does not suffer the problem that occurs in the weighted-average implementation (Eq. 6). In addition, a computational advantage of the sampling-based implementation is that the PIBD matrices can be processed once prior to the MCMC simulation to cluster individuals, given the threshold value, and thus have crisp 0/1 matrix results available for all PIBD positions instead of the original IBD probability matrices.

Latent Ancestor Allele Model (LAAM)

This approach starts with a simple model with K disjoint latent classes in which each parent belongs to precisely one latent class, i.e., one latent ancestor allele. We extend this model with probabilities. Let P be an n × K matrix with elements p ik being the probability that parent i belongs to class k.

$$ p_{ik} \ge 0 \;{\text{and}} \sum\limits_{k = 1}^{K} {p_{ik} } = 1 \left( {i \, = 1, \ldots ,n;k \, = 1, \ldots ,K} \right). $$
(8)

By drawing the class memberships for each parent i from the ith row of P independently, the probability that parents i and j fall in the same class is

$$ q_{ij}^{*} = \sum\limits_{k = 1}^{K} {P(i \in {\text{class}}(k) \wedge j \in {\text{class}}(k)) = } \sum\limits_{k = 1}^{K} {p_{ik} p_{jk} \quad \forall i \ne j} . $$
(9)

We have recently proposed several algorithms to find a matrix P such that \( q_{ij}^{*} \) is close to the observed Q ij for all i ≠ j (ter Braak et al. (2009, 2010). For the Q matrix in Table 1A, the P matrix with five latent ancestor allele classes that corresponds with PIBD matrix Q (zero RMSE between Q and Q *) is also given (Table 1B).

In the Bayesian algorithm we need 0/1 matrices such as the one in Table 1D, expressing a unique assignment of each parent to a single ancestral allele. We can sample such matrices from this prior model by sampling for each parent i independently its ancestral allele (class membership) according to its (row-wise) probabilities \( \left\{ {p_{ik} ,k = 1,..,K} \right\} \) in Table 1B.

The P LAAM matrix pertaining to position λqtl may be calculated from the matrix Q λ, where Q λ is obtained from Eq. 6. However, this is often computationally demanding within the MCMC simulation and the weighted-average Q λ (6) may lead to unreliable analysis results. Similar to the sampling of Q λ for P TIBD, the approach taken to compute the P LAAM matrices follows (7) for all n Q positions along the genome, i.e., a finite number, prior to the actual MCMC simulation. Let P l and P r be the P matrices pertaining to the flanking positions λl and λr. Then, the P matrix at the QTL position λ qtl

$$ {\mathbf{P}}_{{\lambda_{\text{qtl}} }} = \left\{ {\begin{array}{*{20}c} {{\mathbf{P}}_{\text{l}} } & {{\text{with}}\quad {\text{Prob}} = \left( {{{\uplambda}}_{\text{qtl}} - {{\uplambda}}_{\text{l}} } \right)/\left( {{{\uplambda}}_{r} - {{\uplambda}}_{\text{l}} } \right)} \\ {{\mathbf{P}}_{\text{r}} } & {{\text{with}}\quad {\text{Prob}} = \left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{qtl}} } \right)/\left( {{{\uplambda}}_{\text{r}} - {{\uplambda}}_{\text{l}} } \right)} \\ \end{array} } \right. $$
(10)

Sampling from Pl and Pr with these probabilities yields precisely the average IBD probabilities Qλ of Eq. 6 if Pl and Pr perfectly fit Ql and Qr, respectively (see “Appendix A”). In the case of a perfect fit, the sampling approach of Eq. 10 is therefore equivalent to that of calculating the P matrix corresponding to Q λ (6). In the case of a non-perfect fit, the two approaches are almost equivalent. Note that the latter sampling approach is not the same as sampling from an average of P l and P r.

Effective number of latent ancestors

The prior model introduces correlation among the alleles of the parents because parents with similar rows in P are likely to be assigned as offspring from the same latent ancestor and will thus receive more often the same allele than under the independence model. Thus, ter Braak et al. (2010) also propose to use the effective number of latent classes as a measure for genetic diversity (see “Appendix B”).

Bayesian Markov chain Monte Carlo QTL analysis

The utilization of PIBD data adds a new layer in the Bayesian hierarchical framework described by Bink et al. (2008) as presented in Fig. 2. We now present the linear model for the phenotypes and the joint posterior distribution of all random variables.

Data likelihood

The probability model for the trait phenotypes (Y T) is assumed to be

$$ P\left( {{\mathbf{Y}}_{\text{T}} \left| {\mu ,N_{\text{QTL}} ,{\mathbf{W}},\alpha } \right.} \right) = N\left( {{\mathbf{1}}\mu + \sum\limits_{k = 1}^{{N_{\text{QTL}} }} {{\mathbf{W}}_{k} \alpha_{k} } ,\sigma_{e}^{2} } \right), $$
(11)

where 1 is a unity vector pertaining to an overall mean effect (\( \mu \)) and \( \sigma_{e}^{2} \) is the residual variance for the trait of interest; the incidence matrix W (see also “Appendix C”) pertains to the QTL genotypes with additive effects α. Treating the number of bi-allelic QTL (N QTL) as a random variable, the number of columns and the length of vector α are a priori unknown. Let \( \theta = \{ \mu ,\alpha ,\lambda ,{\mathbf{p}},{\mathbf{q}},\sigma_{e}^{2} \} \), i.e., the set of location and dispersion variables given a particular number of QTL in the model. Vector q comprises the frequencies of marker alleles, which are included in the prior on the marker haplotypes (H) as shown in Fig. 2. Further details and prior assumptions on the linear model are as described by Bink et al. (2008) and are omitted here.

Joint posterior distribution

The probability model for the phenotypes and the prior distributions yield the joint posterior distribution. Let Y P and Y M denote the pedigree and marker data, respectively. The joint posterior distribution of all unknowns is now written as,

$$ \begin{gathered} P\left( {\theta ,N_{\text{QTL}} ,{\mathbf{W}}\left| {\mathbf{Y}} \right._{T} ,{\mathbf{Y}}_{P} ,{\mathbf{Y}}_{M} } \right) \propto P\left( {{\mathbf{Y}}_{T} \left| {\theta ,N_{\text{QTL}} ,{\mathbf{W}}} \right.} \right)P\left( {{\mathbf{Y}}_{P} \left| {\mathbf{H}} \right.,{\mathbf{W}}} \right)P\left( {{\mathbf{Y}}_{M} \left| {\mathbf{H}} \right.} \right) \hfill \\ \quad \, \times P\left( {{\mathbf{W}}\left| {p,\lambda ,{\mathbf{H}}} \right.} \right)P\left( {{\mathbf{H}}\left| q \right.} \right)P\left( {\theta \left| {N_{\text{QTL}} } \right.} \right)P\left( {N_{\text{QTL}} } \right) \hfill \\ \end{gathered} $$
(12)

Note that the incidence matrix W is fully determined by variables A, C, and S (“Appendix C”). For MCMC simulation, the sampling distributions of the random variables are derived from this joint posterior distribution by conditioning on all other variables. These conditional distributions are as described by Bink et al. (2008), except for the QTL ancestor class indicator variables (C) that will be presented in the following.

Posterior conditional distributions of ancestral class indicators

In the Markov chain simulation the sampling distribution of the ancestral class indicators is derived by treating all other random variables in the joint posterior distribution as fixed. The posterior (12) then reduces to multiplying the likelihood function and prior on class indicators (3),

$$ P\left( {{\mathbf{C}}\left| {\lambda ,} \right.{\mathbf{Y}}_{T} ,{\mathbf{Y}}_{F} } \right) \propto P\left( {{\mathbf{Y}}_{T} \left| {\theta ,N_{\text{QTL}} ,{\mathbf{A}},{\mathbf{C}},{\mathbf{S}}} \right.} \right)P\left( {{\mathbf{C}}\left| {\lambda ,{\mathbf{Y}}_{F} } \right.} \right) $$
(13)

In the TIBD model with the weighted Q (6) this sampling distribution is actually deterministic because of the conditioning on λ. Consequently, the ancestral class indicators are updated jointly with the position of the QTL. In the TIBD model with the sampled Q (7) we follow the same approach as presented below for the LAAM model.

In the LAAM model the sampling distribution of C is stochastic. We have implemented a Metropolis–Hastings updating algorithm for the joint sampling distribution of C and λ, i.e.,

$$ P\left( {\lambda ,{\mathbf{C}}\left| {{\mathbf{Y}}_{F} } \right.,{\mathbf{Y}}_{T} } \right) \propto P\left( {{\mathbf{Y}}_{T} \left| {\theta ,N_{\text{QTL}} ,{\mathbf{A}},{\mathbf{G}},{\mathbf{S}}} \right.} \right)P\left( {{\mathbf{G}}\left| {\lambda ,{\mathbf{Y}}_{T} } \right.} \right)P\left( \lambda \right) $$
(14)

where the algorithm starts with a normal random walk proposal distribution for λ and then proposes a C’ from the prior as follows. With the probabilities given in (10) we sample either the left-flanking matrix P l or the right-flanking matrix P r of the new position and call the resulting matrix P with elements \( \left\{ {p_{ik} } \right\}. \) We then sample for each parent i independently its ancestral allele (class membership) according to its (row-wise) probabilities \( \left\{ {p_{ik} ,k = 1,..,K} \right\}. \) The allele that is then proposed is the allele assigned to the ancestral class. Because we sample C from the prior and because the prior of λ is uniform, the acceptance ratio for the proposed values is then simply a ratio of the likelihoods of the current and proposed values.

The method of updating of the alleles of the ancestral classes is identical to that for updating alleles of parents in the model were parents are unrelated (UNR), which is Gibbs sampling. Note that the alleles of the parents are correlated in the TIBD and LAAM models because of the P matrix. Models TIBD and LAAM thus shift the independence assumption upward in the pedigree structure, namely from the parents to the latent ancestors.

Markov chain Monte Carlo simulation and posterior inference

The calculation of the above joint posterior distribution is analytically intractable, and we apply computer-intensive MCMC simulation (Gilks et al. 1996) to obtain draws from the joint posterior distribution. Different MCMC sampling algorithms are used, i.e., the Gibbs sampler (Gelman et al. 1995) when the conditional sampling distribution has a recognizable kernel and can directly be sampled from, and the Metropolis–Hastings algorithm (Gelman et al. 1995) when the conditional distribution cannot be sampled from directly. The sampling of ancestral class indicators under our new models TIBD and LAAM has been detailed above. To allow changes in model dimension, i.e., to increase or decrease the number of QTL in the model, we use the reversible jump MCMC method (Green 1995), similar to previous QTL model selection studies (Bink et al. 2002; Heath 1997; Sillanpaa and Arjas 1998). For each model we performed a Markov chain simulation of 500,000 (200,000) cycles for the pilot dataset (replicated datasets) and stored every 200th sample for posterior inference.

For all three models (UNR, TIBD, and LAAM), three values (1, 3, and 5) are evaluated for the mean of the Poisson distribution being the prior on the N QTL in the analyses of our simulated data. The stored draws from the joint posterior distribution were used for posterior inference on the variables of interest, most importantly the characteristics of QTL (number, position, size, genotypes). A linkage group was divided into 1-centiMorgan (cM) bins and the number of QTL per bin per cycle was used to calculate the posterior QTL intensity (Sillanpaa and Arjas 1998).

For model selection in the pilot dataset we used Bayes factors (Kass 1993; Kass and Raftery 1995) as a measure of evidence coming from the data for different QTL models. More precisely, we used the statistic \( 2 \times \ln \left( {BF} \right) \) that scales similar to a LOD score test statistic (Kass and Raftery 1995).

In the replicated datasets we adopt the approach of Hayashi and Iwata (2009) to assess the power and accuracy of the three models. The posterior QTL intensity for 1-cM bins along the linkage groups was calculated. Subsequently, the Summed QTL Intensity (SQI) was calculated by summing the QTL intensity over a single linkage group (Hayashi and Awata 2008). Thresholds of SQI were determined from empirical null distributions of the maximum SQI over all linkage groups obtained from 100 null data sets (no QTL were modeled on any linkage group). When SQI exceeds these thresholds for any linkage group, detection of a QTL was declared. For declared QTL the position and effect were calculated as the weighted average over the linkage group where the weights were equal to the QTL intensity. We also examined an alternative method: to declare a QTL, a SQI threshold value of 0.50 must be exceeded (regardless of model), and the posterior mode estimate of QTL location (and the estimated QTL effect pertaining to that location mode) is used for every declared QTL. Taking SQI threshold values other than 0.50—we explored a range between 0.2 and 0.8—yielded similar patterns in relative performance among the three models (results not presented). Furthermore, the SQI approach works well for linkage groups with only 1 (or no) QTL but cannot be applied to a linkage group with multiple QTL as in the pilot dataset.

Simulated data

To empirically test our models we use one pilot dataset and 100 replicated datasets with the same pedigree data and marker densities but with different trait architectures.

Pedigree of connected mapping populations

The simulated data set mimics a real QTL mapping experiment from an ongoing maize breeding program. That is, 16 inbred parents were crossed in an incomplete mating design to produce 30 mapping populations of Recombinant Inbred Lines (RILs) (Table 2). The number of crosses per parent ranged from 1 (parent 863) to 8 (parent 773) and the number of RILs per population ranges from 4 to 50. The entire dataset contains 1,072 RILs.

Table 2 Simulated QTL genotypes and estimated posterior genotype probabilities in the pilot dataset for map intervals with positive QTL evidence of the 16 mapping parents

Pedigree of ancestral generations

The known ancestral pedigree of the 16 parents of the connected mapping populations contains 162 inbred lines. Of these, 32 are founders, i.e., their parentages (pedigree) are assumed unknown.

Marker data

We simulated genetic data for 32 independent founders and their descendants according to the known (ancestral and mapping) pedigree structure. The pedigree contains multiple loops and the longest lineage is nine generations for any of the resulting 16 inbred parents. A gene-dropping method (Maccluer et al. 1986) was used to simulate Mendelian inheritance of marker and QTL alleles from parents to offspring while Haldane’s mapping function (Haldane 1919) was used to transform linkage distances into recombination fractions.

Pilot dataset

One linkage group of 150 cM was simulated that was covered by 16 equidistantly spaced bi-allelic SNP markers for the mapping populations (=sparse map). In contrast, the 16 parents and their ancestors were genotyped for 151 markers covering the same genome length (1 cM distance, = dense map).

Replicated datasets

Five linkage groups of 100 cM were simulated, each group covered by 11 and 101 equidistantly spaced bi-allelic SNP markers for the mapping populations and parents and ancestral pedigree members, respectively. So, the total numbers of SNP markers were 55 and 505 SNPs for the two subsets of the pedigree.

Phenotypic trait data

Pilot dataset

The phenotypes of all mapping individuals were simulated by assuming three QTL residing at positions 30, 60, and 140 cM on the linkage group. The distance between the first and second QTL was relatively small to assess differences in detection power and accuracy of estimates of closely linked QTL. The size of the additive effect for all of the three simulated QTL was set to 1.0.

Replicated datasets

Linkage groups 1–4 contained a single QTL at positions 22, 44, 66, and 88, respectively. Linkage group 5 did not harbor a QTL and was included to evaluate the false positives rate. The size of the additive effects for the QTL at the subsequent linkage groups were 1.3, 1.1, 0.9, and 0.7, respectively.

In both datasets, the allele frequency of all QTL was 0.5 in the ancestral founder population and linkage equilibrium among all loci was assumed. No other genetic or non-genetic variables were simulated and the residual variance was equal to 16.0. In the pilot dataset, the realized phenotypic variance was equal to 19.3 and the heritability of each of the three QTL was approximately 5%. The phenotypic distributions (not shown) for the whole population and the individual mapping populations were continuous and uni-modal as expected from the relatively large residual variance. The QTL allele frequency of 0.5 in the founders did not yield the same frequency in the mapping populations (or their parents) due to drift in the pedigree ancestral to the mapping parents. For the replicated datasets we calculated the fraction of segregation in the n j mapping populations by

$$ F_{\text{Segr}} = {{\left( {\sum\limits_{j = 1}^{{n_{j} }} {n_{o} [j] \times s_{o} [j]} } \right)} \mathord{\left/ {\vphantom {{\left( {\sum\limits_{j = 1}^{{n_{j} }} {n_{o} [j] \times s_{o} [j]} } \right)} {n_{O} }}} \right. \kern-\nulldelimiterspace} {n_{O} }} $$
(15)

where s o [j] is a binary indicator whether the jth mapping population is segregating (s o  = 1) or not (s o  = 0). The segregation indicator is weighted by the size of the mapping population (n o [j]) and the total number of mapping offspring is given by n O (defined previously).

Parent IBD data

For every marker position on the 1 cM map, IBD probabilities among the 16 inbred parents were calculated by using the FlexQTL software (Bink et al. 2008). Note that these 16 × 16 PIBD matrices were calculated (and stored) only once for a given dataset regardless of the number of traits to be analyzed or model specification in the analysis. The resulting 151 (505) PIBD matrices of the pilot dataset (replicated datasets) were used as the new additional data source in the Bayesian analysis (data Y D in Fig. 2). In the LAAM approach we used the least-squares approximation to obtain Latent Class probabilities allowing a maximum of 16 classes (ter Braak et al. 2010).

Results

Pilot dataset

The effective number of ancestral classes (see “Appendix B”) was computed for the TIBD and LAAM models and significant variation in this number was observed. That is, the mean (standard deviation) for the TIBD and LAAM models were 3.4 (0.96) and 3.1 (0.76), respectively. The lowest number was 1.6, implying a substantial probability (0.625) that two randomly chosen individuals belong to the same ancestral allele class. The highest number was more than 6. The effective number in the LAAM model was always smaller or equal to the effective number in the TIBD model, indicating that parents in the LAAM model may have a higher probability of sharing the same ancestral allele.

The posterior mean estimates for the overall mean and residual variance component for the UNR model deviated more from the simulated values than those from the TIBD and LAAM models, irrespective of the prior value for the number of QTL (Table 3). In all three models, the posterior estimates for the number of QTL and the total QTL variance (and consequently heritability) were inflated for higher mean values of the prior number of QTL. The Bayes Factors estimates correctly identified the proper QTL numbers for the TIBD and LAAM models, i.e., large Bayes Factor values (>20) for model with 3 QTL over a model with 2 QTL. For the UNR model the Bayes Factor estimate for the 3 over 2 QTL model was only moderate (values between 3.1 and 4.1).

Table 3 Posterior mean estimates for overall mean (μ), residual variance (σ 2e ), the number of QTL (N QTL), the QTL variance (σ 2QTL ), and heritability (h 2) for the pilot dataset

The peaks of the posterior intensity profiles for QTL position were in general close to the simulated values (30, 60, and 140 cM) for all three models (Fig. 3). The intensity profiles were clearly more peaked and higher for models TIBD and LAAM (Fig. 3). In these models, the intensity peaks were often somewhat left-shifted from the simulated QTL positions, especially for the QTL at 60 cM. For model UNR the intensity profile near the first two simulated QTL positions was relatively flat and stretched as these two QTL were closely linked. The value of the prior number of QTL seemed most influential on the QTL intensity profile for the UNR model (Fig. 3) where intensity profiles were lifted near QTL 1 and 2 while the profile broadened near QTL 3. These posterior intensity plots pointed to an increased accuracy in positioning the QTL on the linkage groups when adding ancestral information into the analysis.

Fig. 3
figure 3

Marginal posterior intensity profile estimates for QTL position along the 150 cM linkage map for three types of analyses, i.e., a unrelated (UNR) mapping parents; b related mapping parents with Threshold IBD (TIBD) algorithm; and c related mapping parents with Latent Ancestor Allele Model (LAAM) algorithm. The QTL positions of the simulated dataset are indicated by arrows on the x-axis (30, 60, and 140 cM). Three values (1, 3, and 5) for the prior number of QTL were studied

Bin-wise Bayes Factor estimates with a cut-off value of 2.0 are used to identify the most probable QTL regions. For each bin in these QTL regions, the estimates of the mean and 90% probability credible regions for QTL effect size are shown in Fig. 4 (for prior number of QTL equal to 3). The posterior mean estimates for all three models were very close to the simulated value of 1.0. The credible regions were smaller and more accurate in the TIBD and LAAM models than in the UNR model.

Fig. 4
figure 4

Marginal posterior mean (solid black line) and 0.90 probability credible region estimates (gray surfaces) for QTL effect size along the 150 cM linkage map for three types of analyses, i.e., a unrelated (UNR) mapping parents; b related mapping parents with Threshold IBD (TIBD) algorithm; and c related mapping parents with Latent Ancestor Allele Model (LAAM) algorithm, and prior E(N QTL = 3). Estimates are depicted only for areas with positive QTL signal (2 × ln(Bayes Factorbin) ≥ 2.0). The QTL positions of the simulated dataset are indicated by arrows on the x-axis (30, 60, and 140 cM). The dashed horizontal line indicates the simulated effect size

For these most probable QTL regions we computed the posterior probabilities for QTL genotypes for the 16 parents of the mapping population (Table 2). The posterior probability estimates were often inconclusive (0.3 ≤ P ≤ 0.7) for the UNR model, especially for the QTL at 30 and 60 cM, while for the QTL at 140 cM ten out of 16 parents had conclusive probability estimates (0.1 ≤ P or P ≥ 0.9). On the other hand, the genotype probability estimates for the TIBD and LAAM models were almost always conclusive for all three QTL positions (Table 2). These differences between models were also summarized by calculating the average absolute difference between the true and estimated QTL genotype probabilities. These summary values show that including PIBD information leads to four times more accurate genotype probability estimates.

Replicated datasets

The fraction of segregation in the mapping populations for the 100 replicated datasets is presented in Fig. 5. Note that the values were ordered within each QTL and ranged from 0.0 to 0.76. When the fraction was equal to zero the QTL was fixed for one of the alleles and consequently the QTL could not be detected in that particular replicate. For the QTL on linkage groups 1–4 the number of non-segregating QTL was equal to 18, 8, 12, and 13, respectively (Table 4), implying that for example more QTL could be detected for linkage group 2 than for linkage group 1.

Fig. 5
figure 5

Fraction of QTL segregation in offspring populations in the 100 replicated simulations (replicates in ascending order within QTL)

Table 4 Posterior inferences results on replicated datasets (100 replicates), using the SQI thresholds from 100 null datasets

Summed QTL Intensity threshold based on empirical null distribution

For the prior E(N QTL = 1), the 5% significance level of these empirical distributions were 0.127, 0.206, and 0.196 for the models UNR, TIBD, and LAAM, respectively. A plausible cause of the higher threshold for the TIBD and LAAM models is the following. Some (segments of) linkage groups are not segregating in the mapping parents (similar to fixed QTL in Fig. 5). These monomorphic regions are excluded in the TIBD and LAAM models which increases the prior probability for QTL on other linkage groups. This creates a higher variability in SQI among linkage groups and thus higher values for the maximum SQI of the linkage groups. The power of QTL detection was higher for the UNR model than for the other two models when using the SQI threshold based on 100 null datasets, except for the QTL on linkage group 3 (Table 4). The UNR model also yielded a much higher false discovery rate, i.e., 37 QTL were declared for linkage group 5. The posterior estimates for location were most biased for the UNR model and for the QTL at the extremes of the linkage groups. The bias is pointing to the middle of the linkage group and seems to be caused by the estimation protocol of Hayashi and Iwata (2009). Since all intervals along the linkage group are included in the estimation procedure, QTL at the extremes will have their estimated location biased toward the center of the linkage group. Especially for the QTL with smaller effect size the location cannot be precisely determined and positions further away are plausible as well. This bias in estimates of QTL location may be strongly reduced by considering the mode of the posterior mode estimates. For example, for the QTL position at the 4th linkage group the estimated mode was equal to 89 cM in all scenarios and both thresholds (results not shown). The accuracy, as represented by the standard deviation of QTL location, was always lower for the UNR model than for the TIBD and LAAM models. The accuracy of location estimates decreased (standard deviation estimates increased) for smaller QTLs (Table 4). The TIBD and LAAM models yielded very similar results for power and accuracy. The effect sizes were always underestimated for all three models, more severely for the UNR model, and this was also likely due to the estimation procedure of averaging along the whole linkage group.

SQI threshold equal to 0.5

Relative to the empirical null distribution threshold, the SQI > 0.5 threshold decreased the power of QTL detection for all three models (Table 5). The UNR model yielded lowest power and the power decreased consistently with QTL effect size. The highest power for the QTL on linkage group 2 for the TIBD and LAAM models can be explained by the highest number of replicates with segregation (92). The posterior mode estimates for QTL location yielded almost unbiased results, except for the smallest QTL (linkage group 4). Also, the estimates of QTL effects at the mode of QTL location were close to the simulated values for the models TIBD and LAAM, especially for the larger QTL effect sizes. The results for the UNR model were always inferior, especially for accuracy of the estimates as expressed by the standard deviations in Table 5. These latter results indicated that the original SQI protocol of Hayashi and Iwata (2009) may be further improved and extended to compare relative performance of different methods and models with respect to power and accuracy of QTL mapping for complex traits.

Table 5 Posterior inferences results on replicated datasets (100 replicates), using the same SQI threshold of 0.50 for all scenarios of analysis

Discussion

We present a novel approach to efficiently include genome-wide ancestral IBD information on parent alleles into the QTL analyses of multiple connected populations. Analysis of simulated data indicates improvement of mapping accuracy and power when genetic relationships between parents are modeled as opposed to treating the parents as independent. Two algorithms were implemented and tested. The threshold-algorithm benefits from ease of implementation and interpretation but may yield a crude classification of founder alleles, especially when PIBD probabilities are more intermediate between 0 and 1. Furthermore, consistency of classification needs to be checked via transitivity rules. The latent class algorithm is conceptually more appealing as it provides a more precise representation of the original IBD information along the genome. In our simulated datasets these two algorithms yielded the same posterior conclusions. Our current results indicate that the threshold values of 0.90 and 0.80 in the TIBD model yield very similar posterior results and mixing behavior; however, results and performance may become different with further lowering of this threshold. These implementation issues of the TIBD model are subject to further research.

The comparison of our proposed PIBD approach to a full pedigree analyses was impractical as the high marker density in the ancestral pedigree (Fig. 1) creates a major missing data problem in the mapping pedigree. The progeny in the mapping populations in the pilot dataset would have missing marker scores for 135 out of 151 loci. A comparison of our novel approach with a full pedigree analysis with all individuals genotyped for the sparse density map showed that the full pedigree analysis was almost as powerful, but that computation time was dramatically increased because of the added number of individuals and the additional number of generations in the pedigree which makes the sampling algorithms more time consuming. Even in a much smaller simulated example, i.e., considering a single biparental mapping population, the full pedigree analysis required over 50 times more computation time (results not shown).

The novel approach to include genome-wide ancestral IBD information in QTL mapping can be further extended to include a polygenic component which may account for QTL that cannot be picked up in the linkage detection, cf. Bink et al. (2008). When modeling the polygenic component, the use of a marker-based genome-wide average coancestry among founder individuals could be obtained from the PIBD matrices calculated for each chromosomal segment, cf. (NejatiJavaremi et al. 1997), and recently applied to genomic selection, e.g., (Habier et al. 2007). An alternative to this approach would be to use known pedigree relationships to construct the coancestry relationship matrix needed to account for the polygenic term.

In this study, we assume two alleles at a QTL which allows a straightforward extension to include non-additive effects, e.g., dominance and epistatic interactions. When primary interest is in additive gene actions, QTL models with many alleles may be advantageous to allow greater flexibility for panmictic populations (Hoeschele et al. 1997). However, two important implementation issues must be addressed. First, a multiple allele model may contain effects for alleles with little supporting phenotypic data and is thereby prone to less accurate results for QTL allelic effects (Hayashi and Iwata 2009). To draw accurate inference on the number of allelic effects, the allelic effects must differ substantially from each other (Jannink and Wu 2003). Second, the extension to dominance and higher-order interactions is not straightforward as many interaction effects will not be realized in the phenotypic data. The extension of our new approach to outbred populations will be straightforward in case dense marker data are available to unambiguously assign haplotypes to all parents of the mapping population. Then the dimensionality of the PIBD matrices simply doubles and the number of rows in matrix P also doubles. In our study we had access to accurate pedigree and marker data on ancestors of the mapping populations. When ancestral pedigree is unknown or DNA is not available on the members, the LD-based estimation method of Meuwissen and Goddard (2000) utilizing very dense marker data can be applied to obtain the location-specific parental IBD matrices in outbred and inbred populations (Bink and Meuwissen 2004).

The UNR model was the point of departure in our Bayesian approach and this model already accounts for the sharing of one or more common parents by multiple populations using a pedigree linkage approach. Other recent Bayesian approaches have been proposed that take unique QTL allelic effects for each of the mapping parents (Hayashi and Iwata 2009) with their simulated datasets containing a common reference parent in a star design. Fang et al. (2011) assumed the QTL alleles of all mapping individuals as samples from a normal distribution with a covariance matrix proportional to the IBD matrix that was calculated from the marker information on the mapping offspring and their parents. The modeling of the QTL as a random effect in a mixed model can be solved more efficiently using restricted maximum likelihood approaches. Mixed models for QTL mapping in real connected plant populations have been successful in wheat (Arbelbide and Bernardo 2006; Crepieux et al. 2005; Rosyara et al. 2009) and maize (van Eeuwijk et al. 2010), but the treatment of multiple QTL models is less straightforward for (closely) linked QTL as was the case in our pilot dataset.

The simulated datasets used in this study reflect a typical connected population structure as they contained 30 mapping populations derived from 16 connected parents with known ancestry up to 32 original founder individuals. These characteristics can easily be varied without changing the applicability of the method. For example, the idea of ancestral allele classes can also be applied to a single mapping population derived from two inbred parents. In that case, the ancestral PIBD data will indicate which genomic regions are shared by the two inbred parents and these regions can be excluded a priori to harbor QTL. This type of information may substantially increase mapping precision but is fully ignored in other linkage methods. Furthermore, plant breeders may consider a large number of small mapping populations derived from a large number of parents where these parents inherited a limited number of (unknown) ancestral alleles. The additional layer of ancestral allele classes will facilitate substantial power to associate phenotypic trait variation with genomic polymorphisms. The increase in numbers may require more efficient algorithms to include ancestral IBD information. The practicality of our new approach was well illustrated by the successful mapping of QTLs in hybrid selection programs (van Eeuwijk et al. 2010).

The increasing availability of cheap and abundant markers opens new ways to advance genetic progress in plant and animal breeding programs, such as whole genome selection approaches (Bernardo and Yu 2007; Meuwissen et al. 2001). However, the application of high-density (SNP) genotyping to all mapping populations grown within commercial breeding programs might still not be feasible due to economic reasons. Therefore, a substantial discrepancy in marker density between elite (selected) breeding lines and regular breeding populations can occur. Our approach tackles this potential discrepancy and exploits the available sources of information efficiently to map important genomic regions affecting complex traits.