PlmDCA
- Pseudo-likelihood maximization for protein.
Package Features
- Learn pseudo-likelihood model from multiple sequence alignment
See the Index for the complete list of documented functions and types.
Overview
Protein families are given in form of multiple sequence alignments (MSA) $D = (a^m_i |i = 1,\dots,L;\,m = 1,\dots,M)$ of $M$ proteins of aligned length $L$. The entries $a^m_i$ equal either one of the standard 20 amino acids, or the alignment gap $–$. In total, we have $q = 21$ possible different symbols in D. The aim of unsupervised generative modeling is to learn a statistical model $P(a_1,\dots,a_L)$ of (aligned) full-length sequences, which faithfully reflects the variability found in $D$: sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Here we propose a computationally efficient approach based on pseudo-likelihood maximization.
We start from the exact decomposition $P_i(a_i| a_{\setminus i})$ where $a_{\setminus i} := \{a_1,\dots,a_{i-1},a_{i+1},\dots,a_i\}$, i.e. all residues of the sequence of amino acids but the one relative to residue $i$.
Here, we use the following parametrization:
\[P(a_i |a_{\setminus i}) = \frac{\exp \left\{ h_i(a_i) + \sum_{j\neq i} J_{i,j}(a_i,a_j)\right\} }{z_i(a_{\setminus i})}\,,\]
where:
\[z_i(a_{\setminus i})= \sum_{a=1}^{q} \exp \left\{ h_i(a) + \sum_{j=1\neq i} J_{i,j}(a,a_j)\right\} \,,\]
is the normalization factor. The pseudo-likelihood maximization strategy, aims at finding the value of the $J, h$ parameters, that maximize the log-likelihood:
\[{\cal L} = \frac{1}{M} \sum_{m=1}^M \left( h_i(a^m_i) + \sum_{j\neq i} J_{ij}(a_i^m,a_j^m) - \log z_i(a^m_{\setminus i})\right)\,.\]
In machine learning, this parametrization is known as the soft-max regression, a generalization of logistic regression to multi-class labels.
Usage
The typical pipeline to use the package is:
- Compute PlmDCA parameters from a multiple sequence alignment:
julia> res=plmdca(filefasta; kwds...)Multithreading
To fully exploit the the multicore parallel computation, julia should be invoked with
$ julia -t nthreads # put here nthreads equal to the number of cores you want to useIf you want to set permanently the number of threads to the desired value, you can either create a default environment variable export JULIA_NUM_THREADS=24 in your .bashrc. More information here
Load PlmDCA package
The package is on the General Registry. It can be installed from the package manager by
pkg> add PlmDCAand
julia> using PlmDCALearn the parameters
There are two different learning strategies:
- The asymmetric one invoked by the
plmdca_asymmethod (theplmdcamethod points to the asymmetric strategy) - The symmetric strategy, invoked by the
plmdca_sym. This method is slower and typically less accurate.
Both methods return the parameters res::PlmOut, a struct containing:
Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]pslike: a vector of lengthLcontaining the log-pseudolikelihoodsscore: a vector of(i,j,val)::Tuple{Int,Int,Float64}containing the DCA scorevalrelative to thei,jpair in descending order.
For both methods, the algorithmic keyword arguments (with their default value) are:
epsconv::Real=1.0e-5(convergence parameter)maxit::Int=1000(maximum number of iteration - don't change)verbose::Bool=true(set tofalseto suppress printing on screen)method::Symbol=:LD_LBFGS(optimization method)
Example:
res=plmdca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001);Additional parameters include those of PlmDCA.read_fasta, which determine the effective sequence diversity.
Contact Prediction
Contact prediction is contained in the the output of plmdca. ::Output contains a score field which is a Vector of Tuple ranked in descending score order. Each Tuple contains $i,j,s_{ij}$ where $s_{ij}$ is the DCA score of the residue pair $i,j$. The residue numbering is that of the MSA, and not of the unaligned full sequences.
Methods Reference
PlmDCA.plmdca — Methodplmdca(filename::String; kwds...) -> PlmOutRun plmdca_asym on the fasta alignment in filename.
PlmDCA.plmdca_asym — Methodplmdca_asym(filename::String; kwds...)-> PlmOutRun plmdca_asym on the fasta alignment in filename It returns a struct ::PlmOut containing four fields:
Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]pslike: a vector of lengthLcontainining the log-pseudolikelihoodsscore: a vector of(i,j,val)::Tuple{Int,Int,Float64}containing the DCA scorevalrelative to thei,jpair in descending order.Optional arguments:
lambdaJ::Real=0.01coupling L₂ regularization parameter (lagrange multiplier)lambdaH::Real=0.01field L₂ regularization parameter (lagrange multiplier)epsconv::Real=1.0e-5convergence value in minimzationmaxit::Int=1000maximum number of iteration in minimizationverbose::Bool=trueset tofalseto stop printing convergence info onstdoutmethod::Symbol=:LD_LBFGSoptimization strategy seeNLopt.jlfor other options
See PlmDCA.read_fasta for additional parameters.
Example
julia> res = plmdca_asym("data/pf00014short.fasta",lambdaJ=0.01,lambdaH=0.01,epsconv=1e-12);PlmDCA.plmdca_asym — Methodplmdca_asym(Z::Matrix{T},W::Vector{Float64}; kwds...) -> ::PlmOutPseudolikelihood maximization on the M×N alignment Z (numerically encoded in 1,…,21), and the M-dimensional normalized weight vector W.
It returns a struct ::PlmOut containing four fields:
Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]pslike: a vector of lengthLcontainining the log-pseudolikelihoodsscore: a vector of `(i,j,val)::Tuple{Int,Int,Float64}containing the DCA scorevalrelative to thei,jpair in descending order.Optional arguments:
lambdaJ::Real=0.01coupling L₂ regularization parameter (lagrange multiplier)lambdaH::Real=0.01field L₂ regularization parameter (lagrange multiplier)epsconv::Real=1.0e-5convergence value in minimzationmaxit::Int=1000maximum number of iteration in minimizationverbose::Bool=trueset tofalseto stop printing convergence info onstdoutmethod::Symbol=:LD_LBFGSoptimization strategy seeNLopt.jlfor other options
Example
julia> res = plmDCA(Z,W,lambdaJ=0.01,lambdaH=0.01,epsconv=1e-12);PlmDCA.read_fasta — MethodPlmDCA.read_fasta(filename; max_gap_fraction::Real, theta, remove_dups::Bool) -> W, Zint, N, M, qReads a FASTA file and returns the per-sequence weights, integer representation of sequences, length of sequences, number of sequences, and the number of states.
See DCAUtils.read_fasta_alignment about the integer representation of sequences and meaning of max_gap_fraction. If remove_dups is true, duplicate sequences are removed.
theta controls the sequence-weighting computation, see DCAUtils.compute_weights.