PlmDCA

Pseudo-likelihood maximization for protein.

Package Features

Learn pseudo-likelihood model from multiple sequence alignment

See the Index for the complete list of documented functions and types.

Protein families are given in form of multiple sequence alignments (MSA) $D = (a^m_i |i = 1,\dots,L;\,m = 1,\dots,M)$ of $M$ proteins of aligned length $L$. The entries $a^m_i$ equal either one of the standard 20 amino acids, or the alignment gap $–$. In total, we have $q = 21$ possible different symbols in D. The aim of unsupervised generative modeling is to learn a statistical model $P(a_1,\dots,a_L)$ of (aligned) full-length sequences, which faithfully reflects the variability found in $D$: sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Here we propose a computationally efficient approach based on pseudo-likelihood maximization.

We start from the exact decomposition $P_i(a_i| a_{\setminus i})$ where $a_{\setminus i} := \{a_1,\dots,a_{i-1},a_{i+1},\dots,a_i\}$, i.e. all residues of the sequence of amino acids but the one relative to residue $i$.

Here, we use the following parametrization:

\[P(a_i |a_{\setminus i}) = \frac{\exp \left\{ h_i(a_i) + \sum_{j\neq i} J_{i,j}(a_i,a_j)\right\} }{z_i(a_{\setminus i})}\,,\]

where:

\[z_i(a_{\setminus i})= \sum_{a=1}^{q} \exp \left\{ h_i(a) + \sum_{j=1\neq i} J_{i,j}(a,a_j)\right\} \,,\]

is the normalization factor. The pseudo-likelihood maximization strategy, aims at finding the value of the $J, h$ parameters, that maximize the log-likelihood:

\[{\cal L} = \frac{1}{M} \sum_{m=1}^M \left( h_i(a^m_i) + \sum_{j\neq i} J_{ij}(a_i^m,a_j^m) - \log z_i(a^m_{\setminus i})\right)\,.\]

In machine learning, this parametrization is known as the soft-max regression, a generalization of logistic regression to multi-class labels.

Usage

The typical pipeline to use the package is:

Compute PlmDCA parameters from a multiple sequence alignment:

julia> res=plmdca(filefasta; kwds...)

Multithreading

To fully exploit the the multicore parallel computation, julia should be invoked with

$ julia -t nthreads # put here nthreads equal to the number of cores you want to use

If you want to set permanently the number of threads to the desired value, you can either create a default environment variable export JULIA_NUM_THREADS=24 in your .bashrc. More information here

Load PlmDCA package

The package is on the General Registry. It can be installed from the package manager by

pkg> add PlmDCA

and

julia> using PlmDCA

Learn the parameters

There are two different learning strategies:

The asymmetric one invoked by the plmdca_asym method (the plmdca method points to the asymmetric strategy)
The symmetric strategy, invoked by the plmdca_sym. This method is slower and typically less accurate.

Both methods return the parameters res::PlmOut, a struct containing:

Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]
Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]
pslike: a vector of length L containing the log-pseudolikelihoods
score: a vector of (i,j,val)::Tuple{Int,Int,Float64} containing the DCA score val relative to the i,j pair in descending order.

For both methods, the algorithmic keyword arguments (with their default value) are:

epsconv::Real=1.0e-5 (convergence parameter)
maxit::Int=1000 (maximum number of iteration - don't change)
verbose::Bool=true (set to false to suppress printing on screen)
method::Symbol=:LD_LBFGS (optimization method)

Example:

res=plmdca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001);

Additional parameters include those of PlmDCA.read_fasta, which determine the effective sequence diversity.

Contact Prediction

Contact prediction is contained in the the output of plmdca. ::Output contains a score field which is a Vector of Tuple ranked in descending score order. Each Tuple contains $i,j,s_{ij}$ where $s_{ij}$ is the DCA score of the residue pair $i,j$. The residue numbering is that of the MSA, and not of the unaligned full sequences.

PlmDCA.plmdca — Method

plmdca(filename::String; kwds...) -> PlmOut

Run plmdca_asym on the fasta alignment in filename.

source

PlmDCA.plmdca_asym — Method

plmdca_asym(filename::String; kwds...)-> PlmOut

Run plmdca_asym on the fasta alignment in filename It returns a struct ::PlmOut containing four fields:

Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]
Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]
pslike: a vector of length L containining the log-pseudolikelihoods
score: a vector of (i,j,val)::Tuple{Int,Int,Float64} containing the DCA score val relative to the i,j pair in descending order.
Optional arguments:

lambdaJ::Real=0.01 coupling L₂ regularization parameter (lagrange multiplier)
lambdaH::Real=0.01 field L₂ regularization parameter (lagrange multiplier)
epsconv::Real=1.0e-5 convergence value in minimzation
maxit::Int=1000 maximum number of iteration in minimization
verbose::Bool=true set to false to stop printing convergence info on stdout
method::Symbol=:LD_LBFGS optimization strategy see NLopt.jl for other options

See PlmDCA.read_fasta for additional parameters.

Example

julia> res = plmdca_asym("data/pf00014short.fasta",lambdaJ=0.01,lambdaH=0.01,epsconv=1e-12);

source

PlmDCA.plmdca_asym — Method

plmdca_asym(Z::Matrix{T},W::Vector{Float64}; kwds...) -> ::PlmOut

Pseudolikelihood maximization on the M×N alignment Z (numerically encoded in 1,…,21), and the M-dimensional normalized weight vector W.

It returns a struct ::PlmOut containing four fields:

Jtensor: the 4 dimensional Array (q×q×L×L) of the couplings in zero-sum gauge J[a,b,i,j]
Htensor: the 2 dimensional Array (q×L) of the fields in zero-sum gauge h[a,i]
pslike: a vector of length L containining the log-pseudolikelihoods
score: a vector of `(i,j,val)::Tuple{Int,Int,Float64} containing the DCA score val relative to the i,j pair in descending order.
Optional arguments:

lambdaJ::Real=0.01 coupling L₂ regularization parameter (lagrange multiplier)
lambdaH::Real=0.01 field L₂ regularization parameter (lagrange multiplier)
epsconv::Real=1.0e-5 convergence value in minimzation
maxit::Int=1000 maximum number of iteration in minimization
verbose::Bool=true set to false to stop printing convergence info on stdout
method::Symbol=:LD_LBFGS optimization strategy see NLopt.jl for other options

Example

julia> res = plmDCA(Z,W,lambdaJ=0.01,lambdaH=0.01,epsconv=1e-12);

source

PlmDCA.read_fasta — Method

PlmDCA.read_fasta(filename; max_gap_fraction::Real, theta, remove_dups::Bool) -> W, Zint, N, M, q

Reads a FASTA file and returns the per-sequence weights, integer representation of sequences, length of sequences, number of sequences, and the number of states.

See DCAUtils.read_fasta_alignment about the integer representation of sequences and meaning of max_gap_fraction. If remove_dups is true, duplicate sequences are removed.

theta controls the sequence-weighting computation, see DCAUtils.compute_weights.

source