ArDCA
- Autoregressive model learning for protein inference.
Package Features
- Learn model from multiple sequence alignment
- Sample from the model
See the Index for the complete list of documented functions and types.
Overview
Protein families are given in form of multiple sequence alignments (MSA) $D = (a^m_i |i = 1,\dots,L;\,m = 1,\dots,M)$ of $M$ proteins of aligned length $L$. The entries $a^m_i$ equal either one of the standard 20 amino acids, or the alignment gap $–$. In total, we have $q = 21$ possible different symbols in D. The aim of unsupervised generative modeling is to earn a statistical model $P(a_1,\dots,a_L)$ of (aligned) full-length sequences, which faithfully reflects the variability found in $D$: sequences belonging to the protein family of interest should have comparably high probabilities, unrelated sequences very small probabilities. Here we propose a computationally efficient approach based on autoregressive models.
We start from the exact decomposition:
\[P(a_1,\dots,a_L) = P(a_1) \cdot P(a_2|a_1) \cdot \dots \cdot P(a_L|a_1,\dots,a_{L-1})\]
Here, we use the following parametrization:
\[P(a_i | a_1,\dots,a_{i-1}) = \frac{\exp \left\{ h_i(a_i) + \sum_{j=1}^{i-1} J_{i,j}(a_i,a_j)\right\} }{z_i(a_1,\dots,a_{i-1})}\,,\]
where:
\[z_i(a_1,\dots,a_{i-1})= \sum_{a=1}^{q} \exp \left\{ h_i(a) + \sum_{j=1}^{i-1} J_{i,j}(a,a_j)\right\} \,,\]
is a the normalization factor. In machine learning, this parametrization is known as soft-max regression, the generalization of logistic regression to multi-class labels.
Usage
The typical pipeline to use the package is:
- Compute ArDCA parameters from a multiple sequence alignment:
julia> arnet,arvar=ardca(filefasta; kwds...)
- Generate
100
new sequences, and store it in an $L\times 100$ array of integers.
julia> Zgen = sample(arnet,100);
Multithreading
To fully exploit the the multicore parallel computation, julia should be invoked with
$ julia -t nthreads # put here nthreads equal to the number of cores you want to use
If you want to set permanently the number of threads to the desired value, you can either create a default environment variable export JULIA_NUM_THREADS=24
in your .bashrc
. More information here
Tutorial
We will assume we have a Multiple Sequence Alignment (MSA)in FASTA format. We aim at
- Given a MSA, generate a sample
- Given a MSA, predict contacts
- Given a MSA, predict the mutational effect in all (ungapped) position of a given target sequence
Load ArDCA package
The following cell loads the package ArDCA
(Warning: the first time it takes a while)
- The
mypkgdir
variable should be set to yourpath/to/package
dir.
We will use the PF00014 protein family available in data/PF14/
folder of the package/
mypkgdir = normpath(joinpath(pwd(),".."))
datadir=joinpath(mypkgdir,"data") # put here your path
using Pkg
Pkg.activate(mypkgdir)
using ArDCA
Learn the autoregressive parameters
As a preliminary step, we learn the field and the coupling parameters $h,J$ from the MSA. To do so we use the ardca
method that return the parameters (stored in arnet
in the cell below), and the alignment in numerical format and other algorithms variables (stored in arvar
in the cell below). The default autoregressive order is set to :ENTROPIC
. We set the $L_2$ regularization to 0.02 for the $J$ and 0.001 for the $h$.
The keyword arguments for the ardca
method are (with their default value):
lambdaJ::Real=0.01
coupling L₂ regularization parameter (lagrange multiplier)lambdaH::Real=0.01
field L₂ regularization parameter (lagrange multiplier)pc_factor::Real=0
pseudocount factor for calculation ofp0
epsconv::Real=1.0e-5
(convergence parameter)maxit::Int=1000
(maximum number of iteration - don't change)verbose::Bool=true
(set tofalse
to suppress printing on screen)method::Symbol=:LD_LBFGS
(optimization method)permorder::Union{Symbol,Vector{Ti}}=:ENTROPIC
(permutation order). Possible values are:[:NATURAL, :ENTROPIC, :REV_ENTROPIC, :RANDOM]
or a custom permutation vector.
arnet,arvar=ardca("data/PF14/PF00014_mgap6.fasta.gz", verbose=false, lambdaJ=0.02,lambdaH=0.001);
1. Sequence Generation
We now generate M
sequences using the sample
method:
M = 1_000;
generated_alignment = sample(arnet,M);
The generated alignment has is a 𝐿×𝑀 matrix of integer where 𝐿 is the alignment's length, and 𝑀 the number of samples.
Interestingly, we for each sequence we can also compute the likelihood with the samplewithweights method.
loglikelihood,generated_alignment = sample_with_weights(arnet,M);
2. Contact Prediction
We can compute the epistatic score for residue-residue contact prediction. To do so, we can use the epistatic_score
method. The epistatic score is computed on any target sequence of the MSA. Empirically, it turns out the the final score does not depend much on the choice of the target sequence.
The autput is contained in a Vector
of Tuple
ranked in descendic score order. Each Tuple
contains $i,j,s_{ij}$ where $s_{ij}$ is the epistatic score of the residue pair $i,j$. The residue numbering is that of the MSA, and not of the unaligned full sequences.
target_sequence = 1
score=epistatic_score(arnet,arvar,target_sequence)
3. Predicting mutational effects
For any reference sequence, we can easily predict the mutational effect for all single mutants. Of course we can extract this information only for the non-gapped residues of the target sequence.
This is done with the dms_single_site
method, which returns a q×L
matrix D
containing $\log(P(mut))/\log(P(wt))$ for all single site mutants of the reference sequence seqid
(i.e. the so-called wild type sequence), and idxgap
a vector of indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to +Inf
.
A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.
target_sequence = 1
D,idxgap=dms_single_site(arnet,arvar,target_sequence)
Methods Reference
ArDCA.ardca
ArDCA.ardca
ArDCA.dms_single_site
ArDCA.loglikelihood
ArDCA.loglikelihood
ArDCA.loglikelihood
ArDCA.loglikelihood
ArDCA.sample
ArDCA.sample_subsequence
ArDCA.sample_subsequence
ArDCA.sample_with_weights
ArDCA.siteloglikelihood
ArDCA.softmax!
ArDCA.ardca
— Methodardca(filename::String; kwds...)
Run ardca
on the fasta alignment in filename
Return two struct
: ::ArNet
(containing the inferred hyperparameters) and ::ArVar
Optional arguments:
max_gap_fraction::Real=0.9
maximum fraction of insert in the sequenceremove_dups::Bool=true
iftrue
remove duplicated sequencestheta=:auto
if:auto
compute reweighint automatically. Otherwise set aFloat64
value0 ≤ theta ≤ 1
lambdaJ::Real=0.01
coupling L₂ regularization parameter (lagrange multiplier)lambdaH::Real=0.01
field L₂ regularization parameter (lagrange multiplier)pc_factor::Real=1/size(Z,2)
pseudocount factor for calculation ofp0
, defaults to one over the number of sequences.epsconv::Real=1.0e-5
convergence value in minimzationmaxit::Int=1000
maximum number of iteration in minimizationverbose::Bool=true
set tofalse
to stop printing convergence info onstdout
method::Symbol=:LD_LBFGS
optimization strategy seeNLopt.jl
for other optionspermorder::Union{Symbol,Vector{Ti}}=:ENTROPIC
permutation order. Possible values are:NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM
or a custom permutation vector
Examples
julia> arnet, arvar = ardca("pf14.fasta", permorder=:ENTROPIC)
ArDCA.ardca
— Methodardca(Z::Array{Ti,2},W::Vector{Float64}; kwds...)
Auto-regressive analysis on the L×M alignment Z
(numerically encoded in 1,…,21), and the M
-dimensional normalized weight vector W
.
Return two struct
: ::ArNet
(containing the inferred hyperparameters) and ::ArVar
Optional arguments:
lambdaJ::Real=0.01
coupling L₂ regularization parameter (lagrange multiplier)lambdaH::Real=0.01
field L₂ regularization parameter (lagrange multiplier)pc_factor::Real=0
pseudocount factor for calculation ofp0
, defaults to one over the number of sequences.epsconv::Real=1.0e-5
convergence value in minimzationmaxit::Int=1000
maximum number of iteration in minimizationverbose::Bool=true
set tofalse
to stop printing convergence info onstdout
method::Symbol=:LD_LBFGS
optimization strategy seeNLopt.jl
for other optionspermorder::Union{Symbol,Vector{Ti}}=:ENTROPIC
permutation order. Possible values are:NATURAL,:ENTROPIC,:REV_ENTROPIC,:RANDOM
or a custom permutation vector
Examples
julia> arnet, arvar= ardca(Z,W,lambdaJ=0,lambdaH=0,permorder=:REV_ENTROPIC,epsconv=1e-12);
ArDCA.dms_single_site
— Methoddms_single_site(arnet::ArNet, arvar::ArVar, seqid::Int; pc::Float64=0.1)
Return a q×L
matrix of containing -log(P(mut))/log(P(seq))
for all single site mutants of the reference sequence seqid
, and a vector of the indices of the residues of the reference sequence that contain gaps (i.e. the 21 amino-acid) for which the score has no sense and is set by convention to +Inf
. A negative value indicate a beneficial mutation, a value 0 indicate the wild-type amino-acid.
ArDCA.loglikelihood
— Methodloglikelihood(x0::AbstractVector{T}, arnet::ArNet) where {T<:Integer}
Return the loglikelihood of sequence x0
encoded in integer values in 1:q
under the model arnet
`.
ArDCA.loglikelihood
— Methodloglikelihood(arnet::ArNet, arvar::ArVar)
Return the vector of loglikelihoods computed from arvar.Z
under the model arnet
. size(arvar.Z) == N,M
where N
is the sequences length, and M
the number of sequences. The returned vector has M
elements reweighted by arvar.W
ArDCA.loglikelihood
— Methodloglikelihood(x0::String, arnet::ArNet) where {T<:Integer})
Return the loglikelihood of the String
x0
under the model arnet
.
ArDCA.loglikelihood
— Methodloglikelihood(x0::Matrix{T}, arnet::ArNet) where {T<:Integer})
Return the vector of loglikelihoods computed from Matrix
x0
under the model arnet
. size(x0) == N,M
where N
is the sequences length, and M
the number of sequences. The returned vector has M
elements.
ArDCA.sample
— Methodsample(arnet::ArNet, msamples::Int)
Return a generated alignment in the form of a N × msamples
matrix of type ::Matrix{Int}
Examples
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Zgen=Zgen=sample(arnet,1000);
ArDCA.sample_subsequence
— Methodsample_subsequence(x::String, arnet::ArNet, msamples)
Return a generated alignment in the form of a N × msamples
matrix of type ::Matrix{Int}
and the relative probabilities under the model. The alignment is forced to start with with a sequence x
(in amino acid single letter alphabet) and then autoregressively generated.
Example
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence("MAKG",arnet,1000);
ArDCA.sample_subsequence
— Methodsample_subsequence(x::Vector{T}, arnet::ArNet, msamples)
Return a generated alignment in the form of a N × msamples
matrix of type ::Matrix{Int}
and the relative probabilities under the model. The alignment is forced to start with with a sequence x
(in integer number coding) and then autoregressively generated.
Example
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_subsequence([11,1,9,6],arnet,1000);
ArDCA.sample_with_weights
— Methodsample_with_weights(arnet::ArNet, msamples::Int)
Return a generated alignment in the form of a N × msamples
matrix of type ::Matrix{Int}
and the relative probabilities under the module
Examples
julia> arnet,arvar=ardca("file.fasta",verbose=true,permorder=:ENTROPIC, lambdaJ=0.001,lambdaH=0.001);
julia> Wgen,Zgen=sample_with_weights(arnet,1000);
ArDCA.siteloglikelihood
— Methodsiteloglikelihood(i::Int,arnet::ArNet, arvar::ArVar)
Return the loglikelihood relative to site i computed from arvar.Z
under the model arnet
.
ArDCA.softmax!
— Methodsoftmax(x::AbstractArray{<:Real})
Return the softmax transformation
applied to x