Multiple sequence alignment (MSA) is a fundamental tool in biology. Accurate alignment is crucial for downstream analyses in genomics, proteomics, and evolutionary studies, providing insights into biological processes and molecular interactions essential for understanding life.
Some of the current programs for sequence alignment include, but are not limited to:
Although MSA is not on the focus of the orthGS package, it is an important tool for phylogenetic reconstructions and orthology/paralogy analyses. Thus, in the current vignette we offer some indications to perform multiple alignments without leaving the R context.
The current version of MUSCLE is v5. Nevertheless, since most R packages are generally designed for earlier versions of MUSCLE (like v3), you may want to dowload an install MUSCLE v3.8.31 (https://drive5.com/muscle/downloads_v3.htm). In my computer I have renamed the binary to ´muscle3´ because I have also installed MUSCLE v5 (https://github.com/rcedgar/muscle/releases/tag/v5.3). Both versions of the software are accessible via my system’s PATH environment variable and can be invoked as ´muscle3´ (v3.8.31) or ´muscle´ (v5).
To download and install Clustal Omega you can go to http://www.clustal.org/omega/ and follows the
instructions. Nevertheless, if you have a linux based OS you may benefit
of the suitable package manager. For instance, for macOS I use
Homebrew
, which can be installed via CLI:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Then:
brew install brewsci/bio/clustal-omega
brew install brewsci/bio/muscle
will do the work for you.
For linux OS you may use either apt
, dnf
,
pacman
or whatever the package manager i suitable for your
system.
As an example, we will use a small sample of sequences that can be obtained from the orthGS package.
# ara <- subsetGS("Arabidopsis thaliana")
# muscle_aln <- msa(sequences = ara$prot, ids = ara$phylo_id, method = "muscle3")
# muscle_aln
If we opt to use Clustal Omega instead of MUSCLE:
The ape
package in R provides a number of functions,
which call their respective programs (executables) from R to align
multiple nucleotide (DNAbin) or amino acid (AAbin) sequences. The
binaries for the programs must be installed separately (as explained
above) and it is highly recommended to do this so that the execubables
are in a directory located on the PATH of the system. As a note of
interest ape
allows to call MUSCLE v5 and T-Coffee.
Note that the input sequences must be in format AAbin or DNAbin, which store sequences as binary vectors (less human-readable but more memory-efficient).
Type in the RStudio terminal ?ape::tcoffee
, for
instance, to obtain help in the use of these functions.
The msa
package in R provides an interface to multiple
sequence alignment algorithms (including Clustal Omega, ClustalW and
Muscle), allowing users to perform alignments directly from within the R
environment. The msa
R package is available via
Bioconductor. The simplest way to install the package is the following
(uncomment the lines before executing the chunk):
The muscle
package in R is an interface to the MUSCLE
(Multiple Sequence Comparison by Log-Expectation) algorithm, which is
widely used for performing multiple sequence alignments. The package
allows users to run the MUSCLE algorithm from within R, simplifying the
process of aligning biological sequences such as DNA, RNA, or protein
sequences.
We can still integrate MUSCLE v5 into our R workflow by calling it as an external command-line tool. Just for the sake of fun:
# # if (!require("BiocManager", quietly = TRUE))
# # install.packages("BiocManager")
# # BiocManager::install("Biostrings")
#
# sq <- ara$prot
# names(sq) <- ara$phylo_id
# seqs <- Biostrings::AAStringSet(sq)
#
# ## --- Save the input sequences in a temporary file
# destfile <- tempfile(pattern = "input_sequences",
# tmpdir = tempdir(),
# fileext = ".fasta")
# Biostrings::writeXStringSet(seqs, filepath = destfile)
#
#
# ## --- Run MUSCLE on the input sequences file
# system(paste("muscle -align ", destfile, " -output aligned_sequences.fasta", sep = ""))
#
# ## --- Read fasta alignment as dataframe
# m5 <- seqinr::read.fasta(file = "./aligned_sequences.fasta")
# m5df <- as.data.frame(matrix(rep(NA, length(m5[[1]]) * length(m5)), nrow = length(m5)) )
# for (i in 1:length(m5)){
# m5df[i, ] <- m5[[i]]
# }