## RESEARCH PROJECT REVIEW

We regularly review research projects for the following national and international institutions and agencies:

## METHODS AND SOFTWARE DEVELOPMENT

We developed methods and several packages and software for the analysis of phenotypes, indels, nucleotide variability at single, multilocus and genomic data plus coalescent simulators. Most codes are at https:github.com/CRAGENOMICA and here.

Used first in Ramirez-Ayala et al. GSE. 2021. https://doi.org/10.1186/s12711‐020‐00597‐9

Software for calculating iES and Rsb statistics following Tang, Thornton & Stoneking, PloS Biology 2007. It needs genotype iinput n PLink format.

Used first at Guirao-Rico et al.

This application calculates statistics of variability using multiple population in tfasta (transposed fasta), fasta or ms-format files in text or zip files (and optionally GTF files). The program has multiple options, missing values are allowed ,and IUPAC code for diploid individuals can also be processed. Fst comparisons and permutation test are be also calculated among all populations. Optimal tests of neutrality are calculated but it is necessary to include GSL libraries in case of compiling the code. The application can be pipelined with ms (or another simulator with the same output) and calculates the statistics for each replicate. Multiple options for outputs are allowed. Sliding windows genomic analysis is performed using the tfasta format (conversors are available). Calculation of variability and neutrality tests based on frequency spectrum in data considering positions with missing values are now available.

Used first at Guirao-Rico et al.

fastaconvtr is a command line application to convert tfasta (transposed fasta)/fasta alignment files (ziped or not) into fasta/tfasta/ms format (ziped or not). The application also reads GTF annotation files and is able to filter the regions or positions of the interest (ex. coding, synonymous, nonsynonymous and others). The application can also release a weight file, wich gives the weight of each position (for example to filter coding/non-coding positions) for posterior analyses with mstatspop program. A fasta format of diploid sequences can be codified using IUPAC code. Double homozygote positions are coded in uppercase (ex. A means AA) , lowercase is coded for single homozygous positions (ex. a means AN) when considering mising data.

DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets

Rozas, J., Ferrer-Mata, A., Sánchez-DelBarrio, J.C., Guirao-Rico, S., Librado, P.,

Montemuiño C., Espinosa A, Moure J. C., Vera Rodriguez G.,

Montemuiño et al.

A parallel version of the popular Hudson's coalescent simulator.

Totices, R.,

PHENIX provides functions to estimate a sizecontrolled phenotypic integration index, a bootstrapping method to calculate confi dence intervals, and a randomization method to simulate null distributions and test the statistical significance of the integration. PHENIX is an open source package written in R. Functions included in this package easily estimate phenotypic integration by controlling a third variable (e.g., the size of the studied organ). PHENIX helps to estimate and test the statistical signifi cance of the magnitude of integration using one of the most-used methodological approaches, while taking size into account.

Nevado, B.,

Navarro et al.

GHcaller is a c++ program that calls SNPs from a mpileup file. It outputs either genotypes or haplotypes, depending on read depth and genotypes' likelihoods. Data is transformed into fasta format. The algorithm is based on Lynch (2009) Genetics 182:295-301; Roesti et al. (2012) Molecular Ecology 21: 2852-2862. Within the distribution, there is a README and an examples folder.

Pfeifer, B., Wittelsbürger, U.,

We have collaborated with Martin J. Lercher group for the construction of this very useful R library. This library performs population genetics calculations. It can efficiently process genome-scale data as well as large sets of individual loci.

Ferretti, L.,

This code implements some population genetics tests and estimators that can be applied to pooled sequences from Next Generation Sequencing experiments. The statistics are described in the paper "Population genomics from pool sequencing".

Used first in Esteve-Codina et al.

This program computes the HKA from a dataset table of a population and a single individual outgroup. The program computes the expected polymorphism and divergence as well as the theta values per nucleotide, the Time to the ancestor, the partial HKA for each locus (window), the Chi-square and the P-value. The variance of S in case of including missing values is calculated by simulation and take some time.

SIDIER is a software package that allows inferring evolutionary relationships from gapped alignments using information contained in both substitutions and insertions/deletions (indels). SIDIER estimates the number of indel events that occurred during sequence evolution to obtain a distance matrix. This indel distance matrix may be combined with the substitution distance matrix calculated separately from the same data set. Using this software, the inferred evolutionary events can be represented by means of percolation networks. SIDIER is written in the open source R language and is freely available through the Comprehensive R Archive Network.

Used first in Heidel et al.,

Multilocus coalescent simulation program performs coalescent simulations under several demographic and also a selective model. This version is based on the published mlcoalsim (v1) application. In this version, the parameters can be included in separated prior files. Also, it calculates simulations using more than one processor using mpi (defined by the user). Furthermore, the input file has been changed significantly in relation to the fist version.

Used first in

Analysis of Nucleotide Variation from a Population Genetics point of view. The application calculates a wide number of summary statistics and neutrality test up to 32K independent loci for a population with or without an outgroup species. The input files (one per locus) must be in a folder and in fasta or nbrf fromat. The annotation files must be in a folder in format GFF 2 and must have the same name than the fasta files (except for the extension, that must be .gff). This software also does coalescent simulations and calculates probabilities for the fit of the observed data with simulated data.

The application program mlcoalsim (multilocus coalescent simulations) is designed to generate samples and calculate neutrality tests and other statistics under stationary model, several demographic models or strong positive selection using coalescent theory. It performs multilocus analyses and linked loci and unlinked loci are enabled. Multilocus statistics for unlinked loci are the average and the variance for each statistic. It also allows recurrent mutations (multiple hits). More, it includes heterogeneity in mutation rate across the length of the sequence and heterogeneity in recombination rate across the length of the sequence. Hotspots or a constant value for all positions in mutation or recombination are possible. This program is based on a previous version of Hudson’s coalescent program ms (Hudson, 2002) and modified for the above purposes. The function to calculate minimum recombinant values is a modification of Wall’s code (Wall, 2000). The gamma function was partially obtained from Grassly, Adachi and Rambaut code (Grassly et al., 1997). This program is distributed under the GNU GPL License. BUGS KNOWN: The logistic change of Ne is not working properly.

**Tang_Rsb**[https://github.com/sramosonsins/Tang_Rsb]*Calculation of Rsb statistics using genotype data.***S. E. Ramos-Onsins**Used first in Ramirez-Ayala et al. GSE. 2021. https://doi.org/10.1186/s12711‐020‐00597‐9

Software for calculating iES and Rsb statistics following Tang, Thornton & Stoneking, PloS Biology 2007. It needs genotype iinput n PLink format.

**mstatspop**[https://github.com/CRAGENOMICA/mstatspop]*Statistical Analysis using Multiple Populations for Genomic Data: beta version***Ramos-Onsins,****S. E.**, Ferretti, L., Raineri, E., Jené, J., Marmorini, G., Burgos, W., Vera., G.Used first at Guirao-Rico et al.

**2017. https://doi.org/10.1038/s41437-017-0002-9***Heredity.*This application calculates statistics of variability using multiple population in tfasta (transposed fasta), fasta or ms-format files in text or zip files (and optionally GTF files). The program has multiple options, missing values are allowed ,and IUPAC code for diploid individuals can also be processed. Fst comparisons and permutation test are be also calculated among all populations. Optimal tests of neutrality are calculated but it is necessary to include GSL libraries in case of compiling the code. The application can be pipelined with ms (or another simulator with the same output) and calculates the statistics for each replicate. Multiple options for outputs are allowed. Sliding windows genomic analysis is performed using the tfasta format (conversors are available). Calculation of variability and neutrality tests based on frequency spectrum in data considering positions with missing values are now available.

**fastaconvtr**[https://github.com/CRAGENOMICA/fastaconvtr]*Conversor of fasta/tfasta alignments (plus GTF) to tfasta/fasta/ms format: beta version***Ramos-Onsins, S.E.**and Vera, G.Used first at Guirao-Rico et al.

**2017. https://doi.org/10.1038/s41437-017-0002-9***Heredity.*fastaconvtr is a command line application to convert tfasta (transposed fasta)/fasta alignment files (ziped or not) into fasta/tfasta/ms format (ziped or not). The application also reads GTF annotation files and is able to filter the regions or positions of the interest (ex. coding, synonymous, nonsynonymous and others). The application can also release a weight file, wich gives the weight of each position (for example to filter coding/non-coding positions) for posterior analyses with mstatspop program. A fasta format of diploid sequences can be codified using IUPAC code. Double homozygote positions are coded in uppercase (ex. A means AA) , lowercase is coded for single homozygous positions (ex. a means AN) when considering mising data.

**DnaSP 6**[http://www.ub.edu/dnasp/]DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets

Rozas, J., Ferrer-Mata, A., Sánchez-DelBarrio, J.C., Guirao-Rico, S., Librado, P.,

**Ramos-Onsins, S.E.**, Sánchez-Gracia, A. 2017. DnaSP 6: DNA Sequence Polymorphism Analysis of Large Datasets.*34: 3299-3302. DOI: 10.1093/molbev/msx248A new version of the popular tool for performing exhaustive population genetic analyses on multiple sequence alignments. This major upgrade incorporates novel functionalities to analyze large data sets, such as those generated by high-throughput sequencing technologies. Among other features, DnaSP 6 implements: 1) modules for reading and analyzing data from genomic partitioning methods, such as RADseq or hybrid enrichment approaches, 2) faster methods scalable for high- throughput sequencing data, and 3) summary statistics for the analysis of multi-locus population genetics data. Furthermore, DnaSP 6 includes novel modules to perform single- and multi-locus coalescent simulations under a wide range of demographic scenarios. The DnaSP 6 program, with extensive documentation, is freely available at http://www.ub.edu/dnasp.***Mol. Biol. Evol.**

**mspar**[https://github.com/cmontemuino/mspar]*parallelized ms coalescent simulator*(2016)Montemuiño C., Espinosa A, Moure J. C., Vera Rodriguez G.,

**Ramos-Onsins S.E.**, Hernández Budé P.Montemuiño et al.

**2016. 12: 223–228. doi: 10.4137/EBo.s40268.***Evolutionary Bioinformatics.*A parallel version of the popular Hudson's coalescent simulator.

**PHENIX**[http://cran.r-project.org/web/packages/PHENIX/index.html]*PHENIX: an R package to estimate a size-controlled phenotypic integration index.*2015.Totices, R.,

**Muñoz-Pajares, A. J**.PHENIX provides functions to estimate a sizecontrolled phenotypic integration index, a bootstrapping method to calculate confi dence intervals, and a randomization method to simulate null distributions and test the statistical significance of the integration. PHENIX is an open source package written in R. Functions included in this package easily estimate phenotypic integration by controlling a third variable (e.g., the size of the studied organ). PHENIX helps to estimate and test the statistical signifi cance of the magnitude of integration using one of the most-used methodological approaches, while taking size into account.

**GHcaller**[https://github.com/brunonevado/GHcaller] [https://github.com/CRAGENOMICA/pGHcaller]*Genotype/Haplotype SNP caller (version 0.0.1) (02122013)*Nevado, B.,

**Ramos-Onsins,****S.E.**, Perez-Enciso, M. Mol.Ecol. 2014 doi: 10.1111/mec.12693Navarro et al.

**2017. 13: 1–11.***Evolutionary Bioinformatics.*GHcaller is a c++ program that calls SNPs from a mpileup file. It outputs either genotypes or haplotypes, depending on read depth and genotypes' likelihoods. Data is transformed into fasta format. The algorithm is based on Lynch (2009) Genetics 182:295-301; Roesti et al. (2012) Molecular Ecology 21: 2852-2862. Within the distribution, there is a README and an examples folder.

**PopGenome**[http://cran.r-project.org/]*PopGenome: An efficient swiss army knife for population genomic analyses in R.*(2014).Pfeifer, B., Wittelsbürger, U.,

**Ramos-Onsins,****S.E.**, Lercher M. J.**2014. 31: 1929-36. doi: 10.1093/molbev/msu136.***Mol Biol Evol.*We have collaborated with Martin J. Lercher group for the construction of this very useful R library. This library performs population genetics calculations. It can efficiently process genome-scale data as well as large sets of individual loci.

**npstats**[https://github.com/lucaferretti/npstat]*npstats: Population genetics tests and estimators for pooled NGS data*. (2013)Ferretti, L.,

**Ramos-Onsins,****S.E.**, Perez-Enciso, M.**2013. DOI: 10.1111/mec.12522***Molecular Ecology.*This code implements some population genetics tests and estimators that can be applied to pooled sequences from Next Generation Sequencing experiments. The statistics are described in the paper "Population genomics from pool sequencing".

**HKAdirect**[https://github.com/CRAGENOMICA/HKAdirect]*Multilocus HKA test (beta version 0.70b)***Ramos-Onsins,****S.E.**, Raineri, E., Ferretti, L.Used first in Esteve-Codina et al.

*2013***BMC Genomics**This program computes the HKA from a dataset table of a population and a single individual outgroup. The program computes the expected polymorphism and divergence as well as the theta values per nucleotide, the Time to the ancestor, the partial HKA for each locus (window), the Chi-square and the P-value. The variance of S in case of including missing values is calculated by simulation and take some time.

**SIDIER**[http://cran.r-project.org/web/packages/sidier/index.html]*SIDIER: substitution and indel distances to infer evolutionary relationships*(2013).**Muñoz-Pajares, A. J.***2013. 4: 1195–1200.***Methods in Ecology and Evolution.**SIDIER is a software package that allows inferring evolutionary relationships from gapped alignments using information contained in both substitutions and insertions/deletions (indels). SIDIER estimates the number of indel events that occurred during sequence evolution to obtain a distance matrix. This indel distance matrix may be combined with the substitution distance matrix calculated separately from the same data set. Using this software, the inferred evolutionary events can be represented by means of percolation networks. SIDIER is written in the open source R language and is freely available through the Comprehensive R Archive Network.

**mlcoalsim v2.**[https://github.com/CRAGENOMICA/mlcoalsim-v2] [https://github.com/cmontemuino/mlcoalsim-v2]*Multilocus Coalescent Simulations using parallel computing for ABC analysis: beta version 1.9916b (20170515)*

**Ramos-Onsins, S. E.**Used first in Heidel et al.,

**2010. doi: 10.1111/j.1365-294X.2010.04761.x***Molecular Ecology.*Multilocus coalescent simulation program performs coalescent simulations under several demographic and also a selective model. This version is based on the published mlcoalsim (v1) application. In this version, the parameters can be included in separated prior files. Also, it calculates simulations using more than one processor using mpi (defined by the user). Furthermore, the input file has been changed significantly in relation to the fist version.

**MANVa**[https://github.com/CRAGENOMICA/manva]*Multilocus Analysis of Nucleotide Variation (beta version)***Ramos-Onsins,****S.E.**, Mitchell-Olds, T.Used first in

**Ramos-Onsins**et al.*2008.***Molecular Ecology.**Analysis of Nucleotide Variation from a Population Genetics point of view. The application calculates a wide number of summary statistics and neutrality test up to 32K independent loci for a population with or without an outgroup species. The input files (one per locus) must be in a folder and in fasta or nbrf fromat. The annotation files must be in a folder in format GFF 2 and must have the same name than the fasta files (except for the extension, that must be .gff). This software also does coalescent simulations and calculates probabilities for the fit of the observed data with simulated data.

**mlcoalsim v1**[https://github.com/CRAGENOMICA/mlcoalsim-v1]*Multilocus Coalescent Simulations: multilocus coalescen tsimulations*(2007)**Ramos-Onsins,****S.E.**, Mitchell-Olds T.**2007. 2: 41–44.***Evolutionary Bioinformatics.*The application program mlcoalsim (multilocus coalescent simulations) is designed to generate samples and calculate neutrality tests and other statistics under stationary model, several demographic models or strong positive selection using coalescent theory. It performs multilocus analyses and linked loci and unlinked loci are enabled. Multilocus statistics for unlinked loci are the average and the variance for each statistic. It also allows recurrent mutations (multiple hits). More, it includes heterogeneity in mutation rate across the length of the sequence and heterogeneity in recombination rate across the length of the sequence. Hotspots or a constant value for all positions in mutation or recombination are possible. This program is based on a previous version of Hudson’s coalescent program ms (Hudson, 2002) and modified for the above purposes. The function to calculate minimum recombinant values is a modification of Wall’s code (Wall, 2000). The gamma function was partially obtained from Grassly, Adachi and Rambaut code (Grassly et al., 1997). This program is distributed under the GNU GPL License. BUGS KNOWN: The logistic change of Ne is not working properly.