• Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics
  • The Human Genome Project
  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contacts by Research Area
  • News & Events
  • Research Areas
  • Research Investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission and Vision
  • Policies and Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

8.1 Mendel’s Experiments

Learning objectives.

  • Explain the scientific reasons for the success of Mendel’s experimental work
  • Describe the expected outcomes of monohybrid crosses involving dominant and recessive alleles

Johann Gregor Mendel (1822–1884) ( Figure 8.2 ) was a lifelong learner, teacher, scientist, and man of faith. As a young adult, he joined the Augustinian Abbey of St. Thomas in Brno in what is now the Czech Republic. Supported by the monastery, he taught physics, botany, and natural science courses at the secondary and university levels. In 1856, he began a decade-long research pursuit involving inheritance patterns in honeybees and plants, ultimately settling on pea plants as his primary model system (a system with convenient characteristics that is used to study a specific biological phenomenon to gain understanding to be applied to other systems). In 1865, Mendel presented the results of his experiments with nearly 30,000 pea plants to the local natural history society. He demonstrated that traits are transmitted faithfully from parents to offspring in specific patterns. In 1866, he published his work, Experiments in Plant Hybridization, 1 in the proceedings of the Natural History Society of Brünn. As stated earlier, in genetics, "parent" is often used to describe the individual organism(s) that contribute genetic material to an offspring, usually in the form of gamete cells.

Mendel’s work went virtually unnoticed by the scientific community, which incorrectly believed that the process of inheritance involved a blending of parental traits that produced an intermediate physical appearance in offspring. This hypothetical process appeared to be correct because of what we know now as continuous variation. Continuous variation is the range of small differences we see among individuals in a characteristic like human height. It does appear that offspring are a “blend” of their parents’ traits when we look at characteristics that exhibit continuous variation. Mendel worked instead with traits that show discontinuous variation . Discontinuous variation is the variation seen among individuals when each individual shows one of two—or a very few—easily distinguishable traits, such as violet or white flowers. Mendel’s choice of these kinds of traits allowed him to see experimentally that the traits were not blended in the offspring as would have been expected at the time, but that they were inherited as distinct traits. In 1868, Mendel became abbot of the monastery and exchanged his scientific pursuits for his pastoral duties. He was not recognized for his extraordinary scientific contributions during his lifetime; in fact, it was not until 1900 that his work was rediscovered, reproduced, and revitalized by scientists on the brink of discovering the chromosomal basis of heredity.

Mendel’s Crosses

Mendel’s seminal work was accomplished using the garden pea, Pisum sativum , to study inheritance. This species naturally self-fertilizes, meaning that pollen encounters ova within the same flower. Because every pea plant has both male reproductive organs and female reproductive organs, each plant produces both types of gametes required for reproduction—both pollen and ova. In plants, just as in animals, reproductive organs are classified by the size of the gametes produced. The organs producing the smaller pollen are called male reproductive organs, while the organs producing the larger ova are called female reproductive organs.

In garden peas, the flower petals remain sealed tightly until pollination is completed to prevent the pollination of other plants. The result is highly inbred, or “true-breeding,” pea plants. These are plants that always produce offspring that look like the parent. By experimenting with true-breeding pea plants, Mendel avoided the appearance of unexpected traits in offspring that might occur if the plants were not true-breeding. The garden pea also grows to maturity within one season, meaning that several generations could be evaluated over a relatively short time. Finally, large quantities of garden peas could be cultivated simultaneously, allowing Mendel to conclude that his results did not come about simply by chance.

Mendel performed hybridizations , which involve mating two true-breeding individuals that have different traits. In the pea, which is naturally self-pollinating, this is done by manually transferring pollen from the anther of a mature pea plant of one variety to the stigma of a separate mature pea plant of the second variety.

Plants used in first-generation crosses were called P, or parental generation, plants ( Figure 8.3 ). Mendel collected the seeds produced by the P plants that resulted from each cross and grew them the following season. These offspring were called the F 1 , or the first filial (filial = daughter or son), generation. Once Mendel examined the characteristics in the F 1 generation of plants, he allowed them to self-fertilize naturally. He then collected and grew the seeds from the F 1 plants to produce the F 2 , or second filial, generation. Mendel’s experiments extended beyond the F 2 generation to the F 3 generation, F 4 generation, and so on, but it was the ratio of characteristics in the P, F 1 , and F 2 generations that were the most intriguing and became the basis of Mendel’s postulates.

Garden Pea Characteristics Revealed the Basics of Heredity

In his 1865 publication, Mendel reported the results of his crosses involving seven different characteristics, each with two contrasting traits. A trait is defined as a variation in the physical appearance of a heritable characteristic. The characteristics included plant height, seed texture, seed color, flower color, pea-pod size, pea-pod color, and flower position. For the characteristic of flower color, for example, the two contrasting traits were white versus violet. To fully examine each characteristic, Mendel generated large numbers of F 1 and F 2 plants and reported results from thousands of F 2 plants.

What results did Mendel find in his crosses for flower color? First, Mendel confirmed that he was using plants that bred true for white or violet flower color. Irrespective of the number of generations that Mendel examined, all self-crossed offspring of parents with white flowers had white flowers, and all self-crossed offspring of parents with violet flowers had violet flowers. In addition, Mendel confirmed that, other than flower color, the pea plants were physically identical. This was an important check to make sure that the two varieties of pea plants only differed with respect to one trait, flower color.

Once these validations were complete, Mendel applied the pollen from a plant with violet flowers to the stigma of a plant with white flowers. After gathering and sowing the seeds that resulted from this cross, Mendel found that 100 percent of the F 1 hybrid generation had violet flowers. Conventional wisdom at that time would have predicted the hybrid flowers to be pale violet or for hybrid plants to have equal numbers of white and violet flowers. In other words, the contrasting parental traits were expected to blend in the offspring. Instead, Mendel’s results demonstrated that the white flower trait had completely disappeared in the F 1 generation.

Importantly, Mendel did not stop his experimentation there. He allowed the F 1 plants to self-fertilize and found that 705 plants in the F 2 generation had violet flowers and 224 had white flowers. This was a ratio of 3.15 violet flowers to one white flower, or approximately 3:1. Mendel performed an additional experiment to ascertain differences in inheritance of traits carried in the pollen versus the ovum. When Mendel transferred pollen from a plant with violet flowers to fertilize the ova of a plant with white flowers and vice versa, he obtained approximately the same ratio irrespective of which gamete contributed which trait. This is called a reciprocal cross —a paired cross in which the respective traits of the male and female in one cross become the respective traits of the female and male in the other cross. For the other six characteristics that Mendel examined, the F 1 and F 2 generations behaved in the same way that they behaved for flower color. One of the two traits would disappear completely from the F 1 generation, only to reappear in the F 2 generation at a ratio of roughly 3:1 ( Figure 8.4 ).

Upon compiling his results for many thousands of plants, Mendel concluded that the characteristics could be divided into expressed and latent traits. He called these dominant and recessive traits, respectively. Dominant traits are those that are inherited unchanged in a hybridization. Recessive traits become latent, or disappear in the offspring of a hybridization. The recessive trait does, however, reappear in the progeny of the hybrid offspring. An example of a dominant trait is the violet-colored flower trait. For this same characteristic (flower color), white-colored flowers are a recessive trait. The fact that the recessive trait reappeared in the F 2 generation meant that the traits remained separate (and were not blended) in the plants of the F 1 generation. Mendel proposed that this was because the plants possessed two copies of the trait for the flower-color characteristic, and that each parent transmitted one of their two copies to their offspring, where they came together. Moreover, the physical observation of a dominant trait could mean that the genetic composition of the organism included two dominant versions of the characteristic, or that it included one dominant and one recessive version. Conversely, the observation of a recessive trait meant that the organism lacked any dominant versions of this characteristic.

  • 1 Johann Gregor Mendel, “Versuche über Pflanzenhybriden.” Verhandlungen des naturforschenden Vereines in Brünn , Bd. IV für das Jahr, 1865 Abhandlungen (1866):3–47. [for English translation, see http://www.mendelweb.org/Mendel.plain.html]

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/concepts-biology/pages/1-introduction
  • Authors: Samantha Fowler, Rebecca Roush, James Wise
  • Publisher/website: OpenStax
  • Book title: Concepts of Biology
  • Publication date: Apr 25, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/concepts-biology/pages/1-introduction
  • Section URL: https://openstax.org/books/concepts-biology/pages/8-1-mendels-experiments

© Jul 10, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

PhET Home Page

  • Sign in / Register
  • Administration
  • Edit profile

gene variation experiment

The PhET website does not support your browser. We recommend using the latest version of Chrome, Firefox, Safari, or Edge.

Module 8: Cell Division

Genetic variation in meiosis, learning outcomes.

  • Understand how meiosis contributes to genetic diversity

The gametes produced in meiosis aren’t genetically identical to the starting cell, and they also aren’t identical to one another. As an example, consider the meiosis II diagram below, which shows the end products of meiosis for a simple cell with a diploid number of 2 n = 4 chromosomes. The four gametes produced at the end of meiosis II are all slightly different, each with a unique combination of the genetic material present in the starting cell.

As it turns out, there are many more potential gamete types than just the four shown in the diagram below, even for a simple cell with with only four chromosomes. This diversity of possible gametes reflects two factors: crossing over and the random orientation of homologue pairs during metaphase of meiosis I.

  • Crossing over. The points where homologues cross over and exchange genetic material are chosen more or less at random, and they will be different in each cell that goes through meiosis. If meiosis happens many times, as it does in human ovaries and testes, crossovers will happen at many different points. This repetition produces a wide variety of recombinant chromosomes, chromosomes where fragments of DNA have been exchanged between homologues.
  • Random orientation of homologue pairs. The random orientation of homologue pairs during metaphase of meiosis I is another important source of gamete diversity.

Diagram showing the relationship between chromosome configuration at meiosis I and homologue segregation to gametes. The diagram depicts a simplified case in which an organism only has 2n = 4 chromosomes. In this case, four different types of gametes may be produced, depending on whether the maternal homologues are positioned on the same side or on opposite sides of the metaphase plate.

Instead, each pair of homologues will effectively flip a coin to decide which chromosome goes into which group. In a cell with just two pairs of homologous chromosomes, like the one at right, random metaphase orientation allows for 2 2 = 4 different types of possible gametes. In a human cell, the same mechanism allows for 2 23  = 8,388,608 different types of possible gametes [1] . And that’s not even considering crossovers!

Given those kinds of numbers, it’s very unlikely that any two sperm or egg cells made by a person will be the same. It’s even more unlikely that you and your sibling(s) will be genetically identical, unless you happen to be identical twins, thanks to the process of fertilization (in which a unique egg from the maternal parent combines with a unique sperm from the paternal parent, making a zygote whose genotype is well beyond one-in-a-trillion!) [2] .

Meiosis and fertilization create genetic variation by making new combinations of gene variants (alleles). In some cases, these new combinations may make an organism more or less fit (able to survive and reproduce), thus providing the raw material for natural selection. Genetic variation is important in allowing a population to adapt via natural selection and thus survive in the long term.

  • Reece, J. B., L. A. Urry, M. L. Cain, S. A. Wasserman, P. V. Minorksy, and R. B. Jackson. "Genetic Variation Produced in Sexual Life Cycles Contributes to Evolution." In Campbell Biology , 263–65. 10th ed. San Francisco, CA: Pearson, 2011. ↵
  • Ibid. ↵
  • Meiosis. Provided by : Khan Academy. Located at : https://www.khanacademy.org/science/biology/cellular-molecular-biology/meiosis/a/phases-of-meiosis . License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike

Footer Logo Lumen Waymaker

  • Open access
  • Published: 03 September 2024

Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants

  • Jennifer Li 1 ,
  • Andy Yang 2 ,
  • Benedito A. Carneiro 3 , 4 ,
  • Ece D. Gamsiz Uzun 4 , 5 , 6 , 7 , 9 ,
  • Lauren Massingham 8 &
  • Alper Uzun 4 , 5 , 6 , 7 , 9 , 10  

BMC Bioinformatics volume  25 , Article number:  288 ( 2024 ) Cite this article

Metrics details

The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the need for diverse resources and robust features for in-depth exploration.

To address these challenges, we introduce variant graph craft (VGC), a VCF file visualization and analysis tool. VGC offers a wide range of features for exploring genetic variations, including extraction of variant data, intuitive visualization, and graphical representation of samples with genotype information. VGC is designed primarily for the analysis of patient cohorts, but it can also be adapted for use with individual probands or families. It integrates seamlessly with external resources, providing insights into gene function and variant frequencies in sample data. VGC includes gene function and pathway information from Molecular Signatures Database (MSigDB) for GO terms, KEGG, Biocarta, Pathway Interaction Database, and Reactome. Additionally, it dynamically links to gnomAD for variant information and incorporates ClinVar data for pathogenic variant information. VGC supports the Human Genome Assembly Hg37 and Hg38, ensuring compatibility with a wide range of data sets, and accommodates various approaches to exploring genetic variation data. It can be tailored to specific user needs with optional phenotype input data.

Conclusions

In summary, VGC provides a comprehensive set of features tailored to researchers working with genomic variation data. Its intuitive interface, rapid filtering capabilities, and the flexibility to perform queries using custom groups make it an effective tool in identifying variants potentially associated with diseases. VGC operates locally, ensuring data security and privacy by eliminating the need for cloud-based VCF uploads, making it a secure and user-friendly tool. It is freely available at https://github.com/alperuzun/VGC .

Peer Review reports

Introduction

In recent years, advancements in genome sequencing technologies have enabled researchers to generate vast amounts of genomics data. However, with this flood of information comes the need for tools that can analyze and visualize this data effectively. One of the key challenges in analyzing genetic data is dealing with the complexity and the size of variant data stored in VCF files. These files contain information about genetic variations, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variations. Analyzing VCF files is a complex task that necessitates several steps, including indexing, filtering, extracting, visualization, and detailed analysis of genetic variations, preferably with annotations. The conventional approach to VCF file visualization predominantly relies on command-line tools, posing a significant challenge for those not well-versed in terminal-based operations.

While existing tools offer summaries and some level of interactivity, they face notable challenges, particularly in scalability and user-friendliness. One of the primary issues is scalability; handling large datasets can be daunting due to performance bottlenecks and inefficient data processing. This scalability challenge stems from the inherent complexity and size of genomic data, which requires robust and efficient tools to manage effectively [ 1 ]. Current tools such as vcflib, bio-vcf, cyvcf2, hts-nim, slivar and re-Searcher have been developed to provide solutions for processing VCF files, aiming to mitigate the scalability issue by optimizing for large datasets [ 2 , 3 ]. Another limitation of these tools is the lack of or limited interactivity, as many of them do not provide dynamic and interactive environments for exploring variant data. This can make it difficult for researchers to fully understand and analyze the data and explore potential associations between genetic variants and phenotypes. In addition, some of the existing VCF file visualizing tools can be confusing to use and may require significant expertise to operate effectively. Some tools have too many dependencies based on the origin of the programming language and new updates may crash the program, which can add to the complexity of using these tools. Furthermore, compatibility issues may arise due to the different VCF file formats used by different tools, which can make it difficult to compare results between different tools.

To address these challenges and limitations, several user-friendly VCF file visualization and analysis tools have been developed that offer a wide range of features for visualizing genetic variations and exporting filtered data. In the field of genomic research, there are several well-known bioinformatics tools that significantly enhance data analysis and visualization capabilities. These include IGV (Integrative Genomics Viewer), which offers an interactive platform for genomic datasets visualization [ 4 ]; VCF-Server, tailored for managing and querying VCF files [ 5 ]; VCF. Filter, allowing for the intricate filtering of VCF files [ 6 ]; and BrowseVCF, providing a user-friendly interface for VCF file exploration [ 7 ]. Additionally, GEMINI (Genome Exploration and Mining INteractive Interface) focuses on the integrative analysis and variant prioritization within VCF files [ 8 ]. VCF-Miner is a standalone, GUI-based tool for mining and filtering VCF file variants, using a MongoDB engine to identify relevant variants in various organisms [ 9 ]. VCFtools is a comprehensive package for manipulating and interpreting VCF files, including data comparison, summarization, and statistical analysis [ 10 ]. Visualization of Variants (VIVA) is designed for the intuitive visualization and analysis of genomic variants, facilitating complex data interpretation through a graphical interface [ 11 ]. Together, these tools form a robust suite for genomic data management, analysis, and visualization, catering to a variety of research needs in the genomics field. However, despite the improvements made, there is still room for further enhancements to improve scalability, customizability, interactivity, complexity, and compatibility. To overcome these limitations, we have developed Variant Graph Craft (VGC), a VCF analysis and visualization tool designed to extract and visualize variant data from VCF files with multiple customizable options. VGC designed primarily for analyzing patient cohorts. However, VGC can also be adapted for the analysis of individual probands or families, providing flexibility for various research and clinical scenarios.

In addition to the tools for VCF visualization and analysis, the field of rare disease analysis benefits from numerous VCF annotation, filtering, and prioritization tools that integrate patient phenotype information. According to a comprehensive evaluation by Yuan et al. over 20 such tools, including both open-source and commercial options, have been developed to enhance the identification of disease-causing genes in patients with Mendelian disorders [ 12 ]. Tools like LIRICAL, AMELIE, and Exomiser, which use Human Phenotype Ontology (HPO) terms in conjunction with VCF files, have shown superior performance in accurately prioritizing candidate genes compared to those relying solely on phenotypic data [ 13 , 14 , 15 , 16 ].

VGC adeptly addresses several challenges associated with the analysis and visualization of genetic variation data from VCF files through a multitude of innovative features. It provides a solid platform for comprehensive variant data extraction and visualization, enabling users to efficiently browse through genetic variations with details on variant positions, alleles, genotype calls, and quality scores. By transforming complex genomic data into interactive graphical representations, VGC facilitates easy identification of patterns across samples, enhancing the understanding of genetic landscapes. The integration of information from publicly available databases such as MSigDB, KEGG, Biocarta, Pathway Interaction Database (PID), Reactome, gnomAD, and ClinVar enriches the analysis with valuable insights into gene functions, variant frequencies, and pathogenic variants. Operating locally, VGC ensures the privacy and security of sensitive genomic data, a critical feature that sidesteps the need for cloud uploads and thus addresses significant privacy concerns. Its compatibility with the Human Genome Assembly Hg37 and Hg38 ensures that VGC is adaptable and applicable to a wide array of genomic studies. Furthermore, the tool's ability to incorporate optional phenotype input data allows for customized analysis tailored to specific research questions or clinical contexts, thereby facilitating deeper investigations into genotype–phenotype relationships. Through these features, VGC overcomes scalability, interactivity, complexity, and data security challenges, establishing itself as a valuable resource for researchers and clinicians working in genomic variation analysis.

Implementation

VGC is a tool designed for analyzing variant data and visualizing VCF files. It utilizes a range of technologies and libraries to offer a user-friendly experience (Fig.  1 ).

figure 1

Design and integration of VGC. The query pipeline of VGC offers four distinct search options, as well as knowledge-based support with visualization and analysis. Within a given VCF file, users may choose to query single gene names or genomic locations as well as multiple genes or genomic locations simultaneously via file upload options. Relevant information pertaining to the queried variants is retrieved from stored files, thus allowing for efficient variant extraction from the uploaded VCF. The identified variations may then be displayed using interactive graphics, such as histograms, node graphs, spreadsheets, heat maps, sample comparisons, and gene data visualization. The pipeline is supported by several integrated databases and packages, allowing for rich analyses and visualizations

Programming languages, applications and libraries

VGC is a desktop application created using a JavaScript frontend and Java backend. The application is currently built using webpack [ 17 ] module bundler version 5.86.0, and packed for iOS, Windows, and Linux using electron-forge [ 18 ]. Communication between the frontend and backend of VGC is handled by the Axios HTTP library [ 19 ]. VGC is currently packaged using Electron for deployment, which allows the tool to be easily installed and run on a wide range of platforms and operating systems [ 20 ].

UI components are created using the React framework [ 21 ] version 18.2.0, and styled using Tailwind CSS [ 22 ]. To generate highly interactive and dynamic graphics for data visualization, the application utilizes a range of libraries, including Syncfusion [ 23 ], react-force-graph [ 24 ], and Recharts [ 25 ]. These libraries provide a range of tools and functionalities for the visualization and analysis of complex data sets.

Integration of publicly available databases

VGC draws from a range of public databases, including MSig Database for GO terms, as well as KEGG, Biocarta, PID, and Reactome [ 26 , 27 , 28 , 29 , 30 ]. By leveraging these powerful databases, VGC is able to provide users with rich and detailed information about the genetic pathways and functions associated with their variant data, allowing for deeper insights and a greater understanding of the underlying biology. VGC also includes a dynamic link to gnomAD for variant information, allowing users to easily access and explore genetic variation data from this well-known database [ 31 ]. Additionally, the tool includes ClinVar data for pathogenic variant information, providing users with different visualization options for identifying and understanding potentially harmful genetic mutations [ 32 ]. VGC supports the Human Genome Assemblies GRCh37 and GRCh38, ensuring compatibility with a wide range of data sets. The tool provides a range of options for exploring genetic variation, and can be tailored to the specific needs of the user by using optional phenotype input data.

Dynamic link to gnomAD for variant information

The dynamic link feature of VGC to gnomAD, a widely-used database for variant information provides users with a seamless connection to gnomAD, allowing them to access up-to-date and comprehensive variant data. The decision to implement a dynamic link specifically to gnomAD, as opposed to other databases, stems from its unique role as an aggregation database of genetic variation. This distinctive feature consolidates variant information from a variety of sources, providing a comprehensive resource. By establishing this dynamic link, VGC ensures that users have access to the latest information on variant frequencies and population-specific data. This integration enhances the accuracy and reliability of variant interpretation, empowering researchers to make informed decisions based on the most current genomic data available.

Incorporation of ClinVar data for pathogenic variant information

Inclusion of ClinVar data within VGC provides information on pathogenic variants and their clinical significance. By incorporating ClinVar data, VGC enables users to assess the potential pathogenicity of identified variants. Users can access curated information on variants that have been associated with specific diseases or conditions. This integration aids in variant prioritization, helping users focus on variants that may have clinical implications and guiding further investigation.

Compatibility with human genome assemblies GRCh37 and GRCh38

VGC is designed to work seamlessly with these widely-used genome assemblies, ensuring compatibility with a broad range of datasets. By supporting both GRCh37 and GRCh38, VGC enables users to analyze genomic variation data generated using different platforms and datasets aligned to these assemblies. This compatibility enhances the versatility and applicability of VGC, making it a valuable tool for a wide range of genomics studies and research projects.

User input and preprocessing

Upon opening, VGC displays a “welcome” page, allowing users to begin analyses for genome assemblies GRCh37 or GRCh38 (Fig.  2 ). For a given analysis, users may input two files: (1) a required VCF file, and (2) a supplemental and optional phenotype file specifying sample groupings.

figure 2

VGC user interface on startup. Users may begin an analysis by selecting a genome assembly (GRCh37 or GRCh38) and uploading the respective VCF file

Extraction and indexing of VCF

When a new VCF file is uploaded to the program, VGC processes it to extract pertinent information, which is then stored in the user's file system. A new directory named “VGCGeneratedFiles” is created in the user's home directory, along with a corresponding directory that follows a specific naming scheme.

For each VCF file processed, a directory named “VGC_<filename>” is created. Inside these directories, two text files, named info_<filename> and index_<filename>, store important data. The info_<filename> file holds overall file information, such as the VCF file version, total number of samples, total number of chromosomes, number of variants, the header line, and a list of chromosomes in the file. The index_<filename> file contains chromosome-specific information. This indexing by VGC enhances response times for future queries. For each chromosome in the VCF file, the following details are listed in the index file: starting and ending lines, starting and ending positions, number of variants marked as “PASS,” and the count of pathogenic variants for that chromosome.

Customization to suit individual user requirements by incorporating optional phenotype input data

VGC allows users to incorporate additional phenotype information, aligning the analysis with specific research questions or clinical contexts. By incorporating phenotype input data, VGC enables users to explore genetic variations in the context of specific phenotypic traits, enhancing the understanding of genotype–phenotype relationships. This customization feature makes VGC adaptable to various research and clinical scenarios, ensuring that users can leverage the tool to its full potential in their specific domain of interest.

User queries and visualization

Query options.

Users have the flexibility to search for specific genes or defined genomic ranges within the VCF file, enabling focused analysis of variants. When searching by gene, all variants corresponding to that gene within the VCF file are visualized. Alternatively, users can specify a genomic range, extracting and visualizing variants within the defined interval.

The variant extraction process utilizes the information stored in the index_<filename> file, which, as described earlier, provides the starting and ending lines of chromosomes within the VCF file. Depending on the user's selection of GRCh37 or GRCh38 as the reference genome assembly, the system accurately retrieves the relevant variants. Additionally, users can streamline their analysis by uploading a file containing multiple genes or genomic ranges, facilitating simultaneous querying of multiple genes or ranges. Variants associated with each queried gene or range are then extracted and visualized.

Visualization options

VGC offers a diverse range of visualization options tailored to meet various analytical needs.

When a VCF file is initially uploaded, a default bar graph view will display all variants by chromosome present in the file, with each bar corresponding to the number of variants within a specific chromosome. Users can navigate through viewing history using forward and backward arrows. Hovering over a bar reveals details indicating the number of variants displayed as well as the corresponding genomic range. Clicking on a bar enables zoom functionality for a closer examination of variants within the selected data.

Variant data may also be presented in a structured table format, enhancing accessibility and ease of analysis. User may choose to filter, sort, export, or other manipulate data in a spreadsheet-like display.

For analysis of case–control studies, sample groupings, or sample genotypes, VGC provides a node graph visualization option. Users may toggle between 2 and 3D views, facilitating interactive exploration of variant relationships. Moreover, the tool provides Fisher’s Exact Test data for each variant relative to sample groups. The test assesses differences in variant abundance between designated groups (e.g., cases vs. controls) through Monte Carlo simulation. By analyzing a 2 × 3 matrix with default simulations (n = 2000), potential associations between variants and sample groups can be discerned, aiding in phenotype-genotype analyses.

Secure and private local environment for data analysis

VGC is designed to run on the local machine or servers, ensuring that users can work with their genomic data in a secure and confidential setting. By avoiding the need to upload VCF files to the cloud, VGC protects sensitive genomic data and addresses privacy concerns. This local deployment approach instills a sense of reassurance in users, as they can confidently maintain control over their data, ensuring it stays within their organization's infrastructure. VGC requires Java version 1.8 or higher to run and is compatible with Windows, Mac, and Linux, offering flexibility for users across different platforms.

VGC features advanced visualization tools for VCF files. Demonstrating VGC's capabilities, we present an example using whole exome sequencing data from preeclamptic patients and term mothers (Fig.  3 ). The dataset includes 143 samples: 61 early onset severe preeclamptic cases and 82 term mother controls [ 33 ]. Through VGC, we offer a detailed analysis of this dataset, emphasizing major trends, statistical findings, and key outcomes aligned with our research goals. The insights gleaned from this study significantly enhance our understanding of variants associated with preeclampsia and offer valuable information for future research and practical applications.

figure 3

Schematic overview of case–control study to VGC input. To illustrate VGC's capabilities, we present a case study of early onset severe preeclamptic mothers (n = 61) and term mothers (n = 82). Whole exome sequencing of the described case–control samples and subsequent variant calling allowed for the creation of (1) a VCF file and (2) a customized phenotype file as VGC input

Comprehensive variant data extraction and visualization

VGC excels in variant browsing, offering features that enable effective exploration and analysis of genetic variations. It efficiently retrieves crucial data such as variant positions, alleles, genotype calls, and quality scores, offering a comprehensive and structured view of genomic variations for researchers and clinicians. For example, we demonstrate the visualization of variants in TTN, a gene with pathogenic, nominally significant variants identified in univariate analysis (Fig.  4 ). TTN variants are displayed in a histogram, sorted by variant position. Variants in intronic and exonic regions are differentiated by color (Fig.  4 a). Users have the option to filter variants by categories such as “ALL,” “PASS,” or “Pathogenic”. VGC’s visualization capabilities extend beyond basic displays, offering sophisticated graphical representations that deepen understanding of variant data (Fig.  4 b–d). Its intuitive and interactive visualizations allow users to discern patterns, connections, and insights within the genomic variations. In these analyses, such as when visualizing variants of the TTN gene, users have the option to save the variant list with all existing features from the VCF file in four different file formats (.xlsx,.xls,.csv,.pdf). This functionality allows users to retain the gene of interest for later examination and facilitates the transfer of these files for further analysis. Additionally, after the initial presentation of the VCF file, subsequent sessions will benefit from quicker access since the file will have been indexed, enabling more efficient and rapid visualization for repeated use of the same files.

figure 4

Histogram-based variant browsing with VGC. a The VGC user interface upon query of TTN, a gene found to contain pathogenic variants in the uploaded file. b Variants per chromosome, non-filtered [top] vs. filtered by pathogenicity [bottom]. c Partially magnified view of variants in CHR 1 for non-filtered [top] vs. filtered by pathogenicity [bottom]. d A detailed tooltip containing ClinVar-based information appears on hover when magnified to the single-position increment

Graph representation of samples and genotype data

VGC simplifies the interpretation of intricate genomic variation data by converting it into intuitive graphs, offering a visual summary of samples and their genotypes (Fig.  5 ). By representing genotype data graphically, VGC enables users to effortlessly recognize patterns of genetic variation across different samples. This graphical format aids in exploring the relationships between genotypes, making it easier to identify common variants or unique genetic patterns within a population. Such a visual method enriches the users' comprehension of the genetic landscape and assists in uncovering potential links between genotypes and phenotypes.

figure 5

Force-graph visualization of variant to sample-grouping relations. Blue colored nodes show variants, while dark and light gray colored nodes represent cases and controls

Comparative analysis of VCF file analysis and visualization tools

To evaluate the effectiveness and unique features of VGC in comparison to other commonly used bioinformatics tools for VCF file analysis and visualization, we conducted a comprehensive comparison based on several criteria. These criteria include operating system compatibility, programming languages, user interfaces, Docker container support, genomic ranges support, variant annotation capabilities, interactive visualization features. We selected tools that have been published in peer-reviewed journals to ensure the reliability and scientific validation of the comparison. Table 1 provides a detailed comparison of VGC with tools such as VIVA, VCF-Server, BrowseVCF, VCFtools, IGV, VCF.Filter, GEMINI, and VCF-Miner. This table highlights the distinct advantages of VGC, such as dynamic filtering, interactive HTML5 visualization. The comparative analysis underscores VGC's strengths in providing a comprehensive, user-friendly, and efficient solution for VCF file analysis and visualization.

The features of VGC provide a comprehensive solution for users to easily analyze and visualize genomic variation data in a fast and secure manner. One key advantage of the tool is its user-friendly interface, which allows users to easily navigate and analyze large datasets. Another noteworthy feature is the fast filtering of millions of variants, which is crucial for researchers dealing with large-scale genomic data. This feature ensures that users can quickly identify the most relevant variants for further analysis. After initial upload of VCF files, even large files can be visualized in seconds in the next sessions. The ability to add and query based on any number of user-defined groups (or phenotypes) is a significant advantage for researchers interested in studying specific groups of individuals or genes. This feature allows for more targeted analysis. The tool's ability to save and reuse analysis plans for reproducible research is a significant advantage, as it enables researchers to easily reproduce previous analyses and compare results. This feature is particularly important for ensuring that research findings are robust and reliable. The rapid VCF file browsing feature, with support for multiple visualizations such as histograms, spreadsheets, node graphs, and heatmaps, provides users with a comprehensive understanding of their data. This feature is particularly useful for identifying patterns and trends in genomic variation data. The tool’s ability to query by gene, range, position, and file upload, provides users with a range of options for searching and analyzing their data. This feature is particularly useful for identifying specific variants of interest and studying their potential impact on health and disease. The rapid identification and visualization of variant pathogenicity based on ClinVar data is another key advantage of VGC. This feature allows researchers to quickly identify potentially disease-causing variants, which can be further investigated for their clinical significance. VGC’s ability to display variant-to-sample genotype relations of user-defined groups is a significant advantage for researchers interested in studying the relationship between specific genetic variants and phenotypic traits. This feature allows for more targeted analysis and may lead to more insightful findings. The integrated variant querying through gnomAD, MSigDB, and Clinvar databases provides users with access to a wealth of public data, which can be used to enrich their own analysis. VGC supports both Human Genome Assembly Hg37 and GRCh38, significantly expanding its applicability and improving its accuracy by encompassing the most current genomic insights. This feature is particularly useful for identifying novel variants and potential disease-causing mutations. Finally, the software's design to run specifically on the local machine, with no VCF uploads to the cloud, ensures that users can work with their data in a secure and private environment. This feature is particularly important for researchers dealing with sensitive data and ensures that their research is conducted in a safe and confidential manner.

Despite these advancements, opportunities for further improvement remain. Integrating machine learning (ML) and large language models (LLMs) into VGC holds the promise of revolutionizing its capabilities in genomic analysis. Through predictive modeling, VGC could more effectively prioritize genetic variants of significance, while natural language processing (NLP) might automate the integration of scientific literature, enriching the context of variant data. Enhancing the tool's capacity to process even larger datasets would address existing scalability and efficiency challenges. Additionally, introducing more dynamic and customizable visualization options could further engage users by simplifying the interpretation of complex genomic data. A critical enhancement would be establishing a feedback system, enabling direct user input through GitHub or a dedicated site on Brown University's servers. This would allow the VGC team to quickly gather and act on user feedback, aligning the tool more closely with the genomic research community's evolving needs. Expanding integration with additional databases to capture emerging variant annotations and strengthening data privacy features, such as encrypted data storage, would also significantly enhance the tool's utility and user trust. Additionally, another potential future enhancement could involve implementing a feature that enables users to upload their own databases or annotation files. This functionality would allow users to annotate their VCF files using these personalized databases. By concentrating on these areas of development, VGC can continue to evolve to meet the growing demands of the genomic research community, offering state-of-the-art functionalities that keep pace with the latest developments in the field.

In conclusion, the available features of VGC provide a comprehensive solution for researchers dealing with genomic variation data. The user-friendly interface, fast filtering, and ability to query based on user-defined groups, make it an efficient and effective tool for identifying potentially disease-causing variants. The ability to save and reuse analysis plans, rapid VCF file browsing, and integrated variant querying through public databases, further enhance the software’s capabilities, making it a valuable resource for genomic research. The tool’s rapid VCF file browsing with histogram, spreadsheet, node graph, and heatmap support further enhances its usability.

Availability and requirements

Project name: Variant Graph Craft; Project home page: https://github.com/alperuzun/VGC ; Operating system(s): Mac, Windows, Linux; Programming language: Java; Other requirements: Java 1.8 or higher; License: GPL-3.0 license. There no restrictions to use VGC by non-academics.

Availability of data and materials

All data generated or analyzed during this study are included in this published article (Schuster, J., et al. Protein Network Analysis of Whole Exome Sequencing of Severe Preeclampsia. Front Genet 2021;12:765985.).

Abbreviations

Variant call format

Variant graph craft

Human Phenotype Ontology

Molecular Signatures Database

Single nucleotide polymorphisms

Integrative genomics viewer

Genome Exploration and Mining INteractive Interface

Visualization of variants

Campbell IM, Gambin T, Jhangiani S, Grove ML, Veeraraghavan N, Muzny DM, et al. Multiallelic positions in the human genome: challenges for genetic analyses. Hum Mutat. 2016;37(3):231–4.

Article   CAS   PubMed   Google Scholar  

Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol. 2022;18(5):e1009123.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Karabayev D, Molkenov A, Yerulanuly K, Kabimoldayev I, Daniyarov A, Sharip A, et al. re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files. PeerJ. 2021;9:e11333.

Article   PubMed   PubMed Central   Google Scholar  

Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92.

Jiang J, Gu J, Zhao T, Lu H. VCF-Server: A web-based visualization tool for high-throughput variant data mining and management. Mol Genet Genomic Med. 2019;7(7):e00641.

Muller H, Jimenez-Heredia R, Krolo A, Hirschmugl T, Dmytrus J, Boztug K, Bock CVCF. Filter: interactive prioritization of disease-linked genetic variants from sequencing data. Nucleic Acids Res. 2017;45(W1):W567–72.

Salatino S, Ramraj V. BrowseVCF: a web-based application and workflow to quickly prioritize disease-causative variants in VCF files. Brief Bioinform. 2017;18(5):774–9.

PubMed   Google Scholar  

Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput Biol. 2013;9(7):e1003153.

Hart SN, Duffy P, Quest DJ, Hossain A, Meiners MA, Kocher JP. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files. Brief Bioinform. 2016;17(2):346–51.

Article   PubMed   Google Scholar  

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.

Tollefson GA, Schuster J, Gelin F, Agudelo A, Ragavendran A, Restrepo I, et al. VIVA (VIsualization of VAriants): a VCF file visualization tool. Sci Rep. 2019;9(1):12648.

Yuan X, Wang J, Dai B, Sun Y, Zhang K, Chen F, et al. Evaluation of phenotype-driven gene prioritization methods for Mendelian diseases. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbac019 .

Birgmeier J, Haeussler M, Deisseroth CA, Steinberg EH, Jagadeesh KA, Ratner AJ, et al. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci Transl Med. 2020;12:544.

Article   Google Scholar  

Gargano MA, Matentzoglu N, Coleman B, Addo-Lartey EB, Anagnostopoulos AV, Anderton J, et al. The Human Phenotype Ontology in 2024: phenotypes around the world. Nucleic Acids Res. 2024;52(D1):D1333–46.

Robinson PN, Ravanmehr V, Jacobsen JOB, Danis D, Zhang XA, Carmody LC, et al. Interpretable clinical genomics with a likelihood ratio paradigm. Am J Hum Genet. 2020;107(3):403–17.

Smedley D, Jacobsen JO, Jager M, Kohler S, Holtgrewe M, Schubach M, et al. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Protoc. 2015;10(12):2004–15.

Webpack. Available from: https://webpack.js.org/ .

Electron Forge. Available from: https://www.electronforge.io/ .

Axios, HTTP client for the browser and node.js. Available from: https://axios-http.com/docs/intro .

Electron. Available from: https://www.electronjs.org/ .

React, the library for web and native user interfaces. Available from: https://react.dev/ .

Tailwind CSS. Available from: https://tailwindcss.com/ .

Syncfusion. Available from: https://www.syncfusion.com/ .

React-Force-Graph. Available from: https://github.com/vasturiano/react-force-graph .

Recharts, composable charting library built on React components. Available from: https://recharts.org/en-US/ .

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9.

Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.

Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50(D1):D687–92.

Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the pathway interaction database. Nucleic Acids Res. 2009;37:D674–9.

Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43.

Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–7.

Schuster J, Tollefson GA, Zarate V, Agudelo A, Stabila J, Ragavendran A, et al. Protein network analysis of whole exome sequencing of severe preeclampsia. Front Genet. 2021;12:765985.

Download references

Acknowledgements

We would like to thank Professor Vasileios P. Kemerlis of the Department of Computer Science at Brown University for his invaluable advice on security in developing VGC.

Not applicable.

Author information

Authors and affiliations.

Department of Computer Science, Brown University, Providence, RI, 02912, USA

Jennifer Li

Department of Chemistry, Brown University, Providence, RI, 02912, USA

Lifespan Cancer Institute, Providence, RI, 02912, USA

Benedito A. Carneiro

Legorreta Cancer Center, Brown University, Providence, RI, 02912, USA

Benedito A. Carneiro, Ece D. Gamsiz Uzun & Alper Uzun

Department of Pathology and Laboratory Medicine, Rhode Island Hospital, Providence, RI, 02912, USA

Ece D. Gamsiz Uzun & Alper Uzun

Center for Computational Molecular Biology, Brown University, Providence, RI, 02912, USA

Department of Pathology and Laboratory Medicine, Alpert Medical School, Brown University, Providence, RI, 02912, USA

Department of Pediatrics, Division of Genetics, Warren Alpert Medical School, Brown University, Providence, RI, 02912, USA

Lauren Massingham

Center for Clinical Cancer Informatics and Data Science (CCIDS), Brown/Lifespan, Providence, RI, 02912, USA

Department of Pediatrics, Warren Alpert Medical School, Brown University, Providence, RI, 02912, USA

You can also search for this author in PubMed   Google Scholar

Contributions

J.L. and A.U. developed the method. J.L. drafted the manuscript. J.L. designed the user interface and all visualizations. J.L. and A.Y. implement the method. A.Y. implement the evidence-based information from external resources. J.L. build the packaging of the tool and A.Y. tested the tool on different operations systems. E.G.U. provided significant feedback on building pathogenic data visualization. B.A.C. provided different set of VCF files to test. L.M., A.U. built the general concept. J.L., A.Y., B.A.C., E.G.U., L.M., A.U. critically revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alper Uzun .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Li, J., Yang, A., Carneiro, B.A. et al. Variant graph craft (VGC): a comprehensive tool for analyzing genetic variation and identifying disease-causing variants. BMC Bioinformatics 25 , 288 (2024). https://doi.org/10.1186/s12859-024-05875-7

Download citation

Received : 22 January 2024

Accepted : 18 July 2024

Published : 03 September 2024

DOI : https://doi.org/10.1186/s12859-024-05875-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Genomic variation
  • Variant call format (VCF)
  • Variant graph craft (VGC)
  • Visualization
  • Genomic data analysis
  • Genotype information
  • Gene function
  • Pathogenic variants
  • Data security
  • User-friendly interface

BMC Bioinformatics

ISSN: 1471-2105

gene variation experiment

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.188(1); 2011 May

New Experiments for an Undivided Genetics

There used to be a broad split within the experimental genetics research community between those who did mechanistic research using homozygous laboratory strains and those who studied patterns of genetic variation in wild populations. The former benefited from the advantage of reproducible experiments, but faced difficulties of interpretation given possible genomic and evolutionary complexities. The latter research approach featured readily interpreted evolutionary and genomic contexts, particularly phylogeny, but was poor at determining functional significance. Such burgeoning experimental strategies as genome-wide analysis of quantitative trait loci, genotype–phenotype associations, and the products of experimental evolution are now fostering a unification of experimental genetic research that strengthens its scientific power.

MOST empirical research in genetics during the 20th century can be crudely lumped into two main aggregations of researchers. On one hand, there were those who performed functionally or mechanistically oriented research using homozygous or clonal strains, as well as their crosses, recombinants, and segregants. Experiments in this branch of the field often achieved high levels of reproducibility and qualitatively clear results. Much of what we learned about the machinery of inheritance was based on experimentation of this type, notwithstanding the many contributions that biochemistry and molecular biology have made to our understanding of the foundations of genetics using nongenetic experimental methods. This type of research will be labeled “Mendelian genetics” here for convenience because it is relatively akin to the work that Mendel and the founders of 20th-century genetics such as T. H. Morgan performed. The term Mendelian genetics is in conformity with relatively common usage, although this specific field was sometimes called “transmission genetics” and is now often syncretically grouped with “molecular genetics.” A common feature of such research was analysis of the functional effects of specific allelic differences in the laboratory.

On the other hand, there was 20th-century experimental research on the genetic variation of natural populations living in, or recently isolated from, the wild, research that is commonly referred to as “experimental population genetics.” This usage also has its difficulties, with alternative terms such as “experimental evolutionary genetics” and “quantitative genetics” sometimes overlapping or even subsuming the term experimental population genetics in some cases. This type of empirical genetic research was less common than Mendelian genetics and was often afflicted with controversies, most notably the neutralist debate of the late 20th century. In particular, as we will discuss, this type of research was more concerned with overall patterns of genetic variation and the evolutionary mechanisms that produced such genetic variation.

It will not be argued here that one of these experimental approaches was better than the other, but it is becoming ever more apparent that they cannot persist through this century as “twin solitudes” within the scientific community. Instead, it will be argued that experimental genetics is now being unified by means of genome-wide experimental strategies that bring its sundered parts together. This point of view is not entirely novel or unheralded ( cf. S tern 2000 , 2010 ; H oule 2010 ) We should also emphasize at the outset that we are not proposing a new genetic theory of any kind, nor do we have any notably original views concerning the likely results of the presently emerging genome-wide experimentation. Instead, we want to call attention to a new class of experiments that presage a fruitful reunification of genetics and to encourage others to overcome their inhibitions about practicing an undivided experimental genetics.

PAST DIVISION WITHIN EXPERIMENTAL GENETICS

It might surprise some Mendelian or molecular geneticists to learn that a number of population geneticists are critical of the extrapolation of Mendelian experimental results to natural populations in the wild. It might further surprise them that population geneticists are often highly critical of attempts to infer function in nature from experimental data. But those of us who were trained on the population-genetics side of biology have been dealing with this controversy for decades. N ielsen (2009) has supplied a recent example of this debate in population genetics. A key feature of Nielsen’ s review is skepticism specifically about the use of functional information derived from Mendelian genetics: “the combination of a functional effect and selection does not demonstrate that selection acted on the specific trait in question” (p. 2488), and, in referring to research on the microcephalin locus, “Mutations in many different genes might cause microcephaly, but changes in these genes may not have been the underlying molecular cause for the increased brain size occurring during the evolution of man” (p. 2489).

Within the field of aging research, in which Mendelian and evolutionary geneticists have had some notable conflicts, the difficulty of proceeding from the study of mutant strains to the evolution of natural populations has likewise been discussed ( e.g. , R ose 1991 ; V an V oorhies et al. 2006 ). Specifically, the presence or absence of antagonistic pleiotropy in studies of specific mutant strains measured in a particular laboratory environment is demonstrably not a reliable guide to functional genetics in other environments, not even other laboratory environments. This is due, at least in part, to experimentally established propensities for genotype-by-environment interaction in functional characters ( e.g. , L eroi et al. 1994 ; K hazaeli et al. 2005 ). Moreover, random mutations derived from genotoxic treatments are not notably representative of variants that arise or persist in natural populations.

A relatively direct test of the applicability of Mendelian genetics results to the genetics of wild populations has been supplied by experiments focusing on loci that affect bristle number in Drosophila melanogaster. Several studies have ascertained the effects of specific polymorphisms that have been shown to be important in isogenic laboratory strains, such as genetic polymorphisms at the hairy locus in wild-caught flies, but have failed to find a correspondence ( e.g ., M acdonald and L ong 2004 ). This is an important result because bristle-number phenotypes are easily scored, while D. melanogaster is a species that has been studied extensively in both Mendelian and population genetics. These features of fly bristle number favor our ability to identify commonalities between the findings of Mendelian and population genetics, but those commonalities are apparently not reliably found.

Thus there are good arguments to be made for skepticism concerning the empirical inference of population-genetics significance from typical experimental results in Mendelian genetics, however well-conducted and reproducible such experiments might be. If changes in genetic background make inferences about genetics strictly “local”—that is to say specific to the particular set of genotypes under study and the methods used to study them—then there are major problems making general inferences about the functional significance of particular allelic variants from the experiments of Mendelian genetics on their own. This is not to deny the possibility of such inferences in particular cases. But what is exciting at the present moment in the development of genetics is the emergence of genome-wide experimental techniques that offer general methods of connecting genetic variation to functional phenotypes.

On the other hand, there were significant problems with the interpretation of the results obtained from 20th-century studies of the experimental population genetics of wild populations. From its inception, there was a key difficulty facing such research, a difficulty that Sewall Wright may have realized sooner than anyone else as a result of his extensive collaboration with Theodosius Dobzhansky ( P rovine 1986 ). That difficulty is the problem of inferring which particular population genetics mechanism, or which combination of such mechanisms, is producing a particular pattern of genetic variation in a wild population. For example, genetic variation can be maintained by selectively balanced polymorphism or by the persistent immigration of individuals from genetically differentiated populations. Quite often, fluctuations in population structure, or “demography” as it is now called, can generate local patterns of genetic change similar to those of selection ( T hornton and A ndolfatto 2006 ), and if neither the extant population structure nor the mechanisms of natural selection are known with certitude, the mere characterization of changing genetic variation by population geneticists will only rarely yield useful conclusions. [For a recent example of this problem, see H ernandez et al. (2011) .] These problems surfaced 70 years ago in the collaboration of Dobzhansky and Wright, and over the course of the development of their field in the 20th century, evolutionary geneticists became steadily more guarded in their inferences about the action of selection in wild populations. These difficulties did not escape the notice of other types of geneticists, who often had little use for population genetics research that they commonly found overly descriptive and mechanistically unfocused.

Mendelian genetics has yielded many promising findings concerning the basic features of gene transmission and expression, and experimental population genetics has revealed the abundant genetic variation to be found in many wild populations with great clarity. But both experimental strategies have faced clear obstacles impinging on their ability to address many of the most important questions in biology, such as the forces maintaining genetic variability in natural populations or the genetic constraints on the limits of organismal function. Further, it is at least reasonable to suggest that the separation of these strategies sustained these obstacles.

S tern (2000 , 2010 ), to give just one example of a view somewhat similar to our own, has argued that Mendelian experimentation, as we define it here, largely ignored the problem of variation, while experimental population genetics unduly focused on aggregate statistical descriptions of genetic variation. Thus the former approach has revealed elements of L ewontin's (1974) genotype-to-phenotype map, but little about the causes of population-level genetic variation. The latter approach has revealed a great deal about population-genetics variation, but little about the specifics of pleiotropy, epistasis, and genotype-by-environment interaction. Thus Mendelian experiments have provided characteristically reductionist insights that are deficient in context, while experimental population genetics has not achieved much more than an overview of some of the “holistic” genetic properties of wild populations, as far as authors like Stern are concerned.

Here, however, the case will be made that empirical strategies for building a more fruitful, undivided experimental genetics are now available. For those who have always rejected the split between experimental research in Mendelian and population genetics, they can regard the present review as written in the spirit of reconciling disparate strands of experimental genetics. In either verbal formulation, the present intent is the same.

GENOME-WIDE DATA AND THE OLDER MENDELIAN AND POPULATION GENETICS

With the advent of relatively inexpensive whole-genome sequencing, whole-transcriptome sequencing, and genome-wide gene-expression assays, modern biology has turned a major corner. Previously, we argued that these new genomic technologies have led to a “new biology” of great promise ( R ose and O akley 2007 ). Perhaps less noticed is the degree to which whole-genome research has imperiled purist Mendelian or population genetics.

Key to the problems now facing Mendelian genetics is the degree to which the newly abundant genomic information reveals both extremely complex networks of genes affecting phenotypes and widespread epistatic effects of single substitutions on genome-wide gene-expression patterns. While it may be too early to claim that this high degree of complexity is more common than otherwise, what is indubitable is that such functional genomic complexity must now be considered a possibility. Thus, elegant Mendelian experiments that work out the effects of particular substitutions or mutations at a single locus or at a few loci face fundamental uncertainties. It may be that the mechanistic pathways that these experiments reveal are indeed a complete and sufficient genetic analysis if the phenotypes of interest are not underlain by a complex genetic network, but at present complexity appears to be more the rule than the exception.

This in turn poses a kind of induction problem. How, practically speaking, will Mendelian geneticists know when they have done enough experimental work to establish that the network components that they have already uncovered are all the components of importance? The number of experiments to be performed to achieve such empirical closure can be astronomically large, thanks to the combinatoric properties of genetic experiments. If 10 loci, each with three significant alleles, are to be combined in every possible way, there are 6 10 or >60 million genotypes to be assayed. This is not a reasonable experiment, even with automation.

In a sense, this problem goes right to the core of what genetics should be about. Should genetics be about all the genotypes that can be constructed in a genetics laboratory, including genotypes with artificially induced mutations? If so, then it seems to be an astronomically vast endeavor.

Twentieth-century experimental population genetics, particularly as practiced by many of the intellectual descendants of Dobzhansky, had a solution to this problem. It focused instead on the genotypes that are to be found in natural populations in the wild. This is, at least logically, a tenable solution to the aforementioned combinatoric problem of Mendelian genetics experiments on all possible combinations of all possible alleles, while it also focuses attention on those genetic variants that are most relevant to study from the standpoint of high-function alleles.

But population genetics research on wild populations sacrifices the experimental power that Mendelian experiments provide. At the core of this sacrifice is that, to take genetic samples into the laboratory to breed for experimental work, the organisms sampled will be subjected to different environmental conditions and different population structures in almost all cases. Generically, 20th-century population geneticists created inbred lines from wild-caught founders and housed them in laboratories or greenhouses. If inbreeding proceeds with sufficient rapidity, it can overwhelm selection in the laboratory and preserve, over a large ensemble of inbred lines, allele frequencies that are close to those of the natural population(s) from which these lines were derived. But the resultant genotypes of these laboratories will not be a genome-wide reflection of the genotypes found in natural populations, if for no other reason than that they will be an extremely small subset of the genotypes to be found in a genetically polymorphic wild population that undergoes frequent recombination.

If the phenotypes associated with these inbred lines are assayed, then they can present problems of inbreeding depression when their ancestral wild population had high heterozygosity. Few experimental results are more common in genetics than hybrid vigor in crosses of such inbred laboratory strains, particularly when important functional characters closely related to fecundity, longevity, or viability are assayed. This kind of experimental result reveals the extent to which the genotypic structure of the ancestral wild population has been destroyed by the creation of inbred laboratory strains.

But a potentially more insidious problem for the study of laboratory strains derived from wild populations will be genotype-by-environment interaction. Unlike inbreeding depression, in which the sign of the effect is predictable, the effect of an evolutionarily novel laboratory environment on functional characters will be unpredictable as to both magnitude and sign. Sometimes the laboratory environment may artificially boost a functional character, and sometimes it may depress it. These effects in turn may depend on the particular genotype of an inbred laboratory strain. This problem afflicts the derived samples of population genetics as much as it does the arbitrarily assembled genotypes of Mendelian genetics.

Taken together, these problems hamper attempts to infer the causal factors that shape the population genetics of wild populations, in that the only straightforwardly interpretable experiments are those in wild conditions and the only entirely relevant genotypes are those that occur naturally in wild population(s). Only with plants and sessile invertebrates will such experiments in the wild normally be feasible if studies of functional characters are to be performed, and then the experimenter faces the palpable uncertainties of wild conditions, including weather, human encroachment, and sheer accident.

BREAKING FREE OF TRADITIONAL EXPERIMENTAL STRATEGIES IN GENETICS

A possible key to escaping from the dilemmas of experimental Mendelian and population genetics may lie in accepting principles similar to those of strong-inference experimentation in physical science. Instead of making doctrinal fetishes of the experimental concreteness of Mendelian genetics or the “naturalness” of the population genetics of wild populations, genetic experiments could instead focus on powerful tests of generalities, which should apply to broad classes of genetic systems without exception. Given this goal, experimental strategies could be chosen to maximize their power, rather than any other sense of appropriateness. Rather than continuing with established practices, geneticists might focus on new combinations of experimental tactics and technologies.

To be concrete, we will consider the now-burgeoning genome-wide experimental strategies that offer the prospect of an undivided experimental genetics: quantitative trait locus (QTL) analysis, genome-wide association (GWA) studies, and genomic studies of experimental evolution. We do not propose that these are the only strategies for an experimental genetics that seeks to overcome past dichotomies of empirical research. For example, H oule (2010) has offered an alternative strategy that seeks to match genome-wide data to more extensive characterization of what he calls the “phenome,” a strategy that is in some respects based on the biometrical research tradition and its evolutionary quantitative genetics offshoots of the late 20th century. While such an approach is somewhat beyond the scope of this review, we consider it also of significant promise.

QTL ANALYSIS

Before genome-wide sequencing technologies became inexpensive, QTL mapping was the most practical method for identifying genes involved in complex traits. This approach involves crossing individuals from stocks with very-well-characterized phenotypes and genotypes and determining the recombined regions of chromosomes that can be statistically associated with phenotypes in the hybrid offspring. Currently, the genetic dissection of quantitative traits is most feasible in well-characterized model systems; Drosophila and Caenorhabditis elegans are model organisms that have all the tools necessary for identifying QTL and characterizing them at the molecular level ( A yyadevara et al. 2001 , 2003 ; M ackay 2004 ) and will serve as key illustrations for the purpose of the present discussion. In particular, we endeavor to point out the difficulties that face QTL analysis.

The question of how many QTL affect variation in a quantitative trait is not easily answered. The number of QTL mapped in any one experiment is almost certainly an underestimate. First, the two parent strains used in an experiment represent only a limited sample of the species-wide genetic variation for the trait in question. It should therefore not be surprising if different studies point to different QTL, even in a single species. QTL mapping experiments that involve parental stocks derived from laboratory selection experiments, however, will contain a more representative fraction of segregating variation than will crosses of two inbred lines. Second, the number of QTL is expected to increase with sample size, where sample size is the number of recombinant individuals. Increasing the sample size allows mapping of more QTL and of QTL with smaller effects. The frequency of overlap among QTL discovered in distinct interstrain crosses does allow pairwise estimates of the total number of loci of similar or greater significance to those observed in individual experiments. Three such comparisons agree in implicating 11–24 longevity QTL of comparable effect size ( S hmookler R eis et al. 2006 ).

The substantial body of work that has aimed to identify QTL affecting longevity in Drosophila serves as an appropriate example of the strengths and weaknesses of the QTL mapping approach in general. These results come both from recombinant inbred (RI) lines constructed from two parental stocks that were not selected for longevity ( N uzhdin et al. 1997 ; P asyukova et al. 2000 ; V ieira et al. 2000 ; L eips et al. 2006 ) and from RI lines constructed from parental stocks that had undergone long-term selection for postponed aging ( C urtsinger and K hazaeli 2002 ; L uckinbill and G olenberg 2002 ; F orbes et al. 2004 ; V alenzuela et al. 2004 ).

The authors of these studies admit that mapping longevity QTL is an imprecise initial step toward identifying genes responsible for longevity, as they can be localized only to approximate chromosomal regions. In addition, the extent to which different QTL mapping results from one or more laboratories can be compared is ill-defined ( Shmookler Reis et al. 2006 ). Drosophila longevity QTL studies tend to identify 10–20 longevity “genes” that have large enough phenotypic effects to be detected (but see C urtsinger 2002 ). These large-effect QTL have been localized to the centromeric region of chromosome 2, and the left arm of chromosome 3, in independent studies ( V alenzuela et al. 2004 ). Certainly, many loci with smaller phenotypic effects exist and have yet to be detected.

A recurring theme from studies of Drosophila QTL is that genotype-by-sex, genotype-by-environment, and epistatic interactions are common and complex ( e.g ., D ilda and M ackay 2002 ). Drosophila QTL are often sex and environment specific, and longevity QTL often show antagonistic pleiotropy (reviewed in M ackay 2001 , 2004 ). If Drosophila QTL have variable effects depending on the sex, physical environment, and genetic environment in which the QTL are expressed, similar properties are to be expected for QTL in other organisms ( cf. S hmookler R eis et al. 2006 ; R ockman et al. 2010 ).

Less than 2 decades ago, studies of the genetic architecture of Drosophila sensory bristle number were taken to imply that natural variation for this trait could be localized to polymorphisms in relatively few candidate genes ( M ackay 1996 ). But even in these early studies, complications arose from sex-specific QTL effects and interactions between QTL. The challenge for the future will thus be to incorporate a systems-biology perspective into traditional QTL approaches to assess how the particular alleles of many genome-wide loci affect multiple quantitative traits and networks of transcriptional interactions.

MEDICAL GENOME-WIDE ASSOCIATION STUDIES

One of the few organisms that geneticists study in detail in very large numbers is the human. Furthermore, there has been abundant funding for genome-wide assays of human genetic variation, thanks to concern over the possible medical significance. Furthermore, there are abundant data concerning human medical phenotypes, most importantly, the diagnosis of chronic endogenous conditions such as diabetes, hypertension, obesity, and other ailments that are long-sustained and not direct outcomes of infection.

Genome-wide association (GWA) studies use high-throughput methods to genotype panels of individuals at hundreds of thousands of sites and relate those sites to traits of clinical importance. GWA studies represent an important advance in discovering genetic variants influencing disease, but also have important limitations, including their potential for false-positive and false-negative results and for biases related to selection of study participants and genotyping errors ( M c C arthy et al. 2008 ).

The GWA approach permits surveys of the entire human genome in thousands of unrelated individuals, unconstrained by specific a priori mechanistic hypotheses regarding genetic associations with disease ( H irschhorn and D aly 2005 ). The genome-wide nature of GWA studies represents an important step beyond candidate gene studies that attempt to probe patterns of inheritance by focusing on single loci at a time, using the methods of Mendelian or population genetics. For conditions that are not traditional genetic diseases, GWA studies also represent a valuable advance over family-based linkage studies in which inheritance patterns of affected families are analyzed and related to genetic markers throughout the genome. These family-based linkage studies can successfully identify genes of large effect for traditional genetic diseases such as cystic fibrosis, but have been far less successful for common, complex disorders ( e.g. , A ltmuller et al. 2001 ).

Some of the most important, and widely cited, human genetic studies at the present time are the medical case-control GWA studies, such as that of the W ellcome T rust C ase -C ontrol C onsortium (2007) study. These studies have been able to uncover new single-nucleotide polymorphisms (SNPs) that are statistically associated with the onset of chronic diseases and often SNPs at loci that were not known to Mendelian human genetics prior to the advent of GWA studies. GWA studies are predicated on the “common disease, common variant” hypothesis, which generally assumes that common diseases can be attributed to genetic variants present in ∼5% of the population ( C ollins et al. 1997 ). If rarer disease-causing variants exist, or the effects of individual loci are small, we are unlikely to detect them with this approach.

Unfortunately, there are major inferential problems associated with the results of human GWA studies. As for all genome-wide research, there is a considerable multiple-inference statistical test problem with the use of numerous statistical tests over the hundreds of thousands of SNPs, and often multiple clinical endpoints, tested in such experiments. Without correcting for multiple-hypothesis testing, there will be a high false-discovery rate. Furthermore, given the significant linkage disequilibrium that characterizes the human genome, it is only rarely possible to specifically identify the SNP that functionally affects a disease risk, when a significant GWA result has been obtained ( S hmookler R eis et al. 2006 ). For chronic diseases, the SNPs found to be statistically significant in GWA studies account for only a small fraction of their heritability ( M anolio 2009 ). But GWA analysis of human height, which is the characteristic for which the most data are available and is more easily measured than chronic disease status, suggests that larger bodies of data and a different strategy of data analysis can account for much of the heritability of this character ( Y ang et al. 2010 ).

But there are still other difficulties with GWA research. Like laboratory populations of model organisms, humans in industrialized nations live in relatively novel environments. It is certain that many of the selective forces and demographic features that have shaped human genetic variation over the past 100,000 years are no longer present under modern conditions, even though determining just what the ancestral conditions were is itself a formidable project. Thus, genotype-by-environment interaction is a problem that may afflict medical GWA studies. The now-common invocation of dietary change as an etiological factor in human disease ( e.g. , L indeberg 2010 ) is itself an indication of this problem, particularly since the medical conditions that are known to be affected by diet, such as obesity, type II diabetes, cardiovascular disease, and many cancers, are the concern of much GWA research.

Nonetheless, there is a great deal to commend medical GWA research. It systematically studies variation over entire genomes, with its level of resolution depending on the state of the technology for genome characterization as well as the degree of linkage disequilibrium among locations across the genome. This is a great improvement over studies of a few candidate loci, even though GWA still faces grave limitations with respect to its ability to detect effects due to rare alleles. Although there are significant problems with genotype-by-environment interaction effects, GWA brings genetic information together with function, making it content-laden, particularly compared to the lack of functional content in much of population genetics. While the inference of specific functional roles for identifiable sequence differences from GWA data is plagued with significant uncertainties, as just outlined and discussed further below, at least functional questions are being addressed genome-wide.

GENOME-WIDE STUDIES OF EXPERIMENTAL EVOLUTION

While experiments in which model populations are made to evolve in response to culture conditions are of significant vintage, since 1980 they have been carried out with enough attention to population size, controls, and replication to make experimental evolution a relatively reliable experimental strategy (see G arland and R ose 2009 ). Of particular importance for functional and evolutionary interpretation, experimental evolution seeks to control the circumstances of both the present state and the ancestry of the populations that it studies. In Mendelian genetics, the evolutionary histories of the homozygous strains that it employs are often little known, and certainly are haphazardly controlled. While experimental population genetics can sometimes infer the ecology and the ancestry of the wild populations that it studies, the lines that are derived from the populations are characteristically subject to either an abrupt course of intense inbreeding or an ill-defined process of evolutionary domestication ( cf. S imões et al. 2009 ).

The genetics of experimental evolution are of great relevance for the conundrums adduced here. This point has been conceded, at least implicitly, even by some of the most determined skeptics of the prospects for an undivided genetics ( e.g. , “We can repeat the experimental evolution of phages in the laboratory, and demonstrate that the same mutations go to fixation in repeated experiments conducted under the same conditions” ( N ielsen 2009 , p. 2488). In the same way, D ykhuizen and D ean (2009) show that experimental evolution in bacteria can be used to rigorously test the adequacy of functional hypotheses concerning specific genotypes when they are assembled in the laboratory. But perhaps more interesting, from the standpoint of integrating Mendelian and population genetics, are the open-ended experiments, in which populations undergo selection without having their genotypes “pre-assembled.” These experiments use populations of two basic kinds: clonal populations that accumulate de novo mutations and outbreeding sexual populations that have abundant standing genetic variation to begin with. Up until recently, it was chiefly the former type of population that was most amenable to genetic analysis ( e.g. , R iehle et al. 2001 ), but more recently inexpensive molecular-genetic technologies have allowed fairly good genetic characterization of experimental evolution in sexual populations ( e.g. , T eotónio et al. 2009 ). When such genetic characterization is extensive, it allows the experimenter to infer both the genetic substratum of a laboratory-defined adaptation and, conversely, the functional significance of genetic variants. As such, the genetics of experimental evolution constitute a natural bridge between the questions of Mendelian genetics and those of population genetics.

Experimental evolution is not without its difficulties and limitations. The population sizes used in laboratory studies of evolution are necessarily small and thus no doubt systematically smaller than those of most wild populations. With many populations handled in parallel, often by numerous experimenters, contamination between populations is always a risk, particularly in cases where the genetic variation that undergoes selection is generated by de novo mutation. The timescales of laboratory evolution experiments are quite limited relative to the timescales of evolution in the wild. Linkage disequilibrium will make the ascertainment of causally important sequences ambiguous, as it characteristically does. Again, there will be statistical problems of false discovery with genome-wide assays of the effects of evolution. Even though experimental evolutionists may think that they know which characters selection is targeting, they will not necessarily be right (see L eroi et al. 1994 ). Most importantly, the idea that an experimental evolution study necessarily serves as a particularly pertinent guide to the evolution of wild populations is at least dubious ( cf. H uey and R osenzweig 2009 ). But in pursuit of universal principles of genetics, this is a secondary point. If a law of genetics is alleged to be exceptionless, then we could falsify it using the genetics of laboratory experimental evolution, despite the likely disparities between that laboratory system and others in nature.

Recent work by B urke et al. (2010) highlights the utility of the experimental evolution strategy for genetic analysis, despite these limitations. This study examined whole-genome sequence data from populations ( N e ∼10 3 ) of Drosophila that had experienced over 600 generations of selection for accelerated development, as well as their ancestral or control populations. Flies in the strongly selected populations studied by B urke et al. (2010) develop ∼20% faster than control flies and have evolved correlated phenotypic differences including smaller size, decreased stress resistance, and shorter mean life span. The primary goal of this study was to identify SNPs with significantly different allele frequencies in the experimental and control populations, as such loci can reasonably be associated with the aforementioned phenotypes. B urke et al. (2010) identified ∼24,000 such SNPs, and since linkage disequilibrium extends up to 30–100 kb in these populations (see T eotonio et al. 2009), these SNPs localized to several dozen genomic regions that responded strongly to selection.

The observation that >20,000 SNPs significantly change in frequency suggests a large and complex genetic network underlying the response to selection for accelerated development. Perhaps more interesting is that B urke et al. (2010) found no evidence for the complete fixation of any of these alleles. Although local losses in heterozygosity were observed in the same areas of the genome at which there was significant differentiation in allele frequencies, in no region did heterozygosity come close to zero. The failure to observe the traditional signature of a selective sweep in this study is not necessarily unanticipated, given that 600 generations might not be enough time for newly arisen mutations to fix. On the other hand, very little allele-frequency differentiation was observed between replicate populations experiencing the same selection treatment; in other words, it is unlikely that beneficial new mutations arose independently in replicate populations and are currently in the process of fixing. The major conclusion to be drawn from this work is that, unlike microbial evolution experiments, selection acts primarily on standing variation, and not on new mutations, in sexually reproducing systems undergoing experimental evolution.

These experimental evolutionary genomic results ostensibly create a genetic load paradox. How can these laboratory Drosophila populations sustain so much genetic variation for fitness-related characters such as developmental speed, early fecundity, etc? But the genetic load paradox may be more apparent than real. Consider a simple case of balancing selection. For a locus at an overdominant equilibrium, the mean population fitness will be lower than the fitness of the heterozygote. This difference has been referred to as the segregational genetic load ( E wens 1979 ). More formally, genetic load, L , is defined as,

where w − is the mean fitness and w max is the maximum fitness among all genotypes in the population. If the action of natural selection is assumed to act through strict viability selection, then a large genetic load has been taken to indicate that a population may be unable to numerically replace itself, which leads to extinction. Consequently, some ( e.g ., K imura 1968 ) have used this argument to suggest that there is a limit to how much genetic variation can be maintained by natural selection. These arguments have been countered by pointing out that natural selection does not always act through the deaths of individuals; rather, processes like frequency-dependent selection will ameliorate the impact of selection on population viability. Indeed, in Drosophila it has been well documented that adaptation in many severe larval environments results in changes in larval feeding rates (see review in M ueller et al. 2005 ), which in turn affects competitive ability for food, which is a frequency-dependent process ( M ueller 1988 ).

Another reason that genetic load calculations are likely to overstate the negative impact of selection is that the most-fit genotype may be vanishingly rare, in which case the difference between the average and most-fit genotype may be dramatically smaller. For example, consider a single locus with two alleles, A 1 and A 2 . Suppose the fitnesses for genotypes A 1 A 1 , A 1 A 2 , and A 2 A 2 are 1 − s 1 , 1, and 1 − s 2 , respectively. If s 1 , s 2 > 0, then the equilibrium frequency of the A 1 allele is p ˆ = s 2 /( s 1 + s 2 ). Now suppose we have 500 independent loci maintained by heterozygote advantage. The most-fit genotype is the multiple heterozygote with net fitness of 1 500 = 1. The mean fitness over all 500 loci is [ p ˆ 2 ( 1 − s 1 ) + 2 p ˆ ( 1 − p ˆ ) + ( 1 − p ˆ ) 2 ( 1 − s 2 ) ] 500 . If s 1 , s 2 = 0.02, then from Equation 1 we calculate the load as 150. However, the likelihood of seeing the heterozygote at all 500 loci in a finite population is very small.

To quantify this problem, we have created finite populations of 100 individuals and assigned them genotypes randomly at each of 500 loci, assuming that each locus was at the same overdominant equilibrium. We then calculated the load of this population of 100 individuals using Equation 1 , but w max was set to the highest fitness among the 100 individuals. We repeated this process 1000 times and computed the mean genetic load in these populations of 100 individuals and have contrasted this to the maximum expected load ( Table 1 ). The results in Table 1 show that the load in small samples is 10–20 times less than the maximum load.

Theoretical load ( L ), observed load ( L ˆ ) , and mean fitness ( w ˆ ) in samples of 100 individuals

0.050.10.20.40.5
0.0380.0180.0080.0030.002
1.61.51.20.820.65
0.0750.0930.090.0690.057
0.390.410.450.550.61

The column heading p ˆ and the row heading s 1 refer to the equation of overdominance outlined in the text. In each case, s 2 = 0.002. The value of s 1 was chosen to give the equilibrium allele frequency indicated in the table.

The significance of this finding is that outbred laboratory populations maintained under long-sustained selection regimes nonetheless can retain abundant genetic variation of evolutionary genetic and functional significance. This offers the prospect of using the genome-wide response to changes in the phenotypic focus of selection as a general-purpose tool for resolving questions of importance for both Mendelian and population genetics.

FROM GENOME-WIDE ANALYSIS BACK TO INDIVIDUAL GENES?

A natural goal for genetic research would be to proceed from genome-wide analysis to the functional dissection of individual genetic variants. This is perhaps most obvious in the case of QTL mapping, in that this experimental strategy is expressly focused on identifying specific regions of the genome that have a measureable effect on specific phenotypes. To the extent that QTL mapping identifies causally important specific quantitative trait nucleotides (QTNs), then such polymorphisms could be subjected to functional genetic validation using complementation tests among other standard Mendelian procedures. In the research of B urke et al. (2010) , reasonably narrow areas of genomic differentiation were identified; with greater resequencing coverage, genome-wide sequencing of such experimentally evolved populations might provide candidate QTNs of comparable value to those of QTL analysis. Likewise, large-scale GWA research might also provide candidate SNPs that could be examined using more functional genetic studies. Alternatively, candidate SNPs identified by any of the three types of genome-wide scan could be subjected to such molecular genetic assays as RNA interference, overexpression in transgenic constructs, or allele replacement, at least with model organisms.

A significant problem impinging on such explicitly causal research is that the effect sizes of these candidate SNPs might be so small as to vitiate such direct assays of their phenotypic effects. The analysis of the B urke et al. (2010) data provided in its supplementary files suggests fairly stringent limits on the effect sizes of the SNPs involved, given that none of them showed evidence of a selective sweep during >600 generations of selection. Likewise, the small proportion of heritability accounted for by some GWA studies (M olino 2009) suggests that the candidate SNPs identified by such work often have a quantitatively small effect. Many inferred QTL may arise from local linkage disequilibrium among multiple QTNs, each of them having small effects individually. A ndolfatto (2005) has estimated the likely selection coefficients of SNPs involved in selective sweeps and finds that they are quite small, suggesting that they have relatively small quantitative effects on measureable phenotypes. However, this is a question that is best resolved by experimental attempts to establish the magnitude of the phenotypic effects of SNPs identified by the kind of genome-wide experiments discussed here.

In an undivided experimental genetics, we should be free to take our experimental refutations and corroborations wherever we can, without regard to convention or habit. There are still good experiments to be done within the limited confines of both Mendelian genetics and experimental population genetics, as conventionally practiced. But there are now other experimental methods that have a good claim on our attention, particularly in an era when genomic technologies offer possibilities that were barely conceived of in the 1930s and 1940s, when tho two wings of experimental genetics discussed here first started to drift apart.

We provided a brief sketch of some of the possibilities for genomically founded biology in an earlier article ( R ose and O akley 2007 ), but it might be useful to mention a few specifically promising lines of research for the undivided genetics that are now emerging:

  • Whole-genome resequencing of individuals obtained directly from wild populations should reveal the extent to which populations in nature sustain genetic variation and linkage disequilibrium at the nucleotide level, and what kind of structural and nucleotide-level variation is present in such populations.
  • Trajectories of genome-wide variation among replicated populations undergoing identical culture regimes in parallel over many generations at different population sizes should reveal the relative importance of genetic drift vs. deterministic patterns of selection in shaping genetic change over moderate evolutionary time periods.
  • Reverse experimental evolution (see T eotonio and R ose 2000, 2001) combined with whole-genome resequencing ( cf. T eotonio et al. 2009) should reveal how common antagonistic pleiotropy is across entire genomes and thus indicate the potential for selection to sustain genetic polymorphisms.
  • It should be possible to use experimental evolution of replicated populations to test the relative importance of de novo mutation producing selective sweeps vs. shifting selectively sustained polymorphisms in the evolution of outbred sexual populations, provided enough generations of selection are monitored using whole-genome resequencing.
  • As discussed above, genome-wide scans followed by causal tests of the phenotypic effects of individual QTNs should test whether such QTNs have measureable phenotypic effects.
  • When QTNs with measureable phenotypic effects are identified, they could be tested for their pleiotropic and epistatic effects, and when these experiments are done over a variety of environments, they could also delimit patterns of genotype-by-environment interaction.

But as we move through the phenomena of genetics with the freedom to perform new kinds of experiments, we should be wary of supposing that the specifics of what we find in a particular laboratory or a particular set of samples from the wild will inductively generalize. Fortunately, although science grows out of a substratum of particulars, fruitful scientific debate is about general rules and theories, not about the artifacts generated by specific experimental methods. In the end, all experimental strategies are subject to difficulties that may render their conclusions suspect, suggesting that there is all the more reason to unburden ourselves of Procrustean methodologies and narrow views of the possibilities for genetic experimentation.

Acknowledgments

We are grateful to the anonymous referees, Robert J. S. Reis, Kevin R. Thornton, and Adam Wilkins for their comments on earlier drafts of this manuscript. The research of the authors has been supported by grants from the National Institutes of Health, National Science Foundation, and the University of California.

LITERATURE CITED

  • Altmüller J., Palmer L. J., Fischer G., Scherb H., Wjst M., 2001. Genome-wide scans of complex human diseases . Am. J. Hum. Genet. 69 ( 5 ): 936–950 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Andolfatto P., 2005. Adaptive evolution of non-coding DNA in Drosophila . Nature 437 : 1149–1152 [ PubMed ] [ Google Scholar ]
  • Ayyadevara S., Ayyadevara R., Hou S., Thaden J. J., Shmookler Reis R. J., 2001. Genetic mapping of quantitative trait loci governing longevity of Caenorhabditis elegans in recombinant-inbred progeny of a Bergerac-BO × RC301 interstrain cross . Genetics 157 : 655–666 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ayyadevara S., Ayyadevara R., Vertino A., Galecki A., Thaden J. J., et al., 2003. Genetic loci modulating fitness and life span in Caenorhabditis elegans : categorical trait interval mapping in CL2a × Bergerac-BO recombinant-inbred worms . Genetics 163 : 557–570 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Burke M. K., Dunham J. P., Shahrestani P., Thornton K. R., Rose M. R., et al., 2010. Genome-wide analysis of a long-term evolution experiment with Drosophila . Nature 467 : 587–590 [ PubMed ] [ Google Scholar ]
  • Collins F. S., Guyer M. S., Chakravarti A., 1997. Variations on a theme: cataloging human DNA sequence variation . Science 278 ( 5343 ): 1580–1581 [ PubMed ] [ Google Scholar ]
  • Curtsinger J. W., 2002. Sex specificity, life-span QTLs, and statistical power . J. Gerontol. 57A ( 12 ): B409–B414 [ PubMed ] [ Google Scholar ]
  • Curtsinger J. W., Khazaeli A. A., 2002. Lifespan, QTLs, age-specificity, and pleiotropy in Drosophila . Mech. Ageing Dev. 123 : 81–93 [ PubMed ] [ Google Scholar ]
  • Dilda C. L., Mackay T. F. C., 2002. The genetic architecture of Drosophila sensory bristle number . Genetics 162 : 1655–1674 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Dykhuizen D. E., Dean A. M., 2009. Experimental evolution from the bottom up , pp. 67–89 Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments , edited by Garland T., Rose M. R. University of California Press, Berkeley [ Google Scholar ]
  • Ewens W. J., 1979. Mathematical Population Genetics. Spinger-Verlag, New York [ Google Scholar ]
  • Forbes S. N., Valenzuela R. K., Keim P., Service P. M., 2004. Quantitative trait loci affecting life span in replicated populations of Drosophila melanogaster . I. Composite interval mapping . Genetics 168 : 301–311 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Garland T., Rose M. R., 2009. Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. University of California Press, Berkeley [ Google Scholar ]
  • Hernandez R. D., Kelley J. L., Elyashiv E., Melton S. C., Auton A., et al., 2011. Classic selective sweeps were rare in recent human evolution . Science 331 ( 6019 ): 920–924 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hirschhorn J. N., Daly M. J., 2005. Genome-wide association studies for common diseases and complex traits . Nat. Rev. Genet. 6 ( 2 ): 95–108 [ PubMed ] [ Google Scholar ]
  • Houle D., 2010. Numbering the hairs on our heads: the shared challenge and promise of phenomics . Proc. Natl. Acad. Sci. USA 107 ( Supp. 1 ): 1793–1799 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Huey R. B., Rosenzweig F., 2009. Laboratory evolution meets catch-22: balancing simplicity and realism , pp. 671–702 Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments , edited by Garland T., Rose M. R. University of California Press, Berkeley [ Google Scholar ]
  • Khazaeli A. A., Van Voorhies W., Curtsinger J. W., 2005. The relationship between life span and adult body size is highly strain-specific in Drosophila melanogaster . Exp. Gerontol. 40 : 377–385 [ PubMed ] [ Google Scholar ]
  • Kimura M., 1968. Evolutionary rate at the molecular level . Nature 217 : 624–626 [ PubMed ] [ Google Scholar ]
  • Leips J., Gilligan P., Mackay T. F. C., 2006. Quantitative trait loci with age-specific effects on fecundity in Drosophila melanogaster . Genetics 172 : 1595–1605 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Leroi A. M., Chippindale A. K., Rose M. R., 1994. Long-term laboratory evolution of a genetic trade-off in Drosophila melanogaster . I. The role of genotype × environment interaction . Evolution 48 : 1244–1257 [ PubMed ] [ Google Scholar ]
  • Lewontin R. C., 1974. The Genetic Basis of Evolutionary Change. Columbia University Press, New York [ Google Scholar ]
  • Lindeberg S., 2010. Food and Western Disease: Health and Nutrition from an Evolutionary Perspective. Wiley-Blackwell, New York [ Google Scholar ]
  • Luckinbill L. S., Golenberg E. M., 2002. Genes affecting aging: mapping quantitative trait loci in Drosophila melanogaster using amplified fragment length polymorphisms (AFLPs) . Genetica 114 ( 2 ): 147–156 [ PubMed ] [ Google Scholar ]
  • Macdonald S. J., Long A. D., 2004. A potential regulatory polymorphism upstream of hairy is not associated with bristle-number variation in wild-caught Drosophila . Genetics 167 : 2127–2131 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Mackay T. F. C., 1996. The nature of quantitative genetic variation revisited: lessons from Drosophila bristles . Bioessays 18 : 113–121 [ PubMed ] [ Google Scholar ]
  • Mackay T. F. C., 2001. Quantitative trait loci in Drosophila . Nat. Rev. Genet. 2 : 11–20 [ PubMed ] [ Google Scholar ]
  • Mackay T. F. C., 2004. The genetic architecture of quantitative traits: lessons from Drosophila . Curr. Opin. Genet. Dev. 14 : 253–257 [ PubMed ] [ Google Scholar ]
  • Manolio T. A., Collins F. S., Cox N. J., Goldstein D. B., Hindorff L. A., et al., 2009. Finding the missing heritability of complex diseases . Nature 461 : 747–753 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • McCarthy M. I., Abecasis G. R., Cardon L. R., Goldstein D. B., Little J., et al., 2008. Genome-wide association studies for compled traits: consensus, uncertainty and challenges . Nat. Rev. Genet. 9 : 356–369 [ PubMed ] [ Google Scholar ]
  • Mueller L. D., 1988. Density-dependent population growth and natural selection in food limited environments: the Drosophila model . Am. Nat. 132 : 786–809 [ Google Scholar ]
  • Mueller L. D., Folk D. G., Nguyen N., Nguyen P., Lam P., et al., 2005. Evolution of larval foraging behaviour in Drosophila and its effects on growth and metabolic rate . Physiol. Entomol. 30 : 262–269 [ Google Scholar ]
  • Nielsen R., 2009. Adaptationism: 30 years after Gould and Lewontin . Evolution 63 : 2487–2490 [ PubMed ] [ Google Scholar ]
  • Nuzhdin S. V., Pasyukova E. G., Dilda C. L., Zeng Z., Mackay T. F. C., 1997. Proc. Natl. Acad. Sci. USA 94 : 9734–9739 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Pasyukova E. G., Vieira C., Mackay T. F. C., 2000. Deficiency mapping of quantitative trait loci affecting longevity in Drosophila melanogaster . Genetics 156 : 1129–1146 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Provine W. B., 1986. Science and Its Conceptual Foundations: Sewall Wright and Evolutionary Biology. University of Chicago Press, Chicago [ Google Scholar ]
  • Riehle M. M., Bennett A. F., Long A. D., 2001. Genetic architecture of thermal adaptation in Escherichia coli . Proc. Natl. Acad. Sci. USA 98 ( 2 ): 525–530 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rockman M. V., Skrovanek S. S., Kruglyak L., 2010. Selection at linked sites shapes heritable phenotypic variation in C. elegans . Science 372 : 372–376 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rose M. R., 1991. Evolutionary Biology of Aging. Oxford University Press, Oxford [ Google Scholar ]
  • Rose M. R., Oakley T. H., 2007. The new biology: beyond the Modern Synthesis . Biol. Direct 2 : 30. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Shmookler Reis R. J., Kang P., Ayyadevara S., 2006. Quantitative trait loci define genes and pathways underlying genetic variation in longevity . Exp. Gerontol. 41 : 1046–1054 [ PubMed ] [ Google Scholar ]
  • Simões P., Santos J., Matos M., 2009. Experimental evolutionary domestication , pp. 89–110 Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments , edited by Garland T., Rose M. R. University of California Press, Berkeley [ Google Scholar ]
  • Stern D. L., 2000. Perspective: evolutionary developmental biology and the problem of variation . Evolution 54 ( 4 ): 1079–1091 [ PubMed ] [ Google Scholar ]
  • Stern D. L., 2010. Evolution, Development, and the Predictable Genome. Roberts & Company Publishers, Greenwood Village, CO [ Google Scholar ]
  • Teotónio H., Rose M. R., 2000. Variation in the reversibility of evolution . Nature 408 : 463–466 [ PubMed ] [ Google Scholar ]
  • Teotónio H., Rose M. R., 2001. Perspective: reverse evolution . Evolution 55 : 653–660 [ PubMed ] [ Google Scholar ]
  • Teotónio H., Chelo I. M., Bradic M., Rose M. R., Long A. D., 2009. Experimental evolution reveals natural selection on standing genetic variation . Nat. Genet. 41 : 251–257 [ PubMed ] [ Google Scholar ]
  • Thornton K. R., Andolfatto P., 2006. Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaser . Genetics 172 : 1607–1619 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Valenzuela R. K., Forbes S. N., Keim P., Service P. M., 2004. Quantitative trait loci affecting life span in replicated populations of Drosophila melanogaster . II. Response to selection . Genetics 168 : 313–324 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Van Voorhies W. A., Curtsinger J. W., Rose M. R., 2006. Do longevity mutants always show trade-offs? Exp. Gerontol. 41 : 1055–1058 [ PubMed ] [ Google Scholar ]
  • Vieira C., Pasyukova E. G., Zeng Z., Brant Hackett J., Lyman R. F., et al., 2000. Genotype-environment interaction for quantitative trait loci affecting life span in Drosophila melanogaster . Genetics 154 : 213–227 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Wellcome Trust Case Control Consortium, 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls . Nature 447 ( 7145 ): 661–678 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al., 2010. Common SNPs explain a large proportion of the heritability for human height . Nat. Genet. 42 ( 7 ): 565–569 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Search Thermo Fisher Scientific

  • Quick Order
  • Check Order Status
  • Custom Products & Projects
  • Instrument Management
  • Home ›
  • Technical Reference Library ›
  • RNA Technical Resources from Ambion ›
  • RNAi/siRNA ›
  • TechNotes ›
  • Experimental Variability and Replicates in siRNA Experiments

siRNA Experiments - obtain sufficient data

  • siRNA Delivery Efficiency ›
  • Are siRNA Pools Smart? ›
  • Assessing Gene Function with siRNA Libraries ›
  • Cells-to-cDNA™ II Applications | Quantitation of siRNA Target Gene Expression ›
  • Control Your siRNA Research | Proven siRNA Controls and Matched Primary Antibodies ›
  • Controlling Variability in Cell Assays When Designing RNAi Experiments ›
  • Delineating the Role of Survivin in Oncogenesis: An siRNA Study ›
  • Duration of siRNA Induced Silencing: Your Questions Answered ›
  • Ensure RNAi Success ›
  • Enhanced siRNA Delivery and Long-term Gene Silencing ›
  • Experimental Variability and Replicates in siRNA Experiments ›
  • siRNA-induced Gene Silencing ›
  • Fast and Accurate Confirmation of Gene Silencing | Silencer® siRNAs & TaqMan® Gene Expression Assays ›
  • Five Ways to Produce siRNAs ›
  • Fluorescently Label Your siRNA to Track it in Live Cells ›
  • Gene Specific Silencing by RNAi ›
  • Tips from the Bench: Get Control of Your siRNA Experiments ›
  • Getting Started with RNAi ›
  • RNAi In Vivo ›
  • High Throughput siRNA Delivery In Vitro: From Cell Lines to Primary Cells ›
  • Matched siRNAs and Assays ›
  • Next Generation siRNAs to Make Your Silencing Roar ›
  • Optimizing siRNA Transfection for RNAi ›
  • Performing RNAi Experiments in Animals ›
  • Quickly Assess siRNA Delivery and Cell Viability in the Same Assay ›
  • Recommendations for Successful siRNA Library Screens ›
  • Reduced siRNA Concentrations Lead to Fewer Off-Target Effects ›
  • Reproducibly Deliver siRNAs into Cultured Cells ›
  • RNAi How To for New Users ›
  • Applications of siRNAs ›
  • RNAi Four-Step Workflow ›
  • RNAi: Size Does Matter ›
  • pSilencer Vectors ›
  • Setting up Successful siRNA Library Screens ›
  • Silencer® siRNA Libraries | siRNA Libraries Targeting Important Functional Gene Classes ›
  • Silencer® siRNA Screening Control Panel | Effective Controls for RNAi Screening Experiments ›
  • Silencer® siRNA Libraries ›
  • Silencer® siRNA Starter Kit | New User's Kit for Gene Silencing ›
  • siRNA Design It's All in the Algorithm ›
  • siRNA Expression Vectors with Selectable Markers ›
  • siRNA Screening Validate Thousands of Targets in a Single Week ›
  • siRNA-Induced mRNA Knockdown and Phenotype ›
  • Streamline Your siRNA Transfections ›
  • Test More siRNAs for Less ›
  • Tools for Optimizing siRNA Delivery ›
  • Understanding Calculations for siRNA Data ›
  • Using Validated siRNAs in Functional Genomic Assays ›
  • Visualizing siRNA in Mammalian Cells ›
  • Cloning siRNAs into pSilencer ›

The aim of experimental design is to efficiently obtain sufficient data—with the least effort and cost—from which scientifically and statistically valid conclusions can be drawn. Understanding the experimental goals and processes as well as the amount and sources of experimental variability are all important to designing successful siRNA experiments. This article will describe different sources of variation and how to determine the number of replicates that are ideal for your experiments.

Sources of Variation

Variability in sirna experiments, how many replicates.

  • Is it a research assay, screening assay, or release assay?
  • Will the procedure be run many times or infrequently?
  • How much validation was done?

Loading metrics

Open Access

Peer-reviewed

Research Article

ECD-CDGI: An efficient energy-constrained diffusion model for cancer driver gene identification

Roles Data curation, Methodology, Writing – original draft

Affiliation School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, China

Roles Methodology, Supervision, Writing – review & editing

* E-mail: [email protected] (LZ); [email protected] (XF); [email protected] (QZ)

ORCID logo

Roles Investigation

Affiliation College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

Roles Supervision, Writing – review & editing

Roles Supervision

Affiliation Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China

  • Tao Wang, 
  • Linlin Zhuo, 
  • Yifan Chen, 
  • Xiangzheng Fu, 
  • Xiangxiang Zeng, 

PLOS

  • Published: August 30, 2024
  • https://doi.org/10.1371/journal.pcbi.1012400
  • Reader Comments

This is an uncorrected proof.

Table 1

The identification of cancer driver genes (CDGs) poses challenges due to the intricate interdependencies among genes and the influence of measurement errors and noise. We propose a novel energy-constrained diffusion (ECD)-based model for identifying CDGs, termed ECD-CDGI. This model is the first to design an ECD-Attention encoder by combining the ECD technique with an attention mechanism. ECD-Attention encoder excels at generating robust gene representations that reveal the complex interdependencies among genes while reducing the impact of data noise. We concatenate topological embedding extracted from gene-gene networks through graph transformers to these gene representations. We conduct extensive experiments across three testing scenarios. Extensive experiments show that the ECD-CDGI model possesses the ability to not only be proficient in identifying known CDGs but also efficiently uncover unknown potential CDGs. Furthermore, compared to the GNN-based approach, the ECD-CDGI model exhibits fewer constraints by existing gene-gene networks, thereby enhancing its capability to identify CDGs. Additionally, ECD-CDGI is open-source and freely available. We have also launched the model as a complimentary online tool specifically crafted to expedite research efforts focused on CDGs identification.

Author summary

Cancer has become a major disease threatening human life and health. Cancer usually originates from abnormal gene activities, such as mutations and copy number variations. Mutations in cancer driver genes are crucial for the selective growth of tumor cells. Identifying cancer driver genes is crucial in cancer-related research and treatment strategies, as it helps understand cancer occurrence and development. However, the complex gene-gene interactions, measurement errors, and the prevalence of unlabeled data significantly complicate the identification of these driver genes. We developed a new method that integrates an energy-constrained diffusion mechanism with an attention mechanism to uncover implicit gene dependencies in biomolecular networks and generate robust gene representations. Extensive experiments demonstrated that our model accurately identifies known cancer driver genes and effectively discovers potential ones. Furthermore, we analyzed and predicted patient-specific mutated genes, enhancing our understanding of their pathogenesis and advancing precision medicine. In summary, our method offers a promising tool for advancing the identification of cancer driver genes.

Citation: Wang T, Zhuo L, Chen Y, Fu X, Zeng X, Zou Q (2024) ECD-CDGI: An efficient energy-constrained diffusion model for cancer driver gene identification. PLoS Comput Biol 20(8): e1012400. https://doi.org/10.1371/journal.pcbi.1012400

Editor: Jinyan Li, Chinese Academy of Sciences Shenzhen Institutes of Advanced Technology, CHINA

Received: October 25, 2023; Accepted: August 10, 2024; Published: August 30, 2024

Copyright: © 2024 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Our code and data are publicly available in the GitHub repository: https://github.com/taowang11/ECD-CDGI .

Funding: This work received partial support from the Natural Science Foundation of China under Grant No. 62302339, to L.Z. Additionally, this work was partially funded by the Natural Science Foundation of China under Grant No. 62372158 to X.F. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Cancer is typically driven by the accumulation of genetic variations, including single nucleotide variations, small insertions or deletions, and copy number variations [ 1 , 2 ]. Gene mutations can lead to activation or inactivation, promoting cancer occurrence and metastasis. Cancer driver genes(CDGs) mutations enable tumor cells to gain selective growth advantages in evading immune cell clearance and drug treatment [ 3 , 4 ]. Therefore, developing methods to identify CDGs is of great significance for cancer pathologic research, as well as the development of cancer diagnosis, treatment, and targeted drugs [ 5 ]. The recent advancements in next-generation sequencing technology have helped researchers facilitate the generation of a vast amount of cancer genomic data and classify somatic mutations in common and rare cancer types [ 6 ]. Systematically identifying CDGs from large-scale human cancer genomic data remains a significant challenge [ 7 , 8 ].

Many computational methods and tools have been developed to address this challenging issue in the past few years. Traditional computational methods for identifying CDGs can be divided into two main categories: mutation frequency-based and network-based. The mutation frequency-based methods generally assume that mutations in driver genes have a higher probability of being recurrent across samples compared to non-driver genes, thus identifying significantly mutated genes as CDGs [ 9 , 10 ]. The network-based methods consider cancer to result from mutations in multiple genes that collectively play essential roles in cancer-related biological pathways [ 11 , 12 ]. Despite the remarkable achievements of these methods in studying gene variations, there are still some limitations. For example, mutation frequency-based methods often fail to detect driver genes with low mutation frequencies due to the lack of reliable background mutation frequencies. Additionally, when biological networks lack numerous associative relationships or are inundated with a large amount of noise data, this type of method can lead to poor accuracy in identifying driver genes.

Recently, machine learning(ML) techniques, particularly deep learning methods, have achieved tremendous success in identifying CDGs [ 13 – 15 ]. ML-based approaches framethe prediction of driver genes as a classification task, leveraging available data and knowledge to identify driver genes or driver mutations. Typically, these methods utilize a low-dimensional representation of genes’ multi-omic feature vectors, subsequently employing classifiers to identify CDGs. For instance, Parvandeh et al. utilized cancer gene network data to calculate the differences between nodes using the Minkowski distance [ 16 ]. They integrated the nearest neighbor algorithm and evolutionary scoring calculation to potential CDGs. Similarly, Han et al. trained an ensemble of models on various types of gene mutations and then applied Poisson’s distribution coupled with Monte Carlo simulations to discover low-background mutation rate CDGs [ 17 ]. In another study, Habibi et al. combined mutation data, protein-protein interaction (PPI), and biological process networks. They calculated the score of gene features, engineered a gene-gene network significantly linked to cancer, and performed cluster analysis to study CDGs [ 18 ]. However, these traditional machine learning approaches face limitations due to their neglect of complex interactions inherent in gene-gene networks. GNNs offer a promising solution to this constraint. By employing an iterative message passing and aggregation mechanism, GNNs are capable of learning low-dimensional embeddings that capture the complex relationships among genes, based on their interactions within the network [ 19 ].

Consequently, GNNs have been instrumental in enhancing the accuracy of CDGs identification [ 20 – 22 ]. For example, the EMOGI model incorporates diverse multi-omics data, including copy number variation, methylation and PPI network to identify CDGs using graph convolutional neural networks (GCNs) [ 23 ]. The EMOGI model primarily focuses on a subset of genes in the PPI network, conducting training and evaluation solely at the node level. Building upon this, MTGCN integrates both CDG identification and interaction prediction tasks into a collaborative training framework, thereby improving the precision of CDG prediction [ 24 ]. These approaches utilize Chebyshev polynomials within the convolutional layers and separate the embeddings from their neighboring nodes during the aggregation process, which can effectively address the issue of "over-smoothing" often encountered with multiple iterative convolution operations. As a result, these models demonstrate superior performance compared to traditional GCNs [ 25 ] and Graph Attention Networks (GATs) [ 26 ]. However, these models do have their limitations. Specifically, biomolecular networks are typically highly heterogeneous, a condition primarily attributed to the diversity of genomic data, including gene expression, protein interactions, and metabolite profiles. To our knowledge, the message propagation in most GNN models is often influenced by nodes with high degrees. Consequently, this can lead to the masking or domination of gene features by heterogeneous, highly connected neighbors, which impedes the accurate representation of gene features. To overcome this limitation, Zhang et al. introduced the HGDC model based on graph diffusion models [ 27 ]. Initially, HGDC creates an auxiliary graph employing graph diffusion and random walk techniques and jointly trains it alongside the original graph to enhance node representation. Subsequently, it refines the propagation and aggregation mechanisms inherent in GCNs, making the model more suitable for heterogeneous biomolecular networks. Finally, it deploys a multi-layer attention classifier to accurately identify CDGs.

While existing models demonstrate strong performance in identifying CDGs, they have limitations. Most notably, these models often focus solely on the immediate neighborhood of nodes, overlooking potentially complex interdependencies between any two genes. Additionally, data noise introduced by errors in the collection process can further compromise performance. To address these challenges, we propose the ECD-CDGI model, which joins the diffusion process with an attention mechanism to unveil hidden relationships between any two genes and enhance CDG Identification. In summary, the main contributions of this paper are described as follows:

  • ECD-CDGI considers gene interactions as a diffusion process to maintain gene expression globally consistent in terms of the underlying structure while mitigating the effects of noisy data, and for the first time, realizes the combination of energy-constrained diffusion and attention mechanisms to identify CDGs.
  • We design an ECD-Attention encoder based on diffusion processes and attention mechanisms to capture implicit dependencies between genes in biomolecular networks. This approach generates robust gene representations, which are further enhanced by integrating topological information.
  • We introduce a hierarchical attention module to aggregate the output results across each layer during the information propagation process. By augmenting the diversity of node representations, this strategy subsequently improves the predictive accuracy of the ECD-CDGI model.
  • Extensive experiments indicate that the ECD-CDGI model possesses the ability to not only identify known CDGs but also efficiently uncover potential cancer genes. Moreover, compared to the GNN-based approach, the ECD-CDGI model exhibits lower constraints from gene-gene networks, which enhances its ability to identify potential cancer genes.

Materials and methods

The task of identifying CDGs generally draws upon multi-omics data sources including genomics, transcriptomics, proteomics, and metabolomics. The primary workflow entails applying dimensionality reduction techniques to these multi-omics datasets, effectively extracting the low-dimensional representations of genes in the biomolecular network in a reduced dimensional space. Subsequently, the representations of these genes are compared to the representations of known CDGs, enabling the prediction of CDGs. For the scope of this experiment, we utilize a gene set within a 58-dimensional feature space, as cited in the referenced work [ 27 ].

The efficacy of the proposed ECD-CDGI model in predicting CDGs was evaluates across three distinct biomolecular network datasets: PathNet [ 28 ], GGNet [ 29 ], and PPNet [ 30 ]. Specifically, the PathNet dataset comprises a network of interlinked biochemical pathways within cells or organisms, incorporating data from both KEGG and Reactome pathways. GGNet is constructed from RNA interaction data, forming a gene-gene network. Meanwhile, PPNet is extracted from the STRING database. Each of these datasets offers a unique perspective, contributing to a comprehensive evaluation of the model’s performance.

In this study, the term "cancer driver genes" refers to genes that are clearly identified and widely recognized for their crucial roles in the initiation and progression of tumors. These genes are categorized as positive samples. Specifically, 711 well-established driver genes were sourced from the NCG database [ 31 ], and an additional 85 high-confidence driver genes were identified using the DigSEE tool [ 32 ], totaling 796 genes. The positive samples across PPNet, GGNet, and PathNet networks, are derived from these genes. Additionally, drawing on prior findings [ 23 ], negative samples were selected based on the following criteria: Exclude genes 1) listed in the NCG database [ 31 ], 2) linked to "cancer pathways" from the KEGG database [ 33 ], 3) listed in the OMIM disease database [ 34 ], 4) predicted by MutSigdb [ 9 ] to be cancer-related, 5) with expression patterns similar to known cancer genes. Generally, negative samples comprise genes that are unlikely to be related to cancer. The data used in this study is presented in Table 1 .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pcbi.1012400.t001

Problem formulation

The proposed ECD-CDGI model leverages an encoder grounded in both energy-constrained diffusion processes and attention mechanisms. To facilitate a comprehensive understanding of this model and its architecture, we will delineate the foundational principles and associated technologies underpinning the model in the section.

Energy-constrained diffusion process

gene variation experiment

In this way, the diffusivity serves as a measure of the influence between any two nodes and can also be interpreted as attention of each node-node pair. This insight informs the architecture of encoders built on energy-constrained diffusion processes and attention mechanisms.

Model architecture

Fig 1 illustrates the architecture of the ECD-CDGI model, comprising primarily three modules: the Data Module, the Encoder Module (including ECD-Attention encoder, GNN encoder and Residual connection), and the Multi-layer Attention Module. To enrich the datasets, both the initial feature vectors of gene nodes in the biomolecular network and the network’s topological structure were extracted, as detailed in the materials section. To address the challenges posed by noisy observational data and latent dependencies among nodes within biomolecular networks, we design a novel encoder, termed ECD-Attention. This encoder is ground in energy-constrained diffusion processes and attention mechanisms. Fig 1(D) illustrates the energy-constrained diffusion process, wherein the energy (information) from each node is distributed to all other nodes in the network, ensuring that the state of each node is influenced by that of every other node. Simultaneously, a GNN encoder is used to mine the topological structure of the biomolecular network, thereby augmenting gene representations. Employing a multi-layer attention mechanism, the proposed model assimilates information across multiple scales to efficiently identify CDGs.

The ECD-CDGI model employs a automatic approach to identify CDGs, including several key stages: Initially, the multi-omics data information of genes within the biomolecular network is fed into the ECD-Attention encoder, while concurrently, the topological information is input into the GNN encoder. The features extracted from both encoders are then concatenated, followed by residual connections and layer normalization operations. Subsequently, leveraging the message propagation mechanism, the encoding process undergoes multiple iterations, generating multiple sets of gene representations. Ultimately, the multi-layered data is fused utilizing the hierarchical attention module, resulting in the final node representations. These comprehensive representations are then employed to predict CDGs.

thumbnail

The architecture of the ECD-CDGI model mainly includes three principal modules: (A) Data Module, (B) Encoder Module, and (C) Multi-layer Attention Module. (A) The Data Module primarily contains the initial feature vectors and topological architecture of gene nodes within the biomolecular network. (B) The Encoder Module is consisting of three key components: a newly-conceived ECD-Attention encoder based on energy-constrained diffusion process (D) , a GNN encoder, and a residual connection. (C) Employing a hierarchical structure, the Multi-layer Attention Module integrates data across various layers to formulate a comprehensive node representation, which is then used to identify CDGs effectively. (D) The energy-constrained diffusion process.

https://doi.org/10.1371/journal.pcbi.1012400.g001

ECD-Attention encoder.

Building on the insights gained from the Preliminary Section, the diffusion process is governed by energy constraints, which aim to reduce the overall system energy during diffusion, thereby stabilizing the system. And inspired by previous work [ 38 ], we introduce an ECD-Attention encoder that incorporates both energy-constrained diffusion and attention mechanisms. This encoder is crafted to ensure the local consistency of each gene node’s current state during the information propagation process that is similar to the diffusion process, while also preserving global consistency with other gene nodes in the biomolecular network. Notably, the encoder effectively dampens the impact of data noise and reveal latent interdependencies between genes. The following is a detailed presentation of the relevant principles and steps.

gene variation experiment

Leveraging the energy-constrained diffusion and attention mechanisms, the diffusivity matrix in the diffusion process can be reinterpreted as an attention matrix for gene-gene pairs. Echoing the principles outlined in the Preliminary Section, a straightforward dot-product method is employed to quantify the similarity between any two genes. Furthermore, within the energy-constrained diffusion process, the node state update rule considers the state of all nodes, meaning each node’s state is influenced by every other node. Node state updates are executed by integrating the complete node-node similarity matrix with the value vector. Clearly, this approach is well-suited for the Transformer architecture. In the Transformer architecture, node-node attentions resemble the signal propagation rate S observed in energy-constrained diffusion processes. This process normalizes the similarity between nodes using dot product and sigmoid operations.

gene variation experiment

GNN encoder

gene variation experiment

Residual connection

gene variation experiment

Multi-layer attention

gene variation experiment

To evaluate the efficacy of the ECD-CDGI model, we execute multiple sets of experiments using publicly available datasets. Initially, we engage in comparative analyses against state-of-the-art methods for CDG identification to validate the model’s superior capabilities. Subsequently, we design a series of ablation experiments to evaluate the individual contributions of various modules within the ECD-CDGI architecture. In the final phase, we delve into specific case studies and explore the scalability prospects of our proposed model.

Implementation detail

This study was conducted using the Python and Pytorch frameworks, focusing on parameters associated with the ECD-Attention encoder, GCN encoder, and multi-layer attention module, along with various hyperparameters. Genomic data served as the initial input for the model, with its dimensionality set at 58. In the ECD-Attention encoder, the transformation weight matrices are preset to a dimension of 100. The multi-layer attention module is configured with four layers by default, with each layer’s initial weight preset at 0.5. Both the ECD-Attention and GCN encoders are integrated across 4 layers. Other hyperparameters include a hidden layer dimension of 100, 100 training rounds, a default learning rate of 0.001, and Adam as the optimizer.

Comparison experiment

We designed a series of benchmarking experiments across three publicly accessible datasets GGNet, PathNet, and PPNet, to compare the performance of our ECD-CDGI model with six other methods. These comprise three advanced CDG prediction models EMOGI [ 23 ], MTGCN [ 24 ], and HGDC [ 27 ], as well as three conventional GNN models GCN [ 25 ], GAT [ 26 ], and ChebNet [ 40 ]. To ensure a level playing field, each method was fed the same feature matrix corresponding to biomolecular networks. We carried out ten times of 5-fold cross-validation for each model. The final performance metrics, represented by the average AUC and AUPR scores, are presented in Table 2 .

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.t002

As reflected in Table 2 , EMOGI, MTGCN, HGDC, ChebNet, and our proposed ECD-CDGI model all demonstrated commendable performance in the task of identifying CDGs. The GCN and GAT models lagged behind in terms of effectiveness. Notably, the EMOGI, MTGCN, HGDC, and ChebNet algorithms all employ Chebyshev polynomials to perform convolution operations. During the message propagation and aggregation phases, these models differentiate between neighboring nodes and the nodes themselves, thereby mitigating the performance degradation typically induced by over-smoothing. Building upon this, the HGDC model incorporates an auxiliary network crafted using graph diffusion technology and aims to enhance predictive accuracy through joint training with the original network. However, it’s noteworthy that HGDC’s performance remains on par with, or even slightly underperforms, the original ChebNet model. This suggests that the auxiliary network generated through graph diffusion techniques may introduce an element of unpredictable noise.

It’s important to highlight that our proposed ECD-CDGI model outperformed all competitors across all datasets. It led the second-best performing model by margins of 1.30%, 1.24%, and 2.13% in the AUC index, and by 1.57%, 2.02%, and 2.76% in the AUPR index. These results underscore the efficacy of the ECD-Attention encoder, which is grounded in energy-constrained diffusion and attention mechanisms. This encoder is adept at unveiling the complex interdependencies among genes. When combined with the GCN encoder to harness the topological information of the gene-gene network, it substantially enhances the quality of node representation. As illustrated in Fig 2 , we plotted the ROC and PR curves for each model on three datasets. The curves for ECD-CDGI model consistently outpace other models and demonstrate remarkable stability. This provides additional validation that the ECD-CDGI model is both efficient and reliable in identifying CDGs.

thumbnail

ROC curves for multiple models on (a) PPNet, (b) PathNet, and (c) GGNet datasets; PR curves for (d) PPNet, (e) PathNet, and (f) GGNet datasets.

https://doi.org/10.1371/journal.pcbi.1012400.g002

Ablation experiment

This section aimed to evaluate the individual contributions of four key modules within the ECD-CDGI model: the ECD-Attention encoder, the GCN encoder, the residual connection, and the multi-layer attention mechanism. To facilitate this, we conduct ablation experiments across three datasets GGNet, PathNet, and PPNet, while holding other variables constant. The term ’w/o ECD-Att’ denotes a model configuration that removes the ECD-Attention encoder, relying solely on the GCN encoder. Conversely, ’w/o GCN’ signifies a setup where the GCN encoder is excluded, with only the ECD-Attention encoder in place. And ’w/o Residual’ means that the residual connection module has been removed, while ’w/o multi-Att’ implies that the model delete the multi-layer attention mechanism and employs only the encoder’s final layer output for both training and prediction.

We performed ten times of 5-fold cross-validation experiments for each model configuration across three datasets. The results are summarized as average values for the AUC and AUPR metrics, as detailed in Table 3 . Generally speaking, any version of the ECD-CDGI model that omits one of its key components, whether it’s the ECD-Attention encoder, GCN encoder, residual connection, or multi-layer attention mechanism, experiences a decline in performance. The ECD-Attention encoder captures global information, revealing potential dependencies between indirectly connected genes. The GCN encoder receives information from neighboring nodes and effectively propagates messages based on gene interactions. Residual connections maximize the retention of original features during iterations, preventing the loss of information from nodes in previous layers. The multi-layer attention mechanism automatically learns weights and integrates node representations across weighted iterations, enhancing model performance.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.t003

Diving into details, the model’s performance declines slightly on the GGNet dataset when the GCN encoder is omitted, whereas a more substantial decrease is observed on both the PathNet and PPNet datasets. Intriguingly, this pattern is reversed when the ECD-Attention encoder is omitted. This suggests that the high heterogeneity and complex topological structure of the GGNet dataset may make it difficult for GCNs to effectively capture the intricate relationships and dependencies within the data. The finding also highlights the ECD-Attention encoder’s ability to uncover latent interdependencies among genes, thus boosting the model’s overall performance. Most notably, the model experiences its poorest performance when the Residual module is omitted, indicating its critical role in mitigating the over-smoothing arising during information propagation. It is noteworthy that the Residual module serves as a pivotal element within the ECD-Attention encoder, supplying essential information about the node’s current state during the energy-constrained diffusion process.

Skewed distribution and enrichment analysis

We conducted extensive experiments and analyses across the GGNet, PPNet, and PathNet datasets to evaluate the capability of our proposed ECD-CDGI model to identify previously unknown CDGs. To mitigate the influence of random variables, we ran the ECD-CDGI model through 100 iterations on each of these datasets, thereafter analyzing the predicted gene scores.

As illustrated in Fig 3 , the gene scores predicted by the ECD-CDGI model across all datasets exhibit a positive skewness. A scant number of genes gain conspicuously high scores, deviating from the central cluster of the data, while the majority of gene scores hover between -2 and 0. This is likely attributable to the fact that the overwhelming majority of genes are not CDGs, resulting in only subtle variations in their scores. In contrast, the outliers in the dataset suggest a small subset of genes with markedly higher scores, pointing to a heightened likelihood of them being CDGs. Overall, the ECD-CDGI model demonstrates a robust ability to differentiate these CDGs from other non-CDGs.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.g003

We selected and merged the top 100 genes with the highest scores from three networks, resulting in a total of 178 unique genes. This was done to assess the ECD-CDGI model’s ability to recognize these genes. With reference to the DisGeNET database [ 41 ], these highly scored genes were further enriched. In Fig 4(A) each bar on the left represents a different cancer category; the length of the bar indicates the statistical significance of the gene set linked to that disease. A higher -log10(P) value correlates with a lower p-value, suggesting a stronger association between the gene set and the disease. These results suggest that these high-scoring genes are significantly associated with various diseases, predominantly cancers, particularly pancreatic tumors. To further investigate these genes, we conducted pathway and process enrichment analyses using KEGG pathways, GO biological processes, and other resources, categorizing the genes into clusters based on similarities. In Fig 4(B) , on the right, genes are depicted as nodes in different colors, each color representing a distinct enriched pathway. The size of each node correlates with the level of gene enrichment in the corresponding pathway. Purple lines between nodes indicate interactions among genes or the biological processes in which they participate. Of these, 44 genes (24.72%) showed significant enrichment in the "Cancer Pathway" (KEGG Pathway). These genes are likely pivotal in the genesis and progression of tumors. This underscores the capacity of the ECD-CDGI model to identify CDGs accurately, thereby aiding in the elucidation of cancer initiation and progression mechanisms as well as informing relevant treatment strategies.

thumbnail

(a) Results of gene enrichment analysis for various cancers using the ECD-CDGI model; (b) Enrichment analysis leveraging KEGG pathways and GO biological processes.

https://doi.org/10.1371/journal.pcbi.1012400.g004

Identifying new cancer genes

To validate the efficacy of the ECD-CDGI model in identifying novel cancer genes, we conducted targeted experiments. Specifically, we computed the average prediction probabilities for four categories of genes: known CDGs, non-CDGs, a set of potential cancer genes from the ncg7.1 database, and other genes across the GGNet, PathNet, and PPNet datasets. The results detailed in Fig 5 reveal that known CDGs garnered the highest average predicted probabilities, while non-CDGs received the lowest. This underscores the ECD-CDGI model’s capability to accurately differentiate between CDGs and non-CDGs. Intriguingly, the average predicted probability for potential cancer genes was also markedly higher than that for non-CDGs and other genes. This suggests that the ECD-CDGI model is not only proficient in identifying known CDGs but is also adept at uncovering potential cancer genes.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.g005

Case analysis

We undertook a comprehensive comparative analysis to evaluate the adaptability of the ECD-CDGI model across diverse datasets. Specifically, we selected the top 50 genes with predictive scores from the GGNet, PPNet, and PathNet datasets, and then quantified the number and percentage of CDGs involved. These findings are visually represented in Fig 6(A) through a Venn diagram. Interestingly, the likelihood of identifying a CDG that is unique to a single dataset is notably lower than discovering one that appears across multiple datasets. This observation indicates that genes scoring highly across various datasets are more likely to be CDGs. It’s important to acknowledge that due to inherent constraints in each dataset, such as the presence of noisy data, the complexity of multi-omics data, and variations in gene topological networks, predictive inaccuracies may occur within the ECD-CDGI model. To mitigate these limitations, a cross-dataset analysis can be performed to enhance the precision in identifying CDGs.

thumbnail

(a)Venn diagram illustrating the quantity and proportion of CDGs identified by ECD-CDGI model across three datasets. (b)Pie chart showing the proportion of known CDGs, cancer-related genes, and other genes identified as CDGs by the ECD-CDGI model on three datasets.

https://doi.org/10.1371/journal.pcbi.1012400.g006

Additionally, we delved into the analysis of CDGs that were consistently identified across all three datasets. As depicted in Fig 6(B) , out of the 26 genes analyzed, 19 were classified as CDGs, making up 73.08% of the total. Three genes, although not defined as CDGs, were listed as cancer-related in the ncg7.1 database, and constituted 11.54% of the sample. Four other genes TTN, PCLO, LRP2, and RYR2, accounted for the remaining 15.38%. While these genes are not cataloged in the ncg7.1 database, existing literature [ 42 – 44 ] suggests their significant relevance to cancer.

To investigate patient-specific CDGs, we gathered and assessed patient-specific data using the ECD-CDGI model. Mutant genes with higher prediction scores are more likely to be specific driver genes, potentially accelerating cancer progression. Specifically, we utilized the Xena tool [ 45 ] to collect somatic mutation data from 5776 patients across 14 cancer types in the TCGA database [ 45 ]. Initially, we screened and retained genes present in the GGNet, PathNet, and PPNet networks from the patients’ mutant gene data. Building on this, we selected 5535 patients with five or more mutant genes for further analysis. We quantified the mutant genes of each patient (see Fig 7 ) and observed that some patients had fewer than five cancer driver genes, with 2.40% of patients lacking any cancer driver genes in their mutations. Prior studies suggest that having five or more cancer driver genes may correlate with individual cancer development [ 46 ]. Therefore, identifying patients’ specific CDGs is crucial for targeted treatment.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.g007

In this study, we assessed the ECD-CDGI model’s efficacy in identifying patient-specific CDGs for mutant genes, alongside relevant analyses. Specifically, the model was trained using omics data from 14 cancer types on three biomolecular networks: GGNet, PathNet, and PPNet. For each type of cancer, the model generated three predictive gene ranking lists. For each patient, the Rank algorithm [ 47 ] was employed to merge the three gene rankings into a consolidated final list. Subsequently, the top five mutant genes from the final ranking were selected as the specific CDGs for each patient. As illustrated in Fig 8 , within the PPNet network, the shortest distances between the identified driver genes were notably shorter than those between the mutant genes prior to screening. This suggests that the identified CDGs are closely interconnected, likely cooperating within shared biological pathways or functional modules. This tight linkage intensifies their impact on tumor formation, potentially accelerating tumor progression and malignancy.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.g008

In subsequent analyses, we focused on the top 500 genes with the highest prediction scores across the GGNet, PPNet, and PathNet datasets. After removing well-established CDGs, we consider the remaining genes as potential cancer genes. We then probed whether a relationship exists between these potential cancer genes identified by the ECD-CDGI and their connectivity to known CDGs.

As illustrated in Fig 9(A) and 9 (B) , for the PPNet and PathNet datasets, the Spearman correlation coefficients are both below 0.1, and the p-values significantly exceed the 5% significance threshold. This indicates only a marginal correlation. Fig 9(C) reveals that in the GGNet dataset, the Spearman correlation coefficient is 0.17, with a p-value of 0.0238, falling below the 0.05 threshold, signifying a slight but statistically significant positive correlation between the two variables. These results suggest that the potential cancer genes identified by the ECD-CDGI model exhibit a lower degree of reliance on known CDGs. Importantly, this implies that the ECD-CDGI model is less constrained by existing gene-gene networks in identifying potential cancer genes. As a result, it is better suited for the discovery of novel cancer genes, a task that proves challenging for methods based on GNNs.

thumbnail

https://doi.org/10.1371/journal.pcbi.1012400.g009

Discussion and Conclusion

This study investigates the pivotal importance of identifying CDGs for both cancer research and clinical treatment, and evaluates various methodologies geared towards this purpose. While existing machine learning and deep learning techniques are indeed effective, they come with inherent limitations. Most notably, these methods often overlook the complex interdependencies between any two genes and may be compromised by noisy data, a byproduct of data collection oversights.

To address these shortcomings, we introduce the ECD-CDGI model, which incorporates a energy-constrained diffusion process and an attention mechanism. By combining with GNNs and multi-layer attention techniques, our model offers a robust tool for identifying CDGs. Our specially designed ECD-Attention encoder not only uncovers the complex global interrelationships between any two genes but also captures nuanced local information to individual gene nodes. Additionally, we integrate residual connections within the model’s layers to mitigate the performance degradation caused by over-smoothing during inter-layer information propagation. Employing GNN technology, the ECD-CDGI model is capable of extracting topological information from gene-gene networks and leverages a multi-layer attention mechanism for predicting CDGs. Comparison and ablation experiments conducted on public datasets confirm the model’s superior performance. We anticipate that the ECD-CDGI model will assume a significant role in cancer research and treatment protocols, offering researchers an efficient tool for understanding the mechanism of cancer development.

Despite its efficacy in CDG prediction, the ECD-CDGI model has certain limitations. Firstly, the presence of missing or erroneous links in biomolecular networks can compromise the model’s performance. Excessive errors or missing links can mislead the learning process and diminish the model’s accuracy. Secondly, while graph neural networks utilize the topological information in biomolecular networks effectively, the absence of comprehensive omics data still impacts their performance. In practical applications, critical omics data, including gene expression, protein interactions, and metabolite profiles, are often incomplete or unavailable. This lack of data can prevent the model from fully understanding gene network interactions, potentially misleading its learning process. Additionally, integrating and synergizing various types of omics data presents challenges due to differing data characteristics and noise levels, where improper handling could impair the model’s performance. To address these issues, future work will focus on mitigating the identified problems. Firstly, we plan to employ debiasing and sampling techniques to minimize the effects of erroneous or incomplete data. Additionally, we will explore multi-omics fusion techniques to fully leverage diverse datasets. Concurrently, we will assess imputation methods to further diminish the impact of data gaps in omics datasets.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 35. Rosenberg S. (1997). The Laplacian on a Riemannian manifold : an introduction to analysis on manifolds . Cambridge University Press.
  • Research article
  • Open access
  • Published: 03 September 2024

Nanopore sequencing provides snapshots of the genetic variation within salmonid alphavirus-3 (SAV3) during an ongoing infection in Atlantic salmon ( Salmo salar ) and brown trout ( Salmo trutta )

  • HyeongJin Roh   ORCID: orcid.org/0000-0002-1825-2375 1 ,
  • Kai Ove Skaftnesmo 1 ,
  • Dhamotharan Kannimuthu 1 ,
  • Abdullah Madhun 1 ,
  • Sonal Patel 1 , 2 ,
  • Bjørn Olav Kvamme 1 ,
  • H. Craig Morton 1 &
  • Søren Grove 1  

Veterinary Research volume  55 , Article number:  106 ( 2024 ) Cite this article

Metrics details

Frequent RNA virus mutations raise concerns about evolving virulent variants. The purpose of this study was to investigate genetic variation in salmonid alphavirus-3 (SAV3) over the course of an experimental infection in Atlantic salmon and brown trout. Atlantic salmon and brown trout parr were infected using a cohabitation challenge, and heart samples were collected for analysis of the SAV3 genome at 2-, 4- and 8-weeks post-challenge. PCR was used to amplify eight overlapping amplicons covering 98.8% of the SAV3 genome. The amplicons were subsequently sequenced using the Nanopore platform. Nanopore sequencing identified a multitude of single nucleotide variants (SNVs) and deletions. The variation was widespread across the SAV3 genome in samples from both species. Mostly, specific SNVs were observed in single fish at some sampling time points, but two relatively frequent (i.e., major) SNVs were observed in two out of four fish within the same experimental group. Two other, less frequent (i.e., minor) SNVs only showed an increase in frequency in brown trout. Nanopore reads were de novo clustered using a 99% sequence identity threshold. For each amplicon, a number of variant clusters were observed that were defined by relatively large deletions. Nonmetric multidimensional scaling analysis integrating the cluster data for eight amplicons indicated that late in infection, SAV3 genomes isolated from brown trout had greater variation than those from Atlantic salmon. The sequencing methods and bioinformatics pipeline presented in this study provide an approach to investigate the composition of genetic diversity during viral infections.

Introduction

The emergence of new viral strains with increased virulence is of great concern to the aquaculture sector. Salmonid alphavirus (SAV) is the causative agent of pancreas disease (PD) in Atlantic salmon ( Salmo salar ) and of sleeping disease (SD) in rainbow trout ( Oncorhynchus mykiss ). SAV is an enveloped, spherical, single-stranded positive-sense RNA virus with a diameter of ~70 nm belonging to the Togaviridae family. The SAV genome is approximately 12 kb long and comprises two open reading frames (ORF1 and ORF2) that both encode polyproteins [ 1 ]. ORF1 encodes four nonstructural proteins (nsP1, nsP2, nsP3, and nsP4) that are required for RNA synthesis [ 2 ]. Like for other alphaviruses, SAV ORF2 likely encodes six structural proteins, i.e., C, E2, E3, 6 k, E1 and TF, where C is the capsid protein and E1, E2 and E3 are constituents of the heterotrimeric spike proteins in the envelope [ 3 , 4 ]. 6 k is an ion channel protein [ 5 ], whereas the TransFrame (TF) protein, known from several alphaviruses, is produced by a ribosomal –1 frameshift in 6 k. The TF protein has the same N-terminus as 6 k but a unique C-terminus, which may be relevant to virion stability, antigenicity, fusion, and tropism [ 4 , 6 ].

Since SAV was first identified in 1995, at least six subtypes have been described based on nucleotide sequence analysis of nsP3 and E2 [ 7 , 8 ]. More recently, the existence of a seventh genotype has been proposed based on an SAV isolate from Ballan wrasse ( Labrus bergylta ) [ 3 ]. The SAV subtypes show differences in geographical distribution, host range, and clinical manifestations [ 1 , 9 , 10 ]. SAV1 (salmon pancreas disease virus; SPDV) and SAV2 (sleeping disease virus; SDV) were characterized as two separate subtypes from approximately 1999–2000 [ 11 , 12 ]. The SAV3 subtype (Norwegian salmonid alphavirus; NSAV) was first characterized by Hodneland et al. [ 13 ]. Over the whole genome, the subtypes have been shown to share ~86–96% genetic identity [ 3 , 13 ].

Gallagher et al. [ 8 ] reported SAV sequencing data suggesting that individual farmed fish may become coinfected with different SAV subtypes. Infection of a host with two or more viral subtypes may be a basis for viral genetic changes via recombination. Similarly, a single SAV subtype transmitted from one host species or region to another may undergo genetic changes during adaptation [ 8 , 14 , 15 ]. RNA viruses generally have high mutation rates of between ~10 –6 and 10 –4 substitutions per nucleotide site per cell infection. A previous study estimated the SAV substitution rate to be approximately 1.70 (± 1.03) × 10 –4 nt substitution/site/year [ 16 ]. A more recent study of the genome-wide substitution rate for SAV3 estimated 7.351 × 10 –5 substitutions per site per year, with a 95% highest posterior density range of 5.33 × 10 –5 –9.994 × 10 –5 [ 17 ]. In addition, there is evidence that SAV can frequently undergo mutations and deletions even within a single host [ 8 , 18 ]. Petterson et al. [ 18 ] reported that many genome deletions are generated during natural SAV infection, and subsequent verification of frequent deletion mutations was achieved using nanopore sequencing methods [ 17 ]. The low fidelity of the RNA-dependent RNA polymerase (RdRp) and the high incidence of recombination via template switching during replication both contribute to this high mutation rate [ 19 , 20 , 21 ]. The copy choice model is a widely accepted mechanistic model for viral recombination and is particularly relevant for single-stranded positive-sense RNA viruses such as SAV [ 22 , 23 ]. In an infected cell, erroneous replication may produce considerable variation in the virus genome sequence and thus in the expressed viral proteins. In addition to this type of variation, selective pressure may also lead to “intracellular adaptations” that improve viral fitness in a particular host cell environment, including adaptations to codon and codon pair usage, improved suppression of the IFNα/β response and more [ 24 ]. Viral particles exiting infected cells may differ in the amino acid (aa) sequence of their capsid and spike proteins, leading to possible changes in their receptor binding affinities and specificities and hence potentially to changes in cell, tissue and host tropism. Virus particles with altered protein sequences may also be less prone to recognition by specific antibodies. With such variation and the inferred potential differences in viral function, fitness and adaptability, the viral consensus sequence may be insufficient to characterize a virus. Instead, the variation can be better understood as a mutant spectrum or quasispecies, which may provide a better definition of wild-type virus [ 25 ].

Long-read deep sequencing technologies, such as single-molecule real-time sequencing by Pacific Biosciences and Oxford Nanopore, have significantly contributed to the understanding and profiling of genetic variations in pathogens [ 26 , 27 , 28 , 29 ]. In particular, Oxford Nanopore long-read sequencing technology has proven useful for identifying new SAV genotypes and for profiling SAV mutation sites [ 3 , 8 , 30 ]. Until recently, a prevailing issue with long-read sequencing platforms has been the inherent low base-calling accuracy [ 31 ], which may lead to the misidentification of mutations in individual nanopore reads. Several methods have been proposed to complement and overcome this limitation. Gallagher et al. [ 32 ] demonstrated that sequencing errors generated from the Oxford nanopore platform can be minimized by achieving a sufficient sequencing depth. They found that a sequencing depth of more than 50 × was sufficient to accurately sequence the SAV genome. Aligning long reads to a consensus sequence is a standard pipeline for identifying single nucleotide polymorphisms (SNPs) and structural variants. However, the relatively high error rate in individual reads can pose a challenge in distinguishing rare minor variants from within the cloud of nonvariant reads. As an alternative, unique molecular identifiers (UMIs) have been utilized to address sequencing errors, but other technical challenges, such as accurate titration of input templates and sequencing depth, remain a challenge [ 33 , 34 ]. In the most recent advancements, due to improvements in the chemistry of sequencing library preparation kits, the structural and functional properties of nanopores, and recent changes in base-calling algorithms, the accuracy of each raw read can now be over 99.9% (> Q30) with the duplex basecalling algorithm [ 35 ]. By excluding reads found in low numbers, likely representing random sequencing errors, the sequencing fidelity of reads included in the analysis can be increased.

With such high accuracy of single reads, sequence diversity can be profiled by de novo clustering using high thresholds of sequence identity, a technique that is widely applied in microbiome studies from PCR amplicons. In such studies, sequence reads from PCR amplicons (e.g., from the 16S or 18S rRNA gene) can be clustered and classified as operational taxonomic units (OTUs) based on sequence identity [ 36 , 37 ]. Alongside the advantage of amplicon clustering, the high accuracy of single long reads enables the relatively precise profiling of minor variants within a sample. In other words, it allows for both the identification of genetic variation within a sample and de novo assembly of multiple complete genomes for viral variants, strains, and/or quasispecies within a sample. In this study, nanopore sequence reads were clustered based on sharing at least 99% sequence identity. The cluster containing the largest number of reads was designated the “major cluster”, while clusters with fewer sequence reads were defined as “minor clusters”. The consensus that can be generated from each cluster may provide an overview of the most frequent variants present in the analysed samples. In this study, we aimed to 1) develop an SAV3 variant identification method within a sample using high-accuracy nanopore reads; 2) identify major and minor SAV3 variants that arise during an active infection; and 3) explore potential genetic variations that occur when SAV3 infects either Atlantic salmon or brown trout.

Materials and methods

Fish and viral challenge.

Atlantic salmon and brown trout were reared at the Institute of Marine Research (IMR), Research Station in Matre (Masfjorden, Norway). Prior to viral challenge, the fish were transported to IMRs fish disease laboratories in Bergen (Norway). The salmon and trout were acclimated in 400 L tanks supplied with freshwater at a flow rate of approximately 400 L h −1 . Commercial feed was provided twice daily, and the water temperature was maintained at 10–12 °C. The photoperiod was maintained at 12 h light and 12 h dark during both the acclimation and experiment. Viral challenge was performed as a cohabitation challenge. In brief, naïve salmon shedder fish were injected intramuscularly with a 2 × 50 µL of 1 × 10 4 TCID 50 mL −1 SAV3 inoculum [ 38 ]. The virus was propagated in CHH-1 cells, and passage 3 of the virus was used in this trial. The shedder fish were marked by the adipose fin clipping method for selective sampling of cohabitant fish during the subsequent sampling period. Then, 30 salmon shedders and 70 naïve salmon or trout were transferred to 250 L experimental tanks where they remained for the duration of the cohabitation challenge experiment. At 2, 4, and 8 weeks after cohabitation started, sixteen cohabitation fish of each species were euthanized using an overdose of Benzocaine (160 mg L −1 ; Apotekproduksjon AS, Norway). Sampling was performed at 2-, 4-, and 8-weeks post-challenge (wpc), producing six experimental groups consisting of specific combinations of sampling time points and fish species (2wpc_Salmon, 4wpc_Salmon, 8wpc_Salmon, 2wpc_Trout, 4wpc_Trout, and 8wpc_Trout). Hearts were dissected from all the fish, transferred to RNALater (Ambion, TX, USA) and stored at −80 °C until further analysis. All experiments involving live animals were approved by the Norwegian Food Safety Authority (FOTS approval number 11260).

RNA extraction and quantitative PCR (qPCR)

Total RNA was extracted from the heart following the standard protocol of the Promega ReliaPrep ™ simply RNA HT 384 kit (Promega, WI, USA) on a Biomek 4000 Laboratory Automated Workstation (Beckman Coulter, CA, USA). The total RNA concentration was quantified using a NanoDrop ™ 1000 spectrophotometer (Thermo Scientific, MA, USA), and the RNA samples were diluted to 100 ng µL −1 using a Biomek 4000 Laboratory Automated Workstation (Beckman Coulter, CA, USA). Quantitative RT-PCR was conducted using the AgPath-ID One Step RT-PCR kit (ThermoFisher, MA, USA) according to the manufacturer’s instructions with primers targeting the SAV3 nsP1 gene (F: 5′-CCGGCCCTGAACCAGTT-3′; R: 5′-GTAGCCAAGTGGGAGAAAGCT-3′ and probe: 6FAM-TCGAAGTGGTGGCCAG-MGBNFQ)[ 39 ]. Briefly, 200 ng of total RNA was added to a reaction mixture containing 400 nM forward and reverse primers and 160 nM probe in a total volume of 10 µL on a 384-well plate [ 39 ]. The qPCR protocol included reverse transcription (1 cycle: 45 °C/10 min), predenaturation (1 cycle: 95 °C/10 min), 40 cycles of amplification (95 °C/15 s and 60 °C/45 s) and fluorescence detection using a QuantStudio 5 real-time PCR system (Applied Biosystems, MA, USA).

Nanopore sequencing library preparation

Only heart samples with Ct values below 35 were included for analysis via nanopore sequencing. A total of 22 heart samples from salmon and trout at 2, 4, and 8 wpc were included in this experiment. Each experimental group (i.e., fish species at a specific sampling time point) included 3–4 samples, given the maximum of 24 barcodes available in the nanopore sequencing library used in this study (Additional file 1 ). From each sample, 1 µg of total RNA was added to a total of 10 µL of cDNA reaction mix containing 10X SuperScript reverse transcriptase, 5X VILO reaction and random hexamers (SuperScript VILO cDNA synthesis kit (Invitrogen, MA, USA)). The cDNA mixture was then sequentially incubated at the following conditions: 25 °C for 10 min, 42 °C for 60 min, 50 °C for 30 min, and 85 °C for 5 min. For each sample, eight sets of PCR primers were used to produce eight amplicons (amplicon1—amplicon8; amp1—amp8) that covered most of the SAV genome (Figure  1 A; Additional file 2 ). Briefly, the PCR mixture was prepared using the following components: 2 µL of 5X Q5 reaction buffer, 0.2 µL of 10 mM dNTPs, 0.1 µL of Q5 hot-start DNA polymerase (20 units mL −1 ), primers (forward and reverse; 5 µM), 1 µL of cDNA (synthesized from 100 ng of total RNA), and DNase-free water up to 10 µL. The PCR conditions were as follows: 1 cycle of denaturation (98 °C for 30 s), 35 cycles of amplification (98 °C for 10 s, 62 °C for 30 s, and 72 °C for 3 min), and 1 cycle of post-extension (72 °C for 8 min). Amplicons were cleaned using AMPure XP beads according to the manufacturer’s guidelines (Beckman Coulter, CA, USA). Blunt end repair and DNA ligation were carried out using the NEBNext End Repair Module and NEBNext Ligation Sequencing Kit (NEBNext, MA, USA). A Native Barcoding Kit 24 (Q20 + and duplex enabled, Oxford Nanopore, UK) was used to obtain a unique barcode for all eight amplicons from each sample. All the barcoded samples were then pooled together and sequenced using a MinION flow cell (R10.4, Oxford Nanopore, UK).

figure 1

The SAV3 genome, amplicon details and the bioinformatic protocol applied in the study. A The ~12 kb SAV3 genome encodes four nonstructural proteins (nsP1-4) and five structural proteins (C-E1), and the eight overlapping amplicons (amp1-8) cover ~98.8% of its length. B Schematic diagram of the bioinformatic approaches used in the study. Gray boxes: from nanopore sequencing of amplicons to mapped SAV3 reads; Green box: identification of single nucleotide variants (SNVs); Blue boxes: workflow to identify consensus clusters inferred from SAV3 reads sharing at least 99% sequence identity.

Bioinformatics

Basecalling.

Basecalling was performed using the GPU-enabled guppy6.06 basecaller with the super accuracy configuration dna_r10.4_e8.1_sup.cfg. Since the accuracy of the raw reads is important for downstream variant calling analyses, we further implemented the newer duplex basecalling capability introduced by the Oxford Nanopore Company (Oxford Nanopore, UK). Duplex tools were used to identify duplex pairs. The guppy duplex basecalling command was then executed with the super accuracy configuration (dna_r10.4_e8.1_sup.cfg), and the duplex pair information identified in the prior step was used as input. The flags “ –barcode_kits “SQK-NBD112-24”–trim_barcodes –trim_adapters –trim_strategy dna –require_barcodes_both_ends ” were included in this command to ensure proper demultiplexing and trimming of adapter sequences.

Single nucleotide variant (SNV) identification

To identify single nucleotide variants (SNVs) (Table  1 ) occurring in salmon samples at 4 and 8 wpc and all trout samples, a consensus genome was constructed from the reads from the salmon samples at 2 wpc. Briefly, the sequence reads from the 2wpc_Salmon experimental group were mapped onto the published SAV3 genome (SAV3-2-MR/10 isolate; GenBank accession: KC122926), after which Tablet (ver. 1.21.02.08) [ 40 ] was used to generate the “2wpc consensus genome”. All variant analyses were conducted using the 2wpc consensus genome. The FastQ files for each sample, identified by the barcodes, were mapped onto the 2wpc consensus genome using Bowtie2 with the “very sensitive option” [ 41 ]. The SAM file was converted to a sorted BAM file using samtools, and the variant calling file (vcf) was produced using BCFtools call with the command “-m” or “-mv" [ 42 , 43 ]. The terminology related to the analysis of SNVs conducted in this study is defined in Table  1 . Excluding primer binding site sequences, SNVs were identified using the variant calling command with the “-mv” option. Any of the three possible nucleotides that differed from the nucleotide in the reference genome at a polymorphic site were defined as “SNV alleles” (Table 1 ). SNV-alleles with an SNV allele frequency ranging from 5–60% were considered minor SNV-alleles while SNV-alleles with an SNV-allele freq above 60% were considered major (Table  1 ). For each sampling time point and fish species (i.e., experimental group), the number of major SNV-alleles was counted (Figure  2 ).

figure 2

The incidence of major SNV-alleles in the experimental groups. The individual locations of each SNV are marked on the SAV3 genome.The ratio of fish with major SNV-alleles in the various experimental groups (2wpc_Salmon, 4wpc_Salmon, 8wpc_Salmon, 2wpc_Trout, 4wpc_Trout, and 8wpc_Trout). 1 The positions of each gene on the SAV3 genome, 2 details of the major SNV-allele, 3 amino acid position numbering for each protein, and 4 resulting changes in amino acids, i.e., from WT (2wpc_Salmon consensus genome) to variant (changes shown in red), 5 experimental groups (i.e., fish species at specific sampling time points). Each experimental group in which one fish was shown to have an SNV is shown in bold black numbers and yellow. Each experimental group, where two fish were shown to have a specific SNV-allele, is shown in red bold numbers and orange.

Identification of major and minor SAV3 cluster(s) in each amplicon

For each sample, all the sequence reads in the FastQ files were mapped onto each of the eight individual amplicons using Bowtie2 with the same options as described in the subsection “Single nucleotide variant (SNV) identification”. The reads from amplicon (amp) 7 and amp8 were pooled together for clustering because the amplicons overlapped somewhat (Figure  1 ). Antisense reads in the sets were transformed to complementary sense reads using FASTX-Toolkit [ 44 , 45 ]. The reads from each amplicon were de novo clustered (i.e., amp1-cluster to amp8-cluster) using qiime2 and a 99% sequence identity threshold [ 46 ]. In detail, the sample information and FastQ files were processed (“tools” option with the flags “– type SampleData[SequencesWithQuality]” and “–input-format SingleEndFastqManifestPhred33V2”) to.qza file using qiime2. Then, the individual sequences and table files were extracted with the flag “vsearch dereplicate-sequences”, and finally, de novo clustering was carried out through “vsearch cluster-features-de-novo”, with the flag “–p-perc-identity 0.99″. Only reads not shorter than 90% of the amplicon length were included in the clustering, and only clusters that contained at least 0.5% of all reads for the given amplicon were used for further analysis. For each amplicon, the clusters passing the above criteria were then aligned, and phylogenetic trees were produced using the maximum likelihood phylogenetic method with 1000 bootstrap replicates in MEGA11 [ 47 , 48 ].

Visualization of the location of selected deletions and SNVs in the SAV3 spike protein

The amino acid sequences for E1, E2, and E3 from the 2wpc_consensus genome were used. The SAV3 spike protein structure was modelled using homology modelling in SWISS-MODEL in automated mode [ 49 ]. The 3D structure of the SAV3 spike protein model was visualized using PyMOL software [ 50 , 51 ]. The predicted 3D structure was used to visualize the location of the deletions observed in those of the minor clusters that contained at least 10% of the reads (i.e., a proportion > 10%). Additionally, the sites with nonsynonymous minor or major SNVs are also shown in the 3D structure.

Statistical analysis

Duncan’s HSD one-way ANOVA was used for the statistical analysis of Ct values and relative cluster size data. Welch’s two-sample t test was used for the SNV freq and SNV-allele freq analyses. The threshold of the p value was set to less than 0.05. All the statistical analyses were carried out using the “haven” library in R [ 52 ]. The statistical significance of the frequency of major SNV-alleles compared to the amino acid composition of the SAV3 2wpc_Salmon consensus genome was confirmed using chi-square testing in R.

The viral load in the samples included in the sequencing was assessed using qPCR. For Atlantic salmon, the mean Ct values were 28.9 ± 6.3, 22.6 ± 3.9, and 26.8 ± 0.4 at 2, 4, and 8 wpc, respectively. For trout, the parallel Ct values were 25.9 ± 4.0, 21.9 ± 0.8, and 33.4 ± 1.0, respectively. Significant differences in viral load measured by the Ct values between species were observed at 8 wpc (Additional file 3 ).

Nanopore sequencing

More than five million raw nanopore reads were contained in the Fast5 file obtained from the sequencing experiment using a single R10.4 nanopore flow cell. The Fast5 file was converted to nucleotide sequences using guppy 6.06 with the super accuracy base-calling algorithm, resulting in 5,278,494 reads with a median Phred quality score of 16.412 (equivalent to ~97.72% estimated accuracy). Using the duplex basecalling algorithm, we obtained 166740 reads that passed the more rigorous filtering implemented in this method, corresponding to less than 3.2% of the total reads. However, the median Phred quality score was much greater at 24.109, equivalent to ~99.61% estimated accuracy (mean Phred quality score ± standard deviation = 25.116 ± 7.392). Among them, 97,761 reads could be properly identified by the barcode. This study exclusively employed high-quality sequence reads that were accurately identified by barcodes after duplex basecalling. On average, ~50% of the high-quality sequence reads (45,318 out of 97,791 reads) were successfully mapped onto the reference genome (Additional file 1 ). Upon examination of unmapped sequences, sequences harboring high similarity to SAV were identified but were characterized by the presence of sequence transpositions, inversions, large insertions, or deletions. Whether these unmapped sequences were PCR artefacts or originated from viral variation was not examined in this study.

Major and minor mutation changes in SAV

Among the 22 samples, a total of 16 major SNV-alleles were identified in this study, and some of the major SNV-alleles were present in multiple samples (Figures  2 , 3 ). Most of these major SNV-alleles appeared to be randomly distributed across the sampling time points and between fish species. However, two major, nonsynonymous SNV-alleles were identified in two out of four fish (50%) in the same experimental group. These mutations, which are located in nsP2 (SNV-nsP2 3414 -T/C) and E2 (SNV-E2 1187 -T/C), resulted in changes from tyrosine to histidine and valine to alanine, respectively (Figure  3 ). We also noted that while arginine constituted only 6.3% (248/3906) of the amino acids in the 2wpc_Salmon consensus genome, 18.8% (3/16) of the major SNVs occurred in codons for arginine (Table  2 ). Arginine codons, therefore, were the site of major SNVs three times more frequently than would be expected based on their relative frequency in the genome ( P  = 0.0431). The remaining 19 amino acids did not harbor major SNVs at a frequency that was significantly higher or lower than their frequency within the 2wpc_Salmon consensus genome (Table  2 ). We also identified 7 minor SNV-alleles distributed in both nonstructural and structural genes (Figure  4 , Additional file 4 ). Most of the minor SNV-alleles resulted in nonsynonymous mutations. The trout group tended to show more frequent changes than did the salmon group, especially in the E2 gene. In the trout experimental groups, the two minor SNV-alleles, SNV-E2 412 and SNV-E2 432 , increased in SNV freq during the experiment. There was a distinctly greater proportion of SNV-E2 412 -T/C. For SNV-E2 432 , two specific variants, both of which produce a glutamic acid (E) to aspartic acid (D) change (SNV-E2 432 -G/T and SNV-E2 432 -G/C), had a distinct, though not significant, increase in proportion (Additional file 4 ).

figure 3

Examples illustrating the difference in the frequency of selected SNV-alleles in individual fish/samples. For five fish ( A -a to B -c), sequence reads were aligned against the 2wpc_Salmon consensus genome sequence (upper, coloured sequence). The nucleotides in the reads that differed from the corresponding consensus nucleotides are shown in red. A) Comparison of reads from two salmon samples at 2 wpc centred around the major SNV-allele nsP2 1672 -T/C. There is a distinct difference in the frequency of C in the nucleotide site nsP2 1672 between (fish) A -a and (fish) A -b. B ) Comparison of reads from two trout ( B -a and B -c) and one salmon ( B -b) sampled at 2 wpc, centred around the major SNV-allele, E2 1187 -T/C. There is a distinct difference in the frequency of C in the nucleotide site E2 1187 . Both major SNV-alleles lead to nonsynonymous changes in codons.

figure 4

Ocurrence of minor SNVs in the experimental groups. A total of 7 SNVs were identified as minor, as they had an SNV freq between 5 and 60% in at least one experimental group. The locations of minor SNVs within the SAV3 genome are shown here. For each minor SNV, a Welch's t test was used to compare the frequencies between the experimental groups and the 2wpc_Salmon consensus genome. 1 The positions of each gene in the SAV3 genome, 2 details of the minor SNVs, 3 amino acid position numbering for each protein, 4 SNV freq   of the minor SNVs in the 2wpc_Salmon consensus genome, and 5 SNV freq  of the minor SNVs in the experimental groups. The numbers inside brackets show p values from Welch’s t test comparing the SNV frequency in the experimental group with that of the 2wpc_Salmon consensus genome (bold letters indicate P values less than 0.05). The SNVs highlighted with a background color range from yellow to red represent SNV freq values ranging from 5% (yellow) to the highest value (red), with the color intensifying progressively as the values increase. Detailed information on the minor SNVs in the experimental groups is provided in Additional file 4 .

Amplicon clusters and phylogenetic analysis

Through de novo clustering, we identified 9,613 clusters comprising both mapped and unmapped sequences (Additional file 5 ). Among them, only 7 clusters in amp1, 3 in amp2, 3 in amp3, 8 in amp4, 2 in amp5, 4 in amp6, and 9 in amp7&8 met the thresholds defined for this study (Figures  5 , 6 and 7 ; Additional file 6 ). For each amplicon, there was a single major cluster that contained the majority (>45%) of reads, along with one or more minor cluster(s), each with a relatively small number of reads. As the clustering analysis applied a 99% identity threshold, larger deletions (> ~20 bp) influenced the resulting clusters much more than did shorter deletions and SNVs. The proportion of reads in each cluster varied across genome location, sampling time point, and host species. The 4wpc_Trout and 8wpc_Trout experimental groups had a significantly greater proportion of reads in some minor clusters than did the other experimental groups (Figures  5 , 6 and 7 ; Additional file 6 ). This was most prominent for Amp7&8_cluster2 and Amp7&8_cluster3 for 8wpc_Trout (Figure  7 ). Most of the minor clusters predominantly exhibited frameshift deletions; however, each cluster was composed of sequences with 99% identity, resulting in the practical coexistence of both in-frame and frameshift deletion reads. In addition, in some raw clusters that did not pass the threshold, sequence inversion, transposition, insertion, and deletion were observed (Additional file 5 ).

figure 5

Phylogenetic tree of the amp1 and amp2 clusters. The maximum likelihood algorithm was used to construct a phylogenetic tree of the identified clusters from the amplicons amp1 ( A ) and amp2 ( B ) (left side). The numbers (above 50%) near each branch indicate bootstrap values out of 1000 replications. The table on the right side shows the proportion of reads in each identified cluster (proportion mean ± standard deviation (SD)) for each experimental group (i.e., fish species at a specific sampling time point). The color gradient from gray to red indicates the proportion of reads in each cluster. For each cluster, the proportion of reads was compared between experimental groups using Duncan’s HSD one-way ANOVA. Different superscripted letters indicate statistically significant differences ( P value < 0.05).

figure 6

Phylogenetic tree of the amp3, amp4, and amp5 clusters . The maximum likelihood algorithm was used to construct a phylogenetic tree of the identified clusters from the amplicons amp3 ( A ), amp4 ( B ), and amp5 ( C ) (left side). The numbers (above 50%) near each branch indicate bootstrap values out of 1000 replications. The table on the right side shows the proportion of reads in each identified cluster (proportion mean ± standard deviation (SD)) for each experimental group (i.e., fish species at a specific sampling time point). The color gradient from gray to red indicates the proportion of reads in each cluster. For each cluster, the proportion of reads was compared between experimental groups using Duncan’s HSD one-way ANOVA. Different superscripted letters indicate statistically significant differences (P value < 0.05).

figure 7

Phylogenetic tree of the amp6 and amp78 clusters. The maximum likelihood algorithm was used to construct a phylogenetic tree of the identified clusters from the amplicons amp6 ( A ) and amp78 ( B ) (left side). The numbers (above 50%) near each branch indicate bootstrap values out of 1000 replications. The table on the right side shows the proportion of reads in each identified cluster (proportion mean ± standard deviation (SD)) for each experimental group (i.e., fish species at a specific sampling time point). The color gradient from gray to red indicates the proportion of reads in each cluster. For each cluster, the proportion of reads was compared between experimental groups using Duncan’s HSD one-way ANOVA. Different superscripted letters indicate statistically significant differences ( P value < 0.05).

Nonmetric multidimensional scaling (NMDS) analysis of variation between experimental groups

NMDS analysis was used to analyse the variation (dissimilarity) between the experimental groups. In the NMDS analysis, 36 dimensions (i.e., the number of clusters) were condensed into two dimensions where the distance between experimental groups (and specimens) in an NMDS plot indicates the degree of similarity. At two weeks post-challenge, the experimental groups partially overlapped, and each showed relatively little variation between specimens (Figure  8 A). At four weeks post-challenge, the experimental groups no longer overlapped but still showed relatively little variation between specimens (Figure  8 B). At eight weeks post-challenge, the experimental groups were again partially overlapping but showed a distinct difference in variation between specimens (Figure  8 C).

figure 8

Nonmetric multidimensional scaling (NMDS) plot NMDS plots. generated from the read proportions of the 36 clusters from the amplicons amp1 to amp7&8 identified in this study. The distances on the plot reflect the similarities in the proportions of all clusters. Points closer together indicate a higher degree of similarity in cluster proportions, while points farther apart represent lower similarity. Figure  8 A–C depict the comparisons between different species (salmon in red and trout in blue) at 2- (2wpc_Salmon vs 2wpc_Trout), 4- (4wpc_Salmon vs 4wpc_Trout), and 8-wpc (8wpc_Salmon vs 8wpc_Trout), respectively. The ellipses indicate confidence limits of 0.25 (darker red or blue) and 0.5 (lighter red or blue) within the same group.

Visualization of selected mutations in the spike protein

A homology model of the SAV spike protein was constructed using SWISS-MODEL, and the model was subsequently used to visualize the location of selected mutations (Figure  9 ). Amp6_cluster2, Amp78_cluster2, and amp78_cluster3 exceeded a mean proportion of reads of 10% in at least one experimental group, showing statistically significant differences. The consensus sequences from both clusters are frameshift deletions located at the apical region of the spike protein. However, in reality, reads containing both in-frame and frameshift deletions coexist (Figures  8 B–D). The major nonsynonymous SNVs identified in the SAV spike protein are highlighted in green and yellow in Figure  9 E and Additional file 7 . The QMEANDisCo global score, ranging from 0 to 1, expresses the quality of a predicted model [ 53 ]. Higher QMEANDisCo scores indicate better quality and accuracy in the predicted protein structure. While the acceptable range for the QMEANDisCo global score may vary depending on the types of predicted proteins, a score above 0.50 generally implies that the predicted model is likely acceptable based on the established threshold [ 54 ]. The predicted SAV spike protein model based on the 2wpc_consensus sequence had a QMEANDisCo global score of 0.60 ± 0.05, which is comparable to that of other models of alphavirus spike proteins deposited (e.g., Q5WQY5; Chikungunya virus- 0.65 ± 0.05 QMEANDisCo global score). The deletions (Amp6_cluster2, Amp78_cluster2, Amp78_cluster2) and nonsynonymous mutations did not affect the QMEANDisCo global score, as they showed the same values.

figure 9

Visualization of the locations of selected deletions and SNVs in the SAV3 spike protein . A 3D structural model of the SAV3 spike protein consisting of the E1, E2 and E3 subunits was constructed via homology modelling and visualized. A Space-filling model of the SAV3 spike protein, which is a trimeric protein that includes E1 (white), E2 (orange), and E3 (gray). B , C and D The deletions identified in Amp6_cluster2, Amp7&8_cluster2, and Amp7&8_cluster3, respectively, are highlighted in blue. E Nonsynonymous minor SNVs (E2 412 and E2 432 ) are highlighted in light green and yellow, respectively. Comprehensive views of the entire 3D structures from various orientations are available in Additional file 6. The QMEANDisCo global score shown in Figure A-E gives an overall model quality measurement between 0 and 1, where higher numbers indicate higher expected quality.

In the present study, we used the Nanopore long-read sequencing platform to sequence the salmonid alphavirus-3 (SAV3) genome from tissue samples collected from Atlantic salmon and brown trout at various time points during a virus challenge experiment. The primary source of SAV3 infection in cohabitants was the shedder fish. SAV3 sequences from the 2wpc_Salmon experimental group were analysed and used as a reference genome for the remaining experimental time points. The cohabitation challenge applied in this study has both advantages and disadvantages as a method for investigating SAV3 variants. The advantage of the cohabitation model is that it accurately replicates the actual route of waterborne SAV3 infection. However, cohabitation challenges also have potential limitations regarding two parameters: the actual dose of SAV3 to which cohabitant fish are exposed and the exact timing of their initial infection. These potential limitations should be noted when considering the population diversity of sequences within quasispecies at different time points post-infection.

Among the major nonsynonymous SNV-alleles, only two (SNV-nsP2 1672 -T/C and SNV-E2 1187 -T/C) were found in more than one fish. Among them, the SNV-E2 1187 -T/C, located within the spike protein, represented a nonsynonymous mutation that converts valine to alanine. This valine-to-alanine substitution may significantly influence viral fitness, leading to notable phenotypic changes. Interestingly, Tsetsarkin et al. [ 55 ] investigated the impact of an alanine-to-valine mutation at position 226 in the E1 fusion protein of Chikungunya virus (CHIKV). Compared with yellow fever mosquitos ( Ae. aegypti ), CHIKV with an alanine at this position (E1-226A) showed relatively rapid infection and an increased ability to infect Asian tiger mosquitos ( Ae. albopictus ). Conversely, CHIKV with valine at this position (E1-226 V) was significantly better at infecting yellow fever mosquitos. This study highlights how a single substitution can significantly alter the phenotypic characteristics of alphaviruses. Among several minor SNV-alleles identified between the experimental groups, only SNV-E2 412 -T/C was consistently and significantly more abundant in the trout experimental group and exhibited a distinct increase over time. At another site, two minor SNV-alleles (SNV-E2 432 -G/C and SNV-E2 432 -G/T) that both led to an E (glutamic acid) to D (aspartic acid) aa change also increased in SNV-allele freq over time in the trout experimental group, but this increase was not statistically significant. In general, SNVs could alter viral tropism towards different hosts. The E2 protein is one of the three glycoproteins that makes up the SAV spike protein and is one of the structural proteins where most immunogenic epitopes are located [ 56 , 57 ]. Karlsen et al. [ 58 ] observed the influence of a mutation at position E2 206 , from proline (E2 206p ) to serine (E2 206s ), which is located in the receptor binding site. The authors found that viral growth and replication differed significantly between these mutants. The E2 206s mutant also reverted to E2 206p when the virus was inoculated into a cell line (BF2), indicating that SAV3 may adapt to its host and environment. In the present study, the minor SNVs (E2 412 and E2 432 ) identified in the E2 gene are located in the middle of the spike protein rather than in the receptor binding site. Hence, the effect of these nonsynonymous mutations is likely less pronounced/direct than that of the variant observed in the study by Karlsen et al. [ 58 ]. On the other hand, most deletion mutations identified from minor clusters in the spike protein (Amp6_cluster2, Amp7&8_cluster2, and Amp7&8_cluster3) are located in a region that faces outwards from the viral membrane. Deletions in these regions could influence cellular tropism. In addition, introduction of minor SNV-nsP2 486 may lead to the introduction of premature stop codons (TAG and TAA). Given that nonstructural proteins such as nsP2 regulate viral RNA synthesis, premature stop codons will result in a defective viral polyprotein unable to perform its role in viruses.

In the cluster analysis, the reads in each identified cluster had at least 99% sequence identity. Given that the genetic identity among SAV subtypes ranges from ~86–96% [ 3 ], we used the threshold of 99% sequence identity in the cluster analyses to allow the study of intrasubtype variation. If, in contrast, a threshold lower than ~96% sequence identity had been used, the cluster analysis would not have been able to differentiate between SAV subtypes. Since the amplicons (and hence the reads) had an average length of approximately 2000 bp, the clusters, on average, differed from each other in at least 20 nucleotides. Using these threshold conditions inadvertently led to all the identified clusters being predominantly defined by larger deletions. When the reads in each identified cluster were “merged” into a defining consensus sequence, these deletions mostly led to a shift in the reading frame. This would suggest that these deletion-defined clusters should be considered nonproductive dead ends. It should be noted, however, that among the reads in these clusters, there were sequences with in-frame deletions that, in principle, could retain (some) functionality. Similarly, Gallagher et al. [ 17 ] identified many deletion mutations based on nanopore sequencing, and ~34% of deletions did not disrupt the protein-coding frame (in-frame mutation), which leaves open the possibility that not all observed deletions result in defective viral particles. In addition, the sizes of the complete SAV genomes varied slightly (SAV1 (AJ316244.1; 11,919 bp), SAV2 (AJ316246.1; 11,900 bp), SAV3 (KC122926.1; 11,887 bp), SAV4 (MH708651.1; 11,762 bp), SAV5 (MH708650.1; 11,804 bp), and SAV6 (MH238448.1; 11,726 bp)). This difference may ultimately stem from the frequent occurrence of deletion mutations in SAV. Overall, the cluster analysis of each of the 8 amplicons revealed little directional development (i.e., adaptation) at different sampling time points or between fish species. The only exception was for amplicons 1 and 7/8, where the frequency of some minor clusters increased for brown trout at 8 wpc.

NMDS analysis integrating the cluster data over all eight amplicons indicated that late in infection, SAV3 genomes from brown trout had higher levels of variation than did SAV3 genomes from salmon. At the first sampling time point (2wpc), little difference was observed in the NMDS plot. By 4 wpc, the experimental groups had similar levels of variation but were still separated in the NMDS plot. In contrast, the groups overlapped at 8 wpc, but the brown trout experimental group showed distinctly more variation. Considering the distinct kinetics observed between salmon and trout at 8 wpc, the susceptibility of brown trout to SAV3 may be lower than that of other trout species. The observed higher variation in brown trout could be interpreted as the SAV3 exploring the virus fitness landscape in a host to which it is not well adapted.

In conclusion, this study provides insight into the genetic variation in SAV3 in infected fish, revealing mostly random variation with no development in SNV freq during the experiment. Nevertheless, a few specific variants, such as SNV-E2 412 and SNV-E2 432 , increased in frequency with time, potentially showing viral adaptation to trout. We believe that this approach and bioinformatics pipeline will be useful for studies of viral variation and evolution.

Data Availability

The datasets used in this study are available from the corresponding author upon reasonable request.

Deperasińska I, Schulz P, Siwicki AK (2018) Salmonid alphavirus (SAV). J Vet Res 62:1

Article   PubMed   PubMed Central   Google Scholar  

Pietilä MK, Hellström K, Ahola T (2017) Alphavirus polymerase and RNA replication. Virus Res 234:44–57

Article   PubMed   Google Scholar  

Tighe AJ, Gallagher MD, Carlsson J, Matejusova I, Swords F, Macqueen DJ, Ruane NM (2020) Nanopore whole genome sequencing and partitioned phylogenetic analysis supports a new salmonid alphavirus genotype (SAV7). Dis Aquat Organ 142:203–211

Firth AE, Chung BY, Fleeton MN, Atkins JF (2008) Discovery of frameshifting in Alphavirus 6K resolves a 20-year enigma. Virol J 5:108

Melton JV, Ewart GD, Weir RC, Board PG, Lee E, Gage PW (2002) Alphavirus 6K proteins form ion channels. J Biol Chem 277:46923–46931

Article   CAS   PubMed   Google Scholar  

Ramsey J, Mukhopadhyay S (2017) Disentangling the frames, the state of research on the alphavirus 6K and TF proteins. Viruses 9:228

Nelson R, McLoughlin M, Rowley H, Platten M, McCormick J (1995) Isolation of a toga-like virus from farmed Atlantic salmon Salmo salar with pancreas disease. Dis Aquat Organ 22:25–32

Article   Google Scholar  

Gallagher MD, Matejusova I, Ruane NM, Macqueen DJ (2020) Genome-wide target enriched viral sequencing reveals extensive ‘hidden’ salmonid alphavirus diversity in farmed and wild fish populations. Aquac 522:735117

Article   CAS   Google Scholar  

Herath TK, Thompson KD (2022) Salmonid alphavirus and pancreas disease Aquac Pathophysiol. Elsevier, Amsterdam

Google Scholar  

Herath TK, Ashby AJ, Jayasuriya NS, Bron JE, Taylor JF, Adams A, Richards RH, Weidmann M, Ferguson HW, Taggart JB (2017) Impact of Salmonid alphavirus infection in diploid and triploid Atlantic salmon ( Salmo salar L.) fry. PLoS One 12:e0179192

Weston JH, Welsh MD, McLoughlin MF, Todd D (1999) Salmon pancreas disease virus, an alphavirus infecting farmed Atlantic salmon, Salmo salar L. Virol 256:188–195

Villoing S, Béarzotti M, Chilmonczyk S, Castric J, Brémont M (2000) Rainbow trout sleeping disease virus is an atypical alphavirus. J Virol 74:173–183

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hodneland K, Bratland A, Christie K, Endresen C, Nylund A (2005) New subtype of salmonid alphavirus (SAV), Togaviridae, from Atlantic salmon Salmo salar and rainbow trout Oncorhynchus mykiss in Norway. Dis Aquat Organ 66:113–120

Bruno D, Noguera P, Black J, Murray W, Macqueen D, Matejusova I (2014) Identification of a wild reservoir of salmonid alphavirus in common dab Limanda limanda , with emphasis on virus culture and sequencing. Aquac Environ Interact 5:89–98

Macqueen DJ, Eve O, Gundappa MK, Daniels RR, Gallagher MD, Alexandersen S, Karlsen M (2021) Genomic epidemiology of salmonid alphavirus in Norwegian aquaculture reveals recent Subtype-2 transmission dynamics and novel Subtype-3 lineages. Viruses 13:2549

Karlsen M, Hodneland K, Endresen C, Nylund A (2006) Genetic stability within the Norwegian subtype of salmonid alphavirus (family Togaviridae ). Arch Virol 151:861–874

Gallagher MD, Karlsen M, Petterson E, Haugland Ø, Matejusova I, Macqueen DJ (2020) Genome sequencing of SAV3 reveals repeated seeding events of viral strains in Norwegian aquaculture. Front Microbiol 11:524801

Petterson E, Stormoen M, Evensen Ø, Mikalsen AB, Haugland Ø (2013) Natural infection of Atlantic salmon ( Salmo salar L.) with salmonid alphavirus 3 generates numerous viral deletion mutants. J Gen Virol 94:1945–1954

Patterson EI, Khanipov K, Swetnam DM, Walsdorf S, Kautz TF, Thangamani S, Fofanov Y, Forrester NL (2020) Measuring alphavirus fidelity using non-infectious virus particles. Viruses 12:546

Peck KM, Lauring AS (2018) Complexities of viral mutation rates. J Virol 92:e01031-17

Stapleford KA, Rozen-Gagnon K, Das PK, Saul S, Poirier EZ, Blanc H, Vidalain P-O, Merits A, Vignuzzi M (2015) Viral polymerase-helicase complexes regulate replication fidelity to overcome intracellular nucleotide depletion. J Virol 89:11233–11244

Poirier EZ, Mounce BC, Rozen-Gagnon K, Hooikaas PJ, Stapleford KA, Moratorio G, Vignuzzi M (2016) Low-fidelity polymerases of alphaviruses recombine at higher rates to overproduce defective interfering particles. J Virol 90:2446–2454

Article   CAS   PubMed Central   Google Scholar  

Simon-Loriere E, Holmes EC (2011) Why do RNA viruses recombine? Nat Rev Microbiol 9:617–626

Sumpter R Jr, Wang C, Foy E, Loo Y-M, Gale M Jr (2004) Viral evolution and interferon resistance of hepatitis C virus RNA replication in a cell culture model. J Virol 78:11591–11604

Domingo E, Sheldon J, Perales C (2012) Viral quasispecies evolution. Microbiol Mol Biol Rev 76:159–216

Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genom Proteom Bioinform 13:278–289

Takeda H, Yamashita T, Ueda Y, Sekine A (2019) Exploring the hepatitis C virus genome using single molecule real-time sequencing. World J Gastroenterol 25:4661

Freed NE, Vlková M, Faisal MB, Silander OK (2020) Rapid and inexpensive whole-genome sequencing of SARS-CoV-2 using 1200 bp tiled amplicons and Oxford Nanopore rapid barcoding. Biol Methods Protoc 5:bpaa014

Boldogkői Z, Moldován N, Balázs Z, Snyder M, Tombácz D (2019) Long-read sequencing–a powerful tool in viral transcriptome research. Trends Microbiol 27:578–592

Karst SM, Ziels RM, Kirkegaard RH, Sørensen EA, McDonald D, Zhu Q, Knight R, Albertsen M (2021) High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods 18:165–169

Rang FJ, Kloosterman WP, de Ridder J (2018) From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol 19:90

Gallagher MD, Matejusova I, Nguyen L, Ruane NM, Falk K, Macqueen DJ (2018) Nanopore sequencing for rapid diagnostics of salmonid RNA viruses. Sci Rep 8:16307

Boone M, De Koker A, Callewaert N (2018) Capturing the ‘ome’: the expanding molecular toolbox for RNA and DNA library construction. Nucleic Acids Res 46:2701–2721

Brodin J, Hedskog C, Heddini A, Benard E, Neher RA, Mild M, Albert J (2015) Challenges with using primer IDs to improve accuracy of next generation sequencing. PLoS One 10:e0119123

Sanderson ND, Kapel N, Rodger G, Webster H, Lipworth S, Street TL, Peto T, Crook D, Stoesser N (2023) Comparison of R9. 4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction. Microb Genom 9:000910

Blaxter M, Mann J, Chapman T, Thomas F, Whitton C, Floyd R, Abebe E (2005) Defining operational taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci 360:1935–1943

Nguyen N-P, Warnow T, Pop M, White B (2016) A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. NPJ Biofilms Microbi 2:1–8

Xu C, Guo T-C, Mutoloki S, Haugland Ø, Evensen Ø (2012) Gene expression studies of host response to Salmonid alphavirus subtype 3 experimental infections in Atlantic salmon. Vet Res 43:1–10

Hodneland K, Endresen C (2006) Sensitive and specific detection of Salmonid alphavirus using real-time PCR (TaqMan®). J Viro Methods 131:184–192

Milne I, Stephen G, Bayer M, Cock PJ, Pritchard L, Cardle L, Shaw PD, Marshall D (2013) Using Tablet for visual exploration of second-generation sequencing data. Brief Bioinform 14:193–202

Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359

Danecek P, Schiffels S, Durbin R, Multiallelic calling model in bcftools (-m) (2014), June

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079

Gordon A, Hannon GJ (2010) Fastx-toolkit. FASTQ/A short-reads preprocessing tools (unpublished). http://hannonlab.cshl.edu/fastx_toolkit/ . Accessed May 2022

Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F (2019) Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37:852–857

Tamura K, Stecher G, Kumar S (2021) MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol 38:3022–3027

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7:539

Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46:W296–W303

Bramucci E, Paiardini A, Bossa F, Pascarella S (2012) PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL. BMC Bioinformat 13:1–6

Yuan S, Chan HS, Hu Z (2017) Using PyMOL as a platform for computational drug design. Wiley Interdisciplinary Rev Comput Mol Sci 7:e1298

Wickham H, Miller E, haven: Import and Export'SPSS','Stata'and'SAS'Files. R package version 1.1. 2, 2018 (2017)

Studer G, Rempfer C, Waterhouse AM, Gumienny R, Haas J, Schwede T (2020) QMEANDisCo—distance constraints applied on model quality estimation. Bioinformatics 36:1765–1771

Gupta K (2023) In silico structural and functional characterization of hypothetical proteins from Monkeypox virus. J Genet Eng Biotechnol 21:46

Tsetsarkin KA, Vanlandingham DL, McGee CE, Higgs S (2007) A single mutation in chikungunya virus affects vector specificity and epidemic potential. PLoS Pathog 3:e201

Hunt AR, Frederickson S, Maruyama T, Roehrig JT, Blair CD (2010) The first human epitope map of the alphaviral E1 and E2 proteins reveals a new E2 epitope with significant virus neutralizing activity. PLoS Negl Trop Dis 4:e739

Hikke MC, Braaen S, Villoing S, Hodneland K, Geertsema C, Verhagen L, Frost P, Vlak JM, Rimstad E, Pijlman GP (2014) Salmonid alphavirus glycoprotein E2 requires low temperature and E1 for virion formation and induction of protective immunity. Vaccine 32:6206–6212

Karlsen M, Andersen L, Blindheim SH, Rimstad E, Nylund A (2015) A naturally occurring substitution in the E2 protein of Salmonid alphavirus subtype 3 changes viral fitness. Virus Res 196:79–86

Download references

Acknowledgements

The authors would like to acknowledge the invaluable contributions from the technicians at the IMR and personnel at the fish disease laboratory.

Open access funding provided by Institute Of Marine Research. This study was funded by the Institute of Marine Research (Bergen, Norway) in the context of the Disease Transmission Project (15821) and the VIRAQ Project (15533).

Author information

Authors and affiliations.

Institute of Marine Research, Nordnes, PO Box 1870, 5817, Bergen, Norway

HyeongJin Roh, Kai Ove Skaftnesmo, Dhamotharan Kannimuthu, Abdullah Madhun, Sonal Patel, Bjørn Olav Kvamme, H. Craig Morton & Søren Grove

Norwegian Veterinary Institute, Bergen, Norway

Sonal Patel

You can also search for this author in PubMed   Google Scholar

Contributions

Conception and design of the study: HR, DK, HCM, and SG; acquisition and analysis of data: HR, KOS, AM, SP, and BOK; interpretation of data: HR, KOS, DK, HCM, and SG; drafting of manuscript: HR and SG; and revision of the manuscript: HR, KOS, DK, AM, SP, BOK, HCM, and SG. All authors read and approved the final manuscript.

Corresponding author

Correspondence to HyeongJin Roh .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Handling editor: Stéphane Biacchesi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1. basic characteristics of the reads and data used for sequencing and read mapping of the reference genome., additional file 2. primers used in this study., additional file 3.the ct values (mean ± sd) determined by rt-qpcr targeting the sav3 nsp1 gene in the samples sequenced in this study., additional file 4. the frequency of minor snvs.

in the experimental groups. A total of 7 SNVs were identified as minor, as they had an SNV freq between 5 and 60% in at least one experimental group. For each minor SNV, the table shows the frequency observed in the experimental groups and the results of Welch’s t test comparison of frequencies in the experimental groups and the 2wpc_Salmon consensus genome.

Additional file 5. All the raw de novo clusters identified in this study.

Additional file 6. consensus nucleotide sequences of clusters that passed the threshold in this study., additional file 7. visualization of the locations of selected deletions and snvs in the sav3 spike protein..

A 3D structural model of the SAV3 spike protein consisting of the E1, E2 and E3 subunits was constructed via homology modelling and visualized in videos. (A) Space-filling model of the SAV3 spike protein, shown as a 12-meric protein including four E1 subunits (white), four E2 subunits (orange), and four E3 subunits (gray). (B, C and D) The deletions identified in Amp6_cluster2, Amp7&8_cluster2, and Amp7&8_cluster3, respectively, are highlighted in blue. (E) Nonsynonymous major SNVs (SNV-E2 1187 and SNV-E1 1321 ) are highlighted in green and purple, and two minor SNVs (SNV-E2 412 and SNV-E2 432 ) are shown in cyan and yellow.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Roh, H., Skaftnesmo, K.O., Kannimuthu, D. et al. Nanopore sequencing provides snapshots of the genetic variation within salmonid alphavirus-3 (SAV3) during an ongoing infection in Atlantic salmon ( Salmo salar ) and brown trout ( Salmo trutta ). Vet Res 55 , 106 (2024). https://doi.org/10.1186/s13567-024-01349-z

Download citation

Received : 19 February 2024

Accepted : 24 June 2024

Published : 03 September 2024

DOI : https://doi.org/10.1186/s13567-024-01349-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Salmonid alphavirus (SAV)
  • nanopore sequencing
  • viral mutation
  • spike protein
  • viral heterogeneity

Veterinary Research

ISSN: 1297-9716

gene variation experiment

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

forests-logo

Article Menu

gene variation experiment

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Evaluation of shoot collection timing and hormonal treatment on seedling rooting and growth in four poplar genomic groups.

gene variation experiment

1. Introduction

2. materials and methods, 2.1. plant material and study design, 2.2. measurements, 2.3. statistical analysis, 3. results and discussion, 3.1. effect of shoot collection time on seedling rooting, 3.2. effect of shoot collection time on seedling height growth, 3.3. growth stimulants and clone effect on rooting, 4. conclusions, author contributions, data availability statement, acknowledgments, conflicts of interest.

  • Stobrawa, K. Poplars ( Populus spp.): Ecological role, applications and scientific perspectives in the 21st Century. Balt. For. 2014 , 20 , 204–213. [ Google Scholar ]
  • Komán, S.; Németh, R.; Báder, M. An overview of the current situation of European poplar cultures with a main focus on Hungary. Appl. Sci. 2023 , 13 , 12922. [ Google Scholar ] [ CrossRef ]
  • Confalonieri, M.; Balestrazzi, A.; Bisoffi, S.; Carbonera, D. In vitro culture and genetic engineering of Populus spp.: Synergy for forest tree improvement. Plant Cell Tissue Organ Cult. 2003 , 72 , 109–138. [ Google Scholar ] [ CrossRef ]
  • Pliura, A.; Suchockas, V.; Sarsekova, D.; Gudynaite, V. Genotypic variation and heritability of growth and adaptive traits, and adaptation of young poplar hybrids at northern margins of natural distribution of Populus nigra in Europe. Biomass Bioenergy 2014 , 70 , 513–529. [ Google Scholar ] [ CrossRef ]
  • Zhao, X.; Zheng, H.; Li, S.; Yang, C.; Jiang, J.; Liu, G. The rooting of poplar cuttings: A review. New For. 2014 , 45 , 21–34. [ Google Scholar ] [ CrossRef ]
  • Baba, K.; Kurita, Y.; Mimura, T. Wood structure of Populus alba L. formed in a shortened annual cycle system. J. Wood Sci. 2018 , 64 , 1–5. [ Google Scholar ] [ CrossRef ]
  • Ermel, F.F.; Vizoso, S.; Charpentier, J.P.; Allemand, C.J.; Catesson, A.M.; Couée, I. Mechanisms of primordium formation during adventitious root development from walnut cotyledon explants. Planta 2000 , 211 , 563–574. [ Google Scholar ] [ CrossRef ]
  • Zalesny, R.S.; Wiese, A.H. Date of shoot collection, genotype, and original shoot position affect early rooting of dormant hardwood cuttings of Populus . Silvae Genet. 2006 , 55 , 169–182. [ Google Scholar ] [ CrossRef ]
  • Polle, A.; Klein, T.; Kettner, C. Impact of cadmium on young plants of Populus euphratica and P. canescens , two poplar species that differ in stress tolerance. New For. 2013 , 44 , 13–22. [ Google Scholar ] [ CrossRef ]
  • Dickmann, D.I. Poplar culture in North America ; NRC Research Press, National Research Council of Canada: Ottawa, ON, Canada, 2001; pp. 1–42. [ Google Scholar ]
  • Frey, B.R.; Lieffers, V.J.; Landhausser, S.M.; Comeau, P.G.; Greenway, K.J. An analysis of sucker regeneration of trembling aspen. Can. J. For. Res. 2003 , 33 , 1169–1179. [ Google Scholar ] [ CrossRef ]
  • Harfouche, A.; Baoune, N.; Merazga, H. Main and interaction effects of factors on softwood cutting of white poplar ( Populus alba L.). Silvae Genet. 2007 , 56 , 287–294. [ Google Scholar ] [ CrossRef ]
  • Shibuya, T.; Tsukuda, S.; Tokuda, A.; Shizaki, S.; Endo, R.; Kitaya, Y. Effects of warming basal ends of Carolina poplar ( Populus canadensis Moench) softwood cutting at controlled low-air-temperature on their root growth and leaf damage after planting. J For. Res. 2013 , 18 , 279–284. [ Google Scholar ] [ CrossRef ]
  • Wiese, A.H.; Zalesny, J.A.; Donner, D.M.; Zalesny, R.S. Bud removal affects shoot, root, and callus development of hardwood Populus Cuttings. Silvae Genet. 2006 , 3 , 141–148. [ Google Scholar ] [ CrossRef ]
  • Saini, S.; Sharma, I.; Kaur, N.; Pati, P.K. Auxin: A master regulator in plant root development. Plant Cell Rep. 2013 , 32 , 741–757. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Benkova, E.; Ivanchenko, M.G.; Friml, J.; Shishkova, S.; Joseph, G.D. A morphogenetic trigger: Is there an emerging concept in plant developmental biology? Trends Plant Sci. 2009 , 14 , 189–193. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Mironova, V.; Omelyanchuk, N.; Yosiphone, G.; Fadeev, S.; Kolchanov, N.; Mjolsness, E.; Likhoshvai, V. A plausible mechanism for auxin patterning along the developing root. BMC Syst. Biol. 2010 , 4 , 98. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • De Klerk, G.J.; Arnholdt, S.B.; Lieberei, R.; Neumann, K. Regeneration of roots, shoots and embryos: Physiological, biochemical and molecular aspects. Biol. Plant. 1997 , 39 , 53–66. [ Google Scholar ] [ CrossRef ]
  • Simon, S.; Petrášek, J. Why plants need more than one type of auxin. Plant Sci. 2011 , 180 , 454–460. [ Google Scholar ] [ CrossRef ]
  • Werner, T.; Holst, K.; Pors, Y.; Guivarch, A.; Mustroph, A.; Chriqui, D.; Grimm, B.; Schmulling, T. Cytokinin deficiency causes distinct changes of sink and source parameters in tobacco shoots and roots. J. Exp. Bot. 2008 , 59 , 2659–2672. [ Google Scholar ] [ CrossRef ]
  • Kuroha, T.; Ueguchi, C.; Sakakibara, H.; Satoh, S. Cytokinin receptors are required for normal development of auxin-transporting vascular tissues in the hypocotyl but not in adventitious roots. Plant Cell Physiol. 2006 , 47 , 234–243. [ Google Scholar ] [ CrossRef ]
  • Dello, I.R.; Nakamura, K.; Moubayidin, L.; Perilli, S.; Taniguchi, M.; Morita, M.T.; Aoyama, T.; Costantino, P.; Sabatini, S. A genetic framework for the control of cell division and differentiation in the root meristem. Science 2008 , 322 , 1380–1384. [ Google Scholar ]
  • Swain, S.M.; Singh, D.P. Tall tales from sly dwarves: Novel functions of gibberellins in plant development. Trends Plant Sci. 2005 , 10 , 123–129. [ Google Scholar ] [ CrossRef ]
  • Busov, V.B.; Meilan, R.; David, W.; Rood, S.B.; Ma, C.; Tschaplinski, T.J.; Strauss, S.H. Transgentic modification of gai or rgl1 causes dwarfing and alters gibberellins, root growth, and metabolite profiles in Populus . Planta 2006 , 2 , 288–299. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Ortega, M.O.; Pernas, M.; Carol, R.; Dolan, L. Ethylene modulates stem cells division in the Arabidopsis thaliana root. Science 2007 , 317 , 505–510. [ Google Scholar ]
  • Miško Atkūrimo ir Įveisimo Nuostatai [Regulations for Reforestation and Afforestation], 2008 April, No. 108301MISAK00D1-199, Vilnius. Available online: https://e-seimas.lrs.lt/portal/legalAct/lt/TAD/TAIS.318353/asr (accessed on 23 August 2024). (In Lithuanian).
  • SAS Institute Inc. SAS/STAT® User’s Guide, version 9.4 ; SAS Institute Incorporated: Cary, NC, USA, 2016. [ Google Scholar ]
  • Veierskov, B. Relations between carbohydrate and adventitious rooting. In Adventitious Root Formation in Cutting ; Davis, T.D., Haissig, B.E., Sankhla, N., Eds.; Discorides Press: Portland, OR, USA, 1988; pp. 70–78. [ Google Scholar ]
  • Lopez, G.; Ahmadi, S.H.; Amelung, W.; Athmann, M.; Ewert, F.; Gaiser, T.; Gocke, M.I.; Kautz, T.; Postma, J.; Rachmilevitch, S.; et al. Nutrient deficiency effects on root architecture and root-to-shoot ratio in arable crops. Front. Plant Sci. 2023 , 13 , 1067498. [ Google Scholar ] [ CrossRef ]
  • Coleman, G.D.; Englert, J.M.; Chen, T.H.H.; Fuchigami, L.H. Physiological and environmental requirements for poplar ( Populus deltoides ) bark storage protein degradation. Plant Physiol. 1993 , 102 , 53–59. [ Google Scholar ] [ CrossRef ]
  • Leakey, R.R.B.; Dick, J.M.; Newton, A.C. Stock plant-derived variation in rooting ability: The source of physiologically youth. In Mass Production Technology for Genetically Improved Fast Growing Tree Species. Vol. I. AFOCEL=IUFRO Conference, Bordeaux, France, 14–18 September 1992 ; Association Foret-Cellulose: Paris, France, 1992; pp. 171–178. [ Google Scholar ]
  • Haissig, B.E. Carbohydrate relations during propagation of cuttings from sexually mature Pinus banksiana trees. Tree Physiol. 1989 , 5 , 319–328. [ Google Scholar ] [ CrossRef ]
  • Sun, J.; Li, H.; Chen, H.; Wang, T.; Quan, J.; Bi, H. The effect of hormone types, concentrations, and treatment times on the rooting traits of Morus ‘Yueshenda 10’ softwood cuttings. Life 2023 , 13 , 1032. [ Google Scholar ] [ CrossRef ]
  • Martínez, L.D.O.; Mendoza, O.J.; Valenzuela, M.C.; Serrano, P.A.; Olarte, S.J. Efecto de las giberelinas sobre el crecimiento y calidad de plántulas de tomate. Biotecnia 2013 , 15 , 56–60. [ Google Scholar ] [ CrossRef ]
  • Keswani, C.; Singh, S.P.; Cueto, L.; García-Estrada, C.; Mezaache-Aichour, S.; Glare, T.R.; Borriss, R.; Singh, S.P.; Angel Blázquez, M.; Sansinenea, E. Auxins of microbial origin and their use in agriculture. Appl. Microbiol. Biotechnol. 2020 , 104 , 8549–8565. [ Google Scholar ] [ CrossRef ]
  • Sampedro-Guerrero, J.; Vives-Peris, V.; Gomez-Cadenas, A.; Clausell-Terol, C. Efficient strategies for controlled release of nanoencapsulated phytohormones to improve plant stress tolerance. Plant Methods 2023 , 19 , 47. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Yuan, H.; Zhao, L.; Guo, W.; Yu, Y.; Tao, L.; Zhang, L.; Song, X.; Huang, W.; Cheng, L.; Chen, J.; et al. Exogenous application of phytohormones promotes growth and regulates expression of wood formation-related genes in Populus simonii × P. nigra . Int. J. Mol. Sci. 2019 , 20 , 792. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhou, M.; Li, Y.; Cheng, Z.; Zheng, X.; Cai, C.; Wang, H.; Lu, K.; Zhu, C.; Ding, Y. Important factors controlling gibberellin homeostasis in plant height regulation. J. Agric. Food Chem. 2023 , 71 , 15895–15907. [ Google Scholar ] [ CrossRef ] [ PubMed ]

Click here to enlarge figure

Date of Shoot CollectionRooting of Cuttings (%)
P. deltoides × P. nigra
(Agathe-F)
P. maximowiczii × P. trichocarpa
(Arges)
P. deltoides × P. trichocarpa (Donk)P. × canadensis
(F-448)
1 March88.6 ± 3.481.3 ± 3.180.1 ± 3.187.4 ± 3.4
10 March88.4 ± 3.385.0 ± 3.081.9 ± 3.090.6 ± 3.5
30 March93.4 ± 3.586.5 ± 3.191.6 ± 3.191.9 ± 3.5
10 April95.7 ± 3.683.1 ± 3.093.7 ± 3.293.0 ± 3.2
20 April93.1 ± 3.580.5 ± 2.986.4 ± 3.090.0 ± 3.4
Date of Shoot CollectionHeight (cm)
P. deltoides × P. nigra
(Agathe-F)
P. maximowiczii × P. trichocarpa
(Arges)
P. deltoides × P. trichocarpa
(Donk)
P. × canadensis
(F-448)
1 March124.47 ± 11.71122.21 ± 11.1584.78 ± 12.31115.28 ± 12.15
10 March132.22 ± 10.20118.63 ± 11.1790.77 ± 16.22120.48 ± 10.27
30 March131.77 ± 21.11123.81 ± 14.3189.18 ± 11.05123.42 ± 11.12
10 April135.93 ± 19.09121.91 ± 14.61104.23 ± 21.18118.38 ± 12.23
20 April123.04 ± 21.24115.28 ± 21.7985.09 ± 15.26113.76 ± 17.77
Treatment, Concentration (%)Rooting of Cuttings (%)
P. deltoides × P. nigra
(Agathe-F)
P. maximowiczii × P. trichocarpa
(Arges)
P. deltoides × P. trichocarpa
(Donk)
P. × canadensis
(F-448)
Control88.2 ± 3.382.7 ± 3.989.6 ± 3.680.6 ± 5.3
IBA *, 0.00297.7 ± 1.990.1 ± 2.395.9 ± 1.988.8 ± 4.7
IAA **, 0.02 92.5 ± 2.487.7 ± 3.690.8 ± 1.491.2 ± 1.3
IAA **, 0.2 84.2 ± 5.682.2 ± 2.487.4 ± 4.782.4 ± 5.7
Cinnamic acid, 0.0001 100.0 ± 0.088.3 ± 4.693.1 ± 2.687.4 ± 3.6
Cinnamic acid, 0.005 91.6 ± 2.884.3 ± 4.889.6 ± 2.581.2 ± 4.3
Treatment, Concentration (%)Height (cm)
P. deltoides × P. nigra
(Agathe-F)
P. maximowiczii × P. trichocarpa
(Arges)
P. deltoides × P. trichocarpa
(Donk)
P. × canadensis
(F-448)
Control119.47 ± 13.74100.24 ± 13.1985.78 ± 17.33119.18 ± 15.15
IBA *, 0.002138.22 ± 11.21128.63 ± 12.1894.77 ± 19.62125.41 ± 14.77
IAA **, 0.02 131.77 ± 21.11127.87 ± 18.3089.08 ± 14.55133.47 ± 21.12
IAA **, 0.2112.36 ± 15.22112.35 ± 18.2885.55 ± 14.9999.74 ± 19.05
Cinnamic acid, 0.0001 134.93 ± 9.59121.91 ± 14.61105.13 ± 20.13127.33 ± 17.33
Cinnamic acid, 0.005 111.14 ± 20.44115.27 ± 14.7786.19 ± 16.56113.96 ± 14.78
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Varnagirytė-Kabašinskienė, I.; Suchockas, V.; Urbaitis, G.; Žemaitis, P.; Muraškienė, M.; Čiuldienė, D.; Černiauskas, V.; Armoška, E.; Vigricas, E. Evaluation of Shoot Collection Timing and Hormonal Treatment on Seedling Rooting and Growth in Four Poplar Genomic Groups. Forests 2024 , 15 , 1530. https://doi.org/10.3390/f15091530

Varnagirytė-Kabašinskienė I, Suchockas V, Urbaitis G, Žemaitis P, Muraškienė M, Čiuldienė D, Černiauskas V, Armoška E, Vigricas E. Evaluation of Shoot Collection Timing and Hormonal Treatment on Seedling Rooting and Growth in Four Poplar Genomic Groups. Forests . 2024; 15(9):1530. https://doi.org/10.3390/f15091530

Varnagirytė-Kabašinskienė, Iveta, Vytautas Suchockas, Gintautas Urbaitis, Povilas Žemaitis, Milda Muraškienė, Dovilė Čiuldienė, Valentinas Černiauskas, Emilis Armoška, and Egidijus Vigricas. 2024. "Evaluation of Shoot Collection Timing and Hormonal Treatment on Seedling Rooting and Growth in Four Poplar Genomic Groups" Forests 15, no. 9: 1530. https://doi.org/10.3390/f15091530

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 21 October 2015

Common garden experiments in the genomic era: new perspectives and opportunities

  • P de Villemereuil 1 ,
  • O E Gaggiotti 1 , 2 ,
  • M Mouterde 1 &
  • I Till-Bottraud 1  

Heredity volume  116 ,  pages 249–254 ( 2016 ) Cite this article

25k Accesses

221 Citations

121 Altmetric

Metrics details

  • Evolutionary genetics

The study of local adaptation is rendered difficult by many evolutionary confounding phenomena (for example, genetic drift and demographic history). When complex traits are involved in local adaptation, phenomena such as phenotypic plasticity further hamper evolutionary biologists to study the complex relationships between phenotype, genotype and environment. In this perspective paper, we suggest that the common garden experiment, specifically designed to deal with phenotypic plasticity, has a clear role to play in the study of local adaptation, even (if not specifically) in the genomic era. After a quick review of some high-throughput genotyping protocols relevant in the context of a common garden, we explore how to improve common garden analyses with dense marker panel data and recent statistical methods. We then show how combining approaches from population genomics and genome-wide association studies with the settings of a common garden can yield to a very efficient, thorough and integrative study of local adaptation. Especially, evidence from genomic (for example, genome scan) and phenotypic origins constitute independent insights into the possibility of local adaptation scenarios, and genome-wide association studies in the context of a common garden experiment allow to decipher the genetic bases of adaptive traits.

Similar content being viewed by others

gene variation experiment

Opportunities and challenges of macrogenetic studies

gene variation experiment

Correlational selection in the age of genomics

gene variation experiment

The importance of genomic variation for biodiversity, ecosystems and people

Introduction.

Studying adaptation and the genetic bases of the adaptive traits is an ambitious but daunting enterprise, especially for complex traits that have a polygenic basis and are strongly influenced by the environment. Indeed, uncovering the evidence of genetic adaptation is almost always hampered by the pervasive effects of evolutionary phenomena such as genetic drift, phenotypic plasticity, complex demographic history and complex genetic architecture. In the particular case of local adaptation, evolutionary biologists have developed efficient tools to overcome these challenges and the common garden experiment is one of them. The rationale behind this protocol is to control for the effects of phenotypic plasticity and, to a certain extent, genotype-by-environment interactions by growing individuals from different populations in a common environment, and by using the quantitative genetics toolbox (see Box 1 ) to study the genetic bases of complex traits (for example, life history, morphological and physiological traits).

Because it enables to unravel the genetic basis of complex phenotypes across various populations without the confounding effects of the corresponding environment, the common garden experiment is used to test for local adaptation signal in traits of interest such as life history traits ( Kawakami et al., 2011 ), phenology ( Brachi et al., 2013 ) and allometric relationships ( Gonda et al., 2011 ). Local adaptation might be suspected because of the existence of an environmental gradient such as latitude ( Toräng et al., 2015 ) or altitude ( Alberto et al., 2011 ), or because of the existence of several contrasting environments, such as sea and fresh water ( DeFaveri and Merilä, 2014 ). In addition, common garden experiments are also used to study the consequences of local adaptation for conservation ( McKay et al., 2001 ) or even for ecosystem functioning ( Bassar et al., 2010 ). Despite its name, and although it has been used extensively with plants ( Linhart and Grant, 1996 ), this experimental approach can also be applied to a large variety of organisms including fish ( Bassar et al., 2010 ; DeFaveri and Merilä, 2014 ), invertebrates ( Spitze, 1993 ; Luttikhuizen et al., 2003 ) and small mammals ( Bozinovic et al., 2009 ). The main limitations to this experimental design are the ability to breed the species and to grow the produced offspring in laboratory or seminatural conditions. Common garden experiments can also be used to study genotype-by-environment interactions, by implementing the same design in different environments. Although replicating common garden experiments is logistically challenging, the outcomes of such experiments are highly rewarding, as genotype-by-environment effects are likely common and very important in the wild ( Stinchcombe, 2014 ). Note finally that, although common garden experiments are closely related to reciprocal transplant experiments (which aim at testing local adaptation by showing that the average fitness of local individuals is higher than the average fitness of aliens, see for example Å´gren and Schemske, 2012 ), there are important philosophical and practical differences between the two types of experiments. The difference is that reciprocal transplants are designed to prove local adaptation, whereas common gardens are designed to study the genetic bases of traits, regardless of whether they are adaptive or not. In practice, reciprocal transplants will typically create a differential survival, because the locals will survive better. This will be a confounding effect during the quantitative genetic analysis, because only the phenotypes of ‘fit’ individuals are available. Common gardens, by contrast, are often designed to be ‘softer’ on the individuals. Nevertheless, most of the elements in this article regarding common garden experiments can also be applied to reciprocal transplants, especially if one is interested in applying them to survival or some other measure of fitness.

To perform the quantitative genetics analyses of the studied traits, individuals of controlled families (that is, group of individuals with known genealogy) are used. An average relatedness between individuals is derived from this known genealogy and allows to infer within-population additive genetic variance V A , whereas effects due to the population of origin allows to infer the between-population additive genetic variance V pop . This is so because all individuals share the same environment and, therefore, any average difference between populations must have a genetic origin. The residual variance V R accounts for all other kinds of effects (for example, environmental). These variance components can be used to estimate the heritability of the trait:

It is also possible to estimate Q ST , a standardised measure of genetic differentiation for quantitative traits ( Spitze, 1993 ; Edelaar et al., 2011 ). Q ST is defined as the ratio of among-population (additive) genetic variance V A over the total genetic variance (that is, including the within-population additive variance V pop ), and in the case of diploid species is given by:

This parameter is a quantitative analogue of population genetics’ F ST and, under a hypothesis of neutrality, both should be equal. Hence, a common approach for distinguishing between neutral drift and local adaptation scenarios is to compare Q ST s and F ST s. Consequently, individuals from a common garden experiment are typically genotyped to compute F ST .

Despite the advantages of common garden experiments, the study of local adaptation in non-model species during the past decade has been strongly driven by the study of genetic markers in natural populations ( Luikart et al., 2003 ). Typically, evolutionary biologists go to natural populations, sample tissue from the individuals and genotype them with high-throughput methods and then proceed with a genome scan analysis of selection (see, for example, Eckert et al., 2010 ; Bourret et al., 2013 ; Fischer et al., 2013 ). Although this method can be quite powerful, it has some limitations (for example, false positives, no information on the adaptive phenotype). Several calls have been made to independently validate the results of such analyses (see Buehler et al., 2014 for a striking example), possibly using common garden or reciprocal transplant experiments ( Holderegger et al., 2008 ; Pardo-Diaz et al., 2014 ; Rellstab et al., 2015 ). Following these lines, this perspective paper addresses three main questions: where does the common garden experiment stand in the genomic era? In particular, what can common garden experiments bring to population genomics? Conversely, how can techniques from the genomic fields (for example, high-throughput genotyping and model-based inference of neutral evolution) extend the range and scope of common gardens?

It is important to note that population genomics aims at linking genotypes and environments through genome scans methods but often completely neglects to study the phenotypic traits under potential selection. There is much to gain by adding phenotypes into the equation ( Cushman, 2014 ). Yet, because phenotypic plasticity is hard to distinguish from local adaptation in wild populations, it seems useless, or at least dubious, to use phenotypes directly obtained in the field. This simple fact lies at the heart of common garden experiments and we suggest here that this approach is ideally suited to jointly study genotypes, phenotypes and environments, especially when they are combined with high-throughput genotyping and powerful statistical methods. After a short introduction to the different high-throughput genotyping methods available in the context of a common garden experiment, we will discuss how those methods and powerful statistical tools can rejuvenate this classical approach. Finally, we will discuss the complementarity between population genomics and common garden experiments, and how an integrative analysis can deepen our understanding of local adaptation.

Box 1 Quantitative genetics glossary

Quantitative Genetics : Theoretical framework used to study the genetic basis of (mostly) quantitative polygenic traits. It uses relatedness between individuals to partition the phenotypic variance into (among others) genetic and nongenetic components.

Relatedness : Probability of shared ancestry (identity by descent (IBD)) of any two homologous alleles sampled among two individuals. Can also be defined in terms of correlation of homologous alleles between two individuals when the reference population is the sample itself. Relatedness is indeed always defined according to a reference population ( Wang, 2014 ).

Additive genetic variance ( V A ): Variance component due to the additive effects of the alleles and genes responsible for the phenotype. Under general conditions (no epistatis, no inbreeding), this is the only component transmitted to the offspring generation.

Dominance variance ( V D ): Genetic variance arising from interactions between alleles within each gene responsible for the phenotype. The dominance effect is perceptible only when comparing full-sibs and in the presence of mild to strong inbreeding ( Wolak and Keller, 2014 ).

Parental effects : Direct or indirect effects of the parental phenotype on the offspring phenotype, apart from the genetic heredity of the phenotype. This includes, in particular, maternal energetic investment in offspring.

Heritability : Proportion of the phenotypic variance genetically transmissible to the offspring generation within a population. Calculated as a ratio between V A and the total phenotypic variance. The marker-based heritability is the proportion of phenotypic variance explained by the whole genetic marker panel that is not necessarily equal to the true heritability.

Q ST : Among-population genetic differentiation index. Ratio of the among-population additive genetic variance V pop to the total additive genetic variance (calculated as V pop +2 V A ).

High-throughput genotyping in the context of a common garden

High-throughput genotyping defines any genotyping method yielding a large number of markers, thus providing a dense marker panel across the genome. Given the focus on non-model species in this paper, we consider as few as 10 000 independent markers as fairly ‘dense’, provided that the genome of the species is not too large. For example, 10 000 single-nucleotide polymorphisms (SNPs) in a genome of size 100 Mbp would represent ∼ 3% of all SNPs if a SNP occurs every 300 bp.

The most straightforward high-throughput genotyping method is whole-genome sequencing. This method yields the largest possible number of markers, and offers the densest genotyping. However, this technique requires high DNA quality and quantity, bioinformatics computation power and, most importantly, access to genomic resources (for example, genome assembly) within a relatively short phylogenetic range. The huge number of markers generated can also be problematic during the analyses because of high computation/memory requirements, high redundancy in information between linked markers and low signal-to-noise ratio. Still, whole- genome sequencing is the ultimate high-throughput genotyping method, yielding up to millions of SNP markers throughout the whole genome. With a decreasing cost and an increase in the number of species for which the whole genome has been sequenced over the years, it might soon become a recommended technique even for non-model species. A cheaper alternative to whole- genome sequencing are SNP genotyping chips, with most of the limitations above applying still.

For now, an approach likely to be best suited for non-model species is genome representation sequencing. The overall principle of this approach is to sequence only restricted, but random, parts of the genome in order to decrease the sequencing effort, and hence the overall costs and computational efforts associated with genotyping. To do so, the above approaches mainly use DNA digestion by restriction enzymes followed by a ligation of tags and primers and PCR amplification. This is akin to the principle underlying amplified fragment length polymorphism (AFLP) genotyping ( Vos et al., 1995 ). Here, however, the DNA fragments (or at least some of them) are partially sequenced ( ∼ 100 bp) using next-generation technology such as Illumina HiSeq (Illumina Inc., San Diego, CA, USA). This kind of approach includes the genotyping-by-sequencing method ( Elshire et al., 2011 ) and the family of restriction site-associated DNA sequencing methods ( Miller et al., 2007 ; Baird et al., 2008 ).

The sequences obtained are then analysed using quality checks (that is, selecting reads according to their sequencing quality, local coverage, availability over all or most individuals and so on) and SNP calling pipelines in order to identify SNP markers. Note that contrary to the AFLP approach, markers issued from restriction site-associated DNA sequencing are preferentially issued from nonpolymorphic restriction sites and are codominant. Alternatively, when more than one SNP is present on a 100-bp sequence, they can be combined into a new marker with more than two alleles. The rationale behind this is that very close SNPs are likely to be strongly associated because of physical linkage, in which case fewer but independent markers composed of more alleles are often preferable to strongly linked SNPs. Genome representation protocols can yield up to several hundreds of thousands of SNPs, but more typically tens of thousands. This can be achieved at a cost comparable or up to 10 times the cost of an AFLP analysis.

For all of the above, it is clear that next-generation sequencing makes possible the generation of a very large number of markers for a moderate cost. When compared with AFLP markers, next-generation sequencing marker panels are denser, and the markers are codominant and less arbitrary in their interpretation (that is, no ‘binning’ process), hence better in every way, except possibly for their cost. Microsatellites, on the other hand, are very different: they usually provide very sparse panels (up to a few dozens of markers), but highly mutable and with a large allelic diversity. Although it has been argued that microsatellites are better markers to infer relatedness ( Ritland, 2000 ), they typically yield smaller relatedness estimates than SNP or AFLP markers because of higher mutation rates ( Uptmoor et al., 2003 ; El Rabey et al., 2013 ). They also yield smaller F ST estimates ( Edelaar and Björklund, 2011 ) for the same reason. Finally, although in theory more accurate than SNPs for the same number of loci, they typically yield one to two orders of magnitude less loci, and hence they are less accurate in practice ( Uptmoor et al., 2003 ).

A key issue is the number of individuals that need to be genotyped. Our view is that ideally all individuals from the experimental garden(s) should be genotyped, because this opens the way toward the more refined or novel analyses detailed below. However, some of the analyses suggested here (for example, genome scans) can be performed even when a subsample of individuals have been genotyped. De Kort et al. (2014) , for example, have sampled one individual per family in their common garden experiment to combine it with population genomics (that is, genome scans) analyses. This cheaper subsampling procedure might be very attractive to researchers who are not interested in individual genotypes: that is, neither in the relatedness inference nor in the genome-wide association studies that are described below.

Common gardens 2.0: new markers and new methods

We are certainly not the first to encourage the evolutionary biology community to switch toward next-generation sequencing technology ( Luikart et al., 2003 ; Savolainen et al., 2013 ), and it is clear that such a ‘revolution’ is already happening (reviewed in Pardo-Diaz et al., 2014 ). However, we wish here to emphasise the interest of dense marker panels in the context of a common garden experiment.

As stated above, a study of the genetics of complex traits such as that measured in common garden experiments strongly relies on the relatedness between individuals that is often assumed, especially when individuals are siblings (see, for example, Hernández-Serrano et al., 2014 ). Yet, contrary to the parent–offspring relationship, the relatedness between siblings varies: the commonly used value of 0.25 between half-sibs, for example, is only an average, expected value. Hence, using realised relatedness, inferred from molecular data, can allow for better estimates in the sense that (1) they are more robust to error in the kinship assessment (for example, full-sibs instead of half-sibs) and (2) they reflect more accurately the variation in relatedness between siblings. Better relatedness estimates are useful because they will improve the precision of the estimates of h 2 and Q ST . Note however that many markers are typically needed to obtain precise molecular estimates of relatedness ( Uptmoor et al., 2003 ). Dense markers provided by high-throughput genotyping naturally fulfill this requirement.

A large number of markers also allows the reconstruction of the family structure. Indeed, even when relatedness is precisely estimated, the family structure (that is, who is the mother/father of the individuals, which individuals are full- or half-sibs) is of utmost importance in order to account for many confounding effects such as dominance ( Wolak and Keller, 2014 ), parental effects (for example, maternal, Wilson et al., 2010 ) or selfing ( Gauzere et al., 2013 ). Note that maternal effects can also be accounted for by weighting seeds (in plants, Roach and Wulff, 1987 ) or reduced by using F2 generations ( Roach and Wulff, 1987 ; Mousseau and Dingle, 1991 ). However, the possibility of using one of these methods will strongly depend on the studied species. According to Jones et al. (2010) , brood size is one of the biggest limitations for parental reconstruction algorithms because of issues of unsampled alleles when too few segregating individuals are available. With many markers, even with low levels of polymorphism (such as SNPs), this is no longer an issue, as it becomes possible to reconstruct a large-enough proportion of the parental genomes to obtain high certainties of assignment, even for small brood sizes. Now that efficient algorithms such as those implemented in COLONY ( Jones and Wang, 2010 ; Wang, 2012 ), are available, the number of markers should not be a problem. This software allows reconstructing the family structure, as well as inferring parental genotypes, while accounting for selfing or genotyping errors. Indeed, one crucial issue for parental inference with a large number of markers is to include possible genotyping errors that, if left unaccounted for, can severely bias the results ( Wang, 2004 ).

The most innovative statistical method, especially designed to study common garden data, is probably the one developed by Ovaskainen et al. (2011) that overcomes several problems associated with the classical F ST – Q ST comparisons. In order to avoid clumsy comparisons between two noisy estimators, Ovaskainen et al. (2011) conceived a model of neutral phenotypic differentiation between populations that is compared with phenotypic differentiation measured in a common garden experiment (that is, the genetic differentiation linked to the phenotype). When suspiciously strong phenotypic differentiation is observed compared with the neutral expectation, a local adaptation hypothesis can be proposed. The neutral model of phenotypic differentiation is actually a combination of a within-population ‘animal model’ (see Kruuk, 2004 for a description of the model) and an among-populations ‘ F -model’ (see Gaggiotti and Foll, 2010 for a description of the model) of phenotypic evolution ( Karhunen and Ovaskainen, 2012 ). By doing so, this model allows for a multivariate genetic analysis to be performed, for example, to infer genetic correlations and a G matrix. This is a perfect illustration of how models emerging from the field of population genomics (here the F model) can be used to dramatically improve the analysis of common garden data sets. This method has been implemented in the DRIFTSEL package ( Karhunen et al., 2013 ). Using this method, Karhunen et al. (2014) demonstrated the presence of strong footprints of local adaptation in several populations of nine-spine stickleback ( Pungitius pungitius ).

What is the use of common garden experiment in the genomic era?

It is well known in the domain of genome-wide association studies, which aim at uncovering the loci responsible for phenotypic variation, that such analyses should be performed with extreme caution because of the potential effect of hidden population structure. Especially important are the combined effects of genetic drift and gene flow, and the confounding effect of phenotypic plasticity. However, both of the aforementioned problems can be overcome. Structure between population structure can be accounted for by using appropriate models (see, for example, Nicholson et al., 2002 ; Beaumont and Balding, 2004 ) or methods ( Frichot et al., 2013 ) from the genome scan literature. The second problem, on the other hand, is perfectly addressed by common garden experiments that were specifically designed to control for phenotypic plasticity.

As a result, combining common garden experiments of non-model species with genome-wide association studies provides opportunity for multiple-population genome-wide association studies ( Brachi et al., 2013 ; Slavov et al., 2014 ). For a locally adapted trait, it would even be possible to differentiate markers explaining among-population phenotypic variability (by testing for among-population effects) from markers explaining within-population variability (by testing for within-population effects). The technique of within-group centring ( Davis et al., 1961 ; van de Pol and Wright, 2009 ) could be used to this end. It simply consists in distinguishing between the mean-population effect and the within-population effect of each predictor of an association model, as follows:

where y ij is the phenotype of individual i in population j , x ij is its genotype and x̄ j the mean genotype in population j . The parameters μ , β B and β W are the fixed effects of the model. Note that the within-population effects can be tested independently by using a parameter β W i for each population j . The term u j stands for any population structure correction and e ij is the residual. This equation is simply an illustration of within-group centring and does not constitute a model per se . Accounting for population structure should help in distinguishing between neutral and selective scenarios for markers associated with between-population variability. As always ( Korte and Farlow, 2013 ), the power of a genome-wide association study to actually detect loci linked to the phenotypic variability strongly depends on the extent of linkage disequilibrium and the density of markers along the genome, in addition to the sample size. Hence, the most useful, but most expensive, genotyping method for this kind of analysis is whole-genome sequencing. Note also that heterogeneity in recombination/mutation rates along the genome can generate false positives during such analyses ( Korte and Farlow, 2013 ). Here, the number of populations is also of importance, as it will determine the power to detect significance for the parameter β B . Note that Brachi et al. (2013) used a different approach of multiscale (local to worldwide variation) analysis and found very different results depending on the studied scale of local adaptation. The approach that is probably the most typical of the genomic era is to scan genomes for signal of selection (mostly selective sweeps and local adaptation). Many methods have been developed in the past decades to detect local adaptation ( Beaumont and Balding, 2004 ; Foll and Gaggiotti, 2008 ; Bonhomme et al., 2010 ; Coop et al., 2010 ; Frichot et al., 2013 ; Duforet-Frebourg et al., 2014 ; Guillot et al., 2014 ). Despite considerable efforts to account for population structures, these methods have been shown to display high error rates ( de Villemereuil et al., 2014 ; Lotterhos and Whitlock, 2014 ). Hence, validation of the results of a genome scan must always be done using independent tests. Gene ontologies and pathway analyses are the most common mean of checking these results. However, it has been suggested that common garden experiments might be a very efficient complement to those analyses ( De Kort et al., 2014 ; Lepais and Bacles, 2014 ; Rellstab et al., 2015 ).

Performing genome-scan analyses using common garden data can have many advantages. If a strong adaptive signal is detected both using both using genome scan methods (that is, using genotypic and possibly environmental data) and the phenotypic data from a common garden experiment, that will constitute two independent piece of evidence favouring the hypothesis of local adaptation ( Holderegger et al., 2008 ). As stated above, genome scan results need to be validated anyhow ( Pardo-Diaz et al., 2014 ; Rellstab et al., 2015 ), and performing a common garden experiment is an elegant way to do so. We suggested that, whenever possible, combining genome scan approaches with common garden experiments is an efficient approach to the study of local adaptation. Moreover, by comparing the loci showing strong signals of differentiation and the loci associated with among-population phenotypic differentiation, it is possible to isolate candidate loci for local adaptation with very little information regarding the functional annotation of the species’ genome. Third, using the environmental information allows not only to identify the selected phenotypes (that is, strongly differentiated genetically), but also to infer the environmental variable driving the selective pressure. In particular, if a locus is strongly associated with an environmental variable and with the among-population phenotypic differentiation, one might conclude that a relationship exists between the environmental variable and the phenotype (although only correlatively: each variable is a putative proxy for the real selective/selected variable).

An important problem when performing genome-scan analyses directly on common garden individuals is to correctly infer the source-population allele frequencies. The preferred way is simply to genotype the parents of the common garden individuals. However, this is not always possible (for example, genotyping the father for plants is impossible most of the time). In that case, allele frequencies inferred directly from the individuals should be accurate, as long as there is no sex-dependent allelic frequency bias. But the confidence in that inference will be overestimated by the fact that many related individuals were sampled. To account for this situation, a conservative solution is to calculate the allele frequencies based on the individuals of the common garden, but to consider that the sample size of these estimates are the number of parents that have generated the offspring. With these kind of data, all population-based methods (such as Bayescan, Foll and Gaggiotti, 2008 , or BayEnv, Coop et al., 2010 ) can be used. A second solution, if the confidence in parental genotypic reconstruction is high enough, is to directly use the inferred genotypes of the parents, both to infer allele frequencies in the population and directly as data for individual-based genome scan methods. Yet, in practice, these data will always be inferred with some uncertainty, and the consequences of ignoring this uncertainty during post hoc analyses is unknown. Still, the interest of this approach is that individual-based methods (such as Latent Factor Mixed Model, Frichot et al., 2013 , or PCAdapt, Duforet-Frebourg et al., 2014 ) can be used to analyse the data. A last solution is the one implemented by De Kort et al. (2014) that consists in using only one individual per family. Although this solution requires a sufficiently large number of families for each population, it has the compelling advantage of simplicity and efficiency.

Local adaptation is a play starring three actors: the environment, the phenotype and the genotype. The environment selects the phenotypes that are (partly) determined by a number of genes. The evolutionary result is a change in allele frequencies of the polymorphic coding genes. Understanding the relationships between the three actors requires precise but large-scale measurements, rigorous experiments and powerful statistical methods. Because phenotypic plasticity is such a pervasive phenomenon and because it is nearly impossible to account for its effect on in situ phenotypes, phenotypes should never be directly compared between different populations, unless a case is made that the comparison is safe enough (low environmental contrasts or little phenotypic plasticity). In contrast, common garden experiments are ideally suited to perform such kinds of analyses, and hence to study the phenotypic traits affected by local adaptation. Now that dense marker panels are obtainable for many individuals at a moderate cost, common garden experiments are expected to be performed more routinely. Of course, this is unless the biological characteristics (for example size, behaviour, generation time) prevent the applicability of this experiment. Common gardens could possibly even replace the field work required to obtain tissue samples for genotyping: as we mentioned, it would still allow for population genomics approaches, while guaranteeing independent validation through the study of phenotypes ( Pardo-Diaz et al., 2014 ; Rellstab et al., 2015 ), hence saving the cost of another genotyping campaign. As emphasised by Lepais and Bacles (2014) , deciphering the genetic basis of local adaptation will only be accomplished by combining all the information yielded by dense marker panels, careful experiments and in situ sampling and observations. Replicating common garden experiments in different environments can also provide insight into complicated relationships between the three actors such as genotype-by-environment interactions. High-throughput genotyping provides an abundance of genetic data. World-wide fine-scale databases (for example, WorldClim, Hijmans et al., 2005 ) and the advent of cheap in situ sensors also provide high-quality environmental data. However, collecting phenotypic data is still time consuming, tedious and sometimes expensive. It thus seems that the last challenge that needs to be overcome is the development of high-throughput phenotyping allowing for a scaling-up and a more widespread use of common garden experiments.

Å´gren J, Schemske DW . (2012). Reciprocal transplants demonstrate strong adaptive differentiation of the model organism Arabidopsis thaliana in its native range. New Phytol 194 : 1112–1122.

Article   Google Scholar  

Alberto F, Bouffier L, Louvet JM, Lamy JB, Delzon S, Kremer A . (2011). Adaptive responses for seed and leaf phenology in natural populations of sessile oak along an altitudinal gradient. J Evol Biol 24 : 1442–1454.

Article   CAS   PubMed   Google Scholar  

Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA et al . (2008). Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3 : e3376.

Article   PubMed   PubMed Central   Google Scholar  

Bassar RD, Marshall MC, Lopez-Sepulcre A, Zandona E, Auer SK, Travis J et al . (2010). Local adaptation in Trinidadian guppies alters ecosystem processes. Proc Natl Acad Sci USA 107 : 3616–3621.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Beaumont MA, Balding DJ . (2004). Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol 13 : 969–980.

Bonhomme M, Chevalet C, Servin B, Boitard S, Abdallah J, Blott S et al . (2010). Detecting selection in population trees: the Lewontin and Krakauer test extended. Genetics 186 : 241–262.

Bourret V, Dionne M, Kent MP, Lien S, Bernatchez L . (2013). Landscape genomics in Atlantic salmon (Salmo salar): searching for gene-environment interactions driving local adaptation. Evolution 67 : 3469–3487.

Bozinovic F, Rojas JM, Broitman BR, Vasquez RA . (2009). Basal metabolism is correlated with habitat productivity among populations of degus (Octodon degus). Comp Biochem Physiol A Mol Integr Physiol 152 : 560–564.

Article   PubMed   Google Scholar  

Brachi B, Villoutreix R, Faure N, Hautekeete N, Piquot Y, Pauwels M et al . (2013). Investigation of the geographical scale of adaptive phenological variation and its underlying genetics in Arabidopsis thaliana. Mol Ecol 22 : 4222–4240.

Buehler D, Holderegger R, Brodbeck S, Schnyder E, Gugerli F . (2014). Validation of outlier loci through replication in independent data sets: a test on Arabis alpina. Ecol Evol 4 : 4296–4306.

PubMed   PubMed Central   Google Scholar  

Coop G, Witonsky D, Di Rienzo A, Pritchard JK . (2010). Using environmental correlations to identify loci underlying local adaptation. Genetics 185 : 1411–1423.

Cushman SA . (2014). Grand challenges in evolutionary and population genetics: the importance of integrating epigenetics, genomics, modeling, and experimentation. Front Genet 5 : 197.

Davis J, Spaeth J, Huson C . (1961). A technique for analyzing the effects of group composition. Am Soc Rev 26 : 215–225.

De Kort H, Vandepitte K, Bruun HH, Closset-Kopp D, Honnay O, Mergeay J . (2014). Landscape genomics and a common garden trial reveal adaptive differentiation to temperature across Europe in the tree species Alnus glutinosa. Mol Ecol 23 : 4709–4721.

de Villemereuil P, Frichot E, Bazin E, Francois O, Gaggiotti OE . (2014). Genome scan methods against more complex models: when and how much should we trust them? Mol Ecol 23 : 2006–2019.

DeFaveri J, Merilä J . (2014). Local adaptation to salinity in the three-spined stickleback? J Evol Biol 27 : 290–302.

Duforet-Frebourg N, Bazin E, Blum MGB . (2014). Genome scans for detecting footprints of local adaptation using a Bayesian factor model. Mol Biol Evol 31 : 2483–2495.

Eckert AJ, Bower AD, Gonzalez-Martinez SC, Wegrzyn JL, Coop G, Neale DB . (2010). Back to nature: ecological genomics of loblolly pine (Pinus taeda, Pinaceae). Mol Ecol 19 : 3789–3805.

Edelaar P, Björklund M . (2011). If FST does not measure neutral genetic differentiation, then comparing it with QST is misleading. Or is it? Mol Ecol e-pub ahead of print 16 March 2011 doi:10.1111/j.1365-294X.2011.05051.x.

Edelaar P, Burraco P, Gomez-Mestre I . (2011). Comparisons between QST and FST—how wrong have we been? Mol Ecol 20 : 4830–4839.

PubMed   Google Scholar  

El Rabey H, Salem KF, Mattar MZ . (2013). The genetic diversity and relatedness ofice (Oryza sativa L.) cultivars as revealed by AFLP and SSRs markers. Life Sci J 10 : 1471–1479.

Google Scholar  

Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES et al . (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6 : e19379.

Fischer MC, Rellstab C, Tedder A, Zoller S, Gugerli F, Shimizu KK et al . (2013). Population genomic foot- prints of selection and associations with climate in natural populations of Arabidopsis halleri from the Alps. Mol Ecol 22 : 5594–5607.

Foll M, Gaggiotti OE . (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180 : 977–993.

Frichot E, Schoville SD, Bouchard G, Francois O . (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Mol Biol Evol 30 : 1687–1699.

Gaggiotti OE, Foll M . (2010). Quantifying population structure using the F-model. Mol Ecol Resour 10 : 821–830.

Gauzere J, Oddou-Muratorio S, Pichot C, Lefevre F, Klein E . (2013). Biases in quantitative genetic analyses using open-pollinated progeny tests from natural tree populations. Acta Bot Gallica 160 : 227–238.

Gonda A, Herczeg G, Merila J . (2011). Population variation in brain size of nine-spined sticklebacks (Pungitius pungitius) - local adaptation or environmentally induced variation? BMC. Evol Biol 11 : 75.

Guillot G, Vitalis R, Al Rouzic, Gautier M . (2014). Detecting correlation between allele frequencies and environ-mental variables as a signature of selection. A fast computational approach for genome-wide studies. Spatial Stat 8 : 145–155.

Hernández-Serrano A, Verdu M, Santos-delBlancoL, ClimentJ, Gonzalez-Martinez SC, Pausas JG . (2014). Heritability and quantitative genetic divergence of serotiny, a fire-persistence plant trait. Ann Bot 114 : 571–577.

Hijmans RJ, Cameron SE, Parra JL, Jones PG, Jarvis A . (2005). Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25 : 1965–1978.

Holderegger R, Herrmann D, Poncet B, Gugerli F, Thuiller W, Taberlet P et al . (2008). Land ahead: using genome scans to identify molecular markers of adaptive relevance. Plant Ecol Divers 1 : 273–283.

Jones AG, Small CM, Paczolt KA, Ratterman NL . (2010). A practical guide to methods of parentage analysis. Mol Ecol Resour 10 : 6–30.

Jones OR, Wang J . (2010). COLONY: a program for parentage and sibship inference from multilocus genotype data. Mol Ecol Resour 10 : 551–555.

Karhunen M, Merila J, Leinonen T, Cano JM, Ovaskainen O . (2013). DRIFTSEL: an {R} package for detecting signals of natural selection in quantitative traits. Mol Ecol Resour 13 : 746–754.

Karhunen M, Ovaskainen O . (2012). Estimating population-level coancestry coefficients by an admixture {F} model. Genetics 192 : 609–617.

Karhunen M, Ovaskainen O, Herczeg G, Merila J . (2014). Bringing habitat information into statistical tests of local adaptation in quantitative traits: a case study of nine-spined sticklebacks. Evolution 68 : 559–568.

Kawakami T, Morgan TJ, Nippert JB, Ocheltree TW, Keith R, Dhakal P et al . (2011). Natural selection drives clinal life history patterns in the perennial sunflower species, Helianthus maximiliani. Mol Ecol 20 : 2318–2328.

Korte A, Farlow A . (2013). The advantages and limitations of trait analysis with GWAS: a review. Plant Methods 9 : 29.

Kruuk LEB . (2004). Estimating genetic parameters in natural populations using the 'animal model'. Philos Trans R Soc Lond B Biol Sci 359 : 873–890.

Lepais O, Bacles CFE . (2014). Two are better than one: combining landscape genomics and common gardens for detecting local adaptation in forest trees. Mol Ecol 23 : 4671–4673.

Linhart YB, Grant MC . (1996). Evolutionary significance of local genetic differentiation in plants. Annu Rev Ecol Syst 27 : 237–277.

Lotterhos KE, Whitlock MC . (2014). Evaluation of demographic history and neutral parameterization on the performance of FST outlier tests. Mol Ecol 23 : 2178–2192.

Luikart G, England PR, Tallmon D, Jordan S, Taberlet P . (2003). The power and promise of population genomics: from genotyping to genome typing. Nat Rev Genet 4 : 981–994.

Luttikhuizen PC, Drent J, Van Delden W, Piersma T . (2003). Spatially structured genetic variation in a broadcast spawning bivalve: quantitative vs. molecular traits. J Evol Biol 16 : 260–272.

McKay JK, Bishop JG, Lin JZ, Richards JH, Sala A, Mitchell-Olds T . (2001). Local adaptation across a climatic gradient despite small effective population size in the rare sapphire rockcress. Proc R Soc Lond B Biol Sci 268 : 1715–1721.

Article   CAS   Google Scholar  

Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA . (2007). Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res 17 : 240–248.

Mousseau TA, Dingle H . (1991). Maternal effects in insect life histories. Annu Rev Entomol 36 : 511–534.

Nicholson G, Smith AV, Jaonsson F, Guastafsson O, Stefaansson K, Donnelly P . (2002). Assessing population differentiation and isolation from single-nucleotide polymorphism data. J R Stat Soc B (Stat Methodol) 64 : 695–715.

Ovaskainen O, Karhunen M, Zheng C, Arias JMC, Merila J . (2011). A new method to uncover signatures of divergent and stabilizing selection in quantitative traits. Genetics 189 : 621–632.

Pardo-Diaz C, Salazar C, Jiggins CD . (2014). Towards the identification of the loci of adaptive evolution. Methods Ecol Evol e-pub ahead of print 12 February 2015 doi:10.1111/2041-210X.12324.

Rellstab C, Gugerli F, Eckert AJ, Hancock AM, Holderegger R . (2015). A practical guide to environmental association analysis in landscape genomics. Mol Ecol e-pub ahead of print 26 August 2015 doi:10.1111/mec.13322.

Ritland K . (2000). Marker-inferred relatedness as a tool for detecting heritability in nature. Mol Ecol 9 : 1195–1204.

Roach DA, Wulff RD . (1987). Maternal effects in plants. Annu Rev Ecol Syst 18 : 209–235.

Savolainen O, Lascoux M, Merila J . (2013). Ecological genomics of local adaptation. Nat Rev Genet 14 : 807–820.

Slavov GT, Nipper R, Robson P, Farrar K, Allison GG, Bosch M et al . (2014). Genome-wide association studies and prediction of 17 traits related to phenology, biomass and cell wall composition in the energy grass Miscanthus sinensis. New Phytol 201 : 1227–1239.

Spitze K . (1993). Population structure in Daphnia obtusa: quantitative genetic and allozymic variation. Genetics 135 : 367–374.

CAS   PubMed   PubMed Central   Google Scholar  

Stinchcombe JR . (2014) Cross-pollination of plants and animals: wild quantitative genetics and plant evolutionary genetics. In: Charmantier A, Garant D, Kruuk LE (eds). Quantitative Genetics in the Wild . Oxford University Press: Oxford, UK, pp 128–146.

Book   Google Scholar  

Toräng P, Wunder J, Obeso JR, Herzog M, Coupland G, Agren J . (2015). Large-scale adaptive differentiation in the alpine perennial herb Arabis alpina. New Phytol 206 : 459–470.

Uptmoor R, Wenzel W, Friedt W, Donaldson G, Ayisi K, Ordon F . (2003). Comparative analysis on the genetic relatedness of Sorghum bicolor accessions from Southern Africa by RAPDs, AFLPs and SSRs. Theor Appl Genet 106 : 1316–1325.

van de Pol M, Wright J . (2009). A simple method for distinguishing within- versus between-subject effects using mixed models. Anim Behav 77 : 753–758.

Vos P, Hogers R, Bleeker M, Reijans M, Tvd Lee, Hornes M et al . (1995). AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res 23 : 4407–4414.

Wang J . (2004). Sibship reconstruction from genetic data with typing errors. Genetics 166 : 1963–1979.

Wang J . (2012). Computationally efficient sibship and parentage assignment from multilocus marker data. Genetics 191 : 183–194.

Wang J . (2014). Marker-based estimates of relatedness and inbreeding coefficients: an assessment of current methods. J Evol Biol 27 : 518–530.

Wilson AJ, Reale D, Clements MN, Morrissey MM, Postma E, Walling CA et al . (2010). An ecologist's guide to the animal model. J Anim Ecol 79 : 13–26.

Wolak ME, Keller LF . (2014) Dominance genetic variance and inbreeding in natural populations. In: Charmantier A, Garant D, Kruuk LE (eds). Quantitative Genetics in the Wild . Oxford University Press: Oxford, UK, pp 104–127.

Download references

Acknowledgements

We thank the associate editor and three anonymous referees for their very thorough and relevant reviews that considerably improved the focus and quality of this manuscript. PdV was supported by a doctoral studentship from the French Ministère de la Recherche et de l'Enseignement Supérieur . OEG was supported by the Marine Alliance for Science and Technology for Scotland (MASTS).

Author information

Authors and affiliations.

Université Joseph Fourier, Centre National de la Recherche Scientifique, LECA, UMR 5553, Saint Martin d’Hères, France

P de Villemereuil, O E Gaggiotti, M Mouterde & I Till-Bottraud

Scottish Oceans Institute, University of St Andrews, Fife, UK

O E Gaggiotti

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to P de Villemereuil .

Ethics declarations

Competing interests.

The authors declare no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

de Villemereuil, P., Gaggiotti, O., Mouterde, M. et al. Common garden experiments in the genomic era: new perspectives and opportunities. Heredity 116 , 249–254 (2016). https://doi.org/10.1038/hdy.2015.93

Download citation

Received : 17 April 2015

Revised : 05 August 2015

Accepted : 06 August 2015

Published : 21 October 2015

Issue Date : March 2016

DOI : https://doi.org/10.1038/hdy.2015.93

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

gene variation experiment

IMAGES

  1. Genetic Variation- Definition, Causes, Types, Examples (2022)

    gene variation experiment

  2. Common genetic variations. Variations at the (A) nucleotide level and

    gene variation experiment

  3. A Simple, Robust, and Cost-effective Method for Genotyping Small-scale

    gene variation experiment

  4. Genetic Variation Definition, Causes, and Examples

    gene variation experiment

  5. Genetic Variation- Definition, Causes, Types, Examples (2022)

    gene variation experiment

  6. Variation and Gene Pools

    gene variation experiment

VIDEO

  1. Gene variation considered a marker for depression is found

  2. The Role of the C4 Gene Variation in Schizophrenia

  3. Virtual Workshop

  4. Genetics and Evolution

  5. Visual and Facility Mobility Elements of Discrete Event Modeling and Parameter Variation Experiment

  6. Inheritance of One gene||Mendel monohybrid ratio||Alleles|| Test cross ||Class 12||By S H Farasta

COMMENTS

  1. Genetics & Genomics Science Experiments (29 results)

    These are the types of questions scientists are answering with genetics and genomics. By studying individual genes as well as genomes, the whole set of DNA belonging to an organism, scientists hope to get a more complete understanding of how our bodies work and develop better disease treatments. Search.

  2. Activity 1: Genetic Variation in Populations

    Activity 1: Genetic Variation in Populations. The growing ability to detect and measure human genetic variation allows us to study similarities and differences among individuals. In this activity, you will analyze data on genetic variation and address a series of questions about variation within and between populations.

  3. Understanding Human Genetic Variation

    Genetics is the scientific study of inherited variation.Human genetics, then, is the scientific study of inherited human variation.. Why study human genetics? One reason is simply an interest in better understanding ourselves. As a branch of genetics, human genetics concerns itself with what most of us consider to be the most interesting species on earth: Homo sapiens.

  4. Teach Genetics and Heredity with Free STEM Lessons & Activities

    The traits an offspring presents depend upon the recessive and dominant genes from the parent DNA. Heredity is the study of inherited traits, characteristics that are passed down by biological parents, like having brown eyes or red hair. Genetics is the study of heredity and individual genes or systems of genes that give rise to a phenotype.

  5. Genetics & Genomics Science Projects (24 results)

    Science Fair Project Idea. Scientific Method. Our genes are made up of hundreds to millions of building blocks, called DNA nucleotides, and if just a single nucleotide of DNA becomes mutated it might cause a devastating genetic disease. But sometimes a mutation actually does no damage.

  6. Gregor Mendel and the Principles of Inheritance

    By experimenting with pea plant breeding, Mendel developed three principles of inheritance that described the transmission of genetic traits, before anyone knew genes existed. Mendel's insight ...

  7. 8.1 Mendel's Experiments

    Continuous variation is the range of small differences we see among individuals in a characteristic like human height. It does appear that offspring are a "blend" of their parents' traits when we look at characteristics that exhibit continuous variation. Mendel worked instead with traits that show discontinuous variation. Discontinuous ...

  8. The Genetic Variation in a Population Is Caused by Multiple Factors

    Genetic variation in a population is derived from a wide assortment of genes and alleles. The persistence of populations over time through changing environments depends on their capacity to adapt ...

  9. Mutation

    Explore how organisms with different traits survive various selection agents within the environment.

  10. Combining experimental evolution with next-generation sequencing: a

    Genetic drift during the experiment violates the null model of Fisher's exact test (and CMH test), and thus in the absence of an empirical false discovery rate (FDR), these tests can be only ...

  11. Review Discovering mechanisms of human genetic variation and

    The daunting scale of human genetic variation. There is a staggering diversity of naturally occurring DNA sequence variation among humans. Over the past two decades, the near million-fold decrease in the cost of sequencing a human genome has allowed researchers to sequence populations and cells from disease states at an unprecedented scale and pace.

  12. Natural selection drives emergent genetic homogeneity in a century

    Genetic variation segregating within each field facilitated rapid adaptation by targeting advantageous alleles under local conditions. Understanding the process of environmental adaptation in crops and the number and types of genes that drive it has important agronomic implications in a rapidly changing world. ... The CCII experiment has been ...

  13. Khan Academy

    Khanmigo is now free for all US educators! Plan lessons, develop exit tickets, and so much more with our AI teaching assistant.

  14. Microbial Experimental Evolution

    If the goal of the experiment is to avoid genetic drift, a dilution rate that does not bottleneck the population to < 10 3 -10 4 individuals is recommended. Variation in the mutation rate allows the experimenter to vary how much genetic variation, the "fuel" of evolution, is supplied to the population 37.

  15. Genetic Variation in Meiosis

    Meiosis and fertilization create genetic variation by making new combinations of gene variants (alleles). In some cases, these new combinations may make an organism more or less fit (able to survive and reproduce), thus providing the raw material for natural selection. Genetic variation is important in allowing a population to adapt via natural ...

  16. Mutation—The Engine of Evolution: Studying Mutation and Its Role in the

    Abstract. Mutation is the engine of evolution in that it generates the genetic variation on which the evolutionary process depends. To understand the evolutionary process we must therefore characterize the rates and patterns of mutation. Starting with the seminal Luria and Delbruck fluctuation experiments in 1943, studies utilizing a variety of ...

  17. Experimental evolution and the dynamics of adaptation and genome

    Maddamsetti R, Lenski RE, Barrick JE . (2015). Adaptation, clonal interference, and frequency-dependent interactions in a long-term evolution experiment with Escherichia coli. Genetics 200: 619-631.

  18. The genetic mistakes that could shape our species

    At the moment, most gene editing involves "Crispr" - a set of genetic scissors first developed by the Nobel-prize winning scientists Emmanuelle Charpentier and Jennifer A Doudna in 2012.

  19. Variant graph craft (VGC): a comprehensive tool for analyzing genetic

    The variant call format (VCF) file is a structured and comprehensive text file crucial for researchers and clinicians in interpreting and understanding genomic variation data. It contains essential information about variant positions in the genome, along with alleles, genotype calls, and quality scores. Analyzing and visualizing these files, however, poses significant challenges due to the ...

  20. New Experiments for an Undivided Genetics

    For example, genetic variation can be maintained by selectively balanced polymorphism or by the persistent immigration of individuals from genetically differentiated populations. Quite often, fluctuations in ... the two parent strains used in an experiment represent only a limited sample of the species-wide genetic variation for the trait in ...

  21. siRNA Experiments

    Experimental variation is the total variation seen in an experiment and comes from both the process and biological population variability. ... In contrast, if gene expression knockdown differs from baseline by only 1.5-fold and the experimental variation is high (e.g., 75%), 38 biological replicates are needed to detect changes in expression.

  22. ECD-CDGI: An efficient energy-constrained diffusion model for cancer

    Author summary Cancer has become a major disease threatening human life and health. Cancer usually originates from abnormal gene activities, such as mutations and copy number variations. Mutations in cancer driver genes are crucial for the selective growth of tumor cells. Identifying cancer driver genes is crucial in cancer-related research and treatment strategies, as it helps understand ...

  23. Experimental evolution of adaptive divergence under varying degrees of

    At the beginning of the experiment, we identified 107 and 114 genetic variants (single-nucleotide polymorphisms (SNPs) and small indels) representing standing genetic variation in the α and β ...

  24. Nanopore sequencing provides snapshots of the genetic variation within

    Frequent RNA virus mutations raise concerns about evolving virulent variants. The purpose of this study was to investigate genetic variation in salmonid alphavirus-3 (SAV3) over the course of an experimental infection in Atlantic salmon and brown trout. Atlantic salmon and brown trout parr were infected using a cohabitation challenge, and heart samples were collected for analysis of the SAV3 ...

  25. Evaluation of Shoot Collection Timing and Hormonal Treatment on ...

    Populus spp. is an economically valuable tree worldwide, known for its adaptability, fast growth, and versatile wood, often cultivated in short-rotation plantations. Effective propagation is crucial for rapid genetic improvement and global demand for forest products and biomass energy. This study focused on the rooting and growth of poplar cuttings, examining shoot collection timing and growth ...

  26. Reproducibility of animal research in light of biological variation

    Abstract. Context-dependent biological variation presents a unique challenge to the reproducibility of results in experimental animal research, because organisms' responses to experimental ...

  27. Common garden experiments in the genomic era: new perspectives and

    Because it enables to unravel the genetic basis of complex phenotypes across various populations without the confounding effects of the corresponding environment, the common garden experiment is ...