• Medicine Pipeline

Breakthrough Science | Lightning Fast Development | Radical Collaboration

Process development services and technology platforms for your medicines across chemistry, crystallization, engineering and analytical.

Ask a Question About Synthetics

The achieve ® platform.

Our platform technology and world-class scientific approach accelerates the development and launch of synthetic modalities, classic and highly potent synthetic molecules, ADCS, and peptides.

Explore Synthetics

Filtered by:

Chemistry services that support lightning fast development from process definition to manufacturing introduction.

Crystallization

Crystallization services that take particles and powders off your critical path.

analytical research synthetics

Engineering

Scale-up services that get it done right the first time.

analytical research synthetics

Tech Transfer

Tech transfer services that slash the effort needed to introduce processes to manufacturing – and get it right the first time.

analytical research synthetics

Analytical services that serve the fastest synthetic process development programs in the world.

analytical research synthetics

At APC, we deliver ADC and highly potent synthetic programs faster without sacrificing process understanding, scalability, and portability thanks to our best-in-class HPAPI experts and fully equipped development facility.

analytical research synthetics

Computational Fluid Dynamics

Explore the power of Computational Fluid Dynamics (CFD) - enhance reactor selection and design, and ensure the seamless tech transfer of your process.

Continuous Processing

Harness the power of our continuous processing expertise to develop greener, safer processes with smaller manufacturing footprints.

Crystallization Process Design

Crystallization process design leveraging simulations, modeling, data-rich experimentation, and industry experts can save months of development time. Learn how.

Crystallization Scale-Up / Scale-Down

Our team is steeped in engineering know-how and many of us have worked in GMP manufacturing environments, so the challenges of running complex chemistries and crystallizations at scale are well-known.

Filtration, Washing and Drying

Trial & error is not an efficient way to solve filtration, washing, and drying challenges. Partnering with solid state chemistry and engineering experts is!

Method Development, Optimization and Validation

Good analytic method development is crucial for smooth and efficient process development. Learn how our world-class analytical team can help.

Method Transfer

Fast & effective analytical method transfer of Final Product, APIs, Bulk Drug, Drug Substance (DS), Drug Product (DP), Intermediates, and more! Learn how.

Set a scalable milling process or produce precisely-sized particles from your reactor to avoid milling steps by leveraging world-class solid-state expertise.

Multivariate Data Analysis

MVDA experts uncover key info from complex & interrelated datasets for data-driven process development decisions that facilitate speed through your pipeline.

Particle Engineering

Leverage particle engineering expertise in pharmaceutical development to transform challenging precipitation into scalable crystallization processes. Learn how.

Powered by Modeling

Smarter DoEs, Recipe Definition, Reactor Design, & Crystallization - Pharmaceutical Process Modelling Ensures the Most is Made from Every Experiment. Learn how.

Powered by PAT

Process Analytical Technology (PAT) + Automation + Data Mgmt + Process Development Experts = Insights that Deliver First-Right-Time Results. Learn how.

Process Characterization

A "science-first" philosophy facilitates unrivalled insights & understanding and allows a full characterization of your process across all unit ops.

Process Chemistry

Accelerate your synthetic modality development and launch with an engineering-centric approach to process chemistry, whether classic or highly potent.

Process Definition

Optimize challenging steps related to process & product performance, with flexible options that never lock you into a single manufacturing approach or vendor.

Process Development

A laser-like focus on process development significantly reduces the time to get medicines to patients. Learn about our "Process Advancing Medicine" approach.

Reaction Scale-Up/Scale-Down

Delivering a robust process to your manufacturing location of choice requires deep process engineering and modelling expertise. Learn how we can help.

Solid State

Address morphology, size, flowability, solubility, & API formulation issues with world-leading solid-state analysis & particle engineering expertise.

Solvent Selection

Choosing the optimal solvent (or solvent combination) for your API can be experimentally intense. See how AI-driven solubility modelling can be a game changer.

Isolation & Purification | Liquid Phase Partitioning | Extractions | Short Cycle Times — Discuss Your Pharmaceutical Work-Up Challenges with Our Experts Today!

van der Schaar Lab

Generating and evaluating synthetic data: a two-sided research agenda

analytical research synthetics

This post on synthetic data was created to accompany Mihaela van der Schaar’s invited talk at the 2021 Deep Generative Models and Downstream Applications Workshop, held alongside NeurIPS 2021.

This invited talk, entitled “Synthetic Data Generation and Assessment: Challenges, Methods, Impact,” was given by Mihaela van der Schaar on December 14, 2021, as part of the Deep Generative Models and Downstream Applications Workshop running alongside NeurIPS 2021.

If you’d like to learn more about our lab’s research in the area of synthetic data generation and evaluation, you can find a full overview here .

Also, consider watching our Inspiration Exchange engagement series and registering to join upcoming sessions.

Other useful links: – Our lab’s publications – Mihaela van der Schaar on Twitter and LinkedIn

In this short post, we will explain the importance of synthetic data, and outline our lab’s ongoing work to create adaptable and logical frameworks and processes for synthetic data generation and evaluation.

Why do we need synthetic data?

Our purpose as a lab is to create new and powerful machine learning techniques and methods that can revolutionize healthcare. To catalyze such a revolution, we need high-quality data resources in a multitude of forms, including electronic health records, biobanks, and disease registries.

Access to such data, however, is complicated due to strict regulatory constraints (under frameworks such as HIPAA and GDPR), which are the result of perfectly valid concerns regarding the privacy of such data. As we have  pointed out in the past , the lack of access to high-quality healthcare data represents a logjam that impedes machine learning research.

analytical research synthetics

Several approaches—chiefly anonymization and deidentification—have been developed with the aim of rendering such datasets shareable without compromising privacy. Unfortunately,  such approaches  tend to be either highly disclosive or yield low-quality datasets due to the removal of too many fields.

Our lab has invested heavily in synthetic data research, which we see as the only way to  break the data logjam  in machine learning for healthcare. Using synthetic data approaches, a proximal version of the data can be shared that resembles real data, but contains no real samples for any specific individual.

As explained below, our research agenda has two sides: one exploring how synthetic data can be  generated , and one seeking to establish standards and methods for  evaluating  synthetic datasets.

Towards a common “recipe” for synthetic data

Synthetic data has a broad range of potential uses. In healthcare, key applications include, but are not limited to: –  developing analytics  (such as risk predictors or treatment effect estimators); –  facilitating reproducibility  of clinical studies and analyses (due to the need to share the basis for such studies and analyses); –  augmenting small-sample datasets  (such as for rare diseases or underrepresented patient subgroups; see RadialGAN ); –  increasing the robustness and adaptability of machine learning models  (for instance, transferring across hospitals); and –  simulating forward-looking data  (including test new policies).

All of these potential uses of synthetic data come with their own requirements and criteria for suitability. Naturally, there are also many different methods and approaches to generating synthetic data. To try to standardize this into a common framework, our lab has created a common “recipe” for synthetic data generation. This recipe comprises three steps, as outlined below.

analytical research synthetics

Step 1: determine which generative model class to use

First, we must determine which generative modeling class to use. For instance, depending on the purpose of the synthetic dataset, we may use GANs, variational autoencoders, normalizing flows, or any number of other methods.

Step 2: construct an appropriate representation/structure for the type of data

Next, we need to construct an appropriate representation (for example, recurrent neural networks vs. convolutional neural networks, and so forth) for the type of data under consideration, which may vary from time-series datasets to images, notes, biomarkers, and beyond.

Step 3: incorporate required notions of privacy

Finally, it is important to ensure that the required type of privacy is incorporated, given the source dataset and the intended purpose of the synthetic dataset. This is relatively open to interpretation, since GDPR and HIPAA do not specify rigorous mathematical formulations of privacy requirements. This is why our lab proposed a new formalism for privacy using k -anonymity, based on GDPR and HIPA when working on  ADS-GAN  in 2020.

Evaluating synthetic data: the other side of the coin

As we have hinted at the start of this post, generating synthetic data is only half of the challenge. We also need to be able to determine whether synthetic datasets are actually any good—and yet again, this is particularly challenging given the diversity of potential purposes for synthetic datasets.

A three-dimensional scale for evaluating synthetic data

analytical research synthetics

In recent years, our lab has committed a great deal of time to exploring different approaches to evaluating synthetic datasets. One such project was our  hide-and-seek privacy challenge , which ran as part of the NeurIPS 2020 competition track. Along the way, we have learned a number of important lessons—chief among which is the fact that a single-dimensional metric for evaluation is not enough.

What we need is to evaluate model performance as a point in a space that contains and assesses the following three dimensions: – fidelity (how “good” are the synthetic samples?); – diversity (how much of the real data is covered, and how representative is this?); and – generalization (how often does the model copy the training data?).

For this, we have developed new probabilistic, interpretable, and multidimensional quantities for assessing synthetic data. Further details can be found in a paper published in early 2021 , or in Mihaela van der Schaar’s ICML 2021 tutorial on synthetic data .

New frontiers for synthetic data

We see synthetic data as an exciting, diverse, and highly promising area with many unexplored frontiers. Some of these are listed below.

Looking forward using synthetic data

As mentioned above, one particularly intriguing application of synthetic data is to simulate forward-looking data. We can, for example, create a simulation ecosystem that allows us to test a variety of new healthcare policies using synthetic data based on real observational datasets.

One noteworthy example of this is Medkit-Learn, a publicly available Python package providing simple and easy access to high-fidelity synthetic medical data, which we  introduced at NeurIPS 2021 . Medkit-Learn is more than “just” synthetic data: it offers a full benchmarking suite designed specifically for medical sequential decision modelling. It provides a standardized way to compare algorithms in a realistic medical setting, employing a generation process that disentangles the policy and environment dynamics. This allows for a range of customizations and thereby enabling systematic evaluation of algorithms’ robustness against specific challenges prevalent in healthcare.

analytical research synthetics

The central object in Medkit is the scenario, made up of a domain, environment, and policy, which fully defines the synthetic setting. By disentangling the environment and policy dynamics, Medkit enables us to simulate decision making behaviors with various tunable parameters. An example scenario is highlighted: ICU patient trajectories with customized environment dynamics and clinical policy. The output from Medkit will be a batch dataset that can be used for training and evaluating methods for modelling human decision-making.

Turning unfair real-world data into fair synthetic data

A key concern of ours is the fairness of synthetic data. This is a particularly important problem since unfair data can lead to unfair downstream predictions. This is why our lab has been exploring approaches to creating fair synthetic data, which can be used to create fair predictive models.

This is a very challenging problem, since: – there are many different notions of fairness; – removing protected attributes (such as ethnicity) is generally insufficient; – fairness of the downstream model must be guaranteed at the data level; and – data utility must be preserved.

A prime example of our work to date is  DECAF , which was first introduced in a paper published at NeurIPS 2021. DECAF aims to generate fair synthetic data using causally-aware generative networks, using this causal perspective to provide an intuitive guideline to achieve different notions of fairness—with fairness guarantees given for the downstream setting. We have found DECAF to be a very effective approach to generating unbiased synthetic data.

analytical research synthetics

In addition to the new frontiers described above, our lab is currently working on a range of other future directions, including: – synthetic multi-modal data (genetic, images, time-series, etc.); – generative models for asynchronous or sparse follow-up clinic visits ; and – domain and task-specific evaluation metrics .

If you’d like learn more about our work on synthetic data, you can: – join us for Mihaela van der Schaar’s invited talk on December 14 at the  Deep Generative Models and Downstream Applications Workshop , held alongside NeurIPS 2021. – Read an  overview of our work to date  on synthetic data. – Watch Mihaela van der Schaar’s  ICML 2021 tutorial  on synthetic data generation and assessment.

For a full list of the van der Schaar Lab’s publications, click  here .

analytical research synthetics

  • Mihaela van der Schaar

Mihaela van der Schaar is the John Humphrey Plummer Professor of Machine Learning, Artificial Intelligence and Medicine at the University of Cambridge and a Fellow at The Alan Turing Institute in London.

Mihaela has received numerous awards, including the Oon Prize on Preventative Medicine from the University of Cambridge (2018), a National Science Foundation CAREER Award (2004), 3 IBM Faculty Awards, the IBM Exploratory Stream Analytics Innovation Award, the Philips Make a Difference Award and several best paper awards, including the IEEE Darlington Award.

In 2019, she was identified by National Endowment for Science, Technology and the Arts as the most-cited female AI researcher in the UK. She was also elected as a 2019 “Star in Computer Networking and Communications” by N²Women. Her research expertise span signal and image processing, communication networks, network science, multimedia, game theory, distributed systems, machine learning and AI.

Mihaela’s research focus is on machine learning, AI and operations research for healthcare and medicine.

analytical research synthetics

Alex Chan graduated with a BSc in Statistics at University College London before moving to Cambridge for an MPhil in Machine Learning and Machine Intelligence.

Having started early in research, he won an EPSRC funding grant in his second year of undergraduate for a project on Markov chain Monte Carlo mixing times, and earlier this year had his work on uncertainty calibration presented at ICML.

Much of Alex’s research will focus on understanding and building latent representations of human behavior, with a specific emphasis on understanding clinical decision-making (an important new area of focus for the lab’s research) through imitation, representation learning, and generative modeling. In Alex’s own words, replicating and understanding decision-making at a higher level is, in itself, incredibly interesting, but “also being able to apply it healthcare is hugely important, and promises to actually make a difference to people’s lives in the near future.”

He is particularly interested in developing approximate Bayesian methods to appropriately handle the associated uncertainty that naturally arises in this setting and which is vital to understand.

Drawn to the lab’s special focus on healthcare, Alex notes that “No other area promises the same kind of potential for really having an impact with your research, and the lab benefits from the wide diversity of work being done alongside connections everywhere in both academia and industry.”

Alex’s studentship is sponsored by Microsoft Research.

Outside of machine learning, Alex captains and trains the novices of the Wolfson College Boat Club and occasionally keeps up with Krav Maga as a trainee instructor.

analytical research synthetics

Nick Maxfield

From 2020 to 2022, Nick oversaw the van der Schaar Lab’s communications, including media relations, content creation, and maintenance of the lab’s online presence.

You may also like

Van der schaar lab at icml 2024: 7 papers accepted.

We are very proud to consistently be represented at the world’s largest and most prestigious AI and machine learning conferences with our cutting-edge research, impactful papers, and participating in fruitful workshops...

analytical research synthetics

Advanced Topics in Diffusion Models

Diffusion models are a type of generative model in machine learning that generate data by simulating a reverse diffusion process. They start with random noise and gradually transform it into structured data, such as...

analytical research synthetics

Why Tabular Foundation Models Should Be a Research Priority

Recent text and image foundation models are incredibly impressive, and these models are attracting an ever-increasing portion of research resources (see figure below, representing different modalities in foundation...

analytical research synthetics

View posts about…

Upcoming events, ccaim ai and machine learning summer school, the cambridge ai in medicine summer school, data makers fest, causality symposium, what is our future ai is coming for gastroenterology.

analytical research synthetics

Copyright ©2024 van der Schaar Lab

  • Research team
  • Clinical partners
  • Publications
  • Revolutionizing Healthcare: Session Archive
  • Inspiration Exchange
  • Exploring human and machine intelligence
  • 2023 open house
  • 2022 open house
  • 2021 open house
  • Cambridge Summer Schools 2024
  • NeurIPS 2023: Data-Centric AI for reliable and responsible AI Tutorial
  • IJCAI-23: Data-Centric AI Tutorial
  • AAAI-23: Synthetic Data Tutorial
  • AAAI-22: Time series in healthcare
  • Individualized treatment effect inference
  • ICML 2021: Synthetic Healthcare Data Generation and Assessment
  • ICML 2020: Machine Learning for Healthcare: Challenges, Methods, and Frontiers
  • Automated machine learning
  • Causal deep learning
  • Data-Centric AI
  • Interpretable ML
  • Next-gen clinical trials
  • Quantitative epistemology
  • Self-supervised, semi-supervised, and multi-view learning
  • Survival analysis, competing risks, and comorbidities
  • Privacy challenge
  • Time series
  • Uncertainty quantification
  • Our lab’s impact
  • AutoPrognosis 2.0
  • What is Data-Centric AI?
  • DC-Check: about the Tool
  • Demonstrator: Synthetic Data
  • Demonstrator: Survival Analysis
  • Early detection and diagnosis
  • A Foundational Framework for Personalized Early and Timely Diagnosis
  • The future of clinical trials
  • Cardiovascular Disease
  • Revolutionising Pharmacological Predictions with Synthetic Model Combination 
  • Alzheimer’s
  • Our projects
  • Prostate Cancer
  • Breast cancer
  • Announcements
  • Policy Impact Predictor
  • Cystic fibrosis
  • Organ transplantation
  • Hub for Healthcare
  • Ph.D. applications

Warning: The NCBI web site requires JavaScript to function. more...

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Methods Guide for Effectiveness and Comparative Effectiveness Reviews [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2008-.

Cover of Methods Guide for Effectiveness and Comparative Effectiveness Reviews

Methods Guide for Effectiveness and Comparative Effectiveness Reviews [Internet].

Quantitative synthesis—an update.

Investigators: Sally C. Morton , Ph.D., M.Sc., M. Hassan Murad , M.D., M.P.H., Elizabeth O’Connor , Ph.D., Christopher S. Lee , Ph.D., R.N., Marika Booth , M.S., Benjamin W. Vandermeer , M.Sc., Jonathan M. Snowden , Ph.D., Kristen E. D’Anci , Ph.D., Rongwei Fu , Ph.D., Gerald Gartlehner , M.D., M.P.H., Zhen Wang , Ph.D., and Dale W. Steele , M.D., M.S.

Affiliations

Published: February 23, 2018 .

Quantitative synthesis, or meta-analysis, is often essential for Comparative Effective Reviews (CERs) to provide scientifically rigorous summary information. Quantitative synthesis should be conducted in a transparent and consistent way with methodologies reported explicitly. This guide provides practical recommendations on conducting synthesis. The guide is not meant to be a textbook on meta-analysis nor is it a comprehensive review of methods, but rather it is intended to provide a consistent approach for situations and decisions that are commonly faced by AHRQ Evidence-based Practice Centers (EPCs). The goal is to describe choices as explicitly as possible, and in the context of EPC requirements, with an appropriate degree of confidence.

This guide addresses issues in the order that they are usually encountered in a synthesis, though we acknowledge that the process is not always linear. We first consider the decision of whether or not to combine studies quantitatively. The next chapter addresses how to extract and utilize data from individual studies to construct effect sizes, followed by a chapter on statistical model choice. The fourth chapter considers quantifying and exploring heterogeneity. The fifth describes an indirect evidence technique that has not been included in previous guidance – network meta-analysis, also known as mixed treatment comparisons. The final section in the report lays out future research suggestions.

The Agency for Healthcare Research and Quality (AHRQ), through its Evidence-based Practice Centers (EPCs), sponsors the development of evidence reports and technology assessments to assist public- and private-sector organizations in their efforts to improve the quality of health care in the United States. The reports and assessments provide organizations with comprehensive, science-based information on common, costly medical conditions and new health care technologies and strategies. The EPCs systematically review the relevant scientific literature on topics assigned to them by AHRQ and conduct additional analyses when appropriate prior to developing their reports and assessments.

Strong methodological approaches to systematic review improve the transparency, consistency, and scientific rigor of these reports. Through a collaborative effort of the Effective Health Care (EHC) Program, the Agency for Healthcare Research and Quality (AHRQ), the EHC Program Scientific Resource Center, and the AHRQ Evidence-based Practice Centers have developed a Methods Guide for Comparative Effectiveness Reviews. This Guide presents issues key to the development of Systematic Reviews and describes recommended approaches for addressing difficult, frequently encountered methodological issues.

The Methods Guide for Comparative Effectiveness Reviews is a living document, and will be updated as further empiric evidence develops and our understanding of better methods improves. We welcome comments on this Methods Guide paper. They may be sent by mail to the Task Order Officer named below at: Agency for Healthcare Research and Quality, 5600 Fishers Lane, Rockville, MD 20857, or by email to vog.shh.qrha@cpe .

  • Gopal Khanna, M.B.A. Director Agency for Healthcare Research and Quality
  • Arlene S. Bierman, M.D., M.S. Director Center for Evidence and Practice Improvement Agency for Healthcare Research and Quality
  • Stephanie Chang, M.D., M.P.H. Director Evidence-based Practice Center Program Center for Evidence and Practice Improvement Agency for Healthcare Research and Quality
  • Elisabeth Kato, M.D., M.R.P. Task Order Officer Evidence-based Practice Center Program Center for Evidence and Practice Improvement Agency for Healthcare Research and Quality
  • Peer Reviewers

Prior to publication of the final evidence report, EPCs sought input from independent Peer Reviewers without financial conflicts of interest. However, the conclusions and synthesis of the scientific literature presented in this report does not necessarily represent the views of individual investigators.

Peer Reviewers must disclose any financial conflicts of interest greater than $10,000 and any other relevant business or professional conflicts of interest. Because of their unique clinical or content expertise, individuals with potential non-financial conflicts may be retained. The TOO and the EPC work to balance, manage, or mitigate any potential non-financial conflicts of interest identified.

  • Eric Bass, M.D., M.P.H Director, Johns Hopkins University Evidence-based Practice Center Professor of Medicine, and Health Policy and Management Johns Hopkins University Baltimore, MD
  • Mary Butler, M.B.A., Ph.D. Co-Director, Minnesota Evidence-based Practice Center Assistant Professor, Health Policy & Management University of Minnesota Minneapolis, MN
  • Roger Chou, M.D., FACP Director, Pacific Northwest Evidence-based Practice Center Portland, OR
  • Lisa Hartling, M.S., Ph.D. Director, University of Alberta Evidence-Practice Center Edmonton, AB
  • Susanne Hempel, Ph.D. Co-Director, Southern California Evidence-based Practice Center Professor, Pardee RAND Graduate School Senior Behavioral Scientist, RAND Corporation Santa Monica, CA
  • Robert L. Kane, M.D. * Co-Director, Minnesota Evidence-based Practice Center School of Public Health University of Minnesota Minneapolis, MN
  • Jennifer Lin, M.D., M.C.R. Director, Kaiser Permanente Research Affiliates Evidence-based Practice Center Investigator, The Center for Health Research, Kaiser Permanente Northwest Portland, OR
  • Christopher Schmid, Ph.D. Co-Director, Center for Evidence Synthesis in Health Professor of Biostatistics School of Public Health Brown University Providence, RI
  • Karen Schoelles, M.D., S.M., FACP Director, ECRI Evidence-based Practice Center Plymouth Meeting, PA
  • Tibor Schuster, Ph.D. Assistant Professor Department of Family Medicine McGill University Montreal, QC
  • Jonathan R. Treadwell, Ph.D. Associate Director, ECRI Institute Evidence-based Practice Center Plymouth Meeting, PA
  • Tom Trikalinos, M.D. Director, Brown Evidence-based Practice Center Director, Center for Evidence-based Medicine Associate Professor, Health Services, Policy & Practice Brown University Providence, RI
  • Meera Viswanathan, Ph.D. Director, RTI-UNC Evidence-based Practice Center Durham, NC RTI International Durham, NC
  • C. Michael White, Pharm. D., FCP, FCCP Professor and Head, Pharmacy Practice School of Pharmacy University of Connecticut Storrs, CT
  • Tim Wilt, M.D., M.P.H. Co-Director, Minnesota Evidence-based Practice Center Director, Minneapolis VA-Evidence Synthesis Program Professor of Medicine, University of Minnesota Staff Physician, Minneapolis VA Health Care System Minneapolis, MN

Deceased March 6, 2017

  • Introduction

The purpose of this document is to consolidate and update quantitative synthesis guidance provided in three previous methods guides. 1 – 3 We focus primarily on comparative effectiveness reviews (CERs), which are systematic reviews that compare the effectiveness and harms of alternative clinical options, and aim to help clinicians, policy makers, and patients make informed treatment choices. We focus on interventional studies and do not address diagnostic studies, individual patient level analysis, or observational studies, which are addressed elsewhere. 4

Quantitative synthesis, or meta-analysis, is often essential for CERs to provide scientifically rigorous summary information. Quantitative synthesis should be conducted in a transparent and consistent way with methodologies reported explicitly. This guide provides practical recommendations on conducting synthesis. The guide is not meant to be a textbook on meta-analysis nor is it a comprehensive review of methods, but rather it is intended to provide a consistent approach for situations and decisions that are commonly faced by Evidence-based Practice Centers (EPCs). The goal is to describe choices as explicitly as possible and in the context of EPC requirements, with an appropriate degree of confidence.

EPC investigators are encouraged to follow these recommendations but may choose to use alternative methods if deemed necessary after discussion with their AHRQ project officer. If alternative methods are used, investigators are required to provide a rationale for their choices, and if appropriate, to state the strengths and limitations of the chosen methods in order to promote consistency, transparency, and learning. In addition, several steps in meta-analysis require subjective judgment, such as when combining studies or incorporating indirect evidence. For each subjective decision, investigators should fully explain how the decision was reached.

This guide was developed by a workgroup comprised of members from across the EPCs, as well as from the Scientific Resource Center (SRC) of the AHRQ Effective Healthcare Program. Through surveys and discussions among AHRQ, Directors of EPCs, the Scientific Resource Center, and the Methods Steering Committee, quantitative synthesis was identified as a high-priority methods topic and a need was identified to update the original guidance. 1 , 5 Once confirmed as a Methods Workgroup, the SRC solicited EPC workgroup volunteers, particularly those with quantitative methods expertise, including statisticians, librarians, thought leaders, and methodologists. Charged by AHRQ to update current guidance, the workgroup consisted of members from eight of 13 EPCs, the SRC, and AHRQ, and commenced in the fall of 2015. We conducted regular workgroup teleconference calls over the course of 14 months to discuss project direction and scope, assign and coordinate tasks, collect and analyze data, and discuss and edit draft documents. After constructing a draft table of contents, we surveyed all EPCs to ensure no topics of interest were missing.

The initial teleconference meeting was used to outline the draft, discuss the timeline, and agree upon a method for reaching consensus as described below. The larger workgroup then was split into subgroups each taking responsibility for a different chapter. The larger group participated in biweekly discussions via teleconference and email communication. Subgroups communicated separately (in addition to the larger meetings) to coordinate tasks, discuss the literature review results, and draft their respective chapters. Later, chapter drafts were combined into a larger document for workgroup review and discussion on the bi-weekly calls.

Literature Search and Review

A medical research librarian worked with each subgroup to identify a relevant search strategy for each chapter, and then combined these strategies into one overall search conducted for all chapters combined. The librarian conducted the search on the ARHQ SRC Methods Library, a bibliographic database curated by the SRC currently containing more than 16,000 citations of methodological works for systematic reviews and comparative effectiveness reviews, using descriptor and keyword strategies to identify quantitative synthesis methods research publications (descriptor search=all quantitative synthesis descriptors, and the keyword search=quantitative synthesis, meta-anal*, metaanal*, meta-regression in [anywhere field]). Search results were limited to English language and 2009 and later to capture citations published since AHRQ’s previous methods guidance on quantitative synthesis. Additional articles were identified from recent systematic reviews, reference lists of reviews and editorials, and through the expert review process.

The search yielded 1,358 titles and abstracts which were reviewed by all workgroup members using ABSTRACKR software (available at http://abstrackr.cebm.brown.edu ). Each subgroup separately identified articles relevant to their own chapter. Abstract review was done by single review, investigators included anything that could be potentially relevant. Each subgroup decided separately on final inclusion/exclusion based on full text articles.

Consensus and Recommendations

Reaching consensus if possible is of great importance for AHRQ methods guidance. The workgroup recognized this importance in its first meeting and agreed on a process for informal consensus and conflict resolution. Disagreements were thoroughly discussed and if possible, consensus was reached. If consensus was not reached, analytic options are discussed in the text. We did not employ a formal voting procedure to assess consensus.

A summary of the workgroup’s key conclusions and recommendations was circulated for comment by EPC Directors and AHRQ officers at a biannual EPC Director’s meeting in October 2016. In addition, a full draft was circulated to EPC Directors and AHRQ officers prior to peer review, and the manuscript was made available for public review. All comments have been considered by the team in the final preparation of this report.

Chapter 1. Decision to Combine Trials

1.1. goals of the meta-analysis.

Meta-analysis is a statistical method for synthesizing (also called combining or pooling) the benefits and/or harms of a treatment or intervention across multiple studies. The overarching goal of a meta-analysis is generally to provide the best estimate of the effect of an intervention. As part of that aspirational goal, results of a meta-analysis may inform a number of related questions, such as whether that best estimate represents something other than a null effect (is this intervention beneficial?), the range in which the true effect likely lies, whether it is appropriate to provide a single best estimate, and what study-level characteristics may influence the effect estimate. Before tackling these questions, it is necessary to answer a preliminary but fundamental question: Is it appropriate to pool the results of the identified studies? 6

Clinical, methodological, and statistical factors must all be considered when deciding whether to combine studies in a meta-analysis. Figure 1.1 depicts a decision tree to help investigators think through these important considerations, which are discussed below.

Pooling decision tree.

1.2. Clinical and Methodological Heterogeneity

Studies must be reasonably similar to be pooled in a meta-analysis. 1 Even when the review protocol identifies a coherent and fairly narrow body of literature, the actual included studies may represent a wide range of population, intervention, and study characteristics. Variations in these factors are referred to as clinical heterogeneity and methodological heterogeneity. 7 , 8 A third form of heterogeneity, statistical heterogeneity, will be discussed later.

The first step in the decision tree is to explore the clinical and methodological heterogeneity of the included studies (Step A, Figure 1.1 ). The goal is to identify groups of trials that are similar enough that an average effect would make a sensible summary. There is no objective measure or universally accepted standard for deciding whether studies are “similar enough” to pool; this decision is inherently a matter of judgment. 6 Verbeek and colleagues suggest working through key sources of variability in sequence, beginning with the clinical variables of intervention/exposure, control condition, and participants, before moving on to methodological areas such as study design, outcome, and follow-up time. When there is important variability in these areas, investigators should consider whether there are coherent subgroups of trials, rather than the full group, that can be pooled. 6

Clinical heterogeneity refers to characteristics related to the participants, interventions, types of outcomes, and study setting. Some have suggested that pooling may be acceptable when it is plausible that the underlying effects could be similar across subpopulations and variations in interventions and outcomes. 9 For example, in a review of a lipid-lowering medication, researchers might be comfortable combining studies that target younger and middle-aged adults, but expect different effects with older adults, who have high rates of comorbidities and other medication use. Others suggest that it may be acceptable to combine interventions with likely similar mechanisms of action. 6 For example, a researcher may combine studies of depression interventions that use a range of psychotherapeutic approaches, on the logic that they all aim to change a person’s thinking and behavior in order to improve mood, but not want to combine them with trials of antidepressants, whose mechanism of action is presumed to be biochemical.

Methodological heterogeneity refers to variations in study methods (e.g., study design, measures, and study conduct). A common question regarding study design, is whether it is acceptable to combine studies that randomize individual participants with those that randomize clusters (e.g., when clinics, clinicians, or classrooms are randomized and individuals are nested within these units). We believe this is generally acceptable, with appropriate adjustment for cluster randomization as needed. 10 However, closer examination may show that the cluster randomized trials also tend to systematically differ on population or intervention characteristics from the individually-randomized trials. If so, subgroup analyses may be considered.

Outcome measures are a common source of methodological heterogeneity. First, trials may have a wide array of specific instruments and cut-points for a common outcome. For example, a review considering pooling the binary outcome of depression prevalence may find measures that range from a depression diagnosis based on a clinical interview to scores above a cut-point on a screening instrument. One guiding principle is to consider pooling only when it is plausible that the underlying relative effects are consistent across specific definitions of an outcome. In addition, investigators should take steps to harmonize outcomes to the extent possible.

Second, there is also typically substantial variability in the statistics reported across studies (e.g., odds ratios, relative risks, hazard ratios, baseline and mean followup scores, change scores for each condition, between-group differences at followup, etc.). Methods to calculate or estimate missing statistics are available, 5 however the investigators must ultimately weigh the tradeoff of potentially less accurate results (due to assumptions required to estimate missing data) with the potential advantage of pooling a more complete set of studies. If a substantial proportion of the studies require calculations that involve assumptions or estimates (rather than straightforward calculations) in order to combine them, then it may be preferable to show results in a table or forest plot without a pooled estimate

1.3. Best Evidence Versus All Evidence

Sometimes the body of evidence comprises a single trial or small number of trials that clearly represent the best evidence, along with a number of additional trials that are much smaller or with other important limitations (Step B, Figure 1.1 ). The “best evidence” trials are generally very large trials with low risk of bias and with good generalizability to the population of interest. In this case, it may be appropriate to focus on the one or few “best” trials rather than combining them with the rest of the evidence, particularly when addressing rare events that small studies are underpowered to examine. 11 , 12 For example, an evidence base of one large, multi-center trial of an intervention to prevent stroke in patients with heart disease could be preferable to a pooled analysis of 4-5 small trials reporting few events, and combining the small trials with the large trial may introduce unnecessary uncertainty to the pooled estimate.

1.4. Assessing the Risk of Misleading Meta-analysis Results

Next, reviews should explore the risk that the meta-analysis will show results that do not accurately capture the true underlying effect (Step C, Figure 1.1 ). Tables, forest plots (without pooling), and some other preliminary statistical tests are useful tools for this stage. Several patterns can arise that should lead investigators to be cautious about combining studies.

Wide-Ranging Effect Sizes

Sometimes one study may show a large benefit and another study of the same intervention may show a small benefit. This may be due to random error, especially when the studies are small. However, this situation also raises the possibility that observed effects truly are widely variable in different subpopulations or situations. Another look at the population characteristics is warranted in this situation to see if the investigators can identify characteristics that are correlated with effect size and direction, potentially explaining clinical heterogeneity.

Even if no characteristic can be identified that explains why the intervention had such widely disparate effects, there could be unmeasured features that explain the difference. If the intervention really does have widely variable impact in different subpopulations, particularly if it is benefiting some patients and harming others, it would be misleading to report a single average effect.

Suspicion of Publication or Reporting Bias

Sometimes, due to lack of effect, trial results are never published (risking publication bias), or are only published in part (risking reporting bias). These missing results can introduce bias and reduce the precision of meta-analysis. 13 Investigators can explore the risk of reporting bias by comparing trials that do and do not report important outcomes to assess whether outcomes appear to be missing at random. 13 For example, investigators may have 30 trials of weight loss interventions with only 10 reporting blood pressure, which is considered an important outcome for the review. This pattern of results may indicate reporting bias as trials finding group differences in blood pressure were more likely to report blood pressure findings. On the other hand, perhaps most of the studies limited to patients with elevated cardiovascular disease (CVD) risk factors did report blood pressure. In this case, the investigators may decide to combine the studies reporting blood pressure that were conducted in high CVD risk populations. However, investigators should be clear about the applicable subpopulation. An examination of the clinical and methodological features of the subset of trials where blood pressure was reported is necessary to make an informed judgement about whether to conduct a meta-analysis.

Small Studies Effect

If small studies show larger effects than large studies, the pooled results may overestimate the true effect size, possibly due to publication or reporting bias. 14 When investigators have at least 10 trials to combine they should examine small studies effects using standard statistical tests such as the Egger test. 15 If there appears to be a small studies effect, the investigators may decide not to report pooled results since they could be misleading. On the other hand, small studies effects could be happening for other reasons, such as differences in sample characteristics, attrition, or assessment methods. These factors do not suggest bias, but should be explored to the degree possible. See Chapter 4 for more information about exploring heterogeneity.

1.5. Special Considerations When Pooling a Small Number of Studies

When pooling a small number of studies (e.g., <10 studies), a number of considerations arise (Step E, Figure 1.1 ):

Rare Outcomes

Meta-analyses of rare binary outcomes are frequently underpowered, and tend to overestimate the true effect size, so pooling should be undertaken with caution. 11 A small difference in absolute numbers of events can result in large relative differences, usually with low precision (i.e., wide confidence intervals). This could result in misleading effect estimates if the analysis is limited to trials that are underpowered for the rare outcomes. 12 One example is all-cause mortality, which is frequently provided as part of the participant flow results, but may not be a primary outcome, may not have adjudication methods described, and typically occurs very rarely. Studies are often underpowered to detect differences in mortality if it is not a primary outcome. Investigators should consider calculating an optimal information size (OIS) when events are rare to see if the combined group of studies has sufficient power to detect group differences. This could be a concern even for a relatively large number of studies, if the total sample size is not very large. 16 See Chapter 3 for more detail on handling rare binary outcomes.

Small Sample Sizes

When pooling a relatively small number of studies, pooling should be undertaken with caution if the body of evidence is limited only to small studies. Results from small trials are less likely to be reliable than results of large trials, even when the risk of bias is low. 17 First, in small trials it is difficult to balance the proportion of patients in potentially important subgroups across interventions, and a difference between interventions of just a few patients in a subgroup can result in a large proportional difference between interventions. Characteristics that are rare are particularly at risk of being unbalanced in trials with small samples. In such situations there is no way to know if trial effects are due to the intervention or to differences in the intervention groups. In addition, patients are generally drawn from a narrower geographic range in small trials, making replication in other trials more uncertain. Finally, although it is not always the case, large trials are more likely to involve a level of scrutiny and standardization to ensure lower risk of bias than are small trials. Therefore, when the trials have small sample sizes, pooled effects are less likely to reflect the true effects of the intervention. In this case, the required or optimal information size can help the investigators determine whether the sample size is sufficient to conclude that results are likely to be stable and not due to random heterogeneity (i.e., truly significant or truly null results; not a type I or type II error). 16 , 18 An option in this case would be to pool the studies and acknowledge imprecision or other limitations when rating the strength of evidence.

What would be considered a “small” trial varies for different fields and outcomes. For addressing an outcome that only happens in 10% of the population, a small trial might be 100 to 200 per intervention arm, whereas a trial addressing a continuous quality of life measure may be small with 20 to 30 per intervention. Looking carefully at what the studies were powered to detect and the credibility of the power calculations may help determine what constitutes a “small” trial. Investigators should also consider how variable the impact of an intervention may be over different settings and subpopulations when determining how to weigh the importance of small studies. For example, the effects of a counseling intervention that relies on patients to change their behavior in order to reap health benefits may be more strongly influenced by characteristics of the patients and setting than a mechanical or chemical agent.

When the number of trials to be pooled is small, there is a heightened risk that statistical heterogeneity will be substantially underestimated, resulting in 95% confidence intervals that are inappropriately narrow and do not have 95% coverage. This is especially concerning when the number of studies being pooled is fewer than five to seven. 19 – 21

Accounting for these factors should guide an evaluation of whether it is advisable to pool the relatively small group of studies. As with many steps in the multi-stage decision to pool, the conclusion that a given investigator arrives at is subjective, although such evaluations should be guided by the criteria above. If consideration of these factors reassures investigators that the risk of bias associated with pooling is sufficiently low, then pooling can proceed. The next step of pooling, whether for a small, moderate, or large body of studies, is to consider statistical heterogeneity.

1.6. Statistical Heterogeneity

Once clinical and methodological heterogeneity and other factors described above have been deemed acceptable for pooling, investigators should next consider statistical heterogeneity (Step F, Figure 1.1 ). We discuss statistical heterogeneity in general in this chapter, and provide a deeper methodological discussion in Chapter 4 . This initial consideration of statistical heterogeneity is accomplished by conducting a preliminary meta-analysis. Next the investigator must decide if the results of the meta-analysis are valid and should be presented, rather than simply showing tables or forest plots without pooled results. If statistical heterogeneity is very high, the investigators may question whether an “average” effect is really meaningful or useful. If there is a reasonably large number of trials, the investigators may shift to exploring effect modification with high heterogeneity, however this may not be possible if few trials are available. While many would likely agree that pooling (or reporting pooled results) should be avoided when there are few studies and statistical heterogeneity is high, what constitutes “few” studies and “high” heterogeneity is a matter of judgment.

While there are a variety of methods for characterizing statistical heterogeneity, one common method is the I 2 statistic, the proportion of total variance in the pooled trials that is due to inter-study variance, as opposed to random variation. 22 The Cochrane manual proposes ranges for interpreting I 2 : 10 statistical heterogeneity associated with I 2 values of 0-40% might not be important, 30-60% may represent moderate heterogeneity, 50-90% may represent substantial heterogeneity, and 75-100% is considerable heterogeneity. Ranges overlap to reflect that other factors—such as the number and size of the trials and the magnitude and direction of the effect—must be taken into consideration. Other measures of statistical heterogeneity include Cochrane’s Q and τ 2 , but these heterogeneity statistics do not have intrinsic standardized scales that allow specific values to be characterized as “small,” “medium,” or “large” in any meaningful way. 23 However, τ 2 can be interpreted on the scale of the pooled effect, as the variance of the true effect. All these measures are discussed in more detail in Chapter 4 .

Although widely used in quantitative synthesis, the I 2 statistic has come under criticism in recent years. One important issue with I 2 is that it can be an inaccurate reflection of statistical heterogeneity when there are few studies to pool and high statistical heterogeneity. 24 , 25 For example, in random effects models (but not fixed effects models), calculations demonstrate that I 2 tends to underestimate true statistical heterogeneity when there are fewer than about 10 studies and the I 2 is 50% or more. 26 In addition, I 2 is correlated with the sample size of the included studies, generally increasing with larger samples. 27 Complicating this, meta-analyses of continuous measures tend to have higher heterogeneity than those of binary outcomes, and I 2 tends to increase as the number of studies increases when analyzing continuous outcomes, but not binary outcomes. 28 , 29 This has prompted some authors to suggest that different standards may be considered for interpreting I 2 for meta-analyses of continuous and binary outcomes, but I 2 should only be considered reliable when there are a sufficient number of studies. 29 Unfortunately there is not clear consensus regarding what constitutes a sufficient number of studies for a given amount of statistical heterogeneity, nor is it possible to be entirely prescriptive, given the limits of I 2 as a measure of heterogeneity. Thus, I 2 is one piece of information that should be considered, but generally should not be the primary deciding factor for whether to pool.

1.7. Conclusion

In the end, the decision to pool boils down to the question: will the results of a meta-analysis help you find a scientifically valid answer to a meaningful question? That is, will the meta-analysis provide something in addition to what can be understood from looking at the studies individually? Further, do the clinical, methodological, and statistical features of the body of studies permit them to be quantitatively combined and summarized in a valid fashion? Each of these decisions can be broken down into specific considerations (outlined in Figure 1.1 ) There is broad guidance to inform investigators in making each of these decisions, but generally the choices involved are subjective. The investigators’ scientific goal might factor into the evaluation of these considerations: for example, if investigators seek a general summary of the combined effect (e.g., direction only) versus an estimated effect size, the consideration of whether to pool may be weighed differently. In the end, to provide a meaningful result, the trials must be similar enough in content, procedures, and implementation to represent a cohesive group that is relevant to real practice/decision-making.

Recommendations

  • Use Figure 1.1 when deciding whether to pool studies

Chapter 2. Optimizing Use of Effect Size Data

2.1. introduction.

The employed methods for meta-analysis will depend upon the nature of the outcome data. The two most common data types encountered in trials are binary/dichotomous (e.g., dead or alive, patient admitted to hospital or not, treatment failure or success, etc.) and continuous (e.g., weight, systolic blood pressure, etc.). Some outcomes (e.g., heart rate, counts of common events) that are not strictly continuous, are often treated as continuous for the purposes of meta-analysis based on assumptions of normality and the belief that statistical methods that are applied to normal distributions can be applicable to other distributions (central limit theory). Continuous outcomes are also frequently analyzed as binary outcomes when there are clinically meaningful cut-points or thresholds (e.g., a patient’s systolic blood pressure may be classified as low or high based on whether it is under or over 130mmHG). While this type of dichotomization may be more clinically meaningful it reduces statistical information, so investigators should provide their rationale for taking this approach.

Other less common data types that do not fit into either the binary or continuous categories include ordinal, categorical, rate, and time to event to data. Meta-analyzing these types of data will usually require reporting of the relevant statistics (e.g., hazard ratio, proportional odds ratio, incident rate ratio) by the study authors.

2.2. Nuances of Binary Effect Sizes

Data needed for binary effect size computation.

Under ideal circumstances, the minimal data necessary for the computation of effect sizes of binary data would be available in published trial documents or from original sources. Specifically, risk difference (RD), relative risk (RR), and odds ratios (OR) can be computed when the number of events (technically the number of cases in whom there was an event) and sample sizes are known for treatment and control groups. A schematic of one common approach to assembling binary data from trials for effect size computation is presented in Table 2.1 . This approach will facilitate conversion to analysis using commercially-available software such as Stata (College Station, TX) or Comprehensive Meta-Analysis (Englewood, NJ).

Table 2.1. Assembling binary data for effect size computation.

Assembling binary data for effect size computation.

In many instances, a single study (or subset of studies) to be included in the meta-analysis provides only one measure of association (an odds ratio, for example), and the sample size and event counts are not available. In that case, the meta-analytic effect size will be dictated by the available data. However, choosing the appropriate effect size is important for integrity and transparency, and every effort should be made to obtain all the data presented in Table 2.1 . Note that CONSORT guidance requires that published trial data should include the number of events and sample sizes for both treatment and control groups. 30 And, PRISMA guidance supports describing any processes for obtaining and confirming data from investigators 31 – a frequently required step.

In the event that data are only available in an effect size from the original reports, it is important to extract both the mean effect sizes and the associated 95% confidence intervals. Having raw event data available as in Table 2.1 not only facilitates the computation of various effect sizes, but also allows for the application of either binomial (preferred) or normal likelihood approaches; 32 only normal likelihood can be applied to summary statistics (e.g., an odds ratio and confidence interval in the primary study report).

Choosing Among Effect Size Options

One absolute measure and two relative measures are commonly used in meta-analyses involving binary data. The RD (an absolute measure) is a simple metric that is easily understood by clinicians, patients, and other stakeholders. The relative measures, RR or OR, are also used frequently. All three metrics should be considered additive, just on different scales. That is, RD is additive on a raw scale, RR on a log scale, and OR on a logit scale.

Risk Difference

The RD is easily understood by clinicians and patients alike, and therefore most useful to aid decision making. However, the RD tends to be less consistent across studies compared with relative measures of effect size (RR and OR). Hence, the RD may be a preferred measure in meta-analyses when the proportions of events among control groups are relatively common and similar across studies. When events are rare and/or when event rates differ across studies, however, the RD is not the preferred effect size to be used in meta-analysis because combined estimates based on RD in such instances have more conservative confidence intervals and lower statistical power. The calculation of RD and other effect size metrics using binary data from clinical trials can be performed considering the following labeling ( Table 2.2 ).

Table 2.2. Organizing binary data for effect size computation.

Organizing binary data for effect size computation.

Equation Set 2.1. Risk Difference

  • RD = risk difference
  • V RD = variance of the risk difference
  • SE RD = standard error of the risk difference
  • LL RD = lower limit of the 95% confidence interval of the risk difference
  • UL RD = upper limit of the 95% confidence interval of the risk difference

Number Needed To Treat Related to Risk Difference

  • NNT = number needed to treat

In case of a negative RD, the number needed to harm (NNH) or number needed to treat for one patient to be harmed is = − 1/RD.

The Wald method 34 is commonly used to calculate confidence intervals for NNT. It is reasonably adequate for large samples and probabilities not close to either 0 or 1, however it can be less reliable for small samples, probabilities close to either 0 or 1, or unbalanced trial designs. 35 An adjustment to the Wald method (i.e., adding pseudo-observations) helps mitigate concern about its application in small samples, 36 but it doesn’t account for other sources of limitations to this method. The Wilson method of calculating confidence intervals for NNT, as described in detail by Newcome, 37 has better coverage properties irrespective of sample size, is free of implausible results, and is argued to be easier to calculate compared with Wald confidence intervals. 35 Therefore, the Wilson method is preferable to the Wald method for calculating confidence intervals for NNT. When considering using NNT as the effect size in meta-analysis, see commentary by Lesaffre and Pledger.38 When considering using NNT as the effect size in meta-analysis, see commentary on the superior performance of combined NNT on the RD scale as opposed to the NNT scale.

It is important to note that the RR and OR are effectively equivalent for event rates below about 10%. In such cases, the RR is chosen over the OR simply for interpretability (an important consideration) and not substantive differences. A potential drawback to the use of RR over OR (or RD) is that the RR of an event is not the reciprocal of the RR for the non-occurrence of that event (e.g., using survival as the outcome instead of death). In contrast, switching between events and non-occurrence of events is reciprocal in the metric of OR and only entails a change in the sign of OR. If switching between death and survival, for example, is central to the meta-analysis, then the RR is likely not the binary effect size metric of choice unless all raw data are available and re-computation is possible. Moreover, investigators should be particularly attentive to the definition of an outcome event when using a RR.

The calculation of RR using binary data can be performed considering the labeling listed in Table 2.2 . Of particular note, the metrics of dispersion related to the RR are first computed in a natural log metric and then converted to the metric of RR.

Equation Set 2.2. Risk Ratio

  • RR = risk ratio
  • ln RR = natural log of the risk ratio
  • V lnRR = variance of the natural log of the risk ratio
  • SE lnRR = standard error of the natural log of the risk ratio
  • LLlnRR = lower limit of the 95% confidence interval of the natural log of the risk ratio
  • UL lnRR = upper limit of the 95% confidence interval of the natural log of the risk ratio
  • LL RR = lower limit of the 95% confidence interval of the risk ratio
  • UL RR = upper limit of the 95% confidence interval of the risk ratio

Therefore, while the definition of the outcome event needs to be consistent among the included studies when using any measure, the investigators should be particularly attentive to the definition of an outcome event when using an RR.

Odds Ratios

An alternative relative metric for use with binary data is the OR. Given that ORs are frequently presented in models with covariates, it is important to note that the OR is ‘non-collapsible,’ meaning that effect modification varies depending on the covariates for which control has been made; this favors the reporting of RR over OR, particularly when outcomes are common and covariates are included. 39 The calculation of OR using binary data can be performed considering the labeling listed in Table 2.2 . Similar to the computation of RR, the metrics of dispersion related to the OR are first computed in a natural log metric and then converted to the metric of OR.

Equation Set 2.3. Odds ratios

  • OR = odds ratio
  • Ln OR = natural log of the odds ratio
  • V lnOR = variance of the natural log of the odds ratio
  • SE lnoR = standard error of the natural log of the odds ratio
  • LLlnOR = lower limit of the 95% confidence interval of the natural log of the odds ratio
  • UL lnOR = upper limit of the 95% confidence interval of the natural log of the odds ratio
  • LL OR = lower limit of the 95% confidence interval of the odds ratio
  • UL OR = upper limit of the 95% confidence interval of the odds ratio

A variation on the calculation of OR is the Peto OR that is commonly referred to as the assumption-free method of calculating OR. The two key differences between the standard OR and the Peto OR is that the latter takes into consideration the expected number of events in the treatment group and also incorporates a hypergeometric variance. Because of these difference, the Peto OR is preferred for binary studies with rare events, especially when event rates are less than 1%. But in contrast, the Peto OR is biased when treatment effects are large, due to centering around the null hypothesis, and in the instance of imbalanced treatment and control groups. 40

Equation Set 2.4. Peto odds ratios

ORpeto = exp [ { A − E ( A ) } / v ] where E(A) is the expected number of events in the treatment group calculated as: E ( A ) = n 1 ( A + E ) N and v is hypergeometric variance, calculated as: v = { n 1   n 2 ( A + C ) ( B + D ) } / { N 2 ( N − 1 ) }

There is no perfect effect size of binary data to choose because each has benefits and disadvantages. Criteria used to compare and contrast these measures include consistency over a set of studies, statistical properties, and interpretability. Key benefits and disadvantages of each are presented in Table 2.3 . In the table, the term “baseline risk” is the proportion of subjects in the control group who experienced the event. The term “control rate” is sometimes used for this measure as well.

Table 2.3. Benefits and disadvantages of binary data effect sizes.

Benefits and disadvantages of binary data effect sizes.

Time-to-Event and Count Outcomes

For time to event data, the effect size measure is a hazard ratio (HR), which is commonly estimated from the Cox proportional hazards model. In the best-case scenario, HR and associated 95% confidence intervals are available from all studies, the time horizon is similar across studies, and there is evidence that the proportional hazards assumption was met in each study to be included in a meta-analysis. When these conditions are not met, an HR and associated dispersion can still be extracted and meta-analyzed. However, this approach raises concerns about reproducibility due to observer variation. 44

Incident rate ratio (IRR) is used for count data and can be estimated from a Poisson or negative binomial regression model. The IRR is a relative metric based on counts of events (e.g., number of hospitalizations, or days of length of stay) over time (i.e., per person-year) compared between trial arms. It is important to consider how IRR estimates were derived in individual studies particularly with respect to adjustments for zero-inflation and/or over-dispersion as these modeling decisions can be sources of between-study heterogeneity. Moreover, studies that include count data may have zero counts in both groups, which may require less common and more nuanced approaches to meta-analysis like Poisson regression with random intervention effects. 45

2.3. Continuous Outcomes

Assembling data needed for effect size computation.

Meta-analysis of studies presenting continuous data requires both estimated differences between the two groups being compared and estimated standard errors of those differences. Estimating the between-group difference is easiest when the study provides the mean difference. While both a standardized mean difference and ratio of means could be given by the study authors, studies more often report means for each group. Thus, a mean difference or ratio of means often must be computed.

If estimates of the standard errors of the mean are not provided studies commonly provide confidence intervals, standard deviations, p-values, z-statistics, and/or t-statistics, which make it possible to compute the standard error of the mean difference. In the absence of any of these statistics, other methods are available to estimate standard error. 45

(Weighted) Mean Difference

The mean difference (formerly known as weighted mean difference) is the most common way of summarizing and pooling a continuous outcome in a meta-analysis. Pooled mean differences can be computed when every study in the analysis measures the outcome on the same scale or on scales that can be easily converted. For example, total weight can be pooled using mean difference even if different studies reported weights in kilograms and pounds; however it is not possible to pool quality of life measured in both Self Perceived Quality of Life scale (SPQL) and the 36-item Short Form Survey Instrument (SF-36), since these are not readily convertible to one format.

Computation of the mean difference is straightforward and explained elsewhere. 5 Most software programs will require the mean, standard deviation, and sample size from each intervention group and for each study in the meta-analysis, although as mentioned above, other pieces of data may also be used.

Some studies report values as change from baseline, or alternatively present both baseline and final values. In these cases, it is possible to pool differences in final values in some studies with differences in change from baseline values in other studies, since they will be estimating the same value in a randomized control trial. If baseline values are unbalanced it may be better to perform ANCOVA analysis (see below). 5

Standardized Mean Difference

Sometimes different studies will assess the same outcome using different scales or metrics that cannot be readily converted to a common measure. In such instances the most common response is to compute a standardized mean difference (SMD) for each study and then pool these across all studies in the meta-analysis. By dividing the mean difference by a pooled estimate of the standard deviation, we theoretically put all scales in the same unit (standard deviation), and are then able to statistically combine all the studies. While the standardized mean difference could be used even when studies use the same metric, it is generally preferred to use mean difference. Interpretation of results is easier when the final pooled estimate is given in the same units as the original studies.

Several methods can compute SMDs. The most frequently used are Cohen’s d and Hedges’ g .

Cohen’s d

Cohen’s d is the simplest S. computation; it is defined as the mean difference divided by the pooled standard deviation of the treatment and control groups. 5 For a given study, Cohen’s d can be computed as: d = m T − m C s p o o l e d

Where m T and m C are the treatment and control means and spooled is essentially the square root of the weighted average of the treatment and control variances.

It has been shown that this estimate is biased in estimating the true population SMD, and the bias decreases as the sample size increases (small sample bias). 46 For this reason, Hedges g is more often used.

Hedges’ g

Hedges’ g is a transformation of Cohen’s d that attempts to adjust for small sample bias. The transformation involves multiplying Cohen’s d by a function of the total sample size.5 This generally results in a slight decrease in value of Hedges’ g compared with Cohen’s d, but the reduction lessens as the total sample size increases. The formula is: d ( 1 − 3 4 N − 9 )

Where N is the total trial sample size.

For very large sample sizes the two estimates will be very similar.

Back Transformation of Pooled SMD

One disadvantages of reporting standardized mean difference is that units of standard deviation are difficult to interpret clinically. Guidelines do exist but are often thought to be arbitrary and not applicable to all situations.47 An alternative is to back transform the pooled SMD into a scale used in the one of the analyses. In theory, by multiplying the SMD (and its upper and lower confidence bounds) by the standard deviation of the original scale, one can obtain a pooled estimate in that original scale. The difficulty is that the true standard deviation is unknown and must be estimated from available data. Alternatives for estimation include using the standard deviation from the largest study or using a pooled estimate of the standard deviations across studies.5 One should include a sensitivity analysis and be transparent about the approach used.

Ratio of Means

Ratio of Means (RoM), also known as response ratio, has been presented as an alternative to the SMD when outcomes are reported in different non-convertible scales. As the name implies the RoM divides the treatment mean by the control mean rather than taking the difference between the two. The ratio can be interpreted as the percentage change in the mean value of the treatment group relative to the control group. By meta-analyzing across studies we are making the assumption that the relative change will be homogeneous across all studies, regardless of which scale was used to measure it. Similar to the risk ratio and odds ratio, the RoM is pooled on the log scale; computational formulas are readily available. 5

For the RoM to have any clinical meaning, it is required that in the scale being used, the values are always positive (or always negative) and that a value of “zero” truly means zero. For example, if the outcome were patient temperature, RoM would be a poor choice since a temperature of 0 degrees does not truly represent what we would think of as zero.

2.4. Special Topics

Crossover trials.

A crossover trial is one where all patients receive, in sequence, both the treatment and control interventions. This results in the final data having the same group of patients represented with both their outcome values while in the treatment and control groups. When computing the standard error of the mean difference of a crossover trial, one must consider the correlation between the two groups—a result of the two measurements on different within-person treatments. 5 For most variables, the correlation will be positive, resulting in a smaller standard error than would be seen with the same values in a parallel trial.

To compute the correct pooled standard error requires an estimate of the correlation between the two groups. If correlation is available, the pooled standard error can be computed using the following formula: S E P = S E T 2 + S E C 2 + 2 r S E T S E C

Where r is the within-patient correlation and SE P , SE T , and SE C are the pooled, treatment, and control standard errors respectively

Most studies do not give the correlation or enough information to compute it, and thus it often has to be estimated based on investigator knowledge or imputed. 5 An imputation of 0.5 has been suggested as a good conservative estimate of correlation in the absence of any other information. 48

If a cross-over study reports its data by period, investigators have sometimes used first period data only when including cross-over trials in their meta-analyses—essentially treating the study as if it were a parallel design. This eliminates correlation issues, but has the disadvantage of omitting half the data from the trial.

Cluster Randomized Trials

Cluster trials occur when patients are randomized to treatment and control in groups (or clusters) rather than individually. If the units/subjects within clusters are positively correlated (as they usually are), then there is a loss of precision compared to a standard (non-clustered) parallel design of the same size. The design effect (DE) of a cluster randomized trial is the multiplicative multiplier needed to adjust the standard error computed as if the trial were a standard parallel design. Reported results from cluster trials may not reflect the design effect, and thus it will need to be computed by the investigator. The formula for computing the design effect is: D E = 1 + ( M − 1 ) I C C

Where M is the average cluster size and ICC is the intra-class correlation coefficient (see below).

Computation of the design effect involves a quantity known as the intra-class correlation coefficient (ICC), which is defined as the proportion of the total variance (i.e., within cluster variance plus between cluster variance) that is due to between cluster variance. 5 ICC’s are often not reported by cluster trials and thus a value must be obtained from external literature or a plausible value must be assumed by the investigator.

Mean Difference and Baseline Imbalance

  • Use followup data.
  • Use change from baseline data.
  • Use an ANCOVA model that adjusts for the effects of baseline imbalance. 49

As long as trials are balanced at baseline, all three methods will give similar unbiased estimates of mean difference. 5 When baseline imbalance is present, it can be shown that using ANCOVA will give the best estimate of the true mean difference; however the parameters required to perform this analysis (mean and standard deviations of baseline, follow-up and change from baseline values) are usually not provided by the study authors. 50 If it is not feasible to perform an ANCOVA analysis, the choice of whether to use follow up or change from baseline values depends on the amount of correlation between baseline and final values. If the correlation is less than or equal to 0.5, then using the follow up values will be less biased (with respect to the estimate in the ANCOVA model) than using the change from baseline values. If the correlation is greater than 0.5, then change from baseline values will be less biased than using the follow up values. 51 There is evidence that these correlations are more often greater than 0.5, so the change from baseline means will usually be preferred if estimates of correlation are totally unobtainable. 52 A recent study 51 showed that all approaches were unbiased when there were both few trials and small sample sizes within the trials.

  • The analyst should consider carefully which binary measure to analyze.
  • If conversion to NNT or NNH is sought, then the risk difference is the preferred measure.
  • The risk ratio and odds ratio are likely to be more consistent than the risk difference when the studies differ in baseline risk.
  • The risk difference is not the preferred measure when the event is rare.
  • The risk ratio is not the preferred measure if switching between occurrence and non occurrence of the event is important to the meta-analysis.
  • The odds ratio can be misleading.
  • The mean difference is the preferred measure when studies use the same metric.
  • When calculating standardized mean difference, Hedges’ g is preferred over Cohen’s d due to the reduction in bias.
  • If baseline values are unbalanced, one should perform an ANCOVA analysis. If ANCOVA cannot be performed and the correlation is greater than 0.5, change from baseline values should be used to compute the mean difference. If the correlation less than or equal to 0.5, follow-up values should be used.
  • Data from clustered randomized trials should be adjusted for the design effect.

Chapter 3. Choice of Statistical Model for Combining Studies

3.1. introduction.

Meta-analysis can be performed using either a fixed or a random effects model to provide a combined estimate of effect size. A fixed effects model assumes that there is one single treatment effect across studies and any differences between observed effect sizes are due to sampling error. Under a random effects model, the treatment effects across studies are assumed to vary from study and study and follow a random distribution. The differences between observed effect sizes are not only due to sampling error, but also to variation in the true treatment effects. A random effects model usually assumes that the treatment effects across studies follow a normal distribution, though the validity of this assumption may be difficult to verify, especially when the number of studies is small. Alternative distributions 53 or distribution free models 54 , 55 have also been proposed.

Recent advances in meta-analysis include the development of alternative models to the fixed or random effects models. For example, Doi et al. proposed an inverse variance heterogeneity model (the IVhet model) for the meta-analysis of heterogeneous clinical trials that uses an estimator under the fixed effect model assumption with a quasi-likelihood based variance structure. 56 Stanley and Doucouliagosb proposed an unrestricted weighted least squares (WLS) estimator with multiplicative error for meta-analysis and claimed superiority to both conventional fixed and random effects, 57 though Mawdsley et al. 58 found modest differences when compared with the random effects model. These methods have not been fully compared with the many estimators developed within the framework of the fixed and random effects models and are not readily available in most statistical packages; thus they will not be further considered here.

General Considerations for Model Choice

Considerations for model choice include but are not limited to heterogeneity across treatment effects, the number and size of included studies, the type of outcomes, and potential bias. We recommend against choosing a statistical model based on the significance level of a heterogeneity test, for example, picking a fixed effects model when the p-value for the test of heterogeneity is more than 0.10 and a random effects model when P < 0.10, since such an approach does not take the many factors for model choice into full consideration.

In practice, clinical and methodological heterogeneity are always present across a set of included studies. Variation among studies is inevitable whether or not the test of heterogeneity detects it. Therefore, we recommend random effects models, with special considerations for rare binary outcomes (discussed below in the section on combining rare binary outcomes). For a binary outcome, when the estimate of between-study heterogeneity is zero, a fixed effects model (e.g., the Mantel-Haenszel method, inverse variance method, Peto method (for OR), or fixed effects logistic regression) provides an effect estimate similar to that produced by a random effects model. The Peto method requires that no substantial imbalance exists between treatment and control group sizes within trials and treatment effects are not exceptionally large.

When a systematic review includes both small and large studies and the results of small studies are systematically different from those of the large ones, publication bias may be present and the assumption of a random distribution of effect sizes, in particular, a normal distribution, is not justified. In this case, neither the random effects model nor the fixed effects model provides an appropriate estimate and investigators may choose not to combine all studies. 10 Investigators can choose to combine only the large studies if they are well conducted with good quality and are expected to provide unbiased effect estimates. Other potential differences between small and large studies should also be examined.

Choice of Random Effects Model and Estimator

The most commonly used random effects model for combined effect estimates is based on an estimator developed by DerSimonian and Laird (DL) due to its simplicity and ease of implementation. 59 It is well recognized that the estimator does not adequately reflect the error associated with parameter estimation, in particular, when the number of studies is small, and between-study heterogeneity is high. 40 Refined estimators have been proposed by the original authors. 19 , 60 , 61 Other estimators have also been proposed to improve the DL estimator. Sidik and Jonkman (SJ) and Hartung and Knapp (HK) independently proposed a non-iterative variant of the DL estimator using the t-distribution and an adjusted confidence interval for the overall effect. 62 – 64 We refer to this as the HKSJ method. Biggerstaff–Tweedie (BT) proposed another variant of the DL method by incorporating error in the point estimate of between-study heterogeneity into the estimation of the overall effect. 65 There are also many other likelihood based estimators such as maximum likelihood estimate, restricted maximum likelihood estimate and profile likelihood (PL) methods, which better account for the uncertainty in the estimate of between-study variance. 19

Several simulation studies have been conducted to compare the performance of different estimators for combined effect size. 19 – 21 , 66 , 67 For example, Brockwell et al. showed the PL method provides an estimate with better coverage probability than the DL method. 19 Jackson et al. showed that with a small number of studies, the DL method did not provide adequate coverage probability, in particular, when there was moderate to large heterogeneity. 20 However, these results supported the usefulness of the DL method for larger samples. In contrast, the PL estimates resulted in coverage probability closer to nominal values. IntHout et al. compared the performance of the DL and HKSJ methods and showed that the HKSJ method consistently resulted in more adequate error rates than did the DL method, especially when the number of studies was small, though they did not evaluate coverage probability and power. 67 Kontopantelis and Reeves conducted the most comprehensive simulation studies to compare the performance of nine different methods and evaluated multiple performance measures including coverage probability, power, and overall effect estimation (accuracy of point estimates and error intervals). 21 When the goal is to obtain an accurate estimate of overall effect size and the associated error interval, they recommended using the DL method when heterogeneity is low and using the PL method when heterogeneity is high, where the definition of high heterogeneity varies by the number of studies. The PL method overestimated coverage probability in the absence of between-study heterogeneity. Methods like BT and HKSJ, despite being developed to address the limitations of the DL method, were frequently outperformed by the DL method. Encouragingly, Kontopantelis and Reeves also showed that regardless of the estimation method, results are highly robust against even very severe violations of the assumption of normally distributed effect sizes.

Recently there has been a call to use alternative random-effects estimators to replace the universal use of the Dersimonian-Laird random effects model. 68 Based on the results from the simulation studies, the PL method appears to generally perform best, and provides best performance across more scenarios than other methods, though it may overestimate the confidence intervals in small studies with low heterogeneity. 21 It is appropriate to use the DL method when the heterogeneity is low. Another disadvantage of the PL method is that it does not always converge. In those situations, investigators may choose the DL method with sensitivity analyses using other methods, such as the HKSJ method. If non-convergence is due to high heterogeneity, investigators should also reevaluate the appropriateness of combining studies. The PL method (and the DL method) could be used to combine measures for continuous, count, and time to event data, as well as binary data when events are common. Note that the confidence interval produced by the PL method may not be symmetric. It is also worth noting that OR, RR, HR, and incidence rate ratio statistics should be analyzed on the logarithmic scale when the PL, DL, or HKSJ method is used. Finally, a Bayesian approach can also be used since this approach takes the variations in all parameters into account (see the section on Bayesian methods, below).

Role of Generalized Linear Mixed Effects Models

The different methods and estimators discussed above are generally used to combine effect measures directly (for example, mean difference, SMD, OR, RR, HR, and incidence rate ratio). For study-level aggregated binary data and count data, we also recommend the use of the generalized linear mixed effects model assuming random treatment effects. For aggregated binary data, a combined OR can be generated by assuming the binomial distribution with a logit link. It is also possible to generate a combined RR with the binomial distribution and a log link, though the model does not always converge. For aggregated count data, a combined rate ratio can be generated by assuming the Poisson distribution with a log link. Results from using the generalized linear models and directly combining effect measures are similar when the number of studies and/or the sample sizes are large.

3.2. A Special Case: Combining Rare Binary Outcomes

When combining rare binary outcomes (such as adverse event data), few or zero events often occur in one or both arms in some of the studies. In this case, the binomial distribution is not well-approximated by the normal approximation and choosing an appropriate model becomes complicated. The DL method does not perform well with low-event rate binary data. 43 , 69 A fixed effects model often out performs the DL method even in the presence of heterogeneity. 70 When event rates are less than 1 percent, the Peto OR method has been shown to provide the least biased, most powerful combined estimates with the best confidence interval coverage, 43 if the included studies have moderate effect sizes and the treatment and control group are of relatively similar sizes. The Peto method does not perform well when either the studies are unbalanced or the studies have large ORs (outside the range of 0.2-5). 71 , 72 Otherwise, when treatment and control group sizes are very different, effect sizes are large, or when events become more frequent (5 percent to 10 percent), the Mantel-Haenszel method (without a correction factor) or a fixed effects logistic regression provide better combined estimates.

Within the past few years, many methods have been proposed to analyze sparse data from simple averaging, 73 exact methods, 74 , 75 Bayesian approaches 76 , 77 to various parametric models (e.g., generalized linear mixed effect models, beta-binomial model, Gamma-Poisson model, bivariate Binomial-Normal model etc.). Two dominating opinions are to not use continuity corrections, and to include studies with zero events in both arms in the meta-analysis. Great efforts have been made to develop methods that can include such studies.

Bhaumik et al. proposed the simple (unweighted) average (SA) treatment affect with the 0.5 continuity correction, and found that the bias of the SA estimate in the presence of even significant heterogeneity is minimal compared with the bias of MH estimates (with 0.5 correction). 73 A simple average was also advocated by Shuster. 78 However, potential confounding remains an issue for an unweighted estimator. Spittal et al. showed that Poisson regression works better than the inverse variance method for rare events. 79 Kuss et al. conducted a comprehensive simulation of eleven methods, and recommended the use of the beta-binomial model for the three common effect measures (OR, RR, and RD) as the preferred meta-analysis methods for rare binary events with studies of zero events in one or both arms. 80 The beta-binomial model assumes that the observed events follow a binomial distribution and the binomial probabilities follow a beta distribution. In Kuss’s simulation, using a generalized linear model framework to model the treatment effect, an OR was estimated using a logit link, and an RR, using a log link. Instead of using an identity link, RD was estimated based on the estimated event probabilities from the logit model. This comprehensive simulation examined methods that could incorporate data from studies with zero events from both arms and do not need any continuity correction, and only compared the Peto and MH methods as reference methods.

Given the development of new methods that can handle studies with zero events in both arms, we advise that older methods that use continuity corrections be avoided. Investigators should use valid methods that include studies with zero events in one or both arms. For studies with zero events in one arm, or studies with sparse binary data but no zero events, an estimate can be obtained using the Peto method, the Mantel-Haenszel method, or a logistic regression approach, without adding a correction factor, when the between-study heterogeneity is small. These methods are simple to use and more readily available in standard statistical packages. When the between-study heterogeneity is large and/or there are studies with zero events in both arms, the more recently developed methods, such as beta-binomial model, could be explored and used. However, investigators should note that no method gives completely unbiased estimates when events are rare. Statistical methods can never completely solve the issue of sparse data. Investigators should always conduct sensitivity analyses 81 using alternative methods to check the robustness of results to different methods, and acknowledge the inadequacy of data sources when presenting the meta-analysis results, in particular, when the proportion of studies with zero events in both arms are high. If double-zero studies are to be excluded, they should be qualitatively summarized, by providing information on the confidence intervals for the proportion of events in each arm.

A Note on an Exact Method for Sparse Binary Data

For rare binary events, the normal approximation and asymptotic theory for large sample size does not work satisfactorily and exact inference has been developed to overcome these limitations. Exact methods do not need continuity corrections. However, simulation analyses do not identify a clear advantage of early developed exact methods 75 , 82 over a logistic regression or the Mantel-Haenszel method even in situations where these exact methods would theoretically be advantageous. 43 Recent developments of exact methods include Tian et al.’s method of combining confidence intervals 83 and Liu et al.’s method of combining p-value functions. 84 Yang et al. 85 developed a general framework for meta-analysis of rare events by combining confidence distributions (CDs), and showed that Tian’s and Liu’s methods could be unified under the CD framework. Liu showed that exact methods performed better than the Peto method (except when studies are unbalanced) and the Mantel-Haenszel method, 84 though the comparative performance of these methods has not been thoroughly evaluated. Investigators may choose to use exact methods with considerations for the interpretation of effect measures, but we do not specifically recommend exact methods over other models discussed above.

3.3. Bayesian Methods

A Bayesian framework provides a unified and comprehensive approach to meta-analysis that accommodates a wide variety of outcomes, often, using generalized linear model (GLM) with normal, binomial, Poisson and multinomial likelihoods and various link functions. 86

It should be noted that while these GLM models are routinely implemented in the frequentist framework, and are not specific to the Bayesian framework, extensions to more complex situations are most approachable using the Bayesian framework, for example, allowing for mixed treatment comparisons involving repeated measurements of a continuous outcome that varies over time. 87

There are several specific advantages inherent to the Bayesian framework. First, the Bayesian posterior parameter distributions fully incorporate the uncertainty of all parameters. These posterior distributions need not be assumed to be normal. 88 In random-effects meta-analysis, standard methods use only the most likely value of the between-study variance, 59 rather than incorporating the full uncertainty of each parameter. Thus, Bayesian credible intervals will tend to be wider than confidence intervals produced by some classical random-effects analysis such as the DL method. 89 However, when the number of studies is small, the between-study variance will be poorly estimated by both frequentist and Bayesian methods, and the use of vague priors can lead to a marked variation in results, 90 particularly when the model is used to predict the treatment effect in a future study. 91 A natural alternative is to use an informative prior distribution, based on observed heterogeneity variances in other, similar meta-analyses. 92 – 94

Full posterior distributions can provide a more informative summary of the likely value of parameters than the frequentist approach. When communicating results of meta-analysis to clinicians, the Bayesian framework allows direct probability statements to be made and provides the rank probability that a given treatment is best, second best, or worst (see the section on interpreting ranking probabilities and clinically important results in Chapter 5 below). Another advantage is that posterior distributions of functions of model parameters can be easily obtained such as the NNT. 86 Finally, the Bayesian approach allows full incorporation of parameter uncertainty from meta-analysis into decision analyses. 95

Until recently, Bayesian meta-analysis required specialized software such as WinBUGS, 96 OpenBUGS, 97 and JAGS. 98 , 99 Newer open source software platforms such as Stan 100 and Nimble 101 , 102 provide additional functionality and use BUGS-like modeling languages. In addition, there are user written commands that allow data processing in a familiar environment which then can be passed to WinBUGS, or JAGS for model fitting. 103 For example, in R, the package bmeta currently generates JAGS code to implement 22 models. 104 The R package gemtc similarly automates generation of JAGS code and facilitates assessment of model convergence and inconsistency. 105 , 106 On the other hand, Bayesian meta-analysis can be implemented in commonly used statistical packages. For example, SAS PROC MCMC can now implement at least some Bayesian hierarchical models 107 directly, as can Stata, version 14, via the bayesmh command. 108

When vague prior distributions are used, Bayesian estimates are usually similar to estimates obtained from the above frequentist methods. 90 Use of informative priors requires considerations to avoid undue influence on the posterior estimates. Investigators should provide adequate justifications for the choice of priors and conduct sensitivity analyses. Bayesian methods currently require more work in programming, MCMC simulation and convergence diagnostics.

A Note on Using a Bayesian Approach for Sparse Binary Data

It has been suggested that using a Bayesian approach might be a valuable alternative for sparse event data since Bayesian inference does not depend on asymptotic theory and takes into account all uncertainty in the model parameters. 109 The Bayesian fixed effects model provides good estimates when events are rare for binary data. 70 However, the choice of prior distribution, even when non-informative, may impact results, in particular, when a large proportion of studies have zero events in one or two arms. 80 , 90 , 110 Nevertheless, other simulation studies found that when the overall baseline rate is very small and there is moderate or large heterogeneity, Bayesian hierarchical random effect models can provide less biased estimates for the effect measures and the heterogeneity parameters. 77 To reduce the impact of the prior distributions, objective Bayesian methods have been developed 76 , 111 with special attention paid to the coherence between the prior distributions of the study model parameters and the meta-parameter, 76 though the Bayesian model was developed outside the usual hierarchical normal random effects framework. Further evaluations of these methods are required before recommendations of these objective Bayesian methods might be made.

3.4. Recommendations

  • The PL method appears to generally perform best. The DL method is also appropriate when the between-study heterogeneity is low.
  • For study-level aggregated binary data and count data, the use of a generalized linear mixed effects model assuming random treatment effects is also recommended.
  • Methods that use continuity corrections should be avoided.
  • For studies with zero events in one arm, or studies with sparse binary data but no zero events, an estimate can be obtained using the Peto method, the Mantel-Haenszel method, or a logistic regression approach, without adding a correction factor, when the between-study heterogeneity is low.
  • When the between-study heterogeneity is high, and/or there are studies with zero events in both arms, more recently developed methods such as a beta-binomial model could be explored and used.
  • Sensitivity analyses should be conducted with acknowledgement of the inadequacy of data.
  • If investigators choose Bayesian methods, use of vague priors is supported.

Chapter 4. Quantifying, Testing, and Exploring Statistical Heterogeneity

4.1. statistical heterogeneity in meta-analysis.

Statistical heterogeneity was explained in general in Chapter 1 . In this chapter, we provide a deeper discussion from a methodological perspective. Statistical heterogeneity must be expected, quantified and sufficiently addressed in meta-analyses. 112 We recommend performing graphic and quantitative exploration of heterogeneity in combination. 113 In this chapter, it is assumed that a well-specified research question has been posed, the relevant literature has been reviewed, and a set of trials meeting selection criteria have been identified. Even when trial selection criteria are aimed toward identifying studies that are adequately homogenous, it is common for trials included in a meta-analysis to differ considerably as a function of (clinical and/or methodological) heterogeneity that was reviewed in Chapter 1 . Even when these sources of heterogeneity have been accounted for, statistical heterogeneity often remains. Statistical heterogeneity refers to the situation where estimates across studies have greater variability than expected from chance variation alone. 113 , 114

4.2. Visually Inspecting Heterogeneity

Although simple histograms, box plots, and other related graphical methods of depicting effect estimates across studies may be helpful preliminarily, these approaches do not necessarily provide insight into statistical heterogeneity. However, forest and funnel plots can be helpful in the interpretation of heterogeneity particularly when examined in combination with quantitative results. 113 , 115

Forest Plots

Forest plots can help identify potential sources and the extent of statistical heterogeneity. Meta-analyses with limited heterogeneity will produce forest plots with grossly visual overlap of study confidence intervals and the summary estimate. In contrast, a crude sign of statistical heterogeneity would be poor overlap. 115 An important recommendation is to graphically present between-study variance on forest plots of random effects meta-analyses using prediction intervals, which are on the same scale as the outcome. 93 The 95% prediction interval estimates where true effects would be expected for 95% of future studies. 93 When between-study variance is greater than zero, the prediction interval will cover a wider range than the confidence interval of the summary effect. 116 As proposed by Guddat et al. 117 and endorsed by IntHout et al., 116 including the prediction interval as a rectangle at the bottom of forest plots helps differentiate between-study variation from the confidence interval of the summary effect that is commonly depicted as a diamond.

Funnel Plots

Funnel plots are often thought of as representing bias, but they also can aid in detecting sources of heterogeneity. Funnel plots are essentially the plotting of effect sizes observed in each study (x-axis) around the summary effect size versus the degree of precision of each study (typically by standard error, variance, or precision on the y-axis). A meta-analysis that includes studies that estimate the same underlying effect across a range of precision, and has limited bias and heterogeneity would result in a funnel plot that resembles a symmetrical inverted funnel shape with increasing dispersion ranging with less precise (i.e., smaller) studies. 115 In the event of heterogeneity and/or bias, funnel plots will take on an asymmetric pattern around the summary effect size and also provide evidence of scatter outside the bounds of the 95% confidence limits. 115 Asymmetry in funnel plots can be difficult to detect visually, 118 and can be misleading due to multiple contributing factors. 113 , 119 , 120 Formal tests for funnel plot asymmetry (such as Egger’s test 15 for continuous outcomes, or the arcsine test proposed by Rucker et al., 27 for binary data) are available but should not be used with a meta-analysis involving fewer than 10 studies because of limited power. 113 Given the above cautions and considerations, funnel plots should only be used to complement other approaches in the preliminary analysis of heterogeneity.

4.3. Quantifying Heterogeneity

The null hypothesis of homogeneity in meta-analysis is that all studies are evaluating the same effect, 22 (i.e., all studies have the same true effect parameter that may or may not be equivalent to zero) and the alternative hypothesis is that at least one study has an effect that is different from the summary effect.

  • Where Q is the heterogeneity statistic,
  • w is the study weight based on inverse variance weighting,
  • x is the observed effect size in each trial, and
  • x ^ w is the summary estimate in a fixed-effect meta-analysis.

The Q statistic is assumed to have an approximate χ 2 distribution with k – 1 degrees of freedom. When Q is in excess over k – 1 and the associated p-value is low (typically, a p-value of <0.10 is used as a cut-off), the null hypothesis of homogeneity can be rejected. 22 , 122 Interpretation of a Q statistic in isolation is not advisable however, because it has low statistical power in meta-analyses involving a limited number of studies 123 , 124 and may detect unimportant heterogeneity when the number of studies included in a meta-analysis is large. Importantly, since heterogeneity is expected in meta-analyses even without statistical tests to support that claim, non-significant Q statistics must not be interpreted as the absence of heterogeneity. Moreover, the interpretation of Q in meta-analyses is more complicated than typically represented, because the actual distribution of Q is dependent on the measure of effect 125 and only approximately χ 2 in large samples. 122 Even if the null distribution of Q were χ 2 , universally interpreting all values of Q greater than the mean of k − 1 as indicating heterogeneity would be an oversimplification. 122 There are expansions to approximate Q for meta-analyses of standardized mean difference, 125 risk difference, 125 and odds ratios 126 that should be used as alternatives to Q , particularly when sample sizes of studies included in a meta-analysis are small. 122 The Q statistic and expansions thereof must be interpreted along with other heterogeneity statistics and with full consideration of their limitations.

Graphical Options for Examining Contributions to Q

Hardy and Thompson proposed using probability plots to investigate the contribution that each study makes to Q . 127 When each study is labeled, those deviating from the normal distribution in a probability plot have the greatest influence on Q . 127 Baujat and colleagues proposed another graphical method to identify studies that have the greatest impact on Q . 128 Baujat proposed plotting the contribution to the heterogeneity statistic for each study on the horizontal axis, and the squared difference between meta-analytic estimates with and without the i th study divided by the estimated variance of the meta-analytic estimate without the i th study along the vertical axis. Because of the Baujat plot presentation, studies that have the greatest influence on Q are located in the upper right corner for easy visual identification. Smaller studies have been shown to contribute more to heterogeneity than larger studies, 129 which would be visually apparent in Baujat plots. We recommend using these graphical approaches only when there is significant heterogeneity, and only when it is important to identify specific studies that are contributing to heterogeneity.

Between-Study Variance

  • Where τ 2 is the parameter of between-study variance of the true effects,
  • DL is the DerSimonian and Laird approach to τ 2 ,
  • Q is the heterogeneity statistic (as above),
  • k -1 is the degrees of freedom, and
  • w is the weight applied to each study based on inverse variance weighting.

Since variance cannot be less than zero, a τ 2 less than zero is set to zero. The value of τ 2 is integrated into the weights of random-effects meta-analysis as presented in Chapter 3 . Since the DerSimonian and Laird approach to τ 2 is derived in part from Q , the problems with Q described above apply to the τ 2 parameter. 122 There are many alternatives to DerSimonian and Laird when estimating between-study variance. In a recent simulation, Veroniki and colleagues 121 compared 16 estimators of between-study variance; they argued that the Paule and Mandel 130 method of estimating between-study variance is a better alternative to the DerSimonian and Laird parameter for continuous and binary data because it less biased (i.e., yields larger estimates) when between-study variance is moderate-to-large. 121 At the time of this guidance, the Paule and Mandel method of estimating between-study variance is only provisionally recommended as an alternative to DerSimonian and Laird. 129 , 131 Moreover, Veroniki and colleagues provided evidence that the restrictive maximum likelihood estimator 132 is a better alternative to the DerSimonian and Laird parameter of between-study variance for continuous data because it yields similar values for low-to-moderate between-study variance and larger estimates in conditions of high between-study variance. 121

Inconsistency Across Studies

Another statistic that should be generated and interpreted even when Q is not statistically significant is the proportion of variability in effect sizes across studies that is explained by heterogeneity vs. random error or I 2 that is related to Q . 22 , 133

  • Where Q is the estimate of between-study variance, and
  • k −1 is the degrees of freedom.
  • Where τ 2 is the parameter of between-study variance, and
  • σ 2 is the within-study variance.

I 2 is a metric of how much heterogeneity is influencing the meta-analysis. With a range from 0% (indicating no heterogeneity) to 100% (indicating that all of the observed variance is attributable to heterogeneity), the I 2 statistic has several advantages over other heterogeneity statistics including its relative simplicity as a signal-to-noise ratio, and focus on how heterogeneity may be influencing interpretation of the meta-analysis. 59 It is important to note that I 2 increases with increasing study precision and hence is dependent on sample size. 27 By various means, confidence/uncertainty intervals can be estimated for I 2 including Higgins’ test-based method. 22 , 23 the assumptions involved in the construction of 95% confidence intervals cannot be justified in all cases, but I 2 confidence intervals based on frequentist assumptions generally provide sufficient coverage of uncertainty in meta-analyses. 133 In small meta-analyses, it has even been proposed that confidence intervals supplement or replace biased point estimates of I 2 . 26 It is important to note that since I 2 is based on Q or τ 2 , any problems that influence Q or τ 2 (most notably the number of trials included in the meta-analysis) will also indirectly interfere with the computation of I 2 . It is also important to consider that I 2 also is dependent on which between-study variance estimator is used. For example, there is a high level of agreement comparing I 2 derived from DerSimonian and Laird vs. Paul and Mandel methods of estimating between-study variance. 131 In contrast, I 2 derived from other methods of estimating between-study variance have low levels of agreement. 131

Based primarily on the observed distributions of I 2 across meta-analyses, there are ranges that are commonly used to further categorize heterogeneity. That is, I 2 values of 25%, 50%, and 75% have been proposed as working definitions of what could be considered low, moderate, and high proportions, respectively, of variability in effect sizes across studies that is explained by heterogeneity. 59 Currently, the Cochrane manual also includes ranges for interpreting I 2 (0%-40% might not be important, 30%-60% may represent moderate heterogeneity, 50-90% may represent substantial heterogeneity and 75-100% may represent considerable heterogeneity). 10 Irrespective of which categorization of I 2 is used, this statistic must be interpreted with the understanding of several nuances, including issues related to a small number of studies (i.e., fewer than 10), 24 – 26 and inherent differences in I 2 comparing binary and continuous effect sizes. 28 , 29 Moreover, I 2 of zero is often misinterpreted in published reports as being synonymous with the absence of heterogeneity despite upper confidence interval limits that most often would exceed 33% when calculated. 134 Finally, a high I 2 does not necessarily mean that dispersion occurs across a wide range of effect sizes, and a low I 2 does not necessarily mean that dispersion occurs across a narrow range of effect sizes; the I 2 is a signal-to-noise metric, not a statistic about the magnitude of heterogeneity.

4.4. Exploring Heterogeneity

Meta-regression.

Meta-regression is a common approach employed to examine the degree to which study-level factors explain statistical heterogeneity. 135 Random effects meta-regression, as compared with fixed effect meta-regression, allows for residual heterogeneity (i.e., between-study variance that is not explained by study-level factors) to incorporated into the model. 136 Because of this feature, among other benefits described below and in Chapter 3 , random effects meta-regression is recommended over fixed effect meta-regression. 137 It is the default of several statistical packages to use a modified estimator of variance in random effects meta-regression that employs a t distribution in lieu of a standard normal distribution when calculating p-values and confidence intervals (i.e., the Knapp-Hartung modification). 138 This approach is recommended to help mitigate false-positive rates that are common in meta-regression. 137 Since the earliest papers on random effects meta-regression, there has been general caution about the inherent low statistical power in analyses when there are fewer than 10 studies for each study-level factor modelled. 136 Currently, the Cochrane manual recommends that there be at least 10 studies per characteristic modelled in meta-regression 10 over the enduring concern about inflated false-positive rates with too few studies. 137 Another consideration that is reasonable to endorse is adjusting the level of statistical significance to account for making multiple comparisons in cases where more than one characteristic is being investigated in meta-regression.

Beyond statistical considerations important in meta-regression, there are also several important conceptual considerations. First, study-level characteristics to be considered in meta-regression should be pre-specified, scientifically defensible and based on hypotheses. 8 , 10 This first consideration will allow investigators to focus on factors that are believed to modify the effect of intervention as opposed to clinically meaningless study-level characteristics. Arguably, it may not be possible to identify all study-level characteristics that may modify intervention effects. The focus of meta-regression should be on factors that are plausible. Second, meta-regression should be carried out under full consideration of ecological bias (i.e., the inherent problems associated with aggregating individual-level data). 139 As classic examples, the mean study age or the proportion of study participants who were female may result in different conclusions in meta-regression as opposed to how these modifying relationships functioned in each trial. 135

Multiple Meta-regression

It may be desirable to examine the influence of more than one study-level factor on the heterogeneity observed in meta-analyses. Recalling general cautions and specific recommendations about the inherent low statistical power in analyses wherein there are fewer than 10 studies for each study-level factors modelled, 10 , 136 , 137 multiple meta-regression (that is, a meta-regression model with more than one study-level factor included) should only be considered when study-level characteristics are pre-specified, scientifically defensible, and based on hypotheses, and when there are 10 or more studies for each study-level factor included in meta-regression.

Subgroup Analysis

Subgroup analysis is another common approach employed to examine the degree to which study-level factors explain statistical heterogeneity. Since subgroup analysis is a type of meta-regression that incorporates a categorical study-level factor as opposed to a continuous study-level factor, it is similarly important that the grouping of studies to be considered in subgroup analysis be pre-specified, scientifically defensible and based on hypotheses. 8 , 10 Like other forms of meta-regression, subgroup analyses have a high false-positive rate. 137 and may be misleading when few studies are included. There are two general approaches to handling subgroups in meta-analysis. First, a common use is to perform meta-analyses within subgroups without any statistical between-group comparisons. A central problem with this approach is the tendency to misinterpret results from within separate groups as being comparative. That is, identification of groups wherein there is a significant summary effect and/or limited heterogeneity and others wherein there is no significant summary effect and/or substantive heterogeneity does not necessarily indicate that the subgroup factor explains overall heterogeneity. 10 Second, it is recommended to incorporate the subgrouping factor into a meta-regression framework. 140 Doing so allows for quantification of both within and among subgroup heterogeneity as well as well as formal statistical testing that informs whether the summary estimates are different across subgroups. Moreover, subgroup analysis in a meta-regression framework will allow for formal testing of residual heterogeneity in a similar fashion to meta-regression using a continuous study-level factor.

Detecting Outlying Studies

Under consideration that removal of one or more studies from a meta-analysis may interject bias in the results, 10 identification of outlier studies may help build the evidence necessary to justify removal. Visual examination of forest, funnel, normal probability and Baujat plots (described in detail earlier in this chapter) alone may be helpful in identifying studies with inherent outlying characteristics. Additional procedures that may be helpful in interpreting the influence of single studies are quantifying the summary effect without each study (often called one study removed), and performing cumulative meta-analyses. One study removed procedures simply involve sequentially estimating the summary effect without each study to determine if single studies are having a large influence on model results. Using cumulative meta-analysis, 141 it is possible to graph the accumulation of evidence of trials reporting at treatment effect. Simply put, this approach integrates all information up to and including each trial into summary estimates. By looking at the graphical output (from Stata’s metacum command or the R metafor cumul() function), one can examine large shifts in the summary effect that may serve as evidence for study removal. Another benefit of cumulative meta-analysis is detecting shifts in practice (e.g., guideline changes, new treatment approval or discontinuation) that would foster subgroup analysis.

Viechtbauer and Chung proposed other methods that should be considered to help identify outliers. One option is to examine extensions of linear regression residual diagnostics by using studentized deleted residuals. 142 Other options are to examine the difference between the predicted average effect with and without each study (indicating by how many standard deviations the average effect changes) or to examine what effect the deletion of each study has on the fitted values of all studies simultaneously (in a metric similar to Cook’s distance). 142 Particularly in combination, there methods serve as diagnostics that are more formal than visual inspection and both one study removed and cumulative meta-analysis procedures.

4.5. Special Topics

Baseline risk (control-rate) meta-regression.

For studies with binary outcomes, the “control rate” refers to the proportion of subjects in the control group who experienced the event. The control rate can be viewed as a surrogate for covariate differences between studies because it is influenced by illness severity, concomitant treatment, duration of follow-up, and/or other factors that may differ across studies. 143 , 144 Groups of patients with higher underlying risk for poor outcomes may experience different benefits and/or harms from treatment compared with groups of patients who have lower underlying risk. 145 Hence, the control-rate can be used to test for interactions between underlying population risk at baseline and treatment benefit.

To examine for an interaction between underlying population risk and treatment benefit, we recommend a simplified approach. First, generate a scatter plot of treatment effect against control rate to visually assess whether there may be a relation between the two. Since the RD tends to be highly correlated with the control rate, 144 we recommend using an RR or OR when examining a treatment effect against the control rate in all steps. The purpose of generating a scatterplot is simply to give preliminary insight into how differences in baseline risk (control rate) may influence the amount of observed variability in effect sizes across studies. Second, use hierarchical meta-regression 144 or Bayesian meta-regression 146 models to formally test the interaction between underlying population risk and treatment benefit. Although a weighted regression has been proposed as an intermediary step between developing a scatter plot and meta-regression, this approach identifies a significant relation between control rate and treatment effect twice as often compared with more suitable approaches (above), 144 , 146 and a negative finding would likely need to be replicated using meta-regression. Hence, the simplified two-step approach may help streamline the process.

Multivariate Meta-analysis

There are both inherent benefits and disadvantages of using meta-analysis to examine multiple outcomes simultaneously (that is, “multivariate meta-analysis”), and much methodological work has been done in both frequentist and Bayesian frameworks in recent years. 147 – 156 . Some of these methods are readily available in statistical packages (for example, Stata mvmeta ).

One of the advantages of multivariate meta-analysis is being able to incorporate multiple outcomes into one model as opposed to the conduct of multiple univariate meta-analyses wherein the outcomes are handled as being independent. 150 Another advantage of multivariate meta-analysis is being able to gain insight into relationships among study outcomes. 150 , 157 An additional advantage of multivariate meta-analysis is that different clinical conclusions may be made; 150 it may be considered easier to present results from a single multivariate meta-analysis than from several univariate analyses that may make different assumptions. Further, multivariate methods may have the potential to reduce the impact of outcome reporting bias. 150 , 158 , 159

  • the disconnect between how outcomes are handled within each trial (typically in a univariate fashion) compared with a multivariate meta-analysis;
  • estimation difficulties particularly around correlations between outcomes (seldom reported; see Bland 160 for additional commentary);
  • overcoming assumptions of normally-distributed random effects with joint outcomes (difficult to justify with joint distributions);
  • marginal model improvement in the multivariate vs. univariate case (often not sufficient trade off in effort); and
  • amplification of publication bias (e.g., secondary outcomes are not published as frequently). 150

Another potential challenge is the appropriate quantification of heterogeneity in multivariate meta-analysis; but, there are newer alternatives that seem to make this less of a concern. These methods include but are not limited to the multivariate H 2 statistic (the ratio of a generalization of Q and its degrees of freedom, with an accompanying generalization of I 2 ( I H 2 ) ). 163 Finally, limitations to existing software for broad implementation and access to multivariate meta-analysis has been a long-standing barrier to this approach. With currently available add-on or base statistical packages, however, multivariate meta-analysis can be more readily performed, 150 and emerging approaches to multivariate meta-analyses are available to be integrated into standard statistical output. 153 However, the gain in precision of parameter estimates is often modest, and the conclusions from the multivariate meta-analysis are often the same as those from the univariate meta-analysis for individual outcomes, 164 which may not justify the increased complexity and difficulty.

With the exception of diagnostic testing meta-analysis (which provides a natural situation to meta-analyze sensitivity and specificity simultaneously, but which is out of scope for this report) and network meta-analysis (a special case of multivariate meta-analysis with unique challenges, see Chapter 5 ), multivariate meta-analysis has not been widely used in practice. However, we are likely to see multivariate meta-analysis approaches become more accessible to stakeholders involved with systematic reviews. 160 In the interim, however, we do not recommend this approach be used routinely.

Dose-Response Meta-analysis

Considering different exposure or treatment levels has been a longstanding consideration in meta-analyses involving binary outcomes. 165 , 166 and new methods have been developed to extend this approach to differences in means. 167 Meta-regression is commonly employed to test the relationship between exposure or treatment level and the intervention effect (i.e., dose-response). The best-case scenario for testing dose-response using meta-regression is when there are several trials that compared the dose level versus control for each dosing level. That way, subgroup analysis can be performed to provide evidence of effect similarity within groups of study-by-dose in addition to a gradient of treatment effects across groups. 10 Although incorporating study-level average dose can be considered, it should only be conducted in circumstances where there was limited-to-no variation in dosing within intervention arms of the studies included. In many instances, exposure needs to be grouped for effective comparison (e.g., ever vs. never exposed), but doing so raises the issues of non-independence and covariance between estimates. 168 Hamling et al., developed a method of deriving relative effect and precision estimates for such alternative comparisons in meta-analysis that are more reasonable compared with methods that ignore interdependence of estimates by level. 168 In the case of trials involving differences in means, dose-response models are estimated within each study in a first stage and an overall curve is obtained by pooling study-specific dose-response coefficients in a second stage. 167 A key benefit to this emerging approach to differences in means is modeling non-linear dose-response curves in unspecified shapes (including the cubic spline described in the derivation study). 167 Considering the inherent low statistical power associated with meta-regression in general, results of dose-response meta-regression should generally not be used to indicate that a dose response does not exist. 10

  • Statistical heterogeneity should be expected, visually inspected and quantified, and sufficiently addressed in all meta-analyses.
  • Prediction intervals should be included in all forest plots.
  • Investigators should be consider evaluating multiple metrics of heterogeneity, between-study variance, and inconsistency (i.e., Q , τ 2 and I 2 along with their respective confidence intervals when possible).
  • A non-significant Q should not be interpreted as the absence of heterogeneity, and there are nuances to the interpretation of Q that carry over to the interpretation of τ 2 and I 2 .
  • Random effects is the preferred method for meta-regression that should be used under consideration of low power associated with limited studies (i.e., <10 studies per study-level factor) and the potential for ecological bias.
  • We recommend a simplified two-step approach to control-rate meta-regression that involves scatter plotting and then hierarchical or Bayesian meta-regression.
  • Routine use of multivariate meta-analysis is not recommended.

Chapter 5. Network Meta-Analysis (Mixed Treatment Comparisons/Indirect Comparisons)

5.1. rationale and definition.

Decision makers, whether patients, providers or policymakers generally want head-to-head estimates of the comparative effectiveness of the different interventions from which they have to choose. However, head-to-head trials are relatively uncommon. The majority of trials compare active agents with placebo, which has left patients and clinicians unable to compare across treatment options with sufficient certainty.

Therefore, an approach has emerged to compare agents indirectly. If we know that intervention A is better than B by a certain amount, and we know how B compares with C; we can indirectly infer the magnitude of effect comparing A with C. Occasionally, a very limited number of head-to-head trials are available (i.e., there may be a small number of trials directly comparing A with C). Such trials will likely produce imprecise estimates due to the small sample size and number of events. In this case, the indirect comparisons of A with C can be pooled with the direct comparisons, to produce what is commonly called a network meta-analysis estimate (NMA). The rationale for producing such an aggregate estimate is to increase precision, and to utilize all the available evidence for decision making.

Frequently, more than two active interventions are available and stakeholders want to compare (rank) many interventions, creating a network of interventions with comparisons accounting for all the permutations of pairings within the network. The following guidance focuses on NMA of randomized controlled trials. NMA of nonrandomized studies is statistically possible; however, without randomization, NMA assumptions would likely not be satisfied and the results would not be reliable.

5.2. Assumptions

There are three key assumptions required for network meta-analysis to be valid:

I. Homogeneity of direct evidence

When important heterogeneity (unexplained differences in treatment effect) across trials is noted, confidence in a pooled estimate decreases. 169 This is true for any meta-analysis. In an NMA, direct evidence (within each pairwise comparison) should be sufficiently homogeneous. This can be evaluated using the standard methods for evaluating heterogeneity ( I 2 statistic, τ 2 , Cochran Q test, and visual inspection of forest plots for consistency of point estimates from individual trials and overlap of confidence intervals).

II. Transitivity, similarity or exchangeability

Patients enrolled in trials of different comparisons in a network need to be sufficiently similar in terms of the distribution of effect modifiers. In other words, patients should be similar to the extent that it is plausible that they were equally likely to have received any of the treatments in the network. 170 Similarly, active and placebo controlled interventions across trials need to be sufficiently similar in order to attribute the observed change in effect size to the change in interventions.

Transitivity cannot be assessed quantitatively. However, it can be evaluated conceptually. Researchers need to identify important effect modifiers in the network and assess whether differences reported by studies are large enough to affect the validity of the transitivity assumption.

III. Consistency (Between Direct and Indirect Evidence)

Comparing direct and indirect estimates in closed loops in a network demonstrates whether the network is consistent (previously called coherent). Important differences between direct and indirect evidence may invalidate combining them in a pooled NMA estimate.

Consistency refers to the agreement between indirect and direct comparison for the same treatment comparison. If a pooled effect size for a direct comparison is similar to the pooled effect size from indirect comparison, we say the network is consistent; otherwise, the network is inconsistent or incoherent. 171 , 172 Multiple causes have been proposed for inconsistency, such as differences in patients, treatments, settings, timing, and other factors.

Statistical models have been developed to assume consistency in the network (consistency models) or account for inconsistency between direct and indirect comparison (inconsistency models). Consistency is a key assumption/prerequisite for a valid network meta-analysis and should always be evaluated. If there is substantial inconsistency between direct and indirect evidence, a network meta-analysis should not be performed. Fortunately, inconsistency can be evaluated statistically.

5.3. Statistical Approaches

The simplest indirect comparison approach is to qualitatively compare the point estimates and the overlap of confidence intervals from two direct comparisons that use a common comparator. Two treatments are likely to have comparable effectiveness if their direct effects relative to a common comparator (e.g., placebo) have the same direction and magnitude, and if there is considerable overlap in their confidence intervals. However, such qualitative comparisons have to be interpreted cautiously because the degree to which confidence intervals overlap is not a reliable substitute for formal hypothesis testing. Formal testing methods adjust the comparison of the interventions by the results of their direct comparison with a common control group and at least partially preserve the advantages of randomization of the component trials. 173

Many statistical models for network meta-analysis have been developed and applied in the literature. These models range from simple indirect comparisons to more complex mixed effects and hierarchical models, developed in both Bayesian and frequentist frameworks, and using both contrast level and arm level data.

Simple Indirect Comparisons

Simple indirect comparisons apply when there is no closed loop in the evidence network. A closed loop means that each comparison in a particular loop has both direct and indirect evidence. At least three statistical methods are available to conduct simple indirect comparisons: (1) the adjusted indirect comparison method proposed by Bucher et al, 174 (2) logistic regression, and (3) random effects meta-regression.

When there are only two sets of trials, say, A vs. B and B vs. C, Bucher‘s method is sufficient to provide the indirect estimate of A vs. C as: log(OR AC )=log(OR AB )-log(OR BC ) and

Var(Log(OR AC )) = Var(Log(OR AB )) + Var(Log(OR BC )), where OR is the odds ratio. Bucher’s method is valid only under a normality assumption on the log scale.

Logistic regression uses arm-level dichotomous outcomes data and is limited to odds ratios as the measure of effect. By contrast, meta-regression and adjusted indirect comparisons typically use contrast-level data and can be extended to risk ratios, risk differences, mean difference and any other effect measures. Under ideal circumstances (i.e., no differences in prognostic factors exist among included studies), all three methods result in unbiased estimates of direct effects. 175 Meta-regression (as implemented in Stata, metareg ) and adjusted indirect comparisons are the most convenient approaches for comparing trials with two treatment arms. A simulation study supports the use of random effects for either of these approaches. 175

Mixed Effects and Hierarchical Models

More complex statistical models are required for more complex networks with closed loops where a treatment effect could be informed by both direct and indirect evidence. These models typically assume random treatment effects and take the complex data structure into account, and may be broadly categorized as mixed effects, or hierarchical models.

Frequentist Approach

Lumley proposed the term “network meta-analysis” and the first network meta-analysis model in the frequentist framework, and constructed a random-effects inconsistency model by incorporating sampling variability, heterogeneity, and inconsistency. 176 The inconsistency follows a common random-effects distribution with mean of 0. It can use arm-level and contrast-level data and can be easily implemented in statistical software, including R’s lme package. However, studies included in the meta-analysis cannot have more than two arms.

Further development of network meta-analysis models in the frequentist framework addressed how to handle multi-armed trials as well as new methods of assessing inconsistency. 171 , 177 – 179 Salanti et al. provided a general network meta-analysis formulation with either contrast-based data or arm-based data, and defined the inconsistency in a standard way as the difference between ‘direct’ evidence and ‘indirect’ evidence. 177 In contrast, White et al. and Higgins et al. proposed to use a treatment-by-design interaction to evaluate inconsistency of evidence, and developed consistency and inconsistency models based on contrast-based multivariate random effects meta-regression. 171 , 178 These models can be implemented using network , a suite of commands in Stata with input data being either arm-level or contrast level.

Bayesian Approach

Lu and Ades proposed the first Bayesian network meta-analysis model for multi-arm studies that included both direct and indirect evidence. 180 The treatment effects are represented by basic parameters and functional parameters. Basic parameters are effect parameters that are directly compared to the baseline treatment, and functional parameters are represented as functions of basic parameters. Evidence inconsistency is defined as a function of a functional parameter and at least two basic parameters. The Bayesian model has been extended to incorporate study-level covariates in an attempt to explain between-study heterogeneity and reduce inconsistency, 181 to allow for repeated measurements of a continuous endpoint that varies over time, 87 or to appraise novelty effects. 182 A Bayesian multinomial network meta-analysis model was also developed for unordered (nominal) categorical outcomes allowing for partially observed data in which exact event counts may not be known for each category. 183 Additionally, Dias et al. set out a generalized linear model framework for the synthesis of data from randomized controlled trials, which could be applied to binary outcomes, continuous outcomes, rate models, competing risks, or ordered category outcomes. 86

Commonly, a vague (flat) prior is chosen for the treatment effect and heterogeneity parameters in Bayesian network meta-analysis. A vague prior distribution for heterogeneity however may not be appropriate when the number of studies is small. 184 An informative prior for heterogeneity can be obtained from the empirically derived predictive distributions for the degree of heterogeneity as expected in various settings (depending on the outcomes assessed and comparisons made). 185 In the NMA framework, frequentist and Bayesian approaches often provide similar results; particularly because of the common practice to use non-informative priors in the Bayesian analysis. 186 – 188 Frequentist approaches, when implemented in a statistical package, are easily applied in real-life data analysis. Bayesian approaches are highly adaptable to complex evidence structures and provide a very flexible modeling framework, but need a better understanding of the model specification and specialized programing skills.

Arm-Based Versus Contrast-Based Models

It is important to differentiate arm-based/contrast-based models from arm-level/contrast-level data. Arm-level and contrast-level data describe how outcomes are reported in the original studies. Arm-level data represent raw data per study arm (e.g., the number of events from a trial per group); while contrast-level data show the difference in outcomes between arms in the form of absolute or relative effect size (e.g., mean difference or the odds ratio of events).

Contrast-based models resemble the traditional approaches used in meta-analysis of direct comparisons. Absolute or relative effect sizes and associated variances are first estimated (per study) and then pooled to produce an estimate of the treatment comparison. Contrast-based models preserve randomization and, largely, alleviate risk of observed and unobserved imbalance between arms within a study. They use effect sizes relative to the comparison group and reduce the variability of outcomes across studies. Contrast-based models are the dominant approach used in direct meta-analysis and network meta-analysis in current practice.

Arm-based models depend on directly combining the observed absolute effect size in individual arms across studies; thereby producing a pooled rate or mean of the outcome per arm. Estimates can be compared among arms to produce a comparative effect size. Arm-based models break randomization; therefore, the comparative estimate will likely be at an increased risk of bias. Following this approach, nonrandomized studies or even noncomparative studies can be included in the analysis. Multiple models have been proposed for the arm-based approach, especially in the Bayesian framework. 177 , 189 – 192 However, the validity of arm-based methods is under debate. 178 , 193 , 194

Assessing Consistency

Network meta-analysis generates results for all pairwise comparisons; however, consistency can only be evaluated when at least one closed loop exists in the network. In other words, the network must have at least one treatment comparison with direct evidence. Many statistical methods are available to assess consistency. 173 , 174 , 176 , 195 – 200

These methods can generally be categorized into two types: (1) an overall consistency measure for the whole network; and (2) a loop-based approach in which direct and indirect estimates are compared. In the following section, we will focus on a few widely used methods in the literature.

  • Single Measure for Network Consistency : These approaches use a single measure that represents consistency for the whole network. Lumley assumes that, for each treatment comparison (with or without direct evidence), there is a different inconsistency factor; and the inconsistency factor varies for all treatment comparisons and follows a common random-effects distribution. The variance of the differences, ω, also called incoherence, measures the overall inconsistency of the network. 176 A ω above 0.25 suggests substantial inconsistency and in this case, network meta-analysis may be considered inappropriate. 201
  • Global Wald Test : Another approach is to use global Wald test, which tests an inconsistency factor that follows a Χ 2 distribution under the null consistency assumption. 178 A p-value less than 0.10 can be used to determine statistical significance. Rejection of the null is evidence that the model is not consistent.
  • Z-test : A simple z-test can be used to compare the difference of the pooled effect sizes between direct and indirect comparisons. 174 Benefits of this approach include simplicity, ease of application, and the ability to identify specific loops with large inconsistency. Limitations include the need for multiple correlated tests.
  • Side-splitting: A “node” is a treatment and a “side” (or edge) is a comparison. Dias et al. suggests that each comparison can be assessed by comparing the difference of the pooled estimate from direct evidence to the pooled estimate without direct evidence. 196 Side-splitting (sometimes referred to as node-splitting) can be implemented using the Stata network sidesplit command or R gemtc package.

Several graphical tools have been developed to describe inconsistency. One is the inconsistency plot developed by Chaimani et al. 197 Similar to a forest plot, the inconsistency plot graphically presents an inconsistency factor (the absolute difference between the direct and indirect estimates) and related confidence interval for each of the triangular and quadratic loops in the network. The Stata ifplot command can be used for this purpose.

It is important to understand the limitations of these methods. Lack of statistical significance of an inconsistency test does not prove consistency in the network. Similar to Cochran’s Q test of heterogeneity testing in traditional meta-analysis (which is often underpowered), statistical tests for inconsistency in NMA are also commonly underpowered due to the limited number of studies in direct comparisons.

  • Abandon NMA and only perform traditional meta-analysis;
  • Present the results from inconsistency models (that incorporate inconsistency) and acknowledge the limited trustworthiness of the NMA estimates;
  • Split the network to eliminate the inconsistent nodes;
  • Attempt to explain the causes of inconsistency by conducting network meta-regression to test for possible covariates causing the inconsistency: and
  • Use only direct estimates for the pairwise NMA comparisons that show inconsistency (i.e., use direct estimates for inconsistent comparisons and use NMA estimates for consistent comparisons).

5.4. Considerations of Model Choice and Software

Consideration of indirect evidence.

Empirical explorations suggest that direct and indirect comparisons often agree, 174 – 176 , 202 – 204 but with notable exceptions. 205 In principle, the validity of combining direct and indirect evidence relies on the transitivity assumption. However, in practice, trials can vary in numerous ways including population characteristics, interventions, and cointerventions, length of follow-up, loss to follow-up, study quality, etc. Given the limited information in many publications and the inclusion of multiple treatments, the validity of combining direct and indirect evidence is often unverifiable. The statistical methods to evaluate inconsistency generally have low power, and are confounded by the presence of statistical heterogeneity. They often fail to detect inconsistency in the evidence network.

Moreover, network meta-analysis, like all other meta-analytic approaches, constitutes an observational study, and residual confounding can always be present. Systematic differences in characteristics among trials in a network can bias network meta-analysis results. In addition, all other considerations for meta-analyses, such as the choice of effect measures or heterogeneity, also apply to network meta-analysis. Therefore, in general, investigators should compare competing interventions based on direct evidence from head-to-head RCTs whenever possible. When head-to-head RCT data are sparse or unavailable but indirect evidence is sufficient, investigators may consider incorporating indirect evidence and network meta-analysis as an additional analytical tool. If the investigators choose to ignore indirect evidence, they should explain why.

Choice of Method

Although the development of network meta-analysis models has exploded in the last 10 years, there has been no systematic evaluation of their comparative performance, and the validity of the model assumptions in practice is generally hard to verify.

Investigators may choose a frequentist or Bayesian mode of inference based on the research team expertise, the complexity of the evidence network, and/or the research question. If investigators believe that the use of prior information is needed and that the data are insufficient to capture all the information available, then they should use a Bayesian model. On the other hand, a frequentist model is appropriate if one wants inferences to be based only on the data that can be incorporated into a likelihood.

Whichever method the investigators choose, they should assess the consistency of the direct and indirect evidence, and the invariance of treatment effects across studies and the appropriateness of the chosen method on a case-by-case basis, paying special attention to comparability across different sets of trials. Investigators should explicitly state assumptions underlying indirect comparisons and conduct sensitivity analysis to check those assumptions. If the results are not robust, findings from indirect comparisons should be considered inconclusive. Interpretation of findings should explicitly address these limitations. Investigators should also note that simple adjusted indirect comparisons are generally underpowered, needing four times as many equally sized studies to achieve the same power as direct comparisons, and frequently lead to indeterminate results with wide confidence intervals. 174 , 175

When the evidence of a network of interventions is consistent, investigators can combine direct and indirect evidence using network meta-analysis models. Conversely, they should refrain from combining multiple sources of evidence from an inconsistent (i.e., incoherent) network where there are substantial differences between direct and indirect evidence that cannot be resolved by conditioning on the known covariates. Investigators should make efforts to explain the differences between direct and indirect evidence based upon study characteristics, though little guidance and consensus exists on how to interpret the results.

Lastly, the network geometry ( Figure 5.1 ) can also affect the choice of analysis method as demonstrated in Table 5.1 .

Common network geometry (simple indirect comparison, star, network with at least one closed loop).

Table 5.1. Impact of network geometry on choice of analysis method.

Impact of network geometry on choice of analysis method.

Commonly Used Software

Many statistical packages are available to implement NMA. BUGS software (Bayesian inference Using Gibbs Sampling, WINBUGS, OPENBUGS) is a popular choice for conducting Bayesian NMA 206 that offers flexible model specification including NMA meta-regression. JAGS and STAN are alternative choices for Bayesian NMA. Stata provides user-written routines ( http://www.mtm.uoi.gr/index.php/stata-routines-for-network-meta-analysis ) that can be used to conduct frequentist NMA. In particular, the Stata command network is a suite of programs for importing data for network meta-analysis, running a contrast-based network meta-analysis, assessing inconsistency, and graphing the data and results. Further, in the R environment, three packages, gemtc ( http://cran.r-project.org/web/packages/gemtc/index.html ), pcnetmeta ( http://cran.r-project.org/web/packages/pcnetmeta/index.html ), and netmeta ( http://cran.r-project.org/web/packages/netmeta/index.html ), have been developed for Bayesian ( gemtc, pcnetmeta ) or frequestist ( netmeta ) NMA. The packages also include methods to assess heterogeneity and inconsistency, and data visualizations, and allow users to perform NMA with minimal programming. 207

5.5. Inference From Network Meta-analysis

Stakeholders (users of evidence) require a rating of the strength of a body of evidence. The strength of evidence demonstrates how much certainty we should have in the estimates.

The general framework for assessing the strength of evidence used by the EPC program is described elsewhere. However; for NMA, guidance is evolving and may require some additional computations; therefore, we briefly discuss the possible approaches to rating the strength of evidence. We also discuss inference from rankings and probabilities commonly presented with a network meta-analysis.

Approaches for Rating the Strength of Evidence

The original EPC and GRADE guidance was simple and involved rating down all evidence derived from indirect comparisons (or NMA with mostly indirect evidence) for indirectness. Therefore, following this original GRADE guidance, evidence derived from most NMAs would be rated to have moderate strength at best. 208 Subsequently, Salanti et al. evaluated the transitivity assumption and network inconsistency under the indirectness and inconsistency domains of GRADE respectively. They judged the risk of bias based on a ‘contribution matrix’ which gives the percentage contribution of each direct estimate to each network meta-analysis estimate. 209 A final global judgment of the strength of evidence is made for the overall rankings in a network.

More recently, GRADE published a new approach that is based on evaluating the strength of evidence for each comparison separately rather than making a judgment on the whole network. 210 The rationale for not making such an overarching judgment is that the strength of evidence (certainty in the estimates) is expected to be different for different comparisons. The approach requires presenting the three estimates for each comparison (direct, indirect, and network estimates), then rating the strength of evidence separately for each one.

In summary, researchers conducting NMA should present their best judgment on the strength of evidence to facilitate decision-making. Innovations and newer methodology are constantly evolving in this area.

Interpreting Ranking Probabilities and Clinical Importance of Results

Network meta-analysis results are commonly presented as probabilities of being most effective and as rankings of treatments. Results are also presented as the surface under the cumulative ranking curve (SUCRA). SUCRA is a simple transformation of the mean rank that is used to provide a hierarchy of the treatments accounting both for the location and the variance of all relative treatment effects. SUCRA would be 1 when a treatment is certain to be the best and 0 when a treatment is certain to be the worst. 211 Such presentations should be interpreted with caution since they can be quite misleading.

  • Such estimates are usually very imprecise. An empirical evaluation of 58 NMAs showed that the median width of the 95% CIs of SUCRA estimates was 65% (the first quartile was 38%; the third quartile was 80%). In 28% of networks, there was a 50% or greater probability that the best-ranked treatment was actually not the best. No evidence showed a difference between the best-ranked intervention and the second or third best-ranked interventions in 90% and 71% of comparisons, respectively.
  • When rankings suggest superiority of an agent over others, the absolute difference between this intervention and other active agents could be trivial. Converting the relative effect to an absolute effect is often needed to present results that are meaningful to clinical practice and relevant to decision making. 212 Such results can be presented for patient groups with varying baseline risks. The source of baseline risk can be obtained from observational studies judged to be most representative of the population of interest, from the average baseline risk of the control arms of the randomized trials included in meta-analysis, or from a risk stratification tool if one is known and commonly used in practice. 213
  • Rankings hide the fact that each comparison may have its own risk of bias, limitations, and strength of evidence.

5.6. Presentation and Reporting

  • Rationale for conducting an NMA, the mode of inference (e.g., Bayesian, Frequentist), and the model choice (random effects vs. fixed effects; consistency vs inconsistency model, common heterogeneity assumption, etc.);
  • Software and syntax/commands used;
  • Choice of priors for any Bayesian analyses;
  • Graphical presentation of the network structure and geometry;
  • Pairwise effect sizes to allow comparative effectiveness inference; and
  • Assessment of the extent of consistency between the direct and indirect estimates.
  • A network meta-analysis should always be based on a rigorous a rigorous systematic review.
  • Homogeneity of direct evidence
  • Transitivity, similarity, or exchangeability
  • Consistency (between direct and indirect evidence)
  • Investigators may choose a frequentist or Bayesian mode of inference based on the research team’s expertise, the complexity of the evidence network, and the research question.
  • Evaluating inconsistency is a major and mandatory component of network meta-analysis.
  • Evaluating inconsistency should not be only based on a conducting a global test. A loop-based approach can identify the comparisons that cause inconsistency.
  • Inference based on the rankings and probabilities of treatments being most effective should be used cautiously. Rankings and probabilities can be misleading and should be interpreted based on the magnitude of pairwise effect sizes. Differences across interventions may not be clinically important despite such rankings.
  • Future Research Suggestions

The following are suggestions for directions in future research for each of the topics by chapter.

Chapter 1. Decision To Combine Trials

  • Guidance regarding the minimum number of trials one can validly pool at given levels of statistical heterogeneity
  • Research on ratio of means—both clinical interpretability and mathematical consistency across studies compared with standardized mean difference
  • Research on use of ANCOVA models for adjusting baseline imbalance
  • Software packages that more easily enable use of different information
  • Methods to handle zeros in the computation of binary outcomes
  • Evidence on which metrics, and language used to describe these metrics, are most helpful in conveying meta-analysis results to multiple stakeholders
  • Evaluate newly developed statistical models for combining typical effect measures (e.g., mean difference, OR, RR, and/or RD) and compare with current methods
  • Heterogeneity statistics for meta-analyses involving a small number of studies
  • Guidance on specification of hypotheses in meta-regression
  • Guidance on reporting of relationships among study outcomes to facilitate multivariate meta-analysis

Chapter 5. Network Meta-analysis (Mixed Treatment Comparisons/Indirect Comparisons)

  • Methods for combining individual patient data with aggregated data
  • Methods for integrating evidence from RCTs and observational studies
  • Models for time-to-event data
  • User friendly software similar to that available for traditional meta-analysis
  • Evidence to support model choice

This report is based on research conducted by the Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Centers’ 2016 Methods Workgroup. The findings and conclusions in this document are those of the authors, who are responsible for its contents; the findings and conclusions do not necessarily represent the views of AHRQ. Therefore, no statement in this report should be construed as an official position of AHRQ or of the U.S. Department of Health and Human Services.

None of the investigators have any affiliations or financial involvement that conflicts with the material presented in this report.

This research was funded through contracts from the Agency for Healthcare Research and Quality to the following Evidence-based Practice Centers: Mayo Clinic (290-2015-00013-I); Kaiser Permanente (290-2015-00007-I); RAND Corporation (290-2015-00010-I); Alberta (290-2015-00001-I); Pacific Northwest (290-2015-00009-I); RTI (290-2015-00011-I); Brown (290-2015-00002-I); and the Scientific Resource Center (290-2012-00004-C).

The information in this report is intended to help health care decisionmakers—patients and clinicians, health system leaders, and policy makers, among others—make well-informed decisions and thereby improve the quality of health care services. This report is not intended to be a substitute for the application of clinical judgment. Anyone who makes decisions concerning the provision of clinical care should consider this report in the same way as any medical reference and in conjunction with all other pertinent information (i.e., in the context of available resources and circumstances presented by individual patients).

This report is made available to the public under the terms of a licensing agreement between the author and the Agency for Healthcare Research and Quality. This report may be used and reprinted without permission except those copyrighted materials that are clearly noted in the report. Further reproduction of those copyrighted materials is prohibited without the express permission of copyright holders.

AHRQ or U.S. Department of Health and Human Services endorsement of any derivative products that may be developed from this report, such as clinical practice guidelines, other quality enhancement tools, or reimbursement or coverage policies may not be stated or implied.

Persons using assistive technology may not be able to fully access information in this report. For assistance, contact vog.shh.qrha@cpe .

Suggested citation: Morton SC, Murad MH, O’Connor E, Lee CS, Booth M, Vandermeer BW, Snowden JM, D’Anci KE, Fu R, Gartlehner G, Wang Z, Steele DW. Quantitative Synthesis—An Update. Methods Guide for Comparative Effectiveness Reviews. (Prepared by the Scientific Resource Center under Contract No. 290-2012-0004-C). AHRQ Publication No. 18-EHC007-EF. Rockville, MD: Agency for Healthcare Research and Quality; February 2018. Posted final reports are located on the Effective Health Care Program search page . https://doi.org/ 10 ​.23970/AHRQEPCMETHGUIDE3 .

Prepared for: Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services, 5600 Fishers Lane, Rockville, MD 20857, www.ahrq.gov Contract No.: 290-2012-00004-C . Prepared by: Scientific Resource Center, Portland, OR

  • Cite this Page Morton SC, Murad MH, O’Connor E, et al. Quantitative Synthesis—An Update. 2018 Feb 23. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2008-.
  • PDF version of this page (702K)

In this Page

  • Decision to Combine Trials
  • Optimizing Use of Effect Size Data
  • Choice of Statistical Model for Combining Studies
  • Quantifying, Testing, and Exploring Statistical Heterogeneity
  • Network Meta-Analysis (Mixed Treatment Comparisons/Indirect Comparisons)

Other titles in these collections

  • AHRQ Methods for Effective Health Care
  • Health Services/Technology Assessment Text (HSTAT)

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas. [Cochrane Database Syst Rev. 2022] Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas. Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, et al. Cochrane Database Syst Rev. 2022 Feb 1; 2(2022). Epub 2022 Feb 1.
  • Review Conducting Quantitative Synthesis When Comparing Medical Interventions: AHRQ and the Effective Health Care Program. [Methods Guide for Effectivenes...] Review Conducting Quantitative Synthesis When Comparing Medical Interventions: AHRQ and the Effective Health Care Program. Fu R, Gartlehner G, Grant M, Shamliyan T, Sedrakyan A, Wilt TJ, Griffith L, Oremus M, Raina P, Ismaila A, et al. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. 2008
  • Conducting quantitative synthesis when comparing medical interventions: AHRQ and the Effective Health Care Program. [J Clin Epidemiol. 2011] Conducting quantitative synthesis when comparing medical interventions: AHRQ and the Effective Health Care Program. Fu R, Gartlehner G, Grant M, Shamliyan T, Sedrakyan A, Wilt TJ, Griffith L, Oremus M, Raina P, Ismaila A, et al. J Clin Epidemiol. 2011 Nov; 64(11):1187-97. Epub 2011 Apr 7.
  • The future of Cochrane Neonatal. [Early Hum Dev. 2020] The future of Cochrane Neonatal. Soll RF, Ovelman C, McGuire W. Early Hum Dev. 2020 Nov; 150:105191. Epub 2020 Sep 12.
  • Review Grading the Strength of a Body of Evidence When Assessing Health Care Interventions for the Effective Health Care Program of the Agency for Healthcare Research and Quality: An Update. [Methods Guide for Effectivenes...] Review Grading the Strength of a Body of Evidence When Assessing Health Care Interventions for the Effective Health Care Program of the Agency for Healthcare Research and Quality: An Update. Berkman ND, Lohr KN, Ansari M, McDonagh M, Balk E, Whitlock E, Reston J, Bass E, Butler M, Gartlehner G, et al. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. 2008

Recent Activity

  • Quantitative Synthesis—An Update - Methods Guide for Effectiveness and Comparati... Quantitative Synthesis—An Update - Methods Guide for Effectiveness and Comparative Effectiveness Reviews

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

  • Contributors
  • Valuing Black Lives
  • Black Issues in Philosophy
  • Blog Announcements
  • Climate Matters
  • Genealogies of Philosophy
  • Graduate Student Council (GSC)
  • Graduate Student Reflection
  • Into Philosophy
  • Member Interviews
  • On Congeniality
  • Philosophy as a Way of Life
  • Philosophy in the Contemporary World
  • Precarity and Philosophy
  • Recently Published Book Spotlight
  • Starting Out in Philosophy
  • Syllabus Showcase
  • Teaching and Learning Video Series
  • Undergraduate Philosophy Club
  • Women in Philosophy
  • Diversity and Inclusiveness
  • Issues in Philosophy
  • Public Philosophy
  • Work/Life Balance
  • Submissions
  • Journal Surveys
  • APA Connect

Logo

The Teaching Workshop: The Analytic Synthetic Distinction

Welcome again to The Teaching Workshop, where your questions related to pedagogy are answered. Each post features questions submitted by readers with answers from others within the profession. Have a question? Send it to PhilTeacherWorkshop@gmail.com , or participate in the APA Teaching Workshop on Facebook.

I find myself having to teach the analytic-synthetic distinction in almost every class, as the distinction arises in our readings. In an upper-level epistemology or metaphysics class, you have the time to explain the background of the distinction, why it is important, and how it has evolved. But in an introductory philosophy or introductory ethics class, you often have only 10 minutes or so to explain the distinction before you have to move on to the rest of the reading. What should I tell introductory philosophy students about the analytic-synthetic distinction? What should I say about why philosophers think this distinction is so important?

From Edgar Valdez :

I think it is important to remember that even though the initial discussion of the analytic-synthetic distinction can be dealt with in a few minutes, it is a lesson that requires reinforcement throughout the semester.  Invariably teaching the analytic-synthetic distinction first arises in my introductory courses around the same time as the introduction of the synthetic a priori , even if not the same day. This timing is usually helpful as the introductory remarks I have on the distinction are best situated in conjunction with some remarks on the a priori – a posteriori distinction. I emphasize that these are two different distinctions we can make regarding truth judgments and contrasting them often helps draw out the salient features. While the a priori – a posteriori distinction concerns how we are in a position to make certain judgments, the analytic-synthetic distinction is about the information contained in the judgment. Put another way, the first describes how we can say something, while the second describes the kind of thing we can say. I mention at this point that this kind of distinction sets up a battle that we will later consider between method and content as a way of emphasizing the distinction.

Before considering some examples, I go over two (for our purposes equivalent) ways of describing an analytic judgment: a judgment that is true by definition and a judgment whose predicate is contained within the subject. This language usually leads me to open the floor to any self-proclaimed grammar gurus to explain the difference between a subject and a predicate.  I resist the temptation to use mathematical judgments as examples (since they are the ones Kant will only complicate later when turn to the synthetic a priori ) and stick to genus-species examples for analytic judgments. Among others, I regularly use “the cat is a mammal” and “the bachelor is a male.” This is the kind of information that is contained within the subject. In the sense of being true by definition, our very expression of what the subject is calls for expressing the predicate in question. A cat is a mammal that… A bachelor is a male that… Without getting into concept intension and extension, I explain the idea of a predicate being contained within a subject as meaning that the predicate is one of the concepts we would list if we were thinking of all the concepts we would need to establish the concept of the subject.

When it comes to synthetic judgments, I find adding visual predicates to the analytic examples to be the least ambiguous step. The cat is black. The bachelor is tall (and depending on the class, a reference to the television show might get some laughter).  Blackness and tallness are not predicates we would ever arrive at simply by considering our concepts of cat and bachelor. And of course, we can think of lots of cats that are not black, though they are no less cats. And likewise lots of short bachelors. Synthetic judgments add to the concept of the subject. At this point, I will do some quick etymology on the words analysis and synthesis to emphasize that in one case, we are breaking down the concept of the subject and in another we are putting something with the concept of the subject.

To talk about the importance of the distinction, I usually—with the help of Hume—explain that synthetic judgments often come at the cost of experience and that if we hope to maintain an a priori method, we are usually confined to analytic judgments. The examples I have chosen then set up a nice contrast between two kinds of possible explanations of the world. How different are the worldviews of those that think that philosophy should aspire to a posteriori synthetic judgments and those that think we should stick to a priori analytic judgments? Since introductory philosophy students tend to have empiricist tendencies, I jokingly ask them if they would want to sign up for the course where we discuss that all cats are mammals and all bachelors are male. Though, of course, that’s what we just did.

From Gillian Russell :

I may actually be the unique worst person in the world to answer this question, as I’ve reached a point where I need 5 hours to talk about Quine on the topic.  So I estimate that you’d want to reserve … let’s see about … 30 hours for the distinction more generally…

I think a common problem that intro students have with analyticity is seeing how it is different from necessity and a priority.  I suspect one reason for this is that the standard examples of such claims tend to be similar, and as a result adequately explaining the ideas requires more than just those examples.  They also need:  i) a short gloss for each term, ii) some examples of philosophers with motivated reasons for thinking they come apart or go together (e.g. positivists and Kant respectively), and iii) examples of views where analyticity is expected to do some heavy lifting (e.g. a priori knowledge or the linguistic doctrine of necessary truth).

With intro students I’d be happy to use the glosses below and suggest that they learn them off by heart.  I’d list them on a board off to the side (building up the list and the glosses as I went along) or on a class handout.

  • analytic — true in virtue of meaning alone
  • synthetic — true in virtue of both meaning and the way the world is
  • a priori — justification is independent of experience
  • a posteriori — justification dependent on experience
  • necessary — could not have been otherwise
  • contingent — could have been otherwise

Here are two things you can do in your ten minutes:

  • Telling the Two-Factor Story .

You can introduce the analytic/synthetic distinction by telling the “two-factor” story, which goes a bit like this:

Normally, sentences are true in part because of what they mean, and in part because of the way the world is.  The sentence “snow is white” is true in part because snow is a certain colour (if it were black the sentence would be false) and in part because of what the sentence means (if “snow is white” meant what “2+2=5” means then the sentence would be false).  Normal sentences like that are called synthetic .

Analytic sentences, by contrast, are supposed to be special:  they are true in virtue of their meaning alone, and so no matter what the world is like they will be true.

Whether or not any such sentences exist is controversial, but a frequently discussed example is “all bachelors are unmarried”.    The idea is that the meaning of the words in this sentence is sufficient to guarantee its truth.  So a good way to remember what “analytic” means is to memorise the gloss “true in virtue of meaning.”

  • Analyticity and A priori Knowledge .

The two-factor story is one way to introduce the bare distinction, but I also want to help students see why the distinction is important and how it differs from related distinctions.

So next I would introduce the distinction between a priori and a posteriori knowledge, giving putative examples of each, and being careful to use different examples from the ones used for analyticity (maybe “sugar dissolves in water” for the a posteriori and “2+2=4” for the a priori ).  I would emphasise that the status of these examples is up for review later if we decide we’ve made a mistake (perhaps nothing is a priori ?)

I’d then raise the question of how a priori knowledge is possible. We have some understanding of the mechanism that allows us to know that sugar dissolves in water: visual perception.  But how do we know things about numbers?

Analyticity might help us answer that question.   Why are we justified in thinking that “all bachelors are unmarried” is true?  It is not as if we went out and interviewed all the bachelors and asked them whether they were unmarried.  We seem to be able to know that it is true without perceiving any particular bachelors.  Perhaps that is because we know what “bachelor” means, and this tells that that in order to count as a bachelor, you have to be unmarried.  Our knowledge of the meaning of the sentence seems sufficient for knowledge of its truth.

Now,  if “2+2=4” and other a priori truths were like this, then we might be able to know their truth by understanding the symbols and words they contain:  just as you can know that all bachelors are unmarried without going out and surveying bachelors, so perhaps you can know that 2+2=4 without being able to perceive abstract objects like the number 4.   Analyticity might provide an epistemology for mathematics that explains how it can be a priori .

Is analyticity the only way we can get a priori knowledge? Some philosophers, such as the Logical Positivists, defended an affirmative answer to this question. (I’d encourage the enthusiastic students to take a look at Ayer’s Language, Truth and Logic , since this book is short and highly accessible.)

BUT not everyone agrees that all a priori knowledge is analytic.  Kant, for example, thinks that there is non-analytic a priori knowledge.   I’d give some of Kant’s examples of the synthetic a priori (e.g. every effect has a cause, 5+7=12, etc.) and quote the part of the 1st Critique where he denies that arithmetical claims are analytic .

Kant still thinks that arithmetic is a priori , so he is subscribing to the existence of synthetic a priori knowledge— a priori knowledge that is not analytic.  The question of exactly how  there could be synthetic a priori knowledge is a difficult one. But it’s a central question of the Critique of Pure Reason and leads Kant to some very interesting views about reality.

If you have time, you could give the class a list of 10 sentences and give them 5 minutes to divide them into analytic and synthetic. (e.g. “5+7=12,” “all green things are extended”.)  Then ask for votes on the answers and talk about the difficult or interesting cases.

Ten minutes are surely up even though there’s so much more to say! But I think this is a reasonable way to start.

Additional Resources:

  • Teaching & Learning Guide for the Analytic/Synthetic Distinction
  • Teaching Difficult Concepts with the ADEPT method

Can you also help answer this question? Join the conversation in the comments below, email us, Jennifer Morton and Michelle Saint , at PhilTeacherWorkshop@gmail.com , or participate in the APA Teaching Workshop on Facebook. Remember, the best answers are constructive and specific.

  • Editor: Jeremy Cushing
  • Teaching Committee
  • Teaching Workshop

RELATED ARTICLES

Failure, camaraderie, and shared embodied learning, on divestiture, steven m. cahn, epistemic doubt: a dream we dreamed one afternoon long ago, how i got to questions, new series: ai and teaching, mesa community college philosophy club.

If you are teaching intro to modern, or if your students can be assumed to have had intro to modern, it can be helpful to introduce the distinction historically through Leibniz and Kant. Leibniz wants to know, what relation needs to obtain between subject and predicate to make the judgment true and, as it turns out, the only answer he can come up with is concept containment. (At one point in the Arnauld correspondence he even says, “or else I don’t know what truth is”!) Kant thinks, clearly there are examples of true statements that don’t exhibit concept containment. The ones that exhibit concept containment are the analytic ones of course, but what relation might obtain between a subject concept and predicate concept to make a judgment true where there is NOT concept containment (i.e., a synthetic judgment)? Kant’s answer is, the subject concept and predicate concept must be joined in an object. (This, of course, sets up Kant’s theory of synthetic a priori judgments: the object in which the two concepts are joined must be somehow provided by the understanding itself, as in a geometric construction.)

This may just be because I’m an early modern specialist, but I find this story about the context in which the distinction arose helpful for understanding what it is and what it does. Of course, not every class is an appropriate forum for covering Leibniz and Kant on truth!

LEAVE A REPLY Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Notify me of follow-up comments by email.

Notify me of new posts by email.

WordPress Anti-Spam by WP-SpamShield

Currently you have JavaScript disabled. In order to post comments, please make sure JavaScript and Cookies are enabled, and reload the page. Click here for instructions on how to enable JavaScript in your browser.

Advanced search

Posts You May Enjoy

Apa member interview: russ shafer-landau, making womanness striking: salience-based tensions in socially progressive initiatives, filling the gaps: expanding the canon in the history of philosophy, the plato philosophy fund, a bit of fry & laurie and the is/ought problem.

Synthetic and analytical thinking

Synthetisches und analytisches Denken

  • Published: January 1987
  • Volume 326 , pages 320–323, ( 1987 )

Cite this article

analytical research synthetics

  • Franz M. Wuketits 1  

211 Accesses

4 Citations

Explore all metrics

It is shown that scientific research is not a linear process of information gaining, of accumulating data and facts, but is rather to be characterized by a model showing the cyclic structure of data gathering and construction of theories, of inductive and deductive methods. Analytical and synthetic methods are linked together and are building inseparable components of the texture of science.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

analytical research synthetics

Models of Science and Models in Science

analytical research synthetics

Science and the Spectrum of Critical Thinking

analytical research synthetics

Scientific Reasoning During Inquiry

Ayala FJ (1974) Introduction. In: Ayala FJ, Dobzhansky T (eds) Studies in the philosophy of biology. Macmillan, London, pp vii-xiv

Google Scholar  

Bertalanffy Lv (1968) General system theory. Braziller, New York

Campbell DT (1974) Downward causation in hierarchically organized biological systems. In: Ayala FJ, Dobzhansky T (eds) Studies in the philosophy of biology. Macmillan, London, pp 179–186

Hoyningen-Huene, P, Wuketits FM (eds) Molecules and organisms. D. Reidel, Dordrecht Boston (in preparation)

Oeser E (1976) Wissenschaft und Information, vol. 3. Oldenbourg, Wien München

Oeser E (1986) Wissenschaftsevolution. P. Parey, Berlin Hamburg (in press)

Popper KR (1969) Conjectures and refutations. Routledge and Kegan Paul, London

Popper KR (1974) Scientific reduction and the essential incompleteness of all science. In: Ayala FJ, Dobzhansky T (eds) Studies in the philosophy of biology. Macmillan, London, pp 259–283

Primas H (1985) Kann Chemie auf Physik reduziert werden? Chem unserer Zeit 19:109–119, 160–166

Riedl R (1977) A systems-analytical approach to macroevolutionary phenomena. Qu Rev Biol 52:351–370

Riedl R (1984) Biology of knowledge. Wiley, Chichester New York

Riedl R (1985) Die Spaltung des Weltbildes. P. Parey, Berlin Hamburg

Wuketits FM (1982) Systems research — the search for isomorphism, in: Progress in cybernetics and systems research, Vol. xi, 403–407

Wuketits FM (1983) Biologische Erkenntnis: Grundlagen und Probleme. G. Fischer, Stuttgart

Wuketits FM (ed) (1984) Concepts and approaches in evolutionary epistemology. D. Reidel, Dordrecht Boston

Download references

Author information

Authors and affiliations.

Institute of Philosophy, University of Vienna, Universitäts-Str., A-1010, Wien, Austria

Franz M. Wuketits

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

About this article

Wuketits, F.M. Synthetic and analytical thinking. Z. Anal. Chem. 326 , 320–323 (1987). https://doi.org/10.1007/BF00469778

Download citation

Received : 31 January 1986

Issue Date : January 1987

DOI : https://doi.org/10.1007/BF00469778

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Physical Chemistry
  • Analytical Chemistry
  • Inorganic Chemistry
  • Data Gathering
  • Synthetic Method
  • Find a journal
  • Publish with us
  • Track your research

SEP thinker apres Rodin

The Analytic/Synthetic Distinction

“Analytic” sentences, such as “Ophthalmologists are doctors,” are those whose truth seems to be knowable by knowing the meanings of the constituent words alone, unlike the more usual “synthetic” ones, such as “Ophthalmologists are ill-humored,” whose truth is knowable by both knowing the meaning of the words and something about the world. Beginning with Frege, many philosophers hoped to show that knowledge of logic and mathematics and other apparently a priori domains, such as much of philosophy and the foundations of science, could be shown to be analytic by careful “conceptual analysis.” This project encountered a number of problems that have seemed so intractable as to lead some philosophers, particularly Quine, to doubt the reality of the distinction. There have been a number of interesting reactions to this scepticism, both in philosophy and in linguistics, but it has yet to be shown that the distinction will ever be able to ground the a priori in the way that philosophers had hoped.

2.1 Mathematics

2.2 science, 3.1 the status of the primitives, 3.2 the paradox of analysis, 3.3 problems with logicism, 3.4 convention, 3.5 problems with verificationism, 3.6 quine on meaning in linguistics, 3.7 explaining away the appearance of the analytic, 4.1 neo-cartesianism, 4.2 externalist theories of meaning, 4.3 explanatory strategies, 4.4 chomskyan strategies in linguistics, 5. conclusion, bibliography, other internet resources, related entries, 1. the intuitive distinction.

Compare the following two sets of sentences:

I. (1) Some doctors that specialize on eyes are ill-humored. (2) Some ophthalmologists are ill-humored. (3) Many bachelors are ophthalmologists. (4) People who run damage their bodies. (5) If Holmes killed Sikes, then Watson must be dead.
(6) All doctors that specialize on eyes are doctors. (7) All ophthalmologists are doctors. (8) All bachelors are unmarried. (9) People who run move their bodies. (10) If Holmes killed Sikes, then Sikes is dead.

Most competent English speakers who know the meanings of all the constituent words would find an obvious difference between the two sets: whereas they might wonder about the truth or falsity of those of set I, they would find themselves pretty quickly incapable of doubting those of II. Unlike the former, these latter seem to be known automatically, “ just by virtue of knowing just what the words mean ,” as many might spontaneously put it. Indeed, a denial of any of them would seem to be in some important way unintelligible , very like a contradiction in terms . Although there is, as we shall see, a great deal of dispute about these italicized ways of drawing the distinction, and even about whether it is real, philosophers standardly refer to sentences of the first class as “synthetic,” those of the second as (at least apparently) “analytic.” Many philosophers have hoped that the apparent necessity and a priori status of the claims of logic, mathematics and much of philosophy would prove to be due to these claims being analytic, i.e., explaining why such claims seemed to be true “in all possible worlds,” and knowable to be so “independently of experience.” This view has led them to regard philosophy as consisting in large part in the “analysis” of the meanings of the relevant claims, words and concepts (hence “analytic” philosophy, although the term has long ceased to have any such specific commitment, and refers now more generally to philosophy done in the associated closely reasoned style).

Although there are anticipations of the notion of the analytic in Locke and Hume in their talk of “relations of ideas,” the specific terms “analytic” and “synthetic” themselves were introduced by Kant (1781/1998) at the beginning of his Critique of Pure Reason , where he wrote:

In all judgments in which the relation of a subject to the predicate is thought (if I only consider affirmative judgments, since the application to negative ones is easy) this relation is possible in two different ways. Either the predicate B belongs to the subject A as something that is (covertly) contained in this concept A ; or B lies entirely outside the concept A , though to be sure it stands in connection with it. In the first case, I call the judgment analytic, in the second synthetic. (A:6-7)

He provided as an example of an analytic judgment, “All bodies are extended”: in thinking of a body we can't help but also think of something extended in space; that would seem to be just part of what is meant by “body.” He contrasted this with “All bodies are heavy,” where the predicate (“is heavy”) “is something entirely different from that which I think in the mere concept of body in general” (A7), and we must put together, or “synthesize,” the different concepts, body and heavy (sometimes such concepts are called “ampliative,” “amplifying” a concept beyond what is “contained” in it).

Kant tried to spell out his “containment” metaphor for the analytic in two ways. To see that any of set II is true, he wrote, “I need only to analyze the concept, i.e., become conscious of the manifold that I always think in it, in order to encounter this predicate therein” (A7). But then, picking up a suggestion of Leibniz , he went on to claim:

I merely draw out the predicate in accordance with the principle of contradiction, and can thereby at the same time become conscious of the necessity of the judgment. (A7)

As Katz (1988) recently emphasized, this second definition is significantly different from the “containment” idea, since now, in its appeal to the powerful method of proof by contradiction, the analytic would include all of the (potentially infinite) deductive consequences of a particular claim, most of which could not be plausibly regarded as “contained” in the concept expressed in the claim. For starters, “Bachelors are unmarried or the moon is blue” is a logical consequence of “Bachelors are unmarried”—its denial contradicts the latter (a denial of a disjunction is a denial of each disjunct)—but clearly nothing about the color of the moon is remotely “contained in” the concept bachelor. To avoid such consequences, Katz (e.g., 1972, 1988) goes on to try to develop a serious theory based upon only the initial containment idea, as, along different lines, does Pietroski (2005).

One reason Kant may not have noticed the differences between his different characterizations of the analytic was that his conception of “logic” seems to have been confined to Aristotelian syllogistic, and so didn't include the full resources of modern logic, where the differences between the two characterizations become more glaring (see MacFarlane 2002). Indeed, he demarcates the category of the analytic chiefly in order to contrast it with what he regards as the more important category of the synthetic, which he famously thinks is not confined, as one might initially suppose, merely to the empirical. While some trivial a priori claims might be analytic, for Kant the seriously interesting ones were synthetic. He argues that even so elementary an example in arithmetic as “7+5=12,” is synthetic, since the concept of “12” is not contained in the concepts of “7,” “5,” or “+,”: appreciating the truth of the proposition would seem to require some kind of active synthesis of the mind uniting the different constituent thoughts. And so we arrive at the category of the “synthetic a priori ,” whose very possibility became a major concern of his work. He tries to show that the activity of “synthesis” was the source of the important cases of a priori knowledge, not only in arithmetic, but also in geometry, the foundations of physics, ethics, and philosophy generally, a view that set the stage for much of the philosophical discussions of the subsequent century (see Coffa 1991:pt I).

Apart from geometry, Kant, himself, didn't focus much on the case of mathematics. But, as mathematics in the 19th C. began reaching new heights of sophistication, worries were increasingly raised about its foundations as well. It was specifically in response to this latter problem that Gottlob Frege (1884/1980) tried to improve upon Kant's formulations of the analytic, and presented what is widely regarded as the next significant discussion of the topic.

Frege (1884/1950:§§5,88) and others noted a number of problems with Kant's “containment” metaphor. In the first place, as Kant himself probably would have agreed, the criterion would need to be freed of “psychologistic” suggestions, or claims about merely the accidental thought processes of thinkers, as opposed to claims about truth and justification that are presumably at issue with the analytic. In particular, mere associations are not always matters of meaning: someone might regularly associate bachelors with being harried, but this wouldn't therefore seriously be a part of the meaning of “bachelor” (“an unharried bachelor” is not contradictory). But, secondly, although the denial of a genuinely analytic claim may well be a “contradiction,” it isn't clear what makes it so: there is no explicit contradiction in the thought of a married bachelor, in the way that there is in the thought of a bachelor who is not a bachelor. “Married bachelor” has at least the same explicit logical form as “harried bachelor.”. Rejecting “a married bachelor” as contradictory would seem to have no justification other than the claim that “All bachelors are married” is analytic, and so cannot serve to justify or explain that claim.

Even were Kant to have solved these problems, it isn't clear how his notion of “containment” would cover all the cases of what seem to many to be as “analytic” as any of set II. Thus, consider:

II. (cont.) (11) If Bob is married to Sue, then Sue is married to Bob. (12) Anyone who's an ancestor of an ancestor of Bob is an ancestor of Bob. (13) If x is bigger than y, and y is bigger than z, then x is bigger than z. (14) If something is red, then it's colored.

The symmetry of the marriage relation, or the transitivity of “ancestor” and “bigger than” are not obviously “contained in” the corresponding thoughts in the way that the idea of extension is plausibly “contained in” the notion of body, or male in the notion of bachelor. (14) has seemed particularly troublesome: what else besides “colored” could be included in the analysis? Red is colored and what else? It is hard to see what else to “add”—except red itself! (See §3.4 below for further discussion.)

Frege attempted to remedy the situation by completely rethinking the foundations of logic, developing what we now think of as modern symbolic logic . He defined a perfectly precise “formal” language, i.e., a language characterized by the “form” –standardly, the shape – of its expressions, and he carefully set out an account of the syntax and semantics of what are called the “logical constants,” such as “and”, “or”, “not”, “all” and “some”, showing how to capture a very wide class of valid inferences. Just how these constants are selected is a matter of some dispute (see Logical Constants ), but intuitively, the constants can be thought of as those parts of language that don't “point” or “function referentially”, aiming to refer to something in the world, in the way that ordinary nouns, verbs and adjectives seem to do: “dogs” refers to dogs, “clever” to clever and/or clever things, and even “Zeus” aims to refer to a Greek god; but words like “or” and “all” don't seem to function referentially at all: it doesn't seem to make sense to think of there being “or”s in the world, along with the dogs and their properties.

This distinction between referring expressions and logical constants allows us to define a logical truth as a sentence that is true no matter what referring expressions occur in it. Consequently,

(6) All doctors that specialize on eyes are doctors.

counts as a (strict) logical truth: no matter what referring expressions we put in for “doctor”, “eyes” and “specialize on” in (6), the sentence will remain true. For example, substituting “cats” for “doctors”, “mice” for “eyes” and “chase” for “specialize on”, we get:

(15) All cats that chase mice are cats.

(Note that we idealize to non-ambiguity, all occurrences of the same spelt words having the same reference.) But what about the others of set II? Substituting “cats” for “doctors” and “mice” for “ophthalmologists” in

(7) All ophthalmologists are doctors.
(16) All mice are cats.

which is patently false, as would similar such substitutions render the rest of the examples of II. So how are we to capture these apparent analyticities?

Here Frege appealed to the notion of definition , or—presuming that definitions preserve “meaning” (see §4.2 below)— synonymy : the non-logical analytic truths are those that can be converted to (strict) logical truths by substitution of definitions for defined terms, or synonyms for synonyms. Since “mice” is not synonymous with “ophthalmologist”, (16) is not a substitution of the required sort. We need, instead, a substitution of the definition of “ophthalmologist”, i.e., “doctor that specializes on eyes”, which would convert (7) into our earlier purely logical truth:

Of course, these notions of definition, meaning and synonymy would themselves need to clarified, But this wasn't thought to be particularly urgent until Quine raised serious questions about them much later (see §3.6ff below).

Frege was mostly interested in formalizing arithmetic, and so considered the logical forms of a relative minority of natural language sentences in a deliberately spare formalism. Work on the logical (or syntactic) structure of the full range of sentences of natural language has blossomed since then, initially in the work of Russell (1905), in his famous theory of definite descriptions, where the criterion is applied to whole phrases in context, but then especially in the work of Chomsky and other “generative” linguists (see §4.3 below). Whether Frege's criterion of analyticity will work for the rest of II and other analyticities depends upon the details of those proposals (see, e.g., Katz 1972, Montague 1974, Hornstein 1984 and Pietroski 2005).

2. High Hopes

Why should philosophy be interested in what would seem to be a purely linguistic notion? Because, especially in the first half of the Twentieth Century, many philosophers thought it could perform crucial epistemological work, providing an account, first, of our apparently a priori knowledge of mathematics, and then—with a little help from British empiricism—of our understanding of claims about the spatio-temporal world as well. Indeed, “conceptual analysis” soon came to constitute the very way particularly Anglophone philosophers characterized their work. Many additionally thought it would perform the metaphysical work of explaining the truth and necessity of mathematics, showing not only how it is we could know about these topics independently of experience, but how they could be true in all possible worlds . This latter ambition was sometimes not distinguished from the former one, although it is no longer shared by most philosophers still interested in the analytic (see Devitt 1996 for discussion, and Glock 2003:ch 3 for an interesting attempt to resuscitate the metaphysical work). In this entry we will focus primarily on the epistemological project.

The problem of accounting for mathematical knowledge is arguably one of the oldest and hardest problems in Western philosophy. It is easy enough to understand: ordinarily we acquire knowledge about the world by our senses. If we are interested in, for example, whether it's raining outside, how many birds are on the beach, or whether fish sleep, we look and see, or turn to others who do. It is a widespread view that Western sciences owe their tremendous successes precisely to relying on just such “empirical” (experiential, experimental) methods. However, it is also a patent fact about all these sciences, and even our ordinary ways of counting birds and fish, that they depend on mathematics; and mathematics does not seem to be known on the basis of experience. Mathematicians don't do experiments in the way that chemists, biologists or other “natural scientists” do. They seem simply to think , at most with pencil and paper as an aid to memory. In any case, they don't try to justify their claims by reference to experiments: “Twice two is four” is not justified by observing that pairs of pairs tend in all cases observed so far to be quadruples.

But how could mere processes of thought issue in any knowledge about the independently existing external world? The belief that it could would seem to involve some kind of mysticism; and, indeed, many “naturalistic” philosophers have felt that the appeals of “Rationalist” philosophers like Plato, Descartes, Leibniz and, more recently, Katz (1988, 1990), Bealer (1987) and Bonjour (1998), to some special faculty of “rational intuition,” seem no better off than appeals to “revelation” to establish theology.

Here's where the analytic seemed to many to offer a more promising alternative. Perhaps all the truths of arithmetic could be shown to be analytic by Frege's criterion, i.e., by showing that they could all be converted into logical truths by substitution of synonyms for synonyms. Of course, the relevant synonyms were not quite as obvious as “ophthalmologist” and “eye doctor”; one needed to engage in a rigorous process of “logical analysis” of the meanings of such words as “number”, “plus”, “exponent”, “limit”, “integral”, etc. But this is what Frege set out to do, and in his train, Russell and the young Ludwig Wittgenstein, launching the program of logicism, often with great insight and at least some success (see §5 below).

But why stop at arithmetic? If logical analysis could illuminate the foundations of mathematics by showing how it could all be derived from logic by substitution of synonyms, perhaps it could also illuminate the foundations of the rest of our knowledge by showing how its claims could similarly be derived from logic and experience. Such was the hope and program of Logical Positivism , championed by, e.g., Moritz Schlick, A.J. Ayer and, especially, Rudolf Carnap. Of course, such a proposal did presume that all of our concepts were “derived” either from logic or experience, but this seemed in keeping with the then prevailing presumptions of empiricism, which, they assumed, had been vindicated by the immense success of the empirical sciences.

How were our concepts of, e.g., space, time, causation, or material objects analytically related to experience? For the Positivists, the answer seemed obvious: by tests. Taking a page from the American philosopher, C.S. Pierce, they proposed various versions of their Verifiability Theory of Meaning, according to which the meaning (or “cognitive significance”) of any sentence was the conditions of its empirical (dis)confirmation. Thus, to say that there was an electric current of a certain magnitude in a wire was to say that, if one were to attach the terminals of an ammeter to the ends of the wire, the needle would point to that very magnitude, a claim that would be disconfirmed if it didn't. Closer to “experience”: to say that there was a cat on a mat was just to say that certain patterns of sensation (certain familiar visual, tactile and aural appearances) were to be expected under certain circumstances. After all, it seemed to them, as it seemed to Locke, Berkeley and Hume centuries earlier, that all our concepts are derived from sensory experiences, and, if so, then all our concepts must involve some or other kind of construction from those experiences. For the Positivists, these earlier empiricists had erred only in thinking that the mechanism of construction was mere association. But association can't even account for the structure of a judgment, such as “Salt comes in shakers,” which is not merely the excitation of its constituent ideas, along the lines of “salt” exciting “pepper,” but involves combining the nouns “salt” and “shakers” with the predicate “x comes in y” in a very particular way (see Kant 1781/1998:A111-2 and Frege 1892/1966). That is, our thoughts and claims about the world have some kind of logical structure , of a sort that seems to begin to be revealed by Frege's proposals. Equipped with Frege's logic, it was possible to provide a more plausible formulation of conceptual empiricism: our claims about the empirical world were to be analyzed into the (dis)confirming experiences out of which they must somehow have been logically constructed.

The project of providing “analyses” of concepts in this way of especially problematic ones like those concerning, for example, material objects, knowledge, perception, causation, freedom, the self, was pursued by Positivists and other “analytic” philosophers for a considerable period (see Carnap 1928/67 for some rigorous examples, Ayer 1934/52 for more accessible ones). With regard to material object claims, the program came to be known as “phenomenalism”; with regard to the theoretical claims of science, as “operationalism” ; and with regard to the claims about people's mental lives, as “analytical behaviorism” (the relevant experiential basis of mental claims in general being taken to be observations of others' behavior). But, although these programs became extremely influential, and some form of the verifiability criterion was often (and sometimes still is) invoked in physics and psychology to constrain theoretical speculation, they seldom, if ever, met with any serious success. No sooner was an analysis, say, of “material object” or “expectation,” proposed than serious counterexamples were raised and the analysis revised, only to be faced with still further counterexamples. Despite what seemed its initial plausibility, philosophers came to suspect that the criterion, and with it the very notion of analyticity itself, rested on some fundamental mistakes.

3. Problems with the Distinction

An issue that Frege's criterion didn't address is the status of the basic sentences of logic themselves. Are the logical truths themselves a priori because they, too, are “analytic”? But what makes them so? Is it that anyone who understands their wording just must see that they are true? If so, how are we to make sense of disputes about the laws of logic, of the sort that are raised, for example, by mathematical intuitionists, who deny the Law of Excluded Middle, or, more recently, by “para-consistent” logicians, who argue for the toleration even of contradictions to avoid certain paradoxes (see Williamson 2006 for discussion)? Moreover, given that the infinitude of logical truths needs to be “generated” by rules of inference, wouldn't that be a reason for regarding them as “synthetic” in Kant's sense (see Frege 1884/1980:§88, Katz 1988:58-9 and MacFarlane 2002)? Most worrisome is a challenge raised by Quine 1956/76:§II): how does claiming logical truths to be analytic differ from merely claiming them to be obviously and universally correct, i.e., widely and firmly held beliefs, indistinguishable in kind from banalities like “The earth has existed for many years” or “There have been black dogs”?

A further problem arises for the non-logical vocabulary. The sentences reporting our experiences seemed to have some kind of analytic connection with those experiences –a normal sighted person failing to apply “looks red” in clear cases arguably fails to understand the words. But there was a serious question about just what “experience” should be taken to be: was it the sort of encounter with ordinary middle-sized objects such as tables and chairs, the weather and bodily actions, in terms of which most people would readily describe their perceptual experience? Or was it some sort of “un-conceptualized” play of sense impressions that it would take something like the training of an articulate impressionist artist to describe? This latter suggestion seemed to involve a “myth of the given” (Sellars 1956), or the dubious assumption that there was something given in our experience that was entirely un-interpreted by our understanding. This was a claim about which serious doubts were raised by psychologists (e.g., Bruner 1957) and philosophers of science (e.g., Hanson 1958 and Kuhn 1962). Considered closely, ordinary “observations” can be seen to be shot through with conceptual presuppositions: even so guarded a report as “It smells to me like tarragon” arguably involves the conceptualized memory of the smell of tarragon, and what one's earlier experiences were like. If so, and if there were consequently no privileged set of sentences reporting experience in an unbiased way, then the rug would seem to have been pulled from under some of the main presumptions and motivations for the Positivist program: what would be the significance of “analyzing” the meaning of a claim into merely what a particular theorist had (arbitrarily?) decided to regard as primitive?

Recent developments in psychology, however, suggest that human minds may well contain sensory and motor “modules” whose primitives would be epistemically distinctive, even if they do involve some limited degree of conceptual interpretation (see Modularity of Mind and Fodor 1983, 1984). And so the analytical Positivist program might be recast in terms of the reduction of all concepts to these sensorimotor primitives, a project that is sometimes implicit in cognitive psychology and artificial intelligence.

Another problem with the entire program was raised by Langford (1942): why should analyses be of any conceivable interest? After all, if analysis consists in providing the definition of an expression, then it should be merely providing a synonym for it, and this should be wholly uninformative, as un-informative as the claim that unmarried males are unmarried. But the proposed reductions of, say, material object statements to sensory ones were often fairly complex, had to be studied and learned, and so hardly seemed uninformative. So how could they count as seriously analytic? This is “the paradox of analysis,” which can be seen as dormant in Frege's own move from his (1884) focus on definitions to his more controversial (1892) doctrine of sense, where two senses are distinct if and only if someone can think a thought containing the one but not other, as in the case of the senses of “the morning star” and “the evening star.” If definitions preserve sense, then, whenever one thought the defined concept, one would be thinking also the definition. But few of Frege's definitions, much less those of the Positivists, seemed remotely to have this character (see Bealer 1982, Dummett 1991 and Horty 1993, 2007 for discussion).

A related problem, discussed by Bealer (1998), is the possible proliferation of candidate analyses. The concept of a circle can be can be analyzed as the concept of a set of co-planar points equidistant from a given point and as a closed figure of constant curvature. Not only do both of these analyses seem informative, the equivalence between them would need to be shown by some serious geometry, which, especially since the advent of non-Euclidean geometries and Einstein's theories of relativity, could no longer be assumed to be justified merely on the basis of logic and definitions.

These problems, so far, can be regarded as relatively technical, for which further technical moves within the program might be made. For example, one might make further distinctions within the theory of sense between an expression's content and the specific “linguistic vehicle” for its expression, as in Fodor (1990a) and Horty (1993, 2007); and maybe distinguish between the truth-conditional content of an expression and its idiosyncratic role, or “ character ,” in a language system, along the lines of a distinction Kaplan (1989) introduced to deal with indexical and demonstrative expressions (such as “I,” “now” and “that”; see Demonstratives , Narrow Mental Content and White 1982). Perhaps analyses could be regarded as providing a particular “vehicle,” having a specific “character,” that could account for why one could entertain a certain concept without entertaining its analysis.

However, the problems with the program seemed to many philosophers to be deeper than merely technical. By far, the most telling and influential of the criticisms both of the program, and then of analyticity in general, were those of the American philosopher, W.V. Quine, who began as a great champion of the program (see esp. his 1934), and whose subsequent objections therefore carry special weight. The reader is well-advised to consult especially his (1956/76) for as rich and deep a discussion of the issues as one might find. The next two sections abbreviate some of that discussion.

Although the pursuit of the logicist program gave rise to a great many insights into the nature of mathematical concepts, not long after its inception it began encountering substantial difficulties. For Frege, the most calamitous came early on in a letter from Russell, in which Russell pointed out that one of Frege's crucial axioms for arithmetic was actually inconsistent. His intuitively quite plausible “Basic Law V” (sometimes called “the unrestricted Comprehension Axiom”) had committed him to the existence of a set for every predicate. But what, asked Russell, of the predicate “x is not a member of itself”? If there were a set for that predicate, that set itself would be a member of itself if and only if it wasn't; consequently, there could be no such set. Frege's Basic Law V couldn't be true (but see Frege's Logic, Theorem, and Foundations for Arithmetic and recent discussion of Frege's program in §5 below).

What was especially upsetting about “Russell's paradox” was that there seemed to be no intuitively satisfactory way to repair set theory in a way that could lay claim to being as obvious and merely a matter of logic or meaning in the way that Positivists had hoped to show it to be. Various proposals were made, but all of them were tailored precisely to avoid the paradox, and seemed to have little independent appeal. Certainly none of them appeared to be analytic. As Quine (1956/76, §V) observed, in the actual practice of choosing axioms for set theory, we are left “making deliberate choices and setting them forth unaccompanied by any attempt at justification other than in terms of their elegance and convenience,” appeals to the meanings of terms be hanged (although see Boolos 1971).

Perhaps, however, these “deliberate choices” could themselves be seen as affording a basis for analytic claims. For aren't matters of meaning in the end really matters about the deliberate or implicit conventions with which words are used? Someone, for example, could invest a particular word, say, “schmuncle,” with a specific meaning merely by stipulating that it mean, say, unmarried uncle. Wouldn't that afford a basis for claiming then that “A schmuncle is an uncle” is analytic, or “true by virtue of the (stipulated) meanings of the words alone”? Carnap (1947) proposed setting out the “meaning postulates” of a scientific language in just this way. This had the further advantage of allowing terms to be “implicitly defined” by their roles in such postulates, which might be a theory's laws or axioms. This strategy seems especially appropriate for defining logical constants, as well as dealing with cases like (11)-(14) above, e.g. “Red is a color,” where mere “containment” seemed not to suffice. So perhaps what philosophical analysis is doing is revealing the tacit conventions of ordinary language, an approach particularly favored by Ayer (1934/52).

Quine (1956, §§IV-V) goes on to address the complex role(s) of convention in mathematics and science. Drawing on his earlier discussion (1936/76) of the conventionality of logic, he argues that logic could not be established by such conventions, since

the logical truths, being infinite in number, must be given by general conventions rather than singly; and logic is needed then in the meta-theory, in order to apply the general conventions to individual cases (1956:p.115).

This is certainly an argument that ought to give the proponents of the conventionality of logic pause: how could one hope to set out the general conventions for “all” or “if…then…” without using the notions of “all” and “if…then…”? (A complex issue remains about whether conventional rules might not be “implicit” in a practice, and so implicitly definable it terms of it; see Lewis (1969), Boghossian (1997), Horwich (2000), Hale and Wright (2000) and §4.1 below for discussion). Turning to set theory and then the rest of science, Quine goes on to argue that, although stipulative definition, what he calls “legislative postulation,”

contributes truths which become integral to the corpus of truths, the artificiality of their origin does not linger as a localized quality, but suffuses the corpus. (1956:pp. 119-20)

This certainly seems to accord with scientific practice. Even if Newton, say, had himself explicitly set out “F=ma” as a stipulated definition of “F”, this wouldn't really settle the interesting philosophical question of whether “F=ma”, is justified by its being analytic, or “true by meaning alone,” since our taking his stipulation seriously would depend upon our acceptance of his theory as a whole, in particular upon “the elegance and convenience” it brought to the rest of our physical theory of the world (see Harman 1996:p399 for a nice discussion of how “something that is true by stipulative definition can turn out to be false”). As Quine goes on to observe:

[S]urely the justification of any theoretical hypothesis can, at the time of hypothesis, consist in no more than the elegance and convenience which the hypothesis brings to the containing bodies of laws and data. How then are we to delimit the category of legislative postulation, short of including under it every new act of scientific hypothesis? (1956:p.121)

Carnap's legislated “meaning postulates” should therefore be regarded as just an arbitrary selection of sentences a theory presents as true, a selection perhaps useful for purposes of exposition, but no more significant than the selection of certain towns in Ohio as “starting points” for a journey (1953/80:35). Invoking his famous holistic metaphor of the “web of belief,” Quine concludes:

the lore of our fathers is a fabric of sentences [which] develops and changes, through more or less arbitrary and deliberate revisions and additions of our own, more or less directly occasioned by the continuing stimulation of our sense organs. It is a pale grey lore, black with fact and white with convention. But I have found no substantial reasons for concluding that there are any quite black threads in it, or any white ones. (1956:p.132)

These last passages express a tremendously influential view of Quine's that led several generations of philosophers to despair not only of the analytic-synthetic distinction, but of the category of a priori knowledge entirely. The view has come to be called “confirmation holism,” and Quine had expressed it more shortly a few years earlier, in his widely read article, “Two Dogmas of Empiricism” (1953, ch. 2):

our statements about the external world face the tribunal of sense experience not individually, but only as a corporate body. (1953/80, p. 41)

Indeed, the “two dogmas” that the article discusses are (i) the belief in the intelligibility of the distinction itself, and (ii), what Quine regards as the flip side of the same coin, the belief that “each statement, taken in isolation from its fellows, can admit of confirmation or infirmation at all” (p. 41), i.e., the very (version of the) Verifiability Theory of Meaning we have seen the Positivists enlisted in their effort to “analyze” the claims of science and commonsense. (Ironically enough, Quine, himself, continued to adhere to a verifiability conception of meaning, his confirmation holism leading him merely to embrace a meaning holism and his notorious “thesis of the indeterminacy of translation” see his 1986:p155 and the next section).

Quine bases his “confirmation holism” upon observations of Duhem (1914/54), who drew attention to the myriad ways in which theories are supported by evidence, and the fact that an hypothesis is not (dis)confirmed merely by some specific experiment considered in isolation from an immense amount of surrounding theory. Thus, to take our earlier example, applying an ammeter to a copper wire will be a good test that there's a current in the wire, only if the device is in working order, the wire is composed of normal copper, there aren't any other forces at work that might disturb the measurement—and, especially, only if the background laws of physics that have informed the design of the measurement are in fact sufficiently correct. A failure in the ammeter to register a current could, after all, be due to a failure of any of these other conditions, which is, of course, why experimenters spend so much time and money constructing experiments to “control” for them. Moreover, with a small change in our theories, or just in our understanding of the conditions for measurement, we might change the tests on which we rely, but often without changing the meaning of the sentences whose truth we might be trying to test (which, as Putnam 1965/75 pointed out, is precisely what practicing scientists regularly do).

What is novel—and highly controversial—about Quine's understanding of these commonplace observations is his extension of them to claims presumed by most people (e.g., by Duhem himself) to be outside its scope, viz., the whole of mathematics and even logic! It is this extension that seems to undermine the traditional a priori status of these latter domains, since it appears to open the possibility of a revision of logic or mathematics in the interest of the plausibility of the overall resulting theory—containing both the empirical claims and those of logic and mathematics. Perhaps this wouldn't be so bad should the revisability of logic and mathematics permit their ultimately admitting of a justification that didn't involve experience. But this is ruled out by Quine's insistence that scientific theories (with their logic and mathematics) are confirmed “only” as “corporate bodies.” (It's not clear what entitles Quine to this crucial “only,” but his doctrine has been read as standardly including it; see Rey 1998 for discussion). Certainly, though, as an observation about the revisability of claims of logic and meaning, Quine's claim seems right. As Putnam (1968/75) argued, enlarging on Quine's theme, it could out to be rational to revise even elementary logic in view of the surprising results of quantum mechanics, and it is not hard to imagine discovering that a homely purported analytic truth, such as “cats are animals,” could be given up in light of discovering that the little things are really cleverly disguised robots controlled from Mars (Putnam 1962; see Katz 1990:pp216ff for a reply).

Quine's discussion of the role of convention in science seems right; but how about the role of meaning in ordinary natural language? Is it really true that in the “pale grey lore” of all the sentences we accept, there aren't some that are “white” somehow “by virtue of the very meanings of their words”? What about our examples in our earlier set II? What about sentences that merely link patent synonyms, as in “Lawyers are attorneys,” or “A fortnight is a period of fourteen days”? As Grice and Strawson (1956) and Putnam (1962) pointed out, it is unlikely that so intuitively plausible a distinction should turn out to have no basis at all in fact. Quine addresses this issue, first, in his (1953/80, chs. 1 and 3), and then in a much larger way in chapter 2 of his (1960) and many subsequent writings.

Quine (1953) pressed his objection to analyticity further to the very ideas of synonymy and the linguistic meaning of an expression, on which, we saw, Frege's criterion of analyticity crucially relied. His objection is that he sees no way to make any serious explanatory sense of them. In his (1953) he explores plausible explanations in terms of “definition,” “intension,” “possibility,” and “contradiction,”, plausibly pointing out that each of these notions stand in precisely as much need of explanation as synonymy itself (recall our observation in §1.2 above regarding the lack of overt contradiction in “married bachelor”). They form what seems to be a (viciously?) small “closed curve in space” (p. 30). Although many have wondered whether this is a particularly fatal flaw in any of these notions –circularities notoriously abound among many fundamental notions– it led Quine to be sceptical of the lot of them.

Quine (1960) further supported his case by sketching a full-fledged theory of language that does without any theory of determinate meaning. Indeed, a consequence of his theory is that translation (i.e., the identification of two expressions from different languages as having the same meaning) is “indeterminate”; there is “no fact of the matter” about whether two expressions do or do not have the same meaning (see Indeterminacy of Translation ). And it's a consequence of this view that there are pretty much no facts of the matter about people's mental lives at all! For, if there is no fact of the matter about whether two people mean the same thing by their words, then there is no fact of the matter about whether they ever have mental states with the same content; and consequently no fact of the matter about what anyone ever thinks. Quine himself took this consequence in stride –he was, after all, a behaviorist– regarding it as “of a piece” with Brentano's thesis of the irreducibility of the intentional; it's just that for him, unlike for Brentano, it simply showed the “baselessness of intentional idioms and the emptiness of a science of intention” (1960, p. 221). Needless to say, many subsequent philosophers have not been happy with this view, and have wondered where Quine's argument has gone wrong.

One reservation many have had about Quine's argument is about how to explain the appearance of the analytic. Most people, for example, would distinguish our original two sets of sentences (§1), by saying that sentences of the second set, such as “All ophthalmologists are eye doctors,” could be known to be true just by knowing the meanings of the constituent words. Moreover, they might agree about an indefinite number of further examples, e.g., that pediatricians are doctors for children, grandfathers are parents of parents, that sauntering is a kind of movement, pain a mental state, and food, stuff that people eat. As Grice and Strawson (1956) and Putnam (1962) stressed, it's implausible to suppose that there's nothing people are getting at in these judgments.

Here, once again, Quine invoked his metaphor of the web of belief, claiming that sentences are more or less revisable, depending upon how “peripheral” or “central” their position is in the web. The appearance of sentences being “analytic” is simply due to their being, like the laws of logic and mathematics, comparatively central, and so are given up, if ever, only under extreme pressure from the peripheral forces of experience. But no sentence is absolutely immune from revision; all sentences are thereby empirical, and none is actually analytic.

There are a number of problems with this explanation. In the first place, centrality and the appearance of analyticity don't seem to be so closely related. As Quine (1960) himself noted, there are are plenty of central, unrevisable beliefs that don't seem analytic (e.g. The earth has existed for more than five years, , Some people have eyes , Mass-energy is conserved ), and many standard examples of what seems analytic aren't seriously central: “Bachelors are unmarried” and “Aunts are sisters” are notoriously trivial, and could easily be revised if someone really cared.

Secondly, it's not mere unrevisability that seems distinctive of the analytic, but rather a certain sort of unintelligibility : for all the unrevisability of “Some people have eyes,” it's perfectly possible to imagine it to be false. What's peculiar about the analytic is that denials of it often seem unintelligible : we can't seriously imagine a married bachelor. Indeed, far from unrevisability explaining analyticity, it would seem to be analyticity that explains unrevisability: the only reason one balks at denying bachelors are unmarried is that that's just what “bachelor” means !

It is important to note here a crucial change that Quine (and earlier Positivists) casually introduced into the characterization of the a priori , and consequently into much of the now common understanding of the analytic. Where Kant and others had traditionally assumed that the a priori concerned beliefs “justifiable independently of experience,” Quine and many other philosophers of the time came to regard it as consisting of beliefs “unrevisable in the light of experience.” And, we have seen, a similar status is accorded the at least apparently analytic. However, this would imply that someone's taking something to be analytic or a priori would have to regard herself as being infallible about it, forever unwilling to revise it in light of further evidence or argument. But this is a further claim that many defenders of the traditional notions need not embrace. A claim might be in fact analytic and justifiable independently of experience, but nevertheless perfectly revisable in the light of it. Experience, after all, might mislead us, as it (perhaps) misled Putnam when he suggested revising logic in light of difficulties in quantum mechanics, or suggested revising “cats are animals,” were we to discover the things were robots. Just what claims are genuinely analytic might not be available at the introspective or behavioral surface of our lives, in merely our dispositions to assent or dissent from sentences, as Quine (1960) supposes. Those dispositions might be hidden more deeply in our psychology, and our access to them as fallible as our access to any other such facts about ourselves. The genuinely analytic may be a matter of reflective philosophical analysis or abstract linguistic theory (see Bonjour 1998, Rey 1998, Field 2000 and §4.3 below for further discussion).

In his important commentary on Quine's discussion, Hilary Putnam (1962/75) tried to rescue what he thought were theoretically innocuous examples of analytic truths by appeal to what he called “one-criterion” concepts, or concepts like, e.g., [bachelor], [widow], [ophthalmologist], where there seems to be only one “way to tell” whether they apply. However, as Fodor (1998) pointed out, so stated, this latter account won't suffice either, since the notion of “criterion” seems no better off than “analytic.” Moreover, if there were just one way to tell what's what, there would seem, trivially, to be indefinite numbers of different ways –for example, just ask someone who knows the one way; or ask someone who knows someone who knows; or…, etc., and so now we would be faced with saying which of these ways is genuinely “criterial,” which would seem to leave us with the same problem we faced in saying which way is “analytic.”

Fodor (1998) tries to improve on Putnam's proposal by suggesting that a criterion that appears to be analytic is the one on which all the other criteria depend , but which does not depend upon them. Thus, telling that someone is a bachelor by checking out his gender and marriage status doesn't depend upon telling by asking his friends, but telling by asking his friends does depend upon telling by his gender and marriage status; and so we have an explanation of why “bachelors are unmarried males” seems analytic, but, says Fodor, without it's actually being so (perhaps somewhat surprisingly, given his general “asymmetric dependence” theory of content, see his 1990 and cf. Horwich 1998).

However, such asymmetric dependencies among criteria alone will not “explain (away)” either the reality or the appearance of the analytic, since there would appear to be asymmetric dependencies of the proposed sort in non-analytic cases. Natural kinds are dramatic cases in point (see Putnam 1962, 1970/75, 1975). At some stage in history probably the only way anyone could tell whether something was a case of polio was to see whether there was a certain constellation of standard symptoms; other ways (including asking others) asymmetrically depended upon that way. But this wouldn't make “All polio cases exhibit the standard symptoms” remotely analytic—after all, the standard symptoms for many diseases can be misleading. For everyone might also have thought that, with further research, there could in principle come to be better ways to tell (which is, of course, precisely what happened).

Indeed, these cases of “deep” natural kinds contrast dramatically with cases of more superficial kinds like “bachelor,” whose nature is pretty much exhausted by the linguistics of the matter. Again, unlike the case of polio and its symptoms, the reason that gender and marriage status are the best way to tell whether someone is a bachelor is that that's just what “bachelor” means . Indeed, should a doctor propose revising the test for polio in the light of better theory –perhaps reversing the dependency of certain tests– this would not even appear to involve a change in the meaning. Should, however, a feminist propose, in the light of better politics, revising the use of “bachelor” to include women, this obviously would. If the appearance of the analytic is to be explained away, it needs to account for such differences in our understanding of different words (see Rey 2005 for further discussion).

4. Post-Quinean Strategies

There has been a wide variety of responses to Quine's attack. Some, for example, Davidson (1980), Stich (1983) and Dennett (1987), seem simply to accept it and try to account for our practice of meaning ascription within its “non-factual” bounds. Since they follow Quine in at least claiming to forswear the analytic, we will not consider their views further here. Others, who might be (loosely) called “neo-Cartesians,” reject Quine's attack as simply so much prejudice of the empiricism and naturalism which they take to be his own uncritical dogmas (§4.1 in what follows). Still others hope simply to find a way to break out of the “intentional circle,” and provide an account of at least what it means for one thing (a state of the brain, for example) to mean (or “carry the information about”) another external phenomenon in the world (§4.2). Perhaps the most trenchant reaction has been that of empirically oriented linguists and philosophers, who look to a specific explanatory role the analytic may play in an account of thought and talk (§4.3). This role is currently being explored in considerable detail in the now various areas of research inspired by the revolutionary linguistic theories of Chomsky (§4.4).

The most unsympathetic response to Quine's challenges has been essentially to stare him down and insist upon an inner faculty of “intuition” whereby the truth of certain claims is simply “grasped” directly through, as Bonjour (1998) puts it:

an act of rational insight or rational intuition … [that] is seemingly (a) direct or immediate, nondiscursive, and yet also (b) intellectual or reason-governed … [It] depends upon nothing beyond an understanding of the propositional content itself…. (p. 102)

Bealer (1987, 1999) defends similar proposals. Neither Bonjour nor Bealer are in fact particularly concerned to defend the analytic by such claims, but their recourse to mere understanding of propositional content is certainly what many defenders of the analytic have had in mind. Katz (1998:pp44-5), for example, made the very same appeal to intuitions explicitly on behalf of the analytic claims supported by his semantic theory (although he could also be interpreted as sometimes having adopted the more sophisticated strategy of §4.3 below). Somewhat more modestly, Peacocke (1992, 2005) claims that possession of certain logical concepts requires that a person find certain inferences “primitively compelling,” or compelling not by reason of some inference or in any way that takes “their correctness…as answerable to anything else” (p. 6). In a similar vein, Boghossian (1997) appeals to rational inferential practices that might implicitly define at least the logical connectives (see Harman 1996 and Horwich 2000 for discussion).

Perhaps the most modest reply along these lines emerges from a suggestion of David Lewis (1972/80), who proposes to implicitly define common, e.g., psychological terms by “platitudes”:

Include only platitudes that are common knowledge among us –everyone knows them, everyone know that everyone else knows them, and so son. For the meanings of our words are common knowledge, and I am going to claim that names of mental states derive their meaning from these platitudes. (1972/80:p212)

He later (1994:p416) amends this suggestion to allow for the “folk theory” that may tacitly underlie our ordinary use of, e.g., mental terms. Enlarging on this idea, Frank Jackson (1998) emphasizes the role of intuitions about possible cases, as well as the need sometimes to massage such intuitions so as to arrive at “the hypothesis that best makes sense of [folk] responses” (p36; see also pp34-5 and Slote 1966).

The Quinian reply to all these approaches as they stand is pretty straightforward, and, in a way, expresses what many regard as the real heart of his challenge to all proponents of the analytic: how in the end are we to distinguish such claims of “rational insight,” “primitive compulsion,” inferential practice or folk belief from merely some deeply held empirical conviction, indeed, from mere dogma? Isn't the history of thought littered with what have turned out to be deeply mistaken claims, inferences and platitudes that people at the time have found “rationally” and/or “primitively compelling,” say, with regard to God, sin, biology, sexuality, or even patterns of reasoning themselves? Consider, for example, the resistance Kahneman, Slovic and Tversky (1982) observed people display to correction of the fallacies they commit in a surprising range of ordinary thought; or in a more disturbing vein, how the gifted mathematician, John Nash, claimed that his delusional ideas “about supernatural beings came to me the same way that my mathematical ideas did” (Nasar 1998, p11). Introspected episodes, primitive compulsions, intuitions about possibilities, or even tacit folk theories alone are not going to distinguish the analytic, since these all may be due as much to people's empirical theories as to any special knowledge of meaning (see Devitt 2005, 2007 for vigorous critiques of neo-Cartesian approaches along such Quinean lines). Moreover, as Williamson (2006) has stressed, merely the fact that reasonable people often disagree about the rules of logic is reason to suppose that finding some set of rules compelling is not essential to possessing the logical concepts the rules involve. Jackson (1998:29-30) may be quite right to stress the need for some account of meaning in order to distinguish theories of some phenomenon that can be said to be still about that very phenomenon (so-called “reductionist” theories, as the case of theories of water) from those that in effect deny its existence (so-called “eliminativist,” theories, as in the case of standard explanations of devils and witches). Quine (1960, pp264-6) himself might not care. But, for those who do, there needs to be some more principled recourse than merely to beliefs or intuitions. (We'll consider other strategies in §§4.2 and 4.3 below).

A particularly vivid way to feel the force of Quine's challenge is afforded by a recent case that came before the Ontario Supreme Court concerning whether laws that confined marriage to heterosexual couples violated the equal protection clause of the constitution (see Halpern et al. 2001). The question was regarded as turning in part on the meaning of the word “marriage”, and each party to the dispute solicited affidavits from philosophers, one of whom claimed that the meaning of the word was tied to heterosexuality, the other that it wasn't. Putting aside the complex socio-political issues, Quine's challenge can be regarded as a reasonably sceptical request to know precisely what the argument is about, and how any serious theory of the world might settle it. It certainly wouldn't be sufficient merely to claim that marriage is/isn't necessarily heterosexual on the basis of “platitudes,” much less on “an act of rational insight [into] the propositional content itself”; or because speakers found the inference from marriage to heterosexuality “primitively compelling” and couldn't imagine gay people getting married! (In this connection see also the data of “experimental philosophy” in §4.4 below.)

Externalist theories try to meet at least part of Quine's challenge by considering how matters of meaning need not rely on connections among thoughts or beliefs, in the way that the tradition had encouraged philosophers to suppose, but as involving relations between words and the phenomena in the world that they pick out. This suggestion gradually emerged in the work of Putnam (1962/75, 1965/75 and 1975), Kripke (1972/80) and Burge (1979, 1986), but it takes the form of positive theories in, e.g., the work of Dretske (1981, 1988) and Fodor (1987, 1990b, 1992), who base meaning in various forms of natural co-variation between states of the mind/brain and external phenomena (see indicator semantics ); and in the work of Millikan (1984), Papineau (1987) and Neander (1995), who look to mechanisms of natural selection (see teleosemantics). If these theories were to succeed in providing a genuine explanation of intentionality (a success that is by no means undisputed; see Loewer 1996 and Rey 2005), they would go some way towards saving at least intentional psychology from Quine's challenge.

Although these strategies may well save intentionality and meaning, they do so, of course, only by forsaking the high hopes we noted in §2 philosophers harbored for the analytic. For externalists are typically committed to counting expressions as “synonymous” if they happen to be linked in the right way to the same external phenomena, even if a thinker doesn't realize that they are! Consequently, by at least the Fregean criterion, they would seem to be committed to counting as “analytic” many patently empirical sentences as “Water is H 2 O,” “Salt is NaCl” or “Mark Twain is Samuel Clemens,” since in each of these cases, something co-varies with the expression on one side of the identity if and only if it co-varies with the expression on the other (similar problems arise for teleosemantics). But this might not faze an externalist like Fodor (1998), who is concerned only to save intentional psychology, and might otherwise share Quine's scepticism about philosophers' appeals to the analytic and a priori .

Of course, an externalist could just allow that some analytic truths, e.g., “water is H20,” are in fact “external” and subject to empirical (dis)confirmation. Such a view would actually comport well with an older philosophical tradition less interested in the meanings of our words and concepts, and more interested in the “essences” of the worldly phenomena they pick out. Locke (1690/1975:II, 31, vi), for example, posited “real” essences of things rather along the lines resuscitated by Putnam (1975) and Kripke (1972/80), the real essences being the conditions in the world independent of our thought that make something the thing it is, as being H 2 O makes something water, or, to take the striking examples of diseases noted by Putnam (1962), being a the activation of a certain virus makes something polio. But, again, such an external view would still dash the hopes of philosophers looking to the analytic to explain a priori knowledge (see Bealer 1987 and Jackson 1998 for strategies to assimilate such empirical cases to nevertheless a priori analysis).

An interesting possibility raised by an Externalist theory is that the beliefs that are responsible for a person's competence with a term or concept might turn actually out to be false. Putnam (1975), for example, suggested that part of the competence conditions with a term might involve both some kind of external relation to the term's referent, and stereotypical beliefs, e.g., that lemons are yellow, tigers striped, water a liquid, even though it's perfectly possible for there to be exceptions to such claims. It may be essential to knowing the meaning of a term at least that such claims are regularly believed by users of it. On this account, then, a claim might turn out to be analytic and false! A competent user perhaps needs at least to “feel the pull” of certain claims, such as that tigers are striped, which she might ultimately nevertheless recognize to be mistaken (in Peacocke's 1992 phrase, they might feel a “primitive compulsion” in this regard—even if it turned out to be a compulsion they need to learn to resist!). Rather than understanding “analytic” to mean “known to be true by virtue of meaning,” one might understand it merely as “justified by virtue of meaning,”a prima facie justification that simply could be overridden by other, global theoretical considerations.

A promising strategy for replying to Quine's challenge that might begin to provide what the neo-Cartesian wants can be found in recent proposals of Michael Devitt (1996,2002) and, independently, of Paul Horwich (1998, 2005). In a way analogous to Fodor's claims about asymmetric dependence that we just noted, they emphasize how the meaning properties of a term are the ones that play a basic explanatory role with regard to the use of a term generally, the ones in virtue ultimately of which a term is used with that meaning. For example, the use of “red” to refer to the color of blood, roses, stop signs, etc,. is arguably explained by its use to refer to certain colors in good light, but not vice versa: the latter use is “basic” to all the other uses. Similarly, uses of “and” explanatorily depend upon its basic use in inferences to and from the sentences it conjoins. Devitt and Horwich differ about the proper locus of such a strategy. Horwich thinks of it mainly with regard to the use of terms in natural language, and only marginally allows what Devitt (2002) argues is required, that it be applied primarily to terms in a “language of thought.”

There are two potential drawbacks to these strategies. The first is that they still risk Quinean scepticism about meaning and the analytic. For, if Quine (1960) is right about the psychology of language use, then there are no sufficiently local explanatorily basic facts on which all other uses of a term depend. In particular, those uses of a term involved in the expression of belief, either in thought or talk, will likely be explained by the same processes of confirmation that Quine argued were dependent on the character of one's belief system as a whole, and not upon some local feature of that system in the way that an “analytic” claim would have to be (cf. Gibbard 2008). Of course, Quine might be wrong about this psychology. But, to put it lightly, the verdict on that issue is not quite in (see Fodor 1983, 2000 for a perhaps surprising endorsement of Quine's view, and the next section for some qualified alternatives to it).

In any case, a second drawback of this strategy is that it risks rendering matters of meaning far less “transparent” and introspectively accessible than Cartesians have standardly supposed. For, as the worry about our psychology being Quinean makes vivid, there is little reason to suppose that what is explanatorily basic about one's use of a term in thought or talk is a matter that is available to introspection or common knowledge. As in the case of “marriage” mentioned earlier, but certainly with respect to other philosophically problematic notions, just which properties, if any, are explanatorily basic may not be an issue that is at all easy to determine. At best, if the strategy is to save meaning and the analytic from Quinean scepticism, it is probably best pursued along the Chomskyan lines to which we now turn.

Beginning in the 1950s, Chomsky (1965, 1968/2006) began to revolutionize linguistics by presenting substantial reasons for supposing that its proper subject matter was not people's superficial linguistic behavior, or “performance,” but rather the generative rules that constituted their underlying linguistic “competence.” This opened up the possibility of a response to Quine's scepticism within his own naturalistic and at least methodologically empiricist framework, empirically refuting the behaviorist theory of language on which his account had often explicitly relied (as in his 1960; see Chomsky 1959 for the refutation).

The data that concerned Chomsky, himself, have always largely concerned syntactic properties of natural language, but he sometimes construes them broadly to include at least some “analytic” examples, as when he writes, “it seems reasonable to suppose that semantic relations between words like persuade, intend, believe, can be expressed in purely linguistic terms (namely: if I persuade you to go, then you intend to go)…” (Chomsky 1977/98:142). Along these lines (and in arguments that could be sustained independently of the appeals to intuition we considered in §4.1), Katz (1972) drew attention to related semantic data, such as subjects' agreements about, e.g., synonymy, redundancy, antonomy, and implication, and develops a theory systematically relating syntactic and semantic structure to account for them (see Pietroski 2005 for more recent and cautious suggestions along related lines). Since, as we have seen (§3.7), the explanations offered by Quine, Putnam and Fodor in terms of centrality and/or preferred ways of telling seem simply empirically inadequate, perhaps the best explanation of these phenomena are to be had in a theory of the human language faculty.

It might be thought that appeals to such data beg the question against Quine, since, as Quine (1967) pointed out, so much as asking subjects to say whether two expressions are synonymous, antonomous, or implicative is simply transferring the burden of determining what is being discussed from the theorist to the informant. Imagine, again, a person being asked whether marriage entails heterosexuality as a matter of “the meaning of the word.” One can sympathize with someone being at a loss as to what to say. In any case, what is the possible significance of people's answers? The same question can be raised here as before: How do we distinguish a genuine analytic report from merely an expression of a firmly held belief?

The Chomskyan actually has the seed of an interesting reply. For part of Chomsky's view has to do with the modularity of the natural language faculty: whether a sentence is grammatical or not depends not on its relation to our thought and communicative projects, but entirely on its conformity with the internal principles of that specific faculty. It is easy for us to produce in our behavior strings of words that may communicate information effectively, but which may violate those principles. An ungrammatical sentence like “Bill is the man I wanna take a walk” might suffice on occasion for thought and communication (of “Bill is the man who I want to take a walk,”), but it's a striking fact that speakers of English—even four-year old ones!—nevertheless find it problematic (see Crain and Lillo-Martin 1999). The existence of the language faculty as a separate faculty may simply be an odd, but psychologically real fact about us, and it may thereby supply a real basis for commitments about not only what is or isn't grammatical, but about what is or isn't a matter of natural language meaning. On this view, if one were to deny an analytic truth, one would simply be violating a principle of one's natural language, which, on Chomskyan views, it's perfectly possible to do: people often speak in a way that appears to violate patent analyticities (“Ann is his real mother, despite Zoe being his biological one,” “Carl's still really a bachelor, even though he's been married for years”), and scientists regularly do so with their introduction of technical ways of talking, as in the case of “Water is H 2 O” (which Chomskyans claim is not a sentence in a natural language!). Indeed, at least in some cases one might combine a Chomskyan view with an Externalist one, and allow that some of the meaning-constitutive rules for a term can turn out to be false (§4.2 above).

The burden of such a reply lies, however, in actually producing a linguistic theory that sustains a principled class of sentences that could be isolated in this way and that, per the suggestion of (§4.3), might play the basic explanatory role of meaning. This is something that, as yet, it is by no means obvious that it can do. As we saw, Fodor (1983, 2000) argues on behalf of Quine's claim about the confirmation holism of belief fixation generally, and, more specifically, Fodor et al. (1975) raised doubts about whether any kind of “semantic decomposition” is psychologically real, and Fodor (1970, 1998) has contested some of the most prized examples of analyticities, such as (10) in set II above, linking killing to death (but see Pietroski 2002 for a reply). Moreover, many linguists (e.g., Jackendoff 1992, Pustejovsky 1995) proceed somewhat insouciantly to include under issues of “meaning” and “conceptual structure” issues that are patently matters of just ordinary belief and sometimes mere phenomenology. For example, Jackendoff and others have called attention to the heavy use of spatial metaphors in many grammatical constructions. But such facts don't entail that the concepts of the domains to which these metaphors are applied –say, the structure of the mind, social relations, or mathematics– are, themselves, somehow intrinsically spatial, or really thought by anyone to be. People's conceptions of these domains may often be spatialized. However, conceptions –ordinary beliefs, metaphors, associations– are one thing; people's concepts quite another: two mathematicians, after all, can have the thought that there is no largest prime, even if one of them thinks of numbers spatially and the other purely algebraically.

In considering Chomskyan theories of the analytic, it is important to bear in mind that, while the theory may be as methdologically empiricist as any theory ought to be, the theory itself explicitly rejects empiricist conceptions of meaning and mind themselves. Chomsky is famous for having resuscitated Rationalist doctrines of “innate ideas,” according to which many ideas have their origins not in experience, but in our innate endowment. And there's certainly no commitment in semantic programs like those of Katz, Jackendoff or Pustejovsky to anything like the “reduction” of all concepts to the sensorimotor primitives eyed by the Positivists. Of course, just how we come by the meaning of whatever primitive concepts their theories do endorse, is a question they would seriously have to confront, cf. Fodor (1990b, 1998).

Recently, some philosophers have offered some empirical evidence that might be taken to undermine these efforts to empirically ground the analytic, casting doubt on just how robust the data for the analytic might be. The movement of “experimental philosophy” has pointed to evidence of considerable malleability of subject's “intuitions” with regard to the standard kinds of thought experiments on which defenses of analytic claims typically rely. Thus, Weinberg, Nichols and Stich (2001) found significant cultural differences between responses of Asians and Western students regarding whether someone counted as having knowledge in a standard “Gettier” example of accidental justified true belief; and Knobe (2003) found that non-philosopher's judgments about whether an action is intentional depended on the (particularly negative) moral qualities of the action, and not, as is presumed by most philosophers, on whether the action was merely intended by the agent. Questions, of course, could be raised about these experimental results (How well did the subjects understand the project of assessing intuitions? Did the experiments sufficiently control for the multitudinous “pragmatic” effects endemic to polling procedures? To what extent are the target terms merely polysemous, allowing for different uses in different contexts?). However, the results do serve to show how the determination of meaning and analytic truths can be regarded as a more difficult empirical question than philosophers have traditionally supposed (see Bishop and Trout 2005 and Alexander and Weinberg 2007 for extensive discussion).

Suppose linguistics were to succeed in delineating a class of analytic sentences grounded in a special language faculty. Would such sentences serve the purposes for which we noted earlier (§2) philosophers had enlisted them?

Perhaps some of them would. An empirical grounding of the analytic might provide us with an understanding of what constitutes a person's competence with a concept. Given that Quinean scepticism about the analytic is a source of his scepticism about the determinacy of cognitive states (see §3.6 above), such a grounding may be crucial for a realistic psychology. Moreover, setting out the constitutive conditions for possessing a concept might be of some interest to philosophers generally, since many of the crucial questions they ask concern the proper understanding of ordinary notions such as material object, person, action, freedom, god, the good, or the beautiful. Suppose, further, that a domain, such as perhaps ethics or aesthetics, are “response dependent,” constituted by our conceptual responses; suppose, that is, that what determines the nature of, say, the good, the funny, or the beautiful are simply the principles underlying people's competence with their concepts of them. If so, then it might not be implausible to claim that successful conceptual analysis had provided us with a priori knowledge of that domain.

But, of course, many philosophers have wanted more than these essentially psychological gains. They have hoped that analytic claims might provide a basis for a priori knowledge of domains that exist independently of our concepts. An important case in point is the very case of arithmetic that motivated much of the discussion of the analytic in the first place. Recent work of Crispin Wright (1983) and others on the logicist program has shown how a version of Frege's program might be rescued by appealing not to his problematic Basic Law V, but instead merely to what he calls “Hume's Principle,” or the claim that for the number of Fs to be equal to the number of Gs is for there to be a “one-to-one correspondence” between the Fs and the Gs (as in the case of the fingers of a normal right hand and a left one). According to what is now regarded as “Frege's Theorem,” the Peano axioms for arithmetic can be derived from this principle in standard second-order logic (see Frege's Logic, Theorem, and Foundations for Arithmetic). Now, Wright has urged that Hume's Principle might be regarded as analytic, and perhaps this claim could be sustained by an examination of the language faculty along the lines of §4.4. If so, then wouldn't that vindicate the suggestion that arithmetic can be known a priori ?

Not obviously, since Hume's Principle is a claim not merely about the concepts F and G, but about the presumably concept independent fact about the number of things that are F and the number of things that are G , and, we can ask, what justifies any claim about them ? As Boolos (1997) asks in response to Wright:

If numbers are supposed to be identical if and only if the concepts they are numbers of are equinumerous, what guarantee do we have that every concept has a number? p253)

One might reasonably worry that such an unrestricted claim about the extensions of concepts risks the same fate that befell Frege's Basic Law V, and will be shown to be inconsistent (see §3.3 above). In the light of that fate and subsequent developments of set theory, it is hard to see how to justify believing Hume's Principle without appealing, as Quine claimed we must (§3.4) to “the elegance and convenience which the hypothesis brings to the containing bodies of laws and data,” i.e., to our best overall theory of the world (see Wright 1999 and Horwich 2000 for subtle discussions of the issues).

The problem here becomes even more obvious in non-mathematical cases. For example, philosophers have wanted to claim not merely that our concepts of red and green exclude the possibility of our thinking that something is both colors all over, but that this possibility is ruled out for the actual colors, red and green, themselves. It is therefore no accident that Bonjour's (1998:184-5) defense of a priori knowledge turns on including the very properties of red and green themselves as constituents of the analytic propositions we grasp. But it is just such a wonderful coincidence between merely our concepts and actual worldly properties that a linguistic semantics alone does not obviously insure.

But suppose there in fact existed a wonderful correspondence between our concepts and the world, indeed, a deeply reliable, counterfactual supporting correspondence whereby it was in fact metaphysically impossible for certain claims constitutive of those concepts not to be true. This is, of course, not implausible in the case of logic and arithmetic, and is entirely compatible with, e.g., Boolos' reasonable doubts about them (after all, it's always possible to doubt what is in fact a necessary truth). Such necessary correspondences between thought and the world might then serve as a basis for claims to a priori knowledge in at least a reliabilist epistemology, where what's important is not a believer's ability to justify his claims, but merely the reliability of the processes by which he arrived at them. Indeed, in the case of logic and arithmetic, the beliefs might be arrived at by steps that were not only necessarily reliable, but might also be taken to be so by the believer, in ways that might in fact depend in no way upon experience, but only on his competence with the relevant concepts (Kitcher 1980, Rey 1998 and Goldman 1999 explore this strategy).

Such a reliabilist approach, though, might be less than fully satisfying to someone interested in the traditional a priori analytic. For, although someone might turn out in fact to have a priori analytic knowledge of this sort, she might not know that she does (reliabilist epistemologists standardly forgo the “KK Principle,” according to which if one knows that p, one knows that knows that p). Knowledge that the relevant claims were knowable a priori might itself be only possible by an empirically informed understanding of, e.g., one's language faculty, and, à la Quine, by its consonance with the rest of one's theory of the world. But the trouble then is that claims that people do have a capacity to arrive at knowledge in deductively reliable ways seem quite precarious. Williamson (2006), for example, points out that present psychological research on the nature of human reasoning suggests that people, even on reflection, are surprisingly poor at appreciating deductively valid arguments. As logic teachers will attest, appreciating the standard rules even of natural deduction is for many people often a difficult intellectual achievement. Consequently, people's general competence with logical notions may not in fact consist in any grip on valid logical rules; and so whatever rules do underlie that competence may well turn out not to be the kind of absolutely reliable guide to the world on which the above reliabilist defense of a priori analytic knowledge seems to depend. In any case, in view merely of the serious possibility that these pessimistic conclusions are true, it's hard to see how any appeal to the analytic to establish the truth of any controversial claim in any mind-independent domain could have any special justificatory force.

In fact, once one appreciates the serious doubt about whether even our most fundamental concepts correspond to anything in the world, it is unclear that we really expect or even want the analytic to provide knowledge of concept-independent domains. Consider, for example, the common puzzle about the possibility that computers might actually think and enjoy a mental life. In response to this puzzle, some philosophers (e.g., Wittgenstein 1953/67:p97e, Ziff 1959) have suggested that it's analytic that a thinking thing must be alive, a suggestion that certainly seems to accord with many folk intuitions (many people who might cheerfully accept a computational explanation of a thought process often balk at the suggestion that an inanimate machine engaging in that computation would actually be thinking). Suppose this claim were sustained by a Chomskyan theory, showing that the ordinary notion expressed by the natural language word “thinking” is, indeed, correctly applied only to living things, and not to artifactual computers. Should this really satisfy the person worried about the possibility of artificial thought? It's hard to see why. For the serious question that concerns people worried about whether artifacts could think concerns whether those artifacts could in fact share the real, theoretically interesting, explanatory properties of being a thinking thing (cf. Jackson 1998:pp34-5). We might have no reason to suppose that being alive actually figures among them, and so conclude that, despite these (supposed) constraints of natural language, inanimate computers could come to “think” after all. Indeed, perhaps the belief that thinking things must be alive is an example of a false belief that, we saw in §§4.2-4.3, an externalist Chomskyan could claim is part of the constitutive conditions on “think” (again, one doesn't have the concept unless one feels the pull). Alternatively, of course, one could insist on adhering to whatever meaning constraints turn out to be imposed by natural language and so, perhaps, deny that inanimate computers could ever think. But, if the explanatory point were correct, it would be hard to see how this would amount to anything more than a verbal quibble: so computers don't “think”; they “think*” instead.

In sum: an account of the language faculty might provide a basis for ascribing competence with the concepts that that faculty might deploy, and thereby a basis for intentional realism and a distinction between analytic and synthetic claims. It might also provide a basis for a priori analytic knowledge of claims about concept-dependent domains, such as those of ethics and aesthetics. However, in the case of concept-independent domains, such as logic and mathematics, or the nature of worldly phenomena like life or mind, the prospects seem more problematic. There may be analytic claims to be had here, but, in the immortal words of Putnam (1965/75:p36), they would “cut no philosophical ice…bake no philosophical bread and wash no philosophical windows.” We would just have to be satisfied with theorizing about the concept independent domains themselves, without benefit of knowing anything about them “by virtue of knowing the meanings of our words alone.” Reflecting on the difficulties of the past century's efforts on behalf of the analytic, it's not clear why anyone would really want to insist otherwise.

  • Alexander, J. and Weinberg, J. (2007) “Analytic Epistemology and Experimental Philosophy,” Philosophical Compass , 2/1:56-80
  • Ayer, A.J. (1934/52), Language, Truth and Logic , New York: Dover
  • Bealer, G. (1982), Quality and Concept , Oxford: Oxford University Press
  • Bealer, G. (1987), “The Philosophical Limits of Scientific Essentialism,” in J. Tomberlin, Philosophical Perspectives I, Metaphysics, Atascadero (CA): Ridgeview Press
  • Bealer, G. (1998), “Analyticity,” Routledge Encyclopedia of Philosophy , New York: Routledge
  • Bealer, G. (1999). “A Theory of the A Priori ,” Philosophical Perspectives , 13, Epistemology, 1999, 29-55.
  • Bishop, M. and Trout, J. (2005), Epistemology and the Psychology of Human Judgment , Oxford University Press
  • Boghossian, P. (1997), “Analyticity,” in B. Hale and C. Wright (eds.), A Companion to the Philosophy of Language , Oxford: Blackwell
  • Bonjour, L. (1998), In Defense of Pure Reason , Cambridge University Press
  • Boolos, G. (1971), “The Iterative Conception of Set,” Journal of Philosophy 68: 215-32
  • Boolos, G. (1997), “Is Hume's Principle Analytic?”, in Heck, R. (ed.), Language, Thought and Logic , Oxford University Press, pp245-61
  • Bruner, J. (1957), “On Perceptual Readiness,” Psychological Review 64: 123-52
  • Burge, T. (1979), “Individualism and the Mental,” Mid-West Studies in Philosophy , IV: 73-121
  • Burge, T. (1986), “Individualism and Psychology,” Philosophical Review , XCV/1: 3-46
  • Carnap, R. (1928/67), The Logical Structure of the World and Pseudoproblems in Philosophy , trans. by R. George, Berkeley: Universityof California Press
  • Carnap, R. (1947), Meaning and Necessity , Chicago: University of Chicago Press
  • Chomsky, N. (1959/64), “Review of Skinner's Verbal Behavior,” in J. Fodor and J. Katz (eds.), The Structure of Language: Readings in the Philosophy of Language , Englewood Cliffs: Prentice Hall
  • Chomsky, N. (1965), Aspects of the Theory of Syntax , Cambridge: MIT Press
  • Chomsky, N. (1968/2006), Language and Mind , 3rd ed., Cambridge University Press
  • Chomsky, N. (2000), New Horizons in the Study of Language , Cambridge: Cambridge University Press.
  • Coffa, J. (1991), The Semantic Tradition from Kant to Carnap: to the Vienna Station , Cambridge: Cambridge University Press
  • Crain, S. and Lillo-Martin, D. (1999), An Introduction to Linguistic Theory and Language Acquisition , Oxford: Blackwell
  • Davidson, D. (1980), Truth and Meaning , Oxford: Oxford University Press
  • Dennett, D. (1987), The Intentional Stance , Cambridge: MIT (Bradford)
  • Devitt, M. (2002), “Meaning and Use,” Philosophy and Phenomenological Research , LXV(1):106-21
  • Devitt, M. (2005), “There is No A Priori ,” in Contemporary Debates in Epistemology , Sosa, E. and Steup, M (eds.), Cambridge, MA: Blackwell, pp105-115.
  • Devitt, M. (2007), “No Place for the A Priori ,” in Shaffer, M. and Weber, M. (eds.), New Views of the A Priori in Physical Theory , Amsterdam: Rodopi Press
  • Dretske, F. (1981), Knowledge and the Flow of Information , Cambridge, MA: MIT Press
  • Dretske, F. (1987), Explaining Behavior: Reasons in a World of Causes , Cambridge, MA: MIT Press
  • Duhem, P. (1914/54), The Aim and Structure of Physical Theory , tr. by P. Wiener, Pinceton: Princeton University Press
  • Dummett, M. (1991), Frege and Other Philosophers , Oxford: Oxford University Press
  • Field, H. (2000), “A Priority as an Evaluative Notion,” in Boghossian, P. and Peacocke (eds.), New Essays on the A Priori , Oxford University Press, pp117-49.
  • Fodor, J. (1980/90), “Psychosemantics, or Where do Truth Conditions Come From?” in Lycan, W. (ed.), Mind and Cognition , Oxford: Blackwell
  • Fodor, J. (1981), “The Present Status of the Innateness Controversy,” in his RePresentations , Cambridge (MA): MIT (Bradford), pp257-31
  • Fodor, J. (1983), Modularity of Mind , Cambridge, MA: MIT Press
  • Fodor, J. (1984) “Observation Reconsidered” Philosophy of Science , 51, 23-43
  • Fodor, J. (1990a), “Substitution Arguments and The Individuation of Beliefs”, in Fodor, J. (1990b), pp161-76
  • Fodor, J. (1990b), A Theory of Content and Other Essays , Cambridge, MA: MIT Press
  • Fodor, J. (1992), “Replies,” in Loewer, B. and Rey, G., Meaning in Mind: Fodor and His Critics , Oxford: Blackwell
  • Fodor, J. (1998), Concepts: Where Cognitive Science Went Wrong , Cambridge, MA: MIT Press
  • Fodor, J. (2000), The Mind Doesn't Work That Way , Cambridge: MIT Press
  • Fodor, J.D., Fodor, J.A., and Gattett, M. (1975), “The Psychological Unreality of Semantic Representations,” Linguistic Inquiry , 6:515-31
  • Frege, G. (1884/1980), The Foundations of Arithmetic , 2nd revised ed., London: Blackwell
  • Frege, G. (1892a/1966), “On Sense and Reference,” in P.Geach and M. Black (eds.), Translations from the Works of Gottlob Frege , Oxford: Blackwell, pp56-78.
  • Frege, G. (1892b/1966), “On Concept and Object,” in P.Geach and M. Black (eds.), Translations from the Works of Gottlob Frege , Oxford: Blackwell, pp42-55.
  • Glock, H. (2003), Quine and Davidson on Language, Thought and Reality , Cambridge University Press.
  • Goldman, A. (1999), “A Priori Warrant and Naturalistic Epistemology,” in Tomberlin, J. E., ed., Philosophical Perspectives , v. 13., Cambridge, UK: Blackwell
  • Grice, P. and Strawson, P. (1956), “In Defense of a Dogma,” Philosophical Review LXV 2:141-58
  • Halpern et al. V. Attorney General of Canada et al. (Court file 684/00), and Metropolitain Community Church of Toronto V. Attorney General of Canada et al. (Court file 30/2001), in the Ontario Superior Court of Justice (Divisional Court), November 2001.
  • Hanson, N. (1958), Patterns of Discovery: an Inquiry into the Conceptual Foundations of Science , Cambridge University Press
  • Harman, G. (1996), “Analyticity Regained,” Nous 30:3, pp392-40
  • Hornstein, N. (1984), Logic as Grammar , Cambridge, MA: MIT Press
  • Horty, J. (1993), “Frege on the Psychological Significance of Definitions,” Philosophical Studies , 69: 113-153
  • Horty, J. (2007), Frege on Definitions: a Case Study of Semantic Content , Oxford University Press
  • Horwich, P. (1998), Meaning , Oxford University Press
  • Horwich, P. (2000), “Stipulation, Meaning and Apriority,” in Boghossian, P. and Peacocke, C. (eds.), New Essays on the A Priori , Oxford University Press, pp150-69
  • Horwich, P. (2005), Reflections on Meaning , Oxford University Press
  • Jackendoff, R. (1992), Languages of the Mind: Essays on Mental Representation , Cambridge, MA: MIT Press
  • Jackson, F. (1998), From Metaphysics to Ethics: a Defence of Conceptual Analysis , Oxford University Press
  • Kahneman, D., Slovic, P., and Tversky, A. (1982), Judgments Under Uncertainty: Heuristics and Biases , Cambridge: University of Cambridge Press
  • Kant, I. (1781/1998), The Critique of Pure Reason , trans. by P. Guyer and A.W.Wood, Cambridge University Press
  • Kaplan, D. (1989), “Demonstratives,” in Almog, J., Perry, J. and Wettstein, H., Themes from Kaplan , Oxford University Press
  • Katz, J. (1972), Semantic Theory , New York: Harper and Row
  • Katz, J. (1988), Cogitations , Oxford: Oxford University Press
  • Katz, J. (1990), The Metaphysics of Meaning , Oxford: Oxford University Press
  • Kitcher, P. (1980), “ A Priori knowledge,” The Philosophical Review , v. 86. pp. 3-23
  • Knobe, J. (2003), “Intentional Action and Side Effects in Ordinary Language,” Analysis 63: 190-3
  • Kripke, S. (1972/80) Naming and Necessity , Cambridge (MA): Harvard University Press
  • Kuhn, T. (1962), The Structure of Scientific Revolutions , Chicago: University of Chicago Press
  • Lewis, D. (1969), Convention: a Philosophical Study , Cambridge: Harvard University Press
  • Locke, J. (1690/1975), An Essay Concerning Human Understanding , edited by Peter Nidditch, Oxford: Clarendon Press
  • Loewer, B. (1996), “A Guide to Naturalizing Semantics,” in Wright, C. and Hale, B., A Companion to Philosophy of Language , Oxford: Blackwell, pp108-26
  • MacFarlane, J. (2002), “Frege, Kant, and the Logic of Logicism”, Philosophical Review 111(1):25-65
  • Millikan, R. (1984), Language, Thought and Other Biological Categories , Cambridge, MA: MIT Press
  • Nasar, S. (1998), A Beautiful Mind , New York: Touchstonepp. 739-63.
  • Neander, K. (1995), “Misrepresenting and Malfunctioning,” Philosophical Studies 79:109-41
  • Peacocke, C. (1992), A Study of Concepts , Cambridge: MIT
  • Peacocke, Christopher. (2005), “The A Priori ,” in Jackson F. and Smith, M. (eds.), The Oxford Handbook of Contemporary Philosophy , Oxford University Press, pp739-63
  • Pietroski, P. (2002), “Small Verbs, Complex Events: Analyticity without Synonymy,” in L. Antony and N. Hornstein, Chomsky and His Critics , Oxford: Blackwell
  • Pietroski, P. (2005), Events and Semantic Architecture , Oxford University Press
  • Pustejovsky, J. (1995), The Generative Lexicon , Cambridge: MIT Press
  • Putnam, H. (1962), “It Ain't Necessarily So,” Journal of Philosophy , LIX: 658-671
  • Putnam, H. (1965/75), “The Analytic and the Synthetic,” in his Philosophical Papers , vol. 2, Cambridge: Cambridge University Press
  • Putnam, H. (1970/75), “Is Semantics Possible?”, Metaphilosophy, 1: 189-201 reprinted in his Philosophical Papers , vol. 2, Cambridge University Press
  • Putnam, H. (1975), “The Meaning of ”Meaning“”, in his Philosophical Papers , vol. 2, Cambridge University Press
  • Quine, W. (1934/1990) “Lectures on Carnap”, in R. Creath (ed.), Dear Carnap, Dear Van , Berkeley: University of California Press
  • Quine, W. (1936/76), “Truth by Convention,” in his Ways of Paradox and Other Essays, 2nd ed., Cambridge, MA: Harvard University Press
  • Quine, W.(1953/80), From a Logical Point of View , 2nd ed., Cambridge, MA: Harvard University Press
  • Quine, W. (1956/76), “Carnap and Logical Truth,” in his Ways of Paradox and Other Essays , 2nd ed., Cambridge, MA: Harvard University Press
  • Quine, W. (1967), “On a Suggestion of Katz,” The Journal of Philosophy , 64: 52-4
  • Quine, W. (1986), “Reply to Vuillemin”, in P. Schilpp The Philosophy of W.V. Quine “Reply to Vuillemin,” LaSalle: Open Court
  • Rey, G. (1998), “A Naturalistic A Priori ,” Philosophical Studies 92: 25-43
  • Rey, G. (2005), “Philosophical Analyses as Cognitive Psychology: the Case of Empty Concepts,”in Cohen, H. and Lefebvre (eds.), Handbook of Categorization in Cognitive Science , New York: Elsevier, pp72-89
  • Russell, B. (1905), “On Denoting”, Mind 14:479-93
  • Sellars, W. (1956), “Empircism and the Philosophy of Mind,” in M. Scriven, P. Feyerabend, and G. Maxwell (eds.), Minnisota Studies in the Philosophy of Science , vol. I, Minneapolis: University of Minnesota Press, pp. 253-329
  • Slote, M. (1966), “The Theory of Important Criteria,” Journal of Philosophy , 63:211-24
  • Stich, S. (1983), From Folk Psychology to Cognitive Science , Cambridge (MA): MIT Pres
  • Weinberg, J., Nichols, S. and Stich, S. (2001), “Normativity and Epistemic Intuitions,” Philosophical Topics 29:429-60
  • White, S. (1982), “Partial Character and the Language of Thought,” Pacific Philosophical Quarterly , 63:347-65
  • Williamson, T. (2006), “Conceptual Truth,” The Aristotelian Society , Supplementary Volume 80, pp1-41.
  • Wittgenstein, L. (1953/67), Philosophical Investigations , 3rd ed., Oxford: Blackwell
  • Wright, C. (1983), Frege's Conception of Numbers as Objects , Aberdeen University Press
  • Wright, C. (1999), “Is Hume's Principle Analytic?”, Notre Dame Journal of Formal Logic , 40(1), pp6-30
  • Ziff, P. (1959), “The Feelings of Robots,” Analysis , 19:64-8

[Please contact the author with suggestions.]

analysis | a priori justification and knowledge | behaviorism | -->Carnap, Rudolf --> | definitions | epistemology | Frege, Gottlob | -->Kant, Immanuel --> | logical constants | -->logical positivism --> | logical truth | -->logicism --> | meaning, theories of | metaphysics | naturalism | operationalism | phenomenology | -->Quine, Willard van Orman --> | rationalism vs. empiricism | Russell, Bertrand | -->verificationism -->

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 21 August 2019

Synthetic organic chemistry driven by artificial intelligence

  • A. Filipa de Almeida   ORCID: orcid.org/0000-0002-8399-0710 1 ,
  • Rui Moreira 1 &
  • Tiago Rodrigues   ORCID: orcid.org/0000-0002-1581-5654 2  

Nature Reviews Chemistry volume  3 ,  pages 589–604 ( 2019 ) Cite this article

13k Accesses

184 Citations

50 Altmetric

Metrics details

  • Cheminformatics
  • Computational chemistry
  • Organic chemistry
  • Chemical synthesis

Synthetic organic chemistry underpins several areas of chemistry, including drug discovery, chemical biology, materials science and engineering. However, the execution of complex chemical syntheses in itself requires expert knowledge, usually acquired over many years of study and hands-on laboratory practice. The development of technologies with potential to streamline and automate chemical synthesis is a half-century-old endeavour yet to be fulfilled. Renewed interest in artificial intelligence (AI), driven by improved computing power, data availability and algorithms, is overturning the limited success previously obtained. In this Review, we discuss the recent impact of AI on different tasks of synthetic chemistry and dissect selected examples from the literature. By examining the underlying concepts, we aim to demystify AI for bench chemists in order that they may embrace it as a tool rather than fear it as a competitor, spur future research by pinpointing the gaps in knowledge and delineate how chemical AI will run in the era of digital chemistry.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

analytical research synthetics

Similar content being viewed by others

analytical research synthetics

SAVI, in silico generation of billions of easily synthesizable compounds through expert-system type rules

analytical research synthetics

Automation and computer-assisted planning for chemical synthesis

analytical research synthetics

A large-scale reaction dataset of mechanistic pathways of organic reactions

Nantermet, P. G. Reaction: the art of synthetic chemistry. Chem 1 , 335–336 (2016).

CAS   Google Scholar  

Nicolaou, K. C. & Chen, J. S. The art of total synthesis through cascade reactions. Chem. Soc. Rev. 38 , 2993–3009 (2009).

CAS   PubMed   PubMed Central   Google Scholar  

Baran, P. S. Natural product total synthesis: as exciting as ever and here to stay. J. Am. Chem. Soc. 140 , 4751–4755 (2018).

CAS   PubMed   Google Scholar  

Ley, S. V. The engineering of chemical synthesis: humans and machines working in harmony. Angew. Chem. Int. Ed. 57 , 5182–5183 (2018).

Bergman, R. G. & Danheiser, R. L. Reproducibility in chemical research. Angew. Chem. Int. Ed. 55 , 12548–12549 (2016).

Duros, V. et al. Human versus robots in the discovery and crystallization of gigantic polyoxometalates. Angew. Chem. Int. Ed. 56 , 10815–10820 (2017).

Roch, L. M. et al. ChemOS: Orchestrating autonomous experimentation. Science Robot. 3 , eaat5559 (2018).

Google Scholar  

Schneider, G. Mind and machine in drug design. Nat. Mach. Intell. 1 , 128–130 (2019).

Wang, Y. et al. Acoustic droplet ejection enabled automated reaction scouting. ACS Cent. Sci. 5 , 451–457 (2019).

Fitzpatrick, D. E., Battilocchio, C. & Ley, S. V. Enabling technologies for the future of chemical synthesis. ACS Cent. Sci. 2 , 131–138 (2016).

Ley, S. V., Fitzpatrick, D. E., Myers, R. M., Battilocchio, C. & Ingham, R. J. Machine-assisted organic synthesis. Angew. Chem. Int. Ed. 54 , 10122–10136 (2015).

Lehmann, J. W., Blair, D. J. & Burke, M. D. Toward generalization of iterative small molecule synthesis. Nat. Rev. Chem. 2 , 0115 (2018).

PubMed   PubMed Central   Google Scholar  

Corey, E. J. & Wipke, W. T. Computer-assisted design of complex organic syntheses. Science 166 , 178–192 (1969).

Pensak, D. A. & Corey, E. J. in Computer-Assisted Organic Synthesis Ch. 1 (eds Wipke, W. T. & Howe, W. J.) 1-32 (American Chemical Society, 1977).

Lajiness, M. S., Maggiora, G. M. & Shanmugasundaram, V. Assessment of the consistency of medicinal chemists in reviewing sets of compounds. J. Med. Chem. 47 , 4891–4896 (2004).

Earkin, D. R. & Warr, W. A. in Computer-Assisted Organic Synthesis Ch. 10 (eds Wipke, W. T. & Howe, W. J.) 217-226 (American Chemical Society, 1977).

Sridharan, N. S. in Computer-Assisted Organic Synthesis Ch. 7 (eds Wipke, W. T. & Howe, W. J.) 148-178 (American Chemical Society, 1977).

Wipke, W. T., Ouchi, G. I. & Krishnan, S. Simulation and evaluation of chemical synthesis—SECS: An application of artificial intelligence techniques. Artif. Intell. 11 , 173–193 (1978).

Hessler, G. & Baringhaus, K. H. Artificial intelligence in drug design. Molecules 23 , E2520 (2018).

PubMed   Google Scholar  

Sellwood, M. A., Ahmed, M., Segler, M. H. & Brown, N. Artificial intelligence in drug discovery. Future Med. Chem. 10 , 2025–2028 (2018).

Aspuru-Guzik, A., Lindh, R. & Reiher, M. The matter simulation (r)evolution. ACS Cent. Sci. 4 , 144–152 (2018).

Lusher, S. J., McGuire, R., van Schaik, R. C., Nicholson, C. D. & de Vlieg, J. Data-driven medicinal chemistry in the era of big data. Drug Discov. Today 19 , 859–868 (2014).

Tetko, I. V., Engkvist, O., Koch, U., Reymond, J. L. & Chen, H. BIGCHEM: challenges and opportunities for big data analysis in chemistry. Mol. Inf. 35 , 615–621 (2016).

Henson, A. B., Gromski, P. S. & Cronin, L. Designing algorithms to aid discovery by chemical robots. ACS Cent. Sci. 4 , 793–804 (2018).

Rich, A. S. & Gureckis, T. M. Lessons for artificial intelligence from the study of natural stupidity. Nat. Mach. Intell. 1 , 174–180 (2019).

Ekins, S. et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 18 , 435–441 (2019).

Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46 , D1074–D1082 (2018).

Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res. 45 , D945–D954 (2017).

Kim, S. et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 47 , D1102–D1109 (2019).

Grzybowski, B. A. et al. Chematica: A story of computer code that started to think like a chemist. Chem 4 , 390–398 (2018).

Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555 , 604–610 (2018).

Schneider, N., Lowe, D. M., Sayle, R. A., Tarselli, M. A. & Landrum, G. A. Big data from pharmaceutical patents: A computational analysis of medicinal chemists’ bread and butter. J. Med. Chem. 59 , 4385–4402 (2016).

Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360 , 186–190 (2018).

Roughley, S. D. & Jordan, A. M. The medicinal chemist’s toolbox: an analysis of reactions used in the pursuit of drug candidates. J. Med. Chem. 54 , 3451–3479 (2011).

Lowe, D. AI designs organic syntheses. Nature 555 , 592–593 (2018).

Coley, C. W., Green, W. H. & Jensen, K. F. Machine learning in computer-aided synthesis planning. Acc. Chem. Res. 51 , 1281–1289 (2018).

Gelernter, H. L. et al. Empirical explorations of SYNCHEM. Science 197 , 1041–1049 (1977).

Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. & Grzybowski, B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem. Int. Ed. 53 , 8108–8112 (2014).

Coley, C. W., Rogers, L., Green, W. H. & Jensen, K. F. Computer-assisted retrosynthesis based on molecular similarity. ACS Cent. Sci. 3 , 1237–1245 (2017).

Hartenfeller, M. et al. DOGS: reaction-driven de novo design of bioactive compounds. PLoS Comput. Biol. 8 , e1002380 (2012).

Rodrigues, T. et al. De novo design and optimization of Aurora A kinase inhibitors. Chem. Sci. 4 , 1229–1233 (2013).

Rodrigues, T. et al. Steering target selectivity and potency by fragment-based de novo drug design. Angew. Chem. Int. Ed. 52 , 10006–10009 (2013).

Friedrich, L., Rodrigues, T., Neuhaus, C. S., Schneider, P. & Schneider, G. From complex natural products to simple synthetic mimetics by computational de novo design. Angew. Chem. Int. Ed. 55 , 6789–6792 (2016).

Lewell, X. Q., Judd, D. B., Watson, S. P. & Hann, M. M. RECAP — retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J. Chem. Inf. Comput. Sci. 38 , 511–522 (1998).

Reker, D., Bernardes, G. J. L. & Rodrigues, T. Computational advances in combating colloidal aggregation in drug discovery. Nat. Chem. 11 , 402–418 (2019).

Liu, B. et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci. 3 , 1103–1113 (2017).

Altae-Tran, H., Ramsundar, B., Pappu, A. S. & Pande, V. Low data drug discovery with one-shot learning. ACS Cent. Sci. 3 , 283–293 (2017).

Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23 , 1241–1250 (2018).

Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15 , 20170387 (2018).

Baylon, J. L., Cilfone, N. A., Gulcher, J. R. & Chittenden, T. W. Enhancing retrosynthetic reaction prediction with deep learning using multiscale reaction classification. J. Chem. Inf. Model. 59 , 673–688 (2019).

Fialkowski, M., Bishop, K. J., Chubukov, V. A., Campbell, C. J. & Grzybowski, B. A. Architecture and evolution of organic chemistry. Angew. Chem. Int. Ed. 44 , 7263–7269 (2005).

Gothard, C. M. et al. Rewiring chemistry: algorithmic discovery and experimental validation of one-pot reactions in the network of organic chemistry. Angew. Chem. Int. Ed. 51 , 7922–7927 (2012).

Grzybowski, B. A., Bishop, K. J., Kowalczyk, B. & Wilmer, C. E. The ‘wired’ universe of organic chemistry. Nat. Chem. 1 , 31–36 (2009).

Kowalik, M. et al. Parallel optimization of synthetic pathways within the network of organic chemistry. Angew. Chem. Int. Ed. 51 , 7928–7932 (2012).

Segler, M. H. S. & Waller, M. P. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem. Eur. J. 23 , 5966–5971 (2017).

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529 , 484–489 (2016).

Browne, C. et al. A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4 , 1–43 (2012).

Schreck, J. S., Coley, C. W. & Bishop, K. J. M. Learning retrosynthetic planning through simulated experience. ACS Cent. Sci. 5 , 970–981 (2019).

Szymkuc, S. et al. Computer-assisted synthetic planning: The end of the beginning. Angew. Chem. Int. Ed. 55 , 5904–5937 (2016).

Klucznik, T. et al. Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory. Chem 4 , 522–532 (2018).

Molga, K., Dittwald, P. & Grzybowski, B. A. Navigating around patented routes by preserving specific motifs along computer-planned retrosynthetic pathways. Chem 5 , 460–473 (2019).

Badowski, T., Molga, K. & Grzybowski, B. A. Selection of cost-effective yet chemically diverse pathways from the networks of computer-generated retrosynthetic plans. Chem. Sci. 10 , 4640–4651 (2019).

Burke, K. Perspective on density functional theory. J. Chem. Phys. 136 , 150901 (2012).

Chermette, H. Chemical reactivity indexes in density functional theory. J. Comput. Chem. 20 , 129–154 (1999).

Hegde, G. & Bowen, R. C. Machine-learned approximations to density functional theory Hamiltonians. Sci. Rep. 7 , 42669 (2017).

Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 8 , 3192–3203 (2017).

Grisafi, A. et al. Transferable machine-learning model of the electron density. ACS Cent. Sci. 5 , 57–64 (2019).

Sadowski, P., Fooshee, D., Subrahmanya, N. & Baldi, P. Synergies between quantum mechanics and machine learning in reaction prediction. J. Chem. Inf. Model. 56 , 2125–2128 (2016).

Moosavi, S. M. et al. Capturing chemical intuition in synthesis of metal-organic frameworks. Nat. Commun. 10 , 539 (2019).

Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533 , 73–76 (2016).

Kayala, M. A., Azencott, C. A., Chen, J. H. & Baldi, P. Learning to predict chemical reactions. J. Chem. Inf. Model. 51 , 2209–2222 (2011).

Fooshee, D. et al. Deep learning for chemical reaction prediction. Mol. Syst. Des. Eng. 3 , 442–452 (2018).

Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Laino, T. “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9 , 6091–6098 (2018).

Wei, J. N., Duvenaud, D. & Aspuru-Guzik, A. Neural networks for the prediction of organic chemistry reactions. ACS Cent. Sci. 2 , 725–732 (2016).

Hughes, T. B., Dang, N. L., Miller, G. P. & Swamidass, S. J. Modeling reactivity to biological macromolecules with a deep multitask network. ACS Cent. Sci. 2 , 529–537 (2016).

Hughes, T. B., Miller, G. P. & Swamidass, S. J. Modeling epoxidation of drug-like molecules with a deep machine learning network. ACS Cent. Sci. 1 , 168–180 (2015).

Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3 , 434–443 (2017).

Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 10 , 370–377 (2019).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001).

Ho, T. K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20 , 832–844 (1998).

Rodrigues, T. et al. De novo fragment design for drug discovery and chemical biology. Angew. Chem. Int. Ed. 54 , 15079–15083 (2015).

Rodrigues, T. et al. Machine intelligence decrypts beta-lapachone as an allosteric 5-lipoxygenase inhibitor. Chem. Sci. 9 , 6899–6903 (2018).

Richter, M. F. et al. Predictive compound accumulation rules yield a broad-spectrum antibiotic. Nature 545 , 299–304 (2017).

Wolfe, J. M. et al. Machine learning to predict cell-penetrating peptides for antisense delivery. ACS Cent. Sci. 4 , 512–520 (2018).

Chuang, K. V. & Keiser, M. J. Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”. Science 362 , eaat8603 (2018).

Estrada, J. G., Ahneman, D. T., Sheridan, R. P., Dreher, S. D. & Doyle, A. G. Response to Comment on “Predicting reaction performance in C–N cross-coupling using machine learning”. Science 362 , eaat8763 (2018).

Skoraczynski, G. et al. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 7 , 3582 (2017).

Chuang, K. V. & Keiser, M. J. Adversarial controls for scientific machine learning. ACS Chem. Biol. 13 , 2819–2821 (2018).

Beker, W., Gajewska, E. P., Badowski, T. & Grzybowski, B. A. Prediction of major regio-, site-, and diastereoisomers in diels-alder reactions by using machine-learning: the importance of physically meaningful descriptors. Angew. Chem. Int. Ed. 58 , 4515–4519 (2019).

Nielsen, M. K., Ahneman, D. T., Riera, O. & Doyle, A. G. Deoxyfluorination with sulfonyl fluorides: navigating reaction space with machine learning. J. Am. Chem. Soc. 140 , 5004–5008 (2018).

Halford, G. S., Baker, R., McCredden, J. E. & Bain, J. D. How many variables can humans process? Psychol. Sci. 16 , 70–76 (2005).

Leardi, R. Experimental design in chemistry: A tutorial. Anal. Chim. Acta 652 , 161–172 (2009).

Murray, P. M. et al. The application of design of experiments (DoE) reaction optimisation and solvent selection in the development of new synthetic chemistry. Org. Biomol. Chem. 14 , 2373–2384 (2016).

Austin, N. D., Sahinidis, N. V., Konstantinov, I. A. & Trahan, D. W. COSMO-based computer-aided molecular/mixture design: A focus on reaction solvents. AIChE J. 63 , 104–122 (2018).

Struebing, H. et al. Computer-aided molecular design of solvents for accelerated reaction kinetics. Nat. Chem. 5 , 952–957 (2013).

Truhlar, D. G. Chemical reactivity: Inverse solvent design. Nat. Chem. 5 , 902–903 (2013).

Gao, H. et al. Using machine learning to predict suitable conditions for organic reactions. ACS Cent. Sci. 4 , 1465–1476 (2018).

Zhou, Z., Li, X. & Zare, R. N. Optimizing chemical reactions with deep reinforcement learning. ACS Cent. Sci. 3 , 1337–1344 (2017).

Bedard, A. C. et al. Reconfigurable system for automated optimization of diverse chemical reactions. Science 361 , 1220–1225 (2018).

Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug. Discov. Today 20 , 458–465 (2015).

Reker, D., Schneider, P. & Schneider, G. Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors. Chem. Sci. 7 , 3919–3927 (2016).

Reker, D. & Brown, J. B. Selection of informative examples in chemogenomic datasets. Methods Mol. Biol. 1825 , 369–410 (2018).

Reker, D., Schneider, P., Schneider, G. & Brown, J. B. Active learning for computational chemogenomics. Future Med. Chem. 9 , 381–402 (2017).

Sans, V., Porwol, L., Dragone, V. & Cronin, L. A self optimizing synthetic organic reactor system using real-time in-line NMR spectroscopy. Chem. Sci. 6 , 1258–1264 (2015).

Häse, F., Roch, L. M., Kreisbeck, C. & Aspuru-Guzik, A. Phoenics: A Bayesian optimizer for chemistry. ACS Cent. Sci. 4 , 1134–1145 (2018).

Frazier, P. I. A tutorial on Bayesian optimization. Preprint at arXiv https://arxiv.org/abs/1807.02811 (2018).

Brochu, E., Cora, V. M. & Freitas, N. d. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint at arXiv https://arxiv.org/abs/1012.2599 (2010).

Reker, D., Bernardes, G. J. L. & Rodrigues, T. Evolving and nano data enabled machine intelligence for chemical reaction optimization. Preprint at ChemRxiv https://chemrxiv.org/articles/Evolving_and_Nano_Data_Enabled_Machine_Intelligence_for_Chemical_Reaction_Optimization/7291205/1 (2018).

Granda, J. M., Donina, L., Dragone, V., Long, D. L. & Cronin, L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature 559 , 377–381 (2018).

Ahmadi, M., Vogt, M., Iyer, P., Bajorath, J. & Frohlich, H. Predicting potent compounds via model-based global optimization. J. Chem. Inf. Model. 53 , 553–559 (2013).

Patil, P. C. & Luzzio, F. A. Synthesis of extended oxazoles II: Reaction manifold of 2-(halomethyl)-4,5-diaryloxazoles. Tetrahedron Lett. 57 , 757–759 (2016).

Blakemore, D. C. et al. Organic synthesis provides opportunities to transform drug discovery. Nat. Chem. 10 , 383–394 (2018).

Roberts, R. M. Serendipity: Accidental Discoveries in Science 1-288 (John Wiley & Sons, 1989).

Davey, S. Rapid reaction discovery. Nat. Chem. 4 , 69 (2012).

McNally, A., Prier, C. K. & MacMillan, D. W. Discovery of an alpha-amino C–H arylation reaction using the strategy of accelerated serendipity. Science 334 , 1114–1117 (2011).

Amara, Z. et al. Automated serendipity with self-optimizing continuous-flow reactors. Eur. J. Org. Chem. 2015 , 6141–6145 (2015).

Dragone, V., Sans, V., Henson, A. B., Granda, J. M. & Cronin, L. An autonomous organic reaction search engine for chemical reactivity. Nat. Commun. 8 , 15733 (2017).

Gromski, P. S., Henson, A. B., Granda, J. M. & Cronin, L. How to explore chemical space using algorithms and automation. Nat. Rev. Chem. 3 , 119–128 (2019).

Cao, Y., Romero, J. & Aspuru-Guzik, A. Potential of quantum computing for drug discovery. IBM J. Res. Dev. 62 , 6:1–6:20 (2019).

Rodrigues, T. et al. Multidimensional de novo design reveals 5-HT2B2B receptor-selective ligands. Angew. Chem. Int. Ed. 54 , 1551–1555 (2015).

Reutlinger, M., Rodrigues, T., Schneider, P. & Schneider, G. Combining on-chip synthesis of a focused combinatorial library with computational target prediction reveals imidazopyridine GPCR ligands. Angew. Chem. Int. Ed. 53 , 582–585 (2014).

Ban, T. A. The role of serendipity in drug discovery. Dialogues Clin. Neurosci. 8 , 335–344 (2006).

Rosales, A. R. et al. Rapid virtual screening of enantioselective catalysts using CatVS. Nat. Catal. 2 , 41–45 (2019).

Steiner, S. et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science 363 , eaav2211 (2019).

Caramelli, D. et al. Networking chemical robots for reaction multitasking. Nat. Commun. 9 , 3406 (2018).

Fitzpatrick, D. E., Maujean, T., Evans, A. C. & Ley, S. V. Across-the-world automated optimization and continuous-flow synthesis of pharmaceutical agents operating through a cloud-based server. Angew. Chem. Int. Ed. 57 , 15128–15132 (2018).

Lavecchia, A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov. Today 20 , 318–331 (2015).

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559 , 547–555 (2018).

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 , 255–260 (2015).

Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: Generative models for matter engineering. Science 361 , 360–365 (2018).

Wallach, I. & Heifets, A. Most ligand-based classification benchmarks reward memorization rather than generalization. J. Chem. Inf. Model. 58 , 916–932 (2018).

Download references

Acknowledgements

A.F.A. acknowledges Fundação para a Ciência e Tecnologia (FCT) Portugal for financial support through a PhD grant (PD/BD/143125/2019). T.R. is an investigador auxiliar supported by FCT Portugal (CEECIND/00887/2017). T.R. acknowledges FCT/FEDER (02/SAICT/2017, grant 28333) for funding. The authors thank the reviewers for their comments.

Author information

Authors and affiliations.

Research Institute for Medicines (iMed.ULisboa), Faculty of Pharmacy, Universidade de Lisboa, Lisboa, Portugal

A. Filipa de Almeida & Rui Moreira

Instituto de Medicina Molecular (iMM) João Lobo Antunes, Faculdade de Medicina da Universidade de Lisboa, Lisboa, Portugal

Tiago Rodrigues

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to all aspects of the article.

Corresponding author

Correspondence to Tiago Rodrigues .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information.

Nature Reviews Chemistry thanks R. Lewis and B. Maryasin for their contribution to the peer review of this work.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Area of computer science that deals with the recognition, processing and analysis of human (natural) language.

(SMILES arbitrary target specification). A notation for the accurate substructural feature identification and atom typing.

(Simplified molecular-input line-entry system). A notation to describe chemical structure using ASCII strings.

A method to map substructural information into a bit string. The bit length (size) and detail of encoded features are defined by the user.

A method to quantify similarity (ranging from 0 to 1) between molecules. Complete dissimilarity equates to 0 and full identity equals 1.

A method that normalizes a vector of length j into a probability distribution containing J probabilities in the interval [0,1]. The sum of all probabilities equals 1.0.

A machine-learning method giving a probability distribution over a number of possible functions. A prior belief regarding an event is refined through Bayesian inference as data builds up.

(LDA). A machine-learning method that finds linear combinations of features that separate classes, prior to dimensionality reduction and classification.

(SVM). A machine-learning method that separates data points in hyperspace through mathematical functions called kernels.

A method for fine-tuning a model trained on a larger set of related data. The method is employed when limited data are available to answer a research question.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

de Almeida, A.F., Moreira, R. & Rodrigues, T. Synthetic organic chemistry driven by artificial intelligence. Nat Rev Chem 3 , 589–604 (2019). https://doi.org/10.1038/s41570-019-0124-0

Download citation

Accepted : 19 July 2019

Published : 21 August 2019

Issue Date : October 2019

DOI : https://doi.org/10.1038/s41570-019-0124-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Computational drug development for membrane protein targets.

  • Xiaolin Sun
  • Horst Vogel

Nature Biotechnology (2024)

Artificial molecular pumps

  • J. Fraser Stoddart

Nature Reviews Methods Primers (2024)

  • Ramil Babazade
  • Yousung Jung

Scientific Data (2024)

Molecular set representation learning

  • Maria Boulougouri
  • Pierre Vandergheynst
  • Daniel Probst

Nature Machine Intelligence (2024)

Computational synthesis design for controlled degradation and revalorization

  • Anna Żądło-Dobrowolska
  • Karol Molga
  • Bartosz A. Grzybowski

Nature Synthesis (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

analytical research synthetics

IMAGES

  1. Analytics- Synthetic Methods

    analytical research synthetics

  2. (PDF) The Analytic-Synthetic Method

    analytical research synthetics

  3. Choosing the right analytical method. Choosing the appropriate

    analytical research synthetics

  4. Research Types: Descriptive/Analytical/Qualitative/Quantitative/Conceptual/Empirical/Applied/Basic

    analytical research synthetics

  5. Harnessing the power of synthetic biology

    analytical research synthetics

  6. PBAT/PLA/cellulose nanocrystals biocomposites compatibilized with

    analytical research synthetics

COMMENTS

  1. AR Synthetics

    [email protected]. Our Services. Research . As research consultants, we provide a wide array of tests performed by third party laboratories often in collaboration with other research organizations and universities. ... We are able to support your analytical research needs by developing robust strategies to deliver highly efficient ...

  2. What Synthesis Methodology Should I Use? A Review and Analysis of

    Research question: GFT begins with a phenomenon of focus . Analytic questions and the overall research question emerge throughout the process. Quality appraisal: There is no discussion in the GFT literature about critically appraising the studies to be included. However, the nature of the analytic process suggests that critical appraisal may ...

  3. Research Synthesis Methods

    Research Synthesis Methods, the official journal of the Society for Research Synthesis Methodology, is a multidisciplinary peer reviewed journal devoted to the development and dissemination of methods for designing, conducting, analyzing, interpreting, reporting, and applying systematic research synthesis.It aims to facilitate the creation and exchange of knowledge about research synthesis ...

  4. Meta-analysis and the science of research synthesis

    Meta-analysis is the quantitative, scientific synthesis of research results. Since the term and modern approaches to research synthesis were first introduced in the 1970s, meta-analysis has had a ...

  5. Qualitative research synthesis: An appreciative and critical

    Qualitative research synthesis is a diverse set of methods for combining the data or the results of multiple studies on a topic to generate new knowledge, theory and applications. Use of qualitative research synthesis is rapidly expanding across disciplines. Aggregative and interpretive models of qualitative research synthesis are defined and ...

  6. Synthetics

    Our platform technology and world-class scientific approach accelerates the development and launch of synthetic modalities, classic and highly potent synthetic molecules, ADCS, and peptides. Explore Synthetics

  7. Usefulness of Analytical Research: Rethinking Analytical R&D&T

    This Perspective is intended to help foster true innovation in Research & Development & Transfer (R&D&T) in Analytical Chemistry in the form of advances that are primarily useful for analytical purposes rather than solely for publishing. Devising effective means to strengthen the crucial contribution of Analytical Chemistry to progress in Chemistry, Science & Technology, and Society requires ...

  8. Analytic-synthetic distinction

    The analytic-synthetic distinction is a semantic distinction used primarily in philosophy to distinguish between propositions (in particular, statements that are affirmative subject-predicate judgments) that are of two types: analytic propositions and synthetic propositions.Analytic propositions are true or not true solely by virtue of their meaning, whereas synthetic propositions' truth ...

  9. Generating and evaluating synthetic data: a two-sided research agenda

    In recent years, our lab has committed a great deal of time to exploring different approaches to evaluating synthetic datasets. One such project was our hide-and-seek privacy challenge, which ran as part of the NeurIPS 2020 competition track.Along the way, we have learned a number of important lessons—chief among which is the fact that a single-dimensional metric for evaluation is not enough.

  10. Process analytical technology and its recent applications for

    Collectively, these studies outline the utility of PAT for monitoring a variety of different syntheses, providing a deep understanding of the process as it occurs in real time, thereby allowing for pharmaceutical reaction optimization and control. 3. Process analytical technology for asymmetric reactions and synthesis.

  11. New tools and concepts for modern organic synthesis

    The unprecedented increase in the number of new drug targets arising from genomics and proteomics translates directly into a need for new methods to rapidly assemble highly pure small molecules (M ...

  12. Notes to The Analytic/Synthetic Distinction

    Notes to The Analytic/Synthetic Distinction. 1. Hence "analytic philosophy," although this composite term has long ceased to have any commitment to actual "analyses" of meanings, or even to the viability of the analytic/synthetic distinction, and refers now more generally to philosophy done in the associated closely reasoned style.

  13. The Analytic/Synthetic Distinction

    The Analytic/Synthetic Distinction. First published Thu Aug 14, 2003; substantive revision Wed Mar 30, 2022. "Analytic" sentences, such as "Pediatricians are doctors," have historically been characterized as ones that are true by virtue of the meanings of their words alone and/or can be known to be so solely by knowing those meanings.

  14. Automation, analytics and artificial intelligence for chemical ...

    Automation and real-time reaction monitoring have enabled data-rich experimentation, which is critically important in navigating the complexities of chemical synthesis. Linking real-time analysis ...

  15. Quantitative Synthesis—An Update

    Quantitative synthesis, or meta-analysis, is often essential for Comparative Effective Reviews (CERs) to provide scientifically rigorous summary information. Quantitative synthesis should be conducted in a transparent and consistent way with methodologies reported explicitly. This guide provides practical recommendations on conducting synthesis. The guide is not meant to be a textbook on meta ...

  16. The Teaching Workshop: The Analytic Synthetic Distinction

    Here are two things you can do in your ten minutes: Telling the Two-Factor Story. You can introduce the analytic/synthetic distinction by telling the "two-factor" story, which goes a bit like this: Normally, sentences are true in part because of what they mean, and in part because of the way the world is. The sentence "snow is white" is ...

  17. The Analytic/Synthetic Distinction

    The Analytic/Synthetic Distinction. First published Thu Aug 14, 2003; substantive revision Thu Oct 12, 2017. An "analytic" sentence, such as "Ophthalmologists are doctors," has historically been characterized as one whose truth depends upon the meanings of its constituent terms (and how they're combined) alone, as opposed to a more ...

  18. Machine learning in analytical chemistry: From synthesis of

    Another previously-used pesticide is the synthetic plant hormone called 2,4-Dichlorophenoxyacetic acid (2,4-D). ... which have made this approach among the most favorable for analytical research. Although many analytes might possess innate FL properties, ...

  19. Synthetic biology 2020-2030: six commercially-available ...

    Synthetic biology will transform how we grow food, what we eat, and where we source materials and medicines. ... Early research struggled to design cells and physically build DNA with pre-2010 ...

  20. PDF Synthetic and analytical thinking

    Synthetic and analytical thinking* Franz M. Wuketits Institute of Philosophy, University of Vienna, Universit~its-Str., A-I010 Wien, Austria Synthetisches und analytisches Denken Summary. It is shown that scientific research is not a linear process of information gaining, of accumulating data and facts, but is rather to be characterized by a ...

  21. The Analytic/Synthetic Distinction

    The Analytic/Synthetic Distinction. "Analytic" sentences, such as "Ophthalmologists are doctors," are those whose truth seems to be knowable by knowing the meanings of the constituent words alone, unlike the more usual "synthetic" ones, such as "Ophthalmologists are ill-humored," whose truth is knowable by both knowing the ...

  22. Synthetic organic chemistry driven by artificial intelligence

    Nature Synthesis (2024) Synthetic organic chemistry underpins several areas of chemistry, including drug discovery, chemical biology, materials science and engineering. However, the execution of ...