What is a scientific hypothesis?

It's the initial building block in the scientific method.

A girl looks at plants in a test tube for a science experiment. What's her scientific hypothesis?

Hypothesis basics

What makes a hypothesis testable.

  • Types of hypotheses
  • Hypothesis versus theory

Additional resources

Bibliography.

A scientific hypothesis is a tentative, testable explanation for a phenomenon in the natural world. It's the initial building block in the scientific method . Many describe it as an "educated guess" based on prior knowledge and observation. While this is true, a hypothesis is more informed than a guess. While an "educated guess" suggests a random prediction based on a person's expertise, developing a hypothesis requires active observation and background research. 

The basic idea of a hypothesis is that there is no predetermined outcome. For a solution to be termed a scientific hypothesis, it has to be an idea that can be supported or refuted through carefully crafted experimentation or observation. This concept, called falsifiability and testability, was advanced in the mid-20th century by Austrian-British philosopher Karl Popper in his famous book "The Logic of Scientific Discovery" (Routledge, 1959).

A key function of a hypothesis is to derive predictions about the results of future experiments and then perform those experiments to see whether they support the predictions.

A hypothesis is usually written in the form of an if-then statement, which gives a possibility (if) and explains what may happen because of the possibility (then). The statement could also include "may," according to California State University, Bakersfield .

Here are some examples of hypothesis statements:

  • If garlic repels fleas, then a dog that is given garlic every day will not get fleas.
  • If sugar causes cavities, then people who eat a lot of candy may be more prone to cavities.
  • If ultraviolet light can damage the eyes, then maybe this light can cause blindness.

A useful hypothesis should be testable and falsifiable. That means that it should be possible to prove it wrong. A theory that can't be proved wrong is nonscientific, according to Karl Popper's 1963 book " Conjectures and Refutations ."

An example of an untestable statement is, "Dogs are better than cats." That's because the definition of "better" is vague and subjective. However, an untestable statement can be reworded to make it testable. For example, the previous statement could be changed to this: "Owning a dog is associated with higher levels of physical fitness than owning a cat." With this statement, the researcher can take measures of physical fitness from dog and cat owners and compare the two.

Types of scientific hypotheses

Elementary-age students study alternative energy using homemade windmills during public school science class.

In an experiment, researchers generally state their hypotheses in two ways. The null hypothesis predicts that there will be no relationship between the variables tested, or no difference between the experimental groups. The alternative hypothesis predicts the opposite: that there will be a difference between the experimental groups. This is usually the hypothesis scientists are most interested in, according to the University of Miami .

For example, a null hypothesis might state, "There will be no difference in the rate of muscle growth between people who take a protein supplement and people who don't." The alternative hypothesis would state, "There will be a difference in the rate of muscle growth between people who take a protein supplement and people who don't."

If the results of the experiment show a relationship between the variables, then the null hypothesis has been rejected in favor of the alternative hypothesis, according to the book " Research Methods in Psychology " (​​BCcampus, 2015). 

There are other ways to describe an alternative hypothesis. The alternative hypothesis above does not specify a direction of the effect, only that there will be a difference between the two groups. That type of prediction is called a two-tailed hypothesis. If a hypothesis specifies a certain direction — for example, that people who take a protein supplement will gain more muscle than people who don't — it is called a one-tailed hypothesis, according to William M. K. Trochim , a professor of Policy Analysis and Management at Cornell University.

Sometimes, errors take place during an experiment. These errors can happen in one of two ways. A type I error is when the null hypothesis is rejected when it is true. This is also known as a false positive. A type II error occurs when the null hypothesis is not rejected when it is false. This is also known as a false negative, according to the University of California, Berkeley . 

A hypothesis can be rejected or modified, but it can never be proved correct 100% of the time. For example, a scientist can form a hypothesis stating that if a certain type of tomato has a gene for red pigment, that type of tomato will be red. During research, the scientist then finds that each tomato of this type is red. Though the findings confirm the hypothesis, there may be a tomato of that type somewhere in the world that isn't red. Thus, the hypothesis is true, but it may not be true 100% of the time.

Scientific theory vs. scientific hypothesis

The best hypotheses are simple. They deal with a relatively narrow set of phenomena. But theories are broader; they generally combine multiple hypotheses into a general explanation for a wide range of phenomena, according to the University of California, Berkeley . For example, a hypothesis might state, "If animals adapt to suit their environments, then birds that live on islands with lots of seeds to eat will have differently shaped beaks than birds that live on islands with lots of insects to eat." After testing many hypotheses like these, Charles Darwin formulated an overarching theory: the theory of evolution by natural selection.

"Theories are the ways that we make sense of what we observe in the natural world," Tanner said. "Theories are structures of ideas that explain and interpret facts." 

  • Read more about writing a hypothesis, from the American Medical Writers Association.
  • Find out why a hypothesis isn't always necessary in science, from The American Biology Teacher.
  • Learn about null and alternative hypotheses, from Prof. Essa on YouTube .

Encyclopedia Britannica. Scientific Hypothesis. Jan. 13, 2022. https://www.britannica.com/science/scientific-hypothesis

Karl Popper, "The Logic of Scientific Discovery," Routledge, 1959.

California State University, Bakersfield, "Formatting a testable hypothesis." https://www.csub.edu/~ddodenhoff/Bio100/Bio100sp04/formattingahypothesis.htm  

Karl Popper, "Conjectures and Refutations," Routledge, 1963.

Price, P., Jhangiani, R., & Chiang, I., "Research Methods of Psychology — 2nd Canadian Edition," BCcampus, 2015.‌

University of Miami, "The Scientific Method" http://www.bio.miami.edu/dana/161/evolution/161app1_scimethod.pdf  

William M.K. Trochim, "Research Methods Knowledge Base," https://conjointly.com/kb/hypotheses-explained/  

University of California, Berkeley, "Multiple Hypothesis Testing and False Discovery Rate" https://www.stat.berkeley.edu/~hhuang/STAT141/Lecture-FDR.pdf  

University of California, Berkeley, "Science at multiple levels" https://undsci.berkeley.edu/article/0_0_0/howscienceworks_19

Sign up for the Live Science daily newsletter now

Get the world’s most fascinating discoveries delivered straight to your inbox.

Alina Bradford

Bizarre evolutionary roots of Africa's iconic upside-down baobab trees revealed

Snake Island: The isle writhing with vipers where only Brazilian military and scientists are allowed

Newfound autoimmune syndrome tied to COVID-19 can trigger deadly lung scarring

Most Popular

  • 2 'It was not a peaceful crossing': Hannibal's troops linked to devastating fire 2,200 years ago in Spain
  • 3 Snake Island: The isle writhing with vipers where only Brazilian military and scientists are allowed
  • 4 Newfound 'glitch' in Einstein's relativity could rewrite the rules of the universe, study suggests
  • 5 Scientists prove 'quantum theory' that could lead to ultrafast magnetic computing
  • 5 Alien 'Dyson sphere' megastructures could surround at least 7 stars in our galaxy, new studies suggest

hypothesis meaning of biology

1.2 The Process of Science

Learning objectives.

  • Identify the shared characteristics of the natural sciences
  • Understand the process of scientific inquiry
  • Compare inductive reasoning with deductive reasoning
  • Describe the goals of basic science and applied science

Like geology, physics, and chemistry, biology is a science that gathers knowledge about the natural world. Specifically, biology is the study of life. The discoveries of biology are made by a community of researchers who work individually and together using agreed-on methods. In this sense, biology, like all sciences is a social enterprise like politics or the arts. The methods of science include careful observation, record keeping, logical and mathematical reasoning, experimentation, and submitting conclusions to the scrutiny of others. Science also requires considerable imagination and creativity; a well-designed experiment is commonly described as elegant, or beautiful. Like politics, science has considerable practical implications and some science is dedicated to practical applications, such as the prevention of disease (see Figure 1.15 ). Other science proceeds largely motivated by curiosity. Whatever its goal, there is no doubt that science, including biology, has transformed human existence and will continue to do so.

The Nature of Science

Biology is a science, but what exactly is science? What does the study of biology share with other scientific disciplines? Science (from the Latin scientia, meaning "knowledge") can be defined as knowledge about the natural world.

Science is a very specific way of learning, or knowing, about the world. The history of the past 500 years demonstrates that science is a very powerful way of knowing about the world; it is largely responsible for the technological revolutions that have taken place during this time. There are however, areas of knowledge and human experience that the methods of science cannot be applied to. These include such things as answering purely moral questions, aesthetic questions, or what can be generally categorized as spiritual questions. Science cannot investigate these areas because they are outside the realm of material phenomena, the phenomena of matter and energy, and cannot be observed and measured.

The scientific method is a method of research with defined steps that include experiments and careful observation. The steps of the scientific method will be examined in detail later, but one of the most important aspects of this method is the testing of hypotheses. A hypothesis is a suggested explanation for an event, which can be tested. Hypotheses, or tentative explanations, are generally produced within the context of a scientific theory . A generally accepted scientific theory is thoroughly tested and confirmed explanation for a set of observations or phenomena. Scientific theory is the foundation of scientific knowledge. In addition, in many scientific disciplines (less so in biology) there are scientific laws , often expressed in mathematical formulas, which describe how elements of nature will behave under certain specific conditions. There is not an evolution of hypotheses through theories to laws as if they represented some increase in certainty about the world. Hypotheses are the day-to-day material that scientists work with and they are developed within the context of theories. Laws are concise descriptions of parts of the world that are amenable to formulaic or mathematical description.

Natural Sciences

What would you expect to see in a museum of natural sciences? Frogs? Plants? Dinosaur skeletons? Exhibits about how the brain functions? A planetarium? Gems and minerals? Or maybe all of the above? Science includes such diverse fields as astronomy, biology, computer sciences, geology, logic, physics, chemistry, and mathematics ( Figure 1.16 ). However, those fields of science related to the physical world and its phenomena and processes are considered natural sciences . Thus, a museum of natural sciences might contain any of the items listed above.

There is no complete agreement when it comes to defining what the natural sciences include. For some experts, the natural sciences are astronomy, biology, chemistry, earth science, and physics. Other scholars choose to divide natural sciences into life sciences , which study living things and include biology, and physical sciences , which study nonliving matter and include astronomy, physics, and chemistry. Some disciplines such as biophysics and biochemistry build on two sciences and are interdisciplinary.

Scientific Inquiry

One thing is common to all forms of science: an ultimate goal “to know.” Curiosity and inquiry are the driving forces for the development of science. Scientists seek to understand the world and the way it operates. Two methods of logical thinking are used: inductive reasoning and deductive reasoning.

Inductive reasoning is a form of logical thinking that uses related observations to arrive at a general conclusion. This type of reasoning is common in descriptive science. A life scientist such as a biologist makes observations and records them. These data can be qualitative (descriptive) or quantitative (consisting of numbers), and the raw data can be supplemented with drawings, pictures, photos, or videos. From many observations, the scientist can infer conclusions (inductions) based on evidence. Inductive reasoning involves formulating generalizations inferred from careful observation and the analysis of a large amount of data. Brain studies often work this way. Many brains are observed while people are doing a task. The part of the brain that lights up, indicating activity, is then demonstrated to be the part controlling the response to that task.

Deductive reasoning or deduction is the type of logic used in hypothesis-based science. In deductive reasoning, the pattern of thinking moves in the opposite direction as compared to inductive reasoning. Deductive reasoning is a form of logical thinking that uses a general principle or law to predict specific results. From those general principles, a scientist can deduce and predict the specific results that would be valid as long as the general principles are valid. For example, a prediction would be that if the climate is becoming warmer in a region, the distribution of plants and animals should change. Comparisons have been made between distributions in the past and the present, and the many changes that have been found are consistent with a warming climate. Finding the change in distribution is evidence that the climate change conclusion is a valid one.

Both types of logical thinking are related to the two main pathways of scientific study: descriptive science and hypothesis-based science. Descriptive (or discovery) science aims to observe, explore, and discover, while hypothesis-based science begins with a specific question or problem and a potential answer or solution that can be tested. The boundary between these two forms of study is often blurred, because most scientific endeavors combine both approaches. Observations lead to questions, questions lead to forming a hypothesis as a possible answer to those questions, and then the hypothesis is tested. Thus, descriptive science and hypothesis-based science are in continuous dialogue.

Hypothesis Testing

Biologists study the living world by posing questions about it and seeking science-based responses. This approach is common to other sciences as well and is often referred to as the scientific method. The scientific method was used even in ancient times, but it was first documented by England’s Sir Francis Bacon (1561–1626) ( Figure 1.17 ), who set up inductive methods for scientific inquiry. The scientific method is not exclusively used by biologists but can be applied to almost anything as a logical problem-solving method.

The scientific process typically starts with an observation (often a problem to be solved) that leads to a question. Let’s think about a simple problem that starts with an observation and apply the scientific method to solve the problem. One Monday morning, a student arrives at class and quickly discovers that the classroom is too warm. That is an observation that also describes a problem: the classroom is too warm. The student then asks a question: “Why is the classroom so warm?”

Recall that a hypothesis is a suggested explanation that can be tested. To solve a problem, several hypotheses may be proposed. For example, one hypothesis might be, “The classroom is warm because no one turned on the air conditioning.” But there could be other responses to the question, and therefore other hypotheses may be proposed. A second hypothesis might be, “The classroom is warm because there is a power failure, and so the air conditioning doesn’t work.”

Once a hypothesis has been selected, a prediction may be made. A prediction is similar to a hypothesis but it typically has the format “If . . . then . . . .” For example, the prediction for the first hypothesis might be, “ If the student turns on the air conditioning, then the classroom will no longer be too warm.”

A hypothesis must be testable to ensure that it is valid. For example, a hypothesis that depends on what a bear thinks is not testable, because it can never be known what a bear thinks. It should also be falsifiable , meaning that it can be disproven by experimental results. An example of an unfalsifiable hypothesis is “Botticelli’s Birth of Venus is beautiful.” There is no experiment that might show this statement to be false. To test a hypothesis, a researcher will conduct one or more experiments designed to eliminate one or more of the hypotheses. This is important. A hypothesis can be disproven, or eliminated, but it can never be proven. Science does not deal in proofs like mathematics. If an experiment fails to disprove a hypothesis, then we find support for that explanation, but this is not to say that down the road a better explanation will not be found, or a more carefully designed experiment will be found to falsify the hypothesis.

Each experiment will have one or more variables and one or more controls. A variable is any part of the experiment that can vary or change during the experiment. A control is a part of the experiment that does not change. Look for the variables and controls in the example that follows. As a simple example, an experiment might be conducted to test the hypothesis that phosphate limits the growth of algae in freshwater ponds. A series of artificial ponds are filled with water and half of them are treated by adding phosphate each week, while the other half are treated by adding a salt that is known not to be used by algae. The variable here is the phosphate (or lack of phosphate), the experimental or treatment cases are the ponds with added phosphate and the control ponds are those with something inert added, such as the salt. Just adding something is also a control against the possibility that adding extra matter to the pond has an effect. If the treated ponds show lesser growth of algae, then we have found support for our hypothesis. If they do not, then we reject our hypothesis. Be aware that rejecting one hypothesis does not determine whether or not the other hypotheses can be accepted; it simply eliminates one hypothesis that is not valid ( Figure 1.18 ). Using the scientific method, the hypotheses that are inconsistent with experimental data are rejected.

In recent years a new approach of testing hypotheses has developed as a result of an exponential growth of data deposited in various databases. Using computer algorithms and statistical analyses of data in databases, a new field of so-called "data research" (also referred to as "in silico" research) provides new methods of data analyses and their interpretation. This will increase the demand for specialists in both biology and computer science, a promising career opportunity.

Visual Connection

In the example below, the scientific method is used to solve an everyday problem. Which part in the example below is the hypothesis? Which is the prediction? Based on the results of the experiment, is the hypothesis supported? If it is not supported, propose some alternative hypotheses.

  • My toaster doesn’t toast my bread.
  • Why doesn’t my toaster work?
  • There is something wrong with the electrical outlet.
  • If something is wrong with the outlet, my coffeemaker also won’t work when plugged into it.
  • I plug my coffeemaker into the outlet.
  • My coffeemaker works.

In practice, the scientific method is not as rigid and structured as it might at first appear. Sometimes an experiment leads to conclusions that favor a change in approach; often, an experiment brings entirely new scientific questions to the puzzle. Many times, science does not operate in a linear fashion; instead, scientists continually draw inferences and make generalizations, finding patterns as their research proceeds. Scientific reasoning is more complex than the scientific method alone suggests.

Basic and Applied Science

The scientific community has been debating for the last few decades about the value of different types of science. Is it valuable to pursue science for the sake of simply gaining knowledge, or does scientific knowledge only have worth if we can apply it to solving a specific problem or bettering our lives? This question focuses on the differences between two types of science: basic science and applied science.

Basic science or “pure” science seeks to expand knowledge regardless of the short-term application of that knowledge. It is not focused on developing a product or a service of immediate public or commercial value. The immediate goal of basic science is knowledge for knowledge’s sake, though this does not mean that in the end it may not result in an application.

In contrast, applied science or “technology,” aims to use science to solve real-world problems, making it possible, for example, to improve a crop yield, find a cure for a particular disease, or save animals threatened by a natural disaster. In applied science, the problem is usually defined for the researcher.

Some individuals may perceive applied science as “useful” and basic science as “useless.” A question these people might pose to a scientist advocating knowledge acquisition would be, “What for?” A careful look at the history of science, however, reveals that basic knowledge has resulted in many remarkable applications of great value. Many scientists think that a basic understanding of science is necessary before an application is developed; therefore, applied science relies on the results generated through basic science. Other scientists think that it is time to move on from basic science and instead to find solutions to actual problems. Both approaches are valid. It is true that there are problems that demand immediate attention; however, few solutions would be found without the help of the knowledge generated through basic science.

One example of how basic and applied science can work together to solve practical problems occurred after the discovery of DNA structure led to an understanding of the molecular mechanisms governing DNA replication. Strands of DNA, unique in every human, are found in our cells, where they provide the instructions necessary for life. During DNA replication, new copies of DNA are made, shortly before a cell divides to form new cells. Understanding the mechanisms of DNA replication enabled scientists to develop laboratory techniques that are now used to identify genetic diseases, pinpoint individuals who were at a crime scene, and determine paternity. Without basic science, it is unlikely that applied science could exist.

Another example of the link between basic and applied research is the Human Genome Project, a study in which each human chromosome was analyzed and mapped to determine the precise sequence of DNA subunits and the exact location of each gene. (The gene is the basic unit of heredity represented by a specific DNA segment that codes for a functional molecule.) Other organisms have also been studied as part of this project to gain a better understanding of human chromosomes. The Human Genome Project ( Figure 1.19 ) relied on basic research carried out with non-human organisms and, later, with the human genome. An important end goal eventually became using the data for applied research seeking cures for genetically related diseases.

While research efforts in both basic science and applied science are usually carefully planned, it is important to note that some discoveries are made by serendipity, that is, by means of a fortunate accident or a lucky surprise. Penicillin was discovered when biologist Alexander Fleming accidentally left a petri dish of Staphylococcus bacteria open. An unwanted mold grew, killing the bacteria. The mold turned out to be Penicillium , and a new critically important antibiotic was discovered. In a similar manner, Percy Lavon Julian was an established medicinal chemist working on a way to mass produce compounds with which to manufacture important drugs. He was focused on using soybean oil in the production of progesterone (a hormone important in the menstrual cycle and pregnancy), but it wasn't until water accidentally leaked into a large soybean oil storage tank that he found his method. Immediately recognizing the resulting substance as stigmasterol, a primary ingredient in progesterone and similar drugs, he began the process of replicating and industrializing the process in a manner that has helped millions of people. Even in the highly organized world of science, luck—when combined with an observant, curious mind focused on the types of reasoning discussed above—can lead to unexpected breakthroughs.

Reporting Scientific Work

Whether scientific research is basic science or applied science, scientists must share their findings for other researchers to expand and build upon their discoveries. Communication and collaboration within and between sub disciplines of science are key to the advancement of knowledge in science. For this reason, an important aspect of a scientist’s work is disseminating results and communicating with peers. Scientists can share results by presenting them at a scientific meeting or conference, but this approach can reach only the limited few who are present. Instead, most scientists present their results in peer-reviewed articles that are published in scientific journals. Peer-reviewed articles are scientific papers that are reviewed, usually anonymously by a scientist’s colleagues, or peers. These colleagues are qualified individuals, often experts in the same research area, who judge whether or not the scientist’s work is suitable for publication. The process of peer review helps to ensure that the research described in a scientific paper or grant proposal is original, significant, logical, and thorough. Grant proposals, which are requests for research funding, are also subject to peer review. Scientists publish their work so other scientists can reproduce their experiments under similar or different conditions to expand on the findings.

There are many journals and the popular press that do not use a peer-review system. A large number of online open-access journals, journals with articles available without cost, are now available many of which use rigorous peer-review systems, but some of which do not. Results of any studies published in these forums without peer review are not reliable and should not form the basis for other scientific work. In one exception, journals may allow a researcher to cite a personal communication from another researcher about unpublished results with the cited author’s permission.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/concepts-biology/pages/1-introduction
  • Authors: Samantha Fowler, Rebecca Roush, James Wise
  • Publisher/website: OpenStax
  • Book title: Concepts of Biology
  • Publication date: Apr 25, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/concepts-biology/pages/1-introduction
  • Section URL: https://openstax.org/books/concepts-biology/pages/1-2-the-process-of-science

© Apr 26, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Biology library

Course: biology library   >   unit 1.

  • The scientific method

Controlled experiments

  • The scientific method and experimental design

Introduction

How are hypotheses tested.

  • One pot of seeds gets watered every afternoon.
  • The other pot of seeds doesn't get any water at all.

Control and experimental groups

Independent and dependent variables, independent variables, dependent variables, variability and repetition, controlled experiment case study: co 2 ‍   and coral bleaching.

  • What your control and experimental groups would be
  • What your independent and dependent variables would be
  • What results you would predict in each group

Experimental setup

  • Some corals were grown in tanks of normal seawater, which is not very acidic ( pH ‍   around 8.2 ‍   ). The corals in these tanks served as the control group .
  • Other corals were grown in tanks of seawater that were more acidic than usual due to addition of CO 2 ‍   . One set of tanks was medium-acidity ( pH ‍   about 7.9 ‍   ), while another set was high-acidity ( pH ‍   about 7.65 ‍   ). Both the medium-acidity and high-acidity groups were experimental groups .
  • In this experiment, the independent variable was the acidity ( pH ‍   ) of the seawater. The dependent variable was the degree of bleaching of the corals.
  • The researchers used a large sample size and repeated their experiment. Each tank held 5 ‍   fragments of coral, and there were 5 ‍   identical tanks for each group (control, medium-acidity, and high-acidity). Experimental setup to test effects of water acidity on coral bleaching. Control group: Coral fragments are placed in a tank of normal seawater (pH 8.2). Experimental group 1: Coral fragments are placed in a tank of slightly acidified seawater (pH 7.9). Experimental group 2: Coral fragments are placed in a tank of more strongly acidified seawater (pH 7.65). The water acidity is the independent variable. 8 weeks are allowed to pass for each of the tanks... Control group: Corals are about 10% bleached on average. Experimental group 1 (medium acidity): Corals are about 20% bleached on average. Experimental group 2 (higher acidity): Corals are about 40% bleached on average. Degree of coral bleaching is the dependent variable. Note: None of these tanks was "acidic" on an absolute scale. That is, the pH ‍   values were all above the neutral pH ‍   of 7.0 ‍   . However, the two groups of experimental tanks were moderately and highly acidic to the corals , that is, relative to their natural habitat of plain seawater.

Analyzing the results

Non-experimental hypothesis tests, case study: coral bleaching and temperature, attribution:, works cited:.

  • Hoegh-Guldberg, O. (1999). Climate change, coral bleaching, and the future of the world's coral reefs. Mar. Freshwater Res. , 50 , 839-866. Retrieved from www.reef.edu.au/climate/Hoegh-Guldberg%201999.pdf.
  • Anthony, K. R. N., Kline, D. I., Diaz-Pulido, G., Dove, S., and Hoegh-Guldberg, O. (2008). Ocean acidification causes bleaching and productivity loss in coral reef builders. PNAS , 105 (45), 17442-17446. http://dx.doi.org/10.1073/pnas.0804478105 .
  • University of California Museum of Paleontology. (2016). Misconceptions about science. In Understanding science . Retrieved from http://undsci.berkeley.edu/teaching/misconceptions.php .
  • Hoegh-Guldberg, O. and Smith, G. J. (1989). The effect of sudden changes in temperature, light and salinity on the density and export of zooxanthellae from the reef corals Stylophora pistillata (Esper, 1797) and Seriatopora hystrix (Dana, 1846). J. Exp. Mar. Biol. Ecol. , 129 , 279-303. Retrieved from http://www.reef.edu.au/ohg/res-pic/HG%20papers/HG%20and%20Smith%201989%20BLEACH.pdf .

Additional references:

Want to join the conversation.

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Great Answer

What Is a Hypothesis? (Science)

If...,Then...

Angela Lumsden/Getty Images

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

A hypothesis (plural hypotheses) is a proposed explanation for an observation. The definition depends on the subject.

In science, a hypothesis is part of the scientific method. It is a prediction or explanation that is tested by an experiment. Observations and experiments may disprove a scientific hypothesis, but can never entirely prove one.

In the study of logic, a hypothesis is an if-then proposition, typically written in the form, "If X , then Y ."

In common usage, a hypothesis is simply a proposed explanation or prediction, which may or may not be tested.

Writing a Hypothesis

Most scientific hypotheses are proposed in the if-then format because it's easy to design an experiment to see whether or not a cause and effect relationship exists between the independent variable and the dependent variable . The hypothesis is written as a prediction of the outcome of the experiment.

Null Hypothesis and Alternative Hypothesis

Statistically, it's easier to show there is no relationship between two variables than to support their connection. So, scientists often propose the null hypothesis . The null hypothesis assumes changing the independent variable will have no effect on the dependent variable.

In contrast, the alternative hypothesis suggests changing the independent variable will have an effect on the dependent variable. Designing an experiment to test this hypothesis can be trickier because there are many ways to state an alternative hypothesis.

For example, consider a possible relationship between getting a good night's sleep and getting good grades. The null hypothesis might be stated: "The number of hours of sleep students get is unrelated to their grades" or "There is no correlation between hours of sleep and grades."

An experiment to test this hypothesis might involve collecting data, recording average hours of sleep for each student and grades. If a student who gets eight hours of sleep generally does better than students who get four hours of sleep or 10 hours of sleep, the hypothesis might be rejected.

But the alternative hypothesis is harder to propose and test. The most general statement would be: "The amount of sleep students get affects their grades." The hypothesis might also be stated as "If you get more sleep, your grades will improve" or "Students who get nine hours of sleep have better grades than those who get more or less sleep."

In an experiment, you can collect the same data, but the statistical analysis is less likely to give you a high confidence limit.

Usually, a scientist starts out with the null hypothesis. From there, it may be possible to propose and test an alternative hypothesis, to narrow down the relationship between the variables.

Example of a Hypothesis

Examples of a hypothesis include:

  • If you drop a rock and a feather, (then) they will fall at the same rate.
  • Plants need sunlight in order to live. (if sunlight, then life)
  • Eating sugar gives you energy. (if sugar, then energy)
  • White, Jay D.  Research in Public Administration . Conn., 1998.
  • Schick, Theodore, and Lewis Vaughn.  How to Think about Weird Things: Critical Thinking for a New Age . McGraw-Hill Higher Education, 2002.
  • Null Hypothesis Examples
  • Examples of Independent and Dependent Variables
  • Difference Between Independent and Dependent Variables
  • Null Hypothesis Definition and Examples
  • Definition of a Hypothesis
  • What Are the Elements of a Good Hypothesis?
  • Six Steps of the Scientific Method
  • Independent Variable Definition and Examples
  • What Are Examples of a Hypothesis?
  • Understanding Simple vs Controlled Experiments
  • Scientific Method Flow Chart
  • Scientific Method Vocabulary Terms
  • What Is a Testable Hypothesis?
  • What 'Fail to Reject' Means in a Hypothesis Test
  • How To Design a Science Fair Experiment
  • What Is an Experiment? Definition and Design

Module 1: Introduction to Biology

Experiments and hypotheses, learning outcomes.

  • Form a hypothesis and use it to design a scientific experiment

Now we’ll focus on the methods of scientific inquiry. Science often involves making observations and developing hypotheses. Experiments and further observations are often used to test the hypotheses.

A scientific experiment is a carefully organized procedure in which the scientist intervenes in a system to change something, then observes the result of the change. Scientific inquiry often involves doing experiments, though not always. For example, a scientist studying the mating behaviors of ladybugs might begin with detailed observations of ladybugs mating in their natural habitats. While this research may not be experimental, it is scientific: it involves careful and verifiable observation of the natural world. The same scientist might then treat some of the ladybugs with a hormone hypothesized to trigger mating and observe whether these ladybugs mated sooner or more often than untreated ones. This would qualify as an experiment because the scientist is now making a change in the system and observing the effects.

Forming a Hypothesis

When conducting scientific experiments, researchers develop hypotheses to guide experimental design. A hypothesis is a suggested explanation that is both testable and falsifiable. You must be able to test your hypothesis through observations and research, and it must be possible to prove your hypothesis false.

For example, Michael observes that maple trees lose their leaves in the fall. He might then propose a possible explanation for this observation: “cold weather causes maple trees to lose their leaves in the fall.” This statement is testable. He could grow maple trees in a warm enclosed environment such as a greenhouse and see if their leaves still dropped in the fall. The hypothesis is also falsifiable. If the leaves still dropped in the warm environment, then clearly temperature was not the main factor in causing maple leaves to drop in autumn.

In the Try It below, you can practice recognizing scientific hypotheses. As you consider each statement, try to think as a scientist would: can I test this hypothesis with observations or experiments? Is the statement falsifiable? If the answer to either of these questions is “no,” the statement is not a valid scientific hypothesis.

Practice Questions

Determine whether each following statement is a scientific hypothesis.

Air pollution from automobile exhaust can trigger symptoms in people with asthma.

  • No. This statement is not testable or falsifiable.
  • No. This statement is not testable.
  • No. This statement is not falsifiable.
  • Yes. This statement is testable and falsifiable.

Natural disasters, such as tornadoes, are punishments for bad thoughts and behaviors.

a: No. This statement is not testable or falsifiable. “Bad thoughts and behaviors” are excessively vague and subjective variables that would be impossible to measure or agree upon in a reliable way. The statement might be “falsifiable” if you came up with a counterexample: a “wicked” place that was not punished by a natural disaster. But some would question whether the people in that place were really wicked, and others would continue to predict that a natural disaster was bound to strike that place at some point. There is no reason to suspect that people’s immoral behavior affects the weather unless you bring up the intervention of a supernatural being, making this idea even harder to test.

Testing a Vaccine

Let’s examine the scientific process by discussing an actual scientific experiment conducted by researchers at the University of Washington. These researchers investigated whether a vaccine may reduce the incidence of the human papillomavirus (HPV). The experimental process and results were published in an article titled, “ A controlled trial of a human papillomavirus type 16 vaccine .”

Preliminary observations made by the researchers who conducted the HPV experiment are listed below:

  • Human papillomavirus (HPV) is the most common sexually transmitted virus in the United States.
  • There are about 40 different types of HPV. A significant number of people that have HPV are unaware of it because many of these viruses cause no symptoms.
  • Some types of HPV can cause cervical cancer.
  • About 4,000 women a year die of cervical cancer in the United States.

Practice Question

Researchers have developed a potential vaccine against HPV and want to test it. What is the first testable hypothesis that the researchers should study?

  • HPV causes cervical cancer.
  • People should not have unprotected sex with many partners.
  • People who get the vaccine will not get HPV.
  • The HPV vaccine will protect people against cancer.

Experimental Design

You’ve successfully identified a hypothesis for the University of Washington’s study on HPV: People who get the HPV vaccine will not get HPV.

The next step is to design an experiment that will test this hypothesis. There are several important factors to consider when designing a scientific experiment. First, scientific experiments must have an experimental group. This is the group that receives the experimental treatment necessary to address the hypothesis.

The experimental group receives the vaccine, but how can we know if the vaccine made a difference? Many things may change HPV infection rates in a group of people over time. To clearly show that the vaccine was effective in helping the experimental group, we need to include in our study an otherwise similar control group that does not get the treatment. We can then compare the two groups and determine if the vaccine made a difference. The control group shows us what happens in the absence of the factor under study.

However, the control group cannot get “nothing.” Instead, the control group often receives a placebo. A placebo is a procedure that has no expected therapeutic effect—such as giving a person a sugar pill or a shot containing only plain saline solution with no drug. Scientific studies have shown that the “placebo effect” can alter experimental results because when individuals are told that they are or are not being treated, this knowledge can alter their actions or their emotions, which can then alter the results of the experiment.

Moreover, if the doctor knows which group a patient is in, this can also influence the results of the experiment. Without saying so directly, the doctor may show—through body language or other subtle cues—their views about whether the patient is likely to get well. These errors can then alter the patient’s experience and change the results of the experiment. Therefore, many clinical studies are “double blind.” In these studies, neither the doctor nor the patient knows which group the patient is in until all experimental results have been collected.

Both placebo treatments and double-blind procedures are designed to prevent bias. Bias is any systematic error that makes a particular experimental outcome more or less likely. Errors can happen in any experiment: people make mistakes in measurement, instruments fail, computer glitches can alter data. But most such errors are random and don’t favor one outcome over another. Patients’ belief in a treatment can make it more likely to appear to “work.” Placebos and double-blind procedures are used to level the playing field so that both groups of study subjects are treated equally and share similar beliefs about their treatment.

The scientists who are researching the effectiveness of the HPV vaccine will test their hypothesis by separating 2,392 young women into two groups: the control group and the experimental group. Answer the following questions about these two groups.

  • This group is given a placebo.
  • This group is deliberately infected with HPV.
  • This group is given nothing.
  • This group is given the HPV vaccine.
  • a: This group is given a placebo. A placebo will be a shot, just like the HPV vaccine, but it will have no active ingredient. It may change peoples’ thinking or behavior to have such a shot given to them, but it will not stimulate the immune systems of the subjects in the same way as predicted for the vaccine itself.
  • d: This group is given the HPV vaccine. The experimental group will receive the HPV vaccine and researchers will then be able to see if it works, when compared to the control group.

Experimental Variables

A variable is a characteristic of a subject (in this case, of a person in the study) that can vary over time or among individuals. Sometimes a variable takes the form of a category, such as male or female; often a variable can be measured precisely, such as body height. Ideally, only one variable is different between the control group and the experimental group in a scientific experiment. Otherwise, the researchers will not be able to determine which variable caused any differences seen in the results. For example, imagine that the people in the control group were, on average, much more sexually active than the people in the experimental group. If, at the end of the experiment, the control group had a higher rate of HPV infection, could you confidently determine why? Maybe the experimental subjects were protected by the vaccine, but maybe they were protected by their low level of sexual contact.

To avoid this situation, experimenters make sure that their subject groups are as similar as possible in all variables except for the variable that is being tested in the experiment. This variable, or factor, will be deliberately changed in the experimental group. The one variable that is different between the two groups is called the independent variable. An independent variable is known or hypothesized to cause some outcome. Imagine an educational researcher investigating the effectiveness of a new teaching strategy in a classroom. The experimental group receives the new teaching strategy, while the control group receives the traditional strategy. It is the teaching strategy that is the independent variable in this scenario. In an experiment, the independent variable is the variable that the scientist deliberately changes or imposes on the subjects.

Dependent variables are known or hypothesized consequences; they are the effects that result from changes or differences in an independent variable. In an experiment, the dependent variables are those that the scientist measures before, during, and particularly at the end of the experiment to see if they have changed as expected. The dependent variable must be stated so that it is clear how it will be observed or measured. Rather than comparing “learning” among students (which is a vague and difficult to measure concept), an educational researcher might choose to compare test scores, which are very specific and easy to measure.

In any real-world example, many, many variables MIGHT affect the outcome of an experiment, yet only one or a few independent variables can be tested. Other variables must be kept as similar as possible between the study groups and are called control variables . For our educational research example, if the control group consisted only of people between the ages of 18 and 20 and the experimental group contained people between the ages of 30 and 35, we would not know if it was the teaching strategy or the students’ ages that played a larger role in the results. To avoid this problem, a good study will be set up so that each group contains students with a similar age profile. In a well-designed educational research study, student age will be a controlled variable, along with other possibly important factors like gender, past educational achievement, and pre-existing knowledge of the subject area.

What is the independent variable in this experiment?

  • Sex (all of the subjects will be female)
  • Presence or absence of the HPV vaccine
  • Presence or absence of HPV (the virus)

List three control variables other than age.

What is the dependent variable in this experiment?

  • Sex (male or female)
  • Rates of HPV infection
  • Age (years)
  • Revision and adaptation. Authored by : Shelli Carter and Lumen Learning. Provided by : Lumen Learning. License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike
  • Scientific Inquiry. Provided by : Open Learning Initiative. Located at : https://oli.cmu.edu/jcourse/workbook/activity/page?context=434a5c2680020ca6017c03488572e0f8 . Project : Introduction to Biology (Open + Free). License : CC BY-NC-SA: Attribution-NonCommercial-ShareAlike

Footer Logo Lumen Waymaker

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2024 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

How to Write a Great Hypothesis

Hypothesis Definition, Format, Examples, and Tips

Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

hypothesis meaning of biology

Amy Morin, LCSW, is a psychotherapist and international bestselling author. Her books, including "13 Things Mentally Strong People Don't Do," have been translated into more than 40 languages. Her TEDx talk,  "The Secret of Becoming Mentally Strong," is one of the most viewed talks of all time.

hypothesis meaning of biology

Verywell / Alex Dos Diaz

  • The Scientific Method

Hypothesis Format

Falsifiability of a hypothesis.

  • Operationalization

Hypothesis Types

Hypotheses examples.

  • Collecting Data

A hypothesis is a tentative statement about the relationship between two or more variables. It is a specific, testable prediction about what you expect to happen in a study. It is a preliminary answer to your question that helps guide the research process.

Consider a study designed to examine the relationship between sleep deprivation and test performance. The hypothesis might be: "This study is designed to assess the hypothesis that sleep-deprived people will perform worse on a test than individuals who are not sleep-deprived."

At a Glance

A hypothesis is crucial to scientific research because it offers a clear direction for what the researchers are looking to find. This allows them to design experiments to test their predictions and add to our scientific knowledge about the world. This article explores how a hypothesis is used in psychology research, how to write a good hypothesis, and the different types of hypotheses you might use.

The Hypothesis in the Scientific Method

In the scientific method , whether it involves research in psychology, biology, or some other area, a hypothesis represents what the researchers think will happen in an experiment. The scientific method involves the following steps:

  • Forming a question
  • Performing background research
  • Creating a hypothesis
  • Designing an experiment
  • Collecting data
  • Analyzing the results
  • Drawing conclusions
  • Communicating the results

The hypothesis is a prediction, but it involves more than a guess. Most of the time, the hypothesis begins with a question which is then explored through background research. At this point, researchers then begin to develop a testable hypothesis.

Unless you are creating an exploratory study, your hypothesis should always explain what you  expect  to happen.

In a study exploring the effects of a particular drug, the hypothesis might be that researchers expect the drug to have some type of effect on the symptoms of a specific illness. In psychology, the hypothesis might focus on how a certain aspect of the environment might influence a particular behavior.

Remember, a hypothesis does not have to be correct. While the hypothesis predicts what the researchers expect to see, the goal of the research is to determine whether this guess is right or wrong. When conducting an experiment, researchers might explore numerous factors to determine which ones might contribute to the ultimate outcome.

In many cases, researchers may find that the results of an experiment  do not  support the original hypothesis. When writing up these results, the researchers might suggest other options that should be explored in future studies.

In many cases, researchers might draw a hypothesis from a specific theory or build on previous research. For example, prior research has shown that stress can impact the immune system. So a researcher might hypothesize: "People with high-stress levels will be more likely to contract a common cold after being exposed to the virus than people who have low-stress levels."

In other instances, researchers might look at commonly held beliefs or folk wisdom. "Birds of a feather flock together" is one example of folk adage that a psychologist might try to investigate. The researcher might pose a specific hypothesis that "People tend to select romantic partners who are similar to them in interests and educational level."

Elements of a Good Hypothesis

So how do you write a good hypothesis? When trying to come up with a hypothesis for your research or experiments, ask yourself the following questions:

  • Is your hypothesis based on your research on a topic?
  • Can your hypothesis be tested?
  • Does your hypothesis include independent and dependent variables?

Before you come up with a specific hypothesis, spend some time doing background research. Once you have completed a literature review, start thinking about potential questions you still have. Pay attention to the discussion section in the  journal articles you read . Many authors will suggest questions that still need to be explored.

How to Formulate a Good Hypothesis

To form a hypothesis, you should take these steps:

  • Collect as many observations about a topic or problem as you can.
  • Evaluate these observations and look for possible causes of the problem.
  • Create a list of possible explanations that you might want to explore.
  • After you have developed some possible hypotheses, think of ways that you could confirm or disprove each hypothesis through experimentation. This is known as falsifiability.

In the scientific method ,  falsifiability is an important part of any valid hypothesis. In order to test a claim scientifically, it must be possible that the claim could be proven false.

Students sometimes confuse the idea of falsifiability with the idea that it means that something is false, which is not the case. What falsifiability means is that  if  something was false, then it is possible to demonstrate that it is false.

One of the hallmarks of pseudoscience is that it makes claims that cannot be refuted or proven false.

The Importance of Operational Definitions

A variable is a factor or element that can be changed and manipulated in ways that are observable and measurable. However, the researcher must also define how the variable will be manipulated and measured in the study.

Operational definitions are specific definitions for all relevant factors in a study. This process helps make vague or ambiguous concepts detailed and measurable.

For example, a researcher might operationally define the variable " test anxiety " as the results of a self-report measure of anxiety experienced during an exam. A "study habits" variable might be defined by the amount of studying that actually occurs as measured by time.

These precise descriptions are important because many things can be measured in various ways. Clearly defining these variables and how they are measured helps ensure that other researchers can replicate your results.

Replicability

One of the basic principles of any type of scientific research is that the results must be replicable.

Replication means repeating an experiment in the same way to produce the same results. By clearly detailing the specifics of how the variables were measured and manipulated, other researchers can better understand the results and repeat the study if needed.

Some variables are more difficult than others to define. For example, how would you operationally define a variable such as aggression ? For obvious ethical reasons, researchers cannot create a situation in which a person behaves aggressively toward others.

To measure this variable, the researcher must devise a measurement that assesses aggressive behavior without harming others. The researcher might utilize a simulated task to measure aggressiveness in this situation.

Hypothesis Checklist

  • Does your hypothesis focus on something that you can actually test?
  • Does your hypothesis include both an independent and dependent variable?
  • Can you manipulate the variables?
  • Can your hypothesis be tested without violating ethical standards?

The hypothesis you use will depend on what you are investigating and hoping to find. Some of the main types of hypotheses that you might use include:

  • Simple hypothesis : This type of hypothesis suggests there is a relationship between one independent variable and one dependent variable.
  • Complex hypothesis : This type suggests a relationship between three or more variables, such as two independent and dependent variables.
  • Null hypothesis : This hypothesis suggests no relationship exists between two or more variables.
  • Alternative hypothesis : This hypothesis states the opposite of the null hypothesis.
  • Statistical hypothesis : This hypothesis uses statistical analysis to evaluate a representative population sample and then generalizes the findings to the larger group.
  • Logical hypothesis : This hypothesis assumes a relationship between variables without collecting data or evidence.

A hypothesis often follows a basic format of "If {this happens} then {this will happen}." One way to structure your hypothesis is to describe what will happen to the  dependent variable  if you change the  independent variable .

The basic format might be: "If {these changes are made to a certain independent variable}, then we will observe {a change in a specific dependent variable}."

A few examples of simple hypotheses:

  • "Students who eat breakfast will perform better on a math exam than students who do not eat breakfast."
  • "Students who experience test anxiety before an English exam will get lower scores than students who do not experience test anxiety."​
  • "Motorists who talk on the phone while driving will be more likely to make errors on a driving course than those who do not talk on the phone."
  • "Children who receive a new reading intervention will have higher reading scores than students who do not receive the intervention."

Examples of a complex hypothesis include:

  • "People with high-sugar diets and sedentary activity levels are more likely to develop depression."
  • "Younger people who are regularly exposed to green, outdoor areas have better subjective well-being than older adults who have limited exposure to green spaces."

Examples of a null hypothesis include:

  • "There is no difference in anxiety levels between people who take St. John's wort supplements and those who do not."
  • "There is no difference in scores on a memory recall task between children and adults."
  • "There is no difference in aggression levels between children who play first-person shooter games and those who do not."

Examples of an alternative hypothesis:

  • "People who take St. John's wort supplements will have less anxiety than those who do not."
  • "Adults will perform better on a memory task than children."
  • "Children who play first-person shooter games will show higher levels of aggression than children who do not." 

Collecting Data on Your Hypothesis

Once a researcher has formed a testable hypothesis, the next step is to select a research design and start collecting data. The research method depends largely on exactly what they are studying. There are two basic types of research methods: descriptive research and experimental research.

Descriptive Research Methods

Descriptive research such as  case studies ,  naturalistic observations , and surveys are often used when  conducting an experiment is difficult or impossible. These methods are best used to describe different aspects of a behavior or psychological phenomenon.

Once a researcher has collected data using descriptive methods, a  correlational study  can examine how the variables are related. This research method might be used to investigate a hypothesis that is difficult to test experimentally.

Experimental Research Methods

Experimental methods  are used to demonstrate causal relationships between variables. In an experiment, the researcher systematically manipulates a variable of interest (known as the independent variable) and measures the effect on another variable (known as the dependent variable).

Unlike correlational studies, which can only be used to determine if there is a relationship between two variables, experimental methods can be used to determine the actual nature of the relationship—whether changes in one variable actually  cause  another to change.

The hypothesis is a critical part of any scientific exploration. It represents what researchers expect to find in a study or experiment. In situations where the hypothesis is unsupported by the research, the research still has value. Such research helps us better understand how different aspects of the natural world relate to one another. It also helps us develop new hypotheses that can then be tested in the future.

Thompson WH, Skau S. On the scope of scientific hypotheses .  R Soc Open Sci . 2023;10(8):230607. doi:10.1098/rsos.230607

Taran S, Adhikari NKJ, Fan E. Falsifiability in medicine: what clinicians can learn from Karl Popper [published correction appears in Intensive Care Med. 2021 Jun 17;:].  Intensive Care Med . 2021;47(9):1054-1056. doi:10.1007/s00134-021-06432-z

Eyler AA. Research Methods for Public Health . 1st ed. Springer Publishing Company; 2020. doi:10.1891/9780826182067.0004

Nosek BA, Errington TM. What is replication ?  PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691

Aggarwal R, Ranganathan P. Study designs: Part 2 - Descriptive studies .  Perspect Clin Res . 2019;10(1):34-36. doi:10.4103/picr.PICR_154_18

Nevid J. Psychology: Concepts and Applications. Wadworth, 2013.

By Kendra Cherry, MSEd Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

  • More from M-W
  • To save this word, you'll need to log in. Log In

Definition of hypothesis

Did you know.

The Difference Between Hypothesis and Theory

A hypothesis is an assumption, an idea that is proposed for the sake of argument so that it can be tested to see if it might be true.

In the scientific method, the hypothesis is constructed before any applicable research has been done, apart from a basic background review. You ask a question, read up on what has been studied before, and then form a hypothesis.

A hypothesis is usually tentative; it's an assumption or suggestion made strictly for the objective of being tested.

A theory , in contrast, is a principle that has been formed as an attempt to explain things that have already been substantiated by data. It is used in the names of a number of principles accepted in the scientific community, such as the Big Bang Theory . Because of the rigors of experimentation and control, it is understood to be more likely to be true than a hypothesis is.

In non-scientific use, however, hypothesis and theory are often used interchangeably to mean simply an idea, speculation, or hunch, with theory being the more common choice.

Since this casual use does away with the distinctions upheld by the scientific community, hypothesis and theory are prone to being wrongly interpreted even when they are encountered in scientific contexts—or at least, contexts that allude to scientific study without making the critical distinction that scientists employ when weighing hypotheses and theories.

The most common occurrence is when theory is interpreted—and sometimes even gleefully seized upon—to mean something having less truth value than other scientific principles. (The word law applies to principles so firmly established that they are almost never questioned, such as the law of gravity.)

This mistake is one of projection: since we use theory in general to mean something lightly speculated, then it's implied that scientists must be talking about the same level of uncertainty when they use theory to refer to their well-tested and reasoned principles.

The distinction has come to the forefront particularly on occasions when the content of science curricula in schools has been challenged—notably, when a school board in Georgia put stickers on textbooks stating that evolution was "a theory, not a fact, regarding the origin of living things." As Kenneth R. Miller, a cell biologist at Brown University, has said , a theory "doesn’t mean a hunch or a guess. A theory is a system of explanations that ties together a whole bunch of facts. It not only explains those facts, but predicts what you ought to find from other observations and experiments.”

While theories are never completely infallible, they form the basis of scientific reasoning because, as Miller said "to the best of our ability, we’ve tested them, and they’ve held up."

  • proposition
  • supposition

hypothesis , theory , law mean a formula derived by inference from scientific data that explains a principle operating in nature.

hypothesis implies insufficient evidence to provide more than a tentative explanation.

theory implies a greater range of evidence and greater likelihood of truth.

law implies a statement of order and relation in nature that has been found to be invariable under the same conditions.

Examples of hypothesis in a Sentence

These examples are programmatically compiled from various online sources to illustrate current usage of the word 'hypothesis.' Any opinions expressed in the examples do not represent those of Merriam-Webster or its editors. Send us feedback about these examples.

Word History

Greek, from hypotithenai to put under, suppose, from hypo- + tithenai to put — more at do

1641, in the meaning defined at sense 1a

Phrases Containing hypothesis

  • counter - hypothesis
  • nebular hypothesis
  • null hypothesis
  • planetesimal hypothesis
  • Whorfian hypothesis

Articles Related to hypothesis

hypothesis

This is the Difference Between a...

This is the Difference Between a Hypothesis and a Theory

In scientific reasoning, they're two completely different things

Dictionary Entries Near hypothesis

hypothermia

hypothesize

Cite this Entry

“Hypothesis.” Merriam-Webster.com Dictionary , Merriam-Webster, https://www.merriam-webster.com/dictionary/hypothesis. Accessed 19 May. 2024.

Kids Definition

Kids definition of hypothesis, medical definition, medical definition of hypothesis, more from merriam-webster on hypothesis.

Nglish: Translation of hypothesis for Spanish Speakers

Britannica English: Translation of hypothesis for Arabic Speakers

Britannica.com: Encyclopedia article about hypothesis

Subscribe to America's largest dictionary and get thousands more definitions and advanced search—ad free!

Play Quordle: Guess all four words in a limited number of tries.  Each of your guesses must be a real 5-letter word.

Can you solve 4 words at once?

Word of the day.

See Definitions and Examples »

Get Word of the Day daily email!

Popular in Grammar & Usage

More commonly misspelled words, your vs. you're: how to use them correctly, every letter is silent, sometimes: a-z list of examples, more commonly mispronounced words, how to use em dashes (—), en dashes (–) , and hyphens (-), popular in wordplay, the words of the week - may 17, birds say the darndest things, a great big list of bread words, 10 scrabble words without any vowels, 12 more bird names that sound like insults (and sometimes are), games & quizzes.

Play Blossom: Solve today's spelling word game by finding as many words as you can using just 7 letters. Longer words score more points.

Null hypothesis

null hypothesis definition

Null hypothesis n., plural: null hypotheses [nʌl haɪˈpɒθɪsɪs] Definition: a hypothesis that is valid or presumed true until invalidated by a statistical test

Table of Contents

Null Hypothesis Definition

Null hypothesis is defined as “the commonly accepted fact (such as the sky is blue) and researcher aim to reject or nullify this fact”.

More formally, we can define a null hypothesis as “a statistical theory suggesting that no statistical relationship exists between given observed variables” .

In biology , the null hypothesis is used to nullify or reject a common belief. The researcher carries out the research which is aimed at rejecting the commonly accepted belief.

What Is a Null Hypothesis?

A hypothesis is defined as a theory or an assumption that is based on inadequate evidence. It needs and requires more experiments and testing for confirmation. There are two possibilities that by doing more experiments and testing, a hypothesis can be false or true. It means it can either prove wrong or true (Blackwelder, 1982).

For example, Susie assumes that mineral water helps in the better growth and nourishment of plants over distilled water. To prove this hypothesis, she performs this experiment for almost a month. She watered some plants with mineral water and some with distilled water.

In a hypothesis when there are no statistically significant relationships among the two variables, the hypothesis is said to be a null hypothesis. The investigator is trying to disprove such a hypothesis. In the above example of plants, the null hypothesis is:

There are no statistical relationships among the forms of water that are given to plants for growth and nourishment.

Usually, an investigator tries to prove the null hypothesis wrong and tries to explain a relation and association between the two variables.

An opposite and reverse of the null hypothesis are known as the alternate hypothesis . In the example of plants the alternate hypothesis is:

There are statistical relationships among the forms of water that are given to plants for growth and nourishment.

The example below shows the difference between null vs alternative hypotheses:

Alternate Hypothesis: The world is round Null Hypothesis: The world is not round.

Copernicus and many other scientists try to prove the null hypothesis wrong and false. By their experiments and testing, they make people believe that alternate hypotheses are correct and true. If they do not prove the null hypothesis experimentally wrong then people will not believe them and never consider the alternative hypothesis true and correct.

The alternative and null hypothesis for Susie’s assumption is:

  • Null Hypothesis: If one plant is watered with distilled water and the other with mineral water, then there is no difference in the growth and nourishment of these two plants.
  • Alternative Hypothesis:  If one plant is watered with distilled water and the other with mineral water, then the plant with mineral water shows better growth and nourishment.

The null hypothesis suggests that there is no significant or statistical relationship. The relation can either be in a single set of variables or among two sets of variables.

Most people consider the null hypothesis true and correct. Scientists work and perform different experiments and do a variety of research so that they can prove the null hypothesis wrong or nullify it. For this purpose, they design an alternate hypothesis that they think is correct or true. The null hypothesis symbol is H 0 (it is read as H null or H zero ).

Why is it named the “Null”?

The name null is given to this hypothesis to clarify and explain that the scientists are working to prove it false i.e. to nullify the hypothesis. Sometimes it confuses the readers; they might misunderstand it and think that statement has nothing. It is blank but, actually, it is not. It is more appropriate and suitable to call it a nullifiable hypothesis instead of the null hypothesis.

Why do we need to assess it? Why not just verify an alternate one?

In science, the scientific method is used. It involves a series of different steps. Scientists perform these steps so that a hypothesis can be proved false or true. Scientists do this to confirm that there will be any limitation or inadequacy in the new hypothesis. Experiments are done by considering both alternative and null hypotheses, which makes the research safe. It gives a negative as well as a bad impact on research if a null hypothesis is not included or a part of the study. It seems like you are not taking your research seriously and not concerned about it and just want to impose your results as correct and true if the null hypothesis is not a part of the study.

Development of the Null

In statistics, firstly it is necessary to design alternate and null hypotheses from the given problem. Splitting the problem into small steps makes the pathway towards the solution easier and less challenging. how to write a null hypothesis?

Writing a null hypothesis consists of two steps:

  • Firstly, initiate by asking a question.
  • Secondly, restate the question in such a way that it seems there are no relationships among the variables.

In other words, assume in such a way that the treatment does not have any effect.

The usual recovery duration after knee surgery is considered almost 8 weeks.

A researcher thinks that the recovery period may get elongated if patients go to a physiotherapist for rehabilitation twice per week, instead of thrice per week, i.e. recovery duration reduces if the patient goes three times for rehabilitation instead of two times.

Step 1: Look for the problem in the hypothesis. The hypothesis either be a word or can be a statement. In the above example the hypothesis is:

“The expected recovery period in knee rehabilitation is more than 8 weeks”

Step 2: Make a mathematical statement from the hypothesis. Averages can also be represented as μ, thus the null hypothesis formula will be.

In the above equation, the hypothesis is equivalent to H1, the average is denoted by μ and > that the average is greater than eight.

Step 3: Explain what will come up if the hypothesis does not come right i.e., the rehabilitation period may not proceed more than 08 weeks.

There are two options: either the recovery will be less than or equal to 8 weeks.

H 0 : μ ≤ 8

In the above equation, the null hypothesis is equivalent to H 0 , the average is denoted by μ and ≤ represents that the average is less than or equal to eight.

What will happen if the scientist does not have any knowledge about the outcome?

Problem: An investigator investigates the post-operative impact and influence of radical exercise on patients who have operative procedures of the knee. The chances are either the exercise will improve the recovery or will make it worse. The usual time for recovery is 8 weeks.

Step 1: Make a null hypothesis i.e. the exercise does not show any effect and the recovery time remains almost 8 weeks.

H 0 : μ = 8

In the above equation, the null hypothesis is equivalent to H 0 , the average is denoted by μ, and the equal sign (=) shows that the average is equal to eight.

Step 2: Make the alternate hypothesis which is the reverse of the null hypothesis. Particularly what will happen if treatment (exercise) makes an impact?

In the above equation, the alternate hypothesis is equivalent to H1, the average is denoted by μ and not equal sign (≠) represents that the average is not equal to eight.

Significance Tests

To get a reasonable and probable clarification of statistics (data), a significance test is performed. The null hypothesis does not have data. It is a piece of information or statement which contains numerical figures about the population. The data can be in different forms like in means or proportions. It can either be the difference of proportions and means or any odd ratio.

The following table will explain the symbols:

P-value is the chief statistical final result of the significance test of the null hypothesis.

  • P-value = Pr(data or data more extreme | H 0 true)
  • | = “given”
  • Pr = probability
  • H 0 = the null hypothesis

The first stage of Null Hypothesis Significance Testing (NHST) is to form an alternate and null hypothesis. By this, the research question can be briefly explained.

Null Hypothesis = no effect of treatment, no difference, no association Alternative Hypothesis = effective treatment, difference, association

When to reject the null hypothesis?

Researchers will reject the null hypothesis if it is proven wrong after experimentation. Researchers accept null hypothesis to be true and correct until it is proven wrong or false. On the other hand, the researchers try to strengthen the alternate hypothesis. The binomial test is performed on a sample and after that, a series of tests were performed (Frick, 1995).

Step 1: Evaluate and read the research question carefully and consciously and make a null hypothesis. Verify the sample that supports the binomial proportion. If there is no difference then find out the value of the binomial parameter.

Show the null hypothesis as:

H 0 :p= the value of p if H 0 is true

To find out how much it varies from the proposed data and the value of the null hypothesis, calculate the sample proportion.

Step 2: In test statistics, find the binomial test that comes under the null hypothesis. The test must be based on precise and thorough probabilities. Also make a list of pmf that apply, when the null hypothesis proves true and correct.

When H 0 is true, X~b(n, p)

N = size of the sample

P = assume value if H 0 proves true.

Step 3: Find out the value of P. P-value is the probability of data that is under observation.

Rise or increase in the P value = Pr(X ≥ x)

X = observed number of successes

P value = Pr(X ≤ x).

Step 4: Demonstrate the findings or outcomes in a descriptive detailed way.

  • Sample proportion
  • The direction of difference (either increases or decreases)

Perceived Problems With the Null Hypothesis

Variable or model selection and less information in some cases are the chief important issues that affect the testing of the null hypothesis. Statistical tests of the null hypothesis are reasonably not strong. There is randomization about significance. (Gill, 1999) The main issue with the testing of the null hypothesis is that they all are wrong or false on a ground basis.

There is another problem with the a-level . This is an ignored but also a well-known problem. The value of a-level is without a theoretical basis and thus there is randomization in conventional values, most commonly 0.q, 0.5, or 0.01. If a fixed value of a is used, it will result in the formation of two categories (significant and non-significant) The issue of a randomized rejection or non-rejection is also present when there is a practical matter which is the strong point of the evidence related to a scientific matter.

The P-value has the foremost importance in the testing of null hypothesis but as an inferential tool and for interpretation, it has a problem. The P-value is the probability of getting a test statistic at least as extreme as the observed one.

The main point about the definition is: Observed results are not based on a-value

Moreover, the evidence against the null hypothesis was overstated due to unobserved results. A-value has importance more than just being a statement. It is a precise statement about the evidence from the observed results or data. Similarly, researchers found that P-values are objectionable. They do not prefer null hypotheses in testing. It is also clear that the P-value is strictly dependent on the null hypothesis. It is computer-based statistics. In some precise experiments, the null hypothesis statistics and actual sampling distribution are closely related but this does not become possible in observational studies.

Some researchers pointed out that the P-value is depending on the sample size. If the true and exact difference is small, a null hypothesis even of a large sample may get rejected. This shows the difference between biological importance and statistical significance. (Killeen, 2005)

Another issue is the fix a-level, i.e., 0.1. On the basis, if a-level a null hypothesis of a large sample may get accepted or rejected. If the size of simple is infinity and the null hypothesis is proved true there are still chances of Type I error. That is the reason this approach or method is not considered consistent and reliable. There is also another problem that the exact information about the precision and size of the estimated effect cannot be known. The only solution is to state the size of the effect and its precision.

Null Hypothesis Examples

Here are some examples:

Example 1: Hypotheses with One Sample of One Categorical Variable

Among all the population of humans, almost 10% of people prefer to do their task with their left hand i.e. left-handed. Let suppose, a researcher in the Penn States says that the population of students at the College of Arts and Architecture is mostly left-handed as compared to the general population of humans in general public society. In this case, there is only a sample and there is a comparison among the known population values to the population proportion of sample value.

  • Research Question: Do artists more expected to be left-handed as compared to the common population persons in society?
  • Response Variable: Sorting the student into two categories. One category has left-handed persons and the other category have right-handed persons.
  • Form Null Hypothesis: Arts and Architecture college students are no more predicted to be lefty as compared to the common population persons in society (Lefty students of Arts and Architecture college population is 10% or p= 0.10)

Example 2: Hypotheses with One Sample of One Measurement Variable

A generic brand of antihistamine Diphenhydramine making medicine in the form of a capsule, having a 50mg dose. The maker of the medicines is concerned that the machine has come out of calibration and is not making more capsules with the suitable and appropriate dose.

  • Research Question: Does the statistical data recommended about the mean and average dosage of the population differ from 50mg?
  • Response Variable: Chemical assay used to find the appropriate dosage of the active ingredient.
  • Null Hypothesis: Usually, the 50mg dosage of capsules of this trade name (population average and means dosage =50 mg).

Example 3: Hypotheses with Two Samples of One Categorical Variable

Several people choose vegetarian meals on a daily basis. Typically, the researcher thought that females like vegetarian meals more than males.

  • Research Question: Does the data recommend that females (women) prefer vegetarian meals more than males (men) regularly?
  • Response Variable: Cataloguing the persons into vegetarian and non-vegetarian categories. Grouping Variable: Gender
  • Null Hypothesis: Gender is not linked to those who like vegetarian meals. (Population percent of women who eat vegetarian meals regularly = population percent of men who eat vegetarian meals regularly or p women = p men).

Example 4: Hypotheses with Two Samples of One Measurement Variable

Nowadays obesity and being overweight is one of the major and dangerous health issues. Research is performed to confirm that a low carbohydrates diet leads to faster weight loss than a low-fat diet.

  • Research Question: Does the given data recommend that usually, a low-carbohydrate diet helps in losing weight faster as compared to a low-fat diet?
  • Response Variable: Weight loss (pounds)
  • Explanatory Variable: Form of diet either low carbohydrate or low fat
  • Null Hypothesis: There is no significant difference when comparing the mean loss of weight of people using a low carbohydrate diet to people using a diet having low fat. (population means loss of weight on a low carbohydrate diet = population means loss of weight on a diet containing low fat).

Example 5: Hypotheses about the relationship between Two Categorical Variables

A case-control study was performed. The study contains nonsmokers, stroke patients, and controls. The subjects are of the same occupation and age and the question was asked if someone at their home or close surrounding smokes?

  • Research Question: Did second-hand smoke enhance the chances of stroke?
  • Variables: There are 02 diverse categories of variables. (Controls and stroke patients) (whether the smoker lives in the same house). The chances of having a stroke will be increased if a person is living with a smoker.
  • Null Hypothesis: There is no significant relationship between a passive smoker and stroke or brain attack. (odds ratio between stroke and the passive smoker is equal to 1).

Example 6: Hypotheses about the relationship between Two Measurement Variables

A financial expert observes that there is somehow a positive and effective relationship between the variation in stock rate price and the quantity of stock bought by non-management employees

  • Response variable- Regular alteration in price
  • Explanatory Variable- Stock bought by non-management employees
  • Null Hypothesis: The association and relationship between the regular stock price alteration ($) and the daily stock-buying by non-management employees ($) = 0.

Example 7: Hypotheses about comparing the relationship between Two Measurement Variables in Two Samples

  • Research Question: Is the relation between the bill paid in a restaurant and the tip given to the waiter, is linear? Is this relation different for dining and family restaurants?
  • Explanatory Variable- total bill amount
  • Response Variable- the amount of tip
  • Null Hypothesis: The relationship and association between the total bill quantity at a family or dining restaurant and the tip, is the same.

Try to answer the quiz below to check what you have learned so far about the null hypothesis.

Choose the best answer. 

Send Your Results (Optional)

clock.png

  • Blackwelder, W. C. (1982). “Proving the null hypothesis” in clinical trials. Controlled Clinical Trials , 3(4), 345–353.
  • Frick, R. W. (1995). Accepting the null hypothesis. Memory & Cognition, 23(1), 132–138.
  • Gill, J. (1999). The insignificance of null hypothesis significance testing. Political Research Quarterly , 52(3), 647–674.
  • Killeen, P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16(5), 345–353.

©BiologyOnline.com. Content provided and moderated by Biology Online Editors.

Last updated on June 16th, 2022

You will also like...

hypothesis meaning of biology

The Gene Pool and Population Genetics

According to Charles Darwin's theory of natural selection, preferable genes are favored by nature in the gene pool, and ..

Chromosome Mutations

Chromosome Mutations

Mutations can also influence the phenotype of an organism. This tutorial looks at the effects of chromosomal mutations, ..

Genetic engineering

Genetic Engineering Advantages & Disadvantages

This tutorial presents the benefits and the possible adverse eventualities of genetic engineering. Know more about this ..

Developmental Biology

Developmental Biology

Developmental biology is a biological science that is primarily concerned with how a living thing grows and attains matu..

Kidneys

Kidneys and Regulation of Water and Inorganic Ions

The kidneys are responsible for the regulation of water and inorganic ions. Read this tutorial to learn about the differ..

Human perception in action

Human Perception – Neurology

This tutorial investigates perception as two people can interpret the same thing differently. Know more about human perc..

Related Articles...

hypothesis meaning of biology

No related articles found

  • Open access
  • Published: 13 May 2024

SCIPAC: quantitative estimation of cell-phenotype associations

  • Dailin Gan 1 ,
  • Yini Zhu 2 ,
  • Xin Lu 2 , 3 &
  • Jun Li   ORCID: orcid.org/0000-0003-4353-5761 1  

Genome Biology volume  25 , Article number:  119 ( 2024 ) Cite this article

374 Accesses

2 Altmetric

Metrics details

Numerous algorithms have been proposed to identify cell types in single-cell RNA sequencing data, yet a fundamental problem remains: determining associations between cells and phenotypes such as cancer. We develop SCIPAC, the first algorithm that quantitatively estimates the association between each cell in single-cell data and a phenotype. SCIPAC also provides a p -value for each association and applies to data with virtually any type of phenotype. We demonstrate SCIPAC’s accuracy in simulated data. On four real cancerous or noncancerous datasets, insights from SCIPAC help interpret the data and generate new hypotheses. SCIPAC requires minimum tuning and is computationally very fast.

Single-cell RNA sequencing (scRNA-seq) technologies are revolutionizing biomedical research by providing comprehensive characterizations of diverse cell populations in heterogeneous tissues [ 1 , 2 ]. Unlike bulk RNA sequencing (RNA-seq), which measures the average expression profile of the whole tissue, scRNA-seq gives the expression profiles of thousands of individual cells in the tissue [ 3 , 4 , 5 , 6 , 7 ]. Based on this rich data, cell types may be discovered/determined in an unsupervised (e.g., [ 8 , 9 ]), semi-supervised (e.g., [ 10 , 11 , 12 , 13 ]), or supervised manner (e.g., [ 14 , 15 , 16 ]). Despite the fast development, there are still limitations with scRNA-seq technologies. Notably, the cost for each scRNA-seq experiment is still high; as a result, most scRNA-seq data are from a single or a few biological samples/tissues. Very few datasets consist of large numbers of samples with different phenotypes, e.g., cancer vs. normal. This places great difficulties in determining how a cell type contributes to a phenotype based on single-cell studies (especially if the cell type is discovered in a completely unsupervised manner or if people have limited knowledge of this cell type). For example, without having single-cell data from multiple cancer patients and multiple normal controls, it could be hard to computationally infer whether a cell type may promote or inhibit cancer development. However, such association can be critical for cancer research [ 17 ], disease diagnosis [ 18 ], cell-type targeted therapy development [ 19 ], etc.

Fortunately, this difficulty may be overcome by borrowing information from bulk RNA-seq data. Over the past decade, a considerable amount of bulk RNA-seq data from a large number of samples with different phenotypes have been accumulated and made available through databases like The Cancer Genome Atlas (TCGA) [ 20 ] and cBioPortal [ 21 , 22 ]. Data in these databases often contain comprehensive patient phenotype information, such as cancer status, cancer stages, survival status and time, and tumor metastasis. Combining single-cell data from a single or a few individuals and bulk data from a relatively large number of individuals regarding a particular phenotype can be a cost-effective way to determine how a cell type contributes to the phenotype. A recent method Scissor [ 23 ] took an essential step in this direction. It uses single-cell and bulk RNA-seq data with phenotype information to classify the cells into three discrete categories: Scissor+, Scissor−, and null cells, corresponding to cells that are positively associated, negatively associated, and not associated with the phenotype.

Here, we present a method that takes another big step in this direction, which is called Single-Cell and bulk data-based Identifier for Phenotype Associated Cells or SCIPAC for short. SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. Moreover, SCIPAC also enables the estimation of the statistical significance of the association. That is, it gives a p -value for the association between each cell and the phenotype. Furthermore, SCIPAC enables the estimation of association between cells and an ordinal phenotype (e.g., different stages of cancer), which could be informative as people may not only be interested in the emergence/existence of cancer (cancer vs. healthy, a binary problem) but also in the progression of cancer (different stages of cancer, an ordinal problem).

To study the performance of SCIPAC, we first apply SCIPAC to simulated data under three schemes. SCIPAC shows high accuracy with low false positive rates. We further show the broad applicability of SCIPAC on real datasets across various diseases, including prostate cancer, breast cancer, lung cancer, and muscular dystrophy. The association inferred by SCIPAC is highly informative. In real datasets, some cell types have definite and well-studied functions, while others are less well-understood: their functions may be disease-dependent or tissue-dependent, and they may contain different sub-types with distinct functions. In the former case, SCIPAC’s results agree with current biological knowledge. In the latter case, SCIPAC’s discoveries inspire the generation of new hypotheses regarding the roles and functions of cells under different conditions.

An overview of the SCIPAC algorithm

SCIPAC is a computational method that identifies cells in single-cell data that are associated with a given phenotype. This phenotype can be binary (e.g., cancer vs. normal), ordinal (e.g., cancer stage), continuous (e.g., quantitative traits), or survival (i.e., survival time and status). SCIPAC uses input data consisting of three parts: single-cell RNA-seq data that measures the expression of p genes in m cells, bulk RNA-seq data that measures the expression of the same set of p genes in n samples/tissues, and the statuses/values of the phenotype of the n bulk samples/tissues. The output of SCIPAC is the strength and the p -value of the association between each cell and the phenotype.

SCIPAC proposes the following definition of “association” between a cell and a phenotype: A group of cells that are likely to play a similar role in the phenotype (such as cells of a specific cell type or sub-type, cells in a particular state, cells in a cluster, cells with similar expression profiles, or cells with similar functions) is considered to be positively/negatively associated with a phenotype if an increase in their proportion within the tissue likely indicates an increased/decreased probability of the phenotype’s presence. SCIPAC assigns the same association to all cells within such a group. Taking cancer as the phenotype as an example, if increasing the proportion of a cell type indicates a higher chance of having cancer (binary), having a higher cancer stage (ordinal), or a higher hazard rate (survival), all cells in this cell type is positively associated with cancer.

The algorithm of SCIPAC follows the following four steps. First, the cells in the single-cell data are grouped into clusters according to their expression profiles. The Louvain algorithm from the Seurat package [ 24 , 25 ] is used as the default clustering algorithm, but the user may choose any clustering algorithm they prefer. Or if information of the cell types or other groupings of cells is available a priori, it may be supplied to SCIPAC as the cell clusters, and this clustering step can be skipped. In the second step, a regression model is learned from bulk gene expression data with the phenotype. Depending on the type of the phenotype, this model can be logistic regression, ordinary linear regression, proportional odds model, or Cox proportional hazards model. To achieve a higher prediction power with less variance, by default, the elastic net (a blender of Lasso and ridge regression [ 26 ]) is used to fit the model. In the third step, SCIPAC computes the association strength \(\Lambda\) between each cell cluster and the phenotype based on a mathematical formula that we derive. Finally, the p -values are computed. The association strength and its p -value between a cell cluster and the phenotype are given to all cells in the cluster.

SCIPAC requires minimum tuning. When the cell-type information is given in step 1, SCIPAC does not have any (hyper)parameter. Otherwise, the Louvain algorithm used in step 1 has a “resolution” parameter that controls the number of cell clusters: a larger resolution results in more clusters. SCIPAC inherits this parameter as its only parameter. Since SCIPAC gives the same association strength and p -value to cells from the same cluster, this parameter also determines the resolution of results provided by SCIPAC. Thus, we still call it “resolution” in SCIPAC. Because of its meaning, we recommend setting it so that the number of cell clusters given by the clustering algorithm is comparable to, or reasonably larger than, the number of cell types (or sub-types) in the data. We will see that the performance of SCIPAC is insensitive to this resolution parameter, and the default value 2.0 typically works well.

The details of the SCIPAC algorithm are given in the “ Methods ” section.

Performance in simulated data

We assess the performance of SCIPAC in simulated data under three different schemes. The first scheme is simple and consists of only three cell types. The second scheme is more complicated and consists of seven cell types, which better imitates actual scRNA-seq data. In the third scheme, we simulate cells under different cell development stages to test the performance of SCIPAC under an ordinal phenotype. Details of the simulation are given in Additional file 1.

Simulation scheme I

Under this scheme, the single-cell data consists of three cell types: one is positively associated with the phenotype, one is negatively associated, and the third is not associated (we call it “null”). Figure 1 a gives the UMAP [ 27 ] plot of the three cell types, and Fig. 1 b gives the true associations of these three cell types with the phenotype, with red, blue, and light gray denoting positive, negative, and null associations.

figure 1

UMAP visualization and numeric measures of the simulated data under scheme I. All the plots in a–e  are scatterplots of the two dimensional single-cell data given by UMAP. The x and y axes represent the two dimensions, and their scales are not shown as their specific values are not directly relevant. Points in the plots represents single cells, and they are colored differently in each subplot to reflect different information/results. a  Cell types. b  True associations. The association between cell types 1, 2, and 3 and the phenotype are positive, negative, and null, respectively. c  Association strengths \(\Lambda\) given by SCIPAC under different resolutions. Red/blue represents the sign of \(\Lambda\) , and the shade gives the absolute value of \(\Lambda\) . Every cell is colored red or blue since no \(\Lambda\) is exactly zero. Below each subplot, Res stands for resolution, and K stands for the number of cell clusters given by this resolution. d   p -values given by SCIPAC. Only cells with p -value \(< 0.05\) are colored red (positive association) or blue (negative association); others are colored white. e  Results given by Scissor under different \(\alpha\) values. Red, blue, and light gray stands for Scissor+, Scissor−, and background (i.e., null) cells. f  F1 scores and g  FSC for SCIPAC and Scissor under different parameter values. For SCIPAC, each bar is the value under a resolution/number of clusters. For Scissor, each bar is the value under an \(\alpha\)

We apply SCIPAC to the simulated data. For the resolution parameter (see the “ Methods ” section), values 0.5, 1.0, and 1.5 give 3, 4, and 4 clusters, respectively, close to the actual number of cell types. They are good choices based on the guidance for choosing this parameter. To show how SCIPAC behaves under parameter misspecification, we also set the resolution up to 4.0, which gives a whopping 61 clusters. Figure 1 c and d give the association strengths \(\Lambda\) and the p -values given by four different resolutions (results under other resolutions are provided in Additional file 1: Fig. S1 and S2). In Fig. 1 c, red and blue denote positive and negative associations, respectively, and the shade of the color represents the strength of the association, i.e., the absolute value of \(\Lambda\) . Every cell is colored blue or red, as none of \(\Lambda\) is exactly zero. In Fig. 1 d, red and blue denote positive and negative associations that are statistically significant ( p -value \(< 0.05\) ). Cells whose associations are not statistically significant ( p -value \(\ge 0.05\) ) are shown in white. To avoid confusion, it is worth repeating that cells that are colored in red/blue in Fig. 1 c are shown in red/blue in Fig. 1 d only if they are statistically significant; otherwise, they are colored white in Fig. 1 d.

From Fig. 1 c, d (as well as Additional file 1: Fig. S1 and S2), it is clear that the results of SCIPAC are highly consistent under different resolution values, including both the estimated association strengths and the p -values. It is also clear that SCIPAC is highly accurate: most truly associated cells are identified as significant, and most, if not all, truly null cells are identified as null.

As the first algorithm that quantitatively estimates the association strength and the first algorithm that gives the p -value of the association, SCIPAC does not have a real competitor. A previous algorithm, Scissor, is able to classify cells into three discrete categories according to their associations with the phenotype. So, we compare SCIPAC with Scissor in respect of the ability to differentiate positively associated, negatively associated, and null cells.

Running Scissor requires tuning a parameter called \(\alpha\) , which is a number between 0 and 1 that balances the amount of regularization for the smoothness and for the sparsity of the associations. The Scissor R package does not provide a default value for this \(\alpha\) or a function to help select this value. The Scissor paper suggests the following criterion: “the number of Scissor-selected cells should not exceed a certain percentage of total cells (default 20%) in the single-cell data. In each experiment, a search on the above searching list is performed from the smallest to the largest until a value of \(\alpha\) meets the above criteria.” In practice, we have found that this criterion does not often work properly, as the truly associated cells may not compose 20% of all cells in actual data. Therefore, instead of setting \(\alpha\) to any particular value, we set \(\alpha\) values that span the whole range of \(\alpha\) to see the best possible performance of Scissor.

The performance of Scissor in our simulation data under four different \(\alpha\) values are shown in Fig. 1 e, and results under more \(\alpha\) values are shown in Additional file 1: Fig. S3. In the figures, red, blue, and light gray denote Scissor+, Scissor−, and null (called “background” in Scissor) cells, respectively. The results of Scissor have several characteristics different from SCIPAC. First, Scissor does not give the strength or statistical significance of the association, and thus the colors of the cells in the figures do not have different shades. Second, different \(\alpha\) values give very different results. Greater \(\alpha\) values generally give fewer Scissor+ and Scissor− cells, but there are additional complexities. One complexity is that the Scissor+ (or Scissor−) cells under a greater \(\alpha\) value are not a strict subset of Scissor+ (or Scissor−) cells under a smaller \(\alpha\) value. For example, the number of truly negatively associated cells detected as Scissor− increases when \(\alpha\) increases from 0.01 to 0.30. Another complexity is that the direction of the association may flip as \(\alpha\) increases. For example, most cells of cell type 2 are identified as Scissor+ under \(\alpha =0.01\) , but many of them are identified as Scissor− under larger \(\alpha\) values. Third, Scissor does not achieve high power and low false-positive rate at the same time under any \(\alpha\) . No matter what the \(\alpha\) value is, there is only a small proportion of cells from cell type 2 that are correctly identified as negatively associated, and there is always a non-negligible proportion of null cells (i.e., cells from cell type 3) that are incorrectly identified as positively or negatively associated. Fourth, Scissor+ and Scissor− cells can be close to each other in the figure, even under a large \(\alpha\) value. This means that cells with nearly identical expression profiles are detected to be associated with the phenotype in opposite directions, which can place difficulties in interpreting the results.

SCIPAC overcomes the difficulties of Scissor and gives results that are more informative (quantitative strengths with p -values), more accurate (both high power and low false-positive rate), less sensitive to the tuning parameter, and easier to interpret (cells with similar expression typically have similar associations to the phenotype).

SCIPAC’s higher accuracy in differentiating positively associated, negatively associated, and null cells than Scissors can also be measured numerically using the F1 score and the fraction of sign correctness (FSC). F1, which is the harmonic mean of precision and recall, is a commonly used measure of calling accuracy. Note that precision and recall are only defined for two-class problems, which try to classify desired signals/discoveries (so-called “positives”) against noises/trivial results (so-called “negatives”). Our case, on the other hand, is a three-class problem: positive association, negative association, and null. To compute F1, we combine the positive and negative associations and treat them as “positives,” and treat null as “negatives.” This F1 score ignores the direction of the association; thus, it alone is not enough to describe the performance of an association-detection algorithm. For example, an algorithm may have a perfect F1 score even if it incorrectly calls all negative associations positive. To measure an algorithm’s ability to determine the direction of the association, we propose a statistic called FSC, defined as the fraction of true discoveries that also have the correct direction of the association. The F1 score and FSC are numbers between 0 and 1, and higher values are preferred. A mathematical definition of these two measures is given in Additional file 1.

Figure 1 f, g show the F1 score and FSC of SCIPAC and Scissor under different values of tuning parameters. The F1 score of Scissor is between 0.2 and 0.7 under different \(\alpha\) ’s. The FSC of Scissor increases from around 0.5 to nearly 1 as \(\alpha\) increases, but Scissor does not achieve high F1 and FSC scores at the same time under any \(\alpha\) . On the other hand, the F1 score of SCIPAC is close to perfection when the resolution parameter is properly set, and it is still above 0.90 even if the resolution parameter is set too large. The FSC of SCIPAC is always above 0.96 under different resolutions. That is, SCIPAC achieves high F1 and FSC scores simultaneously under a wide range of resolutions, representing a much higher accuracy than Scissor.

Simulation scheme II

This more complicated simulation scheme has seven cell types, which are shown in Fig. 2 a. As shown in Fig. 2 b, cell types 1 and 3 are negatively associated (colored blue), 2 and 4 are positively associated (colored red), and 5, 6, and 7 are not associated (colored light gray).

figure 2

UMAP visualization of the simulated data under a–g  scheme II and h–k  scheme III. a  Cell types. b  True associations. c , d  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. e  Results given by Scissor under different \(\alpha\) values. f  F1 scores and g  FSC for SCIPAC and Scissor under different parameter values. h  Cell differentiation paths. The four paths have the same starting location, which is in the center, but different ending locations. This can be considered as a progenitor cell type differentiating into four specialized cell types. i  Cell differentiation steps. These steps are used to create four stages, each containing 500 steps. Thus, this plot of differentiation steps can also be viewed as the plot of true association strengths. j , k  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution

The association strengths and p -values given by SCIPAC under the default resolution are illustrated in Fig. 2 c, d, respectively. Results under several other resolutions are given in Additional file 1: Fig. S4 and S5. Again, we find that SCIPAC gives highly consistent results under different resolutions. SCIPAC successfully identifies three out of the four truly associated cell types. For the other truly associated cell type, cell type 1, SCIPAC correctly recognizes its association with the phenotype as negative, although the p -values are not significant enough. The F1 score is 0.85, and the FSC is greater than 0.99, as shown in Fig. 2 f, g.

The results of Scissor under four different \(\alpha\) values are given in Fig. 2 e. (More shown in Additional file 1: Fig. S6.) Under this highly challenging simulation scheme, Scissor can only identify one out of four truly associated cell types. Its F1 score is below 0.4.

Simulation scheme III

This simulation scheme is to assess the performance of SCIPAC for ordinal phenotypes. We simulate cells along four cell-differentiation paths with the same starting location but different ending locations, as shown in Fig. 2 h. These cells can be considered as a progenitor cell population differentiating into four specialized cell types. In Fig. 2 i, the “step” reflects their position in the differentiation path, with step 0 meaning the start and step 2000 meaning the end of the differentiation. Then, the “stage” is generated according to the step: cells in steps 0 \(\sim\) 500, 501 \(\sim\) 1000, 1001 \(\sim\) 1500, and 1501 \(\sim\) 2000 are assigned to stages I, II, III, and IV, respectively. This stage is treated as the ordinal phenotype. Under this simulation scheme, Fig. 2 i also gives the actual associations, and all cells are associated with the phenotype.

The results of SCIPAC under the default resolution are shown in Fig. 2 j, k. Clearly, the associations SCIPAC identifies are highly consistent with the truth. Particularly, it successfully identifies the cells in the center as early-stage cells and most cells at the end of branches as last-stage cells. The results of SCIPAC under other resolutions are given in Additional file 1: Fig. S7 and S8, which are highly consistent. Scissor does not work with ordinal phenotypes; thus, no results are reported here.

Performance in real data

We consider four real datasets: a prostate cancer dataset, a breast cancer dataset, a lung cancer dataset, and a muscular dystrophy dataset. The bulk RNA-seq data of the three cancer datasets are obtained from the TCGA database, and that of the muscular dystrophy dataset is obtained from a published paper [ 28 ]. A detailed description of these datasets is given in Additional file 1. We will use these datasets to assess the performance of SCIPAC on different types of phenotypes. The cell type information (i.e., which cell belongs to which cell type) is available for the first three datasets, but we ignore this information so that we can make a fair comparison with Scissor, which cannot utilize this information.

Prostate cancer data with a binary phenotype

We use the single-cell expression of 8,700 cells from prostate-cancer tumors sequenced by [ 29 ]. The cell types of these cells are known and given in Fig. 3 a. The bulk data comprises 550 TCGA-PRAD (prostate adenocarcinoma) samples with phenotype (cancer vs. normal) information. Here the phenotype is cancer, and it is binary: present or absent.

figure 3

UMAP visualization of the prostate cancer data, with a zoom-in view for the red-circled region (cell type MNP). a  True cell types. BE, HE, and CE stand for basal, hillock, club epithelial cells, LE-KLK3 and LE-KLK4 stand for luminal epithelial cells with high levels of kallikrein related peptidase 3 and 4, and MNP stands for mononuclear phagocytes. In the zoom-in view, the sub-types of MNP cells are given. b  Association strengths \(\Lambda\) given by SCIPAC under the default resolution. The cyan-circled cells are B cells, which are estimated by SCIPAC as negatively associated with cancer but estimated by Scissor as Scissor+ or null. c   p -values given by SCIPAC. The MNP cell type, which is red-circled in the plot, is estimated by SCIPAC to be strongly negatively associated with cancer but estimated by Scissor to be positively associated with cancer. d  Results given by Scissor under different \(\alpha\) values

Results from SCIPAC with the default resolution are shown in Fig. 3 b, c (results with other resolutions, given in Additional file 1: Fig. S9 and S10, are highly consistent with results here.) Compared with results from Scissor, shown in Fig. 3 d, results from SCIPAC again show three advantages. First, results from SCIPAC are richer and more comprehensive. SCIPAC gives estimated associations and the corresponding p -values, and the estimated associations are quantitative (shown in Fig. 3 b as different shades to the red or blue color) instead of discrete (shown in Fig. 3 d as a uniform shade to the red, blue, or light gray color). Second, SCIPAC’s results can be easier to interpret as the red and blue colors are more block-wise instead of scattered. Third, unlike Scissor, which produces multiple sets of results varying based on the parameter \(\alpha\) —a parameter without a default value or tuning guidance—typically, a single set of results from SCIPAC under its default settings suffices.

Comparing the results from our SCIPAC method with those from Scissor is non-trivial, as the latter’s outcomes are scattered and include multiple sets. We propose the following solutions to summarize the inferred association of a known cell type with the phenotype using a specific method (Scissor under a specific \(\alpha\) value, or SCIPAC with the default setting). We first calculate the proportion of cells in this cell type identified as Scissor+ (by Scissor at a specific \(\alpha\) value) or as significantly positively associated (by SCIPAC), denoted by \(p_{+}\) . We also calculate the proportion of all cells, encompassing any cell type, which are identified as Scissor+ or significantly positively associated, serving as the average background strength, denoted by \(p_{a}\) . Then, we compute the log odds ratio for this cell type to be positively associated with the phenotype compared to the background, represented as:

Similarly, the log odds ratio for the cell type to be negatively associated with the phenotype, \(\rho _-\) , is computed in a parallel manner.

For SCIPAC, a cell type is summarized as positively associated with the phenotype if \(\rho _+ \ge 1\) and \(\rho _- < 1\)  and negatively associated if \(\rho _- \ge 1\) and \(\rho _+ < 1\) . If neither condition is met, the association is inconclusive. For Scissor, we apply it under six different \(\alpha\) values: 0.01, 0.05, 0.10, 0.15, 0.20, and 0.25. A cell type is summarized as positively associated with the phenotype if \(\rho _+ \ge 1\) and \(\rho _- < 1\) in at least four of these \(\alpha\) values and negatively associated if \(\rho _- \ge 1\) and \(\rho _+ < 1\) in at least four \(\alpha\) values. If these criteria are not met, the association is deemed inconclusive. The above computation of log odds ratios and the determination of associations are performed only on cell types that each compose at least 1% of the cell population, to ensure adequate power.

For the prostate cancer data, the log odds ratios for each cell type using each method are presented in Tables S1 and S2. The final associations determined for each cell type are summarized in Table S3. In the last column of this table, we also indicate whether the conclusions drawn from SCIPAC and Scissor are consistent or not.

We find that SCIPAC’s results agree with Scissor on most cell types. However, there are three exceptions: mononuclear phagocytes (MNPs), B cells, and LE-KLK4.

MNPs are red-circled and zoomed in in each sub-figure of Fig. 3 . Most cells in this cell type are colored red in Fig. 3 d but colored dark blue in Fig. 3 b. In other words, while Scissor determines that this cell type is Scissor+, SCIPAC makes the opposite inference. Moreover, SCIPAC is confident about its judgment by giving small p -values, as shown in Fig. 3 c. To see which inference is closer to the biological fact is not easy, as biologically MNPs contain a number of sub-types that each have different functions [ 30 , 31 ]. Fortunately, this cell population has been studied in detail in the original paper that generated this dataset [ 29 ], and the sub-type information of each cell is provided there: this MNP population contains six sub-types, which are dendritic cells (DC), M1 macrophages (Mac1), metallothionein-expressing macrophages (Mac-MT), M2 macrophages (Mac2), proliferating macrophages (Mac-cycling), and monocytes (Mono), as shown in the zoom-in view of Fig. 3 a. Among these six sub-types, DC, Mac1, and Mac-MT are believed to inhibit cancer development and can serve as targets in cancer immunotherapy [ 29 ]; they compose more than 60% of all MNP cells in this dataset. SCIPAC makes the correct inference on this majority of MNP cells. Another cell type, Mac2, is reported to promote tumor development [ 32 ], but it only composes less than \(15\%\) of the MNPs. How the other two cell types, Mac-cycling and Mono, are associated with cancer is less studied. Overall, the results given by SCIPAC are more consistent with the current biological knowledge.

B cells are cyan-circled in Fig. 3 b. B cells are generally believed to have anti-tumor activity by producing tumor-reactive antibodies and forming tertiary lymphoid structures [ 29 , 33 ]. This means that B cells are likely to be negatively associated with cancer. SCIPAC successfully identifies this negative association, while Scissor fails.

LE-KLK4, a subtype of cancer cells, is thought to be positively associated with the tumor phenotype [ 29 ]. SCIPAC successfully identified this positive association, in contrast to Scissor, which failed to do so (in the figure, a proportion of LE-KLK4 cells are identified as Scissor+, especially under the smallest \(\alpha\) value; however, this proportion is not significantly higher than the background Scissor+ level under the majority of \(\alpha\) values).

In summary, across all three cell types, the results from SCIPAC appear to be more consistent with current biological knowledge. For more discussions regarding this dataset, refer to Additional file 1.

Breast cancer data with an ordinal phenotype

The scRNA-seq data for breast cancer are from [ 34 ], and we use the 19,311 cells from the five HER2+ tumor tissues. The true cell types are shown in Fig. 4 a. The bulk data include 1215 TCGA-BRCA samples with information on the cancer stage (I, II, III, or IV), which is treated as an ordinal phenotype.

figure 4

UMAP visualization of the breast cancer data. a  True cell types. CAFs stand for cancer-associated fibroblasts, PB stands for plasmablasts and PVL stands for perivascular-like cells. b , c  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. Cyan-circled are a group of T cells that are estimated by SCIPAC to be most significantly associated with the cancer stage in the negative direction, and orange-circled are a group of T cells that are estimated by SCIPAC to be significantly positively associated with the cancer stage. d  DE analysis of the cyan-circled T cells vs. all the other T cells. e  DE analysis of the cyan-circled T cells vs. all the other cells. f  Expression of CD8+ T cell marker genes in the cyan-circled cells and all the other cells. g  DE analysis of the orange-circled T cells vs. all the other cells. h  Expression of regulatory T cell marker genes in the orange-circled cells and all the other cells

Association strengths and p -values given by SCIPAC under the default resolution are shown in Fig. 4 b, c. Results under other resolutions are given in Additional file 1: Fig. S11 and S12, and again they are highly consistent with results under the default resolution. We do not present the results from Scissor, as Scissor does not take ordinal phenotypes.

In the SCIPAC results, cells that are most strongly and statistically significantly associated with the phenotype in the positive direction are the cancer-associated fibroblasts (CAFs). This finding agrees with the literature: CAFs contribute to therapy resistance and metastasis of cancer cells via the production of secreted factors and direct interaction with cancer cells [ 35 ], and they are also active players in breast cancer initiation and progression [ 36 , 37 , 38 , 39 ]. Another large group of cells identified as positively associated with the phenotype is the cancer epithelial cells. They are malignant cells in breast cancer tissues and are thus expected to be associated with severe cancer stages.

Of the cells identified as negatively associated with severe cancer stages, a large portion of T cells is the most noticeable. Biologically, T cells contain many sub-types, including CD4+, CD8+, regulatory T cells, and more, and their functions are diverse in the tumor microenvironment [ 40 ]. To explore SCIPAC’s discoveries, we compare T cells that are identified as most statistically significant, with p -values \(< 10^{-6}\) and circled in Fig. 4 d, with the other T cells. Differential expression (DE) analysis (details about DE analysis and other analyses are given in Additional file 1) identifies seven genes upregulated in these most significant T cells. Of these seven genes, at least five are supported by the literature: CCL4, XCL1, IFNG, and GZMB are associated with CD8+ T cell infiltration; they have been shown to have anti-tumor functions and are involved in cancer immunotherapy [ 41 , 42 , 43 ]. Also, IL2 has been shown to serve an important role in combination therapies for autoimmunity and cancer [ 44 ]. We also perform an enrichment analysis [ 45 ], in which a pathway called Myc stands out with a \(\textit{p}\text{-value}<10^{-7}\) , much smaller than all other pathways. Myc is downregulated in the T cells that are identified as most negatively associated with cancer stage progress. This agrees with current biological knowledge about this pathway: Myc is known to contribute to malignant cell transformation and tumor metastasis [ 46 , 47 , 48 ].

On the above, we have compared T cells that are most significantly associated with cancer stages in the negative direction with the other T cells using DE and pathway analysis, and the results could suggest that these cells are tumor-infiltrated CD8+ T cells with tumor-inhibition functions. To check this hypothesis, we perform DE analysis of these cells against all other cells (i.e., the other T cells and all the other cell types). The DE genes are shown in Fig. 4 e. It can be noted that CD8+ T cell marker genes such as CD8A, CD8B, and GZMK are upregulated. We further obtain CD8+ T cell marker genes from CellMarker [ 49 ] and check their expression, as illustrated in Fig. 4 f. Marker genes CD8A, CD8B, CD3D, GZMK, and CD7 show significantly higher expression in these T cells. This again supports our hypothesis that these cells are tumor-infiltrated CD8+ T cells that have anti-tumor functions.

Interestingly, not all T cells are identified as negatively associated with severe cancer stages; a group of T cells is identified as positively associated, as circled in Fig. 4 c. To explore the function of this group of T cells, we perform DE analysis of these T cells against the other T cells. The DE genes are shown in Fig. 4 g. Based on the literature, six out of eight over-expressed genes are associated with cancer development. The high expression of NUSAP1 gene is associated with poor patient overall survival, and this gene also serves as a prognostic factor in breast cancer [ 50 , 51 , 52 ]. Gene MKI67 has been treated as a candidate prognostic prediction for cancer proliferation [ 53 , 54 ]. The over-expression of RRM2 has been linked to higher proliferation and invasiveness of malignant cells [ 55 , 56 ], and the upregulation of RRM2 in breast cancer suggests it to be a possible prognostic indicator [ 57 , 58 , 59 , 60 , 61 , 62 ]. The high expression of UBE2C gene always occurs in cancers with a high degree of malignancy, low differentiation, and high metastatic tendency [ 63 ]. For gene TOP2A, it has been proposed that the HER2 amplification in HER2 breast cancers may be a direct result of the frequent co-amplification of TOP2A [ 64 , 65 , 66 ], and there is a high correlation between the high expressions of TOP2A and the oncogene HER2 [ 67 , 68 ]. Gene CENPF is a cell cycle-associated gene, and it has been identified as a marker of cell proliferation in breast cancers [ 69 ]. The over-expression of these genes strongly supports the correctness of the association identified by SCIPAC. To further validate this positive association, we perform DE analysis of these cells against all the other cells. We find that the top marker genes obtained from CellMarker [ 49 ] for the regulatory T cells, which are known to be immunosuppressive and promote cancer progression [ 70 ], are over-expressed with statistical significance, as shown in Fig. 4 h. This finding again provides strong evidence that the positive association identified by SCIPAC for this group of T cells is correct.

Lung cancer data with survival information

The scRNA-seq data for lung cancer are from [ 71 ], and we use two lung adenocarcinoma (LUAD) patients’ data with 29,888 cells. The true cell types are shown in Fig. 5 a. The bulk data consist of 576 TCGA-LUAD samples with survival status and time.

figure 5

UMAP visualization of a–d  the lung cancer data and e–g  the muscular dystrophy data. a  True cell types. b , c  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. d  Results given by Scissor under different \(\alpha\) values. e , f  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. Circled are a group of cells that are identified by SCIPAC as significantly positively associated with the disease but identified by Scissor as null. g  Results given by Scissor under different \(\alpha\) values

Association strengths and p -values given by SCIPAC are given in Fig. 5 b, c (results under other resolutions are given in Additional file 1: Fig. S13 and S14). In Fig. 5 c, most cells with statistically significant associations are CD4+ T cells or B cells. These associations are negative, meaning that the abundance of these cells is associated with a reduced death rate, i.e., longer survival time. This agrees with the literature: CD4+ T cells primarily mediate anti-tumor immunity and are associated with favorable prognosis in lung cancer patients [ 72 , 73 , 74 ]; B cells also show anti-tumor functions in all stages of human lung cancer development and play an essential role in anti-tumor responses [ 75 , 76 ].

The results by Scissor under different \(\alpha\) values are shown in Fig. 5 d. The highly scattered Scissor+ and Scissor− cells make identifying and interpreting meaningful phenotype-associated cell groups difficult.

Muscular dystrophy data with a binary phenotype

This dataset contains cells from four facioscapulohumeral muscular dystrophy (FSHD) samples and two control samples [ 77 ]. We pool all the 7047 cells from these six samples together. The true cell types of these cells are unknown. The bulk data consists of 27 FSHD patients and eight controls from [ 28 ]. Here the phenotype is FSHD, and it is binary: present or absent.

The results of SCIPAC with the default resolution are given in Fig. 5 e, f. Results under other resolutions are highly similar (shown in Additional file 1: Fig. S15 and S16). For comparison, results given by Scissor under different \(\alpha\) values are presented in Fig. 5 g. The agreements between the results of SCIPAC and Scissor are clear. For example, both methods identify cells located at the top and lower left part of UMAP plots to be negatively associated with FSHD, and cells located at the center and right parts of UMAP plots to be positively associated. However, the discrepancies in their results are also evident. The most pronounced one is a large group of cells (circled in Fig. 5 f) that are identified by SCIPAC as significantly positively associated but are completely ignored by Scissor. Checking into this group of cells, we find that over 90% (424 out of 469) come from the FSHD patients, and less than 10% come from the control samples. However, cells from FSHD patients only compose 73% (5133) of all the 7047 cells. This statistically significant ( p -value \(<10^{-15}\) , Fisher’s exact test) over-representation (odds ratio = 3.51) suggests that the positive association identified SCIPAC is likely to be correct.

SCIPAC is computationally highly efficient. On an 8-core machine with 2.50 GHz CPU and 16 GB RAM, SCIPAC takes 7, 24, and 2 s to finish all the computation and give the estimated association strengths and p -values on the prostate cancer, lung cancer, and muscular dystrophy datasets, respectively. As a reference, Scissor takes 314, 539, and 171 seconds, respectively.

SCIPAC works with various phenotype types, including binary, continuous, survival, and ordinal. It can easily accommodate other types by using a proper regression model with a systematic component in the form of Eq. 3 (see the “ Methods ” section). For example, a Poisson or negative binomial log-linear model can be used if the phenotype is a count (i.e., non-negative integer).

In SCIPAC’s definition of association, a cell type is associated with the phenotype if increasing the proportion of this cell type leads to a change of probability of the phenotype occurring. The strength of association represents the extent of the increase or decrease in this probability. In the case of binary-response, this change is measured by the log odds ratio. For example, if the association strength of cell type A is twice that of cell type B, increasing cell type A by a certain proportion leads to twice the amount of change in the log odds ratio of having the phenotype compared to increasing cell type B by the same proportion. The association strength under other types of phenotypes can be interpreted similarly, with the major difference lying in the measure of change in probability. For quantitative, ordinal, and survival outcomes, the difference in the quantitative outcome, log odds ratio of the right-tail probability, and log hazard ratio respectively are used. Despite the differences in the exact form of the association strength under different types of phenotypes, the underlying concept remains the same: a larger (absolute value of) association strength indicates that the same increase/decrease in a cell type leads to a larger change in the occurrence of the phenotype.

As SCIPAC utilizes both bulk RNA-seq data with phenotype and single-cell RNA-seq data, the estimated associations for the cells are influenced by the choice of the bulk data. Although different bulk data can yield varying estimations of the association for the same single cells, the estimated associations appear to be reasonably robust even when minor changes are made to the bulk data. See Additional file 1 for further discussions.

When using the Louvain algorithm in the Seurat package to cluster cells, SCIPAC’s default resolution is 2.0, larger than the default setting of Seurat. This allows for the identification of potential subtypes within the major cell type and enables the estimation of individual association strengths. Consequently, a more detailed and comprehensive description of the association between single cells and the phenotype can be obtained by SCIPAC.

When applying SCIPAC to real datasets, we made a deliberate choice to disregard the cell annotation provided by the original publications and instead relied on the inferred cell clusters produced by the Louvain algorithm. We made this decision for several reasons. Firstly, we aimed to ensure a fair comparison with Scissor, as it does not utilize cell-type annotations. Secondly, the original annotation might not be sufficiently comprehensive or detailed. Presumed cell types could potentially encompass multiple subtypes, each of which may exhibit distinct associations with the phenotype under investigation. In such cases, employing the Louvain algorithm with a relatively high resolution, which is the default setting in SCIPAC, enables us to differentiate between these subtypes and allows SCIPAC to assign varying association strengths to each subtype.

SCIPAC fits the regression model using the elastic net, a machine-learning algorithm that maximizes a penalized version of the likelihood. The elastic net can be replaced by other penalized estimates of regression models, such as SCAD [ 78 ], without altering the rest of the SCIPAC algorithm. The combination of a regression model and a penalized estimation algorithm such as the elastic net has shown comparable or higher prediction power than other sophisticated methods such as random forests, boosting, or neural networks in numerous applications, especially for gene expression data [ 79 ]. However, there can still be datasets where other models have higher prediction power. It will be future work to incorporate these models into SCIPAC.

The use of metacells is becoming an efficient way to handle large single-cell datasets [ 80 , 81 , 82 , 83 ]. Conceptually, SCIPAC can incorporate metacells and their representatives as an alternative to its default setting of using cell clusters/types and their centroids. We have explored this aspect using metacells provided by SEACells [ 81 ]. Details are given in Additional file 1. Our comparative analysis reveals that combining SCIPAC with SEACells results in significantly reduced performance compared to using SCIPAC directly on original single-cell data. The primary reason for this appears to be the subpar performance of SEACells in cell grouping, especially when contrasted with the Louvain algorithm. Given these findings, we do not suggest using metacells provided by SEACells for SCIPAC applications in the current stage.

Conclusions

SCIPAC is a novel algorithm for studying the associations between cells and phenotypes. Compared to the previous algorithm, SCIPAC gives a much more detailed and comprehensive description of the associations by enabling a quantitative estimation of the association strength and by providing a quality control—the p -value. Underlying SCIPAC are a general statistical model that accommodates virtually all types of phenotypes, including ordinal (and potentially count) phenotypes that have never been considered before, and a concise and closed-form mathematical formula that quantifies the association, which minimizes the computational load. The mathematical conciseness also largely frees SCIPAC from parameter tuning. The only parameter (i.e., the resolution) barely changes the results given by SCIPAC. Overall, compared with its predecessor, SCIPAC represents a substantially more capable software by being much more informative, versatile, robust, and user-friendly.

The improvement in accuracy is also remarkable. In simulated data, SCIPAC achieves high power and low false positives, which is evident from the UMAP plot, F1 score, and FSC score. In real data, SCIPAC gives results that are consistent with current biological knowledge for cell types whose functions are well understood. For cell types whose functions are less studied or more multifaceted, SCIPAC gives support to certain biological hypotheses or helps identify/discover cell sub-types.

SCIPAC’s identification of cell-phenotype associations closely follows its definition of association: when increasing the fraction of a cell type increases (or decreases) the probability for a phenotype to be present, this cell type is positively (or negatively) associated with the phenotype.

The increase of the fraction of a cell type

For a bulk sample, let vector \(\varvec{G} \in \mathbb {R}^p\) be its expression profile, that is, its expression on the p genes. Suppose there are K cell types in the tissue, and let \(\varvec{g}_{k}\) be the representative expression of the k ’th cell type. Usually, people assume that \(\varvec{G}\) can be decomposed by

where \(\gamma _{k}\) is the proportion of cell type k in the bulk tissue, with \(\sum _{k = 1}^{K}\gamma _{k} = 1\) . This equation links the bulk and single-cell expression data.

Now consider increasing cells from cell type k by \(\Delta \gamma\) proportion of the original number of cells. Then, the new proportion of cell type k becomes \(\frac{\gamma _{k} + \Delta \gamma }{1 + \Delta \gamma }\) , and the new proportion of cell type \(j \ne k\) becomes \(\frac{\gamma _{j}}{1 + \Delta \gamma }\)  (note that the new proportions of all cell types should still add up to 1). Thus, the bulk expression profile with the increase of cell type k becomes

Plugging Eq. 1 , we get

Interestingly, this expression of \(\varvec{G}^*\) does not include \(\gamma _{1}, \ldots , \gamma _{K}\) . This means that there is no need actually to compute \(\gamma _{1}, \ldots , \gamma _{K}\) in Eq. 1 , which could otherwise be done using a cell-type-decomposition software, but an accurate and robust decomposition is non-trivial [ 84 , 85 , 86 ]. See Additional file 1 for a more in-depth discussion on the connections of SCIPAC with decomposition/deconvolution.

The change in chance of a phenotype

In this section, we consider how the increase in the fraction of a cell type will change the chance for a binary phenotype such as cancer to occur. Other types of phenotypes will be considered in the next section.

Let \(\pi (\varvec{G})\) be the chance of an individual with gene expression profile \(\varvec{G}\) for this phenotype to occur. We assume a logistic regression model to describe the relationship between \(\pi (\varvec{G})\) and \(\varvec{G}\) :

here the left-hand side is the log odds of \(\pi (\varvec{G})\) , \(\beta _{0}\) is the intercept, and \(\varvec{\beta }\) is a length- p vector of coefficients. In the section after the next, we will describe how we obtain \(\beta _{0}\) and \(\varvec{\beta }\) from the data.

When increasing cells from cell type k by \(\Delta \gamma\) , \(\varvec{G}\) becomes \(\varvec{G}^*\) in Eq. 3 . Plugging Eq. 2 , we get

We further take the difference between Eqs. 4 and 3 and get

The left-hand side of this equation is the log odds ratio (i.e., the change of log odds). On the right-hand side, \(\frac{\Delta \gamma }{1 + \Delta \gamma }\) is an increasing function with respect to \(\Delta \gamma\) , and \(\varvec{\beta }^T(\varvec{g}_{k} - \varvec{G})\) is independent of \(\Delta \gamma\) . This indicates that given any specific \(\Delta \gamma\) , the log odds ratio under over-representation of cell type k is proportional to

\(\lambda _k\) describes the strength of the effect of increasing cell type k to a bulk sample with expression profile \(\varvec{G}\) . Given the presence of numerous bulk samples, employing multiple \(\lambda _k\) ’s could be cumbersome and obscure the overall effect of a particular cell type. To concisely summarize the association of cell type k , we propose averaging their effects. The average effect on all bulk samples can be obtained by

where \(\bar{\varvec{G}}\) is the average expression profile of all bulk samples.

\(\Lambda _k\) gives an overall impression of how strong the effect is when cell type k over-represents to the probability for the phenotype to be present. Its sign represents the direction of the change: a positive value means an increase in probability, and a negative value means a decrease in probability. Its absolute value represents the strength of the effect. In SCIPAC, we call \(\Lambda _k\) the association strength of cell type k and the phenotype.

Note that this derivation does not involve likelihood, although the computation of \(\varvec{\beta }\) does. Here, it serves more as a definitional approach.

Definition of the association strength for other types of phenotype

Our definition of \(\Lambda _k\) relies on vector \(\varvec{\beta }\) . In the case of a binary phenotype, \(\varvec{\beta }\) are the coefficients of a logistic regression that describes a linear relationship between the expression profile and the log odds of having the phenotype, as shown in Eq. 3 . For other types of phenotype, \(\varvec{\beta }\) can be defined/computed similarly.

For a quantitative (i.e., continuous) phenotype, an ordinary linear regression can be used, and the left-hand side of Eq. 3 is changed to the quantitative value of the phenotype.

For a survival phenotype, a Cox proportional hazards model can be used, and the left-hand side of Eq. 3 is changed to the log hazard ratio.

For an ordinal phenotype, we use a proportional odds model

where \(j \in \{1, 2, ..., (J - 1)\}\) and J is the number of ordinal levels. It should be noted that here we use the right-tail probability \(\Pr (Y_{i} \ge j + 1 | X)\) instead of the commonly used cumulative probability (left-tail probability) \(\Pr (Y_{i} \le j | X)\) . Such a change makes the interpretation consistent with other types of phenotypes: in our model, a larger value on the right-hand side indicates a larger chance for \(Y_{i}\) to have a higher level, which in turn guarantees that the sign of the association strength defined according to this \(\varvec{\beta }\) has the usual meaning: a positive \(\Lambda _k\) value means a positive association with the phenotype-using the cancer stage as an example. A positive \(\Lambda _k\) means the over-representation of cell type k increases the chance of a higher cancer stage. In contrast, using the commonly used cumulative probability leads to a counter-intuitive, reversed interpretation.

Computation of the association strength in practice

In practice, \(\varvec{\beta }\) in Eq. 3 needs to be learned from the bulk data. By default, SCIPAC uses the elastic net, a popular and powerful penalized regression method:

In this model, \(l(\beta _{0}, \varvec{\beta })\) is a log-likelihood of the linear model (i.e., logistic regression for a binary phenotype, ordinary linear regression for a quantitative phenotype, Cox proportional odds model for a survival phenotype, and proportional odds model for an ordinal phenotype). \(\alpha\) is a number between 0 and 1, denoting a combination of \(\ell _1\) and \(\ell _2\) penalties, and \(\lambda\) is the penalty strength. SCIPAC fixes \(\alpha\) to be 0.4 (see Additional file 1 for discussions on this choice) and uses 10-fold cross-validation to decide \(\lambda\) automatically. This way, they do not become hyperparameters.

In SCIPAC, the fitting and cross-validation of the elastic net are done by calling the ordinalNet [ 87 ] R package for the ordinal phenotype and by calling the glmnet R package [ 88 , 89 , 90 , 91 ] for other types of phenotypes.

The computation of the association strength, as defined by Eq. 7 , does not only require \(\varvec{\beta }\) , but also \(\varvec{g}_k\) and \(\bar{\varvec{G}}\) . \(\bar{\varvec{G}}\) is simply the average expression profile of all bulk samples. On the other hand, \(\varvec{g}_k\) requires knowing the cell type of each cell. By default, SCIPAC does not assume this information to be given, and it uses the Louvain clustering implemented in the Seurat [ 24 , 25 ] R package to infer it. This clustering algorithm has one tuning parameter called “resolution.” SCIPAC sets its default value as 2.0, and the user can use other values. With the inferred or given cell types, \(\varvec{g}_k\) is computed as the centroid (i.e., the mean expression profile) of cells in cluster k .

Given \(\varvec{\beta }\) , \(\bar{\varvec{G}}\) , and \(\varvec{g}_k\) , the association strength can be computed using Eq. 7 . Knowing the association strength for each cell type and the cell-type label for each cell, we also know the association strength for every single cell. In practice, we standardize the association strengths for all cells. That is, we compute the mean and standard deviation of the association strengths of all cells and use them to centralize and scale the association strength, respectively. We have found such standardization makes SCIPAC more robust to the possible unbalance in sample size of bulk data in different phenotype groups.

Computation of the p -value

SCIPAC uses non-parametric bootstrap [ 92 ] to compute the standard deviation and hence the p -value of the association. Fifty bootstrap samples, which are believed to be enough to compute the standard error of most statistics [ 93 ], are generated for the bulk expression data, and each is used to compute (standardized) \(\Lambda\) values for all the cells. For cell i , let its original \(\Lambda\) values be \(\Lambda _i\) , and the bootstrapped values be \(\Lambda _i^{(1)}, \ldots , \Lambda _i^{(50)}\) . A z -score is then computed using

and then the p -value is computed according to the cumulative distribution function of the standard Gaussian distribution. See Additional file 1 for more discussions on the calculation of p -value.

Availability of data and materials

The simulated datasets [ 94 ] under three schemes are available at Zenodo with DOI 10.5281/zenodo.11013320 [ 95 ]. The SCIPAC package is available at GitHub website https://github.com/RavenGan/SCIPAC under the MIT license [ 96 ]. The source code of SCIPAC is also deposited at Zenodo with DOI 10.5281/zenodo.11013696 [ 97 ]. A vignette of the R package is available on the GitHub page and in the Additional file 2. The prostate cancer scRNA-seq data is obtained from the Prostate Cell Atlas https://www.prostatecellatlas.org [ 29 ]; the scRNA-seq data for the breast cancer are from the Gene Expression Omnibus (GEO) under accession number GSE176078 [ 34 , 98 ]; the scRNA-seq data for the lung cancer are from E-MTAB-6149 [ 99 ] and E-MTAB-6653 [ 71 , 100 ]; the scRNA-seq data for facioscapulohumeral muscular dystrophy data are from the GEO under accession number GSE122873 [ 101 ]. The bulk RNA-seq data are obtained from the TCGA database via TCGAbiolinks (ver. 2.25.2) R package [ 102 ]. More details about the simulated and real scRNA-seq and bulk RNA-seq data can be found in the Additional file 1.

Yofe I, Dahan R, Amit I. Single-cell genomic approaches for developing the next generation of immunotherapies. Nat Med. 2020;26(2):171–7.

Article   CAS   PubMed   Google Scholar  

Zhang Q, He Y, Luo N, Patel SJ, Han Y, Gao R, et al. Landscape and dynamics of single immune cells in hepatocellular carcinoma. Cell. 2019;179(4):829–45.

Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med. 2020;52(9):1452–65.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.

Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.

Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018;360(6385):176–82.

Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.

Article   Google Scholar  

Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJ, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20(1):1–19.

Article   CAS   Google Scholar  

Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746.

Article   PubMed   PubMed Central   Google Scholar  

Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol. 2021;22(1):1–18.

Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16(10):983–6.

Zhang AW, O’Flanagan C, Chavez EA, Lim JL, Ceglia N, McPherson A, et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods. 2019;16(10):1007–15.

Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, et al. SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes. 2019;10(7):531.

Johnson TS, Wang T, Huang Z, Yu CY, Wu Y, Han Y, et al. LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics. 2019;35(22):4696–706.

Ma F, Pellegrini M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics. 2020;36(2):533–8.

Tan Y, Cahan P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 2019;9(2):207–13.

Salcher S, Sturm G, Horvath L, Untergasser G, Kuempers C, Fotakis G, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell. 2022;40(12):1503–20.

Good Z, Sarno J, Jager A, Samusik N, Aghaeepour N, Simonds EF, et al. Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nat Med. 2018;24(4):474–83.

Wagner J, Rapsomaniki MA, Chevrier S, Anzeneder T, Langwieder C, Dykgers A, et al. A single-cell atlas of the tumor and immune ecosystem of human breast cancer. Cell. 2019;177(5):1330–45.

Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20.

Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Disc. 2012;2(5):401–4.

Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):1.

Sun D, Guan X, Moran AE, Wu LY, Qian DZ, Schedin P, et al. Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022;40(4):527–38.

Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):P10008.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20.

McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018. arXiv preprint arXiv:1802.03426 .

Wong CJ, Wang LH, Friedman SD, Shaw D, Campbell AE, Budech CB, et al. Longitudinal measures of RNA expression and disease activity in FSHD muscle biopsies. Hum Mol Genet. 2020;29(6):1030–43.

Tuong ZK, Loudon KW, Berry B, Richoz N, Jones J, Tan X, et al. Resolving the immune landscape of human prostate at a single-cell level in health and cancer. Cell Rep. 2021;37(12):110132.

Hume DA. The mononuclear phagocyte system. Curr Opin Immunol. 2006;18(1):49–53.

Hume DA, Ross IL, Himes SR, Sasmono RT, Wells CA, Ravasi T. The mononuclear phagocyte system revisited. J Leukoc Biol. 2002;72(4):621–7.

Raggi F, Bosco MC. Targeting mononuclear phagocyte receptors in cancer immunotherapy: new perspectives of the triggering receptor expressed on myeloid cells (TREM-1). Cancers. 2020;12(5):1337.

Largeot A, Pagano G, Gonder S, Moussay E, Paggetti J. The B-side of cancer immunity: the underrated tune. Cells. 2019;8(5):449.

Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53(9):1334–47.

Fernández-Nogueira P, Fuster G, Gutierrez-Uzquiza Á, Gascón P, Carbó N, Bragado P. Cancer-associated fibroblasts in breast cancer treatment response and metastasis. Cancers. 2021;13(13):3146.

Ao Z, Shah SH, Machlin LM, Parajuli R, Miller PC, Rawal S, et al. Identification of cancer-associated fibroblasts in circulating blood from patients with metastatic breast cancer. Identification of cCAFs from metastatic cancer patients. Cancer Res. 2015;75(22):4681–7.

Arcucci A, Ruocco MR, Granato G, Sacco AM, Montagnani S. Cancer: an oxidative crosstalk between solid tumor cells and cancer associated fibroblasts. BioMed Res Int. 2016;2016.  https://pubmed.ncbi.nlm.nih.gov/27595103/ .

Buchsbaum RJ, Oh SY. Breast cancer-associated fibroblasts: where we are and where we need to go. Cancers. 2016;8(2):19.

Ruocco MR, Avagliano A, Granato G, Imparato V, Masone S, Masullo M, et al. Involvement of breast cancer-associated fibroblasts in tumor development, therapy resistance and evaluation of potential therapeutic strategies. Curr Med Chem. 2018;25(29):3414–34.

Savas P, Virassamy B, Ye C, Salim A, Mintoff CP, Caramia F, et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat Med. 2018;24(7):986–93.

Bassez A, Vos H, Van Dyck L, Floris G, Arijs I, Desmedt C, et al. A single-cell map of intratumoral changes during anti-PD1 treatment of patients with breast cancer. Nat Med. 2021;27(5):820–32.

Romero JM, Grünwald B, Jang GH, Bavi PP, Jhaveri A, Masoomian M, et al. A four-chemokine signature is associated with a T-cell-inflamed phenotype in primary and metastatic pancreatic cancer. Chemokines in Pancreatic Cancer. Clin Cancer Res. 2020;26(8):1997–2010.

Tamura R, Yoshihara K, Nakaoka H, Yachida N, Yamaguchi M, Suda K, et al. XCL1 expression correlates with CD8-positive T cells infiltration and PD-L1 expression in squamous cell carcinoma arising from mature cystic teratoma of the ovary. Oncogene. 2020;39(17):3541–54.

Hernandez R, Põder J, LaPorte KM, Malek TR. Engineering IL-2 for immunotherapy of autoimmunity and cancer. Nat Rev Immunol. 2022:22:1–15.  https://pubmed.ncbi.nlm.nih.gov/35217787/ .

Korotkevich G, Sukhov V, Budin N, Shpak B, Artyomov MN, Sergushichev A. Fast gene set enrichment analysis. BioRxiv. 2016:060012.  https://www.biorxiv.org/content/10.1101/060012v3.abstract .

Dang CV. MYC on the path to cancer. Cell. 2012;149(1):22–35.

Gnanaprakasam JR, Wang R. MYC in regulating immunity: metabolism and beyond. Genes. 2017;8(3):88.

Oshi M, Takahashi H, Tokumaru Y, Yan L, Rashid OM, Matsuyama R, et al. G2M cell cycle pathway score as a prognostic biomarker of metastasis in estrogen receptor (ER)-positive breast cancer. Int J Mol Sci. 2020;21(8):2921.

Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, et al. Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019;47(D1):D721–8.

Chen L, Yang L, Qiao F, Hu X, Li S, Yao L, et al. High levels of nucleolar spindle-associated protein and reduced levels of BRCA1 expression predict poor prognosis in triple-negative breast cancer. PLoS ONE. 2015;10(10):e0140572.

Li M, Yang B. Prognostic value of NUSAP1 and its correlation with immune infiltrates in human breast cancer. Crit Rev TM Eukaryot Gene Expr. 2022;32(3).  https://pubmed.ncbi.nlm.nih.gov/35695609/ .

Zhang X, Pan Y, Fu H, Zhang J. Nucleolar and spindle associated protein 1 (NUSAP1) inhibits cell proliferation and enhances susceptibility to epirubicin in invasive breast cancer cells by regulating cyclin D kinase (CDK1) and DLGAP5 expression. Med Sci Monit: Int Med J Exp Clin Res. 2018;24:8553.

Geyer FC, Rodrigues DN, Weigelt B, Reis-Filho JS. Molecular classification of estrogen receptor-positive/luminal breast cancers. Adv Anat Pathol. 2012;19(1):39–53.

Karamitopoulou E, Perentes E, Tolnay M, Probst A. Prognostic significance of MIB-1, p53, and bcl-2 immunoreactivity in meningiomas. Hum Pathol. 1998;29(2):140–5.

Duxbury MS, Whang EE. RRM2 induces NF- \(\kappa\) B-dependent MMP-9 activation and enhances cellular invasiveness. Biochem Biophys Res Commun. 2007;354(1):190–6.

Zhou BS, Tsai P, Ker R, Tsai J, Ho R, Yu J, et al. Overexpression of transfected human ribonucleotide reductase M2 subunit in human cancer cells enhances their invasive potential. Clin Exp Metastasis. 1998;16(1):43–9.

Zhang H, Liu X, Warden CD, Huang Y, Loera S, Xue L, et al. Prognostic and therapeutic significance of ribonucleotide reductase small subunit M2 in estrogen-negative breast cancers. BMC Cancer. 2014;14(1):1–16.

Putluri N, Maity S, Kommagani R, Creighton CJ, Putluri V, Chen F, et al. Pathway-centric integrative analysis identifies RRM2 as a prognostic marker in breast cancer associated with poor survival and tamoxifen resistance. Neoplasia. 2014;16(5):390–402.

Koleck TA, Conley YP. Identification and prioritization of candidate genes for symptom variability in breast cancer survivors based on disease characteristics at the cellular level. Breast Cancer Targets Ther. 2016;8:29.

Li Jp, Zhang Xm, Zhang Z, Zheng Lh, Jindal S, Liu Yj. Association of p53 expression with poor prognosis in patients with triple-negative breast invasive ductal carcinoma. Medicine. 2019;98(18).  https://pubmed.ncbi.nlm.nih.gov/31045815/ .

Gong MT, Ye SD, Lv WW, He K, Li WX. Comprehensive integrated analysis of gene expression datasets identifies key anti-cancer targets in different stages of breast cancer. Exp Ther Med. 2018;16(2):802–10.

PubMed   PubMed Central   Google Scholar  

Chen Wx, Yang Lg, Xu Ly, Cheng L, Qian Q, Sun L, et al. Bioinformatics analysis revealing prognostic significance of RRM2 gene in breast cancer. Biosci Rep. 2019;39(4).  https://pubmed.ncbi.nlm.nih.gov/30898978/ .

Hao Z, Zhang H, Cowell J. Ubiquitin-conjugating enzyme UBE2C: molecular biology, role in tumorigenesis, and potential as a biomarker. Tumor Biol. 2012;33(3):723–30.

Arriola E, Rodriguez-Pinilla SM, Lambros MB, Jones RL, James M, Savage K, et al. Topoisomerase II alpha amplification may predict benefit from adjuvant anthracyclines in HER2 positive early breast cancer. Breast Cancer Res Treat. 2007;106(2):181–9.

Knoop AS, Knudsen H, Balslev E, Rasmussen BB, Overgaard J, Nielsen KV, et al. Retrospective analysis of topoisomerase IIa amplifications and deletions as predictive markers in primary breast cancer patients randomly assigned to cyclophosphamide, methotrexate, and fluorouracil or cyclophosphamide, epirubicin, and fluorouracil: Danish Breast Cancer Cooperative Group. J Clin Oncol. 2005;23(30):7483–90.

Tanner M, Isola J, Wiklund T, Erikstein B, Kellokumpu-Lehtinen P, Malmstrom P, et al. Topoisomerase II \(\alpha\) gene amplification predicts favorable treatment response to tailored and dose-escalated anthracycline-based adjuvant chemotherapy in HER-2/neu-amplified breast cancer: Scandinavian Breast Group Trial 9401. J Clin Oncol. 2006;24(16):2428–36.

Arriola E, Moreno A, Varela M, Serra JM, Falo C, Benito E, et al. Predictive value of HER-2 and topoisomerase II \(\alpha\) in response to primary doxorubicin in breast cancer. Eur J Cancer. 2006;42(17):2954–60.

Järvinen TA, Tanner M, Bärlund M, Borg Å, Isola J. Characterization of topoisomerase II \(\alpha\) gene amplification and deletion in breast cancer. Gene Chromosome Cancer. 1999;26(2):142–50.

Landberg G, Erlanson M, Roos G, Tan EM, Casiano CA. Nuclear autoantigen p330d/CENP-F: a marker for cell proliferation in human malignancies. Cytom J Int Soc Anal Cytol. 1996;25(1):90–8.

CAS   Google Scholar  

Bettelli E, Carrier Y, Gao W, Korn T, Strom TB, Oukka M, et al. Reciprocal developmental pathways for the generation of pathogenic effector TH17 and regulatory T cells. Nature. 2006;441(7090):235–8.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat Med. 2018;24(8):1277–89.

Bremnes RM, Busund LT, Kilvær TL, Andersen S, Richardsen E, Paulsen EE, et al. The role of tumor-infiltrating lymphocytes in development, progression, and prognosis of non-small cell lung cancer. J Thorac Oncol. 2016;11(6):789–800.

Article   PubMed   Google Scholar  

Schalper KA, Brown J, Carvajal-Hausdorf D, McLaughlin J, Velcheti V, Syrigos KN, et al. Objective measurement and clinical significance of TILs in non–small cell lung cancer. J Natl Cancer Inst. 2015;107(3):dju435.

Tay RE, Richardson EK, Toh HC. Revisiting the role of CD4+ T cells in cancer immunotherapy—new insights into old paradigms. Cancer Gene Ther. 2021;28(1):5–17.

Dieu-Nosjean MC, Goc J, Giraldo NA, Sautès-Fridman C, Fridman WH. Tertiary lymphoid structures in cancer and beyond. Trends Immunol. 2014;35(11):571–80.

Wang Ss, Liu W, Ly D, Xu H, Qu L, Zhang L. Tumor-infiltrating B cells: their role and application in anti-tumor immunity in lung cancer. Cell Mol Immunol. 2019;16(1):6–18.

van den Heuvel A, Mahfouz A, Kloet SL, Balog J, van Engelen BG, Tawil R, et al. Single-cell RNA sequencing in facioscapulohumeral muscular dystrophy disease etiology and development. Hum Mol Genet. 2019;28(7):1064–75.

Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.

Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction, vol. 2. New York: Springer; 2009.

Book   Google Scholar  

Baran Y, Bercovich A, Sebe-Pedros A, Lubling Y, Giladi A, Chomsky E, et al. MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol. 2019;20(1):1–19.

Persad S, Choo ZN, Dien C, Sohail N, Masilionis I, Chaligné R, et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat Biotechnol. 2023;41:1–12.  https://pubmed.ncbi.nlm.nih.gov/36973557/ .

Ben-Kiki O, Bercovich A, Lifshitz A, Tanay A. Metacell-2: a divide-and-conquer metacell algorithm for scalable scRNA-seq analysis. Genome Biol. 2022;23(1):100.

Bilous M, Tran L, Cianciaruso C, Gabriel A, Michel H, Carmona SJ, et al. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinformatics. 2022;23(1):336.

Avila Cobos F, Alquicira-Hernandez J, Powell JE, Mestdagh P, De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun. 2020;11(1):1–14.

Jin H, Liu Z. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22(1):1–23.

Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat Commun. 2019;10(1):380.

Wurm MJ, Rathouz PJ, Hanlon BM. Regularized ordinal regression and the ordinalNet R package. 2017. arXiv preprint arXiv:1706.05003 .

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1.

Simon N, Friedman J, Hastie T. A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. 2013. arXiv preprint arXiv:1311.6529 .

Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1.

Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B Stat Methodol. 2012;74(2):245–66.

Efron B. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics. New York: Springer; 1992. pp. 569–593.

Efron B, Tibshirani RJ. An introduction to the bootstrap. London: CRC Press; 1994.

Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18(1):174.

Gan D, Zhu Y, Lu X, Li J. Simulated datasets used in SCIPAC analysis. Zenodo. 2024. https://doi.org/10.5281/zenodo.11013320 .

Gan D, Zhu Y, Lu X, Li J. SCIPAC R package. GitHub. 2024. https://github.com/RavenGan/SCIPAC . Accessed 24 Apr 2024.

Gan D, Zhu Y, Lu X, Li J. SCIPAC source code. Zenodo. 2024. https://doi.org/10.5281/zenodo.11013696 .

Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Datasets. 2021. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176078 . Gene Expression Omnibus. Accessed 1 Oct 2022.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Datasets. 2018. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6149 . ArrayExpress. Accessed 24 July 2022.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Datasets. 2018. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6653 . ArrayExpress. Accessed 24 July 2022.

van den Heuvel A, Mahfouz A, Kloet SL, Balog J, van Engelen BG, Tawil R, et al. Single-cell RNA sequencing in facioscapulohumeral muscular dystrophy disease etiology and development. Datasets. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122873 . Gene Expression Omnibus. Accessed 13 Aug 2022.

Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71.

Download references

Review history

The review history is available as Additional file 3.

Peer review information

Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

This work is supported by the National Institutes of Health (R01CA280097 to X.L. and J.L, R01CA252878 to J.L.) and the DOD BCRP Breakthrough Award, Level 2 (W81XWH2110432 to J.L.).

Author information

Authors and affiliations.

Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, 46556, IN, USA

Dailin Gan & Jun Li

Department of Biological Sciences, Boler-Parseghian Center for Rare and Neglected Diseases, Harper Cancer Research Institute, Integrated Biomedical Sciences Graduate Program, University of Notre Dame, Notre Dame, 46556, IN, USA

Yini Zhu & Xin Lu

Tumor Microenvironment and Metastasis Program, Indiana University Melvin and Bren Simon Comprehensive Cancer Center, Indianapolis, 46202, IN, USA

You can also search for this author in PubMed   Google Scholar

Contributions

J.L. conceived and supervised the study. J.L. and D.G. proposed the methods. D.G. implemented the methods and analyzed the data. D.G. and J.L. drafted the paper. D.G., Y.Z., X.L., and J.L. interpreted the results and revised the paper.

Corresponding author

Correspondence to Jun Li .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1. supplementary materials that include additional results and plots., additional file 2. a vignette of the scipac package., additional file 3. review history., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Gan, D., Zhu, Y., Lu, X. et al. SCIPAC: quantitative estimation of cell-phenotype associations. Genome Biol 25 , 119 (2024). https://doi.org/10.1186/s13059-024-03263-1

Download citation

Received : 30 January 2023

Accepted : 30 April 2024

Published : 13 May 2024

DOI : https://doi.org/10.1186/s13059-024-03263-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Phenotype association
  • Single cell
  • RNA sequencing
  • Cancer research

Genome Biology

ISSN: 1474-760X

hypothesis meaning of biology

Help | Advanced Search

Quantitative Biology > Neurons and Cognition

Title: spatial cognition: a wave hypothesis.

Abstract: Animals build Bayesian 3D models of their surroundings, to control their movements. There is strong selection pressure to make these models as precise as possible, given their sense data. A previous paper has described how a precise 3D model of space can be built by object tracking. This only works if 3D locations are stored with high spatial precision. Neural models of 3D spatial memory have large random errors; too large to support the tracking model. An alternative is described, in which neurons couple to a wave excitation in the brain, representing 3-D space. This can give high spatial precision, fast response, and other benefits. Three lines of evidence support the wave hypothesis: (1) it has better precision and speed than neural spatial memory, and is good enough to support object tracking; (2) the central body of the insect brain, whose form is highly conserved across all insect species, is well suited to hold a wave; and (3) the thalamus, whose round shape is conserved across all mammal species, is well suited to hold a wave. These lines of evidence strongly support the wave hypothesis.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

hypothesis meaning of biology

The Fermi Paradox and the Berserker Hypothesis: Exploring Cosmic Silence Through Science Fiction

I n the realm of cosmic conundrums, the Fermi Paradox stands out: why, in a universe replete with billions of stars and planets, have we yet to find any signs of extraterrestrial intelligent life? The “berserker hypothesis,” a spine-chilling explanation rooted in science and popularized by science fiction, suggests a grim answer to this enduring mystery.

The concept’s moniker traces back to Fred Saberhagen’s “Berserker” series of novels, and it paints a picture of the cosmos where intelligent life forms are systematically eradicated by self-replicating probes, known as “berserkers.” These probes, initially intended to explore and report back, turn rogue and annihilate any signs of civilizations they encounter. The hypothesis emerges as a rather dark twist on the concept of von Neumann probes—machines capable of self-replication using local resources, which could theoretically colonize the galaxy rapidly.

Diving into the technicalities, the berserker hypothesis operates as a potential solution to the Hart-Tipler conjecture, which posits the lack of detectable probes as evidence that no intelligent life exists outside our solar system. Instead, this hypothesis flips the script: the absence of such probes doesn’t point to a lack of life but rather to the possibility that these probes have become cosmic predators, leaving a trail of silence in their wake.

Astronomer David Brin’s chilling summation underscores the potential severity of the hypothesis: “It need only happen once for the results of this scenario to become the equilibrium conditions in the Galaxy…because all were killed shortly after discovering radio.” If these berserker probes exist and are as efficient as theorized, then humanity’s attempts at communication with extraterrestrial beings could be akin to lighting a beacon for our own destruction.

Despite its foundation in speculative thought, the theory isn’t without its scientific evaluations. Anders Sandberg and Stuart Armstrong from the Future of Humanity Institute speculated that, given the vastness of the universe and even a slow replication rate, these berserker probes—if they existed—would likely have already found and destroyed us. It’s both a chilling and somewhat reassuring analysis that treads the line between fiction and potential reality.

Within the eclectic array of solutions to the Fermi Paradox, the berserker hypothesis stands out for its seamless blend of science fiction inspiration and scientific discourse. It connects with other notions such as the Great Filter, which suggests that life elsewhere in the universe is being systematically snuffed out before it can reach a space-faring stage, and the Dark Forest hypothesis, which posits that civilizations remain silent to avoid detection by such cosmic hunters.

Relevant articles:

– TIL about the berserker hypothesis, a proposed solution to the Fermi paradox stating the reason why we haven’t found other sentient species yet is because those species have been wiped out by self-replicating “berserker” probes.

– The Berserker Hypothesis: The Darkest Explanation Of The Fermi Paradox

– Beyond “Fermi’s Paradox” VI: What is the Berserker Hypothesis?

In the realm of cosmic conundrums, the Fermi Paradox stands out: why, in a universe replete with billions of stars and planets, have we yet to find any signs of extraterrestrial intelligent life? The “berserker hypothesis,” a spine-chilling explanation rooted in science and popularized by science fiction, suggests a grim answer to this enduring mystery. […]

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Biology LibreTexts

1.3: Scientific Theories

  • Last updated
  • Save as PDF
  • Page ID 6254

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

f-d:611306e22747eba595d62d1288caf1bbed99d9cc1b2c8a5b62092f34 IMAGE_TINY IMAGE_TINY.1

As you view Know the Difference (Between Hypothesis and Theory) , focus on these concepts: the controversy surrounding the words ‘‘hypothesis’’ and ‘‘theory’’, the scientific use of the words ‘‘hypothesis’’ and ‘‘theory’’, the criteria for a ‘‘hypothesis,’’ the National Academy of Sciences definition of ‘‘theory’’, the meaning of the statement, ‘‘theories are the bedrock of our understanding of nature’’.

The Theory of Evolution

The theory of evolution by natural selection is a scientific theory. Evolution is a change in the characteristics of living things over time. Evolution occurs by a process called natural selection . In natural selection, some living things produce more offspring than others, so they pass more genes to the next generation than others do. Over many generations, this can lead to major changes in the characteristics of living things. The theory of evolution by natural selection explains how living things are changing today and how modern living things have descended from ancient life forms that no longer exist on Earth. No evidence has been identified that proves this theory is incorrect. More on the theory of evolution will be presented in additional concepts.

The Cell Theory

The cell theory is another important scientific theory of biology. According to the cell theory , the cell is the smallest unit of structure and function of all living organisms, all living organisms are made up of at least one cell, and living cells always come from other living cells. Once again, no evidence has been identified that proves this theory is incorrect. More on the cell theory will be presented in additional concepts.

The Germ Theory

The germ theory of disease, also called the pathogenic theory of medicine, is a scientific theory that proposes that microorganisms are the cause of many diseases. Like the other scientific theories, lots of evidence has been identified that supports this theory, and no evidence has been identified that proves the theory is incorrect.

  • With repeated testing, some hypotheses may eventually become scientific theories. A scientific theory is a broad explanation for events that is widely accepted as true.
  • Evolution is a change species over time. Evolution occurs by natural selection.
  • The cell theory states that all living things are made up of cells, and living cells always come from other living cells.
  • The germ theory proposes that microorganisms are the cause of many diseases.

Explore More

Use these resources to answer the questions that follow.

Explore More I

  • Darwinian Evolution - Science and Theory at Non-Majors Biology: http://www.hippocampus.org/Biology .
  • How is the word ‘‘theory’’ used in common language?
  • How is the word ‘‘theory’’ used in science?
  • Provide a detailed definition for a ‘‘scientific theory’’.

Explore More II

  • Concepts and Methods in Biology - Theories and Laws at Non-Majors Biology: http://www.hippocampus.org/Biology .
  • What is a scientific law ?
  • What is a scientific theory?
  • Give two examples of scientific theories.
  • Can a scientific theory become a law? Why or why not?
  • Contrast how the term theory is used in science and in everyday language.
  • Explain how a hypothesis could become a theory.
  • Describe the evidence that proves the cell theory is incorrect.

IMAGES

  1. 13 Different Types of Hypothesis (2024)

    hypothesis meaning of biology

  2. Hypothesis Meaning

    hypothesis meaning of biology

  3. Hypothesis Examples

    hypothesis meaning of biology

  4. Hypothesis

    hypothesis meaning of biology

  5. 🎉 What is hypothesis in social research. QMSS e. 2022-10-19

    hypothesis meaning of biology

  6. Hypothesis: Definition, Sources, Uses, Characteristics and Examples

    hypothesis meaning of biology

VIDEO

  1. What Is A Hypothesis?

  2. Hypothesis|Meaning|Definition|Characteristics|Source|Types|Sociology|Research Methodology|Notes

  3. #chemiosmotic_hypothesis #class11th #photosynthesis #atp_synthesis_mechanism #ncertbiology

  4. CHEMIOSMOTIC HYPOTHESIS || HINDI EXPLANATION

  5. HYPOTHESIS in 3 minutes for UPSC ,UGC NET and others

  6. Formulation of Hypothesis

COMMENTS

  1. Hypothesis

    Biology definition: A hypothesis is a supposition or tentative explanation for (a group of) phenomena, (a set of) facts, or a scientific inquiry that may be tested, verified or answered by further investigation or methodological experiment. It is like a scientific guess.

  2. The scientific method (article)

    The scientific method. At the core of biology and other sciences lies a problem-solving approach called the scientific method. The scientific method has five basic steps, plus one feedback step: Make an observation. Ask a question. Form a hypothesis, or testable explanation. Make a prediction based on the hypothesis.

  3. Scientific hypothesis

    scientific hypothesis, an idea that proposes a tentative explanation about a phenomenon or a narrow set of phenomena observed in the natural world.The two primary features of a scientific hypothesis are falsifiability and testability, which are reflected in an "If…then" statement summarizing the idea and in the ability to be supported or refuted through observation and experimentation.

  4. Biology and the scientific method review

    Meaning. Biology. The study of living things. Observation. Noticing and describing events in an orderly way. Hypothesis. A scientific explanation that can be tested through experimentation or observation. Controlled experiment. An experiment in which only one variable is changed.

  5. 4.14: Experiments and Hypotheses

    Forming a Hypothesis. When conducting scientific experiments, researchers develop hypotheses to guide experimental design. A hypothesis is a suggested explanation that is both testable and falsifiable. You must be able to test your hypothesis, and it must be possible to prove your hypothesis true or false.

  6. 1.1 The Science of Biology

    This is a very broad definition because the scope of biology is vast. ... Recall that a hypothesis is a suggested explanation that one can test. To solve a problem, one can propose several hypotheses. For example, one hypothesis might be, "The classroom is warm because no one turned on the air conditioning." However, there could be other ...

  7. What is a scientific hypothesis?

    Bibliography. A scientific hypothesis is a tentative, testable explanation for a phenomenon in the natural world. It's the initial building block in the scientific method. Many describe it as an ...

  8. 1.3: The Science of Biology

    Proposing a Hypothesis. Recall that a hypothesis is an educated guess that can be tested. Hypotheses often also include an explanation for the educated guess. To solve one problem, several hypotheses may be proposed. For example, the student might believe that his friend is tall because he drinks a lot of milk.

  9. 1.2 The Process of Science

    A hypothesis is a suggested explanation for an event, which can be tested. Hypotheses, or tentative explanations, are generally produced within the context of a scientific theory . A generally accepted scientific theory is thoroughly tested and confirmed explanation for a set of observations or phenomena.

  10. Controlled experiments (article)

    When possible, scientists test their hypotheses using controlled experiments. A controlled experiment is a scientific test done under controlled conditions, meaning that just one (or a few) factors are changed at a time, while all others are kept constant. We'll look closely at controlled experiments in the next section.

  11. What Is a Hypothesis? The Scientific Method

    A hypothesis (plural hypotheses) is a proposed explanation for an observation. The definition depends on the subject. In science, a hypothesis is part of the scientific method. It is a prediction or explanation that is tested by an experiment. Observations and experiments may disprove a scientific hypothesis, but can never entirely prove one.

  12. Experiments and Hypotheses

    When conducting scientific experiments, researchers develop hypotheses to guide experimental design. A hypothesis is a suggested explanation that is both testable and falsifiable. You must be able to test your hypothesis through observations and research, and it must be possible to prove your hypothesis false. For example, Michael observes that ...

  13. How to Write a Strong Hypothesis

    Developing a hypothesis (with example) Step 1. Ask a question. Writing a hypothesis begins with a research question that you want to answer. The question should be focused, specific, and researchable within the constraints of your project. Example: Research question.

  14. Hypothesis: Definition, Examples, and Types

    A hypothesis is a tentative statement about the relationship between two or more variables. It is a specific, testable prediction about what you expect to happen in a study. It is a preliminary answer to your question that helps guide the research process. Consider a study designed to examine the relationship between sleep deprivation and test ...

  15. Hypothesis Definition & Meaning

    hypothesis: [noun] an assumption or concession made for the sake of argument. an interpretation of a practical situation or condition taken as the ground for action.

  16. 1.1.2: The Science of Biology

    In simple terms, biology is the study of life. This is a very broad definition because the scope of biology is vast. Biologists may study anything from the microscopic or submicroscopic view of a cell to ecosystems and the whole living planet (Figure 1.2). Listening to the daily news, you will quickly realize how many aspects of biology we ...

  17. Biology

    biology, study of living things and their vital processes. The field deals with all the physicochemical aspects of life. The modern tendency toward cross-disciplinary research and the unification of scientific knowledge and investigation from different fields has resulted in significant overlap of the field of biology with other scientific ...

  18. Null hypothesis

    Biology definition: A null hypothesis is an assumption or proposition where an observed difference between two samples of a statistical population is purely accidental and not due to systematic causes. It is the hypothesis to be investigated through statistical hypothesis testing so that when refuted indicates that the alternative hypothesis is true. . Thus, a null hypothesis is a hypothesis ...

  19. Biology Unit 1 hypothesis Flashcards

    Hypothesis. is a scientific explanation for a set of observations that can be tested in ways that support or reject it. Controlled experiment. when only one variable is changed and all of the other variables stay unchanged or controlled. Independent variable. is the one that deliberately changed. Dependent variable.

  20. The steps of the scientific method are: observation, hypothesis

    Observation: Beekeepers notice a decline in bee populations coinciding with the increased use of a new pesticide in their area. They observe fewer bees visiting flowers, reduced honey production, and weaker hive activity. Hypothesis: Based on the observation, the hypothesis is formulated: "The new pesticide is adversely affecting bee populations." This hypothesis is a tentative explanation for ...

  21. Meta-analyses reveal support for the Social Intelligence Hypothesis

    The Social Intelligence Hypothesis (SIH) is one of the leading explanations for the evolution of cognition. Since its inception a vast body of literature investigating the predictions of the SIH has accumulated, using a variety of methodologies and species. However, the generalisability of the hypothesis remains unclear. To gain an understanding of the robustness of the SIH as an explanation ...

  22. SCIPAC: quantitative estimation of cell-phenotype associations

    SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. Moreover, SCIPAC also enables the estimation of the statistical significance of the association. That is, it gives a p -value for the association between each ...

  23. 1.2: The Science of Biology

    This is a very broad definition because the scope of biology is vast. Biologists may study anything from the microscopic or submicroscopic view of a cell to ecosystems and the whole living planet (Figure 1.2.1 1.2. 1 ). Listening to the daily news, you will quickly realize how many aspects of biology are discussed every day.

  24. 1.1: The Science of Biology

    A second hypothesis might be, "The classroom is warm because there is a power failure, and so the air conditioning doesn't work." Once a hypothesis has been selected, the student can make a prediction. A prediction is similar to a hypothesis but it typically has the format "If . . . then . . . ."

  25. [2405.10112] Spatial Cognition: a Wave Hypothesis

    Quantitative Biology > Neurons and Cognition. arXiv:2405.10112 (q-bio) [Submitted on 16 May 2024] Title: Spatial ... This can give high spatial precision, fast response, and other benefits. Three lines of evidence support the wave hypothesis: (1) it has better precision and speed than neural spatial memory, and is good enough to support object ...

  26. A new 'rule of biology' may have come to light, expanding insight ...

    A rule of biology, sometimes called a biological law, describes a recognized pattern or truism among living organisms. Allen's rule, for example, states that among warm-blooded animals, those ...

  27. 1.2: The Science of Biology

    A hypothesis is a statement/prediction that can be tested by experimentation. A theory is an explanation for a set of observations or phenomena that is supported by extensive research and that can be used as the basis for further research. Inductive reasoning draws on observations to infer logical conclusions based on the evidence.

  28. The Fermi Paradox and the Berserker Hypothesis: Exploring Cosmic ...

    The "berserker hypothesis," a spine-chilling explanation rooted in science and popularized by science fiction, suggests a grim answer to this enduring mystery. The concept's moniker traces ...

  29. 1.3: Scientific Theories

    Scientific Theories. With repeated testing, some hypotheses may eventually become scientific theories. Keep in mind, a hypothesis is a possible answer to a scientific question. A scientific theory is a broad explanation for events that is widely accepted as true. To become a theory, a hypothesis must be tested over and over again, and it must be supported by a great deal of evidence.