Hypothesis Testing with the Binomial Distribution

Contents Toggle Main Menu 1 Hypothesis Testing 2 Worked Example 3 See Also

Hypothesis Testing

To hypothesis test with the binomial distribution, we must calculate the probability, $p$, of the observed event and any more extreme event happening. We compare this to the level of significance $\alpha$. If $p>\alpha$ then we do not reject the null hypothesis. If $p<\alpha$ we accept the alternative hypothesis.

Worked Example

A coin is tossed twenty times, landing on heads six times. Perform a hypothesis test at a $5$% significance level to see if the coin is biased.

First, we need to write down the null and alternative hypotheses. In this case

The important thing to note here is that we only need a one-tailed test as the alternative hypothesis says “in favour of tails”. A two-tailed test would be the result of an alternative hypothesis saying “The coin is biased”.

We need to calculate more than just the probability that it lands on heads $6$ times. If it landed on heads fewer than $6$ times, that would be even more evidence that the coin is biased in favour of tails. Consequently we need to add up the probability of it landing on heads $1$ time, $2$ times, $\ldots$ all the way up to $6$ times. Although a calculation is possible, it is much quicker to use the cumulative binomial distribution table. This gives $\mathrm{P}[X\leq 6] = 0.058$.

We are asked to perform the test at a $5$% significance level. This means, if there is less than $5$% chance of getting less than or equal to $6$ heads then it is so unlikely that we have sufficient evidence to claim the coin is biased in favour of tails. Now note that our $p$-value $0.058>0.05$ so we do not reject the null hypothesis. We don't have sufficient evidence to claim the coin is biased.

But what if the coin had landed on heads just $5$ times? Again we need to read from the cumulative tables for the binomial distribution which shows $\mathrm{P}[X\leq 5] = 0.021$, so we would have had to reject the null hypothesis and accept the alternative hypothesis. So the point at which we switch from accepting the null hypothesis to rejecting it is when we obtain $5$ heads. This means that $5$ is the critical value .

Selecting a Hypothesis Test

Treasure chest

Numbers & Quantities

Statistics and Probability

Statistics & Probability

Functions

More Learning Tools

Encyclopedia volume with math symbols on pages

Encyclopedia

Yellow math study tips lightbulb

Pen & Paper exercises

hypothesis testing for binomial distribution

Excel / GeoGebra recipes

Bespectacled math expert wearing a purple tie

Tutor-on-Demand

All math

Junior Math

Multiplication Master

Multiplication Master

Treasure Trail

Treasure Trail

Wind Surfer

Wind Surfer

Stack n´load

Stack n´load

Language: English

How to Do Hypothesis Testing with Binomial Distribution

A hypothesis test has the objective of testing different results against each other. You use them to check a result against something you already believe is true. In a hypothesis test, you’re checking if the new alternative hypothesis  H A would challenge and replace the already existing null hypothesis  H 0 .

Hypothesis tests are either one-sided or two-sided. In a one-sided test, the alternative hypothesis is left-sided with p < p 0 or right-sided with p > p 0 . In a two-sided test, the alternative hypothesis is p ≠ p 0 . In all three cases, p 0 is the pre-existing probability of what you’re comparing, and p is the probability you are going to find.

Note! In hypothesis testing, you calculate the alternative hypothesis to say something about the null hypothesis.

Hypothesis Testing Binomial Distribution

For example, you would have a reason to believe that a high observed value of p , makes the alternative hypothesis H a : p > p 0 seem reasonable.

There is a drug on the market that you know cures 8 5 % of all patients. A company has come up with a new drug they believe is better than what is already on the market. This new drug has cured 92 of 103 patients in tests. Determine if the new drug is really better than the old one.

This is a classic case of hypothesis testing by binomial distribution. You now follow the recipe above to answer the task and select 5 % level of significance since it is not a question of medication for a serious illness.

The alternative hypothesis in this case is that the new drug is better. The reason for this is that you only need to know if you are going to approve for sale and thus the new drug must be better:

This result indicates that there is a 1 3 . 6 % chance that more than 92 patients would be cured with the old medicine.

so H 0 cannot be rejected. The new drug does not enter the market.

If the p value had been less than the level of significance, that would mean that the new drug represented by the alternative hypothesis is better, and that you are sure of this with statistical significance.

White arrow pointing to the right

Binomial test

The binomial test is a hypothesis test used when there is a categorical variable with two expressions, e.g., gender with "male" and "female". The binomial test can then check whether the frequency distribution of the variable corresponds to an expected distribution, e.g.:

  • Men and women are equally represented.
  • The proportion of women is 54%.

This is a special case when you want to test whether the frequency distribution of the variables is random or not. In this case, the probability of occurrence is set to 50%.

The binomial test can therefore be used to test whether or not the frequency distribution of a sample is the same as that of the population.

The binomial test checks whether the frequency distribution of a variable with two values/categories in the sample corresponds to the distribution in the population.

Hypotheses in binomial test

The hypothesis of the binomial test results in the one tailed case to

  • Null hypothesis: The frequency distribution of the sample corresponds to that of the population.
  • Alternative hypothesis: The frequency distribution of the sample does not correspond to that of the population.

Thus, the non-directional hypothesis only tests whether there is a difference or not, but not in which direction this difference goes.

In the two sided case, the aim is to investigate whether the probability of occurrence of an expression in the sample is greater or less than a given or true percentage.

In this case, an expression is defined as "success" and it is checked whether the true "probability of success" is smaller or larger than that in the sample.

The alternative hypothesis then results in:

  • Alternative hypothesis: True probability of success is smaller/larger than specified value

Binomial test calculation

To calculate a binomial test you need the sample size, the number of cases that are positive of it, and the probability of occurrence in the population.

Alternative hypothesis p
True probability of success is less than 0.35
True probability of success is not equal to 0.35
True probability of success is greater than 0.35

Binomial test example

A possible example for a binomial test would be the question whether the gender ratio in the specialization marketing at the university XY differs significantly from that of all business students at the university XY (population).

Listed below are the students majoring in marketing; women make up 55% of the total business degree program.

Marketing student Gender
1 female
2 male
3 female
4 female
5 female
6 male
7 female
8 male
9 female
10 female

Binomial test with DATAtab:

Calculate the example in the statistics calculator. Simply add the upper table including the first row into the hypothesis test calculator .

DATAtab gives you the following result for this example data:

Binomial test example

Interpretation of a Binomial Test

With an expected test value of 55%, the p-value is 0.528. This means that the p-value is above the signification level of 5% and the result is therefore not significant. Consequently, the null hypothesis must not be rejected. In terms of content, this means that the gender ratio of the marketing specialization (=sample) does not differ significantly from that of all business administration students at XY University (=population).

Statistics made easy

  • many illustrative examples
  • ideal for exams and theses
  • statistics made easy on 412 pages
  • 5rd revised edition (April 2024)
  • Only 8.99 €

Datatab

"Super simple written"

"It could not be simpler"

"So many helpful examples"

Statistics Calculator

Cite DATAtab: DATAtab Team (2024). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL https://datatab.net

Hypothesis Testing Using the Binomial Distribution

  • First Online: 04 June 2021

Cite this chapter

hypothesis testing for binomial distribution

  • Alese Wooditch 6 ,
  • Nicole J. Johnson 6 ,
  • Reka Solymosi 7 ,
  • Juanjo Medina Ariza 8 &
  • Samuel Langton 9  

860 Accesses

Many people involved in criminology and criminal justice research spend time making predictions about populations in the real world. These predictions tend to be based on a theoretical framework and are formally stated as hypotheses in order to answer a specific research question. Using inferential statistics (see Chap. 6 ), we can test to what extent our data support these hypotheses and provide empirical evidence to support (or reject) our expectations in R . This chapter uses a simulated dataset of results from a crime reduction intervention for at-risk youth to explore how the binomial distribution allows us to generalize from a sample of 100 participants in a study to the wider population.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Agresti, A., & Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52 (2), 119–126.

Google Scholar  

Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16 (2), 101–117.

Article   Google Scholar  

Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22 (158), 209–212.

Download references

Author information

Authors and affiliations.

Department of Criminal Justice, Temple University, Philadelphia, PA, USA

Alese Wooditch & Nicole J. Johnson

School of Social Sciences, University of Manchester, Manchester, UK

Reka Solymosi

Department of Criminal Law and Crime Science, School of Law, University of Seville, Seville, Spain

Juanjo Medina Ariza

Netherlands Institute for the Study of Crime and Law Enforcement, Amsterdam, The Netherlands

Samuel Langton

You can also search for this author in PubMed   Google Scholar

The probability or sampling distribution for an event that has only two possible outcomes.

A research hypothesis that indicates a specific type of outcome by specifying the nature of the relationship that is expected.

The extent to which a study sample is reflective of the population from which it is drawn. A study is said to have high external validity when the sample used is representative of the population to which inferences are made.

A research hypothesis that does not indicate a specific type of outcome, stating only that there is a relationship or a difference.

Tests that do not make an assumption about the distribution of the population, also called distribution-free tests.

A statement that reduces the research question to a simple assertion to be tested by the researcher. The null hypothesis normally suggests that there is no relationship or no difference.

Tests that make an assumption about the shape of the population distribution.

Also known as alpha error and false-positive. The mistake made when a researcher rejects the null hypothesis on the basis of a sample statistic (i.e., claiming that there is a relationship) when in fact the null hypothesis is true (i.e., there is actually no such relationship in the population).

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Wooditch, A., Johnson, N.J., Solymosi, R., Medina Ariza, J., Langton, S. (2021). Hypothesis Testing Using the Binomial Distribution. In: A Beginner’s Guide to Statistics for Criminology and Criminal Justice Using R. Springer, Cham. https://doi.org/10.1007/978-3-030-50625-4_8

Download citation

DOI : https://doi.org/10.1007/978-3-030-50625-4_8

Published : 04 June 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-50624-7

Online ISBN : 978-3-030-50625-4

eBook Packages : Law and Criminology Law and Criminology (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

4.3 Binomial Distribution

There are three characteristics of a binomial experiment.

  • There are a fixed number of trials. Think of trials as repetitions of an experiment. The letter n denotes the number of trials.
  • There are only two possible outcomes, called "success" and "failure," for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial. p + q = 1.
  • The n trials are independent and are repeated using identical conditions. Because the n trials are independent, the outcome of one trial does not help in predicting the outcome of another trial. Another way of saying this is that for each individual trial, the probability, p , of a success and probability, q , of a failure remain the same. For example, randomly guessing at a true-false statistics question has only two outcomes. If a success is guessing correctly, then a failure is guessing incorrectly. Suppose Joe always guesses correctly on any statistics true-false question with probability p = 0.6. Then, q = 0.4. This means that for every true-false statistics question Joe answers, his probability of success ( p = 0.6) and his probability of failure ( q = 0.4) remain the same.

The outcomes of a binomial experiment fit a binomial probability distribution . The random variable X = the number of successes obtained in the n independent trials.

The mean, μ , and variance, σ 2 , for the binomial probability distribution are μ = np and σ 2 = npq . The standard deviation, σ , is then σ = n p q n p q .

Any experiment that has characteristics two and three and where n = 1 is called a Bernoulli Trial (named after Jacob Bernoulli who, in the late 1600s, studied them extensively). A binomial experiment takes place when the number of successes is counted in one or more Bernoulli Trials.

Example 4.9

At ABC College, the withdrawal rate from an elementary physics course is 30% for any given term. This implies that, for any given term, 70% of the students stay in the class for the entire term. A "success" could be defined as an individual who withdrew. The random variable X = the number of students who withdraw from the randomly selected elementary physics class.

The state health board is concerned about the amount of fruit available in school lunches. Forty-eight percent of schools in the state offer fruit in their lunches every day. This implies that 52% do not. What would a "success" be in this case?

Example 4.10

Suppose you play a game that you can only either win or lose. The probability that you win any game is 55%, and the probability that you lose is 45%. Each game you play is independent. If you play the game 20 times, write the function that describes the probability that you win 15 of the 20 times. Here, if you define X as the number of wins, then X takes on the values 0, 1, 2, 3, ..., 20. The probability of a success is p = 0.55. The probability of a failure is q = 0.45. The number of trials is n = 20. The probability question can be stated mathematically as P ( x = 15).

Try It 4.10

A trainer is teaching a rescued dolphin to catch live fish before returning it to the wild. The probability that the dolphin successfully catches a fish is 35%, and the probability that the dolphin does not successfully catch the fish is 65%. Out of 20 attempts, you want to find the probability that the dolphin succeeds 12 times. State the probability question mathematically.

Example 4.11

A coin has been altered to weight the outcome from 0.5 to 0.25 and flipped 5 times. Each flip is independent. What is the probability of getting more than 3 heads? Let X = the number of heads in 5 flips of the fair coin. X takes on the values 0, 1, 2, 3, 4, 5. Since the coin is altered to result in p = 0.25, q is 0.75. The number of trials is n = 5. State the probability question mathematically.

First develop fully the probability density function and graph the probability density function. With the fully developed probability density function we can simply read the solution to the question P x > 3 P x > 3 heads. P x > 3 = P x = 4 + P x = 5 = 0 . 0146 + 0 . 0007 = 0 . 0153 . P x > 3 = P x = 4 + P x = 5 = 0 . 0146 + 0 . 0007 = 0 . 0153 . We have added the two individual probabilities because of the addition rule from Probability Topics .

Figure 4.2 also allows us to see the link between the probability density function and probability and area. We also see in Figure 4.2 the skew of the binomial distribution when p is not equal to 0.5. In Figure 4.2 the distribution is skewed right as a result of μ = n p = 1 . 25 μ = n p = 1 . 25 because p = 0 . 25 p = 0 . 25 .

Try It 4.11

A fair, six-sided die is rolled ten times. Each roll is independent. You want to find the probability of rolling a one more than three times. State the probability question mathematically.

Example 4.12

Approximately 70% of statistics students do their homework in time for it to be collected and graded. Each student does homework independently. In a statistics class of 50 students, what is the probability that at least 40 will do their homework on time? Students are selected randomly.

a. This is a binomial problem because there is only a success or a __________, there are a fixed number of trials, and the probability of a success is 0.70 for each trial.

b. If we are interested in the number of students who do their homework on time, then how do we define X ?

c. What values does x take on?

d. What is a "failure," in words?

e. If p + q = 1, then what is q ?

f. The words "at least" translate as what kind of inequality for the probability question P ( x ____ 40).

b. X = the number of statistics students who do their homework on time

c. 0, 1, 2, …, 50

d. Failure is defined as a student who does not complete their homework on time.

The probability of a success is p = 0.70. The number of trials is n = 50.

e. q = 0.30

f. greater than or equal to (≥) The probability question is P ( x ≥ 40).

Try It 4.12

Sixty-five percent of people pass the state driver’s exam on the first try. A group of 50 individuals who have taken the driver’s exam is randomly selected. Give two reasons why this is a binomial problem.

Notation for the Binomial: B = Binomial Probability Distribution Function

X ~ B ( n , p )

Read this as " X is a random variable with a binomial distribution." The parameters are n and p ; n = number of trials, p = probability of a success on each trial.

Example 4.13

It has been stated that about 41% of adult workers have a high school diploma but do not pursue any further education. If 20 adult workers are randomly selected, find the probability that at most 12 of them have a high school diploma but do not pursue any further education. How many adult workers do you expect to have a high school diploma but do not pursue any further education?

Let X = the number of workers who have a high school diploma but do not pursue any further education.

X takes on the values 0, 1, 2, ..., 20 where n = 20, p = 0.41, and q = 1 – 0.41 = 0.59. X ~ B (20, 0.41)

Find P ( x ≤ 12). P ( x ≤ 12) = 0.9738. (calculator or computer)

Using the TI-83, 83+, 84, 84+ Calculator

Go into 2 nd DISTR. The syntax for the instructions are as follows:

To calculate ( x = value): binompdf( n , p , number) if "number" is left out, the result is the binomial probability table. To calculate P ( x ≤ value): binomcdf( n , p , number) if "number" is left out, the result is the cumulative binomial probability table. For this problem: After you are in 2 nd DISTR , arrow down to binomcdf . Press ENTER . Enter 20,0.41,12). The result is P ( x ≤ 12) = 0.9738.

If you want to find P ( x = 12), use the pdf (binompdf). If you want to find P ( x > 12), use 1 - binomcdf(20,0.41,12).

The probability that at most 12 workers have a high school diploma but do not pursue any further education is 0.9738.

The graph of X ~ B (20, 0.41) is as follows:

The y -axis contains the probability of x , where X = the number of workers who have only a high school diploma.

The number of adult workers that you expect to have a high school diploma but not pursue any further education is the mean, μ = np = (20)(0.41) = 8.2.

The formula for the variance is σ 2 = npq . The standard deviation is σ = n p q n p q . σ = ( 20 ) ( 0.41 ) ( 0.59 ) ( 20 ) ( 0.41 ) ( 0.59 ) = 2.20.

Try It 4.13

About 32% of students participate in a community volunteer program outside of school. If 30 students are selected at random, find the probability that at most 14 of them participate in a community volunteer program outside of school. Use the TI-83+ or TI-84 calculator to find the answer.

Example 4.14

In the 2013 Jerry’s Artarama art supplies catalog, there are 560 pages. Eight of the pages feature signature artists. Suppose we randomly sample 100 pages. Let X = the number of pages that feature signature artists.

  • What values does x take on?
  • the probability that two pages feature signature artists
  • the probability that at most six pages feature signature artists
  • the probability that more than three pages feature signature artists.
  • Using the formulas, calculate the (i) mean and (ii) standard deviation.
  • x = 0, 1, 2, 3, 4, 5, 6, 7, 8
  • P ( x = 2) = binompdf ( 100 , 8 560 , 2 ) ( 100 , 8 560 , 2 ) = 0.2466
  • P ( x ≤ 6) = binomcdf ( 100 , 8 560 , 6 ) ( 100 , 8 560 , 6 ) = 0.9994
  • P ( x > 3) = 1 – P ( x ≤ 3) = 1 – binomcdf ( 100 , 8 560 , 3 ) ( 100 , 8 560 , 3 ) = 1 – 0.9443 = 0.0557
  • Mean = np = (100) ( 8 560 ) ( 8 560 ) = 800 560 800 560 ≈ 1.4286
  • Standard Deviation = n p q n p q = ( 100 ) ( 8 560 ) ( 552 560 ) ( 100 ) ( 8 560 ) ( 552 560 ) ≈ 1.1867

Try It 4.14

According to a Gallup poll, 60% of American adults prefer saving over spending. Let X = the number of American adults out of a random sample of 50 who prefer saving to spending.

  • What is the probability distribution for X ?
  • the probability that 25 adults in the sample prefer saving over spending
  • the probability that at most 20 adults prefer saving
  • the probability that more than 30 adults prefer saving
  • Using the formulas, calculate the (i) mean and (ii) standard deviation of X .

Example 4.15

The lifetime risk of developing cancer is about one in 67 (1.5%). Suppose we randomly sample 200 people. Let X = the number of people who will develop cancer.

  • Use your calculator to find the probability that at most eight people develop cancer
  • Is it more likely that five or six people will develop cancer? Justify your answer numerically.
  • X   ~   B 200 , 0 . 015 X   ~   B 200 , 0 . 015
  • Mean = n p = 200 0 . 015   = 3 Mean = n p = 200 0 . 015   = 3 Standard   Deviation = n p q = 200 ( 0 . 015 ) ( 0 . 985 ) = 1 . 719 Standard   Deviation = n p q = 200 ( 0 . 015 ) ( 0 . 985 ) = 1 . 719
  • P x ≤ 8   = 0 . 9965 P x ≤ 8   = 0 . 9965
  • The probability that five people develop cancer is 0.1011. The probability that six people develop cancer is 0.0500.

Try It 4.15

During a certain NBA season, a player for the Los Angeles Clippers had the highest field goal completion rate in the league. This player scored with 61.3% of his shots. Suppose you choose a random sample of 80 shots made by this player during the season. Let X = the number of shots that scored points.

  • Use your calculator to find the probability that this player scored with 60 of these shots.
  • Find the probability that this player scored with more than 50 of these shots.

Example 4.16

The following example illustrates a problem that is not binomial. It violates the condition of independence. ABC College has a student advisory committee made up of ten staff members and six students. The committee wishes to choose a chairperson and a recorder. What is the probability that the chairperson and recorder are both students? The names of all committee members are put into a box, and two names are drawn without replacement . The first name drawn determines the chairperson and the second name the recorder. There are two trials. However, the trials are not independent because the outcome of the first trial affects the outcome of the second trial. The probability of a student on the first draw is 6 16 6 16 . The probability of a student on the second draw is 5 15 5 15 , when the first draw selects a student. The probability is 6 15 6 15 , when the first draw selects a staff member. The probability of drawing a student's name changes for each of the trials and, therefore, violates the condition of independence.

Try It 4.16

A lacrosse team is selecting a captain. The names of all the seniors are put into a hat, and the first three that are drawn will be the captains. The names are not replaced once they are drawn (one person cannot be two captains). You want to see if the captains all play the same position. State whether this is binomial or not and state why.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics-2e/pages/4-3-binomial-distribution

© Jul 18, 2024 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

OML Search

Binomial Distribution: Hypothesis Testing

Mathway Calculator Widget

We welcome your feedback, comments and questions about this site or page. Please submit your feedback or enquiries via our Feedback page.

Learning Materials

  • Business Studies
  • Combined Science
  • Computer Science
  • Engineering
  • English Literature
  • Environmental Science
  • Human Geography
  • Macroeconomics
  • Microeconomics
  • Binomial Hypothesis Test

When calculating probabilities using binomial expansions, we can calculate these probabilities  for an individual value (\(P(x = a)\))  or a cumulative value \(P(x<a), \space P(x\leq a), \space P(x\geq a)\) .

Millions of flashcards designed to help you ace your studies

Review generated flashcards

to start learning or create your own AI flashcards

Start learning or create your own AI flashcards

  • Applied Mathematics
  • Decision Maths
  • Discrete Mathematics
  • Logic and Functions
  • Mechanics Maths
  • Probability and Statistics
  • Bayesian Statistics
  • Bias in Experiments
  • Binomial Distribution
  • Biostatistics
  • Bivariate Data
  • Categorical Data Analysis
  • Categorical Variables
  • Causal Inference
  • Central Limit Theorem
  • Chi Square Test for Goodness of Fit
  • Chi Square Test for Homogeneity
  • Chi Square Test for Independence
  • Chi-Square Distribution
  • Cluster Analysis
  • Combining Random Variables
  • Comparing Data
  • Comparing Two Means Hypothesis Testing
  • Conditional Probability
  • Conducting A Study
  • Conducting a Survey
  • Conducting an Experiment
  • Confidence Interval for Population Mean
  • Confidence Interval for Population Proportion
  • Confidence Interval for Slope of Regression Line
  • Confidence Interval for the Difference of Two Means
  • Confidence Intervals
  • Correlation Math
  • Cox Regression
  • Cumulative Distribution Function
  • Cumulative Frequency
  • Data Analysis
  • Data Interpretation
  • Decision Theory
  • Degrees of Freedom
  • Discrete Random Variable
  • Discriminant Analysis
  • Distributions
  • Empirical Bayes Methods
  • Empirical Rule
  • Errors In Hypothesis Testing
  • Estimation Theory
  • Estimator Bias
  • Events (Probability)
  • Experimental Design
  • Factor Analysis
  • Frequency Polygons
  • Generalization and Conclusions
  • Geometric Distribution
  • Geostatistics
  • Hierarchical Modeling
  • Hypothesis Test for Correlation
  • Hypothesis Test for Regression Slope
  • Hypothesis Test of Two Population Proportions
  • Hypothesis Testing
  • Inference For Distributions Of Categorical Data
  • Inferences in Statistics
  • Item Response Theory
  • Kaplan-Meier Estimate
  • Kernel Density Estimation
  • Large Data Set
  • Lasso Regression
  • Latent Variable Models
  • Least Squares Linear Regression
  • Linear Interpolation
  • Linear Regression
  • Logistic Regression
  • Machine Learning
  • Mann-Whitney Test
  • Markov Chains
  • Mean and Variance of Poisson Distributions
  • Measures of Central Tendency
  • Methods of Data Collection
  • Mixed Models
  • Multilevel Modeling
  • Multivariate Analysis
  • Neyman-Pearson Lemma
  • Non-parametric Methods
  • Normal Distribution
  • Normal Distribution Hypothesis Test
  • Normal Distribution Percentile
  • Ordinal Regression
  • Paired T-Test
  • Parametric Methods
  • Path Analysis
  • Point Estimation
  • Poisson Regression
  • Principle Components Analysis
  • Probability
  • Probability Calculations
  • Probability Density Function
  • Probability Distribution
  • Probability Generating Function
  • Product Moment Correlation Coefficient
  • Quantile Regression
  • Quantitative Variables
  • Random Effects Model
  • Random Variables
  • Randomized Block Design
  • Regression Analysis
  • Residual Sum of Squares
  • Robust Statistics
  • Sample Mean
  • Sample Proportion
  • Sampling Distribution
  • Sampling Theory
  • Scatter Graphs
  • Sequential Analysis
  • Single Variable Data
  • Spearman's Rank Correlation
  • Spearman's Rank Correlation Coefficient
  • Standard Deviation
  • Standard Error
  • Standard Normal Distribution
  • Statistical Graphs
  • Statistical Inference
  • Statistical Measures
  • Stem and Leaf Graph
  • Stochastic Processes
  • Structural Equation Modeling
  • Sum of Independent Random Variables
  • Survey Bias
  • Survival Analysis
  • Survivor Function
  • T-distribution
  • The Power Function
  • Time Series Analysis
  • Transforming Random Variables
  • Tree Diagram
  • Two Categorical Variables
  • Two Quantitative Variables
  • Type I Error
  • Type II Error
  • Types of Data in Statistics
  • Variance for Binomial Distribution
  • Venn Diagrams
  • Wilcoxon Test
  • Zero-Inflated Models
  • Theoretical and Mathematical Physics

In hypothesis testing , we are testing as to whether or not these calculated probabilities can lead us to accept or reject a hypothesis.

We will be focusing on regions of binomial distribution ; therefore, we are looking at cumulative values.

Types of hypotheses

There are two main types of hypotheses:

The null hypothesis (H 0 ) is the hypothesis we assume happens, and it assumes there is no difference between certain characteristics of a population. Any difference is purely down to chance.

The alternative hypothesis (H 1 ) is the hypothesis we can try to prove using the data we have been given.

We can either:

Accept the null hypothesis OR

Reject the null hypothesis and accept the alternative hypothesis .

What are the steps to undertake a hypothesis test?

There are some key terms we need to understand before we look at the steps of hypothesis testing :

Critical value – this is the value where we go from accepting to rejecting the null hypothesis.

Critical region – the region where we are rejecting the null hypothesis.

Significance Level – a significance level is the level of accuracy we are measuring, and it is given as a percentage . When we find the probability of the critical value, it should be as close to the significance level as possible.

One-tailed test – the probability of the alternative hypothesis is either greater than or less than the probability of the null hypothesis.

Two-tailed test – the probability of the alternative hypothesis is just not equal to the probability of the null hypothesis.

So when we undertake a hypothesis test, generally speaking, these are the steps we use:

STEP 1 – Establish a null and alternative hypothesis, with relevant probabilities which will be stated in the question.

STEP 2 – Assign probabilities to our null and alternative hypotheses.

STEP 3 – Write out our binomial distribution .

STEP 4 – Calculate probabilities using binomial distribution. (Hint: To calculate our probabilities, we do not need to use our long-winded formula, but in the Casio Classwiz calculator, we can go to Menu -> Distribution -> Binomial CD and enter n as our number in the sample, p as our probability, and X as what we are trying to calculate).

STEP 5 – Check against significance level (whether this is greater than or less than the significance level).

STEP 6 – Accept or reject the null hypothesis.

Let's look at a few examples to explain what we are doing.

One-tailed test example

As stated above a one-tailed hypothesis test is one where the probability of the alternative hypothesis is either greater than or less than the null hypothesis.

A researcher is investigating whether people can identify the difference between Diet Coke and full-fat coke. He suspects that people are guessing. 20 people are selected at random, and 14 make a correct identification. He carries out a hypothesis test.

a) Briefly explain why the null hypothesis should be H 0 , with the probability p = 0.5 suggesting they have made the correct identification.

b) Complete the test at the 5% significance level.

)

\(\begin{align} H_0: p = 0.5 \\ H_1: p > 0.5\end{align}\)
\(X \sim B(20,0.5)\)
\(0.05765914916 > 0.05\)

Two-tailed test example

In a two-tailed test, the probability of our alternative hypothesis is just not equal to the probability of the null hypothesis.

A coffee shop provides free espresso refills. The probability that a randomly chosen customer uses these refills is stated to be 0.35. A random sample of 20 customers is chosen, and 9 of them have used the free refills.

Carry out a hypothesis test to a 5% significance level to see if the probability that a randomly chosen customer uses the refills is different to 0.35.

of people will use the free espresso refills.
\(\begin{align} H_0: p = 0.35 \\ H_1: p \ne 0.35 \end{align}\)

\(X \sim B(20,0.35)\)

So our key difference with two-tailed tests is that we compare the value to half the significance level rather than the actual significance level.

Critical values and critical regions

Remember from earlier critical values are the values in which we move from accepting to rejecting the null hypothesis. A binomial distribution is a discrete distribution; therefore, our value has to be an integer.

You have a large number of statistical tables in the formula booklet that can help us find these; however, these are inaccurate as they give us exact values not values for the discrete distribution.

Therefore the best way to find critical values and critical regions is to use a calculator with trial and error till we find an acceptable value:

STEP 1 - Plug in some random values until we get to a point where for two consecutive values, one probability is above the significance level, and one probability is below.

STEP 2 - The one with the probability below the significance level is the critical value.

STEP 3 - The critical region, is the region greater than or less than the critical value.

Let's look at this through a few examples.

Worked examples for critical values and critical regions

A mechanic is checking to see how many faulty bolts he has. He is told that 30% of the bolts are faulty. He has a sample of 25 bolts. He believes that less than 30% are faulty. Calculate the critical value and the critical region.

Let's use the above steps to help us out.

A teacher believes that 40% of the students watch TV for two hours a day. A student disagrees and believes that students watch either more or less than two hours. In a sample of 30 students, calculate the critical regions.

As this is a two-tailed test, there are two critical regions, one on the lower end and one on the higher end. Also, remember the probability we are comparing with is that of half the significance level.

\(\begin{align} H_0: p = 0.4 \\ H_1: p \ne 0.4 \end{align}\) \(\begin{align}&P(X \leq a): \\ &P(X \leq 5) = 0.005658796379 \\ &P(X \leq 6) = 0.01718302499 \\ &P(X \leq 7) = 0.0435241189 \end{align}\).

Binomial Hypothesis Test - Key takeaways

  • Hypothesis testing is the process of using binomial distribution to help us reject or accept null hypotheses.
  • A null hypothesis is what we assume to be happening.
  • If data disprove a null hypothesis, we must accept an alternative hypothesis.
  • We use binomial CD on the calculator to help us shortcut calculating the probability values.
  • The critical value is the value where we start rejecting the null hypothesis.
  • The critical region is the region either below or above the critical value.
  • Two-tailed tests contain two critical regions and critical values.

Binomial Hypothesis Test

Learn with 0 Binomial Hypothesis Test flashcards in the free Vaia app

We have 14,000 flashcards about Dynamic Landscapes.

Already have an account? Log in

Frequently Asked Questions about Binomial Hypothesis Test

How many samples do you need for the binomial hypothesis test?

There isn't a fixed number of samples, any sample number you are given you will use as n in X-B(n , p).

What is the null hypothesis for a binomial test?

The null hypothesis is what we assume is true before we conduct our hypothesis test.

What does a binomial test show?

It shows us the probability value is of undertaking a test, with fixed outcomes.

What is the p value in the binomial test?

The p value is the probability value of the null and alternative hypotheses.

Discover learning materials with the free Vaia app

1

Vaia is a globally recognized educational technology company, offering a holistic learning platform designed for students of all ages and educational levels. Our platform provides learning support for a wide range of subjects, including STEM, Social Sciences, and Languages and also helps students to successfully master various tests and exams worldwide, such as GCSE, A Level, SAT, ACT, Abitur, and more. We offer an extensive library of learning materials, including interactive flashcards, comprehensive textbook solutions, and detailed explanations. The cutting-edge technology and tools we provide help students create their own learning materials. StudySmarter’s content is not only expert-verified but also regularly updated to ensure accuracy and relevance.

Binomial Hypothesis Test

Vaia Editorial Team

Team Math Teachers

  • 11 minutes reading time
  • Checked by Vaia Editorial Team

Study anywhere. Anytime.Across all devices.

Create a free account to save this explanation..

Save explanations to your personalised space and access them anytime, anywhere!

By signing up, you agree to the Terms and Conditions and the Privacy Policy of Vaia.

Sign up to highlight and take notes. It’s 100% free.

Join over 22 million students in learning with our Vaia App

The first learning app that truly has everything you need to ace your exams in one place

  • Flashcards & Quizzes
  • AI Study Assistant
  • Study Planner
  • Smart Note-Taking

Join over 22 million students in learning with our Vaia App

Privacy Overview

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Test if two binomial distributions are statistically different from each other

I have three groups of data, each with a binomial distribution (i.e. each group has elements that are either success or failure). I do not have a predicted probability of success, but instead can only rely on the success rate of each as an approximation for the true success rate. I have only found this question , which is close but does not seem to exactly deal with the this scenario.

To simplify down the test, let's just say that I have 2 groups (3 can be extended from this base case).

Group Trials $n_i$ Successes $k_i$ Percentage $p_i$
Group 1 2455 1556 63.4%
Group 2 2730 1671 61.2%

I don't have an expected success probability, only what I know from the samples.

The success rate of each of the sample is fairly close. However my sample sizes are also quite large. If I check the CDF of the binomial distribution to see how different it is from the first (where I'm assuming the first is the null test) I get a very small probability that the second could be achieved.

1-BINOM.DIST(1556,2455,61.2%,TRUE) = 0.012

However, this does not take into account any variance of the first result, it just assumes the first result is the test probability.

Is there a better way to test if these two samples of data are actually statistically different from one another?

  • statistical-significance
  • binomial-distribution
  • bernoulli-distribution

kjetil b halvorsen's user avatar

  • $\begingroup$ Another question I came across that didn't really help much: stats.stackexchange.com/questions/82059/… $\endgroup$ –  Scott Commented Aug 28, 2014 at 17:15
  • $\begingroup$ Does this question help? stats.stackexchange.com/questions/25299/… $\endgroup$ –  Eric Commented Aug 28, 2014 at 17:32
  • 4 $\begingroup$ In R, you could use prop.test : prop.test(c(1556, 1671), c(2455, 2730)) . $\endgroup$ –  COOLSerdash Commented Aug 28, 2014 at 17:48
  • 3 $\begingroup$ Could be done as a two-sample (binomial) proportions test, or a 2x2 chi-square $\endgroup$ –  Glen_b Commented Aug 28, 2014 at 17:58
  • 4 $\begingroup$ Extending the base case from two groups to three could be problematic, because the tests will be interdependent: you will need a binomial version of ANOVA to handle that. $\endgroup$ –  whuber ♦ Commented Mar 21, 2016 at 1:42

6 Answers 6

The solution is a simple google away: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

So you would like to test the following null hypothesis against the given alternative

$H_0:p_1=p_2$ versus $H_A:p_1\neq p_2$

So you just need to calculate the test statistic which is

$$z=\frac{\hat p_1-\hat p_2}{\sqrt{\hat p(1-\hat p)\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}$$

where $\hat p=\frac{n_1\hat p_1+n_2\hat p_2}{n_1+n_2}$ .

So now, in your problem, $\hat p_1=.634$ , $\hat p_2=.612$ , $n_1=2455$ and $n_2=2730.$

Once you calculate the test statistic, you just need to calculate the corresponding critical region value to compare your test statistic too. For example, if you are testing this hypothesis at the 95% confidence level then you need to compare the absolute value of your test statistic against the critical region value of $z_{\alpha/2}=1.96$ (for this two tailed test).

Now, if $|z|>z_{\alpha/2}$ then you may reject the null hypothesis, otherwise you must fail to reject the null hypothesis.

Well this solution works for the case when you are comparing two groups, but it does not generalize to the case where you want to compare 3 groups.

You could however use a Chi Squared test to test if all three groups have equal proportions as suggested by @Eric in his comment above: " Does this question help? stats.stackexchange.com/questions/25299/ … – Eric"

Community's user avatar

  • 14 $\begingroup$ Thanks @Dan. As many times with Google, knowing the right term to search for is the first hurdle. I did take a look at the chi-squared test. The problem there, as with where I was first getting stuck, is that my expected calculation is based on the sample. I can't therefore provide an expected value, because my samples are used to determine that expected value. $\endgroup$ –  Scott Commented Aug 28, 2014 at 18:33
  • 2 $\begingroup$ A related explanation of using this test can be found here: itl.nist.gov/div898/handbook/prc/section3/prc33.htm (currently, the Wikipedia page does not provide a walk-through example). $\endgroup$ –  wwwilliam Commented Apr 18, 2017 at 3:29
  • 2 $\begingroup$ Can someone help me prove the standard deviation of the difference between the two binomial distributions, in other words prove that : $$\sqrt{\hat p (1-\hat p)(\frac{1}{n_1} + \frac{1}{n_2})} = \sqrt{\frac{\hat p_1 (1-\hat p_1)}{n_1} + \frac{\hat p_2 (1-\hat p_2)}{n_2}}$$ $\endgroup$ –  Tanguy Commented Aug 4, 2018 at 9:36
  • 2 $\begingroup$ answer to my question can be found here: stats.stackexchange.com/questions/361015/… $\endgroup$ –  Tanguy Commented Aug 8, 2018 at 10:22
  • 1 $\begingroup$ FYI, this test can be described as a "two-tailed two-proportion pooled z-test". The calculation is described in detail here: stattrek.com/hypothesis-test/difference-in-proportions.aspx $\endgroup$ –  user2739472 Commented Sep 13, 2020 at 15:01

In R the answer is calculated as:

David Makovoz's user avatar

  • 20 $\begingroup$ Would you consider writing a little bit more than providing the R function? Naming the function does not help in understanding the problem and not everyone use R, so it would be no help for them. $\endgroup$ –  Tim Commented Dec 8, 2014 at 13:52
  • 2 $\begingroup$ This is the most exact statistical answer, and works for small numbers of observations (see the following: itl.nist.gov/div898/handbook/prc/section3/prc33.htm ). $\endgroup$ –  Andrew Mao Commented Mar 30, 2015 at 19:36
  • 1 $\begingroup$ Fishers exact test en.wikipedia.org/wiki/Fisher's_exact_test $\endgroup$ –  Keith Commented May 22, 2015 at 23:59

Just a summary:

Dan and Abaumann's answers suggest testing under a binomial model where the null hypothesis is a unified single binomial model with its mean estimated from the empirical data. Their answers are correct in theory but they need approximation using normal distribution since the distribution of test statistic does not exactly follow Normal distribution. Therefore, it's only correct for a large sample size.

But David's answer is indicating a nonparametric test using Fisher's test.The information is here: https://en.wikipedia.org/wiki/Fisher%27s_exact_test And it can be applied to small sample sizes but hard to calculate for big sample sizes.

Which test to use and how much you trust your p-value is a mystery. But there are always biases in whichever test to choose.

Code42's user avatar

  • 2 $\begingroup$ Are you trying to suggest that sample sizes in the thousands, with likely parameter values near $1/2$, are not large for this purpose? $\endgroup$ –  whuber ♦ Commented May 3, 2016 at 1:03
  • 1 $\begingroup$ For this case, I think you could use Dan's method but compute the p value in an exact way (binomial) and approxiamte way (normal Z>Φ−1(1−α/2)Z>Φ−1(1−α/2) and Z<Φ−1(α/2) ) to compare whether they are close enough. $\endgroup$ –  Code42 Commented May 6, 2016 at 4:39
  • 1 $\begingroup$ +1 Not because sample sizes weren't large enough, but because the answer fits the title question and answers it for any sample size - therefore being useful for readers arriving here guided by title text (or Google) and having an smaller sample size in mind. $\endgroup$ –  Pere Commented Jul 10, 2021 at 10:45

As suggested in other answers and comments, you can use an exact test that takes into account the origin of the data. Under the null hypothesis that the probability of success $\theta$ is the same in both experiments,

$$P \bigl(\begin{smallmatrix}k_1 & k_2 \\ n_1-k_1 & n_2-k_2\end{smallmatrix}\bigr) = \binom{n_1}{k_1}\binom{n_2}{k_2}\theta^{{k_1 + k_2}}\left({1-\theta}\right)^{{\left(n_1-k_1\right)+\left(n_2-k_2\right)}} $$ Notice that $P$ is not the p value, but the probability of this result under the null hypothesis. To calculate the p value, we need to consider all the cases whose $P$ is not higher than for our result. As noted in the question, the main problem is that we do not know the value of $\theta$ . This is why it is called a nuisance parameter.

Fisher's test solves this problem by making the experimental design conditional, meaning that the only contingency tables that are considered for the calculation are those where the sum of the number of successes is the same as in the example ( $1556 + 1671 = 3227$ ). This condition may not be in accordance with the experimental design, but it also means that we do not need to deal with the nuisance parameter.

There are also unconditional exact tests. For instance, Barnard's test estimates the most likely value of the nuisance parameter and directly uses the binomial distribution with that parameter. Obviously, the problem here is how to calculate $\theta$ , and there may be more than one answer for that. The original approach is to find the value of $\theta$ that maximizes $P$ . Here you can find an explanation of both tests.

I have recently uploaded a preprint that employs a similar strategy to that of Barnard's test. However, instead of estimating $\theta$ , this method (tentatively called m-test) considers every possible value of this parameter and integrates all the results. Using the same notation as in the question,

$P \bigl(\begin{smallmatrix}k_1 & k_2 \\ n_1-k_1 & n_2-k_2\end{smallmatrix}\bigr) = \binom{n_1}{k_1}\binom{n_2}{k_2}\int_{0}^{1}\theta^{{k_1 + k_2}}\left({1-\theta}\right)^{{\left(n_1-k_1\right)+\left(n_2-k_2\right)}}d\theta$

The calculation of the p value can be simplified using the properties of the integral, as shown in the article. Preliminary tests with Monte Carlo simulations suggest that the m-test is more powerful than the other extact tests at different significance levels. As a bonus, this test can be easily extended to more than two experiments, and also to more than two outcomes. The only limitation is in the speed, as many cases need to be considered. I have also prepared an R package to use the test ( https://github.com/vqf/mtest ). In this example,

In my computer, this takes about 20 seconds, whereas Barnard's test takes much longer.

vqf's user avatar

Original post: Dan's answer is actually incorrect, not to offend anyone. A z-test is used only if your data follows a standard normal distribution. In this case, your data follows a binomial distribution, therefore a use a chi-squared test if your sample is large or fisher's test if your sample is small.

Edit: My mistake, apologies to @Dan. A z-test is valid here if your variables are independent. If this assumption is not met or unknown, a z-test may be invalid.

Ryan's user avatar

  • 4 $\begingroup$ The "only if" part is an extreme position unlikely to be shared by many. No data actually follow a normal distribution. Few data actually behave as if drawn randomly and independently from a normal distribution. Nevertheless, z tests continue to be effective because the distributions of statistics (such as the difference of means) to which they apply can be extremely well approximated by normal distributions. In fact, the appeal to a $\chi^2$ test relies on the same asymptotic assumptions as a z test does! $\endgroup$ –  whuber ♦ Commented Mar 21, 2016 at 1:44
  • $\begingroup$ If you believe in the CLT, then the normal distribution does commonly exist. $\endgroup$ –  Ryan Commented Mar 21, 2016 at 2:49
  • 5 $\begingroup$ @Ryan Well, I believe in the CLT but it doesn't say anything about n=30 or n=300 or n=5000. You don't actually get normality unless you somehow manage to have infinite sample sizes, or you somehow started with normality. Questions about how close we are to normality when taking averages are not addressed by the CLT.. (We can consider those questions but we don't use the CLT to find out if the approximation is any good.) $\endgroup$ –  Glen_b Commented Jul 26, 2016 at 5:12

Your test statistic is $Z = \frac{\hat{p_1}-\hat{p_2}}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}}$, where $\hat{p}=\frac{n_1\hat{p_1}+n_2\hat{p_2}}{n_1+n_2}$.

The critical regions are $Z > \Phi^{-1}(1-\alpha/2)$ and $Z<\Phi^{-1}(\alpha/2)$ for the two-tailed test with the usual adjustments for a one-tailed test.

abaumann's user avatar

Not the answer you're looking for? Browse other questions tagged statistical-significance binomial-distribution bernoulli-distribution excel or ask your own question .

  • Featured on Meta
  • Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2:00...
  • User activation: Learnings and opportunities
  • Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...

Hot Network Questions

  • Custom PCB with Esp32-S3 isn't recognised by Device Manager for every board ordered
  • How should I email HR after an unpleasant / annoying interview?
  • Color an item in an enumerated list (continued; not a duplicate)
  • Why do I often see bunches of medical helicopters hovering in clusters in various locations
  • If Act A repeals another Act B, and Act A is repealed, what happens to the Act B?
  • Doesn't nonlocality follow from nonrealism in the EPR thought experiment and Bell tests?
  • What would a planet need for rain drops to trigger explosions upon making contact with the ground?
  • Browse a web page through SSH? (Need to access router web interface remotely, but only have SSH access to a different device on LAN)
  • Can landlords require HVAC maintenance in New Mexico?
  • siunitx dollar per hour broken going from SI to qty
  • How to avoid bringing paper silverfish home from a vacation place?
  • Why does a capacitor act as an open circuit under a DC circuit?
  • Understanding symmetry in a double integral
  • Copyright Fair Use: Is using the phrase "Courtesy of" legally acceptable when no permission has been given?
  • Function with memories of its past life
  • The consequence of a good letter of recommendation when things do not work out
  • On the history of algae classification
  • O(nloglogn) Sorting Algorithm?
  • How to deal with coauthors who just do a lot of unnecessary work and exploration to be seen as hard-working and grab authorship?
  • Tensor product of intersections in an abelian rigid monoidal category
  • For a bike with "Forged alloy crank with 44T compact disc chainring", can the chainring be removed?
  • Unwanted text replacement of two hyphens with an em-dash
  • How to NDSolve stiff ODE?
  • Ubuntu 22.04.5 - Final Point Release

hypothesis testing for binomial distribution

Loading metrics

Open Access

Peer-reviewed

Binomial models uncover biological variation during feature selection of droplet-based single-cell RNA sequencing

Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliations Department of Integrative Biology and Physiology, University of California, Los Angeles, California, United States of America, Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America

Roles Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing

Affiliations Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America, Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America

Roles Methodology, Software

Affiliation Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America

Roles Formal analysis, Software, Writing – original draft, Writing – review & editing

ORCID logo

Roles Conceptualization, Funding acquisition, Methodology, Resources, Software, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

  • Breanne Sparta, 
  • Timothy Hamilton, 
  • Gunalan Natesan, 
  • Samuel D. Aragones, 
  • Eric J. Deeds

PLOS

  • Published: September 6, 2024
  • https://doi.org/10.1371/journal.pcbi.1012386
  • Reader Comments

This is an uncorrected proof.

Fig 1

Effective analysis of single-cell RNA sequencing (scRNA-seq) data requires a rigorous distinction between technical noise and biological variation. In this work, we propose a simple feature selection model, termed “Differentially Distributed Genes” or DDGs, where a binomial sampling process for each mRNA species produces a null model of technical variation. Using scRNA-seq data where cell identities have been established a priori , we find that the DDG model of biological variation outperforms existing methods. We demonstrate that DDGs distinguish a validated set of real biologically varying genes, minimize neighborhood distortion, and enable accurate partitioning of cells into their established cell-type groups.

Author summary

Single-cell omics technologies measure tens of thousands of genes in up to millions of individual cells. Yet, the sheer dimensionality of the data poses a challenge to its intelligibility. A typical first step in reducing the dimensionality is to apply a feature selection model that distinguishes real biological signals from technical noise. Yet without an appropriate model of technical noise, feature selection can introduce bias into the downstream analysis of the data. In this work, we demonstrate that, in the analysis of single-cell RNA sequencing data, the standard approach of finding Highly Variable Genes (HVGs) induces severe distortion and bias into the analysis of data, when compared to true biological variation that is known a priori . To address this issue, we present a new feature selection model and demonstrate that our model outperforms existing methods in its ability to accurately identify real biological variation.

Citation: Sparta B, Hamilton T, Natesan G, Aragones SD, Deeds EJ (2024) Binomial models uncover biological variation during feature selection of droplet-based single-cell RNA sequencing. PLoS Comput Biol 20(9): e1012386. https://doi.org/10.1371/journal.pcbi.1012386

Editor: Jean Fan, Johns Hopkins University Whiting School of Engineering, UNITED STATES OF AMERICA

Received: January 30, 2024; Accepted: August 5, 2024; Published: September 6, 2024

Copyright: © 2024 Sparta et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All code used in this work is freely available at: https://github.com/DeedsLab/Differentially-Distributed-Genes .

Funding: This work was supported by an NIH IRACDA postdoctoral fellowship K12-GM106996 to BS and NIH R01-GM143378 to EJD. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Single-cell RNA sequencing has advanced the resolution at which variation in gene expression can be observed [ 1 ]. Recent improvements in scRNA-seq technology have enabled the measurement of tens of thousands of genes across hundreds of thousands to millions of cells [ 2 ]. Yet, interpreting transcriptional variation in these extremely high-dimensional datasets remains a challenge at the forefront of biological research [ 3 ].

In addition to the high levels of biological variation observed in scRNA-seq data, the sparsity of the data can pose challenges during downstream analysis procedures [ 4 ]. There exist very small quantities of mRNA inside of single cells, and even with advanced microfluidic technologies, the capture probability for any given mRNA is low [ 5 ]. As a result, a typical scRNA-seq experiment can generate a gene-by-cell expression matrix where ~95% of the entries are zeros. The observance of this large fraction of zeros, colloquially termed “drop-outs”, has left an impression that scRNA-seq data is zero inflated [ 6 – 9 ]. However, there is a growing body of empirical and theoretical work that demonstrates that, given the distributions of gene-count data, we do not observe more zeros in scRNA-seq data than would be expected based on a sampling process where the probability of capturing any given mRNA molecule is low [ 10 – 12 ].

A major contributor to this observation is the use of technologies in droplet-based scRNA-seq approaches that reduce technical error [ 13 – 16 ]. In these experiments, individual mRNA’s from lysed cells are bound to Unique Molecular Identifiers, or UMI’s, prior to PCR-based amplification, thus overcoming amplification bias when measuring mRNA abundance [ 17 ]. Despite these advances, scRNA-seq approaches still entail a low capture probability for individual mRNA molecules, and the resulting sparsity of this high-dimensional data pose challenges for the analysis of biological variation [ 3 ]. Standard approaches often employ a feature selection step, where observed variation in counts of mRNA species across cells is compared to a model of measurement noise [ 18 , 19 ]. The aim of this feature selection step is to identify genes whose expression levels vary due to meaningful biological differences in the population, thus varying more than what would be expected due to technical noise alone. This step is often applied prior to cell-type clustering, such that clustering is performed on only the set of genes that are expected to be biologically informative.

While many null models of biological variation have been proposed, which feature selection method is most appropriate remains unresolved [ 19 – 22 ]. The most popular approach involves finding “Highly Variable Genes” (or HVGs), which identifies genes with higher variance than what would be expected given the average UMI count for each gene [ 23 – 25 ]. Yet, the procedure itself introduces bias into variance estimates, through a series of nonlinear transformations that are performed prior to the HVG selection step [ 10 ]. In the standard analysis pipeline, raw UMI counts are first transformed with a counts-per-million (CPM) normalization and then with a log+1 transformation. These transformations are motivated by the ideas of normalizing for cell-size factors and stabilizing the variance for genes whose averages are orders of magnitudes different, respectively. However, both transformations have the undesirable effect of increasing the distance between 0 and non-zero values within a gene’s distribution [ 10 , 17 , 26 ]. As a result, these transformations artificially increase the variance of genes, and genes with mean counts that are closer to zero are disproportionality inflated [ 10 ]. Further, CPM normalization aims to adjust counts for potential read depth-differences that occur from differences in cell size. However, in droplet-based approaches, the assumption that cell-size can affect read-depth lacks empirical support. As such, the utility of the HVG procedure has not been well-established, and can potentially skew downstream cell-clustering and analysis results.

As a result, there is an ongoing effort to improve feature selection in scRNA-seq. One common alternative is to model the abundance of zeros, or ‘dropouts’, in gene distributions and to identify features that have more zeros than expected. For instance, a recent software package developed by Andrews and Hemberg offers three different null models that develop an expectation for the relationship between the ‘dropout’ rate and the mean expression level of each gene: M3Drop, NBDrop, and NBDisp. In M3Drop, a dropout rate parameter is fit to the whole transcriptome, and a Michaelis-Menten-style hyperbolic function is used to identify outliers where the gene-specific dropout rate exceeds the population expectation [ 19 ]. NBDrop uses a negative binomial distribution to model the fraction of zeros per gene as a function of the mean gene count that is adjusted for the sequencing efficiency, or total number of counts, for each cell. NBDisp uses the same negative binomial model as NBDrop, but is similar to the HVG method in that it uses linear regression to model the relationship between the mean and the estimated dispersion to identify a set of overly-dispersed genes. In another method similar to the NBDrop approach, Townes and co-wokers identified a set of genes with a greater fraction of zeros than is expected, using a multinomial model where the probability of a mRNA being captured depends on its relative abundance within each cell [ 10 ].

In each of these previous studies, feature selection models are evaluated based on clustering performance using biological data with “ground truth” cell identities, or by their ability to recover differentially expressed genes that have been identified using bulk RNA-seq methods. Using these approaches, it has been demonstrated that the HVG method only marginally improves performance compared to a randomized feature selection control [ 19 ]. Yet, the HVG method remains extremely popular [ 18 – 23 , 27 – 35 ]. In addition to uncertainty about the ability of various methods to identify an informative set of genes with real biological variation, the HVG, NBDisp, and the Townes method all entail arbitrary decisions about the number of features to use in the downstream clustering procedure. Because the choice of feature selection models can significantly alter a study’s results, there is a need to further develop statistically-grounded feature selection approaches, and to subject these methods to rigorous tests that evaluate their capacity to identify genes that exhibit bona fide biological variation within a population.

In this work, we proposed a simple binomial model that can be used to identify differentially distributed genes, or DDGs. In this model, each mRNA has the same probability of being captured during the initial phase of the scRNA-seq experiment, which results in a binomial sampling process of the mRNA molecules within the cell. We then consider the simple null model where there is no biological variation in a gene’s expression across the population, and then calculate the probability that purely technical variation due to the mRNA capture process could generate the observed expression patterns. While similar in spirit to the multinomial model developed by Townes [ 10 ], our inclusion of a specific null model allows us to calculate a p -value that represents the chance that the observed variation in a gene’s expression could be explained purely by technical noise. As a result, our approach allows researchers to define a standard False Discovery Rate (FDR) that sets the number of false positives they are willing to accept, rather than an arbitrary cutoff in the number of genes to consider for further analysis.

We validated this model using a standard synthetic cDNA spike-in scRNA-seq data set, where all of the variation should arise from the sampling process. We found that the vast majority of these genes were consistent with the null model, as expected. We further tested our approach by clustering cells from a study of FACS-sorted lymphocytes, where cell identity markers are obtained prior to scRNA-seq [ 15 ]. We compared our model to a number of existing feature selection approaches, and found that DDGs better retain the structure of variation among established cell identities and more accurately identified genes that are differentially expressed between cell types. DDGs also provide a feature set that better partitions groups of cells in accordance with their established cell identity labels. We found that, while feature selection only marginally improves cell clustering performance compared to the full feature set, the DDG approach enables dimensionality reduction without loss of cell neighborhood structures. Overall, our findings suggest that, compared to existing methods, DDGs can more comprehensively identify genes whose expression patters demonstrate bona fide biological variation.

Characterizing patterns of gene expression variation across tissues and multicellular organisms

Single-cell RNA sequencing technology has promised to map functional diversity by quantifying global gene expression patterns at the resolution of individual cells. This approach has the potential to revolutionize our understanding of how gene expression, cellular identities, and tissue function are related. Yet elucidating the molecular organization of tissue function remains an ongoing challenge. In scRNA-seq experiments, the dominant approach is to identify cell-type specific changes in gene expression across varying experimental contexts. Here, it is hypothesized that clustering the transcriptome can produce a model of cell types, where cells of the same type are expected to express similar sets of genes and have similar functions [ 23 , 36 , 37 ]. To gain intuition about the structure of transcriptional variation, we took the inverse approach, and sought to characterize groups of genes that are expressed across similar sets of cells. To identify and compare communities of genes that are expressed in similar patterns across complex tissues, we performed gene-clustering across a diverse set of scRNA-seq data sets.

Louvain clustering was performed on the genes of scRNA-seq data collected from human lymphocytes, mouse bladder, mouse kidney, hydra, or planaria (Figs 1 and S1 ) [ 15 , 27 , 28 , 30 , 35 , 38 ]. Across these samples we observed three general expression patterns: 1) A group of genes that are nearly ubiquitously expressed across all cells in the sample. These genes are comprised largely of ribosomal proteins, and genes involved in energy production, such as the ATP-synthase. 2) Sparsely expressed genes that are found in few, non-overlapping sets of cells, with no apparent structure in the expression pattern. 3) Groups of genes that are differentially expressed in distinct groups of cells ( Fig 1 ).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

A) Violin plots depicting the estimated probability density for gene clusters in Zheng-9 lymphocytes. Each column represents a particular cell, and the violin plot is colored by the average expression level across the set of genes within the gene cluster. B) Kernel density estimation for the distribution of average gene expression across cells for each gene cluster group in the Zheng-9 lymphocytes. C) Kernel density estimation for the distribution of fraction of cells each gene in the gene cluster is identified in, for the Zheng-9 lymphocytes. D) Heat map depicting the normalized expression level of genes grouped by gene cluster membership in the Hydra. Each row represents a gene and column represents a particular cell, and the entry is colored by the normalized expression level. E) Kernel density estimation for the distribution of average gene expression across cells for each gene cluster group in the Hydra. F) Kernel density estimation for the distribution of fraction of cells each gene in the gene cluster is identified in for the Hydra.

https://doi.org/10.1371/journal.pcbi.1012386.g001

We hypothesized that genes expressed in the third, differentially expressed group are involved in the production of distinct cellular identities. In general, the physiological diversity of the sample correlated to the number of gene clusters that are expressed in distinct groups of cells. For example, in the Zheng-9 lymphocyte data we observed one cluster of ubiquitously expressed genes (cluster 4), one cluster of sparsely expressed genes (cluster 1), and two clusters of differentially expressed genes (cluster 2 & 3) ( Fig 1A , 1B and 1C ). In contrast, in the Hydra data, we found six differentially expressed clusters ( Fig 1D , 1E and 1F ).

Across our gene clusters, we observed a trend between the mean expression level per gene and the fraction of cells in which each gene was identified (Figs 1B , 1C , 1E , 1F and S1 ). Genes with high mean expression levels are observed in nearly every sequenced cell. In contrast, of all the gene groups, the subset of genes which appear to be expressed at random have the lowest mean expression levels. Groups of genes that are differentially expressed have expression levels that fall between these two extremes. This observed trend between the mean expression level of a gene, and the number of cells that gene is observed within, suggests methods that relate these two metrics may be useful in identifying genes that capture real biological variation.

A binomial model of mRNA capture identifies genes expressed in fewer cells than expected

For the analysis of scRNA-seq data, many procedures for identifying biologically varying genes have been developed. In the standard HVG feature selection approach, researchers model the mean-variance relationship (dispersion) of transformed gene count data and select those genes that are the most variable for downstream cell-type clustering. The HVG procedure depends on the assumption that variance is proportional to the mean, across the span of mean values observed in the data. Yet, in general, the mean-variance relationship depends on the distribution the sample was drawn from, and in scRNA-seq data, the distribution of particular gene expression patterns across cells is not known. Further, the variance of a gene may be underestimated when the mean counts are close to zero [ 39 ]. To satisfy the assumptions of the HVG model and enable comparison of dispersion across genes whose means span orders of magnitude of values, a log+1 transformation is applied to the normalized count-by-cell matrix. However, rather than achieving a variance-stabilizing effect, this transformation disproportionately increases the dispersion of genes where a greater fraction of the gene counts per cell are zeros. As a result, the HVG approach has demonstrated the capacity to enrich for a set of genes biased to have low expression values [ 10 ]. Whether HVGs are biologically varying genes, and whether HVGs are operationally useful for cell-type clustering, has not been systematically characterized.

Other existing models assert different assumptions about the structure of gene expression and the capture process of scRNA-seq technology. The M3Drop, NBDrop, NBDisp, and Townes models all make subtly different assumptions about the relationships between the fraction of zeros per gene and the size of a cell. In the M3Drop method, the drop-out process is modeled as a kinetic process, such that there exists error around the mean counts per gene [ 19 ]. In the NBDrop and NBDisp method, the relationship between the fraction of zeros and mean gene count is modified by the total number of reads across each cell the gene is observed in [ 19 ]. These models assume that the capture efficiency changes depending on the size of a cell. Similarly, the Townes method models the mRNA capture step as a process where there is competition to be counted, essentially positing that there is a fixed number of mRNA counts per cell [ 10 ].

In this work, we develop a different null model of variation in counts based on a binomial sampling process under simple assumptions. In our model, the observed mean expression level of a particular gene is used to develop an expectation of what fraction of cells we would find each gene in–if the gene was expressed at exactly the same level across cells ( Fig 2A ). This model does not require assumptions about the underlying distributions of mRNA in cells, nor the mean-variance relationship of gene expression. Based on empirical data, we propose that it is more accurate to model the binding of mRNA to UMI-coated beads as a stochastic process that is not saturated, nor affected by the size of a cell. In our model, every mRNA has an equal probability of binding to a bead, giving a binomial sampling process where each mRNA molecule can be considered as an independent “trial” with a fixed capture probability.

thumbnail

A) Schematic of the null model for biologically varying genes. The first panel illustrates uniformly (pink) and differentially (cyan) distributed gene expression across a sample of cells, with illustrated histograms of each gene’s count distribution across the set of cells. The second panel illustrates the mRNA capture as a binomial process, where the probability of capture for each mRNA is stochastic. The third panel depicts the expected relationship of the average mRNA level and the number of cells each gene is observed in, where each ‘x’ on the graph represents a specific gene. The pink line illustrates the expected relationship if the only variation that is observed arises from the binomial sampling process. B-D) Scatter of average mRNA count per gene versus the number of cells each gene is identified in for three datasets: B) the synthetic ERCC data, C) the Zheng-9 lymphocytes, and D) the Hydra. In the top panel each gene is colored by the P-value computed from the DDG model, while in the bottom panel each gene is colored by the coefficient of variation.

https://doi.org/10.1371/journal.pcbi.1012386.g002

This model is described in detail in the Methods section, but we will briefly explain it here. First, consider some gene in the genome, which we will call gene i . For any cell j in the dataset, we define the observed amount of mRNA in that cell to be m i , j and the real amount of mRNA the cell had before the capture process to be M i , j . In other words, M i , j is the actual amount of mRNA for gene i that was in cell j , while m i , j is the amount of counts we end up actually observing for that gene in the experiment. We now imagine that, in each cell in each droplet in the experiment, there is a fixed “capture probability” p c ; this is just the chance that any given mRNA molecule is captured on a bead, amplified effectively, etc., and thus is detected as a UMI. We assume that this probability is the same for all genes and for all cells; we will discuss how this parameter is determined, and the sensitivity of our results to the value of this parameter, in detail below. This leads to a simple model where every mRNA in every cell is subjected to a Bernoulli trial, leading to a Binomial sampling process with a total of M i , j trials ( Fig 2A ).

hypothesis testing for binomial distribution

One major difference between our proposed model and the model developed by Townes and co-workers [ 10 ] is in how cell-size factors are handled. In our null model, every cell starts with an identical amount of mRNA for gene i , and cell-size effects can still contribute to biological variation. In other words, variation in the population that arises from cases where a cell is larger, and thus has more mRNA for gene i than a cell that is smaller, will deviate from the predictions of our null model. In the Townes method, the probability of capture for each mRNA is capped by its proportional expression in each cell, and natural variation in total mRNA’s per cell cannot be accounted for.

To test the relationship between mean expression level and number of cells expressing a given gene, we used scRNA-seq data created from a sample where no biological variation exists. In this experiment, scRNA-seq was performed on droplets spiked with a set of standard, synthetic ERCC control cDNA, and lacking any biological cells [ 15 ]. We observed that, in the absence of real biological variation, the vast majority of the spiked cDNA species fell along the expected relationship predicted by a binomial sampling process (i.e. the equation above, Fig 2B ). In addition to simply predicting the relationship between mean mRNA expression and the total number of cells in which a gene is observed, we can use our model to calculate the probability that we would observe N c , i for a given gene i given it’s observed mean expression level E ( m i ) (see Methods ). This gives us a natural way to define a p -value for the null model, which is the probability of obtaining the observed number (or fewer) cells expressing that gene, given E ( m i ). We term genes for which the observed expression pattern yields significant p -values to be “Differentially Distributed Genes” or DDGs, since the variation in the expression pattern for that gene deviates significantly from what we would expect if the gene were expressed identically across all the cells. We use the standard Benjamini-Hochberg procedure to correct for multiple hypothesis testing. Interestingly, if we set the False Discovery Rate (FDR) to 1%, only 5 of the 93 ERCC spike-ins are significant ( Fig 2B ), suggesting that the bulk extent of technical variation in droplet-based scRNA-seq can be explained by a binomial sampling process. The FDR can be modified in order to make the threshold for inclusion of a gene in the DDG set more or less stringent.

Given a binomial sampling process for mRNA capture, we can expect the fraction of mRNA’s that are actually detected in an experiment to vary depending on the probability of mRNA capture. In droplet-based scRNA-seq, this capture probability is expected to be between 5–10% [ 5 ], but can be difficult to estimate for any given experiment in the absence of appropriate controls. Since the DDG model requires specifying a parameter value for , the capture probability, p c , we next evaluated our model’s robustness to this parameter choice. To do this, we created a Gaussian Mixture Model (GMM) with 3 cell types, and 3000 cells per cell type. There are also a total of 1000 genes for each cell. If a gene is chosen as a “marker” for a given cell type, then for all the cells of that type, the counts of that gene are drawn from a Gaussian with a mean of 35 and a standard deviation of 2. If a gene is not a marker gene, then the count value is drawn from a Gaussian with a mean of 15 and a standard deviation of 1. Note that these values were taken from rough estimates of useful “marker genes” in datasets where we know the cell type identities a priori (see Methods ). As a result, there are a total of 900 marker genes, 300 genes being markers for each of the 3 cell types. These should be “true” DDGs. There are also 100 genes that are not markers for anything, and serve as negative controls.

We then simulated a binomial sampling process by performing a Bernoulli trial for each “UMI count” in the data, varying the capture probability from 1% up to 50%. We find that under reasonable parameter regimes where some noise exists, but not enough to wash out the signal entirely, the DDG model is able to recover 100% of the ground truth genes ( S2C and S2F Fig ). This is true even when the DDG model parameter for capture probability is set to 5%, but the true capture probability in the simulation ranges from 2%-20%. This suggests that the DDG model is fairly robust to misspecification of the capture probability, at least within a reasonable range. Another interesting result of this GMM is that the model never mis-identified a “non-marker” gene as a DDG. This indicates that, even though our null model assumes that every cell has exactly the same number of mRNAs, small variation in the initial starting number of mRNAs does not interfere with the identification of genes that vary in a significant way biologically. Interestingly, the HVG approach was not able to recover the full set of ground truth marker genes for this simulated dataset, suggesting the HVG approach may struggle to find genes with bona fide significant differences across cell types ( S2F Fig ).

We next applied our DDG model to scRNA-seq data with real biological variation, using data collected from the Zheng-9 lymphocytes as well as the Hydra. When we plotted the mean mRNA count versus the number of cells each gene is detected in, we observe a large fraction of genes that deviate from the line predicted by our model of technical variation ( Fig 2C and 2D ). When each gene is colored with our p -value statistic, we observe that the more statistically differentially distributed genes tend to have higher mean mRNA values. Interestingly, when we color each gene by the coefficient of variation (standard deviation/mean) we observe the reverse trend: genes with low mean counts have coefficients of variation that are orders of magnitude higher than the more abundantly expressed genes ( Fig 2B , 2C and 2D ). These statistics indicate that the HVG procedure may be conflating measurement noise with biological variation. Further, if the HVG and DDG methods were to be applied to raw gene counts from the same sample, the two methods may identify non-overlapping feature sets.

Differentially distributed mRNAs exhibit different modes of variation

After identifying our sets of DDGs, we next sought to visualize the distribution of these genes across complex tissues. Because cell-identities for the Zheng-9 lymphocyte data have been annotated prior to sampling the transcriptomes, we estimated the probability density for each gene across each of the nine different cell classes. In this orthogonally annotated data, we observed three general patterns of gene expression that can occur independently from mean expression value. 1) For genes with non-significant p -values, we find uniform distributions both across and within different cell types ( Fig 3B ). This class of genes are those genes whose expression levels do not vary more than what would be expected from a binomial capture process. These genes also tend to include, but are not limited to, genes that have mean expression values that are close to zero (Figs 3A and 4 ). 2) In the Zheng-9 lymphocytes, we observe DDGs that are differentially expressed in specific cell types ( Fig 3C ). For example, out of the 9 annotated cell types, the gene GNLY is only highly expressed in Natural Killer cells, as it encodes a specialized protein with antimicrobial activity, called granulysin. 3) We also find DDGs that quantitatively vary both across and within cell types ( Fig 3D ). The proteins encoded by these genes include general cell-function proteins such as ribosomal proteins and S100 family calcium binding proteins, as well as proteins with established immune function. For example, we observed increasing counts of the GZMM serine protease across different classes of activated lymphocytes. We also observed different levels of the CD52 peptide that is associated with the mobility of lymphocytes, as well as other cytokine receptor and antigen associated genes.

thumbnail

A) Scatter of average mRNA count per gene versus the number of cells each gene is identified in for the lymphocyte data, cropped to highlight the three example genes depicted across the row. B-D) Kernel density estimation for the distribution of gene counts across cells in the lymphocyte data. The distribution of each gene is plotted for each cell type separately. In each column three example gene distributions are depicted to show B) genes that are not significantly differentially distributed, C) genes that are differentially expressed in specific cell types, and D) genes that quantitatively vary both across and within cell types.

https://doi.org/10.1371/journal.pcbi.1012386.g003

thumbnail

A-F) Scatter of average mRNA count per gene versus the number of cells each gene is identified in for the lymphocyte data, where each gene is colored by the computed p-value. The fraction of significantly distributed genes is indicated for A) cytotoxic T cells, B) the Zheng-9 lymphocyte mix, C) hydra, D) mouse bladder, E) mouse kidney, and F) planaria. G) Mean fraction of DDGs over developmental time in mouse limb bud data. Error bars show confidence interval around the mean, estimated using the 10 re-sampled sets.

https://doi.org/10.1371/journal.pcbi.1012386.g004

A natural method for determining the number of genes in an scRNA-seq feature set

Several feature selection procedures use an arbitrary cutoff for determining the number of biologically informative genes. In the HVG method, typically 2,000–5,000 of the most variable genes are chosen to create the cell-by-gene distance matrix upon which cell-type clustering is performed. Arbitrary cutoffs are also imposed by the Townes and NBDisp methods, while the M3Drop and NBDrop models use an FDR-adjusted test statistic. Rather than choosing an arbitrary number of genes for an analysis, we hypothesized that we should expect the number of biologically informative genes to increase with the number of different cell identities within a sample. To test this hypothesis, we calculated the fraction of DDGs over samples with increasing tissue complexities.

For tissues with more distributed physiologies, we recovered a greater fraction of differentially distributed genes. For example, we found a fraction of only 3.8% DDGs in an isolated sample of cytotoxic T cells, compared to 15.2% DDGs across the Zheng mix of 9 different lymphocyte cell types ( Fig 4A and 4B ). Similarly, in the mouse bladder, where three major cell types are expected, 27.4% of the genes are differentially distributed, while we find 39.1% DDGs in the more physiologically diverse mouse kidneys ( Fig 4D and 4E ). When we calculated the fraction of DDGs across scRNA-seq data collected from whole multicellular organisms, we observed 54.3% DDGs in the hydra and 55.4% DDGs in Planaria ( Fig 4C and 4F ). We then repeated this experiment using the two other feature selection models where a natural test statistic determines the number of significant genes for each sample. Interestingly, in the NBdrop method, we observed a similar trend as in the DDGs, where the number of genes increased as a function of tissue complexity ( S3B Fig ). However, the overall number of genes was less for each sample, and the kidney sample deviated from the trend with higher numbers of significant genes than the two whole-organism samples. In contrast, the M3drop method produced a fraction of significant genes that decreased with increasing tissue complexity, contradicting our knowledge of gene expression variation across complex tissues ( S3C Fig ).

We next tested the hypothesis that the fraction of differentially distributed genes increases over developmental time. For this experiment, we used scRNA-seq data collected from mouse limb buds spanning 10.5–15 weeks in embryonic development [ 40 ]. To ensure equivalent statistical power, we randomly sampled 5,000 cells from each time point ten times, calculated the fraction of DDGs, and generated confidence intervals across our samples. We found that as developmental time progressed, the fraction of DDGs generally increased, corroborating our hypothesis that diversity of cell types corresponds to diversity in gene expression ( Fig 4G ). Yet we observed that the earliest and latest limb bud samples (e10.5 and e15) deviated from this trend, potentially due to significant variation in average number of UMI counts per cell detected in these two samples ( S3D–S3F Fig ). We next sought to investigate how different experimental variables may effect the number of DDGs recovered in a sample.

One critical parameter that often varies between experiment is the total number of UMIs captured in each cell, which is often referred to as “sequencing depth.” To understand how variation in this parameter can alter the number of DDGs that are identified, we performed an experiment where we randomly retained only a fraction of counts from the original Zheng 9 lymphocyte data, as well as data generated from a different data set on cell lines obtained from 10X genomics. To do this, we performed a Bernoulli trial for each UMI count in the data, with a capture probability of 50% or 90%, calculated the set of feature genes, and repeated each experiment 10 times. When we compare the set of DDGs identified using the down-sampled data to the original DDG set, we find that re-sampling the cell line data at either 90% or even 50% probability recovers at least 90% of the original DDGs ( S4A Fig ). In the Zheng9 case, however, which is already fairly sparse to begin with, there is a larger effect, especially for the 50% probability case ( S4C Fig ). These results suggest that the set of DDGs identified is robust to minor variation in sequencing depth/coverage, but starts to fail when most of the variation in the data is actually just technical noise; in other words, when there is no signal to detect. Interestingly, when we applied the HVG approach to this experiment, we found no real change in the HVGs, even for the Zheng9 data where the down-sampling produced a dataset with an extreme lack of signal ( S4B and S4D Fig ). This highlights the fact that the HVG model struggles to separate biological and technical sources of variation.

We next tested the idea that mis-specifying the capture probability parameter, p c , in the DDG model can alter the number of DDGs identified in data. To do this, we titrated the value of p c from 1% to 80% for a down-sampling experiment similar to that described above, across several datasets used in this study. While we find that the number of DDGs can vary greatly when different p c values are used, the number of DDGs is relatively stable within the expected experimental range of 5–10% mRNA capture ( S5A–S5I Fig ).

The fraction of DDGs recovered also depends on how many cells are used to generate the observations. Using the lymphocyte data, where cell types were first identified using FACS prior to scRNA-sequencing, we can approximate a set of “real” DDGs using a supervised approach. We calculated the set of genes whose means are significantly different across the nine cell-type groups, using the Wilcoxon rank-sum test. Next, to evaluate how the number of cells affects the power of our model, we randomly sampled each of the nine types of lymphocytes over a range of 100 to 10,000 cells, then calculated the set of DDGs. As the number of cells increases, we find that the number of DDGs recovered increases linearly ( S6A Fig ). When we compare the overlap between the predicted DDGs and the “real” DDGs obtained from the Wilcoxon rank-sum test, we find that as we increase the number of cells, the number of “real” DDGs recovered approaches saturation ( S6D Fig ). Together, these results suggest that the power of the DDG model depends linearly on the number of cells observed, yet the number of biologically variable genes is limited, and with an increasing number of cells, the DDG model can recover an increasing fraction of true variable genes.

Preservation of variance structure during feature selection

After having validated our DDG model with technical and biological controls, we proceeded to characterize the operational utility of different feature selection methods. In the standard scRNA-seq analysis workflow, feature selection is motivated by the idea that dimensionality reduction can remove axes of variation that arise due to sampling noise, and thus improve the identification of cells that vary in similar and biologically informative ways. Feature selection typically implies a 10-20-fold reduction of genes that are used to represent cells, before the dimensionality is further reduced by principal component analysis, and clustering algorithms are applied. However, the extent to which various feature-selection models can preserve the variance structure of high-dimensional data, or of real biological variation, has yet to be established.

To evaluate whether cell neighborhood structure can be preserved by various feature selection methods, we calculated the distortion induced to cell neighborhoods by dimensionality reduction using a metric called the Average Jaccard Distance (AJD). The AJD is defined as the per-cell average of the difference between the overlap and total set of k-nearest cell-neighbors in the high-dimensional space, compared to the k-nearest neighbors in the reduced dimensional space ( Fig 5A ) [ 41 ]. If the AJD = 0, all cell neighbors are the same, and no distortion is induced by the dimensionality reduction. In contrast, if the AJD = 1, dimensionality reduction has permuted all of the neighbors, such that none of the original k-nearest neighbors in the high-dimensional data remain in the reduced-dimensional projection.

thumbnail

A) Schematic of the Average Jaccard Distance as a measure of distortion induced by dimensionality reduction B,C) Heatmap of pairwise Average Jaccard Distances for each feature set for the B) Zheng-9 lymphocyte mix and C) Hydra.

https://doi.org/10.1371/journal.pcbi.1012386.g005

Using the data where all genes are included as a reference, we calculated the neighborhood distortion induced when various feature selection methods were applied to the Zheng lymphocytes ( Fig 5B ). Using the set of true biologically varying genes obtained from applying the Wilcoxon test across the FACS sorted lymphocytes, we first established a minimum expectation of distortion induced by feature selection. We found that when cell-neighborhoods were calculated using 20 nearest neighbors and the supervised set of 8,883 genes, the variance structure of the full 20,000 gene set was largely preserved. The Wilcoxon set produced an AJD of only 0.12. In comparison, when the HVGs were used as the basis for the dimensionality-reduced space, the AJD was 0.88 in reference to both the all-gene and Wilcoxon gene neighborhoods. This high level of distortion induced by using just the set of HVGs to construct cell neighborhoods, suggests that the majority of biological variation expressed in the local neighborhood structure has been lost. In contrast, when we calculated the distortion induced by the DDG set, we found a AJD of only 0.19, indicating the DDG set preserves the high-dimensional neighborhood structure at a similar ability of the Wilcoxon genes, even though the DDG set only includes about on third the number of genes as the Wilcoxon set. We next compared the preservation of variance by DDGs to that of other models. We found that the other two binomial-based models more comparable to the preservation of structure by DDGs, with the Townes method producing an AJD of 0.23, and the NBdrop producing greater distortion with an AJD value of 056. In contrast the M3drop procedure more significantly permutes the neighborhood structure relative to both the all-gene and Wilcoxon-gene neighborhoods, giving an AJD of 0. 98. These trends were also observed across Hydra–relative to the high-dimensional data, HVGs greatly permutated the neighborhood structures, while DDGs induce almost no distortion ( Fig 5C ).

Recovery of biologically meaningful genes by different feature selection methods

We next evaluated whether various feature selection models could recover the specific set of differentially expressed genes that were generated using the supervised Wilcoxon rank-sum test on the FACS-labeled lymphocytes. Out of all the feature selection methods tested, the DDG set recovered the largest fraction of the Wilcoxon genes, specifically sharing 2,807 out of a total of 2,835 DDGs. In contrast, HVGs only recover 1,333 ( Fig 6A ). We calculated the Jaccard Index to account for the proportional overlap, as different feature selection methods identify different numbers of feature genes in the same data. M3Drop had the highest proportional overlap with the Wilcoxon set, with DDGs having the second highest, while the HVGs had the smallest overlap ( Fig 6B ).

thumbnail

A) Venn diagrams of set overlap for each feature set and the set of genes identified in the Zheng-9 lymphocyte mix using the Wilcoxon rank sum test. B) Jaccard index for each feature set illustrating the size-adjusted overlap with the Wilcoxon gene set in the lymphocyte mix. C,D) Scatter of the mean mRNA count versus dispersion per gene, colored by set membership for feature sets in the lymphocyte data. C) compares HVGs and Wilcoxon genes while D) compares DDGs and Wilcoxon genes.

https://doi.org/10.1371/journal.pcbi.1012386.g006

To further explore this question of capturing biologically meaningful genes, we took a second approach and compared our feature sets with historically established marker genes that are often used in cell-type annotation. We obtained marker gene sets for three datasets including lymphocytes, pancreas, and brain cells from Panglao database [ 42 ], and calculated the fraction of marker genes identified by different feature selection models. We find that the M3Drop, followed by the DDG model recover the highest fraction of historically established marker genes ( S7B Fig ). While Wilcoxon and M3Drop feature sets include a higher number of marker genes compared to the DDGs, the DDG set contains significantly fewer genes, suggesting that a larger proportion of DDGs overlap with historically established cell-type markers ( S7A Fig ).

Next, to understand these differences in feature sets, and to develop intuition about what kinds of bias may be introduced by various feature selection models, we plotted the mean mRNA count against the “dispersion” (the variance to mean ratio), for each gene in the Zheng lymphocyte data. Here, each gene is colored by its membership to the supervised Wilcoxon set, a particular feature selection set, or membership to both the Wilcoxon and the feature set ( Fig 6C and 6D ). When we compared the Wilcoxon genes to the HVGs, we observe that HVGs tend to have lower mean expression, but high dispersion ( Fig 6C ). This finding corroborates the previous hypothesis that the HVG procedure produces a biased output of the most variable genes towards a set of genes with low mean expression but high variance. In contrast, DDGs fail to identify Wilcoxon genes with lower expression levels, potentially due to a lack of power at lower mean expression levels.

We then compared the DDG model to the other more similar, binomial-based methods ( S8B–S8D Fig ). Interestingly, the genes included in the DDG set are similar to those in Townes, yet each feature set has a unique subset of genes. Additionally, the Townes method tends to select genes that have higher expression means than the DDGs, suggesting a lesser ability to identify differentially expressed genes with lower expression values ( S8C Fig ). Meanwhile, the NBdrop method produces a set of genes that are a subset of the DDGs, yet have much fewer genes in comparison ( S8D Fig ). The NBdrop method also tends to select the subset of DDGs that have high dispersion values, again indicating a weaker power in the ability to resolve differentially expressed genes. Taken with the previous findings, these results suggest that compared to other methods, our binomial model with simpler assumptions can identify a greater fraction of true biologically varying genes.

Recovery of orthogonal labels by different feature sets

We next sought to test the operational goal of feature selection, which is to enable recovery of physiologically similar cells that have similar transcriptional profiles. To evaluate which feature set is most informative for this task, we first used the FACS-labeled lymphocytes, where biological identities are determined a priori based on historically-established notions of major immune cell types. In this experiment, we apply the standard approach of Louvain clustering to the dimensionality-reduced data and evaluate how well each feature set can recover the original FACS labels.

For each feature-selected group, we titrated the Louvain resolution parameter from 0.1 to 1, calculated the adjusted rand index (ARI) to compare set membership between Louvain clusters and FACS labels across all cells, and plot the best score per group [ 43 ]. In the first experiment, we tested whether clustering feature selected genes using the raw UMI counts, could recover the appropriate lymphocyte cell-types. We found that none of the feature selection groups achieved an ARI score greater than 0.5, indicating a failure to appropriately partition the cells into the historically-established notions of cell-type groups ( S9G Fig ). Yet, rather than using raw UMI counts, the standard analysis pipeline clusters on feature-selected data that has been CPM and log+1 transformed, and then further reduced by principal component analysis (PCA). The application of PCA is motivated by the idea that reducing the sparseness of the data while retaining the variance structure will enhance the performance of the Louvain clustering algorithm. To test this idea, we performed Louvain clustering at various steps in the analysis pipeline by using raw counts, PCA-transformed counts, and Log+1-CPM-PCA transformed counts. We found that using either PCA-transformed counts or the full log+1-CPM-PCA transformation only marginally improved the performance of various feature selection sets, bringing the best ARI scores to slightly greater than 0.6 ( S9H and S9I Fig ).

To identify where the Louvain algorithm was failing, we visually inspected the tSNE projections where cells were colored by either their original FACS labels or their Louvain identified clusters [ 44 ]. We observed that none of the assayed feature sets were able to recover the appropriate FACS labels for the subsets of related T cell lineages ( S9 Fig ). This observation corroborates the existing idea that there might not exist a transcriptional basis that can separate these related T cell lineages into distinct groups [ 44 – 47 ]. Based on these observations, we next performed a second FACS-label recovery experiment, where we merged the FACS labels from two subsets of specialized T cell lineages into two distinct supersets. Specifically, we combined naïve cytotoxic cells, cytotoxic T cells, and memory T cells into one set, and naïve T, helper T and regulatory T cells into another set. Using this more granular approach we proceeded to evaluate the performance of different feature sets.

Several gene sets were able to recover a large majority of the FACS labels, giving ARI scores over 0.9 ( Fig 7 ). These gene sets include all genes, the Wilcoxon genes, the DDGs and the Townes genes. We found that log+1-CPM-PCA transformed counts again perform slightly better than no transformations or only PCA transformation, potentially by permutating neighborhood structure in a way that is more compatible with the Louvain algorithm ( Fig 7G , 7H and 7F ). Interestingly, the variance-based approaches, including the HVG set, failed to produce an ARI of greater than 0.6 ( Fig 7G , 7H and 7F ), even with the minor T-cell lineages merged into major groupings. Inspection of the tSNE projections for the HVG set suggested a loss of the separatrix that distinguishes the two separate T cell lineages. This loss of biologically informative axes of variation in the HVG selected data likely impedes the ability of Louvain algorithm to recover the original FACS labels.

thumbnail

A-F) tSNE projections for dimensionality-reduced data from the Zheng-9 lymphocyte mix. Principal component analysis was used to reduce (A,D) all genes, (B,E) HVGs, or (C,F) DDGs. Cells are either colored by original FACS-based set membership, where T-cell lineages were merged into two super sets (A-C), or Louvain cluster membership (D-I). G-I) Quantification of FACS label recovery by various methods of dimensionality reduction across different feature sets. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original FACS labels where T-cell lineages were merged into two super sets, and the highest adjusted rand index is plotted when G) raw UMI counts, H) PCA-transformed counts, or F) Log+1-CPM-PCA-counts were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.g007

To rule out the possibility that the HVG method was losing biologically relevant information simply due to the choice of too few feature genes, we repeated the label recovery experiment using increasing numbers of HVGs. We first considered cases where the number of HVGs was set to 2,000 (the original number we used, and a typical default value), 4,000 and 8,000. We applied these feature numbers to several datasets with reasonable approaches to orthogonally labeling the “ground truth” of the relevant cell types for clustering: the Zheng 9 FACS-sorted lymphocytes, a multiplexed mixture of B and T cell lines, a multiplexed mixture of A20 and NIH3T3 cell lines, and Citeseq data from spleen and lymph nodes [ 15 , 48 , 49 , 50 ]. In the case of the B and T cell lines, and the A20/NIH3T3 cell lines, the cells were labeled before being sequenced in a multiplexed way, allowing for a priori identification of the cell line of origin of each cell in the dataset. While the Citeseq data did not do this, it uses oligo-tagged antibodies to measure surface protein expression, allowing us to annotate the cell-type identity of each cell in this dataset in a method analogous to FACS-sorting of historically established marker genes. All the data was subject to the same “standard pipeline” set of transformations (CPM normalization, log transformation, PCA, etc.) before being used as the basis for Louvain clustering.

We found that across all four datasets evaluated, varying the number of HVGs from 2000, 4000, and 8000, had minimal effect on the ability of Louvain clustering to recover the appropriate cell type labels ( S10 – S14 Fig ). In the FACS sorted lymphocytes, regardless of the number of genes used, HVGs only achieve an ARI of about 0.5, indicating a lack of similarity between the FACS labels, and the HVG-based Louvain clusters ( S10A–S10C and S12A–S12C Figs). We saw a similar lack of sensitivity to the number of HVGs chosen in the cell line datasets ( S11A and S11B Fig ), though we should note that all feature selection approaches performed extremely well on these data, likely because cell lines are very transcriptionally distinct and thus are relatively straightforward to cluster effectively. Interestingly, varying the number of HVGs had a slight effect on the ARI for the Citeseq data ( S11C Fig ). In contrast to the cell line data, however, overall clustering performance on this dataset was quite low (with ARIs around 0.4), which we suspect is due to the fact that the Citeseq protein measurements are less reliable in determining true cell types than the procedure used for FACS.

In analogy to changing the number of HVGs, we also varied the “effective capture probability” parameter in the DDG model, which can strongly influence the number of DDGs identified as significant at any given FDR threshold. For instance, varying the capture probability between 1% and 80% changes the number of significant genes from ~10% to ~75% in the Zheng 9 data ( S5B Fig ). As with varying the number of HVGs, this has little impact on clustering performance, suggesting that the DDG approach with a reasonably low capture probability (around 5–10%) provides a meaningful basis for clustering ( S10D–S10F and S12D–S12G Figs)

At first glance it might seem that the results on both the cell-line and Citeseq data do not provide much additional insight into the utility of various feature selection approaches. As mentioned above, clustering appears to be trivial for the cell-line dataset, as any feature set and any number of features, even randomly selected features, are able to identify the appropriate cell type classification across all cells in the data ( S11A and S11B Fig ). While this is true for an optimal Louvain cluster resolution parameter, the correct cell-type membership is not typically known a priori , and one would not typically be able to determine the optimal clustering resolution without this knowledge. With this in mind, we find that the DDG approach more frequently produces clusters that are concordant with ground truth cell types when a range of cluster resolutions are evaluated ( S10 and S12 – S14 Figs). In summary, when there are well-defined, known cell identities for each cell in a dataset, the DDG approach performs remarkably well, either outperforming existing feature selection techniques (as in the Zheng9 data) or performing equally well but being far more robust to the choice of parameter values (as in the cell line data).

Performance of different feature sets in RNA velocity

Finally, as not all scRNA-seq experiments are expected to produce discrete cell type clusters, we tested how the DDG method performed in RNA velocity analysis. RNA velocity uses counts of unspliced and spliced mRNA species to create a model that orders cells along a pseudo-temporal axis, modeling how cells progress through the process of differentiation. Using available data from the pancreas and dentate gyrus, we computed the temporal position of each cell, referred to as the ‘latent time’, using either the HVGs, DDGs, or 2,000 random genes as the input for the RNA velocity package, scVelo [ 51 ]. We found that the latent time computed using the DDGs was more similar to the latent time computed using the HVGs, than the results of the DDGs compared to the random genes ( S15J–S15M and S16J–S16M Figs). However, since the “ground truth” (i.e. the real velocity for each gene, or the real latent time for each cell) is not known, it is difficult to determine to what extent HVGs or DDGs are capturing the true underlying variation. That being said, it is clear that DDGs provide a useful basis for performing this analysis and do not generate results that are completely out of line with current feature selection approaches. We should note that the surprising performance of the random subset of genes in this experiment suggests that significant further effort is needed in feature selection for RNA velocity analysis.

A major challenge in the study of single cell biology is understanding how variation in gene expression maps to cell physiology. Single-cell RNA sequencing seeks to overcome this challenge by enhancing the resolution at which transcriptional variation can be measured. Yet, the interpretation of these datasets is subject to the precise mapping challenge it seeks to overcome. The dominant approach to interpreting transcriptional variation relies on clustering groups of transcriptionally similar cells, and then comparing differences in gene expression patterns between the groups. Ideally, unsupervised clustering can partition groups of cells that vary in similar ways into biologically meaningful groups. Defining the most appropriate notion of transcriptional similarity and the best way to delineate these cells into cell-type classes remains an ongoing challenge [ 36 , 37 ].

Nonetheless, an operational goal of scRNA-seq studies is to identify genes that are related to biological variation. However, the nature of scRNA-seq data poses challenges for this goal, as the low probability and stochasticity of the scRNA-seq capture process generates sparse, noisy, and extremely high-dimensional mRNA count data. These raw UMI measurements can conflate biological and technical variation and may create challenges for existing unsupervised clustering algorithms. Therefore, a critical step in the analysis lies in developing models of measurement noise that distinguish which axes of variation are biological, and which arise due to technical noise in the measurement.

To achieve this task, the evaluation of null models of biological variation requires empirical support. In this work we propose a feature selection model with minimal assumptions. While our model is similar to the NBDrop and Townes methods, our model relaxes constraints about how cell size affects the mRNA capture process, and treats every mRNA as having an equal probability of being captured. We benchmark how our model performs compared to other popular methods, by using data where known biological variation has been pre-established. Specifically, the Zheng lymphocyte data, where cells had been sorted into major cell type classes and annotated with FACs labels prior to sequencing, enables us to directly map biological variation to transcriptional variation. In addition, because the Zheng lymphocyte data has a relatively large number of cells in the data set, we were able to test model performance using several rigorous metrics.

We demonstrate that, compared to other approaches, our DDG method produces the most accurate mapping of established physiological variation to dimensionality-reduced transcriptional variation. First, the DDG method assigns a p -value during the feature selection process, and provides a natural FDR cutoff by which the number of features can be determined after correction for multiple hypothesis testing. We show that, when applied to samples with increasing tissue complexity, this approach is the only method tested that allows us to recover an increasing number of biologically varying genes. Second, DDGs preserve high-dimensional variation in the neighborhood structure of raw and Wilcoxon selected count-by-cell data, suggesting the major axes of transcriptional variation are retained in the reduced dataset. Further, using the orthogonally-annotated lymphocyte data, we find that the DDG feature selection method best recovers a set of supervised differentially expressed genes, and the DDG-set provides a basis for unsupervised clustering that enables recovery of the original FACS cell-identity labels. In other datasets where reasonable ground-truth labels are available, such as data on cell lines, we showed that the DDG approach provides a more robust basis for clustering. DDGs can also be used to identify genes with meaningful biological variation for analyses other than clustering, such as RNA velocity, although more work will be required to fully understand how feature selection influences the results of such analyses.

Taken together, these findings support the use of DDGs in improving the standard scRNA-seq analysis pipeline. Our finding that HVGs severely distort the neighborhood structure of the high-dimensional data and perform poorly across all other tested metrics, corroborates the existing hypothesis that HVGs create an almost arbitrary basis for cell type clustering. We argue that the HVG method biases feature selection toward those genes with low expression values, and that this popular method is not necessarily the most appropriate for dimensionality reduction or feature selection for scRNA-seq data. Moreover, our cell-clustering results suggest feature selection only offers a marginal improvement over clustering cells using the full, raw count by cell matrix. Nonetheless, dimensionality reduction using DDG feature selection can offer an advantage when computational power is limited. DDGs also offer a viable dimensionality reduction approach for other computationally intensive techniques such as manifold learning, when PCA may not be the most appropriate method to identifying manifolds of transcriptional variation at the single cell level.

DDGs identify genes that are differentially distributed in their expression both within a specific subset of specialized cell types, as well as those that vary more continuously both within and across transcriptionally similar groups of cells. Thus, the subset of DDGs provides greater information than standard differential expression tests, which are not designed to identify genes that quantitatively vary across multiple cell classes, or within a particular group. As such, the standard scRNA-seq analysis pipeline overlooks potentially interesting manifolds of transcriptional variation. Overall, this selective vision highlights a more general struggle in biological research. Classifying cells into discrete types can be operationally useful for studying changes in gene expression. However, to what extent transcriptional identity maps to discrete objects, or whether physiologically distinct cell-types correspond to well-separable regions in transcriptional space, remains underexplored [ 52 ]. If we cannot reliably draw boundaries around groups of transcriptionally similar cells, the types of questions that we may ask of them are limited. Future work might utilize the DDG method to retain informative axes of biological variation within a more topologically permissive framework for understanding how single-cell variation enables biological organization.

Binomial model of the mRNA capture process

hypothesis testing for binomial distribution

Expected number of cells expressing a gene i

A critical parameter in this model is obviously the ground truth amount of mRNA in each cell for each gene, M i , j . We obviously don’t know this number, since all we have are the observed values of m i , j . The simplest null model for M i , j would be that there is actually no biological variation in the mRNA levels for gene i in the population. In other words, imagine that we had a situation where every single cell started out with the same number of copies of mRNA for that gene (i.e. M i , j = M i , k ∀ j , k ). Of course, even in this scenario, there would be some variation in the resulting scRNA-seq data, since the sampling process is binomial and there would be some differences between cells simply because the capture process is stochastic. Our ultimate goal is to identify genes whose expression pattern is inconsistent with this null expectation; in other words, genes whose variation within the data cannot be explained purely on the basis of the stochastic process of mRNA capture during the experiment, i.e. technical noise. For simplicity, we call this constant amount of mRNA in each cell in this simple model M i , since the amount of mRNA no longer depends on the identity of the cell.

hypothesis testing for binomial distribution

p-value calculation under the null model

hypothesis testing for binomial distribution

Since there are a large number of genes in any given dataset (~20,000 or so), application of this model to determine a group of “significant” DDGs requires correction for multiple hypothesis testing. As is standard in these types of approaches, we apply the Benjamani-Hochberg approach to control the familywise error in the calculation. As such, one can specify the false discovery rate that one is willing to tolerate: based on our results for the ERCC controls, we ourselves set this to be 1% for the calculations presented here, but one can choose this number to be more or less stringent depending on the circumstances.

Of course, this null model is extremely simplistic, as it posits absolutely no underlying variation in the “true” amount of mRNA in each cell, M i , j , and is thus a model for cases where variation in the observed mRNA levels can be explained by a model in which that variation is purely technical. Interestingly, our results reveal that, despite the simplicity of the model, many datasets have a large number of genes whose variation is indistinguishable from this null model. This includes a large number of HVGs, which is perhaps not surprising given the low expression levels observed in these gene sets. Nonetheless, it is important to note that many such genes have gene expression patterns that cannot be statistically distinguished from purely technical variation. One could, of course, relax the constraint that M i , j is constant for all cells. This would be equivalent to coming up with some null distribution of M i , j values where the variation in these starting values would be real, but not biologically interesting. Doing so, however, requires making strong assumptions about what such uninteresting variation might look like. As such, here we focus on this simple null model that distinguishes cases where all the observed variation could explained purely through technical noise. We leave exploration of alternative models to future work.

All code for calculation of DDG p-values may be found at:

https://github.com/DeedsLab/Differentially-Distributed-Genes

Gaussian Mixture Model

To test the DDG model described above, we generated a simple GMM that allows us to specify a priori a set of genes that have meaningful biological variation and a set of genes that do not. The idea here was to sample a set of values of M i , j for a group of cells from a Gaussian, simulating a distribution of “ground truth” mRNA values, and then apply the binomial capture process to these mRNA levels to obtain simulated “observed” m i , j values. To do this, we created a simulated dataset with 3 distinct cell types and 1000 total genes. For each cell type, we had 300 “marker genes” for that cell type. A marker gene has two states: cells in which it is “high” and cells in which it is “low;” the idea here is that the marker gene is high in the cells that are of the right type, and low in all the other cells. In other words, say we call our three cell types A, B, and C. For cell type A, there would be 300 genes where every cell in cell type A has high expression for that gene, and every cell from cell types B and C would have low expression of that gene. Since there are 300 genes for each cell type, this leaves us with 100 “ubiquitously” expressed genes, which we took to be low in all cells.

To generate this simulated data, we had a different Gaussian for these two states, high and low. To estimate parameters for these Gaussians, we used the data from the A20 and 3T3 cell lines discussed above, since these are very clearly different cells and have many differentially expressed marker genes. We analyzed count distributions of genes that were both relatively highly expressed (had observed averages expression levels of about 1 and were significantly differentially expressed between the two cell types according to the standard Wilcoxon Rank-Sum test). Looking at these data, we chose to set the average to 35 and standard deviation to 2 for genes in the “high” expression state. We used an average of 15 and a standard deviation of 1 for genes in the low expression state. Note that, since counts are inherently discrete and this GMM produces non-integer values, we used the standard round() function to round the sampled numbers to the nearest integer value.

To generate the simulated dataset, we independently sampled values for all 1000 genes in cells of type A, B, and C. We simulated 3,000 cells of each type for a total of 9,000 cells. After generating a gene expression vector for each cell in this model, we simulated the experiment by independently executing the binomial capture model described above for each gene in each cell. In other words, the simulation provided us with an M i , j value for each gene in each cell, and we used that to sample observed m i , j values using the binomial sampling process described above. We used a capture probability of 5% for this particular simulation. The resulting simulated data was then subjected to the DDG analysis pipeline described above.

Supporting information

A-C) Kernel density estimation for the distribution of average gene expression across cells for each gene cluster group in A) mouse bladder, B) mouse kidney, and C) Planaria. D-F) Kernel density estimation for the distribution of fraction of cells each gene in the gene cluster is identified in D) mouse bladder, E) mouse kidney, and F) Planaria.

https://doi.org/10.1371/journal.pcbi.1012386.s001

S2 Fig. A simulation scheme to test ground-truth marker gene recovery by DDG and HVG models.

A gaussian mixture model (GMM) with 3 cell types, 3000 cells per cell type, and 1000 genes per cell type was generated. For each cell-type, 300 genes were chosen as ‘marker genes’ and the count values were drawn from a Gaussian with a mean of 35 and standard deviation of 2. If a gene is not a marker gene, the count value was drawn from a gaussian with a mean of 15 and standard deviation of 1. As a result, there are a total of 900 ground truth marker genes, with 300 in each cell type. The model was then down-sampled by performing a Bernouli trial for each gene count in the data, varying the “capture probability’ from 1% to 50%. A) Fraction of counts with a value of 0 in each down-sampled trial. B) Fraction of genes identified as DDGs when our DDG model was applied to each down-sampled version of the GMM. Here, the capture probability was set to the exact value used in the simulated experiment. C) Fraction of ground truth marker genes contained in the DDG set, when the DDG model used the accurate capture probability. D) Fraction of ground truth marker genes contained in the HVG set when the HVG model was specified to recover 900 HVGs for each down-sampled trial. E) Fraction of genes identified as DDGs when the DDG model parameter for capture probability was set at 5%. F) Fraction of ground truth marker genes contained in the DDG set, when the DDG model parameter for capture probability was set at 5%.

https://doi.org/10.1371/journal.pcbi.1012386.s002

A-C) Mean fraction of significant features over increasing tissue complexity using A) the DDG method, B) the NBDrop method, and C) the M3Drop method. D-F) Quality control metrics for the limb bud data. A) Mean total number of genes for each sample, with error bars depicting confidence intervals around the mean. B) Mean total number of UMI counts for each sample, with error bars depicting confidence intervals around the mean. C) Distribution of UMI counts per cell in each limb bud sample.

https://doi.org/10.1371/journal.pcbi.1012386.s003

S4 Fig. Modeling the effect of “sequencing depth” on feature selection.

A20 & NIH3T3 cell line data (A,B) and Zheng 9 lymphocyte data (C,D) was down-sampled by performing a Bernoulli trial for each UMI count in the data, using a 50% or 90% rate. Each experiment was repeated 10 times, and standard deviation bars are plotted on each graph. A) Fraction of the original A20 & NIH3T3 DDG set that was recovered in the down-sampled cases. B) Fraction of the original A20 & NIH3T3 HVG set that was recovered in the down-sampled cases. C) Fraction of the original Zheng 9 DDG set that was recovered in the down-sampled cases. D) Fraction of the original Zheng 9 HVG set that was recovered in the down-sampled cases.

https://doi.org/10.1371/journal.pcbi.1012386.s004

The effect of the capture probability parameter on DDG selection A-I) For each dataset, the capture probability parameter in the DDG model, p c , was titrated from 1% up to 80%. Bar graphs illustrate the number of DDGs identified at different p c , values for A) 10x scRNA-seq data generated from Cytotoxic T cells purified by FACs, B) 10x scRNA-seq data generated from the full set of Zheng 9 lymphocytes, C) Citeseq data generated from Spleen and Lymph cells, D) 10x scRNA-seq data generated from the A20 cell line, E) 10x scRNA-seq data generated from the NIH3T3 cell line, F) 10x scRNA-seq data generated from A20 and NIH3T3 cell lines, G) 10x scRNA-seq data generated from the Raji B cell line, H) 10x scRNA-seq data generated from the Jurkat T cell line, and I) 10x scRNA-seq data generated from the Raji B and the Jurkat T cell lines.

https://doi.org/10.1371/journal.pcbi.1012386.s005

S6 Fig. Modeling the effect of number of cells on DDG selection.

A-D) Empirical power estimations for the DDG method using the Zheng-9 lymphocyte mix, where a new set of DDGs was computed with increasing sample size. A) Mean number of DDGs as a function of increasing number of cells per cell type in the Zheng-9 lymphocyte mix. B) Mean total number of genes as a function of increasing cell number per sample. C) Mean overlap of new DDG sets with original DDG set calculated from the full, 5k cells per type Zheng-9 lymphocyte data. D) Mean overlap of new DDG sets with original Wilcoxon set of differentially expressed genes, as a function of number of cells per cell type.

https://doi.org/10.1371/journal.pcbi.1012386.s006

S7 Fig. Comparison of feature sets with historically established marker genes.

Cell type specific marker gene lists were downloaded from the Panglao database and used for this analysis. A) Number of genes in each feature set for the Zheng 9 lymphocytes. B) Fraction of marker gene set that is recovered in each feature set of the Zheng 9 lymphocytes. C) Quantification of FACS label recovery when the full set of marker genes were added to each feature set or when each feature set was used alone. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original FACS labels where T-cell lineages were merged into two super sets, and the highest adjusted rand index is plotted when log+1-CPM-PCA-counts were used as a basis. D) Fraction of marker gene set that is recovered in each feature set of the pancreas data. E) Fraction of marker gene set that is recovered in each feature set of the dentate gyrus data.

https://doi.org/10.1371/journal.pcbi.1012386.s007

A) Venn diagram of set overlap for all feature sets calculated for the Zheng-9 lymphocyte mix. B) Venn diagrams of set overlap for the DDGs and the two other binomial based feature selection methods applied to the lymphocyte data. C,D) Scatter of the mean mRNA count versus dispersion per gene, colored by set membership for feature sets in the lymphocyte data. C) compares the Townes genes and DDGs while D) compares the NBDrop genes and DDGs.

https://doi.org/10.1371/journal.pcbi.1012386.s008

A-F) tSNE projections for dimensionality-reduced data from the Zheng-9 lymphocyte mix. Principal component analysis was used to reduce (A,D) all genes, (B,E) HVGs, or (C,F) DDGs after log+1 and CPM transforming the count data. Cells are either colored by original FACs-based set where T-cell lineages were merged into two super sets (A-C) or Louvain cluster membership (D-F). G-I) Quantification of FACS label recovery by various methods of dimensionality reduction across different feature sets. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original FACS labels where T-cell lineages were merged into two super sets, and the highest adjusted rand index is plotted when G) raw UMI counts, H) PCA-transformed counts, or F) log+1-CPM-PCA-counts were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s009

A-F) Quantification of FACS label recovery by various methods of dimensionality reduction across different feature sets. Louvain clustering was performed across a range of resolution parameters, cluster labels were compared to original FACS labels where T-cell lineages were merged into two supersets, and the highest adjusted rand index is plotted when A) raw UMI counts of HVGs, B) PCA-transformed HVG counts, C) log+1-CPM-PCA-HVG counts, D) raw UMI counts of DDGs, B) PCA-transformed DDG counts, C) log+1-CPM-PCA-DDG counts were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s010

A-C) Quantification of orthogonal label recovery by various methods of dimensionality reduction across different feature sets. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to cell-type labels, and the highest adjusted rand index is plotted when A) log+1-CPM-PCA counts from the multiplexed A20 and NIH3T3 cell lines, B) log+1-CPM-PCA counts from the multiplexed B and T cell lines, or C) log+1-CPM-PCA counts from the cell-surface protein gated Citeseq data, were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s011

A-G) Quantification of FACS label recovery by various methods of dimensionality reduction across different feature sets and different Louvain clustering resolution parameters for the Zheng 9 lymphocyte data. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original FACS labels where T-cell lineages were merged into two super sets, and the adjusted rand index is plotted when A) log+1-CPM-PCA counts of 2e3 HVGs, B) 4e3 HVGs, or C) 8e3 HVGs, or log+1-CPM-PCA counts of DDGs identified with a p c of D) 1%, E) 2%, F) 5%, or G) 10% were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s012

A-H) Quantification of multiplex label recovery by various methods of dimensionality reduction across different feature sets and different Louvain clustering resolution parameters for the multiplexed A20 and NIH3T3 cell line data. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original cell-type labels and the adjusted rand index is plotted when A) Log+1-CPM-PCA counts of all genes, B) log+1-CPM-PCA counts of 2e3 HVGs, C) 4e3 HVGs, or D) 8e3 HVGs, or log+1-CPM-PCA counts of DDGs identified with a p c of E) 1%, F) 2%, G) 5%, or H) 10% were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s013

A-H) Quantification of multiplex label recovery by various methods of dimensionality reduction across different feature sets and different Louvain clustering resolution parameters for the multiplexed B and T cell line data. Louvain clustering was performed across a titration of resolution parameters, cluster labels were compared to original cell-type labels and the adjusted rand index is plotted when A) Log+1-CPM-PCA counts of all genes, B) log+1-CPM-PCA counts of 2e3 HVGs, C) 4e3 HVGs, or D) 8e3 HVGs, or log+1-CPM-PCA counts of DDGs identified with a p c of E) 1%, F) 2%, G) 5%, or H) 10% were used as a basis.

https://doi.org/10.1371/journal.pcbi.1012386.s014

S15 Fig. RNA velocity analysis of Pancreas data using different feature sets.

A-C) RNA velocity was performed using the scVelo software package and HVGs as the basis. A) UMAP projection of pancreas cells colored by HVG-based velocity and cluster membership. B) UMAP projection of pancreas cells colored by HVG-inferred latent time. C) Heatmap of gene counts of the 300 top genes contributing to the HVG-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. D-F) RNA velocity was performed using the scVelo software package and DDGs as the basis. D) UMAP projection of pancreas cells colored by DDG-based velocity and cluster membership. E) UMAP projection of pancreas cells colored by DDG-inferred latent time. F) Heatmap of gene counts of the 300 top genes contributing to the DDG-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. G-I) RNA velocity was performed using the scVelo software package and 2e3 random genes as the basis. G) UMAP projection of pancreas cells colored by random gene-based velocity and cluster membership. H) UMAP projection of pancreas cells colored by random gene-inferred latent time. I) Heatmap of gene counts of the 300 top genes contributing to the random gene-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. J-L) Scatter plots of latent times assigned to each cell for J) HVGs and DDGs, K) HVGs and random genes, and L) DDGs and random genes. M) Pearson correlations of latent times inferred for each feature set compared.

https://doi.org/10.1371/journal.pcbi.1012386.s015

S16 Fig. RNA velocity analysis of Dentate Gyrus data using different feature sets.

A-C) RNA velocity was performed using the scVelo software package and HVGs as the basis. A) UMAP projection of dentate gyrus cells colored by HVG-based velocity and cluster membership. B) UMAP projection of dentate gyrus cells colored by HVG-inferred latent time. C) Heatmap of gene counts of the 300 top genes contributing to the HVG-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. D-F) RNA velocity was performed using the scVelo software package and DDGs as the basis. D) UMAP projection of dentate gyrus cells colored by DDG-based velocity and cluster membership. E) UMAP projection of dentate gyrus cells colored by DDG-inferred latent time. F) Heatmap of gene counts of the 300 top genes contributing to the DDG-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. G-I) RNA velocity was performed using the scVelo software package and 2e3 random genes as the basis. G) UMAP projection of dentate gyrus cells colored by random gene-based velocity and cluster membership. H) UMAP projection of dentate gyrus cells colored by random gene-inferred latent time. I) Heatmap of gene counts of the 300 top genes contributing to the random gene-based latent time estimate, where each row represents a gene and each column represents individual cell that is ordered along the latent time axis. J-L) Scatter plots of latent times assigned to each cell for J) HVGs and DDGs, K) HVGs and random genes, and L) DDGs and random genes. M) Pearson correlations of latent times inferred for each feature set compared.

https://doi.org/10.1371/journal.pcbi.1012386.s016

Acknowledgments

The authors thank Tom Kolokotrones, Roy Wollman, and members of the Deeds lab for many helpful discussions and comments.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 48. 40k Mixture of Mouse Cell Lines, Multiplexed Samples, 4 Probe Barcodes. [cited 2024 Aug 29]. Database: 10x Genomics [internet]. Available from: https://www.10xgenomics.com/datasets/40k-mixture-of-mouse-cell-lines-multiplexed-samples-4-probe-barcodes-1-standard
  • 49. 10k 1:1 Mixture of Raji and Jurkat Cells Multiplexed, 2 CMOs. [cited 2024 Aug 29]. Database: 10x Genomics [internet]. Available from: https://www.10xgenomics.com/datasets/10-k-1-1-mixture-of-raji-and-jurkat-cells-multiplexed-2-cm-os-3-1-standard-6-0-0

IMAGES

  1. Hypothesis Testing for the Binomial Distribution (Example 2

    hypothesis testing for binomial distribution

  2. Hypothesis Testing with Binomial Distribution

    hypothesis testing for binomial distribution

  3. Hypothesis testing using the binomial distribution (2.05a)

    hypothesis testing for binomial distribution

  4. S2 Hypothesis Testing. Binomial Distribution

    hypothesis testing for binomial distribution

  5. Hypothesis Testing (Binomial Distribution proportion)

    hypothesis testing for binomial distribution

  6. Hypothesis testing

    hypothesis testing for binomial distribution

VIDEO

  1. Hypothesis Testing for the Binomial 2

  2. Hypothesis Testing (Binomial Distribution)

  3. Binomial hypothesis testing

  4. Binomial distribution of probability

  5. Binomial Distribution Hypothesis Testing

  6. Hypothesis Testing for the Binomial 3

COMMENTS

  1. Hypothesis Testing with the Binomial Distribution

    Although a calculation is possible, it is much quicker to use the cumulative binomial distribution table. This gives P[X ≤ 6] = 0.058 P [X ≤ 6] = 0.058. We are asked to perform the test at a 5 5 % significance level. This means, if there is less than 5 5 % chance of getting less than or equal to 6 6 heads then it is so unlikely that we have ...

  2. Binomial Hypothesis Testing

    Edexcel. Spanish. Past Papers. CIE. Spanish Language & Literature. Past Papers. Other Subjects. Revision notes on 5.2.1 Binomial Hypothesis Testing for the Edexcel A Level Maths: Statistics syllabus, written by the Maths experts at Save My Exams.

  3. Binomial test

    The binomial test is useful to test hypotheses about the probability ( ) of success: where is a user-defined value between 0 and 1. If in a sample of size there are successes, while we expect , the formula of the binomial distribution gives the probability of finding this value: If the null hypothesis were correct, then the expected number of ...

  4. Binomial Distribution Hypothesis Tests

    Binomial Distribution Hypothesis Tests Example Questions. Question 1: A disease is moving through a population. On Tuesday, it is believed that nationally around 6\% of people have the disease. In the village of Hammerton, 5 out of 200 residents have the disease. Test, at the 5\% significance level if the prevalence of the disease differs in ...

  5. Binomial Hypothesis Testing

    We now give some examples of how to use the binomial distribution to perform one-sided and two-sided hypothesis testing.. One-sided Test. Example 1: Suppose you have a die and suspect that it is biased towards the number three, and so run an experiment in which you throw the die 10 times and count that the number three comes up 4 times.Determine whether the die is biased.

  6. How to Do Hypothesis Testing with Binomial Distribution

    Hypothesis Testing Binomial Distribution. 1. You formulate a null hypothesis and an alternative hypothesis. H 0: p = p 0 against H a: p > p 0 (possibly H a: p < p 0 or H a: p ≠ p 0). For example, you would have a reason to believe that a high observed value of p, makes the alternative hypothesis H a: p > p 0 seem reasonable.

  7. Binomial Test • Simply explained

    The binomial test is a hypothesis test used when there is a categorical variable with two expressions, e.g., gender with "male" and "female". The binomial test can then check whether the frequency distribution of the variable corresponds to an expected distribution, e.g.: Men and women are equally represented.

  8. PDF Hypothesis Testing Using the Binomial Distribution

    statistical tests are appropriate. For the binomial test, which is based on the binomial distribution, a nominal-level binary measure is required. Such a binary measure has only two possible outcomes. An example of a binary measure is the outcome of the coin tosses by our judge in Chap. 7—the coin could have landed on heads or tails.

  9. Hypothesis Testing Using the Binomial Distribution

    Hypothesis Testing Using the Binomial Distribution. When we carry out hypothesis testing, we want to be able to understand whether a particular statistic in our sample can be used to generalize to the population parameter that it is thought to represent. In a hypothesis test, our aim is to reject our null hypothesis.

  10. PDF Fundamentals of Hypothesis Testing

    The Binomial Distribution Derivation of the Binomial Distribution Formula Derivation of the Binomial Distribution Formula We shall develop the binomial distribution formula in terms of the preceding example. In this example, there are 4 \trials", and the probability of \success" is 0.51. We wish to know Pr(Y = 3).

  11. 4.3 Binomial Distribution

    Introduction; 9.1 Null and Alternative Hypotheses; 9.2 Outcomes and the Type I and Type II Errors; 9.3 Probability Distribution Needed for Hypothesis Testing; 9.4 Rare Events, the Sample, Decision and Conclusion; 9.5 Additional Information and Full Hypothesis Test Examples; 9.6 Hypothesis Testing of a Single Mean and Single Proportion; Key Terms; Chapter Review; Formula Review

  12. Hypothesis Testing for the Binomial Distribution

    Hypothesis testing for the binomial distribution. In this video, I'll show you how to conduct a Hypothesis test for Binomial distributionsYOUTUBE CHANNEL at ...

  13. 9.4: Distribution Needed for Hypothesis Testing

    When testing a single population proportion use a normal test for a single population proportion if the data comes from a simple, random sample, fill the requirements for a binomial distribution, and the mean number of successes and the mean number of failures satisfy the conditions: \(np > 5\) and \(nq > 5\) where \(n\) is the sample size, \(p ...

  14. Binomial Distribution: Hypothesis Testing

    The example looks at a one tailed test in the lower tail. Statistics : Hypothesis Testing for the Binomial Distribution (Example) In this tutorial you are shown an example that tests the upper tail of the proportion p from a Binomial distribution. The example is In Luigi's restaurant, on average 1 in 10 people order a bottle of Chardonay.

  15. PDF Hypothesis testing with the Binomial distribution LESSON

    Success criteria — Hypothesis testing with the binomial distribution: 1. Write down the null hypothesis, , clearly stating what refers to: where is the proportion of… 2. State the alternative hypothesis, . or (one-tailed test) (two-tailed test) 3. State the distribution under : . 4. State the significance level: 5. State the test statistic, 6.

  16. 10. Hypothesis Testing: p-values, Exact Binomial Test, Simple one-sided

    The Exact Binomial Test. A simple one-sided claim about a proportion is a claim that a proportion is greater than some percent or less than some percent. The symbol for proportion is $\rho$. The name of the hypothesis test that we use for this situation is "the exact binomial test". Binomial because we use the binomial distribution.

  17. Hypothesis Testing Using the Binomial Distribution

    In step 3, the underlying distribution (here it was a binomial distribution) has to be determined (which, admittedly, can sometimes be tricky or even unclear). In step 5, we have to draw the right conclusions. This might be a bit tricky at times. At the end of the day (or the research paper) hypothesis testing always follows the same 5 steps.

  18. Binomial Hypothesis Test: Explanation, Example, Assumptions

    Hypothesis testing is the process of using binomial distribution to help us reject or accept null hypotheses. A null hypothesis is what we assume to be happening. If data disprove a null hypothesis, we must accept an alternative hypothesis. We use binomial CD on the calculator to help us shortcut calculating the probability values.

  19. PDF AS/A Level Mathematics The Binomial Distribution and Hypothesis Testing

    The Binomial Distribution and Hypothesis Testing Instructions • Use black ink or ball-point pen. • If pencil is used for diagrams/sketches/graphs it must be dark (HB or B). • Fill in the boxes at the top of this page with your name. • Answer all questions and ensure that your answers to parts of questions are clearly labelled..

  20. 8.1.3: Distribution Needed for Hypothesis Testing

    When testing a single population proportion use a normal test for a single population proportion if the data comes from a simple, random sample, fill the requirements for a binomial distribution, and the mean number of successes and the mean number of failures satisfy the conditions: \(np > 5\) and \(nq > 5\) where \(n\) is the sample size, \(p ...

  21. PDF Hypothesis Testing and the Binomial Distribution

    This reasoning reveals a number of key features about the process of testing this hypothesis. (1) The hypothesis test is based on an assumption about the general nature of a probability distribution. For example, here, we are assuming that the spinning of any fair coin whatsoever results in a binomial distribution.

  22. Test if two binomial distributions are statistically different from

    Dan and Abaumann's answers suggest testing under a binomial model where the null hypothesis is a unified single binomial model with its mean estimated from the empirical data. Their answers are correct in theory but they need approximation using normal distribution since the distribution of test statistic does not exactly follow Normal ...

  23. Binomial models uncover biological variation during feature selection

    NBDrop uses a negative binomial distribution to model the fraction of zeros per gene as a function of the mean gene count that is adjusted for the sequencing efficiency, or total number of counts, for each cell. ... We use the standard Benjamini-Hochberg procedure to correct for multiple hypothesis testing. Interestingly, if we set the False ...