linear hypothesis test p value

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Knowledge Base
Understanding P values | Definition and Examples

Understanding P-values | Definition and Examples

Published on July 16, 2020 by Rebecca Bevans . Revised on June 22, 2023.

The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true.

P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null hypothesis.

What is a null hypothesis, what exactly is a p value, how do you calculate the p value, p values and statistical significance, reporting p values, caution when using p values, other interesting articles, frequently asked questions about p-values.

All statistical tests have a null hypothesis. For most tests, the null hypothesis is that there is no relationship between your variables of interest or that there is no difference among groups.

For example, in a two-tailed t test , the null hypothesis is that the difference between two groups is zero.

Null hypothesis ( H 0 ): there is no difference in longevity between the two groups.
Alternative hypothesis ( H A or H 1 ): there is a difference in longevity between the two groups.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The p value , or probability value, tells you how likely it is that your data could have occurred under the null hypothesis. It does this by calculating the likelihood of your test statistic , which is the number calculated by a statistical test using your data.

The p value tells you how often you would expect to see a test statistic as extreme or more extreme than the one calculated by your statistical test if the null hypothesis of that test was true. The p value gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis.

The p value is a proportion: if your p value is 0.05, that means that 5% of the time you would see a test statistic at least as extreme as the one you found if the null hypothesis was true.

P values are usually automatically calculated by your statistical program (R, SPSS, etc.).

You can also find tables for estimating the p value of your test statistic online. These tables show, based on the test statistic and degrees of freedom (number of observations minus number of independent variables) of your test, how frequently you would expect to see that test statistic under the null hypothesis.

The calculation of the p value depends on the statistical test you are using to test your hypothesis :

Different statistical tests have different assumptions and generate different test statistics. You should choose the statistical test that best fits your data and matches the effect or relationship you want to test.
The number of independent variables you include in your test changes how large or small the test statistic needs to be to generate the same p value.

No matter what test you use, the p value always describes the same thing: how often you can expect to see a test statistic as extreme or more extreme than the one calculated from your test.

P values are most often used by researchers to say whether a certain pattern they have measured is statistically significant.

Statistical significance is another way of saying that the p value of a statistical test is small enough to reject the null hypothesis of the test.

How small is small enough? The most common threshold is p < 0.05; that is, when you would expect to find a test statistic as extreme as the one calculated by your test only 5% of the time. But the threshold depends on your field of study – some fields prefer thresholds of 0.01, or even 0.001.

The threshold value for determining statistical significance is also known as the alpha value.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

Academic style
Vague sentences
Style consistency

See an example

P values of statistical tests are usually reported in the results section of a research paper , along with the key information needed for readers to put the p values in context – for example, correlation coefficient in a linear regression , or the average difference between treatment groups in a t -test.

P values are often interpreted as your risk of rejecting the null hypothesis of your test when the null hypothesis is actually true.

In reality, the risk of rejecting the null hypothesis is often higher than the p value, especially when looking at a single study or when using small sample sizes. This is because the smaller your frame of reference, the greater the chance that you stumble across a statistically significant pattern completely by accident.

P values are also often interpreted as supporting or refuting the alternative hypothesis. This is not the case. The p value can only tell you whether or not the null hypothesis is supported. It cannot tell you whether your alternative hypothesis is true, or why.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

Normal distribution
Descriptive statistics
Measures of central tendency
Correlation coefficient
Null hypothesis

Methodology

Cluster sampling
Stratified sampling
Types of interviews
Cohort study
Thematic analysis

Research bias

Implicit bias
Cognitive bias
Survivorship bias
Availability heuristic
Nonresponse bias
Regression to the mean

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Understanding P-values | Definition and Examples. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/statistics/p-value/

Is this article helpful?

Rebecca Bevans

Other students also liked, an easy introduction to statistical significance (with examples), test statistics | definition, interpretation, and examples, what is effect size and why does it matter (examples), what is your plagiarism score.

P-Value in Statistical Hypothesis Tests: What is it?

P value definition.

A p value is used in hypothesis testing to help you support or reject the null hypothesis . The p value is the evidence against a null hypothesis . The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage . For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“ significant “) your results.

When you run a hypothesis test , you compare the p value from your test to the alpha level you selected when you ran the test. Alpha levels can also be written as percentages.

P Value vs Alpha level

Alpha levels are controlled by the researcher and are related to confidence levels . You get an alpha level by subtracting your confidence level from 100%. For example, if you want to be 98 percent confident in your research, the alpha level would be 2% (100% – 98%). When you run the hypothesis test, the test will give you a value for p. Compare that value to your chosen alpha level. For example, let’s say you chose an alpha level of 5% (0.05). If the results from the test give you:

A small p (≤ 0.05), reject the null hypothesis . This is strong evidence that the null hypothesis is invalid.
A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

P Values and Critical Values

What if I Don’t Have an Alpha Level?

In an ideal world, you’ll have an alpha level. But if you do not, you can still use the following rough guidelines in deciding whether to support or reject the null hypothesis:

If p > .10 → “not significant”
If p ≤ .10 → “marginally significant”
If p ≤ .05 → “significant”
If p ≤ .01 → “highly significant.”

How to Calculate a P Value on the TI 83

Example question: The average wait time to see an E.R. doctor is said to be 150 minutes. You think the wait time is actually less. You take a random sample of 30 people and find their average wait is 148 minutes with a standard deviation of 5 minutes. Assume the distribution is normal. Find the p value for this test.

Press STAT then arrow over to TESTS.
Press ENTER for Z-Test .
Arrow over to Stats. Press ENTER.
Arrow down to μ0 and type 150. This is our null hypothesis mean.
Arrow down to σ. Type in your std dev: 5.
Arrow down to xbar. Type in your sample mean : 148.
Arrow down to n. Type in your sample size : 30.
Arrow to <μ0 for a left tail test . Press ENTER.
Arrow down to Calculate. Press ENTER. P is given as .014, or about 1%.

The probability that you would get a sample mean of 148 minutes is tiny, so you should reject the null hypothesis.

Note : If you don’t want to run a test, you could also use the TI 83 NormCDF function to get the area (which is the same thing as the probability value).

Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial.

Comprehensive Learning Paths
150+ Hours of Videos
Complete Access to Jupyter notebooks, Datasets, References.

What is P-Value? – Understanding the meaning, math and methods

October 12, 2019
Selva Prabhakaran

P Value is a probability score that is used in statistical tests to establish the statistical significance of an observed effect. Though p-values are commonly used, the definition and meaning is often not very clear even to experienced Statisticians and Data Scientists. In this post I will attempt to explain the intuition behind p-value as clear as possible.

Introduction

In Data Science interviews, one of the frequently asked questions is ‘What is P-Value?”.

Believe it or not, even experienced Data Scientists often fail to answer this question. This is partly because of the way statistics is taught and the definitions available in textbooks and online sources.

According to American Statistical Association, “a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.”

That’s hard to grasp, yes?

Alright, lets understand what really is p value in small meaningful pieces so ultimately it all makes sense.

When and how is p-value used?

To understand p-value, you need to understand some background and context behind it. So, let’s start with the basics.

p-values are often reported whenever you perform a statistical significance test (like t-test, chi-square test etc). These tests typically return a computed test statistic and the associated p-value. This reported value is used to establish the statistical significance of the relationships being tested.

So, whenever you see a p-value, there is an associated statistical test.

That means, there is a Hypothesis testing being conducted with a defined Null Hypothesis (H0) and a corresponding Alternate hypothesis (HA).

The p-value reported is used to make a decision on whether the null hypothesis being tested can be rejected or not.

Let’s understand a little bit more about the null and alternate hypothesis.

Now, how to frame a Null hypothesis in general?

While the null hypothesis itself changes with every statistical test, there is a general principle to frame it:

The null hypothesis assumes there is ‘no effect’ or ‘relationship’ by default .

For example: if you are testing if a drug treatment is effective or not, then the null hypothesis will assume there is not difference in outcome between the treated and untreated groups. Likewise, if you are testing if one variable influences another (say, car weight influences the mileage), then null hypothesis will postulate there is no relationship between the two.

It simply implies the absence of an effect.

Examples of Statistical Tests reporting out p-value

Here are some examples of Null hypothesis (H0) for popular statistical tests:

Welch Two Sample t-Test: The true difference in means of two samples is equal to 0
Linear Regression: The beta coefficient(slope) of the X variable is zero
Chi Square test: There is no difference between expected frequencies and observed frequencies.

Get the feel?

But how would the alternate hypothesis would look like?

The alternate hypothesis (HA) is always framed to negate the null hypothesis. The corresponding HA for above tests are as follows:

Welch Two Sample t-Test: The true difference in means of two samples is NOT equal to 0
Linear Regression: The beta coefficient(slope) of the X variable is NOT zero
Chi Square test: The difference between expected frequencies and observed frequencies is NOT zero.

What p-value really is

Now, back to the discussion on p-value.

Along with every statistical test, you will get a corresponding p-value in the results output.

What is this meant for?

It is used to determine if the data is statistically incompatible with the null hypothesis.

Not clear eh?

Let me put it in another way.

The P Value basically helps to answer the question: ‘Does the data really represent the observed effect?’.

This leads us to a more mathematical definition of P-Value.

The P Value is the probability of seeing the effect(E) when the null hypothesis is true .

If you think about it, we want this probability to be very low.

Having said that, it is important to remember that p-value refers to not only what we observed but also observations more extreme than what was observed. That is why the formal definition of p-value contain the statement ‘would be equal to or more extreme than its observed value.’

How is p-value used to establish statistical significance

Now that you know, p value measures the probability of seeing the effect when the null hypothesis is true.

A sufficiently low value is required to reject the null hypothesis.

Notice how I have used the term ‘Reject the Null Hypothesis’ instead of stating the ‘Alternate Hypothesis is True’.

That’s because, we have tested the effect against the null hypothesis only.

So, when the p-value is low enough, we reject the null hypothesis and conclude the observed effect holds.

But how low is ‘low enough’ for rejecting the null hypothesis?

This level of ‘low enough’ cutoff is called the alpha level, and you need to decide it before conducting a statistical test.

But how low is ‘low enough’?

Practical Guidelines to set the cutoff of Statistical Significance (alpha level)

Let’s first understand what is Alpha level.

It is the cutoff probability for p-value to establish statistical significance for a given hypothesis test. For an observed effect to be considered as statistically significant, the p-value of the test should be lower than the pre-decided alpha value.

Typically for most statistical tests(but not always), alpha is set as 0.05.

In which case, it has to be less than 0.05 to be considered as statistically significant.

What happens if it is say, 0.051?

It is still considered as not significant. We do NOT call it as a weak statistical significant. It is either black or white. There is no gray with respect to statistical significance.

Now, how to set the alpha level?

Well, the usual practice is to set it to 0.05.

But when the occurrence of the event is rare, you may want to set a very low alpha. The rarer it is, the lower the alpha.

For example in the CERN’s Hadron collider experiment to detect Higgs-Boson particles(which was very rare), the alpha level was set so low to 5 Sigma levels , which means a p value of less than 3 * 10^-7 is required reject the null hypothesis.

Whereas for a more likely event, it can go up to 0.1.

Secondly, more the samples (number of observations) you have the lower should be the alpha level. Because, even a small effect can be made to produce a lower p-value just by increasing the number of observations. The opposite is also true, that is, a large effect can be made to produce high p value by reducing the sample size.

In case you don’t know how likely the event can occur, its a common practice to set it as 0.05. But, as a thumb rule, never set the alpha greater than 0.1.

Having said that the alpha=0.05 is mostly an arbitrary choice. Then why do most people still use p=0.05? That’s because thats what is taught in college courses and being traditionally used by the scientific community and publishers.

What P Value is Not

Given the uncertainty around the meaning of p-value, it is very common to misinterpret and use it incorrectly.

Some of the common misconceptions are as follows:

P-Value is the probability of making a mistake. Wrong!
P-Value measures the importance of a variable. Wrong!
P-Value measures the strength of an effect. Wrong!

A smaller p-value does not signify the variable is more important or even a stronger effect.

Because, like I mentioned earlier, any effect no matter how small can be made to produce smaller p-value only by increasing the number of observations (sample size).

Likewise, a larger value does not imply a variable is not important.

For a sound communication, it is necessary to report not just the p-value but also the sample size along with it. This is especially necessary if the experiments involve different sample sizes.

Secondly, making inferences and business decisions should not be based only on the p-value being lower than the alpha level.

Analysts should understand the business sense, understand the larger picture and bring out the reasoning before making an inference and not just rely on the p-value to make the inference for you.

Does this mean the p-value is not useful anymore?

Not really. It is a useful tool because it provides an objective standard for everyone to assess. Its just that you need to use it the right way.

Example: How to find p-value for linear regression

Linear regression is a traditional statistical modeling algorithm that is used to predict a continuous variable (a.k.a dependent variable) using one or more explanatory variables.

Let’s see an example of extracting the p-value with linear regression using the mtcars dataset. In this dataset the specifications of the vehicle and the mileage performance is recorded.

We want to use linear regression to test if one of the specs “the ‘weight’ ( wt ) of the vehicle” has a significant relationship (linear) with the ‘mileage’ ( mpg ).

This can be conveniently done using python’s statsmodels library. But first, let’s load the data.

With statsmodels library

	mpg	wt
0	4.582576	2.620
1	4.582576	2.875
2	4.774935	2.320
3	4.626013	3.215
4	4.324350	3.440

The X( wt ) and Y ( mpg ) variables are ready.

Null Hypothesis (H0): The slope of the line of best fit (a.k.a beta coefficient) is zero Alternate Hypothesis (H1): The beta coefficient is not zero.

To implement the test, use the smf.ols() function available in the formula.api of statsmodels . You can pass in the formula itself as the first argument and call fit() to train the linear model.

Once model is trained, call model.summary() to get a comprehensive view of the statistics.

The p-value is located in under the P>|t| against wt row. If you want to extract that value into a variable, use model.pvalues .

Since the p-value is much lower than the significance level (0.01), we reject the null hypothesis that the slope is zero and take that the data really represents the effect.

Well, that was just one example of computing p-value.

Whereas p-value can be associated with numerous statistical tests. If you are interested in finding out more about how it is used, see more examples of statistical tests with p-values.

In this post we covered what exactly is a p-value and how and how not to use it. We also saw a python example related to computing the p-value associated with linear regression.

Now with this understanding, let’s conclude what is the difference between Statistical Model from Machine Learning model?

Well, while both statistical as well as machine learning models are associated with making predictions, there can be many differences between these two. But most simply put, any predictive model that has p-values associated with it are considered as statistical model.

Happy learning!

To understand how exactly the P-value is computed, check out the example using the T-Test .

F statistic formula – explained, correlation – connecting the dots, the role of correlation in data analysis, hypothesis testing – a deep dive into hypothesis testing, the backbone of statistical inference, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

P-Value And Statistical Significance: What It Is & Why It Matters

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05.

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance, it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

P Value Calculator From T Score
P-Value Calculator For Chi-Square
P-values and significance tests (Kahn Academy)
Hypothesis testing and p-values (Kahn Academy)
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
Criticism of using the “ p “< 0.05”.
Publication manual of the American Psychological Association
Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply. BMJ: British Medical Journal , 309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health , 78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In Seminars in hematology (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value. Epidemiology (Cambridge, Mass.) , 9 (1), 7-8.

Quality Improvement
Talk To Minitab

How to Correctly Interpret P Values

Topics: Hypothesis Testing

The P value is used all over statistics, from t-tests to regression analysis . Everyone knows that you use P values to determine statistical significance in a hypothesis test . In fact, P values often determine what studies get published and what projects get funding.

Despite being so important, the P value is a slippery concept that people often interpret incorrectly. How do you interpret P values?

In this post, I'll help you to understand P values in a more intuitive way and to avoid a very common misinterpretation that can cost you money and credibility.

What Is the Null Hypothesis in Hypothesis Testing?

In every experiment, there is an effect or difference between groups that the researchers are testing. It could be the effectiveness of a new drug, building material, or other intervention that has benefits. Unfortunately for the researchers, there is always the possibility that there is no effect, that is, that there is no difference between the groups. This lack of a difference is called the null hypothesis , which is essentially the position a devil’s advocate would take when evaluating the results of an experiment.

To see why, let’s imagine an experiment for a drug that we know is totally ineffective. The null hypothesis is true: there is no difference between the experimental groups at the population level.

Despite the null being true, it’s entirely possible that there will be an effect in the sample data due to random sampling error. In fact, it is extremely unlikely that the sample groups will ever exactly equal the null hypothesis value. Consequently, the devil’s advocate position is that the observed difference in the sample does not reflect a true difference between populations .

What Are P Values?

High P values: your data are likely with a true null.
Low P values: your data are unlikely with a true null.

A low P value suggests that your sample provides enough evidence that you can reject the null hypothesis for the entire population.

How Do You Interpret P Values?

For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.

P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.

hbspt.cta._relativeUrls=true;hbspt.cta.load(3447555, '16128196-352b-4dd2-8356-f063c37c5b2a', {"useNewLoader":"true","region":"na1"});

P values are not the probability of making a mistake.

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error ).

There are several reasons why P values can’t be the error rate.

First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.

Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:

The null is true but your sample was unusual.
The null is false.

Determining which case is more likely requires subject area knowledge and replicate studies.

Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:

Correct: Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.
Incorrect: If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.

To see a graphical representation of how hypothesis tests work, see my post: Understanding Hypothesis Tests: Significance Levels and P Values .

What Is the True Error Rate?

If a P value is not the error rate, what the heck is the error rate? (Can you guess which way this is heading now?)

Sellke et al.* have estimated the error rates associated with different P values. While the precise error rate depends on various assumptions (which I discuss here ), the table summarizes them for middle-of-the-road assumptions.


0.05	At least 23% (and typically close to 50%)
0.01	At least 7% (and typically close to 15%)

Do the higher error rates in this table surprise you? Unfortunately, the common misinterpretation of P values as the error rate creates the illusion of substantially more evidence against the null hypothesis than is justified. As you can see, if you base a decision on a single study with a P value near 0.05, the difference observed in the sample may not exist at the population level. That can be costly!

Now that you know how to interpret P values, read my five guidelines for how to use P values and avoid mistakes .

You can also read my rebuttal to an academic journal that actually banned P values !

An exciting study about the reproducibility of experimental results was published in August 2015. This study highlights the importance of understanding the true error rate. For more information, read my blog post: P Values and the Replication of Experiments .

The American Statistical Association speaks out on how to use p-values!

*Thomas SELLKE, M. J. BAYARRI, and James O. BERGER, Calibration of p Values for Testing Precise Null Hypotheses, The American Statistician, February 2001, Vol. 55, No. 1

Trust Center

Terms of Use
Privacy Policy
Cookies Settings

p-value Calculator

What is p-value, how do i calculate p-value from test statistic, how to interpret p-value, how to use the p-value calculator to find p-value from test statistic, how do i find p-value from z-score, how do i find p-value from t, p-value from chi-square score (χ² score), p-value from f-score.

Welcome to our p-value calculator! You will never again have to wonder how to find the p-value, as here you can determine the one-sided and two-sided p-values from test statistics, following all the most popular distributions: normal, t-Student, chi-squared, and Snedecor's F.

P-values appear all over science, yet many people find the concept a bit intimidating. Don't worry – in this article, we will explain not only what the p-value is but also how to interpret p-values correctly . Have you ever been curious about how to calculate the p-value by hand? We provide you with all the necessary formulae as well!

🙋 If you want to revise some basics from statistics, our normal distribution calculator is an excellent place to start.

Formally, the p-value is the probability that the test statistic will produce values at least as extreme as the value it produced for your sample . It is crucial to remember that this probability is calculated under the assumption that the null hypothesis H 0 is true !

More intuitively, p-value answers the question:

Assuming that I live in a world where the null hypothesis holds, how probable is it that, for another sample, the test I'm performing will generate a value at least as extreme as the one I observed for the sample I already have?

It is the alternative hypothesis that determines what "extreme" actually means , so the p-value depends on the alternative hypothesis that you state: left-tailed, right-tailed, or two-tailed. In the formulas below, S stands for a test statistic, x for the value it produced for a given sample, and Pr(event | H 0 ) is the probability of an event, calculated under the assumption that H 0 is true:

Left-tailed test: p-value = Pr(S ≤ x | H 0 )

Right-tailed test: p-value = Pr(S ≥ x | H 0 )

Two-tailed test:

p-value = 2 × min{Pr(S ≤ x | H 0 ), Pr(S ≥ x | H 0 )}

(By min{a,b} , we denote the smaller number out of a and b .)

If the distribution of the test statistic under H 0 is symmetric about 0 , then: p-value = 2 × Pr(S ≥ |x| | H 0 )

or, equivalently: p-value = 2 × Pr(S ≤ -|x| | H 0 )

As a picture is worth a thousand words, let us illustrate these definitions. Here, we use the fact that the probability can be neatly depicted as the area under the density curve for a given distribution. We give two sets of pictures: one for a symmetric distribution and the other for a skewed (non-symmetric) distribution.

Symmetric case: normal distribution:

p-values for symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

Non-symmetric case: chi-squared distribution:

p-values for non-symmetric distribution — left-tailed, right-tailed, and two-tailed tests.

In the last picture (two-tailed p-value for skewed distribution), the area of the left-hand side is equal to the area of the right-hand side.

To determine the p-value, you need to know the distribution of your test statistic under the assumption that the null hypothesis is true . Then, with the help of the cumulative distribution function ( cdf ) of this distribution, we can express the probability of the test statistics being at least as extreme as its value x for the sample:

Left-tailed test:

p-value = cdf(x) .

Right-tailed test:

p-value = 1 - cdf(x) .

p-value = 2 × min{cdf(x) , 1 - cdf(x)} .

If the distribution of the test statistic under H 0 is symmetric about 0 , then a two-sided p-value can be simplified to p-value = 2 × cdf(-|x|) , or, equivalently, as p-value = 2 - 2 × cdf(|x|) .

The probability distributions that are most widespread in hypothesis testing tend to have complicated cdf formulae, and finding the p-value by hand may not be possible. You'll likely need to resort to a computer or to a statistical table, where people have gathered approximate cdf values.

Well, you now know how to calculate the p-value, but… why do you need to calculate this number in the first place? In hypothesis testing, the p-value approach is an alternative to the critical value approach . Recall that the latter requires researchers to pre-set the significance level, α, which is the probability of rejecting the null hypothesis when it is true (so of type I error ). Once you have your p-value, you just need to compare it with any given α to quickly decide whether or not to reject the null hypothesis at that significance level, α. For details, check the next section, where we explain how to interpret p-values.

As we have mentioned above, the p-value is the answer to the following question:

What does that mean for you? Well, you've got two options:

A high p-value means that your data is highly compatible with the null hypothesis; and
A small p-value provides evidence against the null hypothesis , as it means that your result would be very improbable if the null hypothesis were true.

However, it may happen that the null hypothesis is true, but your sample is highly unusual! For example, imagine we studied the effect of a new drug and got a p-value of 0.03 . This means that in 3% of similar studies, random chance alone would still be able to produce the value of the test statistic that we obtained, or a value even more extreme, even if the drug had no effect at all!

The question "what is p-value" can also be answered as follows: p-value is the smallest level of significance at which the null hypothesis would be rejected. So, if you now want to make a decision on the null hypothesis at some significance level α , just compare your p-value with α :

If p-value ≤ α , then you reject the null hypothesis and accept the alternative hypothesis; and
If p-value ≥ α , then you don't have enough evidence to reject the null hypothesis.

Obviously, the fate of the null hypothesis depends on α . For instance, if the p-value was 0.03 , we would reject the null hypothesis at a significance level of 0.05 , but not at a level of 0.01 . That's why the significance level should be stated in advance and not adapted conveniently after the p-value has been established! A significance level of 0.05 is the most common value, but there's nothing magical about it. Here, you can see what too strong a faith in the 0.05 threshold can lead to. It's always best to report the p-value, and allow the reader to make their own conclusions.

Also, bear in mind that subject area expertise (and common reason) is crucial. Otherwise, mindlessly applying statistical principles, you can easily arrive at statistically significant, despite the conclusion being 100% untrue.

As our p-value calculator is here at your service, you no longer need to wonder how to find p-value from all those complicated test statistics! Here are the steps you need to follow:

Pick the alternative hypothesis : two-tailed, right-tailed, or left-tailed.

Tell us the distribution of your test statistic under the null hypothesis: is it N(0,1), t-Student, chi-squared, or Snedecor's F? If you are unsure, check the sections below, as they are devoted to these distributions.

If needed, specify the degrees of freedom of the test statistic's distribution.

Enter the value of test statistic computed for your data sample.

Our calculator determines the p-value from the test statistic and provides the decision to be made about the null hypothesis. The standard significance level is 0.05 by default.

Go to the advanced mode if you need to increase the precision with which the calculations are performed or change the significance level .

In terms of the cumulative distribution function (cdf) of the standard normal distribution, which is traditionally denoted by Φ , the p-value is given by the following formulae:

Left-tailed z-test:

p-value = Φ(Z score )

Right-tailed z-test:

p-value = 1 - Φ(Z score )

Two-tailed z-test:

p-value = 2 × Φ(−|Z score |)

p-value = 2 - 2 × Φ(|Z score |)

🙋 To learn more about Z-tests, head to Omni's Z-test calculator .

We use the Z-score if the test statistic approximately follows the standard normal distribution N(0,1) . Thanks to the central limit theorem, you can count on the approximation if you have a large sample (say at least 50 data points) and treat your distribution as normal.

A Z-test most often refers to testing the population mean , or the difference between two population means, in particular between two proportions. You can also find Z-tests in maximum likelihood estimations.

The p-value from the t-score is given by the following formulae, in which cdf t,d stands for the cumulative distribution function of the t-Student distribution with d degrees of freedom:

Left-tailed t-test:

p-value = cdf t,d (t score )

Right-tailed t-test:

p-value = 1 - cdf t,d (t score )

Two-tailed t-test:

p-value = 2 × cdf t,d (−|t score |)

p-value = 2 - 2 × cdf t,d (|t score |)

Use the t-score option if your test statistic follows the t-Student distribution . This distribution has a shape similar to N(0,1) (bell-shaped and symmetric) but has heavier tails – the exact shape depends on the parameter called the degrees of freedom . If the number of degrees of freedom is large (>30), which generically happens for large samples, the t-Student distribution is practically indistinguishable from the normal distribution N(0,1).

The most common t-tests are those for population means with an unknown population standard deviation, or for the difference between means of two populations , with either equal or unequal yet unknown population standard deviations. There's also a t-test for paired (dependent) samples .

🙋 To get more insights into t-statistics, we recommend using our t-test calculator .

Use the χ²-score option when performing a test in which the test statistic follows the χ²-distribution .

This distribution arises if, for example, you take the sum of squared variables, each following the normal distribution N(0,1). Remember to check the number of degrees of freedom of the χ²-distribution of your test statistic!

How to find the p-value from chi-square-score ? You can do it with the help of the following formulae, in which cdf χ²,d denotes the cumulative distribution function of the χ²-distribution with d degrees of freedom:

Left-tailed χ²-test:

p-value = cdf χ²,d (χ² score )

Right-tailed χ²-test:

p-value = 1 - cdf χ²,d (χ² score )

Remember that χ²-tests for goodness-of-fit and independence are right-tailed tests! (see below)

Two-tailed χ²-test:

p-value = 2 × min{cdf χ²,d (χ² score ), 1 - cdf χ²,d (χ² score )}

(By min{a,b} , we denote the smaller of the numbers a and b .)

The most popular tests which lead to a χ²-score are the following:

Testing whether the variance of normally distributed data has some pre-determined value. In this case, the test statistic has the χ²-distribution with n - 1 degrees of freedom, where n is the sample size. This can be a one-tailed or two-tailed test .

Goodness-of-fit test checks whether the empirical (sample) distribution agrees with some expected probability distribution. In this case, the test statistic follows the χ²-distribution with k - 1 degrees of freedom, where k is the number of classes into which the sample is divided. This is a right-tailed test .

Independence test is used to determine if there is a statistically significant relationship between two variables. In this case, its test statistic is based on the contingency table and follows the χ²-distribution with (r - 1)(c - 1) degrees of freedom, where r is the number of rows, and c is the number of columns in this contingency table. This also is a right-tailed test .

Finally, the F-score option should be used when you perform a test in which the test statistic follows the F-distribution , also known as the Fisher–Snedecor distribution. The exact shape of an F-distribution depends on two degrees of freedom .

To see where those degrees of freedom come from, consider the independent random variables X and Y , which both follow the χ²-distributions with d 1 and d 2 degrees of freedom, respectively. In that case, the ratio (X/d 1 )/(Y/d 2 ) follows the F-distribution, with (d 1 , d 2 ) -degrees of freedom. For this reason, the two parameters d 1 and d 2 are also called the numerator and denominator degrees of freedom .

The p-value from F-score is given by the following formulae, where we let cdf F,d1,d2 denote the cumulative distribution function of the F-distribution, with (d 1 , d 2 ) -degrees of freedom:

Left-tailed F-test:

p-value = cdf F,d1,d2 (F score )

Right-tailed F-test:

p-value = 1 - cdf F,d1,d2 (F score )

Two-tailed F-test:

p-value = 2 × min{cdf F,d1,d2 (F score ), 1 - cdf F,d1,d2 (F score )}

Below we list the most important tests that produce F-scores. All of them are right-tailed tests .

A test for the equality of variances in two normally distributed populations . Its test statistic follows the F-distribution with (n - 1, m - 1) -degrees of freedom, where n and m are the respective sample sizes.

ANOVA is used to test the equality of means in three or more groups that come from normally distributed populations with equal variances. We arrive at the F-distribution with (k - 1, n - k) -degrees of freedom, where k is the number of groups, and n is the total sample size (in all groups together).

A test for overall significance of regression analysis . The test statistic has an F-distribution with (k - 1, n - k) -degrees of freedom, where n is the sample size, and k is the number of variables (including the intercept).

With the presence of the linear relationship having been established in your data sample with the above test, you can calculate the coefficient of determination, R 2 , which indicates the strength of this relationship . You can do it by hand or use our coefficient of determination calculator .

A test to compare two nested regression models . The test statistic follows the F-distribution with (k 2 - k 1 , n - k 2 ) -degrees of freedom, where k 1 and k 2 are the numbers of variables in the smaller and bigger models, respectively, and n is the sample size.

You may notice that the F-test of an overall significance is a particular form of the F-test for comparing two nested models: it tests whether our model does significantly better than the model with no predictors (i.e., the intercept-only model).

Can p-value be negative?

No, the p-value cannot be negative. This is because probabilities cannot be negative, and the p-value is the probability of the test statistic satisfying certain conditions.

What does a high p-value mean?

A high p-value means that under the null hypothesis, there's a high probability that for another sample, the test statistic will generate a value at least as extreme as the one observed in the sample you already have. A high p-value doesn't allow you to reject the null hypothesis.

What does a low p-value mean?

A low p-value means that under the null hypothesis, there's little probability that for another sample, the test statistic will generate a value at least as extreme as the one observed for the sample you already have. A low p-value is evidence in favor of the alternative hypothesis – it allows you to reject the null hypothesis.

Bertrand's box paradox

Chilled drink, circle skirt.

Biology (103)
Chemistry (101)
Construction (148)
Conversion (304)
Ecology (32)
Everyday life (263)
Finance (592)
Health (443)
Physics (513)
Sports (108)
Statistics (184)
Other (186)
Discover Omni (40)

Skip to secondary menu
Skip to main content
Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How Hypothesis Tests Work: Significance Levels (Alpha) and P values

By Jim Frost 45 Comments

Hypothesis testing is a vital process in inferential statistics where the goal is to use sample data to draw conclusions about an entire population . In the testing process, you use significance levels and p-values to determine whether the test results are statistically significant.

You hear about results being statistically significant all of the time. But, what do significance levels, P values, and statistical significance actually represent? Why do we even need to use hypothesis tests in statistics?

In this post, I answer all of these questions. I use graphs and concepts to explain how hypothesis tests function in order to provide a more intuitive explanation. This helps you move on to understanding your statistical results.

Hypothesis Test Example Scenario

To start, I’ll demonstrate why we need to use hypothesis tests using an example.

A researcher is studying fuel expenditures for families and wants to determine if the monthly cost has changed since last year when the average was $260 per month. The researcher draws a random sample of 25 families and enters their monthly costs for this year into statistical software. You can download the CSV data file: FuelsCosts . Below are the descriptive statistics for this year.

Table of descriptive statistics for our fuel cost example.

We’ll build on this example to answer the research question and show how hypothesis tests work.

Descriptive Statistics Alone Won’t Answer the Question

The researcher collected a random sample and found that this year’s sample mean (330.6) is greater than last year’s mean (260). Why perform a hypothesis test at all? We can see that this year’s mean is higher by $70! Isn’t that different?

Regrettably, the situation isn’t as clear as you might think because we’re analyzing a sample instead of the full population. There are huge benefits when working with samples because it is usually impossible to collect data from an entire population. However, the tradeoff for working with a manageable sample is that we need to account for sample error.

The sampling error is the gap between the sample statistic and the population parameter. For our example, the sample statistic is the sample mean, which is 330.6. The population parameter is μ, or mu, which is the average of the entire population. Unfortunately, the value of the population parameter is not only unknown but usually unknowable. Learn more about Sampling Error .

We obtained a sample mean of 330.6. However, it’s conceivable that, due to sampling error, the mean of the population might be only 260. If the researcher drew another random sample, the next sample mean might be closer to 260. It’s impossible to assess this possibility by looking at only the sample mean. Hypothesis testing is a form of inferential statistics that allows us to draw conclusions about an entire population based on a representative sample. We need to use a hypothesis test to determine the likelihood of obtaining our sample mean if the population mean is 260.

Background information : The Difference between Descriptive and Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics

A Sampling Distribution Determines Whether Our Sample Mean is Unlikely

It is very unlikely for any sample mean to equal the population mean because of sample error. In our case, the sample mean of 330.6 is almost definitely not equal to the population mean for fuel expenditures.

If we could obtain a substantial number of random samples and calculate the sample mean for each sample, we’d observe a broad spectrum of sample means. We’d even be able to graph the distribution of sample means from this process.

This type of distribution is called a sampling distribution. You obtain a sampling distribution by drawing many random samples of the same size from the same population. Why the heck would we do this?

Because sampling distributions allow you to determine the likelihood of obtaining your sample statistic and they’re crucial for performing hypothesis tests.

Luckily, we don’t need to go to the trouble of collecting numerous random samples! We can estimate the sampling distribution using the t-distribution, our sample size, and the variability in our sample.

We want to find out if the average fuel expenditure this year (330.6) is different from last year (260). To answer this question, we’ll graph the sampling distribution based on the assumption that the mean fuel cost for the entire population has not changed and is still 260. In statistics, we call this lack of effect, or no change, the null hypothesis . We use the null hypothesis value as the basis of comparison for our observed sample value.

Sampling distributions and t-distributions are types of probability distributions.

Related posts : Sampling Distributions and Understanding Probability Distributions

Graphing our Sample Mean in the Context of the Sampling Distribution

The graph below shows which sample means are more likely and less likely if the population mean is 260. We can place our sample mean in this distribution. This larger context helps us see how unlikely our sample mean is if the null hypothesis is true (μ = 260).

Sampling distribution of means for our fuel cost data.

The graph displays the estimated distribution of sample means. The most likely values are near 260 because the plot assumes that this is the true population mean. However, given random sampling error, it would not be surprising to observe sample means ranging from 167 to 352. If the population mean is still 260, our observed sample mean (330.6) isn’t the most likely value, but it’s not completely implausible either.

The Role of Hypothesis Tests

The sampling distribution shows us that we are relatively unlikely to obtain a sample of 330.6 if the population mean is 260. Is our sample mean so unlikely that we can reject the notion that the population mean is 260?

In statistics, we call this rejecting the null hypothesis. If we reject the null for our example, the difference between the sample mean (330.6) and 260 is statistically significant. In other words, the sample data favor the hypothesis that the population average does not equal 260.

However, look at the sampling distribution chart again. Notice that there is no special location on the curve where you can definitively draw this conclusion. There is only a consistent decrease in the likelihood of observing sample means that are farther from the null hypothesis value. Where do we decide a sample mean is far away enough?

To answer this question, we’ll need more tools—hypothesis tests! The hypothesis testing procedure quantifies the unusualness of our sample with a probability and then compares it to an evidentiary standard. This process allows you to make an objective decision about the strength of the evidence.

We’re going to add the tools we need to make this decision to the graph—significance levels and p-values!

These tools allow us to test these two hypotheses:

Null hypothesis: The population mean equals the null hypothesis mean (260).
Alternative hypothesis: The population mean does not equal the null hypothesis mean (260).

Related post : Hypothesis Testing Overview

What are Significance Levels (Alpha)?

A significance level, also known as alpha or α, is an evidentiary standard that a researcher sets before the study. It defines how strongly the sample evidence must contradict the null hypothesis before you can reject the null hypothesis for the entire population. The strength of the evidence is defined by the probability of rejecting a null hypothesis that is true. In other words, it is the probability that you say there is an effect when there is no effect.

For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist.

Lower significance levels require stronger sample evidence to be able to reject the null hypothesis. For example, to be statistically significant at the 0.01 significance level requires more substantial evidence than the 0.05 significance level. However, there is a tradeoff in hypothesis tests. Lower significance levels also reduce the power of a hypothesis test to detect a difference that does exist.

The technical nature of these types of questions can make your head spin. A picture can bring these ideas to life!

To learn a more conceptual approach to significance levels, see my post about Understanding Significance Levels .

Graphing Significance Levels as Critical Regions

On the probability distribution plot, the significance level defines how far the sample value must be from the null value before we can reject the null. The percentage of the area under the curve that is shaded equals the probability that the sample value will fall in those regions if the null hypothesis is correct.

To represent a significance level of 0.05, I’ll shade 5% of the distribution furthest from the null value.

Graph that displays a two-tailed critical region for a significance level of 0.05.

The two shaded regions in the graph are equidistant from the central value of the null hypothesis. Each region has a probability of 0.025, which sums to our desired total of 0.05. These shaded areas are called the critical region for a two-tailed hypothesis test.

The critical region defines sample values that are improbable enough to warrant rejecting the null hypothesis. If the null hypothesis is correct and the population mean is 260, random samples (n=25) from this population have means that fall in the critical region 5% of the time.

Our sample mean is statistically significant at the 0.05 level because it falls in the critical region.

Related posts : One-Tailed and Two-Tailed Tests Explained , What Are Critical Values? , and T-distribution Table of Critical Values

Comparing Significance Levels

Let’s redo this hypothesis test using the other common significance level of 0.01 to see how it compares.

Chart that shows a two-tailed critical region for a significance level of 0.01.

This time the sum of the two shaded regions equals our new significance level of 0.01. The mean of our sample does not fall within with the critical region. Consequently, we fail to reject the null hypothesis. We have the same exact sample data, the same difference between the sample mean and the null hypothesis value, but a different test result.

What happened? By specifying a lower significance level, we set a higher bar for the sample evidence. As the graph shows, lower significance levels move the critical regions further away from the null value. Consequently, lower significance levels require more extreme sample means to be statistically significant.

You must set the significance level before conducting a study. You don’t want the temptation of choosing a level after the study that yields significant results. The only reason I compared the two significance levels was to illustrate the effects and explain the differing results.

The graphical version of the 1-sample t-test we created allows us to determine statistical significance without assessing the P value. Typically, you need to compare the P value to the significance level to make this determination.

Related post : Step-by-Step Instructions for How to Do t-Tests in Excel

What Are P values?

P values are the probability that a sample will have an effect at least as extreme as the effect observed in your sample if the null hypothesis is correct.

This tortuous, technical definition for P values can make your head spin. Let’s graph it!

First, we need to calculate the effect that is present in our sample. The effect is the distance between the sample value and null value: 330.6 – 260 = 70.6. Next, I’ll shade the regions on both sides of the distribution that are at least as far away as 70.6 from the null (260 +/- 70.6). This process graphs the probability of observing a sample mean at least as extreme as our sample mean.

Probability distribution plot shows how our sample mean has a p-value of 0.031.

The total probability of the two shaded regions is 0.03112. If the null hypothesis value (260) is true and you drew many random samples, you’d expect sample means to fall in the shaded regions about 3.1% of the time. In other words, you will observe sample effects at least as large as 70.6 about 3.1% of the time if the null is true. That’s the P value!

Learn more about How to Find the P Value .

Using P values and Significance Levels Together

If your P value is less than or equal to your alpha level, reject the null hypothesis.

The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01. Again, in practice, you pick one significance level before the experiment and stick with it!

Using the significance level of 0.05, the sample effect is statistically significant. Our data support the alternative hypothesis, which states that the population mean doesn’t equal 260. We can conclude that mean fuel expenditures have increased since last year.

P values are very frequently misinterpreted as the probability of rejecting a null hypothesis that is actually true. This interpretation is wrong! To understand why, please read my post: How to Interpret P-values Correctly .

Discussion about Statistically Significant Results

Hypothesis tests determine whether your sample data provide sufficient evidence to reject the null hypothesis for the entire population. To perform this test, the procedure compares your sample statistic to the null value and determines whether it is sufficiently rare. “Sufficiently rare” is defined in a hypothesis test by:

Assuming that the null hypothesis is true—the graphs center on the null value.
The significance (alpha) level—how far out from the null value is the critical region?
The sample statistic—is it within the critical region?

There is no special significance level that correctly determines which studies have real population effects 100% of the time. The traditional significance levels of 0.05 and 0.01 are attempts to manage the tradeoff between having a low probability of rejecting a true null hypothesis and having adequate power to detect an effect if one actually exists.

The significance level is the rate at which you incorrectly reject null hypotheses that are actually true ( type I error ). For example, for all studies that use a significance level of 0.05 and the null hypothesis is correct, you can expect 5% of them to have sample statistics that fall in the critical region. When this error occurs, you aren’t aware that the null hypothesis is correct, but you’ll reject it because the p-value is less than 0.05.

This error does not indicate that the researcher made a mistake. As the graphs show, you can observe extreme sample statistics due to sample error alone. It’s the luck of the draw!

Related posts : Statistical Significance: Definition & Meaning and Types of Errors in Hypothesis Testing

Hypothesis tests are crucial when you want to use sample data to make conclusions about a population because these tests account for sample error. Using significance levels and P values to determine when to reject the null hypothesis improves the probability that you will draw the correct conclusion.

Keep in mind that statistical significance doesn’t necessarily mean that the effect is important in a practical, real-world sense. For more information, read my post about Practical vs. Statistical Significance .

If you like this post, read the companion post: How Hypothesis Tests Work: Confidence Intervals and Confidence Levels .

You can also read my other posts that describe how other tests work:

How t-Tests Work
How the F-test works in ANOVA
How Chi-Squared Tests of Independence Work

To see an alternative approach to traditional hypothesis testing that does not use probability distributions and test statistics, learn about bootstrapping in statistics !

Reader Interactions

December 11, 2022 at 10:56 am

For very easy concept about level of significance & p-value 1.Teacher has given a one assignment to student & asked how many error you have doing this assignment? Student reply, he can has error ≤ 5% (it is level of significance). After completion of assignment, teacher checked his error which is ≤ 5% (may be 4% or 3% or 2% even less, it is p-value) it means his results are significant. Otherwise he has error > 5% (may be 6% or 7% or 8% even more, it is p-value) it means his results are non-significant. 2. Teacher has given a one assignment to student & asked how many error you have doing this assignment? Student reply, he can has error ≤ 1% (it is level of significance). After completion of assignment, teacher checked his error which is ≤ 1% (may be 0.9% or 0.8% or 0.7% even less, it is p-value) it means his results are significant. Otherwise he has error > 1% (may be 1.1% or 1.5% or 2% even more, it is p-value) it means his results are non-significant. p-value is significant or not mainly dependent upon the level of significance.

December 11, 2022 at 7:50 pm

I think that approach helps explain how to determine statistical significance–is the p-value less than or equal to the significance level. However, it doesn’t really explain what statistical significance means. I find that comparing the p-value to the significance level is the easy part. Knowing what it means and how to choose your significance level is the harder part!

December 3, 2022 at 5:54 pm

What would you say to someone who believes that a p-value higher than the level of significance (alpha) means the null hypothesis has been proven? Should you support that statement or deny it?

December 3, 2022 at 10:18 pm

Hi Emmanuel,

When the p-value is greater than the significance level, you fail to reject the null hypothesis . That is different than proving it. To learn why and what it means, click the link to read a post that I’ve written that will answer your question!

April 19, 2021 at 12:27 am

Thank you so much Sir

April 18, 2021 at 2:37 pm

Hi sir, your blogs are much more helpful for clearing the concepts of statistics, as a researcher I find them much more useful. I have some quarries:

1. In many research papers I have seen authors using the statement ” means or values are statically at par at p = 0.05″ when they do some pair wise comparison between the treatments (a kind of post hoc) using some value of CD (critical difference) or we can say LSD which is calculated using alpha not using p. So with this article I think this should be alpha =0.05 or 5%, not p = 0.05 earlier I thought p and alpha are same. p it self is compared with alpha 0.05. Correct me if I am wrong.

2. When we can draw a conclusion using critical value based on critical values (CV) which is based on alpha values in different tests (e.g. in F test CV is at F (0.05, t-1, error df) when alpha is 0.05 which is table value of F and is compared with F calculated for drawing the conclusion); then why we go for p values, and draw a conclusion based on p values, even many online software do not give p value, they just mention CD (LSD)

3. can you please help me in interpreting interaction in two factor analysis (Factor A X Factor b) in Anova.

Thank You so much!

(Commenting again as I have not seen my comment in comment list; don’t know why)

April 18, 2021 at 10:57 pm

Hi Himanshu,

I manually approve comments so there will be some time lag involved before they show up.

Regarding your first question, yes, you’re correct. Test results are significant at particular significance levels or alpha. They should not use p to define the significance level. You’re also correct in that you compare p to alpha.

Critical values are a different (but related) approach for determining significance. It was more common before computer analysis took off because it reduced the calculations. Using this approach in its simplest form, you only know whether a result is significant or not at the given alpha. You just determine whether the test statistic falls within a critical region to determine statistical significance or not significant. However, it is ok to supplement this type of result with the actual p-value. Knowing the precise p-value provides additional information that significant/not significant does not provide. The critical value and p-value approaches will always agree too. For more information about why the exact p-value is useful, read my post about Five Tips for Interpreting P-values .

Finally, I’ve written about two-way ANOVA in my post, How to do Two-Way ANOVA in Excel . Additionally, I write about it in my Hypothesis Testing ebook .

January 28, 2021 at 3:12 pm

Thank you for your answer, Jim, I really appreciate it. I’m taking a Coursera stats course and online learning without being able to ask questions of a real teacher is not my forte!

You’re right, I don’t think I’m ready for that calculation! However, I think I’m struggling with something far more basic, perhaps even the interpretation of the t-table? I’m just not sure how you came up with the p-value as .03112, with the 24 degrees of freedom. When I pull up a t-table and look at the 24-degrees of freedom row, I’m not sure how any of those numbers correspond with your answer? Either the single tail of 0.01556 or the combined of 0.03112. What am I not getting? (which, frankly, could be a lot!!) Again, thank you SO much for your time.

January 28, 2021 at 11:19 pm

Ah ok, I see! First, let me point you to several posts I’ve written about t-values and the t-distribution. I don’t cover those in this post because I wanted to present a simplified version that just uses the data in its regular units. The basic idea is that the hypothesis tests actually convert all your raw data down into one value for a test statistic, such as the t-value. And then it uses that test statistic to determine whether your results are statistically significant. To be significant, the t-value must exceed a critical value, which is what you lookup in the table. Although, nowadays you’d typically let your software just tell you.

So, read the following two posts, which covers several aspects of t-values and distributions. And then if you have more questions after that, you can post them. But, you’ll have a lot more information about them and probably some of your questions will be answered! T-values T-distributions

January 27, 2021 at 3:10 pm

Jim, just found your website and really appreciate your thoughtful, thorough way of explaining things. I feel very dumb, but I’m struggling with p-values and was hoping you could help me.

Here’s the section that’s getting me confused:

“First, we need to calculate the effect that is present in our sample. The effect is the distance between the sample value and null value: 330.6 – 260 = 70.6. Next, I’ll shade the regions on both sides of the distribution that are at least as far away as 70.6 from the null (260 +/- 70.6). This process graphs the probability of observing a sample mean at least as extreme as our sample mean.

** I’m good up to this point. Draw the picture, do the subtraction, shade the regions. BUT, I’m not sure how to figure out the area of the shaded region — even with a T-table. When I look at the T-table on 24 df, I’m not sure what to do with those numbers, as none of them seem to correspond in any way to what I’m looking at in the problem. In the end, I have no idea how you calculated each shaded area being 0.01556.

I feel like there’s a (very simple) step that everyone else knows how to do, but for some reason I’m missing it.

Again, dumb question, but I’d love your help clarifying that.

thank you, Sara

January 27, 2021 at 9:51 pm

That’s not a dumb question at all. I actually don’t show or explain the calculations for figuring out the area. The reason for that is the same reason why students never calculate the critical t-values for their test, instead you look them up in tables or use statistical software. The common reason for all that is because calculating these values is extremely complicated! It’s best to let software do that for you or, when looking critical values, use the tables!

The principal though is that percentage of the area under the curve equals the probability that values will fall within that range.

And then, for this example, you’d need to figure out the area under the curve for particular ranges!

January 15, 2021 at 10:57 am

HI Jim, I have a question related to Hypothesis test.. in Medical imaging, there are different way to measure signal intensity (from a tumor lesion for example). I tested for the same 100 patients 4 different ways to measure tumor captation to a injected dose. So for the 100 patients, i got 4 linear regression (relation between injected dose and measured quantity at tumor sites) = so an output of 4 equations Condition A output = -0,034308 + 0,0006602*input Condition B output = 0,0117631 + 0,0005425*input Condition C output = 0,0087871 + 0,0005563*input Condition D output = 0,001911 + 0,0006255*input

My question : i want to compare the 4 methods to find the best one (compared to others) : do Hypothesis test good to me… and if Yes, i do not find test to perform it. Can you suggest me a software. I uselly used JMP for my stats… but open to other softwares…

THank for your time G

November 16, 2020 at 5:42 am

Thank you very much for writing about this topic!

Your explanation made more sense to me about: Why we reject Null Hypothesis when p value < significance level

Kind greetings, Jalal

September 25, 2020 at 1:04 pm

Hi Jim, Your explanations are so helpful! Thank you. I wondered about your first graph. I see that the mean of the graph is 260 from the null hypothesis, and it looks like the standard deviation of the graph is about 31. Where did you get 31 from? Thank you

September 25, 2020 at 4:08 pm

Hi Michelle,

That is a great question. Very observant. And it gets to how these tests work. The hypothesis test that I’m illustrating here is the one-sample t-test. And this graph illustrates the sampling distribution for the t-test. T-tests use the t-distribution to determine the sampling distribution. For the t-distribution, you need to specify the degrees of freedom, which entirely defines the distribution (i.e., it’s the only parameter). For 1-sample t-tests, the degrees of freedom equal the number of observations minus 1. This dataset has 25 observations. Hence, the 24 DF you see in the graph.

Unlike the normal distribution, there is no standard deviation parameter. Instead, the degrees of freedom determines the spread of the curve. Typically, with t-tests, you’ll see results discussed in terms of t-values, both for your sample and for defining the critical regions. However, for this introductory example, I’ve converted the t-values into the raw data units (t-value * SE mean).

So, the standard deviation you’re seeing in the graph is a result of the spread of the underlying t-distribution that has 24 degrees of freedom and then applying the conversion from t-values to raw values.

September 10, 2020 at 8:19 am

Your blog is incredible.

I am having difficulty understanding why the phrase ‘as extreme as’ is required in the definition of p-value (“P values are the probability that a sample will have an effect at least as extreme as the effect observed in your sample if the null hypothesis is correct.”)

Why can’t P-Values simply be defined as “The probability of sample observation if the null hypothesis is correct?”

In your other blog titled ‘Interpreting P values’ you have explained p-values as “P-values indicate the believability of the devil’s advocate case that the null hypothesis is correct given the sample data”. I understand (or accept) this explanation. How does one move from this definition to one that contains the phrase ‘as extreme as’?

September 11, 2020 at 5:05 pm

Thanks so much for your kind words! I’m glad that my website has been helpful!

The key to understanding the “at least as extreme” wording lies in the probability plots for p-values. Using probability plots for continuous data, you can calculate probabilities, but only for ranges of values. I discuss this in my post about understanding probability distributions . In a nutshell, we need a range of values for these probabilities because the probabilities are derived from the area under a distribution curve. A single value just produces a line on these graphs rather than an area. Those ranges are the shaded regions in the probability plots. For p-values, the range corresponds to the “at least as extreme” wording. That’s where it comes from. We need a range to calculate a probability. We can’t use the single value of the observed effect because it doesn’t produce an area under the curve.

I hope that helps! I think this is a particularly confusing part of understanding p-values that most people don’t understand.

August 7, 2020 at 5:45 pm

Hi Jim, thanks for the post.

Could you please clarify the following excerpt from ‘Graphing Significance Levels as Critical Regions’:

“The percentage of the area under the curve that is shaded equals the probability that the sample value will fall in those regions if the null hypothesis is correct.”

I’m not sure if I understood this correctly. If the sample value fall in one of the shaded regions, doesn’t mean that the null hypothesis can be rejected, hence that is not correct?

August 7, 2020 at 10:23 pm

Think of it this way. There are two basic reasons for why a sample value could fall in a critical region:

The null hypothesis is correct and random chance caused the sample value to be unusual.
The null hypothesis is not correct.

You don’t know which one is true. Remember, just because you reject the null hypothesis it doesn’t mean the null is false. However, by using hypothesis tests to determine statistical significance, you control the chances of #1 occurring. The rate at which #1 occurs equals your significance level. On the hand, you don’t know the probability of the sample value falling in a critical region if the alternative hypothesis is correct (#2). It depends on the precise distribution for the alternative hypothesis and you usually don’t know that, which is why you’re testing the hypotheses in the first place!

I hope I answered the question you were asking. If not, feel free to ask follow up questions. Also, this ties into how to interpret p-values . It’s not exactly straightforward. Click the link to learn more.

June 4, 2020 at 6:17 am

Hi Jim, thank you very much for your answer. You helped me a lot!

June 3, 2020 at 5:23 pm

Hi, Thanks for this post. I’ve been learning a lot with you. My question is regarding to lack of fit. The p-value of my lack of fit is really low, making my lack of fit significant, meaning my model does not fit well. Is my case a “false negative”? given that my pure error is really low, making the computation of the lack of fit low. So it means my model is good. Below I show some information, that I hope helps to clarify my question.

SumSq DF MeanSq F pValue ________ __ ________ ______ __________

Total 1246.5 18 69.25 Model 1241.7 6 206.94 514.43 9.3841e-14 . Linear 1196.6 3 398.87 991.53 1.2318e-14 . Nonlinear 45.046 3 15.015 37.326 2.3092e-06 Residual 4.8274 12 0.40228 . Lack of fit 4.7388 7 0.67698 38.238 0.0004787 . Pure error 0.088521 5 0.017704

June 3, 2020 at 7:53 pm

As you say, a low p-value for a lack of fit test indicates that the model doesn’t fit your data adequately. This is a positive result for the test, which means it can’t be a “false negative.” At best, it could be a false positive, meaning that your data actually fit model well despite the low p-value.

I’d recommend graphing the residuals and looking for patterns . There is probably a relationship between variables that you’re not modeling correctly, such as curvature or interaction effects. There’s no way to diagnose the specific nature of the lack-of-fit problem by using the statistical output. You’ll need the graphs.

If there are no patterns in the residual plots, then your lack-of-fit results might be a false positive.

I hope this helps!

May 30, 2020 at 6:23 am

First of all, I have to say there are not many resources that explain a complicated topic in an easier manner.

My question is, how do we arrive at “if p value is less than alpha, we reject the null hypothesis.”

Is this covered in a separate article I could read?

Thanks Shekhar

May 25, 2020 at 12:21 pm

Hi Jim, terrific website, blog, and after this I’m ordering your book. One of my biggest challenges is nomenclature, definitions, context, and formulating the hypotheses. Here’s one I want to double-be-sure I understand: From above you write: ” These tools allow us to test these two hypotheses:

Null hypothesis: The population mean equals the null hypothesis mean (260). Alternative hypothesis: The population mean does not equal the null hypothesis mean (260). ” I keep thinking that 260 is the population mean mu, the underlying population (that we never really know exactly) and that the Null Hypothesis is comparing mu to x-bar (the sample mean of the 25 families randomly sampled w mean = sample mean = x-bar = 330.6).

So is the following incorrect, and if so, why? Null hypothesis: The population mean mu=260 equals the null hypothesis mean x-bar (330.6). Alternative hypothesis: The population mean mu=269 does not equal the null hypothesis mean x-bar (330.6).

And my thinking is that usually the formulation of null and alternative hypotheses is “test value” = “mu current of underlying population”, whereas I read the formulation on the webpage above to be the reverse.

Any comments appreciated. Many Thanks,

May 26, 2020 at 8:56 pm

The null hypothesis states that population value equals the null value. Now, I know that’s not particularly helpful! But, the null value varies based on test and context. So, in this example, we’re setting the null value aa $260, which was the mean from the previous year. So, our null hypothesis states:

Null: the population mean (mu) = 260. Alternative: the population mean ≠ 260.

These hypothesis statements are about the population parameter. For this type of one-sample analysis, the target or reference value you specify is the null hypothesis value. Additionally, you don’t include the sample estimate in these statements, which is the X-bar portion you tacked on at the end. It’s strictly about the value of the population parameter you’re testing. You don’t know the value of the underlying distribution. However, given the mutually exclusive nature of the null and alternative hypothesis, you know one or the other is correct. The null states that mu equals 260 while the alternative states that it doesn’t equal 260. The data help you decide, which brings us to . . .

However, the procedure does compare our sample data to the null hypothesis value, which is how it determines how strong our evidence is against the null hypothesis.

I hope I answered your question. If not, please let me know!

May 8, 2020 at 6:00 pm

Really using the interpretation “In other words, you will observe sample effects at least as large as 70.6 about 3.1% of the time if the null is true”, our head seems to tie a knot. However, doing the reverse interpretation, it is much more intuitive and easier. That is, we will observe the sample effect of at least 70.6 in about 96.9% of the time, if the null is false (that is, our hypothesis is true).

May 8, 2020 at 7:25 pm

Your phrasing really isn’t any simpler. And it has the additional misfortune of being incorrect.

What you’re essentially doing is creating a one-sided confidence interval by using the p-value from a two-sided test. That’s incorrect in two ways.

Don’t mix and match one-sided and two-sided test results.
Confidence levels are determine by the significance level, not p-values.

So, what you need is a two-sided 95% CI (1-alpha). You could then state the results are statistically significant and you have 95% confidence that the population effect is between X and Y. If you want a lower bound as you propose, then you’ll need to use a one-sided hypothesis test with a 95% Lower Bound. That’ll give you a different value for the lower bound than the one you use.

I like confidence intervals. As I write elsewhere, I think they’re easier to understand and provide more information than a binary test result. But, you need to use them correctly!

One other point. When you are talking about p-values, it’s always under the assumption that the null hypothesis is correct. You *never* state anything about the p-value in relation to the null being false (i.e. alternative is true). But, if you want to use the type of phrasing you suggest, use it in the context of CIs and incorporate the points I cover above.

February 10, 2020 at 11:13 am

Muchas gracias profesor por compartir sus conocimientos. Un saliud especial desde Colombia.

August 6, 2019 at 11:46 pm

i found this really helpful . also can you help me out ?

I’m a little confused Can you tell me if level of significance and pvalue are comparable or not and if they are what does it mean if pvalue < LS . Do we reject the null hypothesis or do we accept the null hypothesis ?

August 7, 2019 at 12:49 am

Hi Divyanshu,

Yes, you compare the p-value to the significance level. When the p-value is less than the significance level (alpha), your results are statistically significant and you reject the null hypothesis.

I’d suggest re-reading the “Using P values and Significance Levels Together” section near the end of this post more closely. That describes the process. The next section describes what it all means.

July 1, 2019 at 4:19 am

sure.. I will use only in my class rooms that too offline with due credits to your orginal page. I will encourage my students to visit your blog . I have purchased your eBook on Regressions….immensely useful.

July 1, 2019 at 9:52 am

Hi Narasimha, that sounds perfect. Thanks for buying my ebook as well. I’m thrilled to hear that you’ve found it to be helpful!

June 28, 2019 at 6:22 am

I have benefited a lot by your writings….Can I share the same with my students in the classroom?

June 30, 2019 at 8:44 pm

Hi Narasimha,

Yes, you can certainly share with your students. Please attribute my original page. And please don’t copy whole sections of my posts onto another webpage as that can be bad with Google! Thanks!

February 11, 2019 at 7:46 pm

Hello, great site and my apologies if the answer to the following question exists already.

I’ve always wondered why we put the sampling distribution about the null hypothesis rather than simply leave it about the observed mean. I can see mathematically we are measuring the same distance from the null and basically can draw the same conclusions.

For example we take a sample (say 50 people) we gather an observation (mean wage) estimate the standard error in that observation and so can build a sampling distribution about the observed mean. That sampling distribution contains a confidence interval, where say, i am 95% confident the true mean lies (i.e. in repeated sampling the true mean would reside within this interval 95% of the time).

When i use this for a hyp-test, am i right in saying that we place the sampling dist over the reference level simply because it’s mathematically equivalent and it just seems easier to gauge how far the observation is from 0 via t-stats or its likelihood via p-values?

It seems more natural to me to look at it the other way around. leave the sampling distribution on the observed value, and then look where the null sits…if it’s too far left or right then it is unlikely the true population parameter is what we believed it to be, because if the null were true it would only occur ~ 5% of the time in repeated samples…so perhaps we need to change our opinion.

Can i interpret a hyp-test that way? Or do i have a misconception?

February 12, 2019 at 8:25 pm

The short answer is that, yes, you can draw the interval around the sample mean instead. And, that is, in fact, how you construct confidence intervals. The distance around the null hypothesis for hypothesis tests and the distance around the sample for confidence intervals are the same distance, which is why the results will always agree as long as you use corresponding alpha levels and confidence levels (e.g., alpha 0.05 with a 95% confidence level). I write about how this works in a post about confidence intervals .

I prefer confidence intervals for a number of reasons. They’ll indicate whether you have significant results if they exclude the null value and they indicate the precision of the effect size estimate. Corresponding with what you’re saying, it’s easier to gauge how far a confidence interval is from the null value (often zero) whereas a p-value doesn’t provide that information. See Practical versus Statistical Significance .

So, you don’t have any misconception at all! Just refer to it as a confidence interval rather than a hypothesis test, but, of course, they are very closely related.

January 9, 2019 at 10:37 pm

Hi Jim, Nice Article.. I have a question… I read the Central limit theorem article before this article…

Coming to this article, During almost every hypothesis test, we draw a normal distribution curve assuming there is a sampling distribution (and then we go for test statistic, p value etc…). Do we draw a normal distribution curve for hypo tests because of the central limit theorem…

Thanks in advance, Surya

January 10, 2019 at 1:57 am

These distributions are actually the t-distribution which are different from the normal distribution. T-distributions only have one parameter–the degrees of freedom. As the DF of increases, the t-distribution tightens up. Around 25 degrees of freedom, the t-distribution approximates the normal distribution. Depending on the type of t-test, this corresponds to a sample size of 26 or 27. Similarly, the sampling distribution of the means also approximate the normal distribution at around these sample sizes. With a large enough sample size, both the t-distribution and the sample distribution converge to a normal distribution regardless (largely) of the underlying population distribution. So, yes, the central limit theorem plays a strong role in this.

It’s more accurate to say that central limit theorem causes the sampling distribution of the means to converge on the same distribution that the t-test uses, which allows you to assume that the test produces valid results. But, technically, the t-test is based on the t-distribution.

Problems can occur if the underlying distribution is non-normal and you have a small sample size. In that case, the sampling distribution of the means won’t approximate the t-distribution that the t-test uses. However, the test results will assume that it does and produce results based on that–which is why it causes problems!

November 19, 2018 at 9:15 am

Dear Jim! Thank you very much for your explanation. I need your help to understand my data. I have two samples (about 300 observations) with biased distributions. I did the ttest and obtained the p-value, which is quite small. Can I draw the conclusion that the effect size is small even when the distribution of my data is not normal? Thank you

November 19, 2018 at 9:34 am

Hi Tetyana,

First, when you say that your p-value is small and that you want to “draw the conclusion that the effect size is small,” I assume that you mean statistically significant. When the p-value is low, the null hypothesis must go! In other words, you reject the null and conclude that there is a statistically significant effect–not a small effect.

Now, back to the question at hand! Yes, When you have a sufficiently large sample-size, t-tests are robust to departures from normality. For a 2-sample t-test, you should have at least 15 samples per group, which you exceed by quite a bit. So, yes, you can reliably conclude that your results are statistically significant!

You can thank the central limit theorem! 🙂

September 10, 2018 at 12:18 am

Hello Jim, I am very sorry; I have very elementary of knowledge of stats. So, would you please explain how you got a p- value of 0.03112 in the above calculation/t-test? By looking at a chart? Would you also explain how you got the information that “you will observe sample effects at least as large as 70.6 about 3.1% of the time if the null is true”?

July 6, 2018 at 7:02 am

A quick question regarding your use of two-tailed critical regions in the article above: why? I mean, what is a real-world scenario that would warrant a two-tailed test of any kind (z, t, etc.)? And if there are none, why keep using the two-tailed scenario as an example, instead of the one-tailed which is both more intuitive and applicable to most if not all practical situations. Just curious, as one person attempting to educate people on stats to another (my take on the one vs. two-tailed tests can be seen here: http://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/ )

Thanks, Georgi

July 6, 2018 at 12:05 pm

There’s the appropriate time and place for both one-tailed and two-tailed tests. I plan to write a post on this issue specifically, so I’ll keep my comments here brief.

So much of statistics is context sensitive. People often want concrete rules for how to do things in statistics but that’s often hard to provide because the answer depends on the context, goals, etc. The question of whether to use a one-tailed or two-tailed test falls firmly in this category of it depends.

I did read the article you wrote. I’ll say that I can see how in the context of A/B testing specifically there might be a propensity to use one-tailed tests. You only care about improvements. There’s probably not too much downside in only caring about one direction. In fact, in a post where I compare different tests and different options , I suggest using a one-tailed test for a similar type of casing involving defects. So, I’m onboard with the idea of using one-tailed tests when they’re appropriate. However, I do think that two-tailed tests should be considered the default choice and that you need good reasons to move to a one-tailed test. Again, your A/B testing area might supply those reasons on a regular basis, but I can’t make that a blanket statement for all research areas.

I think your article mischaracterizes some of the pros and cons of both types of tests. Just a couple of for instances. In a two-tailed test, you don’t have to take the same action regardless of which direction the results are significant (example below). And, yes, you can determine the direction of the effect in a two-tailed test. You simply look at the estimated effect. Is it positive or negative?

On the other hand, I do agree that one-tailed tests don’t increase the overall Type I error. However, there is a big caveat for that. In a two-tailed test, the Type I error rate is evenly split in both tails. For a one-tailed test, the overall Type I error rate does not change, but the Type I errors are redistributed so they all occur in the direction that you are interested in rather than being split between the positive and negative directions. In other words, you’ll have twice as many Type I errors in the specific direction that you’re interested in. That’s not good.

My big concerns with one-tailed tests are that it makes it easier to obtain the results that you want to obtain. And, all of the Type I errors (false positives) are in that direction too. It’s just not a good combination.

To answer your question about when you might want to use two-tailed tests, there are plenty of reasons. For one, you might want to avoid the situation I describe above. Additionally, in a lot of scientific research, the researchers truly are interested in detecting effects in either direction for the sake of science. Even in cases with a practical application, you might want to learn about effects in either direction.

For example, I was involved in a research study that looked at the effects of an exercise intervention on bone density. The idea was that it might be a good way to prevent osteoporosis. I used a two-tailed test. Obviously, we’re hoping that there was positive effect. However, we’d be very interested in knowing whether there was a negative effect too. And, this illustrates how you can have different actions based on both directions. If there was a positive effect, you can recommend that as a good approach and try to promote its use. If there’s a negative effect, you’d issue a warning to not do that intervention. You have the potential for learning both what is good and what is bad. The extra false-positives would’ve cause problems because we’d think that there’d be health benefits for participants when those benefits don’t actually exist. Also, if we had performed only a one-tailed test and didn’t obtain significant results, we’d learn that it wasn’t a positive effect, but we would not know whether it was actually detrimental or not.

Here’s when I’d say it’s OK to use a one-tailed test. Consider a one-tailed test when you’re in situation where you truly only need to know whether an effect exists in one direction, and the extra Type I errors in that direction are an acceptable risk (false positives don’t cause problems), and there’s no benefit in determining whether an effect exists in the other direction. Those conditions really restrict when one-tailed tests are the best choice. Again, those restrictions might not be relevant for your specific field, but as for the usage of statistics as a whole, they’re absolutely crucial to consider.

On the other hand, according to this article, two-tailed tests might be important in A/B testing !

March 30, 2018 at 5:29 am

Dear Sir, please confirm if there is an inadvertent mistake in interpretation as, “We can conclude that mean fuel expenditures have increased since last year.” Our null hypothesis is =260. If found significant, it implies two possibilities – both increase and decrease. Please let us know if we are mistaken here. Many Thanks!

March 30, 2018 at 9:59 am

Hi Khalid, the null hypothesis as it is defined for this test represents the mean monthly expenditure for the previous year (260). The mean expenditure for the current year is 330.6 whereas it was 260 for the previous year. Consequently, the mean has increased from 260 to 330.7 over the course of a year. The p-value indicates that this increase is statistically significant. This finding does not suggest both an increase and a decrease–just an increase. Keep in mind that a significant result prompts us to reject the null hypothesis. So, we reject the null that the mean equals 260.

Let’s explore the other possible findings to be sure that this makes sense. Suppose the sample mean had been closer to 260 and the p-value was greater than the significance level, those results would indicate that the results were not statistically significant. The conclusion that we’d draw is that we have insufficient evidence to conclude that mean fuel expenditures have changed since the previous year.

If the sample mean was less than the null hypothesis (260) and if the p-value is statistically significant, we’d concluded that mean fuel expenditures have decreased and that this decrease is statistically significant.

When you interpret the results, you have to be sure to understand what the null hypothesis represents. In this case, it represents the mean monthly expenditure for the previous year and we’re comparing this year’s mean to it–hence our sample suggests an increase.

Comments and Questions Cancel reply

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
Duis aute irure dolor in reprehenderit in voluptate
Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - the hypothesis tests for the slopes.

At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

Hypotheses 1

Is the regression model containing at least one predictor useful in predicting the size of the infarct?

$H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0$
$H_{A} \colon$ At least one $\beta_{j} ≠ 0$ (for j = 1, 2, 3)

Hypotheses 2

Is the size of the infarct significantly (linearly) related to the area of the region at risk?

$H_{0} \colon \beta_{1} = 0 $
$H_{A} \colon \beta_{1} \ne 0 $

Hypotheses 3

(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

$H_{0} \colon \beta_{2} = \beta_{3} = 0$
$H_{A} \colon $ At least one $\beta_{j} ≠ 0$ (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

$F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)$

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — $df_{R}$ and $df_{F}$, respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.

Testing all slope parameters equal 0 Section

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:

$H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0 $
$H_{A} \colon$ At least one $\beta_{j} \ne 0 $ (for j = 1, 2, 3)

The full model

The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

$y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i$

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is $df_{F} = n - 4$.

The reduced model

The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

$y_i=\beta_0+\epsilon_i$

The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is $df_{R} = n - 1 $.

Upon plugging in the above quantities, the general linear F -statistic:

$F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}$

becomes the usual " overall F -test ":

$F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}$

That is, to test $H_{0}$ : $\beta_{1} = \beta_{2} = \beta_{3} = 0 $, we just use the overall F -test and P -value reported in the analysis of variance table:

Analysis of Variance

Source	DF	Adj SS	Adj MS	F- Value	P-Value
Regression	3	0.95927	0.31976	16.43	0.000
Area	1	0.63742	0.63742	32.75	0.000
X2	1	0.29733	0.29733	15.28	0.001
X3	1	0.01981	0.01981	1.02	0.322
Error	28	0.54491	0.01946
	31	1.50418

Regression Equation

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0 Section

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are $x_{1}$, $x_{2}$, and $x_{3}$, we can denote the error sum of squares as SSE ($x_{1}$, $x_{2}$, $x_{3}$). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is $df_{F} = n - 4 $.

Because the null hypothesis sets the first slope parameter, $\beta_{1}$, equal to 0, the reduced model is:

$y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i$

Because the two predictors in the model are $x_{2}$ and $x_{3}$, we denote the error sum of squares as SSE ($x_{2}$, $x_{3}$). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is $df_{R} = n - 3$.

The general linear statistic:

simplifies to:

$F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}$

Getting the numbers from the Minitab output:

we determine that the value of the F -statistic is:

$F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554$

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

F distribution with 1 DF in Numerator and 28 DF in denominator

x	P ( X ≤x )
32.7554	1.00000

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, $\beta_{1}$, is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

Coefficients

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-0.135	0.104	-1.29	0.206
Area	0.613	0.107	5.72	0.000	1.14
X2	-0.2435	0.0623	-3.91	0.001	1.44
X3	-0.0657	0.0651	-1.01	0.322	1.57

is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

$t_{(n-p)}^{2}=F_{(1, n-p)}$

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

$t^{*2}=5.72^2=32.72=F^*$

So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:

Compare the output obtained when $x_{1}$ = Area is entered into the model last :

Term	Coef	SE Coef	T-Value	P-Value	VIF
Constant	-0.135	0.104	-1.29	0.206
X2	-0.2435	0.0623	-3.91	0.001	1.44
X3	-0.0657	0.0651	-1.01	0.322	1.57
Area	0.613	0.107	5.72	0.000	1.14

Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area

to the output obtained when $x_{1}$ = Area is entered into the model first :

The t -statistic and P -value are the same regardless of the order in which $x_{1}$ = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.

We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read right off of the Minitab output, it makes sense that it would be the test that we'll use most often.
But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test has taught us something new about the t -test. The t -test is a test for the marginal significance of the $x_{1}$ predictor after the other predictors $x_{2}$ and $x_{3}$ have been taken into account. It does not test for the significance of the relationship between the response y and the predictor $x_{1}$ alone.

Testing a subset of slope parameters is 0 Section

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

$H_{0} \colon \beta_{2} = \beta_{3} = 0 $
$H_{A} \colon$ At least one $\beta_{j} \ne 0 $ (for j = 2, 3)

Because the null hypothesis sets the second and third slope parameters, $\beta_{2}$ and $\beta_{3}$, equal to 0, the reduced model is:

$y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i$

The ANOVA table for the reduced model is:

Source	DF	Adj SS	Adj MS	F- Value	P-Value
Regression	1	0.6249	0.62492	21.32	0.000
Area	1	0.6249	0.62492	21.32	0.000
Error	30	0.8793	0.02931
	31	1.5042

Because the only predictor in the model is $x_{1}$, we denote the error sum of squares as SSE ($x_{1}$) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is $df_{R} = n - 2 = 32 – 2 = 30$.

\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}

Alternatively, we can calculate the F-statistic using a partial F-test :

\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}

To conduct the test, we regress y = InfSize on $x_{1}$ = Area and $x_{2}$ and $x_{3 }$— in order (and with "Sequential sums of squares" selected under "Options"):

Source	DF	Seq SS	Seq MS	F- Value	P-Value
Regression	3	0.95927	0.31976	16.43	0.000
Area	1	0.62492	0.63492	32.11	0.000
X2	1	0.3143	0.31453	16.16	0.001
X3	1	0.01981	0.01981	1.02	0.322
Error	28	0.54491	0.01946
	31	1.50418

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

yielding SSR ($x_{2}$ | $x_{1}$) = 0.31453, SSR ($x_{3}$ | $x_{1}$, $x_{2}$) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:

\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}

which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:

F distribution with 2 DF in Numerator and 28 DF in denominator

x	P ( X ≤ x )
8.59	0.998767

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing Section

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

Hypothesis test for testing that all of the slope parameters are 0.
Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:

Use the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / ( $df_{R}$ – $df_{F}$) .
Alternatively, use the partial F-test formula by fitting only the full model but making sure the relevant predictors are fitted last and "sequential sums of squares" have been selected. Then the numerator of the F-statistic is the sum of the relevant sequential sums of squares divided by the sum of the degrees of freedom for these sequential sums of squares. The denominator of the F -statistic is the mean squared error in the ANOVA table.

For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors $x_{2}$ and $x_{3}$ using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and $5.72^{2} = 32.72$.

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:

Fit the model of y = InfSize on $x_{1}$ = Area and $x_{2}$ and $x_{3}$ and use an individual t-test for $x_{3}$.
If the test results indicate that we can drop $x_{3}$ then fit the model of y = InfSize on $x_{1}$ = Area and $x_{2}$ and use an individual t-test for $x_{2}$.

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:

rejecting a true null hypothesis, i.e., make a type I error by concluding the tested predictor(s) should be retained in the model when in truth it/they should be dropped; or
failing to reject a false null hypothesis, i.e., make a type II error by concluding the tested predictor(s) should be dropped from the model when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

Hypothesis tests for the slope parameters Section

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

whether one slope parameter is 0 (for example, $H_{0} \colon \beta_{1} = 0 $)
whether a subset (more than one but less than all) of the slope parameters are 0 (for example, $H_{0} \colon \beta_{2} = \beta_{3} = 0 $ against the alternative $H_{A} \colon \beta_{2} \ne 0 $ or $\beta_{3} \ne 0 $ or both ≠ 0)
whether all of the slope parameters are 0 (for example, $H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3}$ = 0 against the alternative $H_{A} \colon $ at least one of the $\beta_{i}$ is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

Sugar beets study

A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen ($x_{1}$ = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.

Preliminary Work

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, $x_{2}$ and $x_{3}$, say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:

treat	$x_2$	$x_3$
1	1	0
2	0	1
3	0	0

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

Then, if we assume the trend in the data can be summarized by this regression model:

$y_{i} = \beta_{0}$ + $\beta_{1}$$x_{1}$ + $\beta_{2}$$x_{2}$ + $\beta_{3}$$x_{3}$ + $\epsilon_{i}$

where $x_{1}$ = nit and $x_{2}$ and $x_{3}$ are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter $\beta_{2}$ quantify? And, what does the parameter $\beta_{3}$ quantify?

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 $x_{2}$ - 2.35 $x_{3}$, which means that the equations for each treatment group are:

Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit
Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit
Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for $x_{2}$ represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.

The regression parameter for $x_{3}$ represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.

Testing whether all of the slope parameters are 0

$H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0$ against the alternative $H_A \colon $ at least one of the $\beta_i$ is not 0.

$F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}$

$F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95$

Since the p -value for this F -statistic is reported as 0.000, we reject $H_{0}$ in favor of $H_{A}$ and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

Tests for whether one slope parameter is 0

$H_0 \colon \beta_1= 0$ against the alternative $H_A \colon \beta_1 \ne 0$

t -statistic = 19.60, p -value = 0.000, so we reject $H_{0}$ in favor of $H_{A}$ and conclude that the slope parameter for $x_{1}$ = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

$F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}$

Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal $t^{2}$, the square of the t -statistic as we might expect?

$F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32$, which is the same as $19.60^{2}$.

Because $t^{2}$ will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the $x_{1}$ predictor after $x_{2}$ and $x_{3}$ have already been taken into account.

Tests for whether a subset of the slope parameters is 0

$H_0 \colon \beta_2=\beta_3= 0$ against the alternative $H_A \colon \beta_2 \ne 0$ or $\beta_3 \ne 0$ or both $\ne 0$.

$F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}$

$F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46$.

F distribution with 2 DF in Numerator and 26 DF in denominator

x	P ( X ≤ x )
0.46	0.363677

p-value $= 1-0.363677 = 0.636$, so we fail to reject $H_{0}$ in favor of $H_{A}$ and conclude that we cannot rule out $\beta_2 = \beta_3 = 0$, i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Note that the sequential mean square due to regression, MSR($X_{2}$,$X_{3}$|$X_{1}$), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors $X_{2}$ and $X_{3}$ are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the $\alpha= 0.05$ level.

Data Science
Data Analysis
Data Visualization
Machine Learning
Deep Learning
Computer Vision
Artificial Intelligence
AI ML DS Interview Series
AI ML DS Projects series
Data Engineering
Web Scrapping

P-Value: Comprehensive Guide to Understand, Apply, and Interpret

A p-value is a statistical metric used to assess a hypothesis by comparing it with observed data.

This article delves into the concept of p-value, its calculation, interpretation, and significance. It also explores the factors that influence p-value and highlights its limitations.

Table of Content

What is P-value?

How P-value is calculated?

How to interpret p-value, p-value in hypothesis testing, implementing p-value in python, applications of p-value, what is the p-value.

The p-value, or probability value, is a statistical measure used in hypothesis testing to assess the strength of evidence against a null hypothesis. It represents the probability of obtaining results as extreme as, or more extreme than, the observed results under the assumption that the null hypothesis is true.

In simpler words, it is used to reject or support the null hypothesis during hypothesis testing. In data science, it gives valuable insights on the statistical significance of an independent variable in predicting the dependent variable.

Calculating the p-value typically involves the following steps:

Formulate the Null Hypothesis (H0) : Clearly state the null hypothesis, which typically states that there is no significant relationship or effect between the variables.
Choose an Alternative Hypothesis (H1) : Define the alternative hypothesis, which proposes the existence of a significant relationship or effect between the variables.
Determine the Test Statistic : Calculate the test statistic, which is a measure of the discrepancy between the observed data and the expected values under the null hypothesis. The choice of test statistic depends on the type of data and the specific research question.
Identify the Distribution of the Test Statistic : Determine the appropriate sampling distribution for the test statistic under the null hypothesis. This distribution represents the expected values of the test statistic if the null hypothesis is true.
Calculate the Critical-value : Based on the observed test statistic and the sampling distribution, find the probability of obtaining the observed test statistic or a more extreme one, assuming the null hypothesis is true.
Interpret the results: Compare the critical-value with t-statistic. If the t-statistic is larger than the critical value, it provides evidence to reject the null hypothesis, and vice-versa.

Its interpretation depends on the specific test and the context of the analysis. Several popular methods for calculating test statistics that are utilized in p-value calculations.

Test	Scenario	Interpretation
	Used when dealing with large sample sizes or when the population standard deviation is known.	A small p-value (smaller than 0.05) indicates strong evidence against the null hypothesis, leading to its rejection.
	Appropriate for small sample sizes or when the population standard deviation is unknown.	Similar to the Z-test
	Used for tests of independence or goodness-of-fit.	A small p-value indicates that there is a significant association between the categorical variables, leading to the rejection of the null hypothesis.
	Commonly used in Analysis of Variance (ANOVA) to compare variances between groups.	A small p-value suggests that at least one group mean is different from the others, leading to the rejection of the null hypothesis.
	Measures the strength and direction of a linear relationship between two continuous variables.	A small p-value indicates that there is a significant linear relationship between the variables, leading to rejection of the null hypothesis that there is no correlation.

In general, a small p-value indicates that the observed data is unlikely to have occurred by random chance alone, which leads to the rejection of the null hypothesis. However, it’s crucial to choose the appropriate test based on the nature of the data and the research question, as well as to interpret the p-value in the context of the specific test being used.

The table given below shows the importance of p-value and shows the various kinds of errors that occur during hypothesis testing.


	Correct decision based on the given p-value	Type I error
	Type II error	Incorrect decision based on the given p-value

Type I error: Incorrect rejection of the null hypothesis. It is denoted by α (significance level). Type II error: Incorrect acceptance of the null hypothesis. It is denoted by β (power level)

Let’s consider an example to illustrate the process of calculating a p-value for Two Sample T-Test:

A researcher wants to investigate whether there is a significant difference in mean height between males and females in a population of university students.

Suppose we have the following data:

$\overline{x_1} = 175$

Starting with interpreting the process of calculating p-value

Step 1 : Formulate the Null Hypothesis (H0):

H0: There is no significant difference in mean height between males and females.

Step 2 : Choose an Alternative Hypothesis (H1):

H1: There is a significant difference in mean height between males and females.

Step 3 : Determine the Test Statistic:

The appropriate test statistic for this scenario is the two-sample t-test, which compares the means of two independent groups.

The t-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

$t = \frac{\overline{x_1} - \overline{x_2}}{ \sqrt{\frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2}}}$

s1 = First sample’s standard deviation
s2 = Second sample’s standard deviation
n1 = First sample’s sample size
n2 = Second sample’s sample size

$\begin{aligned}t &= \frac{175 - 168}{\sqrt{\frac{5^2}{30} + \frac{6^2}{35}}}\\&= \frac{7}{\sqrt{0.8333 + 1.0286}}\\&= \frac{7}{\sqrt{1.8619}}\\& \approx \frac{7}{1.364}\\& \approx 5.13\end{aligned}$

So, the calculated two-sample t-test statistic (t) is approximately 5.13.

Step 4 : Identify the Distribution of the Test Statistic:

The t-distribution is used for the two-sample t-test . The degrees of freedom for the t-distribution are determined by the sample sizes of the two groups.

The t-distribution is a probability distribution with tails that are thicker than those of the normal distribution.

where, n1 is total number of values for 1st category.
n2 is total number of values for 2nd category.

The degrees of freedom (63) represent the variability available in the data to estimate the population parameters. In the context of the two-sample t-test, higher degrees of freedom provide a more precise estimate of the population variance, influencing the shape and characteristics of the t-distribution.

T-Statistic

The t-distribution is symmetric and bell-shaped, similar to the normal distribution. As the degrees of freedom increase, the t-distribution approaches the shape of the standard normal distribution. Practically, it affects the critical values used to determine statistical significance and confidence intervals.

Step 5 : Calculate Critical Value.

To find the critical t-value with a t-statistic of 5.13 and 63 degrees of freedom, we can either consult a t-table or use statistical software.

We can use scipy.stats module in Python to find the critical t-value using below code.

Comparing with T-Statistic:

The larger t-statistic suggests that the observed difference between the sample means is unlikely to have occurred by random chance alone. Therefore, we reject the null hypothesis.

$(\alpha)$

p ≤ (α = 0.05) : Reject the null hypothesis. There is sufficient evidence to conclude that the observed effect or relationship is statistically significant, meaning it is unlikely to have occurred by chance alone.
p > (α = 0.05) : reject alternate hypothesis (or accept null hypothesis). The observed effect or relationship does not provide enough evidence to reject the null hypothesis. This does not necessarily mean there is no effect; it simply means the sample data does not provide strong enough evidence to rule out the possibility that the effect is due to chance.

In case the significance level is not specified, consider the below general inferences while interpreting your results.

If p > .10: not significant
If p ≤ .10: slightly significant
If p ≤ .05: significant
If p ≤ .001: highly significant

Graphically, the p-value is located at the tails of any confidence interval. [As shown in fig 1]

Fig 1: Graphical Representation

What influences p-value?

The p-value in hypothesis testing is influenced by several factors:

Sample Size : Larger sample sizes tend to yield smaller p-values, increasing the likelihood of detecting significant effects.
Effect Size: A larger effect size results in smaller p-values, making it easier to detect a significant relationship.
Variability in the Data : Greater variability often leads to larger p-values, making it harder to identify significant effects.
Significance Level : A lower chosen significance level increases the threshold for considering p-values as significant.
Choice of Test: Different statistical tests may yield different p-values for the same data.
Assumptions of the Test : Violations of test assumptions can impact p-values.

Understanding these factors is crucial for interpreting p-values accurately and making informed decisions in hypothesis testing.

Significance of P-value

The p-value provides a quantitative measure of the strength of the evidence against the null hypothesis.
Decision-Making in Hypothesis Testing
P-value serves as a guide for interpreting the results of a statistical test. A small p-value suggests that the observed effect or relationship is statistically significant, but it does not necessarily mean that it is practically or clinically meaningful.

Limitations of P-value

The p-value is not a direct measure of the effect size, which represents the magnitude of the observed relationship or difference between variables. A small p-value does not necessarily mean that the effect size is large or practically meaningful.
Influenced by Various Factors

The p-value is a crucial concept in statistical hypothesis testing, serving as a guide for making decisions about the significance of the observed relationship or effect between variables.

Let’s consider a scenario where a tutor believes that the average exam score of their students is equal to the national average (85). The tutor collects a sample of exam scores from their students and performs a one-sample t-test to compare it to the population mean (85).

The code performs a one-sample t-test to compare the mean of a sample data set to a hypothesized population mean.
It utilizes the scipy.stats library to calculate the t-statistic and p-value. SciPy is a Python library that provides efficient numerical routines for scientific computing.
The p-value is compared to a significance level (alpha) to determine whether to reject the null hypothesis.

Since, 0.7059>0.05 , we would conclude to fail to reject the null hypothesis. This means that, based on the sample data, there isn’t enough evidence to claim a significant difference in the exam scores of the tutor’s students compared to the national average. The tutor would accept the null hypothesis, suggesting that the average exam score of their students is statistically consistent with the national average.

During Forward and Backward propagation: When fitting a model (say a Multiple Linear Regression model), we use the p-value in order to find the most significant variables that contribute significantly in predicting the output.
Effects of various drug medicines: It is highly used in the field of medical research in determining whether the constituents of any drug will have the desired effect on humans or not. P-value is a very strong statistical tool used in hypothesis testing. It provides a plethora of valuable information while making an important decision like making a business intelligence inference or determining whether a drug should be used on humans or not, etc. For any doubt/query, comment below.

The p-value is a crucial concept in statistical hypothesis testing, providing a quantitative measure of the strength of evidence against the null hypothesis. It guides decision-making by comparing the p-value to a chosen significance level, typically 0.05. A small p-value indicates strong evidence against the null hypothesis, suggesting a statistically significant relationship or effect. However, the p-value is influenced by various factors and should be interpreted alongside other considerations, such as effect size and context.

Frequently Based Questions (FAQs)

Why is p-value greater than 1.

A p-value is a probability, and probabilities must be between 0 and 1. Therefore, a p-value greater than 1 is not possible.

What does P 0.01 mean?

It means that the observed test statistic is unlikely to occur by chance if the null hypothesis is true. It represents a 1% chance of observing the test statistic or a more extreme one under the null hypothesis.

Is 0.9 a good p-value?

A good p-value is typically less than or equal to 0.05, indicating that the null hypothesis is likely false and the observed relationship or effect is statistically significant.

What is p-value in a model?

It is a measure of the statistical significance of a parameter in the model. It represents the probability of obtaining the observed value of the parameter or a more extreme one, assuming the null hypothesis is true.

Why is p-value so low?

A low p-value means that the observed test statistic is unlikely to occur by chance if the null hypothesis is true. It suggests that the observed relationship or effect is statistically significant and not due to random sampling variation.

How Can You Use P-value to Compare Two Different Results of a Hypothesis Test?

Compare p-values: Lower p-value indicates stronger evidence against null hypothesis, favoring results with smaller p-values in hypothesis testing.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Prompt Library
DS/AI Trends
Stats Tools
Interview Questions
Generative AI
Machine Learning
Deep Learning

Linear regression hypothesis testing: Concepts, Examples

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients. Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section.

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
crim : Per capita crime rate by town
chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
rad : Index of accessibility to radial highways
lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics)

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194.
Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients.
Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Ajitesh Kumar

One response.

Very informative

ChatGPT Prompts (250+)

Generate Design Ideas for App
Expand Feature Set of App
Create a User Journey Map for App
Generate Visual Design Ideas for App
Generate a List of Competitors for App
Logistic Regression in Machine Learning: Python Example
Reducing Overfitting vs Models Complexity: Machine Learning
Model Parallelism vs Data Parallelism: Examples
Overfitting & Underfitting in Machine Learning
Self-Supervised Learning: Concepts, Examples

Data Science / AI Trends

• Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
• Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
• Guides, papers, lecture, notebooks and resources for prompt engineering
• Common tricks to make LLMs efficient and stable
• Machine learning in finance

Free Online Tools

Create Scatter Plots Online for your Excel Data
Histogram / Frequency Distribution Creation Tool
Online Pie Chart Maker Tool
Z-test vs T-test Decision Tool
Independent samples t-test calculator

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Search Rcompanion.org

Purpose of this Book
Author of this Book
Statistics Textbooks and Other Resources
Why Statistics?
Evaluation Tools and Surveys
Types of Variables
Descriptive Statistics
Confidence Intervals
Basic Plots

Hypothesis Testing and p-values

Reporting Results of Data and Analyses
Choosing a Statistical Test
Independent and Paired Values
Introduction to Likert Data
Descriptive Statistics for Likert Item Data
Descriptive Statistics with the likert Package
Confidence Intervals for Medians
Converting Numeric Data to Categories
Introduction to Traditional Nonparametric Tests
One-sample Wilcoxon Signed-rank Test
Sign Test for One-sample Data
Two-sample Mann–Whitney U Test
Mood’s Median Test for Two-sample Data
Two-sample Paired Signed-rank Test
Sign Test for Two-sample Paired Data
Kruskal–Wallis Test
Mood’s Median Test
Friedman Test
Scheirer–Ray–Hare Test
Aligned Ranks Transformation ANOVA
Nonparametric Regression and Local Regression
Nonparametric Regression for Time Series
Introduction to Permutation Tests
One-way Permutation Test for Ordinal Data
One-way Permutation Test for Paired Ordinal Data
Permutation Tests for Medians and Percentiles
Association Tests for Ordinal Tables
Measures of Association for Ordinal Tables
Introduction to Linear Models
Using Random Effects in Models
What are Estimated Marginal Means?
Estimated Marginal Means for Multiple Comparisons
Factorial ANOVA: Main Effects, Interaction Effects, and Interaction Plots
p-values and R-square Values for Models
Accuracy and Errors for Models
Introduction to Cumulative Link Models (CLM) for Ordinal Data
Two-sample Ordinal Test with CLM
Two-sample Paired Ordinal Test with CLMM
One-way Ordinal Regression with CLM
One-way Repeated Ordinal Regression with CLMM
Two-way Ordinal Regression with CLM
Two-way Repeated Ordinal Regression with CLMM
Introduction to Tests for Nominal Variables
Confidence Intervals for Proportions
Goodness-of-Fit Tests for Nominal Variables
Association Tests for Nominal Variables
Measures of Association for Nominal Variables
Tests for Paired Nominal Data
Cochran–Mantel–Haenszel Test for 3-Dimensional Tables
Cochran’s Q Test for Paired Nominal Data
Models for Nominal Data
Introduction to Parametric Tests
One-sample t-test
Two-sample t-test
Paired t-test
One-way ANOVA
One-way ANOVA with Blocks
One-way ANOVA with Random Blocks
Two-way ANOVA
Repeated Measures ANOVA
Correlation and Linear Regression
Advanced Parametric Methods
Transforming Data
Normal Scores Transformation
Regression for Count Data
Beta Regression for Percent and Proportion Data
An R Companion for the Handbook of Biological Statistics

Initial comments

Traditionally when students first learn about the analysis of experiments, there is a strong focus on hypothesis testing and making decisions based on p -values. Hypothesis testing is important for determining if there are statistically significant effects. However, readers of this book should not place undo emphasis on p -values. Instead, they should realize that p -values are affected by sample size, and that a low p -value does not necessarily suggest a large effect or a practically meaningful effect. Summary statistics, plots, effect size statistics, and practical considerations should be used. The goal is to determine: a) statistical significance, b) effect size, c) practical importance. These are all different concepts, and they will be explored below.

Statistical inference

Most of what we’ve covered in this book so far is about producing descriptive statistics: calculating means and medians, plotting data in various ways, and producing confidence intervals. The bulk of the rest of this book will cover statistical inference: using statistical tests to draw some conclusion about the data. We’ve already done this a little bit in earlier chapters by using confidence intervals to conclude if means are different or not among groups.

As Dr. Nic mentions in her article in the “References and further reading” section, this is the part where people sometimes get stumped. It is natural for most of us to use summary statistics or plots, but jumping to statistical inference needs a little change in perspective. The idea of using some statistical test to answer a question isn’t a difficult concept, but some of the following discussion gets a little theoretical. The video from the Statistics Learning Center in the “References and further reading” section does a good job of explaining the basis of statistical inference.

One important thing to gain from this chapter is an understanding of how to use the p -value, alpha , and decision rule to test the null hypothesis. But once you are comfortable with that, you will want to return to this chapter to have a better understanding of the theory behind this process.

Another important thing is to understand the limitations of relying on p -values, and why it is important to assess the size of effects and weigh practical considerations.

Packages used in this chapter

The packages used in this chapter include:

The following commands will install these packages if they are not already installed:

if(!require(lsr)){install.packages("lsr")}

Hypothesis testing

The null and alternative hypotheses.

The statistical tests in this book rely on testing a null hypothesis, which has a specific formulation for each test. The null hypothesis always describes the case where e.g. two groups are not different or there is no correlation between two variables, etc.

The alternative hypothesis is the contrary of the null hypothesis, and so describes the cases where there is a difference among groups or a correlation between two variables, etc.

Notice that the definitions of null hypothesis and alternative hypothesis have nothing to do with what you want to find or don't want to find, or what is interesting or not interesting, or what you expect to find or what you don’t expect to find. If you were comparing the height of men and women, the null hypothesis would be that the height of men and the height of women were not different. Yet, you might find it surprising if you found this hypothesis to be true for some population you were studying. Likewise, if you were studying the income of men and women, the null hypothesis would be that the income of men and women are not different, in the population you are studying. In this case you might be hoping the null hypothesis is true, though you might be unsurprised if the alternative hypothesis were true. In any case, the null hypothesis will take the form that there is no difference between groups, there is no correlation between two variables, or there is no effect of this variable in our model.

p -value definition

Most of the tests in this book rely on using a statistic called the p -value to evaluate if we should reject, or fail to reject, the null hypothesis.

Given the assumption that the null hypothesis is true , the p -value is defined as the probability of obtaining a result equal to or more extreme than what was actually observed in the data.

We’ll unpack this definition in a little bit.

Decision rule

The p -value for the given data will be determined by conducting the statistical test.

This p -value is then compared to a pre-determined value alpha . Most commonly, an alpha value of 0.05 is used, but there is nothing magic about this value.

If the p -value for the test is less than alpha , we reject the null hypothesis.

If the p -value is greater than or equal to alpha , we fail to reject the null hypothesis.

Coin flipping example

For an example of using the p -value for hypothesis testing, imagine you have a coin you will toss 100 times. The null hypothesis is that the coin is fair—that is, that it is equally likely that the coin will land on heads as land on tails. The alternative hypothesis is that the coin is not fair. Let’s say for this experiment you throw the coin 100 times and it lands on heads 95 times out of those hundred. The p -value in this case would be the probability of getting 95, 96, 97, 98, 99, or 100 heads, or 0, 1, 2, 3, 4, or 5 heads, assuming that the null hypothesis is true .

This is what we call a two-sided test, since we are testing both extremes suggested by our data: getting 95 or greater heads or getting 95 or greater tails. In most cases we will use two sided tests.

You can imagine that the p -value for this data will be quite small. If the null hypothesis is true, and the coin is fair, there would be a low probability of getting 95 or more heads or 95 or more tails.

Using a binomial test, the p -value is < 0.0001.

(Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16 , which is 0.00000000000000022, with 15 zeros after the decimal point.)

Assuming an alpha of 0.05, since the p -value is less than alpha , we reject the null hypothesis. That is, we conclude that the coin is not fair.

binom.test(5, 100, 0.5)

Exact binomial test number of successes = 5, number of trials = 100, p-value < 2.2e-16 alternative hypothesis: true probability of success is not equal to 0.5

Passing and failing example

As another example, imagine we are considering two classrooms, and we have counts of students who passed a certain exam. We want to know if one classroom had statistically more passes or failures than the other.

In our example each classroom will have 10 students. The data is arranged into a contingency table.

Classroom Passed Failed A 8 2 B 3 7

We will use Fisher’s exact test to test if there is an association between Classroom and the counts of passed and failed students. The null hypothesis is that there is no association between Classroom and Passed/Failed , based on the relative counts in each cell of the contingency table.

Input =(" Classroom Passed Failed A 8 2 B 3 7 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) Matrix

Passed Failed A 8 2 B 3 7

fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.06978

The reported p -value is 0.070. If we use an alpha of 0.05, then the p -value is greater than alpha , so we fail to reject the null hypothesis. That is, we did not have sufficient evidence to say that there is an association between Classroom and Passed/Failed .

More extreme data in this case would be if the counts in the upper left or lower right (or both!) were greater.

Classroom Passed Failed A 9 1 B 3 7 Classroom Passed Failed A 10 0 B 3 7 and so on, with Classroom B...

In most cases we would want to consider as "extreme" not only the results when Classroom A has a high frequency of passing students, but also results when Classroom B has a high frequency of passing students. This is called a two-sided or two-tailed test. If we were only concerned with one classroom having a high frequency of passing students, relatively, we would instead perform a one-sided test. The default for the fisher.test function is two-sided, and usually you will want to use two-sided tests.

Classroom Passed Failed A 2 8 B 7 3 Classroom Passed Failed A 1 9 B 7 3 Classroom Passed Failed A 0 10 B 7 3 and so on, with Classroom B...

In both cases, "extreme" means there is a stronger association between Classroom and Passed/Failed .

Theory and practice of using p -values

Wait, does this make any sense.

Recall that the definition of the p -value is:

The astute reader might be asking herself, “If I’m trying to determine if the null hypothesis is true or not, why would I start with the assumption that the null hypothesis is true? And why am I using a probability of getting certain data given that a hypothesis is true? Don’t I want to instead determine the probability of the hypothesis given my data?”

The answer is yes , we would like a method to determine the likelihood of our hypothesis being true given our data, but we use the Null Hypothesis Significance Test approach since it is relatively straightforward, and has wide acceptance historically and across disciplines.

In practice we do use the results of the statistical tests to reach conclusions about the null hypothesis.

Technically, the p -value says nothing about the alternative hypothesis. But logically, if the null hypothesis is rejected, then its logical complement, the alternative hypothesis, is supported. Practically, this is how we handle significant p -values, though this practical approach generates disapproval in some theoretical circles.

Statistics is like a jury?

Note the language used when testing the null hypothesis. Based on the results of our statistical tests, we either reject the null hypothesis, or fail to reject the null hypothesis.

This is somewhat similar to the approach of a jury in a trial. The jury either finds sufficient evidence to declare someone guilty, or fails to find sufficient evidence to declare someone guilty.

Failing to convict someone isn’t necessarily the same as declaring someone innocent. Likewise, if we fail to reject the null hypothesis, we shouldn’t assume that the null hypothesis is true. It may be that we didn’t have sufficient samples to get a result that would have allowed us to reject the null hypothesis, or maybe there are some other factors affecting the results that we didn’t account for. This is similar to an “innocent until proven guilty” stance.

Errors in inference

For the most part, the statistical tests we use are based on probability, and our data could always be the result of chance. Considering the coin flipping example above, if we did flip a coin 100 times and came up with 95 heads, we would be compelled to conclude that the coin was not fair. But 95 heads could happen with a fair coin strictly by chance.

We can, therefore, make two kinds of errors in testing the null hypothesis:

• A Type I error occurs when the null hypothesis really is true, but based on our decision rule we reject the null hypothesis. In this case, our result is a false positive ; we think there is an effect (unfair coin, association between variables, difference among groups) when really there isn’t. The probability of making this kind error is alpha , the same alpha we used in our decision rule.

• A Type II error occurs when the null hypothesis is really false, but based on our decision rule we fail to reject the null hypothesis. In this case, our result is a false negative ; we have failed to find an effect that really does exist. The probability of making this kind of error is called beta .

The following table summarizes these errors.

Reality ___________________________________ Decision of Test Null is true Null is false Reject null hypothesis Type I error Correctly (prob. = alpha) reject null (prob. = 1 – beta) Retain null hypothesis Correctly Type II error retain null (prob. = beta) (prob. = 1 – alpha)

Statistical power

The statistical power of a test is a measure of the ability of the test to detect a real effect. It is related to the effect size, the sample size, and our chosen alpha level.

The effect size is a measure of how unfair a coin is, how strong the association is between two variables, or how large the difference is among groups. As the effect size increases or as the number of observations we collect increases, or as the alpha level increases, the power of the test increases.

Statistical power in the table above is indicated by 1 – beta , and power is the probability of correctly rejecting the null hypothesis.

An example should make these relationship clear. Imagine we are sampling a large group of 7 th grade students for their height. That is, the group is the population, and we are sampling a sub-set of these students. In reality, for students in the population, the girls are taller than the boys, but the difference is small (that is, the effect size is small), and there is a lot of variability in students’ heights. You can imagine that in order to detect the difference between girls and boys that we would have to measure many students. If we fail to sample enough students, we might make a Type II error. That is, we might fail to detect the actual difference in heights between sexes.

If we had a different experiment with a larger effect size—for example the weight difference between mature hamsters and mature hedgehogs—we might need fewer samples to detect the difference.

Note also, that our chosen alpha plays a role in the power of our test, too. All things being equal, across many tests, if we decrease our alph a, that is, insist on a lower rate of Type I errors, we are more likely to commit a Type II error, and so have a lower power. This is analogous to a case of a meticulous jury that has a very high standard of proof to convict someone. In this case, the likelihood of a false conviction is low, but the likelihood of a letting a guilty person go free is relatively high.

The 0.05 alpha value is not dogma

The level of alpha is traditionally set at 0.05 in some disciplines, though there is sometimes reason to choose a different value.

One situation in which the alpha level is increased is in preliminary studies in which it is better to include potentially significant effects even if there is not strong evidence for keeping them. In this case, the researcher is accepting an inflated chance of Type I errors in order to decrease the chance of Type II errors.

Imagine an experiment in which you wanted to see if various environmental treatments would improve student learning. In a preliminary study, you might have many treatments, with few observations each, and you want to retain any potentially successful treatments for future study. For example, you might try playing classical music, improved lighting, complimenting students, and so on, and see if there is any effect on student learning. You might relax your alpha value to 0.10 or 0.15 in the preliminary study to see what treatments to include in future studies.

On the other hand, in situations where a Type I, false positive, error might be costly in terms of money or people’s health, a lower alpha can be used, perhaps, 0.01 or 0.001. You can imagine a case in which there is an established treatment for cancer, and a new treatment is being tested. Because the new treatment is likely to be expensive and to hold people’s lives in the balance, a researcher would want to be very sure that the new treatment is more effective than the established treatment. In reality, the researchers would not just lower the alpha level, but also look at the effect size, submit the research for peer review, replicate the study, be sure there were no problems with the design of the study or the data collection, and weigh the practical implications.

The 0.05 alpha value is almost dogma

In theory, as a researcher, you would determine the alpha level you feel is appropriate. That is, the probability of making a Type I error when the null hypothesis is in fact true.

In reality, though, 0.05 is almost always used in most fields for readers of this book. Choosing a different alpha value will rarely go without question. It is best to keep with the 0.05 level unless you have good justification for another value, or are in a discipline where other values are routinely used.

Practical advice

One good practice is to report actual p -values from analyses. It is fine to also simply say, e.g. “The dependent variable was significantly correlated with variable A ( p < 0.05).” But I prefer when possible to say, “The dependent variable was significantly correlated with variable A ( p = 0.026).

It is probably best to avoid using terms like “marginally significant” or “borderline significant” for p -values less than 0.10 but greater than 0.05, though you might encounter similar phrases. It is better to simply report the p -values of tests or effects in straight-forward manner. If you had cause to include certain model effects or results from other tests, they can be reported as e.g., “Variables correlated with the dependent variable with p < 0.15 were A , B , and C .”

Is the p -value every really true?

Considering some of the examples presented, it may have occurred to the reader to ask if the null hypothesis is ever really true. For example, in some population of 7 th graders, if we could measure everyone in the population to a high degree of precision, then there must be some difference in height between girls and boys. This is an important limitation of null hypothesis significance testing. Often, if we have many observations, even small effects will be reported as significant. This is one reason why it is important to not rely too heavily on p -values, but to also look at the size of the effect and practical considerations. In this example, if we sampled many students and the difference in heights was 0.5 cm, even if significant, we might decide that this effect is too small to be of practical importance, especially relative to an average height of 150 cm. (Here, the difference would be 0.3% of the average height).

Effect sizes and practical importance

Practical importance and statistical significance.

It is important to remember to not let p -values be the only guide for drawing conclusions. It is equally important to look at the size of the effects you are measuring, as well as take into account other practical considerations like the costs of choosing a certain path of action.

For example, imagine we want to compare the SAT scores of two SAT preparation classes with a t -test.

Class.A = c(1500, 1505, 1505, 1510, 1510, 1510, 1515, 1515, 1520, 1520) Class.B = c(1510, 1515, 1515, 1520, 1520, 1520, 1525, 1525, 1530, 1530) t.test(Class.A, Class.B)

Welch Two Sample t-test t = -3.3968, df = 18, p-value = 0.003214 mean of x mean of y 1511 1521

The p -value is reported as 0.003, so we would consider there to be a significant difference between the two classes ( p < 0.05).

But we have to ask ourselves the practical question, is a difference of 10 points on the SAT large enough for us to care about? What if enrolling in one class costs significantly more than the other class? Is it worth the extra money for a difference of 10 points on average?

Sizes of effects

It should be remembered that p -values do not indicate the size of the effect being studied. It shouldn’t be assumed that a small p -value indicates a large difference between groups, or vice-versa.

For example, in the SAT example above, the p -value is fairly small, but the size of the effect (difference between classes) in this case is relatively small (10 points, especially small relative to the range of scores students receive on the SAT).

In converse, there could be a relatively large size of the effects, but if there is a lot of variability in the data or the sample size is not large enough, the p -value could be relatively large.

In this example, the SAT scores differ by 100 points between classes, but because the variability is greater than in the previous example, the p -value is not significant.

Class.C = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.D = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.C, Class.D)

Welch Two Sample t-test t = -1.4174, df = 18, p-value = 0.1735 mean of x mean of y 1290 1390

boxplot(cbind(Class.C, Class.D))

p -values and sample sizes

It should also be remembered that p -values are affected by sample size. For a given effect size and variability in the data, as the sample size increases, the p -value is likely to decrease. For large data sets, small effects can result in significant p -values.

As an example, let’s take the data from Class.C and Class.D and double the number of observations for each without changing the distribution of the values in each, and rename them Class.E and Class.F .

Class.E = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500, 1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.F = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600, 1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.E, Class.F)

Welch Two Sample t-test t = -2.0594, df = 38, p-value = 0.04636 mean of x mean of y 1290 1390

boxplot(cbind(Class.E, Class.F))

Notice that the p -value is lower for the t -test for Class.E and Class.F than it was for Class.C and Class.D . Also notice that the means reported in the output are the same, and the box plots would look the same.

Effect size statistics

One way to account for the effect of sample size on our statistical tests is to consider effect size statistics. These statistics reflect the size of the effect in a standardized way, and are unaffected by sample size.

An appropriate effect size statistic for a t -test is Cohen’s d . It takes the difference in means between the two groups and divides by the pooled standard deviation of the groups. Cohen’s d equals zero if the means are the same, and increases to infinity as the difference in means increases relative to the standard deviation.

In the following, note that Cohen’s d is not affected by the sample size difference in the Class.C / Class.D and the Class.E / Class.F examples.

library(lsr) cohensD(Class.C, Class.D, method = "raw")

cohensD(Class.E, Class.F, method = "raw")

Effect size statistics are standardized so that they are not affected by the units of measurements of the data. This makes them interpretable across different situations, or if the reader is not familiar with the units of measurement in the original data. A Cohen’s d of 1 suggests that the two means differ by one pooled standard deviation. A Cohen’s d of 0.5 suggests that the two means differ by one-half the pooled standard deviation.

For example, if we create new variables— Class.G and Class.H —that are the SAT scores from the previous example expressed as a proportion of a 1600 score, Cohen’s d will be the same as in the previous example.

Class.G = Class.E / 1600 Class.H = Class.F / 1600 Class.G Class.H cohensD(Class.G, Class.H, method="raw")

Good practices for statistical analyses

Statistics is not like a trial.

When analyzing data, the analyst should not approach the task as would a lawyer for the prosecution. That is, the analyst should not be searching for significant effects and tests, but should instead be like an independent investigator using lines of evidence to find out what is most likely to true given the data, graphical analysis, and statistical analysis available.

The problem of multiple p -values

One concept that will be in important in the following discussion is that when there are multiple tests producing multiple p -values, that there is an inflation of the Type I error rate. That is, there is a higher chance of making false-positive errors.

This simply follows mathematically from the definition of alpha . If we allow a probability of 0.05, or 5% chance, of making a Type I error for any one test, as we do more and more tests, the chances that at least one of them having a false positive becomes greater and greater.

p -value adjustment

One way we deal with the problem of multiple p -values in statistical analyses is to adjust p -values when we do a series of tests together (for example, if we are comparing the means of multiple groups).

Don’t use Bonferroni adjustments

There are various p -value adjustments available in R. In some cases, we will use FDR, which stands for false discovery rate , and in R is an alias for the Benjamini and Hochberg method. There are also cases in which we’ll use Tukey range adjustment to correct for the family-wise error rate.

Unfortunately, students in analysis of experiments courses often learn to use Bonferroni adjustment for p -values. This method is simple to do with hand calculations, but is excessively conservative in most situations, and, in my opinion, antiquated.

There are other p -value adjustment methods, and the choice of which one to use is dictated either by which are common in your field of study, or by doing enough reading to understand which are statistically most appropriate for your application.

Preplanned tests

The statistical tests covered in this book assume that tests are preplanned for their p -values to be accurate. That is, in theory, you set out an experiment, collect the data as planned, and then say “I’m going to analyze it with kind of model and do these post-hoc tests afterwards”, and report these results, and that’s all you would do.

Some authors emphasize this idea of preplanned tests. In contrast is an exploratory data analysis approach that relies upon examining the data with plots and using simple tests like correlation tests to suggest what statistical analysis makes sense.

If an experiment is set out in a specific design, then usually it is appropriate to use the analysis suggested by this design.

p -value hacking

It is important when approaching data from an exploratory approach, to avoid committing p -value hacking. Imagine the case in which the researcher collects many different measurements across a range of subjects. The researcher might be tempted to simply try different tests and models to relate one variable to another, for all the variables. He might continue to do this until he found a test with a significant p -value.

But this would be a form of p -value hacking.

Because an alpha value of 0.05 allows us to make a false-positive error five percent of the time, finding one p -value below 0.05 after several successive tests may simply be due to chance.

Some forms of p -value hacking are more egregious. For example, if one were to collect some data, run a test, and then continue to collect data and run tests iteratively until a significant p -value is found.

Publication bias

A related issue in science is that there is a bias to publish, or to report, only significant results. This can also lead to an inflation of the false-positive rate. As a hypothetical example, imagine if there are currently 20 similar studies being conducted testing a similar effect—let’s say the effect of glucosamine supplements on joint pain. If 19 of those studies found no effect and so were discarded, but one study found an effect using an alpha of 0.05, and was published, is this really any support that glucosamine supplements decrease joint pain?

Clarification of terms and reporting on assignments

"statistically significant".

In the context of this book, the term "significant" means "statistically significant".

Whenever the decision rule finds that p < alpha , the difference in groups, the association, or the correlation under consideration is then considered "statistically significant" or "significant".

No effect size or practical considerations enter into determining whether an effect is “significant” or not. The only exception is that test assumptions and requirements for appropriate data must also be met in order for the p -value to be valid.

What you need to consider :

• The null hypothesis

• p , alpha , and the decision rule,

• Your result. That is, whether the difference in groups, the association, or the correlation is significant or not.

What you should report on your assignments:

• The p -value

• The conclusion, e.g. "There was a significant difference in the mean heights of boys and girls in the class." It is best to preface this with the "reject" or "fail to reject" language concerning your decision about the null hypothesis.

“Size of the effect” / “effect size”

In the context of this book, I use the term "size of the effect" to suggest the use of summary statistics to indicate how large an effect is. This may be, for example the difference in two medians. I try reserve the term “effect size” to refer to the use of effect size statistics. This distinction isn’t necessarily common.

Usually you will consider an effect in relation to the magnitude of measurements. That is, you might look at the difference in medians as a percent of the median of one group or of the global median. Or, you might look at the difference in medians in relation to the range of answers. For example, a one-point difference on a 5-point Likert item. Counts might be expressed as proportions of totals or subsets.

What you should report on assignments :

• The size of the effect. That is, the difference in medians or means, the difference in counts, or the proportions of counts among groups.

• Where appropriate, the size of the effect expressed as a percentage or proportion.

• If there is an effect size statistic—such as r , epsilon -squared, phi , Cramér's V , or Cohen's d —: report this and its interpretation (small, medium, large), and incorporate this into your conclusion.

"Practical" / "Practical importance"

If there is a significant result, the question of practical importance asks if the difference or association is large enough to matter in the real world.

If there is no significant result, the question of practical importance asks if the a difference or association is large enough to warrant another look, for example by running another test with a larger sample size or that controls variability in observations better.

• Your conclusion as to whether this effect is large enough to be important in the real world.

• The context, explanation, or support to justify your conclusion.

• In some cases you might include considerations that aren't included in the data presented. Examples might include the cost of one treatment over another, including time investment, or whether there is a large risk in selecting one treatment over another (e.g., if people's lives are on the line).

A few of xkcd comics

Significant.

xkcd.com/882/

Null hypothesis

xkcd.com/892/

xkcd.com/1478/

Experiments, sampling, and causation

Types of experimental designs, experimental designs.

A true experimental design assigns treatments in a systematic manner. The experimenter must be able to manipulate the experimental treatments and assign them to subjects. Since treatments are randomly assigned to subjects, a causal inference can be made for significant results. That is, we can say that the variation in the dependent variable is caused by the variation in the independent variable.

For interval/ratio data, traditional experimental designs can be analyzed with specific parametric models, assuming other model assumptions are met. These traditional experimental designs include:

• Completely random design

• Randomized complete block design

• Factorial

• Split-plot

• Latin square

Quasi-experiment designs

Often a researcher cannot assign treatments to individual experimental units, but can assign treatments to groups. For example, if students are in a specific grade or class, it would not be practical to randomly assign students to grades or classes. But different classes could receive different treatments (such as different curricula). Causality can be inferred cautiously if treatments are randomly assigned and there is some understanding of the factors that affect the outcome.

Observational studies

In observational studies, the independent variables are not manipulated, and no treatments are assigned. Surveys are often like this, as are studies of natural systems without experimental manipulation. Statistical analysis can reveal the relationships among variables, but causality cannot be inferred. This is because there may be other unstudied variables that affect the measured variables in the study.

Good sampling practices are critical for producing good data. In general, samples need to be collected in a random fashion so that bias is avoided.

In survey data, bias is often introduced by a self-selection bias. For example, internet or telephone surveys include only those who respond to these requests. Might there be some relevant difference in the variables of interest between those who respond to such requests and the general population being surveyed? Or bias could be introduced by the researcher selecting some subset of potential subjects, for example only surveying a 4-H program with particularly cooperative students and ignoring other clubs. This is sometimes called “convenience sampling”.

In election forecasting, good pollsters need to account for selection bias and other biases in the survey process. For example, if a survey is done by landline telephone, those being surveyed are more likely to be older than the general population of voters, and so likely to have a bias in their voting patterns.

Plan ahead and be consistent

It is sometimes necessary to change experimental conditions during the course of an experiment. Equipment might fail, or unusual weather may prevent making meaningful measurements.

But in general, it is much better to plan ahead and be consistent with measurements.

Consistency

People sometimes have the tendency to change measurement frequency or experimental treatments during the course of a study. This inevitably causes headaches in trying to analyze data, and makes writing up the results messy. Try to avoid this.

Controls and checks

If you are testing an experimental treatment, include a check treatment that almost certainly will have an effect and a control treatment that almost certainly won’t. A control treatment will receive no treatment and a check treatment will receive a treatment known to be successful. In an educational setting, perhaps a control group receives no instruction on the topic but on another topic, and the check group will receive standard instruction.

Including checks and controls helps with the analysis in a practical sense, since they serve as standard treatments against which to compare the experimental treatments. In the case where the experimental treatments have similar effects, controls and checks allow you say, for example, “Means for the all experimental treatments were similar, but were higher than the mean for control, and lower than the mean for check treatment.”

Include alternate measurements

It often happens that measuring equipment fails or that a certain measurement doesn’t produce the expected results. It is therefore helpful to include measurements of several variables that can capture the potential effects. Perhaps test scores of students won’t show an effect, but a self-assessment question on how much students learned will.

Include covariates

Including additional independent variables that might affect the dependent variable is often helpful in an analysis. In an educational setting, you might assess student age, grade, school, town, background level in the subject, or how well they are feeling that day.

The effects of covariates on the dependent variable may be of interest in itself. But also, including co-variates in an analysis can better model the data, sometimes making treatment effects more clear or making a model better meet model assumptions.

Optional discussion: Alternative methods to the Null Hypothesis Significance Test

The nhst controversy.

Particularly in the fields of psychology and education, there has been much criticism of the null hypothesis significance test approach. From my reading, the main complaints against NHST tend to be:

• Students and researchers don’t really understand the meaning of p -values.

• p -values don’t include important information like confidence intervals or parameter estimates.

• p -values have properties that may be misleading, for example that they do not represent effect size, and that they change with sample size.

• We often treat an alpha of 0.05 as a magical cutoff value.

Personally, I don’t find these to be very convincing arguments against the NHST approach.

The first complaint is in some sense pedantic: Like so many things, students and researchers learn the definition of p -values at some point and then eventually forget. This doesn’t seem to impact the usefulness of the approach.

The second point has weight only if researchers use only p -values to draw conclusions from statistical tests. As this book points out, one should always consider the size of the effects and practical considerations of the effects, as well present finding in table or graphical form, including confidence intervals or measures of dispersion. There is no reason why parameter estimates, goodness-of-fit statistics, and confidence intervals can’t be included when a NHST approach is followed.

The properties in the third point also don’t count much as criticism if one is using p -values correctly. One should understand that it is possible to have a small effect size and a small p -value, and vice-versa. This is not a problem, because p -values and effect sizes are two different concepts. We shouldn’t expect them to be the same. The fact that p -values change with sample size is also in no way problematic to me. It makes sense that when there is a small effect size or a lot of variability in the data that we need many samples to conclude the effect is likely to be real.

(One case where I think the considerations in the preceding point are commonly problematic is when people use statistical tests to check for the normality or homogeneity of data or model residuals. As sample size increases, these tests are better able to detect small deviations from normality or homoscedasticity. Too many people use them and think their model is inappropriate because the test can detect a small effect size, that is, a small deviation from normality or homoscedasticity).

The fourth point is a good one. It doesn’t make much sense to come to one conclusion if our p -value is 0.049 and the opposite conclusion if our p -value is 0.051. But I think this can be ameliorated by reporting the actual p -values from analyses, and relying less on p -values to evaluate results.

Overall it seems to me that these complaints condemn poor practices that the authors observe: not reporting the size of effects in some manner; not including confidence intervals or measures of dispersion; basing conclusions solely on p -values; and not including important results like parameter estimates and goodness-of-fit statistics.

Alternatives to the NHST approach

Estimates and confidence intervals.

One approach to determining statistical significance is to use estimates and confidence intervals. Estimates could be statistics like means, medians, proportions, or other calculated statistics. This approach can be very straightforward, easy for readers to understand, and easy to present clearly.

Bayesian approach

The most popular competitor to the NHST approach is Bayesian inference. Bayesian inference has the advantage of calculating the probability of the hypothesis given the data , which is what we thought we should be doing in the “Wait, does this make any sense?” section above. Essentially it takes prior knowledge about the distribution of the parameters of interest for a population and adds the information from the measured data to reassess some hypothesis related to the parameters of interest. If the reader will excuse the vagueness of this description, it makes intuitive sense. We start with what we suspect to be the case, and then use new data to assess our hypothesis.

One disadvantage of the Bayesian approach is that it is not obvious in most cases what could be used for legitimate prior information. A second disadvantage is that conducting Bayesian analysis is not as straightforward as the tests presented in this book.

References and further reading

[Video] “Understanding statistical inference” from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=tFRXsngz4UQ .

[Video] “Hypothesis tests, p-value” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=0zZYBALbZgg .

[Video] “Understanding the p-value” from Statistics Learning Center (Dr. Nic). 2011.

www.youtube.com/watch?v=eyknGvncKLw .

[Video] “Important statistical concepts: significance, strength, association, causation” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=FG7xnWmZlPE .

“Understanding statistical inference” from Dr. Nic. 2015. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/understanding-statistical-inference/ .

“Basic concepts of hypothesis testing” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/hypothesistesting.html .

“Hypothesis testing” , section 4.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Hypothesis Testing with One Sample”, sections 9.1–9.2 in Openstax. 2013. Introductory Statistics . openstax.org/textbooks/introductory-statistics .

"Proving causation" from Dr. Nic. 2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/proving-causation/ .

[Video] “Variation and Sampling Error” from Statistics Learning Center (Dr. Nic). 2014. www.youtube.com/watch?v=y3A0lUkpAko .

[Video] “Sampling: Simple Random, Convenience, systematic, cluster, stratified” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=be9e-Q-jC-0 .

“Confounding variables” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/confounding.html .

“Overview of data collection principles” , section 1.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Observational studies and sampling strategies” , section 1.4, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Experiments” , section 1.5, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

Exercises F

1. Which of the following pair is the null hypothesis?

A) The number of heads from the coin is not different from the number of tails.

B) The number of heads from the coin is different from the number of tails.

2. Which of the following pair is the null hypothesis?

A) The height of boys is different than the height of girls.

B) The height of boys is not different than the height of girls.

3. Which of the following pair is the null hypothesis?

A) There is an association between classroom and sex. That is, there is a difference in counts of girls and boys between the classes.

B) There is no association between classroom and sex. That is, there is no difference in counts of girls and boys between the classes.

4. We flip a coin 10 times and it lands on heads 7 times. We want to know if the coin is fair.

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alpha of 0.05,

What do you decide (use the reject or fail to reject language)?

c. In practical terms, what do you conclude?

binom.test(7, 10, 0.5)

Exact binomial test number of successes = 7, number of trials = 10, p-value = 0.3438

5. We measure the height of 9 boys and 9 girls in a class, in centimeters. We want to know if one group is taller than the other.

c. In practical terms, what do you conclude? Address the practical importance of the results.

Girls = c(152, 150, 140, 160, 145, 155, 150, 152, 147) Boys = c(144, 142, 132, 152, 137, 147, 142, 144, 139) t.test(Girls, Boys)

Welch Two Sample t-test t = 2.9382, df = 16, p-value = 0.009645 mean of x mean of y 150.1111 142.1111

mean(Boys) sd(Boys) quantile(Boys)

mean(Girls) sd(Girls) quantile(Girls) boxplot(cbind(Girls, Boys))

6. We count the number of boys and girls in two classrooms. We are interested to know if there is an association between the classrooms and the number of girls and boys. That is, does the proportion of boys and girls differ statistically across the two classrooms?

Classroom Girls Boys A 13 7 B 5 15

Input =(" Classroom Girls Boys A 13 7 B 5 15 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.02484

Matrix rowSums(Matrix) colSums(Matrix) prop.table(Matrix, margin=1) ### Proportions for each row barplot(t(Matrix), beside = TRUE, legend = TRUE, ylim = c(0, 25), xlab = "Class", ylab = "Count")

7. Why should you not rely solely on p -values to make a decision in the real world? (You should have at least two reasons.)

8. Create your own example to show the importance of considering the size of the effect . Describe the scenario: what the research question is, and what kind of data were collected. You may make up data and provide real results, or report hypothetical results.

9. Create your own example to show the importance of weighing other practical considerations . Describe the scenario: what the research question is, what kind of data were collected, what statistical results were reached, and what other practical considerations were brought to bear.

10. What is 5e-4 in common decimal notation?

Non-commercial reproduction of this content, with attribution, is permitted. For-profit reproduction without permission is prohibited.

If you use the code or information in this site in a published work, please cite it as a source. Also, if you are an instructor and use this book in your course, please let me know. My contact information is on the About the Author of this Book page.

Mangiafico, S.S. 2016. Summary and Analysis of Extension Program Evaluation in R, version 1.20.07, revised 2024. rcompanion.org/handbook/ . (Pdf version: rcompanion.org/documents/RHandbookProgramEvaluation.pdf .)

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

understanding of p-value in multiple linear regression

Regarding the p-value of multiple linear regression analysis, the introduction from Minitab's website is shown below.

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable.

For example, I have a resultant MLR model as $ y=0.46753{{X}_{1}}-0.2668{{X}_{2}}+1.6193{{X}_{3}}+4.5424{{X}_{4}}+14.48 $. and the out put is shown below. Then a $y$ can be calculated using this equation.

Based on the introduction above, the null hypothesis is that the coefficient equals 0. My understanding is that the coefficient, for example the coefficient of $X_{4}$, will be set as 0 and another y will be calculated as $y_{2}=0.46753{{X}_{1}}-0.2668{{X}_{2}}+1.6193{{X}_{3}}+0{{X}_{4}}+14.48$. Then a paired t-test is conducted for $y$ and $y_{2}$, but the p-value of this t-test is 6.9e-12 which does not equal to 0.1292 (p-value of coefficient of $X_{4}$.

Can anyone help on the correct understanding? Many thanks!

multiple-regression

$\begingroup$ can you show the output of regression routine? $\endgroup$ – Aksakal Commented Dec 11, 2014 at 19:06
$\begingroup$ Your description of p-value computation is non-standard. Why do you think it should be computed the way you describe? p-value in the output is computed from the Var-Cov matrix of parameters. If you want to run the restriction test, like Wald, then it's not the way you describe. You'd have to re-estimate the model with 3 variables, get loglikelihood etc. $\endgroup$ – Aksakal Commented Dec 11, 2014 at 19:34
1 $\begingroup$ According to that introduction, you have only one "significant" variable--the "intercept"--, because only its p-value is small. To go beyond the naive and misleading practice in the quotation, you need to learn more about multiple regression. To see what can be learned in this regard, consider exploring relevant threads on our site . $\endgroup$ – whuber ♦ Commented Dec 11, 2014 at 20:30
2 $\begingroup$ Check the answers to these two questions: - stats.stackexchange.com/questions/5135/… and - stats.stackexchange.com/questions/126179/… They helped me understand how p-values are calculated, hope you'll find them helpful as well. $\endgroup$ – Giacomo Commented Jul 11, 2017 at 17:03

2 Answers 2

This is incorrect for a couple reasons:

The model "without" X4 will not necessarily have the same coefficient estimates for the other values. Fit the reduced model and see for yourself.

The statistical test for the coefficient does not concern the "mean" values of Y obtained from 2 predictions. The predicted $Y$ will always have the same grand mean, thus have a p-value from the t-test equal to 0.5. The same holds for the residuals. Your t-test had the wrong value per the point above.

The statistical test which is conducted for the statistical significance of the coefficient is a one sample t-test. This is confusing since we do not have a "sample" of multiple coefficients for X4, but we have an estimate of the distributional properties of such a sample using the central limit theorem. The mean and standard error describe the location and shape of such a limiting distribution. If you take the column "Est" and divide by "SE" and compare to a standard normal distribution, this gives you the p-values in the 4th column.

A fourth point: a criticism of minitab's help page. Such a help file could not, in a paragraph, summarize years of statistical training, so I need not contend with the whole thing. But, to say that a "predictor" is "an important contribution" is vague and probably incorrect. The rationale for choosing which variables to include in a multivariate model is subtle and relies on scientific reasoning and not statistical inference.

$\begingroup$ Can I know what we should do if the p-value of intercept in a linear regression is small? Should we remove it, and re-calculate the linear regression? Can we remove variables based on P-value? $\endgroup$ – Amin Commented Jan 28, 2021 at 19:25
$\begingroup$ @Katatonia these questions are asked throughout CV. In general you do not add/remove variables based on p-values. Removing the intercept makes the other variables non-interpretable. $\endgroup$ – AdamO Commented Jan 28, 2021 at 21:44

Your initial interpretation of p-values appears correct, which is that only the intercept has a coefficient that's significantly different from 0. You'll notice that the estimate of the coefficient for x4 is still quite high, but there's enough error that it's not significantly different from 0.

Your paired t test of y1 and y2 suggests that the models are different from one another. That's to be expected, in one model you included a large but imprecise coefficient that's contributing quite a bit to your model. There's no reason to think that the p-value of these models being different from one another should be the same as the p-value of the coefficient of x4 being different from 0.

Your Answer

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged multiple-regression p-value or ask your own question .

Featured on Meta
We've made changes to our Terms of Service & Privacy Policy - July 2024
Bringing clarity to status tag usage on meta sites

Hot Network Questions

default-valued optional (boolean) parameter for a new command in tikz
What is it called when perception of a thing is replaced by an pre-existing abstraction of that thing?
What can I do when someone else is literally duplicating my PhD work?
Is it possible to do physics without mathematics?
Repeating zsh brace expansion values in zsh to download multiple files using wget2
Electric skateboard: helmet replacement
Flight left while checked in passenger queued for boarding
Are there different conventions for 'rounding to even'?
Trying to find an old book (fantasy or scifi?) in which the protagonist and their romantic partner live in opposite directions in time
How do you determine what order to process chained events/interactions?
Meaning of て form here: 「あなたどう思って？」と聞いた。
Inconsistent “unzip -l … | grep -q …” results with pipefail
Polyline to polygon
Uppercase “God” in translations of Greek plays
What are the limits of Terms of Service as a legal shield for a company?
A simplified Blackjack C++ OOP console game
Why is it not generally accepted that Tyranids are the strongest, most adaptable race in Warhammer 40K?
Did US troops insist on segregation in British pubs?
Please help me to identify specific house plant
High CPU usage by process with obfuscated name on Linux server – Potential attack?
How can I draw water level in a cylinder like this?
Why does a 240V dryer heating element have 3+1 terminals?
What is LED starter?
Can my players use both 5e-2014 and 5e-2024 characters in the same adventure?

Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
OverflowAI GenAI features for Teams
OverflowAPI Train & fine-tune LLMs
Labs The future of collective knowledge sharing
About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

How to interpret the results of linearHypothesis function when comparing regression coefficients?

I used linearHypothesis function in order to test whether two regression coefficients are significantly different. Do you have any idea how to interpret these results?

Here is my output:

1 Pr(>F) is the p-value of the test, and this is the output of interest. You want the interpretation of every output ? – Stéphane Laurent Commented Feb 11, 2019 at 12:40

3 Answers 3

Short Answer

Your F statistic is 104.34 and its p-value 2.2e-16. The corresponding p-value suggests that we can reject the null hypothesis that both coefficients cancel each other at any level of significance commonly used in practice.

Were your p-value greater than 0.05, it is accustomed that you would not reject the null hypothesis.

Long Answer

The linearHypothesis function tests whether the difference between the coefficients is significant. In your example, whether the two betas cancel each other out β1 − β2 = 0.

Linear hypothesis tests are performed using F-statistics. They compare your estimated model against a restrictive model which requires your hypothesis (restriction) to be true.

An alternative linear hypothesis testing would be to test whether β1 or β2 are nonzero, so we jointly test the hypothesis β1=0 and β2 = 0 rather than testing each one at a time. Here the null is rejected when one is rejected. Rejection here means that at least one of your hypotheses can be rejected. In other words provide both linear restrictions to be tested as strings

Here are few examples of the multitude of ways you can test hypothese:

You can test a linear combination of coeffecients

joint probability

Aside from the t statistics, which test for the predictive power of each variable in the presence of all the others, another test which can be used is the F-test. (this is the F-test that you would get at the bottom of a linear model)

This tests the null hypothesis that all of the β’s are equal to zero against the alternative that allows them to take any values. If we reject this null hypothesis (which we do because the p-value is small), then this is the same as saying there is enough evidence to conclude that at least one of the covariates has predictive power in our linear model, i.e. that using a regression is predictively ‘better’ than just guessing the average.

So basically, you are testing whether all coefficients are different from zero or some other arbitrary linear hypothesis, as opposed to a t-test where you are testing individual coefficients.

The answer given above is detailed enough except that for this test we are more interested in the two variables hence the linear hypothesis does not investigate the null hypothesis that all of the β’s are equal to zero against the alternative that allows them to take any values but just for two variables of interest which makes this test equivalent to a t-test.

1 As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center . – Community Bot Commented Sep 14, 2022 at 19:22

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged r regression or ask your own question .

The Overflow Blog
LLMs evolve quickly. Their underlying architecture, not so much.
From PHP to JavaScript to Kubernetes: how one backend engineer evolved over time
Featured on Meta
We've made changes to our Terms of Service & Privacy Policy - July 2024
Bringing clarity to status tag usage on meta sites
What does a new user need in a homepage experience on Stack Overflow?
Feedback requested: How do you use tag hover descriptions for curating and do...
Staging Ground Reviewer Motivation

Hot Network Questions

How do I safely remove a mystery cast iron pipe in my basement?
Hotspot vs home internet
Is there any video of an air-to-air missile shooting down an aircraft?
Why do we reduce a body to its center of mass when calculating gain/loss of gravitational potential energy?
Purpose of burn permit?
Reduce String Length With Thread Safety & Concurrency
Meaning of 拵えた here
Will this be the first time that there are more people aboad the ISS than seats in docked spacecraft?
What's the proper way to shut down after a kernel panic?
Polyline to polygon
Resonance structure of aromatic [S4N4]2+
Please help me to identify specific house plant
Is it possible for Thin Film Isotope Nuclear Rockets to have an Isp over a million seconds?
My PC takes so long to boot for no reason
Is it OK to make an "offshape" bid if you can handle any likely response?
Seifert surfaces of fibered knots
using a tikz foreach loop inside a newcommand
Is it possible to create a board position where White must make the move that leads to stalemating Black to avoid Black stalemating White?
Why do I get two values for the following limit using two different methods with the same identity?
Immutability across programming languages
Feasibility of unpressurized space farms containing vacuum-adapted plants
My enemy sent me this puzzle!
Is 2'6" within the size constraints of small, and what would the weight of a fairy that size be?
What is it called when perception of a thing is replaced by an pre-existing abstraction of that thing?

Open access
Published: 23 August 2024

Unveiling missed nursing care: a comprehensive examination of neglected responsibilities and practice environment challenges

Somayeh Babaei 1 ,
Kourosh Amini ORCID: orcid.org/0000-0003-2363-894X 2 &
Farhad Ramezani-Badr 3

BMC Health Services Research volume 24 , Article number: 977 ( 2024 ) Cite this article

37 Accesses

Metrics details

The global variable of missed nursing care and practice environment are widely recognized as two crucial contextual factors that significantly impact the quality of nursing care. This study assessed the current status of missed nursing care and the characteristics of the nursing practice environment in Iran. Additionally, this study aimed to explore the relationship between these two variables.

We conducted an across-sectional study from May 2021 to January 2022 in which we investigated 255 nurses. We utilized the Missed Nursing Care Survey, the Nursing Work Index-Practice Environment Scale, and a demographic questionnaire to gather the necessary information. We used the Shapiro‒Wilk test, Pearson correlation coefficient test, and multiple linear regression test in SPSS version 20 for the data analyses.

According to the present study, 41% of nurses regularly or often overlooked certain aspects of care, resulting in an average score of 32.34 ± 7.43 for missed nursing care. It is worth noting that attending patient care conferences, providing patient bathing and skin care, and assisting with toileting needs were all significant factors contributing to the score. The overall practice environment was unfavorable, with a mean score of 2.25 ± 0.51. Interestingly, ‘nursing foundations for quality of care’ was identified as the sole predictor of missed nursing care, with a β value of -0.22 and a p -value of 0.036.

Conclusions

This study identified attending patient care interdisciplinary team meetings and delivering basic care promptly as the most prevalent instances of missed nursing care. Unfortunately, the surveyed hospitals exhibited an undesirable practice environment, which correlated with a higher incidence of missed nursing care. These findings highlight the crucial impact of nurses’ practice environment on care delivery. Addressing the challenges in the practice environment is essential for reducing instances of missed care, improving patient outcomes, and enhancing overall healthcare quality.

Peer Review reports

Introduction

Missed Nursing Care (MNC) is the failure to provide any necessary aspect of patient care, partially or entirely, or delay in delivering it [ 1 ]. MNCs can have severe side effects on patients, including safety threats [ 2 ] and even mortality [ 3 ]. It also significantly decreases the quality of nursing care [ 4 ]. MNC can also have adverse and destructive effects on nurses, including decreased job satisfaction, increased absenteeism, and the intention to leave their jobs [ 5 ]. As a result, MNCs have become a key focus of nursing researchers in recent years and are widely recognized as a significant global problem [ 6 ].

A literature review revealed that MNCs are multidimensional and vary significantly in frequency and elements across different research communities [ 7 ]. In Iran, information regarding MNCs is limited. According to our search, only one reliable study [ 8 ] has been conducted on this topic in the last five years. Chegini et al. conducted a study that showed that the percentage of participants who missed care was 72.1%. The most common tasks of missed nursing care included patient discharge planning and teaching, emotional support for patients and their families, interdisciplinary care conferences, and patient education regarding their illness, tests, and diagnostic procedures. Although the study by Chegini et al. has provided valuable information, the generalizability of its results is limited due to its small sample size. The study included nurses from only medical-surgical wards and used the census sampling method.

MNC is influenced by various individual and organizational factors [ 9 ]. In a systematic review, Chiappinotto et al. identified significant factors contributing to MNC, such as low nurse-to-patient ratios, high workloads, and poor work environments. Moreover, stress, job dissatisfaction, and inadequate education among nurses were recognized as crucial elements. Furthermore, patient clinical instability was found to further worsen MNC [ 10 ]. However, some researchers argue that organizational and environmental factors are more influential in causing MNC than individual factors [ 11 ].

Another influential organizational variable on nursing performance is the practice environment (PE) [ 12 ]. PE in nursing is inclusive of material and human resources, a cooperative environment, and other elements related to the environment that directly or indirectly affect how care is provided [ 13 ]. PE is involved in nurses’ burnout [ 14 ], job satisfaction, stay in nursing [ 15 ], and overall quality of nursing care [ 16 ]. Like in MNCs, evidence suggests that PE varies across different hospitals and wards within a hospital [ 17 ]. For instance, a study conducted by Choi & Boyle in the U.S. demonstrated that pediatric wards had more favorable PEs than did medical-surgical wards. However, previous studies have shown that MNCs differ across poor, moderate, and suitable PEs. Weak PE has been found to increase MNCs [ 18 ], while optimal PE reduces MNCs [ 17 ]. Due to the global significance of MNCs and PEs for quality of care and the variability of these two variables due to different sociocultural factors, it is essential to understand the weaknesses of MNCs and PEs in every community thoroughly. Therefore, this study aimed to determine the status of MNCs, the characteristics of PEs, and the relationships between these two variables among nurses working in two teaching hospitals.

The present study was cross-sectional from May 30, 2021, to January 19, 2022. The study was conducted according to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines. The study included nurses employed in the medical-surgical, emergency, and intensive care units of two major teaching hospitals in Zanjan Province. This province is situated in the northwestern region of Iran and has a population of approximately 1,016,000 people. To be eligible for participation in the study, the nurses needed to meet the following specific inclusion criteria:

A minimum of three months of work experience in the desired ward.

Holding a bachelor’s degree or higher.

Consent to participate in the study was obtained.

We utilized Formula 1 for a finite population to determine the sample size. The values used in this formula were N (total population) = 553, power (the probability of correctly rejecting the null hypothesis) = 0.80, standard deviation (SD) = 13.97, d (margin of error or precision) = 1.2, and Z (standardized value for the corresponding level of confidence) = 1.96. The formula indicated that a minimum sample size of 246 was required based on these variables. During the research, we found that a recent study comparable to our work was conducted by Park et al. [ 18 ]. For our research, we utilized the standard deviation of the variables in Formula 1. Their study recorded the mean and standard deviation of the MNC and PE as 84.06 ± 13.79 and 2.92 ± 0.25, respectively. We included the higher standard deviation (related to MNCs) to ensure a larger sample size. We prepared 270 questionnaires and distributed them among the selected nurses. We also considered the possibility of spoiled questionnaires and distributed extra questionnaires accordingly. Fifteen questionnaires were excluded from the study due to incomplete data, leaving a total of 255 questionnaires that were used for data analysis out of the 270 that were distributed.

We utilized a systematic random method to select the nurses for the study. In the first step, a list of nurses working in the desired wards was taken, and the sampling frame was prepared. In the second step, each nurse was assigned a number from a table of random numbers. This process generated a new sampling frame. In the third step, we calculated the distance between the study samples, denoted as ‘K’, using the formula K = N/n.’ We computed K by dividing the total population (N) of 553 by the sample size (n) of 270, approximately 2. To select the participants, we utilized a systematic random method. A new sampling frame was generated in the first step, as described earlier. The first nurse was selected randomly from this new sampling frame, and the subsequent samples were taken at a distance of two people from the previous nurse.

To collect the data, we used three different questionnaires: (a) a demographic profile form, (b) the Missed Nursing Care (MISSCARE) Survey, and (c) the Nursing Work Index-Practice Environment Scale (NWI-PES). The demographic profile included various variables, including sex, age, marital status, educational degree, work experience, position, shift work, employment type, and ward type.

In this study, we utilized the MISSCARE survey (MISSED) to assess MNC. We chose the MISSED based on its extensive utilization and strong psychometric properties, as evidenced in the literature. As noted by Chiappinotto et al. [ 10 ], 34 out of the 58 studies reviewed utilized a version of the MISSCARE survey, highlighting its reliability and validity in assessing MNC. The MISSCARE Survey consists of two parts: Part ‘A’ and Part ‘B’. Part ‘A’ included the most missed care components, while Part ‘B’ included the reasons for missing nursing care. We utilized part ‘A’ of the questionnaire, which constituted 24 items of the MISSCARE Survey. Each of the 24 items comprises five answer options: 1) rarely or never missed, 2) occasionally missed, 3) frequently missed, 4) always missed, and 5) nonapplicable. Kalisch & Williams included the option of ‘nonapplicable’ to account for nurses who operate in situations where certain care activities may not be performed [ 19 ]. The total score range of this survey is 24–96, where higher scores indicate a greater probability of missed care. In line with the findings of a previous study [ 17 ], we considered the combination of “frequently missed” and “always missed” options as missed care to demonstrate the frequency of missed nursing care. The MISSCARE Survey has undergone psychometric analysis, and its applicability has been approved for the nursing community in Iran [ 20 ]. The internal consistency of the tool was measured based on Cronbach’s alpha coefficient (α = 0.88) in this study.

The psychometric analysis of the NWI-PES has been conducted, and its usage has been approved [ 21 ]. Developed by Lake in 2002 and authorized by the National Quality Forum (NQF), this scale comprises thirty-one items and operates on a four-point Likert scale, with scores ranging from four to one. The response options were strongly agree = 4, somewhat agree = 3, somewhat disagree = 2, and strongly disagree = 1. According to [ 22 ], the possible score range of the whole scale and its subscales is one to four. The NWI-PES comprises five subscales:

The nurses’ participation in hospital affairs was evaluated with nine items.

‘Staffing and resource adequacy’, which includes four items.

The three items used were “Collegial nurse‒physician relations”.

‘Nursing foundations for quality of care’ with ten items.

The five items asked about nurses’ ability, leadership, and support.

A scale midpoint greater than 2.5 is considered an acceptable PE [ 22 ]. The NWI-PES demonstrated high internal consistency, with a Cronbach’s alpha of 0.93. The Cronbach’s alpha for each of the subscales of the NWI-PES was computed. The results were as follows: ‘nurse participation in hospital affairs,’ α = 0.88; ‘nursing foundations for quality-of-care,’ α = 0.72;‘staffing and resource adequacy,’ α = 0.87; ‘collegial nurse‒physician relations,’ α = 0.90; and ‘nurse manager ability, leadership, and support of nurses,’ α = 0.84.

We computed the means and standard deviations of the MNC and PE scores and utilized the Shapiro‒Wilk test to determine the normality of the data distribution. The results revealed that the data followed a normal distribution. We employed the Pearson correlation coefficient to determine the correlation between PEs and MNCs. Furthermore, we conducted a multiple linear regression test to examine whether changes in the MNC score, as the dependent variable, were associated with changes in the PE subscale scores. Before conducting the multiple linear regression analysis, we confirmed that the assumptions were met and evaluated. We confirmed the assumption of independent errors by using the Durbin–Watson test. Homoscedasticity and linearity assumptions were assessed through P-P plots. The hypothesis of multicollinearity was examined by determining the variance inflation factor (VIF) and tolerance [ 23 ]. The VIF ranged from 1.006 (TOL = 0.99) for ‘collegial nurse‒physician relations’ to 1.04 (TOL = 0.96) for ‘nursing foundations for quality-of-care.’ Independent t tests and ANOVA were used to evaluate the associations between demographic variables and MNCs. The statistical analysis of the data was conducted using SPSS software version 24, and a P value lower than 0.05 was used to indicate statistical significance.

Participants’ characteristics

The majority of the participants were females (84.3%), were married (68.2%), and were employed on a 5-year contract (46.7%). The majority of the participants were females (84.3%), were married (68.2%), and were employed on a 5-year contract (46.7%).

In addition, almost all of the participants (95.7%) had a Bachelor of Science in Nursing (BSN) degree, and a significant proportion (45.8%) worked in medical-surgical wards. Most of the respondents (91.4%) were staff nurses, and 89.8% of them worked in rotational shift work. The.

The participants’ average age and work experience were 33.94 ± 7.40 and 9.25 ± 7.14, respectively (Table 1 ).

Missed nursing care

The overall mean score for MNCs, with a score ranging from 24 to 96, was 32.34 ± 7.43. Of the total nurses, 41% reported that they always or frequently missed at least one aspect of nursing care. Based on the findings, the items with the highest mean score in descending order were attending an interdisciplinary patient care conference, patient bathing or skin care, assisting with toileting needs within 5 min of request, mouth care, and feeding the patient when the food was still warm (Table 2 ).

The mean MNC score was significantly greater for male nurses than for female nurses (X̄1 = 36.25, X̄2 = 31.56; t = -3.738, p < 0.001). Other demographic and occupational variables of the nurses, such as age, marital status, degree, work experience, position, rotational shift work, type of employment, and working place, had no significant association with MNCs ( p > 0.05).

Practice environment characteristics

The overall mean score for PE was 2.25 ± 0.51. Among the different subscales of the PE scale, the highest mean score was observed for ‘collegial nurse‒physician relations’ (M = 2.45, SD = 0.72). Furthermore, the mean scores for “nursing foundations for quality of care”, “nurse manager ability, leadership, and support of nurses”, and “nurse participation in hospital affairs” were 2.43 ± 0.58, 2.23 ± 0.65, and 2.16 ± 0.58, respectively. The lowest mean score was observed for ‘staffing and resource adequacy’ (M = 1.81, SD = 0.64).

Correlations between practice environment characteristics and missed care

The study’s results indicate a significant and negative correlation between the mean score of PEs and the overall mean score of MNCs ( r = -0.18, p = 0.002). There was a strong link between the overall mean score of MNCs and two of the five NWI-PES subscales: “nursing foundations for quality of care” ( r = -0.21, p < 0.001) and “nurse manager ability, leadership, and support of nurses” ( r = -0.16, p = 0.006).

Predicting missed nursing care based on practice environment subscales

According to Table 3 , linear regression analysis showed that only “nursing foundations for quality of care” (β = -0.22, p = 0.036) of the five NWI-PES subscales could predict MNC.

The main objective of this study was to determine the status of MNCs, the characteristics of PEs, and the relationships between these two variables among Iranian nurses working in two teaching hospitals. The findings showed that 41% of nurses reported frequently or always missing at least one aspect of nursing care. A systematic review also reported that 55–98% of nurses missed at least one course of nursing care [ 24 ]. However, the overall mean score of MNCs in our study was 32.3. A literature review revealed that our study’s mean MNC score was lower than that reported in the United States, Turkey, and Australia, except for Iceland [ 25 ]. By comparing our study results with those from other countries [ 26 ], it can be concluded that low MNCs were reported in our study. Like in many previous studies, in this study, we used the self-reporting method. The reason for the lower mean score of MNCs in our study compared to that in other studies might be due to two biases: “acquiescence response style” (tendency to respond positively) and “social desirability bias” (tendency to present oneself socially to be acceptable, but it does not fully reflect the reality of the individual). Due to the two biases mentioned earlier, the ‘truth-telling’ in our survey might have been compromised. This is because we used the self-reporting method to collect data, and the nature of MNCs is one of the essential aspects of ethics in nursing. The study findings indicated that patients who participated in interdisciplinary conferences had the highest mean score. However, not attending training classes can decrease knowledge and make nursing care less updated, ultimately reducing the quality of care provided to patients [ 27 ]. This finding is consistent with that of another study conducted in Brazil [ 7 ]. Based on our field experiences and observations, several factors, including the following, seem to play a significant role in missing nursing care:

Time limitation due to a nursing shortage.

Inappropriate timing of training classes or conferences and conflicts in daily schedules.

There is a lack of support and encouragement from managers, especially hospital managers.

Inappropriate and nonequipped venues for classes.

Improper teaching methods and giving lectures instead of using new teaching methods.

There is a lack of proper alert reminders for nurses regarding the date, time, and place of meetings.

Our study revealed that the lowest scores for missed care were related to items such as ‘bedside glucose monitoring as ordered”, ‘peripheral IV/central line site care and assessments according to hospital policy’, and ‘vital signs assessments as ordered.’ The lower scores associated with this care could be attributed to the use of an accurate system for recording patients in patient files and additional unique records above patients’ beds in the current research environment, which helps staff remember and check this care more often. However, these care tasks are crucial parts of a patient’s vital nursing care and should be performed during each work shift to monitor the patient’s hemodynamic status. This information about each patient was provided to the assigned nurse during the shift handover. A lack of ‘blood sugar control’ was also indicated in the studies of Smith et al. [ 17 ] in the U.S.

Our study revealed a low PE score among the participating nurses. Given that nurses have greater responsibility for caring and have essential tasks such as performing technical procedures, making decisions, and leading patient care, such tasks are affected by poor practices. Consequently, patient and family satisfaction decreases, and adverse patient outcomes, such as mortality and infection, may increase. Azevedo Filho et al. also demonstrated a poor nursing practice environment in Brazil [ 13 ], consistent with our study results. In another study [ 17 ], the average PE score was significantly greater than that in our study and that of Azevedo Filho et al. [ 13 ]. The high score in the Smith et al. research population could be because the surveyed hospitals were magnet hospitals. In magnet hospitals, there is more focus on creating a healthier and more desirable work environment. Our study revealed a significant inverse correlation between PE characteristics and MNCs. In other words, missed nursing care increases significantly in patients with unfavorable PEs. However, this relationship was not strong. Several researchers have emphasized the importance of providing qualified nursing services and improving the nursing work environment [ 17 ].

Among the different dimensions of PE, “nursing foundations for quality of care” and “nurse manager ability, leadership, and support of nurses” had significant relationships with MNCs. These findings suggest that targeted interventions aimed at improving each dimension of PE can help reduce the incidence of MNCs. Additionally, the ability of nursing managers and leaders should be accompanied by reduced missed care because nursing managers are responsible for managing the working conditions of nurses, determining their duties, coordinating existing resources, and developing basic nursing settings for the quality of patient care [ 28 ].

Our study on the relationship between nurses’ occupational and demographic variables and MNCs contradicts the findings of Blackman et al. [ 29 ], who indicated that men’s mean score for missed care is significantly greater than women’s. A study conducted in Iran also showed that female nurses’ quality of nursing care is greater than that of male nurses [ 30 ]. Women tend to care for patients more carefully, and less missed care is provided by women. Except for gender, the results of our study suggested no correlation between MNCs and other occupational and demographic variables of nurses.

Limitations

The study offers insights into missed nursing care and its relationship with the practice environment. However, several limitations should be considered. The study’s cross-sectional design creates potential biases, which may limit our ability to establish causation. Additionally, the reliance on self-reports introduces the likelihood of response bias. Furthermore, the study focused on specific hospitals in Zanjan Province, which may restrict the generalizability of the findings to a broader context. Confounding factors, which are inherent to observational studies, might influence the observed relationships. Despite the abovementioned limitations, the study provides valuable contributions to comprehending the complex dynamics between the practice environment and missed nursing care.

According to our study, nurses consistently neglect a significant portion of nursing care, with patient-related team meetings and training sessions being the most overlooked. This is a noteworthy finding. The findings highlight a possible lack of awareness or inadequacy in planning critical sessions, which demands increased attention. Notably, basic nursing care is the second most commonly overlooked aspect of care. The unfavorable practice environment identified in the hospitals under study highlights the urgent need for improvement by planners and senior managers. Notably, our findings demonstrated a significant statistical relationship between the practice environment and unattended nursing care. This indicates that improving the practice environment could help reduce the number of missed care cases. Notably, managerial competencies, particularly leadership, are vital in preventing overlooked nursing care. These results provide essential insights for the field, highlighting the importance of targeted improvements in practice environments to improve patient care outcomes. Our research provides a foundation for future research and interventions to optimize nursing care delivery.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

Analysis of Variance

Missed Nursing Care

National Quality Forum

Practice Environment

Variance Inflation Factor

Kalisch BJ, Landstrom GL, Hinshaw AS. Missed nursing care: a concept analysis. J Adv Nurs. 2009;65(7):1509–17. https://doi.org/10.1111/j.1365-2648.2009.05027.x

Article PubMed Google Scholar

Cho SH, Lee JY, You SJ, Song KJ, Hong KJ. Nurse staffing, nurses prioritization, missed care, quality of nursing care, and nurse outcomes. Int J Nurs Pract. 2020;26(1):e12803. https://doi.org/10.1111/ijn.12803

Ball JE, Bruyneel L, Aiken LH, Sermeus W, Sloane DM, Rafferty AM, et al. Postoperative mortality, missed care and nurse staffing in nine countries: a cross-sectional study. Int J Nurs Stud. 2018;78:10–5. https://doi.org/10.1016/j.ijnurstu.2017.08.004

Lake ET, Riman KA, Sloane DM. Improved work environments and staffing lead to less missed nursing care: a panel study. J Nurs Manag. 2020;28(8):2157–65. https://doi.org/10.1111/jonm.12970

Article PubMed PubMed Central Google Scholar

Chaboyer W, Harbeck E, Lee BO, Grealish L. Missed nursing care: an overview of reviews. KJMS. 2021;37(2):82–91. https://doi.org/10.1002/kjm2.12308

Nahasaram ST, Ramoo V, Lee WL. Missed nursing care in the Malaysian context: a cross-sectional study from nurses’ perspective. J Nurs Manag. 2021;29(6):1848–56. https://doi.org/10.1111/jonm.13281

Lima JCd, Silva AEBC, Caliri MHL. Omission of nursing care in hospitalization units. Rev Lat -Am Enferm. 2020;28. https://doi.org/10.1590/1518-8345.3138.3233

Chegini Z, Jafari-Koshki T, Kheiri M, Behforoz A, Aliyari S, Mitra U, et al. Missed nursing care and related factors in Iranian hospitals: a cross‐sectional survey. J Nurs Manag. 2020;28(8):2205–15. https://doi.org/10.1111/jonm.13055

Duffy JR, Culp S, Padrutt T. Description and factors associated with missed nursing care in an acute care community hospital. JONA. 2018;48(7/8):361–7. https://doi.org/10.1097/NNA.0000000000000630

Article Google Scholar

Chiappinotto S, Papastavrou E, Efstathiou G, Andreou P, Stemmer R, Ströhm C, et al. Antecedents of unfinished nursing care: a systematic review of the literature. BMC Nurs. 2022;21(1):137. https://doi.org/10.1186/s12912-022-00890-6

Ausserhofer D, Zander B, Busse R, Schubert M, De Geest S, Rafferty AM, et al. Prevalence, patterns and predictors of nursing care left undone in European hospitals: results from the multicountry cross-sectional RN4CAST study. BMJ Qual Saf. 2014;23(2):126–35. https://doi.org/10.1136/bmjqs-2013-002318

Amaliyah E, Tukimin S. The relationship between working environment and quality of nursing care: an integrative literature review. Br J Health Care Manag. 2021;27(7):194–200. https://doi.org/10.12968/bjhc.2020.0043

Azevedo FMd, Rodrigues MCS, Cimiotti JP. Nursing practice environment in intensive care units. ACTA Paul Enferm. 2018;31:217–23. https://doi.org/10.1590/1982-0194201800031

Knupp AM, Patterson ES, Ford JL, Zurmehly J, Patrick T. Associations among nurse fatigue, individual nurse factors, and aspects of the nursing practice environment. J Nurs Adm. 2018;48(12):642–8. https://doi.org/10.1097/nna.0000000000000693

Al Sabei SD, Labrague LJ, Miner Ross A, Karkada S, Albashayreh A, Al Masroori F, et al. Nursing work environment, turnover intention, job burnout, and quality of care: the moderating role of job satisfaction. J Nurs Scholarsh. 2020;52(1):95–104. https://doi.org/10.1111/jnu.12528

Lake ET, de Cordova PB, Barton S, Singh S, Agosto PD, Ely B, et al. Missed nursing care in pediatrics. Hosp Pediatr. 2017;7(7):378–84. https://doi.org/10.1542/hpeds.2016-0141

Smith JG, Morin KH, Wallace LE, Lake ET. Association of the nurse work environment, collective efficacy, and missed care. West J Nurs Res. 2018;40(6):779–98. https://doi.org/10.1177/0193945917734159

Park SH, Hanchett M, Ma C. Practice environment characteristics associated with missed nursing care. J Nurs Scholarsh. 2018;50(6):722–30. https://doi.org/10.1111/jnu.12434

Kalisch BJ, Williams RA. Development and psychometric testing of a tool to measure missed nursing care. J Nurs Adm. 2009;39(5):211–9. https://doi.org/10.1097/NNA.0b013e3181a23cf5

Khajooee R, Bagherian B, Dehghan M, Azizzadeh Forouzi M. Missed nursing care and its related factors from the points of view of nurses affiliated to Kerman University of Medical Sciences in 2017. Hayat. 2019;25(1):11–24.

Google Scholar

Elmi S, Hassankhani H, Abdollahzadeh F, Abadi MAJ, Scott J, Nahamin M. Validity and reliability of the Persian practice environment scale of nursing work index. IJNMR. 2017;22(2):106. https://doi.org/10.4103/1735-9066.205953

Lake ET. Development of the practice environment scale of the nursing work index. Res Nurs Health. 2002;25(3):176–88. https://doi.org/10.1002/nur.10032

Gustafsson N, Leino-Kilpi H, Prga I, Suhonen R, Stolt M. Missed care from the patient’s perspective–a scoping review. Patient Prefer Adherence. 2020;25:383–400. https://doi.org/10.2147/PPA.S238024

Jones TL, Hamilton P, Murry N. Unfinished nursing care, missed care, and implicitly rationed care: state of the science review. Int J Nurs Stud. 2015;52(6):1121–37. https://doi.org/10.1016/j.ijnurstu.2015.02.012

Bragadóttir H, Burmeister EA, Terzioglu F, Kalisch BJ. The association of missed nursing care and determinants of satisfaction with current position for direct-care nurses—an international study. J Nurs Manag. 2020;28(8):1851–60. https://doi.org/10.1111/jonm.13051

Bruzios K, Harwood E. Issues of response styles. Wiley Encyclopedia Personality Individual Differences: Meas Assess. 2020:169–73. https://doi.org/10.1002/9781118970843.ch99

Price S, Reichert C. The importance of continuing professional development to career satisfaction and patient care: meeting the needs of novice to mid-to late-career nurses throughout their career span. Adm Sci. 2017;7(2):17. https://doi.org/10.3390/admsci7020017

Stalpers D, Van Der Linden D, Kaljouw MJ, Schuurmans MJ. Nurse-perceived quality of care in intensive care units and associations with work environment characteristics: a multicenter survey study. J Adv Nurs. 2017;73(6):1482–90. https://doi.org/10.1111/jan.13242

Blackman I, Papastavrou E, Palese A, Vryonides S, Henderson J, Willis E. Predicting variations to missed nursing care: a three-nation comparison. J Nurs Manag. 2018;26(1):33–41. https://doi.org/10.1111/jonm.12514

Khaki S, Esmaeilpourzanjani S, Mashouf S. Nursing cares quality in nurses. S J Nursing, Midwifery and Paramedical Faculty. 2018;3:1–14. https://doi.org/10.29252/sjnmp.3.4.1

Download references

Acknowledgements

We want to thank all the nurses who participated in this study. Their invaluable contributions were crucial in making this research possible. We would also like to thank the hospitals in Zanjan Province for their cooperation and support during the data collection. Furthermore, we would like to acknowledge the Zanjan University of Medical Sciences’ Biomedical Research Ethics Committee for approving and overseeing the ethical aspects of this research. We are grateful for their collaboration and commitment to advancing healthcare research, which made this study possible.

This work was supported by the Research and Technology Deputy of Zanjan University of Medical Sciences, Zanjan, Iran (grant number: A-11-86-17).

Author information

Authors and affiliations.

Department of Medical-Surgical Nursing, School of Nursing and Midwifery, Zanjan University of Medical Sciences, Zanjan, Iran

Somayeh Babaei

Department of Psychiatric Nursing, School of Nursing and Midwifery, Zanjan University of Medical Sciences, Mahdavi St., Zanjan, 4515789589, Iran

Kourosh Amini

Department of Critical Care Nursing, School of Nursing and Midwifery, Zanjan University of Medical Sciences, Zanjan, Iran

Farhad Ramezani-Badr

You can also search for this author in PubMed Google Scholar

Contributions

Study design: KA. Data collection: SB. Data analysis: KA, FR. Study supervision: KA. Manuscript writing: KA, SB, FR. Critical revisions for important intellectual content: KA, FR.

Corresponding author

Correspondence to Kourosh Amini .

Ethics declarations

Ethics approval and consent to participate.

The research proposal with the code IR.ZUMS.REC.1399.053 was approved by the Zanjan University of Medical Sciences’ Biomedical Research Ethics Committee (ZUMS.REC). We obtained written informed consent from all participants and preserved the confidential identity of each participant throughout the study. Before using the two MISSCARE Survey and Practice Environment Scale questionnaires, permission was obtained from the developers of the participants (Professor Kalisch and Professor Lake, respectively) through email.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Babaei, S., Amini, K. & Ramezani-Badr, F. Unveiling missed nursing care: a comprehensive examination of neglected responsibilities and practice environment challenges. BMC Health Serv Res 24 , 977 (2024). https://doi.org/10.1186/s12913-024-11386-1

Download citation

Received : 31 January 2024

Accepted : 01 August 2024

Published : 23 August 2024

DOI : https://doi.org/10.1186/s12913-024-11386-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Care quality
Missed care
Practice environment

BMC Health Services Research

ISSN: 1472-6963

General enquiries: [email protected]

IMAGES

Hypothesis testing tutorial using p value method
P-Value Method For Hypothesis Testing
p-Value in Hypothesis Testing
P-Value
Test a Hypothesis
Hypothesis testing tutorial using p value method

COMMENTS

12.2.1: Hypothesis Test for Linear Regression
The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ±2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.
Understanding P-values
The p value gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis. The p value is a proportion: if your p value is 0.05, that means that 5% of the time you would see a test statistic at least as extreme as the one you found if the null hypothesis was true.
How to Interpret P-values and Coefficients in Regression Analysis
The linear regression p value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. ... perhaps it exists in the population but the small sample size and/or weak relationship made it so the hypothesis test cannot detect it (i.e., the hypothesis test had insufficient statistical ...
S.3.2 Hypothesis Testing (P-Value Approach)
The P -value is, therefore, the area under a tn - 1 = t14 curve to the left of -2.5 and to the right of 2.5. It can be shown using statistical software that the P -value is 0.0127 + 0.0127, or 0.0254. The graph depicts this visually. Note that the P -value for a two-tailed test is always two times the P -value for either of the one-tailed tests.
How to Interpret P-Values in Linear Regression (With Example)
In this example, the regression coefficient for the intercept is equal to 48.56. This means that for a student who studied for zero hours, the average expected exam score is 48.56. The p-value is 0.002, which tells us that the intercept term is statistically different than zero. In practice, we don't usually care about the p-value for the ...
Interpreting P values
Here is the technical definition of P values: P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true. Let's go back to our hypothetical medication study. Suppose the hypothesis test generates a P value of 0.03.
How to Find the P value: Process and Calculations
To find the p value for your sample, do the following: Identify the correct test statistic. Calculate the test statistic using the relevant properties of your sample. Specify the characteristics of the test statistic's sampling distribution. Place your test statistic in the sampling distribution to find the p value.
An Explanation of P-Values and Statistical Significance
A p-value is the probability of observing a sample statistic that is at least as extreme as your sample statistic, given that the null hypothesis is true. For example, suppose a factory claims that they produce tires that have a mean weight of 200 pounds. An auditor hypothesizes that the true mean weight of tires produced at this factory is ...
P-Value in Statistical Hypothesis Tests: What is it?
P Value Definition. A p value is used in hypothesis testing to help you support or reject the null hypothesis. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. P values are expressed as decimals although it may be easier to understand what they ...
9.5: The p value of a test
9.5: The p value of a test. In one sense, our hypothesis test is complete; we've constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, I've actually omitted the most important number of all: the p value.
What is P-Value?
It is the cutoff probability for p-value to establish statistical significance for a given hypothesis test. For an observed effect to be considered as statistically significant, the p-value of the test should be lower than the pre-decided alpha value. Typically for most statistical tests (but not always), alpha is set as 0.05.
Understanding P-Values and Statistical Significance
A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true). The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p -value, the less likely the results occurred by random chance, and the ...
How to Correctly Interpret P Values
The P value is used all over statistics, from t-tests to regression analysis.Everyone knows that you use P values to determine statistical significance in a hypothesis test.In fact, P values often determine what studies get published and what projects get funding.
p-value
The p -value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic . [ note 2] The lower the p -value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically ...
p-value Calculator
Formally, the p-value is the probability that the test statistic will produce values at least as extreme as the value it produced for your sample.It is crucial to remember that this probability is calculated under the assumption that the null hypothesis H 0 is true!. More intuitively, p-value answers the question: Assuming that I live in a world where the null hypothesis holds, how probable is ...
How Hypothesis Tests Work: Significance Levels (Alpha) and P values
Using P values and Significance Levels Together. If your P value is less than or equal to your alpha level, reject the null hypothesis. The P value results are consistent with our graphical representation. The P value of 0.03112 is significant at the alpha level of 0.05 but not 0.01.
6.4
Hypothesis test for testing that a subset — more than one, ... (0.54491/(32-4)) = 16.43 with p-value 0.000. To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic: Use the general linear F-test formula by fitting the full model to find SSE(F) ...
P-Value: Comprehensive Guide to Understand, Apply, and Interpret
Output: t-statistic: -0.3895364838967159 p-value: 0.7059365203154573 Fail to reject the null hypothesis. The difference is not statistically significant. Since, 0.7059>0.05, we would conclude to fail to reject the null hypothesis.This means that, based on the sample data, there isn't enough evidence to claim a significant difference in the exam scores of the tutor's students compared to ...
Linear regression hypothesis testing: Concepts, Examples
This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0. Determine the test statistics: The next step is to determine the test statistics and calculate the value.
Find p-value (significance) in scikit-learn LinearRegression
First lets use statsmodel to find out what the p-values should be. import pandas as pd. import numpy as np. from sklearn import datasets, linear_model. from sklearn.linear_model import LinearRegression. import statsmodels.api as sm. from scipy import stats. diabetes = datasets.load_diabetes() X = diabetes.data.
R Handbook: Hypothesis Testing and p-values
Using a binomial test, the p -value is < 0.0001. (Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16, which is 0.00000000000000022, with 15 zeros after the decimal point.) Assuming an alpha of 0.05, since the p -value is less than alpha, we reject the null hypothesis.
understanding of p-value in multiple linear regression
18. Regarding the p-value of multiple linear regression analysis, the introduction from Minitab's website is shown below. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p ...
How to interpret the results of linearHypothesis function when
An alternative linear hypothesis testing would be to test whether β1 or β2 are nonzero, so we jointly test the hypothesis β1=0 and β2 = 0 rather than testing each one at a time. Here the null is rejected when one is rejected. Rejection here means that at least one of your hypotheses can be rejected. In other words provide both linear ...
Unveiling missed nursing care: a comprehensive examination of neglected
We used the Shapiro‒Wilk test, Pearson correlation coefficient test, and multiple linear regression test in SPSS version 20 for the data analyses. Results According to the present study, 41% of nurses regularly or often overlooked certain aspects of care, resulting in an average score of 32.34 ± 7.43 for missed nursing care.

treat	\(x_2\)	\(x_3\)
1	1	0
2	0	1
3	0	0

Have a language expert improve your writing

Understanding P-values | Definition and Examples

Table of contents

Here's why students love Scribbr's proofreading services

Receive feedback on language, structure, and formatting

Cite this Scribbr article

Is this article helpful?

Rebecca Bevans

P-Value in Statistical Hypothesis Tests: What is it?

P Value vs Alpha level

P Values and Critical Values

What if I Don’t Have an Alpha Level?

How to Calculate a P Value on the TI 83

What is P-Value? – Understanding the meaning, math and methods

Introduction

Examples of Statistical Tests reporting out p-value

What p-value really is

How is p-value used to establish statistical significance

Practical Guidelines to set the cutoff of Statistical Significance (alpha level)

What P Value is Not

Example: How to find p-value for linear regression

More Articles

Machine Learning A-Z™: Hands-On Python & R In Data Science

P-Value And Statistical Significance: What It Is & Why It Matters

Hypothesis testing

What a p-value tells you

Example: Test Statistic and p-Value

P-value interpretation

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

Example: Statistical Significance

What does a p-value of 0.001 mean?

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

One-Tailed Test

Two-Tailed Test

How do you calculate the p-value ?

Example: Choosing a Statistical Test

How to report

Example: Reporting the results

Why is the p -value not enough?

When do you reject the null hypothesis?

What does p-value of 0.05 mean?

Are all p-values below 0.05 considered statistically significant?

How does sample size affect the interpretation of p-values?

Can a non-significant p-value indicate that there is no effect or difference in the data?

Can P values be exactly zero?

Further Information

How to Correctly Interpret P Values

What Is the Null Hypothesis in Hypothesis Testing?

What Are P Values?

How Do You Interpret P Values?

hbspt.cta._relativeUrls=true;hbspt.cta.load(3447555, '16128196-352b-4dd2-8356-f063c37c5b2a', {"useNewLoader":"true","region":"na1"});

What Is the True Error Rate?

You Might Also Like

p-value Calculator

Can p-value be negative?

What does a high p-value mean?

What does a low p-value mean?

Bertrand's box paradox

How Hypothesis Tests Work: Significance Levels (Alpha) and P values

Hypothesis Test Example Scenario

Descriptive Statistics Alone Won’t Answer the Question

A Sampling Distribution Determines Whether Our Sample Mean is Unlikely

Graphing our Sample Mean in the Context of the Sampling Distribution

The Role of Hypothesis Tests

What are Significance Levels (Alpha)?

Graphing Significance Levels as Critical Regions

Comparing Significance Levels

What Are P values?

Using P values and Significance Levels Together

Discussion about Statistically Significant Results

Share this:

Reader Interactions

Comments and Questions Cancel reply

User Preferences

Keyboard Shortcuts

Hypotheses 1

Hypotheses 2

Hypotheses 3

Testing all slope parameters equal 0 Section

The full model