U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HCA Healthc J Med
  • v.1(2); 2020
  • PMC10324782

Logo of hcahjm

Introduction to Research Statistical Analysis: An Overview of the Basics

Christian vandever.

1 HCA Healthcare Graduate Medical Education

Description

This article covers many statistical ideas essential to research statistical analysis. Sample size is explained through the concepts of statistical significance level and power. Variable types and definitions are included to clarify necessities for how the analysis will be interpreted. Categorical and quantitative variable types are defined, as well as response and predictor variables. Statistical tests described include t-tests, ANOVA and chi-square tests. Multiple regression is also explored for both logistic and linear regression. Finally, the most common statistics produced by these methods are explored.

Introduction

Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology. Some of the information is more applicable to retrospective projects, where analysis is performed on data that has already been collected, but most of it will be suitable to any type of research. This primer will help the reader understand research results in coordination with a statistician, not to perform the actual analysis. Analysis is commonly performed using statistical programming software such as R, SAS or SPSS. These allow for analysis to be replicated while minimizing the risk for an error. Resources are listed later for those working on analysis without a statistician.

After coming up with a hypothesis for a study, including any variables to be used, one of the first steps is to think about the patient population to apply the question. Results are only relevant to the population that the underlying data represents. Since it is impractical to include everyone with a certain condition, a subset of the population of interest should be taken. This subset should be large enough to have power, which means there is enough data to deliver significant results and accurately reflect the study’s population.

The first statistics of interest are related to significance level and power, alpha and beta. Alpha (α) is the significance level and probability of a type I error, the rejection of the null hypothesis when it is true. The null hypothesis is generally that there is no difference between the groups compared. A type I error is also known as a false positive. An example would be an analysis that finds one medication statistically better than another, when in reality there is no difference in efficacy between the two. Beta (β) is the probability of a type II error, the failure to reject the null hypothesis when it is actually false. A type II error is also known as a false negative. This occurs when the analysis finds there is no difference in two medications when in reality one works better than the other. Power is defined as 1-β and should be calculated prior to running any sort of statistical testing. Ideally, alpha should be as small as possible while power should be as large as possible. Power generally increases with a larger sample size, but so does cost and the effect of any bias in the study design. Additionally, as the sample size gets bigger, the chance for a statistically significant result goes up even though these results can be small differences that do not matter practically. Power calculators include the magnitude of the effect in order to combat the potential for exaggeration and only give significant results that have an actual impact. The calculators take inputs like the mean, effect size and desired power, and output the required minimum sample size for analysis. Effect size is calculated using statistical information on the variables of interest. If that information is not available, most tests have commonly used values for small, medium or large effect sizes.

When the desired patient population is decided, the next step is to define the variables previously chosen to be included. Variables come in different types that determine which statistical methods are appropriate and useful. One way variables can be split is into categorical and quantitative variables. ( Table 1 ) Categorical variables place patients into groups, such as gender, race and smoking status. Quantitative variables measure or count some quantity of interest. Common quantitative variables in research include age and weight. An important note is that there can often be a choice for whether to treat a variable as quantitative or categorical. For example, in a study looking at body mass index (BMI), BMI could be defined as a quantitative variable or as a categorical variable, with each patient’s BMI listed as a category (underweight, normal, overweight, and obese) rather than the discrete value. The decision whether a variable is quantitative or categorical will affect what conclusions can be made when interpreting results from statistical tests. Keep in mind that since quantitative variables are treated on a continuous scale it would be inappropriate to transform a variable like which medication was given into a quantitative variable with values 1, 2 and 3.

Categorical vs. Quantitative Variables

Categorical VariablesQuantitative Variables
Categorize patients into discrete groupsContinuous values that measure a variable
Patient categories are mutually exclusiveFor time based studies, there would be a new variable for each measurement at each time
Examples: race, smoking status, demographic groupExamples: age, weight, heart rate, white blood cell count

Both of these types of variables can also be split into response and predictor variables. ( Table 2 ) Predictor variables are explanatory, or independent, variables that help explain changes in a response variable. Conversely, response variables are outcome, or dependent, variables whose changes can be partially explained by the predictor variables.

Response vs. Predictor Variables

Response VariablesPredictor Variables
Outcome variablesExplanatory variables
Should be the result of the predictor variablesShould help explain changes in the response variables
One variable per statistical testCan be multiple variables that may have an impact on the response variable
Can be categorical or quantitativeCan be categorical or quantitative

Choosing the correct statistical test depends on the types of variables defined and the question being answered. The appropriate test is determined by the variables being compared. Some common statistical tests include t-tests, ANOVA and chi-square tests.

T-tests compare whether there are differences in a quantitative variable between two values of a categorical variable. For example, a t-test could be useful to compare the length of stay for knee replacement surgery patients between those that took apixaban and those that took rivaroxaban. A t-test could examine whether there is a statistically significant difference in the length of stay between the two groups. The t-test will output a p-value, a number between zero and one, which represents the probability that the two groups could be as different as they are in the data, if they were actually the same. A value closer to zero suggests that the difference, in this case for length of stay, is more statistically significant than a number closer to one. Prior to collecting the data, set a significance level, the previously defined alpha. Alpha is typically set at 0.05, but is commonly reduced in order to limit the chance of a type I error, or false positive. Going back to the example above, if alpha is set at 0.05 and the analysis gives a p-value of 0.039, then a statistically significant difference in length of stay is observed between apixaban and rivaroxaban patients. If the analysis gives a p-value of 0.91, then there was no statistical evidence of a difference in length of stay between the two medications. Other statistical summaries or methods examine how big of a difference that might be. These other summaries are known as post-hoc analysis since they are performed after the original test to provide additional context to the results.

Analysis of variance, or ANOVA, tests can observe mean differences in a quantitative variable between values of a categorical variable, typically with three or more values to distinguish from a t-test. ANOVA could add patients given dabigatran to the previous population and evaluate whether the length of stay was significantly different across the three medications. If the p-value is lower than the designated significance level then the hypothesis that length of stay was the same across the three medications is rejected. Summaries and post-hoc tests also could be performed to look at the differences between length of stay and which individual medications may have observed statistically significant differences in length of stay from the other medications. A chi-square test examines the association between two categorical variables. An example would be to consider whether the rate of having a post-operative bleed is the same across patients provided with apixaban, rivaroxaban and dabigatran. A chi-square test can compute a p-value determining whether the bleeding rates were significantly different or not. Post-hoc tests could then give the bleeding rate for each medication, as well as a breakdown as to which specific medications may have a significantly different bleeding rate from each other.

A slightly more advanced way of examining a question can come through multiple regression. Regression allows more predictor variables to be analyzed and can act as a control when looking at associations between variables. Common control variables are age, sex and any comorbidities likely to affect the outcome variable that are not closely related to the other explanatory variables. Control variables can be especially important in reducing the effect of bias in a retrospective population. Since retrospective data was not built with the research question in mind, it is important to eliminate threats to the validity of the analysis. Testing that controls for confounding variables, such as regression, is often more valuable with retrospective data because it can ease these concerns. The two main types of regression are linear and logistic. Linear regression is used to predict differences in a quantitative, continuous response variable, such as length of stay. Logistic regression predicts differences in a dichotomous, categorical response variable, such as 90-day readmission. So whether the outcome variable is categorical or quantitative, regression can be appropriate. An example for each of these types could be found in two similar cases. For both examples define the predictor variables as age, gender and anticoagulant usage. In the first, use the predictor variables in a linear regression to evaluate their individual effects on length of stay, a quantitative variable. For the second, use the same predictor variables in a logistic regression to evaluate their individual effects on whether the patient had a 90-day readmission, a dichotomous categorical variable. Analysis can compute a p-value for each included predictor variable to determine whether they are significantly associated. The statistical tests in this article generate an associated test statistic which determines the probability the results could be acquired given that there is no association between the compared variables. These results often come with coefficients which can give the degree of the association and the degree to which one variable changes with another. Most tests, including all listed in this article, also have confidence intervals, which give a range for the correlation with a specified level of confidence. Even if these tests do not give statistically significant results, the results are still important. Not reporting statistically insignificant findings creates a bias in research. Ideas can be repeated enough times that eventually statistically significant results are reached, even though there is no true significance. In some cases with very large sample sizes, p-values will almost always be significant. In this case the effect size is critical as even the smallest, meaningless differences can be found to be statistically significant.

These variables and tests are just some things to keep in mind before, during and after the analysis process in order to make sure that the statistical reports are supporting the questions being answered. The patient population, types of variables and statistical tests are all important things to consider in the process of statistical analysis. Any results are only as useful as the process used to obtain them. This primer can be used as a reference to help ensure appropriate statistical analysis.

Alpha (α)the significance level and probability of a type I error, the probability of a false positive
Analysis of variance/ANOVAtest observing mean differences in a quantitative variable between values of a categorical variable, typically with three or more values to distinguish from a t-test
Beta (β)the probability of a type II error, the probability of a false negative
Categorical variableplace patients into groups, such as gender, race or smoking status
Chi-square testexamines association between two categorical variables
Confidence intervala range for the correlation with a specified level of confidence, 95% for example
Control variablesvariables likely to affect the outcome variable that are not closely related to the other explanatory variables
Hypothesisthe idea being tested by statistical analysis
Linear regressionregression used to predict differences in a quantitative, continuous response variable, such as length of stay
Logistic regressionregression used to predict differences in a dichotomous, categorical response variable, such as 90-day readmission
Multiple regressionregression utilizing more than one predictor variable
Null hypothesisthe hypothesis that there are no significant differences for the variable(s) being tested
Patient populationthe population the data is collected to represent
Post-hoc analysisanalysis performed after the original test to provide additional context to the results
Power1-beta, the probability of avoiding a type II error, avoiding a false negative
Predictor variableexplanatory, or independent, variables that help explain changes in a response variable
p-valuea value between zero and one, which represents the probability that the null hypothesis is true, usually compared against a significance level to judge statistical significance
Quantitative variablevariable measuring or counting some quantity of interest
Response variableoutcome, or dependent, variables whose changes can be partially explained by the predictor variables
Retrospective studya study using previously existing data that was not originally collected for the purposes of the study
Sample sizethe number of patients or observations used for the study
Significance levelalpha, the probability of a type I error, usually compared to a p-value to determine statistical significance
Statistical analysisanalysis of data using statistical testing to examine a research hypothesis
Statistical testingtesting used to examine the validity of a hypothesis using statistical calculations
Statistical significancedetermine whether to reject the null hypothesis, whether the p-value is below the threshold of a predetermined significance level
T-testtest comparing whether there are differences in a quantitative variable between two values of a categorical variable

Funding Statement

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity.

Conflicts of Interest

The author declares he has no conflicts of interest.

Christian Vandever is an employee of HCA Healthcare Graduate Medical Education, an organization affiliated with the journal’s publisher.

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity. The views expressed in this publication represent those of the author(s) and do not necessarily represent the official views of HCA Healthcare or any of its affiliated entities.

Data Science: the impact of statistics

  • Regular Paper
  • Open access
  • Published: 16 February 2018
  • Volume 6 , pages 189–194, ( 2018 )

Cite this article

You have full access to this open access article

research paper on statistics pdf

  • Claus Weihs 1 &
  • Katja Ickstadt 2  

40k Accesses

50 Citations

18 Altmetric

Explore all metrics

In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting. Also, we indicate fallacies when neglecting statistical reasoning.

Similar content being viewed by others

research paper on statistics pdf

Data Analysis

research paper on statistics pdf

Data science vs. statistics: two cultures?

research paper on statistics pdf

Ist Data Science mehr als Statistik? Ein Blick über den Tellerrand

Avoid common mistakes on your manuscript.

1 Introduction and premise

Data Science as a scientific discipline is influenced by informatics, computer science, mathematics, operations research, and statistics as well as the applied sciences.

In 1996, for the first time, the term Data Science was included in the title of a statistical conference (International Federation of Classification Societies (IFCS) “Data Science, classification, and related methods”) [ 37 ]. Even though the term was founded by statisticians, in the public image of Data Science, the importance of computer science and business applications is often much more stressed, in particular in the era of Big Data.

Already in the 1970s, the ideas of John Tukey [ 43 ] changed the viewpoint of statistics from a purely mathematical setting , e.g., statistical testing, to deriving hypotheses from data ( exploratory setting ), i.e., trying to understand the data before hypothesizing.

Another root of Data Science is Knowledge Discovery in Databases (KDD) [ 36 ] with its sub-topic Data Mining . KDD already brings together many different approaches to knowledge discovery, including inductive learning, (Bayesian) statistics, query optimization, expert systems, information theory, and fuzzy sets. Thus, KDD is a big building block for fostering interaction between different fields for the overall goal of identifying knowledge in data.

Nowadays, these ideas are combined in the notion of Data Science, leading to different definitions. One of the most comprehensive definitions of Data Science was recently given by Cao as the formula [ 12 ]:

data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking) .

In this formula, sociology stands for the social aspects and | (data + environment + thinking) means that all the mentioned sciences act on the basis of data, the environment and the so-called data-to-knowledge-to-wisdom thinking.

A recent, comprehensive overview of Data Science provided by Donoho in 2015 [ 16 ] focuses on the evolution of Data Science from statistics. Indeed, as early as 1997, there was an even more radical view suggesting to rename statistics to Data Science [ 50 ]. And in 2015, a number of ASA leaders [ 17 ] released a statement about the role of statistics in Data Science, saying that “statistics and machine learning play a central role in data science.”

In our view, statistical methods are crucial in most fundamental steps of Data Science. Hence, the premise of our contribution is:

Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

This paper aims at addressing the major impact of statistics on the most important steps in Data Science.

2 Steps in data science

One of forerunners of Data Science from a structural perspective is the famous CRISP-DM (Cross Industry Standard Process for Data Mining) which is organized in six main steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment [ 10 ], see Table  1 , left column. Ideas like CRISP-DM are now fundamental for applied statistics.

In our view, the main steps in Data Science have been inspired by CRISP-DM and have evolved, leading to, e.g., our definition of Data Science as a sequence of the following steps: Data Acquisition and Enrichment, Data Storage and Access , Data Exploration, Data Analysis and Modeling, Optimization of Algorithms , Model Validation and Selection, Representation and Reporting of Results, and Business Deployment of Results . Note that topics in small capitals indicate steps where statistics is less involved, cp. Table  1 , right column.

Usually, these steps are not just conducted once but are iterated in a cyclic loop. In addition, it is common to alternate between two or more steps. This holds especially for the steps Data Acquisition and Enrichment , Data Exploration , and Statistical Data Analysis , as well as for Statistical Data Analysis and Modeling and Model Validation and Selection .

Table  1 compares different definitions of steps in Data Science. The relationship of terms is indicated by horizontal blocks. The missing step Data Acquisition and Enrichment in CRISP-DM indicates that that scheme deals with observational data only. Moreover, in our proposal, the steps Data Storage and Access and Optimization of Algorithms are added to CRISP-DM, where statistics is less involved.

The list of steps for Data Science may even be enlarged, see, e.g., Cao in [ 12 ], Figure 6, cp. also Table  1 , middle column, for the following recent list: Domain-specific Data Applications and Problems, Data Storage and Management, Data Quality Enhancement, Data Modeling and Representation, Deep Analytics, Learning and Discovery, Simulation and Experiment Design, High-performance Processing and Analytics, Networking, Communication, Data-to-Decision and Actions.

In principle, Cao’s and our proposal cover the same main steps. However, in parts, Cao’s formulation is more detailed; e.g., our step Data Analysis and Modeling corresponds to Data Modeling and Representation, Deep Analytics, Learning and Discovery . Also, the vocabularies differ slightly, depending on whether the respective background is computer science or statistics. In that respect note that Experiment Design in Cao’s definition means the design of the simulation experiments.

In what follows, we will highlight the role of statistics discussing all the steps, where it is heavily involved, in Sects.  2.1 – 2.6 . These coincide with all steps in our proposal in Table  1 except steps in small capitals. The corresponding entries Data Storage and Access and Optimization of Algorithms are mainly covered by informatics and computer science , whereas Business Deployment of Results is covered by Business Management .

2.1 Data acquisition and enrichment

Design of experiments (DOE) is essential for a systematic generation of data when the effect of noisy factors has to be identified. Controlled experiments are fundamental for robust process engineering to produce reliable products despite variation in the process variables. On the one hand, even controllable factors contain a certain amount of uncontrollable variation that affects the response. On the other hand, some factors, like environmental factors, cannot be controlled at all. Nevertheless, at least the effect of such noisy influencing factors should be controlled by, e.g., DOE.

DOE can be utilized, e.g.,

to systematically generate new data ( data acquisition ) [ 33 ],

for systematically reducing data bases [ 41 ], and

for tuning (i.e., optimizing) parameters of algorithms [ 1 ], i.e., for improving the data analysis methods (see Sect.  2.3 ) themselves.

Simulations [ 7 ] may also be used to generate new data. A tool for the enrichment of data bases to fill data gaps is the imputation of missing data [ 31 ].

Such statistical methods for data generation and enrichment need to be part of the backbone of Data Science. The exclusive use of observational data without any noise control distinctly diminishes the quality of data analysis results and may even lead to wrong result interpretation. The hope for “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” [ 4 ] appears to be wrong due to noise in the data.

Thus, experimental design is crucial for the reliability, validity, and replicability of our results.

2.2 Data exploration

Exploratory statistics is essential for data preprocessing to learn about the contents of a data base. Exploration and visualization of observed data was, in a way, initiated by John Tukey [ 43 ]. Since that time, the most laborious part of data analysis, namely data understanding and transformation, became an important part in statistical science.

Data exploration or data mining is fundamental for the proper usage of analytical methods in Data Science. The most important contribution of statistics is the notion of distribution . It allows us to represent variability in the data as well as (a-priori) knowledge of parameters, the concept underlying Bayesian statistics. Distributions also enable us to choose adequate subsequent analytic models and methods.

2.3 Statistical data analysis

Finding structure in data and making predictions are the most important steps in Data Science. Here, in particular, statistical methods are essential since they are able to handle many different analytical tasks. Important examples of statistical data analysis methods are the following.

Hypothesis testing is one of the pillars of statistical analysis. Questions arising in data driven problems can often be translated to hypotheses. Also, hypotheses are the natural links between underlying theory and statistics. Since statistical hypotheses are related to statistical tests, questions and theory can be tested for the available data. Multiple usage of the same data in different tests often leads to the necessity to correct significance levels. In applied statistics, correct multiple testing is one of the most important problems, e.g., in pharmaceutical studies [ 15 ]. Ignoring such techniques would lead to many more significant results than justified.

Classification methods are basic for finding and predicting subpopulations from data. In the so-called unsupervised case, such subpopulations are to be found from a data set without a-priori knowledge of any cases of such subpopulations. This is often called clustering.

In the so-called supervised case, classification rules should be found from a labeled data set for the prediction of unknown labels when only influential factors are available.

Nowadays, there is a plethora of methods for the unsupervised [ 22 ] as well for the supervised case [ 2 ].

In the age of Big Data, a new look at the classical methods appears to be necessary, though, since most of the time the calculation effort of complex analysis methods grows stronger than linear with the number of observations n or the number of features p . In the case of Big Data, i.e., if n or p is large, this leads to too high calculation times and to numerical problems. This results both, in the comeback of simpler optimization algorithms with low time-complexity [ 9 ] and in re-examining the traditional methods in statistics and machine learning for Big Data [ 46 ].

Regression methods are the main tool to find global and local relationships between features when the target variable is measured. Depending on the distributional assumption for the underlying data, different approaches may be applied. Under the normality assumption, linear regression is the most common method, while generalized linear regression is usually employed for other distributions from the exponential family [ 18 ]. More advanced methods comprise functional regression for functional data [ 38 ], quantile regression [ 25 ], and regression based on loss functions other than squared error loss like, e.g., Lasso regression [ 11 , 21 ]. In the context of Big Data, the challenges are similar to those for classification methods given large numbers of observations n (e.g., in data streams) and / or large numbers of features p . For the reduction of n , data reduction techniques like compressed sensing, random projection methods [ 20 ] or sampling-based procedures [ 28 ] enable faster computations. For decreasing the number p to the most influential features, variable selection or shrinkage approaches like the Lasso [ 21 ] can be employed, keeping the interpretability of the features. (Sparse) principal component analysis [ 21 ] may also be used.

Time series analysis aims at understanding and predicting temporal structure [ 42 ]. Time series are very common in studies of observational data, and prediction is the most important challenge for such data. Typical application areas are the behavioral sciences and economics as well as the natural sciences and engineering. As an example, let us have a look at signal analysis, e.g., speech or music data analysis. Here, statistical methods comprise the analysis of models in the time and frequency domains. The main aim is the prediction of future values of the time series itself or of its properties. For example, the vibrato of an audio time series might be modeled in order to realistically predict the tone in the future [ 24 ] and the fundamental frequency of a musical tone might be predicted by rules learned from elapsed time periods [ 29 ].

In econometrics, multiple time series and their co-integration are often analyzed [ 27 ]. In technical applications, process control is a common aim of time series analysis [ 34 ].

2.4 Statistical modeling

Complex interactions between factors can be modeled by graphs or networks . Here, an interaction between two factors is modeled by a connection in the graph or network [ 26 , 35 ]. The graphs can be undirected as, e.g., in Gaussian graphical models, or directed as, e.g., in Bayesian networks. The main goal in network analysis is deriving the network structure. Sometimes, it is necessary to separate (unmix) subpopulation specific network topologies [ 49 ].

Stochastic differential and difference equations can represent models from the natural and engineering sciences [ 3 , 39 ]. The finding of approximate statistical models solving such equations can lead to valuable insights for, e.g., the statistical control of such processes, e.g., in mechanical engineering [ 48 ]. Such methods can build a bridge between the applied sciences and Data Science.

Local models and globalization Typically, statistical models are only valid in sub-regions of the domain of the involved variables. Then, local models can be used [ 8 ]. The analysis of structural breaks can be basic to identify the regions for local modeling in time series [ 5 ]. Also, the analysis of concept drifts can be used to investigate model changes over time [ 30 ].

In time series, there are often hierarchies of more and more global structures. For example, in music, a basic local structure is given by the notes and more and more global ones by bars, motifs, phrases, parts etc. In order to find global properties of a time series, properties of the local models can be combined to more global characteristics [ 47 ].

Mixture models can also be used for the generalization of local to global models [ 19 , 23 ]. Model combination is essential for the characterization of real relationships since standard mathematical models are often much too simple to be valid for heterogeneous data or bigger regions of interest.

2.5 Model validation and model selection

In cases where more than one model is proposed for, e.g., prediction, statistical tests for comparing models are helpful to structure the models, e.g., concerning their predictive power [ 45 ].

Predictive power is typically assessed by means of so-called resampling methods where the distribution of power characteristics is studied by artificially varying the subpopulation used to learn the model. Characteristics of such distributions can be used for model selection [ 7 ].

Perturbation experiments offer another possibility to evaluate the performance of models. In this way, the stability of the different models against noise is assessed [ 32 , 44 ].

Meta-analysis as well as model averaging are methods to evaluate combined models [ 13 , 14 ].

Model selection became more and more important in the last years since the number of classification and regression models proposed in the literature increased with higher and higher speed.

2.6 Representation and reporting

Visualization to interpret found structures and storing of models in an easy-to-update form are very important tasks in statistical analyses to communicate the results and safeguard data analysis deployment. Deployment is decisive for obtaining interpretable results in Data Science. It is the last step in CRISP-DM [ 10 ] and underlying the data-to-decision and action step in Cao [ 12 ].

Besides visualization and adequate model storing, for statistics, the main task is reporting of uncertainties and review [ 6 ].

3 Fallacies

The statistical methods described in Sect.  2 are fundamental for finding structure in data and for obtaining deeper insight into data, and thus, for a successful data analysis. Ignoring modern statistical thinking or using simplistic data analytics/statistical methods may lead to avoidable fallacies. This holds, in particular, for the analysis of big and/or complex data.

As mentioned at the end of Sect.  2.2 , the notion of distribution is the key contribution of statistics. Not taking into account distributions in data exploration and in modeling restricts us to report values and parameter estimates without their corresponding variability. Only the notion of distributions enables us to predict with corresponding error bands.

Moreover, distributions are the key to model-based data analytics. For example, unsupervised learning can be employed to find clusters in data. If additional structure like dependency on space or time is present, it is often important to infer parameters like cluster radii and their spatio-temporal evolution. Such model-based analysis heavily depends on the notion of distributions (see [ 40 ] for an application to protein clusters).

If more than one parameter is of interest, it is advisable to compare univariate hypothesis testing approaches to multiple procedures, e.g., in multiple regression, and choose the most adequate model by variable selection. Restricting oneself to univariate testing, would ignore relationships between variables.

Deeper insight into data might require more complex models, like, e.g., mixture models for detecting heterogeneous groups in data. When ignoring the mixture, the result often represents a meaningless average, and learning the subgroups by unmixing the components might be needed. In a Bayesian framework, this is enabled by, e.g., latent allocation variables in a Dirichlet mixture model. For an application of decomposing a mixture of different networks in a heterogeneous cell population in molecular biology see [ 49 ].

A mixture model might represent mixtures of components of very unequal sizes, with small components (outliers) being of particular importance. In the context of Big Data, naïve sampling procedures are often employed for model estimation. However, these have the risk of missing small mixture components. Hence, model validation or sampling according to a more suitable distribution as well as resampling methods for predictive power are important.

4 Conclusion

Following the above assessment of the capabilities and impacts of statistics our conclusion is:

The role of statistics in Data Science is under-estimated as, e.g., compared to computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modeling needed for prediction.

Stimulated by this conclusion, statisticians are well-advised to more offensively play their role in this modern and well accepted field of Data Science.

Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.

Adenso-Diaz, B., Laguna, M.: Fine-tuning of algorithms using fractional experimental designs and local search. Oper. Res. 54 (1), 99–114 (2006)

Article   Google Scholar  

Aggarwal, C.C. (ed.): Data Classification: Algorithms and Applications. CRC Press, Boca Raton (2014)

Google Scholar  

Allen, E., Allen, L., Arciniega, A., Greenwood, P.: Construction of equivalent stochastic differential equation models. Stoch. Anal. Appl. 26 , 274–297 (2008)

Article   MathSciNet   Google Scholar  

Anderson, C.: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine https://www.wired.com/2008/06/pb-theory/ (2008)

Aue, A., Horváth, L.: Structural breaks in time series. J. Time Ser. Anal. 34 (1), 1–16 (2013)

Berger, R.E.: A scientific approach to writing for engineers and scientists. IEEE PCS Professional Engineering Communication Series IEEE Press, Wiley (2014)

Book   Google Scholar  

Bischl, B., Mersmann, O., Trautmann, H., Weihs, C.: Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 20 (2), 249–275 (2012)

Bischl, B., Schiffner, J., Weihs, C.: Benchmarking local classification methods. Comput. Stat. 28 (6), 2599–2619 (2013)

Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)

Brown, M.S.: Data Mining for Dummies. Wiley, London (2014)

Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (2017). https://doi.org/10.1145/3076253

Claeskens, G., Hjort, N.L.: Model Selection and Model Averaging. Cambridge University Press, Cambridge (2008)

Cooper, H., Hedges, L.V., Valentine, J.C.: The Handbook of Research Synthesis and Meta-analysis. Russell Sage Foundation, New York City (2009)

Dmitrienko, A., Tamhane, A.C., Bretz, F.: Multiple Testing Problems in Pharmaceutical Statistics. Chapman and Hall/CRC, London (2009)

Donoho, D.: 50 Years of Data Science. http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf (2015)

Dyk, D.V., Fuentes, M., Jordan, M.I., Newton, M., Ray, B.K., Lang, D.T., Wickham, H.: ASA Statement on the Role of Statistics in Data Science. http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ (2015)

Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Berlin (2013)

Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, Berlin (2006)

MATH   Google Scholar  

Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., Sohler, C.: Random projections for Bayesian regression. Stat. Comput. 27 (1), 79–101 (2017). https://doi.org/10.1007/s11222-015-9608-z

Article   MathSciNet   MATH   Google Scholar  

Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. Chapman & Hall, London (2015)

Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, K., Dugas, M.: Integrative analysis of histone chip-seq and transcription data using Bayesian mixture models. Bioinformatics 30 (8), 1154–1162 (2014)

Knoche, S., Ebeling, M.: The musical signal: physically and psychologically, chap 2. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 15–68. CRC Press, Boca Raton (2017)

Koenker, R.: Quantile Regression. Econometric Society Monographs, vol. 38 (2010)

Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

Lütkepohl, H.: New Introduction to Multiple Time Series Analysis. Springer, Berlin (2010)

Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 91–99. http://jmlr.org/proceedings/papers/v32/ma14.html (2014)

Martin, R., Nagathil, A.: Digital filters and spectral analysis, chap 4. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 111–143. CRC Press, Boca Raton (2017)

Mejri, D., Limam, M., Weihs, C.: A new dynamic weighted majority control chart for data streams. Soft Comput. 22(2), 511–522. https://doi.org/10.1007/s00500-016-2351-3

Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)

Molinelli, E.J., Korkut, A., Wang, W.Q., Miller, M.L., Gauthier, N.P., Jing, X., Kaushik, P., He, Q., Mills, G., Solit, D.B., Pratilas, C.A., Weigt, M., Braunstein, A., Pagnani, A., Zecchina, R., Sander, C.: Perturbation Biology: Inferring Signaling Networks in Cellular Systems. arXiv preprint arXiv:1308.5193 (2013)

Montgomery, D.C.: Design and Analysis of Experiments, 8th edn. Wiley, London (2013)

Oakland, J.: Statistical Process Control. Routledge, London (2007)

Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Los Altos (1988)

Chapter   Google Scholar  

Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT Press, Cambridge (1991)

Press, G.: A Very Short History of Data Science. https://www.forbescom/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#5c515ed055cf (2013). [last visit: March 19, 2017]

Ramsay, J., Silverman, B.W.: Functional Data Analysis. Springer, Berlin (2005)

Särkkä, S.: Applied Stochastic Differential Equations. https://users.aalto.fi/~ssarkka/course_s2012/pdf/sde_course_booklet_2012.pdf (2012). [last visit: March 6, 2017]

Schäfer, M., Radon, Y., Klein, T., Herrmann, S., Schwender, H., Verveer, P.J., Ickstadt, K.: A Bayesian mixture model to quantify parameters of spatial clustering. Comput. Stat. Data Anal. 92 , 163–176 (2015). https://doi.org/10.1016/j.csda.2015.07.004

Schiffner, J., Weihs, C.: D-optimal plans for variable selection in data bases. Technical Report, 14/09, SFB 475 (2009)

Shumway, R.H., Stoffer, D.S.: Time Series Analysis and Its Applications: With R Examples. Springer, Berlin (2010)

Tukey, J.W.: Exploratory Data Analysis. Pearson, London (1977)

Vatcheva, I., de Jong, H., Mars, N.: Selection of perturbation experiments for model discrimination. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence, ECAI-2000, IOS Press, pp 191–195 (2000)

Vatolkin, I., Weihs, C.: Evaluation, chap 13. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 329–363. CRC Press, Boca Raton (2017)

Weihs, C.: Big data classification — aspects on many features. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds.) Solving Large Scale Learning Tasks: Challenges and Algorithms, Springer Lecture Notes in Artificial Intelligence, vol. 9580, pp. 139–147 (2016)

Weihs, C., Ligges, U.: From local to global analysis of music time series. In: Morik, K., Siebes, A., Boulicault, J.F. (eds.) Detecting Local Patterns, Springer Lecture Notes in Artificial Intelligence, vol. 3539, pp. 233–245 (2005)

Weihs, C., Messaoud, A., Raabe, N.: Control charts based on models derived from differential equations. Qual. Reliab. Eng. Int. 26 (8), 807–816 (2010)

Wieczorek, J., Malik-Sheriff, R.S., Fermin, Y., Grecco, H.E., Zamir, E., Ickstadt, K.: Uncovering distinct protein-network topologies in heterogeneous cell populations. BMC Syst. Biol. 9 (1), 24 (2015)

Wu, J.: Statistics = data science? http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf (1997)

Download references

Acknowledgements

The authors would like to thank the editor, the guest editors and all reviewers for valuable comments on an earlier version of the manuscript. They also thank Leo Geppert for fruitful discussions.

Author information

Authors and affiliations.

Computational Statistics, TU Dortmund University, 44221, Dortmund, Germany

Claus Weihs

Mathematical Statistics and Biometric Applications, TU Dortmund University, 44221, Dortmund, Germany

Katja Ickstadt

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Claus Weihs .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0 /), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Weihs, C., Ickstadt, K. Data Science: the impact of statistics. Int J Data Sci Anal 6 , 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5

Download citation

Received : 20 March 2017

Accepted : 25 January 2018

Published : 16 February 2018

Issue Date : November 2018

DOI : https://doi.org/10.1007/s41060-018-0102-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Structures of data science
  • Impact of statistics on data science
  • Fallacies in data science
  • Find a journal
  • Publish with us
  • Track your research

Library Home

Statistics for Research Students

(2 reviews)

research paper on statistics pdf

Erich C Fein, Toowoomba, Australia

John Gilmour, Toowoomba, Australia

Tayna Machin, Toowoomba, Australia

Liam Hendry, Toowoomba, Australia

Copyright Year: 2022

ISBN 13: 9780645326109

Publisher: University of Southern Queensland

Language: English

Formats Available

Conditions of use.

Attribution

Learn more about reviews.

Reviewed by Sojib Bin Zaman, Assistant Professor, James Madison University on 3/18/24

From exploring data in Chapter One to learning advanced methodologies such as moderation and mediation in Chapter Seven, the reader is guided through the entire process of statistical methodology. With each chapter covering a different statistical... read more

Comprehensiveness rating: 5 see less

From exploring data in Chapter One to learning advanced methodologies such as moderation and mediation in Chapter Seven, the reader is guided through the entire process of statistical methodology. With each chapter covering a different statistical technique and methodology, students gain a comprehensive understanding of statistical research techniques.

Content Accuracy rating: 5

During my review of the textbook, I did not find any notable errors or omissions. In my opinion, the material was comprehensive, resulting in an enjoyable learning experience.

Relevance/Longevity rating: 5

A majority of the textbook's content is aligned with current trends, advancements, and enduring principles in the field of statistics. Several emerging methodologies and technologies are incorporated into this textbook to enhance students' statistical knowledge. It will be a valuable resource in the long run if students and researchers can properly utilize this textbook.

Clarity rating: 5

A clear explanation of complex statistical concepts such as moderation and mediation is provided in the writing style. Examples and problem sets are provided in the textbook in a comprehensive and well-explained manner.

Consistency rating: 5

Each chapter maintains consistent formatting and language, with resources organized consistently. Headings and subheadings worked well.

Modularity rating: 5

The textbook is well-structured, featuring cohesive chapters that flow smoothly from one to another. It is carefully crafted with a focus on defining terms clearly, facilitating understanding, and ensuring logical flow.

Organization/Structure/Flow rating: 5

From basic to advanced concepts, this book provides clarity of progression, logical arranging of sections and chapters, and effective headings and subheadings that guide readers. Further, the organization provides students with a lot of information on complex statistical methodologies.

Interface rating: 5

The available formats included PDFs, online access, and e-books. The e-book interface was particularly appealing to me, as it provided seamless navigation and viewing of content without compromising usability.

Grammatical Errors rating: 5

I found no significant errors in this document, and the overall quality of the writing was commendable. There was a high level of clarity and coherence in the text, which contributed to a positive reading experience.

Cultural Relevance rating: 5

The content of the book, as well as its accompanying examples, demonstrates a dedication to inclusivity by taking into account cultural diversity and a variety of perspectives. Furthermore, the material actively promotes cultural diversity, which enables readers to develop a deeper understanding of various cultural contexts and experiences.

In summary, this textbook provides a comprehensive resource tailored for advanced statistics courses, characterized by meticulous organization and practical supplementary materials. This book also provides valuable insights into the interpretation of computer output that enhance a greater understanding of each concept presented.

Reviewed by Zhuanzhuan Ma, Assistant Professor, University of Texas Rio Grande Valley on 3/7/24

The textbook covers all necessary areas and topics for students who want to conduct research in statistics. It includes foundational concepts, application methods, and advanced statistical techniques relevant to research methodologies. read more

The textbook covers all necessary areas and topics for students who want to conduct research in statistics. It includes foundational concepts, application methods, and advanced statistical techniques relevant to research methodologies.

The textbook presents statistical methods and data accurately, with up-to-date statistical practices and examples.

Relevance/Longevity rating: 4

The textbook's content is relevant to current research practices. The book includes contemporary examples and case studies that are currently prevalent in research communities. One small drawback is that the textbook did not include the example code for conduct data analysis.

The textbook break down complex statistical methods into understandable segments. All the concepts are clearly explained. Authors used diagrams, examples, and all kinds of explanations to facilitate learning for students with varying levels of background knowledge.

The terminology, framework, and presentation style (e.g. concepts, methodologies, and examples) seem consistent throughout the book.

The textbook is well organized that each chapter and section can be used independently without losing the context necessary for understanding. Also, the modular structure allows instructors and students to adapt the materials for different study plans.

The textbook is well-organized and progresses from basic concepts to more complex methods, making it easier for students to follow along. There is a logical flow of the content.

The digital format of the textbook has an interface that includes the design, layout, and navigational features. It is easier to use for readers.

The quality of writing is very high. The well-written texts help both instructors and students to follow the ideas clearly.

The textbook does not perpetuate stereotypes or biases and are inclusive in their examples, language, and perspectives.

Table of Contents

  • Acknowledgement of Country
  • Accessibility Information
  • About the Authors
  • Introduction
  • I. Chapter One - Exploring Your Data
  • II. Chapter Two - Test Statistics, p Values, Confidence Intervals and Effect Sizes
  • III. Chapter Three- Comparing Two Group Means
  • IV. Chapter Four - Comparing Associations Between Two Variables
  • V. Chapter Five- Comparing Associations Between Multiple Variables
  • VI. Chapter Six- Comparing Three or More Group Means
  • VII. Chapter Seven- Moderation and Mediation Analyses
  • VIII. Chapter Eight- Factor Analysis and Scale Reliability
  • IX. Chapter Nine- Nonparametric Statistics

Ancillary Material

About the book.

This book aims to help you understand and navigate statistical concepts and the main types of statistical analyses essential for research students. 

About the Contributors

Dr Erich C. Fein  is an Associate Professor at the University of Southern Queensland. He received substantial training in research methods and statistics during his PhD program at Ohio State University.  He currently teaches four courses in research methods and statistics.  His research involves leadership, occupational health, and motivation, as well as issues related to research methods such as the following article: “ Safeguarding Access and Safeguarding Meaning as Strategies for Achieving Confidentiality .”  Click here to link to his  Google Scholar  profile.

Dr John Gilmour  is a Lecturer at the University of Southern Queensland and a Postdoctoral Research Fellow at the University of Queensland, His research focuses on the locational and temporal analyses of crime, and the evaluation of police training and procedures. John has worked across many different sectors including PTSD, social media, criminology, and medicine.

Dr Tanya Machin  is a Senior Lecturer and Associate Dean at the University of Southern Queensland. Her research focuses on social media and technology across the lifespan. Tanya has co-taught Honours research methods with Erich, and is also interested in ethics and qualitative research methods. Tanya has worked across many different sectors including primary schools, financial services, and mental health.

Dr Liam Hendry  is a Lecturer at the University of Southern Queensland. His research interests focus on long-term and short-term memory, measurement of human memory, attention, learning & diverse aspects of cognitive psychology.

Contribute to this Page

Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

  • Download PDF
  • CME & MOC
  • Share X Facebook Email LinkedIn
  • Permissions

Test-Negative Study Designs for Evaluating Vaccine Effectiveness

  • 1 Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, Georgia
  • Original Investigation Association Between 3 Doses of mRNA COVID-19 Vaccine and Symptomatic Infection Caused by Omicron and Delta Variants Emma K. Accorsi, PhD; Amadea Britton, MD; Katherine E. Fleming-Dutra, MD; Zachary R. Smith, MA; Nong Shang, PhD; Gordana Derado, PhD; Joseph Miller, PhD; Stephanie J. Schrag, DPhil; Jennifer R. Verani, MD, MPH JAMA

The evaluation of vaccines continues long after initial regulatory approval. Postapproval observational studies are often used to investigate aspects of vaccine effectiveness (VE) that clinical trials cannot feasibly assess. These includes long-term effectiveness, effectiveness within subgroups, effectiveness against rare outcomes, and effectiveness as the circulating pathogen changes. 1 Policymakers rely on these data to guide vaccine recommendations or formulation updates. 2

Read More About

Dean N , Amin AB. Test-Negative Study Designs for Evaluating Vaccine Effectiveness. JAMA. Published online June 12, 2024. doi:10.1001/jama.2024.5633

Manage citations:

© 2024

Artificial Intelligence Resource Center

Cardiology in JAMA : Read the Latest

Browse and subscribe to JAMA Network podcasts!

Others Also Liked

Select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing
  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

Purdue Online Writing Lab Purdue OWL® College of Liberal Arts

Welcome to the Purdue Online Writing Lab

OWL logo

Welcome to the Purdue OWL

This page is brought to you by the OWL at Purdue University. When printing this page, you must include the entire legal notice.

Copyright ©1995-2018 by The Writing Lab & The OWL at Purdue and Purdue University. All rights reserved. This material may not be published, reproduced, broadcast, rewritten, or redistributed without permission. Use of this site constitutes acceptance of our terms and conditions of fair use.

The Online Writing Lab (the Purdue OWL) at Purdue University houses writing resources and instructional material, and we provide these as a free service at Purdue. Students, members of the community, and users worldwide will find information to assist with many writing projects. Teachers and trainers may use this material for in-class and out-of-class instruction.

The On-Campus and Online versions of Purdue OWL assist clients in their development as writers—no matter what their skill level—with on-campus consultations, online participation, and community engagement. The Purdue OWL serves the Purdue West Lafayette and Indianapolis campuses and coordinates with local literacy initiatives. The Purdue OWL offers global support through online reference materials and services.

Social Media

Facebook twitter.

The state of AI in early 2024: Gen AI adoption spikes and starts to generate value

If 2023 was the year the world discovered generative AI (gen AI) , 2024 is the year organizations truly began using—and deriving business value from—this new technology. In the latest McKinsey Global Survey  on AI, 65 percent of respondents report that their organizations are regularly using gen AI, nearly double the percentage from our previous survey just ten months ago. Respondents’ expectations for gen AI’s impact remain as high as they were last year , with three-quarters predicting that gen AI will lead to significant or disruptive change in their industries in the years ahead.

About the authors

This article is a collaborative effort by Alex Singla , Alexander Sukharevsky , Lareina Yee , and Michael Chui , with Bryce Hall , representing views from QuantumBlack, AI by McKinsey, and McKinsey Digital.

Organizations are already seeing material benefits from gen AI use, reporting both cost decreases and revenue jumps in the business units deploying the technology. The survey also provides insights into the kinds of risks presented by gen AI—most notably, inaccuracy—as well as the emerging practices of top performers to mitigate those challenges and capture value.

AI adoption surges

Interest in generative AI has also brightened the spotlight on a broader set of AI capabilities. For the past six years, AI adoption by respondents’ organizations has hovered at about 50 percent. This year, the survey finds that adoption has jumped to 72 percent (Exhibit 1). And the interest is truly global in scope. Our 2023 survey found that AI adoption did not reach 66 percent in any region; however, this year more than two-thirds of respondents in nearly every region say their organizations are using AI. 1 Organizations based in Central and South America are the exception, with 58 percent of respondents working for organizations based in Central and South America reporting AI adoption. Looking by industry, the biggest increase in adoption can be found in professional services. 2 Includes respondents working for organizations focused on human resources, legal services, management consulting, market research, R&D, tax preparation, and training.

Also, responses suggest that companies are now using AI in more parts of the business. Half of respondents say their organizations have adopted AI in two or more business functions, up from less than a third of respondents in 2023 (Exhibit 2).

Gen AI adoption is most common in the functions where it can create the most value

Most respondents now report that their organizations—and they as individuals—are using gen AI. Sixty-five percent of respondents say their organizations are regularly using gen AI in at least one business function, up from one-third last year. The average organization using gen AI is doing so in two functions, most often in marketing and sales and in product and service development—two functions in which previous research  determined that gen AI adoption could generate the most value 3 “ The economic potential of generative AI: The next productivity frontier ,” McKinsey, June 14, 2023. —as well as in IT (Exhibit 3). The biggest increase from 2023 is found in marketing and sales, where reported adoption has more than doubled. Yet across functions, only two use cases, both within marketing and sales, are reported by 15 percent or more of respondents.

Gen AI also is weaving its way into respondents’ personal lives. Compared with 2023, respondents are much more likely to be using gen AI at work and even more likely to be using gen AI both at work and in their personal lives (Exhibit 4). The survey finds upticks in gen AI use across all regions, with the largest increases in Asia–Pacific and Greater China. Respondents at the highest seniority levels, meanwhile, show larger jumps in the use of gen Al tools for work and outside of work compared with their midlevel-management peers. Looking at specific industries, respondents working in energy and materials and in professional services report the largest increase in gen AI use.

Investments in gen AI and analytical AI are beginning to create value

The latest survey also shows how different industries are budgeting for gen AI. Responses suggest that, in many industries, organizations are about equally as likely to be investing more than 5 percent of their digital budgets in gen AI as they are in nongenerative, analytical-AI solutions (Exhibit 5). Yet in most industries, larger shares of respondents report that their organizations spend more than 20 percent on analytical AI than on gen AI. Looking ahead, most respondents—67 percent—expect their organizations to invest more in AI over the next three years.

Where are those investments paying off? For the first time, our latest survey explored the value created by gen AI use by business function. The function in which the largest share of respondents report seeing cost decreases is human resources. Respondents most commonly report meaningful revenue increases (of more than 5 percent) in supply chain and inventory management (Exhibit 6). For analytical AI, respondents most often report seeing cost benefits in service operations—in line with what we found last year —as well as meaningful revenue increases from AI use in marketing and sales.

Inaccuracy: The most recognized and experienced risk of gen AI use

As businesses begin to see the benefits of gen AI, they’re also recognizing the diverse risks associated with the technology. These can range from data management risks such as data privacy, bias, or intellectual property (IP) infringement to model management risks, which tend to focus on inaccurate output or lack of explainability. A third big risk category is security and incorrect use.

Respondents to the latest survey are more likely than they were last year to say their organizations consider inaccuracy and IP infringement to be relevant to their use of gen AI, and about half continue to view cybersecurity as a risk (Exhibit 7).

Conversely, respondents are less likely than they were last year to say their organizations consider workforce and labor displacement to be relevant risks and are not increasing efforts to mitigate them.

In fact, inaccuracy— which can affect use cases across the gen AI value chain , ranging from customer journeys and summarization to coding and creative content—is the only risk that respondents are significantly more likely than last year to say their organizations are actively working to mitigate.

Some organizations have already experienced negative consequences from the use of gen AI, with 44 percent of respondents saying their organizations have experienced at least one consequence (Exhibit 8). Respondents most often report inaccuracy as a risk that has affected their organizations, followed by cybersecurity and explainability.

Our previous research has found that there are several elements of governance that can help in scaling gen AI use responsibly, yet few respondents report having these risk-related practices in place. 4 “ Implementing generative AI with speed and safety ,” McKinsey Quarterly , March 13, 2024. For example, just 18 percent say their organizations have an enterprise-wide council or board with the authority to make decisions involving responsible AI governance, and only one-third say gen AI risk awareness and risk mitigation controls are required skill sets for technical talent.

Bringing gen AI capabilities to bear

The latest survey also sought to understand how, and how quickly, organizations are deploying these new gen AI tools. We have found three archetypes for implementing gen AI solutions : takers use off-the-shelf, publicly available solutions; shapers customize those tools with proprietary data and systems; and makers develop their own foundation models from scratch. 5 “ Technology’s generational moment with generative AI: A CIO and CTO guide ,” McKinsey, July 11, 2023. Across most industries, the survey results suggest that organizations are finding off-the-shelf offerings applicable to their business needs—though many are pursuing opportunities to customize models or even develop their own (Exhibit 9). About half of reported gen AI uses within respondents’ business functions are utilizing off-the-shelf, publicly available models or tools, with little or no customization. Respondents in energy and materials, technology, and media and telecommunications are more likely to report significant customization or tuning of publicly available models or developing their own proprietary models to address specific business needs.

Respondents most often report that their organizations required one to four months from the start of a project to put gen AI into production, though the time it takes varies by business function (Exhibit 10). It also depends upon the approach for acquiring those capabilities. Not surprisingly, reported uses of highly customized or proprietary models are 1.5 times more likely than off-the-shelf, publicly available models to take five months or more to implement.

Gen AI high performers are excelling despite facing challenges

Gen AI is a new technology, and organizations are still early in the journey of pursuing its opportunities and scaling it across functions. So it’s little surprise that only a small subset of respondents (46 out of 876) report that a meaningful share of their organizations’ EBIT can be attributed to their deployment of gen AI. Still, these gen AI leaders are worth examining closely. These, after all, are the early movers, who already attribute more than 10 percent of their organizations’ EBIT to their use of gen AI. Forty-two percent of these high performers say more than 20 percent of their EBIT is attributable to their use of nongenerative, analytical AI, and they span industries and regions—though most are at organizations with less than $1 billion in annual revenue. The AI-related practices at these organizations can offer guidance to those looking to create value from gen AI adoption at their own organizations.

To start, gen AI high performers are using gen AI in more business functions—an average of three functions, while others average two. They, like other organizations, are most likely to use gen AI in marketing and sales and product or service development, but they’re much more likely than others to use gen AI solutions in risk, legal, and compliance; in strategy and corporate finance; and in supply chain and inventory management. They’re more than three times as likely as others to be using gen AI in activities ranging from processing of accounting documents and risk assessment to R&D testing and pricing and promotions. While, overall, about half of reported gen AI applications within business functions are utilizing publicly available models or tools, gen AI high performers are less likely to use those off-the-shelf options than to either implement significantly customized versions of those tools or to develop their own proprietary foundation models.

What else are these high performers doing differently? For one thing, they are paying more attention to gen-AI-related risks. Perhaps because they are further along on their journeys, they are more likely than others to say their organizations have experienced every negative consequence from gen AI we asked about, from cybersecurity and personal privacy to explainability and IP infringement. Given that, they are more likely than others to report that their organizations consider those risks, as well as regulatory compliance, environmental impacts, and political stability, to be relevant to their gen AI use, and they say they take steps to mitigate more risks than others do.

Gen AI high performers are also much more likely to say their organizations follow a set of risk-related best practices (Exhibit 11). For example, they are nearly twice as likely as others to involve the legal function and embed risk reviews early on in the development of gen AI solutions—that is, to “ shift left .” They’re also much more likely than others to employ a wide range of other best practices, from strategy-related practices to those related to scaling.

In addition to experiencing the risks of gen AI adoption, high performers have encountered other challenges that can serve as warnings to others (Exhibit 12). Seventy percent say they have experienced difficulties with data, including defining processes for data governance, developing the ability to quickly integrate data into AI models, and an insufficient amount of training data, highlighting the essential role that data play in capturing value. High performers are also more likely than others to report experiencing challenges with their operating models, such as implementing agile ways of working and effective sprint performance management.

About the research

The online survey was in the field from February 22 to March 5, 2024, and garnered responses from 1,363 participants representing the full range of regions, industries, company sizes, functional specialties, and tenures. Of those respondents, 981 said their organizations had adopted AI in at least one business function, and 878 said their organizations were regularly using gen AI in at least one function. To adjust for differences in response rates, the data are weighted by the contribution of each respondent’s nation to global GDP.

Alex Singla and Alexander Sukharevsky  are global coleaders of QuantumBlack, AI by McKinsey, and senior partners in McKinsey’s Chicago and London offices, respectively; Lareina Yee  is a senior partner in the Bay Area office, where Michael Chui , a McKinsey Global Institute partner, is a partner; and Bryce Hall  is an associate partner in the Washington, DC, office.

They wish to thank Kaitlin Noe, Larry Kanter, Mallika Jhamb, and Shinjini Srivastava for their contributions to this work.

This article was edited by Heather Hanselman, a senior editor in McKinsey’s Atlanta office.

Explore a career with us

Related articles.

One large blue ball in mid air above many smaller blue, green, purple and white balls

Moving past gen AI’s honeymoon phase: Seven hard truths for CIOs to get from pilot to scale

A thumb and an index finger form a circular void, resembling the shape of a light bulb but without the glass component. Inside this empty space, a bright filament and the gleaming metal base of the light bulb are visible.

A generative AI reset: Rewiring to turn potential into value in 2024

High-tech bees buzz with purpose, meticulously arranging digital hexagonal cylinders into a precisely stacked formation.

Implementing generative AI with speed and safety

research paper on statistics pdf

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 05 June 2024

The urgent need for designing greener drugs

  • Tomas Brodin   ORCID: orcid.org/0000-0003-1086-7567 1   na1 ,
  • Michael G. Bertram   ORCID: orcid.org/0000-0001-5320-8444 1 , 2 , 3   na1 ,
  • Kathryn E. Arnold   ORCID: orcid.org/0000-0002-6485-6065 4 ,
  • Alistair B. A. Boxall 4 ,
  • Bryan W. Brooks   ORCID: orcid.org/0000-0002-6277-9852 5 ,
  • Daniel Cerveny   ORCID: orcid.org/0000-0003-1491-309X 1 , 6 ,
  • Manuela Jörg   ORCID: orcid.org/0000-0002-3116-373X 7 , 8 ,
  • Karen A. Kidd   ORCID: orcid.org/0000-0002-5619-1358 9 ,
  • Unax Lertxundi   ORCID: orcid.org/0000-0002-9575-1602 10 ,
  • Jake M. Martin 1 , 2 ,
  • Lauren T. May   ORCID: orcid.org/0000-0002-4412-1707 11 ,
  • Erin S. McCallum 1 ,
  • Marcus Michelangeli   ORCID: orcid.org/0000-0002-0053-6759 1 , 3 , 12 ,
  • Charles R. Tyler 13 ,
  • Bob B. M. Wong   ORCID: orcid.org/0000-0001-9352-6500 3 ,
  • Klaus Kümmerer   ORCID: orcid.org/0000-0003-2027-6488 14 , 15   na2 &
  • Gorka Orive 16 , 17 , 18   na2  

Nature Sustainability ( 2024 ) Cite this article

1231 Accesses

1 Citations

383 Altmetric

Metrics details

  • Drug regulation
  • Environmental impact

The pervasive contamination of ecosystems with active pharmaceutical ingredients poses a serious threat to biodiversity, ecosystem services and public health. Urgent action is needed to design greener drugs that maintain efficacy but also minimize environmental impact.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Wilkinson, J. L. et al. Proc. Natl Acad. Sci. USA 119 , e2113947119 (2022).

Article   CAS   Google Scholar  

Bouzas-Monroy, A., Wilkinson, J. L., Melling, M. & Boxall, A. B. A. Environ. Toxicol. Chem. 41 , 2008–2020 (2022).

Proposal for a Directive of the European Parliament and of the Council amending Directive 2000/60/EC establishing a framework for Community action in the field of water policy, Directive 2006/118/EC on the protection of groundwater against pollution and deterioration and Directive 2008/105/EC on environmental quality standards in the field of water policy (European Commission, 2022).

Persson, L. et al. Environ. Sci. Technol. 56 , 1510–1521 (2022).

Kidd, K. A. et al. Proc. Natl Acad. Sci. USA 104 , 8897–8901 (2007).

Saaristo, M. et al. Proc. R. Soc. B 285 , 20181297 (2018).

Article   Google Scholar  

Mehdi, H. et al. Sci. Total Environ. 759 , 143430 (2021).

Communication from the Commission. European Union Strategic Approach to Pharmaceuticals in the Environment (European Commission, 2019).

Improving prescribing and medicines use: sustainable prescribing. Royal Pharmaceutical Society (2021); https://go.nature.com/3V4dk8j

Jones, E. R., Van Vliet, M. T. H., Qadir, M. & Bierkens, M. F. P. Earth Syst. Sci. Data 13 , 237–254 (2021).

Bourgin, M. et al. Water Res. 129 , 486–498 (2018).

Somasundar, Y. et al. ACS EST Water 1 , 2155–2163 (2021).

Kümmerer, K. Green Chem. 9 , 899–907 (2007).

Leder, C. et al. ACS Sustain. Chem. Eng. 9 , 9358–9368 (2021).

Puhlmann, N., Mols, R., Olsson, O., Slootweg, J. C. & Kümmerer, K. Green Chem. 23 , 5006–5023 (2021).

Download references

Acknowledgements

We acknowledge funding support from the Swedish Research Council Formas (2018-00828 to T.B., 2020-02293 to M.G.B., 2020-00981 to E.S.M., 2020-01052 to D.C., 2022-00503 to M.M. and 2022-02796/2023-01253 to J.M.M.), the Kempe Foundations (SMK-1954, SMK21-0069 and JCSMK23-0078 to M.G.B.), the Swedish Research Council VR (2022-03368 to E.S.M.), the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement (101061889 to M.M.), Research England (131911 to M.J.), the Spanish Ministry of Economy, Industry and Competitiveness (PID2022-139746OB-I00/AEI/10.13039/501100011033 to G.O.), the Australian Research Council (FT190100014 and DP220100245 to B.B.M.W.), the Jarislowsky Foundation (to K.A.K.), a Royal Society of New Zealand Catalyst Leaders Fellowship (ILF-CAW2201 to B.W.B.) and the National Institute of Environmental Health Sciences of the National Institutes of Health (1P01ES028942 to B.W.B.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

These authors contributed equally: Tomas Brodin, Michael G. Bertram.

These authors jointly supervised this work: Klaus Kümmerer, Gorka Orive.

Authors and Affiliations

Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, Sweden

Tomas Brodin, Michael G. Bertram, Daniel Cerveny, Jake M. Martin, Erin S. McCallum & Marcus Michelangeli

Department of Zoology, Stockholm University, Stockholm, Sweden

Michael G. Bertram & Jake M. Martin

School of Biological Sciences, Monash University, Melbourne, Victoria, Australia

Michael G. Bertram, Marcus Michelangeli & Bob B. M. Wong

Department of Environment and Geography, University of York, York, UK

Kathryn E. Arnold & Alistair B. A. Boxall

Department of Environmental Science, Baylor University, Waco, TX, USA

Bryan W. Brooks

Faculty of Fisheries and Protection of Waters, University of South Bohemia in České Budějovice, Vodňany, Czech Republic

Daniel Cerveny

Medicinal Chemistry Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia

Manuela Jörg

Centre for Cancer, Chemistry – School of Natural and Environmental Sciences, Newcastle University, Newcastle Upon Tyne, UK

Department of Biology, McMaster University, Hamilton, Ontario, Canada

Karen A. Kidd

Bioaraba Health Research Institute, Osakidetza Basque Health Service, Araba Mental Health Network, Araba Psychiatric Hospital, Pharmacy Service, Vitoria-Gasteiz, Spain

Unax Lertxundi

Drug Discovery Biology, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia

Lauren T. May

School of Environment and Science, Griffith University, Nathan, Queensland, Australia

Marcus Michelangeli

Biosciences, Faculty of Health and Life Sciences, University of Exeter, Exeter, UK

Charles R. Tyler

Institute of Sustainable Chemistry, Leuphana University Lüneburg, Lüneburg, Germany

Klaus Kümmerer

International Sustainable Chemistry Collaborative Centre (ISC3), Bonn, Germany

Laboratory of Pharmaceutics, School of Pharmacy, University of the Basque Country, Vitoria-Gasteiz, Spain

Gorka Orive

Biomedical Research Networking Centre in Bioengineering, Biomaterials and Nanomedicine, Vitoria-Gasteiz, Spain

Bioaraba, NanoBioCel Research Group, Vitoria-Gasteiz, Spain

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Tomas Brodin , Michael G. Bertram or Gorka Orive .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Sustainability thanks Lydia Niemi, Terrence Collins and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Brodin, T., Bertram, M.G., Arnold, K.E. et al. The urgent need for designing greener drugs. Nat Sustain (2024). https://doi.org/10.1038/s41893-024-01374-y

Download citation

Published : 05 June 2024

DOI : https://doi.org/10.1038/s41893-024-01374-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research paper on statistics pdf

COMMENTS

  1. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  2. (PDF) Use of Statistics in Research

    The function of statistics in research is to purpose as a tool in conniving research, analyzing its data and portrayal of conclusions. there from. Most research studies result in a extensive ...

  3. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  4. (PDF) An Overview of Statistical Data Analysis

    [email protected]. August 21, 2019. Abstract. The use of statistical software in academia and enterprises has been evolving over the last. years. More often than not, students, professors ...

  5. (PDF) Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods. to find structure in and to give deeper insight into data, and ...

  6. Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data ...

  7. PDF Introduction to Statistics

    Statistics is a branch of mathematics used to summarize, analyze, and interpret a group of numbers or observations. We begin by introducing two general types of statistics: •• Descriptive statistics: statistics that summarize observations. •• Inferential statistics: statistics used to interpret the meaning of descriptive statistics.

  8. An Introduction to Probability and Statistics

    This book on probability theory and mathematical statistics is designed for a course meeting 4 hours per week or a two-semester course meeting 3 hours per designed primarily for advanced seniors and beginning graduate students in but it can also be used by students in physics and engineering with strong backgrounds.

  9. F DESCRIPTIVE AND INFERENTIAL STATISTICS

    Statistical methods of data analysis form the cornerstone of quantitative-empirical research in theSocialSciences, Humanities,andEconomics. Historically,thebulkofknowledgeavailablein Statistics emerged in the context of the analysis of (nowadays large) data sets from observational and experimental measurements in the Natural Sciences.

  10. PDF A Review of Basic Statistical Concepts

    important calculations that lie at the very heart of statistics. The hands-on approach of this book emphasizes logic over rote calculation, capital-izes on your knowledge of everyday events, and attempts to pique your innate curiosity with realistic research problems that can best be solved by understanding statistics.

  11. PDF The What and the Why of Statistics

    There are two major reasons why learning statistics may be of value to you. First, you are constantly exposed to statistics every day of your life. Marketing surveys, voting polls, and the findings of social research appear daily in newspapers and popular magazines. By learning statistics, you will become a sharper consumer of statistical material.

  12. PDF Learning to Use Statistics in Research: a Case Study of Learning in A

    The purpose of this paper is to document learning opportunities for consultants and clients during statistical consulting sessions as. a means to assess the role of a statistical consulting centre in the research and teaching functions of a university. 1.1. STATISTICS EDUCATION AND STATISTICAL CONSULTING.

  13. PDF Statistics Education Research Journal

    are found. The paper discusses implications for the specification of the skills needed for accessing, filtering, comprehending, and critically evaluating information in these products. Directions for future research and educational practice are outlined. Keywords: Statistics education research; Statistical literacy; Official statistics;

  14. PDF The Significance of Statistics in Mind-Matter Research

    2. Statistics and the Scientific Process Throughout this paper the terms "statistics" and "statistical methods" are used in the broad context of an academic subject area including the design, Journal of Scientific Exploration, Vol. 13, No. 4, pp.615-638, 1999 0892-3310/99 ©1999 Society for Scientific Exploration 615

  15. Statistics for Research Students

    I. Chapter One - Exploring Your Data. II. Chapter Two - Test Statistics, p Values, Confidence Intervals and Effect Sizes. III. Chapter Three- Comparing Two Group Means. IV. Chapter Four - Comparing Associations Between Two Variables. V. Chapter Five- Comparing Associations Between Multiple Variables. VI.

  16. Statistical Research Papers by Topic

    The Statistical Research Report Series (RR) covers research in statistical methodology and estimation. Facebook. X (Twitter) Page Last Revised - October 8, 2021. View Statistical Research reports by their topics.

  17. PDF Introduction to Statistics

    Descriptive statistics are typically presented graphically, in tabular form (in tables), or as summary statistics (single values). Data, or numeric measurements, are the values summarized using descriptive statistics. Presenting data in summary can clarify research findings for small and large data sets. Inferential statistics:

  18. PDF Understanding Research Results: Statistical Inference

    UNDERSTANDING RESEARCH RESULTS: STATISTICAL INFERENCE. A FEW TERMS. A FEW TERMS. SAMPLES AND POPULATIONS 9Inferential statistics are necessary because 9The results of a given study are based on data obtained from a single single sample of researcher participants and 9Data are not based on an entire population of scores

  19. (PDF) Introduction to Descriptive statistics

    Similarly, De scriptive statistics are used to summarize and analyze data in. a variety of academic areas, including psychology, sociology, economics, education, and epidemiology [3 ]. Descriptive ...

  20. PDF Anatomy of a Statistics Paper (with examples)

    important writing you will do for the paper. IMHO your reader will either be interested and continuing on with your paper, or... A scholarly introduction is respectful of the literature. In my experience, the introduction is part of a paper that I will outline relatively early in the process, but will nish and repeatedly edit at the end of the ...

  21. Research Papers / Publications

    Research Papers / Publications. Search. Publication Type. Publication Year. Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Seyed Hamed Hassani, Dongsheng Ding, One-Shot Safety Alignment for Large Language Models via Optimal Dualization. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Seyed Hamed Hassani, Insup Lee, Osbert Bastani ...

  22. Test-Negative Study Designs for Evaluating Vaccine Effectiveness

    This JAMA Guide to Statistics and Methods article explains the test-negative study design, an observational study design routinely used to estimate vaccine effectiveness, and examines its use in a study that estimated the performance of messenger RNA boosters against the Omicron variant.

  23. Epidemic outcomes following government responses to COVID-19 ...

    COVID-19 was—and to a large extent remains—the most meaningful health event in recent global history ().Unlike the 2003 Severe Acute Respiratory Syndrome (SARS) epidemic, it spread globally; unlike Zika, everyone is at risk of infection with COVID-19; and unlike recent swine flu pandemics, the disease severity and mortality from COVID-19 were so high it led to life expectancy reversals in ...

  24. Welcome to the Purdue Online Writing Lab

    The Online Writing Lab (the Purdue OWL) at Purdue University houses writing resources and instructional material, and we provide these as a free service at Purdue.

  25. 1. Why so many non-econ papers by economists? 2. What's on the math GRE

    Andrew, could you elaborate on your answer to #2, specifically: "…and what does this have to do with stat Ph.D. programs?" In your opinion, is "strongly recommending" the Math subject test a good idea, i.e., is it likely to be a reliable and valid measure of aptitude for stats grad school, conditional on other qualifications?

  26. The state of AI in early 2024: Gen AI adoption spikes and starts to

    About the research. The online survey was in the field from February 22 to March 5, 2024, and garnered responses from 1,363 participants representing the full range of regions, industries, company sizes, functional specialties, and tenures. Of those respondents, 981 said their organizations had adopted AI in at least one business function, and ...

  27. The Space Omics and Medical Atlas (SOMA) and international ...

    Here, we present the Space Omics and Medical Atlas (SOMA), an integrated data and sample repository for clinical, cellular, and multi-omic research profiles from a diverse range of missions ...

  28. (PDF) The most-cited statistical papers

    Only a few of the most influential papers on the field of statistics are included on our list. through papers in statistics'. Four of our most cited papers, Duncan (1955), Kramer. (1956), and ...

  29. The urgent need for designing greener drugs

    The pervasive contamination of ecosystems with active pharmaceutical ingredients poses a serious threat to biodiversity, ecosystem services and public health. Urgent action is needed to design ...

  30. Reversible Nucleic Acid Storage in Deconstructable Glassy Polymer

    The rapid decline in DNA sequencing costs has fueled the demand for nucleic acid collection to unravel genomic information, develop treatments for genetic diseases, and track emerging biological threats. Current approaches to maintaining these nucleic acid collections hinge on continuous electricity for maintaining low-temperature and intricate cold-chain logistics. Inspired by the millennia ...