Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality

  • Published: 10 February 2023
  • Volume 15 , pages 1159–1178, ( 2024 )

Cite this article

literature review quality of data

  • Jingran Wang 2 ,
  • Peigong Li 1 ,
  • Zhenxing Lin 1 ,
  • Stavros Sindakis   ORCID: orcid.org/0000-0002-3542-364X 4 &
  • Sakshi Aggarwal 5  

13k Accesses

13 Citations

Explore all metrics

Competition in the business world is fierce, and poor decisions can bring disaster to firms, especially in the big data era. Decision quality is determined by data quality, which refers to the degree of data usability. Data is the most valuable resource in the twenty-first century. The open data (OD) movement offers publicly accessible data for the growth of a knowledge-based society. As a result, the idea of OD is a valuable information technology (IT) instrument for promoting personal, societal, and economic growth. Users must control the level of OD in their practices in order to advance these processes globally. Without considering data conformity with norms, standards, and other criteria, what use is it to use data in science or practice only for the sake of using it? This article provides an overview of the dimensions, subdimensions, and metrics utilized in research publications on OD evaluation. To better understand data quality, we review the literature on data quality studies in information systems. We identify the data quality dimensions, antecedents, and their impacts. In this study, the notion of “Data Analytics Competency” is developed and validated as a five-dimensional formative measure (i.e., data quality, the bigness of data, analytical skills, domain knowledge, and tool sophistication) and its effect on corporate decision-making performance is experimentally examined (i.e., decision quality and decision efficiency). By doing so, we provide several research suggestions, which information system (IS) researchers can leverage when investigating future research in data quality.

Similar content being viewed by others

literature review quality of data

Cost and Value Management for Data Quality

Prologue: research and practice in data quality management.

literature review quality of data

Frontiers of business intelligence and analytics 3.0: a taxonomy-based literature review and research agenda

Avoid common mistakes on your manuscript.

Introduction

Competition in the business world is fierce, and poor decisions can bring disaster to firms. For example, Nokia’s leadership decline in the telecommunications industry resulted from its overestimated brand strength and continued instances that its superior hardware design would win over users long after the iPhone’s release (Surowiecki, 2013 ). Making the right decisions leads to better performance (Goll & Rasheed, 2005 ; Zouari & Abdelhedi, 2021 ).

Knowledge is a foundational value in our society. Data must be free and open since they are a fundamental requirement for knowledge discovery. In terms of science and application, the open data idea is still in its infancy. The development of open government lies at the heart of this political and economic movement. The President’s Memorandum on Transparency and Open Government, which launched the US open data project in 2009, was followed by the UK government’s open data program, which was established in 2011. While public sectors host the bulk of open data activities, open data extends beyond “open government” to include areas such as science, economics, and culture. Open data is also becoming more significant in research and has the ability to enhance public institutional governance (Kanwar & Sanjeeva, 2022 ; Khan et al., 2021 ; Šlibar et al., 2021 ). Thus, open data may be viewed from various angles, providing a range of direct and indirect advantages. For example, the economic perspective makes the case that open data-based innovation promotes economic expansion. The political and strategic viewpoints heavily emphasize political concerns like security and privacy. The social angle focuses on the advantages of data usage for society. The social perspective also looks at how all citizens might see the advantages of open data (Danish et al., 2019 ; Šlibar et al., 2021 ).

As was previously said, numerous research discovered that open data activities strive to promote societal ideals and advantages. The following highlights a few instances of social, political, and economic benefits. More openness, increased citizen engagement and empowerment, public trust in government, new government services for citizens, creative social services, improved policy-making procedures, and modeling knowledge advancements are all results of political and social gains. Additionally, there are a number of economic advantages, such as increased economic growth, increased competitiveness, increased innovation, development of new goods and services, and the emergence of new industries that add to the economy (Cho et al., 2021 ; Ebabu Engidaw, 2021 ; Šlibar et al., 2021 ).

In the big data era, data-driven forecasting, real-time analytics, and performance management tools are aspects of next-generation decision support systems (Hosack et al., 2012 ). A high-quality decision based on data analytics can help companies gain a sustained competitive advantage (Davenport & Harris, 2007 ; Russo et al., 2015 ). Data-driven decision-making is a newer form of decision-making. Data-driven decision-making refers to the practice of basing decisions on the analysis of data rather than purely on intuition or experience (Abdouli & Omri, 2021 ; Provost & Fawcett, 2013 ). In data-driven decision-making, data is at the core of decision-making and influences the quality of the decision. The success of data-driven decision-making depends on data quality, which refers to the degree of usable data (Pipino et al., 2002 ; Price et al., 2013 ). This research seeks to investigate (1) what kinds of data can be viewed as high-quality, (2) what factors influence data quality, and (3) how data quality influences decision-making.

The scope of the paper revolves around three methodologies used to examine the dimensions of data quality and synthesize those data quality dimensions. The findings in the below section show that data quality has many characteristics, with accuracy, completeness, consistency, timeliness, and relevance considered the most significant ones. Additionally, two important aspects were discovered in the paper that affect data quality, i.e., time constraints and data user experience, which is frequently discussed in the literature review. By doing this, we were able to clearly illustrate the problems with data quality, point out the gaps in the literature, and suggest three key concerns about big data quality.

Moreover, the study’s main contributions are beneficial for upcoming academicians and researchers as the literature review emphasizes the benefits of utilizing data analytics tools on firm decision-making performance. There needs to be more research that quantitatively demonstrates the influence of the successful use of data analytics (data analytics competency) on firm decision-making which is fulfilled in our study. This area of research was essential as improving firms’ decision-making performance is the overarching goal of data analysis in the field of data analytics. Understanding the factors affecting it is a novel contribution to its field.

Remarkably, the literature review is built by reviewing articles, of which 29 articles related to data quality were considered. By examining the fundamental aspect of data quality and its impact on decision-making and end users, we begin to take the first step towards gaining a more in-depth understanding of the factors that influence data quality. Previous researchers should have focused more on the above areas, and we aim to highlight and enhance the same.

In addition to this, a thorough review of previous works in the same field has helped us track the research gap, and this organized review of previous works was divided into several steps: identifying keywords, analysis of citations, the calibration of the search strings, and the classification of articles by abstracts. Based on all the database searches, we found that these 29 articles best discuss data quality, its dimensions, constructs, and its impact on decision-making and are much more relevant than other articles included in our study. These articles determine the factors that influence data quality, and a framework provided helps illustrate a complete description of the factors affecting data quality.

The paper is organized as follows. The literature review is divided into three sections. In the first section, we review the literature and briefly identify the dimensions of data quality. In the second section, we summarize the antecedents of data quality. In the third section, we summarize the impacts of data quality. We then discuss future opportunities for dimensions of big data quality that have been neglected in the data quality literature. Finally, we propose several research directions for future studies in data quality.

Literature Review

Data is so essential to modern life that some have referred to it as the “new oil.” A current illustration of the significance of the data is the management of the COVID-19 epidemic. The early detection of the virus, the prediction of its spread, and the evaluation of the effects of lockdowns were all made possible by data gathered from location-based social network posts and mobility records from telecommunications networks, which allowed for data-driven public health decision-making (Dakkak et al., 2021 ; Shumetie & Watabaji, 2019 ).

As a result, words like “data-driven organization” or “data-driven services” are beginning to appear with the prefix “data-driven” more frequently. Additionally, according to Google Books Ngram, the word “data-driven” has become increasingly popular during the previous 10 years. Data-driven creation, which has been defined as the organization’s capacity to acquire, analyze, and use data to produce new goods and services, generate efficiency, and achieve a competitive advantage, is a trend that also applies to the development of software (Dakkak et al., 2021 ; Maradana et al., 2017 ; Prifti & Alimehmeti, 2017 ). More and more software organizations are implementing data-driven development techniques to take advantage of the potential that data offers. Facebook, Google, and Microsoft, among other cloud and web-based software companies, have been tracking user behavior, identifying their preferences, and running experiments utilizing data from the production environment. The adoption of data-driven techniques is happening more slowly in software-intensive embedded systems, where businesses are still modernizing to include capabilities for continuous data collection from in-service systems. The majority of embedded systems organizations use ad hoc techniques to meet the demands of the individual, the team, or the customer rather than having a systematic and ongoing method for collecting data from in-service products (Carayannis et al., 2012 ; Cho et al., 2021 ; Dakkak et al., 2021 ; Šlibar et al., 2021 ). Therefore, Dakkak et al. ( 2021 ) discussed the two areas we use to identify the data gathering challenges, which are:

Customer agreement : The consent between the consumer and the case study organization to obtain and share the information is one of the critical obstacles to data gathering. Since the data is generated by customer-owned products, these products are thought of as producing customer property. Except for basic product configuration data, some clients have tight data-sharing policies that forbid data collection and sharing. Data required during special initiatives like the launch of new goods and features is provided on request, just like data required for fault finding and troubleshooting.

Other clients permit automatic data collection but set restrictions on the types of data that may be gathered, when they can be gathered, how they will be used, and how they will be moved. This is frequently the case with clients who have contracts for services like customer support, optimization, or operations, where data could be used for these reasons and must only be available to those carrying out these tasks. The data itself is now used to evolve these services to become data-driven, even while consumers with service-specific data collection agreements prohibit the data from being used for continuous software enhancements (Cho et al., 2021 ; Dakkak et al., 2021 ).

Technical difficulties : We have observed some technical challenges related to continuous data collection, including:

Impact on the performance of the product : While some data can be gathered from in-service products without having any adverse effects on their operations, such as network performance evaluation, other data must be instrumented before collecting due to the adverse effects their collection creates on the product’s performance as they require internal resources during the generation and collection times, such as processor and memory power.

Data dependability : Given the variety of data kinds, it may be misleading to consider one data type in isolation from the quality standpoint. While a single piece of data can be evaluated based on certain quality indicators like integrity, developing a comprehensive picture of data quality necessitates a connection between several data sources (Dakkak et al., 2021 ; Khan et al., 2021 ; Šlibar et al., 2021 ).

Data Quality Dimensions

Data quality is the core of big data analytics-based decision support. It is not a unidimensional concept but a multidimensional concept (Ballou & Pazer, 1985 ; Pipino et al., 2002 ). The identified dimensions include accessibility, amount of data, believability, completeness, concise representation, consistent representation, ease of manipulation, free of error, interpretability, objectivity, relevancy, reputation, security, timeliness, understandability, and value-added (Abdouli & Omri, 2021 ; Pipino et al., 2002 ). Furthermore, Cho et al. ( 2021 ) highlighted that data quality dimensions are constructs used when evaluating data and are criteria or features of data quality that are thought to be crucial for a particular user’s task. For instance, completeness (e.g., are measured values present?), conformance (e.g., do data values comply with prescribed requirements and layouts?), and plausibility (e.g., are data values credible?) could all be used to evaluate the quality of data. Since data quality has multiple dimensions, how studies are conducted on the dimensions of data quality and which dimensions are the most popular are two questions we want to review in this section. In the data quality literature, three approaches are commonly used to study data quality dimensions (Wang & Strong, 1996 ).

The first approach is an intuitive approach based on the researchers’ past experience or intuitive understanding of what dimensions are essential (Wang & Strong, 1996 ). The intuitive approach has been used in early studies of data quality (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; DeLone & McLean, 1992 ; Ives et al., 1983 ; Laudon, 1986 ; Morey, 1982 ). For example, Bailey and Pearson ( 1983 ) viewed accuracy, timeliness, precision, reliability, currency, and completeness as important dimensions of the data quality of the output information. Ives et al. ( 1983 ) viewed relevancy, volume, accuracy, precision, currency, timeliness, and completeness as important dimensions of data quality for the output information. Ballou and Pazer ( 1985 ) also viewed accuracy, completeness, consistency, and timeliness as data quality dimensions. Laudon ( 1986 ) used accuracy, completeness, and unambiguousness as essential attributes of data quality included in the information. DeLone and McLean ( 1992 ) used accuracy, timeliness, consistency, completeness, relevance, and reliability as data quality dimensions. Studies also argue that inconsistency is important to data quality (Ballou & Tayi, 1999 ; Bouchoucha & Benammou, 2020 ). Many studies use an intuitive approach to define data quality dimensions because each study can choose the dimensions relevant to the specific purpose of the study. In other words, the intuitive approach allows scholars to choose specific dimensions based on their research context or purpose.

A second approach is a theoretical approach that studies data quality from the perspective of the data manufacturing process. Wang et al. ( 1995 ) viewed information systems as data manufacturing systems that work on raw material inputs to produce output material or tangible products. The same can be said of an information system, which acts on raw data input (such as a file, record, single number, report, or spreadsheet) to generate output data or data products (e.g., a corrected mailing list or a sorted file). In some other data manufacturing systems, this data result can be used as raw data. The phrase “data manufacturing” urges academics and industry professionals to look for extra-disciplinary comparisons that can help with knowledge transfer from the context of product assurance to the field of data quality. The phrase “data product” is used to underline that the data output has a value that is passed on to consumers, whether inside or outside the business (Feki & Mnif, 2016 ; Wang et al., 1995 ).

From the data manufacturing standpoint, the quality of data products is decided by consumers. In other words, the actual use of data determines the notion of data quality (Wand & Wang, 1996 ). Thus, Wand and Wang ( 1996 ) posit that the analysis of data quality dimensions should be based on four assumptions: (1) information systems can represent real-world systems; (2) information systems design is based on the interpretation of real-world systems; (3) users can infer a view of the real-world systems from the representation created by information systems; (4) only issues related to the internal view are part of the model (Wand & Wang, 1996 ). Based on the representation, interpretation, inference, and internal view assumptions, they proposed intrinsic data quality dimensions, including complete, unambiguous, meaningful, and correct data (Wand & Wang, 1996 ). The theoretical approach provided a more detailed and complete set of data quality dimensions, which are natural and inherent to the data product.

A third approach is empirical , which focuses on analyzing data quality from the user’s viewpoint. A tenant of the empirical approach is the belief that the quality of the data product is decided by its consumers (Wang & Strong, 1996 ). One of the representative studies was done by Wang and Strong ( 1996 ), who defined the dimensions and evaluation of data quality by collecting information from data consumers (Wang & Strong, 1996 ). Data has quality in and of itself, according to intrinsic DQ. One of the four dimensions that make up this category is accuracy. Contextual DQ draws attention to the necessity of considering data quality as a component of the job at hand; that is, data must be pertinent, timely, complete, and of an acceptable volume to provide value. The relevance of systems is highlighted by representational DQ and accessibility DQ. To be effective, a system must display data in a form that is comprehensible, easy to grasp, and consistently expressed (Ghasemaghaei et al., 2018 ; Ouechtati, 2022 ; Wang & Strong, 1996 ). This study argues that a preliminary conceptual framework for data quality should include four aspects: accessible, interpretable, relevant, and accurate. They further refined their model into four dimensions: (1) intrinsic data quality, which means that data should be not only accurate and objective but also believable and reputable, (2) contextual data quality, which means that data quality must be considered within the context of the task, (3) representational data quality, which means that data quality should include both format of data and meaning of data, and (4) accessible data quality, which is also a significant dimension of data quality from the consumer’s viewpoint.

Fisher and Kingma ( 2001 ) used these dimensions of data quality to analyze the reasons that caused two disasters in US history. In order to explain the role of data quality in the explosion of the Challenger spacecraft and the miss-fire caused by the USS Vincennes. Accuracy, timeliness, consistency, completeness, relevancy, and fitness for use were used as data quality dimensions (Barkhordari et al., 2019 ; Fisher & Kingma, 2001 ). In their study, accuracy means a lack of error between recorded and real-world values. Timeliness means the recorded value is up to date. Completeness is focused on whether all relevant data is recorded. Consistency means data values did not change in different records. Data relevance means data should relate to special issues, and fitness for use means data should serve the user’s purpose (Fisher & Kingma, 2001 ; Strong, 1997 ; Tayi & Ballou, 1998 ). Data quality should depend on purpose (Shankaranarayanan & Cai, 2006 ). This category of data quality is also used in credit risk management. Parssian et al. ( 2004 ) viewed information as a product and presented a method to assess data quality for information products. Researchers mainly focused on accuracy and completeness because they thought these two factors were the most important to decision-making (Parssian et al., 2004 ; Reforgiato Recupero et al., 2016 ). They viewed information as a product but still evaluated the factors from the user’s viewpoint. Studies also evaluated a model for cost-effective data quality management in customer relationship management (CRM) (Even et al., 2010 ). Moges et al. ( 2013 ) argued that completeness, interpretability, reputability, traceability, easily understandable, appropriate-amount, alignment, and concise representation are important dimensions of data quality in credit risk management (Danish et al., 2019 ; Moges et al., 2013 ). These studies view data quality dimensions as involving the voice of data consumers. Examining data quality dimensions from the user’s point of view is one of the most critical characteristics of empirical approaches (Even et al., 2010 ; Fisher & Kingma, 2001 ; Moges et al., 2013 ; Parssian et al., 2004 ; Shankaranarayanan & Cai, 2006 ; Strong, 1997 ; Tayi & Ballou, 1998 ).

Intuitive approaches are the easiest to examine data quality dimensions, and theoretical approaches are supported by theory. However, both approaches overlook the user, the most important judge of data quality. Data consumers are the judges that decide whether data is of high quality or poor quality. However, it is difficult to prove that the results gained from empirical approaches are complete and precise through fundamental principles (Wang et al . , 1995 ; Prifti & Alimehmeti, 2017 ). Based on the studies reviewed, we summarize data quality dimensions and comparative studies (Table 1 ). The results indicate that completeness, accuracy, timeliness, consistency, and relevance are the top six dimensions of data quality mentioned in studies.

Factors that Influence Data Quality (Antecedents)

Several studies try to determine the factors that influence data quality. Wang et al. ( 1995 ) proposed a framework that included seven elements that influenced data quality: management responsibility, operation and assurance costs, research and development, production, distribution, personal management, and legal (Wang et al., 1995 ) . This framework provides a complete description of the factors influencing data quality, but it is challenging to implement because of its complexity. Ballou and Pazer ( 1995 ) studied the accuracy-timeliness tradeoff and argued that accuracy improves with time and will increase data quality (Ballou & Pazer, 1995 ; Šlibar et al., 2021 ). Experience also influenced data quality by affecting the usage of incomplete data (Ahituv et al., 1998 ). If a decision-maker is familiar with the data, the decision-maker may be able to use intuition to compensate for problems (Chengalur-Smith et al., 1999 ). Also, studies indicated that information overload would affect data quality by reducing data accuracy (Berghel, 1997 ; Cho et al., 2021 ). Later, scholars pointed out that information overload, experience level, and time constraints impact data quality by influencing the way decision-makers use the information (e.g., Fisher & Kingma, 2001 ). The top ten antecedents of data quality have been identified through a literature review related to the antecedents of data quality. Table 2 presents the summary of antecedents of data quality.

Impact of Data Quality

The impact of data quality on decision-making and the impact of data quality on end users are two main themes. Studies of the impact of data quality on decision-making frequently use the definition of data quality information (DQI), which is a general evaluation of data quality and data sets (Chengalur-Smith et al., 1999 ; Fisher et al., 2003 ). After considering the decision environment, Chengalur-Smith et al. ( 1999 ) argued that DQI generates different influences on decision-making in different tasks, decision strategies, and the formation of the DQI context. Later, Fisher et al. ( 2003 ) presented the influence of experience and time on the use of DQI in the decision-making process. They developed a detailed model to explain the influence factors between DQI and the decision outcome. Through research, Fisher et al. ( 2003 ) argued that (1) experts use DQI more frequently than do novices; (2) managerial experience positively influences the usage of DQI, but domain experience did not have an influence on the usage of DQI; (3) DQI would be useful for managers with little domain-specific experience, and training in the use of DQI by experts would be worthwhile; (4) the availability of DQI will have more influence on decision-makers who feel time pressure than decision-makers who do not feel time pressure (Cho et al., 2021 ; Fisher et al., 2003 ). According to Price and Shanks ( 2011 ), metadata depicting data quality (DQ) can be viewed as DQ tags. They found that DQ tags can not only increase decision time but can also change decision choices. DQ tags are associated with increased cognitive processing in the early decision-making phases, which delays the generation of decision alternatives (Price & Shanks, 2011 ). Another study on the impact of data quality on decision-making focused on the implementation of data quality management to support decision-making. The data quality management framework was mainly built on the information product view (Ballou et al., 1998 ; Wang et al., 1998 ). Total data quality management (TDQM) and information product map (IPMAP) were developed based on the information product view. Studies of data quality management have focused more on context. For example, Shankaranarayanan and Cai ( 2006 ) constructed a data quality standard framework for B2B e-commerce. They proposed three-layer solutions based on IPMAP and IP View, including the DQ 9000 quality management standard, the standardized data quality specification metadata, and the third-party DQ certification issuers (Sarkheyli & Sarkheyli, 2019 ; Shankaranarayanan & Cai, 2006 ). The representative research on data quality impact on end-user was proposed by Foshay et al. ( 2007 ). They argued that end-user metadata impacts user attitudes toward data in databases, and end-user metadata elements strongly influence user attitudes toward data in the warehouse. They have a similar impact as the “Other factors”: data quality, business intelligence tool utility, and user views of training quality. Together with these other characteristics, metadata factors appear to have a considerable impact on attitudes. This discovery is incredibly important. It implies that metadata plays a significant role in determining whether a user will have a favorable opinion of a data warehouse (Dranev et al., 2020 ; Foshay et al., 2007 ) (Table 3 ).

The study’s “Other factors” do not seem to have much of a direct impact on the utilization of data warehouses. Other elements function similarly to the metadata factors, having an indirect impact on use. Perceived data quality, out of all the other criteria, had the most significant impact on users’ views toward data. Therefore, user perceptions about the data available from the warehouse have a significant impact on how valuable and simple the data warehouse is thought to be to use. Thus, the amount of use of the data warehouse is influenced by perceived usefulness and perceived ease of use to a moderately substantial extent. As a result, it would seem that variables other than perceived usefulness and usability have a role in deciding how widely data warehouses are used. Also, the degree to which end-user metadata quality and use influence user attitudes depend critically on the user experience accessing a data warehouse (Foshay et al., 2007 ; Zhuo et al., 2021 ).

Through the literature review of these 29 articles related to data quality in the IS field, we found that data quality has multiple dimensions, and that completeness (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Côrte-Real et al., 2020 ; DeLone & McLean, 1992 ; Even et al., 2010 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Laudon, 1986 ; Moges et al., 2013 ; Parssian et al., 2004 ; Shankaranarayanan & Cai, 2006 ; Šlibar et al., 2021 ; Wand & Wang, 1996 ; Wang & Strong, 1996 ; Zouari & Abdelhedi, 2021 ), accuracy (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ghasemaghaei et al., 2018 ; Ives et al., 1983 ; Juddoo & George, 2018 ; Laudon, 1986 ; Morey, 1982 ; Parssian et al., 2004 ; Safarov, 2019 ; Wand & Wang, 1996 ; Wang & Strong, 1996 ), timeliness (Bailey & Pearson, 1983 ; Ballou & Pazer, 1985 ; Cho et al., 2021 ; Côrte-Real et al., 2020 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ), consistency (Ballou & Pazer, 1985 ; Ballou & Tayi, 1999 ; Cho et al., 2021 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ghasemaghaei et al., 2018 ; Wang & Strong, 1996 ), and relevance (Bailey & Pearson, 1983 ; Côrte-Real et al., 2020 ; Dakkak et al., 2021 ; DeLone & McLean, 1992 ; Fisher & Kingma, 2001 ; Ives et al., 1983 ; Klein et al., 2018 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ) are the top five data quality dimensions mentioned in studies.

However, existing studies focused on multiple dimensions of traditional data quality and did not address the new dimensions of big data quality. In traditional data, timeliness is important; however, one new attribute of big data is its real-time delivery. So, we do not know whether timeliness will still play an important role in the dimensions of big data. Volume is also a new attribute of big data. Three papers address the volume of data (Ives et al., 1983 ; Moges et al., 2013 ; Šlibar et al., 2021 ; Wang & Strong, 1996 ). One reason that traditional data quality highlights the role of volume is that it is hard to get enough data. However, in the era of big data, there are enormous amounts of data, and volume is no longer a big issue. Therefore, we do not know whether volume will still be an important attribute of big data quality. Value-added is also one of the traditional data quality dimensions (Dakkak et al., 2021 ; Sarkheyli & Sarkheyli, 2019 ; Wang & Strong, 1996 ). Big data’s value is still, however, uncertain, whether value will be a new important attribute of big data quality needs to be studied further. Recent studies indicate that volume, variety, velocity, value, and veracity (5 V) are five common characteristics of big data (Cho et al., 2021 ; Firmani et al., 2019 ; Gordon, 2013 ; Hook et al., 2018 ). Nevertheless, few studies investigated the impacts of the 5 V dimensions on big data quality.

Also, existing studies identified several factors that influenced data quality, such as time pressure (Ballou & Pazer, 1995 ; Cho et al., 2021 ; Côrte-Real et al., 2020 ; Fisher & Kingma, 2001 ; Mock, 1971 ), data user experience (Ahituv et al., 1998 ; Chengalur-Smith et al., 1999 ; Cho et al., 2021 ; Fisher & Kingma, 2001 ), and information overload (Berghel, 1997 ; Fisher & Kingma, 2001 ; Hook et al., 2018 ). There are, however, few studies explaining what new factors influence big data quality. For example, existing studies talk about time pressure (Ballou & Pazer, 1995 ; Fisher & Kingma, 2001 ; Mock, 1971 ; Zhuo et al., 2021 ) more than information overload (Berghel, 1997 ; Fisher & Kingma, 2001 ; Klein et al., 2018 ; Sarkheyli & Sarkheyli, 2019 ). But big data volume is expansive, and data is obtained in real-time. This will cause information overload problems more serious than time pressure. Variety dimensions of big data mean that the structures of big data are various, which could cause problems when unstructured data is converted into structured data. Human error or system error may also constitute new factors influencing big data quality. Most research related to data quality considers how data quality impacts decision-making. No studies discussed the unknown impact of big data quality. Recent studies indicate that future decisions will be based on data analytics, and our world is data-driven (Davenport & Harris, 2007 ; Juddoo & George, 2018 ; Loveman, 2003 ).

Based on the literature review and the research gaps identified, we propose several future research directions related to data quality within the big data context. First, future studies on data quality dimensions should focus more on the 5 V dimensions of big data quality to identify new attributes of big data quality. Furthermore, future research should examine possible changes in the other quality dimensions, such as accuracy and timeliness. Secondly, future research should focus on identifying the new impacts of big data quality on decision-making by answering how big data quality influences decision-making and finding other issues related to big data quality (Davenport & Patil, 2012 ; Safarov, 2019 ). Third, future research should investigate various factors influencing big data quality. Finally, any future research should also actively investigate how to leverage a firm’s capabilities to improve big data quality.

There is some proof that adopting data analytics tools can help businesses become better at making decisions. Studies showed that many businesses that invested in data analytics were unable to fully utilize these capabilities. A study that quantitatively demonstrates the influence of successful use of data analytics (data analytics competency) on firm decision-making is lacking, despite the academic and practitioner literature emphasizing the benefit of employing data analytics tools on firm decision-making effectiveness. We, therefore, set out to investigate how this impact operates. Understanding the elements affecting it is a novel addition to the data analytics literature because increasing firms’ decision-making performance is the ultimate purpose of data analysis in the realm of data analytics. In this research, we filled this knowledge gap by using Huber’s ( 1990 ) theory of effects of advanced IT on decision-making and Bharadwaj’s ( 2000 ) framework of key IT-based resources to describe and justify data analytics competency for enterprises as a multidimensional formative index, as well as to create and validate a framework to predict the role of data analytics competency on firm decision-making performance (i.e., decision quality and efficiency). The two aforementioned initiatives represent fresh characteristics that have not yet been discussed in IS literature.

Furthermore, in this work, various techniques were used to identify the data quality aspects of user-generated wearable device data. Literature analysis and survey were done to comprehend the issues associated with data quality for investigators and their perspectives on data quality dimensions. Domain specialists chose the right dimensions based on this information (Cho et al., 2021 ; Ghasemaghaei et al., 2018 ).

Completeness

In this analysis, the contextual and data quality characteristics of breadth and density completeness were thought to be crucial for conducting research. It is critical to evaluate the breadth and completeness of data sets, especially those gathered in a bring-your-own-device research environment. The number of valid days required within a specific data collection period or the frequency with which the data must be present for the individual data to be included in the analysis is another way that researchers can define completeness. Further research is required to establish how completeness is defined in research studies because recently launched gadgets have the capacity to evaluate numerous data types and gather data continuously for years (Côrte-Real et al., 2020 ).

Conformance

While value, relational, and computational conformity are all seen as crucial aspects of wearable device data, data administration and quality evaluation present difficulties. Only the data dictionary and relational model particular to the model, brand, and version of the device can be used to evaluate the value and relational conformity, and only in cases where this data is publicly available.

Plausibility

Plausibility fits with researchers’ demands for precise data values. For example, the data might be judged implausible when step counts are higher than expected but associated heart rate values are lower than usual. Before beginning an investigation, researchers frequently and arbitrarily create their own standards to determine the facts’ plausibility. However, creating a collection of prospective data quality criteria requires extensive topic expertise and expert time. Therefore, developing a knowledge base of data quality guidelines for user-generated wearable device data would not only help future researchers save time but also eliminate the need for ad hoc data quality guidelines (Cho et al., 2021 ; Dakkak et al., 2021 ).

Theoretical Implications

In this paper, we reviewed literature related to data quality. Based on the literature review, we presented previous studies on data quality. We identified the three approaches used to study the dimensions of data quality and summarized those data quality dimensions by referencing to work of several works of research in this area. The results indicated that data quality has multiple dimensions, with accuracy, completeness, consistency, timeliness, and relevance viewed as the important dimensions of data quality. Also, through the literature review, we identified two important factors that influence data quality: time pressure and data user experience, which are frequently mentioned throughout the literature. We also identified the impact of data quality through present studies related to data quality impacts. We found that many studies examined the impact of data quality on decision-making. By doing so, we depicted a clear picture of issues related to data quality, identified the gaps in existing research, and proposed three important questions related to big data quality.

A study that quantitatively demonstrates the influence of the successful use of data analytics (data analytics competency) on firm decision-making is lacking, despite the academic and practitioner literature emphasizing the benefit of employing data analytics tools on firm decision-making performance (Bridge, 2022 ). We, therefore, set out to investigate how this impact operates. Understanding the elements affecting it is a novel addition to the data analytics literature because increasing firms’ decision-making performance is the overarching goal of data analysis in the realm of data analytics.

Surprisingly, although the size of the data dramatically improves the quality of firm decision-making, it has no discernible effect on firm decision efficiency, according to later investigations. This indicates that while having large amounts of data is a great resource for businesses to use to increase the quality of company decisions, it does not increase the speed at which they can make decisions. The difficulties in gathering, managing, and evaluating massive amounts of data may be to blame. Decision quality and decision efficiency were highly impacted by all other first-order constructs, including data quality, analytical ability, subject expertise, and tool sophistication.

Limitations in our review of literature do exist. Although we summarized the data quality dimensions, antecedents, and impacts through our literature review, we may have overlooked other data quality dimensions, antecedents, and impacts due to the limited number of papers we reviewed. In order to have a comprehensive understanding of data quality, we also suggest that further research needs to be conducted through the review of more papers related to data quality to reveal more dimensions, antecedents, and impacts.

Managerial Implications

The findings of this study have significant ramifications for managers who use data analytics to their benefit. Organizations that make sizable investments in these technologies do so primarily to enhance decision-making performance through the use of data analytics. As an outcome, managers must pay close attention to strengthening data analytics competency dimensions to improve firm decision-making performance as a result of using these tools. This is because they have now adequately explained a large portion of the variance in decision-making performance (Ghasemaghaei et al., 2018 ). The use of analytical tools may fail to enhance organizational decision-making performance without such competency.

Companies could, for instance, invest in training to enhance employees’ analytical skills in order to improve firm decision-making. When employees are equipped with the skills needed to carry out the demands of their jobs, the quality of their work is increased. Furthermore, managers must ensure that staff members who utilize data analytics to make important choices have the necessary domain expertise to correctly use the tools and interpret the findings. When purchasing data analytics tools, managers can use careful selection procedures to ensure the chosen means are powerful enough to support all the data required for carrying out current and upcoming analytical jobs. Therefore, managers must invest in data quality to speed up information processing and increase the efficiency of business decisions if they want to increase their data analytics proficiency.

Ideas for Future Research

It is essential to recognize the limits of this study, as with all studies. First, factors other than data analytics proficiency can have an impact on how well a company makes decisions. Future research is also necessary to better understand how other factors (such as organizational structure and business procedures) affect the effectiveness of company decision-making. Second, open data research is a new area of study, and the current assessment of open data within existing research has space for improvement, according to the initial literature review. This improvement can target a variety of open data paradigm elements, including commonly acknowledged dataset attributes, publishing specifications for open datasets, adherence to specific policies, necessary open data infrastructure functionalities, assessment processes of datasets, openness, accountability, involvement or collaboration, and evaluation of economic, social, political, and human value in open data initiatives. Because open data is, by definition, free, accessible to the general public, nonexclusive (unrestricted by copyrights, patents, etc.), open-licensed, usability-structured, and so forth, its use may be advantageous to a variety of stakeholders. These advantages can include the creation of new jobs, economic expansion, the introduction of new goods and services, the improvement of already existing ones, a rise in citizen participation, and assistance in decision-making. Consequently, the open data paradigm illustrates how IT may support social, economic, and personal growth.

Finally, the decisions that were made were not explicitly covered by this study. Future study is necessary to examine the impact of data analytics competency and each of its dimensions on decision-making consequences in particular contexts, as the relative importance of data analytics competency and its many dimensions may change depending on the type of decision being made (e.g., recruitment processes, marketing promotions).

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Abdouli, M., & Omri, A. (2021). Exploring the nexus among FDI inflows, environmental quality, human capital, and economic growth in the Mediterranean region. Journal of the Knowledge Economy, 12 (2), 788–810.

Google Scholar  

Ahituv, N., Igbaria, M., & Sella, A. V. (1998). The effects of time pressure and completeness of information on decision making. Journal of Management Information Systems, 15 (2), 153–172.

Bailey, J. E., & Pearson, S. W. (1983). Development of a tool for measuring and analyzing computer user satisfaction. Management Science, 29 (5), 530–545.

Ballou, D. P., & Pazer, H. L. (1985). Modeling data and process quality in multi-input, multi-output information systems. Management Science, 31 (2), 150–162.

Ballou, D. P., & Pazer, H. L. (1995). Designing information systems to optimize the accuracy-timeliness tradeoff. Information Systems Research, 6 (1), 51–72.

Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42 (1), 73–78.

Ballou, D., Wang, R., Pazer, H., & Tayi, G. K. (1998). Modeling information manufacturing systems to determine information product quality. Management Science, 44 (4), 462–484.

Barkhordari, S., Fattahi, M., & Azimi, N. A. (2019). The impact of knowledge-based economy on growth performance: Evidence from MENA countries. Journal of the Knowledge Economy, 10 (3), 1168–1182.

Berghel, H. (1997). Cyberspace 2000: Dealing with information overload. Communications of the ACM, 40 (2), 19–24.

Bharadwaj, A. S. (2000). A resource-based perspective on information technology capability and firm performance: An empirical investigation. MIS Quarterly , 169–196.

Bouchoucha, N., & Benammou, S. (2020). Does institutional quality matter foreign direct investment? Evidence from African countries. Journal of the Knowledge Economy, 11 (1), 390–404.

Bridge, J. (2022). A quantitative study of the relationship of data quality dimensions and user satisfaction with cyber threat intelligence (Doctoral dissertation, Capella University).

Carayannis, E. G., Barth, T. D., & Campbell, D. F. (2012). The Quintuple Helix innovation model: Global warming as a challenge and driver for innovation. Journal of Innovation and Entrepreneurship, 1 (1), 1–12.

Chengalur-Smith, I. N., Ballou, D. P., & Pazer, H. L. (1999). The impact of data quality information on decision making: An exploratory analysis. IEEE Transactions on Knowledge and Data Engineering, 11 (6), 853–864.

Cho, S., Weng, C., Kahn, M. G., & Natarajan, K. (2021). Identifying data quality dimensions for person-generated wearable device data: Multi-method study. JMIR mHealth and uHealth, 9 (12), e31618.

Côrte-Real, N., Ruivo, P., & Oliveira, T. (2020). Leveraging internet of things and big data analytics initiatives in European and American firms: Is data quality a way to extract business value? Information & Management, 57 (1), 103141.

Dakkak, A., Zhang, H., Mattos, D. I., Bosch, J., & Olsson, H. H. (2021, December). Towards continuous data collection from in-service products: Exploring the relation between data dimensions and collection challenges. In  2021 28th Asia-Pacific Software Engineering Conference (APSEC)  (pp. 243–252). IEEE.

Danish, R. Q., Asghar, J., Ahmad, Z., & Ali, H. F. (2019). Factors affecting “entrepreneurial culture”: The mediating role of creativity. Journal of Innovation and Entrepreneurship, 8 (1), 1–12.

Davenport, T. H., & Harris, J. G. (2007). Competing on analytics: The new science of winning . Harvard Business Press.

Davenport, T. H., & Patil, D. J. (2012). Data scientist. Harvard Business Review, 90 (5), 70–76.

DeLone, W. H., & McLean, E. R. (1992). Information systems success: The quest for the dependent variable. Information Systems Research, 3 (1), 60–95.

Dranev, Y., Izosimova, A., & Meissner, D. (2020). Organizational ambidexterity and performance: Assessment approaches and empirical evidence. Journal of the Knowledge Economy, 11 (2), 676–691.

EbabuEngidaw, A. (2021). The effect of external factors on industry performance: The case of Lalibela City micro and small enterprises, Ethiopia. Journal of Innovation and Entrepreneurship, 10 (1), 1–14.

Even, A., Shankaranarayanan, G., & Berger, P. D. (2010). Evaluating a model for cost-effective data quality management in a real-world CRM setting. Decision Support Systems, 50 (1), 152–163.

Feki, C., & Mnif, S. (2016). Entrepreneurship, technological innovation, and economic growth: Empirical analysis of panel data. Journal of the Knowledge Economy, 7 (4), 984–999.

Firmani, D., Tanca, L., & Torlone, R. (2019). Ethical dimensions for data quality. Journal of Data and Information Quality (JDIQ), 12 (1), 1–5.

Fisher, C. W., & Kingma, B. R. (2001). Criticality of data quality as exemplified in two disasters. Information & Management, 39 (2), 109–116.

Fisher, C. W., Chengalur-Smith, I., & Ballou, D. P. (2003). The impact of experience and time on the use of data quality information in decision making. Information Systems Research, 14 (2), 170–188.

Foshay, N., Mukherjee, A., & Taylor, A. (2007). Does data warehouse end-user metadata add value? Communications of the ACM, 50 (11), 70–77.

Ghasemaghaei, M., Ebrahimi, S., & Hassanein, K. (2018). Data analytics competency for improving firm decision making performance. The Journal of Strategic Information Systems, 27 (1), 101–113.

Goll, I., & Rasheed, A. A. (2005). The relationships between top management demographic characteristics, rational decision making, environmental munificence, and firm performance. Organization Studies, 26 (7), 999–1023.

Gordon, K. (2013). What is big data? Itnow, 55 (3), 12–13.

Hook, D. W., Porter, S. J., & Herzog, C. (2018). Dimensions: Building context for search and evaluation. Frontiers in Research Metrics and Analytics, 3 , 23.

Hosack, B., Hall, D., Paradice, D., & Courtney, J. F. (2012). A look toward the future: Decision support systems research is alive and well. Journal of the Association for Information Systems, 13 (5), 3.

Huber, G. P. (1990). A theory of the effects of advanced information technologies on organizational design, intelligence, and decision making. Academy of Management Review, 15 (1), 47–71. https://www.jstor.org/stable/258105

Ives, B., Olson, M. H., & Baroudi, J. J. (1983). The measurement of user information satisfaction. Communications of the ACM, 26 (10), 785–793.

Juddoo, S., & George, C. (2018). Discovering most important data quality dimensions using latent semantic analysis. In  2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD)  (pp. 1–6). IEEE.

Kanwar, A., & Sanjeeva, M. (2022). Student satisfaction survey: A key for quality improvement in the higher education institution. Journal of Innovation and Entrepreneurship, 11 (1), 1–10.

Khan, R. U., Salamzadeh, Y., Shah, S. Z. A., & Hussain, M. (2021). Factors affecting women entrepreneurs’ success: A study of small-and medium-sized enterprises in emerging market of Pakistan. Journal of Innovation and Entrepreneurship, 10 (1), 1–21.

Klein, R. H., Klein, D. B., & Luciano, E. M. (2018). Open Government Data: Concepts, approaches and dimensions over time. Revista Economia & Gestão, 18 (49), 4–24.

Laudon, K. C. (1986). Data quality and due process in large interorganizational record systems. Communications of the ACM, 29 (1), 4–11.

Loveman, G. (2003). Diamonds in the data mine. Harvard Business Review, 81 (5), 109–113.

Maradana, R. P., Pradhan, R. P., Dash, S., Gaurav, K., Jayakumar, M., & Chatterjee, D. (2017). Does innovation promote economic growth? Evidence from European countries. Journal of Innovation and Entrepreneurship, 6 (1), 1–23.

Mock, T. J. (1971). Concepts of information value and accounting. The Accounting Review, 46 (4), 765–778.

Moges, H. T., Dejaeger, K., Lemahieu, W., & Baesens, B. (2013). A multidimensional analysis of data quality for credit risk management: New insights and challenges. Information & Management, 50 (1), 43–58.

Morey, R. C. (1982). Estimating and improving the quality of information in a Mis. Communications of the ACM, 25 (5), 337–342.

Ouechtati, I. (2022). Financial inclusion, institutional quality, and inequality: An empirical analysis. Journal of the Knowledge Economy , 1–25.  https://doi.org/10.1007/s13132-022-00909-y

Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: Impact of selection, projection, and Cartesian product. Management Science, 50 (7), 967–982.

Pipino, L. L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45 (4), 211–218.

Price, R., & Shanks, G. (2011). The impact of data quality tags on decision-making outcomes and process. Journal of the Association for Information Systems, 12 (4), 1.

Price, D. P., Stoica, M., & Boncella, R. J. (2013). The relationship between innovation, knowledge, and performance in family and non-family firms: An analysis of SMEs. Journal of Innovation and Entrepreneurship, 2 (1), 1–20.

Prifti, R., & Alimehmeti, G. (2017). Market orientation, innovation, and firm performance—An analysis of Albanian firms. Journal of Innovation and Entrepreneurship, 6 (1), 1–19.

Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data, 1 (1), 51–59.

Reforgiato Recupero, D., Castronovo, M., Consoli, S., Costanzo, T., Gangemi, A., Grasso, L., ... & Spampinato, E. (2016). An innovative, open, interoperable citizen engagement cloud platform for smart government and users’ interaction.  Journal of the Knowledge Economy ,  7 (2), 388-412.

Russo, G., Marsigalia, B., Evangelista, F., Palmaccio, M., & Maggioni, M. (2015). Exploring regulations and scope of the Internet of Things in contemporary companies: A first literature analysis. Journal of Innovation and Entrepreneurship, 4 (1), 1–13.

Safarov, I. (2019). Institutional dimensions of open government data implementation: Evidence from the Netherlands, Sweden, and the UK. Public Performance & Management Review, 42 (2), 305–328.

Sarkheyli, A., & Sarkheyli, E. (2019). Smart megaprojects in smart cities, dimensions, and challenges. In  Smart Cities Cybersecurity and Privacy  (pp. 269–277). Elsevier.

Shankaranarayanan, G., & Cai, Y. (2006). Supporting data quality management in decision-making. Decision Support Systems, 42 (1), 302–317.

Shumetie, A., & Watabaji, M. D. (2019). Effect of corruption and political instability on enterprises’ innovativeness in Ethiopia: Pooled data based. Journal of Innovation and Entrepreneurship, 8 (1), 1–19.

Šlibar, B., Oreški, D., & BegičevićReđep, N. (2021). Importance of the open data assessment: An insight into the (meta) data quality dimensions. SAGE Open, 11 (2), 21582440211023176.

Strong, D. M. (1997). IT process designs for improving information quality and reducing exception handling: A simulation experiment. Information & Management, 31 (5), 251–263.

Surowiecki, J. (2013). “Where Nokia Went Wrong,” Retrieved February 20, 2020, from  http://www.newyorker.com/business/currency/where-Nokia-went-wrong

Tayi, G. K., & Ballou, D. P. (1998). Examining data quality. Communications of the ACM, 41 (2), 54–57.

Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39 (11), 86–95.

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12 (4), 5–33.

Wang, R. Y., Lee, Y. W., Pipino, L. L., & Strong, D. M. (1998). Manage your information as a product. MIT Sloan Management Review, 39 (4), 95.

Wang, R. Y., Storey, V. C., & Firth, C. P. (1995). A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering, 7 (4), 623–640.

Zhuo, Z., Muhammad, B., & Khan, S. (2021). Underlying the relationship between governance and economic growth in developed countries. Journal of the Knowledge Economy, 12 (3), 1314–1330.

Zouari, G., & Abdelhedi, M. (2021). Customer satisfaction in the digital era: Evidence from Islamic banking. Journal of Innovation and Entrepreneurship, 10 (1), 1–18.

Download references

The paper was supported by the H-E-B School of Business & Administration, the University of the Incarnate Word, and the Social Science Foundation of the Ministry of Education of China (16YJA630025).

Author information

Authors and affiliations.

School of Accounting, Shanghai Lixin University of Accounting and Finance, 2800 Wenxiang Rd, Songjiang District, Shanghai, 201620, China

Peigong Li & Zhenxing Lin

Sogang Business School, Sogang University, 35 Baekbeom-Ro, Mapo-Gu, Seoul, South Korea

Jingran Wang

H-E-B School of Business & Administration, University of the Incarnate Word, 4301 Broadway, San Antonio, TX, 78209, USA

School of Social Sciences, Hellenic Open University, 18 Aristotelous Street, Patras, 26335, Greece

Stavros Sindakis

Institute of Strategy, Entrepreneurship and Education for Growth, Athens, Greece

Sakshi Aggarwal

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Zhenxing Lin .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Wang, J., Liu, Y., Li, P. et al. Overview of Data Quality: Examining the Dimensions, Antecedents, and Impacts of Data Quality. J Knowl Econ 15 , 1159–1178 (2024). https://doi.org/10.1007/s13132-022-01096-6

Download citation

Received : 08 August 2022

Accepted : 20 December 2022

Published : 10 February 2023

Issue Date : March 2024

DOI : https://doi.org/10.1007/s13132-022-01096-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data analytics
  • Data quality
  • Decision-making
  • Economic growth
  • IT-based resources
  • Find a journal
  • Publish with us
  • Track your research

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Assessing the practice of data quality evaluation in a national clinical data research network through a systematic scoping review in the era of real-world data

Tianchen lyu, alexander loiacono, tonatiuh mendoza viramontes, gloria lipori, mattia prosperi, thomas j george jr, christopher a harle, elizabeth a shenkman, william hogan.

  • Author information
  • Article notes
  • Copyright and License information

Corresponding Author: Jiang Bian, PhD, College of Medicine, University of Florida, 2197 Mowry Road Suite 122, PO Box 100177, Gainesville, FL 32610-0177, USA ( [email protected] )

Received 2020 Jul 28; Revised 2020 Sep 13; Accepted 2020 Sep 18; Collection date 2020 Dec.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

To synthesize data quality (DQ) dimensions and assessment methods of real-world data, especially electronic health records, through a systematic scoping review and to assess the practice of DQ assessment in the national Patient-centered Clinical Research Network (PCORnet).

Materials and Methods

We started with 3 widely cited DQ literature—2 reviews from Chan et al (2010) and Weiskopf et al (2013a) and 1 DQ framework from Kahn et al (2016)—and expanded our review systematically to cover relevant articles published up to February 2020. We extracted DQ dimensions and assessment methods from these studies, mapped their relationships, and organized a synthesized summarization of existing DQ dimensions and assessment methods. We reviewed the data checks employed by the PCORnet and mapped them to the synthesized DQ dimensions and methods.

We analyzed a total of 3 reviews, 20 DQ frameworks, and 226 DQ studies and extracted 14 DQ dimensions and 10 assessment methods. We found that completeness, concordance, and correctness/accuracy were commonly assessed. Element presence, validity check, and conformance were commonly used DQ assessment methods and were the main focuses of the PCORnet data checks.

Definitions of DQ dimensions and methods were not consistent in the literature, and the DQ assessment practice was not evenly distributed (eg, usability and ease-of-use were rarely discussed). Challenges in DQ assessments, given the complex and heterogeneous nature of real-world data, exist.

The practice of DQ assessment is still limited in scope. Future work is warranted to generate understandable, executable, and reusable DQ measures.

Keywords: data quality assessment, real-world data, clinical data research network, electronic health record, PCORnet

INTRODUCTION

There has been a surge of national and international clinical research networks (CRNs) curating immense collections of real-world data (RWD) from diverse sources of different data types such as electronic health records (EHRs) and administrative claims among many others. One prominent CRN example is the national Patient-Centered Clinical Research Network (PCORnet) 1 , 2 funded by the Patient-Centered Outcomes Research Institute (PCORI) that contains more than 66 million patient data across the United States (US). 3 The OneFlorida Clinical Research Consortium 4 first created in 2009 is 1 of the 9 CRNs contributing to the national PCORnet. The OneFlorida network currently includes 12 healthcare organizations that provide care for more than 60% of Floridians through 4100 physicians, 914 clinical practices, and 22 hospitals covering all 67 Florida counties. 5 The centerpiece of the OneFlorida network is its Data Trust, a centralized data repository that contains longitudinal and robust patient-level records of approximately15 million Floridians from various sources, including Medicaid and Medicare programs, cancer registries, vital statistics, and EHR systems from its clinical partners. Both the amount and types of data collected by OneFlorida is staggering.

Rising from the US Food and Drug Administration (FDA) Real-world Evidence (RWE) program, RWD such as those in the OneFlorida are increasingly important to support a wide range of healthcare and regulatory decisions. 6 , 7 RWD are playing an increasingly critical role in various other national initiatives, such as the learning health systems, 8 , 9 comparative effectiveness research, 10 and programmatic clinical trials. 11 Nevertheless, concerns over the quality of RWD, where data quality (DQ) issues, such as incompleteness, inconsistency, and accuracy, are widely reported and discussed. 12 , 13 To maximize the utility of RWD, data quality should be systematically assessed and understood.

The literature on DQ assessment is rich with a number of DQ frameworks developed over time. Wang et al (1996) 14 proposed a conceptual framework for assessing DQ aspects that are important to data consumers. McGilvray (2008) 15 described 10 steps to quality data, where DQ assessment is an important step. Chan et al (2010) 16 conducted a literature review on EHR DQ and summarized 3 DQ aspects: accuracy , completeness, and comparability . Nahm (2012) 17 defined 10 DQ dimensions (eg, accuracy , currency , completeness ) specific to clinical research with a framework for DQ practice. Kahn et al (2012) 18 proposed the “ fit-for-use by data consumers ” concept with a process model for multisite DQ assessment. Weiskopf et al (2013a) 19 provided an updated literature review on EHR DQ and identified 5 DQ dimensions: completeness , correctness , concordance , plausibility, and currency . They then focused on completeness in their follow up work (ie, Weiskopf et al [2013b] 20 ). Liaw et al (2013) 21 summarized the most reported dimensions in DQ assessment. Zozus et al (2014) 22 conducted a literature review to identify DQ dimensions that affect the capacity of data to support research conclusions the most. Johnson et al (2015) 23 developed an ontology to define DQ dimensions to enable automated computation of DQ measures. Garcí A-de-León-Chocano (2015) 24 described a DQ assessment framework and constructed a set of processes. Kahn et al (2016) 25 developed the “ harmonized data quality assessment terminology ” that organizes DQ assessment into 3 categories: conformance , completeness, and plausibility . Reimer et al (2016) 26 developed a framework based on the 5 DQ dimensions from Weiskopf et al (2013a), 19 with a focus on longitudinal data repositories. Khare et al (2017) 27 summarized DQ issues and mapped to the harmonized DQ terms. Smith et al (2017) 28 shared a framework for assessing the DQ of administrative data. Weiskopf et al (2017) 29 developed a 3x3 DQ assessment guideline, where they selected 3 core dimensions from the 5 dimensions they defined in Weiskopf et al (2013a) 19 and each dimension has 3 core DQ constructs. Lee et al (2018) 30 modified the dimensions defined in Kahn et al (2016) 25 to support specific research tasks. Feder (2018) 31 described common DQ domains and approaches. Terry et al (2019) 32 proposed a model for assessing EHR DQ, deriving from the 5 dimensions in Weiskopf et al (2013a). 19 Nordo et al (2019) 33 proposed outcome metrics in the use of EHR data, including measures related to DQ. Bloland et al (2019) 34 offered a framework that describes immunization data in terms of 3 key characteristics (ie, data quality, usability, and utilization). Henley-Smith et al (2019) 35 derived a 2-level DQ framework based on Kahn et al (2016). 25 Charnock et al (2019) 36 conducted a systematic review focusing on the importance of accuracy and completeness in secondary use of EHR data.

However, the literature on DQ assessment of EHR data is due for an update as the latest review article on this topic is from Weiskopf et al (2013a) 19 that covered the literature before 2012. Further, few studies have assessed the practice of DQ assessment in large clinical networks. Callahan et al (2017) 37 mapped the data checks in 6 clinical networks to their DQ assessment framework—the harmonized data quality assessment by Kahn et al (2016). 25 One of the networks Callahan et al (2017) 37 assessed is the Pediatric Learning Health System (PEDSnet), which also contributes to the national PCORNet like OneFlorida. Qualls et al (2018), 38 from the PCORnet data coordinating center, presented the existing PCORnet DQ framework (ie, called “data characterization”), where they focused on only 3 DQ dimensions: data model conformance, data plausibility, and data completeness, initially with 13 DQ checks. They reported that the data characterization process they put in place has led to improvements in foundational DQ (eg, elimination of conformance errors, decrease in outliers, and more complete data for key analytic variables). As our OneFlorida network contributes to the PCORnet, we participate in the data characterization process. The data characterization process in PCORnet has evolved significantly since Qualls et al (2018). 38 Thus, our study aims to identify gaps in the existing PCORnet data characterization process. To have a more complete picture of DQ dimensions and methods, we first conducted a systematic scoping review of existing DQ literature related to RWD. Through the scoping review, we organized the existing DQ dimensions as well as the methods used to assess these DQ dimensions. We then reviewed the DQ dimensions and corresponding DQ methods used in the PCORnet data characterization process (8 versions since 2016) to assess the DQ practice in PCORnet and how it has evolved.

MATERIALS AND METHODS

We followed the typical systemic review process to synthesize relevant literature to extract DQ dimensions and DQ methods, mapped their relationships, and mapped them to the PCORnet data checks. Throughout the process, 2 team members (TL and AL) independently carried out the review, extraction, and mapping processes in each step, and disagreements between the 2 reviewers were first resolved through discussion with a third team member (JB) first and then the entire study team if necessary. We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline and generated the PRIMSA flow diagram.

A systematic scoping review of data quality assessment literature

We started with 3 widely cited core references on EHR DQ assessment, including 2 review articles from Chan et al (2010) 16 and Weiskopf et al (2013a), 19 and 1 DQ framework from Kahn et al (2016). 25 First, we summarized and mapped the DQ dimensions in these 3 core references. We merged the dimensions that are similar in concept but named differently. For example, Chan et al (2010) 16 defined “ data accuracy ” as whether the data “ can accurately reflect an underlying state of interest ,” while Weiskopf et al (2013a) 19 defined it as “ data correctness ” (ie, “ whether the data is true ”). Then we synthesized the methods used to assess these DQ dimensions. Weiskopf et al (2013a) 19 summarized the DQ assessment methods, while Chan et al (2010) 16 and Kahn et al (2016) 25 only provided definitions and examples on how to measure the different DQ dimensions. Thus, we mapped these definitions and examples to the methods reported in Weiskopf et al (2013a) 19 according to their dimension definitions and measurement examples. For example, Chan et al (2010) 16 defined “ completeness ” as “ the level of missing data ” and discussed various studies that have shown the variation in the amount of missing data across different data areas (eg, problem lists and medication lists) and clinical settings, while Kahn et al (2016) 25 provided examples on how to measure “ completeness ” (eg, “ the encounter ID variable has missing values ”). Thus, we mapped “ completeness ” to the method of checking “ element presence ” (ie, “ whether or not desired data elements are present ”) defined in Weiskopf et al (2013a). 19 We created new categories if the measurement examples cannot be mapped to existing methods in Weiskopf et al (2013a). 19 For example, Kahn et al (2016) 25 defined a “ conformance ” dimension that cannot be mapped to any of the methods defined in Weiskopf et al (2013a). 19 Thus, we created a new method term “ conformance check ” to assess “ whether the values that are present meet syntactic or structural constraints. ” Kahn et al (2016) 25 gave examples of conformance check such as the variable sex shall only have values: “Male,” “Female,” or “Unknown.”

We then reviewed the literature cited in the 3 core references. Chan et al (2010) 16 and Weiskopf et al (2013a) 19 reviewed individual papers that conducted DQ assessment experiments, while the DQ framework from Kahn et al (2016) 25 is based on 9 other frameworks (however, full text of 1 framework is not available) and the literature review by Weiskopf et al (2013a). 19 For completeness, we extracted the extra dimensions that were mentioned in the 8 frameworks but not included in the framework from Kahn et al (2016). 25 We also summarized the methods for these additional dimensions according to the measurement examples given in the original frameworks.

We then reviewed the articles that were cited in the 2 core review papers: Chan et al (2010) 16 and Weiskopf et al (2013a). 19 We mapped the dimensions and methods mentioned in these articles to the ones we extracted from Kahn et al (2016). 25 During this process, we revised the definitions of the dimensions and methods to make them more inclusive of the different literature.

Weiskopf et al (2013a) 19 is the latest review article that covers DQ literature before January 2012. Thus, we conducted an additional review of DQ assessment literature published after 2012 to February 2020. We identified 2 group of search keywords (ie, DQ-related and EHR-related keywords) mainly from the 3 core references. The search strategy including the keywords is detailed in the Supplementary Appendix A. An article was included if it assessed the quality of data derived from EHR systems using clearly defined DQ measurements (even if the primary goal of the study was not to assess DQ).

We then extracted the DQ dimensions and methods from these new articles, merged the ones that are similar to the existing ones, and created new dimensions and methods if necessary. After this process, we created a comprehensive list of dimensions, their concise definitions, and the methods commonly used to assess these DQ dimensions.

Map the PCORnet data characterization checks to the data quality dimensions and methods

We reviewed the measurements in the PCORnet data checks (from version 1 published in 2016 to version 8 as of 2020) 38 , 39 and mapped them to the dimensions and methods we summarized above. Two reviewers (TL and AL) independently carried out the mapping tasks, and conflicts were resolved by a third reviewer (JB) through group discussions.

Data quality dimensions and assessment methods summarized from the 3 core references

Data quality dimensions.

Overall, we extracted 12 dimensions (ie, currency , correctness/accuracy , plausibility , completeness , concordance , comparability , conformance , flexibility , relevance , usability/ease-of-use , security , and information loss and degradation ) from the 3 core references and then mapped the relationships among them.

Chan et al (2010) 16 conducted a systematic review on EHR DQ literature from January 2004 to June 2009 focusing on how DQ affects quality of care measures. They extracted 3 DQ aspects: (1) accuracy , including data currency and granularity; (2) completeness ; and (3) comparability .

Weiskopf et al (2013a) 19 performed a literature review of EHR DQ assessment methodology, covering articles published before February 2012. They identified 27 unique DQ terms/dimensions. After merging DQ terms with similar definitions and excluding dimensions that have no measurement (ie, how the DQ dimension is measured), they retained 5 dimensions: (1) completeness , (2) correctness , (3) concordance , (4) plausibility , and (5) currency .

Kahn et al (2016) 25 proposed a DQ assessment framework for secondary use of EHR data, consisting of 3 DQ dimensions: (1) conformance with 3 subcategories: value conformance , relational conformance , and computational conformance ; (2) completeness ; and (3) plausibility with 3 subcategories: uniqueness plausibility , atemporal plausibility , and temporal plausibility . Each DQ dimension can be assessed in 2 different DQ assessment contexts: verification (ie, “ how data values match expectations with respect to metadata constraints, system assumptions, and local knowledge ”), and validation (ie, “ the alignment of data values with respect to relevant external benchmarks ”).

For comprehensiveness, we also reviewed the 8 DQ frameworks that were cited by Kahn et al (2016) 25 and included any DQ new dimension that has been reported in at least 2 of the 8 DQ frameworks. A total of 5 additional dimensions was identified: (1) flexibility from Wang et al (1996); 14 (2) relevance from Liaw et al (2013); 21 (3) usability/ease-of-use from McGilvray (2008); 15 (4) security from Liaw et al (2013); 21 and (5) information loss and degradation from Zozus et al (2014). 22

Data quality assessment methods

A total of 10 DQ assessment methods were identified: 7 from Weiskopf et al (2013a), 19 1 from Chan et al (2010) 16 and Kahn et al (2016), 25 and 2 from the 8 frameworks referred by Kahn et al (2016). 25

Out of the 3 core references, only Weiskopf et al (2013a) 19 explicitly summarized 7 DQ assessment methods, including (1) gold standard ; (2) data element agreement ; (3) element presence ; (4) data source agreement ; (5) distribution comparison ; (6) validity check ; and (7) log review .

From the other 2 core references, we summarized 3 new DQ assessment methods: (1) conformance check from both Chan et al (2010) 16 and Kahn et al (2016); 25 (2) qualitative assessment from Liaw et al (2013) 21 a DQ framework referenced in Kahn et al (2016); 25 and (3) security analysis from Liaw et al (2013). 21

Review of individual data quality assessment studies with updated literature search

We first reviewed 87 individual DQ assessment studies cited in the 2 systematic review articles: Chan et al (2010) 16 and Weiskopf et al (2013a), 19 extracted the DQ measurements used and mapped them to the 12 DQ dimensions and 10 DQ assessment methods. Through this process, we revised the definitions of the DQ dimensions and methods if necessary. Figure 1A shows our review process.

Figure 1.

The flow chart of the literature review process: (A) individual studies identified from Chan et al (2010) and Weiskopf et al (2013a), and (B) new data quality related articles (both individual studies and review/framework articles) published from 2012 to February 2020.

Further, since the review from Weiskopf et al (2013a) 19 only covered the literature before 2012, we conducted an additional review of the literature on EHR DQ assessment published from 2012 up until February 2020. Figure 1B illustrates our literature search process following the PRISMA flow diagram.

Through this process, we identified 1072 publications and then excluded 743 articles through title and abstract screening. During the full-text screening, 172 articles were excluded because either (1) the full text was not accessible (n = 19); (2) the paper was not relevant to DQ, or the paper lacks sufficient details on what methods were used to assess DQ (n = 147); or (3) the data of interest were not derived from clinical data systems (n = 6). At the end, 157 new articles were included, out of which 139 were individual studies and 16 were review articles or frameworks. Four of the 16 review/framework articles were already included the 3 core references, thus, effectively, we identified 12 new review or framework articles. We effectively reviewed 139 new individual DQ assessment studies published after 2012 until February 2020. The list of all reviewed articles is in Supplementary Appendix B.

Review of the newly identified DQ frameworks and review articles

From the 12 newly identified DQ frameworks or reviews, we extracted the DQ dimensions and assessment methods and mapped them to the existing 12 DQ dimensions and 10 methods we extracted from the 3 core references. We refined the original definitions if necessary. We did not identify any new DQ methods, but we identified 2 new DQ dimensions: (1) consistency (ie, “ pertains to the constancy of the data, at the desired degree of detail for the study purpose, within and across databases and data sets ” from Feder [2018] 31 ) and (2) understandability/interpretability (ie, “ the ease with which a user can understand the data ” from Smith et al [2017] 28 ) The concept of consistency from Feder (2018) 31 can be connected to concordance in Weiskopf et al (2013a) 19 and various other dimensions (eg, plausibility from Kahn et al [2016] 25 ) especially comparability from Chan et al (2010). 16 Nevertheless, consistency based on the definitions and examples from Feder (2018) 31 covers a broader and more abstract concept pertaining to the constancy (ie, “ the quality of being faithful and dependable ”) of the data.

Review of individual studies published after 2012

For the 139 individual studies, we extracted the type of the data (eg, EHR or claims), the DQ dimensions, and assessment methods including the specific DQ measurements if mentioned. Figure 2 shows the results. No new DQ dimension and assessment methods were identified from these studies.

Figure 2.

The numbers of studies by (A) data type, (B) DQ dimension, and (C) DQ assessment method.

A summary of DQ dimensions and assessment methods

We summarized the 14 DQ dimensions and 10 DQ assessment methods and mapped the relationships among them as shown in Figure 3 . Following Kahn et al (2016), 25 we categorized the DQ dimensions and methods into 2 contexts: verification (ie, can be assessed using the information within the dataset or using common knowledge) and validation (ie, can be assessed using external resources such as compared with external data sources and checked against data standards). However, 6 DQ dimensions (ie, flexibility, relevance, usability, security, information loss and degradation, and understandability/interpretability) and 2 DQ assessment methods (ie, qualitative assessment and security analyses) cannot be categorized into either context.

Figure 3.

A summarization of existing DQ dimensions and DQ assessment methods.

In the broader DQ literature, there is also the concept of intrinsic DQ versus extrinsic DQ. 14 , 40 The intrinsic DQ denotes that “ data have quality in its own right ” 14 and “ independent of the context in which data is produced and used ,” 40 while the extrinsic DQ, although not explicitly defined, are more sensitive to the external environments, considering the context of the task at hand (ie, contextual DQ 40 ) and the information systems that store and deliver the data (ie, accessibility DQ and representational DQ 40 ) In our context, D1—D7 are more related to intrinsic DQ; while D8—D14 may fall into the extrinsic DQ category. Note that there is also literature that defines intrinsic DQ versus extrinsic DQ in terms of how they can be assessed (ie, “ this measure is called intrinsic if it does not require any additional data besides the dataset, otherwise it is called extrinsic ” 41 ); however, such definitions may be incomplete and imprecise. For example, correctness/accuracy (D2) is part of the intrinsic DQ defined in Strong et al (1997) 40 but can be assessed with external datasets in the context of validation.

Tables 1 and 2 show the definitions and the reference frameworks or reviews from which we extracted the definitions for DQ dimensions and DQ methods, respectively.

Data quality dimensions summarized from existing DQ frameworks and reviews

D3-1, D3-2, and D3-3 are subcategories of D3; D7-1, D7-2, and D7-3 are subcategories of D7.

Data quality assessment methods summarized from existing DQ frameworks and reviews

Map the PCORnet data characterization checks to the synthesized DQ dimensions and methods

Table 3 shows the result of mapping existing PCORnet data characterization checks to the 14 DQ dimensions and 10 DQ assessment methods.

Mapping PCORnet data characterization checks to the 14 DQ dimensions and 10 DQ assessment methods

Data in the PCORnet follows the PCORnet common data model (CDM). Both the PCORnet CDM and the PCORnet data checks specifications are available at https://pcornet.org/data-driven-common-model/.

Evident from the large number of studies we identified—3 review articles, 20 DQ frameworks, and 226 DQ relevant studies—the literature on the quality of real-world clinical data, such as EHR and claims, for secondary research use is rich. Nevertheless, the definitions of and the relationships among the different DQ dimensions are not as clear as they could have been. For example, even though we merged accuracy with correctness into 1 DQ dimension as accuracy/correctness (D2), the original accuracy dimension (ie, “ the extent to which data accurately reflects an underlying state of interest includes timeliness and granularity ”) as defined by Chan et al (2010) 16 ) actually contains both correctness (ie, “ data were considered correct when the information they contained was true ”) and plausibility (ie, “ actual values as a representation of a real-world ”) defined by Weiskopf et al (2013a) 19 and Kahn et al (2016), 25 respectively. Further, some DQ dimensions are quite broad and have overlapping concepts with other dimensions. For example, comparability can be mapped to completeness , concordance , and consistency depending on the perspectives (eg, frequency or value of a data element).

In terms of DQ assessment methods, similar overlapping definitions exist. For example, the difference between the concept of distribution comparison (M7) and validity check (M4) is subtle, where the original definition of distribution comparison in Weiskopf et al (2013a) 19 refers to comparing a data element to an external authoritative resource (eg, comparing the prevalence of diabetes patients calculated from an EHR system to the general diabetes prevalence of that area), while validity check defined in Kahn et al (2016) 25 refers to whether the value of a data element is out of the normal range (ie, outliers).

The practice of DQ assessment is not evenly distributed. As shown in Figure 2 , most studies that mentioned DQ assessments focused on completeness (D4), concordance (D5), correctness/accuracy (D2), and plausibility (D3); while the element presence (M2), data source agreement (M6), validity check (M4), and data element agreement (M3) are the most used DQ methods, reflecting what aspects of DQ are important in real-world studies. We have similar observations examining the DQ assessment practice in the PCORnet. As shown in Tables 3 and 4 , out of all the data checks in the PCORnet data characterization process, the most used data checks are element presence (M2, 25 checks), validity check (M4, 11 checks), and conformance check (M5, 11 checks), and the most examined DQ dimensions are completeness (D4, 21 checks), conformance (D7, 16 checks), and plausibility (D3, 13 checks), which raises the question why other DQ dimensions and DQ methods are not widely used in practice, especially in a CRN environment.

The numbers of PCORnet data checks mapped to each DQ dimension and DQ assessment method

Abbreviations: DC, data check; DQ, data quality.

The reason maybe multifold. First, the data from different sites of a CRN are heterogeneous in syntax (eg, file formats), schema (eg, data models and structures), and even semantics (eg, meanings or interpretations of the variables). This is not only because of the difference between different EHR vendors (eg, Cerner vs Epic), but also the difference in the implementation of the same EHR vendor system. For example, Epic’s flexibility in being able to create arbitrary flow sheets to meet different use cases also created inconsistency in data capturing at the data sources. Common data models (CDMs) and common data elements are common approaches to address these inconsistencies through transforming the source data into an interoperability common data framework. However, it is worth noting that standardization and harmonization of heterogeneous data sources are always difficult after the fact, when the data have already been collected. For example, in the OneFlorida network, although partners are required to provide a data dictionary of their source data, the units of measures are often neglected by the partners, leading to situations such as the average heights of patients being vastly higher than conventional wisdom. Our investigation of this DQ issue revealed that certain partners used centimeters rather than inches (as dictated by the PCORnet CDM) as the unit of measure. These “human” errors are inevitable, where a rigorous DQ assessment process is critical to identify these issues. Second, even though DQ is widely recognized as an important aspect, it is difficult to have a comprehensive process to capture all DQ issues from the get-go. The approach that the PCORnet takes is to have different levels of DQ assessment processes, where the general data checks (as shown in Table 3 ) are used to capture common and easy-to-catch errors while a study-specific data characterization process is used to inform whether the data at hand can inform a study’s specific objectives. Third, some DQ dimensions and DQ methods, although easy to understand in concept, are difficult to put in place and execute in reality. For example, usability/ease-of-use (D10) and security (D11), although straightforward to understand, lack well-defined executable measures. However, these DQ dimensions are still important aspects of DQ, and more efforts on methods and tools to assess DQ dimensions, such as flexibility (D8), usability/ease-of-use (D10), security (D11), and understandability/interpretability (D14), are needed to fill these knowledge gaps.

There are also a few studies 21 , 23 that attempted to develop ontologies of DQ to “ enable automated computation of data quality measures ” and to “ make data validation more common and reproducible. ” However these efforts, although much needed, have not led to wide adoption. The “ harmonized data quality assessment terminology ” proposed by Kahn et al (2016), 25 although not comprehensive, covers common and important aspects that matter in DQ assessment practice. Further expansion is warranted. Another interesting observation is that out of the 226 DQ assessment studies, only 1 study 42 discussed the importance of reporting DQ assessment reports. It recommends, and we agree, that “ reporting on both general and analysis-specific data quality features ” are critical to ensure transparency and consistency in computing, reporting, and comparing DQ of different datasets. These aspects of DQ assessment also deserve further investigations.

LIMITATIONS

First, we only used PubMed to search for relevant articles, thus, we may have missed some potentially relevant studies indexed in other databases (eg, Web of Science). Second, our review focused on qualitatively synthesizing DQ dimensions and DQ assessment methods but did not go into the details about how these DQ dimensions and methods can be applied. Further comprehensive investigation on which DQ checks and measures are concrete and executable is also warranted.

CONCLUSIONS

Our review highlights the wide awareness and recognition of DQ issues in RWD, especially EHR data. Although the practice of DQ assessment in exists, it is still limited in scope. With the rapid adoption and increasing promotion of research using RWD, DQ issues will be increasingly important and call for attention from the research communities. However, different strategies of DQ may be needed given the complex and heterogeneous nature of RWD. DQ issues should not be treated alone but rather in full consideration with other data-related issues, such as selection bias among others. The addition of reporting DQ into the now widely recognized FAIR (ie, Findability, Accessibility, Interoperability, and Reuse) data principles may benefit the broader research community. Nevertheless, future work is warranted to generate understandable, executable, and reusable DQ measures and their associated assessments.

This work was mainly supported by the University of Florida’s Creating the Healthiest Generation—Moonshot initiative and also supported in part by National Institutes of Health (NIH) grants UL1TR001427, R01CA246418, R21AG068717, as well as Patient-Centered Outcomes Research Institute (PCORI) grants PCORI ME-2018C3-14754 and the OneFlorida Clinical Research Consortium (CDRN-1501-26692). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or PCORI.

AUTHOR CONTRIBUTIONS

JB, BH, and ES designed the initial concepts and framework for the proposed systematic scoping reviewing; TL, AL, and JB carried out the review and annotation process; TL, AL, and JB wrote the initial draft of the manuscript. TM, GL, YG, MP, YW, CH, TG, ES, and BH provided critical feedback and edited the manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

None declared.

Supplementary Material

  • 1. Collins FS, Hudson KL, Briggs JP, Lauer MS.. PCORnet: turning a dream into reality. J Am Med Inform Assoc 2014; 21 (4): 576–7. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 2. Corley DA, Feigelson HS, Lieu TA, McGlynn EA.. Building data infrastructure to evaluate and improve quality: PCORnet. J Oncol Pract 2015; 11 (3): 204–6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 3. PCORnet. Data-Driven | The National Patient-Centered Clinical Research Network. 2019. https://pcornet.org/data-driven-common-model/ Accessed July 21, 2020.
  • 4. Shenkman E, Hurt M, Hogan W, et al. OneFlorida Clinical Research Consortium: linking a clinical and translational science institute with a community-based distributive medical education model. Acad Med 2018; 93 (3): 451–5. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 5. OneFlorida. OneFlorida Clinical Research Consortium. 2020. https://onefloridaconsortium.org/ Accessed July 21, 2020
  • 6. US FDA. Real-World Evidence. 2020. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence Accessed July 21, 2020
  • 7. Sherman RE, Anderson SA, Dal Pan GJ, et al. Real-world evidence—what is it and what can it tell us? N Engl J Med 2016; 375 (23): 2293–7. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 8. Olsen L, Aisner D, McGinnis JM, Institute of Medicine (U.S.), eds. The Learning Healthcare System: Workshop Summary. Washington, DC: The National Academies Press; 2007. [ PubMed ] [ Google Scholar ]
  • 9. Budrionis A, Bellika JG.. The learning healthcare system: where are we now? A systematic review. J Biomed Inform 2016; 64: 87–92. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 10. Sox HC. Comparative effectiveness research: a report from the Institute of Medicine. Ann Intern Med 2009; 151 (3): 203. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 11. Ford I, Norrie J.. Pragmatic trials. N Engl J Med 2016; 375 (5): 454–63. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 12. Botsis T, Hartvigsen G, Chen F, Weng C.. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinform 2010; 2010: 1–5. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 13. Bae CJ, Griffith S, Fan Y, et al. The challenges of data quality evaluation in a joint data warehouse. eGEMs 2015; 3 (1): 12. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 14. Wang RY, Strong DM.. Beyond accuracy: what data quality means to data consumers. J Manag Infn Systems 1996; 12 (4): 5–33. [ Google Scholar ]
  • 15. McGilvray D. Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information. MIT Information Quality Industry Symposium 2009. http://mitiq.mit.edu/IQIS/Documents/CDOIQS_200977/Papers/01_02_T1D.pdf  Accessed July 21, 2020.
  • 16. Chan KS, Fowles JB, Weiner JP.. Review: electronic health records and the reliability and validity of quality measures: a review of the literature. Med Care Res Rev 2010; 67 (5): 503–27. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 17. Nahm M. Data quality in clinical research In: Richesson RL, Andrews JE, eds. Clinical Research Informatics. London: Springer; 2012: 175–201. [ Google Scholar ]
  • 18. Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF.. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Medl Care 2012; 50: S21–9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 19. Weiskopf NG, Weng C.. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 2013; 20 (1): 144–51. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 20. Weiskopf NG, Hripcsak G, Swaminathan S, Weng C.. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46 (5): 830–6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 21. Liaw ST, Rahimi A, Ray P, et al. Towards an ontology for data quality in integrated chronic disease management: a realist review of the literature. Intl J Med Inform 2013; 82 (1): 10–24. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 22. Zozus MN, Hammond WE, Green BB, et al. Assessing Data Quality for Healthcare Systems Data Used in Clinical Research (Version 1.0). https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/Assessing-data-quality_V1%200.pdf#search=Assessing%20Data%20Quality%20for%20Healthcare%20Systems%20Data%20Used%20in%20Clinical%20Research Accessed July 21, 2020.
  • 23. Johnson SG, Speedie S, Simon G, Kumar VWestra BL. A Data Quality Ontology for the Secondary Use of EHR Data. AMIA Annu Symp Proc 2015; 2015: 1937–46. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 24. Garcí A-de-León-Chocano R, Sáez C, Muñoz-Soler V, Garcí A-de-León-González R, García-Gómez JM.. Construction of quality-assured infant feeding process of care data repositories: definition and design (Part 1). Comput Biol Med 2015; 67: 95–103. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 25. Kahn MG, Callahan TJ, Barnard J, et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. eGEMs 2016; 4 (1): 18. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 26. Reimer AP, Milinovich A, Madigan EA.. Data quality assessment framework to assess electronic medical record data for use in research. Int J Med Inform 2016; 90: 40–7. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 27. Khare R, Utidjian L, Ruth BJ, et al. A longitudinal analysis of data quality in a large pediatric data research network. J Am Med Inform Assoc 2017; 24 (6): 1072–9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 28. Smith M, Lix LM, Azimaee M, et al. Assessing the quality of administrative data for research: a framework from the Manitoba Centre for Health Policy. J Am Med Inform Assoc 2018; 25 (3): 224–9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 29. Weiskopf NG, Bakken S, Hripcsak G, Weng C.. A data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC) 2017; 5 (1): 14. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 30. Lee K, Weiskopf N, Pathak J. A Framework for Data Quality Assessment in Clinical Research Datasets. AMIA Annu Symp Proc 2018; 2017: 1080–9. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 31. Feder SL. Data quality in electronic health records research: quality domains and assessment methods. West J Nurs Res 2018; 40 (5): 753–66. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 32. Terry AL, Stewart M, Cejic S, et al. A basic model for assessing primary health care electronic medical record data quality. BMC Med Inform Decis Mak 2019; 19 (1): 30. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 33. Nordo A, Eisenstein EL, Garza M, Hammond WE, Zozus MN. Evaluative outcomes in direct extraction and use of EHR data in clinical trials. Stud Health Technol Inform 2019; 257: 333–40. [ PubMed ] [ Google Scholar ]
  • 34. Bloland P, MacNeil A.. Defining & assessing the quality, usability, and utilization of immunization data. BMC Public Health 2019; 19 (1): 19. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 35. Henley-Smith S, Boyle D, Gray K.. Improving a secondary use health data warehouse: proposing a multi-level data quality framework. eGEMs 2019; 7 (1): 38. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 36. Charnock V. Electronic healthcare records and data quality. Health Info Libr J 2019; 36 (1): 91–5. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 37. Callahan TJ, Bauck AE, Bertoch D, et al. A comparison of data quality assessment checks in six data sharing networks. eGEMs 2017; 5 (1): 8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 38. Qualls LG, Phillips TA, Hammill BG, et al. Evaluating foundational data quality in the National Patient-Centered Clinical Research Network (PCORnet). eGEMs 2018; 6 (1): 3. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 39. PCORnet. PCORnet Data Checks, version 8. 2020. https://pcornet.org/wp-content/uploads/2020/03/PCORnet-Data-Checks-v8.pdf Accessed July 21
  • 40. Strong DM, Lee YW, Wang RY.. Data quality in context. Commun ACM 1997; 40 (5): 103–10. [ Google Scholar ]
  • 41. Mocnik F-B, Mobasheri A, Griesbaum L, Eckle M, Jacobs C, Klonner C.. A grounding-based ontology of data quality measures. JOSIS 2018; (16): 1–25. [ Google Scholar ]
  • 42. Kahn MG, Brown JS, Chun AT, et al. Transparent reporting of data quality in distributed data networks. eGEMs 2015; 3 (1): 7. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

  • View on publisher site
  • PDF (666.2 KB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Defining and Improving Data Quality in Medical Registries: A Literature Review, Case Study, and Generic Framework

Danielle g t arts , msc, nicolette f de keizer , phd, gert-jan scheffer , md, phd.

  • Author information
  • Article notes
  • Copyright and License information

Correspondence and reprints: Danielle G. T. Arts, MSc, Department of Medical Informatics and Department of Intensive Care, J2-257, Academic Medical Center, P.O. Box 22700, 1100 DE, Amsterdam, The Netherlands; e-mail: < [email protected] >.

Received 2002 Feb 1; Accepted 2002 Jun 17.

Over the past years the number of medical registries has increased sharply. Their value strongly depends on the quality of the data contained in the registry. To optimize data quality, special procedures have to be followed. A literature review and a case study of data quality formed the basis for the development of a framework of procedures for data quality assurance in medical registries. Procedures in the framework have been divided into procedures for the co-ordinating center of the registry (central) and procedures for the centers where the data are collected (local). These central and local procedures are further subdivided into (a) the prevention of insufficient data quality, (b) the detection of imperfect data and their causes, and (c) actions to be taken / corrections. The framework can be used to set up a new registry or to identify procedures in existing registries that need adjustment to improve data quality.

Several developments in healthcare, such as progress in information technology and increasing demands for accountability, have led to an increase in the number of medical registries over recent years. We define a medical registry as a systematic collection of a clearly defined set of health and demographic data for patients with specific health characteristics, held in a central database for a predefined purpose (based on Solomon et al. 1 ). The specific patient characteristics (e.g., the presence of a disease or whether an intervention has taken place) determine which patients should be registered. Medical registries can serve different purposes—for instance, as a tool to monitor and improve quality of care or as a resource for epidemiological research. 2 One example is the National Intensive Care Evaluation (NICE) registry, which contains data from patients who have been admitted to Dutch intensive care units (ICUs) and provides insight into the effectiveness and efficiency of Dutch intensive care. 3

To be useful, data in a medical registry must be of good quality. In practice, however, quite frequently incorrect patients are registered or data items can be inaccurately recorded or not recorded at all. 4– 8 To optimize the quality of medical registry data, participatory centers should follow certain procedures designed to minimize inaccurate and incomplete data. The objective of this study was to identify causes of insufficient data quality and to make a list of procedures for data quality assurance in medical registries and put them in a framework. By data quality assurance we mean the whole of planned and systematic procedures that take place before, during, and after data collection, to guarantee the quality of data in a database. Our proposed framework for procedures for data quality assurance is intended to serve as a reference during the start-up of a registry. Furthermore, comparing current procedures in existing medical registries with the proposed procedures in the framework should allow the identification of possible adjustments in the organization to improve data quality.

Literature Review

To gain insight into the concept of data quality, we searched the literature for definitions. For the development of the data quality framework we searched the literature for (a) types and causes of data errors and (b) procedures that can minimize the occurrence of data errors in a registry database. An automated literature search was done using the Medline and Embase databases. The following text words and MeSH headings were used in this search: “data quality,” “registries,” “data collection,” “validity,” “accuracy,” “quality control,” and combinations of these terms. The automated search spanned the years from 1990 to 2000. To supplement the automated search, a manual search was done for papers referenced by other papers, papers and authors known by reputation, and papers from personal databases and the Internet. To search the Internet we used the same terms as for the literature search. The manual search was not restricted to the medical domain or to a specified time period. Papers were considered relevant if they described the analysis of data quality in a registry database or a trial database in terms of types and frequencies of data errors or causes of insufficient data quality. In addition, we selected papers that described procedures for the control and the assurance of data quality in registry or trial databases. We considered registry and trial databases as systematic collections of a prespecified set of data items for all people with a specific characteristic of interest (e.g., patients admitted to intensive care) for a predefined purpose, such as evaluative research.

Methods for the Case Study

We analysed the types and causes of data errors that may occur in a registry by performing a case study at two ICUs that had collected data for the NICE registry for at least one year. The NICE dataset contains 96 variables representing characteristics of the patients and the outcome of ICU treatment. It includes demographic data, admission and discharge data and all variables necessary to calculate severity of illness scores and mortality risks according to the prognostic models APACHE II and III, 9, 10 SAPS II, 11 MPM II, 12 and LODS. 13 Thirty-nine variables are categorical (e.g. presence of chronic renal insufficiency prior to ICU admission), 48 are numerical (e.g., highest systolic blood pressure during the first 24 hours of ICU admission), 7 are dates/times and 2 are character strings (Appendix 1 ▶ ).

One of the ICUs in the case study collected data automatically by extracting the data from their electronic patient data management system (PDMS) 14 into a local registry database. The other ICU collected the data manually by filling in case record forms (CRFs) that were manually entered into a local registry database. Each month the data from the local databases from both ICUs are transferred to the central registry database at the NICE coordinating center. Data flows are shown in Figure 1 ▶ .

Figure 1

Results from the case study: types and percentages of newly occurring data errors at the different steps in the data collection process.

For each ICU we retrieved from the central registry database the records of 20 randomly selected patients that had been admitted in September or October of 1999. To evaluate the accuracy and the completeness of the data we compared the data from the central registry database, the local database and the CRFs with the gold standard data. The gold standard data were re-abstracted from the paper patient record or the PDMS by one of the authors (DA). Registered values were found to be inaccurate if (1) a categorical value was not equal to the gold standard value or (2) a numerical value deviated from the gold standard value more than acceptable (e.g., a deviation in systolic blood pressure of > 10 mmHg below or above the gold standard systolic blood pressure). The appendix contains a complete list of variables and criteria. A data item was found to be incomplete when it was not registered, even though it was available in the paper-based record or the PDMS.

By means of a structured interview with the physicians responsible for the data collection process, we gained insight into the local organization of the data collection. This information and discussion of discovered data errors helped us to identify the causes of insufficient data quality. The causes of insufficient data quality that we found through the literature review and the case-study were grouped according to their place in the data collection process.

Quality Assurance Framework

Procedures for data quality assurance were collected through literature review as described before. According to their characteristics, the procedures were placed in a framework that maps with the grouping of data error causes, obtained by the literature review and case study.

Data Quality Definitions from Literature

According to the International Standards Organisation (ISO) definition, quality is “the totality of features and characteristics of an entity that bears on its ability to satisfy stated and implied needs” (ISO 8402-1986, Quality-Vocabulary) Similarly, in the context of a medical registry, data quality can be defined as “the totality of features and characteristics of a data set, that bear on its ability to satisfy the needs that result from the intended use of the data.” Many researchers point out that data quality is an issue that needs to be assessed from the data users’ perspective. For example, according to Abate et al., 15 data are of the required quality if they satisfy “the requirements stated in a particular specification and the specification reflects the implied needs of the user.”

The review of relevant literature yielded a large number of distinct data quality attributes that might determine usability. Most of the data quality attributes in literature had ambiguous definitions or were not defined at all. In some cases, multiple terms were used for one single data quality attribute in different articles. The two most frequently cited data quality attributes were “accuracy” and “completeness.” 6, 7, 15– 31 From all data quality attributes found in literature, these two are the most relevant in the context of the case study. Based on the definitions found in literature, we formulated clear, unambiguous definitions for (1) data accuracy (the extent to which registered data are in conformity to the truth) and (2) data completeness (the extent to which all necessary data that could have been registered have actually been registered).

Types and Causes of Data Errors

Review of relevant literature 4, 17– 19, 32– 38 resulted in a number of types of data errors. Van der Putten et al. 36 divides data errors into interpretation errors, documentation errors and coding errors. Other authors divide data errors into systematic (type I) errors and random (type II) errors. 17 Knatterud et al. 35 additionally mentions bias as a category of data errors, which can cause random as well as systematic errors. Causes of systematic data errors include programming errors, 32, 34 unclear definitions for data items, 4, 33, 36 or violation of the data collection protocol. 34, 35 Random data errors, for instance, can be caused by inaccurate data transcription and typing errors 18, 32, 35, 39 or illegible handwriting in the patient record. 19, 34 Clarke 32 describes only two causes of data errors, inaccurate data transcription and programming errors in the software used. These two causes of errors are most frequently cited in literature. Inaccurate data transcription occurs during the actual data collection process. Alternatively, programming errors are part of the procedures that precede the actual data collection process. Other examples are the lack of clear definitions for data items and guidelines for data collection 4, 33, 36 or insufficient training of participants in using the data definitions and guidelines. 34, 37

We have now identified several possible types and causes of data errors through literature review. To analyse the types and causes of data errors in real practice we performed a case study at the NICE registry.

Case-study of Data Quality in the NICE Registry

Figure 1 ▶ displays the frequencies of inaccurate and incomplete data as they occurred during the different steps in the data collection process. Because some variables were not documented in the PDMS, for all patients, 4.0% of the data was not available in the extraction source. After the extraction of the data from the PDMS another 1.4% of the data was incomplete because the extraction software lacked queries for some data items. Of the extracted data, 1.7% was inaccurate because of programming errors in the extraction software. Because of programming errors in the software used for transferring data into the central registry database, another 0.6% of the data was incomplete and 0.9% was incorrect. For example, one extraction query did not return the longest prothrombin time in the first 24 hours of ICU admission but in the entire ICU stay. Finally, the central registry database contained 2.0% inaccurate and 6.0% incomplete data for the hospital with automatic data collection.

In the hospital using manual data collection, after transcription of data from the paper patient record to the CRF 4.8% of the data was inaccurate and 3.3% was incomplete. We identified several causes of these errors, such as inaccurate transcription of the data and inaccurate calculations of derived variables such as daily urinary output and alveolar-arterial oxygenation difference. A relatively frequent error cause was nonadherence to data definitions. Contrary to the definitions, data outside the first 24 hours of ICU admission were frequently registered. The next step in the data collection process is the manual entry of the data from the CRFs into the local registry database. Inaccurate typing was a relatively infrequent cause of data errors (0.6%). During entry of data in the local database, some data that were inaccurate or incomplete on the CRFs were corrected or completed. Programming errors in the software used for transferring the data into the central registry database caused another 3.6% incomplete data items. Finally, the central registry database contained 4.6% inaccurate and 5% incomplete data for the hospital with manual data collection.

The causes of inaccurate data that we found through the literature review and the case study, and the types of data errors that they cause are presented in Table 1 ▶ . The causes are grouped according to the related stage in the registration process and subdivided into causes at central and at local level. Central level refers to the coordinating center that sets up the registry and transfers data sets from all participating centers into the central registry database. Local level refers to the sites where the actual data collection takes place.

Causes of Insufficient Data Quality in Medical Registries at the Different Stages in the Registry Process and Type (Systematic or Random) of Data Errors

Proposed Framework with Procedures for Data Quality Improvement

Many different quality assurance procedures have been discussed in literature. Whitney et al. 38 discuss data quality in longitudinal studies. They make a distinction between quality assurance procedures and quality control procedures. Quality assurance consists of activities undertaken before data collection to ensure that the data are of the highest possible quality at the time of collection. Examples of quality assurance procedures are a clear and extensive study design and training of data collectors. Quality control takes place during and after data collection and is aimed at identifying and correcting sources of data errors. Some examples of quality control procedures are completeness checks and site visits. 38

Knatterud et al. 35 describe guidelines for standardisation of quality assurance in clinical trials. According to the authors, quality assurance should include prevention, detection, and action, from the beginning of the data collection through publication of the study results. Important aspects of prevention are the selection and training of adequate and motivated personnel and the design of a data collection protocol. Detection of data errors can be achieved through routinely monitoring the data, which means that they are compared with data in another independent data source. Finally, action implies that data errors are corrected and causes of data errors are resolved.

To standardize data collection in a registry, the data items that need to be collected should be provided with clear data definitions, and standardized guidelines for data collection methods must be designed. 18, 19, 32– 35, 38 Many authors recommend training of persons involved in the registry. 5, 18, 33– 35, 38, 40– 42 Issues in training sessions are the scope of the registry, the data collection protocol and data definitions. Participants should be trained centrally to guarantee standardization of the data collection procedures across participating centers.

From two studies 20, 43 it appeared that data quality improves if the CRF contains less open-ended questions. To reduce the chance of errors occurring in the data during collection several authors recommend collecting the data in space close to the original data source as soon as the data are available. 19, 40, 44, 45 Ideally, the data should be entered by the clinician or obtained directly from the relevant electronic data source (e.g., a laboratory system). 19

Several different methods can be applied to detect errors at data entry. For example the registry data can be entered twice, preferably by two independent persons. Detected inconsistencies, that indicate data-entry errors should be checked and re-entered. 19, 38, 41, 44, 46 Automatic domain or consistency checks on the data at data entry, data extraction or data transfer, can also detect anomalous data. 8, 19, 32, 38, 42, 44, 45 Not all data errors can be detected through automatic data checks. Data errors that are still within the predefined range will not be uncovered. Therefore, in addition to the automatic checking of the data, a visual check of the entered data is recommended. 8, 44 Analyses of the data (e.g., simple cross-tabulation) could also help to uncover anomalies in data patterns. 32, 35, 44 The coordinating center of a registry can control data quality by visiting the participating centres and performing data audits. These audits imply that a sample of the data from the central registry database is being compared with the original source data (e.g., in the paper patient record). 34, 35, 38, 42

Based on causes of insufficient data quality from the case study and on experiences described in literature we made a list of procedures for data quality assurance. These procedures have been placed in a framework (Table 2 ▶ ) in which quality assurance procedures are divided into “central” and “local” procedures. Central and local procedures are further subdivided into three phases: (a) the prevention of insufficient data quality, (b) the detection of inaccurate or incomplete data and their causes and (c) corrections or actions to be taken to improve data quality. This grouping for the data quality assurance procedures resembles the grouping of causes of data errors in Table 1 ▶ . Preventive procedures are aimed at the causes of inaccurate data due to deficiencies in the set-up or the organization of the registry. Detection of inaccurate data takes place during the local data collection process and during the transfer of data into the central database. Finally detected data inaccuracies and their causes should lead to actions that improve data quality.

Framework of Procedures for the Assurance of Data Quality in Medical Registries

The definitions of data quality that we found clearly give “the data requirements that proceed from the intended use” a pivotal position. Thus, the intended use of registry data determines the necessary properties of the data. For example, in a registry that is used to calculate incidence rates of diseases, it is essential to include all existing patient cases. In other cases (e.g., registries used for case-control studies), it is essential to record correctly characteristics of registered patients, such as diagnoses; the exact number of included patients is of minor importance. 17

Definitions of data quality and data quality attributes in literature are frequently unclear, ambiguous or unavailable. Before designing a plan for quality assurance of registry data, a clear description of what attributes constitute data quality is necessary. Additionally, standard definitions of data quality and data quality attributes are necessary to be able to compare data quality among registries or within a registry at different points in time.

Investigating the causes of data errors is a prerequisite for the reduction of errors. The literature search that we performed was not conducted fully in accordance with the methodology of a systematic review. Nevertheless, we believe that we captured most of the relevant articles. In addition to the literature search, we performed a case study. We analyzed the types and causes of data errors in the NICE registry at the different steps in the data collection process, from the patient record to the central registry database. We did not question the quality of data documented in the patient record, either electronic or paper-based. Nevertheless, ample evidence in the literature indicates that the patient record is generally not completely free of data errors. 47– 49 For our case study, however, the paper-based patient record or the PDMS was the most reliable source available. The case study showed missing values for some variables (e.g., admission source, ICU discharge reason) for each patient in the PDMS, because these variables were not configured in the PDMS. This was a relatively large source of systematically missing data for the hospital with automatic data collection. The paper-based records also missed some of the data that were obligatory in the registry. This however was a minor cause of randomly missing data in the registry.

From our case study, it appeared that in case of automatic data collection, data errors are mostly systematic and caused by programming errors. The disadvantage of these programming errors is that they can cause a large number of data errors. On the other hand, once the programming error is detected, this cause can easily be resolved. In case of manual data collection, errors appeared to be mostly random. Most data errors occurred during recording of the data on the CRFs due to inaccurate transcription or non-adherence to data definitions. The fact that in this case most data errors occur during transcription of data to the CRF corresponds to the results of other studies. 39 The percentages of inaccurate and incomplete data in case of manual data collection are comparable to those found in other studies. 7, 39, 42 The literature search yielded no articles presenting results of data quality analysis for automatically extracted data to compare with the results of our case study.

It is unrealistic to aim for a registry database that is completely free of errors. Some errors will remain undetected and uncorrected regardless of quality assurance, editing, and auditing. Implementation of procedures for data quality assurance can merely lead to an improvement of data quality. As evident from the literature review, the assurance of data quality can entail many different procedures. Procedures were selected for our framework when they were practically feasible and when they seemed likely to prevent, detect, or correct frequently occurring errors. This implies that the procedures in the framework can be expected to be effective in improving data quality.

Procedures in the framework are divided into “central” and “local” procedures. This division was applied because most registries consist of several participating local centers that collect the data and send it to the central coordinating center that sets up and maintains the registry. Prud’homme et al. 42 similarly described quality assurance procedures separately for all parties involved in a trial, such as the clinical centers, the coding center, and the data coordination center. Central and local procedures in our framework are further subdivided into the prevention of insufficient data quality, detection of (causes of) insufficient data quality and actions to be taken. Knatterud et al. 35 and Wyatt 19 made similar divisions. The framework of procedures for data quality assurance fits in the table with causes of data errors. If, for example, the local causes of data errors lie mainly in the data collection phase, quality assurance procedures from the local detection section in the framework should be implemented.

For the development of the framework we took two methods of data collection into consideration, manual and automatic data collection. We did not consider the use of alternative data entry methods such as the OCR/OMR scanning of registry data. Since manual and automatic data collection are the two most commonly used methods, the framework should be applicable for reviewing the organization in most medical registries.

The developed quality assurance framework for medical registries will also be useful for reviewing procedures and improving data quality in clinical trials. Good clinical practice guidelines for clinical trials state that procedures for quality assurance have to be implemented at every step in the data collection process. 50 To ensure the quality of data in clinical trials, the framework described in this article could be a suitable addition to the good clinical practice guidelines.

We believe that the framework proposed in this article can be a helpful tool for setting up a high quality medical registry. Our experiences with the NICE registry and with three other national and international registries support this opinion. In existing registries it can be a useful tool for identifying adjustments in the data collection process. Further research, such as pre- and postmeasurements of data quality, should be conducted to determine whether implementation of the framework in a registry in fact reduces the percentages of data errors.

Acknowledgments

The authors thank Martien Limburg and Jeremy Wyatt for their valuable comments on this manuscript.

Appendix 1 ▶

Appendix 1 .

List of NICE variables, the severity-of-illness-models in which they are used, their data type, and the criteria used for analysis of data accuracy. Data accuracy was analysed by comparing the registry data to the gold standard (g.s.) data.

  • 1. Solomon D, et al.. Evaluation and implementation of public-health registries. Public Health Rep. 1991;106(2):142–50. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 2. van Romunde L, Passchier J. Medische registraties: doel, methoden en gebruik [in Dutch]. Ned Tijdschr Geneeskd. 1992;136(33):1592–5. [ PubMed ] [ Google Scholar ]
  • 3. de Keizer N, Bosman R, Joore J, et al. Een nationaal kwaliteitssysteem voor intensive care [in Dutch]. Medisch Contact. 1999;54(8):276–9. [ Google Scholar ]
  • 4. Goldhill DR, Sumner A. APACHE II, data accuracy and outcome prediction. Anaesthesia. 1998;53(10):937–43. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 5. Lorenzoni L, Da Cas R, Aparo UL. The quality of abstracting medical information from the medical record: The impact of training programmes. Int J Qual Health C. 1999;11(3):209–13. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 6. Seddon D, Williams E. Data quality in the population-based cancer registration: An assessment of the Merseyside and Cheshire Cancer Registry. Brit J Cancer 1997;76(5):667–74. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 7. Barrie J, Marsh D. Quality of data in the Manchester orthopaedic database. Br Med J. 1992;304:159–62. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 8. Horbar JD, Leahy KA. An assessment of data quality in the Vermont-Oxford Trials Network database. Control Clin Trials. 1995;16(1):51–61. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 9. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: A severity of disease classification system. Crit Care Med. 1985;13(10):818–29. [ PubMed ] [ Google Scholar ]
  • 10. Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults [see comments]. Chest. 1991;100(6):1619–36. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 11. Le Gall JR, Lemeshow S, Saulnier F. A new Simplified Acute Physiology Score (SAPS II) based on a European/North American multicenter study. JAMA 1993;270(24):2957–63. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 12. Lemeshow S, Teres D, Klar J, et al. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. JAMA. 1993;270(20):2478–86. [ PubMed ] [ Google Scholar ]
  • 13. Le Gall JR, Klar J, Lemeshow S, et al. The Logistic Organ Dysfunction system. A new way to assess organ dysfunction in the intensive care unit. ICU Scoring Group. JAMA. 1996; 276(10):802–10. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 14. Weiss Y, Sprung C. Patient data management systems in critical care. Curr Opin Crit Care. 1996;2:187–92. [ Google Scholar ]
  • 15. Abate M, Diegert K, Allen H. A hierarchical approach to improving data quality. Data Qual J. 1998;33(4):365–9. [ Google Scholar ]
  • 16. Tayi G, Ballou D. Examining data quality. Commun ACM. 1998;41(2):54–7. [ Google Scholar ]
  • 17. Sørensen HT, Sabroe S, Olsen J. A framework for evaluation of secondary data sources for epidemiological research. Int J Epidemiol. 1996;25(2):435–42. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 18. Hilner JE, McDonald A, Van Horn L, et al. Quality control of dietary data collection in the CARDIA study. Control Clin Trials. 1992;13(2):156–69. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 19. Wyatt J. Acquisition and use of clinical data for audit and research. J Eval Clin Pract. 1995;1(1):15–27. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 20. Teperi J. Multi method approach to the assessment of data quality in the Finnish Medical Birth Registry. J Epidemiol Community Health. 1993;47(3):242–7. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 21. Hosking J, Newhouse M, Bagniewska A, Hawkins B. Data collection and transcription. Control Clin Trials. 1995; 16:66S–103S. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 22. AHIMA. The American health information management association practice brief: Data Quality Management Model. June, 1998. Available at http://www.ahima.org/journal/pb/98-06.html . Accessed Sept. 15, 2001.
  • 23. DISA. DOD Guidelines on Data Quality Management. 2000. Available at http://datadmn.disa.mil/dqpaper.html . Accessed Aug. 15, 2001.
  • 24. Shroyer AL, Edwards FH, Grover FL. Updates to the Data Quality Review Program: The Society of Thoracic Surgeons Adult Cardiac National Database. Ann Thorac Surg. 1998; 65(5):1494–7. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 25. Golberg J, Gelfand H, Levy P. Registry evaluation methods: A review and case study. Epidemiol Rev. 1980;2:210–20. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 26. Teppo L, Pukkala E, Lehtonen M. Data quality and quality control of a population-based cancer registry. Experience in Finland. Acta Oncol. 1994;33(4):365–9. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 27. Grover FL, Shroyer AL, Edwards FH, et al. Data quality review program: the Society of Thoracic Surgeons Adult Cardiac National Database. Ann Thorac Surg 1996;62(4):1229–31. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 28. Wand Y, Wang R. Anchoring data quality dimensions in ontological foundations. Communications of the ACM. 1996; 39(11):86–95. [ Google Scholar ]
  • 29. Maudsley G, Williams EM. What lessons can be learned for cancer registration quality assurance from data users? Skin cancer as an example. Int J Epidemiol. 1999;28(5):809–15. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 30. Mathieu R, Khalil O. Data quality in the database systems course. Sept, 1998. Available at http://www.dataquality.com/998mathieu.htm . Accessed Sept. 15, 2001.
  • 31. Kaomea P. Valuation of data quality: A decision analysis approach. Sept, 1994. Available at http://web.mit.edu/tdqm/www/papers/94/94-09.html . Accessed Aug. 15, 2001.
  • 32. Clarke PA. Data validation. In Clinical Data Management. Chichester, John Wiley & Sons, 1993, pp 189–212.
  • 33. Clive RE, Ocwieja KM, Kamell L, et al. A national quality improvement effort: Cancer registry data. J Surg Oncol. 1995; 58(3):155–61. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 34. Gassman JJ, Owen WW, Kuntz TE, et al. Data quality assurance, monitoring, and reporting. Control Clin Trials. 1995;16(2 Suppl):104S–136S. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 35. Knatterud GL, Rockhold FW, George SL, et al. Guidelines for quality assurance in multicenter trials: a position paper. Control Clin Trials. 1998;19(5):477–93. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 36. van der Putten E, van der Velden JW, Siers A, Hamersma EA. A pilot study on the quality of data management in a cancer clinical trial. Control Clin Trials. 1987;8(2):96–100. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 37. Stein HD, Nadkarni P, Erdos J, Miller PL. Exploring the degree of concordance of coded and textual data in answering clinical queries from a clinical data repository [see comments]. J Am Med Inform Assoc. 2000;7(1):42–54. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 38. Whitney CW, Lind BK, Wahl PW. Quality assurance and quality control in longitudinal studies. Epidemiol Rev. 1998; 20(1):71–80. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 39. Vantongelen K, Rotmensz N, van der Schueren E. Quality control of validity of data collected in clinical trials. EORTC Study Group on Data Management (SGDM). Eur J Cancer Clin Oncol. 1989;25(8):1241–7. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 40. Wang R, Kon H. Toward total data quality management (TDQM). June, 1992. Available at http://web.mit.edu/tdqm/papers/92-02.html . Accessed Aug. 15, 2001.
  • 41. Neaton J, Duchene AG, Svendsen KH, Wentworth D. An examination of the efficiency of some quality assurance methods commonly employed in clinical trials. Stat Me.d 1990; 9(1–2):115–23, discussion, 124. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 42. Prud’homme GJ, Canner PL, Cutler JA. Quality assurance and monitoring in the Hypertension Prevention Trial. Hyper-tension Prevention Trial Research Group. Control Clin Trials. 1989;10(3 Suppl):84S–94S. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 43. Gissler M, Teperi J, Hemminki E, Merilainen J. Data quality after restructuring a national medical registry. Scand J Soc Med. 1995;23(1):75–80. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 44. Blumenstein BA. Verifying keyed medical research data. Stat Med. 1993;12(17):1535–42. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 45. Christiansen DH, Hosking JD, Dannenberg AL, Williams OD. Computer-assisted data collection in multicenter epidemiologic research. The Atherosclerosis Risk in Communities Study. Control Clin Trials. 1990;11(2):101–15. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 46. Day S, Fayers P, Harvey D. Double data entry: what value, what price? Control Clin Trials. 1998;19(1):15–24. [ DOI ] [ PubMed ] [ Google Scholar ]
  • 47. Aronsky D, Haug PJ. Assessing the quality of clinical data in a computer-based record for calculating the pneumonia severity index. J Am Med Inform Assoc. 2000;7(1):55–65. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 48. Hogan W, Wagner M. Accuracy of data in computer-based patient records. J Am Med Inform Assoc. 1997;5:342–55. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 49. Wagner M, Hogan W. The accuracy of medication data in an outpatient electronic medical record. J Am Med Inform Assoc. 1996;3:234–44. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • 50. World Health Organization. WHO Technical Reports Series. Guidelines for Good Clinical Practice (GCP) for Trials on Pharmaceutical Products. 1995, Annex 3.
  • View on publisher site
  • PDF (202.5 KB)
  • Collections

Similar articles

Cited by other articles, links to ncbi databases.

  • Download .nbib .nbib
  • Format: AMA APA MLA NLM

Add to Collections

Data Quality in Health Research: Integrative Literature Review

Affiliations.

  • 1 Ribeirão Preto School of Medicine, University of Sao Paulo, Ribeirão Preto, Brazil.
  • 2 Polytechnic Institute of Leiria, Leiria, Portugal.
  • 3 Institute for Systems and Computers Engineering, Coimbra, Portugal.
  • 4 Center for Research in Health Technologies and Services, Porto, Portugal.
  • PMID: 37906223
  • PMCID: PMC10646672
  • DOI: 10.2196/41446

Background: Decision-making and strategies to improve service delivery must be supported by reliable health data to generate consistent evidence on health status. The data quality management process must ensure the reliability of collected data. Consequently, various methodologies to improve the quality of services are applied in the health field. At the same time, scientific research is constantly evolving to improve data quality through better reproducibility and empowerment of researchers and offers patient groups tools for secured data sharing and privacy compliance.

Objective: Through an integrative literature review, the aim of this work was to identify and evaluate digital health technology interventions designed to support the conducting of health research based on data quality.

Methods: A search was conducted in 6 electronic scientific databases in January 2022: PubMed, SCOPUS, Web of Science, Institute of Electrical and Electronics Engineers Digital Library, Cumulative Index of Nursing and Allied Health Literature, and Latin American and Caribbean Health Sciences Literature. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist and flowchart were used to visualize the search strategy results in the databases.

Results: After analyzing and extracting the outcomes of interest, 33 papers were included in the review. The studies covered the period of 2017-2021 and were conducted in 22 countries. Key findings revealed variability and a lack of consensus in assessing data quality domains and metrics. Data quality factors included the research environment, application time, and development steps. Strategies for improving data quality involved using business intelligence models, statistical analyses, data mining techniques, and qualitative approaches.

Conclusions: The main barriers to health data quality are technical, motivational, economical, political, legal, ethical, organizational, human resources, and methodological. The data quality process and techniques, from precollection to gathering, postcollection, and analysis, are critical for the final result of a study or the quality of processes and decision-making in a health care organization. The findings highlight the need for standardized practices and collaborative efforts to enhance data quality in health research. Finally, context guides decisions regarding data quality strategies and techniques.

International registered report identifier (irrid): RR2-10.1101/2022.05.31.22275804.

Keywords: artificial intelligence; data quality; database; decision-making; digital governance; digital health; e-management; health data; health services; health stakeholders; health system; reliability; research; research network; review.

©Filipe Andrade Bernardi, Domingos Alves, Nathalia Crepaldi, Diego Bettiol Yamada, Vinícius Costa Lima, Rui Rijo. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 31.10.2023.

Publication types

  • Benchmarking*
  • Biomedical Technology
  • Data Accuracy*
  • Reproducibility of Results

IMAGES

  1. Literature review data.

    literature review quality of data

  2. Data Quality; The 3 Keys To Developing A Strategy You Can Really Trust

    literature review quality of data

  3. (PDF) Data quality in customer relationship management (CRM): Literature review

    literature review quality of data

  4. (PDF) Systematic Literature Review of Data Quality Within OpenStreetMap

    literature review quality of data

  5. What is Data Quality? Definition and FAQs

    literature review quality of data

  6. Essential Components of a Quality Literature Review

    literature review quality of data

VIDEO

  1. The Aim of a Literature Review

  2. Data Management & Sharing (DMS) Webinar 4: The “R” in FAIR: Data Reuse

  3. Introduction to Systematic Literature Review

  4. Quality Enhancement during Data Collection

  5. A Systematic Literature Review Are Automated Essay Scoring Systems Competent in Real Life Education

  6. How to perform literature review#literaturereview #researchwriting #aitools #bioinformatics #phd

COMMENTS

  1. Literature review as a research methodology: An overview and ...

    Palmatier et al. (2018) suggest that a quality literature review must have both depth and rigor, that is, it needs to demonstrate an appropriate strategy for selecting articles and capturing data and insights and to offer something beyond a recitation of previous research.

  2. Systematically Reviewing the Literature: Building the ...

    Undertaking a literature review includes identification of a topic of interest, searching and retrieving the appropriate literature, assessing quality, extracting data and information, analyzing and synthesizing the findings, and writing a report.

  3. Chapter 9 Methods for Literature Reviews - Handbook of ...

    Among other methods, literature reviews are essential for: (a) identifying what has been written on a subject or topic; (b) determining the extent to which a specific research area reveals any interpretable trends or patterns; (c) aggregating empirical findings related to a narrow research question to support evidence-based practice; (d) generat...

  4. A practical guide to data analysis in general literature reviews

    The central task for the author of any general literature review is to analyse the results of multiple scientific studies in order to describe the state of knowledge about a particular topic, in order to draw conclusions with clinical applications.

  5. Overview of Data Quality: Examining the Dimensions ...

    To better understand data quality, we review the literature on data quality studies in information systems. We identify the data quality dimensions, antecedents, and their impacts.

  6. A Review of Data Quality Assessment Methods for Public Health ...

    Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites.

  7. Assessing the practice of data quality evaluation in a ...

    To synthesize data quality (DQ) dimensions and assessment methods of real-world data, especially electronic health records, through a systematic scoping review and to assess the practice of DQ assessment in the national Patient-centered Clinical Research Network (PCORnet).

  8. Defining and Improving Data Quality in Medical Registries: A ...

    A literature review and a case study of data quality formed the basis for the development of a framework of procedures for data quality assurance in medical registries.

  9. A review of data quality research in achieving high data ...

    The review highlights the advancement of data quality research to resemble its real world application and discuss the available gap for future research.

  10. Data Quality in Health Research: Integrative Literature Review

    Data quality factors included the research environment, application time, and development steps. Strategies for improving data quality involved using business intelligence models, statistical analyses, data mining techniques, and qualitative approaches. Conclusions: The main barriers to health data quality are technical, motivational ...