Progress of HMIS implementation at Film City’s Hospital up to February 2019
Element | Target | Achieved |
---|---|---|
Total desktops installed | 968 | 624 |
Total installed laptops | 25 | 18 |
Total printers installed | 821 | 719 |
Total trolley distributed | 200 | 200 |
Total modules working | 32 | 32 |
Number of data entry operators recruited | 80 | 62 |
Digitization through electronic medical records (EMRs) at Film City’s Hospital
EMR usage report | December 2018 | January 2019 | February 2019 |
---|---|---|---|
Total patient registration | 16,755 | 26,727 | 21,245 |
Total EMR completed | 2,958 | 9,628 | 8,352 |
Percentage EMR completed | 17.65% | 36.02% | 39.31% |
Table showing distribution of customized computer trolleys at Film City’s Hospital
Section of hospital | Number of trolleys distributed |
---|---|
Ward 7 | 4 |
Neuro OT | 2 |
TB OPD | 1 |
Medical OPD | 20 |
ICCU 1 | 5 |
Pharmacy department | 4 |
Medical store | 5 |
AMO office | 1 |
Dispensary | 5 |
Ward 1 | 4 |
Pulmonary medicine | 4 |
Ward 2 | 4 |
Ward 3 | 4 |
MSW | 7 |
OPD 1 | 28 |
Laundry | 1 |
Ward 4 | 10 |
Ward 5 | 8 |
Endocrinology department | 3 |
OPD 2 | 15 |
Ward 6 | 5 |
Ward 7 | 3 |
ICCU 2 | 10 |
Miscellaneous | 47 |
Total | 200 |
Changes in key performance indicators (KPIs) at Film City’s Hospital after HMIS implementation
KPI | December 2018(%) | January 2019(%) | February 2019(%) |
---|---|---|---|
Percentage EMR completed | 17.65 | 36.02 | 39.31 |
Average overtime hours worked per employee | 10.31 | 6.40 | 6.35 |
Percentage of training programs in which Informatics was included | 35.00 | 47.00 | 52.00 |
Employee turnover | 10.05 | 11.25 | 13.12 |
Employee satisfaction | 74.67 | 69.25 | 62.33 |
Percentage of incidents reported | 5.2 | 4.5 | 3.9 |
Readmission rate | 9.5 | 10.5 | 9.5 |
Patient satisfaction | 77.67 | 80.05 | 82.25 |
Percentage of monthly complaints | 17.8 | 17.4 | 15.2 |
Percentage change in imaging turnaround time | 46.25 | 40.45 | 39.25 |
Percentage of reports generated per full time radiologist | 52.52 | 55.33 | 61.65 |
Percentage change in report turnaround time | 44.35 | 38.32 | 34.25 |
Percent change in operational cost | 43.66 | 41.25 | 42.33 |
Percentage of dispensing errors reported | 4.5 | 4.3 | 3.8 |
Downtime | 24.50 | 14.25 | 19.55 |
Percentage of instances of lost data/images | 21.66 | 20.05 | 17.25 |
Average time spent per service
Services | Before HMIS | After HMIS |
---|---|---|
Registration process | 00:04:16 | 00:02:47 |
Consultation process | 00:03:21 | 00:03:16 |
Discharge process | 00:05:53 | 00:04:48 |
Drug dispensing | 00:04:22 | 00:03:18 |
Blood test | 00:04:33 | 00:03:47 |
Ultrasonography | 00:04:54 | 00:04:38 |
MRI | 00:05:57 | 00:05:01 |
Average gain per unit volume of the services
Services | Before HMIS | After HMIS |
---|---|---|
Pharmacy services | Rs. 11 | Rs. 14 |
Consultation services | Rs. 10 | Rs. 11 |
Laboratory services | Rs. 8 | Rs. 12 |
Radiology services | Rs. 11 | Rs. 12 |
Endoscopy services | Rs. 8 | Rs. 9 |
Employees ( n = 75) responses for HMIS
Element | Mean response on five-point scale (Strongly disagree to Strongly agree) |
---|---|
Confidence is using HMIS | 3.10 |
Simplicity of use | 3.01 |
Inconsistency of system | 2.87 |
Well integrated system | 2.96 |
Complex system | 3.48 |
Feel need for learning/training | 3.99 |
Requirement for technical assistance | 3.98 |
Ali , R.S. , Hafez , T.F. , Ali , A.B. and Abd-Alsabour , N. ( 2017 ), “ Blood bag: a web application to manage all blood donation and transfusion processes ”, Paper presented at the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET) .
American Academy of Pediatrics ( 2004 ), “ Committee on fetus and newborn. Age terminology during the perinatal period. Pediatrics ”, Vol. 114 , pp. 1362 - 1364 .
Arreola , M. , Neiman , H.L. , Sugarman , A. , et al. ( 1997 ), “ Implementation of a radiology electronic imaging network: the community teaching hospital experience ”, Journal of Digital Imaging , Vol. 10 No. S1 , pp. 146 - 149 .
Ash , J.S. , Gorman , P.N. , Lavelle , M. , et al. , ( 2003 ), “ A Cross-Site qualitative study of physician order entry ”, Journal of American Medical Informatics Association ,
Azubuike , M.C. and Ehiri , J.E. ( 1999 ), “ Health information systems in developing countries: benefits, problems and prospects ”, Journal of the Royal Society for the Promotion of Health , Vol. 119 No. 3 , pp. 180 - 184 .
Bailey , J.E. and Pearson , S.W. ( 1983 ), “ Development of a tool for measuring and analyzing computer user satisfaction ”, Management Science , Vol. 29 No. 5 , pp. 530 - 545 .
Ball , M.J. ( 2003 ), “ Hospital information systems: Perspectives on problems and prospects ”, International Journal of Medical Informatics , Vol. 69 Nos 2/3 .
Bates , D.W. , Cohen , M. , Leape , L.L. , Overhage , J.M. , Shabot , M.M. and Sheridan , T. ( 2001 ), “ White paper – reducing the frequency of errors in medicine using information technology ”, Journal of the American Medical Informatics Association , Vol. 8 No. 4 , pp. 299 - 308 .
Benson , T. ( 2002 ), “ Why general practitioners use computers and hospital physicians do not – part 1: Incentives ”, British Medical Journal ,
Berg , M. ( 2001 ), “ Implementing information systems in health care organizations: myths and challenges ”, International Journal of Medical Informatics , Vol. 64 No. 2-3 .
Becich , M.J. , Gilbertson , J.R. , Gupta , D. , et al. ( 2004 ), “ Pathology and patient safety: the critical role of pathology informatics in error reduction and quality initiatives ”, Clinics in Laboratory Medicine , Vol. 24 No. 4 , pp. 913 - 943 .
Braa , J. , Heywood , A. and Shung King , M. ( 1997 ), “ District level information systems: two cases from South Africa ”, Methods of Information in Medicine , Vol. 36 No. 02 , pp. 115 - 121 .
Campbell , B. , Adjei , S. and Heywood , A. ( 1996 ), From Data to Decision Making in Health: The Evolution of a Health Management Information System , Royal Tropical Institute , Amsterdam .
Catassi , C.A. and Petersen , E.L. ( 1967 ), “ The blood inventory control SystemHelping blood bank management through computerized inventory control ”, Transfusion , Vol. 7 No. 1 , p. 196 .
Commission on Health Research for Development ( 1990 ), Health Research: essential Link to Equity in Development , Oxford University Press , New York, NY .
Cowan , D.F. ( 2005 ), “ Laboratory informatics and the laboratory information system ”, in Cowan , D.F. (Ed.), Informatics for the Clinical Laboratory , Springer , New York, NY , p. 1 - 20 .
Cummings , B.A. ( 1995 ), “ RIS integration strengthens PACS. PACS under managed care ”, Diagn Imaging , Vol. 17 , pp. 13 - 14 .
Degoulet , P. and Fieschi , M. ( 1997 ), Introduction to Clinical Informatics , Springer , New York, NY .
DeLone , W. and McLean , E. ( 1992 ), “ Information systems success: the quest for the dependent variable ”, Information Systems Research , Vol. 3 No. 1 , pp. 60 - 95 .
DNA ( 2019 ), “ HMIS delayed: Mumbaikars won't get rid of hospital queues soon ”, available at: www.dnaindia.com/mumbai/report-hmis-delayed-mumbaikars-won-t-get-rid-of-hospital-queues-soon-2710654 (accessed 7 March 2019 ).
Doll , W.J. and Torkzadeh , G. ( 1988 ), “ The measurement of end-user computing satisfaction ”, MIS Quarterly , Vol. 12 No. 2 , pp. 259 - 274 .
Eleveitch , F.R. and ( 2001 ), Spackman , KA. “ Clinical laboratory informatics ”, in Burtis , C.A. and Ashwood , E.R. (Eds), ( 2001 ), Tietz Fundamentals of Clinical Chemistry , 5th ed. , WB Saunders Company , pp. 262 - 271 .
Frere , J.J. ( 1987 ), Health and Management Information System for Child Survival Project in Pakistan , Technologies for Primary Health Care Project, United States Agency for International Development , Washington, DC , 1 - 23 .
Frisse , M.E. and Holmes , R.L. ( 2007 ), “ Estimated fi nancial savings associated with health information exchange and ambulatory care referral ”, Journal of Biomedical Informatics , Vol. 40 No. 6 , pp. S27 - 32 .
Galliers , R.D. and Sutherland , A.R. ( 1991 ), “ Information systems management and strategy formulation: the ‘stages of growth’ model revisited ”, Information Systems Journal , Vol. 1 No. 2 , pp. 89 - 114 .
Gay , S.B. , Sobel , A.H. , Young , L.Q. and Dwyer , S.J. ( 2002 ), “ Processes involved in reading imaging studies: workflow analysis and implications for workstation development ”, Journal of Digital Imaging , Vol. 15 No. 3 , pp. 171 - 177 .
Gibby , W.A. and Mciff , E.B. ( 1997 ), “ Lnformactics: Radiology networking fosters quality services: Diagn lmaging ”, Vol. 19 , pp. 145 - 159 .
Gladwin , J. ( 1999 ), “ An informational approach to health management in low-income countries ”, Ph.D. Thesis, University of Sheffield .
Hillestad , R. , Bigelow , J. , Bower , A. , Girosi , F. , Meili , R. , Scoville , R. , et al. ( 2005 ), “ Can electronic medical record systems transform health care? Potential health benefits, savings, and costs ”, Health Affairs) , Vol. 24 No. 5 , pp. 1103 - 1117 .
Huckvale , C. , Car , J. , Akiyama , M. , et al. ( 2010 ), “ Information technology for patient safety ”, Quality and Safety in Health Care , Vol. 19 No. S2 , pp. i25 - 33 .
Husein , K. , Adeyi , O. , Bryant , J. and Cara , N.B. ( 1993 ), “ Developing a primary health care management information system that supports the pursuit of equity, effectiveness and affordability ”, Social Science and Medicine (1982) , Vol. 36 No. 5 , pp. 585 - 596 .
Dudeck , J. , Blobel , B. and Lordieck , W. ( 1997 ), New Technologies in Hospital Information Systems , IOS Press .
Jayasuiriya , R. ( 1999 ), “ Managing information systems for health services in a developing country: a case study using a contextualist framework ”, International Journal of Information Management , Vol. 19 , pp. 335 - 349 .
Rodrigues , J. ( 2009 ), “ Health information systems: Concepts, methodologies, tools, and applications, volume 1. Medical information science reference ”,
Johnson , K.J. , Ravert , R.D. and Avertan , A. ( 2001 ), “ Hopkins teen central: assessment of an internet based support system for children with cystic fibrosis ”, PEDIATRICS , Vol. 107 No. 2 , pp. E24 .
Johnson , K.B. ( 2001 ), “ Barriers that impede the adoption of pediatric information technology ”, Archives of Pediatrics and Adolescent Medicine , Vol. 155 No. 12 , pp. 1374 - 1379 .
Kuruvilla , S. , Dzenowagis , J. , Pleasant , A. , Dwivedi , R. , Murthy , N. , Samuel , R. and Scholtz , M. ( 2004 ), “ Digital bridges need concrete foundations: lessons from the health internetwork India ”, BMJ , Vol. 328 No. 7449 , pp. 1193 - 1196 .
Kuther , T.L. ( 2003 ), “ Medical decision-making and minors: issues of consent and assent ”, Adolescence , Vol. 38 No. 150 , pp. 343 - 358 .
Levine , B. ( 1990 ), “ A tilo of interfaces pilots communication networks ”, Diagn Imaging Focus on PACS 12: I .
Li , B.N. , Chao , S. and Dong , M.C. ( 2007 ), “ SIBAS: a blood bank information system and its 5-year implementation at Macau ”, Computers in Biology and Medicine , Vol. 37 No. 5 , pp. 588 - 597 .
Li , B.N. , Dong , M.C. and Chao , S. ( 2008 ), “ On decision making support in blood bank information systems ”, Expert Systems with Applications , Vol. 34 No. 2 , pp. 1522 - 1532 .
Lifshitz , M.S. , Blank Ge . and Schexneider , K. ( 2007 ), “ Clinical laboratory informatics ”, in McPherson , R.A. and Pincus , M.R. (Eds), Henry’s Clinical Diagnosis and Management by Laboratory Methods , 21st ed. , WB Saunders Elsevier . pp. 112 - 121 .
Lippeveld , T.J. , Foltz , A. and Mahouri , Y.M. ( 1992 ), “ Transforming health facility-based reporting systems into management information systems: lessons from the Chad experience ”, Development Discussion Papers, No. 430 , Harvard Institute of International Development , Cambridge, MA , pp. 1 - 27 .
Lippeveld , T. , Sauerborn , R. and Bodart , C. ( 2000 ), Design and Implementation of Health Information Systems , World Health Organization , Geneva .
Little Johns , P. Wyatt , J.C. and Garvican , L. ( 2003 ), “ Evaluating computerized health information systems: Hard lessons still to be learnt ”, BMJ Publishing Group Ltd. BMJ.com .
Lubitz , D.V. and Wickramasinghe , N. ( 2006 ), “ Healthcare and technology: the doctrine of network centric healthcare ”, International Journal of Electronic Healthcare , Vol. 2 No. 4 , pp. 322 - 344 .
Lun , K.C. ( 1995 ), “ The role of information technology in health care cost containment ”, Singapore Med J , Vol. 36 No. 1 , pp. 32 - 34 .
Mariani , C. , Tronchi , A. , Oncini , L. , Pirani , O. and Murri , R. ( 2006 ), “ Analysis of the x-ray work flow in two diagnostic imaging departments with and without a RIS/PACS system ”, Journal of Digital Imaging , Vol. 19 , No. S1 , pp. 18 - 28 .
Markus , M.L. and Robey , D. ( 1988 ), “ Information technology and organizational change: causal structure in theory and research ”, Management Science , Vol. 34 No. 5 , pp. 583 - 598 .
McKibbon , K.A. , Lokker , C. , Handler , S.M. , et al. ( 2012 ), “ The effectiveness of integrated health information technologies across the phases of medication management: a systematic review of randomized controlled trials ”, Journal of the American Medical Informatics Association , Vol. 19 No. 1 , pp. 22 - 30 .
McKibbon , K.A. Lokker , C. Handler , S.M. et al. ( 2011 ),. “ Enabling medication management through health information technology (health IT) ”, Evid Rep Technol Assess (Full Rep) , 1 - 951 .
MCGM ( 2019 ), “ History of municipal corporation of greater mumbai ”, available at: https://portal.mcgm.gov.in/irj/portal/anonymous?NavigationTarget=navurl://e73b30fff440950441693f945ad3cba7 (accessed 7 March 2019 ).
MCGM Health Department ( 2019 ), “ Municipal corporation of greater Mumbai- Health department ”, available at: https://portal.mcgm.gov.in/irj/portal/anonymous?NavigationTarget=navurl://0a06009b86958aea412e367434b40c52 (accessed 7 March 2019 ).
MCGM IT Department ( 2019 ), “ Municipal corporation of greater Mumbai- Information technology department ”, available at: https://portal.mcgm.gov.in/irj/portal/anonymous/qlitdp?guest_user=english (accessed 7 March 2019 ).
MCGM RTI ( 2019 ), “ Municipal corporation of greater Mumbai- Citizen forum (right to information) department manuals ”, available at: https://portal.mcgm.gov.in/irj/portal/anonymous?NavigationTarget=navurl://14ebe3fd2ffb65c2dd97ceeda2158e11 (accessed 7 March 2019 ).
MCGM Bid Document ( 2018 ), “ Bid notification no.: 7100134255: Bid document for supply, installation, testing, commissioning and maintenance of hardware and network components for MCGM’s hospital management information system.’ law insider ”, available at: https://www.lawinsider.com/documents/lHp622wvpZJ (accessed 7 March 2019 ).
Miller , P.L. , Frawley , S.J. and Sayward , F.G. ( 2001 ), “ Issues in computer-based decision support in public health illustrated using projects involving childhood immunization ”, Journal of Public Health Management and Practice : Jphmp , Vol. 7 No. 6 , pp. 75 - 86 .
Mohapatra , S. ( 2009 ), Cases in Management Information System , PHI , New Delhi .
Mukul ( 2018 ), “ MCGM leveraging technology to boost healthcare delivery system ”, available at: https://ehealth.eletsonline.com/2019/02/mcgm-leveraging-technology-to-boost-healthcare-delivery-system/ (accessed 7 March 2019 ).
National Institutes of Health ( 2003 ), “ Committee on data standards for patient safety. Key capabilities of an electronic health record system: Letter report ”, National Academies Press , Washington, DC .
Nolan, Norton and CO ( 1992 ), Ondernemingsstrategie eninformatie technologie. (NNC and VSB) , The Hague .
Offenmuller , W. ( 1997 ), “ Expectations and solutions for HIS/R1S/PACS dataflow and workflow ”, J Digit Imaging , Vol. 10 , pp. 95 - 98 .
Pearson , S. , Balis , U.J. , Fuller , J. , et al. ( 2006 ), “ Managing and validating laboratory information systems; approved guideline ”, Clinical and Laboratory Standards Institute Document AUTO8-A , Vol. 26 No. 36 .
Ralston , M.D. , Coleman , R.M. , Beaulieu , D.M. , Scrutchfield , K. and Perkins , T. ( 2004 ), “ Progress toward paperless radiology in the digital environment: planning, implementation, and benefits ”, Journal of Digital Imaging , Vol. 17 No. 2 , pp. 134 - 143 .
Robey , D. and Boudreau , M.-C. ( 1999 ), “ Accounting for the contradictory organizational consequences of information technology: theoretical directions and methodological implications ”, Information Systems Research , Vol. 10 No. 2 , pp. 167 - 185 .
Robey , J.M. and Lee , S.H. ( 1990 ), “ Information system development in support of national health programme monitoring and evaluation: the case of the Philippines ”, World Health Statistical Quarterly , Vol. 43 , pp. 37 - 46 .
Rosenbloom , S.T. , Qi , X. , Riddle , W.R. , et al. ( 2006 ), “ Implementing pediatric growth charts into an electronic health record system ”, Journal of the American Medical Informatics Association , Vol. 13 No. 3 , pp. 302 - 308 .
Sandiford , P. , Annett , H. and Cibulskis , R. ( 1992 ), “ What can information systems do for primary health care? An international perspective ”, Social Science and Medicine , Vol. 34 No. 10 , pp. 1077 - 1087 .
Sapirie , S.A. and Orzeszyna , S. ( 1995 ), “ Selecting and defining national health indicators. Strengthening country health information unit, division of epidemiological surveillance and health situation and trend assessment ”, World Health Organization , Geneva , available at: www.who.int/healthservices-delivery/information/20000629d.htm
Shearer , S.O. , Miller , T. and Gillen , S. ( 1997 ), “ Picture archiving and communication system implementation: the practical considerations of adapting the technology to the real world of health care operations ”, Journal of Digital Imaging , Vol. 10 , No. S1 , pp. 158 - 160 .
Shekelle , P. Morton , S.C. and Keeler , E.B. ( 2006 ), “ Costs and benefits of health information technology ”, Rockville (MD): Agency for Healthcare Research and Quality (US); (Evidence Reports/Technology Assessments, No. 132) .
Sinard , J.H. and Morrow , J.S. ( 2001 ), “ Informatics and anatomic pathology: meeting challenges and charting the future ”, Human Pathology , Vol. 32 No. 2 , pp. 143 - 148 .
Sinard , J.H. ( 2006 ), “ Practical pathology informatics ”, Demystifying Informatics for the Practicing Anatomic Pathologist , Springer . New York, NY , pp. 1 - 380 .
Smith , D.L. , Hansen , H. , ( 1988 ), and., and Karim , M.S. “ Management information support for district health systems based on primary health care ”, in Wilson RG et al. (Eds), Management Information Systems and Microcomputers in Primary Health Care , Aga Khan Foundation , Geneva , 89 - 110 .
Smith , M.H. ( 1988 ), “ National childhood vaccine injury compensation act ”, Pediatrics , Vol. 82 No. 2 , pp. 264 - 269 .
Tamblyn , R. ( 2004 ), “ Improving patient safety through computerized drug management: the devil is in the details ”, HealthcarePapers , Vol. 5 No. 3 , pp. 52 - 68 .
The Hindu ( 2018 ), “ Dynacons systems bags contract worth ₹58 cr ”, available at: www.thehindubusinessline.com/markets/stock-markets/dynacons-systems-solutions/article24856706.ece (accessed 7 March 2019 ).
Twelfth Five year Plan Draft 2012 ( 2017 ), “ Social sectors, vol. 3.Govt. of India ”, available at: http://planningcommissiongov.in/plans/planrel/12thplan/welcome.html
Van Der Lei , J. , Duisterhout , J.S. , Westerhof , H. , Boon , W. , Cromme , P.V.M. and Van Bammel , J.H. ( 1993 ), “ The introduction of computer based patient records in The Netherlands ”, Annals of Internal Medicine , Vol. 119 No. 10 , pp. 1036 - 1041 .
Van Hartevelt , J.H.W. ( 1993 ), “ Information management in international development as an area for information services with a case in the field of health care in Ghana ”, International Forum on Information and Documentation , Vol. 18 , pp. 32 - 36 .
Veader , P.C. ( 1997 ), “ The missing links ”, February 16, 1997; Online: Peter C. Veader: Internet: August , Vol. 30, 1997
Wang , S.J. , Middleton , B. , Prosser , L.A. , Bardon , C.G. , Spurr , C.D. , Carchidi , P.J. , et al. ( 2003 ), “ A cost-benefit analysis of electronic medical records in primary care ”, The American Journal of Medicine , Vol. 114 No. 5 , pp. 397 - 402 .
White , K.S. ( 2005 ), “ Speech recognition implementation in radiology ”, Pediatric Radiology , Vol. 35 No. 9 , pp. 841 - 846 .
World Bank ( 1993 ), World Development Report 1993: Investing in Health , Oxford University Press , New York, NY .
World Bank ( 1999 ), World Development Indicators 1999 , The World Bank , Washington, DC .
Yasnoff , W.A. , O’Carroll Patrick , W. , Koo , D. , Linkins Robert , W. and Kilbourne , E.M. ( 2000 ), “ Public health informatics: Improving and transforming public health in the information age ”, Journal of Public Health Management and Practice : Jphmp , Vol. 6 No. 6 , pp. 67 - 75 .
Young , K.M. ( 2000 ), The World of Informatics: informatics for Healthcare Professionals , F.A. Davis Company , Philadelphia , p. 11 - 24 .
Chaudhry , B. , Wang , J. , Wu , S. , Maglione , M. , Mojica , W. , Roth , E. , Morton , S.C. and Shekelle , P.G. ( 2006 ), “ Systematic review: Impact of health information technology on quality, efficiency and costs of medical care, improving patient care ”, Annals of Internal Medicine , Vol. 144 No. 10 , pp. 742 - 752 .
Connelly , D.P. , Sielaff , B.H. and Willard , K.E. ( 1996 ), “ The clinical workstation as the means of improving laboratory use ”, Clinica Chimica Acta , Vol. 248 No. 1 , pp. 51 - 64 .
Kallinikos , J. ( 2005 ), “ The order of technology: Complexity and control in a connected world ”, Information and Organization , Vol. 15 No. 3 , pp. 185 - 202 .
Kallinikos , J Contini and Lanzara , ( 2008 ), “ Institutional complexity and functional simplification: the case of money claim online service in England and Wales ”, in (Eds) ICT and Innovation in the Public Sector. European Studies in the Making of E-Governmen , Palgrave Macmillan , Basingstoke , pp. 174 - 210 .
Lanzara , G.F. ( 2008 ), “ Building digital institutions: ICT and the rise of assemblages in government ”, in Contini and Lanzara (Eds) , ICT and Innovation in the Public Sector. European Studies in the Making of E-Government , Palgrave Macmillan , Basingstoke , pp. 9 - 48 .
Margolis , C.Z. , Warshawsky , S.S. , Goldman , L. , Dagan , O. , Wirtschafter , T. and Pliskin , J.S. ( 1992 ), “ Computerized algorithms and pediatricians management of common problems in a community clinic ”, Acad Med , Vol. 67 , pp. 282 - 284 .
Schoenbaum , S.C. and Barnett , G.O. ( 1992 ), “ Automated ambulatory medical records systems: an orphan technology ”, International Journal of Technology Assessment in Health Care , Vol. 8 No. 4 , pp. 598 - 609 .
Schriger , D.L. , Baraff , L.J. , Buller , K. , et al. ( 2000 ), “ Implementation of clinical guidelines via a computer charting system: effect on the care of febrile children less than three years of age ”, Journal of the American Medical Informatics Association , Vol. 7 No. 2 , pp. 186 - 195 .
Scott , J.C. ( 1988 ), Seeing like a State: How Certain Schemes to Improve the Human Condition Have Failed , Yale University Press .
Related articles, all feedback is valuable.
Please share your general feedback
Contact Customer Support
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Francesco lubrano.
LINKS Foundation, via Boggio 61, Turin, Italy
Giuseppe varavallo, fabrizio bertone, olivier terzo.
Effective management of hospitals and health care facilities is based on the knowledge of the available resources (e.g. staff, beds, services). Furthermore, during emergencies, a reliable exchange information system is a crucial factor in providing a timely response. This paper describes the Hospital Availability Management System (HAMS), a software developed in the framework of the EU-funded SAFECARE project. The main goal of HAMS is to provide the current status of a hospital (or health-care facility) to the internal staff, but also to first responders (paramedics, firefighters, civil protection, etc.) in order to manage the flow of patients correctly. Beyond the data coming from the normal operations of a hospital, the HAMS is able to integrate inputs from incident detection systems deployed in the hospital, to automatically update availability data after cyber and/or physical incidents, also taking into account the propagation of impacts among interconnected assets. Finally, HAMS implements the OASIS EDXL-HAVE standard, to allow the exchange of information in a open and interoperable format.
During a situation of emergency, it is important for hospitals to be able to communicate with each other and with emergency care providers about their shortage or availability of resources in terms of bed and staff capacity. With this information, first responders are able to manage at their best the flow of patients and this improves the response time and the health service resilience during emergencies.
For example, the emergency related to the spread of the COVID-19 virus in Italy required the activation of the Remote Control Center for Health Rescue (CROSS - Centrale Remota Operazioni Soccorso Sanitario) by the Italian Department of Civil Protection. This remote control centre acts in cooperation with the regional contact points to monitor and manage the available resources for hospitals and healthcare facilities on the whole national territory. Its goal is to give support to the areas where the emergency occurred and, if needed, to get access to resources of nearby areas. The mechanism is based on requests of resources (beds, personnel, etc.) that the CROSS platform aims to satisfy, identifying which other areas can provide the needed resources.
As a consequence, effective management of emergencies and crisis depends on the knowledge of each healthcare facility of the status of its own resources and on timely information availability, reliability and intelligibility. Therefore, having a fast communication of incidents and a subsequent processing of availability is a key point in order to provide relevant information as soon as possible, giving to emergency managers the possibility to take more accurate decisions. Furthermore, it’s mandatory to identify a common protocol/language to exchange data about availability among the different Stakeholders to facilitate the overall management.
The Hospital Availability Management System (HAMS), developed in the framework of the EU-funded SAFECARE project 1 , has been designed and developed to support hospitals in both aspects. Thus, the role of the HAMS is to manage the availability of hospital assets and provide hospital status and asset availability information in case of emergency. From one side, HAMS is able to provide operators with the current availability of hospital resources through a graphic interface. Thanks to the integration with incident detection systems and impact propagation models, HAMS considers not only health emergency but also incidents (physical or cyber) that can hinder the normal operations of the structure. On the other side, HAMS is able to export data in a format compliant with the EDXL-HAVE standard [ 1 ].
This paper provides a description of the HAMS system, its context and the innovation it brings, also compared to similar existing systems. Section 2 describes which are the current approaches in the definition of system for emergency management in hospitals. Section 3 provides an overall description of a more complex system in which the HAMS is one of the building blocks. Finally, Sect. 4 describes the HAMS system, its architecture and its integration with the other modules developed within the SAFECARE project.
One of the essential parts of a hospital management system is the management of information about resources availability. A system that handles the hospital status and its resources availability is in charge of tracking the occupancy rates, calculating the number of required employees and estimating the number of available employees and other resources such as departments, bed availability, services, medical equipment, drugs, etc. Such information is of primary importance in emergency situation and different software that handles it should exchange this information through a common language. For this purpose several standards have been developed and this section provide a description of software that implemented the EDXL-HAVE standard.
Analyzing this standard, one of the first software based on it was the SAHANA Disaster Management System (DMS) [ 2 , 3 ]. Sahana DMS system was used in 2010 during the earthquake emergency in Haiti and in particular in the city of Port-au-Prince. This system helped to handle the flow of victims in Haiti, sharing data about hospital availability with emergency managers.
Liapis et al. [ 4 ] described how, within the IMPRESS project, they implemented management system of Hospital Availability, through which hospitals or other health care institutions can exchange information about facilities and resources. The data about the hospital availability are entered by the hospital operators that report the bed, staff and service availability to the crisis center and first responders. In this case, operators usually receive a request from another hospital or emergency call center and answer the request reporting the availability of the hospital.
Health Resources Availability Mapping System (HeRAMS) [ 5 , 6 ] developed by the WHO and Global Health Cluster, is another relevant example. Its purpose is to evaluate the availability of services and resources in the hospitals located in territories in crisis or health emergency. The system is based on surveys carried out in hospitals to collect information about the availability of health resources and services such as staff, beds, medical equipment, drugs. The results of the surveys are reported in an interactive dashboard to visualize the status of hospital resources. Based on the results, the WHO in collaboration with the local health ministries, develops analytical reports to plan future measures to improve the situation. This solution is therefore useful to help governments managing health services during emergency.
The analysis of the main projects in the management of hospital availability shows that the use of a standard in crisis or emergency is essential to exchange information quickly and reliably between different hospital systems.
SAFECARE project is developing an integrated solution for the cyber and physical security of the healthcare sector in general [ 9 ]. As so, the HAMS service is a component plugged in more complex infrastructure, consisting of cyber and physical incident detection systems and a centralised system capable to combine and store incoming data and evaluate potential impacts when security incidents occur (Fig. 1 ).
SAFECARE global architecture
Data about hospital assets are statically stored in a database, that in SAFECARE terminology is called Central Database (CDB). Such data includes departments, medical devices, facilities, personnel, etc. Moreover, dynamic information and messages such as fire alarms, physical access control alarms, malware detection and so on, are automatically generated by various sensors and systems and generally directed to human operators, that can validate or reject them. Once incidents are validated by human operators, potential impacts corresponding to that incident are evaluated and simulated. Impacts are a list of assets that may have been involved in the incident and for each asset a corresponding likelihood and severity is estimated. With the information contained in incidents and impacts, the HAMS can evaluate and update the availability and status of each resource. Indeed, the key is to optimize the way the availability of an asset in the system is updated when it changes.
Relevant data.
When an incident occurs in a healthcare facility, such as hospital, the internal staff must have updated information on the availability status of several elements in order to adequately respond to the incident and safely continue the hospital activities for patients and staff. The required information can be grouped into three main categories: hospital assets (including services), bed capacity and staff availability.
Hospital assets include all the medical devices inside the hospital. Knowing which assets are available allows the hospital staff to understand which kind of patients can be accepted or if they have to be transported in another structure. Beyond medical devices, hospital assets include all the services required for the proper work and management of the hospital. These services are crucial to provide an effective assistance to patients and users and to guarantee their security and safety, even if at first sight, some of them may seem not essential. For example, the IT system is not specifically related to the treatment of a patient. However, it is crucial for the management and recording of its personal data and for protecting them from unauthorized access.
Finally, two essential elements that a hospital management system must handle are the number of available beds and available staff. The number of total and available beds should not be expressed by a total amount for the entire healthcare facility, but for each medical ward in order to provide a clear picture of how many patients, and which of them, can be admitted in the structure. Strictly related to the bed capacity is the assessment of available staff (doctors, nurses, paramedics, etc.) as they are a crucial elements to assist patients. Thus bed and staff availability are related and the availability of a ward or a hospital strictly depends on these two elements. According to this principle, in some open standards like the EDXL-HAVE, they are considered together, and the bed capacity parameter reflects fully staffed and equipped beds.
As described above, the HAMS deeply relies on the EDXL-HAVE standard to represent data internally and to share them with other systems. This section provides a description of the main data types effectively used by the HAMS, through a detailed description of the standard. EDXL [ 7 ] is a set of standards approved by OASIS to manage the entire emergency life cycle. It was developed to exchange and share information easily between different emergency systems. EDXL-HAVE (HAVE) [ 8 ] is an XML messaging standard developed by OASIS in the context of emergency management. A HAVE schema consists of a root element that uniquely identifies the organization that is responsible for the reporting facilities. Figure 2 shows the HAMS data model based on EDXL-HAVE main data types. Each facility is described through several attributes and a list of sub-elements that allow a complete description of hospital departments, services, and resources.
HAMS internal data model
HAVE is the top-level container element for Hospital Availability Exchange (HAVE) message. It has the following attributes:
HAVE element has also a list of facilities. Each Facility contains the following main attributes:
Facilities can have several sub-elements, such as services, operations and resources. Each Service is represented by a set of attributes:
Systems that are not considered medical assets but that are fundamental for the proper operation of the healthcare facility are represented as Operation elements. Operations are characterized by a name, a kind and a status.
Finally, medical devices and staff are represented by the resource element and staffing element. Through these elements it is possible to represent the status of the resources (medical devices and staff) in terms of offers or needs too.
The HAMS has been designed as a web application, following the client-server paradigm.
The Fig. 3 describes the internal architecture of the HAMS and the interconnections with other systems. Describing the HAMS architecture, two different parts can be identified:
HAMS internal architecture
Taking into account the provisioning of availability data and the possibility of manually reporting information into the platform, the HAMS can be considered as stand-alone software to manage the availability of an hospital, but its operation is fully merged with the SAFECARE system and in particular, is an important phase of the incident lifecycle.
At boot time, HAMS populates its internal data structure with all the relevant information regarding the hospital obtaining the static data from the central database. In SAFECARE project, the Central Database, through the Data Exchange Layer, exports some REST APIs, that can be used by the HAMS module to get information about the various assets of the hospital and the corresponding baseline availability data. Following a similar approach of the HAVE standard, asset availability status is mapped using a two level approach: a Boolean value (yes or no), indicating if the asset is available or not, and a colour code (green, yellow, red) to better detail the availability. If an asset is marked with a “green” status, it is working in normal condition, thus it is fully available; if it is marked with a “yellow” status, the asset is still available, but it has been involved in an incident thus a specific attention must be put in order to avoid that the status will deteriorate; finally, if it is marked with a “red” status, there is a severe/extreme deviation from normal operation, making the asset not available. Furthermore, if an asset is a department or a facility, the static data provided by the central database module include the total number of beds as well as the number of staff people.
When fully operational, the HAMS receives messages from the incident detection modules, through the data exchange layer. This information provides data on the assets involved in an incident, associated with a severity level. Incidents are evaluated and validated by specialized human operators, so they are considered reliable. Based on the asset involved and on the severity of the incident, the internal logic of the HAMS applies several policies in order to automatically decide whether there is the need to update the status of involved assets. For example, if a physical incident reports a loitering and suspicious behaviour of two people in a hall, the incident will be managed but the HAMS will not update any availability. Instead, if a cyber incident reports an attack with high severity to a medical device or an IT system, the HAMS will update the status of these assets.
The updated status and availability are shown to the final user through the graphic interface. At the same time, the HAMS module updates a specific table of the Central Database in order to keep track of the history of availability changes.
After an incident is validated by a human operator, it is forwarded to the HAMS, and the other decision modules present into SAFECARE system. One of these module is the Impact propagation module [ 10 ]. This software, triggered by incident messages, evaluates the incident taking into account the directly involved assets and the severity, and it provides a list of assets that could potentially be impacted, simulating potential cascading effects of that incident. Thus, the output of the impact propagation module is a list of assets with a corresponding likelihood that indicates how likely it is for an asset to be affected or impacted by the incident. Once this process ends, the list of potentially involved assets is also forwarded to the HAMS. Upon these values, the HAMS will compute the final hospital availability after the incident, updating medical devices status as well as bed and staff availability if necessary. Updates of status and resource availability are stored into the Central Database and showed by the HAMS web interface, so that users can visualise updated information. These features, combined with the standardized data model and the possibility to get the hospital status through a specific REST API too, make the HAMS an innovative tool in its application field.
This paper describes the Hospital Availability Management Systems, developed as a sub-module of a more complex system that manages cyber and physical security in hospitals, considered critical infrastructures. The need to have updated information about the status and the availability of medical devices, available beds, and medical staff is crucial during emergencies. HAMS can manage this information, allowing authorized users to get data through a web interface or through a REST API that exports data according to the EDXL-HAVE format. It provides such information to other software and management systems, which are able to gather data from different infrastructures and to provide indications to first responders. This can improve the health service resilience and it is useful to reroute the flow of patients in case of incident. Through the integration with the SAFECARE system, HAMS aims to automatically update data about the availability of hospital assets, speeding up this process. Thus, HAMS can be considered as a step forward towards a fully automatic system able to update single asset availability based on incidents.
This research received funding from the European Union’s H2020 Research and Innovation Action “Secure societies – Protecting freedom and security of Europe and its citizens” challenge, under grant agreement 787002 (project SAFECARE).
1 https://www.safecare-project.eu/ .
Leonard Barolli, Email: pj.ca.tif@illorab .
Aneta Poniszewska-Maranda, Email: [email protected] .
Tomoya Enokido, Email: pj.ca.sir@one .
Francesco Lubrano, Email: [email protected] .
Federico Stirano, Email: [email protected] .
Giuseppe Varavallo, Email: [email protected] .
Fabrizio Bertone, Email: [email protected] .
Olivier Terzo, Email: [email protected] .
22 Pages Posted: 31 May 2022
Rikesh kumar, galgotias university.
Date Written: May 9, 2022
Hospital Management System is an organized computerized system designed and programmed to deal with day to day operations and management of the hospital activities. The program can look after inpatients, outpatients, records, database treatments, status illness, billings in the pharmacy and labs. It also maintains hospital information such as ward id, doctors in charge and department administering. The major problem for the patient nowadays to get report after consultation , many hospital managing reports in their system but it's not available to the patient when he / she is outside. In this project we are going to provide the extra facility to store the report in the database and make available from anywhere in the world.
Suggested Citation: Suggested Citation
Plot No.2, Sector 17-A Yamuna Expressway Greater Noida, Uttar Pradesh 201306 India
Paper statistics.
BMC Health Services Research volume 24 , Article number: 860 ( 2024 ) Cite this article
196 Accesses
Metrics details
Governments worldwide are facing growing pressure to increase transparency, as citizens demand greater insight into decision-making processes and public spending. An example is the release of open healthcare data to researchers, as healthcare is one of the top economic sectors. Significant information systems development and computational experimentation are required to extract meaning and value from these datasets. We use a large open health dataset provided by the New York State Statewide Planning and Research Cooperative System (SPARCS) containing 2.3 million de-identified patient records. One of the fields in these records is a patient’s length of stay (LoS) in a hospital, which is crucial in estimating healthcare costs and planning hospital capacity for future needs. Hence it would be very beneficial for hospitals to be able to predict the LoS early. The area of machine learning offers a potential solution, which is the focus of the current paper.
We investigated multiple machine learning techniques including feature engineering, regression, and classification trees to predict the length of stay (LoS) of all the hospital procedures currently available in the dataset. Whereas many researchers focus on LoS prediction for a specific disease, a unique feature of our model is its ability to simultaneously handle 285 diagnosis codes from the Clinical Classification System (CCS). We focused on the interpretability and explainability of input features and the resulting models. We developed separate models for newborns and non-newborns.
The study yields promising results, demonstrating the effectiveness of machine learning in predicting LoS. The best R 2 scores achieved are noteworthy: 0.82 for newborns using linear regression and 0.43 for non-newborns using catboost regression. Focusing on cardiovascular disease refines the predictive capability, achieving an improved R 2 score of 0.62. The models not only demonstrate high performance but also provide understandable insights. For instance, birth-weight is employed for predicting LoS in newborns, while diagnostic-related group classification proves valuable for non-newborns.
Our study showcases the practical utility of machine learning models in predicting LoS during patient admittance. The emphasis on interpretability ensures that the models can be easily comprehended and replicated by other researchers. Healthcare stakeholders, including providers, administrators, and patients, stand to benefit significantly. The findings offer valuable insights for cost estimation and capacity planning, contributing to the overall enhancement of healthcare management and delivery.
Peer Review reports
Democratic governments worldwide are placing an increasing importance on transparency, as this leads to better governance, market efficiency, improvement, and acceptance of government policies. This is highlighted by reports from the Organization for Economic Co-operation and Development (OECD) an international organization whose mission it is to shape policies that foster prosperity, equality, opportunity and well-being for all [ 1 ]. Openness and transparency have been recognized as pillars for democracy, and also for fostering sustainable development goals [ 2 ], which is a major focus of the United Nations ( https://sustainabledevelopment.un.org/sdg16 ).
An important government function is to provide for the healthcare needs of its citizens. The U.S. spends about $3.6 trillion a year on healthcare, which represents 18% of its GDP [ 3 ]. Other developed nations spend around 10% of their GDP on healthcare. The percentage of GDP spent on healthcare is rising as populations age. Consequently, research on healthcare expenditure and patient outcomes is crucial to maintain viable national economies. It is advantageous for nations to combine investigations by the private sector, government sector, non-profit agencies, and universities to find the best solutions. A promising path is to make health data open, which allows investigators from all sectors to participate and contribute their expertise. Though there are obvious patient privacy concerns, open health data has been made available by organizations such as New York State Statewide Planning and Research Cooperative System (SPARCS) [ 4 ].
Once the data is made available, it needs to be suitably processed to extract meaning and insights that will help healthcare providers and patients. We favor the creation and use of an open-source analytics system so that the entire research community can benefit from the effort [ 5 , 6 , 7 ]. As a concrete demonstration of the utility of our system and approach, we revealed that there is a growing incidence of mental health issues amongst adolescents in specific counties in New York State [ 8 ]. This has resulted in targeted interventions to address these problems in these communities [ 8 ]. Knowing where the problems lie allows policymakers and funding agencies to direct resources where needed.
Healthcare in the U.S. is largely provided through private insurance companies and it is difficult for patients to reliably understand what their expected healthcare costs are [ 9 , 10 ]. It is ironic that consumers can readily find prices of electronics items, books, clothes etc. online, but cannot find information about healthcare as easily. The availability of healthcare information including costs, incidence of diseases, and the expected length of stay for different procedures will allow consumers and patients to make better and more informed choices. For instance, in the U.S., patients can budget pre-tax contributions to health savings accounts, or decide when to opt for an elective surgery based on the expected duration of that procedure.
To achieve this capability, it is essential to have the underlying data and models that interpret the data. Our goal in this paper is twofold: (a) to demonstrate how to design an analytics system that works with open health data and (b) to apply it to a problem of interest to both healthcare providers and patients. Significant advances have been made recently in the fields of data mining, machine-learning and artificial intelligence, with growing applications in healthcare [ 11 ]. To make our work concrete, we use our machine-learning system to predict the length of stay (LoS) in hospitals given the patient information in the open healthcare data released by New York State SPARCS [ 4 ].
The LoS is an important variable in determining healthcare costs, as costs directly increase for longer stays. The analysis by Jones [ 12 ] shows that the trends in LoS, hospital bed capacity and population growth have to be carefully analyzed for capacity planning and to ensure that adequate healthcare can be provided in the future. With certain health conditions such as cardiovascular disease, the hospital LoS is expected to increase due to the aging of the population in many countries worldwide [ 13 ]. During the COVID-19 pandemic, hospital bed capacity became a critical issue [ 14 ], and many regions in the world experienced a shortage of healthcare resources. Hence it is desirable to have models that can predict the LoS for a variety of diseases from available patient data.
The LoS is usually unknown at the time a patient is admitted. Hence, the objective of our research is to investigate whether we can predict the patient LoS from variables collected at the time of admission. By building a predictive model through machine learning techniques, we demonstrate that it is possible to predict the LoS from data that includes the Clinical Classifications Software (CCS) diagnosis code, severity of illness, and the need for surgery. We investigate several analytics techniques including feature selection, feature encoding, feature engineering, model selection, and model training in order to thoroughly explore the choices that affect eventual model performance. By using a linear regression model, we obtain an R 2 value of 0.42 when we predict the LoS from a set of 23 patient features. The success of our model will be beneficial to healthcare providers and policymakers for capacity planning purposes and to understand how to control healthcare costs. Patients and consumers can also use our model to estimate the LoS for procedures they are undergoing or for planning elective surgeries.
Stone et al. [ 15 ] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods. Lequertier et al. [ 16 ] surveyed methods for LoS prediction.
The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or restrict their analysis to data from specific hospitals. For instance, Sridhar et al. [ 17 ] created a model to predict the LoS for joint replacements in rural hospitals in the state of Montana by using a training set with 127 patients and a test set with 31 patients. In contrast, we have developed our model to predict the LoS for 285 different CCS diagnosis codes, over a set of 2.3 million patients over all hospitals in New York state. The CCS diagnosis code refers to the code used by the Clinical Classifications Software system, which encompasses 285 possible diagnosis and procedure categories [ 18 ]. Since the CCS diagnosis codes are too numerous to list, we give a few examples that we analyzed, including but not limited to abdominal hernia, acute myocardial infarction, acute renal failure, behavioral disorders, bladder cancer, Hodgkins disease, multiple sclerosis, multiple myeloma, schizophrenia, septicemia, and varicose veins. To the best of our knowledge, we are not aware of models that predict the LoS on such a variety of diagnosis codes, with a patient sample greater than 2 million records, and with freely available open data. Hence, our investigation is unique from this point of view.
Sotodeh et al. [ 19 ] developed a Markov model to predict the LoS in intensive care unit patients. Ma et al. [ 20 ] used decision tree methods to predict LoS in 11,206 patients with respiratory disease.
Burn et. al. examined trends in the LoS for patients undergoing hip-replacement and knee-replacement in the U.K. [ 21 ]. Their study demonstrated a steady decline in the LoS from 1997–2012. The purpose of their study was to determine factors that contributed to this decline, and they identified improved surgical techniques such as fast-track arthroplasty. However, they did not develop any machine-learning models to predict the LoS.
Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] and found that blood pressure is an important predictor of LoS. Garcia et al. determined factors influencing the LoS for undergoing treatment for hip fracture [ 23 ]. B. Vekaria et al. analyzed the variability of LoS for COVID-19 patients [ 24 ]. Arjannikov et al. [ 25 ] used positive-unlabeled learning to develop a predictive model for LoS.
Gupta et al. [ 26 ] conducted a meta-analysis of previously published papers on the role of nutrition on the LoS of cancer patients, and found that nutrition status is especially important in predicting LoS for gastronintestinal cancer. Similarly, Almashrafi et al. [ 27 ] performed a meta-analysis of existing literature on cardiac patients and reviewed factors affecting their LoS. However, they did not develop quantitative models in their work. Kalgotra et al. [ 28 ] use recurrent neural networks to build a prediction model for LoS.
Daghistani et al. [ 13 ] developed a machine learning model to predict length of stay for cardiac patients. They used a database of 16,414 patient records and predicted the length of stay into three classes, consisting of short LoS (< 3 days), intermediate LoS ( 3–5 days) and long LoS (> 5 days). They used detailed patient information, including blood test results, blood pressure, and patient history including smoking habits. Such detailed information is not available in the much larger SPARCS dataset that we utilized in our study.
Awad et al. [ 29 ] provide a comprehensive review of various techniques to predict the LoS. Though simple statistical methods have been used in the past, they make assumptions that the LoS is normally distributed, whereas the LoS has an exponential distribution [ 29 ]. Consequently, it is preferable to use techniques that do not make assumptions about the distribution of the data. Candidate techniques include regression, classification and regression trees, random forests, and neural networks. Rather than using statistical parametric techniques that fit parameters to specific statistical distributions, we favor data-driven techniques that apply machine-learning.
In 2020, during the height of the COVID-19 pandemic, the Lancet, a premier medical journal drew widespread rebuke [ 30 , 31 , 32 ] for publishing a paper based on questionable data. Many medical journals published expressions of concern [ 33 , 34 ]. The Lancet itself retracted the questionable paper [ 35 ], which is available at [ 36 ] with the stamp “retracted” placed on all pages. One possible solution to prevent such incidents from occurring is for top medical journals to require authors to make their data available for verification by the scientific community. Patient privacy concerns can be mitigated by de-identifying the records made available, as is already done by the New York State SPARCS effort [ 4 ]. Our methodology and analytics system design will become more relevant in the future, as there is a desire to prevent a repetition of the Lancet debacle. Even before the Lancet incident, there was declining trust amongst the public related to medicine and healthcare policy [ 37 ]. This situation continues today, with multiple factors at play, including biased news reporting in mainstream media [ 38 ]. A desirable solution is to make these fields more transparent, by releasing data to the public and explaining the various decisions in terms that the public can understand. The research in this paper demonstrates how such a solution can be developed.
We describe the following three requirements of an ideal system for processing open healthcare data
Utilize open-source platforms to permit easy replicability and reproducibility.
Create interpretable and explainable models.
Demonstrate an understanding of how the input features determine the outcomes of interest.
The first requirement captures the need for research to be easily reproduced by peers in the field. There is growing concern that scientific results are becoming hard for researchers to reproduce [ 39 , 40 , 41 ]. This undermines the validity of the research and ultimately hurts the fields. Baker termed this the “reproducibility crisis”, and performed an analysis of the top factors that lead to irreproducibility of research [ 39 ]. Two of the top factors consist of the unavailability of raw data and code.
The second requirement addresses the need for the machine-learning models to produce explanations of their results. Though deep-learning models are popular today, they have been criticized for functioning as black-boxes, and the precise working of the model is hard to discern. In the field of healthcare, it is more desirable to have models that can be explained easily [ 42 ]. Unless healthcare providers understand how a model works, they will be reluctant to apply it in their practice. For instance, Reyes et al. determined that interpretable Artificial Intelligence systems can be better verified, trusted, and adopted in radiology practice [ 43 ].
The third requirement shows that it is important for relevant patient features to be captured that can be related to the outcomes of interest, such as LoS, total cost, mortality rate etc. Furthermore, healthcare providers should be able to understand the influence of these features on the performance of the model [ 44 ]. This is especially critical when feature engineering methods are used to combine existing features and create new features.
In the subsequent sections, we present our design for a healthcare analytics system that satisfies these requirements. We apply this methodology to the specific problem of predicting the LoS.
We have designed the overall system architecture as shown in Fig. 1 . This system is built to handle any open data source. We have shown the New York SPARCS as one of the data sources for the sake of specificity. Our framework can be applied to data from multiple sources such as the Center for Medicare and Medicaid Services (CMS in the U.S.) as shown in our previous work [ 6 ]. We chose a Python-based framework that utilizes Pandas [ 45 ] and Scikit learn [ 46 ]. Python is currently the most popular programming language for engineering and system design applications [ 47 ].
Shows the system architecture. We use Python-based open-source tools such as Pandas and Scikit-Learn to implement the system
In Fig. 2 , we provide a detailed overview of the necessary processing stages. The specific algorithms used in each stage are described in the following sections.
Shows the processing stages in our analytics pipeline
Recent research has shown that it is highly desirable for machine learning models used in the healthcare domain to be explainable to healthcare providers and professionals [ 48 ]. Hence, we focused on the interpretability and explainability of input features in our dataset and the models we chose to explore. We restricted our investigation to models that are explainable, including regression models, multinomial logistic regression, random forests, and decision trees. We also developed separate models for newborns and non-newborns.
During our investigation, we utilized open-health data provided by the New York State SPARCS system. The data we accessed was from the year 2016, which was the most recent year available at the time. This data was provided in the form of a CSV file, containing 2,343,429 rows and 34 columns. Each row contains de-identified in-patient discharge information. The dataset columns contained various types of information. They included geographic descriptors related to the hospital where care was provided, demographic descriptors such as patient race, ethnicity, and age, medical descriptors such as the CCS diagnosis code, APR DRG code, severity of illness, and length of stay. Additionally, payment descriptors were present, which included information about the type of insurance, total charges, and total cost of the procedure.
Detailed descriptions of all the elements in the data can be found in [ 49 ]. The CCS diagnosis code has been described earlier. The term “DRG” stands for Diagnostic Related Group [ 49 ], which is used by the Center for Medicare and Medicaid services in the U.S. for reimbursement purposes [ 50 ].
The data includes all patients who underwent inpatient procedures at all New York State Hospitals [ 51 ]. The payment for the care can come from multiple sources: Department of Corrections, Federal/State/Local/Veterans Administration, Managed Care, Medicare, Medicaid, Miscellaneous, Private Health Insurance, and Self-Pay. The dataset sourced from the New York State SPARCS system, encompassing a wider patient population beyond Medicare/Medicaid, holds greater value compared to datasets exclusively composed of Medicare/Medicaid patients. For instance, Gilmore et al. analyzed only Medicare patients [ 52 ].
We examine the distribution of the LoS in the dataset, as shown in Fig. 3 . We note that the providers of the data have truncated the length of stay to 120 days. This explains the peak we see at the tail of the distribution.
Distribution of the length of stay in the dataset
We identified 36,280 samples, comprising 1.55% of the data where there were missing values. These were discarded for further analysis. We removed samples which have Type of Admission = ‘Unknown’ (0.02% samples). So, the final data set has 2,306,668 samples. ‘Payment Typology 2’, and ‘Payment Typology 3’, have missing values (> = 50% samples), which were replaced by a ‘None’ string.
We note that approximately 10% of the dataset consists of rows representing newborns. We treat this group as a separate category. We found that the ‘Birth Weight’ feature had a zero value for non-newborn samples. Accordingly, to better use the ‘Birth Weight’ feature, we partitioned the data into two classes: newborns and non-newborns. This results in two classes of models, one for newborns and the second for all other patients. We removed the ‘Birth Weight’ feature in the input for the non-newborn samples as its value was zero for those samples.
The column ‘Total Costs’ (and in a similar way, ‘Total Charges’) are usually proportional to the LoS, and it would not be fair to use these variables to predict the LoS. Hence, we removed this column. We found that the columns 'Discharge Year', 'Abortion Edit Indicator'' are redundant for LoS prediction models, and we removed them. We also removed the columns ‘CCS Diagnosis Description’, ‘CCS Procedure Description’, ‘APR DRG Description’, ‘APR MDC Description’, and ‘APR Severity of Illness Description’ as we were given their corresponding numerical codes as features.
Since the focus of this paper is on the prediction of the LoS, we analyzed the distribution of LoS values in the dataset.
We developed regression models using all the LoS values, from 1–120. We also developed classification models where we discretized the LoS into specific bins. Since the distribution of LoS values is not uniform, and is heavily clustered around smaller values, we discretized the LoS into a small number of bins, e.g. 6 to 8 bins.
We utilized 10% of the data as a holdout test-set, which was not seen during the training phase. For the remaining 90% of the data, we used tenfold cross-validation in order to train the model and determine the best parameters to use.
Many variables in the dataset are categorical, e.g., the variable “APR Severity of Illness Description” has the values in the set [Major, Minor, Moderate, Extreme]. We used distribution-dependent target encoding techniques and one-hot techniques to improve the model performance [ 53 ]. We replaced categorical data with the product of mean LoS and median LoS for a category value. The categorical feature can then better capture the dependence distribution of LoS with the value of the categorical feature.
For the linear regression model [ 54 ], we sampled a set of 6 categorical features, [‘Type of Admission’, ‘Patient Disposition’, ‘APR Severity of Illness Code’, ‘APR Medical Surgical Description’, ‘APR MDC Code’] which we target encoded with the mean of the LoS and the median of the LoS. We then one-hot encoded every feature (all features are categorical) and for each such one-hot encoded feature, created a new feature for each of the features in the sampled set, by replacing the ones in the one-hot encoded feature with the value of the corresponding feature in the sampled set. For example, we one-hot encoded ‘Operating Certificate Number’, and for samples where ‘Operating Certificate Number’ was 3, we created 6 features, each where samples having the value 3 were assigned the target encoded values of the sampled set features, and the other samples were assigned zero. We used such techniques to exploit the linear relation between LoS and each feature.
According to the sklearn documentation [ 55 ], a random forest regressor is “a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting”. The random forest regressor leverages ensemble learning based on many randomized decision trees to make accurate and robust predictions for regression problems. The averaging of many trees protects against single trees overfitting the training data.
The random forest classifier is also an ensemble learning technique and uses many randomized decision trees to make predictions for classification problems. The 'wisdom of crowds' concept suggests that the decision made by a larger group of people is typically better than an individual. The random forest classifier uses this intuition, and allows each decision tree to make a prediction. Finally, the most popular predicted class is chosen as the overall classification.
For the Random Forest Regressor [ 56 , 57 ] and Random Forest Classifier [ 58 ], we only used a similar distribution dependent target encoding as a random forest classifier/ regressor is unsuitable for sparse one-hot encoded columns.
Multinomial logistic regression is a type of regression analysis that predicts the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. It allows for more than two discrete outcomes, extending binomial logistic regression for binary classification to models with multiple class membership. For the multinomial logistic regression model [ 59 ], we used only one-hot encoding, and not target encoding, as the target value was categorical.
Finally, we experimented with combinations of target encoding and one-hot encoding. We can either use target encoding, or one-hot encoding, or both. When both encodings are employed, the dimensionality of the data increases to accommodate the one-hot encoded features. For each combination of encodings, we also experimented with different regression models including linear regression and random forest regression.
We experimented with different feature selection methods. Since the focus of our work is on developing interpretable and explainable models, we used SHAP analysis to determine relevant features.
We examine the importance of different features in the dataset. We used the SHAP value (Shapley Additive Explanations), a popular measure for feature importance [ 60 ]. Intuitively, the SHAP value measures the difference in model predictions when a feature is used versus omitted. It is captured by the following formula.
where \({{\varnothing }}_{i}\) is the SHAP value of feature \(i\) , \(p\) is the prediction by the model, n is the number of features and S is any set of features that does not include the feature \(i\) . The specific model we used for the prediction was the random forest regressor where we target-encoded all features with the product of the mean and the median of the LoS, since most of the features were categorical.
One approach to the problem is to bin the LoS into different classes, and train a classifier to predict which class an input sample falls in. We binned the LoS into roughly balanced classes as follows: 1 day, 2 days, 3 days, 4–6 days, > 6 days. This strategy is based on the distribution of the LoS as shown earlier in Figs. 3 and 4 .
A density plot of the distribution of the length of stay. The area under the curve is 1. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot
We used three different classification models, comprising the following:
Multinomial Logistic Regression
Random Forest Classifier
CatBoost classifier [ 62 ].
We used a Multinomial Logistic Regression model [ 59 ] trained and tested using tenfold cross validation to classify the LoS into one of the bins. The multinomial logistic regression model is capable of providing explainable results, which is part of the requirements. We used the feature engineering techniques described in the previous section.
We used a Random Forest Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins. We used a maximum depth of 10 so as to get explainable insights into the model.
Finally, we used a CatBoost Classifier model trained and tested using tenfold cross validation to classify the LoS into one of the bins.
We used three different regression models with the feature engineering techniques mentioned above ( Feature encoding section). These comprise:
Linear regression
Catboost regression
Random forest regression
The linear regression was implemented using the nn.Linear() function in the open source library PyTorch [ 63 ]. We used the ‘Adam’ optimization algorithm [ 64 ] in mini-batch settings to train the model weights for linear regression.
We investigated CatBoost regression in order to create models with minimal feature sets, whereby models with a low number of input features would provide adequate results. Accordingly, we trained a CatBoost Regressor [ 65 ] in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score.
The random forest regression was implemented using the function RandomForestRegressor() in scikit learn [ 55 ].
For the regression models, we used the following metrics to compare the model performance.
The R 2 score and the p -value. We use a significance level of α = 0.05 (5 %) for our statistical tests. If the p -value is small, i.e. less than α = 0.05, then the R 2 score is statistically significant.
For classifier models, we used the following metrics to compare the model performance.
True positive rate, false negative rate, and F1 score [ 66 ].
We computed the Brier score using Brier’s original calculation in his paper [ 67 ]. In this formulation, for R classes the Brier score B can vary between 0 and R, with 0 being the best score possible.
where \({\widehat{y}}_{i,c}\) is the class probability as per the model and \({I}_{i,c}=1\) if the i th sample belongs to class c and \({I}_{i,c}=0\) if it does not belong to class c .
We used the Delong test [ 68 ] to compare the AUC for different classifiers.
These metrics will allow other researchers to replicate our study and provide benchmarks for future improvements.
In this section we present the results of applying the techniques in the Methods section.
We provide descriptive statistics that help the reader understand the distributions of the variables of interest.
Table 1 summarizes basic statistical properties of the LoS variable.
Figure 5 shows the distribution of the LoS variable for newborns.
This figure depicts the distribution of the LoS variable for newborns
Table 2 shows the top 20 APR DRG descriptions based on their frequency of occurrence in the dataset.
Figure 6 shows the distribution of the LoS variable for the top 20 most frequently occurring APR DRG descriptions shown in Table 2 .
A 3-d plot showing the distribution of the LoS for the top-20 most frequently occuring APR DRG descriptions. The x-axis (horizontal) depicts the LoS, the y-axis shows the APR DRG codes and the z-axis shows the density or frequency of occurrence of the LoS
We experimented with different encoding schemes for the categorical variables and for each encoding we examined different regression techniques. Our results are shown in Table 3 . We experimented with the three encoding schemes shown in the first column. The last row in the table shows a combination of one-hot encoding and target encoding, where the number of columns in the dataset are increased to accommodate one-hot encoded feature values for categorical variables.
We obtained the SHAP plots using a Random Forest Regressor trained with target-encoded features.
Figures 7 and 8 show the SHAP values plots obtained for the features in the newborn partition of the dataset. We find that the features, “APR DRG Code”, “APR Severity of Illness Code”, “Patient Disposition”, “CCS Procedure Code”, are very useful in predicting the LoS. For instance, high feature values for “APR Severity of Illness Code”, which are encoded by red dots have higher SHAP values than the blue dots, which correspond to low feature values.
SHAP Value plot for newborns
1-D SHAP plot, in order of decreasing feature importance: top to bottom (for non-newborns)
A similar interpretation can be applied to the features in the non-newborn partition of the dataset. We note that “Operating Certificate Number” is among the top-10 most important features in both the newborn and non-newborn partitions. This finding is discussed in the Discussion section.
From Fig. 9 , we observe that as the severity of illness code increases from 1–4, there is a corresponding increase in the SHAP values.
A 2-D plot showing the relationship between SHAP values for one feature, “APR Severity of Illness Code”, and the feature values themselves (non-newborns)
To further understand the relationship between the APR Severity of Illness code and the LoS, we created the plot in Fig. 10 . This shows that the most frequently occurring APR Severity of Illness code is 1 (Minor), and that the most frequently occurring LoS is 2 days. We provide this 2-D projection of the overall distribution of the multi-dimensional data as a way of understanding the relationship between the input features and the target variable, LoS.
A density plot showing the relationship between APR Severity of Illness Code and the LoS. The color scale on the right determines the interpretation of colors in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot
Similarly, Fig. 11 shows the relationship between the birth weight and the length of stay. The most common length of stay is two days.
A density plot showing the distribution of the birth weight values (in grams) versus the LoS. The colorbar on the right shows the interpretation of color values shown in the plot. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot
We obtained a classification accuracy of 46.98% using Multinomial Logistic Regression with tenfold cross-validation in the 5-class classification task for non-newborn cases. The confusion matrix in Fig. 12 shows that the highest density of correctly classified samples is in or close to the diagonal region. The regions where out model fails occurs between adjacent classes as can be inferred from the given confusion matrix.
Confusion matrix for classification of non-newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers
For the newborn cases, we obtained a classification accuracy of 60.08% using Random Forest Classification model with tenfold cross-validation in the 5-class classification task. The confusion matrix in Fig. 13 shows that the majority of data samples lie in or close to the diagonal region. The regions where our model does not do well occurs between adjacent classes as can be inferred from the given confusion matrix,
Confusion matrix for classification of newborns. The number inside each square along the diagonal represents the number of correctly classified samples. The color is coded so lighter colors represent lower numbers
The density plot in Fig. 14 shows the relationship between the actual LoS and the predicted LoS. For a LoS of 2 days, the centroid of the predicted LoS cluster is between 2 and 3 days.
Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 61 ] to generate the plot
A quantitative depiction of our model errors is shown in Fig. 15 . The values in Fig. 15 are interpreted as follows. Referring to the column for LoS = 2, the top row shows that 51% of the predicted LoS values for an actual stay of 2 days is also 2 days (zero error), and that 23% of the predicted values for LoS equal to 2 days have an error of 1 day and so on. The relatively high values in the top row indicates that the model is performing well, with an error of less than 1 day. There are relatively few instances of errors between 2 and 3 days (typically less than 10% of the values show up in this row). The only exception is for the class corresponding to LoS great than 8 days. The truncation of the data to produce this class results in larger model errors specifically for this class.
Shows the distribution of correctly predicted LoS values for each class used in our model. Along the columns, we depict the different classes used in the model, consisting of LoS equal to 1, 2, 3 …8, and more than 8. Each row depicts different errors made in the prediction. For instance, the top row depicts an error of less than or equal to one day between the actual LoS and the predicted Los. The second row from the top depicts an error which is greater than 1 and less than or equal 2 days. And so on for the other rows, for non-newborns
Figures 16 and 17 show the scatter plots for the linear regression models. The exact line represents a line with slope 1, and a perfect model would be one that produced all points lying on this line.
Scatter plot showing an instance of a linear regression fit to the data (newborns). The R 2 score is 0.82. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)
Scatter plot for linear regression. (non-newborns). The R 2 score is 0.42. The blue line represents an exact fit, where the predicted LoS equals the actual LoS (slope of the line is 1)
Figure 18 shows a density plot depicting the relationship between the predicted length of stay and the actual length of stay.
Shows the density plot of the predicted length of stay versus actual length of stay for the classifier model for non-newborns. We used a kernel density estimation with a Gaussian kernel [ 40 ] to generate the plot. The best fit regression line to our predictions is shown in green, whereas the blue line represents the ideal fit (line of slope 1, where actual LoS and predicted LoS are equal)
Most of the existing literature on LoS stay prediction is based on data for specific disease conditions such as cancer or cardiac disease. Hence, in order to understand which CCS diagnosis codes produce good model fits, we produced the plot in Fig. 19 .
This figure shows the three CCS diagnosis codes that produced the top three R 2 scores using linear regression. These are 101, 100 and 109. The three CCS Diagnosis codes that produced the lowest R 2 scores are 159, 657, and 659
We provide the following descriptions in Tables 4 and 5 for the 3 CCS Diagnosis Codes in Fig. 19 with the top R 2 Scores using linear regression.
Similarly, the following table shows the 3 CCS Diagnosis Codes in Fig. 19 for the lowest R 2 Scores using linear regression.
We trained a CatBoost Regressor [ 65 ] on the complete dataset in order to determine the relationship between combinations of features and the prediction accuracy as determined by the R 2 correlation score. This is shown in Fig. 20
The labels for each row on the left show combinations of different input features. A CatBoost regression model was developed using the selected combination of features. The R 2 correlation scores for each model is shown in the bar graph
We can infer from Fig. 20 that only four features (‘'APR MDC Code', 'APR Severity of Illness Code', 'APR DRG Code', 'Patient Disposition') are sufficient for the model to reach very close to its maximum performance. We obtain similar concurring results when using other regression models for the same experiment.
We used a random forest tree approach to generate the trees in Figs. 21 and 22 .
A random forest tree that represents a best-fit model to the data for newborns. With 4 levels of the decision tree, the R 2 score is 0.65
A random forest tree using only a tree of depth 3 that represents a best-fit model to the data for non-newborns. The R 2 score is 0.28. We can generate trees with greater depth that better fit the data, but we have shown only a depth of 3 for the sake of readability in the printed version of this paper. Otherwise, the tree would be too large to be legible on this page. The main point in this figure is to showcase the ease of interpretation of the working of the model through rules
We used tenfold cross validation to determine the regression scores. The results are summarized in Tables 6 and 7 .
We computed the multi-class classifier metrics for logistic regression, using one-hot encoding for non-newborns. The results are presented in Table 8 . The first row represents the accuracy of the classifier when Class 0 is compared against the rest of the classes. A similar interpretation applies to the other rows in the table, ie one-versus-rest. The macro average gives the balanced recall and precision, and the resulting F1 score. The weighted average gives a support (number of samples) weighted average of the individual class metric. The overall accuracy is computed by dividing the total number of accurate predictions, which is 49,686 out of a total number of 105,932 samples, which yields a value of 0.47.
For the category of non-newborns, Fig. 23 provides a graphical plot that visualizes the ROC curves for the different multiclass classifiers we developed.
This figure applies to data concerning non-newborns. We show the multiclass ROC curves for the performance of the catboost classifier for the different classes shown. The area under the ROC curve is 0.7844
In Table 9 we compare the performance of our multiclass classifier using logistic regression developed on 2016 SPARCS data against 2017 SPARCS data.
In order to compare the performance of the different classifiers, we computed the AUC measures reported in Table 10 . Figure 24 visualizes the data in Table 10 and Fig. 25 visualizes the data in Table 11 . In Tables 12 and 13 we report the results of computing the Delong test for non-newborns and newborns respectively. In Tables 14 and 15 we report the results of computing the Brier scores for non-new borns and newborns respectively.
A bar chart that depicts the data in Table 10 for non-newborns
A bar chart that depicts the data in Table 11
In Table 16 we present the parameter and hyperparameter values used in the different models.
Due to space restrictions, we show additional results in the Appendix/Supplementary Material. These results are in tabular form and describe the R 2 scores for different segmentations of the variables in the dataset, e.g. according to age group, severity of illness code, etc.
The most significant result we obtain is shown in Figs. 21 and 22 , which provides an interpretable working of the decision trees using random forest modeling. Figure 21 for newborns shows that the birth weight features prominently in the decision tree, occurring at the root node. Low birth weights are represented on the left side of the tree and are typically associated with longer hospital stays. Higher birth weights occur on the right side of the tree, and the node in the bottom row with 189,574 samples shows that the most frequently occurring predicted stay is 2.66 days. Figure 22 for non-newborns shows that the features of “APR DRG Code”, “APR Severity of Illness Code” and “Patient Disposition” are the most important top-level features to predict the LoS. This provides a relatively simple rule-based model, which can be easily interpreted by healthcare providers as well as patients. For instance, the right-most branch of the tree classifies the input data into a relatively high LoS (46 days) when the branch conditions APR DRG Code is greater than 813.55 and the APR Severity of Illness Code is less than 91.
The results in Fig. 19 and Table 4 show that if we restrict our model to specific CCS Diagnosis descriptions such as “coronary atherosclerosis and other heart disease”, we obtain a good R 2 Score of 0.62. The objective of our work is not to cherry-pick CCS Diagnosis codes that produce good results, but rather to develop a single model for the entire SPARCS dataset to obtain a birds-eye perspective. For future work, we can explicitly build separate models for each CCS Diagnosis code, and that could have relevance to specific medical specialties, such as cardiovascular care.
Similarly, the results in Fig. 19 and Table 5 show that there are CCS Diagnosis codes corresponding to schizophrenia and mood disorders that produce a poor model fit. Factors that contribute to this include the type of data in the SPARCS dataset, where information about patient vitals, medications, or a patient’s income level is not provided, and the inherent variability in treating schizophrenia and mood disorders. Baeza et al. [ 69 ] identified several variables that affect the LoS in psychiatric patients, which include psychiatric admissions in the previous years, psychiatric rating scale scores, history of attempted suicide, and not having sufficient income. Such variables are not provided in the SPARCS dataset. Hence a policy implication is to collect and make such data available, perhaps as a separate dataset focused on mental health issues, which have proven challenging to treat.
Figures 16 and 17 show that a better regression fit is obtained when a specific CCS Diagnosis code is used to build the model, such as “Newborn” in Fig. 16 . To put these results in context, we note that it is difficult to obtain a high R 2 value for healthcare datasets in general, and especially for large numbers of patient samples that span multiple hospitals. For instance, Bertsimas [ 70 ] reported an R 2 value of 0.2 and Kshirsagar [ 71 ] reported an R 2 value of 0.33 for similar types of prediction problems as studied in this paper.
Further details for a segmentation of R 2 scores by the different variable categories are shown in the Appendix/Supplementary Material section. For instance, the table corresponding to Age Groups shows that there is close agreement between the mean of the predicted LoS from our model and the actual LoS. Furthermore, the mean LoS increases steadily from 4.8 days for Age group 0–17 to 6.4 days for ages 70 or older. A discussion of these tables is outside the scope of this paper. However, they are being provided to help other researchers form hypotheses for further investigations or to find supporting evidence for ongoing research.
Table 3 shows that the best encoding scheme is to combine target encoding with one-hot encoding and then apply linear regression. This produces an R 2 score of 0.42 for the non-newborn data, which is the best fit we could obtain. This table also shows that significant improvements can be obtained by exploring the search space which consists of different strategies of feature encoding and regression methods. There is no theoretical framework which determines the optimum choice, and the best method is to conduct an experimental search. An important contribution of the current paper is to explore this search space so that other researchers can use and build upon our methodology.
The distribution of errors in Fig. 15 shows that the truncation we employed at a LoS of 8 days produces artifacts in the prediction model as all stays of greater than 8 days are lumped into one class. Nevertheless, the distribution of LoS values in Fig. 4 shows that a relatively small number of data samples have LoS greater than 8 days. In the future, we will investigate different truncation levels, and this is outside the scope of the current paper. By using our methodology, the truncation level can also be tuned by practitioners in the field, including hospital administrators and other researchers.
Our results in Fig. 7 show that certain features are not useful in predicting the LoS. The SHAP plot shows that features such as race, gender, and ethnicity are not useful in predicting the LoS. It would have been interesting if this were not the case, as that implies that there is systemic bias based on race, gender or ethnicity. For instance, a person with a given race may have a smaller LoS based on their demographic identity. This would be unacceptable in the medical field. It is satisfying to see that a large and detailed healthcare dataset does not show evidence of bias.
To place this finding in context, racial bias is an important area of research in the U.S., especially in fields such as criminology and access to financial services such as loans. In the U.S., it is well known that there is a disproportional imprisonment of black and Hispanic males [ 72 ]. Researchers working on criminal justice have determined that there is racial bias in the process of sentencing and granting parole, with blacks being adversely affected [ 73 ]. This bias is reinforced through any algorithms that are trained on the underlying data. There is evidence that banks discriminate against applicants for loans based on their race or gender [ 74 ].
This does not appear to be the case in our analysis of the SPARCS data. Though we did not specifically investigate the issue of racial bias in the LoS, the feature analysis we conducted automatically provides relevant answers. Other researchers including those in the U.K [ 21 ] have also determined that gender does not have an effect on LoS or costs. Hence the results in the current paper are consistent with the findings of other researchers in other countries working on entirely different datasets.
From Table 6 we see that in the case of data concerning non-newborns, the catboost regression performs the best, with an R 2 score of 0.432. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through catboost regression is statistically significant. Similarly, the p -values for linear regression and random forest regression indicate that these models produce predictions that are statistically significant, i.e. they did not occur by random chance.
From Table 7 that refers to data from newborns, the linear regression performs the best, with an R 2 score of 0.82. The p -value is less than 0.01, indicating that the correlation between the actual and predicted values of LoS through linear regression is statistically significant. Similarly, the p -values for random forest regression and catboost regression indicate that these models produce predictions that are statistically significant.
We examine the performance of classifiers on non-newborn data, as shown in Tables 10 and 12 . The Delong test conducted in Table 12 shows that there is a statistically significant difference between the AUCs of the pairwise comparisons of the models. Hence, we conclude that the catboost classifier performs the best with an average AUC of 0.7844. We also note that there is a marginal improvement in performance when we use the catboost classifier instead of the random forest classifier. Both the catboost classifier and the random forest classifier perform better than logistic regression. We conclude that the best performing model for non-newborns is the catboost classifier, followed by the random forest classifier, and then logistic regression.
In the case of newborn data, we examine the performance of the classifiers as shown in Tables 11 and 13 . From Table 13 , we note that the p -values in all the rows are less than 0.05, except for the binary class “one vs. rest for class 3”, random forests vs. catboost. Hence, for this particular comparison between the random forest classifier and the catboost classifier for “one vs. rest for class 3”, we cannot conclude that there is a statistically significant difference between the performance of these two classifiers. From Table 11 we observe that the AUCs of these two classifiers are very similar. We also note that only about 10% of the dataset consists of newborn cases.
From Table 14 we note that the Brier score for the catboost classifier is the lowest. A lower Brier score indicates better performance. According to the Brier scores for the non-newborn data, the catboost classifier performs the best, followed by the random forest classifier and then logistic regression. Table 15 shows that for newborns, the random forest classifier performs the best, followed by the catboost classifier and logistic regression. The performance of the random forest classifier and catboost classifier are very similar.
From a practical perspective, it may make sense to use a catboost classifier on both newborn and non-newborn data as it simplifies the processing pipeline. The ultimate decision rests with the administrators and implementers of these decision systems in the hospital environment.
Burn et al. observe [ 21 ] that though the U.S. has reported similar declines in LoS as in the U.K, the overall costs of joint replacement have risen. The U.K. government created policies to encourage the formation of specialist centers for joint replacement, which have resulted in reduction in the LoS as well as delivering cost reductions. The results and analysis presented in our current paper can help educate patients and healthcare consumers about trends in healthcare costs and how they can be reduced. An informed and educated electorate can press their elected representatives to make changes to the healthcare system to benefit the populace.
Hachesu et al. examined the LoS for cardiac disease patients [ 22 ] where they used data from around 5000 patients and considered 35 input variables to build a predictive model. They found that the LoS was longer in patients with high blood pressure. In contrast, our method uses data from 2.5 million patients and considers multiple disease conditions simultaneously. We also do not have access to patient vitals such as blood pressure measurements, due to the limitation of the existing New York State SPARCS data.
Garcia et al. [ 23 ] conducted a study of elderly patients (age greater than 60) to understand factors governing the LoS for hip fracture treatment. They used 660 patient records and determined that the most significant variable was the American Society of Anesthesiologists (ASA) classification system. The ASA score ranges from 1–5 and captures the anesthesiologist’s impression of a patient’s health and comorbidities at the time of surgery. Garcia et al. showed a monotonically increasing relationship between the ASA score and the LoS. However, they did not build a specific predictive model. Their work shows that it is possible to find single variables with significant information content in order to estimate the LoS. The New York SPARCS dataset that we used does not contain the ASA score. Hence a policy implication of our research is to alert the healthcare authorities include such variables such as the ASA score where relevant in the datasets released in the future. The additional storage required is very small (one additional byte per patient record).
Arjannikov et al. [ 25 ] developed predictive models by binarizing the data into two categories, e.g. LoS < = 2 days or LoS > 2 days. In our work, we did not employ such a discretization. In contrast, we used continuous regression techniques as well as classification into more than two bins. It is preferable to stay as close to the actual data as possible.
Almashrafi et al. [ 27 ] and Cots et al. [ 75 ] observed that larger hospitals tended to have longer LoS for patients undergoing cardiac surgery. Though we did not specifically examine cardiac surgery outcomes, our feature analysis indicated that the hospital operating certificate number had lower relevance than other features such as DRG codes. Nevertheless, the SHAP plots in Fig. 7 and Fig. 8 show that the hospital operating certificate number occurs within the top 10 features in order of SHAP values. We will investigate this relationship in more detail in future research, as it requires determining the size of the hospital from the operating certificate number and creating an appropriate machine-learning model. The Appendix contains results that show certain operating certificate numbers that produce a good model fit to the data.
A major focus of our research is on building interpretable and explainable models. Based on the principle of parsimony, it is preferable to utilize models which involve fewer features. This will provide simpler explanations to healthcare professionals as well as patients. We have shown through Fig. 20 that a model with five features performs just as well as a model with seven features. These features also make intuitive sense and the model’s operation can be understood by both patients and healthcare providers.
Patients in the U.S. increasingly have to pay for medical procedures out-of-pocket as insurance payments do not cover all the expenses, leading to unexpectedly large bills [ 76 ]. Many patients also do not possess health insurance in the U.S., with the consequence that they get charged the highest [ 77 ]. Kullgreen et.al. observe that patients in the U.S. need to be discerning healthcare consumers [ 78 ], as they can optimize the value they receive from out-of-pocket spending. In addition to estimating the cost of medical procedures, patients will also benefit from estimating the expected duration for a procedure such as joint replacement. This will allow them to budget adequate time for their medical procedures. Patients and consumers will benefit from obtaining estimates from an unbiased open data source such as New York State SPARCS and the use of our model.
Other researchers have developed specific LoS models for particular health conditions, such as cardiac disease [ 22 ], hip replacement [ 21 ], cancer [ 26 ], or COVID-19 [ 24 ]. In addition, researchers typically assume a prior statistical distribution for the outcomes, such a Weibull distribution [ 24 ]. However, we have not made any assumptions of specific prior statistical distributions, nor have we restricted our analysis to specific diseases. Consequently, our model and techniques should be more widely applicable, especially in the face of rapidly changing disease trajectories worldwide.
Our study is based exclusively on freely available open health data. Consequently, we cannot control the granularity of the data and must use the data as-is. We are unable to obtain more detailed patient information such as their physiological variables such as blood pressure, heartrate variability etc. at the time of admittance and during their stay. Hospitals, healthcare providers, and insurers have access to this data. However, there is no mandate for them to make this available to researchers outside their own organizations. Sometimes they sell de-identified data to interested parties such as pharmaceutical companies [ 79 ]. Due to the high costs involved in purchasing this data, researchers worldwide, especially in developing countries are at a disadvantage in developing AI algorithms for healthcare.
There is growing recognition that medical researchers need to standardize data formats and tools used for their analysis, and share them openly. One such effort is the organization for Observational Health Data Sciences and Informatics (OHDSI) as described in [ 80 ].
Twitter has demonstrated an interesting path forward, where a small percentage of its data was made available freely to all users for non-commercial purposes through an API [ 81 ]. Recently, Twitter has made a larger proportion of its data available to qualified academic researchers [ 82 ]. In the future, the profit motives of companies need to be balanced with considerations for the greater public good. An advantage of using the Twitter model is that it spurs more academic research and allows universities to train students and the workforce of the future on real-world and relevant datasets.
In the U.S., a new law went into effect in January 2021 requiring hospitals to make pricing data available publicly. The premise is that having this data would provide better transparency into the working of the healthcare system in the U.S. and lead to cost efficiencies. However, most hospitals are not in compliance with this law [ 83 ]. Concerted efforts by government officials as well as pressure by the public will be necessary to achieve compliance. If the eventual release of such data is not accompanied by a corresponding interest shown by academicians, healthcare researchers, policymakers, and the public it is likely that the very premise of the utility of this data will be called into question. Furthermore, merely dumping large quantities of data into the public domain is unlikely to benefit anyone. Hence research efforts such as the one presented in this paper will be valuable in demonstrating the utility of this data to all stakeholders.
Our machine-learning pipeline can easily be applied to new data that will be released periodically by New York SPARCS, and also to hospital pricing data [ 83 ]. Due to our open-source methodology, other researchers can easily extend our work and apply it to extract meaning from open health data. This improves reproducibility, which is an essential aspect of science. We will make our code available on Github to interested researchers for non-commercial purposes.
Our models are restricted to the data available through New York State SPARCS, which does not provide detailed information about patient vitals. More detailed physiological data is available through the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) framework [ 84 ], though for a smaller number of patients. We plan to extend our methodology to handle such data in the future. Another limitation of our study is that it does not account for patient co-morbidities. This arises from the de-identification process used to release the SPARCS data, where patient information is removed. Hence we are unable to analyze multiple hospital admissions for a given patient, possibly for different conditions. The main advantage of our approach is that it uses large-scale population data (2.3 million patients) but at a coarse level of granularity, where physiological data is not available. Nevertheless, our approach provides a high-level view of the operation of the healthcare system, which provides valuable insights.
There is growing interest in using data analytics to increase government transparency and inform policymaking. It is expected that the meaning and insights gained from such evidence-based analysis will translate to better policies and optimal usage of the available infrastructure. This requires cooperation between computer scientists, domain experts, and policy makers. Open healthcare data is especially valuable in this context due to its economic significance. This paper presents an open-source analytics system to conduct evidence-based analysis on openly available healthcare data.
The goal is to develop interpretable machine learning models that identify key drivers and make accurate predictions related to healthcare costs and utilization. Such models can provide actionable insights to guide healthcare administrators and policy makers. A specific illustration is provided via a robust machine learning pipeline that predicts hospital length of stay across 285 disease categories based on 2.3 million de-identified patient records. The length of stay is directly related to costs.
We focused on the interpretability and explainability of input features and the resulting models. Hence, we developed separate models for newborns and non-newborns, given differences in input features. The best performing model for non-newborn data was catboost regression, which used linear regression and achieved an R 2 score of 0.43. The best performing model for newborns and non-newborns respectively was linear regression, which achieved an R 2 score of 0.82. Key newborn predictors included birth weight, while non-newborn models relied heavily on the diagnostic related group classification. This demonstrates model interpretability, which is important for adoption. There is an opportunity to further improve performance for specific diseases. If we restrict our analysis to cardiovascular disease, we obtain an improved R 2 score of 0.62.
The presented approach has several desirable qualities. Firstly, transparency and reproducibility are enabled through the open-source methodology. Secondly, the model generalizability facilitates insights across numerous disease states. Thirdly, the technical framework can easily integrate new data while allowing modular extensions by the research community. Lastly, the evidence generated can readily inform multiple key stakeholders including healthcare administrators planning capacity, policy makers optimizing delivery, and patients making medical decisions.
Data is publicly available at the website mentioned in the paper, https://www.health.ny.gov/statistics/sparcs/
There is an “About Us” tab in the website which contains all the contact details. The authors have nothing to do with this website as it is maintained by New York State.
Gurría A. Openness and Transparency - Pillars for Democracy, Trust and Progress. OECD.org. Available: https://www.oecd.org/unitedstates/opennessandtransparency-pillarsfordemocracytrustandprogress.htm . Accessed 28 June 2024.
Jetzek T. The Sustainable Value of Open Government Data: Uncovering the Generative Mechanisms of Open Data through a Mixed Methods Approach. lCopenhagen Business School, Institut for IT-Ledelse Department of IT Management. 2015.
Move fast and heal things: How health care is turning into a consumer product. The Economist. 2022. https://www.economist.com/business/how-health-care-is-turning-into-a-consumer-product/21807114 . Accessed 28 June 2024.
New York State Department Of Health, Statewide Planning and Research Cooperative System (SPARCS). https://www.health.ny.gov/statistics/sparcs/ . Accessed 5 Oct 2022.
Rao AR, Chhabra A, Das R, Ruhil V. A framework for analyzing publicly available healthcare data. In 2015 17th International Conference on E-health Networking, Application & Services (IEEE HealthCom). 2015: IEEE, pp. 653–656.
Rao AR, Clarke D. A fully integrated open-source toolkit for mining healthcare big-data: architecture and applications. In IEEE International Conference on Healthcare Informatics ICHI, Chicago. 2016: IEEE, pp. 255–261.
Rao AR, Garai S, Dey S, Peng H. PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data. SN Computer Science. 2021;2(6):1–22.
Article Google Scholar
Rao AR, Rao S, Chhabra R. Rising mental health incidence among adolescents in Westchester, NY. Community Ment Health J. 2021:1–1.
Boylan J F. My $145,000 Surprise Medical Bill. New York Times. 2020. https://www.nytimes.com/2020/02/19/opinion/surprise-medical-bill.html . Accessed 28 June 2024.
Peterson K, Bykowicz J. Congress Debates Push to End Surprise Medical Billing. Wall Street J. 2020. https://www.wsj.com/articles/congress-debates-push-to-end-surprise-medical-billing-11589448603 . Accessed 28 June 2024.
Wang S, Zhang J, Fu Y, Li Y. ACM TIST Special Issue on Deep Learning for Spatio-Temporal Data: Part 1. 12th ed. NY: ACM New York; 2021. p. 1–3.
Google Scholar
Jones R. lining length of stay and future bed numbers. BJHCM. 2015;21(9):440–1.
Daghistani TA, Elshawi R, Sakr S, Ahmed AM, Al-Thwayee A, Al-Mallah MH. Predictors of in-hospital length of stay among cardiac patients: a machine learning approach. Int J Cardiol. 2019;288:140–7.
Article PubMed Google Scholar
Sen-Crowe B, Sutherland M, McKenney M, Elkbuli A. A closer look into global hospital beds capacity and resource shortages during the COVID-19 pandemic. J Surg Res. 2021;260:56–63.
Article CAS PubMed Google Scholar
Stone K, Zwiggelaar R, Jones P, Mac Parthaláin N. A systematic review of the prediction of hospital length of stay: Towards a unified framework. PLOS Digital Health. 2022;1(4):e0000017.
Article PubMed PubMed Central Google Scholar
Lequertier V, Wang T, Fondrevelle J, Augusto V, Duclos A. Hospital length of stay prediction methods: a systematic review. Med Care. 2021;59(10):929–38.
Sridhar S, Whitaker B, Mouat-Hunter A, McCrory B. Predicting Length of Stay using machine learning for total joint replacements performed at a rural community hospital. PLoS ONE. 2022;17(11);e0277479.
Article CAS PubMed PubMed Central Google Scholar
CCS (Clinical Classifications Software) - Synopsis. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CCS/index.html . Accessed 13 Jan 2022.
Sotoodeh M, Ho JC. Improving length of stay prediction using a hidden Markov model. AMIA Summits on Translational Science Proceedings. 2019;2019:425.
PubMed Central Google Scholar
Ma F, Yu L, Ye L, Yao DD, Zhuang W. Length-of-stay prediction for pediatric patients with respiratory diseases using decision tree methods. IEEE J Biomed Health Inform. 2020;24(9):2651–62.
Burn E, et al. Trends and determinants of length of stay and hospital reimbursement following knee and hip replacement: evidence from linked primary care and NHS hospital records from 1997 to 2014. BMJ Open. 2018;8(1);e019146.
Hachesu PR, Ahmadi M, Alizadeh S, Sadoughi F. Use of data mining techniques to determine and predict length of stay of cardiac patients. Healthcare informatics research. 2013;19(2):121–9.
Garcia AE, et al. Patient variables which may predict length of stay and hospital costs in elderly patients with hip fracture. J Orthop Trauma. 2012;26(11):620–3.
Vekaria B, et al. Hospital length of stay for COVID-19 patients: Data-driven methods for forward planning. BMC Infect Dis. 2021;21(1):1–15.
Arjannikov T, Tzanetakis G. An empirical investigation of PU learning for predicting length of stay. In 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). 2021: IEEE, pp. 41–47.
Gupta D, Vashi PG, Lammersfeld CA, Braun DP. Role of nutritional status in predicting the length of stay in cancer: a systematic review of the epidemiological literature. Ann Nutr Metab. 2011;59(2–4):96–106.
Almashrafi A, Elmontsri M, Aylin P. Systematic review of factors influencing length of stay in ICU after adult cardiac surgery. BMC Health Serv Res. 2016;16(1):318.
Kalgotra P, Sharda R. When will I get out of the hospital? Modeling Length of Stay using Comorbidity Networks. J Manag Inf Syst. 2021;38(4):1150–84.
Awad A, Bader-El-Den M, McNicholas J. Patient length of stay and mortality prediction: a survey. Health Serv Manage Res. 2017;30(2):105–20.
Editorial-Board. The Lancet, HCL and Trump. Wall Street J. 2020. https://www.wsj.com/articles/the-lancet-hcl-and-trump-11591226880 . Accessed 28 June 2024.
Servick K, Enserink M. A mysterious company’s coronavirus papers in top medical journals may be unraveling. Science. 2020. https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-top-medical-journals-may-be-unraveling . Accessed 28 June 2024.
Gabler E, Rabin RC. The Doctor Behind the Disputed Covid Data. New York Times. 2020. https://www.nytimes.com/2020/07/27/science/coronavirus-retracted-studies-data.html . Accessed 28 June 2024.
Lancet-Editors. Expression of concern: Hydroxychloroquine or chloroquine with or without a macrolide for treatment of COVID-19: a multinational registry analysis. 2020;395:10240. https://www.science.org/content/article/mysterious-company-s-coronavirus-papers-topmedical-journals-may-be-unraveling . Accessed 28 June 2024.
Editorial-Board. Expression of Concern: Mehra MR et al. Cardiovascular Disease, Drug Therapy, and Mortality in Covid-19. N Engl J Med. 2020. https://www.nejm.org/doi/full/10.1056/NEJMoa2007621 . Accessed 28 June 2024.
Hopkins JS, Gold R. Authors Retract Studies That Found Risks of Using Antimalaria Drugs Against Covid-19. Wall Street J. 2020. https://www.wsj.com/articles/authors-retract-study-that-found-risks-of-using-antimalaria-drug-against-covid-19-11591299329 . Accessed 28 June 2024.
https://www.thelancet.com/pdfs/journals/lancet/PIIS0140-6736(20)31180-6.pdf . Accessed 9 Jan 2022.
Wolfensberger M, Wrigley A. Trust in Medicine. Cambridge University Press. 2019. ISBN-13: 978-1108487191.
Bhattacharya J, Nicholson T. A Deceptive Covid Study, Unmasked. Wall Street J. 2022. https://www.wsj.com/articles/deceptive-covid-study-unmasked-abc-misleading-omicron-north-carolina-students-duke-mask-test-to-stay-11641933613 . Accessed 28 June 2024.
Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4.
Begley CG, Ioannidis JP. Reproducibility in science: improving the standard for basic and preclinical research. Circ Res. 2015;116(1):116–26.
Eisner D. Reproducibility of science: Fraud, impact factors and carelessness. J Mol Cell Cardiol. 2018;114:364–8.
Wang F, Kaushal R, Khullar D. Should health care demand interpretable artificial intelligence or accept “black box” medicine? Am College Phys. 2020;172:59–60.
Reyes M, et al. On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol Art Intell. 2020;2(3):e190043.
Savadjiev P, et al. Demystification of AI-driven medical image interpretation: past, present and future. Eur Radiol. 2019;29(3):1616–24.
McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc. 2012.
Pedregosa F, et al. Scikit-learn: Machine learning in Python. J Machine Learn Res. 2011;12:2825–30.
Cass S. The top programming languages: Our latest rankings put Python on top-again-[Careers]. IEEE Spectr. 2020;57(8):22–22.
Tjoa E, Guan C. A survey on explainable artificial intelligence (xai): Toward medical xai," IEEE Transactions on Neural Networks and Learning Systems. 2020.
https://www.health.ny.gov/statistics/sparcs/docs/sparcs_data_dictionary.xlsx . Accessed 28 June 2024.
Design and development of the Diagnosis Related Group (DRG). https://www.cms.gov/icd10m/version37-fullcode-cms/fullcode_cms/Design_and_development_of_the_Diagnosis_Related_Group_(DRGs).pdf . Accessed 5 Oct 2022.
ARTICLE 28, Hospitals, Public Health (PBH) CHAPTER 45. 2023. Available: https://www.nysenate.gov/legislation/laws/PBH/A28 . Accessed 28 June 2024.
Gilmore‐Bykovskyi A, et al. Disparities in 30‐day readmission rates among Medicare enrollees with dementia. J Am Geriatr Soc. 2023.
Rodríguez P, Bautista MA, Gonzalez J, Escalera S. Beyond one-hot encoding: Lower dimensional target embedding. Image Vis Comput. 2018;75:21–31.
Montgomery DC, Peck EA, Vining GG. Introduction to linear regression analysis. 6th ed. John Wiley & Sons; 2021. ISBN-13 978-1119578727.
Random forest regressor in sklearn. Available: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html . Accessed 28 June 2024.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003;43(6):1947–58.
Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22.
Böhning D. Multinomial logistic regression algorithm. Ann Inst Stat Math. 1992;44(1):197–200.
Vaid A, et al. Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation. J Med Internet Res. 2020;22(11);e24018.
Density Estimation. https://scikit-learn.org/stable/modules/density.html . Accessed 5 Oct 2022.
CatBoost, a high-performance open source library for gradient boosting on decision trees. Available: https://catboost.ai/ and https://catboost.ai/en/docs/concepts/python-usages-examples . Accessed 28 June 2024.
PyTorch documentation for torch.nn, the basic building blocks for graphs. Available: https://pytorch.org/docs/stable/nn.html . Accessed 28 June 2024.
Kingma DP, Ba J. Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980. 2014.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features," arXiv preprint arXiv:1706.09516. 2017.
Tharwat A. Classification assessment methods. Applied computing and informatics. 2020;17(1):168–92.
Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3.
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988:837–45.
Baeza FL, da Rocha NS, Fleck MP. Predictors of length of stay in an acute psychiatric inpatient facility in a general hospital: a prospective study. Brazilian Journal of Psychiatry. 2017;40:89–96.
Bertsimas D, et al. Algorithmic prediction of health-care costs. Oper Res. 2008;56(6):1382–92.
Kshirsagar R. Accurate and Interpretable Machine Learning for Transparent Pricing of Health Insurance Plans," presented at the AAAI 2021 Conference. 2021.
Ulmer J, Painter-Davis N, Tinik L. Disproportional imprisonment of Black and Hispanic males: Sentencing discretion, processing outcomes, and policy structures. Justice Q. 2016;33(4):642–81.
Angwin J, J. Larso J, Mattu S, Kirchner L. Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica (2016). Google Scholar. 2016;23.
Steil JP, Albright L, Rugh JS, Massey DS. The social structure of mortgage discrimination. Hous Stud. 2018;33(5):759–76.
Cots F, Mercadé L, Castells X, Salvador X. Relationship between hospital structural level and length of stay outliers: Implications for hospital payment systems. Health Policy. 2004;68(2):159–68.
Evans M, McGinty T. Hospital Prices Are Arbitrary. Just Look at the Kingsburys’ $100,000 Bill. Wall Street J. 2021. https://www.wsj.com/articles/hospital-prices-arbitrary-healthcare-medical-bills-insurance-11635428943 . Accessed 28 June 2024.
Evans M. Hospitals Often Charge Uninsured People the Highest Prices, New Data Show. Wall Street J. 2021. https://www.wsj.com/articles/hospitals-often-charge-uninsured-people-the-highest-prices-new-data-show-11625584448 . Accessed 28 June 2024.
Kullgren JT, et al. A survey of Americans with high-deductible health plans identifies opportunities to enhance consumer behaviors. Health Aff. 2019;38(3):416–24.
Wetsman N. Hospitals are selling treasure troves of medical data — what could go wrong? The Verge. 2021. Available: https://www.theverge.com/2021/6/23/22547397/medical-records-health-data-hospitals-research . Accessed 28 June 2024.
Hripcsak G, et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 2015;216:574–8.
PubMed PubMed Central Google Scholar
Gabarron E, Dorronzoro E, Rivera-Romero O, Wynn R. Diabetes on Twitter: a sentiment analysis. J Diabetes Sci Technol. 2019;13(3):439–44.
Statt N. Twitter is opening up its full tweet archive to academic researchers for free. The Verge. 2021. Available: https://www.theverge.com/2021/1/26/22250203/twitter-academic-research-public-tweet-archive-free-access . Accessed 28 June 2024.
Evans M, Mathews AW, McGinty T. Hospitals Still Not Fully Complying With Federal Price-Disclosure Rules. Wall Street J. 2021. https://www.wsj.com/articles/hospital-price-public-biden-11640882507 .
Johnson AE, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3(1):1–9.
Download references
We are grateful to the New York State SPARCS program for making the data available freely to the public. We greatly appreciate the feedback provided by the anonymous reviewers which helped in improving the quality of this manuscript.
No external funding was available for this research.
Authors and affiliations.
Indian Institute of Technology, Delhi, India
Raunak Jain, Mrityunjai Singh & Rahul Garg
Fairleigh Dickinson University, Teaneck, NJ, USA
A. Ravishankar Rao
You can also search for this author in PubMed Google Scholar
Raunak Jain, Mrityunjai Singh, A. Ravishankar Rao, and Rahul Garg contributed equally to all stages of preparation of the manuscript.
Correspondence to A. Ravishankar Rao .
Ethics approval and consent to participate.
Not applicable as no human subjects were used in our study.
Not applicable.
The authors declare no competing interests.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material 1., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Reprints and permissions
Cite this article.
Jain, R., Singh, M., Rao, A.R. et al. Predicting hospital length of stay using machine learning on a large open health dataset. BMC Health Serv Res 24 , 860 (2024). https://doi.org/10.1186/s12913-024-11238-y
Download citation
Received : 19 June 2023
Accepted : 24 June 2024
Published : 29 July 2024
DOI : https://doi.org/10.1186/s12913-024-11238-y
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1472-6963
IMAGES
VIDEO
COMMENTS
A hospital is a place that needs more effective and efficient management of information, people, and assets. This paper demonstrates the design and implementation of an autonomous system with mern ...
International Journal of Engineering and Management Research 12(5):135-149; October 2022; 12(5):135-149 ... The paper presents [5], HMS website development offers a timely solution to mitigate the ...
1.2 Abstract. Hospital Management System provides the benefits of stream lined operations, enhanced. administration, control, superior patient care, strict cost control and improved ...
This paper presents a web-based hospital management system that allows patients, doctors, and administrators to interact with the hospital's information system through a web interface. The system is built using HTML5/CSS3, JavaScript, Bootstrap, XAMPP, PHP, MySQL, and TCPDF technologies. Web-Based Hospital Management System (HMS) enables various hospital and medical processes to be performed ...
Furthermore, during emergencies, a reliable exchange information system is a crucial factor in providing a timely response. This paper describes the Hospital Availability Management System (HAMS), a software developed in the framework of the EU-funded SAFECARE project. The main goal of HAMS is to provide the current status of a hospital (or ...
This paper introduces a cutting-edge Hospital Management System (HMS) that combines blockchain technology and NFC hardware to enhance data management, streamline operations, and improve patient outcomes. The system utilizes blockchain as a secure ledger for patient records, ensuring data integrity and privacy. NFC-enabled wristbands/cards grant easy access to patient data for healthcare ...
Abstract. This research delves into the development and deployment of a Smart Hospital Management System (HMS), a pivotal solution for enhancing administrative efficiency and elevating patient care standards within healthcare institutions.
This research work is on design and construction of Hospital Management System (HMS). The system provides the benefits of streamlined operations, enhanced administration & control, superior patient care, strict cost control and improved profitability. The system uses JAVA as the front-end software which is an Object Oriented Programming ...
In particular, by introducing a tracking system for real-time assets (e.g., medical devices, medical supplies, and pharmaceutical products) based on beacon sensors and tags, medical institutions can improve the efficiency of logistics management related to hospital work and the workflow of medical staff [15,16] (Figure 1).
For this initiative, hospital management information system (HMIS) has to be implemented across 400+ health facilities in the city.,A case study methodology was adopted to study HMIS implementation. ... performance management, outcome analyses and research studies (Cowan, 2005; ... (2001), " White paper - reducing the frequency of errors in ...
Publication: 12 May 2020. Abstract. Today's web -based technology offers many online services in almost. every field. Every major industry is converting and establishing a. digital front for all ...
The paper looks at assessing ... "Advanced Hospital Management System" includes Registration of patients, storing their details into the system and also computerized billing in the pharmacy, and labs. Our software has the facility to give a ... analysis made in this research, it was recommend that General hospital and other medical centre ...
Health Services Management Research is an authoritative research based journal providing expert information on all aspects of healthcare management. Examining the real issues confronting health services management, it analyses policy initiatives and healthcare systems worldwide and provides evidence-based research to guide management decision-making.
Hospital Management System is an information management system designed to help manage the various aspects of a hospital (administrative, clinical and financial). It helps in monitoring and controlling the hospital's daily transactions, as well as the hospital's performance. It also helps to address the critical requirements of the
An intelligent hospital information management system was developed to assist the patient at the front desk of a hospital. The patient will be able to learn about the doctors, appointment times, relevant departments, laboratory tests and the specific medicine about his/her medical situation. System will provide an intelligent front desk information service for the patients at the hospital ...
Their description is as follows: [1] Intelligent Hospital Management System by B.Koyuncu and H.Koyuncu : Helped to set the kind of tasks to be done and handled without increasing complexity of the particular task. [2] Integrated EIP for HealthCare by S.H.Hseih and J.L. Chen: The paper helped us to implement MySQL Databases efficiently and ...
This paper describes the Hospital Availability Management System (HAMS), a software developed in the framework of the EU-funded SAFECARE project. The main goal of HAMS is to provide the current status of a hospital (or health-care facility) to the internal staff, but also to first responders (paramedics, firefighters, civil protection, etc.) in ...
Hospital Management System is an organized computerized system designed and programmed to deal with day to day operations and management of the hospital activities. The program can look after inpatients, outpatients, records, database treatments, status illness, billings in the pharmacy and labs.
The purpose of this paper. is to design and implement a hospital management system. that stores d ata in a database inste ad of pape r, eliminate. redundancy of data, fasten the process of ...
International Journal for Research in Engineering Application & Management (IJREAM) 1 ISSN : 2494-9150 Vol-01, Issue 11, FEB 2016. ... diagnosis details, while system output is to get these details on to the screen. The Hospital Management System can be entered using a username and password. It is accessible either by an administrator or ...
A hospital is a place that needs more effective and efficient management of information, people, and assets. This paper demonstrates the design and implementation of an autonomous system with mern ...
Stone et al. [] present a survey of techniques used to predict the LoS, which include statistical and arithmetic methods, intelligent data mining approaches and operations-research based methods.Lequertier et al. [] surveyed methods for LoS prediction.The main gap in the literature is that most methods focus on analyzing trends in the LoS or predicting the LoS only for specific conditions or ...
The purpose of this study is to develop a computerized hospital management system that will upgrade the quality of information management and efficiency of the hospital employees. Waterfall Model ...